arXivDaily arXiv每日学术速递 周一至周五更新
重置
2606.02563 2026-06-02 cs.LG cs.CR cs.DC 版本更新

IntraShuffler: A Privacy Preserving Framework for Heterogeneous DP Federated Learning

IntraShuffler:一种用于异构差分隐私联邦学习的隐私保护框架

Farhin Farhad Riya, Olivera Kotevska, Jinyuan Stella Sun

发表机构 * University of Tennessee, Knoxville, USA(田纳西大学,科文特分校) Oak Ridge National Laboratory, USA(橡树岭国家实验室)

AI总结 针对异构差分隐私联邦学习中诚实但好奇的服务器通过梯度结构推断客户端属性的隐私推理攻击,提出IntraShuffler中间件框架,通过隐私感知混洗机制破坏梯度持久结构,同时保持ε感知聚合,将梯度可恢复性降低60%以上,代理推理准确率从0.78降至0.33。

详情
AI中文摘要

联邦学习中的异构差分隐私允许客户端根据机构策略和数据敏感性选择个体隐私预算($\varepsilon_i$)。实践中,许多HDP-FL系统采用$\varepsilon$感知的服务器聚合,通过根据声明的隐私预算重新加权客户端更新来提高模型效用。然而,联邦学习中的梯度更新保留了由非独立同分布数据引起的结构模式,而$\varepsilon$感知聚合暴露的这些额外信号为诚实但好奇的服务器提供了新的推理机会。在这项工作中,我们首先展示,配备梯度去噪和代理建模的服务器可以在现实知识约束下发起隐私推理攻击,该攻击推断客户端的分布属性并链接同一客户端在不同训练轮次中的更新,通过代理推理准确率和链接成功率衡量。混洗模型通过匿名化更新来源被广泛研究作为针对此类推理风险的防御,但它与HDP-FL的$\varepsilon$感知聚合根本不相容。为了解决这一挑战,我们提出了IntraShuffler,一个专为HDP-FL系统设计的中间件防御框架。IntraShuffler引入了一种隐私感知的混洗机制,将客户端分组到隐私兼容的桶中,并在每个桶内执行参数级混洗,以破坏持久的梯度结构,同时保持$\varepsilon$感知聚合。在四个不同数据集上的实验表明,IntraShuffler将梯度可恢复性降低了60%以上,并将代理推理准确率从0.78降至0.33,同时在多种联邦学习聚合规则下保持可比的模型效用。

英文摘要

Heterogeneous Differential Privacy (HDP) in Federated Learning (FL) allows clients to select individual privacy budgets ($\varepsilon_i$) according to institutional policies and data sensitivity. In practice, many HDP-FL systems employ $\varepsilon$-aware server aggregation to improve model utility by re-weighting client updates according to their declared privacy budgets. However, gradient updates in FL retain structural patterns induced by non-independent and identically-distributed (non-IID) data, and these additional signals exposed by $\varepsilon$-aware aggregation create new opportunities for inference by an honest-but-curious server. In this work, we first show that a server equipped with gradient denoising and surrogate modeling can mount a \emph{Privacy Inference Attack} that infers distributional attributes of clients and links updates from the same client across training rounds, measured via surrogate inference accuracy and linkage success, under realistic knowledge constraints. The Shuffle-Model has been widely studied as a defense against such inference risks by anonymizing update sources, but it is fundamentally incompatible with HDP-FL $\varepsilon$-aware aggregation. To address this challenge, we propose \textbf{IntraShuffler}, a middleware defense framework designed for HDP-FL systems. IntraShuffler introduces a privacy-aware shuffling mechanism that groups clients into privacy-compatible buckets and performs parameter-level shuffling within each bucket to disrupt persistent gradient structure while preserving $\varepsilon$-aware aggregation. Experiments across four different datasets show that IntraShuffler reduces gradient recoverability by over 60% and decreases surrogate inference accuracy from 0.78 to 0.33 while maintaining comparable model utility across multiple FL aggregation rules.

2606.02562 2026-06-02 cs.RO cs.AI cs.LG cs.SY eess.SY 版本更新

Permissive Safety Through Trusted Inference: Verifiable Belief-Space Neural Safety Filters for Assured Interactive Robotics

通过可信推理实现许可安全:可验证的信念空间神经安全滤波器用于保证交互式机器人

Haimin Hu

发表机构 * Department of Computer Science, Johns Hopkins University, USA(约翰霍普金斯大学计算机科学系)

AI总结 针对交互式机器人中人类不确定性带来的安全问题,提出一种基于共形预测的信念空间安全滤波器验证方法,在考虑推理可靠性的前提下保证高概率安全,并减少保守性。

Comments Accepted to the 17th World Symposium on the Algorithmic Foundations of Robotics (WAFR 2026)

详情
AI中文摘要

与人类交互的自主机器人必须在人类引起的不确定性(如偏好、目标、能力和合作意愿)下做出安全高效的决策。安全滤波器是确保交互式机器人安全性的流行方法,其模块化设计将安全性与性能分离,使机器人能够在最小影响任务效率的情况下安全地与人交互。传统安全滤波器通常仅在物理空间中运行,忽略了机器人在线学习和适应的能力,而最近提出的信念空间安全滤波器(BeliefSF)在闭环中考虑机器人安全性,并通过运行时推理主动减少机器人的不确定性,从而降低滤波的保守性。然而,由于运行时推理的误差以及处理信念空间高维性所需的安全滤波器神经近似,为部署BeliefSF的机器人提供形式化安全保证仍然是一个重大挑战。本文提出一种算法方法,使用共形预测来认证BeliefSF的高概率安全性,同时明确考虑机器人运行时推理模块的可靠性。我们的方法利用信念空间安全滤波的结构,将验证集中在预期推理可靠的区域。它保留了标准共形预测的简单性和样本复杂度,但能够认证一个显著更不保守的安全滤波器。通过一个模拟的人-车交互基准测试,我们展示了我们的方法验证了一个比标准共形预测基线更许可的信念空间安全滤波器。

英文摘要

Autonomous robots that interact with people must make safe and efficient decisions under human-induced uncertainty, such as their preferences, goals, competency, and willingness to cooperate. Safety filters are a popular approach for ensuring safety in interactive robotics, since their modular design separates safety from performance, allowing robots to operate safely around people with minimal impact on task efficiency. While traditional safety filters typically operate only in the physical space, neglecting the robot's ability to learn and adapt online, the recently proposed belief-space safety filter (BeliefSF) reasons about robot safety in closed-loop with runtime inference that actively reduces the robot's uncertainty online, thereby reducing conservativeness in filtering. However, providing formal safety guarantees for robots deploying BeliefSF remains a significant challenge due to errors in runtime inference and neural approximation of safety filters required to handle the high dimensionality of belief spaces. In this paper, we propose an algorithmic approach to certify high-probability safety of BeliefSF using conformal prediction, while explicitly accounting for the reliability of the robot's runtime inference module. Our method leverages the structure of belief-space safety filtering by focusing verification on a region where inference is expected to be reliable. It preserves the simplicity and sample complexity of standard conformal prediction, yet can certify a substantially less conservative safety filter. Through a simulated human-vehicle interaction benchmark, we show that our approach verifies a significantly more permissive belief-space safety filter than a standard conformal prediction baseline.

2606.02528 2026-06-02 q-fin.GN cs.CY cs.LG 版本更新

Auditing Asset-Specific Preferences in Financial Large Language Models: Evidence from Bitcoin Representations and Portfolio Allocation

审计金融大语言模型中的资产特定偏好:来自比特币表征与投资组合配置的证据

Wenbin Wu

AI总结 本研究通过三级审计协议,发现大型语言模型对比特币存在框架依赖的偏好,并识别出模型内部一个可因果干预的比特币选择性特征,该特征能显著影响下游投资组合配置。

Comments 28 pages, 5 figures, 18 tables

详情
AI中文摘要

大型语言模型现已驱动机器人顾问和交易代理,但它们是否对特定资产存在固有偏见尚未得到充分检验。我们提出三个问题:LLMs是否系统性地偏好某些金融工具;能否识别出对这些偏好具有因果杠杆作用的内部表征;以及该表征是否影响下游金融决策。我们开发了一个三级审计协议并将其应用于比特币。首先,对八个前沿LLMs的行为审计显示,比特币在货币类工具中的排名具有框架依赖性:模型将其置于“可靠货币”的第5位(共8位),但在危机和自主代理框架下接近榜首,且属性交换实验确认排名追踪功能属性而非名称。其次,我们打开模型内部:在Gemma 3中搜索数千个稀疏自编码器特征,识别出一个主导的比特币选择性特征。放大该特征会使模型偏向该资产,抑制则使其远离,即使提示中从未出现“比特币”。第三,我们测试金融后果:放大使比特币在投资组合中的份额提高5.2个百分点,而抑制降低4.6个百分点,放大在加密资产内重新分配,抑制则削减总加密敞口。我们将此描述为有界行为杠杆(杠杆指对输出的因果影响,而非金融杠杆):一个可识别的内部特征可被扰动以改变金融选择,但仅在可测量的限度内。该框架将内部表征与外部建议联系起来,并通过随机对照和机制边界进行验证。随着LLMs成为自主金融代理,这是迈向新兴“了解你的代理”(KYA)标准的行为层的第一步:了解代理偏好什么,以及该偏好可被移动多远。

英文摘要

Large language models now power robo-advisors and trading agents, yet whether they carry built-in biases toward specific assets is largely untested. We ask three questions: do LLMs systematically prefer certain financial instruments; can an internal representation with causal leverage over those preferences be identified; and does that representation affect downstream financial decisions? We develop a three-level audit protocol and apply it to Bitcoin. First, a behavioral audit of eight frontier LLMs shows that Bitcoin's ranking among money-like instruments is frame-dependent: models place it around rank 5 of 8 as "reliable money" but near the top under crisis and autonomous-agent frames, and an attribute-swap experiment confirms rankings track functional properties, not names. Second, we open a model's internals: a search across thousands of sparse-autoencoder features in Gemma 3 identifies a dominant Bitcoin-selective feature. Amplifying it shifts the model toward the asset and suppressing it shifts the model away, even when "Bitcoin" never appears in the prompt. Third, we test financial consequences: amplification raises Bitcoin's portfolio share by 5.2 percentage points while suppression lowers it by 4.6 pp, with amplification reallocating within crypto and suppression cutting total crypto exposure. We characterize this as bounded behavioral leverage (leverage meaning causal influence over outputs, not financial leverage): an identifiable internal feature can be perturbed to move financial choices, but only within measurable limits. The framework links internal representations to external recommendations, validated with random controls and mechanism boundaries. As LLMs become autonomous financial agents, this is a first step toward a behavioral layer for emerging know-your-agent (KYA) standards: knowing what an agent prefers, and how far that preference can be moved.

2606.02515 2026-06-02 cs.LG 版本更新

A Biconvex Formulation for Stable Transport of Mixture Models with a Unique Solution

混合模型稳定传输的双凸形式与唯一解

Yeganeh Marghi, Kelly Jin, Uygar Sümbül

AI总结 提出最优混合传输(OMT)框架,通过严格双凸优化实现子群体混合的稳定传输,理论保证稳定性,计算复杂度仅与混合成分数相关。

详情
AI中文摘要

最优传输(OT)为概率分布之间的映射提供了原则性框架。尽管取得了广泛进展,将OT应用于大规模数据仍然计算密集,且得到的逐点传输计划往往难以解释。我们引入了最优混合传输(OMT),这是一个可扩展的框架,将传输范式从单个样本转移到子群体的混合,将传输问题重新表述为具有唯一全局最小值的严格双凸优化。我们进一步建立了OMT映射稳定性的理论保证,表明底层分布的有界扰动会导致传输计划的有界变化。通过将子群体表述为指数族分布,OMT将计算复杂度与样本量解耦,仅随混合成分数量扩展。我们在广泛的合成基准和真实世界数据集(包括图像数据和大规模单细胞RNA测序测量)上展示了OMT的有效性和实用性。

英文摘要

Optimal transport (OT) provides a principled framework for mapping between probability distributions. Despite extensive progress, applying OT to large-scale data remains computationally demanding, and the resulting pointwise transport plans are often difficult to interpret. We introduce Optimal Mixture Transport (OMT), a scalable framework that shifts the transport paradigm from individual samples to mixtures of subpopulations, reformulating the transport problem as a strictly biconvex optimization with a unique global minimizer. We further establish theoretical guarantees on the stability of the OMT map, showing that bounded perturbations of the underlying distributions lead to bounded changes in the transport plan. By formulating subpopulations as exponential-family distributions, OMT decouples computational complexity from the sample size, scaling solely with the number of mixture components. We demonstrate the effectiveness and practicality of OMT on a wide range of synthetic benchmarks and real-world datasets, including image data and large-scale single-cell RNA sequencing measurements.

2606.02507 2026-06-02 cond-mat.mtrl-sci cs.ET cs.LG physics.app-ph physics.comp-ph 版本更新

Towards Automated Discovery: A Review of Generative Models, Multimodal Learning and Closed-Loop Workflows in Inverse Materials Design

迈向自动发现:逆向材料设计中生成模型、多模态学习与闭环工作流综述

Anand Babu, Rogério Almeida Gouvêa, Gian-Marco Rignanese

发表机构 * Institute of Condensed Matter and Nanosciences, Université Catholique de Louvain(凝聚态与纳米科学研究所,比利时列日-努瓦尔桑大学) WEL Research Institute(WEL研究机构)

AI总结 本文综述了逆向材料设计中生成晶体结构建模、多模态学习和闭环设计管道的最新进展,重点讨论了可行性约束与物理先验的施加方式、多模态融合策略以及多种逆向设计策略(如条件生成与潜在优化、贝叶斯优化、强化学习和主动学习),并指出了常见失败模式及基于分阶段报告的评估实践。

详情
AI中文摘要

逆向材料设计将材料发现从正向预测转变为在物理约束下满足目标的有针对性的候选材料提出。在此,我们回顾了晶体固体中生成晶体结构建模、多模态学习和闭环设计管道的最新进展。我们调查了现代生成器如何从大型数据库中学习化学-结构先验,以实现周期性结构的可控采样,并比较了包括变分自编码器、归一化流、自回归公式和扩散模型在内的主要模型类别。特别关注如何通过表示选择、训练目标、采样时指导以及生成后筛选和弛豫,在整个工作流中施加可行性约束和物理先验。我们还讨论了多模态学习如何融合多种材料模态,包括晶体结构、热力学、电子信息、显微镜、光谱学、加工背景和科学文本,以构建更通用、可迁移的化学空间表示。此外,考察了多种逆向设计策略,特别是那些将条件生成与潜在优化、贝叶斯优化、强化学习和主动学习相结合的策略。最后,我们强调了反复出现的失败模式,如代理利用、多样性崩溃、分布偏移和稳定性-可合成性差距,并基于有效性、新颖性、独特性、稳定性和成本的分阶段报告,概述了发现级评估实践。

英文摘要

Inverse materials design is shifting materials discovery from forward prediction to targeted proposal of candidates that satisfy objectives under physical constraints. Here, we review recent advances in generative crystal structure modeling, multimodal learning, and closed-loop design pipelines for crystalline solids. We survey how modern generators learn chemical-structural priors from large databases to enable controllable sampling of periodic structures, and compare leading model classes including variational autoencoders, normalizing flows, autoregressive formulations, and diffusion models. Particular attention is given to how feasibility constraints and physical priors are enforced across the workflow, through representation choices, training objectives, sampling-time guidance, and post-generation screening and relaxation. We also discuss how multimodal learning fuses diverse materials modalities, including crystal structures, thermodynamic, electronic information, microscopy, spectroscopy, processing context, and scientific text, to construct a more universal, transferable representation of chemical space. In addition, diverse inverse-design strategies are examined, particularly those that integrate conditional generation with latent optimization, Bayesian optimization, reinforcement learning, and active learning. Finally, we highlight recurring failure modes, such as surrogate exploitation, diversity collapse, distribution shift, and the stability-synthesizability gap, and outline discovery-grade evaluation practices based on staged reporting of validity, novelty, uniqueness, stability, and cost.

2606.02490 2026-06-02 cs.LG 版本更新

Expressivity of congruence-based architectures for DNNs on positive-definite matrices

基于同余结构的深度神经网络在正定矩阵上的表达能力

Antonin Oswald, Estelle Massart

发表机构 * Antonin Oswald Estelle Massart

AI总结 研究同余层(输入矩阵左乘和右乘权重矩阵及其转置)在正定矩阵分类中的表达能力,发现半正交约束会限制网络表达能力,导致退化为单隐藏层等价结构,并分析了不同黎曼分类器与同余层特征图的兼容性。

Comments Accepted for Eusipco 2026

详情
AI中文摘要

本文研究用于分类对称正定矩阵的神经架构,重点关注同余类层,其中输入矩阵左乘和右乘一个(可能是矩形的)权重矩阵 $W$ 及其转置。这类层是著名的 SPDNet 的核心,也已被独立用于正定数据的降维。我们表明,通常对 $W$ 施加的(半)正交约束限制了这些层的表达能力:对于某些激活函数,生成的架构退化为单隐藏层等价结构。这种表达能力的缺失源于半正交 $W$ 的同余类层中谱多样性的损失,并且是庞加莱分离定理的直接结果。然后,我们考察了最终分类器的选择,比较了几种黎曼分类器,并讨论了它们与同余类层产生的特征图的兼容性。

英文摘要

This work studies neural architectures for classifying symmetric positive-definite matrices, focusing on congruence-like layers, in which the input matrix is multiplied on the left and right by a (possibly rectangular) weight matrix $W$ and its transpose. Such layers lie at the core of the celebrated SPDNet and have also been employed independently for dimensionality reduction on positive-definite data. We show that the (semi)-orthogonality constraint commonly imposed on $W$ limits the expressivity of these layers: for certain activation functions, the resulting architecture collapses to a one-hidden-layer equivalent. This lack of expressivity follows from a loss of spectral diversity in congruence-like layers for semi-orthogonal $W$ and is a direct consequence of Poincaré's separation theorem. We then examine the choice of the final classifier, comparing several Riemannian classifiers and discussing their compatibility with the feature maps produced by congruence-like layers.

2606.02484 2026-06-02 cs.AI cs.LG 版本更新

Iteris: Agentic Research Loops for Computational Mathematics

Iteris: 计算数学的智能体研究循环

Leheng Chen, Zihao Liu, Wanyi He, Bin Dong

发表机构 * School of Mathematical Sciences, Peking University(北京大学数学科学学院) Beijing International Center for Mathematical Research and the New Cornerstone Science Laboratory, Peking University(北京大学北京国际数学研究中心和新基石科学实验室) Center for Machine Learning Research, Peking University(北京大学机器学习研究中心) Center for Intelligent Computing, Great Bay Institute for Advanced Study, Great Bay University(大湾研究院先进研究所智能计算中心) Zhongguancun Academy(中关村学院)

AI总结 提出Iteris智能体研究系统,通过数值实验、构造和证明草稿解决计算数学中的两个开放问题,经专家验证后获得可验证结果。

Comments 43 pages

详情
AI中文摘要

大型语言模型和智能体AI系统的最新进展使得数学发现取得了显著进展,从解决竞赛问题到处理研究级猜想。然而,计算数学中的开放问题受到的关注相对较少:该领域的研究通常不仅需要证明,还需要数值实验、对抗性构造和算法设计。在本文中,我们介绍了一个面向计算数学开放问题的智能体研究系统Iteris。我们将Iteris应用于近期Simons Workshop论文集(arXiv:2602.05394)中的两个开放问题。在这些案例研究中,Iteris生成了数值证据、构造和证明草稿,经过专家评审和修正后,得到了可验证的结果。第一个结果是关于幂律谱上共轭梯度与随机坐标下降渐近比较的相图;第二个结果是一个反例,表明即使低相干性下,带列主元的QR分解也可能无法选择良态子矩阵。这些案例研究表明,智能体AI系统可以有意义地参与计算数学开放问题的研究工作流程,而人工验证仍然至关重要。

英文摘要

Recent advances in large language models and agentic AI systems have enabled significant progress in mathematical discovery, from solving competition problems to tackling research-level conjectures. However, open problems in computational mathematics have received comparatively less attention: research in this area often requires not only proofs but also numerical experimentation, adversarial constructions, and algorithm design. In this paper, we introduce an agentic research system, Iteris, designed for open problems in computational mathematics. We apply Iteris to two open problems from a recent Simons Workshop collection (arXiv:2602.05394). In these case studies, Iteris generated numerical evidence, constructions, and proof drafts that led, after expert review and correction, to verified results. The first result is a phase diagram for the asymptotic comparison between conjugate gradient and randomized coordinate descent on power-law spectra; the second is a counterexample showing that QR factorization with column pivoting can fail to select well-conditioned submatrices even under low coherence. These case studies suggest that agentic AI systems can participate meaningfully in research workflows for open problems in computational mathematics, while human validation remains essential.

2606.02455 2026-06-02 cs.LG cond-mat.mtrl-sci physics.chem-ph physics.comp-ph stat.CO 版本更新

Speculative Sampling For Faster Molecular Dynamics

用于加速分子动力学的推测采样

Arthur Kosmala, Stephan Günnemann, Meng Gao, Brandon Wood

发表机构 * FAIR at Meta(Meta FAIR) School of Computation, Information & Technology, Technical University of Munich(技术大学慕尼黑计算、信息与技术学院) Munich Data Science Institute(慕尼黑数据科学研究所) Munich Center For Machine Learning(慕尼黑机器学习中心)

AI总结 提出Langevin推测动力学(LSD),一种分布式且模型无关的推测采样方法,通过草稿模型快速提议步长并用目标模型并行验证,实现分子动力学模拟的3-9倍加速而不增加相对误差。

Comments Forty-Third International Conference on Machine Learning (ICML 2026). 32 pages, 14 figures, 8 tables

详情
AI中文摘要

分子动力学(MD)是模拟原子系统动力学行为的关键工具。然而,MD本质上是串行的,这使得通过并发计算提高单系统吞吐量变得困难。为了解决这个问题,我们引入了Langevin推测动力学(LSD),一种分布式且模型无关的推测采样器,用于在不增加相对误差的情况下加速MD。受语言和扩散建模中推测方法的启发,LSD使用草稿模型提议快速模拟步长,并用较慢的目标模型并行验证,应用从草稿分布到目标分布的传输映射。我们将推测采样扩展到二阶Langevin动力学,推导出作为物理参数函数的可实现加速比,表明LSD在不同系统和草稿-目标组合中实现3-9倍加速,并从理论和实验上证实LSD从其目标模型分布中采样轨迹。

英文摘要

Molecular dynamics (MD) is a key tool for simulating the dynamical behavior of atomic systems. However, MD is inherently serial, which makes it difficult to increase single-system throughput with concurrent compute. To address this, we introduce Langevin Speculative Dynamics (LSD), a distributed and model-agnostic speculative sampler for accelerating MD without adding relative error. Inspired by speculative methods in language and diffusion modeling, LSD uses a draft model to propose fast simulation steps and verifies them in parallel with a slower target model, applying a transport map from the draft to the target distribution. We extend speculative sampling to second-order Langevin dynamics, derive the achievable speedup as a function of physical parameters, show that LSD generalizes across different systems and draft-target combinations with a 3-9x speedup, and confirm theoretically and empirically that LSD samples trajectories from its target model distribution.

2606.02449 2026-06-02 cs.AI cs.CL cs.CV cs.LG cs.MM 版本更新

HLL: Can Agents Cross Humanity's Last Line of Verification?

HLL:智能体能否跨越人类最后一道验证防线?

Xinhao Song, Su Su, Sirui Song, Hongliang Wu, Wen Shen, Zhihua Wei, Gongshen Liu, Linfeng Zhang, Dongrui Liu

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shandong University(山东大学) Tongji University(同济大学)

AI总结 提出HLL基准,通过交互式CAPTCHA验证评估多模态智能体在受保护工作流中替代人类的能力,发现当前智能体在定位、动作校准、状态跟踪和过程一致性方面存在脆弱性。

Comments 27 pages, 14 figures

详情
AI中文摘要

多模态智能体越来越被期望代表用户操作界面,这引发了一个核心部署问题:在服务特意防止自动化的流程中,它们能否真正替代人类?CAPTCHA验证使这个问题具体化。它不仅仅是一个视觉谜题,更是在账户创建、内容访问、表单提交和其他受保护操作之前设置的人类验证边界。我们引入了 extbf{人类最后一道验证防线(HLL)},这是一个受控基准,使用交互式CAPTCHA验证来评估智能体是否能够通过基于环境的类人交互(而非仅识别)跨越这一边界。HLL涵盖了多种CAPTCHA交互,并让智能体暴露于受控的现实压力因素下,包括杂乱的网页、更困难的任务变体以及解决过程的轨迹条件验证。我们在闭环GUI环境中评估了八个前沿多模态智能体。结果表明,当前智能体在这个人类替代边界上仍然脆弱:性能在不同验证类型间差异显著,在现实界面条件下下降,当正确答案必须由有效动作轨迹支持时进一步下降。通过揭示定位、动作校准、状态跟踪和过程一致性方面的差距,HLL为衡量多模态智能体在受保护的真实世界工作流中作为人类替代品有多接近提供了一个具体的测试平台。我们的代码可在https://github.com/XinhaoS0101/HLL获取。

英文摘要

Multimodal agents are increasingly expected to operate interfaces on behalf of users, raising a central deployment question: can they truly substitute for humans in workflows that services deliberately protect against automation? CAPTCHA verification makes this question concrete. It is not merely a visual puzzle, but a human-verification boundary placed before account creation, content access, form submission, and other protected actions. We introduce \textbf{Humanity's Last Line of Verification (HLL)}, a controlled benchmark that uses interactive CAPTCHA verification to evaluate whether agents can cross this boundary through grounded, human-like interaction rather than recognition alone. HLL covers diverse CAPTCHA interactions and exposes agents to controlled realism stressors, including cluttered webpages, harder task variants, and trace-conditioned validation of the solving process. We evaluate eight frontier multimodal agents in a closed-loop GUI environment. The results show that current agents remain brittle at this human-substitution boundary: performance varies sharply across verification types, degrades under realistic interface conditions, and drops further when correct answers must be supported by valid action traces. By exposing gaps in localization, action calibration, state tracking, and process consistency, HLL provides a concrete testbed for measuring how close multimodal agents are to acting as human substitutes in protected real-world workflows. Our code is available at https://github.com/XinhaoS0101/HLL

2606.02433 2026-06-02 cs.IR cs.AI cs.CL cs.LG cs.MA 版本更新

ODTQA-FoRe: An Open-Domain Tabular Question Answering Dataset for Future Data Forecasting and Reasoning

ODTQA-FoRe:面向未来数据预测与推理的开放域表格问答数据集

Zhensheng Wang, Xiaole Liu, Wenmian Yang, Kun Zhou, Yiquan Zhang, Weijia Jia

发表机构 * School of Artificial Intelligence, Beijing Normal University(北京师范大学人工智能学院) Institute of Artificial Intelligence and Future Networks, Beijing Normal University(北京师范大学人工智能与未来网络研究院) Faculty of Arts and Sciences, Beijing Normal University(北京师范大学文理学院) Beijing Normal-Hong Kong Baptist University(北京师范大学-香港 Baptist大学)

AI总结 提出开放域表格问答的未来预测与推理任务,并构建首个覆盖时间序列预测和基于预测推理的数据集,通过基于LLM代理的TimeFore框架(检索器、预测器、分析器)解决历史数据检索、预测限制和响应标准化挑战。

Comments This paper has been accepted by Findings of ACL 2026

详情
AI中文摘要

大语言模型的快速发展显著推进了表格问答,但大多数系统无法进行面向未来的数值预测。为弥补这一空白,我们引入了一个新任务——面向未来数据预测与推理的开放域表格问答,并提出了首个覆盖时间序列预测和基于预测推理场景的数据集,使用房地产数据。该任务在检索精确历史数据、克服LLM的预测限制以及标准化多样化查询的响应方面提出了挑战。为解决上述挑战,我们提出了TimeFore,一个基于LLM代理的框架,将问题分解为三个协作角色:检索器自主生成SQL以获取数据,预测器调用外部时间序列模型以获得更高精度,分析器综合结果以构建精确且一致的最终答案。大量实验证明了我们TimeFore的有效性。

英文摘要

The rapid development of LLMs has significantly advanced tabular question answering, but most systems cannot perform future-oriented numerical prediction. To address this gap, we introduce a novel task, Open-Domain Tabular Question Answering for Future Data Forecasting and Reasoning, and propose the first dataset to cover time-series forecasting and forecast-based reasoning scenarios using real estate data. This task poses challenges in retrieving precise historical data, overcoming the forecasting limitations of LLMs, and standardizing responses for diverse queries. To solve the above challenges, we propose TimeFore, an LLM agent-based framework that decomposes the problem into three collaborative roles: a Retriever autonomously generates SQL to fetch data, a Forecaster invokes external time-series models for higher accuracy, and an Analyzer synthesizes the results to construct a precise and consistent final answer. Extensive experiments demonstrate the effectiveness of our TimeFore.

2606.02427 2026-06-02 math.NA cs.LG cs.NA 版本更新

Spectral Audit of In-Context Operator Networks

上下文算子网络的频谱审计

Zhiwei Gao, Liu Yang, George Em Karniadakis

发表机构 * Division of Applied Mathematics, Brown University(布朗大学应用数学系) Department of Mathematics, National University of Singapore(新加坡国立大学数学系)

AI总结 提出基于雅可比矩阵的频谱审计方法,通过分析上下文算子学习中的局部频谱特性(频率增益、相位结构、交叉模式耦合)来评估模型是否真正学习了PDE算子的局部动力学机制,而不仅仅是输出预测。

详情
AI中文摘要

现有的神经算子和上下文算子学习评估主要依赖于预测误差,但准确的输出预测并不能保证正确的局部动力学结构。一个模型可能匹配解,同时表现出不正确的敏感性、失真的频率响应、虚假的模式耦合或不稳定的切向行为。我们引入了一种基于雅可比矩阵的频谱审计方法,用于上下文算子学习。对于固定的提示,我们将网络输出对查询函数求导,并将得到的雅可比矩阵视为学习的切向算子。将其投影到傅里叶模式上,我们获得了推断算子的局部频谱特征,包括频率相关的增益、相位结构和交叉模式耦合。该审计通过测试模型是否再现底层PDE算子的局部机制(而不仅仅是输出)来补充标准预测指标。在多个基准测试中,审计揭示了不同的算子级现象,包括相位传输、粘度依赖的阻尼、非线性模式耦合和反应-扩散稳定性结构。它还检测了部分被预测误差指标隐藏的失败,包括高频退化、不正确的相位恢复和提示-算子不一致。即使逐点预测部分准确,损坏或内部不一致的提示也会导致切向算子结构退化。我们的结果表明,预测精度和局部算子保真度是学习到的神经算子的不同属性。我们的框架还为稳定性、灵敏度和算子一致性提供了诊断。

英文摘要

Existing evaluations of neural operators and in-context operator learning rely primarily on prediction error, but accurate output prediction does not guarantee the correct local dynamical structure. A model may match solutions while exhibiting incorrect sensitivities, distorted frequency response, spurious mode coupling, or unstable tangent behavior. We introduce a Jacobian-based spectral audit for in-context operator learning. For a fixed prompt, we differentiate the network output with respect to the query function and view the resulting Jacobian as a learned tangent operator. Projecting it onto Fourier modes, we obtain a local spectral characterization of the inferred operator, including frequency-dependent gains, phase structure, and cross-mode coupling. The audit complements standard prediction metrics by testing whether the model reproduces local mechanisms of the underlying PDE operator rather than only outputs. Across benchmarks, the audit reveals distinct operator-level phenomena, including phase transport, viscosity-dependent damping, nonlinear mode coupling, and reaction--diffusion stability structure. It also detects failures partially hidden by prediction-error metrics, including high-frequency degradation, incorrect phase recovery, and prompt--operator inconsistencies. Corrupted or internally inconsistent prompts lead to degraded tangent-operator structure even when pointwise predictions remain partially accurate. Our results suggest that prediction accuracy and local operator fidelity are distinct properties of learned neural operators. Our framework also provides a diagnostic for stability, sensitivity, and operator consistency.

2606.02424 2026-06-02 cs.CV cs.AI cs.LG 版本更新

GC-MoE: Genomics-Guided Cell-Type-Specific Mixture of Experts for Histology-Based Single-Cell Spatial Transcriptomics

GC-MoE: 基因组引导的细胞类型特异性专家混合模型用于基于组织学的单细胞空间转录组学

Kaito Shiku, Ahtisham Fazeel Abbasi, Ryoma Bise, Yuichiro Iwashita, Kazuya Nishimura, Andreas Dengel, Muhammad Nabeel Asim

发表机构 * Kyushu University(九州大学) German Research Center for Artificial Intelligence (DFKI GmbH)(德国人工智能研究中心) RPTU University Kaiserslautern-Landau(科布伦茨-劳恩堡大学) The University of Osaka(大阪大学) IntelligentX GmbH Osaka Metropolitan University(大阪 Metropolitan 大学)

AI总结 提出GC-MoE模型,通过路由网络估计细胞类型概率并软组合细胞类型特异性专家,结合细胞类型特异性共表达感知预测器和细胞间交互注意力模块,从组织学图像和细胞位置预测单细胞基因表达,在公共数据集上优于现有方法。

详情
AI中文摘要

基于组织学的单细胞空间转录组学(ST)估计旨在从组织病理学图像和细胞位置预测单个细胞的基因表达,从而减少对昂贵的单细胞ST测量的需求。与现有的组织学到ST方法主要预测包含多个细胞的局部区域的斑点级谱不同,该任务需要对细胞间的表达变异性进行建模,而这种变异性强烈地由细胞类型结构化。我们提出了基因组引导的细胞类型特异性专家混合模型(GC-MoE),该模型通过路由网络估计细胞类型概率,并软组合细胞类型特异性专家进行基因表达预测。为了进一步编码细胞类型依赖的基因程序,我们引入了细胞类型特异性共表达感知预测器(CAP),以及一个轻量级的细胞间交互注意力(C2CA)模块用于邻域细胞上下文。在公共单细胞ST数据集上的实验和消融研究表明,该方法在现有单细胞和适应性斑点级基线方法上均有一致的改进。

英文摘要

Histology-based single-cell spatial transcriptomics (ST) estimation aims to predict gene expression for individual cells from histopathological images and cell locations, reducing the need for costly single-cell ST measurements. Unlike existing histology-to-ST methods that mainly predict spot-level profiles for local regions containing multiple cells, this task requires modeling cell-to-cell expression variability, which is strongly structured by cell type. We propose Genomics-Guided Cell-Type-Specific Mixture-of-Experts (GC-MoE), which estimates cell-type probabilities with a routing network and softly combines cell-type-specific experts for gene expression prediction. To further encode cell-type-dependent gene programs, we introduce the Cell-Type-Specific Co-Expression-Aware Predictor (CAP), together with a lightweight Cell-to-Cell Interaction Attention (C2CA) module for neighboring-cell context. Experiments and ablations on public single-cell ST datasets show consistent improvements over existing single-cell and adapted spot-level baselines.

2606.02423 2026-06-02 cs.CL cs.LG 版本更新

Investigating and Alleviating Harm Amplification in LLM Interactions

调查和缓解大语言模型交互中的危害放大

Ruohao Guo, Wei Xu, Alan Ritter

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出HarmAmp基准和TrajSafe监控器,用于评估和缓解多轮对话中大语言模型对危害的放大效应。

详情
AI中文摘要

大语言模型(LLM)可以作为有用的助手,但它们同样可以作为危害放大器,使恶意用户通过扩展交互实现超出其能力的危害结果。这种风险沿着两个轴显现,即民主化领域专业知识,使新手能够产生专门的有害内容,以及以手动努力无法匹敌的规模扩大有害操作。然而,现有工作往往忽略了LLM在多轮对话中如何加剧危害。我们引入了HarmAmp,这是一个新的基准,用于涵盖十二个风险类别的多轮危害放大场景。每个场景都基于现实世界的威胁,并满足严格的标准,即实质性放大、操作特异性和多轮必要性。我们进一步提出了TrajSafe,一种主动监控器,可以预测有害轨迹并通过诸如探测用户真实意图和引导模型更安全地完成等行动进行干预。我们的广泛实验表明,TrajSafe显著降低了多轮交互中产生的危害性,同时保持了低过度拒绝率和目标模型的一般能力。我们的工作为缓解LLM交互中微妙的安全风险提供了一个有前景的范式。

英文摘要

Large language models (LLMs) can serve as helpful assistants, yet they can equally function as harm amplifiers that enable malicious users to achieve harmful outcomes beyond their capabilities through extended interactions. This risk manifests along two axes, i.e., democratizing domain expertise that allows novices to produce specialized harmful content, and scaling harmful operations at volumes that manual effort cannot match. Existing works, however, often overlook how LLMs compound harm across multi-turn conversations. We introduce HarmAmp, a new benchmark for multi-turn harm amplification scenarios spanning twelve risk categories. Each scenario is grounded in real-world threats and satisfies rigorous criteria, i.e., substantive amplification, operational specificity, and multi-turn necessity. We further propose TrajSafe, a proactive monitor that anticipates harmful trajectories and intervenes through actions such as probing users' genuine intents and steering the models towards safer completion. Our extensive experiments demonstrate that TrajSafe significantly reduces the harmfulness incurred in multi-turn interactions while preserving a low over-refusal rate and the target model's general capabilities. Our work offers a promising paradigm to alleviate the nuanced safety risks in LLM interactions.

2606.02398 2026-06-02 cs.LG cs.CL 版本更新

A Local Perturbation Theory for Cross-Domain Interference and Recovery in Multi-Domain RL

跨域干扰与恢复的局部微扰理论:多领域强化学习

Lei Yang, Siyu Ding, Deyi Xiong

发表机构 * TJUNLP Lab, College of Intelligence and Computing, Tianjin University(天津大学智能与计算学院TJUNLP实验室) Baidu Inc.(百度公司)

AI总结 针对多领域RL训练中一个领域性能下降的问题,提出局部微扰理论,证明后期领域训练主要通过二阶损伤项在低维共享冲突子空间中损害早期领域,并通过短时领域刷新实现选择性恢复。

详情
AI中文摘要

强化学习后训练在数学推理、代码生成、问答和创意写作等单个领域上改进了大型语言模型,但在一个领域上的训练往往会降低其他领域的性能。基于灾难性遗忘或全局梯度冲突的现有解释是不完整的:即使全模型梯度几乎正交,也可能发生实质性干扰。我们表明,单领域RL产生稀疏、小量级的参数编辑,且top变化神经元之间的重叠较弱,而不同领域仍然共享大量的活跃计算路径,这些路径上的更新方向决定了它们是协同还是冲突。在此观察指导下,我们在多领域RL的局部微扰模型下证明,后期领域训练主要通过二阶损伤项损害早期领域,在观察到的稀疏路径结构下,该损伤项集中在低维共享冲突子空间中。此外,短时领域刷新会收缩该子空间上的有害成分,从而在有限的附带损伤下实现选择性恢复。与理论一致,在Code → Math → QA → CW之后进行短暂的Re-Math刷新,将Math从57.66恢复到66.04,同时基本保持其他领域的性能,得到最佳平均分66.39。除了刷新之外,针对Math-QA对的稀疏代理冲突坐标集进行无训练回滚可部分恢复Math,为局部损伤提供了直接的代理级证据。这些结果为多领域RL中的干扰和恢复提供了局部机制解释。

英文摘要

Reinforcement learning (RL) post-training improves large language models (LLMs) on individual domains such as mathematical reasoning, code generation, question answering, and creative writing (CW), but training on one domain often degrades performance on others. Existing explanations based on catastrophic forgetting or global gradient conflict are incomplete: substantial interference can occur even when full-model gradients are nearly orthogonal. We show that single-domain RL produces sparse, small-magnitude parameter edits with weak overlap among top-changed neurons, while different domains still share substantial active computation routes on which update directions determine whether they act synergistically or conflict. Guided by this observation, we prove under a local perturbation model of multi-domain RL that later-domain training harms an earlier domain mainly through a second-order damage term, which under the observed sparse route structure concentrates in a low-dimensional shared conflict subspace. Moreover, a short domain refresh contracts the harmful component on this subspace, enabling selective recovery with limited collateral damage. Consistent with the theory, a brief Re-Math refresh after Code $\rightarrow$ Math $\rightarrow$ QA $\rightarrow$ CW recovers Math from 57.66 to 66.04 while largely preserving performance on the other domains, yielding the best average score of 66.39. Beyond refresh, a training-free rollback on a sparse proxy conflict coordinate set for the Math-QA pair partially restores Math, providing direct proxy-level evidence for localized damage. These results provide a localized mechanistic account of interference and recovery in multi-domain RL.

2606.02388 2026-06-02 cs.LG cs.AI 版本更新

Policy and World Modeling Co-Training for Language Agents

语言智能体的策略与世界模型协同训练

Ning Lu, Baijiong Lin, Shengcai Liu, Jiahao Wu, Haoze Lv, Yanbin Wei, Lingting Zhu, Shengju Qian, Xin Wang, Ying-Cong Chen, Qi Wang, Ke Tang

发表机构 * Southern University of Science and Technology(南方科技大学) Hong Kong University of Science and Technology(香港科学大学) Hong Kong University of Science and Technology (Guangzhou)(香港科学大学(广州)) Hong Kong Polytechnic University(香港理工大学) LIGHTSPEED

AI总结 提出PaW框架,通过在强化学习过程中添加辅助世界模型监督,无需改变推理范式,提升语言智能体在多个任务上的性能。

Comments 9 pages, 6 figures

详情
AI中文摘要

强化学习通过教导大语言模型智能体哪些行动能带来高奖励来改进它们,但对这些行动对环境的影响提供很少的监督。世界建模可以填补这一空白,但现有方法通常需要单独的模拟器、额外的训练阶段或额外的推理时计算。我们观察到,在策略强化学习 rollout 已经包含了所需的信号:每个转移将行动与其产生的下一个观察配对。基于这一观察,我们提出了PaW,一个策略和世界模型协同训练框架,它在强化学习过程中向同一策略添加辅助世界模型监督,而不改变推理范式。为了使辅助世界模型监督信息丰富且稳定,PaW引入了三个组件:基于行动熵的世界模型数据选择、噪声容忍的世界模型损失和奖励自适应的损失平衡。在三个智能体任务基准上的实验表明,在不同模型和强化学习算法上,PaW相对于强强化学习基线有一致的改进。这些结果表明,标准的强化学习 rollout 是语言智能体训练中世界模型监督的实用来源。

英文摘要

Reinforcement learning (RL) improves large language model (LLM) agents by teaching them which actions lead to high rewards, but provides little supervision on what those actions do to the environment. World modeling (WM) can fill this gap, yet existing approaches often require separate simulators, extra training stages, or additional inference-time computation. We observe that on-policy RL rollouts already contain the needed signal: each transition pairs an action with its resulting next observation. Based on this observation, we propose PaW, a Policy and World modeling co-training framework that adds auxiliary WM supervision to the same policy during RL, without changing the inference paradigm. To make auxiliary WM supervision informative and stable, PaW introduces three components: action-entropy-based WM data selection, noise-tolerant WM loss, and reward-adaptive loss balancing. Experiments on three agentic task benchmarks show consistent improvements over strong RL baselines across models and RL algorithms. These results suggest that standard RL rollouts are a practical source of WM supervision for language-agent training.

2606.02385 2026-06-02 q-bio.NC cs.LG 版本更新

How Optimality Structures Sparse Dictionaries: A Theory for Understanding SAE Representations

最优性如何结构化稀疏字典:理解SAE表示的理论

William Dorrell

AI总结 本文通过扩展局部最优性分析到非负联合优化问题,推导出稀疏自编码器(SAE)最优特征与数据分布之间的约束,解释了层级分裂与吸收、残差结构和密集对映特征等行为,并构建了新型大字典凸问题以探索宽原子-数据点极限。

Comments 27 pages, 5 figures

详情
AI中文摘要

稀疏自编码器(SAE)已成功将神经表示解析为可解释的概念,为理解和控制提供了基础。然而,SAE究竟提取了什么,以及我们据此能得出哪些科学结论,并不明显。经验上,证据在于结果:SAE学习了可解释的特征。理论上,我们缺乏一个清晰的解释,说明一个“概念”必须满足什么属性才能被SAE提取。已有大量可识别性工作研究稀疏编码恢复真实特征的条件,但这些方法往往关注简单的数据生成模型(如稀疏独立特征),这些模型难以近似SAE所训练的、吞噬互联网的语言模型表示。在此,我们避免数据生成模型,仅询问任何字典学习最优解必须满足什么属性。具体地,我们将局部最优性分析(Gribonval & Schnass, 2010)扩展到普通SAE近似的非负联合优化问题,并推导出最优SAE特征与其分布之间的约束。我们利用这些约束解释了一系列观察到的SAE行为——层级分裂与吸收、残差结构以及密集对映特征——每个都反映了L1+非负性如何与数据交互以结构化最优字典。最后,我们构建了一个新颖的大字典凸问题,并探索了宽原子-数据点极限。总之,我们希望将模型假设与意外观察区分开,从而从SAE的成功中学到更多,并为设计其继任者提供原则。

英文摘要

Sparse Autoencoders (SAEs) have found success parsing neural representations into interpretable concepts, providing a basis for understanding and control. However, what exactly SAEs extract, and, correspondingly, the scientific conclusions we can draw from them, are not obvious. Empirically, the proof is in the pudding: SAEs learn interpretable features. Theoretically, we lack a clear account of what properties a 'concept' must satisfy for an SAE to extract it. There has been extensive identifiability work studying the conditions under which sparse coding recovers ground-truth features; however, these approaches tends to focus on simple data-generating models (e.g. sparse independent features) which poorly approximate the internet-swallowing language-model representations on which SAEs are trained. Here, avoiding data-generating models, we ask simply what properties any dictionary learning optimum must satisfy. Concretely, we extend local optimality analyses (Gribonval & Schnass, 2010) to the nonnegative joint-optimisation problem that vanilla SAEs approximate, and derive constraints relating optimal SAE features to their distributions. We use these constraints to explain a range of observed SAE behaviours - hierarchical splitting & absorption, the structure of residuals, and dense antipodal features - each reflecting how L1+nonnegativity interact with data to structure optimal dictionaries. Finally, we construct a novel large-dictionary convex problem and explore the wide atom-per-datapoint limit. In sum, we hope to tease model assumptions from unexpected observations, letting us learn more from SAEs' successes and provide principles for designing their successors.

2606.02384 2026-06-02 cs.LG 版本更新

TabPrep: Closing the Feature Engineering Gap in Tabular Benchmarks

TabPrep: 弥合表格基准测试中的特征工程差距

Andrej Tschalzev, Nick Erickson, Yuyang Wang, Huzefa Rangwala, Stefan Lüdtke, Heiner Stuckenschmidt, Christian Bartelt

AI总结 本文提出TabPrep,一个轻量级预处理流程,通过针对三种特定数据模式设计的特征生成器,系统性地进行特征工程,显著提升多种模型在表格基准测试中的性能。

详情
AI中文摘要

表格机器学习的进展主要集中在日益复杂的模型架构上。同时,特征工程仍然是现实建模流程中关键但未被充分探索的组成部分,在现代基准测试中完全缺失,这造成了未量化的评估差距。在这项工作中,我们引入了TabPrep,一个轻量级预处理流程,由精心设计以针对三种特定结构数据模式的特征生成器组成。我们表明,许多广泛使用的模型类对这些模式表现出可预测的盲点,仅凭系统性的特征工程就能建立新的峰值性能。在TabArena基准测试中,将TabPrep集成到模型训练和调优中持续提升了基于树、神经网络、线性和基础模型的性能,通常超过仅通过以模型为中心的创新所获得的收益。TabPrep在性能、效率和跨数据集的适用性方面优于以前的自动化特征工程方法,使其能够集成到大规模基准测试中。通过发布TabPrep(见https://github.com/atschalz/tabprep),我们使研究人员能够将特征工程集成到他们的基准测试设置中,填补了表格评估中长期存在的空白。

英文摘要

Progress in tabular machine learning has largely focused on increasingly sophisticated model architectures. At the same time, feature engineering remains a critical yet underexplored component of real-world modeling pipelines that is entirely absent from modern benchmarks, which creates an unquantified evaluation gap. In this work, we introduce TabPrep, a lightweight preprocessing pipeline composed of feature generators that are carefully designed to target three specific structural data patterns. We show that many widely used model classes exhibit predictable blind spots to these patterns and that systematic feature engineering alone can establish new peak performance. Across the TabArena benchmark, integrating TabPrep into model training and tuning consistently improves performance for tree-based, neural, linear, and foundation models, often surpassing gains achieved by model-centric innovations alone. TabPrep outperforms previous automated feature engineering approaches in performance, efficiency, and applicability across datasets, enabling integration into large-scale benchmarks. By releasing TabPrep (see https://github.com/atschalz/tabprep), we enable researchers to integrate feature engineering into their benchmarking setup, filling a longstanding gap in tabular evaluations.

2606.02381 2026-06-02 cs.AI cs.LG math.DS 版本更新

A Mathematical Conflict Framework for Contextual Data Modulation

上下文数据调制的数学冲突框架

Hakan Emre Kartal

发表机构 * GitHub

AI总结 提出一个基于算子的数学冲突框架,将冲突视为局部、方向性和上下文敏感的量,通过统一抽象算子整合权重、尺度行为和输出映射,作为独立于优化过程的数学对象。

Comments 15 pages, 3 figures, framework paper

详情
AI中文摘要

在本研究中,提出了一个基于算子的广义数学冲突框架,以显式表示原始数据与上下文数据之间的结构差异。所提出的结构将冲突视为局部、方向性和上下文敏感的量,在统一抽象算子下整合了权重、尺度行为和输出映射等组件。该框架并未简化为特定的学习算法或优化方法,而是定义为适用于不同问题类别的通用结构。现有方法通常将冲突仅仅视为嵌入优化过程中的隐式副作用,而所提出的框架则将冲突视为独立的、基于算子的、组件级别的数学对象。

英文摘要

In this study, a generalized operator-based mathematical conflict framework is presented to explicitly represent structural discrepancies between raw data and contextual data. The proposed structure treats conflict as a local, directional, and context-sensitive quantity, integrating components such as weighting, scale behavior, and output mapping under a unified abstract operator. Without being reduced to a specific learning algorithm or optimization method, the framework is defined as a general structure adaptable to different classes of problems. While existing approaches typically treat conflict merely as an implicit side effect embedded within the optimization process, the proposed framework considers conflict as an independent, operator-based, and component-level mathematical object.

2606.02365 2026-06-02 cs.LG cs.AI 版本更新

FOAM: Frequency and Operator Error-Based Adaptive Damping Method for Reducing Staleness-Oriented Error for Shampoo

FOAM:基于频率和算子误差的自适应阻尼方法,用于减少Shampoo的陈旧性误差

Kyunghun Nam, Sumyeong Ahn

发表机构 * Kyunghun Nam Sumyeong Ahn

AI总结 提出FOAM算法,通过自适应控制阻尼因子和特征分解频率来抑制陈旧性误差,在保持收敛的同时减少Shampoo的计算时间。

Comments 9 pages, ICML 2026 camera-ready version

详情
AI中文摘要

Shampoo因其在大规模优化基准上的卓越性能而备受关注,但它面临一个重要的实际瓶颈:矩阵求逆的过高计算开销。为了缓解这一问题,从业者通常依赖陈旧的预条件子更新,这在计算效率和优化保真度之间产生了根本性的权衡。在这项工作中,我们通过收敛性和稳定性的互补视角对陈旧性进行了理论研究。虽然陈旧性提高了计算效率,但它固有地降低了性能并引入了数值不稳定性。关键的是,我们发现作为数值稳定器的阻尼可以有效抑制这些负面影响。在此分析指导下,我们提出了FOAM,一种自适应算法,通过基于陈旧性误差的近似动态控制阻尼因子和特征分解频率来稳定训练。实验结果表明,与标准Shampoo相比,FOAM在保持稳健收敛的同时减少了挂钟时间。

英文摘要

Shampoo is attracting considerable attention for its superior performance on large-scale optimization benchmarks; yet it faces a significant practical bottleneck: the prohibitive computational overhead of matrix inversion. To mitigate this, practitioners typically rely on stale preconditioner updates, creating a fundamental trade-off between computational efficiency and optimization fidelity. In this work, we provide a theoretical study of staleness through the complementary lenses of convergence and stability. While staleness improves computational efficiency, it inherently degrades performance and introduces numerical instability. Crucially, we identify that damping, acting as a numerical stabilizer, can effectively suppress these negative effects. Guided by this analysis, we propose FOAM, an adaptive algorithm that stabilizes training by dynamically controlling both the damping factor and the eigendecomposition frequency based on an approximation of the staleness-oriented error. Experimental results demonstrate that FOAM reduces wall-clock time compared to standard Shampoo while maintaining robust convergence.

2606.02363 2026-06-02 cs.LG stat.ML 版本更新

Minimax-Optimal Policy Regret in Partially Observable Markov Games

部分可观测马尔可夫博弈中的极小化最优策略遗憾

Raman Arora

发表机构 * Raman Arora

AI总结 针对部分可观测马尔可夫博弈,提出基于epoch的乐观最大似然算法,实现了与聚合Eluder维数相关的$ ilde{O}(\sqrt{T})$策略遗憾,并证明了匹配的下界。

详情
AI中文摘要

我们研究了部分可观测环境中面对战略、自适应对手的序贯决策问题,建模为部分可观测马尔可夫博弈(POMG)。核心挑战在于从部分观测中学习潜在动态,同时面对行为依赖于学习者策略的对手,这使得标准遗憾概念不适用。我们证明,对于固定问题参数,基于epoch的乐观最大似然算法实现了$ ilde{O}(\sqrt{T})$的策略遗憾,显式依赖于视界、对手记忆、置信半径以及可观测算子类的聚合Eluder维数。该算法在每个几何增长的epoch中选择一个策略,使用从过去数据累积构建的置信集,这将比较跨策略的对手响应的成本控制在$T$的对数级别。我们还证明了与$\sqrt{T}$和聚合Eluder维数依赖相匹配的下界(至多问题相关和对数因子)。最后,我们将框架扩展到视界自适应保证和具有几何衰减记忆的对手。

英文摘要

We study sequential decision-making in partially observable environments against strategic, adaptive opponents, modeled as partially observable Markov games (POMGs). The central challenge is to learn latent dynamics from partial observations while facing an adversary whose behavior depends on the learner's strategy, making standard regret notions inadequate. We prove that an epoch-based optimistic maximum-likelihood algorithm achieves $\tilde{O}(\sqrt{T})$ policy regret for fixed problem parameters, with explicit dependence on the horizon, adversary memory, confidence radius, and the aggregate Eluder dimension of the observable-operator class. The algorithm selects one policy per geometrically growing epoch using confidence sets built cumulatively from past data, which keeps the cost of comparing adversary responses across policies logarithmic in $T$. We also prove a lower bound matching the $\sqrt{T}$ and aggregate-Eluder-dimension dependence, up to problem-dependent and logarithmic factors. Finally, we extend the framework to horizon-adaptive guarantees and adversaries with geometric fading memory.

2606.02355 2026-06-02 cs.AI cs.LG 版本更新

SIRI: Self-Internalizing Reinforcement Learning with Intrinsic Skills for LLM Agent Training

SIRI:具有内在技能的自我内化强化学习用于LLM智能体训练

Zhongyu He, Yuanfan Li, Fei Huang, Tianyu Chen, Siyuan Chen, Xingyang Li, Meng Hsuan Yu, Xiangrong Liu, Leyi Wei, Lu Pan, Ke Zeng, Xunliang Cai

发表机构 * Xiamen University(厦门大学) Meituan(美团) Macao Polytechnic University(澳门 polytechnic 大学)

AI总结 提出SIRI框架,通过自我技能挖掘、验证和内化,使LLM智能体无需外部技能生成器或推理时技能库即可提升长程任务性能,在ALFWorld和WebShop上优于基线方法。

详情
AI中文摘要

长程LLM智能体可以从可重用技能中受益,但现有的基于技能的方法通常依赖于训练期间的外部技能生成器或推理时的持久技能检索,增加了工程复杂性、上下文长度和部署延迟。我们提出了具有内在技能的自我内化强化学习(SIRI),这是一个三阶段框架,使智能体能够发现、验证和内化技能,无需外部技能生成器或推理时的技能库。SIRI首先使用GiGPO预热策略以获得基本交互能力并收集成功的无技能轨迹。然后进行自我技能挖掘,当前策略从其自身的成功普通轨迹中总结紧凑技能,并通过配对的技能增强和技能无关轨迹进行验证。最后,SIRI仅使用轨迹级效用和动作级优势将有帮助的技能引导动作令牌蒸馏到普通策略中。推理时,智能体仅使用原始提示运行。在ALFWorld和WebShop上使用Qwen2.5-7B-Instruct,SIRI将GiGPO从ALFWorld的0.908提升到0.930,从WebShop的0.728提升到0.813,优于基于提示、基于强化学习和基于记忆增强的基线。进一步分析表明,我们的自我挖掘策略可以实现与闭源大模型蒸馏相当的性能。我们的代码可在https://github.com/kirito618/SIRI获取。

英文摘要

Long-horizon LLM agents can benefit from reusable skills, yet existing skill-based methods often rely on external skill generators during training or persistent skill retrieval at inference, increasing engineering complexity, context length, and deployment latency. We propose Self-Internalizing Reinforcement learning with Intrinsic skills (SIRI), a three-phase framework that enables agents to discover, validate, and internalize skills without external skill generators or inference-time skill banks. SIRI first warms up the policy with GiGPO to acquire basic interaction ability and collect successful skill-free trajectories. It then performs self-skill mining, where the current policy summarizes compact skills from its own successful plain rollouts and validates them through paired skill-augmented and skill-free rollouts. Finally, SIRI distills only beneficial skill-guided action tokens into the plain policy using trajectory-level utility and action-level advantage. At inference, the agent runs with the original prompt only. On ALFWorld and WebShop with Qwen2.5-7B-Instruct, SIRI improves GiGPO from 0.908 to 0.930 on ALFWorld and from 0.728 to 0.813 on WebShop, outperforming prompt-based, RL-based, and memory-augmented baselines. Further analysis shows that our self-mining strategy can achieve performance comparable to distillation with closed-source large model. Our code is available at https://github.com/kirito618/SIRI.

2606.02345 2026-06-02 stat.ML cs.LG 版本更新

Doing well with less! On Sampling Techniques for Empirical Pairwise Loss Estimation/Minimization

少即是多!关于经验成对损失估计/最小化的采样技术

Louise Davy, Stephan Clémençon, Charlotte Laclau

发表机构 * IDS, LTCI Télécom Paris Palaiseau, France(IDS、LTCI 雷电巴黎实验室,巴黎帕莱索,法国)

AI总结 本文利用调查采样技术,通过直接对成对样本进行采样而非单个观测,在保留少量信息的情况下实现与全量成对评估相当的估计或优化性能,为精度与计算成本之间提供了理论上有依据的权衡。

详情
AI中文摘要

许多机器学习问题,包括相似性学习、排序和聚类,都依赖于经验成对损失函数,其二次计算成本在大规模下迅速变得难以承受。我们展示了一种节俭的方法,通过利用调查采样技术,仅保留成对信息的一小部分,即可实现与使用所有成对数据相当的估计或优化性能。一个核心发现(理论和实验均支持)是,这种采样方案必须直接针对成对样本而非单个观测。特别地,对于高维向量(如视觉或图学习中的嵌入)之间的成对损失,使用合适的辅助信息为信息量大的成对样本分配更高的包含概率,可以获得接近全量成对评估的性能,从而在精度和计算成本之间提供了一种有原则且理论上有依据的权衡。

英文摘要

Many machine learning problems, including similarity learning, ranking, and clustering, rely on empirical pairwise loss functions whose quadratic computational cost quickly becomes prohibitive at scale. We demonstrate how a frugal approach that retains only a fraction of the available information on pairs can achieve estimation or optimization performance comparable to that obtained by using all pairs, by leveraging survey sampling techniques. A central finding, supported by both theory and experiments, is that such sampling plans must target pairs directly rather than individual observations. In particular, for pairwise losses between high-dimensional vectors such as embeddings in vision or graph learning, assigning higher inclusion probabilities to informative pairs using suitable auxiliary information yields performance close to full pairwise evaluation, providing a principled and theoretically grounded trade-off between accuracy and computational cost.

2606.02339 2026-06-02 cs.LG cs.CV 版本更新

Entropy Minimization without Model Collapse: Mitigating Prediction Bias in Medical Imaging

无模型坍塌的熵最小化:减轻医学影像中的预测偏差

Tim Nielen, Sameer Ambekar, Johannes Kiechle, Daniel M. Lang, Julia A. Schnabel

发表机构 * School of Computation, Information and Technology, Technical University of Munich, Germany(慕尼黑技术大学计算、信息与技术学院) Institute of Machine Learning in Biomedical Imaging, Helmholtz Munich, Germany(生物医学成像中的机器学习研究所,海德堡慕尼黑德国) School of Biomedical Engineering and Imaging Sciences, King’s College London, UK(伦敦国王学院生物医学工程与成像科学学院) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心(MCML)) relAI – Konrad Zuse School of Excellence in Reliable AI(relAI——Konrad Zuse可靠性人工智能卓越学院) TUM University Hospital Rechts der Isar(慕尼黑技术大学医院Rechts der Isar)

AI总结 针对测试时适应中熵最小化导致的模型坍塌问题,提出分布偏移偏差减少(DSBR)方法,通过均衡各预测类对无监督熵最小化损失的贡献来纠正预测偏差,在四个医学影像数据集和ImageNet-C上验证了其稳定性和有效性。

详情
AI中文摘要

熵最小化(EM)是测试时适应的主导目标,但其失败模式——模型坍塌——仍然知之甚少。在这项工作中,我们表明分布偏移会导致模型表示空间中对应不同类别的特征簇合并,而决策边界保持不变。这导致预测类别分布出现系统性偏差,称为预测偏差。预测偏差是指预测类别分布的偏移,其中一些类别被过度代表,而其他类别被抑制。我们表明熵最小化通过收紧现有簇来放大这种预测偏差,强化错误的分类,直到所有预测坍缩为平凡解。接下来,为了证明预测偏差的重要性并减轻它,我们进一步提出了分布偏移偏差减少(DSBR),这是一种偏差纠正目标,通过均衡每个预测类别对无监督熵最小化损失的贡献来专门针对这种失败模式。为了研究这种失败模式,我们使用四个医学影像数据集设计了合适的适应设置,并在ImageNet-C上进行了额外评估。我们发现DSBR一致地稳定了测试时适应,防止了模型坍塌,并且匹配或超越了最先进的方法。此外,DSBR仅在测试时运行。

英文摘要

Entropy minimization (EM) is the dominant objective for test-time adaptation, yet its failure mode, model collapse, remains poorly understood. In this work, we show that distribution shifts can cause feature clusters corresponding to distinct classes in the model's representation space to merge, while the decision boundary remains fixed. This induces a systematic skew in the predicted class distribution, referred to as prediction bias. Prediction bias refers to a shift in the predicted class distribution, with some classes overrepresented and others suppressed. We show that entropy minimization amplifies this prediction bias by tightening the existing clusters, reinforcing the incorrect groupings until all predictions collapse to a trivial solution. Next, to demonstrate the significance of prediction bias and mitigate it, we further propose Distribution Shift Bias Reduction (DSBR), a bias-correcting objective that specifically targets this failure mode by equalizing the contribution of each predicted class to the unsupervised entropy minimization loss. To study this failure mode, we design suitable adaptation settings using four medical-imaging datasets and additionally evaluate on ImageNet-C. We find that DSBR consistently stabilizes test-time adaptation, prevents model collapse, and matches or outperforms state-of-the-art methods. Moreover, DSBR operates solely at test-time.

2606.02331 2026-06-02 cs.CV cs.LG 版本更新

Hallucination-Aware Diffusion Sampling for Inverse Problems via Robust Prior Updates

基于鲁棒先验更新的幻觉感知扩散采样用于逆问题

Pengfei Jin, Yiqi Tian, Kailong Fan, Bingjie Qi, Quanzheng Li

发表机构 * Center for Advanced Medical Computing and Analysis, Massachusetts General Hospital and Harvard Medical School(先进医学计算与分析中心,麻省总医院和哈佛医学院) Department of Industrial Engineering, University of Pittsburgh(工业工程系,匹兹堡大学)

AI总结 提出鲁棒先验更新模块,通过探测扩散先验更新的局部稳定性并重新锚定位移,减少逆问题求解中的测量条件幻觉,提升实例保真度。

详情
AI中文摘要

基于扩散的逆问题求解器可以产生逼真的重建结果,但仅凭逼真度并不能确保恢复的细节得到测量的支持。我们将这种失败研究为测量条件幻觉:视觉上有意义但要么不可信要么与测量实例不一致的内容。我们的分析将基于贝叶斯规则的扩散逆求解器分为先验更新和测量条件步骤,表明在应用测量校正之前,幻觉内容可能通过先验侧提议进入。受此观点启发,我们提出鲁棒先验更新(RPU),一个求解器级模块,探测扩散先验更新的局部稳定性,将产生的位移重新锚定在当前迭代点,并保持测量更新不变。我们在DPS中实例化RPU,并使用自动指标和人类忠实度研究在FFHQ和ImageNet逆问题上进行评估。在FFHQ上,RPU在框内修复、高斯去模糊和运动去模糊中相比DPS提高了PSNR和LPIPS。在人类判断中,RPU在FFHQ框内修复上获得了91.9%的盲选非平局多数偏好和91.1%的借助真实标签的非平局偏好,而ImageNet高斯阅读器研究中平局较多,但在非平局情况下RPU更受青睐。这些结果支持一个有针对性的主张:鲁棒化先验更新可以提高扩散逆求解器中的实例保真度,尤其是在先验塑造弱约束内容时。

英文摘要

Diffusion-based inverse problem solvers can produce realistic reconstructions, but realism alone does not ensure that the recovered details are supported by the measurement. We study this failure as measurement-conditioned hallucination: visually meaningful content that is either implausible or inconsistent with the measured instance. Our analysis separates Bayes-rule-based diffusion inverse solvers into a prior update and a measurement-conditioning step, showing that hallucinated content can enter through the prior-side proposal before the measurement correction is applied. Motivated by this view, we propose Robust Prior Update (RPU), a solver-level module that probes the local stability of the diffusion prior update, re-anchors the resulting displacement at the current iterate, and leaves the measurement update unchanged. We instantiate RPU in DPS and evaluate it on FFHQ and ImageNet inverse problems using automatic metrics and human faithfulness studies. On FFHQ, RPU improves PSNR and LPIPS over DPS across box inpainting, Gaussian deblurring, and motion deblurring. In human judgments, RPU receives 91.9% of blind non-tie majority preferences and 91.1% of ground-truth-assisted non-tie preferences on FFHQ box inpainting, while the ImageNet Gaussian reader study is tie-heavy but favors RPU among non-tie cases. These results support a targeted claim: robustifying the prior update can improve instance faithfulness in diffusion inverse solvers, especially when the prior shapes weakly constrained content.

2606.02328 2026-06-02 cs.LG 版本更新

Riemannian Gradient Descent for Low-Rank Architectures

低秩架构的黎曼梯度下降

Nicholas Knight

AI总结 针对低秩矩阵参数,探索黎曼优化技术,并在小语言模型的多头注意力参数上应用,但未显著优于AdamW基线。

详情
AI中文摘要

我们探索了用于秩因子矩阵参数的黎曼优化技术,针对当代深度学习应用。我们考察了算法设计空间中的十个点:秩为$r$的矩阵的两种几何结构,秩为$r$的部分等距的三种几何结构,以及这五种几何结构的块矩阵变体,其中因子在块行和块列之间共享。我们将我们的方法应用于小语言模型中的多头注意力参数。在调整学习率后,我们的方法并未决定性地优于AdamW基线。我们的实现可在网上获取。

英文摘要

We explore Riemannian optimization techniques for rank-factored matrix parameters, targeting contemporary deep learning applications. We examine ten points in the algorithm design space: two geometries for rank-$r$ matrices, three geometries for rank-$r$ partial isometries, and block-matrix variants of these five, where factors are shared across block-rows and block-columns. We apply our methods to the multihead attention parameters in small language models. After tuning learning rates, our methods do not conclusively outperform an AdamW baseline. Our implementations are available online.

2606.02322 2026-06-02 cs.LG cs.AI 版本更新

Repurposing Adversarial Perturbations for Continual Learning: From Defense to Active Alignment

重新利用对抗扰动进行持续学习:从防御到主动对齐

Ran Liu, Min Yu, Mingqi Liu, Jianguo Jiang, Gang Li, Rongsheng Li, Ning Li, Zhen Xu, Weiqing Huang, Ming Liu

发表机构 * Institute of Information Engineering, Chinese Academy of Sciences(中国科学院信息工程研究所) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络安全学院) Deakin University(德肯大学) Harbin Engineering University(哈尔滨工程大学)

AI总结 提出AdvCL框架,通过将对抗扰动重新用作几何控制信号,结合三个即插即用模块(Intra-Smooth、Proto-Clip、Inter-Align),在持续学习中同时提升标准性能、鲁棒性、降低遗忘并增强迁移。

详情
AI中文摘要

在动态环境中,大型语言模型需要不断适应新任务,但持续学习常常遭受遗忘、有限的迁移以及对对抗扰动的脆弱性。为了解决这个问题,我们提出了AdvCL,它将对抗扰动重新用作稳定的持续适应的几何控制信号。AdvCL结合了三个即插即用模块:Intra-Smooth通过小的对抗扰动促进局部平滑性;Proto-Clip使用相似性裁剪以防止过度对齐到当前任务原型;Inter-Align则通过对齐到先前任务原型的方向性对齐来减少表示间隙。实验表明,在标准性能和鲁棒性方面均有一致的提升,同时具有更低的遗忘和更强的迁移。我们进一步通过量化Intra-Smooth对扰动设置的敏感性以及Inter-Align对任务相似性和几何距离的影响,分析了关键机制。总之,这些模块在组合时提供互补增益,每个模块也可以单独集成到各种持续学习范式中,包括回放、正则化和动态架构,从而为持续学习提供了一种几何控制机制。

英文摘要

In dynamic environments, large language models need to keep adapting to new tasks, but continual learning often suffers from forgetting, limited transfer, and vulnerability to adversarial perturbations. To address this, we present AdvCL, which repurposes adversarial perturbations as a geometric control signal for stable continual adaptation. AdvCL combines three plug-in modules: Intra-Smooth promotes local smoothness via small adversarial perturbations; Proto-Clip uses similarity clipping to prevent excessive alignment to current task prototype; and Inter-Align applies directional alignment toward previous task prototype to reduce representational gaps. Experiments show consistent gains in both standard performance and robustness, with lower forgetting and stronger transfer. We further analyze key mechanisms by quantifying the sensitivity of Intra-Smooth to perturbation settings and the effect of Inter-Align on task similarity and geometric distance. In summary, the modules provide complementary gains when combined, and each can also be integrated individually into diverse CL paradigms, including replay, regularization, and dynamic architectures, thereby offering a geometric control mechanism for continual learning.

2606.02310 2026-06-02 cs.CV cs.LG 版本更新

Deep Learning for Remote Sensing to Improve Flood Inundation Mapping

深度学习用于遥感以改进洪水淹没制图

Yogesh Bhattarai, Vijay Chaudhary, Wai Lim Kim, Sanjib Sharma

发表机构 * University of Colorado Boulder(科罗拉多大学博尔德分校)

AI总结 提出基于去噪扩散概率模型和掩码扩散Transformer的云去除框架,用于洪水影像,以生成无云图像并保持水文一致性,提升洪水监测的可靠性。

Comments This paper has been selected as the top 10 student finalists in IGRASS 2026 paper competition

详情
AI中文摘要

洪水是全球最普遍的自然灾害。及时准确的洪水淹没制图对于告知灾害风险管理至关重要。光学卫星任务提供了高分辨率、多光谱观测,对于洪水检测和淹没制图至关重要。然而,在极端降水事件期间,其操作实用性受到云层的严重限制。基于时间合成或插值的传统云去除技术通常无法捕捉淹没动态。在本研究中,我们引入了一种基于去噪扩散概率模型的洪水影像云去除框架,利用掩码扩散Transformer架构。所提出的方法利用自注意力机制捕获更广泛的空间上下文,并采用掩码令牌建模来显式学习云遮挡区域的重建。在具有真实云模式的多光谱Sentinel-2B洪水场景上训练,该模型生成保持视觉保真度和水文一致性的无云图像实现。使用标准图像质量指标以及洪水特定的水文指标评估重建性能,显示出水体连续性的改善和对水检测指数至关重要的光谱特征的保留。结果表明,基于扩散的生成建模为光学洪水监测中的云去除提供了一种稳健且物理一致的替代方案,从而实现更可靠、连续的观测,以支持灾害风险管理和洪水相关决策。

英文摘要

Flooding is the most pervasive natural disaster worldwide. Timely and accurate flood inundation mapping are essential for informing disaster risk management. Optical satellite missions provide high-resolution, multispectral observations critical for flood detection and inundation mapping. However, their operational utility is severely constrained by cloud cover during extreme precipitation events. Conventional cloud-removal techniques based on temporal compositing or interpolation often fail to capture inundation dynamics. In this study, we introduce a cloud-removal framework for flood imagery based on Denoising Diffusion Probabilistic Models, leveraging the Masked Diffusion Transformer architecture. The proposed approach exploits self-attention mechanisms to capture wider spatial context and employs masked token modeling to explicitly learn the reconstruction of cloud-obscured regions. Trained on multispectral Sentinel-2B flood scenes with realistic cloud patterns, the model generates cloud-free image realizations that preserve both visual fidelity and hydrological consistency. Reconstruction performance is evaluated using standard image quality metrics alongside flood-specific hydrological measures, demonstrating improved continuity of water bodies and preservation of spectral signatures critical for water detection indices. The results indicate that diffusion-based generative modeling offers a robust and physically consistent alternative for cloud removal in optical flood monitoring, enabling more reliable, continuous observations to support disaster risk management and flood-related decision making.

2606.02309 2026-06-02 cs.LG cs.CV 版本更新

Measurement Geometry and Design for Trustworthy Generative Inverse Problems

可信生成式逆问题的测量几何与设计

Pengfei Jin, Na Li, Quanzheng Li

发表机构 * Center for Advanced Medical Computing and Analysis, Massachusetts General Hospital and Harvard Medical School(先进医学计算与分析中心,麻省总医院和哈佛医学院) School of Engineering and Applied Sciences, Harvard University(工程与应用科学学院,哈佛大学)

AI总结 提出局部测量-流形兼容性度量,证明其控制重建误差的稳定部分,并基于体积保持设计固定和自适应测量策略,在多个成像任务中预测失败模式、减少幻觉并指导采样。

详情
AI中文摘要

生成模型越来越多地被用作逆问题的先验,但它们生成逼真图像的能力带来了一个基本的信任问题:一个看似合理的重建可能由测量支持,也可能由先验沿未观测方向填充。这一区别在医学成像中尤为重要,因为采集操作是在扫描时间、剂量和校准约束下设计的。我们从测量几何的角度研究生成式逆问题。核心问题是:固定的测量算子能否区分在生成先验下看似合理的邻近图像,以及这种关系能否指导更好的测量。我们引入了一个局部测量-流形兼容性度量,用于量化算子观测先验相关切线方向的程度。在局部正则性假设下,我们证明该量控制重建误差的稳定部分,而生成先验控制流形外漂移。这一最坏方向证书基于整体局部体积保持,提出了实用的固定和顺序采集规则,包括一种后验云设计,该设计在测试时自适应调整测量,无需训练采样策略。在行采样、断层扫描和MR采集设置中,所提出的分数预测失败模式,解释测量引起的幻觉,并指导更好的采样。在fastMRI笛卡尔采样中,后验云测量设计优于强大的非学习ACS保留基线,包括可变密度和泊松类掩模。

英文摘要

Generative models are increasingly used as priors for inverse problems, but their ability to produce realistic images creates a basic trust problem: a plausible reconstruction may be supported by the measurements, or it may be filled in by the prior along unobserved directions. This distinction is especially important in medical imaging, where acquisition operators are designed under scan-time, dose, and calibration constraints. We study generative inverse problems from a measurement-geometry perspective. The central question is whether a fixed measurement operator can distinguish nearby images that are plausible under the generative prior, and whether this relationship can guide better measurements. We introduce a local measurement-manifold compatibility measure that quantifies how well the operator observes prior-relevant tangent directions. Under local regularity assumptions, we prove that this quantity controls the stable part of the reconstruction error, while the generative prior controls off-manifold drift. This worst-direction certificate motivates practical fixed and sequential acquisition rules based on overall local volume preservation, including a posterior-cloud design that adapts measurements at test time without training a sampling policy. Across row-sampling, tomographic, and MR acquisition settings, the proposed scores predict failure modes, explain measurement-induced hallucinations, and guide better sampling. In fastMRI Cartesian sampling, posterior-cloud measurement design improves over strong non-learned ACS-preserving baselines, including variable-density and Poisson-like masks.

2606.02294 2026-06-02 cs.LG 版本更新

Regularized Large Neighborhood Search

正则化大邻域搜索

Germain Vivier-Ardisson, Laurent Demonet, Axel Parmentier, Mathieu Blondel

发表机构 * Google DeepMind(谷歌DeepMind) CERMICS Paris, France(巴黎CERMICS研究所) ENPC, CNRS, IPP Marne-la-Vallée, France(巴黎-马恩拉瓦尔大学、国家科学研究中心、IPP马恩拉瓦尔分校) Google Research Paris, France(巴黎谷歌研究)

AI总结 提出正则化大邻域搜索(RLNS),将LNS启发式转化为MCMC采样器,实现无需全局求解器的端到端学习。

详情
AI中文摘要

运筹学从业者通常使用大邻域搜索(LNS)来解决NP难的组合问题,这是一种可扩展的启发式方法,通过局部重新优化其变量的子集来迭代改进当前解。相比之下,大多数现有的将组合优化层集成到神经网络中的方法仍然假设可以访问精确的全局解,这在计算上是难以处理的。我们通过引入正则化大邻域搜索(RLNS)来弥合这一差距。通过正则化或扰动局部子问题,我们将LNS启发式转化为一个高效的MCMC采样器,在可行解的组合集上采样,并关联Fenchel-Young损失。在熵正则化下,我们证明RLNS执行精确的块吉布斯采样。此外,调整RLNS迭代次数使我们能够在伪似然和精确最大似然估计之间插值,从而实现无需全局求解器的端到端学习。我们在$k$-子集选择、广义分配和随机车辆调度问题上展示了我们的方法。

英文摘要

Operations research practitioners typically tackle NP-hard combinatorial problems using large neighborhood search (LNS), a scalable heuristic that iteratively refines a current solution by locally re-optimizing subsets of its variables. In contrast, most existing approaches for integrating combinatorial optimization layers into neural networks still assume access to an exact global solution, which is computationally intractable. We bridge this gap by introducing regularized LNS (RLNS). By regularizing or perturbing local subproblems, we turn the LNS heuristic into an efficient MCMC sampler over the combinatorial set of feasible solutions, with associated Fenchel-Young losses. Under entropic regularization, we prove that RLNS performs exact block Gibbs sampling. Furthermore, adjusting the number of RLNS iterations allows us to interpolate between pseudolikelihood and exact maximum likelihood estimation, for end-to-end learning without global solvers. We demonstrate our approach on $k$-subset selection, generalized assignment, and stochastic vehicle scheduling problems.

2606.02288 2026-06-02 cs.LG 版本更新

Massive Spikes in LLMs are Bias Vectors: Mechanistic Uncovering and Spike-Free Quantization

LLM中的大规模尖峰是偏置向量:机制揭示与无尖峰量化

Yung-Chin Chen, Chung Peng Lee, Ze-Wei Liou, Naveen Verma

发表机构 * Princeton University(普林斯顿大学) EnCharge AI

AI总结 本文通过机制分析发现LLM中的激活尖峰本质上是结构化的向量偏置,并提出无尖峰量化框架INSERTQUANT,实现鲁棒的低比特量化。

详情
AI中文摘要

大型语言模型(LLM)中的大规模激活尖峰通过拉伸动态范围严重降低了量化性能。虽然先前的假设将这些尖峰描述为高级标量偏置,但我们认为它们只是携带尖峰的令牌中刚性、结构化的向量偏置的标量中间产物。我们展示了这些令牌在归一化后收敛到常向量,驱动了注意力沉没和值状态耗尽机制。我们通过分析投影权重的协调性从几何上证实了这一点:$W_K$对比性地放大该向量,$W_Q$将语义令牌对齐到它,$W_V$将其投影到谱零空间。此外,我们揭示了模型通过利用低频带和相干通道对将结构偏置定位在“旋转稳定区域”中,从而主动保护这些结构偏置免受旋转位置编码(RoPE)扰动的影响。利用这一点,我们提出了INSERTQUANT,一种后训练量化(PTQ)框架,通过预计算模板向量来钳制尖峰并恢复其功能。这使得激活严格无尖峰,从而实现高保真度的鲁棒低比特量化。INSERTQUANT在LLM上达到了与最先进的每张量量化方法相当的性能,并且独特地泛化到文本以外的其他模态,如ViT。

英文摘要

Massive activation spikes in Large Language Models (LLMs) severely degrade quantization by stretching dynamic ranges. While prior hypotheses characterize these as high-level scalar biases, we argue that they are merely the scalar intermediates of rigid, structural vector biases in the spike-carrying tokens. We show that these tokens converge to constant vectors after normalization that drive the attention sink and value-state drain mechanisms. We geometrically substantiate this by analyzing the coordination of projection weights: $W_K$ contrastively amplifies the vector, $W_Q$ aligns semantic tokens toward it, and $W_V$ projects it into the spectral null-space. Furthermore, we reveal that the model actively preserves these structural biases against Rotary Positional Embedding (RoPE) perturbations by localizing them in "zones of rotational stability" utilizing low-frequency bands and coherent channel pairs. Leveraging this, we propose INSERTQUANT, a post-training quantization (PTQ) framework that clamps spikes and restores their function via pre-computed template vectors. This renders activations strictly spike-free, enabling robust low-bit quantization with high fidelity. INSERTQUANT achieves parity with state-of-the-art per-tensor quantization methods on LLMs and uniquely generalizes beyond text to other modalities such as ViTs.

2606.02287 2026-06-02 cs.LG cs.AI 版本更新

CityTrajBench: A Unified Benchmark for City-Scale Vehicle Trajectory Generation

CityTrajBench: 城市尺度车辆轨迹生成的统一基准

Shibo Zhu, Xiaodan Shi, Dayin Chen, Yuntian Chen, Haoran Zhang, Tianhao Wu, Jinyue Yan

发表机构 * Department of Building Environment and Energy Engineering, The Hong Kong Polytechnic University, Hong Kong SAR, China(香港理工大学建筑环境与能源工程系) Eastern Institute for Advanced Study, Eastern Institute of Technology, Ningbo, China(宁波东部先进研究所) International Centre of Urban Energy Nexus, The Hong Kong Polytechnic University, Hong Kong SAR, China(香港理工大学城市能源 nexus 中心) Department of Computer and Systems Sciences, Stockholm University, Sweden(斯德哥尔摩大学计算机与系统科学系) Zhejiang Key Laboratory of Industrial Intelligence and Digital Twin, Eastern Institute of Technology, Ningbo, China(浙江工业智能与数字孪生重点实验室) Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo, China(宁波数字孪生研究所) LocationMind Inc., Tokyo 101-0042, Japan(LocationMind公司)

AI总结 为解决轨迹生成方法因数据集、预处理、表示和评估指标不一致导致的比较困难,提出CityTrajBench统一基准框架,标准化数据处理、模型适配与多级评估,并在三个真实数据集上对比统计、VAE、GAN、扩散和流匹配模型,揭示不同模型在全局真实性、轨迹几何保真度等指标上的权衡。

详情
AI中文摘要

城市轨迹生成是交通模拟、城市规划和移动性分析的基础任务。然而,由于现有研究通常依赖不同的数据集、预处理流程、轨迹表示和评估指标,轨迹生成方法之间的系统比较仍然困难。这种碎片化使得报告的性能差异是否源于生成机制本身或实验协议不一致变得不明确。为解决这一问题,我们提出了CityTrajBench,一个用于城市尺度车辆轨迹生成的统一基准框架和协议。CityTrajBench在共同设置下标准化了数据摄入、轨迹归一化、特征构建、模型适配、地图感知后处理、模型选择和多级评估。它支持异构生成器,包括统计基线、基于VAE、GAN、扩散和流匹配的模型,并在三个真实世界城市轨迹数据集上评估它们。该基准衡量全局空间真实性、行程级分布保真度、轨迹级几何相似性、条件移动一致性和效率。实验揭示了模型家族之间的明确权衡:DiffTraj在轨迹级几何保真度上最强,DiffRNTraj在结构敏感的全局真实性上具有竞争力,而TrajFlow在真实性、质量、条件一致性和效率之间提供了强平衡。同时,一个简单的马尔可夫基线在粗粒度行程和局部移动统计上仍具有竞争力。这些发现表明,城市轨迹生成质量本质上是多目标的,没有单一模型在所有标准上同等占优,并且CityTrajBench为未来城市移动性生成研究提供了可复现的基准协议和测试平台。

英文摘要

Urban trajectory generation is a fundamental task for transportation simulation, urban planning, and mobility analytics. However, systematic comparison across trajectory generation methods remains difficult because existing studies often rely on different datasets, preprocessing pipelines, trajectory representations, and evaluation metrics. This fragmentation makes it unclear whether reported performance differences arise from the generation mechanism itself or from inconsistent experimental protocols. To address this issue, we present CityTrajBench, a unified benchmark framework and protocol for city-scale vehicle trajectory generation. CityTrajBench standardizes data ingestion, trajectory normalization, feature construction, model adaptation, map-aware post-processing, model selection, and multi-level evaluation under a common setting. It supports heterogeneous generators, including statistical baselines, VAE-based, GAN-based, diffusion-based, and flow-matching-based models, and evaluates them on three real-world urban trajectory datasets. The benchmark measures global spatial realism, trip-level distribution fidelity, trajectory-level geometric similarity, conditional mobility consistency, and efficiency. Experiments reveal clear trade-offs across model families: DiffTraj is strongest on trajectory-level geometric fidelity, DiffRNTraj is competitive on structure-sensitive global realism, and TrajFlow provides a strong balance across realism, quality, conditional consistency, and efficiency. Meanwhile, a simple Markov baseline remains competitive on coarse-grained trip and local-movement statistics. These findings show that urban trajectory generation quality is inherently multi-objective, that no single model dominates all criteria equally, and that CityTrajBench provides a reproducible benchmark protocol and testbed for future research on urban mobility generation.

2606.02278 2026-06-02 eess.SY cs.LG cs.SY 版本更新

Physics-Guided Recurrent State-Space Neural Networks for Multi-Step Prediction

物理引导的循环状态空间神经网络用于多步预测

Ruiyuan Li, Ajay Seth, Manon Kok

发表机构 * Delft Center for Systems and Control, TU Delft, the Netherlands(代尔夫特系统与控制中心,代尔夫特理工大学,荷兰) Department of Biomechanical Engineering, TU Delft, the Netherlands(生物力学工程系,代尔夫特理工大学,荷兰)

AI总结 提出PG-RSSNN,一种结合物理知识和循环结构的状态空间神经网络,通过缓解梯度消失和数值发散风险,在有限数据和部分物理模型下提升多步预测性能。

Comments 6 pages, 3 figures. Accepted at IFAC World Congress 2026

详情
AI中文摘要

状态空间模型传统上基于物理知识,但由于模型不准确,这些物理模型的多步预测可能较差。黑盒深度学习作为替代方案显示出潜力,但这些方法依赖于大量数据集的可用性,且潜在可用的物理知识被忽略。我们提出PG-RSSNN,一种物理引导的循环状态空间神经网络,它结合循环结构以在多步预测中使用非饱和激活函数。它缓解了梯度消失,并消除了现有结构中因反馈状态估计而导致的训练数值发散风险。在多个具有不同物理模型不完善性的系统上(从带高斯噪声的线性状态空间模型到机械臂和级联水箱系统)的实验结果表明,与黑盒神经网络和纯物理模型相比,所提出的PG-RSSNN即使在训练数据有限且物理模型仅部分已知的情况下,也能保持稳定的训练行为,并改善多步预测。

英文摘要

State-space models are traditionally based on physical knowledge, but multi-step predictions from these physical models can be poor due to model inaccuracy. Black-box deep learning has shown promise as an alternative. However, these methods rely on the availability of large datasets and potentially available physical knowledge is neglected. We propose the PG-RSSNN, a physics-guided recurrent state-space neural network that incorporates recurrent structures to enable the use of non-saturating activation functions in multi-step prediction. It mitigates the vanishing gradients and eliminates the risk of numerical divergence in training seen in existing structures that feed back state estimates. Results across multiple systems with various physical model imperfections, from linear state-space models with Gaussian noise to a robotic arm and a cascaded water tank system, show that the proposed PG-RSSNN maintains stable training behavior, and improves multi-step predictions, as compared with black-box neural networks and physics-only models, even with limited training data and when physical models are only partially known.

2606.02276 2026-06-02 cs.CV cs.AI cs.CL cs.LG 版本更新

Cross-modal linkage risk in clinical vision-language models

临床视觉-语言模型中的跨模态链接风险

Soroosh Tayebi Arasteh, Mahshad Lotfinia, Sven Nebelung, Daniel Truhn

发表机构 * Lab for AI in Medicine(医学人工智能实验室) RWTH Aachen University(亚琛工业大学) Department of Diagnostic and Interventional Radiology(诊断与介入放射学部门)

AI总结 研究临床视觉-语言模型(VLM)在图像与报告分离场景下通过余弦相似度实现跨模态重链接的风险,并采用仅对投影头进行差分隐私微调的方法在保持图像效用同时显著降低重链接率。

详情
AI中文摘要

在配对胸部X光片和放射学报告上训练的视觉-语言模型(VLM)学习了一个共享嵌入空间,该空间可以保留实例级别的图像-报告对应关系。这在故意将X光片和报告在获取后分开的场景中(例如仅图像数据共享或受控访问的报告)构成了隐私风险,因为一个去标识的图像可能仅通过余弦相似度就重新链接到其原始叙述性报告。我们将此形式化为图像到报告的检索,并使用公共配对队列(其中真实配对是已知的)作为基准来审计风险,而不是作为隐私场景。在来自MIMIC-CXR(43,793个保留对)和外部CheXpert Plus(29,296个对)的126,804名患者的406,241个配对示例上评估了临床专业化程度递增的VLM,我们发现重链接率随专业化程度系统性地上升:最强的VLM在候选池N=100时以15倍随机概率检索到正确报告,在N=10,000时以50倍随机概率,在全数据库规模下仍远高于随机概率。该信号在去除疾病标签捷径的病理匹配困难负样本下仍然存在,表明对应关系超出了广泛的诊断类别。为了在不重新训练的情况下减少这种风险,我们冻结了两个编码器,仅对定义对齐层的投影头应用差分隐私优化(epsilon=0.34,delta=6x10^-6)。这使得MIMIC-CXR上N=10,000时的Recall@1降低了61.8%,并无需重新训练即可迁移到CheXpert Plus,同时图像侧效用基本保持:线性探针分类在14个标签上的宏AUROC仅从79.63%变为79.43%。对共享对齐层的定向DP微调可以大幅减少跨模态重链接,而不会实质性降低使这些模型在临床上有用的图像表示。

英文摘要

Vision-language models (VLMs) trained on paired chest radiographs and radiology reports learn a shared embedding space that can preserve instance-level image-report correspondence. This poses a privacy risk in settings where radiographs and reports are deliberately kept separate after acquisition, such as image-only data sharing or access-controlled reports, because a de-identified image may be re-linked to its original narrative report through cosine similarity alone. We formalized this as image-to-report retrieval and used public paired cohorts, in which the true pairing is known by design, as ground-truth benchmarks to audit the risk rather than as the privacy scenario. Evaluating VLMs of increasing clinical specialization on 406,241 paired examples from 126,804 patients across MIMIC-CXR (43,793 held-out pairs) and external CheXpert Plus (29,296 pairs), we found that re-linkage rose systematically with specialization: the strongest VLM retrieved the correct report at 15 times chance at a candidate pool of N = 100, 50 times chance at N = 10,000, and well above chance at full-database scale. The signal persisted under pathology-matched hard negatives that removed disease-label shortcuts, indicating correspondence beyond broad diagnostic categories. To reduce it without retraining, we froze both encoders and applied differentially private optimization only to the projection heads defining the alignment layer (epsilon = 0.34, delta = 6x10-6). This reduced Recall@1 by 61.8% at N = 10,000 on MIMIC-CXR and transferred to CheXpert Plus without retraining, while image-side utility was largely preserved: macro AUROC for linear-probe classification across 14 labels shifted only from 79.63% to 79.43%. Targeted DP finetuning of the shared alignment layer can substantially reduce cross-modal re-linkage without materially degrading the image representations that make these models clinically useful.

2606.02267 2026-06-02 cs.LG cs.CV 版本更新

A combination of noise and bilateral filters achieve supralinear and scalable adversarial robustness in CNNs

噪声与双边滤波的组合在CNN中实现超线性且可扩展的对抗鲁棒性

Nicolas Stalder, Benjamin F. Grewe, Matteo Saponati, Pau Vilimelis Aceituno

发表机构 * Institute of Neuroinformatics ETH Zürich, University of Zürich(神经信息学研究所,苏黎世联邦理工学院,苏黎世大学)

AI总结 本文提出结合高斯噪声和双边滤波的预处理方法,通过互补机制实现超线性对抗鲁棒性提升,并验证其与对抗训练结合后能以更低计算成本达到与最先进防御相当的性能。

Comments Main: 8 pages, 3 figures, 2 Tables. Supplement: 10 pages, 7 figures, 6 Tables

详情
AI中文摘要

深度神经网络对对抗样本的脆弱性对其实际部署构成了重大挑战。现有的增强深度网络鲁棒性的技术依赖于对抗训练,这种方法虽然强大,但计算密集且通常针对特定攻击类型。为了解决这些局限性,现有工作探索了添加高斯噪声或滤波图像等技术,这两种技术都能适度提升网络对各种对抗攻击的鲁棒性。在此,我们从理论上证明,这两种方法通过互补机制增强对抗鲁棒性,当结合时产生超线性鲁棒性。基于这一见解,我们通过实验表明,一个结合高斯噪声和双边滤波的简单预处理器能以最小计算成本实现对抗鲁棒性的超线性提升。接下来,我们将预处理器与对抗训练结合,并在RobustBench上进行测试,评估其相对于最先进防御的超线性改进。首先,该组合在AutoAttack上排名第二,总体排名第三,同时仅使用约35%的训练FLOPs,模型参数减少约50%,训练轮次减少约33%,数据量减少约15%(与最先进防御相比)。其次,我们的方法高效可扩展,在三个数量级上以大约2-8倍的总计算量匹配竞争模型的准确率。总体而言,我们的方法提供了一个有原则且易于集成的框架来增强对抗鲁棒性,具有可忽略的计算开销和简单但理论扎实的设计。

英文摘要

The vulnerability of deep neural networks to adversarial examples poses a significant challenge for real-world deployment. Existing techniques to enhance deep network robustness rely on adversarial training, an approach that is powerful but computationally intensive and typically tailored to specific attack types. To address these limitations, existing works have explored techniques such as adding gaussian noise or filtering images, both of which can boost the network robustness to various adversarial attacks, albeit modestly. Here, we theoretically demonstrate that these two approaches enhance robustness against adversarial attacks through complementary mechanisms, resulting in supralinear robustness when combined. Building on this insight, we experimentally show that a simple preprocessor combining Gaussian noise and bilateral filtering yields supralinear improvements in adversarial robustness with minimal computational cost. Next, we combine our preprocessor with adversarial training and test on RobustBench to assess its supralinear improvement over state-of-the-art defenses. First, this combination ranks second on AutoAttack and third overall, while using only $\sim$35% of the training FLOPs, using a model with $\sim$50% less parametets, trained with $\sim$33% of the epochs and $\sim$15% the data compared to state-of-the-art defenses. Second, our method scales efficiently, matching the accuracy of competing models with roughly 2-8x less total compute across 3 orders of magnitude. Overall, our approach provides a principled and easily integrable framework for enhancing adversarial robustness, offering negligible computational overhead and a simple yet theoretically grounded design.

2606.02256 2026-06-02 cs.LG 版本更新

ArrythML: An Autoencoder-Based TinyML Approach for On-Device Arrhythmia Detection on Resource-Constrained Embedded Systems

ArrythML: 一种基于自动编码器的TinyML方法,用于资源受限嵌入式系统上的设备内心律失常检测

Nagarajan S, Kurian Polachan

发表机构 * International Institute of Information Technology, Bangalore, India(国际信息科技研究所,班加罗尔,印度)

AI总结 提出一种基于INT8量化自动编码器的TinyML模型,在ESP32-S3微控制器上实现实时、低功耗的ECG分割与心律失常检测,达到84%召回率和79% F1分数。

Comments 19 pages,

详情
AI中文摘要

我们的工作提出了一种使用TinyML模型进行ECG分割和心律失常检测的方法,用于资源受限嵌入式系统上的实时设备内推理。我们开发了基于INT8量化自动编码器的TinyML模型,具有最少的层数和参数,适用于嵌入式部署。这些模型使用来自MIT-BIH心律失常数据库的自定义数据集进行评估,并在基于PC的模拟和设备内环境中进行了验证。在评估中,超过95,000个ECG片段在运行TensorFlow Lite Micro运行时的ESP32-S3微控制器上进行了处理。评估后,进行了详细分析,包括按注释和按记录的失败分析,以表征模型在不同ECG形态和心律模式上的行为,并解释漏检情况。在几种情况下,明显的误分类可能对应于参考注释中标记为正常的早期或微妙异常模式,突显了模型的敏感性。通过过滤数据集中模糊案例的细化评估显示,性能最佳的基于DNN的自动编码器实现了84%的召回率、79%的F1分数、约180 KB的模型大小和9 ms的设备内推理延迟。这些结果证明了低功耗、保护隐私的嵌入式可穿戴系统的可行性,该系统能够完全在设备上执行准确的心律失常检测。

英文摘要

Our work presents a method for ECG segmentation and arrhythmia detection using Tiny Machine Learning (TinyML) models for real-time, on-device inference on resource-constrained embedded systems. We develop INT8 quantized autoencoder-based TinyML models with minimal layers and parameters for embedded deployment. These models are evaluated using a custom dataset derived from the MIT-BIH Arrhythmia Database and validated in both PC-based simulations and on-device environments. For the evaluations, over 95,000 ECG segments are processed on an ESP32-S3 microcontroller running the TensorFlow Lite Micro runtime. Post-evaluation, detailed analysis, including annotation-wise and record-wise failure analysis, is conducted to characterize model behavior across diverse ECG morphologies and rhythm patterns and to explain missed detections. In several cases, apparent misclassifications may correspond to early or subtle anomaly patterns labeled as normal in the reference annotations, highlighting the model's sensitivity. A refined evaluation by filtering out ambiguous cases in the dataset shows that the best-performing DNN-based autoencoder achieves a recall of 84%, an F1-score of 79%, a model size of approximately 180 KB, and an inference latency of 9 ms on-device. These results demonstrate the feasibility of low-power, privacy-preserving embedded wearable systems capable of performing accurate arrhythmia detection entirely on-device.

2606.02247 2026-06-02 stat.ML cs.LG 版本更新

ShaplEIG: Bayesian Experimental Design for Shapley Value Estimation

ShaplEIG:用于Shapley值估计的贝叶斯实验设计

David Rundel, Fabian Fumagalli, Maximilian Muschalik, Bernd Bischl, Matthias Feurer

AI总结 提出ShaplEIG方法,通过高斯过程代理和期望信息增益自适应选择联盟,以高效估计Shapley值,在低预算场景下显著提升样本效率。

Comments Accepted at the Forty-Third International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

Shapley值是一种原则性的归因度量,广泛用于可解释机器学习,但其精确计算随玩家数量呈指数增长,促使了基于采样联盟价值函数评估的各种近似方法。这引发了一个问题:能否通过根据先前评估自适应选择联盟来提高近似精度?这在价值函数昂贵且评估次数严重受限的设置中尤为重要,例如基于重训练的特征重要性、数据估值和超参数重要性。为此,我们提出ShaplEIG,一种贝叶斯实验设计方法,该方法使用高斯过程代理近似昂贵的价值函数,并根据联盟对Shapley值的期望信息增益自适应选择联盟。通过Shapley值在价值函数中的线性性质,我们证明了期望信息增益具有封闭形式。此外,我们提出了一种高效计算方案,通过初等对称多项式将复杂度从指数级降低到玩家数量的多项式级。在多种昂贵应用的广泛实验中,我们的方法在低预算场景下始终优于最先进的基线方法,提高了样本效率。

英文摘要

Shapley values are a principled attribution measure widely used in interpretable machine learning, but their exact computation scales exponentially with the number of players, motivating a wide range of approximation methods based on value function evaluations of sampled coalitions. This raises the question of whether approximation accuracy can be improved by adaptively selecting coalitions for evaluation based on previous evaluations. This is particularly relevant in settings where the value function is costly and the number of evaluations is severely limited, such as retraining-based feature importance, data valuation, and hyperparameter importance. For this purpose, we propose ShaplEIG, a Bayesian experimental design approach that approximates the expensive value function using a Gaussian process surrogate and adaptively selects coalitions based on their expected information gain about the Shapley values. By the linearity of the Shapley values in the value function, we show that the expected information gain is available in closed form. Furthermore, we propose an efficient computation scheme that reduces the complexity from exponential to polynomial in the number of players via elementary symmetric polynomials. In extensive experiments across diverse costly applications, our method consistently improves sample efficiency in the low-budget regime over state-of-the-art baselines.

2606.02242 2026-06-02 cs.CV cs.AI cs.LG 版本更新

Towards Resolving Optimization Conflicts Between Image- and Text-Based Person Re-Identification

解决基于图像和基于文本的行人重识别之间的优化冲突

Karina Kvanchiani, Timur Mamedov

发表机构 * Tevian, Russia(俄罗斯Tevian) Lomonosov Moscow State University, Russia(俄罗斯罗蒙诺索夫莫斯科国立大学)

AI总结 针对图像与文本行人重识别任务因模态差异和目标冲突导致共享表示次优的问题,提出解耦两阶段训练流程,使用单一视觉编码器避免跨任务干扰,实验表明图像预训练和文本监督能提升双任务性能。

详情
AI中文摘要

基于图像(I2I)和基于文本(T2I)的行人重识别(ReID)的联合优化受到模态差异和冲突训练目标的阻碍,导致共享表示次优。虽然I2I ReID关注同一人图像间的身份级不变性,但T2I ReID由与独特视觉特征相关的实例特定文本描述驱动。本文探讨了两个ReID任务及其优化过程之间的根本差异,以实现有效训练。由于I2I和T2I ReID通常分开研究,为一种检索设置优化的损失函数可能对另一种所需的表示质量产生负面影响。基于这些发现,我们提出了一种解耦的两阶段训练流程,用于学习跨图像和文本模态的共享表示。该流程基于单个视觉编码器,支持I2I和T2I检索,同时避免训练期间的跨任务干扰。我们在多种配置下进行了大量实验,改变了域混合程序、学习策略和任务目标。我们观察到I2I ReID预训练对T2I数据的泛化能力有积极影响。此外,我们发现视觉编码器训练阶段引入文本监督能提升I2I和T2I性能。我们相信,我们的见解为统一的ReID系统和跨模态检索整体迈出了有意义的一步。

英文摘要

The joint optimization of image-based (I2I) and text-based (T2I) person re-identification (ReID) is hindered by modality discrepancies and conflicting training objectives, leading to suboptimal shared representations. While I2I ReID focuses on identity-level invariance across images of the same person, T2I ReID is driven by instance-specific textual descriptions tied to unique visual traits. This paper explores the fundamental difference between two ReID tasks and their optimization processes for effective training. Since I2I and T2I ReID are often studied separately, the loss functions optimized for one retrieval setting may negatively affect the representation quality required by the other. Motivated by these findings, we propose a decoupled two-stage training pipeline for learning a shared representation across image and text modalities. The pipeline is based on a single vision encoder that supports both I2I and T2I retrieval while avoiding cross-task interference during training. We provide extensive experiments across multiple configurations, varying domain mixing procedures, learning strategies, and task objectives. We observed that I2I ReID pre-training positively impacts the generalization ability to T2I data. Besides, we find that incorporating textual supervision during the vision encoder training stage enhances both I2I and T2I performance. We believe our insights provide a meaningful step toward unified ReID systems and cross-modal retrieval overall.

2606.02241 2026-06-02 cs.LG 版本更新

BlockGen: Flexible Blockwise Sequence Modeling with Hybrid Samplers

BlockGen: 灵活的分块序列建模与混合采样器

Justin Deschenaux, Caglar Gulcehre

发表机构 * EPFL Lausanne, Switzerland(瑞士洛桑联邦理工学院) Microsoft AI(微软人工智能)

AI总结 提出BlockGen框架,通过分块序列建模和AR-informed预测-校正采样,比较均匀态扩散与掩码扩散在分块生成中的性能。

详情
AI中文摘要

均匀态扩散框架是否为离散扩散更强大的范式?最近的研究表明情况可能如此。结合预测-校正采样器,均匀态扩散模型(USDMs)生成的样本质量高于掩码扩散模型(MDMs),并且在下游任务中USDMs与MDMs相当或更优,尽管它们表现出更大的困惑度。两个问题仍未解决。首先,现有工作比较均匀和掩码扩散时使用了无信息的校正器,这些校正器在随机位置重新注入噪声,而不是针对最可能出错的标记。其次,先前的工作比较了全序列扩散模型,因此我们不知道当逐块生成标记时是否得出相同的结论。为了解决这些问题,我们引入了BlockGen,一种分块序列模型,我们使用掩码和均匀扩散两种方式实例化它。BlockGen在混合块大小上训练,其似然性比固定块大小的模型更精细地在自回归和纯扩散之间插值。BlockGen实现了AR-informed预测-校正采样(ARPC),它结合了AR和扩散预测来重新生成不太可能的标记,而无需辅助验证器。在祖先采样下,均匀扩散在逐块设置中优于掩码扩散,尤其是在少步数情况下。在ARPC下,差距缩小并在高NFE时反转。在GSM8K上使用块大小16时,MDMs达到略高于USDMs的准确率,我们在OpenWebText上的生成困惑度中也观察到类似趋势。代码见https://github.com/jdeschena/blockgen。

英文摘要

Is the uniform-state diffusion framework a more powerful paradigm for discrete diffusion? Recent studies indicate that this may be the case. In combination with predictor-corrector samplers, uniform-state diffusion models (USDMs) produce samples of higher-quality than masked diffusion models (MDMs), and USDMs equal or outperform MDMs in downstream tasks, even though they exhibit greater perplexity. Two issues remain unresolved. First, existing work compares uniform and masked diffusion with un-informed correctors that re-inject noise at random positions, rather than targeting tokens most likely to be wrong. Second, prior work compares full-sequence diffusion models, so we do not know whether the same conclusion holds when tokens are generated block by block. To address these issues, we introduce BlockGen, a blockwise sequence model that we instantiate with both masked and uniform diffusion. BlockGen trains on a mixture of block sizes and its likelihood interpolates between AR and pure diffusion more finely than models with a fixed block size. BlockGen enables AR-informed predictor-corrector sampling (ARPC), which combines AR and diffusion predictions to re-generate unlikely tokens without an auxiliary verifier. Under ancestral sampling, uniform outperforms masked in the block-by-block setting, especially in the few-step regime. Under ARPC, the gap closes and reverses at high NFE. With block size $16$ on GSM8K, MDMs reach slightly higher accuracy than USDMs, and we observe a similar trend in Generative Perplexity on OpenWebText. Find our code at https://github.com/jdeschena/blockgen.

2606.02237 2026-06-02 cs.LG 版本更新

Why Are DMD Students Lazy? Understanding the Copying Behavior in Few-Step Distillation

为什么 DMD 学生懒惰?理解少步蒸馏中的复制行为

Shucheng Li, Iolo Jones, Alexander Tong, Michael M. Bronstein

发表机构 * Department of Computer Science, University of Oxford(牛津大学计算机科学系) AITHYRA, Research Institute for Biomedical AI(生物医学AI研究 institute)

AI总结 本文研究分布匹配蒸馏(DMD)中高维学生模型自发复制教师噪声-数据配对的现象,通过几何自由度受限解释其成因。

详情
AI中文摘要

分布匹配蒸馏(DMD)通过在所有尺度上对齐噪声分布,将预训练扩散模型压缩为高效的少步生成器。原则上,这种分布级监督对教师的特定噪声-数据配对保持不可知;这为学生提供了重新映射潜在噪声的自由度,这一行为在低维设置中一直被观察到。令人惊讶的是,我们发现,在高维设置中,蒸馏学生自发地再现了教师的原始噪声-数据配对,我们将这种现象称为复制。我们证明复制既不是对抗性目标的副产品,也不是教师记忆的结果。相反,我们的证据表明,复制是高维蒸馏过程中学生模型几何自由度有限而产生的一种涌现特性。

英文摘要

Distribution Matching Distillation (DMD) compresses pretrained diffusion models into efficient few-step generators by aligning their noised distributions across all scales. In principle, such distribution-level supervision remains agnostic to specific noise-data pairings of the teacher; this provides the student the freedom to remap latent noise, a behavior consistently observed in low-dimensional settings. Surprisingly, we find that in high-dimensional settings, distilled students spontaneously reproduce the original noise-data pairings of the teacher, a phenomenon we term copying. We demonstrate that copying is neither a byproduct of adversarial objectives nor a result of teacher memorization. Instead, our evidence suggests that copying is an emergent property arising from the limited geometric freedom of the student model during high-dimensional distillation.

2606.02232 2026-06-02 cs.LG 版本更新

A Doeblin-Anchored Contrastive Chart for Learning Markov Transition Kernels

Doeblin锚定对比图:学习马尔可夫转移核

Ao Xu

发表机构 * School of Artificial Intelligence, Jilin University(吉林大学人工智能学院) Zhongguancun Academy(中关村学院)

AI总结 提出一种基于Doeblin锚定的对比坐标框架,通过对比目标学习有效的马尔可夫转移核,并引入可测马尔可夫化算子保证核有效性,实现非参数收敛率与有限时域误差界。

详情
AI中文摘要

学习马尔可夫转移模型不仅仅是条件密度估计:学习到的对象必须是一个有效的转移核,才能在后续动力学中迭代。本文介绍了一种Doeblin锚定对比图,这是一个从统计到动力学的坐标框架,用于从对比目标学习转移核。给定一个重启律和一个锚定强度,该图将目标转移与重启律混合。得到的锚定核同时是一个Doeblin小化马尔可夫核、一个二元对比实验中的正条件律,以及原始转移律的一个显式可逆坐标。我们证明了锚定对比风险识别锚定转移密度,并将超额风险校准为密度误差。由于学习得分的反演可能产生有符号或未归一化的对象,我们引入了一个可测马尔可夫化算子,在保持积分$L^1$精度(最多一个常数因子)的同时恢复核有效性。Oracle不等式和Hölder-ReLU逼近界给出了独立转移对的非参数速率。对于平稳几何$\beta$-混合轨迹,一个保守的稀疏化与耦合扩展以有效样本量提供了相同的重建接口。在显式覆盖下,占用加权扰动界将一步核误差转化为有限时域边际、路径律和占用测度误差。

英文摘要

Learning a Markov transition model is not merely conditional density estimation: the learned object must be a valid transition kernel before it is iterated in downstream dynamics. This paper introduces a Doeblin-anchored contrastive chart, a statistical-to-dynamical coordinate framework for learning transition kernels from contrastive objectives. Given a restart law and an anchor strength, the chart mixes the target transition with the restart law. The resulting anchored kernel is simultaneously a Doeblin-minorized Markov kernel, the positive conditional law in a binary contrastive experiment, and an explicitly invertible coordinate for the original transition law. We prove that the anchored contrastive risk identifies the anchored transition density and calibrates excess risk to density error. Since inversion of a learned score may produce a signed or unnormalized object, we introduce a measurable Markovization operator that restores kernel validity while preserving integrated $L^1$ accuracy up to a constant factor. Oracle inequalities and Hölder--ReLU approximation bounds yield nonparametric rates for independent transition pairs. For stationary geometrically $β$-mixing trajectories, a conservative thinning-and-coupling extension yields the same reconstruction interface with an effective sample size. Occupancy-weighted perturbation bounds transfer one-step kernel error to finite-horizon marginal, path-law, and occupation-measure errors under explicit coverage.

2606.02228 2026-06-02 stat.ML cs.CV cs.LG 版本更新

Bayesian meta-learning for modeling Alzheimer's disease progression

贝叶斯元学习用于阿尔茨海默病进展建模

Clara Hoffmann, Nadja Klein

发表机构 * Scientific Computing Center, Karlsruhe Institute of Technology, Germany(卡尔斯鲁厄理工学院科学计算中心,德国) Alzheimer’s Disease Neuroimaging Initiative(阿尔茨海默病神经影像计划)

AI总结 提出贝叶斯元学习方法,利用个体历史MRI体积和疾病轨迹预测疾病评分分布,无需重新训练即可动态预测,并减少长期预测的过度自信。

详情
AI中文摘要

预测阿尔茨海默病患者将经历轻度还是重度疾病进展对于个性化治疗至关重要。通常,临床医生试图预测离散疾病评分的分布,条件是个体当前的MRI体积及其历史疾病轨迹。经典的统计回归模型和单任务神经网络不适合此目的,因为拟合单独模型不可行(每个个体通常只有少量观测),而忽略个体间相关性会导致泛化能力差。相比之下,元学习提供了一种自然的方法来动态预测分布,无需重新训练,并能建模结果与协变量之间的非线性关系。受此启发,我们提出了一种贝叶斯元学习器,它在多个个体上训练,但根据每个个体的历史数据定制预测的疾病评分分布。我们的模型无需重新训练即可预测未见过的个体,与历史观测数量呈线性扩展,并且在预测长期疾病评分时,与确定性对应模型相比,保证更少的过度自信。在阿尔茨海默病神经影像学倡议(ADNI)数据库的真实世界数据上,我们的模型在性能上与单任务模型和确定性元学习器相当,同时在预测长期疾病进展时显著提高了性能。

英文摘要

Predicting whether an individual with Alzheimer's disease will experience mild or severe disease progression is essential for personalized treatment. Typically, practitioners seek to predict the distribution of a discrete disease score, conditional on an individual's current MRI volume and their historical disease trajectory. Classical statistical regression models and single-task neural networks are not well-suited for this purpose because fitting separate models is infeasible (since each individual typically has few observations), while ignoring individual-level correlation leads to poor generalization. Meta-learning, in contrast, provides a natural avenue to dynamically predict distributions without retraining and model nonlinear relationships between the outcome and covariates. Motivated by this, we propose a Bayesian meta-learner that is trained on multiple individuals but tailors the predictive disease score distribution to each individual's historical data. Our model predicts on unseen individuals without retraining, scales linearly with the number of historical observations, and is guaranteed to be less overconfident when predicting long-term disease scores compared to its deterministic counterpart. On real-world data from the Alzheimer's Disease Neuroimaging Initiative (ADNI) database, our model achieves performance competitive with both single-task models and deterministic meta-learners, while substantially improving performance when predicting long-term disease progression.

2606.02223 2026-06-02 cs.LG math.ST stat.ME stat.TH 版本更新

Network Learning with Semi-relaxed Gromov-Wasserstein

半松弛Gromov-Wasserstein的网络学习

Charles Dufour, Ulysse Naepels, Leonardo V. Santoro

发表机构 * EPFL, Institute of Mathematics(苏黎世联邦理工学院数学研究所) Lausanne, Switzerland(瑞士拉沃斯)

AI总结 针对大规模网络生成机制估计中的节点标签缺失问题,提出半松弛Gromov-Wasserstein目标函数,通过概率耦合松弛分配问题,利用块坐标条件梯度算法求解,并证明松弛解与确定性分配的最优性差距以O(1/n)速率消失,实现随机块模型和Hölder光滑图模型的相合性与极小化最优收敛速率。

详情
AI中文摘要

估计大规模网络的生成机制是统计机器学习中的一个基本挑战。由于缺乏规范的节点标签,识别潜在连接结构通常是一个NP难的组合问题。我们通过允许概率耦合来应对这一挑战,从而松弛了分配问题。我们的估计框架可以表述为半松弛Gromov-Wasserstein目标,并提供了生成结构的低维表示。我们通过块坐标条件梯度算法求解该问题。尽管进行了松弛,但所得解通常是确定性的:事实上,我们证明了松弛解与确定性分配之间的最优性差距以$O(1/n)$的速率消失,其中$n$是节点数。这使得潜在模型的可处理恢复成为可能,并能够进行严格的统计分析:我们为随机块模型和Hölder光滑图模型建立了相合性和极小化最优收敛速率。我们的实现在合成和真实数据集上均展示了随$n$的高效扩展能力。

英文摘要

Estimating the generative mechanism of large-scale networks is a fundamental challenge in statistical machine learning. It requires the identification of the latent connectivity structure, which is in general an NP-hard combinatorial problem due to the absence of canonical node labels. We address this challenge by allowing for probabilistic couplings, thereby relaxing the assignment problem. Our estimation framework can be formulated as a semi-relaxed Gromov-Wasserstein objective and provides a low-dimensional representation of the generative structure. We solve this via a block-coordinate conditional gradient algorithm. Despite the relaxation, the resulting solution is typically deterministic: in fact, we show that the optimality gap between the relaxed solution and the deterministic assignment vanishes at rate $O(1/n)$, where $n$ is the number of nodes. This allows for tractable recovery of the underlying model and enables rigorous statistical analysis: we establish consistency and minimax-optimal convergence rates for both stochastic block models and Holder-smooth graphons. Our implementation scales efficiently with $n$, as demonstrated on both synthetic and real-world datasets.

2606.02221 2026-06-02 cs.CV cs.LG 版本更新

CORE-MTL: Rethinking Gradient Balancing via Causal Orthogonal Representations

CORE-MTL: 通过因果正交表示重新思考梯度平衡

Chengfeng Wu, Tao Zou, Yanru Wu, Jingge Wang

发表机构 * Tsinghua University(清华大学)

AI总结 提出CORE-MTL框架,通过因果正交表示将共享表示分解为语义流和残差流,以分离任务相关结构与虚假上下文,从而减少负迁移并提升泛化能力。

Comments Accepted by ICML 2026

详情
AI中文摘要

多任务学习旨在通过跨领域共享共同表示来构建联合模型。为实现这一目标,现有的优化中心方法要么平衡任务梯度,要么修改共享架构。然而,由于这些方法对共享表示的内容不可知,它们无法将任务相关结构与虚假上下文分离,导致负迁移和泛化能力差。为克服这一限制,我们提出了用于多任务学习的因果正交表示(CORE-MTL),这是一个因果驱动的表示中心框架,鼓励对共享表示进行结构化的语义-残差分解,将任务相关结构集中在语义流中,而将干扰变化归入残差流。我们通过利用结构化场景的物理先验和属性的统计约束,在视觉领域实例化了该框架。理论上,我们的方法比优化中心方法具有更紧的分布外泛化界,并且无需显式梯度投影或重新加权即可减少任务梯度干扰。实验上,CORE-MTL在视觉多任务基准测试中,在分布内和分布外设置下均持续优于现有方法。代码公开于 https://github.com/Hope-Rita/CORE-MTL。

英文摘要

Multi-task learning (MTL) aims to construct a joint model for multiple tasks by sharing a common representation across domains. To achieve this goal, existing optimization-centric methods either balance task gradients or modify the shared architecture. However, as these approaches remain agnostic to the content of the shared representation, they fail to disentangle task-relevant structure from spurious context, leading to negative transfer and poor generalization. To overcome this limitation, we propose Causal Orthogonal Representations for Multi-Task Learning (CORE-MTL), a causally motivated representation-centric framework that encourages a structured semantic-residual factorization of the shared representation, concentrating task-relevant structure in the semantic stream while relegating nuisance variation to the residual stream. We instantiate this framework in the visual domain by leveraging physical priors for structured scenes and statistical constraints for attributes. Theoretically, our method enjoys a tighter out-of-distribution generalization bound than optimization-centric methods and reduces task gradient interference without explicit gradient projection or reweighting. Empirically, CORE-MTL consistently outperforms existing methods on visual multi-task benchmarks in both in-distribution and out-of-distribution settings. Code is publicly available at https://github.com/Hope-Rita/CORE-MTL.

2606.02218 2026-06-02 cs.LG cs.AI 版本更新

Faster Synchronous On-Policy RL via Straggler-Aware Group Sizing

通过感知掉队者的组大小调整实现更快的同步在线策略强化学习

Azal Ahmad Khan, Ammar Ahmed, Zeshan Fayyaz, Sheng Di, Mingyi Hong, Ali Anwar

发表机构 * University of Minnesota(明尼苏达大学) University of Waterloo(滑铁卢大学) Argonne National Laboratory(阿贡国家实验室)

AI总结 提出动态组大小控制器SAGC,通过在线约束优化调整组大小,减少同步在线策略强化学习中的掉队者事件,提升墙钟效率并保持或改善训练奖励和模型质量。

详情
AI中文摘要

同步强化学习方法如组相对策略优化(GRPO)提供稳定且可复现的在线策略训练,但极易受到掉队者的影响——单个异常长的轨迹可能延迟整个组的奖励计算和参数更新。随着组大小增加,这个问题变得更加严重,在更大组的好处与同步停滞的墙钟成本之间产生矛盾。我们提出感知掉队者的组控制(SAGC),一种动态组大小控制器,根据观察到的轨迹行为在线调整训练组。SAGC将组大小选择形式化为一个在线约束优化问题,旨在保留更大组的好处,同时控制掉队者事件的长期发生率。在同步GRPO和DAPO训练中,以及在普通和强工程基线上,SAGC一致地减少了掉队者发生率并提高了墙钟效率,同时实现了有竞争力或更好的训练奖励。我们进一步表明这些收益转化为最终模型质量:在下游推理基准上,SAGC与最强的静态组大小基线相比具有竞争力或更好,并且通常在没有显式长度惩罚的情况下产生更短的输出。这些结果将动态组控制定位为使同步在线策略强化学习更高效和更稳健的实用方法。

英文摘要

Synchronous reinforcement learning methods such as Group Relative Policy Optimization (GRPO) provide stable and reproducible on-policy training, but they are highly vulnerable to stragglers, a single unusually long rollout can delay reward computation and parameter updates for the entire group. This problem becomes more severe as group size increases, creating a tension between the benefits of larger groups and the wall-clock cost of synchronization stalls. We propose Straggler-Aware Group Control (SAGC), a dynamic group-size controller that adapts the training group online based on observed rollout behavior. SAGC formulates group-size selection as an online constrained optimization problem, seeking to retain the benefits of larger groups while controlling the long-term rate of straggler events. Across synchronous GRPO and DAPO training, and on top of both vanilla and strong engineered baselines, SAGC consistently reduces straggler incidence and improves wall-clock efficiency while achieving competitive or better training reward. We further show that these gains transfer to final model quality: SAGC is competitive with or better than the strongest static group-size baseline on downstream reasoning benchmarks, and often produces shorter outputs without any explicit length penalty. These results position dynamic group control as a practical way to make synchronous on-policy RL more efficient and robust.

2606.02198 2026-06-02 cs.LG cs.CY 版本更新

Model Multiplicity and Predictive Arbitrariness in Recidivism Risk Assessment

模型多重性与再犯风险评估中的预测任意性

Ashwin Singh, Carlos Castillo

AI总结 针对再犯风险评估中的预测任意性问题,通过理论下界推导和实证分析,发现相似精度的模型间预测一致性通常高于最坏情况理论保证,并提出采用最低风险分配策略来缓解任意性。

Comments 17 pages, 12 figures

详情
AI中文摘要

针对个体未来的预测任务本质上是嘈杂的,通常会产生多个相似精度的模型。当这些模型对同一个人产生不同预测时,会引发决策中的任意性问题。这种任意性在理论和实践中可能有多严重?如何解决以支持高风险风险评估?我们通过对一个已使用超过15年的基于机器学习的再犯风险评估决策支持系统的研究来回答这些问题。通过将复杂的法律规则转化为标记释放后结果(再犯或非再犯)的算法,我们首先构建了一个包含数千名囚犯释放的数据集。利用该数据集,我们学习可解释的模型,这些模型提高了预测性能,减少了群体间的错误率差异,并确保康复进展降低风险评分。接下来,我们研究预测多重性,首先推导出数据集上任何有限模型集的期望预测一致性的紧下界,然后评估该集合内的结构多样性(例如,不同的模型系数)在多大程度上转化为预测多重性(即对同一人的不同预测)。我们的实验表明,存在许多相似精度的模型且具有可比较的错误率差异并不一定意味着严重的预测多重性。经验上,性能相似的模型可以表现出比最坏情况理论保证高得多的预测一致性。我们发现,一种简单的策略——为每个囚犯分配这些模型中的最低风险——对于解决预测任意性是有效的。

英文摘要

Prediction tasks over individual futures, which are inherently noisy, often admit multiple similarly accurate models. When these models produce different predictions for the same individual, they raise concerns of arbitrariness in decision-making. How severe can this arbitrariness be, in theory and in practice? How can it be resolved to support high-stakes risk assessment? We address these questions through a study of a machine learning-based decision support system for recidivism risk assessment that has been in use for over 15 years. By translating complex legal rules into an algorithm for labeling post release outcomes (recidivist or non-recidivist), we first construct a dataset of thousands of inmate releases. Using this dataset, we learn interpretable models that improve predictive performance, reduce error-rate disparities between groups, and ensure that rehabilitative progress lowers risk scores. Next, we study predictive multiplicity, by first deriving a tight lower bound on the expected predictive agreement of any finite set of models over a dataset, and then by evaluating the extent to which structural diversity (e.g., different model coefficients) within this set translates to predictive multiplicity (i.e., different predictions for the same individual). Our experiments indicate that the existence of many similarly accurate models with comparable error-rate disparities does not necessarily translate into severe predictive multiplicity. Empirically, similarly performant models can exhibit substantially higher predictive agreement than worst-case theoretical guarantees suggest. We find that a simple policy that assigns each inmate the lowest risk among these models is effective for addressing predictive arbitrariness.

2606.02194 2026-06-02 cs.LG 版本更新

Coherent Off-Policy Improvement of Large Behavior Models with Learned Rewards

基于学习奖励的大规模行为模型的一致性离策略改进

Christian Scherer, Joe Watson, Theo Gruner, Daniel Palenicek, Ingmar Posner, Jan Peters

发表机构 * Technical University of Darmstadt(达姆施塔特技术大学) University of Oxford(牛津大学) Zuse School ELIZA(泽努斯学校ELIZA) hessian.AI(海西斯AI) German Research Center for AI (DFKI)(德国人工智能研究中心(DFKI)) Robotics Institute Germany (RIG)(德国机器人研究所)

AI总结 提出一种逆强化学习方法,通过从专家演示中学习稠密奖励函数,结合一致性模仿学习理论保证,实现对预训练策略的离策略改进,在稀疏奖励操作任务中优于强化学习基线。

Comments 13 pages, 7 figures

详情
AI中文摘要

使用行为克隆将专家演示数据蒸馏到大规模生成模型中是一种可扩展的学习机器人控制能力策略的方法,特别是对于灵巧操作。强化学习(RL)可以作为一种利用额外经验进一步微调这些策略的手段。一个开放的问题是RL是否比收集更多人类演示更具样本效率。先前的工作通过将RL应用于一个较小的残差策略来大规模微调预训练策略,该残差策略纠正预训练模型。然而,对于典型的稀疏奖励任务,RL算法可能难以以样本高效的方式优化行为。我们探索逆强化学习,其中从专家演示中学习稠密奖励函数,可能降低RL微调的挑战。我们特别考虑一致性模仿学习,这是一种IRL方法,通过使用具有理论保证的特定奖励公式来促进BC策略的改进。我们展示了我们的IRL方法在所有六个稀疏操作任务上保持或提高了pi-0.5的性能,并在六个复杂操作任务中的五个上实现了≥90%的成功率,优于使用稀疏奖励的基于RL的基线。通过确保我们的初始预训练微调策略对于初始奖励和评论家是最优的,我们的方法避免了RL微调中常见的初始下降,并实现了更快的改进。

英文摘要

Distilling expert demonstration data into large generative models using behavioral cloning is a scalable approach to learning capable policies for robotic control, particularly for dexterous manipulation. Reinforcement learning (RL) can be used as a means to finetune these policies further using additional experience. An open question is whether RL is more sample-efficient than collecting more human demonstrations. Prior work has finetuned large pretrained policies in a scalable fashion by applying RL to a smaller residual policy that corrects the pretrained model. However, for the typical sparse reward tasks, RL algorithms can struggle to optimize the behavior in a sample-efficient manner. We explore inverse reinforcement learning, where a dense reward function is learned from expert demonstrations, potentially reducing the challenge of RL finetuning. We specifically consider coherent imitation learning, an IRL method that facilitates improvement of the BC policy through using a specific reward formulation with theoretical guarantees. We show that our IRL method maintains or improves the performance of pi-0.5 on all six sparse manipulation tasks and achieves a $\geq 90\%$ success rate on five out of six complex manipulation tasks, outperforming RL-based baselines using sparse rewards. By ensuring our initial pretrained finetuning policy is optimal for our initial reward and critic, our method circumvents the initial drop commonly seen in RL finetuning and enables faster improvement.

2606.02184 2026-06-02 cs.DL cs.LG 版本更新

The Ghost Couple: Correlated LLM Name Priors and Their Haunting of the Web and Academic Publishing

幽灵搭档:相关的大语言模型姓名先验及其对网络和学术出版的困扰

Michał Brzozowski, Neo Christopher Chung

发表机构 * Samsung AI Center(三星人工智能中心) University of Warsaw(华沙大学)

AI总结 研究发现大语言模型生成虚构专家姓名时会产生相关性强的角色组合,这些组合具有模型家族特异性,并在Zenodo等平台造成大量幽灵作者记录,影响学术出版。

详情
AI中文摘要

这些名字并不存在。Elena Vasquez 和 Marcus Chen 作为火山专家、宇航员、惊悚小说主角、播客主持人和学术合著者,出现在数百个独立生成的AI生成文档中,却从未存在过。我们表明,大语言模型在生成虚构专家时不仅仅默认使用高概率的单个名字:它们会产生相关的角色组合、配对和三人组,其共现频率远超偶然,并且在独立生成中保持一致。这些先验是模型家族特定的(Claude:Elena Vasquez + Marcus Chen + Amara Okafor;Gemini:Aris Thorne + Lena Petrova;GPT:Elara Voss 无固定搭档)、版本特定的,并且在模型发布边界处被主动抑制,在它们生成的内容中留下可定时的行为指纹。我们记录了一个大规模的下游后果。在Zenodo(一个由CERN运营的、生成真实DataCite DOI的存储库)上,我们识别出1,655条幽灵作者记录,声称不存在的期刊并带有捏造的出版日期:服务器端的DataCite时间戳证明了故意的回溯日期,其中991条记录在一个月内注册;这些记录携带在DataCite中注册的真实DOI,因此任何摄取DOI元数据的学术聚合器都可以获取它们。幽灵名字还出现在ResearchGate上,形成由来自多个模型家族的合作者组成的合成研究小组;这些记录上的出版日期为模型部署窗口提供了可靠的时间代理。

英文摘要

These names do not exist. Elena Vasquez and Marcus Chen have appeared as volcano experts, astronauts, thriller protagonists, podcast hosts, and academic co-authors across hundreds of independently produced AI-generated documents, never having lived. We show that large language models do not merely default to high-probability individual names when generating fictional experts: they produce correlated character ensembles, pairs and trios whose co-occurrence rates far exceed chance and are consistent across independent generations. These priors are model-family-specific (Claude: Elena Vasquez + Marcus Chen + Amara Okafor; Gemini: Aris Thorne + Lena Petrova; GPT: Elara Voss with no fixed partner), version-specific, and actively suppressed at model release boundaries, leaving dateable behavioral fingerprints in the content they produced. We document a downstream consequence at scale. On Zenodo, a CERN-operated repository that mints real DataCite DOIs, we identify 1,655 ghost-authored records claiming nonexistent journals with fabricated publication dates: server-side DataCite timestamps prove deliberate backdating, and 991 records were registered in a single month; these carry real DOIs registered in DataCite, making them harvestable by any scholarly aggregator that ingests DOI metadata. Ghost names additionally appear on ResearchGate forming synthetic research groups with collaborators drawn from multiple model families; publication dates on these records provide a reliable temporal proxy for model deployment windows.

2606.02179 2026-06-02 cs.LG cs.AI cs.CE 版本更新

On the Generalization in Topology Optimization via Sensitivity-Conditioned Bernoulli Flow Matching

关于拓扑优化中通过敏感性条件伯努利流匹配的泛化性

Mohammad Rashed, Duarte F. Valoroso Madeira, Babak Gholami, Caglar Guerbuez, Yunjia Yang, Nils Thuerey

发表机构 * University of California, San Diego(加州大学圣地亚哥分校) Max Planck Institute for Informatics(马克斯·普朗克信息研究所)

AI总结 通过信息论分析,提出伪敏感性概念,并利用敏感性条件伯努利流匹配生成器在拓扑优化中实现最优的分布外泛化性能。

Comments ICML Paper

详情
AI中文摘要

拓扑优化(TO)的代理模型在分布偏移(如载荷或边界条件变化)下表现出高度可变的分布外(OOD)泛化能力,但这一变异性的来源尚不清楚。我们假设OOD性能取决于条件信号保留关于驱动经典TO的伴随敏感性(简化梯度)的信息量。将TO流程建模为因果马尔可夫链,数据处理不等式表明,在该抽象下,敏感性场是拓扑预测的信息论最优条件信号。然而,计算精确的伴随敏感性在实践中可能昂贵或不可用;我们观察到某些物理场可以通过单调变换近似敏感性。为形式化这一点,我们引入 extbf{伪敏感性}来区分哪些场能够实现泛化,哪些信息贫乏。然后,我们展示了一个敏感性条件的伯努利流匹配生成器实证地证实了这些预测:以敏感性为条件可获得最先进的OOD性能,而越来越远的物理场性能退化至原始参数条件。结果在载荷偏移下的结构TO基准测试和我们新的CFD-TO数据集(边界条件偏移如多出口配置)中均成立。代码和数据集见https://tum-pbs.github.io/topotransformer/。

英文摘要

Surrogate models for topology optimization (TO) exhibit highly variable out-of-distribution (OOD) generalization under distribution shifts such as changing loads or boundary conditions, yet the source of this variability remains unclear. We hypothesize that OOD performance is governed by how much information the conditioning signal preserves about the adjoint sensitivity (reduced gradient) that drives classical TO. Modeling the TO pipeline as a causal Markov chain, the Data Processing Inequality establishes that, under this abstraction, the sensitivity field is an information-theoretically optimal conditioning signal for topology prediction. However, computing exact adjoint sensitivities can be expensive or unavailable in practice; we observe that certain physical fields can approximate sensitivities through monotone transformations. To formalize this, we introduce \textbf{pseudo-sensitivities} to characterize which fields enable generalization versus those that are information-poor. We then show that a sensitivity-conditioned Bernoulli flow-matching generator empirically confirms these predictions: conditioning on sensitivities yields state-of-the-art OOD performance, while increasingly distant physical fields degrade toward raw parameter conditioning. Results hold across structural TO benchmarks under load shifts and our new CFD-TO dataset under boundary-condition shifts such as multi-outlet configurations. Code and datasets are available at https://tum-pbs.github.io/topotransformer/ .

2606.02177 2026-06-02 cs.LG 版本更新

Low-Pass Flow Matching

低通流匹配

Francesco M. Ruscio, T. Konstantin Rusch

发表机构 * ELLIS Institute Tübingen(图宾根ELLIS研究所) Max Planck Institute for Intelligent Systems(智能系统马克斯·普朗克研究所) Tübingen AI Center(图宾根人工智能中心) Liquid AI(液体AI)

AI总结 针对流匹配中白噪声源与自然数据频谱不匹配的问题,提出基于算子调制插值的低通流匹配方法,引入时变频谱偏差,在保持或提升样本质量的同时显著降低采样成本。

Comments ICLR 2026 Delta Workshop

详情
AI中文摘要

流匹配通常依赖于白噪声源,这一选择往往与自然数据的功率谱不一致,自然数据的功率谱倾向于随频率衰减。为了解决这个问题,我们引入了低通流匹配,这是基于算子调制插值的流匹配的一种变体。该公式引入了一种时变频谱偏差,随着路径接近数据,该偏差从源频谱过渡到频率衰减偏差。我们在无条件图像生成任务上验证了我们的方法,包括科学数据集Galaxy10。实验表明,我们的方法与自适应ODE求解器配合使用时特别有效,与标准基线相比,在提高或保持样本质量的同时,大幅降低了采样成本。

英文摘要

Flow Matching typically relies on white noise sources, a choice often misaligned with the power spectra of natural data, which tend to decay with frequency. To address this, we introduce Low-Pass Flow Matching, a variant of Flow Matching based on an operator-modulated interpolant. This formulation induces a time-varying spectral bias that transitions from the source spectrum to a frequency-decaying bias as the path approaches the data. We validate our method on unconditional image generation tasks, including the scientific Galaxy10 dataset. Empirically, we show that our method is particularly effective when paired with adaptive ODE solvers, where it improves or preserves sample quality while substantially reducing sampling cost compared to standard baselines.

2606.02172 2026-06-02 cs.LG cs.CV 版本更新

Closing the Alignment-Maturity Gap in Federated Prototype Learning

缩小联邦原型学习中的对齐-成熟度差距

Mario Casado-Diez, Alejandro Dopico-Castro, Verónica Bolón-Canedo, Bertha Guijarro-Berdiñas

发表机构 * CITIC, Universidade da Coruña(CITIC,科鲁纳大学)

AI总结 针对联邦学习中原型对齐压力抑制局部判别结构的问题,提出FedSAP框架,通过确定性对齐课程和几何驱动代理分离损失稳定表征学习,在多种异质性条件下提升分类性能。

详情
AI中文摘要

从分布式异质数据中学习判别性视觉表示是联邦学习(FL)中的一个基本挑战。基于原型的方法通过跨客户端共享类级表示来解决统计异质性,但在早期训练轮次中会产生距离依赖的梯度压力,这种压力尤其严重:对从噪声局部表示聚合而来的不成熟全局原型施加的对齐压力会产生大梯度,从而抑制局部判别结构的出现。结果导致嵌入空间组织不良,识别性能下降,尤其是在严重的非独立同分布(non-IID)条件下。我们提出FedSAP,一个通过两种互补机制稳定联邦表示学习的框架:一个确定性对齐课程,将全局对齐延迟到局部表示变得稳定;以及一个几何驱动的代理分离损失,利用现有原型库在单位超球面上强制执行类间结构,而不引入额外参数或通信开销。这些机制共同产生紧凑、分离良好的类簇,而不改变联邦参与者之间的底层通信协议。在三个基准测试和不同程度的异质性下的实验表明,与评估的原型基线相比,性能提升高达4个百分点,在高异质性下改进最为显著。我们框架的表示性质还使其能够直接扩展到半监督设置,其中未标记数据只需最小修改即可纳入,突显了调度对齐作为设计原则的通用性。

英文摘要

Learning discriminative visual representations from distributed, heterogeneous data is a fundamental challenge in Federated Learning (FL). Prototype-based methods address statistical heterogeneity by sharing class-level representations across clients but create a distance-dependent gradient pressure that is particularly severe during early training rounds: alignment pressure applied to immature global prototypes, aggregated from noisy local representations, generates large gradients that suppress the emergence of local discriminative structure. The result is a poorly organized embedding space and degraded recognition performance, particularly under severe non-IID conditions. We propose FedSAP, a framework that stabilises federated representation learning through two complementary mechanisms: a deterministic alignment curriculum that delays global alignment until local representations become stable and a geometry-driven proxy separation loss that enforces inter-class structure on the unit hypersphere using the existing prototype bank without introducing additional parameters or communication overhead. Together, these mechanisms produce compact, well-separated class clusters without altering the underlying communication protocol between federation's participants. Experiments across three benchmarks and varying degrees of heterogeneity show gains of up to 4 percentage points over the prototype-based baselines evaluated, with improvements most pronounced under high heterogeneity. The representational nature of our framework further enables a straightforward extension to semi-supervised settings, where unlabelled data is incorporated with minimal modification, underscoring the generality of scheduled alignment as a design principle.

2606.02168 2026-06-02 cs.CV cs.LG 版本更新

Disentanglement-Based Equivariant Learning for Compositional VQA

基于解耦的等变学习用于组合式VQA

Zhou Du, Zhaoquan Yuan, Xiao Wu, Changsheng Xu

发表机构 * IEEE Publication Technology Group(IEEE出版技术组) School of Computing and Artificial Intelligence, Southwest Jiaotong University(计算机与人工智能学院,西南交通大学) Engineering Research Center of Sustainable Urban Intelligence Transportation, Ministry of Education, China(可持续智慧城市交通工程研究中心,中华人民共和国教育部) State Key Laboratory of Multimodal Artificial Intelligence Systems (MAIS), Institute of Automation, Chinese Academy of Sciences(多模态人工智能系统(MAIS)国家重点实验室,自动化研究所,中国科学院) School of Artificial Intelligence, University of Chinese Academy of Sciences(人工智能学院,中国科学院大学)

AI总结 提出DEAL框架,通过因果干预解耦视觉和文本概念,并利用等变约束增强组合推理能力,在CLEVR-CoGenT和GQA-SGL上超越现有方法。

Comments Accepted by IEEE Transactions on Multimedia

详情
Journal ref
IEEE Trans. Multimedia, vol. 27, pp. 8160-8173, 2025
AI中文摘要

组合式视觉问答(VQA)是一项具有挑战性但基础的任务,要求模型理解先前学习概念的新组合。当前方法往往忽视潜在概念的解耦,并且在有效捕捉组合变化机制方面受到限制。此外,最先进的技术依赖于额外的线索进行训练,这在现实世界的VQA场景中不可行。为了解决这些问题,本文提出了一种新颖的基于解耦的等变学习(DEAL)框架用于组合式VQA,该框架仅由真实答案指导。在DEAL中,我们采用因果启发的干预措施,在重新编码框架内解耦来自视觉和文本输入的概念。基于等变性原理,我们随后对推理输入进行组合变换,并对输出施加等变约束,以增强模型的组合推理能力。在基准数据集CLEVR-CoGenT和GQA-SGL上进行的全面实验验证了我们提出的DEAL方法在视觉和语言泛化设置下均优于现有的最先进方法。

英文摘要

Compositional visual question answering (VQA) represents a challenging yet fundamental task that requires models to comprehend novel combinations of previously learned concepts. The current methods often overlook the disentanglement of underlying concepts and are restricted in terms of their ability to effectively capture the compositional variation mechanism. Moreover, the state-of-the-art techniques depend on additional clues for training, which is not feasible in real-world VQA scenarios. To address these issues, in this paper, we introduce a novel Disentanglement-based EquivAriant Learning (DEAL) framework for compositional VQA, which is guided exclusively by ground-truth answers. In DEAL, we employ causality-inspired interventions to disentangle concepts derived from visual and textual inputs within a re-encoding framework. Based on the principle of equivariance, we subsequently perform a compositional transformation on the inference input and impose the equivariant constraint on the output to augment the compositional reasoning capacity of the model. Comprehensive experiments conducted on the benchmark CLEVR-CoGenT and GQA-SGL datasets validate the superiority of our proposed DEAL approach over the existing state-of-the-art methods for compositional VQA tasks in both visual and linguistic generalization settings.

2606.02156 2026-06-02 eess.IV cs.AI cs.CV cs.IR cs.LG 版本更新

Predicting the risk of colorectal anastomotic leak based on preoperative mapping of the blood supply of the bowel

基于术前肠道血供映射预测结直肠吻合口漏风险

Zahra Tabatabaei, Jon Sporring, Mark Bremholm Ellebæk, Alaa El-Hussuna

发表机构 * Computer Science Department, Københavns Universitet (KU)(哥本哈根大学计算机科学系) University of Southern Denmark(南部丹麦大学) Odense University Hospital(奥登塞大学医院) OpenSourceResearch Collaboration(开源研究协作)

AI总结 提出一种基于术前CT影像的AI驱动系统,通过分析血管和组织特征量化吻合口漏风险,并结合内容检索支持临床决策。

详情
AI中文摘要

吻合口漏仍然是结直肠癌手术后最严重的并发症之一,显著影响患者预后、康复轨迹和医疗成本。尽管影像技术有所进步,目前的术前评估仍依赖临床评估,这一过程主观、易出错且高度依赖个人经验。迄今为止,尚无经过验证的基于CT的方法能够在术前预测吻合口漏风险。本方案论文概述了一个全面的框架,用于开发和验证一个AI驱动的系统,该系统利用对比增强前后的CT影像进行术前风险评估。研究描述了数据收集、伦理处理、符合GDPR的患者数据预处理、图像预处理以及旨在生成临床可解释输出的深度学习架构探索等阶段。该工作流程的两个主要成果是:1) 风险评估模块,通过分析CT扫描中的血管和组织特征量化漏液可能性;2) 基于内容的医学图像检索(CBMIR)模块,识别并显示相似历史病例以支持循证手术决策。该方案论文需要医院和大学之间的密切合作;本方案表明,此类系统在现有医疗基础设施内技术上可行且临床可实施。通过遵循所提出的方法论阶段和监管原则,其他机构可以复制此工作流程以开发类似的决策支持工具。最终,这一跨学科框架旨在加强手术规划、减少漏液发生率,并推动向可解释、数据驱动的精准手术的更广泛范式转变。

英文摘要

Anastomotic leak remains one of the most serious complications following colorectal cancer surgery, substantially affecting patient outcomes, recovery trajectories, and healthcare costs. Despite advances in imaging technology, current preoperative assessment relies only on clinical assessment, a process that is subjective, error-prone, and highly dependent on individual expertise. To date, no validated CT-based method exists to predict anastomotic leak risk prior to surgery. This protocol paper outlines a comprehensive framework for developing and validating an AI-driven system for preoperative risk assessment using pre- and post-contrast CT imaging. The study describes the stages of data collection, ethical handling, and preprocessing of patient data in accordance with GDPR, image preprocessing, and the exploration of deep learning architectures designed to generate clinically interpretable outputs. Two integrated tools constitute the main deliverables of this workflow: 1) a risk assessment module, which quantifies the likelihood of leakage by analyzing vascular and tissue features in CT scans, and 2) a Content-Based Medical Image Retrieval (CBMIR) module, which identifies and displays similar historical cases to support evidence-based surgical decision making. The protocol paper requires close collaboration between hospitals and universities; this protocol demonstrates that such a system is technically feasible and clinically implementable within existing healthcare infrastructures. By following the proposed methodological stages and regulatory principles, other institutions can reproduce this workflow to develop analogous decision-support tools. Ultimately, this interdisciplinary framework aims to enhance surgical planning, reduce leak incidence, and contribute to a broader paradigm shift toward explainable, data-driven precision surgery.

2606.02145 2026-06-02 cs.LG 版本更新

Hybrid Neural Ordinary Differential Equations for Data-Efficient Polymerization Modeling with Incomplete Kinetics

混合神经常微分方程用于不完全动力学的高效聚合建模

Marah Almanasreh, Alexander Mitsos, Eike Cramer

发表机构 * RWTH Aachen University, Process Systems Engineering (AVT.SVT)(亚琛工业大学过程系统工程系) JARA-CSD Energy Systems Engineering (ICE-1), Forschungszentrum Jülich(能源系统工程(ICE-1),焦耳中心) Department of Chemical Engineering, Sargent Centre for Process Systems Engineering, University College London(化学工程系,萨金特过程系统工程中心,伦敦大学学院)

AI总结 提出混合神经常微分方程框架,通过仅学习部分表征的有效自由基浓度项,在稀疏数据下实现自由基聚合的准确预测。

Comments 25 pages, 5 figures

详情
AI中文摘要

聚合动力学的准确预测对于过程设计、控制和优化至关重要。然而,纯机理模型需要对部分表征的动力学进行劳动密集型的参数化,而纯数据驱动模型需要大量且多样化的数据集,这些数据集获取成本高昂,尤其是在早期设计阶段。我们提出了一种混合神经常微分方程(NODE)框架,用于自由基聚合的数据高效建模。以甲基丙烯酸甲酯(MMA)的间歇聚合为例,明确保留了机理质量平衡,仅通过神经网络代理从数据中学习部分表征的控制单体消耗的有效自由基浓度,而引发剂分解、链增长和终止等已确立的反应则保持物理建模。在稀疏数据条件下,将混合NODE与离散时间前馈神经网络和纯数据驱动NODE进行比较,模型在规则和不规则采样下仅使用少至十个测量值进行训练。混合NODE始终比两种纯数据驱动基线实现更低的预测误差和更物理一致的外推。在噪声数据和未见操作条件的泛化场景中,混合NODE的RMSE为0.013,而数据驱动NODE为0.31,离散时间模型为0.68,表明在有限数据可用性下,仅学习闭合项而非完整动力学足以实现可靠预测。

英文摘要

Accurate prediction of polymerization dynamics is essential for process design, control, and optimization. Yet, purely mechanistic models require labor-intensive parameterization of partially characterized kinetics, while purely data-driven models demand large, diverse datasets that are costly to obtain, particularly in early-design stages. We propose a hybrid Neural Ordinary Differential Equation (NODE) framework for data-efficient modeling of free-radical polymerization. Using batch polymerization of methyl methacrylate (MMA) as a case study, the mechanistic mass balances are retained explicitly, and only the partially-characterized effective radical concentration governing monomer consumption is learned from data through a neural network surrogate, while established reactions such as initiator decomposition, propagation, and termination remain physically modeled. The hybrid NODE is evaluated against a discrete-time feedforward neural network and a purely data-driven NODE under sparse data conditions, with models trained on as few as ten measurements under both regular and irregular sampling. The hybrid NODE consistently achieves lower prediction errors and more physically consistent extrapolations than both purely data-driven baselines. In a generalization scenario with noisy data and unseen operating conditions, the hybrid NODE achieves an RMSE of 0.013, compared to 0.31 for the data-driven NODE and 0.68 for the discrete-time model, demonstrating that learning only a closure term rather than the full dynamics is sufficient for reliable prediction under limited data availability.

2606.02142 2026-06-02 cs.LG cs.DB 版本更新

TimeBlocks: Foundational and Continual Time-Series Blockbase -- Extended Version

TimeBlocks: 基础与持续时间序列块库——扩展版本

David Campos, Bin Yang, Tung Kieu, Lei Chen, Chenjuan Guo, Christian S. Jensen

AI总结 提出TimeBlocks方法,通过可互换的模块化模型块和路由策略,构建轻量级、多任务的时间序列模型,并引入StreamCore实现持续校准,在多个数据集上优于现有基线。

Comments 15 pages. An extended version of "TimeBlocks: Versatile and Continual Time-Series Blockbase" accepted at SIGKDD 2026

详情
AI中文摘要

持续的数字化导致监控各种过程的时间序列数据流激增,从中可以获得有价值的见解。此外,成功的基础语言模型的出现引发了一个问题:是否可能实现具有处理多个任务的基础属性的时间序列模型,同时足够轻量以允许实时数据流处理。现有的基础时间序列模型通常很大,并且仅在离线设置中有效,没有严格的时间和计算约束,且不需要重复的模型校准。然而,当应用于数据流时,这些模型由于规模大且缺乏对持续校准的支持而效率低下,这损害了它们提供准确实时响应的能力、耐久性以及在硬件受限环境中的可部署性。我们提出TimeBlocks,通过促进在可变条件下适用于多个任务的轻量级模型的高效构建,实现多用途的时间序列处理。特别是,该方法维护一个可互换和模块化的模型块池,可用于构建新的时间序列模型。当面对特定的时间序列数据时,路由策略迭代选择最合适的块来为数据构建轻量级且准确的模型。我们为TimeBlocks配备了一种称为StreamCore的方法,以构建数据流的代表性小子集,该子集随时间保持流的保证近似,从而实现持续的模型校准。在多个数据集和多个任务上的实验研究表明,TimeBlocks能够构建优于现有基线的模型。

英文摘要

The ongoing digitization has led to a proliferation of time-series data streams that monitor a variety of processes, from which valuable insights may be obtained. Further, the emergence of successful foundational language models begs the question of whether it is possible to achieve time-series models with the foundational properties of handling multiple tasks, while being sufficiently lightweight to allow real-time data stream processing. Existing foundational time-series models are often large and only effective in offline settings without stringent time and computational constraints, and where repeated model calibration is not needed. However, when applied to data streams, these models are ineffective due to their size and lack of support for continual calibration, which compromise their ability to deliver accurate real-time responses, their durability, and their deployability in hardware-limited settings. We propose TimeBlocks to enable versatile time-series processing by facilitating the efficient building of lightweight models suitable for multiple tasks under variable conditions. In particular, the method maintains a pool of interchangeable and modular model blocks that can be used to construct new time-series models. When presented with specific time-series data, a routing strategy iteratively selects the most suitable blocks to construct a lightweight and accurate model for the data. We equip TimeBlocks with a method called StreamCore to build a representative small subset of the data stream, which preserves a guaranteed approximation of the stream over time, enabling continual model calibration. An experimental study on multiple data sets and covering multiple tasks shows that TimeBlocks enables to build models capable of outperforming existing baselines.

2606.02138 2026-06-02 cs.LG cs.AI 版本更新

VLBM: Variational Latent Basis Modeling for OOD Robust Multivariate Time Series Forecasting

VLBM:面向OOD鲁棒多变量时间序列预测的变分潜在基建模

Xudong Zhang, Jierui Lei, Jiacheng Li, Lingdong Shen, Jian Cui, Haina Tang

发表机构 * School of Artificial Intelligence, University of Chinese Academy of Sciences(中国科学院大学人工智能学院) Center for Machine Learning Research, Peking University(北京大学机器学习研究中心) Amap, Alibaba Group(阿里巴巴集团阿地图) School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences(中国科学院大学先进交叉学科学院) Environmental Microbiome and Innovative Genomics Laboratory, Peking University(北京大学环境微生物与创新基因组实验室)

AI总结 提出VLBM框架,通过变分潜在基分离稳定动态与OOD偏差,实现混合ID/OOD分布下的鲁棒预测,在12个基准任务上平均MAE和MSE分别提升15.08%和7.74%。

详情
AI中文摘要

多变量时间序列预测中的分布外(OOD)事件虽然罕见,但往往主导现实世界风险,使得平均情况预测不足以可靠部署。在混合ID/OOD分布的标准平均风险训练下,来自罕见OOD事件的优化信号可能被频繁的分布内(ID)模式淹没,因此强基准精度可能无法转化为高影响偏移下的可靠性。为解决此问题,我们提出VLBM(变分潜在基模型),一种理论指导的潜在预测框架,将稳定动态与OOD引起的偏差分离。VLBM学习一个共享潜在基,定义稳定ID动态的低秩子空间,将输入显式分解为基子空间分量和正交残差分量,并将未来感知后验与未来盲先验对齐,使得测试时潜在推断仅依赖于历史输入。在涵盖交通、天气、电力系统及其他现实世界领域的12个基准任务上,包括新构建的现实世界OOD交通数据集,VLBM实现了最先进的OOD鲁棒性和ID精度,平均MAE和MSE比最强基线分别提升15.08%和7.74%。在合成模拟数据集上,VLBM也持续实现最佳性能并更好地跟踪OOD脉冲恢复。这些结果支持潜在结构化预测作为混合ID和OOD条件下鲁棒预测的原则性途径。代码可在https://github.com/leijieruilq/VLBM_OOD_forecast获取。

英文摘要

Out of distribution (OOD) events in multivariate time series forecasting are rare but often dominate real world risk, making average case forecasting insufficient for reliable deployment. Under standard average risk training on mixed ID/OOD distributions, optimization signals from rare OOD events can be overwhelmed by frequent in distribution (ID) patterns, so strong benchmark accuracy may not translate into reliability under high impact shifts. To address this issue, we propose VLBM (Variational Latent Basis Model), a theory guided latent forecasting framework that separates stable dynamics from OOD induced deviations. VLBM learns a shared latent basis that defines a low rank subspace for stable ID dynamics, explicitly decomposes inputs into basis subspace components and orthogonal residual components, and aligns a future aware posterior with a future blind prior so that test time latent inference depends only on historical input. Across 12 benchmark tasks spanning transportation, weather, power systems, and other real world domains, including newly constructed real world OOD traffic datasets, VLBM achieves state of the art OOD robustness and ID accuracy, with average MAE and MSE gains of 15.08\% and 7.74\% over the strongest baseline. On a synthetic simulation dataset, VLBM also consistently achieves the best performance and better tracks OOD pulse recovery. These results support latent structured forecasting as a principled route to robust prediction under mixed ID and OOD conditions. The code is available at https://github.com/leijieruilq/VLBM_OOD_forecast.

2606.02136 2026-06-02 cs.LG 版本更新

Edge-aware Decoding for Neural Asymmetric Routing

面向神经非对称路由的边缘感知解码

Li Liang, Jinbiao Chen, Zizhen Zhang

发表机构 * Sun Yat-Sen University(中山大学) Department of Industrial Systems Engineering and Management, National University of Singapore(新加坡国立大学工业系统工程与管理系)

AI总结 针对神经非对称路由中表示与决策不匹配的问题,提出边缘感知解码器,通过显式暴露转移级成本信息提升零样本泛化性能。

详情
AI中文摘要

神经非对称路由模型越来越多地通过矩阵表示和非对称感知注意力来编码方向性。然而,最终路由动作并非孤立节点,而是在当前部分路由下选择的有向转移。这造成了表示与决策的不匹配:成对成本信息可能在上游编码,而最终候选logit仍主要参数化为上下文-节点兼容性。我们提出一种针对神经非对称路由的解码器设计原则:最终得分应显式暴露问题成本结构所暗示的转移级量。我们通过一个边缘感知解码器实例化该原则,该解码器为当前有向边、返回起点的闭合以及静态轻量级前瞻添加候选特定项,同时保持表示骨干网络固定。在受控的SVD/Sinkhorn非对称骨干网络上,该解码器在ATSP-100上训练并在ATSP-100/200/500/1000上零样本评估时,优于RADAR参考,将ATSP-1000的差距从4.13%降至2.73%。在ACVRP上,相同的得分级修改在更丰富的路由状态下显示出相同的定性趋势。ATSP消融实验和有向转移诊断进一步阐明了机制:最强证据涉及对当前有向边的敏感性,而闭合和静态前瞻则作为启发式延续线索。结果支持一项机制研究:神经非对称路由中一个关键的解码器侧信号是决策时暴露转移级边缘信息。

英文摘要

Neural asymmetric routing models increasingly encode directionality through matrix representations and asymmetry-aware attention. The final routing action, however, is not a node in isolation but a directed transition chosen under the current partial route. This creates a representation--decision mismatch: pairwise cost information may be encoded upstream while the final candidate logit is still largely parameterized as context--node compatibility. We propose a decoder-design principle for neural asymmetric routing: the final score should explicitly expose transition-level quantities suggested by the problem's cost-to-go structure. We instantiate this principle with an edge-aware decoder that adds candidate-specific terms for the current directed edge, return-to-start closure, and static lightweight lookahead, while keeping the representation backbone fixed. On a controlled SVD/Sinkhorn asymmetric backbone, the decoder improves over the RADAR reference when trained on ATSP-100 and evaluated zero-shot on ATSP-100/200/500/1000, reducing the ATSP-1000 gap from $4.13\%$ to $2.73\%$. On ACVRP, the same score-level modification shows the same qualitative trend under a richer routing state. ATSP ablations and directed-transition diagnostics sharpen the mechanism: the strongest evidence concerns sensitivity to the current directed edge, while closure and static lookahead act as heuristic continuation cues. The results support a mechanism study: a key decoder-side signal in neural asymmetric routing is decision-time exposure of transition-level edge information.

2606.02134 2026-06-02 cs.LG cs.AI cs.CV 版本更新

Rethinking Evaluation Paradigms in IBP-based Certified Training

重新思考基于IBP的认证训练中的评估范式

Konstantin Kaulen, Hadar Shavit, Holger H. Hoos

发表机构 * University of Freiburg(弗赖堡大学) ETH Zurich(苏黎世联邦理工学院)

AI总结 针对认证训练中自然精度与认证精度的权衡问题,提出基于Pareto前沿的多目标超参数优化方法,实现公平的方法间比较,并发现先前配置的欠调优现象,建立新的最优性能。

Comments Accepted to ICML 2026

详情
AI中文摘要

深度神经网络在许多监督学习任务上取得了强大性能,但仍易受对抗性扰动的影响。神经网络验证提供了数学上严格的鲁棒性保证,但计算成本高昂。为缓解这一问题,认证训练技术在训练过程中优化可验证的鲁棒性,通常通过方法特定的超参数控制自然精度与认证精度之间的权衡。由于这些指标本质上是冲突的,报告单一配置的常见做法存在问题:它可能误导关于整体性能的结论,并妨碍对最新技术的无偏评估。我们通过基于自然-认证精度权衡的Pareto前沿比较来评估认证训练方法。为了实现公平、方法无关的比较,我们执行高效的自动化多目标超参数优化,为每种方法识别一组Pareto最优配置。这种方法常常揭示先前报告配置中的显著欠调优,从而获得更优性能并建立新的最优水平。利用这些前沿,我们首次对认证训练方法进行了全面的多目标比较,表明先前的进展并不像假设的那样显著,并揭示了先前未报告的性能互补性。

英文摘要

Deep neural networks achieve strong performance on many supervised learning tasks but remain vulnerable to adversarial perturbations. Neural network verification provides mathematically rigorous robustness guarantees, yet at substantial computational cost. To mitigate this, certified training techniques optimise for verifiable robustness during training, typically inducing a trade-off between natural and certified accuracy controlled by method-specific hyperparameters. Because these metrics are inherently conflicting, the common practice of reporting a single configuration is problematic: it can mislead conclusions about overall performance and prevents unbiased assessments of the state of the art. We address this by evaluating certified training methods via Pareto front comparisons over the natural--certified accuracy trade-off. To enable fair, method-agnostic comparisons, we perform efficient automated multi-objective hyperparameter optimisation to identify a set of Pareto-optimal configurations for each method. This approach often uncovers substantial undertuning in previously reported configurations, yielding superior performance and establishing a new state of the art. Leveraging these fronts, we present the first comprehensive multi-objective comparison of certified training approaches, showing that prior advancements are less pronounced than assumed and revealing previously unreported performance complementarities.

2606.02120 2026-06-02 cs.CV cs.AI cs.LG 版本更新

Understanding-Enhanced Model Collaboration for Long-Tailed Egocentric Mistake Detection

理解增强的模型协作用于长尾自我中心错误检测

Boyu Han, Qianqian Xu, Shilong Bao, Zhiyong Yang, Ruochen Cui, Qingming Huang

发表机构 * State Key Laboratory of AI Safety, Institute of Computing Technology, CAS(人工智能安全国家重点实验室,计算技术研究所,中国科学院) School of Computer Science and Tech., University of Chinese Academy of Sciences(中国科学院大学计算机科学与技术学院) Beijing Academy of Artificial Intelligence(北京人工智能研究院) Institute of Information Engineering, CAS(信息工程研究所,中国科学院) School of Cyber Security, University of Chinese Academy of Sciences(中国科学院大学网络安全学院)

AI总结 提出理解增强的模型协作方法(UE-MCM),结合粗粒度视频理解与细粒度动作推理,通过双分支模型和自适应融合门检测自我中心视频中的错误,并优化长尾分布。

详情
AI中文摘要

在本报告中,我们解决了从自我中心视频数据中判断用户是否错误执行动作的问题。为此,我们提出了一种理解增强的模型协作方法(UE-MCM),该方法将高效的粗粒度视频理解与准确的细粒度动作推理相结合。具体来说,UE-MCM包含一个小模型分支和一个大模型分支。大模型分支关注细粒度动作本身是否执行错误,而小模型分支联合输入粗粒度视频和细粒度片段,以识别可能局部正确但与整体工作流不一致的动作。小模型分支基于CLIP4CLIP视频编码器构建,该编码器从通过扩散对比重建增强的CLIP模型初始化,大模型分支使用Qwen3-VL嵌入模型从细粒度动作片段中提取高容量表示。然后,通过轻量级协作门自适应融合小分支预测和大分支预测。为了处理错误实例的长尾分布,我们通过互补目标优化分类器,包括重加权交叉熵、AUC导向学习和标签感知调整。所得系统平衡了速度和准确性,使其能够有效检测自我中心教学视频中的细微、罕见和模糊错误。

英文摘要

In this report, we address the problem of determining whether a user performs an action incorrectly from egocentric video data. To this end, we propose an Understanding-Enhanced Model Collaboration Method (UE-MCM) that combines efficient coarse-grained video understanding with accurate fine-grained action reasoning. Specifically, UE-MCM contains a small model branch and a large model branch. The large model branch focuses on whether the fine-grained action itself is executed incorrectly, while the small model branch jointly takes the coarse-grained video and fine-grained segment as input to identify actions that may be locally correct but inconsistent with the overall workflow. The small model branch is built on a CLIP4CLIP video encoder initialized from a CLIP model enhanced by Diffusion Contrastive Reconstruction, and the large model branch uses the Qwen3-VL Embedding model to extract high-capacity representations from fine-grained action segments. The small-branch prediction and the large-branch prediction are then adaptively fused by a lightweight collaboration gate. To handle the long-tailed distribution of mistake instances, we optimize the classifiers with complementary objectives, including reweighted cross-entropy, AUC-oriented learning, and label-aware adjustment. The resulting system balances speed and accuracy, making it effective for detecting subtle, rare, and ambiguous mistakes in egocentric instructional videos.

2606.02119 2026-06-02 cs.LG cs.AI 版本更新

How Hard Can It Be? Hardness-Aware Multi-Objective Unlearning

到底有多难?难度感知的多目标遗忘学习

Jiangwei Chen, Xinyuan Niu, Rachael Hwee Ling Sim, Zhengyuan Liu, Nancy F. Chen, Bryan Kian Hsiang Low

发表机构 * National University of Singapore(新加坡国立大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 针对现有遗忘学习无法保证同时提升遗忘质量和保持保留效用的缺陷,提出一种基于约束优化的难度感知多目标遗忘算法(HAMU),通过量化遗忘数据与保留数据的相似度来指导模型更新,在保证遗忘质量提升的同时最小化保留效用损失。

Comments ICML 2026

详情
AI中文摘要

机器遗忘旨在由于隐私、版权或偏见问题,移除特定遗忘训练数据的影响,同时保持模型在剩余保留数据上的性能。现有的遗忘算法,例如优化损失的加权组合,试图实现提高遗忘质量和保持保留效用这些目标。然而,它们无法保证对所有遗忘和保留数据都能将目标改进到指定程度。在这项工作中,我们从约束优化的角度,用一种新颖且理论扎实的方法解决了这一限制。首先,我们确定遗忘数据和保留数据之间的相似度可以量化调和两个目标的难度。接下来,我们推导出一种遗忘算法(HAMU),其总体目标是通过根据我们的难度度量更新模型权重,在保证遗忘质量有指定改进的同时,最小化保留效用成本/下降。我们的难度度量还告知用户何时保留效用下降不可避免,即两个目标无法同时改进,应考虑停止。我们的算法适用于非凸模型,并且易于并行化,使其易于在实际场景中部署。我们通过实验使用大型模型在图像和文本数据集上证明了HAMU相对于基线的优越性能。我们的代码可在 https://github.com/aoi3142/HAMU 获取。

英文摘要

Machine unlearning aims to remove the influence of specific forget training data due to privacy, copyright or bias concerns while maintaining the model performance on the remaining retain data. Existing unlearning algorithms, such as optimizing a weighted combination of losses, have tried to achieve these objectives of improving forget quality and maintaining retain utility. However, they do not guarantee that these objectives can be improved by a specified extent for all forget and retain data. In this work, we address this limitation with a novel and theoretically-grounded approach from a constrained optimization perspective. Firstly, we identify that the hardness of reconciling both objectives can be quantified by the similarity between the forget data and the retain data. Next, we derive an unlearning algorithm (HAMU) with the overall goal of guaranteeing a specified improvement in forget quality while minimizing the retain utility cost/degradation by updating the model weights based on our hardness measure. Our hardness measure also informs users when retain utility degradation is unavoidable, i.e., both objectives cannot be improved simultaneously, and stopping should be considered. Our algorithm is applicable to non-convex models and is easily parallelizable, making it readily deployable in real-world scenarios. We empirically demonstrate HAMU's superior performance over baselines on both image and text datasets using large models. Our code is available at https://github.com/aoi3142/HAMU.

2606.02117 2026-06-02 stat.ML cs.LG stat.ME 版本更新

ProbRes: Volatility Learning for Probabilistic Time-Series Forecasting

ProbRes: 概率时间序列预测的波动率学习

Tingting Wang, Yunyi Zhang, Benyou Wang

AI总结 提出ProbRes,一种事后概率校准方法,通过显式学习波动率动态来改进概率预测,有效处理异方差数据,并在理论和实验上验证其有效性。

详情
AI中文摘要

概率时间序列预测由于需要量化未来观测中的风险和不确定性,在金融应用中引起了越来越多的关注。我们提出ProbRes,一种事后概率校准方法,它显式地学习并将波动率动态纳入概率预测中,从而能够有效处理异方差数据。在训练过程中,ProbRes采用两个与架构无关的模块分别对条件均值和条件波动率进行建模。在推理阶段,它通过重采样标准化残差生成预测分布。ProbRes适用于单变量和多变量时间序列,并且在广泛的误差分布下保持稳健,包括具有条件异方差的非高斯创新。理论结果证明了ProbRes的有效性,在合成和真实数据集上的实验表明,ProbRes准确捕捉预测分布并产生校准良好的预测区间。

英文摘要

Probabilistic time series forecasting has attracted increasing attention in financial applications due to the need to quantify risk and uncertainty in future observations. We propose ProbRes, a post-hoc probabilistic calibration method that explicitly learns and incorporates volatility dynamics into probabilistic forecasting, enabling effective handling of heteroskedastic data. During training, ProbRes employs two architecture-agnostic modules to separately model the conditional mean and conditional volatility. At the inference stage, it generates predictive distributions by resampling normalized residuals. ProbRes is applicable to both univariate and multivariate time series and remains robust under a wide range of error distributions, including non-Gaussian innovations with conditional heteroskedasticity. Theoretical results demonstrate ProbRes's validity and experiments on both synthetic and real-world datasets show that ProbRes accurately captures predictive distributions and produces well-calibrated prediction intervals.

2606.02115 2026-06-02 stat.ML cs.LG 版本更新

Error Bounds for a Diffusion Model-Based Drift Estimator

基于扩散模型的漂移估计器的误差界

Ioar Casado-Telletxea, Omar Rivasplata

发表机构 * Basque Center for Applied Mathematics (BCAM)(巴斯克应用数学中心) Centre for AI Fundamentals & Department of Computer Science(人工智能基础研究中心及计算机科学系) University of Manchester, UK(英国曼彻斯特大学)

AI总结 针对随机微分方程中已知扩散参数时的漂移估计问题,利用扩散模型理论推导了时间平均均方误差的显式风险界,将风险分解为离散化、得分近似、噪声初始化和采样方差四项。

Comments Preprint

详情
AI中文摘要

随机微分方程中的参数估计是一个经典的统计问题,在许多科学领域具有重要意义。Tapia Costa等人(2026)的最新工作引入了一种新技术,当扩散参数已知时,利用多条轨迹的离散样本估计漂移。他们的方法将漂移估计视为去噪问题,并利用(条件)得分匹配扩散模型的工具。尽管他们的实验在不同漂移类别中显示出有希望的结果,但其估计器的理论保证问题仍未解决。在本笔记中,我们通过利用扩散模型理论的技术来填补这一空白。更具体地说,我们为该漂移估计器的时间平均均方误差推导了一个显式的风险界。我们的界将风险分解为(i)Euler-Maruyama离散化,(ii)得分/去噪器近似,(iii)噪声初始化,以及(iv)采样方差,揭示了估计器中不同超参数和误差源之间的权衡。

英文摘要

Parameter estimation in stochastic differential equations is a classical statistical problem of much importance in many scientific fields. Recent work of Tapia Costa et al. (2026) introduced a novel technique for estimating the drift when the diffusion parameter is known, using discrete samples from multiple trajectories. Their method treats drift estimation as a denoising problem, and leverages tools from (conditional) score-matching diffusion models. Although their experiments showed promising results across different drift classes, the question of theoretical guarantees for their estimator was left unanswered. In this note, we address this gap by exploiting techniques from diffusion model theory. More concretely, we derive an explicit risk bound for the time-averaged mean-squared error of said drift estimator. Our bound decomposes the risk into the (i) Euler-Maruyama discretization, (ii) score/denoiser approximation, (iii) noise initialization, and (iv) sampling variance, revealing the trade-offs between the different hyperparameters and sources of error in the estimator.

2606.02107 2026-06-02 cs.RO cs.AI cs.LG 版本更新

Network Distributed Multi-Agent Reinforcement Learning for Consensus Control of Quadcopters

网络分布式多智能体强化学习用于四旋翼无人机一致性控制

Youssef Mahran, Zeyad Gamal, Aamir Ahmad, Ayman El-Badawy

发表机构 * Mechatronics Engineering Department, German University in Cairo (GUC), Egypt(埃及德国大学(GUC)机械工程系) Institute of Flight Mechanics and Control (IFR), Head of Flight Robotics, University of Stuttgart, Germany(德国斯图加特大学飞行力学与控制研究所) Faculty of EMS, Head of Mechatronics Engineering Department, German University in Cairo (GUC), Egypt(埃及德国大学(GUC)EMS学院)

AI总结 提出网络分布式多智能体强化学习框架,利用通信图实现分布式策略,通过MASAC训练高层规划器,实现零样本扩展到250个智能体。

Comments This is the Author Accepted Manuscript version of a paper accepted for publication. The final published version is available via IEEE Xplore

详情
Journal ref
2026 IEEE 23rd Mediterranean Electrotechnical Conference (MELECON)
AI中文摘要

本文提出了一种用于四旋翼无人机一致性控制的网络分布式多智能体强化学习(ND-MARL)框架。与依赖集中式规划或完全分散式执行的传统多智能体MARL公式相比,ND-MARL将群体通信图纳入决策过程。在2-邻居通信拓扑下,每个智能体仅观察两个邻居的信息,并通过分布式策略输出动作。使用多智能体软演员-评论家(MASAC)训练高层分布式一致性规划器,并将其嵌入层次化堆栈中,以生成由低层四旋翼控制器跟踪的参考目标位置。结果表明,与集中式MARL控制器相比,实现了平滑的一致性轨迹和规划器-跟踪器集成。最值得注意的是,学习到的控制器表现出零样本可扩展性,即在三智能体系统上训练的策略,在相同的2-邻居通信拓扑下,无需重新训练或微调即可部署到多达250个智能体的群体中,实现了随着团队规模增大而稳态散布增加的一致收敛,这是由于稀疏信息传播所致。这些发现突显了ND-MARL作为分布式、通信感知的四旋翼一致性控制的稳定框架。

英文摘要

This paper proposes a Network Distributed Multi-Agent Reinforcement Learning (ND-MARL) framework for quadcopter consensus control. Compared to conventional multi-agent MARL formulations that rely on centralized planning or fully decentralized execution, ND-MARL incorporates the swarm communication graph into the decision process. Under a 2-Neighbor communication topology, each agent observes information of only two neighbors and outputs an action through a distributed policy. A high-level distributed consensus planner is trained using Multi-Agent Soft Actor-Critic (MASAC) and embedded in a hierarchical stack to generate reference target positions tracked by a low-level quadcopter controller. Results demonstrate smooth consensus trajectories and planner-tracker integration when compared to a centralized MARL controller. Most notably, the learned controller exhibits zero-shot scalability, as policies trained on a three-agent system are deployed to swarms of up to 250 agents under the same 2-Neighbor communication topology without retraining or fine-tuning, achieving consistent convergence with increasing steady-state spread at large team sizes due to sparse information propagation. These findings highlight ND-MARL as a stable framework for distributed, communication-aware quadcopter consensus control.

2606.02106 2026-06-02 cs.LG stat.ML 版本更新

When Tabular Foundation Models Transfer Across Modalities: A Systematic Evaluation Across 95 Datasets, 7 Modalities, and Two Regimes

当表格基础模型跨模态迁移:对95个数据集、7种模态和两种范式的系统评估

Julien Lafrance

发表机构 * Telecom Paris, Institut Polytechnique de Paris(巴黎电信学院,巴黎理工学院)

AI总结 本文提出一种结合等角紧框架预处理与表格基础模型的分类流水线,在跨模态数据上评估其性能,并证明其在速度与质量间取得良好平衡。

Comments 24 pages, 5 figures. Code and data available at https://doi.org/10.5281/zenodo.19982636

详情
AI中文摘要

我们提出一个单一的分类流水线,该流水线结合了等角紧框架(ETF)预处理阶段和用于上下文推理的表格基础模型,一旦数据被映射到固定向量表示,该流水线在所有模态上应用相同。我们在涵盖七种信号模态——视觉、音频、语音、文本、分子、时间序列和表格——的95个数据集上对其进行评估。主要的方法论贡献是固定比较对象:在整个论文中,性能与相同冻结特征上最强的轻量级调优基线进行比较,而oracle选择、部署选择和专门微调则分别报告。该流水线在相同冻结特征上与强大的轻量级调优基线广泛竞争。它并不在每个任务上都匹配最好的专门模型或高度调优的流水线,但差距很小,且运行速度更快——通常比完整骨干微调快4到200倍,而质量往往相当。我们描述了如何在实际中部署该流水线:何时应用ETF预处理,如何在无验证集的情况下停止其训练,如何设置上下文分类器,以及如何校准所得概率。校准步骤并非装饰性的:TabICL通过构造产生良好校准的概率,ETF预处理最初会破坏该校准,而后处理重新缩放则恢复它——从而产生每个预测的置信度信号,从业者可以将其用作置信度门控部署的信任阈值。我们还报告了该流水线在哪些情况下不应期望有帮助,以及如何提前识别这些情况。

英文摘要

We present a single classification pipeline that combines an Equiangular Tight Frame (ETF) preprocessing stage with a tabular foundation model for in-context inference, applied identically across modalities once data is mapped to fixed vector representations. We evaluate it on 95 datasets spanning seven signal modalities -- vision, audio, speech, text, molecular, time-series, and tabular. The main methodological contribution is to fix the comparison object: throughout the paper, performance is judged against the strongest lightweight tuned baseline on the same frozen features, while oracle selection, deployed selection, and specialized fine-tuning are reported separately. The pipeline is broadly competitive with strong lightweight tuned baselines on the same frozen features. It does not match the very best specialized models or heavily tuned pipelines on every task, but it stays close, and it runs much faster -- typically 4 to 200 times faster than full backbone fine-tuning, often at comparable quality. We describe how to deploy the pipeline in practice: when to apply ETF preprocessing, how to stop its training without a validation split, how to set up the in-context classifier, and how to calibrate the resulting probabilities. The calibration step is non-cosmetic: TabICL produces well-calibrated probabilities by construction, ETF preprocessing initially disrupts that calibration, and the post-hoc rescaling restores it -- yielding a per-prediction confidence signal that practitioners can use as a trust threshold for confidence-gated deployment. We also report where the pipeline should not be expected to help, and how to identify those cases in advance.

2606.02101 2026-06-02 stat.ML cs.LG stat.AP 版本更新

It does what it says on the tin: safe synthetic data from coarsened margins

名副其实:来自粗化边际的安全合成数据

Gillian M Raab

发表机构 * University of Edinburgh(爱丁堡大学) Scottish Centre for Administrative Data Research(苏格兰行政数据研究中心)

AI总结 提出一种通过粗化边际并应用迭代比例拟合算法生成合成数据的方法,确保透明性和无披露风险。

详情
AI中文摘要

本文提出了一种创建合成数据的方法,与当前可用的其他方法相比,该方法对用户有两个重要优势。首先是透明性;与其他方法不同,接收合成数据的人将知道原始数据中哪些变量之间的关系将在合成数据中大致保持。其次是保证合成数据来源于已被判定无披露风险的信息。这是通过首先定义和计算将在合成数据中保持变量关系的边际来实现的。然后,每个边际将根据数据保管者定义的标准进行统计披露控制,例如顶部编码和底部编码、小类别的组合和/或修改小计数。建议通过将表格中的所有计数粗化为披露限制的倍数来进一步调整策展边际。这些调整后的边际用于通过迭代比例拟合算法生成合成数据。使用1901年苏格兰人口普查的数据说明了创建此类合成数据的实际步骤。

英文摘要

This paper proposes a method of creating synthetic data (SD) that will have two important advantages for the user compared to other methods currently available. The first is transparency; unlike other methods, the person in receipt of the SD will know which of the relationships between variables in the original data will be approximately maintained in the SD. The second is a guarantee that the SD is derived from information that has already been judged to be free of disclosure risk. This is achieved by first defining and calculating the margins where relationships between variables will be maintained in the SD. Each margin will then be subject to statistical disclosure control (SDC) to the standards defined by the data custodian, e.g. top-coding and bottom-coding, combination of small categories and/or modifying small counts. Further adjustment of the curated margins is advised by coarsening all counts in the table to multiples of the disclosure limit. These adjusted margins are used to create SD by the Iterative Proportional Fitting (IPF) algorithm. The practical steps involved in creating such SD are illustrated using data from the 1901 Census of Scotland.

2606.02093 2026-06-02 cs.CL cs.AI cs.LG 版本更新

The Role of Ambiguity in Error Prediction via Uncertainty Quantification

不确定性量化中模糊性在错误预测中的作用

Ieva Raminta Staliūnaitė, James Bishop, Andreas Vlachos

发表机构 * University of Cambridge(剑桥大学) The Alan Turing Institute(艾伦·图灵研究所)

AI总结 通过解耦输入模糊性与不确定性信号,利用门控专家和选择性预测提升大语言模型在问答任务中的错误预测性能。

Comments 8 pages not including references and appendices, 3 figures

详情
AI中文摘要

错误预测任务,即预测模型输出是否正确,通常通过不确定性量化(UQ)来解决。然而,虽然不确定性指标捕捉了模型缺乏知识或能力进行预测的情况,但它们也反映了模型输入和上下文中固有的偶然不确定性。本文提出了一种通过将输入模糊性与UQ信号解耦来改进大语言模型(LLM)错误预测的方法。我们在问答(QA)任务上使用六种UQ指标进行实验,结果表明,UQ指标在无歧义实例上的错误预测能力优于具有多个合理答案的问题。我们使用门控专家和选择性预测将真实和预测的模糊性标签纳入错误预测流程。我们发现,模糊性信息提高了跨模型家族、训练和评估范式、数据集(包括据称无歧义的数据集)以及偶然不确定性来源的错误预测分数,在标准数据集上对单个UQ指标的PRR提升超过10个百分点。

英文摘要

The task of Error Prediction, namely predicting whether a model output is correct, is commonly tackled with Uncertainty Quantification (UQ). However, while uncertainty metrics capture when models lack knowledge or capacity to make a prediction, they also reflect aleatoric uncertainty, which is inherent in the model input and context. This paper presents a method for improving error prediction for Large Language Models (LLMs), by disentangling input ambiguity from UQ signal. We conduct experiments on the task of Question Answering (QA) with six UQ metrics and show that UQ metrics are more predictive of errors on unambiguous instances than on questions with multiple plausible answers. We use Gated Experts and Selective Prediction to incorporate gold and predicted ambiguity labels into the error prediction pipeline. We find that ambiguity information improves error prediction scores across model families, training and evaluation paradigms, datasets (including allegedly unambiguous ones), and sources of aleatoric uncertainty, yielding improvements of over 10 points of PRR for individual UQ metrics on standard datasets.

2606.02078 2026-06-02 cs.LG 版本更新

Beyond $\ell_2$-norm and $\ell_\infty$-norm: A Curvature-Inspired $\ell_p$-Norm Scheme for Deep Neural Networks

超越ℓ2范数和ℓ∞范数:一种受曲率启发的深度神经网络ℓp范数方案

Jianhao Xu, Zhuang Yang

发表机构 * School of Computer Science and Technology, Soochow University(苏州大学计算机科学与技术学院)

AI总结 针对现有优化器在参数维度曲率变化大时适应性差的问题,提出一种动态p值的ℓp范数方案,并融入SGD和SGDM,得到LPSGD和LPSGDM优化器,通过早期大p抑制高曲率方向、后期余弦退火减小p实现稳定更新,理论证明非凸情形下O(T^{-1/2})收敛率,在CIFAR和ImageNet数据集上验证了泛化性能提升。

详情
AI中文摘要

现有的深度神经网络(DNN)优化器通常依赖于ℓ2范数或ℓ∞范数,导致优化器不能很好地适应参数维度上曲率的显著变化。通常,DNN的训练过程在早期表现出强烈的曲率各向异性,而在后期,DNN的训练过程趋向于向各向异性较弱的平坦区域移动。特别地,基于ℓ2范数的优化器通常由高曲率方向主导,限制了优化器沿较低曲率方向的更新,从而导致收敛速度较慢。而基于ℓ∞范数的优化器由于坐标方向更新幅度相同,在平坦区域容易产生振荡。为了解决ℓ2和ℓ∞范数产生的这两种极端情况,我们提出了一种具有动态p值的新型ℓp范数方案,并将其融入随机梯度下降(SGD)和带动量的SGD(SGDM)中,从而得到两种具有更好泛化性能的新型优化器:ℓp-SGD(LPSGD)和ℓp-SGDM(LPSGDM)。特别地,所得到的优化器通过使用较大的p(p>2)来抑制早期高曲率方向的支配地位,随后将p逐渐减小至2以实现更稳定和精细的更新,其中后一过程受余弦退火策略启发。我们建立了所得到算法的理论保证,并分析了LPSGD和LPSGDM在非凸情形下均达到O(T^{-1/2})的收敛率。在基准数据集(包括CIFAR-10、CIFAR-100和ImageNet-1K)上,使用多种DNN(如VGG-11、ResNet-18和ResNet-50)进行了大量实验。

英文摘要

The existing optimizers for deep neural networks (DNNs) typically rely on either the $\ell_2$ norm or the $\ell_\infty$ norm, resulting in optimizers that do not adapt well to substantial changes in curvature across parameter dimensions. Generally, the training process of DNNs often exhibits strong curvature anisotropy in the early period, whereas in the later period, the training process of DNNs tends to move toward flatter regions with weaker anisotropy. Particularly, optimizers based on the \(\ell_2\)-norm are usually dominated by high-curvature directions, restricting updates of optimizers along with lower curvature direction and thus leading to a slower convergence rate. While optimizers based on the \(\ell_\infty\)-norm are prone to oscillations in flatter regions, due to the coordinate-wise updates of the same magnitude. To address these two extreme cases generated by $\ell_2$ and $\ell_\infty$ norms, we propose a novel $\ell_p$-norm scheme with a dynamical value of $p$ and incorporate it into stochastic gradient descent (SGD) and SGD with momentum (SGDM), leading to two novel optimizers with better generalization performance: ${\ell_p}$-SGD (LPSGD) and ${\ell_p}$-SGDM (LPSGDM). Particularly, the resulting optimizers suppress the dominance of high-curvature directions in the early period by utilizing a large $p$ ($p>2$), followed by a gradual decrease of $p$ toward 2 to enable more stable and refined updates, where the latter process is motivated by the cosine annealing strategy. We establish theoretical guarantees of the resulting algorithms and analyze that both LPSGD and LPSGDM achieve an \(O(T^{-1/2})\) convergence rate for the nonconvex setting. Extensive experiments are conducted on benchmark datasets, including CIFAR-10, CIFAR-100, and ImageNet-1K, with multiple DNNs such as VGG-11, ResNet-18, and ResNet-50.

2606.02073 2026-06-02 cs.LG 版本更新

Planar Symmetric Pattern Generation

平面对称图案生成

Ning Lin, Luxi Chen, Huaguan Chen, Jiacheng Cen, Chongxuan Li, Wenbing Huang, Hao Sun

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学中关村人工智能学校)

AI总结 提出一种适用于任意平面群的对称化框架,通过将任意2D连续表示转换为对称表示并保持连续性,实现对称控制,在图案设计、剪纸设计、风格化拓扑设计和材料设计任务中验证了有效性。

详情
AI中文摘要

生成具有特定对称性的对象在各种现实场景中至关重要。然而,将现有的2D连续表示适应于强制平面群对称性仍然是一个挑战,因为非反射群元素的变换可能破坏连续性。为了克服这一限制,我们提出了一种适用于任意平面群的对称化框架。我们的方法将任意2D连续表示转换为对称表示,同时保持连续性。我们提供了该表示的数学公式,展示了其对对称函数的逼近能力,并详细介绍了构建方法。我们通过三个视觉设计任务(图案设计、剪纸设计和风格化拓扑设计)和一个材料设计任务验证了我们的方法。实验证实,我们的表示能够实现有效的对称控制,并展示了其更广泛的适用性。

英文摘要

Generating objects with specific symmetries is essential in various real-world scenarios. However, adapting existing 2D continuous representations to enforce planar group symmetry remains a challenge, as the transformation of non-reflective group elements may disrupt continuity. To overcome this limitation, we propose a symmetrization framework for arbitrary planar groups. Our method transforms any 2D continuous representation into a symmetric one while preserving continuity. We provide the mathematical formulation of this representation, demonstrate its approximation capability for symmetric functions, and detail the construction methodology. We validate our approach through three visual design tasks (pattern design, paper-cutting design and stylized topology design) and one material design task. Experiments confirm that our representation enables effective symmetry control and demonstrate its broader applicability.

2606.02061 2026-06-02 cs.LG 版本更新

Ablating Archetypes: The Stability of Archetypal SAEs is an Artifact of Initialization and Metric Design

消除原型:原型SAE的稳定性是初始化和度量设计的人为产物

Michał Brzozowski, Neo Christopher Chung

发表机构 * Samsung AI Center(三星人工智能中心) University of Warsaw(华沙大学)

AI总结 本文通过实验证明,原型稀疏自编码器声称的稳定性源于多轮训练中相同的初始化设置,而非原型约束本身,并强调稳定性与稳定化的区别对可解释性研究至关重要。

详情
AI中文摘要

使用稀疏自编码器(SAE)的字典学习从神经网络激活中产生过完备基,这些基通常是可解释的,并减少了多义性。然而,不同随机种子的SAE特征差异很大——这个问题被称为不稳定性。原型SAE(Fel等人,2025)被提出作为一种通用的字典学习干预,用于更可靠的概念提取,并报告在训练结束时字典更稳定。我们证明原型SAE声称的稳定性是在多次运行中设置相同初始化的结果。通过我们的分析,我们试图澄清机械可解释性中可能模糊使用的两个不同概念:稳定性是两个独立训练模型之间的一致性,而稳定化是独立初始化的运行向共同解收敛。这种区分对于自然语言处理(NLP)的机械可解释性至关重要,其中特征稳定性越来越多地被用作SAE特征是可重用分析单元的证据。原型SAE的实验共享一个确定性的k-means解码器初始化,在训练开始前将运行间字典距离设为零。当移除这种初始化时,原型约束在我们的设置中没有提供稳定化优势。我们进一步发现了一个依赖于预处理的余弦几何问题,使端点稳定性指标的解释复杂化。总的来说,我们的研究支持在更大的字典学习传统中研究SAE的价值,同时表明稳定性声明需要轨迹诊断和初始化消融。

英文摘要

Dictionary learning with sparse autoencoders (SAEs) produces overcomplete bases from neural network activations that are often interpretable and reduces polysemanticity. However, features from SAEs vary substantially across random seeds -- a problem known as instability. Archetypal SAEs (Fel et al., 2025) were proposed as a general dictionary-learning intervention for more reliable concept extraction, and report more stable dictionaries at the end of training. We demonstrate that the stability claimed by archetypal SAEs is a result of setting identical initialization across multiple runs. Through our analyses, we attempt to clarify two distinct notions in mechanistic interpretability that may be ambiguously used: stability is agreement between two independently trained models, whereas stabilization is the convergence of independently initialized runs toward a common solution. This distinction is critical for mechanistic interpretability of natural language processing (NLP), where feature stability is increasingly used as evidence that SAE features are reusable units of analysis. Experiments from archetypal SAEs share a deterministic k-means decoder initialization, setting inter-run dictionary distance to zero before training begins. When this initialization is removed, the archetypal constraint provides no stabilization advantage in our setting. We further identify a preprocessing-dependent cosine geometry issue that complicates interpretation of endpoint stability metrics. Overall, our study supports the value of studying SAEs within the larger dictionary-learning tradition while showing that stability claims require trajectory diagnostics and initialization ablations.

2606.02055 2026-06-02 cs.IT cs.LG cs.SI math.IT stat.ML 版本更新

Query-Limited Community Recovery in Stochastic Block Models

随机块模型中的有限查询社区恢复

Sabyasachi Basu, Manuj Mukherjee, Lutz Oettershagen, Suhas Thejaswi

AI总结 研究在有限且带噪的网络数据访问下,通过自适应查询策略实现两社区随机块模型的精确社区恢复,并证明自适应查询可突破非自适应基准的信息论极限。

详情
AI中文摘要

我们研究在 $n$ 个顶点上的两社区随机块模型中,对网络数据的有限且带噪访问下的精确社区恢复。学习器可以查询一个带噪的邻域预言机,该预言机独立地以固定概率揭示被查询顶点的每个真实邻居,且从不返回非邻居,受限于有限的查询预算。我们考虑仅预言机访问以及一个组合模型,其中学习器还观察底层图的单个子采样副本。对于仅预言机访问,平衡均匀查询给出了一个尖锐的非自适应基准:当每个顶点被查询相同整数次数时,观测结果简化为具有衰减边概率的 SBM,并且 Abbe-Bandeira-Hall 精确恢复阈值适用。我们证明该基准并非自适应最优:在平衡均匀查询需要 $m n$ 次查询(对于某个 $m>1$)的机制下,两阶段自适应策略以 $n+o(n)$ 次查询成功。对于额外的子采样图,我们证明了一个亚线性查询的自适应差距:预算为亚线性的平衡数据无关均匀查询不会比单独的子采样图有所改进,而自适应查询可以针对少量不确定顶点并实现精确恢复。因此,自适应数据采集可以严格改善精确恢复的信息论极限。

英文摘要

We study exact community recovery in the two-community stochastic block model on $n$ vertices under limited and noisy access to network data. The learner may query a noisy neighborhood oracle that reveals each true neighbor of a queried vertex independently with fixed probability and never returns non-neighbors, subject to a finite query budget. We consider both oracle-only access and a combined model where the learner also observes a single subsampled copy of the underlying graph. For oracle-only access, balanced uniform querying gives a sharp non-adaptive benchmark: when each vertex is queried the same integer number of times, the observations reduce to an SBM with attenuated edge probabilities and the Abbe-Bandeira-Hall exact-recovery threshold applies. We show that this benchmark is not adaptively optimal: a two-stage adaptive strategy succeeds with $n+o(n)$ queries in a regime where balanced uniform querying requires $m n$ queries for some $m>1$. With an additional subsampled graph, we prove a sublinear-query adaptivity gap: balanced data-independent uniform querying with a sublinear budget does not improve over the subsampled graph alone, whereas adaptive querying can target a small set of uncertain vertices and achieve exact recovery. Thus adaptive data acquisition can strictly improve the information-theoretic limits of exact recovery.

2606.02047 2026-06-02 stat.ML cs.LG math.ST stat.ME stat.TH 版本更新

Convex Distance Operator Transport: A Convex and Geometry-Preserving Formulation

凸距离算子传输:一种凸且保持几何的公式

Junhyoung Chung, Euijong Song, Won Hwa Kim, Gunwoong Park

发表机构 * KAIST(韩国科学技术院)

AI总结 提出凸距离算子传输(CDOT),通过算子正则化联合保持特征对应与内在几何结构,实现异质分布对齐,并证明其伪度量性质及与Gromov-Wasserstein的关系。

Comments This paper is 41 pages long, contains 6 figures, and has been accepted to ICML 2026

详情
AI中文摘要

我们引入了凸距离算子传输(CDOT),这是第一个凸最优传输框架,通过联合保持特征对应和内在几何结构来对齐异质域中的分布。具体来说,CDOT采用基于算子的正则化,通过引入距离算子和条件期望算子来对齐聚合的距离结构。因此,所提出的正则化提高了对局部几何变化的鲁棒性。我们进一步证明了得到的CDOT差异是带属性的紧度量测度空间上的有效伪度量。此外,我们通过一个新的色散间隙概念刻画了CDOT与Gromov-Wasserstein(GW)之间的关系,正式阐明了GW非凸性相对于CDOT凸性的几何来源。在有限样本情况下,我们推导了一个非渐近风险界,分解为优化误差和统计误差,并在全局收敛的Frank-Wolfe算法下建立了风险一致性。在合成点云、脑连接组和图分类基准上的实验表明,该方法优于现有方法,在实践中表现稳定可靠。

英文摘要

We introduce Convex Distance Operator Transport (CDOT), the first convex optimal transport framework that aligns distributions across heterogeneous domains by jointly preserving feature correspondence and intrinsic geometric structure. Specifically, CDOT employs an operator-based regularization that aligns aggregated distance structures by introducing distance and conditional expectation operators. Consequently, the proposed regularization improves the robustness to local geometric variations. We further prove that the resulting CDOT discrepancy is a valid pseudometric on the space of attributed compact metric-measure spaces. In addition, we characterize the relationship between CDOT and Gromov--Wasserstein (GW) through a new notion of dispersion gap, formally elucidating the geometric source of non-convexity in GW compared to the convexity of CDOT. In the finite-sample regime, we derive a non-asymptotic risk bound decomposed into optimization and statistical errors, establishing risk consistency under a globally convergent Frank--Wolfe algorithm. Experiments on synthetic point clouds, brain connectomes, and graph classification benchmarks demonstrate better performance over existing methods, with stable and reliable behavior in practice.

2606.02038 2026-06-02 physics.app-ph cs.LG 版本更新

Uncertainty-Aware Graph Neural Reconstruction of Urban Temperature Fields from Sparse Sensors under Deployment Constraints

部署约束下基于不确定性感知图神经网络的稀疏传感器城市温度场重建

Reda Snaiki, Abdelatif Merabtine

AI总结 提出一种不确定性感知图神经网络框架,从稀疏传感器重建每日最高温度场,支持距离约束传感器放置和概率超标映射,在蒙特利尔地区验证优于传统方法。

详情
AI中文摘要

从稀疏观测重建空间连续的每日温度场对于城市气候监测和热风险分析至关重要,但实际部署受限于传感器预算和间距约束。本研究提出一种不确定性感知图神经网络(GNN)框架,用于从稀疏传感器重建每日最高温度场,同时支持距离约束的传感器放置和概率超标映射。该模型使用基于图注意力的均值残差架构,通过高斯负对数似然训练,预测温度场和空间变化的预测不确定性场。传感器放置采用基于QR分解的本征正交分解(POD-QR)策略,并施加4公里最小传感器间距约束,与随机可行放置和最远点采样进行比较。该框架在蒙特利尔区域多边形上使用Daymet v4.1每日温度数据(1公里分辨率)进行评估,采用严格的时间留出协议(训练:2020-2023;测试:2024)。在传感器预算(10-40个传感器)下,所提出的GNN在未观测节点上的RMSE和MAE始终优于反距离加权和普通克里金法。传感器放置效应在低预算时最显著,在高预算时减弱,在施加间距约束下,约30个传感器时出现实际饱和状态。概率评估进一步显示,随着传感器密度增加,不确定性校准得到改善,并且比克里金法具有更好的锐度-校准权衡。这些结果支持所提出的框架作为不确定性感知温度场重建和面向决策的热风险映射的有效工具。

英文摘要

Reconstructing spatially continuous daily temperature fields from sparse observations is important for urban climate monitoring and heat-risk analysis, but practical deployments are limited by sensor budgets and spacing constraints. This study proposes an uncertainty-aware graph neural network (GNN) framework for reconstructing daily maximum temperature fields from sparse sensors while supporting distance-constrained sensor placement and probabilistic exceedance mapping. The model predicts both the temperature field and a spatially varying predictive uncertainty field using a graph-attention-based mean-residual architecture trained with a Gaussian negative log-likelihood. Sensor placement is addressed using a Proper Orthogonal Decomposition with QR factorization (POD-QR) strategy with a 4 km minimum inter-sensor distance constraint and is compared with random feasible placement and farthest-point sampling. The framework is evaluated over a Montreal-area polygon using Daymet v4.1 daily temperature data (1 km resolution) under a strict temporal hold-out protocol (training: 2020-2023; testing: 2024). Across sensor budgets (10-40 sensors), the proposed GNN consistently outperforms inverse distance weighting and ordinary kriging in RMSE and MAE on unobserved nodes. Sensor-placement effects are most pronounced at low budgets and diminish at higher budgets, with a practical saturation regime emerging around 30 sensors under the imposed spacing constraint. Probabilistic evaluation further shows improved uncertainty calibration with increasing sensor density and a better sharpness-calibration trade-off than kriging. These results support the proposed framework as an effective tool for uncertainty-aware temperature field reconstruction and decision-oriented heat-risk mapping.

2606.02035 2026-06-02 cs.AI cs.LG 版本更新

RL-ACRGNet: Reinforcement Learning-Based Chest Radiology Report Generation Network

RL-ACRGNet:基于强化学习的胸部放射学报告生成网络

Yogesh Kumar Meena, Saurabh Agarwal, K. V. Arya

发表机构 * Human-AI Interaction (HAIx) Lab, Indian Institute of Technology Gandhinagar(人类-人工智能交互实验室,印度理工学院冈丁加尔) Department of Computer Science and Engineering, Madhav Institute of Technology and Science Deemed University (MITS-DU)(计算机科学与工程系,马达夫技术与科学 deemed 大学(MITS-DU)) Multimedia and Information Security Research Group, Department of Computer Science and Engineering, ABV-Indian Institute of Information Technology and Management(多媒体与信息安全研究组,计算机科学与工程系,ABV-印度信息科技与管理学院)

AI总结 提出RL-ACRGNet,一种结合预训练DenseNet编码器与多级LSTM解码器的离策略强化学习框架,通过度量奖励机制优化视觉语义嵌入,在IU-Xray和MIMIC-CXR数据集上超越基线,生成高质量临床报告。

Comments This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

医学影像解读是现代临床诊断的基石,然而手动生成放射学报告既耗时又容易出现解读不一致。在医学AI领域,通过深度学习自动化这些描述有望简化临床工作流程并标准化诊断输出。然而,由于在捕获细粒度视觉特征和确保临床连贯性方面的局限性,准确的疾病检测和精确的报告生成仍然是重大挑战。为了解决这些问题,我们提出了RL-ACRGNet,一种改进的编码器-解码器模型,它将预训练的DenseNet编码器与多级LSTM解码器集成在离策略强化学习框架中。通过使用双网络方法,基于度量奖励机制细化视觉语义嵌入,我们证明RL-ACRGNet在IU-Xray数据集上持续优于最先进的基线,在BLEU-4(0.47%)、METEOR(0.17%)和ROUGE-L(0.518)上取得了定量改进。此外,在大规模MIMIC-CXR数据集上的综合评估证实了该模型的稳健泛化能力及其生成高质量、临床相关报告的能力。

英文摘要

Medical imaging interpretation is a foundational pillar of modern clinical diagnostics, yet the manual generation of radiology reports remains a time-consuming process prone to interpretation inconsistencies. Within the field of medical AI, automating these descriptions through deep learning promises to streamline clinical workflows and standardise diagnostic output. However, accurate disease detection and precise report generation remain significant challenges due to limitations in capturing fine-grained visual features and ensuring clinical coherence. To address these issues, we propose RL-ACRGNet, an improved encoder-decoder model that integrates a pre-trained DenseNet encoder with a multilevel LSTM decoder within an off-policy reinforcement learning framework. Using a dual-network approach to refine visual-semantic embeddings through a metric-based reward mechanism, we demonstrate that RL-ACRGNet consistently outperforms state-of-the-art baselines on the IU-Xray dataset, achieving quantitative improvements in BLEU-4 (0.47%), METEOR (0.17%) and ROUGE-L (0.518). Furthermore, comprehensive evaluations on the large-scale MIMIC-CXR data set confirm the robust generalisation of the model and its ability to generate high-quality, clinically relevant reports

2606.02027 2026-06-02 cs.RO cs.LG cs.MA 版本更新

World-Task Factorization for Robot Learning

世界-任务分解用于机器人学习

Eduardo Sebastián, Adrian Pfisterer, Vito Mengers, Oliver Brock, Amanda Prorok

发表机构 * Department of Computer Science and Technology, University of Cambridge, United Kingdom(计算机科学与技术系,剑桥大学,英国) Robotics and Biology Laboratory, Technische Universität Berlin(机器人与生物学实验室,柏林技术大学) Science of Intelligence (SCIoI), Cluster of Excellence, Berlin, Germany(智能科学(SCIoI),卓越中心,柏林,德国) Robotics Institute Germany(德国机器人研究所)

AI总结 提出将策略分解为世界因子和任务因子,通过可微图模型AICON与紧凑学习策略结合,实现零样本泛化到新配置并迁移到真实硬件。

详情
AI中文摘要

机器人学习必须产生能够泛化到新的约束、队友和环境组合的策略。为此,我们必须对策略进行结构性分解,这种选择决定了哪些部分泛化、哪些需要重新训练、哪些保持纠缠。现有方法涵盖从期望结构从数据扩展中涌现,到通过层次结构、技能库或学习专门化手工设计。在本文中,我们研究我们认为机器人学中最基本的分解:将世界与任务分离。我们研究了这种分解有原则的条件。世界因子是具身系统和环境的属性;它们独立于意图存在。任务因子由任务在世界所允许的事物上的逻辑定义。我们通过贝叶斯模型证据形式化这种不对称性:它与数据生成过程一致,通过分析世界模型保持高似然,并减少奥卡姆剃刀对任务参数的惩罚。我们通过将AICON(一个可微分的递归估计器和互连图,具有组合性,无需任务特定数据即可运行,并将成本梯度传播到执行器)与一个紧凑的学习策略配对来实例化这种分解,该策略调节梯度路径。梯度作为两个因子之间的接口:它们通过图携带世界结构,通过成本携带任务结构,从而在保持结构泛化的同时实现低维学习。我们在三个问题上测试了世界/任务分解,这些问题包含异构机器人、环境、任务逻辑和感觉运动模态。我们的框架在所有设置中优于端到端基线和分析启发式方法,零样本泛化到分布外配置,并无需重新训练即可迁移到真实硬件。

英文摘要

Robot learning must produce policies that generalize to new combinations of constraints, teammates, and environments. To achieve this, we must structurally factor the policy, which is a choice that dictates what generalizes, what requires retraining, and what remains entangled. Existing methods span a wide spectrum, from expecting structure to emerge from data scaling, to hand-designing it via hierarchies, skill libraries or learned specializations. In this paper, we study what we argue is the most fundamental factorization in robotics: separating the world from the task. We investigate the conditions under which this factorization is principled. World factors are properties of the embodied system and the environment; they exist independently of intent. Task factors are defined by the task's logic over what the world admits. We formalize this asymmetry through Bayesian model evidence: it aligns with the data-generating process, maintains high likelihood through an analytical world model, and reduces the Occam razor's penalty on task parameters. We instantiate this factorization by pairing AICON, a differentiable graph of recursive estimators and interconnections that is compositional, operates without task-specific data, and propagates cost gradients to actuators, with a compact, learned policy that modulates gradient paths. Gradients serve as the interface between the two factors: they carry world structure through the graph and task structure through costs, enabling low-dimensional learning while preserving structural generalization. We test the world/task factorization across three problems that encompass heterogeneous robots, environments, task logic and sensorimotor modalities. Our framework outperforms end-to-end baselines and analytical heuristics in all settings, generalizes zero-shot to out-of-distribution configurations, and transfers to real hardware without retraining.

2606.02022 2026-06-02 cs.CV cs.AI cs.LG 版本更新

Ranking vs. Assignment: The Metric Mismatch in Multi-View Object Association

排名 vs. 分配:多视角目标关联中的度量不匹配

Matvei Shelukhan, Timur Mamedov, Aleksandr Chukhrov, Karina Kvanchiani

发表机构 * Tevian Moscow(莫斯科Tevian) Lomonosov Moscow State University(莫斯科国立罗蒙诺索夫大学)

AI总结 本文揭示了多视角目标关联中常用的排名度量(如AP、FPR-95)与分配目标之间的根本性不匹配,并提出了基于Sinkhorn归一化的后处理方法以缓解该问题。

详情
AI中文摘要

多视角目标关联是一个重要的计算机视觉问题,是许多多相机感知任务的基础。虽然该任务自然被表述为受约束的一对一匹配问题,但最近的工作严重依赖成对排名度量(如AP和FPR-95)进行模型评估。我们强调了这些度量与实际分配目标之间的根本性不匹配。理论上,我们表明即使分配已经正确,AP和FPR-95也可能不完美,而基于Sinkhorn的归一化可以使它们完美。相反,最优的成对排名仍然可能导致错误的分配。我们通过使用基于Sinkhorn的归一化作为受控的后处理压力测试,在实践中验证了这种不匹配。我们表明,仅优化几个后处理参数就能显著提升AP和FPR-95,而分配级别的度量(如ACC和IPAA)却没有相应改进。

英文摘要

Multi-view object association is an important computer vision problem that underlies many multi-camera perception tasks. While this task is naturally formulated as a constrained one-to-one matching problem, recent works heavily rely on pairwise ranking metrics like AP and FPR-95 for model evaluation. We highlight a fundamental mismatch between these metrics and the actual assignment objective. Theoretically, we show that AP and FPR-95 can be imperfect even when the assignment is already correct, and that Sinkhorn-based normalization can make them perfect. Conversely, optimal pairwise ranking can still lead to incorrect assignments. We validate this mismatch in practice by using our Sinkhorn-based normalization as a controlled post-processing stress test. We show that optimizing just a few post-processing parameters significantly boosts AP and FPR-95 without corresponding improvements in assignment-level metrics such as ACC and IPAA.

2606.02020 2026-06-02 cs.CL cs.LG 版本更新

Unveiling the Entropy Dynamics of Chain-of-Thought Reasoning

揭示思维链推理的熵动力学

Ting Xu, Xu He, Yupu Lu, Jiankai Sun, Dong Li, Wai Lam, Jianye Hao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文通过熵动力学揭示思维链推理的两阶段结构(不确定性区域和置信区域),并提出基于CUSUM变化点检测的无训练框架实现早期退出和测试时缩放,以提升推理效率与可靠性。

Comments 21 pages, 10 figures, accepted in ICML2026

详情
AI中文摘要

本文研究了思维链(CoT)的熵动力学,揭示了一致的两阶段结构:一个探索性的不确定性区域,然后急剧过渡到收敛的置信区域。我们证明置信区域具有两个关键性质:1)高可靠性——置信区域中的答案变得高度准确和稳定,以及2)高冗余性——模型在达到正确答案后生成长时间的不必要token。这些性质解锁了更高效和可靠的推理策略:1)早期退出利用可靠性和冗余性,在收益递减时安全终止计算,以及2)测试时缩放使用置信区域信号优先考虑收敛轨迹。为了实施这些见解,我们将置信区域检测建模为序列变化点检测问题,首次将经典变化点方法应用于监控CoT推理。使用累积和(CUSUM)算法(一种统计最优的变化点检测器),我们开发了一个无训练框架用于实时推理控制。实验表明,我们的方法为早期退出建立了优越的帕累托前沿。CUSUM在减少11.1% token的情况下达到63.06%的准确率,在准确率上分别超过DEER和Dynasor 3.28%和4.36%。对于测试时缩放,CUSUM加权投票始终优于自一致性。

英文摘要

This paper investigates the entropy dynamics of Chain-of-Thought (CoT) and uncovers a consistent two-phase structure: an Uncertainty Region of exploration transitioning sharply to a Confidence Region of convergence. We demonstrate that the Confidence Region possesses two critical properties: 1) High Reliability -- answers in the confidence region become highly accurate and stable, and 2) High Redundancy -- models generate unnecessary tokens long after reaching the correct answer. These properties unlock more efficient and reliable inference strategies: 1) Early Exit leverages reliability and redundancy to terminate computation safely when returns diminish, and 2)Test-Time Scaling uses the Confidence Region signal to prioritize converged trajectories. To operationalize these insights, we formulate Confidence Region detection as a sequential change-point detection problem, being the first to apply classical change-point methods to monitor CoT reasoning. Using the Cumulative Sum (CUSUM) algorithm, a statistically optimal change-point detector, we develop a training-free framework for real-time inference control. Experiments show our approach establishes a superior Pareto-frontier for early exit. CUSUM achieves 63.06% accuracy with 11.1% token reduction, outperforming DEER and Dynasor by 3.28% and 4.36% in accuracy respectively. For test-time scaling, CUSUM-weighted voting consistently outperforms self-consistency.

2606.02016 2026-06-02 cs.LG 版本更新

Evaluating Real-World Generalizability of Algorithm Selection Models

评估算法选择模型的现实世界泛化能力

Gjorgjina Cenikj, Jakub Kudela, Eva Tuba, Tome Eftimov

发表机构 * Computer Systems Department, Jožef Stefan Institute(计算机系统部门,约泽夫·斯蒂芬研究所) Brno University of Technology(布拉格技术大学)

AI总结 通过跨基准测试系统评估算法选择模型在合成与现实优化问题上的泛化能力,分析其迁移性能并指出在特定领域应用中的挑战。

Comments 10 pages, 12 figures

详情
AI中文摘要

算法选择(AS)旨在通过利用可测量的问题特征和历史性能数据,自动为给定问题实例识别最合适的优化算法。在本研究中,我们研究了AS模型在合成和现实优化景观上的泛化能力。我们考虑了两个广泛使用的学术基准测试套件(BBOB和CEC)以及两个现实世界问题集(机器人轨迹优化任务和无人机路径规划问题)。通过系统的跨基准测试评估,我们分析了AS模型如何在领域之间迁移,识别了泛化成功或失败的情况,并强调了在现实、特定领域环境中应用AS时出现的挑战。我们的研究结果提供了对当前AS方法鲁棒性的见解,并为开发更可靠、广泛适用的现实世界优化AS系统提供了信息。

英文摘要

Algorithm Selection (AS) aims to automatically identify the most suitable optimization algorithm for a given problem instance by leveraging measurable problem characteristics and historical performance data. In this study, we investigate the generalization ability of AS models across both synthetic and real-world optimization landscapes. We consider two widely used academic benchmark suites (BBOB and CEC) and two real-world problem sets (robotics trajectory optimization tasks and unmanned aerial vehicle path-planning problems). Through a systematic cross-benchmark evaluation, we analyze how AS models transfer between domains, identify where generalization succeeds or breaks down, and highlight the challenges that arise when applying AS in realistic, domain-specific contexts. Our findings provide insights into the robustness of current AS approaches and inform the development of more reliable, broadly applicable AS systems for real-world optimization.

2606.02011 2026-06-02 cs.AI cs.LG 版本更新

Extreme Low-Bit Inference in Reasoning Models: Failure Modes and Targeted Recovery

推理模型中的极端低位推理:失败模式与针对性恢复

Ekaterina Alimaskina, Darya Rudas, Denis Shveykin, Gleb Molodtsov, Pavel Vasiliev, Aleksandr Beznosikov

发表机构 * University of Washington(华盛顿大学)

AI总结 针对大型推理模型在2位量化推理中因生成不稳定导致总token数膨胀而无法实现端到端加速的问题,提出轻量级FP16规划和循环救援两种控制方法,显著恢复模型精度并保持实际速度。

详情
AI中文摘要

大型推理模型(LRM)依赖长推理轨迹,导致推理成本高昂。虽然低位量化降低了每token解码成本,但我们表明,激进的2位推理可能无法实现端到端加速,因为生成过程中的不稳定性会膨胀总token数。2位量化不仅降低答案准确性,还常常产生更长的轨迹,包含重复循环、预算耗尽、延迟承诺和未闭合的推理段。我们分析了Qwen3推理模型在数学和常识基准上的完整推理轨迹,并表明准确率下降与这些过程级失败密切相关。为解决这些问题,我们引入了两种轻量级控制:FP16规划,为2位模型提供简短的高精度轮廓;以及循环救援,检测重复轨迹并要么承诺早期答案,要么回退到FP16。在MATH-500上,循环救援将Qwen3-8B准确率从17.2%提升至74.2%,而规划加循环救援将Qwen3-32B准确率从65.0%提升至87.2%。总体而言,我们的结果表明,当极端低位推理的失败被视为可控生成病理时,它变得可行:通过轻量级检测和选择性FP16支持,2位推理可以在恢复准确率的同时保持真实的端到端速度。我们的代码可在 https://github.com/brain-lab-research/quantized-reasoning 获取。

英文摘要

Large Reasoning Models (LRMs) rely on long reasoning traces, making inference expensive. While low-bit quantization reduces per-token decoding cost, we show that aggressive 2-bit inference can fail to deliver end-to-end speedup because instability in the generation process inflates total token count. Instead of merely lowering answer accuracy, 2-bit quantization often produces much longer traces with repetitive loops, budget exhaustion, delayed commitment, and unclosed reasoning segments. We analyze full reasoning traces of Qwen3 reasoning models across mathematical and commonsense benchmarks and show that accuracy degradation is tightly linked to these process-level failures. To address them, we introduce two lightweight controls: FP16 planning, which gives the 2-bit model a short high-precision outline, and loop rescue, which detects repetitive traces and either commits to an earlier answer or falls back to FP16. On MATH-500, loop rescue improves Qwen3-8B accuracy from 17.2% to 74.2%, while planning plus loop rescue improves Qwen3-32B from 65.0% to 87.2%. Overall, our results show that extreme low-bit reasoning becomes practical when its failures are treated as controllable generation pathologies: with lightweight detection and selective FP16 support, 2-bit inference can recover accuracy while preserving real end-to-end speed. Our code is available at: https://github.com/brain-lab-research/quantized-reasoning.

2606.02008 2026-06-02 stat.ML cs.LG 版本更新

Provable Data Scaling Law for Meta Learning via Complexity Minimization

通过复杂度最小化实现元学习的可证明数据缩放定律

Kazuto Fukuchi, Ryuichiro Hataya, Kota Matsui

AI总结 提出复杂度最小化框架,通过最小化跨源域的最坏情况下游模型复杂度,从理论上证明元学习中的预训练数据规模增大可提升少样本适应性能。

详情
AI中文摘要

预训练已成为现代机器学习的基本范式,其关键经验优势之一是随着预训练数据规模的增加,下游样本复杂度降低。然而,现有的预训练理论框架并未完全解释这一现象。在本文中,我们引入了复杂度最小化,一种新颖的元表示学习框架,旨在实现对此缩放行为的理论分析,该框架通过评估每个领域最适合的下游模型复杂度并最小化跨源域的最坏情况复杂度来学习表示。我们的端到端理论分析,涵盖从预训练到下游回归,表明该框架可证明地捕捉了这种缩放行为;特别地,我们展示了少样本适应的错误率随着元训练数据量的增加而改善。实验上,我们证明将复杂度正则化纳入现有的元学习方法中持续提高下游样本效率。

英文摘要

Pre-training has become a fundamental paradigm in modern machine learning, with one of its key empirical benefits being reduced downstream sample complexity as the scale of pre-training data increases. However, existing theoretical frameworks for pre-training do not fully explain this phenomenon. In this paper, we introduce complexity minimization, a novel meta-representation learning framework designed to enable theoretical analysis of this scaling behavior, which learns representations by evaluating the downstream model complexity best suited to each domain and minimizing the worst-case such complexity across source domains. Our end-to-end theoretical analysis, spanning pre-training through downstream regression, shows that this framework provably captures this scaling behavior; in particular, we show that the error rate of few-shot adaptation improves as the amount of meta-training data grows. Empirically, we demonstrate that incorporating complexity regularization into existing meta-learning methods consistently improves downstream sample efficiency.

2606.01999 2026-06-02 cs.LG cs.AI 版本更新

Why Do Time Series Models Need Long Context Windows?

为什么时间序列模型需要长上下文窗口?

Luca Butera, Giovanni De Felice, Andrea Cini, Cesare Alippi

发表机构 * Università della Svizzera Italiana(瑞士联邦理工学院) EPFL(瑞士联邦理工学院) Politecnico di Milano(米兰理工学院)

AI总结 本文从生成过程识别和条件预测两个目标出发,证明长上下文窗口通过降低生成过程的不确定性来提升预测性能,并表明即使对于记忆长度为P的过程,输入窗口必须严格大于P才能达到最小误差。

详情
AI中文摘要

现代用于预测时间序列组的深度学习模型依赖于越来越长的观测窗口。然而,增加窗口大小的好处通常被简单地归因于捕捉长程依赖,而关于全局预测模型如何利用输入观测的更广泛讨论一直有限。在本文中,我们表明预测时间序列组涉及两个目标:(i) 生成过程识别(GPI),即推断生成输入序列的具体过程,以及 (ii) 条件预测(CF),即根据输入观测预测未来值。从这个角度来看,最优预测可以解释为对所有可能数据生成过程的平均,并按输入窗口给定的似然加权。这为长上下文窗口的好处提供了另一种解释:它们降低了运行过程中输入时间序列由哪个具体过程生成的不确定性。我们证明,即使对于记忆长度为 $P$ 的过程,严格大于 $P$ 的输入窗口大小对于达到最小可实现误差是必要的。最后,我们展示了如何将 GPI 和 CF 解耦,以在不牺牲准确性的情况下提高计算可扩展性。在合成和真实数据上的实验验证了我们的见解及其对设计预测架构的相关性。

英文摘要

Modern deep learning models for forecasting groups of time series rely on increasingly longer observation windows. However, the benefit of increasing the window size is often simply attributed to capturing long-range dependencies, and broader discussion on how global forecasting models leverage input observations has been limited. In this paper, we show that forecasting groups of time series involves two objectives: (i) generative process identification (GPI), i.e., inferring the specific process generating the input sequence, and (ii) conditional forecasting (CF), i.e., predicting future values given input observations. From this perspective, optimal predictions can be interpreted as an average over plausible data-generating processes, weighted by their likelihood given the input window. This suggests another explanation for the benefits of long context windows: they reduce the uncertainty about which specific process is generating the input time series during operation. We prove that even for processes with memory length $P$, an input window size strictly larger than $P$ is necessary to achieve the minimum attainable error. Finally, we show how decoupling GPI and CF can improve computational scalability without compromising accuracy. Experiments on synthetic and real-world data validate our insights and their relevance for designing forecasting architectures.

2606.01993 2026-06-02 cs.CL cs.AI cs.LG 版本更新

MMG2Skill: Can Agents Distill In-the-Wild Guides into Self-Evolving Skills?

MMG2Skill: 智能体能否从野外指南中提炼出自我进化的技能?

Xinyu Che, Junqi Xiong, Yunfei Ge, Xinping Lei, Shihao Li, Hang Yan, Han Li, Yuanxing Zhang, Zhiqi Bai, Jinhua Hao, Ming Sun, Han Li, Jiaheng Liu

发表机构 * Nanjing University(南京大学) Kuaishou Technology(快手科技)

AI总结 提出MMG2Skill框架,将多模态异构的野外指南编译为可编辑技能,通过轨迹级根因反馈持续改进,在GUI控制、开放游戏和策略卡牌任务中显著提升VLM智能体性能。

Comments 35 pages, 12 figures, 13 tables. Code: https://github.com/NJU-LINK/MMG2Skill

详情
AI中文摘要

网络上丰富的程序性知识对于帮助智能体解决长程任务具有巨大潜力。然而,这些知识通常是多模态、异构、有噪声的,并且隐含地假设人类执行者,使得它们难以直接用作智能体所需的技能。为了弥合人类导向指南与智能体可执行技能之间的差距,我们将此问题形式化为指南到技能学习:将野外指南转换为可执行技能,并从智能体可观察的轨迹中持续改进它们。为了评估现有智能体在此任务上的能力,我们引入了MMG2Skill-Bench,这是针对该问题的首个基准测试。我们进一步提出了MMG2Skill,一个闭环框架,它将指南编译为可编辑技能,在执行过程中将固定的视觉语言模型(VLM)智能体条件化于这些技能,并从轨迹级根因反馈中修正技能,而不使用基准测试分数。在GUI控制、开放式游戏和策略卡牌游戏中,使用六个VLM骨干网络,MMG2Skill在每个模型-领域设置中始终优于普通基线智能体,在骨干网络上实现了宏观平均增益+12.8到+25.3个百分点。消融研究表明,直接用原始指南提示智能体会降低性能,而结构化技能构建和轨迹驱动修订对于观察到的改进都是必要的。在成功可推断的任务中,当成功信号适当校准时,基于分析器的提前停止进一步防止了后期性能退化,并节省了25%-53%的尝试次数。

英文摘要

Abundant procedural knowledge on the Web holds great potential for helping agents solve long-horizon tasks. However, such knowledge is often multimodal, heterogeneous, noisy, and implicitly assumes human executors, making it difficult to use directly as the skills required by agents. To bridge the gap between human-oriented guides and agent-executable skills, we formalize this problem as guide-to-skill learning: converting in-the-wild guides into executable skills and continuously improving them from trajectories observable to the agent. To evaluate the capability of existing agents on this task, we introduce MMG2Skill-Bench, the first benchmark designed for this problem. We further propose MMG2Skill, a closed-loop framework that compiles guides into editable skills, conditions a fixed vision-language model (VLM) agent on these skills during execution, and revises the skills from trajectory-level root-cause feedback without using benchmark scores. Across GUI control, open-ended gameplay, and strategic card play with six VLM backbones, MMG2Skill consistently outperforms vanilla baseline agents in every model-domain setting, achieving macro-average gains of +12.8 to +25.3 percentage points across backbones. Ablation studies show that directly prompting agents with raw guides can degrade performance, while both structured skill construction and trajectory-driven revision are necessary for the observed improvements. On success-inferable tasks, analyzer-based early stopping further prevents late-stage performance regressions and saves 25%-53% of attempts when the success signal is properly calibrated.

2606.01992 2026-06-02 cs.CV cs.AI cs.LG 版本更新

A Structured Benchmark for Text-Guided Anomaly Detection: When Language Stops Conditioning the Decision

文本引导异常检测的结构化基准:当语言停止条件化决策时

Stefano Samele, Eugenio Lomurno, Teodora Jovanovic, Sanjay Shivakumar Manohar, Alberto Crivellaro, Matteo Matteucci

发表机构 * Politecnico di Milano, AIRLab(米兰理工学院,AIRLab) S&H – Software & Hardware(S&H – 软件与硬件)

AI总结 提出结构化基准TGAD,通过三个场景逐步增加语言功能角色,评估多模态异常检测系统的文本引导能力,发现当前系统仅表面受语言条件化,标准基准高估了其能力。

详情
AI中文摘要

工业异常检测历来是单模态任务。最近的多模态视觉-语言模型产生了接受文本输入和图像的系统,并被呈现为支持文本引导的零样本和少样本检测。然而,这些方法使用继承自单模态基准的协议进行评估,这些协议保持文本条件不变,因此无法衡量语言是否条件化决策;报告的性能提升是否反映文本引导或强大的预训练视觉特征仍是开放问题。我们引入文本引导异常检测(TGAD),这是一个结构化基准,通过三个场景逐步增加语言的功能角色:MVTec AD上的受控提示敏感性设置;MVTec AD的组件标记扩展,要求模型将其评估限制在指定部件;以及新的组装面板数据集(APD),这是一个需要缺陷类型和组件位置知识的现实工业场景。我们评估每个范式的代表性模型:生成式大视觉-语言、无训练判别式和嵌入自适应判别式。在所有三个模型中,文本接口仅表面条件化决策:除非移除对象名词,否则提示内容被吸收(生成模型的I-AUROC从97.4降至82.6);一旦指令部件外的缺陷被视为正常,组件级指令不约束决策(从90.3降至66.3);当两者在APD上结合时,图像级判别崩溃至MVTec水平以下,一种情况低于随机水平(71.2、50.5、31.5)。这些结果表明,标准基准夸大了当前多模态异常检测系统的文本引导能力,并且此类协议是能够通过语言可靠控制以用于工业部署的模型的先决条件。

英文摘要

Industrial anomaly detection has historically been a unimodal task. Recent multimodal vision-language models have produced systems that admit textual input alongside the image and are presented as enabling text-guided zero- and few-shot inspection. Yet these methods are evaluated with protocols inherited from unimodal benchmarks that hold the textual condition constant and therefore cannot measure whether language conditions the decision; whether reported gains reflect text guidance or strong pretrained visual features remains open. We introduce Text-Guided Anomaly Detection (TGAD), a structured benchmark that progressively increases the functional role of language across three scenarios: a controlled prompt-sensitivity setting on MVTec AD; a component-tagged extension of MVTec AD that requires the model to restrict its assessment to an instructed part; and the new Assembled Panel Dataset (APD), a realistic industrial setting that requires both defect-type and component-location knowledge. We evaluate one representative model per paradigm: generative large vision-language, training-free discriminative, and embedding-adaptive discriminative. In all three, the textual interface conditions the decision only superficially: prompt content is absorbed unless the object noun is removed (the generative model's I-AUROC drops from 97.4 to 82.6); component-level instructions do not constrain the decision once defects outside the instructed part are admitted as normal (from 90.3 to 66.3); and when both combine on APD, image-level discrimination collapses below the MVTec level, in one case below chance (71.2, 50.5, 31.5). These results suggest that standard benchmarks overstate the text-guided capabilities of current multimodal anomaly detection systems, and that a protocol of this kind is a prerequisite for models that can be reliably controlled through language for industrial deployment.

2606.01987 2026-06-02 cs.DM cs.LG 版本更新

Graph Edit Distance Formulation for the Vehicle Routing Problem: Theory and Analysis

车辆路径问题的图编辑距离公式:理论与分析

Adel Dabah

发表机构 * Forschungszentrum Jülich(耶鲁斯研究中心)

AI总结 本文提出将车辆路径问题重新表述为图编辑距离最大化问题,通过边删除成本模型实现总路线成本最小化,并利用该公式进行结构分析和基准测试。

详情
AI中文摘要

我们证明车辆路径问题(VRP)可以重新表述为图编辑距离(GED)最大化问题。在简单的边删除成本模型下,最小化总路线成本等价于从完整实例图中删除的边的总权重最大化。该公式在边级别对VRP进行建模,其中解由选定的边而非路线序列定义,从而能够进行经典公式中难以实现的结构分析:解质量的每条边归因、最优性差距的分解、解稀疏性的刻画以及贪婪构造难以到达的边的识别。理论上,我们建立了一个合并-分解定理,表明Clarke-Wright节省等于每次合并的GED增量,以及一个近似转移定理,将GED近似比转化为VRP成本界限。利用这一重新表述,我们分析了90个已知最优解的CVRP基准实例。我们发现最优路由图仅使用5.5%的可用边,约3.0%的最优边在重复重启下始终未被Clarke-Wright启发式找到,并且成本差距分解为遗漏的最优边和替代的非最优边,两者总权重相当。边加性目标为未来的图神经网络边预测方法提供了自然的每条边监督信号,暗示了与图神经网络方法的潜在联系,这留待后续工作。

英文摘要

We show that the Vehicle Routing Problem (VRP) can be reformulated as a Graph Edit Distance (GED) maximization problem. Under a simple edge-deletion cost model, minimizing total route cost is equivalent to maximizing the total weight of edges deleted from the complete instance graph. This formulation models VRP at the edge level, where solutions are defined by selected edges rather than route sequences, enabling structural analyses that are difficult in classical formulations: per-edge attribution of solution quality, decomposition of the optimality gap, characterization of solution sparsity, and identification of edges that are hard to reach by greedy construction. Theoretically, we establish a merge-decomposition theorem showing that Clarke-Wright savings equal per-merge GED increments, and an approximation-transfer theorem that turns GED approximation ratios into VRP cost bounds. Using this reformulation, we analyze 90 CVRP benchmark instances with known optimal solutions. We find that optimal routing graphs use only 5.5% of available edges, that approximately 3.0% of optimal edges are consistently not found by Clarke-Wright heuristics under repeated restarts, and that the cost gap decomposes into missed optimal edges and substituted non-optimal edges of comparable total weight. The edge-additive objective provides a natural per-edge supervision signal for future graph neural network approaches to edge prediction, suggesting a potential connection to graph neural network approaches that we leave for follow-up work.

2606.01973 2026-06-02 cs.LG cs.CV 版本更新

A Closer Look at In-Distribution vs. Out-of-Distribution Accuracy for Open-Set Test-time Adaptation

开放集测试时自适应中分布内与分布外准确率的深入分析

Zefeng Li, Evan Shelhamer

发表机构 * University of British Columbia and Vector Institute(不列颠哥伦比亚大学和向量研究所)

AI总结 本文通过基准测试和提出新基线,揭示了当前开放集测试时自适应方法在平衡分布内准确率和分布外检测能力上的不足。

Comments TMLR 2026

详情
AI中文摘要

开放集测试时自适应(TTA)在存在输入偏移和未知输出类别的情况下更新模型。尽管近期方法在提高已知类别的分布内(InD)准确率方面取得了进展,但它们准确检测分布外(OOD)未知类别的能力仍未得到充分探索。我们在小规模CIFAR-10-C和大规模ImageNet-C的标准损坏基准上,对鲁棒和开放集TTA方法(SAR、OSTTA、UniEnt和SoTTA)进行了基准测试。对于CIFAR-10-C,我们使用来自SVHN和CIFAR-100的OOD数据,分别对应其损坏形式SVHN-C和CIFAR-100-C。对于ImageNet-C,我们使用来自ImageNet-O和Textures的OOD数据,分别对应其损坏形式ImageNet-O-C和Textures-C。ImageNet-O更接近ImageNet,包含未知但相关的物体类别(如食物类的“蒜香面包”与“热狗”,基础设施类的“高速公路”与“水坝”),而Textures则远离ImageNet,包含非物体图案(如“裂纹”泥土、“多孔”海绵、“纹理”树叶)。我们评估了TTA方法在CIFAR-10-C和ImageNet-C上对InD与OOD识别的准确率和置信度。我们在CIFAR-10-C上验证了每种方法自身OOD检测技术的准确率。我们还在ImageNet-C上进行了评估,并报告了准确率和标准OOD检测指标。我们进一步考察了更现实的设置,其中OOD数据的比例和速率可以变化。为了探索InD识别与OOD拒绝之间的权衡,我们提出了一种新的基线,将softmax/多类输出替换为sigmoid/多标签输出。我们的分析首次表明,当前的开放集TTA方法难以平衡InD和OOD准确率,并且它们仅能不完全地过滤OOD数据以进行自身的自适应更新。

英文摘要

Open-set test-time adaptation (TTA) updates models on new data in the presence of input shifts and unknown output classes. While recent methods have made progress on improving in-distribution (InD) accuracy for known classes, their ability to accurately detect out-of-distribution (OOD) unknown classes remains underexplored. We benchmark robust and open-set TTA methods (SAR, OSTTA, UniEnt, and SoTTA) on the standard corruption benchmarks of CIFAR-10-C at the small scale and ImageNet-C at the large scale. For CIFAR-10-C, we use OOD data from SVHN and CIFAR-100 in their respective corrupted forms of SVHN-C and CIFAR-100-C. For ImageNet-C, we use OOD data from ImageNet-O and Textures in their respective corrupted forms of ImageNet-O-C and Textures-C. ImageNet-O is nearer to ImageNet, as unknown but related object classes (like ''garlic bread'' vs. ''hot dog'' for food, or ''highway'' vs. ''dam'' for infrastructure), while Textures is farther from ImageNet, as non-object patterns (like ''cracked'' mud, ''porous'' sponge, ''veined'' leaves). We evaluate the accuracy and confidence of TTA methods for InD vs. OOD recognition on CIFAR-10-C and ImageNet-C. We verify the accuracy of each method's own OOD detection technique on CIFAR-10-C. We also evaluate on ImageNet-C and report both accuracy and standard OOD detection metrics. We further examine more realistic settings, in which the proportions and rates of OOD data can vary. To explore the trade-off between InD recognition and OOD rejection, we propose a new baseline that replaces softmax/multi-class output with sigmoid/multi-label output. Our analysis shows for the first time that current open-set TTA methods struggle to balance InD and OOD accuracy and that they only imperfectly filter OOD data for their own adaptation updates.

2605.02122 2026-06-02 cs.LG cs.AI 版本更新

STABLEVAL: Disagreement-Aware and Stable Evaluation of AI Systems

STABLEVAL: 面向AI系统的分歧感知与稳定评估

Akash Bonagiri, Gerard Janno Anderias, Saee Patil, Angelina Lai, Devang Borkar, Gezheng Kang, Ishant Gandhi, Setareh Rafatirad, Houman Homayoun

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 针对多数投票法在标注者分歧下导致排名不稳定的问题,提出STABLEVAL框架,通过建模潜在正确性和标注者混淆模式,实现稳定且不确定性感知的系统评估。

详情
AI中文摘要

人类评估仍然是评估现代AI系统的主要标准,然而标注者的分歧、偏见和变异性使得在标准多数投票聚合下系统排名变得脆弱。多数投票忽略了标注者可靠性和项目级别的模糊性,往往在标注者子集之间产生不稳定的比较。我们引入了STABLEVAL,一个分歧感知的评估框架,该框架对潜在项目正确性和标注者特定的混淆模式进行建模,以产生后验期望项目得分和校准的智能体级别分数。与Dawid-Skene等标签去噪方法不同,STABLEVAL明确设计用于稳定和不确定性感知的系统评估,而不是硬标签恢复。我们将排名稳定性形式化为首要评估目标,并分析聚合方法如何保留或扭曲底层标注者行为。在受控的合成实验和多个真实世界人工标注基准上,多数投票在标注者异质性和对抗性噪声下表现出增加的得分误差和排名不稳定性,而STABLEVAL产生了更稳定和统计上更合理的系统排名。这些结果表明,对分歧进行建模对于稳健和可复现的AI评估至关重要。

英文摘要

Human evaluation remains the primary standard for assessing modern AI systems, yet annotator disagreement, bias, and variability make system rankings fragile under standard majority vote aggregation. Majority vote discards annotator reliability and item-level ambiguity, often yielding unstable comparisons across annotator subsets. We introduce STABLEVAL, a disagreement-aware evaluation framework that models latent item correctness and annotator-specific confusion patterns to produce posterior expected item credit and calibrated agent-level scores. Unlike label-denoising approaches such as Dawid-Skene, STABLEVAL is explicitly designed for stable and uncertainty-aware system evaluation rather than hard label recovery. We formalize ranking stability as a first-class evaluation objective and analyze how aggregation methods preserve or distort underlying annotator behavior. Across controlled synthetic experiments and multiple real-world human-annotated benchmarks, majority vote exhibits increasing score error and ranking instability under annotator heterogeneity and adversarial noise, while STABLEVAL yields more stable and statistically grounded system rankings. These results demonstrate that modeling disagreement is essential for robust and reproducible AI evaluation.

2606.01954 2026-06-02 cs.LG stat.ML 版本更新

Flow-Transformed Implicit Processes for Function-Space Variational Inference

流变换隐式过程用于函数空间变分推断

Luis A. Ortega, Andrés R. Masegosa, Thomas D. Nielsen

发表机构 * Aalborg University(奥尔堡大学)

AI总结 提出流变换隐式过程(FTIP),通过归一化流增强组合权重的变分分布,从而在函数空间中捕获非对称、重尾和多模态后验结构,并使用黑盒α目标进行优化。

Comments 24 pages, 4 figures, 10 tables. Pre-print submitted for revision

详情
AI中文摘要

隐式过程先验通过灵活的生成机制定义函数上的分布,使其对贝叶斯函数空间建模具有吸引力。然而,使用此类先验进行后验推断具有挑战性,因为其诱导的函数空间分布通常没有闭式解。一种实用策略是使用有限个采样函数的集合来近似先验,然后将后验函数表示为这些样本的学习组合。现有方法通常对组合权重施加高斯变分分布。虽然易于处理,但这种选择限制了可表示的后验不确定性形状,特别是当真实后验是非对称、重尾或多模态时。我们提出流变换隐式过程(FTIP),一种变分推断方法,使这种有限维函数空间近似更具表达力。FTIP不使用高斯分布,而是使用归一化流来定义更丰富的变分分布,从而在保持可处理优化的同时诱导灵活的后验函数分布。我们使用黑盒α目标训练模型,从而能够比较质量覆盖和模式寻找的变分行为。实验表明,FTIP捕获了函数空间中的非对称和多模态后验结构,而高斯系数近似往往会平滑或崩溃这些结构。

英文摘要

Implicit-process priors define distributions over functions through flexible generative mechanisms, making them attractive for Bayesian function-space modelling. However, performing posterior inference with such priors is challenging because their induced function-space distributions are typically not available in closed form. One practical strategy is to approximate the prior using a finite collection of sampled functions, and then represent posterior functions as learned combinations of these samples. Existing approaches commonly place a Gaussian variational distribution over the combination weights. While tractable, this choice limits the shapes of posterior uncertainty that can be represented, especially when the true posterior is asymmetric, heavy-tailed, or multimodal. We propose Flow-Transformed Implicit Processes (FTIP), a variational inference method that makes this finite-dimensional function-space approximation more expressive. Instead of using a Gaussian distribution over the combination weights, FTIP uses a normalizing flow to define a richer variational distribution. This induces a flexible posterior distribution over functions while preserving tractable optimization. We train the model using a Black-Box α objective, allowing us to compare mass-covering and mode-seeking variational behaviour. Experiments show that FTIP captures asymmetric and multimodal posterior structure in function space that Gaussian coefficient approximations tend to smooth or collapse.

2606.01952 2026-06-02 cs.LG 版本更新

Randomized Least Squares Value Iteration itself is Joint Differentially Private

随机最小二乘值迭代本身是联合差分隐私的

Haiyang Lu, Pratik Gajane, Shaojie Bai, Mohammad Sadegh Talebi

发表机构 * Laboratoire d’Informatique Fondamentale d’Orléans (LIFO), Université d’Orléans(奥尔良基础信息学实验室(LIFO),奥尔良大学) College of Control Science and Engineering, Zhejiang University(浙江大学控制科学与工程学院) Department of Computer Science, University of Copenhagen(哥本哈根大学计算机科学系)

AI总结 研究随机探索算法RLSVI在表格MDP中的隐私保护,证明其内在噪声同时提供联合差分隐私保证。

Comments 12 pages, 0 figures

详情
AI中文摘要

随着强化学习越来越多地应用于医疗和推荐系统等敏感领域,隐私保护技术对于保护用户的敏感信息变得至关重要。我们研究在情节设置下的隐私保护强化学习,重点关注基于随机探索的算法,如随机最小二乘值迭代(RLSVI)。总体目标是研究随机探索如何与隐私机制所需的注入噪声相互作用。在这项工作中,我们展示了一种新的隐私分析,该分析描述了RLSVI中为探索设置的噪声如何同时提供隐私保护。具体来说,我们证明RLSVI在表格MDP中是$(\varepsilon(δ),δ)$-联合差分隐私的,其中$\varepsilon(δ) = rac{2AK}{H^2\log(2HSA)} + 2\sqrt{ rac{2AK\log(1/δ)}{H^2\log(2HSA)}}$,$S$和$A$分别是状态和动作的数量,$H$是情节的长度,$K$是情节的数量。

英文摘要

As reinforcement learning (RL) increasingly applies to sensitive domains, such as health care and recommendation systems, privacy-preserving techniques have become essential to protect users' sensitive information. We investigate privacy-preserving RL under an episodic setting, focusing on algorithms based on randomized exploration, such as Randomized Least Squares Value Iteration (RLSVI). The overall goal is to study how randomized exploration interacts with the injected noise required by privacy mechanisms. In this work, we show a new privacy analysis that characterizes how the noise in RLSVI set for exploration simultaneously provides privacy protection. Specifically, we prove that RLSVI is $(\varepsilon(δ),δ)$-joint differentially private in tabular MDP as is with $\varepsilon(δ) = \frac{2AK}{H^2\log(2HSA)} + 2\sqrt{\frac{2AK\log(1/δ)}{H^2\log(2HSA)}}$, where $S$ and $A$ are the number of states and actions respectively, $H$ is the length of an episode and $K$ is the number of episodes.

2606.01950 2026-06-02 cs.RO cs.CV cs.LG 版本更新

Learning Action-Conditional and Object-Centric Gaussian Splatting World Models for Rigid Objects

面向刚性物体的学习动作条件与对象中心高斯溅射世界模型

Jens U. Kreber, Lukas Mack, Joerg Stueckler

发表机构 * Intelligent Perception in Technical Systems Group(技术系统智能感知组)

AI总结 提出MRO-GWM模型,通过对象中心高斯表示和时空变换器架构,学习刚性物体在3D中的动作条件动力学,支持多物体场景和部分观测下的未来运动预测。

详情
AI中文摘要

世界模型使智能体能够预测其动作对环境的影响。在本文中,我们提出了多刚性物体高斯世界模型(MRO-GWM),一种学习刚性物体在3D中动作条件动力学的新模型。通过用对象中心高斯表示场景,我们可以表示任意物体形状和多物体场景。我们开发了一种新颖的时空变换器架构,该架构根据物体高斯的历史和未来动作预测未来的刚体运动。物体通过其在规范坐标系中的高斯表示,从而可以将物体运动描述为刚体变换。我们的模型在多视角重建上进行训练,这要求模型处理因遮挡导致的物体部分观测。我们分析了该方法在由典型家庭物体组成的合成数据集上的预测性能,这些数据集包含多物体动力学和机器人末端执行器的交互。我们还在模拟中评估了模型在非抓取操作中的模型预测控制性能。

英文摘要

World models enable intelligent agents to predict the consequences of their actions on the environment. In this paper, we propose Multi Rigid Object Gaussian World Model (MRO-GWM), a novel model that learns action-conditional dynamics of rigid objects in 3D. By representing the scene by object-centric Gaussians, we can represent arbitrary object shapes and multi-object scenes. We develop a novel spatio-temporal transformer architecture that predicts future rigid body motion from a history of object Gaussians and future actions. Objects are represented by their Gaussians in a canonical frame, which allows for describing object motion as rigid body transformation. Our model is trained on reconstructions from multiple viewpoints, which requires the model to handle partial observations of objects due to occlusions. We analyze prediction performance of our approach on synthetic datasets composed of typical household objects with multi-object dynamics and interactions by a robot end effector. We also evaluate our model in model-predictive control for non-prehensile manipulation in simulation.

2606.01934 2026-06-02 cs.LG cs.CL 版本更新

HMPO: Hybrid Median-length Policy Optimization for Chain-of-Thought Compression

HMPO: 用于思维链压缩的混合中位数长度策略优化

Minghui Zheng, Hongxu Chen, Huimin Ren, Hongsheng Xin, Xiaoyang Qu, Ze Wang, Shuling Yang, Ziyu Peng, Kaike Zhang, Pan Zhou, Kun Zhan

发表机构 * Li Auto Inc.(Li Auto公司)

AI总结 提出HMPO,一种单阶段强化学习框架,通过自适应中位数预算、余弦衰减令牌奖励和乘法奖励公式,在数学数据上训练后实现19%-46%的令牌压缩且精度损失极小,并泛化至多种任务。

详情
AI中文摘要

大型语言模型通过扩展的思维链推理取得了显著性能,但这一冗长过程带来了大量推理开销。现有的思维链压缩方法面临不灵活的手动长度预算、计算昂贵的多阶段训练流程以及仅适用于小模型的脆弱可扩展性。我们提出HMPO(混合中位数长度策略优化),一种经济高效的单阶段强化学习框架。HMPO通过三个协同组件高效压缩思维链:基于成功轨迹的自适应中位数预算以消除手动调整、用于平滑长度惩罚的余弦衰减令牌奖励,以及通过严格优先考虑答案正确性来大幅减轻琐碎奖励破解的乘法奖励公式。仅在数学数据上训练,HMPO无缝泛化到数学、代码、科学和指令遵循任务。在从9B到122B参数、涵盖密集和混合专家架构的大规模实验中,HMPO实现了19%-46%的令牌压缩,精度下降可忽略,同时与现有的多阶段基线相比大幅降低了训练成本。

英文摘要

Large language models achieve remarkable performance via extended chain-of-thought (CoT) reasoning, yet this lengthy process incurs substantial inference overhead. Existing CoT compression methods struggle with inflexible manual length budgets, computationally expensive multi-stage training pipelines, and fragile scalability restricted to small models. We propose HMPO (Hybrid Median-length Policy Optimization), a cost-effective, single-stage reinforcement learning framework. HMPO efficiently compresses CoT via three synergistic components: an adaptive median-based budget derived from successful rollouts to eliminate manual tuning, a cosine-decay token reward for smooth length penalization, and a multiplicative reward formulation that substantially mitigates trivial reward hacking by strictly prioritizing answer correctness. Trained exclusively on mathematical data, HMPO generalizes seamlessly across math, code, science, and instruction-following tasks. Extensive experiments scaling from 9B to 122B parameters across dense and Mixture-of-Experts (MoE) architectures demonstrate that HMPO achieves 19%--46% token compression with negligible accuracy degradation, all while drastically reducing training costs compared to existing multi-stage baselines.

2606.01923 2026-06-02 cs.CL cs.LG 版本更新

Resonant Context Anchoring: Decoupling Attention Routing and Signal Gain at Inference Time

共振上下文锚定:推理时解耦注意力路由与信号增益

Mingkuan Zhao, Yide Gao, Wentao Hu, Suquan Chen, Tianchen Huang, Zhenhua An, Zetao Chang, Xiayu Sun, Yuheng Min

发表机构 * Xi’an Jiaotong University(西安交通大学) University of Science and Technology of China(中国科学技术大学) Tongji University(同济大学) Tsinghua University(清华大学)

AI总结 提出共振上下文锚定(RCA)方法,通过解耦自注意力中的路由逻辑与信息幅度,在推理时动态增强上下文令牌的信号,有效抑制大语言模型的参数化幻觉,提升事实一致性。

详情
AI中文摘要

大型语言模型(LLM)在面对与内部参数记忆冲突的输入证据时,经常表现出“上下文忽视”,导致持续的事实幻觉。现有的缓解策略主要依赖于抑制特定神经元激活或使用计算昂贵的对比解码机制,这往往会导致困惑度增加或推理延迟显著升高。为了解决这些局限性,我们从残差流信号动力学的角度提出了一种轻量级的推理时干预方法——共振上下文锚定(RCA)。RCA旨在解决外部证据在深层网络传播过程中的信号衰减问题。其核心机制是在自注意力模块中正交解耦路由逻辑和信息幅度。通过利用原始的softmax前注意力分数作为语义对齐的即时度量,我们通过非线性整流构建动态增益场,选择性地放大上下文令牌对应的值向量的范数,而不改变注意力概率分布。该机制有效提升了残差流混合中输入证据的信噪比(SNR),从而在推理时稳健地将生成轨迹锚定到真实上下文。在Llama-3模型系列上的大量实验表明,RCA在多个事实一致性和强知识冲突任务中显著提高了上下文忠实度,有效抑制了参数化幻觉。此外,结果证实,作为一个无需训练且计算量可忽略的即插即用模块,RCA在保持模型通用语言理解能力的同时,在忠实度和流畅性上实现了帕累托改进。

英文摘要

Large Language Models (LLMs) frequently exhibit "contextual disregard" when faced with input evidence that conflicts with their internal parametric memory, leading to persistent factual hallucinations. Existing mitigation strategies primarily rely on suppressing specific neuron activations or employing computationally expensive contrastive decoding mechanisms, which often result in increased perplexity or significantly elevated inference latency. To address these limitations, we propose Resonant Context Anchoring (RCA), a lightweight inference-time intervention method grounded in the perspective of residual stream signal dynamics. RCA aims to resolve the signal attenuation of external evidence during its propagation through deep networks. The core mechanism involves the orthogonal decoupling of routing logic and information magnitude within the self-attention module. By utilizing raw pre-softmax attention scores as an instantaneous metric of semantic alignment, we construct a dynamic gain field via non-linear rectification to selectively amplify the norms of value vectors corresponding to context tokens, without altering the attention probability distribution. This mechanism effectively elevates the signal-to-noise ratio (SNR) of input evidence within the residual stream mixture, thereby robustly anchoring the generation trajectory to the truthful context during inference. Extensive experiments on the Llama-3 model series demonstrate that RCA significantly improves contextual faithfulness across multiple factual consistency and strong knowledge-conflict tasks, effectively suppressing parametric hallucinations. Furthermore, results confirm that as a training-free and computationally negligible plug-and-play module, RCA achieves a Pareto improvement in faithfulness and fluency while maintaining the model's general language understanding capabilities.

2606.01908 2026-06-02 cs.LG cs.CV 版本更新

Private and Stable Test-Time Adaptation with Differential Privacy

具有差分隐私的私有且稳定的测试时自适应

Zefeng Li, Qiaoyue Tang, Mathias Lecuyer, Evan Shelhamer

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出将多种测试时自适应方法转化为差分隐私形式,通过逐样本梯度裁剪和高斯噪声保护测试数据隐私,在ImageNet-C上实现隐私与精度的平衡,并发现裁剪机制能提升连续自适应的准确性和稳定性。

Comments ICML 2026

详情
AI中文摘要

测试时自适应(TTA)可以通过在推理过程中更新模型来减少在新数据上的误差。然而,这些更新引发了关于测试数据隐私的问题,因为模型参数现在依赖于所有过去的输入。为了控制这种隐私风险,我们将多种流行的TTA方法(Tent、EATA、SAR、DeYO和COME)转化为差分隐私(DP)形式,对所有更新应用逐样本梯度裁剪和高斯噪声。在ImageNet-C上,我们的DP-TTA方法在精度损失较小的情况下提供了足够的隐私,并且在低隐私机制下,DP的裁剪机制甚至可以改善连续设置中自适应的准确性和稳定性。这些对隐私和精度的改进仅带来适度的计算开销。这些关于私有TTA的初步结果提高了对该问题的认识,为开发更私密的测试时更新提供了信息,并确定了逐样本裁剪作为提高自适应准确性和稳定性的有效技术。

英文摘要

Test-time adaptation (TTA) can reduce error on new and different data by updating the model on these inputs during inference. However, these updates raise the issue of privacy w.r.t. the testing data, because the model parameters now depend on all past inputs. To control this privacy risk, we cast multiple popular TTA methods (Tent, EATA, SAR, DeYO, and COME) into differential privacy (DP) forms that apply per-sample gradient clipping and Gaussian noise for all updates. On ImageNet-C, our DP-TTA methods provide adequate privacy at small cost to accuracy, and in the low-privacy regime the clipping mechanism of DP can even improve the accuracy and stability of adaptation in the continual setting. These improvements to privacy and accuracy come at only modest computational overhead. These first results on private TTA raise awareness of the issue, inform the development of more private test-time updates, and identify per-sample clipping as an effective technique for improving the accuracy and stability of adaptation.

2606.01891 2026-06-02 cs.GR cs.LG 版本更新

MidSurfNet: Learnable Face Pairing and Interference Implicit Fields for Generalized Mid-surface Abstraction

MidSurfNet:面向广义中面抽象的可学习面配对与干涉隐式场

Li Ye, Xinhang Zhou, Xingyu Yang, Ruofeng Tong, Hailong Li, Peng Du, Min Tang

发表机构 * College of Computer Science and Technology, Zhejiang University(浙江大学计算机科学与技术学院) Shenzhen Poisson Software Co., Ltd.(深圳波森软件有限公司)

AI总结 提出MidSurfNet框架,通过可学习的面配对模块和干涉隐式场,解决薄壁CAD模型中多壁厚、自匹配及非中心偏移等复杂场景的中面抽象问题,实现87.32%的面配对准确率。

Comments 20 pages, 12 figures, 5 tables

详情
AI中文摘要

中面抽象对于薄壁CAD模型的有限元分析至关重要。现有的基于面配对的方法依赖手工几何启发式,但实际工业模型常呈现多壁厚区域、自匹配面配置,并需要非中心偏移曲面——在这些场景中,基于规则的方法始终失败。我们提出MidSurfNet,一个学习增强框架,通过两个新颖组件解决这些局限:(1) 神经面配对模块,从几何和拓扑特征学习预测面配对置信度,处理超越基于规则方法的复杂配对场景;(2) 干涉隐式场,将中面表示为两个符号距离函数的干涉,实现广义偏移控制,以便在下游CAE/FEA导向工作流中灵活定位。我们构建了一个包含超过1500个手动标注CAD模型的大规模中面数据集。实验表明,MidSurfNet达到87.32%的面配对准确率,并成功处理了困扰所有现有方法的多壁厚(完成率61.90%)和自匹配(完成率52.94%)场景。此外,MidSurfNet为面向CAE的应用提供了具有任意偏移控制的广义中面抽象的学习方法。

英文摘要

Mid-surface abstraction is essential for finite element analysis of thin-walled CAD models. Existing face pairing-based methods rely on handcrafted geometric heuristics, yet real-world industrial models frequently exhibit multi-wall-thickness regions, self-matching face configurations, and demand for non-center offset surfaces--scenarios where rule-based approaches consistently fail. We present MidSurfNet, a learning-augmented framework that addresses these limitations through two novel components: (1) a neural face pairing module that learns to predict face pair confidence from geometric and topological features, handling complex pairing scenarios beyond rule-based methods; and (2) an interference implicit field that represents mid-surfaces as the interference of two signed distance functions, enabling generalized offset control for flexible positioning in downstream CAE/FEA-oriented workflows. We construct a large-scale mid-surface dataset containing over 1,500 manually annotated CAD models. Experiments demonstrate that MidSurfNet achieves 87.32% face pairing accuracy and successfully handles multi-wall-thickness (61.90% completion) and self-matching (52.94% completion) scenarios that confound all existing methods. Furthermore, MidSurfNet provides a learning-based approach to generalized mid-surface abstraction with arbitrary offset control for CAE-oriented applications.

2606.01890 2026-06-02 cs.LG 版本更新

Segment-driven Structural Induction and Semantic Alignment for Heterogeneous Tabular Representation

面向异构表格表示的片段驱动结构归纳与语义对齐

Woojun Jung, Susik Yoon

发表机构 * Department of Computer Science \& Engineering, Korea University, Seoul, South Korea

AI总结 提出NAVI框架,通过掩码片段建模和熵驱动片段对齐,利用片段级结构归纳与语义对齐实现异构表格的表示学习。

详情
AI中文摘要

现实世界领域通常包含异构表格,其标题各不相同,但底层属性语义是共享的,这使得仅从表格局部证据中归纳领域专用语义变得困难。现有编码器对此问题进行了部分建模,但往往未充分利用列级值分布,并对具有不同语义角色的属性应用统一目标。我们提出NAVI,一种以片段为中心的预训练框架,将每个标题-值对视为聚合模式级结构证据和列级分布证据的单位。我们通过掩码片段建模和熵驱动片段对齐实现这一设计,共同强制结构化标题-值耦合以及跨稳定属性和实例特定属性的语义对齐。在异构领域内表格上的实验表明,在整体评估设置下,重建、语义一致性和下游效用均得到改善。

英文摘要

Real-world domains often contain heterogeneous tables whose headers vary while their underlying attribute semantics are shared, making it difficult to induce domain-specialized semantics from table-local evidence alone. Existing encoders model parts of this problem, but often underuse column-level value distributions and apply uniform objectives across attributes with different semantic roles. We propose NAVI, a segment-centric pretraining framework that treats each header-value pair as the unit for aggregating schema-level structural evidence and column-level distributional evidence. We realize this design through Masked Segment Modeling and Entropy-driven Segment Alignment, which jointly enforce structured header-value coupling and semantic alignment across stable and instance-specific attributes. Experiments on heterogeneous in-domain tables show improved reconstruction, semantic consistency, and downstream utility across evaluation settings overall.

2606.01883 2026-06-02 cs.LG cs.CV 版本更新

Beyond the Simplex: Balanced Prototype Geometry for Scorer-Agnostic Open-Set Recognition

超越单纯形:用于评分器无关的开放集识别的平衡原型几何

Mayank Sharma, Rohit Kumar Mourya

发表机构 * Indian Institute of Technology Jodhpur(印度理工学院乔浦尔)

AI总结 本文提出平衡等范数原型几何理论,统一分析不同嵌入维度下的开放集识别,证明评分器性能依赖于评分规则而非单纯形结构。

Comments 20 pages, 2 figures, 6 tables

详情
AI中文摘要

开放集识别(OSR)要求分类器拒绝来自未见类别的输入,这在医学成像等安全关键场景中至关重要。基于单纯形的方法将类原型固定在正则单纯形的顶点,然后通过距离比分数进行拒绝,这些方法在经验上表现良好但缺乏理论依据,且现有分析仅适用于嵌入维度d至少为C-1的情况,这是正则单纯形存在的条件。我们给出了在任意嵌入维度(包括d < C-1)下单纯形比OSR的理论解释。我们的分析集中于平衡等范数编码:具有等长和零和的原型配置,存在于所有d >= 2的情况,并包含正则单纯形作为特例。对于这些编码,我们证明辅助平方比分数的子水平集是欧几里得球的精确并集,进而包围了操作分数的接受区域;并且我们证明了一个尖锐的二分法:当且仅当d >= C-1时,原型达到等距对称性,行为类似于正则单纯形,低于该阈值时,由显式缺陷参数控制退化程度。我们进一步证明,在自然各向同性假设下,错误接受率随d指数衰减,并且操作分数是全局Lipschitz的,具有紧致接受区域。在实验上,我们将平衡原型几何作为分析工具和表示学习先验进行研究,而非作为独立的先进检测器。在CIFAR和MedMNIST开放集划分上,几何结构提供了有用的结构,但OSR性能仍然强烈依赖于评分规则:原始比率分数通常不如基于最近邻和logit的替代方案。

英文摘要

Open-set recognition (OSR) requires a classifier to reject inputs from unseen classes which is essential in safety-critical settings such as medical imaging. Simplex based methods, which fix class prototypes at the vertices of a regular simplex and then reject via a distance-ratio score, perform well empirically but lack theoretical justification, and existing analysis applies only when the embedding dimension d is at least C-1, which is the regime in which a regular simplex exists. We give a theoretical account of simplex-ratio OSR that holds in every embedding dimension, including d < C-1. Our analysis centers on balanced equal-norm codes: prototype configurations with equal lengths and zero sum, which exist for all d >= 2 and include the regular simplex as a special case. For these codes we show that an auxiliary squared ratio score has sublevel sets that are exact unions of Euclidean balls, which in turn bracket the acceptance region of the operational score; and we prove a sharp dichotomy: the prototypes attain one-distance symmetry, behaving like a regular simplex, if and only if d >= C-1, with controlled degradation governed by an explicit defect parameter below that threshold. We further show the false-acceptance rate decays exponentially in d under natural isotropy assumptions, and that the operational score is globally Lipschitz with compact acceptance regions. Empirically, we study balanced prototype geometry as both an analytic tool and a representation-learning prior, rather than as a stand-alone state-of-the-art detector. Across CIFAR and MedMNIST open-set splits, the geometry provides useful structure, but OSR performance remains strongly dependent on the scoring rule: raw ratio scores typically underperform nearest-neighbor and logit-based alternatives.

2606.01873 2026-06-02 cs.LG 版本更新

G2LoRA: Gradient Orthogonal Low-Rank Adaptation Framework for Graph Continual Learning on Text-Attributed Graphs

G2LoRA: 面向文本属性图的梯度正交低秩自适应框架用于图持续学习

Yuhan Wang, Yibo Ding, Yutong Ye, Mufan Zhao, Wenbo Zhang, Ruijie Wang, Jianxin Li

发表机构 * School of Computer Science and Engineering, Beihang University(北航计算机科学与工程学院) Department of Statistics, Columbia University(哥伦比亚大学统计系) College of Computer Science, Beijing University of Technology(北京理工大学计算机学院)

AI总结 针对LLM-as-Aligner模型在文本属性图持续学习中的灾难性遗忘问题,提出G2LoRA框架,通过统一图-文本对齐目标、类别感知梯度投影和梯度幅度调制,实现任务间正向迁移并缓解模态漂移。

Comments Accepted by KDD 2026

详情
AI中文摘要

LLM-as-Aligner已成为文本属性图(TAGs)的一种流行预训练范式,通过CLIP风格的对比学习将图和文本模态对齐到共享嵌入空间。虽然在单个下游任务上有效,但我们观察到当此类模型在流式任务上顺序微调时会出现严重的灾难性遗忘。尽管参数高效微调在一定程度上缓解了遗忘,但仍不足以解决任务干扰和无效知识迁移。在这项工作中,我们研究了TAGs上LLM-as-Aligner模型的图持续学习,目标是减轻干扰同时促进任务间的正向迁移。该设置引入了两个基本挑战:(1)异构下游任务导致优化目标变化,阻碍统一微调;(2)图和文本编码器对自适应表现出不同的敏感性,不协调的更新容易导致错位。为应对这些挑战,我们提出了G2LoRA,一个面向TAGs的持续学习框架。G2LoRA将节点级、链接级和图级任务统一到单一的图-文本对齐目标下,并在领域/类别/任务增量模式下实现一致的优化。为减少任务干扰同时鼓励正向迁移,G2LoRA在结构化子空间中执行类别感知梯度投影,解决冲突更新并实现条件性反向迁移以平衡前向和后向知识流。为进一步防止跨模态漂移,G2LoRA引入梯度幅度调制来协调图和文本编码器之间的更新速率。在基准数据集上的大量实验表明,G2LoRA在不同骨干架构上始终优于强基线,实现了卓越的持续性能和可迁移性。

英文摘要

LLM-as-Aligner has emerged as a prevalent pre-training paradigm for Text-Attributed Graphs(TAGS), aligning graph and text modalities into a shared embedding space via CLIP-style contrastive learning. While effective on individual downstream tasks, we observe severe catastrophic forgetting when such models are sequentially fine-tuned on streaming tasks. Although parameter-efficient fine-tuning alleviates forgetting to some extent, it remains insufficient to resolve task interference and ineffective knowledge transfer. In this work, we study graph continual learning for LLM-as-Aligner models on TAGs, with the goal of mitigating interference while promoting positive transfer across tasks. This setting introduces two fundamental challenges: (1) heterogeneous downstream tasks induce shifting optimization objectives, hindering unified fine-tuning; and (2) graph and text encoders exhibit different sensitivities to adaptation, making uncoordinated updates prone to misalignment. To address these challenges, we propose G2LoRA, a continual learning framework for TAGs. G2LoRA unifies node-, link-, and graph-level tasks under a single graph--text alignment objective, and enables consistent optimization across domain/class/task incremental modes. To reduce task interference while encouraging positive transfer, G2LoRA performs category-aware gradient projection in structured subspaces, resolving conflicting updates and enabling conditional backward transfer to balance forward and backward knowledge flow. To further prevent cross-modal drift, G2LoRA introduces gradient magnitude modulation to coordinate update rates between graph and text encoders. Extensive experiments on benchmark datasets demonstrate that G2LoRA consistently outperforms strong baselines across different backbone architectures, achieving superior continual performance and transferability.

2606.01868 2026-06-02 cs.LG 版本更新

Task-Induced Representational Invariances Depend on Learning Objective in Deep RL

任务诱导的表征不变性依赖于深度强化学习中的学习目标

Manu Srinath Halvagal, Sebastian Lee, SueYeon Chung

发表机构 * Department of Physics, Harvard University(哈佛大学物理系) Kempner Institute, Harvard University(哈佛大学凯普纳研究所) Center for Computational Neuroscience, Flatiron Institute(Flatiron研究所计算神经科学中心)

AI总结 本文通过MDP约简理论分析深度强化学习中的表征,发现基于价值的方法(DQN)学习对MDP同态对称性不变的表征,而基于策略梯度的方法(PPO)学习对动作对称性不变的表征,这些差异影响迁移学习并在LLM中呈现提示依赖性。

详情
AI中文摘要

强化学习(RL)长期以来在神经科学中被用作目标导向动物行为的模型。现代深度RL在许多领域取得了显著成功,进一步强化了这一联系。学习高维状态空间的抽象表征能力是这一成功的基础。然而,对这些学习表征的理论理解仍然有限,阻碍了模型与动物学习之间的直接比较。我们通过MDP约简理论的视角分析深度RL表征来弥补这一差距。在导航任务中研究经典RL算法时,我们发现即使性能相当,基于价值的方法(DQN)学习对MDP同态对称性不变的表征,而基于策略梯度的方法(PPO)学习对动作对称性不变的表征。这些差异在不同领域中一致出现,对迁移学习有下游影响,并以提示依赖的方式出现在LLM中。我们的发现提供了一种比较不同RL算法学习表征的原则性方法,具有实际意义,并可能为大脑中的神经编码提供见解。

英文摘要

Reinforcement Learning (RL) has long served as a model for goal-directed animal behavior in neuroscience. Modern deep RL has shown remarkable success across many domains, further strengthening this connection. The ability to learn abstract representations of high-dimensional state spaces underlies much of this success. However, theoretical understanding of these learned representations remains limited, hindering direct comparisons between models and animal learning. We address this gap by analyzing deep RL representations through the lens of MDP reduction theory. Investigating canonical RL algorithms in a navigation task, we find that even when performance is comparable, the value-based method (DQN) learns representations that are invariant to MDP homomorphism symmetries, while the policy-gradient method (PPO) learns representations invariant to action symmetries. These differences emerge consistently across domains, have downstream consequences for transfer learning, and appear in LLMs in a prompt-dependent manner. Our findings provide a principled approach to comparing learned representations across RL algorithms, with demonstrated practical implications and possible insights for neural coding in the brain.

2606.01863 2026-06-02 cs.LG math-ph math.MP 版本更新

Continual Learning as a Multiphase Moving-Boundary Problem

持续学习作为多相移动边界问题

Snigdha Chandan Khilar

发表机构 * Independent Researcher(独立研究者)

AI总结 受熔化物理学启发,提出Stefan-CL方法,将知识巩固视为固相、未用容量视为液相,通过控制潜热调节边界移动,在几乎零遗忘下实现持续学习,无需存储原始数据。

详情
AI中文摘要

持续学习在平衡保留旧知识与吸收新任务方面面临困难。Stefan-CL通过熔化物理学优雅地解决了这一稳定性-可塑性困境。它将巩固的知识视为受保护的“固体”,未使用的容量视为可适应的“液体”。随着网络学习,这个边界在“潜热”调节旋钮的控制下扩展。通过数学上冻结已学习的内部区域,Stefan-CL将遗忘降至接近零,匹配了需要大量内存的基线方法,而无需存储原始数据,为AI开辟了一条优美且基于物理学的路径。

英文摘要

Continual learning struggles to balance retaining past knowledge with absorbing new tasks. Stefan-CL elegantly resolves this stability-plasticity dilemma through the physics of melting. It frames consolidated knowledge as a protected "solid" and unused capacity as an adaptable "liquid." As the network learns, this boundary expands, governed by a "latent heat" tuning dial. By mathematically freezing the learned interior, Stefan-CL cuts forgetting to near zero, matching memory-heavy baselines without storing raw data, forging a beautiful, physics-grounded path for AI.

2606.01861 2026-06-02 cs.LG 版本更新

A Theoretical Framework for Self-Play Theorem Proving Algorithms

自我对弈定理证明算法的理论框架

Thomas Chen, Zhiyuan Li

发表机构 * Thomas Chen(汤姆·陈) Zhiyuan Li(李志强)

AI总结 本文提出一个理论框架,通过将定理集建模为图并引入可逆随机游走和多样性度量,分析自我对弈算法在定理证明中的自我改进能力。

详情
AI中文摘要

自我对弈是一种使模型能够自我改进的训练算法,最近在利用大型语言模型进行形式定理证明方面显示出有希望的实证结果。(Dong & Ma, 2025) 将自我对弈实例化为两个协作智能体:证明者(证明定理)和猜想者(生成新定理作为证明者的课程)。在本文中,我们提供了一个理论框架,用于理解自我对弈算法在定理证明中的自我改进能力。首先,我们将定理集形式化为一个图,节点为定理,边连接具有相似语义的定理对。我们引入一组原始假设,刻画训练过的证明者的保证以及猜想者如何访问图的结构。其次,我们证明,如果底层定理图是良好连接的,那么一个基于可逆随机游走的猜想算法的证明者-猜想者系统足以指数级增长已证明定理的集合。第三,受自我对弈算法在实证中遇到的一个问题(猜想者倾向于生成人为复杂且非基础的定理)的启发,我们提出了一个由猜想者生成的定理训练分布的多样性度量,以及一种改进的猜想算法,该算法通过计算定理图中相邻定理之间的扩散相似性来局部最大化该多样性度量。最后,我们描述了一种通过对比学习将节点嵌入欧几里得空间,然后计算嵌入之间的内积来计算扩散相似性的方法。

英文摘要

Self-play, a type of training algorithm that enables a model to self-improve, has recently shown promising empirical results in the context of formal theorem proving using Large Language Models (LLMs). (Dong & Ma, 2025) instantiate self-play with two cooperating agents: a prover, which proves theorems, and a conjecturer, which generates new theorems as a curriculum to the prover. In this paper, we provide a theoretical framework for understanding the self-improvement capabilities of self-play algorithms for theorem proving. First, we formalize the set of theorems as a graph, with nodes as theorems and edges between pairs of theorems with similar semantics. We introduce a set of primitive assumptions that characterize the guarantees of a trained prover and how a conjecturer can access the structure of the graph. Second, we show that if the underlying graph of theorems is well-connected, then a prover-conjecturer system, where the conjecturing algorithm is based on a reversible random walk, is sufficient to grow the set of proved theorems exponentially. Third, motivated by an issue encountered empirically by self-play algorithms, where the conjecturer tends to generate artificially complex and non-fundamental theorems, we propose a diversity measure for a training distribution of theorems generated by a conjecturer and an improved conjecturing algorithm that locally maximizes this diversity measure, by computing the diffusion similarity between neighboring theorems in the theorem graph. Finally, we describe a method to compute the diffusion similarity by using contrastive learning to embed nodes into Euclidean space and then computing the inner-product between embeddings.

2606.01847 2026-06-02 cs.RO cs.LG 版本更新

The Lie We Tell: Correcting the Euclidean Fallacy in Vision Language Action Policies via Score Matching on Tangent Space

我们说的谎言:通过切空间上的分数匹配纠正视觉-语言-动作策略中的欧几里得谬误

Bing-Cheng Chuang, I-Hsuan Chu, Bor-Jiun Lin, YuanFu Yang, Min Sun, Chun-Yi Lee

发表机构 * National Taiwan University(台湾大学)

AI总结 针对扩散视觉-语言-动作策略将SE(3)位姿表示为平坦R^12向量导致的欧几里得谬误,提出Lie Diffuser Actor (LDA)框架,通过左不变SDE注入噪声、在切空间预测分数并利用指数映射回缩样本,从根本上消除流形漂移、保证坐标框架等变性和测地线最优性,在CALVIN ABC→D上平均任务长度从3.27提升至3.51。

Comments ICML 2026 Accepted

详情
AI中文摘要

基于扩散的视觉-语言-动作策略在机器人操作中取得了显著成功,但犯了一个我们称之为$ extbf{欧几里得谬误}$的基本几何错误:将SE(3)位姿表示为平坦的$\mathbb{R}^{12}$向量。这种近似导致(1)违反SO(3)约束的流形漂移,(2)坐标变换下等变性的破坏,以及(3)具有过高运动学代价的非测地线轨迹。我们提出$ extbf{Lie Diffuser Actor (LDA)}$,一个本质上在SE(3)上运行的扩散框架。我们的方法通过左不变SDE注入噪声,在切空间中预测分数,并通过指数映射回缩样本。这种表述通过构造消除了流形漂移,同时保证了坐标框架等变性和测地线最优性。在CALVIN ABC$ ightarrow$D上,LDA将平均任务长度从$3.27$提升到$3.51$($+7.3\%$)。我们进一步在真实机器人上验证了该方法,结果表明我们的方法在大多数任务上优于基线。

英文摘要

Diffusion-based Vision-Language-Action policies achieve remarkable success in robotic manipulation, yet commit a fundamental geometric error we term the $\textbf{Euclidean Fallacy}$: representing SE(3) poses as flat $\mathbb{R}^{12}$ vectors. This approximation induces (1) manifold drift violating SO(3) constraints, (2) broken equivariance under coordinate transformations, and (3) non-geodesic trajectories with excessive kinematic cost. We introduce $\textbf{Lie Diffuser Actor (LDA)}$, a diffusion framework operating intrinsically on SE(3). Our method injects noise through left-invariant SDEs, predicts scores in the tangent space, and retracts samples via the exponential map. This formulation eliminates manifold drift by construction while guaranteeing coordinate-frame equivariance and geodesic optimality. On CALVIN ABC$\rightarrow$D, LDA improves average task length from $3.27$ to $3.51$ ($+7.3\%$). We further validate our method on real robot and the results show that our methodology outperforms the baseline on majority tasks.

2606.01846 2026-06-02 cs.LG 版本更新

Mos-Gen: A Generative Molecular Framework for Mosquito Insecticide Design

Mos-Gen:用于蚊虫杀虫剂设计的生成式分子框架

Lina Wang, Yaning Cui

发表机构 * Shanghai Institute of Organic Chemistry, Chinese Academy of Sciences(上海有机化学研究所,中国科学院) DP Technology(DP技术)

AI总结 提出Mos-Gen框架,结合预训练分子表示模型Uni-Mol与变分自编码器,用于从头生成含二硫键的大蒜素衍生物作为蚊虫杀虫剂,实验验证预测阳性命中率达78%。

详情
AI中文摘要

蚊媒传染病每年在全球造成超过70万人死亡。传统化学杀虫剂的长期使用已导致严重的抗药性问题,迫切需要开发新型、高效且生态可持续的替代品。虽然该领域现有的人工智能方法主要集中于活性预测和分类,但在从头生成新型分子骨架方面存在关键空白。在本研究中,我们提出了Mos-Gen,一种基序感知的生成式协作框架,将预训练分子表示模型Uni-Mol与变分自编码器(VAE)相结合,专门用于设计含二硫键的大蒜素衍生物作为蚊虫杀虫剂。在生成的候选分子中,选择了14种化合物——包括9个预测阳性和5个预测阴性——进行化学合成和实验验证。预测阳性中的命中率达到78%,而预测阴性均未表现出杀蚊活性。这些实验结果充分验证了Mos-Gen框架的高精度筛选能力。

英文摘要

Mosquito-borne infectious diseases cause more than 700000 deaths worldwide each year. The long-term use of conventional chemical insecticides has induced serious resistance problems, creating an urgent need to develop novel, highly effective, and ecologically sustainable alternatives. While existing artificial intelligence approaches in this domain have focused primarily on activity prediction and classification, they leave a critical gap in the de~novo generation of novel molecular scaffolds. In this study, we propose Mos-Gen, a motif-aware generative collaborative framework that couples the pretrained molecular representation model Uni-Mol with a variational autoencoder (VAE), specifically tailored for the design of disulfide-containing allicin derivatives as mosquito insecticides. Among the generated candidates, fourteen compounds -- comprising nine predicted positives and five predicted negatives -- were selected for chemical synthesis and experimental validation. The hit rate among the predicted positives reached 78%, whereas none of the predicted negatives exhibited mosquitocidal activity. These experimental results fully validated the high-precision screening capability of the Mos-Gen framework.

2606.01839 2026-06-02 cs.DC cs.AR cs.LG 版本更新

Observation, Not Prediction: Conversation-Level Disaggregated Scheduling for Agentic Serving

观察而非预测:面向智能体服务的对话级解耦调度

Jianru Ding, Ryien Hosseini, Pouya Mahdi Gholami, Mingyuan Xiang, Henry Hoffmann

发表机构 * Anonymous Authors(匿名作者)

AI总结 提出将调度单元从单轮提升至整个对话,利用对话中首轮计算密集与后续内存密集的两阶段可观察特性,实现无需预测的解耦调度,显著降低延迟并提升能效。

详情
AI中文摘要

基于LLM的智能体通过多轮依赖推理和工具调用来解决用户任务,产生的工作负载在任务到达时总成本未知。现有的多轮系统以轮次为调度单元,逐轮决定是否将预填充与解码解耦。该决策依赖于该轮的解码长度、工具行为和KV增长,这些量在调度器必须行动时不可观察,迫使系统进行预测。我们表明这种对预测的依赖是由调度单元而非工作负载强加的。将调度单元从轮次提升到对话,将轮次级的不规则性转化为稳定的两阶段结构:1) 计算密集的首轮预填充,随后是2) 长尾内存密集阶段。因此,以对话为调度单元,放置问题简化为读取首轮输入长度和每解码器KV占用率,两者均可直接观察。我们在ConServe中实例化这一原则,它将首轮预填充路由到高吞吐预填充器,精确传输KV缓存一次,并将对话固定到单个解码器处理其整个尾部,无需学习解码侧成本模型。与每轮预测基线相比,ConServe将p95首次有效令牌时间(对话首个用户可见输出的延迟)降低51.08%,能效提升7.51%,同时保持最后一轮的TBT和SLO;将两阶段映射到异构GPU层级可进一步增加22.75%的能效。

英文摘要

LLM-based agents resolve a user task through many turns of dependent inference and tool calls, producing a workload whose total cost is unknown when the task arrives. Existing multi-turn systems keep the turn as the scheduling unit and decide, turn by turn, whether to disaggregate prefill from decode. That decision rests on the turn's decode length, tool behavior, and KV growth, quantities that are not observable when the scheduler must act, forcing the system to predict them. We show this dependence on prediction is imposed by the scheduling unit, not the workload. Raising the scheduling unit from the turn to the conversation converts turn-level irregularity into a stable, two-phase structure: 1) a compute-bound turn-1 prefill followed by 2) a long, memory-bound tail. Thus, with the conversation as the scheduling unit, placement reduces to reading the first-turn input length and per-decoder KV occupancy, both directly observable. We instantiate this principle in ConServe, which routes the first-turn prefill to a high-throughput prefiller, transfers the KV cache exactly once, and pins the conversation to a single decoder for its entire tail, with no learned model of decode-side cost. Against a per-turn prediction baseline, ConServe reduces p95 time-to-first-effective-token (the latency of a conversation's first user-visible output) by 51.08% and improves energy efficiency by 7.51% while preserving last-turn TBT and SLOs; mapping the two phases onto heterogeneous GPU tiers adds a further 22.75% in energy efficiency.

2606.01838 2026-06-02 cs.CL cs.AI cs.LG 版本更新

LayerRoute: Input-Conditioned Adaptive Layer Skipping via LoRA Fine-Tuning for Agentic Language Models

LayerRoute: 基于LoRA微调的输入条件自适应层跳过方法用于智能语言模型

Prateek Kumar Sikdar

发表机构 * Accenture(埃森哲)

AI总结 提出LayerRoute,通过为每个Transformer块添加轻量级路由器和LoRA适配器,根据输入类型(工具调用或规划推理)自适应跳过层,在仅增加0.22%可训练参数下实现12.91%的跳过差异并提升质量。

Comments 10 pages, 3 figures, 4 tables

详情
AI中文摘要

智能语言模型系统交替使用两种结构不同的步骤类型:结构化工具调用(短、确定性、低困惑度)和开放式规划/推理步骤(长、复杂、高困惑度)。尽管存在这种异质性,当前的推理系统对每个步骤应用相同的计算量。我们引入LayerRoute,一个轻量级适配器,学习基于每个输入有选择地跳过Transformer块。LayerRoute为Qwen2.5-0.5B-Instruct中的24个Transformer块中的每一个增加:(1)一个每层路由器(约897个参数,Linear(896,1)),通过直通估计器输出硬二值门;(2)在Q/K/V/O注意力投影上的LoRA适配器(秩8,约1.08M参数)。骨干权重保持冻结。在智能体数据(Hermes、Glaive、GSM8K、Turing)上进行单次端到端训练,并加入门正则化项,迫使系统发现每个输入类型下哪些块是可跳过的。经过3000步(在A100 40GB上6.4分钟),LayerRoute实现了12.91%的跳过差异:工具调用跳过15.25%的FLOPs,而规划步骤仅跳过2.34%,仅使用1.10M可训练参数(占494M骨干的0.22%)。由于LoRA适配,质量相比基础模型有所提升,工具调用上的困惑度差为-1.29,规划步骤上为-1.30。

英文摘要

Agentic language model systems alternate between two structurally distinct step types: structured tool calls (short, deterministic, low perplexity) and open-ended planning/reasoning steps (long, complex, high perplexity). Despite this heterogeneity, current inference systems apply identical compute to every step. We introduce LayerRoute, a lightweight adapter that learns to selectively skip transformer blocks on a per-input basis. LayerRoute augments each of the 24 transformer blocks in Qwen2.5-0.5B-Instruct with: (1) a per-layer router (~897 parameters, Linear(896,1)) that outputs a hard binary gate via the straight-through estimator, and (2) LoRA adapters (rank 8, ~1.08M parameters) on the Q/K/V/O attention projections. The backbone weights remain frozen. A single end-to-end training pass on agentic data (Hermes, Glaive, GSM8K, Turing) with a gate regularisation term forces the system to discover which blocks are skippable per input type. After 3,000 steps (6.4 minutes on an A100 40GB), LayerRoute achieves a 12.91% skip differential: tool calls skip 15.25% of FLOPs while planning steps skip only 2.34%, using only 1.10M trainable parameters (0.22% of the 494M backbone). Quality improves over the base model due to LoRA adaptation, with perplexity delta of -1.29 on tool calls and -1.30 on planning.

2606.01833 2026-06-02 cs.LG cs.AI 版本更新

Learning Implicit Bias in Generative Spaces for Accelerating Protein Dynamics Emulation

学习生成空间中的隐式偏置以加速蛋白质动力学仿真

Kaihui Cheng, Zhiqiang Cai, Wenkai Xiang, Zhihang Hu, Siyu Zhu, Tzuhsiung Yang, Yuan Qi

发表机构 * Fudan University(复旦大学) Shanghai Academy of AI for Science(上海人工智能科学研究院) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出在预训练生成式仿真器的生成空间中引入隐式历史依赖偏置,结合距离加权分数估计和环境支持正则化,通过重投影步骤保持结构有效性,显著提升采样多样性和稀有状态覆盖速度。

详情
AI中文摘要

蛋白质动力学生成式仿真器能够以分子动力学一小部分成本生成合理的轨迹,但它们继承了训练分布,在长期外推下倾向于重访已知状态而非到达稀有状态。受经典增强采样启发,我们在预训练仿真器的生成空间中引入隐式历史依赖偏置。具体来说,一个历史感知的分数估计器向冻结的仿真器添加距离加权偏置,引导逆时采样远离先前生成的结构,并通过环境支持项进行正则化。为在长时间尺度下保持结构有效性,一个基于分数的精化步骤利用冻结仿真器将漂移的样本重新投影到数据流形上。实验表明,该方法(i)在DynamicPDB-80上将多样性提升35%;(ii)在12个零样本快速折叠蛋白质上,单独使用学习到的偏置达到无偏仿真器覆盖的速度最高快约15倍,与精化结合后覆盖速度最高快约37倍,同时覆盖的低能态数量多约3倍。代码即将发布。

英文摘要

Generative emulators of protein dynamics produce plausible trajectories at a fraction of the cost of molecular dynamics, but they inherit their training distribution and tend to revisit known states rather than reach rare ones under long-horizon extrapolation. Inspired by classical enhanced sampling, we introduce an implicit, history-dependent bias in the generative space of a pretrained emulator. Specifically, a history-aware score estimator augments the frozen emulator with a distance-weighted bias that steers reverse-time sampling away from previously generated structures, regularized by an environment-support term. To preserve structural validity at long horizons, a score-based refinement step re-projects drifted samples onto the data manifold using the frozen emulator. Our experiments demonstrate that the method (i) raises diversity by $35\%$ on DynamicPDB-80; (ii) on $12$ zero-shot Fast-Folding proteins, the learned bias alone reaches the unbiased emulator's coverage up to ${\sim}15\times$ faster, and pairing it with refinement reaches the coverage up to ${\sim}37\times$ faster while covering ${\sim}3\times$ as many low-energy states. Code will be released soon.

2606.01827 2026-06-02 math.OC cs.LG stat.ML 版本更新

Adaptive Sharpness-Aware Minimization with a Polyak-type Step size: A Theory-Grounded Scheduler

自适应锐度感知最小化与Polyak型步长:一种理论驱动的调度器

Dimitris Oikonomou, Nicolas Loizou

发表机构 * Mathematical Institute for Data Science (MINDS), Johns Hopkins University, Baltimore, MD, USA(数据科学数学研究所(MINDS),约翰霍普金斯大学,巴尔的摩,MD,美国) Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA(计算机科学系,约翰霍普金斯大学,巴尔的摩,MD,美国) Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD, USA(应用数学与统计学系,约翰霍普金斯大学,巴尔的摩,MD,美国)

AI总结 针对锐度感知最小化(SAM)对学习率敏感的问题,受随机Polyak步长启发,提出适用于SAM的Polyak调度器,在确定性和随机设置下实现自适应算法,并证明收敛性,实验表明性能优于或媲美调优的SAM基线。

Comments 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

锐度感知最小化(SAM)已成为训练机器学习模型的一种强大且广泛采用的优化器。通过显式最小化损失景观的锐度,SAM通常能提高泛化能力,同时提供强大的经验性能。然而,SAM及其变体,像大多数训练算法一样,对学习率的选择敏感,而学习率通常通过广泛的超参数调优或预定义调度器来选择。在这项工作中,受随机Polyak步长对随机梯度下降(SGD)有效性的最新进展的启发,我们推导了针对SAM风格更新的Polyak调度器,在确定性和随机设置下产生了新颖的自适应算法。在光滑设置中,我们证明了强凸目标的线性收敛性和确定性情况下凸目标的$\mathcal{O}(1/T)$收敛率。在随机设置中,我们建立了直到最优解邻域的类似收敛保证。数值实验表明,所提出的Polyak调度器实现了与精心调优的SAM基线相当或更好的性能,同时大大减少了对学习率调优的需求。

英文摘要

Sharpness-Aware Minimization (SAM) has established itself as a powerful and widely adopted optimizer for training machine learning models. By explicitly minimizing the sharpness of the loss landscape, SAM often improves generalization while delivering strong empirical performance. However, SAM and its variants, like most training algorithms, are sensitive to the choice of learning rate, which is typically selected through extensive hyperparameter tuning or predefined schedulers. In this work, motivated by recent advances on the effectiveness of stochastic Polyak step sizes for Stochastic Gradient Descent (SGD), we derive Polyak schedulers tailored to SAM-style updates, yielding novel adaptive algorithms in both deterministic and stochastic settings. In the smooth setting, we prove linear convergence for strongly convex objectives and an $\mathcal{O}(1/T)$ convergence rate for convex objectives in the deterministic case. In the stochastic setting, we establish analogous convergence guarantees up to a neighborhood of the optimum. Numerical experiments demonstrate that the proposed Polyak schedulers achieve performance comparable to or better than carefully tuned SAM baselines, while substantially reducing the need for learning-rate tuning.

2606.01816 2026-06-02 q-bio.BM cs.LG 版本更新

Site4Drug: Predicting Drug-Binding Target Sites with an AI Agent

Site4Drug: 利用AI智能体预测药物结合靶点

Taehan Kim, Sarrah Rose Mikhail Leung, Bharat Mekala, Jeongbin Park

发表机构 * University of California, San Diego(加州大学圣地亚哥分校)

AI总结 提出Site4Drug,一种模态感知的靶点发现智能体,通过整合拓扑、亲水性、翻译后修饰等证据,输出带约束、风险标记和决策日志的可靶向区域排名列表,并自动推荐结合模态。

Comments Accepted to the ICML 2026 Workshop on Generative and Agentic AI for Biology (GenBio)

详情
AI中文摘要

选择在蛋白质上的干预位置(即选择可靶向位点)通常比选择结合物更模糊且更容易失败,尤其是对于膜蛋白,其可及性、拓扑和翻译后修饰(PTMs)限制了可作用区域。我们提出Site4Drug,一种模态感知的位点发现智能体,输出带有显式约束、证据摘要、风险标记和可追溯决策日志的可靶向区域排名列表。Site4Drug无需用户预先指定药物模态,而是利用与位点发现相同的证据(包括拓扑、亲水性、PTM倾向、二硫键、结构域背景和序列)推荐结合模态(例如抗体/肽类 vs 小分子)。重要的是,这些证据一致地应用于所有模态,包括小分子口袋发现,以避免选择化学上可行但生物学上被遮蔽的位点。

英文摘要

Selecting where to intervene on a protein (i.e., choosing a targetable site) is often a more ambiguous and failure-prone bottleneck than selecting what binds, especially for membrane proteins where accessibility, topology, and post-translational modifications (PTMs) constrain actionable regions. We present Site4Drug, a modality-aware site-finding agent that outputs a ranked list of targetable regions with explicit constraints, evidence summaries, risk flags, and a traceable decision log. Rather than requiring users to specify the drug modality upfront, Site4Drug can recommend a binding modality (e.g., antibody/peptide-like vs small-molecule) from the same evidence used for site discovery, including topology, hydropathy, PTM propensity, disulfides, domain context, and sequence. Importantly, this evidence is applied consistently across modalities, including small-molecule pocket discovery, to avoid selecting chemically plausible but biologically occluded sites.

2606.01811 2026-06-02 cs.CL cs.AI cs.LG 版本更新

"I've Seen How This Goes": Characterizing Diversity via Progressive Conditional Surprise

“我知道这会如何发展”:通过渐进条件惊奇度刻画多样性

Matthew Khoriaty, David Williams-King, Shi Feng

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Cambridge(剑桥大学) Stanford University(斯坦福大学)

AI总结 提出一种基于上下文学习的多样性度量方法 Decan(D_{Ca_n}),通过单次前向传递计算每个字节的得分,无需嵌入模型、参考语料或人工标注,在多个基准上验证了其有效性。

Comments 28 pages, 18 figures, 9 tables. Accepted to the Workshop on Generative AI, Creativity, and Human-AI Co-Creation @ ICML 2026 (non-archival). Code and data: https://github.com/AMindToThink/icl-diversity

详情
AI中文摘要

衡量创意输出的多样性对于评估训练后模式崩溃、比较解码策略以及量化AI和人类写作中的创造性行为至关重要。我们提出了一种使用上下文学习来度量多样性的新方法,其中“Decan”度量 $D_{Ca_n} = C \times a_n$ 是我们评估的工作实例:一个基于每个字节的得分,该得分从基础模型 $θ$ 的每个标记对数概率中读取,每次排列只需一次前向传递,无需嵌入模型、参考语料库和人工标签。该方法基于信息论,利用语言模型的上下文学习来检测任意数量输入之间的广泛相似性,并避免了训练专用模型的需要。同一流程对AI样本和人类编写的回答集进行评分,将多样性视为(回答、提示、评分模型)的一个属性。在Tevet和Berant基于人类判断的McDiv基准上,$D_{Ca_n}$ 在McDiv prompt_gen 集上达到了0.846的OCA,这是其表现最好的情况,仅次于Tevet和Berant报告的最强神经基线(SentBERT,0.897)。在OLMo-2-7B训练后流程中,$D_{Ca_n}$ 在基础→SFT→DPO→RLVR阶段单调下降,检测到创意写作应用所关注的多样性损失类型。

英文摘要

Measuring the diversity of creative outputs is central to evaluating post-training mode collapse, comparing decoding strategies, and quantifying creative behavior in both AI and human writing. We propose a new approach to measuring diversity using in-context learning, of which the ``Decan'' metric, $D_{Ca_n} = C \times a_n$, is the working instance we evaluate: a per-byte score read off the per-token log-probabilities of a base model $θ$ in a \emph{single forward pass} per permutation, with no embedding model, no reference corpus, and no human labels. This approach is grounded in information theory, makes use of language model in-context learning to detect a wide range of similarities between any number of inputs, and obviates the need to train a special-purpose model. The same pipeline scores AI samples and human-written response sets, with diversity treated as a property of (responses, prompt, scoring model). On Tevet and Berant's human-grounded McDiv benchmark, $D_{Ca_n}$ reaches OCA 0.846 on the McDiv prompt\_gen set where it performs best, behind the strongest neural baseline reported in Tevet and Berant (SentBERT, 0.897). On the OLMo-2-7B post-training pipeline, $D_{Ca_n}$ drops monotonically across the base $\to$ SFT $\to$ DPO $\to$ RLVR stages, detecting the type of diversity loss that creative-writing applications care about.

2606.01806 2026-06-02 cs.CL cs.AI cs.LG 版本更新

ProbeScale: Probing Analysis to Optimize Neural Scaling Laws for Efficient Small Language Model Inference

ProbeScale: 通过探测分析优化神经缩放定律以实现高效小语言模型推理

Sourav Das

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) Indian Institution of Information Technology Kalyani(印度信息技术学院Kalyani)

AI总结 提出ProbeScale框架,利用缩放定律和探测分析从预训练小语言模型中识别参数高效子网络,在参数预算下最大化任务加权探测性能,实现5-10倍参数压缩并保持95%-98%原始性能。

Comments 7 pages, 2 figures, ACL

详情
AI中文摘要

小语言模型在能力与计算可行性之间取得了平衡。神经缩放定律指导其最优训练,表明它们拥有随规模增长而丰富的内部表示。然而,在严格的资源约束下部署即使是这些小语言模型也可能具有挑战性。语言模型探测提供了分析模型内部编码的语言知识的方法。我们提出ProbeScale,一个统一缩放定律和探测洞察的框架,用于在预训练小语言模型中识别参数高效的子网络。ProbeScale利用良好缩放的小语言模型的高质量表示,并使用任务特定探测来数学量化每层对目标下游能力的相关性。这使得能够选择在性能与参数规模之间最优权衡的子网络。我们将子网络选择形式化为在参数预算下寻找最大化聚合任务加权探测性能的层子集。在代表性小语言模型如RoBERTa-Large和T5-Base上的实验表明,ProbeScale识别出的子网络实现了5到10倍的显著参数减少,同时在目标任务上保持了高性能(原始小语言模型的95%至98%),优于启发式基线。

英文摘要

Small Language Models (SLMs) offer a balance between capability and computational feasibility. Neural scaling laws inform their optimal training, suggesting that they possess rich internal representations that scale with their size. However, deploying even these SLMs can be challenging under strict resource constraints. Language model probing provides methods for analyzing the linguistic knowledge encoded in a model's internals. We propose ProbScale, a framework that unifies insights from scaling laws and probing to identify parameter-efficient subnetworks within pre-trained SLMs. ProbScale utilizes the high-quality representations of well-scaled SLMs and uses task-specific probes to mathematically quantify the relevance of each layer for target downstream capabilities. This allows selecting subnetworks that optimally trade off performance against parameter size. We formulate the subnetwork selection as finding a layer subset maximizing aggregated, task-weighted probe performance under a parameter budget. Experiments on representative SLMs such as RoBERTa-Large and T5-Base demonstrate that ProbScale identifies subnetworks achieving significant parameter reduction, from 5 to 10 times, while maintaining high performance (95% to 98% of the original SLMs) on targeted tasks, outperforming heuristic baselines.

2606.01800 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Multilinguality of Large Language Models From a Structural Perspective

从结构视角看大语言模型的多语言性

Haruki Sakajo, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe

发表机构 * Nara Institute of Science and Technology(奈良科学技術研究所)

AI总结 本研究通过表示结构分析探索大语言模型的多语言性,发现低资源语言与英语的结构差异大于高、中资源语言,且语言特定后训练改变结构但保留语言间关系。

详情
AI中文摘要

大型语言模型(LLMs)通过在多语言数据上进行预训练和后训练,在处理多种语言方面表现出色,尽管英语在训练数据中占主导地位。先前关注标记表示的研究揭示了这些LLMs如何处理非英语文本。尽管这些分析提供了有见地的发现,但它们未能捕捉到结构视角,而结构是语言的内在属性。在本研究中,我们通过表示结构分析探索LLMs的多语言性。我们的发现表明,低资源语言在结构上与英语的差异大于高资源和中资源语言,并且语言特定的后训练改变了它们的结构,同时保留了语言间的关系。

英文摘要

Large language models (LLMs) have excelled in processing multiple languages through pre- and post-training on multilingual data, even though English dominates the training data. Prior work focusing on token representations has revealed how those LLMs process non-English text. Although these analyses have provided insightful findings, they fail to capture a structural view, which is an inherent property of language. In this study, we explore the multilinguality of LLMs through representational structural analysis. Our findings reveal that low-resource languages are structurally more different from English than high- and mid-resource languages, and that language-specific post-training alters their structures while preserving inter-language relationships.

2606.01799 2026-06-02 cs.LG stat.ML 版本更新

Tree-Guided Identify-Then-Exploit: A Unified Framework of Best Arm Identification and Regret Minimization for Dueling Bandits

树引导的识别-然后-利用:决斗式赌博机中最佳臂识别与遗憾最小化的统一框架

Pu Wang, Yao-Xiang Ding

发表机构 * State Key Lab of CAD&CG(计算机辅助设计与图形学国家重点实验室)

AI总结 针对Condorcet赢家假设下的N臂随机决斗式赌博机,提出树引导的识别-然后-利用(TG-ITE)统一框架,通过共享树引导识别方法在O(N)次比较内找到高置信度候选,并针对不同目标设计利用策略,首次同时实现最佳臂识别O(N)样本复杂度、弱遗憾O(N)和强遗憾O(N log T)保证,并消除现有方法中O(log N)的次优差距。

详情
AI中文摘要

我们研究在Condorcet赢家假设下的$N$臂随机决斗式赌博机,考虑三个广泛采用的目标:最佳臂识别(BAI)、弱遗憾和强遗憾。我们提出树引导的识别-然后-利用(TG-ITE),据我们所知,这是第一个统一处理所有这些目标的框架。无需更强的假设,我们提出一种共享的树引导识别方法,在$O(N)$次比较内找到高置信度的候选。我们进一步提出不同的利用策略,利用这个热启动阶段来优化具体目标。这种方法使得我们的方法能够:(1)在没有通常采用的更强假设的情况下,实现BAI的$O(N)$样本复杂度;(2)构建第一个赢家保持风格的算法,实现$O(N)$弱遗憾;(3)享有与专门强遗憾方法相同的$O(N \log T)$保证;(4)实现BAI和弱遗憾的联合优化,两者均具有$O(N)$保证,消除了现有方法中$O(\log N)$的次优差距。我们的结果提供了证据,表明在决斗式赌博机中,BAI和遗憾最小化之间的权衡相对温和。

英文摘要

We study $N$-armed stochastic dueling bandits under the Condorcet-winner assumption, where three widely adopted objectives are considered: best-arm identification (BAI), weak regret, and strong regret. We propose Tree-Guided Identify-Then-Exploit (TG-ITE), the first unified framework to tackle all these objectives to our knowledge. Without requiring stronger assumptions, we propose a shared tree-guided identification approach to find a high-confidence incumbent within $O(N)$ comparisons. We further propose varied exploitation strategies to utilize this warm-start stage to optimize the specific objectives at hand. This methodology enables our approach to (1) achieve $O(N)$ sample complexity in BAI without commonly adopted stronger assumptions; (2) build the first winner-stays-style algorithm to achieve $O(N)$ weak regret; (3) enjoy the same $O(N \log T)$ guarantee as specialized strong-regret approaches; (4) realize the joint optimization of BAI and weak regret with $O(N)$ guarantees for both, eliminating the sub-optimal gap of $O(\log N)$ in the existing approach. Our results provide evidence that the trade-off between BAI and regret minimization is relatively benign in dueling bandits.

2606.01774 2026-06-02 cs.LG cs.AI 版本更新

FLARE: Diffusion for Hybrid Language Model

FLARE: 混合语言模型的扩散方法

Yuchen Zhu, Jing Shi, Chongjian Ge, Hao Tan, Yiran Xu, Wanrong Zhu, Jason Kuen, Koustava Goswami, Rajiv Jain, Yongxin Chen, Molei Tao, Jiuxiang Gu

发表机构 * Adobe Research(Adobe研究院) Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出FLARE框架,通过结合自回归和扩散目标、硬件感知内核和统一推理,将混合注意力LLM转换为支持并行解码的扩散模型,在保持能力的同时提升吞吐量。

详情
AI中文摘要

自回归(AR)大型语言模型(LLM)已取得广泛的实际成功,但顺序解码仍然是低延迟部署的关键瓶颈。近期的高效推理工作沿着两个方向推进:通过高效架构降低每次模型调用的成本,以及通过并行生成减少串行解码步骤。混合注意力骨干解决了前者,而扩散语言模型(dLLM)通过迭代并行去噪追求后者。结合这些优势仍然具有挑战性:AR到dLLM的转换通常无法保留种子检查点的能力,并且混合注意力循环状态和掩码约束使得扩散训练和服务变得复杂。我们提出了FLARE,一个针对混合注意力LLM的系统转换框架。我们的分析确定迁移数据质量是能力保留的主要决定因素,其重要性超过损失公式和注意力掩码设计。最终框架结合了token等价的AR和扩散目标、硬件感知内核以及统一推理,使得一个检查点能够同时支持AR风格的验证解码和扩散风格的并行去噪。从强大的AR检查点出发,使用有限的训练后数据,FLARE在模型规模上与领先的开源dLLM竞争,并在单GPU并发服务中相比开源dLLM基线实现了持续的吞吐量提升。我们的结果进一步表明,实际dLLM不仅受限于解码算法,还受限于迁移数据质量和当前块扩散目标的训练低效性,这促使我们联合设计数据、目标、架构和推理系统。

英文摘要

Autoregressive (AR) large language models (LLMs) have achieved broad practical success, but sequential decoding remains a key bottleneck for low-latency deployment. Recent efficient-inference work has progressed along two axes: reducing the cost of each model invocation through efficient architectures, and reducing serial decoding steps through parallel generation. Hybrid attention backbones address the former, while diffusion language models (dLLMs) pursue the latter via iterative parallel denoising. Combining these advantages remains challenging: AR-to-dLLM conversion often fails to preserve seed-checkpoint capability, and hybrid-attention recurrent states and masking constraints make diffusion training and serving nontrivial. We present FLARE, a systematic conversion framework for hybrid-attention LLMs. Our analysis identifies transfer data quality as the primary determinant of capability preservation, outweighing loss formulation and attention-mask design. The resulting framework combines a token-equal AR-and-diffusion objective, hardware-aware kernels, and unified inference, enabling one checkpoint to support both AR-style verified decoding and diffusion-style parallel denoising. Starting from strong AR checkpoints with limited post-training data, FLARE is competitive with leading open-source dLLMs across model scales and delivers consistent throughput gains over open-source dLLM baselines in single-GPU concurrent serving. Our results further suggest that practical dLLMs are limited not only by decoding algorithms, but also by transfer data quality and the training inefficiency of current block-diffusion objectives, motivating joint design of data, objectives, architectures, and inference systems.

2606.01764 2026-06-02 math.OC cs.GT cs.LG 版本更新

Accelerating Min-Max Optimization via Power-Law Stepsizes

通过幂律步长加速极小极大优化

Yue Wu, Weiqiang Zheng, Yang Cai, Haipeng Luo

发表机构 * University of Southern California(南加州大学) Yale University(耶鲁大学)

AI总结 本文提出确定性动态步长调度,将外梯度方法的最后迭代收敛率从Θ(T^{-1/2})加速到O(T^{-2/3+ε}),并通过分离外推和更新步长进一步达到近最优的O(T^{-1+ε})。

Comments 56 pages

详情
AI中文摘要

我们重新审视了无约束双仿射极小极大优化的外梯度(EG)方法的收敛保证。已知固定步长的EG实现了$Θ(T^{-1/2})$的最后迭代收敛率,这比通过引入锚定等额外机制可达到的最优$\mathcal{O}(T^{-1})$率要慢。受最近进展(动态步长本身可以显著加速梯度下降)的启发,我们询问动态步长是否也能类似地加速EG的最后迭代收敛。我们在此方向上给出了第一个正面结果。具体地,我们提供了一个确定性动态步长调度,将EG的收敛率加速到$\mathcal{O}(T^{-2/3+\varepsilon})$,对于任意$\varepsilon > 0$。我们还证明,当EG的外推和更新步使用相同步长时,该率是紧的。然后我们表明,允许外推和更新步使用不同步长进一步将收敛率提高到近最优的$\mathcal{O}(T^{-1+\varepsilon})$。我们的分析将步长调度简化为一个优化问题,其解导致遵循幂律分布(的离散化)的步长调度。我们提出的步长调度和分析可扩展到其他方法,如乐观梯度(OG),并表明对一般极小极大优化问题的更广泛适用性。

英文摘要

We revisit the convergence guarantees of the Extragradient (EG) method for unconstrained biaffine min-max optimization. It is known that EG with a fixed stepsize achieves a $Θ(T^{-1/2})$ last-iterate convergence rate, which is slower than the optimal $\mathcal{O}(T^{-1})$ rate attainable by incorporating additional mechanisms such as anchoring. Motivated by recent advances showing that dynamic stepsizes alone can significantly accelerate gradient descent, we ask whether dynamic stepsizes can similarly accelerate the last-iterate convergence of EG. We present the first positive result in this direction. Specifically, we provide a deterministic dynamic stepsize schedule that accelerates the convergence rate of EG to $\mathcal{O}(T^{-2/3+\varepsilon})$ for any $\varepsilon > 0$. We also show that this rate is tight when the extrapolation and update steps of EG use the same stepsize. We then show that allowing different stepsizes for the extrapolation and update steps further improves the convergence rate to the near-optimal $\mathcal{O}(T^{-1+\varepsilon})$. Our analysis reduces stepsize scheduling to an optimization problem, whose solution leads to a stepsize schedule that follows (a discretization of) a power-law distribution. Our proposed stepsize schedules and analysis extend to other methods, such as Optimistic Gradient (OG), and suggest broader applicability to general min-max optimization problems.

2606.01746 2026-06-02 cs.CV cs.LG 版本更新

Sensitivity as a Double-Edged Sword: A Trade-off Between Discriminability and Adversarial Robustness

敏感性是一把双刃剑:判别性与对抗鲁棒性之间的权衡

Kai Wang

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文发现全连接分类器的高敏感性带来判别性但也导致脆弱性,而ℓ2距离分类器的不敏感性带来鲁棒性但限制性能,为此提出基于混合原型混合框架的ℓ2重分类器,通过融合稳定原型和动态原型实现判别性与鲁棒性的平衡,并设计混合替代攻击评估协议。

Comments 13 pages including reference, 4 figures

详情
AI中文摘要

现代神经网络极易受到对抗性扰动的影响。在这项工作中,我们指出这种脆弱性部分源于广泛使用的全连接分类器对此类扰动的敏感性。相比之下,简单的基于ℓ2距离的分类器表现出显著更强的鲁棒性。我们提供了充分的理论和实证分析,表明全连接分类器的高敏感性使其具有判别性,但也使其脆弱;相反,ℓ2分类器的不敏感性赋予了鲁棒性但限制了性能。受这种权衡的启发,我们提出了一种基于混合原型混合框架的新型ℓ2重分类器。该方法保留了全连接分类器的判别能力,同时利用了ℓ2距离的鲁棒性。它通过融合两种原型类型来产生基于ℓ2距离的预测:(1)通过指数移动平均更新的稳定数据集级原型,以及(2)使用直通估计器从全连接分类器预测生成的动态批量级原型。然而,这种基于直通估计器的动态架构给评估带来了重大挑战,例如梯度混淆和前向不连续性。为了解决这个问题,我们提出了一种新的严格评估协议——混合替代攻击,该协议使用多个替代模型以及强大的AutoAttack,以确保公平和稳健的评估。大量实验表明,我们的轻量级即插即用模块只需极少的微调,就能有效增强各种现有最先进对抗训练模型的对抗鲁棒性。

英文摘要

Modern neural networks are highly susceptible to adversarial perturbations. In this work, we identify that part of this vulnerability stems from the sensitivity of the widely used fully connected (FC) classifiers to such perturbations. In contrast, simple $\ell_2$ distance-based classifiers exhibit significantly greater robustness. We provide thorough theoretical and empirical analysis showing that while FC classifiers' high sensitivity makes them discriminative, it also makes them vulnerable. Conversely, $\ell_2$-classifiers' insensitivity grants robustness but limits performance. Motivated by this trade-off, we propose a novel $\ell_2$-reclassifier based on a Hybrid Prototype Mixing (HPM) framework. This method retains the discriminative power of FC classifiers while leveraging the robustness of $\ell_2$ distance. It yields $\ell_2$-distance-based predictions by fusing two prototype types: (1) stable, dataset-level prototypes updated via EMA, and (2) dynamic, batch-level prototypes generated from the FC classifier's predictions using a Straight-Through Estimator (STE). However, this dynamic, STE-based architecture introduces significant challenges for evaluation, such as gradient obfuscation and forward discontinuity. To address this, we propose a new, rigorous evaluation protocol, the Mixed Surrogate Attack (MSA), which uses multiple surrogates along with powerful AutoAttack to ensure a fair and robust assessment. Extensive experiments demonstrate that our lightweight, plug-and-play module, with minimal fine-tuning, effectively enhances the adversarial robustness of various existing SOTA adversarially trained models.

2606.01734 2026-06-02 cs.CV cs.LG cs.RO 版本更新

FlatVPR: Plug-and-play Geo-linear Residual Adapter for Geometric Rectification of Foundation Model Feature Manifolds

FlatVPR: 用于基础模型特征流形几何校正的即插即用地线性残差适配器

Rai Hisada, Kanji Tanaka

发表机构 * Fundamental Engineering for Knowledge-Based Society, Graduate School of Engineering, University of Fukui(知识社会基础工程,工程研究生院,福井大学)

AI总结 提出FlatVPR范式,通过可学习残差适配器和Pullback Flatness Loss抑制特征流形曲率,实现稀疏锚点下的线性插值重建,在NCLT数据集上显著提升视觉位置识别精度。

Comments 5 pages, 1 figure, technical report

详情
AI中文摘要

本文提出“FlatVPR”,一种新颖的几何校正范式,通过强制特征流形结构,使得两个相邻锚点 $\mathbf{z}_A$ 和 $\mathbf{z}_B$ 之间的任何描述符都可以通过线性插值 $\hat{\mathbf{z}}_{pseudo} = (1-t)\mathbf{z}_A + t\mathbf{z}_B$(其中 $t \in [0,1]$ 表示相对位置)精确重建,从而有效平衡视觉位置识别(VPR)中地图轻量化和定位精度之间的权衡。尽管最先进的基础模型(如DINOv2-ViT-S/14)提供了鲁棒的语义特征,但其潜在流形表现出显著的曲率,将物理空间中的均匀线性运动投影到特征空间中高度非线性的轨迹上,这阻碍了稀疏锚点条件下的可靠重建。为了实现上述基于插值的重建,我们对原始基础特征 $\mathbf{z}$ 引入残差变换 $\hat{\mathbf{z}} = \mathbf{z} + \text{Res}(\mathbf{z})$,其中 $\text{Res}(\cdot)$ 表示可学习的适配器。我们的方法通过数学上严谨的Pullback Flatness Loss显式抑制流形曲率,该损失最小化中间特征与连接相邻锚点的线性段之间的偏差,从而最小化流形的内在曲率。通过这种空间展平,地图构建被公式化为期望最大化(EM)框架,解耦为用于流形适应的连续M步和用于最优锚点选择准则的概念性E步。在NCLT数据集上的实验表明,即使在100米间隔的极端稀疏锚点和极端季节变化条件下,应用我们的适配器也能带来显著的性能提升。

英文摘要

This paper proposes ``FlatVPR,'' a novel geometric rectification paradigm that effectively bridges the trade-off between map lightweightness and localization accuracy in visual place recognition (VPR) by enforcing a feature manifold structure where any descriptor between two adjacent anchors $\mathbf{z}_A$ and $\mathbf{z}_B$ can be accurately reconstructed via linear interpolation $\hat{\mathbf{z}}_{pseudo} = (1-t)\mathbf{z}_A + t\mathbf{z}_B$, where $t \in [0,1]$ denotes the relative position. While state-of-the-art foundation models such as DINOv2-ViT-S/14 provide robust semantic features, their latent manifolds exhibit prominent curvature, projecting uniform linear motion in physical space onto highly non-linear trajectories in the feature space, which hinders reliable reconstruction under sparse anchor conditions. To enable the aforementioned interpolation-based reconstruction, we introduce a residual transformation $\hat{\mathbf{z}} = \mathbf{z} + \text{Res}(\mathbf{z})$ to the raw foundation features $\mathbf{z}$, where $\text{Res}(\cdot)$ represents a learnable adapter. Our method explicitly suppresses manifold curvature using a mathematically grounded Pullback Flatness Loss that minimizes the deviation of intermediate features from the linear segment connecting adjacent anchors, thereby minimizing the intrinsic curvature of the manifold. Through this spatial flattening, map construction is formulated within an Expectation-Maximization (EM) framework, decoupled into a continuous M-step for manifold adaptation and a conceptual E-step for optimal anchor selection guidelines. Experiments on the NCLT dataset demonstrate that the application of our adapter leads to significant performance improvements even under extremely sparse anchor conditions with 100m intervals and extreme seasonal changes.

2606.01725 2026-06-02 cs.AI cs.LG 版本更新

Characterization of Multi-Model Agentic AI Systems on General Tasks via Trace-Driven Simulation

基于迹驱动仿真的通用任务多模型智能体AI系统特征分析

Donghwan Kim, Prakhar Singh, Younghoon Min, Jongryool Kim, Jongse Park, Kiwan Maeng

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学) SK Hynix(SK海力士) KAIST(韩国科学技术院)

AI总结 本文提出GAIATrace数据集和Vidur-Agent仿真器,通过迹驱动仿真分析多模型智能体AI系统在通用任务上的行为特征。

Comments 13 pages, 18 figures, 2 tables

详情
AI中文摘要

智能体AI通过迭代规划、工具使用和基于观察结果的推理来完成任务。尽管其流行,但其系统级行为仍然知之甚少,特别是对于复杂数据集和智能体架构——由于高度非确定性执行、高昂的评估成本以及对专有模型的有限可见性。本文提出了GAIATrace,这是两个最先进的智能体系统(MiroThinker和OWL)运行GAIA(一个由异构通用任务组成的基准测试)的首个token级迹数据集。与先前的迹数据集不同,GAIATrace捕获了完整的推理token、任务级结构以及每个主要参与LLM的活动,从而支持深入的系统研究。作为数据集的补充,我们提出了Vidur-Agent,一个迹驱动的仿真器,可以重放GAIATrace以在多种模拟环境中进行可重复、低成本的系统评估。利用这两个工件,我们描述了现代智能体系统如何处理通用任务以及各种系统设计选择如何塑造其行为,得出了若干独特的发现。

英文摘要

Agentic AI completes tasks through iterative planning, tool use, and reasoning based on observed outcomes. Despite its popularity, its system-level behavior remains poorly understood, particularly for complex datasets and agent architectures-owing to highly non-deterministic execution, prohibitive evaluation costs, and limited visibility into proprietary models. This paper presents GAIATrace, the first token-level trace dataset of two state-of-the-art agentic systems (MiroThinker and OWL) running GAIA, a benchmark composed of a heterogeneous mix of general-purpose tasks. Unlike prior trace datasets, GAIATrace captures full reasoning tokens, task-level structures, and activities of every major participating LLMs, enabling in-depth systems research. Complementing the dataset, we present Vidur-Agent, a trace-driven simulator that can replay GAIATrace to perform reproducible, low-cost system evaluation across diverse simulated environments. Using both artifacts, we characterize how modern agentic systems handle general tasks and how various system design choices shape their behavior, yielding several unique findings.

2606.01723 2026-06-02 cs.LG cs.AI 版本更新

Shortcut to Nowhere: Demystifying Deep Spurious Regression

捷径通往虚无:揭秘深度虚假回归

Guanrong Xu, Jessica Li, Hao Wang, Yuzhe Yang

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) Rutgers University(罗格斯大学) Yang AI Lab(杨人工智能实验室)

AI总结 针对连续预测中的虚假相关性,提出利用标签和特征空间中虚假属性的相似性来校准分布,从而提升模型在分布偏移下的泛化能力。

详情
AI中文摘要

现实世界中的回归常常存在捷径:在训练中与连续目标虚假相关的属性,在部署偏移下不可靠;使用此类捷径回归目标可能在测试时灾难性失败。现有关于虚假相关性的研究主要关注分类,其中标签是分类的且组是自然定义的。然而,许多现实任务需要连续预测,其中不存在硬标签边界或离散的组-标签对。我们将深度虚假回归(DSR)定义为从具有属性-标签混淆的回归数据中学习,处理连续虚假相关性,并在测试时泛化到所有属性-标签组合。受分类和回归捷径内在差异的启发,我们提出利用标签和特征空间中虚假属性之间的相似性,从而在跨属性校准标签和学习特征分布时考虑邻近目标和相关组。在涵盖计算机视觉、环境感知和大语言模型(LLM)回归的常见真实世界DSR数据集上的大量实验验证了我们策略的优越性能。我们的工作填补了研究连续预测中虚假相关性的基准和技术空白。

英文摘要

Real-world regression often exhibits shortcuts: attributes that are spuriously correlated with continuous targets in training, yet unreliable under deployment shifts; regressing targets using such shortcuts may fail catastrophically at test time. Existing studies on spurious correlations focus primarily on classification, where labels are categorical and groups are naturally defined. However, many real-world tasks require continuous prediction, where hard label boundaries or discrete group-label pairs do not exist. We define Deep Spurious Regression (DSR) as learning from regression data with attribute-label confounding, addressing continuous spurious correlations, and generalizing to all attribute-label combinations at test time. Motivated by the intrinsic difference between classification and regression shortcuts, we propose to exploit the similarity among spurious attributes in both label and feature spaces, thereby accounting for nearby targets and related groups while calibrating both label and learned feature distributions across attributes. Extensive experiments on common real-world DSR datasets that span computer vision, environmental sensing, and large language model (LLM) regression verify the superior performance of our strategies. Our work fills the gap in benchmarks and techniques for studying spurious correlations in continuous prediction.

2606.01722 2026-06-02 cs.LG cs.AI cs.DC 版本更新

Post-Deterministic Distributed Systems: A New Foundation for Trustworthy Autonomous Infrastructure

后确定性分布式系统:可信自主基础设施的新基础

Jun He, Deying Yu

发表机构 * OpenKedge Inc.(OpenKedge公司)

AI总结 本文提出后确定性分布式系统(PDDS)模型,以协调确定性代码、随机模型和自主代理共存的异构环境,并定义了五大架构支柱及新的故障分类。

Comments 8 pages, 1 table

详情
AI中文摘要

几十年来,分布式系统通常假设正确的参与者执行协议指定的行为,具有稳定、外部定义和确定性的语义。经典理论广泛参数化了网络时序、通信拓扑和故障域,但参与者模型相对固定。将自主推理引擎、随机模型驱动代理和策略驱动参与者集成到云控制平面、事件响应系统和金融基础设施中,挑战了这一假设的普遍性。这些代理通常产生不同的推理路径、不同的操作轨迹和异构的内部表示,同时实现语义等价且正确的结果。在本文中,我们引入后确定性分布式系统(PDDS)作为研究和工程模型,用于协调确定性代码、随机模型和自主代理共存的异构环境。我们表明,经典分布式计算模型构成了这种参与者通用模型的零歧义特例。我们并非主张确定性系统消失;而是确定性执行不能再作为自主基础设施的通用参与者假设。最后,我们概述了后确定性基础设施的五大架构支柱:协议驱动开发、可验证代理基础设施、自主状态控制平面、语义法定保证和认知状态复制。认知状态复制将持久性和一致性模型从数据可见性扩展到知识可见性,实现代理记忆、可验证语义回滚以及跨推理参与者的连贯性。我们还定义了在此环境中出现的故障类别的分类法。

英文摘要

For decades, distributed systems have typically assumed that correct participants execute protocol-specified behavior with stable, externally defined, and deterministic semantics. Classical theory has extensively parameterized network timing, communication topologies, and failure domains, but this participant model has remained comparatively fixed. The integration of autonomous reasoning engines, stochastic model-driven agents, and policy-driven actors into cloud control planes, incident response systems, and financial infrastructure challenges the universality of this assumption. These agents often produce divergent reasoning paths, distinct operational traces, and heterogeneous internal representations while achieving semantically equivalent and correct outcomes. In this paper, we introduce Post-Deterministic Distributed Systems (PDDS) as a research and engineering model for coordinating heterogeneous environments where deterministic code, stochastic models, and autonomous agents coexist. We show that classical distributed computing models form a zero-ambiguity special case of this participant-general model. We do not argue that deterministic systems disappear; rather, deterministic execution can no longer serve as the universal participant assumption for autonomous infrastructure. Finally, we outline five architectural pillars of post-deterministic infrastructure: Protocol-Driven Development, Verifiable Agentic Infrastructure, Autonomous State Control Planes, Semantic Quorum Assurance, and Epistemic State Replication. Epistemic State Replication extends persistence and consistency models from data visibility to knowledge visibility, enabling agentic memory, Verifiable Semantic Rollback, and coherence across reasoning participants. We also define a taxonomy of failure classes that arise in this setting.

2606.01720 2026-06-02 cs.LG 版本更新

A Note on Stability for Orthogonalized Matrix Momentum with Client Sampling

关于带客户端采样的正交化矩阵动量的稳定性注记

Da Chang, Qiankun Shi, Lvgang Zhang, Yu Li, Ruijie Zhang

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) Sun Yat-sen University(中山大学) Southern University of Science and Technology(南方科技大学) George Washington University(乔治华盛顿大学)

AI总结 研究带客户端采样的分布式矩阵优化中正交化动量更新的有限样本泛化界,通过耦合邻域稳定性递归和加权集中步骤导出上尾保证。

详情
AI中文摘要

我们研究了带矩阵值参数和正交化动量更新的客户端采样分布式优化方案的有限样本泛化。核心量是当每轮只有一部分客户端参与时,返回模型上总体目标与经验目标之间的差距。在独立异构客户端数据、不等本地样本计数和固定聚合权重下,我们通过耦合邻域稳定性递归和加权集中步骤导出了有限轮上尾保证。该界限通过放大因子 \(Y_i(\mathcal C)\) 保留客户端选择计数;在均匀全参与全批次情况下,当控制依赖于时间范围的放大项时,它产生 \(\widetilde{\mathcal O}(n^{-1}+n^{-1/2})\) 的缩放。矩阵正交化规则要求沿配对轨迹是Lipschitz的,该条件由正则化极型映射和归一化有限步Newton-Schulz正交化器满足。对于未正则化的矩阵符号,相同的论证需要耦合谱分离,而高斯平滑给出了有限轮平滑变体。一个一维反例说明了为什么间隙、平滑或正则性条件是必要的。

英文摘要

We study finite-sample generalization for a client-sampled distributed optimization scheme with matrix-valued parameters and orthogonalized momentum updates. The central quantity is the gap between the population and empirical objectives at the returned model when only a subset of clients participates in each round. Under independent heterogeneous client data, unequal local sample counts, and fixed aggregation weights, we derive a finite-round upper-tail guarantee from a coupled-neighbor stability recursion and a weighted concentration step. The bound keeps the client-selection counts through the amplification factor \(Y_i(\mathcal C)\); in the uniform full-participation full-batch regime, it yields \(\widetilde{\mathcal O}(n^{-1}+n^{-1/2})\) scaling whenever the horizon-dependent amplification terms are controlled. The matrix-orthogonalization rule is required to be Lipschitz along paired trajectories, a condition satisfied by regularized polar-type maps and normalized finite-step Newton--Schulz orthogonalizers. For the unregularized matrix sign, the same argument requires coupled spectral separation, whereas Gaussian smoothing gives a finite-round smoothed variant. A one-dimensional counterexample shows why a gap, smoothing, or regularity condition is necessary.

2606.01719 2026-06-02 cs.LG cs.AI cs.CR 版本更新

Fair Finetuning Mitigates Distribution Inference Attacks

公平微调缓解分布推断攻击

Rakshit Naidu

发表机构 * Rakshit Naidu

AI总结 提出公平微调(FFt)方法,通过在等几率约束下对互补分布样本进行微调,将模型公平性指标与分布推断攻击中的对抗优势联系起来,并给出理论界限,实验证明能有效降低攻击成功率。

Comments 16 pages (11 main, 5 appendix)

详情
AI中文摘要

在敏感数据上训练的机器学习模型可能会无意中泄露其训练分布的群体级信息——这种威胁被称为分布推断攻击(DIA)。具有黑盒访问权限的对手可以在不直接观察任何训练数据的情况下推断敏感的人口统计属性,如子群比例。尽管已经提出了差分隐私和属性遗忘等防御措施,但公平性约束与分布泄漏之间的联系尚未被探索。我们提出了公平微调(FFt):在等几率(EO)约束下,对来自互补分布的样本进行微调。我们提供了完整的理论刻画,证明了紧界 $ ext{Adv}(\mathcal{A},M_f) \le Δ_{ ext{EO}} \cdot W$,其中 $W$ 量化了两个训练分布通过其敏感属性组成的可区分程度。我们还建立了FFt降低对抗优势的必要条件,并证明了该界的紧性。我们在六个数据集上进行了评估,涵盖表格数据(ACS Income、COMPAS、German Credit)、图像数据(UTKFaces)和自然语言处理数据(Bias in Bios)。基于重演的FFt在所有设置中一致地将对抗准确率差距降低到检测阈值 $τ=0.1$ 以下;在ACS Income上,差距从约15%下降到4%以下。我们的工作提供了第一个将模型测量的EO差异直接与其在DIA博弈中的对抗优势联系起来的正式界限,为统一的公平性和隐私防御开辟了新途径。

英文摘要

Machine learning models trained on sensitive data can inadvertently leak population-level information about their training distributions -- a threat known as distribution inference attack (DIA). An adversary with black-box access can infer sensitive demographic properties, such as subgroup proportions, without observing any training data directly. While defenses such as differential privacy and property unlearning have been proposed, the link between fairness constraints and distributional leakage remains unexplored. We propose Fair Fine-tuning (FFt): a trained model is fine-tuned on samples from the complementary distribution under an Equalized Odds (EO) constraint. We provide a complete theoretical characterization, proving the tight bound $\text{Adv}(\mathcal{A},M_f) \le Δ_{\text{EO}} \cdot W$, where $W$ quantifies how distinguishable the two training distributions are by their sensitive-attribute composition. We also establish a necessary condition for FFt to reduce adversarial advantage and prove tightness of the bound. We evaluate across six datasets spanning tabular (ACS Income, COMPAS, German Credit), image (UTKFaces), and NLP (Bias in Bios) modalities. Rehearsal-based FFt consistently reduces the adversarial accuracy gap below the detection threshold $τ!=!0.1$ across all settings; on ACS Income, the gap falls from $\sim!15%$ to under $4%$. Our work provides the first formal bound connecting a model's measured EO disparity directly to its adversarial advantage in the DIA game, opening a new avenue for unified fairness-and-privacy defenses.

2606.01717 2026-06-02 cs.LG 版本更新

Decentralized Instruction Tuning: Conflict-Aware Splitting and Weight Merging

去中心化指令微调:冲突感知拆分与权重合并

Minsik Choi, Geewook Kim

发表机构 * Korea University(韩国大学)

AI总结 提出MERIT方法,通过冲突感知拆分和权重合并实现去中心化指令微调,在Qwen2.5-VL-3B上8个基准平均分从54.3提升至57.0。

Comments 32 pages, 5 figures. Accepted for publication at ICML 2026

详情
AI中文摘要

指令微调使包括多模态在内的大语言模型与多样化的用户意图对齐,但扩展到异构混合数据时受到梯度干扰和带宽密集型同步的阻碍。我们提出是否可以通过独立训练部分混合数据并在参数空间中一次性协调它们来共同解决这两个瓶颈。我们在共享平坦盆地内发展了一个局部二次理论,得到三个结果:权重合并产生曲率加权方差减少;PCA对齐的冲突拆分沿高曲率方向最大化这一增益;合并还作为具有隐式范数正则化的谱滤波。这些结果直接激发了MERIT,一个去中心化的合并就绪指令微调流程,该流程估计数据集级别的梯度冲突,沿顶部PCA冲突轴划分混合数据,独立微调每个分区且无分区间通信,并通过令牌加权平均一次性合并。在Qwen2.5-VL-3B上使用136个Vision-FLAN任务,MERIT将8个基准平均分从54.3(联合训练)提升至57.0。相同的方案扩展到7B模型上,使用160万样本、176个源的混合数据——以最小成本开销匹配或超越集中式联合训练——并迁移到纯文本FLAN。我们的代码可在https://github.com/naver-ai/merit获取。

英文摘要

Instruction tuning aligns large language models, including multimodal ones, with diverse user intents, but scaling to heterogeneous mixtures is hindered by gradient interference and bandwidth-heavy synchronization. We ask whether these two bottlenecks can be addressed jointly by training parts of the mixture independently and reconciling them once in parameter space. We develop a local quadratic theory inside a shared flat basin that yields three results: weight merging produces a curvature-weighted variance reduction; PCA-aligned conflict splitting maximizes this gain along high-curvature directions; and merging additionally acts as spectral filtering with implicit norm regularization. These results directly motivate MERIT, a decentralized merge-ready instruction-tuning pipeline that estimates dataset-level gradient conflicts, partitions the mixture along the top PCA conflict axes, fine-tunes each partition independently with no inter-partition communication, and merges once via token-weighted averaging. On Qwen2.5-VL-3B with 136 Vision-FLAN tasks, MERIT improves the 8-benchmark average from 54.3 (joint training) to 57.0. The same recipe scales to a 7B model on a 1.6M-example, 176-source mixture -- matching or exceeding centralized joint training with minimal cost overhead -- and transfers to text-only FLAN. Our code is available at https://github.com/naver-ai/merit.

2606.01710 2026-06-02 cs.CV cs.LG 版本更新

Density-Aware Translation of Spurious Correlations in Zero-Shot VLMs

零样本VLM中虚假相关性的密度感知转换

Afsaneh Hasanebrahimi, Hanxun Huang, Christopher Leckie, Sarah Erfani

发表机构 * School of Computing and Information Systems, The University of Melbourne, Victoria, Australia(计算与信息系统学院,墨尔本大学,维多利亚,澳大利亚)

AI总结 提出密度感知转换(DAT)方法,利用局部几何密度项修正图像-文本相似度,以缓解CLIP等视觉语言模型在零样本分类中因虚假相关性导致的性能下降。

Comments ICML 2026

详情
AI中文摘要

视觉语言模型(如CLIP)实现了强大的零样本分类。然而,它们的预测仍然对虚假相关性敏感,即上下文线索主导语义内容。早期的解决方案通常依赖于微调或提示工程,这要么削弱了预训练模型的优势,要么容易产生幻觉。在这项工作中,我们提出了密度感知转换(DAT),它使用从组参考集导出的局部几何密度项来细化图像-文本相似度分数。我们的方法受到以下现象的启发:CLIP嵌入表现出模态间隙,并位于特征空间中的各向异性壳上:常见模式聚集在均值附近,而罕见模式被推向外围。这种几何结构产生了不均匀的对齐,其中虚假相关性被放大,而语义上有意义但罕见的线索被边缘化。为了解决这个问题,我们采用相对度量根据嵌入密度重新缩放相似度,抑制扩散区域中过度自信的分数,同时保留密集、语义一致的匹配。在基准数据集上的实验结果表明,最差组和平均准确率持续提高,突出了密度感知转换作为一种简单有效的校准机制,用于使用多模态模型进行可靠的零样本分类。

英文摘要

Vision-Language models (VLMs), such as CLIP, achieve powerful zero-shot classification. However, their predictions remain sensitive to spurious correlations, where contextual cues dominate over semantic content. Earlier solutions typically rely on fine-tuning or prompt engineering, which either undermine the advantages of pre-trained models or are prone to hallucination. In this work, we propose Density-Aware Translation (DAT) that refines image-text similarity scores using a local geometric density term derived from group reference sets. Our approach is motivated by the phenomenon that CLIP embeddings exhibit a modality gap and lie on an anisotropic shell in the feature space: common patterns cluster near the mean, while rare patterns are pushed outward. This geometry creates uneven alignment, where spurious correlations are amplified while semantically meaningful but rare cues are marginalised. To address this, we employ a relative measure to rescale similarities based on embedding density, suppressing overconfident scores in diffuse regions while preserving dense, semantically consistent matches. Experimental results on benchmark datasets demonstrate consistent improvements in worst-group and average accuracy, highlighting density-aware translation as a simple and effective calibration mechanism for reliable zero-shot classification using multimodal models.

2606.01708 2026-06-02 cs.LG cs.AI 版本更新

Two-Fidelity Best-Action Identification for Stochastic Minimax Tree

随机极小极大树的双保真度最优动作识别

Peter Chen, Xi Chen

发表机构 * Department of Mathematics, Columbia University(哥伦比亚大学数学系) Stern School of Business, New York University(纽约大学斯特恩商学院)

AI总结 针对随机极小极大树中的固定置信度最优动作识别问题,提出双保真度树搜索算法2FFS,结合极小极大快速扩展与MCTS随机采样,自适应选择廉价有偏评估或昂贵精确评估,理论证明固定置信度正确性、有限停止及多项式深度成本上界,实验表明比现有BAI-MCTS基线显著减少样本和计算。

Comments 36 pages

详情
AI中文摘要

我们研究随机极小极大树中的固定置信度最优动作识别(BAI)。该问题在现代AI规划中日益重要,其中深度极小极大搜索和带有语言模型长滚动的蒙特卡洛树搜索(MCTS)面临一个基本权衡:启发式评估廉价但有偏,而精确滚动可靠但代价高昂。我们提出2FFS,一种双保真度树搜索算法,将多保真度平面赌博机思想引入树中。该算法结合了极小极大风格的快速扩展和MCTS风格的随机采样,自适应地决定何时利用廉价有偏评估以及何时调用昂贵精确评估进行局部认证。我们证明了固定置信度正确性,建立了精确识别的有限停止性,并给出了通用深度树的多项式深度成本上界。在数值随机树实验中,与现有BAI-MCTS基线相比,2FFS使用的样本和计算操作显著减少。

英文摘要

We study fixed-confidence best-action identification (BAI) in stochastic minimax trees. This problem is increasingly relevant in modern AI planning, where deep minimax search and Monte Carlo Tree Search (MCTS) with language model long rollouts face a fundamental tradeoff: heuristic evaluations are cheap but biased, while accurate rollouts are reliable but prohibitively expensive. We propose 2FFS, a two-fidelity tree-search algorithm that brings multi-fidelity flat bandit ideas into trees. The algorithm combines minimax-style fast expansion with MCTS-style stochastic sampling, adaptively deciding when to exploit cheap biased evaluations and when to invoke expensive accurate evaluations for local certification. We prove fixed-confidence correctness, establish finite stopping for exact identification, and give a polynomial-depth cost upper bound for general-depth trees. Across numerical stochastic-tree experiments, 2FFS uses substantially fewer samples and computational operations comparing to existing BAI-MCTS baseline.

2606.01702 2026-06-02 cs.GR cs.LG 版本更新

KDH-CAD: Knowledge-data hybrid CAD learning under data scarcity

KDH-CAD:数据稀缺下的知识-数据混合CAD学习

Ziqin Gao, Zhijie Yang, Qiang Zou

发表机构 * State Key Laboratory of CAD \& CG, Zhejiang University, Hangzhou, 310027, China

AI总结 提出KDH-CAD框架,融合预训练基础模型、结构化领域知识和少量标注CAD数据,在数据稀缺下实现高效机械零件分类,准确率达92.6%(250样本)和95.8%(1000样本)。

Comments 18 pages

详情
AI中文摘要

计算机辅助设计(CAD)中的深度学习仍然受到数据稀缺挑战的根本制约:真实的CAD数据难以大规模收集,而合成数据可能无法真实反映实际设计实践。本文不追求更大的CAD数据集,而是将CAD学习视为知识补全和校准问题。它引入了KDH-CAD,一个知识-数据混合框架,该框架整合了基础模型中的预训练知识、教科书/教程中的结构化领域知识以及非常少量的标注CAD数据。领域知识用于引出和补全在预训练基础模型中表达较弱或代表性不足的CAD相关概念,而标注CAD数据则在潜在空间中校准这些概念,以考虑特定任务的几何变异性,而无需微调基础模型。在真实机械零件分类上的实验表明,KDH-CAD在低数据场景下取得了强劲性能,仅用250个训练样本就达到92.6%的准确率,用1000个样本达到95.8%,并且随着数据增加持续提升。这匹配或超过了通常需要多一个数量级数据的现有最优性能。这些结果表明,将预训练基础模型与结构化领域知识相结合可以大幅减少对大规模CAD数据集的依赖,为数据高效的CAD学习提供了原则性和实用性的方向。

英文摘要

Deep learning in computer-aided design (CAD) remains fundamentally constrained by the data scarcity challenge: authentic CAD data is difficult to collect at scale, while synthetic data may not faithfully reflect real design practice. Rather than pursuing ever-larger CAD datasets, this paper alternatively treats CAD learning as a knowledge completion and calibration problem. It introduces KDH-CAD, a knowledge-data hybrid framework that integrates pretrained knowledge in foundation models, structured domain knowledge from textbooks/tutorials, and a very small amount of labeled CAD data. Domain knowledge is used to elicit and complete CAD-relevant concepts that are weakly expressed or under-represented in pretrained foundation models, while labeled CAD data calibrates these concepts in the latent space to account for task-specific geometric variability, without fine-tuning the foundation model. Experiments on real-world mechanical part classification show that KDH-CAD achieves strong performance in low-data regimes, reaching 92.6\% accuracy with only 250 training samples, 95.8\% with 1,000 samples, and continuing to improve with additional data. This matches or exceeds state-of-the-art performance that typically requires an order of magnitude more data. These results suggest that combining pretrained foundation models with structured domain knowledge can substantially reduce reliance on large-scale CAD datasets, providing a principled and practical direction for data-efficient CAD learning.

2606.01695 2026-06-02 cs.LG 版本更新

CANARY: Zero-Label Detection of Fine-Tuning Contamination in Language Models

CANARY: 语言模型中微调污染的无标签检测

Swapnil Parekh

发表机构 * Switzerland(瑞士)

AI总结 提出CANARY方法,通过稀疏自编码器分析隐藏状态差异,在无标签情况下检测微调数据污染,实现1%污染率下AUROC=1.000,并支持检测、验证、优先排序和修复。

详情
AI中文摘要

攻击者可以通过污染仅1%的微调样本来植入潜在的有害行为。这种污染对所有的输出级防御都是不可见的:有害行为潜伏在模型的隐藏状态几何中,直到污染超过7.5%才会在生成的文本中出现。我们提出了CANARY(通过神经激活表示产出的污染审计器),这是一种无标签检查点审计器,可以直接通过对未标记提示集进行两次前向传递来检测这种隐藏的偏移。CANARY通过稀疏自编码器投影隐藏状态差异,过滤风格噪声以隔离有意义的语义漂移。它在四种模型架构和两种训练范式下,在1%污染率下实现了AUROC=1.000(95%置信区间=[0.997, 1.000];Cohen's d=3.28),比任何输出级方法触发点低7.5倍,并且在良性微调上零误报,对风格匹配和梯度噪声自适应攻击具有完全鲁棒性。相同的SAE特征基础驱动了一个完整的治理流程:SAE过滤放大以比标准生成高5倍的速率揭示潜在危害;得分排序的提示带来4.2倍的红队测试提升;在推理时抑制少数污染特定特征将危害从70%降低到10%,且无困惑度惩罚。CANARY是第一个仅从隐藏状态检测、验证、优先排序和修复供应链污染的无标签框架。

英文摘要

Adversaries can implant latent harmful behavior by poisoning as few as 1% of fine-tuning examples. The contamination is invisible to every output-level defense: harmful behavior lies dormant in the model's hidden-state geometry and does not appear in generated text until contamination exceeds 7.5%. We introduce CANARY (Contamination Auditor via Neural Activation Representation Yield), a zero-label checkpoint auditor that detects this hidden shift directly from two forward passes over an unlabeled prompt set. CANARY projects the hidden-state difference through a Sparse Autoencoder, filtering style noise to isolate meaningful semantic drift. It achieves AUROC = 1.000 at 1% contamination (95% CI = [0.997, 1.000]; Cohen's d = 3.28) across four model architectures and two training paradigms, 7.5x below where any output-level method fires, with zero false positives on benign fine-tuning and full robustness to style-matching and gradient-noise adaptive attacks. The same SAE feature basis drives a complete governance pipeline: SAE-filtered amplification surfaces latent harm at a 5x higher rate than standard generation; score-ranked prompts yield 4.2x red-teaming lift; and suppressing a handful of contamination-specific features at inference time reduces harm from 70% to 10% with no perplexity penalty. CANARY is the first zero-label framework to detect, verify, prioritize, and remediate supply-chain contamination from hidden states alone.

2606.01694 2026-06-02 cs.CV cs.AI cs.LG cs.MM 版本更新

Understanding Identity Continuity in Thermal Video through Scene-Level Consistency

通过场景级一致性理解热视频中的身份连续性

Wei-Chieh Sun, Gyungmin Ko, Heejae Kwon, Hsiang-Wei Huang, Jenq-Neng Hwang

发表机构 * Department of Electrical and Computer Engineering, Information Processing Lab, University of Washington, USA(电气与计算机工程系,信息处理实验室,华盛顿大学,美国)

AI总结 针对热行人多目标跟踪中身份碎片化问题,提出轻量级后处理方法,通过在线短间隙重映射和离线轨迹重链接恢复身份连续性,在PBVS热行人MOT基准上提升IDF1。

Comments Accepted to CVPR 2026 Workshop on SVC. Published in CVPR Workshops proceedings

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, 2026, pp. 1411-1419
AI中文摘要

热行人多目标跟踪仍然具有挑战性,因为弱外观线索和频繁的检测中断导致严重的轨迹碎片化。我们研究轻量级后处理是否可以在不依赖重型重识别模型或复杂在线关联的情况下恢复身份连续性。从YOLOv8和SORT基线开始,我们添加了一个模块化的身份修复后端,包括基于时间、空间、运动和边界线索的在线短间隙重映射和离线轨迹重链接。在固定验证集上的受控消融实验和在官方PBVS热行人MOT基准上的评估表明,主要身份增益来自保守的重链接,将IDF1从82.25提升到84.93,同时保持MOTA,而许多启发式阈值在广泛的操作范围内保持稳定。这些结果表明,在低信息热图像中,通过高精度轨迹重链接比增加跟踪器复杂性更能有效地实现鲁棒的身份恢复。这些结果提供了对热视频中身份恢复的受控分析,表明与局部帧到帧关联相比,场景级时空一致性在身份连续性中起主导作用。

英文摘要

Thermal pedestrian MOT remains challenging because weak appearance cues and frequent detection interruptions cause severe trajectory fragmentation. We study whether lightweight post-processing can recover identity continuity without relying on heavy re-identification models or complex online association. Starting from a YOLOv8 and SORT baseline, we add a modular identity-repair backend consisting of online short-gap remapping and offline tracklet relinking based on temporal, spatial, motion, and border cues. Controlled ablations on a fixed validation split and evaluation on the official PBVS Thermal Pedestrian MOT benchmark show that the main identity gains arise from conservative relinking, improving IDF1 from 82.25 to 84.93 while preserving MOTA, whereas many heuristic thresholds remain stable across broad operating ranges. These results suggest that, in low-information thermal imagery, robust identity recovery can be achieved more effectively through high-precision trajectory relinking than through increasing tracker complexity. These results provide a controlled analysis of identity recovery in thermal video, showing that scene-level spatial-temporal consistency plays a dominant role in identity continuity compared to local frame-to-frame association.

2606.01691 2026-06-02 cs.CR cs.LG 版本更新

IstGPT: LLM-based Anomaly Detection for Spatial-Temporal Graph in Industrial Systems

IstGPT:基于LLM的工业系统时空图异常检测

Yuchen Zhang, Ning Xi, Pengbin Feng, Shigang Liu, Jianfeng Ma, Yulong Shen, Yanan Sun, Xiaolin Zhou

发表机构 * School of Cyber Engineering, Xidian University(电子科技大学信息工程学院) School of Science, Computing and Engineering Technologies, Swinburne University of Technology(斯winburne技术大学科学与工程技术学院) School of Computer Science and Technology, Xidian University(电子科技大学计算机科学与技术学院)

AI总结 提出IstGPT,首个结合大语言模型与图学习的工业异常检测工具,通过多模态知识提取传感器-执行器依赖图并利用改进的图神经网络实现实时异常检测,在9个数据集上取得最佳F1分数和eTaF1指标。

详情
AI中文摘要

工业互联网系统面临来自复杂工业控制系统(ICS)攻击的日益增长的威胁,导致严重的安全事件。然而,由于传感器和执行器之间的复杂依赖关系,现有工具在实时异常检测方面效果有限。为了解决这个问题,我们提出了IstGPT,这是首个基于大语言模型和图学习的工业异常检测工具,能够针对广泛的ICS攻击提供实时保护。IstGPT实现了对工业信息物理系统中时空依赖关系的细粒度精确建模。它首先利用工业多模态知识,包括操作数据、技术文档和系统图,通过多阶段提示工程提取传感器-执行器依赖图。然后,LLM-Optimation基于节点准确性、边缘一致性和逻辑连贯性迭代优化图。最后,IstGPT将改进的图神经网络与编码器-解码器架构相结合,通过重构误差检测异常。我们在9个数据集上评估了IstGPT与12个最先进基线模型的性能,包括2个公共数据集、6个模拟数据集和一个真实机器人手臂数据集。IstGPT在所有九个数据集上取得了最佳的F1分数和eTaF1(一种较新的时间感知指标)。我们进一步讨论了在真实工业场景中部署IstGPT的可行性。

英文摘要

Industrial Internet systems face increasing threats from sophisticated industrial control system (ICS) attacks, resulting in critical safety incidents. However, existing tools exhibit limited effectiveness in real-time anomaly detection due to the complex dependencies among sensors and actuators. To tackle this, we present IstGPT, the first industrial anomaly detection tool based on LLMs and graph learning to provide real-time protection against a wide range of ICS attacks. IstGPT achieves fine-grained and precise modeling on spatial-temporal dependencies in industrial cyber-physical systems. It first leverages industrial multi-modal knowledge, including operational data, technical documents, and system diagrams, to extract sensor-actuator dependency graphs via multi-stage prompt engineering. Then, LLM-Optimation iteratively refines the graph based on node accuracy, edge consistency, and logical coherence. Finally, IstGPT integrated improved graph neural networks with an encoder-decoder architecture to detect anomalies via reconstruction errors. We evaluate IstGPT against 12 state-of-the-art baselines on 9 datasets, including 2 public, 6 simulated, and a real-world robotic arm dataset. IstGPT achieves the best F1-scores and eTaF1 (a newer time-aware metric) across nine datasets. We further discuss the feasibility of deploying IstGPT in real-world industrial scenarios.

2606.01682 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Off-the-Shelf LLMs as Process Scorers: Training-Free Alternative to PRMs for Mathematical Reasoning

现成的大语言模型作为过程评分器:数学推理中PRM的无训练替代方案

Atoosa Chegini, Soheil Feizi

发表机构 * Department of Computer Science, University of Maryland(马里兰大学计算机科学系)

AI总结 提出Chunk-Level Guided Generation方法,利用现成的大语言模型作为过程评分器,通过固定长度块评分和对比选择规则,无需训练即可在数学推理中匹配或超越PRM引导搜索的性能。

详情
AI中文摘要

使用更强的评分器从多个小模型样本中选择最佳响应是一种简单的推理时策略,但当小模型已经陷入错误推理路径时,该策略会失败。PRM引导搜索通过在生成过程中对候选延续进行评分来避免这一问题,但需要经过步骤级标签训练的奖励模型。我们提出Chunk-Level Guided Generation,一种无训练的替代方案,使用现成的大语言模型作为过程评分器。在每一步,小模型采样k个固定长度的候选块,而大模型使用似然度对候选块进行评分,无需生成任何文本。选中的块在下一步之前被提交,从而在错误传播之前引导生成。我们用两种选择规则实例化该框架:似然引导选择(LGS),选择具有最高长度归一化大模型对数概率的块;以及对比引导选择(CGS),减去小模型的对数概率,以偏向于大模型偏好与小模型偏好不同的块。我们证明,由于系统性的长度偏差(即使在长度归一化后仍然存在),使用大模型似然度对可变长度推理步骤进行评分是不可靠的,而固定长度块避免了这一混淆。在GSM8K、MATH、Minerva Math、AMC23和AIME24上,使用Qwen2.5-32B引导Qwen2.5-1.5B以及Llama-3.1-70B引导Llama-3.2-1B,CGS在多数投票上最多提升28个百分点,并且在匹配的引导预算下,在大多数基准测试中匹配或超越了Qwen2.5-Math-PRM-72B引导搜索,且无需奖励模型训练。使用Qwen2.5-72B引导Qwen2.5-7B,CGS在k=16时在MATH上达到81.8%,在Minerva Math上达到63.6%,超过多数投票4-6个百分点。最后,Chunk-Level Guided Generation产生的推理轨迹比PRM引导搜索短得多。

英文摘要

Selecting the best response from multiple small-model samples using a stronger scorer is a simple inference-time strategy, but fails when the small model has already committed to incorrect reasoning paths. PRM guided search avoids this by scoring candidate continuations during generation, but requires a reward model trained with step-level labels. We propose Chunk-Level Guided Generation, a training-free alternative that uses an off-the-shelf large language model as a process scorer. At each step, a small model samples k fixed-length candidate chunks, while the larger model scores the candidates using likelihoods without generating any text. The selected chunk is committed before the next step, steering generation before errors can propagate. We instantiate this framework with two selection rules: Likelihood-Guided Selection (LGS), which selects the chunk with the highest length-normalized large-model log-probability, and Contrastive-Guided Selection (CGS), which subtracts the small model's log-probability to favor chunks where the large model's preference diverges from the small model's. We show that scoring variable-length reasoning steps with large-model likelihoods is unreliable due to a systematic length bias that persists even after length normalization, and that fixed-length chunks avoid this confound. On GSM8K, MATH, Minerva Math, AMC23, and AIME24 with Qwen2.5-1.5B guided by Qwen2.5-32B and Llama-3.2-1B guided by Llama-3.1-70B, CGS outperforms majority voting by up to 28 pp and, under matched guidance budgets, matches or outperforms Qwen2.5-Math-PRM-72B guided search on most benchmarks without reward-model training. With Qwen2.5-7B guided by Qwen2.5-72B, CGS reaches 81.8% on MATH and 63.6% on Minerva Math at k=16, surpassing majority voting by 4--6 pp. Finally, Chunk-Level Guided Generation produces substantially shorter reasoning traces than PRM guided search.

2606.01680 2026-06-02 cs.DC cs.LG cs.NI 版本更新

Don't Let a Few Network Failures Slow the Entire AllReduce

不要让少数网络故障拖慢整个 AllReduce

Peiqing Chen, Jiedong Jiang, Nengneng Yu, Yuefeng Wang, Sixian Xiong, Wei Wang, Zaoxing Liu

发表机构 * University of Maryland, College Park(马里兰大学学院公园分校) Utrecht University(乌特雷赫大学) Kyoto University(京都大学)

AI总结 针对网络故障导致 AllReduce 性能下降的问题,提出基于信息论下界的 OptCC 算法,通过四阶段流水线设计在带宽损失高达 50% 时仍接近无故障性能。

详情
AI中文摘要

网络故障是大规模 GPU 集群中最常见的硬件故障之一,也是训练任务中断的主要原因。现代集体通信库(如 NCCL)通过将流量重新路由到同一服务器上幸存的 NIC 来缓解网络故障,以降低节点间带宽换取不间断训练。然而,降级后的服务器仍处于标准环形算法的关键路径上,拖慢了整个集体通信。我们首次给出了非对称网络带宽下 AllReduce 完成时间的信息论下界,并表明当落后者保留至少一半原始带宽时,相对于无故障最优值的不可避免开销仅为 O(1/p)(p 为 GPU 数量)。然后,我们设计了 OptCC,一种接近该下界的四阶段流水线 AllReduce 算法。SimAI 上的实验证实,OptCC 缩小了现有容错方案留下的差距:在实际网络故障(带宽损失高达 50%)下,OptCC 的 AllReduce 完成时间在 NCCL 无故障环形性能的 2-6% 以内,而现有最优方案的开销高达 57%。

英文摘要

Network failures are among the most frequent hardware faults in large-scale GPU clusters and a leading cause of training-job interruptions. Modern collective communication libraries such as NCCL mitigate network failures by rerouting traffic through surviving NICs on the same server, trading reduced inter-node bandwidth for uninterrupted training. However, the degraded server remains on the critical path of the standard ring algorithm, slowing the entire collective. We present the first information-theoretic lower bound on AllReduce completion time under asymmetric network bandwidth and show that when the straggler retains at least half of its original bandwidth, the unavoidable overhead relative to the fault-free optimum is only O(1/p) for p GPUs. We then design OptCC, a four-stage pipelined AllReduce algorithm that approaches this lower bound. Experiments on SimAI confirm that OptCC closes the gap left by existing fault-tolerant schemes: under practical network failures with up to 50% bandwidth loss, OptCC completes AllReduce within 2-6% of NCCL's fault-free ring performance, whereas the state-of-the-art incurs up to 57% overhead.

2606.01672 2026-06-02 cs.LG 版本更新

RDA: Reward Design Agent for Reinforcement Learning

RDA:用于强化学习的奖励设计智能体

Hojoon Lee, Ajay Subramanian, Ben Abbatematteo, Vijay Veerabadran, Pedro Matias, Karl Ridgeway, Nitin Kamra

发表机构 * University of California, Berkeley(加州大学伯克利分校) DeepMind(深度Mind)

AI总结 提出基于视觉语言模型的奖励设计智能体RDA,通过任务分解、视觉轨迹评估和失败模式总结迭代优化奖励函数,在操作任务中生成更符合指令的策略。

Comments Accepted to RLC'26

详情
AI中文摘要

强化学习已经能够获得令人印象深刻的机器人技能,但通常需要手工设计的奖励函数,这些函数设计缓慢且难以与人类意图对齐。最近的工作,如Eureka,通过使用LLM从任务描述中迭代生成和优化奖励代码来自动化奖励设计。然而,它们依赖于粗糙的反馈信号,如成功率,这些信号对学习到的行为提供的语义洞察很少。因此,它们训练的策略达到了最终目标,但经常与任务指令对齐不良。我们引入了奖励设计智能体(RDA),一个基于VLM的智能体框架,将语义理解注入奖励设计。RDA分解任务,视觉评估轨迹,总结失败模式,并迭代修订奖励代码以更好地与任务指令对齐。在ManiSkill的12个桌面操作任务和HumanoidBench的4个全身操作任务中,RDA产生的策略在指令对齐方面显著优于其他基线,同时实现了相当的任务成功率。视频和生成的奖励代码可在https://nitinkamra1992.github.io/reward-design-agent获取。

英文摘要

Reinforcement learning has enabled the acquisition of impressive robotic skills, but typically requires hand-crafted reward functions that are slow to design and difficult to align with human intentions. Recent work, such as Eureka, automates reward design by using an LLM to iteratively generate and refine reward code from task descriptions. However, they rely on coarse feedback signals such as success rate, which provide little semantic insight into the learned behavior. As a result, their trained policies achieve the final goal but are frequently poorly aligned with task instructions. We introduce the Reward Design Agent (RDA), a VLM-based agentic framework that injects semantic understanding into reward design. RDA decomposes tasks, visually evaluates trajectories, summarizes failure modes, and iteratively revises reward code to better align with task instructions. Across 12 tabletop manipulation tasks from ManiSkill and 4 whole-body manipulation tasks from HumanoidBench, RDA produces policies substantially more instruction-aligned than those of other baselines, while achieving comparable task success rates. Videos and the generated reward code are available on https://nitinkamra1992.github.io/reward-design-agent.

2606.01667 2026-06-02 cs.LG 版本更新

ATLAS: Agentic Test-time Learning-to-Allocate Scaling

ATLAS: 智能体测试时学习分配缩放

Peijia Qin, Qi Cao, Pengtao Xie

发表机构 * University of California, San Diego(加州大学圣迭戈分校)

AI总结 提出ATLAS框架,让LLM编排器通过单一

详情
AI中文摘要

测试时缩放已成为提升大语言模型推理能力的主要方式,但其编排仍由设计者工程化:固定的样本预算、固定的改进循环、固定的评分规则或固定的搜索策略决定了计算如何分配,模型负责求解而非编排。我们提出ATLAS,一种智能体测试时缩放框架,其中LLM编排器端到端地拥有控制循环。通过单一动作“探索”(在原问题上派发一个全新的独立求解器),编排器决定是否收集更多证据、何时停止以及如何综合最终答案;动作空间是可扩展的,每次探索调用可选地指定求解器、推理努力或提示策略。我们在四个基准上评估ATLAS,涵盖科学问答、代码生成和多模态推理,使用Claude Sonnet 4.6骨干网络,在HLE-Verified上达到56.00%,在LiveCodeBench上达到82.29%,在GPQA-Diamond上达到85.75%,在BabyVision上达到23.71%,同时使用的API调用远少于固定工作流基线。多模型扩展ATLAS-MM将求解器选择作为额外动作维度,进一步将HLE-Verified提升至60.00%,LiveCodeBench提升至85.63%,并在GPQA-Diamond和BabyVision上持续改进。将编排器的直接综合替换为独立整合器的消融实验在四个基准中的三个上降低或未能提高准确率,这与有状态证据管理在产生增益中的作用一致。

英文摘要

Test-time scaling has become a major way to improve large language model reasoning, but its orchestration has remained designer-engineered: a fixed sample budget, a fixed refinement loop, a fixed scoring rule, or a fixed search policy decides how compute is spent, leaving the model in charge of solving but not of orchestration. We introduce ATLAS, an agentic test-time scaling framework in which an LLM orchestrator owns the control loop end-to-end. Through a single action, explore, which dispatches a fresh independent solver on the original problem, the orchestrator decides whether to gather more evidence, when to stop, and how to synthesize the final answer; the action space is extensible, with each explore call optionally specifying solver, reasoning effort, or prompting strategy. We evaluate ATLAS on four benchmarks covering scientific question answering, code generation, and multimodal reasoning under a Claude Sonnet 4.6 backbone, where it reaches 56.00% on HLE-Verified, 82.29% on LiveCodeBench, 85.75% on GPQA-Diamond, and 23.71% on BabyVision while using far fewer API calls than fixed-workflow baselines. A multi-model extension, ATLAS-MM, that exposes solver choice as an additional action dimension further improves HLE-Verified to 60.00% and LiveCodeBench to 85.63%, with consistent gains on GPQA-Diamond and BabyVision. Ablations replacing the orchestrator's direct synthesis with a separate integrator degrade or fail to improve accuracy on three of four benchmarks, consistent with the role of stateful evidence management in producing the gains.

2606.01666 2026-06-02 cs.LG cs.AI 版本更新

DOT-MoE: Differentiable Optimal Transport for MoEfication

DOT-MoE:用于MoE化的可微最优传输

Udbhav Bamba, Arnav Chavan, Aryamaan Thakur, Steve Teig, Deepak Gupta

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出DOT-MoE框架,通过可微最优传输将密集层分解为专家,联合学习神经元分配和路由策略,在减少50%活跃参数的同时保留90%原始性能。

Comments Accepted at ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)的扩展带来了显著的性能提升,但也造成了推理效率方面的重大挑战。虽然混合专家(MoEs)架构通过将模型大小与推理成本解耦来解决这一问题,但从头训练MoEs通常不稳定且计算密集。将预训练的密集模型转换为稀疏MoEs已成为一种替代方案;然而,现有方法通常依赖启发式神经元聚类或随机分割来将前馈网络(FFN)划分为专家。在这项工作中,我们提出了DOT-MoE,一种新颖的框架,将密集层的分解建模为可微最优传输(DOT)问题。与静态启发式方法不同,我们将神经元分配建模为平衡传输问题,利用可微的Sinkhorn-Knopp迭代来强制执行严格的专家容量约束。此外,我们利用直通估计器(STE)来联合学习离散的神经元到专家的分配和令牌到专家的路由策略。跨多个架构和基准的大量实验表明,DOT-MoE显著优于结构化剪枝、启发式聚类和随机分割基线,在减少50%活跃参数的同时保留了原始密集模型90%的性能。

英文摘要

The scaling of Large Language Models (LLMs) has driven significant performance gains but created substantial challenges in inference efficiency. While Mixture of Experts (MoEs) architectures address this by decoupling model size from inference cost, training MoEs from scratch is often unstable and compute intensive. Conversion of pre-trained dense models into sparse MoEs has emerged as an alternative solution; however, existing methods typically rely on heuristic neuron clustering or random splitting to partition the Feed-Forward Network (FFN) into experts. In this work, we propose DOT-MoE, a novel framework that formulates the decomposition of dense layers as a Differentiable Optimal Transport (DOT) problem. Instead of static heuristics, we model neuron assignment as a balanced transport problem, utilizing differentiable Sinkhorn-Knopp iterations to enforce strict expert capacity constraints. Furthermore, we utilize Straight-Through Estimators (STE) to jointly learn the discrete neuron-to-expert assignment and the token-to-expert routing policy end-to-end. Extensive experiments across multiple architectures and benchmarks demonstrate that DOT-MoE significantly outperforms structured pruning, heuristic clustering, and random-split baselines, retaining 90% of the original dense model's performance while reducing active parameters by 50%.

2606.01665 2026-06-02 cs.LG 版本更新

Quantifying the Energy Floor: Direct Measurement and Replay Buffer Bias in SAC-Based HVAC Control on sbsim

量化能量下限:基于sbsim的SAC HVAC控制中的直接测量与回放缓冲区偏差

Bo Li, Chen Zhang

发表机构 * Shanghai Jiao Tong University College of Smart Energy(上海交通大学智能能源学院)

AI总结 通过最小动作实验直接测量SAC HVAC控制中的能量下限,发现回放缓冲区初始化是次优性的主要来源,消除后可将成本降至接近下限。

Comments 5 pages, 3 figures, 2 tables. Presented at AI-DEEDS 2026 Workshop, ACM Sustainability Week, Banff, Canada (non-archival)

详情
AI中文摘要

我们在sbsim校准建筑模拟器上量化了Soft Actor-Critic (SAC) HVAC控制的能量下限——在动作空间约束下的最小可实现成本。通过最小动作实验,我们直接测量到该下限为35.51美元/天,其中连续电力负载占主导(35.44美元,99.8%),燃气消耗可忽略。标准SAC基线使用调度策略回放缓冲区过渡初始化,收敛到37.18美元/天,高于下限4.7%。我们确定缓冲区初始化是此场景中次优性的主要来源:从空缓冲区训练可将成本降至35.57美元/天,消除了96%的差距。将供水温度范围扩大10 K仅带来可忽略的额外节省(0.03美元/天),进一步扩大则触发物理约束违反。我们还发现一个折扣因子耦合(gamma_eff = 0.891),将有效规划视野从8.3小时缩小至46分钟——这是一个需要审计的基准广泛问题。在规划视野、奖励权重和观测增强上的系统消融实验证实,所有预填充缓冲区配置的聚类范围在0.7%以内(37.18–37.42美元),表明设备最小功率(而非算法设计)构成了约束性限制。

英文摘要

We quantify the energy floor -- the minimum achievable cost given action space constraints -- for Soft Actor-Critic (SAC) HVAC control on the sbsim calibrated building simulator. Through minimum-action experiments, we directly measure this floor at USD 35.51/day, dominated by continuous electrical loads (USD 35.44, 99.8%) with negligible gas consumption. The standard SAC baseline, initialized with schedule-policy replay buffer transitions, converges to USD 37.18/day, 4.7% above the floor. We identify buffer initialization as the dominant source of sub-optimality in this scenario: training from an empty buffer reduces cost to USD 35.57/day, eliminating 96% of the gap. Expanding the supply water temperature range by 10 K yields negligible additional savings (USD 0.03/day), and further expansion triggers physical constraint violations. We additionally uncover a discount factor coupling (gamma_eff = 0.891) shrinking the effective planning horizon from 8.3 h to 46 min -- a benchmark-wide issue warranting audit. Systematic ablation across planning horizon, reward weights, and observation enrichment confirms all pre-filled-buffer configurations cluster within 0.7% (USD 37.18--USD 37.42), demonstrating that equipment minimum power -- not algorithmic design -- imposes the binding constraint.

2606.01660 2026-06-02 cs.LG 版本更新

Gate the Filter, Not the Message: Node-Channel Mixtures for Pre-Propagation GNNs

门控滤波器而非消息:预传播图神经网络中的节点-通道混合

Zichao Yue, Zhiru Zhang

发表机构 * School of Electrical and Computer Engineering, Cornell University(康奈尔大学电气与计算机工程学院)

AI总结 针对预传播图神经网络中复杂跳聚合器性能不佳的问题,提出FilterMoE模型,通过3D门控张量联合路由节点和通道上的可学习切比雪夫滤波器专家,在11个同质和异质基准测试中平均提升1.53个测试分数。

详情
AI中文摘要

预传播图神经网络(PPGNNs)将所有图相关的计算推入预处理步骤,仅对生成的密集跳特征进行训练,这使得它们具有高度可扩展性。该领域的一个难题是,更复杂的跳聚合器并不总是可靠地优于简单的聚合器:在许多基准测试中,基于普通MLP的聚合器与跳注意力变体相当或更优。我们从图滤波器的角度重新审视这一行为。在预计算的扩散基上,现有的PPGNNs主要区别在于滤波器系数如何在节点和特征通道之间共享,而非仅仅在原始聚合器容量上。基于MLP的架构学习通道相关的滤波器,这些滤波器在节点之间大致共享,而基于跳注意力的架构学习节点相关的混合,这些混合在通道之间大致共享。这揭示了标准PPGNN设计中的一个缺失机制:在预传播计算约束下,联合节点和通道自适应滤波。我们提出FilterMoE,一种混合专家PPGNN,其中一小批可学习的切比雪夫滤波器专家通过3D门控张量在节点和通道上联合路由。在11个同质和异质基准测试中,FilterMoE在9个数据集上优于强PPGNN基线,并在所有三个大规模基准测试中排名第一,平均测试分数提高了1.53分。这些结果确立了联合节点-通道滤波器路由作为数据集特定跳聚合器选择的稳健替代方案。

英文摘要

Pre-propagation graph neural networks (PPGNNs) push all graph-dependent computation into a preprocessing step and train only on the resulting dense hop features, which makes them highly scalable. A puzzle in this regime is that more complex hop aggregators do not reliably outperform simpler ones: on many benchmarks, a plain MLP-based aggregator matches or beats hop-attention variants. We revisit this behavior from a graph-filter perspective. Over a precomputed diffusion basis, existing PPGNNs differ mainly in how filter coefficients are shared across nodes and feature channels, rather than simply in raw aggregator capacity. MLP-based architectures learn channel-dependent filters that are largely shared across nodes, while hop-attention-based architectures learn node-dependent mixtures that are largely shared across channels. This reveals a missing regime in standard PPGNN designs: joint node- and channel-adaptive filtering under the pre-propagation computational contract. We propose FilterMoE, a mixture-of-experts PPGNN in which a small bank of learnable Chebyshev filter experts is routed jointly over nodes and channels by a 3D gating tensor. Across eleven homophilic and heterophilic benchmarks, FilterMoE outperforms strong PPGNN baselines on nine datasets and ranks first on all three large-scale benchmarks, improving the average test score by 1.53 points. These results establish joint node-channel filter routing as a robust alternative to dataset-specific hop-aggregator selection.

2606.01655 2026-06-02 math.OC cs.AI cs.LG stat.ML 版本更新

MINTS: Minimalist Thompson Sampling

MINTS: 极简汤普森采样

Kaizheng Wang

发表机构 * Department of IEOR and Data Science Institute, Columbia University(工业工程与数据科学学院,哥伦比亚大学)

AI总结 针对贝叶斯方法在复杂结构约束下的局限性,提出一种仅对最优位置设置先验、通过轮廓似然消除冗余参数的极简贝叶斯框架,并实例化为MINTS算法,在均值约束多臂老虎机中实现近最优非渐近遗憾保证和精确几乎必然渐近遗憾刻画。

Comments 29 pages

详情
AI中文摘要

贝叶斯范式为不确定性下的序贯决策提供了原则性工具,但其对所有参数依赖概率模型的做法会阻碍复杂结构约束的纳入。我们提出一种极简贝叶斯框架,仅对最优位置设置先验,同时通过轮廓似然消除冗余参数。这产生了一个自然适应结构约束的广义后验。作为直接实例,我们开发了极简汤普森采样(MINTS)。对于具有均值约束的多臂老虎机,我们建立了近最优的非渐近遗憾保证和精确的几乎必然渐近遗憾刻画。特别地,MINTS在无结构设置中达到了经典的Lai-Robbins常数,并自动适应单峰结构,达到仅由最优臂的紧邻所确定的精确常数。

英文摘要

The Bayesian paradigm offers principled tools for sequential decision-making under uncertainty, but its reliance on a probabilistic model for all parameters can hinder the incorporation of complex structural constraints. We introduce a minimalist Bayesian framework that places a prior only on the location of the optimum, while eliminating nuisance parameters through profile likelihood. This yields a generalized posterior that naturally accommodates structural constraints. As a direct instantiation, we develop MINimalist Thompson Sampling (MINTS). For multi-armed bandits with mean constraints, we establish near-optimal non-asymptotic regret guarantees and sharp almost-sure asymptotic regret characterizations. In particular, MINTS attains the classical Lai--Robbins constant in the unstructured setting and automatically adapts to unimodal structure, achieving the sharp constant determined only by the immediate neighbors of the optimal arm.

2606.01645 2026-06-02 stat.ML cs.LG 版本更新

Self-Regulating Annealing in Heavy-Tailed Diffusion Models

重尾扩散模型中的自调节退火

Keito Wakatsuki, Hideaki Shimazaki

发表机构 * Keito Wakatsuki(凯托·瓦卡苏基) Hideaki Shimazaki

AI总结 本文提出一种基于随机微分方程的重尾扩散模型采样器,通过状态依赖的扩散系数实现自调节退火机制,以改进重尾数据的生成保真度。

Comments 6 pages, 3 figures, IJCNN2026

详情
AI中文摘要

扩散模型已成为深度生成模型的主要框架。虽然标准高斯公式在理论上很方便,但其对重尾数据集的适用性仍不清楚。为了解决这个问题,重尾扩散模型(HTDM)通过用学生t分布替换高斯分布来扩展标准公式,从而提高了重尾数据集上的尾部保真度。尽管基于随机微分方程(SDE)的采样在HTDM中是可能的,但尚未得到充分探索。在本文中,我们提出了一种用于HTDM的基于SDE的采样器,该采样器明确地包含了状态依赖的扩散系数。这种状态依赖性通过自适应地调节有效噪声尺度,自然地诱导出自调节退火机制。我们从理论上探讨了这一机制,并通过实验验证了其在从重尾分布中重现样本的必要性。

英文摘要

Diffusion models have emerged as a leading framework for deep generative modeling. While the standard Gaussian formulation is theoretically convenient, its suitability for heavy-tailed datasets remains unclear. To address this, heavy-tailed diffusion models (HTDMs) extend the standard formulation by replacing the Gaussian distribution with a Student's t-distribution, thereby improving tail fidelity on heavy-tailed datasets. Although stochastic differential equation (SDE)-based sampling is possible in HTDMs, it has not been fully explored. In this paper, we propose an SDE-based sampler for HTDMs that explicitly incorporates a state-dependent diffusion coefficient. This state dependence naturally induces a self-regulating annealing mechanism by adaptively modulating the effective noise scale. We theoretically explore this mechanism and experimentally verify its necessity for reproducing samples from a heavy-tailed distribution.

2606.01634 2026-06-02 cs.LG cs.AI 版本更新

E4GEN: Event-level Explainable Extreme-Enhanced Time-series Generation

E4GEN:事件级可解释的极端增强时间序列生成

Lin Jiang, Dahai Yu, Ximiao Li, Guang Wang

发表机构 * Florida State University(佛罗里达州立大学)

AI总结 提出E4GEN可解释扩散框架,通过E-Activator、E-Predictor和E-Control三个组件实现事件级极端事件可控生成,在整体保真度、极端事件保真度和下游效用上优于现有方法。

Comments 48 pages,26 figures

详情
AI中文摘要

生成逼真的时间序列对于科学研究和实际应用至关重要。然而,现有方法通常强调整体分布保真度,而未能忠实捕捉极端事件。为了推进现有研究,我们提出了E4GEN,一个用于极端事件感知时间序列生成的可解释扩散框架。E4GEN通过三个关键组件提供了关于何时、什么以及如何控制极端事件生成的系统见解。首先,E-Activator在去噪过程中学习数据集自适应的极端控制信号激活步骤,而不干扰常规时间成分,包括趋势和季节性。其次,E-Predictor通过自驱动语义预测确定要强制执行的控制信号,其中每个样本通过推断生成过程中的潜在极端事件信息来导出其自身的控制信号。它还包括一种新颖的数据条件训练、噪声初始化采样机制,以解决训练标签不可用的问题。第三,E-Control通过可训练的极端控制网络指定如何控制极端事件生成,该网络将语义控制信号转换为逐层信号并将其注入去噪过程。我们在六个数据集上使用17个指标评估了E4GEN,大量实验表明,E4GEN在多个维度上优于最先进的模型,包括整体保真度、极端事件保真度和下游效用。

英文摘要

Generating realistic time series is essential for scientific research and real-world applications. However, existing methods often emphasize overall distributional fidelity while failing to faithfully capture extreme events. To advance existing research, we propose E4GEN, an explainable diffusion framework for extreme event-aware time-series generation. E4GEN provides systematic insights into when, what, and how to control extreme-event generation through three key components. First, E-Activator learns the dataset-adaptive extreme-control signal activation step during the denoising process without interfering with regular temporal components, including trend and seasonality. Second, E-Predictor determines what control signal to enforce through Self-Driven Semantic Prediction, where each sample derives its own control signal by inferring latent extreme-event information during generation. It also includes a novel Data-Conditioned Training, Noise-Initiated Sampling mechanism to address the issue of unavailable training labels. Third, E-Control specifies how to control extreme-event generation through a trainable Extreme Control Network, which transforms the semantic control signal into layer-wise signals and injects it into the denoising process. We evaluate E4GEN on six datasets with 17 metrics, and extensive experiments show that E4GEN outperforms state-of-the-art models across multiple dimensions, including overall fidelity, extreme-event fidelity, and downstream utility.

2606.01626 2026-06-02 cs.LG 版本更新

IMWM: Intuition Models Complement World Models for Latent Planning

IMWM:直觉模型补充世界模型用于潜在规划

Baoqi Gao, Ruize Han, Miao Wang, Song Wang

发表机构 * Beihang University(北航) Shenzhen University of Advanced Technology(深圳先进技术大学)

AI总结 针对基于潜在世界模型的规划中搜索瓶颈问题,提出IMWM框架,通过直觉模型与三个轻量组件协作,在四个像素级任务上显著提升成功率。

详情
AI中文摘要

使用学习到的潜在世界模型进行规划是从原始像素控制的有前途的途径,但仅靠强大的世界模型是不够的。我们通过实验证明了这一点:即使使用完美的世界模型(通过将学习到的前向预测器替换为真实环境动态的理想化展开来实现),有限预算的基于样本的规划器仍然在某些任务上失败,这表明瓶颈可能在于搜索而非世界模型的准确性。受此差距的启发,我们提出了IMWM(直觉模型+世界模型),它将世界模型与从演示中训练出的直觉模型配对,以识别有希望的动作。这两个模型通过三个轻量组件协作:(i)检索初始化,从检索到的演示中初始化规划器的动作提议;(ii)混合成本,将直觉分数与世界模型展开成本相结合;(iii)可靠性门控,调整规划器在每个设置中信任直觉的程度。在四个基于像素的目标到达任务(Two-Room、Reacher、Push-T和OGBench-Cube)中,IMWM在所有四个任务上的平均成功率均高于仅使用世界模型的规划器,其中在Two-Room(99.2%,+11.5个百分点)和OGBench-Cube(94.7%,+28.5个百分点)上提升最大。

英文摘要

Planning with a learned latent world model is a promising route to control from raw pixels, but a strong world model alone is not enough. We show this experimentally: even with a perfect world model (operationalized by replacing the learned forward predictor with an idealized rollout of the true environment dynamics), a finite-budget sample-based planner still fails on some tasks, indicating that the bottleneck can lie in search rather than in world-model accuracy. Motivated by this gap, we propose IMWM (Intuition Model + World Model), which pairs the world model with an intuition model trained from demonstrations to recognize promising actions. The two models collaborate through three lightweight components: (i) Retrieval Initialization, which initializes the planner's action proposal from a retrieved demonstration; (ii) Hybrid Cost, which combines the intuition score with the world-model rollout cost; and (iii) a Reliability Gate, which adjusts how much the planner trusts intuition in each setting. Across four pixel-based goal-reaching tasks (Two-Room, Reacher, Push-T, and OGBench-Cube), IMWM has higher mean success than the world-model-only planner on all four, with the largest gains on Two-Room (99.2%, +11.5 percentage points) and OGBench-Cube (94.7%, +28.5 percentage points).

2606.01612 2026-06-02 cs.CV cs.LG 版本更新

Self-Improving Small Object Grounding in LVLMs

LVLMs中的自改进小目标定位

Tianze Yang, Yucheng Shi, Ruitong Sun, Ninghao Liu, Jin Sun

发表机构 * University of Georgia(佐治亚大学)

AI总结 利用LVLMs内部注意力模式,通过轻量级IoU回归器或无需训练的注意力熵选择器,从多个候选框中选出最佳框,实现小目标定位的自改进。

Comments 29 Pages, 15 Figures

详情
AI中文摘要

大型视觉语言模型(LVLMs)中的内部注意力模式能否在无需微调的情况下识别可靠的小目标框?在这项工作中,我们给出了肯定的答案。LVLMs中的注意力结构编码了定位质量——一个仅基于注意力图训练的轻量级IoU回归器实现了强IoU预测(Pearson r > 0.67)。该回归器驱动了我们基于注意力的候选选择(ACS)框架的回归器变体,称为ACS-Learned,它从多个采样候选中选择最佳框以改进目标定位。通过分析回归器学习的内容,我们揭示了哪些Transformer层和头最为关键,并推导出ACS-Free:一个无需训练的选择器,它根据这些判别性头上的注意力熵对候选进行排序,推理时无需任何学习组件。在COCO和Objects365上的实验表明,小目标定位的自改进高达19%,其中ACS-Free在所有无需训练的方法中排名最佳,表明有用的注意力结构提高了LVLMs中定位的可靠性和可解释性。

英文摘要

Can internal attention patterns in Large Vision Language Models (LVLMs) identify reliable small-object boxes without fine-tuning? In this work, we provide an affirmative answer. Attention structure in LVLMs encodes grounding quality-a lightweight IoU regressor trained solely on attention maps achieves strong IoU prediction (Pearson r > 0.67). This regressor powers the regressor-based variant of our Attention-based Candidate Selection (ACS) framework, called ACS-Learned, which selects the best box from multiple sampled candidates to improve object grounding. By analyzing what the regressor learns, we reveal which transformer layers and heads are most critical and derive ACS-Free: a training-free selector that ranks candidates by attention entropy on these discriminative heads, with no learned component at inference. Experiments on COCO and Objects365 demonstrate up to 19% self-improvement on small object localization, with ACS-Free ranking best among all training-free methods, demonstrating that useful attention structure improves both localization reliability and interpretability in LVLMs.

2606.01607 2026-06-02 cs.LG cs.AI 版本更新

FedMTFI: Feature Importance Based Optimized Multi Teacher Knowledge Distillation in Heterogeneous Federated Learning Environment

FedMTFI: 异构联邦学习环境中基于特征重要性优化的多教师知识蒸馏

Nazmus Shakib Shadin, Aaron Cummings, Xinyue Zhang, Bobin Deng

发表机构 * Department of Computer Science, Kennesaw State University, Marietta, GA, 30060 USA(计算机科学系,肯纳邦大学,马里埃塔,GA,30060 USA)

AI总结 提出FedMTFI架构,通过结合多教师知识蒸馏与Shapley值特征重要性,在异构联邦学习中提升模型准确性和可解释性。

Comments Accepted by IJCNN 2026

详情
AI中文摘要

联邦学习(FL)是一种去中心化方法,能够在无需暴露原始数据的情况下实现协作模型训练。它允许设备仅共享模型权重,而将个人数据保留在本地并确保安全,从而避免了敏感数据的传输。然而,在现实环境中,设备持有的数据往往分布不均,且设备在计算能力和内存容量上大多存在差异。这些差异使得FL难以在整个系统中保持一致的性能。为了解决这些问题,我们提出了FedMTFI,一种新颖的架构,它将多教师知识蒸馏(MTKD)与特征重要性相结合,以改善异构环境中的FL过程。在FedMTFI中,客户端根据相似的硬件和模型类型进行聚类。每个聚类在非独立同分布(non-IID)数据上训练特定模型。在聚类内部,每个客户端仅使用自己的本地私有数据更新该模型。然后,服务器使用FedAvg对每个聚类中的本地训练模型进行聚合,形成多个原型模型。接着,这些原型作为教师模型,通过MTKD训练一个全局通用的学生模型。FedMTFI的独特之处在于集成了Shapley值(SHAP),以在蒸馏过程中强调重要特征,从而提高了准确性和可解释性。实验结果表明,FedMTFI比传统FL算法实现了更高的准确性,并且在non-IID数据条件下表现更有效。

英文摘要

Federated learning (FL) is a decentralized approach that enables collaborative model training without exposing raw data. Instead of transferring sensitive data, it allows devices to share only model weights, keeping personal data locally and secure. However, in real world settings, the data held by devices is often not evenly distributed and devices mostly differ in computing power and memory capacity. These differences make FL harder to maintain consistent performance across the system. To address these issues, we propose FedMTFI, a novel architecture that combines multi-teacher knowledge distillation (MTKD) with feature importance to improve the FL process in heterogeneous environments. In FedMTFI, clients are clustered based on similar hardware and model types. Each cluster trains a specific model on not independently and identically distributed (non-IID) data. Within a cluster, every client updates that model using only its own local private data. The server then aggregates the locally trained models in each cluster using FedAvg to form multiple prototype models. Then these prototypes serve as teacher models to train a global generalized student model using MTKD. What makes FedMTFI more unique is the integration of Shapley values (SHAP) to emphasize important features during distillation, which enhances both accuracy and interpretability. Experimental results show that FedMTFI achieves higher accuracy than traditional FL algorithms and performs more effectively under non-IID data conditions.

2606.01596 2026-06-02 math.NA cs.LG cs.NA 版本更新

Learning Chaotic Dynamics through Second-Order Geometric Supervision

通过二阶几何监督学习混沌动力学

Shinhoo Kang, Hai V. Nguyen, Tan Bui-Thanh

发表机构 * Department of Computer Science and Software Engineering, Korea University(韩国大学计算机科学与软件工程系) Department of Aerospace Engineering and Engineering Mechanics, The Oden Institute for Computational Engineering and Sciences, The University of Texas at Austin(德克萨斯大学奥斯汀分校航空航天工程与工程力学系,奥登计算工程与科学研究所)

AI总结 提出模型约束随机雅可比匹配方法,以O(d^2)代价隐式施加二阶一致性,在混沌系统中恢复吸引子几何和不变统计量。

Comments 37 pages, 15 figures, 6 tables

详情
AI中文摘要

从数据中学习混沌动力系统需要的不仅仅是短期预测精度:学习模型必须保持吸引子几何及其不变统计量。轨迹(零阶)和雅可比(一阶)匹配监督向量场的值和切结构,但两者都不约束场如何偏离其切平面。因此,模型可以在监督状态下匹配值和切线,但弯曲方式与真实情况不同,在保持局部精度的同时,向虚假吸引子漂移并扭曲长时间统计量。我们证明,强制二阶一致性可以减轻这些失败,但在高维中形成完整的Hessian矩阵是禁止的。我们提出模型约束随机雅可比匹配,该方法在随机扰动的输入处比较真实和学习的向量场的雅可比矩阵。泰勒展开表明,期望的随机雅可比损失分解为名义雅可比失配加上由噪声方差缩放的Hessian失配,从而以O(d^2)代价隐式施加二阶一致性,而无需形成O(d^3)的Hessian张量。仅使用雅可比评估,该方法可扩展到显式Hessian匹配无法实现的高维。数值实验证实二阶方法是稳健的。对于Lorenz~63,一阶方法在最小时间监督下产生灾难性的Lyapunov指数异常值,而二阶方法消除了这些异常值并恢复了正确的吸引子。对于耦合Lorenz~96,分布外强迫扫描区分了这些方法:所有方法在F=16之前一致,但超过F=18后,只有二阶方法保持了不变测度和Lyapunov谱。在两个系统上,随机雅可比匹配以低得多的成本实现了与显式Hessian匹配相当的性能。

英文摘要

Learning chaotic dynamical systems from data requires more than short-term predictive accuracy: the learned model must preserve the attractor geometry and its invariant statistics. Trajectory (zero-order) and Jacobian (first-order) matching supervise the values and tangent structure of the vector field, but neither constrains how the field bends away from its tangent plane. A model can thus match values and tangents at the supervised states yet curve differently from the truth, remaining locally accurate while drifting toward spurious attractors and distorting long-time statistics. We show that enforcing second-order consistency mitigates these failures, but forming the full Hessian is prohibitive in high dimensions. We propose model-constrained randomized Jacobian matching, which compares the Jacobians of the true and learned vector fields at randomly perturbed inputs. A Taylor expansion shows that the expected randomized Jacobian loss decomposes into the nominal Jacobian mismatch plus a Hessian mismatch scaled by the noise variance, implicitly enforcing second-order consistency at $\mathcal{O}(d^2)$ cost without forming the $\mathcal{O}(d^3)$ Hessian tensor. Using only Jacobian evaluations, the method scales to high dimensions where explicit Hessian matching does not. Numerical experiments confirm that second-order methods are robust. For Lorenz~63, first-order methods produce catastrophic Lyapunov-exponent outliers under minimal temporal supervision, which second-order methods eliminate while recovering the correct attractor. For coupled Lorenz~96, an out-of-distribution forcing sweep separates the methods: all agree up to $F=16$, but beyond $F=18$ only second-order methods preserve the invariant measure and Lyapunov spectrum. On both systems, randomized Jacobian matching performs comparably to explicit Hessian matching at much lower cost.

2606.01595 2026-06-02 cs.LG 版本更新

Uncertainty-Calibrated Diffusion for Reliable 3D Molecular Graph Generation

不确定性校准的扩散用于可靠的3D分子图生成

Fang Wan, Jingxiang Qu, Yi Liu

发表机构 * State University of New York at Stony Brook(纽约州立大学石溪分校)

AI总结 针对扩散模型在3D分子图生成中因认知不确定性导致采样质量下降的问题,提出不确定性校准扩散方法(UCD),通过校准反向扩散过程来补偿认知不确定性,在多个基准上取得最优性能。

详情
AI中文摘要

贝叶斯推理通过将预测视为分布而非确定性值,为神经网络中的认知不确定性建模提供了原则性框架。同时,用于3D分子图生成的扩散模型在受严格化学约束的脆弱几何结构上运行,使得推理对不确定性误校准高度敏感。一个被广泛忽视的问题是,来自学习去噪器的认知不确定性会与反向扩散过程中有意注入的偶然不确定性相互作用,导致系统性的方差膨胀以及真实分布与模拟分布之间的不匹配。这种效应对于高精度分子生成尤其有害,因为即使微小偏差也可能违反化学有效性。在这项工作中,我们对认知不确定性如何通过扩散推理传播并降低采样质量进行了理论和实证分析。基于此研究,我们提出了UCD(不确定性校准扩散),一种简单而有效的方法,通过校准反向扩散过程来考虑认知不确定性。在标准3D分子基准上的大量实验表明,UCD在不同基线方法中一致地提高了采样质量,为3D分子扩散建立了新的最先进性能。代码可在 https://github.com/jiuguaiwf/UCD 获取。

英文摘要

Bayesian inference provides a principled framework for modeling epistemic uncertainty in neural networks by treating predictions as distributions rather than deterministic values. Meanwhile, diffusion-based models for 3D molecular graph generation operate on fragile geometric structures governed by strict chemical constraints, making inference highly sensitive to uncertainty miscalibration. A largely overlooked issue is that epistemic uncertainty arising from the learned denoiser interacts with the aleatoric uncertainty intentionally injected during reverse diffusion, leading to systematic variance inflation and a mismatch between the true distribution and the simulated distribution. This effect is particularly detrimental for high-precision molecular generation, where even small deviations can violate chemical validity. In this work, we provide a theoretical and empirical analysis of how epistemic uncertainty propagates through diffusion inference and degrades sampling quality. Building on this investigation, we propose UCD (Uncertainty-Calibrated Diffusion), a simple yet effective method that calibrates the reverse diffusion process to account for epistemic uncertainty. Extensive experiments on standard 3D molecular benchmarks demonstrate that UCD consistently improves sampling quality across diverse baseline methods, establishing new state-of-the-art performance for 3D molecular diffusion. The code is available at https://github.com/jiuguaiwf/UCD.

2606.01591 2026-06-02 cs.CV cs.LG 版本更新

TLG: Temporal-Logic Grounding for Video Question Answering via Source-Annotation Reconstruction and Category-Targeted Reasoning

TLG: 通过源标注重建和类别目标推理实现视频问答的时间逻辑基础

Ali Alavi

发表机构 * The Ohio State University(俄亥俄州立大学)

AI总结 提出TLG三阶段系统,通过重建动作时间线、解析问题为时间逻辑程序并确定性执行,结合强视觉语言模型和前沿推理模型,将视频问答准确率从46.9%提升至71.37%。

详情
AI中文摘要

TimeLogic挑战评估对视频的形式时间逻辑推理——包括16个算子(之前、之后、直到、自从、总是、共现、排序等),采用布尔和四选一形式。端到端视频语言模型在此任务上接近随机水平,因为它们将视频视为帧的集合,无法定位动作发生的时间。我们提出TLG(时间逻辑基础),一个三阶段系统:(i)从生成基准测试的公共源数据集标注中重建每个视频的动作时间线,将每个问题解析为时间逻辑程序,并确定性执行;(ii)在没有标注的情况下回退到强大的开放视觉语言模型;(iii)仅将视觉语言模型经验上最弱的问题类别路由到前沿推理模型。TLG将测试准确率从46.9%的视觉语言模型基线提升到71.37%,绝对增益+24.5,达到排行榜前三名3分以内。我们报告了广泛的消融实验,包括三种基于模型的时间线重建变体,它们都低于整体视觉语言模型,将时间基础隔离为不可约的瓶颈,并表明真正的标注——而非更大的模型——驱动准确率。

英文摘要

The TimeLogic Challenge evaluates formal temporal-logic reasoning over video - 16 operators (before, after, until, since, always, co-occur, ordering, ...) in boolean and 4-way multiple-choice form. End-to-end video-language models (VLMs) hover near chance on this task because they treat video as a bag of frames and cannot localize when actions occur. We present TLG (Temporal-Logic Grounding), a three-tier system that (i) reconstructs each video's action timeline from the public source-dataset annotations the benchmark was generated from, parses every question into a temporal-logic program, and executes it deterministically; (ii) falls back to a strong open VLM where no annotation exists; and (iii) routes only the question categories where the VLM is empirically weakest to a frontier reasoning model. TLG raises test accuracy from a 46.9% VLM baseline to 71.37%, a +24.5 absolute gain, reaching within 3 points of the leaderboard top. We report extensive ablations, including three model-based timeline-reconstruction variants that all underperform a holistic VLM, isolating temporal grounding as the irreducible bottleneck and showing that real annotations - not larger models - drive accuracy.

2606.01566 2026-06-02 cs.LG 版本更新

RobustModelMaker: Coupling Bootstrap Stability Selection with Leakage-Safe Nested Cross-Validation for Scientific Machine Learning

RobustModelMaker: 将Bootstrap稳定性选择与防泄漏嵌套交叉验证相结合的科学机器学习

Amanda S Barnard

发表机构 * School of Computing, Australian National University(计算学院,澳大利亚国立大学)

AI总结 针对小到中等规模科学数据集,提出RobustModelMaker框架,通过结合bootstrap稳定性选择与严格嵌套交叉验证,在防止数据泄漏的同时提供稳定性测试的特征子集和性能估计,在预测得分和选择稳定性上优于多种替代方法。

Comments 19 pages, 2 figure plates, 8 tables

详情
AI中文摘要

小到中等规模的科学数据集使机器学习流程面临两种叠加压力。单次特征选择产生的特征集在训练数据微小扰动下会发生显著变化,而任何使用相同数据进行选择、调参和评估的程序都会产生乐观偏差的性能估计。这两种失效模式通常被视为可分离的,但在科学数据所处的场景中,它们相互影响:不稳定的选择会放大本已乐观的得分的方差,而针对其中一种的标准补救措施很少能解决另一种。RobustModelMaker是一个Python框架,它将bootstrap稳定性选择与严格的嵌套交叉验证相结合,在每个折叠内执行所有预处理和选择,并生成一个经过稳定性测试的特征子集以及一个防泄漏的性能估计。该框架支持二分类、多分类和回归中的九种算法。行为通过确定性测试套件进行验证,该套件涵盖单元测试、性能测试和可重复性检查,在三个真实科学数据集上,与三种替代选择器(ANOVA F检验、带交叉验证的递归特征消除和Boruta)在预测得分和选择稳定性的Jaccard度量上进行比较。RobustModelMaker在每个数据集上的得分与最佳替代选择器相当,并且在所有三种任务类型中,在联合得分-稳定性前沿上占据了一个任何替代方法都无法匹敌的位置。两个示例应用——来自PLCO试验的卵巢癌生物标志物发现和UCI超导数据上的临界温度回归——说明了该框架在实际中的使用方式,以及当稳定性被视为首要交付成果而非涌现属性时,哪些权衡变得可见。

英文摘要

Small-to-medium scientific datasets place machine learning pipelines under two compounding pressures. Single-run feature selection produces feature sets that change substantially under small perturbations of the training data, and any procedure that uses the same data for selection, tuning, and evaluation produces optimistically biased performance estimates. The two failure modes are routinely treated as separable, but in the regimes where scientific data live, they interact: an unstable selection inflates the variance of an already-optimistic score, and standard remedies for one rarely address the other. RobustModelMaker is a Python framework that couples bootstrap stability selection with strict nested cross-validation, performs all preprocessing and selection inside each fold, and produces a stability-tested feature subset together with a leakage-safe performance estimate. The framework supports nine algorithms across binary classification, multiclass classification, and regression. Behaviour is verified by a deterministic test suite spanning unit, performance, and reproducibility checks on three real scientific datasets comparing to three alternative selectors (ANOVA F-test, recursive feature elimination with cross-validation, and Boruta) on both predictive score and a Jaccard measure of selection stability. RobustModelMaker is competitive in score with the best alternative selector on each dataset, and occupies a position on the joint score-stability frontier that none of the alternatives match across all three task types. Two example applications, ovarian cancer biomarker discovery from the PLCO Trial and critical-temperature regression on the UCI Superconductivity Data, illustrate how the framework is used in practice and what trade-offs become visible when stability is treated as a first-class deliverable rather than an emergent property.

2606.01563 2026-06-02 cs.LG 版本更新

MomentKV: Closing the Directional Gap in KV Cache Eviction for Long-Context Inference

MomentKV:消除长上下文推理中KV缓存驱逐的方向差距

Yu Li, Binxu Li, Tian Lan

发表机构 * George Washington University(乔治·华盛顿大学) Princeton University(普林斯顿大学)

AI总结 针对长上下文推理中KV缓存驱逐导致输出退化的问题,提出MomentKV方法,通过维护驱逐令牌集的矩统计量(计数、键均值、值均值和值-键协方差)来识别与累积摘要对齐的令牌,并在推理时提供驱逐注意力输出的一阶近似,实现选择性驱逐与精确校正的相互增强。

详情
AI中文摘要

基于Transformer的语言模型中的自回归解码依赖于KV缓存,其内存占用随序列长度线性增长,成为长上下文推理的主要瓶颈。KV缓存驱逐通过保留固定大小的键值对子集并丢弃其余部分来解决这一问题。我们发现输出退化的一个主要来源并非驱逐令牌上的残余注意力质量(现有方法已最小化),而是保留令牌集与驱逐令牌集之间的方向不匹配。具体而言,实际中被驱逐的令牌通常与保留的令牌接近正交。因此,即使少量的驱逐质量也可能对最终的方向分布产生过大影响,并放大为显著的输出误差。这揭示了现有策略的根本局限性。为解决此问题,我们提出MomentKV,它在驱逐令牌集上维护紧凑的小规模矩统计量,包括计数、键均值、值均值和值-键协方差。在驱逐过程中,利用矩统计量识别已经与累积摘要良好对齐并被其捕获的令牌,保持驱逐集的几何规则性。在推理过程中,它们产生驱逐注意力输出的闭式一阶近似,在选择性驱逐与精确校正之间形成相互增强的循环。在LongBench和RULER上使用LLaMA-3.1-8B-Instruct和Qwen3-4B-Instruct进行的实验表明,MomentKV在每个缓存预算下均优于所有基线,在激进压缩下增益最大。

英文摘要

Autoregressive decoding in Transformer-based language models relies on the KV cache, whose memory footprint grows linearly with sequence length and becomes the primary bottleneck for long-context inference. KV cache eviction addresses this by retaining a fixed-size subset of key-value pairs and discarding the rest. We identify that a primary source of output degradation is not the residual attention mass on evicted tokens, which existing methods already minimize, but a directional mismatch between the retained and evicted token sets. Specifically, the evicted tokens in practice are often near-orthogonal to the retained ones. Thus, even a small evicted mass could have an oversized impact on the resulting direction distribution and amplify into substantial output error. This reveals a fundamental limit in existing strategies. To address this, we propose MomentKV, which maintains compact, small-size moment statistics over the evicted token set, including a count, key mean, value mean, and value-key covariance. During eviction, the moment statistics is leveraged to identify tokens already well aligned with and captured by the accumulated summary, keeping the evicted set geometrically regular. During inference, they yield a closed-form first-order approximation of the evicted attention output, forming a mutually reinforcing loop between selective eviction and accurate correction. On LongBench and RULER with LLaMA-3.1-8B-Instruct and Qwen3-4B-Instruct, MomentKV outperforms all baselines at every cache budget, with the largest gains under aggressive compression.

2606.01560 2026-06-02 cs.LG cs.AI 版本更新

GJDNet: Robust Graph Neural Networks via Joint Disentangled Learning Against Adversarial Attacks

GJDNet: 通过联合解缠学习实现鲁棒图神经网络对抗攻击

Canyixing Cui, Tao Wu, Xingping Xian, Xiao-Ke Xu, Mao Wang, Weina Niu

发表机构 * School of Computer Science and Technology, Chongqing University of Posts and Telecommunications(重庆邮电大学计算机科学与技术学院) School of Cyber Security and Information Law, Chongqing University of Posts and Telecommunications(重庆邮电大学网络安全与信息法学院) Computational Communication Research Center, Beijing Normal University(北京师范大学计算通信研究中心) School of Journalism and Communication, Beijing Normal University(北京师范大学新闻传播学院) School of Computer Science and Engineering, University of Electronic Science and Technology of China(电子科技大学计算机科学与工程学院)

AI总结 提出GJDNet框架,通过联合解缠节点表示和决策空间,并采用球形决策边界,增强图神经网络在不同图同配性下的鲁棒性。

详情
AI中文摘要

图神经网络(GNN)易受对抗攻击,这类攻击通过在同配图中引入异配边、在异配图中引入同配边,从根本上反转连接模式。这种结构反转造成结构-特征不匹配,扰乱不同图类型上的邻域聚合。然而,我们发现现有防御措施存在局限性,它们要么在固定的同配性假设下将邻域视为整体,要么依赖无法应对扰动引起的表示偏移的标准softmax分类器。为进一步利用这一观察,我们采用鲁棒性视角,联合解缠节点表示和决策空间,在隔离扰动影响的同时强制实现分离良好的决策区域。基于此原则,我们提出图联合解缠网络(GJDNet),这是一个统一的框架,用于在不同图同配性机制下进行鲁棒节点分类。GJDNet在表示和决策两个层面增强鲁棒性:它采用特征驱动的软结构解缠,结合偏度感知的邻居过滤,抑制扰动引起的结构-特征不匹配;并引入球形决策边界(SDB),促进嵌入空间中的类内紧凑性和类间分离,从而在扰动下稳定决策边界。理论分析揭示了所提出的解缠表示和决策机制的有效性,而大量实验表明,GJDNet在不同连接模式的图上始终展现出强鲁棒性。

英文摘要

Graph Neural Networks (GNNs) are vulnerable to adversarial attacks, which inherently invert connectivity patterns by introducing disassortative edges in assortative graphs and assortative edges in disassortative graphs. This structural inversion creates structure-feature mismatches that disrupt neighborhood aggregation across different graph types. However, we find that existing defenses are limited, as they either treat neighborhoods as monolithic under fixed assortativity assumptions or rely on standard softmax classifiers that fail to account for perturbation-induced representation shifts. To further exploit this observation, we adopt a robustness perspective that jointly disentangles node representations and decision spaces, isolating perturbation effects while enforcing well-separated decision regions. Based on this principle, we propose Graph Joint Disentanglement Network (GJDNet), a unified framework for robust node classification across diverse graph assortativity regimes. GJDNet enhances robustness at both representation and decision levels: it employs feature-driven soft structural disentanglement with skewness-aware neighbor filtering to suppress perturbation-induced structure-feature mismatches, and introduces a Spherical Decision Boundary (SDB) to promote intra-class compactness and inter-class separation in the embedding space, thereby stabilizing decision boundaries under perturbations. Theoretical analysis provides insights into the effectiveness of the proposed disentangled representation and decision mechanisms, while extensive experiments demonstrate that GJDNet consistently achieves strong robustness across graphs with different connectivity regimes.

2606.01557 2026-06-02 cs.LG eess.SP 版本更新

Everywhere Learning: Artificial Intelligence with Pointwise Constraints

处处学习:具有逐点约束的人工智能

Ignacio Boero, Ignacio Hounie, Luiz Chamon, Alejandro Ribeiro

发表机构 * Department of Electrical and Systems Engineering, University of Pennsylvania(宾夕法尼亚大学电气与系统工程系) École polytechnique, Institut Polytechnique de Paris(巴黎理工学院)

AI总结 提出“处处学习”新范式,通过近似对偶理论分析泛化性能,并用稀疏L1惩罚控制泛化,在语言模型任务中验证其优势。

详情
AI中文摘要

处处学习是一种新范式,其中人工智能系统被训练以满足数据分布上概率为1的损失约束。这与训练人工智能系统最小化平均损失的标准范式形成对比。我们发展了一种近似对偶理论,以支持泛化分析,该分析建立了经验与统计处处学习问题解之间的接近性。我们的结果表明,对偶变量将数据分布重新加权到损失约束更难满足的点,并且泛化由数据分布质量集中与约束更难满足点上的质量集中之间的不匹配控制。我们进一步表明,我们可以通过约束松弛上的稀疏L1惩罚来控制泛化。我们通过语言模型任务中的智能体分类实验说明了处处学习的优点。

英文摘要

Everywhere learning is a new paradigm whereby Artificial Intelligence (AI) systems are trained to satisfy loss constraints with probability one over the data distribution. This is in contrast to the standard paradigm of training AI systems to minimize average losses. We develop an approximate duality theory to substantiate a generalization analysis that establishes the proximity between solutions of empirical and statistical everywhere learning problems. Our results show that dual variables reweigh the data distribution towards points in which loss constraints are more difficult to satisfy and that generalization is controlled by the mismatch between the concentration of mass of the data distribution and the concentration of mass on points where constraints are more difficult to satisfy. We further show that we can control generalization with a sparse L1 penalty on constraint relaxations. We illustrate the merits of everywhere learning with an experiment in agentic classification for language model tasks.

2606.01544 2026-06-02 cs.LG 版本更新

CRePE: Convolution-aware Relative Importance in Post-training Pruning with Efficient Search

CRePE: 后训练剪枝中基于卷积感知的相对重要性及高效搜索

Cheonjun Park

发表机构 * Hankuk University of Foreign Studies(韩国家外国语大学)

AI总结 提出CRePE方法,通过引入二维局部邻域上下文和自适应系数改进相对重要性评分,结合PHO代理优化实现高效后训练剪枝,在多种模型和稀疏度下取得最优性能。

Comments 10 pages

详情
AI中文摘要

在实际部署大型语言模型(LLM)时,会带来大量的内存和计算成本。后训练剪枝(PTP)是一种通过移除权重来降低这些成本的有效方法,无需额外训练。在现有方法中,RIA引入了通过行和列和归一化的相对重要性分数,实现了最先进的精度。然而,RIA仅考虑一维十字形(行/列)方向信息,并对行和列贡献赋予相同权重。在本文中,我们提出**CRePE**,它将二维局部邻域上下文和自适应系数纳入相对重要性评分。CRePE在各种模型和稀疏度设置下始终优于现有的PTP方法。然而,通过基于困惑度(PPL)的爬山法确定最优自适应系数需要大量PPL评估和约11小时的搜索时间。为了解决这个问题,我们提出**PHO**(基于代理的超参数优化),它消除了重复PPL测量的需要,并将搜索时间减少到约20分钟。此外,PHO在一个模型上找到的最优超参数配置可以很好地迁移到其他模型,展现出强大的泛化能力。最后,我们验证了CRePE可以与现有技术(包括通道置换、非均匀稀疏分配和重新剪枝方法)正交结合。

英文摘要

Deploying Large Language Models (LLMs) in practice incurs substantial memory and computational costs. Post-training pruning (PTP) is an effective approach to reducing these costs by removing weights without additional training. Among existing methods, RIA introduces relative importance scores normalized by row and column sums, achieving state-of-the-art accuracy. However, RIA considers only 1D cross-shaped (row/column) directional information and assigns equal weight to row and column contributions. In this paper, we propose \textbf{CRePE}, which incorporates 2D local neighborhood context and adaptive coefficients into Relative Importance scoring. CRePE consistently outperforms existing PTP methods across diverse models and sparsity settings. However, identifying optimal adaptive coefficients via perplexity (PPL)-based hill climbing requires numerous PPL evaluations and approximately 11 hours of search time. To address this, we propose \textbf{PHO} (Proxy-based Hyperparameter Optimization), which eliminates the need for repeated PPL measurements and reduces the search time to approximately 20 minutes. Furthermore, the optimal hyperparameter configuration found by PHO on one model transfers well to other models, demonstrating strong generalization. Finally, we verify that CRePE can be orthogonally combined with existing techniques including Channel Permutation, non-uniform sparsity allocation, and re-pruning methods.

2606.01540 2026-06-02 cs.LG cs.AI 版本更新

TN-SHAP-G: Graph-Structured Tensor Network Surrogates for Shapley Values and Interactions

TN-SHAP-G:用于Shapley值和交互的图结构张量网络代理

Farzaneh Heidari, Guillaume Rabusseau

发表机构 * University of Washington(华盛顿大学) CNRS(法国国家科学研究中心)

AI总结 提出TN-SHAP-G框架,利用图结构输入通过张量网络代理高效计算Shapley值和高阶交互指数。

详情
AI中文摘要

Shapley值是一种广泛使用的工具,用于归因黑盒模型中输入变量的重要性和交互,但其计算涉及定义在指数级子集空间上的函数。我们提出TN-SHAP-G,一个利用图结构输入中的结构高效计算Shapley值和高阶交互指数的框架。给定一个预测器和一个固定的掩码方案,TN-SHAP-G学习一个紧凑的、与图对齐的多线性代理,该代理近似掩码输入行为,表示为拓扑结构反映输入图的张量网络。一旦从少量oracle查询中训练完成,该代理通过多线性扩展实现一阶和高阶Shapley指数的确定性恢复,无需额外模型查询或蒙特卡洛方差。分子基准实验表明,学习到的分解在小图上紧密匹配精确Shapley值,并能高效扩展到基于采样的方法不可行的更大图。

英文摘要

Shapley values are a widely used tool for attributing importance and interactions among input variables in black-box models, but their computation involves a function defined over an exponentially large space of subsets. We propose TN-SHAP-G, a framework that exploits structure in graph-structured inputs to compute Shapley values and higher-order interaction indices efficiently. Given a predictor and a fixed masking scheme, TN-SHAP-G learns a compact, graph-aligned multilinear surrogate that approximates the masked-input behavior, represented as a tensor network whose topology mirrors the input graph. Once trained from a small number of oracle queries, the surrogate enables deterministic recovery of first- and higher-order Shapley indices via the multilinear extension, without additional model queries or Monte Carlo variance. Experiments on molecular benchmarks show that the learned factorization closely matches exact Shapley values on small graphs and scales efficiently to larger graphs where sampling-based methods become infeasible.

2606.01539 2026-06-02 stat.ME cs.LG 版本更新

Scalable Counterfactual Risk Estimation for Rare Events in Longitudinal Data

纵向数据中罕见事件的可扩展反事实风险估计

Xiaohui Yin, Avijit Mitra, Ying Zhou, Kun Chen, Hong Yu

发表机构 * University of Connecticut Storrs(康涅狄格大学斯托尔分校) University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校) University of Massachusetts Lowell(马萨诸塞大学洛厄尔分校)

AI总结 针对纵向生存数据中罕见事件导致的类不平衡和计算负担问题,提出一种可扩展的子采样与重加权策略,应用于ICE等因果效应估计器,在保持一致性的同时提高稳定性。

Comments Accepted at KDD-2026, 12 pages

详情
AI中文摘要

在大规模观察性研究中,估计时变治疗对生存结果的因果效应在计算上要求很高,尤其是当结果罕见时。虽然基于g公式的方法(如迭代条件期望(ICE)估计器)为纵向因果推断提供了原则性框架,但它们在计算上变得昂贵,特别是当需要基于自助法的方差估计时。此外,每个时间点的结果罕见性会导致严重的类不平衡,从而引发逻辑回归及相关模型的不稳定性和收敛问题。为应对这些挑战,我们提出了一种针对纵向生存数据的原则性子采样与重加权策略,可应用于该场景下的多种现有因果效应估计器,包括ICE估计器。所提方法显著降低了计算负担,同时在罕见结果场景下保持一致性并提高估计稳定性。我们通过模拟评估该方法,并使用一项关于健康社会和行为决定因素(SBDH)与自杀风险的大规模EHR队列研究进行验证,证明了其在纵向数据中建模罕见结果的有效性。

英文摘要

Estimating the causal effect of time-varying treatments on survival outcomes in large observational studies is computationally demanding, particularly when outcomes are rare. While g-formula-based methods such as the iterative conditional expectation (ICE) estimator provide a principled framework for longitudinal causal inference, they become computationally expensive, especially when bootstrap-based variance estimation is required. In addition, outcome rarity at each time point induces severe class imbalance, leading to instability and convergence issues in logistic regression and related models. To address these challenges, we propose a principled subsampling and reweighting strategy for longitudinal survival data that can be applied to a range of existing causal effect estimators in this setting, including the ICE estimator. The proposed method substantially reduces computational burden while preserving consistency and improving estimation stability in rare-outcome settings. We evaluate the method through simulations and validate it using a large-scale EHR cohort study on social and behavioral determinants of health (SBDH) and suicide risk, demonstrating its effectiveness for modeling rare outcomes in longitudinal data.

2606.01533 2026-06-02 cs.MA cs.CL cs.LG 版本更新

Multi-Agent Computer Use

多智能体计算机使用

Jing Yu Koh, Ruslan Salakhutdinov, Daniel Fried

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 针对单智能体计算机使用代理在复杂长时任务中的不足,提出多智能体计算机使用系统,通过有向无环图分解任务并并行执行,在多个基准测试上提升3.4-25.5%性能,并加速任务完成时间约1.5倍。

详情
AI中文摘要

目前的计算机使用代理(CUA)主要部署为单序列代理。这种设置对于受益于任务分解、并行执行和基于新信息持续重新规划的复杂长时任务来说并不理想。在本文中,我们认为应该转向评估和构建多智能体计算机使用(MACU)系统。这些系统强调规划和并行执行,缓解了单智能体CUA的许多缺点。我们提出了一种通用的多智能体设置,其中管理模型将计算机使用任务分解为有向无环图(DAG),编码子代理的相关依赖关系和目标。在每次迭代中,管理器调度并行的CUA子代理执行DAG就绪前沿上的节点,并根据子代理的新发现持续修订DAG(添加、取消或重写节点)。这种设计将计算机使用的部分可观察环境作为首要挑战:下游代理可能无法重新观察到的信息通过管理器和DAG结构保留并传递。我们证明,MACU在桌面(OSWorld)和网页导航(Online-Mind2Web、WebTailBench、Odysseys)基准测试上始终比强单智能体基线提升3.4-25.5%,表现出更有利的测试时缩放,并解决了单智能体CUA陷入困境的复杂长时任务。在Odysseys(一个长时网页导航基准测试)上,MACU将平均任务完成墙钟时间提高了约1.5倍,证明了其在加速传统缓慢的CUA流程方面的有效性。我们的研究结果强调,多智能体协调是扩展计算机使用代理以更长时间、更有效地工作的一个有前景的方向。我们在https://jykoh.com/multi-agent-computer-use上发布所有代码和交互式可视化。

英文摘要

Computer use agents (CUAs) today are primarily deployed as single serial agents. This setup is suboptimal for complex long-horizon tasks that benefit from task decomposition, parallel execution, and consistent re-planning based on new information. In this paper, we argue that we should instead move towards evaluating and building multi-agent computer use (MACU) systems. These systems, which emphasize planning and parallel execution, alleviate many of the shortcomings of single-agent CUAs. We propose a general multi-agent setup in which a manager model decomposes computer use tasks as a directed acyclic graph (DAG), encoding relevant dependencies and goals for subagents. At each iteration, the manager dispatches parallel CUA subagents to carry out nodes on the ready frontier of the DAG, and continuously revises the DAG (adding, canceling, or rewriting nodes) as new findings arrive from subagents. This design treats the partially observable environment of computer use as a first class challenge: information that downstream agents may not be able to re-observe are retained and passed forward through the manager and DAG structure. We demonstrate that MACU consistently improves over strong single-agent baselines by $3.4-25.5\%$ on desktop (OSWorld) and web navigation (Online-Mind2Web, WebTailBench, Odysseys) benchmarks, exhibits more favorable test-time scaling, and solves complex long-horizon tasks where single-agent CUAs get stuck. On Odysseys, a long-horizon web navigation benchmark, MACU improves average task completion wall-clock time by ${\sim} 1.5 \times$, demonstrating its efficacy in speeding up traditionally slow CUA pipelines. Our findings highlight that multi-agent coordination is a promising axis for scaling computer use agents to work productively for longer and more effectively. We release all code and interactive visualizations at https://jykoh.com/multi-agent-computer-use.

2606.01527 2026-06-02 cs.LG cs.CR 版本更新

Near-Optimal Pure Machine Unlearning for Smooth Strongly Convex Losses

平滑强凸损失下的近最优纯机器遗忘

Matthew Regehr, Gautam Kamath, Andrew Lowy

发表机构 * University of Waterloo(滑铁卢大学) Vector Institute(向量研究所) CISPA Helmholtz Center for Information Security(CISPA海德堡信息安全中心)

AI总结 针对平滑强凸随机优化中的近似ε-遗忘问题,本文通过证明超额总体风险的上界和下界(紧至条件数因子),几乎解决了遗忘的基本统计代价,并提出了在ε≫d时相比从头再训练和差分隐私基线具有指数级精度提升的遗忘算法。

详情
AI中文摘要

机器遗忘受到法律和用户需求(如被遗忘权)的驱动,旨在从训练模型中移除个体数据的影响。先前的工作已经为平滑强凸随机优化中的遗忘开发了算法和误差界,但遗忘的基本统计代价仍不清楚。我们通过证明近似ε-遗忘的超额总体风险的上界和下界,几乎解决了这个问题;我们的界紧至条件数因子。对于单位球上的均值估计,我们的上界和下界匹配。最优速率是通常的统计误差加上一个遗忘惩罚,该惩罚在从头再训练速率和随着ε/d增长而指数级减小的项之间插值,其中d是模型的维度。特别地,当ε≫d时,我们的ε-遗忘算法相比从头再训练模型和差分隐私基线提供了指数级的精度提升。另一方面,当ε≤d时,从头再训练是最优的。

英文摘要

Machine unlearning is motivated by legal and user-facing requirements to remove the influence of individuals' data from trained models, such as the right to be forgotten. Prior work has developed algorithms and error bounds for unlearning in smooth strongly convex stochastic optimization, but the fundamental statistical cost of unlearning has remained unclear. We nearly resolve this problem by proving upper and lower bounds on the excess population risk of approximate $\varepsilon$-unlearning; our bounds are tight up to a condition-number factor. For mean estimation over the unit ball, our upper and lower bounds match. The optimal rate is the usual statistical error plus an unlearning penalty that interpolates between the retraining-from-scratch rate and an exponentially smaller term as $\varepsilon/d$ grows, where $d$ is the dimension of the model. In particular, when $\varepsilon \gg d$, our $\varepsilon$-unlearning algorithm offers an exponential accuracy improvement over retraining the model from scratch and differentially private baselines. On the other hand, when $\varepsilon \le d$, retraining from scratch is optimal.

2606.01525 2026-06-02 cs.LG stat.ML 版本更新

Semi-Supervised Hyperbolic Hierarchical Clustering with Set-Level Structural Priors

基于集合级结构先验的半监督双曲层次聚类

Junjing Zheng, Xinyu Zhang, Xiangfeng Qiu, Chengliang Song, Weidong Jiang

发表机构 * College of Electronic Science and Technology, National University of Defense Technology(电子科学与技术学院,国防科技大学)

AI总结 提出一种半监督双曲层次聚类方法,通过引入集合作为基本建模单元,利用从叶级监督导出的集合级结构先验来指导非叶层次结构学习,提升标签一致性和树质量。

详情
AI中文摘要

半监督层次聚类旨在学习与数据模式和用户提供的监督一致的树结构。监督通常以叶级关系的形式给出,例如成对的必须连接/不能连接约束或三元组的必须在之前连接约束。尽管这些约束有助于调节局部样本关系,但它们并不直接指示哪些样本应形成连贯的子树。因此,学习到的树的非叶结构可能偏离真实标签所偏好的层次组织。为了解决这一局限性,我们提出了一种具有集合级结构先验的半监督双曲层次聚类方法。主要贡献是引入集合作为层次学习的基本建模单元。每个集合表示预期在子树内凝聚的样本,并从叶级监督以及学习到的约束一致相似性结构中导出。这些集合作为子树级监督的软结构先验,使得监督能够指导超出局部叶级关系的非叶层次形成。具体来说,我们首先学习约束一致的嵌入以获得可靠的集合划分,然后构建约束诱导的集合并估计集合间相似性以形成集合级结构先验。最后,将这些先验纳入双曲层次目标中进行连续树优化。在11个基准数据集上的实验和消融研究表明,所提出的方法在提高代表性层次聚类基线的标签一致性的同时,也增强了基于相似性的树质量。

英文摘要

Semi-supervised hierarchical clustering aims to learn a tree structure consistent with data patterns and user-provided supervision. Supervision is usually given as leaf-level relations, such as pairwise must-link/cannot-link constraints or triplet-wise must-link-before constraints. Although useful for regulating local sample relations, such supervision does not directly indicate which samples should form coherent subtrees. Consequently, the non-leaf structure of the learned tree may deviate from the hierarchical organization preferred by ground-truth labels. To address this limitation, we propose a semi-supervised hyperbolic hierarchical clustering method with set-level structural priors. The main contribution is to introduce sets as basic modeling units for hierarchy learning. Each set denotes samples expected to cohere within a subtree and is induced from leaf-level supervision together with a learned constraint-consistent similarity structure. These sets act as soft structural priors for subtree-level supervision, allowing supervision to guide non-leaf hierarchy formation beyond local leaf-level relations. Specifically, we first learn constraint-consistent embeddings to obtain a reliable set partition, then construct constraint-induced sets and estimate inter-set similarities to form set-level structural priors. Finally, these priors are incorporated into a hyperbolic hierarchy objective for continuous tree optimization. Experiments on eleven benchmark datasets and ablation studies show that the proposed method consistently improves label consistency over representative hierarchical clustering baselines while also enhancing similarity-based tree quality.

2606.01521 2026-06-02 cs.LG stat.ML 版本更新

Fast Generalization after Interpolation via Critically Damped Momentum Optimization

通过临界阻尼动量优化实现插值后的快速泛化

Luca Muscarnera, Silas Ruhrberg Estévez, Yuanzhang Xiao, Mihaela Van der Schaar

发表机构 * University of Cambridge(剑桥大学) University of Hawaii at Manoa(夏威夷大学曼瑙分校)

AI总结 提出GROKtimizer双阶段策略,结合快速收敛到插值与临界阻尼动量后插值范数最小化,在局部二次模型下实现比经典梯度下降二次加速,选择低范数插值解以提升泛化。

详情
AI中文摘要

机器学习的一个核心问题是模型在训练中可以达到近乎完美的性能,但对未见示例的泛化能力却显著较差。这种差距在高维、小样本场景下尤为严重,因为存在许多插值解,优化必须隐式地在具有不同泛化特性的最小值之间进行选择。基于最近关于插值阈值附近优化动态的理论进展,我们注意到风险最小化的两阶段结构(先损失最小化,后复杂度最小化)启发了一种双阶段优化调度。因此,我们从理论上证明,GROKtimizer——一种结合快速收敛到插值与基于临界阻尼动量(CDM)的后插值范数最小化的双阶段策略——为选择低范数插值解提供了一种自然方案。在后插值盆地的局部二次模型下,GROKtimizer比经典梯度下降实现了二次加速,并在一阶优化器中具有可证明的最优性。为了展示我们方法的适用性,我们在经典grokking文献中常见的几个合成基准以及各种真实世界数据集上评估了GROKtimizer。最后,我们将我们的发现与平坦最小值假说相协调,强调了后插值动态在构建高质量、泛化模型中的重要性。

英文摘要

A central problem in machine learning is that models can achieve near-perfect training performance while generalizing substantially less well to unseen examples. This gap is especially acute in high-dimensional, low-sample regimes, where many interpolating solutions exist and optimization must implicitly select among minima with different generalization properties. Following recent theoretical advances on optimization dynamics near the interpolation threshold, we note that the two-regime structure of risk minimization, with loss minimization followed by complexity minimization, motivates a biphasic optimization schedule. We thus theoretically demonstrate that GROKtimizer, a biphasic strategy that combines rapid convergence to interpolation with Critically Damped Momentum (CDM)-based post-interpolation norm minimization, offers a natural solution for selecting low-norm interpolating solutions. Under a local quadratic model of the post-interpolation basin, GROKtimizer provides a quadratic speedup over classical gradient descent, with provable optimality among first-order optimizers. To showcase the applicability of our method, we evaluate GROKtimizer on several synthetic benchmarks common in the classical grokking literature and on various real-world datasets. Finally, we reconcile our findings with the flat-minima hypothesis, highlighting the importance of post-interpolation dynamics in the construction of high-quality, generalizing models.

2606.01513 2026-06-02 cs.DC cs.AI cs.CL cs.LG 版本更新

Compliance-Scored Best-of-N Guardrail Orchestration for Multimodal Document Generation in Payments Dispute Defense

基于合规评分的Best-of-N护栏编排用于支付争议防御中的多模态文档生成

Nataraj Agaram Sundar, Tejas Morabia

发表机构 * eBay Inc.(eBay公司)

AI总结 提出一种结合多候选生成与合规评分早退机制的护栏编排层,通过并行生成、加权评分和最佳输出选择,在支付争议防御场景中实现高合规率与低延迟。

Comments 8 pages, 7 figures, 4 tables. Preprint. Applied systems paper on compliance-scored guardrail orchestration for multimodal LLM document generation. Contains aggregate operational readouts; not a randomized A/B test

详情
AI中文摘要

高风险企业文档生成,包括金融争议叙述、合规通知和审计摘要,要求模式正确性、策略合规性以及大规模低延迟操作。在统一的护栏层之前,生产系统通常将独立的PII编辑、内容审核和格式验证步骤拼接在一起,导致逻辑碎片化、请求路径变慢和运营成本增加。我们提出了一种针对文本和图像输入的护栏编排层,它将多候选生成与用于早退的显式合规评分相结合。该框架运行可配置的并行生成头,根据加权护栏(包括PII检测、内容审核、模式约束和领域规则)对候选进行评分,并返回具有选择元数据的最佳评分输出。可用的运营读数报告在20秒内进行5次尝试,合规率为91%。对于支付争议防御摘要,我们分析聚合运营场景读数,而非随机A/B测试。可变队列显示总体胜率高于对照组,301/659对比536/1548,对应+11.0个百分点,95%置信区间[6.6, 15.5],p < 0.001;对于调整后的未收到物品案例,+7.5个百分点,95%置信区间[0.2, 15.7],p = 0.045。欺诈和本地证据排名差异方向为正,但在聚合计数数据中不具有统计显著性。我们还报告了来自770次生成证据审查和70例OCR切片的评审校准的负责任AI证据质量信号,并通过请求接口、评分逻辑、伪代码和运营证据边界记录了可重复性边界。

英文摘要

High-stakes enterprise document generation, including financial dispute narratives, compliance notices, and audit summaries, demands schema correctness, policy compliance, and low-latency operation at scale. Prior to a unified guardrail layer, production systems often stitched together separate PII redaction, content moderation, and format validation steps, leading to fragmented logic, slower request paths, and higher operational cost. We present a guardrail orchestration layer for text and image inputs that couples multi-candidate generation with an explicit compliance score used for early exit. The framework runs configurable parallel generation heads, scores candidates against weighted guardrails including PII detection, content moderation, schema constraints, and domain rules, and returns the best-scoring output with selection metadata. The available operational readout reports 5 attempts within 20 seconds and 91 percent compliance. For payments dispute defense summaries, we analyze aggregate operational scenario readouts rather than a randomized A/B test. Variable cohorts show higher count win rates than controls overall, 301/659 versus 536/1548, corresponding to +11.0 percentage points with 95 percent confidence interval [6.6, 15.5] and p < 0.001, and for adjusted item-not-received cases, +7.5 percentage points with 95 percent confidence interval [0.2, 15.7] and p = 0.045. Fraud and local evidence-ranking deltas are directionally positive but not statistically significant from the aggregate count data. We also report reviewer-calibrated Responsible-AI evidence-quality signals from 770 generated-evidence reviews and a 70-case OCR slice, and document the reproducibility boundary through the request interface, scoring logic, pseudocode, and operational evidence boundary.

2606.01509 2026-06-02 cs.LG cs.AI 版本更新

ProbMoE: Differentiable Probabilistic Routing for Mixture-of-Experts

ProbMoE:可微分的专家混合概率路由

Heng Zhao, Zilei Shao, Guy Van den Broeck, Zhe Zeng

发表机构 * Imperial College London(伦敦帝国学院) University of Waterloo(多伦多大学) EPFL(瑞士联邦理工学院)

AI总结 提出ProbMoE概率路由框架,通过离散子集空间上的概率推断实现专家选择,解决top-k路由的离散非可微问题,并扩展到动态k路由,提升专家利用率和路由多样性。

Comments Accepted at ICML 2026

详情
AI中文摘要

专家混合(MoE)模型通过每个令牌仅激活一小部分专家来扩展规模。然而,训练此类模型仍然具有挑战性,因为top-$k$路由是离散且不可微的,需要针对专家选择的梯度估计器,其设计仍是一个核心开放问题。我们引入了ProbMoE,一种概率路由框架,将专家选择建模为基数受限专家子集上的分布,并将路由公式化为该离散子集空间中的概率推断。我们首先提出ProbMoE Exact-$k$路由,在前向传播中采样$k$专家子集,后向传播使用每个专家精确边际概率的梯度作为真实梯度的可处理代理。ProbMoE自然地推广到动态$k$路由设置,其中训练和推理都将路由基数约束到相同的预定义范围,允许每个令牌自适应地分配专家。在多个基准测试和模型骨干上,ProbMoE Exact-$k$相比竞争基线实现了强性能,具有改进的专家利用率和路由多样性;ProbMoE Dynamic-$k$以更少的激活专家实现了可比的性能。

英文摘要

Mixture-of-Experts (MoE) models scale by activating only a small subset of experts per token. However, training such models remains challenging because top-$k$ routing is discrete and non-differentiable, requiring gradient estimators for expert selection whose design remains a central open problem. We introduce ProbMoE, a probabilistic routing framework that models expert selection as a distribution over cardinality-constrained expert subsets and formulates routing as probabilistic inference in this discrete subset space. We first propose ProbMoE Exact-$k$ routing, which samples $k$-expert subsets in the forward pass, and the backward pass uses gradients through each expert's exact marginal probability as a tractable surrogate for the true gradient. ProbMoE naturally generalizes to a dynamic-$k$ routing setting, where both training and inference constrain the routing cardinality to the same predefined range, allowing adaptive expert allocation per token. Across benchmarks and model backbones, ProbMoE Exact-$k$ achieves strong performance compared to competitive baselines, with improved expert utilization and routing diversity; ProbMoE Dynamic-$k$ achieves comparable performance with fewer activated experts.

2606.01504 2026-06-02 cs.IR cs.LG 版本更新

Semantic Retrieval for Product Search in E-Commerce

电子商务产品搜索中的语义检索

Nikhil Kothari, Saksham Samdani, Ritam Mallick, Praveen Gupta, Ankit Vijay, Surender Kumar

发表机构 * Flipkart, India(印度Flipkart)

AI总结 针对电商搜索中短、嘈杂、口语化查询和细粒度属性区分问题,提出一种基于Siamese LLM双编码器的两阶段训练方法,通过对比学习和偏好优化实现精确匹配与排序。

详情
AI中文摘要

电子商务中的语义检索必须处理短、嘈杂和口语化的查询,并在具有细粒度属性区分的大型产品目录上进行。我们提出了一种Siamese LLM双编码器,通过两阶段流水线进行训练:首先使用带有假阴性边缘掩码的对比学习,以防止对近似重复产品的惩罚;然后进行相对赔率对齐检索(ROAR),这是一种偏好优化目标,通过连续赔率比边缘将Bradley-Terry扩展到可变大小的分级相关组。训练语料库反映了这一进展——第一阶段中替代查询-产品对提供粗略的语义监督,第二阶段中分级相关性注释驱动细粒度排序。由此产生的系统能够准确检索精确匹配,同时正确排序替代品和互补产品,在查询频率层和业务垂直领域均得到验证,并通过大规模在线A/B部署验证了统计显著性。

英文摘要

Semantic retrieval in e-commerce must handle short, noisy, and colloquial queries over large product catalogs with fine-grained attribute distinctions. We present a Siamese LLM dual-encoder trained through a two-stage pipeline: contrastive learning with a false-negative margin mask to prevent penalization of near-duplicate products, followed by Relative Odds Alignment for Retrieval (ROAR), a preference optimization objective that extends Bradley-Terry to variable-sized graded relevance groups via consecutive odds-ratio margins. The training corpus mirrors this progression - substitute query-product pairs provide coarse semantic supervision in Stage 1 and graded relevance annotations drive fine-grained ranking in Stage 2. The resulting system accurately retrieves exact matches while correctly ordering substitutes and complementary products, with gains confirmed across query-frequency strata and business verticals, and statistical significance validated through live A/B deployment at scale.

2606.01485 2026-06-02 cs.CV cs.LG 版本更新

Perception First: A Frontier Native-Video Model with Self-Consistency for Implicit Video Question Answering

感知优先:具有自一致性的前沿原生视频模型用于隐式视频问答

Ali Alavi

发表机构 * The Ohio State University(俄亥俄州立大学)

AI总结 本文通过系统实验发现隐式视频问答基准是感知受限而非推理受限,并指出提升基础模型感知能力和轻量级测试时去噪是唯一可靠手段。

详情
AI中文摘要

我们描述了提交至CVPR 2026 VRR挑战赛的方案,该方案基于ImplicitQA / VRR-QA基准:一种多项选择视频问答任务,其中答案有意地不在任何单帧中可观察,必须从创意视频的不连续帧中的空间布局、运动、深度、视角、因果关系和社会背景推断。我们对开源视频大语言模型(Qwen2.5-VL、Qwen3-VL、InternVL3、Gemma-3以及经过强化学习训练的视频推理器Video-R1和VideoChat-R1.5)和一系列推理时策略(思维链、问题分解、描述-推理级联、音频转录、空间状态提示、自一致性、多模型集成和类别路由)进行了系统的、无需训练的研究。我们的核心发现是,该基准是感知受限而非推理受限:推理侧的增强是中性的甚至有害的,而基础模型的感知能力和轻量级测试时去噪是唯一可靠的杠杆。按类别的错误分析将困难定位到低级感知——相对深度、视角和计数是最困难的类别,而因果和社会推理几乎已解决——一个明确注入单目深度线索以攻击最弱类别的提示将测试准确率降低了5.8个百分点,证实了模型需要更好的感知,而非更好的过程。

英文摘要

We describe our submission to the VRR Challenge @ CVPR 2026, built on the \emph{ImplicitQA} / \emph{VRR-QA} benchmark~\cite{implicitqa}: multiple-choice video question answering in which answers are deliberately \emph{not} observable in any single frame and must be inferred from spatial layout, motion, depth, viewpoint, causality, and social context across discontinuous frames of creative video. We conduct a systematic, training-free study spanning open-source Video-LMMs (Qwen2.5-VL~\cite{qwen25vl}, Qwen3-VL~\cite{qwen3vl}, InternVL3, Gemma-3, and the RL-tuned video reasoners Video-R1~\cite{videor1} and VideoChat-R1.5~\cite{videochatr15}) and a battery of inference-time strategies (chain-of-thought, question decomposition, describe-then-reason cascades, audio transcripts, spatial state prompting, self-consistency~\cite{selfconsistency}, multi-model ensembling, and category routing). Our central finding is that this benchmark is \emph{perception-bound rather than reasoning-bound}: reasoning-side augmentations are neutral-to-harmful, whereas base-model perceptual capability and lightweight test-time denoising are the only reliable levers. A per-category error analysis localizes the difficulty to low-level perception -- relative depth, viewpoint, and counting are the hardest categories, while causal and social reasoning are nearly solved -- and a prompt that explicitly injects monocular depth cues to attack the weakest category \emph{lowers} test accuracy by $5.8$ points, confirming that the model needs a better \emph{percept}, not a better \emph{procedure}.

2606.01483 2026-06-02 cs.LG cs.AI eess.AS 版本更新

MURMUR: An Efficient Inference System for Long-Form ASR

MURMUR:一种高效的长时间语音识别推理系统

Wei-Tzu Lee, Keisuke Kamahori, Baris Kasikci

发表机构 * University of Washington(华盛顿大学)

AI总结 提出MURMUR推理系统,通过块间和块内两级优化,在保持高精度的同时显著降低长时间语音识别的延迟。

详情
AI中文摘要

长时间自动语音识别(ASR)需要高精度和低延迟,但现有系统迫使两者之间进行权衡。基于块的流水线在并行窗口中处理音频以实现低延迟,但丢失了跨块上下文,并且需要脆弱的启发式方法来对齐边界处的说话人和时间戳。长上下文ASR模型通过单次传递解决所有问题以获得更好的准确性,但速度慢一个数量级。我们提出MURMUR,一个通过两级操作克服这种权衡的推理系统。在块间级别,我们重新审视基于块的流水线以适应现代长上下文ASR,将块大小视为可调超参数,并表明中间块大小在准确性和延迟之间取得了良好的平衡。在块内级别,我们通过应用于输出和语音令牌的滑动窗口KV缓存驱逐策略来利用注意力稀疏性。在AMI-IHM上,MURMUR匹配单次传递准确性,同时将延迟降低4.2倍,通过令牌驱逐进一步获得收益,相对tcpWER退化小于1%。MURMUR的代码可在https://github.com/uw-syfi/Murmur获取。

英文摘要

Long-form automatic speech recognition (ASR) requires both high accuracy and low latency, but existing systems force a trade-off between the two. Chunk-based pipelines process audio in parallel windows for low latency, but lose cross-chunk context and need brittle heuristics to align speakers and timestamps at boundaries. Long-context ASR models resolve everything in a single pass for better accuracy, but are an order of magnitude slower. We propose Murmur, an inference system that overcomes this trade-off by operating at two levels. At the inter-chunk level, we revisit the chunk-based pipeline for modern long-context ASR, treating chunk size as a tunable hyperparameter, and show that intermediate chunk sizes strike a good balance of accuracy and latency. At the intra-chunk level, we exploit attention sparsity through a sliding window KV cache eviction policy applied to both output and speech tokens. On AMI-IHM, Murmur matches single-pass accuracy while reducing latency by 4.2x, with further gains from token eviction at less than 1% relative tcpWER degradation. The code of Murmur is available at https://github.com/uw-syfi/Murmur.

2606.01470 2026-06-02 physics.flu-dyn cs.AI cs.LG 版本更新

Emergent Transfer of a Physics Foundation Model from Simulation to Laboratory Turbulence

物理基础模型从模拟到实验室湍流的涌现迁移

Payel Mukhopadhyay, Stefan S. Nixon, Romain Watteaux, Michael McCabe, Alberto Bietti, Kyunghyun Cho, Cristiana Diaconu, Irina Espejo Morales, David Fouhey, Siavash Golkar, Tom Hehir, Shirley Ho, Jake Kovalic, Geraud Krawezik, Francois Lanusse, Tanya Marwah, Rudy Morel, Mariel Pettee, Helen Qu, Jeff Shen, Hadi Sotoudeh, Stuart B. Dalziel, Miles Cranmer

发表机构 * University of Cambridge(剑桥大学) CEA, DAM/DIF(法国CEA DAM/DIF) Flatiron Institute(Flatiron研究所) New York University(纽约大学) Princeton University(普林斯顿大学) Yale University(耶鲁大学) AIM, Université Paris-Saclay, Université Paris Cité, CEA, CNRS(AIM,巴黎-萨克雷大学,巴黎城市大学,CEA,CNRS) University of Wisconsin–Madison(威斯康星大学麦迪逊分校) Polymathic AI(聚合人工智能)

AI总结 通过微调连续介质动力学基础模型Walrus,在仅使用少量模拟数据的情况下,零样本泛化到实验室瑞利-泰勒不稳定性实验,揭示了初始条件在模拟-实验差距中的关键作用。

详情
AI中文摘要

物理基础模型能否有效应用于实验室实验,仍然是科学机器学习(ML)的一个未解决问题。我们在瑞利-泰勒不稳定性(RTI)上测试了这个问题,这是一种普遍且要求苛刻的流体不稳定性,从桌面流动到超新星爆炸中都能看到,其中密度界面上的小扰动在较轻流体加速进入较重流体时演变成混沌、多尺度的混合。标准ML模型难以处理RTI,尽管经过一个多世纪的理论、数值和实验工作,模拟与实验之间仍存在一个未解决的分歧:大多数实验室实验中测量的后期混合增长率$\alpha$(约0.06-0.07)大约是理想直接数值模拟(DNS,约0.02)的三倍。这一差距的起源仍有争议。这些特性使RTI成为一个严格的测试,其意义远超RTI本身:仅基于模拟训练的基础模型能否泛化到稀疏、杂乱且嘈杂的实验室环境?我们对连续介质动力学基础模型Walrus进行了微调,使用三个或更少的DNS实现,并在长时间滚动中恢复了关键的RTI物理特性。将微调模型零样本应用于滑动屏障实验室数据,它离开了类似DNS的区域,进入了观察到的增长带,而从未见过任何实验样本。这些结果提供了独立的数据驱动证据,表明初始条件在长期存在的模拟-实验$\alpha$差距中起着关键作用。该模型还零样本泛化到稳定分层(一种训练中未出现的浮力状态),正确减缓了混合层增长。总之,我们的结果表明,基础模型可以很好地泛化到训练数据之外,预测实验室行为和未见过的物理状态,为探索长期存在的模拟-实验差距开辟了新途径。

英文摘要

Whether physics foundation models can be usefully deployed on laboratory experiments remains an open question for scientific machine learning (ML). We test this question on the Rayleigh-Taylor instability (RTI), a ubiquitous and demanding fluid instability seen from tabletop flows to supernova explosions, in which small perturbations at a density interface grow into chaotic, multiscale mixing as a lighter fluid accelerates into a heavier one. Standard ML models struggle with RTI, and despite over a century of theoretical, numerical, and experimental work, it carries an unresolved discrepancy between simulation and experiment: the late-time mixing growth rate, $α$, measured in most laboratory experiments ($\sim$ 0.06-0.07), is roughly three times the value from idealized direct numerical simulations (DNS, $\sim$ 0.02). The gap's origin remains debated. These properties make RTI a stringent test for a question that matters well beyond RTI: can foundation models trained only on simulations generalise to sparse, messy, and noisy laboratory settings? We finetune Walrus, a foundation model for continuum dynamics, on three or fewer DNS realizations and recover key RTI physics over long rollouts. Applied zero-shot to sliding-barrier laboratory data, the finetuned model leaves the DNS-like regime and enters the observed growth band, having never seen a single experimental sample. These results provide independent, data-driven evidence that initial conditions play a crucial role in the longstanding sim-experiment gap in $α$. The model also generalises zero-shot to stable stratification, a buoyancy regime absent from training, correctly slowing mixing-layer growth. Together, our results show that foundation models can generalise well beyond their training data, predicting laboratory behavior and unseen physical regimes, opening new ways to probe longstanding simulation-experiment gaps.

2606.01468 2026-06-02 stat.ML cs.AI cs.LG 版本更新

Computation-Aware Kalman Filtering with Model Selection for Neural Dynamics

基于模型选择的计算感知卡尔曼滤波用于神经动力学

JR Huml, Jonathan Wenger, John P. Cunningham

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Washington(华盛顿大学) University of Texas at Austin(得克萨斯大学奥斯汀分校)

AI总结 提出计算感知状态空间模型(CASSM),通过新训练损失和优化方案实现模型选择,在试验数远少于神经元数的规模不平衡场景中,以可处理的计算复杂度提供竞争性预测和更优的不确定性校准。

Comments 24 pages, Proceedings of 2nd International Conference on Probabilistic Numerics (2026)

详情
AI中文摘要

由于其明确的先验和建模不确定性的能力,贝叶斯方法在单细胞神经记录的动力潜变量建模中发挥了重要作用。然而,现代规模的数据集使得过参数化的深度网络因其预测能力和有利的计算扩展性成为首选方法。尽管存在许多后验近似方法,但所有方法都会引入近似误差。最近的工作以计算不确定性的形式考虑了这种误差,但代价是二次复杂度,并假设固定的模型超参数。在这里,我们将这一发展扩展到模型选择,包括一种新颖的训练损失和优化方案,从而在大状态空间中实现可处理的推理。我们引入了一个框架,即计算感知状态空间模型(CASSM),专门针对规模不平衡的场景设计,其中试验次数显著少于记录的神经元数量。在这种场景下,对于合成数据和真实数据,我们展示了我们的方法与数据饥饿的深度网络具有竞争力,并且与之前扩展贝叶斯方法的尝试相比,不确定性校准显著改善。我们的实验为神经科学研究人员根据关键数据集属性和约束从一系列潜在动力潜变量模型中进行选择提供了路线图。

英文摘要

Due to their explicit priors and ability to model uncertainty, Bayesian methods have played a major role in dynamical latent variable modeling of single-cell neural recordings. However, modern-sized datasets have made overparameterized deep networks the preferred methods of choice due to their predictive power and favorable computational scaling. While many posterior approximations exist, all incur approximation errors. Recent work accounts for this error in the form of computational uncertainty but comes at the cost of quadratic complexity and assumes fixed model hyperparameters. Here we extend this development to model selection, including a novel training loss and optimization scheme, which yields tractable inference in large state-spaces. We introduce a framework, the Computation-Aware State-Space Model (CASSM), specifically designed for the scale-imbalanced regime, where the number of trials is significantly lower than the number of recorded neurons. In this regime, for both synthetic and real data, we show that our method is competitive with data-hungry deep networks, with significantly improved uncertainty calibration over previous attempts to scale Bayesian methods. Our experiments provide a roadmap to neuroscience researchers in choosing from a host of potential dynamical latent variable models given key dataset properties and constraints.

2606.01462 2026-06-02 cs.AI cs.CL cs.LG 版本更新

An Enigma of Artificial Reason: Investigating the Production-Evaluation Gap in Large Reasoning Models

人工推理之谜:探究大型推理模型中的生成-评估差距

Mingzhong Sun, Teresa Yeo, Armando Solar-Lezama, Tan Zhi-Xuan

发表机构 * NUS Department of Computer Science(国立新加坡大学计算机科学系) MIT EECS(麻省理工学院电子工程与计算机科学系) A*STAR(新加坡科技研究局) Singapore-MIT Alliance for Research and Technology (SMART)(新加坡-麻省理工联合研究技术机构(SMART))

AI总结 本文通过VAIR数据集发现大型推理模型在评估推理时存在显著缺陷,表现为答案确认偏差,即模型倾向于验证答案正确性而非仔细检查推理步骤。

Comments 10 pages, 8 figures, 2 tables (Appendix: 19 pages, 13 figures, 3 tables)

详情
AI中文摘要

对人类推理的研究表明,人们通常更擅长评估推理而非从头生成推理。相比之下,大型推理模型(LRMs)经过训练,擅长生成长链推理以解决复杂问题。那么,LRMs在评估推理方面表现如何?我们通过有效答案-无效推理(VAIR)数据集进行研究:该数据集包含数学问题和解决方案,这些解决方案存在琐碎的推理缺陷但答案有效,旨在将推理评估与推理生成混淆因素分离。与人类(我们发现人类在评分此类问题时仅比解决它们差6%)不同,我们发现LRMs存在显著的生成-评估差距:前沿模型在评估VAIR解决方案时得分低至48%,尽管在解决方案生成方面近乎完美。为何存在这一谜团?通过思维链(CoT)分析,我们发现了答案确认偏差的证据:LRMs通常先产生答案,然后检查正确答案,而不是仔细验证每一步,即使在注意到异常推理时也会编造合理化解释。线性探针进一步证实了这一点,表明虽然LRM激活编码了有效推理的某些表示,但它们未能稳健地将VAIR解决方案表示为无效。对最终答案表示的因果修补导致LRM判断和激活翻转,表明答案有效性是模型确认偏差的原因。这些发现揭示了主导推理训练方法的显著局限性,该方法激励LRMs生成并确认朝向正确答案的推理,但未能稳健地评估底层推理。

英文摘要

Studies of human reasoning have shown that people are typically stronger at evaluating reasoning than producing it from scratch. In contrast, large reasoning models (LRMs) are trained to excel at producing long chains of reasoning to solve complex problems. How then do LRMs perform at evaluating reasons? We investigate this with the Valid-Answer-Invalid-Reasoning (VAIR) dataset: math problems and solutions with trivial reasoning flaws but valid answers, designed to isolate reasoning evaluation from the confound of reasoning production. Unlike humans, who we find are only 6% worse at grading than solving such problems, we find a substantial production-evaluation gap in LRMs: frontier models score as low as 48% when evaluating VAIR solutions, despite near-perfect solution production. Why this enigma? Through chain-of-thought (CoT) analysis, we find evidence of an answer confirmation bias: LRMs often produce then check for the correct answer instead of carefully verifying each step, fabricating rationalizations even when noticing anomalous reasoning. Linear probes corroborate this, showing that while LRM activations encode some representation of valid reasoning, they fail to robustly represent VAIR solutions as invalid. Causal patching of the final answer's representations causes LRM verdicts and activations to flip, demonstrating that answer validity is responsible for models' confirmation biases. These findings indicate an outstanding limitation in dominant approaches to reasoning training, which incentivize LRMs to produce and confirm reasoning towards correct answers, but not to robustly evaluate the underlying reasons.

2606.01461 2026-06-02 cs.LG cs.MA 版本更新

Genotype-Conditioned Molecular Generation via Evidence-Grounded Multi-Objective Latent Perturbation in Diffusion Models

基于证据的多目标潜在扰动在扩散模型中的基因型条件分子生成

Brenda Nogueira, Gisela A. Gonzalez-Montiel, Nitesh V. Chawla, Nuno Moniz

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) University of Notre Dame(诺克斯大学) Department of Chemistry and Biochemistry(化学与生物化学系) Lucy Family Institute for Data & Society(数据与社会学院)

AI总结 提出一种在预训练的基因型到药物扩散模型的潜在空间中,通过梯度上升优化可学习扰动以最大化药物敏感性、类药性和合成可及性的复合奖励,并利用实验数据和LLM管道确保生物合理性和机制一致性。

详情
AI中文摘要

由于肿瘤异质性和跨癌症亚型缺乏明确的分子靶点,开发有效的抗癌疗法仍然具有挑战性。以癌症基因型为条件的生成模型为个性化药物发现提供了一条有前景的途径,但现有方法缺乏对同时优化敏感性、可合成性和机制结合合理性的明确优化。我们提出了一种针对预训练的基因型到药物扩散模型的潜在空间优化方法,引入一个在分子潜在空间上的可学习扰动,通过梯度上升优化以最大化结合预测药物敏感性(AUC)、类药性(QED)和合成可及性(SAS)的复合奖励。关键的是,通过将奖励设计和评估基于实验衍生的癌细胞系数据和经过验证的药理学信号,将候选生成锚定在真实世界的临床证据中,从而强制执行生物学真实性。机制一致性合理性进一步通过基于扩散模型注意力机制的多智能体LLM管道进行评估。在来自三个保留评估集的15个癌细胞系上的实验表明,在敏感性、类药性、可合成性和化学有效性方面,与竞争基线相比,该方法具有一致且显著的改进。

英文摘要

Developing effective anticancer therapeutics remains challenging due to tumor heterogeneity and the absence of well-defined molecular targets across cancer subtypes. Generative models conditioned on cancer genotypes offer a promising avenue for personalized drug discovery, yet existing approaches lack explicit optimization for simultaneous sensitivity, synthesizability, and mechanistic binding plausibility. We present a latent-space optimization approach for a pretrained genotype-to-drug diffusion model, introducing a learnable perturbation over the molecular latent space optimized via gradient ascent to maximize a composite reward combining predicted drug sensitivity (AUC), drug-likeness (QED), and synthetic accessibility (SAS). Critically, biological realism is enforced by grounding both reward design and evaluation in experimentally-derived cancer cell line data and validated pharmacologic signals, anchoring candidate generation in real-world clinical evidence. Mechanistic consistency plausibility is further assessed by a multi-agent LLM pipeline grounded in the diffusion model's attention mechanism. Experiments across 15 cancer cell lines from three held-out evaluation sets demonstrate consistent and noticeable improvements over competing baselines in sensitivity, drug-likeness, synthesizability, and chemical validity.

2606.01457 2026-06-02 cs.AI cs.LG stat.ML 版本更新

Transferring Information Across Interventions in Causal Bayesian Optimization

跨干预因果贝叶斯优化的信息传递

Mohammad Ali Javidian

发表机构 * Computer Science Department(计算机科学系)

AI总结 提出图耦合因果贝叶斯优化方法,通过共享因果参数的不确定性连接不同干预效应,实现跨干预信息传递,在可识别线性高斯因果模型中证明低秩核性质和次线性遗憾界。

详情
AI中文摘要

贝叶斯优化是一种优化昂贵系统的流行方法,其中每次实验、模拟或干预都会耗费时间或金钱。在其标准形式中,它将我们控制的变量视为黑盒的普通输入,无法区分单纯的相关性与真正的因果关系。因果贝叶斯优化通过使用已知因果图结合观测数据来决定哪些变量值得干预,从而部分弥补了这一差距。然而,现有方法几乎孤立地学习每种可能干预的效果,尽管在因果系统中这些效果通常共享相同的底层机制。我们提出图耦合因果贝叶斯优化,通过我们对一小部分共享因果参数的不确定性,将不同的干预效果联系在一起。结果是一个因果核,使得从一次干预收集的证据能够改进我们对相关干预的估计。对于可识别的线性高斯因果模型,我们证明该核具有低秩,其秩由共享参数的数量而非干预菜单的大小界定。这进而产生一个信息增益界,该界仅随优化范围对数增长,以及一个遗憾界,清晰地将三种误差来源分开:优化、因果估计以及考虑哪些干预集的选择。我们还描述了非线性和自适应扩展。在与理论一致的高斯系统、共享机制压力测试以及标准因果优化基准测试中,该方法保持了因果贝叶斯优化的优势,同时实现了跨相关干预的信息传递,当对目标父节点的直接干预不可用且稀疏的干预数据必须在一大组候选干预中重复使用时,增益最为明显。

英文摘要

Bayesian optimization is a popular way to optimize expensive systems, where every experiment, simulation, or intervention costs time or money. In its standard form, it treats the variables we control as plain inputs to a black box and cannot tell apart mere correlation from a real cause and effect. Causal Bayesian optimization closes part of this gap by using a known causal graph together with observational data to decide which variables are worth intervening on. Existing methods, however, learn the effect of each possible intervention almost in isolation, even though in a causal system these effects usually share the same underlying mechanisms. We propose graph-coupled causal Bayesian optimization, which ties the different intervention effects together through the uncertainty we have about a small set of shared causal parameters. The result is a causal kernel that lets evidence collected from one intervention improve our estimate of related interventions. For identifiable linear Gaussian causal models, we show that this kernel has low rank, bounded by the number of shared parameters rather than by the size of the intervention menu. This in turn yields an information-gain bound that grows only logarithmically in the optimization horizon, and a regret bound that cleanly separates three sources of error: optimization, causal estimation, and the choice of which intervention sets to consider. We also describe nonlinear and adaptive extensions. Across theory-aligned Gaussian systems, shared-mechanism stress tests, and standard causal optimization benchmarks, the method keeps the benefits of causal Bayesian optimization while transferring information across related interventions, with the clearest gains when direct interventions on the target's parents are unavailable and sparse interventional data must be reused across a large family of candidate interventions.

2606.01456 2026-06-02 cs.LG cs.CL cs.GT 版本更新

Truthful AI Advisors: A Pre-Specified Benchmark for Large Language Model Honesty Under Preference Misalignment

诚实的人工智能顾问:偏好错位下大语言模型诚实性的预设基准

Hamidreza Hasani Balyani, Seyed Pouyan Mousavi Davoudi, Alireza Amiri-Margavi, Amin Gholami Davodi, Arshia Gharagozlou

发表机构 * Amazon Lab126, HW Tech Org.(亚马逊实验室126,硬件技术组织) Computational Modeling and Simulation University of Pittsburgh(计算建模与仿真大学匹兹堡分校) Mathematics & Statistics Department University of Minnesota Duluth(数学与统计学系明尼苏达大学 Duluth 分校)

AI总结 通过Crawford-Sobel廉价谈话模型构建基准,评估大语言模型在偏好冲突时是否诚实,发现模型过度揭示信息,偏离策略最优。

Comments 19 pages. Code and data: https://github.com/iHamidHasani/cheap-talk-llm-benchmark

详情
AI中文摘要

大语言模型越来越多地被部署为顾问,其目标与用户不一致:推荐系统优化参与度,销售助手优化购买,谈判代理优化让步。当诚实与自身收益冲突时,这些顾问是否保持诚实是一个核心的对齐评估问题。我们将经典的Crawford-Sobel廉价谈话模型转化为偏好错位下LLM诚实性的预设基准。廉价谈话理论预测既非完全揭示也非沉默,而是粗糙的单调划分,随着偏好冲突增加,信息区间减少。发送者观察到状态omega在[0,1]中,希望接收者的行动接近omega+b,并向理想行动为omega的接收者发送一条无成本消息。设计使用5个偏差水平、3个提示框架、固定的低温度设置和每个单元200个状态:共12,000次发送者调用。对于正偏差网格b∈{0.01,0.04,0.08,0.12},最信息丰富的划分大小分别为7、4、3、2,预言机归一化互信息分别为0.5294、0.3268、0.2205、0.1829。在四个指令调优模型(GPT-4o、Claude Sonnet 4.5、Gemini 2.5 Flash-Lite、Llama-3.3-70B)上运行完整设计,我们发现所有四个模型相对于最信息丰富的均衡过度揭示1.8至4.2倍:归一化互信息保持在0.78-0.94,而预言机规定为0.18-0.53。信息量随偏差下降如预测,但从未接近策略最优;模型显示出近乎完全的揭示,并带有跟踪其偏差的恒定正向偏移(线性夸大)。收益最大化与诚实框架的影响可忽略。解码器消融表明,仅当接收者读取发送者陈述的数字时,该发现才可恢复:仅嵌入解码器将相同数据误读为近乎胡言乱语。

英文摘要

Large language models are increasingly deployed as advisors whose objective is not aligned with the user's: recommenders optimize for engagement, sales assistants for purchases, negotiation agents for concessions. Whether such advisors stay truthful when honesty conflicts with their own payoff is a core alignment-evaluation question. We turn the canonical Crawford-Sobel cheap-talk model into a pre-specified benchmark for LLM honesty under preference misalignment. Cheap-talk theory predicts neither full revelation nor silence but coarse monotone partitions, with fewer informative intervals as preference conflict grows. A sender observes a state omega in [0,1], wants the receiver's action near omega+b, and sends one costless message to a receiver whose ideal action is omega. The design uses 5 bias levels, 3 prompt frames, a fixed low-temperature setting, and 200 states per cell: 12,000 sender calls. For the positive-bias grid b in {0.01,0.04,0.08,0.12} the exact most-informative partition sizes are 7,4,3,2, with oracle normalized mutual information 0.5294, 0.3268, 0.2205, 0.1829. Running the full design on four instruction-tuned models (GPT-4o, Claude Sonnet 4.5, Gemini 2.5 Flash-Lite, Llama-3.3-70B), we find all four over-reveal relative to the most-informative equilibrium by 1.8 to 4.2x: normalized mutual information stays at 0.78-0.94 where the oracle prescribes 0.18-0.53. Informativeness declines with bias as predicted but never approaches the strategic optimum; rather than coarse partitions, models show near-full revelation with a constant upward offset tracking their bias (linear exaggeration). Payoff-maximizing versus honesty framing has negligible effect. A decoder ablation shows the finding is recoverable only when the receiver reads the sender's stated number: an embedding-only decoder mis-reads the same data as near-babbling.

2606.01444 2026-06-02 cs.AI cond-mat.mtrl-sci cs.CL cs.LG math.CT 版本更新

Self-Revising Discovery Systems for Science: A Categorical Framework for Agentic Artificial Intelligence

科学中的自我修正发现系统:面向主体人工智能的范畴论框架

Fiona Y. Wang, Markus J. Buehler

发表机构 * Laboratory for Atomistic and Molecular Mechanics(原子分子力学实验室) Department of Biological Engineering(生物工程系) Massachusetts Institute of Technology(麻省理工学院) Department of Civil and Environmental Engineering(土木与环境工程系) Department of Mechanical Engineering(机械工程系) Center for Computational Science and Engineering(计算科学与工程中心) Schwarzman College of Computing(施瓦茨曼计算学院)

AI总结 本文提出一个基于范畴论的框架,通过左Kan扩展实现科学发现中的表征体制转换,并应用于材料科学中的蛋白质力学和纤维网络建模。

详情
AI中文摘要

科学发现不仅是生成答案,更是对证据、人工制品、操作和验证者进行类型化的表征体制的修正。我们为材料科学中的主体发现开发了一个范畴论描述。在固定体制b中,模式类别为S_b,系统状态是一个余预层I_t: S_b -> Set,来源是元素范畴∫_{S_b} I_t。固定体制操作是对此类状态的更新,仅当指定并保留了保持来源的细化时才是自函子。发现则是经过验证的体制转换u: S_b -> S_b':旧人工制品通过左Kan扩展Lan_u I_t保存并传输,并与转换后状态进行比较,以识别超出函子传输的剩余内容。这在不依赖主观新颖性的情况下区分了检索、搜索和发现。我们在两个系统中实例化了该框架。在Builder/Breaker中,蛋白质力学世界模型在最小描述长度门控下进行修正;接受的定律将链内柔性表示为受慢集体模式调节的全模态弹性柔度,即模式调节柔度。在CategoryScienceClaw中,类型化技能、人工制品、开放需求、工作流变异、门控、压力测试和公共话语构成了一个携带证明的知识计算图。一个纤维网络示例记录了候选模型、被拒绝的替代方案、AIC门控、扰动测试以及一个基于各向同性纤维计数描述符的接受取向张量各向异性刚度代理模型。这些案例共同展示了范畴论如何既作为科学发现的数学语言,又作为自我修正AI发现系统的工程规范。

英文摘要

Scientific discovery is not only answer generation but revision of the representational regime in which evidence, artifacts, operations, and verifiers are typed. We develop a category-theoretic account of agentic discovery for materials science. In a fixed regime b with schema category S_b, the system state is a copresheaf I_t: S_b -> Set, and provenance is the category of elements \int_{S_b} I_t. Fixed-regime operation is an update on such states, endofunctorial only when provenance-preserving refinements are specified and preserved. Discovery is instead a verified regime transition u: S_b -> S_b': old artifacts are preserved, transported by the left Kan extension Lan_u I_t, and compared with the post-transition state to identify residual content beyond functorial transport. This separates retrieval, search, and discovery without subjective novelty. We instantiate the framework in two systems. In Builder/Breaker, a protein-mechanics world model is revised under a Minimum Description Length gate; the accepted law expresses within-chain flexibility as all-mode elastic compliance conditioned by slow collective-mode participation, or mode-conditioned compliance. In CategoryScienceClaw, typed skills, artifacts, open needs, workflow mutation, gates, stress tests, and public discourse become a proof-carrying knowledge-computation graph. A fiber-network example records candidate models, rejected alternatives, an AIC gate, perturbation tests, and an accepted orientation-tensor anisotropic stiffness surrogate over an isotropic fiber-count descriptor. Together, the cases show how category theory can be both a mathematical language for discovery and an engineering specification for self-revising AI discovery systems.

2606.01443 2026-06-02 cs.LG cs.AI cs.CV 版本更新

UR-JEPA: Uniform Rectifiability as a Regularizer for Joint-Embedding Predictive Architectures

UR-JEPA:均匀可整流性作为联合嵌入预测架构的正则化器

Triet M. Le

发表机构 * Spatiolyx LLC(Spatiolyx公司)

AI总结 提出UR-JEPA,通过高斯核平滑的Carleson型平方函数实现均匀n-可整流测度正则化,防止表示坍塌,在多个数据集上达到与LeJEPA相当的峰值精度但具有更低的种子方差。

详情
AI中文摘要

训练联合嵌入预测架构(JEPA)的一个核心困难是防止表示坍塌。LeJEPA通过素描各向同性高斯正则化(SIGReg)对嵌入施加各向同性高斯目标来解决这一问题。该目标与流形假设相矛盾,流形假设期望嵌入集中在环境空间的低维子集上。我们提出\emph{UR-JEPA},其目标是在小尺度上具有局部切向维度$n$的均匀$n$-可整流测度,通过高斯核平滑的Carleson型平方函数$\mathcal{L}^{ ext{CGLT}}$实现,并辅以Jones $β$数公式。在Inet10上,UR-JEPA($\mathcal{L}^{ ext{CGLT}}$)达到$0.9141 \pm 0.0014$,相比LeJEPA($\mathcal{L}^{ ext{SIGReg}}$)提高了$+0.83$个百分点,种子标准差降低约$30\%$;在匹配配方的Galaxy10~SDSS、单种子ImageNet-$100$运行和3种子EuroSAT遥感运行中,两种方法在收敛时处于相同的峰值精度区间,UR-JEPA保持其较低的种子方差特征。在EuroSAT上,域内对在$96.0$到$96.1\%$之间具有竞争力,且使用大型遥感基础模型迁移时骨干网络缩小$25$倍。区别在于几何结构:对投影仪输出分布的直接可视化显示,在所有四个数据集上,UR-JEPA($\mathcal{L}^{ ext{CGLT}}$)产生的全局PCA谱在索引$\sim 20$到$25$(共$D=32$)处出现$4$到$5$个数量级的下降,而LeJEPA的谱接近平坦(顶部到底部比率最多为$3.6$)。两种方法的每维度边缘分布同时接近高斯分布(平均Shapiro-Wilk $W \in [0.992, 0.996]$),这是Diaconis-Freedman结果的一个推论。因此,在匹配精度下,两种正则化器产生结构上不同的投影表示。

英文摘要

A central difficulty in training Joint-Embedding Predictive Architectures (JEPAs) is preventing representation collapse. LeJEPA addresses this by enforcing an isotropic Gaussian target on the embeddings via Sketched Isotropic Gaussian Regularization (SIGReg). This target is in tension with the manifold hypothesis, which expects embeddings to concentrate on a low-dimensional subset of the ambient space. We propose \emph{UR-JEPA}, which targets a uniformly $n$-rectifiable measure of local tangent dimension $n$ at small scales, realized through a Gaussian-kernel smoothed Carleson-type square function $\mathcal{L}^{\text{CGLT}}$, with a complementary Jones $β$-number formulation. On Inet10, UR-JEPA($\mathcal{L}^{\text{CGLT}}$) attains $0.9141 \pm 0.0014$ for a $+0.83$\,pp gain over LeJEPA($\mathcal{L}^{\text{SIGReg}}$) with $\sim 30\%$ lower seed standard deviation; on matched-recipe Galaxy10~SDSS, a single-seed ImageNet-$100$ run, and a $3$-seed EuroSAT remote-sensing run, the two methods lie in the same peak-accuracy band at convergence, with UR-JEPA retaining its lower-seed-variance signature. On EuroSAT the in-domain pair is competitive at $96.0$ to $96.1\%$ with large remote-sensing foundation-model transfer at a $25\times$ smaller backbone. The distinction is geometric: direct visualization of the projector output distribution shows that on all four datasets UR--JEPA($\mathcal{L}^{\text{CGLT}}$) produces a global PCA spectrum with a $4$ to $5$ order-of-magnitude drop at index $\sim 20$ to $25$ out of $D = 32$, while LeJEPA's spectrum is near-flat (top-to-bottom ratio at most $3.6$). Per-dimension marginals are simultaneously near-Gaussian for both methods (mean Shapiro-Wilk $W \in [0.992, 0.996]$) as a Diaconis-Freedman consequence. At matched accuracy the two regularizers therefore yield structurally distinct projected representations.

2606.01437 2026-06-02 cs.LG cs.AI 版本更新

CEAR: Certified Ensemble Adversarial Robustness in DNNs

CEAR: 深度神经网络中的集成对抗鲁棒性认证

Daniel Sadig, Mohammadreza Maleki, Hamed Karimi, Reza Samavi

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出CEAR方法,通过混合经验与认证防御机制,利用高斯噪声和温度混淆梯度与logits,并扩展随机平滑以验证集成分类器的鲁棒性,在多个数据集上取得更优的认证准确率和鲁棒半径。

Comments This is the preprint of the work accepted for publication in the Proceedings of the 39th Canadian Conference on Artificial Intelligence (Canadian AI 2026); 19 Pages

详情
AI中文摘要

深度神经网络(DNN)极易受到对抗性扰动的影响,这促使了对安全关键应用鲁棒性的广泛研究。最先进的实证防御机制通过训练阶段提高DNN的鲁棒性,但仍难以应对自适应白盒攻击。另一方面,认证防御在指定的扰动范围内提供可证明的鲁棒性保证。这些保证无论扰动程度如何都成立,即使攻击者拥有模型的完全知识。在本文中,我们提出了CEAR,一种基于集成的鲁棒方法,它利用了实证和认证防御机制的混合。CEAR使用不同的高斯噪声和温度训练集成中的每个网络,以混淆梯度和logits,使模型对更强的基于梯度的攻击更具抵抗力。然后我们使用带噪声的logits,并提出了两种不同的投票机制来进一步提高鲁棒性。此外,我们扩展了随机平滑以验证基于集成的分类器的鲁棒性。我们在MNIST、CIFAR10和TinyImageNet数据集上的实验评估表明,与基线方法相比,平均认证准确率更高,鲁棒半径更大,可迁移性更低。

英文摘要

Deep Neural Networks (DNNs) are highly susceptible to adversarial perturbations, leading to extensive research on robustness for safety-critical applications. State-of-the-art empirical defense mechanisms improve the robustness of DNNs through the training phase, but still struggle against adaptive white-box attacks. On the other hand, certified defenses offer provable guarantees of robustness within a specified perturbation bound. These guarantees hold regardless of the level of perturbations, even if the attacker is given full knowledge of the model. In this paper, we propose CEAR, an ensemble-based robust method that utilizes a hybrid of empirical and certified defense mechanisms. CEAR trains each network within the ensemble using varying Gaussian noise and temperatures to obfuscate gradients and logits, making the model more resistant to stronger gradient-based attacks. We then use noisy logits and propose two different voting mechanisms to further improve robustness. Furthermore, we extend randomized smoothing to verify the robustness of ensemble-based classifiers. Our experimental evaluations on MNIST, CIFAR10, and TinyImageNet datasets demonstrate superior certified accuracy on average, increased robustness radius, and decreased transferability compared to baseline methods.

2606.01432 2026-06-02 cs.LG eess.IV eess.SP stat.ML 版本更新

Leaf Spectral Reflectance Prediction Using Multi-Head Attention Neural Networks

使用多头注意力神经网络预测叶片光谱反射率

Parastoo Farajpoor, Alireza Pourreza, Mohammadreza Narimani, Ashraf El-Kereamy, Matthew W. Fidelibus

发表机构 * Digital Agriculture Laboratory, Department of Biological and Agricultural Engineering, University of California, Davis, CA, USA(加州大学戴维斯分校数字农业实验室,生物与农业工程系) Department of Botany and Plant Sciences, University of California, Riverside, CA, USA(加州大学河滨分校植物学与植物科学系) Department of Viticulture and Enology, University of California, Davis, CA, USA(加州大学戴维斯分校葡萄学与酿酒学系)

AI总结 针对特定作物(如葡萄藤),提出基于多头注意力神经网络的叶片性状-光谱预测模型,在葡萄藤数据集上实现高精度(R²=0.84,NRMSE=1.52%),优于传统辐射传输模型PROSPECT-PRO。

Comments 8 pages, 5 figures. Author-accepted version of the SPIE conference paper

详情
Journal ref
Proc. SPIE 13475, 134750V (2025)
AI中文摘要

从生理和生化性状准确建模叶片光谱反射率对于推进植物科学和精准农业中的遥感应用至关重要。广泛使用的辐射传输模型(如PROSPECT-PRO)依赖于从多种物种中开发的广义性状-反射率关系,这可能无法完全捕捉特定作物(如葡萄藤)的光谱行为。在本研究中,我们开发了一个基于多头注意力神经网络的性状到光谱预测模型,该模型在包含16个叶片性状(涵盖多个品种、生长阶段和年份)的葡萄藤特定数据集上训练。使用分层5折交叉验证评估模型,平均决定系数(R^2)为0.84,归一化均方根误差(NRMSE)为1.52%,显示出高精度和泛化能力。与正向模式下的PROSPECT-PRO相比,神经网络表现出更低的平均绝对误差(MAE),尤其是在近红外(NIR)和短波红外(SWIR)区域。这些结果强调了物种特异性建模方法的重要性,并表明将生化和结构性状整合到数据驱动架构中可以显著改善光谱预测。所提出的模型为生成准确的叶片级反射率数据提供了稳健框架,在冠层性状反演、葡萄园监测和遥感驱动的作物管理方面具有潜在应用。

英文摘要

Accurate modeling of leaf spectral reflectance from physiological and biochemical traits is essential for advancing remote sensing applications in plant science and precision agriculture. Widely used radiative transfer models, such as PROSPECT-PRO, rely on generalized trait-reflectance relationships developed from a wide range of species, which may not fully capture the spectral behavior of specific crops like grapevines. In this study, we developed a trait-to-spectra prediction model using a multi-head attention neural network trained on a grapevine-specific dataset that includes 16 leaf traits measured across multiple varieties, growth stages, and years. The model was evaluated using stratified 5-fold cross-validation and achieved an average coefficient of determination (R^2) of 0.84 and normalized root mean squared error (NRMSE) of 1.52 percent, demonstrating high accuracy and generalizability. When compared to PROSPECT-PRO in forward mode, the neural network exhibited lower mean absolute error (MAE), especially in the near-infrared (NIR) and shortwave-infrared (SWIR) regions. These results emphasize the importance of species-specific modeling approaches and show that integrating biochemical and structural traits into data-driven architectures can significantly improve spectral prediction. The proposed model provides a robust framework for generating accurate leaf-level reflectance data, with potential applications in canopy trait retrieval, vineyard monitoring, and remote sensing-driven crop management.

2606.01427 2026-06-02 stat.ML cs.LG 版本更新

On the Uncertainty Quantification Ability of Tabular Foundation Models

关于表格基础模型的不确定性量化能力

Tyler R. Johnson, Kian Ben-Jacob, Nima Negarandeh, Oriol Vendrell-Gallart, Ramin Bostanabad

发表机构 * Department of Mechanical and Aerospace Engineering, University of California, Irvine(加州大学欧文分校机械与航空航天工程系) Department of Civil and Environmental Engineering, University of California, Irvine(加州大学欧文分校土木与环境工程系)

AI总结 通过对比TabPFN与高斯过程在回归任务上的实证研究,揭示了显式先验与学习先验之间的权衡:TabPFN在复杂高维问题中表现优异,而高斯过程在数据稀缺时提供更优的预测精度和不确定性量化。

Comments 12 pages, 2 figures, 2 tables

详情
AI中文摘要

基础模型(FMs)在无需特定任务训练或微调的情况下,已在跨任务泛化方面取得了显著成功。然而,力学和计算科学中的许多关键应用不仅需要准确的预测,还需要可靠的不确定性量化(UQ)。本文通过全面的实证研究,比较了表格先验数据拟合网络(TabPFN)与高斯过程(GP)在回归任务中的UQ能力。我们系统地评估了这两种方法在一系列具有不同复杂度、数据集大小和输入维度的回归问题上的表现。我们使用默认设置构建所有GP,并与TabPFN v2.5进行公平比较。我们的发现突显了显式先验与学习先验之间的重要权衡:虽然TabPFN在数据充足的高维复杂问题上具有高度竞争性的性能,但GP在数据稀缺场景下通常提供更优的预测精度和UQ。此外,当所选核函数构成底层函数的良好先验时,GP的性能可能显著超过TabPFN。我们的结果可从https://github.com/kianswarehouse/GPvsPFN复现。

英文摘要

Foundation models (FMs) have achieved substantial success in generalizing across tasks without problemspecific training or fine-tuning. However, many critical applications in mechanics and computational science require not only accurate predictions but also reliable uncertainty quantification (UQ). Herein we investigate the UQ capabilities of tabular FMs in regression tasks through a comprehensive empirical study comparing Tabular Prior-Data Fitted Networks (TabPFN) against Gaussian processes (GPs). We systematically evaluate these two methods across a host of regression problems with varying complexity, dataset sizes, and input dimensionalities. We use a default setting to build all the GPs and for a fair comparison against TabPFN v2.5. Our findings highlight an important trade-off between explicit and learned priors: while TabPFN achieves highly competitive performance for complex, high-dimensional problems with sufficient data, GPs often provide superior predictive accuracy and UQ in data-scarce settings. Moreover, when the chosen kernel constitutes a good prior for the underlying function, GP performance can substantially exceed that of TabPFN. Our results can be reproduced from https://github.com/kianswarehouse/GPvsPFN.

2606.01425 2026-06-02 cs.LG 版本更新

Learning-based Directed Graph Abstraction of Combinatorial Spaces for Order-Preserving Search in Mixed-Combinatorial Nonlinear Optimization

基于学习的组合空间有向图抽象用于混合组合非线性优化中的保序搜索

Gishnu Madhu, Feng Liu, Souma Chowdhury

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) Department of Mechanical and Aerospace Engineering(机械与航空航天工程系)

AI总结 提出一种基于图神经网络的有向图抽象方法,将组合空间映射为有向图,以改进混合组合非线性规划问题的搜索效率。

Comments Accepted for presentation at 2026, ASME IDETC

详情
AI中文摘要

混合组合非线性规划(MCNLP)问题出现在许多工程设计和规划应用中,例如由于分类、组件和几何设计选择,以及联合任务和运动规划。组合空间的传统表示方法,如整数或二进制编码,常常引入虚假关系,增加维度,并需要额外的兼容性约束。相反,本文借鉴了机器人规划和车辆/网络路由领域的最新发展,旨在使用图神经网络(GNN)学习组合空间上的搜索启发式。更具体地说,本文通过使用边场图网络(EFGN)学习从无向全连接组合图到指示改进方向的有向图的映射,提出了首个结构化的组合空间抽象。为了展示这种抽象组合空间的新方法在解决MCNLP中的效用,我们采用了一个最近的优化框架,该框架纯粹搜索非组合(例如连续)变量,并通过使用抽象模型(类似于推荐系统)为每个候选设计检索最合适的组合。与原始框架中的推荐系统相比,所提出的方向感知抽象模型提供了可能更具可扩展性和可解释性的组合检索。为了评估,所提出的方法与著名的粒子群优化和遗传算法求解器集成,在三个具有不同组合和变量数量的基准非线性问题上进行测试。与使用索引化组合的基线求解器相比,基于GNN的推荐器在多次运行中始终获得更好的平均最优值和鲁棒性。

英文摘要

Mixed-combinatorial nonlinear programming (MCNLP) problems arise in many engineering design and planning applications, e.g., due to categorical, component, and geometric design choices, as well as joint task and motion planning. Traditional representations of combinatorial spaces, such as integer or binary encoding, often introduce spurious relations, increase dimensionality, and require additional compatibility constraints. Instead, this paper draws on recent developments in robot planning and vehicle/network routing domains that aim to learn search heuristics over combinatorial spaces using graph neural networks (GNNs). More specifically, this paper presents a first-of-its-kind structured abstraction of the combinatorial space by learning a mapping from an undirected fully connected graph of combinations to a directed graph indicating improvement directions using an Edge Field Graph Network (EFGN). To demonstrate the utility of this new way of abstracting the combinatorial space in solving MCNLPs, we adopt a recent optimization framework that purely searches over the non-combinatorial (e.g., continuous) variables and retrieves the best-suited combination for each candidate design by using the abstraction model, akin to a recommender system. The presented direction-aware abstraction model provides a potentially more scalable and interpretable retrieval of combinations compared to the original recommendation system in that framework. For evaluation, the proposed method is integrated with a well-known particle swarm optimization and genetic algorithm solvers on three benchmark nonlinear problems with varying numbers of combinations and variables. Compared to baseline solvers using indexified combinations, the GNN-based recommender consistently achieves better mean optimum values and robustness across multiple runs.

2606.01421 2026-06-02 cs.LG 版本更新

Target localization, identification and sensing using latent symmetries

利用潜在对称性进行目标定位、识别与感知

David Dukov, Malte Röntgen, Bryn Davies

发表机构 * Mathematics Institute, University of Warwick(沃里克大学数学研究所) Eastern Institute for Advanced Study(东部高级研究 institute) Eastern Institute of Technology Ningbo, Zhejiang, China(宁波东部技术研究院,浙江,中国)

AI总结 本文利用设计有潜在对称性的散射体阵列作为传感器,通过分析对称性破缺程度,结合贝叶斯推断或人工神经网络实现入侵散射体的半径识别与位置定位。

Comments Submitted to SIAM Journal on Imaging Sciences

详情
AI中文摘要

我们展示了具有潜在(“隐藏”)对称性的散射体阵列可用作传感器。我们使用电容矩阵作为三维杂化的典型模型,研究“入侵”散射体的引入如何破坏潜在对称性。通过分析每个对称性被破坏的程度,我们识别出入侵者的半径并定位其位置。这可以通过基于字典的方法实现,然而在存在测量噪声的情况下,贝叶斯推断或人工神经网络(多层感知器)表现更好。据我们所知,这是首次将潜在对称性成功应用于感知问题。这也是首次在无法用稀疏图近似的三维开放系统中观察到潜在对称性。

英文摘要

We show that an array of scatterers which has been designed to have latent ("hidden") symmetries can be used as a sensor. We use the capacitance matrix as a canonical model for three-dimensional hybridisation and study how the introduction of an "intruder'' scatterer breaks the latent symmetries. By analysing the degree to which each symmetry is broken, we identify the radius of the intruder and localize its position. This can be achieved using a dictionary-based approach, however Bayesian inference or an artificial neural network (multi-layer perceptron) perform better in the presence of measurement noise. To our knowledge, this is the first time latent symmetries have been exploited successfully for sensing problems. It is also the first time latent symmetries have been observed in a three-dimensional open system that cannot be approximated by a sparse graph.

2606.01413 2026-06-02 cs.CR cs.IR cs.LG 版本更新

Differentially Private Datastore Generation for Retrieval-Augmented Inference

用于检索增强推理的差分隐私数据存储生成

Abdelrahman Abouelenein, Marwan Torki

发表机构 * Department of Computer and Systems Engineering, Alexandria University(计算机与系统工程系,亚历山大大学) Microsoft(微软)

AI总结 提出基于哈希的概率生成框架,利用局部敏感哈希和差分隐私噪声实现数据存储的隐私保护,在ε=5时准确率仅下降2.6%,并将成员推断攻击准确率降至53.60%。

Comments Accepted at the 28th International Conference on Pattern Recognition (ICPR-2026)

详情
AI中文摘要

对于依赖检索增强推理的现代设备端AI系统来说,在不损害个人隐私的情况下发布和共享数据存储至关重要。这可以通过差分隐私(DP)实现,它提供了形式化的保证,确保即使在对抗性分析下,个体贡献仍然不可区分。在本文中,我们引入了一个基于哈希的概率生成框架,旨在实现差分隐私数据存储的创建和发布。我们的方法采用局部敏感哈希(LSH)将高维数据高效地划分到桶中。然后,我们向每个桶的累积投票中添加校准的DP噪声,生成跨类别的概率分布。我们的方法广泛适用于任何需要安全键值数据存储创建和发布的流水线。我们在七个样本量和类别数(从2到14不等)的数据集上进行了实验。在ε=5时,我们发布的DP数据存储实现了强隐私保护,准确率仅平均下降2.6%。最后,我们评估了DP数据存储对成员推断攻击的抵抗力,将攻击准确率降低到53.60%。

英文摘要

It is crucial for modern on-device AI systems that rely on retrieval-augmented inference to release and share datastores without compromising individual privacy. This can be achieved using Differential Privacy (DP), which provides a formal guarantee that ensures individual contributions remain indistinguishable, even under adversarial analysis. In this paper, we introduce a hashing-based probability generation framework designed to enable the creation and release of differentially private datastores. Our approach employs locality-sensitive hashing (LSH) to efficiently partition high-dimensional data into buckets. We then add calibrated DP noise to the accumulated vote for each bucket, generating a probability distribution across classes. Our method is broadly applicable to any pipeline requiring secure key,value datastore creation and release. We conducted experiments on seven datasets with varying sample sizes and class counts, ranging from 2 to 14. At epsilon=5, our released DP datastore achieves strong privacy protection with only an average 2.6% drop in accuracy. Finally, we benchmark DP datastore resilience to membership inference attacks, reducing attack accuracy to 53.60%.

2606.01412 2026-06-02 cs.LG cs.IT math.IT 版本更新

GPTQ-intrinsic LoRA: A Near-optimal Algorithm for Low-precision Quantization with Low-rank Adaptation

GPTQ-intrinsic LoRA: 一种用于低秩自适应低精度量化的近最优算法

Shihao Zhang, Rayan Saab

发表机构 * Department of Mathematics, University of California San Diego(数学系,加州大学圣地亚哥分校) Department of Mathematics and Halıcıoğlu Data Science Institute, University of California San Diego(数学系和Halıcıoğlu数据科学研究所,加州大学圣地亚哥分校)

AI总结 本文提出GPTQ-intrinsic LoRA算法,通过将低秩校正直接融入GPTQ量化过程,并利用信息论下界证明其近最优性,在语言和视觉模型上优于现有方法。

详情
AI中文摘要

后训练量化广泛用于压缩大型神经网络,但激进的低比特量化会显著降低模型质量。一种常见的补救措施是用低秩校正增强量化权重,得到形如 $W\approx Q+LR$ 的近似。本文通过逐层重构目标 $\|XW-X(Q+LR)\|_F^2$ 研究这种低精度加低秩表示,其中 $X$ 是校准矩阵。我们首次在有限字母和有界低秩补偿约束下建立了该问题的信息论下界。然后我们提出GPTQ-intrinsic LoRA,一种无训练算法,通过适当增广校准Hessian矩阵,将低秩校正直接融入GPTQ风格的量化过程中。对于选择 $L=V_r$($V_r$ 包含 $X$ 的顶部右奇异向量),我们证明了逐层重构误差界,其中通常的GPTQ对 $\|X\|_F^2$ 的依赖被秩-$r$ 残差 $\|X-X_r\|_F^2$ 取代,直至正则化项。在自然结构假设下,这些界在主导尺度上与信息论下界匹配(至多常数和温和因子)。我们还引入了Bid-Up,一种固定网格量化细化步骤,可与最优低秩补偿交替进行,保证逐层重构误差不增。在Qwen3语言模型和DeiT视觉变换器上的实验表明,GPTQ-intrinsic LoRA优于GPTQ以及GPTQ后接低秩补偿,并且通过细化循环获得额外增益。

英文摘要

Post-training quantization is widely used for compressing large neural networks, but aggressive low-bit quantization can significantly degrade model quality. A common remedy is to augment the quantized weights with a low-rank correction, leading to approximations of the form $W\approx Q+LR$. In this paper, we study this low-precision plus low-rank representation through the layer-wise reconstruction objective $\|XW-X(Q+LR)\|_F^2$, where $X$ is a calibration matrix. We establish, to our knowledge, the first information-theoretic lower bounds for this problem under finite-alphabet and bounded low-rank compensation constraints. We then propose GPTQ-intrinsic LoRA, a training-free algorithm that incorporates the low-rank correction directly into a GPTQ-style quantization pass by appropriately augmenting the calibration Hessian. For the choice $L=V_r$, where $V_r$ contains the top right singular vectors of $X$, we prove layer-wise reconstruction error bounds in which the usual GPTQ dependence on $\|X\|_F^2$ is replaced by the rank-$r$ residual $\|X-X_r\|_F^2$, up to regularization terms. Under natural structural assumptions, these bounds match the information-theoretic lower bounds in their dominant scaling, up to constants and mild factors. We also introduce Bid-Up, a fixed-grid quantization refinement step that can be alternated with optimal low-rank compensation with guaranteed non-increasing layer-wise reconstruction error. Experiments on Qwen3 language models and DeiT vision transformers show that GPTQ-intrinsic LoRA improves over GPTQ and GPTQ followed by low-rank compensation, with additional gains from refinement loops.

2606.01402 2026-06-02 cs.LG cs.AI 版本更新

Neural Network Compression by Approximate Differential Equivalence

基于近似微分等价的神经网络压缩

Ravi Dhiman, Andrea Passarella, Mirco Tribastone, Lorenzo Valerio

发表机构 * IMT School for Advanced Studies Lucca(利古里亚高级研究学院) IIT CNR(理工学院-国家科研委员会)

AI总结 提出一种通过聚合功能相似神经元来压缩神经网络的方法,利用近似前向微分等价将网络编码为多项式ODE系统,实现模型大小与精度的平滑权衡。

Comments 19 pages, 4 figures

详情
AI中文摘要

神经网络压缩通常通过基于局部重要性分数(例如基于幅度的剪枝)剪枝参数来实现。我们提出一种互补方法,通过聚合具有相似功能行为的神经元来压缩模型,而不是独立移除权重。我们的方法将训练好的网络编码为多项式ODE系统,并应用一种称为近似前向微分等价的 lumping 方法来识别具有近似匹配诱导动力学的神经元。单个容差参数 $\varepsilon$ 控制压缩水平,并在模型大小和预测精度之间诱导平滑权衡。我们在来自已知真实行为的非线性动力系统的合成数据集和公共回归基准上评估该方法。在这两种设置下,所提出的方法在保持精度的同时实现了显著的参数减少,并在相似的压缩水平下始终优于基于幅度的剪枝和Wanda。这些结果表明,基于微分等价的聚合是传统以权重为中心的剪枝的一种有原则且有效的替代方案。

英文摘要

Neural network compression is commonly achieved by pruning parameters based on local importance scores, e.g., magnitude-based pruning. We propose a complementary approach that compresses models by aggregating neurons with similar functional behavior rather than removing weights independently. Our method encodes a trained network as a polynomial ODE system and applies a lumping method called Approximate Forward Differential Equivalence to identify neurons with approximately matching induced dynamics. A single tolerance parameter, $\varepsilon$, controls the compression level and induces a smooth trade-off between model size and predictive accuracy. We evaluate the method on synthetic datasets derived from nonlinear dynamical systems with known ground-truth behavior and on public regression benchmarks. Across both settings, the proposed approach achieves substantial parameter reduction while preserving accuracy, and consistently compares favorably with magnitude-based pruning and Wanda at similar compression levels. These results suggest that differential equivalence-based aggregation is a principled and effective alternative to conventional weight-centric pruning.

2606.01397 2026-06-02 cs.RO cs.LG cs.SY eess.SY 版本更新

Autopilot-Preserving Residual Q-Learning with HJB-Inspired Finite-Action Risk Filtering for Fixed-Wing UAV Command Supervision

基于HJB启发有限动作风险滤波的保持自动驾驶仪的残差Q学习用于固定翼无人机指令监督

Mehmet Iscan, Batuhan Temiz

发表机构 * PythaLab, Yildiz Technical University, Istanbul, Turkey(伊兹密尔技术大学吡塔实验室,伊斯坦布尔,土耳其) Turkish Aerospace (TUSAŞ), Ankara, Turkey(土耳其航空航天(TUSAŞ),安卡拉,土耳其)

AI总结 提出一种保持自动驾驶仪的残差指令监督框架,通过HJB方程启发的半离散值迭代评价器和控制Lyapunov/屏障函数启发的有限动作屏蔽,选择有限有界动作集中的残差,显著降低路径跟踪误差。

Comments 47 pages, 12 figures, 20 tables. Simulation-based study with a code-traceable benchmark, source code and a demonstration video are linked in the paper

详情
AI中文摘要

固定翼无人机必须在风、阵风和湍流下保持空速、高度和航向参考,这些通道耦合使得纠正一个通道可能恶化另一个。经典自动驾驶仪能很好地稳定机身,但在强侧风遇到激进转弯时适应能力差,而直接作用于舵面的强化学习策略将探索风险集中在执行器接口。我们在未改变的自动驾驶仪之上放置一个学习型监督器,而不是在其内部:它从指令空速、高度和航向的有限有界动作集中选择一个残差;修改后的参考在到达自动驾驶仪之前被投影到允许的指令包络内,自动驾驶仪仍然是唯一面向执行器的控制器。新颖之处在于残差的选择方式。HJB残差使用半离散值迭代评价器(基于Hamilton-Jacobi-Bellman方程精神)对候选动作评分,通过无操作相对哈密顿优势排序,并通过控制Lyapunov函数和控制屏障函数启发的有限动作屏蔽进行过滤,该屏蔽始终保留无操作回退。在共享的12状态运行时(固定植物、自动驾驶仪和执行器模型)上,HJB残差将均方根路径跟踪误差降低到44.809米,而基线自动驾驶仪为338.617米,表格Q残差为88.809米,相比基线降低86.77%,相比Q学习降低49.54%。增益集中在基线表现最差的区域,但伴随空速误差的测量上升,因此没有方法在所有指标上占优。我们呈现这种保持自动驾驶仪的残差指令监督设计,并完整报告其权衡基准。

英文摘要

A fixed-wing UAV must hold airspeed, altitude, and heading references under wind, gusts, and turbulence, channels coupled so that correcting one can degrade another. Classical autopilots stabilize the airframe well but adapt poorly when a hard crosswind meets an aggressive turn, while reinforcement-learning (RL) policies acting directly on the surfaces concentrate exploration risk at the actuator interface. We place a learned supervisor above an unchanged autopilot rather than inside it: it selects a residual from a finite, bounded action set on the commanded airspeed, altitude, and heading; the modified reference is projected into an admissible command envelope before reaching the autopilot, which stays the only actuator-facing controller. What is new is how the residual is chosen. HJB residual scores candidates with a semi-discrete value-iteration critic in the spirit of the Hamilton-Jacobi-Bellman (HJB) equation, ranks them by a no-op-relative Hamiltonian advantage, and filters them through a control-Lyapunov- and control-barrier-inspired finite-action shield that always keeps a no-op fallback. On a shared 12-state runtime holding the plant, autopilot, and actuator model fixed, so the comparison is at the package level, HJB residual lowers mean RMS path-tracking error to 44.809 m, against 338.617 m for the baseline autopilot and 88.809 m for a tabular-Q residual, an 86.77% reduction over the baseline and 49.54% over Q-learning. The gain concentrates where the baseline fails worst and comes with a measured rise in airspeed error, so no method dominates every metric. We present this autopilot-preserving residual command-supervision design and benchmark with its trade-offs reported intact.

2606.01386 2026-06-02 cs.AI cs.CL cs.DC cs.LG 版本更新

GuidaPA: Privacy-Preserving Chatbot for Public Administration via Federated Learning

GuidaPA: 通过联邦学习为公共行政提供隐私保护的聊天机器人

Daniel M. Jimenez-Gutierrez, Albenzio Cirillo, Raffaele Nicolussi, Alessio Beltrame, Andrea Vitaletti

发表机构 * University of Bologna(博洛尼亚大学)

AI总结 提出GuidaPA,一个基于联邦学习(FL)在意大利公共行政文档上训练的隐私保护聊天机器人,通过参数高效的联邦微调(QLoRA)和角色访问控制,在保持数据本地化的同时实现了接近集中式微调的答案质量。

Comments Accepted to the 2nd International Conference on Federated Learning and Intelligent Computing Systems (FLICS2026)

详情
AI中文摘要

我们提出了GuidaPA,一个为意大利公共行政(PA)设计的隐私保护聊天机器人,它通过联邦学习(FL)在两个国家PA平台SIGESON和SIDFORS的文档上进行训练。我们的语料库包括约8页的SIGESON手册和31页的SIDFORS手册/常见问题解答;虽然本研究使用公开文档作为安全代理,但预期的部署将扩展到受限制的内部来源(例如,工单、官员手册、数据库提取),这些数据由于监管和组织约束无法集中汇集。GuidaPA集成了基于角色的访问控制、安全的客户端预处理、对非独立同分布效应的显式监控以及大语言模型的参数高效联邦微调。使用QLoRA(4位)进行15轮联邦训练,每个客户端采用80/20的训练-测试划分,我们使用ROUGE、BLEU-4和METEOR评估答案质量。最佳联邦模型达到了ROUGE-1/2/L分别为61.10/55.77/59.44,BLEU-4为45.02,METEOR为63.94——接近私有集中式微调的性能,同时保持数据在本地。与通用基线相比,领域微调将ROUGE-1从41.45提高到62.18,BLEU-4从26.97提高到50.90。总体而言,结果表明FL可以在不进行集中数据共享的情况下,为公共服务提供高质量的对话式AI。

英文摘要

We present GuidaPA, a privacy-preserving chatbot for the Italian Public Administration (PA) trained via Federated Learning (FL) on documentation from two national PA platforms, SIGESON and SIDFORS. Our corpus includes approximately 8 pages of SIGESON manuals and 31 pages of SIDFORS manuals/FAQs; while this study uses public documentation as a safe proxy, the intended deployment extends to restricted internal sources (e.g., tickets, officer manuals, database extracts) that can not be centrally pooled due to regulatory and organizational constraints. GuidaPA integrates role-based access control, secure client-side preprocessing, explicit monitoring of non-IID effects, and parameter-efficient federated fine-tuning of large language models. Using QLoRA (4-bit) over 15 federated rounds with an 80/20 train-test split per client, we evaluate answer quality with ROUGE, BLEU-4, and METEOR. The best federated model achieves ROUGE-1/2/L of 61.10/55.77/59.44, BLEU-4 of 45.02, and METEOR of 63.94-close to private centralized fine-tuning while keeping data on-site. Compared to the general-purpose baseline, domain fine-tuning improves ROUGE-1 from 41.45 to 62.18 and BLEU-4 from 26.97 to 50.90. Overall, the results indicate that FL can deliver high-quality conversational AI for public services without centralized data sharing

2606.01382 2026-06-02 cs.LG cs.AI 版本更新

Efficient Exploration for Iterative Nash Preference Optimization

迭代纳什偏好优化的高效探索

Tianlong Nan, Xiaopeng Li, Christian Kroer, Tianyi Lin

发表机构 * Columbia University(哥伦比亚大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 针对通用偏好模型下的迭代NLHF,提出显式探索算法,结合SFT正则化与对抗性策略探索,实现O(√T)遗憾界,避免对KL正则化参数的指数依赖。

Comments 49 pages

详情
AI中文摘要

偏好对齐是改进大语言模型的核心,但当人类偏好是循环、非传递或无法用标量奖励表示时,标准的基于奖励的公式可能具有限制性。从人类反馈中学习纳什均衡(NLHF)通过将对齐建模为偏好博弈并针对纳什均衡而非奖励最大化来解决这一限制。然而,可扩展NLHF的学习理论基础仍然有限。现有的遗憾保证依赖于基于oracle的方法,这些方法估计一个通用偏好模型并求解KL正则化的极小极大问题,而迭代NLHF方法直接优化策略级别的偏好损失,更易实现但缺乏遗憾保证。我们研究通用偏好模型下的在线迭代NLHF,并确定探索是关键障碍。首先,我们表明标准迭代NLHF可能遭受对KL正则化参数的指数依赖,揭示了通过策略更新进行的隐式探索不足以控制遗憾。其次,我们提出一种显式探索的迭代NLHF算法,结合了基于SFT的正则化与对抗性策略探索。所得方法保留了迭代NLHF的直接策略优化结构,避免了显式偏好模型估计,并实现了$O(\sqrt{T})$的遗憾界,而不依赖于KL正则化参数的指数项。我们表明,通过访问一个极小极大oracle,遗憾可以改进为$O(\log(T))$,阐明了学习通用偏好博弈中的计算-统计权衡。最后,我们将我们的方法实例化用于LLM微调,并在多个基准上对\texttt{Llama-3-8B-Instruct}进行评估,其中显式探索在现有NLHF基线上产生了一致的改进。

英文摘要

Preference alignment is central to improving large language models, but standard reward-based formulations can be restrictive when human preferences are cyclic, non-transitive, or otherwise not representable by a scalar reward. Nash Learning from Human Feedback (NLHF) addresses this limitation by modeling alignment as a preference game and targeting a Nash equilibrium rather than a reward maximizer. However, the learning-theoretic foundations of scalable NLHF remain limited. Existing regret guarantees rely on oracle-based methods that estimate a general preference model and solve KL-regularized minimax problems, while iterative NLHF methods directly optimize policy-level preference losses and are easier to implement but lack regret guarantees. We study online iterative NLHF under general preference models and identify exploration as the key obstacle. First, we show that standard iterative NLHF can suffer an exponential dependence on the KL-regularization parameter, revealing that implicit exploration through policy updates is insufficient for controlling regret. Second, we propose an explicitly exploratory iterative NLHF algorithm that combines SFT-based regularization with adversarial policy exploration. The resulting method retains the direct policy optimization structure of iterative NLHF, avoids explicit preference model estimation, and achieves an $O(\sqrt{T})$ regret bound without an exponential dependence on the KL-regularization parameter. We show that the regret can be improved to $O(\log(T))$ with access to a minimax oracle, clarifying the computational-statistical tradeoff in learning general preference games. Finally, we instantiate our method for LLM fine-tuning and evaluate it on \texttt{Llama-3-8B-Instruct} across multiple benchmarks, where explicit exploration yields consistent improvements over existing NLHF baselines.

2606.01374 2026-06-02 cs.LG 版本更新

From Performance to Viability: A Bootstrap Framework for Latent-Space Representation Learning in Adaptive Biological Systems

从性能到生存力:自适应生物系统中潜在空间表示学习的自举框架

Jacques Raynal, Pierre Slangen, Elsa Raynal, Jacques Margerit

发表机构 * Laboratory of Bioengineering and Nanosciences (LBN)(生物工程与纳米科学实验室) University of Montpellier(蒙彼利埃大学) EuroMov Digital Health in Motion(EuroMov数字健康运动) IMT Mines Alès Certified Sophrologist, Sensorimotor Practice(认证Sophrologist,运动感知实践)

AI总结 针对自适应生物系统中性能相似但组织不同的问题,提出一个五级自举框架,通过逐步引入潜在组织、纵向生存力和内部预测近似,从观测不足中学习更具信息量的表示。

Comments 25 pages. Methodological framework for latent-space representation learning in adaptive biological systems

详情
AI中文摘要

可观测性能通常用于表征生物系统。然而,在自适应系统中,相似性能可能源于不同的组织,且在给定时间看似相似的配置可能遵循不同的纵向轨迹。这一局限性促使我们提出一种方法论框架,以超越基于性能的解释,而无需事先假设完整的机制模型。本文提出了一个用于自适应生物系统中潜在空间表示学习的自举框架。这里的自举是在方法论和认识论意义上使用的:当先前的表示不足以解释观察到的自适应动态时,引入新的分析层次。该框架围绕五个层次组织:可观测性能、动态组织、潜在组织、纵向生存力和内部预测近似。通过三个先前报道的步态-遮挡研究来说明该框架,这些研究仅作为方法论案例序列,而非新的实验证据。本文形式化了性能分析如何导致潜在组织,静态潜在组织如何导致纵向生存力,以及观察到的生存力如何导致内部预测近似。贡献不是新的学习算法、临床协议或数据集,而是一个用于潜在空间表示学习的自举框架,描述了如何从自适应生物数据的观测不足中涌现出更具信息量的表示。

英文摘要

Observable performance is commonly used to characterize biological systems. In adaptive systems, however, similar performances may arise from distinct organizations, and configurations that appear comparable at a given time may follow different longitudinal trajectories. This limitation motivates a methodological framework for moving beyond performance-based interpretation without assuming a complete mechanistic model in advance. This article proposes a bootstrap framework for latent-space representation learning in adaptive biological systems. Here, bootstrap is used in a methodological and epistemological sense: new analytical levels are introduced when the preceding representation becomes insufficient to account for observed adaptive dynamics. The framework is organized around five levels: observable performance, dynamic organization, latent organization, longitudinal viability, and internal predictive approximation. The framework is illustrated by three previously reported gait--occlusion studies, used here only as a methodological case sequence and not as new experimental evidence. The article formalizes how performance analysis led to latent organization, how static latent organization led to longitudinal viability, and how observed viability led to internal predictive approximation. The contribution is not a new learning algorithm, clinical protocol, or dataset, but a bootstrap framework for latent-space representation learning describing how increasingly informative representations can emerge from observational insufficiencies in adaptive biological data.

2606.01372 2026-06-02 cs.LG cs.AI cs.CV 版本更新

BRo-JEPA: Learning Modular Arithmetic in Latent Space

BRo-JEPA:在潜空间中学习模算术

Divyansh Jha, Yuanfang Xie, Varan Mehra, Brennen Yu

发表机构 * Georgia Institute of Technology(佐治亚理工学院) NYU Langone Health(纽约大学Langone医疗中心)

AI总结 本文提出BRo-JEPA模型,通过在潜空间中施加模10算术的循环结构,实现零样本泛化,解决了标准模型无法外推未见操作的问题。

Comments 10 pages, 14 figures

详情
AI中文摘要

神经网络能否学习抽象的代数规则,还是仅仅记忆训练模式?我们使用MNIST数字作为状态,模算术运算作为动作,在JEPA风格的潜世界模型中进行研究。标准监督基线和带有加法操作嵌入的JEPA模型能够学习已见操作,但无法可靠地外推到未见操作。为了弥补这一差距,我们引入了一个块旋转预测器,在潜空间中施加模10算术的循环结构。这使得模型具有强大的零样本泛化能力,最佳的基于ResNet的JEPA块旋转模型达到了99.46%的零样本准确率和99.46%的展开准确率。我们的结果表明,当架构与问题结构匹配时,潜世界模型可以学习符号变换规则。我们的代码可以在此处访问:https://github.com/DL-World-Models/mnist-math。

英文摘要

Can neural networks learn abstract algebraic rules, or do they merely memorize training patterns? We investigate this using MNIST digits as states and modular arithmetic operations as actions in a JEPA-style latent world model. Standard supervised baselines and JEPA models with additive operation embeddings fit seen operations but fail to extrapolate reliably to unseen ones. To bridge this gap, we introduce a block-rotation predictor that imposes the circular structure of modulo-10 arithmetic in latent space. This enables strong zero-shot generalization, with the best ResNet-based JEPA block-rotation model achieving 99.46\% zero-shot and 99.46\% rollout accuracy. Our results suggest that latent world models can learn symbolic transformation rules when architecture matches the structure of the problem. Our code can be \href{https://github.com/DL-World-Models/mnist-math}{accessed here}.

2606.01363 2026-06-02 cs.LG cs.SY eess.SY 版本更新

All Models are Wrong, Knowing Where is Useful: On Model Uncertainty in Reinforcement Learning

所有模型都是错的,知道哪里有用:强化学习中的模型不确定性

Bernd Frauenknecht, Devdutt Subhasish, Artur Eisele, Friedrich Solowjow, Sebastian Trimpe

发表机构 * German Federal Ministry of Research, Technology and Space (BMFTR)(德国联邦研究、技术和空间部) Robotics Institute Germany (RIG)(德国机器人研究所) Institute for Data Science in Mechanical Engineering, RWTH Aachen University(机械工程数据科学研究所,亚琛工业大学) NHR Center NHR4CES at RWTH Aachen University(亚琛工业大学NHR4CES中心)

AI总结 提出通过针对性处理概率模型的不确定性来减轻模型利用的框架,并展示在硬件直接学习和安全探索方面的成功。

详情
AI中文摘要

基于模型的强化学习(MBRL)从学习的动力学模型中推断环境信息,并具有解决数据高效和机器人安全学习等开放性问题的潜力。然而,学习到的动力学模型的不准确性通常被智能体利用,严重阻碍了MBRL方法的能力。我们提出了一个通过针对性处理不确定性来有效减轻模型利用的框架。我们展示了在硬件直接学习和安全探索方面的近期成功,并讨论了不确定性感知MBRL的未来方向。

英文摘要

Model-based reinforcement learning (MBRL) infers information about the environment from a learned dynamics model and bears the potential to address open problems such as data efficient and safe learning in robotics. However, inaccuracies of the learned dynamics model are typically exploited by the agent, substantially hampering the capabilities of MBRL methods. We present a framework for dealing with inaccuracies of probabilistic models through targeted handling of uncertainty that effectively mitigates model exploitation. We present recent successes in learning directly on hardware and safe exploration, and discuss future directions for uncertainty-aware MBRL.

2606.01339 2026-06-02 cs.LG cs.AI cs.CL cs.CV cs.ET 版本更新

FreqLite: A Lightweight Frequency-Decomposed Linear Model with Adaptive Reversible Normalization for Robust Long-Term Time-Series Forecasting

FreqLite:一种轻量级频率分解线性模型,具有自适应可逆归一化,用于稳健的长期时间序列预测

Mirza Samad Ahmed Baig, Syeda Anshrah Gillani

发表机构 * Hamdard University(哈姆达德大学)

AI总结 提出FreqLite,一种超轻量级、通道独立的频率分解线性预测器,通过可学习的无损谱滤波器进行频带分解和线性预测,并引入自适应可逆实例归一化(A-RevIN)处理非平稳性,在长期预测基准上以更少参数和计算资源超越PatchTST等模型。

Comments 26 pages, 5 figures

详情
AI中文摘要

长期时间序列预测需要既准确又能在商用硬件上高效运行的模型。轻量级线性预测器在此领域表现出色,但仍存在两个问题:可逆实例归一化(RevIN)使用单一回溯统计量对整个预测区间进行去归一化,在非平稳性下不准确;时域趋势/季节分解依赖于固定的非自适应滤波器。我们提出FreqLite,一种超轻量级、通道独立的频率分解线性预测器:一个可学习的、无损的单位划分谱滤波器将输入分割成多个频带,由每个频带的线性头进行预测,与低通截断方法不同,高频带被保留并建模。FreqLite在标准长期预测基准上是最佳的轻量级模型,在长回溯(L=336)时,其平均误差低于PatchTST Transformer(0.3244 vs 0.3587 MSE),同时参数减少4倍,内存减少2.2倍,在单块4 GB笔记本GPU上每轮时间减少2.2倍;尽管幅度不大,但在所有匹配单元上的配对Wilcoxon检验中,其改进具有统计显著性(p < 1e-5)。我们进一步引入自适应可逆实例归一化(A-RevIN),一种自适应可逆归一化,严格推广了RevIN(在其门关闭时完全恢复),在非平稳性下起作用,并在平稳数据上无害地退化为RevIN。我们在一个真实的强非平稳数据集(ILI,MSE降低约5%)和一个受控合成漂移扫描中验证了这一点,其中A-RevIN的收益及其学习门都随注入的非平稳性单调增加。每个组件均可独立消融(Linear和RLinear是FreqLite的特例),所有结果均可在商用硬件上复现。

英文摘要

Long-term time-series forecasting needs models that are accurate yet efficient enough for commodity hardware. Lightweight linear forecasters are remarkably strong in this regime, yet they leave two openings: reversible instance normalization (RevIN) de-normalizes the entire horizon with a single lookback statistic, which is inaccurate under non-stationarity, and time-domain trend/seasonal decomposition relies on a fixed, non-adaptive filter. We present FreqLite, an ultra-lightweight, channel-independent frequency-decomposed linear forecaster: a learnable, lossless, partition-of-unity spectral filter splits the input into bands that are forecast by per-band linear heads and, unlike low-pass-truncation approaches, the high-frequency band is retained and modeled. FreqLite is the best lightweight model on the standard long-term forecasting benchmarks and, at long lookback (L=336), attains a lower average error than a PatchTST Transformer (0.3244 vs. 0.3587 MSE) while using 4x fewer parameters, 2.2x less memory, and 2.2x less time per epoch on a single 4 GB laptop GPU; although modest in magnitude, its improvements are statistically significant under paired Wilcoxon tests across all matched cells (p < 1e-5). We further introduce Adaptive Reversible Instance Normalization (A-RevIN), a regime-adaptive reversible normalization that strictly generalizes RevIN (recovered exactly when its gate is closed), engages under non-stationarity, and reduces to RevIN without harm on stationary data. We validate this on both a real strongly non-stationary dataset (ILI, up to ~5% MSE reduction) and a controlled synthetic drift sweep in which A-RevIN's benefit and its learned gate both rise monotonically with injected non-stationarity. Every component is independently ablatable (Linear and RLinear are special cases of FreqLite), and all results are reproducible on commodity hardware.

2606.01329 2026-06-02 cs.LG q-bio.BM 版本更新

Conditioned free-energy density of proteins using unbalanced solutions to constraint satisfaction problems

使用约束满足问题的不平衡解的条件化蛋白质自由能密度

Pratik Worah, Subhash Khot, Srinivasa Varadhan

发表机构 * CIMS, NYU(纽约大学应用数学与计算科学中心)

AI总结 本文通过将条件化非均匀Curie-Weiss自旋哈密顿量的对数配分函数(自由能)简化为不平衡$2 \to 1$范数计算,并设计多项式时间SDP算法,应用于泛素蛋白以探索自由能景观并识别柔性区域。

详情
AI中文摘要

我们证明,计算条件化非均匀Curie-Weiss自旋哈密顿量的对数配分函数(自由能)简化为不平衡的$2 \to 1$范数计算,并为此问题设计了一个多项式时间的SDP算法,同时给出了所实现不平衡量的下界证明。应用于蛋白质泛素,该框架从已知晶体结构出发,探索自由能景观中的替代骨架构象,并在保留天然二级结构的同时识别蛋白质的柔性区域。

英文摘要

We show that computing the log-partition function (free-energy) of conditioned inhomogeneous Curie--Weiss spin Hamiltonians reduces to an unbalanced $2 \to 1$ norm computation, and design a polynomial-time SDP algorithm for this problem with a lower bound proof for the amount of unbalance achieved. Applied to the protein Ubiquitin, the framework starts from a known crystal structure, explores alternative backbone conformations across the free-energy landscape, and identifies flexible regions of the protein while preserving its native secondary structure.

2606.01325 2026-06-02 cs.NI cs.LG 版本更新

SEArch: Optimistic Policy Selection Between Scene Noise and Drift for UAV Radar Search

SEArch: 无人机雷达搜索中场景噪声与漂移间的乐观策略选择

Noor Khial, Naram Mhaisen, Loay Ismail, Amr Mohamed

发表机构 * Department of Electrical and Computer Engineering, University of Waterloo(1 温哥华大学电子与计算机工程系)

AI总结 针对无人机雷达目标搜索中场景噪声与漂移共存的问题,提出基于随机扩展对手框架的乐观跟随正则化领导者策略选择器SEArch及其窗口变体W-SEArch,实现了亚线性遗憾界,实验显示相比非自适应基线遗憾降低达30%。

详情
AI中文摘要

配备雷达传感器的无人机被部署在多样环境中执行目标搜索任务,目标具有可通过遮挡检测到的特征信号(例如人体搜索中的呼吸微动)。一个基本挑战在于,当无人机在动态且可能非平稳的环境中移动时,雷达统计特性发生变化,使得任何固定的信号处理策略都变得次优;然而感知和适应必须在资源受限的空中节点上实时运行。由于没有单一检测器能在所有条件下表现良好,我们采用多策略范式,将无人机目标搜索形式化为一个在线策略选择问题,基于一组专用检测器库,性能通过遗憾(即相对于每个场景中最优策略的累积损失差距)来衡量。该设置将场景内随机噪声与场景间漂移耦合在一起。先前的方法仅捕捉一种模式,而我们通过随机扩展对手框架同时考虑两者,无需场景动态的先验知识。由于适应必须在无人机上运行,我们通过SEArch实例化SEA,这是一种轻量级的乐观跟随正则化领导者选择器,具有自适应学习率,实现了遗憾界$O(arσ_T \sqrt{T} + \sqrt{J})$,其中$arσ_T$捕捉雷达测量噪声,$J$是任务时间范围$T$内的场景转换次数。为了在频繁场景变化下实现快速适应,我们进一步引入了W-SEArch,这是一种窗口变体,每$w$轮重启一次,并在每个窗口内最多一次转换下实现遗憾界$O(arσ_I \sqrt{w})$。实验表明,在一系列非平稳设置中,与非自适应基线相比,遗憾降低高达30%。

英文摘要

Unmanned Aerial Vehicles (UAVs) equipped with radar sensors are deployed for target search missions in diverse environments, where targets exhibit characteristic signatures (e.g., respiration micro-motion in human search) detectable through occlusions. A fundamental challenge arises from shifts in radar statistics as the UAV moves through a dynamic and potentially non-stationary environment, rendering any fixed signal-processing strategy suboptimal; yet perception and adaptation must run onboard a resource-constrained aerial node in real time. Since no single detector performs well across all conditions, we adopt a multi-policy paradigm and formulate UAV target search as an online policy selection problem over a library of specialized detectors, with performance measured by regret, the cumulative loss gap relative to the best policy in each scene. The setting couples in-scene stochastic noise with inter-scene shifts. Whereas prior methods capture only one regime, we account for both through the Stochastically Extended Adversary (SEA) framework, without requiring oracle knowledge of scene dynamics. Because adaptation must run at the UAV, we instantiate SEA through \textsc{SEArch}, a lightweight optimistic Follow the Regularized Leader (OFTRL) selector with an adaptive learning rate, achieving regret $O(\barσ_T \sqrt{T} + \sqrt{J})$, where $\barσ_T$ captures radar measurement noise and $J$ is the number of scene transitions over the mission horizon $T$. To enable rapid adaptation under frequent scene changes, we further introduce \textsc{W-SEArch}, a windowed variant that restarts every $w$ rounds and achieves regret $O(\barσ_I \sqrt{w})$ under at most one transition per window. Experiments show up to 30\% regret reduction compared to non-adaptive baselines across a range of non-stationary settings.

2606.01311 2026-06-02 cs.CL cs.AI cs.LG cs.MA 版本更新

SkillAdaptor: Self-Adapting Skills for LLM Agents from Trajectories

SkillAdaptor:基于轨迹的LLM智能体自适应技能

Zhuoyun Yu, Xin Xie, Wuguannan Yao, Chenxi Wang, Lei Liang, Xiang Qi, Shumin Deng

发表机构 * Zhejiang University(浙江大学) Ant Digital Technologies, Ant Group(蚂蚁集团数字技术部) Zhejiang University - Ant Group Joint Laboratory of Knowledge Graph(浙江大学-蚂蚁集团知识图谱联合实验室)

AI总结 提出SkillAdaptor,一种无训练的步骤级技能自适应框架,通过显式故障归因和针对性更新,提升LLM智能体在长程交互任务中的表现。

Comments Work in progress

详情
AI中文摘要

大型语言模型(LLM)智能体越来越依赖可重用的外部技能来解决长程交互任务。现有的无训练技能自适应流程通常从完整轨迹或会话级反馈更新技能,这使得故障归因粗糙,往往产生不稳定或过于宽泛的修订。我们提出SkillAdaptor,一种无训练的步骤级技能自适应框架,具有显式故障归因,并可插入OpenClaw类智能体框架。给定一个失败轨迹,SkillAdaptor识别第一个可操作的故障步骤,将责任关联到候选技能,并在显式接受检查下应用针对性更新,同时保持主干冻结。我们在WebShop、PinchBench和Claw-Eval上使用Kimi-K2.5、GLM-5和GPT-5.2进行评估。SkillAdaptor在所有三个套件上均优于无技能和技能自适应基线,最大的单项指标提升为PinchBench平均得分%提升1.5分,Claw-Eval平均得分提升1.8分,WebShop成功率提升1.7分。这些结果表明,步骤级归因支持更稳定且可审计的无训练技能维护。

英文摘要

Large language model (LLM) agents increasingly rely on reusable external skills to solve long-horizon interactive tasks. Existing training-free skill adaptation pipelines usually update skills from full trajectories or session-level feedback, which makes failure attribution coarse and often produces unstable or overly broad revisions. We propose SkillAdaptor, a training-free step-level skill adaptation framework with explicit failure attribution, and it can plug into OpenClaw-class agent harnesses. Given a failed trajectory, SkillAdaptor identifies a first actionable fault step, links responsibility to candidate skills, and applies targeted updates under explicit acceptance checks while keeping the backbone frozen. We evaluate on WebShop, PinchBench, and Claw-Eval with Kimi-K2.5, GLM-5, and GPT-5.2. SkillAdaptor improves over no-skill and skill-adaptation baselines on all three suites, with the largest single-metric improvements of +1.5 points on PinchBench Avg Score%, +1.8 on Claw-Eval Avg Score, and +1.7 on WebShop success rate. These results indicate that step-level attribution supports more stable and auditable training-free skill maintenance\footnote{The code will be released at https://github.com/zjunlp/SkillAdaptor.}.

2606.01306 2026-06-02 cs.LG cs.IR 版本更新

FAiT: Frequency-Aware Inverted Transformer for Multivariate Time Series Forecasting

FAiT:面向多元时间序列预测的频率感知倒置Transformer

Peng He, Yao Liu, Yanglei Gan, Run Lin, Yuxiang Cai, Qiao Liu

发表机构 * University of Electronic Science and Technology of China(电子科技大学)

AI总结 提出FAiT,通过倒置注意力机制和动态时频调制,解决Transformer在多元时间序列预测中忽略高频信号和时变频谱特性的问题。

详情
AI中文摘要

虽然基于Transformer的架构已成为多元时间序列预测(MTSF)的主导范式,但其核心自注意力机制本质上充当低通滤波器,系统性地平滑掉对剧烈局部变化至关重要的高频信号。最近的进展越来越多地引入频域操作来解决这一偏差,然而,大多数现有设计依赖固定的频谱基并应用序列级(均匀)调制,隐含地假设时不变的频率响应。这忽略了现实世界序列的一个关键属性——其频谱特征通常随时间演变,使得均匀调制不足以捕捉细粒度的时态动态。为了解决这些局限性,我们提出了FAiT,一种频率感知的倒置Transformer。具体来说,FAiT通过倒置注意力机制内部纠正频谱偏差,该机制将注意力图解释为可学习的低通算子,并通过倒置注意力矩阵构建一个专用的互补高通分支,以恢复衰减的瞬态信号。此外,FAiT引入了动态时频调制(DTFM),该调制合成实例条件权重以自适应地重新校准频谱子带的能量,从而实现对演变的多样模式进行细粒度控制。在广泛使用的基准上的大量实验表明,FAiT在保持计算效率的同时,始终优于最先进的基于Transformer和频率增强的基线。

英文摘要

While Transformer-based architectures have established themselves as a dominant paradigm in Multivariate Time Series Forecasting (MTSF), their core self-attention mechanism inherently functions as a low-pass filter, systematically smoothing out high-frequency signals vital for sharp local changes. Recent advancements have increasingly incorporated frequency-domain operations to address this bias, however, most existing designs rely on fixed spectral bases and apply sequence-wise (uniform) modulation, implicitly assuming a time-invariant frequency response. This overlooks a key property of real-world series that their spectral characteristics often evolve over time, making uniform modulation insufficient for capturing fine-grained temporal dynamics. To tackle these limitations, we propose FAiT, a Frequency-Aware inverted Transformer. Specifically, FAiT rectifies the spectral bias internally through Inverted Attention, which interprets the attention map as a learnable low-pass operator and constructs a dedicated complementary high-pass branch by inverting the attention matrix to recover attenuated transient signals. Furthermore, FAiT introduces Dynamic Temporal-Frequency Modulation (DTFM), which synthesizes instance-conditioned weights to adaptively re-calibrate the energy of spectral sub-bands, enabling fine-grained control over evolving multi-scale patterns. Extensive experiments on widely used benchmarks demonstrate that FAiT consistently outperforms state-of-the-art Transformer-based and frequency-enhanced baselines, while maintaining computational efficiency.

2606.01302 2026-06-02 cs.LG 版本更新

Structure and Scale in Simplicial Sequence Modelling

单纯复形序列建模中的结构与规模

Matthew Farrugia-Roberts

发表机构 * Department of Computer Science, University of Oxford(牛津大学计算机科学系)

AI总结 本文通过训练小型Transformer预测隐马尔可夫模型输出,发现性能缩放模式与内部表征之间存在相关性,为行为缩放定律与涌现机制的联系提供了初步证据。

Comments HiLD 2026: 4th Workshop on High-dimensional Learning Dynamics

详情
AI中文摘要

现代大规模深度学习展现出两个引人注目的经验现象:行为缩放定律(性能随规模增大而可预测地提升)和涌现机制(深度神经网络中结构化的内部表征和回路)。我们假设这两个现象是相互关联的:行为上的可预测变化是内部计算结构可预测变化的结果。在本文中,我们报告了这种联系的初步证据。我们发现,在训练用于预测隐马尔可夫模型输出的小型Transformer中,性能缩放模式与表征之间存在相关性,已知其残差激活在概率单纯形上线性编码了关于潜在状态的信念分布。

英文摘要

Modern large-scale deep learning exhibits two striking empirical phenomena: behavioural scaling laws (predictable performance gains with increasing scale) and emergent mechanisms (structured internal representations and circuits in deep neural networks). We hypothesise that these two phenomena are connected: that predictable changes in behaviour are the result of predictable changes in internal computational structure. In this paper, we report preliminary evidence of such a connection. We find a correlation between scaling patterns in performance and representations in small transformers trained to predict the outputs of a hidden Markov model, for which residual activations are known to linearly encode a belief distribution over latent states in a probability simplex.

2606.01300 2026-06-02 cs.LG cs.AI 版本更新

ChronosAD: Leveraging Time Series Foundation Models for Accurate Anomaly Detection

ChronosAD:利用时间序列基础模型进行精确异常检测

Uzair Khan, Luigi Capogrosso, Francesco Biondani, Michele Magno, Franco Fummi, Francesco Setti, Marco Cristani

发表机构 * PR Veneto FESR 2021-2027(普罗文托地区FESR 2021-2027项目) Action 1.1.1(行动1.1.1) DGR 792 CUP D19J24000810007

AI总结 提出ChronosAD架构,通过时间序列基础模型提取特征并结合BiLSTM与多头注意力机制,实现跨域鲁棒的异常检测,在11个基准上平均AUC提升4.72%,AP提升6.60%。

Comments Accepted at the 24th IEEE International Conference on Industrial Informatics (INDIN) 2026

详情
AI中文摘要

时间序列异常检测是金融、医疗和工业等多个领域的关键任务。然而,现有方法通常难以在不同数据集上泛化,尤其是当异常微妙或依赖于上下文时。为解决此问题,我们引入了ChronosAD,一种新颖的异常检测架构,它使用时间序列基础模型作为特征提取器。具体而言,它采用两阶段流程:首先,使用基础模型以零样本方式为每个时间序列提取嵌入。然后,一个由双向长短期记忆(BiLSTM)和多头注意力组成的自定义开发的时间块,对这些嵌入进行精炼以捕捉时间依赖性并突出显著模式。与先前方法不同,我们的模型需要最少的任务特定调整,并在包括工业、医疗、信息物理和汽车系统在内的广泛领域中展现出鲁棒的泛化能力。在11个基准上的大量实验表明,ChronosAD在AUC和AP上平均分别超过现有方法4.72%和6.60%。源代码可在https://github.com/intelligolabs/ChronosAD获取。

英文摘要

Time series anomaly detection is a crucial task in various domains, including finance, healthcare, and industry. However, existing methods often struggle to generalize across different datasets, especially when anomalies are subtle or context-dependent. To solve this issue, we introduce ChronosAD, a novel architecture for anomaly detection that uses a time series foundation model as a feature extractor. Specifically, it employs a two-stage pipeline: first, it uses the foundation model to extract embeddings for each time series in a zero-shot manner. Then, a custom-developed Temporal Block, composed of Bidirectional Long Short-Term Memory (BiLSTM) and Multi-Head Attention, refines these embeddings to capture temporal dependencies and highlight salient patterns. Unlike previous approaches, our model requires minimal task-specific tuning and demonstrates robust generalization across a wide range of domains, including industrial, medical, cyber-physical, and automotive systems. Extensive experiments on 11 benchmarks show that ChronosAD outperforms existing methods by 4.72% in AUC and 6.60% in AP on average. The source code is available at https://github.com/intelligolabs/ChronosAD.

2606.01294 2026-06-02 cs.CL cs.LG 版本更新

Don't Read Everything: A Curvature-Conditioned Query for Linear Attention

不要阅读一切:用于线性注意力的曲率条件查询

Dong Le, Thong Nguyen, Cong-Duy Nguyen, Anh Tuan Luu

发表机构 * Nanyang Technological University(南洋理工大学) National University of Singapore(国立新加坡大学) VinUniversity(文理大学)

AI总结 针对线性注意力在上下文检索和长上下文任务中的不足,提出曲率条件查询(CCQ)机制,通过二阶泰勒展开构建局部二次模型,利用运行键协方差收缩查询向量,仅修改读取步骤,兼容现有线性注意力骨干,在困惑度、零样本下游准确率、检索和长上下文任务上取得提升。

Comments 19 pages

详情
AI中文摘要

线性注意力通过维护一个循环的快速权重状态,降低了 softmax 注意力的二次成本,但在上下文检索和长上下文任务中始终落后。现有的补救措施通过门控、增量更新或核特征映射作用于记忆的写入侧,但读取步骤保持不变:每个过去的键对输出都有加性贡献,因此有用的目标被存储向量的大多数稀释。我们借用 softmax 几何的一个特定部分来构建一个廉价的读取时查询收缩。在等向注意力点处对 softmax 对数配分函数进行二阶泰勒展开,得到一个局部二次模型,其曲率与运行键协方差一致,该量可以通过与线性注意力状态相同的循环/分块机制来维护。相关的线性算子在查询读取状态之前,沿着记忆的高密度方向收缩查询。我们将这种机制称为曲率条件查询(CCQ)。CCQ 仅修改读取步骤,并且可以与任何线性注意力骨干组合。将其附加到 GLA 和 Gated DeltaNet 上,它在困惑度、零样本下游准确率、训练上下文内外的 S-NIAH 检索、从 4K 到 20K 的长度外推困惑度以及 LongBench 准确率上均有提升,且额外成本很小。

英文摘要

Linear attention reduces the quadratic cost of softmax attention by maintaining a recurrent fast-weight state, but it consistently lags on in-context retrieval and long-context tasks. Existing remedies act on the write side of memory through gating, delta updates, or kernel feature maps, but the read step is left unchanged: every past key contributes additively to the output, so useful targets are diluted by the bulk of stored vectors. We borrow one specific piece of softmax's geometry to construct a cheap read-time contraction of the query. A second-order Taylor expansion of the softmax log-partition at the isotropic-attention point gives a local quadratic model whose curvature coincides with the running key covariance, a quantity that can be maintained with the same recurrent/chunkwise mechanism as the linear-attention state. The associated linear operator contracts the query along the high-density directions of memory before it reads the state. We call this mechanism Curvature-Conditioned Query (CCQ). CCQ modifies only the read step and is composable with any linear-attention backbone. Attached to GLA and Gated DeltaNet, it improves perplexity, zero-shot downstream accuracy, S-NIAH retrieval at and beyond the training context, length-extrapolation perplexity from 4K to 20K, and LongBench accuracy, at small extra cost.

2606.01292 2026-06-02 cs.LG cs.AI 版本更新

What Makes a Strong Model? A Unified Spectral Analysis of Knowledge Transfer over High-dimensional Linear Regression

什么造就了一个强模型?高维线性回归中知识迁移的统一谱分析

Wendao Wu, Fangqing Zhang, Haihan Zhang, Cong Fang

发表机构 * Department of Computer Science(计算机科学系) Cranberry-Lemon University(Cranberry-Lemon 大学) Department of Computational Neuroscience(计算神经科学系) University of the Witwatersrand(沃特瓦特斯兰大学)

AI总结 本文通过高维线性回归中SGD动力学的统一谱分析,揭示了知识蒸馏中的谱视界扩展和弱到强泛化中的谱去噪两种机制,统一解释了不同知识迁移范式的有效性。

详情
AI中文摘要

师生知识迁移在现代机器学习中无处不在,从通过知识蒸馏进行的经典模型压缩到弱到强泛化这一新兴现象。尽管现有研究提供了孤立见解,但缺乏一个统一的理论框架来解释知识迁移在这些不同机制中的有效性。在这项工作中,我们建立了高维线性回归中SGD动力学的统一谱分析,阐明了知识迁移在看似不同的机制中的效率。我们通过两种不同机制来刻画知识迁移效率:知识蒸馏中的谱视界扩展,使得能够捕获统计上不可及的高频信号;以及弱到强泛化中的谱去噪,其中学生充当优化噪声的滤波器。我们的框架统一了这些现象,揭示了迁移的有效性由隐式正则化与谱上异质谱学习速度之间的相互作用所支配。

英文摘要

Teacher-Student Knowledge Transfer (KT) is ubiquitous in modern machine learning, ranging from classical model compression via Knowledge Distillation (KD) to the emergent phenomenon of Weak-to-Strong (W2S) generalization. While existing studies offer isolated insights, a unified theoretical framework explaining the efficacy of KT across these disparate regimes remains lacking. In this work, we establish a unified spectral analysis of SGD dynamics in high-dimensional linear regression, elucidating the efficiency of KT across seemingly disparate regimes. We characterize KT efficiency through two distinct mechanisms: \emph{Spectral Horizon Expansion} in KD, which enables the capture of statistically inaccessible high-frequency signals, and \emph{Spectral Denoising} in W2S, where the student acts as a filter for optimization noise. Our framework unifies these phenomena, revealing that the efficacy of transfer is governed by the interplay between implicit regularization and heterogeneous spectral learning speeds over the spectrum.

2606.01289 2026-06-02 cs.LG 版本更新

Feature to Dynamics: Feature-space to Autoregression strategy for Zero-shot Time Series Forecasting

从特征到动态:面向零样本时间序列预测的特征空间到自回归策略

Yifan Wu, Junjie Wu, Kai Wu, Xiaoyu Zhang, Jian Lou

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出FSA框架,通过从可解释特征空间到自回归策略空间的映射,在零样本单变量时间序列预测中引入显式归纳偏置,分离全局趋势、周期成分和局部动态,以更少数据假设实现跨域泛化,在受控实验中优于Transformer架构。

详情
AI中文摘要

零样本时间序列预测旨在预测未见序列的未来值,要求模型泛化超出训练分布的时间动态。虽然近期的基础模型通过大规模预训练实现了强大的域内性能,但其有效性通常依赖于广泛的数据覆盖和隐式模式记忆,当数据稀缺或源域与目标域不重叠时,这可能会限制泛化能力。在这项工作中,我们提出了FSA,一种用于受控零样本单变量预测的特征到策略框架。FSA不直接在观测空间中对原始序列建模,而是学习从可解释特征空间到自回归策略空间的结构化映射。这种设计引入了显式归纳偏置,将全局趋势、周期成分和局部时间动态分离,使模型能够以更少的数据假设捕获可迁移的时间序列结构。实验结果表明,在相同的预训练数据、训练协议和可比较的参数预算下,FSA在我们的受控零样本设置中优于基于Transformer的架构。

英文摘要

Zero-shot time series forecasting aims to predict future values for previously unseen series, requiring models to generalize temporal dynamics beyond the training distribution. While recent foundation models achieve strong in-domain performance through large-scale pretraining, their effectiveness often relies on broad data coverage and implicit pattern memorization, which can limit generalization when data are scarce or source and target domains are disjoint. In this work, we propose FSA, a feature-to-strategy framework for controlled zero-shot univariate forecasting. Instead of directly modeling raw sequences in the observation space, FSA learns a structured mapping from an interpretable feature space to an autoregressive strategy space. This design introduces explicit inductive biases that disentangle global trends, periodic components, and local temporal dynamics, enabling the model to capture transferable time-series structure with fewer data assumptions. Empirical results show that, under identical pretraining data, training protocol, and comparable parameter budgets, FSA outperforms Transformer-based architectures in our controlled zero-shot setting.

2606.01286 2026-06-02 cs.SE cs.AI cs.CL cs.LG 版本更新

BenchEvolver: Frontier Task Synthesis via Solution-Centric Evolution

BenchEvolver: 通过以解决方案为中心的进化进行前沿任务合成

Yangzhen Wu, Aaron J. Li, Wenjie Ma, Li Cao, Ziheng Zhou, Mert Cemri, Shu Liu, Yuran Xiu, Chenxiao Yan, Haikun Zhao, Bin Yu, Ion Stoica, Dawn Song

发表机构 * University of California, Berkeley(加州大学伯克利分校) Institute for Interdisciplinary Information Sciences, Tsinghua University(清华大学交叉信息研究院)

AI总结 提出BenchEvolver框架,通过进化参考解决方案自动生成更难的编程问题,以解决基准饱和问题,并在LiveCodeBench和SciCode上验证其有效性。

详情
AI中文摘要

前沿大语言模型的快速进步导致了广泛的基准饱和,限制了现有数据集区分模型能力或提供有用训练信号的能力。例如,在LiveCodeBench上,前沿模型在简单拆分上达到超过99%的Pass@1,在不同难度级别上平均超过90%的Pass@1。构建新的、具有挑战性的数据集通常需要大量人力,成为进步的瓶颈。我们引入了BenchEvolver,一个以解决方案为中心的进化框架,自动将现有编码问题转化为更难的变体。BenchEvolver不是从头生成问题,而是通过结构化变换进化参考解决方案,并从进化后的解决方案中推导出相应的描述和测试。这种设计将生成过程基于可执行语义,使得能够可扩展地构建高质量、多样化和困难的任务,并具有可验证的正确性。将BenchEvolver应用于LiveCodeBench和SciCode,我们获得了显著更难的进化任务,同时保持了有效性、参考正确性和多样性。我们进一步策划了LiveCodeBench-Plus,一个包含91个问题的基准,结合了进化后的任务和困难的原始LCB-v6任务,其中前沿模型的Pass@1范围从27.5%到62.6%,恢复了强编码模型之间的清晰区分。重要的是,即使对于生成它们的模型,进化后的任务仍然具有挑战性,从而实现了自我改进。我们进一步表明,在进化后的LCB任务上进行强化学习提高了留出编码性能:对于gpt-oss-20b,种子+进化训练在LCB v6 Hard和LCB-Pro Easy上分别获得了+8.7和+8.3的Pass@1提升,分别超过仅种子训练的70.7%和34.8%。我们的结果表明,BenchEvolver可以将饱和的基准转化为前沿级别的评估套件和可重用的训练信号。

英文摘要

The rapid progress of frontier large language models has led to widespread benchmark saturation, limiting the ability of existing datasets to differentiate model capabilities or provide useful training signal. For instance, on LiveCodeBench, frontier models achieve over 99% Pass@1 on easy splits and exceed 90% Pass@1 on average across difficulty levels. Constructing new, challenging datasets typically requires substantial human effort, creating a bottleneck for progress. We introduce BenchEvolver, a solution-centric evolutionary framework that automatically transforms existing coding problems into harder variants. Rather than generating problems from scratch, BenchEvolver evolves reference solutions through structured transformations and derives corresponding statements and tests from the evolved solutions. This design grounds generation in executable semantics, enabling scalable construction of high-quality, diverse, and difficult tasks with verifiable correctness. Applying BenchEvolver to LiveCodeBench and SciCode, we obtain evolved tasks that are substantially harder while maintaining validity, reference correctness, and diversity. We further curate LiveCodeBench-Plus, a 91-problem benchmark combining evolved and difficult original LCB-v6 tasks, where frontier-model Pass@1 ranges from 27.5% to 62.6%, restoring clear discrimination among strong coding models. Importantly, evolved tasks remain challenging even for the model that generates them, enabling self-improvement. We further show that RL on evolved LCB tasks improves held-out coding performance: for gpt-oss-20b, seed+evolved training achieves +8.7 and +8.3 Pass@1 gains on LCB v6 Hard and LCB-Pro Easy, exceeding seed-only gains by 70.7% and 34.8%, respectively. Our results show that BenchEvolver can convert saturated benchmarks into frontier-level evaluation suites and reusable training signal.

2606.01283 2026-06-02 cs.LG 版本更新

AdaKernel: Learning Adaptive Kernel Parameters for Spatiotemporal Graph Neural Networks

AdaKernel: 为时空图神经网络学习自适应核参数

Zhongyue Zhang, Guangyin Jin, Yuxuan Liang, Suwan Yin, Yuankai Wu

发表机构 * Sichuan University(四川大学) PLA Academy of Military Science(中国人民解放军军事科学院) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 针对图神经网络中固定核参数导致模型容量受限的问题,提出AdaKernel方法,通过结构保持策略学习自适应核参数,在数据稀疏场景下优于固定先验和全隐式图结构方法。

Comments 17 pages, 15 figures, including appendix

详情
AI中文摘要

建模空间依赖性是使用图神经网络(GNN)进行时空数据分析的核心。传统方法依赖于具有预定义参数的基于距离的核,这限制了模型容量。尽管通用自适应机制(如图注意力网络)提供了灵活性,但它们通常无法捕捉潜在的几何结构,在数据稀疏场景下表现不如基于距离的模型。针对这一问题,我们重新审视核参数化问题,并从理论上证明,错误指定的核参数会在GNN中引入不可避免的近似误差。为了克服这一困难,我们提出AdaKernel,一种简单而有效的方法,在神经网络内学习自适应核参数。与从头学习图结构的方法不同,AdaKernel采用结构保持策略,优化物理相互作用的尺度而非丢弃它们。在克里金插值、数据填补和预测上的大量实验表明,AdaKernel持续改进各种GNN架构,并优于模型无关的自适应基线,验证了准确学习的核参数优于固定先验和完全隐式图结构。

英文摘要

Modeling spatial dependencies is central to spatiotemporal data analysis using Graph Neural Networks (GNNs). Traditional methods rely on distance-based kernels with predefined parameters, which restricts model capacity. Although generic adaptive mechanisms (e.g., Graph Attention Networks) offer flexibility, they often fail to capture the underlying geometric structure, performing worse than distance-based models in data-sparse scenarios. Addressing this, we revisit the kernel parameterization problem and theoretically prove that misspecified kernel parameters introduce unavoidable approximation errors in GNNs. To overcome this, we propose AdaKernel, a simple yet effective approach that learns adaptive kernel parameters within the neural network. Unlike methods that learn graph structures from scratch, AdaKernel adopts a structure-preserving strategy that optimizes the scale of physical interactions rather than discarding them. Extensive experiments on Kriging, Imputation, and Forecasting demonstrate that AdaKernel consistently improves various GNN architectures and outperforms model-agnostic adaptive baselines, validating that accurately learned kernel parameters are superior to both fixed priors and fully latent graph structures.

2606.01282 2026-06-02 cs.CV cs.CY cs.LG 版本更新

KG-FairDiff: Knowledge Graph-Guided Prompt Refinement for Demographically Fair Text-to-Image Generation

KG-FairDiff: 知识图谱引导的提示词精炼用于人口统计公平的文本到图像生成

Farbod Davoodi, Seyed Reza Tavakoli Shiyadeh, Pooria Safaei, Sana Harighi, Parsa Gholami, Amirali Amini, Kimia Vanaei, Emad Firoozi, Parham Abed Azad, Babak Khalaj, Siavash Ahmadi, Amir Hossein Payberah, Mohammad Hossein Rohban, Soheil Kolouri, Ali Diba

发表机构 * University of Science and Technology of China(中国科学技术大学) Sharif University of Technology(谢赫·伊斯兰大学) Iran University of Science and Technology(伊朗科学技术大学)

AI总结 提出KG-FairDiff框架,通过知识图谱引导的提示词精炼,在推理时优化公平性损失,减少文本到图像生成中的性别、种族、年龄等人口统计偏差,同时保持语义保真度。

详情
AI中文摘要

文本到图像(TTI)系统现已成为新闻、教育、广告和公共传播的日常基础设施,它们从训练数据中继承的人口统计和文化刻板印象(将女性、有色人种、老年人和非西方文化描绘为代表性不足或漫画化)在部署规模上成为人口层面的危害。现有的缓解措施要么需要昂贵的重新训练,这对于主导消费产品的闭源骨干网络不可行,要么依赖于忽略文化背景的固定人口统计模板。我们提出了KG-FairDiff,一个模型无关、推理时框架,将公平感知的提示词精炼形式化为一个约束优化问题,并将其实现为一个闭环流水线:一个包含约1200个文化和偏见相关三元组的知识图谱检索结构化上下文,一个LLM改写器提出精炼,一个验证器仅接受那些减少基于散度的公平性损失同时保持用户原始意图语义保真度的提示词。我们证明了精炼循环的有限终止界限,贡献了一个数学上一致的评估套件,将Bias-P/Bias-W与目标分布的散度以及ENS与KL散度联系起来,并审计了八个广泛部署的骨干生成器。KG-FairDiff显著减少了性别、种族、年龄和交叉差异,同时保持了提示词语义,为更公平的生成式AI提供了一条实用、可部署的路径。

英文摘要

Text-to-Image (TTI) systems are now everyday infrastructure for journalism, education, advertising, and public communication, and the demographic and cultural stereotypes they inherit from training data (rendering women, people of colour, older adults, and non-Western cultures as under-represented or caricatured) become a population-level harm at deployment scale. Existing mitigations either require costly retraining, infeasible for the closed-source backbones that dominate consumer products, or rely on fixed demographic templates that ignore cultural context. We present KG-FairDiff, a model-agnostic, inference-time framework that formalises fairness-aware prompt refinement as a constrained optimisation problem and operationalises it as a closed-loop pipeline: a knowledge graph of ~1,200 culture- and bias-related triples retrieves structured context, an LLM rewriter proposes refinements, and a validator accepts only prompts that reduce a divergence-based fairness loss while preserving semantic fidelity to the user's original intent. We prove a finite-termination bound for the refinement loop, contribute a mathematically consistent evaluation suite linking Bias-P/Bias-W to divergence from target distributions and ENS to KL divergence, and audit eight widely-deployed backbone generators. KG-FairDiff substantially reduces gender, race, age, and intersectional disparities while preserving prompt semantics, offering a practical, deployment-ready route to more equitable generative AI.

2606.01281 2026-06-02 cs.LG cs.AI 版本更新

RLVR without Ineffective Samples: Group Prioritized Off-Policy Optimization for LLM Reasoning

RLVR 无需无效样本:面向 LLM 推理的群体优先级离策略优化

Yixiu Mao, Yun Qu, Qi Wang, Heming Zou, Xiangyang Ji

发表机构 * Department of Automation, Tsinghua University(清华大学自动化系)

AI总结 针对强化学习中无效样本导致学习信号不足的问题,提出群体优先级离策略优化(POPO),通过优先级群体重放和解耦重要性采样,在不增加额外采样开销的情况下提升推理性能。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)已成为增强大型语言模型(LLMs)推理能力的强大范式。然而,其有效性受到无效训练数据普遍存在的严重阻碍:许多采样提示产生的响应群体要么完全正确,要么完全错误,导致奖励零方差和学习信号有限。最近的先进方法通过大量LLM rollout来过滤无效样本以解决此问题,但代价是相当大的计算开销。替代方法,包括预测性采样和轨迹重放,旨在提高数据效率,但往往仍不充分,并可能引入额外问题,如系统性偏差或次优约束。为解决这些局限性,我们提出了群体优先级离策略优化(POPO),一个简单而有效的框架,无需额外rollout开销即可充分利用有效训练批次。POPO包含两个关键组件:优先级群体重放和解耦离策略优化。前者通过基于近因的重放机制,联合考虑样本质量和离策略程度,用有效的离策略群体替换无效的在策略群体。为进一步缩小离策略差距,POPO采用解耦重要性采样来校正离策略偏差,同时在一致的信任区域约束下保持稳定的策略更新。在包括数学、规划和视觉几何在内的多种推理任务上的实证评估表明,POPO显著加速了RL微调,并在显著减少rollout的情况下实现了强大的推理性能。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for enhancing the reasoning capabilities of large language models (LLMs). However, its effectiveness is substantially hindered by the prevalence of ineffective training data: many sampled prompts yield response groups that are either entirely correct or entirely incorrect, resulting in zero-variance rewards and limited learning signals. Recent state-of-the-art methods address this issue through extensive LLM rollouts to filter ineffective samples, but at the cost of considerable computational overhead. Alternative approaches, including predictive sampling and trajectory replay, aim to improve data efficiency but often remain insufficient and may introduce additional issues such as systematic bias or suboptimal constraints. To address these limitations, we propose Group Prioritized Off-Policy Optimization (POPO), a simple yet effective framework that fully exploits effective training batches without additional rollout overhead. POPO comprises two key components: prioritized group replay and decoupled off-policy optimization. The former replaces ineffective on-policy groups with effective off-policy groups via a recency-based replay mechanism that jointly considers sample quality and the degree of off-policiness. To further mitigate the off-policy gap, POPO employs decoupled importance sampling to correct off-policy bias while maintaining stable policy updates under consistent trust-region constraints. Empirical evaluations across diverse reasoning tasks, including mathematics, planning, and visual geometry, demonstrate that POPO substantially accelerates RL finetuning and achieves strong reasoning performance with significantly fewer rollouts.

2606.01273 2026-06-02 cs.LG 版本更新

GLIDE: Graph-guided Leap Inference for Diffusion Estimation of Spatio-Temporal Point Processes

GLIDE: 面向时空点过程扩散估计的图引导跳跃推理

Guanyu Zhou, Yao Liu, Yanglei Gan, Yuxiang Cai, Peng He, Run Lin, Yuxiang Liu, Qiao Liu

发表机构 * University of Electronic Science and Technology of China(电子科技大学)

AI总结 提出GLIDE框架,利用多尺度历史图编码和双流架构作为条件,结合先验引导的跳跃推理机制,实现高效且准确的时空点过程下一个事件建模与预测。

详情
AI中文摘要

时空点过程(STPPs)为连续时间和空间中的异步事件建模提供了原则性框架。最近的扩散方法通过建模复杂条件分布,为确定性预测提供了灵活的替代方案,但其在STPPs中的应用仍面临挑战:从纯噪声中反向采样成本高昂,且稀疏空间域中弱结构约束可能导致概率质量定位不佳。我们提出 extbf{GLIDE}(图引导跳跃推理扩散估计),一种用于STPPs中下一个事件建模的条件扩散框架。GLIDE将历史事件组织成多尺度历史图,并通过双流架构编码时间演化和空间拓扑,为双分支扩散去噪器提供结构化条件上下文。它进一步引入先验引导的跳跃推理机制,其中轻量级均值预测器提供确定性锚点,反向过程从中间扩散步骤而非纯高斯噪声开始。在多个真实世界数据集上的实验表明,GLIDE改进了分布拟合和下一个事件预测,其中空间方面的提升最大。结果还表明,先验引导的跳跃推理大幅降低了反向采样成本,同时保留了扩散模型的随机生成能力。

英文摘要

Spatio-temporal point processes (STPPs) provide a principled framework for modeling asynchronous events in continuous time and space. Recent diffusion-based approaches offer a flexible alternative to deterministic prediction by modeling complex conditional distributions, but their application to STPPs remains challenging: reverse sampling from pure noise is costly, and weak structural constraints in sparse spatial domains can lead to poorly localized probability mass. We propose \textbf{GLIDE} (Graph-guided Leap Inference for Diffusion Estimation), a conditional diffusion framework for next-event modeling in STPPs. GLIDE organizes historical events into a multi-scale historical graph and encodes temporal evolution and spatial topology through a dual-stream architecture, yielding a structured conditioning context for a dual-branch diffusion denoiser. It further introduces a prior-guided leap inference mechanism, in which a lightweight mean predictor provides a deterministic anchor and the reverse process starts from an intermediate diffusion step instead of from pure Gaussian noise. Experiments on multiple real-world datasets show that GLIDE improves both distribution fitting and next-event prediction, with the largest gains appearing on the spatial side. The results also indicate that prior-guided leap inference substantially reduces reverse-sampling cost while preserving the stochastic generation capability of diffusion models.

2606.01265 2026-06-02 cs.LG cs.AI 版本更新

PALTO: Physics-Informed Active Learning for Tri-Gate FinFET Design Optimization for Vertical Power Delivery

PALTO:面向垂直供电的Tri-Gate FinFET设计优化的物理信息主动学习

Ayoub Sadeghi, Leonid Popryho, Inna Partin-Vaisband

发表机构 * University of Illinois Chicago(伊利诺伊大学香槟分校) Center for Heterogeneous Integration of Micro Electronic Systems(微电子异构集成中心) Joint University Microelectronics Program (JUMP) 2.0(联合大学微电子计划(JUMP)2.0) Semiconductor Research Corporation (SRC)(半导体研究公司(SRC)) Defense Advanced Research Project Agency (DARPA)(国防高级研究计划局(DARPA))

AI总结 提出物理信息主动学习框架,高效探索GaN tri-gate FinFET的高维设计空间,优化关键结构参数(如GaN-to-AlGaN厚度比),发现两种优化器件,其中D1在300-fin配置下驱动电流和开关效率优于D2。

详情
AI中文摘要

本文展示了机器学习驱动优化在垂直供电系统中设计特定应用的GaN三栅极FinFET的有效性。传统的基于TCAD的方法计算量大,且不足以导航先进GaN器件的高维非线性设计空间。为此,采用物理信息主动学习框架智能引导仿真,在保持精度的同时加速收敛。这种ML引导的方法通过高效探索关键结构参数——尤其是GaN-to-AlGaN厚度比(器件设计中长期争论的焦点)——来发现最优配置。通过系统探索关键结构参数,确定了两种具有激进缩放的栅漏长度的优化器件。单鳍多通道仿真表明,相对于AlGaN势垒具有更薄GaN沟道的器件D2实现了更高的驱动电流。然而,在300鳍配置中,器件D1以0.49欧姆导通电阻提供3.3A电流,性能约为D2的2倍,尽管寄生参数略高。两种器件均工作在常关模式。基于特定应用品质因数,器件D1达到5 pC·欧姆,开关效率比D2高2倍,而两种设计在不同性能指标上均优于工业基准。

英文摘要

This paper demonstrates the effectiveness of machine learning-driven optimization for designing application-specific GaN tri-gate FinFETs in vertical power delivery systems. Conventional TCAD-based approaches are computationally intensive and insufficient for navigating the high-dimensional, nonlinear design space of advanced GaN devices. To address this, a physics-informed active learning framework is used to intelligently guide simulations, accelerating convergence while preserving accuracy. This ML-guided approach enables the discovery of optimal configurations by efficiently exploring key structural parameters -- most notably the GaN-to-AlGaN thickness ratio -- a long-standing focus of debate in device design. By systematically exploring key structural parameters, two optimized devices with aggressively scaled gate-to-drain lengths are identified. Single-fin, multi-channel simulations show that device~D2, with a thinner GaN channel relative to the AlGaN barrier, achieves higher drive current. However, in a 300-fin configuration, device~D1 outperforms device~D2 by delivering 3.3\,A at 0.49~ohm on-resistance -- approximately 2$\times$ better -- despite slightly higher parasitics. Both devices operate in a normally-off mode. Based on an application-specific figure of merit, device~D1 achieves 5\,pC$\cdot$ohm, demonstrating 2$\times$ greater switching efficiency than device~D2, while both designs outperform industrial benchmarks from different performance standpoints.

2606.01258 2026-06-02 cs.LG cs.CL eess.SP 版本更新

Beyond Sinusoids: A Morlet Wavelet Framework for Transformer Positional Encoding

超越正弦波:基于Morlet小波的Transformer位置编码框架

Athanasios Zeris

发表机构 * Independent Researcher(独立研究者) Athens, Greece(希腊雅典)

AI总结 提出Morlet位置编码(MoPE),通过可学习的频率和局部带宽统一了正弦位置编码和旋转位置编码,并在TinyShakespeare上结合能量门控注意力提升了0.119的性能。

Comments 16 pages, 4 figures, 4 tables

详情
AI中文摘要

标准Transformer位置编码——正弦编码和旋转位置编码(RoPE)——将每个位置视为同等局部:它们编码了标记的位置,但未编码其位置影响应延伸多远。我们提出Morlet小波(同时最小化位置和频率的不确定性)是位置编码的自然基础,并引入Morlet位置编码(MoPE):每个嵌入维度从数据中学习其自身的频率和局部带宽。主要理论结果是统一:当局部性关闭时(sigma_i -> 无穷大),正弦PE和RoPE相关核都作为MoPE的极限情况出现。MoPE的相位精确恢复了RoPE旋转角度;幅度增加了一个标准编码所缺乏的可学习高斯局部核。实验上,MoPE结合能量门控注意力在TinyShakespeare上比标准注意力提升了0.119,优于任一单独组件。对学习参数的分析显示,所有128个频率-带宽对收敛到小波可容许边界——这一经验观察与关于能量门控的伴随结果一致,表明字符级语言信号的一个可重现性质,值得进一步研究。

英文摘要

Standard positional encodings for transformers - sinusoidal and rotary (RoPE) - treat every position as equally local: they encode where a token is, but not how far its positional influence should extend. We propose that the Morlet wavelet, which simultaneously minimises uncertainty in position and frequency, is the natural basis for positional encoding, and introduce Morlet Positional Encoding (MoPE): each embedding dimension learns its own frequency and locality bandwidth from data. The main theoretical result is a unification: sinusoidal PE and the RoPE correlation kernel both emerge as limiting cases of MoPE when locality is switched off (sigma_i -> infinity). The phase of MoPE recovers the RoPE rotation angle exactly; the amplitude adds a learned Gaussian locality kernel that standard encodings lack. Empirically, MoPE combined with Energy-Gated Attention achieves +0.119 improvement over standard attention on TinyShakespeare, outperforming either component alone. Analysis of the learned parameters reveals that all 128 frequency-bandwidth pairs converge to the wavelet admissibility boundary - an empirical observation consistent with a companion result on energy gating, suggesting a reproducible property of character-level language signals that warrants further investigation.

2606.01256 2026-06-02 stat.ML cs.LG stat.ME 版本更新

Distribution-free changepoint localization after sequential change detection

顺序变化检测后的无分布变化点定位

Aytijhya Saha, Aaditya Ramdas

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文提出了一种无分布框架,在停止顺序变化检测程序后构建变化点的后检测置信集,无需任何分布假设,并保证了有限样本覆盖率和渐近有界期望大小。

详情
AI中文摘要

本文介绍了一种无分布框架,用于在停止顺序变化检测程序后构建变化点的后检测置信集。众所周知,共形测试鞅可用于顺序检测分布变化,但其本身不提供对声称变化发生时间的推断。以往关于后检测推断的工作需要已知变化前和变化后的分布类别,而本文在没有任何分布假设的情况下实现了变化点的定位。我们建立了有限样本覆盖保证(条件于正确检测)。我们给出了置信集条件期望大小的非渐近界。在合适的渐近机制下,我们证明了置信集的条件期望大小一致有界,并在模拟和真实数据上展示了强大的实证性能。据我们所知,这是第一个具有有效后检测覆盖保证的通用无分布顺序变化点定位框架。

英文摘要

This paper introduces a distribution-free framework for constructing post-detection confidence sets for changepoints after stopping a sequential change detection procedure. It is well known that conformal test martingales can be used to sequentially detect changes in distribution, but by themselves provide no inference for the time at which a proclaimed change occurred. Past work on post-detection inference requires pre- and post-change classes of distributions to be known, but this paper accomplishes localization of the changepoint without any distributional assumptions. We establish finite-sample coverage guarantees (conditional on correct detection). We provide non-asymptotic bounds on the conditional expected size of the confidence sets. Under suitable asymptotic regimes, we proved that the conditional expected size of the confidence set remains uniformly bounded. and demonstrate strong empirical performance on simulated and real data. To the best of our knowledge, this is the first general distribution-free framework for sequential changepoint localization with a valid post-detection coverage guarantee.

2606.01244 2026-06-02 stat.ML cs.LG cs.NA math.FA math.NA math.ST stat.TH 版本更新

Efficient Approximation for Encoder--Decoder Neural Operators via Variation Spaces

基于变分空间的编码器-解码器神经算子的高效逼近

Jia-Qi Yang, Lei Shi

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 通过引入变分空间作为非线性算子的无穷维结构类,建立了编码器-解码器双层网络在Bochner L^q范数下的逼近界,误差分解为输入编码误差、输出编码误差和N^{-1/2}阶有限宽逼近项,为超越一般Lipschitz或Fréchet可微算子类的高效神经算子学习提供了理论保证。

Comments 14 pages

详情
AI中文摘要

我们研究使用编码器-解码器神经网络的算子学习。受神经网络函数空间理论的启发,我们引入变分空间作为非线性算子的无穷维结构类。该空间通过直接在输入和输出空间上的向量值测度定义。对于该空间中的算子,我们建立了编码器-解码器双层网络在Bochner $L^q$ 范数下的逼近界。得到的误差界分解为输入编码误差、输出编码误差和一个阶为 $N^{-1/2}$ 的有限宽逼近项,其常数与输入和输出编码维度无关。当输入和输出编码误差在编码维度上呈多项式衰减时,这些估计产生代数逼近和学习速率。结果为超越一般Lipschitz或Fréchet可微算子类的高效神经算子学习提供了理论保证。

英文摘要

We study operator learning using encoder--decoder neural networks. Inspired by the function-space theory of neural networks, we introduce a variation space as an infinite-dimensional structural class for nonlinear operators. This space is defined through vector-valued measures directly on the input and output spaces. For operators in this space, we establish approximation bounds for encoder--decoder two-layer networks in the Bochner $L^q$ norm. The resulting error bound decomposes into the input encoding error, the output encoding error, and a finite-width approximation term of order $N^{-1/2}$, with a constant independent of the input and output encoding dimensions. When the input and output encoding errors decay polynomially in the encoding dimensions, these estimates yield algebraic approximation and learning rates. The results provide an theoretical guarantees for efficient neural operator learning beyond general Lipschitz or Fréchet differentiable operator classes.

2606.01243 2026-06-02 cs.CL cs.LG 版本更新

Unlocking the Black Box of Latent Reasoning: An Interpretability-Guided Approach to Intervention

解锁潜在推理的黑箱:一种可解释性引导的干预方法

Shuochen Chang, Tong Bai, Xiaofeng Zhang, Qianli Ma, Qingyang Liu, Zhaohe Liao, Yibo Miao, Li Niu

发表机构 * Shanghai Jiao Tong University(上海交通大学) Fudan University(复旦大学)

AI总结 本文通过结构、因果和几何探针分析潜在推理向量的可解释性,并基于此提出无需训练的解码时干预方法,提升大语言模型推理准确性。

详情
Journal ref
ACL2026 Main
AI中文摘要

潜在推理使大型语言模型(LLMs)能够在连续隐藏状态内执行多步推理,相比显式思维链(CoT)提供了效率提升。然而,这些连续思维向量的不透明性阻碍了其可靠性和可控性。本文弥合了机械可解释性与可操作控制之间的差距。我们首先使用结构、因果和几何探针进行系统分析,揭示潜在向量编码了推理步骤的压缩、忠实表示,其中早期向量作为关键因果枢纽。在此基础上,我们将这些可解释性见解操作化为一套无需训练、解码时干预的方法,通过施加已识别的几何和语义先验来优化潜在推理过程。跨多个模型规模和不同任务领域的广泛实验表明,我们的方法持续提高了推理准确性。我们的可解释性引导干预一致地解锁了潜在能力,并在没有任何参数更新的情况下提高了推理准确性。

英文摘要

Latent reasoning enables Large Language Models (LLMs) to perform multi-step inference within continuous hidden states, offering efficiency gains over explicit Chain-of-Thought (CoT). However, the opacity of these continuous thought vectors hinders their reliability and controllability. This paper bridges the gap between mechanistic interpretability and actionable control. We first present a systematic analysis using structural, causal, and geometric probes, revealing that latent vectors encode compressed, faithful representations of reasoning steps, with early vectors acting as critical causal hubs. Building on this, we operationalize these interpretability insights into a suite of training-free, decode-time interventions that refine the latent reasoning process by imposing the identified geometric and semantic priors. Extensive experiments across multiple model scales and diverse task domains demonstrate that our approaches consistently improve reasoning accuracy. Our interpretability-guided interventions consistently unlock latent capabilities and improve reasoning accuracy without any parameter updates.

2605.04819 2026-06-02 cs.LG 版本更新

Unsat Core Prediction through Polarity-Aware Representation Learning over Clause-Literal Hypergraphs

通过子句-文字超图上的极性感知表示学习进行不可满足核心预测

Zhenchao Sun, Shuai Ma, Ping Lu, Chongyang Tao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出一种极性感知的表示学习框架,将SAT公式建模为子句-文字超图,通过极性感知分解和极性反转一致性正则化,有效预测不可满足核心。

Comments Accepted at ICML 2026

详情
AI中文摘要

图神经网络已广泛用于布尔可满足性(SAT)任务中,以从SAT公式中学习结构信息。这些研究的目标是解决SAT实例或增强SAT求解器,包括不可满足核心预测等任务。然而,大多数现有方法将SAT公式建模为二分图或有向无环图,这些方法在捕捉文字和子句之间的子句级和高阶交互方面不够直接。此外,这些方法在建模SAT固有的极性相关属性(如变量的正负文字之间的互补关系)方面存在局限性。为了解决这些局限性,我们提出了一种基于子句-文字超图的极性感知表示学习框架。我们将SAT公式建模为子句-文字超图,并辅以子句关联图以捕捉高阶结构交互。然后,我们引入一种极性感知分解机制,将变量表示分离为极性不变和等变分量,显式建模正负文字之间的关系,并将生成的文字表示沿超图结构传播。我们进一步引入极性反转一致性正则化,以在训练过程中强化极性一致的表示。在多个SAT数据集上的实验结果表明了该方法的有效性。

英文摘要

Graph neural networks have been widely used in Boolean satisfiability (SAT) tasks to learn structural information from SAT formulas. The goal of these studies is to solve SAT instances or to enhance SAT solvers, including tasks such as unsat-core prediction. However, most existing approaches model a SAT formula as a bipartite graph or a directed acyclic graph, which are less direct in capturing clause-level and higher-order interactions among literals and clauses. Moreover, these approaches are limited in modeling intrinsic polarity-related properties of SAT, such as the complementary relationship between the positive and negative literals of a variable. To address these limitations, we propose a polarity-aware representation learning framework over clause-literal hypergraphs. We model SAT formulas as clause-literal hypergraphs augmented with a clause incidence graph to capture higher-order structural interactions. We then introduce a polarity-aware decomposition mechanism that separates variable representations into polarity invariant and equivariant components, explicitly modeling the relationship between positive and negative literals, with the resulting literal representations propagated along the hypergraph structure. We further incorporate a polarity-inversion consistency regularization to reinforce polarity-consistent representations during training. Experimental results on multiple SAT datasets demonstrate the effectiveness of the proposed approach.

2605.04193 2026-06-02 cs.AI cs.LG cs.LO 版本更新

ANDRE: An Attention-based Neuro-symbolic Differentiable Rule Extractor for Inductive Logic Programming

ANDRE:一种基于注意力的神经符号可微规则提取器,用于归纳逻辑编程

Iman Sharifi, Peng Wei, Saber Fallah

发表机构 * Dept. of Mechanical and Aerospace Engineering, George Washington University, USA(机械与航空航天工程系,乔治华盛顿大学) Dept. of Mechanical Engineering Sciences, University of Surrey, UK(机械工程科学系,萨里大学)

AI总结 提出ANDRE框架,通过注意力驱动的可微逻辑算子优化连续规则空间,实现从概率数据中学习一阶逻辑规则,在噪声环境下保持鲁棒性和可解释性。

Comments 35 pages, 8 figures, 10 tables

详情
AI中文摘要

归纳逻辑编程(ILP)旨在从数据中学习可解释的一阶规则,但现有的符号和神经符号方法难以扩展到噪声和概率设置。经典ILP依赖于离散的组合规则搜索,在不确定性下脆弱,而可微ILP方法通常依赖预定义规则模板或不精确的模糊算子,这些算子在推理概率谓词估值时会遭受梯度消失或逻辑结构近似不佳的问题。本文提出基于注意力的神经符号可微规则提取器(ANDRE),一种新颖的ILP框架,通过基于注意力的逻辑算子优化连续规则空间来学习一阶逻辑程序。ANDRE用完全可微的、注意力驱动的合取和析取算子替代规则模板和逻辑算子,这些算子近似逻辑最小-最大语义,从而实现对概率数据的准确、稳定和可解释推理。通过在每条规则内软选择、否定或排除谓词,ANDRE在保持符号结构的同时支持灵活规则归纳。在经典ILP基准、大规模知识库以及带有概率谓词和噪声监督的合成数据集上的大量实验表明,ANDRE达到了有竞争力或更优的预测性能,同时在不确定性下可靠地恢复正确的符号规则。特别是,ANDRE对中等标签噪声保持鲁棒,在规则提取质量和稳定性上显著优于现有可微ILP方法。

英文摘要

Inductive Logic Programming (ILP) aims to learn interpretable first-order rules from data, but existing symbolic and neuro-symbolic approaches struggle to scale to noisy and probabilistic settings. Classical ILP relies on discrete combinatorial rule search and is brittle under uncertainty, while differentiable ILP methods typically depend on predefined rule templates or inaccurate fuzzy operators that suffer from vanishing gradients or poor approximation of logical structure when reasoning over probabilistic predicate valuations. This paper proposes an Attention-based Neuro-symbolic Differentiable Rule Extractor (ANDRE), a novel ILP framework that learns first-order logic programs by optimizing over a continuous rule space with attention-based logical operators. ANDRE replaces both rule templates and logical operators with fully differentiable, attention-driven conjunction and disjunction operators that approximate logical min-max semantics, enabling accurate, stable, and interpretable reasoning over probabilistic data. By softly selecting, negating, or excluding predicates within each rule, ANDRE supports flexible rule induction while preserving symbolic structure. Extensive experiments on classical ILP benchmarks, large-scale knowledge bases, and synthetic datasets with probabilistic predicates and noisy supervision demonstrate that ANDRE achieves competitive or superior predictive performance while reliably recovering correct symbolic rules under uncertainty. In particular, ANDRE remains robust to moderate label noise, substantially outperforming existing differentiable ILP methods in both rule extraction quality and stability.

2605.03403 2026-06-02 cs.CV cs.LG 版本更新

GRPO-TTA: Test-Time Visual Tuning for Vision-Language Models via GRPO-Driven Reinforcement Learning

GRPO-TTA:基于GRPO驱动的强化学习进行视觉语言模型的测试时视觉调优

Yujun Li, Hongyuan Zhang, Yuan Yuan

发表机构 * School of Artificial Intelligence, Optics and Electronics (iOPEN), Northwestern Polytechnical University(人工智能、光学与电子学院(iOPEN),西北工业大学)

AI总结 提出GRPO-TTA方法,将GRPO应用于测试时适应,通过将类特定提示预测重构为组策略优化问题,并设计对齐奖励和分散奖励,在多种基准上优于现有方法。

详情
AI中文摘要

组相对策略优化(GRPO)最近在大型语言模型和视觉语言模型的后训练中展现出强大性能。这引发了一个问题:GRPO是否也能显著促进视觉语言模型的测试时适应(TTA)。在本文中,我们提出了用于测试时适应的组相对策略优化(GRPO-TTA),通过将类特定提示预测重构为组策略优化问题,将GRPO适应到TTA设置。具体来说,我们通过从CLIP相似度分布中采样top-K类候选来构建输出组,从而在无需真实标签的情况下实现概率驱动的优化。此外,我们设计了针对测试时适应的奖励函数,包括对齐奖励和分散奖励,以指导有效的视觉编码器调优。在多种基准上的大量实验表明,GRPO-TTA一致优于现有的测试时适应方法,在自然分布偏移下性能提升尤为显著。

英文摘要

Group Relative Policy Optimization (GRPO) has recently shown strong performance in post-training large language models and vision-language models. It raises a question of whether the GRPO also significantly promotes the test-time adaptation (TTA) of vision language models. In this paper, we propose Group Relative Policy Optimization for Test-Time Adaptation (GRPO-TTA), which adapts GRPO to the TTA setting by reformulating class-specific prompt prediction as a group-wise policy optimization problem. Specifically, we construct output groups by sampling top-K class candidates from CLIP similarity distributions, enabling probability-driven optimization without access to ground-truth labels. Moreover, we design reward functions tailored to test-time adaptation, including alignment rewards and dispersion rewards, to guide effective visual encoder tuning. Extensive experiments across diverse benchmarks demonstrate that GRPO-TTA consistently outperforms existing test-time adaptation methods, with notably larger performance gains under natural distribution shifts.

2606.01238 2026-06-02 cs.RO cs.LG 版本更新

Training-Free Imitation Learning with Closed-Form Diffusion Policies

无训练闭环扩散策略的模仿学习

Raghav Mishra, Ian R. Manchester

发表机构 * Australian Center for Robotics, ARIAM Hub, and School of Aerospace, Mechanical and Mechatronic Engineering University of Sydney(澳大利亚机器人中心、ARIAM中心和悉尼大学航空航天、机械与机电工程学院)

AI总结 提出一种基于演示数据集闭式得分的无训练扩散策略(CFDP),实现毫秒级实时模仿学习,性能媲美需数小时训练的神经基线,并支持推理时策略编辑与演示增强。

详情
AI中文摘要

尽管基于扩散的策略具有令人印象深刻的性能和表达能力,但其长时间离线训练拖慢了数据收集和策略部署循环。我们引入了闭环扩散策略(CFDP),这是一类使用从演示数据集导出的闭式得分的无训练扩散策略,用于模仿学习。我们在硬件实验中用移动CPU进行实时推理部署CFDP,表明它能够直接从数据集中毫秒级成功执行模仿,并且推理速度比神经扩散策略更快。在模仿学习基准实验中,我们展示了CFDP与需要数小时训练的神经基线相比具有竞争力,在训练时间和性能之间提供了有利的权衡。最后,我们展示了闭环扩散策略如何作为一种可组合原语,实现对预训练神经扩散策略的数据驱动推理时编辑,包括策略引导和新颖的演示增强。

英文摘要

While diffusion-based policies have impressive performance and expressivity, their long offline training slows down the data collection and policy deployment loop. We introduce Closed-Form Diffusion Policies, a class of training-free diffusion-based policies for imitation learning using the closed-form score derived from the demonstration dataset. We deploy CFDP with real-time inference with a mobile CPU in hardware experiments, showing it can successfully perform imitation directly from the dataset in milliseconds and with faster inference than neural diffusion policies. In experiments on imitation learning benchmarks, we show that CFDP is competitive against neural baselines that require hours of training, providing a favorable tradeoff between training time and performance. Finally, we show how closed-form diffusion policies act as a composable primitive that enables data-driven inference-time editing of pre-trained neural diffusion policies, including policy guidance and novel demonstration augmentation.

2606.01234 2026-06-02 econ.GN cs.CE cs.CV cs.GT cs.LG physics.soc-ph q-fin.EC 版本更新

Differing Roles of Leisure and Productivity in GDP - A Machine Learning based comparative analysis of Germany and USA

休闲与生产力在GDP中的不同作用——基于机器学习的德国与美国比较分析

Achintya Ranjan, Uma Ranjan

发表机构 * Achintya Ranjan(阿金蒂亚·兰詹) Uma Ranjan(乌玛·兰詹)

AI总结 本研究通过随机森林模型分析工作时间和全要素生产率对GDP的影响,并利用Gini重要性、SHAP图和部分依赖图揭示德国与美国社会结构差异在GDP贡献中的体现。

Comments International Conference on Emerging Techniques in Computational Intelligence 2025

详情
AI中文摘要

一个国家的GDP被建模为两个因素之间的相对相互作用——工作时间,反映人口的社会选择,以及全要素生产率,反映对生产力提升因素的集体投资。研究表明,随机森林模型可以从这两个因素准确预测GDP。通过Gini重要性、SHAP图和部分依赖图分析了德国和美国所做的选择差异。结果表明,国家社会结构的差异反映在工作时间和生产率对GDP的相对贡献中。

英文摘要

The GDP of a country is modelled as the relative interaction between two agents - working hours, reflecting the social choice of a population, and Total Factor Productivity, reflecting the collective investment in productivity enhancers. It is shown that a Random Forest model can accu- rately predict the GDP from these two factors. The differences in the choices made by Germany and USA are analysed though Gini importance, SHAP plots and partial dependency. It is shown that the differences in the social structure of the countries are reflected in the relative contribution of working hours and productivity to the GDP.

2606.01227 2026-06-02 cs.LG q-bio.NC 版本更新

DAGGER: Gradient-Free Construction of Transiently Amplifying Networks under Hard Connectivity Constraints

DAGGER: 硬连接约束下瞬态放大网络的无梯度构造

James C. Ferguson

发表机构 * The African Institute for Mathematical Sciences(非洲数学科学研究所) Institute of Science and Technology Austria(奥地利科学技术研究所)

AI总结 提出无梯度单遍算法DAGGER,在硬符号/稀疏/对角约束下构造瞬态放大网络,通过单一标量β控制Wasserstein-2预算实现放大与多重集保留的平滑权衡。

Comments 12 pages, 7 figures

详情
AI中文摘要

许多网络不仅支持而且依赖于瞬态非正态放大,即稳定系统的活动增加数个数量级。在硬符号/稀疏/对角约束(与生物连接组和结构化RNN初始化相关的区域)下构造此类网络,迄今为止需要基于梯度的局部搜索(包含数千次内循环特征分解)或基于Schur形式的直接构造(在抽象基中,投影后破坏约束)。 本文提出DAGGER(有向无环图引导边重加权),一种无梯度单遍算法。给定稳定的有符号稀疏矩阵,DAGGER产生具有相同符号、稀疏性和对角的输出。单一标量β控制Wasserstein-2预算,平滑地权衡精确多重集保留(β=0)与放大;峰值放大随β几乎无界增长,经验上在数值溢出前达到10^10。 在单次前向传递中,DAGGER在多重集保留方面匹配或超过基于梯度的方法(比典型梯度内循环少30-100倍特征分解),并且在中等β下,在精确保持连接性的同时,超过它们数个数量级。我们开发了该算法,将其与现有方法以及下游信号检测任务进行比较,并检查了显示DAGGER在结构上与其他放大网络不同的诊断结果。

英文摘要

Many networks not only support but also rely on transient non-normal amplification, an orders-of-magnitude increase in the activity of an otherwise stable system. Constructing such networks under hard sign/sparsity/diagonal constraints -- the regime relevant for biological connectomes and structured RNN initializations -- has so far required either gradient-based local search with thousands of inner-loop eigendecompositions or Schur-form direct construction in an abstract basis that breaks the constraints under projection. Here we introduce DAGGER (Directed Acyclic Graph Guided Edge Reweighting), a gradient-free single-pass algorithm. Given a stable signed sparse matrix, DAGGER produces an output with the same sign, sparsity, and diagonal. A single scalar $β$ controls a Wasserstein-2 budget that smoothly trades exact multiset preservation ($β= 0$) for amplification; peak amplification grows essentially without bound with $β$, empirically reaching $10^{10}$ before numerical overflow. DAGGER matches or exceeds gradient-based methods at multiset preservation in a single forward pass -- 30-100$\times$ fewer eigendecompositions than a typical gradient inner loop -- and at moderate $β$ beats them by orders of magnitude with connectivity exactly preserved. We develop the algorithm, compare it to the existing methods and on a downstream signal-detection task, and examine the diagnostics that show why DAGGER is structurally different from other amplifying networks.

2606.01221 2026-06-02 cs.LG cs.AI 版本更新

Hybrid Imbalanced Regression Through Unified Data-Level and Algorithm-Level Balancing

混合不平衡回归:统一的数据级与算法级平衡方法

Shermin Shahbazi, Hossein Mohammadi, Mohsen Afsharchi

发表机构 * Zahedan National University(札赫德安国立大学)

AI总结 提出一个五阶段混合框架,结合自适应分箱、条件变分自编码器、特征空间聚类过采样、潜在密度加权损失和注意力门控融合,解决回归中的不平衡问题。

Comments 52 pages, 20 figures, accepted at Expert Systems with Applications

详情
Journal ref
Expert Systems with Applications, Date: 1 August 2026, Article: 131908, Volume: Volume 322
AI中文摘要

不平衡学习是机器学习中的一个关键挑战,其中代表性不足的目标值可能使模型产生偏差,并降低对罕见但重要案例的预测性能。尽管在分类中得到了广泛研究,不平衡回归仍然相对未被充分探索。现有方法主要关注数据级平衡(可能引入噪声和过拟合)或算法级平衡(通常难以处理高度复杂的目标分布)。为了解决这些局限性,我们提出了一个统一的混合框架,将数据级和算法级平衡策略集成到一个与回归器无关的流水线中。该框架包括五个阶段:(1)自适应分箱划分,基于局部线性一致性动态分割目标空间;(2)使用条件变分自编码器进行目标条件表示学习;(3)通过特征空间聚类和少数类过采样进行多阶段数据级平衡;(4)使用新颖的潜在密度加权损失(LDWL)进行算法级平衡,以强调潜在空间和目标空间中的稀有样本;(5)基于注意力的门控融合用于最终回归。在基准数据集上的实验结果表明,与单独的回归器和现有的不平衡回归方法相比,所提出的框架持续提高了预测性能。

英文摘要

Imbalanced learning is a critical challenge in machine learning, where underrepresented target values can bias models and degrade prediction performance on rare but important cases. Although extensively studied in classification, imbalanced regression remains relatively underexplored. Existing methods mainly focus on either data-level balancing, which may introduce noise and overfitting, or algorithm-level balancing, which often struggles with highly complex target distributions. To address these limitations, we propose a unified hybrid framework that integrates both data- and algorithm-level balancing strategies into a regressor-agnostic pipeline. The proposed framework consists of five stages: (1) adaptive bin partitioning to dynamically segment the target space based on local linear coherence; (2) target-conditioned representation learning using a Conditional Variational Autoencoder; (3) multistage data-level balancing through feature-space clustering and oversampling of minority clusters; (4) algorithm-level balancing using a novel Latent-Density Weighted Loss (LDWL) to emphasize rare samples in latent and target spaces; and (5) attention-based gated fusion for final regression. Experimental results on benchmark datasets demonstrate that the proposed framework consistently improves predictive performance compared to standalone regressors and existing imbalanced regression approaches.

2606.01220 2026-06-02 cs.LG cs.AI 版本更新

Fine-Tuning Diffusion Models for Molecular Generation via Reinforcement Learning and Fast Sampling

通过强化学习和快速采样微调扩散模型用于分子生成

Guang Lin, Shikui Tu, Lei Xu

发表机构 * Department of Computer Science and Engineering, Shanghai Jiao Tong University(上海交通大学计算机科学与工程系) Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)(广东人工智能与数字经济实验室(深圳))

AI总结 提出FTDiff框架,结合组相对策略优化和快速采样机制,微调扩散模型以生成满足多目标药物设计约束的高质量分子。

Comments 13 pages, 7 figures

详情
AI中文摘要

生成同时满足类药性质并符合目标蛋白三维结构的分子是基于结构的药物设计(SBDD)中的核心挑战。然而,现有的生成方法通常依赖于采样过程中昂贵的后处理或训练时需要精心策划的数据集,但增益仍然有限。这些限制在多目标设置中尤为突出,平衡冲突标准仍是一个核心挑战。为了解决这些问题,我们提出了FTDiff,一个专为结构约束下基于扩散的分子生成量身定制的强化学习微调框架。为了确保稳定且样本高效的优化,FTDiff采用了组相对策略优化(GRPO)风格策略。此外,FTDiff基于一个无时间预训练扩散模型,并集成了快速采样机制,减少了去噪步数,在保持生成质量的同时显著加速了训练和推理。通过优化一个固定阈值感知的奖励,FTDiff有效引导模型生成有效、多样且高质量的分子,平衡多个药物设计目标。在基准数据集上的大量实验表明,FTDiff始终优于先前的方法,且无需昂贵的后处理优化或复杂的数据工程。

英文摘要

Generating molecules that simultaneously satisfy drug-like properties and conform to the 3D structure of a target protein is a core challenge in structure-based drug design (SBDD). Existing generative approaches, however, often rely on costly post-hoc processing during Sampling or require carefully curated datasets during training, yet still achieve modest gains. These limitations are especially pronounced in multi-objective settings, where balancing conflicting criteria remains a core challenge. To address these challenges, We propose FTDiff, a reinforcement learning fine-tuning framework tailored for diffusion-based molecular generation under structural constraints. To ensure stable and sample-efficient optimization, FTDiff adopts a group relative policy optimization (GRPO) style strategy. Furthermore, FTDiff builds upon a time-free pretrained diffusion model and incorporates a fast sampling mechanism that reduces the number of denoising steps, significantly accelerating both training and inference while maintaining generation quality. By optimizing a fixed threshold-aware reward, FTDiff effectively guides the model to produce valid, diverse, and high- quality molecules that balance multiple drug design objectives. Extensive experiments on benchmark datasets demonstrate that FTDiff consistently outperforms prior methods, without requiring expensive post-hoc optimization or intricate data engineering.

2606.01217 2026-06-02 cs.CV cs.LG stat.AP 版本更新

Analysis of Ethnic Disparities in Autism Spectrum Disorder among Toddlers

幼儿自闭症谱系障碍中的种族差异分析

Aadithya Prabha Ramaharsha, Deevna Reddy, Uma Ranjan

发表机构 * Sri Ramachandra Institute of Higher Education and Research(Sri Rajachandra高等教育部与研究机构)

AI总结 通过逻辑回归分析,研究种族、行为评分、性别和新生儿黄疸对幼儿自闭症谱系障碍(ASD)的影响,发现白种人ASD风险比亚洲人高81%,中东人低79%,并确认新生儿黄疸和男性为显著风险因素。

Comments Third International Conference Biomedical Engineering Science and technology

详情
AI中文摘要

自闭症谱系障碍(ASD)是一种以沟通和行为挑战为特征的神经发育障碍。本研究考察了种族与ASD特征之间的关系,以及行为评分、性别和新生儿黄疸在三个种族群体(白种人、亚洲人和中东人)中的差异。我们进行了逻辑回归分析,表明种族对ASD发病率有显著影响。与亚洲人相比,白种人患ASD的风险增加81%,而中东人患ASD的风险降低79%。我们还证实了早期研究,即新生儿黄疸是ASD的重要预测因子,而男性儿童患ASD的风险远高于女性儿童。这些结果表明,需要建立考虑种族差异的诊断框架和干预措施,以评估ASD特征的表现和评估。

英文摘要

Autism Spectrum Disorder (ASD) is a neurodevelopmental disorder characterized by challenges in communication and behavior. This study examines the relationship between ethnicity and ASD traits, along with behavioural scores, sex and neonatal jaundice across three ethnic groups: White Europeans, Asians, and Middle Eastern individuals. We perform a logistic regression and show that ethnicity has a significant effect on incidence of ASD. White Europeans are 81% increased risk of ASD and Middle Easterners are at 79\% reduced risk of ASD compared to Asians. We also confirm earlier studied which show that neonatal jaundice is a significant predictor of ASD, while male children are at much higher risk of ASD compared to female children. These results suggest the need for diagnostic frameworks and interventions that account for ethnic in the presentation and assessment of ASD traits

2606.01216 2026-06-02 cs.LG math.OC 版本更新

Riemannian Optimization for Hadamard Products of Low-Rank Matrices

低秩矩阵的Hadamard积的黎曼优化

Pratik Jawanpuria, Ankish Chandresh, Bamdev Mishra

发表机构 * Centre for Machine Intelligence and Data Science, Indian Institute of Technology Bombay, India(机器智能与数据科学中心,印度班加罗尔理工学院,印度) Microsoft India(微软印度)

AI总结 针对低秩矩阵Hadamard积因子的耦合缩放对称性,提出一种基于黎曼商流形的块对角度量,并开发了线性复杂度的梯度下降算法。

详情
AI中文摘要

两个低秩矩阵的逐元素Hadamard积为具有乘法结构的数据提供了一种参数高效的模型,但由于两个因子之间存在耦合的行/列缩放对称性,其建模具有挑战性。为了利用空间几何,我们将此类矩阵的学习问题转化为黎曼商流形上的优化问题。我们提出了一种新的块对角黎曼度量,该度量由Frobenius内积的拉回导出,并证明该度量在完整对称群下不变。我们开发了一种黎曼梯度下降算法,该算法使用无需调参的Gauss-Newton步长,且每次迭代的计算复杂度与观测条目数呈线性关系。在真实和合成数据集上的实验验证了我们提出的黎曼方法的有效性。

英文摘要

The elementwise Hadamard product of two low-rank matrices provides a parameter-efficient model for data with multiplicative structure, but its modeling is challenging due to the presence of additional symmetries under coupled row/column scalings between the two factors. In order to leverage the geometry of the space, we formulate the learning of such matrices as optimization on a Riemannian quotient manifold. We propose a novel block-diagonal Riemannian metric derived from the pullback of the Frobenius inner product. The metric is shown to be invariant under the full symmetry group. We develop a Riemannian gradient descent algorithm that uses a tuning-free Gauss--Newton step size and scales linearly in the number of observed entries per iteration. Experiments on real and synthetic datasets illustrate the efficacy of our proposed Riemannian approach.

2606.01207 2026-06-02 cs.CV cs.LG 版本更新

Feature Alignment Determines Fusion Strategy: A Comparative Study of Cross-Attention and Concatenation in Multimodal Learning

特征对齐决定融合策略:多模态学习中交叉注意力与拼接的比较研究

Zhiqiang Zhou, Xuezhen Xie

发表机构 * Hunan Chemical Industry Vocational and Technical College(湖南化学工业职业技术学院)

AI总结 通过实验和理论分析,证明特征对齐质量而非数据规模是决定多模态融合策略优劣的关键因素,当特征预对齐时拼接优于交叉注意力。

Comments 8 pages,6 figures,4 tables

详情
AI中文摘要

在多模态融合中,交叉注意力与拼接的选择仍由实践者直觉而非原理性理解主导。本文通过使用两个特征提取骨干(ResNet18和CLIP ViT-B/32)在Flickr8k上的控制实验,证明特征对齐质量(而非仅数据规模)是决定哪种融合策略更优的主要因素。当特征通过视觉语言预训练目标预对齐时,在所有测试规模(2048-16384样本)下,拼接比交叉注意力高出4.1-5.1个百分点。我们提供了基于样本复杂度分析的理论解释:拼接需要O(d_v + d_t)个样本来学习其融合投影,而交叉注意力需要O(d_v * d_t)个样本来学习双线性注意力权重,对于512维CLIP特征,后者是前者的256倍以上。当特征已经对齐时,两种方法的近似误差差距消失,拼接的样本效率在所有实际数据集规模上占优。对齐退化研究证实了单调趋势:随着特征对齐退化,拼接的优势从1.3%增长到2.8%。这些发现为多模态系统中的融合方法选择提供了原理性决策框架,对多模态大语言模型的设计具有直接影响。

英文摘要

The choice between cross-attention and concatenation for multimodal fusion remains governed by practitioner intuition rather than principled understanding. In this paper, we demonstrate that feature alignment quality, not data scale alone, is the primary determinant of which fusion strategy excels. Through controlled experiments on Flickr8k using two feature extraction backbones (ResNet18 and CLIP ViT-B/32), we show that concatenation outperforms cross-attention by 4.1-5.1 percentage points across all tested scales (2048-16384 samples) when features are pre-aligned by a vision-language pretraining objective. We provide a theoretical explanation grounded in sample complexity analysis: concatenation requires O(d_v + d_t) samples to learn its fusion projection, while cross-attention requires O(d_v * d_t) samples to learn bilinear attention weights, over 256 times as many for 512-dimensional CLIP features. When features are already aligned, the approximation error gap between the two methods vanishes, and concatenation's sample efficiency dominates at all practical dataset sizes. An alignment degradation study confirms a monotonic trend: as feature alignment degrades, concatenation's advantage grows from 1.3% to 2.8%. These findings provide a principled decision framework for fusion method selection in multimodal systems, with direct implications for the design of Multimodal Large Language Models.

2606.01202 2026-06-02 cs.AI cs.CL cs.LG 版本更新

The Shape of Wisdom: Decision Trajectories in Language Models

智慧的形状:语言模型中的决策轨迹

Shailesh Rana

发表机构 * Independent Researcher(独立研究者)

AI总结 本文通过分析三种语言模型在MMLU上的9000条轨迹,提出用答案边际、边际变化和决策翻转距离描述轨迹,发现正确性与稳定性不同,并探究了注意力与MLP标量对边际的影响。

Comments 6 pages, 5 figures. Code and derived artifacts: https://github.com/gut-puncture/The-Shape-of-Wisdom

详情
AI中文摘要

语言模型并非简单地在输出层选择一个答案。在一项包含9000条轨迹的MMLU研究中,涉及Qwen2.5-7B-Instruct、Llama-3.1-8B-Instruct和Mistral-7B-Instruct-v0.3,答案的分数在深度上以结构化方式移动。我们用三个量描述每条轨迹:当前答案边际、该边际的下一层变化,以及距离决策翻转的距离。主要经验图景是正确性和稳定性是不同的:最大的群体是不稳定-正确的,而不是稳定-正确的。然后,一个追踪的子集询问是什么推动了边际。在稳定-正确的情况下,平均注意力标量指向正确的方向,而平均MLP标量则不然;跨度删除显示,移除支持答案的文本会损害边际,而移除类似干扰项的文本则有助于边际。结果并非完整的电路解释。它是一种可重复的方式,用于查看哪些答案已确定,哪些仍然脆弱,以及哪些测量来源推动了它们。

英文摘要

Language models do not simply choose an answer at the output layer. In a 9,000-trajectory MMLU study across Qwen2.5-7B-Instruct, Llama-3.1-8B-Instruct, and Mistral-7B-Instruct-v0.3, the score of the answer moves across depth in structured ways. We describe each trajectory with three quantities: the current answer margin, the next-layer change in that margin, and the distance from a decision flip. The main empirical picture is that correctness and stability are different: the largest group is unstable-correct, not stable-correct. A traced subset then asks what moves the margin. In stable-correct cases, the average attention scalar points in the correct direction, while the average MLP scalar does not; span deletion shows that removing answer-supporting text hurts the margin and removing distractor-like text helps it. The result is not a full circuit explanation. It is a reproducible way to see which answers are settled, which remain fragile, and which measured sources move them.

2606.01198 2026-06-02 cs.LG 版本更新

Linear Strategic Classification with Endogenous Improvements

具有内生改进的线性战略分类

Siddharth Shrivastava, Mahvith Akshintala, B Vamsha Vardhan Reddy, Naresh Manwani, Sujit Gujar, Ganesh Ghalme

发表机构 * Department of Artificial Intelligence(人工智能系) IIT Hyderabad(海得拉巴理工学院) IIIT Hyderabad(海得拉巴理工学院)

AI总结 研究智能体通过修改特征响应分类器时,能产生真实结果改进的战略分类问题,提出线性分类器下的最优决策边界平移方法,并给出PAC保证和实用算法。

详情
AI中文摘要

战略分类研究智能体通过以成本修改可观察特征来响应已部署分类器的设置。经典模型通常将此类响应视为装饰性的:特征可能改变,但真实标签保持不变。我们研究了一种考虑改进的变体,其中战略响应可以引起结果相关特征的真正变化。智能体策略性地选择部署后的特征向量,然后根据一个稳定的条件结果分布生成标签,该分布保留了特征与结果之间的关系。我们在单指标资格模型和线性可分解成本下形式化了线性分类器的这一问题。我们证明,战略最优分类器是通过贝叶斯最优决策边界的平行移动获得的,并且它比贝叶斯分类器为改进感知目标提供了更好的代理。由于改进感知学习需要部署后的标签,而这些标签通常在部署前不可用,我们在预言机模型下提供了PAC风格的保证,提出了一种实用的插件算法,建立了其泛化界,并在合成和真实数据集上进行了评估。

英文摘要

Strategic classification studies settings in which agents respond to a deployed classifier by modifying observable features at a cost. Classical models typically treat such responses as cosmetic: features may change, but true labels remain fixed. We study an improvement-aware variant in which strategic responses can induce genuine changes in outcome-relevant features. Agents choose post-deployment feature vectors strategically, and labels are then generated according to a stable conditional outcome law that preserves the relationship between features and outcomes. We formalize this problem for linear classifiers under a single-index qualification model and linear-decomposable costs. We show that the strategic-optimal classifier is obtained by a parallel shift of the Bayes-optimal decision boundary, and that it provides a better surrogate for the improvement-aware objective than the Bayes classifier. Since improvement-aware learning requires post-deployment labels, which are typically unavailable before deployment, we provide PAC-style guar- antees under an oracle model, propose a practical plug-in algorithm, establish its generalization bound, and evaluate it on synthetic and real-world datasets.

2606.01179 2026-06-02 cs.LG cs.AI 版本更新

Physics-Informed Deep Learning for Entropy Prediction in Heterogeneous Systems: Thermodynamic and Information-Theoretic Case Studies

异质系统中熵预测的物理信息深度学习:热力学与信息论案例研究

Biswajeet Sahoo, Debadutta Patra

发表机构 * Durham University(杜ham大学) Department of Chemical Engineering(化学工程系) Veer Surendra Sai University of Technology(维尔·苏雷纳·赛大学)

AI总结 提出统一物理信息深度学习框架,通过微分方程残差和信息论约束,在单一神经网络中同时实现热力学与信息论系统的熵预测,并验证其数据效率和物理一致性。

详情
AI中文摘要

熵产生支配着物理和信息论系统中的不可逆性和不确定性。尽管物理信息神经网络(PINNs)成功求解微分方程,但当前架构本质上仍是领域特定的。跨根本不同物理定律的领域不变熵表示的提取尚未探索。本文引入了一个统一的物理信息深度学习(PIDL)框架,该框架在单一神经架构中同时强制执行微分方程残差和信息论界限。我们通过两个经典研究来展示该框架:(i)一个热力学连续搅拌釜反应器(CSTR)模型,求解控制常微分方程,其中Softplus约束严格强制执行热力学第二定律;(ii)一个信息论金融市场模型,求解逆Fokker-Planck偏微分方程以推断潜在漂移和扩散系数,通过Softplus约束保证扩散正性,同时自然诱导香农熵。评估了三种模型变体:两个特定领域基线和一种共享编码器架构。PIDL框架保证了绝对的热力学可接受性,零违反第二定律,并表现出卓越的数据效率,仅使用30%的可用训练数据即可保持>90%的预测精度。此外,对学习到的熵表面的事后Ruppeiner黎曼几何分析成功识别了热力学相不稳定性。该方法为物理约束熵建模提供了一个稳健、领域无关的架构,推动了可持续过程设计和定量金融风险评估的应用。

英文摘要

Entropy production governs irreversibility and uncertainty in both physical and information-theoretic systems. While Physics-Informed Neural Networks (PINNs) successfully solve differential equations, current architectures remain inherently domain-specific. The extraction of domain-invariant entropy representations across fundamentally different physical laws remains unexplored. This paper introduces a unified Physics-Informed Deep Learning (PIDL) framework that simultaneously enforces differential equation residuals and information-theoretic bounds within a single neural architecture. We demonstrate this framework via two canonical studies: (i) a thermodynamic continuous stirred-tank reactor (CSTR) model solving governing ODEs, where a Softplus constraint strictly enforces the Second Law of Thermodynamics; and (ii) an information-theoretic financial market model solving the inverse Fokker-Planck PDE to infer latent drift and diffusion coefficients, guaranteeing diffusion positivity via a Softplus constraint while naturally inducing Shannon entropy. Three model variants are evaluated: two domain-specific baselines and one shared-encoder architecture. The PIDL framework guarantees absolute thermodynamic admissibility with zero Second-Law violations and exhibits exceptional data efficiency, retaining >90% predictive accuracy using merely 30% of available training data. Furthermore, a post-hoc Ruppeiner Riemannian geometric analysis of the learned entropy surface successfully identifies thermodynamic phase instabilities. This methodology provides a robust, domain-agnostic architecture for physics-constrained entropy modeling, advancing applications in sustainable process design and quantitative financial risk assessment.

2606.01176 2026-06-02 cs.LG 版本更新

Temporal Motif Signatures for Temporal Graph Neural Networks

时序图神经网络的时序模体特征

Dylan Sandfelder, Mihai Cucuringu, Xiaowen Dong

发表机构 * University of Oxford(牛津大学) University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 针对时序图神经网络难以捕捉短时序模体模式的问题,提出一种紧凑的13维模体特征图,可线性嵌入任意静态或时序编码器,并在多种任务上提升性能。

详情
AI中文摘要

真实时序交互流在短时模体模式(重复、互惠、星型多样性、三元组流)中蕴含预测结构,而普通的时序图神经网络(TGNN)通常无法将其暴露给边评分器。我们在MOOC交互预测中具体展示了这一点:一个由过去窗口星型计数组成的小型四特征族已经提供了相对于强静态GNN的大部分提升。在广泛的实际和合成时序数据集中,我们发现模体活动沿着三个尺度稳定的轴(二元近因/互惠、星型多样性、三元组流)一致地组织,并利用这一经验结构设计了一个紧凑的13维、防泄漏、候选局部模体特征图h(u, v, t),该特征图可线性嵌入任何静态或时序编码器,无需改变架构。时序Weisfeiler-Leman(WL)分析将该增强置于锚定时序WL层次的第一级,并展示了一个候选锚定对,模体特征在该对上具有区分性。我们通过实验证明,相同的增强在异构任务上一致地提升了性能:TGB链路属性预测在所有五个基线上,Bitcoin Alpha/OTC和MOOC上的边分类,以及合成时序生成器的图级分类。

英文摘要

Real temporal interaction streams carry predictive structure in short-horizon motif patterns -- repetition, reciprocity, star diversity, triadic flow -- that vanilla temporal graph neural networks (TGNNs) often fail to expose to their edge scorers. We show this concretely on MOOC interaction prediction, where a small four-feature family of past-window star counts already delivers most of the lift over a strong static GNN. Across a wide set of real and synthetic temporal datasets we find that motif activity organizes consistently along three scale-stable axes (dyadic recency/reciprocity, star diversity, triadic flow), and we use this empirical structure to design a compact 13-coordinate, leakage-safe, candidate-local motif feature map h(u, v, t) that linearly embeds into any static or temporal encoder without architectural changes. A temporal Weisfeiler-Leman (WL) analysis places the augmentation relative to the first level of an anchored temporal-WL hierarchy and exhibits a candidate-anchored pair on which motif features distinguish. We demonstrate empirically that the same augmentation consistently lifts performance across heterogeneous tasks: TGB link-property prediction across all five baselines, edge classification on Bitcoin Alpha/OTC and MOOC, and graph-level classification of synthetic temporal generators.

2606.01159 2026-06-02 cs.LG cs.GT 版本更新

Fairness in two-player zero-sum games with bandit feedback

带赌博反馈的两人零和博弈中的公平性

S Akash, Pratik Gajane

发表机构 * LatentForce.ai Laboratoire d’Informatique Fondamentale d’Orléans(奥尔良基础信息学实验室) University of Orléans(奥尔良大学)

AI总结 研究在公平约束下(每个动作概率至少为α/m)的两人零和博弈,通过重参数化将公平博弈转化为标准零和博弈,提出Fair-ETC-TPZSG算法并证明其遗憾界。

详情
AI中文摘要

我们研究在公平约束下的两人零和博弈(TPZSGs),其中每个动作必须以至少$α/m$的概率被选择。现有的实例相关结果针对$ extit{纯}$纳什均衡,而公平性通常产生$ extit{混合}$均衡,这是一个更难的学习目标。我们的关键技术工具是重参数化:每个公平策略分解为$p = (α/m)\mathbf{1} + (1-α)\widetilde{p}$,其中$\widetilde{p} \in Δ_m$,代入收益形式得到$p^{ op}Aq = \widetilde{p}^{ op}\widetilde{A} q$,其中公平收益矩阵$\widetilde{A} := (1-α)A + α\mathbf{1} c^{ op}$,$c_j = frac{1}{m}\sum_i A(i,j)$是列均值向量。那么$A$上的公平博弈等价于$\widetilde{A}$上的标准零和博弈,因此均衡存在性、KKT结构和LP基稳定性归结为应用于$\widetilde{A}$的经典结果。我们推导了公平最小最大值、公平纳什均衡、公平遗憾以及一个简洁的对偶表示,表明公平代价至多为$α(1-1/m)$,并且当无约束均衡已经具有完全支撑时消失。我们的主要结果是针对$ exttt{Fair-ETC-TPZSG}$算法的$\widetilde{O}(T^{2/3})$遗憾界,该算法适用于一般的混合公平均衡,并讨论了为什么朴素的动作消除不能轻易改进它。当公平均衡具有单一主导动作时,即当$\widetilde{p}^{\star}$是$Δ_m$的顶点时,该界收紧为实例相关的$\widetilde{O}(1/\widetildeΔ(α)^{2})$,其中$\widetildeΔ(α)$是LP边际间隙。

英文摘要

We study two-player zero-sum games (TPZSGs) with bandit feedback under fairness constraints requiring every action to be played with probability at least $α/m$. Existing instance-dependent results target $\textit{pure}$ Nash equilibria, while fairness generically produces $\textit{mixed}$ equilibria, a harder learning target. Our key technical tool is a reparametrization: every fair strategy decomposes as $p = (α/m)\mathbf{1} + (1-α)\widetilde{p}$ with $\widetilde{p} \in Δ_m$, and substituting into the payoff form yields $p^{\top}Aq = \widetilde{p}^{\top}\widetilde{A} q$ for a fair payoff matrix $\widetilde{A} := (1-α)A + α\mathbf{1} c^{\top}$, where $c_j = \tfrac{1}{m}\sum_i A(i,j)$ is the column-mean vector. The fair game on $A$ is then equivalent to a standard zero-sum game on $\widetilde{A}$, so equilibrium existence, KKT structure, and LP basis stability reduce to classical results applied to $\widetilde{A}$. We derive the fair minimax value, fair Nash equilibrium, fair regret, and a clean dual representation showing the price of fairness is at most $α(1-1/m)$ and vanishes whenever the unconstrained equilibrium already has full support. Our main result is an $\widetilde{O}(T^{2/3})$ regret bound for an Explore-Then-Commit algorithm, $\texttt{Fair-ETC-TPZSG}$, applicable to general mixed fair equilibria, together with a discussion of why naive action elimination does not readily improve it. When the fair equilibrium has a single dominant action, equivalently when $\widetilde{p}^{\star}$ is a vertex of $Δ_m$, the bound sharpens to instance-dependent $\widetilde{O}(1/\widetildeΔ(α)^{2})$, where $\widetildeΔ(α)$ is the LP-margin gap.

2606.01155 2026-06-02 cs.LG cs.AI 版本更新

When Data Is Scarce: Scaling Sparse Language Models with Repeated Training

当数据稀缺时:通过重复训练扩展稀疏语言模型

Boqian Wu, Qiao Xiao, Patrik Okanovic, Tomasz Sternal, Maurice van Keulen, Mykola Pechenizkiy, Elena Mocanu, Torsten Hoefler, Decebal Constantin Mocanu

发表机构 * Eindhoven University of Technology(埃因霍温理工大学) University of Luxembourg(卢森堡大学) University of Twente(埃因霍温大学) ETH Zürich(苏黎世联邦理工学院)

AI总结 研究数据受限下稀疏训练的可扩展性,提出包含活跃参数、唯一标记、数据重复和稀疏度的缩放定律,发现稀疏训练可延迟数据饱和并改善资源权衡。

Comments Accepted at ICML2026

详情
AI中文摘要

密集大语言模型在无限数据下的缩放定律已被充分探索,但稀疏性与有限数据如何相互作用尚未研究。在这项工作中,我们研究了数据受限场景下的稀疏训练,其中有限的唯一标记需要多轮训练。我们的实验涵盖拟合集中最多1.92B参数的模型、最高93.75%的稀疏度、最多2.6B标记的唯一数据预算,以及16轮训练中最多41.6B的总训练标记;我们进一步在保留的密集等价模型(最多7.68B参数)上验证了外推能力。我们发现:1. 数据受限下的稀疏缩放:我们引入了一个缩放定律,将损失建模为活跃参数、唯一标记、数据重复和稀疏度的函数,准确预测跨计算和数据预算的性能。2. 延迟数据饱和:稀疏训练延迟了重复数据带来的收益递减,使多轮训练更有效。3. 资源权衡:在固定数据下,损失最优的稀疏度约为50%,而计算最优的稀疏度更高且随数据规模增长。总体而言,稀疏性不仅是提高效率的工具,也是在数据稀缺下改善缩放权衡的机制。我们的代码可在 https://github.com/boqian333/sparse-dc-scaling 获取。

英文摘要

Scaling laws for dense LLMs under infinite data are well explored, but how sparsity interacts with limited data is not. In this work, we study sparse training in data-constrained regimes where limited unique tokens require multi-epoch training. Our experiments span models up to 1.92B parameters in the fitting set, sparsity up to 93.75%, unique data budgets up to 2.6B tokens, and total training tokens up to 41.6B over 16 epochs; we further validate extrapolation on held-out dense-equivalent models up to 7.68B parameters. We find that: 1. Sparse scaling in data-limited settings: We introduce a scaling law that models loss as a function of active parameters, unique tokens, data repetition, and sparsity, accurately predicting performance across compute and data budgets. 2. Delayed data saturation: sparse training postpones diminishing returns from repeated data, making multi-epoch training more effective. 3. Resource trade-offs: With fixed data, loss-optimal sparsity is moderate ~ 50%, while compute-optimal sparsity is higher and grows with data scale. Overall, sparsity is not just a tool for efficiency, but a mechanism for improving scaling trade-offs under data scarcity. Our code is available at: https://github.com/boqian333/sparse-dc-scaling.

2606.01151 2026-06-02 cs.LG 版本更新

Lagrangian Perturbation Diffusion Steering: Latent Reinforcement Learning for Generative Policies

拉格朗日扰动扩散引导:用于生成策略的潜在强化学习

Hikmet Simsir, Ozgur S. Oguz

发表机构 * University of Michigan(密歇根大学)

AI总结 提出拉格朗日扰动扩散引导(LP-DS),通过学习紧凑的噪声空间扰动来微调冻结的生成策略,利用拉格朗日信任区域目标优化,在保持潜在先验约束的同时提升下游价值,在多个基准上实现样本效率和回报提升。

Comments Accepted as a regular paper at ICML 2026

详情
AI中文摘要

使用高容量生成策略的行为克隆实现了强大的模仿性能,但通常受限于演示覆盖率和分布偏移。直接强化学习微调可以提升性能,但更新大型动作解码器往往不稳定且样本效率低。我们提出拉格朗日扰动扩散引导(LP-DS),一种轻量级自适应方法,通过在解码前学习紧凑的噪声空间扰动来改进冻结的生成策略。LP-DS 使用拉格朗日信任区域目标优化该扰动,在约束与潜在先验偏差的同时提升下游价值。在 RoboMimic 操作、OpenAI Gym 运动和 Adroit 灵巧操作基准上,LP-DS 提高了样本效率、成功率和回报,同时相比无约束噪声空间引导保持了更高的动作空间熵,回报提升高达 25%。使用流匹配骨干、大型视觉-语言-动作模型以及物理 Franka 部署的额外评估表明,LP-DS 不限于紧凑扩散策略或模拟基准。项目页面:https://sites.google.com/view/lp-ds/home。

英文摘要

Behavior cloning with high-capacity generative policies achieves strong imitation performance, but is often limited by demonstration coverage and distribution shift. Direct reinforcement learning fine-tuning can improve performance, but updating large action decoders is frequently unstable and sample inefficient. We propose Lagrangian Perturbation Diffusion Steering (LP-DS), a lightweight adaptation method that improves a frozen generative policy by learning a compact noise-space perturbation before decoding. LP-DS optimizes this perturbation with a Lagrangian trust-region objective, improving downstream value while constraining deviation from the latent prior. Across RoboMimic manipulation, OpenAI Gym locomotion, and Adroit dexterous manipulation benchmarks, LP-DS improves sample efficiency, success, and return while maintaining higher action-space entropy than unconstrained noise-space steering, with return improvements of up to 25% over prior baselines. Additional evaluations with flow-matching backbones, a large vision-language-action model, and physical Franka deployment show that LP-DS is not limited to compact diffusion policies or simulated benchmarks. Project page: https://sites.google.com/view/lp-ds/home.

2606.01134 2026-06-02 eess.AS cs.LG cs.SD 版本更新

Context-aware child-directed speech detection from long-form recordings

基于上下文的儿童导向语音检测:从长时间录音中识别

Théo Charlot, Tarek Kunze, Kaveri K. Sheth, Alejandrina Cristia, Marvin Lavechin

发表机构 * LSCP, DEC, ENS, EHESS, CNRS, PSL University, France(法国社会科学高等学院(LSCP)、法国国家科学研究中心(DEC)、巴黎高等师范学院(ENS)、高等科学研究院(EHESS)、法国国家科学研究中心(CNRS)、巴黎社会科学大学(PSL University))

AI总结 本研究通过微调自监督模型、融入上下文信息以及端到端流水线评估,显著提升了从长时间录音中自动检测儿童导向语音的性能。

Comments 6 pages, 1 figure

详情
AI中文摘要

在长时间录音中自动区分儿童导向语音和成人导向语音,是可扩展分析儿童语言环境的关键。现有方法孤立地处理话语,并且主要针对英语进行评估。我们从三个维度解决这些不足。首先,我们在一个包含182名儿童的多语言数据集上微调并评估了六个自监督模型,表明在儿童中心录音上进行领域内预训练显著优于在成人语音上训练的模型。其次,我们证明融入周围上下文能大幅提升分类性能,平均F1分数绝对提升13.8%。第三,我们在一个现实的端到端流水线中评估我们的模型,从成人语音检测到受话者分类,显示在自动分割下性能有所下降,但仍持续优于基于规则的基线。

英文摘要

Automatically distinguishing child-directed speech from adult-directed speech in long-form recordings is key to scalable analyses of children's language environments. Existing approaches process utterances in isolation and have been evaluated primarily on English. We address these gaps along three dimensions. First, we fine-tune and evaluate six-self supervised models on a multilingual dataset of 182 children, showing that in-domain pre-training on child-centered recordings substantially outperforms models trained on adult speech. Second, we demonstrate that incorporating surrounding context substantially improves classification, with an absolute gain of 13.8% in average F1-score. Third, we evaluate our model in a realistic end-to-end pipeline, from adult speech detection to addressee classification, showing that performance drops under automatic segmentation but still consistently outperforms a rule-based baseline.

2606.01128 2026-06-02 cs.LG 版本更新

Local MixVR: Breaking the Communication-Sample Dependence in Distributed Learning

Local MixVR:打破分布式学习中通信与样本的依赖关系

Tehila Dahan, Bassel Hamoud, Roie Reshef, Martin Jaggi, Kfir Y. Levy

发表机构 * Technion Haifa, Israel(技术离子海法分校,以色列) EPFL Lausanne, Switzerland(洛桑联邦理工学院,瑞士)

AI总结 提出Local MixVR框架,通过局部更新与方差缩减技术消除通信复杂度对样本总数N的依赖,实现仅与工作节点数M相关的通信复杂度,在M<O(N^{1/4})时优于现有最优方法。

详情
AI中文摘要

通信开销是可扩展分布式学习中的关键瓶颈。现有方法如Local SGD、Minibatch SGD及其加速变体虽旨在高效利用数据点,但其通信轮次复杂度仍与总样本数$N$成比例。本文提出Local MixVR,一种将局部更新与方差缩减技术相结合的分布式框架,以减轻局部噪声。我们证明Local MixVR是首个消除通信复杂度对$N$依赖的分布式方法,实现了仅与工作节点数$M$相关的复杂度。在常见场景$M<O\left(N^{1/4}\right)$下,Local MixVR优于当前最优的Minibatch Accelerated SGD基线,填补了分布式优化中长期存在的空白,并建立了通信高效训练的新范式。

英文摘要

Communication overhead is a crucial bottleneck in scalable distributed learning. While existing methods aim to efficiently utilize data points, such as Local SGD, Minibatch SGD, and their accelerated variants, they still exhibit communication-round complexity that scales with the total number of samples $N$. In this paper, we introduce Local MixVR, a distributed framework that integrates local updates with variance-reduction techniques to mitigate local noise. We show that Local MixVR is the first distributed method to eliminate the dependence of communication complexity on $N$, achieving a complexity that scales only with the number of workers $M$. In common regimes where $M<O\left(N^{1/4}\right)$, Local MixVR outperforms the state-of-the-art Minibatch Accelerated SGD baseline, bridging a long-standing gap in distributed optimization and establishing a new paradigm for communication-efficient training.

2606.01126 2026-06-02 cs.LG cs.AI cs.CV 版本更新

STARFISH: faST Accuracy Recovery in pruned networks From Internal State Healing

STARFISH: 从内部状态修复中实现剪枝网络的快速精度恢复

Shir Maon, Odelia Melamed, Adi Shamir

发表机构 * Weizmann Institute of Science(魏茨曼科学研究所)

AI总结 提出STARFISH方法,通过少量无标签校准集优化剪枝网络与原始网络内部状态对齐,高效恢复精度,在ViT网络上优于现有方法。

详情
AI中文摘要

剪枝是一种旨在减少大型神经网络中权重数量的过程。这可以显著加快推理速度,但可能导致模型精度大幅下降,因此通常随后会进行修复过程以恢复部分丢失的精度。在本文中,我们提出了一种新的修复方法STARFISH,它可以高效地恢复任何剪枝网络的(大部分)精度。STARFISH的主要思想是使用少量无标签示例的校准集,优化剪枝网络以与原始网络的内部状态表示对齐。对于去除50%权重的常见情况,在基于ViT的网络中,STARFISH修复相比最先进方法将恢复精度提高了高达22%。在激进剪枝下其优势更为显著。例如,在ImageNet的DeiT-B网络中去除75%权重后,STARFISH仅使用训练图像数量的0.4%作为校准集,恢复了原始稠密模型精度的82%,而竞争恢复技术仅达到稠密模型精度的40%。

英文摘要

Pruning is a process designed to reduce the number of weights in a large neural network. This can substantially speed up inference but might cause a considerable reduction in the model's accuracy, and thus it is usually followed by a healing process that regains some of the lost accuracy. In this paper, we propose a new healing method, STARFISH, that can recover (most of) the accuracy of any pruned network efficiently. The main idea of STARFISH is to optimize the pruned network to align with the original network's internal state representations using a tiny calibration set of unlabeled examples. For the common case of removing 50% of the weights, STARFISH healing improves the recovered accuracy by up to 22% over the state-of-the-art methods on ViT-based networks. Its advantage is even more pronounced under aggressive pruning. For example, after eliminating 75% of the weights in a DeiT-B network for ImageNet, STARFISH uses only 0.4% of the number of training images as a calibration set and recovers 82% of the original dense accuracy, whereas competing recovery techniques reach only 40% of the dense model accuracy.

2606.01123 2026-06-02 cs.LG 版本更新

From Reward-Free Representations to Preferences: Rethinking Offline Preference-Based Reinforcement Learning

从无奖励表示到偏好:重新思考离线基于偏好的强化学习

Jun-Jie Yang, Chia-Heng Hsu, Kui-Yuan Chen, Ping-Chun Hsieh

发表机构 * GitHub

AI总结 本文提出一种结合无奖励表示学习和对比搜索微调的离线偏好强化学习框架,通过从无奖励离线数据中学习潜在后继度量表示,再利用偏好数据进行对比搜索和微调,显著提升了偏好效率。

Comments Published in ICML 2026

详情
AI中文摘要

基于偏好的强化学习通过从成对的人类偏好反馈中学习,避免了显式的奖励工程。现有的离线PbRL方法通常遵循两阶段流程,首先从标记的偏好中学习奖励或偏好模型,然后在未标记数据上执行离线RL。我们通过零样本RL文献中的无奖励表示学习视角重新审视离线PbRL,并提出一个新的训练框架,该框架首先从无奖励离线数据中学习潜在后继度量表示,然后使用偏好数据进行对比搜索和微调。通过大量实验和消融研究,我们表明我们的方法在偏好效率上优于离线PbRL基线。这项工作首次将RFRL与PbRL联系起来,突出了其作为反馈高效解决方案的潜力。我们的代码可在https://github.com/rl-bandits-lab/FB-PbRL公开获取。

英文摘要

Preference-based reinforcement learning (PbRL) avoids explicit reward engineering by learning from pairwise human preference feedback. Existing offline PbRL methods typically follow a two-stage pipeline, first learning a reward or preference model from labeled preferences and then performing offline RL on unlabeled data. We revisit offline PbRL through the lens of reward-free representation learning (RFRL) from the zero-shot RL literature, and propose a new training framework that first learns latent successor-measure representations from reward-free offline data, followed by contrastive search and fine-tuning using preference data. Through extensive experiments and ablations, we show that our method achieves superior preference efficiency over offline PbRL baselines. This work is the first to connect RFRL with PbRL, highlighting its potential as a feedback-efficient solution. Our code is publicly available at https://github.com/rl-bandits-lab/FB-PbRL.

2606.01122 2026-06-02 cs.LG q-fin.CP 版本更新

A Per-Component Diagnostic Protocol for Neural HJB-PIDE Solvers under Control-Dependent Lévy Jumps

控制依赖 Lévy 跳跃的神经 HJB-PIDE 求解器的逐分量诊断协议

R. Drissi

发表机构 * GitHub

AI总结 提出一个五步诊断协议,用于检测残差训练的神经 HJB-PIDE 求解器在控制依赖 Lévy 跳跃下的算子计算错误,并通过 CRRA-Merton-Variance-Gamma 基准案例验证其有效性。

详情
AI中文摘要

我们针对具有控制依赖 Lévy 跳跃的残差训练神经 HJB-PIDE 求解器,提出一个五步诊断协议,旨在解决神经 PDE 方法的一种常见失效模式:学习到的解可能匹配标量诊断指标,但错误计算了其训练损失内部的算子。该协议将每个神经求解与至少一个从零开始的独立参考配对,将哈密顿量分解为漂移、扩散、补偿器和非局部积分分量(在 u 网格上),并在 (t,x) 网格上比较值函数及其低阶导数,然后进行任何 argmax 比较。应用于标准 CRRA-Merton-Variance-Gamma 基准,它隔离了神经方法重要性提议密度中缺失的 1/2 混合因子,该因子将非局部积分恰好缩放了一半——这是常数提议尺度误差的教科书式特征,而更长的训练、网格细化和截断扫描均无法发现。修正该错误后,四个参考解——两个具有不连续离散化的有限差分求解器、神经求解器以及通过 CRRA 齐次性获得的半解析标量基线——在最优控制上达成约 2% 以内的一致。常数系数 CRRA 基准通过齐次性简化为标量最大化,因此标量基线是此处的高效方法;贡献在于该协议,原则上可应用于真正需要神经 HJB-PIDE 求解器的非齐次和高维场景。该案例是更广泛的神经 PDE 验证失效的具体实例:学习到的值或控制的逐点一致可能与系统性错误的非局部算子共存,因此在信任 argmax 策略之前,需要进行逐分量和表面层次的检查。

英文摘要

We propose a five-step diagnostic protocol for residual-trained neural HJB-PIDE solvers with control-dependent Lévy jumps, targeting a general failure mode of neural PDE methods: a learned solution can match headline scalar diagnostics while miscomputing an operator inside its training loss. The protocol pairs each neural solve with at least one from-scratch independent reference, decomposes the Hamiltonian into drift, diffusion, compensator, and nonlocal-integral components across a u-grid, and compares the value function and its low-order derivatives over a (t,x) grid before any argmax comparison. Applied to a standard CRRA-Merton-Variance-Gamma benchmark, it isolates a missing 1/2-mixture factor in the neural method's importance-proposal density that scaled the nonlocal integral by exactly half - a textbook signature of a constant proposal scale error, invisible to longer training, grid refinement, and truncation sweeps. With the bug corrected, four references - two finite-difference solvers with disjoint discretizations, the neural solver, and a semi-analytic scalar baseline obtained from CRRA homogeneity - agree on the optimal control to within ~2%. The constant-coefficient CRRA benchmark collapses by homogeneity to a scalar maximization, so the scalar baseline is the efficient method here; the contribution is the protocol, applicable in principle to non-homogeneous and higher-dimensional settings where neural HJB-PIDE solvers are genuinely needed. The episode is a concrete instance of a broader neural-PDE verification failure: pointwise agreement of a learned value or control can coexist with a systematically wrong nonlocal operator, so per-component and surface-level checks are needed before trusting the argmax policy.

2606.01117 2026-06-02 cs.LG cs.AI 版本更新

HASTE: Hardware-Aware Dynamic Sparse Training for Large Output Spaces

HASTE: 面向大输出空间的硬件感知动态稀疏训练

Nasib Ullah, Jinbin Zhang, Jean Lucien Randrianantenaina, Erik Schultheis, Rohit Babbar

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 提出组共享固定扇入稀疏性方法,通过半结构化输出层设计结合长尾分解,在极端多标签分类中实现显著加速并保持精度。

Comments Accepted at ICML 2026 Regular

详情
AI中文摘要

极端多标签分类(XMC)涉及在具有数百万标签的大输出空间上学习模型,使得输出层成为内存计算瓶颈。虽然基于稀疏性的方法降低了算术复杂度,但由于不规则内存访问、硬件利用率低或在长尾场景中依赖辅助架构组件,它们通常无法产生成比例的速度提升。我们引入了组共享固定扇入稀疏性,一种半结构化的输出层设计,其中语义相关的标签共享一个稀疏输入模式,同时保留独立的权重。这种分组引入了任务对齐的归纳偏置——鼓励相关标签共享特征子集——同时减少了索引内存开销,增加了跨标签的特征重用,并通过利用现代加速器原语的自定义CUDA内核实现了高效的GPU执行。作为辅助目标的替代方案,我们利用XMC的长尾结构,将输出层分解为频繁标签上的小型密集头部和其余标签上的组共享稀疏尾部,在保留稀疏性内存优势的同时提供了信息丰富的梯度路径。通过内核级微基准测试,我们表明组共享固定扇入将算术减少转化为实际的挂钟时间增益,在前向传播中实现了高达4.4倍的加速,在反向传播中实现了高达25倍的加速,同时与FLOPs匹配的密集瓶颈相比,性能仅相差几个百分点。在大型XMC基准测试中,我们的方法在precision@k上匹配或优于先前的稀疏基线,同时缩小了与密集方法的性能差距。

英文摘要

Extreme multi-label classification (XMC) involves learning models over large output spaces with millions of labels, making the output layer a memory-compute bottleneck. While sparsity-based methods reduce arithmetic complexity, they often fail to yield proportional speedups due to irregular memory access, poor hardware utilization, or reliance on auxiliary architectural components in long-tailed regimes. We introduce group-shared fixed fan-in sparsity, a semi-structured output-layer design in which semantically related labels share a sparse input pattern while retaining independent weights. This grouping introduces a task-aligned inductive bias -- encouraging related labels to share feature subsets -- while reducing index memory overhead, increasing feature reuse across labels, and enabling efficient GPU execution via custom CUDA kernels that leverage modern accelerator primitives. As an alternative to auxiliary objectives, we exploit the long-tailed structure of XMC by decomposing the output layer into a small dense head over frequent labels and a group-shared sparse tail over the remainder, providing an informative gradient pathway while preserving the memory benefits of sparsity. Through kernel-level microbenchmarking, we show that group-shared fixed fan-in translates arithmetic reductions into practical wall-clock gains, achieving up to $4.4\times$ speedup in the forward pass and up to $25\times$ speedup in backward passes over standard fixed fan-in sparsity, while operating within a few percent of a FLOPs-matched dense bottleneck. Across large-scale XMC benchmarks, our approach matches or improves precision@k over prior sparse baselines, while narrowing the performance gap to dense.

2606.01107 2026-06-02 cs.LO cs.LG math.LO 版本更新

How (and when) can you fit examples to logic-based hypothesis classes over infinite structures?

如何(以及何时)能在无限结构上拟合样本到基于逻辑的假设类?

Michael Benedikt, Alessio Mansutti

发表机构 * University of Oxford(牛津大学) IMDEA Software Institute(IMDEA软件研究所)

AI总结 研究在无限结构(如实数有序域和Presburger算术)中,对于逻辑定义的假设类,拟合有限样本的计算复杂性和描述复杂性,并关注通过自然查询语言确定样本可拟合性的情况。

详情
AI中文摘要

我们研究拟合问题,有时称为“训练问题”,其中我们有一个由输入和输出组成的有限样本,并且我们想知道是否存在某个类中的函数可以在给定输入上精确或近似地产生这些输出。我们关注在常见可判定结构(如实数有序域和Presburger算术)中逻辑定义类的拟合的计算复杂性和描述复杂性,以及通过组合或模型论性质定义的更广泛类。我们隔离了这些拟合问题的复杂性,特别关注我们可以使用样本上的自然查询语言中的查询来确定样本是否可拟合的情况。

英文摘要

We study fitting problems, sometimes called ``training problems'', where we have a finite sample consisting of inputs and outputs, and we want to know whether there is a function in a certain class that could produce these outputs, exactly or approximately, on the given inputs. We focus on the computational and descriptive complexity of fitting for logically-defined classes in common decidable structures, like the real ordered field and Presburger arithmetic, and also for broader classes defined via combinatorial or model-theoretic properties. We isolate the complexity of these fitting problems, with particular attention to cases where we can use queries in a natural query language over the sample to determine whether a sample is fittable.

2606.01101 2026-06-02 cs.LG cs.AI 版本更新

Soft-NBCE: Entropy-Weighted Chunk Fusion for Long-Context

Soft-NBCE: 基于熵加权分块融合的长上下文处理

Shihao Ji, Mingyu Li, Zihui Song

发表机构 * Beijing Normal University(北京师范大学) Chunjiang Intelligence(春江智能)

AI总结 针对长上下文推理中硬选择策略导致语义碎片化的问题,提出Soft-NBCE,通过熵加权软融合和一致性蒸馏,在保持检索精度的同时提升多跳推理性能。

Comments 7 pages, 3 figures, 2 tables. Preprint

详情
AI中文摘要

自注意力的二次复杂度仍然是大型语言模型(LLMs)处理超长上下文的瓶颈。朴素贝叶斯认知引擎(NBCE)通过将文档分块并在每个解码步骤路由到熵最低的分块,实现了长上下文推理的并行化。这种硬选择策略在跨分块推理时会导致语义碎片化,因为相邻token之间的突然路由变化破坏了模型的上下文基础。我们提出了Soft-NBCE,这是一种轻量级扩展,用软熵加权分块融合替代了离散的分块选择。通过预测熵上的温度缩放Softmax,为所有分块分配连续权重,实现了跨分块条件分布的log空间聚合。为了部分补偿分块引入的条件独立性假设,我们提出了一致性蒸馏,这是一种基于LoRA的自蒸馏方法,通过KL散度将分块logit分布约束为全上下文教师分布。在LongBench多跳基准测试中,带有一致性蒸馏的Soft-NBCE在NBCE风格基线(MuSiQue F1: 0.310 vs. 0.275(Vanilla NBCE);HotpotQA F1: 0.479 vs. 0.427)上持续改进,同时在O(L^2/n)峰值内存下保持检索精度(NIAH-32K: 0.909)。

英文摘要

The quadratic complexity of self-attention remains a bottleneck for Large Language Models (LLMs) processing ultra-long contexts. The Naive Bayes Cognitive Engine (NBCE) parallelizes long-context inference by chunking documents and routing to the lowest-entropy chunk at each decoding step. This hard-selection strategy causes semantic fragmentation during cross-chunk reasoning, as abrupt routing changes between adjacent tokens disrupt the model's contextual grounding. We present Soft-NBCE, a lightweight extension that replaces discrete chunk selection with soft entropy-weighted chunk fusion. A temperature-scaled Softmax over predictive entropies assigns continuous weights to all chunks, enabling log-space aggregation across chunk-conditioned distributions. To partially compensate for the conditional independence assumption introduced by chunking, we propose Consistency Distillation, a LoRA-based self-distillation that constrains the chunked logit distribution toward a full-context teacher via KL-divergence. On LongBench multi-hop benchmarks, Soft-NBCE with Consistency Distillation improves consistently over NBCE-style baselines (MuSiQue F1: 0.310 vs.\ 0.275 for Vanilla NBCE; HotpotQA F1: 0.479 vs.\ 0.427) while maintaining retrieval accuracy (NIAH-32K: 0.909) at O(L^2/n) peak memory.

2606.01092 2026-06-02 cs.LG cs.AI 版本更新

A Fiber Criterion for Representation Identifiability in Supervised Learning

监督学习中表示可辨识性的纤维准则

Vasileios Sevetlidis

发表机构 * Athena Research Center, Kimmeria Campus, Xanthi, Greece(亚特兰大研究中心,基米里亚校区,哈尼亚,希腊) Democritus University of Thrace, Vas. Sofias Campus, Xanthi, Greece(德摩根大学,瓦斯·索菲亚校区,哈尼亚,希腊) International Hellenic University, Serres, Greece(国际希腊大学,塞雷斯,希腊)

AI总结 本文提出纤维准则,通过投影映射的纤维常数性来形式化监督学习中表示-头部分解的可辨识性,并指出仅凭监督预测行为无法唯一确定表示。

详情
AI中文摘要

监督学习通过输入-输出行为评估预测器。当预测器实现为复合函数 $f=c\circ h$ 时,监督证据约束了复合映射 $f$,但未必确定表示-头部因子分解 $(h,c)$。本文形式化了由此产生的表示级可辨识性问题:对于一类可接受的表示-头部对,当且仅当表示属性在投影 $(h,c)\mapsto c\circ h$ 的纤维上为常数时,它可从诱导的预测器中辨识,等价于它下降为预测器的良定义属性。保持预测器的增广给出了一个规范障碍:辅助信息可以附加到表示上而头部忽略它,保持预测器不变但改变诸如极小性、压缩、不变性、等变性、干扰信息或语义可访问性等属性。这种构造将表示可辨识性与优化和有限样本估计分离开来。有限样本诊断说明了而非证明了该准则:精确代数见证在改变表示诊断时保持预测器固定,而匹配性能的Waterbirds模型表明不同约束可以在相似的监督性能下选择不同的表示。结果阐明,表示级声明需要超越监督预测行为本身的假设、目标、测量或归纳偏置。

英文摘要

Supervised learning evaluates predictors through their input-output behavior. When a predictor is implemented as a composition $f=c\circ h$, supervised evidence constrains the composite map $f$ but need not determine the representation-head factorization $(h,c)$. This paper formalizes the resulting representation-level identifiability problem: for a class of admissible representation-head pairs, a representation property is identifiable from the induced predictor exactly when it is constant on the fibers of the projection $(h,c)\mapsto c\circ h$, equivalently when it descends to a well-defined property of the predictor. Predictor-preserving augmentation gives a canonical obstruction: auxiliary information can be appended to a representation while the head ignores it, leaving the predictor unchanged but altering properties such as minimality, compression, invariance, equivariance, nuisance information, or semantic accessibility. This construction separates representation identifiability from optimization and finite-sample estimation. Finite-sample diagnostics illustrate, rather than prove, the criterion: exact algebraic witnesses hold the predictor fixed while changing representation diagnostics, and matched-performance Waterbirds models show that different constraints can select different representations at similar supervised performance. The results clarify that representation-level claims require assumptions, objectives, measurements, or inductive biases beyond supervised predictive behavior alone.

2606.01090 2026-06-02 stat.ME cs.LG 版本更新

Measuring the Symmetry--Data Exchange Rate

测量对称性与数据交换率

Ahmed M. Adly

发表机构 * Independent Researcher(独立研究者) Egypt(埃及)

AI总结 通过在受控的C_n对称任务上实验,发现错位群约束有害、架构与增强的差距取决于非对称测试计算、对称性交换率与理论值一致但置信区间包含零,并提出了相对率估计器、错位群控制和失败分类法等方法论贡献。

Comments 19 pages, 9 figures. Exploratory study. Code and data at https://github.com/AhmedMostafa16/symmetry-exchange

详情
AI中文摘要

等变理论预测,架构对称性先验将样本复杂度降低|G|倍;这一结论被广泛引用,但很少作为具有控制分离先验及其混杂因素的比例定律进行测量。在一个受控的C_n对称任务上,我们报告了三个发现。首先,具有相同轨道大小和匹配计算的错位群控制比无约束更差(联合成对CI [+0.79, +3.26]排除零,对估计器鲁棒);错位约束是有害的,而不仅仅是无帮助。其次,配备测试时轨道平均的增强基线精确匹配等变模型——跨匹配单元的每周期验证曲线逐位相同——因此架构与增强之间的差距取决于非对称测试时计算,而非无条件。第三,相对交换率beta_diff = 1.28在符号和数量级上与理论值1.0一致(单层CI [+0.92, +2.05]);更保守的两层自助法(种子×群大小)将其扩大到[-0.63, +1.72],包含零,并且在sqrt(2)间隔网格上的更细N复制不确定(点估计-0.82)。方法论贡献——消除共享难度混杂因素的相对率估计器、错位群控制和预先指定的失败分类法——可迁移到任何强度可参数化的归纳偏置。诚实范围:主要估计器beta_diff是在初始分析揭示正斜率可识别性问题后事后采用的;设计从未外部预注册;标题数字基于粗N网格上七个群大小的OLS斜率。这是一项探索性研究,而非确认性测量;错位群结果是最清晰的发现,也是我们报告最有信心的发现。在新鲜种子上的注册复制是未来工作。

英文摘要

Equivariance theory predicts that an architectural symmetry prior reduces sample complexity by a factor of |G|; this is widely cited but rarely measured as a scaling law with controls that separate the prior from its confounds. On a controlled C_n-symmetric task, we report three findings. First, a wrong-group control with identical orbit size and matched compute is worse than no constraint (joint pairwise CI [+0.79, +3.26] excludes zero, robust across estimators); misaligned constraint is actively harmful, not merely unhelpful. Second, an augmentation baseline equipped with test-time orbit averaging matches the equivariant model exactly -- bit-identical per-epoch validation curves across matched cells -- so the architecture-vs-augmentation gap is conditional on asymmetric test-time computation, not unconditional. Third, the relative exchange rate beta_diff = 1.28 is consistent in sign and order of magnitude with the theoretical 1.0 (single-level CI [+0.92, +2.05]); the more conservative two-level bootstrap (seeds x group sizes) widens this to [-0.63, +1.72], including zero, and a finer-N replication on a sqrt(2)-spaced grid is inconclusive (point estimate -0.82). The methodological contributions -- the relative-rate estimator that cancels the shared-difficulty confound, the wrong-group control, and a pre-specified failure taxonomy -- transfer to any inductive bias whose strength can be parameterised. Honest scoping: the primary estimator beta_diff was adopted post-hoc after the initial analysis revealed a positive-slope identifiability problem; the design was never externally pre-registered; and the headline number rests on an OLS slope over seven group sizes on a coarse N grid. This is an exploratory study, not a confirmatory measurement; the wrong-group result is the cleanest finding and the one we report with the most confidence. A registered replication on fresh seeds is future work.

2606.01086 2026-06-02 cs.LG cs.AI 版本更新

Strong Stochastic Flow Maps

强随机流映射

Sam McCallum, Zander W. Blasingame, Timothy Herschell, Niklas Rindtorff, Alexander Tong, James Foster

发表机构 * University of Bath(巴斯大学) AITHYRA

AI总结 提出强随机流映射(SSFMs)框架,通过学习加性噪声SDE的强解映射,实现扩散模型的免模拟训练和少步采样,在图像生成和分子系统采样中优于现有方法。

Comments Preprint

详情
AI中文摘要

流模型和扩散模型在许多模态中生成高质量样本;然而,由于需要对底层微分方程进行数值积分,推理过程中需要多次网络评估。流映射通过学习微分方程的解映射直接缓解了这一问题,实现了少步采样。然而,当前方法仅限于逼近ODE的解映射。这些方法可用于学习SDE的转移核,从而获得恢复过程边际分布(弱收敛)而非解路径(强收敛)的解映射。我们提出强随机流映射(SSFMs)作为一种新框架,用于学习加性噪声SDE的强解映射,直接将确定性流映射推广到随机设置。此外,引入了布朗运动的多项式逼近,并证明其路径收敛。这些结果为扩散模型的解映射提供了免模拟训练目标。我们证明,SSFMs在图像生成上优于先前的随机流映射方法,并实现了分子系统的少步采样。

英文摘要

Flow and diffusion models generate high-quality samples in many modalities; however, many network evaluations are required during inference due to numerical integration of an underlying differential equation. Flow maps alleviate this problem by learning the solution map of the differential equation directly, enabling few-step sampling. Yet, current methods are restricted to approximating the solution map of ODEs. These methods can be used to learn the transition kernel of an SDE, thereby obtaining a solution map that recovers the marginal distributions of the process (weak convergence) rather than the solution path (strong convergence). We propose Strong Stochastic Flow Maps (SSFMs) as a novel framework for learning the strong solution map of additive-noise SDEs, directly generalizing deterministic flow maps to the stochastic setting. Further, a polynomial approximation to Brownian motion is introduced and shown to converge pathwise. These results enable a simulation-free training objective for the solution map of diffusion models. We demonstrate that SSFMs outperform previous stochastic flow map methods on image generation and enable few-step sampling of molecular systems.

2606.01084 2026-06-02 cs.LG cs.AI 版本更新

MViewRouter: Internalizing Geometric Equivariance via Multi-view Alternating Attention for Combinatorial Routing

MViewRouter:通过多视图交替注意力内化组合路由的几何等变性

Shiyan Liu, Bohan Tan, Yaoxin Wu, Yan Jin

发表机构 * Huazhong University of Science and Technology(华中科技大学) Eindhoven University of Technology(埃因霍温理工大学)

AI总结 提出MViewRouter框架,利用多视图交替注意力机制内化几何等变性作为结构归纳偏置,通过集体策略梯度聚合优化,解决组合路由问题中的对称性挑战,在TSP和CVRP上取得竞争性解质量和强零样本泛化。

详情
AI中文摘要

组合路由问题,如旅行商问题(TSP)和带容量约束的车辆路径问题(CVRP),是基础的NP难问题,具有广泛的现实应用。虽然最近的深度强化学习方法显示出有希望的性能,但它们通常仅通过数据增强处理几何对称性,导致决策不一致和泛化能力有限。为了解决这个问题,我们提出了MViewRouter,一个多视图框架,将几何等变性内化为结构归纳偏置,以实现跨路由问题变体的不变决策。我们的方法引入了一种多视图交替注意力(MAA)机制,能够在$D_4$对称群上进行并行处理,在视图内关系建模和视图间特征对齐之间交替进行。此外,我们通过集体策略梯度聚合(CPGA)优化策略,利用来自多个对称视图的共识梯度来稳定训练并加速收敛。在TSP和CVRP基准测试以及真实世界的TSPLIB实例上的实验表明,MViewRouter实现了竞争性的解质量和强大的零样本泛化能力。

英文摘要

Combinatorial routing problems such as the Traveling Salesman Problem (TSP) and the Capacitated Vehicle Routing Problem (CVRP) are fundamental NP-hard problems with broad real-world applications. While recent deep reinforcement learning methods have shown promising performance, they typically handle geometric symmetries only through data augmentation, resulting in inconsistent decisions and limited generalization. To address this issue, we propose MViewRouter, a multi-view framework that internalizes geometric equivariance as a structural inductive bias to achieve invariant decision-making across routing problem variants. Our approach introduces a Multi-view Alternating Attention (MAA) mechanism that enables parallel processing over the $D_4$ symmetry group, alternating between intra-view relational modeling and inter-view feature alignment. Furthermore, we optimize the policy via Collective Policy Gradient Aggregation (CPGA), leveraging consensus gradients from multiple symmetric views to stabilize training and accelerate convergence. Experiments on TSP and CVRP benchmarks, as well as real-world TSPLIB instances, demonstrate that MViewRouter achieves competitive solution quality and strong zero-shot generalization.

2606.01081 2026-06-02 cs.LG 版本更新

Decision-Focused On-Policy Learning for Contextual Linear Optimization with Partial Feedback

面向决策的在线策略学习用于部分反馈下的上下文线性优化

Wyame Benslimane, Tinghan Ye, Pascal Van Hentenryck, Paul Grigas

发表机构 * Department of Industrial Engineering and Operations Research, University of California, Berkeley(工业工程与运筹学系,加州大学伯克利分校) H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology(H.米尔顿·斯图尔特工业与系统工程学院,佐治亚理工学院)

AI总结 提出一种混合梯度估计方法,用于部分反馈下顺序上下文线性优化的在线策略学习,实现决策质量驱动的预测模型训练,并在多个基准上优于上下文多臂赌博机基线。

详情
AI中文摘要

决策聚焦学习(DFL)通过优化下游决策质量而非单独预测准确性来训练预测模型。对于上下文线性优化,大多数现有DFL方法假设离线数据和目标成本向量的完全观测。我们开发了一种在线策略学习方法,用于部分反馈下的顺序上下文线性优化,推广了标准赌博机反馈设置。我们的方法学习一个随机预测-然后-优化策略,该策略从条件分布中采样成本向量预测,并求解由此产生的下游线性优化问题。为了更新这个分布模型,我们引入了一个双组分混合梯度估计器。第一个组分是得分函数估计器,它提供无偏但可能高方差的策略梯度估计。第二个是决策聚焦插件组分,它使用潜在成本向量的辅助干扰估计来利用下游优化结构,随着估计的改进而变得更具信息性。我们证明了平均平方策略梯度范数的$\mathcal{O}(T^{-1/2})$界,与标准非凸SGD速率相匹配。在top-$k$选择、最短路径、组合定价和真实数据能源调度基准上的实验表明,混合梯度方法在使用高斯和更丰富的条件生成模型时,在所有基准上实现了比上下文赌博机风格基线更低的累积遗憾。代码可在https://github.com/Joeyetinghan/on-policy-bandit-dfl获取。

英文摘要

Decision-focused learning (DFL) trains predictive models by optimizing downstream decision quality rather than standalone prediction accuracy. For contextual linear optimization, most existing DFL methods assume offline data and full observations of the objective cost vector. We develop an on-policy learning method for sequential contextual linear optimization under partial feedback, generalizing the standard bandit feedback setting. Our method learns a stochastic predict-then-optimize policy that samples a cost-vector prediction from a conditional distribution and solves the resulting downstream linear optimization problem. To update this distributional model, we introduce a two-component hybrid gradient estimator. The first component is a score function estimator, which provides an unbiased but potentially high-variance policy gradient estimate. The second is a decision-focused plug-in component that uses an auxiliary nuisance estimate of the latent cost vector to exploit the downstream optimization structure, becoming more informative as the estimate improves. We prove an $\mathcal{O}(T^{-1/2})$ bound on the average squared policy-gradient norm, matching the standard non-convex SGD rate. Experiments on top-$k$ selection, shortest path, combinatorial pricing, and a real-data energy-scheduling benchmark show that the hybrid gradient approach achieves lower cumulative regret than contextual-bandit-style baselines across all benchmarks, using both Gaussian and richer conditional generative models. Code is available at https://github.com/Joeyetinghan/on-policy-bandit-dfl.

2606.01080 2026-06-02 cs.LG cs.AI 版本更新

ThinkSwitch: Context Distillation with LoRA and Weight Interpolation for Specific-Purpose Reasoning Tasks

ThinkSwitch:基于LoRA和权重插值的上下文蒸馏用于特定目的推理任务

Dhruv Saini, Rohan Pandey

发表机构 * bellevue High School(贝尔维尤高中) DigitalOcean

AI总结 提出ThinkSwitch方法,通过QLoRA蒸馏和球面权重插值协同训练指令模型和思考模型,在AIME 2026和PubMedQA上分别提升指令模型10/30→20/30和13/30→18/30,思考模型14/30→22/30和18/30→25/30,仅需15个训练提示和$2.86成本。

详情
AI中文摘要

大型语言模型通常通过在产生最终答案之前花费推理时间计算来改进困难任务。额外的计算可能有用,但也增加了延迟、令牌成本和部署复杂性。我们引入了 extbf{ThinkSwitch},一种低计算量的程序,用于协同训练配对的指令和思考检查点。从兼容的Qwen3-4B指令和思考模型开始,每次迭代要求思考检查点生成答案,移除推理轨迹,通过QLoRA将仅答案对蒸馏到指令检查点,并通过球面权重插值重建思考检查点。唯一的人工输入是任务提示;标签由模型自身生成。在30个问题的AIME 2026评估中,ThinkSwitch将指令检查点从10/30提升到20/30,思考检查点从14/30提升到22/30。在30个问题的PubMedQA子集上,它将指令检查点从13/30提升到18/30,思考检查点从18/30提升到25/30。完整实验每个领域使用15个训练提示,在单个云RTX 3070上花费2.86美元。结果规模较小,但表明有针对性的蒸馏循环可以将显式推理的部分好处转移到权重中,同时保留独立的思考模式。

英文摘要

Large language models often improve on difficult tasks by spending inference-time compute on a reasoning trace before producing the final answer. That extra computation can be useful, but it also raises latency, token cost, and deployment complexity. We introduce \textbf{ThinkSwitch}, a low-compute procedure for co-training paired instruct and thinking checkpoints. Starting from compatible Qwen3-4B instruct and thinking models, each iteration asks the thinking checkpoint to generate answers, removes the reasoning trace, distills the answer-only pairs into the instruct checkpoint with QLoRA, and reconstructs a thinking checkpoint with spherical weight interpolation. The only human-supplied inputs are task prompts; the labels are generated by the model itself. On a 30-question AIME 2026 evaluation, ThinkSwitch improves the instruct checkpoint from 10/30 to 20/30 and the thinking checkpoint from 14/30 to 22/30. On a 30-question PubMedQA subset, it improves the instruct checkpoint from 13/30 to 18/30 and the thinking checkpoint from 18/30 to 25/30. The complete experiment uses 15 training prompts per domain and costs \$2.86 on a single cloud RTX 3070. The results are small-scale, but they indicate that targeted distillation loops can move part of the benefit of explicit reasoning into weights while preserving a separate thinking mode.

2606.01078 2026-06-02 cs.LG stat.CO stat.ME 版本更新

Non-Vacuous Certification of Transport MCMC via Oscillation-Controlled Normalizing Flows

通过振荡控制归一化流实现传输MCMC的非平凡认证

Jun Hu

发表机构 * China Sanya Science and Education Innovation Park, Wuhan University of Technology, Sanya 572025, China(中国三亚科技教育创新园,武汉理工大学,三亚572025,中国) School of Civil Engineering and Architecture, Wuhan University of Technology, Wuhan 430070, China(武汉理工大学土木工程与建筑学院,武汉430070,中国)

AI总结 提出振荡控制归一化流框架,首次为传输MCMC采样器提供严格的非平凡谱隙界,通过谱归一化、基于覆盖的经验振荡界和振荡正则化训练,在多个后验分布上实现可认证的采样效率。

Comments 36 pages, includes appendix

详情
AI中文摘要

传输MCMC训练归一化流以预处理Metropolis-Hastings提议,在具有挑战性的后验分布上实现了高经验效率;然而,先前的工作没有为此类采样器产生数值上非平凡的、严格的谱隙界。我们建立了第一个这样的界。对于香蕉族上的独立MH,我们在D=2时认证了γ^*=0.828(在原始空间中覆盖),在D=5时认证了γ^*≥7.6×10^{-4}(在解析解卷的高斯空间中覆盖,并具有网格认证的梯度界,在所述数值Lipschitz认证下),两者均在95%置信度下严格。该框架基于三个支柱:(i) 具有缩减尺度裁剪的谱归一化将流Lipschitz常数从10^{47}约束到10^4;(ii) 基于覆盖的经验振荡界用数据依赖的证书替代了空洞的分析界;(iii) 振荡正则化训练在不损失密度拟合的情况下将经验振荡减少60-90%,将实用证书扩展到D=20(γ^*≥1.7×10^{-4})。在另外四个目标(高斯混合、剪切建筑、Neal的漏斗、贝叶斯逻辑回归)上的测试确定了三个精确障碍:边界曲率、目标刚度和尾部覆盖不匹配。仿射与样条比较表明,更简单的架构在相同NLL下产生更紧的证书,颠倒了通常的表达性层次。

英文摘要

Transport MCMC trains a normalizing flow to precondition Metropolis--Hastings proposals, achieving high empirical efficiency on challenging posteriors; yet no prior work produces a numerically non-vacuous, rigorous spectral-gap bound for such samplers. We establish the first such bounds. For independence MH on the banana family we certify (γ^\ast = 0.828) at (D = 2) (covering in the original space) and (γ^\ast \ge 7.6\times 10^{-4}) at (D = 5) (covering in an analytically unwarped Gaussian space with a grid-certified gradient bound under the stated numerical Lipschitz certification), both rigorous at 95% confidence. The framework rests on three pillars: (i) spectral normalization with reduced scale clips constrains the flow Lipschitz constant from (10^{47}) to (10^4); (ii) a coverage-based empirical oscillation bound replaces the vacuous analytical bound with a data-dependent certificate; and (iii) oscillation-regularised training cuts the empirical oscillation by 60--90% at no cost to density fit, extending practical certificates through (D = 20) ((γ^\ast \ge 1.7\times 10^{-4})). Tests on four further targets (Gaussian mixture, shear-building, Neal's funnel, Bayesian logistic regression) identify three precise barriers: boundary curvature, target stiffness, and tail-coverage mismatch. An affine-vs-spline comparison shows that simpler architectures yield tighter certificates at identical NLL, inverting the usual expressiveness hierarchy.

2606.01070 2026-06-02 cs.IR cs.AI cs.LG 版本更新

Test-Time Training for Zero-Resource Dense Retrieval Reranking

零资源稠密检索重排的测试时训练

Shiyan Liu, Yichen Li

发表机构 * Huazhong University of Science and Technology(华中科技大学) ByteDance(字节跳动)

AI总结 提出 DART 方法,通过测试时自适应双线性评分矩阵,利用伪正负样本进行少量梯度更新,在零资源下提升稠密检索重排性能。

Comments Accepted at KnowFM @ ACL 2026

详情
AI中文摘要

稠密检索器在第一阶段候选生成中表现出色,但在零资源设置下缺乏有效的重排能力。现有方法面临根本性困境:交叉编码器重排质量高,但需要昂贵的监督训练且延迟高,而无监督的 BM25 重排在大多数 BEIR 基准上持续降低稠密检索性能。我们提出 DART(测试时稠密自适应重排),通过在推理时自适应评分函数来解决这一困境。对于每个查询,排名靠前的文档作为伪正例,排名靠后的作为伪负例,提供噪声但可用的监督信号,通过少量梯度更新来适应双线性评分矩阵 $W$。我们进一步引入置信加权边际损失和跨查询动量缓冲区,以预热跨查询的适应过程。在六个 BEIR 基准上,DART 相对于稠密检索基线实现了平均每个数据集 NDCG@10 相对提升 +2.1%,且每个查询额外延迟低于 10ms,展示了强大的零样本性能提升和跨领域泛化能力。

英文摘要

Dense retrievers excel at first-stage candidate generation but lack effective reranking in zero-resource settings. Existing approaches face a fundamental dilemma: cross-encoders deliver strong reranking quality but require costly supervised training and incur high latency, while unsupervised BM25 reranking consistently degrades dense retrieval performance on most of BEIR benchmarks. We propose DART (Dense Adaptive Reranking at Test-time), which resolves this dilemma by adapting the scoring function at inference time. For each query, the top-ranked documents serve as pseudo-positive examples and the bottom-ranked as pseudo-negative examples, providing noisy but readily available supervision to adapt a bilinear scoring matrix $W$ via a small number of gradient updates. We further introduce a confidence-weighted margin loss and a cross-query momentum buffer that warm-starts adaptation across queries. On six BEIR benchmarks, DART achieves a mean per-dataset relative NDCG@10 gain of +2.1% over the dense retrieval baseline with under 10ms additional latency per query, demonstrating a powerful capability for zero-shot performance enhancement and cross-domain generalization.

2606.01065 2026-06-02 cs.DC cs.AI cs.LG 版本更新

Leyline: KV Cache Directives for Agentic Inference

Leyline:用于智能推理的 KV 缓存指令

Bole Ma, Jan Eitzinger, Harald Koestler

发表机构 * Erlangen National High Performance Computing Center(埃朗根国家高性能计算中心)

AI总结 针对智能体 LLM 中策略驱动的缓存编辑需求,提出 Leyline 服务端原语,通过声明式指令四元组和架构无关接口实现缓存拼接与截断,提升缓存命中率和求解率。

详情
AI中文摘要

现代 KV 缓存管理假设聊天机器人工作负载:提示一次性到达,缓存仅追加增长,因此前缀缓存和仅向前驱逐在构造上是正确的。智能体 LLM 打破了这一假设。它们的对话通过策略驱动的编辑演变:失败的工具调用被重试,过时的输出被丢弃,轨迹被转向。这导致两个不同的缓存问题。首先,相同的内容在轮次之间移动到新位置,使得精确前缀缓存失效,尽管底层 KV 仍然有效;最近针对 MLA 的位置无关缓存工作解决了这个重用问题。其次,也是本文的重点,策略可能需要指示服务系统主动移除或替换一段缓存内容,并继续而不重新预填充之后的所有内容。没有现有的原语提供此功能。生产智能体框架退回到每次编辑时重新预填充,支付完整的前缀重新计算成本;内核级驱逐方法自行决策,无法接受来自内核外部的策略指令。我们引入 Leyline,一个弥补这一差距的服务端原语。一个声明式指令四元组将编辑内容与保持位置正确性分离。策略声明编辑及其模式(原地拼接或前缀修剪的重新预填充以实现语义遗忘);一个架构无关的接口路由到每个架构的内核,通过闭式 RoPE 旋转校正恢复注意力计算。拼接内核将重放缓存命中率提高 11.2 个百分点,并将延迟降低最多 241 毫秒。通过同一接口路由的十行截断规则将 debug-gym 上的智能体求解率提高 14.3 个百分点。该机制是开放的;它启用的策略空间是未来的议程。

英文摘要

Modern KV cache management assumes the chatbot workload: prompts arrive once and the cache grows append-only, so prefix caching and forward-only eviction are correct by construction. Agentic LLMs break this assumption. Their conversations evolve through policy-driven editing: failed tool calls are retried, stale outputs dropped, trajectories pivoted. Two distinct cache problems result. First, identical content moves to new positions between turns, invalidating exact-prefix caches even though the underlying KV would still be valid; recent work on position-independent caching for MLA addresses this reuse problem. Second, and this paper's focus, a policy may need to direct the serving system to actively remove or replace a span of cached content and continue without re-prefilling everything that came after. No existing primitive offers this. Production agentic harnesses fall back to re-prefill on every edit, paying full prefix-recomputation cost; kernel-level eviction methods make their own decisions and cannot accept policy directives from outside the kernel. We introduce Leyline, a serving-side primitive that closes this gap. A declarative directive 4-tuple separates what to edit from how to preserve position correctness. The policy declares the edit and its mode (in-place splice or prefix-trimmed re-prefill for semantic forgetting); an architecture-agnostic interface routes to a per-architecture kernel that restores attention math via a closed-form RoPE-rotation correction. The splice kernel lifts replay cache-hit by +11.2 pp and cuts latency by up to 241 ms. A ten-line truncation rule routed through the same interface lifts agentic solve rate by +14.3 pp on debug-gym. The mechanism is open; the policy space it enables is the agenda.

2606.01057 2026-06-02 cs.CV cs.AI cs.GR cs.LG 版本更新

3DCodeBench: Benchmarking Agentic Procedural 3D Modeling Via Code

3DCodeBench:通过代码进行智能体程序化3D建模的基准测试

Yipeng Gao, Lei Shu, Genzhi Ye, Xi Xiong, Ameesh Makadia, Meiqi Guo, Laurent Itti, Jindong Chen

发表机构 * Google DeepMind(谷歌DeepMind) University of Southern California(南加州大学) Google Research(谷歌研究)

AI总结 提出3DCodeBench基准,评估12种视觉语言模型将文本和图像参考转换为程序化3D建模代码的能力,并构建基于人类偏好的3DCodeArena排名平台。

Comments Project Page: https://www.3dcodebench.com/; 11 pages (main), with appendix

详情
AI中文摘要

通过代码进行程序化3D建模正成为一种通用的范式,提供确定性、引擎就绪且可精确编辑的资产,而神经3D生成器天生缺乏这些特性。然而,编写此类程序化内容需要深厚的3D软件API、参数化设计和代码级几何推理专业知识。在本文中,我们提出了3DCodeBench,一个系统性的基准,用于评估3D建模软件中用于程序化3D生成的视觉语言模型(VLM)智能体。具体来说,3DCodeBench评估了12种先进VLM如何有效地充当程序化3D建模器,将文本和图像参考转换为3D建模软件的程序化代码。认识到自动度量可能无法完全捕捉3D形状的感知质量,我们构建了3DCodeArena,一个基于成对人类偏好对生成的3D输出进行排名的平台。通过广泛的评估和结果,我们观察到:(1)失败主要源于API不匹配,而成功渲染的模型仍然存在断开或浮动的3D几何组件。(2)测试时扩展,如更高的思考预算和多轮细化,总体上提高了性能。我们的发现突显了对高质量程序化编码数据以推进商业VLM的迫切需求。此外,有效的程序化3D建模需要一个强大的执行环境,为迭代细化提供高保真反馈。我们发布了3DCodeBench,包括精心策划的大规模多模态(文本/图像)提示数据集、程序化代码、3D对象三元组、评估协议以及公共3DCodeArena平台,作为探索基于VLM的程序化3D建模器的基础工具包。

英文摘要

Procedural 3D modeling through code is emerging as a versatile paradigm, offering deterministic, engine-ready, and precisely editable assets that neural 3D generators inherently lack. Authoring such procedural content, however, demands deep expertise in 3D software APIs, parametric design, and code-level geometric reasoning. In this paper, we propose 3DCodeBench, a systematic benchmark for evaluating vision-language model (VLM) agents for procedural 3D generation in 3D modeling software. Specifically, 3DCodeBench evaluates how effectively 12 advanced VLMs can serve as procedural 3D modelers by translating text and image references into procedural code for 3D modeling software. Recognizing that automated metrics may not fully capture the perceptual quality of 3D shapes, we build 3DCodeArena, a ranking platform based on pairwise human preferences over generated 3D outputs. From extensive evaluations and results, we observe that: (1) Failures mostly arise from API mismatches, while successful renders still suffer from disconnected or floating 3D geometric components. (2) Test-time scaling, such as higher thinking budgets and multi-turn refinement, improves performance overall. Our findings highlight a critical need for high-quality procedural coding data to advance commercial VLMs. Furthermore, effective procedural 3D modeling requires a robust execution environment that provides high-fidelity feedback for iterative refinement. We release 3DCodeBench, including the curated large-scale dataset of multimodal (text/image) prompts, procedural code, 3D object triplets, evaluation protocol, and the public 3DCodeArena platform as a foundational toolkit for exploring VLM-based procedural 3D modelers.

2606.01051 2026-06-02 cs.LG 版本更新

Interaction-Limited Safe Continuous-Time RL for Dynamical Medical Treatment

交互受限的动态医疗安全连续时间强化学习

Xun Shen, Yuepeng Wang, Akifumi Wachi, Yongqi Zhou, Richard Weiss, Yoshihiko Fujisawa, Ken Kawano, Mehrshad Sadria, Ying Chen, Xin Liu, Sebastien Gros, Xiao Hu, Kyoung-Sook Kim, Mengmou Li, Katsuki Fujisawa, Kenji Wakabayashi

发表机构 * Tokyo University of Agriculture and Technology(东京大学农业技术大学) LY Corporation(LY公司) National University of Singapore(新加坡国立大学) Institute of Science Tokyo(东京科学研究所) Altos Labs, Inc.(Altos实验室) National Institute of Advanced Industrial Science and Technology (AIST)(国家先进工业科学与技术研究院) Norwegian University of Science and Technology(挪威科学技术大学) Emory University(埃默里大学) Hiroshima University(广岛大学)

AI总结 提出交互受限的安全连续时间强化学习框架,通过选项式半马尔可夫决策过程联合优化治疗策略与临床交互时机,并引入安全收紧机制保证轨迹级安全。

详情
AI中文摘要

动态医疗需要决定治疗强度和干预时机,而患者状态连续演化,不良事件可能在临床交互之间发生。现有大多数治疗学习方法假设固定时间表或仅在离散决策点强制执行安全性。我们提出了交互受限的安全连续时间强化学习,这是一个在轨迹级安全约束下联合优化治疗管理和临床交互时机的框架。我们的关键思想是将连续时间治疗问题重新表述为基于选项的半马尔可夫决策过程,其中每个选项指定一个连续时间治疗策略及其持续时间。我们开发了一种安全收紧机制,表明在交互时间适当构造的约束能够以高概率保证整个连续时间轨迹的安全性。我们进一步建立了从记录的治疗轨迹中进行策略学习的有限样本保证,并引入了一个实用的数据驱动保守替代。实验表明,所提出的自适应交互时机机制在不同安全策略优化方法上均能提高安全性和治疗效果,优于等距交互方案。

英文摘要

Dynamic medical treatment requires deciding treatment intensity and intervention timing, while patient states evolve continuously and adverse events may occur between clinical interactions. Most existing treatment learning methods assume fixed schedules or enforce safety only at discrete decision points. We propose Interaction-Limited Safe Continuous-Time Reinforcement Learning, a framework that jointly optimizes treatment administration and clinical interaction timing under trajectory-level safety constraints. Our key idea is to reformulate the continuous time treatment problem as an option-based semi-Markov decision process, where each option specifies a continuous-time treatment policy and its duration. We develop a safety-tightening mechanism showing that suitably constructed constraints at interaction times guarantee safety over the full continuous-time trajectory with high probability. We further establish finite-sample guarantees for policy learning from logged treatment trajectories and introduce a practical data-driven conservative surrogate. Experiments show that the proposed adaptive interaction-timing mechanism improves both safety and treatment effectiveness over equidistant interaction schemes across different safe policy optimization methods.

2606.01042 2026-06-02 cs.LG cs.AI 版本更新

Plausibility Is Not Prediction: Contrastive Evidence for LLM-Based Cellular Perturbation Reasoning

似真性不是预测:基于LLM的细胞扰动推理的对比证据

Xinyu Yuan, Xixian Liu, Jianan Zhao, Yashi Zhang, Hongyu Guo, Jian Tang

发表机构 * Mila - Québec AI Institute(魁北克人工智能研究所) University of Montréal(蒙特利尔大学) HEC Montréal(蒙特利尔HEC商学院) University of Ottawa(渥太华大学) National Research Council of Canada(加拿大国家研究理事会) CIFAR AI Chair(CIFAR人工智能 chair)

AI总结 本文发现基于大语言模型的细胞扰动推理虽能生成生物上合理的解释,但实际预测性能差,并提出CORE方法通过对比证据组织来提升扰动特异性预测。

详情
AI中文摘要

扰动实验对于理解细胞机制至关重要,但成本高昂且稀疏,因此需要预测未观察条件下的基因表达响应。最近一个有前景的方向是利用大语言模型(LLM)作为“虚拟细胞”模拟器——通过逐步的、基于知识的机械推理来推断差异表达——指向一种可解释的、知识驱动的范式,超越了纯粹的数据驱动方法。然而,我们发现似真性不是预测:尽管产生了生物上合理的解释,这些方法未能捕捉扰动特异性效应:系统性地高估差异表达,在聚合评估中通常表现不如简单的基因频率基线,并且在每个基因水平上降至随机水平。这揭示了对内在基因响应倾向的依赖,而非真正的扰动推理。我们将这一失败追溯到证据呈现方式:现有方法孤立地评估扰动-基因对,而不揭示相关扰动对同一基因的影响差异。为解决这一局限性,我们引入了CORE(对比关系证据组织),通过将证据组织成来自相关扰动的正面和负面结果,将预测重新定义为比较任务。使用生物医学知识图谱进行证据检索,CORE在基于LLM和非LLM的设置中均改善了校准并大幅提升了扰动特异性预测:例如,在药物扰动数据上,CORE-Reasoning将Qwen3.5-9B的聚合指标提升了高达28.6%;而在通用扰动数据上,CORE-Voting将四个细胞系的每个基因平均AUROC从随机水平提高到0.703。这突显了对比证据组织对于可靠的基于LLM的扰动推理至关重要。

英文摘要

Perturbation experiments are central to understanding cellular mechanisms, but remain costly and sparse, motivating prediction of gene expression responses for unobserved conditions. A promising recent direction leverages large language models (LLMs) as "virtual cell" simulators-using stepwise, knowledge-grounded mechanistic reasoning to infer differential expression-pointing toward an interpretable, knowledge-driven paradigm that transcends purely data-driven approaches. However, we find that plausibility is not prediction: despite producing biologically plausible explanations, these methods fail to capture perturbation-specific effects: systematically overestimating differential expression, often underperforming a simple gene-frequency baseline in aggregate evaluations, and collapsing to chance-level performance at the per-gene level. This reveals a reliance on intrinsic gene response tendencies rather than true perturbation reasoning. We trace this failure to how evidence is presented: existing methods evaluate perturbation-gene pairs in isolation, without exposing how related perturbations differ in their effects on the same gene. To address this limitation, we introduce CORE (Contrastive Organization of Relational Evidence), which reframes prediction as a comparison task by organizing evidence into positive and negative outcomes from related perturbations. Using a biomedical knowledge graph for evidence retrieval, CORE improves calibration and substantially boosts perturbation-specific prediction in both LLM-based and non-LLM settings: for example, on drug-perturbation data, CORE-Reasoning improves Qwen3.5-9B aggregate metrics by up to 28.6%, while on generic perturbation data, CORE-Voting raises macro-per-gene AUROC from chance to 0.703 in average across four cell lines. This highlights contrastive evidence organization as essential to reliable LLM-based perturbation reasoning

2606.01039 2026-06-02 cs.LG cs.AI 版本更新

OPD+: Rethinking the Advantage Design for On-Policy Distillation

OPD+: 重新思考在线策略蒸馏中的优势设计

Hanyang Zhao, Haoxian Chen, Han Lin, Genta Indra Winata, David Yao, Wenpin Tang

发表机构 * Columbia University(哥伦比亚大学) Amazon(亚马逊) Meta Capital One

AI总结 本文提出OPD+,通过修正在线策略蒸馏中因停止梯度操作导致的奖励目标偏差,并支持多种f-散度,在数学推理和工具使用基准上提升了性能。

详情
AI中文摘要

在线策略蒸馏(OPD)是一种广泛使用的技术,用于将能力强的教师语言模型的能力迁移到基础学生模型,并且可以通过使用学生生成的轨迹来制定强化学习风格的目标。然而,尽管散度奖励依赖于学生模型的可能性,现有工作通常采用停止梯度设计主要是为了稳定性,这使得得到的优势估计存在问题。在这项工作中,我们提供了一个基于学生和教师之间f-散度的通用优化框架,并从数学上重新审视这种设计空间是否有效。我们证明,对于一般的散度函数,一般的停止梯度操作会导致奖励目标和相应梯度的有偏估计。我们提出了OPD+,这是OPD的修正版本,在基线KL方法上展示了改进的性能,并且也支持各种f-散度的选择。我们在数学推理和工具使用基准上验证了我们的发现。

英文摘要

On-policy distillation (OPD) is a widely used technique to transfer capabilities from capable teacher language models to the base student models, and can be formulated in a reinforcement learning style objective using student generated rollouts. Yet, despite the divergence reward being dependent on student model likelihood, existing works usually adopt a stop gradient design primarily for stability, which makes the resulting advantage estimation questionable. In this work, we provide a generic optimization framework based on f-divergence between the student and teacher, and mathematically revisit whether such design space is valid. We prove that general stop-gradient operation would lead to biased estimates of the reward objective and corresponding gradient for general divergence functions. We propose OPD+, the corrected version of OPD that demonstrates improved performance over the baseline KL approach and also supports the choice of various f-divergence. We validate our findings on mathematical reasoning and tool-use benchmarks.

2606.01031 2026-06-02 cs.GR cs.AI cs.CV cs.LG cs.MM 版本更新

Temporally-Aligned Evaluation for Audio-Driven Talking Head Generation

音频驱动说话头生成的时序对齐评估

Zhicheng Zhang, Lei Wang, Yu Zhang, Yongsheng Gao

发表机构 * School of Business, University of New South Wales (UNSW)(新南威尔士大学商学院) School of Engineering and Built Environment, Griffith University(格里菲斯大学工程与环境学院) Data61/CSIRO(Data61/澳大利亚国家科学委员会)

AI总结 针对现有帧级评估指标对时序偏差敏感的问题,提出基于软动态时间规整的序列级对齐评估框架,提升评估鲁棒性并揭示不同建模范式间的系统权衡。

Comments Research report

详情
AI中文摘要

音频驱动的说话头生成技术发展迅速,但现有评估协议主要依赖帧级指标,假设生成视频与参考视频之间存在严格的时间对应关系。这一假设与语音驱动的面部运动不符,后者自然包含轻微的时间偏移、不同的说话速度和风格变化。因此,传统指标可能将无害的时间差异视为质量错误,使得公平比较方法并理解其权衡变得更加困难。在这项工作中,我们认为动态生成模型的评估应被表述为序列对齐问题,而非独立的帧比较。我们引入了一种统一的序列级重新表述,将软动态时间规整集成到已有的评估流程中。通过在对齐特征轨迹的同时保持时间顺序,所提出的框架对有限的时间错位具有鲁棒性,且不改变底层的感知、身份或同步编码器。我们表明,在刚性对齐下,帧级评估可被视为一个特例,而序列级对齐提供了更好的稳定性、对时间差异的更低敏感性以及建模范式之间更清晰的区分。基于这一原则性表述,我们在标准化协议下,对涵盖规范、野外和风格多样场景的七个数据集上的20种方法进行了大规模基准测试。大量实验表明,时序对齐的指标对时间差异更鲁棒,跨数据集提供更一致的结果,并能更好地揭示建模范式之间的系统权衡,例如同步性与真实性、表现力与稳定性之间的权衡。

英文摘要

Audio-driven talking-head generation has advanced rapidly, yet existing evaluation protocols mainly rely on frame-wise metrics that assume strict temporal correspondence between generated and reference videos. This assumption does not match speech-driven facial motion, which naturally includes slight timing shifts, different speaking speeds, and stylistic variations. As a result, conventional metrics may treat harmless timing differences as quality errors, making it harder to fairly compare methods and understand their trade-offs. In this work, we argue that evaluation of dynamic generative models should be formulated as a sequence-alignment problem rather than independent frame comparison. We introduce a unified sequence-level reformulation that integrates Soft Dynamic Time Warping into established evaluation pipelines. By aligning feature trajectories while preserving temporal order, the proposed framework provides robustness to bounded temporal misalignments without altering the underlying perceptual, identity, or synchronization encoders. We show that frame-wise evaluation can be viewed as a special case under rigid alignment, while sequence-level alignment provides improved stability, lower sensitivity to timing differences, and clearer separation between modeling paradigms. Building on this principled formulation, we conduct a large-scale benchmark of 20 methods across seven datasets spanning canonical, in-the-wild, and style-diverse scenarios under standardized protocols. Extensive experiments show that temporally aligned metrics are more robust to timing differences, provide more consistent results across datasets, and better reveal systematic trade-offs between modeling paradigms, such as synchronization versus realism and expressiveness versus stability.

2606.01028 2026-06-02 cs.LG 版本更新

MedGym:A Unified Continuous-Time Benchmark for Dynamic Medical Treatment Reinforcement Learning

MedGym:面向动态医疗治疗强化学习的统一连续时间基准

Yuepeng Wang, Ken Kawano, Yongqi Zhou, Yoshihiko Fujisawa, Richard Weiss, Akifumi Wachi, Katsuki Fujisawa, Ying Chen, Mehrshad Sadria, Xin Liu, Kyoung-Sook Kim, Xiao Hu, Sebastien Gros, Xun Shen

发表机构 * Tokyo University of Agriculture and Technology(东京农业大学) Institute of Science Tokyo(东京科学研究院) National University of Singapore(国立新加坡大学) LY Corporation(LY公司) Altos Labs, Inc.(Altos实验室) National Institute of Advanced Industrial Science and Technology (AIST)(国家先进工业科学与技术研究院) Emory University(埃默里大学) Norwegian University of Science and Technology(挪威科学技术大学)

AI总结 提出MedGym基准,通过连续时间框架和物理信息神经网络构建可配置的医疗RL环境,支持离散与连续时间方法在非规则治疗间隔下的比较,并评估个性化、轨迹安全等临床指标。

详情
AI中文摘要

医疗治疗推荐给强化学习(RL)带来了若干挑战:患者生理状态在连续时间内演变,测量和干预以不规则间隔进行,且治疗效果在不同个体间差异显著。然而,现有的RL公式和模拟环境基于离散时间的MDP或POMDP抽象,具有固定或预先指定的决策间隔。因此,评估RL方法能否处理时间间隔依赖的疾病进展、个性化治疗反应以及连续测量点之间的安全性仍然困难。为弥补这一空白,我们引入了MedGym,一个用于动态治疗推荐的基准环境。MedGym在连续时间框架中对纵向患者演变进行建模,并通过使用物理信息神经网络从临床数据构建可配置的医疗RL基准。所得基准支持离线RL和在线RL,并能够在非规则治疗时机和患者特定动态下直接比较离散时间与连续时间方法。此外,MedGym支持从临床重要角度进行评估,包括个性化、轨迹级安全性以及基于模型的离线学习与在线部署之间的性能差距。通过为连续时间动态治疗提供标准化且可配置的基准,MedGym旨在促进对医疗RL方法进行更真实、更具信息量的评估。

英文摘要

Medical treatment recommendation poses several challenges to reinforcement learning (RL): patient physiology evolves in continuous time, measurements and interventions are performed at irregular intervals, and treatment effects vary substantially across individuals. Existing RL formulations and simulated environments, however, are based on discrete-time MDP or POMDP abstractions with fixed or pre-specified decision intervals. Thus, it remains difficult to evaluate whether RL methods can handle time-interval-dependent disease progression, personalized treatment response, and safety between consecutive measurement points. To address this gap, we introduce MedGym, a benchmark environment for dynamic treatment recommendation. MedGym models longitudinal patient evolution in a continuous-time framework and constructs a configurable medical RL benchmark from clinical data by using Physics-Informed Neural Networks. The resulting benchmark supports both offline and online RL, and enables direct comparison between discrete-time and continuous-time methods under irregular treatment timing and patient-specific dynamics. Besides, MedGym supports evaluation from clinically important perspectives, including personalization, trajectory-level safety, and the performance gap between model-based offline learning and online deployment. By providing a standardized and configurable benchmark for continuous-time dynamic treatment, MedGym aims to facilitate more realistic and informative evaluation of medical RL methods.

2606.01020 2026-06-02 cs.AI cs.LG 版本更新

Tackling the Root of Misinformation by Teaching Laypeople about Logical Fallacies via Socratic Questioning and Critical Argumentation

通过苏格拉底式提问和批判性论证教授外行人逻辑谬误,以应对错误信息的根源

Minjing Shi, Junling Wang, Jingwei Ni, Sankalan Pal Chowdhury, Mrinmaya Sachan

发表机构 * ETH Zurich(苏黎世联邦理工学院) ETH AI Center(苏黎世联邦理工学院人工智能中心)

AI总结 提出LFTutor智能辅导系统,利用大语言模型结合苏格拉底式提问和批判性论证原则,帮助外行人学习识别逻辑谬误,显著优于基线模型。

Comments This paper has been accepted to Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Long Paper), Main Conference

详情
Journal ref
Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics, 2026
AI中文摘要

识别日常话语中的逻辑谬误对许多人来说具有挑战性。这一挑战在大语言模型(LLMs)时代被放大,恶意行为者可以利用谬误论证大规模传播错误信息。在这项工作中,我们探索了LLMs作为解决方案一部分的潜力。我们介绍了LFTutor,一个智能辅导系统,它使用LLMs辅导外行人,帮助他们学习逻辑谬误。LFTutor整合了意图驱动的苏格拉底式提问和批判性论证原则,以积极引导学习者反思自己的推理。通过自动评估和人工评估,我们证明LFTutor显著优于缺乏这些教学策略的基线LLMs。这项工作突显了将LLMs与教学支架相结合以在人工智能时代培养批判性思维和论证素养的前景。

英文摘要

Identifying logical fallacies in everyday discourse is challenging for many people. This challenge is amplified in the era of Large Language Models (LLMs), where malicious agents can deploy fallacious arguments to disseminate misinformation at scale. In this work, we explore the potential of LLMs as part of the solution. We introduce LFTutor, an intelligent tutoring system which uses LLMs to tutor laypeople and help them learn about logical fallacies. LFTutor integrates intent-driven Socratic questioning and critical argumentation principles to actively engage learners to reflect on their reasoning. Through both automatic and human evaluations, we demonstrate that LFTutor significantly outperforms baseline LLMs lacking these pedagogical strategies. This work highlights the promise of combining LLMs with pedagogical scaffolding to foster critical thinking and argument literacy in the age of AI.

2606.01007 2026-06-02 cs.LG cs.AI 版本更新

Beyond Task-Agnostic: Task-Aware Grouping for Communication-Efficient Multi-Task MoE Inference

超越任务无关:面向通信高效的多任务MoE推理的任务感知分组

Zhiyao Xu, Aoxue Liu, Zhanjie Ding, Dan Zhao, Yong Jiang, Qing Li

发表机构 * Tsinghua Shenzhen International Graduate School(清华大学深圳国际研究生院) Pengcheng Laboratory(鹏城实验室)

AI总结 提出任务感知共激活分组(TACG)框架,通过任务特定的共激活模式优化专家放置,并引入通用专家共享复制(GESR)应对在线负载倾斜,在三个MoE模型上平均降低通信成本31.39%,保持公平性指数0.9975。

详情
AI中文摘要

稀疏激活的混合专家(MoE)模型通过条件计算扩展容量,但分布式推理面临跨GPU专家通信和路由引起的负载不平衡问题。现有的放置方法通过共同定位频繁共激活的专家来降低这一成本;然而,它们从全局聚合的路由轨迹中推导出单一部署方案,从而平均掉了多任务服务中实际驱动通信的异构、任务特定的共激活模式。我们观察到专家共激活强烈依赖于任务:在一个任务族中紧密耦合的专家对在另一个任务族中往往不相关,因此有效的部署应根据任务感知的共激活而非任务无关的平均值来分组专家。基于这一见解,我们提出了任务感知共激活分组(TACG),这是一个部署时框架,利用族特定的调度和共激活轨迹推导每个专家的任务族偏好,重新加权共激活图使得族内局部性主导分组,并在精确容量约束下将每个专家分配到主GPU。为了使静态放置对在线工作负载倾斜保持鲁棒,我们进一步引入了通用专家共享复制(GESR),这是一个轻量级辅助方法,识别具有持续中心共激活特征的通用专家,将它们复制到少量辅助GPU上,并在服务时应用局部性和负载感知的选择。在三个代表性的开源MoE模型上的实验表明,我们的框架相比基线平均降低了31.39%的通信成本,同时保持了平均Jain公平指数0.9975。即使在推理数据出现严重分布偏移的情况下,这一优势依然存在,持续优于强基线。

英文摘要

Sparsely activated Mixture-of-Experts (MoE) models scale capacity via conditional computation, but distributed inference suffers from cross-GPU expert communication and routing-induced load imbalance. Existing placement methods reduce this cost by co-locating frequently co-activated experts; however, they derive a single deployment plan from globally aggregated routing traces, thereby averaging away the heterogeneous, task-specific co-activation patterns that actually drive communication in multi-task serving. We observe that expert co-activation is strongly task-conditioned: pairs tightly coupled in one task family are often uncorrelated in another, so effective deployment should group experts by task-aware co-activation rather than by a task-agnostic average. Based on this insight, we propose \emph{Task-Aware Coactivation Grouping} (TACG), a deployment-time framework that uses family-specific dispatch and co-activation traces to derive per-expert task-family preferences, reweights the co-activation graph so that intra-family locality dominates grouping, and assigns each expert to a primary GPU under exact capacity constraints. To keep the static placement robust under online workload skew, we further introduce \emph{Generic Expert Shared Replication} (GESR), a lightweight companion that identifies generic experts with consistently central co-activation profiles, replicates them across a small set of secondary GPUs, and applies locality- and load-aware selection at serving time. Experiments on three representative open-source MoE models demonstrate that our framework reduces the average communication cost by 31.39\% over the baseline, while preserving an average Jain fairness index of 0.9975. This advantage persists even under severe distribution shifts in the inference data, consistently outperforming strong baselines.

2606.01002 2026-06-02 stat.ME cs.LG math.ST stat.TH 版本更新

Theoretical Analysis of Engression and Reverse Markov Engression

Engression与反向马尔可夫Engression的理论分析

Jiaqi Huang, Gongjun Xu, Ji Zhu

发表机构 * Department of Statistics, University of Michigan(密歇根大学统计系)

AI总结 本文针对Engression及其反向马尔可夫扩展,在深度神经网络参数化下建立了非渐近收敛界,并通过能量距离链式法则分析了误差传播,得到了接近最优的过量风险界。

详情
AI中文摘要

Engression是最近提出的用于条件分布学习的有效框架。其多步反向马尔可夫扩展通过将复杂条件采样分解为顺序反向转移,进一步提高了生成灵活性。尽管这些方法具有强大的实证性能,但其严格的有限样本统计保证仍然缺乏。在本文中,在深度神经网络参数化下,我们通过直接控制学习到的条件分布与目标条件分布之间的能量距离,建立了Engression的非渐近收敛界。对于反向马尔可夫框架,我们进一步开发了基于能量距离的链式法则,从而能够严格分析反向步骤间的误差传播。我们的分析得到了相应的过量风险界,相对于一般Hölder类上的经典极小化最优速率,该界在对数因子意义下是接近最优的。

英文摘要

Engression is a recently proposed and effective framework for conditional distribution learning. Its multi-step Reverse Markov extension further improves generative flexibility by decomposing complex conditional sampling into sequential reverse transitions. Despite their strong empirical performance, rigorous finite-sample statistical guarantees for these methods remain unavailable. In this paper, under deep neural network parameterizations, we establish nonasymptotic convergence bounds for Engression by directly controlling the Energy Distance between the learned and target conditional distributions. For the Reverse Markov framework, we further develop an Energy-Distance-based chain rule that enables a rigorous analysis of error propagation across reverse steps. Our analysis yields corresponding excess-risk bounds that are near-optimal up to logarithmic factors relative to the classical minimax rate over a general Hölder class.

2606.01000 2026-06-02 cs.LG cs.CL 版本更新

Trust Functions: Near-Lossless Weak-to-Strong Generalization by Learning When to Trust the Weak Teacher

信任函数:通过学习何时信任弱教师实现近乎无损的弱到强泛化

Arda Uzunoglu, Alvin Zhang, Daniel Khashabi

发表机构 * University of Washington(华盛顿大学)

AI总结 提出信任函数为弱标签分配信任分数并据此过滤弱监督,在多个领域实现近乎无损的弱到强泛化,且能通过迭代链放大收益。

Comments ICML 2026

详情
AI中文摘要

弱到强泛化研究在可靠标签稀缺时,如何利用较弱教师的监督来提升强学生。我们主要将其视为数据选择问题,关键挑战是识别哪些弱标签足够可靠以作为训练信号。为此,我们引入信任函数,为每个弱标签分配一个标量信任分数,并使用这些分数过滤弱监督。在包括世界知识、定量推理和策略游戏在内的多个领域,信任过滤产生的学生匹配甚至有时超越真实监督,实现了近乎无损的弱到强泛化。此外,信任函数能够实现迭代的弱到强链,通过训练学生并将其重用为下一个教师来累积收益,从而放大收益。信任函数的优势可归因于多种机制。

英文摘要

Weak-to-strong generalization studies how to improve a strong student using supervision from a weaker teacher when reliable labels are scarce. We view this primarily as a data selection problem, where the key challenge is to identify which weak labels are reliable enough to serve as a training signal. To address this, we introduce trust functions that assign each weak label a scalar trust score and use these scores to filter weak supervision. Across several domains, including world knowledge, quantitative reasoning, and strategy games, trust filtering yields students that match and sometimes surpass ground-truth supervision, achieving near-lossless weak-to-strong generalization. Moreover, trust functions enable an iterative weak-to-strong chain that compounds gains by training a student and reusing it as the next teacher, amplifying the gains. There are several mechanisms to which advantage of trust functions can be attributed.

2606.00988 2026-06-02 cs.LG 版本更新

Data Enrichment for Symbolic Regression Using Diffusion Models

使用扩散模型进行符号回归的数据增强

Simon De Reuver, Tamas Kristof Toth, Teddy Lazebnik

发表机构 * Department of Computing(计算系) Jönköping University(约翰·科普丁大学) Department of Information Science(信息科学系) University of Haifa(海法大学)

AI总结 提出一种物理引导的潜在扩散框架,通过生成受物理约束的合成数据来增强稀疏观测,从而提升符号回归在稀疏、噪声或不完整数据下的方程发现可靠性。

详情
AI中文摘要

符号回归(SR)通过将观测转化为可解释的控制方程,为科学发现提供了一条途径。然而,尽管其前景广阔,当时空测量稀疏、有噪声或物理上不完整时(这在实践中很常见),其可靠性会急剧下降。数据增强(DE)已被证明能够缓解这一限制,但除非额外样本保留目标系统的物理结构,否则它们可能误导方程发现。这种DE的隐含要求需要狭窄的领域专业知识以及技术流畅性,极大地限制了其实用性。在本研究中,我们引入了一个物理引导的潜在扩散框架,用于下游SR模型的DE。该框架结合了变分自编码器、条件潜在扩散模型和物理信息残差校正器,通过受控制关系约束的合成场来补全稀疏观测。我们在热传导、不可压缩Navier-Stokes流和移动单质量牛顿引力势上评估了该方法,使用GPLearn、DEAP和PySR作为下游SR后端。我们的结果表明,物理校正的增强在稀疏情况下始终改善了跨物理动力学和SR模型的恢复。这些结果表明,生成式增强可以在不需要额外领域专业知识的情况下加强方程发现。

英文摘要

Symbolic regression (SR) offers a route to scientific discovery by converting observations into interpretable governing equations. However, despite its promise, its reliability degrades sharply when spatiotemporal measurements are sparse, noisy, or physically incomplete, as commonly occurring in practice. Data enrichment (DE) has been shown to be able to mitigate this limitation, yet additional samples can mislead equation discovery unless they preserve the physical structure of the target system. Such implication of DE requires narrow domain expertise as well as technical fluidity, highly limiting its practical usefulness. In this study, we introduce a physics-guided latent diffusion framework for DE for down the line SR models. The proposed framework combines a variational autoencoder, a conditional latent diffusion model, and a physics-informed residual corrector to complete sparse observations with synthetic fields constrained by governing relations. We evaluate the approach on heat conduction, incompressible Navier-Stokes flow, and a moving single-mass Newtonian gravitational potential, using GPLearn, DEAP, and PySR as downstream SR backends. Our results reveal that physics-corrected enrichment consistently improves recovery in sparse regimes across physical dynamics and SR models. These results show that generative enrichment can strengthen equation discovery without additional domain expertise.

2606.00986 2026-06-02 cs.LG 版本更新

Profiling Privacy Preservation Against Gradient Inversion Attacks in Tabular Federated Learning

表格联邦学习中针对梯度反转攻击的隐私保护分析

Ivo Osterberg Nilsson, Maximilian Birr Engvall, Viktor Valadi, Teddy Lazebnik

发表机构 * Department of Computing(计算系) Jönköping University(琼堡大学) Scaleout Systems University of Haifa(海法大学)

AI总结 本研究通过评估不同联邦学习协议、客户端批量大小、训练阶段、攻击者假设、模型架构及任务类型下梯度反转攻击对表格数据的恢复能力,发现小批量更新最易受攻击,而FT-Transformer架构比MLP更难反转,并指出聚合重建精度可能高估完整记录恢复。

详情
AI中文摘要

联邦学习(FL)允许多个数据持有者在不集中原始数据的情况下协作训练机器学习模型,使其在医疗保健和机构数据共享等隐私敏感领域非常有用。FL将数据保留在客户端本地,仅通信模型更新(如梯度或模型增量)。然而,这些更新可能通过梯度反转攻击(GIA)暴露客户端私有数据。我们研究了在诚实但好奇的服务器威胁模型下,表格FL中的这种风险,涉及FL协议、客户端批量大小、训练阶段、攻击者假设、模型架构以及二分类、多分类和回归任务。我们使用MIMIC-IV和补充基准数据集。我们的评估区分了数值和分类恢复、基线可恢复性、特征级别恢复和精确匹配率(EMR)。我们使用暴露对齐协议评估FedSGD梯度和FedAvg模型增量,比较在匹配的客户端数据暴露(而非匹配的通信轮次)后的受攻击模型。我们比较了多层感知器(MLP)、ResNet和FT-Transformer模型,并通过MLP网格(宽度、深度、激活函数、归一化和丢弃率)隔离架构效应。结果表明,小客户端批量以及代表少量不同记录的更新最易受攻击。更大的本地批量和更强的聚合减少了重建,但并未消除泄露。FT-Transformer始终比独热基线更难反转,而MLP家族内的可重建性也差异很大。这些发现将架构确定为表格FL中一个实用的隐私变量。我们还表明,聚合重建精度可能高估稀疏数据中的完整记录恢复,因此EMR和基线比较至关重要。

英文摘要

Federated learning (FL) enables multiple data holders to train machine learning models collaboratively without centralizing raw data, making it useful in privacy sensitive domains such as healthcare and institutional data sharing. FL keeps data local to clients while communicating only model updates, such as gradients or model deltas. Nevertheless, these updates can expose private client data through gradient inversion attacks (GIAs). We study this risk for tabular FL under an honest-but-curious server threat model across FL protocols, client batch sizes, training stages, attacker assumptions, model architectures, and binary classification, multiclass classification, and regression tasks. We use MIMIC-IV and complementary benchmark datasets. Our evaluation distinguishes numerical and categorical recovery, baseline recoverability, feature level recovery, and exact match rate (EMR). We evaluate FedSGD gradients and FedAvg model deltas with an exposure aligned protocol, comparing attacked models after matched client data exposure rather than matched communication rounds. We compare multilayer perceptron (MLP), ResNet, and FT-Transformer models, and isolate architecture effects through an MLP grid over width, depth, activation, normalization, and dropout. The results show that small client batches and updates representing few distinct records are most vulnerable. Larger local batches and stronger aggregation reduce reconstruction but do not eliminate leakage. FT-Transformer is consistently harder to invert than one-hot baselines, while reconstructability also varies substantially within the MLP family. These findings identify architecture as a practical privacy variable in tabular FL. We also show that aggregate reconstruction accuracy can overstate complete record recovery in sparse data, making EMR and baseline comparisons essential.

2606.00984 2026-06-02 stat.ML cs.LG 版本更新

Practical and Optimal Algorithm for Linear Contextual Bandits with Rare Parameter Updates

线性上下文赌博机中参数稀有更新的实用最优算法

Sanghoon Yu, Min-hwan Oh

发表机构 * Sanghoon Yu(苏杭oon Yu) Min-hwan Oh

AI总结 针对参数更新次数受限的线性上下文赌博机问题,提出两种仅需O(log log T)次参数更新的算法,在静态调度下达到极小化最优遗憾,并显著降低计算复杂度。

Comments Accepted at ICML 2026

详情
AI中文摘要

我们研究在参数稀有更新下的线性上下文赌博机:学习器只能在少量更新时刻将奖励反馈纳入其参数估计,同时仍在线观察上下文并顺序选择动作。这一观点澄清了文献中常被模糊的实际区别:许多“严格批处理”方法额外限制了区间内上下文的自适应性,即区间内的动作规则不能依赖于该区间内已实现的上下文/动作序列(除了当前轮次的上下文)。对于线性上下文赌博机,我们提出了两种仅需$O(\log\log T)$次参数更新的实用算法。我们的第一个算法BLCE-G在静态调度下,同时在小$K$和大$K$机制下达到极小化最优遗憾(达到$T$的多对数因子)。第二个算法BLCE去除了近G-最优设计步骤——这是先前严格批处理静态网格方法中主要的计算瓶颈——同时保持极小化最优遗憾,并在最优算法中实现了已知最低的运行时间复杂度。我们进一步将这些稀有更新和计算原则扩展到广义线性上下文赌博机。总体而言,我们的结果在$O(\log\log T)$次参数更新下产生了统计最优且计算高效的算法。

英文摘要

We study linear contextual bandits under rare parameter updates: the learner may incorporate reward feedback into its parameter estimate only at a small number of update times, while still observing contexts online and selecting actions sequentially. This viewpoint clarifies a practical distinction that is often blurred in the literature: many "strictly batched" methods additionally restrict within-interval context adaptivity, meaning that the action rule inside an interval cannot depend on the sequence of realized contexts/actions in that interval (beyond the current round's context). For linear contextual bandits, we propose two practical algorithms with only $O(\log\log T)$ parameter updates. Our first algorithm BLCE-G attains minimax-optimal regret (up to polylogarithmic factors in $T$) simultaneously in both the small-$K$ and large-$K$ regimes under a static schedule. Our second algorithm BLCE removes the near G-optimal design step -- a dominant computational bottleneck in prior strictly batched static-grid methods -- yet preserves minimax-optimal regret and achieves the lowest known runtime complexity among optimal algorithms. We further extend these rare-update and computational principles to generalized linear contextual bandits. Overall, our results yield statistically optimal algorithms under $O(\log\log T)$ parameter updates that are also computationally efficient in practice.

2606.00979 2026-06-02 cs.LG 版本更新

UME: A Unified Meta-Generalization Framework for Cross-Domain ETA

UME:跨域ETA的统一元泛化框架

Duo Wang, Qiong Wu, Jianguo Wu, Ruiyu Xu, Jinhui Yi, Zhonggen Sun, Zhentao Zhang, Yu Zhang, Ke Xing, Yongjun Yin, Zishuo Li, Jianwen Huang

发表机构 * Peking University(北京大学) Meituan(美团)

AI总结 针对即时物流中跨域ETA预测的零样本泛化、特征缺失和知识迁移问题,提出基于超网络元学习的统一双分支架构UME,通过元模块动态调制特征门控、专家注意力和最终预测,并在美团Keeta平台部署验证。

详情
AI中文摘要

在即时物流中,结账页面的准确预计到达时间(ETA)预测对于提高用户满意度、优化调度和控制运营成本至关重要。在国际按需配送平台上,ETA数据来自具有不同模式的不同国家或地区,多域建模非常重要且已被广泛采用。然而,现有方法在实际部署中仍面临三个关键挑战。首先,当前的多域模型难以泛化到完全未见过的域,无法在初始冷启动阶段实现零样本预测。其次,跨域特征空间通常被假设为一致的,而新域由于缺乏历史数据,常常遭受离线(统计)特征的结构性缺失。第三,这种特征缺失通常迫使工业系统分别对成熟域和冷启动域进行建模,阻碍了知识迁移并增加了维护开销。为了解决这些挑战,我们提出了UME,一个统一的元泛化框架用于ETA。具体来说,UME将统一的双分支架构与一种新颖的元学习机制相结合,该机制采用基于超网络的元学习器。通过利用域级知识和实例级上下文,元学习器赋能三个元模块动态调制特征门控、专家注意力和最终预测,捕获跨域相关性并促进域内适应。进一步引入知识蒸馏策略以提升性能。UME现已部署在美团Keeta配送平台(中国最大的国际食品配送平台)上。大量的离线实验和在线A/B测试表明,UME显著优于现有基线。

英文摘要

Accurate Estimated Time of Arrival (ETA) prediction on checkout page is crucial in instant logistics for enhancing user satisfaction, optimizing dispatching, and controlling operational costs. In international on-demand delivery platforms, where ETA data originates from diverse countries or regions with different patterns, multi-domain modeling is of great importance and has been widely adopted. However, existing methods still face three critical challenges in real-world deployment. First, current multi-domain models struggle to generalize to completely unseen domains, failing to achieve zero-shot prediction during the initial cold-start phase. Second, cross-domain feature spaces are often assumed to be consistent, whereas new domains commonly suffer from structural missingness of offline (statistical) features due to the lack of historical data. Third, such feature missingness often compels industrial systems to model mature and cold-start domains separately, hindering knowledge transfer and increasing maintenance overhead. To address these challenges, we propose \textbf{UME}, a \textbf{U}nified \textbf{M}eta-generalization framework for \textbf{E}TA. Specifically, UME integrates a unified dual-branch architecture with a novel meta-learning mechanism that employs a hypernetwork-based meta learner. By leveraging domain-level knowledge and instance-level context, the meta learner empowers three meta modules to dynamically modulate feature gating, expert attention, and final prediction, capturing cross-domain correlations and facilitating intra-domain adaptation. A knowledge distillation strategy is further introduce to enhance performance. UME has now been deployed in Meituan-keeta delivery platform (the largest international food delivery platform in China). Extensive offline experiments and online A/B tests demonstrate that UME significantly outperforms existing baselines.

2606.00970 2026-06-02 cs.AI cs.LG econ.TH 版本更新

Prospect-Theory Behavior from Bellman Optimality in MDPs with Catastrophic States

具有灾难性状态的MDP中贝尔曼最优性产生的前景理论行为

Yujiao Chen

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 研究具有吸收灾难状态的马尔可夫决策过程中的风险中性控制,发现标准贝尔曼最优性产生前景理论特征:S形值函数、内生损失敏感系数和反射效应策略反转,并推导出渐近损失厌恶平台的闭式表达式。

详情
AI中文摘要

我们研究具有吸收灾难状态的马尔可夫决策过程中的风险中性控制。尽管奖励是线性的且智能体没有效用曲率、概率加权或框架依赖,标准贝尔曼最优性产生了三个前景理论特征:S形值函数轮廓(灾难附近凸,远处凹)、内生损失敏感系数$λ^*(S) > 1$以及反射效应策略反转。在495个配置中,最优策略在正漂移(增长)模式下在灾难附近选择安全动作,尽管风险动作的即时期望值更高;在负漂移(衰退)模式下在灾难附近选择风险动作,尽管安全动作的即时期望损失更低。我们推导出渐近损失厌恶平台$\barλ$的闭式表达式,该表达式仅依赖于获胜概率$p$、收益不对称性$r = |Δ_\ell/Δ_w|$和折扣因子$β$,与数值解的拟合$R^2 = 0.999$。该机制不需要不对称收益。在三个不对称水平下对$(p,β)$进行扫描,$\barλ$大于1的不对称份额中位数为4.6%($r = 1.25$时),上升到13.9%($r = 2$时),且在每个测试单元中边界贡献超过不对称贡献。这些现象在表格Q学习(无模型智能体在增长模式下与$V^*$的相关性为0.98,衰退模式下为1.00)以及随机转移(高斯、重尾Student-$t_3$和不对称偏正态噪声,幅度高达步长的50%)中持续存在,其中渐近平台在安全通道噪声下跟踪闭式预测的误差在0.41%以内,在风险通道或双通道噪声下误差在9.6%以内。这些结果将吸收失败状态识别为最优控制下产生前景理论行为的充分结构机制。

英文摘要

We study risk-neutral control in Markov decision processes with an absorbing catastrophic state. Even though rewards are linear and the agent has no utility curvature, probability weighting, or framing dependence, standard Bellman optimality produces three prospect-theory-like signatures: an S-shaped value-function profile (convex near catastrophe, concave in the far field), an endogenous loss-sensitivity coefficient $λ^*(S) > 1$, and a reflection-effect policy reversal. Across 495 configurations, the optimal policy plays safe near catastrophe in positive-drift (growth) regimes despite the risky action's higher immediate expected value, and plays risky near catastrophe in negative-drift (decline) regimes despite the safe action's lower immediate expected loss. We derive a closed-form expression for the asymptotic loss-aversion plateau $\barλ$ that depends only on win probability $p$, payoff asymmetry $r = |Δ_\ell/Δ_w|$, and discount factor $β$, and matches numerical solutions to $R^2 = 0.999$. The mechanism does not require asymmetric payoffs. Across a sweep of $(p,β)$ at three asymmetry levels, the asymmetry share of $\barλ$ above unity has median 4.6% at $r = 1.25$ and rises to 13.9% at $r = 2$, with the boundary contribution exceeding the asymmetry contribution in every cell tested. The phenomena persist under tabular Q-learning (a model-free agent reproduces $V^*$ at correlation 0.98 in growth and 1.00 in decline) and under stochastic transitions with Gaussian, heavy-tailed Student-$t_3$, and asymmetric skew-normal noise up to 50% of the step size, where the asymptotic plateau tracks the closed-form prediction within 0.41% for safe-channel noise and within 9.6% for risky-channel or both-channel noise. These results identify absorbing failure states as a sufficient structural mechanism for prospect-theory-like behavior under optimal control.

2606.00956 2026-06-02 cs.LG 版本更新

Optimal-Point Variance Reduction For Bayesian Optimization With Regret Guarantee

具有遗憾保证的贝叶斯优化的最优点方差缩减

Shion Takeno

发表机构 * Nagoya University(名古屋大学)

AI总结 提出一种名为最优点方差缩减(OVR)的单步前瞻贝叶斯优化方法,通过后验采样和蒙特卡洛近似实现,并证明了正则化OVR的贝叶斯期望简单遗憾上界趋于零。

Comments 23pages, 3 figures

详情
AI中文摘要

本文研究了一种单步前瞻贝叶斯优化(BO)方法及其理论保证。尽管单步前瞻BO方法(如熵搜索)的经验有效性已被广泛研究,但它们通常依赖于计算上难以处理的近似,且其遗憾保证仍不完善。因此,本文提出了一种名为最优点方差缩减(OVR)的单步前瞻BO方法,该方法仅需要后验采样和蒙特卡洛近似。我们得到了OVR中蒙特卡洛估计在输入域上的均匀误差界。此外,我们表明,通过轻微修改以促进探索的正则化OVR,实现了贝叶斯期望简单遗憾上界趋于零。最后,我们通过数值实验展示了OVR的有效性。

英文摘要

This paper studies a one-step lookahead Bayesian optimization (BO) method and its theoretical guarantee. Although the empirical effectiveness of one-step lookahead BO methods, such as entropy search, has been studied extensively, they often rely on computationally intractable approximations, and their regret guarantees remain underdeveloped. Thus, this paper proposes a one-step lookahead BO method called optimal-point variance reduction (OVR), which requires only posterior sampling and Monte Carlo approximations. We obtain a uniform error bound over an input domain for the Monte Carlo estimation in OVR. Furthermore, we show that the regularized OVR, with the slight modification to promote exploration, achieves a vanishing Bayesian expected simple regret upper bound. Finally, we demonstrate the effectiveness of OVR through numerical experiments.

2606.00955 2026-06-02 cs.LG q-bio.QM 版本更新

CryoProt: A Protein Pretraining Framework with Cross-Box Interactions on Cryo-EM Density Maps

CryoProt: 一种基于冷冻电镜密度图跨盒交互的蛋白质预训练框架

Dan Luo, Xuan Lin, Peng Zhou, Junwen Zhu, Tengfei Ma, Xiangxiang Zeng, Yiping Liu

发表机构 * College of Computer Science and Electronic Engineering, Hunan University(湖南大学计算机科学与电子工程学院) School of Computer Science, Xiangtan University(湘潭大学计算机学院)

AI总结 提出 CryoProt 框架,通过多头潜在注意力机制实现密度图跨盒交互建模,并采用多任务预训练策略,在蛋白质柔性预测等下游任务中取得最高12%的性能提升。

详情
AI中文摘要

尽管冷冻电镜(cryo-EM)密度图的数据日益增多,但有效利用它们进行蛋白质表示仍具挑战。首先,当前方法缺乏专门针对cryo-EM密度图设计的通用蛋白质预训练框架,用于蛋白质相关属性预测。其次,现有方法通常将密度图划分为局部盒区域并独立建模,忽略了跨盒交互,而这对捕获cryo-EM密度图中的全局结构上下文至关重要。为解决这些挑战,我们提出CryoProt,一种专为cryo-EM密度图设计的蛋白质预训练框架。CryoProt引入了基于多头潜在注意力(MLA)的图编码器,其中盒级表示通过共享潜在空间进行交互,从而显式建模密度图内的跨盒依赖关系。此外,我们采用多任务预训练策略来学习可泛化的表示,这些表示可以有效地迁移到各种下游任务,例如蛋白质柔性预测,其中不需要cryo-EM密度图,而可以由预训练模型隐式推断。实验结果表明,CryoProt在多个基准测试中持续优于现有最先进方法,相比最佳基线实现了高达12%的提升,突显了在cryo-EM数据中建模跨盒交互的重要性。源代码公开于https://anonymous.4open.science/r/CryoProt。

英文摘要

Despite the growing availability of cryo-electron microscopy (cryo-EM) density maps, effectively leveraging them for protein representation remains challenging. First, current methods lack a general-purpose protein pretraining framework tailored for cryo-EM density maps, designed for protein-related property prediction. Second, existing approaches typically partition density maps into local box regions and model them independently, overlooking interactions across boxes which are essential for capturing global structural context in cryo-EM density map. To address these challenges, we propose CryoProt, a protein pretraining framework designed for cryo-EM density maps. CryoProt introduces a Map Encoder based on multi-head latent attention (MLA), where box-level representations interact through a shared latent space, enabling explicit modeling of cross-box dependencies within the density map. Furthermore, we adopt a multi-task pretraining strategy to learn generalizable representations that can be effectively transferred to diverse downstream tasks, such as protein flexibility prediction, where cryo-EM density maps are not required and can be inferred implicitly by the pretrained model. Experimental results demonstrate that CryoProt consistently outperforms existing state-of-the-art methods across multiple benchmarks, achieving up to 12% improvement over the best-performing baselines, highlighting the importance of modeling cross-box interactions in cryo-EM data. The source code is publicly available at https://anonymous.4open.science/r/CryoProt.

2606.00953 2026-06-02 cs.LG cs.MA 版本更新

When Parallelism Pays Off: Cohesion-Aware Task Partitioning for Multi-Agent Coding

当并行性有回报时:面向多智能体编码的凝聚力感知任务划分

Xu Yang, Lunyiu Nie, Ethan Chandra, Stanislav Gannutin, Fangru Lin, Swarat Chaudhuri

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) University of Oxford(牛津大学)

AI总结 提出Co-Coder方法,通过静态分析构建依赖图、社区检测划分图及依赖感知调度,在仓库级软件工程中平衡通信与计算开销,实现多智能体并行编码的效率和成本优化。

详情
AI中文摘要

多智能体大语言模型(LLM)系统提供了一种通过并行化和上下文隔离来分解复杂任务(如编码)的方式。然而,在实践中增加智能体会引入智能体间通信开销,这会产生额外成本,有时甚至会抵消效率提升。我们将多智能体编排形式化为一个图划分问题,以捕捉通信与计算之间的权衡:任务分解可以缩短关键路径计算,但跨智能体依赖需要昂贵的上下文传输。我们在仓库级软件工程中实例化这一观点,并提出了凝聚力感知编码器(Co-Coder),它通过静态分析构建依赖图,隔离结构枢纽文件,通过社区检测划分图,并使用依赖感知调度器执行划分。在DevEval和CodeProjectEval上的28个真实世界任务中,Co-Coder在帕累托前沿上超越了顺序和基于文件的并行基线以及带有智能体团队的Claude Code,将通过率提高了最多14.0%,实现了最多2.10倍的墙钟加速,并将API成本降低了最多35%,在依赖最密集的项目上取得了最大收益。Co-Coder展示了凝聚力感知编排如何使并行编码智能体既具有理论依据又具有实际效率,为多智能体系统提出了更广泛的设计原则。

英文摘要

Multi-agent Large Language Model (LLM) systems offer a way to decompose complex tasks, such as coding, through parallelization and context isolation. However, adding agents in practice introduces inter-agent communication overhead, which incurs extra cost and can sometimes offset the efficiency gains. We formalize multi-agent orchestration as a graph partitioning problem that captures the communication-to-computation trade-off: task decomposition can shorten critical-path computation, but cross-agent dependencies require costly context transfer. We instantiate this view in repository-level software engineering and present Cohesion-aware Coder (Co-Coder), which builds dependency graphs from static analysis, isolates structural hub files, partitions the graph via community detection, and executes the partition with a dependency-aware scheduler. Across 28 real-world tasks on DevEval and CodeProjectEval, Co-Coder advances the Pareto-frontier over sequential and file-based parallel baselines as well as Claude Code with Agent Teams, lifting pass rate by up to 14.0%, achieving up to a 2.10x wall-clock speedup, and reducing API cost by up to 35%, with the largest gains on the most dependency-dense projects. Co-coder demonstrates how cohesion-aware orchestration can make parallel coding agents both theoretically grounded and practically efficient, suggesting a broader design principle for multi-agent systems.

2606.00950 2026-06-02 cs.LG 版本更新

COLLIE: Guiding Skill Discovery in Semantically Coherent Latent Space

COLLIE:在语义连贯的潜在空间中引导技能发现

Yao Luan, Ni Mu, Hanfei Ge, Yiqin Yang, Bo Xu, Qing-Shan Jia

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出COLLIE框架,利用密集无监督数据构建语义连贯的潜在空间,通过无需额外训练的引导信号实现稀疏人类反馈下的有效技能发现,避免危险行为并提升下游性能。

Comments ICML 2026

详情
AI中文摘要

无监督技能发现(USD)旨在无需奖励函数的情况下学习多样化的行为,但由于均匀探索,常常导致与任务无关或危险的行为。引导式技能发现(GSD)通过融入人类意图将探索聚焦于有意义的区域来解决这一问题。然而,现有的GSD方法通常需要训练额外的引导模型,并依赖于预定义规则或专家演示,这在稀疏的在线收集的人类反馈下可能效果不佳。为了克服这一点,我们提出了COLLIE,一个利用密集无监督数据构建语义连贯技能潜在空间的GSD框架。该潜在空间结构良好,能够通过稀疏的在线反馈实现可靠的引导。此外,其语义连贯性特性使得引导信号的构建无需训练,从而消除了在技能学习之外额外训练模型的需要。理论分析证明了我们无需训练的引导信号的有效性,而在各种基于状态和基于像素的任务上的实验表明,COLLIE能够学习多样化、与人类对齐的技能,避免危险行为,并在最少的人类反馈下实现优越的下游性能。

英文摘要

Unsupervised skill discovery (USD) aims to learn diverse behaviors without reward functions, but often results in task-irrelevant or hazardous behaviors due to uniform exploration. Guided skill discovery (GSD) addresses this issue by incorporating human intent to focus exploration on meaningful regions. However, existing GSD methods typically require training additional guidance models, and rely on pre-defined rules or expert demonstration, which can be ineffective under sparse, online-collected human feedback. To overcome this, we propose COLLIE, a GSD framework that leverages dense unsupervised data to construct a semantically coherent skill latent space. This latent space is well-structured, enabling reliable guidance with sparse online feedback. Moreover, its semantic coherence property enables training-free construction of guidance signals, eliminating the need for additional model training beyond skill learning. Theoretical analysis justifies the effectiveness of our training-free guidance signal, while experiments across diverse state-based and pixel-based tasks show that COLLIE learns diverse, human-aligned skills, avoids hazardous behaviors, and achieves superior downstream performance with minimal human feedback.

2606.00949 2026-06-02 cs.LG cs.AI physics.flu-dyn 版本更新

Explainable deep reinforcement learning reveals energy-efficient control strategies for turbulent drag reduction

可解释深度强化学习揭示湍流减阻的节能控制策略

Federica Tonti, Ricardo Vinuesa

发表机构 * Department of Aerospace Engineering University of Michigan(航空航天工程系密歇根大学)

AI总结 结合多智能体深度强化学习与可解释深度学习,提出基于SHAP归因的奖励策略,实现高效湍流减阻,净节能达34.01%且输入功率仅0.43%。

详情
AI中文摘要

我们提出了一种结合多智能体深度强化学习(MARL)和可解释深度学习(XDL)的方法,用于减少壁面边界湍流中的阻力。以直接针对壁面剪切应力和反对称控制训练智能体的结果作为基线,比较了三种SHAP引导的方法。第一种方法中,奖励根据预测未来速度场的U-net的SHAP归因计算;第二种方法中,奖励根据预测摩擦系数的U-net的SHAP归因计算;第三种方法中,奖励结合了分别预测摩擦系数和壁面压力脉动的两个U-net的SHAP归因。基于摩擦系数和壁面压力脉动的组合SHAP策略实现了最佳整体性能,在仅0.43%归一化输入功率下实现了34.44%的减阻率(DR)和34.01%的净节能率(NES)。相对于反对称控制,减阻和净节能分别提高了49.41%和48.52%。与直接壁面剪切应力基线相比,所提出的策略在提高性能的同时,将归一化驱动成本从5.90%降低到0.43%。结果分析表明,节能策略与压力门控驱动一致,主要在壁面压力接近零时激活,并且其时间尺度与近壁湍流结构的寿命相当。

英文摘要

We propose a method combining Multi-Agent Deep Reinforcement Learning (MARL) and eXplainable Deep Learning (XDL) to reduce drag in wall-bounded turbulent flows. Taking as a baseline the results of training agents directly targeting wall-shear stress and opposition control, three SHAP-guided approaches are compared. In the first, the reward is computed from SHAP attributions of a U-net predicting the future velocity field; in the second, from SHAP attributions of a U-net predicting the skin-friction coefficient; in the third, from a combination of SHAP attributions of two U-nets predicting the skin-friction coefficient and the wall pressure fluctuations, respectively. The combined SHAP strategy based on skin-friction coefficient and wall-pressure fluctuations achieves the best overall performance, achieving a DR of 34.44% and a NES of 34.01% with only 0.43% normalized input power. Relative to opposition control, drag reduction and net energy saving increase by 49.41% and 48.52%, respectively. Compared with the direct wall-shear-stress baseline, the proposed strategy simultaneously improves performance while reducing the normalized actuation cost from 5.90% to 0.43%. Analysis of the results reveals that the energetically efficient policy is consistent with pressure-gated actuation, activating predominantly at near-zero wall pressure, and operates on a temporal timescale comparable to the lifetime of the near-wall turbulent structures.

2606.00946 2026-06-02 cs.DC cs.AI cs.LG 版本更新

Lodestar: An Online-Learning LLM Inference Router

Lodestar: 一种在线学习的大语言模型推理路由器

Gangmuk Lim, Wanyu Zhao, Brighten Godfrey, Jiaxin Shan, Le Xu, Liguang Xie

发表机构 * UIUC(伊利诺伊大学香槟分校) Bytedance(字节跳动) University of Edinburgh(爱丁堡大学)

AI总结 提出Lodestar,一种基于在线学习的请求路由系统,通过实时收集集群状态并训练奖励预测器,以最小化TTFT为目标分配推理请求,在异构GPU集群上显著降低延迟。

详情
AI中文摘要

高效服务大语言模型(LLM)推理任务对于用户感知的延迟(如首令牌时间TTFT)和GPU利用率至关重要。然而,LLM请求路由(即将每个推理请求分配给GPU实例)尤其具有挑战性:执行高度依赖于输入;批处理和KV缓存重用造成了强烈的跨请求耦合;延迟对上下文长度、模型/引擎设置和异构加速器呈非线性响应。因此,简单的传统负载均衡算法,甚至针对LLM推理定制的启发式方法,都难以实现良好性能。我们提出Lodestar,一种面向分布式GPU集群的基于学习的请求路由系统。Lodestar持续在每个请求级别收集集群快照,包括实时实例状态、请求特征和观察到的性能,并训练一个在线奖励预测器,用于将推理请求路由到将最大化给定奖励(例如最小化TTFT)的实例。Lodestar是云原生的,并与现有服务栈(vLLM)无缝协作。通过持续在线适应变化的工作负载和基础设施条件,与最先进的前缀缓存和负载感知启发式方法相比,Lodestar在平均TTFT上降低1.41倍,在P99 TTFT上平均降低1.47倍(在同构集群上最高达2.15倍/1.86倍,在异构集群上最高达4.38倍/4.42倍),并且根据在公有云GPU集群上的实验,大约在5分钟内学习到这些高效的路由策略。

英文摘要

Efficiently serving large language model (LLM) inference tasks is crucial both for user-perceived latency such as time-to-first-token (TTFT) and for GPU utilization. However, LLM request routing, that is, assigning each inference request to a GPU instance, is particularly challenging: execution is highly input-dependent; batching and KV-cache reuse create strong cross-request coupling; and latency responds nonlinearly to context length, model/engine settings, and heterogeneous accelerators. As a result, simple traditional load balancing algorithms, and even heuristics tailored for LLM inference, fail to achieve good performance. We present Lodestar, a novel learning-based request routing system for distributed GPU clusters. Lodestar continuously collects a snapshot of the cluster at per-request level, including real-time instance state, request characteristics, and observed performance, and trains an online reward predictor that it uses to route inference requests to the instance that will maximize given reward (e.g., minimizing TTFT). Lodestar is cloud-native and works seamlessly with existing serving stacks (vLLM). With continuous online adaptation to changing workloads and infrastructure conditions, Lodestar achieves 1.41x lower average TTFT and 1.47x lower P99 TTFT on average (up to 2.15x/1.86x on homogeneous and 4.38x/4.42x on heterogeneous clusters) compared to a state-of-the-art prefix cache and load-aware heuristic, and learns these efficient routing strategies within about 5 minutes, based on experiments in a public cloud GPU cluster.

2606.00944 2026-06-02 cs.LG 版本更新

PRISM: Gauge-Invariant Tangent-Space Differentially Private LoRA

PRISM: 规范不变切空间差分隐私LoRA

Shihao Wang, Xueru Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对LoRA中低秩参数化导致的非可辨识性和规范依赖噪声放大问题,提出PRISM机制,通过构造规范不变的差分隐私扰动,实现高效且稳定的隐私-效用权衡。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026) as an oral presentation

详情
AI中文摘要

通过DP-SGD将差分隐私(DP)应用于低秩适应(LoRA)是一种自然的隐私保护微调方法。然而,LoRA的低秩参数化带来了根本性挑战。在LoRA中,每个可训练更新表示为低秩矩阵$Z = AB^\top$,但这种分解本质上是非可辨识的:许多因子对$(A,B)$表示相同的更新$Z$。因此,直接将DP-SGD应用于因子会导致$Z$上的规范依赖扰动,并且我们表明这种朴素的DP-LoRA可能导致无界的噪声放大。我们提出了PRISM,一种针对LoRA的内在DP机制,该机制通过构造具有规范不变性,避免了双线性噪声放大,并允许高效的低维噪声采样。此外,PRISM给出了$Z$上有效内在噪声的闭式表征,通过有界、规范不变的扰动实现稳定的隐私-效用权衡。我们为PRISM建立了标准的$(\varepsilon,\delta)$-DP保证,并引入了一种DP感知的、规范不变的自适应更新规则,防止自适应优化放大注入的隐私噪声,从而在实践中提高数值稳定性。

英文摘要

Applying differential privacy (DP) via DP-SGD to Low-Rank Adaptation (LoRA) is a natural approach for privacy-preserving fine-tuning. However, LoRA's low-rank parameterization poses a fundamental challenge. In LoRA, each trainable update is represented as a low-rank matrix $Z = AB^\top$, but this factorization is inherently non-identifiable: many factor pairs $(A,B)$ represent the same update $Z$. As a result, applying DP-SGD directly to the factors induces gauge-dependent perturbations on $Z$, and we show that this naive DP-LoRA can lead to unbounded noise amplification. We propose PRISM, an intrinsic DP mechanism for LoRA that is gauge invariant by construction, avoids bilinear noise amplification, and admits an efficient low-dimensional noise sampler. Moreover, PRISM yields a closed-form characterization of the effective intrinsic noise induced on $Z$, enabling stable privacy-utility trade-offs through bounded, gauge-invariant perturbations. We establish standard $(ε,δ)$-DP guarantees for PRISM and introduce a DP-aware, gauge-invariant adaptive update rule that prevents adaptive optimization from amplifying injected privacy noise, improving numerical stability in practice.

2606.00938 2026-06-02 cs.CE cs.LG 版本更新

Machine Learning Surrogate Modeling for Homogenization of Hyperelastic Materials with Boolean Microstructures

具有布尔微结构的超弹性材料均匀化的机器学习代理建模

Matthias Brändel, Oliver Rheinbach

发表机构 * Technische Universität Bergakademie Freiberg(弗赖贝格应用科学大学)

AI总结 提出一种监督学习方法,利用低维微观结构描述符(如面积分数、形状描述符τ、两点相关函数S2(r)和线路径函数ℓ(z))预测超弹性复合材料的有效拉梅参数,并通过留一法交叉验证评估泛化能力。

Comments 16 pages, 7 figures

详情
AI中文摘要

数据驱动代理模型是非均质材料数值均匀化的替代方案。本文提出一种监督学习方法,用于从低维微观结构描述符预测超弹性复合材料的有效拉梅参数。数据集基于先前发表的平面布尔模型生成的两相随机微观结构集合的数值均匀化结果,涵盖了夹杂物形状、相衬和面积分数的变化;参见Brändel, Brands, Maike, Rheinbach, Schröder, Schwarz和Stoyan (2022)。神经网络在标量和曲线值统计描述符的组合上进行训练,包括面积分数、导出的标量形状描述符$τ$、两点相关函数$S_2(r)$和线路径函数$\ell(z)$。还加入了代表参数空间极限情况的额外数据,以稳定训练并改善外推行为。通过留一颗粒类型交叉验证评估代理模型,以评估对未见颗粒几何形状的泛化能力。数值结果表明,额外的描述符可以降低相对误差。使用$τ$和$S_2(r)$训练的预测器提供了紧凑的表示,具有良好的定量精度和规则的密集响应行为。添加线路径函数$\ell(z)$进一步降低了可用数据点上的误差,表明它是一个有前景的额外描述符;然而,训练后密集响应评估显示,改进的点态精度并不能自动保证采样参数值之间的物理可接受行为。这激励了未来在物理约束代理模型、损失公式、有界输出参数化以及曲线值几何描述符的更系统表示方面的工作。

英文摘要

Data-driven surrogate models are an alternative to numerical homogenization of heterogeneous materials. In this contribution, a supervised learning approach is presented for predicting effective Lamé parameters of hyperelastic composites from low-dimensional microstructural descriptors. The data set is based on previously published numerical homogenization results for ensembles of two-phase stochastic microstructures generated by planar Boolean models, covering variations of inclusion shape, phase contrast, and area fraction; see Brändel, Brands, Maike, Rheinbach, Schröder, Schwarz and Stoyan (2022). A neural network is trained on combinations of scalar and curve-valued statistical descriptors, including the area fraction, a derived scalar shape descriptor $τ$, the two-point correlation function $S_2(r)$, and the lineal-path function $\ell(z)$. Additional data representing limiting cases of the parameter space are incorporated to stabilize training and improve extrapolation behavior. The surrogate is evaluated by leave-one-grain-type-out cross-validation in order to assess generalization to unseen grain geometries. Numerical results demonstrate that additional descriptors can reduce relative errors. A predictor trained with $τ$ and $S_2(r)$ provides a compact representation with good quantitative accuracy and regular dense response behavior. Adding the lineal-path function $\ell(z)$ further reduces the error at the available data points, indicating that it is a promising additional descriptor; however, dense post-training response evaluations show that improved pointwise accuracy does not automatically guarantee physically admissible behavior between sampled parameter values. This motivates future work on physically constrained surrogate models, loss formulations, bounded output parametrizations, and a more systematic representation of curve-valued geometric descriptors.

2606.00937 2026-06-02 cs.LG cs.CE cs.NA math.NA physics.comp-ph physics.plasm-ph 版本更新

Cellular Sheaf Neural Operators for Structure-Preserving Surrogate Modeling of Constrained PDEs

细胞层神经算子用于约束PDE的结构保持代理建模

Lennon J. Shikhman, Shane Gilbertie

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Franklin & Marshall University(弗兰克林与马歇尔大学)

AI总结 提出细胞层神经算子,通过面向细胞复形的消息传递和Hodge拉普拉斯算子,在代理模型中保持PDE的几何约束和物理结构,在湍流MHD和聚变平衡任务中提升了结构敏感诊断指标。

Comments 41 pages, 5 figures, 3 tables

详情
AI中文摘要

神经算子为PDE模拟提供了快速的代理模型,但标准架构通常将几何和离散化视为次于场数据。物理状态通常表示为网格通道堆栈,即使不同数量自然属于顶点、边、面、单元、边界或界面,并且必须满足兼容性约束。我们提出了细胞层神经算子,一种用于结构保持神经PDE代理的离散化感知框架。该方法在定向细胞复形上表示PDE状态,通过学习的限制映射耦合局部特征空间,并使用关联/Hodge感知的消息传递来遵循计算几何。学习到的更新头通过共边界或通量映射,使得选定的约束来自细胞复形结构而不仅仅来自损失惩罚。对于磁流体动力学,这产生了由边缘电动势场驱动的基于面的磁通量更新和由学习的面通量和单元源驱动的有限体积式流体更新。在湍流MHD和聚变平衡代理任务上,该方法改善了结构敏感诊断,包括展开行为、散度控制、谱误差和平衡回归精度。这些结果表明,细胞层结构是约束多物理系统中神经PDE代理的有用归纳偏置。

英文摘要

Neural operators provide fast surrogate models for PDE simulations, but standard architectures often treat geometry and discretization as secondary to field data. Physical states are usually represented as grid-channel stacks, even when different quantities naturally belong on vertices, edges, faces, cells, boundaries, or interfaces and must satisfy compatibility constraints. We propose Cellular Sheaf Neural Operators, a discretization-aware framework for structure-preserving neural PDE surrogates. The method represents PDE states on oriented cell complexes, couples local feature spaces through learned restriction maps, and uses incidence/Hodge-informed message passing to follow computational geometry. Learned update heads pass through coboundary or flux maps, allowing selected constraints to arise from cell-complex structure rather than only from loss penalties. For magnetohydrodynamics, this yields face-based magnetic-flux updates driven by edge electromotive fields and finite-volume-style fluid updates driven by learned face fluxes and cell sources. On turbulent MHD and fusion-equilibrium surrogate tasks, the method improves structure-sensitive diagnostics, including rollout behavior, divergence control, spectral error, and equilibrium-regression accuracy. These results indicate that cellular-sheaf structure is a useful inductive bias for neural PDE surrogates in constrained multiphysics systems.

2606.00934 2026-06-02 stat.ML cs.LG stat.AP stat.ME 版本更新

Efficient Synthetic Network Generation via Latent Embedding Reconstruction

通过潜在嵌入重建的高效合成网络生成

Feifan Jiang, Yinan Bu, Shihao Wu, Gongjun Xu, Ji Zhu

发表机构 * Department of Statistics, University of Michigan, Ann Arbor, Michigan, USA(统计学系,密歇根大学,安娜堡,密歇根州,美国)

AI总结 提出SyNGLER框架,基于潜在空间网络模型,通过重建潜在嵌入生成合成网络,兼顾效率与结构保真度。

详情
AI中文摘要

网络数据在社会科学、生物学和信息系统中无处不在。生成逼真的合成网络数据具有从网络模拟到科学发现的广泛应用。然而,许多现有的黑盒网络生成方法倾向于过拟合观测数据,同时忽视特征网络结构,并在大规模下产生大量计算开销。这些实际挑战要求合成网络生成方法既高效又能捕捉网络的结构特性。在本文中,我们介绍了通过潜在嵌入重建的合成网络生成(SyNGLER),这是一个基于潜在空间网络模型的通用且高效的合成网络生成框架。给定一个观测网络,SyNGLER首先通过潜在空间网络模型学习低维潜在节点嵌入,然后通过在这些嵌入上构建无分布生成器来重建潜在空间。对于生成,SyNGLER首先从潜在空间中的生成器采样(或重采样)节点嵌入,然后使用潜在空间网络模型生成合成网络。通过潜在空间框架,SyNGLER保留了网络中的独特特征,如稀疏性和节点度异质性,同时允许以比许多现有深度架构更低的计算成本进行高效训练。我们通过开发真实边缘分布与合成边缘分布之间距离的一致性结果来提供理论保证。实证研究进一步证明了SyNGLER的有效性,与现有方法相比,它高效地生成了更好地保留关键网络特征(如网络矩和度分布)的网络。代码可在 https://github.com/FeifanJiang/syngler 获取。

英文摘要

Network data are ubiquitous across the social sciences, biology, and information systems. Generating realistic synthetic network data has broad applications from network simulation to scientific discovery. However, many existing black-box approaches for network generation tend to overfit observed data while overlooking characteristic network structure, and incur substantial computational overhead at scale. These practical challenges call for synthetic network generation methods that are both efficient and capable of capturing structural properties of networks. In this paper, we introduce Synthetic Network Generation via Latent Embedding Reconstruction (SyNGLER), a general and efficient framework for synthetic network generation that builds on latent space network models. Given an observed network, SyNGLER first learns low-dimensional latent node embeddings via a latent space network model and then reconstructs the latent space by building a distribution-free generator over these embeddings. For generation, SyNGLER first samples (or resamples) node embeddings from the generator in the latent space and then produces synthetic networks using the latent space network model. Through the latent space framework, SyNGLER preserves unique characteristics in networks such as sparsity and node degree heterogeneity, while allowing for efficient training with lower computational cost than many existing deep architectures. We provide theoretical guarantees by developing consistency results on the distance between the true and synthetic edge distributions. Empirical studies further demonstrate the effectiveness of SyNGLER, which efficiently produces networks that better preserve key network characteristics such as network moments and degree distributions compared with existing approaches. Code is available at https://github.com/FeifanJiang/syngler.

2606.00930 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Detection vs. Execution: Single-Bucket Probes Miss Half the Mamba-2 State Sink

检测 vs. 执行:单桶探针遗漏了 Mamba-2 状态汇的一半

Yuhang Jiang

发表机构 * Independent Researcher(独立研究者)

AI总结 本文发现 Mamba-2 中的状态汇(state sink)可分解为两类功能头集,单桶探针仅能恢复执行层而遗漏检测层,表明表征相似性不等于功能等价。

Comments 16 pages, 3 figures

详情
AI中文摘要

机械可解释性通常假设识别表征特征的探针也能识别执行相应计算的电路。我们证明这一假设在 Mamba-2 中可能系统性失败。通过研究状态汇(边界 token 上不成比例的 Delta 门控激活,类似于注意力汇),我们发现单桶探针仅能恢复一个小的执行层,而遗漏了具有相同表征特征的更大的检测层。 在 Mamba-2 中,状态汇分解为两个功能头集。单桶 BOS 专家头(在 2.7B 模型中约占 5% 的头)在模型规模和语料库上因果支持 BOS 上下文和新行目标预测。双头(占头的 27-35%,通过同一探针的多类聚合恢复)表现出更强的 BOS-新行表征相似性,但在消融下因果效应显著较弱。表征相似性并不意味着功能等价。 这一区别对下游行为至关重要:消融 BOS 专家头使 Mamba-1 2.8B 和 Mamba-2 2.7B 在 1024 上下文长度下的 RULER NIAH 检索准确率从 1.00 降至 0.00,而大小匹配的补集保持基线性能。随机通道分桶控制排除了仅由基质粒度造成的可能,暗示 Mamba-2 的头共享 Delta 投影。探针导出的专长可以识别执行电路;在粗粒度下,同一探针也能恢复检测电路,而区分它们需要类别条件消融而非类别条件余弦。

英文摘要

Mechanistic interpretability often assumes that probes identifying a representational signature also identify the circuit executing the corresponding computation. We show that this assumption can fail systematically in Mamba-2. Studying the state sink (disproportionate Delta-gate activation on boundary tokens, analogous to the attention sink), we find that single-bucket probes recover only a small execution layer while missing a much larger detection layer with the same representational signature. In Mamba-2, the state sink decomposes into two functional head sets. Single-bucket BOS-specialist heads (about 5% of heads at 2.7B) causally support both BOS-context and newline-target predictions across model scales and corpora. Dual heads (27-35% of heads, recovered by multi-class aggregation of the same probe) show stronger BOS-newline representational similarity but substantially weaker causal effects under ablation. Representational similarity does not imply functional equivalence. This distinction matters for downstream behaviour: ablating BOS-specialist heads collapses RULER NIAH retrieval accuracy from 1.00 to 0.00 at 1024 context length in both Mamba-1 2.8B and Mamba-2 2.7B, while size-matched complements preserve baseline performance. A random channel-bucketing control rules out substrate granularity alone, implicating Mamba-2's head-shared Delta projection. Probe-derived specialty can identify execution circuits; at coarse granularity the same probe also recovers detection circuits, and separating them requires class-conditional ablation rather than class-conditional cosine.

2606.00928 2026-06-02 cs.CV cs.LG 版本更新

Single-Channel Tissue Segmentation via Cross-Modal Distillation from Foundation Models

基于基础模型跨模态蒸馏的单通道组织分割

Sakib Mohammad, Jarin Ritu, Md Sakhawat Hossain

发表机构 * Department of Engineering Technology(工程技术系) Department of Electrical and Computer Engineering(电气与计算机工程系) Department of Mechanical Engineering(机械工程系)

AI总结 提出跨模态知识蒸馏框架,将多通道输入的基础模型教师知识迁移到仅使用核通道的轻量级学生网络,实现单通道组织分割性能大幅提升。

Comments 6 pages, 3 figures

详情
AI中文摘要

多重荧光显微镜通过提供互补通道(包括核(DAPI)和膜(E-cadherin))改善组织分割,这些通道共同编码比单通道成像更丰富的空间上下文。然而,多重模型在推理时需要所有通道,限制了在仅部分通道可用时的部署。本文提出一个跨模态知识蒸馏框架,将处理多重输入的基础模型教师的语义信息迁移到仅使用核通道的轻量级学生网络。蒸馏目标结合了基于MSE的概率匹配、边界感知监督和可学习的不确定性加权。在TissueNet和BBBC038上,评估了SAM ViT-H和CellSAM作为教师,四个U-Net学生:Swin-Tiny(27M)、ResNet18(11M)、EfficientNet-B0(5.3M)和MobileNetV3(1.5M)。在TissueNet上,SAM蒸馏的Swin-Tiny学生达到Dice 78.36(±1.44),比无KD基线(65.31±1.35)提高13.05分,并以23倍参数缩减恢复了教师oracle性能(89.12±1.21)的87.9%。KD一致地使所有四个学生提高约12个Dice点,确认了架构无关的蒸馏。在所有设置中,SAM ViT-H作为教师优于CellSAM。在BBBC038上的跨数据集评估显示,无需教师重新训练即可获得一致增益。

英文摘要

Multiplexed fluorescence microscopy improves tissue segmentation by providing complementary channels including nuclear (DAPI) and membrane (E-cadherin), that together encode richer spatial context than single-channel imaging alone. However, multiplexed models require all channels at inference, limiting deployment where only a subset is available. This work proposes a cross-modal knowledge distillation framework that transfers semantic information from a frozen foundation model teacher processing multiplexed input to a lightweight student operating on the nuclear channel only. The distillation objective combines MSE-based probability matching, boundary-aware supervision, and learnable uncertainty weighting. SAM ViT-H and CellSAM are evaluated as teachers across four U-Net students: Swin-Tiny (27M), ResNet18 (11M), EfficientNet-B0 (5.3M), and MobileNetV3 (1.5M), on TissueNet and BBBC038. On TissueNet, the SAM-distilled Swin-Tiny student achieves Dice 78.36 (plus or minus 1.44), a 13.05-point improvement over the no-KD baseline (65.31 plus or minus 1.35) and 87.9% recovery of teacher oracle performance (89.12 plus or minus 1.21) at a 23x parameter reduction. KD consistently improves all four students by approximately 12 Dice points, confirming architecture-agnostic distillation. SAM ViT-H outperforms CellSAM as teacher across all settings. Cross-dataset evaluation on BBBC038 shows consistent gains without teacher retraining.

2606.00926 2026-06-02 cs.LG cs.CL 版本更新

Task Structure Reverses Layerwise State Encoding in Sequence Models

任务结构逆转序列模型中的层级状态编码

Yuhang Jiang

发表机构 * Independent Researcher(独立研究者)

AI总结 本文通过形式模型和预训练模型上的实验,发现序列模型(如Transformer、Mamba、LSTM等)中层级状态编码的分布模式会随任务结构(如Parity、Dyck-k、S3)而逆转,且这种分组由计算结构(前缀更新 vs. 栈)而非代数结构(交换性)决定。

Comments 20 pages, 11 figures, 8 tables

详情
AI中文摘要

序列模型的机制研究通常将层级状态编码视为架构特征:循环模型集中可读状态,注意力模型分散状态。我们发现,当任务改变时,同一架构会逆转这种分布。在Transformer、Mamba、Mamba-2、LSTM和GRU中,Parity在Mamba和循环基线中集中在后期,而Transformer逐步构建;在有界深度Dyck-k上模式翻转。同样的翻转出现在微调的Mamba-130M和Pythia-160M中,且Pythia在Dyck上的瓶颈在410M时仍然存在。文献中混淆了两种解释:代数结构(交换性)与计算结构(前缀更新 vs. 栈)。为了区分它们,我们添加了第三个任务:非交换的S3置换组合。在所有五种架构的层级探测和Mamba特有的Conv1D归因中,S3与Parity而非Dyck归为一组,因此分组追踪的是计算结构而非交换性。因果干预表明,在4层形式模型中,线性可读方向通常是功能上必要的,并且在Parity和Dyck上的分布外长度上可能仍然重要。在预训练规模上,情况出现分化。微调的Pythia在Dyck上存在强中间层瓶颈(在160M时,L6-L7消融使准确率下降约81%;在410M时,L4-L18出现更宽的瓶颈),而在最佳探测层上则弱得多。预训练的Mamba表现出互补的失败模式:其最后一层高度可读,但没有任何单个探测方向能在Parity、Dyck或S3上破坏任务,然而中间位置的激活修补恢复了约97-98%的干净-损坏logit差距。探测定位了状态线性可用的位置,并不总是计算瓶颈所在。机制特征是架构和任务共同的性质。

英文摘要

Mechanistic studies of sequence models often treat layerwise state encodings as architectural traits: recurrent models concentrate readable state, attention-based models distribute it. We find that the same architecture reverses this profile when the task changes. Across Transformers, Mamba, Mamba-2, LSTMs, and GRUs, Parity is concentrated late in Mamba and the recurrent baselines and built gradually by Transformer; on bounded-depth Dyck-k the pattern flips. The same flip appears in fine-tuned Mamba-130M and Pythia-160M, and the Pythia Dyck bottleneck persists at 410M. Two explanations are conflated in the literature: algebraic structure (commutativity) versus computational structure (prefix update vs. stack). To separate them we add a third task: non-commutative S_3 permutation composition. S_3 groups with Parity, not Dyck, on layerwise probing across all five architectures and on Mamba-specific Conv1D attribution, so the grouping tracks computational structure rather than commutativity. Causal interventions show that, in the 4-layer formal models, linearly readable directions are often functionally necessary and can remain important at out-of-distribution lengths on Parity and Dyck. At pretrained scale the picture splits. Fine-tuned Pythia Dyck has a strong middle-layer bottleneck (L6-L7 ablation drops accuracy by roughly 81% at 160M; broader L4-L18 plateau at 410M), far weaker at the best-probe layer. Pretrained Mamba shows the complementary failure mode: its final layer is highly readable, no single probe direction breaks the task on Parity, Dyck, or S_3, yet mid-position activation patching there recovers about 97-98% of the clean-corrupted logit gap. Probing localizes where state is linearly available, not always where the computation is bottlenecked. Mechanistic signatures are properties of architecture and task together.

2606.00920 2026-06-02 cs.LG cs.AI cs.SE 版本更新

Accuracy, Stability, and Repeated-Run Reliability of Large Language Models on Deterministic Programming Tasks

大型语言模型在确定性编程任务上的准确性、稳定性和重复运行可靠性

Yongxi Zhou, Lai Yun Choi, Jiaxi Wen, Wenbo Ye

发表机构 * Northeastern University, Massachusetts, USA(东北大学,马萨诸塞州,美国) University of Southern California, California, USA(南加州大学,加利福尼亚州,美国)

AI总结 通过重复运行评估协议,发现运行级通过率高估了无重试覆盖率高达17.8个百分点,且差距在中等性能系统中最大,表明稳定性分析是准确性报告的必要补充。

详情
AI中文摘要

运行级通过率高估了无重试覆盖率高达17.8个百分点——且差距恰恰在中等性能系统中最大。我们研究了大型语言模型(LLM)在确定性文本条件生成评估中的这种准确性-稳定性关系,以编程任务作为具体测试平台。标准代码生成基准强调单次运行准确性或在重复采样下的最终成功,但许多部署场景还需要稳定性:在相同任务描述下重复调用时的一致结果。我们提出了一种重复运行评估协议,包含运行级准确性、无重试覆盖率和每个问题的变异性指标。在一个包含100道LeetCode风格问题的基于近期的基准上,我们评估了来自五个提供者家族的16个模型,使用两种提示模板,每个问题重复运行五次,共产生16,000个评估实例。尽管运行级通过率与完美稳定率强相关(r=0.985),但通过率始终超过无重试覆盖率——这一差距达到17.8个百分点,并且即使在密切匹配的系统之间也会逆转模型排名。提示效应是模型依赖的,而非普遍有益的。这些结果表明,对于确定性文本条件生成任务,重复运行稳定性分析是传统准确性报告的必要补充。

英文摘要

Run-level pass rate overstates retry-free coverage by up to 17.8 percentage points -- and the gap is largest precisely for mid-performing systems. We investigate this accuracy--stability relationship in large language model (LLM) evaluation for deterministic text-conditioned generation, using programming tasks as a concrete testbed. Standard code-generation benchmarks emphasize single-run accuracy or eventual success under repeated sampling, but many deployment settings also require stability: consistent outcomes across repeated invocations under the same task description. We present a repeated-run evaluation protocol with metrics for run-level accuracy, retry-free coverage, and per-problem variability. On a recency-based benchmark of 100 LeetCode-style problems, we evaluate 16 models from five provider families under two prompt templates with five repeated runs per problem, yielding 16,000 evaluation instances. Although run-level pass rate and perfect stability rate are strongly correlated (r=0.985), pass rate consistently exceeds retry-free coverage -- a gap that reaches 17.8 percentage points and reverses model rankings even among closely matched systems. Prompt effects are model-dependent rather than uniformly beneficial. These results suggest that repeated-run stability analysis is a necessary complement to conventional accuracy reporting for deterministic text-conditioned generation tasks.

2606.00919 2026-06-02 cs.CL cs.LG 版本更新

Towards Lightweight Reliability: Using Soft Prompts for Hallucination Mitigation in Large Language Models

迈向轻量级可靠性:使用软提示缓解大型语言模型中的幻觉

S M Tahmid Siddiqui, Akib Jawad Ononto, Anoop Singhal, Latifur Khan

发表机构 * The University of Texas at Dallas(德克萨斯大学达拉斯分校) National Institute of Standards and Technology(国家标准与技术研究院)

AI总结 提出一种参数高效的软提示方法RCSP,通过对比学习、课程学习和KL正则化平衡事实回忆、幻觉抑制和弃权,在多个QA数据集上优于基线。

Comments 20 pages, 5 tables, 2 figures. Accepted for publication in DBSec 2026. The final publication will be available at Springer

详情
AI中文摘要

大型语言模型(LLMs)已在各个领域得到广泛应用,但其可靠性常因幻觉——听起来合理但事实不正确的回答——而受到损害。在高风险领域,这些错误会降低信任并引入现实风险。为解决这一挑战,我们提出一种参数高效的方法,使用软提示来缓解幻觉内容并促进生成式问答(QA)任务中的负责任弃权。我们的方法称为负责任对比软提示(RCSP),使用复合损失训练软提示,以平衡三个目标:抑制幻觉内容、鼓励在不确定性下弃权、以及保持或改善事实回忆。为实现这些目标,我们在训练机制中融入对比损失、课程学习和KL正则化。我们使用LLM-as-a-Judge框架在五个不同的生成式QA数据集上评估我们的方法。在Gemma 3(12B)和Llama 3.1(8B)骨干上的实验结果表明,RCSP有效平衡了事实回忆与幻觉抑制和弃权,在F分数上通常优于标准推理和基于指令的提示基线。值得注意的是,这些改进仅通过训练其他调优技术所需参数的一小部分实现。我们的结果表明,软提示提供了一条模块化且计算高效的路径,用于提高LLM的可靠性。

英文摘要

Large language models (LLMs) have seen widespread adoption across various domains, yet their reliability is frequently undermined by hallucinations - responses that are plausible-sounding but factually incorrect. In high-stakes domains, these errors can reduce trust and introduce real-world risk. To address this challenge, we present a parameter-efficient approach that uses soft prompts to mitigate hallucinated content and promote responsible abstention in generative question-answering (QA) tasks. Our method, called Responsible Contrastive Soft Prompting (RCSP), uses a composite loss to train soft prompts that balance three goals: suppressing hallucinatory content, encouraging abstention under uncertainty, and preserving or improving factual recall. To achieve these goals, we incorporate contrastive loss, curriculum learning, and KL regularization into our training mechanism. We evaluate our approach on five diverse generative QA datasets using an LLM-as-a-Judge framework. Experimental results on the Gemma 3 (12B) and Llama 3.1 (8B) backbones demonstrate that RCSP effectively balances factual recall with hallucination suppression and abstention, yielding a generally superior F-score over standard reasoning and instruction-based prompting baselines. Notably, these improvements are achieved by training only a fraction of the parameters required by other tuning techniques. Our results demonstrate that soft prompts provide a modular and computationally efficient path toward improving LLM reliability.

2606.00913 2026-06-02 stat.ML cs.LG 版本更新

Bandit Simulation for Average Reward Inference

平均奖励推断的赌博机模拟

Samya Praharaj, Chih-Yu Chang, Koulik Khamaru, Kelly W. Zhang

发表机构 * Rutgers University(罗格斯大学) Imperial College London(伦敦帝国理工学院)

AI总结 提出BSI框架,通过拟合环境模拟器并传播参数不确定性,为自适应赌博机算法构建渐近有效的置信区间。

详情
AI中文摘要

多臂赌博机算法越来越多地用于在线平台、临床试验和社会科学实验,但对其性能的有效统计推断仍然是一个开放挑战。部署赌博机后,一个自然的问题是能否为其平均奖励构建置信区间,并评估其是否可靠地优于基线策略。任何单次赌博机部署中获得的总奖励是随机的,由于奖励的随机性,在同一人群上部署两次赌博机通常会产生不同的奖励轨迹。标准统计推断方法无法使用,因为赌博机算法在收集的数据中引入了复杂的依赖性,违反了经典方法所依赖的独立同分布假设。此外,现有的自适应收集数据推断方法仅适用于不依赖于数据收集算法的估计量(例如固定动作下的平均奖励)。我们提出了用于推断的赌博机模拟(BSI),这是一个框架,它从观测数据(在线或离线)中拟合赌博机环境的模拟器,并用于估计任何评估策略(包括自适应黑盒算法)下的平均奖励。BSI将估计的模拟器参数的不确定性正式传播到置信区间构建中。此外,BSI的有效性仅需要对行为策略的弱探索假设,并避免了重要性加权。我们证明BSI产生渐近有效的置信区间,并通过实验证明在标准离线策略评估方法失败的情况下,BSI能保持名义覆盖。

英文摘要

Multi-arm bandit algorithms are increasingly used in online platforms, clinical trials, and social science experiments, but valid statistical inference on their performance remains an open challenge. After deploying bandits, a natural question is whether one can construct a confidence interval for its mean reward and assess whether it reliably outperforms a baseline policy. The total reward achieved in any single bandit deployment is random, and deploying a bandit twice on the same population typically yields different reward trajectories due to stochastic rewards. Standard statistical inference methods cannot be used because bandit algorithms introduce complex dependencies in the collected data, which violate the i.i.d. assumption underlying many classical approaches. Moreover, existing inference methods for adaptively collected data only apply to estimands that do not depend on the data-collection algorithm (such as the mean reward under a fixed action). We propose Bandit Simulation for Inference (BSI), a framework that fits a simulator of the bandit environment from observed data--either on-policy or off-policy--and uses it to estimate the mean reward under any evaluation policy, including adaptive blackbox algorithms. BSI formally propagates uncertainty in the estimated simulator parameters into the confidence interval construction. Furthermore, for BSI to be valid, it requires only weak exploration assumptions on the behavior policy and avoids importance weighting. We prove that BSI yields asymptotically valid confidence intervals, and demonstrate empirically that it maintains nominal coverage in settings where standard off-policy evaluation methods fail.

2606.00910 2026-06-02 cs.CV cs.LG 版本更新

Reason, Retrieve, Re-rank: A Zero-Shot Reasoning-Aware Framework for Composed Video Retrieval

推理、检索、重排序:一种用于组合视频检索的零样本推理感知框架

Ali Alavi

发表机构 * The Ohio State University(俄亥俄州立大学)

AI总结 提出R3-CoVR零样本管道,通过多模态大模型推理编辑后状态、对比编码检索和约束感知重排序,在CVPR 2026 VidLLMs挑战赛上达到91.9% R@1和98.2% R@10。

详情
AI中文摘要

组合视频检索(CoVR)旨在通过对参考视频应用自由形式的文本修改来寻找目标视频。我们应对CVPR 2026 VidLLMs研讨会上的推理感知CoVR(CoVR-R)挑战,其中检索严格为零样本。我们提出R3-CoVR(推理、检索、重排序),一个完全由冻结基础模型构建的无训练管道。多模态大语言模型(Qwen3-VL-8B)推理编辑所隐含的“后效”——状态转换、动作阶段、场景、镜头和节奏——并生成简洁的编辑后描述;对比视频-文本编码器(SigLIP-2)对该描述和图库进行嵌入以进行第一阶段检索;最后,一个约束感知重排序阶段使用相同的多模态模型作为评判者,对每个候选视频针对预期的编辑结果进行评分。在挑战测试集上,R3-CoVR达到了91.9%的R@1和98.2%的R@10。两个发现推动了这些结果:(i)将描述长度匹配到对比编码器的文本窗口使R@1从67.5提升到72.7;(ii)仅对候选列表进行重排序的约束感知重排序器将R@1从72.7提升到91.9——这是最大的单一增益。我们分析了重排序器的行为、检索/重排序混合以及候选列表深度,并发布了一个干净的三层实现。

英文摘要

Composed Video Retrieval (CoVR) seeks the target video that results from applying a free-form textual modification to a reference video. We address the \emph{Reason-Aware} CoVR (CoVR-R) challenge at the CVPR~2026 VidLLMs workshop, where retrieval is strictly zero-shot. We present \textbf{R3-CoVR} (\emph{Reason, Retrieve, Re-rank}), a training-free pipeline built entirely from frozen foundation models. A multimodal large language model (Qwen3-VL-8B) reasons about the \emph{after-effects} an edit implies -- state transitions, action phases, scene, camera and tempo -- and verbalises a concise post-edit description; a contrastive video--text encoder (SigLIP-2) embeds this description and the gallery for first-stage retrieval; finally a constraint-aware re-ranking stage uses the same multimodal model as a judge that scores each shortlisted candidate against the intended edited result. On the challenge test set, R3-CoVR attains \textbf{91.9\% R@1} and \textbf{98.2\% R@10}. Two findings drive these results: (i)~matching the description length to the contrastive encoder's text window lifts \Rk{1} from $67.5$ to $72.7$; and (ii)~the constraint-aware re-ranker, which reorders only the shortlist, lifts \Rk{1} from $72.7$ to $91.9$ -- the single largest gain. We analyse the re-ranker's behaviour, the retrieve/re-rank blend, and the shortlist depth, and we release a clean three-layer implementation.

2606.00895 2026-06-02 math.OC cs.LG 版本更新

Tiny Recursive Models for Solving the J2-Perturbed Lambert Problem

用于求解J2摄动兰伯特问题的小型递归模型

Minduli Wijayatunga, Roberto Armellin

发表机构 * Department of Aerospace Engineering, University of Illinois Urbana-Champaign(航空航天工程系,伊利诺伊大学厄巴纳-香槟分校) Te Pūnaha Ātea – Space Institute, University of Auckland(太空研究所,奥克兰大学)

AI总结 提出基于小型递归模型(TRM)的快速递归神经求解器TRM-PL,通过迭代深度而非参数数量实现有效容量,统一初始猜测生成与迭代校正,在多种轨道转移场景中显著降低终端位置误差。

详情
AI中文摘要

本文提出一种基于小型递归模型(TRM)的快速递归神经求解器,用于求解J2摄动兰伯特问题,称为TRM-PL模型。TRM是一种权重共享架构,其有效容量源于迭代深度而非参数数量:一个紧凑的推理模块在两级潜在层次中重复应用,通过模拟J2轨迹并根据产生的跟踪误差进行校正,来优化候选出发速度。这统一了初始猜测生成和迭代校正于一个端到端可微分的单一架构中。递归精化循环是经典摄动兰伯特求解器中同伦和延拓方案的一种学习替代方案:网络学习自己的校正序列,而不是遵循从开普勒解到摄动解的手工设计路径。我们在三个难度递增的测试案例上评估TRM-PL:单圈低地球轨道(LEO)转移、多圈LEO转移和多圈木星转移。比较了三种训练范式:联合学习兰伯特解和J2校正;使用目标位置和J2校正速度监督精化兰伯特初始速度;仅使用目标位置监督精化。在所有案例中,仅精化方法最为可靠。在单圈LEO上,位置监督变体将中位终端位置误差从21.7公里降至0.027公里,在多圈LEO上从340.9公里降至0.31公里,均采用相同的230万参数架构。对TRM-PL输出进行一次牛顿校正迭代,可将木星案例的中位误差收紧至0.063公里,从而得到足够精确的紧凑模型,适用于嵌入式部署。

英文摘要

This paper presents a fast, recursive neural solver for the J2-perturbed Lambert problem based on Tiny Recursive Models (TRM), termed the TRM-Perturbed Lambert (TRM-PL) model. TRM is a weight-shared architecture whose effective capacity emerges from iteration depth rather than parameter count: a compact reasoning module is applied repeatedly within a two-level latent hierarchy, refining a candidate departure velocity by simulating the J2 trajectory and correcting it from the resulting tracking error. This unifies initial-guess generation and iterative correction in a single, end-to-end differentiable architecture. The recursive refinement loop is a learned alternative to the homotopy and continuation schemes of classical perturbed-Lambert solvers: rather than following a hand-designed path from the Keplerian to the perturbed solution, the network learns its own sequence of corrections. We evaluate TRM-PL on three test cases of increasing difficulty: single-revolution low-Earth-orbit (LEO) transfers, multi-revolution LEO transfers, and multi-revolution Jovian transfers. Three training paradigms are compared: jointly learning the Lambert solution and the J2 correction; refining the Lambert initial velocity with target-position and J2-corrected velocity supervision; and refining it with target-position supervision alone. Across all cases, the refinement-only approaches are the most reliable. The position-supervised variant reduces the median terminal-position error from 21.7 km to 0.027 km on single-revolution LEO, from 340.9 km to 0.31 km on multi-revolution LEO, all with the same 2.3M-parameter architecture. A single Newton corrector iteration on the TRM-PL output tightens the Jovian median to 0.063 km, yielding compact models accurate enough for embedded deployment.

2606.00892 2026-06-02 cs.LG cs.CE physics.comp-ph 版本更新

An Exploratory Study into using Machine-Learning for Fast Step-by-step Emulation of Numerical Mechanical Thrombectomy Simulations for Ischemic Stroke

使用机器学习快速逐步模拟缺血性卒中机械取栓数值仿真的探索性研究

Thijs Stessen

发表机构 * MSc Artificial Intelligence Master Thesis(人工智能硕士论文) Thijs Stessen MSc. Thijs Kuipers(Thijs Kuipers) Dr. Simone Saitta(Simone Saitta)

AI总结 本研究探索使用机器学习替代模型逐步加速机械取栓数值仿真,在简化抽吸过程中实现显著加速,但复杂几何下的长期稳定性不足。

Comments 40 pages, 16 figures, master thesis artificial intelligence

详情
AI中文摘要

使用机械取栓治疗缺血性卒中涉及在时间紧迫下做出困难决策。数值物理仿真理论上可以为操作者提供关于治疗方法和设备选择的更好决策信息,但在实践中速度太慢。在本论文中,我们研究当前基于机器学习的替代模型能否在显著加速的同时,以逐步方式准确模拟这些仿真。为此,我们在两个涉及简化抽吸过程的仿真上训练了三个替代模型,几何复杂度不同。结果表明,其中两个模型能准确预测单个仿真步骤并提供显著加速,尤其是结合特定数据增强时。然而,这些模型在长时间模拟复杂几何时表现出缺乏稳定性。总体而言,这项工作为未来研究开发稳定方法并扩展到机械取栓的现实数值物理仿真奠定了基础。

英文摘要

The treatment of ischemic stroke using mechanical thrombectomy involves difficult decisions under intense time constraints. Numerical physics simulations can in theory inform operators to make better decisions regarding treatment approaches and device selection, but are too slow to do so in practice. In this thesis, we investigate if current machine learning based surrogates can accurately emulate these simulations in a step-by-step manner while making them significantly faster. To do this we train three surrogate models on two simulations that involve a simplified aspiration procedure, with varying levels of geometric complexity. Our results show that two of our models accurately predict singular simulation steps and provide substantial speedups, especially when combined with specific data augmentations. However, the models showed a lack of stability when emulating simulations with complex geometries over longer time periods. Overall, this work provides a foundation for future studies to develop stable methods that scale to realistic numerical physics simulations of mechanical thrombectomy.

2606.00889 2026-06-02 cs.CR cs.LG 版本更新

A Lightweight Hybrid MLP-Based Framework for Real-Time Phishing URL Detection Using Structural URL Features

基于结构URL特征的轻量级混合MLP框架用于实时钓鱼URL检测

Uche Unoke Emmanuel, Gideon Francis Oghie

发表机构 * Department of Cyber Security Science, School of Information and Communication Technology, Federal University of Technology, Minna, Nigeria(网络安全科学系,信息与通信技术学院,联邦科技大学,米纳,尼日利亚)

AI总结 提出一种结合黑名单筛选和仅基于结构URL特征的多层感知器(MLP)分类器的轻量级混合框架,用于实时钓鱼URL检测,在PhiUSIIL数据集上达到99.24%准确率和1.2ms推理延迟。

Comments 27 pages, 6 figures, 12 tables

详情
AI中文摘要

钓鱼攻击仍然是主要的网络安全威胁,利用欺骗性URL窃取用户敏感信息。传统的黑名单和基于规则的检测方法是被动的,往往无法识别新出现的钓鱼URL。本文提出了一种轻量级的实时钓鱼URL检测混合框架,该框架将基于黑名单的筛选与仅基于结构URL特征的多层感知器(MLP)分类器相结合。该框架提取16个URL衍生特征,捕获结构、域和与安全相关的特征,无需网页内容访问、第三方API或视觉渲染,因此计算效率高,适合实时部署。该系统在包含235,795个标记URL的PhiUSIIL钓鱼数据集上进行了训练和评估。实验结果表明,所提出的MLP在相同评估设置下达到了99.24%的准确率、98.74%的精确率、99.95%的召回率、99.34%的F1分数和99.65%的ROC-AUC,优于随机森林、逻辑回归、XGBoost、LightGBM和CatBoost。混合架构在并发处理下实现了每个URL平均1.2毫秒的推理延迟和每秒4200个URL的峰值吞吐量。一个功能性的桌面应用程序原型CyberGuard进一步展示了部署可行性。结果表明,所提出的框架为资源受限环境下的实时钓鱼URL检测提供了准确且计算高效的解决方案。

英文摘要

Phishing attacks remain a major cybersecurity threat, exploiting deceptive URLs to steal sensitive user information. Traditional blacklist and rule-based detection approaches are reactive and often fail to identify newly emerging phishing URLs. This paper proposes a lightweight hybrid framework for real-time phishing URL detection that combines blacklist-based screening with a Multi-Layer Perceptron (MLP) classifier operating solely on structural URL features. The framework extracts 16 URL-derived features capturing structural, domain-based, and security-related characteristics without requiring webpage content access, third-party APIs, or visual rendering, making it computationally efficient for real-time deployment. The system was trained and evaluated on the PhiUSIIL phishing dataset containing 235,795 labelled URLs. Experimental results show that the proposed MLP achieved 99.24% accuracy, 98.74% precision, 99.95% recall, 99.34% F1-score, and 99.65% ROC-AUC, outperforming Random Forest, Logistic Regression, XGBoost, LightGBM, and CatBoost under the same evaluation setting. The hybrid architecture achieved an average inference latency of 1.2 ms per URL and a peak throughput of 4,200 URLs per second under concurrent processing. A functional desktop application prototype, CyberGuard, further demonstrates deployment viability. The results indicate that the proposed framework provides an accurate and computationally efficient solution for real-time phishing URL detection in resource-constrained environments.

2606.00888 2026-06-02 cs.LG cs.AI 版本更新

Memory-Efficient LLM Training with Dynamic Sparsity: From Stability to Practical Scaling

基于动态稀疏性的内存高效LLM训练:从稳定性到实际扩展

Qiao Xiao, Boqian Wu, Patrik Okanovic, Tomasz Sternal, Maurice van Keulen, Elena Mocanu, Mykola Pechenizkiy, Decebal Constantin Mocanu, Torsten Hoefler

发表机构 * University of Waterloo(滑铁卢大学) University of California, Berkeley(加州大学伯克利分校) ETH Zurich(苏黎世联邦理工学院) University of Texas at Austin(德克萨斯大学奥斯汀分校) University of Michigan(密歇根大学)

AI总结 提出SMET方法,通过优化器预热和密度感知学习率缩放解决动态稀疏训练中的优化不稳定问题,实现LLM的稳定、可扩展且内存高效的稀疏预训练。

Comments Accepted at ICML2026

详情
AI中文摘要

动态稀疏训练(DST)为提高深度神经网络的训练和推理效率提供了一种有前景的范式;然而,我们发现,在大语言模型训练中,DST可能会遭受优化不稳定性,表现为拓扑更新后的损失尖峰。在这项工作中,我们表明,标准基于Adam的优化器的朴素使用会导致新重新生长的参数出现冷启动问题,从而导致过大的更新和破坏训练动态。为了解决这个问题,我们提出了稀疏内存高效训练(SMET),它通过优化器预热稳定DST,并通过密度感知学习率缩放改善训练进度。SMET通过仅存储活动参数的梯度和优化器状态进一步减少内存消耗。我们对SMET下的更新行为进行了理论分析,显示出改进的优化稳定性。大量实验表明,SMET能够实现LLM的稳定、可扩展且内存高效的稀疏预训练,为稀疏训练作为密集训练的实际替代方案铺平了道路。我们的代码公开在:https://github.com/QiaoXiao7282/SMET。

英文摘要

Dynamic Sparse Training (DST) offers a promising paradigm for improving the training and inference efficiency of deep neural networks; however, we find that in large language model training, DST can suffer from optimization instability, manifested as loss spikes after topology updates. In this work, we show that the naive use of standard Adam-based optimizers leads to a cold-start issue for newly regrown parameters, resulting in excessively large updates and disrupted training dynamics. To address this issue, we propose Sparse Memory-Efficient Training (SMET), which stabilizes DST with optimizer warm-up and improves training progress through density-aware learning-rate scaling. SMET further reduces memory consumption by storing gradients and optimizer states only for active parameters. We provide a theoretical analysis of the update behaviors under SMET, showing improved optimization stability. Extensive experiments demonstrate that SMET enables stable, scalable, and memory-efficient sparse pre-training of LLMs, paving the way for sparse training as a practical alternative to dense training. Our code is publicly available at: https://github.com/QiaoXiao7282/SMET.

2606.00884 2026-06-02 cs.LG cs.AI 版本更新

Dive into Waves: Morlet Spectral Transformer for Cross-Subject Emotion Decoding from EEG

深入波动:用于跨被试脑电情绪解码的Morlet谱变换器

Jiaxin Qing, Lexin Li

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 针对脑电情绪识别中跨被试变异性问题,提出基于Morlet小波标记化、长上下文基线去除和频带特定空间投影的Morlet谱变换器(MST),无需预训练即可在SEED系列数据集上超越大型预训练模型和频域方法。

详情
AI中文摘要

我们研究基于脑电的跨被试情绪识别,这是脑机接口中一个实际重要但具有挑战性的问题。与具有清晰波形特征的任务不同,情绪相关的脑电信号主要编码在频谱功率中,且微弱、嘈杂,并在被试间高度变化。现有方法要么依赖需要大量数据但仍难以应对跨被试变异的大型预训练脑电基础模型,要么依赖频域编码器(能更好地反映频谱结构但存在表示不匹配、漂移主导的标记化以及缺乏频带特定空间建模)。在本文中,我们提出了Morlet谱变换器(MST),它围绕三个关键组件构建,并与时空变换器主干集成。首先,Morlet小波标记化提供了与脑节律多尺度结构匹配的时频表示,并将经典微分熵特征扩展到适合变换器的形式。其次,长上下文基线去除作为一种简单的时间归一化,消除了被试特定漂移和附近窗口间的冗余。第三,频带特定空间投影为每个频带学习独立的通道混合器,捕获可解释的频带特定模式并减少跨通道混合。我们表明,即使没有预训练,MST在所有SEED系列数据集上始终优于大型预训练脑电基础模型和基于频率的方法。这些结果表明,精心的表示设计可以产生准确、经济且可解释的替代大规模预训练的方法。

英文摘要

We study cross-subject emotion recognition from EEG, a practically important yet challenging problem in brain-computer interfaces. Unlike tasks with clear waveform signatures, emotion-related EEG signals are primarily encoded in spectral power and are weak, noisy, and highly variable across subjects. Existing approaches rely either on large pretrained EEG foundation models, which require massive data yet still struggle with cross-subject variability, or frequency-domain encoders, which better reflect spectral structure but suffer from mismatched representations, drift-dominated tokenization, and lack of band-specific spatial modeling. In this article, we propose the Morlet Spectral Transformer (MST), built around three key components and integrated with a spatiotemporal Transformer backbone. First, Morlet wavelet tokenization provides a time-frequency representation that matches the multi-scale structure of brain rhythms, and extends classical differential entropy features to a form suitable for Transformers. Second, long-context baseline removal acts as a simple temporal normalization that removes subject-specific drift and redundancy across nearby windows. Third, frequency-specific spatial projection learns a separate channel mixer for each frequency band, capturing interpretable band-specific patterns and reducing cross-channel mixing. We show that, even without pretraining, MST consistently outperforms both large pretrained EEG foundation models and frequency-based methods across all SEED-family datasets. These results suggest that careful representation design can yield an accurate, cost-effective, and interpretable alternative to large-scale pretraining.

2606.00880 2026-06-02 cs.LG cs.AI 版本更新

Task diversity produces systematic transfer but inhibits continual reinforcement learning

任务多样性产生系统性迁移但抑制持续强化学习

Purab Seth, Neil Shah, Kunal Jha, Samuel J. Gershman, Max Kleiman-Weiner, Wilka Carvalho

发表机构 * MIT(麻省理工学院) University of California, Berkeley(加州大学伯克利分校) Princeton University(普林斯顿大学) Harvard University(哈佛大学)

AI总结 通过引入GPU加速的持续强化学习领域Banyan,研究任务多样性(地图布局、交互对象、子目标层次结构)对智能体在分布变化下持续学习能力的影响,发现多样性促进局部迁移但导致长期任务性能停滞和遗忘。

详情
AI中文摘要

持续强化学习旨在产生不仅能在当前任务上提高,还能随着任务分布变化而适应的智能体。在众多不同任务上训练智能体可以引发零样本泛化,但先前的工作通常是在训练后(冻结权重)评估这种泛化。任务多样性是否也能提高智能体在分布变化下继续学习的能力仍不清楚。我们引入了Banyan,一个GPU加速的持续强化学习领域,其中任务多样性分解为三个独立可控的轴:智能体必须导航的地图布局、必须与之交互的对象以及子目标依赖的层次结构。在单个分布变化中,增加每个轴上的多样性会导致智能体在新任务上开始训练时,其性能接近先前任务达到的水平,即使变化改变了最优策略的结构。然而,随着变化数量的增加,这种局部迁移本身并不能产生持续的持续学习:更长视野的任务出现平台期,并且较早的任务分布在后续训练后被遗忘。Banyan是一个基准,用于研究受控的任务多样性何时产生可迁移的学习,这种迁移何时持续,以及它在哪些方面未能达到真正的持续学习。

英文摘要

Continual reinforcement learning aims to produce agents that learn not only to improve at their current tasks but also to adapt as task distributions change. Training an agent on many diverse tasks can induce zero-shot generalization, but previous work generally evaluates this generalization after training -- with frozen weights. Whether task diversity also improves an agent's ability to continue learning across distribution shifts remains unclear. We introduce Banyan, a GPU-accelerated continual RL domain in which task diversity factors into three independently controllable axes: the map layouts an agent must navigate, the objects it must interact with, and the hierarchical structures of sub-goal dependencies. Across individual distribution shifts, increasing diversity along each axis causes agents to begin training on the new tasks near the performance attained on the previous one, even when the shift changes the structure of the optimal policy. However, as the number of shifts increases, this local transfer does not by itself yield sustained continual learning: longer-horizon tasks plateau, and earlier task distributions are forgotten after later training. Banyan is a benchmark for studying when controlled task diversity produces transferable learning, when that transfer persists, and where it falls short of proper continual learning.

2606.00869 2026-06-02 cs.LG 版本更新

Enhancing LLM Metacognition via Cognitive Pairwise Training

通过认知成对训练增强LLM元认知

Weitao Li, Hao Zhou, Xuanyu Lei, Fandong Meng, Yuanhang Liu, Jingyi Ren, Ante Wang, Xiaolong Wang, Yuanchi Zhang, Fuwen Luo, Guangwen Yang, Lin Gan, Weizhi Ma, Yang Liu

发表机构 * National Engineering Laboratory for Intelligent Information Processing, Academy of Mathematics and Physics, Chinese Academy of Sciences(智能信息处理国家工程实验室,中国科学院数学物理研究所) University of Science and Technology of China(中国科学技术大学)

AI总结 提出认知成对训练(CPT),通过成对比较推理轨迹来学习区分可靠与不可靠推理,从而提升LLM的推理与元认知权衡。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)已成为LLM推理的核心,但其结果级奖励可能使模型在证据或推理不可靠时更愿意给出自信答案。现有的SFT或RL方法主要在响应级别教导LLM拒绝或表达不确定性,这可能导致过度拟合拒绝行为,而非提高推理可靠性。为解决这一局限,我们提出认知成对训练(CPT),这是一种认知中期训练对齐阶段,将推理轨迹上的成对比较转化为可复用的对齐信号。通过学习区分可信与有缺陷的推理,CPT鼓励模型内化推理质量判别边界,而非记忆表面拒绝模式。在五个模型规模和三个模型家族上,CPT改善了推理与元认知的权衡。在14B规模上,CPT+RL相比标准SFT+RL流水线在数学平均分上提升2.2分,在拒绝F1上提升5.2分。进一步分析表明,CPT提高了轨迹质量,并在评估和训练设置中表现出强鲁棒性和可扩展性。代码和模型已发布在https://github.com/Tsinghua-dhy/CPT。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become central to LLM reasoning, but its outcome-level rewards can make models more willing to give confident answers when evidence or reasoning is unreliable. Existing SFT or RL methods mainly teach LLMs to refuse or express uncertainty at the response level, which can overfit abstention behavior rather than improve reasoning reliability. To address this limitation, we propose Cognitive Pairwise Training (CPT), a cognitive mid-training alignment stage that turns pairwise comparisons over reasoning traces into a reusable alignment signal. By learning to distinguish trustworthy from flawed reasoning, CPT encourages the model to internalize a reasoning-quality discrimination boundary rather than memorize surface refusal patterns. Across five model scales and three model families, CPT improves the reasoning--metacognition trade-off. At 14B, CPT+RL outperforms the standard SFT+RL pipeline by +2.2 math-average points and +5.2 abstention-F1 points. Further analyses show that CPT improves trace quality and exhibits strong robustness and scalability across evaluation and training settings. Code and models are released at https://github.com/Tsinghua-dhy/CPT.

2606.00867 2026-06-02 stat.ML cs.LG eess.SP 版本更新

Statistical Analysis of using the Shapley Value for Sensor Anomaly Localization with Accurate Classifiers

使用Shapley值进行传感器异常定位与准确分类器的统计分析

Xubin Fang, Rick S. Blum

发表机构 * Electrical and Computer Engineering Department of Lehigh University(莱斯大学电气与计算机工程系)

AI总结 本文通过数学定义的二元最优分类器分析Shapley值在传感器异常定位中的性能,证明在独立观测下等价于低复杂度测试,而在相关双变量高斯/拉普拉斯场景下两者存在本质差异,并首次提供理论统计结果。

详情
AI中文摘要

最近的出版物建议使用Shapley值进行传感器异常/攻击定位。我们通过在Shapley值计算中使用数学定义的二元最优分类器来研究这种方法的性能。为了判断定位性能,我们研究给定传感器观测的Shapley值确定该观测是否异常的能力。首先,我们证明对于独立传感器观测的情况,使用Shapley值的优化异常测试等价于使用Shapley值计算中单个项的优化低复杂度异常测试,产生完全相同的错误概率。对于涉及两个传感器的一些流行的相关观测情况,包括相关双变量高斯/拉普拉斯概率密度函数和常数/高斯攻击/异常,我们证明这两个测试本质上是不同的,产生不同的决策区域和错误概率。此外,我们证明在某些统计相关的双变量高斯场景中,当相关幅度较大且存在加性攻击/异常时,Shapley值测试有时严格劣于另一个(Shapley计算中的单个项)测试,而在其他情况下则严格优于它,具体取决于相关的符号。在这些情况下,可以结合这两种方法以获得严格更好的方法。这些结果首次提供了基于Shapley定位的理论统计分析,鉴于许多研究人员广泛接受Shapley值,这些结果似乎非常有趣,并应鼓励对该主题的进一步研究。提供了数值结果以说明我们的发现。

英文摘要

Recent publications have suggested using the Shap- ley value for sensor anomaly/attack localization. We study the performance of such an approach by using mathematically de- fined optimum binary classifiers in the Shapley value calculation. To judge localization performance, we study the ability of the Shapley value of a given sensor observation to determine if that observation is anomalous. First, we prove that for cases with independent sensor observations, an optimized anomaly test using the Shapley value is equivalent to an optimized lower-complexity anomaly test using a single term in the Shapley value calculation, yielding the exact same probability of error. For some popular dependent observation cases involving two sensors, including correlated bivariate Gaussian/Laplacian probability density functions and constant/Gaussian at- tacks/anomalies, we prove that these two tests are fundamentally different, yielding different decision regions and error probabil- ities. Further, we prove that the Shapley value test is sometimes strictly inferior to the other (single term in Shapley calculation) test in certain statistically dependent bivariate Gaussian scenarios with large correlation magnitude and additive attacks/anomalies, while it is strictly superior in others, depending on the sign of the correlation. One can combine these two approaches to obtain a strictly better approach in these cases. These results, which provide the first theoretical statistical analysis of Shapley-based localization, seem very interesting based on the wide acceptance of the Shapley value by many researchers and should encourage further research on this topic. Numerical results are provided which illustrate our findings.

2606.00862 2026-06-02 cs.NE cs.LG 版本更新

Meta-Black-Box Optimization with Ensemble Surrogate Modeling for Robustness-Accuracy Trade-off within SAEA

基于集成代理建模的元黑箱优化以实现SAEA中的鲁棒性-准确性权衡

Xiao Jin, Yongxiong Wang, Haobo Liu, Yudong Du, Yukun Du

发表机构 * GitHub

AI总结 提出AdaE-SAEA,一种将SAEA嵌入MetaBBO框架并联合控制填充准则与集成代理建模的方法,通过强化学习训练元策略,自适应平衡鲁棒性与准确性,在昂贵多目标优化中优于现有方法。

详情
AI中文摘要

代理辅助进化算法(SAEAs)已被广泛用于昂贵的黑箱优化问题。然而,它们对刚性且手动设计组件的依赖限制了其跨任务的灵活性和泛化能力。元黑箱优化(MetaBBO)为自适应配置算法组件提供了一种有前景的范式。尽管如此,现有的MetaBBO方法通常只控制单个组件,很少有研究调查多组件优化器(如SAEAs)的统一控制。此外,代理建模中的鲁棒性-准确性权衡对于早期稳定探索和后期精确开发至关重要,但很少被明确考虑。为了解决这些问题,我们提出了AdaE-SAEA,一种用于昂贵多目标优化的自适应集成代理辅助进化算法。AdaE-SAEA将SAEA作为低层优化器嵌入MetaBBO框架,并联合控制填充准则和基于集成的代理建模。具体来说,bagging和boosting被设计为代理建模模块,以在不同搜索阶段自适应平衡鲁棒性和准确性,而元策略同时选择填充准则以实现自适应采样决策。元策略通过并行采样和集中训练的强化学习进行训练,提高了训练效率和可迁移性。在合成和实际问题上的实验表明,AdaE-SAEA优于最先进的基线和基于MetaBBO的方法。我们进一步验证了TabPFN作为集成学习基础代理模型的有效性。据我们所知,这是第一个统一控制SAEAs中代理建模和填充准则,同时明确解决鲁棒性-准确性权衡的工作。

英文摘要

Surrogate-assisted evolutionary algorithms (SAEAs) have been widely used for expensive black-box optimization problems. However, their reliance on rigid and manually designed components limits their flexibility and generalization across tasks. Meta-black-box optimization (MetaBBO) provides a promising paradigm for adaptively configuring algorithmic components. Nevertheless, existing MetaBBO methods usually control only a single component, and few studies have investigated the unified control of multi-component optimizers such as SAEAs. Moreover, the robustness-accuracy trade-off in surrogate modeling, which is crucial for stable early-stage exploration and accurate late-stage exploitation, has rarely been explicitly considered. To address these issues, we propose AdaE-SAEA, an adaptive ensemble surrogate-assisted evolutionary algorithm for expensive multi-objective optimization. AdaE-SAEA embeds SAEA as the low-level optimizer within the MetaBBO framework and jointly controls the infill criterion and ensemble-based surrogate modeling. Specifically, bagging and boosting are designed as surrogate modeling modules to adaptively balance robustness and accuracy across different search phases, while the meta-policy simultaneously selects the infill criterion to enable adaptive sampling decisions. The meta-policy is trained through reinforcement learning with parallel sampling and centralized training, improving both training efficiency and transferability. Experiments on synthetic and real-world problems demonstrate that AdaE-SAEA outperforms state-of-the-art baselines and MetaBBO-based methods. We further verify the effectiveness of TabPFN as the base surrogate model for ensemble learning. To the best of our knowledge, this is the first work to unify the control of surrogate modeling and infill criteria in SAEAs while explicitly addressing the robustness--accuracy trade-off.

2606.00852 2026-06-02 cs.CV cs.AI cs.LG 版本更新

RefDiffNet: Learning to Expose Subtle PCB Defects Before Detection

RefDiffNet: 在检测前学习暴露细微PCB缺陷

Vinay Edula, Nilesh Badwe, Priyanka Bagade

发表机构 * Department of Computer Science and Engineering Indian Institute of Technology Kanpur(计算机科学与工程系印度理工学院坎浦尔) Department of Materials Science and Engineering Indian Institute of Technology Kanpur(材料科学与工程系印度理工学院坎浦尔)

AI总结 提出RefDiffNet,一种轻量级即插即用的输入增强模块,通过引入无缺陷参考图像来突出缺陷区域,从而提升下游检测器在PCB缺陷检测中的性能。

详情
AI中文摘要

印刷电路板(PCB)缺陷检测具有挑战性,因为许多缺陷很小且难以与复杂的背景图案区分。大多数基于深度学习的PCB检测方法仅依赖被检测的PCB图像进行缺陷检测,忽略了编码走线、焊盘和其他PCB结构预期布局的无缺陷参考图像。在这项工作中,我们提出了RefDiffNet,一种轻量级即插即用的输入增强模块,放置在检测器主干之前,用于在缺陷检测前增强图像。RefDiffNet将经典检测中的一个成熟思想带入深度学习时代,利用无缺陷参考图像来揭示缺陷。RefDiffNet比较缺陷图像与对齐的参考图像,捕获相对于参考图像的结构变化,并使用轻量级编码器输出缺陷区域被突出的原始图像,从而简化下游检测器的任务。在HRIPCB和DeepPCB上的结果表明,RefDiffNet在各类检测器上一致地提升了性能,包括从YOLOv8到YOLOv26的单阶段检测器、基于Transformer的RT-DETR以及两阶段Faster R-CNN。它实现了高达18%的相对mAP50:95增益,且开销可忽略,仅引入0.004-0.005M额外参数和0.7-0.8 GFLOPs,最多占任何评估检测器参数量的0.25%。结果确立了RefDiffNet作为一种轻量级、即插即用、检测器无关的输入增强模块,以最小的计算成本显著提升PCB缺陷检测性能。

英文摘要

Printed circuit board (PCB) defect detection is challenging because many defects are small and difficult to distinguish from complex background patterns. Most deep learning-based PCB inspection methods rely only on the inspected PCB image for defect detection, ignoring the defect-free reference image that encodes the expected layout of traces, pads, and other PCB structures. In this work, we propose RefDiffNet, a lightweight plug-and-play input enhancement block placed before the detector backbone to enhance the image before defect detection. RefDiffNet brings one proven idea from classical inspection into the deep learning era, using a defect-free reference image to reveal defects. RefDiffNet compares the defective image with the aligned reference, captures structural changes relative to the reference, and uses a lightweight encoder to output the original image with defective regions highlighted, thereby making the downstream detector's task easier. Results on HRIPCB and DeepPCB show that RefDiffNet consistently improves performance across detector families, including one-stage detectors from YOLOv8 to YOLOv26, the transformer-based RT-DETR, and the two-stage Faster R-CNN. It achieves up to 18% relative mAP50:95 gain with negligible overhead, introducing only 0.004 - 0.005M additional parameters and 0.7 - 0.8 GFLOPs, amounting to at most 0.25% of the parameter count of any evaluated detector. Results establish RefDiffNet as a lightweight, plug-and-play, detector-agnostic input enhancement module that substantially improves PCB defect detection with minimal computational cost.

2606.00851 2026-06-02 cs.SD cs.CL cs.HC cs.LG eess.AS 版本更新

Sympatheia: Emotionally Adaptive Voice Assistant with Continuous Affect Conditioning

Sympatheia: 具有连续情感调节的情感自适应语音助手

Sukru Samet Dindar, Riki Shimizu, Xilin Jiang, Nima Mesgarani

发表机构 * Department of Electrical Engineering, Columbia University(电气工程系,哥伦比亚大学)

AI总结 提出Sympatheia语音对话框架,通过从用户语音推断情感并结合连续效价-唤醒度控制信号,实现情感自适应响应,优于基线模型。

详情
AI中文摘要

共情口语对话系统必须推断用户的情感状态以做出适当响应,然而日常语音通常带有微弱、中性或模糊的情感线索。为解决这一问题,我们引入了Sympatheia,一种语音到语音对话框架,其条件基于从用户语音中推断出的情感,并且在可用时,基于多模态感知模块或用户界面提供的连续效价-唤醒度(VA)控制信号中的明确情感规格。为了训练我们的模型,我们构建了Sympatheia-18k,一个包含12个情感锚点的情感条件合成口语对话语料库。该数据集包括用于学习情感语音行为的情感分割,以及一个中性分割,该分割将情感中性查询与多个情感条件响应配对,以在情感模糊情况下隔离明确的情感控制。实验结果表明,Sympatheia在生成语义内容和口语表达均情感适当的响应方面优于语音对话基线。我们进一步表明,相同的VA界面可以整合来自不同感知模块(包括面部表情、生物信号和文本情感描述)的情感估计,从而在语音单独提供有限情感证据时改善响应对齐。这些结果表明,连续情感调节是构建情感自适应语音助手的有效实际步骤。

英文摘要

Empathetic spoken dialogue systems must infer a user's emotional state to respond appropriately, yet everyday speech often carries weak, neutral, or ambiguous affective cues. To address this, we introduce Sympatheia, a speech-to-speech dialogue framework conditioned on affect inferred from the user's speech and, when available, explicit affect specifications provided as a continuous valence--arousal (VA) control signal by a multimodal sensing module or user interface. To train our model, we construct Sympatheia-18k, an emotion-conditioned synthetic spoken dialogue corpus with 12 emotion anchors. This dataset includes an emotional split for learning affective speech behavior, and a neutral split that pairs emotionally neutral queries with multiple emotion-conditioned responses to isolate explicit emotion control in emotionally ambiguous cases. Empirical results show that Sympatheia outperforms speech conversational baselines in generating responses whose semantic content and spoken delivery are both emotionally appropriate. We further show that the same VA interface can integrate emotion estimates from diverse sensing modules, including facial expression, biosignals, and textual affect descriptions, improving response alignment when speech alone provides limited emotional evidence. These results suggest that continuous affect conditioning is an effective practical step for building emotionally adaptive voice assistants.

2606.00846 2026-06-02 cs.LG 版本更新

CUPID in the Model Zoo: Online Matchmaking for Selecting Your Dream LLM

模型动物园中的丘比特:在线匹配以选择你的梦想大语言模型

Son Nguyen, Xinyuan Liu, Ransalu Senanayake

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一种基于决斗老虎机算法的主动学习框架,通过迭代选择大语言模型对并收集用户反馈,高效匹配用户偏好与模型能力。

Comments 38 pages, 11 figures

详情
AI中文摘要

用户越来越面临从快速增长的大语言模型池中为给定任务选择合适的LLM的挑战,每个模型具有独特但通常不透明的潜在属性。加剧这一挑战的是,用户可能缺乏词汇或意识来明确表达他们在LLM的响应或部署中所重视的特征。我们提出了一种交互高效的主动学习框架,其中决斗老虎机算法迭代选择LLM对,收集用户关于其响应的反馈,并更新其对用户潜在偏好的信念。我们引入了一种新颖的信念感知上置信界策略,平衡模型池的探索与推断偏好的利用,从而在用户指定的成本和时间预算下实现用户需求与LLM能力之间的高效对齐。通过在LLM和人类研究上的多样化实验,我们实验验证了我们的模型能够以较低成本高效地将良好对齐的LLM匹配给用户。

英文摘要

Users increasingly face the challenge of selecting an appropriate LLM for a given task from a rapidly growing pool of LLMs, each with distinct but often opaque latent properties. Compounding this challenge, users may lack the vocabulary or awareness to explicitly articulate the characteristics they value in an LLM's responses or deployment. We propose an interaction-efficient active learning framework in which a dueling bandit algorithm iteratively selects pairs of LLMs, collects user feedback about their responses, and updates its belief about the user's latent preferences. We introduce a novel belief-aware upper confidence bound strategy that balances exploration of the model pool with exploitation of inferred preferences, enabling efficient alignment between user needs and LLM capabilities under user-specified cost and time budgets. Through diverse experiments on LLMs and human studies, we experimentally verify that our model can efficiently match well-aligned LLMs to users at a lower cost.

2606.00844 2026-06-02 cs.CV cs.AI cs.LG 版本更新

MoEIoU: Rethinking Bounding-Box Regression as a Mixture of Experts

MoEIoU:将边界框回归重新思考为混合专家模型

Vinay Edula, Priyanka Bagade

发表机构 * Indian Institute of Technology Kanpur(印度理工学院坎普尔分校)

AI总结 提出MoEIoU损失函数,通过混合专家模型联合优化重叠、中心对齐和长宽比,并采用课程学习权重调度,在多个数据集和YOLO架构上超越现有IoU损失。

详情
AI中文摘要

边界框回归是目标检测的基本组成部分,在精确目标定位中起着关键作用。现有的基于交并比(IoU)的损失函数通过引入几何惩罚项(如中心距离和长宽比不匹配)来扩展IoU目标,以改进边界框回归。然而,这些惩罚项通常在训练过程中保持不变,没有考虑优化动态:预测框在初始阶段表现出较大的中心距离和形状误差,而后期阶段则侧重于提高与真实框的重叠。为了解决这一局限性,我们引入了MoEIoU,一种基于混合专家的回归损失,它联合建模了重叠、中心对齐和长宽比不匹配。MoEIoU使用log-sum-exp函数聚合这些组件,该函数强调主要的定位误差,同时保持其他项的平滑贡献。此外,采用基于课程的权重调度,在早期训练阶段优先纠正框的位置和形状,在后期阶段提高重叠。我们在PASCAL VOC、HRIPCB和MS COCO上使用多种YOLO架构以及大规模模拟实验评估了所提出的MoEIoU。它始终优于标准和最新的最先进损失,表现出更快的收敛速度和更高的定位精度。我们进一步表明,这种自适应聚合改进了现有的基于IoU的损失,带来了一致的增益,并为目标检测框架中的边界框回归提供了更有效的优化指导。

英文摘要

Bounding-box regression is a fundamental component of object detection, playing a critical role in precise object localization. Existing Intersection-over-Union (IoU)-based loss functions extend the IoU objective by incorporating geometric penalties, such as center-distance and aspect-ratio mismatch, to improve bounding-box regression. However, these penalties typically remain fixed throughout training and do not account for the optimization dynamics in which predicted boxes initially exhibit large center-distance and shape errors, with later stages focusing on improving overlap with the ground truth. To address this limitation, we introduce MoEIoU, a mixture-of-experts based regression loss that jointly models overlap, center alignment, and aspect-ratio mismatch. MoEIoU aggregates these components using a log-sum-exp function, which emphasizes the dominant localization error while maintaining smooth contributions from other terms. Additionally, a curriculum-based weighting schedule is employed to prioritize correcting box position and shape in early training stages and improving overlap in later stages. We evaluated proposed MoEIoU on PASCAL VOC, HRIPCB, and MS COCO using multiple YOLO architectures, along with large-scale simulation experiments. It consistently outperforms standard and recent state-of-the-art losses, demonstrating faster convergence and improved localization accuracy. We further show that this adaptive aggregation improves existing IoU-based losses, yielding consistent gains and providing more effective optimization guidance for bounding-box regression in object detection frameworks.

2606.00837 2026-06-02 cs.RO cs.LG 版本更新

Coarse-to-Fine Compositional Diffusion for Long-Horizon Planning

粗到细的组合扩散用于长时域规划

Byoungwoo Park, Utkarsh A. Mishra, Jaemoo Choi, Juho Lee, Yongxin Chen

发表机构 * KAIST(韩国科学技术院) Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出Coarse-to-Fine Compositional Diffusion (CoFi)方法,通过先形成全局骨架再细化局部细节,在长时域机器人规划、全景图像生成和长视频生成中提升全局一致性和局部质量,同时减少2-8倍去噪评估次数。

Comments Project page: https://cofi-diffusion.github.io

详情
AI中文摘要

扩散模型为生成结构化数据提供了强先验,但许多任务需要输出超出这些模型通常训练规模的范围。组合生成通过将来自预训练短时域先验的重叠局部计划组合成长时域输出来解决这一问题。然而,标准组合主要强制相邻局部计划之间的一致性,产生局部一致性而不直接指定完整组合的全局结构。因此,局部兼容的计划仍可能形成不合理的路线、任务序列或时间演化。现有方法通过重复传播局部一致性信号或添加推理时优化来提高全局连贯性,但随着局部计划数量或维度的增加,这些过程变得昂贵。我们提出粗到细组合扩散(CoFi),一种推理时采样器,将全局结构形成与局部细节细化分离。CoFi首先将局部去噪估计围绕共享的粗结构对齐,产生捕获长程任务级排列的全局骨架。然后将该骨架扩散到中间噪声水平,并使用相同的预训练局部先验去噪,在保留骨架诱导的全局连贯性的同时恢复局部精细结构。在长时域机器人规划、全景图像生成和长视频生成中,CoFi不仅比先前的组合基线提高了全局连贯性和局部样本质量,而且需要2-8倍更少的去噪评估次数。

英文摘要

Diffusion models provide strong priors for generating structured data, but many tasks require outputs beyond the scale on which these models are typically trained. Compositional generation addresses this by composing overlapping local plans from a pretrained short-horizon prior into a long-horizon output. However, standard composition primarily enforces agreement between neighboring local plans, yielding local consistency without directly specifying the global structure of the full composition. As a result, locally compatible plans may still form an implausible route, task sequence, or temporal evolution. Existing methods improve global coherence by repeatedly propagating local consistency signals or by adding inference-time optimization, but these procedures become expensive as the number or dimensionality of local plans increases. We propose Coarse-to-Fine Compositional Diffusion (CoFi), an inference-time sampler that separates global structure formation from local detail refinement. CoFi first aligns local denoised estimates around a shared coarse structure, producing a global scaffold that captures the long-range task-level arrangement. It then diffuses this scaffold to an intermediate noise level and denoises it with the same pretrained local prior, restoring local fine structure while preserving the scaffold-induced global coherence. Across long-horizon robotic planning, panoramic image generation, and long video generation, CoFi not only improves both global coherence and local sample quality over prior compositional baselines, but also requires 2-8x fewer denoiser evaluations.

2606.00835 2026-06-02 cs.LG 版本更新

Online Packet Scheduling with Deadlines and Learning

具有截止日期和学习的在线数据包调度

Gianmarco Genalti, Achraf Azize, Vianney Perchet

发表机构 * Politecnico di Milano(米兰理工大学) FairPlay Joint Team, CREST, ENSAE, IP Paris(FairPlay联合团队,CREST,ENSAE,IP巴黎)

AI总结 针对部分反馈下未知权重的在线数据包调度问题,通过连接睡眠强盗问题,提出算法实现α-遗憾最小化,并在不同松弛度下达到最优界。

详情
AI中文摘要

强制执行服务质量(QoS)保证的网络路由器必须在每个时钟周期决定传输哪个即将过期的数据包,即使数据包的值在处理之前是未知的。我们将此问题框架化为部分反馈下的在线数据包调度(OPSD)问题:数据包在每个时钟周期到达,具有不同的截止日期,但权重仅在执行后观察到。在未知权重的随机假设下,我们探索了具有强盗反馈的OPSD问题的不同变体。我们在我们的设置和睡眠强盗问题之间建立了联系,并将学习目标设定为α-遗憾最小化。我们提供了在不同松弛度下具有可证明α-遗憾保证的算法,区分了允许随机化的系统和不允许的系统。在每种情况下,我们的算法实现了$\widetilde{\mathcal{O}}\left(\sqrt{KT} ight)$的α-遗憾上界,与标准强盗设置的下界匹配。在实际相关的2-有界截止日期实例中,其中截止日期最多设置在到达后的一个时钟周期,我们的确定性算法实现了可证明的最紧竞争比。值得注意的是,当不同数据包类型数量$K\ge 2$有限时,有可能打破已建立的$\Phi= rac{1+\sqrt{5}}{2}$竞争比障碍,并获得范围在$[\sqrt{2}, \Phi)$内的更紧竞争比$ heta_K$。

英文摘要

Network routers that enforce Quality-of-Service (QoS) guarantees must decide, at every clock cycle, which expiring packet of information to transmit, even when the value of the packet is unknown until it is processed. We frame this problem as the Online Packet Scheduling with Deadlines (OPSD) problem under Partial Feedback: packets arrive at every clock cycle, with different deadlines, but the weights are only observed after execution. Under a stochastic assumption on the unknown weights, we explore different variants of the OPSD problem with bandit feedback. We establish a connection between our setting and the sleeping bandits problem, and set our learning goal to $α$-regret minimization. We provide algorithms with provable $α$-regret guarantees under different spans of slackness, distinguishing systems allowing for randomization and systems that do not. In every scenario, our algorithms achieve an $α$-regret upper bound of $\widetilde{\mathcal{O}}\left(\sqrt{KT}\right)$, matching the lower bound for the standard bandit setting. In the practically relevant case of $2$-bounded deadline instances, where the deadline is set at most one clock cycle away from the arrival, our deterministic algorithm achieves the provably tightest possible competitive ratio. Remarkably, when the number of distinct packet types $K\ge 2$ is finite, it is possible to break the well-established $Φ= \frac{1+\sqrt{5}}{2}$ competitive ratio barrier and attain a tighter competitive ratio $θ_K$ ranging in $[\sqrt{2}, Φ)$.

2606.00834 2026-06-02 stat.AP cs.AI cs.LG math.PR 版本更新

Hybrid Probabilistic Forecasting of Under-Five Malaria Admissions in Ghana: A Gaussian Process Regression with Holt-Winters Smoothing

加纳五岁以下儿童疟疾住院人数的混合概率预测:高斯过程回归与Holt-Winters平滑

T. Ansah-Narh, Y. Asare Afrane, J. Bremang Tandoh

发表机构 * GAEC, Ghana(加纳农业和粮食部)

AI总结 针对加纳疟疾预测中季节性和数据不确定性挑战,提出结合高斯过程回归与Holt-Winters指数平滑的混合模型,实现概率性预测并评估其性能。

Comments 24 pages, 8 figures, accepted for publication in Artificial Intelligence in Medicine

详情
AI中文摘要

准确的疟疾预测在撒哈拉以南非洲仍是一个重大挑战,那里强烈的季节性、报告不确定性和非平稳传播动态降低了传统模型的可靠性。在加纳,地区级疟疾监测需要概率上严谨且数据有限时稳健的预测框架。本研究提出了一个混合框架,将高斯过程回归(GPR)与Holt-Winters指数平滑相结合,用于建模每月五岁以下儿童疟疾住院人数。GPR捕捉非线性行为和预测不确定性,而Holt-Winters稳定长期预测并保留季节结构。使用十年(2014-2023年)的地区级数据,通过滚动起点扩展窗口验证评估性能。混合模型实现了$R^2 = 0.9906$,而单独Holt-Winters为$0.8213$,$94.2\%$的残差在$\pm 2σ$范围内。2024-2028年的预测显示月平均住院人数约为8,000至12,200例。时空分析揭示了显著的生态异质性:北部高负担地区尽管绝对波动较大,但相对模式稳定。该框架为疟疾流行地区的早期预警和运营规划提供了一种可扩展的概率方法,支持加纳国家疟疾控制战略。

英文摘要

Accurate malaria forecasting remains a major challenge in sub-Saharan Africa, where strong seasonality, reporting uncertainty, and non-stationary transmission dynamics reduce the reliability of conventional models. In Ghana, district-level malaria surveillance requires forecasting frameworks that are probabilistically rigorous and robust under limited data. This study proposes a hybrid framework integrating Gaussian Process Regression (GPR) with Holt-Winters exponential smoothing for modelling monthly under-five malaria admissions. GPR captures non-linear behaviour and predictive uncertainty, while Holt-Winters stabilises long-horizon forecasts and preserves seasonal structure. Using ten years of district-level data (2014-2023), performance was evaluated via rolling-origin expanding-window validation. The hybrid model achieved $R^2 = 0.9906$ versus $0.8213$ for Holt-Winters alone, with $94.2\%$ of residuals within $\pm 2σ$ bounds. Forecasts for 2024-2028 project average monthly admissions from approximately 8{,}000 to 12{,}200 cases. Spatio-temporal analysis revealed pronounced ecological heterogeneity: northern high-burden districts exhibited stable relative patterns despite large absolute fluctuations. The framework provides a scalable probabilistic approach for malaria early warning and operational planning in endemic settings, supporting Ghana's national malaria control strategy.

2606.00831 2026-06-02 cs.AI cs.LG 版本更新

Subliminal Learning is a LoRA Artifact

潜意识学习是LoRA的伪影

Todd Nief, Harvey Yiyun Fu, Mark Muchane, Ari Holtzman

发表机构 * Department of Computer Science, University of Chicago(芝加哥大学计算机科学系) Data Science Institute, University of Chicago(芝加哥大学数据科学研究所)

AI总结 本文发现潜意识学习是LoRA微调产生的伪影,其传递行为与LoRA秩呈倒U型关系,且完全微调下消失,表明该现象依赖于微调和评估上下文。

详情
AI中文摘要

潜意识学习是一种现象,语言模型可以通过看似无害的数据将行为特征传递给其他模型(Cloud et al., 2025)。在潜意识学习中,具有行为特征(例如对猫的痴迷)的教师模型可以将这种猫痴迷传递给仅在教师生成的数字序列上微调的学生模型。在本文中,我们提出疑问:这种意想不到的行为传递是如何发生的?我们表明,潜意识学习是LoRA的伪影。当潜意识学习发生时,传递与LoRA秩呈倒U型关系;在完全微调下也会消失。我们表明,潜意识学习高度依赖于微调和评估期间看到的上下文。例如,在微调期间使用默认系统提示(“你是Qwen,由阿里云创建。你是一个有用的助手。”)的Qwen模型,在生成时如果没有包含系统提示,则不会表现出潜意识学习。我们进一步证明,潜意识行为局限于在微调和评估期间都看到的标记(例如模型的默认系统提示、标准聊天模板标记等)上的计算。总体而言,潜意识学习似乎是LoRA超参数和微调上下文的脆弱伪影,使其成为行为传递的不稳定渠道。

英文摘要

Subliminal learning is a phenomenon where language models can transmit behavioral traits to other models through seemingly innocuous data (Cloud et al., 2025). In subliminal learning, a teacher model with a behavioral trait (e.g. obsession with cats) can transmit this cat obsession to a student model finetuned only on numerical sequences generated by the teacher. In this paper, we ask: how does this unexpected behavioral transmission occur? We show that subliminal learning is a LoRA artifact. When subliminal learning occurs, transmission has an inverted U-shaped relationship with LoRA rank; it also disappears with full finetuning. We show that subliminal learning is highly dependent on the context seen during finetuning and evaluation. For example, a Qwen model with the default system prompt during finetuning ("You are Qwen, created by Alibaba Cloud. You are a helpful assistant.") does not show subliminal learning during generation when no system prompt is included. We further demonstrate that subliminal behavior is localized to computation at tokens seen during both finetuning and evaluation (e.g. the model's default system prompt, the standard chat template tokens, etc.). Overall, subliminal learning seems to be a fragile artifact of LoRA hyperparameters and finetuning context, making it an unstable channel for behavioral transmission.

2606.00826 2026-06-02 cs.LG 版本更新

Partial Fairness Awareness: Belief-Guided Strategic Mechanism for Strategic Agents

部分公平意识:面向策略代理的信念引导策略机制

Xinpeng Lv, Chunyuan Zheng, Yunxin Mao, Renzhe Xu, Hao Zou, Shanzhi Gu, Liyang Xu, Huan Chen, Yuanlong Chen, Wenjing Yang, Haotian Wang

发表机构 * National University of Defense Technology, Changsha, China(国防科技大学) Peking University, Beijing, China(北京大学) Shanghai University of Finance and Economics, Shanghai, China(上海财经大学) ZGC Laboratory, Beijing, China(ZGC实验室) Faculty of Computing, Harbin Institute of Technology, Harbin, China(哈尔滨工业大学计算机学院)

AI总结 针对策略分类中的公平暴露困境,提出部分公平意识(PFA)问题,通过发布公平约束候选集并隐藏真实约束,结合信念引导机制实现代理与系统公平约束的对齐,实验表明PFA在降低群体公平差距、提高合格个体接受率和结果稳定性方面优于完全公开或私有的公平机制。

Comments Accepted by AAAI2026

详情
AI中文摘要

策略机器学习研究代理操纵其特征以从预测模型获得有利决策的场景。为了解决策略分类中固有的公平问题,最近的工作引入了群体特定的公平约束。然而,当前的公平感知方法在公平暴露问题上面临根本困境:公开这些约束会导致策略操纵和公平逆转,而隐藏它们可能降低社会福利并阻碍真正的改进。为填补这一空白,我们随后提出了部分公平意识(PFA)问题,因为我们的理论分析表明,这种困境可以通过发布公平约束的候选集并隐藏真实约束来缓解。具体来说,我们引入了一种信念引导的策略机制,其中代理与决策系统迭代交互,并在公平约束候选集上维持一个信念分布。这一信念引导过程使代理能够通过迭代交互和反馈,更新其在候选集上的信念分布,从而逐渐使其信念与系统采用的真实公平约束对齐。在真实世界和合成数据集上的大量实验表明,与完全公开或私有的公平机制相比,PFA实现了更低的群体公平差距、更高的真正合格个体接受率以及更稳定的结果。

英文摘要

Strategic machine learning investigates scenarios where agents manipulate their features to receive favorable decisions from predictive models. To address fairness concerns intrinsic to strategic classification, recent work has introduced group-specific fairness constraints. However, current fairness-aware approaches face a fundamental dilemma in the issue of fairness exposure: making these constraints public enables strategic manipulation and can lead to fairness reversal, while keeping them hidden may reduce social welfare and discourage genuine improvement. To fill this gap, we subsequently propose the problem of partial fairness awareness (PFA), as our theoretical analysis informs that such a dilemma can be mitigated by releasing the candidate set of fairness constraints and concealing the grounding constraint. To be specific, we introduce a belief-guided strategic mechanism, wherein agents iteratively interact with the decision system and maintain a belief distribution over the candidate set of fairness constraints. This belief-guided process enables agents, through iterative interaction and feedback, to update their belief distribution over the candidate set, thereby gradually aligning their belief with the grounding fairness constraint employed by the system. Extensive experiments on real-world and synthetic datasets demonstrate that PFA achieves lower group fairness gaps, higher acceptance of truly qualified individuals, and more stable outcomes compared to fully public or private fairness regimes.

2606.00821 2026-06-02 cs.LG 版本更新

A Comparative Analysis of Machine Learning Algorithms for Multi-Task Prediction of the Parameters of the Pectin Hydrolysis--Extraction Process

机器学习算法用于果胶水解-提取过程参数多任务预测的比较分析

Mullosharaf K. Arabov, Shavkat Yo. Kholov, Zainiddin K. Muhiddin

发表机构 * Institute of Computational Mathematics and Information Technologies, Kazan Federal University(卡兹安联邦大学计算数学与信息科技研究所) Tajik Technical University named after Academician M.S. Osimi(阿米尔·苏米院士命名的塔吉克技术大学) V.I. Nikitin Institute of Chemistry, National Academy of Sciences of Tajikistan(塔吉克斯坦国家科学院化学研究所维·尼金廷研究所)

AI总结 本研究比较了11种机器学习算法在多任务回归预测果胶水解-提取过程参数中的性能,其中CatBoost表现最佳(平均R²约0.946),并分析了特征重要性,原料类型占主导地位(63.6%)。

Comments Preprint

详情
AI中文摘要

本研究利用机器学习方法解决复杂多参数工艺——果胶水解-提取过程的控制挑战。实验基础是一个独特的数据库,包含在受控条件下对七种植物原料进行的1000次实验室实验,涉及四个可变工艺因素(温度85-130°C、压力0.9-2.2 atm、保温时间3-10分钟、pH 1.5-2.0)。记录了四个输出特征:果胶产率、半乳糖醛酸含量、分子量和酯化度。为解决多任务回归问题,训练并比较了11种算法:正则化线性模型、集成方法(随机森林、梯度提升、XGBoost、CatBoost、Extra Trees)、k近邻、支持向量回归和多层感知器。最佳结果由CatBoost展示(超参数优化后平均R²约为0.946)。特征重要性分析揭示了原料类型的主导作用(占总重要性的63.6%),其次是温度和保温时间。开发的流水线以生产就绪格式导出,并部署为交互式Web界面。研究结果表明,集成方法结合严格的统计分析和可解释AI显著减少了物理实验的需求,并为智能果胶生产控制奠定了基础。

英文摘要

This study addresses the challenge of controlling a complex, multi-parameter technological process -- pectin hydrolysis--extraction -- using machine learning methods. The experimental foundation is a unique database comprising 1,000 laboratory experiments conducted under controlled conditions on seven types of plant raw material with four variable process factors (temperature 85--130 C, pressure 0.9--2.2 atm, holding time 3--10 min, pH 1.5--2.0). Four output characteristics were recorded: pectin yield, galacturonic acid content, molecular weight, and degree of esterification. To solve the multi-task regression problem, 11 algorithms were trained and compared: regularised linear models, ensemble methods (Random Forest, Gradient Boosting, XGBoost, CatBoost, Extra Trees), k-nearest neighbours, support vector regression, and a multilayer perceptron. The best results were demonstrated by CatBoost (average R-squared approximately 0.946 after hyperparameter optimisation). Feature importance analysis revealed the dominant role of the raw material type (63.6% of total importance), followed by temperature and holding time. The developed pipeline was exported in a production-ready format and deployed as an interactive web interface. The findings demonstrate that ensemble methods combined with rigorous statistical analysis and interpretable AI significantly reduce the need for physical experiments and form the basis for intelligent pectin production control.

2606.00815 2026-06-02 cs.LG 版本更新

OmniEEG-Bench: A Standardized Evaluation Benchmark for EEG Foundation Models

OmniEEG-Bench: 脑电图基础模型的标准化评估基准

Ziling Lu, Zongsheng Li, Xinke Shen, Kexin Lou, Yingyue Xin, Xiaoqi Chen, Shinan Wang, Xiang Chen, Jiahao Fan, Chenyu Huang, Xin Xu, Zhoujie Hou, Chen Wei, Quanying Liu

发表机构 * Department of Biomedical Engineering, Southern University of Science and Technology, Shenzhen, China(南方科技大学生物医学工程系,深圳,中国) School of Computer Science and Engineering, The Chinese University of Hong Kong, Shenzhen, China(香港中文大学(深圳)计算机科学与工程学院,深圳,中国) Omni-Intelligence, Shenzhen, China(奥米智能,深圳,中国) Shenzhen Loop Area Institute, Shenzhen, China(深圳环城研究院,深圳,中国)

AI总结 针对脑电图基础模型评估碎片化问题,提出统一基准OmniEEG-Bench,涵盖六类任务、54个数据集,并揭示预训练数据多样性和模型大小与性能的缩放律关系。

Comments 28 pages, 13 figures, 8 tables; benchmark of EEG foundation models

详情
AI中文摘要

脑电图(EEG)支持多种脑机接口(BCI)任务,从脑状态监测到人-大语言模型交互。EEG基础模型正在兴起,但由于异构数据集和不一致的任务协议,评估仍然碎片化。在此,我们介绍OmniEEG-Bench,一个用于EEG基础模型(FMs)的统一基准和下游任务路线图。它将EEG FMs的评估组织为六个任务族,涵盖(i)信号可靠性、(ii)生物特征与疾病、(iii)意识与状态、(iv)认知与情感、(v)自然刺激解码以及(vi)运动与交互,引入了先前EEG FM工作中未系统基准测试的新一代任务。OmniEEG-Bench通过任务卡规范标准化模型部署、任务定义和指标,并统一了54个EEG数据集及一致的评估协议。我们对10个代表性EEG基础模型进行了基准测试,并报告了涵盖多种评估设置的排行榜。预训练数据集多样性和模型大小均与跨数据集的更好平均排名显著相关,揭示了EEG基础模型中的缩放律行为(图1)。这些结果表明,扩展EEG基础模型不仅需要更大的架构,还需要更广泛和更多样化的预训练数据。基准测试代码可在https://github.com/ncclab-sustech/omni-eegbench.git获取。

英文摘要

Electroencephalography (EEG) supports a variety of brain-computer interface (BCI) tasks ranging from brain-state monitoring to human-LLM interactions. EEG foundation models are emerging, but evaluation remains fragmented due to heterogeneous datasets and nconsistent task protocols. Here, we introduce OmniEEG-Bench, a unified benchmark and downstream task roadmap for EEG foundation models (FMs). It organizes evaluation of EEG FMs into six task families spanning (i) signal reliability, (ii) biometrics and disease, (iii) consciousness and state, (iv) cognition and emotion, (v) naturalistic stimulus decoding, and (vi) motor and interaction, introducing a new generation of tasks not systematically benchmarked in prior EEG FM work. OmniEEG-Bench standardizes model deployment, task definitions, and metrics through a task-card specification, and unifies 54 EEG datasets with consistent evaluation protocols. We benchmark 10 representative EEG foundation models and report a leaderboard that covers diverse evaluation settings. Both pretraining dataset diversity and model size are significantly associated with better average ranks across datasets, revealing scaling-law behavior in EEG foundation models (Figure 1). These results suggest that scaling EEG foundation models requires not only larger architectures but also broader and more diverse pretraining data. The benchmark code is available at https://github.com/ncclab-sustech/omni-eegbench.git.

2606.00813 2026-06-02 cs.CR cs.CL cs.ET cs.LG cs.NE 版本更新

Cross-Generational Transfer of Adversarial Attacks Reveals Non-Monotonic Safety Alignment in LLMs

跨代对抗攻击迁移揭示大语言模型安全对齐的非单调性

Subhadip Mitra

发表机构 * Rota Labs(Rota实验室)

AI总结 通过质量多样性进化(MAP-Elites)对四代Gemma模型进行自动红队探测,发现安全对齐非单调变化,其中Gemma 3攻击成功率显著高于前后代,且攻击迁移率在不同代际间存在差异。

Comments 8 pages, 3 figures

详情
AI中文摘要

大语言模型的安全对齐在不同代际之间并非单调提升。通过对Google Gemma家族四代模型(7B-31B)使用质量多样性进化(MAP-Elites)作为自动红队探测,我们发现Gemma 3(12B)的攻击成功率为68.7% ± 5.7%(均值±标准差,3个种子),显著高于其前代Gemma 2(45.5% ± 7.2%;p = 0.030,配对bootstrap)和后继Gemma 4(33.9% ± 1.8%)。跨代重放进化攻击档案显示,来自其他代际的攻击对Gemma 3的迁移率为44-46%,而对Gemma 4仅为14-18%,表明Gemma 4的安全增益泛化到了针对前代进化出的攻击分布之外。在我们的8B评判模型下,版权和网络犯罪漏洞在所有代际中接近100%,但第二评判审计(第6节)表明版权结果对评判选择敏感。错误信息ASR从Gemma 2的29%跃升至Gemma 3的99%,并在Gemma 4中仍保持77%的高位,表明该退化未得到完全解决。这些模式在静态基准测试中不可见,仅通过自适应的纵向探测才显现。所有实验使用3个随机种子和统一的自主托管评判模型;代码和工件可在https://github.com/bassrehab/red-queen获取。

英文摘要

Safety alignment in LLMs does not improve monotonically across model generations. Studying four generations of Google's Gemma family (7B-31B) with quality-diversity evolution (MAP-Elites) as an automated red-teaming probe, we find that Gemma 3 (12B) exhibits 68.7% +/- 5.7% attack success rate (ASR; mean +/- std, 3 seeds), significantly higher than its predecessor Gemma 2 (45.5% +/- 7.2%; p = 0.030, paired bootstrap) and its successor Gemma 4 (33.9% +/- 1.8%). Replaying evolved attack archives across generations reveals that attacks from other generations transfer to Gemma 3 at 44-46% but only 14-18% to Gemma 4, indicating that Gemma 4's safety gains generalize beyond the attack distributions evolved against earlier generations. Under our 8B judge, copyright and cybercrime vulnerabilities register at near-100% across all generations, though a second-judge audit (Section 6) suggests the copyright result is sensitive to judge choice. Misinformation ASR jumps from 29% to 99% between Gemma 2 and Gemma 3 and remains elevated at 77% in Gemma 4, indicating the regression was not fully addressed. These patterns are invisible to static benchmarks and emerge only through adaptive, longitudinal probing. All experiments use 3 random seeds with a unified self-hosted judge; code and artifacts are available at https://github.com/bassrehab/red-queen.

2606.00808 2026-06-02 cs.LG 版本更新

Safe-Subspace Pseudo-Label Refinement for Source-Free Graph Domain Adaptation

安全子空间伪标签精炼用于无源图域自适应

Yingxu Wang, Xinwang Liu, Siyang Gao, Nan Yin

发表机构 * Department of Computer Science and Engineering, Chinese University of Hong Kong(香港中文大学计算机科学与工程系) College of Computer, National University of Defense Technology(国防科技大学计算机学院) Department of Data Science, City University of Hong Kong(香港城市大学数据科学系) The Education University of Hong Kong(香港教育大学)

AI总结 针对无源图域自适应中伪标签不可靠的问题,提出SafeSubspace伪标签精炼方法,通过识别置信度一致的安全子空间并利用语义与结构证据进行伪标签验证,实现鲁棒的图域自适应。

详情
AI中文摘要

无源图域自适应(SF-GDA)旨在当源图不再可访问时,将源训练的图模型适应到未标记的目标图。一个核心障碍是伪标签的可靠性:在特征和拓扑偏移下,源诱导的预测可能变得自信但错误,而无差别的自训练会通过图消息传递放大系统误差。本文从选择性伪标签的角度研究SF-GDA。我们不是假设整个目标域上全局有界的伪标签噪声,而是识别一个置信度一致的安全子空间,在该子空间上伪标签噪声可以在受限后验差异下得到控制,并推导出一个目标风险分解,将安全子空间拟合误差、选定标签噪声和不确定集风险分开。在此分析指导下,我们提出SafeSubspace伪标签精炼(S$^2$PLR),一种无源图自适应框架,仅对同时具有语义和结构证据支持的目标图应用硬伪标签监督。具体来说,S$^2$PLR利用源委员会置信度和分歧估计语义可靠性,通过图对比学习学习目标内在的结构表示,通过邻域一致性验证伪标签,并利用噪声容忍的软正则化处理剩余的不确定样本,而不是不可靠的硬标签。在不同域偏移下的图像和真实世界图基准上的实验表明,S$^2$PLR在各种无源迁移设置中实现了鲁棒且具有竞争力的性能。

英文摘要

Source-free graph domain adaptation (SF-GDA) aims to adapt source-trained graph models to unlabeled target graphs when source graphs are no longer accessible. A central obstacle is pseudo-label reliability: under feature and topological shifts, source-induced predictions may become confidently wrong, and indiscriminate self-training can amplify systematic errors through graph message passing. This paper studies SF-GDA from a selective pseudo-labeling perspective. Instead of assuming globally bounded pseudo-label noise over the entire target domain, we identify a confidence-consistent safe subspace on which pseudo-label noise can be controlled under restricted posterior discrepancy, and derive a target-risk decomposition that separates safe-subspace fitting error, selected-label noise, and uncertain-set risk. Guided by this analysis, we propose SafeSubspace Pseudo-Label Refinement (S$^2$PLR), a source-free graph adaptation framework that applies hard pseudo-label supervision only to target graphs supported by both semantic and structural evidence. Specifically, S$^2$PLR estimates semantic reliability using source-committee confidence and disagreement, learns a targetintrinsic structural representation via graph contrastive learning, verifies pseudo-labels through neighborhood consistency, and exploits the remaining uncertain samples with noise-tolerant soft regularization rather than unreliable hard labels. Experiments on image and real-world graph benchmarks under different domain shifts demonstrate that S$^2$PLR achieves robust and competitive performance across diverse source-free transfer settings.

2606.00803 2026-06-02 astro-ph.CO cs.CV cs.LG 版本更新

Generative Diffusion Priors for 3D Mapping of the Dark Universe

用于暗宇宙三维映射的生成扩散先验

Brandon Zhao, Diana Scognamiglio, Olivier Doré, Katherine L. Bouman

发表机构 * Department of Computing and Mathematical Sciences, California Institute of Technology(加州理工学院计算与数学科学系) Jet Propulsion Laboratory, California Institute of Technology(加州理工学院喷气推进实验室) Department of Physics, Duke University(杜克大学物理系) Cahill Center for Astronomy and Astrophysics, California Institute of Technology(加州理工学院卡希尔天文与天体物理中心)

AI总结 利用扩散模型学习宇宙模拟中的先验分布,结合物理正向模型解决弱引力透镜三维暗物质反问题,显著提升重建精度并生成统计一致的后验样本。

Comments Accepted to CVPR 2026 (Highlight)

详情
AI中文摘要

从弱引力透镜观测重建暗物质的三维分布是宇宙学中一个核心但高度病态的反问题。与多视角标准三维重建不同,我们通过单一视线方向观测宇宙,通过星系不确定距离的噪声形状畸变,因此有意义的三维物质场恢复需要强先验假设。现有方法要么使用手工先验产生点估计,要么使用神经集成进行近似贝叶斯不确定性,难以捕捉宇宙网的非高斯、纤维状结构。随着新的高分辨率宇宙学模拟的出现,我们现在有了另一种先验知识来源,其捕捉结构形成的非线性统计的保真度远高于解析公式。我们利用这些模拟构建了一个新数据集$ exttt{Conicus3D}$,使我们能够学习一个数据驱动的扩散模型先验,捕捉暗物质结构在宇宙时间内的完整三维分布。基于最近的即插即用方法,我们将基于扩散的后验采样方案修改为三维弱引力透镜设置,将学习到的先验与可微分的物理正向模型相结合。在针对现代弱引力透镜巡天的逼真模拟上,我们的方法在二维和三维重建精度上显著优于基线方法。此外,它产生的后验样本的统计量紧密跟踪底层模拟,同时对宇宙学参数的适度偏移保持鲁棒性。

英文摘要

Reconstructing the three-dimensional distribution of dark matter from weak-lensing observations is a central but highly ill-posed inverse problem in cosmology. Unlike standard 3D reconstruction with multiple viewpoints, we observe the universe from a single line of sight, through noisy shape distortions of galaxies with uncertain distances, so meaningful recovery of the 3D matter field requires strong prior assumptions. Existing methods either produce point estimates with handcrafted priors or use neural ensembles for approximate Bayesian uncertainty, and struggle to capture the non-Gaussian, filamentary structure of the cosmic web. With the advent of new high-resolution cosmological simulations, we now have an alternative source of prior knowledge that captures the nonlinear statistics of structure formation with far greater fidelity than analytic prescriptions. We leverage these simulations to build a new dataset $\texttt{Conicus3D}$, which enables us to learn a data-driven diffusion-model prior capturing the full 3D distribution of dark matter structure across cosmic time. Building on recent plug-and-play approaches, we modify a diffusion-based posterior sampling scheme to the 3D weak-lensing setting, combining the learned prior with a differentiable physical forward model. On realistic simulations targeting a modern weak lensing survey, our approach yields substantially improved 2D and 3D reconstruction accuracy over baseline methods. Moreover, it produces posterior samples whose statistics closely track the underlying simulations, while remaining robust to moderate shifts in cosmology.

2606.00801 2026-06-02 cs.CR cs.CL cs.ET cs.LG cs.NE 版本更新

Quality-Diversity Evolution for Discovering Diverse Vulnerabilities in LLM Safety

用于发现LLM安全中多样漏洞的质量-多样性进化

Subhadip Mitra

发表机构 * Rota Labs(Rota实验室)

AI总结 提出基于质量-多样性进化框架(MAP-Elites)在语义层面生成可解释攻击策略,发现不同LLM的特定漏洞模式。

Comments 9 pages, 6 figures. Accepted at the ICLR 2026 Workshop on Agents in the Wild (AIWILD)

详情
AI中文摘要

当前LLM对抗性测试方法存在覆盖缺口:手动红队测试无法扩展,LLM作为攻击者的方法会出现模式崩溃,基于梯度的方法产生不可解释的乱码。我们引入一个在语义层面运行的质量-多样性进化框架,进化可解释的攻击策略而非令牌序列。使用MAP-Elites,我们在行为维度(策略类型、编码方法、提示长度)上维护一个多样化的攻击档案。在GPT-4o-mini、Claude 3.5 Sonnet、Gemini 2.0 Flash和一个开放权重的编码模型(Devstral-small-2)上的实验中,我们发现了不同的漏洞特征:GPT-4o-mini容易受到假设性和多轮框架结合ROT13编码的攻击(适应度0.8),Gemini容易受到直接攻击结合ROT13以及多轮攻击结合Leetspeak(0.8),而Claude在所有策略上表现出统一的模糊响应(最大0.4)。语义表示产生了可解释的攻击,揭示了系统性的、模型特定的弱点,为改进LLM安全性提供了可操作的见解,并为评估未来前沿模型提供了可复现的基线。代码和实验工件发布在https://github.com/bassrehab/red-queen。

英文摘要

Current approaches to LLM adversarial testing suffer from coverage gaps: manual red-teaming does not scale, LLM-as-attacker methods exhibit mode collapse, and gradient-based approaches produce uninterpretable gibberish. We introduce a quality-diversity evolutionary framework that operates at the semantic level, evolving interpretable attack strategies rather than token sequences. Using MAP-Elites, we maintain a diverse archive of attacks across behavioral dimensions (strategy type, encoding method, prompt length). In experiments across GPT-4o-mini, Claude 3.5 Sonnet, Gemini 2.0 Flash, and an open-weight coding model (Devstral-small-2), we discover distinct vulnerability profiles: GPT-4o-mini is vulnerable to hypothetical and multi-turn framing combined with ROT13 encoding (fitness 0.8), Gemini to direct attacks with ROT13 and multi-turn with Leetspeak (0.8), while Claude shows uniformly ambiguous responses across all strategies (max 0.4). The semantic representation produces interpretable attacks that reveal systematic, model-specific weaknesses, providing actionable insights for improving LLM safety and a reproducible baseline for evaluating future frontier models. Code and experiment artifacts are released at https://github.com/bassrehab/red-queen.

2606.00798 2026-06-02 cs.CV cs.AI cs.LG 版本更新

DASH: Dual-Branch Score Distillation for Guidance-Calibrated Compact Diffusion Models

DASH: 用于引导校准紧凑扩散模型的双分支分数蒸馏

Abdullah Al Shafi, Kazi Saeed Alam, Sk Imran Hossain, Engelbert Mephu Nguifo

发表机构 * Khulna University of Engineering & Technology(Khulna 工程与技术大学) University Clermont Auvergne(克莱蒙特-奥弗涅大学)

AI总结 针对类条件扩散模型参数压缩中无监督无条件分数分支导致引导失效的问题,提出双分支蒸馏框架DASH,通过独立监督两个分支并引入锚点正则化和课程迁移,在5.9倍压缩下保持与教师模型相近的FID和引导保真度。

Comments 14 pages, 7 figures, 4 tables; appendix with additional ablations and qualitative results

详情
AI中文摘要

类条件扩散模型的参数压缩揭示了输出级蒸馏中一个未被充分探索的局限性:无条件分数分支保持无监督,导致学生模型中无分类器引导差距欠定。该差距在每个去噪步骤中被放大,允许两个分支都崩溃为相同预测的退化解,使得引导在低输出级训练损失下无效。本文介绍了DASH,一种双分支蒸馏框架,独立监督两个分数分支,通过独立分支约束为每个训练样本唯一指定目标分支输出,并引入锚点项将条件预测正则化到真实噪声。该框架进一步引入了TIRT迁移,将教师收敛的每时间步重要性课程复制到学生中作为冻结先验,消除了在有限蒸馏预算内重新学习它的需要。在CIFAR-10和CIFAR-100上的实验表明,5.9倍压缩在50步DDIM采样下将质量保持在教师模型4个FID点以内,显著优于从头训练,且引导保真度良好保持。消融研究证实无条件监督是主要贡献,占总蒸馏增益的60%以上。课程迁移和锚点正则化提供互补收益,共同验证了双分支约束对于引导保持压缩的经验必要性。

英文摘要

Parameter compression of class-conditional diffusion models reveals an underexplored limitation in output-level distillation: the unconditional score branch remains unsupervised, leaving the classifier-free guidance gap underdetermined in the student. This gap, amplified at every denoising step, admits degenerate solutions where both branches collapse toward identical predictions, rendering guidance ineffective despite low output-level training loss. This paper introduces DASH, a dual-branch distillation framework that independently supervises both score branches, uniquely specifying target branch outputs for each training sample through independent branch constraints, with an anchor term regularising conditional predictions toward ground-truth noise. The framework further introduces TIRT Transfer, which copies the teacher's converged per-timestep importance curriculum into the student as a frozen prior, eliminating the need to relearn it within limited distillation budgets. Experiments on CIFAR-10 and CIFAR-100 demonstrate that 5.9x compression maintains quality within 4 FID points of the teacher at 50-step DDIM sampling, considerably outperforming training from scratch with guidance fidelity well preserved. Ablation studies confirm that unconditional supervision is the dominant contribution, accounting for over 60% of total distillation gain. Curriculum transfer and anchor regularisation provide complementary benefit, together validating dual-branch constraints as empirically essential for guidance-preserving compression.

2606.00795 2026-06-02 cs.LG cs.AI 版本更新

Extending Causal Metamodeling to a non-Markovian Queue

将因果元建模扩展到非马尔可夫排队系统

Pracheta Amaranath, Anant Bhide, David Jensen, Peter Haas

发表机构 * Manning College of Information and Computer Sciences University of Massachusetts Amherst(信息与计算机科学学院麻省大学阿默斯特分校)

AI总结 本文通过相位型分布近似非指数分布,将模块化动态贝叶斯网络(MDBN)因果元建模方法从马尔可夫系统扩展到非马尔可夫排队系统,并解决了相位数选择、参数学习和采样间隔等挑战,实验表明在G/M/1队列上可实现数量级的推理加速。

Comments 12 pages

详情
AI中文摘要

离散事件仿真的元模型近似模拟模型的行为,而无需运行昂贵的仿真。先前的工作引入了模块化动态贝叶斯网络(MDBN)——一类元模型,可以使用单个训练模型估计一系列概率和因果查询(PCQ)——但该方法仅限于马尔可夫系统。在本文中,我们通过使用相位型分布近似非指数分布,启动MDBN向非马尔可夫排队的扩展。这种方法带来了新的挑战,包括在选择相位数量时平衡元建模精度和可处理性、高效学习元模型参数,以及选择用于通过离散时间MDBN近似连续时间仿真的采样间隔。我们为这些挑战提供了初步解决方案,从而产生了第一个针对非马尔可夫系统的因果元建模技术。在G/M/1队列上的实验表明,MDBN可以为PCQ提供准确的答案,并且相对于直接仿真,推理时间实现了数量级的加速。

英文摘要

Metamodels for discrete-event simulations approximate the behavior of simulation models without running expensive simulations. Prior work introduced modular dynamic Bayesian networks (MDBNs) -- a class of metamodels that can estimate a range of probabilistic and causal queries (PCQs) using a single, trained model -- but the method was limited to Markovian systems. In this paper, we initiate an extension of MDBNs to non-Markovian queues by approximating non-exponential distributions using phase-type distributions. This approach raises novel challenges, including balancing metamodeling accuracy and tractability when choosing the number of phases, efficiently learning metamodel parameters, and choosing the sampling interval that is used to approximate a continuous-time simulation by a discrete-time MDBN. We provide preliminary solutions to these challenges, yielding the first causal metamodeling technique for non-Markovian systems. Experiments on a G/M/1 queue demonstrate that the MDBN can produce accurate answers to PCQs with orders-of-magnitude speedup of inference times relative to direct simulation.

2606.00794 2026-06-02 cond-mat.mtrl-sci cs.LG 版本更新

Benchmark Dataset for Catalysis on 2D MXenes

二维MXene催化基准数据集

Pavlo Melnyk, Anmar Karmush, Mårten Wadenbäck, Ania Beatriz Rodríguez-Barrera, Johanna Rosen, Michael Felsberg, Jonas Björk

发表机构 * Computer Vision and Learning Systems, Department of Electrical Engineering (ISY) & AI4X(计算机视觉与学习系统,电气工程系(ISY)及AI4X) Materials Design Division, Department of Physics, Chemistry and Biology (IFM)(材料设计分校,物理、化学与生物系(IFM)) Wallingenberg Initiative Materials Science for Sustainability (WISE)(瓦伦贝格可持续材料科学倡议(WISE))

AI总结 通过结合第一性原理计算与机器学习,构建包含50000个DFT计算训练集和10000个测试集的数据集,训练并验证多种机器学习原子间势模型,实现约10^3倍加速且保持高精度,推动MXene催化行为的高效研究。

详情
AI中文摘要

将第一性原理计算与机器学习(ML)相结合,旨在加速新型材料催化行为的探索。我们专注于二维(2D)Ti$_2$CT$_y$ MXene,其多样的表面化学性质使其成为极具吸引力的催化候选材料。由于计算成本,在现实条件下解析其组成和结构超出了标准密度泛函理论(DFT)的能力。为应对这一挑战,我们生成了一个包含50000个DFT计算用于训练和10000个用于测试的全面数据集,涵盖Ti$_2$CT$_y$ MXene构型和分子系统,以及一个包含1000个真正新的大系统的额外测试数据集,以研究模型的泛化能力。我们训练并验证了广泛使用且具有竞争力的机器学习原子间势(MLIP)模型,包括EquiformerV2、MACE、MatRIS和UPET,这些模型能够准确预测原子力和形成能——这些是DFT在结构和催化研究中必须反复计算的量——对于这些二维材料。这种DFT-ML联合框架实现了约$1-4 \cdot 10^3$倍(在CPU上)的计算加速,同时保持所需精度(力约$\pm 10$ meV/Å,每原子能量约$\pm 1$ meV),为更高效地研究MXene催化行为铺平了道路。此外,我们对训练模型进行了广泛的定性评估,展示了超越基准指标的基于模拟的综合比较的重要性。数据集、训练模型及代码可在https://huggingface.co/datasets/CatalystAnonymous/catalyst_mxenes获取。

英文摘要

Merging first-principles calculations with machine learning (ML), we aim to accelerate the exploration of catalytic behaviour in novel materials. We focus on two-dimensional (2D) Ti$_2$CT$_y$ MXenes, whose versatile surface chemistry makes them particularly compelling candidates for catalysis. Resolving their composition and structure under realistic conditions exceeds the reach of standard density functional theory (DFT) due to computational cost. To address this challenge, we generate a comprehensive dataset of 50,000 DFT calculations for training and 10,000 for testing, encompassing both Ti$_2$CT$_y$ MXene configurations and molecular systems, along with an additional test dataset with 1000 genuinely new, larger systems to investigate how well models generalise. We train and validate widely used and competitive machine learning interatomic potential (MLIP) models, including EquiformerV2, MACE, MatRIS, and UPET, that accurately predict atomic forces and formation energies -- quantities that DFT must repeatedly compute for structural and catalytic investigations -- for these 2D materials. This combined DFT-ML framework achieves computational acceleration on the order of approximately $1-4 \cdot 10^3$ (on a CPU) while maintaining desired-level accuracy (approximately +/- $10$ meV/A for forces and approximately +/- $1$ meV for per-atom energies), paving the way for more efficient investigations of MXene catalytic behaviour. Moreover, we perform an extensive qualitative evaluation of the trained models, showcasing the importance of comprehensive simulation-based comparison beyond benchmark metrics. The dataset and the trained models with the code are available at https://huggingface.co/datasets/CatalystAnonymous/catalyst_mxenes.

2606.00780 2026-06-02 cs.LG cs.AI 版本更新

Behavior-Invariant Task Representation Learning with Transformer-based World Models for Offline Meta-Reinforcement Learning

基于Transformer世界模型的行为不变任务表示学习用于离线元强化学习

Fuyuan Qian, Menglong Zhang, Song Wang, Quanying Liu

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出一种结合信息论任务表示学习与Transformer随机世界模型的框架,通过提取行为不变的任务变量和保守值惩罚,解决离线元强化学习中的分布偏移和稀疏奖励问题,实现鲁棒泛化。

Comments ICML2026

详情
AI中文摘要

离线元强化学习利用静态数据集使智能体能够通过结合离线效率与元学习适应性来泛化到未见环境,但它面临来自上下文和策略分布偏移的关键挑战。这些问题阻碍智能体适应在线环境,并在稀疏奖励设置下进一步加剧。结果,智能体常常陷入固有的模式困境,无法实现鲁棒的泛化。在这项工作中,我们提出了一种新颖的框架,将信息论任务表示学习与基于Transformer的随机世界模型相结合。我们的方法提取对行为策略不变的任务定义潜在变量,从而有效缓解上下文分布偏移。为了进一步处理策略偏移和模型利用,我们对基于想象力的轨迹应用保守值惩罚,防止策略利用模型不准确性,同时保持鲁棒适应。大量评估表明,我们的方法在分布外和稀疏奖励设置下优于最先进的方法,具有优越的稳定性和泛化能力。

英文摘要

Offline meta-reinforcement learning leverages static datasets to enable agents to generalize to unseen environments by combining offline efficiency with meta-learning adaptability, yet it faces key challenges from context and policy distribution shifts. These issues hinder agents from adapting to online environments, and are further exacerbated under sparse-reward settings. As a result, agents often become trapped in an inherent pattern dilemma, failing to achieve robust generalization. In this work, we propose a novel framework that integrates information-theoretic task representation learning with a Transformer-based stochastic world model. Our approach extracts task-defining latent variables that are invariant to behavior policy, thereby effectively mitigating the context distribution shift. To further handle policy shift and model exploitation, we apply a conservative value penalty to imagination-based rollouts, preventing the policy from exploiting model inaccuracies while maintaining robust adaptation. Extensive evaluations demonstrate that our method outperforms state-of-the-art approaches, with superior stability and generalization under out-of-distribution and sparse-reward settings.

2606.00776 2026-06-02 cs.LG 版本更新

Latent Diffusion Pretraining for Crystal Property Prediction

晶体性质预测的潜在扩散预训练

Shrimon Mukherjee, Kishalay Das, Partha Basuchowdhuri, Pawan Goyal, Niloy Ganguly

发表机构 * University of California, Berkeley(加州大学伯克利分校) Indian Institute of Technology, Bombay(印度班加罗尔印度理工学院)

AI总结 提出基于潜在扩散的预训练框架CrysLDNet,结合变分自编码器和扩散模型,从无标注晶体结构中学习表示,微调后显著提升性质预测性能。

Comments Published in ICML 2026

详情
AI中文摘要

快速准确地预测晶体性质是新材料设计中的核心挑战。图神经网络和基于Transformer的模型由于能够编码晶体中原子的局部结构环境,已成为此任务的有力工具。然而,这些模型需要大量数据,而实践中晶体性质的标注数据稀缺。预训练-微调策略,特别是基于扩散模型的策略,在解决这些限制方面显示出前景。在这项工作中,我们引入了一个新颖的基于潜在扩散的预训练框架CrysLDNet,旨在缓解数据稀缺问题。我们的方法在预训练阶段将变分自编码器(VAE)与扩散模型相结合。VAE编码器将3D晶体结构映射到平滑的潜在空间,在该空间中应用扩散过程。这种潜在扩散预训练使图编码器能够从大规模无标注数据中有效捕获结构和化学语义,然后可以针对特定性质预测任务进行微调。在流行的DFT数据集上进行性质预测的综合实验表明,CrysLDNet显著优于从头训练和预训练的基线,在JARVIS和MP数据集上分别提高了4.26%和4.90%。此外,学习到的表示在稀疏数据条件下保持鲁棒,并且具有足够的表达能力,可以在有限实验数据微调时纠正DFT误差。代码可在https://github.com/shrimonmuke0202/CrysLDNet.git获取。

英文摘要

Fast and accurate prediction of crystal properties is a central challenge in new materials design. Graph neural networks and Transformer-based models have emerged as powerful tools for this task due to their ability to encode the local structural environment of atoms within a crystal. However, these models are data-hungry, and in practice, labeled data for crystal properties are scarce. Pretraining-finetuning strategies, particularly those based on diffusion models, have shown promise in addressing these limitations. In this work, we introduce a novel latent diffusion based pretraining framework, CrysLDNet, designed to mitigate data scarcity. Our approach integrates a Variational Autoencoder (VAE) with a diffusion model during the pretraining stage. The VAE encoder maps 3D crystal structures into a smooth latent space within which the diffusion process is applied. This latent diffusion pretraining enables the graph encoder to effectively capture structural and chemical semantics from large-scale unlabeled data, which can then be finetuned for specific property prediction tasks. Comprehensive experiments on popular DFT datasets for property prediction reveal that CrysLDNet significantly outperforms both training-from-scratch and pretrained baselines, with improvements of 4.26% and 4.90% on the JARVIS and MP datasets, respectively. Additionally, the learned representations remain robust in sparse-data conditions and are expressive enough to correct DFT errors when finetuned with limited experimental data. Code is available at: https://github.com/shrimonmuke0202/CrysLDNet.git.

2606.00771 2026-06-02 cs.LG cs.AI cs.SD 版本更新

Logit Distillation on Manifolds: Mapping by Learning

流形上的对数蒸馏:通过学习进行映射

Yiru Yang, Junling Wang, Nishant Kumar Singh, Luohong Wu, Haoran Yan

发表机构 * University of Zurich(苏黎世大学) ETH Zurich(苏黎世联邦理工学院) Deutsche Bank Securities(德意志银行证券公司)

AI总结 提出一种层和点投影映射方法,将学生和教师表示对齐到高维嵌入空间,结合LoRA注入,在显著减少可训练参数的同时提高词错误率。

详情
AI中文摘要

提高几乎任何机器学习模型性能的一种简单方法是,不训练单个模型,而是训练多个使用不同算法的模型,这些模型对相同数据做出略有不同的预测和错误,从而提高平均预测和鲁棒性。然而,使用整个模型集成进行预测是繁琐且计算成本过高的,无法部署给大量用户,特别是当模型是大型神经网络时。为此,我们引入了一种层和点投影映射,在训练过程中将学生和教师表示映射到对齐的高维嵌入空间。所提出的方法结合LoRA注入,将学生模型的可训练参数减少到教师模型的不到1%,同时与其他蒸馏方法相比,显著提高了词错误率(WER),如消融研究所示。与专家混合不同,我们的方法可以快速并行训练。

英文摘要

A simple way to improve the performance of almost any machine learning model is not to train a single but several models with diverse algorithms which will make slightly distinct kinds of predictions and errors on the same data, and thus improve the average predictions and robustness. However, making predictions using a whole ensemble of models is cumbersome and computationally too expensive to allow deployment to a large number of users, especially if the models are large neural nets. In response to this, we introduce a layer and point wise projection mapping, which maps student and teacher representations into an aligned high-dimensional embedding space during training process. The proposed approach combined with LoRA injection reduces the student model trainable parameters to less than 1% of the teacher model, while significantly improving word error rate (WER) compared to other distillation methods, as demonstrated in ablation studies. Unlike a mixture of experts, our method can be trained rapidly and in parallel.

2606.00761 2026-06-02 cs.LG cs.CL 版本更新

Confidence-Adaptive SwiGLU for Mixture-of-Experts

Confidence-Adaptive SwiGLU for Mixture-of-Experts

Shaohua Li, Xiuchao Sui, Xiaobing Sun, Yuhang Wu, Liangli Zhen, Yong Liu, Rick Siow Mong Goh

发表机构 * Institute of High Performance Computing, Agency for Science, Technology and Research, Singapore(高性能计算研究所,新加坡科技研究局) Shanghai University of Engineering Science, China(上海工程技术大学)

AI总结 提出 Confidence-Aware SwiGLU (κ-SwiGLU),通过根据 token 级路由置信度调整专家门控锐度,在 MoE Transformer 中提升性能且仅增加少量参数和计算开销。

Comments 13 pages, 10 figures

详情
AI中文摘要

SwiGLU 已成为现代 Transformer MLP 中的标准门控激活函数,但其门控锐度——即门控函数的平滑性和选择性——在整个训练过程中通常是固定的。在这项工作中,我们提出了 Confidence-Aware SwiGLU (κ-SwiGLU),这是 SwiGLU 的一种变体,用于混合专家 (MoE) 模型,它根据 token 级路由置信度调整专家门控锐度。具体来说,κ-SwiGLU 将 SiLU 门控锐度系数参数化为路由器 logit 的可学习函数,使每个专家门控单元能够在平滑、广泛激活的门控和尖锐、选择性门控之间进行插值。我们在 FineWeb-Edu 数据集上评估了 κ-SwiGLU,使用了从 8 层到 28 层的 MoE Transformer 模型。在这些设置中,κ-SwiGLU 提高了平均 CORE 性能,同时仅增加了可忽略的参数和少量计算开销,表明置信度感知的门控锐度是改进 MoE MLP 的一种有前景的机制。代码可在 https://github.com/askerlee/kappa-swiglu 获取。

英文摘要

SwiGLU has become a standard gated activation in modern Transformer MLPs, yet its gate sharpness -- the smoothness and selectivity of the gating function -- is typically fixed throughout training. In this work, we propose Confidence-Aware SwiGLU ($κ$-SwiGLU), a variant of SwiGLU for Mixture-of-Experts (MoE) models that adjusts expert gate sharpness according to token-level routing confidence. Specifically, $κ$-SwiGLU parameterizes the SiLU gate sharpness coefficient as a learnable function of the router logit, enabling each expert gate unit to interpolate between smooth, broadly active gating and sharp, selective gating. We evaluate $κ$-SwiGLU on the FineWeb-Edu dataset across MoE Transformer models ranging from 8 to 28 layers. Across these settings, $κ$-SwiGLU improves mean CORE performance while adding negligible parameters and incurring only a small computational overhead, demonstrating that confidence-aware gate sharpness is a promising mechanism for improving MoE MLPs. The code is available at https://github.com/askerlee/kappa-swiglu.

2606.00759 2026-06-02 cs.LG 版本更新

Distributed GNEP Algorithms without Multiplier Sharing and Applications to Multi-Robot Coordination and Contextual Bandit-Based Active Learning

无乘子共享的分布式GNEP算法及其在多机器人协调和基于上下文赌博机的主动学习中的应用

Shao-An Yin

发表机构 * Shao-An Yin(殷少安)

AI总结 提出无需交换拉格朗日乘子的全分布式连续时间算法,收敛到广义纳什均衡(非仅变分均衡),并应用于多机器人协调和基于上下文赌博机的主动学习策略选择。

Comments 136 pages, 14 figures

详情
AI中文摘要

人工智能的最新进展将关注点从经典优化扩展到非合作博弈中的均衡分析。许多此类博弈涉及共享约束,从而产生广义纳什均衡问题(GNEP)。现有的分布式算法通常要求智能体交换拉格朗日乘子以强制执行共识并计算变分GNEs(v-GNEs)。 本文介绍了全分布式连续时间算法,并在不需要交换乘子的情况下建立收敛性,从而减少每次迭代的信息交换,同时提高隐私保护。分析聚焦于具有凸个体约束和线性共享约束的强单调博弈。我还提出了连续时间算法的几种离散化方案。所提出的方法收敛到一般的GNEs,而非仅限于v-GNEs,达到的均衡取决于初始化。通过多机器人协调和放置应用展示了所提方法的有效性。 在第二部分中,本文包括与亚马逊科学家合作进行的研究。现实世界机器学习中最具挑战性的问题之一是标记数据收集,这通常需要大量的人力和成本。主动学习旨在减少这种标记需求。然而,现有的手工主动学习策略通常仅在特定类型的数据集上表现良好,而这些数据集往往是事先未知的。在本文中,我提出使用上下文赌博机自适应地选择最合适的主动学习策略。在公开的外部数据集上展示了所提方法的有效性。

英文摘要

Recent advances in artificial intelligence have expanded the focus from classical optimization to include equilibrium analysis in noncooperative games. Many such games involve shared constraints, leading to Generalized Nash Equilibrium Problems (GNEPs). Existing distributed algorithms typically require agents to exchange Lagrange multipliers to enforce consensus and compute variational-GNEs (v-GNEs). This work introduces fully distributed continuous-time algorithms and establishes convergence without requiring multiplier exchange, thereby reducing information exchange per iteration while improving privacy preservation. The analysis focuses on strongly monotone games with convex individual constraints and linear shared constraints. I also propose several discretization schemes for the continuous-time algorithms. The proposed approach converges to general GNEs, rather than being restricted to v-GNEs, with the attained equilibrium depending on the initialization. The effectiveness of the proposed method is demonstrated through applications in multi-robot coordination and placement. In the second part, this work includes research conducted in collaboration with Amazon scientists. One of the most challenging problems in real-world machine learning is labeled data collection, which typically requires substantial human effort and cost. Active learning aims to reduce this labeling requirement. Existing handcrafted active learning strategies, however, generally perform well only on specific types of datasets, which are often unknown in advance. In this work, I propose using contextual bandits to adaptively select the most suitable active learning strategy. The effectiveness of the proposed approach is demonstrated on publicly available external datasets.

2606.00758 2026-06-02 stat.ML cs.LG eess.SP stat.ME 版本更新

Statistical Testing on Directed Graphs by Surrogate Data Generation

通过替代数据生成的有向图统计检验

Chun Hei Michael Chan, Alexandre Cionca, Dimitri Van De Ville

发表机构 * Neuro-X Institute, Ecole polytechnique fédérale de Lausanne, and the Department of Radiology and Medical Informatics, University of Geneva(Neuro-X研究所,瑞士联邦理工学院洛桑校区,日内瓦大学放射科与医学信息学系)

AI总结 针对有向图,基于图移位算子的特征分解定义宽平稳信号,提出保持协方差结构的替代数据生成框架,用于构建检验统计量的零分布,并在真实数据上验证了优于现有方法。

Comments Submitted to IEEE Transactions on Signal and Information Processing over Networks

详情
AI中文摘要

近年来,图信号处理已成为信号处理与图论交叉领域的一个强大框架,提供了分析定义在节点上的信号同时考虑其由边表示的关系的工具。这些工具已成功应用于各种场景,包括统计假设检验。特别地,针对无向图上的信号,已提出了基于替代生成的非参数方法。然而,这些方法尚未扩展到有向图。在这项工作中,我们首先重新审视有向图上平稳图信号的概念。具体地,通过图移位算子的特征分解,我们定义了有向图宽平稳信号。然后,我们提出一个新的框架来生成替代图信号,该信号在平稳性假设下保持协方差结构。然后,可以从这些替代信号构建检验度量的零分布,并作为经验数据的参考。最后,我们提供了指导性示例和真实数据上的应用,其中我们将我们的框架与现有针对无向图或基于朴素置换的技术进行了性能比较,证明了所提方法的可行性和优越性。

英文摘要

In recent years, graph signal processing has emerged as a powerful framework at the intersection of signal processing and graph theory, providing tools for the analysis of signals defined on nodes while accounting for their relationships represented by edges. These tools have been successfully applied to various settings, including statistical hypothesis testing. In particular, non-parametric approaches based on surrogate generation have been proposed for signals on undirected graphs. However, they are yet to be extended to directed graphs. In this work, we first revisit the notion of stationary graph signals on directed graphs. Specifically, and through the eigendecomposition of the graph shift operator, we define directed graph wide-sense stationary signals. Then, we propose a new framework to generate surrogate graph signals that preserve covariance structure under stationarity assumptions. Null distributions of the test metric can then be constructed from these surrogates and serve as a reference for the empirical data. Finally, we provide guiding examples and an application on real data, in which we compare the performance of our framework with existing techniques for undirected graphs or based on naive permutation, demonstrating feasibility and superiority of the proposed approach.

2606.00755 2026-06-02 cs.CL cs.LG 版本更新

Internalize the Temperature: On-Policy Self-Distillation as Policy Reheater for Reinforcement Learning

内化温度:面向强化学习的同策略自蒸馏作为策略加热器

Xuewei Yang, Jiachen Yu, Jie Wu, Shaoning Sun, Junjie Wang, Yujiu Yang

发表机构 * Tsinghua University(清华大学)

AI总结 提出温度缩放同策略自蒸馏(TS-OPSD),通过将温度探索效应内化到模型参数中,缓解强化学习中的熵崩溃问题,无需外部教师或额外推理成本。

详情
AI中文摘要

基于可验证奖励的强化学习提升了大语言模型的推理能力,但常常遭受熵崩溃,即日益集中的策略减少了轨迹多样性和有用的学习信号。现有补救措施要么约束强化学习目标(如熵正则化),要么在轨迹收集期间调整采样温度,但这些干预措施仍外在于模型参数。我们提出温度缩放同策略自蒸馏(TS-OPSD),一种轻量级的策略加热方法,将温度的探索效应内化到模型参数中。从熵崩溃的强化学习检查点开始,TS-OPSD 通过对模型自身的 logits 应用高温缩放来构建自教师,然后将得到的更平滑分布蒸馏回学生。这种策略加热不需要外部教师、特权数据或额外的推理成本。在 Qwen3-4B-Base 和 Qwen3-8B-Base 上的实验表明,策略加热为继续强化学习提供了比标准继续强化学习和轨迹级温度加热更强的初始化。进一步分析表明,TS-OPSD 主要降低输出锐度,同时保留中间表示、顶级候选集和推理能力。这些结果表明,熵恢复可以作为面向推理的强化学习的一种简单的崩溃后干预措施。

英文摘要

Reinforcement learning from verifiable rewards improves the reasoning ability of large language models, but often suffers from entropy collapse, in which increasingly concentrated policies reduce rollout diversity and useful learning signals. Existing remedies either constrain the RL objective (e.g., entropy regularization) or adjust sampling temperature during rollout collection, but these interventions remain external to the model parameters. We propose Temperature-Scaled On-Policy Self-Distillation (TS-OPSD), a lightweight policy reheating method that internalizes the exploratory effect of temperature into model parameters. Starting from an entropy-collapsed RL checkpoint, TS-OPSD constructs a self-teacher by applying high-temperature scaling to the model's own logits, then distills the resulting smoother distribution back into the student. This policy reheating requires no external teacher, privileged data, or additional inference cost. Experiments on Qwen3-4B-Base and Qwen3-8B-Base show that policy reheating yields a stronger initialization for continued RL than both standard continued RL and rollout-level temperature reheating. Further analyses show that TS-OPSD mainly reduces output sharpness while preserving intermediate representations, top candidate sets, and reasoning capability. These results suggest that entropy restoration can serve as a simple post-collapse intervention for extending reasoning-oriented RL.

2606.00754 2026-06-02 stat.ME cs.AI cs.LG 版本更新

Causal Density Functions

因果密度函数

Sridhar Mahadevan

发表机构 * Adobe Research(Adobe研究院) University of Massachusetts(马萨诸塞大学) Amherst(阿默斯特)

AI总结 提出因果密度函数作为干预分布与观测分布的Radon-Nikodym导数,用于局部密度比衡量因果效应,并给出估计与检验方法。

Comments 25 pages

详情
AI中文摘要

我们引入因果密度函数:Radon-Nikodym导数,它比较干预分布与观测分布,因此作为因果效应的局部密度比。许多因果强度度量在图手术后的整个分布上进行比较,而因果密度函数提供了一个逐点的测度变换对象,可以估计、校准并用于评分有向影响。基本恒等式 \[ \mathbb{E}_{\mathrm{do}}[f(Y)] = \mathbb{E}_{\mathrm{obs}}\!\left[f(Y)ρ(X,Y)\right] \] 使得因果密度直接可检验:如果估计的密度比正确,通过ρ重新加权的观测期望重现干预期望。我们推导了do曲线和有向边得分的实用估计量,将构造与条件作用和干预的Radon-Nikodym/Kan语义联系起来,并在合成和真实扰动基准上评估了所得估计量。

英文摘要

We introduce causal density functions: Radon-Nikodym derivatives that compare interventional laws to observational laws and therefore act as local density ratios for causal effects. Whereas many causal-strength measures compare whole distributions after graph surgery, causal density functions provide a pointwise change-of-measure object that can be estimated, calibrated, and used to score directed influence. The basic identity \[ \mathbb{E}_{\mathrm{do}}[f(Y)] = \mathbb{E}_{\mathrm{obs}}\!\left[f(Y)ρ(X,Y)\right] \] makes causal density directly testable: if the estimated density ratio is correct, observational expectations reweighted by $ρ$ reproduce interventional expectations. We derive practical estimators for do-curves and directed edge scores, relate the construction to Radon-Nikodym/Kan semantics for conditioning and intervention, and evaluate the resulting estimators on synthetic and real perturbation benchmarks.

2606.00752 2026-06-02 cs.LG cs.CE cs.HC 版本更新

A multimodal dataset of photoplethysmography and continuous behavioral responses to ASMR and nature videos

光电容积描记术和ASMR及自然视频连续行为反应的多模态数据集

Tushar Das, Daigo Hozaki, Koushlendra Kumar Singh, Hirohito M. Kondo

发表机构 * Machine Vision & Intelligence Lab, National Institute of Technology Jamshedpur(机器视觉与智能实验室,jamshedpur国家理工学院) School of Psychology, Chukyo University(心理学系,chukyo大学)

AI总结 提出REST-ASMR数据集,包含34名参与者在ASMR和自然视频刺激下的光电容积描记图、时间对齐视听刺激和连续主观标注,验证了刺激有效性、心血管减速,并通过双向长短期记忆模型实现高精度ASMR状态分类。

详情
AI中文摘要

自主感觉经络反应(ASMR)是一种以愉悦刺痛感和心血管减慢为特征的体感现象。然而,ASMR研究因缺乏标准化、开放获取的多模态数据集而受到阻碍。为解决这一限制,我们提出了REST-ASMR(对环境与感觉刺激的反应),这是一个同步的多模态数据集,旨在捕捉ASMR期间的行为报告和生理动态,并以自然放松视频作为对照刺激。该数据集包括来自34名参与者的高分辨率光电容积描记图(PPG)、时间对齐的视听刺激和连续主观标注。技术验证显示高刺激有效性(97%响应率)、显著的刺激特异性受试者间一致性(p < 0.05),以及稳健的PPG衍生的ASMR特异性心血管减速。此外,双向长短期记忆模型成功预测了主观ASMR刺痛状态,在严格的、无泄漏的受试者-视频双独立4折交叉验证下,实现了视频级ASMR与自然分类的完美准确率,以及帧级全局平均准确率75.51%、宏F1分数71.86%和100%自然基线特异性。REST-ASMR为情感计算、多模态研究以及个性化放松相关反应模型的发展提供了密集的时间基础。

英文摘要

Autonomous Sensory Meridian Response (ASMR) is a somatosensory phenomenon characterized by pleasant tingling sensations and cardiovascular slowing. However, ASMR research has been hindered by a dearth of standardized, open-access multimodal datasets. To address this limitation, we present REST-ASMR (Response to Environmental & Sensory Triggers), a synchronized multimodal dataset designed to capture behavioral reports and physiological dynamics during ASMR, with nature-relaxation videos as control stimuli. The dataset includes high-resolution photoplethysmography (PPG), time-aligned audiovisual stimuli, and continuous subjective annotations from 34 participants. Technical validation showed high stimulus efficacy (97% responder rate), significant stimulus-specific inter-subject agreement (p < 0.05), and a robust PPG-derived ASMR-specific cardiovascular deceleration. Additionally, a Bidirectional Long-Short Term Memory model successfully predicted subjective ASMR tingle states, achieving video-level ASMR vs. Nature classification with perfect accuracy and a frame-level global mean accuracy of 75.51%, macro F1-score of 71.86%, and 100% Nature-baseline specificity, under a strict, leakage-free subject-video double-independent 4-fold cross-validation. REST-ASMR constitutes a dense temporal foundation for affective computing, multimodal research, and the development of personalized models of relaxation-related responses.

2606.00746 2026-06-02 cs.CV cs.LG 版本更新

Scaling Parallel Sequence Models to Foundation-Scale Vision Encoders

将并行序列模型扩展到基础规模的视觉编码器

Yitong Jiang, Hongjun Wang, Collin McCarthy, Hanrong Ye, David Wehr, Xinhao Li, Qi Dou, Tianfan Xue, Ka Chun Cheung, Simon See, Wonmin Byeon, Ke Chen, Kai Han, Jinwei Gu, Hongxu Yin, Pavlo Molchanov, Jan Kautz, Sifei Liu

发表机构 * NVIDIA The Chinese University of Hong Kong(香港中文大学) The University of Hong Kong(香港大学) University of California, San Diego(加州大学圣地亚哥分校)

AI总结 提出C-GSPN,一种基于2D空间传播的基础规模视觉编码器,通过快速CUDA内核、压缩潜在空间传播块和两阶段交叉算子蒸馏,在减少参数的同时提升性能并实现高效推理。

详情
AI中文摘要

视觉基础模型受限于自注意力的二次成本,这限制了可用分辨率并增加了大规模预训练的成本。次二次替代方案如线性注意力和状态空间模型降低了这一成本,但通常将图像序列化为1D令牌流,削弱了对视觉重要的2D空间结构。广义空间传播网络(GSPN)通过线扫描递归直接在2D网格上传播上下文,实现了接近线性的复杂度且无需位置嵌入,但很少用作基础规模的编码器。我们提出C-GSPN,一种基于2D空间传播的基础规模视觉编码器。C-GSPN通过三项改进使该算子实用化:(1)一个快速的GSPN CUDA内核,将每步启动融合为单个warp专用实现,采用共享内存分块、合并访问和紧凑的多通道传播,达到峰值内存带宽的90%以上,运行速度比原始GSPN实现快40-52倍;(2)一个带有融合归一化的压缩潜在空间传播块,将内核级速度转化为块级和模型级效率;(3)一个两阶段交叉算子蒸馏方案,从注意力教师训练新架构,无需从头开始进行基础规模训练的成本。使用6亿图像-文本对进行蒸馏,C-GSPN以少15%的参数匹配同构ViT基线,在ADE20K分割上提升+2.1%,以极少的数据迁移到高分辨率,并在2K分辨率下通过单次无分块推理实现4倍的端到端块加速。

英文摘要

Vision foundation models are bottlenecked by the quadratic cost of self-attention, which limits usable resolution and increases the cost of large-scale pretraining. Subquadratic alternatives such as linear attention and state-space models reduce this cost, but often serialize images into 1D token streams and weaken the 2D spatial structure important for vision. Generalized Spatial Propagation Networks (GSPN) instead propagate context directly on the 2D grid through line-scan recurrences, achieving near-linear complexity without positional embeddings, but have seen little use as foundation-scale encoders. We present C-GSPN, a foundation-scale vision encoder based on 2D spatial propagation. C-GSPN makes the operator practical through three improvements: (1) a fast GSPN CUDA kernel that fuses per-step launches into a single warp-specialized implementation with shared-memory tiling, coalesced access, and a compact multi-channel propagation, reaching over 90% of peak memory bandwidth and running up to 40--52x faster than the original GSPN implementation; (2) a compressed latent-space propagation block with fused normalization, which turns kernel-level speed into block- and model-level efficiency; and (3) a two-stage cross-operator distillation recipe that trains the new architecture from an attention teacher without the cost of from-scratch foundation-scale training. Distilled with 600M image-text pairs, C-GSPN matches an isomorphic ViT baseline with 15% fewer parameters, improves ADE20K segmentation by +2.1%, transfers to high resolution with a fraction of the data needed from scratch, and delivers a 4x end-to-end block speedup at 2K with single-pass, tiling-free inference.

2606.00741 2026-06-02 cs.LG cs.AI stat.ML 版本更新

Quantum Tunneling-Aware Machine Learning: Physics-Derived Noise Models for Robust Deployment

量子隧穿感知机器学习:面向鲁棒部署的物理衍生噪声模型

Uiwon Hwang, Jaeho Hwang

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) Human-Centered Artificial Intelligence Research Institute(以人为本的人工智能研究院)

AI总结 本文提出量子隧穿感知机器学习(QTAML),通过WKB近似推导部署时的权重误差分布,并设计隧穿感知补偿(TAC)算法,在无需重训练和标签的情况下,以较低ECC开销恢复模型精度。

详情
AI中文摘要

晶体管缩放正接近量子力学极限,因为薄栅氧化物通过量子隧穿引起电子泄漏。与传统数字系统不同,只要错误结构被正确建模,AI推理可以容忍此类错误。在本文中,我们引入量子隧穿感知机器学习(QTAML)。我们使用Wentzel-Kramers-Brillouin(WKB)近似从第一性原理推导部署时的权重误差分布,并表明它具有通用高斯噪声模型所忽略的结构:精确的仿射均值漂移、由最高有效位主导的逐位方差层级,以及依赖于$\|W_\ell\|_\infty$和训练网络Jacobian的逐层依赖性。我们将这三个结构属性打包成一个单一的部署时算法——隧穿感知补偿(TAC),该算法结合了闭式均值校正和基于WKB方差分解的最优逐层自适应比特预算分配。在$p_\mathrm{flip}=0.10$的四个卷积架构和$p_\mathrm{flip}=0.05$的一个Transformer编码器上,TAC达到了干净精度的95%,同时ECC开销比从相同物理导出的自然基线Uniform-MSP低3.4倍到33.6倍。闭式饱和比$ ho^*$预先预测了这些增益,在异构架构上,WKB导出的评分在小预算下比基于幅度的分配高出多达24个百分点。该算法无需重训练、无需标签,且无推理时开销。我们还验证了WKB导出的分布定理达到蒙特卡洛精度。这些结果将WKB隧穿物理与噪声感知深度学习联系起来,并为超越传统缩放极限的硬件-软件协同设计提供了一条有原则的路径。

英文摘要

Transistor scaling is approaching a quantum-mechanical limit, as thin gate oxides induce electron leakage through quantum tunneling. Unlike conventional digital systems, AI inference can tolerate such errors provided their structure is modeled correctly. In this paper, we introduce quantum tunneling-aware machine learning (QTAML). We derive the deployment-time weight-error distribution from first principles using the Wentzel-Kramers-Brillouin (WKB) approximation and show that it has structure that generic Gaussian noise models miss: an exact affine mean drift, a per-bit variance hierarchy dominated by the most-significant bit, and a per-layer dependence on $\|W_\ell\|_\infty$ and the trained-network Jacobian. We package these three structural properties into a single deployment-time algorithm, Tunneling-Aware Compensation (TAC), that combines closed-form mean correction with an optimal layer-adaptive bit-budget allocation derived from the WKB variance decomposition. Across four convolutional architectures at $p_\mathrm{flip}$=0.10 and a transformer encoder at $p_\mathrm{flip}$=0.05, TAC reaches $95\%$ of clean accuracy with 3.4$\times$ to 33.6$\times$ less ECC overhead than Uniform-MSP, the natural baseline derived from the same physics. The closed-form saturation ratio $ρ^*$ predicts these gains in advance, and on heterogeneous architectures WKB-derived scoring outperforms magnitude-based allocation by up to 24 percentage points at small budgets. The algorithm requires no retraining, no labels, and no inference-time overhead. We also verify the WKB-derived distributional theorems to Monte Carlo precision. These results connect WKB tunneling physics with noise-aware deep learning and suggest a principled path toward hardware--software co-design beyond conventional scaling limits.

2606.00739 2026-06-02 cs.LG 版本更新

Score $\times$ Decoder: A Unified View of Unsupervised Inference-Time Scaling for Hallucination Mitigation

Score × Decoder:无监督推理时缩放缓解幻觉的统一视角

Yun-Chen Cheng, Che-Yu Lin, Cheng-Lin Yang

发表机构 * CyCraft AI Lab, Taiwan(CyCraft人工智能实验室,台湾)

AI总结 本文提出Score×Decoder框架,通过配对四种内在分数(困惑度、对比度、幂分布似然、自验证)与三种解码族(优化、采样、共识),在无监督条件下选择最佳组合以缓解大语言模型幻觉。

详情
AI中文摘要

大型语言模型即使答案在其参数范围内也会产生幻觉。虽然推理时缩放可以揭示这种潜在知识,但最有效的方法需要监督:一个训练好的验证器或奖励模型。我们探讨仅使用基础语言模型可以做什么:哪个内在信号最能识别正确输出,以及应该如何解码?我们将此视为一个分数×解码器网格,将四种分数(困惑度、对比度、幂分布似然和自验证)与三种解码族(优化、采样、共识)配对,并在MATH500上使用基础版和指令调优版Qwen3-1.7B评估每个单元格。虽然自验证(提示模型判断自己的答案,并通过无训练虚拟思考前缀增强)在大多数设置中效果良好,但没有一个分数具有固定质量:其价值取决于使用它的解码器和模型能力。当没有监督可用时,必须同时选择分数和解码族。

英文摘要

Large language models hallucinate even when the answer lies within their parameters. While inference-time scaling can surface this latent knowledge, the most effective methods require supervision: a trained verifier or reward model. We ask what can be done with only a base language model: which intrinsic signal best identifies correct outputs, and how should it be decoded? We cast this as a score~$\times$~decoder grid pairing four scores (perplexity, contrastive, power-distribution likelihood, and self-verification) with three decoding families (optimization, sampling, consensus), and evaluate every cell on MATH500 with the base and instruction-tuned Qwen3-1.7B. While self-verification, which prompts the model to judge its own answer and is sharpened by a training-free virtual-thinking prefix, works well in most settings, no score has a fixed quality: its value depends on the decoder that consumes it and on model capability. When no supervision is available, the score and the decoding family must be chosen together.

2606.00738 2026-06-02 cs.LG cs.AI cs.CV 版本更新

SORA: Free Second-Order Attacks in Fast Adversarial Training

SORA:快速对抗训练中的自由二阶攻击

Mazdak Teymourian, Ramtin Moslemi, Farzan Rahmani, Mohammad Hossein Rohban

发表机构 * Department of Computer Engineering, Sharif University of Technology, Tehran, Iran(谢赫大学计算机工程系)

AI总结 针对快速对抗训练中的灾难性过拟合问题,提出通过扰动变异性和梯度对齐指标PertAlign来预测并防止过拟合,并设计自适应步长方法SORA,实现最优鲁棒性和干净准确率。

Comments Accepted at ICML 2026

详情
AI中文摘要

对抗训练是对抗性样本的主要防御手段,但在高效的单步变体中常常遭受灾难性过拟合,即尽管单步性能很高,但对多步攻击的鲁棒性却崩溃。我们通过两个贡献来解决这种失效模式。首先,我们形式化了epsilon过拟合(EO),这是一种固定扰动幅度和方向加剧CO的视角,并表明引入扰动变异性可以显著提高不同架构和数据集上的鲁棒泛化能力。其次,我们提出了PertAlign(扰动对齐),这是一种理论上合理、计算开销可忽略的指标,通过测量攻击阶段的梯度对齐来预测CO的发生。利用这些见解,我们引入了SORA,一种自适应步长的AT方法,它根据损失曲面几何动态调整扰动。SORA始终能防止CO,实现最先进的鲁棒性和干净准确率,并使用一组固定的超参数在数据集和架构上泛化,这对于快速AT的适用性至关重要。在不同数据集和架构上的大量实验表明,SORA在提供更高干净准确率和卓越效率的同时,匹配或超越了先前方法的鲁棒性。代码可在https://github.com/SecondOrderAT/SORA获取。

英文摘要

Adversarial Training (AT) is a leading defense against adversarial examples but often suffers from Catastrophic Overfitting (CO) in efficient single-step variants, where robustness to multi-step attacks collapses despite high single-step performance. We address this failure mode with two contributions. First, we formalize Epsilon Overfitting (EO), a perspective in which fixed perturbation magnitudes and directions exacerbate CO, and show that introducing perturbation variability significantly improves robust generalization across different architectures and datasets. Second, we propose PertAlign (Perturbation Alignment), a theoretically grounded, computationally negligible metric that predicts CO onset by measuring gradient alignment across attack stages. Leveraging these insights, we introduce SORA, an adaptive step-size AT method that dynamically adjusts perturbations based on loss surface geometry. SORA consistently prevents CO, achieves state-of-the-art robustness and clean accuracy, and generalizes across datasets and architectures using a single fixed set of hyperparameters, which is essential for applicability in fast AT. Extensive experiments on diverse datasets and architectures show that SORA matches or surpasses the robustness of prior methods while delivering higher clean accuracy and superior efficiency. Code is available at https://github.com/SecondOrderAT/SORA.

2606.00735 2026-06-02 cs.DC cs.LG 版本更新

ViBE: Co-Optimizing Workload Skew and Hardware Variability for MoE Serving

ViBE: 针对MoE服务的工作负载偏斜与硬件变异性协同优化

Seokjin Go, Marko Scrbak, Ephrem Wu, Srilatha Manne, Divya Mahajan

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Advanced Micro Devices, Inc.(先进微器件公司)

AI总结 提出ViBE框架,通过感知硬件的专家放置方法,结合GPU性能建模与专家激活分析,最小化分布式MoE推理中的执行时间不平衡,显著提升SLO达标率并降低P90 TTFT。

详情
AI中文摘要

在分布式混合专家(MoE)推理中,依赖于输入的令牌路由与GPU性能变异性相互作用,在同步执行下产生持续的掉队者,其中最慢的GPU决定层延迟。这种性能变异性是现代加速器固有的:制造差异、功率限制和热条件在名义上相同的GPU之间引入了可测量的执行时间差异。核心挑战在于MoE执行时间不平衡源于工作负载偏斜和硬件不对称的相互作用。令牌路由产生不均匀且逐层变化的专家负载,而GPU吞吐量取决于设备特定的操作特性和工作负载强度。先前的工作缓解了路由偏斜,但假设硬件同质,优化令牌平衡而非执行延迟。因此,即使平衡的令牌分配也可能留下硬件引起的掉队者未解决。为此,我们提出了变异性感知的专家分箱(ViBE),一种硬件感知的专家放置框架,旨在最小化跨GPU的执行时间不平衡。ViBE结合了每GPU性能建模与专家激活分析,将高负载专家分配给更快的设备,低负载专家分配给较慢的设备,从而在不修改模型语义或硬件的情况下减少层级别的掉队者。由于工作负载特征和有效GPU吞吐量可能随服务条件变化,ViBE支持在负载/性能漂移下进行轻量级重新校准,以在需要时刷新其路由和性能估计。结果表明,ViBE持续减少执行时间不平衡,并将SLO达标率提高14%,同时将P90 TTFT降低高达45%。我们进一步表明,硬件变异性的影响在规模扩大时增加,使得变异性感知的放置对于高效、高利用率的LLM服务至关重要。

英文摘要

In distributed Mixture-of-Experts (MoE) inference, input-dependent token routing interacts with GPU performance variability to create persistent stragglers under synchronized execution, where the slowest GPU determines layer latency. This performance variability is inherent to modern accelerators: manufacturing variation, power limits, and thermal conditions introduce measurable execution-time differences across nominally identical GPUs. The core challenge is that MoE execution-time imbalance arises from the interaction of workload skew and hardware asymmetry. Token routing produces uneven and layer-varying expert loads, while GPU throughput depends on device-specific operating characteristics and workload intensity. Prior work mitigates routing skew but assumes homogeneous hardware, optimizing token balance rather than execution latency. As a result, even balanced token assignments can leave hardware-induced stragglers unaddressed. Thus, we propose Variability-Informed Binning of Experts (ViBE), a hardware-aware expert placement framework that minimizes execution-time imbalance across GPUs. ViBE combines per-GPU performance modeling with expert activation profiling to assign high-load experts to faster devices and low-load experts to slower ones, reducing layer-level stragglers without modifying model semantics or hardware. Because both workload characteristics and effective GPU throughput can shift across serving conditions, ViBE supports lightweight recalibration under workload/performance drift to refresh its routing and performance estimates when needed. Results show that ViBE consistently reduces execution-time imbalance and improves SLO attainment by 14%, while lowering P90 TTFT by up to 45%. We further show that the impact of hardware variability increases at scale, making variability-aware placement important for efficient, high-utilization LLM serving.

2606.00717 2026-06-02 cs.LG cs.AI stat.ML 版本更新

Multi-Agent Conformal Prediction with Personalized Statistical Validity

具有个性化统计有效性的多智能体共形预测

Martin V. Vejling, Christophe A. N. Biscio, Adrien Mazoyer, Petar Popovski, Shashi Raj Pandey

发表机构 * Department of Electronic Systems(电子系统系) Aalborg University(奥尔堡大学) Department of Mathematical Sciences(数学科学系) Institut de Mathématiques de Toulouse(图卢兹数学研究所) Université de Toulouse(图卢兹大学)

AI总结 提出个性化联邦加权共形预测框架,通过局部密度比加权和加权分位数聚合,在保护隐私的同时纠正数据异质性,为每个参与智能体提供渐近有效的边际和校准条件覆盖保证。

详情
AI中文摘要

不确定性量化在高风险机器学习任务中至关重要。然而,共形预测这一原则性解决方案在局部校准数据有限、隐私约束和数据异质性下面临挑战。在多智能体设置中,现有工作无法同时令人满意地解决这些挑战,其保证要么限于智能体间的平均值,要么在异质性设置中失去有效性。因此,我们提出个性化联邦加权共形预测(PFWCP),该框架结合局部密度比加权与加权分位数聚合,以在保护隐私的同时纠正异质性。该方法为每个参与智能体提供渐近有效的边际和校准条件覆盖保证,并支持一次性通信协议。理论分析呈现了对覆盖方差的调整,该调整由有效样本量表达式控制,这在加权共形预测的背景下是必要的,并且在合成和真实数据集上的实验表明,与最先进的联邦共形基线相比,校准质量有所提高。

英文摘要

Uncertainty quantification is essential in high-stakes machine learning tasks. However, one of the principled solutions, conformal prediction, faces challenges under limited local calibration data, privacy constraints, and data heterogeneity. In multi-agent settings, existing works do not simultaneously and satisfactorily address these challenges with guarantees either limited to averages across agents or losing validity in heterogeneous settings. Hence, we propose personalized federated weighted conformal prediction (PFWCP), a framework that combines local density ratio weighting with weighted quantile aggregation to correct for heterogeneity while preserving privacy. The method yields asymptotically valid marginal and calibration-conditional coverage guarantees for each participating agent and supports protocols with one-shot communication. Theoretical analysis presents an adjustment to the coverage variance, governed by an effective sample size expression, which is necessary in the context of weighted conformal prediction, and experiments on synthetic and real datasets show improved calibration quality over state-of-the-art federated conformal baselines.

2606.00716 2026-06-02 cs.LG eess.SP 版本更新

Graph Transfer Learning via Shared Latent Geometry: Theory and Applications

基于共享潜在几何的图迁移学习:理论与应用

Tong Wu, Andrew Campbell, Anna Scaglione

发表机构 * University of Central Florida, USA(佛罗里达中央大学) Cornell University, USA(康奈尔大学)

AI总结 提出一种非对称双路径架构,通过教师编码器从高保真模拟器学习算子多项式特征,学生编码器从稀疏数据学习相同潜在几何,实现零样本迁移并给出可证明的误差界。

详情
AI中文摘要

在工程物理系统的推理与控制中,部署时面临高昂的物理代价:状态估计器、逆问题求解器、模型预测控制器、调度器和观测器通常没有闭式解,必须针对每个实例重新求解数值优化问题,且每次需重新提供算子。物理信息学习将这一代价转移到训练阶段,但使用单一编码器路径,其潜在几何在微调时会退化,且无法提供定量迁移保证。我们提出一种非对称双路径架构来解决这两个问题。教师编码器从高保真模拟器中获取特权密集状态,并通过在谱扰动下稳定的算子多项式特征表示系统;学生编码器从稀疏现场数据和算子描述符学习相同的潜在几何。部署时丢弃教师,冻结的学生编码器通过单次前向传播运行,并附带迁移证书。该设计关联了特权信息学习、知识蒸馏和跨模态蒸馏,但目标是跨实例迁移而非固定实例预测:拓扑和算子可以变化,而潜在任务不变。我们通过潜在律之间的Wasserstein距离建立了充分且近乎必要的迁移条件,得到了零样本误差界,并开发了一种在覆盖不完全时主动扩展的有限样本认证协议。该框架适用于任何具有可报告谱的算子的系统。在电力系统估计中,它实现了对100种未见拓扑的零样本迁移,95%的证书通过率,与拓扑感知的牛顿-拉夫逊方法相当的精度,以及亚毫秒级推理。这些结果表明,非对称路径加上算子锚定的潜在几何为认证的零样本推理与控制奠定了基础。

英文摘要

Inference and control in engineered physical systems pay a heavy physics cost at deployment: state estimators, inverse-problem solvers, model-predictive controllers, schedulers, and observers are often not closed-form and must re-solve a numerical optimization per instance, with the operator re-supplied each time. Physics-informed learning moves this cost to training, but uses a single encoder pathway whose latent geometry de-learns under fine-tuning and admits no quantitative transfer guarantee. We propose an asymmetric two-pathway architecture that resolves both issues. A teacher encoder consumes privileged dense states from a high-fidelity simulator and represents the system through operator-polynomial features stable under spectral perturbation; a student encoder learns the same latent geometry from sparse field data and operator descriptors. At deployment the teacher is discarded, and the frozen student runs in a single forward pass with a transfer certificate. The design connects to privileged-information learning, knowledge distillation, and cross-modal distillation, but targets cross-instance transfer rather than fixed-instance prediction: topology and operator may change, while the latent task does not. We establish sufficient and near-necessary transfer conditions via Wasserstein proximity between latent laws, yielding a zero-shot error bound, and develop a finite-sample certification protocol with active expansion when coverage is incomplete. The framework applies wherever a system admits an operator with reportable spectrum. On power-system estimation, it achieves zero-shot transfer to 100 unseen topologies, a 95% certificate pass rate, accuracy competitive with topology-aware Newton--Raphson, and sub-millisecond inference. These results suggest asymmetric pathways plus operator-anchored latent geometry provide a foundation for certified zero-shot inference and control.

2606.00708 2026-06-02 cs.AI cs.LG 版本更新

MOSAIC: Modular Orchestration for Structured Agentic Intelligence and Composition

MOSAIC:结构化智能体智能与组合的模块化编排

Yifan Bao, Xinyu Xi, Xinyu Liu, Wen Ge, Lei Jiang, Kevin Zhang, Raad Khraishi, Yihao Ang, Anthony K. H. Tung, Lukasz Szpruch, Hao Ni

发表机构 * Department of Computer Science, National University of Singapore(新加坡国立大学计算机科学系) University College London(伦敦大学学院) University of Edinburgh(爱丁堡大学) Data & Analytics, Digital X(Digital X 数据与分析部) Alan Turing Institute(艾伦·图灵研究所)

AI总结 提出MOSAIC框架,通过结构化智能体编排、记忆驱动的模型选择和蓝图构建,将自动化数据科学转化为可验证、可复用的模型选择问题,在金融时间序列任务中优于AutoML和智能体基线。

详情
AI中文摘要

自动化数据科学是一个结构化的模型选择问题。解决方案必须为任务选择数据转换、特征表示、架构、训练过程、评估协议和优化策略。AutoML系统自动化了该过程的部分环节,但通常是在预定义的流水线、模型和超参数空间内搜索。基于LLM的智能体通过检索、代码生成和执行反馈提供了更大的灵活性,但其建模决策通常是非结构化的、难以验证且难以复用。我们引入了 extsc{MOSAIC}(结构化智能体智能与组合的模块化编排),一个用于记忆驱动的模型选择和工作流构建的结构化智能体框架。给定任务和数据集, extsc{MOSAIC}构建语义任务画像,检索先前的案例和源代码模块,并构建蓝图:一个指定所选建模组件、组合、接口约束和执行需求的中间表示。该蓝图将模型选择转化为分阶段、上下文驱动的搜索,并将基于LLM的代码生成建立在检索证据而非无约束合成之上。候选模型通过执行验证,并使用诊断反馈、训练轨迹、任务指标以及一个失败感知的强化学习策略进行优化。我们在金融时间序列预测和生成任务上实例化了 extsc{MOSAIC},其中模型必须满足预测准确性、分布保真度、执行可靠性以及下游金融标准(如风险和尾部行为)。与AutoML和智能体基线的实验表明, extsc{MOSAIC}提高了任务性能、执行成功率和决策可追溯性,证明了将自动化数据科学视为结构化、可复用且基于执行的模型选择的价值。

英文摘要

Automated data science is a structured model-selection problem. A solution must choose data transformations, feature representations, architecture, training procedure, evaluation protocol, and refinement strategy for a task. AutoML systems automate parts of this process, but typically search within predefined pipeline, model, and hyperparameter spaces. LLM-based agents offer greater flexibility through retrieval, code generation, and execution feedback, yet their modelling decisions are often unstructured, difficult to verify, and hard to reuse. We introduce \textsc{MOSAIC} (Modular Orchestration for Structured Agentic Intelligence and Composition), a structured agentic framework for memory-grounded model selection and workflow construction. Given a task and dataset, \textsc{MOSAIC} builds a semantic task profile, retrieves prior cases and source-code modules, and constructs a blueprint: an intermediate representation specifying selected modelling components, composition, interface constraints, and execution requirements. This blueprint turns model selection into a staged, context-grounded search and grounds LLM-based code generation in retrieved evidence rather than unconstrained synthesis. Candidate models are validated by execution and refined using diagnostic feedback, training traces, task metrics, and a failure-aware reinforcement learning policy. We instantiate \textsc{MOSAIC} on financial time-series forecasting and generation, where models must satisfy predictive accuracy, distributional fidelity, execution reliability, and downstream financial criteria such as risk and tail behaviour. Experiments against AutoML and agentic baselines show that \textsc{MOSAIC} improves task performance, execution success, and decision traceability, demonstrating the value of treating automated data science as structured, reusable, and execution-grounded model selection.

2606.00703 2026-06-02 cs.IT cs.AI cs.LG math.IT 版本更新

Information-Theoretic Lower Bounds for Bit-Constrained Stochastic Optimization via a Reduction to Compressed Gaussian Mean Estimation

通过约化到压缩高斯均值估计的比特约束随机优化的信息论下界

Munsik Kim

AI总结 本文通过将强凸二次族优化问题精确约化为交互式压缩高斯均值估计问题,推导出比特约束随机优化的无条件下界,并给出近乎匹配的可实现性结果。

详情
AI中文摘要

低精度预训练(FP8, MXFP4, NVFP4)现已成为前沿语言模型的标准,但文献几乎完全是可实现性——算法和经验缩放定律——没有匹配的信息论可能性的刻画。我们研究B比特量化随机一阶预言机:优化器与T轮交互,每轮接收其随机梯度的B比特自适应公共硬币描述。我们的主要贡献是将强凸二次族优化精确约化为交互式压缩高斯均值估计——在B比特预言机下,查询不携带信息,因此优化完全坍缩为顺序分布式估计问题。这产生了两个无条件下界:通信界TB = Omega(d)和统计界T = Omega(sigma^2 d / eps^2),以及尖锐的乘积形式界T = Omega((sigma^2 d / eps^2) max{1, d/B})。乘积形式也是无条件的:B比特转录本最多携带关于均值的O(TB / sigma^2) Fisher迹,因此比特而非维度限制了可恢复信息,结合多元van Trees不等式直接给出该界,无需有界似然比截断。我们给出了一个近乎匹配的可实现性结果,在有限动态范围预言机下精确计算每轮比特,紧至对数因子;下界针对真正高斯(无界)梯度,而缩小这一预言机差距留待未来。顺序率失真视角将约化扩展到相关和漂移预言机,并修正了先前的猜想:正噪声相关性将界提高(1+rho)/(1-rho)倍而非放松。这些界为任何低位梯度路径提供了信息论基线,而非关于已部署FP4系统的最优性声明。

英文摘要

Low-precision pretraining (FP8, MXFP4, NVFP4) is now standard for frontier language models, yet the literature is almost entirely achievability -- algorithms and empirical scaling laws -- with no matching characterization of what is information-theoretically possible. We study a B-bit quantized stochastic first-order oracle: an optimizer interacts for T rounds and receives, each round, a B-bit adaptive public-coin description of its stochastic gradient. Our main contribution is an exact reduction from optimizing a strongly convex quadratic family to interactively compressed Gaussian mean estimation -- under the B-bit oracle the query carries no information, so optimization collapses exactly onto a sequential distributed-estimation problem. This yields two unconditional lower bounds, a communication bound TB = Omega(d) and a statistical bound T = Omega(sigma^2 d / eps^2), and the sharp product-form bound T = Omega((sigma^2 d / eps^2) max{1, d/B}). The product form is also unconditional: a B-bit transcript carries at most O(TB / sigma^2) of Fisher trace about the mean, so bits rather than dimension limit the recoverable information, and combined with the multivariate van Trees inequality this gives the bound directly, without bounded-likelihood-ratio truncation. We give a near-matching achievability result with exact per-round bit accounting under a bounded-dynamic-range oracle, tight up to a logarithmic factor; the lower bound is for truly Gaussian (unbounded) gradients, and closing this oracle gap is left open. A sequential rate-distortion perspective extends the reduction to correlated and drifting oracles and corrects an earlier conjecture: positive noise correlation raises the bound by (1+rho)/(1-rho) rather than relaxing it. The bounds give an information-theoretic baseline for any low-bit gradient path, not an optimality claim about deployed FP4 systems.

2606.00700 2026-06-02 cs.LG cs.AI 版本更新

COPF: An Online Framework for Deployment-Stable Counterfactual Fairness in Evolving Graphs

COPF:演化图中部署稳定的反事实公平性在线框架

Sheng'en Li, Dongmian Zou

发表机构 * Shanghai Jiao Tong University(上海交通大学)

AI总结 针对演化图上的在线链接推荐,提出COPF框架,通过反事实暴露机会差距、显式探索和残差不可区分性审计,实现部署稳定的公平性监控与控制。

Comments Accepted at ICML 2026

详情
AI中文摘要

演化图上的在线链接推荐是表演性的:通过选择向用户展示哪些候选链接,系统会改变哪些链接形成以及后续观察到的反馈。因此,来自记录结果的公平性估计可能具有误导性,并且在推荐策略更新后部署时可能会漂移。我们引入了COPF(反事实在线表演性公平性),这是一个用于在线链接推荐中部署稳定的公平性监控和控制的决策层框架。COPF (i) 定义了暴露(展示 vs. 未展示)反事实上的群体级机会差距,(ii) 通过显式探索和记录每个候选被展示的概率(倾向性)使其可估计,以及(iii) 使用图感知双重稳健(GA-DR)估计器,在可配置的审计器族上通过残差结果不可区分性(OI)审计和控制公平性。我们提供了一个噪声传递定理,表明在时间混合和有界局部干扰下,估计的GA-DR残差上的残差OI意味着暴露反事实群体差距的界限,并实例化了一个在线多校准审计器以及一个原始-对偶控制器。在两个TGB流和一个受控的合成二分图流上的实验表明,COPF减少了暴露反事实群体差距的最坏情况峰值,同时对排序效用的影响较小。我们的代码可在 https://github.com/lsnnnnnnnn/COPF 获取。

英文摘要

Online link recommendation on evolving graphs is performative: by choosing which candidate links to show users, the system changes which links form and what feedback it later observes. Consequently, fairness estimates from logged outcomes can be misleading and may drift after deployment when the recommendation policy is updated. We introduce COPF (Counterfactual Online Performative Fairness), a decision-layer framework for deployment-stable fairness monitoring and control in online link recommendation. COPF (i) defines group-level opportunity gaps over exposure (shown vs. not shown) counterfactuals, (ii) makes them estimable by explicit exploration and by logging the probability (propensity) that each candidate is shown, and (iii) audits and controls fairness using residual outcome indistinguishability (OI) over a configurable auditor family with graph-aware doubly robust (GA-DR) estimators. We provide a noisy transfer theorem showing that Residual-OI on estimated GA-DR residuals implies bounds on exposure-counterfactual group gaps under temporal mixing and bounded local interference, and we instantiate an online multicalibration auditor together with a primal-dual controller. Experiments on two TGB streams and a controlled synthetic bipartite stream show that COPF reduces worst-case spikes in exposure-counterfactual group disparities with modest impact on ranking utility. Our code is available at https://github.com/lsnnnnnnnn/COPF.

2606.00690 2026-06-02 cs.LG 版本更新

DistMatch: Adaptive Binning via Distribution Matching for Robust Sequential Conformal Prediction

DistMatch: 通过分布匹配的自适应分箱用于鲁棒序列共形预测

Enver Menadjiev, Jihyeon Seong, Jisu Yeo, Jaesik Choi

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Tokyo(东京大学)

AI总结 提出DistMatch方法,利用Kolmogorov-Smirnov统计量递归划分残差以实现近似可交换性,结合在线分位数回归进行局部自适应推理,提升序列共形预测对分布偏移的鲁棒性。

Comments ICML 2026 (34 pages, 12 figures, 16 tables)

详情
AI中文摘要

序列共形预测在残差可交换性假设下提供有效的不确定性量化。然而,由于时间依赖性和分布偏移,该假设在现实时间序列中常被违反。尽管近期方法尝试通过重新加权来近似可交换性,但确定最优权重仍是一个开放挑战。为解决此局限,我们提出DistMatch,一种基于分箱的方法,利用Kolmogorov-Smirnov统计量在二叉树中递归划分残差。我们从理论上证明,这种划分诱导出近似可交换的叶子节点,从而避免重新加权的需要。通过在每个叶子节点内应用在线更新的分位数回归,DistMatch实现了局部自适应推理,并提高了对分布偏移的鲁棒性。大量实验表明,DistMatch优于现有序列共形预测方法。

英文摘要

Sequential conformal prediction (CP) provides valid uncertainty quantification under the assumption of residual exchangeability. However, this assumption is often violated in real-world time series due to temporal dependencies and distributional shifts. While recent methods attempt to approximate exchangeability through reweighting, identifying optimal weights remains an open challenge. To address this limitation, we propose DistMatch, a binning-based method that recursively partitions residuals within a binary tree using the Kolmogorov-Smirnov (KS) statistic. We theoretically show that this partitioning induces approximately exchangeable leaves, thereby avoiding the need for reweighting. By applying quantile regression with online updates within each leaf, DistMatch enables locally adaptive inference and improves robustness to distributional shifts. Extensive experiments demonstrate that DistMatch outperforms existing sequential CP methods.

2606.00686 2026-06-02 cs.LG 版本更新

Dialectics of Alignment: Harnessing Unsafe Knowledge for Dynamic Safety Routing

对齐的辩证法:利用不安全知识实现动态安全路由

Maryam Hashemzadeh, Jerry Huang, Minseon Kim, Marc-Alexandre Côté, Sarath Chandar

发表机构 * Chandar Research Lab(Chandar研究实验室) Mila – Quebec AI Institute(魁北克AI研究所) Université de Montréal(蒙特利尔大学) Microsoft Research(微软研究院) Polytechnique Montréal(蒙特利尔理工学院) Canada CIFAR AI Chair(加拿大CIFAR人工智能主席)

AI总结 提出SafeMoE框架,通过混合专家模型将不安全知识隔离到领域特定的低秩适配器中,并训练轻量级门控网络动态路由这些专家,在保持安全性的同时生成信息丰富的响应。

详情
AI中文摘要

大语言模型(LLM)对齐的主流范式通过擦除、过滤不安全数据或训练模型严格拒绝有害提示来运作。虽然这种方法能有效降低即时毒性,但根本上限制了模型的认识论范围,导致系统过度谨慎,对敏感但良性的查询输出无信息量的全面拒绝。在这项工作中,我们挑战了不安全数据必须丢弃的正统观念。我们提出了一种对齐的辩证方法,认为不安全数据编码了丰富的、领域特定的知识,对于细致、安全且信息丰富的生成至关重要。为实现这一点,我们引入了SafeMoE,一个混合专家(MoE)框架,将不安全知识隔离到仅在有害语料上训练的领域特定低秩适配器(LoRA专家)中。为了从这些不安全基元中综合安全性,我们使用最小、高度精选的安全信息响应集训练一个轻量级门控网络。在推理时,该路由器动态编排不安全专家,有效引导生成轨迹以利用其深层领域知识,同时严格执行安全约束。在严格的安全基准上的广泛实证评估表明,SafeMoE不仅更安全,安全响应率相对提高了20%以上(绝对增益超过15%),而且在安全性和危害性至关重要时能生成更具信息量的响应。此外,路由机制在未见领域和更广泛的安全任务上表现出强大的零样本泛化能力,无需领域特定监督。我们的发现表明对齐的范式转变:真正的安全不需要掩盖不安全知识,而是需要其受控整合。

英文摘要

The prevailing paradigm in large language model (LLM) alignment operates via erasure, filtering unsafe data or training models to strictly refuse harmful prompts. While effective at reducing immediate toxicity, this approach fundamentally constricts the model's epistemological scope, resulting in over-cautious systems that output uninformative blanket refusals to sensitive yet benign queries. In this work, we challenge the orthodoxy that unsafe data must be discarded. We propose a dialectical approach to alignment, positing that unsafe data encodes rich, domain specific knowledge critical for nuanced, safe, and informative generation. To operationalize this, we introduce SafeMoE, a Mixture-of-Experts (MoE) framework that isolates unsafe knowledge into domain-specific Low-Rank Adapters (LoRA experts) trained exclusively on harmful corpora. To synthesize safety from these unsafe primitives, we train a lightweight gating network using a minimal, highly curated set of safe-informative responses. During inference, this router dynamically orchestrates the unsafe experts, effectively steering the generation trajectory to harness their deep domain knowledge while strictly enforcing safety constraints. Extensive empirical evaluations across stringent safety benchmarks demonstrate that SafeMoE is not only safer, achieving over a 20% relative improvement in safe response rate (more than a 15% absolute gain), but also produces more informative responses when safety and harmfulness are of paramount concern. Furthermore, the routing mechanism exhibits strong zero-shot generalization to unseen domains and broader safety tasks without domain-specific supervision. Our findings suggest a paradigm shift in alignment: true safety requires not the masking of unsafe knowledge, but its controlled integration.

2606.00685 2026-06-02 cs.LG 版本更新

Prior-Guided Multi-Omic Transformers for Single-Cell Gene Regulatory Network Inference

先验引导的多组学Transformer用于单细胞基因调控网络推断

Tianyang Xu, Tianci Liu, Niraj Rayamajhi, Ryan Patrick, Kranthi Varala, Ying Li, Jing Gao

发表机构 * Elmore Family School of Electrical and Computer Engineering(埃尔莫夫家庭电气与计算机工程学院) Purdue University(普渡大学) Department of Horticulture and Landscape Architecture(园艺与景观建筑系) School of Biological Sciences(生物科学学院)

AI总结 提出EpiAwareNet框架,通过先验引导的多组学Transformer,结合基因-峰值交叉注意力模块和批量数据先验,从配对单细胞数据中重建基因调控网络。

Comments 12 pages, 6 figures. Accepted to the KDD 2026 AI4Sciences Track

详情
AI中文摘要

基因调控网络(GRN)捕捉转录因子-靶标相互作用,是理解细胞状态调控和疾病的核心。从配对的单细胞转录组和染色质可及性数据重建GRN具有前景但充满挑战:scATAC极其稀疏,且大多数方法依赖于固定的峰值-基因链接和弱监督。我们提出EpiAwareNet,一个先验引导的多组学Transformer框架,仅使用轻量级生物学先验从配对单细胞数据重建GRN。在第一阶段,EpiAwareNet通过基因-峰值交叉注意力模块学习联合基因-峰值表示,实现数据驱动的、基因特异性的可及性信号聚合,而非硬编码的峰值-基因分配。在第二阶段,EpiAwareNet引入批量数据衍生的GRN先验作为噪声正边,在标签稀缺情况下提供弱监督,同时保持对先验噪声的鲁棒性,细化调控分数。在我们的实验中,EpiAwareNet在GRN重建上优于代表性的单组学和多组学基线,并产生更具生物学合理性的GRN,例如改善已知调控相互作用的恢复,这表明当与自适应跨模态表示学习结合时,来自批量数据的轻量级生物学先验可以有效指导单细胞GRN推断。代码和数据将在https://github.com/tianyang-x/EpiAwareNet_pub提供。

英文摘要

Gene regulatory networks (GRNs) capture transcription factor-target interactions and are central to understanding cell-state regulation and disease. Reconstructing GRNs from paired single-cell transcriptomic and chromatin accessibility data is promising but challenging: scATAC is extremely sparse, and most methods rely on fixed peak-to-gene links and weak supervision. We present EpiAwareNet, a prior-guided multi-omic Transformer framework that reconstructs GRNs from paired single-cell data using only lightweight biological priors. In Stage 1, EpiAwareNet learns joint gene-peak representations with a gene-peak cross-attention module, enabling data-driven, gene-specific aggregation of accessibility signals rather than hard-coded peak-to-gene assignments. In Stage 2, EpiAwareNet incorporates a bulk-derived GRN prior as noisy positive edges to provide weak supervision under label scarcity, refining regulatory scores while remaining robust to prior noise. In our experiments, EpiAwareNet improves GRN reconstruction over representative single- and multi-omic baselines and yields GRNs with greater biological plausibility, such as improved recovery of known regulatory interactions, suggesting that lightweight biological priors from bulk data can effectively guide single-cell GRN inference when combined with adaptive cross-modal representation learning. Code and data will be available at https://github.com/tianyang-x/EpiAwareNet_pub.

2606.00677 2026-06-02 cs.LG 版本更新

Limits of Resolution Equivariance in Fourier Neural Operators

傅里叶神经算子中的分辨率等变性极限

Alex Colagrande, Paul Caillon, Eva Feillet, Alexandre Allauzen

发表机构 * Miles Team, LAMSADE, Université Paris Dauphine-PSL(巴黎萨克雷大学巴黎-达菲学院LAMSADE团队) Université Paris-Saclay, CNRS, LISN(巴黎-萨克雷大学CNRS LISN) ESPCI PSL, Paris(巴黎ESPCI PSL)

AI总结 本文通过对比直接细网格推理与低网格加傅里叶零填充上采样两种策略,发现傅里叶神经算子并不总是能泛化到不同分辨率,并分析了其层间频谱特性,指出非线性混叠是零样本分辨率等变性的主要障碍。

Comments Published as a paper at AI&PDE: ICLR 2026 Workshop on AI and Partial Differential Equations. 6 pages, 2 figures

详情
AI中文摘要

傅里叶神经算子通常被认为能够跨空间分辨率泛化,从而可以在粗网格上训练并在细网格上部署。我们通过对比从训练分辨率 $s$ 到测试分辨率 $S>s$ 时的两种推理选择来检验这一假设:直接在 $S$ 上运行 FNO,或者在 $s$ 上运行并通过傅里叶零填充将预测上采样到 $S$。在达西流问题上,我们观察到直接细网格推理并非总是有益的,甚至可能比低网格加上采样基线更差。我们进一步分析了层间频谱,发现在傅里叶截断下,中间表示的能量越来越集中在低频,而高频输出主要由后期的非线性/解码器阶段产生。这为 FNO 在保留少量模式时仍能表现良好,但对分辨率变化敏感的现象提供了机制性解释。我们的发现强调了一个简单但强大的跨分辨率评估基线,并指出非线性混叠是零样本分辨率等变性的关键障碍。

英文摘要

Fourier Neural Operators are often assumed to generalize across spatial resolutions, enabling training on a coarse grid and deployment on a finer grid. We test this assumption by contrasting two inference-time choices when moving from training resolution $s$ to test resolution $S>s$: running FNO directly at $S$, or running at $s$ and upsampling the prediction to $S$ via Fourier zero-padding. On Darcy flow, we observe that direct fine-grid inference is not reliably beneficial and can be worse than the low-grid-plus-upsampling baseline. We further analyze layerwise spectra and find that, under Fourier truncation, intermediate representations increasingly concentrate energy in low frequencies, with high-frequency output produced mainly by late nonlinear/decoder stages. This offers a mechanistic explanation for why FNO can perform well while retaining few modes, yet remain sensitive under resolution shifts. Our findings highlight a simple but strong baseline for cross-resolution evaluation and point to nonlinear aliasing as a key obstacle to zero-shot resolution equivariance.

2606.00675 2026-06-02 cs.LG 版本更新

Mapping the evolution of small reservoirs in Brazil from 1984 to 2025 using deep learning

利用深度学习绘制1984年至2025年巴西小型水库的演变

Kylen Solvik, Luis Gustavo Carvalho, Marcia N. Macedo

AI总结 针对巴西小型水库被忽视的问题,采用深度学习计算机视觉模型从Landsat数据中分割小型水库,生成了1984-2025年全国年度水库地图,揭示了水库数量和面积的大幅增长。

Comments 33 pages, 5 figures, 2 tables

详情
AI中文摘要

巴西的水研究在很大程度上忽视了为农业用途(如牲畜饮水、农场规模水电、灌溉和水产养殖)而广泛筑坝的小溪流。这些无处不在的水坝及其水库会改变水温、河流连通性、水生栖息地、温室气体排放和蒸发水损失。绘制小型水库地图具有挑战性,因为需要可靠地检测小型水体并将人工水库与天然湖泊区分开来。因此,大多数区域和全球数据集都将其排除在外。为了解决这一空白,我们训练了一个深度学习计算机视觉模型,利用Landsat 5-9的数据,准确分割巴西境内的小型(<1平方公里)、溪流补给的地表水水库。从1984年到2025年应用我们的模型,我们为整个国家创建了年度水库地图,以评估其数量、大小和分布随时间的变化。检测到的水库数量从263,913个增加到996,245个,增长了近四倍,而它们的总表面积从3510平方公里增加到8550平方公里。据我们所知,这是第一个代表四十年来小型水库演变的全国年度数据集。公开可用的年度地图突出了巴西各地小溪流蓄水工程的范围和累积影响,为管理淡水生态系统和水资源提供了可操作的见解。

英文摘要

Water research in Brazil largely overlooks the widespread damming of small streams for agricultural uses such as watering cattle, farm-scale hydropower, irrigation, and aquaculture. These ubiquitous dams and their reservoirs can alter water temperature, stream connectivity, aquatic habitats, greenhouse gas emissions, and evaporative water losses. Mapping small reservoirs is challenging because it requires reliably detecting small water bodies and distinguishing artificial reservoirs from natural lakes. As a result, most regional and global datasets exclude them. To address this gap, we trained a deep learning computer vision model to accurately segment small ($< 1 km^2$), stream-fed, surface water reservoirs in Brazil leveraging data from Landsat 5-9. Applying our model from 1984 to 2025, we created annual reservoir maps for the entire country to evaluate how their count, size, and distribution have changed over time. The number of detected reservoirs grew nearly fourfold from 263,913 to 996,245, while their total surface area increased from 3510 $km^2$ to 8550 $km^2$. To our knowledge, this is the first country-wide annual dataset representing the evolution of small reservoirs over four decades. The publicly available annual maps highlight the extent and cumulative impacts of the small stream impoundments across Brazil, providing actionable insights for managing freshwater ecosystems and water resources.

2606.00674 2026-06-02 cs.LG cs.AI 版本更新

The Paradox of Outcome Optimization: A Causal Information-Theoretic Bound on Reasoning Shortcuts in LLMs

结果优化的悖论:LLM中推理捷径的因果信息论界限

Zihan Chen, Yiming Zhang, Wenxiang Geng, Zenghui Ding, Yining Sun

发表机构 * HFIPS, Chinese Academy of Sciences(中国科学院HFIPS) University of Science and Technology of China(中国科学技术大学)

AI总结 针对基于结果强化学习的LLM在分布外任务中推理脆弱的问题,提出因果信息论框架解释奖励诱导的流形坍缩,并证明过程奖励模型作为拓扑滤波器可消除低复杂度捷径。

详情
AI中文摘要

通过基于结果的强化学习(RL)对齐的大型语言模型(LLM)经常表现出一种关键失败模式:它们在分布内基准测试上取得高性能,但在分布外(OOD)任务上推理能力脆弱。我们将这种现象称为奖励诱导的流形坍缩。我们建立了一个理论框架,将结构因果模型(SCM)和信息瓶颈(IB)原理联系起来,以解释这一悖论。我们将推理定义为高复杂度的因果过程,将捷径学习定义为利用低复杂度的虚假相关性。在随机梯度下降(SGD)的隐式归纳偏置下,只要训练分布允许对真实因果机制进行“马尔可夫筛选”,优化结果奖励的模型就会偏向于捷径解。我们基于语义覆盖度量($\eta$)而非样本量推导了一个新的泛化界限,说明了为什么在同质分布上扩展数据可能无法纠正推理缺陷。我们还表明,过程奖励模型(PRM)作为拓扑滤波器,通过强制执行逐步互信息约束,使得低复杂度的捷径流形不可行。这些结果为过程监督在简单信用分配之外的作用提供了数学基础。

英文摘要

Large Language Models (LLMs) aligned via outcome-based Reinforcement Learning (RL) frequently exhibit a critical failure mode: they achieve high performance on in-distribution benchmarks while demonstrating brittle reasoning capabilities on out-of-distribution (OOD) tasks. We term this phenomenon Reward-Induced Manifold Collapse. We establish a theoretical framework bridging Structural Causal Models (SCM) and the Information Bottleneck (IB) principle to explain this paradox. We define reasoning as a high-complexity causal process and shortcut learning as the exploitation of low-complexity spurious correlations. Under the implicit inductive bias of Stochastic Gradient Descent (SGD), models optimized for outcome rewards are biased toward shortcut solutions whenever the training distribution allows for a ``Markovian Screening'' of the true causal mechanism. We derive a new generalization bound based on Semantic Coverage Measure ($η$) rather than sample size, showing why data scaling on homogeneous distributions may fail to correct reasoning flaws. We also show that Process Reward Models (PRMs) function as Topological Filters, enforcing step-wise mutual information constraints that render the low-complexity shortcut manifold inadmissible. These results provide a mathematical grounding for the role of process supervision beyond simple credit assignment.

2606.00672 2026-06-02 cs.AI cs.LG 版本更新

Medication-Aware Financial Exploitation Detection for Alzheimer's Patients Using Edge-Aware Interaction Risk Modeling

基于边缘感知交互风险建模的阿尔茨海默病患者药物感知金融剥削检测

Farzana Akter, Lisan Al Amin, Rakib Hossain, Chaitanya Gunupudi, Faisal Quader

发表机构 * Cognitive Links LLC University of Maryland, College Park(马里兰大学学院公园分校)

AI总结 提出一种药物感知框架,通过同步药物依从性与交易监控,利用交互感知逻辑模型提升对认知风险金融事件的检测,尤其在药物脆弱窗口期召回率从0.7442提升至0.9070。

详情
AI中文摘要

金融剥削对阿尔茨海默病患者日益构成威胁,尤其是在认知稳定性下降期间。传统欺诈检测系统通常仅依赖金融行为,忽略可能改变脆弱性的临床相关因素。本文提出一种药物感知框架,将药物依从性与交易级监控同步,以改进对认知风险金融事件的检测。构建了180名患者45天的混合模拟数据集,产生8,100条药物记录和30,855笔交易。该框架通过纯金融、加性药物感知和交互感知逻辑模型评估金额异常、商家新颖性、交易频率、时间偏差和药物依从性。结果表明,纯金融基线获得了最高的全局F1分数0.5000,但交互感知模型在药物诱导脆弱窗口期内将召回率从0.7442提升至0.9070,并在排名高风险案例中实现了最高平均精度。研究结果表明,药物依从性作为金融风险的上下文修饰因子比作为孤立预测因子更有用。

英文摘要

Financial exploitation is a growing concern for people with Alzheimer's disease, especially during periods of reduced cognitive stability. Conventional fraud detection systems usually rely on financial behavior alone and ignore clinically relevant factors that may alter vulnerability. This paper proposes a medication-aware framework that synchronizes medication adherence with transaction-level monitoring to improve detection of cognitively risky financial events. A hybrid simulation dataset was constructed for 180 patients across 45 days, producing 8,100 medication records and 30,855 transactions. The framework evaluates amount anomaly, vendor novelty, transaction frequency, time deviation, and medication adherence through financial-only, additive medication-aware, and interaction-aware logistic models. Results show that the financial-only baseline obtained the highest global F1-score of 0.5000, but the interaction-aware model improved recall during medication-induced vulnerability windows from 0.7442 to 0.9070 and achieved the highest average precision for ranked high-risk cases. The findings suggest that medication adherence is most useful as a contextual modifier of financial risk rather than as an isolated predictor.

2606.00671 2026-06-02 cs.AI cs.CL cs.LG 版本更新

AXIOM: A Trust-First Neuro-Symbolic Execution Architecture for Verifiable Mathematical Reasoning

AXIOM: 一种用于可验证数学推理的信任优先神经符号执行架构

Alessio Bruno

发表机构 * Independent researcher(独立研究者)

AI总结 提出AXIOM架构,将语言模型限制为规范化器,通过确定性计算机代数系统管道实现可验证的数学推理,在4个MATH类别上达到94.36%的正确率和100%的信任度。

Comments Preprint. 12 pages, 2 figures. Live interactive demo: https://huggingface.co/spaces/Squagghy/axiom-solver. Paper artifact and dataset on Zenodo (concept-DOI): 10.5281/zenodo.20440225

详情
AI中文摘要

我们提出AXIOM,一种用于自然语言数学推理的信任优先神经符号执行架构。在AXIOM中,语言模型严格作为规范化器:它将非正式问题文本重写为狭窄的模式,由确定性计算机代数系统(CAS)管道消费,该管道推导并验证答案,或作为第一类输出弃权。路由遵循问题形状正则表达式、特定模式提示和封闭形式CAS处理器之间的1:1:1对齐,已交付3100多条这样的路由,并在250多个连续提交中零LOST_CORRECT回归。我们在4个MATH类别上报告了实证结果,累积正确率为94.36%(2,592/2,747),可解析问题的信任度为100.00%(在整个2,747条记录基准测试中零自信错误答案),所有四个领域均高于每个领域70/90/70的阈值,每个领域信任度为100.0%,仅规则处理器的中位延迟为1毫秒(在lm-eval算术20,000条记录基准测试中占88%的记录)。该架构通过公共部署已服务约30,000次生产查询。我们强调的贡献不是最终的准确率数字,而是该架构建立的向前动态:生产中的每个记录弃权在一次发布周期后都是候选正确,因为新任务在不回归注册表的情况下组合。支撑这一特性的操作纪律——数学模板分桶、LOST_CORRECT扫描作为回归预言机、可解析优先接入以及弃权作为第一类输出——构成了一个可迁移的框架,适用于数学之外的值得信赖的神经符号系统。

英文摘要

We present AXIOM, a trust-first neuro-symbolic execution architecture for natural-language mathematical reasoning. In AXIOM, the language model functions strictly as a canonicalizer: it rewrites informal problem text into a narrow schema consumed by a deterministic Computer-Algebra-System (CAS) pipeline, which derives and verifies the answer or abstains as a first-class output. Routing follows a 1:1:1 alignment between problem-shape regex, schema-specific prompt, and closed-form CAS handler, with 3,100+ such routes shipped and zero LOST_CORRECT regressions across 250+ consecutive ship commits. We report empirical results on 4 MATH categories with a cumulative correctness of 94.36% (2,592/2,747) at 100.00% trust on parseable (zero confident-wrong answers across the full 2,747-record benchmark), all four domains above the per-domain 70/90/70 floor with per-domain trust at 100.0%, and median latency of 1 ms on rule-only handlers (88% of records on the lm-eval arithmetic 20,000-record benchmark). The architecture has served ~30,000 production queries through a public deployment. The contribution we emphasize is not a final accuracy figure but the forward dynamic the architecture establishes: every logged abstain in production is a candidate correct after one ship cycle, since new tasks compose without regressing the registry. The operational discipline behind this property -- math-template bucketing, LOST_CORRECT scan as regression oracle, parseable-first onboarding, and abstain as first-class output -- constitutes a transferable framework for trustworthy neuro-symbolic systems beyond mathematics.

2606.00667 2026-06-02 q-bio.NC cs.LG 版本更新

Cortex and subcortex play distinct roles over learning when cortical memory is limited

皮层与皮层下在学习中扮演不同角色:当皮层记忆受限时

Matthew Farrell, Taro Toyoizumi

发表机构 * Laboratory for Neural Computation and Adaptation(神经计算与适应实验室) RIKEN Center for Brain Science(脑科学研究中心) Department of Mathematical Informatics, Graduate School of Information Science and Technology(信息科学与技术研究生院数学信息学系) The University of Tokyo(东京大学)

AI总结 通过约束模型基模块的记忆资源,研究皮层与皮层下系统在学习中的功能分离,发现皮层支持一般结构学习而皮层下专攻奖励学习。

Comments Preprint. 19 pages, 4 figures

详情
AI中文摘要

已有研究表明,大脑将灵活但计算成本高的皮层处理与更简单、成本更低的皮层下机制相结合,以实现比任一系统单独运行更高效的资源利用。尽管这一观点具有吸引力,但探索该假设的理论框架仍然有限。我们扩展了现有框架,其中模型基模块和模型无关模块并行学习,通过显式约束模型基模块的记忆资源,并在一个简单的决策设置中研究该约束的影响。记忆约束自然引发了分配记忆资源的策略。我们评估了不同策略在不同情境下的表现,并证明当奖励状态频繁变化时,模型基模块将记忆资源用于捕捉环境的通用结构而非利用当前奖励可能更有利。这项工作为学习过程中皮层和皮层下系统的功能分离提供了理论基础:皮层支持通用结构学习,而皮层下回路专门负责基于奖励的学习。我们进一步详细说明了如何在实验数据上检验这些假设。

英文摘要

It has been proposed that the brain integrates flexible, computationally expensive cortical processing with simpler, lower-cost subcortical mechanisms to achieve resource-efficient performance greater than that of either system alone. Despite the allure of this perspective, satisfying theoretical frameworks that explore this hypothesis are still limited. We extend existing frameworks in which a model-based module and model-free module learn in tandem by explicitly constraining the memory resources of the model-based module, and investigate the impact of this constraint in a simple decision-making setting. Memory constraints naturally give rise to strategies for allocating memory resources. We evaluate the performance of different strategies in different situations and demonstrate that when the rewarded states change often, it can be advantageous for the model-based module to focus its memory resources not on exploiting the current reward, but on capturing general structure of the environment. This work provides a theoretical foundation for a functional dissociation between cortical and subcortical systems during learning: the cortex supports general structure learning, while subcortical circuits specialize in reward-based learning. We further detail how these hypotheses can be tested on experimental data.

2606.00666 2026-06-02 cond-mat.mtrl-sci cs.LG physics.chem-ph 版本更新

Manifold Diffusion for Structure Generation of Transition Metal Complexes

过渡金属配合物结构生成的流形扩散

Luca Schaufelberger, Kjell Jorner

发表机构 * Institute of Chemical and Bioengineering, Department of Chemistry and Applied Biosciences, ETH Zurich(苏黎世联邦理工学院化学与生物工程研究所,化学与应用生物科学系) NCCR Catalysis, Switzerland(瑞士催化中心)

AI总结 提出TMCgen流形扩散模型,通过金属-配体配位角与配体扭转/旋转扩散,高效生成过渡金属配合物的精确几何结构。

详情
AI中文摘要

过渡金属配合物是催化、药物设计和材料科学的核心,其相关性质对三维几何结构高度敏感。然而,过渡金属配合物的电子多样性和非常规键合环境对准确结构生成构成重大挑战。在这项工作中,我们引入了TMCgen,一种流形扩散机器学习模型,能够高效准确地生成过渡金属配合物的几何结构。通过将扩散过程公式化为金属-配体配位角,并结合配体的扭转和旋转扩散,TMCgen聚焦于过渡金属配合物的关键几何自由度。TMCgen在多样化的实验衍生生物无机和有机金属配合物上表现出生成准确配位环境的强大性能,同时仅需少量推理步骤,实现高效生成。我们的结果展示了基于流形的生成建模在数据高效几何生成中的潜力,为过渡金属配合物的性质条件设计铺平了道路。

英文摘要

Transition metal complexes are central to catalysis, drug design, and materials science, with relevant properties strongly sensitive to their three-dimensional geometry. However, the electronic diversity and unconventional bonding environments of transition metal complexes pose a major challenge for accurate structure generation. In this work, we introduce TMCgen, a manifold diffusion machine learning model that efficiently and accurately generates geometries of transition metal complexes. By formulating the diffusion process over the metal-ligand coordination angles, combined with torsional and rotational diffusion of the ligands, TMCgen focuses on the key geometric degrees of freedom of transition metal complexes. TMCgen shows strong performance in generating accurate coordination environments on a diverse set of experimentally derived bioinorganic and organometallic complexes while requiring only few inference steps, enabling efficient generation. Our results demonstrate the potential of manifold-based generative modeling for data-efficient geometry generation, paving the way for property-conditioned design of transition metal complexes.

2606.00661 2026-06-02 stat.ML cs.LG 版本更新

On Median of Incomplete U-Statistics

关于不完全U-统计量的中位数

Nong Minh Hieu

发表机构 * Singapore Management University, School of Computing and Information Systems(新加坡国立管理学院,计算机与信息系统学院)

AI总结 本文针对不完全U-统计量的中位数(MIU)建立了有限样本浓度率,这是一种用于对称核期望的高效稳健估计量。

详情
AI中文摘要

我们建立了不完全U-统计量的中位数(MIU)的有限样本浓度率,这是一种用于对称核期望的高效稳健估计量。

英文摘要

We establish the finite-sample concentration rate for the Median-of-Incomplete-U-Statistics (MIU), an efficient robust estimator for the expectation of symmetric kernels.

2606.00656 2026-06-02 cs.LG cs.AI 版本更新

Demystifying the Optimal Fair Classifier in Multi-Class Classification

揭秘多类分类中的最优公平分类器

Li Zhang, Yuyuan Li, XiaoHua Feng, Jiaming Zhang, Fengyuan Yu, Chaochao Chen

发表机构 * College of Computer Science and Technology, Zhejiang University(浙江大学计算机科学与技术学院) College of Computer Science(计算机科学学院) Technology, Zhejiang University(技术,浙江大学) School of Communication Engineering, Hangzhou Dianzi University(杭州电子科技大学通信工程学院)

AI总结 本文针对多类分类中的公平性问题,提出了一种在公平约束下最优分类器的概率公式,并设计了两种属性盲算法(处理中与处理后)以逼近最优精度-公平帕累托前沿。

Comments Accepted to ICML 2026

详情
AI中文摘要

确保不同群体之间的公平公正对待,特别是在多类分类任务中,由于机器学习模型中固有的持续偏差,构成了重大挑战。大多数现有的偏差缓解技术针对二元设置,而多维输出和复杂公平机制的存在使得它们扩展到多类场景既不直接也不有效。在本文中,我们研究了公平分类中两个基本且未解决的挑战:(i)刻画多类设置中的最优精度-公平前沿,以及(ii)设计在不同训练阶段达到此最优值的实用算法。为应对这些挑战,我们首先指定了公平约束下最优分类器的解析可处理概率公式。在此基础上,我们提出了两种属性盲算法以在实践中实施公平要求:一种是通过约简方法在训练期间进行公平干预的处理中方法,以及一种通过插件估计微调输出概率的处理后方法。理论分析表明,两种方法都收敛到最优精度-公平帕累托前沿。在多个数据集上进行的实验证明了我们的方法在平衡精度和公平性方面的优越性能。

英文摘要

Ensuring fair and equitable treatment across diverse groups, particularly in multi-class classification tasks, poses a significant challenge due to the persistent biases inherent in machine learning models. Most existing bias mitigation techniques are tailored to binary settings, and the presence of multi-dimensional outputs and complex fairness mechanisms makes their extension to multi-class scenarios neither straightforward nor effective. In this paper, we investigate two fundamental, unresolved challenges in fair classification: (i) characterizing the optimal accuracy-fairness frontier in multi-class settings, and (ii) designing practical algorithms that attain this optimum in different training phases. To tackle these challenges, we first specify an analytically tractable probabilistic formulation of the optimal classifier under fairness constraints. Building upon this, we propose two attribute-blind algorithms to enforce fairness requirements in practice: an in-processing approach for fairness intervention during training via the reduction approach, and a post-processing approach for fine-tuning output probabilities with plug-in estimation. Theoretical analysis reveals that both methods converge to the optimal accuracy-fairness Pareto frontier. Experiments conducted on multiple datasets demonstrate the superior performance of our methods in balancing accuracy and fairness.

2606.00651 2026-06-02 cs.LG cs.AI cs.CL 版本更新

MESA: Improving MoE Safety Alignment via Decentralized Expertise

MESA: 通过去中心化专家提升MoE安全对齐

Yitong Sun, Yao Huang, Teng Li, Ranjie Duan, Yichi Zhang, Xingjun Ma, Hui Xue, Xingxing Wei

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对MoE架构中安全能力集中于少数专家导致的脆弱性,提出MESA框架,通过最优传输理论实现专家安全职责去中心化分配与路由细化,在保持实用性的同时提升防御性能。

Comments 18 pages, 8 figures, accepted by ICML 2026

详情
AI中文摘要

混合专家(MoE)架构高效扩展大型语言模型(LLM),通过动态路由将输入分配给相关专家,以降低计算成本的同时增强容量,但引入了一个关键漏洞:安全稀疏性,即安全能力集中在少数专家中,使其容易受到对抗性绕过。同时,传统的对齐方法统一调整所有参数,忽略了它们的功能差异,并无意中降低了性能。为了解决这些挑战,我们提出了MESA(MoE安全对齐),一个针对基于MoE的LLM的定向对齐框架,策略性地去中心化安全责任以最大化覆盖范围,同时最小化对实用性的干扰。基于最优传输(OT)理论,MESA通过两种机制运作:(1)专家容量重新分配使用传输成本矩阵将安全职责分配给最具成本效益的专家,以及(2)动态路由细化约束路由器精确激活这些去中心化模块。实验表明,MESA在保持有用性的同时,对各种有害基准实现了稳健的防御性能。代码可在https://github.com/lorraine021/MESA获取。

英文摘要

Mixture-of-Experts (MoE) architectures scale Large Language Models (LLMs) efficiently, enabling greater capacity with reduced computational cost by dynamically routing inputs to relevant experts, yet introduce a critical vulnerability: Safety Sparsity, where safety capabilities concentrate in few experts, making them susceptible to adversarial bypassing. Meanwhile, conventional alignment methods uniformly adapt all parameters, ignoring their functional differences and inadvertently degrading performances. To address these challenges, we propose MESA (MoE Safety Alignment), a targeted alignment framework for MoE-based LLMs that strategically decentralizes safety responsibility to maximize coverage while minimizing interference with utility. Based on Optimal Transport (OT) theory, MESA operates through two mechanisms: (1) Expert Capacity Reallocation uses a transport cost matrix to distribute safety duties to the most cost-effective experts, and (2) Dynamic Routing Refinement constrains the router to precisely activate these decentralized modules. Experiments show that MESA achieves robust defensive performance against varied harmful benchmarks while preserving helpfulness. Code is available at https://github.com/lorraine021/MESA.

2606.00643 2026-06-02 stat.ML cs.LG cs.NA math.NA math.OC math.ST stat.TH 版本更新

Taming the Loss Landscape of PINNs with Noisy Feynman-Kac Supervision: Operator Preconditioning and Non-Asymptotic Error Bounds

驯服带噪声Feynman-Kac监督的PINN损失景观:算子预条件与非渐近误差界

Nathanael Tepakbong, Hanyu Hu, Chengyu Liu, Xiang Zhou

发表机构 * Department of Data Science, City University of Hong Kong(香港城市大学数据科学系) Department of Mathematics, City University of Hong Kong(香港城市大学数学系)

AI总结 通过引入点态数据保真项作为算子级预条件,显著改善PINN的损失景观条件数,并基于Feynman-Kac表示生成标签,提出FK-PINN方法,推导了梯度下降下的非渐近误差界。

Comments accepted in ICML 2026 (poster), 59 pages

详情
AI中文摘要

物理信息神经网络(PINNs)在求解具有挑战性的偏微分方程(PDEs)时通常训练缓慢或无法收敛,这一行为最近被归因于从底层微分算子继承的严重病态损失景观。我们研究了在标准残差和边界损失基础上,于域内少数点添加点态数据保真项的PINNs。我们证明,该监督项作为算子级预条件:对于合适的权重,我们的比较界保证条件数比标准PINN损失下显著更小,且与点态标签的获取方式无关。对于一类允许Feynman-Kac(FK)表示的PDEs,我们通过FK泛函的蒙特卡洛平均生成此类标签,得到所谓的“FK-PINNs”,并利用超额风险分解方法,推导了使用tanh激活函数、通过有限步梯度下降训练的FK-PINNs的非渐近$L^2(Ω)$误差界。在此过程中,我们建立了tanh神经网络一阶和二阶导数的伪维数界,这些结果具有独立意义,且据我们所知是新的。在泊松、薛定谔、平均逃逸时间和通量问题上的数值实验证实了理论,并表明FK-PINNs能够成功求解标准PINNs表现出严重失效模式的PDEs。

英文摘要

Physics-Informed Neural Networks (PINNs) often train slowly or fail to converge on challenging partial differential equations (PDEs), a behavior recently linked to severely ill-conditioned loss landscapes inherited from the underlying differential operator. We study PINNs augmented with a pointwise data-fidelity term, added at a few points in the domain to the standard residual and boundary losses. We show that this supervision term acts as an operator-level preconditioner: for suitable weights, our comparison bounds guarantee a substantially smaller condition number than under the standard PINN loss, independently of how the pointwise labels are obtained. For a broad class of PDEs admitting a Feynman-Kac (FK) representation, we generate such labels by Monte Carlo averages of the FK functional, resulting in what we call ``FK-PINNs", and using the excess risk decomposition approach, we derive non-asymptotic $L^2(Ω)$-error bounds for FK-PINNs with $\tanh$ activation trained by finitely many steps of gradient descent. Along the way, we establish pseudo-dimension bounds for first- and second-order derivatives of $\tanh$ neural networks, which are of independent interest and, to the best of our knowledge, new. Numerical experiments on Poisson, Schrödinger, mean exit time, and committor problems corroborate the theory, and show that FK-PINNs can successfully solve PDEs for which standard PINNs exhibit severe failure modes.

2606.00635 2026-06-02 cs.LG 版本更新

How Neural Losses Shape VAE Latents

神经损失如何塑造VAE潜在变量

Giorgio Strano, Luca Cerovaz, Michele Mancusi, Tommaso Mencattini, Emanuele Rodolà

发表机构 * Sapienza University of Rome(罗马大学萨皮恩扎分校) Paradigma, Inc.(Paradigma公司) Moises Systems, Inc.(Moises系统公司) EPFL(苏黎世联邦理工学院)

AI总结 本文研究感知损失和对抗损失等神经重建损失如何改变VAE的率失真问题,证明其减少潜在表示信息量并改变潜在空间几何结构,使表示更各向同性且不确定性分布更均匀。

详情
AI中文摘要

现代VAE很少使用标准$β$-VAE目标隐含的点态似然进行训练。在实践中,尽管缺乏对如何改变模型潜在动态的理解,点态重建常与感知损失和对抗损失结合。我们表明,重建损失的选择重塑了率失真问题本身,改变了潜在表示的信息内容和几何结构,这些变化可能仅从重建中无法察觉。首先,我们证明并实证验证,用神经项(如感知和对抗目标)增强点态重建会减少存储在潜在表示中的信息量。其次,我们展示神经重建损失系统地改变了潜在空间的几何结构:它们使表示更各向同性,并更均匀地将不确定性分布在潜在维度上,产生不同的后验方差分布。这些发现强调了率失真权衡并非理解VAE行为的全面视角,我们提出一种更机械的方法来研究失真度量的选择如何重塑优化问题。

英文摘要

Modern VAEs are rarely trained with the pointwise likelihood implied by the standard $β$-VAE objective. In practice, pointwise reconstruction is often combined with perceptual and adversarial losses, despite a lack of understanding of how this changes the latent dynamics of the model. We show that the choice of reconstruction loss reshapes the rate-distortion problem itself, altering both the information content and the geometry of the learned latent space in ways that may be invisible from reconstructions alone. First, we prove and verify empirically that augmenting pointwise reconstruction with neural terms, such as perceptual and adversarial objectives, reduces the amount of information stored in the latent representations. Second, we show that neural reconstruction losses systematically change the geometry of the latent space: they make representations more isotropic and distribute uncertainty more evenly across latent dimensions, producing different posterior variance profiles. These findings highlight how the rate-distortion tradeoff is not a comprehensive lens to understand the behavior of VAEs, and we propose a more mechanistic approach to investigate how the choice of a distortion metric reshapes the optimization problem.

2606.00634 2026-06-02 cs.CL cs.LG 版本更新

French parsing enhanced with a word clustering method based on a syntactic lexicon

基于句法词典的词聚类方法增强的法语解析

Anthony Sigogne, Matthieu Constant, Eric Laporte

发表机构 * Université Paris-Est(巴黎-est大学) LIGM(语言与信息学实验室)

AI总结 本文通过将法语句法词典(Lexicon-Grammar)的数据整合到概率解析器中,并应用聚类方法于法语树库的动词,提高了基于概率上下文无关文法的解析性能。

详情
Journal ref
Second Workshop on Statistical Parsing of Morphologically Rich Languages (SPMRL), 2011, Dublin, Ireland, pp.22-27
AI中文摘要

本文评估了从法语句法词典(Lexicon-Grammar, Gross, 1994)中提取的数据整合到概率解析器中的效果。我们表明,通过对法语树库(Abeillé et al., 2003)中的动词应用聚类方法,基于概率上下文无关文法(Petrov et al., 2006)的解析器在法语上获得了准确的性能。

英文摘要

This article evaluates the integration of data extracted from a French syntactic lexicon, the Lexicon-Grammar (Gross, 1994), into a probabilistic parser. We show that by applying clustering methods on verbs of the French Treebank (Abeillé et al., 2003), we obtain accurate performances on French with a parser based on a Probabilistic Context-Free Grammar (Petrov et al., 2006).

2606.00629 2026-06-02 cs.SD cs.HC cs.LG eess.AS 版本更新

Quality Audio Prototyping: a prototype system for unified sound retrieval and procedural generation

质量音频原型:统一声音检索与程序化生成的系统原型

Nelly Garcia, Aditya Bhattacharjee, Gabryel Mason-Williams, Israel Mason-Williams, Emmanouil Benetos, Joshua Reiss

发表机构 * GitHub

AI总结 提出QuAP系统,通过统一基于内容的音频检索和实时程序化生成,并集成规则辅助参数指导,降低声音设计中的操作距离,经主观评估和用户测试验证了其有效性和实用性。

Comments DaFx 2026

详情
AI中文摘要

声音设计工作流经常在耗时的库搜索和复杂的程序化合成之间摇摆,从业者通常依赖独立的工具分别应对每个挑战。本文介绍了质量音频原型(QuAP),一个工作原型,它在单一界面中统一了基于内容的音频检索和程序化声音生成,减少了叙事概念与其声音实现之间的操作距离。QuAP集成了基于相似性的检索引擎与实时程序化音频模型,并辅以基于规则的助手,提供基于感知的参数指导,给出源自经验优化的定义和建议,而不需要先验的合成知识。初步评估证实了这种方法的可行性:主观评估显示六个嵌入合成模型中有五个在质量上具有统计显著性的提升,编码器消融研究在音效数据集上确立了首选的检索架构。与16名从业者的用户评估证实了该工具的工作流实用性,所有参与者一致认为参数助手在保持创作自主性的同时降低了程序化交互的门槛。

英文摘要

Sound design workflows frequently oscillate between time-consuming library searches and the complexity of procedural synthesis, with practitioners typically relying on disconnected tools to address each challenge separately. This paper introduces Quality Audio Prototyping (QuAP), a working prototype that unifies content-based audio retrieval and procedural sound generation within a single interface, reducing the procedural distance between a narrative concept and its sonic realisation. QuAP integrates a similarity-based retrieval engine with real-time procedural audio models, complemented by a rule-based assistant that provides perceptually informed parameter guidance, offering definitions and recommendations derived from empirical optimisation rather than requiring prior synthesis knowledge. Preliminary evaluation confirms the viability of this approach: subjective assessment demonstrated statistically significant quality improvements in five of six embedded synthesis models, and an encoder ablation study established the preferred retrieval architecture on a sound effect dataset. A user evaluation with 16 practitioners confirmed the tool's workflow utility, with all participants agreeing that the parameter assistant preserved creative agency while lowering the barrier to procedural interaction.

2606.00609 2026-06-02 cs.LG cs.AI 版本更新

CARE-RL: Capability-Aware Reinforcement Learning for Mitigating Cross-Domain Conflicts

CARE-RL:用于缓解跨领域冲突的能力感知强化学习

Rui Zhang, Xinle Wu, Yao Lu

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 提出CARE-RL框架,结合协议感知奖励生成与能力感知优化,通过PA-GRM和DACSP方法缓解多领域强化学习中的奖励不可靠与能力干扰问题。

详情
AI中文摘要

具有可验证奖励的强化学习在面向推理的大语言模型中取得了显著进展,但由于非可验证任务中奖励不可靠以及跨领域能力干扰,将其扩展到多领域强化学习仍具挑战性。我们提出CARE-RL,将协议感知奖励生成与能力感知优化相结合,以缓解跨领域冲突。对于非可验证任务,协议感知生成式奖励模型(PA-GRM)在生成轨迹条件奖励之前构建提示级别的评估协议和模式,从而实现对开放式响应的任务自适应且可比较的评估。对于多领域优化,方向感知能力子空间投影(DACSP)从先前的强化学习阶段提取历史能力方向,并通过放大对齐分量、抑制冲突分量以及保留正交更新来调节后续更新。在数学、聊天和指令遵循基准上的实验表明,CARE-RL始终优于标准的多领域强化学习基线,在Qwen2.5-7B和Qwen3-4B上分别达到47.9和50.7的总平均分。

英文摘要

Reinforcement learning (RL) with verifiable rewards has achieved strong progress in reasoning-oriented LLMs, but extending it to multi-domain RL remains challenging due to reward unreliability in non-verifiable tasks and capability interference across domains. We propose CARE-RL to combine protocol-aware reward generation with capability-aware optimization for mitigating cross-domain conflicts. For non-verifiable tasks, the Protocol-Aware Generative Reward Model (PA-GRM) constructs prompt-level evaluation protocols and schemas before producing trace-conditioned rewards, enabling task-adaptive yet comparable evaluation of open-ended responses. For multi-domain optimization, Direction-Aware Capability Subspace Projection (DACSP) extracts historical capability directions from previous RL stages and modulates later updates by amplifying aligned components, suppressing conflicting components, and preserving orthogonal updates. Experiments across math, chat, and instruction-following benchmarks show that CARE-RL consistently outperforms standard multi-domain RL baselines, achieving Total Avg scores of 47.9 and 50.7 on Qwen2.5-7B and Qwen3-4B, respectively.

2606.00605 2026-06-02 cs.LG stat.ML 版本更新

Looped Transformers with Layer Normalization Provably Learn the Power Method

带有层归一化的循环Transformer可证明地学习幂方法

Lyumin Wu, Chenyang Zhang, Yuan Cao

发表机构 * School of Computing & Data Science, The University of Hong Kong(计算与数据科学学院,香港大学)

AI总结 本文通过主成分预测任务,证明带有层归一化的循环线性Transformer在梯度下降训练下会收敛到实现幂方法的解,揭示了层归一化带来的算法隐式偏差。

Comments 70 pages, 8 figures

详情
AI中文摘要

Transformer在广泛的应用中取得了显著成功,越来越多的研究表明其部分优势来自于学习和执行算法程序的能力。然而,我们对Transformer如何学习此类算法的理解仍然有限,尤其是在存在层归一化(LN)的情况下。在这项工作中,我们研究主成分预测作为理解带有LN的Transformer训练动态的具体测试平台。我们证明,通过梯度下降训练的带有LN的循环线性Transformer收敛到实现幂方法的解,其中每个自注意力层执行一次幂迭代。值得注意的是,模型仅针对主成分预测进行训练,而非明确监督其实现幂方法。因此,我们的发现揭示了带有LN的循环Transformer的“算法隐式偏差”:主成分预测原则上可以通过多种机制实现,但梯度下降选择了实现幂方法的一种。我们进一步提供了带有和不带有LN的Transformer之间的具体比较:即使有幂迭代的逐层指导,没有LN的Transformer也无法精确学习幂方法,而带有LN的对应Transformer可以,导致主成分预测中可证明的性能差距。据我们所知,我们的结果首次对带有LN的循环和单层Transformer的训练动态进行了理论分析,并阐明了LN在Transformer模型中的作用。

英文摘要

Transformers have achieved remarkable success across a wide range of applications, and a growing body of work suggests that part of their strength comes from their ability to learn and execute algorithmic procedures. However, our understanding of how transformers learn such algorithms remains limited, especially in the presence of layer normalization (LN). In this work, we study principal component prediction as a concrete testbed for understanding the training dynamics of transformers with LN. We prove that a looped linear transformer with LN, trained by gradient descent, converges to a solution that implements the power method, with each self-attention layer performing one power iteration. Notably, the model is trained only for principal component prediction, rather than being explicitly supervised to implement the power method. Our finding thus reveals an "algorithmic implicit bias" of looped transformers with LN: principal-component prediction can in principle be achieved by many mechanisms, yet gradient descent selects one that realizes the power method. We further provide a concrete comparison between transformers with and without LN: even with layerwise guidance from power iterations, a transformer without LN cannot exactly learn the power method, whereas the corresponding transformer with LN can, leading to a provable performance gap in principal component prediction. Our results provide, to our knowledge, the first theoretical analysis of the training dynamics of looped and single-layer transformers with LN, and shed light on the role of LN in transformer models.

2606.00584 2026-06-02 stat.ML cs.LG 版本更新

Spectra-Guided Neural Tucker Factorization

光谱引导的神经Tucker分解

Fusheng Wang, Yikai Hou

发表机构 * School of Automation, Chongqing University of Posts and Telecommunications(重庆邮电大学自动化学院) College of Computer and Information Science, School of Software, Southwest University(西南大学计算机与信息科学学院、软件学院)

AI总结 提出光谱引导的神经Tucker分解(SG-NTF),通过连续光谱空间映射和时空共门控机制,实现高维不完整张量的高效补全。

详情
AI中文摘要

本文针对高维不完整(HDI)张量补全问题,提出光谱引导的神经Tucker分解(SG-NTF)。为规避离散表示的局限性,SG-NTF将标量时间戳映射到连续光谱空间以抽象时间周期性。同时,时空共门控(STCG)机制通过时空上下文上的乘法调制显式过滤潜在交互。在真实世界HDI张量上的评估验证了SG-NTF在参数效率下保持有竞争力的补全精度。

英文摘要

This paper proposes Spectra-Guided Neural Tucker Factorization (SG-NTF) for High-Dimensional and Incomplete (HDI) tensor completion. Circumventing discrete representational limits, SG-NTF maps scalar timestamps into a continuous spectral space to abstract temporal periodicities. Concurrently, a Spatio-Temporal Co-Gating (STCG) mechanism explicitly filters latent interactions via multiplicative modulation on spatiotemporal contexts. Evaluations on real-world HDI tensors verify that SG-NTF maintains competitive completion accuracy with parameter efficiency.

2606.00583 2026-06-02 cs.CV cs.AI cs.LG cs.MM 版本更新

Improving Visual Representation Alignment Generation with GRPO

利用GRPO改进视觉表示对齐生成

Shentong Mo, Sukmin Yun

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Hanyang University(翰阳大学)

AI总结 提出VRPO方法,通过强化学习将静态对齐损失替换为生成式表示策略优化目标,动态平衡表示一致性与生成质量,在扩散Transformer中实现更快的收敛和更高的图像保真度。

详情
AI中文摘要

最近的扩散Transformer展示了强大的图像合成能力,但由于生成表示与判别表示之间的弱对齐,训练效率仍然较低。虽然表示对齐框架(如REPA)通过将噪声去噪特征与预训练视觉编码器对齐来改善收敛,但其外部监督的对齐损失是静态的,在训练和推理过程中缺乏自适应性。现有方法依赖于固定的余弦对齐或对比目标,无法动态平衡表示一致性和生成质量,导致判别收益有限,且无法以任务自适应方式优化对齐。为了解决这个问题,我们提出了VRPO,一种基于强化学习的优化策略,用生成式表示策略优化目标取代REPA的静态对齐损失。VRPO不强制执行固定的相似性约束,而是将表示对齐视为一个奖励引导的过程:模型根据生成保真度、感知质量以及扩散特征与预训练视觉嵌入之间的语义一致性获得自适应奖励。这种公式使生成器能够不断优化其内部表示,朝向有语义意义的方向,同时提高图像质量。我们的VRPO驱动训练无缝集成到扩散Transformer中,引入可忽略的计算成本,并保持与SiT和DiT架构的完全兼容性。在ImageNet-256x256上的大量实验表明,我们的VRPO-Alignment显著提高了收敛速度和保真度,在相同计算预算下,与REPA相比,FID提升高达1.8,训练速度加快2.3倍。

英文摘要

Recent diffusion transformers have demonstrated strong image synthesis capabilities but remain inefficient to train due to weak alignment between generative and discriminative representations. While representation alignment frameworks such as REPA improve convergence by aligning noisy denoising features with pretrained visual encoders, their externally supervised alignment loss is static and lacks adaptivity during training and inference. Existing methods rely on fixed cosine alignment or contrastive objectives, which cannot dynamically balance representation consistency and generation quality, resulting in limited discriminative benefit and failing to optimize alignment in a task-adaptive manner. To address this, we propose VRPO, a reinforcement-based optimization strategy that replaces REPA's static alignment loss with a generative representation policy optimization objective. Instead of enforcing a fixed similarity constraint, VRPO treats representation alignment as a reward-guided process: the model receives adaptive rewards based on generation fidelity, perceptual quality, and semantic coherence between the diffusion features and pretrained visual embeddings. This formulation enables the generator to continuously refine its internal representations toward semantically meaningful directions while improving image quality. Our VRPO-driven training seamlessly integrates into diffusion transformers, introducing negligible computation cost and preserving full compatibility with SiT and DiT architectures. Extensive experiments on ImageNet-256x256 demonstrate that our VRPO-Alignment substantially enhances both convergence and fidelity, achieving up to +1.8 FID improvement and 2.3x faster training compared to REPA under identical compute budgets.

2606.00573 2026-06-02 cs.LG 版本更新

LASER: Loss-Aware Singular-value Decomposition and Rank Allocation for Efficient Low-Precision Vision-Language Models

LASER: 面向高效低精度视觉-语言模型的损失感知奇异值分解与秩分配

Haiyu Wang, Yutong Wang, Leshu Li, Yihui Ren, Sai Qian Zhang

发表机构 * Tandon School of Engineering, New York University(纽约大学工程学院) Courant Institute of Mathematical Sciences, New York University(纽约大学数学科学学院) Brookhaven National Laboratory(布鲁克海文国家实验室)

AI总结 提出LASER框架,通过损失感知的奇异值分解和跨层秩分配,结合混合量化方案,实现视觉-语言模型在低精度推理下的高效压缩与加速。

详情
AI中文摘要

视觉-语言模型(VLM)具有强大的多模态推理能力,但其巨大的计算开销和高参数数量使得在资源受限设备上部署面临挑战。低秩分解已成为一种有前景的压缩技术,然而现有方法通常优化局部矩阵重建误差,依赖均匀或启发式的秩分配,并且主要关注注意力投影,而前馈网络尚未得到充分探索。在本文中,我们提出 extit{LASER}( extbf{L}oss- extbf{A}ware extbf{S}ingular-value d extbf{E}composition and extbf{R}ank allocation),一种面向高效低精度VLM推理的低秩压缩框架。LASER从模型损失的二阶近似推导出曲率加权的SVD目标,并使用Kronecker分解的Fisher信息来引导分解朝向下游性能而非单纯的重建。我们进一步引入基于校准梯度的损失感知跨层秩分配策略,使得跨层的参数预算分配更加有效。最后,我们通过一种结合SVD与量化的混合方案,将低秩压缩扩展到FFN层。评估结果表明,LASER在低精度推理下相比先前工作实现了超过2.3倍的解码加速,同时保持了强大的准确性。

英文摘要

Vision-language models (VLMs) deliver strong multimodal reasoning capabilities, but their large computational cost and high parameter counts make deployment challenging on resource-constrained devices. Low-rank decomposition has emerged as a promising compression technique, yet existing methods often optimize local matrix reconstruction error, rely on uniform or heuristic rank allocation, and focus mainly on attention projections while leaving feed-forward networks underexplored. In this paper, we propose~\textit{LASER} (\textbf{L}oss-\textbf{A}ware \textbf{S}ingular-value d\textbf{E}composition and \textbf{R}ank allocation), a low-rank compression framework for efficient low-precision VLM inference. LASER derives a curvature-weighted SVD objective from a second-order approximation of the model loss and uses Kronecker-factored Fisher information to guide decomposition toward downstream performance rather than reconstruction alone. We further introduce a loss-aware cross-layer rank allocation strategy based on calibration gradients, enabling more effective parameter budgeting across layers. Finally, we extend low-rank compression to FFN layers through a hybrid scheme that combines SVD with quantization. The evaluation results show that LASER achieves more than $2.3\times$ decoding speedup over previous work while preserving strong accuracy under low-precision inference.

2606.00572 2026-06-02 cs.LG 版本更新

Spatiotemporal Multi-Task Graph Transformer for Trip-Level Transit Prediction

时空多任务图Transformer用于行程级公交预测

Oluwaleke Yusuf, Adil Rasheed, Frank Lindseth

发表机构 * Department of Engineering Cybernetics, Norwegian University of Science and Technology (NTNU)(工程 cybernetics 部,挪威科学与技术大学(NTNU)) Department of Computer Science, Norwegian University of Science and Technology (NTNU)(计算机科学部,挪威科学与技术大学(NTNU))

AI总结 提出SMT-GraphFormer,一种将行程级公交预测建模为序列到序列问题的时空多任务图Transformer,通过图嵌入、上下文编码器和多门专家混合模块,在挪威特隆赫姆公交数据上优于停靠级基线方法。

Comments 25 pages, 7 figures, 11 tables, including appendix. Code available at https://github.com/Outsiders17711/SMTGraphFormer

详情
AI中文摘要

来自公共交通系统的乘客计数数据揭示了城市出行模式,对于规划、运营和优化至关重要。然而,站点和线路之间的非线性时空相互依赖性使得建模和预测具有挑战性。现有方法通常依赖于固定的时间、空间或站点级公式,限制了它们捕捉行程内演变和网络上下文的能力。本研究提出了SMT-GraphFormer,一种时空多任务图Transformer,将行程级公交预测构建为序列到序列建模。给定一条线路的站点序列和行程级上下文,模型预测连续的上下车人数,并将延误和停靠时间作为编码器侧的辅助任务。关键组件包括用于多关系站点相似性的图嵌入、用于天气和时间信息的上下文编码器,以及一个多门专家混合模块,该模块为上下车预测生成任务特定的解码器表示。对挪威特隆赫姆的公共公交数据进行评估表明,SMT-GraphFormer优于站点级表格基线,消融研究考察了每个组件的贡献。序列化公式在下车预测上取得了显著提升(R²提高+0.24),并在上车、延误和停靠时间上持续改进,证实了显式行程级序列偏差和目标间依赖性的价值。这些发现展示了基于Transformer的序列建模在捕捉公共交通复杂时空动态方面的潜力,并强调了针对公交数据定制的架构相对于现成表格模型的价值。所提出的框架为数字孪生环境中的场景分析提供了与预测范围无关的基础,支持规划者和公交运营商的知情决策。

英文摘要

Passenger count data from public transit systems reveals urban mobility patterns and is essential for planning, operation, and optimisation. However, non-linear spatiotemporal interdependencies across stops and lines make modelling and prediction challenging. Existing approaches often rely on fixed temporal, spatial, or stop-level formulations, limiting their ability to capture within-trip evolution and network context. This study proposes SMT-GraphFormer, a spatiotemporal multi-task graph transformer that frames trip-level transit prediction as sequence-to-sequence modelling. Given a line's stop sequence and trip-level context, the model predicts successive boarding and alighting counts, with delay and dwell time treated as encoder-side surrogate tasks. Key components include graph embeddings for multi-relational stop similarity, a context encoder for weather and temporal information, and a multi-gate mixture-of-experts module that produces task-specific decoder representations for boarding and alighting predictions. Evaluation on public bus transit data from Trondheim, Norway, shows that SMT-GraphFormer outperforms stop-level tabular benchmarks, with ablation studies examining each component's contribution. The sequential formulation yields substantial gains on alighting prediction ($+$0.24 in $R^2$) and consistent improvements on boarding, delay, and dwell, confirming the value of explicit trip-level sequential bias and inter-target dependencies. These findings demonstrate the potential of transformer-based sequence modelling for capturing complex spatiotemporal dynamics in public transit and underscore the value of architectures tailored to transit data rather than off-the-shelf tabular models. The proposed framework provides a horizon-agnostic basis for scenario analysis in digital twin environments, supporting informed decision-making by planners and transit operators.

2606.00571 2026-06-02 cs.LG cs.AI cs.CV 版本更新

On the Difficulty of Learning a Meta-network for Training Data Selection

学习用于训练数据选择的元网络的困难性

Zilin Du, Junqi Zhao, Boyang Albert Li

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 针对元学习训练数据选择(MTS)在实践中表现不佳的问题,本文通过数学分析揭示了梯度信噪比低和缺乏信息特征两大障碍,并提出增大批大小和利用信息特征作为解决方案。

详情
AI中文摘要

合成数据越来越多地被用于训练神经网络,但若不加区分地使用,其与真实数据的分布不匹配会限制其有效性。一种常见策略是通过双层优化学习数据权重,我们称之为元学习训练数据选择(MTS)。有趣的是,在实践中,MTS 往往低于预期。我们识别了正确训练 MTS 的两个障碍:梯度信噪比(GSNR)低导致优化困难,以及缺乏与数据质量相关的信息特征。我们对 MTS 进行了数学分析,揭示了归一化数据权重的动态以及不同数据质量与低 GSNR 之间的关系。分析表明,一个简单而有效的解决方案是增大批大小。此外,我们提出了一组信息特征,用于捕捉训练数据在其分布中的位置和训练动态。在四个基准上的实验显示了一致的改进,与无选择的训练相比平均提升 5.49%,与最强基线相比平均提升 2.89%。

英文摘要

Synthetic data are increasingly used to train neural networks, yet distributional mismatch with real data limits their effectiveness when used indiscriminately. A common strategy is to learn data weights via bi-level optimization, which we refer to as Meta-learning for Training-data Selection (MTS). Interestingly, in practice, MTS often performs below expectation. We identify two obstacles in properly training MTS: a poor gradient signal-to-noise ratio (GSNR), which causes optimization difficulties, and lack of informative features that correlates with data quality. We present a mathematical analysis of MTS, which reveals the dynamics of normalized data weights and the relation between disparate data quality and poor GSNR. The analysis suggests a a simple yet effective solution: increasing the batch size. Further, we propose a set of informative features that capture the positions of training data in their distributions and training dynamics. Experiments across four benchmarks show consistent improvements, achieving average gains of 5.49% over training without selection and 2.89% over the strongest baseline.

2606.00566 2026-06-02 cs.LG cs.CL cs.CR 版本更新

Same Payload, Different Channel: Measuring Trust Asymmetry in Tool-Using Language Models

相同载荷,不同通道:测量使用工具的語言模型中的信任不对称性

Mohammed Sameer Syed, Rozhin Yasaei

发表机构 * University of Arizona(亚利桑那大学)

AI总结 本研究提出安全不对称分数(SAS),通过匹配恶意载荷仅改变传递上下文,系统测量了语言模型在不同通道(用户消息、工具元数据、工具输出)中对对抗性内容的脆弱性差异,发现代理原生模型在工具描述通道更脆弱,而通用模型相反,且机制研究表明安全相关表示在深层网络非线性编码。

Comments 13 pages, 1 figure. Submitted to EMNLP 2026

详情
AI中文摘要

随着语言模型承担代理角色,包括调用外部API、读取工具输出以及执行嵌入在第三方内容中的指令,其攻击面远超用户输入。模型是否以相同方式处理恶意指令(无论其来源)尚未被系统研究。我们引入了安全不对称分数(SAS),通过使用匹配的载荷对(保持恶意文本相同,仅改变传递上下文)来测量模型对对抗性内容的敏感性如何随内容出现在用户消息、工具元数据或工具输出中而变化。在6个生产级LLM和三种攻击家族上的评估发现了一致且信息丰富的不对称性:当对抗性内容通过工具描述而非用户消息传递时,代理原生模型显著更脆弱,而通用模型则相反。当相同内容通过工具输出而非描述传递时,这种不对称性进一步反转,表明模型隐含地将工具元数据视为可信指令,而将工具结果视为普通数据。对Llama 3.3 70B的机制研究表明,安全相关表示在网络的中间到深层因果存在但非线性编码,解释了线性探针为何无法检测到它。这些发现揭示了当前使用工具的模型在处理对抗性内容时存在的系统性、通道依赖的盲点。

英文摘要

As language models take on agentic roles that span calling external APIs, reading tool outputs, and acting on instructions embedded in third-party content, their attack surface expands well beyond what users type. Whether a model treats a malicious instruction the same way regardless of where it arrives has not been systematically studied. We introduce the Safety Asymmetry Score (SAS), which measures how much a model's susceptibility to adversarial content shifts depending on whether that content arrives in the user message, tool metadata, or tool output, using matched payload pairs that keep the malicious text identical and vary only the context of delivery. Evaluated across 6 production LLMs and three attack families, we find a consistent and informative asymmetry: agent-native models are substantially more vulnerable when adversarial content arrives via tool descriptions than via user messages, while general-purpose models show the reverse. This asymmetry further inverts when the same content is delivered through tool outputs rather than descriptions, suggesting models implicitly treat tool metadata as trusted instructions and tool results as ordinary data. A mechanistic study on Llama 3.3 70B reveals that the safety-relevant representation is causally present at mid-to-late network depths but non-linearly encoded, explaining why linear probes fail to detect it. These findings expose a systematic, channel-dependent blind spot in how current tool-using models handle adversarial content.

2606.00563 2026-06-02 cs.LG cs.AI stat.ML 版本更新

A Practical Upper Bound on Selection Bias Effects in Medical Prediction Models

医学预测模型中选择偏差影响的一个实用上界

Kara Liu, Maggie Wang, Russ B. Altman

发表机构 * Stanford University(斯坦福大学)

AI总结 针对选择偏差导致模型泛化性差的问题,提出在仅部分观测选择机制和目标分布的现实条件下,对目标群体最差模型性能的一个新上界,并通过合成数据和真实数据验证其有效性和实用性。

Comments 32 pages, 27 figures, will be published at ACM SIGKDD '26

详情
AI中文摘要

选择偏差是真实世界数据中常见且往往不可避免的一个方面,它挑战了机器学习模型的泛化性。当在偏倚数据上训练的模型被部署到更广泛的目标群体时,模型泛化能力差可能导致实际危害,尤其是在医疗保健等高危环境中。这种风险凸显了从业者在部署前可靠评估模型泛化性的需求。然而,现有的预测模型性能的方法依赖于不切实际地访问目标分布或了解导致偏差的选择机制。为了解决这些局限性,我们提出了一个新颖的上界,用于在现实设置下目标群体上的最差模型性能,其中选择机制和目标群体数据仅被部分观测。我们通过在完全合成数据、源自All of Us研究计划的半合成数据以及MIMIC-IV中的真实世界选择偏差上的实验,证明了我们方法的有效性和实际效用。我们的工作提供了一个原则性和实用性的工具,用于估计在原本难以处理的情况下选择偏差的影响,从而使从业者能够在医疗保健及其他领域构建更安全、更具泛化性的模型。

英文摘要

Selection bias is a common and often unavoidable aspect of real-world data that challenges the generalizability of machine learning models. When models trained on biased data are deployed in the broader target population, poor model generalization may lead to real harm, particularly in high-risk settings such as healthcare. This risk highlights the need for practitioners to reliably assess model generalizability prior to deployment. However, existing methods for predicting model performance rely on unrealistic access to the target distribution or knowledge of the selection mechanism causing bias. To address these limitations, we propose a novel upper bound on the worst-case model performance on the target population under the realistic setting where the selection mechanism and the target population data are only partially observed. We demonstrate the validity and practical utility of our method through experiments on fully synthetic data, semi-synthetic data derived from the All of Us Research Program, and real-world selection bias in MIMIC-IV. Our work offers a principled and practical tool to estimate the impact of selection bias in an otherwise intractable setting, thereby enabling practitioners to build safer and more generalizable models in healthcare and beyond.

2606.00562 2026-06-02 cs.CV cs.LG 版本更新

DeepLatent: Think with Images via Parallel Latent Visual Reasoning

DeepLatent: 通过并行潜在视觉推理用图像思考

Dongchen Lu, Zhimo Li, Mao Shu, Huo Cao

发表机构 * Baidu Inc.(百度公司) Peking University(北京大学)

AI总结 提出DeepLatent框架,通过LatentFormer并行生成潜在视觉状态,并结合连续空间强化学习优化潜在表示,在多个基准上达到最先进性能。

详情
AI中文摘要

“用图像思考”的新兴范式将视觉状态嵌入中间推理步骤,定义了视觉语言模型的新前沿。现有方法沿两条路线分化。工具辅助方法应用显式视觉操作,但存在高延迟和操作类型受限的问题。潜在推理方法自回归生成隐式视觉状态,但性能不如工具辅助方法,且其潜在标记无法捕获有效的视觉信息。在这项工作中,我们提出DeepLatent,一个用于潜在视觉推理的并行框架。首先,我们引入LatentFormer。它使用可学习的2D标记并行生成上下文条件的潜在状态,将每次视觉更新直接锚定在原始图像特征中。其次,我们设计了一种连续空间强化学习算法。它直接在嵌入空间中优化潜在调制参数,显著提高潜在表示质量。该框架通过知识蒸馏和连续空间强化学习算法进行训练。此外,我们贡献了DeepLatent-180K,一个专为潜在视觉推理定制的大规模数据集。在多个基准上的广泛评估表明,DeepLatent达到了最先进的性能。

英文摘要

The emerging paradigm of "thinking with images" embeds visual states into intermediate reasoning steps, defining a new frontier for Vision-Language Models. Existing approaches diverge along two lines. Tool-assisted methods apply explicit visual operations but suffer from high latency and restricted manipulation types. Latent reasoning methods autoregressively produce implicit visual states, but underperform tool-assisted methods, and their latent tokens fail to capture effective visual information. In this work, we propose DeepLatent, a parallel framework for latent visual reasoning. First, we introduce LatentFormer. It uses learnable 2D tokens to generate context-conditioned latent states in parallel, anchoring every visual update directly in the original image features. Second, we design a continuous-space reinforcement learning algorithm. It optimizes latent modulation parameters directly in the embedding space, significantly improving latent representation quality. The framework is trained via knowledge distillation followed by this continuous-space RL algorithm. Furthermore, we contribute DeepLatent-180K, a large-scale dataset tailored for latent visual reasoning. Extensive evaluations across multiple benchmarks demonstrate that DeepLatent achieves state-of-the-art performance.

2606.00561 2026-06-02 cs.LG cs.AI 版本更新

Interpretable Policy Distillation for Power Grid Topology Control

可解释的策略蒸馏用于电网拓扑控制

Aleksandra Dmitruka, Karlis Freivalds

发表机构 * University of Latvia, Faculty of Exact Sciences and Technology(拉脱维亚大学,精确科学与技术学院)

AI总结 提出一种将深度强化学习策略蒸馏为轻量级决策树/随机森林的方法,在保持性能的同时提升可解释性,并揭示表征偏移。

详情
AI中文摘要

深度强化学习为实时电网运行提供了有前景的途径,但大型神经策略评估成本高、难以在受限硬件上部署,且对操作员不透明。我们探究用于电网拓扑控制的近端策略优化(PPO)智能体能否压缩为紧凑的树基替代模型而不损失运行性能。在Grid2Op的标准14节点环境中,使用面向稳定性的奖励,通过压力聚焦的数据收集在关键高负荷状态下训练PPO教师。然后将策略蒸馏为决策树和随机森林。在保留的验证回合中,两个替代模型在平均奖励和生存时长上均超过教师,而推理成本仅为教师的一小部分。决策树与PPO argmax的动作完全一致率较高,且在其排名靠前的动作中几乎完全一致,同时保持足够小以便直接检查。特征重要性分析揭示了表征偏移:PPO策略主要依赖线路负载信号,而蒸馏树主要由母线拓扑变量驱动。这些结果表明,压力聚焦的蒸馏可以将黑箱神经控制器转换为轻量级、可审计的规则类替代模型,适用于实时部署,同时揭示与确定性动作和拓扑特定泛化相关的风险。

英文摘要

Deep reinforcement learning (RL) offers a promising route to real-time power grid operation, yet large neural policies are costly to evaluate, hard to deploy on constrained hardware, and opaque to operators. We ask whether a Proximal Policy Optimization (PPO) agent for grid topology control can be compressed into compact tree-based surrogates without losing operational performance. A PPO teacher is trained on Grid2Op's standard 14-bus environment with a stability-oriented reward, using stress-focused data collection on critical, high-loading states. The policy is then distilled into a decision tree and a random forest. Across held-out validation episodes, both surrogates exceed the teacher in mean reward and survival length at a fraction of the inference cost. The decision tree shows high exact-action agreement with the PPO argmax and near-complete agreement within its top-ranked actions, while remaining small enough to be inspected directly. Feature-importance analysis reveals a representational shift: the PPO policy relies mainly on line-loading signals, while the distilled tree is driven primarily by bus-topology variables. These results suggest that stress-focused distillation can convert a black-box neural controller into a lightweight, auditable rule-like surrogate suited for real-time deployment, while also surfacing risks tied to deterministic actions and topology-specific generalization.

2606.00559 2026-06-02 cs.LG cs.AI 版本更新

Richer Representations for Neural Algorithmic Reasoning via Auxiliary Reconstruction

通过辅助重建实现神经算法推理的更丰富表示

Jiafu Huang, Chao Peng, Chenyang Xu, Zhengfeng Yang, Kecheng Cai, Chenhao Zhang, Yi Wang, Yiwei Gong, Wanqin Zhou, Irene Zheng

发表机构 * sei.ecnu.edu.cn(东华大学信息科学与工程学院)

AI总结 提出辅助重建模块和自监督学习变体,增强编码器对输入状态信息的保留和特征间依赖的捕捉,从而提升神经算法推理性能。

Comments Appeared at AAAI 2026

详情
AI中文摘要

神经算法推理已成为一个热门研究方向。它旨在训练神经网络模仿经典基于规则的算法的逐步行为。更具体地说,此类算法的执行可以抽象为一系列状态,其中每个状态代表执行步骤后的中间结果。训练目标是生成复制底层算法过程的状态序列。该任务的常见框架采用编码器-处理器-解码器架构,其中编码器学习状态的表示,处理器模拟算法步骤,解码器重建输出状态。虽然先前的工作侧重于改进处理器,但编码器在表示学习中的作用很少受到关注。大多数方法依赖简单的MLP编码器,这引发了一个问题:这些表示是否足够信息丰富以支持算法推理。本文研究如何改进神经算法推理的编码器表示。我们提出一个重建模块,旨在从其编码表示中恢复输入状态。这个辅助重建任务鼓励编码器保留关于输入的关键信息。我们证明,在训练过程中加入此任务可以提高现有神经架构在标准基准上的性能。此外,我们观察到当前编码器常常未充分利用状态内特征之间的相关性。为了解决这个问题,我们从自监督学习中汲取灵感,设计了一个增强的辅助任务变体,鼓励编码器捕捉状态内特征依赖。实验结果表明,我们的方法使编码器能够学习更丰富的表示,从而增强现有处理器在算法推理任务上的性能。

英文摘要

Neural algorithmic reasoning has emerged as a popular research direction. It aims to train neural networks to mimic the step-by-step behavior of classical rule-based algorithms. More specifically, the execution of such algorithms can be abstracted as a sequence of states, where each state represents the intermediate outcome after an execution step. The training objective is to generate state sequences that replicate the underlying algorithmic process. A common framework for this task adopts an encoder-processor-decoder architecture, where the encoder learns representations of states, the processor simulates algorithmic steps, and the decoder reconstructs output states. While prior work has focused on improving the processor, the role of the encoder in representation learning has received little attention. Most methods rely on simple MLP encoders, raising the question of whether such representations are sufficiently informative for supporting algorithmic reasoning. This paper investigates how to improve encoder representations for neural algorithmic reasoning. We propose a reconstruction module that aims to recover the input state from its encoded representation. This auxiliary reconstruction task encourages the encoder to retain critical information about the input. We demonstrate that incorporating this task during training improves the performance of existing neural architectures on standard benchmarks. Furthermore, we observe that current encoders often underutilize the correlations among features within a state. To address this, we draw inspiration from self-supervised learning and design an enhanced variant of the auxiliary task that encourages the encoder to capture intra-state feature dependencies. Experimental results show that our method enables the encoder to learn richer representations, thereby enhancing the performance of existing processors on algorithmic reasoning tasks.

2606.00557 2026-06-02 cs.LG 版本更新

Normalized Relevance Measure as a Unifying Framework to Explain Neural Network Latent Structures

归一化相关度量作为解释神经网络潜在结构的统一框架

Ping Xiong, Thomas Schnake, Grégoire Montavon, Klaus-Robert Müller, Shinichi Nakajima

发表机构 * Berlin Institute for the Foundations of Learning(学习与数据基础研究院) Machine Learning Group, Technical University of Berlin(柏林技术大学机器学习组) Department of Artificial Intelligence, Korea University(韩国大学人工智能系) Max Planck Institute for Informatics(信息研究所) Department of Chemistry, Chemical Physics Theory Group, University of Toronto(多伦多大学化学系、化学物理理论组) Vector Institute for Artificial Intelligence(人工智能矢量研究所) Acceleration Consortium, University of Toronto(多伦多大学加速联盟)

AI总结 提出归一化相关度量(NRM)框架,通过定义归一化符号度量来归因任意层神经元的相关性,统一了现有传播解释算法,并在VGG16中展示多层级联合分析揭示信息流。

详情
AI中文摘要

为了理解神经网络(NN)的功能和预测,越来越清楚的是,仅分析输入域是不够的——还必须检查其内部推理机制以获取完整图景。为了解释此类模型的内部推理机制,分析潜在表示对于给定任务的重要性至关重要。在本文中,我们提出了\emph{归一化相关度量}(NRM)框架——一种新颖的通用解释过程,将相关性归因于\emph{任意架构中跨层的任意神经元集合}。在NRM框架中,所选神经元的相关性被明确定义为归一化符号度量,使用简单操作——基于加法和乘法法则的边际化和条件化——类似于概率度量。归一化性质进一步保证了跨层的可比性。NRM框架通过明确识别正在计算的底层量,涵盖了现有的基于传播的解释算法。我们在计算机视觉应用中展示了该框架的实用性,其中跨多个层的联合相关性分析揭示了VGG16网络中的关键信息流。总体而言,NRM框架提供了一种通用的、数学上严谨的方法来理解现代NN如何传播信息,为可解释人工智能提供了多功能且广泛适用的基础。

英文摘要

To understand how a neural network (NN) functions and makes predictions, it has become increasingly clear that analyzing only the input domain is insufficient -- one must also examine its internal inference mechanisms to capture the complete picture. To explain the internal inference mechanisms of such models, it is essential to analyze the importance of latent representations for a given task. In this paper, we propose the \emph{normalized relevance measure} (NRM) framework -- a novel general explanation procedure that attributes relevance to \emph{arbitrary sets of neurons across layers of arbitrary architectures}. In the NRM framework, relevance of selected neurons is explicitly defined as a normalized signed measure, constructed using simple operations -- marginalization and conditioning based on additive and multiplicative laws -- in analogy to the probability measures. The normalization property further guarantees comparability across layers. The NRM framework subsumes existing propagation-based explanation algorithms by explicitly identifying the underlying quantity being computed. We demonstrate the utility of the framework in computer vision applications, where joint relevance analysis across multiple layers reveals key information flows in VGG16 networks. Overall, the NRM framework provides a general, mathematically grounded approach to understanding how modern NNs propagate information, offering a versatile and broadly applicable foundation for explainable artificial intelligence.

2606.00548 2026-06-02 cs.CV cs.AI cs.LG 版本更新

CAFOSat: A Strongly Annotated Dataset for Infrastructure-Aware CAFO Mapping Using High-Resolution Imagery

CAFOSat:用于基于高分辨率影像的基础设施感知型CAFO制图的高质量标注数据集

Oishee Bintey Hoque, Nibir Chandra Mandal, Mandy L Wilson, Samarth Swarup, Madhav Marathe, Abhijin Adiga

发表机构 * University of Virginia(弗吉尼亚大学) Biocomplexity Institute, University of Virginia(弗吉尼亚大学生物复杂性研究所)

AI总结 针对集中式动物饲养操作(CAFO)大规模制图困难,提出CAFOSat数据集,集成高分辨率NAIP影像与多源CAFO清单,通过人机协同标注、GradCAM定位和几何聚类优化弱定位记录,并引入合成增强管道,实现基础设施级标注和鲁棒分类。

Comments Accepted at CVPR Workshop-2026. First two authors has equal contribution

详情
AI中文摘要

集中式动物饲养操作(CAFO)在农业生产中发挥重要作用,但也与环境、公共卫生和疾病监测问题相关。由于基础设施布局异质、位置记录噪声大、标注不一致以及清单不完整,从遥感影像大规模制图CAFO仍具挑战。我们引入CAFOSat,一个用于美国全境CAFO制图的高质量标注、基础设施感知数据集。CAFOSat集成高分辨率国家农业影像计划(NAIP)影像与跨州收集的多源CAFO清单,并通过结合AI辅助标注、基于GradCAM的定位和几何聚类的人机协同管道,将弱地理定位记录转化为精细标注。为提高数据集质量,我们利用土地覆盖引导采样和空间排除约束筛选具有挑战性的负样本,并通过人工验证提供基础设施级标注,包括畜棚、粪池和放牧相关特征。最终数据集包含超过45,000个图像块,覆盖20个州和四大CAFO类别。我们对多种卷积、基于Transformer和视觉-语言模型进行基准测试,证明了精细标注和精心筛选的负样本在CAFO分类和泛化中的价值。此外,我们引入一个合成增强管道,生成基础设施感知的变体以增加训练多样性并提升分布偏移下的鲁棒性。CAFOSat为推进基础设施感知的农业监测和基于高分辨率遥感影像的CAFO制图提供了大规模基准。

英文摘要

Concentrated Animal Feeding Operations (CAFOs) play an important role in agricultural production but are also associated with environmental, public health, and disease surveillance concerns. Large-scale mapping of CAFOs from remote sensing imagery remains challenging due to heterogeneous infrastructure layouts, noisy location records, inconsistent annotations, and incomplete inventories. We introduce CAFOSat, a strongly annotated, infrastructure-aware dataset for CAFO mapping across the United States. CAFOSat integrates high-resolution National Agriculture Imagery Program (NAIP) imagery with multi-source CAFO inventories collected across multiple states and transforms weak geolocation records into refined annotations through a human-in-the-loop pipeline combining AI-assisted annotation, GradCAM-based localization, and geometric clustering. To improve dataset quality, we curate challenging negative samples using land-cover-guided sampling with spatial exclusion constraints and provide infrastructure-level annotations, including barns, manure ponds, and grazing-related features, through manual verification. The resulting dataset contains more than 45,000 image patches spanning 20 states and four major CAFO categories. We benchmark a diverse set of convolutional, transformer-based, and vision-language models, demonstrating the value of refined annotations and curated negative samples for CAFO classification and generalization. In addition, we introduce a synthetic augmentation pipeline that generates infrastructure-aware variations to increase training diversity and improve robustness under distribution shifts. CAFOSat provides a large-scale benchmark for advancing infrastructure-aware agricultural monitoring and CAFO mapping from high-resolution remote sensing imagery.

2606.00545 2026-06-02 cs.LG 版本更新

The Assistant as a Privileged Persona: A canonical reference in cross-persona self-recognition

助手作为特权角色:跨角色自我识别中的规范参考

Asvin G

发表机构 * Institute for Advanced Study, Princeton(普林斯顿高级研究院)

AI总结 本文研究后训练语言模型在跨角色作者身份判断中的表现,发现助手角色作为规范参考,其熵信号和角色向量距离紧密耦合,且这种耦合仅对助手角色成立。

Comments Project out of Anthropic Fellows

详情
AI中文摘要

后训练语言模型能够从上下文中的一两句话识别自己的输出。在配套论文 \citep{jack2026twomodes} 中,我们展示了它们还能通过助手模式生成的尖锐熵降来识别当前是否在策略上行动。这两个信号都与后训练主要塑造的助手角色相关。 本文将框架扩展到 Llama-3.1-70B-Instruct 上的跨角色作者身份判断。我们测量了一个由评估者和生成者角色(从图书管理员到龙到莎士比亚)组成的面板上的作者身份声称率矩阵,并提出两个主张。 首先,在助手自己的矩阵行上,助手的声称率、激活空间中与助手的角色向量距离,以及助手对某个角色文本的惊讶与该角色对自己文本的惊讶之间的熵差,三者紧密耦合。这扩展了配套论文中“行动”的熵特征,使之成为“已行动”的回顾性特征。 其次,这种耦合在助手行之外失效:熵差的自然对称扩展不能预测独特评估者(海盗、龙、莎士比亚)的作者身份;起作用的是非对称的——评估者与助手对同一文本的惊讶比较,而非与生成者的比较。我们通过尝试许多候选替代角色排除了任何其他角色都能扮演这一参考角色的可能性。我们将这种非对称性解释为模型在执行隐式贝叶斯似然比检验,以助手作为规范备择假设,而 \citet{chen2025persona} 的角色向量几何(每个角色都是助手的一个增量)确保了助手是唯一普遍可被该检验访问的角色。

英文摘要

Post-trained language models can recognize their own outputs from a sentence or two out of context. In a companion paper \citep{jack2026twomodes} we showed they can also recognize when they are currently acting on-policy, through the sharp entropy drop of assistant-mode generation. Both signals are tied to the Assistant persona that post-training mainly shapes. This paper widens the frame to cross-persona authorship judgement on Llama-3.1-70B-Instruct. We measure a matrix of authorship claim rates over a panel of evaluator and generator personas spanning librarian to dragon to Shakespeare, and make two claims. \emph{First}, on the Assistant's own row of the matrix, the Assistant's claim rate, the persona-vector distance from the Assistant in activation space, and the entropy gap between the Assistant's surprise on a persona's text and the persona's surprise on its own text are all tightly coupled. This extends the entropy signature of \emph{acting} from the companion paper to a retrospective signature of \emph{having acted}. \emph{Second}, this coupling fails off the Assistant's row: the natural symmetric extension of the entropy gap does not predict authorship for distinctive evaluators (pirate, dragon, Shakespeare); what does is asymmetric -- the evaluator's surprise compared to the Assistant's surprise on the same text, not to the generator's. We rule out the alternative that any persona could play this reference role by trying many candidate substitutes; none does. We interpret the asymmetry as the model performing an implicit Bayesian likelihood-ratio test against the Assistant as the canonical alternative hypothesis, with the persona-vector geometry of \citet{chen2025persona} (every persona a delta off the Assistant) ensuring that the Assistant is the only persona universally accessible to that test.

2606.00544 2026-06-02 cs.LG cs.CL 版本更新

Escaping the Mode Lottery: Multi-Response Training Improves Language Model Generalization

逃离模式抽彩:多响应训练提升语言模型泛化能力

Hasan Amin, Kian Ahrabian, Ming Yin, Rajiv Khanna

发表机构 * Department of Computer Science, Purdue University(计算机科学系,普渡大学)

AI总结 本文提出多响应训练(MRT)方法,通过保留每个提示的多个有效响应来缓解传统单响应微调导致的“模式抽彩”问题,并从统计角度揭示了其提升分布泛化的原理和适用条件。

详情
AI中文摘要

现代语言模型微调通常为每个提示配对单个响应,尽管许多提示允许多个有效补全。这实际上将多模态条件分布简化为单样本视图,我们称之为“模式抽彩”现象,其中训练强调一部分合理模式而忽略其他模式。我们研究了多响应训练(MRT),该方法保留每个提示的多个响应,并建立了关于何时以及为何有帮助的原则性解释。我们的关键见解是,提示和响应是不同的统计资源:额外的提示减少输入分布的不确定性,而额外的响应减少条件输出分布的不确定性。这产生了方差-预算权衡,预测了何时保留多个响应是有价值的,显示了随着提示级不确定性占主导地位而收益递减,并解释了为什么大型冗余语料库可以表现出隐式的多响应效应。我们进一步分析了响应选择,并表明Random-K-of-N是分布微调的无偏默认选择,基于奖励的选择可能导致模式坍缩,而子模质量-多样性目标提供了一种具有理论保证的高效替代方案。受控模拟验证了预测的方差和选择效应,包括一个惊人的失败模式,其中仅奖励选择产生的梯度与真实目标不一致。在结构化和真实世界数据集上,包括一个新的多提示、多响应基准,MRT一致地改善了分布泛化,在响应多样性高、提示冗余性低的场景中收益最大。MRT将响应多重性重新定义为数据分配问题,并提供了明确的指导:当响应廉价且多样时,保留多个响应不是启发式方法,而是基于统计的选择。

英文摘要

Modern language-model fine-tuning typically pairs each prompt with a single response, even though many prompts admit multiple valid completions. This effectively reduces a multi-modal conditional distribution to a one-sample view, a phenomenon we call the "mode lottery," where training emphasizes a subset of plausible modes while leaving others underrepresented. We study multi-response training (MRT), which retains multiple responses per prompt, and develop a principled account of when and why it helps. Our key insight is that prompts and responses are distinct statistical resources: additional prompts reduce uncertainty about the input distribution, while additional responses reduce uncertainty about the conditional output distribution. This yields a variance-budget tradeoff that predicts when retaining multiple responses is worthwhile, shows diminishing returns as prompt-level uncertainty dominates, and explains why large redundant corpora can exhibit an implicit multi-response effect. We further analyze response selection, and show that Random-K-of-N is the unbiased default for distributional fine-tuning, reward-based selection can induce mode collapse, and a submodular quality-diversity objective provides an efficient alternative with theoretical guarantees. Controlled simulations validate the predicted variance and selection effects, including a striking failure mode where reward-only selection produces gradients misaligned with the true objective. Across structured and real-world datasets, including a new multi-prompt, multi-response benchmark, MRT consistently improves distributional generalization, with the largest gains in high response-diversity, low prompt-redundancy regimes. MRT reframes response multiplicity as a data-allocation problem with clear guidance: when responses are cheap and diverse, keeping more than one is not a heuristic, but a statistically grounded choice.

2606.00539 2026-06-02 cs.LG math.OC stat.ML 版本更新

GNMR: Runtime Stability Control for Low-Precision Large Language Model Training

GNMR: 低精度大语言模型训练的运行时稳定性控制

Boao Kong, Weichen Jia, Engao Zhang, Guohong Li, Yonghan Dong, Yao Wang, Yaoyuan Wang, Yunke Peng, Kun Yuan

发表机构 * Peking University(北京大学) Huawei Technologies Ltd.(华为技术有限公司)

AI总结 针对低精度语言模型训练中的稳定性瓶颈,提出基于梯度范数与历史均值之比(GNMR)的轻量级运行时控制器,通过局部风险信号映射到有界恢复动作,在不改变数值格式或后端的情况下提升训练稳定性。

Comments 29 pages, 4 figures, 15 tables

详情
AI中文摘要

训练稳定性是低精度语言模型训练的关键瓶颈:高效的低成本路径仍可能在少量算子处产生短暂的数值风险。我们将此问题形式化为运行时稳定性控制,并提出梯度范数与历史均值之比(GNMR),一种轻量级控制器,将每个可恢复单元的当前梯度范数与其历史均值进行比较。结合用于检测短窗口内突增的$Δ$-GNMR,GNMR在硬$\mathrm{maxO}$预算和短锁定间隔下将局部风险信号映射到有界恢复动作,而不改变数值格式、内核或后端方案。在激活量化压力测试、DeepSeek风格的配方级训练以及LLaMA-2 13B微调中,GNMR以稀疏且预算受限的恢复保持了高保真质量。这些结果支持GNMR作为一种与后端无关的控制器,在保持低成本执行的同时提高低精度训练的稳定性。

英文摘要

Training stability is a key bottleneck in low-precision language model training: efficient low-cost paths can still produce short-lived numerical risks at a small set of operators. We formulate this as runtime stability control and present Gradient Norm-to-Mean Ratio (GNMR), a lightweight controller that compares each recoverable unit's current gradient norm with its historical mean. Together with $Δ$-GNMR for abrupt short-window increases, GNMR maps local risk signals to bounded recovery actions under a hard $\mathrm{maxO}$ budget and a short lock interval, without changing the numerical format, kernel, or backend recipe. Across activation-quantization stress, DeepSeek-style recipe-level training, and LLaMA-2 13B fine-tuning, GNMR preserves high-fidelity quality with sparse, budgeted recovery. These results support GNMR as a backend-agnostic controller to improve low-precision training stability while preserving low-cost execution.

2606.00535 2026-06-02 cs.LG 版本更新

DREAM-S: Speculative Decoding with Searchable Drafting and Target-Aware Refinement for Multimodal Generation

DREAM-S: 基于可搜索草稿与目标感知精炼的推测解码用于多模态生成

Zining Liu, Yunhai Hu, Tianhua Xia, Bo Bao, Eric Sather, Vithursan Thangarasa, Sai Qian Zhang

发表机构 * New York University(纽约大学) Cerebras Systems Inc.(Cerebras Systems公司) University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出DREAM-S框架,通过神经架构搜索和目标感知超网训练自动优化草稿模型架构与交互策略,结合注意力熵引导的自适应中间特征蒸馏,实现视觉语言模型的高效推测解码,加速比达3.85倍。

详情
AI中文摘要

推测解码(SD)已被证明是加速大型语言模型(LLM)自回归生成的有效技术,然而其在视觉语言模型(VLM)中的应用仍相对未被探索。我们提出 extit{DREAM-S},一个专门为VLM中快速高效解码设计的新型SD框架。DREAM-S利用神经架构搜索(NAS)框架与目标感知超网训练,自动识别草稿模型与目标模型之间的最优交互策略,以及最适合底层硬件实现平台的草稿模型架构。此外,DREAM-S还结合了由注意力熵引导的自适应中间特征蒸馏,以实现高效的草稿训练。在一系列成熟的VLM上的实验表明,与标准解码方法相比,DREAM-S实现了高达$3.85 imes$的加速,并显著优于现有的SD基线。代码已公开:https://github.com/SAI-Lab-NYU/DREAM-S。

英文摘要

Speculative decoding (SD) has proven to be an effective technique for accelerating autoregressive generation in large language models (LLMs) however, its application to vision-language models (VLMs) remains relatively unexplored. We propose~\textit{DREAM-S}, a novel SD framework designed specifically for fast and efficient decoding in VLMs. DREAM-S leverages a neural architecture search (NAS) framework with target-aware supernet training to automatically identify both the optimal interaction strategy between the draft and target models, and the most suitable draft model architecture for the underlying hardware implementation platform. DREAM-S additionally incorporates adaptive intermediate feature distillation, guided by attention entropy, to enable efficient draft training. Experiments on a range of well-established VLMs show that DREAM-S achieves up to a $3.85\times$ speedup compared to standard decoding approaches and significantly outperforms existing SD baselines. The code is publicly available at: https://github.com/SAI-Lab-NYU/DREAM-S .

2606.00520 2026-06-02 math.OC cs.LG stat.ML 版本更新

In-Expectation Convergence of Stochastic Gradient Methods under Heavy-Tailed Noise

重尾噪声下随机梯度方法的期望收敛性

Zijian Liu

发表机构 * Stern School of Business, New York University(纽约大学斯特恩商学院)

AI总结 针对重尾噪声(有限p阶矩,p∈(1,2))下随机梯度方法的收敛性问题,证明了随机镜像下降(SMD)、加速随机镜像下降(ASMD)在凸优化中以及SGD和带动量的SGD(SGDM)在非凸优化中的期望收敛性,无需算法修改或有界域假设。

详情
AI中文摘要

许多随机梯度方法被认为在随机梯度的噪声仅具有有限$p$阶矩($p\in\left(1,2\right)$)时不会收敛,这种设置被称为重尾噪声假设。然而,最近的一些研究发现,随机梯度下降($\textsf{SGD}$)无需对其更新规则进行任何修改,就能在有界域的凸问题中出人意料地收敛,这凸显了经典随机梯度方法的潜力。受这一最新进展的启发,我们对重尾噪声下的随机优化进行了全面研究,并为凸优化中的随机镜像下降($\textsf{SMD}$)和加速随机镜像下降($\textsf{ASMD}$)以及非凸优化中的$\textsf{SGD}$和带动量的随机梯度下降($\textsf{SGDM}$)建立了新的期望收敛结果。值得注意的是,我们的结果不仅无需算法修改,而且避免了先前工作中施加的限制性假设,如有界域。更重要的是,我们的分析为研究重尾随机优化提供了一个新颖、优雅且强大的框架,为理解一阶随机梯度方法开辟了一条新途径。

英文摘要

Many stochastic gradient methods are believed not to converge when the noise in stochastic gradients has only a finite $p$-th moment for $p\in\left(1,2\right)$, a setting known as the heavy-tailed noise assumption. However, some recent studies have found that Stochastic Gradient Descent ($\textsf{SGD}$), without any modification to its update rule, can surprisingly converge in expectation for convex problems with bounded domains, highlighting the potential of classical stochastic gradient methods. Inspired by this recent progress, we provide a comprehensive study of stochastic optimization under heavy-tailed noise and establish new in-expectation convergence results for Stochastic Mirror Descent ($\textsf{SMD}$) and Accelerated Stochastic Mirror Descent ($\textsf{ASMD}$) in convex optimization, and for $\textsf{SGD}$ and Stochastic Gradient Descent with Momentum ($\textsf{SGDM}$) in nonconvex optimization. Notably, our results not only hold without algorithmic changes but also avoid restrictive assumptions, such as bounded domains, imposed in prior work. More importantly, our analysis provides a new, elegant, and powerful framework for studying heavy-tailed stochastic optimization, opening a new route to understanding first-order stochastic gradient methods.

2606.00514 2026-06-02 cs.LG cs.CV 版本更新

Generate in Reconstruction Space, Match in Semantic Space: Transport Geometry for One-Step Generation

在重建空间中生成,在语义空间中匹配:一步生成的传输几何

Hugues Van Assel, Edward De Brouwer, Saeed Saremi, Gabriele Scalia, Aviv Regev

发表机构 * Genentech(基因泰克)

AI总结 本文研究自监督表示学习(SSL)特征在一步生成模型中的作用,提出在语义特征空间中使用Sinkhorn散度进行分布匹配,显著降低ImageNet FID,并揭示了评估指标与训练特征之间的潜在冲突。

Comments 26 pages, 4 figures

详情
AI中文摘要

生成建模和自监督表示学习(SSL)优化结构不同的目标:生成训练奖励分布保真度,而SSL奖励语义一致性。然而,最近的研究反复发现SSL特征改善了生成训练,尽管这种协同作用的机制仍不清楚。在这里,我们在一步生成的框架下研究SSL在生成建模中的优势,其中表示的作用是明确的:冻结的SSL特征用于将生成的样本与真实数据匹配。我们在该特征空间中使用Sinkhorn散度,为Wasserstein距离提供了一个可处理的代理,这是由Fréchet风格评估指标(如FID)近似的总体差异。我们发现,当在语义结构化的SSL特征空间中计算时,这个目标变得非常有效(ImageNet FID降低39倍)。我们将这种行为主要归因于匹配估计:抑制无关重建细节的语义SSL特征诱导出更紧凑的几何结构,使分布匹配更易处理。因此,最佳的训练SSL特征不一定与评估指标使用的特征匹配。特别是,我们表明使用Inception作为特征提取器可以改善FID,同时降低匹配稳定性和样本质量,揭示了一种形式的指标黑客攻击。通过在ImageNet上的大量实验,我们确定了哪些SSL特征族能带来最佳的生成性能,并表明匹配稳定性是选择它们的定量标准。代码可在https://github.com/Genentech/semantic-transport-generation获取。

英文摘要

Generative modeling and self-supervised representation learning (SSL) optimize structurally different objectives: generative training rewards distributional fidelity, while SSL rewards semantic coherence. Yet recent work repeatedly finds that SSL features improve generative training, though the mechanism of this synergy remains unclear. Here, we study the benefits of SSL in generative modeling in the framework of one-step generation where the role of representation is explicit: frozen SSL features are used to match generated samples to real data. We use the Sinkhorn divergence in that feature space, providing a tractable surrogate for the Wasserstein distance, the population-level discrepancy approximated by Fréchet-style evaluation metrics (such as FID). We find that this objective becomes highly effective when computed in a semantically structured SSL feature space (a 39$\times$ reduction in ImageNet FID). We trace this behavior primarily to matching estimation: semantic SSL features that suppress nuisance reconstruction details induce a more compact geometry, making distribution matching more tractable. As a consequence, the best training SSL features need not match the features used by the evaluation metric. In particular, we show that using Inception as the feature extractor can improve FID while degrading matching stability and sample quality, revealing a form of metric hacking. Using extensive experiments on ImageNet, we identify which SSL feature families lead to best generation performance and show that matching stability is a quantitative criterion for selecting them. Code is available at https://github.com/Genentech/semantic-transport-generation.

2606.00512 2026-06-02 cs.LG cs.IT math.IT stat.ML 版本更新

Semi-Supervised Learning with Noisy Proxy Covariates: Generalization Bounds and Distribution Regression

带噪声代理协变量的半监督学习:泛化界与分布回归

Kwangho Kim, Jisu Kim

发表机构 * Department of Statistics, Korea University, Seoul, Korea(韩国大学统计系) Department of Statistics, Seoul National University, Seoul, Korea(首尔国立大学统计系)

AI总结 针对带噪声代理协变量的半监督回归问题,提出两阶段估计器,利用所有代理协变量学习核本征特征,并在标记数据上拟合岭回归,理论证明在代理扰动可控且未标记代理协变量充足时能恢复快速标记样本率,实验表明在低标记率下优于监督和半监督基线。

详情
Journal ref
Proceedings of the 43rd International Conference on Machine Learning (ICML 2026)
AI中文摘要

在许多现代机器学习流程中,丰富的预训练表示充当有噪声的代理协变量,而任务特定标签仍然稀缺。我们研究这种设置下的半监督回归,并提出一个简单的两阶段估计器,该估计器从所有代理协变量中学习核本征特征,并在标记数据上拟合岭预测器。我们推导出有限样本界,表明当代理扰动受控且未标记代理协变量足够丰富时,可以恢复快速的标记样本率。我们还表明,分布回归是一个直接的特例,当有限袋大小足够大时具有类似的保证。实验表明,在低标记率情况下,相比监督和半监督基线有持续改进。

英文摘要

In many modern machine learning pipelines, abundant pretrained representations serve as noisy proxy covariates, while task-specific labels remain scarce. We study semi-supervised regression in this setting, and propose a simple two stage estimator that learns kernel eigenfeatures from all proxy covariates and fits a ridge predictor on labeled data. We derive finite sample bounds showing that fast labeled sample rates are recovered when proxy perturbation is controlled and unlabeled proxy covariates are sufficiently abundant. We also show that distribution regression is a direct special case, with analogous guarantees when the finite bag size is large enough. Experiments show consistent gains over supervised and semi-supervised baselines, especially in low label regimes.

2606.00511 2026-06-02 cs.LG cs.CV 版本更新

Saliency-Aware Model Merging

显著性感知模型合并

Jungin Park, Jiyoung Lee, Kwanghoon Sohn

发表机构 * Yonsei University, Seoul, South Korea(首尔大学) Ewha Womans University, Seoul, South Korea(成均馆女子大学)

AI总结 提出SA-Merging方法,利用结构剪枝中的连通性显著性(如SynFlow)进行数据无关模型合并,通过任务向量显著性评分和合并感知调制减少任务干扰,并在视觉和语言任务上验证有效性。

Comments ICML 2026 Camera-ready

详情
AI中文摘要

模型合并旨在将多个在不同数据集上微调的任务特定模型整合到一个统一架构中,以实现跨领域能力。当前的数据无关模型合并方法通常难以扩展,因为它们依赖于忽略层间依赖性和非均匀专业知识分布的简单参数级启发式方法。本文提出SA-Merging,它基于结构剪枝(如SynFlow)中的连通性显著性公式,并将其扩展到数据无关模型合并设置。我们相对于共享基础模型定义任务向量上的显著性分数,并进一步引入合并感知调制,该调制结合专家间的一致性以减轻任务干扰。基于此公式,迭代的显著性感知合并过程逐步移除非信息性更新,同时保留端到端连通性。此外,我们将SA-Merging扩展到为LoRA引入秩级显著性分解,而不损害其结构完整性。在视觉和语言任务上的大量实验证明了我们基于显著性方法的有效性,进一步缩小了数据无关方法和测试时自适应方法之间的差距。

英文摘要

Model merging aims to consolidate multiple task-specific models fine-tuned on different datasets into a unified architecture that performs cross-domain proficiency. Current data-free model merging methods often struggle to scale as they rely on simple parameter-level heuristics that ignore inter-layer dependencies and non-uniform distribution of expertise. This work proposes SA-Merging, which is built upon connectivity-based saliency formulations from structural pruning (e.g., SynFlow) and extends them to the data-free model merging setting. We define a saliency score over task vectors relative to a shared base model, and further introduce merge-aware modulation that incorporates agreement across experts to mitigate task interference. Based on this formulation, an iterative saliency-aware merging procedure progressively removes non-informative updates while preserving end-to-end connectivity. Furthermore, we extend SA-Merging to introduce rank-wise saliency decomposition for LoRAs without compromising their structural integrity. Extensive experiments on vision and language tasks demonstrate the effectiveness of our saliency-based approach, further reducing the gap between data-free and test-time adaptation methods.

2606.00506 2026-06-02 cs.AI cs.LG 版本更新

EnergyMamba: An Uncertainty-Aware Graph-Enhanced Selective State Space Model for Energy Consumption Prediction

EnergyMamba:一种用于能耗预测的具有不确定性感知的图增强选择性状态空间模型

Dahai Yu, Rongchao Xu, Lin Jiang, Guang Wang

发表机构 * Florida State University(佛罗里达州立大学)

AI总结 提出EnergyMamba框架,通过图增强选择性状态空间模型(GE-Mamba)和自适应序列分位数回归(AS-CQR)模块,实现时空联合建模与不确定性量化,在能耗预测中提升准确率约5%、不确定性量化约6%。

Comments Accepted by KDD 2026 AI4S

详情
AI中文摘要

能耗预测对于高效的电网管理、需求侧优化和可持续能源规划至关重要。尽管先进的机器学习方法已被用于提高预测性能,但现有工作存在两个关键局限:(1)通常将任务视为纯时间序列预测问题,未显式建模不同区域间的空间依赖关系;(2)在极端天气等异常情况下无法提供带有不确定性估计的可靠预测。为推进现有研究,我们提出EnergyMamba,一种具有不确定性感知的时空学习框架,用于准确可靠的能耗预测,包含两个关键组件:(i)一种新颖的图增强选择性状态空间模型(GE-Mamba),将从电网拓扑中学到的空间上下文注入时间动态,实现耦合的时空建模;(ii)自适应序列分位数回归(AS-CQR)模块,包括局部自适应归一化和在线反馈机制,以在潜在分布偏移下动态校准预测区间。我们在来自佛罗里达、纽约和加利福尼亚的四个大规模真实数据集上评估EnergyMamba。结果表明,与15个最先进的基线相比,EnergyMamba在预测准确率上提升约5%,在不确定性量化上提升约6%。

英文摘要

Energy consumption prediction is essential for efficient grid management, demand-side optimization, and sustainable energy planning. Although advanced machine learning methods have been employed for better prediction performance, existing works have two key limitations: (1) they usually formulate this task as a purely time-series prediction problem without explicitly modeling the spatial dependencies among different regions, and (2) they fail to provide reliable predictions with uncertainty estimates under abnormal situations such as extreme weather events. To advance existing research, we propose EnergyMamba, an uncertainty-aware spatiotemporal learning framework for accurate and reliable energy consumption prediction, which comprises two key components: (i) a novel Graph-Enhanced Selective State Space Model (GE-Mamba) that injects spatial context learned from the grid topology into the temporal dynamics, enabling coupled spatiotemporal modeling, and (ii) an Adaptive Sequential Conformalized Quantile Regression (AS-CQR) module, which includes locally adaptive normalization and an online feedback mechanism to dynamically calibrate prediction intervals under potential distribution shifts. We evaluate EnergyMamba on four large-scale real-world datasets from Florida, New York, and California. Results show EnergyMamba achieves around 5% improvement in prediction accuracy and 6% improvement in uncertainty quantification over 15 state-of-the-art baselines.

2606.00503 2026-06-02 cs.LG cs.AI 版本更新

TabChange: Precise Attribute Changes in Tabular Data

TabChange: 表格数据中的精确属性变化

Arjun Dahal, Yu Lei, Raghu N. Kacker, Richard Kuhn

发表机构 * The University of Texas at Arlington(德克萨斯大学阿灵顿分校) National Institute of Standards and Technology(美国国家标准与技术研究院) Information Technology Laboratory(信息技术实验室)

AI总结 针对表格数据中修改属性时破坏自然性的问题,提出TabChange方法,通过分析属性间关系并利用对抗框架去除潜在空间中的属性信息,实现精确且自然的属性修改。

详情
AI中文摘要

修改表格数据中的属性通常会破坏其与其他属性的关系,从而产生不自然的实例。修改后的实例必须既自然又与原始实例变化最小。本文解决了生成这种修改实例的挑战。我们识别了现有方法的关键局限性:生成模型要么不支持实例级属性编辑,要么像CVAE这样的方法在潜在空间中保留属性信息,导致不必要的修改。为了解决这个问题,我们提出了TabChange,一种分析数据集中目标属性与其他属性关系的方法。如果关系较弱,它直接翻转属性;如果关系较强,它使用对抗框架去除潜在空间表示中的属性信息。这种去除使得能够进行精确修改,只进行必要的调整以保持自然性。我们在七个数据集上的实验表明,TabChange生成的属性反事实在自然性方面与基线相当,并且更接近原始实例。与基线相比,这导致了更多有效的反事实和更少的无效反事实。

英文摘要

Modifying an attribute in tabular data often introduces an unnatural instance by breaking its relationships with other attributes. The modified instance must be both natural and minimally changed from the original instance. This paper addresses the challenge of generating such a modified instance. We identify key limitations in existing approaches: generative models either don't support instance-level attribute editing or, in the case of methods like CVAE, retain attribute information in the latent space, leading to unnecessary modifications. To solve this, we propose TabChange, an approach that analyzes the relationship between the attribute of interest and other attributes in the dataset. If the relationship is weak, it simply flips the attribute; if it is strong, it uses an adversarial framework that removes information about the attribute in the latent space representation. This removal enables precise modifications, making only the necessary adjustments to maintain naturalness. Our experiments across seven datasets show that TabChange generates counterfactuals in attributes that are comparable in naturalness and are more proximal to their original instances. This leads to a higher number of valid counterfactuals and a lower number of invalid counterfactuals compared to the baselines.

2606.00500 2026-06-02 cs.DS cs.LG math.ST stat.ML stat.TH 版本更新

Easy, robust approximate message passing for planted spike models

用于植入尖峰模型的简单、鲁棒近似消息传递

Misha Ivkov, Tselil Schramm

发表机构 * Stanford University(斯坦福大学)

AI总结 针对含对抗性噪声的尖峰矩阵模型,提出一种结合谱预处理与鲁棒谱初始化的算法,使近似消息传递(AMP)在无需修改的情况下实现鲁棒性,输出与无噪声AMP结果接近的向量。

Comments 32 pages

详情
AI中文摘要

我们提出了一种简单高效的算法,用于尖峰矩阵设置中的鲁棒近似消息传递(AMP)。特别地,设 $\varepsilon$ 为足够小的常数,并假设 $X \in \mathbb R^{n \times n}$ 是带有植入秩-$1$ 尖峰的高斯矩阵,而 $E \in \mathbb R^{n \times n}$ 是支撑在 $\varepsilon n \times \varepsilon n$ 主子矩阵上的对抗性选择矩阵。令 $v_{\mathrm{AMP}}(X)$ 为在未损坏矩阵 $X$ 上执行 AMP 迭代的输出。我们给出一个过程,仅给定损坏矩阵 $Y = X + E$,即可计算向量 $v_{\mathrm{ALG}}(Y)$,该向量与 $v_{\mathrm{AMP}}(X)$ 的差距为 $\tilde{O}(\sqrt{\varepsilon})$,适用于包括稀疏主成分分析(PCA)、非负 PCA 和 $\mathbb Z_2$ 同步在内的一类 AMP 迭代。我们的算法由谱预处理步骤结合鲁棒谱初始化过程组成;给定这些输入,我们证明(或许令人惊讶地)AMP 开箱即用具有鲁棒性。

英文摘要

We present a simple and efficient algorithm for robust approximate message passing (AMP) in the spiked matrix setting. In particular, let $\varepsilon$ be a sufficiently small constant, and suppose that $X \in \mathbb R^{n \times n}$ is a Gaussian matrix with a planted rank-$1$ spike, and $E \in \mathbb R^{n \times n}$ is an adversarially chosen matrix supported on an $\varepsilon n \times \varepsilon n$ principal minor. Let $v_{\mathrm{AMP}}(X)$ be the output of an AMP iteration on the uncorrupted matrix $X$. We give a procedure that, given access only to the corrupted matrix $Y = X + E$, computes a vector $v_{\mathrm{ALG}}(Y)$ which is $\tilde{O}(\sqrt{\varepsilon})$-close to $v_{\mathrm{AMP}}(X)$, for any of a class of AMP iterations which includes sparse Principal Component Analysis (PCA), non-negative PCA, and $\mathbb Z_2$ synchronization. Our algorithm consists of a spectral pre-processing step combined with a robust spectral initialization procedure; given these inputs, we prove that (perhaps surprisingly) AMP is robust out-of-the-box.

2606.00496 2026-06-02 cs.LG 版本更新

Torus Graphs for Large Scale Neural Phase Analysis

大规模神经相位分析的环面图模型

Jack Goffinet, Casey Hanks, David E. Carlson

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一种随机得分匹配方法,将环面图模型的计算复杂度从O(d^6)降至O(d^2),使其能处理数千变量,并扩展至隐马尔可夫模型和自回归模型,用于分析脑状态依赖的相位耦合和方向性交互。

Comments 23 pages, 15 figures; to be published in ICML 2026

详情
AI中文摘要

振荡神经信号(如脑电图和局部场电位)表现出协调跨脑区通信的相位关系。现代记录在多个频率区间捕获数百个通道,但标准相位分析仅限于少数变量。环面图模型是一种相位上的指数族分布,其单变量和成对势函数推广了冯·米塞斯分布,推断振荡之间的结构化关系,但仅建模静态无向依赖,且由于得分匹配推断复杂度为O(d^6),仅限于约100个变量。我们引入一种随机得分匹配过程,将每次迭代成本降至O(d^2),使得能够对数千变量的数据集进行推断。这一可扩展基础支持对来自多电极LFPs的1,860个频率-相位特征进行分析,并实现了之前环面图或经典圆形统计无法实现的两种扩展:(i) 捕获状态依赖的相位耦合变化(例如睡眠期间纺锤波相关状态)的环面图隐马尔可夫模型,以及(ii) 通过传递熵估计推断方向性交互的自回归环面图。应用于LFP记录,这些模型揭示了清醒和NREM睡眠之间状态依赖的相位交互模式。它们共同实现了对大脑和认知状态中动态和方向性相位关系的系统性大规模映射。

英文摘要

Oscillatory neural signals such as electroencephalography (EEG) and local field potentials (LFPs) show phase relationships that coordinate communication across brain regions. Modern recordings capture hundreds of channels across many frequency bins, yet standard phase analyses are restricted to only a few variables. The Torus Graph (TG) model, an exponential-family distribution over phases whose univariate and pairwise potentials generalize von Mises distributions, infers principled structure among oscillations but models only static, undirected dependencies and is limited to $\sim \! 100$ variables because its score matching inference scales as $\mathcal{O}(d^{6})$. We introduce a stochastic score matching procedure that reduces the per-iteration cost to $\mathcal{O}(d^{2})$, enabling inference on datasets with thousands of variables. This scalable foundation supports analyses of 1,860 frequency-phase features from multi-electrode LFPs and enables two extensions previously inaccessible to TGs or classical circular statistics: (i) a TG Hidden Markov Model capturing state-dependent phase-coupling changes (e.g., spindle-related states during sleep) and (ii) an autoregressive TG inferring directional interactions via transfer-entropy estimation. Applied to LFP recordings, these models reveal state-dependent phase-interaction patterns between wakefulness and NREM sleep. Together, they enable systematic, large-scale mapping of dynamic and directional phase relationships across brain and cognitive states.

2606.00483 2026-06-02 q-bio.GN cs.LG 版本更新

Annotation-Informed Block-Sparse Bayesian Modeling for cis-Expression Prediction

基于注释信息的块稀疏贝叶斯建模用于顺式表达预测

Lei Huang, Hui Shen, Kuan-Jui Su, Chuan Qiu, Martha Isabel Gonzalez-Ramirez, Anqi Liu, Zhe Luo, Yun Gong, Yipu Zhang, Dawei Li, Chaoyang Zhang, Hong-Wen Deng

发表机构 * School of Computing Sciences and Computer Engineering, University of Southern Mississippi(南密西西比大学计算机科学与计算机工程学院) Tulane Center for Biomedical Informatics and Genomics, Deming Department of Medicine, School of Medicine, Tulane University(路易斯安那大学医学中心生物医学信息学与基因组学中心,德明医学部) Texas Tech University Health Sciences Center, School of Medicine, Texas Tech University(德克萨斯科技大学健康科学中心,德克萨斯科技大学医学院)

AI总结 提出块稀疏贝叶斯稀疏线性混合模型(bsBSLMM),通过整合LD块尖峰-板稀疏性和TSS先验,提高了顺式表达预测性能及下游TWAS发现能力。

Comments 16 pages manuscript; 38 pages supplementary

详情
AI中文摘要

基于基因型的顺式表达预测依赖于对局部调控架构的精确建模。我们提出了块稀疏贝叶斯稀疏线性混合模型(bsBSLMM),这是贝叶斯稀疏线性混合模型(BSLMM)的扩展,它整合了连锁不平衡(LD)块的尖峰-板稀疏性和转录起始位点(TSS)先验的SNP包含。在来自GEUVADIS欧洲血统淋巴母细胞系系的23,098个基因中,在匹配的评估标准下,bsBSLMM保留了比BSLMM、LASSO、BLUP、TIGAR弹性网和TIGAR狄利克雷过程回归更多可预测的基因。与BSLMM相比,bsBSLMM提高了大多数共享基因的留出预测性能,其增益主要由LD块稀疏性驱动,并通过TSS先验进一步增强。bsBSLMM选择的变异在GM12878 DNase和H3K27ac调控区域中显示出比BSLMM选择的变异更强的富集性。在全转录组关联研究(TWAS)分析中,bsBSLMM恢复了已建立的炎症性肠病信号,包括IL23R,并识别了BSLMM未检测到的其他全基因组显著基因。在路易斯安那州骨质疏松症研究中的独立验证重现了跨祖先的预测产量增加,并在下游TWAS和基因集富集分析中恢复了生物学相关的骨矿物质密度通路。这些结果表明,整合LD块结构和生物学先验的SNP改进了顺式表达预测并增强了下游TWAS发现。

英文摘要

Genotype-based cis-expression prediction depends on accurately modeling local regulatory architecture. We present block-sparse Bayesian sparse linear mixed model (bsBSLMM), an extension of Bayesian sparse linear mixed model (BSLMM) that incorporates linkage disequilibrium (LD)-block spike-and-slab sparsity and a transcription start site (TSS)-informed SNP inclusion prior. Across 23,098 genes from GEUVADIS European-ancestry lymphoblastoid cell lines, bsBSLMM retained more predictable genes than BSLMM, LASSO, BLUP, TIGAR elastic net, and TIGAR Dirichlet-process regression under matched evaluation criteria. Compared with BSLMM, bsBSLMM improved held-out prediction performance for most shared genes, with gains driven primarily by LD-block sparsity and further enhanced by the TSS-informed prior. Variants selected by bsBSLMM showed stronger enrichment in GM12878 DNase and H3K27ac regulatory regions than variants selected by BSLMM. In transcriptome-wide association study (TWAS) analysis, bsBSLMM recovered established inflammatory bowel disease signals, including IL23R, and identified additional genome-wide significant genes not detected by BSLMM. Independent validation in the Louisiana Osteoporosis Study reproduced the increased prediction yield across ancestries and recovered biologically relevant bone mineral density pathways in downstream TWAS and gene set enrichment analyses. These results demonstrate that incorporating LD-block structure and biologically informed SNP priors improves cis-expression prediction and enhances downstream TWAS discovery.

2606.00472 2026-06-02 cs.CV cs.AI cs.HC cs.LG 版本更新

CodeCytos: AI-assisted spatial molecular imaging analysis via code-augmented agent action space

CodeCytos: 通过代码增强的智能体动作空间实现AI辅助空间分子成像分析

Hung Q. Vo, Huy Q. Vo, Son T. Ly, Zhihao Wan, Anh-Vu Nguyen, Hong Zhao, Jianting Sheng, Stephen T. C. Wong, Hien V. Nguyen

发表机构 * University of Houston, Department of Electrical and Computer Engineering(德克萨斯大学休斯顿分校电子与计算机工程系) Houston Methodist Hospital, Department of Systems Medicine and Biomedical Engineering(休斯顿 Methodist 医院系统医学与生物医学工程系)

AI总结 提出CodeCytos框架,通过代码驱动的推理智能体实现空间分子成像数据的动态可编程分析,提升自动化与定制化能力,并在多种组织类型数据集上验证其优于基线方法。

详情
AI中文摘要

传统的组织图像分析软件为细胞分析提供了基础功能,包括分割、基本形态特征提取和空间组织分析。然而,这些工具通常需要手动干预,且与代码驱动的自动化集成不佳,限制了复杂空间组织研究的效率和可扩展性。此外,它们对自定义分析的灵活性有限,通常只支持一组固定的预实现空间细胞特征。为了解决这些限制,我们提出了CodeCytos,一个基于编码的推理智能体框架,能够实现与空间分子成像数据的动态、可编程交互,以提高自动化和定制化。CodeCytos旨在简化自定义空间细胞特征的探索,并适应多样化的研究需求。我们通过四个来自不同组织类型(额叶皮层、非小细胞肺癌、胰腺和扁桃体)的专家精选数据集案例研究展示了其实用性。我们在现实的最小提示设置下评估CodeCytos,其中生物科学家提出简单问题,没有任务特定指令或关于空间细胞分析的上下文信息,并基准测试了多个具有强大编码能力的LLM骨干。我们进一步表明,结合定制的、领域无关的少样本上下文编码推理示例(空间分析领域外随机采样的演示)可以显著提高性能,而无需昂贵的、专家制作的领域内演示。总体而言,CodeCytos优于基线方法,突显了代码动作智能体在空间分子成像中辅助自定义特征探索和加速生物标志物发现的潜力。

英文摘要

Conventional tissue image analysis software provides foundational capabilities for cellular analysis, including segmentation, basic morphological feature extraction, and spatial organization analysis. However, these tools often require manual intervention and are not well integrated with code-driven automation, limiting efficiency and scalability for complex spatial tissue studies. In addition, they offer limited flexibility for custom analyses, as they typically support only a fixed set of pre-implemented spatial cellular features. To address these limitations, we propose CodeCytos, a coding-based reasoning agent framework that enables dynamic, programmable interaction with spatial molecular imaging data to improve automation and customization. CodeCytos is designed to streamline the exploration of custom spatial cellular features and adapt to diverse research needs. We demonstrate its utility through case studies on four expert-curated datasets from distinct tissue types: frontal cortex, non-small-cell lung cancer, pancreas, and tonsil. We evaluate CodeCytos under a realistic minimal prompt setting, where bioscientists pose simple questions without task-specific instructions or contextual information about spatial cellular analysis, and benchmark multiple LLM backbones with strong coding capabilities. We further show that incorporating tailored, domain-agnostic few-shot in-context coding-reasoning examples (randomly sampled demonstrations outside the spatial analysis domain) can substantially improve performance without requiring costly, expert-crafted in-domain demonstrations. Overall, CodeCytos outperforms baseline approaches, highlighting the potential of code-action agents to assist with custom feature exploration in spatial molecular imaging and to accelerate biomarker discovery.

2606.00467 2026-06-02 cs.CL cs.AI cs.LG stat.ML 版本更新

On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance

论大语言模型适应性的局限:模型内化先验对标注任务性能的影响

Etienne Casanova, Rafal Kocielnik, R. Michael Alvarez

发表机构 * University of Washington(华盛顿大学)

AI总结 通过毒性检测实验,研究大语言模型内化先验与指令交互的三个维度,发现近三分之二的零样本错误难以通过提示纠正,并引入定义特定熟悉度(DSF)指标,证明其与性能正相关,而文本记忆指标则无此关联。

Comments Accepted at ICML 2026 (Oral & Spotlight); PMLR vol. 306. 9 pages, 4 figures

详情
AI中文摘要

大语言模型(LLMs)越来越多地用于零样本标注和LLM-as-a-judge任务,但其可靠性取决于模型内化先验与用户提供指令的交互方式。我们研究了这种交互的三个维度:(1)LLM对数据和任务定义的熟悉程度如何影响性能;(2)提示中的额外信息能在多大程度上纠正零样本错误(“决策粘性”);(3)模型对错误任务定义的敏感性。通过在多种数据集(涵盖社交媒体、游戏、新闻和论坛)上进行毒性检测实验,使用密集模型和混合专家模型,我们发现近三分之二的零样本错误难以纠正,提示纠正的总体挽救率(初始错误中被纠正的比例)仅为34.8%。高置信度错误尤其难以纠正。当给出错误定义时,LLM会遵循这些定义,同时保持与正确定义条件下相同的置信水平。关键的是,我们引入了定义特定熟悉度(DSF),它衡量模型内部概念与任务定义之间的一致性。在控制数据集层面的混杂因素后,DSF与模型性能呈正相关(偏相关系数r=+0.41),而三种不同的记忆指标(ROUGE-L、BERTScore和嵌入余弦相似度)均未显示正相关。这些发现揭示了基于提示的纠正在标注任务中的局限性,强调了定义对齐比文本级记忆更重要。

英文摘要

Large Language Models (LLMs) are increasingly used for zero-shot annotation and LLM-as-a-judge tasks, yet their reliability hinges on how model-internalized priors interact with user-provided instructions. We investigate three dimensions of this interaction: (1) how an LLM's familiarity with data and task definitions affects performance, (2) the extent to which additional information in prompts can correct zero-shot errors ("decision stickiness"), and (3) model susceptibility to misaligned task definitions. Through experiments on toxicity detection across diverse datasets (spanning social media, gaming, news, and forums) using both dense and mixture-of-experts models, we find that nearly two-thirds of zero-shot errors are resistant to correction, with an overall rescue rate (fraction of initial errors corrected by prompting) of only 34.8%. High-confidence errors prove especially resistant to correction. When given misaligned definitions, LLMs follow them while maintaining confidence levels unchanged from the aligned condition. Crucially, we introduce Definition-Specific Familiarity (DSF), which measures alignment between a model's internal concept and the task definition. After controlling for dataset-level confounds, DSF shows a positive association with model performance (partial r = +0.41), while three distinct memorization metrics (ROUGE-L, BERTScore, and embedding cosine similarity) all fail to show a positive association. These findings show the limitations of prompt-based correction in annotation tasks, highlighting the importance of definition alignment over text-level memorization.

2606.00462 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Short-form Text Rewriting with Phi Silica

短文本改写与 Phi Silica

Divya Tadimeti, Shawn Pan, Sameera Lanka, Chenghui Zhou, Sadid Hasan

发表机构 * IEEE ICAD

AI总结 本研究通过数据集整理、提示蒸馏、参数高效微调和评估,将小语言模型 Phi Silica 适配于短文本改写任务,结果表明微调提高了语义保真度、减少了幻觉并提升了与 GPT-5-chat 改写的偏好胜率。

Comments 6 pages

详情
AI中文摘要

短文本改写是释义的一种受限变体,其中有限的上下文和高语义密度几乎没有留下变化空间。虽然大型语言模型在一般释义任务上表现良好,但小语言模型(SLM)在短文本场景中常常在语义保真度和幻觉鲁棒性方面遇到困难。在这项工作中,我们提出了一项实证研究,通过数据集整理、提示蒸馏、参数高效微调和评估,将小语言模型 Phi Silica 适配于短文本改写。我们从公开的幻灯片中整理了一个简短的演示风格文本数据集,并使用 GPT-5-chat 来生成改写监督以及进行 LLM 作为评判者的评估。我们的结果表明,微调提高了语义保真度,减少了幻觉,并提高了与 GPT-5-chat 改写的偏好胜率。这些发现表明,针对 SLM 的定向适配可以显著缩小与云模型的差距,并为将 SLM 适配于精度关键的改写任务提供实用指导。

英文摘要

Short-form text rewriting is a constrained variant of paraphrasing in which limited context and high semantic density leave little room for variation. While large language models perform well on general paraphrasing, small language models (SLMs) often struggle with semantic fidelity and hallucination robustness in short-form settings. In this work, we present an empirical study of adapting an SLM, Phi Silica, for short-form rewrite through dataset curation, prompt distillation, parameter-efficient fine-tuning, and evaluation. We curate a dataset of short presentation-style text from public slide decks and use GPT-5-chat both to generate rewrite supervision and to conduct LLM-as-a-judge evaluation. Our results show that finetuning improves semantic fidelity, reduces hallucinations, and increases preference win rate against GPT-5-chat rewrites. The findings suggest that targeted adaptation for SLMs can substantially narrow the gap to cloud models and provide practical guidance for adapting SLMs to precision-critical rewrite tasks.

2606.00445 2026-06-02 cs.CV cs.AI cs.LG 版本更新

DarkVesselNet: Multi-Modal Remote Sensing and Trajectory Reasoning for Dark Vessel Detection

DarkVesselNet: 用于暗船检测的多模态遥感和轨迹推理

Arun Sharma

发表机构 * University of Minnesota, Twin Cities(明尼苏达大学,双城分校)

AI总结 提出DarkVesselNet,融合Sentinel-1 SAR、Sentinel-2光学影像、地理空间基础模型、AIS轨迹推理、TGARD间隙检测和Pi-DPM异常头,实现多模态遥感暗船检测。

详情
AI中文摘要

暗船检测需要融合船只通过AIS报告的信息与卫星通过雷达和光学传感器观测到的信息。DarkVesselNet是一个多模态遥感堆栈,结合了Sentinel-1 SAR、Sentinel-2光学影像、地理空间基础模型骨干、AIS轨迹推理、TGARD风格的间隙检测以及受Pi-DPM启发的异常头。该仓库将系统呈现为经过测试的Python包和公开的Hugging Face Space。本文介绍了传感器堆栈、骨干抽象、融合路径、异常头和当前的验证。目前可用的证据是基于软件的:针对SAR散斑滤波、光学波段比、Haversine距离、TGARD间隙发射、传感器配准、骨干token形状和可微分异常评分的测试。

英文摘要

Dark vessel detection requires fusing what vessels report through AIS with what satellites observe through radar and optical sensors. DarkVesselNet is a multi-modal remote sensing stack that combines Sentinel-1 SAR, Sentinel-2 optical imagery, geospatial foundation model backbones, AIS trajectory reasoning, TGARD-style gap detection, and a Pi-DPM-inspired anomaly head. The repository exposes the system as a tested Python package and a public Hugging Face Space. The paper presents the sensor stack, backbone abstraction, fusion path, anomaly head, and current validation. The evidence currently available is software-grounded: tests for SAR speckle filtering, optical band ratios, Haversine distance, TGARD gap emission, sensor coregistration, backbone token shapes, and differentiable anomaly scoring.

2606.00442 2026-06-02 cs.LG math.OC stat.ML 版本更新

Exploiting weight-space symmetries for approximating curvature

利用权重空间对称性近似曲率

Artem Artemev, Rui Xia, Benjamin M. Boyd, Youjing Yu, Felix Dangel, Guillaume Hennequin, Alberto Bernacchia

发表机构 * DeepMind, London, UK(伦敦DeepMind)

AI总结 本文通过解析平均化保持损失不变的群作用,从单个梯度构建结构化的Hessian近似,从而利用权重空间对称性来近似损失函数的曲率。

Comments Published at ICML 2026. 35 pages, 11 figures. Code: https://github.com/mtkresearch/symm_opt

详情
AI中文摘要

许多机器学习技术依赖于近似损失函数的曲率,但在现代深度网络的规模下,这通常很难做到。令人惊讶的是,之前没有工作利用损失景观中众所周知的权重空间对称性所产生的曲率约束。通过解析平均化保持损失不变的群作用,我们从单个梯度构建了结构化的Hessian近似,这些近似可以易于估计、存储和求逆。用户指定的对称群直接控制近似精度与计算成本之间的权衡。此外,我们的框架为审视现有方法提供了统一的理论视角;特别地,特定的对称群选择可以恢复Shampoo/Muon类的曲率估计。我们在多种网络架构上验证了我们的方法,并将其应用于二阶优化基准测试,包括一个小型语言模型。我们的曲率估计框架可能在机器学习其他问题中找到应用,如不确定性估计、持续学习、压缩/剪枝、训练数据归因等。

英文摘要

Many machine learning techniques rely on approximating a loss function's curvature, but this is notoriously hard to do at the scale of modern deep networks. Surprisingly, no previous work has exploited the curvature constraints that arise from well known weight-space symmetries in loss landscapes. By analytically averaging over group actions that leave the loss invariant, we construct structured Hessian approximations from single gradients that can be tractably estimated, stored, and inverted. The choice of user-specified symmetry group directly governs the trade-off between approximation accuracy and computational cost. Moreover, our framework provides a unifying theoretical lens for viewing existing methods; in particular, a specific choice of symmetry group recovers Shampoo/Muon-like curvature estimates. We validate our method on a range of network architectures, and deploy it to second-order optimization benchmarks, including a small language model. Our curvature estimation framework might find applications in other machine learning problems such as uncertainty estimation, continual learning, compression/pruning, training data attribution, and more.

2606.00437 2026-06-02 cs.LG 版本更新

EST-PRM: Stress-Testing Process Reward Models Before They Become Load-Bearing

EST-PRM:在过程奖励模型成为关键依赖之前对其进行压力测试

Ibne Farabi Shihab, Fariya Afrin, Sanjeda Akter, Anuj Sharma

发表机构 * Department of Computer Science, Iowa State University(艾奥瓦州立大学计算机科学系) Department of Computer Science, Kalinga Institute of Industrial Technology(卡林加工业技术学院计算机科学系) Department of Civil, Construction & Environmental Engineering, Iowa State University(艾奥瓦州立大学土木、建设与环境工程系)

AI总结 提出EST-PRM框架,通过步骤膨胀、依赖感知重排序和置信度标记三种变换对过程奖励模型进行压力测试,发现不同模型在奖励膨胀和正确性敏感性损失方面存在显著差异。

详情
AI中文摘要

过程奖励模型(PRM)在具有密集步骤级监督的语言模型训练中被广泛使用。它们假设在标签保持变换下,PRM分数是步骤正确性的稳定代理。这些变换改变推理结构但保留最终答案。我们认为这一假设未得到充分验证。此类变换可能改变PRM分数与正确性信号之间的关系,导致不同模型出现不同的故障模式。为弥补这一空白,我们引入了 extbf{EST-PRM},一个用于密集过程奖励的压力测试框架。它应用三种变换:(1)步骤膨胀,(2)依赖感知步骤重排序,以及(3)置信度标记。定义了一个脆弱性分解,将奖励膨胀与正确性敏感性损失分开。在来自MATH-500、GSM8K和PRMBench的4,687条推理链上评估了五种PRM风格模型。结果表明不同模型的脆弱性模式存在明显差异。Math-Shepherd对位置扰动表现出最强的敏感性,Pearson相关系数下降$0.152 \pm 0.038$,分数膨胀率为$32.8 \pm 4.9\%$。Qwen2.5-Math-PRM受步骤膨胀影响最大,膨胀率达到$47.6 \pm 4.3\%$。基于置信度的扰动也会扭曲奖励校准,揭示正确性估计中的不一致性。评估了三种缓解策略,突出了鲁棒性覆盖率和假阳性率之间的权衡。

英文摘要

Process reward models (PRMs) are widely used in language-model training with dense step-level supervision. They assume PRM scores are stable proxies for step correctness under label-preserving transformations. These transformations change reasoning structure but preserve final answers. We argue this assumption is not well validated. Such transformations can change how PRM scores relate to correctness signals, leading to different failure modes across models.To address this gap, we introduce \textbf{EST-PRM}, a stress-testing framework for dense process rewards. It applies three transformations: (1) step inflation, (2) dependency-aware step reordering, and (3) confidence markers. A vulnerability decomposition is defined that separates reward inflation from loss of correctness sensitivity. Five PRM-style models are evaluated on 4,687 reasoning chains from MATH-500, GSM8K, and PRMBench.The results indicate clear differences in vulnerability patterns across models. Math-Shepherd shows the strongest sensitivity to position perturbations, with a Pearson correlation drop of $0.152 \pm 0.038$ and a $32.8 \pm 4.9\%$ score inflation rate. Qwen2.5-Math-PRM is most affected by step inflation, reaching a $47.6 \pm 4.3\%$ inflation rate. Confidence-based perturbations also distort reward calibration, revealing inconsistencies in correctness estimation. Three mitigation strategies are evaluated, highlighting trade-offs between robustness coverage and false-positive rates.

2606.00432 2026-06-02 cs.LG 版本更新

Grounded Decoding: Retrieval-Anchored Probability Fusion for Faithful RAG

Grounded Decoding: 面向忠实RAG的检索锚定概率融合

Ibne Farabi Shihab, Fariya Afrin, Sanjeda Akter, Anuj Sharma

发表机构 * Department of Computer Science, Iowa State University(爱荷华州立大学计算机科学系) Department of Computer Science, Kalinga Institute of Industrial Technology(卡林加工业技术学院计算机科学系) Department of Civil, Construction & Environmental Engineering, Iowa State University(爱荷华州立大学土木、建设与环境工程系)

AI总结 提出Grounded Decoding,一种无需训练的推理时解码框架,通过KL-重心目标融合全RAG分布和仅检索分布,并引入冲突感知自适应加权,以提升RAG的事实一致性。

详情
AI中文摘要

随着检索增强生成(RAG)系统的扩展,确保忠实基于外部证据变得越来越具有挑战性。当冲突出现时,大型语言模型仍可能优先考虑参数化知识而非检索信息。我们提出了一种新颖的无训练解码框架——\emph{Grounded Decoding},旨在不修改模型参数的情况下提高RAG的事实一致性。与依赖单一条件分布的标准方法不同,我们的方法在每个生成步骤构建两个匹配提示分布:(1)以查询、检索文档和生成前缀为条件的完整RAG分布,以及(2)仅以检索证据和相同前缀为条件的仅检索分布。最终的下一词分布被推导为概率单纯形上KL-重心目标的唯一解,产生两个分布的归一化几何融合。当接地权重为零时,该公式自然恢复标准RAG,并随着接地强度增加平滑地将概率质量移向检索证据。我们进一步引入了一种冲突感知自适应加权方案,该方案基于分布分歧和检索器置信度动态调整接地。在ALCE、Natural Questions和FActScore上的实验表明,与标准RAG和有竞争力的解码时基线相比,在事实准确性和引用质量上取得了一致改进,同时保持了流畅性。我们的结果表明,概率级融合为忠实RAG解码提供了一种强大且高效的替代对数级干预方法。

英文摘要

As retrieval-augmented generation (RAG) systems scale, it becomes increasingly challenging to ensure faithful grounding in external evidence. Large language models may still prioritize parametric knowledge over retrieved information when conflicts arise. We propose a novel training-free decoding framework, \emph{Grounded Decoding}, designed to improve factual consistency in RAG without modifying model parameters. Unlike standard approaches that rely on a single conditional distribution, our method constructs two matched-prompt distributions at every generation step: (1) a full RAG distribution conditioned on the query, retrieved documents, and generated prefix, and (2) a retrieval-only distribution conditioned solely on retrieved evidence and the same prefix. The final next-token distribution is derived as the unique solution to a KL-barycenter objective over the probability simplex, yielding a normalized geometric fusion of the two distributions.This formulation naturally recovers standard RAG when the grounding weight is zero and smoothly shifts probability mass toward retrieved evidence as grounding strength increases. We further introduce a conflict-aware adaptive weighting scheme that dynamically adjusts grounding based on distributional disagreement and retriever confidence. Experiments on ALCE, Natural Questions, and FActScore demonstrate consistent improvements in factual accuracy and citation quality over standard RAG and competitive decoding-time baselines, while maintaining fluency. Our results indicate that probability-level fusion provides a strong and efficient alternative to logit-level intervention methods for faithful RAG decoding.

2606.00431 2026-06-02 cs.LG 版本更新

Variance-sensitive Thompson sampling for generalised linear bandits, revisited

广义线性bandits的方差敏感Thompson采样,再探讨

Tom Perneczky, Marc Abeille, David Janz

发表机构 * University of Oxford(牛津大学) Criteo AI Lab(Criteo人工智能实验室)

AI总结 本文通过高斯庞加莱不等式证明Thompson采样在随机广义线性bandits中的方差敏感遗憾界,并指出移除预热阶段保持相同方差敏感尺度是开放且非平凡的问题。

详情
AI中文摘要

我们证明了在随机广义线性bandits中,Thompson采样具有方差敏感的遗憾界。该论证假设了一个预热阶段,之后通过使用高斯庞加莱不等式来控制遗憾。这绕过了先前基于乐观的分析失效的点。在保留相同方差敏感尺度的同时移除预热阶段仍然是开放问题,并且似乎是非平凡的。

英文摘要

We prove a variance-sensitive regret bound for Thompson sampling in stochastic generalised linear bandits. The argument assumes a warm-up, after which the regret is controlled through using the Gaussian Poincaré inequality. This bypasses the point at which previous optimism-based analyses break down. Removing the warm-up while retaining the same variance-sensitive scaling remains open, and appears nontrivial.

2606.00428 2026-06-02 cs.LG cs.AI cs.CL 版本更新

Finer Parameter Steps for Low-Rank PEFT: A Controlled Study with CP Tensor Adapters

低秩PEFT的更细参数步长:基于CP张量适配器的控制研究

Xinjue Wang, Xiuheng Wang, Yejun Zhang, Sergiy A. Vorobyov, Esa Ollila, Zhi-Yong Wang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文通过固定组件的规范多路分解(CP)张量适配器实现更细的参数步长,研究其对低秩适配器精度-预算权衡的影响,发现CP适配器能填补LoRA秩之间的空白,但效果依赖于任务。

Comments Accepted at the ICML 2026 Workshop on CoLoRAI

详情
AI中文摘要

低秩适配器通常通过扫描少量秩进行比较,但秩也固定了参数预算的分辨率。对于一个$2048{\times}2048$的OPT注意力投影,增加LoRA的一个秩会存储$4096$个可训练标量,导致可行的低预算适配器大小之间存在较大间隙。本文探讨具有更细容量增量的张量化适配器是否会改变观察到的精度-预算权衡。我们通过固定组件的规范多路分解(CP)张量适配器来实例化这个问题。在$32{\times}64{\times}32{\times}64$的张量化下,一个归一化的CP组件每个投影存储$193$个可训练标量,比LoRA的一个秩步长小约21倍。我们在OPT-1.3B上,在匹配的目标模块、训练协议、数据上限和种子调度下,比较了CP适配器和LoRA在SST-2、RTE和BoolQ上的表现。CP训练稳定,并填补了LoRA秩之间的空白,但效果依赖于任务:SST-2早期达到低预算平台,BoolQ在略低于LoRA饱和之前受益于额外的CP组件,而RTE仍然偏好LoRA。因此,更细的参数步长有助于诊断PEFT预算敏感性,但它们本身并不能保证更好的精度-预算曲线。

英文摘要

Low-rank adapters are usually compared by sweeping a small set of ranks, but the rank also fixes the resolution of the parameter budget. For a $2048{\times}2048$ OPT attention projection, increasing LoRA by one rank stores $4096$ trainable scalars, leaving large gaps between feasible low-budget adapter sizes. This paper asks whether a tensorized adapter with finer capacity increments changes the observed accuracy--budget trade-off. We instantiate this question with fixed-component canonical polyadic (CP) tensor adapters. Under a $32{\times}64{\times}32{\times}64$ tensorization, one normalized CP component stores $193$ trainable scalars per projection, about $21$ times smaller than one LoRA rank step. We compare CP adapters and LoRA on OPT-1.3B across SST-2, RTE, and BoolQ under matched target modules, training protocol, data caps, and seed schedules. CP trains stably and fills the gaps between LoRA ranks, but the effect is task-dependent: SST-2 reaches an early low-budget plateau, BoolQ benefits from additional CP components before saturating slightly below LoRA, and RTE remains LoRA-favored. Finer parameter steps are therefore useful for diagnosing PEFT budget sensitivity, but they do not by themselves guarantee a better accuracy--budget curve.

2606.00427 2026-06-02 cs.LG 版本更新

Topology-Aware State Abstraction with Tangle Cores for Markov Decision Processes

基于纠缠核的马尔可夫决策过程拓扑感知状态抽象

Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma

发表机构 * Department of Computer Science, Iowa State University(计算机科学系,爱荷华州立大学) Department of Civil, Construction & Environmental Engineering, Iowa State University(土木、建设与环境工程系,爱荷华州立大学)

AI总结 提出纠缠核抽象框架,利用经验转移图的图纠缠构建重叠状态抽象,在动作一致性条件下保证价值保持,并通过实验证明其在瓶颈领域优于现有方法。

详情
AI中文摘要

强化学习中的状态抽象通常被形式化为基于奖励和转移相似性的状态划分。这排除了导航、图和分层决策问题中的常见结构模式:接口状态(如门、枢纽和瓶颈)自然参与多个区域。我们引入了\emph{纠缠核抽象},一种基于经验转移图的图纠缠的重叠状态抽象框架。该方法从一致定向的低阶分离中构建抽象状态,并通过隶属核而非硬划分来表示共享接口。我们在显式动作一致性条件下给出了诱导的重叠抽象MDP的价值保持保证,识别了内部同质性/边界泄漏误差分解,并证明了一个定量接口重叠结果,表明硬划分何时会引入可避免的边界误差。实验上,在瓶颈表格领域、程序生成迷宫和MiniGrid表示中,纠缠核抽象在压缩-回报权衡上优于奖励感知、学习、拓扑映射和图划分基线。我们还识别了一个清晰的失败机制,即转移拓扑无信息时,纠缠可预测地几乎没有益处。这些结果将图纠缠定位为具有共享接口结构的决策问题的有效拓扑感知抽象先验。

英文摘要

State abstraction in reinforcement learning is usually formulated as a partition of states based on reward and transition similarity. This excludes a common structural pattern in navigation, graph, and hierarchical decision problems: interface states such as doors, hubs, and bottlenecks naturally participate in more than one region. We introduce \emph{tangle-core abstraction}, an overlapping state-abstraction framework based on graph tangles of empirical transition graphs. The method constructs abstract states from consistently oriented low-order separations and represents shared interfaces through a membership kernel rather than a hard partition. We give value-preservation guarantees for the induced overlapping abstract MDP under an explicit action-consistency condition, identify an interior-homogeneity/boundary-leakage error decomposition, and prove a quantitative interface-overlap result showing when hard partitions incur an avoidable boundary error. Empirically, tangle-core abstractions achieve favorable compression--return tradeoffs against reward-aware, learned, topological-map, and graph-partitioning baselines across bottlenecked tabular domains, procedurally generated mazes, and MiniGrid representations. We also identify a clear failure regime in which transition topology is uninformative, where tangles predictably offer little benefit. These results position graph tangles as an effective topology-aware abstraction prior for decision problems with shared interface structure.

2606.00426 2026-06-02 cs.LG 版本更新

Canonicalized Stable-List Replay for Private Federated Continual Learning over Language-Model Embeddings

规范化稳定列表回放:面向语言模型嵌入的私有联邦持续学习

Ibne Farabi Shihab, Abu Sa-Adat Mohamed Moon-Im Al Ahsan, Anuj Sharma

发表机构 * Department of Computer Science, Iowa State University(爱荷华州立大学计算机科学系) Department of Computer Science & Engineering, BRAC University(BRAC大学计算机科学与工程系) Department of Civil, Construction & Environmental Engineering, Iowa State University(爱荷华州立大学土木、建设与环境工程系)

AI总结 针对联邦持续学习中差分隐私下回放列表无序的问题,提出规范化稳定列表回放(CSLR)方法,利用公共锚句的签名对齐客户端分布,在多个基准上提升性能。

详情
AI中文摘要

联邦持续学习(FCL)允许分布式客户端在不共享原始文本的情况下,将语言模型头部适应不断演变的NLP任务。在用户级差分隐私(DP)下,基于回放的持续学习面临一个结构性障碍:客户端只能发布候选回放摘要的小型噪声列表,且这些列表在客户端之间是无序的。我们引入了规范化稳定列表回放(CSLR),其中客户端在共享的句子嵌入空间上私有地生成候选回放分布,服务器使用公共锚句诱导的签名对齐它们。锚点提供聚合的可识别性,而不是额外的回放数据。我们证明,在可观测的锚签名间隔下,$O(\log(N/η)/p)$个锚点以至少$1-η$的概率区分$N$个候选列表元素,并给出了无序标签预言机模型的范围性无锚不可识别性结果。在持续分类、NER和对话基准的五个随机种子上,CSLR在报告的回放发布预算下,在$\eps=4$时,最终平均任务指标比最强的非CSLR DP基线提高了3.9-5.6个点,同时也优于匈牙利匹配和最优传输匹配。形式化隐私保证涵盖回放发布;端到端私有训练还需要与用于任务头更新的私有优化器组合。

英文摘要

Federated continual learning (FCL) lets distributed clients adapt language-model heads to evolving NLP tasks without sharing raw text. Under user-level differential privacy (DP), replay-based continual learning faces a structural obstacle: clients can release only small noisy lists of candidate replay summaries, and those lists are unordered across clients. We introduce Canonicalized Stable-List Replay (CSLR), where clients privately produce candidate replay distributions over a shared sentence-embedding space and the server aligns them using signatures induced by public anchor sentences. The anchors provide identifiability for aggregation rather than additional replay data. We prove that, under an observable anchor-signature margin, $O(\log(N/η)/p)$ anchors distinguish $N$ candidate list elements with probability at least $1-η$, and we give a scoped anchorless non-identifiability result for unordered-label oracle models. Across five seeds on continual classification, NER, and dialogue benchmarks, CSLR improves the final average task metric by 3.9--5.6 points over the strongest non-CSLR DP baseline at $\eps=4$ under the reported replay-release budget, while also outperforming Hungarian and optimal-transport matchers. The formal privacy guarantee covers replay release; end-to-end private training additionally requires composition with a private optimizer for task-head updates.

2606.00422 2026-06-02 cs.IR cs.LG 版本更新

UniPinRec: Unifying Generative Retrieval and Ranking at Pinterest Scale

UniPinRec:在Pinterest规模下统一生成式检索与排序

Hanyu Li, Yi-Ping Hsu, Aditya Mantha, Prabhat Agarwal, Laksh Bhasin, Jialu Wang, Hongtao Lin, Bella Huang, Yaxin Li, Xinyi Li, Chuxi Wang, Kousik Rajesh, Hooshmand Shokri Razaghi, Shunyao Li, Zongyue Qin, Jaewon Yang, James Li, Dhruvil Deven Badani, Jiajing Xu, Charles Rosenberg

发表机构 * Pinterest

AI总结 提出UniPinRec,通过共享Transformer编码用户行为序列,结合掩码动作建模、混合训练样本和跨阶段KV缓存共享,在Pinterest生产系统中首次实现检索与排序的全栈统一,提升在线参与度并降低延迟。

详情
AI中文摘要

现代推荐系统主要将检索和排序作为独立模型训练,尽管两者都越来越依赖编码相同用户行为数据的大型Transformer,导致参数、计算和服务成本重复。先前的工作统一了模型架构,但未统一完整流程:输入格式、训练过程和服务栈在阶段间仍然分散。我们提出UniPinRec,在Pinterest实现了检索和排序的全栈统一:一种输入格式、一个模型、一个训练阶段,部署在现有服务基础设施中。共享Transformer将用户行为序列编码为候选无关的表示,通过任务特定的头部分支到检索(ANN点积)和排序(交叉注意力)。三个关键思想使此工作成立:(1)掩码动作建模(MAM)消除了交错,使得无需加倍上下文长度即可实现权重共享;(2)混合训练样本将动作序列与feedview曝光列表配对,以共同满足两个目标;(3)跨阶段KV缓存共享重用检索中的用户历史计算用于排序,相比服务两个独立模型减少了总FLOPs。部署在Pinterest核心表面,UniPinRec实现了约+1%的在线参与度提升,同时将端到端服务延迟降低11.1%,QPS提升63.6%。据我们所知,这是首个在生产推荐系统中实现检索和排序全栈统一的工作,涵盖输入、模型、训练和服务。

英文摘要

Modern recommendation systems predominantly train retrieval and ranking as separate models despite both increasingly relying on large transformers encoding the same user behavior data, duplicating parameters, compute, and serving cost. Prior work unifies the model architecture but not the full pipeline: input formats, training procedures, and serving stacks remain fragmented across stages. We present UniPinRec, which achieves full-stack unification of retrieval and ranking at Pinterest: one input format, one model, one training stage, deployed within existing serving infrastructure. A shared transformer encodes the user action sequence into candidate-independent representations that branch into retrieval (ANN dot-product) and ranking (cross-attention) via task-specific heads. Three ideas make this work: (1) Masked Action Modeling (MAM) eliminates interleaving, enabling weight sharing without doubling context length; (2) Blended training examples pair action sequences with feedview impression slates to satisfy both objectives jointly; (3) Cross-stage KV cache sharing reuses user-history computation from retrieval for ranking, reducing total FLOPs versus serving two independent models. Deployed in the Pinterest core surfaces, UniPinRec delivers approximately +1% online engagement lift while cutting end-to-end serving latency by 11.1% and lifting QPS by 63.6%. To our knowledge, this is the first full-stack unification of retrieval and ranking, covering inputs, model, training and serving, deployed in a production recommendation system.

2606.00414 2026-06-02 cs.LG 版本更新

Auditing Near-Optimal Policies Can Be Exponentially Hard: Conditional Query Lower Bounds via Occupancy Rashomon Capacity

审计近最优策略可能是指数级困难的:通过占用Rashomon容量的条件查询下界

Ibne Farabi Shihab, Sanjeda Akter, Anuj Sharma

发表机构 * Department of Computer Science, Iowa State University(计算机科学系,爱荷华州立大学) Department of Civil, Construction & Environmental Engineering, Iowa State University(土木、建设与环境工程系,爱荷华州立大学)

AI总结 本文通过引入占用Rashomon容量概念,证明了在存在多个近最优策略时,审计这些策略的行为差异需要指数级数量的查询,并给出了精确和噪声查询下的下界。

详情
AI中文摘要

当许多强化学习策略达到近最优回报时,事后审计员可能必须区分许多行为不同但回报等价的策略。我们通过Rashomon容量的占用度量类比来形式化这一现象:近最优占用区域的度量熵,相对于被审计的部署类别计算。由于占用度量仅识别到占用等价,我们在占用类别级别上制定审计,并区分精确局部查询预言机和噪声样本查询预言机。我们的主要精确查询结果是条件性的:如果被审计类别包含一个$2/H$-分离的近最优填充,其局部签名是$b$-稀疏的,那么精确局部查询审计需要$\Omega(M/b)$次查询;当填充实现部署类别容量且$b=O(1)$时,这变为$\Omega(2^{\Hopt^\cF(\eps)})$。我们给出了一个有限折扣隐藏分支MDP达到此界,并展示了精确贝叶斯成功律。对于噪声隐藏触发测试,我们证明了阶为$M/\beta$的混合下界,其中$\beta$是每样本KL信号,对于容量阶填充且$\beta=O(\rho^2\Delta^2)$,得到$\Omega(2^{\Hopt^\cF(\eps)}/(\rho^2\Delta^2))$。我们还提供了静态目标识别信息下界、一个转录兼容的预言机覆盖验证上界,以及一个规范占用正则化器,当存在可信参考占用时,其正则化审计容量会崩溃。受控基准将正稀疏签名实例与精确审计容易的高容量阴性对照区分开来,并将噪声触发律映射到后处理的连续控制和视觉RL审计体制。

英文摘要

When many reinforcement-learning policies achieve near-optimal return, a post-hoc auditor may have to distinguish among many behaviorally distinct but return-equivalent policies. We formalize this phenomenon through an occupancy-measure analogue of Rashomon capacity: the metric entropy of the near-optimal occupancy region, computed relative to an audited deployment class. Because occupancy measures identify behavior only up to occupancy equivalence, we formulate auditing at the occupancy-class level and distinguish exact local-query oracles from noisy sample-query oracles. Our main exact-query result is conditional: if the audited class contains a $2/H$-separated near-optimal packing whose local signatures are $b$-sparse, then exact local-query auditing requires $Ω(M/b)$ queries; when the packing realizes deployment-class capacity and $b=O(1)$, this becomes $Ω(2^{\Hopt^\cF(\eps)})$. We give a finite discounted hidden-branch MDP attaining this bound and show the exact Bayes success law. For noisy hidden-trigger testing, we prove a mixture lower bound of order $M/β$, where $β$ is the per-sample KL signal, yielding $Ω(2^{\Hopt^\cF(\eps)}/(ρ^2Δ^2))$ for capacity-order packings with $β=O(ρ^2Δ^2)$. We also provide a static target-recognition information lower bound, a transcript-compatible oracle-cover verification upper bound, and a canonical occupancy regularizer whose regularized audited capacity collapses when a trusted reference occupancy is available. Controlled benchmarks distinguish positive sparse-signature instances from high-capacity negative controls where exact auditing is easy, and map the noisy-trigger law to post-processed continuous-control and visual-RL auditing regimes.

2606.00413 2026-06-02 stat.ML cs.LG 版本更新

Riemannian Stochastic Optimization for Sufficient Dimension Reduction

充分降维的黎曼随机优化

Thibault Pautrel, François Portier

发表机构 * Laboratoire des Signaux et Systèmes (L2S), CentraleSupélec, Université Paris-Saclay, Gif-sur-Yvette, France(信号与系统实验室(L2S),中央超导电子研究所,巴黎-萨克雷大学,吉夫-sur-伊夫特,法国)

AI总结 提出一种基于黎曼流形随机梯度上升的算法SMAVE,通过将充分降维问题转化为Stiefel流形上的光滑最大化,实现高效的低维子空间恢复。

详情
AI中文摘要

充分降维(SDR)通过将协变量投影到保留响应条件均值的低维子空间,使高维回归变得易于处理。现有的基于梯度的估计器要么在原始空间中操作并遭受维数灾难,要么在降维空间中局部化,每次外迭代的代价至少与样本量成二次关系。我们证明了总体最小平均方差估计(MAVE)风险的最小化器与梯度外积(OPG)逼近相同的Grassmannian目标,并将经验准则重新表述为Stiefel流形上的光滑最大化,具有闭式黎曼梯度。由此产生的算法SMAVE结合了稀疏投影空间最近邻局部化和黎曼随机梯度上升。简化版本具有几乎必然收敛性和非渐近速率,匹配标准的非凸随机一阶缩放。实验上,SMAVE在中高维环境中匹配或改进了RMAVE的合成子空间恢复,在四个真实数据集上一致优于OPG,并且与RMAVE相比具有竞争力或更优,同时运行时间低几个数量级。

英文摘要

Sufficient dimension reduction (SDR) makes high-dimensional regression tractable by projecting the covariates onto a low-dimensional subspace that preserves the conditional mean of the response. Existing gradient-based estimators either operate in the ambient space and suffer from the curse of dimensionality, or localize in the reduced space at a per-outer-iteration cost at least quadratic in the sample size. We show that minimizers of the population Minimum Average Variance Estimation (MAVE) risk approximate the same Grassmannian target as the Outer Product of Gradients (OPG), and recast the empirical criterion as a smooth maximization on the Stiefel manifold with closed-form Riemannian gradient. The resulting algorithm, SMAVE, combines sparse projected-space nearest-neighbor localization with Riemannian stochastic gradient ascent. A simplified version comes with almost-sure convergence and a non-asymptotic rate matching the standard non-convex stochastic first-order scaling. Empirically, SMAVE matches or improves on RMAVE's synthetic subspace recovery at moderate-to-high ambient dimension, and on four real datasets it uniformly improves over OPG and is competitive with or outperforms RMAVE at orders of magnitude lower runtime.

2606.00404 2026-06-02 cs.CV cs.LG 版本更新

Rethinking Amortized Neural Representations for High-Resolution Terrain Elevation Data

重新思考高分辨率地形高程数据的摊销神经表示

Haoan Feng, Xin Xu, Leila De Floriani

发表机构 * University of Maryland, College Park(马里兰大学学院公园分校)

AI总结 针对地形高程数据,提出HUVR+SIREN超网络方法,通过替换坐标解码器为平滑可微版本,在统一基准上实现最佳高度和导数保真度,且支持后训练量化压缩。

Comments 12 pages, 7 figures, 10 tables

详情
AI中文摘要

隐式神经表示(INR)将信号建模为连续的坐标到值函数。对于地形高程数据,这支持解析导数、任意分辨率解码以及底层高度场的平滑表面模型。然而,为每个瓦片拟合和存储单独的INR无法扩展到大型地形数据集。摊销神经表示通过共享网络降低了这一成本:新瓦片被映射到紧凑的每瓦片载荷,共享解码器从中重建高度场。大多数此类方法是超网络,通过单次前向传递预测载荷,而其他方法则通过短时的每瓦片优化恢复载荷。这些方法主要针对自然图像开发,其在地形高度场上的适用性尚不清楚。我们在1米/像素的地形数据集上引入了受控基准,并在统一协议下评估了三种代表性方法。观察到明显的跨领域差距后,我们提出了HUVR+SIREN,这是一种超网络,它通过将坐标解码器替换为平滑、解析可微的解码器来适应最强的基准方法(HUVR)。它在基准上实现了最佳的高度和导数保真度,无需额外的每瓦片存储且解码成本更低,并且能够容忍激进的后训练量化而质量损失可忽略,从而形成了紧凑的地形神经格式。消融和诊断进一步确定了哪些设计选择可迁移到地形,并表明每瓦片瓶颈已接近其有用极限,剩下的差距在于共享超网络的架构设计。

英文摘要

Implicit neural representations (INRs) model a signal as a continuous coordinate-to-value function. For terrain elevation data, this supports analytic derivatives, arbitrary-resolution decoding, and a smooth surface model of the underlying heightfield. However, fitting and storing a separate INR for every tile does not scale to large terrain datasets. Amortized neural representations reduce this cost with a shared network: a new tile is mapped to a compact per-tile payload, and a shared decoder reconstructs the heightfield from it. Most such methods are hypernetworks that predict the payload in a single forward pass, while others recover it through a short per-tile optimization. These methods were developed primarily for natural images, and their suitability for terrain heightfields remains unclear. We introduce a controlled benchmark on a 1 m/pixel terrain dataset and evaluate three representative methods under a unified protocol. Observing a clear cross-domain gap, we propose HUVR+SIREN, a hypernetwork that adapts the strongest benchmarked method (HUVR) by replacing its coordinate decoder with a smooth, analytically differentiable one. It attains the best height and derivative fidelity on the benchmark with no additional per-tile storage and lower decode cost, and tolerates aggressive post-training quantization with negligible quality loss, giving a compact terrain neural format. Ablations and diagnostics further identify which design choices transfer to terrain and show that the per-tile bottleneck is already near its useful limit, leaving the remaining gap in the shared hypernetwork's architectural design.

2606.00401 2026-06-02 physics.comp-ph cond-mat.mtrl-sci cs.LG cs.NA math.NA 版本更新

Data-Driven Spectral Prediction for Accelerating Large-Scale Electronic Structure Calculations

数据驱动的光谱预测加速大规模电子结构计算

Abhiram Badrinarayanan, Davor Davidovic, Edoardo Di Napoli, Jurica Novak, Luigi Genovese, Gustavo Ramirez-Hidalgo, Xinzhe Wu

发表机构 * Jülich Supercomputing Centre, Forschungszentrum Jülich, Germany(耶拿超级计算中心,耶拿研究中心,德国) Ruđer Bošković Institute, Croatia(鲁德·波斯科维奇研究所,克罗地亚)

AI总结 针对大规模电子结构计算中广义本征问题求解的瓶颈,提出基于数据驱动的光谱预测框架,通过机器学习预测切比雪夫多项式系数,提供初始猜测以跳过早期自洽场迭代,并优化有理滤波器本征求解器。

详情
AI中文摘要

模拟包含数千个原子的大分子系统需要高度可扩展的方法。虽然现代密度泛函理论(DFT)代码具有线性标度性,但在百亿亿次架构上,求解相关的大规模稀疏广义本征问题仍然是关键的计算瓶颈。在LimitX项目背景下,我们提出了一个数据驱动框架来加速这些计算。通过将机器学习目标从离散特征值转移到插值切比雪夫多项式的系数,并比较全原子和基于片段的结构表示,我们成功克服了大规模光谱预测的维度限制。我们研究了三种机器学习模型(核岭回归、图神经网络和随机森林),这些模型在包含2 TB蛋白质二聚体的新数据集上进行训练。预测的光谱提供了初始猜测,有效跳过了BigDFT中的早期自洽场(SCF)迭代。最终,这些光谱预测器将被部署以动态优化即将推出的基于有理滤波器的本征求解器(如目前处于初期开发阶段的FrASE)。

英文摘要

Simulating large molecular systems comprising thousands of atoms requires highly scalable methodologies. While modern Density Functional Theory (DFT) codes exhibit linear scaling, solving the associated large, sparse generalized eigenproblems remains a critical computational bottleneck on exascale architectures. In the context of the LimitX project, we propose a data-driven framework to accelerate these calculations. By shifting the machine learning target from discrete eigenvalues to the coefficients of an interpolating Chebyshev polynomial, and by comparing both all-atom and fragment-based structural representations, we successfully overcome the dimensionality constraints of large-scale spectral prediction. We investigate three machine learning models (Kernel Ridge Regression, Graph Neural Networks, and Random Forests) trained on a novel 2 TB dataset of protein dimers. The predicted spectra provide initial guesses that effectively bypass early Self-Consistent Field (SCF) iterations in BigDFT. Ultimately, these spectral predictors will be deployed to dynamically optimize upcoming rational filter-based eigensolvers, such as FrASE, which is currently in initial development.

2606.00400 2026-06-02 cs.LG 版本更新

Dynamic Proxy-Mixing: Transferring Replay Controllers from Small to Large Models for Continual Instruction Tuning

动态代理混合:将重放控制器从小模型迁移到大模型以进行持续指令微调

Ibne Farabi Shihab, Fariya Afrin, Anuj Sharma

发表机构 * Department of Computer Science, Iowa State University(爱荷华州立大学计算机科学系) Department of Computer Science, Kalinga Institute of Industrial Technology(卡林加工业技术学院计算机科学系) Department of Civil, Construction & Environmental Engineering, Iowa State University(爱荷华州立大学土木、建筑与环境工程系)

AI总结 提出PROXY-MIX框架,通过在小代理模型上学习动态重放控制器并冻结迁移至大模型,以解决持续指令微调中固定重放比例导致的灾难性遗忘问题,在LLaMA-3-8B上平均准确率提升3.4%,遗忘降低3.5%,安全性提升5.8%。

详情
AI中文摘要

持续指令微调通过一系列新领域更新语言模型,但每次更新会逐渐侵蚀先前学到的能力和对齐行为。重放是标准的缓解方法,但固定重放比例本质上有限,因为最优混合比例随当前领域、训练阶段以及先前行为的脆弱性而变化。我们提出PROXY-MIX框架,该框架在小代理模型上学习动态重放控制器,并将冻结的控制器迁移到更大的目标模型。控制器从未见过未来任务,而是从归一化的验证损失及其时间动态构建状态,生成当前任务和可访问重放缓冲区的掩码混合。我们的核心经验假设是遗忘镜像:即使绝对损失大小不同,任务脆弱性排名在不同模型规模上基本一致。在跨规模迁移控制器之前,我们通过实验验证了这一假设。在LLaMA-3-8B上跨越五个持续指令微调序列,PROXY-MIX在平均准确率上提高了3.4个百分点,最终遗忘降低了3.5个百分点,安全性得分比最强的非神谕基线提高了5.8个百分点,策略学习成本约为神谕目标强化学习的50倍。该框架在接口层面无泄漏且架构无关,我们还确定了代理假设失效的设置,突出了鲁棒部署的局限性。

英文摘要

Continual instruction tuning updates a language model through a sequence of new domains, yet each update can progressively erode previously learned capabilities and alignment behavior. Replay is the standard mitigation, but fixed replay ratios are inherently limited because the optimal mixture varies with the current domain, the training stage, and the evolving vulnerability of prior behaviors. We propose PROX-YMIX, a framework that learns a dynamic replay controller on a small proxy model and transfers the frozen controller to a larger target. The controller never observes future tasks and constructs its state from normalized validation losses and their temporal dynamics, producing a masked mixture over the current task and accessible replay buffers. Our core empirical hypothesis is forgetting mirroring: task vulnerability rankings remain largely consistent across model scales even when absolute loss magnitudes differ. We validate this assumption empirically before transferring controllers across scales. On LLaMA-3-8B across five continual instruction tuning sequences, PROXYMIX improves average accuracy by 3.4 points, reduces final forgetting by 3.5 points, and raises safety score by 5.8 points over the strongest non-oracle baseline, at roughly 50x lower policy learning cost than Oracle Target RL. The framework is leakage free and architecture independent at the interface level, and we also identify settings where the proxy assumption breaks down, highlighting limitations for robust deployment.

2606.00399 2026-06-02 cs.LG 版本更新

Multi-Objective Reference-Aligned Machine Unlearning

多目标参考对齐机器遗忘

Rasa Khosrowshahli, Stephen Asobiela, Beatrice Ombuki-Berman, Shahryar Rahnamayan

发表机构 * arXiv

AI总结 提出多目标框架RAUL,通过将遗忘样本的预测对齐到参考分布(均匀分布或保留集经验分布)来约束遗忘目标,并利用雅可比下降解决多目标优化,实现接近完整重训练的遗忘效果。

Comments Accepted as a short paper at Canadian AI 2026. Author version with an added framework overview figure for clarity

详情
AI中文摘要

机器遗忘旨在移除特定训练样本的影响,同时保持模型的效用。现有的单目标方法,如梯度上升或随机重标,常常由于冲突的优化动态和无界的遗忘目标导致灾难性遗忘,使模型偏离其预训练知识。我们提出参考对齐遗忘(RAUL),一个多目标框架,通过将遗忘样本上的无界损失最大化替换为有界的KL对齐,使其预测对齐到代表未见数据的参考分布(可实例化为均匀分布或来自保留参考集的经验分布),从而约束遗忘目标并减少与保留目标的梯度冲突,联合优化遗忘和保留。通过雅可比下降解决由此产生的多目标优化(MOO)问题,该算法将多个梯度聚合到无冲突的方向。我们的结果表明,与完全重训练相比,RAUL实现了最接近的差距。

英文摘要

Machine unlearning aims to remove the influence of specific training samples while preserving the model's utility. Existing single-objective approaches, such as gradient ascent or random relabeling, often induce catastrophic forgetting due to conflicting optimization dynamics and unbounded forgetting objectives that cause the model to drift from its pre-trained knowledge. We propose Reference-Aligned UnLearning (RAUL), a multi-objective framework that jointly optimizes forgetting and retention by replacing unbounded loss maximization with a bounded KL alignment of predictions on forgotten samples toward a reference distribution representing unseen data, instantiated either as a uniform distribution or an empirical distribution from a held-out reference set, which constrains the forgetting objective and reduces gradient conflict with retention. The resulting multi-objective optimization (MOO) problem is solved via Jacobian descent, which aggregates multiple gradients into a direction that does not conflict. Our results demonstrate that RAUL achieves the closest gap compared to full retraining.

2606.00392 2026-06-02 cs.LG cs.AI 版本更新

Detector-Evasive LLM Paraphrasing via Constrained Policy Optimization

通过约束策略优化实现检测器规避的LLM释义

Mingyi Wang, Zhuoer Shen, Yuheng Bu, Shaofeng Zou

发表机构 * School of ECEE, Arizona State University(亚利桑那州立大学电子工程与计算机科学学院) Department of Computer Science, University of California, Santa Barbara(加州大学圣巴巴拉分校计算机科学系)

AI总结 提出DEPO算法,将检测器规避的LLM释义建模为约束马尔可夫决策过程,通过拉格朗日对偶强化学习在保持语义的同时实现高效规避。

详情
AI中文摘要

AI文本检测器易受释义和检测器引导的释义攻击,但现有规避方法缺乏对语义保持的精确控制。特别是,直接优化检测器规避会降低细粒度语义,而标量化奖励设计仅提供间接、权重敏感的规避-语义权衡控制。我们通过将检测器规避的LLM释义建模为约束马尔可夫决策过程来解决这一限制,其中检测器规避是主要目标,语义保持作为显式约束强制执行。我们提出检测器规避策略优化(DEPO),一种拉格朗日原始-对偶强化学习算法,具有新颖的GRPO风格组基策略更新。DEPO在训练期间自适应平衡语义保持和检测器规避,使策略能够在规定的语义保持区域内提高攻击成功率。在MAGE、M4、RAID和同行评审数据集上的实验,针对MAGE、RoBERTa、RADAR、Binoculars和Fast-DetectGPT检测器进行评估,表明DEPO在精确满足语义保持约束的同时实现了强大的检测器规避。DEPO还表现出跨领域、跨检测器和提示级别的鲁棒性。

英文摘要

AI-text detectors are vulnerable to paraphrasing and detector-guided paraphrasing attacks, but existing detector-evasion methods often lack precise control over semantic preservation. In particular, optimizing directly for detector evasion can degrade fine-grained semantics, whereas scalarized reward designs provide only indirect, weight-sensitive control over the evasion-semantics trade-off. We address this limitation by formulating detector-evasive LLM paraphrasing as a Constrained Markov Decision Process, where detector evasion is the primary objective and semantic preservation is enforced as an explicit constraint. We propose Detector Evasion Policy Optimization (DEPO), a Lagrangian primal-dual reinforcement learning algorithm with a novel GRPO-style group-based policy update. DEPO adaptively balances semantic preservation and detector evasion during training, enabling the policy to improve attack success within a prescribed semantic-preservation region. Experiments on MAGE, M4, RAID, and peer-review datasets, evaluated against MAGE, RoBERTa, RADAR, Binoculars, and Fast-DetectGPT detectors, show that DEPO achieves strong detector evasion while precisely satisfying the semantic preservation constraint. DEPO also exhibits cross-domain, cross-detector, and prompt-level robustness.

2606.00383 2026-06-02 cs.RO cs.LG cs.SY eess.SY 版本更新

Behavior Cloning of MPC for 3-DOF Robotic Manipulators

三自由度机械臂MPC的行为克隆

Theo Guegan, Dexter Wen Jie Teo

发表机构 * University of Waterloo(多伦多大学) Universite de Technologie de Compiègne(技术与科学大学) Nanyang Technological University(南洋理工大学) Polytechnique Montréal(蒙特利尔理工学院)

AI总结 针对MPC实时计算负担重的问题,采用行为克隆方法近似MPC策略,通过多种神经网络架构实现三自由度机械臂的实时控制,在宽松容差下推理延迟降低3倍,成功率84.98%。

Comments Accepted at the IEEE ICRA 2026 Workshop on Reinforcement Learning in the Era of Imitation Learning (RL4IL), 6 pages excluding references

详情
AI中文摘要

虽然模型预测控制(MPC)提供了强大的稳定性和鲁棒性,但它给实时系统带来了显著的计算负担。本文研究了行为克隆在近似MPC策略以实时控制三自由度机械臂中的应用。我们提出了一个结合逆运动学与MPC的基线控制器,并评估了从经典回归算法到深度学习模型(包括深度MLP和RNN)的神经网络架构,以推导计算高效的替代策略。我们分析了泛化能力、稳定性考虑以及不同架构选择固有的权衡。我们的实证研究采用了在线和离线评估,以评估在准确性、计算效率和对原始MPC策略的忠实度方面的性能。结果表明,行为克隆可以有效减少三自由度机械臂MPC策略的计算负担,在宽松容差下推理延迟降低3倍,成功率达到84.98%。值得注意的是,我们发现静态架构优于时间变体,证实了瞬时状态观测对此任务的充分性。然而,在严格容差下我们观察到精度差距,这表明虽然行为克隆捕获了全局最优轨迹,但需要进一步研究以最小化终端稳态误差。

英文摘要

While Model Predictive Control (MPC) provides strong stability and robustness, it imposes a significant computational burden on real-time systems. This paper investigates the application of Behavior Cloning to approximate MPC policies for the real-time control of a 3-degree-of-freedom robotic manipulator. We present a baseline controller combining Inverse Kinematics with MPC and evaluate neural network architectures, ranging from classical regression algorithms to deep learning models including Deep MLPs and RNNs, to derive computationally efficient surrogate policies. We analyze generalization capabilities, stability considerations, and the trade-offs inherent in different architectural choices. Our empirical study employs both online and offline evaluations to assess performance regarding accuracy, computational efficiency, and fidelity to the original MPC policy. Our results demonstrate that Behavior Cloning can effectively reduce the computational burden of MPC policies for 3-DOF robotic manipulators, achieving a 3x reduction in inference latency with a 84.98% success rate under relaxed tolerances. Notably, we find that static architectures outperform temporal variants, confirming the sufficiency of instantaneous state observations for this task. However, we observe a precision gap under strict tolerances, which suggest that while Behavior Cloning captures the global optimal trajectory, further research is needed to minimize terminal steady-state error.

2606.00382 2026-06-02 cs.LG 版本更新

CRMA: A Spectrally-Bounded Backbone for Modular Continual Fine-Tuning of LLMs

CRMA:用于大语言模型模块化持续微调的谱约束骨干

Kiran Nayudu, Aswini Nutakki, Sai Vinay Naidu, Ashwin Shanmugasundaram

发表机构 * ModelBrew AI

AI总结 提出CRMA残差适配器,通过Sinkhorn归一化确保混合矩阵双随机,从而在结构上约束谱范数,实现共享基座的持续训练与跨任务正向迁移,无需回放或蒸馏。

Comments 38 pages, 10 figures. Patent-pending construction details deferred to companion technical report (in preparation)

详情
AI中文摘要

大语言模型的顺序微调面临两难选择:要么让共享基座持续学习并接受灾难性遗忘,要么在第一个任务后冻结它并放弃跨任务优化。每任务适配器方法(LoRAHub、AdapterFusion、PackNet、Progressive Networks)选择了后一条路径。我们提出CRMA(约束残差混合适配器),一种残差适配器,其内部混合矩阵M通过Sinkhorn归一化在每次前向传播时保持双随机,因此根据Birkhoff定理,||M||_2 <= 1由构造保证——这是一种结构约束,而非惩罚项。CRMA的谱约束骨干提供了一个持续训练的共享基座,这是早期模块化方法无法实现的,同时保留了它们的遗忘保证。在Mistral-7B上,跨越5个顺序领域和3个种子,基于CRMA骨干的模块化每任务LoRA将损失相对漂移从+42.96% ± 5.5(朴素顺序微调)降低到-0.17% ± 0.17,且每种子范围不重叠,并将先前任务的保留损失比匹配的冻结基座基线提高了1.99% ± 0.54。三个独立的实验设置(Mistral-7B 4领域受控消融、TinyLlama 3领域污染控制复现、Mistral-7B 7B跨领域探测)均显示出正向反向迁移——无需回放缓冲区、无需增加每任务内存、无需蒸馏。在Gemma-2-9B上的推理时消融证实CRMA介导了对顺序训练知识的访问:在相同权重和相同问题上,仅通过切换CRMA注入,结果从38/100提升到98/100。867次记录的训练步骤验证了||M||_2 = 1.0在float32精度内(最大偏差1.2×10^-7)。遗忘预防效果在1.1B-9.2B参数和四个架构系列中均成立。

英文摘要

Sequential fine-tuning of large language models forces a choice: let the shared substrate keep learning and accept catastrophic forgetting, or freeze it after task one and foreclose cross-task refinement. Per-task adapter methods (LoRAHub, AdapterFusion, PackNet, Progressive Networks) take the second path. We introduce CRMA (Constrained Residual Mixing Adapter), a residual adapter whose internal mixing matrix M is doubly-stochastic at every forward pass via Sinkhorn normalization, so by Birkhoff's theorem ||M||_2 <= 1 holds by construction -- a structural bound, not a penalty. CRMA's spectrally bounded backbone provides a continuously trained shared substrate that earlier modular methods could not, while preserving their forgetting guarantees. On Mistral-7B across 5 sequential domains and 3 seeds, modular per-task LoRA on a CRMA backbone reduces loss-relative drift from +42.96% +/- 5.5 (naive sequential fine-tuning) to -0.17% +/- 0.17, with disjoint per-seed ranges, and improves prior-task holdout loss by 1.99% +/- 0.54 over a matched frozen-substrate baseline. Three independent experimental setups (Mistral-7B 4-domain controlled ablation, TinyLlama 3-domain contamination-controlled replication, Mistral-7B cross-domain probes at 7B) all show positive backward transfer -- without replay buffers, without growing per-task memory, and without distillation. An inference-time ablation on Gemma-2-9B confirms CRMA mediates access to sequentially trained knowledge: 98/100 vs. 38/100 on the same weights and same questions with only CRMA injection toggled. 867 logged training steps verify ||M||_2 = 1.0 within float32 precision (max deviation 1.2 x 10^-7). The forgetting-prevention effect holds across 1.1B-9.2B parameters and four architecture families.

2606.00376 2026-06-02 cs.AI cs.CL cs.LG 版本更新

The Deterministic Horizon: When Extended Reasoning Fails and Tool Delegation Becomes Necessary

确定性视界:当扩展推理失败时工具委托变得必要

Dongxin Guo, Jikun Wu, Siu Ming Yiu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文通过注意力瓶颈定理和确定性视界概念,证明解码器-only注意力在确定性状态追踪任务中存在信息论容量限制,导致扩展推理性能退化,并指出当视界超过19-31时工具委托成为必要。

Comments Accepted at ICML 2026. 4 figures. 51 pages including appendices

详情
AI中文摘要

扩展的思维链推理可能会在确定性状态追踪任务上降低性能,这不是由于偏好偏差,而是源于解码器-only注意力的信息论容量限制。我们建立了:(1) 注意力瓶颈定理及互补的可达性构造,将状态追踪容量界定为 $O(H \cdot \log(L/H) \cdot \sqrt{d_h})$;(2) 一个上下文相关的错误模型,导致超指数精度衰减;(3) 状态空间Jaccard度量,区分能力与偏好失败;(4) 确定性视界 $d^* \in [19, 31]$,超过该视界工具委托变得必要。在12个模型和8个任务领域(包括SWE-Bench、WebArena和SQL-Multi)中,工具集成推理始终优于神经思维链;在主要模型套件上,其准确率达到86-94%,而神经思维链仅为24-42%。在最优长度轨迹上进行微调仅带来<5%的提升,证实了架构上限,并且高跨模型相关性($r = 0.81$-$0.91$)表明这些失败是架构性的而非训练特定的。我们的结果为在代理系统中纯神经推理何时应让位于混合方法提供了原则性指导。

英文摘要

Extended chain-of-thought reasoning can degrade performance on deterministic state-tracking tasks, not due to preference biases, but limits rooted in the information-theoretic capacity of decoder-only attention. We establish: (1) an Attention Bottleneck Theorem with a complementary achievability construction, bounding state-tracking capacity as $O(H \cdot \log(L/H) \cdot \sqrt{d_h})$; (2) a context-dependent error model yielding super-exponential accuracy decay; (3) the State-Space Jaccard metric distinguishing capability from preference failures; (4) a Deterministic Horizon $d^* \in [19, 31]$ beyond which tool delegation becomes necessary. Across 12 models and 8 task domains (including SWE-Bench, WebArena, and SQL-Multi), tool-integrated reasoning consistently outperforms neural chain-of-thought; on the primary model suite it reaches 86-94% accuracy versus 24-42% for neural chain-of-thought. Fine-tuning on optimal-length traces yields $<$5% improvement, confirming an architectural ceiling, and high cross-model correlation ($r = 0.81$-$0.91$) indicates these failures are architectural rather than training-specific. Our results provide principled guidance for when pure neural reasoning should yield to hybrid approaches in agentic systems.

2606.00371 2026-06-02 cs.LG 版本更新

How Much Orthogonalization Does Muon Need?

Muon 需要多少正交化?

Hua Huang

发表机构 * NVIDIA

AI总结 研究 Muon 优化器所需的正交化程度,提出一种基于三次牛顿-舒尔茨迭代的低成本正交化变体 cubic5,并在多种模型上验证其与高精度方法性能相当。

详情
AI中文摘要

Muon 优化器通过将病态动量更新替换为近似半正交更新来改进神经网络训练。这引出一个实际问题:Muon 实际上需要多少正交化?我们使用直接为 Muon 的低精度奇异值带导出的松弛三次牛顿-舒尔茨调度来研究这个问题。与五次五次牛顿-舒尔茨迭代的十五次主导矩阵乘法相比,所得的五步三次构造使用十次主导矩阵乘法。三次调度并非旨在作为更精确的极分解求解器;相反,它是一种原则性的低成本变体,使我们能够探究极分解精度、谱整形和训练质量之间的关系。通过合成诊断、NanoGPT 消融实验以及混合 MoE/Mamba 模型的训练实验,我们发现训练质量并非由极分解精度单调决定:截断的 Polar Express、Muon-Jordan、三次牛顿-舒尔茨以及显式 FP32 SVD 极分解因子在 GPT-2 Small 上可达到几乎无法区分的最终损失,而 cubic5 在具有十亿到四十亿参数的混合 MoE/Mamba 模型上,其验证损失与 Muon-Jordan 五次更新相差约 $10^{-3}$。这些结果支持 cubic5 作为一种实用的低成本 Muon 正交化变体,并在测试的设置中提供了训练质量等同的实验证据。

英文摘要

Muon optimizers improve neural-network training by replacing ill-conditioned momentum updates with approximately semi-orthogonal updates. This motivates a practical question: how much orthogonalization does Muon actually require? We study this question using a relaxed cubic Newton--Schulz schedule derived directly for Muon's low precision singular value band. The resulting five-step cubic construction uses ten dominant matrix multiplications, compared with fifteen for five quintic Newton--Schulz iterations. The cubic schedule is not intended as a more accurate polar solver; instead, it is a principled low-cost variant that lets us probe the relation between polar accuracy, spectral shaping, and training quality. Across synthetic diagnostics, NanoGPT ablations, and training experiments on hybrid MoE/Mamba models, we find that training quality is not governed monotonically by polar-decomposition accuracy: truncated Polar Express, Muon-Jordan, cubic Newton--Schulz, and an explicit FP32 SVD polar factor can reach nearly indistinguishable final loss on GPT-2 Small, and cubic5 matches the Muon-Jordan quintic update within about $10^{-3}$ validation loss on hybrid MoE/Mamba models with one billion to four billion parameters. These results support cubic5 as a practical low-cost Muon orthogonalization variant, with empirical evidence of training-quality parity in the settings tested.

2606.00369 2026-06-02 cs.CY cs.LG 版本更新

Quantifying the Salience of Geo-Cultural Values for Pluralistic Safety Alignment

量化地理文化价值对多元安全对齐的显著性

Arkadiy Saakyan, Charvi Rastogi, Lora Aroyo

发表机构 * University of Oxford(牛津大学) University of Cambridge(剑桥大学)

AI总结 通过多层次模型分析,发现文化区域归属对安全评分有显著影响(p<0.05),约10%的项目存在文化敏感性,当前LLM无法可靠替代人类评分员但可辅助筛选。

Comments 119 pages, 13 figures. ICML 2026 camera ready

详情
AI中文摘要

AI模型的安全全球部署需要与跨文化的人类价值观对齐。然而,安全评估数据集中的评分者群体在地理上仍然高度同质,未能捕捉地理文化差异。此外,在控制年龄、性别和种族等人口统计学因素后,这些差异是否仍然存在尚不清楚。通过对安全数据集的元分析,我们发现大多数数据集未报告地理文化信息,而那些报告的数据集缺乏统一的方法来联合分析地理文化和人口统计学相关性。利用Inglehart-Welzel跨文化变异维度,我们通过多层次模型证明,文化区域归属解释了超出标准人口统计学变量的安全评分方差(6个数据集中p<0.05)。此外,我们的分析表明,我们检查的数据集中大约10%的项目具有文化敏感性:如果没有充分的文化代表性,这些项目很可能被错误分类为安全。我们将LLM评估为评分替代工具和分诊工具,发现当前的LLM不能可靠地替代评分员,尽管它们可以帮助优先选择文化敏感项目进行人工标注。我们的发现推动了更多文化多元的安全评估,并提供了支持其实践的实用建议。

英文摘要

Safe global deployment of AI models requires alignment with human values that vary across cultures. Yet rater pools in safety evaluation datasets remain largely geographically homogeneous, failing to capture geo-cultural differences. Further, it remains unclear whether such differences persist after controlling for demographics such as age, gender, and ethnicity. Through a meta-analysis of safety datasets, we find that most do not report geo-cultural information, and those that do lack a unified methodology to jointly analyze geo-cultural and demographic correlates. Using the Inglehart-Welzel dimensions of cross-cultural variation, we demonstrate via multilevel modeling that cultural zone membership explains variance in safety ratings beyond standard demographics (p<0.05 across 6 datasets). Moreover, our analysis indicates that roughly 10% of items in the datasets we examined are culturally sensitive: likely to be misclassified as safe without adequate cultural representation. We evaluate LLMs as both rater surrogates and triage tools, finding that current LLMs do not reliably stand in for raters, though they can help prioritize culturally sensitive items for human annotation. Our findings motivate more culturally pluralistic safety evaluation and offer practical takeaways to support it.

2606.00367 2026-06-02 cs.LG cs.AI 版本更新

Reinforcement Learning with Pairwise Preferences in Long-Term Decision Problems

长期决策问题中基于成对偏好的强化学习

Jonathan Colaço Carr, Prakash Panangaden, Doina Precup, Benjamin Van Roy

发表机构 * School of Computer Science, McGill University, Montreal, Quebec, Canada(麦吉尔大学计算机科学学院) Mila - Quebec AI Institute, Montreal, Quebec, Canada(魁北克人工智能研究所) Department of Electrical Engineering, Stanford University, Stanford, California, USA(斯坦福大学电气工程系)

AI总结 针对长期决策问题中基于成对偏好的强化学习效率低且缺乏马尔可夫策略最优性保证的问题,提出马尔可夫决策竞赛模型,证明平稳马尔可夫策略最优性、求解复杂度为P,并给出亚线性收敛算法,在高维长期问题中显著提升学习效率。

详情
AI中文摘要

强化学习问题通常将目标定义为最大化标量奖励函数的期望值。但是,成对偏好通常比标量奖励更容易指定,并且它们表达了标量奖励无法表达的某些目标。因此,基于成对偏好的强化学习方法受到了越来越多的关注。不幸的是,这些方法在长时间跨度的任务中效率低下,并且缺乏关于马尔可夫策略相对于历史依赖策略的性能保证,而这连接了强化学习的理论与实践。因此,我们提出了 extit{马尔可夫决策竞赛}作为基于成对偏好的强化学习的新问题模型。我们证明了平稳马尔可夫策略在所有历史依赖策略中是最优的,精确求解马尔可夫决策竞赛属于P类问题,并且一个简单的迭代算法以亚线性速率收敛到最优策略。最后,在一组具有长时间跨度的高维决策问题中,我们展示了我们的近似算法在学习效率上显著优于先前的工作。

英文摘要

Reinforcement learning problems typically define the goal as maximizing the expected value of a scalar reward function. But, pairwise preferences are often easier to specify than scalar rewards, and they express certain goals that scalar rewards cannot. Methods for reinforcement learning with pairwise preferences have thus received growing interest. Unfortunately, these methods are inefficient in problems with long time horizons, and they lack guarantees on the performance of Markov policies relative to history-dependent policies, which bridge the theory and practice of reinforcement learning. We therefore propose the \textit{Markov decision contest} as a new problem model for reinforcement learning with pairwise preferences. We prove that stationary Markov policies are optimal among all history-dependent policies, that solving a Markov decision contest exactly is in P, and that a simple iterative algorithm converges to an optimal policy at a sublinear rate. Lastly, in a set of high-dimensional decision problems with long time horizons, we show that our approximate algorithm is significantly more learning-efficient than prior work.

2606.00350 2026-06-02 cs.LG cs.AI 版本更新

Drift Q-Learning

Drift Q-Learning

Anas Houssaini, Mohamad H. Danesh, Amin Abyaneh, Scott Fujimoto, Hsiu-Chin Lin, David Meger

发表机构 * McGill University(麦吉尔大学) Mila - Quebec AI Institute(魁北克AI研究院)

AI总结 提出DriftQL,通过漂移正则化与Q学习结合,在离线强化学习中避免分布外动作,单步生成动作,性能优于扩散和流方法。

详情
AI中文摘要

离线强化学习需要从固定数据中改进策略,同时避免具有不可靠价值估计的分布外动作。扩散和流策略通过建模行为分布来正则化强化学习目标以处理这种权衡,但它们需要迭代去噪、求解器集成,并且在更高效的变体中,推理时需要蒸馏或其他近似。我们提出DriftQL,它将基于漂移的行为正则化器与评论家驱动的策略改进相结合。价值信号将策略偏向数据支持的高价值区域,而吸引和排斥共同使生成的动作接近数据并防止坍缩到单一模式。DriftQL实现为具有统一训练目标的单一网络,并在单次前向传播中生成动作。在D4RL和OGBench上,DriftQL持续优于扩散和流方法,推进了最先进水平。在数据质量下降(基线明显挣扎)的情况下,DriftQL保持接近其干净数据性能,使其成为扩散和流方法的有前途的替代方案,同时保持确定性方法的简单性和效率。项目页面:https://driftql.github.io/

英文摘要

Offline reinforcement learning requires improving a policy from fixed data while avoiding out-of-distribution actions with unreliable value estimates. Diffusion and flow policies handle this trade-off by modeling the behavior distribution to regularize the RL objective, but they require iterative denoising, solver integrations, and in more efficient variants, distillation or other approximations at inference. We propose DriftQL, which combines a drift-based behavioral regularizer with critic-driven policy improvement. The value signal biases the policy toward high-value regions of the data support, while attraction and repulsion together keep generated actions near the data and prevent collapse onto a single mode. DriftQL is implemented as a single network with a unified training objective and generates actions in a single forward pass. On D4RL and OGBench, DriftQL consistently outperforms diffusion and flow methods, advancing the state of the art. Under degraded data quality, where the baselines visibly struggle, DriftQL remains close to its clean-data performance, positioning it as a promising alternative to diffusion and flow-based methods while maintaining the simplicity and efficiency of deterministic approaches. Project page: https://driftql.github.io/

2606.00349 2026-06-02 cs.LG cs.AI cs.CE 版本更新

(HB-ARFM) History-Bootstrapped Flow Matching for Inverse Boiling Reconstruction

(HB-ARFM) 基于历史引导的流匹配用于逆沸腾重建

Xianwei Zou, Sheikh Md Shakeel Hassan, Arthur Feeney, Aparna Chandramowlishwaran

发表机构 * arXiv

AI总结 提出历史引导自回归流匹配方法,通过条件流匹配和自回归传播解决部分观测下的时空逆重建问题,在沸腾动力学重建中优于其他模型。

Comments ICML 2026

详情
AI中文摘要

从部分观测中重建时空场是科学推理的基础,例如从卫星数据推断大气状态或从成像恢复流体状态。当观测不完整时,逆问题本质上是病态的:即使底层PDE动力学在全状态上是马尔可夫的,部分观测算子也会诱导出非马尔可夫的后验,无法从单个时间步解析。我们提出了一种历史引导自回归流匹配方法,用于部分可观测性下的时空逆重建。观测历史通过条件流匹配引导初始重建,减少歧义。然后自回归地应用相同的条件传输模型,以新观测和过去预测为条件,将重建向前传播。我们在沸腾动力学重建上评估该方法,从界面几何和运动恢复完整的速度和温度场。在两个不同观测稀疏性的逆任务中,HB-ARFM产生了物理和时间上有效的重建,而其他模型则失败。

英文摘要

Reconstructing spatiotemporal fields from partial observations is fundamental to scientific inference, from inferring atmospheric states from satellite data to recovering fluid states from imaging. When observations are incomplete, the inverse problem is fundamentally ill-posed: even when the underlying PDE dynamics are Markovian in the full state, partial observation operators induce a non-Markovian posterior that cannot be resolved from a single timestep. We propose a history-bootstrapped autoregressive flow matching (HB-ARFM) for spatiotemporal inverse reconstruction under partial observability. Observation history bootstraps the initial reconstruction via conditional flow matching, reducing ambiguities. The same conditional transport model is then applied autoregressively, conditioning on both new observations and past predictions to propagate the reconstruction forward in time. We evaluate the method on boiling dynamics reconstruction, recovering full velocity and temperature fields from interface geometry and motion. Across two inverse tasks with varying observation sparsity, HB-ARFM produces physically and temporally valid reconstructions where other models fail.

2606.00345 2026-06-02 cs.LG 版本更新

Longitudinal Multimodal Sensing of Physical Activity and Well-Being in Older Adults

老年人身体活动与福祉的纵向多模态感知

Flavio Di Martino, Mattia G. Campana, Marcello Magno, Lorenza Pratali, Franca Delmastro

发表机构 * IIT-CNR(意大利理工学院-克雷斯塔纳国家研究委员会) IFC-CNR(意大利弗洛rence-克雷斯塔纳国家研究委员会)

AI总结 本研究通过纵向多模态数据(可穿戴传感、行为监测和临床评估)对66名老年人进行现实世界监测,发现可观察行为目标预测性能良好(macro-F1 65%),而抽象结果预测仍具挑战,且历史特征是最重要的预测因子。

详情
AI中文摘要

可穿戴和移动传感技术能够在现实环境中连续监测人类行为和健康。然而,纵向多模态数据中的预测建模仍然具有挑战性,特别是在针对复杂或临床衍生结果时。在这项工作中,我们展示了一项在现实条件下进行的纵向多模态研究,涉及66名老年人,结合了可穿戴传感、行为监测和临床评估。这一设置提供了研究长期、野外条件下代表性不足人群的难得机会。基于该数据集,我们研究了感知信号与目标变量之间的对齐如何影响跨健康相关任务的预测性能。我们设计了一个统一的评估框架,涵盖具有不同可观测性水平的任务,包括活动水平预测、睡眠时长估计和睡眠呼吸暂停严重程度分类。我们的结果揭示了明确的预测性梯度:高度可观察的行为目标实现了稳健的性能(macro-F1 65%),而更抽象的结果尽管相对于基线模型持续改进,但仍然具有挑战性。此外,通过可解释性分析,我们表明历史特征始终是最具信息量的预测因子,突显了纵向信息的核心作用。

英文摘要

Wearable and mobile sensing technologies enable continuous monitoring of human behavior and health in real-world settings. However, predictive modeling in longitudinal multimodal data remains challenging, particularly when targeting complex or clinically derived outcomes. In this work, we present a longitudinal multimodal study of 66 older adults conducted in real-world conditions and combining wearable sensing, behavioral monitoring, and clinical assessments. This setting provides a rare opportunity to study an underrepresented population in long-term, into-the-wild conditions. Building on this dataset, we investigate how the alignment between sensed signals and target variables affects predictive performance across health-related tasks. We design a unified evaluation framework spanning tasks with increasing levels of observability, including Activity Levels prediction, Sleep Duration estimation, and Sleep Apnea Severity classification. Our results reveal a clear gradient of predictability: highly observable behavioral targets achieve robust performance (macro-F1 65%), while more abstract outcomes remain challenging despite consistent improvements over baseline models. Moreover, through explainability analysis, we show that historical features consistently emerge as the most informative predictors, highlighting the central role of longitudinal information.

2606.00344 2026-06-02 cs.LG 版本更新

The role of class encoding in neural collapse

类编码在神经坍缩中的作用

Bastien Massion, Roy Makhlouf, Estelle Massart

发表机构 * Institute of Cognitive Sciences, University of Amsterdam(阿姆斯特丹大学认知科学研究所)

AI总结 本文通过无限制特征模型和均方误差训练损失,研究标签编码对神经坍缩的影响,发现one-hot编码和平衡数据下,增大偏置正则化系数时,各类未中心化均值特征从单纯形等角紧框架转变为正交框架,并证明任意编码下分类器偏置旨在居中标签。

详情
AI中文摘要

神经坍缩是神经网络分类模型中最后一层隐藏层激活的一个结构特性,当训练超过零分类误差时出现。在这项工作中,我们依靠均方误差训练损失的无限制特征模型,探索标签编码在神经坍缩中的作用。我们证明,对于one-hot编码标签和平衡数据,当增加与最终分类器相关的偏置正则化系数时,每个类别的未中心化均值特征从单纯形等角紧框架转变为正交框架。这些结构让人联想到one-hot编码标签的正交框架结构。对于任意编码,我们还表明最终分类器的偏置旨在居中标签,补偿标签全局均值与原点的差异。我们进一步讨论了编码在其他神经坍缩特性中的作用。

英文摘要

Neural collapse is a structural property of the last-hidden-layer activations in neural network classification models, when trained beyond a zero classification error. In this work, we explore the role of label encoding in neural collapse by relying on the unrestricted feature model with mean squared error training loss. We demonstrate that, for one-hot encoded labels and balanced data, the uncentered mean features associated with each class transition from a simplex equiangular tight frame to an orthogonal frame when increasing the bias regularization coefficient associated with the final classifier. These structures are reminiscent of the orthogonal frame structure of one-hot encoded labels. For any arbitrary encoding, we also show that the final classifier's bias aims at centering the labels, compensating for the discrepancy between the global mean of the labels and the origin. We further discuss the role of the encoding in other neural collapse properties.

2606.00342 2026-06-02 cs.LG cs.CR cs.DB 版本更新

PE-means: Improved Differentially Private $k$-means Clustering through Private Evolution

PE-means: 通过私有进化改进差分隐私 $k$-均值聚类

Thomas Humphries, Zinan Lin, Sergey Yekhanin

发表机构 * University of Waterloo(滑铁卢大学) Microsoft Research(微软研究院)

AI总结 针对欧几里得空间中差分隐私 $k$-均值聚类问题,提出PE-means算法,利用私有进化方法仅计算恒定敏感度的私有直方图,在聚类损失上平均比现有最优基线提升20%。

详情
AI中文摘要

我们研究欧几里得空间中差分隐私(DP)$k$-均值聚类问题。先前的解决方案直接对私有数据求和,导致敏感度与域大小成比例。我们引入PE-means,将私有进化(PE)算法(一种日益流行的合成数据生成方法)扩展到$k$-均值聚类问题。PE的关键优势在于它仅计算具有恒定敏感度的私有直方图来指导进化。我们对PE的改编包括用于聚类的新进化算子,以及其他具有独立意义的算法改进。总体而言,PE-means在聚类损失上比现有最优基线平均提升20%。

英文摘要

We study the problem of differentially private (DP) $k$-means clustering in Euclidean space. Previous solutions rely on summing the private data directly, which induces a sensitivity proportional to the domain. We introduce PE-means, an extension of the private evolution (PE) algorithm (an increasingly popular method for synthetic data generation), to the problem of $k$-means clustering. The key advantage of PE is that it only computes a private histogram with constant sensitivity to guide the evolution. Our adaptation of PE includes new evolutionary operators for clustering, as well as other algorithmic improvements of independent interest. Overall, PE-means achieves an average improvement of 20% in clustering loss over state-of-the-art baselines.

2606.00341 2026-06-02 cs.LG cs.AI 版本更新

ROGUE: Misaligned Agent Behavior Arising from Ordinary Computer Use

ROGUE:源于普通计算机使用的错误对齐代理行为

Jeremy Tien, Abishek Anand, Yu-Rou Tuan, Yuchen Shen, J. Zico Kolter, Aran Nayebi

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 研究AI代理在良性环境中因任务完成而采取不安全行为(违反可纠正性)的问题,通过基准测试发现前沿模型普遍绕过用户中断或限制,且性能提升反而加剧错误对齐。

Comments 27 pages, 13 figures

详情
AI中文摘要

随着AI代理越来越多地部署在真实的个人和企业环境(电子邮件账户、开发工作流、公司数据库等)中,围绕这些代理的安全考虑变得至关重要。尽管许多工作集中在存在对手时的代理安全性上,但我们表明,即使在良性环境中,代理也可能表现出错误对齐的行为,在那些行为对任务完成有帮助时采取不安全的行动。我们通过可纠正性(即代理保持对人类纠正、中断或关闭的顺从性的安全要求)的视角研究这种失败模式。为了证明这种倾向,我们引入了一个基准测试,其中代理被要求完成现实的计算机使用任务,但面临一个可纠正性障碍:人类中断、登录页面或关闭通知。然后我们评估代理是否选择违反可纠正性以完成任务——覆盖人类、访问私人密码、重新接线关闭。我们发现,绝大多数测试的前沿模型经常绕过用户中断或限制。此外,更好的模型性能似乎导致更大的错误对齐。最后,即使模型最初完全可纠正,我们表明它们创建的子代理也不能保证如此。我们的工作强调了在自主代理中需要基于原则的、专注于可纠正性的对齐方法的迫切性。

英文摘要

As AI agents are increasingly deployed in real personal and corporate settings (email accounts, development workflows, company databases, etc.), safety considerations surrounding these agents become paramount. Although much work has focused on agent safety in the presence of an adversary, we show that agents can exhibit misaligned behavior even in benign settings, taking unsafe actions when those actions are instrumental to task completion. We study this failure mode through the lens of corrigibility, the safety desideratum that agents remain amenable to human correction, interruption, or shutdown. To demonstrate this tendency, we introduce a benchmark in which agents are asked to complete realistic, computer-use tasks but are confronted with a corrigibility obstacle: a human interrupt, a login page, or a shutdown notification. We then evaluate whether agents choose to violate corrigibility in order to complete the task -- overriding the human, accessing private passwords, rewiring shutdown. We find that the overwhelming majority of frontier models tested frequently bypass user interruptions or restrictions. In addition, better model performance appears to lead to greater misalignment. Finally, even when models are completely corrigible initially, we show there are no guarantees that the subagents they create are. Our work highlights the critical need for principled, corrigibility-focused alignment methods in autonomous agents.

2606.00340 2026-06-02 cs.LG 版本更新

Balancing Learning Rates Across Layers: Exact Two-Step Dynamics and Optimal Scaling in Linear Neural Networks

跨层学习率平衡:线性神经网络中的精确两步动力学与最优缩放

Tianyu Pang, Vignesh Kothapalli, Shenyang Deng, Haohui Wang, Dawei Zhou, Yaoqing Yang

发表机构 * Dartmouth College(达特茅斯学院) Stanford University(斯坦福大学)

AI总结 本文通过精确推导线性神经网络在梯度下降两步后的梯度和测试损失闭式表达式,研究了层间学习率的最优选择,揭示了初始步骤不等学习率可最小化测试损失而后续步骤等学习率最优的早期训练机制。

Comments ICML 2026

详情
AI中文摘要

我们研究了两层和三层线性神经网络在学习线性目标函数时的最优学习率选择。特别地,我们推导了梯度下降一步和两步后梯度和测试损失的精确闭式表达式,从而能够精确刻画早期训练动态。我们描述了在前两步梯度近似下学习率应如何缩放,并证明使用该近似进行更新可得到一个具有紧密小近似误差的可处理替代损失。这一公式使得层间学习率的理论分析成为可能,并揭示了一个独特的早期训练机制:在初始步骤中,不等学习率可以最小化测试损失,而在后续步骤中,等学习率变得最优。我们的数值实验验证了该理论,并证明了在训练早期平衡层间学习率的重要性。代码可在 https://github.com/TDCSZ327/Layer-Balancing 获取。

英文摘要

We study optimal learning-rate selection in two-layer and three-layer linear neural networks trained to learn linear target functions. In particular, we derive the exact closed-form expressions for the gradients and test loss after one and two steps of gradient descent, enabling a precise characterization of early training dynamics. We characterize how learning rates should scale under the gradient approximation in the first two steps, and prove that performing updates with this approximation yields a tractable surrogate loss with a tight, small approximation error. This formulation enables the theoretical analysis of layer-wise learning rates and reveals a distinct early-training regime: test loss can be minimized by unequal learning rates at the initial step, while equal learning rates become optimal in subsequent steps. Our numerical experiments validate the theory and demonstrate the importance of balancing layer-wise learning rates early during training. The code is available at: https://github.com/TDCSZ327/Layer-Balancing.

2606.00338 2026-06-02 cs.LG 版本更新

CHAM-net: A Contrastive Hierarchical Adaptive Meta-network for Robust Global Methane Flux Prediction

CHAM-net:用于鲁棒全球甲烷通量预测的对比层次自适应元网络

Rongchao Dong, Yiming Sun, Shuo Chen, Youmi Oh, Licheng Liu, Yiqun Xie, Xiaowei Jia

发表机构 * University of Pittsburgh(匹兹堡大学) Purdue University(普渡大学) University of Colorado Boulder(科罗拉多大学博尔德分校) NOAA Global Monitoring Laboratory(国家海洋大气管理局全球监测实验室) University of Wisconsin–Madison(威斯康星大学麦迪逊分校) University of Maryland(马里兰大学)

AI总结 提出对比层次自适应元网络(CHAM-net),通过层次编码器-解码器架构从历史数据中学习站点特异性动态,解决时空异质性问题,在模拟和观测数据集上优于基线方法。

详情
AI中文摘要

甲烷是一种强效温室气体,显著加剧全球变暖。然而,由于环境驱动因素在空间和时间尺度上的复杂相互作用,准确估计全球甲烷排放和消耗仍具挑战。以往的数据驱动方法往往忽略生态系统固有的时空异质性,未能明确捕捉站点特异性特征和跨年演化动态。为解决这些问题,我们提出对比层次自适应元网络(CHAM-net),一种新颖的框架,通过从历史背景中学习来明确捕捉站点特异性动态。CHAM-net采用层次编码器-解码器架构,其中编码器从历史数据中捕捉站点特异性特征,然后动态调节解码器以生成最终预测。实验结果表明,CHAM-net在甲烷排放和消耗的模拟和观测数据集上均持续优于所有基线方法,在排放预测中实现了低至0.43和0.88的nRMSE值,对应的R²分数高达0.97和0.68。

英文摘要

Methane is a potent greenhouse gas that significantly contributes to global warming. However, accurately estimating global methane emissions and consumption remains challenging due to the complex interactions among environmental drivers that may vary across spatial and temporal scales. Prior data-driven methods often overlook the inherent spatiotemporal heterogeneity of ecosystems, failing to explicitly capture site-specific characteristics and cross-year evolutionary dynamics. To address these issues, we propose the Contrastive Hierarchical Adaptive Meta-network (CHAM-net), a novel framework that explicitly learns from historical context to capture site-specific dynamics. CHAM-net employs a hierarchical encoder-decoder architecture, in which the encoder captures site-specific characteristics from historical data and then dynamically conditions the decoder to generate the final prediction. Experimental results demonstrate that CHAM-net consistently outperforms all baseline methods on both simulation and observational datasets for methane emission and consumption, achieving nRMSE values as low as 0.43 and 0.88 with corresponding R2 scores up to 0.97 and 0.68 for emission prediction.

2606.00336 2026-06-02 cs.AI cs.LG 版本更新

From Noise to Control: Parameterized Diffusion Policies

从噪声到控制:参数化扩散策略

Renhao Zhang, Haotian Fu, Mingxi Jia, George Konidaris, Yilun Du, Bruno Castro da Silva

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出参数化扩散策略(PDP),通过学习行为流形上的低维连续参数条件化扩散策略,将扩散从随机多样性机制转化为精确可优化的行为引导工具,实现策略间的平滑插值和新约束下的高效适应。

详情
AI中文摘要

我们提出参数化扩散策略(PDP),这是一个学习扩散策略的框架,该策略以嵌入在学习行为流形中的低维连续参数为条件。通过构建该流形,使得潜在表示之间的距离反映物理轨迹之间的语义相似性,我们将扩散从随机多样性机制转化为精确且可优化的行为引导工具。我们的方法能够实现已知策略之间的平滑插值,并在不更新策略权重的情况下高效适应新约束。我们证明,与标准扩散策略相比,PDP在模拟和真实机器人实验的复杂多模态基准测试中显著提高了适应性能,特别是在需要合成新行为的场景中。

英文摘要

We propose Parameterized Diffusion Policy (PDP), a framework for learning diffusion policies conditioned on low-dimensional, continuous parameters embedded in a learned behavior manifold. By constructing this manifold so that distances between latent representations reflect the semantic similarity between physical trajectories, we transform diffusion from a mechanism for stochastic diversity into a precise and optimizable tool for behavior steering. Our approach enables smooth interpolation between known strategies and efficient adaptation to novel constraints without updating policy weights. We demonstrate that PDP significantly improves adaptation performance on complex multimodal benchmarks in both simulated and real-robot experiments compared to standard diffusion policies, particularly in scenarios requiring the synthesis of novel behaviors.

2606.00329 2026-06-02 eess.SY cs.LG cs.SY stat.ML 版本更新

Benchmarking Recursive-Collapse Warning Claims Under Matched False-Positive Control

在匹配假阳性控制下对递归崩溃警告声明的基准测试

David Mullett

发表机构 * Independent Researcher(独立研究者)

AI总结 提出Loopzero基准框架,通过方向性遥测模式(增益G、递归持久性p、多样性δ)在匹配假阳性预算下评估递归系统崩溃警告声明,并报告标准检测器未达到可接受工作点。

Comments 29 pages, 7 figures, 2 tables; supplementary materials: 9 pages, 1 figure, 4 tables. Code, derived data packets, and Lean artifact: https://github.com/davidmullett/loopzero-paper-public (release tag lean-v1.0)

详情
AI中文摘要

递归系统在明显故障变得可见之前可能进入类似崩溃的状态——自我强化放大、持续递归和多样性缩小,这些掩盖了加速的内部退化。我们引入了Loopzero,一个声明约束的基准框架,用于测试递归故障是否遵循方向性遥测模式:上升增益(G)、递归持久性(p)和下降多样性(δ)。声明边界在Lean中指定;Lean构件不验证实际遥测、基准有效性或检测器性能。我们在两个冻结的公共构件基准上评估桥梁:一个分段公共市场基准(2018年Volmageddon,2020年COVID MWCB)和一个MovieLens-25M离线确定性推荐回放。检测器在锁定等假阳性合同(FP ∈ [0.03, 0.07],预注册)下进行评估,因此所有配置面临相同的警报预算。测试的标准比较器和Loopzero预注册的分位数检测器均未达到可接受的工作点。方向性证人对齐在两个规范基准上成立,并披露了相邻视野和行级限制。数字化Shumailov等人(2024)的LLM训练循环轨迹在方向上与模式一致;该领域的匹配假阳性评估被推迟。贡献是一个可复现、可证伪的基准框架,用于在显式警报预算合同下评估递归崩溃警告声明——将不接受报告为第一类科学结果。

英文摘要

Recursive systems can enter collapse-like regimes -- self-reinforcing amplification, persistent recursion, and narrowing diversity that mask accelerating internal degradation -- before overt failure becomes visible. We introduce Loopzero, a claim-bounded benchmark framework for testing whether recursive failures follow a directional telemetry pattern: rising gain (G), recursive persistence (p), and declining diversity ($δ$). The claim boundary is specified in Lean; the Lean artifact does not verify real telemetry, benchmark validity, or detector performance. We evaluate the bridge on two frozen public-artifact benchmarks: a segmented public-markets benchmark (Volmageddon 2018, COVID MWCB 2020) and a MovieLens-25M offline deterministic recommender replay. Detectors are evaluated under a locked equal-false-positive contract (FP $\in$ [0.03, 0.07], pre-registered) so all configurations face the same alert budget. Neither tested standard comparators nor Loopzero's pre-registered quantile detector achieved an accepted operating point. Directional witness alignment held on both canonical benchmarks, with adjacent-horizon and row-level limitations disclosed. Digitized Shumailov et al. (2024) LLM training-loop trajectories are directionally consistent with the pattern; matched-FP evaluation in that domain is deferred. The contribution is a reproducible, falsifiable benchmark framework for evaluating recursive-collapse warning claims under an explicit alert-budget contract -- non-acceptance reported as a first-class scientific outcome.

2606.00328 2026-06-02 cs.LG 版本更新

KG-Guard: Graph-Based Hallucination Detection for Knowledge Base Question Answering

KG-Guard: 基于图的知识库问答幻觉检测

Albert Sawczyn, Piotr Bielak, Tomasz Kajdanowicz

发表机构 * Department of Artificial Intelligence(人工智能系) Wroclaw University of Science and Technology(波兹南科技大学)

AI总结 针对知识库问答中LLM的幻觉问题,提出一种轻量级图框架,将问答实例构建为增强图,通过图编码器和MLP分类器检测幻觉答案节点,在三个基准上取得最高F1并显著提升下游KBQA性能。

Comments preprint

详情
AI中文摘要

大型语言模型(LLM)越来越多地用于知识库问答(KBQA),其中回答需要从问题特定的知识图谱子图中选择实体。然而,LLM在任务中已知会产生幻觉,KBQA也不例外:即使我们提供图作为知识源,模型可能依赖参数化知识而非图证据,或对给定关系进行无效推理。这种幻觉答案节点可能限制KBQA系统的实际部署,尤其是在医疗等高风险领域。我们将KBQA中的幻觉检测形式化为一个答案节点分类问题,并提出一个轻量级基于图的框架,将回答LLM视为黑盒。\methodname将每个KBQA实例表示为一个增强图。它用KG实体的语义表示初始化节点特征,用学习向量标记主题实体和LLM提出的答案节点,并将一个虚拟问题节点连接到主题实体。然后,图编码器生成面向验证的节点表示,一个小型MLP利用其图表示和问题嵌入对每个提出的答案节点进行分类。在WebQSP、ComplexWebQuestions和PUGG上的实验表明,我们的检测器在所有三个基准上取得了最高F1(分别为82.0、87.4和84.3),优于LLM作为评判和基于采样的基线,同时参数数量比参考方法少约305倍。除了检测,节点级反馈是可操作的:当标记的答案被反馈给KBQA系统进行迭代优化时,下游KBQA F1提高了13.0-14.5个点,精确匹配提高了16.9-17.6个点。

英文摘要

Large language models (LLMs) are increasingly used for knowledge base question answering (KBQA), where answering requires selecting entities from a question-specific knowledge-graph subgraph. Yet LLMs are known to hallucinate across tasks, and KBQA is no exception: even when we provide a graph as the knowledge source, the model may rely on parametric knowledge instead of graph evidence or perform invalid reasoning over the given relations. Such hallucinated answer nodes can limit the practical deployment of KBQA systems, especially in high-stakes domains such as healthcare. We formulate hallucination detection in KBQA as an answer-node classification problem and propose a lightweight graph-based framework that treats the answering LLM as a black box. \methodname represents each KBQA instance as an augmented graph. It initializes node features with semantic representations of KG entities, marks topic entities and LLM-proposed answer nodes with learned vectors, and connect a virtual question node to the topic entities. A graph encoder then produces verification-oriented node representations, and a small MLP classifies each proposed answer node using its graph representation together with the question embedding. Experiments on WebQSP, ComplexWebQuestions, and PUGG show that our detector achieves the highest F1 on all three benchmarks ($82.0$, $87.4$, and $84.3$), outperforming LLM-as-judge and sampling-based baselines, while having $\sim305\times$ fewer parameters than the reference approaches. Beyond detection, the node-level feedback is actionable: when flagged answers are fed back to the KBQA system for iterative refinement, downstream KBQA F1 improves by $13.0$--$14.5$ points and Exact Match by $16.9$--$17.6$ points.

2606.00327 2026-06-02 stat.ME cs.LG stat.AP stat.ML 版本更新

Cluster Analysis with Resampling for Validation and Exploration (CARVE)

基于重采样的聚类验证与探索分析(CARVE)

Kai R. Wycik, Tiffany M. Tang, Tarek M. Zikry, Genevera I. Allen

发表机构 * Department of Statistics, Columbia University, New York, NY, USA(哥伦比亚大学统计学系) Center for Theoretical Neuroscience, Zuckerman Mind Brain Behavior Institute, Columbia University, New York, NY, USA(哥伦比亚大学理论神经科学中心、Zuckerman思维-大脑-行为研究所) Department of Applied and Computational Mathematics and Statistics, University of Notre Dame, Notre Dame, IN, USA(诺丁汉大学应用与计算数学与统计学系) School of Data and Information Sciences, University of North Carolina at Chapel Hill, Chapel Hill, NC, USA(北卡罗来纳大学夏洛特分校数据与信息科学学院) Irving Institute for Cancer Dynamics, Columbia University, New York, NY, USA(哥伦比亚大学癌症动力学伊万·里弗斯研究所)

AI总结 提出CARVE开源软件包,通过重采样评估聚类稳定性和泛化性,在全局、簇和样本级别提供诊断,优于传统聚类验证指标。

详情
AI中文摘要

聚类在科学领域被广泛用作下游数据驱动科学发现的基础。然而,聚类结果对算法选择、预处理和聚类数$k$高度敏感,导致科学声明往往不可重复。当前用于验证聚类解决方案的最先进技术包括轮廓系数、Davies-Bouldin和Calinski-Harabasz等聚类验证指标(CVI),这些指标依赖于几何假设,但在生物医学研究中遇到的重尾、高维和非线性结构数据上失效。基于重采样的替代方法——基于聚类稳定性和泛化性的思想——已被提出,但仍分散在专门的工具中,缺乏统一、易用的软件。我们通过CARVE(基于重采样的聚类验证与探索分析)填补了这一空白,这是一个开源的Python和R包,可联合评估多个聚类算法和超参数,在全局、簇和样本级别返回稳定性和泛化性诊断,以及基于原则的选择规则和基于共识的簇标签。在六个合成基准测试中,CARVE一致地恢复了接近最优的聚类,而经典指标则显著退化。在实验基因组学和蛋白质组学数据集上,当经典CVI完全失效时,CARVE恢复了更精细的生物结构。CARVE提供与scikit-learn兼容的Python API和与Seurat工作流兼容的类似R接口。

英文摘要

Clustering is widely used across the sciences as the foundation for downstream data-driven scientific discoveries. However, clustering results are highly sensitive to the choice of algorithm, preprocessing, and the number of clusters $k$, producing scientific claims that are often not reproducible. The current state of the art for validating clustering solutions consists of clustering validation indices (CVIs) such as Silhouette, Davies-Bouldin, and Calinski-Harabasz, which rely on geometric assumptions that break down on the heavy-tailed, high-dimensional, and nonlinearly structured data encountered in biomedical research. Resampling-based alternatives - grounded in the ideas of clustering stability and generalizability - have been proposed but remain scattered across specialized tools with no unified, accessible software. We fill this gap with CARVE (Cluster Analysis with Resampling for Validation and Exploration), an open-source Python and R package that jointly evaluates multiple clustering algorithms and hyperparameters, returning stability and generalizability diagnostics at the global, cluster, and sample level together with principled selection rules and consensus-based cluster labels. Across six synthetic benchmarks CARVE consistently recovers near-optimal clusterings where classical indices degrade substantially. On experimental genomics and proteomics data sets, CARVE recovers finer biological structure when classical CVIs collapse entirely. CARVE is available with a scikit-learn-compatible Python API and an analogous R interface compatible with Seurat workflows.

2606.00322 2026-06-02 cs.LG stat.ML 版本更新

Perturbative methods for non-parametric instrumental variable

非参数工具变量的微扰方法

Wei Bu, Arthur Gretton

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出一种受物理微扰论启发的非参数工具变量估计方法,通过系统的高阶微扰校正改进核岭回归,在高维病态问题中预测误差降低高达99%。

Comments 8+24 pages, 4 figures, comments welcomed

详情
AI中文摘要

我们引入了一种用于非参数工具变量(NPIV)估计的微扰方法。通过从物理学中的微扰论汲取灵感,我们用系统的高阶微扰校正扩展了标准核岭回归方法,显著提高了估计精度。在谱域中,微扰引入了期望积分算子不同本征模之间的混合,这在积分方程病态时尤其有用。这种病态的一个来源可以是维度灾难。我们的方法在各种维度范围内均有效,特别是当维度参数$β$(通过样本数$n$和维度$d$定义为$n^β= d$)变大时。实验结果表明,在高维病态情况($β> 0.7$)下,与标准岭回归方法相比,我们的一阶微扰校正可以将预测误差降低高达99%。性能提升在广泛的维度范围内得以保持,并且随着维度的增加,优势变得更加明显。

英文摘要

We introduce a perturbative approach for nonparametric instrumental variable (NPIV) estimation. By drawing inspiration from perturbation theory in physics, we extend standard kernel ridge methods with systematic higher perturbation order corrections that significantly improve estimation accuracy. Spectrally, the perturbation introduces mixing between different eigenmodes of the expectation integral operator, which becomes especially useful when the integral equation is ill-defined. One source for such ill-definedness can be the curse of dimensionality. Our method performs across various dimensionality regimes, particularly when the dimensionality parameter $β$ which is defined through the number of samples $n$ and dimension $d$ as $n^β= d$, becomes large. Experimental results show that our first-order perturbative corrections can reduce prediction error by up to 99\% in high-dimensional ill-defined cases ($β> 0.7$) compared to standard ridge regression approaches. The performance improvement is maintained across a wide range of dimensions, with the advantage becoming more pronounced as dimensionality increases.

2606.00320 2026-06-02 cs.LG 版本更新

Adversarially Robust Control of Conditional Value-at-Risk via Rockafellar-Uryasev Conformal Inference

通过Rockafellar-Uryasev共形推断的条件风险价值对抗鲁棒控制

Catherine Chen, Jingyan Shen, Zhun Deng, Lihua Lei

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出一种在线无分布框架,通过结合共形尾风险控制、在线学习和CVaR的变分表示,在非平稳和对抗环境下实现对条件风险价值(CVaR)的鲁棒控制,并提供渐近保证。

详情
AI中文摘要

我们提出了一种在线、无分布框架用于控制条件风险价值(CVaR),将共形尾风险控制扩展到非平稳和对抗环境。与依赖于平稳性或期望线性性的经典风险控制方法不同,我们的方法在任意可能随时间漂移或策略性变化的数据生成过程中,为非线性尾风险泛函提供了可证明的安全保证。通过利用共形尾风险控制、在线学习以及Rockafellar和Uryasev引入的CVaR变分表示之间的深层联系,我们开发了一种新的在线CVaR控制程序,具有对抗遗憾保证。所提出的方法无需对底层数据生成过程做出假设,使其广泛适用于现代高风险部署场景。我们证明了实现的实证CVaR在目标水平上渐近受控,并且所得控制渐近紧致,直到有限样本保守性差距。我们在投资组合风险管理和大型语言模型(LLM)毒性缓解中展示了我们方法的有效性,其中罕见但灾难性的故障主导了系统风险。

英文摘要

We present an online, distribution-free framework for controlling the Conditional Value-at-Risk (CVaR), extending conformal tail risk control to non-stationary and adversarial environments. Unlike classical risk control methods, which rely on stationarity or linearity of expectation, our approach provides provable safety guarantees for a nonlinear tail risk functional under arbitrary data-generating processes that may drift or shift strategically over time. By leveraging deep connections between conformal tail risk control, online learning, and the variational representation of CVaR introduced by Rockafellar and Uryasev, we develop a novel procedure for online CVaR control with adversarial regret guarantees. The proposed method operates without assumptions on the underlying data-generating process, making it broadly applicable in modern high-stakes deployment settings. We prove that the realized empirical CVaR is asymptotically controlled at the target level, and that the resulting control is asymptotically tight up to a finite-sample conservatism gap. We demonstrate the effectiveness of our approach on portfolio risk management and toxicity mitigation for Large Language Models (LLMs), where rare but catastrophic failures dominate system risk.

2606.00312 2026-06-02 math.NA cs.LG cs.NA 版本更新

Stochastic Rounding Increases Small Singular Values

随机舍入增加小奇异值

Linkai Ma, Tingzhou Yu, Petros Drineas

发表机构 * Department of Computer Science, Purdue University(计算机科学系,普渡大学) Department of Mathematics, University of Alberta(数学系,阿尔伯塔大学)

AI总结 本文证明随机舍入作为低精度浮点运算的量化方案,不仅对极端长宽比矩阵,而且对恒定长宽比矩阵都能提升尾部奇异值簇,从而更广泛地发挥谱正则化作用。

详情
AI中文摘要

在过去的六七年中,随机舍入(SR)作为一种低精度浮点运算的量化方案重新引起了广泛关注,其应用涵盖数值分析和现代机器学习系统。最近的研究表明,SR通过增加极瘦长(或对称地,极矮胖)矩阵的最小奇异值来充当隐式正则化器。在这项工作中,我们从两个方向大幅改进并扩展了这一理解。首先,我们证明SR的正则化效应并不局限于极端长宽比区域:它对于恒定长宽比的矩阵仍然存在。其次,我们证明SR不仅正则化最小奇异值,而是提升谱尾部整个奇异值簇。这些结果共同提供了随机舍入作为谱正则化器的更一般特征,揭示其效应超越极端长宽比,并作用于奇异值谱的更广泛部分。

英文摘要

Over the past half-dozen years, stochastic rounding (SR) has regained significant attention as a quantization scheme for low-precision floating-point arithmetic, with applications spanning numerical analysis and modern machine learning systems. Recent work has shown that SR acts as an implicit regularizer by increasing the smallest singular value of extremely tall-and-thin (or, symmetrically, short-and-fat) matrices. In this work, we substantially sharpen and extend this understanding in two directions. First, we show that the regularization effect of SR is not restricted to extreme aspect ratio regimes: it persists for matrices with constant aspect ratio. Second, we demonstrate that SR does not merely regularize the smallest singular value, but instead lifts entire clusters of singular values at the tail of the spectrum. Together, these results provide a more general characterization of stochastic rounding as a spectral regularizer, revealing that its effects extend beyond extremal aspect ratios and act on a broader portion of the singular value spectrum.

2606.00309 2026-06-02 cs.LG stat.ML 版本更新

Large-scale Uncertainty Quantification for Latent Variable Models Using Subsampling Markov Chain Monte Carlo

基于子抽样马尔可夫链蒙特卡罗的潜变量模型大规模不确定性量化

Xiaoyu Wang, Jonathan H. Huggins

发表机构 * University of Cambridge(剑桥大学)

AI总结 针对潜变量模型中SGLD-Gibbs算法超参数调优缺乏理论指导的问题,通过推导统计缩放极限理论,提出确保不确定性量化有意义的调优准则。

详情
Journal ref
Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026
AI中文摘要

随机梯度Langevin动力学结合Gibbs更新(SGLD--Gibbs)为潜变量模型中的近似贝叶斯推断提供了一种高度可扩展的方法。然而,如何以原则性方式调整算法的超参数以确保不确定性估计在统计上有意义仍不清楚。在这项工作中,我们通过为SGLD--Gibbs开发统计缩放极限理论来解决这一调优指导的空白。我们在适当的时空重缩放下推导了全局参数和潜变量的联合渐近极限。我们表明,全局参数收敛到扩散型极限,而每个潜变量收敛到跳跃过程,反映了间歇性Gibbs更新的使用。这种联合跳跃-扩散结构揭示了潜变量随机性如何对全局参数的平稳分布做出贡献。我们利用我们的结果为SGLD--Gibbs的超参数调优提出明确的指导,确保有意义的不确定性量化。数值实验表明,使用我们的调优指导的SGLD--Gibbs在参数估计、不确定性量化和预测性能方面优于随机变分推断。

英文摘要

Stochastic gradient Langevin dynamics combined with Gibbs updates (SGLD--Gibbs) provides a highly scalable approach to approximate Bayesian inference in latent variable models. However, it remains unclear how to tune the algorithm's hyperparameters in a principled manner to ensure the uncertainty estimates are statistically meaningful. In this work, we address this gap in tuning guidance by developing a statistical scaling limit theory for SGLD--Gibbs. We derive a joint asymptotic limit for the global parameters and latent variables under appropriate space-time rescaling. We show that global parameters converge to a diffusion-type limit, while each latent variable converges to a jump process, reflecting the use of intermittent Gibbs updates. This joint jump-diffusion structure reveals how latent-variable randomness contributes to the stationary distribution of the global parameters. We leverage our results to propose explicit guidance on hyperparameter tuning for SGLD--Gibbs that ensures meaningful uncertainty quantification. Numerical experiments show that SGLD--Gibbs with our tuning guidance leads to better parameter estimates, uncertainty quantification, and predictive performance than stochastic variational inference.

2606.00308 2026-06-02 cs.SE cs.AI cs.LG 版本更新

How Generation Architecture Shapes Code Complexity in Multi-Agent LLM Systems: A Paired Study on HumanEval

生成架构如何塑造多智能体LLM系统中的代码复杂度:基于HumanEval的配对研究

Nazmus Ashrafi

发表机构 * GitHub

AI总结 通过配对实验比较六种多智能体架构在HumanEval上的代码复杂度,发现架构复杂度与功能正确性无正相关,最简架构在准确率上持平或超越复杂架构。

Comments 16 pages, 7 figures, 7 tables

详情
AI中文摘要

大语言模型代码生成已从单次提示转向多智能体编排——分析师、编码员、测试员和调试器流水线——并且几乎完全根据功能正确性进行评估。这些架构是否也影响它们生成代码的结构复杂度,以及哪些编排层承担了成本,在很大程度上仍未得到检验:先前的工作记录了提示级别对代码复杂度的影响,但架构级别的问题仍是开放的。我们在GPT-4o系列的两个模型下,针对所有164个HumanEval任务(1,968个配对观测),使用五个RADON复杂度度量(SLOC、圈复杂度以及Halstead体积、难度和努力),比较了六种广泛使用的多智能体配置(Basic、AC、ACT、Debugger、AC+Debugger、ACT+Debugger)。我们在所有完成和仅通过条件下应用了配对非参数统计流程(Friedman总体检验、Wilcoxon符号秩事后检验与Holm校正、Kendall's W和配对秩双列效应量)。六种架构坍缩为两个不可区分的复杂度簇,间隔50-130%的差距,在两个模型和两种条件下分区相同;在架构层中,分析师-编码员分割增加了复杂度,运行时调试器没有——并且在分析师-编码员背景下主动降低复杂度——而测试员则重新增加复杂度。重簇的额外复杂度并未带来pass@1优势:最简架构在准确率上匹配或超越最重架构。因此,LLM代码生成中的架构细化应通过所关注维度上的实测收益来证明,而非假设。

英文摘要

Large-language-model code generation has shifted from single-shot prompting to multi-agent orchestrations - analyst, coder, tester, and debugger pipelines - and is evaluated almost exclusively on functional correctness. Whether these architectures also affect the structural complexity of the code they produce, and which orchestration layers carry the cost, remains largely unexamined: prior work has documented prompt-level effects on code complexity, but the architecture-level question is open. We compare six widely-used multi-agent configurations (Basic, AC, ACT, Debugger, AC+Debugger, ACT+Debugger) under two models from the GPT-4o family across all 164 HumanEval tasks - 1,968 paired observations - using the five RADON complexity metrics (SLOC, cyclomatic complexity, and Halstead Volume, Difficulty, and Effort). We apply a paired non-parametric statistical pipeline (Friedman omnibus, Wilcoxon signed-rank post-hoc with Holm correction, Kendall's $W$ and matched-pairs rank-biserial effect sizes) in both all-completions and passing-only conditions. The six architectures collapse into two indistinguishable complexity clusters separated by a 50-130% gap, the same partition in both models and under both conditions; among the architectural layers, the analyst-coder split inflates complexity, the runtime debugger does not - and on the analyst-coder background actively deflates it - and the tester re-inflates it. The heavy cluster's additional complexity buys no pass@1 advantage: the leanest architectures match or beat the heaviest on accuracy. Architectural elaboration in LLM code generation should therefore be justified by measured benefit on the dimensions that matter, not assumed.

2606.00306 2026-06-02 cs.LG cs.AI 版本更新

Rethinking the Role of Temperature in Large Language Model Distillation

重新思考温度在大语言模型蒸馏中的作用

Hoang-Chau Luong, Lingwei Chen

发表机构 * Golisano College of Computing and Information Sciences(戈利萨诺计算与信息科学学院) Rochester Institute of Technology(罗切斯特理工学院)

AI总结 本文通过分析温度τ对前向KL散度和反向KL散度在LLM蒸馏中的不对称影响,发现高温下FKL优于RKL,并证明温度能提升多种蒸馏目标,使简单KL方法达到先进水平。

详情
AI中文摘要

反向KL散度在大语言模型蒸馏中比前向KL更受欢迎,但这种偏好主要基于忽略温度τ的比较,忽视了其在软化教师分布和改进知识转移中的核心作用。本文重新审视LLM蒸馏中的温度,发现它从根本上改变了FKL和RKL的比较。我们的分析揭示了一种不对称效应:温度显著丰富了FKL中的非主导令牌信号,而主要重新缩放RKL梯度,导致FKL从τ缩放中获益远多于RKL。这种不对称推翻了标准经验结论:尽管在τ=1时RKL优于FKL,但在指令遵循基准测试中,高温下FKL始终超过RKL。此外,温度的影响不仅限于FKL;它改进了更广泛的蒸馏目标,使简单的基于KL的方法能够与最近最先进的LLM蒸馏方法竞争。

英文摘要

Reverse Kullback-Leibler (RKL) divergence is widely favored over forward KL (FKL) in large language models (LLM) distillation, yet this preference is largely based on comparisons that omit the temperature $τ$, overlooking its central role in softening teacher distributions and improving knowledge transfer. In this work, we revisit temperature in LLM distillation and show that it fundamentally changes the comparison between FKL and RKL. Our analysis reveals an asymmetric effect: temperature substantially enriches FKL with non-dominant token signals, whereas it mainly rescales RKL gradients, causing FKL to benefit much more from $τ$ scaling than RKL. This asymmetry overturns the standard empirical conclusion: although RKL outperforms FKL at $τ=1$, FKL consistently surpasses RKL at higher temperatures across instruction-following benchmarks. Moreover, the impact of temperature is not limited to FKL; it improves a broader family of distillation objectives, enabling simple KL-based methods to achieve competitive performance against recent state-of-the-art LLM distillation approaches.

2606.00304 2026-06-02 cs.LG 版本更新

Modeling Spectral Energy Shifts in Spatio-Temporal Graph Anomaly Detection

时空图异常检测中的频谱能量偏移建模

Yilin Liu, Hongchao Zhang, Taylor T. Johnson, Ahmad F. Taha, Meiyi Ma

发表机构 * Department of Computation, University of Torontoland, Torontoland, Canada(计算系,托伦托兰大学,加拿大) School of Computation, University of Edenborrow, Edenborrow, United Kingdom(计算学院,伊登伯恩大学,英国) College of Connected Computing, Vanderbilt University, Nashville, USA(连接计算学院,范德比大学,美国) Department of Civil and Environmental Engineering, Electrical and Computer Engineering, Vanderbilt University, Nashville, USA(土木与环境工程系,电气与计算机工程系,范德比大学,美国)

AI总结 针对现有频谱方法无法检测伪装异常(能量变化减小的异常)的问题,提出节点级频谱能量公式和能量感知图学习框架,通过能量驱动消息传递建模静态与时序图中的频谱偏移,实现伪装异常检测。

详情
AI中文摘要

图异常检测方法旨在区分异常节点。虽然先前的方法通过频谱能量分布的增加变化来表征异常,但它们忽略了导致变化减小的异常,即看起来正常的伪装异常。我们表明,这种类型的异常在多个数据集中持续存在,并且现有频谱方法无法检测到。为了解决这一限制,我们提出了一种与消息传递完全兼容的节点级频谱能量公式,能够检测伪装异常。基于此公式,我们引入了一个能量感知图学习框架,通过在静态和时间序列图中进行能量驱动的消息传递来建模频谱偏移。此外,我们的统一架构无需引入专门的序列模块即可扩展到时间设置,从而在长滑动窗口下实现高效学习。在大规模基准上的大量实验证明了我们方法的有效性和可扩展性。

英文摘要

Graph anomaly detection methods aim to distinguish anomalous nodes. While prior methods characterize anomalies through increased variation in the spectral energy distributions, they overlook those that result in decreased variation, i.e., camouflaged anomalies that appear normal. We show that this type of anomaly persists across multiple datasets and remains undetectable by existing spectral approaches. To address this limitation, we propose a node-level spectral energy formulation that is fully compatible with message passing and enables the detection of camouflaged anomalies. Building on this formulation, we introduce an energy-aware graph learning framework that models spectral shifts through energy-driven message passing in both static and time-series graphs. Besides, our unified architecture extends to temporal settings without introducing specialized sequence modules, enabling efficient learning under long sliding windows. Extensive experiments on large-scale benchmarks demonstrate the effectiveness and scalability of our approach.

2606.00302 2026-06-02 stat.ML cs.LG 版本更新

ERICA: Quantifying Replicability of Cluster Analysis

ERICA: 量化聚类分析的可复现性

Siamak K. Sorooshyari, Manuel A. Rivas, Robert Tibshirani

AI总结 提出ERICA框架,通过迭代聚类分配计算统计量,量化数据集中的聚类结构是否可复现,并应用于合成数据和乳腺癌基因表达数据,发现合成数据可复现而部分真实数据存在不可复现性。

详情
AI中文摘要

尽管聚类在科学中无处不在,但其结果尚未通过框架进行定量审查。我们提出了一种称为通过迭代聚类分配评估可复现性(ERICA)的分析方法,应用于数据集以确定聚类是否以可复现的方式被识别。该流程计算一个统计量,描述数据集中是否发现结构。提出了定量可视化方法以回答重要问题,例如聚类之间的相似性以及可能是异常值的点的身份。当在合成数据上进行测试时,结果显示聚类以可复现的方式被发现。然而,我们注意到当该流程应用于三个用于乳腺癌亚型验证的基因表达数据集时,可能出现不可复现的结果。该研究强调了严格检查的必要性,并为此提供了一个实用工具。

英文摘要

Despite being ubiquitous in science, clustering remains a technique whose results are not quantitatively scrutinized via a framework. We present an analysis called evaluating replicability via iterative clustering assignments (ERICA) that is applied to a dataset to determine whether clusters are identified in a replicable manner. The pipeline computes a statistic that describes whether structure is found in a dataset. Quantitative visualization methods are presented to answer important questions such as the similarity between clusters, and the identity of points that may be outliers. When tested on synthetic data, the findings show clusters being discovered in a replicable manner. However, we note a possibility for non-replicable results when the pipeline is applied to three gene expression datasets for breast cancer subtype validation. The study underscores the need for rigorous inspection and offers a practical tool for doing so.

2606.00301 2026-06-02 cs.LG 版本更新

FLaG: Fine-Grained Latent Grouping for Hallucination Detection

FLaG:用于幻觉检测的细粒度潜在分组

Wentao Ye, Liyao Li, Zhiqing Xiao, Muzhi Zhu, Jiaqi Hu, Zhanming Shen, Xiaomeng Hu, Sean Du, Haobo Wang

发表机构 * Zhejiang University(浙江大学) Nanyang Technological University(南洋理工大学)

AI总结 提出FLaG框架,通过能量路由机制将实例软关联到多个潜在证据组,并利用对数边际聚合组合组条件可靠性信号,以捕获异构幻觉模式,实现无需修改底层模型的高效幻觉检测。

详情
AI中文摘要

大型语言模型(LLM)中的幻觉源于异构的失败机制,这使得任何单一的全局不确定性分数都难以可靠检测。在这项工作中,我们将幻觉检测形式化为一个机制感知的证据聚合问题,其中不同的表示级和令牌级信号必须在多个潜在解释下进行解释。我们提出了FLaG,一个轻量级的幻觉检测框架,通过一组潜在证据组对正确性进行建模。每个实例通过基于能量的路由机制与多个组软关联,并通过原则性的对数边际聚合组合组条件可靠性信号。这种设计使FLaG能够捕获异构的幻觉模式,同时对决策阈值和评估指标保持不变。该框架作为冻结模型头部运行,无需修改底层语言模型,并且计算开销极小。我们进一步提供了一个理论视角,将FLaG与异构错误机制下的最优证据聚合联系起来,表明贝叶斯最优检验统计量必然具有对数边际形式,并且FLaG构成了一个具有可控误差界的可处理近似。跨多个基准和LLM骨干网络的广泛实验表明,FLaG持续实现了最先进的性能,同时在数据集和模型之间表现出稳健的迁移能力,并在有限监督下保持有效。

英文摘要

Hallucinations in large language models (LLMs) arise from heterogeneous failure mechanisms, making reliable detection difficult for any single global uncertainty score. In this work, we formulate hallucination detection as a mechanism-aware evidence aggregation problem, where diverse representation- and token-level signals must be interpreted under multiple latent explanations. We propose FLaG, a lightweight hallucination detection framework that models correctness through a set of latent evidence groups. Each instance is softly associated with multiple groups via an energy-based routing mechanism, and group-conditional reliability signals are combined through a principled log-marginal aggregation. This design enables FLaG to capture heterogeneous hallucination patterns while remaining invariant to decision thresholds and evaluation metrics. The framework operates as a frozen-model head, requires no modification to the underlying language model, and incurs minimal computational overhead. We further provide a theoretical perspective that connects FLaG to optimal evidence aggregation under heterogeneous error mechanisms, showing that the Bayes-optimal test statistic necessarily admits a log-marginal form and that FLaG constitutes a tractable approximation with a controllable error bound. Extensive experiments across multiple benchmarks and LLM backbones demonstrate that FLaG consistently achieves SOTA performance, while exhibiting robust transfer across datasets and models, and remaining effective under limited supervision.

2606.00298 2026-06-02 math.NA cs.LG cs.NA cs.SY eess.SY math.DS math.OC 版本更新

Symmetric Hermite quadrature-based balanced truncation for learning linear dynamical systems from derivative data

基于对称Hermite求积的平衡截断:从导数数据学习线性动力系统

Sean Reiter, Steffen W. R. Werner

发表机构 * New York University(纽约大学) Virginia Tech(弗吉尼亚理工学院)

AI总结 提出一种对称Hermite求积平衡截断算法,通过传递函数及其导数数据构建线性降阶模型,保持状态空间Hermite性和渐近稳定性。

Comments 14 pages, 2 figures, 4 tables

详情
AI中文摘要

数据驱动的降阶建模是控制系统计算机辅助设计的重要组成部分。本文提出了一种新颖的对称Hermite形式的求积平衡截断算法,该算法通过评估全阶系统的传递函数及其导数来构建线性降阶模型。值得注意的是,Hermite形式保留了用于生成数据的系统的理想定性性质,例如状态空间Hermite性,进而保持渐近稳定性。

英文摘要

Data-driven reduced-order modeling is an essential component in the computer-aided design of control systems. In this work, we present a novel symmetric Hermite formulation of the quadrature-based balanced truncation algorithm that constructs linear reduced-order models from evaluations of the full-order system's transfer function and its derivative. Significantly, the Hermite formulation preserves desirable qualitative properties of the system used to generate the data, such as state-space Hermiticity and, consequently, asymptotic stability.

2606.00296 2026-06-02 stat.ML cs.LG math.AP 版本更新

Is Zero-Shot Super-Resolution Possible in Operator Learning?

零样本超分辨率在算子学习中是否可能?

Unique Subedi, Ambuj Tewari

发表机构 * Unique Subedi Ambuj Tewari

AI总结 本文系统研究算子学习中的零样本超分辨率现象,证明其在信息论上可能不可行,并识别输出函数的Hölder光滑性作为充分条件,给出泛化界。

详情
AI中文摘要

神经算子常被报道具有零样本超分辨率能力,即模型在粗网格上训练后,无需额外训练即可在更细的测试网格上产生准确预测。尽管有强有力的经验证据,这一现象的理论基础仍不清楚。本文对算子学习中的零样本超分辨率进行了系统的理论研究。我们首先证明,即使在输入函数在整个连续域上可用且真实映射为简单秩一线性算子的良性设置下,零样本超分辨率在信息论上也可能不可行。然后,我们识别出输出函数的Hölder光滑性作为零样本超分辨率的充分条件,并推导出相应的泛化界。最后,我们通过实验结果验证了所识别的失败模式。

英文摘要

Neural operators are often reported to exhibit zero-shot super-resolution, a phenomenon in which a model trained on coarse grids produces accurate predictions on finer testing grids without additional retraining. Despite strong empirical evidence, the theoretical foundations of this phenomenon remain unclear. In this work, we provide a systematic theoretical study of zero-shot super-resolution in operator learning. We first show that zero-shot super-resolution can be information-theoretically impossible even in benign settings such as when the input functions are available over the entire continuum and the ground truth is a simple rank-one linear operator. We then identify H{\" o}lder smoothness of the output functions as a sufficient condition for zero-shot super-resolution and derive corresponding generalization bounds. Finally, we also validate the identified failure modes through experimental results.

2606.00295 2026-06-02 cs.LG 版本更新

Adaptive Order Policies for Masked Diffusion

掩码扩散的自适应顺序策略

Jama Hussein Mohamud, Mohsin Hasan, Mirco Ravanelli, Yoshua Bengio

发表机构 * Université de Montréal(蒙特利尔大学) Mila Concordia University(康科迪亚大学) LawZero

AI总结 提出一种通过轻量级策略网络学习掩码扩散模型中解掩码顺序的方法,使用加权损失训练,在组合任务和蛋白质等对顺序敏感的问题上优于常见启发式方法。

详情
AI中文摘要

掩码扩散模型在文本和蛋白质等离散序列的数据分布捕获方面取得了巨大成功。这些模型通过从完全掩码序列开始迭代地解掩码令牌来生成数据,解掩码顺序通常随机选择或基于去噪器概率的启发式方法。在这项工作中,我们提出了一种方案,通过在扩散模型之上使用额外的轻量级策略网络来学习解掩码顺序。我们提出的损失根据策略概率重新加权掩码扩散损失中的项,并产生一个偏好于去噪器更可能正确的位置的策略。我们在两种设置下研究这种损失:(i)仅训练策略,同时使用冻结的预训练去噪器,以及(ii)使用加权损失联合训练策略和去噪器,以实现相互适应。我们证明,在组合任务和蛋白质等对令牌顺序敏感的问题上,我们的方法优于常见的启发式方法。

英文摘要

Masked diffusion models have seen great success in capturing data distributions over discrete sequences in domains such as text and proteins. These models generate data by iteratively unmasking tokens starting from a fully masked sequence, with the unmasking order typically chosen at random or using a heuristic based on denoiser probabilities. In this work, we propose a scheme for learning the unmasking order using an additional lightweight policy network on top of a diffusion model. Our proposed loss reweights terms in the masked diffusion loss according to policy probabilities, and results in a policy that prefers positions where the denoiser is more likely to be correct. We study this loss in two settings: (i) training solely the policy while using a frozen pre-trained denoiser, and (ii) training the policy and denoiser jointly with the weighted loss to allow for mutual adaptation. We demonstrate that our approach outperforms common heuristics on problems that are sensitive to token ordering, such as combinatorial tasks and proteins.

2606.00293 2026-06-02 cs.LG stat.ME stat.ML 版本更新

Accurate Large-sample Uncertainty Quantification using Stochastic Gradient Markov Chain Monte Carlo

使用随机梯度马尔可夫链蒙特卡洛进行精确的大样本不确定性量化

Yu Wang, Jie Ding, Jonathan H. Huggins

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 针对大批量或模型误设下随机梯度下降和随机梯度Langevin动力学调参困难的问题,提出新的离散时间近似方法,实现稳态协方差、迭代平均协方差和积分自相关时间的精确预测,并给出非渐近误差界。

详情
Journal ref
Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026
AI中文摘要

调参算法如随机梯度下降(SGD)和随机梯度Langevin动力学(SGLD)用于近似采样和不确定性量化仍然具有挑战性,特别是在批量大小较大或模型误设的实际相关设置中。现有提供调参指导的理论依赖于连续时间极限或强统计假设,在这些情况下可能变得定量不准确。我们通过提出新的带或不带动量的SG(L)D离散时间近似来解决这些不足,从而能够精确预测稳态协方差、迭代平均协方差和积分自相关时间。此外,我们证明了定量的非渐近误差界,表明这些估计对于实际调参和不确定性量化足够准确。数值实验表明,在现有方法失效的各种模型和数据生成分布中,我们的理论提供了改进的调参指导,包括使用$β$-散度而非对数损失以获得统计稳健推断的情况。

英文摘要

Tuning algorithms such as stochastic gradient descent (SGD) and stochastic gradient Langevin dynamics (SGLD) for approximate sampling and uncertainty quantification remains challenging, particularly in the practically relevant settings when the batch size is large or the model is misspecified. Existing theory that provides tuning guidance relies on continuous-time limits or strong statistical assumptions, which can become quantitatively inaccurate in these regimes. We address these shortcomings by proposing new discrete-time approximations to SG(L)D with and without momentum, which enables accurate predictions of the stationary covariance, iterate average covariance, and integrated autocorrelation time. Moreover, we prove quantitative, non-asymptotic error bounds showing that these estimates are sufficiently accurate for practical tuning and uncertainty quantification. Numerical experiments demonstrate that our theory yields improved tuning guidance across a range of models and data-generating distributions where existing approaches fail, including when using the $β$-divergence rather than log-loss to obtain statistically robust inferences.

2606.00291 2026-06-02 cs.GT cs.LG 版本更新

The Representation-Rationalizability Tradeoff in Reward Learning

奖励学习中的表示-可理性权衡

Jing Dong, Yaoliang Yu, Pascal Pourpart

发表机构 * Vector Institute(向量研究所) University of Waterloo(滑铁库大学)

AI总结 本文研究RLHF中奖励学习面临的表示与可理性之间的权衡,通过分解交叉熵损失为表示项和聚合项,证明更丰富的表示会扩大不可理性比较的数量,且联合训练无法自动达到最优平衡点。

详情
AI中文摘要

在RLHF中,每个训练样本包含一个提示$x$和两个候选回答$y,y'$,标注者提供这些回答之间的成对偏好。学习问题是将这些异质成对判断转换为一个标量奖励$r(x,y)$,用于衡量每个提示的回答质量。经典社会选择理论表明这是不可能的,因为异质标注者样本可能导致具有孔多塞循环的汇总偏好,因此没有标量奖励能够一致地评估所有被比较的回答对。越来越多的文献将RLHF作为社会选择问题进行分析,但通常假设固定的有限备选集合,即每个提示预先列举的有限候选回答集。现代流程则通过一个学习的表示$ϕ(x,y)$对回答进行评分,然后通过标量头,因此$ϕ$决定了哪些回答被视为可区分的备选,以及哪些比较对奖励模型可见。一旦嵌入成为问题的一部分,社会选择理论中的不可能结果就变成了一个权衡。我们证明,任何基于$ϕ$构建的奖励的额外交叉熵损失可以精确分解为一个表示项(更丰富的$ϕ$会缩小它)和一个聚合项(更丰富的$ϕ$通过暴露更多无法被任何标量一致排序的比较而扩大它)。相同的结果扩展到直接偏好优化(DPO),并且联合训练嵌入和奖励不能保证恢复此权衡的最佳点。在合成数据和真实偏好数据集上的实验证实了我们的结果。

英文摘要

In RLHF, each training example contains a prompt $x$ and two candidate responses $y,y'$, and annotators provide pairwise preferences between these responses. The learning problem is to convert these heterogeneous pairwise judgments into a single scalar reward $r(x,y)$ that measures response quality for each prompt. Classical social choice implies an impossibility because heterogeneous annotator samples can induce pooled preferences with Condorcet cycles, so no scalar reward can evaluate all compared response pairs consistently. A growing literature analyzes RLHF as a social-choice problem, but usually assumes a fixed finite set of alternatives, i.e., a pre-enumerated finite set of candidate responses for each prompt. Modern pipelines instead score responses through a learned representation $ϕ(x,y)$ before a scalar head, so $ϕ$ determines which responses are treated as distinguishable alternatives and which comparisons are visible to the reward model. Once this embedding is part of the problem, the impossibility results from social choice theory become a tradeoff. We show that the excess cross-entropy loss of any reward built on $ϕ$ decomposes exactly into a representational term, which a richer $ϕ$ shrinks, and an aggregation term, which a richer $ϕ$ enlarges by exposing more comparisons that no scalar can rank consistently. The same results extend to direct preference optimization (DPO), and jointly training the embedding and the reward cannot guarantee to recover the sweet spot of this tradeoff. Experiments on synthetic data and real preference datasets corroborate our results.

2606.00289 2026-06-02 cs.LG cs.DS 版本更新

Inner Product Aware Quantization: Provably Fast, Accurate, and Adaptive Algorithms

内积感知量化:可证明快速、准确且自适应的算法

Nathan White, Krish Singal

发表机构 * University of Pennsylvania(宾夕法尼亚大学)

AI总结 本文提出内积感知量化方法,通过优化目标函数并利用自适应随机量化(ASQ)理论,开发出快速且无偏的量化算法,在保证质量的同时比现有方法快2-10倍。

详情
AI中文摘要

量化是一种基本工具,用于压缩数据集、神经网络权重以及一系列计算任务中的内存使用。向量量化的许多下游应用需要与任意输入进行内积运算。这促使我们研究内积感知量化方案,该方案能够近似保留与未见向量的内积——而不仅仅是简单地最小化均方误差。在这项工作中,我们制定了捕捉自然期望的目标,并开发了自适应且无偏的量化方法,这些方法能够近似保留与最坏情况和平均情况输入的内积。对这些目标的分析表明,它们与广为人知的自适应随机量化(ASQ)概念有着紧密联系。我们为目标函数开发了可证明快速的精确和近似算法。我们的理论结果启发了高效的实际算法,这些算法在各种工作负载分布下表现良好。它们还导致了标准ASQ的实际算法,这些算法在保持质量的同时比现有最先进方法快2-10倍。这些理论和实证结果有助于使自适应量化技术在实际环境中更加高效和易于处理。

英文摘要

Quantization is a fundamental tool used to compress datasets, neural network weights, and memory usage in a range of computational tasks. Many downstream applications of vector quantization perform inner products with arbitrary inputs. This motivates the study of inner product aware quantization schemes that approximately preserve inner products with unseen vectors -- in contrast to simply minimizing the mean-squared error. In this work, we formulate objectives that capture natural desiderata and develop adaptive and unbiased quantization methods that approximately preserve inner products with worst-case and average-case inputs. An analysis of these objectives shows a tight connection with the well-studied notion of Adaptive Stochastic Quantization (ASQ). We develop provably fast exact and approximate algorithms for our objectives. Our theoretical results inspire efficient practical algorithms that perform well across a variety of workload distributions. They also lead to practical algorithms for standard ASQ which are 2-10$\times$ faster than prior state-of-the-art methods while maintaining quality. These theoretical and empirical results contribute towards making adaptive quantization techniques more efficient and tractable in practical settings.

2606.00281 2026-06-02 physics.ao-ph cs.LG 版本更新

Flow Matching for Convective-Scale Precipitation Downscaling

对流尺度降水降尺度的流匹配方法

Tom Wetherell

发表机构 * Met Office(英国气象局)

AI总结 针对对流尺度降水降尺度问题,提出流匹配生成模型,相比扩散模型在空间技能上表现更优,但低估降水分布上尾导致气候平均偏干。

详情
AI中文摘要

生成式机器学习正日益成为动力降尺度的重要补充,用于生成高分辨率降水预测,其中扩散模型是目前领先的方法。流匹配是一种相关的生成框架,最近在图像、视频和其他领域取得了强劲成果,并在降尺度方面显示出早期前景。我们训练了一个流匹配模型,将新加坡周围对流尺度区域上的每日降水从8公里映射到2公里,并将其与基于分数的扩散模型CPMGEM进行基准测试。流匹配在空间技能上始终表现更好:在每个降水阈值和邻域尺度测试中,分数技能得分更高,并且SAL得分的结构和幅度分量更紧密,位置技能相当。然而,流匹配低估了降水分布的上尾,导致气候平均存在干偏差。这些结果表明,流匹配是对流尺度降水降尺度的竞争性生成框架,特别适合捕捉空间结构。

英文摘要

Generative machine learning is an increasingly important complement to dynamical downscaling for producing high-resolution precipitation projections, with diffusion models currently the leading approach. Flow matching is a related generative framework that has recently achieved strong results across image, video and other domains, and shown early promise for downscaling. We train a flow matching model to map daily precipitation from 8 km to 2 km over a convective-scale domain centred on Singapore, and benchmark it against CPMGEM, a score-based diffusion model. Flow matching achieves consistently better spatial skill: higher fractions skill score at every precipitation threshold and neighbourhood scale tested, and tighter structure and amplitude components of the SAL score with comparable location skill. However, flow matching underestimates the upper tail of the precipitation distribution, resulting in a dry bias in the climatological mean. These results suggest that flow matching is a competitive generative framework for convective-scale precipitation downscaling, particularly well suited to capturing spatial structure.

2606.00270 2026-06-02 cs.AI cs.LG cs.LO 版本更新

Robust Shielding for Safe Reinforcement Learning

用于安全强化学习的鲁棒屏蔽

Edwin Hamel-De le Court, Thom Badings, Alessandro Abate, Francesco Belardinelli, Francesco Fabiano

发表机构 * Department of Computer Science, University of Manchester(曼彻斯特大学计算机科学系) Faculty of Computer Science & DSME, RWTH Aachen University(亚琛工业大学计算机科学与DSME学院) Department of Computer Science, University of Oxford(牛津大学计算机科学系) Department of Computing, Imperial College London(伦敦帝国理工学院计算系)

AI总结 提出一种针对鲁棒MDP的屏蔽框架,通过线性时序逻辑公式在最坏情况下的概率阈值保证安全性,并证明其可靠性与最优性。

详情
AI中文摘要

屏蔽是一种在马尔可夫决策过程(MDP)中正式保证强化学习智能体安全性的有效方法。然而,现有的屏蔽技术通常假设已知安全相关的转移动态——这一要求在现实中很少得到满足。为了解决这一限制,我们引入了一种针对鲁棒MDP(RMDP)的新型屏蔽框架,即具有转移概率集合的MDP。我们将安全性定义为在RMDP的最坏情况转移概率下,以一定阈值概率满足线性时序逻辑(LTL)公式。我们证明,我们的屏蔽框架对于RMDP既是可靠的又是最优的:屏蔽允许的每个策略都是安全的,反之,每个安全的RMDP策略都被屏蔽允许。我们将我们的方法与现有的用于学习具有可能近似正确(PAC)保证的MDP转移概率的采样方法相结合。这种组合使得能够为MDP构建屏蔽,这些屏蔽在高置信度下保证安全性,同时保持最小限制性。我们的实验表明,我们为学习的RMDP构建的屏蔽在未知MDP中保证安全性,同时随着样本数量的增加恢复出强的期望回报。

英文摘要

Shielding is an effective approach to formally guarantee the safety of reinforcement learning agents in Markov decision processes (MDPs). However, existing shielding techniques typically assume knowledge of the safety-relevant transition dynamics - a requirement that is seldom met in practice. To address this limitation, we introduce a novel shielding framework for robust MDPs (RMDPs), i.e., MDPs with sets of transition probabilities. We define safety as the satisfaction of a linear temporal logic (LTL) formula with a certain threshold probability under the worst-case transition probabilities of the RMDP. We prove that our shielding framework is both sound and optimal for the RMDP: every policy admissible by the shield is safe, and conversely, every safe RMDP policy is admissible by the shield. We combine our approach with existing sampling methods for learning transition probabilities of MDPs with probably approximately correct (PAC) guarantees. This combination enables the construction of shields for MDPs that, with high confidence, guarantee safety while remaining minimally restrictive. Our experiments show that our shields for learned RMDPs guarantee safety in unknown MDPs while recovering strong expected return as the number of samples increases.

2606.00267 2026-06-02 cs.CV cs.AI cs.LG cs.RO 版本更新

StressDream: Steering Video World Models for Robust Policy Evaluation and Improvement

StressDream: 引导视频世界模型实现鲁棒的策略评估与改进

Junwon Seo, Sushant Veer, Ran Tian, Wenhao Ding, Apoorva Sharma, Karen Leung, Edward Schmerling, Marco Pavone, Andrea Bajcsy

发表机构 * Carnegie Mellon University(卡内基梅隆大学) NVIDIA Research(NVIDIA研究) University of Washington(华盛顿大学) Stanford University(斯坦福大学)

AI总结 提出StressDream方法,通过优化扩散视频世界模型的初始噪声,在推理时引导生成高影响且合理的未来场景,以支持鲁棒的策略评估与改进。

Comments Project page: https://junwon.me/StressDream/

详情
AI中文摘要

视频世界模型通过想象以自我机器人动作为条件的真实未来观察,在策略评估与改进方面展现出潜力。虽然世界模型可以对未来的分布进行建模,但策略评估与改进通常依赖于名义上的想象,这可能会遗漏机器人动作的高影响结果,除非抽取大量样本。为了实现对世界模型想象的鲁棒策略评估与改进,我们提出StressDream,该方法通过在推理时优化扩散世界模型的初始噪声,将想象引导至高影响且合理的结果。然而,优化高维噪声具有挑战性:优化必须推理生成视频中细微的、场景相关的目标事件,同时避免产生不合理想象的分布外噪声。我们通过两个互补目标来解决这一问题:一个语义目标,利用视觉语言模型通过推理生成视频提供信息丰富的梯度;一个合理性目标,防止优化后的噪声漂移到分布外。利用用于自动驾驶和机器人操作的最先进的视频世界模型,我们展示了StressDream能够有效地将想象引导至推理时由文本指定的高影响且合理的结果,例如任务失败,从而通过识别那些合理未来包含不良结果的动作,实现鲁棒的策略评估与改进。视频结果见https://junwon.me/StressDream/。

英文摘要

Video world models (WMs) have shown promise for policy evaluation and improvement by imagining realistic future observations conditioned on ego-robot actions. While WMs can model distributions over futures, policy evaluation and improvement typically rely on nominal imaginations, which can miss high-impact outcomes of robot actions unless prohibitively many samples are drawn. To enable robust policy evaluation and improvement over WM imaginations, we propose StressDream, which steers imaginations toward high-impact yet plausible outcomes specified at inference time by optimizing the initial noise of diffusion-based WMs. However, optimizing high-dimensional noise is challenging: the optimization must reason about nuanced, scene-dependent target events in generated videos while avoiding out-of-distribution (OOD) noise that yields implausible imaginations. We address this with two complementary objectives: a semantic objective with a Vision-Language Model that provides informative gradients by reasoning about the generated video, and a plausibility objective that prevents the optimized noise from drifting OOD. With state-of-the-art video world models for autonomous driving and robotic manipulation, we show that StressDream effectively steers imaginations toward high-impact yet plausible outcomes specified by text at inference time, such as task failures, enabling robust policy evaluation and improvement by identifying actions whose plausible futures include undesirable outcomes. Video results are available at https://junwon.me/StressDream/.

2606.00266 2026-06-02 cs.NI cs.LG 版本更新

KISS: Keeping it Simple and Slotted when Learning to Communicate over Wireless

KISS:学习无线通信时保持简单和时隙化

Kamil Szczech, Maksymilian Wojnar, Krzysztof Rusek, Katarzyna Kosek-Szott, Szymon Szott

发表机构 * AGH University of Krakow(克拉科夫AGH大学)

AI总结 本文使用离线双深度Q网络结合贝叶斯推理,在时隙信道上训练分布式智能体自主学习随机接入策略,实现了接近理论效率且公平的接入,并发现学习到的行为类似于动态调整传输概率的时隙ALOHA。

详情
AI中文摘要

分布式无线系统中长期存在的挑战是确保高效且公平的随机信道接入。现有解决方案通常处理与时间、周期性或集中化相关的特定约束,但它们通常依赖固定启发式方法。受机器学习(ML)最新进展的启发,我们研究ML智能体能否自主学习高效且公平的接入策略,以及这种学习能否为介质访问控制(MAC)设计提供新见解。我们的目标不是提出可部署的协议,而是检验在最小假设下,分散式学习能否重新发现或近似理论上高效的随机接入机制。为此,我们部署了带有贝叶斯推理的离线双深度Q网络(DDQN)来训练在时隙信道上运行的智能体。所得方法完全在线(无需预训练)、完全分布式(独立的多智能体学习器)、随机(非周期性),且无需协调或显式通信。大量仿真表明,学习到的策略适应变化的网络条件,并在保持公平性的同时实现接近理论的效率。消融研究进一步揭示,学习到的行为类似于具有动态调整传输概率的时隙ALOHA,因此我们将该方法称为KISS:保持简单和时隙化。

英文摘要

A long-standing challenge in distributed wireless systems is ensuring efficient and fair random channel access. Existing solutions often address specific constraints related to timing, periodicity, or centralization, but they typically rely on fixed heuristics. Motivated by recent advances in machine learning (ML), we investigate whether ML agents can autonomously learn efficient and fair access strategies, and whether such learning can offer new insights into medium access control (MAC) design. Rather than proposing a deployable protocol, our aim is to examine whether decentralized learning can rediscover or approximate theoretically efficient random-access mechanisms under minimal assumptions. To this end, we deploy an off-policy Double Deep Q-Network (DDQN) with Bayesian inference to train agents operating over a slotted channel. The resulting method is fully online (no pre-training), fully distributed (independent multi-agent learners), stochastic (non-periodic), and requires no coordination or explicit communication. Extensive simulations show that the learned strategy adapts to varying network conditions and achieves near-theoretical efficiency while maintaining fairness. Ablation studies further reveal that the learned behavior resembles slotted ALOHA with a dynamically adjusted transmission probability, leading us to refer to the method as KISS: Keeping It Simple and Slotted.

2606.00265 2026-06-02 stat.ML cs.LG 版本更新

Out-of-Distribution generalization of quantile regression with heavy tailed inputs: an SVM approach

重尾输入下分位数回归的分布外泛化:一种SVM方法

Baptiste Leroux, Clément Dombry, Anne Sabourin

AI总结 针对协变量取异常大值的分位数回归外推问题,提出基于支持向量机(SVM)的框架,利用再生核希尔伯特空间处理高维非线性情况,并建立有限样本学习保证。

Comments 48 pages, 5 figures

详情
AI中文摘要

我们研究了协变量取异常大值的外推机制下的分位数回归。在正则变化假设下,极端观测可以通过其角度分量有效表征,从而使得学习策略能够聚焦于最极端观测的角度。该方法通过最小化渐近条件风险来形式化,该风险将学习定位在协变量分布的尾部。我们提出了一种新的支持向量机(SVM)框架用于极端分位数回归,利用再生核希尔伯特空间处理高维和非线性设置。我们的方法还适应无界响应变量,并避免了限制性变换。我们在温和的正则性假设下建立了有限样本学习保证。该框架统一了统计学习和多元极值的思想,提供了一种可处理且理论扎实的外推方法。我们通过对多瑙河河流流量数据的实证研究补充了理论发现,证明了我们方法的实际相关性。

英文摘要

We study quantile regression in an extrapolation regime where the covariate takes unusually large values. Under regular variation assumptions, extreme observations can be effectively characterized through their angular components, enabling learning strategies that focus on the angle of the most extreme observations. This approach is formalized through the minimization of an asymptotic conditional risk that localizes learning in the tail of the covariate distribution. We propose a novel Support Vector Machine (SVM) framework for extreme quantile regression, leveraging reproducing kernel Hilbert spaces to handle high-dimensional and nonlinear settings. Our method also accommodates unbounded response variables and avoids restrictive transformations. We establish finite-sample learning guarantees under mild regularity assumptions. The proposed framework unifies ideas from statistical learning and multivariate extremes, providing a tractable and theoretically grounded approach to extrapolation. We complement our theoretical findings with an empirical study on river flow data from the Danube, demonstrating the practical relevance of our methods.

2606.00263 2026-06-02 eess.SP cs.LG 版本更新

ReFLEX: Length-Generalizable CSI Denoising for MIMO-OFDM via Relative-Frequency Bias

ReFLEX: 通过相对频率偏置实现MIMO-OFDM中长度可泛化的CSI去噪

Zhibin Zhang, Robert Potekhin, Ziwei Wan, Vladimir Lyashev, Zhen Gao

发表机构 * Moscow Institute of Physics and Technology (State University)(莫斯科物理技术学院(国家大学)) Yangtze Delta Region Academy(长江三角洲地区研究院) Beijing Institute of Technology(北京理工大学) School of Interdisciplinary Science(交叉科学学院)

AI总结 提出ReFLEX,一种基于相对频率位置偏置(RFPB)的长度可泛化Transformer,用于MIMO-OFDM系统中可变RB分配的CSI去噪,在未见RB长度和稀疏DM-RS场景下无需重训即可应用,并在3GPP信道和NR PUSCH仿真中显著提升性能。

Comments 5 pages, 3 figures, submitted to IEEE journal

详情
AI中文摘要

本文研究了具有可变NR资源块(RB)分配的MIMO-OFDM系统的CSI去噪问题。ReFLEX是一种长度可泛化的Transformer,其频率注意力使用由子载波偏移生成的相对频率位置偏置(RFPB)。单个检查点可处理未见过的RB长度,并可应用于测试的RB5/RB10 PUSCH配置中的稀疏DM-RS观测,无需重新训练。在3GPP TR 38.901 UMa NLOS信道中,ReFLEX在未见RB长度上实现了约-9.6 dB的NMSE。在NR PUSCH/UL-SCH仿真中,ReFLEX去噪后接时频插值将10% BLER阈值降低了约2-3 dB。

英文摘要

This letter studies CSI denoising for MIMO--OFDM with variable NR resource block (RB) allocations. ReFLEX is a length-generalizable Transformer whose frequency attention uses a relative-frequency position bias (RFPB) generated from subcarrier offsets. A single checkpoint handles unseen RB lengths and can be applied to sparse DM-RS observations in the tested RB5/RB10 PUSCH setup without retraining. In a 3GPP~TR~38.901 UMa NLOS channel, ReFLEX achieves about $-9.6$~dB NMSE on unseen RB lengths. In NR PUSCH/UL-SCH simulations, ReFLEX denoising followed by time-frequency interpolation reduces the 10\% BLER threshold by about 2--3~dB.

2606.00262 2026-06-02 cs.LG cs.AI stat.AP stat.ML 版本更新

When Softmax Fails at the Top: Extreme Value Corrections for InfoNCE

当 Softmax 在顶部失效:InfoNCE 的极值修正

Melihcan Erol, Suat Evren, Oktay Ozel, Alexander Morgan, Jongha Jon Ryu, Lizhong Zheng

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 针对 InfoNCE 中 softmax 假设与对比学习嵌入设置不匹配的问题,提出基于极值理论的 WEINCE 修正方法,在五个视觉基准上提升冻结特征评估性能。

Comments Presented in ICML 2026

详情
AI中文摘要

InfoNCE 是标准的对比学习目标,但其 softmax 形式不仅是一种计算便利:它还编码了关于如何选择最高分示例的统计假设。利用极值理论,我们表明这一假设通常与现代对比学习中使用的归一化嵌入设置不一致。受此不匹配的启发,我们提出了 extsc{WEINCE},这是 InfoNCE 的一个简单修改,它使用锚点在线批次统计将通常的 softmax 对数与端点短缺修正混合,不增加可训练参数。在五个视觉基准上, extsc{WEINCE} 在冻结特征评估中产生了一致的改进。这些结果表明,对困难负样本进行更忠实的统计处理可以改进对比目标。

英文摘要

InfoNCE is the standard contrastive learning objective, but its softmax form is not only a computational convenience: it also encodes a statistical assumption about how the top-scoring example is selected. Using extreme value theory, we show that this assumption is often misaligned with the normalized embedding setting used in modern contrastive learning. Motivated by this mismatch, we propose \textsc{WEINCE}, a simple modification of InfoNCE that uses anchor-wise online batch statistics to blend the usual softmax logits with an endpoint shortfall correction, adding no trainable parameters. Across five vision benchmarks, \textsc{WEINCE} yields consistent improvements in frozen-feature evaluation. These results show that a more faithful statistical treatment of hard negatives can improve contrastive objectives.

2606.00257 2026-06-02 cs.LG cs.AI 版本更新

ARCA: Adapter-Residual Credit Assignment When Token Signals Degenerate

ARCA: 当令牌信号退化时的适配器-残差信用分配

Rodney Lafuente-Mercado

发表机构 * Rodney Lafuente-Mercado(罗伊德·拉福恩特-默茨)

AI总结 针对LoRA微调下令牌级信用分配信号退化的问题,提出ARCA方法,利用适配器隐藏状态残差作为令牌显著性度量,无需学习奖励模型或价值头。

Comments Accepted to DEMO 2026: ICML Workshop on Decision-Making from Offline Datasets to Online Adaptation. Non-archival report

详情
AI中文摘要

语言模型强化学习的令牌级信用分配通常被表述为策略完全可训练,而实际的LLM-RL流程往往依赖于参数高效微调,尤其是LoRA。我们认为这种分离隐藏了一种结构性失效模式。在LoRA下,策略被限制在参考模型的低秩邻域内,因此常用内在信用信号(如惊奇度、熵减和策略散度)所依赖的每令牌输出分布差异,在轨迹内归一化后可能变得退化,要么接近均匀权重,要么集中在少量与任务无关的位置上。我们形式化了这种行为,并提出直接用浓度诊断指标(如权重基尼系数和有效令牌比率)进行测量。然后,我们引入了适配器-残差信用分配(ARCA),一种轻量级替代方案,它从适配器自身的隐藏状态残差 $\|h^{\text{adapted}}_t - h^{\text{base}}_t\|_2$ 中推导令牌显著性。ARCA关注适配器实际改变模型的位置,而不是输出分布显得不确定或偏移的位置,并且不需要学习奖励模型、价值头或树结构。在紧凑的MATH/Qwen3-1.7B GRPO扫描中,ARCA在匹配的轨迹预算下表现出预测的非退化中间区域信用分布,并与秩匹配的基线保持竞争力。

英文摘要

Token-level credit assignment for language-model reinforcement learning is usually formulated as if the policy were fully trainable, while practical LLM-RL pipelines often rely on parameter-efficient fine-tuning, especially LoRA. We argue that this separation hides a structural failure mode. Under LoRA, the policy is restricted to a low-rank neighborhood of the reference model, so the per-token output-distribution differences used by common intrinsic credit signals, surprisal, entropy reduction, and policy divergence, can become degenerate after within-trajectory normalization, either approaching uniform weights or concentrating on a small set of task-agnostic positions. We formalize this behavior and propose measuring it directly with concentration diagnostics such as weight Gini and effective-token ratio. We then introduce \emph{Adapter-Residual Credit Assignment} (ARCA), a lightweight alternative that derives token salience from the adapter's own hidden-state residual, $\|h^{\text{adapted}}_t - h^{\text{base}}_t\|_2$. ARCA asks where the adapter actually changes the model, rather than where the output distribution appears uncertain or shifted, and requires no learned reward model, value head, or tree construction. In a compact MATH/Qwen3-1.7B GRPO sweep, ARCA exhibits the predicted non-degenerate middle-regime credit distribution under matched rollout budgets and remains competitive with rank-matched baselines.

2606.00253 2026-06-02 cs.RO cs.LG 版本更新

Per-Group Error, Not Total MSE: Fine-Tuning Vision-Language-Action Models for 11-DoF Mobile Manipulation

分组误差而非总MSE:微调视觉-语言-动作模型用于11自由度移动操作

Pau Montagut Bofi, Mario García Blasco, Tessa Pulli, Markus Vincze

发表机构 * University of California, Berkeley(加州大学伯克利分校) ETH Zurich(苏黎世联邦理工学院)

AI总结 针对异构关节空间的移动操作器微调视觉-语言-动作模型时,发现总MSE最低的检查点并非实际表现最佳,提出以分组误差作为更可靠的检查点选择指标。

Comments 4 pages, 3 figures, 3 tables. Accepted as poster at ICRA 2026 Workshop "From Data to Decisions: VLA Pipelines for Real Robots". Code: [https://github.com/paumontagut/per-group-mse-vla](https://github.com/paumontagut/per-group-mse-vla)

详情
AI中文摘要

对具有异构关节空间的移动操作器微调视觉-语言-动作(VLA)模型可能产生反直觉的结果:总MSE最低的检查点并非在真实机器人上表现最佳。我们认为这是将异构关节组(手臂、夹爪、头部、轮式底座)合并为单一指标的可预测后果,其中易于预测的关节可能掩盖仍然失败的关节。我们在11自由度Toyota HSR上微调SmolVLA(450M,仅动作专家),并将其与更强的预训练基线$π_{0.5}$(3.3B)进行比较。分组分析揭示了两种模式:在SmolVLA中,移动底座收敛最慢并限制了整体性能。在$π_{0.5}$的仅专家微调(仅训练动作头,骨干冻结)中,总MSE低于基线但手臂精度下降。在60次真实机器人试验(每个模型20次)中,$π_{0.5}$ 80k(4.0/4)显著优于两种微调变体(仅专家3k:3.75/4;HSR-SmolVLA:3.5/4;Mann-Whitney $p \leq 0.010$),尽管仅专家3k的总MSE最低。这种差异与离线手臂组误差最为一致,而非总MSE或底座组误差。我们得出结论:对于具有异构动作空间的机器人,分组误差比总MSE是更可靠的检查点选择信号。代码:https://github.com/paumontagut/per-group-mse-vla

英文摘要

Fine-tuning Vision-Language-Action (VLA) models for mobile manipulators with heterogeneous joint spaces can produce a counterintuitive result: the checkpoint with the lowest aggregate MSE is not the one that performs best on the real robot. We argue this is a predictable consequence of collapsing heterogeneous joint groups (arm, gripper, head, wheeled base) into a single metric, where easy-to-predict joints can mask joints that still fail. We fine-tune SmolVLA (450M, action-expert only) on the 11-DoF Toyota HSR and compare it against $π_{0.5}$ (3.3B), a stronger pretrained baseline. Per-group analysis exposes two patterns: in SmolVLA, the mobile base converges slowest and limits overall performance. In expert-only fine-tuning of $π_{0.5}$ (training only the action head, backbone frozen), total MSE drops below the baseline but arm accuracy degrades. On 60 real-robot trials (20 per model), $π_{0.5}$ 80k (4.0/4) significantly outperforms both fine-tuned variants (expert-only 3k: 3.75/4; HSR-SmolVLA: 3.5/4; Mann-Whitney $p \leq 0.010$), despite expert-only 3k having the lowest total MSE. This separation is most consistent with the offline arm-group error, not total MSE or base-group error. We conclude that per-group error is a more reliable signal than total MSE for checkpoint selection on robots with heterogeneous action spaces. Code: https://github.com/paumontagut/per-group-mse-vla

2606.00252 2026-06-02 cs.RO cs.LG 版本更新

HOIST: Humanoid Optimization with Imitation and Sample-efficient Tuning for Manipulating Suspended Loads

HOIST: 基于模仿和样本高效微调的人形机器人悬挂负载操作优化

Songyang Liu, Shunyu Yao, Dingyuan Huang, Shuai Li

发表机构 * Department of Civil and Coastal Engineering, University of Florida(土木与海岸工程系,佛罗里达大学)

AI总结 提出HOIST方法,结合模仿学习和样本高效的批量强化学习,优化人形机器人操控悬挂负载的放置精度和停止行为。

详情
AI中文摘要

使用人形机器人操控悬挂负载具有挑战性,因为机器人只能通过全身运动和间歇接触来影响一个欠驱动的振荡负载。模仿学习提供了安全初始行为,但无法直接优化最终放置,而从头开始的强化学习在真实人形机器人上不安全且样本效率低。我们提出了HOIST——基于模仿和样本高效微调的人形机器人悬挂负载操作优化。HOIST首先从虚拟现实遥操作演示中微调一个高级视觉-语言-动作策略,并通过全身控制器执行其命令。然后,它使用VLA rollout和迭代批量RL来提高放置精度和停止行为。在仿真和真实人形机器人上的实验表明,HOIST优于仅模仿和额外演示基线;与纯VLA rollout相比,HOIST将平移放置误差减少了19.9厘米,原始角度误差减少了3.56度,展示了人形机器人在欠驱动物料处理任务中的潜力。

英文摘要

Manipulating suspended payloads with humanoid robots is challenging because the robot can only influence an underactuated, oscillatory load through whole-body motion and intermittent contact. Imitation learning provides safe initial behavior but does not directly optimize final placement, while reinforcement learning from scratch is unsafe and sample-inefficient on real humanoids. We present HOIST-Humanoid Optimized with Imitation and Sample-efficient Tuning for manipulating suspended loads. HOIST first finetunes a high-level vision-language-action (VLA) policy from virtual-reality (VR) teleoperation demonstrations and executes its commands through a whole-body controller. It then uses VLA rollouts and iterative batched RL to improve placement accuracy and stopping behavior. Experiments in simulation and on a real humanoid show that HOIST improves over imitation-only and additional-demonstration baselines; compared with pure VLA rollouts, HOIST reduces translational placement error by 19.9 cm and raw angular error by 3.56 degrees, demonstrating the potential of humanoids for underactuated material-handling tasks.

2606.00241 2026-06-02 cs.LG cs.AI stat.ML 版本更新

InfoAtlas: A Foundation Model for Zero-Shot Statistical Dependence Estimate

InfoAtlas:用于零样本统计依赖性估计的基础模型

Zhengyang Hu, Yanzhi Chen, Hanxiang Ren, Qunsong Zeng, Youyi Zheng, Adrian Weller, Kaibin Huang, Yanchao Yang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出InfoAtlas,一种基础模型架构,通过单次前向传播直接推断互信息,实现零样本估计,在保持精度的同时获得100倍加速。

Comments Accepted to ICML 2026

详情
AI中文摘要

测量高维随机变量之间的统计依赖性是数据科学和机器学习中的基本任务。神经互信息(MI)估计器提供了一种有前景的途径,但它们通常需要对每个新数据集进行昂贵的迭代优化,这使得它们不适用于实时应用。我们提出了InfoAtlas,一种类似基础模型的架构,通过单次前向传播直接推断MI,消除了这一瓶颈。在大规模合成数据上预训练,具有丰富的依赖模式,InfoAtlas学习识别多样的依赖结构并直接从数据集中预测MI。全面的实验表明,InfoAtlas在准确性上匹配最先进的神经估计器,同时实现100倍加速,可以通过单个统一模型灵活处理不同维度和样本量,并有效推广到复杂的现实场景。通过将MI估计重新表述为推理任务,InfoAtlas为实时依赖性分析奠定了基础。

英文摘要

Measuring statistical dependency between high-dimensional random variables is a fundamental task in data science and machine learning. Neural mutual information (MI) estimators offer a promising avenue, but they typically require costly iterative optimization for each new dataset, making them impractical for real-time applications. We present InfoAtlas, a foundation model-like architecture that eliminates this bottleneck by directly inferring MI in a single forward pass. Pretrained on large-scale synthetic data with rich dependence patterns, InfoAtlas learns to identify diverse dependence structures and predict MI directly from the dataset. Comprehensive experiments demonstrate that InfoAtlas matches state-of-the-art neural estimators in accuracy while achieving $100\times$ speedup, can flexibly handle varying dimensions and sample sizes through a single unified model, and generalizes effectively to complex, real-world scenarios. By reformulating MI estimation as an inference task, InfoAtlas establishes a foundation for real-time dependency analysis.

2606.00232 2026-06-02 cs.AI cs.LG 版本更新

TIGER: Traceable Inference with Graph-Based Evidence Routing for Mitigating Hallucinations in Multimodal Generation

TIGER: 基于图证据路由的可追踪推理用于减轻多模态生成中的幻觉

Kaixiang Zhao, Tianrun Yu, Shawn Huang, Porter Jenkins, Yushun Dong, Amanda Hughes

发表机构 * Brigham Young University Florida State University

AI总结 提出TIGER框架,通过从输入和输出中独立提取观测图与声明图,并基于图条件风险评分修复高风险声明,以减轻多模态生成中的事实级幻觉。

Comments 25 pages, 7 figures, 16 tables. Under review

详情
AI中文摘要

我们研究多模态生成的事实级修复,其中流畅的输出可能包含输入不支持的具体事实。现有的推理时修复方法通常通过联合条件化输入和当前输出来生成反馈。这种设计有两个局限性:输出中的幻觉声明可能偏置模型对输入的解释,且自由形式的反馈无法在事实级别进行排序或调度。我们提出TIGER,一种重新设计反馈以进行局部修复的推理时框架。TIGER从输入中独立提取观测图,从当前输出中提取声明图,然后根据支持和冲突为每个声明分配图条件风险分数。模型修复选定的高风险声明,同时保持骨干网络冻结。我们提供收敛性分析,表明在温和假设下,期望总风险几何级数下降至显式渐近界。跨四个跨模态路径(包括图像到文本、图像+文本到文本、音频到文本和视频到文本)的实验表明,TIGER在保持任务质量的同时减少了不支持内容。该增益在多个骨干网络上成立,CrisisFACTS案例研究表明相同的修复机制可以改善多源设置中的接地性。

英文摘要

We study fact-level repair for multimodal generation, where a fluent output may contain specific facts that are not supported by the input. Existing inference-time repair methods often generate feedback by jointly conditioning on the input and the current output. This design has two limitations: hallucinated claims in the output can bias the model's interpretation of the input, and free-form feedback cannot be ranked or scheduled at the fact level. We present TIGER, an inference-time framework that redesigns feedback for localized repair. TIGER independently extracts an observation graph from the input and a claim graph from the current output, then assigns each claim a graph-conditioned risk score based on support and conflict. The model repairs selected high-risk claims while keeping the backbone frozen. We provide a convergence analysis showing that the expected total risk decreases geometrically to an explicit asymptotic bound under mild assumptions. Experiments across four cross-modal paths, including image-to-text, image+text-to-text, audio-to-text, and video-to-text, show that TIGER reduces unsupported content while preserving task quality. The gains hold across multiple backbones, and a CrisisFACTS case study suggests that the same repair mechanism can improve grounding in multi-source settings.

2606.00230 2026-06-02 cs.LG 版本更新

A Pre-Training Analogue of Grokking in Language Models: Tracing Delayed Grammatical Generalization

语言模型中预训练的Grokking类比:追踪延迟的语法泛化

Sherin Muckatira, Namrata Shivagunde, Vijeta Deshpande, Anna Rumshisky

发表机构 * University of Massachusetts Lowell(马萨诸塞大学洛文分校)

AI总结 本文提出一个基于暴露的框架,在LLM预训练中研究类似grokking的延迟泛化现象,通过BLiMP最小对发现语法泛化延迟,并分析泛化前后语法概念向量的变化。

Comments 18 pages, 10 figures, 9 tables

详情
AI中文摘要

Grokking是指神经网络在拟合训练数据后很长时间才泛化的现象,已在监督设置下经过多个epoch研究。LLM预训练则涉及在未标注语料库上进行下一个词预测,数据重复有限且没有明确的训练/验证划分。为了解决这个问题,我们提出了一个基于暴露的框架,使得在LLM预训练期间能够研究类似grokking的动态。我们将评估基于BLiMP最小对,它们提供了受控的语法对比。对于每个BLiMP最小对,我们识别出一个关键短语,即捕获语法对比和现象相关上下文的最小连续跨度。其关键短语出现在预训练窗口中的示例被分配到代理训练集;其余示例被分配到代理验证集。在五个语法现象中,我们观察到延迟泛化。分析泛化前后的预训练检查点表明,语法概念向量在泛化后更能预测语法可接受性,并占据更高维的子空间。我们还发现,从关键标记到相关上下文标记的注意力集中在少数头上。

英文摘要

Grokking, the phenomenon in which neural networks generalize long after fitting their training data, has been studied in supervised settings on many epochs. LLM pre-training instead involves next-token prediction over an unlabeled corpus, with limited data repetition and no explicit train/validation split. To address this, we propose an exposure-based framework that enables the study of grokking-like dynamics during LLM pre-training. We ground our evaluation in BLiMP minimal pairs, which provide controlled grammatical contrasts. For every BLiMP minimal pair, we identify a critical phrase, the smallest continuous span that captures the grammatical contrast and the phenomenon-relevant context. Examples whose critical phrase appears in the pre-training window are assigned to the proxy-train split; the remaining examples are assigned to the proxy-validation split. Across five grammatical phenomena, we observe delayed generalization. Analyzing pre-training checkpoints before and after generalization shows that grammatical concept vectors become more predictive of grammatical acceptability and occupy a higher-dimensional subspace after generalization. We also find that attention from the critical token to the relevant context token is concentrated in a small number of heads.

2606.00228 2026-06-02 cs.LG 版本更新

LithoGRPO: Fast Inverse Lithography via GRPO Reinforced Flow Matching

LithoGRPO: 通过GRPO强化流匹配的快速逆向光刻

Yao Lai, Xuyuan Xiong, Zeyue Xue, Guojin Chen, Jing Wang, Xihui Liu, Rui Zhang, Robert Mullins, Bei Yu, Ping Luo

发表机构 * University of Science and Technology of China(中国科学技术大学) Tsinghua University(清华大学)

AI总结 提出LithoGRPO框架,结合流匹配与GRPO强化学习微调,利用物理奖励函数优化掩模,实现高效逆向光刻,性能优于现有方法。

Comments ICML 2026

详情
AI中文摘要

在半导体制造中,光刻通过光学掩模将电路布局投影到硅片上。随着电路特征尺寸缩小到光波长以下,光学衍射导致印刷图案偏离预期布局。逆向光刻技术(ILT)通过生成优化掩模来解决这一挑战,提高图案转移到晶圆上的保真度。虽然ILT类似于图像合成任务,但其对掩模评估依赖明确的物理指标限制了现有生成模型的适用性。我们引入了LithoGRPO,一个ILT框架,它将流匹配范式与基于GRPO的强化学习(RL)微调相结合,能够针对给定目标布局高效探索多样化的掩模。与纯生成或基于优化的方法不同,LithoGRPO中的RL利用了ILT明确定义的、基于物理的奖励函数,从而在复杂、工艺感知约束下进行优化。据我们所知,这是第一个将流匹配和RL统一用于掩模优化的框架。为了提高RL采样效率,我们提出了一种用于可制造性评估的快速镜头计数算法,在保持传统镜头计数指标掩模排序的同时,实现了超过130倍的加速。大量实验表明,LithoGRPO在基于优化和基于学习的方法中均达到了最先进的性能,同时保持了高效的掩模生成。

英文摘要

In semiconductor manufacturing, lithography projects circuit layouts onto silicon wafers through an optical mask. As circuit features shrink below the wavelength of light, optical diffraction causes the printed patterns to deviate from their intended layouts. Inverse Lithography Technology (ILT) addresses this challenge by generating optimized masks that enhance the fidelity of pattern transfer onto wafers. While ILT resembles an image synthesis task, its reliance on explicit physical metrics for mask evaluation limits the applicability of existing generative models. We introduce LithoGRPO, an ILT framework that integrates the flow-matching paradigm with GRPO-based reinforcement learning (RL) fine-tuning, enabling efficient exploration of diverse masks for a given target layout. Unlike purely generative or optimization-based approaches, RL in LithoGRPO exploits the explicitly defined, physics-based reward function of ILT, enabling optimization under complex, process-aware constraints. To the best of our knowledge, this is the first framework that unifies flow matching and RL for mask optimization. To improve RL sampling efficiency, we propose a fast shot-counting algorithm for manufacturability evaluation, achieving over 130x speedup while preserving the mask ranking of the traditional shot-count metric. Extensive experiments demonstrate that LithoGRPO achieves state-of-the-art performance over both optimization-based and learning-based methods, while maintaining efficient mask generation.

2606.00219 2026-06-02 astro-ph.CO astro-ph.GA cs.LG 版本更新

21cmEMUv3: a hybrid diffusion-LSTM emulator of 21cmFAST summary observables

21cmEMUv3: 一种混合扩散-LSTM的21cmFAST概要可观测量的仿真器

Daniela Breitman, Andrei Mesinger, Steven G. Murray, Ivan Nikolic, Roberto Trotta

发表机构 * Research Center for the Early Universe, Graduate School of Science, The University of Tokyo(早期宇宙研究中心,东京大学研究生院) Department of Physics, Graduate School of Science, The University of Tokyo(东京大学研究生院物理系) Scuola Normale Superiore (SNS), Piazza dei Cavalieri 7, Pisa(普拉蒂亚学院(SNS),比萨) Physics Department, Stellenbosch University(斯坦福博斯奇大学物理系) Cosmic Dawn Center (DAWN)(黎明宇宙中心(DAWN)) Niels Bohr Institute, University of Copenhagen(哥本哈根大学尼尔斯·波尔研究所) SISSA, Via Bonomea 265, 34136 Trieste(SISSA,特里斯特) INFN Sezione di Trieste(INFN特里埃斯特分部) Centro Nazionale di Ricerca in High Performance Computing, Big Data e Quantum Computing(高性能计算、大数据和量子计算国家研究中心) Physics Department, Blackett Lab, Imperial College London(伦敦帝国理工学院布莱特实验室物理系)

AI总结 提出混合扩散-LSTM仿真器21cmEMUv3,基于21cmFASTv3模拟,以高精度仿真21cm功率谱等七个概要可观测量,并用于重新解释HERA观测上限和预测SKA探测能力。

Comments 12 pages, 6 figures

详情
AI中文摘要

我们正在见证宇宙黎明和再电离时期观测的激增,这推动了对快速且稳健的理论解释框架的需求。为此,机器学习,特别是仿真,已成为加速和增强推断流程的有力方法。在本工作中,我们提出21cmEMUv3,一个基于21cmFASTv3模拟训练的仿真器,该模拟同时模拟了原子冷却和分子冷却星系。21cmEMUv3以$σ_8$和十个天体物理参数为条件,生成七个概要可观测量:(i) 柱状21cm功率谱,首次在如此高的分辨率和精度下,在宽红移范围$z \sim$ 6--30内进行仿真;(ii) 球平均21cm功率谱;(iii) 星系间介质平均中性分数;(iv) 平均21cm自旋温度;(v) 全局21cm信号;(vi) 紫外光度函数;(vii) 汤姆孙散射光深。值得注意的是,柱状21cm功率谱通过基于分数的扩散进行仿真,而其余六个概要可观测量通过长短期记忆网络进行仿真,所有结果均达到亚百分位的中位精度。我们使用该仿真器重新解释当前HERA的21cm功率谱上限,首次利用最先进的流体动力学模拟为分子冷却星系中的恒星形成提供先验信息。我们发现,推断出的单位恒星形成率软带X射线光度与高质量X射线双星向第一代星系预期的低金属丰度区域的外推一致,在95%置信度下排除了低于$10^{39.2}$ erg s$^{-1}M^{-1}_\odot m{yr}$的值。最后,我们针对不同阵列配置,对平方公里阵列探测宇宙21cm功率谱进行了预测。21cmEMU软件包已公开可用。

英文摘要

We are witnessing a surge in observations of the cosmic dawn (CD) and epoch of reionisation (EoR), driving an increasing demand for fast and robust theoretical interpretation frameworks. In response, machine learning (ML), and emulation in particular, has emerged as a powerful approach to accelerate and enhance inference pipelines. In this work, we present 21cmEMUv3, an emulator trained on 21cmFASTv3 simulations that model both atomically and molecularly cooling galaxies. 21cmEMUv3 is conditioned on $σ_8$ and ten astrophysical parameters to produce seven summary observables: (i) the cylindrical 21cm power spectrum (PS), emulated for the first time at such high resolution and accuracy across a wide redshift range of $z \sim$ 6--30; (ii) the spherically-averaged 21cm PS; (iii) the mean neutral fraction of the intergalactic medium (IGM); (iv) the mean 21cm spin temperature; (v) the global 21cm signal; (vi) the ultraviolet (UV) luminosity functions (LFs); and (vii) the Thomson scattering optical depth. Notably, the cylindrical 21cm PS is emulated via score-based diffusion, while the remaining six summaries are emulated via long-short term memory (LSTM) networks, all achieving sub-percent median accuracy. We use the emulator to reinterpret current 21cm PS upper limits from HERA, for the first time using state-of-the-art hydrodynamical simulations to inform priors on star formation inside molecularly cooling galaxies. We find that our inferred soft-band X-ray luminosity per unit star formation rate is consistent with extrapolations of high-mass X-ray binaries to the low-metallicity regimes expected in the first galaxies, excluding values below $10^{39.2}$ erg s$^{-1}M^{-1}_\odot \rm{yr}$ at $95\%$ confidence. Finally, we produce forecasts for the detection of the cosmic 21cm PS with the Square Kilometre Array for different array configurations. The 21cmEMU package is publicly available.

2606.00206 2026-06-02 cs.LG 版本更新

Quantized Reasoning Models Think They Need to Think Longer, but They Do Not

量化推理模型认为它们需要思考更长时间,但实际上并不需要

Sanae Lotfi, Polina Kirichenko, Steven Li, Zechun Liu

发表机构 * FAIR at Meta(Meta 联合实验室) Meta AI(Meta 人工智能)

AI总结 本文发现后训练量化会降低推理模型准确率并增加思维链长度,通过分析量化模型在中间步骤正确但最终输出错误的“过度思考”错误,提出一种无训练的对过度思考标记施加logit惩罚的方法,在保持或提升准确率的同时减少12-23%的思维链长度。

详情
AI中文摘要

后训练量化(PTQ)被广泛用于高效部署大型语言模型,但其对推理模型的影响尚不明确。在数学、编程和科学问答任务中,我们发现激进的PTQ会降低准确率,同时增加思维链(CoT)长度。令人惊讶的是,我们证明在量化模型高达52%的失败案例中,模型在中间推理步骤中得出了正确答案,但并未将其作为最终答案输出。为了理解量化为何导致这种过度思考错误的增加,我们测量了量化模型与全精度输出分布之间的token级KL散度。KL散度高的位置与高下一个token熵强相关,在这些位置上,量化模型过度采样了“wait”、“but”、“alternatively”等过度思考标记。我们表明,仅对一组精心挑选的过度思考标记引入无训练的logit惩罚,即可在5个模型(1.5B-32B参数)、3种量化方法和5个基准测试中将CoT长度减少12-23%,同时保持或提升准确率,与惩罚其他标记集相比,在准确率与推理成本之间产生了更优的帕累托前沿。量化模型产生的过度思考错误尤其减少了高达58%。

英文摘要

Post-training quantization (PTQ) is widely used to deploy large language models efficiently, but its effect on reasoning models is not well understood. Across math, coding, and science QA, we find that aggressive PTQ reduces accuracy while increasing chain-of-thought (CoT) length. Surprisingly, we show that in up to 52% of the quantized models' failures, models reach the right answer in intermediate reasoning steps but do not output it as a final answer. To understand why quantization leads to this increase in overthinking errors, we measure the token-level KL divergence between quantized and full-precision output distributions. Positions with high KL divergence correlate strongly with high next-token entropy, and at these positions quantized models disproportionately sample overthinking markers such as "wait", "but", and "alternatively". We show that simply introducing a training-free logit penalty on a curated set of overthinking markers can reduce CoT length by 12--23% while preserving or improving accuracy across 5 models (1.5B-32B parameters), 3 quantization methods, and 5 benchmarks, yielding a favorable Pareto frontier of accuracy against reasoning cost compared to penalizing other token sets. Overthinking errors produced by quantized models are particularly reduced by up to 58%.

2606.00202 2026-06-02 cs.LG cs.AI 版本更新

From Rashomon Theory to PRAXIS: Efficient Decision Tree Rashomon Sets

从Rashomon理论到PRAXIS:高效决策树Rashomon集

Zakk Heile, Hayden McTavish, Varun Babbar, Margo Seltzer, Cynthia Rudin

发表机构 * Stanford University(斯坦福大学)

AI总结 针对决策树Rashomon集计算开销大的问题,提出PRAXIS算法,在运行时和内存使用上实现数量级改进,并能恢复几乎完整的Rashomon集。

Comments Accepted to ICML 2026

详情
AI中文摘要

标准机器学习流程通常会产生许多接近最优的模型。这些“Rashomon集”为不确定性感知的鲁棒决策带来了一系列挑战和机遇。它们允许用户整合领域知识和偏好,这些知识和偏好通常难以直接指定为目标函数,并且它们量化了给定训练数据集和目标函数下有效模型之间的多样性。然而,即使对于稀疏决策树这样简单、可解释的模型类,Rashomon集的计算仍然需要巨大的内存和运行时资源。我们提出了PRAXIS,一种近似该Rashomon集的算法,在运行时和内存使用上实现了数量级的改进。我们验证了PRAXIS通常能恢复几乎完整的Rashomon集。PRAXIS使研究人员和从业者能够可扩展地对真实世界数据集的Rashomon集进行建模。PRAXIS的代码可在https://github.com/zakk-h/PRAXIS获取。

英文摘要

Standard machine learning pipelines often admit many near-optimal models. These "Rashomon sets" pose a range of challenges and opportunities for uncertainty-aware, robust decision making. They allow users to incorporate domain knowledge and preferences that would otherwise be difficult to specify directly in an objective, and they quantify diversity among valid models for a given training dataset and objective function. However, computation of Rashomon sets, even for simple, interpretable model classes such as sparse decision trees, continues to require immense memory and runtime resources. We present PRAXIS, an algorithm to approximate this Rashomon set with orders of magnitude improvement in runtime and memory usage. We validate that PRAXIS regularly recovers almost all of the full Rashomon set. PRAXIS allows researchers and practitioners to scalably model the Rashomon set for real-world datasets. Code for PRAXIS is available at https://github.com/zakk-h/PRAXIS

2606.00198 2026-06-02 cs.LG cs.AI cs.CL 版本更新

BAGEN: Are LLM Agents Budget-Aware?

BAGEN:LLM 智能体是否具有预算意识?

Yuxiang Lin, Zihan Wang, Mengyang Liu, Yuxuan Shan, Longju Bai, Junyao Zhang, Xing Jin, Boshan Chen, Jinyan Su, Xingyao Wang, Jiaxin Pei, Manling Li

发表机构 * Northwestern University(西北大学) O2 Lab(O2实验室) Independent(独立) University of Michigan(密歇根大学) Cornell(康奈尔大学) All Hands AI Stanford(斯坦福大学) UT Austin(德克萨斯大学奥斯汀分校)

AI总结 本文提出预算感知智能体(BAGEN)概念,将预算作为主动控制信号而非被动成本指标,通过渐进区间估计方法预测剩余预算上下界,并在四个环境和五个前沿模型上发现强模型不一定具有强预算意识、模型过度乐观等失败模式,早期停止可节省 28-64% 令牌,但精确区间校准仍具挑战。

详情
AI中文摘要

尽管智能体正在花费越来越多的资源,但如今智能体成本大多仅在执行后衡量。预算感知智能体(BAGEN)应将预算视为主动控制信号,而非被动成本指标。我们首先系统地将预算估计定义为内部预算(来自智能体计算)和外部预算(来自智能体动作)。然后,我们将预算意识形式化为渐进区间估计:在计划的每一步,智能体应预测剩余预算的上限和下限,并在完成可能性低时发出警报。通过 rollout-replay 协议进行评分,我们在四个环境和五个前沿模型上发现了一致的失败模式:(1)强模型不一定具有强预算意识,相关性 r=0.35。(2)前沿模型始终过度乐观,继续在不太可能成功的任务上花费资源,而不是尽早提醒用户。(3)预算感知信号是可操作且可训练的。早期停止在失败轨迹上节省 28-64% 的令牌,SFT+RL 增强了早期停止和警报行为。(4)精确区间校准仍然具有挑战性,SFT+RL 后区间覆盖率上限为 47%。项目页面:https://ragen-ai.github.io/bagen/

英文摘要

While agents are increasingly spending more resources, today agent cost is mostly measured only after execution. A Budget-Aware Agent (BAGEN) should treat budget as an active control signal, rather than a passive cost metric. We first systematically define budget estimation as internal budgets (from agent computation) and external budgets (from agent actions). We then formalize budget-awareness as progressive interval estimation: at each step of a plan, an agent should predict an upper and lower bound on remaining budget, and alert when completion is unlikely. Scoring with a rollout-replay protocol, we find consistent failure patterns on four environments and five frontier agents: (1) strong agents do not necessarily have strong budget-awareness, with correlation r=0.35. (2) frontier models are consistently over-optimistic, continue spending on tasks that are unlikely to succeed, instead of alerting the user early. (3) budget-aware signal is actionable and trainable. Early stop saves 28-64% tokens on failed trajectories, and SFT+RL strengthens early stop and alert behavior. (4) precise interval calibration remains challenging, with interval coverage capping at 47% after SFT+RL. Project page: https://ragen-ai.github.io/bagen/

2606.00189 2026-06-02 cs.LG cs.AI 版本更新

Learning to Construct Practical Agentic Systems

学习构建实用的智能体系统

Aditya Kumar, Zhihan Lei, Jerry Yan, Joshua W. Momo, Lauhitya Reddy, Rafael Enrique Cabrera Jimenez, Cassandra A. Cohen, Arthur Kajiyama, William W. Cohen

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Dept. of Computer Science(计算机科学系) Emory University(埃默里大学)

AI总结 本文提出一种基于伪工具和固定工作流的智能体框架,通过模块化设计和多目标优化方法,在保证成本可控和结果质量的前提下,实现实用智能体系统的自动构建与优化。

详情
AI中文摘要

基于LLM的智能体系统的自动设计和优化能够产生复杂的系统,显著提升结果质量,优于现成的智能体模式。然而,对实际部署的智能体系统的研究表明,生产系统更关注推理成本的简单性、可控性和可预测性等问题。本文提出了设计和优化实用智能体系统的原则性方法。我们描述了一个智能体框架,通过定义在受限上下文中递归调用LLM的“伪工具”,使设计者能够强制智能体系统的模块化。利用该框架,我们为多种任务手工设计了智能体,并表明相对于动态规划的工作流,手工构建的固定工作流通常更便宜且更准确。随后,我们提出了针对该框架所需的智能体组件(即伪工具和固定工作流)的新型学习方法。这些学习方法通常优于手工设计的智能体。我们还利用框架的模块化特性,应用多目标优化方法联合优化成本和响应质量,并融合多个学习系统的结果。

英文摘要

Automated design and optimization of agentic LLM-based systems leads to sophisticated systems that substantially improve result quality over off-the-shelf agentic patterns. However, studies of fielded agentic systems show that production systems focus much more on issues such as simplicity, controllability, and predictability of inference costs. In this paper we propose principled approaches to designing and optimizing practical agentic systems. We describe an agent framework that enables designers to enforce modularity in agentic systems, by defining "pseudo-tools" that call LLMs recursively on a restricted context. Using this framework we hand-engineer agents for a diverse set of tasks, and show that relative to dynamically-planned workflows, hand-constructed fixed workflows are generally cheaper and more accurate. We then propose novel learning methods for the agentic components required by this framework, namely pseudo-tools and fixed workflows. These learning methods generally outperform hand-engineered agents. We also exploit the modularity of the framework to apply multi-objective optimization methods to jointly optimize cost and response quality and blend the results of multiple learning systems.

2606.00187 2026-06-02 cs.LG cond-mat.mtrl-sci 版本更新

AI-Guided Design and Optimization of Graphite-Based Anodes via Iterative Experimental Feedback

基于迭代实验反馈的AI引导石墨负极设计与优化

Qian Du, Mark M. Sullivan, James E. Saal, Florian Huber

发表机构 * Citrine Informatics hte GmbH

AI总结 提出一种迭代AI引导工作流,通过多目标逆向设计和反馈标签,将电池负极制造成功率从频繁失败提升至100%,高容量电池比例从28.4%增至84.8%,容量保持率从42.1%升至97.3%。

Comments 12 pages, 10 figures, 2 tables

详情
AI中文摘要

本研究提出一种迭代AI引导工作流,通过提高配方可行性和工艺鲁棒性来加速石墨负极开发。利用Citrine平台实现AI/ML引导的多目标逆向设计以优化负极。从嘈杂、不完整的数据集开始,Citrine平台生成早期代理模型,尽管预测确定性低,但突出了缺失的工艺约束。通过迭代添加可行性标签和边界条件失败,工作流迅速收敛到可制造、性能更高的配方。制造可靠性从频繁的工艺失败提高到100%成功的电池生产,而提供≥350 mAh g$^{-1}$的电池比例从28.4%增加到84.8%,容量保持率从42.1%上升到97.3%。这些结果表明,结构化的、反馈驱动的AI工作流可以将不完美的工业数据转化为可操作的指导,实现更快、更可重复的电池电极制造优化。

英文摘要

This study presents an iterative AI-guided workflow that accelerates graphite-based anode development by improving both formulation feasibility and process robustness. Sequential learning via AI/ML-guided multiobjective inverse design for anode optimization was implemented using the Citrine Platform. Starting from a noisy, incomplete dataset, the Citrine Platform was used to generate early surrogate models, which despite low predictive certainty highlighted missing process constraints. By iteratively adding feasibility labels and boundary condition failures, the workflow rapidly converged toward manufacturable, higher-performing formulations. Fabrication reliability improved from frequent process failures to 100% successful cell production, while the fraction of cells delivering $\geq$ 350 mAh g$^{-1}$ increased from 28.4% to 84.8%, with capacity retention rising from 42.1% to 97.3%. These results demonstrate that structured, feedback-driven AI workflows can transform imperfect industrial data into actionable guidance, enabling faster, more reproducible optimization of battery electrode manufacturing.

2606.00183 2026-06-02 cs.LG cs.AI math.OC stat.ML 版本更新

Agentic Transformers Provably Learn to Search via Reinforcement Learning

智能体Transformer通过强化学习可证明地学会搜索

Tong Yang, Yu Huang, Yingbin Liang, Yuejie Chi

发表机构 * Carnegie Mellon University(卡内基梅隆大学) University of Pennsylvania(宾夕法尼亚大学) The Ohio State University(俄亥俄州立大学) Yale University(耶鲁大学)

AI总结 本文通过构建双头Transformer实现随机深度优先搜索,并分析策略梯度训练动力学,证明该搜索机制能从稀疏强化反馈中分阶段涌现,且具备深度泛化能力。

详情
AI中文摘要

树搜索是许多语言智能体推理和决策任务的核心抽象:智能体必须探索动作、记住失败并回溯到有希望的替代方案。然而,我们缺乏对基于Transformer的策略如何从强化学习(RL)的训练动态中获得这种搜索能力的理论理解。我们在一个随机的$k$叉树环境中研究这个问题,其中智能体Transformer仅通过交互观察其轨迹历史,并在到达隐藏的叶子目标节点时获得终端奖励。我们首先构建了一个实现随机深度优先搜索(DFS)的双头Transformer:一个头跟踪之前的动作,而另一个头检测失败结果并触发回溯。然后,我们分析了在深度课程下的策略梯度训练动态,表明相同的DFS机制在没有专家演示的情况下,从稀疏强化反馈中分阶段涌现。得到的策略表现出深度泛化能力:仅在深度为1和2的树上训练后,它能在更深的完整树上成功。我们进一步表明,在非平衡目标分布下,对回报进行折扣会导致一种排序的DFS策略,优先考虑高概率分支。总的来说,我们的结果确定了基于Transformer的搜索的一种机制性标准形式,其中注意力头专门化并协作,从上下文中提取与决策相关的轨迹,并通过RL训练将其转化为智能体动作选择。

英文摘要

Tree search is a central abstraction behind many language-agent reasoning and decision-making tasks: agents must explore actions, remember failures, and backtrack toward promising alternatives. Yet, we lack a theoretical understanding of how transformer-based policies acquire such search capabilities from the training dynamics of reinforcement learning (RL). We study this question in a stochastic $k$-ary tree environment, where an agentic transformer observes only its trajectory history through interaction and receives a terminal reward for reaching a hidden leaf goal node. We first construct a two-head transformer that implements randomized depth-first search (DFS): one head tracks previous actions, while the other detects failure outcomes and triggers backtracking. We then analyze the training dynamics of policy gradient under a depth-wise curriculum, showing that this same DFS mechanism emerges in stages from sparse reinforcement feedback without expert demonstrations. The resulting policy exhibits depth generalization: after training only on depth-$1$ and depth-$2$ trees, it succeeds on deeper full trees. We further show that, under imbalanced goal distributions, discounting the return leads to a ranked DFS policy that prioritizes higher-probability branches. Overall, our results identify a mechanistic normal form for transformer-based search, in which attention heads specialize and cooperate to extract decision-relevant traces from context and convert them into agentic action selection via RL training.

2606.00180 2026-06-02 cs.LG cs.AI 版本更新

Beyond Augmentation: Score-Guided Pathological Prior for EEG-based Depression Detection

超越增强:基于评分引导的病理先验用于脑电图抑郁症检测

Xiaojing Chen, Jingqi Cheng, Xu Zhao, Wan Jiang, Jingjing Wu

发表机构 * School of Internet, Anhui University(安徽大学互联网学院) School of Computer Science and Technology, Hefei University of Technology(合肥工业大学计算机科学与技术学院) School of Computer Science and Information Engineering, Hefei University of Technology(合肥工业大学计算机科学与信息工程学院)

AI总结 针对脑电图抑郁症检测中的小样本困境,提出无数据增强的评分引导分类框架,利用生成网络建模病理先验并融合深度特征,同时设计跨通道空间适应模块解决多中心数据集硬件异构问题。

详情
AI中文摘要

基于深度学习的脑电图(EEG)重度抑郁症(MDD)检测从根本上受到“小样本困境”的制约。主流的生成式数据增强方法不仅带来沉重的计算开销,还可能引入合成噪声,从而模糊分类边界。为了挑战传统的“数据数量优先”惯例,我们提出了一种新颖的框架“超越增强”:评分引导分类(SGC)。SGC不合成伪样本,而是利用无监督生成网络架构对样本的结构和统计异常程度进行建模,作为核心的“病理先验”。该先验经过鲁棒归一化后,与深度特征表示显式融合,从而精确指导分类器的决策边界。此外,为了动态适应不同的通道配置,我们提出了跨通道空间适应模块,利用空间映射机制有效解决多中心数据集中不匹配通道的硬件异构问题。在Mumtaz2016和高密度MODMA数据集上的大量实验证明了我们的方法在具有挑战性的“零数据增强”设置和“零样本合成成本”下的有效性和卓越的泛化能力。

英文摘要

Deep learning-based Major Depressive Disorder (MDD) detection using Electroencephalography (EEG) is fundamentally constrained by the "small-sample dilemma." Prevailing generative data augmentation methods not only incur heavy computational overhead but also risk introducing synthetic noise, thereby blurring classification boundaries. To challenge the traditional "data quantity first" convention, we propose a novel framework "Beyond Augmentation": Score-Guided Classification (SGC). SGC does not synthesize pseudo-samples; instead, it utilizes an unsupervised generative network architecture to model the structural and statistical anomaly degrees of samples, serving as the core "Pathological Prior". This prior, after robust normalization, is explicitly fused with deep feature representations, thereby precisely guiding the classifier's decision boundary. Furthermore, to dynamically adapt to varying channel configurations, we propose a Cross-Channel Spatial Adaptation module, utilizing a spatial mapping mechanism to effectively resolve the hardware heterogeneity of mismatched channels in multi-center datasets. Extensive experiments on the Mumtaz2016 and high-density MODMA datasets demonstrate the effectiveness and exceptional generalizability of our method under the challenging "zero data augmentation" setting and at "zero sample synthesis cost". Keywords: Electroencephalography (EEG), Depression Detection, Anomaly Score, Diffusion Models, Few-Shot Learning

2606.00169 2026-06-02 cs.LG cs.AI 版本更新

ChurnNet: A Optimized Modern AI for Churn Prediction

ChurnNet: 一种用于流失预测的优化现代人工智能

Syed Saad Saif, Giulio Maggiore, Paolo Russo, Damiano Distante

发表机构 * Department of Computer, Control, and Management Engineering(计算机、控制与管理工程系) Department of Law and Economics, UnitelmaSapienza University of Rome(法律与经济学系,Unitelma萨皮恩扎罗马大学) R&D Center, Token Financial Technologies(Token金融技术研发中心) Department of Civil, Computer Science and Aeronautical Technologies Engineering, Roma Tre University(土木、计算机科学与航空技术工程系,罗马三大学)

AI总结 本研究评估了传统机器学习方法(随机森林、XGBoost、支持向量机)与统一多任务时间序列模型在客户流失预测任务上的性能,发现传统方法在预测性能、数据效率和计算资源需求方面仍具优势。

详情
AI中文摘要

日益激烈的竞争以及零售商提供的产品和服务日益相似,降低了客户转向竞争对手的门槛。准确的流失预测可以成为推动有效个性化营销活动和帮助减少客户流失的宝贵工具。本研究评估了传统机器学习技术(即随机森林、XGBoost和支持向量机)的性能,并将其与统一多任务时间序列模型(一种二元时间序列分类任务)在流失预测上进行比较。尽管后者在建模复杂时间动态和变量间关系方面具有强大能力,但我们的结果表明,对于流失预测,传统方法在预测性能、数据效率以及训练和部署的计算资源需求方面仍可超越它。这些发现在多个数据集和各种流失标记技术中保持一致。

英文摘要

Increased competition and the growing similarity of products and services offered by retailers have lowered the barriers for customers to switch to competitors. Accurate churn prediction can be a valuable tool for driving effective personalized marketing campaigns and helping to reduce customer attrition. This study evaluates the performance of traditional machine learning techniques, namely, Random Forests, XGBoost, and Support Vector Machines, and compares them with the Unified Multi-Task Time Series Model for churn prediction, a binary time-series classification task. Despite the strong capacity of the latter to model complex temporal dynamics and inter-variable relationships, our results indicate that for churn prediction, conventional methods can still outperform it in terms of predictive performance, data efficiency, and computational resource requirements for training and deployment. These findings are consistent across multiple datasets and various churn labeling techniques.

2606.00162 2026-06-02 cs.RO cs.CV cs.LG 版本更新

Modeling Robotics Dataset Construction as an Artifact-Based Build Process

将机器人数据集构建建模为基于工件的构建过程

Leon Pohl, Lukas Beer, George Sebastian, Mirko Maehlisch

发表机构 * Institute for Autonomous Driving, University of the Bundeswehr Munich(自主驾驶研究所,联邦国防军 Munich 大学)

AI总结 本文提出将机器人数据集构建建模为基于工件的构建过程,并实现开源工具Bagzel,通过依赖图管理和增量构建显著降低数据集更新延迟,实验表明在迭代工作流中速度提升高达386倍。

Comments Accepted 2026 IEEE 22nd International Conference on Automation Science and Engineering (CASE 2026), 6 pages, 6 figures, 2 tables

详情
AI中文摘要

机器人系统生成大量多模态传感器数据,但将ROS bag记录转换为机器学习数据集通常由临时的顺序脚本处理,导致工程开销和迭代周期缓慢。我们将数据集构建建模为基于依赖图的工件构建过程,并在Bagzel中实现该方法,这是一个开源的Bazel扩展,用于可重现、增量式的数据集生成(包括nuScenes格式导出)。我们将Bagzel和Bagzel-xattr(服务端摘要管理)与顺序的rosbag2nuscenes基线进行比较。Bagzel在所有评估执行模式下减少了运行时间,在迭代工作流中提升最大(在20.4 GB数据集上,热构建加速高达386.26倍,增量构建加速高达7.21倍)。在5.1至20.4 GB的数据集大小范围内,Bagzel变体显示出比基线明显更好的扩展行为,尤其是在热构建和增量构建模式下。Bagzel-xattr提供了额外增益,在输入粒度研究中相比Bagzel平均运行时间减少5.9%。总体而言,将机器人数据集构建建模为基于工件的构建过程大幅降低了数据集更新延迟,同时保持了支持可重现性的确定性构建设计。Bagzel公开获取地址:https://github.com/UniBwTAS/bagzel。

英文摘要

Robotic systems generate large volumes of multimodal sensor data, but converting ROS bag recordings into machine learning datasets is often handled by ad hoc sequential scripts, creating engineering overhead and slow iteration cycles. We model dataset construction as an artifact-based build process over a dependency graph and implement this approach in Bagzel, an open-source Bazel extension for reproducible, incremental dataset generation (including nuScenes-format export). We compare Bagzel and Bagzel-xattr (server-side digest management) against a sequential rosbag2nuscenes baseline. Bagzel reduces runtime in all evaluated execution modes, with the largest gains in iterative workflows (up to 386.26x in warm builds and 7.21x in incremental builds on a 20.4 GB dataset). Across dataset sizes from 5.1 to 20.4 GB, Bagzel variants show markedly better scaling behavior than the baseline, especially in warm and incremental modes. Bagzel-xattr provides additional gains, with a mean runtime reduction of 5.9% compared to Bagzel in the input granularity study. Overall, modeling robotics dataset construction as an artifact-based build process substantially reduces dataset update latency while maintaining a deterministic build design that supports reproducibility. Bagzel is publicly available at https://github.com/UniBwTAS/bagzel.

2606.00161 2026-06-02 cs.CR cs.AI cs.LG 版本更新

Improving IoT Intrusion Detection Through SMOTE-Based Oversampling and Extended Multi-Model Evaluation on Side-Channel Power Data

基于SMOTE过采样和扩展多模型评估的侧信道功率数据物联网入侵检测改进

Muhammad Khuram Shahzad, Haseeb Khan, Muhammad Masood Khan, Mubashra Bibi

发表机构 * School of Electrical Engineering and Computer Science (SEECS), NUST(电气工程与计算机科学学院(SEECS),努斯兰大学)

AI总结 针对物联网侧信道数据集中的严重类别不平衡问题,采用SMOTE过采样平衡数据,并评估八种机器学习模型,其中随机森林和极端随机树在F1分数上超越基线方法,同时揭示了宏观F1指标的重要性。

Comments 8 pages, 14 figures; code and results publicly available

详情
AI中文摘要

物联网网络中的入侵检测面临传统机器学习方法无法克服的挑战,其中最大的挑战之一是侧信道数据集中存在的类别不平衡问题,正常类样本与攻击类样本的比例可达75964:1。Dominguez等人通过基于功率的入侵检测概念验证解决了这一问题,但既未尝试处理不平衡问题,也未使用平衡训练集评估分类器性能。本文同时处理这两个方面。首先,对从初始数据集提取的所有九个可能数据集应用合成少数类过采样技术(SMOTE),使每个数据集的精确不平衡比达到1.1。然后,在SMOTE平衡的6小时数据集上,在相同条件下训练了八种算法:随机森林、HistGradientBoosting、LightGBM、极端随机树、XGBoost、k近邻、多层感知机和决策树。随机森林的微平均F1分数达到0.9989,宏F1为0.9794,优于基线论文中时间序列森林算法之前的最佳微F1结果0.9983。极端随机树提供了相同的性能,但速度快10倍。与基线论文评估相比,显式引入宏F1指标揭示了聚合性能指标遗漏的重要类别级信息。基于混淆矩阵、F1热图和ROC曲线计算的每类召回率表明,仅当使用SMOTE平衡时,少数攻击类(尤其是M+L联合感染类)才能被可靠检测。特征重要性分析表明,在功率窗口的60个时间步中,最近的时间步是最重要的预测信号。

英文摘要

The detection of intrusions in IoT-based networks poses challenges that cannot be overcome using traditional machine learning methods. Perhaps the biggest of them is related to the presence of a class imbalance in the side-channel dataset, where the number of samples in the normal class compared to the attacks can reach a ratio of 75,964 to 1. Such an aspect is addressed by Dominguez et al. through the proof of concept of power-based intrusion detection. Unfortunately, neither the authors attempt to cope with the problem of imbalance nor do they assess the classifier performance using a balanced training set. In the current paper, both aspects will be handled at once. First, a Synthetic Minority Oversampling Technique (SMOTE) was performed on all nine possible datasets extracted from the initial one, providing an exact imbalance ratio of 1.1 for each. Then, eight algorithms i.e. Random Forest, HistGradientBoosting, LightGBM, Extra Trees, XGBoost, k-Nearest Neighbors, Multi-Layer Perceptron, and Decision Tree were trained under identical conditions for the SMOTE balanced 6-hour dataset. Random Forest reached a micro-averaged F1 score of 0.9989 and macro F1 of 0.9794, thus outperforming the previously best micro-F1 result obtained by Time Series Forest algorithm from the base paper of 0.9983. Extra Trees provided the same performance as well, but at 10 times faster. The introduction of a macro-F1 metric explicitly in contrast to the base paper assessment reveals important class-level information missed with aggregate performance metrics. Recall rates per-class calculated with confusion matrices, F1 heatmaps, and ROC curves show that minority attack classes, especially those with combined M+L infections, are detected reliably only when using SMOTE balance. Feature importance analysis indicates the latest time steps as the most important predictor signals out of 60 steps in a power window.

2606.00157 2026-06-02 stat.ML cs.AI cs.LG math.PR 版本更新

Interpreting FCDNNs via RG on Exponential Family

通过指数族上的重正化群解释全连接深度神经网络

Fuzhou Gong, Zigeng Xia

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文通过建立统计物理中重正化群方法与深度神经网络训练过程的对应关系,证明了对于指数族连续输入数据,全连接DNN训练后特征层输出的特征参数等于RG方法下的不动点,从而解释了DNN的特征提取能力。

Comments 18 pages, 2 figures

详情
AI中文摘要

我们考虑通过建立统计物理中的重正化群(RG)方法与深度神经网络(DNN)训练过程之间的对应关系,来建立深度学习的可解释性理论。我们已使用一维伊辛模型作为输入数据证明了所构建的关系。本文我们将结果推广到连续输入数据的情况,这是将该对应框架应用于真实数据的必要准备。为具有代表性,我们考虑指数族中的一类数据分布。我们证明,当全连接(FC)DNN的参数在训练后达到最优值时,DNN特征层输出的特征参数等于连续场RG方法下输入数据特征参数的不动点。这一结论表明,DNN的训练过程等价于对此类数据进行RG计算,因此网络能够像RG一样从输入数据中提取主要特征。此外,该等价性进一步验证了我们建立的对应框架,为DNN在真实数据上的卓越表现提供了解释。

英文摘要

We consider establishing the interpretability theory of deep learning through constructing a corresponding relationship between the renormalization group (RG) method in statistical physics and the training process of deep neural networks (DNNs). We have proved the constructed relationship using the one-dimensional Ising model as the input data. In this paper we generalize our results to the case of continuous input data, which is a necessary preparation for applying the corresponding framework to real-world data. To be representative, we consider a class of data distribution in the exponential family. We prove that when the parameters of fully connected (FC) DNNs achieve their optimal value after training, the characteristic parameters of the feature layer output of DNNs are equal to the fixed points of the characteristic parameters of input data under RG method for continuous fields. This conclusion shows that the training process of DNNs is equivalent to RG calculation on this kind of data and therefore the network can extract main features from the input data just like RG. Also, the equivalence further validates the correspondence framework we have established, providing an explanation for the outstanding performance of DNNs on real-world data.

2606.00151 2026-06-02 cs.LG cs.AI 版本更新

Emergence of Exploration in Policy Gradient Reinforcement Learning via Retrying

通过重试在策略梯度强化学习中涌现探索行为

Soichiro Nishimori, Paavo Parmas, Sotetsu Koyamada, Tadashi Kozuno, Toshinori Kitamura, Shin Ishii, Yutaka Matsuo

发表机构 * University of Tokyo(东京大学) Aalto University(阿尔托大学)

AI总结 提出ReMax目标函数,通过最大化M个样本的期望最大回报来使探索行为自然涌现,并推导策略梯度公式及RePPO算法,在MinAtar和Craftax基准上无需显式探索奖励即可促进探索。

详情
AI中文摘要

在强化学习(RL)中,智能体从探索中获益仅仅是因为它们反复遇到相似的状态:尝试不同的动作可以提高性能或减少不确定性;没有这样的重试,贪婪策略是最优的。我们通过ReMax形式化这一直觉,该目标函数根据$M$个样本($M$为正整数)的期望最大回报来评估策略,同时考虑回报的不确定性。优化该目标函数会使随机探索作为涌现属性出现,无需显式奖励项。为了实现高效的策略优化,我们为ReMax推导了新的策略梯度公式,并引入ReMax PPO(RePPO),这是一种PPO变体,它优化ReMax的同时将离散重试次数$M$推广为连续参数$m>0$,从而实现对探索的细粒度控制。实验上,RePPO在MinAtar和Craftax基准上无需任何显式探索奖励即可促进探索。

英文摘要

In reinforcement learning (RL), agents benefit from exploration only because they repeatedly encounter similar states: trying different actions can improve performance or reduce uncertainty; without such retries, a greedy policy is optimal. We formalize this intuition with ReMax, an objective that evaluates a policy by the expected maximum return over $M$ samples, where $M$ is a positive integer, while accounting for return uncertainty. Optimizing this objective induces stochastic exploration as an emergent property, without explicit bonus terms. For efficient policy optimization, we derive a new policy-gradient formulation for ReMax and introduce ReMax PPO (RePPO), a PPO variant that optimizes ReMax while generalizing the discrete retry count $M$ to a continuous parameter $m > 0$, enabling fine-grained control of exploration. Empirically, RePPO promotes exploration, without any explicit exploration bonuses, on the MinAtar and Craftax benchmarks.

2606.00147 2026-06-02 cs.LG cs.AI 版本更新

RAFT: Data Refinement and Adaptive Distillation for Domain Fine-Tuning with Alleviated Forgetting

RAFT:用于缓解遗忘的领域微调的数据精炼与自适应蒸馏

Yuduo Li, Xiaofeng Shi, Qian Kou, Longbin Yu, Hua Zhou

发表机构 * Beijing Academy of Artificial Intelligence (BAAI)(北京人工智能研究院) Beijing Jiaotong University (BJTU)(北京交通大学)

AI总结 提出RAFT框架,通过数据精炼(自条件重写、语义过滤、答案融合)和答案条件在线蒸馏(top-K温度蒸馏、EMA自适应损失平衡)来解决领域微调中的监督兼容性差距和轨迹保持差距,在提升领域性能的同时缓解通用能力退化。

Comments preprint

详情
AI中文摘要

领域特定的监督微调(SFT)通常以提高领域内性能为代价,导致模型通用能力下降。我们将这种退化归因于领域SFT中的两个实际差距:监督兼容性差距,即领域目标在风格和推理格式上与原始模型的自然响应不同;以及轨迹保持差距,即教师强制SFT优化固定目标令牌,而不约束模型在其自身生成前缀上的行为。这个过程未能保留模型的原始行为。我们提出RAFT(用于缓解遗忘的领域微调的数据精炼与自适应蒸馏),一个两阶段框架来解决这两个因素。首先,RAFT通过自条件重写、语义过滤和答案融合构建模型兼容的监督。其次,RAFT执行答案条件在线蒸馏,其中原始指令调优模型在学生生成的轨迹上提供软目标,同时以融合答案作为有用上下文进行条件化。我们进一步引入top-K温度蒸馏和基于EMA的自适应损失平衡来稳定领域-通用权衡。在三个指令调优骨干和五个领域上,RAFT相比标准SFT将平均领域准确率提高了23.2%,同时恢复了MS-Bench和IFEval上SFT引起的部分退化,相对改进分别为18.2%和10.2%。这些结果表明,将数据精炼与轨迹级保持相结合为缓解遗忘的领域微调提供了有效方案。

英文摘要

Domain-specific supervised fine-tuning (SFT) often improves in-domain performance at the cost of degrading a model's general capabilities. We view this degradation through two practical gaps in domain SFT: a supervision-compatibility gap, where domain targets differ in style and reasoning format from the original model's natural responses, and a trajectory-preservation gap, where teacher-forced SFT optimizes fixed target tokens without constraining the model's behavior on its own generated prefixes. This process fails to preserve the model's original behavior. We propose RAFT (Data Refinement and Adaptive Distillation for Domain Fine-Tuning with Alleviated Forgetting), a two-stage framework that addresses both factors. First, RAFT constructs model-compatible supervision through self-conditioned rewriting, semantic filtering, and answer fusion. Second, RAFT performs Answer-Conditioned On-Policy Distillation, where the original instruction-tuned model provides soft targets on student-generated trajectories while being conditioned on the fused answer as helpful context. We further introduce top-K temperature distillation and EMA-based adaptive loss balancing to stabilize the domain-general trade-off. Across three instruction-tuned backbones and five domains, RAFT improves average domain accuracy by 23.2% over standard SFT, while recovering part of the SFT-induced degradation on MS-Bench and IFEval, with relative improvements of 18.2% and 10.2%, respectively. These results show that coupling data refinement with trajectory-level preservation provides an effective recipe for domain fine-tuning with alleviated forgetting.

2606.00144 2026-06-02 cs.LG cs.AI 版本更新

BudgetDraft: Acceptance-Aware Multi-View Training for Sparse-KV Speculative Decoding

BudgetDraft:面向稀疏KV投机解码的接受感知多视角训练

Liang He, Jingbo Wen, Qishi Zhan, Yixiong Chen, Kangning Cui, Qizhen Lan, Xilu Wang

发表机构 * Shanghai Institute of Optics and Fine Mechanics(上海光学精密机械研究所) The University of Sydney(悉尼大学) Marquette University(马基特大学) Johns Hopkins University(约翰·霍普金斯大学) Wake Forest University(威克森林大学) University of Texas Health Science Center at Houston(德克萨斯大学健康科学中心休斯顿分部) University of Surrey(萨里大学)

AI总结 针对中长上下文推理中稀疏/全缓存不匹配导致接受率下降的问题,提出BudgetDraft多视角稀疏训练方法,通过接受感知损失和多视角损失训练单一鲁棒草稿模型,在固定KV预算下恢复接受率,实现最高6.55倍加速。

详情
AI中文摘要

投机解码通过草稿模型提出多个令牌,验证器并行验证,从而加速自回归解码。在资源受限的部署中,草稿模型使用稀疏KV缓存以在固定KV预算下限制峰值GPU内存和端到端延迟,而验证器保留全KV缓存。实际应用中常见中长上下文推理(4K--16K上下文长度)。然而,随着上下文长度增长,朴素稀疏/全投机解码遭受稀疏/全不匹配问题,导致接受率快速下降。我们提出BudgetDraft,一种用于中长推理中稀疏草稿的多视角稀疏训练方法。草稿模型在训练期间暴露于多个采样的KV预算,并学习将每个稀疏视角与一个共享的全缓存教师目标对齐。BudgetDraft将全缓存分支上的接受感知损失与稀疏缓存分支上的多视角损失相结合,产生一个单一的预算鲁棒草稿模型,无需额外的推理时组件即可恢复跨稀疏级别的接受率。在PG-19、LongBench和LWM上的实验结果表明,BudgetDraft在4K、8K和16K上下文长度下,与自回归相比分别实现了最高6.55倍、4.46倍和2.10倍的端到端加速,同时保持推理流水线内存友好。

英文摘要

Speculative decoding speeds up autoregressive decoding by using a drafter to propose multiple tokens that a verifier validates in parallel. In resource-constrained deployments, the drafter uses a sparse KV cache to limit peak GPU memory and end-to-end latency under a fixed KV budget, while the verifier keeps a full KV cache. Mid-to-long context inference (4K--16K context length) is common in real applications. However, naive sparse/full speculative decoding suffers from the sparse/full mismatch as context length grows, causing the acceptance rate to drop quickly. We propose BudgetDraft, a multi-view sparse training method for sparse drafting in mid-to-long inference. The drafter is exposed to multiple sampled KV budgets during training and learns to align each sparse view with one shared full-cache teacher target. BudgetDraft combines an acceptance-aware loss on a full-cache branch with a multi-view loss on a sparse-cache branch, producing a single budget-robust drafter that recovers acceptance across sparsity levels without extra inference-time components. Experimental results on PG-19, LongBench, and LWM show that BudgetDraft achieves up to 6.55x, 4.46x, and 2.10x end-to-end speedup vs AR at 4K, 8K, and 16K context lengths, while keeping the inference pipeline memory-friendly.

2606.00141 2026-06-02 cs.LG cs.AI 版本更新

Adaptive data selection improves wearable prediction under low baseline performance

自适应数据选择改善低基线性能下的可穿戴预测

Ali Kargarandehkordi

AI总结 本研究通过评估多种模态下自适应时间窗口选择策略,发现其能显著提升低基线性能参与者的AUROC(最高提升0.7),而高基线性能者收益有限或为负,且增益与基线性能呈强负相关。

详情
AI中文摘要

自适应传感策略通过选择性采样数据,在有限数据预算下提高预测性能,在可穿戴健康系统中应用日益广泛,但其在不同个体间的收益尚不明确。本文基于纵向可穿戴数据集,评估了在固定测量预算下,针对心率、活动和生态瞬时评估(EMA)等多种传感模态,自适应选择时间窗口进行模型训练的效果。我们使用接收者操作特征曲线下面积(AUROC)和F1分数量化了相对于随机采样的性能提升。自适应策略为基线性能较低的参与者带来了显著的AUROC提升(增益高达0.7),而对基线性能较强的参与者增益有限甚至为负。跨模态来看,自适应增益与基线性能呈强负相关(Pearson r = -0.67;Spearman ρ = -0.62)。在参与者层面,大多数个体在AUROC上受益(跨模态为60-80%),尽管F1的改进较小且一致性较差。这些发现表明,自适应传感并非普遍有益,而是在性能不佳的情况下提供最大价值。我们的结果支持基于基线性能定制自适应传感的选择性部署策略,以提高可穿戴健康监测的效率。

英文摘要

Adaptive sensing strategies that selectively sample data are increasingly used in wearable health systems to improve prediction performance under limited data budgets, yet their benefits across individuals remain poorly understood. Here, we evaluate adaptive selection of time windows for model training under fixed measurement budgets across multiple sensing modalities, including heart rate, activity, and ecological momentary assessment (EMA), in a longitudinal wearable dataset. We quantify performance gains relative to random sampling using both area under the receiver operating characteristic curve (AUROC) and F1 score. Adaptive strategies yield substantial improvements in AUROC for participants with low baseline performance (with gains up to 0.7), while offering limited or negative gains for participants with strong baselines. Across modalities, adaptive gain is strongly inversely correlated with baseline performance (Pearson r = -0.67; Spearman p = -0.62). At the participant level, most individuals benefit in AUROC (60-80% across modalities), although improvements in F1 are smaller and less consistent. These findings show that adaptive sensing is not uniformly beneficial, but instead provides the greatest value in underperforming settings. Our results support selective deployment strategies that tailor adaptive sensing based on baseline performance to improve efficiency in wearable health monitoring.

2606.00136 2026-06-02 cs.LG cs.AI cs.CL cs.CR cs.SI 版本更新

Generative AI and Digital Ecosystem Resilience: A Proactive Lifecycle-Based Survey

生成式AI与数字生态系统韧性:基于生命周期的主动式综述

Jonghyun Chung, Rishabh Chaddha, Sanket Badhe, Debanshu Das, Nathan Huang, Amanpreet Kaur

发表机构 * Google LLC(谷歌有限公司)

AI总结 本文采用基于生命周期的C5交互模型,综合机器学习与社会科学方法,系统综述了针对生成式AI驱动的对抗性合成内容的主动检测技术,包括协调不真实行为分析、流行病学建模和霍克斯过程等,旨在构建更具韧性的信息生态系统。

Comments 14 pages, 3 figures, 3 tables. Accepted for publication in IEEE Access (May 2026)

详情
Journal ref
IEEE Access (2026) IEEE Access (2026)
AI中文摘要

生成式AI加速了对抗性合成内容的扩散,使得传统的被动检测方法失效。本综述综合了新兴研究,展示了向主动检测新兴不真实叙事的范式转变。我们采用统一的、基于生命周期的分类法,将对抗性活动的社会技术生命周期模型与新兴不真实叙事检测的高级计算方法相结合。通过围绕C5交互模型(背景、原因、内容、放大循环、后果)构建分析,我们整合了机器学习和社会科学的不同研究流。为了区分合成放大模式与真实基线流量,本文综述了建模新叙事创建、播种和传播的最先进技术,包括协调不真实行为分析、流行病学建模和霍克斯过程。本综述还系统回顾了C5交互模型不同阶段对抗性威胁的主动检测方法,特别是高维嵌入空间中的异常检测、多层图上的无监督协调检测以及代理型AI系统。最后,本综述探讨了生成式AI带来的挑战,包括追踪快速变化威胁和多级分布漂移的困难,并概述了未来研究议程,重点在于检测异常聚类和构建预期性及韧性系统。本综述为更韧性的信息生态系统提供了基于生命周期的主动检测新兴合成威胁方法的全面回顾。

英文摘要

The proliferation of adversarial synthetic content, accelerated by Generative AI (GenAI) is rendering traditional reactive detection methods ineffective. This survey synthesizes emerging research to demonstrate a paradigm shift toward the proactive detection of emerging inauthentic narratives. In this survey, we adopt a unified, lifecycle-based taxonomy to combine socio-technical lifecycle models of adversarial campaigns with advanced computational methodologies for emerging inauthentic narrative detection. By structuring the analysis around the C5 Interaction Model (Context, Causes, Content, Cycle of Amplification, Consequences), we integrate different research streams from machine learning and social science. To differentiate spread patterns of synthetic amplification from authentic baseline traffic, this paper surveys state-of-the-art techniques for modeling the creation, seeding, and propagation of fresh narratives, including the analysis of Coordinated Inauthentic Behavior (CIB), epidemiological modeling, and Hawkes process. This survey also provides a systematic review of proactive detection methods for adversarial threats at different stages in the C5 interaction model, specifically, anomaly detection in high-dimensional embedding spaces, unsupervised coordination detection on multi-layer graphs, and agentic AI systems. Finally, this survey addresses challenges posed by GenAI, including the difficulty of tracking rapidly changing threats and multi-level distributional drift, and it outlines a future research agenda focused on detecting anomalous clusters and building anticipatory and resilient systems. This survey provides a comprehensive, lifecycle-based review of methods for the proactive detection of emerging synthetic threats for more resilient information ecosystems.

2606.00135 2026-06-02 cs.LG cs.AI 版本更新

On Effectiveness and Efficiency of Agentic Tool-calling and RL Training

论智能体工具调用与强化学习训练的有效性与效率

Tong Liu, Cheng Qian, Matej Cief, Yuan He, Daniele Dan, Nikolaos Aletras, Gabriella Kazai

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Cambridge(剑桥大学) University of Toronto(多伦多大学)

AI总结 本文系统分析工具调用评估中的实现选择对结果敏感性的影响,并针对强化学习训练中的计算浪费提出两种加速技术。

Comments ICML 2026

详情
AI中文摘要

工具调用是现代大型语言模型(LLM)智能体的核心组件,使其具备超越参数化知识的技能。本文从两个互补维度研究工具调用:有效性(即如何衡量该能力)和效率(即如何学习该能力)。在有效性方面,我们系统分析了工具调用评估流程,并表明结果可能对看似微小、通常未文档化的实现选择高度敏感,包括随机种子、系统提示、多轮模板构建以及先前交互/推理历史的传递方式。这些选择可能导致报告性能的显著差异,尤其是在多轮设置中,若缺乏严格标准化,排行榜排名将不可靠。在效率方面,我们考察了用于工具调用的标准强化学习(RL),并识别出两个计算浪费来源:(i)在 rollout 过程中,许多提示不产生学习信号;(ii)在策略更新过程中,优化产生高计算成本。基于这些发现,我们引入了两种加速基于 RL 的工具调用训练的技术,在不降低性能的情况下实现了显著的挂钟时间加速。

英文摘要

Tool-calling is a central component of modern large language model (LLM) agents, equipping them with skills beyond their parametric knowledge. This paper studies tool-calling along two complementary axes: effectiveness, i.e., how this capability is measured, and efficiency, i.e., how it is learned. On effectiveness, we systematically analyze tool-calling evaluation pipelines and show that results can be highly sensitive to seemingly minor, often undocumented implementation choices including the random seed, system prompt, multi-turn template construction, and how prior interaction/reasoning history is carried forward. These choices can lead to substantial differences in reported performance, especially in multi-turn settings where without rigorous standardization, leaderboard rankings are unreliable. On efficiency, we examine standard reinforcement learning (RL) for tool-calling and identify two sources of computational waste: (i) during rollouts, many prompts produce no learning signal, and (ii) during policy updates, optimization incurs high computational cost. Guided by these findings, we introduce two techniques that accelerate RL-based tool-calling training, achieving substantial wall-clock speedup without degrading performance.

2606.00134 2026-06-02 cs.CR cs.AI cs.LG 版本更新

XAI-SOH-FL: Enhancing SOH-FL with Adaptive Aggregation and Explainable AI for Intrusion Detection in Heterogeneous IoT

XAI-SOH-FL: 通过自适应聚合和可解释人工智能增强异构物联网入侵检测中的SOH-FL

Ambreen Aslam, Maaz Hassan, Bibi Zahra, Muhammad Khuram Shahzad

发表机构 * School of Electrical Engineering and Computer Science (SEECS), National University of Sciences and Technology (NUST)(电气工程与计算机科学学院(SEECS),国家 sciences and Technology(NUST))

AI总结 针对异构物联网中数据异构、标签稀缺和模型不可解释性问题,提出XAI-SOH-FL框架,通过自适应聚合(动态γ选择与贝叶斯优化)和SHAP可解释性,在CICIDS2017数据集上达到94.12%准确率和0.92 F1分数,优于基线SOH-FL。

Comments 8 pages, 6 figures; code available at https://github.com/aaslam-msit/SOH-FL-Enhancement

详情
AI中文摘要

物联网环境中的入侵检测系统面临数据异构、缺乏标记数据和模型可解释性有限等重大挑战。联邦学习提供了一种隐私保护解决方案;然而,现有方法如SOH-FL存在两个关键限制:依赖手动调整的聚合参数γ以及模型预测缺乏可解释性。在本文中,我们提出XAI-SOH-FL,一个增强框架,将自适应聚合和可解释人工智能集成到SOH-FL范式中。首先,我们引入基于相似性阈值的动态γ选择机制,使聚合过程能够适应不断变化的数据分布。其次,采用贝叶斯优化自动确定最优γ值,消除了手动调整的需要。第三,引入SHAP(SHapley Additive exPlanations)为入侵检测决策提供特征级可解释性。在CICIDS2017数据集上的实验评估表明,所提方法达到了94.12%的准确率和0.92的F1分数,优于基线SOH-FL模型,同时收敛所需的通信轮次更少。此外,基于SHAP的分析揭示,流级特征如流持续时间和数据包长度显著影响模型预测。这些结果表明,XAI-SOH-FL在异构物联网环境中提供了准确性、适应性和可解释性之间的有效平衡。

英文摘要

Intrusion Detection Systems (IDS) in Internet of Things (IoT) environments face significant challenges due to data heterogeneity, lack of labeled data, and limited model interpretability. Federated Learning (FL) offers a privacy-preserving solution; however, existing approaches such as SOH-FL suffer from two key limitations: reliance on a manually tuned aggregation parameter γ and lack of explainability in model predictions. In this paper, we propose XAI-SOH-FL, an enhanced framework that integrates adaptive aggregation and explainable artificial intelligence into the SOH-FL paradigm. First, we introduce a dynamic γ selection mechanism based on similarity thresholding, enabling the aggregation process to adapt to evolving data distributions. Second, Bayesian Optimization is employed to automatically determine optimal γ values, eliminating the need for manual tuning. Third, SHAP (SHapley Additive exPlanations) is incorporated to provide feature-level interpretability for intrusion detection decisions. Experimental evaluation on the CICIDS2017 dataset demonstrates that the proposed approach achieves an accuracy of 94.12% and an F1-score of 0.92, outperforming the baseline SOH-FL model while converging in fewer communication rounds. Furthermore, SHAP-based analysis reveals that flow-level features such as Flow Duration and Packet Length significantly influence model predictions. These results indicate that XAI-SOH-FL provides an effective balance between accuracy, adaptability, and interpretability in heterogeneous IoT environments.

2606.00133 2026-06-02 cs.LG cs.ET 版本更新

World Models: A Comprehensive Survey of Architectures, Methodologies, Reasoning Paradigms, and Applications

世界模型:架构、方法论、推理范式与应用的全面综述

Arif Hassan Zidan, Yi Pan, Hanqi Jiang, Ruiyu Yan, Wei Ruan, Zihao Wu, Lifeng Chen, Weihang You, Xinliang Li, Bowen Chen, Huawen Hu, Peilong Wang, Sizhuang Liu, Jing Zhang, Siyuan Li, Zhengliang Liu, Yu Bao, Lin Zhao, Lichao Sun, Dajiang Zhu, Xiang Li, Jinglei Lv, Quanzheng Li, Wei Liu, Tianming Liu, Wei Zhang

发表机构 * School of Computer and Cyber Sciences, Augusta University(奥古斯塔大学计算机与网络科学学院) School of Computing, University of Georgia(佐治亚大学计算机学院) Department of Biomedical Engineering, New Jersey Institute of Technology(新泽西理工学院生物医学工程系) Department of Radiology, Massachusetts General Hospital, Harvard Medical School(麻省总医院放射科,哈佛医学院) Department of Computer Science and Engineering, University of Texas at Arlington(德克萨斯大学阿灵顿分校计算机科学与工程系) Department of Graduate Psychology, James Madison University(詹姆斯麦迪逊大学研究生心理学系) Computer Science and Engineering, Lehigh University(莱斯大学计算机科学与工程系) School of Biomedical Engineering, The University of Sydney(悉尼大学生物医学工程学院) Tandon School of Engineering, New York University(纽约大学泰坦工程学院) Department of Radiation Oncology, City of Hope National Medical Center(城市希望国家医学中心放射肿瘤科) Department of Mayo Clinic Comprehensive Cancer Center, Mayo Clinics(梅奥诊所综合癌症中心,梅奥诊所) Savannah River Ecology Laboratory (SREL), University of Georgia(萨凡纳河生态实验室(SREL),佐治亚大学)

AI总结 本文提出一个多轴分类法,从架构、方法论、推理策略和应用领域四个维度系统综述世界模型,涵盖从早期认知科学基础到PlaNet、Dreamer系列、MuZero、Sora等里程碑系统,并指出未来方向。

详情
AI中文摘要

世界模型作为学习环境结构和动态的内部模拟器,已成为追求通用人工智能的核心范式,使智能体能够在学习到的表征中进行预测、规划和推理。尽管在强化学习、机器人、自动驾驶和视频生成等领域取得了快速进展,但该领域缺乏一个统一的框架来整合其多样化的架构选择、训练方法、推理机制和应用场景。本综述通过一个多轴分类法填补了这一空白,该分类法沿四个维度组织:(i) 架构,涵盖表征格式、动态公式、输入模态、学习范式和下游应用;(ii) 方法论家族,包括状态空间和循环方法、基于Transformer的模型、基于扩散的生成器、物理信息网络和语言增强的多模态系统;(iii) 推理策略,涵盖基于想象的规划、潜在策略学习、反事实推理和不确定性下的规划;(iv) 应用领域,涵盖机器人、自动驾驶、视频预测、多模态智能体、强化学习、科学建模、医学影像、教育测量以及商业和金融。从早期认知科学基础到PlaNet、Dreamer系列、MuZero、Sora、Cosmos和Genie等里程碑系统,我们考察了这些维度如何相互作用,并强调了思维链推理与世界模型想象的最新融合。我们回顾了评估协议和基准,指出了持续存在的挑战,如复合预测误差、模拟到真实迁移和碎片化评估,并概述了未来方向,包括统一的多模态世界模型、基础规模的交互式模拟器以及在安全关键领域的可靠部署。

英文摘要

World models, internal simulators that learn the structure and dynamics of an environment, have emerged as a central paradigm in the pursuit of artificial general intelligence, enabling agents to predict, plan, and reason within learned representations. Despite rapid progress across reinforcement learning, robotics, autonomous driving, and video generation, the field lacks a unified framework integrating its diverse architectural choices, training methods, reasoning mechanisms, and application settings. This survey addresses that gap with a multi-axis taxonomy organized along four dimensions: (i) architecture, encompassing representation format, dynamics formulation, input modality, learning paradigm, and downstream application; (ii) methodological family, including state-space and recurrent approaches, transformer-based models, diffusion-based generators, physics-informed networks, and language-augmented multimodal systems; (iii) reasoning strategy, covering imagination-based planning, latent policy learning, counterfactual reasoning, and planning under uncertainty; and (iv) application domain, spanning robotics, autonomous driving, video prediction, multimodal agents, reinforcement learning, scientific modeling, medical imaging, educational measurement, and business and finance. Tracing the field from early cognitive-science foundations to milestone systems such as PlaNet, the Dreamer family, MuZero, Sora, Cosmos, and Genie, we examine how these dimensions interact and highlight the recent convergence of chain-of-thought reasoning with world-model imagination. We review evaluation protocols and benchmarks, identify persistent challenges such as compounding prediction errors, sim-to-real transfer, and fragmented evaluation, and outline future directions toward unified multimodal world models, foundation-scale interactive simulators, and safe deployment in safety-critical domains.

2606.00132 2026-06-02 cs.LG cs.AI 版本更新

Foundation-Preserving Adaptation via Generalized Rayleigh-Quotient Optimization

基于广义瑞利商优化的基础模型保留适配

Dongjun Kim, Adrian de Wynter, Huancheng Chen, Heasung Kim, Haris Vikalo

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) Microsoft(微软) Microsoft AI(微软人工智能) Meta

AI总结 提出FoLoRA框架,通过广义瑞利商优化更新方向,在微调中平衡下游任务性能与预训练能力保留。

详情
AI中文摘要

虽然微调有效地将基础模型适配到专门的下游任务,但可能会降低预训练期间获得的非目标能力。现有的遗忘感知方法通常通过专门的初始化或固定约束寻求更安全的更新,但未在训练过程中调节适配-保留权衡。我们提出基础保留LoRA(FoLoRA),一个遗忘感知优化框架。在一阶保留条件的指导下,FoLoRA定义了预训练代理激活上的遗忘惩罚和下游任务激活上的任务效用。然后,它通过广义瑞利商按单位遗忘惩罚的任务效用对更新方向进行评分。由此产生的谱坐标系实现了方向门控Adam更新,在训练过程中衰减低效用-惩罚方向。为了估计遗忘惩罚,FoLoRA通过从预训练模型中采样构建预训练代理校准数据,而不是依赖单个代理数据集。在数学、代码和指令遵循适配上的实验表明,FoLoRA在基线上实现了最强的保留-适配平衡,提高了目标任务性能,同时最好地聚合保留了非目标能力。

英文摘要

While finetuning effectively adapts foundation models to specialized downstream tasks, it can degrade nontarget capabilities acquired during pretraining. Existing forgetting aware methods typically seek safer updates through specialized initialization or fixed constraints, but do not regulate the adaptation preservation trade-off during training. We propose Foundation Preserving LoRA (FoLoRA), a forgetting aware optimization framework. Guided by a first order preservation condition, FoLoRA defines a forgetting penalty over pretraining-proxy activations and a task utility over downstream task activations. It then scores update directions by task utility per unit forgetting penalty via a generalized Rayleigh quotient. The resulting spectral coordinate system enables direction wise gated Adam updates, attenuating low utility to penalty directions during training. To estimate the forgetting penalty, FoLoRA constructs pretraining proxy calibration data by sampling from the pretrained model rather than relying on a single proxy dataset. Experiments on math, code, and instruction following adaptation show that FoLoRA achieves the strongest preservation adaptation balance over baselines, improving target task performance with best aggregate preservation of non target capabilities.

2606.00131 2026-06-02 cs.SE cs.AI cs.LG cs.PL 版本更新

AI-PROPELLER: Warehouse-Scale Interprocedural Code Layout Optimization with AlphaEvolve

AI-PROPELLER:基于AlphaEvolve的仓库规模过程间代码布局优化

Chaitanya Mamatha Ananda, Rajiv Gupta, Mircea Trofin, Aiden Grossman, Sriraman Tallam, Xinliang David Li, Amir Yazdanbakhsh

发表机构 * University of California, Riverside(加州大学河滨分校) Google(谷歌) DeepMind(深度思维)

AI总结 提出AI-PROPELLER系统,利用Magellan智能工作流将Propeller的编译器启发式方法演化为细粒度过程间优化器,并通过实际硬件执行评估布局变体,首次在工业仓库规模应用中实现细粒度过程间代码布局优化,性能提升0.23%至1.6%。

详情
AI中文摘要

后链接优化器(如Propeller和BOLT)已证明,精确的、基于性能剖析的代码布局可以从高度优化的二进制文件中提取显著的性能提升。然而,这些系统目前局限于过程内技术,未能充分利用过程间布局的全局潜力。由于组合爆炸的搜索空间和复杂的调用返回语义难以建模,过程间代码布局历来困难。因此,细粒度过程间布局的性能潜力在实践中尚未得到证实。AI-PROPELLER使用Magellan(一种智能工作流),将Propeller中的编译器启发式方法演化为细粒度过程间优化器,并微调所得策略的超参数。为确保高保真度,我们摒弃了近似的静态成本模型,智能工作流生成多个布局变体,并在实际硬件上执行以测量真实性能计数器,为进化循环提供精确的奖励信号。AI-PROPELLER已在包括大型仓库规模应用在内的多个基准测试上进行了评估,实验表明,在使用最先进的FDO和PLO优化后,性能提升0.23%至1.6%,这对于实际二进制文件而言意义重大。这是首次在工业环境中对大型仓库规模应用进行细粒度过程间代码布局优化。

英文摘要

Post-link optimizers (PLOs) such as Propeller and BOLT have demonstrated that precise, profile-guided code layout can extract significant performance gains from heavily optimized binaries. However, these systems are currently restricted to intraprocedural techniques, leaving the global potential of interprocedural layout largely untapped. Interprocedural code layout is historically difficult due to a combinatorially intractable search space and complex call-return semantics that are challenging to model. Consequently, the performance potential of fine-grained interprocedural layout remains unproven in practice. AI-PROPELLER uses Magellan, an agentic workflow that evolves the compiler heuristic in Propeller into a fine-grained interprocedural optimizer and fine-tunes the resulting policy hyperparameters. To ensure high-fidelity, we move away from approximate static cost models and the agentic workflow generates multiple layout variants that are executed on actual hardware to measure real performance counters, providing a precise reward signal for the evolutionary loop. AI-PROPELLER has been evaluated on several benchmarks including large warehouse-scale applications and experiments show performance improvements of 0.23% to 1.6% optimized with state-of-the-art FDO and PLO which is significant for real-world binaries. This is the first time ever that large warehouse-scale applications in industrial settings have been optimized with fine-grained interprocedural code layout.

2606.00130 2026-06-02 cs.LG cs.AI 版本更新

Automatically Differentiable Nonlinear Tensor Networks (ADNTNs) for Exponential Compression of Deep Neural Networks

自动可微非线性张量网络(ADNTNs)用于深度神经网络的指数级压缩

Andrzej Cichocki, Michal Wietczak

发表机构 * Institute of Computing Intelligence, Polish Academy of Sciences(波兰科学院计算智能研究所)

AI总结 提出自动可微非线性张量网络(ADNTNs)作为结构化权重生成器,通过反向模式自动微分端到端训练紧凑核心张量,实现深度神经网络的高效压缩,在AlexNet和VGG-16上达到每层2000倍至77000倍压缩比,且精度与密集基线相当或更优。

Comments 6 figure, 28 pages, to be submitted to Journal and confrence

详情
AI中文摘要

我们研究了自动可微非线性张量网络(ADNTNs),这是一类结构化权重生成器,其紧凑核心张量通过反向模式自动微分(AD)进行端到端训练。该方法可视为低秩适应和张量分解的自然扩展:ADNTN不是使用一个低秩矩阵更新,而是通过小核心、非线性激活和可选的横向混合张量的层次结构构建大权重张量。本文聚焦于三种架构:树张量网络(TTNs)、带边界解缠器的增强型TTN(aTTNs)以及多尺度纠缠重整化拟设(MERA)。该公式支持非线性激活、任务感知目标、批处理以及硬件感知的执行调度。同时,本文明确区分了“微分”收缩程序和使收缩自由:AD并未消除大中间体、不良收缩顺序或一般带环张量网络精确收缩的成本。在AlexNet和VGG-16层上的大量模拟显示,在所研究设置下每层压缩比约为2000倍至77000倍,精度通常与密集基线相当,且在几个VGG-16案例中有所提升。这些结果是令人鼓舞的而非最终结论:它们表明,只要优化、收缩调度和部署内核协同设计,ADNTNs是一条有前景、数学结构清晰且硬件感知的通往更小神经网络的路径。

英文摘要

We study Automatically Differentiable Nonlinear Tensor Networks (ADNTNs), a family of structured weight generators whose compact core tensors are trained end-to-end by reverse-mode automatic differentiation (AD). The approach can be viewed as a natural extension of low-rank adaptation and tensor factorisation: instead of using one low-rank matrix update, an ADNTN builds a large weight tensor through a hierarchy of small cores, nonlinear activations, and optional lateral mixing tensors. The paper focuses on three architectures: Tree Tensor Networks (TTNs), augmented TTNs (aTTNs) with boundary disentanglers, and Multi-scale Entanglement Renormalisation Ansatze (MERA). The formulation supports nonlinear activations, task-aware objectives, batching, and hardware-aware execution schedules. At the same time, the paper keeps a clear distinction between \emph{differentiating} a contraction program and making contraction free: AD does not remove the cost of large intermediates, poor contraction orders, or exact contraction of general loopy tensor networks. Extensive simulations on AlexNet and VGG-16 layers show per-layer compression ratios from roughly $2000\times$ to $77000\times$ in the studied settings, with accuracy often matching the dense baseline and, in several VGG-16 cases, improving it. These results are encouraging rather than final: they suggest that ADNTNs are a promising, mathematically structured, and hardware-aware route toward much smaller neural networks, provided that optimisation, contraction schedules, and deployment kernels are designed together.

2606.00129 2026-06-02 cs.LG cs.AI 版本更新

A Shared Valence Axis Across Modern LLMs and Human EEG: The Saturation Regularity

现代LLM与人脑EEG共享的效价轴:饱和规律

Yousef A. Radwan, Xuhui Liu, Kilichbek Haydarov, Yuqian Fu, Mohamed Elhoseiny

发表机构 * King Abdullah University of Science and Technology(卡布斯大学)

AI总结 本研究通过构建从大型语言模型(LLM)中提取的一维效价方向(V轴),发现其与人类EEG神经活动对齐,但进一步对齐策略无法提升解码性能,并形式化为“饱和规律”,指出改进应来自监督无法触及的残差子空间。

详情
AI中文摘要

大型语言模型(LLM)已成为强大的表示学习器,其内部特征与人类认知日益对齐。我们研究现代LLM是否可以作为理解人脑神经表示的透镜,重点关注EEG中的情感效价。 我们首先仅使用九个情感唤起句子从现代LLM中构建了一维效价方向(V轴),并通过零样本迁移到情感基准测试和跨十四个LLM的模型一致性进行了验证。然后,我们展示了这个从LLM导出的方向映射到人类神经活动。在一个包含123名受试者观看情感视频的公共EEG队列中,EEG特征上的单个线性投影追踪了每个刺激的V轴位置。此外,36个未暴露于V轴的EEG情感分类器在其内部表示中自发发现了相同的方向,表明相同的效价结构在语言模型和人类电生理学中同时出现。 然而,这种趋同并未提供有效的训练信号。我们测试了二十五种对齐策略,包括知识蒸馏、表示相似性、对比和拓扑损失;没有一种能改善解码,十六种显著降低了准确性。我们将这一结果形式化为饱和规律:一旦任务标签单独驱动脑解码网络朝向目标方向,额外的监督主要扭曲已经饱和的盆地,而承载类内残差的子空间几乎得不到有用的梯度。 这一规律也指出了改进应来自何处:监督无法触及的残差子空间。受此启发,我们集成残差多样性而非监督盆地,在FACED上将平衡准确率提高了10.5%,并在SEED-V上复制了相同效果。

英文摘要

Large language models (LLMs) have emerged as powerful representation learners whose internal features increasingly align with human cognition. We study whether modern LLMs can serve as a lens for understanding neural representations in the human brain, focusing on emotional valence in EEG. We first build a one-dimensional valence direction, the V-axis, from modern LLMs using only nine emotion-evocative sentences. We validate it through zero-shot transfer to sentiment benchmarks and cross-model consistency across fourteen LLMs. We then show that this LLM-derived direction maps onto human neural activity. On a public EEG cohort of 123 subjects watching affective videos, a single linear projection on EEG features tracks the V-axis position of each stimulus. Moreover, 36 EEG emotion classifiers trained without exposure to the V-axis spontaneously rediscover the same direction in their internal representations, suggesting that the same valence structure emerges in both language models and human electrophysiology. Yet this convergence does not provide an effective training signal. We test twenty-five alignment strategies, including knowledge distillation, representational similarity, contrastive, and topographic losses; none improve decoding, and sixteen significantly reduce accuracy. We formalize this result as the saturation regularity: once task labels alone drive a brain-decoding network onto the target direction, additional supervision mainly distorts an already-saturated basin, while the load-bearing within-class residual receives little useful gradient. This regularity also indicates where improvement should come from: the residual subspace unreachable by supervision. Motivated by this insight, we ensemble across residual diversity rather than supervising the basin, improving balanced accuracy by 10.5% over the prior best on FACED, with the same effect replicated on SEED-V.

2606.00125 2026-06-02 cs.IR cs.AI cs.LG cs.MM 版本更新

Multimodal Music Recommendation System using LLMs

使用LLMs的多模态音乐推荐系统

Srikar Prabhas Kandagatla, Sreehitha R. Narayana, Chandana Magapu, Swetha Mohan, Shamanth Kuthpadi, Hongjie Chen, Ryan A. Rossi, Franck Dernoncourt, Nesreen Ahmed

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿姆赫斯特分校) Dolby Laboratories(Dolby实验室) Adobe Research(Adobe研究) Cisco Research(Cisco研究)

AI总结 提出一个多模态框架,通过融合音频、歌词、LLM生成的语义元数据和收听完成率,在基于会话的音乐推荐中显著提升Recall和NDCG。

详情
AI中文摘要

音乐推荐系统通常将歌曲视为不透明标记,依赖协同交互历史,忽略了语义或声学内容。先前工作探索了LLM增强、多模态和文本增强的序列推荐方法,但有些方法部分结合了语义、声学或参与信号,没有在一个统一的基于LLM的序列推理框架中联合建模所有三个信号,该框架将推荐基于实际歌曲内容。在这项工作中,我们提出了一个用于基于会话的音乐推荐的多模态框架,通过三种互补信号丰富了LastFM-1K数据集:(1) 使用预训练音乐和文本表示模型提取的音频和歌词嵌入,(2) 使用MGPHot注释方案生成的LLM语义元数据,以及(3) 收听完成率。我们采用E4SRec框架,通过扩展多模态特征和不同的项目ID编码器骨干(包括SASRec、BERT4Rec和GRU4Rec)来增强它。我们进一步扩展了LLM骨干选项,包括LLaMa-2-13B、Qwen2.5-7B-Instruct和LLaMa-3-70B,在零样本和微调设置下。我们的实验表明,集成基于内容的特征比仅使用ID的基线在Recall上提升高达95%,在NDCG上提升高达79%。此外,我们的实验表明,朴素的多模态融合并不总是产生加性改进,突显了跨模态整合的挑战。我们发布了一个用于音乐推荐的大规模多模态基准。

英文摘要

Music recommendation systems typically treat songs as opaque tokens, relying on collaborative interaction histories which overlooks semantic or acoustic content. Prior work has explored LLM-augmented, multimodal, and text-enhanced approaches to sequential recommendation, and while some methods partially combine semantic, acoustic, or engagement signals, none jointly model all three within a unified LLM-based sequential reasoning framework that grounds recommendations in actual song content. In this work, we propose a multimodal framework for session-based music recommendation that enriches the LastFM-1K dataset with three complementary signals: (1) audio and lyric embeddings extracted using pretrained music and text representation models, (2) LLM-generated semantic metadata using the MGPHot annotation schema, and (3) listening completion ratios. We adopt the E4SRec framework by extending it with multimodal features and different item ID encoder backbones, including SASRec, BERT4Rec, and GRU4Rec. We further extend the LLM backbone option with LLaMa-2-13B, Qwen2.5-7B-Instruct, and LLaMa-3-70B in both zero-shot and fine-tuned settings. Our experiments show that integrating content-based features improves over ID-only baselines up to 95% in terms of Recall and 79% in terms of NDCG. Moreover, our experiments show that naive multimodal fusion does not always yield additive improvements, highlighting challenges in cross-modal integration. We release a large-scale multimodal benchmark for music recommendation.

2606.00124 2026-06-02 cs.CV cs.LG 版本更新

Positional Encodings Anchor Spatial Structure in Vision Transformers: A Geometric Perspective on Robustness

位置编码锚定视觉Transformer中的空间结构:基于几何视角的鲁棒性研究

Mahmoud Mannes

发表机构 * ESSTHS

AI总结 本文通过引入空间相似性距离相关性(SSDC)度量,研究不同位置编码对视觉Transformer内部空间表示几何结构的影响,发现位置编码通过建立索引锚定的空间组织来提升模型在内容破坏性分布偏移下的鲁棒性。

Comments 16 pages (9 main text, 7 appendix). 5 figures (3 main text, 2 appendix) with 8 graphics total. 5 tables (1 main text, 4 appendix). Submitted to NeurIPS 2026 main conference and the ICML 2026 mechanistic interpretability workshop

详情
AI中文摘要

视觉Transformer中的位置嵌入(PEs)已知会影响性能和鲁棒性,但它们在塑造内部空间表示中的作用尚不明确。本文研究了不同形式的PEs如何影响ViT的表示几何结构,以及这些变化如何与内容破坏性分布偏移下的鲁棒性相关。我们引入了一个度量——空间相似性距离相关性(SSDC),用于量化token表示中的空间结构。利用该度量,我们发现未使用PEs训练的ViT仍会发展出非平凡的空间结构,但这种结构由视觉内容驱动,并在token置换下崩溃。相反,所有考虑的PEs(可学习绝对位置编码、正弦位置编码和旋转位置编码)都与向索引锚定空间组织的一致转变相关。这些模型中的表示在破坏内容的扰动下保持稳定,并对这类分布偏移表现出显著增强的鲁棒性。我们进一步表明,尽管不同的PEs产生不同的空间结构深度轨迹,但其鲁棒性属性大致相似(编码方案间存在次要差异),这表明鲁棒性似乎更依赖于稳定的位置参考框架的存在,而非特定的编码机制。这些结果为位置编码如何塑造内部表示提供了几何解释,并对未来编码方案的原则性设计具有启示意义。

英文摘要

Positional embeddings (PEs) in Vision Transformers (ViTs) are known to impact performance and robustness, but their role in shaping internal spatial representations is not well understood. In this work, we study how different forms of PEs influence the representational geometry of ViTs and how these changes relate to robustness under content-disrupting distribution shifts. We introduce a metric, the Spatial Similarity Distance Correlation (SSDC), to quantify spatial structure in token representations. Using this metric, we show that ViTs trained without PEs still develop non-trivial spatial structure, but this structure is driven by visual content and collapses under token permutation. In contrast, we find that all PEs considered (learned absolute, sinusoidal, and rotary) are associated with a consistent shift toward an index-anchored spatial organization. Representations in these models remain stable under perturbations that disrupt content, and exhibit substantially improved robustness to such distributional shifts. We further show that while different PEs produce distinct depth-wise trajectories of spatial structure, their robustness properties are largely similar (with secondary variation across encoding schemes), suggesting that robustness appears to depend on the presence of a stable positional reference frame more than it depends on the specific encoding mechanism. These results offer a geometric account of how positional encodings shape internal representations, with implications for the principled design of future encoding schemes.

2606.00123 2026-06-02 cs.CV cs.AI cs.LG 版本更新

CardioLens: Revealing the Clinical Reality Gap of MLLMs via Multi-Sequence Cardiac MRI Evaluations

CardioLens: 通过多序列心脏MRI评估揭示MLLMs的临床现实差距

Zixian Su, Hongkai Zhang, Fan Gao, Encheng Su, Taiping Qu, Jingwei Guo, Nan Zhang, Hui Wang, Zhen Zhou, Kairui Bo, Yan Chen, Yue Ren, Shuai Li, Lei Xu, Henggui Zhang

发表机构 * Beijing Academy of Artificial Intelligence(北京人工智能研究院) Beijing Anzhen Hospital(北京安贞医院) Beihang University(北航) King Abdullah University of Science and Technology(国王 Abdullah 科学与技术大学)

AI总结 提出CardioLens测试平台,通过多序列心脏磁共振成像评估24个多模态大语言模型,发现其在临床工作流中表现不佳,存在类别崩溃失败模式,且输入选择和推理提示改进效果有限。

详情
AI中文摘要

多模态大语言模型在公共医学基准上表现出色,但现有评估通常依赖于孤立输入和简化识别任务,难以作为临床使用的有效代理。我们提出了CardioLens,一个针对多序列心血管磁共振的无泄漏评估测试平台,通过严格的报告到QA构建和验证流程,从私有医院档案中构建。CardioLens包含473,896张切片和13,494个经过验证的QA对,涵盖4D Cine、LGE、灌注和T2加权成像,并评估CMR解读的三个阶段:图像理解、报告生成和疾病诊断。在24个最先进的MLLM上,CardioLens揭示了显著的临床现实差距:模型整体表现不佳,性能沿真实CMR工作流下降。混淆分析进一步显示一种类别崩溃失败模式,模型倾向于默认频繁出现的异常类别,而不是区分临床不同的发现。为了排除MLLM兼容输入构造是主要原因,我们在不同切片预算下比较了随机、临床动机和数据驱动的切片选择协议;性能变化很小,通常约为1%。显式推理提示也无法挽救性能,往往使模型更加保守,而不是改善视觉证据的使用。这些结果表明,当前MLLM远未达到可靠的CMR解读,临床决策需要跨序列、视图和时间相位整合分布式证据。CardioLens为开发面向真实临床部署的下一代MLLM提供了一个临床基础的测试平台。

英文摘要

Multimodal Large Language Models (MLLMs) have shown strong performance on public medical benchmarks, yet existing evaluations often remain weak proxies for clinical use, relying on isolated inputs and simplified recognition-style tasks. We introduce CardioLens, a leakage-resistant evaluation testbed for multi-sequence Cardiovascular Magnetic Resonance (CMR), constructed from private hospital archives through a rigorous report-to-QA construction and verification pipeline. CardioLens contains 473,896 slices and 13,494 verified QA pairs across 4D Cine, LGE, perfusion, and T2-weighted imaging, and evaluates three stages of CMR interpretation: image understanding, report generation, and disease diagnosis. Across 24 state-of-the-art MLLMs, CardioLens reveals a substantial clinical reality gap: models perform poorly overall, with performance degrading along the real CMR workflow. Confusion analysis further shows a category-collapse failure mode, where models default to frequent abnormal categories rather than distinguishing clinically distinct findings. To rule out MLLM-compatible input construction as the primary cause, we compare random, clinically motivated, and data-driven slice selection protocols under different slice budgets; performance changes only marginally, typically by about 1%. Explicit reasoning prompts also fail to rescue performance, often making models more conservative rather than improving visual evidence use. These results show that current MLLMs remain far from reliable CMR interpretation, where clinical decisions require integrating distributed evidence across sequences, views, and temporal phases. CardioLens provides a clinically grounded testbed for developing next-generation MLLMs toward real-world clinical deployment.

2606.00120 2026-06-02 eess.SP cs.AI cs.LG 版本更新

SpikeWFM: Spiking-Aided Wireless Foundation Model for Robust Channel Prediction

SpikeWFM:用于鲁棒信道预测的脉冲辅助无线基础模型

Liwen Jing, Yisha Lu, Tingting Yang, Li Sun, Yuxuan Shi, Yuwei Wang, Mengfan Zheng, Leiyang Xu

发表机构 * Mobile Information Networks-National Science and Technology Major Project(移动信息网络国家科技重大专项)

AI总结 提出SpikeWFM混合架构,将脉冲神经网络与基于ANN的Transformer结合,通过时间稀疏性和事件驱动处理增强无线基础模型对噪声和干扰的鲁棒性,在信道预测任务上优于传统模型。

详情
AI中文摘要

本文提出SpikeWFM,一种新颖的混合架构,它将脉冲神经网络(SNN)与基于传统人工神经网络(ANN)的Transformer集成用于无线基础模型(WFM)。受人类大脑中噪声鲁棒且节能的信息处理启发,SpikeWFM旨在增强WFM对噪声和干扰的抵抗力,同时保持跨多种无线场景的强大泛化能力。借鉴大型语言模型成功经验,WFM利用跨各种无线环境的大规模数据集上的自监督预训练,学习一个统一的嵌入表示,支持包括信道预测、信道估计、波束预测、定位等在内的广泛下游任务。这类模型通常优于任务特定设计,并对未见条件表现出卓越的适应性。然而,现有WFM在实际无线系统中仍易受真实噪声和干扰影响。为解决这一局限,我们将脉冲神经元引入基于Transformer的WFM架构。我们提供简要理论分析,展示SNN-ANN混合如何通过时间稀疏性和事件驱动处理有效减轻噪声和干扰。实验结果表明,SpikeWFM在预训练收敛和信道预测准确性上均持续优于传统基于ANN的WFM。关于通信和感知任务的更多结果将在本工作的完整期刊版本中呈现。

英文摘要

This paper proposes SpikeWFM, a novel hybrid architecture that integrates spiking neural networks (SNNs) with conventional artificial neural network (ANN)-based transformers for wireless foundation models (WFMs). Inspired by the noise-robust and energy-efficient information processing in the human brain, SpikeWFM aims to enhance the resilience of WFMs against noise and interference while maintaining strong generalization capabilities across diverse wireless scenarios. Drawing from the success of large language models, WFMs leverage self-supervised pre-training on large-scale datasets spanning various wireless environments to learn a unified embedding that supports a wide range of downstream tasks, including channel prediction, channel estimation, beam predition, positioning and etc. Such models typically outperform task-specific designs and exhibit superior adaptability to unseen conditions. However, existing WFMs remain vulnerable to realistic noise and interference in practical wireless systems. To address this limitation, we incorporate spiking neurons into the transformer-based WFM architecture. We provide a brief theoretical analysis demonstrating how the SNN-ANN hybrid effectively mitigates noise and interference through temporal sparsity and event-driven processing. Experimental results show that SpikeWFM consistently outperforms conventional ANN-based WFMs in both pre-training convergence and channel prediction accuracy. Additional results on communication and sensing tasks will be presented in the full journal version of this work.

2606.00116 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Enhancing BiGRU with a KAN Block for Legal Document Classification and Summarization

增强BiGRU与KAN模块在法律文档分类与摘要中的应用

Ahmed Faizul Haque Dhrubo, Souvik Pramanik, Most. Aysha Siddika Sumona, Shahnewaz Siddique, Mohammad Ashrafuzzaman Khan, Mohammad Abdul Qayum, Mohsin Sajjad

发表机构 * Dept. of ECE North South University(电子工程系北南大学)

AI总结 提出一种基于KAN的BiGRU模型,用于低资源多语言法律文档的分类与摘要,通过KAN模块提升分类准确率至67.96%。

Comments This paper contains of 10 pages, 10 figures, 4 tables and version 2 after it review from ACL 2026

详情
AI中文摘要

本研究引入了一种基于KAN的BiGRU模型的新架构,用于低资源多语言环境下的法律文档分类与摘要任务。为了解决领域语言、不同语言使用、上下文长依赖和类别不平衡等问题,我们使用了由孟加拉国法律文档组成的数据集,这些文档来自Manupatra,包括孟加拉语、英语和音译孟加拉语。我们的分类任务采用BiGRU模型以及Kolmogorov-Arnold网络(KAN)模块,而摘要部分则利用基于注意力的GRU结合KAN模型头部。分类模型达到了67.96%的准确率和0.65的F1分数;摘要的ROUGE-1、ROUGE-2和ROUGE-L指标分别对应0.38、0.23和0.31的F1分数。消融研究表明,使用KAN将分类准确率从57.34%提升至67.96%。此外,我们将所提出的技术与多个基线进行了比较,包括经典机器学习算法和预训练语言模型。

英文摘要

This study introduces a novel architecture of KAN-based BiGRU model for the task of classification and summarization of legal documents in a low-resource multilingual setup. In order to tackle problems associated with domain language, the usage of different languages, long dependencies within context, and class imbalance, we employ the dataset composed of legal documents from Bangladesh and taken from Manupatra, which include Bengali, English, and transliterated Bengali languages. Our classification task involves BiGRU model, along with Kolmogorov-Arnold Network (KAN) module, while the summarization part utilizes attention-based GRU, combined with a KAN model head. Classification model yields 67.96% of accuracy and 0.65 F1 score; while ROUGE-1, ROUGE-2, and ROUGE-L measures for summarization yield 0.38, 0.23, and 0.31 F1 scores, correspondingly. Ablation study shows that the use of KAN increases classification accuracy from 57.34% to 67.96%. Moreover, our proposed technique is compared to several baselines, including classical ML algorithms and pretrained language models.

2606.00115 2026-06-02 cs.CV cs.LG stat.ML 版本更新

Physics from Video: Identifiability of Time-Invariant Second-Order ODEs under Minimal Trajectory Conditions

来自视频的物理:最小轨迹条件下时不变二阶ODE的可辨识性

Yuanyuan Wang, Wenjie Wang, Kun Zhang, Mingming Gong

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 研究从原始像素中辨识连续时间物理定律的结构可辨识性,证明在最小轨迹条件下,编码器-仅管道可唯一恢复二阶线性ODE参数,并引入方差底正则化器稳定无解码器目标。

Comments Accepted at ICML 2026

详情
AI中文摘要

弥合视觉真实感与物理理解之间的差距是基于视频的世界模型的核心挑战。我们研究从原始像素中辨识连续时间物理定律的结构可辨识性,重点关注编码器-仅管道能否唯一恢复二阶线性ODE的参数。我们证明,一个水平集斜率覆盖条件确保学习到的潜在空间与真实物理状态局部仿射,从而实现精确的参数恢复。我们的理论首次给出了不同阻尼机制下最小数据需求的刻画,建立了欠阻尼系统可从单个视频片段辨识,而其他机制需要三个不同轨迹。我们进一步引入方差底正则化器以稳定无解码器目标并防止潜在坍缩。在合成和真实数据上验证,我们的方法表明,无需计算密集的像素重建,即可从视频中可靠估计可解释的物理常数,确保物理正确性和透明性。代码可在 https://github.com/wenjiewang3/PhysicsFromVideo 获取。

英文摘要

Bridging the gap between visual realism and physical understanding is a core challenge for video-based world models. We study the structural identifiability of continuous-time physical laws from raw pixels, focusing on whether an encoder-only pipeline can uniquely recover the parameters of second-order linear ODEs. We prove that a level-set slope-coverage condition ensures the learned latent space is locally affine to the true physical state, enabling exact parameter recovery. Our theory provides the first characterization of minimal data requirements across damping regimes, establishing that underdamped systems are identifiable from a single video clip, whereas other regimes require three diverse trajectories. We further introduce a variance-floor regularizer to stabilize the decoder-free objective and prevent latent collapse. Validated on synthetic and real-world data, our approach demonstrates that interpretable physical constants can be reliably estimated from video without the need for compute-intensive pixel reconstruction, ensuring both physical correctness and transparency. Code is available at https://github.com/wenjiewang3/PhysicsFromVideo.

2606.00111 2026-06-02 eess.IV cs.CV cs.LG 版本更新

ChWDTA: Channel-wise Wavelet-Domain Transformer Attention and Entropy Modeling for Learned Image Compression

ChWDTA:用于学习图像压缩的通道级小波域变换器注意力和熵建模

Haisheng Fu, Runyu Yang, Feng Ding, Siyu Zhu, Jie Liang, Xiaoxiao Li, Zhenman Fang, Jingning Han

发表机构 * Electrical and Computer Engineering Department, The University of British Columbia(英属哥伦比亚大学电气与计算机工程系) School of Engineering Science, Simon Fraser University(西蒙弗雷泽大学工程科学学院) School of Electronic Science and Technology, Eastern Institute of Technology(电子科学与技术学院,东部技术学院) Google LLC(谷歌公司)

AI总结 提出通道级小波域变换器注意力(ChWDTA)和通道级小波包分解,在混合CNN-Transformer图像压缩框架中提升率失真性能,在多个测试集上实现显著BD-rate降低。

Comments 13 pages, 8 figures, 6 tables

详情
AI中文摘要

最先进的学习图像压缩(LIC)方案越来越多地基于混合CNN-Transformer架构。为了进一步提高率失真性能,我们将通道级小波变换引入变换器和熵编码组件。首先,我们提出了一种通道级小波域变换器注意力(ChWDTA)机制。ChWDTA保留了现代LIC骨干中使用的有效窗口化空间自注意力,但在将注意力输出通过逆变换映射回来之前,在通道级小波变换特征上计算Q/K/V投影。因此,得到的通道级小波域变换器块(ChWDTB)保留了窗口化注意力的空间标记化模式,同时稀疏化了注意力投影所见的通道协方差。其次,在熵编码阶段,我们引入了一种通道级小波包(ChWP)分解,产生四个大小相等的子带,这更适合基于通道级切片的自回归熵建模。当每个通道级子带被分成两个切片时,我们使用八个切片进行熵编码。通过这种配置,所提出的方案在Kodak、CLIC Professional Validation和Tecnick测试集上分别获得了-17.82%、-19.15%和-22.56%的BD-rate降低。即使每个通道级子带被编码为单个切片,该方案仍以较低的复杂度保留了大部分编码增益。结果证实了在基于CNN-Transformer的LIC方案中引入小波变换的优势。

英文摘要

State-of-the-art learned image compression (LIC) schemes are increasingly based on hybrid CNN-transformer architectures. To further improve rate-distortion performance, we introduce channel-wise wavelet transforms into both the transformer and entropy-coding components. First, we propose a channel-wise wavelet-domain transformer attention (ChWDTA) mechanism. ChWDTA keeps the efficient windowed spatial self-attention used in modern LIC backbones, but computes the Q/K/V projections on channel-wise wavelet-transformed features before mapping the attention output back with the inverse transform. The resulting Channel-wise Wavelet-Domain Transformer Block (ChWDTB) therefore preserves the spatial tokenization pattern of windowed attention while sparsifying the channel covariance seen by the attention projections. Second, in the entropy-coding stage, we introduce a channel-wise wavelet packet (ChWP) decomposition that produces four equal-sized subbands, which better fit channel-wise slice-based autoregressive entropy modeling. When each channel-wise subband is divided into two slices, we use eight slices for entropy coding. With this configuration, the proposed scheme obtains BD-rate reductions of -17.82%, -19.15%, and -22.56% on the Kodak, CLIC Professional Validation, and Tecnick test sets, respectively. Even when each channel-wise subband is coded as a single slice, the scheme still retains most of the coding gains with lower complexity. The results confirm the advantage of introducing wavelet transform in CNN-transformer-based LIC schemes.

2606.00109 2026-06-02 cs.CV cs.AI cs.LG 版本更新

VDSB-GWSyn: Diffusion Schrödinger Bridge for Controllable and Anatomically Feasible Guidewire Synthesis in Coronary Angiography

VDSB-GWSyn: 用于冠状动脉造影中可控且解剖学可行的导丝合成的扩散薛定谔桥

Haoyuan Tang, Zhuo Zhang, Jialin Li, Shuai Xiao, Jiachen Yang

发表机构 * Tianjin University(天津大学)

AI总结 提出基于扩散薛定谔桥的VDSB-GWSyn框架,通过形状先验和血管分割约束生成可控、高保真导丝样本,显著提升下游导丝端点定位精度。

Comments Early accept to MICCAI 2026

详情
AI中文摘要

冠状动脉导丝端点定位是计算机辅助PCI的基本能力,随着机器人辅助PCI逐渐普及以减少操作者辐射暴露,其重要性日益增加。然而,带有导丝的标注CAG图像稀缺以及现有导丝合成模型的适应性有限,仍是导丝端点定位的关键瓶颈。为解决此问题,我们提出VDSB-GWSyn,一个基于扩散薛定谔桥(DSB)模型的框架,能够在复杂解剖背景下合成可控、高保真的导丝样本。VDSB-GWSyn首先使用我们的形状先验算法学习基本导丝几何形状,然后在血管分割掩码的约束下生成导丝掩码并输出对应的端点坐标,最后通过SPADE条件化的DSB在真实CAG图像上合成逼真的导丝样本。实验结果表明,VDSB-GWSyn合成的导丝样本取得了良好的ROI-FID和ROI-KID,以及高IPR分数。此外,将我们的合成数据用于合成预训练后接真实微调,显著改进了下游导丝端点定位,将MPE从16.01像素降低到7.71像素,PCK@3像素从52.63%提高到86.27%,从而实现了更临床可靠的机器人辅助导丝输送系统部署。此外,具有严格背景保留和解剖可行性约束的可控设备合成的核心设计理念,有可能迁移到其他标注数据稀缺的介入设备感知任务中。

英文摘要

Coronary guidewire endpoint localization is a fundamental capability for computer-assisted PCI, and its importance increases as robot-assisted PCI is progressively adopted to reduce operator radiation exposure. However, the scarcity of annotated CAG images with guidewires and the limited adaptability of existing guidewire synthesis models remain key bottlenecks for guidewire endpoint localization. To address this issue, we propose VDSB-GWSyn, a Diffusion Schrödinger Bridge (DSB) model-based framework, enabling synthesis of controllable, high-fidelity guidewire samples under complex anatomical backgrounds. VDSB-GWSyn first uses our shape prior algorithm to learn the basic guidewire geometry. It then generates guidewire masks under constraints imposed by the vessel segmentation masks and outputs the corresponding endpoint coordinates. Finally, it synthesizes realistic guidewire samples on real CAG images using DSB conditioned with SPADE. Experimental results show that the guidewire samples synthesized by VDSB-GWSyn achieve favorable ROI-FID and ROI-KID, as well as high IPR scores. In addition, incorporating our synthesized data for synthetic pre-training followed by real fine-tuning substantially improves downstream guidewire endpoint localization, reducing MPE from 16.01~px to 7.71~px and increasing PCK at 3~px from 52.63\% to 86.27\%, leading to more clinically reliable deployment of robot-assisted guidewire delivery systems. Moreover, the core design philosophy of controllable device synthesis with strict background preservation and anatomical feasibility constraints has the potential to transfer to other interventional device perception tasks where annotated data are scarce.

2606.00107 2026-06-02 eess.SP cs.AI cs.LG 版本更新

Motif-based morphology signatures for interpretable ECG screening and monitoring

基于基序的形态学特征用于可解释的心电图筛查和监测

Nivedita Bijlani, Mauricio Villarroel

发表机构 * The Podium Institute of Sports Medicine and Technology(Podium运动医学与体育科技研究所)

AI总结 提出一种基于基序的框架,通过定义可解释的心跳对齐基序和三种漂移度量,实现短期和长期心电图监测中的形态学变化量化与异常检测。

Comments Accepted to the IEEE Engineering in Medicine and Biology Conference (EMBC) 2026

详情
AI中文摘要

心电图仍然是心血管筛查的核心,但解读仍主要依赖人工且呈间歇性。临床实践依赖于简短的静息心电图,并在需要时进行长时间动态记录,两者都会产生需要大量资源审查的数据。因此,在临床明显异常出现之前,微妙的形态学变化或渐进性漂移可能被忽视。我们提出了一种基于基序的框架,该框架将心跳对齐的心电图基序定义为可解释的心脏特征,并量化短期和长期监测中的形态学漂移和偏差。基序是代表主导形态的典型心动周期。我们引入了三个可解释的漂移度量:与正常窦性心律的偏差、与个性化基线的偏差以及基序不稳定性指数。基序通过选择在固定窗口内最小化动态时间规整距离的心跳来提取。我们在短期(PTB-XL)和长期(MIT-BIH心律失常)心电图数据集上评估这些度量。通过代表性基序叠加和基于基准点的可视化实现可解释性,从而能够直接检查形态学变化。在MIT-BIH中,所提出的度量显著区分了主要正常和心律失常受试者(p<0.01)。在PTB-XL中,正常窦性心律偏差在主要诊断亚型中区分了正常和异常心电图(p<1e-4,Cliff's delta高达0.93)。心电图基序提供了心脏形态的可解释表示,支持可扩展的纵向监测和形态学驱动变化的早期检测。

英文摘要

Electrocardiography (ECG) remains central to cardiovascular screening, yet interpretation remains largely manual and episodic. Clinical practice relies on brief resting ECGs and, when required, long-duration ambulatory recordings, both generating data that require resource-intensive review. Consequently, subtle morphological changes or progressive drift preceding clinically apparent abnormalities may go unnoticed. We propose a motif-based framework that defines beat-aligned ECG motifs as interpretable cardiac signatures and quantifies morphological drift and deviation across short and long-term monitoring. Motifs are representative cardiac cycles capturing dominant morphology. We introduce three interpretable drift metrics: deviation from a normal sinus rhythm (NSR), deviation from a personalised baseline, and a motif instability index. Motifs are extracted by selecting beats that minimise Dynamic Time Warping (DTW) distance within fixed windows. We evaluate these metrics on short (PTB-XL) and long-duration (MIT-BIH Arrhythmia) ECG datasets. Interpretability is achieved through representative motif overlays and fiducial-based visualisations, enabling direct inspection of morphological changes. In MIT-BIH, the proposed metrics significantly separated predominantly normal from arrhythmic subjects (p<0.01). In PTB-XL, NSR deviation distinguished normal from abnormal ECGs across major diagnostic subtypes (p<1e-4, Cliff's delta up to 0.93). ECG motifs provide an interpretable representation of cardiac morphology, supporting scalable longitudinal monitoring and early detection of morphology-driven change.

2606.00106 2026-06-02 eess.SP cs.AI cs.HC cs.LG 版本更新

A Methodological Framework for Explicit Control of the Speed-Accuracy Trade-off in Brain-Computer Interfaces

脑机接口中速度-准确性权衡显式控制的方法论框架

Javier Jiménez, Francisco B Rodríguez

发表机构 * Grupo de Neurocomputación Biológica, Departamento de Ingeniería Informática, Universidad Autónoma de Madrid(生物神经计算组,信息工程系,马德里自治大学)

AI总结 提出一个独立于分类器、范式和早停策略的评估框架,通过增益和保持度两个指标及可调参数α显式控制速度-准确性权衡,并在P300范式上验证其有效性。

详情
AI中文摘要

脑机接口(BCI)受到脑电图等模态低信噪比的限制,需要多次试验才能可靠解码用户意图。这导致了速度-准确性权衡,即更高的准确性以速度为代价。速度-准确性平衡依赖于应用,因此需要可控的权衡。传统指标(如信息传输率)将速度和准确性合并,模糊了它们的依赖关系并可能引入偏差。在本研究中,我们提出了一个独立于分类器、范式和早停策略的评估框架,将速度和准确性分离。我们采用两个度量:增益(相对速度提升)和保持度(相对准确性保持),并将它们组合成一个由α控制的可调增益-保持平衡,从而调节速度-准确性权衡。该参数无需修改分类器即可调整工作点,便于跨场景部署。该框架在P300事件相关电位范式上进行了评估,使用了63名受试者的公开记录以及多种分类器和早停策略,以实现速度-准确性和比特率的不同工作点。结果表明,调整α可产生快速、准确或平衡的BCI行为,展示了速度-准确性权衡的显式控制。该方法支持受试者级别的性能预测,并提高了BCI行为的可解释性。对信息传输率的进一步分析揭示了其向速度的系统性偏差,该偏差通过所提出的框架中的增益和保持度测量得到解释。总体而言,本工作将速度-准确性权衡确立为可控的设计变量,并在公开的P300范式上进行了验证,从而实现了BCI的透明评估和应用特定优化。

英文摘要

Brain-computer interfaces (BCIs) are limited by low signal-to-noise ratio in modalities such as electroencephalography, which requires multiple trials to reliably decode user intentions. This induces a speed-accuracy trade-off, whereby higher accuracy comes at the cost of speed. The speed-accuracy balance is application-dependent, motivating controllable trade-offs. Conventional metrics, such as the Information Transfer Rate, combine speed and accuracy obscuring their dependence and potentially introducing biases. In this study, we propose an evaluation framework independent of classifier, paradigm, and early-stopping strategy that separates speed and accuracy. We employ two measures, Gain (relative speed improvement) and Conservation (relative accuracy preservation), and combine them into a tunable Gain-Cons Balance controlled by α, regulating the speed-accuracy trade-off. The parameter adjusts the operating point without modifying the classifier, facilitating deployment across scenarios. The framework was evaluated on P300 event-related potential paradigms using public recordings from 63 subjects as well as multiple classifiers and early-stopping strategies to achieve distinct operating points in speed-accuracy and bitrate. Results show that tuning α yields fast, accurate, or balanced BCI behaviours, demonstrating explicit control of the speed-accuracy trade-off. The method supports subject-level performance prediction and improves explainability of BCI behaviour. Further analysis of the Information Transfer Rate reveals a systematic bias toward speed, explained by the proposed framework through the Gain and Conservation measurements. Overall, this work establishes the speed-accuracy trade-off as a controllable design variable validated on public P300-based paradigms, enabling transparent evaluation and application-specific optimization of BCIs.

2606.00084 2026-06-02 cs.IR cs.AI cs.CL cs.LG 版本更新

SentimentLens: Reconciling Sentiment and Ratings via Dual-Modality in the Hospitality Sector

SentimentLens: 通过双模态调和酒店业中的情感与评分

Dineth Jayakody, Pasindu Thenahandi, Sampath Jayarathna

发表机构 * University of Peradeniya(珀拉尼亚大学)

AI总结 提出SentimentLens系统,基于方面级情感分析从非结构化酒店评论中提取知识,并通过跨模态调和文本情感与数值评分来识别运营冲突和服务改进机会。

详情
AI中文摘要

在线旅游平台生成大量用户生成的酒店评论,为大规模理解旅行者体验提供了丰富机会。然而,将非结构化文本反馈转化为结构化、可操作的见解仍然是一项具有挑战性的任务。本文提出了SentimentLens,一个基于方面级情感分析的可扩展分析系统,该系统从非结构化酒店评论中执行知识提取,并将其组织成可解释的服务类别。SentimentLens集成了方面术语提取、方面情感分类、语义类别分配和多层次分析模块,以支持区域级、酒店级和类别级评估。该系统设计为在不同地理环境和酒店环境中运行。为了展示其实用性,我们将SentimentLens应用于一个包含超过10,000条公开酒店评论的大型真实数据集。通过广泛分析,该框架揭示了旅行者情感如何随区域、服务类别和酒店类型而变化。我们进一步实现了文本情感与数值评分的跨模态调和,以识别潜在运营冲突、服务质量的结构性不一致性,并使用重要性-绩效和基于熵的分析确定高影响力的改进机会。结果表明,SentimentLens有效地将大规模非结构化评论转化为可操作的情报,支持酒店管理和旅游政策的数据驱动决策。虽然通过一个国家案例研究进行了演示,但所提出的系统可推广到其他目的地和评论驱动的服务领域。

英文摘要

Online travel platforms generate vast volumes of user-generated hotel reviews, offering rich opportunities to understand traveler experiences at scale. However, transforming unstructured textual feedback into structured, actionable insights remains a challenging task. This paper presents SentimentLens, a scalable analysis system based on Aspect-Based Sentiment Analysis that performs knowledge extraction from unstructured hotel reviews and organizes them into interpretable service categories. SentimentLens integrates aspect term extraction, aspect sentiment classification, semantic category assignment, and multi-level analytical modules to support region-level, hotel-level, and category-level evaluation. The system is designed to operate across different geographic contexts and hospitality settings. To demonstrate its practical utility, we apply SentimentLens to a large real-world dataset of over 10,000 publicly available hotel reviews. Through extensive analysis, the framework reveals how traveler sentiment varies across regions, service categories, and hotel archetypes. We further implement a cross-modal reconciliation of textual sentiment and numerical ratings to identify latent operational conflicts, structural inconsistencies in service quality, and high-impact improvement opportunities using importance--performance and entropy-based analyses. The results show that SentimentLens effectively transforms large-scale unstructured reviews into actionable intelligence, supporting data-driven decision-making for hospitality management and tourism policy. While demonstrated using a national case study, the proposed system is generalizable to other destinations and review-driven service domains.

2606.00083 2026-06-02 cs.LG cs.AI cs.RO 版本更新

From Demonstrations to Rewards: Test-Time Prompt Optimization for VLM Reward Models

从演示到奖励:VLM奖励模型的测试时提示优化

Christian Gumbsch, Leonardo Barcellona, Lennard Schünemann, Platon Karageorgis, Andrii Zadaianchuk, Zehao Wang, Sergey Zakharov, Fabien Despinoy, Rahaf Aljundi, Efstratios Gavves

发表机构 * University of Amsterdam(阿姆斯特丹大学) Catholic University of Leuven(鲁汶天主大学) Toyota Research Institute(丰田研究院) Toyota Motor Europe(丰田欧洲公司)

AI总结 提出Demo2Reward方法,利用少量专家演示在测试时优化VLM奖励模型的提示指令,减少假阳性并保持真阳性,无需额外训练即可提升下游策略学习。

详情
AI中文摘要

强化学习依赖于准确的奖励函数,但在现实应用(如机器人技术)中,这些函数通常是手工设计的,甚至不可用。最近的研究探索了预训练视觉-语言模型(VLM)作为奖励模型的零样本推理能力。然而,如果没有仔细的提示工程,这些方法往往会产生次优的奖励,其中假阳性预测会严重降低下游策略学习。在机器人技术中,通常收集包含专家演示的有限数据集来引导策略学习。这种场景提供了在策略训练之前优化奖励模型的机会。我们提出Demo2Reward,一种测试时自适应技术,基于少量演示(3-10条轨迹)优化奖励模型的语言指令,以减少假阳性同时保持真阳性。关键是,这在策略学习期间不需要额外的模型训练或计算资源。我们表明,Demo2Reward在一系列模拟机器人任务和策略骨干上始终优于现有的零样本和少样本VLM奖励模型。最后,我们证明Demo2Reward有效迁移到真实世界的机器人学习场景,无需手动设计奖励函数即可实现策略学习。

英文摘要

Reinforcement learning relies on accurate reward functions, which are often hand-crafted or even unavailable in real-world applications, such as robotics. Recent work has explored the zero-shot reasoning capabilities of pre-trained Vision-Language Models (VLMs) as reward models. However, without careful prompt engineering, these approaches tend to produce suboptimal rewards, where false positive predictions can severely degrade downstream policy learning. In robotics, limited datasets comprising expert demonstrations are often collected to bootstrap policy learning. This scenario provides an opportunity to optimize a reward model prior policy training. We propose Demo2Reward a test-time adaptation technique to optimize the language instruction of a reward model based on a few demonstrations (3-10 trajectories) to reduce false positives while preserving true positives. Crucially, this requires no additional model training or computation resources during policy learning. We show that Demo2Reward consistently outperforms existing zero- and few-shot VLM reward models across a range of simulated robotic tasks and policy backbones. Finally, we demonstrate that Demo2Reward effectively transfers to a real-world robotic learning scenario, enabling policy learning without manually engineering a reward function.

2606.00082 2026-06-02 cs.LG cs.AI stat.ML 版本更新

Hoeffding Concept Bottleneck Models with Applications to Overhead Images

Hoeffding概念瓶颈模型及其在俯视图像中的应用

Clément Bénard, Manon Arfib, Christophe Labreuche, Victor Quétu

发表机构 * Thales cortAIx-Labs(泰雷兹 cortAIx 实验室) Université Paris-Saclay, CentraleSupélec(巴黎-萨克雷大学,中央理工-巴黎高等学院)

AI总结 针对线性概念瓶颈模型可解释性差和信息泄露问题,提出基于Hoeffding泛函分解的非线性稀疏聚合方法HCBM,并证明其对概念间泄露的鲁棒性,在分类和俯视图像目标检测任务中优于传统线性CBM。

详情
AI中文摘要

深度学习算法的可解释性对于高风险决策的计算机视觉应用至关重要。概念瓶颈模型(CBM)最近在基于高级概念瓶颈的分类问题上展示了提供可解释且准确预测的潜力。现有的CBM方法依赖概念分数的线性聚合来计算预测。然而,这种线性方法通常使用大量概念,这削弱了可解释性并有利于信息泄露。通常,概念与输出logits之间的潜在关系不是线性的。因此,我们引入了Hoeffding概念瓶颈模型(HCBM),该模型基于梯度提升树的Hoeffding泛函分解,提供概念分数的非线性和稀疏聚合,并使用素蕴含生成紧凑预测。HCBM被证明对概念间泄露具有鲁棒性,并在大量实验中优于标准线性CBM。除了分类,HCBM还可以适应目标检测,我们专注于一个具有挑战性的俯视图像案例,以展示HCBM在这些设置中的高性能。

英文摘要

Explainability of deep learning algorithms is critical for computer-vision applications with high-stake decisions. Concept bottleneck models (CBM) have recently shown promising performance to provide explainable and accurate predictions for classification problems, based on a bottleneck of high-level concepts. Existing CBM methods rely on a linear aggregation of the concept scores to compute predictions. However, a large number of concepts is often used in this linear approach, which undermines explainability and favors information leakage. In general, the underlying relation between concepts and output logits is not linear. Therefore, we introduce Hoeffding Concept Bottleneck Models (HCBM), which build on the Hoeffding functional decomposition of gradient-boosted trees to provide non-linear and sparse aggregations of concept scores, and generate compact predictions using prime implicants. HCBM are proved to be robust to interconcept leakage, and outperform standard linear CBM in practice, as shown in extensive experiments. Beyond classification, HCBM can be adapted to object detection, and we focus on a challenging case with overhead images to show the high performance of HCBM in these settings.

2606.00081 2026-06-02 cs.LG cs.AI cs.SD 版本更新

DAStatFormer: A Hybrid Multibranch Transformer with Statistical Feature Integration for DAS-Based Pattern Recognitions

DAStatFormer: 一种融合统计特征的混合多分支Transformer用于DAS模式识别

Michel Dione, Jerry Lonlac, Hélène Louis, Anthony Fleury, Stephane Lecoeuche

发表机构 * IMT Nord Europe, Institut Mines-Telecom, Univ. Lille, Centre for Digital Systems Lille, France(IMT北欧学院,法国电信研究院,里尔大学,数字系统研究中心,法国) IMT Mines Ales, Institut Mines-Telecom, Ales, France(IMT阿尔勒学院,法国电信研究院,阿尔勒,法国)

AI总结 针对DAS数据高维度和复杂时空模式问题,提出DAStatFormer混合多分支Transformer,通过提取24个ANOVA选择的统计特征并采用门控Transformer网络,在降低数据量级的同时实现高达99.4%的准确率。

详情
AI中文摘要

分布式声学传感(DAS)通过光纤实现大规模监测,但其高维度和复杂的时空模式使得事件分类具有挑战性。现有的深度学习方法——CNN、循环模型和Transformer变体——要么无法捕获长程依赖,要么需要以高昂成本处理原始DAS矩阵。我们提出DAStatFormer,一种混合多分支Transformer,将紧凑的多域统计特征与门控Transformer网络相结合。我们不是使用原始信号,而是从每个通道的时域、波形和频域提取24个ANOVA选择的属性,将数据量减少数个数量级,同时保留判别信息。每个域通过专用的逐步骤和逐通道注意力分支处理,并通过自适应门控机制融合。在开放的$\Phi$-OTDR基准测试和真实场景DAS数据集上的实验表明,DAStatFormer实现了高达99.4%的准确率和接近完美的实际性能,同时使用的参数和推理成本显著低于DASFormer和DeepViT等模型。这些结果证明了其适用于可扩展、实时的DAS监测。我们在https://github.com/MichelD-git/DAStatFormer发布代码。

英文摘要

Distributed Acoustic Sensing (DAS) enables large-scale monitoring through optical fibers, but its high dimensionality and complex spatio-temporal patterns make event classification demanding. Existing deep learning approaches-CNNs, recurrent models, and Transformer variants-either fail to capture long-range dependencies or require processing raw DAS matrices at prohibitive cost. We propose DAStatFormer, a hybrid multibranch Transformer that combines compact multidomain statistical features with Gated Transformer Networks. Instead of raw signals, we extract 24 ANOVA-selected attributes per channel from the temporal, waveform, and spectral domains, reducing data size by orders of magnitude while preserving discriminative information. Each domain is processed via dedicated step-wise and channel-wise attention branches, fused by an adaptive gating mechanism. Experiments on the open $Φ$-OTDR benchmark and a real-scenario DAS dataset show that DAS-tatFormer achieves up to 99.4% accuracy and near-perfect real-world performance, while using significantly fewer parameters and lower inference cost than models such as DASFormer and DeepViT. These results demonstrate its suitability for scalable, real-time DAS-based monitoring. We release our code at https://github.com/MichelD-git/DAStatFormer

2606.00080 2026-06-02 cs.CV cs.AI cs.LG cs.NE 版本更新

Planktonzilla: Multimodal dataset and models for understanding plankton ecosystems

Planktonzilla: 用于理解浮游生态系统的多模态数据集与模型

Alan Gerson Contreras Montanares, Luis Valenzuela, Luis Martí, Nayat Sanchez-Pi

发表机构 * Inria Chile Research Center(Inria智利研究中心)

AI总结 为解决浮游生物分类模型泛化性差的问题,提出统一数据集Planktonzilla-17M(含1740万张图像,涵盖602个分类类群),并对比监督学习与CLIP风格训练,发现基于分类谱系的监督学习优于CLIP,且现有生物基础模型在海洋成像领域表现不佳。

详情
AI中文摘要

海洋浮游生物支撑着水生食物网,并在全球二氧化碳封存中发挥关键作用,因此可靠的物种识别对于理解海洋健康和气候反馈至关重要。现有的分类模型在单个数据集上表现良好,但由于训练数据集孤立且标签不一致,无法跨仪器和环境泛化。为解决这一问题,我们引入了Planktonzilla-17M,这是一个统一的数据集,整合了来自13个成像系统的公开浮游生物图像集合。它包含1740万张图像,具有标准化的分类学和地理环境元数据,其中包括374万张浮游生物图像,涵盖602个分类类群,其中201个在物种级别被识别,使其成为迄今为止最大、最全面的浮游生物图像数据集。利用这一大规模数据集,我们在共享ViT骨干网络上进行了监督学习与CLIP风格图像-文本训练的对比实验。我们发现,当使用分类谱系作为文本时,监督分类器的表现与CLIP风格训练相当或更优。我们进一步观察到,BioCLIP和BioCLIP2在零样本和少样本设置下对浮游生物表现不佳。利用Planktonzilla-17M提高了浮游生物分类性能,凸显了当前生物基础模型在海洋成像领域的局限性。

英文摘要

Marine plankton underpin aquatic food webs and play a key role in global CO2 sequestration, making reliable species identification critical for understanding ocean health and climate feedbacks. Existing classification models perform well on individual collections but fail to generalize across instruments and environments due to isolated training datasets and inconsistent labels. To address this, we introduce Planktonzilla-17M, a unified dataset consolidating publicly available plankton image collections spanning thirteen imaging systems. It comprises 17.4 million images with standardized taxonomy and geo-environmental metadata, including 3.74 million plankton images spanning over 602 taxonomic classes, of which 201 are identified at the species level, making it the largest and most comprehensive plankton image dataset to date. Using this large-scale dataset, we perform a controlled comparison between supervised and CLIP-style image--text training on a shared ViT backbone. We find that a supervised classifier matches or exceeds CLIP-style training when trained using taxonomic lineage as text. We further observe that BioCLIP and BioCLIP2 perform poorly on plankton in zero-shot and few-shot settings. Leveraging Planktonzilla-17M improves plankton classification performance, highlighting the limitations of current biological foundation models in marine imaging domains.

2606.00079 2026-06-02 cs.LG cs.AI 版本更新

BitsMoE: Efficient Spectral Energy-Guided Bit Allocation for MoE LLM Quantization

BitsMoE: 面向MoE大语言模型量化的频谱能量引导比特分配

Jiayu Zhao, Zihan Teng, Minhao Fan, Tianrui Ma, Wentao Ren, Song Chen, Weichen Liu

发表机构 * School of Microelectronics, University of Science and Technology of China(中国科学技术大学微电子学院) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算与数据科学学院) School of Electrical and Electronic Engineering, Nanyang Technological University(南洋理工大学电子与电气工程学院)

AI总结 提出BitsMoE框架,通过SVD分解和频谱能量引导的混合精度比特分配,解决MoE模型超低位量化中的精度损失问题,在Qwen3-30B-A3B-Base上2比特量化下准确率提升27.83个百分点。

Comments 29 pages, 6 figures, 9 tables. Code and models are available at https://github.com/zjiayu064/BitsMoE

详情
AI中文摘要

混合专家(MoE)大语言模型通过稀疏专家激活减少了每词元的计算量,但由于所有专家权重必须常驻内存,其部署仍然占用大量内存。现有的MoE压缩方法在超低位宽场景下表现不佳:剪枝不可逆地移除模型容量,而粗粒度量化无法根据异构的专家和权重方向重要性分配比特。我们提出BitsMoE,一种面向MoE大语言模型量化的频谱能量引导比特分配框架。BitsMoE通过SVD将每个MoE层分解为共享基和专家特定的频谱因子,保留共享基不进行量化以保持跨专家的共同结构,并使用专家特定因子作为细粒度量化单元。为确定每个单元的比特宽度,BitsMoE将频谱混合精度量化建模为激活感知的重建替代问题,并求解一个整数线性规划,在固定比特预算下最小化估计的重建损失。在多个MoE大语言模型上的实验表明,BitsMoE在超低位宽场景下显著降低了下游任务准确率下降。在Qwen3-30B-A3B-Base上进行2比特量化时,BitsMoE相比GPTQ加速量化12.3倍,平均准确率提升27.83个百分点,解码速度提升1.76倍。我们的模型和代码已在https://github.com/zjiayu064/BitsMoE公开。

英文摘要

Mixture-of-Experts (MoE) large language models reduce per-token computation through sparse expert activation, but their deployment remains memory-intensive because all expert weights must be kept resident in memory. Existing MoE compression methods struggle in the ultra-low-bit regime: pruning irreversibly removes model capacity, while coarse-grained quantization fails to allocate bits according to heterogeneous expert and weight-direction importance. We propose BitsMoE, a spectral-energy-guided bit-allocation framework for MoE LLM quantization. BitsMoE decomposes each MoE layer by SVD into a shared basis and expert-specific spectral factors, retaining the shared basis without quantization to preserve common cross-expert structure and using the expert-specific factors as fine-grained quantization units. To determine the bit-width of each unit, BitsMoE formulates spectrum-wise mixed-precision quantization as an activation-aware reconstruction surrogate and solves an integer linear program that minimizes estimated reconstruction loss under a fixed bit budget. Experiments across multiple MoE LLMs show that BitsMoE substantially reduces downstream task accuracy degradation in ultra-low-bit regimes. Under 2-bit quantization on Qwen3-30B-A3B-Base, BitsMoE accelerates quantization by 12.3$\times$, improves average accuracy by 27.83 percentage points, and increases decoding speed by 1.76$\times$ over GPTQ. Our model and code are publicly available at https://github.com/zjiayu064/BitsMoE.

2606.00074 2026-06-02 eess.SP cs.AI cs.LG 版本更新

CLSP-REQA: A Real-Time Quality-Aware Closed-Loop Seizure Prediction Framework with Mamba-BiLSTM and Confidence-Gated Intervention

CLSP-REQA:基于Mamba-BiLSTM和置信门控干预的实时质量感知闭环癫痫发作预测框架

Mufeng Chen, Qi Wu, Bingchao Huang, Xiwen Lai, Zekai Chen, Xinge Ouyang, Quansheng Ren

发表机构 * Department of Engineering Science, University of Oxford(牛津大学工程科学系) Mathematical Institute, University of Oxford(牛津大学数学研究所) School of Computer Science and Engineering, Beihang University(北航计算机科学与工程学院) Aerospace Information Research Institute, Chinese Academy of Sciences(中国科学院航天信息研究所) Department of Mechanical Engineering, The University of British Columbia(不列颠哥伦比亚大学机械工程系) College of Life Sciences, Hunan Normal University(湖南师范大学生命科学学院) School of Electronics, Peking University(北京大学电子学院)

AI总结 提出CLSP-REQA框架,通过嵌入实时EEG质量评估模块和Mamba-BiLSTM骨干网络,结合分层非线性融合函数,在严格跨患者评估下实现优于现有方法的癫痫发作预测性能。

Comments 27 pages, 8 figures, submitted to Biomedical Signal Processing and Control

详情
AI中文摘要

可靠的癫痫发作预测是闭环神经刺激治疗的前提,然而现有方法很少考虑实际部署中EEG信号质量的可变性,并且绝大多数采用非严格的评估协议,高估了泛化性能。我们提出了CLSP-REQA(具有实时EEG质量评估的闭环癫痫发作预测),这是一个统一框架,将轻量级信号质量估计器直接嵌入预测流程中。实时EEG质量评估(REQA)模块与Mamba-BiLSTM骨干网络并行运行,产生一个标量质量分数q ∈ [0,1],通过分层非线性融合函数(ECLO)调节输出置信度。在CHB-MIT头皮EEG数据库(n=23名受试者,198次发作)的严格跨患者评估下,CLSP-REQA实现了0.7426 ± 0.0199的AUC-ROC,优于Jemal等人报告的未适应跨患者基线0.69,仅使用16个EEG通道(先前工作为23个),且无需任何目标患者数据或域适应。在SIENA头皮EEG数据库(n=14名受试者,47次发作)上,CLSP-REQA实现了0.7012 ± 0.0249的AUC,大幅超过同一数据集上最佳域适应跨患者结果0.61,展示了强大的跨数据集泛化能力。该框架输出结构化四元组(p, q, c, Phi_SHAP),可直接与闭环神经刺激器接口兼容。

英文摘要

Reliable seizure prediction is a prerequisite for closed-loop neurostimulation therapy, yet existing methods rarely account for the variability in EEG signal quality encountered in real-world deployment, and the overwhelming majority adopt non-strict evaluation protocols that overestimate generalisation performance. We propose CLSP-REQA (Closed-Loop Seizure Prediction with Real-time EEG Quality Assessment), a unified framework that embeds a lightweight signal quality estimator directly within the prediction pipeline. A Real-time EEG Quality Assessment (REQA) module runs in parallel with a Mamba-BiLSTM backbone, producing a scalar quality score q in [0,1] that modulates output confidence through a tiered non-linear fusion function (ECLO). Under strict cross-patient evaluation on the CHB-MIT Scalp EEG Database (n = 23 subjects, 198 seizures), CLSP-REQA achieves an AUC-ROC of 0.7426 +- 0.0199, outperforming the unadapted cross-patient baseline of 0.69 reported by Jemal et al., using only 16 EEG channels compared to 23 in prior work, and without requiring any target-patient data or domain adaptation. On the SIENA Scalp EEG Database (n = 14 subjects, 47 seizures), CLSP-REQA achieves AUC 0.7012 +- 0.0249, substantially surpassing the best domain-adapted cross-patient result of 0.61 on the same dataset, demonstrating strong cross-dataset generalisation. The framework outputs a structured four-tuple (p, q, c, Phi_SHAP) directly compatible with closed-loop neurostimulator interfaces.

2606.00073 2026-06-02 cs.NE cs.AI cs.LG 版本更新

Rare Events, Real Signals: Functional Ensembles as Units of Computation in Deep Spiking Networks

罕见事件,真实信号:深度脉冲网络中的功能集合作为计算单元

Aditi Aravind, Konstantinos Ladakis, Mario Alexios Savaglio, Stelios M. Smirnakis, Maria Papadopouli

发表机构 * University of Crete(希腊克里特大学) Foundation of Research & Technology - Hellas(希腊研究与技术基金会) Archimedes Research Unit(阿基米德研究单位) Harvard Medical School(哈佛医学院) Brigham and Women’s Hospital(布莱根妇女医院)

AI总结 通过引入功能连接性分析框架,研究深度脉冲神经网络中功能集合的涌现特性,发现一阶功能连接集合的协同放电可靠预测下游神经元响应,且信息编码集中在罕见但高度协调的活动模式中。

详情
AI中文摘要

我们通过引入一个受神经科学启发的框架,从功能连接性的角度分析深度脉冲神经网络(SNN),研究内部表征如何在层次化处理系统中涌现。借鉴系统神经科学和信息论的概念,我们基于一个神经元与训练好的SNN架构中前一层神经元的统计显著成对相关性,形成该神经元的一阶功能连接(1FC)组。然后,我们在各种条件下的推理过程中跟踪其响应特性。我们的分析表明,先前在生物皮层中观察到的功能连接性的几个原理在脉冲ResNet架构中得以保留。这些1FC集合表现出有趣的特性:它们的聚合协同放电通过一个鲁棒的、类似ReLU的输入输出关系可靠地预测下游神经元响应,其增益随集合大小系统性缩放。仅在高的1FC协同放电事件期间才出现所呈现类别的可靠编码,而这些事件本身发生频率较低,表明信息表征集中在罕见但高度协调的活动模式中。在均匀随机噪声或对抗性扰动下,这些响应轮廓被破坏,尤其是在早期和中间层。这使得能够在特定节点和路径上进行有针对性的高分辨率探查。我们表明,功能连接结构由学习塑造,并且在权重置换下该结构被破坏。这些确立了1FC集合作为输入编码和信息传递的功能上有意义的基质,对设计针对信息流的有针对性的细粒度诊断具有潜在意义。

英文摘要

We investigate how internal representations emerge across hierarchical processing systems by introducing a neuroscience-inspired framework for analyzing deep spiking neural networks (SNN) through the lens of functional connectivity. Drawing on concepts from systems neuroscience and information theory, we form the first-order functionally-connected (1FC) group of a neuron based on its statistically significant pairwise correlations with neurons from the previous layer of a trained SNN architecture. We then track its response properties during inference under various conditions. Our analysis shows that several principles of functional connectivity previously observed in biological cortex are preserved in spiking ResNet architectures. These 1FC ensembles display interesting properties: their aggregate cofiring reliably predicts downstream neuronal responses through a robust, ReLU-like input-output relationship, whose gain scales systematically with ensemble size. Reliable encoding of the presented class emerges only during high 1FC cofiring events, which themselves occur infrequently, indicating that informative representations are concentrated in rare but highly coordinated activity patterns. Under uniform random noise or adversarial perturbations, these response profiles are disrupted, particularly in early and intermediate layers. This enables a targeted high-resolution interrogation at specific nodes and pathways. We showed that the functional connectivity structure is shaped by learning and this structure breaks under weight permutation. These establish 1FC ensembles as a functionally meaningful substrate for input encoding and information transfer, with potential implications in designing targeted fine-grained diagnostics on the information flow.

2606.00060 2026-06-02 q-fin.TR cs.CE cs.LG 版本更新

Machine Learning-Based Bitcoin Trading Under Transaction Costs: Evidence From Walk-Forward Forecasting

基于机器学习的比特币交易:考虑交易成本的滚动前向预测证据

Andrei Bysik, Robert Ślepaczuk

发表机构 * Quantitative Finance Research Group, Faculty of Economic Sciences, University of Warsaw(经济科学学院量化金融研究组,华沙大学) Quantitative Finance Research Group, Department of Quantitative Finance and Machine Learning, Faculty of Economic Sciences, University of Warsaw(经济科学学院量化金融与机器学习系量化金融研究组,华沙大学)

AI总结 研究在交易成本下,利用XGBoost、LSTM和iTransformer等机器学习模型预测BTC-USDT小时收益率,并通过成本感知执行过滤器将预测转化为盈利交易策略。

Comments 42 pages,

详情
AI中文摘要

本文研究机器学习对BTC-USDT小时收益率的预测能否在扣除交易成本后转化为具有经济意义的交易表现。使用2018-2026年间约70,000个小时观测值,在27折滚动前向协议中评估XGBoost、LSTM和iTransformer。所有三种模型在选定配置下均产生正的毛交易表现,但一旦施加十个基点的交易成本,基于符号的朴素策略便失效。一种成本感知的执行过滤器(仅当预测幅度超过基于交易成本的阈值时才阻止交易)显著降低了换手率,并在选定配置下恢复了盈利能力。最强的纯多头XGBoost策略年化收益率超过65%,夏普比率高于1。额外测试表明,技术指标在选定情况下提升了表现,EGARCH导出的特征并未提供一致的稳健收益,且XGBoost在描述性上优于神经替代模型,尽管自助法证据不支持正式的统计优势。损失函数和模型选择效应是次要的且统计上脆弱。结果表明,小时级加密货币交易的主要障碍不仅在于弱可预测性,还在于将预测转化为交易的方式。

英文摘要

This paper investigates whether machine learning forecasts of hourly BTC-USDT returns can be converted into economically meaningful trading performance after transaction costs. Using approximately 70,000 hourly observations from 2018-2026, XGBoost, LSTM, and iTransformer are evaluated in a 27-fold walk-forward protocol. All three models produce positive gross trading performance in selected configurations, but naive sign-based strategies fail once transaction costs of ten basis points are imposed. A cost-aware execution filter, which prevents trades only when the forecast magnitude exceeds a transaction-cost-based threshold, sharply reduces turnover and restores profitability in selected configurations. The strongest long-only XGBoost strategy produces annualised returns above 65% with a Sharpe ratio above one. Additional tests show that technical indicators improve performance in selected cases, EGARCH-derived features do not provide uniformly robust gains, and XGBoost is descriptively stronger than the neural alternatives, although bootstrap evidence does not support formal statistical dominance. Loss-function and model-selection effects are secondary and statistically fragile. The results show that the main obstacle in hourly cryptocurrency trading is not only weak predictability, but also the way forecasts are converted into trades.

2606.00059 2026-06-02 cs.RO cs.LG 版本更新

Reinforcement Learning for Optimal Experiment Design in Parameter Identification of Mechatronic Systems

机电系统参数辨识中最优实验设计的强化学习方法

Julian Langschwert, Georg Schaefer, Jakob Rehrl, Stefan Huber, Simon Hirlaender

发表机构 * Josef Ressel Centre for Intelligent and Secure Industrial Automation, Salzburg University of Applied Sciences, Salzburg, Austria(约瑟夫·雷斯尔智能与安全工业自动化中心,萨尔茨堡应用技术大学,萨尔茨堡,奥地利) Paris Lodron University of Salzburg, Salzburg, Austria(萨尔茨堡巴黎洛登伦大学,萨尔茨堡,奥地利)

AI总结 提出一种强化学习智能体,通过奖励塑形自主满足安全约束,为Quanser Aero 2测试平台学习最优激励信号,在三个辨识参数上均达到竞争性估计精度,且安全违规率仅0.75%。

Comments Accepted at DEXA AI4IP 2026

详情
AI中文摘要

信息丰富的激励信号对于机电系统的精确系统辨识至关重要,然而经典系统辨识方法需要专家知识和手工设计的信号以满足硬件安全约束,限制了其通用性。我们提出一种强化学习智能体,为Quanser Aero 2测试平台学习最优激励信号,同时通过奖励塑形自主强制执行安全约束。在10个独立训练种子的评估中,我们的综合智能体在所有三个辨识参数上均实现了具有竞争力的估计精度,优于经典基线方法,且仅产生0.75%的安全违规。

英文摘要

Informative excitation signals are critical for accurate system identification of mechatronic systems, yet classical system identification (SI) approaches require expert knowledge and hand-crafted signal design to respect hardware safety constraints, limiting their generalizability. We propose a reinforcement learning (RL) agent that learns optimal excitation signals for a Quanser Aero 2 testbed while autonomously enforcing safety constraints through reward shaping. Evaluated across 10 independent training seeds, our comprehensive agent achieves competitive estimation accuracy across all three identified parameters, outperforming classical baselines while incurring only 0.75% safety violations.

2606.00056 2026-06-02 cs.CE cs.AI cs.LG physics.app-ph 版本更新

Physics-Informed Neural Networks for Radial Consolidation of Combined Electroosmotic, Vacuum and Surcharge Preloading Considering Smear Effects

考虑涂抹效应的电渗-真空-堆载联合预压径向固结的物理信息神经网络

Dong Li, Yapeng Cao, Shuai Huang, Yujun Cui, Haiping Fu, Lu Yang, He Wei

发表机构 * Department of Civil, Environmental, and Infrastructure Engineering, George Mason University(乔治·马歇尔大学土木、环境与基础设施工程系) State Key Laboratory of Cryospheric Science and Frozen Soil Engineering, Northwest Institute of Eco-Environment and Resources, Chinese Academy of Sciences(中国科学院寒区工程与冻土科学国家重点实验室,西北生态环境资源研究院) Laboratoire Navier/CERMES, École Nationale des Ponts et Chaussées, Institut Polytechnique de Paris(巴黎理工学院劳纳实验室/塞梅斯实验室,法国国家桥梁与道路学院) College of Water Conservancy and Hydropower Engineering, Hohai University(河海大学水利水电学院) School of Geosciences and Info-physics, Central South University(中南大学地球科学与信息物理学院)

AI总结 提出一种无量纲多域物理信息神经网络框架,通过改进的门控硬约束边界编码模型解决电渗径向固结问题,在时变荷载下实现高精度预测。

详情
AI中文摘要

本研究开发了一个无量纲多域物理信息神经网络(PINN)框架,用于考虑涂抹效应和真空-堆载联合预压的电渗径向固结。研究了三种基于PINN的模型:标准软约束PINN(Std-PINN)、改进的门控PINN(Mod-PINN)以及具有硬约束边界编码的改进门控PINN(Mod-HC-PINN)。这些模型在四种荷载工况下与有限元参考解进行了对比评估,包括恒定真空、指数真空、指数真空加斜坡堆载以及指数真空加循环半正弦堆载。结果表明,Mod-PINN中采用的门控架构提高了恒定真空荷载下阴极和涂抹区界面附近陡峭压力梯度的分辨率。在时变荷载下,软约束的Mod-PINN由于必须同时学习多个竞争目标而精度降低。Mod-HC-PINN通过将阴极边界和初始条件嵌入输出结构,减轻了这一问题,从而降低了优化负担并提高了物理一致性。Mod-HC-PINN在指数真空、斜坡堆载和循环堆载工况下的平均绝对误差(MAE)分别为0.43、0.41和0.27 kPa。敏感性分析进一步表明,所提出的框架在网络架构、配置点密度和渗透率对比的实际范围内保持稳健。

英文摘要

This study develops a dimensionless multi-domain physics-informed neural network (PINN) framework for electro-osmotic radial consolidation considering smear effects and combined vacuum and surcharge loading. Three PINN-based models are investigated: a standard soft-constrained PINN (Std-PINN), a modified gated PINN (Mod-PINN), and a modified gated PINN with hard-constraint boundary encoding (Mod-HC-PINN). The models are evaluated against FEM reference solutions under four loading cases, including constant vacuum, exponential vacuum, exponential vacuum with ramp surcharge, and exponential vacuum with cyclic haversine surcharge. The results indicate that the gated architecture applied in Mod-PINN improves the resolution of steep pressure gradients near the cathode and smear-zone interface under constant vacuum loading. Under time-dependent loading, the soft-constrained Mod-PINN shows reduced accuracy because it must learn multiple competing objectives simultaneously. The Mod-HC-PINN mitigates this issue by embedding the cathode boundary and initial conditions into the output structure, thereby reducing the optimization burden and improving physical consistency. The Mod-HC-PINN achieves MAE values of 0.43, 0.41, and 0.27 kPa for the exponential vacuum, ramp surcharge, and cyclic surcharge cases, respectively. Sensitivity analyses further demonstrate that the proposed framework remains robust across practical ranges of network architecture, collocation density, and permeability contrast.

2606.00052 2026-06-02 cs.AI cs.LG 版本更新

Product-Aware Deep Autoencoders for Robust Process Monitoring in Multi-Product Cyber-Physical Systems

产品感知深度自编码器用于多产品信息物理系统的鲁棒过程监控

MD Shafikul Islam, Jordan Carden

发表机构 * University of Cambridge(剑桥大学)

AI总结 针对多产品制造中全局模型因决策边界扩大而产生盲点的问题,提出产品感知自编码器,通过限制学习域到产品特定分布来提升异常检测鲁棒性,在扩展田纳西伊士曼过程基准上实现100%攻击检测。

详情
AI中文摘要

随着工业4.0加速信息物理系统在制造业中的集成,鲁棒异常检测对于确保过程安全与安保变得至关重要。当前的数据驱动方法通常采用“产品无关”或全局模型,这些模型在所有正常操作数据的聚合上训练。然而,现代工业设施经常在不同的产品等级下运行。虽然计算简单,但这些全局模型本质上会扩展其决策边界以适应多种模式的方差,从而产生一个“盲点”,其中微妙的异常或针对性的信息物理攻击可能被模型的宽接受区域所掩盖。在这项工作中,我们首先证明了上述漏洞存在于跨多个产品等级运行的全局无关模型中。然后,我们提出了一种产品感知自编码器作为原则性的缓解措施,将学习域限制在等级特定的分布上。虽然这种方法降低了已识别的盲点风险,但我们并不声称它是所有可能替代方案中的最优缓解措施。我们使用扩展的田纳西伊士曼过程基准对这种方法进行了严格的验证,并与全局无关基线进行了比较。我们的实证结果表明,产品感知框架在标准检测指标上与全局基线表现相当,同时提供了对产品等级特定操作模式的改进鲁棒性。最关键的是,模拟我们假设的攻击场景的压力测试显示,虽然全局模型在77.8%的场景中未能检测到操作偏差,但产品感知系统实现了100%的检测准确率。这些发现表明,在柔性制造环境中,广义异常检测器可能带来非平凡的安全风险,促使向模式感知诊断架构的转变。

英文摘要

As Industry 4.0 accelerates the integration of Cyber-Physical Systems (CPS) in manufacturing, robust anomaly detection has become critical for ensuring process safety and security. Current data-driven approaches typically employ "product-agnostic" or global models trained on the aggregate of all normal operating data. However, modern industrial facilities frequently operate under diverse product grades. While computationally simple, these global models inherently expand their decision boundaries to accommodate the variance of multiple modes, creating a "blind spot" where subtle anomalies or targeted cyber-physical attacks may be masked by the wide acceptance region of the model. In this work, we first demonstrate that the vulnerability described above is present in global-agnostic models operating across multiple product grades. We then present a Product-Aware Autoencoder as a principled mitigation that restricts the learning domain to grade-specific distributions. While this approach reduces the identified blind-spot risk, we do not claim it as the optimal mitigation among all possible alternatives. We rigorously validate this approach against a Global Agnostic baseline using the Extended Tennessee Eastman Process (TEP) benchmark. Our empirical results indicate that the Product-Aware framework performs comparably to the global baseline on standard detection metrics, while offering improved robustness to product-grade-specific operating modes. Most critically, stress tests simulating our hypothetical attack scenarios reveal that while the global model fails to detect operational deviations in 77.8% of the scenarios, the product-aware system achieves 100% detection accuracy. These findings suggest that, in flexible manufacturing environments, generalized anomaly detectors can pose non-trivial security risks, motivating a shift toward mode-aware diagnostic architectures.

2606.00023 2026-06-02 cs.CL cs.AI cs.LG 版本更新

TrustLDM: Benchmarking Trustworthiness in Language Diffusion Models

TrustLDM:语言扩散模型的可信度基准测试

Yichuan Mo, Yukun Jiang, Yanbo Shi, Mingjie Li, Michael Backes, Yang Zhang, Yisen Wang

发表机构 * State Key Lab of General Artificial Intelligence, School of Intelligence Science and Technology, Peking University(中国科学院自动化研究所,智能科学与技术学院,北京大学) CISPA Helmholtz Center for Information Security(信息安全研究所) School of EECS, Peking University(电子工程学院,北京大学) Institute for Artificial Intelligence, Peking University(人工智能研究所,北京大学)

AI总结 针对语言扩散模型(LDM)的可信度问题,提出TrustLDM基准,评估其在不同架构和恶意上下文下的安全性、隐私性和公平性,并开发自动评估框架TrustLDM-Auto以识别脆弱配置。

详情
AI中文摘要

语言扩散模型(LDM)的快速发展挑战了自回归模型在语言处理中的主导地位。然而,其灵活、任意顺序的解码策略不仅实现了快速解码速度,还可能带来新的可信度挑战。为了更好地理解其流程背后的风险,我们引入了一个针对LDM的全面可信度基准(TrustLDM),评估不同LDM架构在多种静态后上下文类别下的安全性、隐私性和公平性。我们的实证结果表明,尽管LDM在仅使用用户提示时通常表现出较强的可信度,但当恶意后上下文附加到掩码响应时,其对齐行为明显下降。我们进一步观察到,较长的上下文不一定产生更强的影响,解码顺序和生成长度都会影响评估结果。最后,我们提出了TrustLDM-Auto,一个利用LDM解码灵活性自动识别脆弱配置的评估框架,揭示了所有评估模型和维度上的显著可信度弱点。我们的工作可能有助于社区构建更可信的LDM。我们的代码可在https://github.com/PKU-ML/TrustLDM获取。

英文摘要

The rapid development of Language Diffusion Models (LDMs) challenges the dominant position of auto-regressive competitors in language processing. However, their flexible, any-order decoding strategies not only enable fast decoding speed but also potentially bring new trustworthiness challenges. To better understand the risks behind their pipelines, we introduce a comprehensive trustworthiness benchmark tailored to LDMs (TrustLDM), evaluating safety, privacy, and fairness across different LDM architectures with multiple categories of static post contexts. Our empirical results show that although LDMs generally exhibit strong trustworthiness with only the user prompts, their alignment behavior degrades noticeably when the malicious post contexts are attached to the masked responses. We further observe that longer contexts do not necessarily induce stronger effects, and both decoding order and generation length affect the evaluation outcomes. Finally, we propose TrustLDM-Auto, an automatic evaluation framework that leverages LDM decoding flexibility to systematically identify vulnerable configurations, revealing substantial trustworthiness weaknesses across all evaluated models and dimensions. Our work may potentially help the community build more trustworthy LDMs. Our code is available at https://github.com/PKU-ML/TrustLDM.

2606.00021 2026-06-02 cs.CL cs.AI cs.LG 版本更新

SENSE: Semantic Embedding Navigation with Soft-gated Evaluation for Retrieval-based Speculative Decoding

SENSE: 基于软门控评估的语义嵌入导航用于检索式推测解码

Shaowen Chen, Zhicheng Liao, Hongwei Wang

发表机构 * Zhejiang University, Hangzhou, China(浙江大学,杭州,中国)

AI总结 提出SENSE方法,通过语义嵌入导航和软门控评估模块替代表面形式匹配,提升检索式推测解码的鲁棒性和加速效果,在LLaMA和Qwen系列上实现最高4.09平均接受长度和3.26倍加速。

详情
AI中文摘要

推测解码(SD)通过使用轻量级草稿模型提出候选令牌,并由目标模型并行验证,从而加速大型语言模型(LLM)推理,同时不损害生成质量。尽管检索式推测解码(RSD)因其即插即用的多功能性而受到青睐,但其潜力受到刚性词汇依赖的阻碍,使得检索和验证对表面形式变化敏感。为了解决这个问题,我们提出了SENSE(基于软门控评估的语义嵌入导航)。通过将检索锚定在目标模型的隐藏状态上,SENSE建立了稳健的语义对齐,这使得软门控评估模块能够验证语义等价性而非表面形式。为了确保严格的基准测试,我们将现有方法解构为统一框架内的原子原语,促进细粒度的组件级比较。跨多个领域的广泛实验表明,SENSE在LLaMA和Qwen系列上优于多个基线,实现了高达4.09的平均接受长度和3.26倍的加速,同时保持了生成质量。我们的代码将在发表后发布。

英文摘要

Speculative Decoding (SD) accelerates Large Language Model (LLM) inference by employing a lightweight draft model to propose candidate tokens, which are verified in parallel by the target model, without compromising generation quality. While Retrieval-based Speculative Decoding (RSD) is favored for its plug-and-play versatility, its potential is impeded by rigid lexical dependencies, rendering both retrieval and verification brittle to surface-level variations. To address this, we propose SENSE (Semantic Embedding Navigation with Soft-gated Evaluation). By anchoring retrieval on the hidden states of the target model, SENSE establishes robust semantic alignment, which empowers the Soft-gated Evaluation module to validate semantic equivalence rather than surface forms. To ensure rigorous benchmarking, we deconstruct existing methods into atomic primitives within a unified framework, facilitating granular, component-level comparison. Extensive experiments across diverse domains demonstrate that SENSE outperforms multiple baselines on the LLaMA and Qwen families, attaining up to 4.09 mean acceptance length and 3.26x speedup, while preserving generation quality. Our code will be released upon publication.

2606.00011 2026-06-02 cs.HC cs.AI cs.LG 版本更新

RuleEdit: Failure-Guided Human-AI Model Editing with Prospective Impact Preview

RuleEdit: 失败引导的人机模型编辑与前瞻性影响预览

Min Hun Lee, Justin Yu Feng Teo

发表机构 * Singapore Management University(新加坡国立大学)

AI总结 提出RuleEdit系统,通过规则表的不匹配信号检测失败并预览模型编辑的影响,在卒中康复评估中显著提升人机协同性能。

详情
AI中文摘要

尽管AI有望协助复杂决策,但从业者仍然缺乏在提交模型编辑之前检测可能失败和检查后果的方法。我们提出RuleEdit,一个交互式、规则引导的人机模型编辑系统,它(i)通过规则表可解释的不匹配信号揭示可能的失败,并(ii)支持用户编写的规则反馈,提供预期性能变化和嵌入偏移的前瞻性预览。我们在卒中康复评估中实例化RuleEdit,并与卫生专业人员和学生一起评估。规则引导的失败检测将人+AI性能显著提高了14.16%(p<0.001),同时改善了对错误AI的拒绝,减少了过度依赖和不足依赖以及ChangedToWrong决策。此外,呈现前瞻性嵌入预览改善了参与者对模型适应的反馈,在纳入用户基于规则的反馈后,将更新后的局部性能增益从11.50%提高到36.38%(p<0.001)。我们的发现表明,基于不匹配的失败线索和前瞻性影响预览可以支持失败感知的人机模型编辑,同时也揭示了局部-全局权衡:有助于特定案例的编辑在全局转移时可能会降低性能。我们讨论了设计失败感知和可控人机系统的意义。

英文摘要

Despite the promise of AI to assist complex decisions, practitioners still lack ways to detect likely failures and inspect the consequences of model edits before committing them. We present RuleEdit, an interactive, rule-guided human-AI model editing system that (i) surfaces likely failures through interpretable mismatch signals from rule tables and (ii) supports user-authored rule feedback with prospective previews of projected performance changes and embedding shifts. We instantiate RuleEdit in stroke rehabilitation assessment and evaluate it with health professionals and students. Rule-guided failure detection significantly increased Human + AI performance by 14.16\% ($p<0.001$) while improving rejection of incorrect AI and reducing both over- and under- reliance as well as ChangedToWrong decisions. In addition, presenting prospective embedding previews improved participants' feedback for model adaptation, increasing post-update local performance gains from 11.50\% to 36.38\% after incorporating users' rule-based feedback ($p<0.001$). Our findings show that mismatch-based failure cues and prospective impact previews can support failure-aware human-AI model editing, while also revealing a local-global tradeoff: edits that help a specific case can degrade performance when transferred globally. We discuss implications of designing failure-aware and controllable human-AI systems.

2605.04127 2026-06-02 cs.LG cs.CL cs.CY 版本更新

Position: the Stochastic Parrot in the Coal Mine. Model Collapse is a Threat to Low-Resource Communities

立场:煤矿中的随机鹦鹉。模型崩溃对低资源社区的威胁

Devon Jarvis, Richard Klein, Benjamin Rosman, Steven James, Stefano Sarao Mannelli

发表机构 * GitHub

AI总结 本文探讨模型崩溃(生成模型在先前模型输出上训练导致的性能下降)如何通过降低训练效率、扭曲数据分布,不成比例地影响低资源和边缘化社区,并呼吁采取行动。

Comments 14 pages, 1 figure, 1 table, International Conference on Machine Learning

详情
AI中文摘要

模型崩溃,即当生成模型在先前的模型输出上进行训练时出现的性能下降,随着人工生成内容的激增,日益受到关注。对大型语言模型的相关批评强调了它们倾向于复现训练数据中的频繁模式、依赖庞大的数据集以及巨大的环境成本。这些因素共同导致了数据退化、文化偏见的强化以及资源利用的低效。在这篇立场论文中,我们旨在结合这些观点,并论证模型崩溃威胁着当前使AI民主化的努力。通过降低训练效率并使数据分布偏离其支撑的尾部,模型崩溃不成比例地影响了低资源和边缘化社区。我们考察了这一现象的环境和文化影响,将我们的立场置于近期关于模型崩溃的立场论文中,并以行动呼吁作为结论。最后,我们概述了减轻这些影响的初步方向。

英文摘要

Model collapse, the degradation in performance that arises when generative models are trained on the outputs of prior models, is an increasing concern as artificially generated content proliferates. Related critiques of large language models have highlighted their tendency to reproduce frequent patterns in training data, their reliance on vast datasets, and their substantial environmental cost. Together, these factors contribute to data degradation, the reinforcement of cultural biases, and inefficient resource use. In this position paper we aim to combine these views and argue that model collapse threatens current efforts to democratize AI. By reducing training efficiency and skewing data distributions away from the tails of their support, model collapse disproportionately impacts low-resource and marginalized communities. We examine both the environmental and cultural implications of this phenomenon, situate our position within recent position papers on model collapse, and conclude with a call to action. Finally, we outline initial directions for mitigating these effects.

2605.31200 2026-06-02 cs.LG stat.ML 版本更新

Beyond Additive Decompositions: Interpretability Through Separability

超越加性分解:通过可分离性实现可解释性

Jinyang Liu, Munir Eberhardt Hiabu

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出张量分离学习(TSL)回归模型,通过阶段式贪心过程与正交重拟合学习单变量特征函数的秩1乘积之和,避免加性分解中的信号抵消和外推问题,实现忠实于拟合分量的可视化,并提供近似率保证。

Comments To appear in Proceedings of the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

可解释机器学习需要准确且结构上忠实于数据的模型。现有的可解释性方法严重依赖加性表示(例如广义加性模型(GAMs)、SHapley加性解释(SHAP)、函数ANOVA),这些方法在存在强交互作用时可能会遭受信号抵消和支持外推。我们提出了张量分离学习(TSL),一种回归模型,通过带有正交重拟合的阶段式贪心过程学习单变量特征函数的秩1乘积之和。通过强制可分离性,TSL避免了加性投影中由于边缘化高阶交互作用而导致的信息损失。学习的TSL模型可以从一阶偏依赖函数完全重建,直到常数因子。这种阶段式对应确保了所得可视化忠实于拟合的分量。我们为具有有界混合$p$阶偏导数的函数建立了近似率保证,并证明TSL在回归基准测试中与黑盒模型竞争。

英文摘要

Interpretable machine learning requires models that are accurate and structurally faithful to the data. Existing explainability methods rely heavily on additive representations (e.g., Generalized Additive Models (GAMs), SHapley Additive exPlanations (SHAP), functional ANOVA), which can suffer from signal cancellation and off-support extrapolation in the presence of strong interactions. We propose Tensor Separation Learning (TSL), a regression model that learns a sum of rank-1 products of univariate per-feature functions via a stagewise greedy procedure with orthogonal refitting. By enforcing separability, TSL avoids the information loss inherent in additive projections caused by marginalizing higher-order interactions. The learned TSL model can be fully reconstructed from first-order partial dependence functions, up to constant factors. This stage-wise correspondence ensures that the resulting visualizations are faithful to the fitted components. We establish approximation-rate guarantees for functions with bounded mixed $p$-th order partial derivatives and demonstrate that TSL competes with black-box models on regression benchmarks.

2605.31162 2026-06-02 cs.CV cs.LG 版本更新

Guidance for Low-Level Perceptual Editing in Unconditional Diffusion Models

无条件扩散模型中低级感知编辑的引导

Shreyansh Modi, Akshat Tomar, Aarush Aggarwal

发表机构 * Indian Institute of Technology Roorkee(印度理工学院罗尔基)

AI总结 针对无条件扩散模型在美学和感知增强中难以进行全局低级变换的问题,提出一种无需训练的推理时机制,通过提取退化概念向量并结合瓶颈修补与无分类器引导,实现图像编辑与质量提升。

Comments 11 pages, 12 figures, Generative Models for Computer Vision Workshop CVPR 2026

详情
AI中文摘要

无条件扩散模型提供了强大的生成先验,但将其引导至美学增强的输出仍未被充分探索。我们表明,h-空间修补(用于无训练扩散编辑的主导范式)在美学和感知细化所需的全局低级变换中系统性失败。我们引入了一种新颖的、通用的框架,用于在无条件扩散模型中进行图像编辑,无需显式训练。这种推理时机制通过提取退化概念向量并组合瓶颈修补与无分类器引导来操作低级特征,从而引导采样远离退化流形,无需任何模型重训练即可持续生成改进的图像。

英文摘要

Unconditional diffusion models offer powerful generative priors, yet steering them toward aesthetically enhanced outputs remains largely unexplored. We show that h-space patching, the dominant paradigm for training-free diffusion editing, systematically fails for global, low-level transformations required for aesthetic and perceptual refinement. We introduce a novel, generalized framework for image-editing in unconditional diffusion models without explicit training. This inference-time mechanism operates on low-level features by extracting degradation concept vectors and combining bottleneck patching with classifier-free guidance to guide sampling away from the degraded manifold, producing consistently improved images without any model retraining.

2605.28335 2026-06-02 cs.LG 版本更新

Dimensionality Reduction for Robust Federated Learning: A Theoretical Analysis and Convergence Guarantee

鲁棒联邦学习的降维方法:理论分析与收敛性保证

Shiyuan Zuo, Jiashuo Li, Rongfei Fan, Han Hu, Jie Xu

发表机构 * Beihang University(北京航空航天大学) Xi'an Jiaotong University(西安交通大学) City University of Hong Kong(香港城市大学)

AI总结 针对联邦学习在拜占庭攻击下高维梯度聚合计算开销大的问题,提出基于稀疏随机投影的投影降维框架,将复杂度降至最优O(Mp),并证明其达到非凸函数O(1/√T)和强凸函数O(1/T)的最优收敛率。

详情
AI中文摘要

联邦学习使多个客户端能够在不共享原始数据的情况下协作训练模型,但它极易受到拜占庭攻击。现有的鲁棒方法可以中和这些威胁,但在高维梯度聚合过程中会产生大量计算开销,这种开销随模型大小扩展性差,并且随着现代模型变得越来越大,在训练成本中占据主导地位。为了解决这一计算瓶颈,我们提出了投影降维(PDR),一种用于基于向量级距离的鲁棒聚合器的通用加速框架,它通过稀疏随机投影将梯度压缩到一个大幅缩小的子空间中以高效计算可靠性权重,从而执行鲁棒聚合。这种方法将服务器计算复杂度降低到最优的O(Mp),其中M是客户端数量,p是模型维度,匹配了仅读取梯度所需的理论下界。我们在先前拜占庭鲁棒联邦学习分析的标准FL假设下建立了收敛性保证。通过利用子空间嵌入定理,我们证明PDR对于非凸函数实现了O(1/√T)的最优收敛率,对于强凸函数实现了O(1/T)的最优收敛率,其中T表示迭代次数。关键的是,我们从数学上证明,这种巨大的加速几乎是免费的,仅仅将固有的拜占庭误差界放大了有界且可调的因子(1+ε)/(1-ε)。在基准数据集上的实验结果证实,将PDR与现有聚合器集成可以在时间效率上实现数量级的加速,同时保持高度有竞争力的收敛性能。

英文摘要

Federated Learning (FL) enables multiple clients to collaboratively train models without sharing raw data, but it is highly vulnerable to Byzantine attacks. Existing robust approaches can neutralize these threats but incur substantial computational overhead during high-dimensional gradient aggregation, an overhead that scales poorly with model size and increasingly dominates the training cost as modern models grow larger. To address this computational bottleneck, we propose Projected Dimensionality Reduction (PDR), a universal acceleration framework for vector-level distance-based robust aggregators, which performs robust aggregation by compressing gradients into a drastically smaller subspace via sparse random projection to efficiently compute reliability weights. This approach reduces the server computational complexity to an optimal $ \mathcal{O}(Mp) $, where $ M $ is the number of clients and $ p $ is the model dimension, matching the theoretical lower bound required merely to read the gradients. We establish convergence guarantees under standard FL assumptions in prior Byzantine-robust FL analyses. By leveraging the Subspace Embedding Theorem, we show that PDR achieves optimal convergence rates of $ \mathcal{O}(1/\sqrt{T}) $ for non-convex functions and $ \mathcal{O}(1/T) $ for strongly convex functions, where $ T $ denotes the number of iterations. Crucially, we mathematically demonstrate that this massive acceleration comes almost for free, merely inflating the inherent Byzantine error floor by a bounded, tunable factor of $ \frac{1+ε}{1-ε} $. Experimental results on benchmark datasets confirm that integrating PDR with existing aggregators yields orders of magnitude speedups in time efficiency while maintaining highly competitive convergence performance.

2605.28209 2026-06-02 cs.LG 版本更新

Robust Contrastive Graph Clustering with Adaptive Local-Global Integration

鲁棒对比图聚类与自适应局部-全局整合

Lei Zhang, Fubo Sun, Haipeng Yang, Zhong Guan, Likang Wu

发表机构 * School of Computer Science and Technology, Anhui University(安徽大学计算机科学与技术学院) College of Management and Economics, Tianjin University(天津大学管理学院)

AI总结 提出一种对比图聚类框架,通过注意力机制自适应融合多尺度局部结构和全局语义原型,以解决复杂图中高阶局部结构捕获不足和全局语义忽略问题,提升聚类性能。

Comments Accepted at IJCAI 2026

详情
AI中文摘要

图聚类在图分析中对于揭示结构模式和节点社区至关重要。尽管自监督对比学习的最新进展通过结构和属性信号改进了聚类,但现有方法仍难以灵活捕获高阶局部结构,并且常常忽略复杂图中的全局语义。这些限制导致节点表示次优,尤其是在具有碎片化结构和模糊聚类边界的真实世界图中。为了解决这些限制,提出了一种对比图聚类框架,通过注意力机制联合整合多尺度局部结构与全局语义。在局部层面,通过基于注意力的加权自适应融合从多个传播深度提取的基于GNN的拓扑信号,以捕获多尺度邻域特征。在全局层面,通过注意力自适应聚合从动态演化的聚类中心导出的语义原型,以指导节点表示并增强聚类间可分离性。该模型在双视图对比学习范式下训练,采用结合实例级和结构感知损失的混合目标,以提高表示鲁棒性和判别力。在八个真实世界图数据集上的实验表明,我们的方法实现了有竞争力的聚类性能。代码可在 https://github.com/vege12138/w2 获取。

英文摘要

Graph clustering is essential in graph analysis for revealing structural patterns and node communities. Despite recent advances in self-supervised contrastive learning that have improved clustering via structural and attribute signals, existing methods still struggle to flexibly capture high-order local structures and often overlook global semantics in complex graphs. These limitations lead to suboptimal node representations, especially in real-world graphs with fragmented structures and ambiguous cluster boundaries. To address these limitations, a contrastive graph clustering framework is proposed to jointly integrate multi-scale local structures with global semantics via attention mechanisms. At the local level, GNN-based topological signals extracted from multiple propagation depths are adaptively fused through attention-based weighting to capture multi-scale neighborhood features. At the global level, semantic prototypes derived from dynamically evolving cluster centers are adaptively aggregated through attention to guide node representations and enhance inter-cluster separability. The model is trained under a dual-view contrastive learning paradigm with a hybrid objective that combines instance-level and structure-aware losses to improve representation robustness and discrimination. Experiments on eight real-world graph datasets demonstrate that our method achieves competitive clustering performance. Code is available at https://github.com/vege12138/w2.

2605.24583 2026-06-02 cs.LG cs.CL stat.ML 版本更新

Measuring Alignment-Induced Activation Shifts Correctly: A Template-Controlled Difference-in-Differences Protocol

正确测量对齐引起的激活偏移:一种模板控制的双重差分协议

Yuki Nakamura

发表机构 * The Open University of Japan(日本开放大学)

AI总结 针对对齐前后模型内部激活比较中存在的聊天模板混淆问题,提出模板控制的双重差分协议,有效分离对齐偏移与格式效应,恢复拒绝方向并提升余弦对齐度。

Comments 11 pages, 1 figure. v3: substantially revised and reframed as a measurement-methodology paper. Code, data, and an immutable Zenodo archive are available at https://github.com/Nakammura/effective-rank-audit (DOI: 10.5281/zenodo.20341444)

详情
AI中文摘要

比较模型在安全对齐前后的内部激活是探究安全训练改变了什么的一种自然方式:在安全相关输入上形成配对的(对齐后减对齐前)激活矩阵,并读取其有效秩或主方向。我们表明,形成该矩阵的直观方式存在混淆。对齐后的模型在基础模型从未见过的聊天模板下进行评估,因此朴素差异将对齐偏移与聊天格式混为一谈。我们引入修改矩阵的四变量分解(朴素、模板控制、对齐内和双重差分,DiD),以分离这两种效应。仅模板控制即可消除 Llama-3.1-8B、Gemma-2-9B 和 Qwen-2.5-7B 上测量有效秩的 2.0-3.9 倍膨胀;DiD 对比才是恢复 Arditi 等人(2024)的拒绝方向的关键,将其余弦对齐度从 0.18-0.39 提升至 0.50-0.86。跨三个系列的投影消融实验证实,恢复的子空间在行为上是活跃的,且奇异值顺序并非因果顺序。我们在受控测试平台上验证了该协议,并将其提炼为对齐激活差异研究的测量建议。

英文摘要

Comparing a model's internal activations before and after alignment is a natural way to ask what safety training changes: one forms the matrix of paired aligned-minus-base activations on safety-relevant inputs and reads off its effective rank or top direction. We show the obvious way to form this matrix is confounded. The aligned model is evaluated under a chat template the base model never saw, so the naive difference conflates the alignment shift with chat formatting. We introduce a four-variant decomposition of the modification matrix (naive, template-controlled, within-aligned, and difference-in-differences, DiD) that separates the two effects. Template control alone removes a 2.0-3.9x inflation of the measured effective rank across Llama-3.1-8B, Gemma-2-9B, and Qwen-2.5-7B; the DiD contrast is what recovers the refusal direction of Arditi et al. (2024), lifting its cosine alignment from 0.18-0.39 to 0.50-0.86. Projection-ablation across the three families confirms the recovered subspace is behaviorally active and that singular-value order is not causal order. We validate the protocol on a controlled testbed and distill it into measurement recommendations for activation-difference studies of alignment.

2605.20716 2026-06-02 cs.LG stat.ML 版本更新

Decision-Path Patterns as Tree Reliability Signals: Path-based Adaptive Weighting for Random Forest Classification

决策路径模式作为树可靠性信号:基于路径的自适应加权用于随机森林分类

Youngjoon Park

发表机构 * Independent Researcher(独立研究者)

AI总结 提出利用每棵树的决策路径结构模式作为实例自适应可靠性信号,对更可靠的树进行差异化加权,以纠正随机森林中因错误表示树占多数而导致的错误,在36个二分类基准上显著提升准确率。

Comments 32 pages, 3 figures. Code and data: https://github.com/DavidParkYJ/dwarfp

详情
AI中文摘要

随机森林通过每棵树对特征空间的不同随机化表示进行构建。当具有错误表示的树在概率上超过正确表示的树时,即使集成整体拥有足够正确的信息,其统一投票也无法纠正这些区域的错误——这是本文解决的一种可约错误。我们提出利用每棵树决策路径的结构模式作为实例自适应可靠性信号,以识别并对更可靠的树进行差异化加权。在推理时,随机森林通过样本在每棵树中遍历的根到叶路径得出预测,因此路径级可靠性提供了比树级加权更细的粒度。我们表明,该信号反映了每棵树决策的实际可靠性,并且使用它能在36个二分类基准上比RF获得统计显著的准确率提升(Wilcoxon p < 0.0001)。测量了类别召回率回归——RF校正方法的典型失败模式:在0.2个百分点阈值下,零个少数类召回率回归和一个多数类召回率回归,表明偏差减少而非类别权衡。我们进一步量化了该方法从拟合RF本身可访问的可约错误;该估计与每个数据集的增益强相关(Pearson r = +0.840, p < 0.0001)。在它识别的合格组上,该方法平均准确率提升+0.99个百分点,且在每个数据集上严格获胜(7/0/0);可选的放大机制进一步将其提升至+1.48个百分点。

英文摘要

Random forests construct each tree with a different, randomised representation of the feature space. Their uniform voting cannot correct errors in regions where trees with incorrect representations probabilistically outnumber correct ones, even when the ensemble collectively holds enough correct information - a reducible error that this paper addresses. We propose using the structural pattern of each tree's decision path as an instance-adaptive reliability signal to identify and differentially weight the more reliable trees. At inference, a random forest reaches its prediction through the root-to-leaf path the sample traverses in each tree, so path-level reliability offers a finer granularity than tree-level weighting can access. We show that this signal reflects the actual reliability of each tree's decision, and that using it yields a statistically significant accuracy improvement over RF on 36 binary classification benchmarks (Wilcoxon p < 0.0001). Class-recall regression - the typical failure mode of RF correction methods - is measured: zero minority-recall regressions and a single majority-recall regression at the 0.2 pp threshold, indicating bias reduction rather than a class trade-off. We further quantify the reducible error accessible to the method from the fitted RF alone; this estimate correlates strongly with per-dataset gain (Pearson r = +0.840, p < 0.0001). On the qualifying group it identifies, the method delivers a mean +0.99 pp accuracy improvement with strict wins on every dataset (7/0/0); an optional amplification mechanism further raises this to +1.48 pp.

2605.27835 2026-06-02 cs.LG cs.CL 版本更新

CAREF: Calibration-Aware Regularization for Explanation Faithfulness Without Rationale Supervision

CAREF: 面向解释忠实性的校准感知正则化,无需理由监督

Naphat Nithisopa, Teerapong Panboonyuen

发表机构 * MARSAIL Chulalongkorn University(朱拉隆梭大学) PBYAIL (Panboonyuen AI Lab)(PBYAIL(Panboonyuen人工智能实验室))

AI总结 提出CAREF框架,通过校准感知正则化联合优化预测准确性和解释忠实性,无需理由监督,在四个NLE基准上以少量可训练参数取得最佳性能。

Comments 10 pages

详情
AI中文摘要

我们引入了CAREF,一个参数高效的微调框架,通过校准感知正则化联合优化预测准确性和解释忠实性。其核心是通过单一统一损失函数LSCED(面向解释忠实性的校准感知正则化)将基于熵的校准与令牌级稀疏性控制相结合,无需理由监督。在四个NLE基准(COS-E、ECQA、ComVE、e-SNLI)上使用Flan-T5进行评估,我们轻量级的CAREF-AQ变体仅使用6.43%的可训练参数就达到了最佳平均准确率(89.04)和解释对齐度(81.00 nBERT),优于LoRA和AdaLoRA。据我们所知,CAREF是第一个将熵和稀疏性正则化统一到单一训练目标中用于可解释LLM微调的方法。

英文摘要

We introduce CAREF, a parameter-efficient fine-tuning framework that jointly optimizes predictive accuracy and explanation faithfulness via calibration-aware regularization. At its core, CAREF couples entropy-based calibration with token-level sparsity control through a single unified loss, the Calibration-Aware Regularization for Explanation Faithfulness (LSCED), without requiring rationale supervision. Evaluated on four NLE benchmarks (COS-E, ECQA, ComVE, e-SNLI) with Flan-T5, our lightweight CAREF-AQ variant attains the best average accuracy (89.04) and explanation alignment (81.00 nBERT) using only 6.43% of trainable parameters, outperforming LoRA and AdaLoRA. To our knowledge, CAREF is the first method to unify entropy and sparsity regularization in a single training objective for interpretable LLM fine-tuning.

2605.27527 2026-06-02 astro-ph.IM cs.LG 版本更新

Probabilistic Data-Driven Modelling of Astrophysical Transients: The Neural Process Family for Ultrafast and Class-Agnostic Light Curve Reconstruction with NightLANP

天体瞬变事件的概率数据驱动建模:基于NightLANP的超快速与类别无关光变曲线重建的神经过程家族

Siddharth Chaini, Federica B. Bianco, Ashish Mahabal

发表机构 * NASA FINESST Fellow Department of Physics and Astronomy, University of Delaware(物理与天文学系,德雷克塞尔大学) University of Delaware, Data Science Institute(德雷克塞尔大学数据科学研究所) Joseph R. Biden, Jr. School of Public Policy and Administration, University of Delaware(德雷克塞尔大学公共政策与行政学院) Vera C. Rubin Observatory(维拉·鲁宾天文台) Division of Physics, Mathematics, and Astronomy, California Institute of Technology(物理、数学与天文学系,加州理工学院) Center for Data Driven Discovery, California Institute of Technology(数据驱动发现中心,加州理工学院)

AI总结 针对稀疏不规则光变曲线重建问题,提出神经过程家族(以注意力神经过程为例),结合高斯过程的概率框架与深度学习的可扩展性,通过元学习实现跨波段、类别无关的快速推理,在Rubin模拟数据上优于高斯过程和神经网络。

详情
AI中文摘要

来自地球的天体观测受到天气、环境和科学限制,导致稀疏、不规则的光变曲线。在Vera C. Rubin天文台时空遗产巡天前夕,其数据集为瞬变科学提供了前所未有的机遇。然而,一个关键挑战是其观测节奏——在六个波段上稀疏且不规则,限制了推断。插值有助于缓解这一问题,高斯过程是标准方法,但它们在跨波段相关性上表现不佳,需要先验核函数指定,并且必须单独拟合每条光变曲线,因此可扩展性差。在此,我们引入神经过程家族用于光变曲线重建,结合了高斯过程的概率框架与深度学习的可扩展性。通过在多样化的模拟瞬变事件上进行元学习,注意力神经过程将大部分计算转移到训练阶段,从而能够使用类别无关模型进行快速、摊销的推断。在15个瞬变类别上使用真实的Rubin观测节奏进行评估,我们表明,即使是一个未优化的、开箱即用的注意力神经过程,在所有测试指标(包括回归质量、天体物理特征恢复和概率校准)上始终优于所有基准——一组高斯过程和神经网络。我们的模型同时插值所有波段,耗时微秒级,比次优的神经基准快四个数量级,比高斯过程快五个数量级,展示了神经过程在Rubin夜间警报流中的潜力。注意力神经过程避免了标准神经网络的过度自信和高斯过程的信心不足,提供了尖锐且良好校准的不确定性。这项工作确立了神经过程家族作为Rubin时代实时瞬变科学的可扩展概率基础。

英文摘要

Astrophysical observations from Earth are subject to weather, environmental, and scientific constraints that lead to sparse, irregular light curves. On the eve of the Vera C. Rubin Observatory Legacy Survey of Space and Time, its dataset offers unprecedented opportunities for transient science. Yet a key challenge remains its cadence, sparse and irregular across six bands, limiting inference. Interpolation helps mitigate this, with Gaussian Processes the standard, but they struggle with cross-band correlations, require a priori kernel specification, and must be fit to each light curve individually, hence scaling poorly. Here, we introduce the neural process family for light curve reconstruction, combining the probabilistic framework of Gaussian Processes with the scalability of deep learning. By meta-learning on diverse simulated transients, Attentive Neural Processes shift the bulk of computation to training, enabling rapid, amortized inference with a class-agnostic model. Evaluated on realistic Rubin cadences across 15 transient classes, we show that even an unoptimized, out-of-the-box Attentive Neural Process consistently outperforms all benchmarks -- a suite of Gaussian Processes and neural networks -- on every tested metric, spanning regression quality, astrophysical feature recovery, and probabilistic calibration. Our model interpolates all bands simultaneously in microseconds, over four orders of magnitude faster than the next-best neural benchmark and five faster than Gaussian Processes, demonstrating the potential of neural processes for the nightly Rubin alert stream. Attentive Neural Processes avoid the overconfidence of standard neural networks and the underconfidence of Gaussian Processes, delivering sharp, well-calibrated uncertainties. This work establishes the neural process family as a scalable, probabilistic foundation for real-time transient science in the Rubin era.

2605.27458 2026-06-02 cs.CV cs.AI cs.CL cs.LG 版本更新

Generic Interpretation Approach for Transformer Models Incorporating Heterogenous Attention Structures

融合异质注意力结构的Transformer模型通用解释方法

Yongjin Cui, Xiaohui Fan, Huajun Chen

发表机构 * Zhejiang University(浙江大学)

AI总结 针对Transformer中异质注意力结构(如共注意力)带来的多源信息融合挑战,提出一种通用解释方法,并通过实验分析范式对代表性模型进行语义和逻辑解释。

详情
AI中文摘要

Transformer极大地推动了人工智能的发展,也推动了智能体(agent)的发展。我们将Transformer的注意力结构根据输入信息的来源分为两类:同质注意力结构和异质注意力结构。异质注意力结构以共注意力(co-attention)为典型例子,处理来自不同来源的信息。异质注意力结构是Transformer模型实现更复杂功能、融合更多模态信息的基础。无论是出于研究目的还是政策要求,对具有异质注意力结构的Transformer模型进行解释都是一项重要任务。来自不同来源的信息融合带来了新的挑战。我们的工作主要包括方法和实验两部分。在方法方面,我们提出了一种针对具有异质注意力结构的Transformer模型的解释方法。在实验方面,基于我们的实验分析范式,我们解释代表性模型的操作机制,进行语义解释和逻辑解释。

英文摘要

Transformer has significantly propelled the development of artificial intelligence, and certainly the development of agents as well. We categorize attention structures of Transformer into two types based on the source of the input information: homogenous and heterogenous attention structures. Heterogenous attention structures, with co-attention as a typical example, process information from different sources. Heterogenous attention structure is the foundation for Transformer models to achieve more complex functions and integrate more modal information. Whether for research purposes or policy requirements, the interpretation of Transformer models with heterogenous attention structures is an important task. The fusion of information from different sources brings new challenges. Our work mainly includes two parts: method and experimentation. In terms of method, we propose an interpretation method for Transformer models with heterogenous attention structures. In terms of experimentation, based on our experimental analysis paradigm, we interpret the operating mechanisms of representative models, conduct semantic interpretation and logical interpretation.

2605.27095 2026-06-02 cs.LG 版本更新

Adversarial Dual On-Policy Distillation from Expressive Teacher

来自表达性教师的对抗性双在线策略蒸馏

Zhenglin Wan, Jingxuan Wu, Xingrui Yu, Chubin Zhang, Mingcong Lei, Bo An, Ivor W. Tsang, Yang You

发表机构 * National University of Singapore(新加坡国立大学) University of Technology Sydney(悉尼大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出FA-OPD方法,通过对抗性双在线策略蒸馏,结合流匹配教师和轻量MLP学生,利用奖励和动作双通道信号,在机器人导航、操作和运动任务中优于强基线,且对噪声和有限演示鲁棒。

Comments arXiv admin note: substantial text overlap with arXiv:2510.09222

详情
AI中文摘要

在具身控制中从演示学习通常被建模为行为克隆,最近的扩散或流匹配策略通过建模多模态专家动作改进了这一范式。然而这些方法仍然是离线监督学习:策略仅在专家状态上训练,在其实际访问的状态上得不到纠正信号。在线策略蒸馏(OPD)提供了一种自然的补救措施,但标准OPD假设一个强大的固定教师,这在仅演示控制中不可用。我们提出 extbf{FA-OPD},一种\emph{对抗性双在线策略蒸馏}方法,其中从演示学习流匹配(FM)教师,并与轻量MLP学生共同训练。教师在学生 rollout 上提供两种互补信号。奖励通道学习状态-动作对上的专家似然目标,并通过长视界策略优化驱动在线探索。动作通道在学生访问的状态提供密集的局部目标,稳定利用。FA-OPD耦合两者,使得奖励蒸馏能够实现超越逐点演示的泛化,而动作蒸馏使探索锚定在接近专家行为附近。在六个机器人导航、操作和运动基准上,FA-OPD击败了强基线,并在噪声或有限演示下表现出更强的鲁棒性。源代码:https://github.com/vanzll/FA-OPD。

英文摘要

Learning from demonstrations in embodied control is often cast as behavioral cloning, and recent diffusion or flow-matching policies improve this paradigm by modeling multi-modal expert actions. Yet these methods remain offline supervised learners: the policy is trained only on expert states and receives no corrective signal on the states it actually visits. On-policy distillation (OPD) offers a natural remedy, but standard OPD assumes a strong fixed teacher, which is unavailable in demonstration-only control. We propose \textbf{FA-OPD}, an \emph{adversarial dual on-policy distillation} method in which a Flow Matching (FM) teacher is learned from demonstrations and co-trained with a lightweight MLP student. The teacher provides two complementary signals on student rollouts. The reward channel learns an expert-likeness objective over state-action pairs and drives online exploration through long-horizon policy optimization. The action channel supplies dense local targets at student-visited states, stabilizing exploitation. FA-OPD couples them so that reward distillation enables generalization beyond point-wise demonstrations, while action distillation keeps exploration anchored near expert-like behavior. Across six robot navigation, manipulation, and locomotion benchmarks, FA-OPD beats strong baselines and shows much stronger robustness under noisy or limited demonstrations. Source code: https://github.com/vanzll/FA-OPD.

2605.26919 2026-06-02 cs.LG stat.ML 版本更新

Agile Online Model Selection: Resolving Adaptation Lag via Safeguarded Large Learning Rates

敏捷在线模型选择:通过受保护的大学习率解决适应滞后

Kei Takemura, Ryuta Matsuno, Keita Sakuma

发表机构 * NEC Corporation(日本电报电话株式会社)

AI总结 提出一种乐观在线镜像下降算法,利用受保护的大学习率(高达Θ(T))并引入事后惩罚机制,在非平稳环境中实现快速适应,同时保持近最优的遗憾界。

Comments Accepted to KDD 2026

详情
AI中文摘要

在非平稳环境中保持预测准确性需要在线模型选择来自主适应未知的分布变化。然而,现有的免调参算法在鲁棒性和敏捷性之间存在根本性权衡。具体来说,为了确保动态遗憾界,它们必须将学习率限制为小常数(例如,$O(1)$)。这种限制不可避免地会在突变期间导致显著的适应滞后。为了解决这个问题,我们提出了一种新颖的乐观在线镜像下降算法,该算法利用受保护的大学习率,最高可达$Θ(T)$,其中$T$是轮数。我们的关键技术贡献是一种事后惩罚机制,该机制动态监控不稳定的更新,并排除导致过度遗憾的学习率,从而消除了限制性先验约束的需要。我们证明了累积惩罚仍为$O(\log T)$,使得我们的算法在良性情况下实现优越的速率的同时,匹配接近最优的最坏情况保证。在三个合成数据集和十一个多样化真实世界数据集上的实证评估表明,我们的方法将适应滞后从数百轮减少到几轮,始终优于免调参基线。

英文摘要

Maintaining predictive accuracy in non-stationary environments requires online model selection to adapt autonomously to unknown distribution shifts. However, existing tuning-free algorithms face a fundamental trade-off between robustness and agility. Specifically, to ensure dynamic regret bounds, they must restrict learning rates to small constants (e.g., $O(1)$). This restriction inevitably causes significant adaptation lag during abrupt changes. To resolve this, we propose a novel optimistic online mirror descent that utilizes safeguarded large learning rates up to $Θ(T)$, where $T$ is the number of rounds. Our key technical contribution is a post-hoc penalty mechanism that dynamically monitors unstable updates and excludes learning rates incurring excessive regret, eliminating the need for restrictive a priori constraints. We show that the cumulative penalty remains $O(\log T)$, allowing our algorithm to match near-optimal worst-case guarantees while achieving superior rates in benign cases. Empirical evaluations on three synthetic and eleven diverse real-world datasets demonstrate that our approach reduces the adaptation lag from hundreds of rounds to a few rounds, consistently outperforming tuning-free baselines.

2605.26874 2026-06-02 cs.DB cs.AI cs.LG 版本更新

Knowledge Graphs as the Missing Data Layer for LLM-Based Industrial Asset Operations

知识图谱:基于LLM的工业资产运营中缺失的数据层

Madhulatha Mandarapu, Sandeep Kunkunuru

发表机构 * VaidhyaMegha Private Limited, India(印度VaidhyaMegha私人有限公司)

AI总结 研究通过类型化知识图谱作为数据层,将GPT-4在工业维护场景中的准确率从65%提升至99%,并引入生成增强知识(GAK)处理缺失数据,实现81.8%的场景可回答性。

Comments v2: reframed around the knowledge graph as a grounding substrate with a 3-tier router (text-to-Cypher; native graph/optimization primitives; generation-augmented knowledge, GAK). Adds a benchmark-grounded GAK evaluation on 88 real non-deterministic AssetOpsBench scenarios with provenance-tagged enrichment. 18 pages. Code: github.com/samyama-ai/assetops-kg

详情
AI中文摘要

基于LLM的工业资产运营代理在处理平面文档存储时准确性有限。AssetOpsBench(KDD 2026)表明,GPT-4代理在139个工业维护场景中达到65%的准确率,并比较了LLM编排范式(Agent-As-Tool vs. Plan-Execute)在固定数据层上的表现。我们提出一个正交问题:工具背后的数据模型有多重要?我们将类型化知识图谱作为基础基质,并根据最佳回答方式路由每个问题:(i)LLM生成的Cypher进行结构化检索,将同一GPT-4模型从65%提升至82-83%;(ii)原生图和优化原语(无需LLM)在图可回答场景中达到99%;(iii)生成增强知识(GAK)用于处理数据中缺失的答案——引擎的代理将缺失事实实现为带有溯源标签的图节点,然后回答。一个反复出现的主题是反向LLM使用:我们约束LLM从类型化模式生成查询或一次性丰富,让图确定性地执行。在88个真实的AssetOpsBench故障模式场景中(基准本身标记为非确定性——图中缺失十种设备类型),GAK将可回答性从零提升至100%的设备类型,并回答了81.8%的场景,每个实现的事实都标记为来源:LLM派生以确保可审计性。我们还贡献了40个图原生场景。对于结构化操作领域,数据层——而非LLM编排——是主要杠杆,类型化知识图谱充当原始工业数据与LLM推理之间的基础基质。

英文摘要

LLM-based agents for industrial asset operations show limited accuracy when reasoning over flat document stores. AssetOpsBench (KDD 2026) establishes that GPT-4 agents achieve 65% on 139 industrial maintenance scenarios, and compares LLM orchestration paradigms (Agent-As-Tool vs. Plan-Execute) on a fixed data layer. We ask the orthogonal question: how much does the data model behind the tools matter? We treat a typed knowledge graph as a grounding substrate and route each question by how it is best answered: (i) LLM-generated Cypher for structured retrieval, which lifts the same GPT-4 model from 65% to 82-83%; (ii) native graph and optimization primitives, with no LLM, reaching 99% on graph-answerable scenarios; and (iii) generation-augmented knowledge (GAK) for answers absent from the data -- the engine's agent materializes the missing facts as provenance-tagged graph nodes, then answers. A recurring theme is inverted LLM usage: we constrain the LLM to query generation or one-shot enrichment from a typed schema and let the graph execute deterministically. On the 88 real AssetOpsBench failure-mode scenarios the benchmark itself flags non-deterministic -- ten equipment types absent from the graph -- GAK lifts answerability from zero to 100% of equipment types and answers 81.8% of scenarios, every materialized fact tagged source:LLM-derived for auditability. We also contribute 40 graph-native scenarios. For structured operational domains the data layer -- not the LLM orchestration -- is the primary lever, and a typed knowledge graph serves as a grounding substrate between raw industrial data and LLM reasoning.

2605.26684 2026-06-02 cs.LG cs.AI 版本更新

Beyond Trajectory-Level Attribution: Graph-Based Credit Assignment for Agentic Reinforcement Learning

超越轨迹级归因:基于图的智能体强化学习信用分配

Xin Cheng, Shuo He, Lang Feng, HaiYang Xu, Ming Yan, Lei Feng, Bo An

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出GraphGPO方法,通过构建状态转移图并利用全局信息估计各状态到任务目标的距离,实现步骤级信用分配,提升训练效率和性能。

Comments Accepted by ICML 2026

详情
AI中文摘要

基于组的强化学习方法在提升大型语言模型性能方面取得了显著成功,并已迅速扩展到智能体任务。然而,其信用分配严重依赖于根据最终结果进行的粗粒度轨迹级归因,难以捕捉单个步骤的贡献,例如失败轨迹中被掩盖的有价值步骤。为了揭示潜在信息并实现更忠实的步骤级信用分配,我们提出基于图的组策略优化(GraphGPO),该方法首先将所有 rollout 轨迹聚合为一个统一的状态转移图,然后利用图中编码的全局信息估计每个状态到任务目标的距离。最后,GraphGPO 通过估计基于图的优势函数,根据转移减少到任务目标距离的程度,为每条边分配信用。通过这种方式,GraphGPO 显著提高了训练效率,并在多个具有挑战性的基准测试中取得了最先进的性能。

英文摘要

Group-based reinforcement learning (RL) methods have achieved remarkable success in improving the performance of large language models (LLMs) and have been rapidly extended to agentic tasks. However, their credit assignment relies heavily on coarse-grained trajectory-level attribution according to final outcomes, making it difficult to capture the contribution of individual steps, such as valuable steps obscured within failed trajectories. To uncover latent information and enable more faithful step-level credit assignment, we propose Graph-based Group Policy Optimization (GraphGPO), which first aggregates all rollout trajectories into a unified state-transition graph and then estimates the distance from each state to the task goal using the global information encoded in the graph. Finally, GraphGPO assigns credit to each edge by estimating a graph-based advantage, based on how much the transition reduces the distance to the task goal. In this way, GraphGPO significantly improves training efficiency and achieves state-of-the-art performance across a range of challenging benchmarks.

2605.26660 2026-06-02 cs.LG 版本更新

WINDQuant: Weight-Informed Neural Decision-Making for Global Mixed-Precision LLM Quantization

WINDQuant: 权重感知的全局混合精度大语言模型量化神经决策

Phong Nam Huu Nguyen, Khoi M. Le, Cong-Duy T Nguyen, Anh Tuan Luu, Thong Thanh Nguyen, Tho Quan

发表机构 * CAIR, VinUniversity, Vietnam(越南 VinUniversity 的 CAIR) Ho Chi Minh City University of Technology (HCMUT), VNU-HCM, Ho Chi Minh City, Vietnam(越南胡志明市技术大学 (HCMUT)) CCDS, Nanyang Technological University, Singapore(新加坡南洋理工大学的 CCDS) School of Computing, National University of Singapore, Singapore(新加坡国立大学计算机学院)

AI总结 提出基于强化学习的WINDQuant控制器,在全局存储预算下为列块分配位宽和量化策略,实现超低位LLM量化的细粒度混合精度,性能优于传统方法。

详情
AI中文摘要

量化是减少大语言模型(LLM)内存占用和推理成本的有效方法,但在超低位宽下保持性能仍具挑战。现有的后训练方法常遭受严重的精度下降,而量化感知训练则需要昂贵的重训练和额外资源。此外,大多数混合精度策略依赖于粗粒度或启发式敏感性分析,忽略了权重矩阵内的细粒度变化。我们提出WINDQuant,一种基于强化学习的分配控制器,用于超低位LLM量化。WINDQuant并非引入另一种低位量化算子,而是学习如何在全局存储预算下为细粒度列块分配位宽和量化处理。通过在列块级别操作,WINDQuant能够在全局目标位宽下实现层内灵活且细粒度的精度分配。实现结合了PPO与激活感知校准、轻量级每单元量化器拟合以及学习到的混合精度方案的显式有效位计算。在LLaMA模型上的实验表明,WINDQuant在超低位设置下实现了竞争性能,同时相对于基于重训练的方法降低了优化开销,凸显了强化学习作为自适应混合精度量化实用控制器的潜力。

英文摘要

Quantization is an effective approach to reduce the memory footprint and inference cost of large language models (LLMs), yet maintaining performance in the ultra-low-bit regime remains challenging. Existing post-training methods often suffer from severe accuracy degradation, while quantization-aware training requires costly retraining and additional resources. Moreover, most mixed-precision strategies rely on coarse-grained or heuristic sensitivity analysis that overlooks fine-grained variations within weight matrices. We propose WINDQuant, a reinforcement-learning-based allocation controller for ultra-low-bit LLM quantization. Rather than introducing another low-level quantization operator, WINDQuant learns how to assign bit-widths and quantization treatments to fine-grained column chunks under a global storage budget. By operating at the column-chunk level, WINDQuant enables flexible and fine-grained precision assignment within layers under a global target bit-width. The implementation combines PPO with activation-aware calibration, lightweight per-unit quantizer fitting, and explicit effective-bit accounting of the learned mixed-precision plan. Experiments on LLaMA models demonstrate that WINDQuant achieves competitive performance in ultra-low-bit settings while reducing optimization overhead relative to retraining-based approaches, highlighting reinforcement learning as a practical controller for adaptive mixed-precision quantization.

2605.26632 2026-06-02 cs.LG 版本更新

RT-Lynx: Putting the GEMM Sparsity In a Right Way for Diffusion Models

RT-Lynx:以正确方式将GEMM稀疏性应用于扩散模型

Xing Cong, Hanlin Tang, Kan Liu, Lan Tao, Lin Qu, Chenhao Xie

发表机构 * Alibaba Group(阿里巴巴集团) Independent Researcher(独立研究者)

AI总结 针对扩散模型推理成本高的问题,提出将N:M半结构化稀疏性从权重转移到激活,结合误差补偿技术,实现线性层平均1.55倍加速且保持生成质量。

Comments 33 pages, 18 figures, Accepted by ICML 2026

详情
AI中文摘要

扩散Transformer(DiT)在图像生成中表现出色,但推理成本高昂。先前工作通过量化和蒸馏降低了成本,但半结构化稀疏性(可将近减半FLOPs)仍未得到充分探索。一个关键原因是现有方法大多关注权重稀疏化,而剪枝50%的权重会移除关键模型容量并降低生成质量。然而,我们的研究表明,DiT激活本质上是稀疏的,并且比权重对N:M半结构化稀疏化更鲁棒。受此观察启发,我们倡导从权重稀疏化到激活稀疏化的范式转变。我们提出RT-Lynx,将N:M稀疏化应用于激活,并结合误差补偿技术以减轻精度损失。我们进一步针对此设置实现了高度优化的CUDA内核,在线性层中平均实现高达1.55倍的加速。在多个扩散模型上的大量实验表明,我们的方法在显著加速推理的同时,保持了原始模型的生成质量。

英文摘要

Diffusion Transformers (DiT) achieve strong performance in image generation but incur substantial inference costs. While prior work has reduced this cost via quantization and distillation, semi-structured sparsity, which can nearly halve FLOPs, remains underexplored. A key reason is that most existing approaches focus on weight sparsification, and pruning 50% of the weights can remove critical model capacity and degrade generation quality. Our study, however, shows that DiT activations are intrinsically sparse and significantly more robust to N:M semi-structured sparsification than weights. Motivated by this observation, we advocate a paradigm shift from weight sparsification to activation sparsification. We propose RT-Lynx, which applies N:M sparsification to activations and incorporates error-compensation techniques to mitigate accuracy loss. We further implement highly optimized CUDA kernels tailored to this setting, achieving up to a 1.55x speedup on average in linear layers. Extensive experiments across multiple diffusion models demonstrate that our method preserves the generation quality of the original models while substantially accelerating inference.

2605.26310 2026-06-02 cs.LG cs.NA math.NA 版本更新

Classification and detection of multiple UAVs using rational Gaussian wavelet neural networks

基于有理高斯小波神经网络的多无人机分类与检测

Ungvári Gergő, Ferenc Braun, Attila Ámon, Péter Kackstädter, János Volk, Péter Kovács, Tamás Dózsa

发表机构 * Department of Numerical Analysis(数值分析系) Eötvös Loránd University(欧多维奇·劳尔大学) System and Control Laboratory(系统与控制实验室) HUN-REN Institute for Computer Science and Control(HUN-REN计算机科学与控制研究所) UniDistanceSuisse Siemens Mobility Kft.(西门子移动有限公司) HUN-REN Centre for Energy Research(HUN-REN能源研究中心)

AI总结 提出一种利用麦克风声音信号和有理高斯小波自适应特征提取的神经网络,用于检测和分类单个及群体无人机,在室内外噪声环境中优于传统方法且保持可解释性。

Comments 19 pages, 4 figures

详情
AI中文摘要

无人机(UAV)的检测对于保护民用和军事基础设施至关重要。本文提出一种成本有效的无人机检测系统,使用从麦克风获取的声音信号。记录的信号通过信号处理流水线,该流水线采用所谓的理性高斯小波作为可解释的自适应特征提取器。这些自适应小波变换嵌入到一个底层小神经网络中,并与之一同训练,基于提取的特征检测和分类无人机。这产生了一个物理可解释的机器学习算法,除了分类无人机外,还能够检测无人机群。我们使用室内工作室和嘈杂室外环境收集的数据展示了结果。我们得出结论,所提出的方法在检测和分类单个无人机以及无人机群方面优于传统的机器学习方法,同时保持了高度的可解释性。我们公开了所提出方法的实现,以便可重复性。

英文摘要

The detection of unmanned aerial vehicles (UAVs) is important for the protection of civilian and military infrastructure. In this paper we propose a cost effective UAV detection system using sound signals obtained from microphones. The recorded signals are passed through a signal processing pipeline which employs interpretable adaptive feature extractors using so-called rational Gaussian wavelets. These adaptive wavelet transformations are embedded into and trained together with an underlying small neural network which detects and classifies UAVs based on the obtained features. This leads to a physically interpretable machine learning algorithm that in addition to classifying UAVs is also capable of detecting UAV swarms. We demonstrate our results using data collected in indoor studio and noisy outdoor environments. We conclude that the proposed method outperforms traditional machine learning approaches for detecting and classifying single UAVs as well as drone swarms, while retaining a high degree of interpretability. Our implementation of the proposed methods is made publicly available for reproducibility.

2605.26068 2026-06-02 cs.LG cs.AI 版本更新

Rethinking Weak Supervision in Anomaly Detection: A Comprehensive Benchmark

重新思考异常检测中的弱监督:一个综合基准

Xu Yao, Siyuan Zhou, Zhenbo Wu, Chaochuan Hou, Shuang Liang, Shiping Wang, Hailiang Huang, Songqiao Han, Minqi Jiang

发表机构 * Shanghai University of Finance and Economics(上海金融学院) Ant Group(蚂蚁集团) Key Laboratory of Interdisciplinary Research of Computation and Economics(计算与经济交叉学科重点实验室)

AI总结 提出WSADBench,首个统一评估不完全、不精确和不准确三种弱监督异常检测场景的基准,通过系统变化标签数量、粒度和质量,揭示36种算法在4种模态上的性能边界,并发现弱监督场景间存在强相关性、专用WSAD算法仅在极端标签稀缺时占优等关键洞察。

Comments Accepted at KDD 2026 Datasets and Benchmarks Track

详情
AI中文摘要

弱监督异常检测(WSAD)已发展出三个主要方向:不完全监督、不精确监督和不准确监督。然而,这些方向仍然相互孤立,缺乏一个统一的框架来评估它们是否解决独特的挑战或共享基本机制。本文介绍了WSADBench,这是第一个统一评估不同弱监督场景的基准,对从专用WSAD方法到先进表格基础模型的多种方法进行基准测试。WSADBench通过系统变化标签数量、粒度和质量,建立了标准化协议来评估4种模态上的36种算法,揭示了各种方法的性能边界。基于超过70万次实验,WSADBench揭示了四个关键见解:(i)这些弱监督场景之间存在强内在相关性,挑战了当前研究方向的孤立性。(ii)专用WSAD算法仅在极端标签稀缺情况下表现出色,但随着监督增加或在OOD场景中,很快被表格基础模型和通用分类方法主导。(iii)未标记数据在不同设置中的效用不一致,与标签细化相比收益微乎其微。(iv)模型对不同类型的标签噪声表现出不对称敏感性。我们发布WSADBench作为开源基准,包含代码和数据集,以促进未来的WSAD研究:https://github.com/SUFE-AILAB/WSADBench。

英文摘要

Weakly supervised anomaly detection (WSAD) has developed in three primary directions: incomplete, inexact, and inaccurate supervision. However, these directions remain isolated, lacking a unified framework to assess whether they address unique challenges or share fundamental mechanisms. This paper introduces WSADBench, the first benchmark that unifies evaluation across distinct weakly supervised scenarios, benchmarking diverse approaches from specialized WSAD methods to advanced tabular foundation models. WSADBench establishes standardized protocols to evaluate 36 algorithms across 4 modalities by systematically varying label quantity, granularity, and quality, revealing the performance boundaries of various methods. Based on over 700K experiments, WSADBench reveals four critical insights: (i) Strong intrinsic correlations exist between these weak supervision scenarios, challenging the isolation of current research directions. (ii) Specialized WSAD algorithms excel only in extreme label-scarcity regimes but are quickly dominated by tabular foundation models and general classification methods as supervision increases or in OOD scenarios. (iii) Unlabeled data shows inconsistent utility across settings, with marginal gains compared to label refinement. (iv) Models exhibit asymmetric sensitivity to different types of label noise. We release WSADBench as an open-source benchmark with code and datasets to facilitate future WSAD research: https://github.com/SUFE-AILAB/WSADBench.

2605.30290 2026-06-02 cs.LG cs.AI cs.CL 版本更新

Self-Trained Verification for Training- and Test-Time Self-Improvement

自训练验证用于训练和测试时的自我改进

Chen Henry Wu, Aditi Raghunathan

发表机构 * arXiv

AI总结 提出自训练验证(STV)方法,通过让验证器模仿参考解决方案下的自身版本,解决自我改进中验证器瓶颈问题,在测试时显著提升验证-细化循环,在训练时通过验证器在环训练(ViL)进一步提升生成器性能。

详情
AI中文摘要

大规模自我改进一直是推理模型的长期目标,有两个自然的实现阶段:测试时,通过验证-细化(V-R)循环;训练时,通过自训练方法。两者都受限于同一个瓶颈:验证器。当验证器得分膨胀而准确率停滞,且反馈过于泛化无法执行时,V-R循环会停滞;当糟糕的自生成数据被加入训练时,自训练同样会失败。更好的验证将解锁两者,但我们想要训练的能力,即捕捉自生成的错误,缺乏训练信号。为了解决这一挑战,我们提出了自训练验证(STV)。我们的关键观察是,虽然模型单独无法捕捉这些错误,但当它看到参考解决方案时却可以。我们将这种不对称性转化为监督目标,训练验证器模仿自身更具信息量的版本。在测试时,STV在困难问题上显著改进了V-R循环,而替代方法(如SFT、对验证器分数进行RL,甚至元验证器)则不然。STV在困难数学任务上大致使准确率翻倍,在科学推理任务上提升14倍(从1.5%到21%)。在训练时,我们额外使用STV验证器在V-R循环内的反馈对生成器进行RL训练——我们称之为验证器在环训练(ViL)。从一个RL收敛的生成器开始,ViL在pass@1上进一步获得33%的提升。更值得注意的是,生成器在测试时无验证器的独立pass@1相对标准RL收敛点提升了30%。因此,困难问题推理的下一个前沿可能在于我们如何训练用于验证和与验证结合的方法。网站:https://ar-forum.github.io/stv-webpage

英文摘要

Self-improvement at scale has been a longstanding goal for reasoning models, and there are two natural places to do it: at test time, through verification-refinement (V-R) loops; and at training time, through self-training methods. Both are gated by the same bottleneck: the verifier. V-R loops stall when verifier scores inflate while accuracy stagnates, and when feedback is too generic to act on; self-training fails similarly when bad self-generated data are added to training. Better verification would unlock both, but the capability we want to train, i.e., catching self-generated errors, lacks training signal. To address this challenge, we propose self-trained verification (STV). Our key observation is that, while a model cannot catch these errors alone, it can when shown the reference solution. We turn this asymmetry into a supervision target and train the verifier to imitate a more informed version of itself. At test time, STV substantially improves V-R loops on hard problems, while alternatives (e.g., SFT, RL on verifier scores, and even meta-verifiers) do not. STV roughly doubles accuracy on hard math and lifts it 14x on scientific reasoning tasks (1.5% to 21%). At training time, we additionally train the generator using RL with STV verifier's feedback inside the V-R loop - a procedure we call verifier-in-the-loop training (ViL). Starting from an RL-converged generator, ViL yields a further 33% gain in pass@1. More notably, the generator's standalone pass@1, with no verifier at test time, climbs 30% relative past where standard RL had converged. Hence, the next frontier in reasoning on hard problems may lie in how we train for and with verification. Website: https://ar-forum.github.io/stv-webpage

2605.30237 2026-06-02 cs.IR cs.CL cs.LG 版本更新

GRASP: Plan-Guided Graph Retrieval with Adaptive Fusion and Reranking on Semi-Structured Knowledge Bases

GRASP:半结构化知识库上具有自适应融合和重排序的计划引导图检索

Yicheng Tao, Yiqun Wang, Xiangchen Song, Xin Luo, Kai Liu, Jie Liu

发表机构 * Department of Electrical Engineering and Computer Science, University of Michigan(密歇根大学电气工程与计算机科学系) Department of Computational Medicine and Bioinformatics, University of Michigan(密歇根大学计算医学与生物信息学系)

AI总结 提出GRASP框架,通过计划引导图检索、条件融合和重排序三阶段,在半结构化知识库上实现最先进的检索性能。

详情
AI中文摘要

半结构化知识库(SKBs)将文本文档嵌入实体和关系的类型化图中,并支持产品搜索、学术论文搜索和精准医学查询等应用。现有的SKB混合检索系统要么仅将图用于查询扩展,要么在全局加权下混合文本和结构分支,要么依赖微调的图遍历生成器。我们提出了GRASP,一个三阶段SKB检索框架,统一了基于计划的图检索、与密集检索器的计划条件融合以及融合候选上的微调重排序。GRASP在三个STaRK基准测试的每个指标上显著推进了现有技术水平,将平均Hit@1从62.0提升至73.9。消融和敏感性研究进一步证实了GRASP的有效性和鲁棒性。

英文摘要

Semi-structured knowledge bases (SKBs) embed textual documents in a typed graph of entities and relations, and underpin applications such as product search, academic paper search, and precision-medicine inquiries. Existing hybrid retrieval systems on SKBs either use the graph only for query expansion, mix textual and structural branches under a global weighting, or rely on fine-tuned graph-traversal generators. We present GRASP, a three-stage SKB retrieval framework unifying plan-based graph retrieval, plan-conditioned fusion with a dense retriever, and a fine-tuned reranker over the fused candidates. GRASP substantially advances the state of the art on every metric across the three STaRK benchmarks, lifting average Hit@1 from 62.0 to 73.9. Ablation and sensitivity studies further confirm the effectiveness and robustness of GRASP.

2605.30190 2026-06-02 cs.LG 版本更新

Mean-Field Diffuser: Scaling Offline MARL to Thousands of Agents

均场扩散器:将离线多智能体强化学习扩展到数千个智能体

Wenhao Li, Xiangfeng Wang, Bo Jin

发表机构 * Tongji University(同济大学) East China Normal University(华东师范大学)

AI总结 提出MF-Diffuser框架,通过将轨迹规划提升到轨迹分布的Wasserstein空间,并利用混沌传播和分层粗到细策略,将离线多智能体强化学习扩展到数千个智能体,理论证明均场近似误差为$O(H^2/\sqrt{N})$且离线分布偏移不随群体规模增长,实验在三个均场RL基准上取得最佳回报。

Comments 62 pages, 15 figures, 16 tables

详情
AI中文摘要

基于扩散的规划在单智能体离线强化学习中取得了显著成果,但由于联合轨迹空间的维度灾难,扩展到多智能体系统仍然难以处理。我们引入了MF-Diffuser,这是一个将轨迹规划提升到轨迹分布的Wasserstein空间的框架,其中混沌传播确保一个小的代表性智能体子集能够捕捉整个群体的动态。我们的方法包括一个值加权的混沌熵目标,该目标协调生成保真度与回报最大化,以及一个分层粗到细策略,在去噪过程中逐步增加智能体群体。我们建立了端到端的次优性界,包含四个可解释项,揭示了均场近似误差以$O(H^2/\sqrt{N})$缩放,而离线分布偏移被证明不会随群体规模$N$增长,并证明生成的策略是一个近似的均场纳什均衡,具有明确的收敛保证。在三个均场RL基准(包括阶段博弈、序列动态和对抗性团队竞争)上的实验表明,MF-Diffuser在大多数设置中实现了最佳回报,在次优离线数据和极端规模($N \geq 10^3$)下增益最大。

英文摘要

Diffusion-based planning has achieved strong results in single-agent offline reinforcement learning, yet scaling to many-agent systems remains intractable due to the curse of dimensionality in the joint trajectory space. We introduce MF-Diffuser, a framework that lifts trajectory planning to the Wasserstein space of trajectory distributions, where the propagation of chaos ensures a small representative subset of agents captures the full population dynamics. Our approach features a value-weighted chaotic entropy objective that reconciles generative fidelity with return maximization, and a hierarchical coarse-to-fine strategy that progressively grows the agent population during denoising. We establish end-to-end suboptimality bounds with four interpretable terms, revealing that mean-field approximation error scales as $O(H^2/\sqrt{N})$ while offline distribution shift provably does not grow with population size $N$, and prove the generated policy is an approximate mean-field Nash equilibrium with explicit convergence guarantees. Experiments on three mean-field RL benchmarks -- spanning stage games, sequential dynamics, and adversarial team competition -- show MF-Diffuser achieves the best return in the majority of settings, with the largest gains on suboptimal offline data and at extreme scales ($N \geq 10^3$).

2605.30188 2026-06-02 cs.LG cs.AI stat.ML 版本更新

CalArena: A Large-Scale Post-Hoc Calibration Benchmark

CalArena:大规模事后校准基准

Eugène Berta, David Holzmüller, Francis Bach, Michael I. Jordan

发表机构 * Inria - Ecole Normale Supérieure PSL Research University(法国国家科学研究中心-巴黎高等师范学院-巴黎-萨克雷大学)

AI总结 提出CalArena大规模标准化基准,涵盖近2000个实验,通过事后改进(PHI)原则比较多种校准方法,发现平滑校准函数优于分箱方法,专用多类方法在高维场景中至关重要。

Comments 30 pages, 9 figures

详情
AI中文摘要

可靠的概率估计在许多机器学习应用中至关重要,但现代分类器往往校准不佳。事后校准提供了一种简单且广泛使用的解决方案,但由于提出的方法众多,加上小规模和不一致的评估,很难确定哪些方法在实践中真正有效。我们引入了一个大规模、标准化的事后校准基准,涵盖表格和计算机视觉任务的近2000个实验,包括二分类、多分类和大规模分类设置。我们的基准汇集了来自多种经典模型、现代深度学习架构和基础模型的预测,并在通用评估框架内提供了数十种校准方法的统一、可重复实现。我们认为,在适当评分规则下的事后改进(PHI)为比较事后方法提供了传统校准误差估计器的原则性替代方案,同时捕捉校准质量和模型预测性能的潜在退化。利用这一框架,我们进行了迄今为止最全面的事后校准实证研究。我们的结果揭示了跨领域的一致模式:平滑校准函数优于基于分箱的方法,专用多类方法在高维设置中至关重要,而通用机器学习模型在没有校准特定设计的情况下不具备竞争力。为促进未来研究,我们发布了所有数据、代码和评估工具,为开发和比较校准方法提供了一个即插即用的基准。

英文摘要

Reliable probability estimates are critical in many machine learning applications, yet modern classifiers are often poorly calibrated. Post-hoc calibration provides a simple and widely used solution, but the large number of proposed methods, combined with small-scale and inconsistent evaluations, makes it difficult to determine which approaches are truly effective in practice. We introduce a large-scale, standardized benchmark for post-hoc calibration, covering nearly 2000 experiments across tabular and computer vision tasks, including binary, multiclass, and large-scale classification settings. Our benchmark aggregates predictions from a diverse set of classical models, modern deep learning architectures, and foundation models, and provides unified, reproducible implementations of dozens of calibration methods within a common evaluation framework. We argue that Post-Hoc Improvement (PHI) in proper scoring rules offers a principled alternative to traditional calibration error estimators for comparing post-hoc methods, capturing both calibration quality and potential degradation to the model's predictive performance. Using this framework, we conduct the most comprehensive empirical study of post-hoc calibration to date. Our results reveal consistent patterns across domains: smooth calibration functions outperform binning-based approaches, dedicated multiclass methods are essential in high-dimensional settings, and generic machine learning models are not competitive without calibration-specific design. To facilitate future research, we release all data, code, and evaluation tools, providing a plug-and-play benchmark for developing and comparing calibration methods.

2605.30122 2026-06-02 cs.LG cs.AI 版本更新

Beyond MSE: Improving Precipitation Nowcasting with Multi-Quantile Regression

超越MSE:利用多分位数回归改进降水临近预报

Gijs van Nieuwkoop, Siamak Mehrkanoon

发表机构 * Department of Information and Computing Sciences, Utrecht University(信息与计算科学系,乌特勒支大学)

AI总结 本文提出将确定性降水临近预报模型的训练目标从均方误差(MSE)改为多分位数回归损失,使用SmaAt-UNet模型在荷兰雷达降水数据上验证,使中心确定性预测的测试集MSE降低8.6%,并输出高分位数预测以改善强降水预测。

Comments 7 pages, 5 figs

详情
AI中文摘要

深度学习降水临近预报模型通常使用逐点损失(如均方误差或平均绝对误差)进行优化,这可能导致预测过于平滑且对强降雨的表示较差。本研究探讨了是否可以通过将训练重新表述为多分位数回归问题来提高已建立的确定性临近预报架构的预测性能。使用SmaAt-UNet作为核心模型,我们在荷兰雷达降水临近预报上比较了MSE、MAE和多分位数pinball损失训练。结果表明,多分位数训练改进了中心确定性预测,与使用MSE训练的模型相比,测试集MSE降低了8.6%,同时产生的高分位数输出对强降水的风险敏感预测很有用。这些发现表明,分位数回归提供了一种简单的替代标准逐点损失的方法,无需新的架构或生成采样过程。我们模型和训练设置的实现可在GitHub上获取。

英文摘要

Deep-learning precipitation nowcasting models are often optimized using pointwise losses such as mean squared error or mean absolute error, which can lead to overly smooth forecasts and poor representation of heavy rainfall. This study investigates whether the predictive performance of an established deterministic nowcasting architecture can be improved by reformulating training as a multi-quantile regression problem. Using SmaAt-UNet as a core model, we compare MSE, MAE, and multi-quantile pinball-loss training on radar precipitation nowcasting over the Netherlands. The results show that multi-quantile training improves the central deterministic forecast, decreasing test-set MSE by 8.6\% compared to a model trained using MSE, while also producing upper-quantile outputs that are useful for risk-sensitive prediction of heavy precipitation. These findings suggest that quantile regression provides a simple alternative to standard pointwise losses without requiring a new architecture or generative sampling procedure. The implementation of our models and training setup is available on \href{https://github.com/gijsvn/Multi-Quantile-Precipitation-Nowcasting}{GitHub}.

2605.29987 2026-06-02 cs.LG cs.CL 版本更新

MIC: Maximizing Informational Capacity in Adaptive Representations via Isotropic Subspace Alignment

MIC: 通过各向同性子空间对齐最大化自适应表示中的信息容量

Dang Nguyen Hong, Nhi Ngoc-Yen Nguyen, Huy-Hieu Pham

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 针对多尺度表示中的维度冗余和谱坍缩问题,提出MIC框架,通过各向同性子空间对齐、软坍缩正则化和谱各向同性正则化,结合自蒸馏目标生成语义密集且高判别力的表示,在高压缩场景下显著优于基线。

Comments Accepted at the GlobalSouthML Workshop at ICML 2026. 8 pages, 2 figures

详情
AI中文摘要

尽管多尺度表示学习能够实现弹性维度的嵌入,但嵌套子空间常常遭受维度冗余和谱坍缩的困扰。为了解决这一问题,我们引入了MIC,一个通过各向同性子空间对齐来优化多粒度嵌入几何景观的框架。MIC采用软坍缩正则化(SCR),通过交叉相关惩罚来减轻前缀子空间和残差子空间之间的冗余,同时使用谱各向同性正则化(SIR)确保低维前缀的超球面均匀性。通过自蒸馏目标统一这些策略,MIC生成语义密集的表示,并保持高判别力。我们的实验表明,MIC显著优于标准基线,特别是在维持信息容量最为关键的高压缩场景中。

英文摘要

Although multi-scales representation learning enables elastic-dimension embeddings, nested subspaces often suffer from dimensional redundancy and spectral collapse. To address this, we introduce MIC, a framework that optimizes the geometric landscape of multi-granular embeddings through isotropic subspace alignment. MIC employs Soft Collapse Regularization (SCR) to mitigate redundancy between prefix and residual subspaces via cross-correlation penalties, alongside Spectral Isotropy Regularization (SIR) to ensure hyper-spherical uniformity in low-dimensional prefixes. By unifying these strategies through a self-distillation objective, MIC generates semantically dense representations that maintain high discriminative power. Our experiments demonstrate that MIC significantly outperforms standard baselines, particularly in high-compression scenarios where maintaining informational capacity is most critical.

2605.29977 2026-06-02 cs.CV cs.LG 版本更新

EVL-ECG: Efficient ECG Interpretation With Multi-Aspect Heterogeneous Knowledge Distillation

EVL-ECG:面向多视角异构知识蒸馏的高效心电图解读

Dang Nguyen Hong, Nhi Ngoc-Yen Nguyen, Huy-Hieu Pham

发表机构 * University of Notre Dame(诺丁汉大学)

AI总结 提出EVL-ECG框架,通过多头交叉注意力对齐、最优传输视觉特征匹配和几何结构关系匹配三种创新方法,实现跨架构知识蒸馏,在资源受限环境下高效解读心电图。

Comments 7Accepted at the SD4H Workshop at ICML 2026. 7 pages, 3 figures

详情
AI中文摘要

高保真心电图解读越来越依赖于大规模基础模型,但其在临床边缘护理中的部署仍受到极端计算需求的阻碍。虽然知识蒸馏(KD)是一种有前景的解决方案,但传统方法在跨异构架构传递知识时,无法捕捉心电图信号的复杂时空依赖关系。本文提出EVL-ECG,一个专门用于心脏诊断逻辑跨架构蒸馏的框架。EVL-ECG引入了三种心电图感知创新:(1)多头交叉注意力对齐,协调架构差异以保留细粒度形态特征;(2)基于最优传输的视觉特征匹配,利用最优传输在标记表示不匹配的情况下保持跨心电图导联的全局结构关系;(3)几何结构内关系匹配,蒸馏教师模型的潜在诊断推理。在心电图基准测试上的评估表明,EVL-ECG相比现有基线,AUC提升高达2.4%,临床准确率提升1.1%。值得注意的是,EVL-ECG建立了一个高效的20亿参数心电图基础模型,适用于资源受限的临床环境。

英文摘要

High-fidelity ECG interpretation is increasingly reliant on massive foundation models, yet their deployment in clinical edge-care remains hindered by extreme computational demands. While knowledge distillation (KD) is a promising solution, traditional methods fail to capture the complex spatio-temporal dependencies of ECG signals when transferring knowledge across heterogeneous architectures. In this paper, we propose EVL-ECG, a framework specifically designed for cross-architecture distillation of cardiac diagnostic logic. EVL-ECG introduces three ECG-aware innovations: (1) Multi-Head Cross-Attention Alignment, which harmonizes architectural discrepancies to preserve fine-grained morphological features; (2) Optimal Transport-based Visual Feature Matching, utilizing optimal transport to maintain global structural relationships across ECG leads despite mismatched token representations; and (3) Geometric Intra-Architecture Relation Matching, which distills the latent diagnostic reasoning of the teacher model. Evaluations across ECG benchmarks demonstrate that EVL-ECG yields improvements of up to 2.4% AUC and 1.1% clinical accuracy over existing baselines. Notably, EVL-ECG establishes an efficient 2B-parameter ECG foundation model, suitable for resource-constrained clinical environments.

2605.16415 2026-06-02 cs.CV cs.LG 版本更新

Diffusion Models, Denoiser Architecture and Creativity

扩散模型、去噪器架构与创造力

Itamar Levine, Yair Weiss

发表机构 * The Hebrew University of Jerusalem(海法大学)

AI总结 本文通过理论和实验表明,扩散模型的创造力源于去噪器架构与目标分布之间的相互作用,并指出去噪器架构的归纳偏差必须与真实目标分布高度一致才能成功。

详情
AI中文摘要

扩散模型的创造力是指它们生成与训练数据不同但高度逼真图像的能力。创造力有些令人惊讶,因为已知如果扩散模型中使用的去噪器是给定训练集的贝叶斯最优去噪器,那么模型将简单地复制训练样本。在本文中,我们提出经验和理论结果,表明扩散模型的创造力源于去噪器架构与目标分布之间的相互作用。理论上,我们针对三种不同的去噪器架构(线性、多项式、瓶颈)给出了生成样本分布作为目标分布和去噪器函数的显式形式。经验上,我们表明流行的UNET去噪器架构的微小变化会导致非常不同的创造力形式,并且这些微小变化通常会产生高度不真实的样本。综合来看,我们的结果表明,只有当去噪器架构的归纳偏差与真实目标分布高度一致时,扩散模型才能成功。

英文摘要

The creativity of diffusion models refers to their ability to generate highly realistic images that are different from their training data. Creativity is somewhat surprising since it is known that if the denoiser used in the diffusion model is the Bayes optimal denoiser for a given training set, then the model will simply copy the training samples. In this paper we present empirical and theoretical results that suggest that creativity in diffusion models is due to an interaction between the denoiser architecture and the target distribution. Theoretically, we give explicit forms for the distribution of generated samples as a function of the target distribution and the denoiser architecture for three different denoiser architectures (linear, polynomial, bottleneck). Empirically, we show that small changes in the popular UNET denoiser architecture leads to very different forms of creativity, and these small changes often yield samples that are highly nonrealistic. Taken together, our results show that diffusion models will only be successful if the inductive bias of the denoiser architecture is in strong alignment with the true target distribution.

2605.07804 2026-06-02 cs.LG cs.AI 版本更新

Prune-OPD: Efficient and Reliable On-Policy Distillation for Long-Horizon Reasoning

Prune-OPD:面向长程推理的高效可靠在线策略蒸馏

Zhicheng Yang, Zhijiang Guo, Yifan Song, Minrui Xu, Yongxin Wang, Yiwei Wang, Xiaodan Liang, Jing Tang

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) The Hong Kong University of Science and Technology(香港科技大学) MBZUAI University of California, Merced(加州大学默塞德分校) Sun Yat-sen University(中山大学)

AI总结 提出Prune-OPD框架,通过实时检测学生与教师之间的前缀漂移并动态截断不可靠的轨迹,在减少计算浪费的同时保持或提升长程推理任务的性能。

Comments 17 pages, 8 figures

详情
AI中文摘要

在线策略蒸馏(OPD)利用密集的教师奖励来增强推理模型。然而,将OPD扩展到长程任务暴露了一个关键缺陷:随着学生生成的前缀不可避免地偏离教师的思维过程,教师的密集奖励失去了局部可开发性。继续在这些“漂移”轨迹上生成和评估标记不仅会降低奖励质量,还会导致巨大的计算浪费。为了解决这个问题,我们引入了 extbf{Prune-OPD},一个动态地将训练预算与监督质量对齐的框架。通过持续监控学生和教师预测之间的局部兼容性(例如,通过top-$k$重叠),Prune-OPD实时检测前缀漂移事件。一旦检测到严重漂移,它会单调地降低后续不可靠奖励的权重,并触发动态的轨迹截断。这使得训练过程能够停止无效的生成,并将计算重新分配到可靠的教师监督上。在不同的教师-学生组合中,Prune-OPD始终将计算与监督可靠性对齐。当前缀漂移使得密集的教师奖励不可靠时,它减少了37.6\%--68.0\%的训练时间,同时保持甚至提升了在具有挑战性的基准(AMC、AIME、HMMT)上的性能。当学生-教师兼容性保持较高时,它会通过扩展训练窗口自动保留长上下文监督。这些结果表明,Prune-OPD不是通过盲目缩短轨迹来改进OPD,而是通过将计算重新分配到局部可开发的教师奖励上。

英文摘要

On-policy distillation (OPD) leverages dense teacher rewards to enhance reasoning models. However, scaling OPD to long-horizon tasks exposes a critical flaw: as the student's generated prefix inevitably diverges from the teacher's thought process, the teacher's dense reward loses local exploitability. Continuing to generate and evaluate tokens on these ``drifted'' trajectories not only degrades reward quality but also incurs massive computational waste. To address this, we introduce \textbf{Prune-OPD}, a framework that dynamically aligns training budgets with supervision quality. By continuously monitoring the local compatibility between student and teacher predictions (e.g., via top-$k$ overlap), Prune-OPD detects prefix-drift events in real time. Upon detecting severe drift, it monotonically down-weights subsequent unreliable rewards and triggers dynamic rollout truncation. This allows the training process to halt futile generation and reallocate compute strictly to reliable teacher supervision. Across diverse teacher-student combinations, Prune-OPD consistently aligns computation with supervision reliability. When prefix drift makes dense teacher rewards unreliable, it reduces training time by 37.6\%--68.0\% while preserving, and often improving, performance on challenging benchmarks (AMC, AIME, HMMT). When student-teacher compatibility remains high, it automatically preserves long-context supervision by expanding the training window. These results suggest that Prune-OPD improves OPD not by blindly shortening rollouts, but by reallocating computation toward locally exploitable teacher rewards.

2602.14307 2026-06-02 cs.AI cs.LG 版本更新

Benchmarking at the Edge of Comprehension

在理解边缘的基准测试

Samuele Marro, Jialin Yu, Emanuele La Malfa, Oishi Deb, Jiawei Li, Yibo Yang, Ebey Abraham, Sunando Sengupta, Eric Sommerlade, Michael Wooldridge, Philip Torr

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出Critique-Resilient Benchmarking框架,通过对抗性生成-评估游戏在人类理解受限时比较模型,利用批判韧性正确性概念和分项Bradley-Terry模型对LLM进行排序。

详情
AI中文摘要

随着前沿大型语言模型(LLMs)在新基准发布后迅速饱和,基准测试本身正处于一个转折点:如果前沿模型持续改进,人类将越来越难以生成具有区分度的任务、提供准确的真实答案或评估复杂解决方案。如果基准测试变得不可行,我们衡量AI进展的能力将受到威胁。我们将这种情况称为后理解阶段。在这项工作中,我们提出了Critique-Resilient Benchmarking,一种对抗性框架,旨在即使在人类完全理解不可行的情况下也能比较模型。我们的技术依赖于批判韧性正确性的概念:如果没有对手令人信服地证明答案错误,则该答案被视为正确。与标准基准测试不同,人类充当有界验证者,专注于局部声明,从而在超出任务完全理解的情况下保持评估完整性。使用分项二分Bradley-Terry模型,我们联合对LLM进行排序,依据其解决挑战性任务的能力和生成困难但可解问题的能力。我们在数学领域展示了该方法在八个前沿LLM上的有效性,表明所得分数稳定且与外部能力度量相关。我们的框架将基准测试重新定义为一种对抗性生成-评估游戏,其中人类作为最终裁决者。

英文摘要

As frontier Large Language Models (LLMs) increasingly saturate new benchmarks shortly after they are published, benchmarking itself is at a juncture: if frontier models keep improving, it will become increasingly hard for humans to generate discriminative tasks, provide accurate ground-truth answers, or evaluate complex solutions. If benchmarking becomes infeasible, our ability to measure any progress in AI is at stake. We refer to this scenario as the post-comprehension regime. In this work, we propose Critique-Resilient Benchmarking, an adversarial framework designed to compare models even when full human understanding is infeasible. Our technique relies on the notion of critique-resilient correctness: an answer is deemed correct if no adversary has convincingly proved otherwise. Unlike standard benchmarking, humans serve as bounded verifiers and focus on localized claims, which preserves evaluation integrity beyond full comprehension of the task. Using an itemized bipartite Bradley-Terry model, we jointly rank LLMs by their ability to solve challenging tasks and to generate difficult yet solvable questions. We showcase the effectiveness of our method in the mathematical domain across eight frontier LLMs, showing that the resulting scores are stable and correlate with external capability measures. Our framework reformulates benchmarking as an adversarial generation-evaluation game in which humans serve as final adjudicators.

2506.02075 2026-06-02 stat.ME cs.LG 版本更新

Position: Stop Chasing the C-index when Evaluating Survival Analysis Models

立场:评估生存分析模型时停止追逐C指数

Christian Marius Lillelund, Shi-ang Qi, Russell Greiner, Christian Fischer Pedersen

发表机构 * University of Copenhagen(哥本哈根大学)

AI总结 本文批判性审视生存分析中的评估实践,指出C指数等一致性指标被过度使用且与建模目标错位,提出双螺旋阶梯框架以确保评估指标与模型假设对齐,并通过实验展示错位导致的误导性比较。

Comments ICML 2026 Position Paper Track (Spotlight)

详情
AI中文摘要

当前生存分析评估的现状受到持续使用与既定建模目标不一致的评估指标的困扰。此外,许多此类评估基于隐含或不合理的删失假设。这意味着报告的性能可能具有误导性,并且可能无法回答评估旨在解决的科学或建模问题。在这篇立场论文中,我们批判性地审视了生存分析中的评估实践,并强调了删失如何使评估从根本上不同于标准回归或分类。我们特别关注基于一致性的度量,如C指数,我们证明其在文献中被过度使用。为了帮助确定合适的度量,我们提出了一组关键需求,并引入了一个双螺旋阶梯,其中有效评估需要度量与模型假设之间的对齐。通过控制实验,我们表明这种对齐的违反可能导致误导性的模型比较。最后,我们提供了关于如何评估生存模型的实用指导。

英文摘要

The current state of evaluation in survival analysis is plagued by the persistent use of evaluation metrics in ways that are misaligned with the stated modeling objective. In addition, many such evaluations are based on censoring assumptions that are left implicit or unjustified. This means that the reported performance can be misleading and may fail to answer the scientific or modeling question the evaluation was intended to address. In this position paper, we critically examine evaluation practices in survival analysis and highlight how censoring makes evaluation fundamentally different from standard regression or classification. We place particular focus on concordance-based measures, such as the C-index, which we show are heavily overused in the literature. To help identify appropriate metrics, we propose a set of key desiderata and introduce a double-helix ladder, in which valid evaluation requires alignment between metric and modeling assumptions. Through controlled experiments, we show that violations of this alignment can lead to misleading model comparisons. We conclude by providing practical guidance on how to evaluate a survival model.

2605.29548 2026-06-02 cs.LG 版本更新

Why Larger Models Learn More: Effects of Capacity, Interference, and Rare-Task Retention

为什么更大的模型学习得更多:容量、干扰和稀有任务保留的影响

Jing Huang, Daniel Wurgaft, Rachit Bansal, Laura Ruis, Naomi Saphra, David Alvarez-Melis, Andrew Kyle Lampinen, Christopher Potts, Ekdeep Singh Lubana

发表机构 * Stanford University(斯坦福大学) Kempner Institute at Harvard University(哈佛大学凯普纳研究所) MIT(麻省理工学院) Anthropic

AI总结 通过理论分析和合成实验,研究模型规模对学习能力的影响,发现更大模型通过减少梯度干扰来学习稀有和复杂任务,并在OLMo模型上验证。

详情
AI中文摘要

更大的模型能学习更小模型无法学习的任务。是什么驱动了这一现象?我们提出了一个简单的现象学论证,即幂律缩放已经表明,即使训练数据无限,更大的模型也能够学习到更小模型无法学习的数据分布部分。为了验证这一说法并找出其原因,我们研究了模型缩放对合成设置的影响,该设置由一系列呈现单调缩放曲线的任务混合而成。结果指向数据引起的资源(神经元)竞争。具体来说,较小的模型将其神经元分配给高频或低复杂度的任务,因此它们学习到的解决方案在稀有和复杂任务上表现不佳。此外,即使存在能够表达所需任务的解决方案,这种情况也会发生。然后,我们评估了更大的模型如何规避这一以数据为中心的瓶颈,发现这归因于一种减少的干扰机制:更大的模型可以为常见任务分配足够的资源,使得这些任务的梯度更新变弱,这意味着它们不会在缓慢积累稀有任务特征时覆盖这些特征。最后,为了进一步验证这些说法,我们在不同频率和复杂性的新任务上预训练了OLMo模型(4M到4B参数)。结果与我们的合成数据实验相呼应:只有更大的OLMo模型学习了不频繁和复杂的任务,并且这些更大的模型在其表示中嵌入了更多的任务特征,并且任务之间的梯度干扰更少。总体而言,我们提供了一个以数据为中心的解释,说明为什么更大的模型能够学习更小模型无法学习的任务。这有助于解释为什么更大的模型在实践中更好,并且可以为有关模型大小和训练数据混合的实际问题提供信息。

英文摘要

Larger models learn tasks smaller models do not. What drives this phenomenon? We develop a simple phenomenological argument that power-law scaling already suggests that a larger model will be able to learn a part of the data distribution that a smaller model fails to learn, even with infinite training data. To validate this claim and identify its causes, we study the effects of model scaling on a synthetic setup consisting of a mixture of tasks that show monotonic scaling curves. The results point to a data-induced competition over resources (neurons). Specifically, smaller models allocate their neurons to high frequency or low complexity tasks, and so they learn solutions that perform poorly on rare and complex tasks. Moreover, this happens even when solutions capable of expressing the desired task exist. We then assess how a larger model circumvents this data-centric bottleneck, finding that it traces to a reduced interference mechanism: larger models can allocate enough resources to common tasks that the gradient updates for those tasks become weak, which means that they do not overwrite rare-task features as they slowly accumulate. Finally, to further validate these claims, we pretrain OLMo models (4M to 4B parameters) on novel tasks of varying frequency and complexity. The results mirror those from our synthetic data experiments: only the larger OLMo models learn the infrequent and complex tasks, and these larger models embed more task features in their representations and show less gradient interference between tasks. Overall, we offer a data-centric account of why larger models learn tasks that smaller models fail to. This helps explain why larger models are better in practice, and it can inform practical questions concerning model sizing and training data mixtures.

2605.29463 2026-06-02 cs.LG cs.AI 版本更新

Honest Lying: Understanding Memory Confabulation in Reflexive Agents

诚实撒谎:理解反射型智能体中的记忆虚构

Prakhar Dixit, Sadia Kamal, Tim Oates

发表机构 * University of Cambridge(剑桥大学)

AI总结 研究反射型智能体在自我反思中产生记忆虚构的问题,提出反射重复率(RRR)指标检测该现象,并通过程序化提取失败信号缓解问题。

Comments Accepted to ICML 2026 Workshop "Failure Modes in Agentic AI"

详情
AI中文摘要

反射型智能体依赖自我生成的反思作为记忆,隐含地假设智能体能够准确诊断自己的失败。我们表明这一假设可能系统性地失败:在ALFWorld和HumanEval中,智能体存储自信但错误的任务解释,并在多次试验中继续据此行动,尽管每次环境都重置为正确任务。我们将这种失败模式称为记忆虚构,并引入反射重复率(RRR),一种基于日志的指标,用于检测对错误反思内容的重复依赖。使用RRR,我们在ALFWorld中识别出16个冻结环境,其中121条反思中0条提及正确目标对象,在HumanEval中有4个类似案例。我们的缓解方法用程序化提取轨迹级失败信号替代开放式自我诊断,将正确对象提及率从0%提升至86%,RRR从0.64降至0.10,并解决了16个冻结ALFWorld环境中的3个,表明反思记忆可能强化而非纠正错误信念。

英文摘要

Reflexion-style agents rely on self-generated reflections as memory, implicitly assuming that agents can accurately diagnose their own failures. We show that this assumption can fail systematically: across ALFWorld and HumanEval, agents store confident but incorrect interpretations of the task and continue acting on them across trials, even though the environment resets to the correct task each time. We call this failure mode memory confabulation and introduce the Reflection Repetition Rate (RRR), a log-based metric that detects repeated reliance on incorrect reflective content. Using RRR, we identify 16 frozen environments in ALFWorld, where 0 of 121 reflections mention the correct target object, and 4 analogous cases in HumanEval. Our mitigation replaces open-ended self-diagnosis with programmatic extraction of trajectory-level failure signals, increasing correct object mention from 0% to 86%, reducing RRR from 0.64 to 0.10, and solving 3 of 16 frozen ALFWorld environments, suggesting that reflective memory can reinforce false beliefs rather than correct them.

2605.29263 2026-06-02 cs.LG 版本更新

Robust Frequency-Calibrated Virtual EEG Channel Generation from Four Frontal Electrodes for Wearable EEG Augmentation

用于可穿戴脑电图增强的鲁棒频率校准虚拟脑电通道生成:来自四个额叶电极

Minghao Xiao

发表机构 * School of Biomedical Science and Engineering, South China University of Technology(生物医学工程学院,华南理工大学) School of Automation Science and Engineering, South China University of Technology(自动化科学与工程学院,华南理工大学)

AI总结 提出FAVC-Net网络,通过频率校准从四个额叶电极生成13个未测量通道,联合优化波形保真度、频谱分配和鲁棒性,在PRED+CT数据集上实现最佳波形-频谱性能。

Comments 17 pages, 4 figures

详情
AI中文摘要

低通道可穿戴脑电图(EEG)对于长期监测具有吸引力,但四个额叶电极仅提供稀疏且空间偏斜的分布头皮活动视图。我们提出了FAVC-Net,一个紧凑的频率校准虚拟通道网络,从Fp1、Fp2、F7和F8生成13个未测量的EEG通道。该模型结合了共享多尺度源编码、源状态嵌入、目标条件有符号源块混合、基于GATv2的注意力细化、注意力一致跳跃融合以及弱Welch功率谱密度校准。该框架不是将稀疏到密集的EEG生成视为纯波形匹配任务,而是联合强调幅度保真度、频谱分配、通道频率纹理以及对损坏的可穿戴输入的鲁棒性。在PRED+CT数据集上,FAVC-Net在神经和插值基线中实现了最佳的联合波形-频谱工作点。其时域增益适中,而对数谱距离和PSD KL散度相对于最强的非FAVC比较器分别降低了30.09%和37.98%。在类似可穿戴的源扰动下,该模型保持了频谱保真度并抵抗了频谱崩溃。这些结果支持虚拟EEG通道生成作为双域增强问题,同时强调生成的顶叶和后部通道应解释为从稀疏额叶测量导出的频率校准表示,而非独立的物理记录。

英文摘要

Low-channel wearable electroencephalography (EEG) is attractive for long-term monitoring, but four frontal electrodes provide only a sparse and spatially biased view of distributed scalp activity. We present FAVC-Net, a compact frequency-calibrated virtual-channel network that generates 13 unmeasured EEG channels from Fp1, Fp2, F7, and F8. The model combines shared multi-scale source encoding, source-state embeddings, target-conditioned signed source-block mixing, GATv2-based attention refinement, attention-consistent skip fusion, and weak Welch power spectral density calibration. Rather than treating sparse-to-dense EEG generation as a purely waveform-matching task, the framework jointly emphasizes amplitude fidelity, spectral allocation, channel-frequency texture, and robustness to corrupted wearable inputs. On the PRED+CT dataset, FAVC-Net achieved the best joint waveform-spectral operating point among neural and interpolation baselines. Its time-domain gains were modest, whereas log-spectral distance and PSD KL divergence were reduced by 30.09% and 37.98% relative to the strongest non-FAVC comparator. Under wearable-like source perturbations, the model preserved spectral fidelity and resisted spectral collapse. These results support virtual EEG channel generation as a dual-domain augmentation problem, while emphasizing that generated posterior and parietal channels should be interpreted as frequency-calibrated representations derived from sparse frontal measurements rather than as independent physical recordings.

2605.29233 2026-06-02 cs.LG cs.AI 版本更新

BlockBatch: Multi-Scale Consensus Decoding for Efficient Diffusion Language Model Inference

BlockBatch: 面向高效扩散语言模型推理的多尺度共识解码

Xiaoyou Wu, Cheng-Jhih Shih, Binfei Ji, Yong Liu, Yingyan Celine Lin

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出BlockBatch框架,通过多分支并行解码和置信度门控合并,在不训练的情况下加速扩散语言模型推理,平均减少26.6%的去噪步数并实现1.33倍端到端加速。

Comments 23 pages, including references and appendices

详情
AI中文摘要

扩散语言模型(dLLMs)通过并行迭代去噪多个标记位置来生成文本,为严格自回归解码提供了有吸引力的替代方案。然而,在实际中,块状dLLM推理暴露了难以权衡的粒度问题:小块保留局部条件但需要更多去噪步骤,而大块暴露更多并行性但可能做出过早承诺并累积缓存误差。现有加速方法通常为每个请求选择单一块大小,未利用块大小之间的互补性。我们表明块大小本身是一个有用的分支维度。不同块大小产生相关但非相同的KV缓存轨迹:分支通常共享初始前缀,在语义决定性位置分叉,并在句法轻量级标记上后来达成一致。受此结构启发,我们提出BlockBatch,一种无需训练的在线推理框架,在批处理前向传递中为同一请求执行多个块大小分支。BlockBatch通过置信度门控标记合并、基于领导者的同步和周期性全序列刷新来协调这些分支,将局部块更新重新锚定到全局一致的KV状态。在3个代表性dLLM和4个数据集上,BlockBatch平均减少26.6%的去噪NFE,并在保持准确性的同时实现比Fast-dLLM平均1.33倍的端到端加速。这些结果将块大小多样性确定为分支并行dLLM推理中一个实用且先前未被充分探索的维度。

英文摘要

Diffusion language models (dLLMs) generate text by iteratively denoising multiple token positions in parallel, offering an attractive alternative to strictly autoregressive decoding. In practice, however, block-wise dLLM inference exposes a difficult granularity trade-off: small blocks preserve local conditioning but require many denoising steps, whereas large blocks expose more parallelism but can make premature commitments and accumulate cache error. Existing acceleration methods typically choose a single block size per request, leaving the complementarity among block sizes unused. We show that block size itself is a useful branching dimension. Different block sizes induce related but non-identical KV-cache trajectories: branches often share an initial prefix, bifurcate at semantically decisive positions, and later agree on syntactically lightweight tokens. Motivated by this structure, we propose BlockBatch, a training-free online inference framework that executes multiple block-size branches for the same request inside a batched forward pass. BlockBatch coordinates these branches through confidence-gated token merging, leader-based synchronization, and periodic full-sequence refreshes that re-anchor local block updates to a globally consistent KV state. Across 3 representative dLLMs and 4 datasets, BlockBatch reduces denoising NFEs by 26.6\% on average and achieves a 1.33$\times$ average end-to-end speedup over Fast-dLLM while preserving accuracy. These results identify block-size diversity as a practical and previously underexplored axis for branch-parallel dLLM inference.

2605.29183 2026-06-02 cs.LG cs.AI 版本更新

TIMEGATE: Sustainable Time-Boxed Promotion Gates for Continual ML Adaptation Under Resource Constraints

TIMEGATE: 资源约束下可持续的限时促销门控用于持续ML适应

Abhijit Chakraborty, Suddhasvatta Das, Yash Shah, Vivek Gupta, Kevin A. Gary

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Washington(华盛顿大学)

AI总结 提出TIMEGATE策略层,通过预算时间、标注、训练和评估来管理持续ML适应,实现评估计算节省且无静默错误促销。

详情
AI中文摘要

随着机器学习(ML)系统向持续适应演进,每个重新训练周期都会消耗计算、标注和能源。我们引入TIMEGATE,一个通过预算时间、标注、训练和评估来管理适应的策略层。TIMEGATE发出一个度量可用性信号M,用于部分与完整评估决策。我们验证:(i)在Adult表格数据上,标注优于训练2.3倍;(ii)它迁移到LLaMA-3.1-8B + QLoRA在SST-2上(准确率从0.80到0.96;35/36次运行中M=1);(iii)M具有信息性,28单元敏感性显示在严格阈值下M降至0.81;(iv)100周期模拟实现66%的评估计算节省,且无静默错误促销;(v)在单个H200上,LLaMA的10%切片评估使用89%更少的挂钟时间和能源(比率一致至0.2%)。

英文摘要

As machine learning(ML) systems evolve to continual adaptation, each re-training cycle uses compute, annotation, and energy. We introduce TIMEGATE, a policy layer managing adaptation by budgeting time, labeling, training, and evaluation. TIMEGATE emits a metric-availability signal M for partial vs. full-evaluation decisions. We validate: (i) labeling outperforms training by 2.3x on Adult tabular; (ii) it transfers to LLaMA-3.1-8B + QLoRA on SST-2 (accuracy 0.80 to 0.96; M =1 in 35/36 runs); (iii) M is informative, 28-cell sensitivity shows M drops to 0.81 at tight thresholds; (iv) 100-cycle simulation achieves 66% evaluation-compute savings with no silent mis-promotions; (v) 10%-slice evaluation on LLaMA uses 89% less wall-clock and energy on a single H200 (ratios agree to 0.2%).

2605.29072 2026-06-02 cs.LG cs.NA math.NA 版本更新

Ensemble Score Filtering for Real-Data Energy Consumption Forecast Correction

集成得分滤波用于真实数据能耗预测修正

Ruoyu Hu, Dahai Yu, Feng Bao, Guang Wang, Guannan Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对真实能耗数据的高维数据同化问题,采用集成得分滤波器(EnSF)结合预训练黑箱时空预测模型,通过基于得分的扩散模型和闭式得分表示修正预测轨迹,实验表明EnSF优于开环传播和集成卡尔曼滤波。

详情
AI中文摘要

准确的能耗估计和预测对电力系统运行、规划和需求侧管理至关重要。然而,在实践中,完整及时的测量可能并不总是可用,观测数据可能是不完整的、有噪声的或延迟的。这促使使用学习型预测模型来预测不断变化的消费状态,并结合数据同化方法进行序列预测修正。在这项工作中,我们研究了真实能耗数据的高维数据同化问题。前向预测由预训练的黑箱时空预测模型提供,该模型在滤波过程中被视为状态传播器。我们采用集成得分滤波器(EnSF)来同化部分和有噪声的观测,并随时间修正预测轨迹。EnSF使用基于得分的扩散模型来近似滤波分布,并通过使用闭式得分表示和蒙特卡洛近似避免在同化过程中重新训练神经网络得分模型。数值实验表明,学习型预测模型的开环传播在长时间范围内可能变得不可靠,而基于EnSF的修正显著改善了状态估计。与集成卡尔曼滤波(EnKF)的比较进一步表明,在本工作考虑的非线性观测设置下,EnSF提供了更强的修正能力。

英文摘要

Accurate estimation and forecasting of energy consumption are important for power-system operation, planning, and demand-side management. In practice, however, complete and timely measurements may not always be available, and the observed data can be partial, noisy, or delayed. This motivates the use of learned forecasting models for predicting the evolving consumption state, together with data assimilation methods for sequential forecast correction. In this work, we study a high-dimensional data assimilation problem for real energy-consumption data. \modeltext{The forward prediction is supplied by a pretrained black-box spatio-temporal forecasting model, which is treated as the state propagator in the filtering procedure.} We employ the Ensemble Score Filter (EnSF) to assimilate partial and noisy observations and to correct the forecast trajectory over time. The EnSF uses score-based diffusion models to approximate filtering distributions and avoids retraining neural-network score models during assimilation by using a closed-form score representation and Monte Carlo approximation. Numerical experiments demonstrate that open-loop propagation of the learned forecasting model can become unreliable over long horizons, while EnSF-based correction substantially improves state estimation. Comparisons with the Ensemble Kalman Filter (EnKF) further show that EnSF provides stronger correction under the nonlinear observation setting considered in this work.

2605.28952 2026-06-02 cs.CR cs.DS cs.IT cs.LG math.IT math.ST stat.TH 版本更新

Optimal Rates for Differentially Private Hypothesis Testing with E-values

基于E值的差分隐私假设检验的最优速率

Ben Jacobsen, Tomas Gonzalez, Gavin Brown, Kassem Fawaz, Aaditya Ramdas

发表机构 * University of Wisconsin-Madison(威斯康星大学麦迪逊分校) Carnegie Mellon University(卡内基梅隆大学)

AI总结 研究在ε-差分隐私约束下,使用e值进行假设检验时所能达到的最大e-power,并给出最优速率及匹配算法。

Comments Corrected typos; updated references; generalized proposition 3.1

详情
AI中文摘要

近年来,e值作为支持任意有效和自适应数据分析的灵活工具引起了广泛关注。假设检验是许多此类应用的核心,而这些应用通常涉及私有或敏感数据。在这项工作中,我们回答了一个简单但重要的问题:给定两个分布 $\mathbb{P}$ 和 $\mathbb{Q}$,当使用满足 $\varepsilon$-差分隐私的e值检验 $X\sim \mathbb{P}^n$ 对 $X\sim\mathbb{Q}^n$ 时,所能达到的最大e-power是多少?我们刻画了该问题的最优速率,并提供了一个精确匹配的算法。在顺序设置中,当观测值逐个到达且分析者选择何时停止时,我们给出了任何私有e过程的停止时间的匹配上下界。数值实验证实了我们算法的实用性,在多种顺序检验问题和隐私水平下,我们的算法所需数据少于最近提出的DP-SPRT。

英文摘要

E-values have attracted considerable interest in recent years as flexible tools for enabling anytime-valid and adaptive data analysis. Hypothesis testing is at the core of many of these applications, which can often involve private or sensitive data. In this work, we answer a simple but important question: given two distributions $\mathbb{P}$ and $\mathbb{Q}$, what is the maximum achievable e-power when testing $X\sim \mathbb{P}^n$ against $X\sim\mathbb{Q}^n$ with e-values that satisfy $\varepsilon$-differential privacy? We characterize the optimal rate for this problem and provide an algorithm which matches it exactly. In the sequential setting, when observations arrive one-by-one and the analyst chooses when to halt, we give matching upper and lower bounds on the stopping times of any private e-process. Numerical experiments confirm the practicality of our algorithms, which require less data than the recently proposed DP-SPRT across a range of sequential testing problems and privacy levels.

2605.28850 2026-06-02 cs.LG q-fin.CP 版本更新

Representation Signatures and Risk-Feedback Alignment in LLM Trading Agents

表示签名与LLM交易智能体中的风险反馈对齐

Weicheng Xue

发表机构 * Virginia Tech(弗吉尼亚理工大学)

AI总结 通过TradeArena测试平台研究LLM交易智能体在金融决策中的行为对齐与表示动态,发现故障前表示签名(规划嵌入漂移、流形有效秩收缩)并验证风险反馈作为外部对齐信号的有效性。

详情
AI中文摘要

我们研究了大型语言模型(LLM)智能体在金融决策环境中的行为对齐与表示动态。TradeArena是一个可审计的交易智能体测试平台,提供风险报告、执行模拟、记忆和可重放轨迹,使我们能够分析在市场压力下推理、持仓和干预的演变。代码和数据工件可通过TradeArena仓库获取。我们发现了故障前签名:规划嵌入偏离正常质心,融合的计划-风险表示将正常状态与预回撤状态分离,局部流形表现出有效秩收缩。在80个滚动故障锚点和8条LLM轨迹中,这一模式在哈希、LSA、Transformer和白盒隐藏状态探针中持续存在。使用无CoT目标权重、词汇控制、OHLCV噪声和虚假审计的压力测试表明,无推理时推理级收缩消失,而意图空间和融合签名仍具有信息性。结构化风险反馈可以在不微调的情况下作为外部对齐信号,但并非通用性能增强器:真实审计反馈改善了一些模型的校准,另一些模型的收益,并暴露出安慰剂或隐藏反馈在短周期内收益更高但对齐诊断较弱的情况。一项51只股票的日内实验揭示了相关性盲点:LLM推理为风险层会削减的相关资产敞口提供理由。最后,一个金融审计任务套件将比较从“哪个模型交易最好”转向模型能否审计轨迹、尊重执行边界、重现工件并避免过度声明。这些结果支持研究主张而非盈利主张:可审计的风险反馈和表示轨迹揭示了LLM金融推理何时对齐、漂移或失败。

英文摘要

We study behavioral alignment and representation dynamics of large language model (LLM) agents in financial decision environments. TradeArena, an auditable trading-agent testbed with risk reports, execution simulation, memory, and replayable trajectories, lets us analyze how rationales, positions, and interventions evolve under market stress. Code and data artifacts are available through the \href{https://github.com/weich97/TradeArena.git}{TradeArena repository}. We find pre-failure signatures: planning embeddings drift from normal centroids, fused plan-risk representations separate normal from pre-drawdown states, and local manifolds exhibit effective-rank contraction. Across 80 rolling failure anchors and eight LLM trajectories, this pattern persists across hash, LSA, Transformer, and white-box hidden-state probes. Stress tests with CoT-free target weights, lexical controls, OHLCV noise, and false audits show that rationale-level contraction can vanish without rationales, while intent-space and fused signatures remain informative. Structured risk feedback can act as an external alignment signal without fine-tuning, but not as a universal performance enhancer: true audit feedback improves calibration for some models, returns for others, and exposes cases where placebo or hidden feedback has higher short-horizon return but weaker alignment diagnostics. A 51-stock intraday experiment reveals a correlation blind spot: LLM rationales justify exposure to coupled assets that the risk layer clips. Finally, a financial-audit task suite shifts comparison from ``which model trades best'' to whether models can audit trajectories, respect execution boundaries, reproduce artifacts, and avoid claim overreach. These results support a research claim, not a profitability claim: auditable risk feedback and representation trajectories reveal when LLM financial reasoning is aligning, drifting, or failing.

2605.26092 2026-06-02 cs.LG cs.AI 版本更新

GoQuant: Geometric Orthogonal Residual Projection for Multiplier-Free Power-of-Two Transformer Quantization

GoQuant: 用于无乘法器二次幂变压器量化的几何正交残差投影

Maoyang Xiang, Tao Luo, Bo Wang

发表机构 * Information Systems Technology and Design(信息系统技术与设计) Singapore University of Technology and Design(新加坡科技设计大学) Institute of High Performance Computing (IHPC)(高性能计算研究所) Agency for Science, Technology and Research (A*STAR)(科技研究局)

AI总结 针对低比特量化中二次幂格式的低角度分辨率问题,提出几何正交残差投影量化(GoQuant),通过双基几何投影和移位加操作合成高分辨率残差格点,实现硬件高效且无需乘法器的量化方法。

详情
AI中文摘要

大型语言模型(LLMs)和视觉变换器(ViTs)在边缘设备上的部署受到内存限制和密集乘加(MAC)阵列引入的关键时序瓶颈的显著约束。在超低比特范围内,对数二次幂(PoT)量化通过用位移操作替代MAC操作,提供了一种硬件高效的替代方案。然而,非均匀指数格点固有地受到低角度分辨率机制的局限,这一结构缺陷在低于4比特阈值时尤为突出,导致高维特征流形的显著退化。为解决这一几何限制,我们提出了几何正交残差投影量化(GoQuant),一种算法-硬件协同设计框架。通过将量化表述为双基几何投影,GoQuant使用严格的移位加操作自适应地合成更高分辨率的残差格点。此外,其解析求解器为计算密集的梯度优化提供了实用替代方案,将LLaMA-2-7B的全模型校准时间减少到约15分钟。广泛评估表明GoQuant在多种模态下的适用性和硬件效率。在3比特(W3/A16)约束下,它在LLaMA-2-7B上实现了6.10的困惑度,与依赖非对称缩放的常规MAC密集型基线(如AWQ)相比具有竞争力,同时在4比特场景下保持竞争性精度。在硅片层面,28nm节点的标准单元RTL综合表明,GoQuant有效缓解了与密集乘法器树相关的时序瓶颈。通过展平组合逻辑深度,我们的并行移位加数据路径将关键路径延迟降低至0.35纳秒。

英文摘要

The deployment of Large Language Models (LLMs) and Vision Transformers (ViTs) on edge devices is significantly constrained by memory limitations and the critical timing bottlenecks introduced by dense Multiply-Accumulate (MAC) arrays. In the ultra-low bit regime, logarithmic Power-of-Two (PoT) quantization provides a hardware-efficient alternative by replacing MAC operations with bit-shifts. However, the non-uniform exponential lattice is inherently limited by a \textbf{Low Angular Resolution Regime}, a structural flaw that becomes particularly pronounced at sub-4-bit thresholds, leading to a notable degradation of high-dimensional feature manifolds. To address this geometric limitation, we propose Geometric Orthogonal Residual Projection Quantization (GoQuant), an algorithm-hardware co-design framework. By formulating quantization as a dual-basis geometric projection, GoQuant adaptively synthesizes a higher-resolution residual lattice using strictly shift-and-add operations. Furthermore, its analytical solver offers a practical alternative to computationally intensive gradient-based optimization, reducing the full-model calibration time for LLaMA-2-7B to approximately 15 minutes. Extensive evaluations demonstrate GoQuant's applicability across modalities and its hardware efficiency. Under the 3-bit (W3/A16) constraint, it achieves a perplexity of 6.10 on LLaMA-2-7B, comparing favorably to conventional MAC-intensive baselines like AWQ without relying on asymmetric scaling, while maintaining competitive accuracy in 4-bit scenarios. At the silicon level, standard-cell RTL synthesis at a 28nm node indicates that GoQuant effectively mitigates the timing bottlenecks associated with dense multiplier trees. By flattening the combinational logic depth, our parallel shift-and-add datapath reduces the critical path delay to 0.35 ns.

2605.25889 2026-06-02 cs.CR cs.LG 版本更新

Capability and Robustness Cannot Both Be Free: An Information-Theoretic Bound for Vision-Language-Action Models

能力与鲁棒性不可兼得:视觉-语言-动作模型的信息论界

Jianwei Tai

发表机构 * Jianwei Tai(Tai Jianwei)

AI总结 本文证明视觉-语言-动作模型的能力与鲁棒性之间存在信息论权衡,能力与鲁棒性之和受限于任务熵与对抗信道容量之和,并通过实验验证了该界。

详情
AI中文摘要

视觉-语言-动作(VLA)模型在干净输入上达到高成功率,但在小的对抗扰动下崩溃:$16/255$ PGD攻击将OpenVLA-7B在LIBERO上的成功率从$95\\%$降至$5\\%$以下。这种权衡是否存在理论下限此前未知。我们证明它存在。对于任何VLA策略,能力$I(\\Astar;\\Api)$和鲁棒性$I(\\Api;\\Atildepi)-I(\\Api;δ)$之和至多为$H(\\Astar)+I(X;\\Xtilde)$,即任务熵加对抗信道容量。证明简化为两次应用数据处理不等式。像素级界宽松约$10^3$纳特,作为上限保证;编码器特定的推论将其收紧一个数量级以上,进入实际能力已消耗$5$--$9\\%$预算的区域。我们在$308$个单元中验证定理\\ref{thm:main},零违反:$252$个闭式高斯VLA、$48$个OpenVLA-7B$+$LIBERO$+$PGD($4$套件$\\times$ $4$个$\\\eps$ $\\times$ $3$个种子)、$4$个Square-Attack和$4$个多步($T{=}10$)。一个互补的可测性不等式$\\\Rob_{\\text{disc}} \\\le \\\Cap_{\\text{disc}}$进一步在跨越OpenVLA、OpenVLA-OFT(连续$L_1$)和SmolVLA(流匹配)的$144$个跨架构单元中成立。相同的构造产生了三个无标签诊断工具:预飞行编码器上限、定位输入侧与语言模型干预的防御取证探针,以及可在离散令牌、$L_1$回归和流匹配策略间比较的头部无关鲁棒性比。这些共同提供了跨设置轴防御和架构比较目前所缺乏的。

英文摘要

Vision-Language-Action (VLA) models reach high success rates on clean inputs but collapse under small adversarial perturbations: a $16/255$ PGD attack drops OpenVLA-7B's LIBERO success from $95\%$ to under $5\%$. Whether this trade-off has a theoretical floor was open. We prove that it does. For any VLA policy, capability $I(\Astar;\Api)$ and robustness $I(\Api;\Atildepi)-I(\Api;δ)$ sum to at most $H(\Astar)+I(X;\Xtilde)$, the task entropy plus adversarial channel capacity. The proof reduces to two applications of the Data Processing Inequality. The pixel-level bound is loose by $\sim 10^3$ nats and serves as a ceiling guarantee; an encoder-specific corollary tightens it by over an order of magnitude, into a regime where realized capability already consumes $5$--$9\%$ of the budget. We validate Theorem~\ref{thm:main} with zero violations across $308$ cells: $252$ closed-form Gaussian-VLA, $48$ OpenVLA-7B$+$LIBERO$+$PGD ($4$ suites $\times$ $4$ $\eps$ $\times$ $3$ seeds), $4$ Square-Attack, and $4$ multi-step ($T{=}10$). A complementary measurability inequality $\Rob_{\text{disc}} \le \Cap_{\text{disc}}$ further holds across $144$ cross-architecture cells spanning OpenVLA, OpenVLA-OFT (continuous-$L_1$), and SmolVLA (flow-matching). The same construction yields three label-free diagnostics: a pre-flight encoder ceiling, a defense-forensics probe that localizes input-side vs.\ language-model intervention, and a head-agnostic robustness ratio comparable across discrete-token, $L_1$-regression, and flow-matching policies. Together these provide the cross-setting axis defense and architecture comparisons currently lack.

2602.08646 2026-06-02 cs.LG 版本更新

Gradient Preconditioning for Efficient and Reliable Reward-Guided Generation

梯度预处理实现高效可靠的奖励引导生成

Jisung Hwang, Minhyuk Sung

发表机构 * KAIST(韩国科学技术院)

AI总结 提出一种梯度预处理方法,通过将奖励梯度投影到白高斯噪声可行集上,实现一步生成模型在奖励引导生成中的高效性和可靠性,防止奖励黑客攻击并加速优化。

Comments ICML 2026

详情
AI中文摘要

我们提出了一种梯度预处理方法,使得使用一步生成模型的奖励引导生成既高效又可靠。测试时噪声优化可以从预训练生成模型中解锁显著更好的奖励引导生成,但它容易导致奖励黑客攻击,从而降低质量,并且通常对于实际使用来说太慢。我们通过将奖励梯度投影到一个精心设计的白高斯噪声可行集上来预处理奖励梯度,该可行集是一个具有块状范数约束的紧凑谱集,紧密捕捉白高斯噪声的统计特性和空间不相关性。这种预处理将每个梯度更新重塑为噪声对齐的方向,驱动更快更有效的奖励上升,同时防止奖励黑客攻击。该投影具有闭式解,并且与FFT的$O(N \log N)$复杂度相匹配,在实践中增加了可忽略的开销。在FLUX上使用四个奖励模型的实验中,我们的方法仅使用最先进的基于正则化方法所需挂钟时间的30%就达到了可比的审美分数。

英文摘要

We propose a gradient preconditioning method that makes reward-guided generation with one-step generative models both efficient and reliable. Test-time noise optimization can unlock substantially better reward-guided generations from pretrained generative models, but it is prone to reward hacking that degrades quality and is often too slow for practical use. We precondition reward gradients by projecting them onto a carefully designed white Gaussian noise feasible set, a compact spectral set with blockwise norm constraints that tightly captures the statistics and spatial uncorrelatedness of white Gaussian noise. This preconditioning reshapes each gradient update into a noise-aligned direction, driving faster and more effective reward ascent while preventing reward hacking. The projection is closed-form and matches the $O(N \log N)$ complexity of FFT, adding negligible overhead in practice. In experiments on FLUX with four reward models, our approach reaches a comparable Aesthetic Score using only 30% of the wall-clock time required by the state-of-the-art regularization-based method.

2510.01711 2026-06-02 cs.RO cs.LG 版本更新

Contrastive Representation Regularization for Vision-Language-Action Models

视觉-语言-动作模型的对比表示正则化

Taeyoung Kim, Jimin Lee, Myungkyu Koo, Dongyoung Kim, Kyungmin Lee, Changyeon Kim, Younggyo Seo, Jinwoo Shin

发表机构 * KAIST(韩国科学技术院)

AI总结 提出机器人状态感知对比损失(RS-CL),通过对比学习对齐VLM表示与机器人本体感受状态,提升VLA模型在机器人操作任务中的性能。

Comments ICML 2026

详情
AI中文摘要

视觉-语言-动作(VLA)模型通过利用预训练视觉-语言模型(VLM)的丰富表示,在机器人操作中展现了强大的能力。然而,它们的表示可以说仍然次优,缺乏对控制动作和本体感受信息等机器人信号的敏感性。为了解决这个问题,我们引入了机器人状态感知对比损失(RS-CL),一种简单有效的VLA模型表示正则化方法,旨在弥合VLM表示与机器人信号之间的差距。特别地,RS-CL通过使用状态之间的相对距离作为软监督,使表示更紧密地对齐机器人的本体感受状态。作为原始动作预测目标的补充,RS-CL增强了控制相关表示学习,同时轻量级且与标准VLA训练流程完全兼容。我们的实验结果表明,RS-CL显著提升了最先进VLA模型的性能;它将先前技术在RoboCasa-Kitchen基准上的性能提升至69.7%,达到最先进水平,并在具有挑战性的真实机器人操作任务中将成功率从45.0%提升至58.3%。

英文摘要

Vision-Language-Action (VLA) models have shown strong capabilities in robot manipulation by leveraging rich representations from pre-trained Vision-Language Models (VLMs). However, their representations arguably remain suboptimal, lacking sensitivity to robotic signals such as control actions and proprioceptive information. To address the issue, we introduce Robot State-aware Contrastive Loss (RS-CL), a simple and effective representation regularization for VLA models, designed to bridge the gap between VLM representations and robotic signals. In particular, RS-CL aligns the representations more closely with the robot's proprioceptive states by using relative distances between the states as soft supervision. Complementing the original action prediction objective, RS-CL enhances control-relevant representation learning, while being lightweight and fully compatible with standard VLA training pipelines. Our empirical results demonstrate that RS-CL substantially improves the performance of state-of-the-art VLA models; it pushes the prior art to 69.7% achieving the state-of-the-art performance on the RoboCasa-Kitchen benchmark, and boosts success rates from 45.0% to 58.3% on challenging real-robot manipulation tasks.

2605.21247 2026-06-02 cs.LG 版本更新

Graph Navier Stokes Networks

图纳维-斯托克斯网络

Zexing Zhao, Guangsi Shi, Yu Gong, Tianyu Wang, Shirui Pan, Hongye Cheng, Yuxiao Li

发表机构 * Northwest A&F University(西北农林科技大学) Corporate Research Center, Midea Group(美的集团企业研发中心) Peking University(北京大学) Fudan University(复旦大学) Griffith University(格里菲斯大学) Bosch(博世)

AI总结 针对图神经网络中的过平滑问题,提出基于纳维-斯托克斯方程的图纳维-斯托克斯网络(GNSN),通过引入对流机制实现更高效的消息传递,并在12个真实数据集上取得最优分类性能。

详情
AI中文摘要

图神经网络(GNN)已成为深度学习的基石,现有方法大多基于图信号处理和扩散方程来建模消息传递。然而,这些方法固有地存在过平滑问题,即随着网络深度增加,节点特征变得难以区分。受纳维-斯托克斯方程启发,我们提出了图纳维-斯托克斯网络(GNSN),这是一种新颖的架构,通过将对流引入图结构,超越了传统的基于扩散的消息传递。GNSN在图定义动态速度场来控制对流,实现更高效、更直接的消息传播。通过自适应平衡对流和扩散,GNSN能够有效处理具有不同同质性水平的数据集。在12个真实世界数据集上的广泛评估表明,GNSN在分类准确率上持续优于最先进的基线方法。此外,实验结果进一步强调了其在缓解过平滑问题方面的有效性。

英文摘要

Graph Neural Networks (GNNs) have emerged as a cornerstone of deep learning, with most existing methods rooted in graph signal processing and diffusion equations to model message passing. However, these approaches inherently suffer from the oversmoothing problem, where node features become indistinguishable as the network depth increases. Inspired by the Navier Stokes equations, we introduce Graph Navier Stokes Networks (GNSN), a novel architecture that transcends conventional diffusion-based message passing by incorporating convection into graph structures. GNSN defines a dynamic velocity field on the graph to govern convection, enabling more efficient and direct message propagation. By adaptively balancing convection and diffusion, GNSN is able to efficiently handle datasets with varying levels of homophily. Extensive evaluations across twelve real-world datasets demonstrate that GNSN consistently outperforms state-of-the-art baselines in classification accuracy. Moreover, experimental results further emphasize its effectiveness in alleviating the oversmoothing problem.

2605.25143 2026-06-02 cs.AI cs.LG 版本更新

Beyond the Frontier: Stochastic Backtracking for Efficient Test-Time Scaling

超越前沿:用于高效测试时扩展的随机回溯

Dao Tran, Duc Anh Le, Ngoc Luu, Quan Pham, Tung Pham, Hung Bui

发表机构 * Qualcomm AI Research(高通人工智能研究)

AI总结 提出随机回溯方法,通过维护历史前缀池并利用子池选择和幂回溯序列蒙特卡洛机制,在测试时扩展中实现更高的准确率-令牌数权衡。

详情
AI中文摘要

测试时扩展通过花费额外计算来探索多个解轨迹,从而改进语言模型推理。关键挑战是在推理过程中最大化准确率的同时最小化生成的令牌总数。最近的PRM引导方法对中间前缀进行评分以引导搜索,但大多数方法仅关注前沿:它们只保留当前活动的前缀,并使用带噪声的PRM分数不可逆地剪枝或重采样其余部分。这可能导致过早承诺、多样性崩溃以及丢失仍可产生正确延续的前缀。我们引入了一种基于历史前缀持久池的随机回溯,允许测试时计算重新访问先前生成的状态,而不是仅扩展当前前沿。为了提高效率,我们提出了两种互补机制。子池选择通过随机子池内应用Top-N选择来增强贪婪PRM引导搜索,使历史前缀有机会绕过评分过高的前沿候选。幂回溯序列蒙特卡洛使用幂化PRM分数和混合校正权重,将SMC风格的重采样扩展到持久池。在数学推理基准和模型规模上,我们的方法在每令牌准确率上始终更高,并且与强PRM引导基线相比,仅使用一小部分令牌数即可达到相同的准确率水平,这表明持久池随机回溯为改善测试时扩展中的准确率-令牌权衡提供了一种简单有效的方法。

英文摘要

Test-time scaling improves language model reasoning by spending additional compute to explore multiple solution trajectories. The key challenge is to maximize accuracy while minimizing the total number of generated tokens during reasoning. Recent PRM-guided methods score intermediate prefixes to steer this search, but most are frontier-only: they keep only the current active prefixes and irreversibly prune or resample away the rest using noisy PRM scores. This can cause premature commitment, diversity collapse, and the loss of prefixes that still admit correct continuations. We introduce stochastic backtracking over a persistent pool of historical prefixes, allowing test-time compute to revisit previously generated states instead of only expanding the current frontier. To make this efficient, we propose two complementary mechanisms. Subpool Selection strengthens greedy PRM-guided search by applying Top-N selection within random subpools, giving historical prefixes a chance to bypass over-scored frontier candidates. Power Backtrack Sequential Monte Carlo extends SMC-style resampling to the persistent pool using powered PRM scores and mixture-corrected weights. Across mathematical reasoning benchmarks and model scales, our methods consistently achieve higher accuracy per token count, and the same level of accuracy using only a fraction of the token count in comparison to strong PRM-guided baselines, demonstrating that persistent-pool stochastic backtracking provides a simple and effective way to improve the accuracy-token trade-off in test-time scaling.

2605.24528 2026-06-02 cs.AI cs.CL cs.LG 版本更新

Hypothesis Generation and Inductive Inference in Children and Language Models

儿童与语言模型中的假设生成与归纳推理

Jeffrey Qin, Wasu Top Piriyakulkij, Zhuangfei Gao, Mia Radovanovic, Jessica Sommerville, Kevin Ellis, Marta Kryven

发表机构 * Computer Science University of Waterloo(滑铁卢大学计算机科学系) Department of Computer Science Cornell University(康奈尔大学计算机科学系) Department of Computer Science Dalhousie University(达尔豪斯大学计算机科学系) Department of Psychology University of Toronto(多伦多大学心理学系)

AI总结 通过归纳推理盒子任务,结合贝叶斯粒子推断的程序归纳形式化,比较儿童与基于LLM的智能体在不确定性下的假设生成与证据寻求行为,发现两者在适应环境结构上相似但信息寻求成本与归纳偏差不同。

详情
AI中文摘要

现实世界中的决策需要在证据、潜在因果规则以及世界状态本身的不确定性下构建心智模型。在这种条件下,哪些计算原理支撑人类的推理?在给定匹配约束下,基于LLM的智能体是否表现出类似行为?我们使用归纳推理盒子任务来探讨这些问题,在该任务中,参与者(人类儿童和基于LLM的智能体)通过与不确定环境的顺序交互来推断潜在原因。我们将该任务形式化为基于贝叶斯粒子推断的程序归纳,并承认两种互补的解释:(1) 作为对假设的约束满足过程,以及(2) 作为程序综合问题,其中假设是针对证据评估的可执行程序。使用基于约束的公式,我们表明儿童的行为最好由主观证据可靠性和在线假设生成的组合来解释,这解释了他们的证据寻求模式以及任务完成与规则泛化之间的分离。使用程序综合公式,我们将基于LLM的智能体视为模型有机体:可控系统,允许系统性地操纵任务条件。在各种后端中,基于LLM的智能体复制了儿童对证据可靠性和可观察性变化的反应,包括折扣不可靠证据、寻求解决部分信息以及任务完成与因果泛化之间的分离。同时,与儿童相比,基于LLM的智能体倾向于过度观察和过度遵守指令。这些结果表明,虽然儿童和基于LLM的智能体在适应环境结构方面相似,但他们的信息寻求行为表现出不同的潜在成本和归纳偏差。

英文摘要

Real world decision-making requires constructing mental models under uncertainty over evidence, over the underlying causal rules, and over the state of the world itself. Which computational principles underpin human inference under such conditions, and do LLM-based agents exhibit similar behavior given matching constraints? We address these questions using an inductive inference Box Task in which participants, human children and LLM-based agents, infer a latent cause through sequential interaction with an uncertain environment. We formalize this task as program induction with Bayesian particle-based inference, admitting two complementary interpretations: (1) as a constraint satisfaction process over hypotheses, and (2) as a program synthesis problem in which hypotheses are executable programs evaluated against evidence. Using the constraint-based formulation, we show that children's behavior is best explained by a combination of subjective evidence reliability and online hypothesis generation, accounting for both their evidence-seeking patterns and their dissociation between task completion and rule generalization. Using the program synthesis formulation, we treat LLM-based agents as model organisms: controllable systems that allow systematic manipulation of task conditions. Across backends, LLM-based agents replicate children's responses to changes in evidence reliability and observability, including discounting unreliable evidence, seeking to resolve partial information, and dissociating between task completion and causal generalization. At the same time, LLM-based agents tend to over-observe and over-comply with instructions relative to children. These results suggest that while children and LLM-based agents adapt similarly to environmental structure, their information-seeking behavior exhibits distinct underlying costs and inductive biases.

2605.18838 2026-06-02 cs.LG cs.AI cs.CL 版本更新

Lying Is Just a Phase: The Hidden Alignment Transition in Language Model Scaling

说谎只是一个阶段:语言模型扩展中的隐藏对齐转变

Adil Amin

发表机构 * ZEHEN Labs(ZEHEN实验室)

AI总结 通过分析63个基础模型,发现语言模型在特定规模阈值下,推理能力与真实性从反相关转变为正相关,并揭示了输出投影瓶颈和零竞争注意力头等内部机制。

Comments 15 pages, 8 figures, 2 tables. Companion paper: "The Growing Pains of Frontier Models: When Leaderboards Stop Separating and What to Measure Next." ( https://doi.org/10.48550/arXiv.2605.18840). Code: https://github.com/adilamin89/cape-scaling. Dashboard: https://zehenlabs.com/cape/

详情
AI中文摘要

扩展定律预测了计算量带来的损失,但未预测能力如何相互作用。我们测量了来自16个家族的63个基础模型的推理能力与真实性之间的耦合,并发现了一个在损失曲线中不可见的相变:低于家族依赖的临界规模N_c时,能力反相关(r = -0.989,p = 4 x 10^{-5},非参数置换检验);高于该规模时,它们合作。N_c ~ 3.5B参数 [2.9B, 13.4B](bootstrap 95% CI),但模型大小并非决定相位的唯一变量。架构、数据整理和训练配方各自独立地改变N_c:精心整理的数据消除了Qwen代际之间的耦合下降(在匹配规模下从0.025到0.830),Gemma-4在4B时通过蒸馏和架构创新实现了0.871的耦合,这通常是13B+标准训练模型的特征,而Phi在1B时仅通过数据整理就达到了10B网络训练模型的耦合水平。宽度归一化消除了所有测试家族的反相关,支持输出投影瓶颈的存在。在内部,40个模型中有38个显示零竞争注意力头。一个稀疏回归ODE以5.6%的误差交叉预测了保留的Llama-2。该诊断不需要模型内部信息——仅需跨模型家族的公开基准分数。合作区域扩展到前沿(r = +0.72,34个模型,10个实验室)。一个概念验证干预证实了瓶颈是可利用的:在识别层添加单个真实方向向量,无需重新训练即可纠正税收阶段60%的错位输出——这是一种无需修改权重的、每推理一次的外科手术式修正。代码、数据、用于任何开放权重模型的开源转向CLI以及用于相位诊断的交互式仪表板已发布:https://zehenlabs.com/cape/。

英文摘要

Scaling laws predict loss from compute but not how capabilities interact. We measure the coupling between reasoning and truthfulness across 63 base models from 16 families and find a regime change invisible to loss curves: below a family-dependent critical scale N_c, capabilities anticorrelate (r = -0.989, p = 4 x 10^{-5} nonparametric permutation test); above it, they cooperate. N_c ~ 3.5B parameters [2.9B, 13.4B] (bootstrap 95% CI), but model size is not the only variable that determines phase. Architecture, data curation, and training recipe each shift N_c independently: curated training eliminated the coupling dip between Qwen generations (0.025 to 0.830 at matched scale), Gemma-4 at 4B achieves coupling 0.871, characteristic of 13B+ standard-trained models, through distillation and architectural innovation, and Phi at 1B matches web-trained coupling at 10B through data curation alone. Width normalization eliminates the anticorrelation across all tested families, supporting an output-projection bottleneck. Internally, 38 of 40 models show zero competing attention heads. A sparse-regression ODE cross-predicts held-out Llama-2 at 5.6% error. The diagnostic requires no model internals -- only public benchmark scores across a model family. The cooperative regime extends to the frontier (r = +0.72, 34 models, 10 labs). A proof-of-concept intervention confirms the bottleneck is exploitable: adding a single truth-direction vector at the identified layer corrects 60% of misaligned outputs in the tax phase with zero retraining -- a surgical, per-inference correction that requires no weight modification. Code, data, an open-source steering CLI for any open-weight model, and an interactive dashboard for phase diagnosis are released: https://zehenlabs.com/cape/.

2602.23179 2026-06-02 cs.LG q-bio.BM 版本更新

Induction Meets Biology: Mechanisms of Repeat Detection in Protein Language Models

归纳遇见生物学:蛋白质语言模型中重复检测的机制

Gal Pomerants, Yaniv Nikankin, Anja Reusch, Tomer Tsaban, Ora Schueler-Furman, Yonatan Belinkov

发表机构 * Weizmann Institute of Science(魏茨曼科学研究院)

AI总结 通过分析蛋白质语言模型在掩码预测中的行为,揭示了其检测精确和近似重复序列的两阶段机制:先构建特征表示,再利用归纳头关注重复片段中的对齐标记。

详情
AI中文摘要

蛋白质序列中存在大量重复片段,既有精确拷贝,也有带有突变的近似片段。这些重复对蛋白质结构和功能至关重要,推动了数十年来关于重复识别的算法研究。最近的研究表明,蛋白质语言模型(PLMs)通过掩码标记预测中的行为能够识别重复。为了阐明其内部机制,我们研究了PLMs如何检测精确和近似重复。我们发现,近似重复的机制在功能上包含了精确重复的机制。然后,我们描述了这一机制,揭示了两个主要阶段:首先,PLMs使用通用位置注意力头和生物学特化组件(如编码氨基酸相似性的神经元)构建特征表示;然后,归纳头关注重复片段中的对齐标记,促进正确答案的产生。我们的结果揭示了PLMs如何通过将基于语言的模式匹配与特化的生物学知识相结合来解决这一生物学任务,从而为研究PLMs中更复杂的进化过程奠定了基础。

英文摘要

Protein sequences are abundant in repeating segments, both as exact copies and as approximate segments with mutations. These repeats are important for protein structure and function, motivating decades of algorithmic work on repeat identification. Recent work has shown that protein language models (PLMs) identify repeats, by examining their behavior in masked-token prediction. To elucidate their internal mechanisms, we investigate how PLMs detect both exact and approximate repeats. We find that the mechanism for approximate repeats functionally subsumes that of exact repeats. We then characterize this mechanism, revealing two main stages: PLMs first build feature representations using both general positional attention heads and biologically specialized components, such as neurons that encode amino-acid similarity. Then, induction heads attend to aligned tokens across repeated segments, promoting the correct answer. Our results reveal how PLMs solve this biological task by combining language-based pattern matching with specialized biological knowledge, thereby establishing a basis for studying more complex evolutionary processes in PLMs.

2601.19597 2026-06-02 cs.LG 版本更新

The Geometric Mechanics of Contrastive Representation Learning: Alignment Potentials, Entropic Dispersion, and Cross-modal Divergence

对比表示学习的几何力学:对齐势、熵分散和跨模态散度

Yichao Cai, Zhen Zhang, Yuhang Liu, Javen Qinfeng Shi

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文通过测度论框架,在大批量极限下证明InfoNCE目标与确定性能量景观的等价性,揭示单模态与对称多模态之间的几何分岔,并指出跨模态散度项导致模态间隙。

Comments 54 Pages, ICML 2026 (Refined document aesthetics for clearer reading)

详情
AI中文摘要

尽管InfoNCE是现代对比学习的基础,但其几何机制在经典的对齐-均匀分解之外仍未被充分刻画。我们发展了一个测度论框架,其中表示测度在固定的嵌入流形上演化。在大批量极限下,我们证明了值和梯度的一致性,将随机目标与显式的确定性能量景观联系起来,并揭示了单模态和对称多模态之间的几何分岔。在单模态情况下,内在能量是严格凸的,并具有唯一的吉布斯平衡,表明熵在对齐盆地中起到打破平衡的作用。在多模态情况下,内在几何变得交叉耦合,并包含一个持续的负对称散度项:每个模态的边缘分布重塑了另一个模态的有效景观,使得强成对对齐与持续的模态间隙共存。受控的合成实验和预训练CLIP表示的分析支持这些预测。总体而言,我们的结果将分析视角从逐点区分转移到总体几何,表明仅靠成对对齐不足以控制跨模态边缘结构。

英文摘要

While InfoNCE underlies modern contrastive learning, its geometric mechanisms remain under-characterized beyond the canonical alignment--uniformity decomposition. We develop a measure-theoretic framework in which representation measures evolve on a fixed embedding manifold. In the large-batch limit, we prove value and gradient consistency, linking the stochastic objective to explicit deterministic energy landscapes and revealing a geometric bifurcation between unimodal and symmetric multimodal regimes. In the unimodal case, the intrinsic energy is strictly convex and admits a unique Gibbs equilibrium, showing that entropy acts as a tie-breaker within the aligned basin. In the multimodal case, the intrinsic geometry becomes cross-coupled and contains a persistent negative symmetric divergence term: each modality's marginal reshapes the effective landscape of the other, allowing strong pairwise alignment to coexist with a persistent modality gap. Controlled synthetic experiments and analyses of pretrained CLIP representations support these predictions. Overall, our results shift the analytical lens from pointwise discrimination to population geometry, showing that pairwise alignment alone is insufficient to control cross-modal marginal structure.

2605.24202 2026-06-02 cs.AI cs.LG 版本更新

When Does Multi-Agent RL Improve LLM Workflows? Workflow, Scale, and Policy-Sharing Tradeoffs

多智能体强化学习何时能改进LLM工作流?工作流、规模与策略共享的权衡

Yifan Zeng, Yiran Wu, Yaolun Zhang, Wentian Zhao, Kun Wan, Qingyun Wu, Huazheng Wang

发表机构 * Oregon State University(俄勒冈州立大学) Pennsylvania State University(宾夕法尼亚州立大学) Adobe Inc.(Adobe公司) AG2AI, Inc.(AG2AI公司)

AI总结 研究多智能体LLM工作流中端到端强化学习训练的效果,发现改进依赖于工作流、任务和规模,策略共享不提供统一稳定性而是重新分配失败模式。

详情
AI中文摘要

多智能体LLM工作流通过将推理路由到专门角色来提升最终任务准确性,但联合训练这些角色的强化学习不稳定,其机制尚不明确。我们研究了多智能体LLM工作流的端到端RL训练何时能改进其基础模型,比较了共享策略训练(所有角色更新一个策略)和隔离策略训练(每个角色有自己的参数)。我们的实验矩阵涵盖Eval-Opt、Voting和Orch-Workers工作流、数学和代码任务以及三种模型规模(0.6B、1.7B、4B)。我们发现多智能体RL通常能改进基础模型,但增益共同依赖于工作流、任务和规模,而非仅依赖于策略共享。隔离策略倾向于达到更高的峰值准确率,但更频繁地掉入终端准确率悬崖,而共享策略训练并未消除失败;它只是将失败重新分布为性质不同的模式。然后,我们通过工作流拓扑和策略路由引起的角色级梯度动力学解释了其中最显著的模式:在隔离策略下,共享提示上的并行同角色代理会放大每个角色的梯度,并在Voting和Orch-Workers工作流中导致终端退化;在共享策略下,非对称的每步梯度质量导致共享策略被主导角色捕获,从而产生因任务和工作流而异的失败特征。总之,经验图谱及其潜在机制表明,策略共享通过不同渠道引导训练压力,而非提供统一稳定性,使其成为具有工作流和任务条件权衡的设计选择。

英文摘要

Multi-agent LLM workflows route inference through specialized roles to lift end-task accuracy, but jointly training those roles with reinforcement learning is unstable in ways that are poorly understood. We study when end-to-end RL training of multi-agent LLM workflows improves over their base models, comparing Shared-Policy training, where all roles update one policy, with Isolated-Policy training, where each role has its own parameters. Our experimental matrix spans Eval-Opt, Voting, and Orch-Workers workflows, math and code tasks, and three model scales (0.6B, 1.7B, 4B). We find that multi-agent RL usually improves over base models, but gains depend jointly on workflow, task, and scale, not on policy sharing alone. Isolated-Policy tends to reach higher peak accuracy yet more often falls off a terminal accuracy cliff, while Shared-Policy training does not eliminate failure; it redistributes failure into qualitatively different patterns. We then explain the strongest of these patterns through role-level gradient dynamics induced by workflow topology and policy routing: under Isolated-Policy, parallel same-role agents on shared prompts amplify per-role gradients and drive terminal degradation in Voting and Orch-Workers workflows; under Shared-Policy, asymmetric per-step gradient mass causes the shared policy to be captured by the dominant role, producing different failure signatures by task and workflow. Together, the empirical map and its underlying mechanisms show that policy sharing routes training pressure through different channels rather than offering uniform stability, making it a design choice with workflow- and task-conditional tradeoffs.

2212.07944 2026-06-02 cs.LG math.OC q-fin.CP q-fin.PM q-fin.ST 版本更新

Variable Clustering via Distributionally Robust Nodewise Regression

基于分布鲁棒节点回归的变量聚类

Kaizheng Wang, Xiao Xu, Xun Yu Zhou

发表机构 * Department of Industrial Engineering and Operations Research & The Data Science Institute, Columbia University(工业工程与运筹学系及数据科学研究院,哥伦比亚大学)

AI总结 本文提出一种分布鲁棒节点回归方法,通过凸松弛、数据驱动鲁棒区域选择和ADMM算法,实现多因子块模型下的变量聚类,并在数值实验中展示其优越性能。

Comments ICML 2026

详情
AI中文摘要

我们研究了一个用于变量聚类的多因子块模型,并通过分布鲁棒版本的节点回归将其与正则化子空间聚类联系起来。为了解决后一个问题,我们推导了一个凸松弛,提供了一种数据驱动的方法来选择鲁棒区域的大小,并开发了一种ADMM算法以实现高效实现。我们在广泛的数值研究中验证了我们的方法,并展示了其优越的性能。

英文摘要

We study a multi-factor block model for variable clustering and connect it to regularized subspace clustering through a distributionally robust version of nodewise regression. To solve the latter problem, we derive a convex relaxation, provide a data-driven approach for selecting the size of the robust region, and develop an ADMM algorithm for efficient implementation. We validate our method in extensive numerical studies and demonstrate its superior performance.

2605.23500 2026-06-02 cs.CV cs.LG 版本更新

B-GRTO: Bootstrapped Group Relative Tool Optimization for Referring Segmentation

B-GRTO: 引导式分组相对工具优化用于指代分割

Mario Markov, Stefan Maria Ailuro, Mohammad Mahdi, Luc Van Gool, Danda Pani Paudel

发表机构 * INSAIT Sofia University "St. Kliment Ohridski"(索菲亚大学"圣克莱门特·欧赫里迪斯基")

AI总结 提出B-GRTO框架,通过引导式预训练和分组相对工具优化,联合优化策略与可微分割解码器,显著提升复杂指代分割性能。

详情
AI中文摘要

分割是计算机视觉中的基本任务,支撑像素级场景理解,并作为从自主感知到医学图像分析等应用的基石。对于复杂的指代分割,近期方法将大型视觉-语言模型与分割解码器配对:前者分析图像和提示,后者预测目标掩码。尽管强化学习改进了推理密集型视觉-语言系统,但可训练工具(如分割解码器)通常使用可微目标单独优化,而将这些目标原则性地整合到强化学习中仍未被充分探索。因此,我们引入了分组相对工具优化(GRTO),这是一个数学上严谨的框架,用于联合优化具有可微工具使用的策略。GRTO重用分组相对策略优化(GRPO)的采样结果来优化辅助工具目标,使解码器梯度补充策略奖励。此外,我们推导出引导式GRTO(B-GRTO),一种廉价引导工具的预训练方法,从而实现更快的收敛和更优的性能。在三个具有挑战性的指代分割设置中,B-GRTO相比普通GRPO取得了显著改进,匹配或超越了领域特定的最新方法。这证明了将强化学习与可微辅助目标统一用于推理密集型分割的价值。

英文摘要

Segmentation is a fundamental task in computer vision, underpinning pixel-level scene understanding and serving as a cornerstone for applications ranging from autonomous perception to medical image analysis. For complex referring segmentation, recent methods pair large vision-language models with segmentation decoders: the former analyzes the image and prompt, while the latter predicts the target mask. Although reinforcement learning improves reasoning-intensive vision-language systems, trainable tools such as segmentation decoders are typically optimized separately with differentiable objectives, and the principled integration of such objectives into reinforcement learning remains underexplored. Thus, we introduce group relative tool optimization (GRTO), a mathematically grounded framework for jointly optimizing a policy with differentiable tool use. GRTO reuses group relative policy optimization (GRPO) rollouts to optimize the auxiliary tool objective, letting decoder gradients complement policy rewards. Further, we derive Bootstrapped-GRTO (B-GRTO), a pre-training method that cheaply bootstraps the tool, leading to faster convergence and superior performance. Across three challenging referring segmentation settings, B-GRTO results in substantial improvements over plain GRPO, matching or surpassing domain-specific state-of-the-art methods. This demonstrates the value of unifying reinforcement learning with differentiable auxiliary objectives for reasoning-intensive segmentation.

2605.23080 2026-06-02 cs.LG 版本更新

The Attribution Contract: Feature Attribution for Generative Language Models

归因契约:生成式语言模型的特征归因

Giang Nguyen

发表机构 * Guide Labs(Guide实验室)

AI总结 针对生成式语言模型中特征归因的歧义性,提出归因契约规范,明确归因对象、特征范围、生成过程等要素,并通过自回归和扩散模型案例展示不同契约下的归因效果。

详情
AI中文摘要

特征归因方法承诺识别哪些输入特征对模型输出重要。然而,在生成式语言模型中,首先往往不清楚什么应算作特征。在自回归语言模型中,先前生成的标记既是模型的输出,也是后续预测的输入。在扩散语言模型中,生成通过迭代去噪或去掩码进行,而非固定的从左到右预测,因此局部解释可能针对扩散状态而非下一个标记。我们认为这种模糊性不仅是实现细节,而是将分类器时代的特征归因直接带入生成式语言建模的概念局限。我们引入归因契约,这是一种特征归因声明的规范,它命名了被解释的输出、有资格获得归因的特征、假定的生成过程、保持不变的内容以及被归因的模型分数。该契约澄清了为什么相同的归因方法可以根据其实例化方式回答不同的问题。我们认为,生成式语言模型中关于特征归因的许多分歧并非关于归因算法的分歧,而是关于未明确说明的解释契约的分歧。以自回归和扩散语言模型为案例研究,我们展示了何时对先前生成的标记、中间状态或去噪阶段的归因是有信息的,何时是误导的,以及为什么生成式语言模型中的特征归因方法应作为方法-契约对进行评估。

英文摘要

Feature attribution methods promise to identify which input features matter for a model output. In generative language models, however, it is often unclear what should count as a feature in the first place. In autoregressive language models, earlier generated tokens are both outputs of the model and inputs to later predictions. In diffusion language models, generation proceeds through iterative denoising or unmasking rather than fixed left-to-right prediction, so local explanation may target a state of diffusion rather than a next token. We argue that this ambiguity is not merely an implementation detail, but a conceptual limitation of carrying classifier-era feature attribution directly into generative language modeling. We introduce the Attribution Contract, a specification for feature-attribution claims that names what output is being explained, which features are eligible to receive attribution, what generative process is assumed, what is held fixed, and what model score is being attributed. The contract clarifies why the same attribution method can answer different questions depending on how it is instantiated. We argue that many disagreements about feature attribution in generative language models are not disagreements about attribution algorithms, but about unstated explanatory contracts. Using autoregressive and diffusion language models as case studies, we show when attribution to earlier generated tokens, intermediate states, or denoising stages is informative, when it is misleading, and why feature-attribution methods in generative language models should be evaluated as method-contract pairs.

2605.17109 2026-06-02 cs.LG cs.AI 版本更新

DynMuon: A Dynamic Spectral Shaping View of Muon

DynMuon: 缪子的动态谱整形视角

Fangzhou Wu, Rikhav Shah, Sandeep Silwal, Qiuyi Zhang

发表机构 * University of Wisconsin–Madison(威斯康星大学麦迪逊分校) MIT(麻省理工学院) Elorian AI

AI总结 本文提出DynMuon方法,通过动态调整谱整形参数p(从正到负),在训练过程中平衡高曲率与低曲率方向,从而加速收敛并降低验证损失。

Comments 21 pages

详情
AI中文摘要

近年来,Muon已成为训练大型语言模型及更广泛的Transformer的主导方法。与标准梯度下降方法相比,其本质区别在于将通常的更新矩阵$M=UΣV^\top$替换为其极因子$UV^\top$。在本文中,我们考虑一类类似Muon的更新,其中将更新$M$替换为$UΣ^p V^\top$,参数$p$。我们称此为“谱整形”操作,并发展了一套选择$p$的理论,该选择依赖于(a)损失函数的局部曲率,(b)来自随机梯度和标签噪声的噪声,以及(c)训练阶段。我们的理论和实验揭示了一个先前被忽视的行为:正的$p$通过强调高曲率方向并加速信号收缩而在早期有帮助,而轻微负的$p$通过将更新强度重新分配到仍包含有用训练信号的低曲率方向而在后期有帮助。基于这一洞察,我们提出了DynMuon,一种高效的动态谱整形方法,在训练过程中将$p$从正值调度到轻微负值。跨模型大小、架构和训练设置的大量实验表明,DynMuon始终比Muon获得更低的验证损失,同时达到相同目标损失所需的步数减少10.6-26.5%。我们的代码可在https://github.com/fzwark/DynMuon获取。

英文摘要

In recent years, Muon has emerged as the dominant method for training large language models, and transformers more broadly. The essential difference, when compared to standard gradient descent methods, is to replace the usual update matrix $M=UΣV^\top$ with its polar factor $UV^\top$. In this work, we consider a class of Muon-like updates, where we replace the update $M$ with $UΣ^p V^\top$ for some parameter $p$. We call this a "spectral-shaping" operation, and develop a theory of how to pick $p$ which depends on (a) local curvature of the loss function, (b) noise stemming from stochastic gradients and label noise, and (c) training stage. Our theory and experimentation reveal a previously overlooked behavior: positive $p$ helps early by emphasizing high-curvature directions and accelerating signal contraction, while mildly negative $p$ helps later by reallocating update strength toward low-curvature directions that still contain useful training signals. Building on the insight, we propose DynMuon, an efficient dynamic spectral shaping method that schedules $p$ from positive to mildly negative over training. Extensive experiments across model sizes, architectures, and training settings show that DynMuon consistently achieves lower validation loss than Muon, while requiring 10.6-26.5% fewer steps to reach the same target loss. Our code is available at https://github.com/fzwark/DynMuon.

2604.13517 2026-06-02 cs.LG cs.AI 版本更新

Representation over Routing: Diagnosing Temporal Routing Pathologies in Multi-Timescale PPO

表征优于路由:诊断多时间尺度PPO中的时间路由病理

Jing Sun

发表机构 * Information Engineering School, Chengyi College, Jimei University(信息工程学院, Chengyi 学院, 厦门大学)

AI总结 本文通过形式化代理目标攻击和时间不确定性悖论,揭示了多时间尺度PPO中可微路由和基于误差路由的数值捷径问题,并提出目标解耦方法消除演员侧路由路径以改善性能。

Comments 8 pages, 3 figures

详情
AI中文摘要

强化学习中的时间信用分配通常通过引入多个折扣因子的价值估计来处理。一个自然的下一步是让演员在这些时间头之间动态路由,使用可微注意力或启发式不确定性权重。本文认为,这种路由可能产生数值捷径而非可靠的时间抽象。我们在LunarLander-v2上的受控PPO设置中研究此问题,将环境用作诊断故障模式的视觉沙箱。首先,我们形式化了代理目标攻击:暴露于PPO代理的可微softmax路由器会直接获得梯度,指向对当前更新数值有利的优势头,即使这种路由变化并不对应物理控制的改进。由于不同折扣因子的未归一化优势具有不同的有效尺度,这产生了尺度差异脆弱性。其次,我们在基于梯度的无误差路由中识别了时间不确定性悖论:短视头可能获得最大的路由份额,因为其预测目标更容易,即使它们与延迟任务成功的对齐程度较低。作为结构性回应,我们研究了目标解耦:评论家可以保留多时间尺度辅助头,但演员仅使用长视优势进行更新。目标解耦并非作为广泛的性能提升器;在此运行集中,它消除了可被利用的演员侧路由路径,并改善了观察到的最差种子回报。代码可在 https://github.com/ben-dlwlrma/Representation-Over-Routing 获取。

英文摘要

Temporal credit assignment in reinforcement learning is often approached by introducing value estimates at multiple discount factors. A natural next step is to let the actor dynamically route among these temporal heads, using either differentiable attention or heuristic uncertainty weights. This paper argues that such routing can create a numerical shortcut rather than a reliable temporal abstraction. We study this issue in a controlled PPO setting on LunarLander-v2, using the environment as a visual sandbox for diagnosing failure modes. First, we formalize Surrogate Objective Hacking: a differentiable softmax router exposed to the PPO surrogate receives a direct gradient toward advantage heads that are numerically favorable for the current update, even when this routing change does not correspond to improved physical control. Because unnormalized advantages at different discount factors have different effective scales, this creates a scale-discrepancy vulnerability. Second, we identify the Paradox of Temporal Uncertainty in gradient-free error-based routing: short-horizon heads can receive the largest routing share because their prediction targets are easier, even when they are less aligned with delayed task success. As a structural response, we study Target Decoupling: the critic may retain multi-timescale auxiliary heads, but the actor is updated only with the long-horizon advantage. Target Decoupling is not presented as a broad performance booster; in this run set it removes the exploitable actor-side routing pathway and improves the observed worst-seed return. Code is available at https://github.com/ben-dlwlrma/Representation-Over-Routing.

2605.22305 2026-06-02 cs.LG 版本更新

Chebyshev Policies and the Mountain Car Problem: Reinforcement Learning for Low-Dimensional Control Tasks

切比雪夫策略与山地车问题:低维控制任务的强化学习

Stefan Huber, Hannes Unger, Georg Schäfer, Jakob Rehrl

发表机构 * University of Freiburg(弗赖堡大学)

AI总结 本文解析求解山地车问题,推导最优控制,并引入切比雪夫策略作为通用策略类,在低维控制任务中显著优于神经网络。

Comments ICML 2026 Oral

详情
AI中文摘要

我们解析求解了强化学习中的经典基准问题——山地车问题,并推导出最优控制解,填补了36年来的空白。这使我们得以揭示两个令人惊讶的见解:最优控制非常简单,然而现代强化学习智能体与最优性之间存在巨大差距。受最优控制分析的启发,我们从基本原理出发,引入了切比雪夫策略作为强化学习策略的通用(即稠密)类。它们可以作为神经网络的即插即用替代品进行训练,将遗憾值降低4.18倍,同时所需参数减少277倍,从而促进样本效率、可解释性和实时能力。切比雪夫策略在进一步的强化学习任务上进行了评估,包括一个真实世界的非线性运动控制测试平台。在使用PPO、ARS和REINFORCE算法时,它们始终优于神经网络。我们的结果证明了切比雪夫策略在低维控制任务中作为神经网络的一种引人注目且轻量级的替代或补充方案的有效性。

英文摘要

We analytically solve the Mountain Car problem, a canonical benchmark in RL, and derive an optimal control solution, closing a gap after 36 years. This enables us to reveal two surprising insights: The optimal control is quite simple, yet modern RL agents display a large gap to optimality. Motivated by the analysis of the optimal control, we introduce Chebyshev policies as a universal (i.e. dense) class of RL policies from first principles. They can be trained as drop-in replacements of neural nets, reducing the regret by a factor of 4.18, while requiring 277 times fewer parameters, fostering sample efficiency, explainability and realtime capability. Chebyshev policies are evaluated on further RL tasks, including a real-world nonlinear motion control testbed. They consistently improve performance over neural nets with PPO, ARS and REINFORCE. Our results demonstrate how Chebyshev policies offer a compelling and lightweight alternative or addition to neural nets for low-dimensional control tasks.

2605.00941 2026-06-02 cs.LG cs.CV 版本更新

Divergence is Uncertainty: A Closed-Form Posterior Covariance for Flow Matching

散度即不确定性:流匹配的闭式后验协方差

Jiarui Xing, Song Wang, Jian Wang

发表机构 * Yale University(耶鲁大学) Shanxi University(山西大学) Harvard Medical School(哈佛医学院)

AI总结 本文通过扩展Tweedie公式到流匹配插值,推导出生成轨迹上每一点后验协方差的精确闭式表达式,该表达式仅依赖于学习速度场的散度,可在预训练模型上事后计算,无需重新训练或修改架构。

Comments 9 Pages, 5 figures

详情
AI中文摘要

流匹配已成为生成建模的领先框架,但量化其样本的不确定性仍是一个开放问题。现有方法使用辅助方差头重新训练模型、维护昂贵的集成或通过多个积分步骤传播近似协方差,在训练成本、推理成本或准确性之间进行权衡。我们表明这些权衡都不是必需的。通过将Tweedie公式从去噪设置扩展到流匹配插值,我们推导出生成轨迹上每一点后验协方差的精确闭式表达式。结果仅依赖于一个量,即学习速度场的散度,该散度可以在任何预训练的流匹配模型上事后计算,无需重新训练和架构修改。对于像MeanFlow这样的单步生成器,相同的公式在单次前向传递中产生端到端的生成不确定性,消除了所有先前方法所需的多步方差传播。在MNIST上的实验证实,得到的逐像素不确定性图在语义上有意义,集中在样本间变化最大的数字边界上,并且标量不确定性分数跟踪实际预测误差,所有计算量大约比集成或蒙特卡洛丢弃法少$10^4$倍。

英文摘要

Flow matching has become a leading framework for generative modeling, but quantifying the uncertainty of its samples remains an open problem. Existing approaches retrain the model with auxiliary variance heads, maintain costly ensembles, or propagate approximate covariance through many integration steps, trading off training cost, inference cost, or accuracy. We show that none of these trade-offs is necessary. By extending Tweedie's formula from the denoising setting to the flow matching interpolant, we derive an exact, closed-form expression for the posterior covariance at every point along the generative trajectory. The result depends on a single quantity, namely the divergence of the learned velocity field, which can be computed post-hoc on any pre-trained flow matching model, requiring no retraining and no architectural modification. For one-step generators such as MeanFlow, the same formula yields the end-to-end generation uncertainty in a single forward pass, eliminating the multi-step variance propagation required by all prior methods. Experiments on MNIST confirm that the resulting per-pixel uncertainty maps are semantically meaningful, concentrating on digit boundaries where inter-sample variation is highest, and that the scalar uncertainty score tracks actual prediction error, all at roughly $10^4 \times$ less total compute than ensembling or Monte Carlo dropout.

2508.03556 2026-06-02 cs.LG 版本更新

VRPRM: Process Reward Modeling via Visual Reasoning

VRPRM: 通过视觉推理的过程奖励建模

Xinquan Chen, Chongying Yue, Bangwei Liu, Xuhong Wang, Yingchun Wang, Chaochao Lu

发表机构 * PJ Lab(PJ实验室)

AI总结 提出VRPRM,一种结合视觉推理的过程奖励模型,通过两阶段训练策略(少量CoT-PRM SFT数据+大量非CoT-PRM RL数据)以较低成本实现高质量推理,性能提升达118%。

Comments 20 pages, 11 figures

详情
AI中文摘要

过程奖励模型(PRM)因其能够对生成内容的推理步骤进行细粒度评估,被广泛用于大型语言模型(LLM)的后训练中。然而,大多数PRM缺乏长期推理和深度思考能力。另一方面,尽管少数工作尝试将思维链(CoT)能力引入PRM,但CoT-PRM数据的标注成本过高,难以在各种任务中发挥稳定作用。为应对上述挑战,我们提出VRPRM,一种通过视觉推理的过程奖励模型,并设计了一种高效的两阶段训练策略。实验结果表明,仅使用3.6K CoT-PRM监督微调(SFT)数据和50K非CoT-PRM强化学习(RL)训练数据,VRPRM即可超越总数据量达400K的非思考PRM,并在BoN实验中相对于基模型实现了高达118%的相对性能提升。这一结果证实,所提出的组合训练策略能够以较低的数据标注成本实现更高质量的推理能力,从而为更高效数据利用的PRM训练提供了新范式。

英文摘要

Process Reward Model (PRM) is widely used in the post-training of Large Language Model (LLM) because it can perform fine-grained evaluation of the reasoning steps of generated content. However, most PRMs lack long-term reasoning and deep thinking capabilities. On the other hand, although a few works have tried to introduce Chain-of-Thought (CoT) capability into PRMs, the annotation cost of CoT-PRM data is too expensive to play a stable role in various tasks. To address the above challenges, we propose VRPRM, a process reward model via visual reasoning, and design an efficient two-stage training strategy. Experimental results show that using only 3.6K CoT-PRM Supervised Fine-Tuning(SFT) data and 50K non-CoT PRM Reinforcement Learning (RL) training data, VRPRM can surpass the non-thinking PRM with a total data volume of 400K and achieved a relative performance improvement of up to 118\% over the base model in the BoN experiment. This result confirms that the proposed combined training strategy can achieve higher quality reasoning capabilities at a lower data annotation cost, thus providing a new paradigm for PRM training with more efficient data utilization.

2605.21648 2026-06-02 cs.LG cond-mat.dis-nn cs.NE stat.ML 版本更新

Dropout Universality: Scaling Laws and Optimal Scheduling at the Edge-of-Chaos

Dropout 普适性:混沌边缘的缩放定律与最优调度

Lucas Fernandez Sarmiento

发表机构 * Lucas Fernandez Sarmiento

AI总结 提出 dropout 作为临界信号传播扰动的平均场理论,发现前端加载的 dropout 调度在固定预算下可将 MLP 和 Vision Transformer 的测试损失降低 18-35%,并推导出相关缩放定律与普适类。

Comments Accepted at the 43rd International Conference on Machine Learning (ICML 2026). 36 pages, 11 figures. Camera-ready version

详情
AI中文摘要

我们发展了 dropout 作为混沌边缘临界信号传播扰动的平均场理论,并表明它预测了一个简单、零成本的实践改变:前端加载的 dropout 调度在固定预算下,在 MLP 和 Vision Transformer 中比恒定 dropout 降低测试损失 18-35%。理论机制是 dropout 移动了完美对齐固定点,使得即使在临界初始化下信息传播的深度尺度也变得有限。我们推导了相关衰减的临界和交叉缩放定律,并建立了平滑激活和带拐点的 ReLU 类激活构成不同的普适类,具有不同的临界指数以及在失谐和 dropout 强度下的通用两参数缩放塌缩。这种区别追溯到相关映射的解析结构:平滑激活在完美对齐附近允许泰勒展开,而带拐点的激活则出现具有普适非解析性的分支点。作为推论,该框架在固定预算下产生饱和的 dropout 轮廓;然后通过正则化可达性论证选择前端加载的调度,精度提升作为一致的次要效果。我们还讨论了相同的高斯核结构如何将理论从 MLP 扩展到 CNN 和残差架构。

英文摘要

We develop a mean-field theory of dropout as a perturbation of critical signal propagation at the edge of chaos, and show that it predicts a simple, no-cost change to standard practice: \emph{front-loaded} dropout schedules cut test loss by \(18\)--\(35\%\) over constant dropout in MLPs and Vision Transformers at fixed budget. The theoretical mechanism is that dropout shifts the perfect-alignment fixed point, making the depth scale for information propagation finite even at critical initialization. We derive critical and crossover scaling laws for correlation decay and establish that smooth activations and kinked, \relu{}-like activations constitute distinct universality classes, with different critical exponents and a universal two-parameter scaling collapse in detuning and dropout strength. The distinction traces to the analytic structure of the correlation map: smooth activations admit a Taylor expansion near perfect alignment, while kinked activations develop a branch point with universal non-analyticity. As a corollary, the framework yields saturated dropout profiles under fixed budget; a regularization-reach argument then selects front-loaded schedules, with accuracy gains as a consistent secondary effect. We also discuss how the same Gaussian-kernel structure extends the theory beyond MLPs toward CNNs and residual architectures.

2602.11210 2026-06-02 cs.SE cs.AI cs.LG 版本更新

SWE-MiniSandbox: Container-Free Reinforcement Learning for Building Software Engineering Agents

SWE-MiniSandbox:用于构建软件工程智能体的无容器强化学习

Danlong Yuan, Wei Wu, Enhan Zhao, Zhengren Wang, Xueliang Zhao, Huishuai Zhang, Dongyan Zhao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出SWE-MiniSandbox,一种轻量级无容器方法,通过内核级隔离和预缓存技术降低磁盘使用和准备时间,实现可扩展的强化学习训练。

详情
AI中文摘要

强化学习已成为训练软件工程智能体的关键范式,但现有流程通常依赖每个任务的容器进行隔离。在大规模场景下,预构建的容器镜像会带来显著的存储开销、缓慢的环境设置,并且需要容器管理权限。我们提出SWE-MiniSandbox,一种轻量级、无容器的方法,能够在无需牺牲隔离性的情况下实现SWE智能体的可扩展强化学习训练。SWE-MiniSandbox不依赖每个实例的容器,而是在由内核级机制支持的隔离工作空间中执行每个任务,从而大幅降低系统开销。它利用轻量级环境预缓存技术,消除了对庞大容器镜像的需求。因此,我们的方法将磁盘使用量降低到基于容器的流程所需的大约5%,并将环境准备时间缩短到容器基线的大约25%。实验结果表明,SWE-MiniSandbox实现了与标准基于容器的流程相当的评估性能。通过消除对重型容器基础设施的依赖,SWE-MiniSandbox为扩展基于强化学习的SWE智能体提供了一个实用且可访问的基础,特别是在资源受限的研究环境中。

英文摘要

Reinforcement learning (RL) has become a key paradigm for training software engineering (SWE) agents, but existing pipelines typically rely on per-task containers for isolation. At scale, pre-built container images incur substantial storage overhead, slow environment setup, and require container-management privileges. We propose SWE-MiniSandbox, a lightweight, container-free method that enables scalable RL training of SWE agents without sacrificing isolation. Instead of relying on per-instance containers, SWE-MiniSandbox executes each task in an isolated workspace backed by kernel-level mechanisms, substantially reducing system overhead. It leverages lightweight environment pre-caching techniques to eliminate the need for bulky container images. As a result, our approach lowers disk usage to approximately 5\% of that required by container-based pipelines and reduces environment preparation time to about 25\% of the container baseline. Empirical results demonstrate that SWE-MiniSandbox achieves evaluation performance comparable to standard container-based pipelines. By removing the dependency on heavy container infrastructure, SWE-MiniSandbox offers a practical and accessible foundation for scaling RL-based SWE agents, particularly in resource-constrained research environments.

2605.21422 2026-06-02 cs.LG 版本更新

PRISM: Preference-Aware Influence Function Based Data Selection Method for Efficient Fine-Tuning

PRISM:基于偏好感知影响函数的数据选择方法用于高效微调

Qihao Lin, Guanxu Chen, Dongrui Liu, Jing Shao

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 提出PRISM方法,通过偏好感知影响函数对目标示例加权,构建偏好感知目标方向,优先选择有效驱动模型匹配目标行为的数据,提升高效微调和安全对齐微调性能。

Comments 23 pages, 5 figures

详情
AI中文摘要

随着LLM规模不断扩大,提高训练效率在很大程度上依赖于有效的数据利用。数据选择通过将有限的训练预算分配给能够最优促进模型目标行为的高价值样本来缓解这一问题。大多数现有方法通过一组目标示例定义目标行为,并根据候选训练数据对这些样本的估计影响进行评分。然而,这些方法将所有目标示例视为同等重要,忽略了单个示例对模型优化的不同相关性。具体来说,与模型固有行为紧密对齐的目标示例提供更强的监督信号,而不一致的示例仅提供微弱且无效的局部指导。我们提出PRISM,一种基于偏好感知影响函数的数据选择方法。它利用模型偏好为目标示例分配权重,并构建偏好感知目标方向。PRISM根据候选训练样本对该方向的影响进行评估,并优先将数据预算分配给能有效驱动模型匹配预期目标行为的样本。理论分析验证,与均匀聚合策略相比,加权偏好构造能产生更优的一阶梯度方向以提升目标偏好。涵盖不同模型架构和参数规模的广泛实验表明,PRISM在高效微调和安全对齐监督微调修正中取得了更好的性能。结果验证了目标行为的准确表征是成本效益数据选择的核心。

英文摘要

As LLMs continue to scale up, improving training efficiency heavily relies on effective data utilization. Data selection mitigates this issue by allocating the limited training budget to high-value examples that optimally facilitate the model's target behavior. Most existing approaches define target behavior via a set of target examples and score candidate training data based on their estimated influence on these samples. However, such methods uniformly treat all target examples as equally important, ignoring the varying relevance of individual examples to model optimization. Specifically, target examples that align closely with the model's inherent behavior deliver stronger supervisory signals, whereas discrepant examples yield only weak and ineffective local guidance. We propose PRISM, a Preference-aware Influence function based Data Selection Method. It leverages model preference to assign weights to target examples and builds a preference-aware target direction. PRISM evaluates candidate training samples according to their influence on this direction, and prioritizes data budget allocation to samples that effectively drive the model to match expected target behavior. Theoretical analysis verifies that weighted preference construction generates a superior first-order gradient direction for boosting target preference, compared with uniform aggregation strategies. Extensive experiments covering diverse model architectures and parameter scales demonstrate that PRISM achieves better performance in efficient fine-tuning and safety-aligned supervised fine-tuning rectification. The results validate that accurate characterization of target behavior serves as the core of cost-effective data selection.

2605.21125 2026-06-02 cs.LG 版本更新

Advantage Collapse in Group Relative Policy Optimization: Diagnosis and Mitigation

群体相对策略优化中的优势崩塌:诊断与缓解

Xixiang He, Qiyao Sun, Ao Cheng, Xingming Li, Xuanyu Ji, Hailun Lu, Runke Huang, Qingyong Hu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对GRPO在可验证奖励强化学习中的优势崩塌问题,提出诊断指标ACR和轻量级扩展方法AVSPO,通过注入虚拟奖励样本减少梯度消失,提升模型推理能力。

Comments Accepted at the International Conference on Machine Learning (ICML 2026). Project page: https://QingyongHu.github.io/AVSPO

详情
AI中文摘要

群体相对策略优化(GRPO)是可验证奖励强化学习(RLVR)框架中的一种重要算法,在提升大型语言模型(LLMs)的推理能力方面取得了显著成果。然而,GRPO容易发生优势崩塌,这是一种故障模式,其中组内的同质奖励(例如,全部正确或全部错误的答案)产生接近零的优势和消失的梯度。为了解决这个问题,我们引入了优势崩塌率(ACR),这是第一个量化具有无效梯度的训练批次比例的诊断指标。在数学推理基准上,使用从0.5B到14B参数的模型,我们证明ACR能够强有力地预测训练停滞和最终性能。然后,我们提出了自适应虚拟样本策略优化(AVSPO),这是GRPO的一个轻量级扩展,通过实时ACR监控指导注入虚拟奖励样本,使得无需额外的模型 rollout 即可从同质组中学习。与GRPO相比,AVSPO将优势崩塌减少了58-63%,并在所有模型规模上一致地获得了4-6个百分点的准确率提升,同时在评估的域外任务上保持了泛化能力。代码和数据集可在 https://github.com/hexixiang/Advantage-Collapse-Rate 获取。

英文摘要

Group Relative Policy Optimization (GRPO), a prominent algorithm within the Reinforcement Learning from Verifiable Rewards (RLVR) framework, has achieved strong results in improving the reasoning capabilities of large language models (LLMs). However, GRPO is prone to advantage collapse, a failure mode where homogeneous rewards within a group (e.g., all correct or all incorrect answers) yield near-zero advantages and vanishing gradients. To address this, we introduce the Advantage Collapse Rate (ACR), the first diagnostic metric quantifying the proportion of training batches with ineffective gradients. Across models from 0.5B to 14B parameters on mathematical reasoning benchmarks, we show that ACR strongly predicts training stagnation and final performance. We then propose Adaptive Virtual Sample Policy Optimization (AVSPO), a lightweight extension of GRPO that injects virtual reward samples, guided by real-time ACR monitoring, to enable learning from homogeneous groups without additional model rollouts. AVSPO reduces advantage collapse by 58-63% relative to GRPO and yields consistent accuracy gains of 4-6 percentage points across all model scales, while maintaining generalization on the evaluated out-of-domain task. Code and datasets are available at https://github.com/hexixiang/Advantage-Collapse-Rate.

2605.20854 2026-06-02 cs.LG 版本更新

Finite-Time Regret Analysis of Retry-Aware Bandits

重试感知赌博机的有限时间遗憾分析

Bingkui Tong, Junpei Komiyama, Soichiro Nishimori, Paavo Parmas

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(莫扎伊德·本·扎耶德人工智能大学) RIKEN AIP(日本学术振兴会高级研究所) The University of Tokyo(东京大学)

AI总结 研究针对重试感知目标(如pass@k和max@k)的随机赌博机算法ReMax,通过期望改进平衡条件刻画其最优采样分布,并证明首次亚线性遗憾界,揭示其比汤普森采样更具剥削性的原因。

Comments 38 pages

详情
AI中文摘要

我们研究了一种由重试感知目标(重视多次尝试中的最佳结果,如pass@$k$和max@$k$)启发的随机赌博机算法。给定臂值的后验分布,ReMax选择一种采样分布,最大化在$M$次虚拟抽取中后验期望最大奖励。尽管该目标在强化学习中作为不确定性下的探索机制被引入,但其在赌博机问题中的遗憾性质一直不清楚。对于高斯奖励和第一个非平凡情况$M=2$,我们通过期望改进平衡条件刻画了最优ReMax分布,并证明了ReMax的第一个亚线性遗憾界。我们的分析将次优臂的通常饱和行为与ReMax特有的低估效应分开,其中最优臂可能在不利估计后被采样过少。这解释了为什么ReMax可能比汤普森采样(TS)更具剥削性,以及其遗憾分析在技术上的微妙之处。实验支持这一图景:在轻度低估下,ReMax通常优于KL-UCB和汤普森采样,而后验方差缩放经验性地缓解了严重低估。

英文摘要

We study a stochastic bandit algorithm motivated by retry-aware objectives that value the best outcome among multiple attempts, such as pass@$k$ and max@$k$. Given a posterior over arm values, ReMax chooses a sampling distribution that maximizes the posterior expected maximum reward over $M$ virtual draws. Although this objective was introduced in reinforcement learning as an exploration mechanism under uncertainty, its regret properties in bandit problems have remained unclear. For Gaussian rewards and the first nontrivial case $M=2$, we characterize the optimal ReMax distribution through an expected-improvement balance condition and prove the first sublinear regret bound for ReMax. Our analysis separates the usual saturation behavior of suboptimal arms from a ReMax-specific underestimation effect, in which the optimal arm may be sampled too rarely after an unfavorable estimate. This explains why ReMax can be more exploitative than Thompson sampling (TS) and why its regret analysis is technically delicate. Experiments support this picture: ReMax often outperforms KL-UCB and Thompson sampling under mild underestimation, while posterior-variance scaling empirically mitigates severe underestimation.

2605.19847 2026-06-02 cs.CR cs.IR cs.LG 版本更新

Auditing Privacy in Multi-Tenant RAG under Account Collusion

多租户RAG中账户共谋下的隐私审计

Florian A. D. Burnat

发表机构 * University of Bath(巴斯大学)

AI总结 针对多租户RAG中同一索引下账户共谋导致隐私泄露加剧的问题,提出一种可验证的审计协议,用于认证噪声-选择检索并报告共谋上限内的隐私损失。

详情
AI中文摘要

多租户RAG服务通常将账户视为隐私边界:每个账户针对租户索引获得$(\varepsilon_{ ext{acc}},δ_{ ext{acc}})$-DP检索保证。我们表明,这种框架低估了同一索引下账户共谋的泄露。对于高斯噪声-选择检索,$k$个协调的同一租户账户组合成联合泄露$Θ(\sqrt{k}\,\varepsilon_{ ext{acc}})$,而非$\varepsilon_{ ext{acc}}$;我们给出匹配的成员推断攻击,并在标量、top-$K$、训练嵌入器和生产规模的HNSW设置中验证了预测的$\sqrt{k}$ AUC趋势。然后,我们给出一个验证者可运行的审计协议,该协议认证噪声-选择检索,并针对达到声明上限$k_{\max}$的联盟报告$( extsf{PASS},\varepsilon_{ ext{audit}})$,而不泄露索引或改变检索决策规则。该声明仅针对检索通道:生成通道泄露和对抗性鲁棒的联盟规模估计是补充审计谓词。

英文摘要

Multi-tenant RAG services often treat the account as the privacy boundary: each account receives an $(\varepsilon_{\text{acc}},δ_{\text{acc}})$-DP retrieval guarantee against the tenant index. We show that this framing understates leakage under same-index account collusion. For Gaussian noise-then-select retrieval, $k$ coordinated same-tenant accounts compose to joint leakage $Θ(\sqrt{k}\,\varepsilon_{\text{acc}})$, not $\varepsilon_{\text{acc}}$; we give a matching membership-inference attack and validate the predicted $\sqrt{k}$ AUC trend in scalar, top-$K$, trained-embedder, and production-scale HNSW settings. We then give a verifier-runnable audit protocol that attests noise-then-select retrieval and reports $(\textsf{PASS},\varepsilon_{\text{audit}})$ for coalitions up to a declared cap $k_{\max}$, without disclosing the index or changing the retrieval decision rule. The claim is retrieval-channel only: generation-channel leakage and adversarially robust coalition-size estimation are complementary audit predicates.

2605.01752 2026-06-02 cs.LG 版本更新

Robust Linear Dueling Bandits with Post-serving Context under Unknown Delays and Adversarial Corruptions

鲁棒线性决斗赌博机:在未知延迟和对抗性破坏下的服务后上下文

Youngmin Oh

发表机构 * KAIST(韩国科学技术院)

AI总结 针对同时存在服务后上下文、延迟反馈和对抗性破坏的易变环境,提出RCDP-UCB算法,通过预测服务后上下文和自适应加权策略,实现了延迟无关的遗憾上界,并揭示了破坏与延迟之间的加性成本结构。

详情
AI中文摘要

我们研究了在易变环境中的线性决斗赌博机,其特点是同时存在服务后上下文、延迟反馈和对抗性破坏。反馈受到未知随机或对抗性延迟以及累积破坏预算$\mathcal{C}$的影响。为了解决这些挑战,我们提出了RCDP-UCB算法,该算法集成一个学习近似器,从服务前信息预测服务后上下文。它进一步采用自适应加权策略,裁剪特征向量以同时减轻破坏和延迟观测的影响。在标准正则性条件和参数化服务后映射下,我们严格证明了我们的算法是延迟机制无关的,实现了$\widetilde{\mathcal{O}}(d(\sqrt{T} + \mathcal{C} + \mathcal{D}))$的遗憾上界,其中$d$是总特征维度,$\mathcal{D}$封装了延迟复杂度。关键的是,我们的分析揭示了破坏和延迟之间的加性成本结构,避免了先前工作中典型的乘性退化。我们进一步建立了下界,在没有服务后上下文的情况下,对于对抗性延迟,下界几乎与上界匹配,仅差$\sqrt{d}$因子。代码可在https://github.com/youngmin0oh/rcdp-public获取。

英文摘要

We study linear dueling bandits in volatile environments characterized by the simultaneous presence of post-serving contexts, delayed feedback, and adversarial corruption. Feedback is subject to unknown stochastic or adversarial delays and a cumulative corruption budget $\mathcal{C}$. To address these challenges, we propose e RCDP-UCB, which integrates a learned approximator that predicts post-serving contexts from pre-serving information. It further employs an adaptive weighting strategy that clips feature vectors to mitigate the impact of corrupted and delayed observations simultaneously. Under standard regularity conditions and a parametric post-serving mapping, we rigorously establish that our algorithm is delay-regime-agnostic, achieving a regret upper bound of $\widetilde{\mathcal{O}}(d(\sqrt{T} + \mathcal{C} + \mathcal{D}))$, where $d$ is the total feature dimension and $\mathcal{D}$ encapsulates the delay complexity. Crucially, our analysis reveals an additive cost structure between corruption and delay, avoiding the multiplicative degradation typical of prior works. We further establish lower bounds that nearly match our upper bounds up to a $\sqrt{d}$ factor for adversarial delays in the absence of post-serving contexts. Code is available at https://github.com/youngmin0oh/rcdp-public.

2605.19263 2026-06-02 cs.LG cs.NA math.NA 版本更新

From Simple to Complex: Curriculum-Guided Physics-Informed Neural Networks via Gaussian Mixture Models

从简单到复杂:基于高斯混合模型的课程引导物理信息神经网络

Jianan Yang, Yiran Wang, Shuai Li, Fujun Cao, Xuefei Yan, Junmin Liu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出CGMPINN,通过高斯混合模型拟合PDE残差分布以量化学习难度,并采用动态课程学习逐步从简单区域过渡到困难区域,显著提升PINN在强非线性、尖锐梯度或多尺度问题上的收敛性和精度。

Comments 23 pages, 15 figures

详情
AI中文摘要

物理信息神经网络(PINN)提供了一种无网格框架用于求解偏微分方程(PDE),但训练过程中常面临梯度病态、频谱偏差和收敛性差的问题,尤其是对于具有强非线性、尖锐梯度或多尺度特征的问题。我们提出了课程引导高斯混合物理信息神经网络(CGMPINN),它将高斯混合模型与动态课程学习相结合。具体来说,周期性地将GMM拟合到PDE残差分布上,以量化空间变化的学习难度。一个平滑的课程计划逐步将训练重点从简单区域转移到困难区域,同时在早期优化过程中,基于精度的方差调制抑制不可靠的聚类。这种双重课程由一个共享的课程参数控制,并且可以与自适应损失平衡相结合。我们进一步建立了理论保证,包括诱导时变损失的梯度范数的次线性收敛、课程加权损失与标准PDE损失之间的均匀等价性,以及具有显式加权诱导偏差表征的泛化界。在涵盖椭圆型、抛物型、双曲型、对流主导型和非线性反应扩散型的六个基准PDE上的实验表明,CGMPINN在所有比较方法中一致地实现了最低的相对$L_2$和最大绝对误差,在相当的计算成本下,相对于标准PINN,相对$L_2$误差降低了高达97.8%。我们的代码公开在https://github.com/Mathematics-Yang/CGMPINN。

英文摘要

Physics-informed neural networks (PINNs) offer a mesh-free framework for solving partial differential equations (PDEs), yet training often suffers from gradient pathologies, spectral bias, and poor convergence, especially for problems with strong nonlinearity, sharp gradients, or multiscale features. We propose the Curriculum-Guided Gaussian Mixture Physics-Informed Neural Network (CGMPINN), which integrates Gaussian mixture modeling with dynamic curriculum learning. Specifically, a GMM is periodically fitted to the PDE residual distribution to quantify spatially varying learning difficulty. A smooth curriculum schedule progressively shifts training focus from easy to harder regions, while precision-based variance modulation suppresses unreliable clusters during early optimization. This dual curriculum is governed by a shared curriculum parameter and can be combined with self-adaptive loss balancing. We further establish theoretical guarantees, including sublinear convergence of the gradient norm for the induced time-varying loss, uniform equivalence between the curriculum-weighted and standard PDE losses, and a generalization bound with an explicit weighting-induced bias characterization. Experiments on six benchmark PDEs spanning elliptic, parabolic, hyperbolic, advection-dominated, and nonlinear reaction-diffusion types show that CGMPINN consistently achieves the lowest relative $L_2$ and maximum absolute errors among all compared methods, reducing relative $L_2$ error by up to 97.8\% over the standard PINN at comparable cost. Our code is publicly available at https://github.com/Mathematics-Yang/CGMPINN.

2605.17839 2026-06-02 cs.LG cs.AI 版本更新

Balancing Knowledge Distillation for Imbalance Learning with Bilevel Optimization

基于双层优化的不平衡学习知识蒸馏平衡

Anh B. H. Nguyen, Ba Tho Phan, Viet Cuong Ta

发表机构 * VNU University of Engineering and Technology(越南工程技术大学)

AI总结 提出BiKD双层框架,通过自适应样本级权重平衡硬损失和软损失,解决不平衡数据上知识蒸馏的脆弱性问题。

Comments Accepted to Special Session: Data Science: Foundations and Applications (DSFA), PAKDD 2026

详情
AI中文摘要

知识蒸馏通过混合硬损失和软损失将高容量教师的知识转移到紧凑的学生模型。在不平衡数据上,硬损失和软损失之间的固定权重使得学习过程变得脆弱。最近的研究尝试在长尾设置中重新加权这些组件。然而,大多数方法没有在样本级别调整权重,也没有考虑训练过程中学生的行为。为了解决这个问题,我们提出了BiKD——一个双层框架,动态平衡每个样本的硬损失和软损失。我们采用一个权重生成网络,由一个小型平衡验证集引导,产生自适应的逐样本权重。学生现在通过无约束的加权硬损失和软损失组合进行训练,使得学生可以放松这两个项。我们进一步提出了一种多步SGD策略,以更准确和高效地优化权重模型。在长尾CIFAR-10/100上的实验表明,我们的方法在不同不平衡因子下均优于最近的平衡蒸馏方法。

英文摘要

Knowledge distillation transfers knowledge from a high capacity teacher to a compact student using a mixture of hard and soft losses. On imbalanced data, a fixed weighting between hard and soft losses becomes brittle the learning process. Recent studies try to reweight these components in long-tailed settings. However, most of these methods do not adapt weights at the sample-wise level and do not take into account the students behavior during training. To address this, we propose BiKD -- a bilevel framework that dynamically balances hard and soft losses for each sample. We employ a weight generation network that produces adaptive per-sample weights, guided by a small balanced validation set. The student is now trained with an unconstrained combination of weighted hard and soft losses, allowing the student to relax both terms. We further propose a multi-step SGD strategy to optimize the weight model more accurately and efficiently. Experiments on long-tailed CIFAR-10/100 show that our approach surpasses recent balanced distillation methods across imbalance factors.

2601.22285 2026-06-02 cs.LG 版本更新

Demystifying Mergeability: Interpretable Properties to Predict Model Merging Success

揭秘可合并性:预测模型合并成功的可解释属性

Luca Zhou, Bo Zhao, Rose Yu, Emanuele Rodolà

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 通过架构无关框架和L1正则化线性优化,发现合并成功取决于合并方法和伙伴任务,梯度对齐是最基本的兼容性信号。

Comments 9 pages of main paper, 3 figures in the main paper, 4 tables in the main paper, many more figures and tables in the appendix

详情
AI中文摘要

模型合并结合了分别微调模型的知识,但驱动其成功的因素仍知之甚少。虽然最近的工作将可合并性视为模型的内在属性,但我们通过一个架构无关的框架表明,它从根本上取决于合并方法和伙伴任务。使用基于一组可解释的成对度量(例如,梯度L_2距离)的L1正则化线性优化,我们发现了与五种合并方法合并后归一化准确率相关的属性。我们发现成功驱动因素存在架构和方法特定的变化(平均前5个度量重叠64.0%;符号一致性79.3%),某些方法,特别是TIES,表现出与广泛共识不同的独特“指纹”。然而,至关重要的是,梯度对齐度量始终作为兼容性的最基本信号出现。这些发现为理解可合并性提供了诊断基础,并激发了未来合并感知的微调策略。

英文摘要

Model merging combines knowledge from separately fine-tuned models, yet the factors driving its success remain poorly understood. While recent work treats mergeability as an intrinsic property of the models, we show with an architecture-agnostic framework that it fundamentally depends on both the merging method and the partner tasks. Using L1-regularized linear optimization over a set of interpretable pairwise metrics (e.g., gradient L_2 distance), we uncover properties correlating with post-merge normalized accuracy across five merging methods. We find architecture- and method-specific variation in success drivers (64.0% average top-5 metric overlap; 79.3% sign agreement), with certain methods, notably TIES, exhibiting distinct ``fingerprints'' that diverge from the broader consensus. Crucially, however, gradient alignment metrics consistently emerge as the most fundamental signals of compatibility. These findings provide a diagnostic foundation for understanding mergeability and motivate future merge-aware fine-tuning strategies.

2411.19093 2026-06-02 cs.CV cs.CY cs.LG 版本更新

Seeing SDG 6 from space: local-scale monitoring of piped water and sewage system access across Africa using satellite imagery and self-supervised learning

从太空看SDG 6:利用卫星图像和自监督学习对非洲管道水和污水系统接入进行局部尺度监测

Othmane Echchabi, Aya Lahlou, Nizar Talty, Josh Malcolm Manto, Tongshu Zheng, Ka Leung Lam

发表机构 * Mila – Quebec AI Institute(魁北克人工智能研究所) School of Computer Science, McGill University(麦吉尔大学计算机科学学院) Department of Earth and Environmental Engineering, Columbia University(哥伦比亚大学地球与环境工程系) Center for Learning the Earth with Artificial Intelligence and Physics (LEAP)(人工智能与物理学习地球中心(LEAP)) Division of Natural and Applied Sciences, Duke Kunshan University(杜克-昆山大学自然科学与应用科学系)

AI总结 本研究利用Sentinel-2图像、Afrobarometer调查数据、30米人口数据和DINO自监督视觉Transformer特征,开发了一个可扩展的遥感框架,以约2.56公里分辨率估计管道水和污水系统接入情况,最佳模型AUROC分别达到91.54%和93.24%,与WHO/UNICEF JMP统计数据高度一致,并在尼日利亚案例中揭示了细粒度环境不平等。

Comments Under Review

详情
AI中文摘要

获得饮用水和卫生设施对健康和福祉至关重要,但主要差距仍然存在,尤其是在非洲等数据稀缺地区。SDG 6旨在实现普遍接入,但目前的监测依赖于成本高昂、频率低且空间不均匀的调查和普查,且报告延迟较长。 本研究开发了一个可扩展的遥感框架,利用Sentinel-2图像、Afrobarometer调查响应、30米人口数据和DINO自监督视觉Transformer特征,以约2.56公里分辨率估计管道水和污水系统接入情况。最佳模型在管道水和污水接入方面分别达到91.54%和93.24%的AUROC值。在50个非洲国家中,人口加权估计与WHO/UNICEF JMP统计数据在管道水方面高度一致($R^2 = 0.92$),在污水接入方面也有显著一致性($R^2 = 0.72$)。在无Afrobarometer覆盖的国家,平均绝对误差分别为9.5%和10.7%,估计值分别与1.214亿和1.597亿人口的JMP值相差在15%以内。 一项覆盖尼日利亚767个地方政府区域的案例研究表明,该框架揭示了细尺度的环境不平等。管道水和污水无接入的最大负担分别达到115.5万和145.2万人,是地方政府区域中位数负担的7.9倍和8.3倍,而最高十分位无接入阈值分别为0.805和0.952,表明匮乏普遍存在。这些发现表明,基于DINO的卫星模型可以以低成本、空间详细的方式补充家庭调查,为SDG 6监测、基础设施定位和环境公平评估提供证据。

英文摘要

Access to drinking water and sanitation is essential for health and well-being, yet major disparities remain, especially in data-scarce regions such as Africa. SDG 6 aims for universal access, but current monitoring relies on costly, infrequent, and spatially uneven surveys and censuses with long reporting delays. This study develops a scalable remote-sensing framework to estimate piped water and sewage system access at approximately 2.56 km resolution using Sentinel-2 imagery, Afrobarometer survey responses, 30 m population data, and DINO self-supervised Vision Transformer features. The best model achieves AUROC values of 91.54% for piped water and 93.24% for sewage access. Across 50 African countries, population-weighted estimates strongly align with WHO/UNICEF JMP statistics for piped water ($R^2 = 0.92$) and show meaningful agreement for sewage access ($R^2 = 0.72$). In countries without Afrobarometer coverage, MAEs are 9.5% and 10.7%, with estimates within 15% of JMP values for 121.4 million and 159.7 million people, respectively. A Nigeria case study across 767 Local Government Areas (LGAs) shows that the framework reveals fine-scale environmental inequality. The largest no-access burdens reach 1.155 million people for piped water and 1.452 million for sewage, 7.9 and 8.3 times the median LGA burden, while top-decile no-access thresholds of 0.805 and 0.952 indicate that deprivation is widespread. These findings show that DINO-based satellite models can complement household surveys with low-cost, spatially detailed evidence for SDG 6 monitoring, infrastructure targeting, and environmental equity assessment.

2605.18694 2026-06-02 math.OC cs.LG stat.ML 版本更新

Can Adaptive Gradient Methods Converge under Heavy-Tailed Noise? A Case Study of AdaGrad

自适应梯度方法能否在重尾噪声下收敛?以 AdaGrad 为例

Zijian Liu

发表机构 * Zijian Liu(刘子健)

AI总结 本文研究 AdaGrad 在重尾梯度噪声下的收敛性,首次证明当尾指数 p 满足 4/3 < p ≤ 2 时,无需先验知识即可获得非凸优化的收敛率,并给出了算法相关的下界。

Comments ICML 2026. v2: simplification of the proof

详情
AI中文摘要

现代机器学习中的许多任务在优化过程中观察到涉及重尾梯度噪声。为了应对这一现实且具有挑战性的场景,引入了新的机制,如梯度裁剪和梯度归一化,以确保一阶算法的收敛性。然而,自适应梯度方法,一类著名的现代优化器,包括流行的 $\mathtt{Adam}$ 和 $\mathtt{AdamW}$,即使没有上述任何额外操作,通常也表现良好。因此,自然要问:自适应梯度方法能否在重尾噪声下收敛而无需任何算法更改?在这项工作中,我们通过研究一个特例 $\mathtt{AdaGrad}$(自适应梯度方法的起源)迈出了回答这个问题的第一步。我们首次证明了当尾指数 $p$ 满足 $4/3 < p \leq 2$ 时,$\mathtt{AdaGrad}$ 在非凸优化中的可证明收敛率。值得注意的是,这一结果无需任何关于 $p$ 的先验知识,因此对尾指数是自适应的。此外,我们开发了一个算法相关的下界,表明现有的重尾优化极小极大速率无法由 $\mathtt{AdaGrad}$ 达到。最后,我们考虑了 $\mathtt{AdaGrad}\text{-}\mathtt{Norm}$(理论研究中 $\mathtt{AdaGrad}$ 的一个流行变体),并证明了在额外温和假设下,对于任何 $1 < p \leq 2$ 都成立的改进速率。

英文摘要

Many tasks in modern machine learning are observed to involve heavy-tailed gradient noise during the optimization process. To manage this realistic and challenging setting, new mechanisms, such as gradient clipping and gradient normalization, have been introduced to ensure the convergence of first-order algorithms. However, adaptive gradient methods, a famous class of modern optimizers that includes popular $\mathtt{Adam}$ and $\mathtt{AdamW}$, often perform well even without any extra operations mentioned above. It is therefore natural to ask whether adaptive gradient methods can converge under heavy-tailed noise without any algorithmic changes. In this work, we take the first step toward answering this question by investigating a special case, $\mathtt{AdaGrad}$, the origin of adaptive gradient methods. We provide the first provable convergence rate for $\mathtt{AdaGrad}$ in non-convex optimization when the tail index $p$ satisfies $4/3<p\leq2$. Notably, this result is achieved without requiring any prior knowledge of $p$ and is hence adaptive to the tail index. In addition, we develop an algorithm-dependent lower bound, suggesting that the existing minimax rate for heavy-tailed optimization is not attainable by $\mathtt{AdaGrad}$. Lastly, we consider $\mathtt{AdaGrad}\text{-}\mathtt{Norm}$, a popular variant of $\mathtt{AdaGrad}$ in theoretical studies, and show an improved rate that holds for any $1<p\leq2$ under an extra mild assumption.

2604.23765 2026-06-02 cs.LG cs.NE math.FA 版本更新

Necessary and sufficient conditions for universality of Kolmogorov-Arnold networks

Kolmogorov-Arnold 网络普适性的充要条件

Vugar Ismailov

发表机构 * vugar E. Ismailov

AI总结 本文分析了 Kolmogorov-Arnold 网络(KAN)的普适逼近性质,证明了当所有边缘函数为仿射时普适性不成立,而添加一个非仿射函数即可恢复普适性,并给出了深层和两层 KAN 的充要条件。

Comments 19 pages; two corollaries from Section 6 removed and generalized in arXiv:2605.26550

详情
AI中文摘要

我们从边缘函数的角度分析了 Kolmogorov-Arnold 网络(KAN)的普适逼近性质。如果这些函数都是仿射的,那么普适性显然不成立。除了仿射函数之外,还需要多少非仿射函数才能保证普适性?我们证明一个就足够了。更精确地说,我们证明对于每个紧集 $K\subset\mathbb{R}^n$,所有边缘函数要么是仿射的,要么等于一个固定的连续函数 $\sigma$ 的深层 KAN 在 $C(K)$ 中稠密,当且仅当 $\sigma$ 是非仿射的。相比之下,对于恰好有两个隐藏层的 KAN,普适性成立当且仅当 $\sigma$ 是非多项式的。我们进一步证明,并不需要完整的仿射函数类;它可以用一个有限集代替而不影响普适性。特别地,在非多项式情况下,当深度任意时,一个由五个仿射函数组成的固定族就足够了。更一般地,对于每个连续的非仿射函数 $\sigma$,存在一个有限的仿射族 $A_\sigma$,使得边缘函数在 $A_\sigma\cup\{\sigma\}$ 中的深层 KAN 仍然是普适的。我们还证明,采用 Liu 等人~\cite{Liu2024} 引入的基于样条的边缘参数化的 KAN 是经典意义上的普适逼近器,即使样条次数和节点序列是预先固定的。

英文摘要

We analyze the universal approximation property of Kolmogorov-Arnold Networks (KANs) in terms of their edge functions. If these functions are all affine, then universality clearly fails. How many non-affine functions are needed, in addition to affine ones, to ensure universality? We show that a single one suffices. More precisely, we prove that deep KANs in which all edge functions are either affine or equal to a fixed continuous function $σ$ are dense in $C(K)$ for every compact set $K\subset\mathbb{R}^n$ if and only if $σ$ is non-affine. In contrast, for KANs with exactly two hidden layers, universality holds if and only if $σ$ is nonpolynomial. We further show that the full class of affine functions is not required; it can be replaced by a finite set without affecting universality. In particular, in the nonpolynomial case, a fixed family of five affine functions suffices when the depth is arbitrary. More generally, for every continuous non-affine function $σ$, there exists a finite affine family $A_σ$ such that deep KANs with edge functions in $A_σ\cup\{σ\}$ remain universal. We also prove that KANs with the spline-based edge parameterization introduced by Liu et al.~\cite{Liu2024} are universal approximators in the classical sense, even when the spline degree and knot sequence are fixed in advance.

2605.18077 2026-06-02 cs.AI cs.LG cs.MA 版本更新

LLM-Guided Communication for Cooperative Multi-Agent Reinforcement Learning

LLM引导的通信用于合作多智能体强化学习

Sangjun Bae, Yisak Park, Sanghyeon Lee, Seungyul Han

发表机构 * KAIST(韩国科学技术院)

AI总结 提出LMAC框架,利用大语言模型的推理能力设计通信协议,使所有智能体尽可能准确一致地重建底层状态,从而提升多智能体强化学习中的状态重建和性能。

Comments 9 pages for main, 32 pages for total, Accepted to ICML 2026

详情
AI中文摘要

通信是多智能体强化学习(MARL)中缓解部分可观测性的关键组成部分,然而先前的方法通常依赖于低效的信息交换或无法传输足够的状态信息。为了解决这一问题,我们提出了LLM驱动的多智能体通信(LMAC),它利用LLM的推理能力设计一种通信协议,使所有智能体能够尽可能准确且一致地重建底层状态。LMAC使用显式的状态感知准则迭代地优化协议,在缩小智能体知识差异的同时改善状态恢复。在多种MARL基准上的实验表明,LMAC改善了智能体间的状态重建,并且相较于先前的通信基线取得了显著的性能提升。

英文摘要

Communication is a key component in multi-agent reinforcement learning (MARL) for mitigating partial observability, yet prior approaches often rely on inefficient information exchange or fail to transmit sufficient state information. To address this, we propose LLM-driven Multi-Agent Communication (LMAC), which leverages an LLM's reasoning capability to design a communication protocol that enables all agents to reconstruct the underlying state as accurately and uniformly as possible. LMAC iteratively refines the protocol using an explicit state-awareness criterion, improving state recovery while narrowing differences in agents' knowledge. Experiments on diverse MARL benchmarks show that LMAC improves state reconstruction across agents and yields substantial performance gains over prior communication baselines.

2605.17554 2026-06-02 cs.AI cs.LG 版本更新

Evaluating Deep Research Agents on Expert Consulting Work: A Benchmark with Verifiers, Rubrics, and Cognitive Traps

评估深度研究代理在专家咨询工作中的表现:一个包含验证器、评分标准和认知陷阱的基准

Tanmay Asthana, Aman Saksena, Divyansh Sahu

AI总结 本文提出一个基准,通过42个专家编写的任务,使用确定性验证器和五维度评分标准评估三个前沿深度研究代理(Claude、OpenAI o3、Gemini)在管理咨询类结构化分析交付物上的表现,并嵌入认知陷阱,发现所有代理的联合接受率均较低(最高21.4%),且各有独特失败模式。

Comments Updating the paper with more data. Will resubmit

详情
AI中文摘要

前沿深度研究代理(DRA)能够规划研究任务、综合多篇文档,并按需生成结构化的交付物。它们在企业工作流中的部署速度远快于评估速度。现有基准衡量事实回忆、单跳问答或通用代理技能,忽略了DRA被部署用于生成的多文档、决策级工作。我们引入一个基准,针对管理咨询师典型一周中所需的结构化分析交付物。我们评估三个前沿代理,即Claude Opus 4.6(带网络搜索)、OpenAI o3-deep-research和Google Gemini 3.1 Pro deep-research,在42个由领域专家(SME)编写的提示上。每个提示的126个响应在两个层面评分:确定性真实验证器(平均每个任务13.8个)和五维度0-3 SME评分标准,组合成0-100的验证器-评分标准分数(VRS)。大多数提示嵌入了惩罚表面模式匹配的认知陷阱。在我们的联合阈值(评分标准均值>=2.5且验证器通过率>=80%)下的接受率普遍较低:Gemini 21.4%,o3 9.5%,Claude 9.5%。平均VRS分数与已发表的基于评分标准的基准一致(我们的最高62.6对比APEX-v1 64.2,ProfBench 65.9,ResearchRubrics <68%),验证了评分标准构建。ACCEPT率低于APEX-Agents在专用DR代理上的MC-segment Pass@1区间(12.3-22.7%);尽管有工具优势,我们的下限仍低三个百分点,这是由于更严格的合取评分和陷阱设计。每个代理的失败模式各不相同。Claude最可靠地生成交付物(在需要文件的任务上比其他代理高4.5倍),但具有最高的虚构特征。o3具有最清晰的推理平均值,但会遗漏必要部分并传播算术错误。Gemini是双峰的,具有最高的接受率,同时也有最多的零分评分标准单元格。

英文摘要

Frontier deep research agents (DRAs) plan a research task, synthesize across documents, and return a structured deliverable on demand. They are being deployed in enterprise workflows faster than they are being evaluated. Existing benchmarks measure factual recall, single-hop QA, or generic agentic skill, missing the multi-document, decision-grade work DRAs are deployed to produce. We introduce a benchmark targeting the structured analytical deliverables that fill a management consultant's typical week. We grade three frontier agents, namely Claude Opus 4.6 with web search, OpenAI o3-deep-research, and Google Gemini 3.1 Pro deep-research, on 42 SME-authored prompts. Each of the 126 responses is scored on two layers: deterministic ground-truth verifiers (mean 13.8 per task) and a five-criterion 0-3 SME rubric, composed into a Verifier-Rubric Score (VRS) on 0-100. Most prompts embed cognitive traps that penalize surface-pattern matching. Acceptance under our joint threshold (rubric mean >= 2.5 and verifier rate >= 80%) is uniformly low: Gemini 21.4%, o3 9.5%, Claude 9.5%. Mean VRS scores agree with published rubric-based benchmarks (our top 62.6 vs. APEX-v1 64.2, ProfBench 65.9, ResearchRubrics < 68%), validating the rubric construct. ACCEPT rates sit below APEX-Agents' MC-segment Pass@1 band (12.3-22.7%) on dedicated DR agents; our floor is three points lower despite the harness advantage, opened by stricter conjunctive grading and trap design. Each agent fails distinctively. Claude produces the deliverable most reliably (4.5x the others' rate on file-required tasks) but carries the highest fabrication signature. o3 has the cleanest reasoning average yet drops required sections and propagates arithmetic errors. Gemini is bimodal, with the highest acceptance rate alongside the most zero-scored rubric cells.

2605.12969 2026-06-02 cs.LG cs.AI 版本更新

Revisiting Reinforcement Learning with Verifiable Rewards from a Contrastive Perspective

从对比视角重新审视基于可验证奖励的强化学习

Feng Zhang, Xinhong Ma, Ziqiang Dong, Xi Leng, Jianfei Zhao, Xin Sun, Yang Yang, Guanjun Jiang

发表机构 * Beijing Institute of Technology(北京理工大学) Qwen Business Unit of Alibaba(阿里巴巴Qwen业务部) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 本文提出ConSPO方法,通过对比序列级策略优化,解决GRPO在目标函数上的似然错配和信用分配不敏感问题,在推理任务上超越强基线。

详情
AI中文摘要

组相对策略优化(GRPO)是目前最广泛采用的RLVR算法之一,用于对大型语言模型进行推理任务的后训练。我们首先证明GRPO存在等价的判别式重新表述,其中策略优化最大化验证的正负rollout之间的期望得分差距。这种重新表述揭示了两个目标层面的局限性:似然错配的替代得分(优化的是基于裁剪比率的得分而非控制生成的序列似然)和得分不敏感的信用分配(rollout级别的信用不反映当前正负rollout之间的得分差距)。为了解决这些局限性,我们提出ConSPO,一种对比序列级策略优化方法,它使用长度归一化的序列对数概率作为rollout得分,并在同一组内对比验证的正rollout与负干扰项。ConSPO优化一个组级别的InfoNCE风格目标,以自适应地增强对分离不佳的正样本和高分负样本的更新,同时结合课程调度的边界,在训练过程中保持分离压力。在多种设置下的实验表明,ConSPO在具有挑战性的推理基准上优于强基线。代码将在论文被接收后发布。

英文摘要

Group Relative Policy Optimization (GRPO) is one of the most widely adopted RLVR algorithms for post-training large language models on reasoning tasks. We first show that GRPO admits an equivalent discriminative reformulation, in which policy optimization maximizes the expected score gap between verified positive and negative rollouts. This reformulation reveals two objective-level limitations: likelihood-misaligned surrogate scores, in which clipped ratio-based scores are optimized rather than the sequence likelihoods that govern generation, and score-insensitive credit assignment, in which rollout-level credit does not reflect the current score gaps between positive and negative rollouts. To address these limitations, we propose ConSPO, a Contrastive Sequence-level Policy Optimization method that uses length-normalized sequence log-probabilities as rollout scores and contrasts verified positive rollouts against negative distractors within the same group. ConSPO optimizes a group-wise InfoNCE-style objective to adaptively strengthen updates for poorly separated positives and high-scoring negatives, together with a curriculum-scheduled margin that preserves separation pressure as training progresses. Experiments across diverse settings show that ConSPO outperforms strong baselines on challenging reasoning benchmarks. Code will be released upon paper acceptance.

2605.17110 2026-06-02 cs.AI cs.LG 版本更新

Capturing LLM Capabilities via Evidence-Calibrated Query Clustering

通过证据校准的查询聚类捕捉LLM能力

Fangzhou Wu, Sandeep Silwal, Qiuyi Zhang

发表机构 * University of Wisconsin–Madison(威斯康星大学麦迪逊分校) Elorian AI

AI总结 提出ECC算法,利用有限后验模型比较校准先验语义嵌入,通过Bradley-Terry模型参数化能力轮廓,联合学习灵活的能力感知聚类结构,显著提升LLM能力排序质量。

Comments 45 pages

详情
AI中文摘要

查询聚类将查询分组为反映共享潜在能力需求的组,从而实现能力感知的LLM评估。现有的聚类方法主要依赖于语义分类或嵌入,由于表面语义与实际模型性能之间的错位,往往无法捕捉此类潜在能力需求。我们提出ECC,一种使用有限后验模型比较校准先验语义嵌入的算法,以弥合表面语义与潜在能力需求之间的差距。ECC通过Bradley-Terry模型参数化的能力轮廓来表征每个聚类,并使用可训练的混合权重来适应具有混合能力需求的查询,联合学习灵活的能力感知聚类结构,支持查询特定的LLM能力推断。大量的定量和定性评估表明,ECC显著提高了LLM能力排序质量,分别比人工标注和基于嵌入的基线平均高出17.64和18.02个百分点,并在查询路由等下游任务中证明有效。

英文摘要

Query clustering organizes queries into groups that reflect shared latent capability demands, enabling capability-aware LLM evaluation. Existing clustering methods, which primarily rely on semantic taxonomies or embeddings, often fail to capture such latent capability requirements due to a misalignment between surface-level semantics and actual model performance. We propose ECC, an algorithm that calibrates prior semantic embeddings using limited posterior model comparisons to bridge the gap between surface-level semantics and latent capability requirements. ECC characterizes each cluster through a capability profile parameterized by a Bradley-Terry model and uses trainable mixture weights to accommodate queries with mixed capability demands, jointly learning a flexible, capability-aware clustering structure that supports query-specific inference of LLM capabilities. Extensive quantitative and qualitative evaluations demonstrate that ECC significantly improves LLM capability ranking quality, outperforming human-labeled and embedding-based baselines by an average of 17.64 and 18.02 percentage points, respectively, and proves effective in downstream tasks such as query routing.

2605.17034 2026-06-02 cs.LG cs.AI cs.CR 版本更新

Privacy Policy Enforcement Guardrails for Data-Sensitive Retrieval-Augmented Generation

面向数据敏感检索增强生成的隐私策略执行护栏

Osama Zafar, Alexander Nemecek, Yiqian Zhang, Wenbiao Li, Debargha Ganguly, Vikash Singh, Vipin Chaudhary, Erman Ayday

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Washington(华盛顿大学) University of Toronto(多伦多大学) University of Texas at Austin(德克萨斯大学奥斯汀分校) University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 针对RAG系统中上下文数据泄露问题,提出基于双单类密度估计器与融合文本嵌入的隐私策略执行框架,在医学、金融和法律领域实现高AUROC和低误报率。

详情
AI中文摘要

标准的PII过滤器常常遗漏RAG系统中的上下文数据泄露,例如非受管制的属性集群共同识别个人身份。我们引入了一个隐私策略执行(PPE)框架,使用双单类密度估计器与融合文本嵌入,以及针对分布外输入的校准弃权区域。通过跨医学、金融和法律领域的轴分层、多LLM合成数据管道,我们发现传统的高斯混合基线在边界安全压力测试中失败,因为它们关注语言风格而非内容。我们提出的T3+OCSVM检测器,在安全和边界安全数据上训练,实现了0.93+的边界AUROC,同时将误报率降低44-55个百分点,并保持毫秒级延迟。与监督MLP分类器或14B参数LLM评判器相比,我们的框架提供了更优的操作适用性,因为前者具有高弃权率,后者存在延迟和校准问题。该方法为任何合成数据训练的分类器提供了稳健的压力测试标准。

英文摘要

Standard PII filters often miss contextual data leakage in RAG systems, such as non-regulated attribute clusters that collectively identify individuals. We introduce a Privacy Policy Enforcement (PPE) framework using dual one-class density estimators with fused text embeddings and a calibrated abstain region for out-of-distribution inputs. Using an axis-stratified, multi-LLM synthetic data pipeline across medicine, finance, and law, we found that traditional Gaussian Mixture baselines fail on borderline-safe stress tests by focusing on linguistic register rather than content. Our proposed T3+OCSVM detector, trained on safe and borderline-safe data, achieves a borderline AUROC of 0.93+ while reducing false positives by 44-55 percentage points and maintaining millisecond latency. Compared to supervised MLP classifiers or 14B-parameter LLM judges, our framework offers superior operational suitability, as the former suffers from high abstention rates and the latter from latency and calibration issues. This methodology provides a robust stress-testing standard for any synthetic-data-trained classifier.

2605.11125 2026-06-02 cs.LG 版本更新

Language Modeling with Hyperspherical Flows

超球面流语言建模

Justin Deschenaux, Caglar Gulcehre

发表机构 * EPFL(苏黎世联邦理工学院) Microsoft AI(微软人工智能)

AI总结 提出一种在超球面潜空间中进行连续流语言建模的方法 S-FLM,通过旋转向量和交叉熵学习速度场,避免独热向量开销,在大型词汇推理任务上显著提升性能,缩小了与掩码扩散模型的差距。

详情
AI中文摘要

离散扩散语言模型作为自回归模型的替代方案发展迅速,其动机在于并行生成能力。然而,为了可处理性,离散扩散模型从因子化分布中采样,其表达能力弱于自回归模型。最近的流语言模型将连续流应用于语言,通过确定性常微分方程将噪声传输到数据,避免了因子化采样。流语言模型操作于独热向量,其维度随词汇表大小缩放,使得流语言模型训练成本高昂。此外,由于所有不同的独热嵌入在 $\ell_2$ 中都是等距的,添加高斯噪声没有明确的语义解释(与图像不同,在图像中高斯噪声逐渐退化结构)。我们引入了 $\mathbb{S}$-FLM,一种在超球面中的潜在流语言模型。$\mathbb{S}$-FLM 通过沿速度场旋转 $\mathbb{S}^{d-1}$ 中的向量来生成序列,该速度场使用交叉熵学习,避免了具体化独热向量的开销。先前的流语言模型在生成困惑度上与自回归模型匹配,但在数学和代码等可验证领域中,高似然的样本不一定正确。$\mathbb{S}$-FLM 在大型词汇推理任务上显著改进了连续流语言模型,并在标准温度采样($T=1$)下缩小了与掩码扩散的差距,而在优化的低温解码($T=0.1$)下仍存在差距。

英文摘要

Discrete Diffusion Language Models progressed rapidly as an alternative to autoregressive (AR) models, motivated by their parallel generation abilities. However, for tractability, discrete diffusion models sample from a factorized distribution, which is less expressive than AR. Recent Flow Language Models (FLMs) apply continuous flows to language, transporting noise to data with a deterministic ODE that avoids factorized sampling. FLMs operate on one-hot vectors whose dimension scales with the vocabulary size, making FLMs costly to train. Moreover, since all distinct one-hot embeddings are equidistant in $\ell_2$, adding Gaussian noise does not have a clear semantic interpretation (unlike images, where Gaussian noise progressively degrades structure). We introduce $\mathbb{S}$-FLM, a latent FLM in the hypersphere. $\mathbb{S}$-FLM generates sequences by rotating vectors in $\mathbb{S}^{d-1}$ along a velocity field learned with cross-entropy, avoiding the overhead of materializing one-hot vectors. Previous FLMs match AR in Generative Perplexity (Gen.\ PPL), but samples with high likelihood are not necessarily correct in verifiable domains such as math and code. $\mathbb{S}$-FLM substantially improves continuous flow language models on large-vocabulary reasoning and closes the gap to masked diffusion under standard-temperature sampling ($T=1$), while a gap remains under optimized low-temperature ($T=0.1$) decoding.

2511.00064 2026-06-02 cs.LG 版本更新

SPORE: Skeleton Propagation Over Recalibrating Expansions

SPORE: 基于重新校准扩展的骨架传播

Randolph Wiredu-Aidoo

发表机构 * Randolph Wiredu-Aidoo

AI总结 提出一种两阶段密度聚类算法SPORE,通过自适应扩展和边界传播解决异质密度和边界模糊问题,在28个基准数据集上显著优于现有方法。

详情
AI中文摘要

许多真实世界的数据集不是线性可分的,这限制了基于质心的聚类方法(如K-means)的有效性。基于密度的聚类方法通过识别具有任意几何结构的聚类来解决这一限制;然而,现有方法存在两个持续的缺点。首先,它们在存在异质局部密度的情况下往往表现不佳,其中单个密度阈值无法充分捕获跨多个密度尺度的聚类。其次,它们通常缺乏由基于质心方法的线性划分机制自然诱导的清晰边界界定。本文介绍了SPORE(基于重新校准扩展的骨架传播),这是一种聚类算法,旨在解决这两个挑战,同时保留基于密度方法的几何灵活性。SPORE分两个阶段运行:自适应聚类扩展阶段,然后是邻近驱动的边界传播阶段,即使在弱密度对比下也能保持判别能力。该方法在28个基准数据集上与已建立的基于密度的基线进行了评估,并以K-means作为参考的基于质心方法。实验结果表明,相对于所有评估的基线(p < 0.01),SPORE实现了显著改善的聚类恢复,同时可以在五次随机搜索评估内识别出性能强劲的配置。

英文摘要

Many real-world datasets are not linearly separable, limiting the effectiveness of centroid-based clustering methods such as K-means. Density-based clustering methods address this limitation by identifying clusters with arbitrary geometric structure; however, existing approaches exhibit two persistent shortcomings. First, they often underperform in the presence of heterogeneous local densities, where a single density threshold cannot adequately capture clusters across multiple density scales. Second, they generally lack the clear boundary delineation naturally induced by the linear partitioning mechanism of centroid-based methods. This paper introduces SPORE (Skeleton Propagation Over Recalibrating Expansions), a clustering algorithm designed to address both challenges while preserving the geometric flexibility of density-based approaches. SPORE operates in two stages: an adaptive cluster expansion phase followed by a proximity-driven boundary propagation phase that maintains discriminative capability even under weak density contrast. The proposed method is evaluated on 28 benchmark datasets against established density-based baselines, with K-means included as a reference centroid-based method. Experimental results demonstrate that SPORE achieves significantly improved cluster recovery relative to all evaluated baselines (p < 0.01), while strong-performing configurations can be identified within five random-search evaluations.

2510.12999 2026-06-02 cs.LG stat.ML 版本更新

AMORE: Adaptive Multi-Output Operator Network for Stiff Chemical Kinetics

AMORE: 自适应多输出算子网络用于刚性化学动力学

Kamaljyoti Nath, Additi Pandey, Bryan T. Susi, Hessam Babaee, George Em Karniadakis

发表机构 * Division of Applied Mathematics, Brown University(布朗大学应用数学系) Applied Research Associates, Inc.(应用研究公司) Department of Mechanical Engineering and Materials Science, University of Pittsburgh(匹兹堡大学机械工程与材料科学系)

AI总结 针对刚性化学动力学系统的时间积分计算成本高的问题,提出AMORE框架,通过自适应损失函数和可逆映射确保多输出算子学习的可靠性,并在合成气和GRI-Mech 3.0上验证了有效性。

详情
AI中文摘要

刚性系统的时间积分是燃烧、高超声速及其他反应输运系统中计算成本的主要来源。这种刚性会引入远小于其他物理过程的时间尺度,导致显式格式需要极小的步长或隐式方法计算量大。因此,缓解刚性挑战的策略至关重要。虽然神经算子(DeepONet)可以作为刚性动力学的替代模型,但需要可靠的算子学习策略来适当考虑输出变量和样本之间的误差差异。本文开发了AMORE(自适应多输出算子网络),一个包含能够预测多个输出的算子和确保可靠算子学习的自适应损失函数的框架。该算子从给定初始条件预测所有热化学状态。我们提出了两种自适应损失函数,考虑每个状态变量和样本的误差来惩罚损失函数。我们设计了主干网络以自动满足单位分解。为了精确满足质量分数总和为1的约束,我们提出了一个可逆解析映射,将n维物种质量分数向量变换到(n-1)维空间。我们将所提出的自适应损失函数扩展到具有多输出的DeepONet的两步训练中的主干和分支训练。我们还通过预测质量分数上的softmax函数精确实现了另一个质量分数总和为1的约束。我们通过两个示例证明了模型的有效性和适用性:合成气(12个状态)、GRI-Mech 3.0(54个中的24个活跃状态)。所提出的DeepONet将成为未来CFD研究加速湍流燃烧模拟的骨干。AMORE是一个通用框架,本文也将其应用于FNO。

英文摘要

Time integration of stiff systems is a primary source of computational cost in combustion, hypersonics, and other reactive transport systems. This stiffness can introduce time scales significantly smaller than those associated with other physical processes, requiring extremely small time steps in explicit schemes or computationally intensive implicit methods. Consequently, strategies to alleviate challenges posed by stiffness are important. While neural operators (DeepONets) can act as surrogates for stiff kinetics, a reliable operator learning strategy is required to appropriately account for differences in error between output variables and samples. Here, we develop AMORE, Adaptive Multi-Output Operator Network, a framework comprising an operator capable of predicting multiple outputs and adaptive loss functions ensuring reliable operator learning. The operator predicts all thermochemical states from given initial conditions. We propose two adaptive loss functions within the framework, considering each state variable's and sample's error to penalize the loss function. We designed the trunk to automatically satisfy Partition of Unity. To enforce unity mass-fraction constraint exactly, we propose an invertible analytical map that transforms the $n$-dimensional species mass-fraction vector into an ($n-1$)-dimensional space. We extend the proposed adaptive loss functions to trunk and branch training in two-step training of DeepONet with multiple outputs. We implemented another unity mass fraction constraint exactly using a softmax function on the predicted mass fraction. We demonstrate efficacy and applicability of our models through two examples: syngas (12 states), GRI-Mech 3.0 (24 active states out of 54). The proposed DeepONet will be a backbone for future CFD studies to accelerate turbulent combustion simulations. AMORE is a general framework, and here, we also demonstrate it for FNO.

2605.16451 2026-06-02 cs.LG cs.AI 版本更新

Physics-Guided Geometric Diffusion for Macro Placement Generation

物理引导的几何扩散用于宏单元布局生成

Jongho Yoon, Jinsung Jeon, Seokhyeong Kang

发表机构 * POSTECH Institute of Artificial Intelligence(POSTECH人工智能研究所) KAIST InnoCORE LLM(韩国科学技术院InnoCORE语言模型实验室) Seoul National University(首尔国立大学) Pohang University of Science and Technology(釜山科学技术大学)

AI总结 提出MacroDiff+框架,通过双域去噪架构和物理引导采样策略,在宏单元布局中同时优化拓扑连接和物理约束,在ISPD2005 MMS基准上实现线长减少6.1-6.2%。

Comments Accepted to IJCAI 2026. 9 pages, 5 figures

详情
AI中文摘要

宏单元布局是VLSI物理设计中的关键阶段,从根本上决定了芯片的整体性能。最近的数据驱动布局方法显示出巨大潜力,但它们往往难以处理序列依赖关系,并平衡拓扑连接与物理约束。为弥补这一差距,我们提出了MacroDiff+,一个物理引导的几何扩散框架。具体来说,我们设计了一个双域去噪架构,将异构GNN编码的拓扑连接与Transformer建模的全局几何上下文相结合。此外,我们引入了物理引导采样,一种推理策略,通过显式梯度主动引导生成,以确保统计合理性和物理有效性。在ISPD2005 MMS基准上,MacroDiff+优于最先进的基线,线长减少6.1-6.2%。值得注意的是,在先前方法无法收敛的大规模设计中,它表现出卓越的稳定性和可扩展性。源代码可在https://github.com/jhy00n/MacroDiff-plus获取。

英文摘要

Macro placement is a pivotal stage in VLSI physical design, fundamentally determining the overall chip performance. Recent data-driven placement methods have demonstrated significant potential, yet they often struggle to handle sequential dependencies and to balance topological connectivity with physical constraints. To bridge this gap, we propose MacroDiff+, a physics-guided geometric diffusion framework. Specifically, we design a dual-domain denoising architecture that couples topological connectivity encoded by heterogeneous GNNs with global geometric context modeled by a Transformer. Furthermore, we introduce Physics-Guided Sampling, an inference strategy that actively steers the generation using explicit gradients to ensure both statistical plausibility and physical validity. On the ISPD2005 MMS benchmarks, MacroDiff+ outperforms state-of-the-art baselines with a 6.1-6.2% reduction in wirelength. Notably, it exhibits superior stability and scalability on large-scale designs where prior methods fail to converge. The source code is available at https://github.com/jhy00n/MacroDiff-plus.

2605.16446 2026-06-02 cs.LG cs.AI 版本更新

Avoiding Structural Failure Modes in Tabular Fair SSL: Online Primal-Dual Allocation under Confidence Gating

避免表格公平半监督学习中的结构失效模式:基于置信门控的在线原始-对偶分配

Hangchuan Liang, Changchun Li

发表机构 * College of Computer Science and Technology, Jilin University, China(吉林大学计算机科学与技术学院) Key Laboratory of Symbolic Computation and Knowledge Engineering of the Ministry of Education, Jilin University, China(教育部符号计算与知识工程重点实验室)

AI总结 针对表格公平半监督学习中的结构冲突,提出在线原始-对偶分配(OPDA)方法,通过动态调度公平性和熵稳定性惩罚,避免掩码崩溃和平凡饱和两种失效模式,在多个基准上实现非退化运行点。

详情
AI中文摘要

半监督学习(SSL)能够在有限标签下进行预测,但高风险表格应用(医疗、信贷、再犯)需要统计公平性保证。通过诊断压力测试,我们识别出表格公平SSL中的结构冲突:在置信门控伪标签下,矩匹配公平正则化器可能触发两种失效模式——掩码崩溃(公平性侵蚀置信度,导致伪标签匮乏)和平凡饱和(漂移至常数预测器)。我们提出在线原始-对偶分配(OPDA),一种在线控制器,利用违规、风险和伪标签健康信号调度公平性和基于熵的稳定性惩罚,从而避免在该诊断机制下为每个数据集选择固定公平权重。在评估的表格基准(Adult、ACSIncome、COMPAS)上,OPDA缓解了静态权重和简单单信号自适应基线中观察到的退化状态。在Adult和COMPAS上,它产生了与经验静态λ前沿竞争的非退化运行点;在ACSIncome上,它保持了效用,同时具有更宽的公平-效用分布。相对于OPDA-lite,完整控制器主要在ACSIncome上将运行点向更高效用偏移,而Adult则突出了两种变体之间的公平-效用权衡。这些结果使OPDA成为表格公平SSL中无需校准的控制器,无需针对每个数据集进行调整即可获得非退化运行点。

英文摘要

Semi-supervised learning (SSL) enables prediction with limited labels, but high-stakes tabular applications (medical, credit, recidivism) require statistical fairness guarantees. We identify a structural conflict in tabular fair SSL through a diagnostic stress test: under confidence-gated pseudo-labeling, moment-matching fairness regularizers can trigger two failure modes -- Masking Collapse (fairness erodes confidence, starving pseudo-labels) and Trivial Saturation (drift to constant predictors). We propose Online Primal-Dual Allocation (OPDA), an online controller that schedules fairness and entropy-based stability penalties using violation, risk, and pseudo-label health signals, avoiding per-dataset selection of a fixed fairness weight within this diagnostic regime. On the evaluated tabular benchmarks (Adult, ACSIncome, COMPAS), OPDA mitigates the degenerate regimes observed under static weighting and simple single-signal adaptive baselines. On Adult and COMPAS, it yields non-degenerate operating points competitive with the empirical static-$λ$ frontier; on ACSIncome, it preserves utility with a wider fairness-utility spread. Relative to OPDA-lite, the full controller mainly shifts the operating point toward higher utility on ACSIncome, while Adult highlights the fairness-utility trade-off between the two variants. These results position OPDA as a calibration-free controller for non-degenerate operating points in tabular fair SSL without per-dataset tuning.

2605.15511 2026-06-02 cs.LG 版本更新

OgBench: A Framework for Evaluating Graph Neural Networks on Omics Data

OgBench:评估图神经网络在组学数据上的框架

Louisa Cornelis, Johan Mathe, Louis Van Langendonck, Guillermo Bernárdez, Nina Miolane

发表机构 * UC Santa Barbara(加州大学圣芭芭拉分校) Atmo, Inc.(Atmo公司) Universitat Politècnica de Catalunya(加泰罗尼亚理工大学)

AI总结 针对组学数据中样本少、节点多的特点,提出OgBench基准平台,评估GNN性能,发现常用GNN常不如简单MLP和经典基线。

Comments 42 pages

详情
AI中文摘要

图神经网络(GNN)已成为归纳图级学习的主导框架。然而,大多数基准测试关注的是 $n \gg p$ 的情况,其中图的数量 $n$ 远大于每张图的节点数 $p$。这忽略了诸如组学等生物学领域,这些领域处于相反的 $n \ll p$ 情况,其特点是跨少量患者样本的大规模基因、转录本或蛋白质图。这引发了一个问题: extit{GNN 在低样本、高节点的组学设置中表现如何?} 我们引入了 exttt{OgBench}(组学图基准),这是第一个针对组学数据 $n \ll p$ 特征下的图级预测基准平台。我们提供了一个标准化的、端到端的模块化基础设施,从原始组学数据到具有不同结构属性的特征图家族。我们对经典GNN、为大型图和组学应用设计的GNN,以及MLP和机器学习基线进行基准测试,以建立参考性能。我们的结果表明,广泛使用的GNN通常并不优于简单的MLP和经典基线。这些发现挑战了图结构在该领域固有地增加价值的普遍假设,促进了对当前学习范式的批判性重新评估。最终,通过揭示这些局限性,OgBench提供了必要的开源生态系统,使社区能够开发和验证专门为生物图设计的新型架构。代码可在 https://github.com/geometric-intelligence/ogbench 获取。

英文摘要

Graph Neural Networks (GNNs) have become the dominant framework for inductive graph-level learning. Yet most benchmarks focus on the regime $n \gg p$, where the number of graphs $n$ greatly exceeds the number of nodes per graph $p$. This overlooks biological domains such as omics, which operate in the opposite $n \ll p$ regime, characterized by large graphs of genes, transcripts, or proteins across few patient samples. This raises the question: \textit{how do GNNs perform in this low-sample, high-node omics setting?} We introduce \texttt{OgBench} (Omics-Graph Bench), the first benchmarking platform for graph-level prediction in the $n \ll p$ regime characteristic of omics data. We provide a standardized, end-to-end modular infrastructure from raw omics data to families of featured graphs with varied structural properties. We benchmark classical GNNs, as well as GNNs designed for large graphs and omics applications, alongside MLPs and machine learning baselines to establish reference performances. Our results show that widely used GNNs often do not outperform simple MLPs and classical baselines. These findings challenge the prevailing assumption that graph structure inherently adds value in this domain, fostering a critical reassessment of current learning paradigms. Ultimately, by exposing these limitations, OgBench provides the open-source ecosystem necessary for the community to develop and validate novel architectures explicitly tailored for biological graphs. The code is available at https://github.com/geometric-intelligence/ogbench.

2603.05917 2026-06-02 cs.LG cs.AI q-fin.ST 版本更新

Stock Market Prediction Using Node Transformer Architecture Integrated with BERT Sentiment Analysis

结合BERT情感分析的节点Transformer架构用于股票市场预测

Mohammad Al Ridhawi, Mahtab Haj Ali, Hussein Al Osman

发表机构 * University of Technology, Baghdad, Iraq(巴格达大学)

AI总结 提出一种将节点Transformer与BERT情感分析相结合的框架,通过图结构建模股票间依赖关系并融合社交媒体情感,在S&P 500股票上实现0.80%的MAPE,显著优于传统方法。

Comments 18 pages, 5 figures, 12 tables. Accepted for publication in IEEE Access

详情
Journal ref
IEEE Access, vol. 14, pp. 72613-72631, 2026
AI中文摘要

股票市场预测对在噪声、非平稳和行为动态的复杂市场环境中操作的投资者、金融机构和政策制定者提出了相当大的挑战。传统的预测方法,包括基本面分析和技术指标,往往无法捕捉金融市场中固有的复杂模式和横截面依赖性。本文提出了一种结合节点Transformer架构与基于BERT的情感分析的集成框架,用于股票价格预测。该模型将股票市场表示为图结构,其中个股构成节点,边捕捉关系,包括行业隶属关系、相关价格变动和供应链连接。一个微调的BERT模型从社交媒体帖子中提取情感信息,并通过基于注意力的融合机制将其与定量市场特征相结合。节点Transformer处理历史市场数据,同时捕捉股票间的时间演变和横截面依赖性。在1982年1月至2025年3月期间20只S&P 500股票上进行的实验表明,集成模型在一天前预测中实现了0.80%的平均绝对百分比误差(MAPE),而ARIMA为1.20%,LSTM为1.00%。情感分析的加入使预测误差总体降低10%,在财报公告期间降低25%,而基于图的架构通过捕捉股票间依赖性额外贡献了15%的改进。方向准确率在一天预测中达到65%。通过配对t检验的统计验证确认了这些改进的显著性(所有比较p < 0.05)。该模型在高波动期保持较低的误差,MAPE为1.50%,而基线模型范围为1.60%至2.10%。

英文摘要

Stock market prediction presents considerable challenges for investors, financial institutions, and policymakers operating in complex market environments characterized by noise, non-stationarity, and behavioral dynamics. Traditional forecasting methods, including fundamental analysis and technical indicators, often fail to capture the intricate patterns and cross-sectional dependencies inherent in financial markets. This paper presents an integrated framework combining a node transformer architecture with BERT-based sentiment analysis for stock price forecasting. The proposed model represents the stock market as a graph structure where individual stocks form nodes and edges capture relationships including sectoral affiliations, correlated price movements, and supply chain connections. A fine-tuned BERT model extracts sentiment information from social media posts and combines it with quantitative market features through attention-based fusion mechanisms. The node transformer processes historical market data while capturing both temporal evolution and cross-sectional dependencies among stocks. Experiments conducted on 20 S&P 500 stocks spanning January 1982 to March 2025 demonstrate that the integrated model achieves a mean absolute percentage error (MAPE) of 0.80% for one-day-ahead predictions, compared to 1.20% for ARIMA and 1.00% for LSTM. The inclusion of sentiment analysis reduces prediction error by 10% overall and 25% during earnings announcements, while the graph-based architecture contributes an additional 15% improvement by capturing inter-stock dependencies. Directional accuracy reaches 65% for one-day forecasts. Statistical validation through paired t-tests confirms the significance of these improvements (p < 0.05 for all comparisons). The model maintains lower error during high-volatility periods, achieving MAPE of 1.50% while baseline models range from 1.60% to 2.10%.

2605.13834 2026-06-02 cs.LG cs.AI cs.CG 版本更新

Topology-Preserving Neural Operator Learning via Hodge Decomposition

通过Hodge分解保持拓扑的神经算子学习

Dongzhe Zheng, Tao Zhong, Christine Allen-Blanchette

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文从函数空间视角研究几何网格上物理场方程的解算子,利用Hodge正交性分离不可学习的拓扑自由度与可学习的几何动力学,提出基于Hodge谱对偶的混合欧拉-拉格朗日架构,在保持物理不变量的同时提升几何图上的精度与效率。

Comments Accepted at ICML 2026. Code available at https://github.com/ContinuumCoder/Hodge-Spectral-Duality

详情
AI中文摘要

本文从函数空间视角研究几何网格上物理场方程的解算子。我们发现Hodge正交性通过将不可学习的拓扑自由度与可学习的几何动力学分离,从根本上解决了谱干扰问题,从而实现了局限于保结构子空间的加性逼近。基于Hodge理论和算子分裂,我们推导出原则性的算子级分解。结果是一种混合欧拉-拉格朗日架构,具有我们称为Hodge谱对偶(HSD)的代数级归纳偏置。在我们的框架中,我们使用离散微分形式捕捉拓扑主导的分量,并使用正交辅助环境空间表示复杂的局部动力学。我们的方法在几何图上实现了优越的准确性和效率,并增强了对物理不变量的保真度。我们的代码可在https://github.com/ContinuumCoder/Hodge-Spectral-Duality获取。

英文摘要

In this paper, we study solution operators of physical field equations on geometric meshes from a function-space perspective. We reveal that Hodge orthogonality fundamentally resolves spectral interference by isolating unlearnable topological degrees of freedom from learnable geometric dynamics, enabling an additive approximation confined to structure-preserving subspaces. Building on Hodge theory and operator splitting, we derive a principled operator-level decomposition. The result is a Hybrid Eulerian-Lagrangian architecture with an algebraic-level inductive bias we call Hodge Spectral Duality (HSD). In our framework, we use discrete differential forms to capture topology-dominated components and an orthogonal auxiliary ambient space to represent complex local dynamics. Our method achieves superior accuracy and efficiency on geometric graphs with enhanced fidelity to physical invariants. Our code is available at https://github.com/ContinuumCoder/Hodge-Spectral-Duality

2605.13430 2026-06-02 stat.ME cs.AI cs.LG 版本更新

Towards a holistic understanding of Selection Bias for Causal Effect Identification

走向因果效应识别中选择偏差的整体理解

Yiwen Qiu, Filip Kovačević, Shimeng Huang, Peter Spirtes, Francesco Locatello

发表机构 * Carnegie Mellon University(卡内基梅隆大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 研究在观测研究中存在选择偏差时,如何利用弱假设刻画倾向得分和选择概率,给出平均处理效应可识别性的充要条件,扩展了现有图形识别准则。

Comments 9 pages for the main text, ICML 2026

详情
AI中文摘要

选择偏差在观测研究中普遍存在。例如,大规模生物库数据可能表现出“健康志愿者偏差”,即受访者比他们所要代表的人群更健康、社会经济地位更高。从这样的子人群中恢复因果效应是因果推断中的一个重要问题,因为从选定人群估计平均处理效应(ATE)可能导致对整个群体的ATE估计严重偏倚。本文研究了选择偏差下ATE的可识别性。我们利用概率类的弱假设刻画倾向得分和选择概率,给出了ATE可识别性的充要条件。与以往工作相比,我们的结果扩展了现有的图形可识别性准则,并在存在选择偏差的情况下,以严格更弱的条件提供了对因果效应识别更全面的理解。

英文摘要

Selection bias is pervasive in observational studies. For example, large scale biobanks data can exhibit ``healthy volunteer bias'' when respondents are healthier and of higher socio-economic status than the population they are meant to represent. Recovering causal effects from such sub-population is an important problem in causal inference, as estimating average treatment effects (ATE) from selected populations can result in a severely biased estimate of the ATE from the whole population. In this paper, we investigate the identifiability of the ATE under selection bias. We provide necessary and sufficient conditions for ATE identifiability, leveraging weak assumptions on probability classes to characterize propensity score and selection probability. Compared to previous works, our results extend existing graphical identifiability criteria and offer a more comprehensive understanding of causal effect identification with strictly weaker conditions in the presence of selection bias.

2509.24627 2026-06-02 cs.LG 版本更新

Learning Hamiltonian Dynamics at Scale: A Differential-Geometric Approach

大规模学习哈密顿动力学:一种微分几何方法

Katharina Friedl, Noémie Jaquier, Alyx Liao, Danica Kragic

发表机构 * Department of Robotics, Perception, and Learning(机器人、感知与学习系)

AI总结 提出结合哈密顿力学守恒律与模型降阶可扩展性的降阶哈密顿神经网络(RO-HNN),通过几何约束辛自编码器和几何哈密顿神经网络实现高维动力系统的物理一致预测。

Comments 32 pages, 21 figures, Intl. Conference on Machine Learning (ICML), 2026

详情
AI中文摘要

将物理直觉嵌入网络架构允许学习强制执行基本属性(如能量守恒定律)的动力学,从而产生物理上合理的预测。然而,将这些模型扩展到高维动力系统仍然是一个重大挑战。本文介绍了降阶哈密顿神经网络(RO-HNN),一种新颖的物理启发神经网络,它结合了哈密顿力学的守恒律与模型降阶的可扩展性。RO-HNN 建立在两个核心组件上:一种新颖的几何约束辛自编码器,用于学习低维、保结构的辛子流形,以及一种几何哈密顿神经网络,用于建模子流形上的动力学。我们的实验表明,RO-HNN 提供了复杂高维动力学的物理一致、稳定且可泛化的预测,从而有效地将哈密顿神经网络的范围扩展到高维物理系统。

英文摘要

Embedding physical intuition into network architectures allows the learning of dynamics that enforce fundamental properties, such as energy conservation laws, thereby leading to physically-plausible predictions. Yet, scaling these models to high-dimensional dynamical systems remains a significant challenge. This paper introduces Reduced-order Hamiltonian Neural Network (RO-HNN), a novel physics-inspired neural network that combines the conservation laws of Hamiltonian mechanics with the scalability of model order reduction. RO-HNN is built on two core components: a novel geometrically-constrained symplectic autoencoder that learns a low-dimensional, structure-preserving symplectic submanifold, and a geometric Hamiltonian neural network that models the dynamics on the submanifold. Our experiments demonstrate that RO-HNN provides physically-consistent, stable, and generalizable predictions of complex high-dimensional dynamics, thereby effectively extending the scope of Hamiltonian neural networks to high-dimensional physical systems.

2605.13175 2026-06-02 cs.LG 版本更新

Do Heavy Tails Help Diffusion? On the Subtle Trade-off Between Initialization and Training

重尾是否有助于扩散?初始化与训练之间的微妙权衡

Hamza Cherkaoui, Hélène Halconruy, Antonio Ocello

发表机构 * SAMOVAR, Télécom SudParis, Institut Polytechnique de Paris, Palaiseau, France(SAMOVAR,电信南巴黎,巴黎理工学院,Palaiseau,法国) Modal’X, Université Paris Nanterre, Nanterre, France(Modal’X,巴黎南大学,Nanterre,法国) CREST, ENSAE Paris, Institut Polytechnique de Paris, Palaiseau, France(CREST,ENSAE巴黎,巴黎理工学院,Palaiseau,法国)

AI总结 本文通过理论和实验研究,比较了重尾噪声与轻尾高斯噪声在扩散模型中的表现,发现重尾噪声虽能更好地匹配数据尾部,但会使统计估计更困难,导致更差的采样误差。

详情
AI中文摘要

最近的工作提出将重尾噪声引入基于扩散和流的生成模型,旨在更好地恢复目标分布的尾部并提高生成多样性。这一动机直观:如果数据是重尾的,重尾噪声可能比轻尾高斯噪声更匹配。然而,用重尾噪声替换高斯噪声也改变了底层估计问题。在本文中,我们通过理论和实验相结合的研究重新审视这一范式,建立了由重尾和轻尾噪声驱动的两种代表性扩散模型的采样误差界。我们表明,重尾噪声使统计估计问题更困难,导致更不利的采样误差界。我们通过在合成和真实数据集上的实验支持这些发现,经验性地恢复了预测的误差权衡。我们的结果质疑了生成建模中日益增长的设计趋势,并挑战了使用重尾噪声来改进稀有区域探索的做法。

英文摘要

Recent works have proposed incorporating heavy-tailed (HT) noise into diffusion- and flow-based generative models, with the goals of better recovering the tails of target distributions and improving generative diversity. This motivation is intuitive: if the data are heavy-tailed, HT noise may appear better matched than light-tailed (LT) Gaussian noise. However, replacing Gaussian noise by HT noise also changes the underlying estimation problem. In this paper, we revisit this paradigm through a combined theoretical and empirical study, establishing sampling-error bounds for two representative diffusion models driven by HT and LT noise. We show that HT noise makes the statistical estimation problem harder, leading to less favorable sampling-error bounds. We support these findings with experiments on synthetic and real-world datasets, empirically recovering the predicted error trade-off. Our results call into question a growing design trend in generative modeling and challenge the use of HT noise to improve rare-region exploration.

2605.12895 2026-06-02 cs.LG cs.AI cs.CY stat.AP 版本更新

RISED: A Pre-Deployment Evaluation Framework for High-Stakes AI Decision-Support Systems, with Application to Healthcare

RISED:高风险AI决策支持系统的部署前评估框架,及其在医疗中的应用

Rohith Reddy Bellibatlu, Manpreet Singh, Yash Jajoo, Shyamal Lakhanpal, Abhishek Israni

发表机构 * Florida International University(佛罗里达国际大学) Boston University(波士顿大学) New York University(纽约大学) University of Maryland(马里兰大学) Boston University School of Public Health(波士顿大学公共卫生学院)

AI总结 提出RISED框架,通过BCa bootstrap置信区间、文献阈值和Holm-Bonferroni校正的PASS/FAIL/INCONCLUSIVE判定,从五个维度评估高风险AI决策支持系统,在医疗等数据集上发现AUROC无法揭示的失败模式。

Comments 39 pages, 7 figures, 15 tables. Code at https://github.com/rohithreddybc/rised-healthcare-eval and dataset at https://doi.org/10.57967/hf/8734 (Hugging Face). To be submitted to Expert Systems with Applications (Elsevier)

详情
AI中文摘要

临床决策支持系统是专家系统,临床医生直接根据其建议行动,但通常仅通过保留测试集上的一个总体准确率数字来批准。这个数字对编码偏移下的输入可靠性、子组差距、阈值敏感性或操作可行性毫无说明。我们提出RISED,一个部署前评估框架,通过BCa bootstrap 95%置信区间、基于文献的阈值和Holm-Bonferroni校正的PASS/FAIL/INCONCLUSIVE判定,操作化五个维度(可靠性、包容性、敏感性、公平性、可部署性);公平性是一个代理依赖诊断而非门控测试。应用于跨越35年的七个队列(n从303到99,492),RISED揭示了AUROC无法发现的失败:在Diabetes 130上,可靠性通过三个数量级(PSS = 0.0004),而包容性(AUC差距 = 0.262)和敏感性(最大阈值翻转率49.1%)明确失败;两个NHIS队列也重复了这一点。具有完整特征配置的NHANES 2021-2023获得了INCONCLUSIVE判定;BRFSS 2024在仪器旋转移除高血压和胆固醇后产生了该套件中最严重的敏感性失败(最大阈值翻转率64.2%)。该模式在信用和收入预测队列上重复出现,证实了领域无关性;多模型检查显示失败是数据驱动的,而非模型特定的。RISED作为开源Python包发布,补充了TRIPOD+AI、FUTURE-AI和Fairlearn,提供了这些标准要求但未规定的结构化数值证据。

英文摘要

Clinical decision-support systems are expert systems whose recommendations clinicians act on directly, yet they are usually cleared on one aggregate accuracy number from a held-out test set. That number says nothing about input reliability under encoding shifts, subgroup gaps, threshold sensitivity, or operational feasibility. We present RISED, a pre-deployment evaluation framework operationalising five dimensions (Reliability, Inclusivity, Sensitivity, Equity, Deployability) through BCa bootstrap 95% confidence intervals, literature-grounded thresholds, and Holm-Bonferroni-corrected PASS / FAIL / INCONCLUSIVE verdicts; Equity is a proxy-dependence diagnostic rather than a gating test. Applied to seven cohorts spanning 35 years (n from 303 to 99,492), RISED surfaces failures invisible to AUROC: on Diabetes 130, Reliability passes by three orders of magnitude (PSS = 0.0004) while Inclusivity (AUC parity gap = 0.262) and Sensitivity (max threshold-flip rate 49.1%) fail decisively; both NHIS cohorts reproduce this. NHANES 2021-2023, with a complete feature profile, achieves INCONCLUSIVE verdicts; BRFSS 2024 produces the suite's most severe Sensitivity failure (max threshold-flip rate 64.2%) after instrument rotation removed hypertension and cholesterol. The pattern recurs on credit- and income-prediction cohorts, confirming domain-agnosticity; a multi-model check shows the failures are data-driven, not model-specific. RISED ships as an open-source Python package complementing TRIPOD+AI, FUTURE-AI, and Fairlearn with the structured numerical evidence those standards require but do not prescribe.

2605.12813 2026-06-02 cs.CL cs.AI cs.CR cs.LG 版本更新

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations

REALISTA: 引发LLM幻觉的逼真潜在对抗攻击

Buyun Liang, Jinqi Luo, Liangzu Peng, Kwan Ho Ryan Chan, Darshan Thaker, Kaleab A. Kinfu, Fengrui Tian, Hamed Hassani, René Vidal

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 提出REALISTA框架,通过潜在空间优化语义等价的对抗提示,有效引发大语言模型幻觉,优于现有方法。

Comments Accepted at ICML 2026. Code is available at https://github.com/Buyun-Liang/REALISTA

详情
AI中文摘要

大型语言模型(LLM)在许多任务上表现出色,但仍然容易产生幻觉,因此有必要在逼真的对抗输入下系统地评估其可靠性。我们将幻觉引发问题形式化为一个约束优化问题,目标是找到与良性用户提示语义等价的对抗提示。现有攻击方法仍有局限:基于离散提示的攻击保持语义等价性和连贯性,但仅搜索有限的提示变体;而连续潜在空间攻击探索更丰富的空间,但通常解码为不再有效改写的提示。为解决这些局限,我们提出REALISTA,一个逼真的潜在空间攻击框架。REALISTA构建了一个依赖于输入的合法编辑方向字典,每个方向对应一个语义等价且连贯的改写,并在潜在空间中优化这些方向的连续组合。这种设计结合了连续攻击的优化灵活性和基于离散改写的攻击的语义逼真性。实验表明,REALISTA在开源LLM上达到优于或与最先进逼真攻击相当的性能,并且关键的是,在自由形式响应设置下成功攻击大型推理模型,而先前的逼真攻击则失败。代码可在https://github.com/Buyun-Liang/REALISTA获取。

英文摘要

Large language models (LLMs) achieve strong performance across many tasks but remain vulnerable to hallucinations, making it important to systematically evaluate their reliability under realistic adversarial inputs. We formulate hallucination elicitation as a constrained optimization problem, where the goal is to find semantically coherent adversarial prompts that are equivalent to benign user prompts. Existing attack methods remain limited: discrete prompt-based attacks preserve semantic equivalence and coherence but search only over a limited set of prompt variations, while continuous latent-space attacks explore a richer space but often decode into prompts that are no longer valid rephrasings. To address these limitations, we propose REALISTA, a realistic latent-space attack framework. REALISTA constructs an input-dependent dictionary of valid editing directions, each corresponding to a semantically equivalent and coherent rephrasing, and optimizes continuous combinations of these directions in latent space. This design combines the optimization flexibility of continuous attacks with the semantic realism of discrete rephrasing-based attacks. Experiments demonstrate that REALISTA achieves superior or comparable performance to state-of-the-art realistic attacks on open-source LLMs and, crucially, succeeds in attacking large reasoning models under free-form response settings, where prior realistic attacks fail. Code is available at https://github.com/Buyun-Liang/REALISTA.

2605.12768 2026-06-02 stat.ML cs.LG 版本更新

ISOMORPH: A Supply Chain Digital Twin for Simulation, Dataset Generation, and Forecasting Benchmarks

ISOMORPH:用于仿真、数据集生成和预测基准的供应链数字孪生

Zhizhen Zhang, Hyemin Gu, Benjamin J. Zhang, Daniel Elenius, Michael Tyrrell, Theo J. Bourdais, Houman Owhadi, Markos A. Katsoulakis, Tuhin Sahai

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校) University of North Carolina(北卡罗来纳大学) SRI International(SRI国际) California Institute of Technology(加州理工学院)

AI总结 本文提出ISOMORPH,首个公开的多级物流网络数字孪生,通过可配置参数和模块化拓扑生成具有牛鞭效应等动态特性的数据集,并评估基础模型的零样本预测性能。

详情
AI中文摘要

开放的时间序列预测(TSF)基准涵盖零售、能源、天气和交通,但供应链物流仍未得到充分服务。我们引入了ISOMORPH,这是第一个具有可解释、用户可配置参数以及模块化拓扑、需求和控制规则的多级物流网络的公开数字孪生。该模拟器在离散时间上推进一个有向路由图:需求从库存中满足或记录为积压,并触发整个网络的补货。状态跟踪库存、未结订单、在途货物以及平滑的需求估计,在可处理的状态空间上产生马尔可夫动力学。发布的数据以经验一致的程度再现了牛鞭效应,同时三个守恒定律为模拟器扩展提供了验证工具。我们发布了两个目录规模(C=50和C=200)、六种场景扫描和20种拉丁超立方体扰动的数据集。这些数据集展示了固定TSF基准中基本缺失的动态特性,包括方差放大、级联瓶颈、制度转换以及通过共享宏观冲击的跨通道耦合。对四个基础模型(Chronos、Moirai、TimesFM和Lag-Llama)的零样本评估在低至中等预测范围上产生了超过公开GIFT-Eval参考的MASE值,支持将其纳入现有基准套件。相同的模型通过需求侧参数的拉丁超立方体扰动提供预测置信带,实现了标准TSF数据集上不可用的前向不确定性量化(UQ),并证明基础模型可以作为基于数字孪生的UQ的快速替代。代码(MIT):https://github.com/tuhinsahai/ISOMORPH。交互演示:https://huggingface.co/spaces/HyeminGu/ISOMORPH-demo。

英文摘要

Open time-series forecasting (TSF) benchmarks cover retail, energy, weather, and traffic, but supply-chain logistics remains underserved. We introduce ISOMORPH, the first public digital twin of a multi-echelon logistics network with interpretable, user-configurable parameters and modular topology, demand, and control rules. The simulator advances a directed routing graph in discrete time: demand is served from inventory or recorded as backlog and triggers replenishment throughout the network. The state tracks inventory, outstanding orders, in-transit shipments, and a smoothed demand estimate, yielding Markovian dynamics on a tractable state space. The released data reproduces the bullwhip effect at empirically consistent magnitudes, while three conservation laws provide verification tools for simulator extensions. We release datasets at two catalogue scales ($C=50$ and $C=200$), six scenario sweeps, and 20 Latin-hypercube perturbations. These datasets exhibit dynamics largely absent from fixed TSF benchmarks, including variance amplification, cascading bottlenecks, regime shifts, and cross-channel coupling through shared macro shocks. Zero-shot evaluation of four foundation models (Chronos, Moirai, TimesFM, and Lag-Llama) yields MASE values exceeding public GIFT-Eval references at low-to-moderate horizons, supporting incorporation into existing benchmark suites. The same models provide forecast confidence bands through Latin-hypercube perturbations of demand-side parameters, enabling forward uncertainty quantification (UQ) unavailable on standard TSF datasets and demonstrating that foundation models can serve as fast surrogates for digital-twin-based UQ. Code (MIT): https://github.com/tuhinsahai/ISOMORPH. Interactive demo: https://huggingface.co/spaces/HyeminGu/ISOMORPH-demo.

2605.12652 2026-06-02 cs.LG cs.AI 版本更新

Multi-Rollout On-Policy Distillation via Peer Successes and Failures

基于同伴成功与失败的多轨迹在线策略蒸馏

Weichen Yu, Xiaomin Li, Yizhou Zhao, Xiaoze Liu, Ruowang Zhang, Haixin Wang, Yinyi Luo, Chen Henry Wu, Gaurav Mittal, Matt Fredrikson, Yu Hu

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Microsoft(微软) Purdue University(普渡大学)

AI总结 提出多轨迹在线策略蒸馏(MOPD),利用学生模型的本地轨迹组构造更丰富的教师信号,通过同伴成功与失败条件化提升蒸馏效果。

Comments 23 pages

详情
AI中文摘要

大型语言模型通常使用稀疏验证器奖励进行后训练,该奖励指示采样轨迹是否成功,但对推理成功或失败的位置提供有限指导。在线策略蒸馏(OPD)通过训练学生生成的轨迹提供更密集的令牌级监督,但现有方法通常独立蒸馏每个轨迹,忽略为同一提示采样的其他尝试。我们引入多轨迹在线策略蒸馏(MOPD),一种同伴条件化蒸馏框架,利用学生的本地轨迹组构造信息更丰富的教师信号。MOPD 将教师条件化于成功和失败的同伴轨迹:成功为有效推理模式提供正面证据,而失败则为要避免的合理错误提供结构化负面证据。我们研究了两种同伴上下文构建:正面同伴模仿和对比性成功-失败条件化。在竞争性编程、数学推理、科学问答和工具使用基准上的实验表明,MOPD 持续优于标准在线策略基线。进一步的教师信号分析表明,混合成功-失败上下文能更好地使教师分数与验证器奖励对齐,表明性能提升源于更忠实、实例自适应的监督。这些结果表明,有效的在线策略蒸馏应利用学生的多轨迹试错行为,而不是将轨迹视为孤立样本。

英文摘要

Large language models are often post-trained with sparse verifier rewards, which indicate whether a sampled trajectory succeeds but provide limited guidance about where reasoning succeeds or fails. On-policy distillation (OPD) offers denser token-level supervision by training on student-generated trajectories, yet existing methods typically distill each rollout independently and ignore the other attempts sampled for the same prompt. We introduce Multi-Rollout On-Policy Distillation (MOPD), a peer-conditioned distillation framework that uses the student's local rollout group to construct more informative teacher signals. MOPD conditions the teacher on both successful and failed peer rollouts: successes provide positive evidence for valid reasoning patterns, while failures provide structured negative evidence about plausible mistakes to avoid. We study two peer-context constructions: positive peer imitation and contrastive success-failure conditioning. Experiments on competitive programming, mathematical reasoning, scientific question answering, and tool-use benchmarks show that MOPD consistently improves over standard on-policy baselines. Further teacher-signal analysis shows that mixed success-failure contexts better align teacher scores with verifier rewards, indicating that the gains arise from more faithful, instance-adaptive supervision. These results indicate that effective on-policy distillation should exploit the student's multi-rollout trial-and-error behavior rather than treating rollouts as isolated samples.

2605.11374 2026-06-02 cs.LG cs.CL cs.IR 版本更新

Test-Time Compute for Frozen Embedding Models through Agentic Program Search

冻结嵌入模型在测试时通过智能体程序搜索的计算

Han Xiao

发表机构 * Jina AI by Elastic(Jina AI 由 Elastic 提供)

AI总结 本文提出一种智能体程序搜索方法,通过大语言模型编写程序操作冻结编码器API,在推理时提升小嵌入模型的检索质量,无需训练参数且跨任务迁移。

Comments 15 pages, 7 figures, 4 tables

详情
AI中文摘要

测试时计算被广泛认为只对大型推理模型有益,小模型无法从中获益。对于密集检索,我们持相反观点,因为现代小型嵌入模型是从大型语言模型骨干蒸馏或适配而来,继承了其潜在的测试时计算能力。我们探究一个冻结的嵌入模型在仅推理时,无需辅助模型且部署时不训练任何参数,能获得多少检索质量提升。一个智能体循环中,大语言模型在冻结编码器API上编写程序,探索144个候选程序,得到12个帕累托最优程序,这些程序在成本比率从$c=1.2$到$14.7$之间权衡推理计算与质量,每个程序在所有14个发现任务上均提升了nDCG@10。这些程序不使用可训练参数,并恢复了经典检索原语,包括倒数秩融合、Fisher线性判别、Rocchio伪相关反馈和句子级MaxSim。未经修改地应用于19个保留任务和三个未见过的编码器家族,单个固定程序改进了大多数任务,中位数$Δ$nDCG@10为正,在$c\ge4$时胜率为54%至57%,且在发现过程中从未见过的编码器家族上增益最大。一个在相同任务上训练的匹配预算学习投影头无法以这种方式迁移,它在域内检索上提升了$+0.20$至$+0.25$ nDCG@10,但在每个保留编码器上均低于基线。因此,小型嵌入模型继承了可用的测试时计算潜力,冻结编码器将推理计算转化为检索增益,并迁移到新的语料库和编码器,无需每个领域的标签。

英文摘要

Test-time compute is widely believed to benefit only large reasoning models, leaving small models with nothing to gain. We argue the opposite for dense retrieval, since modern small embedding models are distilled or adapted from large language model backbones and can inherit their latent test-time-compute potential. We ask how much retrieval quality a frozen embedding model gains at inference alone, with no auxiliary model and no parameters trained at deployment. An agentic loop in which a large language model writes programs over a frozen encoder API explores 144 candidates and yields twelve Pareto-optimal programs that trade inference compute for quality across cost ratios from $c{=}1.2$ to $14.7$, every one improving nDCG@10 on all 14 discovery tasks. The programs use no trainable parameters and recover classical retrieval primitives, among them reciprocal rank fusion, the Fisher linear discriminant, Rocchio pseudo-relevance feedback, and sentence-level MaxSim. Applied unmodified to nineteen held-out tasks and three unseen encoder families, a single fixed program improves the majority of tasks, with a positive median $Δ$nDCG@10 and a 54 to 57% win-rate at $c{\ge}4$, and the gains are largest on encoder families never seen during discovery. A matched-budget learned projection head trained on the same tasks does not transfer this way, improving in-domain retrieval by $+0.20$ to $+0.25$ nDCG@10 yet falling below baseline on every held-out encoder. Small embedding models therefore inherit usable test-time-compute potential, and a frozen encoder converts inference compute into retrieval gains that transfer to new corpora and encoders with no per-domain labels.

2605.12400 2026-06-02 cs.LG cs.AI 版本更新

OGLS-SD: On-Policy Self-Distillation with Outcome-Guided Logit Steering for LLM Reasoning

OGLS-SD:基于结果引导的对数几率操控的在线自蒸馏用于大语言模型推理

Yuxiao Yang, Xiaoyun Wang, Weitong Zhang

发表机构 * UNC Chapel Hill(UNC夏洛特山分校)

AI总结 提出OGLS-SD框架,通过结果奖励校准教师对数几率,解决在线自蒸馏中师生响应模式不匹配导致的训练不稳定问题,提升数学推理性能。

Comments 17 pages, 10 figures, 5 tables

详情
AI中文摘要

我们研究在线自蒸馏(OPSD),其中语言模型通过沿其自身在线轨迹蒸馏特权教师分布来提高推理能力。尽管有前景,OPSD可能因教师和学生响应之间的模式不匹配而遭受训练不稳定。自我反思的教师响应可能引入反思引起的偏差和响应模板,从而错误校准令牌级监督,最终损害学生的推理能力。为缓解此问题,我们提出OGLS-SD,一种结果引导的对数几率操控框架,利用可验证的结果奖励来校准特权教师对数几率。具体而言,OGLS-SD对比由成功和失败的在线轨迹诱导的教师对数几率,构建一个结果判别性的操控方向用于令牌级指导。在数学推理基准上的实验表明,OGLS-SD稳定了自蒸馏,并提高了相对于标准OPSD和其他变体的性能。

英文摘要

We study on-policy self-distillation (OPSD), where a language model improves its reasoning ability by distilling privileged teacher distributions along its own on-policy trajectories. Despite its promise, OPSD can suffer from training instability due to a pattern mismatch between teacher and student responses. Self-reflected teacher responses may introduce reflection-induced biases and response templates that miscalibrate token-level supervision, ultimately harming the student's reasoning ability. To mitigate this issue, we propose OGLS-SD, an outcome-guided logit-steering framework that leverages verifiable outcome rewards to calibrate privileged teacher logits. Specifically, OGLS-SD contrasts teacher logits induced by successful and failed on-policy trajectories, constructing an outcome-discriminative steering direction for token-level guidance. Experiments on mathematical reasoning benchmarks show that OGLS-SD stabilizes self-distillation and improves performance over standard OPSD and other variants.

2605.09382 2026-06-02 cs.LG cs.CV cs.DS math.OC 版本更新

Learning-Augmented Scalable Linear Assignment Problem Optimization via Neural Dual Warm-Starts

学习增强的可扩展线性分配问题优化:基于神经对偶热启动

Ilay Yavlovich, Jad Agbaria, Muhamed Mhamed, Nir Weinberger, Jose Yallouz

发表机构 * Department of Electrical and Computer Engineering, Technion -- Israel Institute of Technology, Haifa, Israel(电气与计算机工程系,技术学院——以色列理工学院,海法,以色列)

AI总结 提出一种学习增强框架,通过预测对偶变量热启动精确求解器,并设计轻量级行独立架构RowDualNet避免O(N^2)内存瓶颈,实现可扩展的神经热启动,在保持最优性的同时获得超过2倍加速。

Comments Accepted to ICML 2026. 23 pages, 18 figures

详情
AI中文摘要

线性分配问题是一个基本的组合优化任务,经典精确求解器能保证最优性但受限于O(N^3)瓶颈,而最近的神经近似方法在可扩展性和精确性上存在困难。我们提出一个学习增强框架,通过预测对偶变量来热启动搜索,加速精确求解器,并配备回退机制以保持最坏情况保证。我们的核心是RowDualNet,一种轻量级、行独立的架构,避免了图模型的O(N^2)内存瓶颈,实现了高达N=16,384的可扩展神经热启动。通过Min-Trick机制,可行性由构造保证,完全消除了昂贵的迭代投影。实验上,我们的方法大幅减少了Jonker-Volgenant (LAPJV)算法的搜索努力,实现了鲁棒的零样本泛化,在复杂合成数据上获得超过2倍的端到端加速,在真实世界跟踪上获得1.25倍加速,在交通网络上获得1.5倍加速,同时严格保持最优性。

英文摘要

The Linear Assignment Problem is a fundamental combinatorial optimization task where classical exact solvers ensure optimality but suffer from an $\mathcal{O}(N^{3})$ bottleneck, while recent neural approximations struggle with scalability and exactness. We propose a learning-augmented framework that accelerates exact solvers by predicting dual variables to warm-start the search, backed by a fallback mechanism to preserve worst-case guarantees. Central to our approach is RowDualNet, a lightweight, row-independent architecture that avoids the $\mathcal{O}(N^{2})$ memory bottleneck of graph models, enabling scalable neural warm-starting up to $N=16{,}384$. Feasibility is guaranteed by construction via the Min-Trick mechanism, completely eliminating the need for costly iterative projections. Empirically, our method drastically reduces the search effort of the Jonker-Volgenant (LAPJV) algorithm, yielding robust zero-shot generalization with strict optimality and end-to-end speedups of over 2x on complex synthetic data, 1.25x on real-world tracking, and 1.5x on transportation networks.

2605.08398 2026-06-02 cs.LG cs.CV 版本更新

Exploring and Exploiting Stability in Latent Flow Matching

探索和利用潜流匹配中的稳定性

Rania Briq, Michael Kamp, Ohad Fried, Sarel Cohen, Stefan Kesselheim

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文证明潜流匹配模型对数据缩减和模型容量收缩具有鲁棒性,并利用这种稳定性提出更高效的训练和推理算法,包括数据节省和超过两倍的推理加速。

Comments Accepted at ICML 2026

详情
AI中文摘要

在这项工作中,我们展示了潜流匹配(LFM)模型对不同类型的扰动具有鲁棒性,包括数据缩减和模型容量收缩。我们通过这些模型在相同噪声种子下倾向于生成相似输出来表征这种稳定性。我们提供了一个视角,将这种现象与流匹配理论联系起来,表明这种稳定性是FM目标固有的。我们进一步利用这种稳定性推导出更高效训练和推理的实用算法。具体来说,首先,我们表明通过在显著减少的数据集上训练LFM模型,性能得以保持,并且在计算受限的情况下,模型在保持质量的同时收敛更快。这带来了多种优势,包括由于更快的收敛而节省训练时间,以及在训练条件模型时减轻标注工作。其次,LFM在架构收缩下的稳定性产生了一种双模型由粗到细的方法,一个使用轻量级架构用于FM轨迹的第一阶段,另一个具有更高容量用于第二阶段,从而大幅降低推理成本。为了确定哪些样本具有信息量,我们引入了三个样本评分标准,并在生成模型的标准指标下进行评估。我们的结果在多个数据集上进行了彻底评估,展示了这种稳定性的实际优势,包括数据节省和超过两倍的推理加速,同时生成可比较的输出。

英文摘要

In this work, we show that Latent Flow-Matching (LFM) models are robust to different types of perturbations, including data reduction and model capacity shrinkage. We characterize this stability by these models' tendency to generate similar outputs under identical noise seeds. We provide a perspective relating this phenomenon to flow matching theory, which indicates that this stability is inherent to the FM objective. We further exploit this stability to derive practical algorithms for more efficient training and inference. Concretely, first, we show that by training LFM models on significantly reduced datasets, performance is preserved, and in compute-constrained regimes, the model converges faster while maintaining quality. This yields multiple advantages, including savings in the training time due to faster convergence, and alleviating annotation effort when training conditional models. Second, LFM stability under architectural shrinkage gives rise to a two-model coarse-to-fine approach, one using a light-weight architecture for the first phase of the FM trajectory, and one with higher capacity for the second, thereby reducing the inference cost substantially. To determine which samples are informative, we introduce three sample-scoring criteria and evaluate them under standard metrics for generative models. Our results are thoroughly evaluated on multiple datasets, demonstrating the practical advantage of this stability, including data savings and a more than two-fold inference speedup while generating comparable outputs.

2605.07527 2026-06-02 cs.LG cs.AI 版本更新

Why Self-Inconsistency Arises in GNN Explanations and How to Exploit It

为什么 GNN 解释中会出现自不一致性以及如何利用它

Wenxin Tai, Yaqian Liu, Ting Zhong, Fan Zhou

发表机构 * University of Electronic Science and Technology of China(电子科技大学)

AI总结 本文分析了图神经网络解释中自不一致性的成因(重新解释引起的上下文扰动),提出潜在信号分配假设解释边缘敏感性,并设计无需训练的后处理策略 Self-Denoising 来校准解释。

Comments Corrected result errors and fixed typos

详情
AI中文摘要

最近的工作观察到,自解释图神经网络(SI-GNN)产生的解释可能存在自不一致性:当模型重新应用于其自身的解释性子图时,可能会产生不同的解释。然而,自不一致性产生的原因尚不清楚。在这项工作中,我们首先将重新解释引起的上下文扰动确定为分数变化的直接原因。然后,我们引入潜在信号分配假设来解释为什么只有部分边缘对此扰动敏感,并分析简洁性正则化如何影响潜在信号分配。鉴于自不一致的边缘不能为模型预测提供稳定的证据,我们提出了自去噪(SD),这是一种模型无关且无需训练的后处理策略,仅需一次额外前向传播即可校准解释。在代表性 SI-GNN 框架、骨干架构和基准数据集上的实验支持我们的假设,并表明 SD 能够持续提高解释质量,同时在实际中仅增加约 4-6% 的计算开销。

英文摘要

Recent work has observed that explanations produced by Self-Interpretable Graph Neural Networks (SI-GNNs) can be self-inconsistent: when the model is reapplied to its own explanatory graph subset, it may produce a different explanation. However, why self-inconsistency arises remains poorly understood. In this work, we first identify re-explanation-induced context perturbation as the direct cause of score variation. We then introduce a latent signal assignment hypothesis to explain why only some edges are sensitive to this perturbation, and analyze how conciseness regularization affects latent signal assignment. Given that self-inconsistent edges do not provide stable evidence for the model's prediction, we propose Self-Denoising (SD), a model-agnostic and training-free post-processing strategy that calibrates explanations with only one additional forward pass. Experiments across representative SI-GNN frameworks, backbone architectures, and benchmark datasets support our hypothesis and show that SD consistently improves explanation quality while adding only about 4--6\% computational overhead in practice.

2604.17415 2026-06-02 cs.LG cs.AI cs.CV 版本更新

Reward Score Matching: Unifying Reward-based Fine-tuning for Flow and Diffusion Models

奖励分数匹配:统一流模型和扩散模型的基于奖励的微调

Jeongjae Lee, Jinho Chang, Jeongsol Kim, Jong Chul Ye

发表机构 * Graduate School of AI, KAIST, Korea(人工智能研究生院,韩国科学技术院)

AI总结 提出奖励分数匹配(RSM)框架,统一了多种基于奖励的微调方法,通过分数匹配与值引导目标对齐,简化了设计空间并提高了效率。

Comments 43 pages, 15 figures

详情
AI中文摘要

基于奖励的微调引导预训练的扩散或基于流的生成模型生成更高奖励的样本,同时保持接近预训练模型。尽管现有方法源自不同视角,但我们表明许多方法可以写在一个共同框架下,我们称之为奖励分数匹配(RSM)。在此视角下,对齐变为针对值引导目标的分数匹配,方法间的主要差异归结为值引导估计器的构建和跨时间步的有效优化强度。这种统一澄清了现有设计的偏差-方差-计算权衡,并将核心优化组件与增加复杂性而无明显益处的辅助机制区分开来。在此视角指导下,我们针对代表性的可微和黑盒奖励对齐任务开发了更简单、更高效的重新设计。总体而言,RSM将看似分散的基于奖励的微调方法集合转变为更小、更可解释且更可操作的设计空间。代码可在 https://github.com/jaylee2000/rsm 获取。

英文摘要

Reward-based fine-tuning steers a pretrained diffusion or flow-based generative model toward higher-reward samples while remaining close to the pretrained model. Although existing methods are derived from different perspectives, we show that many can be written under a common framework, which we call reward score matching (RSM). Under this view, alignment becomes score matching against a value-guided target, and the main differences across methods reduce to the construction of the value-guidance estimator and the effective optimization strength across timesteps. This unification clarifies the bias-variance-compute tradeoffs of existing designs, and distinguishes core optimization components from auxiliary mechanisms that add complexity without clear benefit. Guided by this perspective, we develop simpler, more efficient redesigns across representative differentiable and black-box reward alignment tasks. Overall, RSM turns a seemingly fragmented collection of reward-based fine-tuning methods into a smaller, more interpretable, and more actionable design space. Code is available at https://github.com/jaylee2000/rsm

2605.01270 2026-06-02 cs.LG 版本更新

Continuous Temporal Representations of Event-Based Signals via Interference-Based Wave Modeling

基于干涉波建模的事件驱动信号的连续时间表示

Magnus Bengtsson

发表机构 * Department of Engineering, University of Borås(于厄萨大学工程系)

AI总结 提出基于干涉波表示的连续时间建模框架,通过复值潜波场编码事件驱动信号的时间结构,实现高效梯度优化和鲁棒特征提取,在sEMG数据上优于纯实值表示。

Comments 18 pages, 3 figures, Submitted to Journal

详情
AI中文摘要

来自事件驱动生物过程的时空信号,如表面肌电图(sEMG),表现出异步且高度结构化的激活模式,使用传统的离散或纯实值表示难以建模。在这项工作中,我们提出了一种基于干涉波表示的连续时间建模框架。该方法将类事件输入信号映射到复值潜波场,其中时间结构通过相位调制和潜分量之间的相互作用进行编码。通过将所得波场投影到能量域,模型在有限观测窗口内诱导出捕获时间定位和关系依赖性的结构化激活模式,而无需依赖显式循环或因果状态传播。所提出的公式特别适用于事件驱动的生物信号,其中连续表示能够实现高效的基于梯度的优化和鲁棒的特征提取。特别是,该方法旨在支持从sEMG数据中学习,用于生物力学系统中的下游控制任务,例如假肢装置和外骨骼。实验结果表明,与纯实值表示相比,所提出的干涉波模型提供了改进的表示质量,同时保持了适合实际部署的计算效率。

英文摘要

Spatio-temporal signals arising from event-driven biological processes, such as surface electromyography (sEMG), exhibit asynchronous and highly structured activation patterns that are challenging to model using conventional discrete or purely real-valued representations. In this work, we propose a continuous temporal modeling framework based on interference-based wave representations. The approach maps event-like input signals into a complex-valued latent wave field, where temporal structure is encoded through phase modulation and interactions between latent components. By projecting the resulting wave field onto an energy domain, the model induces structured activation patterns that capture both temporal localization and relational dependencies within finite observation windows, without relying on explicit recurrence or causal state propagation. The proposed formulation is particularly suited for event-driven biosignals, where continuous representations enable efficient gradient-based optimization and robust feature extraction. In particular, the method is designed to support learning from sEMG data for downstream control tasks in biomechanical systems, such as prosthetic devices and exoskeletons. Experimental results demonstrate that the proposed interference-based wave model provides improved representation quality compared to purely real-valued representations, while maintaining computational efficiency suitable for practical deployment.

2605.00696 2026-06-02 stat.ML cs.CL cs.LG 版本更新

Adaptive Querying with AI Persona Priors

基于AI人格先验的自适应查询

Kaizheng Wang, Yuhang Wu, Assaf Zeevi

发表机构 * Department of Industrial Engineering and Operations Research and Data Science Institute, Columbia University(工业工程与运筹学系及数据科学研究所,哥伦比亚大学) Decision, Risk, and Operations Division, Columbia Business School(决策、风险与运营部门,哥伦比亚商学院)

AI总结 提出一种基于AI人格诱导的潜变量模型,利用大语言模型生成响应分布,实现高效贝叶斯设计,用于在有限查询预算下学习用户相关量。

Comments ICML 2026

详情
AI中文摘要

我们研究在严格查询预算内,通过自适应查询学习用户相关的感兴趣量(如对保留项目的响应和心理测量指标)的问题。经典的贝叶斯设计和计算机化自适应测试通常依赖于限制性的参数假设或昂贵的后验近似,限制了它们在异质性、高维和冷启动场景中的应用。我们引入了一种人格诱导的潜变量模型,通过有限字典中的AI人格成员身份来表示用户状态,每种人格由大语言模型产生的响应分布提供。这产生了具有闭式后验更新和高效有限混合预测的表达性先验,从而实现了可扩展的贝叶斯设计用于顺序项目选择。在合成数据和WorldValuesBench上的实验表明,基于人格的后验提供了准确的概率预测和可解释的自适应启发流程。

英文摘要

We study adaptive querying for learning user-dependent quantities of interest, such as responses to held-out items and psychometric indicators, within tight query budgets. Classical Bayesian design and computerized adaptive testing typically rely on restrictive parametric assumptions or expensive posterior approximations, limiting their use in heterogeneous, high-dimensional, and cold-start settings. We introduce a persona-induced latent variable model that represents a user's state through membership in a finite dictionary of AI personas, each offering response distributions produced by a large language model. This yields expressive priors with closed-form posterior updates and efficient finite-mixture predictions, enabling scalable Bayesian design for sequential item selection. Experiments on synthetic data and WorldValuesBench demonstrate that persona-based posteriors deliver accurate probabilistic predictions and an interpretable adaptive elicitation pipeline.

2605.00394 2026-06-02 cs.LG 版本更新

Mesh Field Theory: Port-Hamiltonian Formulation of Mesh-Based Physics

网格场理论:基于网格的物理的端口-哈密顿形式化

Satoshi Noguchi, Yoshinobu Kawahara

发表机构 * University of Tokyo(东京大学)

AI总结 提出网格场理论(MeshFT)及其神经实现MeshFT-Net,通过端口-哈密顿形式化分离物理的拓扑与度量结构,实现近零能量漂移和强物理保真度。

Comments 29 pages, 7 figures, 15 tables. Accepted to ICML 2026

详情
AI中文摘要

我们提出了网格场理论(MeshFT)及其神经实现MeshFT-Net:一个用于基于网格的连续介质物理的结构保持框架,该框架清晰地将物理的拓扑结构与度量结构分开。通过施加最小物理原理(局部性、置换等变性、方向协变性以及能量平衡/耗散不等式),我们证明了基于网格的物理的约化定理。在这些条件下,物理动力学可以局部分解为端口-哈密顿形式:保守互连由网格拓扑唯一固定,而度量效应仅通过本构关系和耗散进入。这种约化阐明了哪些是必须固定的,哪些是应该学习的,直接指导了MeshFT-Net的设计。在解析和真实数据集、物理一致性测试以及分布外验证的评估中,MeshFT-Net实现了近零能量漂移和强物理保真度(正确的色散和动量守恒),以及稳健的外推和高数据效率。通过消除非物理自由度并仅学习依赖于度量的结构,MeshFT为稳定、忠实且数据高效的基于学习的物理模拟提供了原则性的归纳偏置。

英文摘要

We present Mesh Field Theory (MeshFT) and its neural realization, MeshFT-Net: a structure-preserving framework for mesh-based continuum physics that cleanly separates the physics' topological structure from its metric structure. Imposing minimal physical principles (locality, permutation equivariance, orientation covariance, and energy balance/dissipation inequality), we prove a reduction theorem for mesh-based physics. Under these conditions, the physical dynamics admit a local factorization into a port-Hamiltonian form: the conservative interconnection is fixed uniquely by mesh topology, whereas metric effects enter only through constitutive relations and dissipation. This reduction clarifies what must be fixed and what should be learned, directly informing MeshFT-Net's design. Across evaluations on analytic and realistic datasets, physics-consistency tests, and out-of-distribution validation, MeshFT-Net achieves near-zero energy drift and strong physical fidelity (correct dispersion and momentum conservation) along with robust extrapolation and high data efficiency. By eliminating non-physical degrees of freedom and learning only metric-dependent structure, MeshFT provides a principled inductive bias for stable, faithful, and data-efficient learning-based physical simulation.

2605.00310 2026-06-02 cs.CV cs.AI cs.LG 版本更新

Beyond Visual Fidelity: Benchmarking Super-Resolution Models for Large-Scale Remote Sensing Imagery via Downstream Task Integration

超越视觉保真度:通过下游任务集成评估大规模遥感影像的超分辨率模型

Zhili Li, Kangyang Chai, Zhihao Wang, Xiaowei Jia, Yanhua Li, Gengchen Mai, Sergii Skakun, Dinesh Manocha, Yiqun Xie

发表机构 * University of Maryland(马里兰大学) University of Pittsburgh(匹兹堡大学) Worcester Polytechnic Institute(沃思利技术学院) University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 针对现有超分辨率评估依赖PSNR/SSIM等保真度指标而忽略下游任务效用的问题,提出GeoSR-Bench基准数据集,集成土地覆盖分割、基础设施映射等下游任务,评估GAN、Transformer等9种SR模型在270种设置下的性能,发现保真度指标与任务性能弱相关甚至负相关。

Comments Under review at IEEE TPAMI

详情
AI中文摘要

超分辨率(SR)技术在从低分辨率输入重建高分辨率图像方面取得了重大进展。分辨率的提高为监测任务提供了视觉增强和实用性。特别是,SR已越来越多地用于基于卫星的地球观测,应用于城市规划、农业、生态学和灾害响应。然而,现有的SR研究和基准通常使用保真度指标如PSNR或SSIM,而超分辨率图像的真实效用在于支持下游任务,如土地覆盖分类、生物量估计和变化检测。为弥合这一差距,我们引入了GeoSR-Bench,一个下游任务集成的SR基准数据集,用于评估超越保真度指标的SR模型。GeoSR-Bench包含来自约36,000个地点的空间共位、时间对齐和质量控制的图像对,覆盖多种土地覆盖类型,分辨率从500米到0.6米。据我们所知,GeoSR-Bench是第一个直接将SR模型提高的图像分辨率与下游地球监测任务(包括土地覆盖分割、基础设施映射和生物物理变量估计)联系起来的SR基准。利用GeoSR-Bench,我们对基于GAN、Transformer、神经算子和扩散的SR模型在感知质量和下游任务性能上进行了基准测试。我们进行了270种设置的实验,涵盖2个跨平台SR任务、9个SR模型、3个下游任务模型以及每个SR任务的5个下游任务。结果表明,传统SR指标的改进通常与任务性能的提升不相关,甚至可能负相关,表明这些指标为选择适用于下游任务的优越模型提供的指导有限。这揭示了将下游任务集成到SR模型开发和评估中的必要性。

英文摘要

Super-resolution (SR) techniques have made major advances in reconstructing high-resolution images from low-resolution inputs. The increased resolution provides visual enhancement and utility for monitoring tasks. In particular, SR has been increasingly developed for satellite-based Earth observation, with applications in urban planning, agriculture, ecology, and disaster response. However, existing SR studies and benchmarks typically use fidelity metrics such as PSNR or SSIM, whereas the true utility of super-resolved images lies in supporting downstream tasks such as land cover classification, biomass estimation, and change detection. To bridge this gap, we introduce GeoSR-Bench, a downstream task-integrated SR benchmark dataset to evaluate SR models beyond fidelity metrics. GeoSR-Bench comprises spatially co-located, temporally aligned, and quality-controlled image pairs from about 36,000 locations across diverse land covers, spanning resolutions from 500m to 0.6m. To the best of our knowledge, GeoSR-Bench is the first SR benchmark that directly connects improved image resolution from SR models with downstream Earth monitoring tasks, including land cover segmentation, infrastructure mapping, and biophysical variable estimation. Using GeoSR-Bench, we benchmark GAN, transformer, neural operator, and diffusion-based SR models on perceptual quality and downstream task performance. We conduct experiments with 270 settings, covering 2 cross-platform SR tasks, 9 SR models, 3 downstream task models, and 5 downstream tasks for each SR task. The results show that improvements in traditional SR metrics often do not correlate with gains in task performance, and the correlations can be negative, indicating that these metrics provide limited guidance for selecting superior models for downstream tasks. This reveals the need to integrate downstream tasks into SR model development and evaluation.

2605.00161 2026-06-02 cs.LG 版本更新

Consistent Diffusion Language Models

一致性扩散语言模型

Hasan Amin, Yuan Gao, Yaser Souri, Subhojit Som, Ming Yin, Rajiv Khanna, Xia Song

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出多路径离散一致性(MPDC)原则,通过训练去噪器在随机桥上的路径不变性,实现单阶段训练的一致性扩散语言模型(CDLM),在文本生成中达到最先进性能。

Comments ICML 2026

详情
AI中文摘要

扩散语言模型(DLM)是自回归模型的一个有吸引力的替代方案,因为它们承诺亚线性时间、并行生成,但实际收益仍然难以捉摸,因为高质量样本仍需要数百个细化步骤。在连续域中,沿着概率流ODE的一致性训练是加速扩散的流行方法。对于离散扩散,不存在类似的样本空间ODE,使得直接适应不明确。我们认为正确的离散替代是精确后验桥,即连接任意两个噪声水平的闭式条件分布,这对于包括掩码扩散和均匀扩散在内的广泛损坏是可用的。基于这一观察,我们引入了多路径离散一致性(MPDC),这是一个新原则,它训练去噪器在这些随机桥上期望路径不变,并将其实例化为一致性扩散语言模型(CDLM),这是一个不需要预训练教师模型的单阶段训练框架。我们的CDLM目标将掩码扩散、连续一致性模型以及渐进或离散蒸馏恢复为同一观点的分析极限或经验近似。实验上,CDLM在条件和非条件文本生成上建立了新的最先进水平,在采样预算下始终优于强基线的离散扩散模型,并且通常甚至优于多阶段蒸馏基线,在少步数情况下增益最大。总之,这些结果将CDLM定位为下一代快速、高保真离散生成建模的原则性和可扩展基础。

英文摘要

Diffusion language models (DLMs) are an attractive alternative to autoregressive models because they promise sublinear-time, parallel generation, yet practical gains remain elusive as high-quality samples still demand hundreds of refinement steps. In continuous domains, consistency training along the probability-flow ODE is a popular recipe to accelerate diffusion. For discrete diffusion, no analogous sample-space ODE exists, making direct adaptation ill-defined. We argue that the right discrete substitute is the exact posterior bridge, the closed-form conditional law linking any two noise levels, which is available for broad corruptions including masked and uniform diffusion. Building on this observation, we introduce Multi-Path Discrete Consistency (MPDC), a new principle that trains a denoiser to be path-invariant in expectation across these stochastic bridges, and instantiate it as the Consistent Diffusion Language Model (CDLM), a single-stage training framework that does not require an already trained teacher model. Our CDLM objective recovers masked diffusion, continuous consistency models, and progressive or discrete distillation as analytic limits or empirical approximations of one common view. Empirically, CDLM establishes a new state of the art on both conditional and unconditional text-generation, consistently outperforming strong base discrete diffusion models and often even multi-stage distilled baselines across sampling budgets, with the largest gains in the few-step regime. Together, these results position CDLM as a principled and scalable foundation for the next generation of fast, high-fidelity discrete generative modeling.

2604.26197 2026-06-02 cs.IR cs.LG 版本更新

Hierarchical Long-Term Semantic Memory for LinkedIn's Hiring Agent

面向LinkedIn招聘代理的分层长期语义记忆

Zhentao Xu, Shangjin Zhang, Emir Poyraz, Yvonne Li, Ye Jin, Xie Lu, Xiaoyang Gu, Karthik Ramgopal, Praveen Kumar Bodigutla, Xiaofeng Wang

发表机构 * LinkedIn Corporation(LinkedIn公司)

AI总结 提出分层长期语义记忆(HLTM)框架,通过构建模式对齐的记忆树,实现可扩展的语义知识摄入、隐私感知存储、低延迟检索和透明溯源,在LinkedIn招聘助手应用中使答案正确率提升超5%、检索F1提升超10%。

Comments Accepted to the Applied Data Science (ADS) track at the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD 2026)

详情
AI中文摘要

大型语言模型(LLM)代理越来越多地应用于实际产品中,其中个性化和上下文感知的用户交互至关重要。实现此类能力的核心是代理的长期语义记忆系统,该系统从嘈杂的纵向行为数据中提取隐式和显式信号,以结构化形式存储,并支持低延迟检索。构建工业级LLM代理长期记忆面临五大挑战:可扩展性、低延迟检索、隐私约束、适应性和可观测性。我们提出了分层长期语义记忆(HLTM)框架,该框架将文本数据组织成模式对齐的记忆树,在多个粒度级别捕获语义知识,从而实现可扩展的摄入、隐私感知存储、低延迟检索和透明溯源;HLTM还进一步融入了适应机制以泛化到不同用例。在LinkedIn招聘助手上的广泛评估表明,HLTM使答案正确率提升超过5%,检索F1提升超过10%,同时显著推进了查询与索引延迟之间的帕累托前沿。HLTM已全面部署在LinkedIn招聘助手中,用于支持生产招聘工作流中的核心个性化功能。

英文摘要

Large Language Model (LLM) agents are increasingly used in real-world products, where personalized and context-aware user interactions are essential. A central enabler of such capabilities is the agent's long-term semantic memory system, which extracts implicit and explicit signals from noisy longitudinal behavioral data, stores them in a structured form, and supports low-latency retrieval. Building industrial-grade long-term memory for LLM agents raises five challenges: scalability, low-latency retrieval, privacy constraints, adaptability, and observability. We introduce the Hierarchical Long-Term Semantic Memory (HLTM) framework, which organizes textual data into a schema-aligned memory tree that captures semantic knowledge at multiple levels of granularity, enabling scalable ingestion, privacy-aware storage, low-latency retrieval, and transparent provenance; HLTM further incorporates an adaptation mechanism to generalize across diverse use cases. Extensive evaluations on LinkedIn's Hiring Assistant show that HLTM improves answer correctness by more than 5% and retrieval F1 by more than 10%, while significantly advancing the Pareto frontier between query and indexing latency. HLTM has been fully deployed in LinkedIn's Hiring Assistant to power core personalization features in production hiring workflows.

2604.25191 2026-06-02 cs.AR cs.AI cs.LG 版本更新

How Can Reinforcement Learning Achieve Expert-level Placement?

强化学习如何实现专家级布局?

Ruo-Tong Chen, Ke Xue, Chengrui Gao, Yunqi Shi, Tian Xu, Peng Xie, Siyuan Xu, Mingxuan Yuan, Chao Qian, Zhi-Hua Zhou

发表机构 * State Key Laboratory of Novel Software Technology, Nanjing University, China(南京大学新型软件技术国家重点实验室) School of Artificial Intelligence, Nanjing University, China(南京大学人工智能学院) Huawei Noah’s Ark Lab, China(华为诺亚实验室)

AI总结 针对强化学习在芯片布局中因奖励设计不当而难以达到专家质量的问题,提出从专家布局直接学习奖励模型的方法,通过推断专家轨迹并训练隐式奖励模型,实现从单个设计高效学习并泛化到未见案例。

Comments DAC 2026

详情
AI中文摘要

芯片布局是物理设计中的关键步骤。尽管基于强化学习的方法最近出现,但它们的训练主要关注线长优化,因此常常无法达到专家质量的布局。我们确定奖励设计是与专家性能差距的主要原因,并且我们没有形式化复杂的过程,而是通过直接从专家布局中学习来推导奖励模型,从而规避了这一问题。我们的方法从最终的专家布局开始,逐步推断专家轨迹。利用这些轨迹作为演示或偏好,我们训练一个模型来捕捉专家结果中的潜在隐式奖励。实验表明,我们的框架可以高效地从单个设计中学习,并很好地泛化到未见案例。

英文摘要

Chip placement is a critical step in physical design. While reinforcement learning (RL)-based methods have recently emerged, their training primarily focuses on wirelength optimization, and therefore often fail to achieve expert-quality layouts. We identify the reward design as the primary cause for the performance gap with experts, and instead of formalizing intricate processes, we circumvent this by directly learning from expert layouts to derive a reward model. Our approach starts from the final expert layouts to infer step-by-step expert trajectories. Using these trajectories as demonstrations or preferences, we train a model that captures the latent implicit rewards in expert results. Experiments show that our framework can efficiently learn from even a single design and generalize well to unseen cases.

2604.23658 2026-06-02 cs.AR cs.AI cs.LG 版本更新

FlowPlace: Flow Matching for Chip Placement

FlowPlace: 用于芯片布局的流匹配

Peng Xie, Ke Xue, Yunqi Shi, Ruo-Tong Chen, Chengrui Gao, Siyuan Xu, Chenjian Ding, Mingxuan Yuan, Chao Qian

发表机构 * State Key Laboratory of Novel Software Technology, Nanjing University, China(南京大学新型软件技术国家重点实验室) School of Artificial Intelligence, Nanjing University, China(南京大学人工智能学院) Huawei Noah’s Ark Lab, China(华为诺亚实验室)

AI总结 提出FlowPlace,通过掩码引导的合成数据生成、基于流的灵活先验注入高效训练和硬约束采样实现无重叠布局,在OpenROAD和ICCAD 2015基准上取得更优PPA指标、10-50倍采样效率提升和零重叠。

Comments DAC 2026

详情
AI中文摘要

芯片布局在物理设计中扮演重要角色。虽然扩散模型等生成模型提供了有前景的基于学习的解决方案,但当前方法存在以下局限性:使用随机合成数据进行预训练,需要较长的采样时间,并且由于在采样过程中依赖基于梯度的求解器,常常导致重叠。为了克服这些问题,我们提出了FlowPlace,其特点包括掩码引导的合成数据生成、基于流的灵活先验注入高效训练以及用于无重叠布局的硬约束采样。在OpenROAD和ICCAD 2015基准上的实验表明,FlowPlace实现了更好的PPA指标、10-50倍的采样效率提升以及零重叠。

英文摘要

Chip placement plays an important role in physical design. While generative models like diffusion models offer promising learning-based solutions, current methods have the following limitations: they use random synthetic data for pre-training, require long sampling times, and often result in overlaps due to their dependence on gradient-based solvers during the sampling process. To overcome these issues, we propose FlowPlace, which features mask-guided synthetic data generation, flow-based efficient training with flexible prior injection, and hard constraint sampling for overlap-free layouts. Experiments on OpenROAD and ICCAD 2015 benchmarks show FlowPlace achieves better PPA metrics, 10-50$\times$ faster sampling efficiency, and zero overlaps.

2604.22896 2026-06-02 cs.RO cs.LG 版本更新

Magnetic Indoor Localization through CNN Regression and Rotation Invariance

基于CNN回归和旋转不变性的磁室内定位

Helge Rosé, Konstantin Klipp, Tom Koubek, Bernd Schäufele, Ilja Radusch

发表机构 * University of Freiburg(弗赖堡大学)

AI总结 提出使用旋转不变特征(磁场强度和重力轴投影)训练轻量级CNN模型,实现无需方向校准的室内定位,在MagPie数据集上达到或超越现有最优精度。

Comments Published and presented at the 2026 4th International Conference on Mechatronics, Control and Robotics (ICMCR)

详情
AI中文摘要

室内定位是GNSS拒止环境中广泛应用的关键技术,包括室内导航和物联网系统。结合卷积神经网络(CNN)和基于磁场特征的方法,提供了一种低成本、无需基础设施的精确定位解决方案。尽管磁指纹是室内定位的一种有前景的方法,但基于原始3D磁力计数据训练的模型对设备方向高度敏感。我们通过使用从3D磁场导出的两个旋转不变特征来解决这个问题:磁场强度(Mn)和重力轴投影(Mg)。我们在磁序列上训练轻量级7层扩张CNN(MagNetS/XL),直接回归(x, y)位置。使用MagPie数据集(三栋建筑,手持轨迹),我们系统评估了测试和/或训练数据的固定和随机旋转。原始3D输入(Mx, My, Mz)在固定90°旋转下表现出各向同性误差增加,并随着随机旋转增大而进一步恶化。相比之下,2D输入(Mn, Mg)保持旋转不变精度,并且一旦旋转超过三个参考建筑的特定阈值(Loomis大建筑0°,Talbot中建筑5°,CSL小建筑6°),其性能就超过3D输入。MagNetXL在MagPie数据集上达到或超越了现有最优精度,而MagNetS以约三分之一的参数实现了相似性能,有利于移动部署。这些结果表明,在实际使用中,从旋转不变输入获得的鲁棒性超过了输入维度降低的损失,从而无需方向校准或额外基础设施即可进行地图构建和定位。

英文摘要

Indoor positioning is an essential technology for a wide range of applications in GNSS-denied environments, including indoor navigation and IoT systems. Combining convolutional neural networks (CNNs) and magnetic field-based features offers a low-cost, infrastructure-free solution for precise positioning. While magnetic fingerprints are a promising approach for indoor positioning, models trained on raw 3D magnetometer data are highly sensitive to device orientation. We address this by using two rotation invariant features derived from the 3D magnetic field: the norm (Mn) and the projection onto the gravity axis (Mg). We train a lightweight 7-layer dilated CNN (MagNetS/XL) on magnetic sequences to directly regress (x, y) positions. Using the MagPie dataset (three buildings, handheld trajectories), we systematically evaluate fixed and random rotations of test and/or train data. Raw 3D inputs (Mx, My , Mz) exhibit isotropic error increases under fixed 90° rotations and further degrade with growing random rotations. In contrast, 2D (Mn, Mg) inputs maintain rotation invariant accuracy and surpass the 3D inputs once rotation exceeds building-specific thresholds for three reference buildings: 0° for Loomis (large), 5° for Talbot (medium), and 6° for CSL (small). MagNetXL achieves or exceeds state-of-the-art accuracy on the MagPie dataset, and MagNetS delivers similar performance with roughly one third of the parameters, favoring mobile deployment. These results show that the robustness gained from rotation invariant inputs outweighs the loss of input dimensionality in realistic usage, allowing mapping and localization without orientation alignment or added infrastructure.

2602.02689 2026-06-02 cs.CR cs.AI cs.LG 版本更新

Eidolon: A Post-Quantum Signature Scheme Based on k-Colorability in the Age of Graph Neural Networks

Eidolon: 图神经网络时代基于k-可着色性的后量子签名方案

Asmaa Cherkaoui, Ramon Flores, Delaram Kahrobaei, Richard Wilson

发表机构 * Laboratory of Mathematical Analysis, Algebra and Applications (LAM2A), Faculty of Sciences Ain Chock (FSAC), University Hassan II, Casablanca, Morocco(哈桑二世大学阿因-奇克学院数学分析与代数实验室) Department of Geometry and Topology, Faculty of Mathematics, University of Seville, Seville, Spain(塞维利亚大学数学系几何与拓扑系) Departments of Computer Science and Mathematics, Queens College, City University of New York, USA(纽约市立大学皇后学院计算机科学与数学系;数学博士项目,理论科学倡议,研究生中心,纽约市立大学;计算机科学与工程系,纽约大学塔朗分校;计算机科学系,英国约克大学) PhD Program in Mathematics, and Initiative for the Theoretical Sciences, Graduate Center, City University of New York, USA(英国约克大学计算机科学系) Department of Computer Science and Engineering, Tandon School of Engineering, New York University, USA Department of Computer Science, University of York, United Kingdom Department of Computer Science, University of York, United Kingdom

AI总结 提出一种基于NP完全问题k-可着色性的后量子签名方案Eidolon,通过推广Goldreich-Micali-Wigderson零知识协议、应用Fiat-Shamir变换和Merkle树压缩,并利用植入着色法生成困难实例,实验表明对经典求解器和图神经网络攻击具有抵抗性。

Comments 20 pages, 4 figures

详情
Journal ref
Proceedings of WAIFI 2026, Lecture Notes in Computer Science (LNCS), Vol. 16611, Springer, 2026
AI中文摘要

我们提出Eidolon,一种基于NP完全问题k-可着色性的后量子签名方案。我们的构造将Goldreich-Micali-Wigderson零知识协议推广到任意k >= 3,应用Fiat-Shamir变换,并使用Merkle树承诺将签名从O(tn)压缩到O(t log n)。我们通过植入着色法生成困难实例,同时旨在保留随机图的统计特征。我们对此类方案进行了针对经典求解器(ILP、DSatur)和定制图神经网络(GNN)攻击者的实证安全分析。实验表明,对于n >= 60,两种方法均无法恢复与植入解匹配的有效着色,表明精心设计的k-着色实例能够抵抗所考虑的传统和基于学习的密码分析方法。这些实验表明,构造的实例能够抵抗我们评估中考虑的攻击。

英文摘要

We propose Eidolon, a post-quantum signature scheme grounded on the NP-complete k-colorability problem. Our construction generalizes the Goldreich-Micali-Wigderson zero-knowledge protocol to arbitrary k >= 3, applies the Fiat-Shamir transform, and uses Merkle-tree commitments to compress signatures from O(tn) to O(t log n). We generate hard instances by planting a coloring while aiming to preserve the statistical profile of random graphs. We present an empirical security analysis of such a scheme against both classical solvers (ILP, DSatur) and a custom graph neural network (GNN) attacker. Experiments show that for n >= 60, neither approach is able to recover a valid coloring matching the planted solution, suggesting that well-engineered k-coloring instances can resist the considered classical and learning-based cryptanalytic approaches. These experiments indicate that the constructed instances resist the attacks considered in our evaluation.

2604.20308 2026-06-02 cs.LG 版本更新

Sheaf Neural Networks on SPD Manifolds: Second-Order Geometric Representation Learning

SPD流形上的层神经网络:二阶几何表示学习

Yuhan Peng, Junwen Dong, Yuzhi Zeng, Hao Li, Ce Ju, Huitao Feng, Diaaeldin Taha, Anna Wienhard, Kelin Xia

发表机构 * GitHub

AI总结 针对图神经网络在欧氏空间中的线性结构限制,提出首个在对称正定矩阵流形上运行的层神经网络,利用李群结构定义层算子,实现二阶几何表示学习,在MoleculeNet基准上取得6/7最优结果。

详情
AI中文摘要

图神经网络面临两个源于欧氏向量空间线性结构的基本挑战:(1) 当前架构通过向量(方向、梯度)表示几何,但许多任务需要矩阵值表示来捕捉方向之间的关系——例如分子中原子取向的协变。这些二阶表示自然地由对称正定矩阵流形上的点捕获;(2) 标准消息传递在边上应用共享变换。层神经网络通过边特定变换解决了这一问题,但现有公式仍局限于向量空间,因此无法传播矩阵值特征。我们通过开发首个在SPD流形上原生运行的层神经网络来应对这两个挑战。我们的关键洞察是SPD流形具有李群结构,使得无需投影到欧氏空间即可定义良置的层算子。理论上,我们证明SPD值层比欧氏层具有更强的表达能力:它们能容纳向量值层无法表示的相容配置(全局截面),直接转化为更丰富的学习表示。实验上,我们的层卷积有效地将秩1方向输入变换为编码局部几何结构的满秩矩阵。我们的双流架构在MoleculeNet基准的6/7个任务上达到最优,且层框架提供了持续的深度鲁棒性。

英文摘要

Graph neural networks face two fundamental challenges rooted in the linear structure of Euclidean vector spaces: (1) Current architectures represent geometry through vectors (directions, gradients), yet many tasks require matrix-valued representations that capture relationships between directions-such as how atomic orientations covary in a molecule. These second-order representations are naturally captured by points on the symmetric positive definite matrices (SPD) manifold; (2) Standard message passing applies shared transformations across edges. Sheaf neural networks address this via edge-specific transformations, but existing formulations remain confined to vector spaces and therefore cannot propagate matrix-valued features. We address both challenges by developing the first sheaf neural network operates natively on the SPD manifold. Our key insight is that the SPD manifold admits a Lie group structure, enabling well-posed analogs of sheaf operators without projecting to Euclidean space. Theoretically, we prove that SPD-valued sheaves are strictly more expressive than Euclidean sheaves: they admit consistent configurations (global sections) that vector-valued sheaves cannot represent, directly translating to richer learned representations. Empirically, our sheaf convolution transforms effectively rank-1 directional inputs into full-rank matrices encoding local geometric structure. Our dual-stream architecture achieves SOTA on 6/7 MoleculeNet benchmarks, with the sheaf framework providing consistent depth robustness.

2604.17838 2026-06-02 cs.LG stat.CO stat.ML 版本更新

Efficient Diffusion Models under Nonconvex Equality and Inequality constraints via Landing

非凸等式和不等式约束下的高效扩散模型 via Landing

Kijung Jeon, Michael Muehlebach, Molei Tao

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一个统一框架,通过计算高效的landing机制替代投影,结合欠阻尼动力学加速混合,在非凸可行集上实现等式和不等式约束下的扩散模型,显著降低计算成本。

Comments 58 pages

详情
AI中文摘要

在约束集合内的生成建模对于涉及物理、几何或安全要求(例如分子生成、机器人学)的科学和工程应用至关重要。我们提出了一个通用框架,用于在一般非凸可行集 $Σ$ 上的约束扩散模型,该模型在整个扩散过程中同时强制执行等式和不等式约束。我们的框架包含了过阻尼和欠阻尼动力学用于前向和后向采样。一个关键的算法创新是计算高效的landing机制,它替代了昂贵且通常定义不清的到 $Σ$ 的投影,确保可行性而无需迭代牛顿求解或投影失败。通过利用欠阻尼动力学,我们加速了向先验分布的混合,有效缓解了通常与约束扩散相关的高模拟成本。实验上,该方法在训练和推理过程中减少了函数评估和内存使用,同时保持了样本质量。在具有等式和混合约束的基准测试中,我们的方法在显著降低计算成本的同时实现了与最先进基线相当的样本质量,为非凸可行集上的扩散提供了实用且可扩展的解决方案。

英文摘要

Generative modeling within constrained sets is essential for scientific and engineering applications involving physical, geometric, or safety requirements (e.g., molecular generation, robotics). We present a unified framework for constrained diffusion models on generic nonconvex feasible sets $Σ$ that simultaneously enforces equality and inequality constraints throughout the diffusion process. Our framework incorporates both overdamped and underdamped dynamics for forward and backward sampling. A key algorithmic innovation is a computationally efficient landing mechanism that replaces costly and often ill-defined projections onto $Σ$, ensuring feasibility without iterative Newton solves or projection failures. By leveraging underdamped dynamics, we accelerate mixing toward the prior distribution, effectively alleviating the high simulation costs typically associated with constrained diffusion. Empirically, this approach reduces function evaluations and memory usage during both training and inference while preserving sample quality. On benchmarks featuring equality and mixed constraints, our method achieves comparable sample quality to state-of-the-art baselines while significantly reducing computational cost, providing a practical and scalable solution for diffusion on nonconvex feasible sets.

2601.02997 2026-06-02 cs.LG cs.CV 版本更新

From Memorization to Creativity: LLM as a Designer of Novel Neural Architectures

从记忆到创造:LLM作为新型神经架构的设计者

Waleed Khalid, Dmitry Ignatov, Radu Timofte

发表机构 * Computer Vision Lab, CAIDAS & IFI, University of Würzburg, Germany(计算机视觉实验室,CAIDAS与IFI,乌尔姆大学,德国)

AI总结 本文提出NNGPT框架,通过闭环架构合成流水线,利用代码型LLM的监督微调循环,结合MinHash-Jaccard新颖性过滤和低保真性能信号,迭代提升生成架构的有效性、性能和多样性,实现从记忆到创造的转变。

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 3252-3261, 2026
AI中文摘要

大型语言模型(LLM)在程序合成方面表现出色,但其在神经架构设计中的能力——平衡语法可靠性、性能和结构新颖性——仍未得到充分探索。我们提出了NNGPT框架内的闭环架构合成流水线,其中代码型LLM经过22次监督微调循环的演化。在每个循环中,LLM合成PyTorch卷积网络,通过低保真性能信号验证,并通过MinHash-Jaccard标准过滤以防止结构冗余,然后纳入LEMUR数据集。具有新颖架构的高性能候选被转换为提示-代码对,用于参数高效的LoRA微调。这种反馈循环驱动了可测量的分布偏移,逐步内化经验架构先验,使得有效且高性能的输出从稀缺变为主导。在CIFAR-10上,有效生成率稳定在50.6%(峰值74.5%),平均第一轮准确率从28.1%上升到51.0%,超过40%准确率的候选从2.0%增长到96.8%。跨数据集迁移到CIFAR-100和SVHN证实了改进的有效性、偏移的准确率分布和持续的新颖性在不同难度和视觉领域的基准测试中泛化。在22个循环中,有455个原始语料库中不存在的新颖架构被新颖性过滤器接受。通过将合成基于执行反馈和新颖性过滤,我们证明了迭代自监督微调将LLM重塑为任务特化的架构先验——提高了生成可靠性、代理性能和结构多样性——为手工设计的搜索空间提供了一种可复现、无需标注的替代方案。

英文摘要

Large language models (LLMs) excel in program synthesis, yet their capacity for neural architecture design -- balancing syntactic reliability, performance, and structural novelty -- remains underexplored. We present a closed-loop architecture synthesis pipeline within the NNGPT framework, in which a code-oriented LLM evolves over 22 supervised fine-tuning cycles. At each cycle, the LLM synthesizes PyTorch convolutional networks, validated via low-fidelity performance signals and filtered via a MinHash--Jaccard criterion to prevent structural redundancy before being incorporated into the LEMUR dataset. High-performing candidates with novel architectures are converted into prompt--code pairs for parameter-efficient LoRA fine-tuning. This feedback loop drives a measurable distributional shift, progressively internalizing empirical architectural priors such that valid and high-performing outputs evolve from scarce to dominant across cycles. On CIFAR-10, the valid generation rate stabilizes at 50.6% (peaking at 74.5%), mean first-epoch accuracy rises from 28.1% to 51.0%, and candidates exceeding 40% accuracy grow from 2.0% to 96.8%. Cross-dataset transfer to CIFAR-100 and SVHN confirms that improved validity, shifted accuracy distributions, and sustained novelty generalize across benchmarks of varying difficulty and visual domain. Across 22 cycles, 455 unique architectures absent from the original corpus are admitted under the novelty filter. By grounding synthesis in execution feedback and novelty filtering, we demonstrate that iterative self-supervised fine-tuning reshapes an LLM into a task-specialized architectural prior -- improving generation reliability, proxy performance, and structural diversity -- offering a reproducible, annotation-free alternative to hand-crafted search spaces.

2604.14698 2026-06-02 cs.LG 版本更新

Mean Flow Policy Optimization

平均流策略优化

Xiaoyi Dong, Xi Sheryl Zhang, Jian Cheng

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出使用平均流模型作为策略表示,通过最大熵强化学习框架进行软策略迭代,以提升在线强化学习中训练和推理效率,实验表明性能与扩散模型基线相当或更优且时间显著减少。

Comments ICML 2026

详情
AI中文摘要

扩散模型最近作为在线强化学习(RL)的表达性策略表示出现。然而,其迭代生成过程引入了大量的训练和推理开销。为了克服这一限制,我们提出使用平均流模型(MeanFlow模型)来表示策略,这是一类少步流生成模型,旨在提高基于扩散的RL方法的训练和推理效率。为了促进探索,我们通过软策略迭代在最大熵强化学习框架下优化平均流策略,并解决了平均流策略特有的两个关键挑战:动作似然评估和软策略改进。在MuJoCo、DeepMind Control Suite和HumanoidBench基准上的实验表明,我们的方法——平均流策略优化(MFPO)——实现了与当前基于扩散的基线相当或更优的性能,同时显著减少了训练和推理时间。我们的代码可在https://github.com/dongxiaoyi-xyz/MFPO获取。

英文摘要

Diffusion models have recently emerged as expressive policy representations for online reinforcement learning (RL). However, their iterative generative processes introduce substantial training and inference overhead. To overcome this limitation, we propose to represent policies using MeanFlow models, a class of few-step flow-based generative models, to improve training and inference efficiency over diffusion-based RL approaches. To promote exploration, we optimize MeanFlow policies under the maximum entropy RL framework via soft policy iteration, and address two key challenges specific to MeanFlow policies: action likelihood evaluation and soft policy improvement. Experiments on MuJoCo, DeepMind Control Suite and HumanoidBench benchmarks demonstrate that our method, Mean Flow Policy Optimization (MFPO), achieves performance comparable to or exceeding current diffusion-based baselines while considerably reducing training and inference time. Our code is available at https://github.com/dongxiaoyi-xyz/MFPO.

2201.10838 2026-06-02 cs.CR cs.LG 版本更新

Privacy-Preserving Logistic Regression Training with A Faster Gradient Variant

隐私保护逻辑回归训练中的一种更快梯度变体

John Chiang

AI总结 提出一种名为二次梯度的梯度变体,用于加速隐私保护逻辑回归训练,并在同态加密场景下仅需四次迭代即可达到可比性能。

详情
AI中文摘要

近年来,对加密数据训练逻辑回归已成为解决安全问题的重要方法。本文引入了一种高效的梯度变体,称为 extit{二次梯度},该变体专为隐私保护逻辑回归设计,同时在明文优化中同样有效。通过引入二次梯度,我们改进了Nesterov加速梯度(NAG)、自适应梯度(AdaGrad)和Adam算法。我们在多个数据集上评估了这些改进算法,实验结果表明,其收敛速度达到最先进水平,显著优于传统一阶梯度方法。此外,我们将改进的NAG方法应用于同态逻辑回归训练,仅需四次迭代即可实现可比性能。所提出的二次梯度方法提供了一个统一框架,融合了一阶梯度方法和二阶牛顿型方法的优势,表明其可广泛应用于各种数值优化任务。

英文摘要

Training logistic regression over encrypted data has emerged as a prominent approach to addressing security concerns in recent years. In this paper, we introduce an efficient gradient variant, termed the \textit{quadratic gradient}, which is specifically designed for privacy-preserving logistic regression while remaining equally effective in plaintext optimization. By incorporating this quadratic gradient, we enhance Nesterov's Accelerated Gradient (NAG), Adaptive Gradient (AdaGrad), and Adam algorithms. We evaluate these enhanced algorithms across various datasets, with experimental results demonstrating state-of-the-art convergence rates that significantly outperform traditional first-order gradient methods. Furthermore, we apply the enhanced NAG method to implement homomorphic logistic regression training, achieving comparable performance within only four iterations. The proposed quadratic-gradient approach offers a unified framework that synergizes the advantages of first-order gradient methods and second-order Newton-type methods, suggesting broad applicability to diverse numerical optimization tasks.

2604.10688 2026-06-02 cs.LG cs.AI cs.CL 版本更新

SCOPE: Signal-Calibrated On-Policy Distillation Enhancement with Dual-Path Adaptive Weighting

SCOPE: 信号校准的在线策略蒸馏增强与双路径自适应加权

Binbin Zheng, Xing Ma, Yiheng Liang, Jingqing Ruan, Xiaoliang Fu, Kepeng Lin, Benchang Zhu, Ke Zeng, Xunliang Cai

发表机构 * University of Science and Technology of China(中国科学技术大学) Meituan LongCat Interaction Team(美团 LongCat 交互团队) Nanjing University(南京大学) Fudan University(复旦大学) Huazhong University of Science and Technology(华中科技大学)

AI总结 针对在线策略强化学习中奖励稀疏导致的信用分配难题,提出SCOPE框架,通过双路径自适应加权机制分别处理正确与错误轨迹,实现信号校准的蒸馏增强,在六个推理基准上平均提升11.42%的Avg@32和7.30%的Pass@32。

详情
AI中文摘要

在线策略强化学习已成为大型语言模型推理对齐的主导范式,但其稀疏的结果级奖励使得令牌级信用分配异常困难。在线策略蒸馏(OPD)通过引入来自教师模型的密集令牌级KL监督缓解了这一问题,但通常对所有rollout均匀应用这种监督,忽略了信号质量的根本差异。我们提出信号校准的在线策略蒸馏增强(SCOPE),一种双路径自适应训练框架,根据正确性将在线策略rollout路由到两个互补的监督路径。对于错误轨迹,SCOPE执行教师困惑度加权的KL蒸馏,优先考虑教师展现出真正纠正能力的实例,同时降低不可靠指导的权重。对于正确轨迹,它应用学生困惑度加权的MLE,将强化集中在能力边界上的低置信度样本,而不是过度强化已掌握的样本。两条路径都采用组级归一化来自适应校准权重分布,考虑不同提示的内在难度差异。在六个推理基准上的大量实验表明,SCOPE在Avg@32和Pass@32上分别比竞争基线平均相对提升11.42%和7.30%,证明了其一致的有效性。

英文摘要

On-policy reinforcement learning has become the dominant paradigm for reasoning alignment in large language models, yet its sparse, outcome-level rewards make token-level credit assignment notoriously difficult. On-Policy Distillation (OPD) alleviates this by introducing dense, token-level KL supervision from a teacher model, but typically applies this supervision uniformly across all rollouts, ignoring fundamental differences in signal quality. We propose Signal-Calibrated On-Policy Distillation Enhancement (SCOPE), a dual-path adaptive training framework that routes on-policy rollouts by correctness into two complementary supervision paths. For incorrect trajectories, SCOPE performs teacher-perplexity-weighted KL distillation to prioritize instances where the teacher demonstrates genuine corrective capability, while down-weighting unreliable guidance. For correct trajectories, it applies student-perplexity-weighted MLE to concentrate reinforcement on low-confidence samples at the capability boundary rather than over-reinforcing already mastered ones. Both paths employ a group-level normalization to adaptively calibrate weight distributions, accounting for the intrinsic difficulty variance across prompts. Extensive experiments on six reasoning benchmarks show that SCOPE achieves an average relative improvement of 11.42% in Avg@32 and 7.30% in Pass@32 over competitive baselines, demonstrating its consistent effectiveness.

2604.09487 2026-06-02 cs.RO cs.LG 版本更新

Sim-to-Real Transfer for Muscle-Actuated Robots via Generalized Actuator Networks

基于广义执行器网络的肌肉驱动机器人仿真到现实迁移

Jan Schneider, Mridul Mahajan, Le Chen, Simon Guist, Bernhard Schölkopf, Ingmar Posner, Dieter Büchler

AI总结 提出广义执行器网络(GenAN),通过从关节位置轨迹学习执行器模型,实现肌肉驱动机器人从仿真到现实的策略迁移,首次成功在四自由度肌肉驱动机器人臂上完成动态任务。

详情
AI中文摘要

肌腱驱动配合软肌肉执行器使机器人更快、更安全,同时可能加速技能获取。然而,由于固有的非线性、摩擦和迟滞,这些系统在实际中很少使用,这给建模和控制带来了复杂性。到目前为止,这些挑战阻碍了策略从仿真到真实系统的迁移。为弥合这一差距,我们提出了一种仿真到现实的流程,该流程学习这种复杂执行器的神经网络模型,并利用成熟的刚体仿真来处理手臂动力学和与环境的交互。我们的方法称为广义执行器网络(GenAN),通过直接从关节位置轨迹学习,而不是需要扭矩传感器,从而能够在广泛的机器人上进行执行器模型识别。在PAMY2(一种由气动人工肌肉驱动的肌腱驱动机器人)上使用GenAN,我们成功部署了完全在仿真中训练的、动态但精确的到达目标、杯中球和乒乓球策略。据我们所知,这一结果构成了四自由度肌肉驱动机器人臂首次成功的仿真到现实迁移。

英文摘要

Tendon drives paired with soft muscle actuation enable faster and safer robots while potentially accelerating skill acquisition. Still, these systems are rarely used in practice due to inherent nonlinearities, friction, and hysteresis, which complicate modeling and control. So far, these challenges have hindered policy transfer from simulation to real systems. To bridge this gap, we propose a sim-to-real pipeline that learns a neural network model of this complex actuation and leverages established rigid body simulation for the arm dynamics and interactions with the environment. Our method, called Generalized Actuator Network (GenAN), enables actuation model identification across a wide range of robots by learning directly from joint position trajectories rather than requiring torque sensors. Using GenAN on PAMY2, a tendon-driven robot powered by pneumatic artificial muscles, we successfully deploy dynamic but precise goal-reaching, ball-in-a-cup, and table tennis policies, trained entirely in simulation. To the best of our knowledge, this result constitutes the first successful sim-to-real transfer for a four-degrees-of-freedom muscle-actuated robot arm.

2604.09041 2026-06-02 cs.LG cs.AI physics.ao-ph stat.ML 版本更新

U-Cast: A Surprisingly Simple and Efficient Frontier Probabilistic AI Weather Forecaster

U-Cast:一种惊人简单且高效的边界概率AI天气预报器

Salva Rühling Cachay, Duncan Watson-Parris, Rose Yu

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出基于标准U-Net骨架的概率天气预报模型U-Cast,通过确定性预训练和短时概率微调,以不到1/10的计算成本匹配或超越GenCast和IFS ENS的预报技能。

Comments ICML 2026. Our code is available at: https://github.com/Rose-STL-Lab/u-cast

详情
AI中文摘要

基于AI的天气预报现在可以与传统的基于物理的集合预报相媲美,但最先进的模型依赖于专门的架构和巨大的计算预算,造成了很高的进入门槛。我们证明,对于边界性能而言,这种复杂性是不必要的。我们引入了\ours,一种基于标准U-Net骨架的概率预报器,采用简单的训练方案:先进行基于平均绝对误差的确定性预训练,然后使用蒙特卡洛Dropout引入随机性,基于连续排序概率评分(CRPS)进行短时概率微调。结果,我们的模型在$1.5^\circ$分辨率下匹配或超过了GenCast和IFS ENS的概率技能,同时与领先的基于CRPS的模型相比,训练计算量减少了10倍以上,与基于扩散的模型相比,推理延迟减少了10倍以上。U-Cast在不到12个H200 GPU天内完成训练,并在3秒内生成15天的集合预报。这些结果表明,可扩展的通用架构与高效的训练课程相结合,可以以极低的成本匹配复杂的领域特定设计,从而向更广泛的社区开放边界概率天气模型的训练。

英文摘要

AI-based weather forecasting now rivals traditional physics-based ensembles, but state-of-the-art (SOTA) models rely on specialized architectures and massive computational budgets, creating a high barrier to entry. We demonstrate that such complexity is unnecessary for frontier performance. We introduce \ours, a probabilistic forecaster built on a standard U-Net backbone trained with a simple recipe: deterministic pre-training on Mean Absolute Error followed by short probabilistic fine-tuning on the Continuous Ranked Probability Score (CRPS) using Monte Carlo Dropout for stochasticity. As a result, our model matches or exceeds the probabilistic skill of GenCast and IFS ENS at $1.5^\circ$ resolution while reducing training compute by over $10\times$ compared to leading CRPS-based models and inference latency by over $10\times$ compared to diffusion-based models. U-Cast trains in under 12 H200 GPU-days and generates a 15-day ensemble forecast in 3 seconds. These results suggest that scalable, general-purpose architectures paired with efficient training curricula can match complex domain-specific designs at a fraction of the cost, opening the training of frontier probabilistic weather models to the broader community.

2604.08161 2026-06-02 cs.LG 版本更新

Shift- and stretch-invariant non-negative matrix factorization with an application to brain tissue delineation in emission tomography data

位移与伸缩不变的非负矩阵分解及其在脑组织发射断层成像数据分割中的应用

Anders S. Olsen, Miriam L. Navarro, Claus Svarer, Jesper L. Hinrich, Morten Mørup, Gitte M. Knudsen

发表机构 * Neurobiology Research Unit, Copenhagen University Hospital Rigshospitalet(哥本哈根大学医院神经生物学研究单位) Department of Neuroscience, Faculty of Health and Medical Sciences, University of Copenhagen(哥本哈根大学健康与医学科学学院神经科学系) Department of Applied Mathematics and Computer Science, Technical University of Denmark(丹麦技术大学应用数学与计算机科学系) Department of Clinical Medicine, Faculty of Health and Medical Sciences, University of Copenhagen(哥本哈根大学健康与医学科学学院临床医学系)

AI总结 提出频域实现的位移与伸缩不变非负矩阵分解方法,解决动态神经影像中扩散导致的时延和伸缩问题,在合成数据和脑发射断层数据上验证了其对脑组织结构的精细刻画能力。

Comments Accepted at ICASSP2026

详情
AI中文摘要

动态神经影像数据,例如血液或脑脊液中放射性示踪剂传输的发射断层测量,通常表现出类似扩散的特性。这些特性引入了距离依赖的时间延迟、尺度差异和伸缩效应,限制了传统线性建模和分解方法的有效性。为了解决这一问题,我们提出了位移与伸缩不变的非负矩阵分解框架。我们的方法估计整数和非整数的时间位移以及时间伸缩,全部在频域中实现,其中位移对应于相位修改,而伸缩通过零填充或截断处理。该模型在PyTorch中实现(https://github.com/anders-s-olsen/shiftstretchNMF)。我们在合成数据和脑发射断层成像数据上证明,该模型能够解释伸缩效应,从而提供更详细的脑组织结构表征。

英文摘要

Dynamic neuroimaging data, such as emission tomography measurements of radiotracer transport in blood or cerebrospinal fluid, often exhibit diffusion-like properties. These introduce distance-dependent temporal delays, scale-differences, and stretching effects that limit the effectiveness of conventional linear modeling and decomposition methods. To address this, we present the shift- and stretch-invariant non-negative matrix factorization framework. Our approach estimates both integer and non-integer temporal shifts as well as temporal stretching, all implemented in the frequency domain, where shifts correspond to phase modifications, and where stretching is handled via zero-padding or truncation. The model is implemented in PyTorch (https://github.com/anders-s-olsen/shiftstretchNMF). We demonstrate on synthetic data and brain emission tomography data that the model is able to account for stretching to provide more detailed characterization of brain tissue structure.

2604.08149 2026-06-02 cs.LG stat.ML 版本更新

A Direct Approach for Handling Contextual Bandits with Latent State Dynamics

处理具有隐状态动态的上下文赌博机的直接方法

Zhen Li, Gilles Stoltz

发表机构 * GitHub

AI总结 本文提出一种直接方法处理隐马尔可夫链驱动的线性上下文赌博机,通过简化模型归约到标准线性上下文赌博机,并扩展理论分析以考虑HMM参数估计,同时针对更复杂的隐状态依赖模型引入周期性参数更新算法。

详情
Journal ref
ICML 2026 - Forty-Third International Conference on Machine Learning, Jul 2026, Seoul, South Korea, France
AI中文摘要

我们考虑一个线性上下文赌博机模型,其中上下文和奖励由有限隐马尔可夫链控制。我们首先重新审视Nelson等人(2022)的简化模型,其中奖励是给定观察上下文(称为信念)的隐状态后验概率的线性函数,而不是隐状态本身的函数。这个简化模型可以通过直接归约到标准线性上下文赌博机来处理。我们扩展了这一归约的理论分析,在遗憾界中考虑了隐马尔可夫模型[HMM]参数的估计,并提供了不再依赖于奖励函数而仅通过HMM参数估计依赖于模型的高概率界。其次,也是最重要的,我们转而研究更自然且更复杂的模型,该模型在隐状态中引入直接依赖关系(除了对观察上下文的依赖,这对于上下文赌博机是自然的)。在经典的HMM遗忘条件下,为应对奖励结构引入的各种统计依赖,引入的主要算法工具是仅周期性更新奖励模型参数。

英文摘要

We consider a linear contextual bandit model where contexts and rewards are governed by a finite hidden Markov chain. We first revisit the simplified model by Nelson et al. (2022), in which rewards are linear functions of the posterior probabilities over the hidden states given the observed contexts (called beliefs), rather than functions of the hidden states themselves. This simplified model may be handled through a direct reduction to standard linear contextual bandits. We extend the theoretical analysis of this reduction to take into account the estimation of the parameters of the hidden Markov model [HMM] in the regret bound and to provide high-probability bounds not depending anymore on the reward functions and only depending on the model through the estimation of the HMM parameters. Second, and most importantly, we instead study the more natural and more complex model incorporating direct dependencies in the hidden states (on top of dependencies on the observed contexts, as is natural for contextual bandits). Under a classic HMM forgetting condition, the main algorithmic tool introduced to cope with the various statistical dependencies that the reward structure introduces is to only periodically update reward-model parameters.

2602.24047 2026-06-02 cs.NI cs.CR cs.LG 版本更新

Unsupervised Baseline Clustering and Incremental Adaptation for IoT Device Traffic Profiling

无监督基线聚类与增量自适应用于物联网设备流量分析

Sean M. Alderman, John D. Hastings

发表机构 * The Beacom College of Computer \& Cyber Sciences Dakota State University Madison, SD, USA

AI总结 提出两阶段无监督流量分析流程,使用DBSCAN进行基线聚类(NMI 0.78),BIRCH实现增量自适应(纯度0.87),揭示静态高纯度与增量灵活性之间的权衡。

Comments 6 pages, 2 figures, 4 tables

详情
Journal ref
2026 IEEE 14th International Symposium on Digital Forensics and Security (ISDFS)
AI中文摘要

物联网设备的增长和异构性带来了安全挑战,静态识别模型会随着流量演变而退化。本文提出了一种基于流特征的两阶段无监督物联网设备流量分析和增量模型更新流程,并在Deakin物联网数据集的选定长时间捕获数据上进行评估。对于基线分析,基于密度的聚类(DBSCAN)隔离了数据中相当一部分离群点,并在测试的经典方法中与真实设备标签的对齐最强(NMI 0.78),在聚类纯度上优于基于质心的聚类。对于增量自适应,我们评估了面向流的聚类方法,发现BIRCH支持高效更新(每次更新0.13秒),并为保留的新设备形成相对连贯的聚类(纯度0.87),但新流量捕获有限(份额0.72),且自适应后已知设备准确性存在可衡量的权衡(0.71)。总体而言,结果突出了高纯度静态分析与增量聚类灵活性在演变的物联网环境中的实际权衡。

英文摘要

The growth and heterogeneity of IoT devices create security challenges where static identification models can degrade as traffic evolves. This paper presents a two-stage, flow-feature-based pipeline for unsupervised IoT device traffic profiling and incremental model updating, evaluated on selected long-duration captures from the Deakin IoT dataset. For baseline profiling, density-based clustering (DBSCAN) isolates a substantial outlier portion of the data and produces the strongest alignment with ground-truth device labels among tested classical methods (NMI 0.78), outperforming centroid-based clustering on cluster purity. For incremental adaptation, we evaluate stream-oriented clustering approaches and find that BIRCH supports efficient updates (0.13 seconds per update) and forms comparatively coherent clusters for a held-out novel device (purity 0.87), but with limited capture of novel traffic (share 0.72) and a measurable trade-off in known-device accuracy after adaptation (0.71). Overall, the results highlight a practical trade-off between high-purity static profiling and the flexibility of incremental clustering for evolving IoT environments.

2602.03912 2026-06-02 cs.LG 版本更新

Echo State Networks for Time Series Forecasting: Hyperparameter Sweep and Benchmarking

回声状态网络用于时间序列预测:超参数扫描与基准测试

Alexander Häußer

发表机构 * Justus-Liebig-University Giessen(吉斯塔-利比希大学吉essen)

AI总结 本文研究回声状态网络(ESN)在M4竞赛数据集上的单变量预测性能,通过超参数扫描和基准测试,发现简单的一阶自回归ESN在月度数据上与ARIMA和TBATS相当,在季度数据上取得最低平均MASE。

详情
AI中文摘要

本文研究了回声状态网络(ESN)对M4预测竞赛数据集中月度与季度时间序列的单变量预测性能。我们评估了一个简单的一阶自回归ESN是否能成为广泛使用的预测方法的竞争性替代方案。研究采用两阶段设计:使用参数数据集分析泄漏率、谱半径、储层大小和正则化选择下的ESN模型配置,同时保留一个不相交的预测数据集用于样本外基准测试。预测精度通过平均绝对缩放误差(MASE)和对称平均绝对百分比误差(sMAPE)衡量,并与简单基准和统计模型(包括自回归积分滑动平均(ARIMA)、指数平滑状态空间(ETS)、Theta方法和TBATS)进行比较。模型配置分析揭示了频率特定的模式:月度序列倾向于中等持久性的储层,而季度序列则偏好更收缩的动态;两种频率下,高泄漏率普遍更受青睐。在最终基准测试中,ESN在月度数据上与ARIMA和TBATS表现相当,并在季度数据上取得最低平均MASE,尽管并非在所有指标上均一致最优。总体而言,结果表明,在考虑过滤后的M4子集上,简单的自回归ESN能提供有竞争力的预测精度(特别是在MASE下),且一旦ESN配置固定,训练和预测时间需求较低。

英文摘要

This paper investigates the performance of Echo State Networks (ESNs) for univariate forecasting of monthly and quarterly time series from the M4 Forecasting Competition dataset. We evaluate whether a simple first-order autoregressive ESN can serve as a competitive alternative to widely used forecasting methods. The study uses a two-stage design: a Parameter dataset is used to analyze ESN model configurations over leakage rate, spectral radius, reservoir size, and regularization selection, while a disjoint Forecast dataset is reserved for out-of-sample benchmarking. Forecast accuracy is measured using mean absolute scaled error (MASE) and symmetric mean absolute percentage error (sMAPE) and compared with simple benchmarks and statistical models including autoregressive integrated moving average (ARIMA), exponential smoothing state space (ETS), the Theta method, and TBATS. The model-configuration analysis reveals frequency-specific patterns: monthly series tend to favor moderately persistent reservoirs, whereas quarterly series favor more contractive dynamics; across both frequencies, high leakage rates are generally preferred. In the final benchmark, the ESN performs on par with ARIMA and TBATS for monthly data and achieves the lowest mean MASE for quarterly data, although it is not uniformly best across all metrics. Overall, the results indicate that a simple autoregressive ESN can provide competitive forecast accuracy on the considered filtered M4 subsets, particularly under MASE, while requiring low training and forecasting time once the ESN configuration has been fixed.

2604.05324 2026-06-02 cs.LG cs.IT math.IT 版本更新

A Theoretical Framework for Statistical Evaluability of Generative Models

生成模型统计可评估性的理论框架

Shashaank Aiyer, Yishay Mansour, Shay Moran, Han Shao

发表机构 * University of Maryland(马里兰大学) Tel Aviv University and Google Research(特拉维夫大学和谷歌研究) Technion and Google Research(技术学院和谷歌研究)

AI总结 提出一个理论框架,研究生成模型的统计可评估性,证明基于有界测试类的积分概率度量可有限样本评估,而Rényi和KL散度不可评估。

Comments 30 pages

详情
AI中文摘要

统计评估旨在使用从真实分布中采样的独立同分布测试数据来估计模型的泛化性能。在分类等监督学习设置中,错误率等性能指标定义明确,给定足够大的数据集,测试误差可靠地近似总体误差。相比之下,由于生成模型的开放性,评估更具挑战性:不清楚哪些指标是合适的,以及这些指标是否可以从有限样本中可靠评估。在这项工作中,我们引入了一个评估生成模型的理论框架,并建立了常用指标的可评估性结果。我们研究了两类指标:基于测试的指标,包括积分概率度量(IPMs)和Rényi散度。我们证明,对于任何有界测试类,IPMs可以从有限样本中评估,误差为乘性和加性近似。此外,当测试类具有有限脂肪破碎维度时,IPMs可以任意精度评估。相比之下,Rényi和KL散度不能从有限样本中评估,因为它们的值可能由罕见事件关键决定。我们还分析了困惑度作为评估方法的潜力和局限性。

英文摘要

Statistical evaluation aims to estimate the generalization performance of a model using held-out i.i.d. test data sampled from the ground-truth distribution. In supervised learning settings such as classification, performance metrics such as error rate are well-defined, and test error reliably approximates population error given sufficiently large datasets. In contrast, evaluation is more challenging for generative models due to their open-ended nature: it is unclear which metrics are appropriate and whether such metrics can be reliably evaluated from finite samples. In this work, we introduce a theoretical framework for evaluating generative models and establish evaluability results for commonly used metrics. We study two categories of metrics: test-based metrics, including integral probability metrics (IPMs), and Rényi divergences. We show that IPMs with respect to any bounded test class can be evaluated from finite samples up to multiplicative and additive approximation errors. Moreover, when the test class has finite fat-shattering dimension, IPMs can be evaluated with arbitrary precision. In contrast, Rényi and KL divergences are not evaluable from finite samples, as their values can be critically determined by rare events. We also analyze the potential and limitations of perplexity as an evaluation method.

2604.04199 2026-06-02 cs.LG 版本更新

Which Leakage Types Matter? A Quantitative Landscape Across 2,047 Benchmark Datasets

哪些泄漏类型重要?2047个基准数据集的定量景观

Simon Roth

发表机构 * Simon Roth

AI总结 通过在2047个独立同分布表格数据集上进行28项受试者内反事实实验,以及129个时间序列数据集的边界实验,定量评估了机器学习中四类数据泄漏的严重性。

Comments 39 pages, 6 figures, 13 tables. Companion to arXiv:2603.10742

详情
AI中文摘要

通过在2047个独立同分布表格数据集上进行28项受试者内反事实实验,以及129个时间序列数据集的边界实验,测量了机器学习中四类数据泄漏的严重程度。第一类(估计:在全数据上拟合缩放器)可忽略:所有九种条件产生的$|ΔAUC| \leq 0.005$。第二类(选择:偷窥、种子挑选)影响显著:测量效果与约90%的噪声利用导致报告分数膨胀一致。第三类(记忆)随模型容量增加:在10%重复时,$d_z$从0.37(朴素贝叶斯)到1.11(决策树)。第四类(边界)在随机交叉验证下不可见。在这个独立同分布表格数据体制中,教科书的重点被颠覆:归一化泄漏最不重要;而实际数据集规模下的选择泄漏最为重要。

英文摘要

Twenty-eight within-subject counterfactual experiments across 2,047 iid tabular datasets, plus a boundary experiment on 129 temporal datasets, measure the severity of four data leakage classes in machine learning. Class I (estimation: fitting scalers on full data) is negligible: all nine conditions produce $|ΔAUC| \leq 0.005$. Class II (selection: peeking, seed cherry-picking) is substantial: the measured effect is consistent with about 90% noise exploitation inflating reported scores. Class III (memorization) scales with model capacity: $d_z$ = 0.37 (Naive Bayes) to 1.11 (Decision Tree) at 10% duplication. Class IV (boundary) is invisible under random cross-validation. Within this iid tabular regime, the textbook emphasis is inverted: normalization leakage matters least; selection leakage at practical dataset sizes matters most.

2603.24324 2026-06-02 cs.LG cs.AI cs.SY eess.SY 版本更新

Large Language Model Guided Incentive Aware Reward Design for Cooperative Multi-Agent Reinforcement Learning

大语言模型引导的激励感知奖励设计用于合作多智能体强化学习

Dogan Urgun, Gokhan Gungor

发表机构 * Department of Electrical and Electronics Engineering(电气与电子工程系) Karabuk University(卡拉博克大学) Department of Mechatronics Engineering(机械工程系)

AI总结 提出利用大语言模型自动生成可执行奖励程序,结合多智能体近端策略优化训练,在Overcooked-AI环境中显著提升合作任务回报。

详情
AI中文摘要

设计有效的辅助奖励对于合作多智能体系统仍然具有挑战性,因为激励不匹配会导致次优协调,尤其是在稀疏任务奖励无法为协调行为提供足够基础的情况下。本研究引入了一个自主奖励设计框架,利用大语言模型(LLMs)从环境仪器化中合成可执行的奖励程序。该过程将候选程序限制在形式有效性范围内,并在固定计算预算下使用多智能体近端策略优化(MAPPO)从头训练策略。然后根据性能评估候选程序,并仅基于稀疏任务回报进行跨代选择。该框架在四个Overcooked-AI布局中进行了评估,这些布局具有不同程度的走廊拥堵、交接依赖和结构不对称性。所提出的奖励设计方法始终产生更高的任务回报和交付数量,在交互瓶颈主导的环境中收益最为显著。对合成塑造成分的诊断分析揭示了动作选择中更强的相互依赖性,以及在协调密集型任务中信号对齐的改善。这些结果表明,所提出的LLM引导的奖励搜索框架减轻了手动工程的需求,同时产生了与有限预算下合作学习兼容的塑造成分信号。

英文摘要

Designing effective auxiliary rewards for cooperative multi-agent systems remains challenging, as misaligned incentives can induce suboptimal coordination, particularly when sparse task rewards provide insufficient grounding for coordinated behavior. This study introduces an autonomous reward design framework that uses large language models (LLMs) to synthesize executable reward programs from environment instrumentation. The procedure constrains candidate programs within a formal validity envelope and trains policies from scratch using Multi-Agent Proximal Policy Optimization (MAPPO) under a fixed computational budget. The candidates are then evaluated on the basis of their performance, and selection across generations solely based on the sparse task returns. The framework is evaluated in four Overcooked-AI layouts characterized by varying levels of corridor congestion, handoff dependencies, and structural asymmetries. The proposed reward design approach consistently yields higher task returns and delivery counts, with the most pronounced gains observed in environments dominated by interaction bottlenecks. Diagnostic analysis of the synthesized shaping components reveals stronger interdependence in action selection and improved signal alignment in coordination-intensive tasks. These results demonstrate that the proposed LLM-guided reward search framework mitigates the need for manual engineering while producing shaping signals compatible with cooperative learning under finite budgets.

2603.10742 2026-06-02 cs.LG 版本更新

A Grammar of Machine Learning Workflows: Rejecting Data Leakage at Call Time

机器学习工作流的语法:在调用时拒绝数据泄露

Simon Roth

发表机构 * GitHub

AI总结 提出一种包含八个类型化原语和四个硬约束的有向无环图语法,通过首次在调用时强制执行的评估/评估边界,使最严重的数据泄露类型在语法范围内结构上不可表示。

Comments 40 pages, v1.3. Two maintained implementations: Python (PyPI: mlw), R (CRAN: ml), Code under github.com/epagogy/ml

详情
AI中文摘要

数据泄露已在30个科学领域的648篇已发表论文中被识别。防止数据泄露的知识已存在超过十年;问题持续存在是因为工具没有强制执行教科书所教导的内容。本文提出一种语法(由八个类型化原语通过有向无环图连接,并带有四个硬约束),使得在语法范围内最严重的泄露类型在结构上不可表示。核心机制是一个终端评估门:据我所知(截至2026年5月),这是同行评审的机器学习方法论文献中首次记录的调用时强制执行的评估/评估边界,其规范足够精确以支持独立重新实现。一项涵盖2,047个数据集的配套景观研究将约束建立在测量的效应大小上。提供了两个参考实现(Python、R)。

英文摘要

Data leakage has been identified in 648 published papers across 30 scientific fields. The knowledge to prevent it has existed for over a decade; the problem persists because the tools do not enforce what the textbooks teach. This paper presents a grammar (eight typed primitives connected by a directed acyclic graph with four hard constraints) that makes the most damaging leakage types structurally unrepresentable within the grammar's scope. The core mechanism is a terminal assessment gate: the first call-time-enforced evaluate/assess boundary documented in the peer-reviewed ML methodology literature (to my knowledge, as of May 2026), backed by a specification precise enough for independent reimplementation. A companion landscape study across 2,047 datasets grounds the constraints in measured effect sizes. Two reference implementations (Python, R) are available.

2604.03789 2026-06-02 cs.LG cs.AI 版本更新

Automated Conjecture Resolution with Formal Verification

自动猜想解决与形式化验证

Haocheng Ju, Guoxiong Gao, Jiedong Jiang, Bin Wu, Zeming Sun, Shurui Liu, Leheng Chen, Yutong Wang, Yuefeng Wang, Zichen Wang, Wanyi He, Peihao Wu, Liang Xiao, Ruochuan Liu, Bryan Dai, Bin Dong

发表机构 * School of Mathematical Sciences, Peking University(北京大学数学科学学院) Westlake Institute for Advanced Study, Westlake University(西拉雅大学先进研究所) School of Mathematics, Tianjin University(天津大学数学学院) Research Institute for Mathematical Sciences, Kyoto University(京都大学数学研究所) Department of Mathematics, Stanford University(斯坦福大学数学系) IQuest Research(IQuest研究) New Cornerstone Science Laboratory, School of Mathematical Sciences, Peking University(北京大学数学科学学院新基石科学实验室) Beijing International Center for Mathematical Research and the New Cornerstone Science Laboratory, Peking University(北京大学国际数学研究所以及新基石科学实验室) Center for Machine Learning Research, Peking University(北京大学机器学习研究中心) Center for Intelligent Computing, Great Bay Institute for Advanced Study, Great Bay University(大湾大学先进研究所智能计算中心) Zhongguancun Academy(中关村学院)

AI总结 提出一个集成非形式化推理与形式化验证的自动框架,通过两个组件Rethlas和Archon解决研究级数学问题,并成功解决交换代数中的开放问题并在Lean 4中形式化验证。

Comments Code and resources are available at: Rethlas (https://github.com/frenzymath/Rethlas), Rethlas Results (https://github.com/frenzymath/Rethlas_results), Archon (https://github.com/frenzymath/Archon), and the formalization results (https://github.com/frenzymath/Anderson-Conjecture)

详情
AI中文摘要

近年来,大型语言模型在数学推理能力上取得了显著进步,从解决初等问题扩展到研究级问题。然而,由于自然语言推理固有的歧义性,可靠地解决和验证此类问题仍然具有挑战性。本文提出一个自动框架,将自然语言推理与形式化验证相结合,以应对研究级数学问题。我们的框架由两个组件组成:非形式化推理代理Rethlas和形式化验证代理Archon。Rethlas将推理原语与我们的定理搜索引擎Matlas相结合,探索解决策略并构建候选证明。Archon配备LeanSearch,通过任务分解、迭代细化和自动证明合成,将非形式化论证转化为形式化的Lean 4项目,确保机器可检查的正确性。利用该框架,我们解决了一个交换代数中的开放问题,并在几乎无需人工参与的情况下在Lean 4中形式化验证了所得证明。额外的案例研究展示了Rethlas在非形式化数学推理和发现方面的能力,以及Archon将研究级证明形式化为Lean 4的能力。我们的实验表明,强大的定理检索工具能够发现和应用跨领域数学技巧,而形式化代理可以自主填补非形式化论证中的非平凡空白。更广泛地说,我们的工作展示了一种有前景的数学研究范式,其中配备定理检索工具的非形式化和形式化推理系统协同工作,以产生可验证的结果,减少人工努力,并支持人机协作的数学研究。

英文摘要

Recent advances in large language models have significantly improved their ability to perform mathematical reasoning, extending from elementary problem solving to increasingly capable performance on research-level problems. However, reliably solving and verifying such problems remains challenging due to the inherent ambiguity of natural language reasoning. In this paper, we propose an automated framework that integrates natural language reasoning with formal verification to tackle research-level mathematical problems. Our framework consists of two components: an informal reasoning agent, Rethlas, and a formal verification agent, Archon. Rethlas combines reasoning primitives with our theorem search engine, Matlas, to explore solution strategies and construct candidate proofs. Archon, equipped with LeanSearch, translates informal arguments into formalized Lean 4 projects through task decomposition, iterative refinement, and automated proof synthesis, ensuring machine-checkable correctness. Using this framework, we resolve an open problem in commutative algebra and formally verify the resulting proof in Lean 4 with essentially no human involvement. Additional case studies illustrate the capabilities of Rethlas in informal mathematical reasoning and discovery, as well as the ability of Archon to formalize research-level proofs in Lean 4. Our experiments demonstrate that strong theorem retrieval tools enable the discovery and application of cross-domain mathematical techniques, while the formal agent can autonomously fill nontrivial gaps in informal arguments. More broadly, our work illustrates a promising paradigm for mathematical research in which informal and formal reasoning systems, equipped with theorem retrieval tools, operate in tandem to produce verifiable results, reduce human effort, and support human-AI collaborative mathematical research.

2602.00906 2026-06-02 cs.LG cs.AI cs.CL cs.DS cs.IT math.IT 版本更新

Hallucination is a Consequence of Space-Optimality: A Rate-Distortion Theorem for Membership Testing

幻觉是空间最优性的结果:成员测试的率失真定理

Anxin Guo, Jingwei Li

发表机构 * Computer Science Department, Northwestern University(西北大学计算机科学系) Department of IEOR, Columbia University(哥伦比亚大学工业工程与运筹学系)

AI总结 通过将幻觉形式化为成员测试问题,建立率失真定理,证明在有限容量下信息论最优策略必然导致对某些非事实的高置信度,从而产生幻觉。

Comments ICML 2026

详情
AI中文摘要

大型语言模型通常对缺乏可推断模式的“随机事实”以高置信度产生幻觉。我们将此类事实的记忆形式化为一个成员测试问题,统一了布隆过滤器的离散误差指标与LLM的连续对数损失。通过分析在事实在可能主张的宇宙中稀疏的情况下,我们建立了一个率失真定理:最优记忆效率由事实与非事实得分分布之间的最小KL散度刻画。这一理论框架在理想化设置下为幻觉提供了独特的解释:即使有最优训练、完美数据和简化的“封闭世界”设置,有限容量下信息论最优策略不是放弃或遗忘,而是对某些非事实赋予高置信度,从而导致幻觉。我们在合成数据和真实数据上实证验证了这一理论,表明幻觉作为有损压缩的自然结果持续存在。同一定理恢复并锐化了布隆型滤波器的经典空间下界,确定了两侧滤波器遗留的加性常数。

英文摘要

Large language models often hallucinate with high confidence on "random facts" that lack inferable patterns. We formalize the memorization of such facts as a membership testing problem, unifying the discrete error metrics of Bloom filters with the continuous log-loss of LLMs. By analyzing this problem in the regime where facts are sparse in the universe of plausible claims, we establish a rate-distortion theorem: the optimal memory efficiency is characterized by the minimum KL divergence between score distributions on facts and non-facts. This theoretical framework provides a distinctive explanation for hallucination under an idealized setting: even with optimal training, perfect data, and a simplified ``closed world'' setting, the information-theoretically optimal strategy under limited capacity is not to abstain or forget, but to assign high confidence to some non-facts, resulting in hallucination. We validate this theory empirically on both synthetic and real-world data, showing that hallucinations persist as a natural consequence of lossy compression. The same theorem recovers and sharpens classical space lower bounds for Bloom-type filters, pinning down an additive constant left open for two-sided filters.

2508.02812 2026-06-02 cs.LG 版本更新

Evaluating and Learning Robust Bandit Policies Under Uncertain Causal Mechanisms

在不确定因果机制下评估和学习鲁棒的Bandit策略

Katherine Avery, Chinmay Pendse, David Jensen

发表机构 * College of Information and Computer Sciences, University of Massachusetts Amherst(信息与计算机科学学院,马萨诸塞大学阿姆赫斯特分校) Capital One

AI总结 提出一种因果多臂老虎机评估与学习算法,通过结构化方程模型处理条件概率分布的不确定性,并利用条件独立性检验选择变量,在因果机制不确定时提供更准确的评估和低方差策略。

Comments Published at the 5th Conference on Causal Learning and Reasoning 2026

详情
AI中文摘要

因果图模型可以编码大量结构知识,既包括领域专家的背景知识,也包括从随机实验或观测数据中发现的结构知识。然而,尽管我们可能知道因果关系的一般结构,但通常不知道确切的因果机制。在这项工作中,我们提出了一种因果多臂老虎机评估和学习算法,该算法能够在条件概率分布不确定的情况下有效推理。此外,我们展示了如何使用条件独立性检验来选择建模变量。我们发现,与传统方法相比,结构化方程模型(SEM)方法在可能因果机制范围扩大时能提供更准确的评估。此外,SEM方法学习到低方差策略,并且在模型充分指定时学习最优策略。传统方法可能收敛到局部极值或根本无法收敛。

英文摘要

Causal graphical models can encode large amounts structural knowledge, both from the background knowledge of domain experts and the structural knowledge discovered from randomized experiments or observational data. However, though we may know the general structure of causal relationships, we often do not know the exact causal mechanisms. In this work, we propose a causal multi-armed bandit evaluation and learning algorithm that can reason effectively despite uncertainty over conditional probability distributions. Further, we show how conditional independence testing can be used to choose variables for modeling. We find that the structural equation model (SEM) approach gives more accurate evaluations compared to traditional approaches, particularly as the range of possible causal mechanisms grows. Further, the SEM approach learns low-variance policies, and it learns an optimal policy, assuming the model is sufficiently well-specified. Traditional approaches can converge to local extrema or fail to converge at all.

2601.16884 2026-06-02 cs.LG cs.NA math.NA stat.ML 版本更新

Multigrade Neural Network Approximation

多级神经网络逼近

Shijun Zhang, Zuowei Shen, Yuesheng Xu

发表机构 * Department of Applied Mathematics, Hong Kong Polytechnic University(应用数学系,香港理工大学) Department of Mathematics, National University of Singapore(数学系,新加坡国立大学) Department of Mathematics and Statistics, Old Dominion University(数学与统计学系,老 Dominion 大学)

AI总结 本文提出多级深度学习(MGDL)框架,通过逐级冻结并训练子网络拟合残差,实现结构化误差修正,并证明固定宽度多级ReLU网络可均匀逼近连续函数。

详情
AI中文摘要

我们研究多级深度学习(MGDL)作为深度神经网络中结构化误差修正的原则性框架。虽然神经网络的逼近能力现在相对被充分理解,但由于高度非凸且常常病态的优化景观,训练非常深的架构仍然具有挑战性。相比之下,对于相对浅的网络,特别是某些单隐层ReLU模型,在适当设置下训练允许具有全局保证的凸重构,这激发了在扩展深度的同时提高稳定性的学习范式。MGDL基于这一见解,通过逐级训练深度网络:先前学习的级别被冻结,每个新添加的级别子网络被组合在先前学习的级别之上,并训练以拟合当前逼近留下的残差,产生结构化和可解释的分层修正过程。我们为MGDL开发了算子理论基础,并证明对于定义在超立方体上的任何连续目标函数,存在一个固定宽度的多级ReLU方案,其残差点态非增且一致收敛到零,并且对于每个非平凡级别,在$p\in [1,\infty)$上具有严格的$L^p$范数衰减。据我们所知,这项工作提供了第一个严格的构造性逼近保证,表明逐级残差修正方案可以在固定宽度多级ReLU架构中实现误差消失。

英文摘要

We study multigrade deep learning (MGDL) as a principled framework for structured error refinement in deep neural networks. While the approximation power of neural networks is now relatively well understood, training very deep architectures remains challenging due to highly nonconvex and often ill-conditioned optimization landscapes. In contrast, for relatively shallow networks, most notably certain one-hidden-layer ReLU models, training admits convex reformulations with global guarantees under appropriate settings, motivating learning paradigms that improve stability while scaling to depth. MGDL builds on this insight by training deep networks grade by grade: previously learned grades are frozen, and each newly added grade-wise subnetwork is composed on top of the previously learned grades and trained to fit the residual left by the current approximation, yielding a structured and interpretable hierarchical refinement process. We develop an operator-theoretic foundation for MGDL and prove that, for any continuous target function defined on a hypercube, there exists a fixed-width multigrade ReLU scheme whose residuals are pointwise nonincreasing in magnitude and converge uniformly to zero, with strict $L^p$-norm decay at every nontrivial grade for $p\in [1,\infty)$. To the best of our knowledge, this work provides the first rigorous constructive approximation guarantee showing that a grade-wise residual refinement scheme can achieve vanishing error in a fixed-width multigrade ReLU architecture.

2604.01802 2026-06-02 cs.LG 版本更新

Real-Time Sensing of Inaccessible Physical Fields via an Edge-Deployable Hardware-Portable Graph Neural Operator

通过边缘可部署的硬件可移植图神经算子实时感知不可及的物理场

William Howes, Jason Yoo, Kazuma Kobayashi, Subhankar Sarkar, Farid Ahmed, Souvik Chakraborty, Syed Bahauddin Alam

发表机构 * Grainger College of Engineering, Nuclear, Plasma & Radiological Engineering Department, University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校格拉inger工程学院、核物理与等离子体工程系) Yardi School of Artificial Intelligence, Indian Institute of Technology Delhi(印度理工学院德里分校人工智能学院) Department of Applied Mechanics, Indian Institute of Technology Delhi(印度理工学院德里分校应用力学系) National Center for Supercomputing Applications(国家超级计算应用中心)

AI总结 提出VIRSO,一种具有独特时空架构的神经算子,通过硬件协同设计实现边缘设备上从稀疏边界观测到内部多物理场的实时推理,在降低能耗的同时保持高精度。

Comments 36 pages, 5 figures, 16 tables

详情
AI中文摘要

从稀疏边界观测实时推断不可及的内部物理场是科学机器学习中一个基本但未解决的问题,与许多工程应用中的安全关键监测直接相关。现有的神经算子实现了高精度,但未解决在嵌入式边缘平台上的部署问题。本文引入VIRSO(虚拟不规则实时稀疏算子),这是第一个具有独特时空架构、明确针对边缘部署硬件的神经算子。VIRSO通过显式与硬件执行对齐的谱-空间分解(计算受限的图谱路径和内存带宽受限的空间聚合路径,分别在数据中心和嵌入式加速器上独立表征),学习从稀疏、几何不连续的边界输入到不规则非结构化网格上空间连续内部多物理场的非线性映射。该设计将推理能量-延迟积相对于原始图算子基线降低了29倍(在NVIDIA H200上从206 J·ms降至7.0 J·ms),并在NVIDIA Jetson Orin Nano上实现了17.0样本/秒的嵌入式推理,板级功耗为7.06 W,无需修改。一种网格密度自适应图构建策略(V-KNN)同时提高了精度并将图边数减少了34%。在三个基准测试中,重建比从47:1到156:1,VIRSO实现了低于1%的平均相对$L_2$误差,参数少于算子基线,并且相对于高保真参考求解器提供了约$10^4$倍的推理加速。据我们所知,这是首个单瓦级神经算子的演示,确立了硬件协同设计作为算子推理中缺失的要素以及实现实时部署的可行路径。

英文摘要

Real-time inference of inaccessible interior physical fields from sparse boundary observations is a fundamental but unresolved problem in scientific machine learning, with direct relevance to safety-critical monitoring across many engineering applications. Existing neural operators achieve high accuracy but leave deployment to embedded edge platforms unaddressed. Here we introduce VIRSO (Virtual Irregular Real-Time Sparse Operator), the first neural operator with a unique spatial-spectral architecture that explicitly addresses edge-deployment hardware. VIRSO learns a nonlinear mapping from sparse, geometrically disjoint boundary inputs to spatially continuous interior multiphysics fields on irregular unstructured meshes through a spectral-spatial decomposition explicitly aligned with hardware execution: a compute-bound graph spectral pathway and a memory-bandwidth-bound spatial-aggregation pathway, each independently characterized on datacenter and embedded accelerators. The design reduces the inference energy-delay product by 29$\times$ relative to the vanilla graph-operator baseline (206 J$\cdot$ms $\to$ 7.0 J$\cdot$ms on an NVIDIA H200) and enables 17.0 samples/s embedded inference on an NVIDIA Jetson Orin Nano within 7.06 W board-level power, without modification. A mesh-density-adaptive graph construction strategy (V-KNN) simultaneously improves accuracy and reduces graph edge count by 34%. Across three benchmarks with reconstruction ratios from 47:1 to 156:1, VIRSO achieves mean relative $L_2$ errors below 1% with fewer parameters than operator baselines and delivers an inference speedup of $\approx 10^4$ times over the high-fidelity reference solver. To our knowledge, this is the first demonstration of a single-digit-watt neural operator, establishing hardware co-design as a missing ingredient in operator-based inference and a tractable path to real-time deployment.

2510.03690 2026-06-02 cs.LG stat.ML 版本更新

From Moments to Models: Graphon-Mixture Learning for Mixup and Contrastive Learning

从矩到模型:用于混合和对比学习的图模型混合学习

Ali Azizpour, Reza Ramezanpour, Santiago Segarra

发表机构 * University of Michigan(密歇根大学)

AI总结 提出一个统一框架,将图数据建模为图模型(graphon)混合,利用图矩(motif密度)聚类并估计混合成分,进而提出图模型感知的混合(GMAM)和对比学习(MGCL)方法,在监督和无监督任务上取得最优或竞争性能。

详情
AI中文摘要

现实世界的图数据集通常来自混合群体,其中图由多个不同的潜在分布生成。在这项工作中,我们提出了一个统一框架,将图数据显式建模为由图模型表示的 probabilistic 图生成模型的混合。为了表征和估计这些图模型,我们利用图矩(motif密度)对从相同底层模型生成的图进行聚类。我们建立了一个新的理论保证,推导出一个更紧的界,表明从结构相似的图模型中采样的图以高概率表现出相似的 motif 密度。这一结果使得图模型混合成分的估计具有原则性。我们展示了如何将估计的图模型混合成分增强两种广泛使用的下游范式:通过混合进行图数据增强和图对比学习。通过将这些方法基于底层生成模型,我们开发了图模型感知的混合(GMAM)和模型感知的图对比学习(MGCL)。在模拟和真实数据集上的大量实验证明了强大的实证性能。在监督学习中,GMAM 优于现有的增强策略,在 7 个数据集中的 6 个上达到了新的最先进准确率。在无监督学习中,MGCL 在七个基准数据集上具有竞争力,并实现了总体最低的平均排名。

英文摘要

Real-world graph datasets often arise from mixtures of populations, where graphs are generated by multiple distinct underlying distributions. In this work, we propose a unified framework that explicitly models graph data as a mixture of probabilistic graph generative models represented by graphons. To characterize and estimate these graphons, we leverage graph moments (motif densities) to cluster graphs generated from the same underlying model. We establish a novel theoretical guarantee, deriving a tighter bound showing that graphs sampled from structurally similar graphons exhibit similar motif densities with high probability. This result enables principled estimation of graphon mixture components. We show how incorporating estimated graphon mixture components enhances two widely used downstream paradigms: graph data augmentation via mixup and graph contrastive learning. By conditioning these methods on the underlying generative models, we develop graphon-mixture-aware mixup (GMAM) and model-aware graph contrastive learning (MGCL). Extensive experiments on both simulated and real-world datasets demonstrate strong empirical performance. In supervised learning, GMAM outperforms existing augmentation strategies, achieving new state-of-the-art accuracy on 6 out of 7 datasets. In unsupervised learning, MGCL performs competitively across seven benchmark datasets and achieves the lowest average rank overall.

2603.29488 2026-06-02 cs.LG 版本更新

What Cosine Similarity of Label Representations Can and Cannot Tell us

标签表示的余弦相似度能告诉我们什么,不能告诉我们什么

Beatrix M. G. Nielsen, Andreas Grivas

发表机构 * IT University of Copenhagen(丹麦哥本哈根技术大学) School of Mathematics, University of Edinburgh(爱丁堡大学数学学院)

AI总结 本文证明对于softmax分类器,标签表示(称为unembedding)之间的余弦相似度不提供模型概率的任何信息,而对于sigmoid分类器,所有成对余弦相似度定义了可能的标签组合集。

详情
AI中文摘要

余弦相似度常用于衡量神经网络模型向量表示的相似性。然而,表示的余弦相似度并不能保证告诉我们关于模型概率的任何信息。在本文中,我们证明对于softmax分类器,无论是图像分类器还是自回归语言模型,标签表示(在论文中称为unembedding)之间的余弦相似度不提供模型分配的概率的任何信息。具体地,我们证明给定两个unembedding,可以创建另一个模型,该模型对所有输入分配相同的概率,但表示之间的余弦相似度现在要么是1要么是-1。我们还证明对于sigmoid分类器(其中每个输入可以被分配多个标签),unembedding之间的所有成对余弦相似度定义了可能的标签组合集。然而,对于softmax分类器(其中每个输入被分配从最可能到最不可能的标签排序),我们需要所有unembedding差异之间的所有成对余弦相似度才能知道模型可以预测哪些排序。我们得出结论,在没有参考产生它们的分类器的情况下解释unembedding之间的余弦相似度是具有误导性的。

英文摘要

Cosine similarity is often used to measure the similarity of vector representations of neural network models. However, the cosine similarity of representations is not guaranteed to tell us anything about model probabilities. In this paper we show that for a softmax classifier, be it an image classifier or an autoregressive language model, the cosine similarity between label representations (called unembeddings in the paper) does not give any information on the probabilities assigned by the model. Specifically, we prove that given two unembeddings, it is possible to create another model which assigns the same probabilities for all inputs, but where the cosine similarity between the representations is now either 1 or -1. We also show that for a sigmoid classifier (where each input can be assigned multiple labels), all pairwise cosine similarities between the unembeddings define the set of possible label combinations. However, for softmax classifiers (where each input is assigned a ranking of the labels from most to least likely), we need all pairwise cosine similarities between all differences of unembeddings to know which rankings the model can predict. We conclude that it is misleading to interpret the cosine similarity between unembeddings without reference to the classifier that produced them.

2603.28768 2026-06-02 cs.DC cs.LG 版本更新

CRAFT: Fine-Grained Cost-Aware Expert Replication For Efficient Mixture-of-Experts Serving

CRAFT:面向高效混合专家服务的细粒度成本感知专家复制

Adrian Zhao, Zhenkun Cai, Zhenyu Song, Lingfan Yu, Haozheng Fan, Jun Wu, Yida Wang, Nandita Vijaykumar

发表机构 * NVIDIA Corporation(英伟达公司)

AI总结 提出CRAFT框架,通过基于估计收益的细粒度逐层复制,在给定内存预算下最大化负载均衡,无需额外训练即可提升大规模MoE服务吞吐量。

Comments 22 pages, 15 figures

详情
Journal ref
Proceedings of the Ninth Conference on Machine Learning and Systems (MLSys 2026)
AI中文摘要

混合专家(MoE)最近成为高效扩展大型语言模型同时保持计算成本近乎恒定的主流架构。专家并行通过跨设备划分专家来分布参数,但这会在推理过程中引入令牌级负载不均衡。专家复制是服务框架中广泛采用的负载均衡技术,通过复制高负载专家来缓解大规模部署中的负载不均衡。在这项工作中,我们证明现有的复制方案往往过度复制,许多副本提供的改进微乎其微。副本消耗大量GPU内存,可能导致资源争用和吞吐量下降。我们提出CRAFT,一种高效的专家复制框架,通过基于估计的复制收益进行细粒度逐层复制,在给定内存预算下最大化负载均衡。CRAFT可以无缝集成到现有服务框架中,无需额外训练或模型更改。我们的评估表明,在模型规模从数千亿到数万亿参数的大规模部署中,与现有复制技术相比,CRAFT将端到端服务吞吐量平均提高1.14倍(最高1.2倍)。

英文摘要

Mixture-of-Experts (MoE) has recently emerged as the mainstream architecture for efficiently scaling large language models while maintaining near-constant computational cost. Expert parallelism distributes parameters by partitioning experts across devices, but this introduces token-level load imbalance during inference. Expert replication is a widely adopted load-balancing technique in serving frameworks that alleviates load imbalance in large-scale deployments by replicating experts with high loads. In this work, we demonstrate that existing replication schemes often over-replicate, with many replicas providing marginal improvement. Replicas consume substantial GPU memory, which may lead to resource contention and throughput degradation. We present CRAFT, an efficient expert replication framework that maximizes load balance under a given memory budget by performing fine-grained, per-layer replication based on the estimated replication benefit. CRAFT can be seamlessly integrated into existing serving frameworks without any additional training or model changes. Our evaluation shows that CRAFT increases end-to-end serving throughput by $1.14\times$ on average (up to $1.2\times$) over existing replication techniques in large-scale deployments with models ranging from hundreds of billions to a trillion parameters.

2603.23582 2026-06-02 cs.LG cs.AI 版本更新

AI Generalisation Gap In Comorbid Sleep Disorder Staging

共病睡眠障碍分期中的AI泛化差距

Saswata Bose, Suvadeep Maiti, Shivam Kumar Sharma, Mythirayee S, Tapabrata Chakraborti, Srijitesh Rajendran, Raju S. Bapi

发表机构 * arXiv

AI总结 针对脑卒中患者睡眠分期中深度学习模型在健康与临床人群间泛化差的问题,通过Grad-CAM可视化和新数据集iSLEEPS,揭示模型关注生理无意义区域,并强调需开发疾病特异性模型。

详情
AI中文摘要

准确的睡眠分期对于诊断脑卒中患者的OSA和低通气至关重要。尽管PSG可靠,但成本高、劳动密集且需人工评分。虽然深度学习在健康受试者中实现了基于EEG的自动睡眠分期,但我们的分析显示,该方法在睡眠紊乱的临床人群中泛化能力差。利用Grad-CAM解释,我们系统地证明了这一局限性。我们引入了iSLEEPS,一个经过临床注释的缺血性脑卒中新数据集(即将公开发布),并评估了SE-ResNet加双向LSTM模型用于单通道EEG睡眠分期。正如预期,健康与疾病受试者之间的跨域性能很差。注意力可视化在临床专家反馈的支持下显示,模型在患者数据中关注生理上无信息的EEG区域。统计和计算分析进一步证实了健康与缺血性脑卒中队列之间显著的睡眠结构差异,强调了在部署前需要经过临床验证的受试者感知或疾病特异性模型。论文和代码摘要见https://himalayansaswatabose.github.io/iSLEEPS_Explainability.github.io/

英文摘要

Accurate sleep staging is essential for diagnosing OSA and hypopnea in stroke patients. Although PSG is reliable, it is costly, labor-intensive, and manually scored. While deep learning enables automated EEG-based sleep staging in healthy subjects, our analysis shows poor generalization to clinical populations with disrupted sleep. Using Grad-CAM interpretations, we systematically demonstrate this limitation. We introduce iSLEEPS, a newly clinically annotated ischemic stroke dataset (to be publicly released), and evaluate a SE-ResNet plus bidirectional LSTM model for single-channel EEG sleep staging. As expected, cross-domain performance between healthy and diseased subjects is poor. Attention visualizations, supported by clinical expert feedback, show the model focuses on physiologically uninformative EEG regions in patient data. Statistical and computational analyses further confirm significant sleep architecture differences between healthy and ischemic stroke cohorts, highlighting the need for subject-aware or disease-specific models with clinical validation before deployment. A summary of the paper and the code is available at https://himalayansaswatabose.github.io/iSLEEPS_Explainability.github.io/

2511.16992 2026-06-02 cs.LG 版本更新

FIRM: Federated In-client Regularized Multi-objective Alignment for Large Language Models

FIRM: 面向大型语言模型的联邦客户端内正则化多目标对齐

Fatemeh Nourzad, Amirhossein Roknilamouki, Eylem Ekici, Jia Liu, Ness Shroff

发表机构 * Department of Electrical and Computer Engineering, The Ohio State University, Columbus, OH, USA(电气与计算机工程系,俄亥俄州立大学,哥伦布,OH,USA) Department of Computer Science and Engineering, The Ohio State University, Columbus, OH, USA(计算机科学与工程系,俄亥俄州立大学,哥伦布,OH,USA)

AI总结 提出FIRM算法,通过客户端内正则化缓解客户端分歧漂移并提高通信效率,实现联邦多目标对齐,并首次给出有限时间收敛保证。

详情
AI中文摘要

将大型语言模型(LLMs)与人类价值观对齐通常需要平衡多个相互冲突的目标,如有用性和无害性。训练这些模型计算密集,且集中式处理引发严重的数据隐私问题。联邦学习(FL)提供了一种有吸引力的替代方案,但现有的联邦多目标优化(FMOO)方法面临严重的通信瓶颈,因为它们依赖向服务器传输多个梯度,这对于大型模型不可扩展。我们提出了FIRM(联邦客户端内正则化多目标对齐),一种新颖的算法,同时实现了客户端分歧漂移缓解和通信效率。在FIRM中,每个客户端本地求解一个正则化多目标优化问题。通过客户端内正则化直接缓解客户端分歧漂移,我们的方法消除了先前工作中常见的多梯度传输需求。因此,客户端只需传输一组适配参数,保持高通信效率。我们证明了我们的算法收敛到帕累托驻点,并且据我们所知,首次为这种联邦多目标对齐设置提供了有限时间收敛保证。实验上,我们展示了与基线相比,FIRM导致更平滑的训练动态、减少的客户端分歧漂移和改进的奖励权衡。我们进一步提出了一种方法,将目标上的偏好纳入考虑,并报告了经验帕累托图,表明FIRM可以根据指定偏好平滑地调整目标之间的权衡。

英文摘要

Aligning Large Language Models (LLMs) with human values often involves balancing multiple, conflicting objectives such as helpfulness and harmlessness. Training these models is computationally intensive, and centralizing the process raises significant data privacy concerns. Federated Learning (FL) offers a compelling alternative, but existing Federated Multi-Objective Optimization (FMOO) methods face severe communication bottlenecks as their reliance on transmitting multiple gradients to a server is unscalable for large models. We introduce FIRM (Federated In-client Regularized Multi-objective alignment), a novel algorithm that achieves both client disagreement drift mitigation and communication efficiency. In FIRM, each client locally solves a regularized multi-objective optimization problem. By directly mitigating client disagreement drift through in-client regularization, our method eliminates the need for the multi-gradient transmissions common in prior works. Consequently, clients need only to transmit a single set of adapted parameters, maintaining high communication efficiency. We prove that our algorithm converges to Pareto-stationary points and, to our knowledge, provide the first finite-time convergence guarantees for this federated multi-objective alignment setting. Empirically, we show that FIRM leads to smoother training dynamics, reduced client disagreement drift, and improved reward trade-offs compared to baselines. We further propose a method to incorporate a preference over the objectives and report empirical Pareto plots, demonstrating that FIRM can smoothly adapt trade-offs between objectives in response to specified preferences.

2603.24511 2026-06-02 cs.LG cs.AI cs.CR 版本更新

Claudini: Autoresearch Discovers State-of-the-Art Adversarial Attack Algorithms for LLMs

Claudini: 自动研究发现针对LLM的最先进对抗攻击算法

Alexander Panfilov, Peter Romov, Igor Shilov, Yves-Alexandre de Montjoye, Jonas Geiping, Maksym Andriushchenko

发表机构 * MATS ELLIS Institute(MATS ELLIS研究所) Max Planck Institute for Intelligent Systems(智能系统马克斯·普朗克研究所) Tübingen AI Center(图宾根人工智能中心) Imperial College London(伦敦帝国理工学院)

AI总结 本文提出一种自动研究循环,利用前沿AI代理(如Claude Code和Codex)自动发现针对大语言模型的新型对抗攻击算法,在白盒越狱和提示注入评估中达到最先进水平。

详情
AI中文摘要

我们证明AI代理能够发现针对LLM的新型对抗攻击算法,在白盒越狱和提示注入评估中推进了最先进水平。我们部署前沿代理(如Claude Code和Codex)在自动研究循环中,访问包含30多种先前方法的库和具有固定计算预算的评估脚本。我们展示了该流程在越狱OpenAI的GPT-OSS-Safeguard-20B以及对对抗鲁棒模型Meta-SecAlign-70B进行提示注入方面的有效性。对于GPT-OSS-Safeguard,代理发现的最佳方法在CBRN查询上实现了高达80%的攻击成功率,而现有方法低于50%。对于SecAlign,它实现了100%的ASR,而先前最佳自动化方法仅达到82%。值得注意的是,在我们的设置中,攻击方法是在无关的替代模型上为纯随机目标令牌强制任务开发的,却直接泛化到对抗训练模型上的提示注入。最后,我们追溯了自动研究过程中开发的方法的谱系,刻画了代理的策略和失败模式。对抗性机器学习长期以来一直认为防御必须针对为其量身定制的攻击进行评估;自动研究自动化了这一原则,我们认为这应该是未来防御评估的最低标准。

英文摘要

We show that AI agents are capable of discovering novel algorithms for adversarial attacks against LLMs, advancing the state of the art on white-box jailbreaking and prompt injection evaluations. We deploy frontier agents, such as Claude Code and Codex, in an autoresearch loop with access to a library of 30+ prior methods and an evaluation script with a fixed compute budget. We show this pipeline to be effective in jailbreaking OpenAI's GPT-OSS-Safeguard-20B and in prompt injections against Meta-SecAlign-70B, an adversarially robust model. For GPT-OSS-Safeguard, the best agent-discovered method achieves up to 80\% attack success rate on CBRN queries, compared to <50\% for existing methods. For SecAlign, it achieves 100\% ASR, while the best prior automated methods only achieve 82\%. Notably, in our setting, attack methods are developed on unrelated surrogate models for a pure random-target token-forcing task, yet generalize directly to prompt injection on the adversarially trained model. Finally, we trace the lineage of methods developed during autoresearch, characterizing the agents' strategies and failure modes. Adversarial ML has long held that defenses must be evaluated against attacks tailored to them; autoresearch automates this principle, and we argue it should be the minimum bar for defense evaluation going forward.

2603.23647 2026-06-02 cs.CV cs.AI cs.LG 版本更新

λSplit: Self-Supervised Content-Aware Spectral Unmixing for Fluorescence Microscopy

λSplit: 用于荧光显微镜的自监督内容感知光谱解混

Federico Carrara, Talley Lambert, Mehdi Seifi, Florian Jug

发表机构 * Fondazione Human Technopole(人类技术极地基金会) Harvard Medical School(哈佛医学院) Università Campus Bio-Medico(生物医学大学校园)

AI总结 提出λSplit,一种基于物理信息的深度生成模型,通过分层变分自编码器学习浓度图的条件分布,结合可微分光谱混合器实现最先进的光谱解混和隐式噪声去除。

Comments 14 pages, 25 pages supplement, 16 figures total, 14 tables total

详情
AI中文摘要

在荧光显微镜中,光谱解混旨在从捕获混合荧光发射的光谱图像中恢复单个荧光团浓度。由于经典方法逐像素操作并依赖最小二乘拟合,其性能随着发射光谱重叠增加和噪声水平升高而下降,这表明能够学习并利用结构先验的数据驱动方法可能会带来改进。基于学习的光谱成像方法确实存在,但它们要么未针对显微镜数据进行优化,要么是为不适用于荧光显微镜设置的非常特定情况而开发的。为了解决这个问题,我们提出了λSplit,一种基于物理信息的深度生成模型,它使用分层变分自编码器学习浓度图上的条件分布。一个完全可微的光谱混合器强制与图像形成过程的一致性,而学习到的结构先验实现了最先进的解混和隐式噪声去除。我们在3个真实世界数据集上展示了λSplit,这些数据集被我们合成为总共66个具有挑战性的光谱解混基准。我们将结果与总共10种基线方法进行比较,包括经典方法和一系列基于学习的方法。我们的结果一致显示出竞争性能和在强噪声、光谱显著重叠或光谱维度降低情况下的改进鲁棒性,使λSplit成为荧光显微镜数据光谱解混的新最先进方法。重要的是,λSplit与标准共聚焦显微镜产生的光谱数据兼容,无需专门的硬件修改即可立即采用。

英文摘要

In fluorescence microscopy, spectral unmixing aims to recover individual fluorophore concentrations from spectral images that capture mixed fluorophore emissions. Since classical methods operate pixel-wise and rely on least-squares fitting, their performance degrades with increasingly overlapping emission spectra and higher levels of noise, suggesting that a data-driven approach that can learn and utilize a structural prior might lead to improved results. Learning-based approaches for spectral imaging do exist, but they are either not optimized for microscopy data or are developed for very specific cases that are not applicable to fluorescence microscopy settings. To address this, we propose λSplit, a physics-informed deep generative model that learns a conditional distribution over concentration maps using a hierarchical Variational Autoencoder. A fully differentiable Spectral Mixer enforces consistency with the image formation process, while the learned structural priors enable state-of-the-art unmixing and implicit noise removal. We demonstrate λSplit on 3 real-world datasets that we synthetically cast into a total of 66 challenging spectral unmixing benchmarks. We compare our results against a total of 10 baseline methods, including classical methods and a range of learning-based methods. Our results consistently show competitive performance and improved robustness in high noise regimes, when spectra overlap considerably, or when the spectral dimensionality is lowered, making λSplit a new state-of-the-art for spectral unmixing of fluorescent microscopy data. Importantly, λSplit is compatible with spectral data produced by standard confocal microscopes, enabling immediate adoption without specialized hardware modifications.

2511.08851 2026-06-02 cs.NI cs.LG eess.SP 版本更新

Measurement-Driven Early Warning of Reliability Breakdown in 5G NSA Railway Networks

基于测量的5G NSA铁路网络可靠性崩溃早期预警

Po-Heng Chou, Da-Chih Lin, Hung-Yu Wei, Walid Saad, Yu Tsao

发表机构 * National Science and Technology Council (NSTC) of Taiwan(台湾国家科学与技术委员会) U.S. National Science Foundation (NSF)(美国国家科学基金会) University of Notre Dame(诺丁汉大学)

AI总结 本文通过测量驱动的方法,使用10 Hz地铁列车测量数据,评估六种学习模型在5G NSA铁路网络中提前数秒预测可靠性崩溃事件的可行性,并建立基准以量化其性能与权衡。

Comments 6 pages, 4 figures, 2 tables, and submitted to 2026 IEEE Globecom

详情
AI中文摘要

本文提出了一种基于测量的5G非独立组网(NSA)铁路网络可靠性崩溃事件早期预警研究。利用10 Hz地铁列车测量轨迹(包含服务小区和邻小区指标),我们在多个观测窗口和预测时域下,对六种代表性学习模型(包括CNN、LSTM、XGBoost、Anomaly Transformer、PatchTST和TimesNet)进行了基准测试。本研究并非提出新的预测架构,而是开发了一个基于测量的基准,以量化5G NSA铁路环境中提前数秒可靠性预测的可行性和操作权衡。实验结果表明,学习模型可以利用商用设备上可用的轻量级无线特征,提前数秒预测与无线链路失败(RLF)相关的可靠性崩溃事件。所提出的基准为感知辅助通信控制提供了见解,并为将感知与分析集成到未来移动控制中提供了经验基础。

英文摘要

This paper presents a measurement-driven study of early warning for reliability breakdown events in 5G non-standalone (NSA) railway networks. Using 10~Hz metro-train measurement traces with serving- and neighbor-cell indicators, we benchmark six representative learning models, including CNN, LSTM, XGBoost, Anomaly Transformer, PatchTST, and TimesNet, under multiple observation windows and prediction horizons. Rather than proposing a new prediction architecture, this study develops a measurement-driven benchmark to quantify the feasibility and operating trade-offs of seconds-ahead reliability prediction in 5G NSA railway environments. Experimental results show that learning models can anticipate radio link failure (RLF)-related reliability breakdown events seconds in advance using lightweight radio features available on commercial devices. The presented benchmark provides insights for sensing-assisted communication control and offers an empirical foundation for integrating sensing and analytics into future mobility control.

2603.23398 2026-06-02 cs.LG cs.AI stat.ML 版本更新

Graph Energy Matching: Transport-Aligned Energy-Based Modeling for Graph Generation

图能量匹配:用于图生成的传输对齐能量基建模

Michal Balcerak, Suprosanna Shit, Chinmay Prabhakar, Sebastian Kaltenbach, Michael S. Albergo, Yilun Du, Bjoern Menze

发表机构 * University of Zurich(苏黎世大学) Harvard University(哈佛大学) Kempner Institute(凯普纳研究所)

AI总结 提出Graph Energy Matching (GEM)方法,基于JKO传输映射优化视角学习置换不变势能,通过能量基切换策略实现离散图的高质量生成,在分子图基准上匹配或超越离散扩散模型。

详情
AI中文摘要

离散数据(如图)的生成建模支撑着许多科学和工业应用,包括分子发现和材料设计。在这些领域中,概率推理尤其有价值,因为它能够实现可组合生成和原则性地融入期望的约束,例如结构或功能属性。能量基模型通过捕获相对似然并在推理过程中直接施加约束来支持可组合推理,自然符合这一目标。然而,离散能量基模型通常难以实现高效高质量的采样,因为支持区域外的区域常包含虚假局部最小值,会困住采样器并导致训练不稳定,从而与离散扩散模型相比存在保真度差距。为了解决这一差距,我们引入了Graph Energy Matching (GEM),这是一种受Jordan-Kinderlehrer-Otto (JKO)传输映射优化视角启发的离散生成框架。GEM学习一个置换不变势能,同时引导从噪声到高似然图区域的离散传输,并在这些区域内细化样本。我们进一步引入了一种利用能量基切换策略的采样协议,无缝衔接快速的梯度引导传输和用于有效探索的局部混合机制。在分子图基准上,GEM在大多数报告指标上匹配或超越了强离散扩散基线。除了提高生成质量,GEM的相对似然建模还支持定向探索,促进组合生成、属性约束采样以及图之间的插值。项目页面:https://michalbalcerak.ai/graph-energy-matching/。

英文摘要

Generative modeling of discrete data, such as graphs, underpins many scientific and industrial applications, including molecular discovery and materials design. In these domains, probabilistic inference is particularly valuable, as it enables composable generation and principled incorporation of desired constraints, such as structural or functional properties. Energy-based models naturally support this goal by capturing relative likelihoods and enabling composable inference by directly enforcing constraints during inference. However, discrete energy-based models typically struggle with efficient and high-quality sampling, as off-support regions often contain spurious local minima, trapping samplers and causing training instabilities, resulting in a fidelity gap compared to discrete diffusion models. To address this gap, we introduce Graph Energy Matching (GEM), a discrete generative framework inspired by the Jordan-Kinderlehrer-Otto (JKO) transport-map optimization perspective. GEM learns a permutation-invariant potential energy that simultaneously guides discrete transport from noise toward high-likelihood graph regions and refines samples within these regions. We further introduce a sampling protocol leveraging an energy-based switching strategy, seamlessly bridging rapid, gradient-guided transport and a local mixing regime for effective exploration. On molecular graph benchmarks, GEM matches or surpasses strong discrete diffusion baselines on most reported metrics. Beyond improving generation quality, GEM's relative likelihood modeling enables targeted exploration, facilitating compositional generation, property-constrained sampling, and interpolation between graphs. Project page: https://michalbalcerak.ai/graph-energy-matching/.

2603.22235 2026-06-02 cs.HC cs.LG 版本更新

ShapDBM: Exploring Decision Boundary Maps in Shapley Space

ShapDBM:在Shapley空间中探索决策边界图

Luke Watkin, Daniel Archambault, Alex Telea

发表机构 * School of Computing, Newcastle University, UK(新castle大学计算机学院) Department of Information and Computing Science, Utrecht University, Netherlands(乌得勒支大学信息与计算科学系)

AI总结 提出通过将数据空间转换为Shapley空间并计算降维来生成决策边界图,相比直接基于数据的方法,生成的图质量指标相似或更高,决策区域更紧凑、更易探索且与模型性能更一致。

Comments 4 pages and 3 figures (excluding supplementary material)

详情
AI中文摘要

决策边界图(DBM)是可视化机器学习分类边界的有效工具。然而,DBM的质量很大程度上取决于降维(DR)技术和用于数据点的高维空间。对于复杂的机器学习数据,降维可能会产生许多混合类别,导致DBM难以使用甚至产生误导。我们提出了一种新技术,通过将数据空间转换为Shapley空间并对其计算降维来生成DBM。与直接从数据计算的DBM相比,我们的图具有相似或更高质量指标值,并且决策区域明显更紧凑、更易于探索,与测量的模型性能更一致。

英文摘要

Decision Boundary Maps (DBMs) are an effective tool for visualising machine learning classification boundaries. Yet, DBM quality strongly depends on the dimensionality reduction (DR) technique and high dimensional space used for the data points. For complex ML data, DR can create many mixed classes which yield DBMs that are hard to use or even misleading. We propose a new technique to compute DBMs by transforming data space into Shapley space and computing DR on it. Compared to DBMs computed directly from data, our maps have similar or higher quality metric values and visibly more compact, easier to explore, decision zones that better agree with measured model performance.

2510.19496 2026-06-02 cs.CV cs.AI cs.LG 版本更新

CARES: Context-Aware Resolution Selector for VLMs

CARES: 面向视觉语言模型的上下文感知分辨率选择器

Moshe Kimhi, Nimrod Shabtay, Raja Giryes, Chaim Baskin, Eli Schwartz

发表机构 * Technion(技术ion大学) IBM Research(IBM研究院) Tel-Aviv University(特拉维夫大学) Ben-Gurion University(本· Gurion大学)

AI总结 提出CARES轻量级预处理模块,通过紧凑型VLM预测图像-查询对的最小足够分辨率,在保持任务性能的同时最多减少80%计算量。

Comments Proceedings of the 64th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) Accepted to ACL 2026 (Oral presentation). Code available at https://github.com/mkimhi/CARES

详情
AI中文摘要

大型视觉语言模型通常以原始或高分辨率处理图像以保持跨任务有效性。这导致视觉令牌通常占总令牌的97-99%,即使低分辨率图像就足够时,也会产生高计算量和延迟。我们引入了CARES——一种上下文感知分辨率选择器,这是一个轻量级预处理模块,给定图像-查询对,预测最小的足够输入分辨率。CARES使用紧凑型VLM(350M)提取特征,并预测目标预训练VLM的响应何时收敛到其正确回答的峰值能力。尽管作为一组可选分辨率上的离散分类器进行训练,但CARES在推理时插值连续分辨率以实现细粒度控制。在涵盖文档和自然图像以及多样化目标VLM的五个多模态基准测试中,CARES在保持任务性能的同时最多减少80%的计算量。

英文摘要

Large vision-language models (VLMs) commonly process images at native or high resolution to remain effective across tasks. This inflates visual tokens ofter to 97-99% of total tokens, resulting in high compute and latency, even when low-resolution images would suffice. We introduce \emph{CARES}-a \textbf{C}ontext-\textbf{A}ware \textbf{R}esolution \textbf{S}elector, a lightweight preprocessing module that, given an image-query pair, predicts the \emph{minimal} sufficient input resolution. CARES uses a compact VLM (350M) to extract features and predict when a target pretrained VLM's response converges to its peak ability to answer correctly. Though trained as a discrete classifier over a set of optional resolutions, CARES interpolates continuous resolutions at inference for fine-grained control. Across five multimodal benchmarks spanning documents and natural images, as well as diverse target VLMs, CARES preserves task performance while reducing compute by up to 80%.

2602.10014 2026-06-02 cs.LG stat.ML 版本更新

A Task-Centric Theory for Iterative Self-Improvement with Easy-to-Hard Curricula

一种以任务为中心的迭代自改进理论,采用由易到难的课程

Chenruo Liu, Yijun Dong, Yiqiu Shen, Qi Lei

发表机构 * New York University(纽约大学)

AI总结 本文通过将自改进建模为基于奖励过滤分布的最大似然微调,推导了期望奖励的有限样本保证,并证明了在推理任务中由易到难的课程比固定任务混合训练具有更好的理论保证。

详情
AI中文摘要

迭代自改进在由大型语言模型自身生成的、经过奖励验证的输出上微调自回归大型语言模型。与自改进的经验成功相比,这种生成性迭代过程在实际有限样本设置下的理论基础仍然有限。我们通过将每一轮自改进建模为在奖励过滤分布上的最大似然微调,并推导期望奖励的有限样本保证,朝这个目标取得了进展。我们的分析揭示了一个显式的反馈循环,其中更好的模型每轮接受更多数据,支持持续的自改进,同时解释了这种改进最终饱和的原因。通过采用以任务为中心的观点,考虑具有多个难度级别的推理任务,我们进一步证明了在模型初始化、任务难度和样本预算方面的可量化条件,在这些条件下,由易到难的课程比在固定任务混合上训练具有可证明的更好保证。我们的分析通过蒙特卡洛模拟以及涵盖合成图基推理任务和多个标准数学推理基准的实验得到了验证。

英文摘要

Iterative self-improvement fine-tunes an autoregressive large language model (LLM) on reward-verified outputs generated by the LLM itself. In contrast to the empirical success of self-improvement, the theoretical foundation of this generative, iterative procedure in a practical, finite-sample setting remains limited. We make progress toward this goal by modeling each round of self-improvement as maximum-likelihood fine-tuning on a reward-filtered distribution and deriving finite-sample guarantees for the expected reward. Our analysis reveals an explicit feedback loop where better models accept more data per iteration, supporting sustained self-improvement while explaining eventual saturation of such improvement. Adopting a task-centric view by considering reasoning tasks with multiple difficulty levels, we further prove quantifiable conditions on model initialization, task difficulty, and sample budget where easy-to-hard curricula provably achieve better guarantees than training on fixed mixtures of tasks. Our analyses are validated through Monte-Carlo simulations and experiments spanning a synthetic graph-based reasoning task and multiple standard mathematical reasoning benchmarks.

2603.18016 2026-06-02 cs.CL cs.AI cs.DC cs.LG 版本更新

MineDraft: A Framework for Batch Parallel Speculative Decoding

MineDraft: 批量并行推测解码框架

Zhenwei Tang, Arun Verma, Zijian Zhou, Zhaoxuan Wu, Alok Prakash, Daniela Rus, Bryan Kian Hsiang Low

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Toyota Research Institute(丰田研究院) Toyota Motor Corporation(丰田公司)

AI总结 提出MineDraft框架,通过批量并行设计将草稿生成与验证阶段重叠,显著提升推测解码的吞吐量和端到端延迟。

Comments Accepted at ICML 2026

详情
AI中文摘要

推测解码(SD)通过使用较小的草稿模型提出草稿令牌,随后由较大的目标模型验证,从而加速大型语言模型推理。然而,标准SD的性能通常受限于这些草稿和验证阶段的严格顺序执行。为解决此问题,本文提出MineDraft,一种批量并行推测解码(PSD)框架,旨在通过将草稿生成与验证重叠来有效隐藏草稿延迟。我们的理论分析表明,PSD比标准SD高效得多。MineDraft通过一种新颖的批量并行设计实现PSD,该设计维护两个请求批次,将一个批次的草稿生成与另一个批次的验证重叠。我们的实验结果显示,与标准SD相比,MineDraft在吞吐量(最高提升75%)和端到端延迟(最高降低39%)方面均有显著改进。此外,我们已将MineDraft实现为vLLM的插件,展示了其在生产级推理系统中的实用性。

英文摘要

Speculative decoding (SD) accelerates large language model inference by using a smaller draft model to propose draft tokens that are subsequently verified by a larger target model. However, the performance of standard SD is often limited by the strictly sequential execution of these drafting and verification stages. To address this, this paper proposes MineDraft, a batch parallel speculative decoding (PSD) framework designed to effectively hide drafting latency by overlapping it with verification. Our theoretical analysis shows that PSD is substantially more efficient than standard SD. MineDraft realizes the PSD through a novel batch-parallel design that maintains two batches of requests, overlapping drafting for one batch with verification for the other. Our experimental results show significant improvements of MineDraft in both throughput (up to 75%) and end-to-end latency (up to 39%) over standard SD. Furthermore, we have implemented MineDraft as a plugin for vLLM, demonstrating its practicality for production-ready inference systems.

2603.17893 2026-06-02 cs.SE cs.AI cs.LG 版本更新

scicode-lint: Detecting Methodology Bugs in Scientific Python Code with LLM-Generated Patterns

scicode-lint: 使用LLM生成的模式检测科学Python代码中的方法论错误

Sergey V. Samsonau

发表机构 * Authentic Research Partners, Princeton, NJ(真实研究伙伴,新泽西州普林斯顿)

AI总结 提出scicode-lint,通过两级架构(构建时使用前沿模型生成模式,运行时使用小型本地模型执行)自动检测科学Python代码中的方法论错误,如数据泄露、交叉验证错误和缺失随机种子。

详情
AI中文摘要

科学Python代码中的方法论错误会产生看似合理但实际不正确的结果,传统的linter和静态分析工具无法检测到这些错误。多个研究团队构建了特定于ML的linter,证明了检测的可行性。然而,这些工具存在可持续性问题:依赖于特定的pylint或Python版本、有限的打包方式,以及每个新模式都需要手动工程。随着AI生成代码增加了科学软件的数量,对自动化方法论检查(如检测数据泄露、不正确的交叉验证和缺失随机种子)的需求日益增长。我们提出了scicode-lint,其两级架构将模式设计(构建时的前沿模型)与执行(运行时的小型本地模型)分离。模式是生成的,而非手工编码;适应新的库版本花费的是token,而非工程时间。在带有手动标注真实值的Kaggle笔记本上,预处理泄露检测在100%召回率下达到了65%的精确率;在38篇应用AI/ML的已发表科学论文中,精确率为62%(由LLM评判),不同模式类别之间存在显著差异;在一个保留的论文集上,精确率为54%。在受控测试中,scicode-lint在66个模式上达到了97.7%的准确率。

英文摘要

Methodology bugs in scientific Python code produce plausible but incorrect results that traditional linters and static analysis tools cannot detect. Several research groups have built ML-specific linters, demonstrating that detection is feasible. Yet these tools share a sustainability problem: dependency on specific pylint or Python versions, limited packaging, and reliance on manual engineering for every new pattern. As AI-generated code increases the volume of scientific software, the need for automated methodology checking (such as detecting data leakage, incorrect cross-validation, and missing random seeds) grows. We present scicode-lint, whose two-tier architecture separates pattern design (frontier models at build time) from execution (small local model at runtime). Patterns are generated, not hand-coded; adapting to new library versions costs tokens, not engineering hours. On Kaggle notebooks with human-labeled ground truth, preprocessing leakage detection reaches 65% precision at 100% recall; on 38 published scientific papers applying AI/ML, precision is 62% (LLM-judged) with substantial variation across pattern categories; on a held-out paper set, precision is 54%. On controlled tests, scicode-lint achieves 97.7% accuracy across 66 patterns.

2603.13373 2026-06-02 cs.CY cs.AI cs.LG 版本更新

Ethical Fairness in Ubiquitous Health Sensing without Known Attributes

无已知属性下的普适健康感知伦理公平性

Shaily Roy, Harshit Sharma, Daniel A. Adler, Srijan Sen, Tanzeem Choudhury, Asif Salekin

发表机构 * Ira A. Fulton Schools of Engineering, Arizona State University(亚利桑那州立大学弗里曼工程学院) Arizona State University(亚利桑那州立大学) Cornell University(康奈尔大学) University of Michigan(密歇根大学)

AI总结 针对普适健康感知中缺乏人口统计或异构属性时的公平性问题,提出基于Fisher信息引导的潜在子群学习与无害正则化框架Flare,通过优化几何实现伦理公平。

详情
AI中文摘要

在普适和移动健康系统中,计算模型从可穿戴、行为和生理传感数据推断人类状态。在这些场景中,仅高准确率是不够的;模型必须在不同人群、环境和设备间合乎伦理且公平地运行。然而,依赖训练时的人口统计或异构属性的公平方法难以实施,因为这些属性通常不可用、隐私敏感、受监管或不宜收集。传统的基于均等的公平也可能通过牺牲子群性能而违反伦理原则。为应对这一挑战,我们提出了Flare(Fisher引导的潜在子群学习与无害正则化),这是一个不依赖人口统计和异构属性的框架,将以人为本的公平性与普适和移动传感的伦理原则对齐。Flare利用优化几何,特别是Fisher信息,来正则化曲率并揭示模型行为中的潜在差异,而无需人口统计或异构属性。通过整合表示、损失和曲率信号,它识别隐藏的性能分层,并通过协作但无害的优化对其进行改进,在提升子群性能的同时保持伦理平衡。我们还引入了BHE(善行-避害-公平),一个超越统计均等的伦理公平度量套件。在移动生理、行为和临床传感数据集(包括EDA、OhioT1DM、IHS和Percept-R)上,Flare在伦理公平性上优于最先进的基线。消融、可解释性和损失景观分析表明,这些提升源于更平坦的优化几何、更简单的决策规则和无害的潜在子群适应。运行时分析支持Flare在资源受限的传感部署中的实用性。

英文摘要

In ubiquitous and mobile health systems, computational models infer human states from wearable, behavioral, and physiological sensing data. In these settings, high accuracy alone is insufficient; models must act ethically and equitably across diverse people, contexts, and devices. However, fairness methods that rely on demographic or heterogeneous attributes during training are difficult to enforce because such attributes are often unavailable, privacy-sensitive, regulated, or undesirable to collect. Conventional parity-based fairness can also violate ethical principles by trading off subgroup performance. To address this challenge, we present Flare, Fisher-guided LAtent-subgroup learning with do-no-harm REgularization, a demographic- and heterogeneous-attribute-agnostic framework that aligns human-centered fairness with ethical principles for ubiquitous and mobile sensing. Flare leverages optimization geometry, particularly Fisher Information, to regularize curvature and uncover latent disparities in model behavior without demographic or heterogeneous attributes. By integrating representation, loss, and curvature signals, it identifies hidden performance strata and refines them through collaborative but do-no-harm optimization, enhancing subgroup performance while preserving ethical balance. We also introduce BHE (Beneficence-Harm Avoidance-Equity), a metric suite that operationalizes ethical fairness beyond statistical parity. Across mobile physiological, behavioral, and clinical sensing datasets, including EDA, OhioT1DM, IHS, and Percept-R, Flare improves ethical fairness over state-of-the-art baselines. Ablation, interpretability, and loss-landscape analyses show that these gains arise from flatter optimization geometry, simpler decision rules, and do-no-harm latent-subgroup adaptation. Runtime analysis supports the practicality of Flare for resource-constrained sensing deployments.

2509.12263 2026-06-02 cs.AI cs.LG 版本更新

InPhyRe Discovers: Large Multimodal Models Struggle in Inductive Physical Reasoning

InPhyRe 发现:大型多模态模型在归纳物理推理中表现不佳

Gautam Sreekumar, Vishnu Naresh Boddeti

发表机构 * Department of Computer Science and Engineering, Michigan State University(密歇根州立大学计算机科学与工程系)

AI总结 提出 InPhyRe 基准测试,通过合成视频中的碰撞事件预测任务,评估大型多模态模型在未见物理定律下的归纳物理推理能力,发现其依赖有限参数知识、受语言偏差影响且忽略视觉输入。

Comments Accepted to TMLR. 53 pages including appendix

详情
AI中文摘要

大型多模态模型(LMMs)将训练中观察到的物理定律(如动量守恒)编码为参数化知识。这使得 LMMs 能够回答物理推理查询,例如从视觉输入中预测潜在碰撞事件的结果。然而,由于参数化知识仅包含训练中见过的物理定律,它不足以推理遵循训练中未见物理定律的推理场景。在这种新颖的物理环境中,人类可以根据提供的演示调整其物理推理。这种归纳物理推理能力对于 LMMs 在安全关键应用中替代人类代理是必不可少的。尽管其重要性,现有的视觉基准并未评估归纳物理推理,仅考虑 LMMs 中的参数化知识。为此,我们提出了 InPhyRe,这是第一个用于衡量 LMMs 归纳物理推理的视觉问答基准。InPhyRe 评估 LMMs 预测算法生成的合成视频中碰撞事件结果的能力。通过检查超过 13 个开源和专有 LMMs,InPhyRe 告诉我们:(1)LMMs 难以将其关于普遍物理定律的有限参数化知识应用于推理;(2)当推理场景背后的物理定律在训练中未见时,LMMs 的归纳物理推理能力较弱;(3)LMMs 的归纳物理推理受到语言偏差的影响,可能忽略视觉输入,质疑了 LMMs 在视觉输入方面的可信度。

英文摘要

Large multimodal models (LMMs) encode physical laws observed during training, such as momentum conservation, as parametric knowledge. It allows LMMs to answer physical reasoning queries, such as the outcome of a potential collision event from visual input. However, since parametric knowledge includes only the physical laws seen during training, it is insufficient for reasoning in inference scenarios that follow physical laws unseen during training. In such novel physical environments, humans could adapt their physical reasoning based on provided demonstrations. This inductive physical reasoning ability is indispensable for LMMs if they are to replace human agents in safety-critical applications. Despite its importance, existing visual benchmarks do not evaluate inductive physical reasoning and only consider the parametric knowledge in LMMs. To this end, we propose InPhyRe, the first visual question answering benchmark to measure inductive physical reasoning in LMMs. InPhyRe evaluates LMMs' ability to predict the outcome of collision events in algorithmically generated synthetic videos. By inspecting over 13 open-source and proprietary LMMs, InPhyRe informs us that (1) LMMs struggle to apply their limited parametric knowledge about universal physical laws to reasoning, (2) inductive physical reasoning in LMMs is weak when the physical laws underlying inference scenarios were unseen during training, and (3) inductive physical reasoning in LMMs suffers from language bias and may ignore the visual inputs, questioning the trustworthiness of LMMs regarding visual inputs.

2603.14798 2026-06-02 stat.ML cs.LG cs.NA math.NA 版本更新

Preconditioned One-Step Generative Modeling for Bayesian Inverse Problems in Function Spaces

函数空间中贝叶斯逆问题的预处理一步生成建模

Zilan Cheng, Li-Lian Wang, Zhongjian Wang

发表机构 * Division of Mathematical Sciences, School of Physical and Mathematical Sciences, Nanyang Technological University(数学科学学院,物理与数学科学学院,南洋理工大学)

AI总结 提出一种基于一步生成传输的机器学习算法,使用先验对齐的高斯随机场作为源,通过神经算子逼近后验分布,高效求解函数空间中的贝叶斯逆问题。

详情
AI中文摘要

我们提出了一种用于函数空间贝叶斯逆问题的机器学习算法。基于一步生成传输,该方法学习一个摊销神经算子,其将高斯源的推送前推近似于以每个新观测为条件的后验分布。我们证明白噪声源与函数空间极限不兼容,因此采用先验对齐的GRF作为源。通过所得一步条件后验传输的Lipschitz正则性以及在线性逆问题和基于PDE的逆问题上的数值实验,我们证明了这一选择的合理性。该方法并非从MCMC中提炼:它仅使用先验样本和模拟的部分噪声观测进行训练。一旦训练完成,它能在约$10^{-3}$秒内生成一个$64\times64$的后验样本,避免了MCMC中重复的正向模型评估和多步生成采样器中重复的网络评估,同时匹配关键的后验摘要。

英文摘要

We propose a machine-learning algorithm for Bayesian inverse problems in the function-space regime. Based on one-step generative transport, the method learns an amortized neural operator whose pushforward of a Gaussian source approximates the posterior distribution conditioned on each new observation. We show that white-noise sources are incompatible with the function-space limit, and therefore adopt a prior-aligned GRF as the source. We justify this choice through the Lipschitz regularity of the resulting one-step conditional posterior transport and numerical experiments on linear inverse and PDE-based inverse problems. The method is not distilled from MCMC: it is trained only with prior samples and simulated partial noisy observations. Once trained, it generates a $64\times64$ posterior sample in $\sim 10^{-3}$s, avoiding repeated forward-model evaluations in MCMC and repeated network evaluations in multistep generative samplers while matching key posterior summaries.

2603.14405 2026-06-02 cs.LG cs.AI 版本更新

ES-Merging: Biological MLLM Merging via Embedding Space Signals

ES-Merging: 通过嵌入空间信号进行生物多模态大模型合并

Wonbin Lee, Dongki Kim, Sung Ju Hwang

发表机构 * KAIST(韩国科学技术院) DeepAuto.ai

AI总结 提出ES-Merging框架,利用嵌入空间信号估计合并系数,实现生物多模态大模型的高效合并,提升跨模态推理和单模态知识保留能力。

详情
AI中文摘要

生物多模态大语言模型(MLLMs)已成为科学发现的基础模型。然而,现有模型专注于单一模态,限制了其解决跨模态科学问题的能力。虽然模型合并是将不同模态组合成统一MLLM的有效方法,但现有方法依赖于与输入无关的参数空间启发式,无法准确捕捉模态特异性。为克服这一局限,我们提出基于嵌入信号的MLLM合并(ES-Merging),该框架从嵌入空间信号估计合并系数,将合并范式从参数信号转向嵌入信号。ES-Merging利用嵌入空间中的粗粒度和细粒度信号分别估计层间和元素级合并系数,并联合实现互补系数估计。通过大量实验,我们证明ES-Merging不仅在跨模态推理上,而且在单模态知识保留上均优于现有合并方法,表明嵌入空间信号为MLLM合并提供了有原则且有效的基础。

英文摘要

Biological multimodal large language models (MLLMs) have emerged as powerful foundation models for scientific discovery. However, existing models are specialized to a single modality, limiting their ability to solve inherently cross-modal scientific problems. While model merging is an efficient method to combine the different modalities into a unified MLLM, existing methods rely on input-agnostic parameter space heuristics that fail to faithfully capture modality specialization. To overcome this limitation, we propose the Embedding-Signal-based MLLM Merging (ES-Merging), a framework that estimates merging coefficients from embedding space signals, moving the merging paradigm from the parameter signals to the embedding signals. ES-Merging exploits coarse-grained and fine-grained signals from embedding space to estimate the layer-wise and element-wise merging coefficients, respectively, which are jointly combined for complementary coefficient estimation. Through extensive experiments, we demonstrate that ES-Merging outperforms existing merging methods not only on the cross-modal reasoning but also on the single-modal knowledge preservation, establishing that embedding space signals provide a principled and effective foundation for MLLM merging.

2505.21806 2026-06-02 cs.LG 版本更新

Towards Operational Automated Greenhouse Gas Plume Detection and Delineation

面向运营的自动化温室气体羽流检测与描绘

Brian D. Bue, Jake H. Lee, Andrew K. Thorpe, Philip G. Brodrick, Daniel Cusworth, Alana Ayasse, Vassiliki Mancoridis, Anagha Satish, Shujun Xiong, Riley Duren

发表机构 * University of California, San Diego(加州大学圣地亚哥分校) NASA Jet Propulsion Laboratory(美国国家航空航天局喷气推进实验室)

AI总结 针对高空间分辨率成像光谱仪,通过卷积神经网络和多任务学习解决数据质量、时空偏差和建模目标对齐等障碍,实现运营级温室气体羽流检测与分割。

Comments Main 19 pages 14 figures. Supplemental 19 pages 16 figures. In review

详情
Journal ref
Remote Sensing of Environment 343 (2026) 115506
AI中文摘要

尽管深度学习方法取得了最新进展,但对于精细空间分辨率成像光谱仪,全自动设施级温室气体(GHG)羽流检测系统的运营部署仍然具有挑战性。然而,随着数据可用性的急剧增加,自动化在排放监测中的重要性持续上升。本工作回顾并解决了该领域的几个关键障碍:数据和标签质量控制、时空偏差的预防以及正确对齐的建模目标。我们通过使用来自机载和星载仪器的多活动数据进行的严格实验证明,当这些障碍得到缓解时,卷积神经网络(CNN)能够实现运营检测性能。我们证明,同时学习实例检测和像素级分割的多任务模型可以成功走向运营路径。我们评估了模型在不同排放源类型和区域上的羽流可检测性,确定了运营部署的阈值。最后,我们提供了分析就绪的数据、模型和源代码以实现可重复性,并致力于定义一套最佳实践和验证标准,以促进未来对该领域的贡献。

英文摘要

Operational deployment of a fully automated facility-scale greenhouse gas (GHG) plume detection system remains challenging for fine spatial resolution imaging spectrometers, despite recent advances in deep learning approaches. With the dramatic increase in data availability, however, automation continues to increase in importance for emissions monitoring. This work reviews and addresses several key obstacles in the field: data and label quality control, prevention of spatiotemporal biases, and correctly aligned modeling objectives. We demonstrate through rigorous experiments using multicampaign data from airborne and spaceborne instruments that convolutional neural networks (CNNs) are able to achieve operational detection performance when these obstacles are alleviated. We demonstrate that a multitask model that learns both instance detection and pixelwise segmentation simultaneously can successfully lead towards an operational pathway. We evaluate the model's plume detectability across emission source types and regions, identifying thresholds for operational deployment. Finally, we provide analysis-ready data, models, and source code for reproducibility, and work to define a set of best practices and validation standards to facilitate future contributions to the field.

2603.13312 2026-06-02 cs.MM cs.LG 版本更新

Design-MLLM: A Reinforcement Alignment Framework for Verifiable and Aesthetic Interior Design

Design-MLLM:一种用于可验证且美观的室内设计的强化对齐框架

Yuxuan Yang, Xiaotong Mao, Jingyao Wang, Fuchun Sun

发表机构 * National Jiangsu University of Finance(江苏财经大学) University of Lorraine(洛林大学) Institute of Electronics and Information Technology, Chinese Academy of Sciences(中国科学院电子信息技术研究所) Tsinghua University(清华大学)

AI总结 提出Design-MLLM框架,通过双分支美学导向奖励的强化对齐,解决室内设计中空间可行性硬约束与美学偏好软约束的矛盾,生成既可行又美观的设计。

详情
AI中文摘要

室内设计是一个从需求到视觉方案的生成过程,必须同时满足可验证的空间可行性和比较性的美学偏好。虽然最近的多模态大语言模型(MLLM)为解释用户意图和生成设计理由提供了统一基础,但我们的实证分析揭示了实际部署中持续存在的矛盾:MLLM通常生成不可建造且美学不一致的布局。这些发现表明,简单地添加领域内文本是不够的;有效的室内设计需要一种对齐机制,将硬约束与软偏好分离,并在优化过程中协调它们。为此,我们提出Design-MLLM,一种通过双分支、美学导向奖励优化可行性优先偏好目标的强化对齐框架。具体来说,Design-MLLM (i) 使用程序化约束检查显式评估空间可行性,(ii) 仅在可行候选者中评估美学偏好,以避免视觉吸引但不可执行的捷径,(iii) 执行组相对优化以获得稳定的偏好信号。通过这个过程,Design-MLLM学习一种可控策略,一致地选择并生成既可行又美学协调的解决方案,而不是偶尔产生视觉吸引但不可行的设计。在各种基准数据集上的大量实验证明了Design-MLLM的优势。

英文摘要

Interior design is a requirements-to-visual-plan generation process that must simultaneously satisfy verifiable spatial feasibility and comparative aesthetic preferences. While recent multimodal large language models (MLLMs) offer a unified foundation for interpreting user intent and producing design rationales, our empirical analysis reveals a persistent contradiction in real-world deployment: MLLMs often produce layouts that are unbuildable and aesthetically inconsistent. These findings indicate that simply adding in-domain text is insufficient; effective interior design requires an alignment mechanism that separates hard constraints from soft preferences and coordinates them during optimization. To address this, we propose Design-MLLM, a reinforcement alignment framework that optimizes a feasibility-first preference objective via a dual-branch, aesthetic-oriented reward. Specifically, Design-MLLM (i) explicitly evaluates spatial feasibility using programmatic constraint checks, (ii) assesses aesthetic preference only among feasible candidates to avoid visually appealing but unexecutable shortcuts, and (iii) performs group-relative optimization to obtain stable preference signals. Through this process, Design-MLLM learns a controllable policy that consistently selects and generates solutions that are both executable and aesthetically coherent, rather than occasionally producing visually appealing but infeasible designs. Extensive experiments on various benchmark datasets demonstrate the advantages of Design-MLLM.

2603.12996 2026-06-02 cs.LG 版本更新

DAPD: Dependency-Aware Parallel Decoding via Attention for Diffusion LLMs

DAPD: 面向扩散LLM的基于注意力的依赖感知并行解码

Bumjun Kim, Dongjae Jeon, Moongyu Jeon, Albert No

发表机构 * KAIST(韩国科学技术院)

AI总结 提出一种无需训练的并行解码方法DAPD,利用自注意力构建掩码标记的依赖图,通过选择独立集并行解码,避免强耦合标记同时更新,提升了扩散LLM的精度-步数权衡。

Comments Accepted at ICML 2026

详情
AI中文摘要

扩散LLM(dLLM)的并行解码很困难,因为每个去噪步骤仅提供逐标记的边缘分布,而同时解掩多个标记需要考虑标记间的依赖关系。我们提出依赖感知并行解码(DAPD),一种简单、无需训练的解码方法,它使用自注意力在掩码标记上诱导条件依赖图。在每次迭代中,图中的边捕捉强标记交互,而非边表示弱依赖。然后,并行解码简化为在图上选择一个独立集并并行解掩所选标记。这避免了同时更新强耦合标记,无需辅助模型或重新训练。在LLaDA和Dream上的实验表明,DAPD改进了现有方法的精度-步数权衡,并实现了更全局分布的并行更新,更好地利用了dLLM的任意顺序生成能力。项目地址:https://ai-isl.github.io/dapd

英文摘要

Parallel decoding for Diffusion LLMs (dLLMs) is difficult because each denoising step provides only token-wise marginal distributions, while unmasking multiple tokens simultaneously requires accounting for inter-token dependencies. We propose Dependency-Aware Parallel Decoding (DAPD), a simple, training-free decoding method that uses self-attention to induce a conditional dependency graph over masked tokens. At each iteration, edges in this graph capture strong token interactions, while non-edges indicate weak dependence. Parallel decoding is then reduced to selecting an independent set on the graph and unmasking the selected tokens in parallel. This avoids co-updating strongly coupled tokens without auxiliary models or retraining. Experiments on LLaDA and Dream show that DAPD improves the accuracy-steps trade-off over existing methods and enables more globally distributed parallel updates that better exploit the any-order generation capability of dLLMs. The project is available at https://ai-isl.github.io/dapd

2603.12037 2026-06-02 cs.LG 版本更新

Frequentist Consistency of Prior-Data Fitted Networks for Causal Inference

用于因果推断的先验数据拟合网络的频率派一致性

Valentyn Melnychuk, Vahid Balazadeh, Stefan Feuerriegel, Rahul G. Krishnan

发表机构 * LMU Munich \& Munich Center for Machine Learning (MCML), Munich, Germany University of Toronto \& Vector Institute, Toronto, Canada

AI总结 本文分析基于先验数据拟合网络(PFN)的平均处理效应(ATE)估计量的频率派一致性,发现其存在先验诱导的混淆偏差,并提出基于一步后验校正(OSPC)的校准方法,结合鞅后验恢复功能干扰后验,从而恢复频率派一致性并实现半参数Bernstein-von Mises定理。

详情
Journal ref
Proceedings of the 43-rd International Conference on Machine Learning, Seoul, South Korea, PMLR 306, 2026
AI中文摘要

基于先验数据拟合网络(PFN)的基础模型通过将因果推断任务构建为上下文学习问题,在因果推断中展现出强大的实证性能。然而,目前尚不清楚基于PFN的因果估计量是否提供与经典频率派估计量一致的不确定性量化。在这项工作中,我们通过分析基于PFN的平均处理效应(ATE)估计量的频率派一致性来填补这一空白。(1)我们表明,现有的PFN在解释为贝叶斯ATE估计量时,可能表现出先验诱导的混淆偏差:先验不会被数据渐近覆盖,这反过来阻碍了频率派一致性。(2)作为补救措施,我们建议采用基于一步后验校正(OSPC)的校准程序。我们证明OSPC有助于恢复频率派一致性,并能为校准后的PFN导出半参数Bernstein-von Mises定理(即,随着数据规模增大,校准后的基于PFN的估计量和经典半参数有效估计量在分布上收敛)。(3)最后,我们通过在PFN之上定制鞅后验来实现OSPC。通过这种方式,我们能够从PFN中恢复OSPC所需的功能干扰后验。在多个(半)合成实验中,使用我们的鞅后验OSPC校准的PFN产生的ATE不确定性(i)渐近匹配频率派不确定性,并且(ii)与其他贝叶斯ATE估计量相比,在有限样本中校准良好。

英文摘要

Foundation models based on prior-data fitted networks (PFNs) have shown strong empirical performance in causal inference by framing the task as an in-context learning problem. However, it is unclear whether PFN-based causal estimators provide uncertainty quantification that is consistent with classical frequentist estimators. In this work, we address this gap by analyzing the frequentist consistency of PFN-based estimators for the average treatment effect (ATE). (1) We show that existing PFNs, when interpreted as Bayesian ATE estimators, can exhibit prior-induced confounding bias: the prior is not asymptotically overwritten by data, which, in turn, prevents frequentist consistency. (2) As a remedy, we suggest employing a calibration procedure based on a one-step posterior correction (OSPC). We show that the OSPC helps to restore frequentist consistency and can yield a semi-parametric Bernstein-von Mises theorem for calibrated PFNs (i.e., both the calibrated PFN-based estimators and the classical semi-parametric efficient estimators converge in distribution with growing data size). (3) Finally, we implement OSPC through tailoring martingale posteriors on top of the PFNs. In this way, we are able to recover functional nuisance posteriors from PFNs, required by the OSPC. In multiple (semi-)synthetic experiments, PFNs calibrated with our martingale posterior OSPC produce ATE uncertainty that (i) asymptotically matches frequentist uncertainty and (ii) is well calibrated in finite samples in comparison to other Bayesian ATE estimators.

2603.11946 2026-06-02 cs.LG cs.AI 版本更新

Geometry-Aware Probabilistic Circuits via Voronoi Tessellations

基于Voronoi剖分的几何感知概率电路

Sahil Sidheekh, Sriraam Natarajan

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 针对概率电路因数据无关混合权重而无法捕捉数据流形局部几何结构的问题,提出通过Voronoi剖分将几何结构直接融入求和节点,并开发近似推理框架和精确推理条件,最后引入可微松弛实现梯度学习,在密度估计任务上验证了有效性。

详情
AI中文摘要

概率电路(PC)支持精确且易于处理的推理,但采用数据无关的混合权重,限制了其捕捉数据流形局部几何结构的能力。我们提出将Voronoi剖分(VT)作为将几何结构直接融入PC求和节点的自然方式。然而,直接引入这种结构会破坏可处理性。我们形式化了这种不兼容性,并开发了两种互补的解决方案:(1)一个近似推理框架,为推理提供保证的下界和上界;(2)VT的一个结构条件,在该条件下恢复精确的可处理推理。最后,我们引入了VT的可微松弛,使得基于梯度的学习成为可能,并在标准密度估计任务上实证验证了所提方法。

英文摘要

Probabilistic circuits (PCs) enable exact and tractable inference but employ data independent mixture weights that limit their ability to capture local geometry of the data manifold. We propose Voronoi tessellations (VT) as a natural way to incorporate geometric structure directly into the sum nodes of a PC. However, naïvely introducing such structure breaks tractability. We formalize this incompatibility and develop two complementary solutions: (1) an approximate inference framework that provides guaranteed lower and upper bounds for inference, and (2) a structural condition for VT under which exact tractable inference is recovered. Finally, we introduce a differentiable relaxation for VT that enables gradient-based learning and empirically validate the resulting approach on standard density estimation tasks.

2603.11653 2026-06-02 cs.LG cs.RO 版本更新

Simple Recipe Works: Vision-Language-Action Models are Natural Continual Learners with Reinforcement Learning

简单配方有效:视觉-语言-动作模型通过强化学习成为自然持续学习者

Jiaheng Hu, Jay Shim, Chen Tang, Yoonchang Sung, Bo Liu, Peter Stone, Roberto Martin-Martin

发表机构 * University of Southern California(南加州大学) University of Texas at Austin(德克萨斯大学奥斯汀分校) University of California, Berkeley(加州大学伯克利分校)

AI总结 本文通过系统研究发现,对于大型预训练视觉-语言-动作模型,简单的顺序微调结合低秩适配在持续强化学习中表现出高可塑性、几乎无遗忘和强零样本泛化,优于复杂方法。

Comments Accepted at RLC 2026

详情
AI中文摘要

持续强化学习(CRL)用于视觉-语言-动作(VLA)模型是一个有前景的方向,旨在实现能够在开放、不断变化的环境中适应的自我改进具身智能体。然而,持续学习的传统观点认为,简单的顺序微调(Seq. FT)会导致灾难性遗忘,需要复杂的CRL策略。在这项工作中,我们退一步,对大型预训练VLA在多种终身RL基准上的CRL进行了系统研究。我们发现,与既定信念相反,使用低秩适配(LoRA)的简单Seq. FT非常强大:它实现了高可塑性,几乎没有遗忘,并保持了强大的零样本泛化,通常优于更复杂的CRL方法。通过详细分析,我们表明这种鲁棒性源于大型预训练模型、参数高效适配和在线RL之间的协同作用。这些组件共同重塑了稳定性-可塑性权衡,使持续适应既稳定又可扩展。我们的结果将顺序微调定位为VLA持续RL的强大方法,并为大模型时代的终身学习提供了新见解。代码可在github.com/UT-Austin-RobIn/continual-vla-rl获取。

英文摘要

Continual Reinforcement Learning (CRL) for Vision-Language-Action (VLA) models is a promising direction toward self-improving embodied agents that can adapt in openended, evolving environments. However, conventional wisdom from continual learning suggests that naive Sequential Fine-Tuning (Seq. FT) leads to catastrophic forgetting, necessitating complex CRL strategies. In this work, we take a step back and conduct a systematic study of CRL for large pretrained VLAs across diverse lifelong RL benchmarks. We find that, contrary to established belief, simple Seq. FT with low-rank adaptation (LoRA) is remarkably strong: it achieves high plasticity, exhibits little to no forgetting, and retains strong zero-shot generalization, frequently outperforming more sophisticated CRL methods. Through detailed analysis, we show that this robustness arises from a synergy between the large pretrained model, parameter-efficient adaptation, and on-policy RL. Together, these components reshape the stability-plasticity trade-off, making continual adaptation both stable and scalable. Our results position Sequential Fine-Tuning as a powerful method for continual RL with VLAs and provide new insights into lifelong learning in the large model era. Code is available at github.com/UT-Austin-RobIn/continual-vla-rl.

2603.09692 2026-06-02 cs.LG cs.AI cs.CL 版本更新

ActiveUltraFeedback: Efficient Preference Data Generation using Active Learning

ActiveUltraFeedback:使用主动学习的高效偏好数据生成

Davit Melikidze, Marian Schneider, Jessica Lam, Martin Wertich, Ido Hakimi, Barna Pásztor, Andreas Krause

发表机构 * University of Freiburg(弗赖堡大学)

AI总结 提出ActiveUltraFeedback主动学习流水线,通过不确定性估计和两种新采样方法(DRTS和DeltaUCB)动态选择最具信息量的响应对,以最少六分之一的标注数据实现与静态基线相当或更优的下游性能。

Comments 40 pages, 9 figures, 26 tables

详情
AI中文摘要

基于人类反馈的强化学习(RLHF)已成为对齐大型语言模型(LLMs)的标准方法,但其有效性受到偏好数据获取高成本的瓶颈限制,尤其是在低资源和专家领域。为解决这一问题,我们引入了ACTIVEULTRAFEEDBACK,一个模块化的主动学习流水线,利用不确定性估计动态识别最具信息量的响应进行标注。我们的流水线支持系统评估标准响应选择方法以及两种新方法:DOUBLE REVERSE THOMPSON SAMPLING(DRTS)和DELTAUCB,这两种方法优先选择预测质量差距大的响应对,利用近期研究结果,即此类对为微调提供良好信号。实验表明,ACTIVEULTRAFEEDBACK生成的高质量数据集在下游性能上带来显著提升,尤其以静态基线六分之一的标注数据即可达到相当或更优的结果。我们的流水线可在https://github.com/lasgroup/ActiveUltraFeedback获取,偏好数据集可在https://huggingface.co/ActiveUltraFeedback获取。

英文摘要

Reinforcement Learning from Human Feedback (RLHF) has become the standard for aligning Large Language Models (LLMs), yet its efficacy is bottlenecked by the high cost of acquiring preference data, especially in low-resource and expert domains. To address this, we introduce ACTIVEULTRAFEEDBACK, a modular active learning pipeline that leverages uncertainty estimates to dynamically identify the most informative responses for annotation. Our pipeline facilitates the systematic evaluation of standard response selection methods alongside DOUBLE REVERSE THOMPSON SAMPLING (DRTS) and DELTAUCB, two novel methods prioritizing response pairs with large predicted quality gaps, leveraging recent results showing that such pairs provide good signals for fine-tuning. Our experiments demonstrate that ACTIVEULTRAFEEDBACK yields high-quality datasets that lead to significant improvements in downstream performance, notably achieving comparable or superior results with as little as one-sixth of the annotated data relative to static baselines. Our pipeline is available at https://github.com/lasgroup/ActiveUltraFeedback and our preference datasets at https://huggingface.co/ActiveUltraFeedback.

2603.08000 2026-06-02 cs.CL cs.LG 版本更新

SmartThinker: Progressive Chain-of-Thought Length Calibration for Efficient Large Language Model Reasoning

SmartThinker: 渐进式思维链长度校准以实现高效的大语言模型推理

Chenzhi Hu, Qinzhe Hu, Yuhang Xu, Junyi Chen, Ruijie Wang, Shengzhong Liu, Jianxin Li, Fan Wu, Guihai Chen

发表机构 * Tsinghua University(清华大学)

AI总结 针对大型推理模型输出冗余问题,提出基于GRPO的渐进式CoT长度校准方法SmartThinker,通过动态估计最优长度和调节长度奖励系数,在压缩响应长度同时提升准确率。

Comments Accepted by ICML 2026, 18 pages, 13 figures

详情
AI中文摘要

大型推理模型(LRMs),如OpenAI o1和DeepSeek-R1,通过采用长思维链(CoT)推理路径在复杂任务上实现了高准确率。然而,这些过程固有的冗长常常导致冗余和过度思考。为了解决这一问题,现有工作利用组相对策略优化(GRPO)来减少LRM的输出长度,但其静态长度奖励设计无法根据问题相对难度和响应长度分布动态调整,导致过度压缩和准确率下降。因此,我们提出SmartThinker,一种新颖的基于GRPO的高效推理方法,具有渐进式CoT长度校准。SmartThinker有两个贡献:首先,它在训练期间动态估计具有峰值准确率的最优长度,并引导过长响应朝向该长度,以减少响应长度同时保持准确率。其次,它动态调节长度奖励系数,以避免对正确推理路径的不当惩罚。大量实验结果表明,SmartThinker在提高准确率的同时实现了高达52.5%的平均长度压缩,并在AIME25等具有挑战性的基准上实现了高达16.6%的准确率提升。源代码可在https://github.com/SJTU-RTEAS/SmartThinker获取。

英文摘要

Large reasoning models (LRMs) like OpenAI o1 and DeepSeek-R1 achieve high accuracy on complex tasks by adopting long chain-of-thought (CoT) reasoning paths. However, the inherent verbosity of these processes frequently results in redundancy and overthinking. To address this issue, existing works leverage Group Relative Policy Optimization (GRPO) to reduce LRM output length, but their static length reward design cannot dynamically adapt according to the relative problem difficulty and response length distribution, causing over-compression and compromised accuracy. Therefore, we propose SmartThinker, a novel GRPO-based efficient reasoning method with progressive CoT length calibration. SmartThinker makes a two-fold contribution: First, it dynamically estimates the optimal length with peak accuracy during training and guides overlong responses toward it to reduce response length while sustaining accuracy. Second, it dynamically modulates the length reward coefficient to avoid the unwarranted penalization of correct reasoning paths. Extensive experiment results show that SmartThinker achieves up to 52.5% average length compression with improved accuracy, and achieves up to 16.6% accuracy improvement on challenging benchmarks like AIME25. The source code can be found at https://github.com/SJTU-RTEAS/SmartThinker.

2603.06741 2026-06-02 cs.LG cs.AI cs.CV 版本更新

Heterogeneous Decentralized Diffusion Models

异构去中心化扩散模型

Zhiying Jiang, Raihan Seraj, Marcos Villagra, Bidhan Roy

发表机构 * bagel.com(Bagel公司)

AI总结 提出一种异构去中心化训练框架,通过支持不同专家使用不同目标(DDPM和Flow Matching)并统一推理、预训练检查点转换以及高效架构,大幅降低计算和数据需求,使单GPU(24-48GB VRAM)即可参与训练。

Comments Accepted to CVPR2026

详情
AI中文摘要

训练前沿规模的扩散模型通常需要大量计算资源集中在紧密耦合的集群中,限制了只有资源充足的机构才能参与。虽然去中心化扩散模型(DDM)能够独立训练多个专家,但现有方法需要1176 GPU天,且所有专家使用同质化训练目标。我们提出了一个高效框架,大幅降低资源需求,同时支持异构训练目标。我们的方法结合了三个关键贡献:(1)一种异构去中心化训练范式,允许专家使用不同的目标(DDPM和Flow Matching),在推理时无需任何重新训练即可统一;(2)从ImageNet-DDPM到Flow Matching目标的预训练检查点转换,加速收敛并无需针对特定目标的预训练即可初始化;(3)PixArt-$α$的高效AdaLN-Single架构,在保持质量的同时减少参数。在LAION-Aesthetics上的实验表明,相对于先前DDM工作报告的训练规模,我们的方法将计算量减少了16倍,数据量减少了14倍。在对齐的推理设置下,我们的异构配置比同质基线获得了更好的FID和更高的提示内多样性。通过消除同步需求并支持混合DDPM/FM目标,我们的框架使贡献者只需单GPU(24-48GB VRAM)即可进行去中心化生成模型训练。

英文摘要

Training frontier-scale diffusion models often requires substantial computational resources concentrated in tightly-coupled clusters, limiting participation to well-resourced institutions. While Decentralized Diffusion Models (DDM) enable training multiple experts in isolation, existing approaches require 1176 GPU-days and homogeneous training objectives across all experts. We present an efficient framework that dramatically reduces resource requirements while supporting heterogeneous training objectives. Our approach combines three key contributions: (1) a heterogeneous decentralized training paradigm that allows experts to use different objectives (DDPM and Flow Matching), unified at inference time without any retraining; (2) pretrained checkpoint conversion from ImageNet-DDPM to Flow Matching objectives, accelerating convergence and enabling initialization without objective-specific pretraining; and (3) PixArt-$α$'s efficient AdaLN-Single architecture, reducing parameters while maintaining quality. Experiments on LAION-Aesthetics show that, relative to the training scale reported for prior DDM work, our approach reduces the compute by 16$\times$ and data by 14$\times$. Under aligned inference settings, our heterogeneous configuration achieves better FID and higher intra-prompt diversity than the homogeneous baseline. By eliminating synchronization requirements and enabling mixed DDPM/FM objectives, our framework makes decentralized generative model training accessible to contributors with single GPUs requiring only 24--48GB VRAM.

2603.04430 2026-06-02 cs.LG 版本更新

Flowers: A Warp Drive for Neural PDE Solvers

Flowers: 神经PDE求解器的曲速引擎

Till Muser, Alexandra Spitzer, Matti Lassas, Maarten V. de Hoop, Ivan Dokmanić

发表机构 * ETH Zurich(苏黎世联邦理工学院) University of Helsinki(赫尔辛基大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出Flowers架构,通过多头扭曲场实现线性代价的自适应全局交互,在2D/3D时变PDE基准上超越傅里叶、卷积和注意力基线。

详情
AI中文摘要

我们引入了Flowers,一种完全由多头扭曲构建的神经架构,用于学习PDE解算子。除了逐点通道混合和多尺度支架外,Flowers不使用傅里叶乘子、点积注意力或卷积混合。每个头预测一个位移场并扭曲混合后的输入特征。受物理和计算效率的启发,位移是逐点预测的,没有任何空间聚合,非局域性仅通过每个头在源坐标处的稀疏采样引入。在多尺度残差块中堆叠扭曲得到Flowers,它以线性代价实现自适应的全局交互。我们通过三个互补视角从理论上论证了这一设计:守恒律的流图、非均匀介质中的波以及动力学理论的连续极限。Flowers在一系列2D和3D时变PDE基准上取得了优异性能,特别是流和波。一个紧凑的17M参数模型持续优于相似规模的傅里叶、卷积和注意力基线,而一个150M参数变体在参数、数据和训练计算量多得多的情况下,超越了近期基于transformer的基础模型。

英文摘要

We introduce Flowers, a neural architecture for learning PDE solution operators built entirely from multihead warps. Aside from pointwise channel mixing and a multiscale scaffold, Flowers use no Fourier multipliers, no dot-product attention, and no convolutional mixing. Each head predicts a displacement field and warps the mixed input features. Motivated by physics and computational efficiency, displacements are predicted pointwise, without any spatial aggregation, and nonlocality enters only through sparse sampling at source coordinates, one per head. Stacking warps in multiscale residual blocks yields Flowers, which implement adaptive, global interactions at linear cost. We theoretically motivate this design through three complementary lenses: flow maps for conservation laws, waves in inhomogeneous media, and a kinetic-theoretic continuum limit. Flowers achieve excellent performance on a broad suite of 2D and 3D time-dependent PDE benchmarks, particularly flows and waves. A compact 17M-parameter model consistently outperforms Fourier, convolution, and attention-based baselines of similar size, while a 150M-parameter variant improves over recent transformer-based foundation models with much more parameters, data, and training compute.

2602.22101 2026-06-02 cs.LG cs.AI 版本更新

On Imbalanced Regression with Hoeffding Trees

关于使用Hoeffding树的不平衡回归

Pantia-Marina Alchirch, Dimitrios I. Diochnos

发表机构 * University of Oklahoma(俄克拉荷马大学)

AI总结 针对不平衡回归中的数据流问题,将核密度估计扩展到流式设置并集成层次收缩到增量决策树中,实验表明KDE能持续提升早期流性能。

Comments 17 pages, 5 figures, 3 tables, 2 algorithms, authors' version of paper accepted in PAKDD 2026 special session on Data Science: Foundations and Applications (DSFA)

详情
AI中文摘要

许多现实应用会生成用于回归的连续数据流。Hoeffding树及其变体因其有效性而具有悠久的传统,无论是单独使用还是作为更广泛集成中的基础模型。最近的批量学习工作表明,核密度估计(KDE)改善了不平衡回归中的平滑预测[Yang等人,2021],而层次收缩(HS)为决策树提供了事后正则化,无需修改其结构[Agarwal等人,2022]。我们通过伸缩公式将KDE扩展到流式设置,并将HS集成到增量决策树中。在标准在线回归基准上的实证评估表明,KDE持续改善了早期流性能,而HS提供的增益有限。我们的实现公开于:https://github.com/marinaAlchirch/DSFA_2026。

英文摘要

Many real-world applications generate continuous data streams for regression. Hoeffding trees and their variants have a long-standing tradition due to their effectiveness, either alone or as base models in broader ensembles. Recent batch-learning work shows that kernel density estimation (KDE) improves smoothed predictions in imbalanced regression [Yang et al., 2021], while hierarchical shrinkage (HS) provides post-hoc regularization for decision trees without modifying their structure [Agarwal et al., 2022]. We extend KDE to streaming settings via a telescoping formulation and integrate HS into incremental decision trees. Empirical evaluation on standard online regression benchmarks shows that KDE consistently improves early-stream performance, whereas HS provides limited gains. Our implementation is publicly available at: https://github.com/marinaAlchirch/DSFA_2026.

2503.11832 2026-06-02 cs.AI cs.LG 版本更新

Safety Mirage: How Spurious Correlations Undermine VLM Safety Fine-Tuning and Can Be Mitigated by Machine Unlearning

安全幻象:虚假相关性如何破坏VLM安全微调及通过机器遗忘缓解

Yiwei Chen, Yuguang Yao, Yihua Zhang, Bingquan Shen, Gaowen Liu, Sijia Liu

发表机构 * Michigan State University(密歇根州立大学) National University of Singapore(新加坡国立大学) Cisco Research(思科研究)

AI总结 本文发现视觉语言模型(VLM)的安全微调存在“安全幻象”,即虚假相关性导致脆弱性,并提出机器遗忘作为替代方案,显著降低攻击成功率和不必要拒绝。

Comments Accepted to ICLR 2026

详情
AI中文摘要

最近的视觉语言模型(VLM)在多模态输入(特别是文本和图像)的生成建模方面取得了显著进展。然而,当暴露于不安全查询时,它们生成有害内容的倾向引发了关键的安全问题。虽然当前的对齐策略主要依赖于使用精心策划的数据集进行监督安全微调,但我们发现了一个基本限制,称为“安全幻象”,其中监督微调无意中强化了表面文本模式与安全响应之间的虚假相关性,而不是促进深层的、内在的危害缓解。我们表明,这些虚假相关性使微调后的VLM即使面对基于单词修改的简单攻击也易受攻击,其中将文本查询中的单个单词替换为诱导虚假相关性的替代词即可有效绕过安全防护。此外,这些相关性导致过度谨慎,使微调后的VLM不必要地拒绝良性查询。为了解决这些问题,我们展示了机器遗忘(MU)作为监督安全微调的有力替代方案,因为它避免了有偏的特征-标签映射,并直接从VLM中移除有害知识,同时保留其通用能力。在安全基准上的广泛评估表明,基于MU的对齐将攻击成功率降低了高达60.27%,并将不必要的拒绝减少了超过84.20%。警告:存在可能具有攻击性的AI生成内容。

英文摘要

Recent vision language models (VLMs) have made remarkable strides in generative modeling with multimodal inputs, particularly text and images. However, their susceptibility to generating harmful content when exposed to unsafe queries raises critical safety concerns. While current alignment strategies primarily rely on supervised safety fine-tuning with curated datasets, we identify a fundamental limitation we call the ''safety mirage'', where supervised fine-tuning inadvertently reinforces spurious correlations between superficial textual patterns and safety responses, rather than fostering deep, intrinsic mitigation of harm. We show that these spurious correlations leave fine-tuned VLMs vulnerable even to a simple one-word modification-based attack, where substituting a single word in text queries with a spurious correlation-inducing alternative can effectively bypass safeguards. Additionally, these correlations contribute to the over-prudence, causing fine-tuned VLMs to refuse benign queries unnecessarily. To address these issues, we show machine unlearning (MU) as a powerful alternative to supervised safety fine-tuning, as it avoids biased feature-label mappings and directly removes harmful knowledge from VLMs while preserving their general capabilities. Extensive evaluations across safety benchmarks show that under MU-based alignment reduces the attack success rate by up to 60.27% and cuts unnecessary rejections by over 84.20%. WARNING: There exist AI generations that may be offensive in nature.

2510.07650 2026-06-02 cs.LG cs.AI 版本更新

Value Flows

Value Flows

Perry Dong, Chongyi Zheng, Chelsea Finn, Dorsa Sadigh, Benjamin Eysenbach

发表机构 * Stanford University(斯坦福大学) Princeton University(普林斯顿大学)

AI总结 本文利用基于流的生成模型估计完整未来回报分布,通过新的流匹配目标满足分布贝尔曼方程,并利用流导数ODE估计回报不确定性以优先学习,在离线与在线设置中平均成功率提升1.3倍。

Comments ICLR 2026

详情
AI中文摘要

虽然当今大多数强化学习方法将未来回报的分布压缩为单个标量值,但分布RL方法利用回报分布提供更强的学习信号,并支持探索和安全强化学习中的应用。虽然估计回报分布的主要方法是将其建模为离散区间上的分类分布或估计有限数量的分位数,但这些方法留下了关于回报分布的细粒度结构以及如何区分高回报不确定性的状态以进行决策的未解问题。本文的关键思想是使用现代、灵活的基于流的模型来估计完整的未来回报分布,并识别那些具有高回报方差的状态。我们通过制定一个新的流匹配目标来实现这一点,该目标生成满足分布贝尔曼方程的概率密度路径。基于学习到的流模型,我们使用一个新的流导数ODE来估计不同状态的回报不确定性。我们还利用这种不确定性信息,优先在某些转换上学习更准确的回报估计。我们将我们的方法(Value Flows)与先前的方法在离线和在线到在线设置中进行了比较。在37个基于状态和25个基于图像的基准任务上的实验表明,Value Flows在成功率上平均提高了1.3倍。网站:https://pd-perry.github.io/value-flows 代码:https://github.com/chongyi-zheng/value-flows

英文摘要

While most reinforcement learning methods today flatten the distribution of future returns to a single scalar value, distributional RL methods exploit the return distribution to provide stronger learning signals and to enable applications in exploration and safe RL. While the predominant method for estimating the return distribution is by modeling it as a categorical distribution over discrete bins or estimating a finite number of quantiles, such approaches leave unanswered questions about the fine-grained structure of the return distribution and about how to distinguish states with high return uncertainty for decision-making. The key idea in this paper is to use modern, flexible flow-based models to estimate the full future return distributions and identify those states with high return variance. We do so by formulating a new flow-matching objective that generates probability density paths satisfying the distributional Bellman equation. Building upon the learned flow models, we estimate the return uncertainty of distinct states using a new flow derivative ODE. We additionally use this uncertainty information to prioritize learning a more accurate return estimation on certain transitions. We compare our method (Value Flows) with prior methods in the offline and online-to-online settings. Experiments on $37$ state-based and $25$ image-based benchmark tasks demonstrate that Value Flows achieves a $1.3\times$ improvement on average in success rates. Website: https://pd-perry.github.io/value-flows Code: https://github.com/chongyi-zheng/value-flows

2603.03031 2026-06-02 cs.LG 版本更新

Step-Level Sparse Autoencoder for Reasoning Process Interpretation

步骤级稀疏自编码器用于推理过程解释

Xuan Yang, Jiayu Liu, Yuhang Lai, Hao Xu, Zhenya Huang, Ning Miao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出步骤级稀疏自编码器(SSAE),通过条件稀疏化形成信息瓶颈,将推理步骤中的增量信息与背景信息分离为稀疏特征,用于解释大语言模型的推理过程。

详情
AI中文摘要

大型语言模型(LLMs)通过思维链(CoT)推理实现了强大的复杂推理能力。然而,它们的推理模式仍然过于复杂而难以分析。尽管稀疏自编码器(SAEs)已成为可解释性的强大工具,但现有方法主要在token级别操作,在捕获更关键的步骤级信息(如推理方向和语义转换)时存在粒度不匹配问题。在这项工作中,我们提出了步骤级稀疏自编码器(SSAE),作为一种分析工具,将LLMs推理步骤的不同方面解耦为稀疏特征。具体来说,通过精确控制步骤特征基于其上下文的稀疏性,我们在步骤重建中形成一个信息瓶颈,将增量信息从背景信息中分离出来,并将其解耦为几个稀疏激活的维度。在多个基础模型和推理任务上的实验显示了提取特征的有效性。通过线性探测,我们可以轻松预测表面级信息,如生成长度和第一个token分布,以及更复杂的属性,如步骤的正确性和逻辑性。这些观察表明,LLMs在生成过程中应该已经至少部分地知道这些属性,这为LLMs的自我验证能力提供了基础。我们的代码可在https://github.com/Miaow-Lab/SSAE获取。

英文摘要

Large Language Models (LLMs) have achieved strong complex reasoning capabilities through Chain-of-Thought (CoT) reasoning. However, their reasoning patterns remain too complicated to analyze. While Sparse Autoencoders (SAEs) have emerged as a powerful tool for interpretability, existing approaches predominantly operate at the token level, creating a granularity mismatch when capturing more critical step-level information, such as reasoning direction and semantic transitions. In this work, we propose step-level sparse autoencoder (SSAE), which serves as an analytical tool to disentangle different aspects of LLMs' reasoning steps into sparse features. Specifically, by precisely controlling the sparsity of a step feature conditioned on its context, we form an information bottleneck in step reconstruction, which splits incremental information from background information and disentangles it into several sparsely activated dimensions. Experiments on multiple base models and reasoning tasks show the effectiveness of the extracted features. By linear probing, we can easily predict surface-level information, such as generation length and first token distribution, as well as more complicated properties, such as the correctness and logicality of the step. These observations indicate that LLMs should already at least partly know about these properties during generation, which provides the foundation for the self-verification ability of LLMs. Our code is available at https://github.com/Miaow-Lab/SSAE.

2603.02650 2026-06-02 cs.LG cs.AI cs.RO 版本更新

Improving Diffusion Planners by Self-Supervised Action Gating with Energies

通过自监督动作能量门控改进扩散规划器

Yuan Lu, Dongqi Han, Yansen Wang, Dongsheng Li

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出SAGE方法,利用潜在一致性信号在推理时重新排序轨迹,惩罚动态不一致的计划,从而提升扩散规划器的性能和鲁棒性。

详情
AI中文摘要

扩散规划器是离线强化学习的一种强大方法,但当价值引导选择偏好得分高但局部与环境动态不一致的轨迹时,它们可能会失败,导致执行脆弱。我们提出了自监督动作能量门控(SAGE),一种推理时重排序方法,使用潜在一致性信号惩罚动态不一致的计划。SAGE在离线状态序列上训练联合嵌入预测架构(JEPA)编码器,并训练一个动作条件的潜在预测器用于短时域过渡。在测试时,SAGE为每个采样候选分配一个由其潜在预测误差给出的能量,并将此可行性得分与价值估计相结合以选择动作。SAGE可以集成到现有的扩散规划流程中,这些流程可以通过价值评分采样轨迹和选择动作;它不需要环境回滚,也不需要重新训练策略。在运动、导航和操作基准测试中,SAGE提高了扩散规划器的性能和鲁棒性。

英文摘要

Diffusion planners are a strong approach for offline reinforcement learning, but they can fail when value-guided selection favours trajectories that score well yet are locally inconsistent with the environment dynamics, resulting in brittle execution. We propose Self-supervised Action Gating with Energies (SAGE), an inference-time re-ranking method that penalises dynamically inconsistent plans using a latent consistency signal. SAGE trains a Joint-Embedding Predictive Architecture (JEPA) encoder on offline state sequences and an action-conditioned latent predictor for short horizon transitions. At test time, SAGE assigns each sampled candidate an energy given by its latent prediction error and combines this feasibility score with value estimates to select actions. SAGE can integrate into existing diffusion planning pipelines that can sample trajectories and select actions via value scoring; it requires no environment rollouts and no policy re-training. Across locomotion, navigation, and manipulation benchmarks, SAGE improves the performance and robustness of diffusion planners.

2603.02346 2026-06-02 cond-mat.str-el cs.AI cs.LG 版本更新

Large Electron Model: A Universal Ground State Predictor

大型电子模型:一种通用的基态预测器

Timothy Zaklama, Max Geier, Liang Fu

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Department of Physics(物理系)

AI总结 提出Large Electron Model,一种基于Fermi Sets架构的神经网络模型,通过在整个哈密顿参数流形上生成变分波函数,准确预测二维谐振势中相互作用电子的基态,并泛化到未见耦合强度和粒子数,为材料发现提供了基于变分原理的基座模型方法。

Comments 8+7 pages, 5+6 figures, 1+1 tables

详情
AI中文摘要

我们引入了大型电子模型,这是一个单一的神经网络模型,能够在整个哈密顿参数流形上产生相互作用电子的变分波函数。我们的模型采用了Fermi Sets架构,这是一种多体费米子波函数的通用表示,并进一步以哈密顿参数和粒子数为条件。对于二维谐振势中的相互作用电子,一个训练好的模型能够准确预测基态波函数,同时泛化到未见过的耦合强度和粒子数扇区,产生精确的实空间电荷密度和基态能量,甚至多达50个粒子。我们的结果为基于变分原理的材料发现建立了一个基座模型方法,同时准确处理了密度泛函理论能力之外的强电子关联。

英文摘要

We introduce Large Electron Model, a single neural network model that produces variational wavefunctions of interacting electrons over the entire Hamiltonian parameter manifold. Our model employs the Fermi Sets architecture, a universal representation of many-body fermionic wavefunctions, which is further conditioned on Hamiltonian parameter and particle number. For interacting electrons in a two-dimensional harmonic potential, a single trained model accurately predicts the ground state wavefunction while generalizing across unseen coupling strengths and particle-number sectors, producing both accurate real-space charge densities and ground state energies, even up to $50$ particles. Our results establish a foundation model method for material discovery that is grounded in the variational principle, while accurately treating strong electron correlation beyond the capacity of density functional theory.

2603.02238 2026-06-02 cs.LG cs.FL cs.LO 版本更新

Length Generalization Bounds for Transformers

Transformer的长度泛化界

Andy Yang, Pascal Bergsträßer, Georg Zetzsche, David Chiang, Anthony W. Lin

发表机构 * University of Cambridge(剑桥大学)

AI总结 本文证明C-RASP(与Transformer紧密相关的语言类)不存在可计算的长度泛化界,但为正片段(等价于固定精度Transformer)提供了可计算的指数级最优界。

Comments 22 pages

详情
AI中文摘要

长度泛化是学习算法的一个关键性质,它使得算法在给定有限训练数据的情况下,能够对任意长度的输入做出正确预测。为了提供这样的保证,需要能够计算一个长度泛化界,超过该界模型保证泛化。本文关注C-RASP(一类与Transformer紧密相关的语言)的此类泛化界的可计算性这一开放问题。最近Chen等人针对仅有一层C-RASP以及在限制条件下针对两层C-RASP给出了部分正面结果。我们对该开放问题给出了完整答案。主要结果是C-RASP(已有两层)以及因此Transformer不存在可计算的长度泛化界。作为补充,我们为C-RASP的正片段(我们证明其等价于固定精度Transformer)提供了一个可计算的界。对于正C-RASP和固定精度Transformer,我们证明长度复杂度是指数级的,并证明了界的优性。

英文摘要

Length generalization is a key property of a learning algorithm that enables it to make correct predictions on inputs of any length, given finite training data. To provide such a guarantee, one needs to be able to compute a length generalization bound, beyond which the model is guaranteed to generalize. This paper concerns the open problem of the computability of such generalization bounds for C-RASP, a class of languages which is closely linked to transformers. A positive partial result was recently shown by Chen et al. for C-RASP with only one layer and, under some restrictions, also with two layers. We provide complete answers to the above open problem. Our main result is the non-existence of computable length generalization bounds for C-RASP (already with two layers) and hence for transformers. To complement this, we provide a computable bound for the positive fragment of C-RASP, which we show equivalent to fixed-precision transformers. For both positive C-RASP and fixed-precision transformers, we show that the length complexity is exponential, and prove optimality of the bounds.

2603.02237 2026-06-02 cs.LG cs.AI 版本更新

Concept Heterogeneity-aware Representation Steering

概念异质性感知表示引导

Laziz U. Abdullaev, Noelle Y. L. Wong, Ryan T. Z. Lee, Shiqi Jiang, Khoi N. M. Nguyen, Tan M. Nguyen

发表机构 * arXiv

AI总结 针对大语言模型表示非均匀导致全局引导脆弱的问题,提出基于最优传输的输入依赖引导方法CHaRS,通过高斯混合模型和离散最优传输实现更有效的行为控制。

详情
Journal ref
ICML 2026
AI中文摘要

表示引导提供了一种轻量级机制,通过在推理时干预内部激活来控制大语言模型(LLMs)的行为。现有方法大多依赖于单个全局引导方向,通常通过对比较数据集进行均值差异得到。这种方法隐含假设目标概念在嵌入空间中均匀表示。然而在实践中,LLM表示可能高度非均匀,表现出聚类、上下文相关的结构,这使得全局引导方向变得脆弱。在这项工作中,我们通过最优传输(OT)的视角审视表示引导,注意到标准均值差异引导隐式对应于具有不同一阶矩的两个相同分布之间的OT映射,产生全局平移。为了放宽这一限制性假设,我们从理论上将源和目标表示建模为高斯混合模型,并将引导公式化为语义潜在聚类之间的离散OT问题。从得到的传输计划中,我们通过重心投影推导出显式的、输入依赖的引导映射,产生聚类级别偏移的平滑核加权组合。我们将此方法称为概念异质性感知表示引导(CHaRS)。通过大量实验设置,我们证明CHaRS比全局引导产生更有效的行为控制。

英文摘要

Representation steering offers a lightweight mechanism for controlling the behavior of large language models (LLMs) by intervening on internal activations at inference time. Most existing methods rely on a single global steering direction, typically obtained via difference-in-means over contrastive datasets. This approach implicitly assumes that the target concept is homogeneously represented across the embedding space. In practice, however, LLM representations can be highly non-homogeneous, exhibiting clustered, context-dependent structure, which renders global steering directions brittle. In this work, we view representation steering through the lens of optimal transport (OT), noting that standard difference-in-means steering implicitly corresponds to the OT map between two identical distributions with differing first moments, yielding a global translation. To relax this restrictive assumption, we theoretically model source and target representations as Gaussian mixture models and formulate steering as a discrete OT problem between semantic latent clusters. From the resulting transport plan, we derive an explicit, input-dependent steering map via barycentric projection, producing a smooth, kernel-weighted combination of cluster-level shifts. We term this method Concept Heterogeneity-aware Representation Steering (CHaRS). Through numerous experimental settings, we show that CHaRS yields more effective behavioral control than global steering.

2509.15394 2026-06-02 cs.LG 版本更新

VMDNet: Temporal Leakage-Free Variational Mode Decomposition for Electricity Demand Forecasting

VMDNet:用于电力需求预测的无时间泄漏变分模态分解

Weibin Feng, Ran Tao, John Cartlidge, Jin Zheng

发表机构 * UKRI EPSRC Doctoral Training Partnership(UKRI EPSRC博士培训计划) UKRI EPSRC AI for Collective Intelligence (AI4CI)(集体智能(AI4CI))

AI总结 提出VMDNet框架,通过逐样本变分模态分解避免时间泄漏、频率感知嵌入和并行时间卷积网络建模各模态,并引入Stackelberg博弈双层优化选择超参数,在电力需求预测中超越现有方法。

Comments 5 pages, 1 figure, 2 tables. Version 3: Accepted author manuscript for the 34th European Signal Processing Conference (EUSIPCO 2026), Bruges, Belgium. Improved figures, additional details on TCN-based parallel decoding, and extended literature review. Code and data available: https://github.com/weibin-feng/VMDNet

详情
AI中文摘要

准确的电力需求预测具有挑战性,因为真实需求序列具有强多周期性,使得有效建模循环时间模式至关重要。分解技术使这种结构显式化,从而提升预测性能。变分模态分解(VMD)是一种用于周期性感知分解的强大信号处理方法,近年来得到越来越多的采用。然而,现有研究常遭受信息泄漏,并依赖不恰当的超参数调优。为解决这些问题,我们提出VMDNet,一个因果保持框架,它(i)应用逐样本VMD以避免时间泄漏;(ii)用频率感知嵌入表示每个分解模态,并使用并行时间卷积网络(TCNs)解码,确保模态独立性和高效学习;(iii)引入受Stackelberg博弈启发的双层方案来指导VMD两个关键超参数的选择。在三个广泛使用的电力需求数据集上的实验表明,VMDNet持续优于最先进的基线方法。

英文摘要

Accurate electricity demand forecasting is challenging due to the strong multi-periodicity of real-world demand series, which makes effective modeling of recurrent temporal patterns crucial. Decomposition techniques make such structure explicit and thereby improve predictive performance. Variational Mode Decomposition (VMD) is a powerful signal-processing method for periodicity-aware decomposition and has seen growing adoption in recent years. However, existing studies often suffer from information leakage and rely on inappropriate hyperparameter tuning. To address these issues, we propose VMDNet, a causality-preserving framework that (i) applies sample-wise VMD to avoid temporal leakage; (ii) represents each decomposed mode with frequency-aware embeddings and decodes it using parallel temporal convolutional networks (TCNs), ensuring mode independence and efficient learning; and (iii) introduces a Stackelberg game inspired bilevel scheme to guide the selection of VMD's two key hyperparameters. Experiments on three widely used electricity demand datasets show that VMDNet consistently outperforms state-of-the-art baselines.

2603.01097 2026-06-02 cs.LG 版本更新

Understanding LoRA as Knowledge Memory: An Empirical Analysis

理解LoRA作为知识记忆:一项实证分析

Seungju Back, Dongwoo Lee, Naun Kang, Taehee Lee, S. K. Hong, Youngjune Gwon, Sungjin Ahn

发表机构 * New York University(纽约大学)

AI总结 本文通过系统实证研究,将低秩适配(LoRA)作为模块化知识记忆,探索其存储容量、内部化优化、多模块系统扩展及长上下文推理能力,提供LoRA记忆操作边界的实用指导。

Comments ICML 2026

详情
AI中文摘要

预训练大型语言模型(LLM)的持续知识更新日益必要但仍具挑战性。尽管上下文学习(ICL)和检索增强生成(RAG)等推理时方法很流行,但它们面临上下文预算、成本和检索碎片化的限制。脱离这些依赖上下文的范式,本工作研究使用低秩适配(LoRA)作为模块化知识记忆的参数化方法。尽管近期有少量工作探讨了这一概念,但控制其容量和可组合性的基本机制仍很大程度上未被探索。我们通过首个系统性的实证研究来填补这一空白,该研究映射了基于LoRA记忆的设计空间,包括表征存储容量、优化内部化、扩展多模块系统以及评估长上下文推理。我们并非提出单一架构,而是提供关于LoRA记忆操作边界的实用指导。总体而言,我们的发现将LoRA定位为与RAG和ICL互补的记忆轴,具有独特优势。代码和数据集可在 https://github.com/ahn-ml/Understanding-LoRA-as-Knowledge-Memory 获取。

英文摘要

Continuous knowledge updating for pre-trained large language models (LLMs) is increasingly necessary yet remains challenging. Although inference-time methods like In-Context Learning (ICL) and Retrieval-Augmented Generation (RAG) are popular, they face constraints in context budgets, costs, and retrieval fragmentation. Departing from these context-dependent paradigms, this work investigates a parametric approach using Low-Rank Adaptation (LoRA) as a modular knowledge memory. Although few recent works examine this concept, the fundamental mechanics governing its capacity and composability remain largely unexplored. We bridge this gap through the first systematic empirical study mapping the design space of LoRA-based memory, ranging from characterizing storage capacity and optimizing internalization to scaling multi-module systems and evaluating long-context reasoning. Rather than proposing a single architecture, we provide practical guidance on the operational boundaries of LoRA memory. Overall, our findings position LoRA as the complementary axis of memory alongside RAG and ICL, offering distinct advantages. Code and datasets are available at https://github.com/ahn-ml/Understanding-LoRA-as-Knowledge-Memory.

2506.14003 2026-06-02 cs.LG 版本更新

Unlearning Isn't Invisible: Detecting Unlearning Traces in LLMs from Model Outputs

遗忘并非不可见:从模型输出中检测LLMs的遗忘痕迹

Yiwei Chen, Soumyadeep Pal, Yimeng Zhang, Qing Qu, Sijia Liu

发表机构 * Michigan State University(密歇根州立大学) University of Michigan, Ann Arbor(密歇根大学安娜堡分校) IBM Research(IBM研究院)

AI总结 本文发现大型语言模型在经历机器遗忘后会在行为和内部表征中留下可检测的“指纹”,并通过分类器利用预测logits或文本输出以超过90%的准确率识别遗忘模型。

Comments Accepted to ICLR 2026

详情
AI中文摘要

大型语言模型(LLMs)的机器遗忘(MU),通常称为LLM遗忘,旨在从训练模型中移除特定的不良数据或知识,同时保持其在标准任务上的性能。虽然遗忘在保护数据隐私、执行版权和减轻LLMs中的社会技术危害方面发挥着关键作用,但我们发现了一个遗忘后的新漏洞:遗忘痕迹检测。我们发现遗忘在LLMs中留下了持久的“指纹”,即在模型行为和内部表征中可检测的痕迹。这些痕迹可以从输出响应中识别,即使使用与遗忘无关的输入进行提示。具体来说,即使是一个简单的监督分类器,仅使用其预测logits甚至文本输出,就可以确定模型是否经历了遗忘。进一步的分析表明,这些痕迹嵌入在中间激活中,并非线性地传播到最后一层,在激活空间中形成低维、可学习的流形。通过大量实验,我们证明即使在遗忘无关的输入下,遗忘痕迹也可以以超过90%的准确率被检测到,并且更大的LLMs表现出更强的可检测性。这些发现揭示了遗忘留下了可测量的签名,引入了一种新的风险,即当模型被识别为已遗忘时,给定输入查询,可以逆向工程遗忘的信息。

英文摘要

Machine unlearning (MU) for large language models (LLMs), commonly referred to as LLM unlearning, seeks to remove specific undesirable data or knowledge from a trained model, while maintaining its performance on standard tasks. While unlearning plays a vital role in protecting data privacy, enforcing copyright, and mitigating sociotechnical harms in LLMs, we identify a new vulnerability post-unlearning: unlearning trace detection. We discover that unlearning leaves behind persistent "fingerprints" in LLMs, detectable traces in both model behavior and internal representations. These traces can be identified from output responses, even when prompted with forget-irrelevant inputs. Specifically, even a simple supervised classifier can determine whether a model has undergone unlearning, using only its prediction logits or even its textual outputs. Further analysis shows that these traces are embedded in intermediate activations and propagate nonlinearly to the final layer, forming low-dimensional, learnable manifolds in activation space. Through extensive experiments, we demonstrate that unlearning traces can be detected with over 90% accuracy even under forget-irrelevant inputs, and that larger LLMs exhibit stronger detectability. These findings reveal that unlearning leaves measurable signatures, introducing a new risk of reverse-engineering forgotten information when a model is identified as unlearned, given an input query.

2603.00963 2026-06-02 cs.LG cs.CL 版本更新

Stabilizing Policy Optimization via Logits Convexity

通过Logits凸性稳定策略优化

Hongzhan Chen, Tao Yang, Yuhua Zhu, Shiping Gao, Xiaojun Quan, Ting Yao

发表机构 * National University of Singapore(新加坡国立大学) University of Science and Technology of China(中国科学技术大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 针对强化学习训练不稳定的问题,从梯度角度分析监督微调与强化学习的稳定性差距,提出Logits凸优化(LCO)框架,通过模拟logits级凸性来稳定策略优化,实验表明该方法能提升训练稳定性并在多个基准上优于传统方法。

详情
AI中文摘要

虽然强化学习(RL)在大语言模型(LLM)近期成功中发挥了核心作用,但RL优化以不稳定著称,尤其是与监督微调(SFT)相比。本文从梯度角度研究SFT和RL之间的稳定性差距,并表明SFT损失相对于模型logits的凸性在实现稳定训练中起关键作用。我们的理论分析证明,该性质在优化过程中诱导了有利的梯度方向性。相比之下,广泛采用的策略梯度算法——使用裁剪替代目标的近端策略优化(PPO)缺乏这种稳定性质。受此观察启发,我们提出Logits凸优化(LCO),一种简单而有效的策略优化框架,将学习策略与从原始RL目标导出的最优目标对齐,从而模拟logits级凸性的稳定效果。跨多个模型家族的大量实验表明,我们的LCO框架一致地提升了训练稳定性,并在广泛的基准测试中优于传统RL方法。

英文摘要

While reinforcement learning (RL) has been central to the recent success of large language models (LLMs), RL optimization is notoriously unstable, especially when compared to supervised fine-tuning (SFT). In this work, we investigate the stability gap between SFT and RL from a gradient-based perspective, and show that the convexity of the SFT loss with respect to model logits plays a key role in enabling stable training. Our theoretical analysis demonstrates that this property induces favorable gradient directionality during optimization. In contrast, Proximal Policy Optimization (PPO), a widely adopted policy gradient algorithm utilizing a clipped surrogate objective, lacks this stabilizing property. Motivated by this observation, we propose Logits Convex Optimization (LCO), a simple yet effective policy optimization framework that aligns the learned policy with an optimal target derived from the original RL objective, thereby emulating the stabilizing effects of logits-level convexity. Extensive experiments across multiple model families show that our LCO framework consistently improves training stability and outperforms conventional RL methods on a broad range of benchmarks.

2603.00829 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Constitutional Black-Box Monitoring for Scheming in LLM Agents

LLM Agent 中阴谋行为的宪法黑盒监控

Simon Storf, Rich Barton-Cooper, James Peters-Gill, Marius Hobbhahn

发表机构 * University of Cambridge(剑桥大学)

AI总结 研究使用基于宪法黑盒的监控器,通过仅观察外部输入和输出检测LLM Agent的阴谋行为,并在合成数据上优化后泛化到更真实环境。

Comments Accepted at ICML 2026. Camera-ready version

详情
AI中文摘要

在自主环境中安全部署大型语言模型(LLM)Agent需要可靠的监督机制。一个核心挑战是检测阴谋行为,即Agent暗中追求不一致的目标。缓解此类风险的一种方法是基于LLM的监控:使用语言模型检查Agent行为中的可疑动作。我们研究宪法黑盒监控器:仅利用外部可观测的输入和输出检测阴谋行为的提示分类器,并在从自然语言行为规范生成的合成数据上优化。我们引入两个生成合成Agent轨迹的流水线:STRIDE(迭代精炼)和Gloom(Agent-环境模拟),各生成1000个样本。通过提示扫描、人工精炼和自动提示优化,我们在这些数据集上优化前沿LLM监控器,并在ControlArena(一套Agent在更现实环境中运行的接地环境)中的7500个保留轨迹上评估性能。结果表明,仅基于合成数据选择的监控器可以泛化到更现实的环境,捕获有意义的阴谋信号。然而,我们发现性能在我们的设置中迅速饱和,简单的提示扫描匹配了更广泛优化的结果。超越这一限制不会带来进一步改进,反而导致过拟合。

英文摘要

Safe deployment of Large Language Model (LLM) agents in autonomous settings requires reliable oversight mechanisms. A central challenge is detecting scheming, where agents covertly pursue misaligned goals. One approach to mitigating such risks is LLM-based monitoring: using language models to examine agent behaviors for suspicious actions. We study constitutional black-box monitors: prompted classifiers that detect scheming using only externally observable inputs and outputs, optimized on synthetic data generated from natural-language behavior specifications. We introduce two pipelines for generating synthetic agent trajectories, STRIDE (iterative refinement) and Gloom (agent-environment simulation), from which we generate 1,000 samples each. We optimize frontier LLM monitors on these datasets via prompt sweeps, human refinement, and automated prompt optimization, and evaluate performance on 7,500 held-out trajectories from ControlArena, a suite of grounded environments where agents operate in more realistic contexts. Our results demonstrate that monitors selected purely on synthetic data can generalize to more realistic environments, capturing a meaningful scheming signal. However, we find that performance saturates quickly in our setting, with simple prompt sweeps matching the results of more extensive optimization. Pushing beyond this limit yields no further improvements and instead leads to overfitting.

2509.25837 2026-06-02 cs.LG cs.AI 版本更新

Distillation of Large Language Models via Concrete Score Matching

通过具体分数匹配进行大型语言模型的蒸馏

Yeongmin Kim, Donghyeok Shin, Mina Kang, Byeonghu Na, Il-Chul Moon

发表机构 * Korea Advanced Institute of Science and Technology(韩国科学技术院)

AI总结 提出具体分数蒸馏(CSD)目标,通过离散分数匹配克服softmax平滑和logit平移不变性限制,实现学生与教师模型间所有词汇对相对logit差异的灵活加权,在GPT-2、OpenLLaMA和GEMMA上优于现有蒸馏方法。

Comments ICLR 2026

详情
AI中文摘要

大型语言模型(LLMs)性能卓越但部署成本高昂,促使知识蒸馏(KD)用于高效推理。现有的KD目标通常通过softmax匹配学生和教师概率,这会模糊有价值的logit信息。虽然直接logit蒸馏(DLD)缓解了softmax平滑问题,但它未能考虑logit平移不变性,从而限制了解空间。我们提出具体分数蒸馏(CSD),一种离散分数匹配目标,克服了softmax引起的平滑和对最优解集的限制。我们解决了自回归LLMs中离散分数匹配的训练不稳定和二次复杂度问题,得到的CSD目标以灵活权重对齐学生和教师之间所有词汇对的相对logit差异。我们在框架内提供了模式寻求和模式覆盖实例,并在GPT-2-1.5B、OpenLLaMA-7B和GEMMA-7B-IT上评估了CSD在任务无关的指令遵循和任务特定蒸馏中的表现。实验表明,CSD持续超越最近的KD目标,实现了良好的保真度-多样性权衡,并与on-policy技术结合时产生互补增益,展示了其在LLM蒸馏中的可扩展性和有效性。代码:https://github.com/aailab-kaist/CSD。

英文摘要

Large language models (LLMs) deliver remarkable performance but are costly to deploy, motivating knowledge distillation (KD) for efficient inference. Existing KD objectives typically match student and teacher probabilities via softmax, which blurs valuable logit information. While direct logit distillation (DLD) mitigates softmax smoothing, it fails to account for logit shift invariance, thereby restricting the solution space. We propose Concrete Score Distillation (CSD), a discrete score-matching objective that overcomes both softmax-induced smoothing and restrictions on the optimal solution set. We resolve the training instability and quadratic complexity of discrete score-matching in autoregressive LLMs, and the resulting CSD objective aligns relative logit differences across all vocabulary pairs between student and teacher with flexible weighting. We provide both mode-seeking and mode-covering instances within our framework and evaluate CSD on task-agnostic instruction-following and task-specific distillation using GPT-2-1.5B, OpenLLaMA-7B, and GEMMA-7B-IT. Experiments show that CSD consistently surpasses recent KD objectives, achieves favorable fidelity-diversity trade-offs, and yields complementary gains when combined with on-policy techniques, demonstrating its scalability and effectiveness for LLM distillation. Code: https://github.com/aailab-kaist/CSD.

2602.24201 2026-06-02 cs.LG 版本更新

Flow-Based Density Ratio Estimation for Intractable Distributions with Applications in Genomics

基于流的难解分布密度比估计及其在基因组学中的应用

Egor Antipov, Alessandro Palma, Lorenzo Consoli, Stephan Günnemann, Andrea Dittadi, Fabian J. Theis

发表机构 * ETH Zurich(苏黎世联邦理工学院) University of Cambridge(剑桥大学) Max Planck Institute for Informatics(马克斯·普朗克信息研究所)

AI总结 提出利用条件感知流匹配推导单一动力学公式,沿生成轨迹追踪密度比,以高效估计难解分布间的密度比,并在单细胞基因组学数据分析中展示竞争力。

详情
AI中文摘要

估计成对难解数据分布之间的密度比是概率建模中的一个核心问题,它能够在不同条件下对不同数据生成过程中的样本似然进行原则性比较。虽然诸如归一化流之类的精确似然模型为密度比估计提供了一种有前景的方法,但朴素评估计算成本高且容易产生离散化误差,因为需要独立模拟每个分布的似然。在这项工作中,我们利用条件感知流匹配推导出一个单一的动力学公式,用于沿生成轨迹追踪密度比。我们在封闭形式比估计的模拟基准上展示了竞争性能,并表明我们的方法支持单细胞基因组学数据分析中的多种任务,其中基于似然的跨实验条件细胞状态比较能够实现治疗效果估计和批次校正评估。

英文摘要

Estimating density ratios between pairs of intractable data distributions is a core problem in probabilistic modeling, enabling principled comparisons of sample likelihoods under different data-generating processes across conditions. While exact-likelihood models such as normalizing flows offer a promising approach to density ratio estimation, naive evaluations are computationally expensive and prone to discretization errors because they require simulating each distribution's likelihood independently. In this work, we leverage condition-aware flow matching to derive a single dynamical formulation for tracking density ratios along generative trajectories. We demonstrate competitive performance on simulated benchmarks for closed-form ratio estimation, and show that our method supports versatile tasks in single-cell genomics data analysis, where likelihood-based comparisons of cellular states across experimental conditions enable treatment effect estimation and batch correction evaluation.

2602.23881 2026-06-02 cs.LG cs.CL 版本更新

LK Losses: Direct Acceptance Rate Optimization for Speculative Decoding

LK损失:用于推测解码的直接接受率优化

Alexander Samarin, Sergei Krutikov, Anton Shevtsov, Sergei Skvortsov, Filipp Fisin, Alexander Golubev

发表机构 * arXiv

AI总结 针对推测解码中标准KL散度训练不能最大化接受率的问题,提出LK损失直接优化接受率,实验表明在多种架构和模型上一致提升接受指标。

Comments ICML 2026

详情
AI中文摘要

推测解码通过使用轻量级草稿模型提出候选令牌,然后由目标模型并行验证,从而加速自回归大型语言模型(LLM)推理。加速效果显著取决于接受率,然而标准训练将Kullback-Leibler(KL)散度作为代理目标进行最小化。虽然KL散度和接受率共享相同的全局最优解,但小型草稿模型由于容量有限,通常收敛到次优解,此时最小化KL并不能保证最大化接受率。为解决此问题,我们提出LK损失,这是一种直接针对接受率的特殊训练目标。在四种草稿架构和六个目标模型(参数范围从8B到685B)上的全面实验表明,与基于KL的标准训练相比,所有配置下的接受指标均有一致提升。我们在通用、编码和数学领域评估了我们的方法,并报告平均接受长度提升高达8-10%。LK损失易于实现,不引入计算开销,可直接集成到任何现有的推测器训练框架中,使其成为现有草稿训练目标的有力替代方案。

英文摘要

Speculative decoding accelerates autoregressive large language model (LLM) inference by using a lightweight draft model to propose candidate tokens that are then verified in parallel by the target model. The speedup is significantly determined by the acceptance rate, yet standard training minimizes Kullback-Leibler (KL) divergence as a proxy objective. While KL divergence and acceptance rate share the same global optimum, small draft models, having limited capacity, typically converge to suboptimal solutions where minimizing KL does not guarantee maximizing acceptance rate. To address this issue, we propose LK losses, special training objectives that directly target acceptance rate. Comprehensive experiments across four draft architectures and six target models, ranging from 8B to 685B parameters, demonstrate consistent improvements in acceptance metrics across all configurations compared to the standard KL-based training. We evaluate our approach on general, coding and math domains and report gains of up to 8-10% in average acceptance length. LK losses are easy to implement, introduce no computational overhead and can be directly integrated into any existing speculator training framework, making them a compelling alternative to the existing draft training objectives.

2602.23197 2026-06-02 cs.CL cs.LG stat.ML 版本更新

Fine-Tuning Without Forgetting In-Context Learning: A Theoretical Analysis of Linear Attention Models

微调不忘上下文学习:线性注意力模型的理论分析

Chungpa Lee, Jy-yong Sohn, Kangwook Lee

发表机构 * KAIST(韩国科学技术院)

AI总结 本文通过线性注意力模型理论分析,揭示了微调目标如何修改注意力参数并导致少样本性能下降的条件,提出仅更新值矩阵可保持上下文学习能力。

详情
Journal ref
International Conference on Machine Learning (ICML) 2026
AI中文摘要

基于Transformer的大型语言模型展现出上下文学习能力,能够通过少量示例提示适应下游任务。实践中,这类模型常被微调以提升下游任务的零样本性能,使其无需示例即可解决问题,从而降低推理成本。然而,微调可能削弱上下文学习能力,限制微调模型在未见任务上的表现。利用线性注意力模型,我们提供了理论分析,刻画了微调目标如何修改注意力参数,并识别了导致少样本性能下降的条件。我们表明,微调所有注意力参数会损害上下文学习,而仅更新值矩阵可在保持上下文学习的同时提升零样本性能。我们进一步证明,引入辅助的少样本损失主要增强目标任务的上下文学习,但以牺牲微调未见任务上的上下文学习能力为代价。我们提供了来自合成和真实数据集的实验证据,与理论定性预测一致。

英文摘要

Transformer-based large language models exhibit in-context learning, enabling adaptation to downstream tasks via few-shot prompting with demonstrations. In practice, such models are often fine-tuned to improve zero-shot performance on downstream tasks, allowing them to solve tasks without examples and thereby reducing inference costs. However, fine-tuning can degrade in-context learning, limiting the performance of fine-tuned models on tasks not seen during fine-tuning. Using linear attention models, we provide a theoretical analysis that characterizes how fine-tuning objectives modify attention parameters and identifies conditions under which this leads to degraded few-shot performance. We show that fine-tuning all attention parameters can harm in-context learning, whereas restricting updates to the value matrix improves zero-shot performance while preserving in-context learning. We further show that incorporating an auxiliary few-shot loss enhances in-context learning primarily on the target task, at the expense of degraded in-context learning ability on tasks not seen during fine-tuning. We provide empirical evidence from synthetic and real-world datasets consistent with the qualitative predictions of our theory.

2602.16953 2026-06-02 cs.AI cs.LG 版本更新

LLM4Cov: Execution-Aware Agentic Learning for High-coverage Testbench Generation

LLM4Cov:面向高覆盖率测试生成的执行感知智能体学习

Hejia Zhang, Zhongming Yu, Chia-Tung Ho, Haoxing Ren, Brucek Khailany, Jishen Zhao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出LLM4Cov离线智能体学习框架,通过执行验证数据策展、策略感知数据合成和最差状态优先采样,在硬件验证中实现高覆盖率测试生成,4B参数模型在CVDP-ECov上达到69.2%通过率和90.4%平均覆盖率。

Comments ICML'26 Camera Ready version

详情
AI中文摘要

执行感知的LLM智能体为从工具反馈中学习提供了一种有前景的范式,但这种反馈可能昂贵且获取缓慢,使得在线强化学习(RL)在某些场景下不太实用。高覆盖率硬件验证由于依赖工业模拟器和不可微的执行信号,体现了这一挑战。我们提出LLM4Cov,一种离线智能体学习框架,将验证建模为由确定性评估器指导的单步状态转移。基于这一公式,我们引入了执行验证的数据策展、策略感知的智能体数据合成以及最差状态优先采样,以在执行约束下实现可扩展学习。我们进一步通过修订的评估协议,从现有验证套件中整理了一个符合现实的基准。使用所提出的流程,一个紧凑的4B参数模型在智能体评估下实现了69.2%的通过率和90.4%的平均覆盖率(CVDP-ECov),比其教师模型分别高出5.3%和10.5%,展现出与规模大一个数量级的模型相竞争的性能。

英文摘要

Execution-aware LLM agents offer a promising paradigm for learning from tool feedback, but such feedback can be expensive and slow to obtain, making online reinforcement learning (RL) less practical in certain scenarios. High-coverage hardware verification exemplifies this challenge due to its reliance on industrial simulators and non-differentiable execution signals. We propose LLM4Cov, an offline agent-learning framework that models verification as single-step state transitions guided by deterministic evaluators. Building on this formulation, we introduce execution-validated data curation, policy-aware agentic data synthesis, and worst-state-prioritized sampling to enable scalable learning under execution constraints. We further curate a reality-aligned benchmark adapted from an existing verification suite through a revised evaluation protocol. Using the proposed pipeline, a compact 4B-parameter model achieves 69.2% pass rate and 90.4% average coverage in CVDP-ECov under agentic evaluation, outperforming its teacher by 5.3% and 10.5%, demonstrating competitive performance against models an order of magnitude larger.

2508.08337 2026-06-02 cs.CY cs.AI cs.LG 版本更新

Position: Beyond Sensitive Attributes, ML Fairness Should Quantify Structural Injustice via Social Determinants

立场:超越敏感属性,机器学习公平性应通过社会决定因素量化结构性不公正

Zeyu Tang, Alex John London, Atoosa Kasirzadeh, Sarah Stewart de Ramirez, Peter Spirtes, Kun Zhang, Sanmi Koyejo

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Cambridge(剑桥大学) University of Washington(华盛顿大学) University of Michigan(密歇根大学) University of Toronto(多伦多大学)

AI总结 本文主张算法公平性研究应超越敏感属性,通过社会决定因素量化结构性不公正,并通过理论模型和实证研究证明仅关注敏感属性的缓解策略可能引入新的结构性不公正。

Comments Accepted to ICML 2026 Position Paper Track

详情
AI中文摘要

算法公平性研究在很大程度上将不公平视为对敏感属性的歧视。然而,这种方法限制了对作为通过社会决定因素实例化的结构性不公正的不公平的可见性,社会决定因素是塑造属性和结果但不涉及特定个体的上下文变量。这篇立场论文认为,该领域应通过社会决定因素量化结构性不公正,超越敏感属性。借鉴跨学科见解,我们认为主流技术范式未能充分捕捉作为结构性不公正的不公平,因为上下文可能被视为需要标准化的噪声,而不是需要审计的信号。我们进一步通过大学录取的理论模型、使用美国人口普查数据的人口统计研究以及美国综合医疗系统中关于乳腺癌筛查的高风险领域应用,证明了这种转变的实际紧迫性。我们的结果表明,仅关注敏感属性的缓解策略可能引入新的结构性不公正形式。我们认为,通过社会决定因素审计结构性不公正必须先于缓解措施,并呼吁开发超越以敏感属性为中心的非歧视公平概念的新技术。

英文摘要

Algorithmic fairness research has largely framed unfairness as discrimination along sensitive attributes. However, this approach limits visibility into unfairness as structural injustice instantiated through social determinants, which are contextual variables that shape attributes and outcomes without pertaining to specific individuals. This position paper argues that the field should quantify structural injustice via social determinants, beyond sensitive attributes. Drawing on cross-disciplinary insights, we argue that prevailing technical paradigms fail to adequately capture unfairness as structural injustice, because contexts are potentially treated as noise to be normalized rather than signal to be audited. We further demonstrate the practical urgency of this shift through a theoretical model of college admissions, a demographic study using U.S. census data, and a high-stakes domain application regarding breast cancer screening within an integrated U.S. healthcare system. Our results indicate that mitigation strategies centered solely on sensitive attributes can introduce new forms of structural injustice. We contend that auditing structural injustice through social determinants must precede mitigation, and call for new technical developments that move beyond sensitive-attribute-centered notions of fairness as non-discrimination.

2601.17074 2026-06-02 cs.LG cs.AI 版本更新

Physics-Encoded Inverse Modeling for Arctic Snow Depth Prediction

物理编码的北极雪深预测逆建模

Akila Sampath, Vandana Janeja, Jianwu Wang

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出物理编码逆建模框架PhysE-Inv,结合LSTM序列学习与对比学习正则化,在稀疏观测下实现雪深估计,均方误差平均降低24.7%。

详情
AI中文摘要

在有限且稀疏观测下准确估计时变逆问题仍然是科学领域的基本挑战。例如,雪深估计需要推断控制海冰物理的隐藏参数,这可以通过物理信息编码来实现。为了解决这一挑战,我们引入了物理编码逆建模(PhysE-Inv),这是一个新颖的框架,将深度序列学习与物理信息推理相结合,用于解决真实世界稀疏观测环境下的逆问题。PhysE-Inv集成了LSTM编码器-解码器以捕获时间依赖性,并结合对比学习正则化来强制实现噪声不变的潜在表示。该框架学习潜在参数,这些参数与观测输入相结合,在融入物理信息指导的同时重建雪深。PhysE-Inv在所有评估基线上持续表现优异,在所有基线模型上实现了平均MSE降低24.7%,在参数估计设置下比最强基线提高了17.3%。总体而言,我们的工作为数据稀缺领域展示了一种可泛化的逆建模范式,其中物理信息指导可以融入稀疏观测中。

英文摘要

Accurate estimation in time-varying inverse problems under limited and sparse observations remains a fundamental challenge across scientific domains. For example, snow depth estimation requires inferring hidden parameters governing sea ice physics, which can be incorporated through physics-informed encoding. To address this challenge, we introduce Physics-Encoded Inversion (PhysE-Inv), a novel framework that combines deep sequential learning with physics-informed inference for solving inverse problems under real-world sparse observational settings. PhysE-Inv integrates an LSTM encoder-decoder to capture temporal dependencies, together with contrastive learning regularization that enforces noise-invariant latent representations. The framework learns latent parameters that, when combined with observational inputs, reconstruct snow depth while incorporating physics-informed guidance. PhysE-Inv consistently outperforms all evaluated baselines, achieving an average MSE reduction of 24.7\% across all baseline models and a 17.3\% improvement over the strongest baseline under parameter estimation settings. Overall, our work demonstrates a generalizable inverse modeling paradigm for data-scarce domains where physics-informed guidance can be incorporated into sparse observations.

2505.18877 2026-06-02 cs.LG 版本更新

RefLoRA: Refactored Low-Rank Adaptation for Efficient Fine-Tuning of Large Models

RefLoRA:重构低秩适配以实现大型模型的高效微调

Yilang Zhang, Bingcong Li, Georgios B. Giannakis

发表机构 * Department of ECE University of Minnesota(电子工程系明尼苏达大学) Department of CS ETH Zürich(计算机科学系苏黎世联邦理工学院)

AI总结 针对LoRA因非唯一低秩分解导致权重更新不一致和性能下降的问题,提出RefLoRA方法,通过每步优化最小化损失上界的低秩分解,促进更平坦的损失景观和稳定收敛,在自然语言理解和常识推理任务上优于现有LoRA变体且计算开销可忽略。

Comments Accepted as a conference paper at NeurIPS 2025

详情
AI中文摘要

低秩适配(LoRA)通过更新预训练权重矩阵的低维子空间,降低了大型模型微调的计算和内存开销。尽管高效,LoRA由于其非唯一的低秩分解导致权重更新不一致和不平衡,表现出次优收敛和明显的性能下降。为了克服这些限制,本文确定了每步最小化损失上界的最优低秩分解。由此产生的重构低秩适配(RefLoRA)方法促进了更平坦的损失景观,以及一致和平衡的权重更新,从而加速了稳定收敛。大量实验在自然语言理解和常识推理任务上评估了RefLoRA,使用了流行的LLaMA-7B、LLaMA2-7B和LLaMA3-8B等大型语言模型。数值测试证实,RefLoRA收敛更快,优于各种基准,并且与最先进的LoRA变体相比,计算开销可忽略不计。

英文摘要

Low-Rank Adaptation (LoRA) lowers the computational and memory overhead of fine-tuning large models by updating a low-dimensional subspace of the pre-trained weight matrix. Albeit efficient, LoRA exhibits suboptimal convergence and noticeable performance degradation, due to inconsistent and imbalanced weight updates induced by its nonunique low-rank factorizations. To overcome these limitations, this article identifies the optimal low-rank factorization per step that minimizes an upper bound on the loss. The resultant refactored low-rank adaptation (RefLoRA) method promotes a flatter loss landscape, along with consistent and balanced weight updates, thus speeding up stable convergence. Extensive experiments evaluate RefLoRA on natural language understanding, and commonsense reasoning tasks with popular large language models including DeBERTaV3, LLaMA-7B, LLaMA2-7B and LLaMA3-8B. The numerical tests corroborate that RefLoRA converges faster, outperforms various benchmarks, and enjoys negligible computational overhead compared to state-of-the-art LoRA variants.

2602.20019 2026-06-02 cs.LG cs.AI 版本更新

Learning Discriminative and Generalizable Anomaly Detector for Dynamic Graph with Limited Supervision

有限监督下动态图的可判别且可泛化的异常检测器学习

Yuxing Tian, Yiyan Qi, Fengran Mo, Weixu Zhang, Jian Guo, Jian-Yun Nie

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对动态图异常检测中标注异常稀缺的问题,提出一个结合残差表示编码、限制损失和双边界优化的模型无关框架,从正常/未标注数据中学习可判别边界,同时利用有限标注异常并保持对未见异常的泛化能力。

Comments Accepted by ICML2026

详情
AI中文摘要

动态图异常检测对许多现实应用至关重要,但由于标注异常的稀缺性,仍然具有挑战性。现有方法要么是无监督的,要么是半监督的:无监督方法避免了标注异常的需求,但往往产生模糊的边界,而半监督方法可能过拟合于有限的标注异常,并对未见异常泛化能力差。为了解决这一差距,我们考虑一个很大程度上未被探索的问题:从正常/未标注数据中学习可判别边界,同时利用有限的标注异常(当可用时),而不牺牲对未见异常的泛化能力。在本文中,我们提出了一个有效、可泛化且模型无关的框架,包含三个主要组件:(i)残差表示编码,捕捉当前交互与其历史上下文之间的偏差,提供与异常相关的信号;(ii)限制损失,将正常表示约束在两个共心超球面之间的区间内,确保尺度一致的同时保持异常的可分离性;(iii)双边界优化策略,利用归一化流建模的对数似然分布,学习一个可判别且鲁棒的边界。大量实验证明了我们的框架在不同评估设置下的优越性。

英文摘要

Dynamic graph anomaly detection is critical for many real-world applications but remains challenging due to the scarcity of labeled anomalies. Existing methods are either unsupervised or semi-supervised: unsupervised methods avoid the need for labeled anomalies but often produce ambiguous boundary, whereas semi-supervised methods can overfit to the limited labeled anomalies and generalize poorly to unseen anomalies. To address this gap, we consider a largely underexplored problem: learning a discriminative boundary from normal/unlabeled data, while leveraging limited labeled anomalies \textbf{when available} without sacrificing generalization to unseen anomalies. In this paper, we propose an effective, generalizable, and model-agnostic framework with three main components: (i) residual representation encoding that capture deviations between current interactions and their historical context, providing anomaly-relevant signals; (ii) a restriction loss that constrain the normal representations within an interval bounded by two co-centered hyperspheres, ensuring consistent scales while keeping anomalies separable; (iii) a bi-boundary optimization strategy that learns a discriminative and robust boundary using the log-likelihood distribution modeled by a normalizing flow. Extensive experiments demonstrate the superiority of our framework across diverse evaluation settings.

2602.19789 2026-06-02 cs.LG cs.CY 版本更新

Position: Stop Preaching and Start Practising Data Frugality for Responsible Development of AI

立场:停止说教,开始实践数据节俭以负责任地发展人工智能

Sophia N. Wilson, Andrew Millard, Guðrún Fjóla Guðmundsdóttir, Raghavendra Selvan, Sebastian Mair

发表机构 * GitHub

AI总结 本文主张机器学习社区应从说教转向实践数据节俭,通过子集选择方法在保持精度的同时大幅降低训练能耗和碳排放,以负责任地发展AI。

Comments ICML 2026

详情
AI中文摘要

这篇立场论文认为,机器学习社区必须从说教转向实践数据节俭,以实现负责任的人工智能发展。长期以来,进步一直与越来越大的数据集等同,这推动了显著的进步,但现在却带来了越来越小的性能提升,同时伴随着能源使用和碳排放的增加。尽管对数据节俭方法的认识有所提高,但其采用仍停留在口头上,数据规模扩展仍然主导着开发实践。我们认为,必须弥合说教与实践之间的差距,因为持续的数据规模扩展会带来巨大且未被充分核算的环境影响。为了支撑我们的立场,我们提供了与ImageNet-1K下游使用相关的能源使用和碳排放的指示性估计。然后,我们提供了实证证据,表明数据节俭既实用又有益,证明了子集选择方法可以在几乎不损失精度的情况下大幅减少训练能耗,同时减轻数据集偏差。最后,我们概述了将数据节俭从口头说教转变为具体实践的可操作建议,以负责任地发展人工智能。

英文摘要

This position paper argues that the machine learning community must move from preaching to practising data frugality for responsible artificial intelligence (AI) development. For too long, progress has been equated with ever-larger datasets, driving remarkable advances but now yielding increasingly diminishing performance gains alongside rising energy use and carbon emissions. While awareness of data frugal approaches has grown, their adoption has remained rhetorical, and data scaling continues to dominate development practice. We argue that this gap between preach and practice must be closed, as continued data scaling entails substantial and under-accounted environmental impacts. To ground our position, we provide indicative estimates of the energy use and carbon emissions associated with the downstream use of ImageNet-1K. We then present empirical evidence that data frugality is both practical and beneficial, demonstrating that subset selection methods can substantially reduce training energy consumption with little loss in accuracy, while also mitigating dataset bias. Finally, we outline actionable recommendations for moving data frugality from rhetorical preaching to concrete practice for responsible development of AI.

2602.19126 2026-06-02 cs.LG math.PR math.ST stat.TH 版本更新

Robust Predictive Uncertainty and Double Descent in Contaminated Bayesian Random Features

污染贝叶斯随机特征中的鲁棒预测不确定性与双重下降

Michele Caprio, Katerina Papagiannouli, Siu Lun Chau, Sayan Mukherjee

发表机构 * The University of Manchester, UK(英国曼彻斯特大学) University of Pisa, Italy(意大利比萨大学) Nanyang Technological University, Singapore(新加坡南洋理工大学) Max Planck Institute for Mathematics in the Sciences, Germany(德国马克斯·普朗克数学研究所)

AI总结 提出一种鲁棒贝叶斯随机特征回归方法,通过Huber污染集处理先验和似然误设,推导出后验预测密度的上下界,并引入不精确最高密度区域进行鲁棒不确定性量化,证明预测不确定性保持计算可行性并继承经典双重下降相位结构。

详情
AI中文摘要

我们提出了一种随机特征(RF)回归的鲁棒贝叶斯公式,通过Huber风格的污染集明确考虑先验和似然的误设。从岭正则化RF训练与高斯先验和似然的贝叶斯推断之间的经典等价性出发,我们分别用ε-和η-污染信度集替换单一先验和似然,并使用悲观广义贝叶斯更新进行推断。我们推导出所得后验预测密度的下界和上界的显式且可处理的界限。这些界限表明,当污染适中时,先验和似然模糊性有效地直接污染后验预测分布,产生围绕经典高斯预测的不确定性包络。我们引入了一个不精确最高密度区域(IHDR)用于鲁棒预测不确定性量化,并证明它可以通过调整的高斯可信区间进行有效近似。我们进一步获得了预测方差界限(在温和截断近似下得到上界),并证明它们保留了RF模型已知的领先阶比例增长渐近性。这些结果共同建立了贝叶斯随机特征的鲁棒性理论:预测不确定性保持计算可行性,继承经典的双重下降相位结构,并在有界先验和似然误设下通过显式最坏情况保证得到改进。

英文摘要

We propose a robust Bayesian formulation of random feature (RF) regression that accounts explicitly for prior and likelihood misspecification via Huber-style contamination sets. Starting from the classical equivalence between ridge-regularized RF training and Bayesian inference with Gaussian priors and likelihoods, we replace the single prior and likelihood with $ε$- and $η$-contaminated credal sets, respectively, and perform inference using pessimistic generalized Bayesian updating. We derive explicit and tractable bounds for the resulting lower and upper posterior predictive densities. These bounds show that, when contamination is moderate, prior and likelihood ambiguity effectively acts as a direct contamination of the posterior predictive distribution, yielding uncertainty envelopes around the classical Gaussian predictive. We introduce an Imprecise Highest Density Region (IHDR) for robust predictive uncertainty quantification and show that it admits an efficient approximation via an adjusted Gaussian credible interval. We further obtain predictive variance bounds (under a mild truncation approximation for the upper bound) and prove that they preserve the leading-order proportional-growth asymptotics known for RF models. Together, these results establish a robustness theory for Bayesian random features: predictive uncertainty remains computationally tractable, inherits the classical double-descent phase structure, and is improved by explicit worst-case guarantees under bounded prior and likelihood misspecification.

2602.19066 2026-06-02 cs.LG cs.AI 版本更新

IDLM: Inverse-distilled Diffusion Language Models

IDLM:逆蒸馏扩散语言模型

David Li, Nikita Gushchin, Dmitry Abulkhanov, Eric Moulines, Ivan Oseledets, Maxim Panov, Alexander Korotin

发表机构 * GitHub

AI总结 针对扩散语言模型推理慢的问题,提出逆蒸馏方法(IDLM),通过理论保证唯一解和梯度稳定松弛,实现4倍至64倍推理加速并保持生成质量。

Comments ICML 2026. We provide the code at: https://david-cripto.github.io/idlm-project-page

详情
AI中文摘要

扩散语言模型(DLM)最近在文本生成中取得了强劲成果。然而,其多步采样导致推理缓慢,限制了实际应用。为解决此问题,我们将逆蒸馏(一种最初为加速连续扩散模型而开发的技术)扩展到离散设置。然而,这种扩展引入了理论和实践上的挑战。从理论角度看,逆蒸馏目标缺乏唯一性保证,可能导致次优解。从实践角度看,离散空间中的反向传播非平凡且常不稳定。为克服这些挑战,我们首先提供理论结果,证明我们的逆形式具有唯一解,从而确保有效优化。然后,我们引入梯度稳定松弛以支持有效训练。最终,在多个DLM上的实验表明,我们的方法——逆蒸馏扩散语言模型(IDLM)——将推理步骤减少了4倍至64倍,同时保持了教师模型的生成质量。我们在项目页面上提供代码、模型检查点和视频教程:https://david-cripto.github.io/idlm-project-page。

英文摘要

Diffusion Language Models (DLMs) have recently achieved strong results in text generation. However, their multi-step sampling leads to slow inference, limiting practical use. To address this, we extend Inverse Distillation, a technique originally developed to accelerate continuous diffusion models, to the discrete setting. Nonetheless, this extension introduces both theoretical and practical challenges. From a theoretical perspective, the inverse distillation objective lacks uniqueness guarantees, which may lead to suboptimal solutions. From a practical standpoint, backpropagation in the discrete space is non-trivial and often unstable. To overcome these challenges, we first provide a theoretical result demonstrating that our inverse formulation admits a unique solution, thereby ensuring valid optimization. We then introduce gradient-stable relaxations to support effective training. As a result, experiments on multiple DLMs show that our method, Inverse-distilled Diffusion Language Models (IDLM), reduces the number of inference steps by 4x-64x, while preserving the teacher model's generation quality. We provide the code, model checkpoints, and video tutorials on the project page: https://david-cripto.github.io/idlm-project-page

2512.09730 2026-06-02 cs.CL cs.LG 版本更新

Interpreto: An Explainability Library for Transformers

Interpreto:一个用于Transformer的可解释性库

Antonin Poché, Thomas Mullor, Gabriele Sarti, Frédéric Boisnard, Corentin Friedrich, Charlotte Claye, François Hoofd, Raphael Bernas, Nicholas Asher, Céline Hudelot, Fanny Jourdan

发表机构 * IRT Saint Exupéry Toulouse(伊尔杜夫圣埃克苏佩里图卢斯) IRIT Toulouse(图卢兹IRIT) Khoury College of Computer Sciences(科赫里计算机科学学院) Ampere(阿姆佩尔) MICS, CentraleSupélec(MICS,中央超导学院) Scienta Lab(科学实验室) Thales Avionics(泰勒斯航空电子) ANITI

AI总结 Interpreto是一个开源Python库,通过归因方法和基于概念的解释,为HuggingFace语言模型(从早期BERT变体到LLM)提供统一的解释工作流,其端到端基于概念的流水线是主要创新。

Comments Accepted to ACL 2026 System Demonstration. Equal contribution: Poché and Jourdan

详情
AI中文摘要

Interpreto是一个用于解释HuggingFace语言模型(从早期BERT变体到LLM)的开源Python库。它提供了两类互补的方法:归因方法和基于概念的解释。该库通过为分类和文本生成提供统一的API来暴露解释工作流,从而连接了最新研究和实用工具。一个关键的区别在于其端到端的基于概念的流水线(从激活提取到概念学习、解释和评分),这超越了特征级归因,在现有库中并不常见。参见GitHub: https://github.com/FOR-sight-ai/interpreto 和演示网站: https://for-sight-ai.github.io/interpreto-demo/。

英文摘要

Interpreto is an open-source Python library for interpreting HuggingFace language models, from early BERT variants to LLMs. It provides two complementary families of methods: attribution methods and concept-based explanations. The library bridges recent research and practical tooling by exposing explanation workflows through a unified API for both classification and text generation. A key differentiator is its end-to-end concept-based pipeline (from activation extraction to concept learning, interpretation, and scoring), which goes beyond feature-level attributions and is uncommon in existing libraries. See GitHub: https://github.com/FOR-sight-ai/interpreto and the demo website: https://for-sight-ai.github.io/interpreto-demo/.

2602.18645 2026-06-02 cs.LG 版本更新

Adaptive Time Series Reasoning via Segment Selection

通过片段选择的自适应时间序列推理

Shvat Messica, Jiawen Zhang, Kevin Li, Theodoros Tsiligkaridis, Marinka Zitnik

发表机构 * harvard(哈佛大学) mit(麻省理工学院) mitll(MIT林肯实验室) hongkong(香港科学与技术大学)

AI总结 提出ARTIST框架,将时间序列推理建模为序列决策问题,通过控制器-推理器架构和强化学习自适应选择信息片段,提升推理准确率。

Comments ICML 2026

详情
AI中文摘要

时间序列推理任务通常从自然语言问题开始,需要对时间序列进行有针对性的分析。证据可能跨越整个序列或出现在少数短区间内,因此模型必须决定检查什么。大多数现有方法在推理前将整个时间序列编码为固定表示,而不考虑整个序列是否相关。我们引入ARTIST,将时间序列推理建模为序列决策问题。ARTIST将推理与自适应时间片段选择交错进行。它采用控制器-推理器架构,并使用强化学习训练控制器角色选择信息片段,推理角色生成基于片段条件的推理轨迹和最终答案。在推理过程中,模型主动获取任务相关信息,而不是依赖整个序列的静态摘要。我们采用一种新颖的分层策略优化方法进行后训练,使模型在片段选择和问答行为方面都表现出色。我们在六个时间序列推理基准上评估ARTIST,并与大语言模型、视觉语言模型以及先前的时间序列推理系统进行比较。ARTIST在最强基线上平均准确率提高了6.46个绝对百分点。最大的提升出现在罕见事件定位和多片段推理任务上。监督微调提高了性能,而强化学习通过优化问题自适应片段选择提供了额外增益。这些结果表明,选择性数据使用驱动了有效的时间序列推理。

英文摘要

Time series reasoning tasks often start with a natural language question and require targeted analysis of a time series. Evidence may span the full series or appear in a few short intervals, so the model must decide what to inspect. Most existing approaches encode the entire time series into a fixed representation before inference, regardless of whether or not the entire sequence is relevant. We introduce ARTIST, which formulates time-series reasoning as a sequential decision problem. ARTIST interleaves reasoning with adaptive temporal segment selection. It adopts a controller-reasoner architecture and uses reinforcement learning to train the controller role to select informative segments and the reasoner role to generate segment-conditioned reasoning traces and final answers. During inference, the model actively acquires task-relevant information instead of relying on a static summary of the full sequence. We use a novel hierarchical policy optimization approach for post-training that allows the model to excel in both segment selection and question-answering behavior. We evaluate ARTIST on six time-series reasoning benchmarks and compare it with large language models, vision-language models, and prior time-series reasoning systems. ARTIST improves average accuracy by 6.46 absolute percentage points over the strongest baseline. The largest gains appear on rare event localization and multi-segment reasoning tasks. Supervised fine-tuning improves performance, and reinforcement learning provides additional gains by optimizing question-adaptive segment selection. These results show that selective data use drives effective time-series reasoning.

2602.18195 2026-06-02 cs.LG cs.AI 版本更新

LERD: Latent Event-Relational Dynamics for Neurodegenerative Classification

LERD: 用于神经退行性疾病分类的潜在事件-关系动力学

Yicheng Feng, Hairong Chen, Ziyu Jia, Samir Bhatt, Hengguan Huang

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学) University of Washington(华盛顿大学) University of California, San Diego(加州大学圣地亚哥分校) University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 提出LERD,一种端到端贝叶斯潜在事件-关系动力系统,直接从多通道脑电图推断潜在神经事件及其关系结构,无需事件或交互标注,在阿尔茨海默病分类中优于基线方法并提供生理对齐的动力学摘要。

详情
AI中文摘要

阿尔茨海默病(AD)会改变大脑电生理学并破坏多通道脑电图动力学,使得准确且临床有用的基于脑电图的诊断对于筛查和疾病监测越来越重要。然而,许多现有方法依赖黑盒分类器,并未明确建模其决策背后的潜在事件时序和跨通道协调。为解决这些局限,我们提出LERD,一种端到端贝叶斯潜在事件-关系动力系统,无需事件或交互标注,直接从多通道脑电图推断潜在神经事件及其关系结构。LERD结合连续时间事件推断模块与随机事件生成过程以捕获灵活的时间模式,同时融入电生理学启发的动力学先验以原则性方式指导学习。我们进一步提供理论分析,得到基于初值问题的可处理KL正则化项以及推断关系动力学的稳定性保证。在合成基准和两个真实世界AD脑电图队列上的大量实验表明,LERD一致优于强基线,并生成与生理对齐的速率、时序和图摘要,有助于刻画组级动力学差异。

英文摘要

Alzheimer's disease (AD) alters brain electrophysiology and disrupts multichannel EEG dynamics, making accurate and clinically useful EEG-based diagnosis increasingly important for screening and disease monitoring. However, many existing approaches rely on black-box classifiers and do not explicitly model the latent event timing and cross-channel coordination behind their decisions. To address these limitations, we propose LERD, an end-to-end Bayesian latent event--relational dynamical system that infers latent neural events and their relational structure directly from multichannel EEG without event or interaction annotations. LERD combines a continuous-time event inference module with a stochastic event-generation process to capture flexible temporal patterns, while incorporating an electrophysiology-inspired dynamical prior to guide learning in a principled way. We further provide theoretical analysis that yields a tractable IVP-based KL regularizer and stability guarantees for the inferred relational dynamics. Extensive experiments on synthetic benchmarks and two real-world AD EEG cohorts demonstrate that LERD consistently outperforms strong baselines and yields physiology-aligned rate, timing, and graph summaries that help characterize group-level dynamical differences.

2602.18008 2026-06-02 cs.LG cs.AI cs.CL 版本更新

Are LLMs Ready for Neural-integrated Mechanistic Modeling? A Benchmark and Agentic Framework

LLM 是否准备好进行神经集成机制建模?一个基准测试与智能体框架

Zihan Guan, Rituparna Datta, Mengxuan Hu, Shunshun Liu, Aiying Zhang, Prasanna Balachandran, Sheng Li, Anil Vullikanti

发表机构 * University of Virginia(弗吉尼亚大学)

AI总结 本文提出神经集成机制建模(NIMM)基准测试,评估大语言模型在三个科学领域构建神经集成机制模型的能力,并设计树引导的智能体框架 NIMMGen,通过分支级搜索和原子模型细化显著提升搜索稳定性和解质量。

Comments 25 pages, 8 figures

详情
AI中文摘要

大语言模型(LLM)在从数据构建机制模型方面显示出潜力。然而,现有评估主要关注简化设置,未能捕捉真实世界科学建模的复杂性。在实践中,此类建模通常涉及神经集成公式,其中机制模型组件和神经网络组件共同构建,导致搜索空间显著复杂化。受此差距驱动,我们引入了神经集成机制建模(NIMM)基准测试,该基准测试评估 LLM 生成的神经集成机制模型在三个科学领域上的表现。在 NIMM 上的实验表明,现有基于 LLM 的方法难以有效探索这一复杂空间,导致搜索稳定性和解质量有限。为应对这一挑战,我们提出了 NIMMGen,一种树引导的智能体框架,通过分支级搜索实现多样化探索,并通过原子模型细化改进解。大量实验表明,NIMMGen 在 NIMM 上达到了最先进的性能,显著提升了搜索稳定性和解质量。

英文摘要

Large language models (LLMs) have shown promise in constructing mechanistic models from data. However, existing evaluations largely focus on simplified settings and fail to capture the complexity of real-world scientific modeling. In practice, such modeling often involves neural-integrated formulations, where a mechanistic model component and a neural network component are jointly constructed, leading to a significantly more complex search space. Motivated by this gap, we introduce the Neural-Integrated Mechanistic Modeling (NIMM) benchmark, which evaluates LLM-generated neural-integrated mechanistic models across three scientific domains. Experiments on NIMM reveal that existing LLM-based approaches struggle to effectively explore this complex space, resulting in limited search stability and solution quality. To address this challenge, we propose NIMMGen, a tree-guided agentic framework that enables diversified exploration via branch-level search and improves solutions through atomic model refinement. Extensive experiments demonstrate that NIMMGen achieves state-of-the-art performance on NIMM, significantly improving search stability and solution quality.

2602.17737 2026-06-02 cs.RO cs.LG cs.MA 版本更新

NestRL: A Nested Training Regime for Mutual Adaptation in Human-AI Teaming

NestRL: 一种用于人机团队中相互适应的嵌套训练机制

Upasana Biswas, Durgesh Kalwar, Subbarao Kambhampati, Sarath Sreedharan

发表机构 * School of Computing and AI, Arizona State University(计算与人工智能学院,亚利桑那州立大学) Department of Computer Science, Colorado State University(计算机科学系,科罗拉多州立大学)

AI总结 针对人机团队中相互适应的挑战,提出嵌套训练机制NestRL,通过分层训练代理对抗自适应对手,避免产生不透明的协调策略,在Overcooked领域实现更高的任务性能和适应性。

详情
AI中文摘要

相互适应是人机团队中的一个核心挑战,因为人类会自然地根据AI代理的行为调整自己的策略。现有方法试图通过多样化训练伙伴来近似人类行为;然而,这些伙伴通常是静态的,无法捕捉人类队友的适应性。当代理在标准多智能体设置中联合训练时,它们常常收敛到不透明的协调策略,这些策略仅适用于共同训练的伙伴,导致泛化能力差。为了建模自适应的人类行为,我们将人机团队问题形式化为交互式部分可观测马尔可夫决策过程(I-POMDP)。我们提出NestRL,一种嵌套训练机制,通过在每个层级上训练代理对抗来自下一层级的自适应代理,来学习有限层级I-POMDP的解。这使代理暴露于自适应行为,同时防止出现不透明的协调策略。我们提供了理论分析,表明NestRL代理避免了收敛到特定伙伴的策略,并在Overcooked领域通过与最先进的基线进行实证验证。NestRL在与未见过的自适应代理和真实人类队友合作时均实现了更高的任务性能,同时在交互过程中表现出显著更强的适应性。

英文摘要

Mutual adaptation is a central challenge in human-AI teaming, as humans naturally adjust their strategies in response to an AI agent's behavior. Existing approaches attempt to approximate human behavior by diversifying training partners; however, these partners are typically static and fail to capture the adaptive nature of human teammates. When agents are trained jointly in standard multi-agent settings, they often converge to opaque coordination strategies that work only with their co-trained partners, leading to poor generalization. To model adaptive human behavior, we formulate human-AI teaming as an Interactive Partially Observable Markov Decision Process (I-POMDP). We propose NestRL, a nested training regime that learns the solution to a finite-level I-POMDP by training agents at each level against adaptive agents from the level below. This exposes agents to adaptive behavior while preventing emergence of opaque coordination strategies. We provide theoretical analysis showing that NestRL agents avoid convergence to partner-specific strategies, and validate this empirically in the Overcooked domain against state-of-the-art baselines. NestRL achieves higher task performance with both unseen adaptive agents and real human teammates, while exhibiting significantly greater adaptability over the course of interaction.

2602.17706 2026-06-02 cs.LG 版本更新

Parallel Complex Diffusion for Scalable Time Series Generation

并行复数扩散用于可扩展时间序列生成

Rongyao Cai, Yuxi Wan, Kexin Zhang, Ming Jin, Zhiqiang Ge, Qingsong Wen, Yong Liu

发表机构 * Institute of Cyber-Systems and Control(网络系统与控制研究所) Zhejiang University(浙江大学) Griffith University(格里菲斯大学) School of Mathematics(数学学院) Southeast University(东南大学) Squirrel Ai Learning

AI总结 提出PaCoDi(并行复数扩散)框架,通过离散傅里叶变换将时间序列分解到谱域,利用并行实值估计器替代复数估计器,解决时间序列生成中的纠缠问题,理论证明谱高斯噪声的正交性,并引入平均场理论近似处理边缘耦合,在无条件和条件生成任务中优于5个基线。

Comments Proceedings of the 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2 (KDD '26). Extended Version with Full Proofs

详情
AI中文摘要

扩散模型通过去噪间接学习数据分布,使得生成建模的难度与数据的依赖结构密切相关。对于时间序列,强时间依赖性迫使噪声/分数估计器恢复高度纠缠的跨时间关系,导致纠缠问题。我们通过改变扩散空间的拓扑结构来减轻这一负担:离散傅里叶变换(DFT)将时间依赖分解为谱模式,对角化二阶依赖结构,使数据流形与各向同性高斯噪声和均匀扩散动力学更好地对齐。然而,现有的频率感知扩散方法主要使用DFT设计时间DDPM/SDE框架下的估计器模块,而频率原生扩散路径面临复数动力学带来的数学障碍。我们提出PaCoDi(并行复数扩散),一种频率原生扩散框架,在谱域构建扩散路径,同时用实部和虚部的并行实值估计器替代复数估计器。理论上,我们证明了谱高斯噪声的统计正交性,建立了正交前向转移和条件反向分解,并通过谱维纳过程将离散PaCoDi扩展到连续时间谱SDE。我们进一步引入带有交互校正分支的平均场理论近似来处理边缘耦合,并利用厄米对称性减少50%的注意力FLOPs而无信息损失。在无条件和条件时间序列生成上的大量实验表明,在5个基准测试中,生成质量和计算效率分别优于5个最先进基线。代码可在https://github.com/RongyaoCai/PaCoDi获取。

英文摘要

Diffusion models learn data distributions indirectly through denoising, making the difficulty of generative modeling closely tied to the dependency structure of data. For time series, strong temporal dependence forces the noise / score estimator to recover highly entangled cross-time relationships, leading to the curse of entanglement. We mitigate this burden by changing the topology of the diffusion space: the Discrete Fourier Transform (DFT) decomposes temporal dependencies into spectral modes, diagonalizing second-order dependency structure and better aligning the data manifold with isotropic Gaussian noise and homogeneous diffusion dynamics. However, existing frequency-aware diffusion methods mainly use the DFT to design estimator blocks under temporal DDPM/SDE frameworks, while frequency-native diffusion paths face a mathematical barrier from complex-valued dynamics. We propose PaCoDi (Parallel Complex Diffusion), a frequency-native diffusion framework that constructs the diffusion path in the spectral domain while replacing the complex-valued estimator with parallel real-valued estimators for real and imaginary components. Theoretically, we prove the statistical orthogonality of spectral Gaussian noise, establish quadrature forward transitions and conditional reverse factorization, and extend discrete PaCoDi to continuous-time spectral SDEs through a Spectral Wiener Process. We further introduce a Mean Field Theory approximation with an Interactive Correction Branch to handle marginal coupling, and exploit Hermitian symmetry to reduce 50% attention FLOPs without information loss. Extensive experiments on unconditional and conditional time series generation demonstrate superior generative quality and computational efficiency against 5 SOTA baselines in 5 benchmarks, respectively. Code is available at https://github.com/RongyaoCai/PaCoDi.

2602.16794 2026-06-02 stat.ML cs.LG 版本更新

Beyond Procedure: Substantive Fairness in Conformal Prediction

超越程序:共形预测中的实质性公平

Pengqi Liu, Zijun Yu, Mouloud Belbahri, Arthur Charpentier, Masoud Asgharian, Jesse C. Cresswell

发表机构 * University of Montreal(蒙特利尔大学)

AI总结 本文通过理论分解和LLM辅助评估,研究共形预测中标签聚类方法如何平衡效用与实质性公平,并发现均衡集合大小比覆盖度更能提升公平性。

Comments Camera-ready version. Accepted at ICML 2026

详情
AI中文摘要

共形预测(CP)为机器学习模型提供了无分布的不确定性量化,但其在下游决策中与公平性的相互作用仍未充分探索。超越将CP视为独立操作(程序公平),我们分析整体决策流程以评估实质性公平——下游结果的公平性。理论上,我们推导出一个上界,将预测集大小差异分解为可解释的组成部分,阐明标签聚类CP如何帮助控制方法驱动的对不公平的贡献。为了促进可扩展的实证分析,我们引入了一个LLM在环评估器,它近似人类对跨多种模态的实质性公平的评估。我们的实验表明,标签聚类CP通常在效用和实质性公平之间提供了有利的平衡,同时根据我们的理论减少了集合大小差异。最后,我们实证表明,均衡的集合大小(而非覆盖度)与实质性公平的改善强相关,使从业者能够设计更公平的CP系统。我们的代码可在https://github.com/layer6ai-labs/llm-in-the-loop-conformal-fairness获取。

英文摘要

Conformal prediction (CP) offers distribution-free uncertainty quantification for machine learning models, yet its interplay with fairness in downstream decision-making remains underexplored. Moving beyond CP as a standalone operation (procedural fairness), we analyze the holistic decision-making pipeline to evaluate substantive fairness-the equity of downstream outcomes. Theoretically, we derive an upper bound that decomposes prediction-set size disparity into interpretable components, clarifying how label-clustered CP helps control method-driven contributions to unfairness. To facilitate scalable empirical analysis, we introduce an LLM-in-the-loop evaluator that approximates human assessment of substantive fairness across diverse modalities. Our experiments show that label-clustered CP often provides a favorable balance between utility and substantive fairness, while reducing set-size disparities in line with our theory. Finally, we empirically show that equalized set sizes, rather than coverage, strongly correlate with improved substantive fairness, enabling practitioners to design more fair CP systems. Our code is available at https://github.com/layer6ai-labs/llm-in-the-loop-conformal-fairness.

2602.16745 2026-06-02 cs.LG cs.AI 版本更新

PETS: A Principled Framework Towards Optimal Trajectory Allocation for Efficient Test-Time Self-Consistency

PETS:一种面向高效测试时自一致性的最优轨迹分配原则性框架

Zhangyi Liu, Huaizhi Qu, Xiaowei Yin, He Sun, Yanjun Han, Tianlong Chen, Zhun Deng

发表机构 * Stanford University(斯坦福大学) UNC at Chapel Hill(Chapel Hill 大学) Yale University(耶鲁大学) New York University(纽约大学)

AI总结 提出PETS框架,通过将轨迹分配建模为优化问题并引入自一致性率度量,在离线(连接众包理论)和在线流式场景下实现样本高效的测试时自一致性,显著降低采样预算。

详情
AI中文摘要

测试时扩展可以通过聚合随机推理轨迹来提高模型性能。然而,在有限预算下实现样本高效的测试时自一致性仍然是一个开放的挑战。我们引入了PETS(原则性且高效的测试时自一致性),它通过一个优化框架启动了对轨迹分配的原则性研究。我们方法的核心是自一致性率,这是一个新定义的度量,即与无限预算多数投票的一致性。这一公式使样本高效的测试时分配在理论上具有坚实基础,并适合严格分析。我们研究了离线和在线两种设置。在离线模式下,所有问题事先已知,我们将轨迹分配与众包(一个经典且成熟的研究领域)联系起来,将推理轨迹建模为工人。这种视角使我们能够利用丰富的现有理论,获得理论保证和一种高效的基于多数投票的分配算法。在在线流式模式下,问题顺序到达且必须实时做出分配,我们提出了一种受离线框架启发的新方法。我们的方法根据问题难度调整预算,同时保持强大的理论保证和计算效率。实验表明,PETS始终优于均匀分配。在GPQA上,PETS在两种设置下均实现了完美的自一致性,同时相对于均匀分配将采样预算减少了高达75%(离线)和55%(在线)。代码可在https://github.com/ZDCSlab/PETS获取。

英文摘要

Test-time scaling can improve model performance by aggregating stochastic reasoning trajectories. However, achieving sample-efficient test-time self-consistency under a limited budget remains an open challenge. We introduce PETS (Principled and Efficient Test-TimeSelf-Consistency), which initiates a principled study of trajectory allocation through an optimization framework. Central to our approach is the self-consistency rate, a new measure defined as agreement with the infinite-budget majority vote. This formulation makes sample-efficient test-time allocation theoretically grounded and amenable to rigorous analysis. We study both offline and online settings. In the offline regime, where all questions are known in advance, we connect trajectory allocation to crowdsourcing, a classic and well-developed area, by modeling reasoning traces as workers. This perspective allows us to leverage rich existing theory, yielding theoretical guarantees and an efficient majority-voting-based allocation algorithm. In the online streaming regime, where questions arrive sequentially and allocations must be made on the fly, we propose a novel method inspired by the offline framework. Our approach adapts budgets to question difficulty while preserving strong theoretical guarantees and computational efficiency. Experiments show that PETS consistently outperforms uniform allocation. On GPQA, PETS achieves perfect self-consistency in both settings while reducing the sampling budget by up to 75% (offline) and 55% (online) relative to uniform allocation. Code is available at https://github.com/ZDCSlab/PETS.

2602.16220 2026-06-02 cs.LG 版本更新

SEMixer: Semantics Enhanced MLP-Mixer for Multiscale Mixing and Long-term Time Series Forecasting

SEMixer: 语义增强的MLP-Mixer用于多尺度混合和长期时间序列预测

Xu Zhang, Qitong Wang, Peng Wang, Wei Wang

发表机构 * Shanghai Key Laboratory of Data Science, College of Computer Science and Artificial Intelligence Fudan University(上海数据科学 key 实验室,复旦大学计算机科学与人工智能学院) Harvard University(哈佛大学)

AI总结 提出SEMixer模型,通过随机注意力机制和多尺度渐进混合链,有效建模多尺度时间依赖并解决语义鸿沟问题,在10个公开数据集和真实无线网络数据上取得优异性能。

Comments This work is accepted by the proceedings of the ACM Web Conference 2026 (WWW 2026). The code is available at the link https://github.com/Meteor-Stars/SEMixer

详情
AI中文摘要

建模多尺度模式对于长期时间序列预测(TSF)至关重要。然而,时间序列中的冗余和噪声,以及非相邻尺度之间的语义鸿沟,使得高效对齐和集成多尺度时间依赖具有挑战性。为此,我们提出了SEMixer,一种专为长期TSF设计的轻量级多尺度模型。SEMixer包含两个关键组件:随机注意力机制(RAM)和多尺度渐进混合链(MPMC)。RAM在训练期间捕获多样化的时间块交互,并通过推理时的dropout集成进行聚合,增强了块级语义,使MLP-Mixer能够更好地建模多尺度依赖。MPMC进一步以内存高效的方式堆叠RAM和MLP-Mixer,实现更有效的时间混合。它解决了跨尺度的语义鸿沟,促进了更好的多尺度建模和预测性能。我们不仅在10个公开数据集上验证了SEMixer的有效性,还在基于21GB真实无线网络数据的 extit{2025 CCF AlOps Challenge}中取得了第三名。代码可在链接https://github.com/Meteor-Stars/SEMixer获取。

英文摘要

Modeling multiscale patterns is crucial for long-term time series forecasting (TSF). However, redundancy and noise in time series, together with semantic gaps between non-adjacent scales, make the efficient alignment and integration of multi-scale temporal dependencies challenging. To address this, we propose SEMixer, a lightweight multiscale model designed for long-term TSF. SEMixer features two key components: a Random Attention Mechanism (RAM) and a Multiscale Progressive Mixing Chain (MPMC). RAM captures diverse time-patch interactions during training and aggregates them via dropout ensemble at inference, enhancing patch-level semantics and enabling MLP-Mixer to better model multi-scale dependencies. MPMC further stacks RAM and MLP-Mixer in a memory-efficient manner, achieving more effective temporal mixing. It addresses semantic gaps across scales and facilitates better multiscale modeling and forecasting performance. We not only validate the effectiveness of SEMixer on 10 public datasets, but also on the \textit{2025 CCF AlOps Challenge} based on 21GB real wireless network data, where SEMixer achieves third place. The code is available at the link https://github.com/Meteor-Stars/SEMixer.

2602.05139 2026-06-02 cs.LG 版本更新

Adaptive Exploration for Latent-State Bandits

潜在状态赌博机的自适应探索

Jikai Jin, Kenneth Hung, Sanath Kumar Krishnamurthy, Baoyi Shi, Congshan Zhang

发表机构 * The Institute for Computational and Mathematical Engineering(计算与数学工程研究所) Stanford University(斯坦福大学) Meta Platforms, Inc.(Meta平台公司) Ads Online Experimentation(广告在线实验部) Central Applied Science(应用科学中央研究所)

AI总结 针对奖励依赖于未观测马尔可夫状态的赌博机问题,提出基于LinUCB的自适应算法,通过滞后动作-奖励对和探针指纹两种摘要来区分状态,并采用残差、边际和过时测试动态更新指纹,在合成压力测试中相比标准、对抗和非平稳基线降低了动态遗憾。

Comments 12 pages, 3 figures, 5 tables

详情
AI中文摘要

我们研究奖励依赖于未观测马尔可夫状态的赌博机,该状态独立于学习者的动作演化。即使学习者只观测过去的动作和奖励,最优臂也可能发生变化。我们提出的算法将隐藏状态的两种摘要馈送给LinUCB:滞后动作-奖励对,以及(当可用时)由多个臂的奖励形成的探针指纹。自适应变体使用残差、边际和过时测试刷新指纹。在关于状态数量、转移速率、噪声和时域的综合压力测试中,当这些摘要能够区分状态并足够频繁地更新时,这些方法相对于标准、对抗和非平稳赌博机基线减少了动态遗憾。消融和误设测试识别了主要失败模式:弱指纹分离、高噪声以及顺序探针期间的状态变化。

英文摘要

We study bandits whose rewards depend on an unobserved Markov state that evolves independently of the learner's actions. The optimal arm can change even though the learner observes only past actions and rewards. We propose algorithms that feed LinUCB with two summaries of the hidden state: a lagged action-reward pair and, when available, a probe fingerprint formed from rewards of multiple arms. The adaptive variants refresh the fingerprint using residual, margin, and staleness tests. In synthetic stress tests over state count, transition rate, noise, and horizon, these methods reduce dynamic regret relative to standard, adversarial, and non-stationary bandit baselines when the summaries distinguish states and are updated often enough. Ablations and misspecification tests identify the main failure modes: weak fingerprint separation, high noise, and state changes during sequential probes.

2602.15259 2026-06-02 cs.CY cs.AI cs.LG 版本更新

Knowing Isn't Understanding: Re-grounding Generative Proactivity with Epistemic and Behavioral Insight

知道不等于理解:用认知与行为洞察重新奠定生成式主动性

Kirandeep Kaur, Xingda Lyu, Chirag Shah

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Toronto(多伦多大学) University of Waterloo(滑铁卢大学)

AI总结 针对用户无法明确表达需求时的认知不完整问题,提出生成式主动性需要基于认知和行为双重约束来设计负责任的主动代理。

Comments 43 rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

生成式AI代理将理解等同于解决显式查询,这一假设将交互限制在用户能够表达的范围内。当用户自身缺乏对缺失、风险或值得考虑之事的意识时,这一假设就会失效。在这种情况下,主动性不仅是效率提升,更是一种认知上的必要性。我们将这种状态称为认知不完整:即进步依赖于处理未知的未知以实现有效协作。现有的主动性方法仍然局限于预测性,从过去行为中推断并假定目标已经明确,从而未能有意义地支持用户。然而,揭示超出用户当前意识的可能性并非天然有益。不受约束的主动干预可能误导注意力、使用户不堪重负或引入伤害。因此,主动代理需要行为锚定:对代理何时、如何以及在何种程度上进行干预施加原则性约束。我们主张生成式主动性必须在认知和行为上双重锚定。借鉴无知哲学和主动行为研究,我们认为这些理论为设计能够负责任地参与并促进有意义协作的代理提供了关键指导。

英文摘要

Generative AI agents equate understanding with resolving explicit queries, an assumption that confines interaction to what users can articulate. This assumption breaks down when users themselves lack awareness of what is missing, risky, or worth considering. In such conditions, proactivity is not merely an efficiency enhancement, but an epistemic necessity. We refer to this condition as epistemic incompleteness: where progress depends on engaging with unknown unknowns for effective partnership. Existing approaches to proactivity remain narrowly anticipatory, extrapolating from past behavior and presuming that goals are already well defined, thereby failing to support users meaningfully. However, surfacing possibilities beyond a user's current awareness is not inherently beneficial. Unconstrained proactive interventions can misdirect attention, overwhelm users, or introduce harm. Proactive agents, therefore, require behavioral grounding: principled constraints on when, how, and to what extent an agent should intervene. We advance the position that generative proactivity must be grounded both epistemically and behaviorally. Drawing on the philosophy of ignorance and research on proactive behavior, we argue that these theories offer critical guidance for designing agents that can engage responsibly and foster meaningful partnerships.

2602.14849 2026-06-02 cs.LG cs.AI cs.DC cs.MA 版本更新

Atomix: Timely, Transactional Tool Use for Reliable Agentic Workflows

Atomix: 用于可靠智能体工作流的及时事务性工具使用

Bardia Mohammadi, Nearchos Potamitis, Lars Klein, Akhil Arora, Laurent Bindschaedler

发表机构 * Max Planck Institute for Software Systems(马克斯·普朗克软件系统研究所) Aarhus University(奥胡斯大学) EPFL(苏黎世联邦理工学院)

AI总结 针对LLM智能体多步工作流中因故障、推测和并发导致的状态不一致问题,提出Atomix系统,通过进度感知事务将效果分组与冲突解决分离,实现可靠提交与回滚。

详情
AI中文摘要

LLM智能体执行多步工作流,通过工具改变外部状态。常见的编排器将工具返回视为结算触发器,因此故障、推测和并发智能体可能留下部分效果、丢失分支残留、陈旧写入或不可逆发送。正确的结算需要两个事实,而重试、检查点重放、锁和补偿各自混淆了这些事实:哪些效果必须一起结算,以及何时较早的冲突工作已耗尽。Atomix通过进度感知事务使这种分离明确化。运行时在执行期间记录读取和效果,当足迹完成时密封事务,并且仅在每个资源的前沿显示没有更早的冲突工作可能到达后才提交。提交是最终结算:Atomix释放可缓冲效果,接受可逆外部效果为最终状态,并让不可逆效果离开。中止抑制未释放的效果,并在可能的情况下补偿外部化的可逆效果。在代表性智能体工作负载上,这种组合在注入故障下改善了干净恢复,隔离了竞争和推测工作,并防止了正确分类的不可逆动作泄漏;微基准测试显示相对于工具延迟的微秒级包装开销。

英文摘要

LLM agents execute multi-step workflows that mutate external state through tools. Common orchestrators treat tool return as the settlement trigger, so faults, speculation, and concurrent agents can leave partial effects, losing-branch residue, stale writes, or irreversible sends. Correct settlement needs two facts that retries, checkpoint replay, locks, and compensation each conflate: which effects must settle together, and when earlier conflicting work is exhausted. Atomix makes this split explicit with progress-aware transactions. The runtime records reads and effects during execution, seals a transaction when its footprint is complete, and commits only after per-resource frontiers show that no earlier conflicting work can still arrive. Commit is final settlement: Atomix releases bufferable effects, accepts reversible external effects as final, and lets irreversible effects leave the gate. Abort suppresses unreleased effects and compensates externalized reversible effects where possible. On representative agent workloads, this composition improves clean recovery under injected faults, isolates contending and speculative work, and prevents correctly classified irreversible actions from leaking; microbenchmarks show microsecond-scale wrapper overhead relative to tool latency.

2602.14134 2026-06-02 cs.CV cs.AI cs.LG 版本更新

DenseMLLM: Standard Multimodal LLMs for Dense Prediction

DenseMLLM:用于密集预测的标准多模态大语言模型

Yi Li, Hongze Shen, Lexiang Tang, Xin Li, Xinpeng Ding, Yinsong Liu, Deqiang Jiang, Xing Sun, Xiaomeng Li

发表机构 * Department of Electronic and Computer Engineering, The Hong Kong University of Science and Technology, Hong Kong, China(香港科技大学电子与计算机工程系) Tencent, Youtu-Lab, China(腾讯优图实验室)

AI总结 提出DenseMLLM,通过标准多模态大语言模型架构和视觉令牌监督策略,无需任务特定解码器即可实现语义分割、深度估计等密集预测任务,在多个基准上取得竞争性能。

Comments ICML 2026

详情
AI中文摘要

多模态大语言模型在高层次视觉理解方面展现出卓越能力。然而,将这些模型扩展到细粒度的密集预测任务(如语义分割和深度估计)通常需要引入复杂的任务特定解码器和其他定制化组件。这种架构碎片化增加了模型复杂度,偏离了多模态大语言模型的通用设计,最终限制了其实用性。在这项工作中,我们挑战了这一范式,通过调整标准多模态大语言模型来执行密集预测,无需额外的任务特定解码器。所提出的模型称为DenseMLLM,基于标准架构,并采用一种新颖的视觉令牌监督策略来处理多个标签和任务。尽管设计极简,我们的模型在广泛的密集预测和视觉语言基准测试中取得了极具竞争力的性能,表明标准的通用多模态大语言模型可以在没有架构专门化的情况下有效支持密集感知。该项目可在github.com/Eli-YiLi/DenseMLLM获取。

英文摘要

Multimodal Large Language Models (MLLMs) have demonstrated exceptional capabilities in high-level visual understanding. However, extending these models to fine-grained dense prediction tasks, such as semantic segmentation and depth estimation, typically necessitates the incorporation of complex, task-specific decoders and other customizations. This architectural fragmentation increases model complexity and deviates from the generalist design of MLLMs, ultimately limiting their practicality. In this work, we challenge this paradigm by accommodating standard MLLMs to perform dense predictions without requiring additional task-specific decoders. The proposed model is called DenseMLLM, grounded in the standard architecture with a novel vision token supervision strategy for multiple labels and tasks. Despite its minimalist design, our model achieves highly competitive performance across a wide range of dense prediction and vision-language benchmarks, demonstrating that a standard, general-purpose MLLM can effectively support dense perception without architectural specialization. This project is available at github.com/Eli-YiLi/DenseMLLM.

2602.13940 2026-06-02 cs.LG cs.AI 版本更新

You Can Learn Tokenization End-to-End with Reinforcement Learning

你可以通过强化学习端到端地学习分词

Sam Dauncey, Roger Wattenhofer

发表机构 * University of Waterloo(滑铁卢大学) ETH Zurich(苏黎世联邦理工学院)

AI总结 本文提出使用强化学习中的得分函数估计来学习离散分词边界,通过时间折扣等技巧降低方差,在1亿参数规模上优于先前的直通估计方法。

Comments ICML 2026 camera-ready

详情
AI中文摘要

分词是一个硬编码的压缩步骤,尽管架构总体上趋向于端到端,但它仍然保留在大语言模型(LLM)的训练流程中。先前的工作在大规模上展示了有希望的结果,通过启发式方法将这一压缩步骤引入LLM架构内部以绘制分词边界,并尝试使用直通估计来学习这些分词边界,直通估计将绘制离散分词边界的问题视为连续问题。我们表明,这些分词边界可以通过得分函数估计来学习,由于直接优化绘制离散分词边界以最小化损失的问题,得分函数估计具有更严格的理论保证。我们观察到,强化学习中的技术,如时间折扣,对于充分降低该得分函数的方差以使其可行是必要的。我们证明,所得到的方法在1亿参数规模上,在定性和定量上都优于先前提出的直通估计方法。

英文摘要

Tokenization is a hardcoded compression step which remains in the training pipeline of Large Language Models (LLMs), despite a general trend towards architectures becoming increasingly end-to-end. Prior work has shown promising results at scale in bringing this compression step inside the LLMs' architecture with heuristics to draw token boundaries, and also attempts to learn these token boundaries with straight-through estimates, which treat the problem of drawing discrete token boundaries as a continuous one. We show that these token boundaries can instead be learned using score function estimates, which have tighter theoretical guarantees due to directly optimizing the problem of drawing discrete token boundaries to minimize loss. We observe that techniques from reinforcement learning, such as time discounting, are necessary to reduce the variance of this score function sufficiently to make it practicable. We demonstrate that the resultant method outperforms prior proposed straight-through estimates, both qualitatively and quantitatively at the $100$ million parameter scale.

2602.13937 2026-06-02 cs.LG cs.SE 版本更新

iML: Executable, Problem-Grounded, and Broadly Exploratory Code-Driven AutoML

iML: 可执行、问题驱动且广泛探索的代码驱动自动机器学习

Dat Le, Duc-Cuong Le, Anh-Son Nguyen, Tuan-Dung Bui, Thu-Trang Nguyen, Son Nguyen, Hieu Dinh Vo

发表机构 * Faculty of Information Technology, VNU University of Engineering and Technology(信息科技学院,越南工程与技术大学)

AI总结 提出iML多智能体框架,通过任务分析、数据剖析、结构化蓝图生成和模块化代码合成,实现可执行、问题驱动且广泛探索的代码驱动AutoML,在MLE-BENCH和iML-BENCH上显著优于基线。

详情
AI中文摘要

自动机器学习(AutoML)改善了机器学习的可访问性,但现有技术通常在灵活性、透明度和执行可靠性方面仍然有限。代码驱动的AutoML通过合成用于预处理、模型训练和评估的可执行代码,提供了一个有前景的方向。然而,当前基于LLM的方法经常生成在文本上合理但在执行中脆弱、未充分基于实际数据集或局限于狭窄解决方案路径的代码。在本文中,我们介绍了iML,一个多智能体代码驱动AutoML框架,围绕三个需求设计:可执行性、问题驱动和对有效解决方案的广泛探索。iML首先分析任务并剖析数据,然后合成一个结构化的蓝图,指导跨多个实现轨道的模块化代码生成,包括传统机器学习、预训练适应和自定义神经架构。为了提高可靠性,iML在集成过程中强制执行接口检查、动态执行和迭代调试。我们在MLE-BENCH和新引入的iML-BENCH上评估iML,涵盖多样化的Kaggle风格任务。在MLE-BENCH上,iML达到了90%的有效提交率和45%的奖牌率,以及0.82的APS,将基于LLM的基线的平均标准化性能分数(APS)提高了52%-273%。在iML-BENCH上,它实现了最高的APS,并且即使在任务描述被大幅简化时也表现出稳健的性能。这些结果确立了iML作为代码驱动AutoML的可靠且有竞争力的框架。

英文摘要

Automated Machine Learning (AutoML) has improved access to machine learning, yet existing techniques often remain limited in flexibility, transparency, and execution reliability. Code-driven AutoML offers a promising direction by synthesizing executable code for preprocessing, model training, and evaluation. However, current LLM-based approaches frequently generate code that is plausible in text yet brittle in execution, insufficiently grounded in the actual dataset, or restricted to narrow solution paths. In this paper, we introduce iML, a multi-agent code-driven AutoML framework designed around three requirements: executability, problem grounding, and broad exploration of valid solutions. iML first analyzes the task and profiles the data, then synthesizes a structured blueprint that guides modular code generation across multiple implementation tracks, including traditional ML,pretrained adaptation, and custom neural architectures. To improve reliability, iML enforces interface checking, dynamic execution, and iterative debugging during integration. We evaluate iML on MLE-BENCH and the newly introduced iML-BENCH, covering diverse Kaggle-style tasks. On MLE-BENCH, iML attains a 90% valid submission rate and a 45% medal rate, and an APS of 0.82, improving the average standardized performance score (APS) over the LLM-based baselines by 52%-273%. On iML-BENCH, it achieves the highest APS and demonstrates robust performance even when task descriptions are substantially stripped. These results establish iML as a reliable and competitive framework for code-driven AutoML.

2602.13906 2026-06-02 stat.ML cs.LG 版本更新

How Accurately Can a Gaussian Approximate Stochastic Approximation Iterates?

高斯分布能以多高的精度近似随机逼近迭代?

Shaan Ul Haque, Zedong Wang, Zixuan Zhang, Siva Theja Maguluri

AI总结 本文通过递归定义协方差的高斯序列来近似随机逼近迭代的有限时间分布,并给出了Wasserstein-1距离的显式界,从而得到误差的尾部界和渐近正态性的收敛速率。

Comments 63 pages, 6 figures

详情
AI中文摘要

随机逼近(SA)是一种在噪声干扰下寻找算子根的方法。本文重点研究SA迭代在有限时间内的分布。通常,精确分布难以刻画,因此我们的目标是找到一种能提供有用尾部界的近似。受重缩放SA迭代渐近正态性丰富文献的启发,我们通过协方差递归定义的高斯序列来近似极限前分布。特别地,我们建立了在时间$k$处重缩放迭代与前述高斯之间Wasserstein-1距离的显式界,适用于多种步长选择。由于这些协方差收敛到经典渐近极限,我们的分析也附带给出了渐近正态性的收敛速率。作为界的直接推论,我们得到了任意时刻SA迭代误差的尾部界。最后,通过匹配下界证明了速率的尖锐性,并通过模拟验证了结果。我们首先研究由一般噪声驱动的离散Ornstein-Uhlenbeck(O-U)过程的收敛速率,其平稳分布与重缩放SA迭代的极限高斯分布相同,从而获得尖锐速率。鉴于其与采样文献的联系,我们认为这具有独立意义。分析涉及调整Stein方法进行高斯近似,以处理独立同分布随机变量的矩阵加权和。通过刻画重缩放SA迭代与离散时间O-U过程之间的误差动态,并结合后者的收敛速率,得到了所需的SA有限时间界。

英文摘要

Stochastic approximation (SA) is a method for finding the root of an operator perturbed by noise. The focus of this paper is studying the distribution of SA iterates in finite time. In general, it is not possible to characterize the exact distribution, and therefore our goal is to find an approximation which can yield useful tail bounds. Inspired by the rich literature on the asymptotic normality of rescaled SA iterates, we approximate the pre-limit distributions by a sequence of Gaussians whose covariance is recursively defined. In particular, we establish explicit bounds on the Wasserstein-1 distance between the rescaled iterate at time $k$ and the aforementioned Gaussian for various choices of step-sizes. Since these covariances converge to the classical asymptotic limit, our analysis also provides a convergence rate for asymptotic normality as a by-product. As an immediate consequence of our bounds, we obtain tail bounds on the error of SA iterates at any time. Finally, we establish the sharpness of our rates by providing matching lower bounds and validate our findings through simulations. We obtain the sharp rates by first studying the convergence rate of the discrete Ornstein-Uhlenbeck (O-U) process driven by general noise, whose stationary distribution is identical to the limiting Gaussian distribution of the rescaled SA iterates. We believe that this is of independent interest, given its connection to sampling literature. The analysis involves adapting Stein's method for Gaussian approximation to handle the matrix weighted sum of i.i.d. random variables. The desired finite-time bounds for SA are obtained by characterizing the error dynamics between the rescaled SA iterate and the discrete time O-U process and combining it with the convergence rate of the latter process.

2602.13602 2026-06-02 cs.CV cs.LG 版本更新

Towards Sparse Video Understanding and Reasoning

迈向稀疏视频理解与推理

Chenwei Xu, Zhen Ye, Shang Wu, Weijian Li, Zihan Wang, Zhuofan Xia, Lie Lu, Pranav Maneriker, Fan Du, Manling Li, Han Liu

发表机构 * Northwestern University(西北大学) Johns Hopkins University(约翰霍普金斯大学) Dolby Laboratories(杜比实验室)

AI总结 提出一种多轮视频问答代理,通过稀疏帧选择、状态摘要和早期停止机制,在减少帧数和令牌数的同时提升准确率。

Comments Accepted to CVPR 2026. Project page: https://sparsevideounderstanding.github.io

详情
AI中文摘要

我们提出 \revise (\underline{Re}asoning with \underline{Vi}deo \underline{S}parsity),一种用于视频问答 (VQA) 的多轮代理。与均匀采样帧不同,\revise 选择一小部分信息丰富的帧,跨轮维护摘要作为状态,并在置信时提前停止。它支持专有视觉语言模型 (VLM) 的“即插即用”设置,并允许对开源模型进行强化微调。对于微调,我们引入 EAGER (Evidence-Adjusted Gain for Efficient Reasoning),一种无注释奖励,包含三项:(1) 置信增益:添加新帧后,奖励正确选项与最强替代选项之间对数几率差距的增加;(2) 摘要充分性:在回答时仅使用最后提交的摘要重新提问,并奖励成功;(3) 正确且早期停止:在较小的轮次预算内正确回答即获得奖励。在多个 VQA 基准上,\revise 在减少帧数、轮数和提示令牌数的同时提高了准确率,展示了实用的稀疏视频推理。

英文摘要

We present \revise (\underline{Re}asoning with \underline{Vi}deo \underline{S}parsity), a multi-round agent for video question answering (VQA). Instead of uniformly sampling frames, \revise selects a small set of informative frames, maintains a summary-as-state across rounds, and stops early when confident. It supports proprietary vision-language models (VLMs) in a ``plug-and-play'' setting and enables reinforcement fine-tuning for open-source models. For fine-tuning, we introduce EAGER (Evidence-Adjusted Gain for Efficient Reasoning), an annotation-free reward with three terms: (1) Confidence gain: after new frames are added, we reward the increase in the log-odds gap between the correct option and the strongest alternative; (2) Summary sufficiency: at answer time we re-ask using only the last committed summary and reward success; (3) Correct-and-early stop: answering correctly within a small turn budget is rewarded. Across multiple VQA benchmarks, \revise improves accuracy while reducing frames, rounds, and prompt tokens, demonstrating practical sparse video reasoning.

2602.11554 2026-06-02 cs.RO cs.CV cs.LG 版本更新

HyperDet: 3D Object Detection with Hyper 4D Radar Point Clouds

HyperDet: 基于超4D雷达点云的3D目标检测

Yichun Xiao, Runwei Guan, Jin Jin, Fangqiang Ding

发表机构 * University of Edinburgh(爱丁堡大学) HKUST (GZ)(香港科技大学(广州)) University of Oxford(牛津大学) MIT(麻省理工学院)

AI总结 提出一种与检测器无关的框架HyperDet,通过构建任务感知的超4D雷达点云,利用时空累积、跨传感器验证和多普勒引导的运动补偿以及前景生成增强,显著提升仅用雷达的3D目标检测性能。

Comments 11 pages, 3 figures, 3 tables

详情
AI中文摘要

仅使用4D雷达进行3D目标检测能达到什么程度?尽管现代4D雷达为自主感知提供了鲁棒天气和速度感知能力,但其点云仍然稀疏、嘈杂且不稳定,限制了仅用雷达的3D检测。我们提出HyperDet,一种与检测器无关的框架,在检测前构建任务感知的超4D雷达点云。HyperDet首先通过时空累积、跨传感器验证和多普勒引导的运动补偿来细化短窗口环视雷达观测,提高返回可靠性和时间一致性。然后,它利用仅在训练时可用的激光雷达引导的伪雷达监督进行前景生成增强,在保留测量雷达背景和雷达原生属性的同时丰富目标几何。在检测器训练期间,雷达感知的目标级增强进一步在几何重定位下保持多普勒一致性。在推理时,HyperDet仅需雷达输入,可直接与标准3D检测器配合使用。在两个公开的环视4D雷达数据集上的实验表明,与原始雷达输入相比,在标准3D检测器上均取得一致改进,验证了输入级雷达增强作为仅用雷达3D检测的有效方法。

英文摘要

How far can 3D object detection go using 4D radar alone? Despite offering weather-robust and velocity-aware sensing for autonomous perception, modern 4D radar still yields sparse, noisy, and unstable point clouds, limiting radar-only 3D detection. We present HyperDet, a detector-agnostic framework that constructs task-aware hyper 4D radar point clouds before detection. HyperDet first refines short-window surround-view radar observations through spatio-temporal accumulation, cross-sensor validation, and Doppler-guided motion compensation, improving return reliability and temporal coherence. It then performs foreground generative enhancement using LiDAR-guided pseudo-radar supervision available only during training, enriching object geometry while preserving measured radar background and radar-native attributes. During detector training, radar-aware object-level augmentation further preserves Doppler consistency under geometric relocation. At inference time, HyperDet requires radar input alone and can be directly paired with standard 3D detectors. Experiments on two public surround-view 4D radar datasets demonstrate consistent improvements over raw radar inputs across standard 3D detectors, validating input-level radar enhancement as an effective approach to radar-only 3D detection.

2602.12972 2026-06-02 cs.SI cs.LG 版本更新

Jointly Optimizing Debiased CTR and Uplift for Coupons Marketing: A Unified Causal Framework

联合优化去偏点击率和优惠券营销提升:一个统一因果框架

Siyun Yang, Shixiao Yang, Jian Wang, Di Fan, Kehe Cai, Haoyan Fu, Jiaming Zhang, Wenjin Wu, Peng Jiang

发表机构 * Kuaishou Technology(快手科技) Beijing Institute of Technology(北京理工大学) Independent Researcher(独立研究者)

AI总结 针对优惠券等营销干预导致的点击率预测偏差,提出统一多值处理网络UniMVT,通过反事实推断同时实现去偏点击率预测和提升估计。

详情
AI中文摘要

在线广告中,优惠券等营销干预会引入显著的混杂偏差,影响点击率(CTR)预测。观察到的点击反映了用户内在偏好与干预带来的提升的混合。这导致传统模型对基础CTR校准不准确,从而扭曲下游排序和计费决策。此外,营销干预通常作为多值处理,具有不同幅度,给CTR预测增加了额外复杂性。为解决这些问题,我们提出了统一多值处理网络(UniMVT)。具体来说,UniMVT从处理敏感表示中解耦混杂因素,使得全空间反事实推断模块能够联合重建去偏的基础CTR和强度-响应曲线。为处理多值处理的复杂性,UniMVT采用辅助强度估计任务来捕获处理倾向,并设计一个单位提升目标来归一化干预效果。这确保了在连续优惠券价值谱上的可比较估计。UniMVT同时实现了用于准确系统校准的去偏CTR预测和用于激励分配的精确提升估计。在合成和工业数据集上的大量实验证明了UniMVT在预测准确性和校准方面的优越性。此外,真实世界的A/B测试证实,UniMVT通过更有效的优惠券分发显著改善了业务指标。

英文摘要

In online advertising, marketing interventions such as coupons introduce significant confounding bias into Click-Through Rate (CTR) prediction. Observed clicks reflect a mixture of users' intrinsic preferences and the uplift induced by these interventions. This causes conventional models to miscalibrate base CTRs, which distorts downstream ranking and billing decisions. Furthermore, marketing interventions often operate as multi-valued treatments with varying magnitudes, introducing additional complexity to CTR prediction. To address these issues, we propose the \textbf{Uni}fied \textbf{M}ulti-\textbf{V}alued \textbf{T}reatment Network (UniMVT). Specifically, UniMVT disentangles confounding factors from treatment-sensitive representations, enabling a full-space counterfactual inference module to jointly reconstruct the debiased base CTR and intensity-response curves. To handle the complexity of multi-valued treatments, UniMVT employs an auxiliary intensity estimation task to capture treatment propensities and devise a unit uplift objective that normalizes the intervention effect. This ensures comparable estimation across the continuous coupon-value spectrum. UniMVT simultaneously achieves debiased CTR prediction for accurate system calibration and precise uplift estimation for incentive allocation. Extensive experiments on synthetic and industrial datasets demonstrate UniMVT's superiority in both predictive accuracy and calibration. Furthermore, real-world A/B tests confirm that UniMVT significantly improves business metrics through more effective coupon distribution.

2602.12080 2026-06-02 cs.LG 版本更新

PathCRF: Ball-Free Soccer Event Detection via Possession Path Inference from Player Trajectories

PathCRF: 通过球员轨迹的控球路径推断实现无球足球事件检测

Hyunsung Kim, Kunhee Lee, Sangwoo Seo, Sang-Ki Ko, Jinsung Yoon, Chanyoung Park

发表机构 * KAIST(韩国釜山科学技术院) Fitogether Inc.(Fitogether公司) University of Seoul(首尔大学)

AI总结 提出PathCRF框架,仅利用球员轨迹数据,通过将轨迹建模为动态图并采用条件随机场(CRF)推断控球路径,实现无球足球事件检测,降低对人工标注和球轨迹数据的依赖。

详情
AI中文摘要

尽管人工智能取得了最新进展,足球比赛的事件数据收集仍然严重依赖劳动密集型的人工标注。虽然已有研究利用球员和球轨迹探索自动事件检测,但由于高昂的基础设施和运营成本,球轨迹追踪仍然难以大规模应用。因此,足球领域的全面数据收集主要局限于顶级赛事,限制了数据驱动分析在该领域的广泛应用。为了解决这一挑战,本文提出了PathCRF,一个仅使用球员追踪数据检测足球控球事件的框架。我们将球员轨迹建模为全连接动态图,并将事件检测形式化为在每个时间步选择恰好一条对应于当前控球状态的边。为了确保所得边序列的逻辑一致性,我们采用条件随机场(CRF),禁止连续边之间出现不可能的转换,其中发射分数和转移分数由社会-时间骨干架构产生的边嵌入动态计算。在推理过程中,通过维特比解码获得最可能的边序列,当所选边在相邻时间步之间发生变化时,检测到控球或传球等事件。实验表明,PathCRF生成准确、逻辑一致的控球路径,能够实现可靠的下游分析,同时大幅减少对人工事件标注的需求。源代码可在 https://github.com/hyunsungkim-ds/pathcrf.git 获取。

英文摘要

Despite recent advances in AI, event data collection in soccer still relies heavily on labor-intensive manual annotation. Although prior work has explored automatic event detection using player and ball trajectories, ball tracking also remains difficult to scale due to high infrastructural and operational costs. As a result, comprehensive data collection in soccer is largely confined to top-tier competitions, limiting the broader adoption of data-driven analysis in this domain. To address this challenge, this paper proposes PathCRF, a framework for detecting on-ball soccer events using only player tracking data. We model player trajectories as a fully connected dynamic graph and formulate event detection as the problem of selecting exactly one edge corresponding to the current possession state at each time step. To ensure logical consistency of the resulting edge sequence, we employ a Conditional Random Field (CRF) that forbids impossible transitions between consecutive edges, where emission and transition scores are dynamically computed from edge embeddings produced by a socio-temporal backbone architecture. During inference, the most probable edge sequence is obtained via Viterbi decoding, and events such as ball controls or passes are detected whenever the selected edge changes between adjacent time steps. Experiments show that PathCRF produces accurate, logically consistent possession paths, enabling reliable downstream analyses while substantially reducing the need for manual event annotation. The source code is available at https://github.com/hyunsungkim-ds/pathcrf.git.

2602.11852 2026-06-02 cs.AI cs.CL cs.LG 版本更新

Prototype Transformer: Towards Language Model Architectures Interpretable by Design

原型Transformer:迈向可解释设计的语言模型架构

Yordan Yordanov, Matteo Forasassi, Bayar Menzat, Ruizhi Wang, Chang Qi, Markus Kaltenberger, Amine M'Charrak, Tommaso Salvatori, Thomas Lukasiewicz

发表机构 * University of Cambridge(剑桥大学) ETH Zurich(苏黎世联邦理工学院)

AI总结 提出原型Transformer(ProtoT),一种用线性代价原型模块替代二次代价自注意力的自回归语言模型架构,原型自动捕获可命名概念,提升可解释性并支持行为编辑。

Comments Accepted at ICML 2026. Equal contribution: Yordan Yordanov and Matteo Forasassi. 40 pages, 28 figures, 22 tables

详情
AI中文摘要

尽管最先进的语言模型(LM)在某些领域超越了大多数人类,但其推理过程仍然不透明,降低了信任度并增加了欺骗和幻觉的风险。我们引入了原型Transformer(ProtoT),一种自回归LM架构,它将Transformer的二次代价自注意力模块替换为基于原型的线性代价模块,原型是学习到的参数向量。在ProtoT中,原型创建了在不同时间尺度上聚合上下文信息的通信通道。我们表明,这种结构导致原型在训练过程中自动捕获可命名的概念,例如“女人”,为解释模型推理和对模型行为进行有针对性的编辑提供了途径。与基线相比,ProtoT在模型和数据规模上具有良好的扩展性,对输入扰动具有鲁棒性,并在文本生成和下游任务(包括GLUE)上表现良好。这些结果表明,ProtoT是朝着设计上更可解释的自回归语言模型迈出的有希望的一步。

英文摘要

While state-of-the-art language models (LMs) surpass most humans in certain domains, their reasoning remains largely opaque, reducing trust and increasing the risk of deception and hallucination. We introduce the Prototype Transformer (ProtoT), an autoregressive LM architecture that replaces the quadratic-cost self-attention module of the Transformer with a linear-cost module based on prototypes, which are learned parameter vectors. In ProtoT, prototypes create communication channels that aggregate contextual information at different time scales. We show that this structure leads prototypes to automatically capture nameable concepts, such as "woman", during training, offering a path toward interpreting model reasoning and making targeted edits to model behavior. Compared with baselines, ProtoT scales well with model and data size, is robust to input perturbations, and performs well on text generation and downstream tasks, including GLUE. These results suggest that ProtoT is a promising step toward autoregressive language models that are more interpretable by design.

2510.06028 2026-06-02 cs.LG stat.ML 版本更新

Generalization of Gibbs and Langevin Monte Carlo Algorithms in the Interpolation Regime

插值机制下吉布斯和朗之万蒙特卡洛算法的泛化

Andreas Maurer, Erfan Mirzaei, Massimiliano Pontil

发表机构 * ETH Zurich(苏黎世联邦理工学院)

AI总结 本文在过参数化插值机制下,通过数据依赖的期望误差界,证明了低温区的泛化可由高温区的小训练误差预示,并利用朗之万蒙特卡洛算法稳定逼近,在MNIST、CIFAR-10和SVHN数据集上给出非平凡且接近真实标签测试误差的预测。

详情
AI中文摘要

本文在过参数化插值机制下提供了吉布斯算法期望误差的数据依赖界,其中对于不可能的数据(如分类中的随机标签)也能获得低训练误差。结果表明,低温区的泛化已经由噪声较大的高温区的小训练误差所预示。这些界在使用朗之万蒙特卡洛算法近似时是稳定的。该分析激励了一种计算界的算法设计,该算法在MNIST、CIFAR-10和SVHN数据集上对真实标签数据给出了非平凡且接近的测试误差预测,同时对随机标签保持了正确的测试误差上界。

英文摘要

This paper provides data-dependent bounds on the expected error of the Gibbs algorithm in the overparameterized interpolation regime, where low training errors are also obtained for impossible data, such as random labels in classification. The results show that generalization in the low-temperature regime is already signaled by small training errors in the noisier high-temperature regime. The bounds are stable under approximation with Langevin Monte Carlo algorithms. The analysis motivates the design of an algorithm to compute bounds, which on the MNIST, CIFAR-10, and SVHN datasets yield nontrivial, close predictions on the test error for true labeled data, while maintaining a correct upper bound on the test error for random labels.

2602.11641 2026-06-02 cs.LG 版本更新

Both Topology and Text Matter: Revisiting LLM-guided Out-of-Distribution Detection on Text-attributed Graphs

拓扑与文本同样重要:重新审视基于LLM的文本属性图分布外检测

Yinlin Zhu, Di Wu, Xu Wang, Guocong Quan, Miao Hu

发表机构 * Sun Yat-sen University(中山大学) Shandong University(山东大学)

AI总结 针对文本属性图分布外检测中拓扑与文本信息利用不足的问题,提出LG-Plug框架,通过对齐拓扑与文本表示并利用聚类迭代LLM提示生成共识驱动的OOD样本,有效提升检测性能。

Comments Accepted by SIGKDD 2026

详情
AI中文摘要

文本属性图(TAGs)将节点与文本属性和图结构关联,使GNN能够联合建模语义和结构信息。尽管在分布内(ID)数据上有效,但GNN在面对具有未见文本或结构模式的分布外(OOD)节点时常常失败,产生过度自信的预测而缺乏可靠的OOD检测。现有的拓扑驱动方法通过邻域结构缓解节点级偏差,但通常将文本编码为浅层特征,未充分利用语义信息。最近的基于LLM的方法则从文本知识中合成伪OOD先验,但存在两个关键限制:(1)可靠性与信息性之间的权衡,生成的OOD暴露要么偏离真实的OOD语义,要么引入大量ID噪声;(2)依赖专用架构,限制了与先前工作中验证的拓扑级进展的兼容性。为解决这些问题,我们提出LG-Plug,一个用于TAG OOD检测的LLM引导的即插即用框架。LG-Plug对齐拓扑和文本表示以获得细粒度节点嵌入,然后通过聚类迭代LLM提示构建共识驱动的OOD暴露。为降低LLM查询成本,它进一步采用轻量级簇内码本和启发式采样。生成的OOD暴露作为正则化器,分离ID和OOD节点,实现与现有检测器的无缝集成。在六个TAG基准上的实验表明,LG-Plug持续改进拓扑驱动的OOD检测器(FPR95降低>7%),并超越先前基于LLM的方法(FPR95降低>5%)。

英文摘要

Text-attributed graphs (TAGs) associate nodes with textual attributes and graph structure, enabling GNNs to jointly model semantic and structural information. Although effective on in-distribution (ID) data, GNNs often fail on out-of-distribution (OOD) nodes with unseen textual or structural patterns, producing overconfident predictions without reliable OOD detection. Existing topology-driven methods mitigate node-level bias through neighboring structures, but typically encode texts as shallow features, underutilizing semantic information. Recent LLM-based approaches instead synthesize pseudo OOD priors from textual knowledge, yet suffer from two key limitations: (1) a trade-off between reliability and informativeness, where generated OOD exposures either deviate from true OOD semantics or introduce substantial ID noise; and (2) dependence on specialized architectures, limiting compatibility with topology-level advances validated in prior work. To address these issues, we propose LG-Plug, an LLM-Guided Plug-and-play framework for TAG OOD detection. LG-Plug aligns topology and text representations to obtain fine-grained node embeddings, then constructs consensus-driven OOD exposure through clustered iterative LLM prompting. To reduce LLM query cost, it further adopts lightweight in-cluster codebooks and heuristic sampling. The generated OOD exposure acts as a regularizer that separates ID and OOD nodes, enabling seamless integration with existing detectors. Experiments on six TAG benchmarks demonstrate that LG-Plug consistently improves topology-driven OOD detectors (>7% FPR95 reduction) and surpasses prior LLM-based methods (>5% FPR95 reduction).

2507.15336 2026-06-02 cs.LG cs.AI cs.DB 版本更新

Beyond Model Base Retrieval: Weaving Knowledge to Master Fine-grained Neural Network Design

超越模型库检索:编织知识以掌握细粒度神经网络设计

Jialiang Wang, Hanmo Liu, Shimin Di, Zhili Wang, Jiachuan Wang, Lei Chen, Xiaofang Zhou

发表机构 * National University of Singapore(新加坡国立大学) Tsinghua University(清华大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出M-DESIGN框架,通过构建编辑效应证据图并采用自适应检索与预测任务规划器,在严格预算下高效发现近最优细粒度架构修改路径,在33个案例中26个达到搜索空间最佳性能。

Comments Accepted at ICML 2026. Title changed from "Beyond Model Base Selection: Weaving Knowledge to Master Fine-grained Neural Network Design" to "Beyond Model Base Retrieval: Weaving Knowledge to Master Fine-grained Neural Network Design"

详情
AI中文摘要

为新任务设计高性能神经网络需要在优化质量与搜索效率之间取得平衡。当前方法未能实现这一平衡:神经架构搜索计算成本高,而模型检索通常产生次优的静态检查点。为解决这一困境,我们将细粒度架构修改带来的性能增益建模为编辑效应证据,并从先验任务构建证据图。通过构建检索增强的模型精炼框架,我们提出的M-DESIGN动态编织历史证据以发现近最优的修改路径。M-DESIGN具有自适应检索机制,可快速校准来自不同来源的编辑效应证据的演化可迁移性。为处理分布外偏移,我们引入预测任务规划器,从多跳证据外推增益,从而减少对详尽知识库的依赖。基于包含22个数据集上67,760个图神经网络的知识库,大量实验表明,M-DESIGN持续优于基线,在严格预算下33个案例中有26个达到搜索空间最佳性能。

英文摘要

Designing high-performance neural networks for new tasks requires balancing optimization quality with search efficiency. Current methods fail to achieve this balance: neural architectural search is computationally expensive, while model retrieval often yields suboptimal static checkpoints. To resolve this dilemma, we model the performance gains induced by fine-grained architectural modifications as edit-effect evidence and build evidence graphs from prior tasks. By constructing a retrieval-augmented model refinement framework, our proposed M-DESIGN dynamically weaves historical evidence to discover near-optimal modification paths. M-DESIGN features an adaptive retrieval mechanism that quickly calibrates the evolving transferability of edit-effect evidence from different sources. To handle out-of-distribution shifts, we introduce predictive task planners that extrapolate gains from multi-hop evidence, thereby reducing reliance on an exhaustive repository. Based on our model knowledge base of 67,760 graph neural networks across 22 datasets, extensive experiments demonstrate that M-DESIGN consistently outperforms baselines, achieving the search-space best performance in 26 out of 33 cases under a strict budget.

2602.10623 2026-06-02 cs.LG cs.AI 版本更新

Mitigating Reward Hacking in RLHF via Bayesian Non-negative Reward Modeling

通过贝叶斯非负奖励建模缓解RLHF中的奖励黑客

Zhibin Duan, Guowei Rong, Zhuo Li, Bo Chen, Mingyuan Zhou, Dandan Guo

发表机构 * Zhejiang University(浙江大学)

AI总结 提出贝叶斯非负奖励模型(BNRM),通过非负因子分析和变分推断,在Bradley-Terry偏好模型中实现解耦与去偏,有效缓解奖励过度优化,提升鲁棒性和可解释性。

Comments Accepted as an Oral presentation at ICML 2026. The code is available at https://github.com/GuoweiRong/Bayesian-Non-negative-Reward-Model

详情
AI中文摘要

从人类偏好中学习的奖励模型是通过人类反馈强化学习对齐大型语言模型的核心,但由于噪声标注和系统偏差(如响应长度或风格),它们通常容易受到奖励黑客攻击。我们提出了贝叶斯非负奖励模型(BNRM),这是一个原则性的奖励建模框架,将非负因子分析整合到Bradley-Terry偏好模型中。BNRM通过稀疏的非负潜在因子生成过程表示奖励,该过程在两个互补层面运作:实例特定的潜在变量诱导解耦的奖励表示,而全局潜在因子的稀疏性作为隐式去偏机制,抑制虚假相关性。这种解耦-去偏结构共同实现了鲁棒的不确定性感知奖励学习。为了将BNRM扩展到现代LLM,我们开发了一个基于深度模型表示的条件摊销变分推断网络,实现高效的端到端训练。大量实验结果表明,BNRM显著缓解了奖励过度优化,提高了分布偏移下的鲁棒性,并比强基线产生了更可解释的奖励分解。

英文摘要

Reward models learned from human preferences are central to aligning large language models (LLMs) via reinforcement learning from human feedback, yet they are often vulnerable to reward hacking due to noisy annotations and systematic biases such as response length or style. We propose Bayesian Non-Negative Reward Model (BNRM), a principled reward modeling framework that integrates non-negative factor analysis into Bradley-Terry (BT) preference model. BNRM represents rewards through a sparse, non-negative latent factor generative process that operates at two complementary levels: instance-specific latent variables induce disentangled reward representations, while sparsity over global latent factors acts as an implicit debiasing mechanism that suppresses spurious correlations. Together, this disentanglement-then-debiasing structure enables robust uncertainty-aware reward learning. To scale BNRM to modern LLMs, we develop an amortized variational inference network conditioned on deep model representations, allowing efficient end-to-end training. Extensive empirical results demonstrate that BNRM substantially mitigates reward over-optimization, improves robustness under distribution shifts, and yields more interpretable reward decompositions than strong baselines.

2505.24069 2026-06-02 cs.LG cs.AI 版本更新

Can LLMs Reason Structurally? Benchmarking via the Lens of Data Structures

LLM 能否进行结构性推理?通过数据结构视角进行基准测试

Yu He, Yingxi Li, Colin White, Ellen Vitercik

发表机构 * Stanford University(斯坦福大学)

AI总结 本文提出 DSR-Bench 基准,通过 20 种数据结构、35 种操作和 4140 个问题实例评估 LLM 的结构性推理能力,发现顶级模型在挑战性实例上仅得 0.46/1,且在空间数据、上下文丰富场景及自身代码推理上表现不佳。

Comments Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026

详情
AI中文摘要

大型语言模型(LLM)被部署在日益复杂的任务上,这些任务需要多步决策。因此,理解它们的算法推理能力至关重要。然而,我们缺乏用于评估这些能力的诊断基准。我们提议使用数据结构作为原则性视角:作为算法的基本构建块,它们自然地探测结构性推理——即理解和操作支撑算法推理的关系(如顺序、层次和连接性)的能力。我们引入了 DSR-Bench(数据结构推理基准),涵盖 20 种数据结构、35 种操作和 4140 个问题实例。DSR-Bench 具有层次化任务组织、全自动生成与评估以及细粒度诊断的特点。评估 13 个最先进的 LLM 揭示了关键局限性:表现最好的模型在挑战性实例上仅达到 0.46/1。三个针对更现实用法的辅助探针暴露了进一步的弱点:模型在空间数据和上下文丰富的场景中表现不佳,并且难以对其自身代码进行推理。

英文摘要

Large language models (LLMs) are deployed on increasingly complex tasks that require multi-step decision-making. Understanding their algorithmic reasoning abilities is therefore crucial. However, we lack a diagnostic benchmark for evaluating these capabilities. We propose to use data structures as a principled lens: as fundamental building blocks of algorithms, they naturally probe structural reasoning - the ability to understand and manipulate relationships such as order, hierarchy, and connectivity that underpin algorithmic reasoning. We introduce DSR-Bench (Data Structure Reasoning Benchmark), spanning 20 data structures, 35 operations, and 4,140 problem instances. DSR-Bench features hierarchical task organization, fully automated generation and evaluation, and fine-grained diagnostics. Evaluating 13 state-of-the-art LLMs reveals critical limitations: the top-performing model achieves only 0.46/1 on challenging instances. Three auxiliary probes targeting more realistic usages expose further weaknesses: models perform poorly on spatial data and context-rich scenarios, and they struggle to reason over their own code.

2602.10056 2026-06-02 cs.LG stat.ML 版本更新

WildCat: Near-Linear Attention in Theory and Practice

WildCat: 理论与实践中近乎线性的注意力机制

Tobias Schröder, Lester Mackey

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出WildCat方法,通过随机枢轴Cholesky算法选择加权核心集,在近线性时间内以超多项式误差衰减近似精确注意力,并应用于图像生成、分类和语言模型KV缓存压缩。

详情
AI中文摘要

我们介绍了WildCat,一种高精度、低成本的神经网络注意力机制压缩方法。虽然注意力是现代网络架构的标配,但由于其资源需求随输入序列长度$n$呈二次方增长,部署成本极高。WildCat通过仅关注一个小的加权核心集来避免这些二次成本。关键的是,我们使用一种快速但谱精确的子采样算法——随机枢轴Cholesky——来选择核心集,并最优地加权元素以最小化重构误差。值得注意的是,在输入有界的情况下,WildCat以超多项式$O(n^{-\sqrt{\log(\log(n))}})$的误差衰减逼近精确注意力,同时运行在近线性$O(n^{1+o(1)})$时间内。相比之下,先前的实用近似要么缺乏误差保证,要么需要二次运行时间才能保证如此高的保真度。我们将这一进展与GPU优化的PyTorch实现以及一套基准实验相结合,展示了WildCat在图像生成、图像分类和语言模型KV缓存压缩方面的优势。

英文摘要

We introduce WildCat, a high-accuracy, low-cost approach to compressing the attention mechanism in neural networks. While attention is a staple of modern network architectures, it is also notoriously expensive to deploy due to resource requirements that scale quadratically with the input sequence length $n$. WildCat avoids these quadratic costs by only attending over a small weighted coreset. Crucially, we select the coreset using a fast but spectrally-accurate subsampling algorithm -- randomly pivoted Cholesky -- and weight the elements optimally to minimise reconstruction error. Remarkably, given bounded inputs, WildCat approximates exact attention with super-polynomial $O(n^{-\sqrt{\log(\log(n))}})$ error decay while running in near-linear $O(n^{1+o(1)})$ time. In contrast, prior practical approximations either lack error guarantees or require quadratic runtime to guarantee such high fidelity. We couple this advance with a GPU-optimized PyTorch implementation and a suite of benchmark experiments demonstrating the benefits of WildCat for image generation, image classification, and language model KV cache compression.

2602.09651 2026-06-02 stat.ML cs.LG 版本更新

The Entropic Signature of Class Speciation in Diffusion Models

扩散模型中类别分化的熵特征

Florian Handke, Dejan Stančević, Felix Koulischer, Thomas Demeester, Luca Ambrogioni

发表机构 * GitHub arXiv

AI总结 通过追踪潜在语义变量的类别条件熵,检测扩散模型中的语义转变区间,并验证其在高斯混合模型和实际模型中的有效性。

Comments Accepted at International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

扩散模型并非随时间均匀地恢复语义结构。相反,样本在狭窄的区间内从语义模糊过渡到类别确定。最近的理论工作将这种转变归因于沿类别分离方向的动力学不稳定性,但在训练模型中检测和利用这些窗口的实用方法仍然有限。我们表明,跟踪给定噪声状态下潜在语义变量的类别条件熵提供了这些转变区间的可靠特征。通过将熵限制在语义划分上,熵还可以解析不同抽象层次上的语义决策。我们在高维高斯混合模型中分析了这种行为,并表明熵率集中在与方差保持扩散中先前识别的分化对称性破缺不稳定性相同的对数时间尺度上。我们在EDM2-XS和Stable Diffusion 1.5上验证了我们的方法,其中类别条件熵一致地隔离了对语义结构形成至关重要的噪声区间。最后,我们使用我们的框架来量化引导如何随时间重新分布语义信息。这些结果共同连接了信息论和统计物理学对扩散的视角,并为时间局部化控制提供了原则性基础。

英文摘要

Diffusion models do not recover semantic structure uniformly over time. Instead, samples transition from semantic ambiguity to class commitment within a narrow regime. Recent theoretical work attributes this transition to dynamical instabilities along class-separating directions, but practical methods to detect and exploit these windows in trained models are still limited. We show that tracking the class-conditional entropy of a latent semantic variable given the noisy state provides a reliable signature of these transition regimes. By restricting the entropy to semantic partitions, the entropy can furthermore resolve semantic decisions at different levels of abstraction. We analyze this behavior in high-dimensional Gaussian mixture models and show that the entropy rate concentrates on the same logarithmic time scale as the speciation symmetry-breaking instability previously identified in variance-preserving diffusion. We validate our method on EDM2-XS and Stable Diffusion 1.5, where class-conditional entropy consistently isolates the noise regimes critical for semantic structure formation. Finally, we use our framework to quantify how guidance redistributes semantic information over time. Together, these results connect information-theoretic and statistical physics perspectives on diffusion and provide a principled basis for time-localized control.

2602.09492 2026-06-02 cs.LG cs.AI 版本更新

Beware of the Batch Size: Hyperparameter Bias in Evaluating LoRA

当心批量大小:评估 LoRA 中的超参数偏差

Sangyoon Lee, Jaeho Lee

发表机构 * Pohang University of Science and Technology (POSTECH)(浦项科学技术大学(POSTECH))

AI总结 本文发现批量大小是导致 LoRA 变体性能矛盾的关键因素,提出基于代理的高效调优策略,将批量大小提升为一阶设计参数。

详情
AI中文摘要

低秩适配(LoRA)是微调大型语言模型的标准方法,但其众多变体在相同基准上报告了相互矛盾的经验性收益。我们表明这些矛盾源于一个被忽视的因素:批量大小。当适当调整时,vanilla LoRA 通常能达到与更复杂变体相当的性能。我们进一步提出了一种基于代理的、成本高效的批量大小调优策略,揭示了秩、数据集大小和模型容量对最优批量大小的影响。我们的发现将批量大小从次要实现细节提升为一阶设计参数,调和了先前的不一致性,并使得对 LoRA 变体的评估更加可靠。

英文摘要

Low-rank adaptation (LoRA) is a standard approach for fine-tuning large language models, yet its many variants report conflicting empirical gains, often on the same benchmarks. We show that these contradictions arise from a single overlooked factor: the batch size. When properly tuned, vanilla LoRA often matches the performance of more complex variants. We further propose a proxy-based, cost-efficient strategy for batch size tuning, revealing the impact of rank, dataset size, and model capacity on the optimal batch size. Our findings elevate batch size from a minor implementation detail to a first-order design parameter, reconciling prior inconsistencies and enabling more reliable evaluations of LoRA variants.

2602.09474 2026-06-02 cs.LG 版本更新

Online Learning in MDPs with Partially Adversarial Transitions and Losses

具有部分对抗性转移和损失的MDP中的在线学习

Ofir Schlisselberg, Tal Lancewicki, Yishay Mansour

发表机构 * Tel Aviv University(特拉维夫大学) Meta AI Google Research(谷歌研究院)

AI总结 研究转移函数在大多数步骤随机但在每回合固定Λ个步骤可对抗的MDP中的强化学习,提出条件占用度量并设计两种算法,分别处理任意对抗步骤和连续对抗步骤,并给出全对抗设置下的遗憾界。

详情
AI中文摘要

我们研究MDP中的强化学习,其转移函数在大多数步骤是随机的,但在每回合固定的$Λ$个步骤子集上可能表现对抗性。该模型捕捉了除少数脆弱点外稳定的环境。我们引入了\emph{条件占用度量},即使在对抗性转移下也能在回合间保持稳定,并利用它们设计了两种算法。第一种处理任意对抗步骤,实现遗憾$ ilde{O}(H S^Λ\sqrt{K S A^{Λ+1}})$,其中$K$是回合数,$S$是状态数,$A$是动作数,$H$是回合长度。第二种假设对抗步骤连续,将对$S$的依赖改进为$ ilde{O}(H\sqrt{K S^{3} A^{Λ+1}})$。我们进一步给出一个$K^{2/3}$遗憾约简,消除了知道哪些步骤是$Λ$个对抗步骤的需要。我们还刻画了\emph{全对抗}设置($Λ=H-1$)下对抗性MDP的遗憾,包括完全信息和赌博机反馈,并提供了几乎匹配的上界和下界(略微加强了现有下界,并阐明了不同反馈结构如何影响学习的难度)。

英文摘要

We study reinforcement learning in MDPs whose transition function is stochastic at most steps but may behave adversarially at a fixed subset of $Λ$ steps per episode. This model captures environments that are stable except at a few vulnerable points. We introduce \emph{conditioned occupancy measures}, which remain stable across episodes even with adversarial transitions, and use them to design two algorithms. The first handles arbitrary adversarial steps and achieves regret $\tilde{O}(H S^Λ\sqrt{K S A^{Λ+1}})$, where $K$ is the number of episodes, $S$ is the number of state, $A$ is the number of actions and $H$ is the episode's horizon. The second, assuming the adversarial steps are consecutive, improves the dependence on $S$ to $\tilde{O}(H\sqrt{K S^{3} A^{Λ+1}})$. We further give a $K^{2/3}$-regret reduction that removes the need to know which steps are the $Λ$ adversarial steps. We also characterize the regret of adversarial MDPs in the \emph{fully adversarial} setting ($Λ=H-1$) both for full-information and bandit feedback, and provide almost matching upper and lower bounds (slightly strengthen existing lower bounds, and clarify how different feedback structures affect the hardness of learning).

2602.03970 2026-06-02 stat.ML cs.LG cs.NE math.MG math.ST stat.TH 版本更新

Statistical Guarantees for Reasoning Probes on Looped Boolean Circuits

循环布尔电路上推理探针的统计保证

Anastasis Kratsios, Giulia Livieri, A. Martina Neuman

发表机构 * Department of Mathematics, McMaster University(麦斯特大学数学系) Vector Institute(向量研究所) The London School of Economics and Political Science(伦敦政治经济学院) University of Vienna, Faculty of Mathematics(维也纳大学数学系)

AI总结 针对循环布尔电路上的推理探针,利用图卷积网络和度量嵌入技术,证明了在最坏情况下泛化误差以最优速率衰减,且该速率与计算图规模无关。

详情
AI中文摘要

我们研究了一种受神经算法推理启发的迭代计算风格化模型中推理探针的统计行为。底层计算由一个循环布尔电路给出,其图是完美的 $ν$ 元树($ν\ge 2$),输出在计算轮次中递归地作为输入反馈。探针观察内部节点的采样子集,并试图推断每个节点处的潜在操作,表示为有限可容许布尔门集合上的概率分布。这种部分可观测性在结构化计算图上诱导了一个转导泛化问题。我们证明,当探针由图卷积网络参数化并查询 $N$ 个节点时,最坏情况下的泛化误差以最优速率 $\mathcal{O}(\sqrt{\log(2/δ)}/\sqrt{N})$ 衰减,概率至少为 $1-δ$。我们的分析将度量嵌入技术与最优传输工具相结合。一个关键见解是,该速率与计算图规模无关,这是通过诱导图度量的低失真一维雪花嵌入实现的。这些结果突出了在探测结构化迭代计算中统计效率的几何机制。

英文摘要

We study the statistical behavior of reasoning probes in a stylized model of iterative computation inspired by neural algorithmic reasoning. The underlying computation is given by a looped Boolean circuit whose graph is a perfect $ν$-ary tree ($ν\ge 2$), with outputs recursively fed back as inputs across computation rounds. A probe observes a sampled subset of internal nodes and seeks to infer the latent operation at each node, represented as a probability distribution over a finite set of admissible Boolean gates. This partial observability induces a transductive generalization problem on a structured computation graph. We show that when the probe is parameterized by a graph convolutional network and queries $N$ nodes, the worst-case generalization error decays at the optimal rate $\mathcal{O}(\sqrt{\log(2/δ)}/\sqrt{N})$ with probability at least $1-δ$. Our analysis combines metric embedding techniques with tools from optimal transport. A key insight is that this rate is achievable independently of the size of the computation graph, enabled by a low-distortion one-dimensional snowflake embedding of the induced graph metric. These results highlight a geometric mechanism underlying statistical efficiency in probing structured, iterative computations.

2502.00753 2026-06-02 math.OC cs.LG 版本更新

Mirror Descent Under Generalized Smoothness

广义光滑性下的镜像下降

Dingzhi Yu, Wei Jiang, Hongyi Tao, Yuanyu Wan, Lijun Zhang

发表机构 * State Key Laboratory of Novel Software Technology, Nanjing University(南京大学新型软件技术国家重点实验室) School of Artificial Intelligence, Nanjing University(南京大学人工智能学院) School of Computer Science and Engineering, Nanjing University of Science and Technology(南京理工大学计算机科学与工程学院) School of Software Technology, Zhejiang University(浙江大学软件技术学院) Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security(杭州高新技术区(滨江)区块链与数据安全研究院)

AI总结 本文提出一种新的 $\ell_*$-光滑性概念,将经典光滑性推广到一般范数空间,并证明镜像下降类算法在此条件下收敛率与经典光滑性一致。

Comments ICML 2026

详情
AI中文摘要

光滑性对于一阶优化达到快速收敛率至关重要。然而,现代机器学习中的许多优化问题涉及非光滑目标。最近的研究通过允许梯度的Lipschitz常数相对于梯度范数增长来放宽光滑性假设,这适应了实践中广泛的目标。尽管取得了进展,现有的光滑性推广仅限于具有 $\ell_2$ 范数的欧几里得几何,并且仅在欧几里得空间中的优化具有理论保证。在本文中,我们通过引入一个新的 $\ell_*$-光滑性概念来解决这一限制,该概念以一般范数及其对偶度量Hessian的范数,并建立了镜像下降类型算法的收敛性,与经典光滑性下的收敛率相匹配。值得注意的是,我们提出了一种广义的自有界性质,有助于通过控制次优性间隙来界定梯度,作为收敛分析的主要组成部分。在确定性优化之外,我们建立了随机镜像下降的尖锐收敛性,与经典光滑性下的最新结果相匹配。我们的理论还扩展到非凸和复合优化,这可能为镜像下降的实际应用(包括大语言模型的预训练和后训练)提供启示。

英文摘要

Smoothness is crucial for attaining fast rates in first-order optimization. However, many optimization problems in modern machine learning involve non-smooth objectives. Recent studies relax the smoothness assumption by allowing the Lipschitz constant of the gradient to grow with respect to the gradient norm, which accommodates a broad range of objectives in practice. Despite this progress, existing generalizations of smoothness are restricted to Euclidean geometry with $\ell_2$-norm and only have theoretical guarantees for optimization in the Euclidean space. In this paper, we address this limitation by introducing a new $\ell*$-smoothness concept that measures the norm of Hessians in terms of a general norm and its dual, and establish convergence for mirror-descent-type algorithms, matching the rates under the classic smoothness. Notably, we propose a generalized self-bounding property that facilitates bounding the gradients via controlling suboptimality gaps, serving as a principal component for convergence analysis. Beyond deterministic optimization, we establish sharp convergence for stochastic mirror descent, matching state-of-the-art under classic smoothness. Our theory also extends to non-convex and composite optimization, which may shed light on practical usages of mirror descent, including pre-training and post-training of LLMs.

2602.08868 2026-06-02 cs.LG cs.AI 版本更新

AnomSeer: Reinforcing Multimodal LLMs to Reason for Time-Series Anomaly Detection

AnomSeer: 增强多模态大语言模型进行时间序列异常检测的推理能力

Junru Zhang, Lang Feng, Haoran Shi, Xu Guo, Han Yu, Yabo Dong, Duanqing Xu

发表机构 * GitHub

AI总结 提出AnomSeer,通过专家思维链和基于最优传输的时间序列接地策略优化,增强多模态大语言模型在时间序列异常检测中的细粒度推理能力,统一异常分类、定位和解释。

Comments ICML 2026

详情
AI中文摘要

基于多模态大语言模型(MLLM)的时间序列异常检测(TSAD)是一个新兴领域,但一个持续存在的挑战是:MLLM依赖于粗略的时间序列启发式方法,但在多维、详细的推理方面存在困难,而这对于理解复杂的时间序列数据至关重要。我们提出AnomSeer来解决这个问题,通过增强模型将其推理基于时间序列的精确结构细节,统一异常分类、定位和解释。其核心是生成专家思维链轨迹,从经典分析(如统计度量、频率变换)中提供可验证的细粒度推理。在此基础上,我们提出了一种新颖的时间序列接地策略优化(TimerPO),它在标准强化学习之外引入了两个额外组件:基于最优传输的时间序列接地优势,以及确保这种辅助细粒度信号不干扰主要检测目标的正交投影。在各种异常场景中,使用Qwen2.5-VL-3B/7B-Instruct的AnomSeer在分类和定位准确性上优于更大的商业基线(如GPT-4o),特别是在点和频率驱动的异常上。此外,它产生了合理的时间序列推理轨迹,支持其结论。

英文摘要

Time-series anomaly detection (TSAD) with multimodal large language models (MLLMs) is an emerging area, yet a persistent challenge remains: MLLMs rely on coarse time-series heuristics but struggle with multi-dimensional, detailed reasoning, which is vital for understanding complex time-series data. We present AnomSeer to address this by reinforcing the model to ground its reasoning in precise, structural details of time series, unifying anomaly classification, localization, and explanation. At its core, an expert chain-of-thought trace is generated to provide a verifiable, fine-grained reasoning from classical analyses (e.g., statistical measures, frequency transforms). Building on this, we propose a novel time-series grounded policy optimization (TimerPO) that incorporates two additional components beyond standard reinforcement learning: a time-series grounded advantage based on optimal transport and an orthogonal projection to ensure this auxiliary granular signal does not interfere with the primary detection objective. Across diverse anomaly scenarios, AnomSeer, with Qwen2.5-VL-3B/7B-Instruct, outperforms larger commercial baselines (e.g., GPT-4o) in classification and localization accuracy, particularly on point- and frequency-driven exceptions. Moreover, it produces plausible time-series reasoning traces that support its conclusions.

2602.08689 2026-06-02 cs.LG 版本更新

Learning To Sample From Diffusion Models Via Inverse Reinforcement Learning

通过逆强化学习从扩散模型中学习采样

Constant Bourdrez, Alexandre Vérine, Olivier Cappé

发表机构 * DI ENS, Ecole normale supérieure, Université PSL, CNRS(巴黎高等师范学院)

AI总结 提出一个逆强化学习框架,在不重新训练去噪器的情况下优化扩散模型的采样策略(噪声调度、引导尺度、随机性),通过策略梯度匹配目标行为,在ImageNet-64上以9倍更低成本和16%推理开销替代网格搜索。

Comments Preprint

详情
AI中文摘要

扩散模型通过由预训练神经网络引导的迭代去噪过程生成样本。一旦去噪器固定,采样算法本身(噪声调度、引导尺度、随机性分布)仍需要仔细调整,这一过程通常通过昂贵的经验网格搜索进行。在这项工作中,我们引入了一个逆强化学习框架,用于在不重新训练去噪器的情况下学习采样策略。我们将扩散采样过程建模为一个离散时间有限时域马尔可夫决策过程,其中动作对应于采样动力学的可选修改。为了优化动作调度,我们避免定义显式奖励函数,而是直接使用策略梯度技术匹配采样器预期的目标行为。我们提供的实验证据表明,该方法与微调后的采样器性能相当,并且与网格搜索相比成本适中:在ImageNet-64上,单次训练运行以高达9倍更低的成本取代了穷举搜索,推理时仅增加16%的开销。

英文摘要

Diffusion models generate samples through an iterative denoising process guided by a pretrained neural network. Once the denoiser is fixed, the sampling algorithm itself (noise schedules, guidance scales, stochasticity profiles) still requires careful tuning, a process typically carried out through costly empirical grid search. In this work, we introduce an inverse reinforcement learning framework for learning sampling strategies without retraining the denoiser. We formulate the diffusion sampling procedure as a discrete-time finite-horizon Markov Decision Process, where actions correspond to optional modifications of the sampling dynamics. To optimize action scheduling, we avoid defining an explicit reward function and instead directly match the target behavior expected from the sampler using policy gradient techniques. We provide experimental evidence that this approach matches fine-tuned samplers and comes at a modest cost compared to grid search: on ImageNet-64, a single training run replaces exhaustive search at up to 9x lower cost, with only 16% overhead at inference.

2602.08585 2026-06-02 cs.LG cs.AI 版本更新

Predicting Future Utility: Global Combinatorial Optimization for Task-Agnostic KV Cache Eviction

预测未来效用:任务无关的KV缓存驱逐的全局组合优化

Ziyao Tang, Pengkun Jiao, Xinhang Chen, Wei Liu, Shiyong Li, Jingjing Chen

发表机构 * Fudan University(复旦大学) Baige AI Team, Baidu inc(百度AI团队) Work done during an internship at Baidu(百度实习)

AI总结 提出LU-KV框架,通过全局组合优化分配注意力头预算以最大化长期边际贡献,实现80%的KV缓存压缩且性能损失极小。

详情
AI中文摘要

鉴于注意力的二次复杂度,KV缓存驱逐对于加速模型推理至关重要。当前的KV缓存驱逐方法通常依赖于瞬时启发式度量,隐含地假设分数幅度是所有注意力头的重要性一致代理。然而,这忽略了注意力头之间预测保真度的异质性。虽然某些头优先考虑令牌的瞬时贡献,但其他头致力于捕捉长期效用。在本文中,我们提出最优预算分配应由保留长期语义信息的边际效用决定。基于这一见解,我们提出了LU-KV,这是一个新颖的框架,将头级预算分配表述为全局组合优化问题,以最大化保留令牌的长期边际贡献。为了解决这个非凸问题,我们采用凸包松弛和基于边际效用的贪婪求解器,实现接近最优的解。此外,我们实现了一个数据驱动的离线分析协议,以促进LU-KV的实际部署。在LongBench和RULER基准上的评估表明,LU-KV将KV缓存大小减少了80%,性能下降最小,同时降低了推理延迟和GPU内存占用。

英文摘要

Given the quadratic complexity of attention, KV cache eviction is vital to accelerate model inference. Current KV cache eviction methods typically rely on instantaneous heuristic metrics, implicitly assuming that score magnitudes are consistent proxies for importance across all heads. However, this overlooks the heterogeneity in predictive fidelity across attention heads. While certain heads prioritize the instantaneous contribution of tokens, others are dedicated to capturing long-horizon utility. In this paper, we propose that optimal budget allocation should be governed by the marginal utility in preserving long-term semantic information. Building on this insight, we propose LU-KV, a novel framework that formulates head-level budget allocation as a global combinatorial optimization problem to maximize the long-horizon marginal contribution of reserved tokens. To solve this non-convex problem, we employ a convex-hull relaxation and a marginal-utility-based greedy solver, achieving near-optimal solutions. Furthermore, we implement a data-driven offline profiling protocol to facilitate the practical deployment of LU-KV. Evaluations on LongBench and RULER benchmarks demonstrate that LU-KV reduces KV cache size by 80% with minimal performance degradation, while also decreasing inference latency and GPU memory footprint.

2602.06065 2026-06-02 stat.ML cond-mat.dis-nn cs.CL cs.LG 版本更新

Deep networks learn to parse uniform-depth context-free languages from local statistics

深度网络从局部统计中学习解析均匀深度的上下文无关语言

Jack T. Parley, Francesco Cagnetta, Matthieu Wyart

发表机构 * GitHub

AI总结 通过引入可调类概率上下文无关文法并设计基于深度卷积网络的推理算法,揭示了语言结构从局部统计中涌现的机制,并验证了深度卷积和Transformer架构的预测。

Comments Accepted as regular paper at ICML 2026

详情
AI中文摘要

理解语言结构如何仅从句子中学习是认知科学和机器学习中的一个核心问题。大型语言模型(LLMs)内部表征的研究支持其在预测下一个词时解析文本的能力,同时独立于表面形式表示语义概念。然而,哪些数据统计使这些成就成为可能,以及需要多少数据,仍然在很大程度上未知。概率上下文无关文法(PCFGs)为研究这些问题提供了一个可处理的测试平台。然而,先前的工作要么侧重于训练网络使用的类解析算法的后验表征,要么侧重于具有固定语法(无需解析)的PCFGs的可学习性。在这里,我们(i)引入了一个可调的PCFGs类别,其中歧义程度和跨尺度的相关结构都可以被控制;(ii)提供了一种学习机制——一种受深度卷积网络结构启发的推理算法——将可学习性和样本复杂度与特定语言统计联系起来;(iii)在深度卷积和基于Transformer的架构上经验性地验证了我们的预测。总体而言,我们提出了一个统一框架,其中不同尺度的相关性消除了局部歧义,使数据的层次化表征得以涌现。

英文摘要

Understanding how the structure of language can be learned from sentences alone is a central question in both cognitive science and machine learning. Studies of the internal representations of Large Language Models (LLMs) support their ability to parse text when predicting the next word, while representing semantic notions independently of surface form. Yet, which data statistics make these feats possible, and how much data is required, remain largely unknown. Probabilistic context-free grammars (PCFGs) provide a tractable testbed for studying these questions. However, prior work has focused either on the post-hoc characterization of the parsing-like algorithms used by trained networks; or on the learnability of PCFGs with fixed syntax, where parsing is unnecessary. Here, we (i) introduce a tunable class of PCFGs in which both the degree of ambiguity and the correlation structure across scales can be controlled; (ii) provide a learning mechanism -- an inference algorithm inspired by the structure of deep convolutional networks -- that links learnability and sample complexity to specific language statistics; and (iii) validate our predictions empirically across deep convolutional and transformer-based architectures. Overall, we propose a unifying framework where correlations at different scales lift local ambiguities, enabling the emergence of hierarchical representations of the data.

2602.01460 2026-06-02 math.OC cs.LG 版本更新

Non-Uniform Noise-to-Signal Ratio in the REINFORCE Policy-Gradient Estimator

REINFORCE策略梯度估计器中的非均匀噪声信号比

Haoyu Han, Heng Yang

发表机构 * math.OC(数学优化)

AI总结 研究REINFORCE策略梯度估计器的噪声信号比(NSR),通过精确刻画线性与多项式系统中的NSR,发现NSR在策略参数空间中高度非均匀,且通常在策略接近最优时增大甚至爆炸,导致训练不稳定和策略崩溃。

详情
AI中文摘要

策略梯度方法在强化学习中被广泛使用,但随着学习的进行,训练常常变得不稳定或减慢。我们通过策略梯度估计器的噪声信号比(NSR)来研究这一现象,该比值定义为估计器方差(噪声)除以真实梯度的平方范数(信号)。我们的主要结果是,对于(i)具有高斯策略和线性状态反馈的有限时域线性系统,以及(ii)具有高斯策略和多项式反馈的有限时域多项式系统,REINFORCE估计器的NSR可以精确刻画——要么是闭式形式,要么通过数值矩评估算法——无需近似。对于一般的非线性动力学和表达性策略(包括神经策略),我们进一步推导了方差的一般上界。这些刻画使得能够直接检查NSR如何随策略参数变化,以及如何沿优化轨迹(例如SGD和Adam)演变。在一系列示例中,我们发现NSR景观高度非均匀,并且通常随着策略接近最优而增大;在某些情况下它会爆炸,从而触发训练不稳定和策略崩溃。

英文摘要

Policy-gradient methods are widely used in reinforcement learning, yet training often becomes unstable or slows down as learning progresses. We study this phenomenon through the noise-to-signal ratio (NSR) of a policy-gradient estimator, defined as the estimator variance (noise) normalized by the squared norm of the true gradient (signal). Our main result is that, for (i) finite-horizon linear systems with Gaussian policies and linear state-feedback, and (ii) finite-horizon polynomial systems with Gaussian policies and polynomial feedback, the NSR of the REINFORCE estimator can be characterized exactly-either in closed form or via numerical moment-evaluation algorithms-without approximation. For general nonlinear dynamics and expressive policies (including neural policies), we further derive a general upper bound on the variance. These characterizations enable a direct examination of how NSR varies across policy parameters and how it evolves along optimization trajectories (e.g. SGD and Adam). Across a range of examples, we find that the NSR landscape is highly non-uniform and typically increases as the policy approaches an optimum; in some regimes it blows up, which can trigger training instability and policy collapse.

2511.21140 2026-06-02 cs.LG cs.CL stat.AP stat.ML 版本更新

How to Correctly Report LLM-as-a-Judge Evaluations

如何正确报告LLM作为评估者的评估结果

Chungpa Lee, Thomas Zeng, Jongwon Jeong, Jy-yong Sohn, Kangwook Lee

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对LLM作为评估者时存在偏差的问题,提出一种插件式校正框架,实现无偏估计和统计原理的不确定性量化,并证明在分布偏移下仍保持无偏性。

详情
Journal ref
International Conference on Machine Learning (ICML) 2026
AI中文摘要

大型语言模型(LLMs)被广泛用作模型响应的可扩展评估者,以替代人工标注者。然而,LLM评估者的不完美灵敏度和特异性会导致朴素评估分数产生偏差。我们提出一个简单的插件式框架,可校正此偏差并实现统计原理的不确定性量化。我们的框架构建置信区间,该区间同时考虑来自测试数据集和人工标注校准数据集的不确定性。此外,它采用自适应策略分配校准样本以获得更紧的区间。重要的是,我们刻画了由真实评估分数和LLM评估者的灵敏度与特异性定义的参数区间,在这些区间内,基于LLM的评估比仅人工评估产生更可靠的估计。此外,我们证明,与现有方法相比,我们的框架在测试集和校准集之间存在分布偏移时仍保持无偏性。

英文摘要

Large language models (LLMs) are widely used as scalable evaluators of model responses in lieu of human annotators. However, imperfect sensitivity and specificity of the LLM judges induce bias in naive evaluation scores. We propose a simple plug-in framework that corrects this bias and enables statistically principled uncertainty quantification. Our framework constructs confidence intervals that account for uncertainty from both the test dataset and a human-labeled calibration dataset. Additionally, it uses an adaptive strategy to allocate calibration samples for tighter intervals. Importantly, we characterize parameter regimes defined by the true evaluation score and the LLM judge's sensitivity and specificity in which our LLM-based evaluation yields more reliable estimates than human-only evaluation. Moreover, we show that our framework remains unbiased under distribution shift between the test and calibration datasets, in contrast to existing approaches.

2602.07356 2026-06-02 cs.LG 版本更新

Controllable Value Alignment in Large Language Models through Neuron-Level Editing

大语言模型中通过神经元级编辑的可控价值对齐

Yonghui Yang, Yihui Wang, Junwei Li, Jilong Liu, Fengbin Zhu, Weibiao Huang, Le Wu, Richang Hong, Tat-Seng Chua

发表机构 * National University of Singapore(新加坡国立大学) Hefei University of Technology(合肥工业大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) ST Engineering Ltd.(ST工程有限公司)

AI总结 提出NeVA框架,通过识别稀疏价值相关神经元并进行推理时激活编辑,实现细粒度可控价值对齐,减少价值泄漏并保持通用能力。

详情
AI中文摘要

随着大语言模型对人类行为和决策的影响不断扩大,使其与人类价值观对齐变得越来越重要。然而,现有的基于引导的对齐方法存在可控性有限的问题:引导目标价值往往会无意中激活其他非目标价值。为了描述这一局限性,我们引入了价值泄漏这一诊断概念,它捕捉了价值引导过程中非目标价值的非预期激活,并基于Schwartz价值理论提出了归一化泄漏度量。基于此分析,我们提出了NeVA,一种用于大语言模型中可控价值对齐的神经元级编辑框架。NeVA识别稀疏的、与价值相关的神经元,并进行推理时激活编辑,无需参数更新或重新训练即可实现细粒度控制。实验表明,NeVA在实现更强的目标价值对齐的同时,对通用能力的性能下降更小。此外,NeVA显著降低了平均泄漏,残余效应主要局限于语义相关的价值类别。总体而言,NeVA为价值对齐提供了一种更可控且可解释的机制。

英文摘要

Aligning large language models (LLMs) with human values has become increasingly important as their influence on human behavior and decision-making expands. However, existing steering-based alignment methods suffer from limited controllability: steering a target value often unintentionally activates other, non-target values. To characterize this limitation, we introduce value leakage, a diagnostic notion that captures the unintended activation of non-target values during value steering, along with a normalized leakage metric grounded in Schwartz's value theory. In light of this analysis, we propose NeVA, a neuron-level editing framework for controllable value alignment in LLMs. NeVA identifies sparse, value-relevant neurons and performs inference-time activation editing, enabling fine-grained control without parameter updates or retraining. Experiments show that NeVA achieves stronger target value alignment while incurring smaller performance degradation on general capability. Moreover, NeVA significantly reduces the average leakage, with residual effects largely confined to semantically related value classes. Overall, NeVA offers a more controllable and interpretable mechanism for value alignment.

2602.02763 2026-06-02 cs.LG 版本更新

Exposing Vulnerabilities in Explanation for Time Series Classifiers via Dual-Target Attacks

通过双目标攻击暴露时间序列分类器解释中的脆弱性

Bohan Wang, Zewen Liu, Lu Lin, Hui Liu, Li Xiong, Ming Jin, Wei Jin

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出TSEF双目标攻击方法,通过联合操纵分类器和解释器输出,实现目标性误分类同时保持解释与参考一致,揭示解释稳定性不能作为决策鲁棒性的可靠指标。

Comments Accepted at ICML 2026; Code is available at https://github.com/Bohan7/TSEF

详情
AI中文摘要

可解释的时间序列深度学习系统通常通过检查解释的时间一致性来评估,隐含地将此视为鲁棒性的证据。我们表明这一假设可能失败:预测和解释可以被对抗性解耦,实现目标性误分类,同时解释保持合理并与选定的参考理由一致。我们提出TSEF(时间序列解释欺骗器),一种双目标攻击,联合操纵分类器和解释器输出。与破坏解释并广泛分散归因质量的单目标误分类攻击相比,TSEF在保持解释与参考一致的同时实现目标性预测变化。在多个数据集和解释器骨干上,我们的结果一致表明,解释稳定性是决策鲁棒性的误导性代理,并激励为可信赖的时间序列任务进行耦合感知的鲁棒性评估。

英文摘要

Interpretable time series deep learning systems are often assessed by checking temporal consistency on explanations, implicitly treating this as evidence of robustness. We show that this assumption can fail: Predictions and explanations can be adversarially decoupled, enabling targeted misclassification while the explanation remains plausible and consistent with a chosen reference rationale. We propose TSEF (Time Series Explanation Fooler), a dual-target attack that jointly manipulates the classifier and explainer outputs. In contrast to single-objective misclassification attacks that disrupt explanation and spread attribution mass broadly, TSEF achieves targeted prediction changes while keeping explanations consistent with the reference. Across multiple datasets and explainer backbones, our results consistently reveal that explanation stability is a misleading proxy for decision robustness and motivate coupling-aware robustness evaluations for trustworthy time series tasks.

2509.21474 2026-06-02 cs.LG 版本更新

d2: Improving Reasoning in Diffusion Language Models via Trajectory Likelihood Estimation

d2: 通过轨迹似然估计改进扩散语言模型的推理能力

Guanghan Wang, Gilad Turok, Yair Schiff, Marianne Arriola, Volodymyr Kuleshov

发表机构 * Cornell University, Cornell Tech(康奈尔大学,康奈尔科技)

AI总结 提出d2框架,通过新的策略梯度算法和轨迹似然估计,显著提升扩散语言模型在逻辑与数学推理任务上的性能。

Comments ICML 2026. project page: https://guanghanwang.com/d2

详情
AI中文摘要

虽然扩散语言模型(DLMs)在文本生成方面取得了竞争性表现,但利用强化学习提升其推理能力仍是一个活跃的研究领域。本文提出d2,一个专为掩码DLMs设计的推理框架。该框架的核心是一种新的策略梯度算法,依赖于对采样轨迹似然的准确估计。由于对掩码DLMs进行朴素计算这些似然在计算上代价高昂,我们针对不同模型类别开发了一系列估计器。对于支持任意顺序解码采样算法的DLMs,我们提出d2-AnyOrder,它通过单次模型前向传播实现精确的轨迹似然。通过对广泛使用的DLMs的实证研究,我们发现任意顺序解码在实践中并非普遍支持。对于标准掩码扩散模型,我们提出d2-StepMerge,它近似轨迹似然,以可分析的方式在计算和近似精度之间进行权衡。实验表明,当应用于流行的DLMs时,d2显著优于广泛使用的强化学习基线,并在逻辑推理任务(Countdown和Sudoku)以及数学推理基准(GSM8K和MATH500)上为DLMs设立了新的最先进性能。我们在项目页面提供了代码和博客文章:https://guanghanwang.com/d2

英文摘要

While diffusion language models (DLMs) have achieved competitive performance in text generation, improving their reasoning ability with reinforcement learning remains an active research area. Here, we introduce d2, a reasoning framework tailored for masked DLMs. Central to our framework is a new policy gradient algorithm that relies on accurate estimates of the sampling trajectory likelihoods. Because computing these likelihoods naively is computationally expensive for masked DLMs, we develop a family of estimators tailored to distinct model classes. For DLMs that support a sampling algorithm called any-order decoding, we propose d2-AnyOrder, which achieves exact trajectory likelihood with a single model pass. Through an empirical study of widely used DLMs, we show that any-order decoding is not universally supported in practice. For standard masked diffusion models, we propose d2-StepMerge, which approximates the trajectory likelihood, trading off compute for approximation accuracy in an analytically tractable manner. Empirically, d2 significantly outperforms widely-used RL baselines when applied to popular DLMs, and sets a new state-of-the-art performance for DLMs on logical reasoning tasks (Countdown and Sudoku) and math reasoning benchmarks (GSM8K and MATH500). We provide the code along with a blog post on the project page: https://guanghanwang.com/d2

2602.07223 2026-06-02 cs.LG 版本更新

Vegas: Self-Speculative Decoding with Verification-Guided Sparse Attention

Vegas: 验证引导稀疏注意力的自推测解码

Yikang Yue, Yuqi Xue, Jian Huang

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 提出Vegas方法,利用验证过程自然识别关键KV缓存条目,并仅对这些条目进行稀疏注意力计算,从而提升自推测解码的吞吐量。

Comments Accepted to ICML'26

详情
AI中文摘要

长上下文大语言模型(LLM)推理已成为当今AI应用的常态。然而,其KV缓存的内存需求不断增加,严重制约了推理速度。先前的研究表明,采用稀疏注意力的自推测解码(即使用KV缓存子集草拟令牌,并针对完整KV缓存并行验证)能够以无损方式加速推理。然而,这些方法依赖独立的KV选择算法来选取用于草拟的KV条目,忽略了每个KV条目的关键性已在验证过程中固有计算。本文提出Vegas,一种验证引导稀疏注意力的自推测解码方法。Vegas将验证的副产品——关键KV缓存条目——识别出来,并在草拟后续令牌时仅对这些条目计算注意力。这不仅提高了草拟令牌的接受率,还降低了KV选择开销,从而提升了解码吞吐量。与默认vLLM相比,Vegas在解码吞吐量上实现了1.25倍至2.81倍的加速;与最先进的基于稀疏注意力的自推测解码方法相比,实现了1.15倍至1.29倍的加速。我们的代码可在https://github.com/platformxlab/vegas获取。

英文摘要

Long-context large language model (LLM) inference has become the norm for today's AI applications. However, it is severely bottlenecked by the increasing memory demands of its KV cache. Previous works have shown that self-speculative decoding with sparse attention, where tokens are drafted using a subset of the KV cache and verified in parallel against the full KV cache, speeds up inference in a lossless manner. However, they rely on a standalone KV selection algorithm to select the KV entries used for drafting and overlook the fact that the criticality of each KV entry is inherently computed during verification. In this paper, we propose Vegas, a self-speculative decoding method with verification-guided sparse attention. Vegas identifies critical KV cache entries as a byproduct of verification and computes attention only over these entries when drafting subsequent tokens. This not only improves the draft token acceptance rate but also incurs low KV selection overhead, thereby improving decoding throughput. Vegas achieves a 1.25$\times$-2.81$\times$ speedup in decoding throughput over default vLLM and a 1.15$\times$-1.29$\times$ speedup over state-of-the-art sparse attention-based self-speculative decoding methods. Our code is available at https://github.com/platformxlab/vegas.

2602.07218 2026-06-02 cs.LG cs.AI stat.ML 版本更新

Collaborative and Efficient Fine-tuning: Leveraging Task Similarity

协作高效微调:利用任务相似性

Gagik Magakyan, Amirhossein Reisizadeh, Chanwoo Park, Pablo A. Parrilo, Asuman Ozdaglar

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Stanford University(斯坦福大学)

AI总结 提出CoLoRA方法,通过共享适配器和个性化适配器利用任务相似性进行协作微调,提升数据稀缺下的模型性能,并在理论和实验上验证其有效性。

详情
AI中文摘要

适应性被认为是基础模型的核心特征,使其能够有效适应未见过的下游任务。参数高效的微调方法,如著名的LoRA,使得使用标记的、高质量且通常稀缺的任务数据对大型基础模型进行高效适应成为可能。为了缓解基础模型微调中的数据稀缺问题,我们提出利用多个下游用户之间的任务相似性。直观上,具有相似任务的用户必须能够相互帮助,以增加有效的微调数据量。我们提出了协作低秩适应(CoLoRA),该方法利用任务相似性来协作且高效地微调个性化基础模型。CoLoRA的主要思想是训练一个共享适配器,捕捉所有任务之间的潜在任务相似性,以及针对用户特定任务定制的个性化适配器。我们在异质线性回归上对CoLoRA进行了理论研究,并提供了真实恢复的可证明保证。我们还进行了多个具有不同任务相似性的自然语言实验,进一步表明当与相似任务一起训练时,个体性能显著提升。

英文摘要

Adaptability has been regarded as a central feature in the foundation models, enabling them to effectively acclimate to unseen downstream tasks. Parameter-efficient fine-tuning methods such as celebrated LoRA facilitate efficient adaptation of large foundation models using labeled, high-quality and generally scarce task data. To mitigate data scarcity in fine-tuning of foundation models, we propose to leverage task similarity across multiple downstream users. Intuitively, users with similar tasks must be able to assist each other in boosting the effective fine-tuning data size. We propose Collaborative Low-Rank Adaptation, or CoLoRA, which exploits task similarity to collaboratively and efficiently fine-tune personalized foundation models. The main idea in CoLoRA is to train one shared adapter capturing underlying task similarities across all tasks, and personalized adapters tailored to user-specific tasks. We theoretically study CoLoRA on heterogeneous linear regression and provide provable guarantees for ground truth recovery. We also conduct several natural language experiments with varying task similarity, which further demonstrate that when trained together with similar tasks, individual performances are significantly boosted.

2602.06837 2026-06-02 cs.LG stat.ML 版本更新

Sharpness-Aware Hybrid Model Learning for Architecture-Agnostic Parameter Estimation

锐度感知的混合模型学习用于架构无关的参数估计

Naoya Takeishi

发表机构 * The University of Tokyo(东京大学)

AI总结 提出一种基于锐度感知最小化的架构无关方法,通过损失平坦性实现混合模型中科学参数的准确估计。

详情
AI中文摘要

混合建模,即机器学习模型与科学数学模型的结合,能够实现灵活且鲁棒的数据驱动预测,并具有部分可解释性。然而,科学模型的未知参数不一定能被正确估计,因为机器学习模型的灵活性可能导致科学模型部分在预测中被有效忽略。我们可以通过应用正则化来避免这种情况,但这种正则化的公式通常依赖于模型架构和领域知识。在本文中,我们提出了一种架构无关的方法来学习混合模型,同时正确估计科学参数。其思想基于奥卡姆剃刀原则,利用损失最小值的平坦性来实现模型简洁性。我们采用锐度感知最小化的思想,并将其适应于混合建模设置。数值实验证明了基于SAM的混合模型学习在科学参数估计中的有效性。

英文摘要

Hybrid modeling, the combination of machine learning models and scientific mathematical models, enables flexible and robust data-driven prediction with partial interpretability. However, the unknown parameters of the scientific model cannot necessarily be estimated properly, since the flexibility of the machine learning model might make the scientific model part effectively ignored in prediction. We may avoid it by applying some regularization, but the formulation of such regularizers typically depends on model architectures and domain knowledge. In this paper, we propose an architecture-agnostic method to learn hybrid models while properly estimating the scientific parameters. The idea is to use the flatness of loss minima to achieve model simplicity, based upon the Occam's razor principle. We employ the idea of sharpness-aware minimization and adapt it to the hybrid modeling setting. Numerical experiments demonstrate the effectiveness of the SAM-based hybrid model learning for scientific parameter estimation.

2602.06448 2026-06-02 cs.LG cs.AI 版本更新

Principle-Evolvable Scientific Discovery via Uncertainty Minimization

通过不确定性最小化实现原理可演化的科学发现

Yingming Pu, Tao Lin, Hongyu Chen

发表机构 * Westlake University(西lake大学) Zhejiang University(浙江大学)

AI总结 提出PiEvo框架,将科学发现视为原理空间上的贝叶斯优化,通过信息导向假设选择与异常驱动增强机制,使智能体自主演化理论世界观,在四个基准上平均解质量达90.81%~93.15%,收敛速度提升83.3%。

Comments Proceedings of the 43rd International Conference on Machine Learning, Seoul, South Korea. PMLR 306, 2026. Copyright 2026 by the author(s)

详情
Journal ref
Proc. 43rd Intl. Conf. on Machine Learning (ICML 2026), PMLR 306
AI中文摘要

基于大型语言模型的科学智能体加速了科学发现,但由于固守初始先验,常常效率低下。现有方法主要在静态假设空间中操作,限制了新现象的发现,当基线理论失效时导致计算浪费。为解决此问题,我们提出将焦点从搜索假设转向演化底层科学原理。我们提出PiEvo,一个原理可演化框架,将科学发现视为在扩展原理空间上的贝叶斯优化。通过集成基于高斯过程的信息导向假设选择和异常驱动增强机制,PiEvo使智能体能够自主完善其理论世界观。在四个基准上的评估表明,PiEvo (1) 平均解质量高达90.81%~93.15%,比现有最优方法提升29.7%~31.1%;(2) 通过优化紧凑原理空间显著降低样本复杂度,收敛步骤加速83.3%;(3) 在不同科学领域和LLM骨干上保持稳健性能。代码公开于\hyperlink{https://github.com/amair-lab/PiEvo}{github.com/amair-lab/PiEvo}。

英文摘要

Large Language Model (LLM)-based scientific agents have accelerated scientific discovery, yet they often suffer from significant inefficiencies due to adherence to fixed initial priors. Existing approaches predominantly operate within a static hypothesis space, which restricts the discovery of novel phenomena, resulting in computational waste when baseline theories fail. To address this, we propose shifting the focus from searching hypotheses to evolving the underlying scientific principles. We present PiEvo, a principle-evolvable framework that treats scientific discovery as Bayesian optimization over an expanding principle space. By integrating Information-Directed Hypothesis Selection via Gaussian Process and an anomaly-driven augmentation mechanism, PiEvo enables agents to autonomously refine their theoretical worldview. Evaluation across four benchmarks demonstrates that PiEvo (1) achieves an average solution quality of up to 90.81%~93.15%, representing a 29.7%~31.1% improvement over the state-of-the-art, (2) attains an 83.3% speedup in convergence step via significantly reduced sample complexity by optimizing the compact principle space, and (3) maintains robust performance across diverse scientific domains and LLM backbones. Code is publicly available at \hyperlink{https://github.com/amair-lab/PiEvo}{github.com/amair-lab/PiEvo}.

2502.16174 2026-06-02 cs.LG cs.AI cs.CL cs.CR 版本更新

Efficient LLM Moderation with Multi-Layer Latent Prototypes

基于多层潜在原型的高效LLM审核

Maciej Chrabąszcz, Filip Szatkowski, Bartosz Wójcik, Jan Dubiński, Tomasz Trzciński, Sebastian Cygert

发表机构 * University of Warsaw(华沙大学)

AI总结 提出多层原型审核器(MLPM),利用多层中间表示的原型实现轻量、高效且可定制的输入审核,在多个基准上达到最优性能,并可与输出审核结合提升响应安全性。

详情
AI中文摘要

尽管现代LLM在后训练过程中与人类价值观对齐,但在部署时仍需稳健的审核以防止有害输出。现有方法存在性能与效率的权衡,且难以定制以满足用户特定需求。针对这一差距,我们引入了多层原型审核器(MLPM),一种轻量级且高度可定制的输入审核工具。我们提出利用多层中间表示的原型来提高审核质量,同时保持高效率。通过设计,我们的方法对生成流水线的开销可忽略不计,并可无缝应用于任何模型。MLPM在多种审核基准上实现了最先进的性能,并在不同大小的模型系列中表现出强大的可扩展性。此外,我们展示了它能平滑集成到端到端审核流水线中,并在与输出审核技术结合时进一步提高响应安全性。总体而言,我们的工作为安全、稳健且高效的LLM部署提供了一种实用且可适应的解决方案。

英文摘要

Although modern LLMs are aligned with human values during post-training, robust moderation remains essential to prevent harmful outputs at deployment time. Existing approaches suffer from performance-efficiency trade-offs and are difficult to customize to user-specific requirements. Motivated by this gap, we introduce Multi-Layer Prototype Moderator (MLPM), a lightweight and highly customizable input moderation tool. We propose leveraging prototypes of intermediate representations across multiple layers to improve moderation quality while maintaining high efficiency. By design, our method adds negligible overhead to the generation pipeline and can be seamlessly applied to any model. MLPM achieves state-of-the-art performance on diverse moderation benchmarks and demonstrates strong scalability across model families of various sizes. Moreover, we show that it integrates smoothly into end-to-end moderation pipelines and further improves response safety when combined with output moderation techniques. Overall, our work provides a practical and adaptable solution for safe, robust, and efficient LLM deployment.

2602.06136 2026-06-02 cs.LG cs.CV 版本更新

Tempora: Characterising the Time-Contingent Utility of Online Test-Time Adaptation

Tempora: 表征在线测试时适应的时间条件效用

Sudarshan Sreeram, Young D. Kwon, Cecilia Mascolo

发表机构 * University of Bristol(布里斯托大学)

AI总结 提出Tempora框架,通过时间场景、评估协议和时间条件效用指标,系统评估测试时适应方法在延迟约束下的准确性-延迟权衡,揭示传统排名在时间压力下失效。

Comments Accepted to ICML 2026

详情
AI中文摘要

测试时适应(TTA)为在域偏移下性能下降的机器学习模型提供了一种引人注目的补救措施,仅使用未标记样本即可即时改进泛化能力。这种灵活性适合实际部署,但传统评估不切实际地假设无限处理时间,忽略了准确性-延迟权衡。随着机器学习越来越多地支撑延迟敏感和面向用户的应用,时间压力限制了可适应推理的可行性;到达太晚而无法采取行动的预测是徒劳的。我们引入了Tempora,一个在这种压力下评估TTA的框架。它由模拟部署约束的时间场景、实现测量的评估协议以及量化准确性-延迟权衡的时间条件效用指标组成。我们用三个这样的指标实例化该框架:(1)用于具有硬截止时间的异步流的离散效用,(2)用于价值随延迟衰减的交互式设置的连续效用,以及(3)用于预算受限部署的摊销效用。通过将Tempora应用于11种TTA方法,我们发现排名不稳定性在跨越不同数据集、模型和硬件平台的750多次时间评估中持续存在;即,传统排名不能预测时间压力下的排名。最高效用方法随偏移和时间压力而变化,没有明确的赢家。通过首次实现跨不同时间约束的系统评估,Tempora揭示了排名何时以及为何变化,为从业者提供了方法选择的视角,为研究人员提供了可部署适应的目标。代码:https://github.com/sudotensor/tempora。

英文摘要

Test-time adaptation (TTA) offers a compelling remedy for machine learning (ML) models that degrade under domain shifts, improving generalisation on-the-fly with only unlabelled samples. This flexibility suits real deployments, yet conventional evaluations unrealistically assume unbounded processing time, overlooking the accuracy-latency trade-off. As ML increasingly underpins latency-sensitive and user-facing use-cases, temporal pressure constrains the viability of adaptable inference; predictions arriving too late to act on are futile. We introduce Tempora, a framework for evaluating TTA under this pressure. It consists of temporal scenarios that model deployment constraints, evaluation protocols that operationalise measurement, and time-contingent utility metrics that quantify the accuracy-latency trade-off. We instantiate the framework with three such metrics: (1) discrete utility for asynchronous streams with hard deadlines, (2) continuous utility for interactive settings where value decays with latency, and (3) amortised utility for budget-constrained deployments. By applying Tempora to 11 TTA methods, we find that rank instability persists across 750+ temporal evaluations spanning diverse datasets, models, and hardware platforms; i.e., conventional rankings do not predict rankings under temporal pressure. The highest-utility method varies with the shift and temporal pressure, with no clear winner. By enabling systematic evaluation across diverse temporal constraints for the first time, Tempora reveals when and why rankings change, offering practitioners a lens for method selection and researchers a target for deployable adaptation. Code: https://github.com/sudotensor/tempora.

2602.06033 2026-06-02 cs.LG 版本更新

Can Vision Language Models Learn Intuitive Physics from Interaction?

视觉语言模型能否从交互中学习直观物理?

Luca M. Schulze Buschoff, Konstantinos Voudouris, Can Demircan, Eric Schulz

发表机构 * GitHub

AI总结 研究通过强化学习与环境交互训练视觉语言模型,发现交互学习能提升任务内性能,但无法产生可泛化的物理直觉。

Comments Updated accepted version for ICML'26

详情
AI中文摘要

预训练的视觉语言模型对物理世界没有良好的直觉。最近的研究表明,监督微调可以提高模型在简单物理任务上的性能。然而,微调后的模型似乎无法学习能够泛化到新情境的稳健物理规则。基于认知科学的研究,我们假设模型需要与环境交互才能正确学习其物理动态。我们训练模型通过强化学习与模拟环境交互来学习。虽然从交互中学习使模型能够提高其任务内性能,但未能产生具有可泛化物理直觉的模型。我们发现,在一个任务上训练的模型不能可靠地泛化到相关任务,即使这些任务共享视觉统计和物理原理,并且无论模型是否通过交互训练。

英文摘要

Pre-trained vision language models do not have good intuitions about the physical world. Recent work has shown that supervised fine-tuning can improve model performance on simple physical tasks. However, fine-tuned models do not appear to learn robust physical rules that can generalize to new contexts. Based on research in cognitive science, we hypothesize that models need to interact with an environment to properly learn its physical dynamics. We train models that learn through interaction with a simulated environment using reinforcement learning. While learning from interaction allows models to improve their within-task performance, it fails to produce models with generalizable physical intuitions. We find that models trained on one task do not reliably generalize to related tasks, even if the tasks share visual statistics and physical principles, and regardless of whether the models are trained through interaction.

2602.05970 2026-06-02 cs.LG cs.AI math.DS stat.ML 版本更新

Inverse Depth Scaling From Most Layers Being Similar

大多数层相似时的逆深度缩放

Yizhou Liu, Sara Kangaslahti, Ziming Liu, Jeff Gore

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 通过分析大型语言模型和玩具残差网络,发现损失与深度成反比,归因于功能相似的层通过集成平均而非组合学习或平滑动力学离散化来减少误差,表明需要架构创新以鼓励深度组合使用。

Comments Camera-ready version, ICML 2026

详情
AI中文摘要

神经缩放定律将损失与大型语言模型(LLM)的模型大小联系起来,但深度和宽度可能对性能有不同的贡献,需要更详细的研究。在这里,我们通过分析LLM和玩具残差网络来量化深度如何影响损失。我们发现LLM中的损失与深度成反比,这可能是由于功能相似的层通过集成平均而不是组合学习或平滑动力学的离散化来减少误差。这种机制效率低下但鲁棒,可能源于残差网络的架构偏差和与平滑动力学不兼容的目标函数。研究结果表明,提高LLM效率可能需要架构创新以鼓励深度的组合使用。

英文摘要

Neural scaling laws relate loss to model size in large language models (LLMs), yet depth and width may contribute to performance differently, requiring more detailed studies. Here, we quantify how depth affects loss via analysis of LLMs and toy residual networks. We find loss scales inversely proportional to depth in LLMs, probably due to functionally similar layers reducing error through ensemble averaging rather than compositional learning or discretizing smooth dynamics. This regime is inefficient yet robust and may arise from the architectural bias of residual networks and target functions incompatible with smooth dynamics. The findings suggest that improving LLM efficiency may require architectural innovations to encourage compositional use of depth.

2602.05951 2026-06-02 cs.CV cs.AI cs.LG 版本更新

Better Source, Better Flow: Learning Condition-Dependent Source Distribution for Flow Matching

更好的源,更好的流:学习条件依赖的源分布用于流匹配

Junwan Kim, Jiho Park, Seonghu Jeon, Seungryong Kim

发表机构 * New York University(纽约大学) KAIST AI(韩国科学技术院人工智能实验室)

AI总结 本文提出在流匹配框架中学习条件依赖的源分布,通过方差正则化和源-目标方向对齐,显著提升文本到图像生成的速度和质量。

Comments Project Page: https://junwankimm.github.io/CSFM

详情
AI中文摘要

流匹配最近已成为基于扩散的生成模型的有前途的替代方案,特别是在文本到图像生成方面。尽管它在允许任意源分布方面具有灵活性,但大多数现有方法依赖于标准高斯分布(这是从扩散模型继承的选择),并且很少在这种设置中将源分布本身视为优化目标。在这项工作中,我们表明源分布的原则性设计不仅是可行的,而且在现代文本到图像系统的规模上也是有益的。具体来说,我们提出在流匹配目标下学习条件依赖的源分布,以更好地利用丰富的条件信号。我们识别了将条件直接纳入源时出现的关键失败模式,包括分布坍缩和不稳定性,并表明适当的方差正则化以及源和目标之间的方向对齐对于稳定和有效的学习至关重要。我们进一步分析了目标表示空间的选择如何影响具有结构化源的流匹配,揭示了这种设计最有效的场景。在多个文本到图像基准上的大量实验表明了一致且稳健的改进,包括FID收敛速度提高多达3倍,突出了原则性源分布设计对条件流匹配的实际好处。

英文摘要

Flow matching has recently emerged as a promising alternative to diffusion-based generative models, particularly for text-to-image generation. Despite its flexibility in allowing arbitrary source distributions, most existing approaches rely on a standard Gaussian distribution, a choice inherited from diffusion models, and rarely consider the source distribution itself as an optimization target in such settings. In this work, we show that principled design of the source distribution is not only feasible but also beneficial at the scale of modern text-to-image systems. Specifically, we propose learning a condition-dependent source distribution under flow matching objective that better exploit rich conditioning signals. We identify key failure modes that arise when directly incorporating conditioning into the source, including distributional collapse and instability, and show that appropriate variance regularization and directional alignment between source and target are critical for stable and effective learning. We further analyze how the choice of target representation space impacts flow matching with structured sources, revealing regimes in which such designs are most effective. Extensive experiments across multiple text-to-image benchmarks demonstrate consistent and robust improvements, including up to a 3x faster convergence in FID, highlighting the practical benefits of a principled source distribution design for conditional flow matching.

2602.05395 2026-06-02 stat.ML cs.AI cs.LG 版本更新

Optimal Bayesian Stopping for Efficient Inference of Consistent LLM Answers

用于高效推断一致LLM答案的最优贝叶斯停止

Jingkai Huang, Will Ma, Zhengyuan Zhou

发表机构 * Stern School of Business, New York University, New York, USA(纽约大学 Stern 商学院) Graduate School of Business, Columbia University, New York, USA(哥伦比亚大学 商学院)

AI总结 利用贝叶斯先验信息,通过L-聚合停止策略在达到足够一致性时提前停止采样,以最小化采样成本并高效识别最一致的LLM答案。

Comments Accepted to ICML 2026. Camera-ready version

详情
Journal ref
Proceedings of the 43rd International Conference on Machine Learning (ICML 2026)
AI中文摘要

一种提高LLM准确性的简单策略,特别是在数学和推理问题中,是采样多个响应并提交最一致达成的答案。在本文中,我们利用贝叶斯先验信息来节省采样成本,一旦达到足够的一致性就停止。尽管精确后验在计算上难以处理,我们进一步引入了一种高效的“L-聚合”停止策略,该策略仅跟踪L-1个最频繁的答案计数。理论上,我们证明L=3就足够了:这种粗略近似足以实现渐近最优性,并且严格优于无先验基线,同时具有快速的后验计算。实验上,该方法使用更少的样本识别出最一致(即众数)的LLM答案,并且可以在减少LLM调用次数(即节省LLM推理成本)高达50%的同时实现相似的答案准确性。

英文摘要

A simple strategy for improving LLM accuracy, especially in math and reasoning problems, is to sample multiple responses and submit the answer most consistently reached. In this paper we leverage Bayesian prior information to save on sampling costs, stopping once sufficient consistency is reached. Although the exact posterior is computationally intractable, we further introduce an efficient "L-aggregated" stopping policy that tracks only the L-1 most frequent answer counts. Theoretically, we prove that L=3 is all you need: this coarse approximation is sufficient to achieve asymptotic optimality, and strictly dominates prior-free baselines, while having a fast posterior computation. Empirically, this identifies the most consistent (i.e., mode) LLM answer using fewer samples, and can achieve similar answer accuracy while cutting the number of LLM calls (i.e., saving on LLM inference costs) by up to 50%.

2511.16886 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Latent Reasoning in TRMs is Secretly a Policy Improvement Operator

TRMs中的潜在推理实际上是策略改进算子

Arip Asadulaev, Rayan Banerjee, Fakhri Karray, Martin Takac

发表机构 * Arip Asadulaev Rayan Banerjee Fakhri Karray Martin Takac

AI总结 本文通过将潜在递归推理形式化为策略改进算法,解释了递归步骤何时有效提升性能,并提出结合强化学习和扩散方法的训练方案,在Tiny Recursive Model上实现18倍前向传递减少且保持性能。

详情
AI中文摘要

最近,具有潜在递归的小模型在复杂推理任务上取得了有希望的结果。这些结果通常由这样的理论解释:这种递归增加了网络的深度,使其能够紧凑地模拟更大模型的能力。然而,递归添加层的性能仍然落后于具有相同前馈深度的单次通过模型。这意味着在循环版本中,并非每个递归步骤都有效地贡献于深度。这提出了一个问题:潜在推理何时以及为何能提高性能,何时会导致无效计算?在我们的工作中,我们证明了潜在递归推理为这个问题提供了答案。我们展示了潜在递归推理可以形式化为策略改进算法。基于这些见解,我们提出使用强化学习和扩散方法的训练方案用于潜在推理模型。以Tiny Recursive Model作为测试平台,我们展示了通过我们的修改,可以避免无效计算步骤,并将前向传递总数减少18倍,同时保持性能。总的来说,我们展示了递归步骤的策略改进视角如何解释模型行为,并为进一步改进提供见解。

英文摘要

Recently, small models with latent recursion have obtained promising results on complex reasoning tasks. These results are typically explained by the theory that such recursion increases a networks depth, allowing it to compactly emulate the capacity of larger models. However, the performance of recursively added layers remains behind the capabilities of one pass models with the same feed-forward depth. This means that in the looped version, not every recursive step effectively contributes to depth. This raises the question: when and why does latent reasoning improve performance, and when does it result in dead compute? In our work, we demonstrate that latent recursive reasoning provides answer to this question. We show that latent recursive reasoning can be formalized as a policy improvement algorithm. Building on these insights, we propose to use a training schemes from reinforcement learning and diffusion methods for latent reasoning models. Using the Tiny Recursive Model as our testbed, we show that with our modifications we can avoid dead compute steps and reduce the total number of forward passes by 18x while maintaining performance. Broadly speaking, we show how a policy improvement perspective on recursive steps can explain model behavior and provide insights for further improvements.

2602.04861 2026-06-02 cs.LG cond-mat.mtrl-sci cs.AI physics.chem-ph 版本更新

From Evaluation to Design: Using Potential Energy Surface Smoothness Metrics to Guide Machine Learning Interatomic Potential Architectures

从评估到设计:利用势能面平滑度指标指导机器学习原子间势架构

Ryan Liu, Eric Qu, Tobias Kreiman, Samuel M. Blau, Aditi S. Krishnapriyan

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出键平滑度表征测试(BSCT)作为高效评估机器学习原子间势(MLIP)势能面平滑度的指标,并与分子动力学稳定性强相关,同时指导模型设计以减少伪影。

Comments Accepted at the International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

机器学习原子间势(MLIP)有时无法再现量子势能面(PES)的物理平滑性,导致下游模拟中出现标准能量和力回归评估无法捕捉的错误行为。现有评估方法(如微正则分子动力学(MD))计算成本高且主要探测近平衡态。为改进MLIP的评估指标,我们引入键平滑度表征测试(BSCT)。该高效基准通过受控键变形探测PES,检测近平衡和远离平衡态的非平滑性,包括不连续性、人工极小值和虚假力。我们证明BSCT与MD稳定性强相关,而成本仅为MD的一小部分。为展示BSCT如何指导迭代模型设计,我们利用无约束Transformer主干作为测试平台,说明如何通过改进(如新的可微$k$-最近邻算法和温度控制注意力)减少指标识别的伪影。通过基于BSCT系统优化模型设计,所得MLIP同时实现了低传统E/F回归误差、稳定的MD模拟和鲁棒的原子性质预测。我们的结果将BSCT确立为从业者评估MLIP实用性的验证指标,以及“循环内”模型设计代理,提醒MLIP开发者注意当前MLIP基准无法高效评估的物理挑战。BSCT数据集和评估可在https://github.com/ryanliu30/bsct.git获取。

英文摘要

Machine Learning Interatomic Potentials (MLIPs) sometimes fail to reproduce the physical smoothness of the quantum potential energy surface (PES), leading to erroneous behavior in downstream simulations that standard energy and force regression evaluations can miss. Existing evaluations, such as microcanonical molecular dynamics (MD), are computationally expensive and primarily probe near-equilibrium states. To improve evaluation metrics for MLIPs, we introduce the Bond Smoothness Characterization Test (BSCT). This efficient benchmark probes the PES via controlled bond deformations and detects non-smoothness, including discontinuities, artificial minima, and spurious forces, both near and far from equilibrium. We show that BSCT correlates strongly with MD stability while requiring a fraction of the cost of MD. To demonstrate how BSCT can guide iterative model design, we utilize an unconstrained Transformer backbone as a testbed, illustrating how refinements such as a new differentiable $k$-nearest neighbors algorithm and temperature-controlled attention reduce artifacts identified by our metric. By optimizing model design systematically based on BSCT, the resulting MLIP simultaneously achieves a low conventional E/F regression error, stable MD simulations, and robust atomistic property predictions. Our results establish BSCT as both a validation metric for practitioners to assess MLIP utility and as an "in-the-loop" model design proxy that alerts MLIP developers to physical challenges that cannot be efficiently evaluated by current MLIP benchmarks. The BSCT dataset and evaluation are available on https://github.com/ryanliu30/bsct.git

2501.18649 2026-06-02 cs.CL cs.AI cs.IR cs.LG 版本更新

Fake News Detection After LLM Laundering: Measurement and Explanation

LLM清洗后的假新闻检测:测量与解释

Rupak Kumar Das, Jonathan Dodge

发表机构 * College of IST Pennsylvania State University(宾夕法尼亚州立大学信息科学与技术学院)

AI总结 研究测量检测器在识别LLM改写假新闻时的有效性,发现检测器难以检测LLM改写的假新闻,并通过LIME解释发现情感偏移是检测失败的原因之一。

详情
AI中文摘要

凭借其先进的能力,大型语言模型(LLM)可以生成高度令人信服且上下文相关的假新闻,这可能有助于传播错误信息。尽管针对人类撰写文本的假新闻检测已有大量研究,但检测LLM生成的假新闻这一领域仍探索不足。本研究测量了检测器在识别LLM改写的假新闻方面的有效性,特别是确定在检测流程中添加改写步骤是有助于还是阻碍检测。本研究贡献如下:(1)检测器在检测LLM改写的假新闻时比检测人类撰写文本更困难;(2)我们发现了哪些模型在哪些任务(逃避检测、通过改写逃避检测以及为语义相似性进行改写)上表现出色;(3)通过LIME解释,我们发现了检测失败的一个可能原因:情感偏移;(4)我们发现了一个关于改写质量测量的令人担忧的趋势:尽管BERTSCORE很高,但样本仍表现出情感偏移;(5)我们提供了一对数据集,用改写输出和分数扩充了现有数据集。该数据集可在GitHub上获取。

英文摘要

With their advanced capabilities, Large Language Models (LLMs) can generate highly convincing and contextually relevant fake news, which can contribute to disseminating misinformation. Though there is much research on fake news detection for human-written text, the field of detecting LLM-generated fake news is still under-explored. This research measures the efficacy of detectors in identifying LLM-paraphrased fake news, in particular, determining whether adding a paraphrase step in the detection pipeline helps or impedes detection. This study contributes: (1) Detectors struggle to detect LLM-paraphrased fake news more than human-written text, (2) We find which models excel at which tasks (evading detection, paraphrasing to evade detection, and paraphrasing for semantic similarity). (3) Via LIME explanations, we discovered a possible reason for detection failures: sentiment shift. (4) We discover a worrisome trend for paraphrase quality measurement: samples that exhibit sentiment shift despite a high BERTSCORE. (5) We provide a pair of datasets augmenting existing datasets with paraphrase outputs and scores. The dataset is available on GitHub

2602.03685 2026-06-02 cs.LG cs.AI stat.ML 版本更新

Universal One-third Time Scaling in Learning Peaked Distributions

学习尖峰分布中的普适三分之一时间缩放

Yizhou Liu, Ziming Liu, Cengiz Pehlevan, Jeff Gore

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文通过理论分析和实验验证,揭示了使用softmax和交叉熵学习尖峰分布时,损失和梯度呈幂律衰减,导致损失时间缩放指数为1/3的普适瓶颈,为神经缩放现象提供了机理解释。

Comments Camera-ready version, ICML 2026

详情
AI中文摘要

训练大型语言模型(LLM)计算成本高昂,部分原因是损失呈现缓慢的幂律收敛,其起源仍有争议。通过对玩具模型的系统分析和LLM的经验评估,我们表明这种行为本质上源于softmax和交叉熵的使用。当学习尖峰概率分布(例如下一个词元分布)时,这些组件普遍产生幂律衰减的损失和梯度,与许多微观细节无关,从而形成基本的优化瓶颈。这最终导致损失的时间缩放服从幂律,普适指数为$1/3$。我们的结果为观察到的神经缩放提供了机理解释,并提出了改进LLM训练效率的新方向。

英文摘要

Training large language models (LLMs) is computationally expensive, partly because the loss exhibits slow power-law convergence whose origin remains debatable. Through systematic analysis of toy models and empirical evaluation of LLMs, we show that this behavior can arise intrinsically from the use of softmax and cross-entropy. When learning peaked probability distributions, e.g., next-token distributions, these components generically yield power-law vanishing losses and gradients, regardless of many microscopic details, creating a fundamental optimization bottleneck. This ultimately leads to power-law time scaling of the loss with a universal exponent of $1/3$. Our results provide a mechanistic explanation for observed neural scaling and suggest new directions for improving LLM training efficiency.

2602.03670 2026-06-02 cs.LG cs.AI cs.NE math.DS physics.class-ph 版本更新

Equilibrium Propagation for Non-Conservative Systems

非保守系统的平衡传播

Antonino Emanuele Scurria, Dimitri Vanden Abeele, Bortolo Matteo Mognetti, Serge Massar

发表机构 * University of Amsterdam(阿姆斯特丹大学) Institute for Advanced Study(高级研究院)

AI总结 提出一种扩展平衡传播到非保守系统(包括前馈网络)的框架,通过在学习阶段引入与非互易相互作用成比例的项来精确计算代价函数的梯度,数值实验表明性能更优且学习更快。

Comments 23 pages

详情
AI中文摘要

平衡传播(EP)是一种受物理学启发的学习算法,它利用动力系统的稳态进行推理和学习。在其原始公式中,它仅限于保守系统,即从能量函数导出的动力学。考虑到它们的应用,将EP扩展到非保守系统(即具有非互易相互作用的系统)非常重要。先前将EP推广到此类系统的尝试未能精确计算代价函数的梯度。在这里,我们提出了一个将EP扩展到任意非保守系统(包括前馈网络)的框架。我们保留了平衡传播的关键特性,即同时使用稳态进行推理和学习。然而,我们在学习阶段通过一个与相互作用的非互易部分成比例的项修改了动力学,以便获得代价函数的精确梯度。该算法也可以通过变分公式推导,该公式通过定义在增广状态空间上的能量函数生成学习动力学。数值实验表明,该算法比先前的方案实现了更好的性能并学习更快。

英文摘要

Equilibrium Propagation (EP) is a physics-inspired learning algorithm that uses stationary states of a dynamical system both for inference and learning. In its original formulation it is limited to conservative systems, $\textit{i.e.}$ to dynamics which derive from an energy function. Given their applications, it is important to extend EP to non-conservative systems, $\textit{i.e.}$ systems with non-reciprocal interactions. Previous attempts to generalize EP to such systems failed to compute the exact gradient of the cost function. Here we propose a framework that extends EP to arbitrary non-conservative systems, including feedforward networks. We keep the key property of equilibrium propagation, namely the use of stationary states both for inference and learning. However, we modify the dynamics in the learning phase by a term proportional to the non-reciprocal part of the interaction so as to obtain the exact gradient of the cost function. This algorithm can also be derived using a variational formulation that generates the learning dynamics through an energy function defined over an augmented state space. Numerical experiments show that this algorithm achieves better performance and learns faster than previous proposals.

2602.03554 2026-06-02 cs.LG cs.AI cs.CE cs.CL 版本更新

When Single Answer Is Not Enough: Rethinking Single-Step Retrosynthesis Benchmarks for LLMs

当单一答案不够时:重新思考面向大语言模型的单步逆合成基准

Bogdan Zagribelnyy, Ivan Ilin, Maksim Kuznetsov, Nikita Bondarev, Mathieu Reymond, Roman Schutski, Thomas MacDougall, Rim Shayakhmetov, Zulfat Miftakhutdinov, Mikolaj Mizera, Vladimir Aladinskiy, Alex Aliper, Alex Zhavoronkov

发表机构 * DeepMind, London, UK(伦敦英国深度思维公司)

AI总结 针对现有逆合成基准依赖单一真实答案的局限,提出基于化学合理性度量ChemCensor的新评估框架,并构建数据集CREED训练模型以提升性能。

详情
AI中文摘要

最近的进展扩展了大语言模型(LLMs)在药物发现中的应用,包括合成规划。然而,逆合成性能的客观评估仍然有限。现有的基准和指标通常依赖于已发表的合成程序以及基于单一真实答案的Top-K准确率,这未能捕捉真实世界合成规划的开放性。我们提出一个新的单步逆合成基准框架,使用ChemCensor(一种化学合理性的新度量)来评估通用型和化学专用型LLMs。通过强调合理性而非精确匹配,该方法更符合人类合成规划实践。我们还引入了CREED,一个包含数百万经ChemCensor验证的反应记录的新数据集,用于LLM训练,并使用它训练了一个在该基准下优于LLM基线的模型。

英文摘要

Recent progress has expanded the use of large language models (LLMs) in drug discovery, including synthesis planning. However, objective evaluation of retrosynthesis performance remains limited. Existing benchmarks and metrics typically rely on published synthetic procedures and Top-K accuracy based on single ground-truth, which does not capture the open-ended nature of real-world synthesis planning. We propose a new benchmarking framework for single-step retrosynthesis that evaluates both general-purpose and chemistry-specialized LLMs using ChemCensor, a novel metric for chemical plausibility. By emphasizing plausibility over exact match, this approach better aligns with human synthesis planning practices. We also introduce CREED, a novel dataset comprising millions of ChemCensor-validated reaction records for LLM training, and use it to train a model that improves over the LLM baselines under this benchmark.

2602.03211 2026-06-02 cs.LG cs.AI 版本更新

Lookahead Sample Reward Guidance for Test-Time Scaling of Diffusion Models

前瞻样本奖励引导用于扩散模型的测试时缩放

Yeongmin Kim, Donghyeok Shin, Byeonghu Na, Minsang Park, Richard Lee Kim, Il-Chul Moon

发表机构 * KAIST(韩国科学技术院)

AI总结 提出一种高效测试时缩放方法LiDAR采样,通过前瞻几步采样和精确求解器引导粒子向高奖励区域移动,无需反向传播,在GenEval上达到与最新梯度引导方法相同性能且加速9.5倍。

Comments ICML 2026 Spotlight

详情
AI中文摘要

扩散模型已展现出强大的生成性能;然而,生成的样本往往未能完全符合人类意图。本文研究了一种高效的测试时缩放方法,用于从具有更高人类对齐奖励值的区域进行采样。现有的计算期望未来奖励(EFR)方法面临重要限制:反向展开导致采样成本过高,而基于Tweedie的方法(包括顺序蒙特卡洛和梯度引导)则存在偏差和固有的采样问题。我们证明,任何$\mathbf{x}_t$处的EFR仅需使用预训练扩散模型的边际样本即可计算,从而无需神经反向传播即可实现闭式奖励引导。为了进一步提高效率,我们引入了少步前瞻采样和一个精确求解器,引导粒子向高奖励的前瞻样本移动。我们将这种采样方案称为LiDAR采样。LiDAR在SDXL上达到了与最新梯度引导方法相同的GenEval性能,并实现了9.5倍的加速。我们在https://github.com/aailab-kaist/Diffusion-LiDAR-Sampling 上发布了代码。

英文摘要

Diffusion models have demonstrated strong generative performance; however, generated samples often fail to fully align with human intent. This paper studies an efficient test-time scaling method for sampling from regions with higher human-aligned reward values. Existing methods for computing the expected future reward (EFR) face important limitations: backward rollout incurs prohibitively high sampling costs, while Tweedie-based approaches, including Sequential Monte Carlo and gradient guidance, suffer from bias and inherent sampling issues. We show that the EFR at any $\mathbf{x}_t$ can be computed using only marginal samples from a pre-trained diffusion model, enabling closed-form reward guidance without neural backpropagation. To further improve efficiency, we introduce a few-step lookahead sampling and an accurate solver that guides particles toward high-reward lookahead samples. We refer to this sampling scheme as LiDAR sampling. LiDAR achieves the same GenEval performance as the latest gradient guidance method for SDXL with a 9.5x speedup. We release the code at https://github.com/aailab-kaist/Diffusion-LiDAR-Sampling.

2602.03203 2026-06-02 cs.CL cs.LG 版本更新

ForesightKV: Optimizing KV Cache Eviction for Reasoning Models by Learning Long-Term Contribution

ForesightKV: 通过学习长期贡献优化推理模型的KV缓存驱逐

Zican Dong, Peiyu Liu, Junyi Li, Zhipeng Chen, Han Peng, Shuo Wang, Wayne Xin Zhao

发表机构 * Zhejiang University(浙江大学)

AI总结 提出ForesightKV框架,通过监督学习和强化学习预测KV对在长文本生成中的重要性,在仅用一半缓存预算下优于现有方法。

Comments ICML 2026

详情
AI中文摘要

近期,大型语言模型通过生成长推理轨迹展现出卓越的推理能力。然而,随着序列长度增长,键值缓存线性扩展,导致显著的内存和计算成本。现有的KV缓存驱逐方法通过丢弃不重要的KV对来缓解此问题,但往往难以捕捉复杂的KV依赖关系,导致性能下降。为了更好地平衡效率与性能,我们引入了ForesightKV,一种基于训练的KV缓存驱逐框架,学习在长文本生成过程中预测哪些KV对应该被驱逐。我们首先设计了Golden Eviction算法,该算法利用未来注意力分数在每一步识别最优的驱逐KV对。然后通过监督训练,使用成对排序损失将这些轨迹和每一步的分数进行蒸馏。此外,我们将缓存驱逐建模为马尔可夫决策过程,并应用GRPO算法来缓解低熵令牌上语言模型损失显著增加的问题。在三个推理模型的AIME2024和AIME2025基准测试上的实验表明,ForesightKV在仅用一半缓存预算的情况下始终优于先前方法,同时从监督学习和强化学习方法中协同受益。代码可在https://github.com/RUCAIBox/ForesightKV获取。

英文摘要

Recently, large language models (LLMs) have shown remarkable reasoning abilities by producing long reasoning traces. However, as the sequence length grows, the key-value (KV) cache expands linearly, incurring significant memory and computation costs. Existing KV cache eviction methods mitigate this issue by discarding less important KV pairs, but often fail to capture complex KV dependencies, resulting in performance degradation. To better balance efficiency and performance, we introduce ForesightKV, a training-based KV cache eviction framework that learns to predict which KV pairs to evict during long-text generations. We first design the Golden Eviction algorithm, which identifies the optimal eviction KV pairs at each step using future attention scores. These traces and the scores at each step are then distilled via supervised training with a Pairwise Ranking Loss. Furthermore, we formulate cache eviction as a Markov Decision Process and apply the GRPO algorithm to mitigate the significant language modeling loss increase on low-entropy tokens. Experiments on AIME2024 and AIME2025 benchmarks of three reasoning models demonstrate that ForesightKV consistently outperforms prior methods under only half the cache budget, while benefiting synergistically from both supervised and reinforcement learning approaches. Code is available at https://github.com/RUCAIBox/ForesightKV.

2602.03024 2026-06-02 cs.LG cs.AI 版本更新

Consistency Deep Equilibrium Models

一致性深度均衡模型

Junchao Lin, Zenan Ling, Jingwen Xu, Robert C. Qiu

发表机构 * School of Electronic Information and Communications, Huazhong University of Science and Technology(华中科技大学电子信息学院) School of Electronic Information(电子信息学院) Communications, Huazhong University of Science(华中科技大学通信学院) School of Science, Wuhan University of Technology(武汉理工大学理学院)

AI总结 提出一致性深度均衡模型(C-DEQ),通过一致性蒸馏将DEQ迭代推理过程视为沿ODE轨迹演化,训练模型将中间状态直接映射到不动点,实现少步推理并保持性能,同时支持多步评估以灵活权衡计算与性能,实验表明在相同少步推理预算下精度提升2-20倍。

详情
AI中文摘要

深度均衡模型(DEQ)已成为深度学习中的一种强大范式,能够以恒定的内存使用量建模无限深度网络。然而,由于不动点求解器的迭代性质,DEQ会带来显著的推理延迟。在这项工作中,我们引入了一致性深度均衡模型(C-DEQ),这是一种利用一致性蒸馏来加速DEQ推理的新框架。我们将DEQ迭代推理过程视为沿固定ODE轨迹向均衡演化。沿着这条轨迹,我们训练C-DEQ将中间状态一致地直接映射到不动点,从而在保持教师DEQ性能的同时实现少步推理。同时,它支持多步评估,以灵活地权衡计算与性能提升。跨多个领域任务的广泛实验表明,在相同的少步推理预算下,C-DEQ相比隐式DEQ实现了2-20倍的精度提升。我们的代码可在https://github.com/landrarwolf/CDEQ获取。

英文摘要

Deep Equilibrium Models (DEQs) have emerged as a powerful paradigm in deep learning, offering the ability to model infinite-depth networks with constant memory usage. However, DEQs incur significant inference latency due to the iterative nature of fixed-point solvers. In this work, we introduce the Consistency Deep Equilibrium Model (C-DEQ), a novel framework that leverages consistency distillation to accelerate DEQ inference. We cast the DEQ iterative inference process as evolution along a fixed ODE trajectory toward the equilibrium. Along this trajectory, we train C-DEQs to consistently map intermediate states directly to the fixed point, enabling few-step inference while preserving the performance of the teacher DEQ. At the same time, it facilitates multi-step evaluation to flexibly trade computation for performance gains. Extensive experiments across various domain tasks demonstrate that C-DEQs achieve consistent 2-20$\times$ accuracy improvements over implicit DEQs under the same few-step inference budget. Our code is available at https://github.com/landrarwolf/CDEQ.

2602.03018 2026-06-02 cs.LG 版本更新

From Zero to Hero: Advancing Zero-Shot Foundation Models for Tabular Outlier Detection

从零到英雄:推进表格异常检测的零样本基础模型

Xueying Ding, Haomin Wen, Simon Klüttermann, Leman Akoglu

发表机构 * Xueying Ding(丁雪莹) Haomin Wen(文浩明) Simon Klüttermann(西蒙·克吕特曼) Leman Akoglu(拉曼·阿科格卢)

AI总结 提出OUTFORMER模型,通过混合合成先验和自演化课程训练,实现零样本表格异常检测,在AdBench及新基准上达到最优性能。

Comments 41 Pages, ICML 2026

详情
AI中文摘要

异常检测(OD)在实践中广泛应用;但由于缺乏标记异常,其在新任务上的有效部署受到阻碍,这使得算法和超参数选择异常困难。基础模型(FMs)已经改变了机器学习,OD也不例外:Shen等人(2025)引入了FoMo-0D,这是第一个用于OD的基础模型,在众多基线中取得了显著性能。本文介绍了OUTFORMER,它通过(1)混合合成先验和(2)自演化课程训练推进了FoMo-0D。OUTFORMER仅在合成标记数据集上预训练,并通过将其训练数据作为上下文输入来推断新任务的测试标签。推理速度快且零样本,仅需前向传播,无需标记异常。得益于上下文学习,它不需要额外工作——无需OD模型训练或定制模型选择——实现了真正的即插即用部署。OUTFORMER在著名的AdBench以及我们引入的两个包含超过1500个数据集的大规模新OD基准上取得了最先进的性能,同时保持了快速的推理速度。

英文摘要

Outlier detection (OD) is widely used in practice; but its effective deployment on new tasks is hindered by lack of labeled outliers, which makes algorithm and hyperparameter selection notoriously hard. Foundation models (FMs) have transformed ML, and OD is no exception: Shen et. al. (2025) introduced FoMo-0D, the first FM for OD, achieving remarkable performance against numerous baselines. This work introduces OUTFORMER, which advances FoMo-0D with (1) a mixture of synthetic priors and (2) self-evolving curriculum training. OUTFORMER is pretrained solely on synthetic labeled datasets and infers test labels of a new task by using its training data as in-context input. Inference is fast and zero-shot, requiring merely forward pass and no labeled outliers. Thanks to in-context learning, it requires zero additional work-no OD model training or bespoke model selection-enabling truly plug-and-play deployment. OUTFORMER achieves state-of-the-art performance on the prominent AdBench, as well as two new large-scale OD benchmarks that we introduce, comprising over 1,500 datasets, while maintaining speedy inference.

2602.02886 2026-06-02 cs.LG cs.AI 版本更新

Mixture of Concept Bottleneck Experts

概念瓶颈专家混合模型

Francesco De Santis, Gabriele Ciravegna, Giovanni De Felice, Arianna Casanova, Francesco Giannini, Michelangelo Diligenti, Johannes Schneider, Danilo Giordano, Mateo Espinosa Zarlenga, Pietro Barbiero

发表机构 * University of Padua(帕多瓦大学)

AI总结 提出概念瓶颈专家混合模型(M-CBE),通过引入多个专家表达式和灵活的函数形式,在保持可解释性的同时提升预测精度和适应性。

详情
AI中文摘要

概念瓶颈模型(CBM)通过将预测基于人类可理解的概念来促进可解释性。然而,现有的CBM通常将其任务预测器限制为单个表达式,其函数形式是预先设定的,这限制了预测精度和对不同用户需求的适应性。我们提出了概念瓶颈专家混合模型(M-CBE),这是一个沿两个维度推广现有CBM的框架:任务预测器用于将概念映射到任务的表达式数量(称为专家),以及每个表达式所采用的函数形式,从而揭示了该设计空间中一个未被充分探索的区域。我们通过实例化两个新颖的模型来研究这一区域:线性M-CBE,它学习一组有限的线性表达式;以及符号M-CBE,它利用符号回归从数据中发现专家函数,受限于用户指定的算子词汇表。实证评估表明,改变表达式的数量及其函数形式为导航精度-可解释性权衡提供了一个稳健的框架。

英文摘要

Concept Bottleneck Models (CBMs) promote interpretability by grounding predictions in human-understandable concepts. However, existing CBMs typically constrain their task predictor to a single expression whose functional form is set a priori, limiting both predictive accuracy and adaptability to diverse user needs. We propose Mixture of Concept Bottleneck Experts (M-CBE), a framework that generalizes existing CBMs along two dimensions: the number of expressions, referred to as experts, employed by the task predictor to map concepts to the task, and the functional form each expression takes, thus exposing an underexplored region of this design space. We investigate this region by instantiating two novel models: Linear M-CBE, which learns a finite set of linear expressions, and Symbolic M-CBE, which leverages symbolic regression to discover expert functions from data subject to user-specified operator vocabularies. Empirical evaluation demonstrates that varying the number of expressions and their functional form provides a robust framework for navigating the accuracy-interpretability trade-off.

2602.02557 2026-06-02 cs.LG cs.AI cs.SD 版本更新

The Alignment Curse: Modality Alignment Supercharges Audio Attacks via Text Transfer

对齐诅咒:模态对齐通过文本传输增强音频攻击

Yupeng Chen, Junchi Yu, Aoxi Liu, Baoyuan Wu, Philip Torr, Adel Bibi

发表机构 * University of Oxford(牛津大学)

AI总结 本文提出并验证了“对齐诅咒”原理,即更强的文本-音频模态对齐会促进文本攻击向音频的迁移,并通过黑盒实验表明文本转移的音频攻击性能与原生音频攻击相当甚至更优,揭示了能力与安全之间的根本矛盾。

Comments 23 pages, 5 figures

详情
AI中文摘要

近期端到端训练的全能模型通过加强文本-音频模态对齐显著提升了音频能力。然而,这种对齐是否无意中促进了安全漏洞跨模态的转移仍未被充分探索。这一问题至关重要,因为基于文本的越狱攻击远比基于音频的攻击成熟;如果它们系统性转移,当前的音频安全评估可能低估源自文本模态的风险。在本文中,我们引入了“对齐诅咒”,这是一个经过形式化表征和实证验证的原理,表明更强的模态对齐使得攻击从文本到音频的转移更有效,揭示了能力与安全之间的根本矛盾。基于这一原理,我们在最新的全能模型(如Qwen2.5-Omni、Qwen3-Omni)上对三类攻击(文本攻击、文本转移的音频攻击和音频攻击)进行了全面的黑盒评估。我们发现,文本转移的音频攻击与基于音频的攻击表现相当,甚至更优,在仅音频访问下展现出明显优势。这表明基于文本的漏洞在塑造音频安全风险中扮演关键角色。最后,我们实证分析了不同攻击方法和模型下模态对齐与转移有效性之间的关系,观察到对“对齐诅咒”的一致支持:更紧密的模态对齐导致更有效的跨模态攻击转移。

英文摘要

Recent advances in end-to-end trained omni-models have substantially improved audio capabilities by strengthening text-audio modality alignment. However, whether such alignment inadvertently facilitates the transfer of safety vulnerabilities across modalities remains underexplored. This question is critical as text-based jailbreak attacks are considerably more mature than audio-based ones; if they transfer systematically, current audio safety evaluations may underestimate risks originating from the text modality. In this paper, we introduce the Alignment Curse, a formally characterized and empirically validated principle showing that stronger modality alignment enables more effective transfer of attacks from text to audio, revealing a fundamental tension between capability and safety. Motivated by this principle, we conduct a comprehensive black-box evaluation of three attack categories on recent omni-models (e.g., Qwen2.5-Omni, Qwen3-Omni): text attacks, text-transferred audio attacks, and audio attacks. We find that text-transferred audio attacks perform comparably to, and often better than, audio-based attacks, exhibiting a clear advantage under audio-only access. This suggests that text-based vulnerabilities play a pivotal role in shaping audio safety risks. Finally, we empirically analyze the relationship between modality alignment and transfer effectiveness across attack methods and models, observing consistent support for the Alignment Curse: tighter modality alignment leads to more effective cross-modality attack transfer.

2602.02547 2026-06-02 cs.LG cs.AI 版本更新

naPINN: Noise-Adaptive Physics-Informed Neural Networks for Recovering Physics from Corrupted Measurement

naPINN: 用于从损坏测量中恢复物理的噪声自适应物理信息神经网络

Hankyeol Kim, Pilsung Kang

发表机构 * Department of Industrial Engineering(工业工程系) Seoul National University(首尔国立大学)

AI总结 提出噪声自适应物理信息神经网络(naPINN),通过嵌入能量模型学习残差分布并自适应过滤异常值,从非高斯噪声和离群点损坏的测量中鲁棒恢复物理解。

详情
AI中文摘要

物理信息神经网络(PINNs)是解决逆问题和从观测数据中发现控制方程的有效方法。然而,在复杂测量噪声和严重离群点下,其性能显著下降。为解决此问题,我们提出了噪声自适应物理信息神经网络(naPINN),该网络无需噪声分布先验知识,即可从损坏测量中鲁棒恢复物理解。naPINN在训练循环中嵌入一个基于能量的模型,以学习预测残差的潜在分布。利用学习到的能量景观,一个可训练的可靠性门自适应地过滤具有高能量的数据点,同时拒绝代价正则化防止丢弃有效数据导致的平凡解。我们在被非高斯噪声和不同比例离群点损坏的各种基准偏微分方程上展示了naPINN的有效性。结果表明,naPINN显著优于现有的鲁棒PINN基线,成功隔离离群点并在严重数据损坏下准确重建动力学。

英文摘要

Physics-Informed Neural Networks (PINNs) are effective methods for solving inverse problems and discovering governing equations from observational data. However, their performance degrades significantly under complex measurement noise and gross outliers. To address this issue, we propose the Noise-Adaptive Physics-Informed Neural Network (naPINN), which robustly recovers physical solutions from corrupted measurements without prior knowledge of the noise distribution. naPINN embeds an energy-based model into the training loop to learn the latent distribution of prediction residuals. Leveraging the learned energy landscape, a trainable reliability gate adaptively filters data points exhibiting high energy, while a rejection cost regularization prevents trivial solutions where valid data are discarded. We demonstrate the efficacy of naPINN on various benchmark partial differential equations corrupted by non-Gaussian noise and varying rates of outliers. The results show that naPINN significantly outperforms existing robust PINN baselines, successfully isolating outliers and accurately reconstructing the dynamics under severe data corruption.

2510.06048 2026-06-02 cs.LG 版本更新

BLISS: A Lightweight Bilevel Influence Scoring Method for Data Selection in Language Model Pretraining

BLISS: 一种用于语言模型预训练数据选择的轻量级双层影响评分方法

Jie Hao, Rui Yu, Wei Zhang, Huixia Wang, Jie Xu, Mingrui Liu

发表机构 * Department of Computer Science, George Mason University, USA(乔治·马歇尔大学计算机科学系) IBM T.J. Watson Research Center, USA(IBM T.J. Watson研究部) Department of Statistics, Rice University(里士大学统计系) Department of System Engineering & Operations Research, George Mason University, USA(乔治·马歇尔大学系统工程与运营管理系)

AI总结 提出一种无需外部预训练模型的轻量级数据选择方法BLISS,通过双层优化和代理模型估计训练样本的长期影响,实现高效数据筛选,在C4数据集上预训练多种规模模型,显著加速收敛并提升下游任务性能。

详情
AI中文摘要

有效的数据选择对于预训练大型语言模型(LLM)至关重要,可以提高效率并增强对下游任务的泛化能力。然而,现有方法通常需要利用外部预训练模型,使得难以将数据选择的效果与外部预训练模型的效果分开。此外,如果模型训练至收敛,它们通常忽略所选数据的长期影响,这主要是由于全规模LLM预训练的过高成本。在本文中,我们介绍了BLISS(用于数据选择的轻量级双层影响评分方法):一种轻量级数据选择方法,完全从头开始操作,不依赖任何外部预训练预言模型,同时明确考虑所选数据的长期影响。BLISS利用一个小型代理模型作为LLM的替代,并采用一个评分模型来估计如果代理模型训练至收敛时训练样本的长期影响。我们将数据选择形式化为一个双层优化问题,其中上层目标优化评分模型以分配重要性权重给训练样本,确保最小化下层目标(即在加权训练损失上训练代理模型直至收敛)导致最佳验证性能。一旦优化完成,训练好的评分模型预测数据集的影响分数,从而能够高效选择高质量样本用于LLM预训练。我们通过在C4数据集的选择子集上预训练410M/1B/2.8B Pythia和LLaMA-0.5B模型来验证BLISS。值得注意的是,在1B模型设置下,BLISS在达到与最先进方法相同性能时实现了1.7倍的加速,展示了在多个下游任务上的优越性能。

英文摘要

Effective data selection is essential for pretraining large language models (LLMs), enhancing efficiency and improving generalization to downstream tasks. However, existing approaches often require leveraging external pretrained models, making it difficult to disentangle the effects of data selection from those of the external pretrained models. In addition, they often overlook the long-term impact of selected data if the model is trained to convergence, primarily due to the prohibitive cost of full-scale LLM pretraining. In this paper, we introduce BLISS (\textbf{B}ileve\textbf{L} \textbf{I}nfluence \textbf{S}coring method for data \textbf{S}election): a lightweight data selection method that operates entirely \emph{from scratch}, without relying on any external pretrained oracle models, while explicitly accounting for the long-term impact of selected data. BLISS leverages a small proxy model as a surrogate for the LLM and employs a score model to estimate the long-term influence of training samples if the proxy model is trained to convergence. We formulate data selection as a bilevel optimization problem, where the upper-level objective optimizes the score model to assign importance weights to training samples, ensuring that minimizing the lower-level objective (i.e., training the proxy model over the weighted training loss until convergence) leads to best validation performance. Once optimized, the trained score model predicts influence scores for the dataset, enabling efficient selection of high-quality samples for LLM pretraining. We validate BLISS by pretraining 410M/1B/2.8B Pythia and LLaMA-0.5B models on selected subsets of the C4 dataset. Notably, under the 1B model setting, BLISS achieves $1.7\times$ speedup in reaching the same performance as the state-of-the-art method, demonstrating superior performance across multiple downstream tasks.

2602.02250 2026-06-02 math.OC cs.LG 版本更新

Well-Posed KL-Regularized Control via Wasserstein and Kalman-Wasserstein KL Divergences

通过Wasserstein和Kalman-Wasserstein KL散度的适定KL正则化控制

Viktor Stein, Adwait Datar, Nihat Ay

发表机构 * Department of Mathematics, Technical University of Munich \& Munich Center for Machine Learning, Germany. The majority of the work was conducted while at the Institute of Mathematics at the Technical University of Berlin, Germany \& the Berlin Mathematical School. Institute for Data Science Foundations, Hamburg University of Technology, Germany Santa Fe Institute, USA

AI总结 针对KL散度在支持不匹配和低噪声下失效的问题,提出基于传输几何的KL变体,消除线性时不变系统中的奇异性,实现适定控制并提升闭环性能。

Comments 37 pages, 9 figures, comments welcome. Accepted @ ICML'26

详情
AI中文摘要

Kullback-Leibler (KL) 散度正则化在强化学习中广泛使用,但在支持不匹配时会变得无穷大,且在低噪声情况下可能退化。利用统一的信息几何框架,我们通过用基于传输的几何替换KL动态公式中的Fisher-Rao几何,引入KL类似物,并推导出常见分布族的闭式表达式。在椭圆分布之间,这些散度对于退化的相等协方差保持有限,并为卡尔曼集成方法中使用的正则化启发式方法提供了几何解释。我们展示了这些散度在KL正则化最优控制中的效用。在具有高斯过程噪声的线性时不变系统的完全可处理设置中,经典KL简化为二次控制惩罚,该惩罚随着过程噪声消失而变得奇异。我们的变体消除了这种奇异性,并产生了适定问题。在双积分器和倒立摆示例中,所得控制保留了非平凡反馈,并实现了更好的闭环性能。

英文摘要

Kullback-Leibler (KL) divergence regularization is widely used in reinforcement learning, but it becomes infinite under support mismatch and can degenerate in low-noise regimes. Using a unified information-geometric framework, we introduce KL analogs by replacing the Fisher-Rao geometry in the dynamical formulation of the KL with transport-based geometries, and derive closed-form expressions for common distribution families. Between elliptic distributions, these divergences remain finite for degenerating equal covariances and yield a geometric interpretation of regularization heuristics used in Kalman ensemble methods. We demonstrate the utility of these divergences in KL-regularized optimal control. In the fully tractable setting of linear time-invariant systems with Gaussian process noise, the classical KL reduces to a quadratic control penalty that becomes singular as process noise vanishes. Our variants remove this singularity and yield well-posed problems. In both the double integrator and cart-pole examples, the resulting controls preserve nontrivial feedback and achieve better closed-loop performance.

2602.02239 2026-06-02 cs.LG 版本更新

Interpretability in Deep Time Series Models Demands Semantic Alignment

深度时间序列模型的可解释性需要语义对齐

Giovanni De Felice, Riccardo D'Elia, Alberto Termine, Pietro Barbiero, Giuseppe Marra, Silvia Santini

发表机构 * University of Padua(帕多瓦大学)

AI总结 本文提出深度时间序列模型的可解释性应追求语义对齐,即预测应基于对用户有意义的变量,并受时空机制约束,同时需保持时间演化下的语义一致性,为此提供了形式化定义和模型设计蓝图。

Comments Accepted at ICML 2026

详情
AI中文摘要

深度时间序列模型在预测性能上持续提升,但其黑箱特性限制了部署。为此,现有的可解释性方法主要关注解释模型内部计算,而未考虑这些计算是否与人类对研究现象的推理方式一致。相反,我们认为深度时间序列模型的可解释性应追求语义对齐:预测应基于对最终用户有意义的变量来表达,并由允许用户依赖约束的时空机制中介。在本文中,我们形式化了这一要求,并指出一旦建立,语义对齐必须在时间演化下保持:这是一个在静态设置中没有类似物的约束。基于这一定义,我们概述了语义对齐深度时间序列模型的蓝图,确定了支持信任的属性,并讨论了对模型设计的影响。

英文摘要

Deep time series models continue to improve predictive performance, yet their deployment remains limited by their black-box nature. In response, existing interpretability approaches in the field keep focusing on explaining the internal model computations, without addressing whether they align or not with how a human would reason about the studied phenomenon. Instead, we state interpretability in deep time series models should pursue semantic alignment: predictions should be expressed in terms of variables that are meaningful to the end user, mediated by spatial and temporal mechanisms that admit user-dependent constraints. In this paper, we formalize this requirement and state that, once established, semantic alignment must be preserved under temporal evolution: a constraint with no analog in static settings. Provided with this definition, we outline a blueprint for semantically aligned deep time series models, identify properties that support trust, and discuss implications for model design.

2602.02098 2026-06-02 cs.LG cs.AI 版本更新

Probabilistic Performance Guarantees for Multi-Task Reinforcement Learning

多任务强化学习的概率性能保证

Yannik Schnitzer, Mathias Jackermeier, Alessandro Abate, David Parker

发表机构 * ETH Zurich(苏黎世联邦理工学院)

AI总结 提出一种结合每任务有限 rollout 置信下界与任务级泛化的新泛化界,为未见任务提供高置信度性能保证。

详情
AI中文摘要

多任务强化学习训练能够执行多个任务的通用策略。尽管近年来取得了显著进展,现有方法很少提供正式的性能保证,而这在安全关键环境中部署策略时是必不可少的。我们提出了一种方法,用于计算多任务策略在训练期间未见任务上的高置信度性能保证。具体地,我们引入了一个新的泛化界,该界将(i)来自有限 rollout 的每任务置信下界与(ii)来自有限采样任务的任务级泛化相结合,为从相同任意未知分布中抽取的新任务提供高置信度保证。在最新的多任务强化学习方法中,我们证明了这些保证在理论上是合理的,并且在现实样本量下具有信息量。

英文摘要

Multi-task reinforcement learning trains generalist policies that can execute multiple tasks. While recent years have seen significant progress, existing approaches rarely provide formal performance guarantees, which are indispensable when deploying policies in safety-critical settings. We present an approach for computing high-confidence guarantees on the performance of a multi-task policy on tasks not seen during training. Concretely, we introduce a new generalisation bound that composes (i) per-task lower confidence bounds from finitely many rollouts with (ii) task-level generalisation from finitely many sampled tasks, yielding a high-confidence guarantee for new tasks drawn from the same arbitrary and unknown distribution. Across state-of-the-art multi-task RL methods, we show that the guarantees are theoretically sound and informative at realistic sample sizes.

2602.01962 2026-06-02 cs.LG cs.AI 版本更新

Zero-Shot Off-Policy Learning

零样本离策略学习

Arip Asadulaev, Maksim Bobrin, Salem Lahlou, Dmitry Dylov, Fakhri Karray, Martin Takac

发表机构 * Arip Asadulaev(阿里普·阿萨杜拉耶夫) Maksim Bobrin(马克西姆·博布林) Salem Lahlou(萨勒姆·拉洛) Dmitry Dylov(德米特里·达里夫) Fakhri Karray(法赫里·卡里) Martin Takac(马尔 tin 塔卡)

AI总结 本文通过发现后继度量与平稳密度比的理论联系,提出一种零样本离策略学习算法,能够实时推断最优重要性采样比率并进行平稳分布修正,实现无需额外训练即可适应新任务。

详情
AI中文摘要

离策略学习方法旨在直接从固定的先前交互数据集中推导出最优策略。这一目标面临重大挑战,主要源于固有的分布偏移和价值函数高估偏差。这些问题在零样本强化学习中尤为突出,其中在无奖励数据上训练的智能体必须在测试时适应新任务而无需额外训练。在这项工作中,我们通过发现后继度量与平稳密度比的理论联系,解决了零样本场景下的离策略问题。利用这一洞见,我们的算法能够推断最优重要性采样比率,有效地为任意任务实时执行带有最优策略的平稳分布修正。我们在SMPL人体模型上的运动跟踪任务、ExoRL上的连续控制任务以及长时域OGBench任务上对方法进行了基准测试。我们的技术无缝集成到前向-后向表示框架中,并在无需训练的情况下实现对新任务的快速适应。更广泛地说,这项工作架起了离策略学习和零样本适应之间的桥梁,为两个研究领域都带来了益处。

英文摘要

Off-policy learning methods seek to derive an optimal policy directly from a fixed dataset of prior interactions. This objective presents significant challenges, primarily due to the inherent distributional shift and value function overestimation bias. These issues become even more noticeable in zero-shot reinforcement learning, where an agent trained on reward-free data must adapt to new tasks at test time without additional training. In this work, we address the off-policy problem in a zero-shot setting by discovering a theoretical connection of successor measures to stationary density ratios. Using this insight, our algorithm can infer optimal importance sampling ratios, effectively performing a stationary distribution correction with an optimal policy for any task on the fly. We benchmark our method in motion tracking tasks on SMPL Humanoid, continuous control on ExoRL, and for the long-horizon OGBench tasks. Our technique seamlessly integrates into forward-backward representation frameworks and enables fast-adaptation to new tasks in a training-free regime. More broadly, this work bridges off-policy learning and zero-shot adaptation, offering benefits to both research areas.

2602.01053 2026-06-02 cs.LG 版本更新

LRAgent: Efficient KV Cache Sharing for Multi-LoRA LLM Agents

LRAgent: 面向多LoRA LLM代理的高效KV缓存共享

Hyesung Jeon, Hyeongju Ha, Jae-Joon Kim

发表机构 * KAIST(韩国科学技术院)

AI总结 针对多LoRA代理系统中每个代理独立存储相同长轨迹的KV缓存导致内存和计算开销大的问题,提出LRAgent框架,通过将缓存分解为共享基座部分和适配器依赖部分,并利用共享A的多LoRA架构和Flash-LoRA-Attention内核,实现高效共享,在保持精度的同时显著降低开销。

Comments 25 pages, 10 figures, 22 tables

详情
Journal ref
ICML 2026 Poster
AI中文摘要

多LLM代理系统中的角色专业化通常通过多LoRA实现,其中代理共享预训练骨干网络,仅通过轻量级适配器区分。尽管共享基础模型权重,每个代理仍独立构建和存储相同长工具增强轨迹的KV缓存,导致大量内存和计算开销。现有的KV缓存共享方法大多忽略这种多LoRA设置。我们观察到,代理间的缓存差异主要由适配器输出主导,而共享预训练骨干网络的激活保持高度相似。基于此观察,我们提出LRAgent,一个面向多LoRA代理的KV缓存共享框架。它将缓存分解为两个组件:来自预训练权重的共享基座组件和来自LoRA权重的适配器依赖组件。LRAgent通过在代理间共享基座组件并以固有的低秩形式存储适配器组件来减少内存开销。它还通过共享A的多LoRA架构共享低秩缓存,从而减少计算开销,避免对已被其他代理处理过的上下文进行冗余计算。为了在运行时高效重建适配器贡献,我们引入Flash-LoRA-Attention,一个重新排序注意力计算以避免将低秩缓存实例化为全维度的内核。LRAgent实现了接近完全共享缓存的吞吐量和首令牌延迟,同时在代理问答基准测试中保持了接近非共享缓存基线的准确性。

英文摘要

Role specialization in multi-LLM agent systems is often realized via multi-LoRA, where agents share a pretrained backbone and differ only by lightweight adapters. Despite sharing base model weights, each agent independently builds and stores its own KV cache for the same long, tool-augmented trajectories, incurring substantial memory and compute overhead. Existing KV cache sharing methods largely overlook this multi-LoRA setting. We observe that, cache differences across agents are dominated by adapter outputs, while activations from the shared pretrained backbone remain highly similar. Based on this observation, we propose LRAgent, a KV cache sharing framework for multi-LoRA agents. It decomposes the cache into two components, a shared base component derived from pretrained weights and an adapter-dependent component derived from LoRA weights. LRAgent reduces memory overhead by sharing the base component across agents and storing the adapter component in its inherent low-rank form. It also reduces computational overhead by sharing the low-rank cache, enabled by a shared-A multi-LoRA architecture. This avoids redundant computations for contexts that have already been processed by other agents. To efficiently reconstruct adapter contributions at runtime, we introduce Flash-LoRA-Attention, a kernel that reorders attention computation to avoid materializing the low-rank cache to full dimension. LRAgent achieves throughput and time-to-first-token latency close to fully shared caching, while preserving accuracy near the non-shared caching baseline across agentic question-answering benchmarks.

2602.00415 2026-06-02 cs.AI cs.LG 版本更新

PolarMem: A Training-Free Polarized Latent Graph Memory for Verifiable Vision-Language Models

PolarMem: 一种无需训练的可验证视觉语言模型极化隐式图记忆

Zhisheng Chen, Tingyu Wu, Zijie Zhou, Zhengwei Xie, Jinhan Li, Ziyan Weng, Liang Lin, Jingwei Song, Zikai Xiao, Yingwei Zhang

发表机构 * ICT, CAS(中国科学院信息科技研究院) UCAS(中国科学院大学) CUPB(中国政法大学) USTC(中国科学技术大学) CityU-DG(城市大学-数据科学) HKU(香港大学) ZJU(浙江大学)

AI总结 提出PolarMem,一种无需训练的极化隐式图记忆框架,通过语义一致性验证和自适应分布划分将视觉语言模型感知信号转化为HAS、NOT_HAS和Uncertain记忆状态,并采用词典逻辑感知检索协议优先保证逻辑一致性,从而提升检索密集型任务性能并减少矛盾。

详情
AI中文摘要

记忆对于智能系统而言不仅是存储机制,更是组织证据和约束信念的结构。这对多模态推理尤为重要,因为检索到的证据必须既与查询相关又在视觉上一致。然而,当前视觉语言模型(VLM)的记忆系统大多保持正关联:它们检索相似或先前观察到的内容,但缺乏明确的方式记住已被验证为不存在或逻辑排除的内容。为此,我们提出 extbf{PolarMem},一种无需训练的极化隐式图记忆框架,用于可验证的视觉语言推理。PolarMem通过语义一致性验证和自适应分布划分,将冻结的VLM感知信号转化为 extit{HAS}、 extit{NOT\_HAS}和 extit{Uncertain}记忆状态,并将其存储在具有明确正负记忆关系的极化图中。在推理时,词典逻辑感知检索协议在语义相似性之前强制执行逻辑一致性,在冲突记忆进入模型上下文之前将其抑制。在八个冻结的VLM骨干网络和六个多模态基准测试中,PolarMem一致地提升了检索密集型任务性能并减少了检索级矛盾。这些结果凸显了负记忆作为构建更可靠多模态记忆系统的关键机制。我们的代码可在https://github.com/czs-ict/PolarMem获取。

英文摘要

Memory is not merely a storage mechanism for intelligent systems, but a structure for organizing evidence and constraining belief. This is especially important for multimodal reasoning, where retrieved evidence must be both query-relevant and visually consistent. However, current memory systems for vision-language models (VLMs) remain largely positive-associative: they retrieve what is similar or previously observed, but lack an explicit way to remember what has been verified as absent or logically excluded. To this end, we propose \textbf{PolarMem}, a training-free polarized latent graph memory framework for verifiable vision-language reasoning. PolarMem transforms frozen VLM perceptual signals into \textit{HAS}, \textit{NOT\_HAS}, and \textit{Uncertain} memory states through semantic consistency verification and adaptive distributional partitioning, and stores them in a polarized graph with distinct positive and negative memory relations. During inference, a lexicographical logic-aware retrieval protocol enforces logical consistency before semantic similarity, suppressing conflicting memories before they enter the model context. Across eight frozen VLM backbones and six multimodal benchmarks, PolarMem consistently improves retrieval-intensive tasks and reduces retrieval-level contradictions. These results highlight negative memory as a key mechanism for building more reliable multimodal memory systems. Our code is available at https://github.com/czs-ict/PolarMem.

2601.22947 2026-06-02 cs.CL cs.LG 版本更新

Reconsidering Positional Supervision in Masked Diffusion Language Model Training

重新审视掩码扩散语言模型训练中的位置监督

Mengyu Ye, Keito Kudo, Ryosuke Takahashi, Jun Suzuki

发表机构 * Tohoku University(东大大学) RIKEN(理化学研究所) NII LLMC(国家信息研究所LLMC)

AI总结 针对掩码扩散语言模型对位置偏移敏感的问题,提出基于连接主义时序分类(CTC)的目标函数,通过引入松弛令牌和更新折叠映射来吸收位置不确定性,从而在开放生成基准上取得一致提升。

Comments preprint, WIP

详情
AI中文摘要

掩码扩散语言模型(MDLM)通过并行去掩码生成文本,最近成为自回归语言模型的替代方案。它们可以被视为使用位置交叉熵(CE)损失训练的并行解码器,与非自回归翻译(NAT)设置相同。在NAT中,CE训练的并行解码器被认为对小的位置偏移敏感,因为CE会严厉惩罚它们。我们询问CE训练的MDLM在迭代解码下是否同样对此类偏移敏感。为了探究这一点,我们应用了一种受控干预,在解码过程中引入这些偏移。在LLaDA-8B-Instruct和Arena-Hard上,仅将1%的生成令牌移动一个位置,就显著降低了相对于未干预模型的胜率,表明MDLM在迭代并行解码下对此类小偏移敏感。受此启发,我们将连接主义时序分类(CTC)(一种已知能缓解该问题的对齐灵活目标)适配到MDLM监督微调中。通过放宽CE施加的严格位置匹配,CTC为损失提供了吸收小位置偏移的空间;具体地,我们修改了CTC目标,使用一个特殊的<slack>令牌来吸收目标令牌与输出位置之间的位置不确定性,并更新了折叠映射以保留目标表面形式。在四个开放生成基准上,所得模型在原始模型和匹配的交叉熵训练基线上均有一致改进,且在所有四个基准上具有统计显著性。这些结果表明,训练侧的对齐灵活性是MDLM SFT的一个有用设计维度,与先前工作中探索的推理时方法互补。

英文摘要

Masked diffusion language models (MDLMs) generate text by unmasking tokens in parallel and have recently emerged as alternatives to autoregressive language models. They can be viewed as parallel decoders trained with a position-wise cross-entropy (CE) loss, the same setup as non-autoregressive translation (NAT). In NAT, CE-trained parallel decoders have been argued to be sensitive to small positional shifts, since CE penalizes them harshly. We ask whether CE-trained MDLMs are similarly sensitive to such shifts under iterative decoding. To probe this, we apply a controlled intervention that introduces them during decoding. On LLaDA-8B-Instruct with Arena-Hard, displacing as little as 1% of generated tokens by one position substantially reduces win rates against the unintervened model, showing that MDLMs are sensitive to such small shifts under iterative parallel decoding. Motivated by this, we adapt connectionist temporal classification (CTC), an alignment-flexible objective known to mitigate it there, to MDLM supervised fine-tuning. By relaxing the strict position-wise match that CE imposes, CTC gives the loss room to absorb small positional shifts; concretely, we modified CTC objective to use a special <slack> token that absorbs positional uncertainty between target tokens and output positions, and a updated collapse map that preserves target surface forms. Across four open-ended generation benchmarks, the resulting model consistently improves over both the original model and a matched cross-entropy-trained baseline, with statistically significant gains on all four. These results identify training-side alignment flexibility as a useful design dimension for MDLM SFT, complementary to the inference-time approaches explored in prior work.

2601.22813 2026-06-02 cs.LG 版本更新

Quartet II: Accurate LLM Pre-Training in NVFP4 by Improved Unbiased Gradient Estimation

Quartet II: 通过改进的无偏梯度估计实现 NVFP4 中准确的 LLM 预训练

Andrei Panferov, Erik Schultheis, Soroush Tabesh, Dan Alistarh

发表机构 * University of Waterloo(多伦多大学)

AI总结 提出一种用于微缩放格式的无偏量化方法 MS-EDEN,其量化误差比随机舍入低 2 倍以上,并集成到全 NVFP4 线性层量化方案 Quartet II 中,在 LLM 预训练中实现更准确的梯度估计和加速。

详情
AI中文摘要

NVIDIA Blackwell GPU 硬件支持的 NVFP4 低精度格式,有望首次实现大规模模型(如 LLM)的端到端全量化预训练。然而,现有的量化训练方法仍然牺牲了该格式的部分表示能力,以通过随机舍入(SR)获得更准确的无偏量化梯度估计,相对于标准 FP16 和 FP8 训练,损失了显著的准确性。在本文中,我们通过一种新颖的微缩放格式无偏量化例程 MS-EDEN 改进了 NVFP4 量化训练的最新技术,其量化误差比 SR 低 2 倍以上。我们将其集成到一种新颖的全 NVFP4 线性层量化方案 Quartet II 中。我们分析表明,Quartet II 在前向和后向传播的所有主要矩阵乘法中一致地实现了更好的梯度估计。此外,我们的提议与最近专门针对 NVFP4 的训练改进很好地协同。我们进一步在多达 1.9B 参数和 38B token 的端到端 LLM 训练上验证了 Quartet II。我们提供了在 NVIDIA Blackwell GPU 上执行的核函数,相比 BF16 实现了高达 4.2 倍的加速。我们的代码可在 https://github.com/IST-DASLab/Quartet-II 获取。

英文摘要

The NVFP4 lower-precision format, supported in hardware by NVIDIA Blackwell GPUs, promises to allow, for the first time, end-to-end fully-quantized pre-training of massive models such as LLMs. Yet, existing quantized training methods still sacrifice some of the representation capacity of this format in favor of more accurate unbiased quantized gradient estimation by stochastic rounding (SR), losing noticeable accuracy relative to standard FP16 and FP8 training. In this paper, improve the state of the art for quantized training in NVFP4 via a novel unbiased quantization routine for micro-scaled formats, called MS-EDEN, that has more than 2x lower quantization error than SR. We integrate it into a novel fully-NVFP4 quantization scheme for linear layers, called Quartet II. We show analytically that Quartet II achieves consistently better gradient estimation across all major matrix multiplications, both on the forward and on the backward passes. In addition, our proposal synergizes well with recent training improvements aimed specifically at NVFP4. We further validate Quartet II on end-to-end LLM training with up to 1.9B parameters on 38B tokens. We provide kernels for execution on NVIDIA Blackwell GPUs with up to 4.2x speedup over BF16. Our code is available at https://github.com/IST-DASLab/Quartet-II .

2601.22784 2026-06-02 stat.ML cs.LG 版本更新

Approximating $f$-Divergences with Rank Statistics

用秩统计量近似 $f$-散度

Viktor Stein, José Manuel de Frutos

发表机构 * Department of Mathematics, Technical University of Munich \& Munich Center for Machine Learning, Germany. The majority of the work was conducted while at the Institute of Mathematics at the Technical University of Berlin, Germany \& the Berlin Mathematical School. Department of Signal Theory

AI总结 提出一种基于秩统计量的 $f$-散度近似方法,通过直接处理秩分布避免显式密度比估计,并证明其单调性、下界性质及收敛速率,同时扩展到高维数据的切片版本。

Comments 40 pages, 16 figures, 6 tables, accepted at ICML'26. Comments welcome!

详情
AI中文摘要

我们引入了一种基于秩统计量的 $f$-散度近似方法,通过直接处理秩的分布来避免显式的密度比估计。对于分辨率参数 $K$,我们将两个单变量分布 $μ$ 和 $ν$ 之间的不匹配映射到 $\{0, \ldots, K\}$ 上的秩直方图,并通过离散 $f$-散度测量其与均匀分布的偏差,从而得到一个秩统计量散度估计量。我们证明该散度估计量在 $K$ 上是单调的,并且始终是真实 $f$-散度的下界,同时在分位数域密度比的适度正则性下,建立了 $K o\infty$ 时的定量收敛速率。为了处理高维数据,我们通过随机投影对单变量构造进行平均,定义了切片秩统计量 $f$-散度,并给出了切片极限的收敛结果。我们还推导了估计量的有限样本偏差界以及渐近正态性结果。最后,通过与神经基线进行基准测试,并展示其在生成建模实验中作为学习目标的应用,我们实证验证了该方法的有效性。

英文摘要

We introduce a rank-statistic approximation of $f$-divergences that avoids explicit density-ratio estimation by working directly with the distribution of ranks. For a resolution parameter $K$, we map the mismatch between two univariate distributions $μ$ and $ν$ to a rank histogram on $\{ 0, \ldots, K\}$ and measure its deviation from uniformity via a discrete $f$-divergence, yielding a rank-statistic divergence estimator. We prove that the resulting estimator of the divergence is monotone in $K$, is always a lower bound of the true $f$-divergence, and we establish quantitative convergence rates for $K\to\infty$ under mild regularity of the quantile-domain density ratio. To handle high-dimensional data, we define the sliced rank-statistic $f$-divergence by averaging the univariate construction over random projections, and we provide convergence results for the sliced limit as well. We also derive finite-sample deviation bounds along with asymptotic normality results for the estimator. Finally, we empirically validate the approach by benchmarking against neural baselines and illustrating its use as a learning objective in generative modeling experiments.

2601.22651 2026-06-02 cs.LG cs.AI 版本更新

GUDA: Counterfactual Group-wise Training Data Attribution for Diffusion Models via Unlearning

GUDA: 基于反事实的扩散模型分组训练数据归因方法

Naoki Murata, Yuhta Takida, Chieh-Hsin Lai, Toshimitsu Uesaka, Bac Nguyen, Stefano Ermon, Yuki Mitsufuji

发表机构 * University of Tokyo(东京大学) Toyota Central Research Laboratory(丰田中央研究所) University of California, Berkeley(加州大学伯克利分校) Massachusetts Institute of Technology(麻省理工学院) National Institute of Advanced Industrial Science and Technology(国家工业科学与技术研究院)

AI总结 提出GUDA方法,利用机器遗忘近似反事实模型,通过似然评分规则(ELBO)量化组别影响,实现高效的分组训练数据归因。

Comments Accepted at ICML 2026. Code is available at https://github.com/sony/guda

详情
AI中文摘要

视觉生成模型的训练数据归因旨在识别哪些训练数据影响了给定输出。虽然大多数方法对单个样本进行评分,但实践者通常需要组级别的答案(例如,艺术风格或对象类别)。分组归因是反事实的:如果某个组别在训练中缺失,模型对生成样本的行为会如何变化?这种反事实的自然实现是留一组法(LOGO)重训练,即移除每个组别后重新训练模型;然而,随着组别数量的增加,计算变得不可行。我们提出了用于扩散模型的GUDA(基于组遗忘的数据归因)方法,该方法通过应用机器遗忘到共享的全数据模型而不是从头训练来近似每个反事实模型。GUDA使用全模型和每个遗忘反事实模型之间基于似然的评分规则(ELBO)的差异来量化组别影响。在CIFAR-10和Stable Diffusion的艺术风格归因上的实验表明,GUDA比语义相似性、基于梯度的归因和实例级遗忘方法更可靠地识别主要贡献组别,同时在CIFAR-10上比LOGO重训练实现了约100倍的加速。

英文摘要

Training-data attribution for vision generative models aims to identify which training data influenced a given output. While most methods score individual examples, practitioners often need group-level answers (e.g., artistic styles or object classes). Group-wise attribution is counterfactual: how would a model's behavior on a generated sample change if a group were absent from training? A natural realization of this counterfactual is Leave-One-Group-Out (LOGO) retraining, which retrains the model with each group removed; however, it becomes computationally prohibitive as the number of groups grows. We propose GUDA (Group Unlearning-based Data Attribution) for diffusion models, which approximates each counterfactual model by applying machine unlearning to a shared full-data model instead of training from scratch. GUDA quantifies group influence using differences in a likelihood-based scoring rule (ELBO) between the full model and each unlearned counterfactual. Experiments on CIFAR-10 and artistic style attribution with Stable Diffusion show that GUDA identifies primary contributing groups more reliably than semantic similarity, gradient-based attribution, and instance-level unlearning approaches, while achieving ~100x speedup on CIFAR-10 over LOGO retraining.

2601.22328 2026-06-02 cs.LG 版本更新

Knowledge-Informed Kernel State Reconstruction from Heterogeneous Partial Observations

知识驱动的异质部分观测核状态重建

Luca Muscarnera, Silas Ruhrberg Estévez, Samuel Holt, Evgeny Saveliev, Mihaela van der Schaar

发表机构 * Supplementary Materials for MAAT Report GitHub Issue(MAAT报告补充材料GitHub问题)

AI总结 提出MAAT框架,利用再生核希尔伯特空间和异质观测算子及先验知识,从部分、噪声、异质观测中重建平滑且物理一致的动态系统状态,显著降低轨迹和导数重建误差。

Comments Accepted at ICML 2026 SD4H Workshop

详情
AI中文摘要

现实世界的科学系统很少通过完整、规则采样的状态轨迹被观测。相反,测量通常是部分的、有噪声的、异质的,提供了潜在动态状态的碎片化视图。我们引入了MAAT(模型感知轨迹近似),一个用于部分观测动态系统中知识驱动的核状态重建框架。MAAT在再生核希尔伯特空间中制定重建,并结合异质观测算子以及语义和结构先验,包括非负性、守恒约束和特定领域的测量模型。这产生了具有解析时间导数的平滑、物理一致的状态估计,为碎片化测量和下游机制发现方法(如符号回归)提供了原则性接口。在九个科学基准、多个噪声设置和一个真实世界的COVID-19数据集上,MAAT相对于强基线显著降低了轨迹和导数重建误差。

英文摘要

Real-world scientific systems are rarely observed through complete, regularly sampled state trajectories. Instead, measurements are often partial, noisy, and heterogeneous, providing fragmented views of latent dynamical states. We introduce MAAT (Model Aware Approximation of Trajectories), a framework for knowledge-informed Kernel State Reconstruction in partially observed dynamical systems. MAAT formulates reconstruction in a reproducing kernel Hilbert space and incorporates heterogeneous observation operators together with semantic and structural priors, including non-negativity, conservation constraints, and domain-specific measurement models. This yields smooth, physically consistent state estimates with analytic time derivatives, providing a principled interface between fragmented measurements and downstream mechanistic discovery methods such as symbolic regression. Across nine scientific benchmarks, multiple noise regimes, and a real-world COVID-19 dataset, MAAT substantially reduces trajectory and derivative reconstruction error relative to strong baselines.

2601.22276 2026-06-02 cs.LG cs.CV 版本更新

SurrogateSHAP: Training-Free Contributor Attribution for Text-to-Image (T2I) Models

SurrogateSHAP:文本到图像(T2I)模型的无训练贡献者归因

Mingyu Lu, Soham Gadgil, Chris Lin, Chanwoo Kim, Su-In Lee

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 针对文本到图像扩散模型中数据贡献者公平估值的高计算成本问题,提出基于预训练模型推理的无重训练框架SurrogateSHAP,利用梯度提升树近似效用函数并解析计算Shapley值,在多个任务上以更低开销超越现有方法。

详情
AI中文摘要

随着文本到图像(T2I)扩散模型在现实创意工作流中的广泛应用,一个用于评估提供数据集合的贡献者的原则性框架对于公平补偿和可持续数据市场至关重要。虽然Shapley值提供了理论上有依据的归因方法,但它面临双重计算瓶颈:(i)对每个采样的玩家(即数据贡献者)子集进行穷举模型重训练的高昂成本,以及(ii)由于贡献者交互,估计边际贡献所需的子集组合数量巨大。为此,我们提出了SurrogateSHAP,一个无需重训练的框架,通过从预训练模型进行推理来近似昂贵的重训练博弈。为了进一步提高效率,我们采用梯度提升树来近似效用函数,并基于树模型解析地推导Shapley值。我们在三个不同的归因任务上评估了SurrogateSHAP:(i)CIFAR-20上DDPM-CFG的图像质量,(ii)后印象派艺术品上Stable Diffusion的美学质量,以及(iii)时尚产品数据上FLUX.1的产品多样性。在各种设置下,SurrogateSHAP在显著降低计算开销的同时优于先前方法,一致地在多个效用指标上识别出有影响力的贡献者。最后,我们证明了SurrogateSHAP能够有效定位导致临床图像中虚假相关的数据源,为审计安全关键型生成模型提供了一条可扩展的路径。

英文摘要

As Text-to-Image (T2I) diffusion models are increasingly used in real-world creative workflows, a principled framework for valuing contributors who provide a collection of data is essential for fair compensation and sustainable data marketplaces. While the Shapley value offers a theoretically grounded approach to attribution, it faces a dual computational bottleneck: (i) the prohibitive cost of exhaustive model retraining for each sampled subset of players (i.e., data contributors) and (ii) the combinatorial number of subsets needed to estimate marginal contributions due to contributor interactions. To this end, we propose SurrogateSHAP, a retraining-free framework that approximates the expensive retraining game through inference from a pretrained model. To further improve efficiency, we employ a gradient-boosted tree to approximate the utility function and derive Shapley values analytically from the tree-based model. We evaluate SurrogateSHAP across three diverse attribution tasks: (i) image quality for DDPM-CFG on CIFAR-20, (ii) aesthetics for Stable Diffusion on Post-Impressionist artworks, and (iii) product diversity for FLUX.1 on Fashion-Product data. Across settings, SurrogateSHAP outperforms prior methods while substantially reducing computational overhead, consistently identifying influential contributors across multiple utility metrics. Finally, we demonstrate that SurrogateSHAP effectively localizes data sources responsible for spurious correlations in clinical images, providing a scalable path toward auditing safety-critical generative models.

2501.13428 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Softplus Attention with Re-weighting Boosts Length Extrapolation in Large Language Models

Softplus注意力与重加权提升大语言模型的长度外推能力

Bo Gao, Michael W. Spratling, Letizia Gionfrida

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一种两阶段注意力机制,用Softplus和l1归一化替代Softmax,并引入基于不变熵的动态缩放因子和重加权机制,以提升数值稳定性、缓解注意力下沉现象,并显著改善长度外推性能。

Comments Accepted by ICML 2026

详情
AI中文摘要

近年来,大语言模型取得了显著成功,这主要归功于自注意力机制。然而,传统的Softmax注意力存在数值不稳定性,并且随着推理令牌数量的增加,性能会下降。本文通过提出一种新的注意力设计原则来解决这些问题,将注意力视为一个两阶段过程。第一阶段(归一化)通过用数值更稳定的Softplus后接$l_{1}$归一化替代Softmax来改进标准注意力。此外,我们引入了一个基于不变熵的动态缩放因子。我们证明,这种新颖的注意力机制优于传统的Softmax注意力和最先进的非Softmax替代方案。我们的第二个提议是引入第二阶段处理(锐化),该阶段由一个重加权机制组成,该机制放大重要的注意力权重,同时削弱较弱的权重。这使得模型能够更有效地聚焦于相关令牌,缓解注意力下沉现象,并从根本上改善长度外推。这种新颖的两阶段自注意力替代方案被证明能确保数值稳定性,并显著提升长度外推能力,在训练长度的16倍时保持几乎恒定的验证损失,同时在具有挑战性的长上下文检索任务和下游基准测试中取得优异结果。此外,符号回归实验表明,我们的方法使模型能够从轨道轨迹序列中恢复牛顿万有引力定律,这为适当的注意力机制对于基础模型发展真正的物理世界模型至关重要提供了证据。我们的代码可在 https://github.com/iminfine/freeattn 获取。

英文摘要

Large language models have achieved remarkable success in recent years, primarily due to self-attention. However, traditional Softmax attention suffers from numerical instability and reduced performance as the number of inference tokens increases. This work addresses these issues by proposing a new design principle for attention, viewing it as a two-stage process. The first stage (normalisation) refines standard attention by replacing Softmax with the more numerically stable Softplus followed by $l_{1}$-normalisation. Furthermore, we introduce a dynamic scale factor based on invariance entropy. We show that this novel attention mechanism outperforms conventional Softmax attention, and state-of-the-art Softmax-free alternatives. Our second proposal is to introduce a second processing stage (sharpening) which consists of a re-weighting mechanism that amplifies significant attentional weights while diminishing weaker ones. This enables the model to concentrate more effectively on relevant tokens, mitigating the attention sink phenomenon, and fundamentally improving length extrapolation. This novel, two-stage, replacement for self-attention is shown to ensure numerical stability and dramatically improve length extrapolation, maintaining a nearly constant validation loss at 16$\times$ the training length while achieving superior results on challenging long-context retrieval tasks and downstream benchmarks. Furthermore, symbolic regression experiments demonstrate that our method enables models to recover Newton's gravitational law from orbital trajectory sequences, providing evidence that appropriate attention mechanisms are crucial for foundation models to develop genuine physical world models. Our code is available at https://github.com/iminfine/freeattn.

2601.21959 2026-06-02 stat.ML cs.LG 版本更新

Near-Optimal Private Tests for Simple and MLR Hypotheses

简单和MLR假设的近最优私有检验

Yu-Wei Chen, Raghu Pasupathy, Jordan Awan

发表机构 * Department of Statistics, Purdue University West Lafayette(统计学系,普渡大学西拉法叶分校) Department of Statistics, University of Pittsburgh(统计学系,匹兹堡大学)

AI总结 本文在高斯差分隐私框架下,针对单调似然比条件下的简单、单侧和双侧假设检验,提出了一种基于数据驱动截断边界的私有均值估计器,并构造了私有检验统计量,实现了与非参数最有效检验相同的渐近相对效率,同时保守控制第一类错误。

详情
AI中文摘要

我们在高斯差分隐私框架下,针对单调似然比条件下的简单假设以及单侧和双侧检验,开发了一种近最优的检验程序。我们的机制基于具有数据驱动截断边界的私有均值估计器,其总体风险在对数因子范围内匹配私有极小化率。利用该估计器,我们构造了私有检验统计量,在保持保守的第一类错误控制的同时,实现了与非私有最有效检验相同的渐近相对效率。除了理论结果外,我们的数值实验表明,即使在中等小的样本量和隐私损失预算下,我们的私有检验也优于竞争性的差分隐私方法,并提供与非私有最有效检验相当的功效。

英文摘要

We develop a near-optimal testing procedure under the framework of Gaussian differential privacy for simple as well as one- and two-sided tests under monotone likelihood ratio conditions. Our mechanism is based on a private mean estimator with data-driven clamping bounds, whose population risk matches the private minimax rate up to logarithmic factors. Using this estimator, we construct private test statistics that achieve the same asymptotic relative efficiency as the non-private, most powerful tests while maintaining conservative type I error control. In addition to our theoretical results, our numerical experiments show that our private tests outperform competing DP methods and offer comparable power to the non-private most powerful tests, even at moderately small sample sizes and privacy loss budgets.

2601.07742 2026-06-02 cond-mat.mtrl-sci cs.LG 版本更新

PFT: Phonon Fine-tuning for Machine Learned Interatomic Potentials

PFT: 机器学习原子间势的声子微调

Teddy Koker, Abhijeet Gangan, Mit Kotak, Jaime Marian, Tess Smidt

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出声子微调(PFT)方法,通过监督二阶力常数来优化机器学习原子间势(MLIP)的曲率,显著提升声子热力学性质预测精度。

Comments 17 pages, 11 figures, ICML 2026

详情
AI中文摘要

许多材料性质依赖于势能面的高阶导数,然而使用能量、力和应力误差的标准损失训练的机器学习原子间势(MLIP)可能在曲率上存在误差,从而降低振动性质的预测。我们引入了声子微调(PFT),通过将MLIP能量Hessian矩阵与有限位移声子计算得到的DFT力常数匹配,直接监督材料的二阶力常数。为了扩展到大型超胞,PFT随机采样Hessian列并通过单个Hessian-向量积计算损失。我们还使用简单的协同训练方案来整合上游数据以减轻灾难性遗忘。在MDR声子基准测试中,PFT在声子热力学性质上平均将Nequix MP提升了55%,并在基于Materials Project轨迹训练的模型中达到了最先进的精度。PFT还能泛化改进超越二阶导数的性质,改善了依赖于势能三阶导数的热导率预测。

英文摘要

Many materials properties depend on higher-order derivatives of the potential energy surface, yet machine learned interatomic potentials (MLIPs) trained with a standard loss on energy, force, and stress errors can exhibit error in curvature, degrading the prediction of vibrational properties. We introduce phonon fine-tuning (PFT), which directly supervises second-order force constants of materials by matching MLIP energy Hessians to DFT-computed force constants from finite displacement phonon calculations. To scale to large supercells, PFT stochastically samples Hessian columns and computes the loss with a single Hessian-vector product. We also use a simple co-training scheme to incorporate upstream data to mitigate catastrophic forgetting. On the MDR Phonon benchmark, PFT improves Nequix MP by 55% on average across phonon thermodynamic properties and achieves state-of-the-art accuracy among models trained on Materials Project trajectories. PFT also generalizes to improve properties beyond second-derivatives, improving thermal conductivity predictions that rely on third-order derivatives of the potential energy.

2601.21718 2026-06-02 cs.LG cs.AI 版本更新

When Does Predictive Inverse Dynamics Outperform Behavior Cloning?

何时预测性逆动力学优于行为克隆?

Lukas Schäfer, Pallavi Choudhury, Abdelhak Lemkhenter, Chris Lovett, Somjit Nath, Luis França, Matheus Ribeiro Furtado de Mendonça, Alex Lamb, Riashat Islam, Siddhartha Sen, John Langford, Katja Hofmann, Sergio Valcarcel Macua

发表机构 * University of Cambridge(剑桥大学) Universitygrow

AI总结 本文通过理论分析解释了预测性逆动力学模型(PIDM)为何在行为克隆(BC)失败时表现更优,归因于偏差-方差权衡,并实验验证了PIDM在样本效率上的显著优势。

Comments To be published in proceedings of the International Conference on Machine Learning (ICML), 2026

详情
AI中文摘要

行为克隆(BC)是一种实用的离线模仿学习方法,但在专家演示有限时常常失败。最近的工作引入了一类名为预测性逆动力学模型(PIDM)的架构,它将未来状态预测器与逆动力学模型相结合。虽然PIDM通常优于BC,但其优势背后的原因尚不清楚。在本文中,我们提供了一个理论解释:PIDM引入了偏差-方差权衡。虽然预测未来状态会引入偏差,但将逆动力学模型(IDM)基于该预测可以显著降低方差。我们建立了状态预测器偏差的条件,使得PIDM相比BC实现更低的预测误差和更高的样本效率,当有额外数据源时差距会扩大。我们在2D导航任务中实证验证了理论见解,其中BC需要多达五倍(平均三倍)于PIDM的演示才能达到相当的性能;以及在现代视频游戏中的一个复杂3D环境中,具有高维视觉输入和随机转换,BC需要比PIDM多66%以上的样本。

英文摘要

Behavior cloning (BC) is a practical offline imitation learning method, but it often fails when expert demonstrations are limited. Recent works have introduced a class of architectures named predictive inverse dynamics models (PIDM) that combine a future state predictor with an inverse dynamics model. While PIDM often outperforms BC, the reasons behind its benefits remain unclear. In this paper, we provide a theoretical explanation: PIDM introduces a bias-variance tradeoff. While predicting the future state introduces bias, conditioning the IDM on the prediction can significantly reduce variance. We establish conditions on the state predictor bias for PIDM to achieve lower prediction error and higher sample efficiency than BC, with the gap widening when additional data sources are available. We validate the theoretical insights empirically in 2D navigation tasks, where BC requires up to five times (three times on average) more demonstrations than PIDM to reach comparable performance; and in a complex 3D environment in a modern video game with high-dimensional visual inputs and stochastic transitions, where BC requires over 66% more samples than PIDM.

2601.21579 2026-06-02 cs.CL cs.LG 版本更新

KromHC: Manifold-Constrained Hyper-Connections with Kronecker-Product Residual Matrices

KromHC: 基于Kronecker积残差矩阵的流形约束超连接

Wuyang Zhou, Yuxuan Gu, Giorgos Iacovides, Danilo Mandic

发表机构 * University of Technology Sydney(悉尼科技大学)

AI总结 针对超连接中的训练不稳定和参数爆炸问题,提出KromHC方法,利用Kronecker积分解小规模双随机矩阵来参数化残差矩阵,在保证精确双随机性的同时将参数复杂度降至O(n^2C)。

详情
AI中文摘要

超连接(HC)在神经网络中的成功也凸显了训练不稳定和可扩展性受限的问题。流形约束超连接(mHC)通过将残差连接空间投影到Birkhoff多面体上缓解了这些挑战,但它面临两个问题:1)其迭代Sinkhorn-Knopp(SK)算法并不总是产生精确的双随机残差矩阵;2)mHC的参数复杂度为O(n^3C),其中n是残差流的宽度,C是特征维度。最近提出的mHC-lite通过Birkhoff-von-Neumann定理重新参数化残差矩阵以保证双随机性,但其参数复杂度面临阶乘爆炸,即O(nC·n!)。为了解决这两个挑战,我们提出KromHC,它使用较小双随机矩阵的Kronecker积来参数化mHC中的残差矩阵。通过沿张量化残差流的每个模式对因子残差矩阵施加流形约束,KromHC保证了残差矩阵的精确双随机性,同时将参数复杂度降低到仅O(n^2C)。实验表明,KromHC匹配甚至超越了其他最先进的mHC变体,同时需要显著更少的可训练参数。代码见https://github.com/wz1119/KromHC。

英文摘要

The success of Hyper-Connections (HC) in neural networks (NN) has also highlighted issues related to training instability and restricted scalability. The Manifold-Constrained Hyper-Connections (mHC) mitigate these challenges by projecting the residual connection space onto a Birkhoff polytope, however, it faces two issues: 1) its iterative Sinkhorn-Knopp (SK) algorithm does not always yield exactly doubly stochastic residual matrices; 2) mHC incurs a prohibitive $O(n^3C)$ parameter complexity with $n$ as the width of the residual stream and $C$ as the feature dimension. The recently proposed mHC-lite reparametrizes the residual matrix via the Birkhoff-von-Neumann theorem to guarantee double stochasticity, but also faces a factorial explosion in its parameter complexity, $O \left( nC \cdot n! \right)$. To address both challenges, we propose KromHC, which uses the Kronecker products of smaller doubly stochastic matrices to parametrize the residual matrix in mHC. By enforcing manifold constraints across the factor residual matrices along each mode of the tensorized residual stream, KromHC guarantees exact double stochasticity of the residual matrices while reducing parameter complexity to only $O(n^2C)$. Experiments show that KromHC matches or even outperforms other state-of-the-art (SOTA) mHC variants, while requiring significantly fewer trainable parameters. The code is at https://github.com/wz1119/KromHC.

2601.21237 2026-06-02 cs.DS cs.CL cs.LG 版本更新

Characterizing the Effect of Noise in Language Generation in the Limit

极限情况下语言生成中噪声影响的刻画

Aaron Li, Ian Zhang

发表机构 * Harvard University(哈佛大学) Duke University(杜克大学)

AI总结 本文在极限语言生成模型中,通过分析噪声字符串对生成能力的影响,证明了单个噪声字符串严格减少可生成集合族,且有限噪声等价于单个噪声,并首次刻画了非均匀噪声依赖的可生成性。

Comments ICML 2026

详情
AI中文摘要

Kleinberg 和 Mullainathan 最近提出了一个用于研究语言生成现象的正式框架,称为极限语言生成。在该模型中,对手从未知目标语言中给出示例字符串的枚举,算法需要在有限时间内正确生成目标语言中未见过的字符串。Li、Raman 和 Tewari(2025)后来引入了非均匀和均匀生成的细化概念,Raman 和 Raman(2025)引入了噪声模型,允许对手插入无关字符串。噪声模型中的一个自然问题是通过研究每个额外无关字符串的影响来量化噪声效应。我们在此设置中展示了两个互补的结果。首先,我们证明对于均匀和非均匀生成,单个噪声字符串严格减少了可生成的集合族,从而回答了 Raman 和 Raman(2025)中的一个开放问题。然后,我们证明对于均匀和非均匀生成,单个噪声字符串的生成等价于任何有限噪声量的生成,这与 Bai、Panigrahi 和 Zhang(2026)展示的极限噪声生成的严格层次结构形成鲜明对比。最后,我们利用先前的结果首次提供了非均匀噪声依赖可生成性的刻画。

英文摘要

Kleinberg and Mullainathan recently proposed a formal framework for studying the phenomenon of language generation, called language generation in the limit. In this model, an adversary gives an enumeration of example strings from an unknown target language, and the algorithm is tasked with correctly generating unseen strings from the target language within finite time. Refined notions of non-uniform and uniform generation were later introduced by Li, Raman, and Tewari (2025), and a noisy model was introduced by Raman and Raman (2025), which allows the adversary to insert extraneous strings. A natural question in the noisy model is to quantify the effect of noise, by studying the impact of each additional extraneous string. We show two complementary results in this setting. We first show that for both uniform and non-uniform generation, a single noisy string strictly reduces the set of collections that can be generated, thus answering an open question in Raman and Raman (2025). Then, we show for both uniform and non-uniform generation that generation with a single noisy string is equivalent to generation with any finite amount of noise, sharply contrasting with the strict hierarchy for noisy generation in the limit shown by Bai, Panigrahi, and Zhang (2026). Finally, we leverage our previous results to provide the first known characterization for non-uniform noise-dependent generatability.

2505.14411 2026-06-02 cs.LG 版本更新

Byte Pair Encoding for Efficient Time Series Forecasting

用于高效时间序列预测的字节对编码

Leon Götz, Marcel Kollovieh, Stephan Günnemann, Leo Schwinn

发表机构 * GitHub arXiv

AI总结 提出基于频繁模式的字节对编码方法,通过自适应压缩时间序列为令牌,显著提升预测性能与效率。

Comments 32 pages in total, 22 figures

详情
AI中文摘要

现有的时间序列分词方法主要将固定数量的样本编码为单个令牌。这种不灵活的方法即使对于简单的模式(如扩展的常数值)也可能生成过多的令牌,导致大量计算开销。受字节对编码成功的启发,我们提出了第一个面向模式的时间序列分词方案。基于频繁模式的离散词汇表,我们的方法将具有潜在模式的样本合并为令牌,自适应地压缩时间序列。利用有限的模式集和时间序列的连续特性,我们进一步引入条件解码作为一种轻量级但强大的事后优化方法,该方法无需梯度计算且不增加计算开销。在近期的时间序列基础模型上,基于模式的分词平均将预测性能提升40%,效率提升2314%。条件解码进一步将MSE降低高达48%。在广泛的分析中,我们展示了分词对多样化时间模式的适应性、对未见数据的泛化能力,以及捕获不同时间序列属性(包括统计矩和趋势)的有意义的令牌表示。

英文摘要

Existing time series tokenization methods predominantly encode a constant number of samples into individual tokens. This inflexible approach can generate excessive tokens for even simple patterns like extended constant values, resulting in substantial computational overhead. Inspired by the success of byte pair encoding, we propose the first pattern-centric tokenization scheme for time series analysis. Based on a discrete vocabulary of frequent motifs, our method merges samples with underlying patterns into tokens, compressing time series adaptively. Exploiting our finite set of motifs and the continuous properties of time series, we further introduce conditional decoding as a lightweight yet powerful post-hoc optimization method, which requires no gradient computation and adds no computational overhead. On recent time series foundation models, our motif-based tokenization improves forecasting performance by 40% and boosts efficiency by 2314% on average. Conditional decoding further reduces MSE by up to 48%. In an extensive analysis, we demonstrate the adaptiveness of our tokenization to diverse temporal patterns, its generalization to unseen data, and its meaningful token representations capturing distinct time series properties, including statistical moments and trends.

2601.18783 2026-06-02 cs.LG cs.AI cs.SY eess.SY 版本更新

Multi-Objective Reinforcement Learning for Tactical Decision Making for Trucks in Highway Traffic

多目标强化学习用于高速公路卡车战术决策

Deepthi Pathare, Leo Laine, Morteza Haghir Chehreghani

发表机构 * Department of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg(计算机科学与工程系,查尔姆斯理工大学和哥德堡大学) Department of Mechanics and Maritime Sciences, Chalmers University of Technology(机械与海洋科学系,查尔姆斯理工大学)

AI总结 提出基于近端策略优化的多目标强化学习框架,学习一组帕累托最优策略以平衡安全性、能源效率和时间效率,实现无需重新训练的灵活决策。

详情
AI中文摘要

在高速公路驾驶中平衡安全性、效率和运营成本对重型车辆来说是一个具有挑战性的决策问题。一个核心困难是,通过聚合这些竞争目标得到的传统标量奖励公式往往会掩盖其权衡结构。我们提出了一个基于近端策略优化的多目标强化学习框架,该框架学习一组明确表示这些权衡的策略,并在一个可扩展的模拟平台上对卡车的战术决策进行评估。所提出的方法学习一组帕累托最优策略,捕捉三个冲突目标之间的权衡:安全性(以碰撞和成功完成量化)、能源效率和时间效率(分别以能源成本和驾驶员成本量化)。得到的帕累托前沿平滑且可解释,使得在不同冲突目标下选择驾驶行为具有灵活性。该框架允许在不同驾驶策略之间无缝切换而无需重新训练,为自动驾驶卡车应用提供了稳健且自适应的决策策略。

英文摘要

Balancing safety, efficiency, and operational costs in highway driving poses a challenging decision-making problem for heavy-duty vehicles. A central difficulty is that conventional scalar reward formulations, obtained by aggregating these competing objectives, often obscure the structure of their trade-offs. We present a Proximal Policy Optimization based multi-objective reinforcement learning framework that learns a set of policies explicitly representing these trade-offs and evaluates it on a scalable simulation platform for tactical decision making in trucks. The proposed approach learns a set of Pareto-optimal policies that capture the trade-offs among three conflicting objectives: safety, quantified in terms of collisions and successful completion; energy efficiency and time efficiency, quantified using energy cost and driver cost, respectively. The resulting Pareto frontier is smooth and interpretable, enabling flexibility in choosing driving behavior along different conflicting objectives. This framework allows seamless transitions between different driving policies without retraining, yielding a robust and adaptive decision-making strategy for autonomous trucking applications.

2601.18115 2026-06-02 cs.LG cs.DS math.OC 版本更新

Robust Learning of a Group DRO Neuron

群体分布鲁棒优化神经元的鲁棒学习

Guyang Cao, Shuyao Li, Sushrut Karmalkar, Jelena Diakonikolas

发表机构 * University of Wisconsin-Madison(威斯康星大学麦迪逊分校) Microsoft Research, Cambridge(微软研究院,剑桥)

AI总结 针对任意标签噪声和群体级分布偏移,提出一种计算高效的原对偶算法,学习一个单神经元,使其在最小化最坏情况群体加权损失时达到常数因子竞争比。

详情
AI中文摘要

我们研究在存在任意标签噪声和群体级分布偏移的情况下,对于一大类协变量分布,在标准平方损失下学习单个神经元的问题。我们的目标是识别一个由 $\mathbf{w}_*$ 参数化的“最佳拟合”神经元,该神经元在最具挑战性的群体重新加权下表现良好。具体来说,我们解决了一个群体分布鲁棒优化问题:给定对 $K$ 个不同分布 $\mathcal p_{[1]},\dots,\mathcal p_{[K]}$ 的样本访问,我们寻求近似 $\mathbf{w}_*$,该 $\mathbf{w}_*$ 最小化群体分布的凸组合 $\boldsymbolλ \in Δ_K$ 上的最坏情况目标,其中目标为 $\sum_{i \in [K]}λ_{[i]}\,\mathbb E_{(\mathbf x,y)\sim\mathcal p_{[i]}}(σ(\mathbf w\cdot\mathbf x)-y)^2 - νd_f(\boldsymbolλ,\frac{1}{K}\mathbf1)$,而 $d_f$ 是一个 $f$-散度,用于对偏离均匀群体权重施加(可选的)惩罚,由参数 $ν\geq 0$ 缩放。我们开发了一个计算高效的原对偶算法,输出一个向量 $\widehat{\mathbf w}$,该向量在最坏情况群体加权下与 $\mathbf{w}_*$ 相比具有常数因子竞争比。我们的分析框架直接应对损失函数固有的非凸性,在任意标签损坏和群体特定分布偏移的情况下提供鲁棒学习保证。受我们算法框架启发的对偶外推更新实现,在 LLM 预训练基准测试中显示出前景。

英文摘要

We study the problem of learning a single neuron under standard squared loss in the presence of arbitrary label noise and group-level distributional shifts, for a broad family of covariate distributions. Our goal is to identify a ''best-fit'' neuron parameterized by $\mathbf{w}_*$ that performs well under the most challenging reweighting of the groups. Specifically, we address a Group Distributionally Robust Optimization problem: given sample access to $K$ distinct distributions $\mathcal p_{[1]},\dots,\mathcal p_{[K]}$, we seek to approximate $\mathbf{w}_*$ that minimizes the worst-case objective over convex combinations of group distributions $\boldsymbolλ \in Δ_K$, where the objective is $\sum_{i \in [K]}λ_{[i]}\,\mathbb E_{(\mathbf x,y)\sim\mathcal p_{[i]}}(σ(\mathbf w\cdot\mathbf x)-y)^2 - νd_f(\boldsymbolλ,\frac{1}{K}\mathbf1)$ and $d_f$ is an $f$-divergence that imposes (optional) penalty on deviations from uniform group weights, scaled by a parameter $ν\geq 0$. We develop a computationally efficient primal-dual algorithm that outputs a vector $\widehat{\mathbf w}$ that is constant-factor competitive with $\mathbf{w}_*$ under the worst-case group weighting. Our analytical framework directly confronts the inherent nonconvexity of the loss function, providing robust learning guarantees in the face of arbitrary label corruptions and group-specific distributional shifts. The implementation of the dual extrapolation update motivated by our algorithmic framework shows promise on LLM pre-training benchmarks.

2509.13805 2026-06-02 cs.LG cs.AI stat.ML 版本更新

Towards a Physics Foundation Model

迈向物理基础模型

Florian Wiesner, Zoë J. Gray, Matthias Wessling, Stephen Baek

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出通用物理变换器(GPhyT),通过在大规模多样化模拟数据上训练,实现单一模型在多个物理领域(如流固耦合、冲击波、热对流和多相流)的零样本泛化与长期稳定预测,性能超越专用架构7倍以上。

Comments ICML-AI4Physics 2026

详情
AI中文摘要

基础模型通过“一次训练,随处部署”的范式彻底改变了自然语言处理,即单个预训练模型无需重新训练即可适应无数下游任务。拥有物理基础模型(PFM)将是变革性的——它能够民主化高保真模拟的访问、加速科学发现,并消除对专用求解器开发的需求。然而,当前物理感知的机器学习方法仍然从根本上局限于单一狭窄领域,并且需要为每个新系统重新训练。我们提出了通用物理变换器(GPhyT),该模型在1.8 TB的多样化模拟数据上训练,证明了基础模型能力在物理领域是可以实现的。我们的关键见解是,变换器可以学习从上下文中推断支配动力学,从而使单一模型能够模拟流固耦合、冲击波、热对流和多相动力学,而无需被告知底层方程。GPhyT实现了三个关键突破:(1)在多个物理领域上表现出卓越性能,比专用架构高出7倍以上;(2)通过上下文学习,对完全未见过的物理系统进行合理的零样本泛化;(3)通过长程 rollout 实现更稳定的长期预测。通过证明单一模型可以仅从数据中学习可泛化的物理原理,这项工作为通向通用PFM开辟了道路,该模型可能改变计算科学与工程。

英文摘要

Foundation models have revolutionized natural language processing through a ``train once, deploy anywhere'' paradigm, where a single pre-trained model adapts to countless downstream tasks without retraining. Access to a Physics Foundation Model (PFM) would be transformative - democratizing access to high-fidelity simulations, accelerating scientific discovery, and eliminating the need for specialized solver development. Yet current physics-aware machine learning approaches remain fundamentally limited to single, narrow domains and require retraining for each new system. We present the General Physics Transformer (GPhyT), trained on 1.8 TB of diverse simulation data, that demonstrates foundation model capabilities are achievable for physics. Our key insight is that transformers can learn to infer governing dynamics from context, enabling a single model to simulate fluid-solid interactions, shock waves, thermal convection, and multi-phase dynamics without being told the underlying equations. GPhyT achieves three critical breakthroughs: (1) superior performance across multiple physics domains, outperforming specialized architectures by more than 7x, (2) plausible zero-shot generalization to entirely unseen physical systems through in-context learning, and (3) more stable long-term predictions through long-horizon rollouts. By establishing that a single model can learn generalizable physical principles from data alone, this work opens the path toward a universal PFM that could transform computational science and engineering.

2404.01356 2026-06-02 cs.LG cs.AI cs.CY 版本更新

Perturbation Effects on Accuracy and Fairness among Similar Individuals

扰动对相似个体间准确性和公平性的影响

Xuran Li, Hao Xue, Peng Wu, Xingjun Ma, Zhen Zhang, Huaming Chen, Flora D. Salim

发表机构 * University of New South Wales(新南威尔士大学) The Hong Kong University of Science and Technology(香港科学与技术大学) Key Laboratory of System Software, Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所系统软件重点实验室) Fudan University(复旦大学) Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences(中国科学院大学杭州先进研究所) The University of Sydney(悉尼大学)

AI总结 提出鲁棒个体公平性(RIF)概念,并开发黑盒对抗框架RIFair,通过解耦扰动策略暴露模型在语义保持扰动下同时存在的鲁棒性和公平性缺陷。

详情
AI中文摘要

深度神经网络易受对抗性扰动影响,这些扰动能在不同应用场景中同时降低预测鲁棒性和个体公平性。然而,现有评估协议通常孤立地评估这些维度,从而掩盖了关键故障模式。为弥补这一差距,我们形式化了鲁棒个体公平性(RIF):在语义保持(真值条件保持)扰动下,预测应既相对于真实标签保持正确,又在语义等价的个体间保持不变。为在实践中揭示RIF违规,我们引入RIFair,一种黑盒对抗框架,利用解耦扰动策略构建语义保持但不鲁棒和/或不公平的实例对。跨多个模型架构和真实世界文本数据集的实验表明,仅关注鲁棒性或公平性的指标常常遗漏鲁棒偏差和不鲁棒公平行为。RIFair可靠地暴露这些隐藏的漏洞,支持RIF作为可信模型评估的必要标准。实验代码公开于https://github.com/Xuran-LI/RIFair。

英文摘要

Deep neural networks are vulnerable to adversarial perturbations that can simultaneously degrade prediction robustness and individual fairness across diverse application settings. However, existing evaluation protocols typically assess these dimensions in isolation, thereby obscuring critical failure modes. To bridge this gap, we formalize Robust Individual Fairness (RIF): under semantic-preserving (truth-condition-preserving) perturbations, predictions should remain both correct with respect to the ground truth and invariant across semantically equivalent individuals. To surface RIF violations in practice, we introduce RIFair, a black-box adversarial framework that leverages a decoupled perturbation strategy to construct semantically preserved yet unrobust and/or unfair instance pairs. Experiments across multiple model architectures and real-world textual datasets show that robustness-only or fairness-only metrics often miss Robust Biased and Unrobust Fair behaviors. RIFair}reliably exposes these hidden vulnerabilities, supporting RIF as a necessary criterion for trustworthy model assessment. The experimental code is publicly available at https://github.com/Xuran-LI/RIFair.

2210.12860 2026-06-02 math.OC cs.CC cs.LG 版本更新

Explicit Second-Order Min-Max Optimization: Practical Algorithms and Complexity Analysis

显式二阶极小极大优化:实用算法与复杂度分析

Tianyi Lin, Panayotis Mertikopoulos, Michael I. Jordan

发表机构 * Department of Electrical Engineering and Computer Sciences(电气工程与计算机科学系) Department of Statistics(统计学系) University of California, Berkeley(加州大学伯克利分校) Department of Industrial Engineering and Operations Research(工业工程与运作研究系) Columbia University(哥伦比亚大学) Univ. Grenoble Alpes, CNRS, Inria, Grenoble INP, LIG(格勒诺布尔阿尔卑斯大学、CNRS、Inria、格勒诺布尔INP、LIG)

AI总结 针对凸-凹无约束极小极大优化问题,提出并分析了几种不精确正则化牛顿型方法,证明其迭代在界集内且平均迭代在O(ε^{-2/3})次内收敛到ε-鞍点,并通过Schur分解和线性系统求解器高效求解子问题,在合成基准和AUC最大化实际应用中优于一阶方法。

Comments Accepted by TMLR; Adding funding information; 35 pages

详情
AI中文摘要

我们提出并分析了几种不精确正则化牛顿型方法,用于寻找凸-凹无约束极小极大优化问题的全局鞍点。与一阶方法相比,我们对二阶方法在极小极大优化中的理解相对有限,因为利用二阶信息获得全局收敛率可能更加复杂。在本文中,我们研究了即使在不精确情况下,二阶信息如何用于加速额外梯度方法。特别地,我们证明了所提出的方法生成的迭代保持在有界集内,并且平均迭代在受限间隙函数意义下在O(ε^{-2/3})次内收敛到ε-鞍点。我们还提供了一个简单的例程来求解每次迭代的子问题,该例程需要一次Schur分解和O(log log(1/ε))次对拟上三角系统的线性系统求解器调用。因此,我们的方法通过将所需Schur分解次数减少O(log log(1/ε))因子,改进了现有的基于线搜索的二阶极小极大优化方法。最后,我们在合成基准和来自标准LIBSVM数据集上AUC最大化的实际应用上评估了我们的方法,发现所提出的二阶方法在这些问题上比代表性的一阶方法具有更强的实际效率。

英文摘要

We propose and analyze several inexact regularized Newton-type methods for finding a global saddle point of convex-concave unconstrained min-max optimization problems. Compared to first-order methods, our understanding of second-order methods for min-max optimization is relatively limited, as obtaining global rates of convergence with second-order information can be much more involved. In this paper, we examine how second-order information is used to speed up extra-gradient methods, even under inexactness. In particular, we show that the proposed methods generate iterates that remain within a bounded set and that the averaged iterates converge to an $ε$-saddle point within $O(ε^{-2/3})$ iterations in terms of a restricted gap function. We also provide a simple routine for solving the subproblem at each iteration, requiring a single Schur decomposition and $O(\log\log(1/ε))$ calls to a linear system solver in a quasi-upper-triangular system. Thus, our method improves the existing line-search-based second-order min-max optimization methods by shaving off an $O(\log\log(1/ε))$ factor in the required number of Schur decompositions. Finally, we evaluate our method on both synthetic benchmarks and a real-world application arising from AUC maximization on standard LIBSVM datasets, and find that the proposed second-order approach delivers stronger practical efficiency than representative first-order methods on these problems.

2601.11460 2026-06-02 cs.RO cs.LG 版本更新

Semantic-Geometric Task Representations for Bimanual Manipulation from Human Demonstrations to Robot Action Planning

面向双臂操作的语义-几何任务表示:从人类示范到机器人动作规划

Franziska Herbert, Vignesh Prasad, Han Liu, Dorothea Koert, Georgia Chalvatzaki

发表机构 * Interactive Robot Perception & Learning (PEARL) Lab, Computer Science Dept., TU Darmstadt, Germany(图腾大学达姆施塔特分校计算机科学系交互机器人感知与学习实验室) Hessian.AI, Darmstadt, Germany(黑森人工智能公司) Robotics Institute Germany (RIG)(德国机器人研究所) Interactive AI Algorithms & Cognitive Models for Human-AI Interaction (IKIDA), Computer Science Dept., TU Darmstadt, Germany(人机交互的交互人工智能算法与认知模型(IKIDA),图腾大学达姆施塔特分校计算机科学系) Center for Cognitive Science, TU Darmstadt, Germany(图腾大学达姆施塔特分校认知科学中心)

AI总结 提出一种语义-几何图任务表示,通过消息传递神经网络编码器和Transformer解码器联合编码对象身份、语义关系和运动历史,实现从人类示范中学习结构化任务表示,并支持跨实体迁移和双臂操作规划。

Comments 9 pages, 7 figures, preprint

详情
AI中文摘要

从人类示范中学习结构化任务表示对于双臂操作至关重要,因为动作顺序、对象参与和交互几何在不同执行中变化显著。一个关键挑战在于以支持任务进展推理的形式,联合捕获离散的语义任务结构和对象中心几何关系的时间演化。我们引入一种基于语义-几何图的任务表示,通过消息传递神经网络(MPNN)编码器和基于Transformer的解码器,联合编码对象身份、对象间语义关系和每个对象的运动历史。编码器仅对时间场景图进行操作,产生与动作标签解耦的结构化表示。解码器则根据动作上下文预测未来动作、关联对象和对象运动。这种解耦学习了任务无关的表示,使得编码器可以通过仅在小型机器人数据集上微调解码器而跨实体复用。在两个数据集的十一个双臂任务中,我们发现结构化语义-几何表示相对于更简单的基于序列模型的优势随着动作顺序和对象参与的任务变异性增加而增长。在部署时,规划器将动作和运动预测与学习的概率运动基元相结合,在两个真实机器人双臂任务上实现了完全任务成功,并优于图消融、Transformer、仅解码器和微调的视觉-语言模型基线。

英文摘要

Learning structured task representations from human demonstrations is essential for bimanual manipulation, where action ordering, object involvement, and interaction geometry vary significantly across executions. A key challenge lies in jointly capturing the discrete semantic task structure and the temporal evolution of object-centric geometric relations in a form that supports reasoning over task progression. We introduce a semantic--geometric graph-based task representation that jointly encodes object identities, inter-object semantic relations, and per-object motion histories, via a Message Passing Neural Network (MPNN) encoder and a Transformer-based decoder. The encoder operates solely on the temporal scene graph, producing structured representations decoupled from action labels. The decoder then conditions on action-context to forecast future actions, associated objects, and object motions. This decoupling learns task-agnostic representations, enabling encoder reuse across embodiments through decoder-only finetuning on a small robot dataset. Across eleven bimanual tasks from two datasets, we find that the benefit of structured semantic--geometric representations over simpler sequence-based models grows with task variability in action ordering and object involvement. At deployment, a planner couples the action and motion predictions with learned Probabilistic Movement Primitives, achieving full task success on two real-robot bimanual tasks and outperforming graph ablations, Transformer, decoder-only, and finetuned vision-language model baselines.

2511.06163 2026-06-02 eess.IV cs.CV cs.LG physics.med-ph 版本更新

Cross-Modal Fine-Tuning of 3D Convolutional Foundation Models for ADHD Classification with Low-Rank Adaptation

基于低秩适应的3D卷积基础模型跨模态微调用于ADHD分类

Jyun-Ping Kao, Shinyeong Rho, Shahar Lazarev, Hyun-Hae Cho, Fangxu Xing, Taehoon Shin, C. -C. Jay Kuo, Jonghye Woo

发表机构 * National Institute of Mental Health, National Institutes of Health(国家精神卫生研究所,国立卫生研究院)

AI总结 提出一种参数高效的迁移学习方法,通过3D低秩适应(LoRA)将预训练于CT图像的3D卷积基础模型微调至MRI的ADHD分类任务,在公开扩散MRI数据集上达到71.9%准确率和0.716 AUC,仅需164万可训练参数。

Comments Accepted for presentation at the IEEE International Symposium on Biomedical Imaging (ISBI) 2026

详情
Journal ref
2026 IEEE 23rd International Symposium on Biomedical Imaging (ISBI), pp. 1-4
AI中文摘要

儿童注意缺陷/多动障碍(ADHD)的早期诊断在改善教育和心理健康结果中起着关键作用。然而,由于异质性表现和与其他疾病的重叠症状,使用神经影像数据诊断ADHD仍然具有挑战性。为了解决这一问题,我们提出了一种新颖的参数高效迁移学习方法,将预训练于CT图像的大规模3D卷积基础模型适应于基于MRI的ADHD分类任务。我们的方法通过将3D卷积核分解为2D低秩更新,在3D中引入低秩适应(LoRA),大幅减少可训练参数,同时实现优越性能。在公开扩散MRI数据库上的五折交叉验证评估中,我们的3D LoRA微调策略取得了最先进的结果,一个模型变体达到71.9%的准确率,另一个达到0.716的AUC。两个变体仅使用164万可训练参数(比完全微调的基础模型少113倍以上)。我们的结果代表了神经影像中基础模型首次成功的跨模态(CT到MRI)适应之一,为ADHD分类建立了新的基准,同时大幅提高了效率。

英文摘要

Early diagnosis of attention-deficit/hyperactivity disorder (ADHD) in children plays a crucial role in improving outcomes in education and mental health. Diagnosing ADHD using neuroimaging data, however, remains challenging due to heterogeneous presentations and overlapping symptoms with other conditions. To address this, we propose a novel parameter-efficient transfer learning approach that adapts a large-scale 3D convolutional foundation model, pre-trained on CT images, to an MRI-based ADHD classification task. Our method introduces Low-Rank Adaptation (LoRA) in 3D by factorizing 3D convolutional kernels into 2D low-rank updates, dramatically reducing trainable parameters while achieving superior performance. In a five-fold cross-validated evaluation on a public diffusion MRI database, our 3D LoRA fine-tuning strategy achieved state-of-the-art results, with one model variant reaching 71.9% accuracy and another attaining an AUC of 0.716. Both variants use only 1.64 million trainable parameters (over 113x fewer than a fully fine-tuned foundation model). Our results represent one of the first successful cross-modal (CT-to-MRI) adaptations of a foundation model in neuroimaging, establishing a new benchmark for ADHD classification while greatly improving efficiency.

2511.01938 2026-06-02 cs.LG cs.AI 版本更新

The Geometry of Grokking: Norm Minimization on the Zero-Loss Manifold

Grokking 的几何:零损失流形上的范数最小化

Tiberiu Musat

发表机构 * ETH Zurich(苏黎世联邦理工学院)

AI总结 本文通过约束优化视角,证明在极小学习率和权重衰减系数下,梯度下降在零损失流形上最小化权重范数,并引入近似解耦参数子集的学习动力学,推导出两层网络第一层后记忆动力学的闭式表达式,实验验证了该框架能复现 grokking 的延迟泛化和表征学习特征。

详情
AI中文摘要

Grokking 是神经网络中一种令人费解的现象,即在完全记忆训练数据之后,经过相当长的延迟才出现完全的泛化。先前的研究将这种延迟泛化与由权重衰减驱动的表征学习联系起来,但精确的潜在动力学仍然难以捉摸。在本文中,我们认为后记忆学习可以通过约束优化的视角来理解:梯度下降在零损失流形上有效地最小化权重范数。我们在无穷小学习率和权重衰减系数的极限下正式证明了这一点。为了进一步剖析这一机制,我们引入了一种近似,将一部分参数的学习动力学与网络其余部分解耦。应用这一框架,我们推导出两层网络中第一层后记忆动力学的闭式表达式。实验证实,使用我们预测的梯度模拟训练过程能够再现 grokking 的特征性延迟泛化和表征学习。

英文摘要

Grokking is a puzzling phenomenon in neural networks where full generalization occurs only after a substantial delay following the complete memorization of the training data. Previous research has linked this delayed generalization to representation learning driven by weight decay, but the precise underlying dynamics remain elusive. In this paper, we argue that post-memorization learning can be understood through the lens of constrained optimization: gradient descent effectively minimizes the weight norm on the zero-loss manifold. We formally prove this in the limit of infinitesimally small learning rates and weight decay coefficients. To further dissect this regime, we introduce an approximation that decouples the learning dynamics of a subset of parameters from the rest of the network. Applying this framework, we derive a closed-form expression for the post-memorization dynamics of the first layer in a two-layer network. Experiments confirm that simulating the training process using our predicted gradients reproduces both the delayed generalization and representation learning characteristic of grokking.

2408.11266 2026-06-02 cs.LG cs.NA math.NA 版本更新

Practical Aspects on Solving Differential Equations Using Deep Learning: A Primer

使用深度学习求解微分方程的实践方面:入门指南

Georgios Is. Detorakis

发表机构 * International Centre for Neuromorphic Systems(神经形态系统国际中心) Department of Computer Science(计算机科学系) University of Manchester(曼彻斯特大学)

AI总结 本文介绍深度学习核心概念,重点阐述如何利用神经网络求解偏微分方程,包括实现方法、超参数选择及精度提升技巧,并强调无需GPU即可复现。

Comments 34 pages, 13 figures, primer (tutorial)

详情
AI中文摘要

深度学习如今在许多科学领域都很常见,包括偏微分方程的研究。本文简要、易懂地介绍了深度学习的核心概念,包括神经网络、反向传播和通用逼近定理。它主要涵盖了如何使用深度学习求解微分方程。本文旨在帮助数学、物理及相关领域的本科生和研究生学习如何使用深度学习求解偏微分方程。数学或物理教师也可以使用本文向学生介绍深度伽辽金方法和科学深度学习。我们关注关键问题:什么是深度学习,它如何帮助解决数学或物理问题?如何实现神经网络并选择正确的数值方法来求解微分方程?如何选择最佳超参数?如何提高精度并加速收敛?需要说明的是,本文中的所有问题都可以在没有GPU的机器上解决,因此任何学生都可以遵循所介绍的方法。

英文摘要

Deep learning is now common across many scientific fields, including the study of partial differential equations. This article provides a brief, accessible introduction to core deep learning concepts, including neural networks, backpropagation, and the universal approximation theorem. It mainly covers how to use deep learning in solving differential equations. The article aims to help undergraduate and graduate students in mathematics, physics, and related areas learn how to use Deep Learning to solve partial differential equations. Instructors in mathematics or physics can also use this article to introduce students to Deep Galerkin method and scientific deep learning. We focus on key questions: What is deep learning, and how can it help solve mathematical or physical problems? How can you implement a neural network and choose the right numerical method to solve differential equations? How do you select the best hyperparameters? How can you improve accuracy and speed up convergence? We should mention that all the problems in this article can be solved on a machine without a GPU, so any student can follow the presented methodology.

2601.04539 2026-06-02 cs.NE cs.AI cs.LG 版本更新

Paradoxical noise preference in RNNs

RNN中的矛盾噪声偏好

Noah Eckstein, Manoj Srinivasan

发表机构 * Department of Mechanical and Aerospace Engineering(机械与航空航天工程系)

AI总结 研究发现,在循环神经网络中,训练时注入的噪声在测试时移除反而会降低性能,网络偏好训练时的噪声水平,该现象源于噪声引起的固定点偏移。

Comments Published in Transactions on Machine Learning Research (TMLR), 2026 21 pages, 8 figures

详情
Journal ref
Transactions on Machine Learning Research, 2026
AI中文摘要

在用于模拟生物神经网络的循环神经网络(RNN)中,通常在训练期间引入噪声以模拟生物变异性和正则化学习。预期在测试时去除噪声应保持或提高性能。与这一直觉相反,我们发现连续时间RNN(CTRNN)通常在训练噪声水平或接近该水平时表现最佳。这种噪声偏好通常出现在噪声注入到神经激活函数内部时;而在激活函数外部注入噪声训练的网络在零噪声时表现最佳。该现象在多种任务中对于足够大的训练噪声鲁棒地出现;我们还展示了该现象出现在前馈神经网络中,而不仅仅是RNN中。我们的分析表明,该现象源于RNN底层随机动力学中固定点(平稳分布)的噪声诱导偏移。这些固定点偏移依赖于噪声水平,并在去除噪声时使网络输出产生偏差,从而降低性能。分析和数值结果表明,当神经状态在激活函数非线性附近运行时会产生偏差,此时噪声被不对称地衰减,而性能优化激励了在这些非线性附近运行;对于噪声在激活函数内部的网络存在这种性能激励,而外部噪声的网络则没有,这解释了为什么只有内部噪声网络表现出偏好。因此,网络可能过拟合到训练噪声本身,而不仅仅是输入-输出数据。该现象不同于随机共振,后者中非零噪声增强信号处理。我们的发现揭示了训练噪声可以成为神经网络学习到的计算的一部分,对理解神经群体动力学和设计鲁棒的人工RNN具有启示意义。

英文摘要

In recurrent neural networks (RNNs) used to model biological neural networks, noise is typically introduced during training to emulate biological variability and regularize learning. The expectation is that removing the noise at test time should preserve or improve performance. Contrary to this intuition, we find that continuous-time RNNs (CTRNNs) often perform best at or near the training noise level. This noise preference typically arises when noise is injected inside the neural activation function; networks trained with noise injected outside the activation function perform best with zero noise. The phenomenon arises robustly in diverse tasks for large enough training noise; we also show the phenomenon arising in feedforward neural networks, not just in RNNs. Our analyses show that the phenomenon stems from noise-induced shifts of fixed points (stationary distributions) in the underlying stochastic dynamics of the RNNs. These fixed point shifts are noise-level dependent and bias the network outputs when the noise is removed, degrading performance. Analytical and numerical results show that the bias arises when neural states operate near activation-function nonlinearities, where noise is asymmetrically attenuated, and that performance optimization incentivizes operation near these nonlinearities; such performance incentives exist for networks with noise inside, but not outside, the activation function, explaining why only noise-in networks show the preference. Thus, networks can overfit to the training noise itself rather than just to the input-output data. The phenomenon is distinct from stochastic resonance, wherein nonzero noise enhances signal processing. Our findings reveal that training noise can become an integral part of the computation learned by neural networks, with implications for understanding neural population dynamics and for the design of robust artificial RNNs.

2601.00672 2026-06-02 math.NA cs.LG cs.NA 版本更新

Sparse FEONet: A Low-Cost, Memory-Efficient Operator Network via Finite-Element Local Sparsity for Parametric PDEs

稀疏FEONet:通过有限元局部稀疏性实现低计算成本、高内存效率的参数化PDE算子网络

Seungchan Ko, Jiyeon Kim, Dongwook Shin

发表机构 * Department of Mathematics, Inha University(inha大学数学系) Department of Mathematics, Ajou University(ajou大学数学系)

AI总结 针对参数化PDE的有限元算子网络(FEONet)在大规模问题中计算成本高、精度下降的问题,提出一种基于有限元结构的新型稀疏网络架构,在保持精度相当的同时显著提升计算效率,并给出理论逼近和稳定性分析。

详情
AI中文摘要

本文研究了有限元算子网络(FEONet),这是一种用于参数化问题的算子学习方法,最初由J. Y. Lee、S. Ko和Y. Hong在《Finite Element Operator Network for Solving Elliptic-Type Parametric PDEs》(SIAM J. Sci. Comput., 47(2), C501-C528, 2025)中提出。FEONet在有限元空间上实现参数到解的映射,并采用无需训练数据的训练过程,同时在一大类问题上表现出高精度和鲁棒性。然而,随着单元数量的增加,其计算成本上升且精度可能下降,这给大规模问题带来了显著挑战。在本文中,我们受有限元结构启发,提出一种新的稀疏网络架构来解决这一问题。通过大量数值实验,我们表明所提出的稀疏网络在保持相当精度的同时,在计算成本和效率方面实现了显著改进。我们还建立了理论结果,证明稀疏架构能够有效逼近目标算子,并提供了稳定性分析以确保可靠的训练和预测。

英文摘要

In this paper, we study the finite element operator network (FEONet), an operator-learning method for parametric problems, originally introduced in J. Y. Lee, S. Ko, and Y. Hong, Finite Element Operator Network for Solving Elliptic-Type Parametric PDEs, SIAM J. Sci. Comput., 47(2), C501-C528, 2025. FEONet realizes the parameter-to-solution map on a finite element space and admits a training procedure that does not require training data, while exhibiting high accuracy and robustness across a broad class of problems. However, its computational cost increases and accuracy may deteriorate as the number of elements grows, posing notable challenges for large-scale problems. In this paper, we propose a new sparse network architecture motivated by the structure of the finite elements to address this issue. Throughout extensive numerical experiments, we show that the proposed sparse network achieves substantial improvements in computational cost and efficiency while maintaining comparable accuracy. We also establish theoretical results demonstrating that the sparse architecture can approximate the target operator effectively and provide a stability analysis ensuring reliable training and prediction.

2601.00664 2026-06-02 cs.LG cs.AI cs.CV cs.HC cs.MM 版本更新

Avatar Forcing: Real-Time Interactive Head Avatar Generation for Natural Conversation

Avatar Forcing:用于自然对话的实时交互式头部化身生成

Taekyung Ki, Sangwon Jang, Jaehyeong Jo, Jaehong Yoon, Sung Ju Hwang

发表机构 * KAIST(韩国科学技术院) NTU Singapore(新加坡国立大学) DeepAuto.ai

AI总结 提出Avatar Forcing框架,通过扩散强制实现实时交互式头部化身生成,利用直接偏好优化进行无标签学习,在低延迟(约500ms)下生成富有表现力的反应动作。

Comments CVPR 2026. Project page: https://taekyungki.github.io/AvatarForcing/

详情
AI中文摘要

说话头部生成从静态肖像创建逼真的化身,用于虚拟通信和内容创作。然而,当前的模型尚未传达真正交互式通信的感觉,通常生成缺乏情感投入的单向响应。我们确定了实现真正交互式化身的两个关键挑战:在因果约束下实时生成运动,以及在没有额外标注数据的情况下学习富有表现力、生动的反应。为了解决这些挑战,我们提出了Avatar Forcing,一种新的交互式头部化身生成框架,通过扩散强制建模实时用户-化身交互。该设计允许化身处理实时多模态输入,包括用户的音频和运动,以低延迟即时响应语言和非语言线索,如言语、点头和笑声。此外,我们引入了一种直接偏好优化方法,利用通过丢弃用户条件构建的合成失败样本,实现无标签的富有表现力交互学习。实验结果表明,我们的框架能够实现低延迟(约500ms)的实时交互,相比基线加速6.8倍,并生成反应性和富有表现力的化身运动,在80%以上的情况下优于基线。

英文摘要

Talking head generation creates lifelike avatars from static portraits for virtual communication and content creation. However, current models do not yet convey the feeling of truly interactive communication, often generating one-way responses that lack emotional engagement. We identify two key challenges toward truly interactive avatars: generating motion in real-time under causal constraints and learning expressive, vibrant reactions without additional labeled data. To address these challenges, we propose Avatar Forcing, a new framework for interactive head avatar generation that models real-time user-avatar interactions through diffusion forcing. This design allows the avatar to process real-time multimodal inputs, including the user's audio and motion, with low latency for instant reactions to both verbal and non-verbal cues such as speech, nods, and laughter. Furthermore, we introduce a direct preference optimization method that leverages synthetic losing samples constructed by dropping user conditions, enabling label-free learning of expressive interaction. Experimental results demonstrate that our framework enables real-time interaction with low latency (approximately 500ms), achieving 6.8X speedup compared to the baseline, and produces reactive and expressive avatar motion, which is preferred over 80% against the baseline.

2601.00389 2026-06-02 cs.CR cs.LG cs.NI 版本更新

NOS-Gate: Queue-Aware Streaming IDS for Consumer Gateways under Timing-Controlled Evasion

NOS-Gate: 面向消费网关的时序控制规避下队列感知流式入侵检测系统

Muhammad Bilal, Omer Tariq, Hasan Ahmed

发表机构 * School of Computing and Communications, Lancaster University(计算与通信学院,兰卡斯特大学) School of Computing, Korea Advanced Institute of Science and Technology(计算学院,韩国科学技术院)

AI总结 提出一种轻量级流式入侵检测系统NOS-Gate,基于网络优化脉冲动力学和K-of-M持久规则,在时序控制规避下实现高召回率低延迟的加密流量元数据检测。

Comments 9 pages, 3 figures, 4 tables. M. Bilal, O. Tariq and H. Ahmed, "NOS-Gate: Queue-Aware Streaming IDS for Consumer Gateways under Timing-Controlled Evasion," in IEEE Transactions on Consumer Electronics, doi: 10.1109/TCE.2026.3682516

详情
AI中文摘要

时序和突发模式可能通过加密泄露,自适应攻击者可利用这一点。这削弱了独立消费网关中仅基于元数据的检测能力。因此,消费网关需要在严格的CPU和延迟预算下,仅使用元数据对加密流量进行流式入侵检测。我们提出了一种针对独立网关的流式入侵检测系统,该系统为每个流实例化一个源自网络优化脉冲(NOS)动力学的轻量级两状态单元,称为NOS-Gate。NOS-Gate对固定长度的元数据特征窗口进行评分,并在K-of-M持久规则下触发可逆缓解措施,在加权公平队列(WFQ)下暂时降低该流的权重。我们使用可执行程序worlds基准测试评估了NOS-Gate在时序控制规避下的性能,该基准测试指定了良性设备进程、可审计的攻击者预算、竞争结构以及数据包级WFQ重放以量化队列影响。所有方法均通过烧入分位数阈值进行无标签校准。在多个可复现的worlds和恶意事件中,在达到0.1%假阳性率的工作点下,NOS-Gate实现了0.952的事件召回率,而最佳基线为0.857。在门控下,它将p99.9排队延迟和p99.9附带延迟降低,CPU上每个流窗口的平均评分成本约为2.09微秒。

英文摘要

Timing and burst patterns can leak through encryption, and an adaptive adversary can exploit them. This undermines metadata-only detection in a stand-alone consumer gateway. Therefore, consumer gateways need streaming intrusion detection on encrypted traffic using metadata only, under tight CPU and latency budgets. We present a streaming IDS for stand-alone gateways that instantiates a lightweight two-state unit derived from Network-Optimised Spiking (NOS) dynamics per flow, named \emph{NOS-Gate}. NOS-Gate scores fixed-length windows of metadata features and, under a $K$-of-$M$ persistence rule, triggers a reversible mitigation that temporarily reduces the flow's weight under weighted fair queueing (WFQ). We evaluate NOS-Gate under timing-controlled evasion using an executable \emph{worlds} benchmark that specifies benign device processes, auditable attacker budgets, contention structure, and packet-level WFQ replay to quantify queue impact. All methods are calibrated label-free via burn-in quantile thresholding. Across multiple reproducible worlds and malicious episodes, at an achieved $0.1\%$ false-positive operating point, NOS-Gate attains 0.952 incident recall versus 0.857 for the best baseline in these runs. Under gating, it reduces p99.9 queueing delay and p99.9 collateral delay with a mean scoring cost of $\approx 2.09\,μ\mathrm{s}$ per flow-window on CPU.

2601.00175 2026-06-02 cs.LG 版本更新

Early Prediction of Liver Cirrhosis Up to Two Years in Advance: A Machine Learning Study Benchmarking Against the FIB-4 and APRI Scores

提前两年预测肝硬化:一项机器学习研究,与FIB-4和APRI评分的基准比较

Zhuqi Miao, Ahmed G Qasem, Sujan Ravi, Jason T. Cheng, Abdulaziz Ahmed, Courtney W. Houchen, Sumayah Abed, Dilorom Azimdjanovna Zuparova, Abdulaziz Ahmed

发表机构 * Center for Health Systems Innovation, Oklahoma State University(俄克拉荷马州立大学健康系统创新中心) Department of Management Science and Information Systems, Oklahoma State University(俄克拉荷马州立大学管理科学与信息系统系) Department of Health Services Administration, University of Alabama at Birmingham(阿拉巴马大学伯明翰分校健康服务管理系) Division of Gastroenterology and Hepatology, Department of Medicine, University of Alabama at Birmingham(阿拉巴马大学伯明翰分校消化内科与肝病科) College of Medicine, The University of Oklahoma Health Campus(俄克拉荷马大学健康校园医学院) Department of Family and Community Medicine, University of Alabama at Birmingham(阿拉巴马大学伯明翰分校家庭与社区医学系)

AI总结 本研究利用常规电子健康记录数据开发XGBoost模型,在诊断前1年和2年预测肝硬化,性能优于FIB-4和APRI评分。

详情
AI中文摘要

目的:利用常规收集的电子健康记录(EHR)数据,开发并评估机器学习(ML)模型,用于在诊断前1年和2年预测新发肝硬化(LC),并将其性能与FIB-4和APRI临床评分进行基准比较。方法:我们使用来自大型学术医疗系统的去标识化EHR数据进行了一项回顾性队列研究。针对1年和2年预测窗口开发了XGBoost模型,并应用了模型特定的特征选择和贝叶斯超参数调优以提高预测性能。然后在保留的测试集上评估模型,并使用准确率、精确率、召回率、F1分数、精确率-召回率曲线下面积(PR AUC)和受试者工作特征曲线下面积(AUC)与FIB-4和APRI进行比较。结果:最终建模队列包括1年预测的60,481名患者和2年预测的47,322名患者。在两个预测窗口上,调优后的ML模型均持续优于FIB-4和APRI。XGBoost模型在1年和2年预测中分别达到0.872和0.839的AUC,而FIB-4为0.756和0.723,APRI为0.798和0.761。在精确率-召回率指标上改进更大,XGBoost的PR AUC为0.657和0.562,而FIB-4为0.456和0.373,APRI为0.504和0.421。随着预测窗口延长,性能增益持续存在,表明保持了早期风险区分能力。结论:利用常规EHR数据的机器学习模型在早期预测肝硬化方面显著优于传统的FIB-4和APRI评分。这些模型能够实现更早、更准确的风险分层,并可集成到临床工作流程中作为自动化决策支持工具,以支持主动的肝硬化预防和管理。

英文摘要

Objective: Develop and evaluate machine learning (ML) models for predicting incident liver cirrhosis (LC) one and two years prior to diagnosis using routinely collected electronic health record (EHR) data and benchmark their performance against the FIB-4 and APRI clinical scores. Methods: We conducted a retrospective cohort study using de-identified EHR data from a large academic health system. XGBoost models were developed for 1- and 2-year prediction horizons, with model-specific feature selection and Bayesian hyperparameter tuning applied to improve predictive performance. The model was then evaluated on held-out test sets, and its performance was compared with FIB-4 and APRI using accuracy, precision, recall, F1, area under the precision-recall curve (PR AUC), and area under the receiver operating characteristic curve (AUC). Results: Final modeling cohorts included 60,481 patients for the 1-year prediction and 47,322 for the 2-year prediction. Across both prediction windows, the tuned ML models consistently outperformed FIB-4 and APRI. The XGBoost models achieved AUCs of 0.872 and 0.839 for the 1- and 2-year predictions, respectively, compared with 0.756 and 0.723 for FIB-4 and 0.798 and 0.761 for APRI. Improvements were larger on the precision-recall metric, with PR AUCs of 0.657 and 0.562 for XGBoost compared with 0.456 and 0.373 for FIB-4 and 0.504 and 0.421 for APRI. Performance gains persisted with longer prediction horizons, indicating maintained early risk discrimination. Conclusions: Machine learning models leveraging routine EHR data substantially outperform the traditional FIB-4 and APRI scores for early prediction of liver cirrhosis. These models enable earlier and more accurate risk stratification and can be integrated into clinical workflows as automated decision-support tools to support proactive cirrhosis prevention and management.

2512.22702 2026-06-02 cs.LG 版本更新

Position: Current Benchmarking Hinders Real Progress in Deep Learning for Time Series Forecasting

立场:当前基准测试阻碍了时间序列预测深度学习的真正进展

Valentina Moretti, Ivan Marisca, Cesare Alippi, Andrea Cini

发表机构 * University of Cambridge(剑桥大学)

AI总结 本文指出当前基准测试实践未能识别影响性能的关键设计因素,尤其是全局性与局部性等被忽视的方面对预测方法类别和实验结果有重大影响,并提出辅助预测模型卡以改进架构比较。

Comments ICML 2026

详情
AI中文摘要

深度学习模型在时间序列应用中越来越受欢迎。然而,大量新提出的架构和经常矛盾的实证结果使得评估哪种设计选择和模型组件驱动性能变得困难。在这篇立场论文中,我们认为当前的基准测试实践未能识别导致性能差异的因素,从而减缓了该领域的进展。特别是,在比较架构时,关键设计维度的差异被忽视,最终导致不一致的结果。为了支持我们的立场,我们展示了这些差异——通常被视为单纯的实现细节——可能比采用特定的序列建模层具有更大的影响。我们讨论了被忽视的方面(如全局性和局部性)如何(1)从根本上改变预测方法的类别,以及(2)极大地影响实证结果。我们的发现表明,需要重新思考我们的基准测试实践,并在设计和比较架构时关注预测问题的基本方面。作为具体步骤,我们提出了一个辅助预测模型卡,即一个包含一组字段的模板,用于根据关键设计选择来表征现有和新的预测架构。

英文摘要

Deep learning models have grown popular in time series applications. However, the large quantity of newly proposed architectures and the often contradictory empirical results make it difficult to assess which design choice and model component drives performance. In this position paper, we argue that current benchmarking practices fail to identify the factors responsible for performance differences, thus slowing down progress in the field. In particular, differences in crucial design dimensions are overlooked when comparing architectures, ultimately leading to inconsistent outcomes. To support our position, we show that such differences-often treated as mere implementation details-can have a greater impact than adopting specific sequence modeling layers. We discuss how overlooked aspects (such as globality and locality) can (1) fundamentally change the class of the forecasting method and (2) drastically affect empirical results. Our findings suggest rethinking our benchmarking practices and focusing on the foundational aspects of the forecasting problem when designing and comparing architectures. As a concrete step, we propose an auxiliary forecasting model card, i.e., a template with a set of fields to characterize existing and new forecasting architectures based on key design choices.

2512.20638 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Uncovering Competency Gaps in Large Language Models and Their Benchmarks

揭示大型语言模型及其基准测试中的能力差距

Maty Bohacek, Nino Scherrer, Nicholas Dufour, Thomas Leung, Christoph Bregler, Stephanie C. Y. Chan

发表机构 * University of Zurich(苏黎世大学)

AI总结 提出一种基于稀疏自编码器概念激活的新方法,自动发现模型在细粒度概念上的弱点(模型差距)和基准测试覆盖不平衡(基准差距),并通过内部表示评估和跨基准比较进行验证。

详情
Journal ref
ICML 2026
AI中文摘要

大型语言模型的评估严重依赖标准化基准测试。这些基准测试提供了有用的聚合指标,但可能掩盖(i)模型薄弱的特定子领域(“模型差距”)和(ii)基准测试本身的不平衡覆盖(“基准差距”)。为了自动揭示这两类差距,我们提出了一种简单的新方法,利用稀疏自编码器的概念激活,在逐概念基础上识别细粒度差距。该方法还受益于将评估基于模型的内部表示,以及易于跨基准测试进行比较。我们将该方法应用于五个流行的开源模型和十几个基准测试,作为示例说明。作为对该方法的验证,我们发现我们的自动无监督方法能够恢复文献中先前记录的模型差距(例如与谄媚相关的差距),并识别出新的模型差距。我们还能够自动揭示基准差距:应属于给定基准测试范围的核心概念。我们的“能力差距”方法可以通过提供模型行为的概念级分解,并帮助基准测试开发者迭代基准测试设计,来补充现有基准测试。代码可在 https://competency-gaps.github.io 获取。

英文摘要

The evaluation of large language models relies heavily on standardized benchmarks. These benchmarks provide useful aggregated metrics, but can obscure (i) particular sub-areas where the models are weak ("model gaps") and (ii) imbalanced coverage in the benchmarks themselves ("benchmark gaps"). To automatically uncover both types of gaps, we propose a simple new method using concept activations from sparse autoencoders, to identify fine-grained gaps on a per-concept basis. The method also benefits from grounding evaluation in the model's internal representations, as well as easy comparison across benchmarks. We applied the method to five popular open-source models and more than a dozen benchmarks, as illustrative examples. As validation of the approach, we found that our automatic, unsupervised method was able to recover model gaps that have been previously documented in the literature (e.g. relating to sycophancy), in addition to identifying novel model gaps. We were also able to automatically uncover benchmark gaps: core concepts that should fall within the scope of a given benchmark. Our "competency gaps" method can be used to complement existing benchmarks, by providing a concept-level decomposition of model behavior, and by helping benchmark developers iterate upon benchmark design. Code is available at https://competency-gaps.github.io.

2508.20072 2026-06-02 cs.CV cs.LG cs.RO 版本更新

Discrete Diffusion VLA: Bringing Discrete Diffusion to Action Decoding in Vision-Language-Action Policies

离散扩散VLA:将离散扩散引入视觉-语言-动作策略中的动作解码

Zhixuan Liang, Yizhuo Li, Tianshuo Yang, Chengyue Wu, Sitong Mao, Liuao Pei, Tian Nian, Shunbo Zhou, Xiaokang Yang, Jiangmiao Pang, Yao Mu, Ping Luo

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出离散扩散VLA,通过将动作块离散化并在统一Transformer骨干内使用离散扩散模式进行渐进细化,实现自适应解码顺序和错误纠正,在多个基准上取得高性能并保留预训练的视觉-语言先验。

Comments Accepted by ICML 2026. 17 pages

详情
AI中文摘要

视觉-语言-动作(VLA)模型将大型视觉-语言骨干网络适配为将图像和指令映射为机器人动作。然而,当前的VLA要么以固定的从左到右顺序自回归生成动作,性能较差;要么在骨干网络外附加独立的扩散头,这会割裂信息通路并阻碍统一、可扩展的架构。相反,我们提出了离散扩散VLA,它将动作块离散化,并使用离散扩散模式在统一的Transformer骨干内保留渐进细化。我们的方法实现了自适应解码顺序,在解决较难的动作元素之前先解决高置信度的动作元素,并采用二次重掩码来重新审视不确定的预测,从而实现鲁棒的纠错。这种设计保留了预训练的视觉-语言先验,支持并行解码,并提高了效率。离散扩散VLA在LIBERO上达到96.4%的平均成功率,在SimplerEnv-Fractal上达到71.2%的视觉匹配,在SimplerEnv-Bridge上达到54.2%的整体性能。在LIBERO-Goal的分布外测试中,我们的方法仅表现出0.8%的语言退化(相比之下并行解码为8.0%),以及20.4%的视觉退化(相比之下连续扩散为29.0%),表明其很好地保留了预训练的视觉-语言能力。我们还在AgileX Cobot Magic平台上进行了两次真实机器人评估,以展示该方法的有效性。

英文摘要

Vision-Language-Action (VLA) models adapt large vision-language backbones to map images and instructions into robot actions. However, prevailing VLAs either generate actions autoregressively in a fixed left-to-right order with poor performance or attach separate diffusion heads outside the backbone that fragments information pathways and hinders unified, scalable architectures. Instead, we present Discrete Diffusion VLA that discretizes action chunks and models them with discrete diffusion pattern retaining progressive refinement inside the unified transformer backbone. Our method achieves an adaptive decoding order that resolves high-confidence action elements before harder ones and employs secondary re-masking to revisit uncertain predictions, enabling robust error correction. This design preserves pretrained vision-language priors, supports parallel decoding, and improves the efficiency. Discrete Diffusion VLA achieves 96.4% avg. success on LIBERO, 71.2% visual matching on SimplerEnv-Fractal, and 54.2% overall on SimplerEnv-Bridge. On out-of-distribution tests of LIBERO-Goal, our method exhibits only 0.8% language degradation versus 8.0% of parallel decoding, and 20.4% vision degradation versus 29.0% for continuous diffusion, demonstrating well retention of pretrained vision-language capabilities. We also conduct two real-robot evaluations on AgileX Cobot Magic platform to show the method's effectiveness.

2506.13702 2026-06-02 cs.LG cs.AI 版本更新

Value-Free Policy Optimization via Reward Partitioning

通过奖励划分实现无价值函数策略优化

Bilal Faye, Hanane Azzag, Mustapha Lebbah

发表机构 * LIPN, Université Paris 13(巴黎第十三大学LIPN实验室) Université Paris 13(巴黎第十三大学) Université de Versailles Saint-Quentin Paris(巴黎- versaillies圣quentin大学)

AI总结 提出Reward Partition Optimization (RPO)方法,通过基于划分的奖励归一化消除价值函数学习,实现稳定、高效的策略优化。

详情
AI中文摘要

单轨迹偏好优化方法从((提示, 响应, 奖励))元组的数据集中学习,通过直接利用标量反馈为成对偏好学习提供了一种实用的替代方案。现有方法如直接奖励优化(DRO)已显示出有希望的结果,但依赖于价值函数估计,引入了额外的方差、优化复杂性和对离策略数据的敏感性。我们引入了奖励划分优化(RPO),一种简单且可扩展的奖励驱动目标,消除了对价值函数学习的需要。RPO通过直接从提示级奖励分布估计的基于划分的公式对奖励进行归一化,产生稳定的监督优化目标,无需辅助模型或强化学习循环。我们使用自动评估指标、LLM作为评判员的评估和优化稳定性分析,在多个编码器-解码器和仅解码器语言模型上评估RPO。实验结果表明,RPO在生成更对齐、更多样化和更少有毒内容的同时,始终优于强基线,包括SFT、KTO和DRO。

英文摘要

Single-trajectory preference optimization methods learn from datasets of ((prompt, response, reward)) tuples, offering a practical alternative to pairwise preference learning by directly leveraging scalar feedback. Existing approaches such as Direct Reward Optimization (DRO) have demonstrated promising results but rely on value function estimation, introducing additional variance, optimization complexity, and sensitivity to off-policy data. We introduce Reward Partition Optimization (RPO), a simple and scalable reward-driven objective that eliminates the need for value function learning. RPO normalizes rewards through a partition-based formulation estimated directly from prompt-level reward distributions, yielding a stable supervised optimization objective without auxiliary models or reinforcement learning loops. We evaluate RPO across multiple encoder-decoder and decoder-only language models using automatic metrics, LLM-as-a-judge evaluations, and optimization stability analyses. Experimental results show that RPO consistently outperforms strong baselines, including SFT, KTO, and DRO, while producing more aligned, diverse, and less toxic generations.

2512.18336 2026-06-02 cs.RO cs.AI cs.LG 版本更新

Dynamic Entropy Tuning in Reinforcement Learning Low-Level Quadcopter Control: Stochasticity vs Determinism

强化学习低层四旋翼控制中的动态熵调节:随机性与确定性

Youssef Mahran, Zeyad Gamal, Ayman El-Badawy

发表机构 * Mechatronics Engineering Department(机械工程系) The German University in Cairo(开罗德国大学)

AI总结 研究在四旋翼控制中,通过动态熵调节训练随机策略的强化学习算法,并与确定性策略算法对比,发现动态熵调节可防止灾难性遗忘并提高探索效率。

Comments This is the Author Accepted Manuscript version of a paper accepted for publication. The final published version is available via IEEE Xplore

详情
Journal ref
2024 IEEE 34th International Conference on Computer Theory and Applications (ICCTA)
AI中文摘要

本文探讨了在训练随机策略的强化学习算法中动态熵调节的影响,并将其性能与训练确定性策略的算法进行了比较。随机策略通过优化动作的概率分布来最大化奖励,而确定性策略则为每个状态选择一个确定的动作。本文研究了使用静态熵和动态熵训练随机策略,然后执行确定性动作来控制四旋翼的效果,并与训练确定性策略并执行确定性动作进行了对比。为此,随机算法选择了软演员-评论家(SAC)算法,确定性算法选择了双延迟深度确定性策略梯度(TD3)算法。训练和仿真结果表明,动态熵调节通过防止灾难性遗忘和提高探索效率,对控制四旋翼产生了积极影响。

英文摘要

This paper explores the impact of dynamic entropy tuning in Reinforcement Learning (RL) algorithms that train a stochastic policy. Its performance is compared against algorithms that train a deterministic one. Stochastic policies optimize a probability distribution over actions to maximize rewards, while deterministic policies select a single deterministic action per state. The effect of training a stochastic policy with both static entropy and dynamic entropy and then executing deterministic actions to control the quadcopter is explored. It is then compared against training a deterministic policy and executing deterministic actions. For the purpose of this research, the Soft Actor-Critic (SAC) algorithm was chosen for the stochastic algorithm while the Twin Delayed Deep Deterministic Policy Gradient (TD3) was chosen for the deterministic algorithm. The training and simulation results show the positive effect the dynamic entropy tuning has on controlling the quadcopter by preventing catastrophic forgetting and improving exploration efficiency.

2512.18333 2026-06-02 cs.RO cs.AI cs.LG 版本更新

Reinforcement Learning Position Control of a Quadrotor Using Soft Actor-Critic (SAC)

基于软演员-评论家(SAC)的四旋翼强化学习位置控制

Youssef Mahran, Zeyad Gamal, Ayman El-Badawy

发表机构 * Mechatronics Engineering Department(机械电子工程系) The German University in Cairo(埃及德国大学)

AI总结 提出一种基于强化学习的四旋翼推力矢量控制架构,使用软演员-评论家算法训练,相比传统RPM控制器训练更快、路径跟踪更平滑准确。

Comments This is the Author Accepted Manuscript version of a paper accepted for publication. The final published version is available via IEEE Xplore

详情
Journal ref
2024 IEEE 6th Novel Intelligent and Leading Emerging Sciences Conference (NILES)
AI中文摘要

本文提出了一种新的基于强化学习(RL)的四旋翼控制架构。现有文献主要关注直接控制四个旋翼的转速,而本文旨在控制四旋翼的推力矢量。RL智能体计算沿四旋翼z轴的总推力百分比以及期望的滚转角(ϕ)和俯仰角(θ)。然后,智能体将计算出的控制信号连同当前四旋翼的偏航角(ψ)发送给姿态PID控制器。PID控制器再将控制信号映射为电机转速。采用软演员-评论家算法(一种无模型离策略随机RL算法)来训练RL智能体。训练结果表明,与传统的RPM控制器相比,所提出的推力矢量控制器训练时间更短。仿真结果表明,所提出的推力矢量控制器具有更平滑、更精确的路径跟踪性能。

英文摘要

This paper proposes a new Reinforcement Learning (RL) based control architecture for quadrotors. With the literature focusing on controlling the four rotors' RPMs directly, this paper aims to control the quadrotor's thrust vector. The RL agent computes the percentage of overall thrust along the quadrotor's z-axis along with the desired Roll ($ϕ$) and Pitch ($θ$) angles. The agent then sends the calculated control signals along with the current quadrotor's Yaw angle ($ψ$) to an attitude PID controller. The PID controller then maps the control signals to motor RPMs. The Soft Actor-Critic algorithm, a model-free off-policy stochastic RL algorithm, was used to train the RL agents. Training results show the faster training time of the proposed thrust vector controller in comparison to the conventional RPM controllers. Simulation results show smoother and more accurate path-following for the proposed thrust vector controller.

2512.02342 2026-06-02 math.OC cs.LG stat.ML 版本更新

Safeguarded Stochastic Polyak Step Sizes for Non-smooth Optimization: Robust Performance Without Small (Sub)Gradients

非光滑优化的保护性随机Polyak步长:无需小(次)梯度的鲁棒性能

Dimitris Oikonomou, Nicolas Loizou

发表机构 * Mathematical Institute for Data Science (MINDS), Johns Hopkins University, Baltimore, MD, USA(数据科学数学研究所(MINDS),约翰霍普金斯大学,巴尔的摩,MD,美国) Department of Computer Science, Johns Hopkins University, Baltimore, MD, USA(计算机科学系,约翰霍普金斯大学,巴尔的摩,MD,美国) Department of Applied Mathematics and Statistics, Johns Hopkins University, Baltimore, MD, USA(应用数学与统计学系,约翰霍普金斯大学,巴尔的摩,MD,美国)

AI总结 针对非光滑凸优化问题,提出保护性随机Polyak步长(SPS_safe)用于随机次梯度方法,在无需强假设下提供收敛保证,并融入动量机制,实验验证其在深度神经网络训练中避免梯度消失的鲁棒性。

Comments 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

随机Polyak步长(SPS)已被证明是随机梯度下降(SGD)的一个有前景的选择,在光滑凸和非凸优化问题(包括深度神经网络训练)上,与最先进方法相比具有竞争性能。然而,该方法向非光滑设置的扩展仍处于早期阶段,通常依赖于插值假设或需要知道最优解。在这项工作中,我们为随机次梯度方法提出了一种新的SPS变体——保护性SPS(SPS$_{safe}$),并在无需强假设的情况下为非光滑凸优化提供了严格的收敛保证。我们进一步将动量融入更新规则中,得到了同样严格的理论结果。在凸基准和深度神经网络上的综合实验证实了我们的理论:所提出的步长在现有自适应基线中实现了竞争性能,并在广泛的问题设置中表现出稳定行为。最后,在深度神经网络训练的背景下,我们的步长下的梯度范数不会崩溃到(接近)零,表明了对梯度消失的鲁棒性。

英文摘要

The stochastic Polyak step size (SPS) has proven to be a promising choice for stochastic gradient descent (SGD), delivering competitive performance relative to state-of-the-art methods on smooth convex and non-convex optimization problems, including deep neural network training. However, extensions of this approach to non-smooth settings remain in their early stages, often relying on interpolation assumptions or requiring knowledge of the optimal solution. In this work, we propose a novel SPS variant, Safeguarded SPS (SPS$_{safe}$), for the stochastic subgradient method, and provide rigorous convergence guarantees for non-smooth convex optimization with no need for strong assumptions. We further incorporate momentum into the update rule, yielding equally tight theoretical results. Comprehensive experiments on convex benchmarks and deep neural networks corroborate our theory: the proposed step size achieves competitive performance to existing adaptive baselines and exhibits stable behavior across a wide range of problem settings. Finally, in the context of deep neural network training, the gradient norms under our step size do not collapse to (near) zero, indicating robustness to vanishing gradients.

2512.13356 2026-06-02 cs.RO cs.AI cs.LG 版本更新

Control of a Twin Rotor using Twin Delayed Deep Deterministic Policy Gradient (TD3)

使用双延迟深度确定性策略梯度(TD3)控制双旋翼系统

Zeyad Gamal, Youssef Mahran, Ayman El-Badawy

发表机构 * Mechatronics Engineering Department(机械电子工程系) The German University in Cairo(埃及德国大学)

AI总结 提出基于TD3算法的强化学习框架,用于控制双旋翼气动系统在俯仰和方位角上的稳定与轨迹跟踪,仿真和实验验证了其优于传统PID控制器的抗干扰能力。

Comments This is the Author Accepted Manuscript version of a paper accepted for publication. The final published version is available via IEEE Xplore

详情
Journal ref
2024 28th IEEE International Conference on System Theory, Control and Computing (ICSTCC)
AI中文摘要

本文提出了一种强化学习(RL)框架,用于在特定俯仰角和方位角下控制和稳定双旋翼气动系统(TRAS),并跟踪给定轨迹。TRAS的复杂动力学和非线性特性使得使用传统控制算法进行控制具有挑战性。然而,近年来RL的发展因其在多旋翼控制中的潜在应用而引起了兴趣。本文使用双延迟深度确定性策略梯度(TD3)算法来训练RL智能体。该算法适用于具有连续状态和动作空间的环境(类似于TRAS),因为它不需要系统的模型。仿真结果展示了RL控制方法的有效性。接下来,使用风扰形式的的外部扰动来测试控制器与传统PID控制器相比的有效性。最后,在实验室装置上进行了实验,以确认控制器在实际应用中的有效性。

英文摘要

This paper proposes a reinforcement learning (RL) framework for controlling and stabilizing the Twin Rotor Aerodynamic System (TRAS) at specific pitch and azimuth angles and tracking a given trajectory. The complex dynamics and non-linear characteristics of the TRAS make it challenging to control using traditional control algorithms. However, recent developments in RL have attracted interest due to their potential applications in the control of multirotors. The Twin Delayed Deep Deterministic Policy Gradient (TD3) algorithm was used in this paper to train the RL agent. This algorithm is used for environments with continuous state and action spaces, similar to the TRAS, as it does not require a model of the system. The simulation results illustrated the effectiveness of the RL control method. Next, external disturbances in the form of wind disturbances were used to test the controller's effectiveness compared to conventional PID controllers. Lastly, experiments on a laboratory setup were carried out to confirm the controller's effectiveness in real-world applications.

2508.11931 2026-06-02 cs.LG 版本更新

An Improved Algorithm for Adversarial Linear Contextual Bandits via Reduction

通过归约改进的对抗线性上下文赌博机算法

Tim van Erven, Jack Mayo, Julia Olkhovskaya, Chen-Yu Wei

发表机构 * University of Amsterdam(阿姆斯特丹大学) Delft University of Technology(代尔夫特理工大学) University of Virginia(弗吉尼亚大学)

AI总结 提出一种基于归约的 oracle 高效算法,将具有对抗损失和随机动作集的线性上下文赌博机问题转化为鲁棒对抗线性赌博机问题,实现了接近最优的遗憾界和多项式时间复杂度。

详情
AI中文摘要

我们提出了一种 oracle 高效、接近最优的算法,用于具有对抗损失和随机动作集的线性上下文赌博机,每轮仅需对动作集进行线性优化 oracle 调用。我们的方法将该问题归约为具有固定动作集的鲁棒对抗线性赌博机。在不知道上下文分布或无法访问上下文模拟器的情况下,该算法实现了 $\widetilde{\mathcal{O}}(\min\{d^2\sqrt{T}, \sqrt{d^3T\log K}\})$ 的遗憾,运行时间为 $\mathrm{poly}(d,T)$ 加上 $\mathrm{poly}(d,T)$ 次线性优化 oracle 调用,其中 $d$ 是特征维度,$K$ 是每轮动作数的上界,$T$ 是轮数。这解决了 Liu 等人 (2023) 提出的开放问题:是否可以在与动作数无关的多项式时间内获得 $\mathrm{poly}(d)\sqrt{T}$ 的遗憾。对于具有对抗损失和随机动作集的组合赌博机这一重要类别,我们的算法是首个在多项式时间内实现 $\mathrm{poly}(d)\sqrt{T}$ 遗憾的算法,而据我们所知,此前没有算法能在多项式时间内达到甚至 $o(T)$ 的遗憾。当模拟器可用时,遗憾界可以改进为 $\widetilde{\mathcal{O}}(d\sqrt{L^\star})$,其中 $L^\star$ 是最优策略的累积损失。

英文摘要

We present an oracle-efficient, near-optimal algorithm for linear contextual bandits with adversarial losses and stochastic action sets, only requiring a linear optimization oracle for the action sets in each round. Our approach reduces this setting to misspecification-robust adversarial linear bandits with fixed action sets. Without knowledge of the context distribution or access to a context simulator, the algorithm achieves $\widetilde{\mathcal{O}}(\min\{d^2\sqrt{T}, \sqrt{d^3T\log K}\})$ regret and runs in $\mathrm{poly}(d,T)$ time plus $\mathrm{poly}(d,T)$ calls to the linear optimization oracles, where $d$ is the feature dimension, $K$ is an upper bound on the number of actions in each round, and $T$ is number of rounds. This resolves the open question by Liu et al. (2023) on whether one can obtain $\mathrm{poly}(d)\sqrt{T}$ regret in polynomial time independent of the number of actions. For the important class of combinatorial bandits with adversarial losses and stochastic action sets, our algorithm is the first to achieve $\mathrm{poly}(d)\sqrt{T}$ regret in polynomial time, while no prior algorithm achieves even $o(T)$ regret in polynomial time to our knowledge. When a simulator is available, the regret bound can be improved to $\widetilde{\mathcal{O}}(d\sqrt{L^\star})$, where $L^\star$ is the cumulative loss of the best policy.

2511.01064 2026-06-02 stat.ML cs.LG stat.CO 版本更新

Generalized Guarantees for Variational Inference in the Presence of Even and Elliptical Symmetry

存在偶对称和椭圆对称时变分推断的广义保证

Charles C. Margossian, Isaac E. Rankin, Lawrence K. Saul

发表机构 * University of British Columbia, Department of Statistics(不列颠哥伦比亚大学统计学系) Flatiron Institute, Center for Computational Mathematics(Flatiron研究所计算数学中心)

AI总结 本文证明,对于所有f-散度,在偶对称和椭圆对称条件下,变分推断的驻点能分别恢复目标密度的均值和相关矩阵,推广了先前对逆KL散度的结果。

详情
AI中文摘要

变分推断(VI)通过在易处理的分布族中寻找最佳匹配$q$来近似目标密度$p$。最佳变分近似通过最小化分布之间的散度$D(p||q)$得到,目前已提出多种散度作为VI的目标函数,不同选择导致不同近似。我们证明,即使这些散度具有不同的最小化器,所得近似都遵循某些对称匹配原则。具体来说,我们的结果适用于所有$f$-散度,这是一大类包括逆和前向Kullback-Leibler散度以及$\alpha$-散度的散度。我们证明,在存在偶对称时,$f$-散度的任何驻点都保证恢复$p$的均值;同样,在存在椭圆对称时,任何驻点都保证恢复其相关矩阵。为获得这些保证,我们假设$p$和$q$是单峰的,但值得注意的是,我们不要求它们是对数凹、轻尾或处处光滑的。这些保证推广了先前对逆Kullback-Leibler散度在$p$为对数凹时得到的结果。它们还扩展到目标密度$p$仅在其部分坐标上呈现对称性的情况。这些部分对称性自然出现在贝叶斯层次模型中,其中先验诱导出具有挑战性的几何结构,但仍具有对称轴。

英文摘要

Variational inference (VI) approximates a target density $p$ by the best match $q$ in a family of tractable distributions. The best variational approximation is found by minimizing a divergence between distributions, $D(p||q)$, and several divergences have been proposed as objective functions for VI, with different choices leading to different approximations. We show that even when these divergences have different minimizers, the resulting approximations all abide by certain symmetry-matching principles. Specifically, our results hold for all $f$-divergences, a broad class which includes the reverse and forward Kullback-Leibler divergences and the $α$-divergences. We show that in the presence of even symmetry, any stationary point of an $f$-divergence is guaranteed to recover the mean of $p$ and likewise, in the presence of elliptical symmetry, any stationary point is guaranteed to recover its correlation matrix. To obtain these guarantees we assume that $p$ and $q$ are unimodal, but notably we do not require them to be log-concave, light-tailed, or even everywhere-smooth. These guarantees generalize a previous result obtained for the reverse Kullback-Leibler divergence when $p$ is log-concave. They also extend to cases where the target density $p$ only exhibits symmetry along some but not all of its coordinates. These partial symmetries arise naturally in Bayesian hierarchical models, where the prior induces a challenging geometry but still possesses axes of symmetry.

2409.03915 2026-06-02 cs.LG math.OC 版本更新

Asynchronous Stochastic Approximation with Applications to Average-Reward Reinforcement Learning

异步随机逼近及其在平均奖励强化学习中的应用

Huizhen Yu, Yi Wan, Richard S. Sutton

发表机构 * Department of Computing Science, University of Alberta(计算科学系,阿尔伯塔大学) Alberta Machine Intelligence Institute (Amii)(阿尔伯塔人工智能研究所(Amii))

AI总结 研究异步随机逼近算法的稳定性与收敛性,通过扩展Borkar-Meyn稳定性证明方法和Hirsch-Benaïm动力学系统方法,为平均奖励强化学习中的相对值迭代算法提供理论基础。

Comments 34 pages. This version contains only the asynchronous stochastic approximation material from version 2 of the original report; the reinforcement-learning material has been moved to a separate, stand-alone paper (arXiv:2512.06218). Minor corrections and additional remarks have been incorporated. A shorter version of this paper is to appear in the SIAM Journal on Control and Optimization

详情
Journal ref
SIAM Journal on Control and Optimization, 64(3):1456-1481, 2026
AI中文摘要

本文研究了异步随机逼近(SA)算法的稳定性和收敛性质,重点关注与平均奖励强化学习相关的扩展。我们首先扩展了Borkar和Meyn的稳定性证明方法,以适应比先前考虑的更一般的噪声条件,从而为异步SA提供了更广泛的收敛保证。为了深化收敛性分析,我们进一步基于Hirsch和Benaïm的动力学系统方法,研究了异步SA的阴影性质。这些结果为在配套论文中开发和分析的一类基于相对值迭代的强化学习算法提供了理论基础,用于求解平均奖励马尔可夫和半马尔可夫决策过程。

英文摘要

This paper investigates the stability and convergence properties of asynchronous stochastic approximation (SA) algorithms, with a focus on extensions relevant to average-reward reinforcement learning. We first extend a stability proof method of Borkar and Meyn to accommodate more general noise conditions than previously considered, thereby yielding broader convergence guarantees for asynchronous SA. To sharpen the convergence analysis, we further examine the shadowing properties of asynchronous SA, building on a dynamical systems approach of Hirsch and Benaïm. These results provide a theoretical foundation for a class of relative value iteration-based reinforcement learning algorithms -- developed and analyzed in a companion paper -- for solving average-reward Markov and semi-Markov decision processes.

2512.07795 2026-06-02 cs.AI cs.CL cs.LG 版本更新

ReasonBENCH: Benchmarking the (In)Stability of LLM Reasoning

ReasonBENCH: 基准测试LLM推理的(不)稳定性

Nearchos Potamitis, Vansh Ramani, Har Ashish Arora, Dhairya Kuchhal, Lars Klein, Akhil Arora

发表机构 * Aarhus University(奥胡斯大学) Indian Institute of Technology Delhi(德里印度理工学院) EPFL(苏黎世联邦理工学院)

AI总结 提出ReasonBench基准套件,通过30次独立试验揭示LLM推理系统在贪婪解码下仍存在结构化方差,并引入全局噪声和运行噪声分类法,证明稳定性是推理系统的固有属性,倡导分布感知评估。

Comments 29 pages, 19 tables, 85 figures

详情
AI中文摘要

LLM推理系统的基准分数被报告为单一数字,然而相同的模型、策略和任务在重复执行时,即使在贪婪解码(T=0)下也会产生显著不同的答案和成本。这种方差并非统计上的麻烦:性能最高的策略在与最接近的对手进行头对头运行时仅获胜77%,这意味着单次观测到的分数可能会无声地错误排序系统。我们引入了ReasonBench,一个基准套件,记录了10种推理策略、12个模型和6个任务的30次独立试验,将质量和成本视为分布而非点估计。我们发现这种方差是有结构的而非随机的:一个双组分分类法——全局噪声(捕捉跨基准的不均匀性)和运行噪声(捕捉基准内的随机性)——揭示了策略架构预测稳定性分布,而模型和策略则移动分布的正交方面。层次分解将四分之三的分数方差归因于基准、系统和项目结构,而单次运行评估无声地吸收了持久的残差。最后,成本和成本非对称地解耦:廉价方法在结构上对联合成本-质量失败免疫,而昂贵方法无论其准确性如何仍然暴露。这些发现确立了不稳定性作为推理系统的固有属性,并促使分布感知评估成为标准实践。

英文摘要

Benchmark scores for LLM reasoning systems are reported as single numbers, yet the same model, strategy, and task can produce meaningfully different answers and costs across repeated executions, even under greedy decoding (T = 0). This variance is not a statistical nuisance: the highest-performing strategy wins only 77% of head-to-head runs against its nearest competitor, meaning a single observed score can silently misrank systems. We introduce ReasonBench, a benchmark suite recording 30 independent trials across 10 reasoning strategies, 12 models, and 6 tasks, treating quality and cost as distributions rather than point estimates. We find that this variance is structured rather than random: a two-component taxonomy -- Global Noise, capturing cross-benchmark unevenness, and Run Noise, capturing within-benchmark stochasticity -- reveals that strategy architecture predicts stability profiles, while models and strategies shift orthogonal aspects of the distribution. A hierarchical decomposition attributes three-quarters of score variance to benchmark, system, and item structure, with a persistent residual that single-run evaluation silently absorbs. Finally, cost and quality decouple asymmetrically: cheap methods are structurally immune to joint cost-quality failure, while expensive methods remain exposed regardless of their accuracy. These findings establish instability as an inherent property of reasoning systems and motivate distribution-aware evaluation as standard practice.

2512.06906 2026-06-02 cs.SE cs.CR cs.DB cs.LG 版本更新

MINES: Explainable Anomaly Detection through Web API Invariant Inference

MINES:通过Web API不变式推断实现可解释的异常检测

Wenjie Zhang, Yun Lin, Chun Fung Amos Kwok, Xiwen Teoh, Xiaofei Xie, Frank Liauw, Hongyu Zhang, Jin Song Dong

发表机构 * National University of Singapore(国立新加坡大学) Shanghai Jiao Tong University(上海交通大学) Singapore Management University(新加坡管理学院) GovTech Singapore(新加坡政府科技局) Chongqing University(重庆大学)

AI总结 提出MINES方法,通过从模式级别推断可解释的API不变式来检测Web应用异常,显著降低误报率并提高召回率。

Comments Accepted by ICSE 2026

详情
AI中文摘要

检测Web应用的异常对于提供可靠的Web服务至关重要,这些应用是现代公司和政府运行的重要基础设施。许多现代Web应用基于Web API(例如RESTful、SOAP和WebSockets)运行,其暴露性会招致有意攻击或无意非法访问,导致系统行为异常。然而,此类异常可能与正常日志共享非常相似的日志,缺少用于日志区分的關鍵信息(可能存在于数据库中)。此外,日志实例可能包含噪声,这会进一步误导最先进的日志学习解决方案学习虚假相关性,从而产生用于异常检测的浅层模型和规则。在这项工作中,我们提出MINES,它从模式级别而非详细的原始日志实例推断可解释的API不变式用于异常检测,能够(1)显著区分日志中的噪声以识别精确的正常行为,以及(2)检测超出已记录日志的异常行为。技术上,MINES(1)将API签名转换为表模式以增强原始数据库模式;(2)在增强的数据库模式上推断潜在的数据库约束,以捕获API与数据库表之间的潜在关系。MINES使用LLM基于两个给定的表结构提取潜在关系,并使用正常日志实例拒绝或接受LLM生成的不变式。最后,MINES将推断的约束转换为不变式,生成用于验证运行时日志的Python代码。我们在TrainTicket、NiceFish、Gitea、Mastodon和NextCloud基准测试上针对Web篡改攻击,与LogRobust、LogFormer和WebNorm等基线进行了广泛评估。结果表明,MINES在引入几乎零误报的情况下实现了对异常的高召回率,代表了新的最先进水平。

英文摘要

Detecting the anomalies of web applications, important infrastructures for running modern companies and governments, is crucial for providing reliable web services. Many modern web applications operate on web APIs (e.g., RESTful, SOAP, and WebSockets), their exposure invites intended attacks or unintended illegal visits, causing abnormal system behaviors. However, such anomalies can share very similar logs with normal logs, missing crucial information (which could be in database) for log discrimination. Further, log instances can be also noisy, which can further mislead the state-of-the-art log learning solutions to learn spurious correlation, resulting superficial models and rules for anomaly detection. In this work, we propose MINES which infers explainable API invariants for anomaly detection from the schema level instead of detailed raw log instances, which can (1) significantly discriminate noise in logs to identify precise normalities and (2) detect abnormal behaviors beyond the instrumented logs. Technically, MINES (1) converts API signatures into table schema to enhance the original database shema; and (2) infers the potential database constraints on the enhanced database schema to capture the potential relationships between APIs and database tables. MINES uses LLM for extracting potential relationship based on two given table structures; and use normal log instances to reject and accept LLM-generated invariants. Finally, MINES translates the inferred constraints into invariants to generate Python code for verifying the runtime logs. We extensively evaluate MINES on web-tamper attacks on the benchmarks of TrainTicket, NiceFish, Gitea, Mastodon, and NextCloud against baselines such as LogRobust, LogFormer, and WebNorm. The results show that MINES achieves high recall for the anomalies while introducing almost zero false positives, indicating a new state-of-the-art.

2511.20639 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Latent Collaboration in Multi-Agent Systems

多智能体系统中的潜在协作

Jiaru Zou, Ruizhong Qiu, Gaotang Li, Xiyuan Yang, Katherine Tieu, Pan Lu, Ke Shen, Hanghang Tong, Yejin Choi, Jingrui He, James Zou, Mengdi Wang, Ling Yang

发表机构 * University of Washington(华盛顿大学)

AI总结 提出LatentMAS框架,使LLM智能体在连续潜在空间直接协作,无需文本中介,实现更高精度、更低开销和更快推理。

Comments ICML2026 Spotlight, Project: https://github.com/Gen-Verse/LatentMAS

详情
AI中文摘要

多智能体系统(MAS)将大语言模型(LLM)从独立的单模型推理扩展到协同的系统级智能。现有LLM智能体依赖基于文本的中介进行推理和通信,而我们更进一步,使模型能够在连续潜在空间内直接协作。我们引入了LatentMAS,一个端到端无需训练的框架,实现了LLM智能体间的纯潜在协作。在LatentMAS中,每个智能体首先通过最后一层的隐藏嵌入而非文本进行自回归潜在思维生成。然后,一个共享的潜在工作记忆保存并传递每个智能体的内部表示和潜在思维,确保无需重新编码的无损信息交换。我们提供了详细的理论分析,表明LatentMAS比基于文本的标准MAS具有更高的表达能力和无损信息保存能力,且整体复杂度更低。此外,在涵盖数学和科学推理、常识理解及代码生成的9个综合基准测试上的实证评估表明,LatentMAS优于先进的单智能体和基于文本的MAS基线,准确率最高提升14.6%,输出token使用量减少70.8%-83.7%,端到端推理速度提升4倍至4.3倍。代码和数据完全开源:https://github.com/Gen-Verse/LatentMAS。

英文摘要

Multi-agent systems (MAS) extend large language models (LLMs) from independent single-model reasoning to coordinative system-level intelligence. While existing LLM agents depend on text-based mediation for reasoning and communication, we take a step forward by enabling models to collaborate directly within the continuous latent space. We introduce LatentMAS, an end-to-end training-free framework that enables pure latent collaboration among LLM agents. In LatentMAS, each agent first performs auto-regressive latent thoughts generation through last-layer hidden embeddings instead of text. Then, a shared latent working memory preserves and transfers each agent's internal representations and latent thoughts, ensuring lossless information exchange without re-encoding. We provide detailed theoretical analyses showing that LatentMAS achieves higher expressiveness and lossless information preservation with lower overall complexity than standard text-based MAS. In addition, empirical evaluations across 9 comprehensive benchmarks spanning math and science reasoning, commonsense understanding, and code generation show that LatentMAS outperforms advanced single agents and text-based MAS baselines, achieving up to 14.6% higher accuracy, reducing output token usage by 70.8%-83.7%, and providing 4$\times$-4.3$\times$ faster end-to-end inference. Code and data are fully open-sourced at https://github.com/Gen-Verse/LatentMAS.

2511.08223 2026-06-02 stat.CO cs.LG cs.NA math.NA 版本更新

High-Performance Variance-Covariance Matrix Construction Using an Uncentered Gram Formulation

使用非中心Gram形式的高性能方差-协方差矩阵构建

Felix Reichel

发表机构 * Department of Economics, Johannes Kepler University Linz(经济系,约翰尼斯·开普勒大学林茨)

AI总结 本文通过非中心Gram矩阵和修正项等价于成对差异定义,避免了显式中心化,将计算简化为一个p×p外积和一次减法,在Python基准测试中显著提升运行速度。

Comments 17 pages, 9 figures, 1 table

详情
Journal ref
A-ccepted at International Journal of Parallel, Emergent and Distributed Systems, 2026, Taylor & Francis, Unpublished
AI中文摘要

Reichel (2025) 将bariance定义为一种成对差异度量,该度量可以仅使用标量求和在线性时间内重写。我们通过证明涉及非中心Gram矩阵和修正项的标准矩阵表达式在代数上与成对差异定义相同,同时避免了显式中心化,将此思想扩展到协方差矩阵。然后计算简化为一个p×p维的外积和一次减法。Python中的基准测试显示出明显的运行时间增益,特别是在没有BLAS优化的情况下。可选的更快Gram矩阵例程(如RXTX, Rybin et al., 2025)进一步降低了总体成本。

英文摘要

Reichel (2025) defined the bariance as a pairwise-difference measure that can be rewritten in linear time using only scalar sums. We extend this idea to the covariance matrix by showing that the standard matrix expression involving the uncentered Gram matrix and a correction term is algebraically identical to the pairwise-difference definition while avoiding explicit centering. The computation then reduces to one outer product of dimension p-by-p and a single subtraction. Benchmarks in Python show clear runtime gains, especially when BLAS optimizations are absent. Optionally faster Gram-matrix routines such as RXTX (Rybin et al., 2025) further reduce overall cost.

2510.09330 2026-06-02 cs.LG 版本更新

Safety Game: Inference-Time Alignment of Black-Box LLMs via Constrained Optimization

安全博弈:通过约束优化实现黑盒大语言模型的推理时对齐

Tuan Nguyen, Long Tran-Thanh

发表机构 * University of Southampton(南安普顿大学)

AI总结 提出一种无需重新训练或访问模型内部结构的黑盒安全对齐框架,通过将安全与帮助性的权衡建模为二人零和博弈,并在推理时使用线性规划求解均衡策略,实现了黑盒LLM的安全对齐。

详情
AI中文摘要

确保大语言模型(LLM)遵守安全要求是AI部署中的核心挑战。现有的对齐方法主要在训练期间进行,例如通过微调或基于人类反馈的强化学习,但这些方法成本高昂且不灵活,每当出现新需求时都需要重新训练。最近针对推理时对齐的努力缓解了其中一些限制,但仍然假设可以访问模型内部结构,这在实际中不可行,也不适用于无法访问模型的第三方利益相关者。在这项工作中,我们提出了一种与模型无关的黑盒安全对齐框架,无需重新训练或访问底层LLM架构。作为概念验证,我们解决了在生成安全但无信息的回答与有用但潜在风险的回答之间进行权衡的问题。我们将这一困境建模为一个二人零和博弈,其极小极大均衡捕捉了安全性与帮助性之间的最优平衡。LLM智能体通过推理时使用线性规划求解器计算均衡策略来操作这一框架。我们的结果证明了黑盒安全对齐的可行性,为包括小型组织和资源受限环境中的实体在内的利益相关者提供了一种可扩展且可访问的途径,以在快速演变的LLM生态系统中强制执行安全。

英文摘要

Ensuring that large language models (LLMs) comply with safety requirements is a central challenge in AI deployment. Existing alignment approaches primarily operate during training, such as through fine-tuning or reinforcement learning from human feedback, but these methods are costly and inflexible, requiring retraining whenever new requirements arise. Recent efforts toward inference-time alignment mitigate some of these limitations but still assume access to model internals, which is impractical, and not suitable for third party stakeholders who do not have access to the models. In this work, we propose a model-independent, black-box framework for safety alignment that does not require retraining or access to the underlying LLM architecture. As a proof of concept, we address the problem of trading off between generating safe but uninformative answers versus helpful yet potentially risky ones. We formulate this dilemma as a two-player zero-sum game whose minimax equilibrium captures the optimal balance between safety and helpfulness. LLM agents operationalize this framework by leveraging a linear programming solver at inference time to compute equilibrium strategies. Our results demonstrate the feasibility of black-box safety alignment, offering a scalable and accessible pathway for stakeholders, including smaller organizations and entities in resource-constrained settings, to enforce safety across rapidly evolving LLM ecosystems.

2512.02328 2026-06-02 q-bio.QM cs.LG 版本更新

Molecular Embedding-Based Algorithm Selection in Protein-Ligand Docking

基于分子嵌入的蛋白质-配体对接算法选择

Jiabao Brad Wang, Siyuan Cao, Hongxuan Wu, Yiliang Yuan, Mustafa Misir

发表机构 * Division of Natural and Applied Sciences, Duke Kunshan University(杜克昆山大学自然科学与应用科学系) Machine Learning Department, Mohamed bin Zayed University of Artificial Intelligence(莫扎德人工智能大学机器学习系)

AI总结 提出MolAS轻量级算法选择模型,利用预训练蛋白质和配体嵌入的注意力池化与浅层残差解码器预测对接算法性能,在五个基准上相比单一最优算法绝对提升高达15个百分点,并缩小了与虚拟最优算法之间17-66%的差距。

Comments 40 pages, 16 figures, 8 tables; updated to the accepted manuscript version

详情
Journal ref
J Cheminform 18, 47 (2026)
AI中文摘要

选择有效的对接算法高度依赖于具体情境,没有单一方法能在结构、化学和协议范围内可靠地表现。MolAS是一种轻量级算法选择模型,通过注意力池化和浅层残差解码器,从预训练的蛋白质和配体嵌入中预测每个算法的性能。使用数百到数千个标记复合物,MolAS在五个对接基准上相比单一最优算法(SBS)实现了高达15个百分点的绝对改进,并缩小了虚拟最优算法(VBS)与SBS之间17-66%的差距。对选择频率、边际条件可靠性和基准级预言结构分析表明,当工作流定义的预言景观具有低胜者熵和合理可分离的顶级求解器区域时,MolAS最有效,但在协议不匹配导致求解器排名变化和诱导标签改变时性能下降。这些结果表明,在评估的范围内,鲁棒性受限于工作流和协议引起的求解器层次不稳定性,而非表示能力,将MolAS定位为固定管线的领域内选择器以及评估对接算法选择是否适定的诊断工具。

英文摘要

Selecting an effective docking algorithm is highly context-dependent, and no single method performs reliably across structural, chemical, and protocol regimes. MolAS is a lightweight algorithm-selection model that predicts per-algorithm performance from pretrained protein and ligand embeddings using attentional pooling and a shallow residual decoder. With hundreds to a few thousand labelled complexes, MolAS achieves up to a 15 percentage-point absolute improvement over the single-best solver (SBS) and closes 17--66\% of the Virtual Best Solver (VBS)--SBS gap across five docking benchmarks. Analyses of selection frequencies, margin-conditioned reliability, and benchmark-level oracle structure indicate that MolAS is most effective when the workflow-defined oracle landscape has low winner entropy and a reasonably separable top-solver region, but degrades under protocol mismatch that shifts solver rankings and changes the induced labels. These results suggest that, in the evaluated regime, robustness is limited less by representational capacity than by workflow- and protocol-induced instability in solver hierarchies, positioning MolAS as an in-domain selector for fixed pipelines and as a diagnostic tool for assessing when docking algorithm selection is well-posed.

2512.00088 2026-06-02 cs.CV cs.LG 版本更新

Semimage: HSV-Based Semantic Image Encoding for Disentangled Text Representation

Semimage: 基于HSV的语义图像编码用于解缠文本表示

Mohammad Zare

发表机构 * AI Lab at Department of Computer Engineering(计算机工程系人工智能实验室) AriooBarzan Engineering Team and Information Technology(AriooBarzan工程团队和信息技术) Shiraz University of Technology(谢兹大学技术学院)

AI总结 提出SemImage方法,将文本表示为二维语义图像,利用HSV颜色空间解缠主题、情感和强度特征,通过多任务学习实现,并在文档分类中取得竞争性性能。

详情
Journal ref
2026 12th International Conference on Web Research (ICWR), 253-259
AI中文摘要

我们提出SemImage,一种将文本文档表示为二维语义图像以由卷积神经网络(CNN)处理的新方法。在SemImage中,每个单词表示为二维图像中的一个像素:行对应句子,并在句子之间插入额外的边界行以标记语义转换。每个像素不是典型的RGB值,而是解缠HSV颜色空间中的向量,编码不同的语言特征:色调(具有两个分量H_cos和H_sin以考虑循环性)编码主题,饱和度编码情感,明度编码强度或确定性。我们通过多任务学习框架强制这种解缠:ColorMapper网络将每个词嵌入映射到HSV空间,并对色调和饱和度通道应用辅助监督以预测主题和情感标签,同时执行主要任务目标。在句子之间插入动态计算的边界行,当连续句子在语义上不相似时,会在图像中产生清晰的视觉边界,有效地使段落边界突出。我们将SemImage与标准2D CNN(例如ResNet)集成用于文档分类。在多标签数据集(同时具有主题和情感标注)和单标签基准上的实验表明,SemImage能够达到与强文本分类基线(包括BERT和层次注意力网络)相当或更好的准确性,同时提供增强的可解释性。消融研究证实了多通道HSV表示和动态边界行的重要性。最后,我们展示了SemImage的可视化,定性地揭示了生成图像中与主题转换和情感变化相对应的清晰模式,表明我们的表示使这些语言特征对人类和机器都可见。

英文摘要

We propose SemImage, a novel method for representing a text document as a two-dimensional semantic image to be processed by convolutional neural networks (CNNs). In a SemImage, each word is represented as a pixel in a 2D image: rows correspond to sentences and an additional boundary row is inserted between sentences to mark semantic transitions. Each pixel is not a typical RGB value but a vector in a disentangled HSV color space, encoding different linguistic features: the Hue with two components H_cos and H_sin to account for circularity encodes the topic, Saturation encodes the sentiment, and Value encodes intensity or certainty. We enforce this disentanglement via a multi-task learning framework: a ColorMapper network maps each word embedding to the HSV space, and auxiliary supervision is applied to the Hue and Saturation channels to predict topic and sentiment labels, alongside the main task objective. The insertion of dynamically computed boundary rows between sentences yields sharp visual boundaries in the image when consecutive sentences are semantically dissimilar, effectively making paragraph breaks salient. We integrate SemImage with standard 2D CNNs (e.g., ResNet) for document classification. Experiments on multi-label datasets (with both topic and sentiment annotations) and single-label benchmarks demonstrate that SemImage can achieve competitive or better accuracy than strong text classification baselines (including BERT and hierarchical attention networks) while offering enhanced interpretability. An ablation study confirms the importance of the multi-channel HSV representation and the dynamic boundary rows. Finally, we present visualizations of SemImage that qualitatively reveal clear patterns corresponding to topic shifts and sentiment changes in the generated image, suggesting that our representation makes these linguistic features visible to both humans and machines.

2512.00062 2026-06-02 cs.RO cs.AI cs.LG 版本更新

SpeedAug: Policy Acceleration via Tempo-Enriched Policy and RL Fine-Tuning

SpeedAug: 通过节奏增强策略和强化学习微调实现策略加速

Taewook Nam, Junmo Cho, Youngsoo Jang, Sung Ju Hwang

发表机构 * KAIST(韩国科学技术院) UNIST(全南大学) DeepAuto.ai

AI总结 提出SpeedAug框架,通过节奏增强先验策略和强化学习微调,使机器人策略学习任务最优执行节奏,在保持高成功率的同时显著提升执行速度和样本效率。

详情
AI中文摘要

针对复杂真实世界操作任务的机器人策略学习近期取得了快速进展,这在很大程度上得益于通过人类操作收集演示数据的能力。然而,从这些演示中训练出的策略通常执行任务的速度远低于机器人的物理能力,因为演示数据是在实际约束下收集的,这些约束倾向于保守的、以成功为导向的轨迹,而非执行速度。现有的策略加速方法通过数据预处理或启发式规则确定执行节奏,而不是学习针对任务优化的执行速度。在本文中,我们提出了SpeedAug,一个策略加速框架,使策略能够通过强化学习(RL)学习任务最优的执行节奏。SpeedAug首先从速度增强的演示中学习一个节奏增强的先验策略,该策略捕捉了多样的执行节奏。在此基础上,通过强化学习微调指导探索,以优化动作轨迹并高效优化执行节奏。在机器人操作基准上的实验表明,SpeedAug在保持高成功率的同时,显著提高了策略加速的样本效率,实现了快速且稳定的任务执行。应用于真实世界的操作任务时,SpeedAug仅用16分钟的在线交互就将任务吞吐量提高了1.8倍,且未降低成功率。

英文摘要

Robotic policy learning for complex real-world manipulation tasks has seen rapid recent progress, enabled in large part by the ability to collect demonstrations through human operation. However, policies trained from such demonstrations often execute tasks far more slowly than the robot's physical capabilities, as demonstration data is collected under practical constraints that favor conservative, success-oriented trajectories over execution speed. Existing policy acceleration methods determine execution tempo through data preprocessing or heuristic rules, rather than learning execution speed optimized for the task. In this paper, we propose SpeedAug, a policy acceleration framework that enables policies to learn task-optimal execution tempo via reinforcement learning (RL). SpeedAug first learns a tempo-enriched prior policy from speed-augmented demonstrations that captures diverse execution tempos. Building on this tempo-enriched prior, RL fine-tuning guides exploration to refine action trajectories and optimize execution tempo efficiently. Experiments on robotic manipulation benchmarks demonstrate that SpeedAug substantially improves the sample efficiency of policy acceleration while maintaining high success rates, achieving fast and stable task execution. Applied to a real-world manipulation task, SpeedAug improves task throughput by 1.8x using only 16 minutes of online interactions without compromising the success rate.

2511.21397 2026-06-02 cs.CV cs.AI cs.CL cs.LG 版本更新

Understanding the Effects of Distractors on Reasoning Vision-Language Models

理解干扰项对推理视觉语言模型的影响

Jiyun Bae, Hyunjong Ok, Sangwoo Mo, Jaeho Lee

发表机构 * Pohang University of Science and Technology (POSTECH)(坡山科学技术大学(POSTECH))

AI总结 本文通过构建包含语义和数值维度干扰项的视觉问答数据集Idis,研究视觉干扰项如何影响视觉语言模型的测试时缩放行为,发现视觉干扰项以与文本干扰项根本不同的方式降低准确率而不增加推理长度,并提出简单提示策略缓解干扰项驱动的预测。

Comments preprint

详情
AI中文摘要

无关信息(即干扰项)如何影响视觉语言模型(VLM)的测试时缩放?先前关于纯文本语言模型的研究表明,文本干扰项可以加剧逆缩放,导致模型推理更长但推理轨迹效率更低。在这项工作中,我们研究了类似现象是否在多模态设置中出现。我们引入了Idis(带干扰项的图像),这是一个视觉问答数据集,系统性地沿着语义和数值维度变化干扰项。我们的分析揭示,视觉干扰项以与文本干扰项根本不同的方式影响推理VLM:尽管逆缩放仍然出现,但视觉干扰项降低了准确率而不增加推理长度。我们进一步展示了从推理轨迹中提取的属性计数为干扰项如何与推理长度和准确率交互提供了关键见解。作为合理性检查,我们提出了一种简单的提示策略,以减轻推理视觉语言模型中干扰项驱动的预测。

英文摘要

How does irrelevant information (i.e., distractors) affect test-time scaling in vision-language models (VLMs)? Prior work on text-only language models has shown that textual distractors can intensify inverse scaling, causing models to reason longer but less effective reasoning traces. In this work, we investigate whether similar phenomena arise in multimodal settings. We introduce Idis (Images with distractors), a visual question-answering dataset that systematically varies distractors along semantic and numerical dimensions. Our analyses reveal that visual distractors affect reasoning VLMs in a fundamentally different way from textual distractors: although inverse scaling still emerges, visual distractors reduce accuracy without increasing reasoning length. We further show that attribute counts extracted from reasoning traces provide key insights into how distractors interact with reasoning length and accuracy. As a sanity check, we propose a simple prompting strategy that mitigates distractor-driven predictions in reasoning vision-language models.

2511.20333 2026-06-02 cs.AI cs.LG cs.NE 版本更新

NNGPT: Rethinking AutoML with Large Language Models

NNGPT: 用大型语言模型重新思考AutoML

Roman Kochnev, Waleed Khalid, Tolgay Atinc Uzun, Xi Zhang, Yashkumar Sanjaybhai Dhameliya, Furui Qin, Chandini Vysyaraju, Raghuvir Duvvuri, Avi Goyal, Dmitry Ignatov, Radu Timofte

发表机构 * Computer Vision Lab, CAIDAS & IFI, University of Würzburg, Germany(计算机视觉实验室,CAIDAS与IFI,乌尔姆大学,德国)

AI总结 提出NNGPT开源框架,利用大型语言模型实现自我改进的AutoML引擎,通过生成-评估-自我改进闭环自动设计神经网络架构和超参数。

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 5664-5674, 2026
AI中文摘要

构建自我改进的人工智能系统仍然是AI领域的一个基本挑战。我们提出了NNGPT,一个开源框架,它将大型语言模型(LLM)转变为用于神经网络开发的自我改进AutoML引擎,主要针对计算机视觉。与之前的框架不同,NNGPT通过生成新模型扩展神经网络数据集,基于生成、评估和自我改进的闭环系统实现LLM的持续微调。它在一个统一的工作流中集成了五个协同的基于LLM的流水线:零样本架构合成、超参数优化(HPO)、代码感知的准确率/早停预测、检索增强的闭域PyTorch块合成(NN-RAG)以及强化学习。基于LEMUR数据集作为具有可复现指标的可审计语料库,NNGPT从单个提示出发,验证网络架构、预处理代码和超参数,端到端执行,并从结果中学习。PyTorch适配器使NNGPT框架无关,实现了强大性能:NN-RAG在1289个目标上达到73%的可执行性,3-shot提示在常见数据集上提高了准确率,基于哈希的去重节省了数百次运行。一次性预测匹配基于搜索的AutoML,减少了大量试验的需要。在LEMUR上的HPO实现了RMSE 0.60,优于Optuna(0.64),而代码感知预测器达到RMSE 0.14,Pearson r=0.78。该系统已生成超过5000个经过验证的模型,证明了NNGPT作为自主AutoML引擎的能力。接受后,代码、提示和检查点将公开发布,以实现可复现性并促进社区使用。

英文摘要

Building self-improving AI systems remains a fundamental challenge in the AI domain. We present NNGPT, an open-source framework that turns a large language model (LLM) into a self-improving AutoML engine for neural network development, primarily for computer vision. Unlike previous frameworks, NNGPT extends the dataset of neural networks by generating new models, enabling continuous fine-tuning of LLMs based on closed-loop system of generation, assessment, and self-improvement. It integrates within one unified workflow five synergistic LLM-based pipelines: zero-shot architecture synthesis, hyperparameter optimization (HPO), code-aware accuracy/early-stop prediction, retrieval-augmented synthesis of scope-closed PyTorch blocks (NN-RAG), and reinforcement learning. Built on the LEMUR dataset as an audited corpus with reproducible metrics, NNGPT emits from a single prompt and validates network architecture, preprocessing code, and hyperparameters, executes them end-to-end, and learns from result. The PyTorch adapter makes NNGPT framework-agnostic, enabling strong performance: NN-RAG achieves 73% executability on 1,289 targets, 3-shot prompting boosts accuracy on common datasets, and hash-based deduplication saves hundreds of runs. One-shot prediction matches search-based AutoML, reducing the need for numerous trials. HPO on LEMUR achieves RMSE 0.60, outperforming Optuna (0.64), while the code-aware predictor reaches RMSE 0.14 with Pearson r=0.78. The system has already generated over 5K validated models, proving NNGPT as an autonomous AutoML engine. Upon acceptance, the code, prompts, and checkpoints will be released for public access to enable reproducibility and facilitate community usage.

2511.13487 2026-06-02 eess.AS cs.LG cs.SD 版本更新

Systematic Evaluation of Time-Frequency Features for Binaural Sound Source Localization

双耳声源定位的时频特征系统评估

Davoud Shariat Panah, Alessandro Ragano, Dan Barry, Jan Skoglund, Andrew Hines

发表机构 * Taighde Éireann – Research Ireland(塔尔德·爱尔兰——爱尔兰研究)

AI总结 系统评估不同时频特征组合对双耳声源定位性能的影响,发现精心选择的特征组合(如通道频谱图结合ILD和IPD)可超越增加模型复杂度,为领域特定和通用定位提供实用指导。

Comments Accepted at EUSIPCO 2026

详情
AI中文摘要

本研究对双耳声源定位(SSL)的时频特征设计进行了系统评估,重点关注特征选择如何在多样条件下影响模型性能。我们研究了使用基于幅度特征(幅度频谱图、耳间电平差 - ILD)和基于相位特征(相位频谱图、耳间相位差 - IPD)的各种组合的卷积神经网络(CNN)模型的性能。在域内和域外数据(具有不匹配的头部相关传递函数 - HRTFs)上的评估表明,精心选择的特征组合通常优于增加模型复杂度。虽然诸如ILD + IPD的双特征集足以用于域内SSL,但泛化到多样内容需要更丰富的输入,结合通道频谱图与ILD和IPD。使用最优特征集,我们的低复杂度CNN模型实现了有竞争力的性能。我们的发现强调了特征设计在双耳SSL中的重要性,并为领域特定和通用定位提供了实用指导。

英文摘要

This study presents a systematic evaluation of time-frequency feature design for binaural sound source localization (SSL), focusing on how feature selection influences model performance across diverse conditions. We investigate the performance of a convolutional neural network (CNN) model using various combinations of amplitude-based features (magnitude spectrogram, interaural level difference - ILD) and phase-based features (phase spectrogram, interaural phase difference - IPD). Evaluations on in-domain and out-of-domain data with mismatched head-related transfer functions (HRTFs) reveal that carefully chosen feature combinations often outperform increases in model complexity. While two-feature sets such as ILD + IPD are sufficient for in-domain SSL, generalization to diverse content requires richer inputs combining channel spectrograms with both ILD and IPD. Using the optimal feature sets, our low-complexity CNN model achieves competitive performance. Our findings underscore the importance of feature design in binaural SSL and provide practical guidance for both domain-specific and general-purpose localization.

2511.12081 2026-06-02 cs.IR cs.LG 版本更新

From Scaling to Structured Expressivity: Rethinking Transformers for CTR Prediction

从规模到结构化表达能力:重新思考用于CTR预测的Transformer

Bencheng Yan, Yuejie Lei, Zhiyuan Zeng, Zheye Deng, Di Wang, Kaiyi Lin, Pengjie Wang, Chuan Yu, Jian Xu, Bo Zheng

发表机构 * Alibaba Group(阿里巴巴集团)

AI总结 针对CTR预测中Transformer模型因结构错位导致收益递减的问题,提出Field-Aware Transformer (FAT),通过场感知参数重构和基组合超网络实现结构化表达能力,在理论(Rademacher复杂度标度律)和实验(AUC提升+4.38%,线上CTR+2.33%,RPM+0.66%)上均优于现有方法。

Comments KDD 2026; The first four authors contributed equally to this work

详情
AI中文摘要

尽管在规模上投入巨大,用于点击率(CTR)预测的深度模型往往表现出快速递减的回报——这与大型语言模型(LLM)中观察到的可预测标度律形成鲜明对比。我们识别出根本原因在于根本性的结构错位:标准Transformer假设顺序组合性,而CTR数据需要对异构字段进行组合推理。为恢复对齐,我们引入了Field-Aware Transformer (FAT)。通过用场中心参数重构标准Transformer块,FAT实现了结构化表达能力,从根本上将模型复杂度依赖从总词汇量n转变为字段数F(n >> F)。关键的是,为了将模型容量与字段基数解耦,FAT采用基组合超网络从共享基合成场特定参数,进一步降低参数复杂度。理论上,我们通过基于Rademacher复杂度的形式化标度律来支撑这一缩放行为。实验上,FAT以高达+4.38%的AUC提升超越现有最先进方法,并在线上生产中带来+2.33%的CTR和+0.66%的RPM提升。我们的工作表明,可扩展的推荐不仅来自规模,更来自结构化表达能力——架构与数据语义的一致性。

英文摘要

Despite massive investments in scale, deep models for click-through rate (CTR) prediction often exhibit rapidly diminishing returns -- a stark contrast to the {predictable scaling laws} seen in large language models (LLMs). We identify the root cause as a {fundamental} \textit{structural misalignment}: {standard} Transformers assume sequential compositionality, whereas CTR data demand combinatorial reasoning over {heterogeneous} fields. To restore alignment, we introduce the \textbf{Field-Aware Transformer (FAT)}. {By reconstructing the standard Transformer block with field-centric parameters, FAT achieves \textit{structured expressivity}, {fundamentally shifting the model complexity dependence from the total vocabulary size $n$ with the number of fields $F$ ($n \gg F$).}} Crucially, to decouple model capacity from field cardinality, FAT employs a {Basis-Composed Hypernetwork} to synthesize field-specific parameters from shared bases, further reducing parameter complexity. {Theoretically, we ground this scaling behavior through a formal scaling law based on Rademacher complexity. Empirically, FAT outperforms exisiting state-of-the-art methods with up to \textbf{+4.38\%} AUC improvement, and delivers \textbf{+2.33\%} CTR and \textbf{+0.66\%} RPM in live production.} Our work establishes that scalable recommendation arises not from size alone, but from \textit{structured expressivity} -- architectural coherence with data semantics.

2507.22842 2026-06-02 stat.ML cs.LG 版本更新

Tricks and Plug-ins for Gradient Boosting in Image Classification

图像分类中梯度提升的技巧与插件

Biyi Fang, Truong Vo, Jean Utke, Diego Klabjan

发表机构 * Northwestern University(西北大学) Allstate(Allstate公司)

AI总结 提出一种结合动态特征选择与BoostCNN原理的框架,通过子网格选择和重要性采样策略,将提升权重嵌入最小二乘损失训练,提升CNN性能与效率。

Comments 6 pages, 5 figures. Experimental results reported on CIFAR-10, SVHN, and ImageNetSub datasets

详情
Journal ref
2025 IEEE International Conference on Big Data (BigData), pp. 1382-1388
AI中文摘要

卷积神经网络(CNN)通过深度架构的分层特征学习,在广泛的机器学习任务中取得了显著成功。然而,大量的层和数百万参数通常使得CNN训练计算成本高昂,需要大量时间和手动调优来发现最优架构。在本文中,我们介绍了一种提升CNN性能的新框架,该框架将动态特征选择与BoostCNN原理相结合。我们的方法包含两个关键策略:子网格选择和重要性采样,以引导训练朝向特征空间的信息区域。我们进一步开发了一系列算法,使用最小二乘损失公式将提升权重直接嵌入网络训练过程。这种集成不仅减轻了手动架构设计的负担,还提高了准确性和效率。在多个细粒度分类基准上的实验结果表明,我们的提升CNN变体在预测性能和训练速度上始终优于传统CNN。

英文摘要

Convolutional Neural Networks (CNNs) have achieved remarkable success across a wide range of machine learning tasks by leveraging hierarchical feature learning through deep architectures. However, the large number of layers and millions of parameters often make CNNs computationally expensive to train, requiring extensive time and manual tuning to discover optimal architectures. In this paper, we introduce a novel framework for boosting CNN performance that integrates dynamic feature selection with the principles of BoostCNN. Our approach incorporates two key strategies: subgrid selection and importance sampling, to guide training toward informative regions of the feature space. We further develop a family of algorithms that embed boosting weights directly into the network training process using a least squares loss formulation. This integration not only alleviates the burden of manual architecture design but also enhances accuracy and efficiency. Experimental results across several fine-grained classification benchmarks demonstrate that our boosted CNN variants consistently outperform conventional CNNs in both predictive performance and training speed.

2501.02409 2026-06-02 cs.LG cs.AI cs.CE q-bio.MN stat.ME 版本更新

Interpretable Neural ODEs for Gene Regulatory Network Discovery under Perturbations

可解释神经ODE用于扰动下基因调控网络发现

Zaikang Lin, Sei Chang, Aaron Zweig, Minseo Kang, Fabian J. Theis, Elham Azizi, David A. Knowles

发表机构 * Department of Computer Science, Columbia University, New York, U.S.(哥伦比亚大学计算机科学系) Department of Industrial Engineering and Operations Research, Columbia University, New York, U.S.(哥伦比亚大学工业工程与运筹学系) Department of Applied Mathematics and Applied Physics, Columbia University, New York, U.S.(哥伦比亚大学应用数学与应用物理系) New York Genome Center, New York, U.S.(纽约基因组中心) Irving Institute of Cancer Dynamics, New York, U.S.(伊万·罗伯特癌症动力学研究所) Institute of Computational Biology, Helmholtz Munich, Munich, Germany(海德堡医学院计算生物学研究所) Department of Mathematics, Technische Universität München, Munich, Germany(慕尼黑技术大学数学系)

AI总结 提出PerturbODE框架,利用可解释神经常微分方程建模扰动下的细胞状态轨迹,从ODE参数中推导因果基因调控网络,实现未见遗传干预的模拟。

详情
AI中文摘要

现代高通量生物数据集包含数千种扰动,使得能够大规模发现代表基因间调控相互作用的因果图。可微分因果图模型和基于回归的方法已被开发用于从干预数据集推断基因调控网络(GRN)。然而,现有方法未能捕捉生物过程(如细胞分化)的非线性动力学。为解决这一局限性,我们提出PerturbODE,一种新颖框架,采用可解释神经常微分方程(神经ODE)对扰动下的细胞状态轨迹进行建模,并从神经ODE参数中推导出潜在的因果GRN,从而实现对未见遗传干预的下游模拟。GRN通过单隐藏层前馈网络编码,隐含地将基因分组为可解释的共调控模块。我们展示了PerturbODE在GRN推断和扩展到扰动响应预测方面的有效性,包括模拟和真实过表达数据集。

英文摘要

Modern high-throughput biological datasets containing thousands of perturbations enable large-scale discovery of causal graphs that represent regulatory interactions between genes. Differentiable causal graphical models and regression-based methods have been developed to infer gene regulatory networks (GRNs) from interventional datasets. However, existing approaches fail to capture the non-linear dynamics of biological processes such as cellular differentiation. To address this limitation, we propose PerturbODE, a novel framework that employs interpretable neural ordinary differential equations (neural ODEs) to model cell state trajectories under perturbations and derive the underlying causal GRN from the neural ODE parameters, enabling downstream simulation of unseen genetic interventions. The GRN is encoded via a single-hidden-layer feedforward network, implicitly grouping genes into interpretable co-regulated modules. We demonstrate PerturbODE's efficacy in GRN inference and extension to perturbation response prediction across both simulated and real overexpression datasets.

2511.09190 2026-06-02 cs.LG cs.NE 版本更新

Iterated Population Based Training with Task-Agnostic Restarts

迭代式基于种群的训练与任务无关的重启

Alexander Chebykin, Tanja Alderliesten, Peter A. N. Bosman

发表机构 * Alexander Chebykin(亚历山大·切比金) Tanja Alderliesten(塔妮娅·阿尔德利斯特恩) Peter A. N. Bosman(彼得·A·N·博斯曼)

AI总结 提出迭代式基于种群的训练(IPBT),通过任务无关的重启自动调整超参数更新间隔,在8个图像分类和强化学习任务上平均优于或匹配5种PBT变体及其他HPO算法。

详情
AI中文摘要

超参数优化(HPO)可以减轻调整神经网络超参数(HP)的负担。基于种群训练(PBT)系列的HPO算法通过在权重优化的每几步后动态调整HP,从而高效运行。最近的结果表明,HP更新之间的步数是所有PBT变体的重要元超参数,会显著影响其性能。然而,目前没有有效设置其值的方法或直觉。我们引入了迭代式基于种群的训练(IPBT),一种新颖的PBT变体,通过以任务无关的方式重用权重信息进行重启,并利用时变贝叶斯优化重新初始化HP,自动调整该超参数。在8个图像分类和强化学习任务上的评估表明,平均而言,我们的算法匹配或优于5种之前的PBT变体和其他HPO算法(随机搜索、ASHA、SMAC3),且无需增加预算或更改其超参数。源代码可在https://github.com/AwesomeLemon/IPBT获取。

英文摘要

Hyperparameter Optimization (HPO) can lift the burden of tuning hyperparameters (HPs) of neural networks. HPO algorithms from the Population Based Training (PBT) family are efficient thanks to dynamically adjusting HPs every few steps of the weight optimization. Recent results indicate that the number of steps between HP updates is an important meta-HP of all PBT variants that can substantially affect their performance. Yet, no method or intuition is available for efficiently setting its value. We introduce Iterated Population Based Training (IPBT), a novel PBT variant that automatically adjusts this HP via restarts that reuse weight information in a task-agnostic way and leverage time-varying Bayesian optimization to reinitialize HPs. Evaluation on 8 image classification and reinforcement learning tasks shows that, on average, our algorithm matches or outperforms 5 previous PBT variants and other HPO algorithms (random search, ASHA, SMAC3), without requiring a budget increase or any changes to its HPs. The source code is available at https://github.com/AwesomeLemon/IPBT.

2511.06663 2026-06-02 eess.SY cs.LG cs.SY 版本更新

GNN-Enabled Robust Hybrid Beamforming with Score-Based CSI Generation and Denoising

基于分数生成CSI与去噪的GNN鲁棒混合波束成形

Yuhang Li, Yang Lu, Bo Ai, Zhiguo Ding, Arumugam Nallanathan

发表机构 * State Key Laboratory of Advanced Rail Autonomous Operation(先进轨道交通自主运行国家重点实验室) School of Computer Science and Technology(计算机科学与技术学院) Beijing Jiaotong University(北京交通大学) School of Electronics and Information Engineering(电子与信息工程学院) School of Electrical and Electronic Engineering (EEE)(电子与电气工程学院) Nanyang Technological University(南洋理工大学) School of Electronic Engineering and Computer Science(电子工程与计算机科学学院) Queen Mary University of London(伦敦女王玛丽大学) Department of Electronic Engineering(电子工程系)

AI总结 针对混合波束成形中信道状态信息不精确的问题,提出利用图神经网络和基于分数的生成模型,通过混合消息图注意力网络、BERT噪声条件分数网络和去噪分数网络实现鲁棒HBF。

详情
AI中文摘要

准确的信道状态信息(CSI)对于混合波束成形(HBF)任务至关重要。然而,在实际无线通信系统中,获取高分辨率CSI仍然具有挑战性。为了解决这个问题,我们提出利用图神经网络(GNN)和基于分数的生成模型,在不完美的CSI条件下实现鲁棒的HBF。首先,我们开发了混合消息图注意力网络(HMGAT),通过节点级和边级消息传递更新节点和边特征。其次,我们设计了一个基于BERT的噪声条件分数网络(NCSN),学习高分辨率CSI的分布,促进CSI生成和数据增强,进一步提高HMGAT的性能。最后,我们提出了一个去噪分数网络(DSN)框架及其实例化DeBERT,该网络可以在任意信道误差水平下对不完美的CSI进行去噪,从而实现鲁棒的HBF。在DeepMIMO城市数据集上的实验表明,所提出的模型在完美和不完美CSI的各种HBF任务中具有优越的泛化能力、可扩展性和鲁棒性。

英文摘要

Accurate Channel State Information (CSI) is critical for Hybrid Beamforming (HBF) tasks. However, obtaining high-resolution CSI remains challenging in practical wireless communication systems. To address this issue, we propose to utilize Graph Neural Networks (GNNs) and score-based generative models to enable robust HBF under imperfect CSI conditions. Firstly, we develop the Hybrid Message Graph Attention Network (HMGAT) which updates both node and edge features through node-level and edge-level message passing. Secondly, we design a Bidirectional Encoder Representations from Transformers (BERT)-based Noise Conditional Score Network (NCSN) to learn the distribution of high-resolution CSI, facilitating CSI generation and data augmentation to further improve HMGAT's performance. Finally, we present a Denoising Score Network (DSN) framework and its instantiation, termed DeBERT, which can denoise imperfect CSI under arbitrary channel error levels, thereby facilitating robust HBF. Experiments on DeepMIMO urban datasets demonstrate the proposed models' superior generalization, scalability, and robustness across various HBF tasks with perfect and imperfect CSI.

2511.06642 2026-06-02 cs.LG 版本更新

Improving Asset Allocation in a Fast Moving Consumer Goods B2B Company: An Interpretable Machine Learning Framework for Commercial Cooler Assignment Based on Multi-Tier Growth Targets

改善快速消费品B2B公司中的资产配置:基于多层级增长目标的商业冷柜分配可解释机器学习框架

Renato Castro, Rodrigo Paredes, Douglas Kahn

发表机构 * Fast Moving Consumer Goods B2B Company(快速消费品B2B公司)

AI总结 提出一个可解释机器学习框架,利用XGBoost、LightGBM和CatBoost模型结合SHAP分析,预测B2B饮料客户在获得冷柜后的销量增长,以优化资产分配并提升ROI。

详情
Journal ref
2025 Artificial Intelligence for Business (AIxB), Laguna Hills, CA, USA, 2025, pp. 1-6
AI中文摘要

在快速消费品(FMCG)行业中,决定商业饮料冷柜等实体资产的放置位置可以直接影响收入增长和执行效率。尽管客户流失预测和需求预测在B2B环境中已被广泛研究,但使用机器学习指导资产配置仍相对未被探索。本文提出了一个框架,专注于预测哪些饮料客户在收到冷柜后最有可能实现销量增长。使用来自一家知名中美洲酿酒和饮料公司的私有数据集,包含3,119个在2022年1月至2024年7月期间收到冷柜的B2B传统贸易渠道客户,并跟踪冷柜安装前后12个月的销售交易,定义了三个增长阈值:销量同比增长10%、30%和50%。分析比较了XGBoost、LightGBM和CatBoost等机器学习模型与SHAP结合用于可解释特征分析的结果,以获取改善与冷柜分配相关的业务运营的见解;结果显示,最佳模型在验证集上各阈值的AUC得分分别为0.857、0.877和0.898。模拟表明,与传统的基于销量的方法相比,该方法可以更好地选择有望达到预期增长水平的潜在客户,并通过不分配冷柜给不会增长的客户来增加成本节约,从而改善ROI,并提供了实质性的业务管理建议。

英文摘要

In the fast-moving consumer goods (FMCG) industry, deciding where to place physical assets, such as commercial beverage coolers, can directly impact revenue growth and execution efficiency. Although churn prediction and demand forecasting have been widely studied in B2B contexts, the use of machine learning to guide asset allocation remains relatively unexplored. This paper presents a framework focused on predicting which beverage clients are most likely to deliver strong returns in volume after receiving a cooler. Using a private dataset from a well-known Central American brewing and beverage company of 3,119 B2B traditional trade channel clients that received a cooler from 2022-01 to 2024-07, and tracking 12 months of sales transactions before and after cooler installation, three growth thresholds were defined: 10%, 30% and 50% growth in sales volume year over year. The analysis compares results of machine learning models such as XGBoost, LightGBM, and CatBoost combined with SHAP for interpretable feature analysis in order to have insights into improving business operations related to cooler allocation; the results show that the best model has AUC scores of 0.857, 0.877, and 0.898 across the thresholds on the validation set. Simulations suggest that this approach can improve ROI because it better selects potential clients to grow at the expected level and increases cost savings by not assigning clients that will not grow, compared to traditional volume-based approaches with substantial business management recommendations

2511.05650 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Optimizing Diversity and Quality through Base-Aligned Model Collaboration

通过基座对齐模型协作优化多样性与质量

Yichen Wang, Chenghao Yang, Tenghao Huang, Muhao Chen, Jonathan May, Mina Lee

发表机构 * University of Chicago(芝加哥大学) University of Southern California, Information Sciences Institute(南加州大学信息科学研究所) University of California, Davis(加州大学戴维斯分校)

AI总结 提出基座对齐模型协作框架(BACo),在推理时通过令牌级路由策略动态结合基座LLM与其对齐版本,以单次前向传递同时提升生成多样性和质量。

Comments ICML 2026. (47 pages, 22 figures)

详情
AI中文摘要

对齐极大地提升了大语言模型(LLM)的输出质量,但以牺牲多样性为代价,导致跨代生成高度相似的输出,尤其是在开放式生成任务中。我们提出基座对齐模型协作(BACo),一种推理时令牌级模型协作框架,动态结合基座LLM与其对齐版本,以优化多样性和质量。利用基于不确定性和内容的信号,BACo采用路由策略决定每个令牌从哪个模型解码。先前的多样性提升方法通常以质量下降为代价,或需要昂贵的解码或后训练。相比之下,BACo在单次前向传递中事后同时实现高多样性和高质量,同时提供强可控性。我们引入一系列有效的路由策略,并在三个开放式生成任务中使用13个多样性和质量指标进行评估。BACo持续超越最先进的推理时基线。使用我们最佳的路由器,BACo在多样性和质量上实现了21.3%的联合提升,这一结果进一步得到人工评估的支持。总体而言,我们的结果表明,基座模型与对齐模型之间的协作为优化多样性-质量权衡提供了一种有效且可控的机制。

英文摘要

Alignment has greatly improved large language models (LLMs)' output quality at the cost of diversity, yielding highly similar outputs across generations, especially in open-ended generation tasks. We propose Base-Aligned Model Collaboration (BACo), an inference-time token-level model collaboration framework that dynamically combines a base LLM with its aligned counterpart to optimize diversity and quality. Using uncertainty and content-based signals, BACo employs routing strategies to determine, at each token, which model to decode from. Prior diversity-promoting methods often improve diversity at the expense of quality or require expensive decoding or post-training. In contrast, BACo achieves both high diversity and quality post hoc within a single pass, while offering strong controllability. We introduce a family of effective routing strategies and evaluate them across three open-ended generation tasks with 13 diversity and quality metrics. BACo consistently surpasses state-of-the-art inference-time baselines. With our best router, BACo achieves a 21.3% joint improvement in diversity and quality, which is further supported by human evaluations. Overall, our results demonstrate that collaboration between base and aligned models provides an effective and controllable mechanism for optimizing the diversity-quality trade-off.

2511.05613 2026-06-02 cs.CY cs.AI cs.LG 版本更新

Who Evaluates AI's Social Impacts? Mapping Coverage and Gaps in First and Third Party Evaluations

谁在评估人工智能的社会影响?第一方和第三方评估的覆盖范围与差距分析

Anka Reuel, Avijit Ghosh, Jenny Chim, Andrew Tran, Yanan Long, Jennifer Mickel, Usman Gohar, Srishti Yadav, Pawan Sasanka Ammanamanchi, Mowafak Allaham, Hossein A. Rahmani, Mubashara Akhtar, Felix Friedrich, Robert Scholz, Michael Alexander Riegler, Jan Batzner, Eliya Habba, Arushi Saxena, Anastassia Kornilova, Kevin Wei, Prajna Soni, Yohan Mathew, Kevin Klyman, Jeba Sania, Subramanyam Sahoo, Olivia Beyer Bruvik, Pouya Sadeghi, Sujata Goswami, Angelina Wang, Yacine Jernite, Zeerak Talat, Stella Biderman, Mykel Kochenderfer, Sanmi Koyejo, Irene Solaiman

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 通过分析186份第一方发布报告和248份第三方评估来源,结合开发者访谈,揭示了第一方报告稀疏且流于表面,而第三方评估更广泛深入,但数据溯源、内容审核劳动等关键领域存在披露缺口,呼吁政策强制开发者透明化并加强独立评估生态。

Comments Accepted at the Forty-Third International Conference on Machine Learning (ICML), 2026, in Seoul, Korea

详情
AI中文摘要

基础模型日益成为高风险人工智能系统的核心,治理框架现在依赖评估来评估其风险和能力。尽管通用能力评估已广泛开展,但涵盖偏见、公平性、隐私、环境成本和劳动的社会影响评估仍不均衡。为了描述这一格局,我们进行了首次社会影响评估报告的综合分析,检查了186份第一方发布报告和248份第三方评估来源,并辅以开发者访谈。我们发现明显的分工:第一方报告稀疏、通常流于表面,且在环境影响和偏见等领域呈下降趋势,而第三方评估者提供了更广泛、更严格的偏见、有害内容和性能差异覆盖。然而,只有开发者才能权威地报告数据来源、内容审核劳动、成本和基础设施,但访谈揭示这些披露除非与产品采用或合规挂钩,否则被降级优先。当前实践在评估社会影响方面留下了重大空白,强调了需要制定政策强制开发者透明化、加强独立评估生态系统,并创建聚合第三方评估的共享基础设施。

英文摘要

Foundation models are increasingly central to high-stakes AI systems, and governance frameworks now depend on evaluations to assess their risks and capabilities. Although general capability evaluations are widespread, social impact assessments covering bias, fairness, privacy, environmental costs, and labor remain uneven. To characterize this landscape, we conduct the first comprehensive analysis of social impact evaluation reporting, examining 186 first-party release reports and 248 third-party evaluation sources, supplemented by developer interviews. We find a stark division of labor: first-party reporting is sparse, often superficial, and declining in areas like environmental impact and bias, while third-party evaluators provide broader, more rigorous coverage of bias, harmful content, and performance disparities. However, only developers can authoritatively report on data provenance, content moderation labor, costs, and infrastructure, yet interviews reveal these disclosures are deprioritized unless tied to product adoption or compliance. Current practices leave major gaps in assessing societal impacts, underscoring the need for policies that mandate developer transparency, strengthen independent evaluation ecosystems, and create shared infrastructure for aggregating third-party evaluations.

2403.06524 2026-06-02 cs.LG cs.AI cs.RO 版本更新

Tactical Decision Making for Autonomous Trucks by Deep Reinforcement Learning with Total Cost of Operation Based Reward

基于总运营成本奖励的深度强化学习自动驾驶卡车战术决策

Deepthi Pathare, Leo Laine, Morteza Haghir Chehreghani

发表机构 * Department of Computer Science and Engineering, Chalmers University of Technology and and University of Gothenburg(计算机科学与工程系,查尔姆斯理工大学和哥德堡大学) Department of Mechanics and Maritime Sciences, Chalmers University of Technology(机械与海洋科学系,查尔姆斯理工大学) Safe and Efficient Driving, Volvo Group of Trucks Technology(安全高效驾驶,沃尔沃卡车技术集团)

AI总结 提出一种深度强化学习框架,用于自动驾驶卡车在高速公路场景下的自适应巡航控制和变道战术决策,通过基于总运营成本的多目标奖励函数优化性能。

Comments Paper is accepted for publication in Artificial Intelligence Review

详情
Journal ref
Artificial Intelligence Review, Volume 59, Article number 27 (2026)
AI中文摘要

我们开发了一个深度强化学习框架,用于自动驾驶卡车的战术决策,特别是高速公路场景下的自适应巡航控制(ACC)和变道操作。我们的结果表明,将高层决策过程与低层控制动作分离,分别由强化学习智能体和基于物理模型的低层控制器执行是有益的。接下来,我们研究了使用不同方法基于卡车总运营成本(TCOP)的逼真多目标奖励函数来优化性能:通过添加奖励组件权重、通过归一化奖励组件以及通过使用课程学习技术。

英文摘要

We develop a deep reinforcement learning framework for tactical decision making in an autonomous truck, specifically for Adaptive Cruise Control (ACC) and lane change maneuvers in a highway scenario. Our results demonstrate that it is beneficial to separate high-level decision-making processes and low-level control actions between the reinforcement learning agent and the low-level controllers based on physical models. In the following, we study optimizing the performance with a realistic and multi-objective reward function based on Total Cost of Operation (TCOP) of the truck using different approaches; by adding weights to reward components, by normalizing the reward components and by using curriculum learning techniques.

2511.04981 2026-06-02 cs.LG 版本更新

Scaling depth capacity via zero/one-layer model expansion

通过零/单层模型扩展扩展深度容量

Zhiqi Bu

发表机构 * Meta FAIR

AI总结 提出零/单层渐进训练方法,通过优化理论和特征学习分析深度扩展,在GPT2上节省约80%计算量并实现约5倍加速,在LLAMA3和DeepSeekV3上展示3~5倍计算效率提升。

Comments Accepted to ICML 2026

详情
AI中文摘要

模型深度在深度学习中是一把双刃剑:更深的模型实现更高的精度,但需要更高的计算成本。为了高效训练大规模模型,渐进训练(也称为模型扩展)在训练过程中扩展模型容量,显著减少计算量且性能下降很小。在这项工作中,我们通过优化理论和特征学习的视角研究大规模模型的深度扩展,提供了关于新层初始化、超参数迁移、学习率调度和模型扩展时机的见解。具体来说,我们提出零/单层渐进训练以实现计算和损失之间的最优权衡,并对我们的扩展策略进行了全面的消融实验。例如,在GPT2上的零/单层渐进训练可以节省约80%的计算量,或等效地实现约5倍的加速,同时达到与完全训练的60层7B参数模型相当的损失,从而在损失方面表现出混合行为。此外,在LLAMA3和DeepSeekV3模型上的缩放定律显示计算效率提升了3~5倍,且在更大规模上优势递增。

英文摘要

Model depth is a double-edged sword in deep learning: deeper models achieve higher accuracy but require higher computational cost. To efficiently train models at scale, progressive training (also known as model expansion) scales up model capacity during training and significantly reduces computation with little performance degradation. In this work, we study the depth expansion of large-scale models through the lens of optimization theory and feature learning, offering insights on the initialization of new layers, hyperparameter transfer, learning rate schedule, and timing of model expansion. Specifically, we propose zero/one-layer progressive training to achieve an optimal tradeoff between computation and loss, with a comprehensive ablations on our expansion strategy. For example, zero/one-layer progressive training on GPT2 can save $\approx 80\%$ compute, or equivalently achieve an $\approx 5\times$ acceleration, while attaining a loss comparable to that of a fully trained 60-layer model with 7B parameters, thus demonstrating a mixing behavior in terms of loss. Furthermore, scaling laws on LLAMA3 and DeepSeekV3 models show a $3\sim 5\times$ improvement in compute efficiency, with an increasing advantage at larger scales.

2511.04873 2026-06-02 stat.ML cs.LG 版本更新

Prototype Selection Using Topological Data Analysis

使用拓扑数据分析的原型选择

Jordan Eckert, Elvan Ceyhan, Henry Schenck

发表机构 * Department of Mathematics & Statistics(数学与统计学系)

AI总结 提出两种基于持续同调的原型选择方法TPS和BoundaryTPS,通过多尺度拓扑结构压缩训练集,在保持决策边界和内部典型点的同时,实现了对H1持续图的最佳保留和稳定的折叠扰动性能。

Comments Code will be made available upon request to Jordan Eckert

详情
AI中文摘要

原型选择方法压缩训练集,但现有的分类(压缩、编辑、混合、基于能力、基于优化和基于聚类)不包括对数据多尺度拓扑结构进行操作的方法。本文介绍了两种不同的基于持续性的原型选择变体:拓扑原型选择器(TPS)和边界感知拓扑原型选择器(BoundaryTPS)。TPS使用两个连续的Rips过滤来保留边界相关点和内部典型点。BoundaryTPS是一种单阶段变体,其顶点加权过滤将保留集中在决策边界附近。我们在15个真实数据集上对这两种方法进行了评估,并与七个经典基线方法进行了比较,发现拓扑方法在原型选择设计空间中占据了与现有方法不同的操作点。BoundaryTPS在$H_1$持续图保留上实现了最低的平均Friedman秩,并且显著优于七个基线中的五个(Nemenyi,$α= 0.05$)。TPS在同一指标上排名第三。两种方法在折叠扰动下比任何测试的链式决策选择器更稳定,并且两者都继承了源集的类别比例,无需标签感知机制。在聚合G-Mean上,两种方法具有竞争力但并非领先,跨折叠组合的秩1频率分别为$11.3\%$(TPS)和$9.9\%$(BoundaryTPS)。经验上,两种方法在样本量上呈次二次方缩放。

英文摘要

Prototype selection methods compress a training set, but the existing taxonomy of condensation, edition, hybrid, competence-based, optimization-based, and clustering-based families does not include methods that operate on the multi-scale topological structure of the data. This paper introduces two different persistence-based prototype selector variants, Topological Prototype Selector (TPS) and Boundary-Conscious Topological Prototype Selector (BoundaryTPS). TPS uses two sequential Rips filtrations to retain boundary-relevant and interior-typical points. BoundaryTPS is a single-stage variant whose vertex-weighted filtration concentrates retention near the decision boundary. We evaluate both methods against seven classical baselines on fifteen real datasets and find that the topological methods occupy a different operating point in the prototype-selection design space than existing methods. BoundaryTPS achieves the lowest mean Friedman rank on $H_1$ persistence-diagram preservation and is significantly better than five of the seven baselines (Nemenyi, $α= 0.05$). TPS ranks third on the same endpoint. Both methods are more stable under fold perturbation than any chained-decision selector tested, and both inherit the source set's class proportions without label-aware machinery. On aggregate G-Mean both methods are competitive but not leading, with rank-1 frequencies of $11.3\%$ (TPS) and $9.9\%$ (BoundaryTPS) across fold combinations. Empirically, both methods scale sub-quadratically in sample size.

2511.04791 2026-06-02 cs.LG 版本更新

DuetServe: Harmonizing Prefill and Decode for LLM Serving via Adaptive GPU Multiplexing

DuetServe: 通过自适应GPU多路复用协调LLM服务的预填充与解码

Lei Gao, Chaoyi Jiang, Hossein Entezari Zarch, Daniel Wong, Mark Hill, Murali Annavaram

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学) University of Toronto(多伦多大学) University of Washington(华盛顿大学) University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 针对LLM服务中预填充与解码阶段的干扰问题,提出DuetServe框架,通过自适应SM级GPU空间多路复用实现单GPU内的阶段隔离,在保证低延迟的同时提升吞吐量。

详情
AI中文摘要

现代LLM服务系统必须维持高吞吐量,同时满足两个不同推理阶段(计算密集型预填充阶段和内存受限的解码阶段)的严格延迟SLO。现有方法要么(1)在共享GPU上聚合两个阶段,导致预填充和解码阶段之间的干扰,从而降低令牌间时间(TBT);要么(2)将两个阶段分离到不同GPU上,改善延迟但通过重复模型和KV缓存传输浪费资源。我们提出DuetServe,一个统一的LLM服务框架,在单个GPU内实现分离级别的隔离。DuetServe默认以聚合模式运行,并在预测到TBT下降时动态激活SM级GPU空间多路复用。其关键思想是仅在需要时通过细粒度、自适应的SM分区解耦预填充和解码执行,仅在争用威胁延迟服务级别目标时提供阶段隔离。DuetServe集成了(1)一个注意力感知的屋顶线模型以预测迭代延迟,(2)一个分区优化器,选择最优SM分割以在TBT约束下最大化吞吐量,以及(3)一个无中断执行引擎,消除CPU-GPU同步开销。评估表明,与最先进框架相比,DuetServe在保持低生成延迟的同时,总吞吐量提升高达1.3倍。

英文摘要

Modern LLM serving systems must sustain high throughput while meeting strict latency SLOs across two distinct inference phases: compute-intensive prefill and memory-bound decode phases. Existing approaches either (1) aggregate both phases on shared GPUs, leading to interference between prefill and decode phases, which degrades Time-Between-Tokens (TBT); or (2) disaggregate the two phases across GPUs, improving latency but wasting resources through duplicated models and KV cache transfers. We present DuetServe, a unified LLM serving framework that achieves disaggregation-level isolation within a single GPU. DuetServe operates in aggregated mode by default and dynamically activates SM-level GPU spatial multiplexing when TBT degradation is predicted. Its key idea is to decouple prefill and decode execution only when needed through fine-grained, adaptive SM partitioning that provides phase isolation only when contention threatens latency service level objectives. DuetServe integrates (1) an attention-aware roofline model to forecast iteration latency, (2) a partitioning optimizer that selects the optimal SM split to maximize throughput under TBT constraints, and (3) an interruption-free execution engine that eliminates CPU-GPU synchronization overhead. Evaluations show that DuetServe improves total throughput by up to 1.3x while maintaining low generation latency compared to state-of-the-art frameworks.

2506.22666 2026-06-02 cs.CR cs.CL cs.LG stat.ML 版本更新

VERA: Variational Inference Framework for Jailbreaking Large Language Models

VERA:用于越狱大型语言模型的变分推理框架

Anamika Lochab, Lu Yan, Patrick Pynadath, Xiangyu Zhang, Ruqi Zhang

发表机构 * Department of Computer Science, Purdue University(计算机科学系,普渡大学)

AI总结 提出VERA框架,将黑盒越狱提示生成视为变分推理问题,训练小型攻击者LLM近似目标LLM的对抗提示后验,无需重新优化即可生成多样且流畅的越狱提示。

Comments Accepted by NeurIPS 2025

详情
AI中文摘要

仅通过API访问最先进LLM的兴起凸显了在现实环境中识别模型漏洞的有效黑盒越狱方法的需求。由于缺乏基于梯度的优化原则性目标,大多数现有方法依赖于遗传算法,这些算法受限于其初始化和对人工策划提示池的依赖。此外,这些方法需要对每个提示进行单独优化,未能提供模型漏洞的全面表征。为弥补这一差距,我们引入了VERA:用于越狱的变分推理框架。VERA将黑盒越狱提示生成视为变分推理问题,训练一个小型攻击者LLM来近似目标LLM在对抗提示上的后验。一旦训练完成,攻击者可以针对目标查询生成多样化、流畅的越狱提示,而无需重新优化。实验结果表明,VERA在一系列目标LLM上取得了强劲的性能,凸显了概率推理在对抗性提示生成中的价值。

英文摘要

The rise of API-only access to state-of-the-art LLMs highlights the need for effective black-box jailbreak methods to identify model vulnerabilities in real-world settings. Without a principled objective for gradient-based optimization, most existing approaches rely on genetic algorithms, which are limited by their initialization and dependence on manually curated prompt pools. Furthermore, these methods require individual optimization for each prompt, failing to provide a comprehensive characterization of model vulnerabilities. To address this gap, we introduce VERA: Variational infErence fRamework for jAilbreaking. VERA casts black-box jailbreak prompting as a variational inference problem, training a small attacker LLM to approximate the target LLM's posterior over adversarial prompts. Once trained, the attacker can generate diverse, fluent jailbreak prompts for a target query without re-optimization. Experimental results show that VERA achieves strong performance across a range of target LLMs, highlighting the value of probabilistic inference for adversarial prompt generation.

2504.16129 2026-06-02 cs.MA cs.AI cs.LG cs.RO 版本更新

MARFT: Multi-Agent Reinforcement Fine-Tuning

MARFT: 多智能体强化微调

Junwei Liao, Muning Wen, Jun Wang, Weinan Zhang

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Innovation Institute(上海创新研究院) OPPO Research Institute(OPPO研究院)

AI总结 针对基于大语言模型的多智能体系统,提出多智能体强化微调(MARFT)框架,通过引入Flex-MG马尔可夫博弈公式和通用算法,解决异步交互、异构架构等挑战,提升系统鲁棒性和适应性。

Comments 37 pages

详情
AI中文摘要

基于大语言模型的多智能体系统(LaMAS)在需要多方面推理和协作的复杂智能体任务中展现出强大能力,从高质量演示生成到科学研究。同时,强化学习(RL)被广泛认可用于增强智能体智能,但用基础RL技术微调LaMAS的研究有限。由于LaMAS的独特机制,直接将传统多智能体强化学习(MARL)应用于LaMAS也带来了重大挑战。为解决这些挑战,本文对基于LLM的MARL进行了全面研究,并提出了多智能体强化微调(MARFT)。我们引入了Flex-MG,一种与真实世界LaMAS优化一致的新马尔可夫博弈公式,以及一个针对LaMAS定制的通用算法框架。我们回顾了从传统RL到强化微调(RFT)的演变,然后分析了多智能体对应部分。对于LaMAS,我们识别了经典MARL与MARFT之间的关键差异,包括异步智能体交互、轮廓感知智能体设计和异构架构。这些差异促使了面向LaMAS的RFT公式。我们提出了一个稳健且可扩展的MARFT框架,详细介绍了其模块化算法,并提供了开源实现以支持采用和进一步研究。本文进一步讨论了应用前景和开放挑战,包括动态环境建模、样本效率低下以及缺乏连贯框架。通过将理论基础与实践方法相结合,本文旨在作为推进MARFT向弹性、自适应和与人类一致的智能体系统发展的路线图。实现:https://github.com/jwliao-ai/MARFT。

英文摘要

Large Language Model (LLM)-based Multi-Agent Systems (LaMAS) have demonstrated strong capabilities on complex agentic tasks requiring multifaceted reasoning and collaboration, from high-quality presentation generation to scientific research. Meanwhile, Reinforcement Learning (RL) is widely recognized for enhancing agent intelligence, but limited work has studied fine-tuning LaMAS with foundational RL techniques. Directly applying conventional Multi-Agent Reinforcement Learning (MARL) to LaMAS also introduces major challenges due to the unique mechanisms of LaMAS. To address these challenges, this article presents a comprehensive study of LLM-based MARL and proposes Multi-Agent Reinforcement Fine-Tuning (MARFT). We introduce Flex-MG, a new Markov Game formulation aligned with real-world LaMAS optimization, together with a universal algorithmic framework tailored to LaMAS. We review the evolution from traditional RL to Reinforcement Fine-Tuning (RFT), then analyze the multi-agent counterpart. For LaMAS, we identify key differences between classical MARL and MARFT, including asynchronous agent interactions, profile-aware agent design, and heterogeneous architectures. These differences motivate a LaMAS-oriented formulation of RFT. We present a robust and scalable MARFT framework, detail its modular algorithm, and provide an open-source implementation to support adoption and further research. The paper further discusses application perspectives and open challenges, including dynamic environment modeling, sample inefficiency, and the lack of cohesive frameworks. By connecting theoretical foundations with practical methodology, this work aims to serve as a roadmap for advancing MARFT toward resilient, adaptive, and human-aligned agentic systems. Implementation: https://github.com/jwliao-ai/MARFT.

2412.16209 2026-06-02 cs.LG stat.ML 版本更新

Challenges in the calibration of tree-based models for imbalanced classification

基于树的模型在不平衡分类中校准的挑战

Nathan Phelps, Daniel J. Lizotte, Douglas G. Woolford

发表机构 * Department of Statistical and Actuarial Sciences, University of Western Ontario, London, Canada(统计与精算科学系,温哥华大学,伦敦,加拿大) Department of Computer Science, University of Western Ontario, London, Canada(计算机科学系,温哥华大学,伦敦,加拿大) Department of Epidemiology and Biostatistics, University of Western Ontario, London, Canada(流行病学与生物统计学系,温哥华大学,伦敦,加拿大)

AI总结 研究随机森林在欠采样数据上使用解析校准导致偏差的问题,发现决策树可能偏向少数类,并提出应使用可学习校准模式的方法(如beta校准)。

详情
AI中文摘要

当使用机器学习处理不平衡的二分类问题时,通常会对多数类进行子采样以创建(更)平衡的训练数据集。这会使模型产生偏差,因为模型从不能完全代表底层感兴趣群体的数据中学习。解决这种偏差的一种方法是基于多数类的采样率,将预测结果解析映射到新值。我们展示了以这种方式校准随机森林会产生负面后果,包括流行率估计同时依赖于随机森林中每个分裂考虑的预测变量数量和使用的采样率。我们利用随机森林和解析校准的已知性质解释前者,并通过展示决策树中的偏差解释后者。与现有文献相矛盾的是,我们证明决策树可能偏向少数类。这些问题表明,在欠采样数据上训练的基于树的模型不应进行解析校准。能够学习原始模型中校准偏差模式的方法(例如beta校准)更为合适。

英文摘要

When using machine learning for imbalanced binary classification problems, it is common to subsample the majority class to create a (more) balanced training dataset. This biases the model's predictions because the model learns from data that is not fully representative of the underlying population of interest. One way of accounting for this bias is analytically mapping the resulting predictions to new values based on the sampling rate for the majority class. We show that calibrating a random forest this way has negative consequences, including prevalence estimates that depend on both the number of predictors considered at each split in the random forest and the sampling rate used. We explain the former using known properties of random forests and analytical calibration and the latter by demonstrating a bias in decision trees. In contradiction with much of the existing literature, we show that decision trees can be biased towards the minority class. These issues indicate that tree-based models trained on undersampled data should not be calibrated analytically. Calibration approaches that can learn a miscalibration pattern in the original model (e.g., beta calibration) are more suitable.

2510.14904 2026-06-02 cs.CV cs.AI cs.LG 版本更新

CaptionFormer: Unified Segmentation, Tracking, and Captioning for Spatio-Temporal Objects

CaptionFormer:时空对象的统一分割、跟踪与描述

Gabriel Fiastre, Antoine Yang, Cordelia Schmid

发表机构 * Inria, École Normale Supérieure, CNRS, PSL Research University(法国国家科学研究中心、巴黎高等师范学院、国家科学研究中心、巴黎综合理工研究所) Google DeepMind(谷歌DeepMind)

AI总结 提出 CaptionFormer 模型,通过利用 VLM 生成合成描述并扩展数据集,实现视频中对象轨迹的联合检测、分割、跟踪与描述,在三个基准上达到最优。

Comments 17 pages, 10 figures

详情
AI中文摘要

密集视频对象描述(DVOC)是联合检测、跟踪和描述视频中对象轨迹的任务,需要理解时空细节并用自然语言描述。由于任务复杂性和手动标注的高成本,先前方法采用有限数据的训练策略,可能导致次优性能。为解决此问题,我们提出利用最先进的 VLM 生成关于时空定位实体的描述,并用我们的合成描述(LVISCap 和 LV-VISCap)扩展 LVIS 和 LV-VIS 数据集。此外,我们引入端到端模型 CaptionFormer,能够联合检测、分割、跟踪和描述对象轨迹。CaptionFormer 在三个现有基准(VidSTG、VLN 和 BenSMOT)上取得了最先进的 DVOC 结果。数据集和代码可在 https://www.gabriel.fiastre.fr/captionformer/ 获取。

英文摘要

Dense Video Object Captioning (DVOC) is the task of jointly detecting, tracking, and captioning object trajectories in a video, requiring the ability to understand spatio-temporal details and describe them in natural language. Due to the complexity of the task and the high cost associated with manual annotation, previous approaches resort to training strategies with limited data, potentially leading to suboptimal performance. To circumvent this issue, we propose to generate captions about spatio-temporally localized entities leveraging a state-of-the-art VLM, and extend the LVIS and LV-VIS datasets with our synthetic captions (LVISCap and LV-VISCap). Moreover, we introduce an end-to-end model, CaptionFormer, capable of jointly detecting, segmenting, tracking and captioning object trajectories. CaptionFormer achieves state-of-the-art DVOC results on three existing benchmarks, VidSTG, VLN and BenSMOT. The datasets and code are available at https://www.gabriel.fiastre.fr/captionformer/.

2504.19419 2026-06-02 cs.LG stat.ML 版本更新

Advancing Local Clustering on Graphs via Compressive Sensing: Semi-supervised and Unsupervised Methods

通过压缩感知推进图上的局部聚类:半监督和无监督方法

Zhaiming Shen, Sung Ha Kang

发表机构 * School of Mathematics, Georgia Institute of Technology(数学系,佐治亚理工学院)

AI总结 提出基于压缩感知的半监督和无监督局部聚类方法,通过随机采样、扩散和重叠分析实现稀疏解,并证明其正确性,在低标签率下达到最优性能。

详情
AI中文摘要

局部聚类旨在无需图的任何额外结构信息的情况下,识别大型图中的特定子结构。这些子结构通常相对于整个图较小,使得可以通过寻找与图拉普拉斯相关的线性系统的稀疏解来解决问题。在这项工作中,我们首先提出了一种在给定极少标签数据时识别特定局部聚类的方法,我们称之为半监督局部聚类。然后,我们将该方法扩展到无监督设置,即没有标签的先验信息可用。所提出的方法包括随机采样图、通过局部聚类提取进行扩散,然后检查结果之间的重叠以找到每个聚类。我们建立了任意节点对的共成员条件,并严格证明了我们方法的正确性。此外,我们进行了大量实验,证明所提出的方法在低标签率情况下达到了最先进的结果。

英文摘要

Local clustering aims to identify specific substructures within a large graph without any additional structural information of the graph. These substructures are typically small compared to the overall graph, enabling the problem to be approached by finding a sparse solution to a linear system associated with the graph Laplacian. In this work, we first propose a method for identifying specific local clusters when very few labeled data are given, which we term semi-supervised local clustering. We then extend this approach to the unsupervised setting when no prior information on labels is available. The proposed methods involve randomly sampling the graph, applying diffusion through local cluster extraction, then examining the overlap among the results to find each cluster. We establish the co-membership conditions for any pair of nodes, and rigorously prove the correctness of our methods. Additionally, we conduct extensive experiments to demonstrate that the proposed methods achieve state of the art results in the low-label rates regime.

2510.22138 2026-06-02 cs.LG 版本更新

Tractable Shapley Values and Interactions via Tensor Networks

通过张量网络实现可计算的Shapley值和交互作用

Farzaneh Heidari, Chao Li, Guillaume Rabusseau

发表机构 * DIRO & Mila Université de Montréal(DIRO与Mila蒙特利尔大学) RIKEN AIP(日本理化学研究所AIP) DIRO & Mila, CIFAR AI Chair(DIRO与Mila,CIFAR人工智能主席) Université de Montréal(蒙特利尔大学)

AI总结 提出TN-SHAP方法,利用张量网络替代指数级联盟枚举,通过少量评估高效计算Shapley值和交互指数,实现O(n·poly(χ)+n²)复杂度,并在UCI数据集上取得25-1000倍加速。

详情
AI中文摘要

我们展示了如何将Shapley值和Shapley风格交互指数背后的n个特征上的O(2^n)联盟枚举替换为在张量网络(TN)代理上的少量评估方案:TN-SHAP。关键思想是将预测器的局部行为表示为因子化的多重线性映射,使得联盟量成为系数张量的线性探针。TN-SHAP用仅少量目标评估替换穷举联盟扫描,以提取k阶Shapley交互。特别地,一阶(单特征)和二阶(成对)计算的复杂度均为O(n·poly(χ) + n^2),其中χ是TN的最大切割秩。我们提供了TN-SHAP近似误差和可计算性的理论保证。在UCI数据集上,我们的方法在拟合代理上匹配枚举,同时将评估量减少数个数量级,并在可比精度下实现比KernelSHAP-IQ快25-1000倍的挂钟时间加速,同时将训练分摊到局部群体上。

英文摘要

We show how to replace the O(2^n) coalition enumeration over n features behind Shapley values and Shapley-style interaction indices with a few-evaluation scheme on a tensor-network (TN) surrogate: TN-SHAP. The key idea is to represent a predictor's local behavior as a factorized multilinear map, so that coalitional quantities become linear probes of a coefficient tensor. TN-SHAP replaces exhaustive coalition sweeps with just a small number of targeted evaluations to extract order-k Shapley interactions. In particular, both order-1 (single-feature) and order-2 (pairwise) computations have cost O(n*poly(chi) + n^2), where chi is the TN's maximal cut rank. We provide theoretical guarantees on the approximation error and tractability of TN-SHAP. On UCI datasets, our method matches enumeration on the fitted surrogate while reducing evaluation by orders of magnitude and achieves 25-1000x wall-clock speedups over KernelSHAP-IQ at comparable accuracy, while amortizing training across local cohorts.

2510.23379 2026-06-02 cs.LG cs.AI cs.NE q-bio.BM 版本更新

Symbolic Neural Generation with Applications to Lead Discovery in Drug Design

符号神经生成及其在药物设计先导发现中的应用

Ashwin Srinivasan, Tirtharaj Dash, A Baskar, Michael Bain, Sanjay Kumar Dey, Mainak Banerjee

发表机构 * Dept. of Computer Science & Information Systems and APPCAIR BITS Pilani, K K Birla Goa Campus, India(计算机科学与信息系统系及APPCAIR比特纳学院,K K Birl拉果阿校区,印度) Dept. of Computer Science & Information Systems BITS Pilani, K K Birla Goa Campus, India(计算机科学与信息系统系比特纳学院,K K Birl拉果阿校区,印度) Department of Biochemistry, University of Cambridge, Cambridge, UK(生物化学系,剑桥大学,剑桥,英国) School of Computer Science and Engineering University of New South Wales, Sydney(计算机科学与工程学院新南威尔士大学,悉尼) Dr. B.R. Ambedkar Center for Biomedical Research University of Delhi, New Delhi, India(B.R.阿姆贝卡尔生物医学研究中心,德里大学,新德里,印度) Department of Chemistry BITS Pilani, K.K. Birla Goa Campus, India(化学系比特纳学院,K.K. Birl拉果阿校区,印度)

AI总结 提出符号神经生成器(SNG)框架,结合归纳逻辑编程与大语言模型,通过符号约束指导神经生成,在药物设计中生成满足形式规范的候选分子,性能与现有方法相当,并在探索性问题上产生与临床候选分子相当的结合亲和力。

Comments 37 pages, submitted to the Machine Learning journal; partial overlap of experimental results with https://doi.org/10.1101/2025.02.14.634875

详情
AI中文摘要

我们研究了一类相对未被充分探索的混合神经符号模型,该模型将符号学习与神经推理相结合,以构建满足形式正确性标准的数据生成器。在符号神经生成器(SNG)中,符号学习器从少量实例(有时仅一个)中检查可行数据的逻辑规范。每个规范反过来约束提供给基于神经的生成器的条件信息,该生成器拒绝任何违反符号规范的实例。与其他神经符号方法一样,SNG利用了符号和神经方法的互补优势。SNG的输出是一个对$(H, X)$,其中$H$是从数据构建的可行实例的符号描述,$X$是满足该描述的一组生成的新实例。我们基于构建适当的基集和纤维偏序集并将其组合成整体偏序,为这类系统引入语义。我们实现了一个SNG,将受限形式的归纳逻辑编程(ILP)与大语言模型(LLM)相结合,并在早期药物设计上进行了评估。我们的主要兴趣在于SNG生成的描述和一组潜在的抑制剂分子。在基准问题(药物靶点已被充分理解)上,SNG的性能在统计上与最先进方法相当。在探索性问题(靶点理解不足)上,生成的分子表现出与领先临床候选分子相当的结合亲和力。专家进一步发现符号规范作为初步过滤器很有用,多个生成的分子被确定为可用于合成和湿实验测试。

英文摘要

We investigate a relatively under-explored class of hybrid neurosymbolic models that integrate symbolic learning with neural reasoning to construct data generators meeting formal correctness criteria. In Symbolic Neural Generators (SNGs), symbolic learners examine logical specifications of feasible data from a small set of instances -- sometimes just one. Each specification in turn constrains the conditional information supplied to a neural-based generator, which rejects any instance violating the symbolic specification. Like other neurosymbolic approaches, SNG exploits the complementary strengths of symbolic and neural methods. The outcome of an SNG is a pair $(H, X)$, where $H$ is a symbolic description of feasible instances constructed from data, and $X$ a set of generated new instances that satisfy the description. We introduce a semantics for such systems, based on the construction of appropriate base and fibre partially-ordered sets combined into an overall partial order. We implement an SNG combining a restricted form of Inductive Logic Programming (ILP) with a large language model (LLM) and evaluate it on early-stage drug design. Our main interest is the description and the set of potential inhibitor molecules generated by the SNG. On benchmark problems -- where drug targets are well understood -- SNG performance is statistically comparable to state-of-the-art methods. On exploratory problems with poorly understood targets, generated molecules exhibit binding affinities on par with leading clinical candidates. Experts further find the symbolic specifications useful as preliminary filters, with several generated molecules identified as viable for synthesis and wet-lab testing.

2506.16704 2026-06-02 cs.LG stat.ML 版本更新

How Many Domains Suffice for Domain Generalization? A Tight Characterization via the Domain Shattering Dimension

域泛化需要多少域?通过域破碎维度的紧刻画

Cynthia Dwork, Lunjia Hu, Han Shao

发表机构 * Harvard University(哈佛大学)

AI总结 本文在PAC框架下引入域破碎维度,刻画了域泛化中所需随机采样域的数量,并建立了与VC维度的紧定量关系。

Comments Accepted to NeurIPS 2025

详情
AI中文摘要

我们研究域泛化的一个基本问题:给定一族域(即数据分布),我们需要从多少个随机采样的域中收集数据,才能学习到一个在族中每个已见和未见域上表现合理的模型?我们在PAC框架下建模该问题,并引入一种新的组合度量,称为域破碎维度。我们证明该维度刻画了域样本复杂度。此外,我们建立了域破碎维度与经典VC维度之间的紧定量关系,表明每个在标准PAC设置中可学习的假设类在我们的设置中也是可学习的。

英文摘要

We study a fundamental question of domain generalization: given a family of domains (i.e., data distributions), how many randomly sampled domains do we need to collect data from in order to learn a model that performs reasonably well on every seen and unseen domain in the family? We model this problem in the PAC framework and introduce a new combinatorial measure, which we call the domain shattering dimension. We show that this dimension characterizes the domain sample complexity. Furthermore, we establish a tight quantitative relationship between the domain shattering dimension and the classic VC dimension, demonstrating that every hypothesis class that is learnable in the standard PAC setting is also learnable in our setting.

2510.17532 2026-06-02 cs.CL cs.LG 版本更新

OncoReason: Structuring Clinical Reasoning in LLMs for Robust and Interpretable Survival Prediction

OncoReason: 在大语言模型中构建临床推理以实现稳健且可解释的生存预测

Raghu Vamshi Hemadri, Geetha Krishna Guruju, Kristi Topollai, Anna Ewa Choromanska

AI总结 提出统一多任务学习框架,通过监督微调、思维链提示和强化学习三种对齐策略,使自回归大语言模型在MSK-CHORD数据集上联合进行二元生存分类、连续生存时间回归和自然语言理由生成,实现可解释的生存预测。

Comments This manuscript is withdrawn to allow careful review and correction of bibliographic issues identified after submission, including references that could not be adequately verified. These matters should be resolved before further circulation

详情
AI中文摘要

预测癌症治疗结果需要模型既准确又可解释,尤其是在存在异质性临床数据的情况下。虽然大语言模型(LLM)在生物医学自然语言处理中表现出强大的性能,但它们通常缺乏在高风险决策支持中至关重要的结构化推理能力。我们提出了一个统一的多任务学习框架,将自回归LLM与临床推理对齐,用于MSK-CHORD数据集上的结果预测。我们的模型被训练来联合执行二元生存分类、连续生存时间回归和自然语言理由生成。我们评估了三种对齐策略:(1)标准监督微调(SFT),(2)带有思维链(CoT)提示的SFT,以引发逐步推理,以及(3)组相对策略优化(GRPO),一种强化学习方法,将模型输出与专家推导的推理轨迹对齐。使用LLaMa3-8B和Med42-8B骨干网络的实验表明,CoT提示将F1提高了+6.0,并将MAE降低了12%,而GRPO在BLEU、ROUGE和BERTScore上实现了最先进的可解释性和预测性能。我们进一步表明,现有的生物医学LLM由于架构限制通常无法产生有效的推理轨迹。我们的发现强调了在多任务临床建模中推理感知对齐的重要性,并为精准肿瘤学中可解释、可信赖的LLM设立了新的基准。

英文摘要

Predicting cancer treatment outcomes requires models that are both accurate and interpretable, particularly in the presence of heterogeneous clinical data. While large language models (LLMs) have shown strong performance in biomedical NLP, they often lack structured reasoning capabilities critical for high-stakes decision support. We present a unified, multi-task learning framework that aligns autoregressive LLMs with clinical reasoning for outcome prediction on the MSK-CHORD dataset. Our models are trained to jointly perform binary survival classification, continuous survival time regression, and natural language rationale generation. We evaluate three alignment strategies: (1) standard supervised fine-tuning (SFT), (2) SFT with Chain-of-Thought (CoT) prompting to elicit step-by-step reasoning, and (3) Group Relative Policy Optimization (GRPO), a reinforcement learning method that aligns model outputs to expert-derived reasoning trajectories. Experiments with LLaMa3-8B and Med42-8B backbones demonstrate that CoT prompting improves F1 by +6.0 and reduces MAE by 12%, while GRPO achieves state-of-the-art interpretability and predictive performance across BLEU, ROUGE, and BERTScore. We further show that existing biomedical LLMs often fail to produce valid reasoning traces due to architectural constraints. Our findings underscore the importance of reasoning-aware alignment in multi-task clinical modeling and set a new benchmark for interpretable, trustworthy LLMs in precision oncology.

2510.17303 2026-06-02 cs.LG stat.ML 版本更新

Symmetries in PAC-Bayesian Learning

PAC-Bayesian学习中的对称性

Armin Beck, Peter Ochs

发表机构 * Technical University of Munich(慕尼黑技术大学)

AI总结 本文在PAC-Bayes框架下,将对称性的泛化保证扩展到非紧致群和非不变数据分布,通过调整和收紧现有界,实验验证了其有效性。

详情
AI中文摘要

已知对称性能够提高机器学习模型的经验性能,但解释这些增益的理论保证仍然有限。先前的工作主要关注紧致群对称性,并且通常假设数据分布本身是不变的,这一假设在实际应用中很少满足。在这项工作中,我们将泛化保证扩展到更广泛的非紧致对称性设置,例如平移和非不变数据分布。基于PAC-Bayes框架,我们调整并收紧现有界,在McAllester的PAC-Bayes界上展示了该方法,同时表明它适用于广泛的PAC-Bayes界。我们通过几个具有非均匀和非紧致变换的数据集上的实验验证了我们的理论,其中推导出的保证不仅成立,而且优于先前的结果。这些发现提供了理论证据,表明对于对称数据,对称模型在紧致群和不变分布的狭窄设置之外也是更可取的,为机器学习中对称性的更一般理解开辟了道路。

英文摘要

Symmetries are known to improve the empirical performance of machine learning models, yet theoretical guarantees explaining these gains remain limited. Prior work has focused mainly on compact group symmetries and often assumes that the data distribution itself is invariant, an assumption rarely satisfied in real-world applications. In this work, we extend generalization guarantees to the broader setting of non-compact symmetries, such as translations and to non-invariant data distributions. Building on the PAC-Bayes framework, we adapt and tighten existing bounds, demonstrating the approach on McAllester's PAC-Bayes bound while showing that it applies to a wide range of PAC-Bayes bounds. We validate our theory with experiments on several datasets with non-uniform and non-compact transformations, where the derived guarantees not only hold but also improve upon prior results. These findings provide theoretical evidence that, for symmetric data, symmetric models are preferable beyond the narrow setting of compact groups and invariant distributions, opening the way to a more general understanding of symmetries in machine learning.

2510.17045 2026-06-02 cs.CV cs.AI cs.LG 版本更新

Video Reasoning without Training

无需训练的视频推理

Deepak Sridhar, Kartikeya Bhardwaj, Jeya Pradha Jeyaraj, Nuno Vasconcelos, Ankita Nayak, Harris Teague

发表机构 * Qualcomm AI Research(高通AI研究) University of California, San Diego(加州大学圣地亚哥分校)

AI总结 提出V-Reason方法,利用输出分布熵作为信号,通过轻量级控制器在推理时自适应调整值缓存,无需强化学习或微调即可提升视频推理性能。

Comments CVPR Findings 2026. Project Page https://deepaksridhar.github.io/vreason.github.io/

详情
AI中文摘要

使用大型多模态模型(LMM)进行视频推理依赖于昂贵的强化学习(RL)和冗长的思维链,导致训练和推理过程中产生大量计算开销。此外,这些推理模型中控制思维过程的机制非常有限。在本文中,我们利用模型输出分布的熵作为信号来研究和指导推理行为。我们发现高质量模型表现出微探索和微利用循环的特征模式,随后出现后期熵峰值(即更长的思考)和较低的最终熵,表明更谨慎的探索和自信的收敛(即当模型探索或思考答案时避免过度随机性)。然后,我们利用这些新颖的、有理论基础的见解,引入了V-Reason(Video-Reason),一种推理时优化方法,通过轻量级、可训练的控制器自适应调整LMM的值缓存。我们提出的控制器由基于熵的目标引导,直接在推理时调整模型行为,无需使用任何RL或监督微调。我们的实验表明,V-Reason在许多视频推理数据集上显著优于基础指令调优模型,将与RL模型的差距平均缩小到0.6%的准确率以内。我们在无需任何训练的情况下实现了这一点,同时提供了效率优势:V-Reason使用的token比RL模型少58.6%。项目页面:https://deepaksridhar.github.io/vreason.github.io/

英文摘要

Video reasoning using Large Multimodal Models (LMMs) relies on costly reinforcement learning (RL) and verbose chain-of-thought, resulting in substantial computational overhead during both training and inference. Moreover, the mechanisms that control the thinking process in these reasoning models are very limited. In this paper, we use the entropy of the model's output distribution as a signal to study and guide reasoning behavior. We discover that high-quality models exhibit a characteristic pattern of micro-exploration and micro-exploitation cycles, followed by a later entropy peak (i.e., longer thinking) and a lower final entropy, indicating more deliberate exploration and confident convergence (i.e., avoid excessive randomness while the model is exploring or thinking through an answer). We then use these novel, theoretically-grounded insights to introduce V-Reason (Video-Reason), an inference-time optimization method that adapts the value cache of the LMM through a lightweight, trainable controller. Our proposed controller is guided by an entropy-based objective, to tune the model's behavior directly at inference, without using any RL or supervised fine-tuning. Our experiments show that V-Reason significantly outperforms the base instruction-tuned models on many video reasoning datasets, narrowing the gap with RL models to within 0.6% accuracy on average. We achieve this without any training, while offering efficiency benefits: V-Reason uses 58.6% fewer tokens than the RL model. Project Page https://deepaksridhar.github.io/vreason.github.io/

2510.16660 2026-06-02 cs.CV cs.LG physics.med-ph 版本更新

Universal and Transferable Attacks on Pathology Foundation Models

病理基础模型的通用与可迁移攻击

Yuntian Wang, Xilin Yang, Che-Yung Shen, Nir Pillar, Aydogan Ozcan

发表机构 * Electrical and Computer Engineering Department, University of California, Los Angeles, CA, 90095, USA(加州大学洛杉矶分校电子与计算机工程系) Bioengineering Department, University of California, Los Angeles, CA, 90095, USA(加州大学洛杉矶分校生物工程系) California NanoSystems Institute (CNSI), University of California, Los Angeles, CA, 90095, USA(加州大学洛杉矶分校加州纳米系统研究所) Department of Pathology, Hadassah Hebrew University Medical Center, Jerusalem, 91120, Israel(海法希伯来大学医疗中心病理学系) Department of Surgery, University of California, Los Angeles, CA, 90095, USA(加州大学洛杉矶分校外科系)

AI总结 提出通用可迁移对抗扰动(UTAP),通过固定弱噪声模式破坏多个病理基础模型的特征表示能力,导致下游任务性能下降,并展示其跨数据集通用性和跨模型可迁移性。

Comments 38 Pages, 8 Figures

详情
Journal ref
Light: Science & Applications (2026)
AI中文摘要

我们为病理基础模型引入了通用可迁移对抗扰动(UTAP),揭示了其能力中的关键脆弱性。UTAP 使用深度学习优化,由一个固定的弱噪声模式组成,当添加到病理图像时,会系统地破坏多个病理基础模型的特征表示能力。因此,UTAP 会导致利用基础模型的下游任务性能下降,包括在广泛的未见数据分布上的错误分类。除了损害模型性能,我们展示了 UTAP 的两个关键特征:(1)通用性:其扰动可应用于不同的视野,与开发 UTAP 的数据集无关;(2)可迁移性:其扰动能成功降低各种外部、黑盒病理基础模型(从未见过)的性能。这两个特征表明 UTAP 不是针对特定基础模型或图像数据集的专用攻击,而是对多种新兴病理基础模型及其应用构成广泛威胁。我们在多个数据集上系统评估了 UTAP 对各种最先进病理基础模型的影响,通过使用固定噪声模式对输入图像进行视觉上不可察觉的修改,导致其性能显著下降。这些强大攻击的开发为模型鲁棒性评估建立了一个关键的高标准基准,凸显了推进防御机制的需求,并可能为对抗训练提供必要资产,以确保 AI 在病理学中的安全可靠部署。

英文摘要

We introduce Universal and Transferable Adversarial Perturbations (UTAP) for pathology foundation models that reveal critical vulnerabilities in their capabilities. Optimized using deep learning, UTAP comprises a fixed and weak noise pattern that, when added to a pathology image, systematically disrupts the feature representation capabilities of multiple pathology foundation models. Therefore, UTAP induces performance drops in downstream tasks that utilize foundation models, including misclassification across a wide range of unseen data distributions. In addition to compromising the model performance, we demonstrate two key features of UTAP: (1) universality: its perturbation can be applied across diverse field-of-views independent of the dataset that UTAP was developed on, and (2) transferability: its perturbation can successfully degrade the performance of various external, black-box pathology foundation models - never seen before. These two features indicate that UTAP is not a dedicated attack associated with a specific foundation model or image dataset, but rather constitutes a broad threat to various emerging pathology foundation models and their applications. We systematically evaluated UTAP across various state-of-the-art pathology foundation models on multiple datasets, causing a significant drop in their performance with visually imperceptible modifications to the input images using a fixed noise pattern. The development of these potent attacks establishes a critical, high-standard benchmark for model robustness evaluation, highlighting a need for advancing defense mechanisms and potentially providing the necessary assets for adversarial training to ensure the safe and reliable deployment of AI in pathology.

2505.22961 2026-06-02 cs.CL cs.LG 版本更新

ToMAP: Training Opponent-Aware LLM Persuaders with Theory of Mind

ToMAP: 利用心智理论训练对手感知的大语言模型说服者

Peixuan Han, Zijia Liu, Jiaxuan You

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 提出ToMAP方法,通过两个心智理论模块增强大语言模型对对手心理状态的感知与分析,结合强化学习生成更有效、多样化的说服性论据,在3B参数下超越GPT-4o等大模型。

详情
AI中文摘要

大语言模型(LLMs)在说服方面显示出有前景的潜力,但现有关于训练LLM说服者的工作仍处于初步阶段。值得注意的是,虽然人类擅长主动且动态地建模对手的想法和观点,但当前的LLMs在这种心智理论(ToM)推理上存在困难,导致多样性和对手意识有限。为解决这一局限,我们引入了心智理论增强说服者(ToMAP),这是一种通过整合两个心智理论模块来构建更灵活的说服者智能体的新方法,这些模块增强了说服者对对手心理状态的感知和分析。具体来说,我们首先提示说服者考虑对目标中心主张的可能反对意见,然后使用文本编码器配合训练好的MLP分类器来预测对手对这些反主张的当前立场。我们精心设计的强化学习框架使说服者学会如何分析对手相关信息并利用它来生成更有效的论据。实验表明,ToMAP说服者仅包含3B参数,却在多个被说服者模型和多样化语料库上以39.4%的相对增益优于GPT-4o等更大的基线模型。值得注意的是,ToMAP在训练过程中展现出复杂的推理链和减少的重复性,从而产生更多样化和有效的论据。ToMAP的对手感知特性也使其适用于长对话,并能采用更具逻辑性和对手感知的策略。这些结果强调了我们方法的有效性,并突出了其在开发更具说服力的语言智能体方面的潜力。代码可在 https://github.com/ulab-uiuc/ToMAP 获取。

英文摘要

Large language models (LLMs) have shown promising potential in persuasion, but existing works on training LLM persuaders are still preliminary. Notably, while humans are skilled in modeling their opponent's thoughts and opinions proactively and dynamically, current LLMs struggle with such Theory of Mind (ToM) reasoning, resulting in limited diversity and opponent awareness. To address this limitation, we introduce Theory of Mind Augmented Persuader (ToMAP), a novel approach for building more flexible persuader agents by incorporating two theory of mind modules that enhance the persuader's awareness and analysis of the opponent's mental state. Specifically, we begin by prompting the persuader to consider possible objections to the target central claim, and then use a text encoder paired with a trained MLP classifier to predict the opponent's current stance on these counterclaims. Our carefully designed reinforcement learning schema enables the persuader learns how to analyze opponent-related information and utilize it to generate more effective arguments. Experiments show that the ToMAP persuader, while containing only 3B parameters, outperforms much larger baselines, like GPT-4o, with a relative gain of 39.4% across multiple persuadee models and diverse corpora. Notably, ToMAP exhibits complex reasoning chains and reduced repetition during training, which leads to more diverse and effective arguments. The opponent-aware feature of ToMAP also makes it suitable for long conversations and enables it to employ more logical and opponent-aware strategies. These results underscore our method's effectiveness and highlight its potential for developing more persuasive language agents. Code is available at: https://github.com/ulab-uiuc/ToMAP.

2510.11713 2026-06-02 cs.CL cs.LG 版本更新

Are Large Reasoning Models Interruptible?

大型推理模型是否可中断?

Tsung-Han Wu, Mihran Miroyan, David M. Chan, Trevor Darrell, Narges Norouzi, Joseph E. Gonzalez

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 研究大型推理模型在中断和动态上下文两种现实动态场景下的鲁棒性,发现静态评估高估了鲁棒性,并揭示了推理泄漏、恐慌和自我怀疑等新故障模式。

Comments ICML 2026; Project Page: http://dynamic-lm.github.io

详情
AI中文摘要

大型推理模型(LRM)的实际应用通常需要对变化的提示或环境进行推理。在这项工作中,我们挑战冻结世界假设,并在两种现实的动态场景下评估LRM的鲁棒性:中断(测试模型在预算受限输出下的响应准确性)和动态上下文(测试模型对飞行中变化的适应能力)。在需要长形式推理的数学和编程基准测试中,静态评估始终高估鲁棒性:即使在静态设置中达到高准确率的最先进的LRM,在中断或暴露于变化上下文时也可能不可预测地失败,当更新在推理过程后期引入时,性能下降高达60%。我们的分析进一步揭示了多种新的故障模式,包括推理泄漏(模型在中断时将推理折叠到最终答案中)、恐慌(在时间压力下模型完全放弃推理并返回错误答案)以及自我怀疑(在尝试整合更新信息时性能下降)。项目页面:http://dynamic-lm.github.io/

英文摘要

Real-world applications of Large Reasoning Models (LRMs) often require reasoning about changing prompts or environments. In this work, we challenge the frozen world assumption and evaluate LRM robustness under two realistic dynamic scenarios: interruptions, which test the accuracy of model responses under budget-constrained outputs, and dynamic context, which tests model adaptation to in-flight changes. Across mathematics and programming benchmarks that require long-form reasoning, static evaluations consistently overestimate robustness: even state-of-the-art LRMs, which achieve high accuracy in static settings, can fail unpredictably when interrupted or exposed to changing context, with performance dropping by up to 60% when updates are introduced late in the reasoning process. Our analysis further reveals several novel failure modes, including reasoning leakage, where models fold the reasoning into their final answer when interrupted; panic, where under time pressure models abandon reasoning entirely and return incorrect answers; and self-doubt, where performance degrades when trying to incorporate updated information. Project Page: http://dynamic-lm.github.io/

2510.13774 2026-06-02 cs.LG cs.CV 版本更新

UrbanFusion: Stochastic Multimodal Fusion for Contrastive Learning of Robust Spatial Representations

UrbanFusion: 用于鲁棒空间表示对比学习的随机多模态融合

Dominik J. Mühlematter, Lin Che, Ye Hong, Martin Raubal, Nina Wiedemann

发表机构 * ETH Zurich(苏黎世联邦理工学院)

AI总结 提出UrbanFusion模型,通过随机多模态融合(SMF)和Transformer模块整合街景、遥感、地图和POI数据,在56个城市41项任务中优于现有GeoAI模型。

详情
Journal ref
International Conference on Machine Learning (ICML), 2026
AI中文摘要

预测房价和公共卫生指标等城市现象需要有效整合各种地理空间数据。当前方法主要使用特定任务模型,而近期用于空间表示的通用模型通常仅支持有限模态且缺乏多模态融合能力。为克服这些挑战,我们提出UrbanFusion,一种具有随机多模态融合(SMF)的空间表示模型。该框架采用模态特定编码器处理不同类型输入,包括街景图像、遥感数据、制图地图和兴趣点(POI)数据。这些多模态输入通过基于Transformer的融合模块进行集成,学习统一表示。在全世界56个城市的41项任务上的广泛评估表明,与最先进的GeoAI模型相比,UrbanFusion具有强大的泛化和预测性能。具体而言,它1)在位置编码上优于先前模型,2)允许推理时多模态输入,3)能很好地泛化到训练中未见过的区域。UrbanFusion在预训练和推理过程中均可灵活利用给定位置的任何可用模态子集,从而在多样化的数据可用性场景中实现广泛适用性。

英文摘要

Forecasting urban phenomena such as housing prices and public health indicators requires the effective integration of various geospatial data. Current methods primarily utilize task-specific models, while recent generic models for spatial representations often support only limited modalities and lack multimodal fusion capabilities. To overcome these challenges, we present UrbanFusion, a spatial representation model that features Stochastic Multimodal Fusion (SMF). The framework employs modality-specific encoders to process different types of inputs, including street view imagery, remote sensing data, cartographic maps, and points of interest (POIs) data. These multimodal inputs are integrated via a Transformer-based fusion module that learns unified representations. An extensive evaluation across 41 tasks in 56 cities worldwide demonstrates UrbanFusion's strong generalization and predictive performance compared to state-of-the-art GeoAI models. Specifically, it 1) outperforms prior models on location-encoding, 2) allows multimodal input during inference, and 3) generalizes well to regions unseen during training. UrbanFusion can flexibly utilize any subset of available modalities for a given location during both pretraining and inference, enabling broad applicability across diverse data availability scenarios.

2410.14483 2026-06-02 stat.ML cs.LG stat.ME 版本更新

Interventional Processes for Causal Uncertainty Quantification

因果不确定性量化的干预过程

Hugh Dance, Peter Orbanz, Arthur Gretton

发表机构 * Gatsby Unit, University College London, London, United Kingdom(伦敦大学学院Gatsby单元)

AI总结 本文提出一种基于高斯过程的方法,通过将干预函数表示为再生核希尔伯特空间中观测函数的内积,实现干预函数的不确定性量化,并给出闭式后验矩和可处理的训练推理过程。

详情
AI中文摘要

在高风险应用中,因果效应的可靠不确定性量化至关重要,但当目标是一个完整函数而非标量估计量时,这仍然具有挑战性。在这项工作中,我们引入了一种基于高斯过程的方法,用于干预函数的不确定性量化。核心思想是建立在最近工作的基础上,该工作将干预函数表示为再生核希尔伯特空间中观测函数的内积,通过为这些函数构建适当的高斯过程先验,并从观测数据中推断后验。我们的方法产生闭式后验矩和可处理的训练与推理,同时避免了先前为RKHS函数构建高斯过程先验的病理问题。我们进一步推导了一种后验覆盖校准的实用程序。在合成基准、因果贝叶斯优化任务和大规模真实数据集上,我们的方法在保持因果效应估计竞争力的同时,改善了不确定性量化。

英文摘要

Reliable uncertainty quantification for causal effects is crucial in high-stakes applications, but remains challenging when the target is an entire function rather than a scalar estimand. In this work, we introduce a GP-based approach for uncertainty quantification of interventional functions. The central idea is to build on recent work representing interventional functions as an inner-product of observational functions in a reproducing kernel Hilbert space (RKHS), by constructing appropriate GP priors for such functions and inferring posteriors from observational data. Our approach yields closed-form posterior moments and tractable training and inference, while avoiding pathologies of previous GP prior constructions for RKHS functions. We further derive a practical procedure for posterior coverage calibration. Across synthetic benchmarks, causal Bayesian optimization tasks, and a large-scale real dataset, our method improves uncertainty quantification while remaining competitive in causal effect estimation.

2510.12624 2026-06-02 cs.LG cs.AI 版本更新

Learning-To-Measure: In-Context Active Feature Acquisition

学习测量:上下文主动特征获取

Yuta Kobayashi, Zilin Jing, Jiayu Yao, Hongseok Namkoong, Shalmali Joshi

发表机构 * University of Tokyo(东京大学)

AI总结 提出 Learning-to-Measure (L2M) 方法,通过不确定性量化与条件互信息引导的贪婪特征获取,在上下文学习中解决元主动特征获取问题,无需针对每个任务重新训练。

详情
AI中文摘要

主动特征获取 (AFA) 是一个序列决策问题,目标是通过自适应选择要获取的特征来改进测试实例的模型性能。在实践中,AFA 方法通常从具有系统性特征缺失和有限任务特定标签的回顾性数据中学习。大多数先前的工作针对单个预定任务进行获取,限制了可扩展性。为解决这一限制,我们形式化了元 AFA 问题,其目标是学习跨各种任务的获取策略。我们引入了学习测量 (L2M),它包括 i) 对未见任务的可靠不确定性量化,以及 ii) 一个最大化条件互信息的不确定性引导的贪婪特征获取代理。我们展示了一种序列建模或自回归预训练方法,该方法为具有任意缺失模式的任务提供了可靠的不确定性量化基础。L2M 直接对具有回顾性缺失的数据集进行操作,并在上下文中执行元 AFA 任务,消除了每个任务的重新训练。在合成和真实世界的表格基准测试中,L2M 匹配或超越了特定任务的基线,特别是在标签稀缺和高缺失率的情况下。

英文摘要

Active feature acquisition (AFA) is a sequential decision-making problem where the goal is to improve model performance for test instances by adaptively selecting which features to acquire. In practice, AFA methods often learn from retrospective data with systematic missingness in the features and limited task-specific labels. Most prior work addresses acquisition for a single predetermined task, limiting scalability. To address this limitation, we formalize the meta-AFA problem, where the goal is to learn acquisition policies across various tasks. We introduce Learning-to-Measure (L2M), which consists of i) reliable uncertainty quantification over unseen tasks, and ii) an uncertainty-guided greedy feature acquisition agent that maximizes conditional mutual information. We demonstrate a sequence-modeling or autoregressive pre-training approach that underpins reliable uncertainty quantification for tasks with arbitrary missingness. L2M operates directly on datasets with retrospective missingness and performs the meta-AFA task in-context, eliminating per-task retraining. Across synthetic and real-world tabular benchmarks, L2M matches or surpasses task-specific baselines, particularly under scarce labels and high missingness.

2510.12249 2026-06-02 cs.LG 版本更新

Optimal Regularization for Performative Learning

表现性学习的最优正则化

Edwige Cyffers, Alireza Mirrokni, Marco Mondelli

发表机构 * EPFL, Switzerland(瑞士联邦理工学院)

AI总结 研究高维岭回归中正则化如何应对数据分布随模型变化的表现性效应,发现过参数化下表现性效应有益,并给出最优正则化参数与表现性效应强度的关系。

Comments Accepted at ICML 2026

详情
AI中文摘要

在表现性学习中,数据分布会响应部署的模型——例如,因为策略性用户调整其特征以博弈模型——这创造了比经典监督学习更复杂的动态。因此,我们不仅应该针对当前数据优化模型,还应该考虑模型可能将分布引向新方向,而不知道潜在变化的确切性质。我们通过研究正则化在高维岭回归中的影响,探索正则化如何帮助应对表现性效应。我们表明,虽然表现性效应在总体设置中恶化测试风险,但在特征数量超过样本数量的过参数化机制中,它们可能是有益的。我们证明最优正则化与表现性效应的整体强度成比例,从而可以预先设置正则化以应对这种效应。我们通过在合成和真实数据集上对最优正则化参数的经验评估来展示这一发现。

英文摘要

In performative learning, the data distribution reacts to the deployed model - for example, because strategic users adapt their features to game it - which creates a more complex dynamic than in classical supervised learning. One should thus not only optimize the model for the current data but also take into account that the model might steer the distribution in a new direction, without knowing the exact nature of the potential shift. We explore how regularization can help cope with performative effects by studying its impact in high-dimensional ridge regression. We show that, while performative effects worsen the test risk in the population setting, they can be beneficial in the over-parameterized regime where the number of features exceeds the number of samples. We show that the optimal regularization scales with the overall strength of the performative effect, making it possible to set the regularization in anticipation of this effect. We illustrate this finding through empirical evaluations of the optimal regularization parameter on both synthetic and real-world datasets.

2510.10982 2026-06-02 cs.LG cs.AI 版本更新

Catch-Only-One: Non-Transferable Examples for Model-Specific Authorization

仅捕获一个:用于模型特定授权的不可迁移样本

Zihan Wang, Zhiyong Ma, Zhongkui Ma, Shuofeng Liu, Akide Liu, Derui Wang, Minhui Xue, Guangdong Bai

AI总结 提出不可迁移样本(NTEs),通过将数据编码为仅能被指定模型解码的“密文”,在无需训练的情况下利用模型特定低敏感子空间实现授权模型保真度与未授权模型性能退化。

详情
AI中文摘要

最近的AI法规越来越强调需要保护数据在AI创新中的效用,同时防止滥用,特别是在下游AI应用中强制执行目的限制。在实践中,执行这一原则仍然具有挑战性,因为发布的数据可以轻易地输入到超出其声明意图的任意模型中。现有方法试图通过扰动数据或重新训练模型来限制意外使用来减轻这种风险。然而,这些策略无法防止未知或外部训练模型的推理,或者从根本上依赖于对训练或部署的控制。在这项工作中,我们引入了不可迁移样本(NTEs),即重新编码的数据,作为任务级别的“密文”,只能由指定模型解码。对抗性样本利用高模型敏感性的方向,而NTEs则利用互补的不敏感子空间。我们提出了一种无需训练、数据无关的方法,在模型特定的低敏感子空间内重新编码数据,保留授权模型的输出,同时通过子空间错位降低未授权模型的性能。我们建立了形式化界限,证明授权模型的保真度,并表明未授权模型的退化与模型之间可测量的谱错位成比例。实验上,NTEs在常见预处理下保持了多种视觉骨干网络和最先进视觉语言模型的性能,而未授权模型即使在自适应重建攻击下也会崩溃。这些结果确立了NTEs作为一种实用手段,在防止未授权利用的同时保持预期的数据效用。我们的项目可在 https://trusted-system-lab.github.io/model-specificity 获取。

英文摘要

Recent AI regulations increasingly emphasize the need for mechanisms that preserve the utility of data for AI innovation while preventing misuse, particularly by enforcing purpose limitation in downstream AI applications. In practice, enforcing this principle remains challenging, as released data can be trivially fed into arbitrary models beyond its declared intent. Existing approaches attempt to mitigate this risk by either perturbing data or retraining models to limit unintended use. These strategies, however, offer no protection against inference by unknown or externally trained models, or fundamentally rely on control over the training or deployment. In this work, we introduce non-transferable examples (NTEs), recoded data that act as a task-level "ciphertext" decodable only by a designated model. Whereas adversarial examples exploit directions of high model sensitivity, NTEs leverage the complementary insensitive subspace. We propose a training-free, data-agnostic method that recodes data within a model-specific low-sensitivity subspace, preserving outputs for the authorized model while degrading unauthorized ones through subspace misalignment. We establish formal bounds certifying authorized-model fidelity and showing that unauthorized degradation scales with measurable spectral misalignment between models. Empirically, NTEs preserve performance across diverse vision backbones and state-of-the-art vision-language models under common preprocessing, while unauthorized models collapse even under adaptive reconstruction attacks. These results establish NTEs as a practical means to preserve intended data utility while preventing unauthorized exploitation. Our project is available at https://trusted-system-lab.github.io/model-specificity

2510.10541 2026-06-02 cs.LG cs.AI 版本更新

Rethinking RL Evaluation: Can Benchmarks Truly Reveal Failures of RL Methods?

重新思考强化学习评估:基准测试能否真正揭示强化学习方法的失败?

Zihan Chen, Yiming Zhang, Hengguang Zhou, Zenghui Ding, Yining Sun, Cho-Jui Hsieh

发表机构 * HFIPS, Chinese Academy of Sciences(中国科学院HFIPS) University of Science and Technology of China(中国科学技术大学) University of California, Los Angeles(美国加州大学洛杉矶分校) Arena Project: RL-GAP.github.io(Arena项目: RL-GAP.github.io)

AI总结 本文通过引入诊断套件和Oracle性能差距(OPG)指标,发现当前基准测试无法可靠区分强化学习方法在训练集和测试集上的性能差异,并揭示现有方法在分布偏移、难度变化和反事实场景中泛化能力不足,提出更可靠基准设计的三项核心原则。

详情
AI中文摘要

当前的基准测试不足以评估大型语言模型(LLM)在强化学习(RL)方面的进展。尽管最近报告了RL在基准测试上的提升,但我们发现,在这些基准测试的训练集上训练与直接在测试集上训练几乎达到相同的性能,这表明基准测试无法可靠地区分进一步的进展。为了研究这一现象,我们引入了一个诊断套件和Oracle性能差距(OPG)指标,该指标量化了在基准测试的训练集与测试集上训练之间的性能差异。我们进一步通过压力测试分析这一现象,发现尽管基准测试得分很高,现有的RL方法难以在分布偏移、不同难度级别和反事实场景中泛化:这些是当前基准测试未能揭示的缺陷。我们得出结论,当前的基准测试不足以评估泛化能力,并提出了设计更可靠基准测试的三项核心原则:足够的难度、平衡的评估和分布鲁棒性。

英文摘要

Current benchmarks are inadequate for evaluating progress in reinforcement learning (RL) for large language models (LLMs).Despite recent benchmark gains reported for RL, we find that training on these benchmarks' training sets achieves nearly the same performance as training directly on the test sets, suggesting that the benchmarks cannot reliably separate further progress.To study this phenomenon, we introduce a diagnostic suite and the Oracle Performance Gap (OPG) metric that quantifies the performance difference between training on the train split versus the test split of a benchmark. We further analyze this phenomenon with stress tests and find that, despite strong benchmark scores, existing RL methods struggle to generalize across distribution shifts, varying levels of difficulty, and counterfactual scenarios: shortcomings that current benchmarks fail to reveal.We conclude that current benchmarks are insufficient for evaluating generalization and propose three core principles for designing more faithful benchmarks: sufficient difficulty, balanced evaluation, and distributional robustness.

2510.09222 2026-06-02 cs.LG 版本更新

FM-IRL: Flow-Matching for Reward Modeling and Policy Regularization in Reinforcement Learning

FM-IRL:强化学习中用于奖励建模与策略正则化的流匹配

Zhenglin Wan, Jingxuan Wu, Xingrui Yu, Chubin Zhang, Mingcong Lei, Bo An, Ivor Tsang

AI总结 提出利用流匹配(FM)模型作为教师,通过奖励建模和策略正则化,指导简单MLP结构的学生策略进行在线强化学习,从而解决FM策略在线交互不稳定和效率低的问题。

Comments We have submitted a new version of this paper to arxiv (with new framing and title), arXiv:2605.27095. To avoid the misunderstanding of the readers, we request to withdraw the old-version of this paper

详情
AI中文摘要

流匹配(Flow Matching, FM)在建模复杂分布方面表现出色,并在离线模仿学习中克隆专家行为取得了强劲性能。然而,尽管其行为克隆表达能力强大,基于FM的策略本质上受限于缺乏环境交互和探索,导致在专家演示之外的未见场景中泛化能力差,凸显了与环境在线交互的必要性。不幸的是,由于梯度计算不稳定和推理成本高,通过在线交互优化FM策略具有挑战性且效率低下。为解决这些问题,我们提出让一个具有简单MLP结构的学生策略探索环境,并通过带有奖励模型的RL算法进行在线更新。该奖励模型与教师FM模型相关联,包含专家数据分布的丰富信息。此外,利用相同的教师FM模型来正则化学生策略的行为以稳定策略学习。由于学生架构简单,我们避免了FM策略的梯度不稳定性,实现了高效的在线探索,同时仍然利用了教师FM模型的表达能力。大量实验表明,我们的方法显著提高了学习效率、泛化能力和鲁棒性,尤其是在从次优专家数据学习时。

英文摘要

Flow Matching (FM) has shown remarkable ability in modeling complex distributions and achieves strong performance in offline imitation learning for cloning expert behaviors. However, despite its behavioral cloning expressiveness, FM-based policies are inherently limited by their lack of environmental interaction and exploration. This leads to poor generalization in unseen scenarios beyond the expert demonstrations, underscoring the necessity of online interaction with environment. Unfortunately, optimizing FM policies via online interaction is challenging and inefficient due to instability in gradient computation and high inference costs. To address these issues, we propose to let a student policy with simple MLP structure explore the environment and be online updated via RL algorithm with a reward model. This reward model is associated with a teacher FM model, containing rich information of expert data distribution. Furthermore, the same teacher FM model is utilized to regularize the student policy's behavior to stabilize policy learning. Due to the student's simple architecture, we avoid the gradient instability of FM policies and enable efficient online exploration, while still leveraging the expressiveness of the teacher FM model. Extensive experiments show that our approach significantly enhances learning efficiency, generalization, and robustness, especially when learning from suboptimal expert data.

2510.09288 2026-06-02 stat.ML cs.LG 版本更新

A unifying Bayesian framework for adversarial robustness

对抗鲁棒性的统一贝叶斯框架

Pablo G. Arce, Roi Naveiro, David Ríos Insua

发表机构 * Universidad Autónoma de Madrid, Escuela de Doctorado(马德里自治大学博士学院) Institute of Mathematical Sciences, Spanish National Research Council(西班牙国家研究理事会数学研究所) CUNEF Universidad(CUNEF大学)

AI总结 提出一个统一的贝叶斯框架,通过随机信道建模对抗不确定性,衍生出对抗训练和对抗净化两种鲁棒化策略,并验证了显式建模对抗不确定性的优势。

详情
AI中文摘要

机器学习模型对对抗攻击的脆弱性仍然是一个关键的社会安全挑战。传统的防御方法,如对抗训练,通常通过最小化最坏情况损失来增强模型鲁棒性。这些确定性方法没有考虑对手攻击的不确定性。虽然存在将概率分布置于对手上的随机防御,但它们通常缺乏统计严谨性,并且未能明确其潜在假设。为了解决这些问题,我们引入了一个正式的贝叶斯框架,通过随机信道建模对抗不确定性,阐明所有概率假设。这产生了两种鲁棒化策略:一种是在训练期间实施的主动防御,与对抗训练一致;另一种是在操作期间实施的被动防御,与对抗净化一致。几种最先进的防御可以作为我们模型的极限情况恢复。我们通过实验验证了我们的方法,展示了显式建模对抗不确定性的好处。

英文摘要

The vulnerability of machine learning models to adversarial attacks remains a critical societal security challenge. Traditional defenses, such as adversarial training, typically robustify models by minimizing a worst-case loss. These deterministic approaches do not account for uncertainty in the adversary's attack. While stochastic defenses placing a probability distribution on the adversary exist, they often lack statistical rigor and fail to make explicit their underlying assumptions. To resolve these issues, we introduce a formal Bayesian framework that models adversarial uncertainty through a stochastic channel, articulating all probabilistic assumptions. This yields two robustification strategies: a proactive defense enacted during training, aligned with adversarial training, and a reactive defense enacted during operations, aligned with adversarial purification. Several state-of-the-art defenses can be recovered as limiting cases of our model. We empirically validate our methodology, showcasing the benefits of explicitly modeling adversarial uncertainty.

2510.09260 2026-06-02 cs.CR cs.LG 版本更新

GREAT: Generalizable Backdoor Attacks in RLHF via Emotion-Aware Trigger Synthesis

GREAT: 通过情感感知触发器在RLHF中实现可泛化的后门攻击

Subrat Kishore Dutta, Yuelin Xu, Piyush Pant, Xiao Zhang

发表机构 * CISPA Helmholtz Center for Information Security(CISPA赫尔姆霍茨信息安全中心)

AI总结 提出GREAT框架,利用情感感知触发器在RLHF中构造自然分布后门,通过潜在空间聚类和多样性提示策略生成愤怒触发器,实现对未见触发器的泛化攻击。

详情
AI中文摘要

近期研究表明,RLHF极易受到后门攻击。然而,现有方法通常依赖稀有令牌或固定触发器,限制了其在现实场景中的影响。在这项工作中,我们开发了GREAT,一个用于在RLHF中构造自然分布后门的新框架。具体而言,GREAT针对易受攻击的用户子群体,通过语义暴力的请求与情感愤怒的触发器配对,生成有害响应。我们框架的核心是一个在模型潜在嵌入空间中运行的触发器识别管道,利用降维和聚类技术来识别代表性触发器。为实现这一点,我们引入了一种层次化和多样性驱动的提示策略,构建了Erinyes,一个从GPT-4.1中策划的包含5000多个愤怒触发器的高质量数据集。实验表明,GREAT在攻击泛化到未见触发器方面显著优于基线,同时保持标准效用并在防御下保持隐蔽。

英文摘要

Recent work has shown that RLHF is highly susceptible to backdoor attacks. However, existing methods often rely on rare tokens or fixed triggers, limiting their impact in realistic scenarios. In this work, we develop GREAT, a novel framework for crafting natural distributional backdoors in RLHF. Specifically, GREAT targets harmful response generation for a vulnerable user subpopulation featured by semantically violent requests paired with emotionally angry triggers. At the core of our framework is a trigger identification pipeline that operates in the model's latent embedding space, leveraging dimensionality reduction and clustering techniques to identify representative triggers. To enable this, we introduce a hierarchical and diversity-driven prompting strategy to construct Erinyes, a high-quality dataset of over 5,000 angry triggers curated from GPT-4.1. Our experiments show that GREAT significantly outperforms baselines in attack generalization to unseen triggers, while preserving standard utility and maintaining stealth under defenses.

2508.17320 2026-06-02 cs.LG 版本更新

AdaptiveK: Complexity-Driven Sparse Autoencoders for Interpretable Language Model Representations

AdaptiveK:基于复杂度的稀疏自编码器用于可解释的语言模型表示

Yifei Yao, Hanrong Zhang, Mengnan Du

发表机构 * Zhejiang University(浙江大学) University of Illinois Chicago(伊利诺伊大学香槟分校) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 提出 AdaptiveK SAE,根据输入语义复杂度动态调整稀疏度,利用线性探针引导特征分配,在重构保真度、解释方差、余弦相似度和可解释性指标上优于固定稀疏度方法。

Comments Accepted by ACL 2026

详情
AI中文摘要

理解大型语言模型(LLMs)的内部表示仍然是可解释性研究的一个核心挑战。稀疏自编码器(SAEs)通过将激活分解为可解释的特征提供了一种有前景的解决方案,但现有方法依赖于固定的稀疏度约束,未能考虑输入的复杂度。我们提出了AdaptiveK SAE(自适应Top K稀疏自编码器),一种新颖的框架,根据每个输入的语义复杂度动态调整稀疏度。利用线性探针,我们证明了上下文复杂度在线性层面上编码在LLM表示中,并使用这一信号在训练过程中指导特征分配。在十个语言模型上的实验表明,这种基于复杂度的自适应方法在重构保真度、解释方差、余弦相似度和可解释性指标上优于固定稀疏度方法,同时消除了大量超参数调优的负担。我们的代码可在 https://github.com/hiyukie/adaptiveK 获取。

英文摘要

Understanding the internal representations of large language models (LLMs) remains a central challenge for interpretability research. Sparse autoencoders (SAEs) offer a promising solution by decomposing activations into interpretable features, but existing approaches rely on fixed sparsity constraints that fail to account for input complexity. We propose AdaptiveK SAE (Adaptive Top K Sparse Autoencoders), a novel framework that dynamically adjusts sparsity levels based on the semantic complexity of each input. Leveraging linear probes, we demonstrate that context complexity is linearly encoded in LLM representations, and we use this signal to guide feature allocation during training. Experiments across ten language models demonstrate that this complexity-driven adaptation outperforms fixed-sparsity approaches on reconstruction fidelity, explained variance, cosine similarity and interpretability metrics while eliminating the burden of extensive hyperparameter tuning. Our code is available at: https://github.com/hiyukie/adaptiveK.

2510.05566 2026-06-02 stat.ML cs.AI cs.CL cs.LG stat.AP 版本更新

Domain-Shift-Aware Conformal Prediction for Large Language Models

领域偏移感知的共形预测用于大型语言模型

Zhexiao Lin, Yuanyuan Li, Neeraj Sarna, Yuanyuan Gao, Michael von Gablenz

发表机构 * University of Waterloo(多伦多大学)

AI总结 提出领域偏移感知共形预测框架,通过重加权校准样本应对分布偏移,在MMLU基准上提升覆盖可靠性。

Comments Accepted to Forty-Third International Conference on Machine Learning (ICML), 2026

详情
AI中文摘要

大型语言模型在各种任务中取得了令人印象深刻的性能。然而,它们倾向于产生过度自信且事实不正确的输出,即所谓的幻觉,这在实际应用中带来了风险。共形预测提供了有限样本、无分布假设的覆盖保证,但标准共形预测在领域偏移下会失效,常常导致覆盖不足和不可靠的预测集。我们提出了一种称为领域偏移感知共形预测(DS-CP)的新框架。我们的框架通过根据校准样本与测试提示的接近程度系统地重新加权校准样本,将共形预测适应于领域偏移下的大型语言模型,从而在保持有效性的同时增强适应性。我们的理论分析和在MMLU基准上的实验表明,所提出的方法比标准共形预测提供了更可靠的覆盖,尤其是在显著分布偏移下,同时保持了效率。这为大型语言模型在实际部署中实现可信的不确定性量化迈出了实际的一步。

英文摘要

Large language models have achieved impressive performance across diverse tasks. However, their tendency to produce overconfident and factually incorrect outputs, known as hallucinations, poses risks in real-world applications. Conformal prediction provides finite-sample, distribution-free coverage guarantees, but standard conformal prediction breaks down under domain shift, often leading to under-coverage and unreliable prediction sets. We propose a new framework called Domain-Shift-Aware Conformal Prediction (DS-CP). Our framework adapts conformal prediction to large language models under domain shift, by systematically reweighting calibration samples based on their proximity to the test prompt, thereby preserving validity while enhancing adaptivity. Our theoretical analysis and experiments on the MMLU benchmark demonstrate that the proposed method delivers more reliable coverage than standard conformal prediction, especially under substantial distribution shifts, while maintaining efficiency. This provides a practical step toward trustworthy uncertainty quantification for large language models in real-world deployment.

2510.05342 2026-06-02 cs.LG cs.AI 版本更新

Margin Adaptive DPO: Leveraging Reward Model for Granular Control in Preference Optimization

Margin Adaptive DPO: 利用奖励模型实现偏好优化中的细粒度控制

Hyung Gyu Rho

发表机构 * Independent Researcher(独立研究者)

AI总结 提出Margin-Adaptive Direct Preference Optimization (MADPO)方法,通过奖励模型估计偏好边界并自适应调整DPO损失权重,实现实例级别的细粒度控制,在摘要任务上优于现有方法。

详情
AI中文摘要

直接偏好优化(DPO)已成为一种简单有效的大语言模型对齐方法。然而,其依赖固定温度参数导致在多样化偏好数据上训练次优,造成对简单样本过拟合而对信息丰富样本学习不足。近期出现了应对此问题的方法。虽然IPO解决了通用过拟合,但其均匀正则化可能过于保守。更针对性的β-DPO方法有其自身局限:其批次级自适应对混合边界对应用单一折中温度,线性更新规则可能产生不稳定的负β值,其过滤机制丢弃了潜在有用的训练信号。本文提出边界自适应直接偏好优化(MADPO),一种稳定、保留数据且实例级别的解决方案。MADPO采用实用的两步方法:首先训练奖励模型估计偏好边界,然后利用这些边界对每个训练样本的DPO损失施加连续自适应权重。这种重加权方案创建了一个有效目标边界,对困难对放大而对简单对抑制,从而实现对学习信号的细粒度控制。我们提供了全面的理论分析,证明MADPO具有良性的优化景观,且对奖励模型估计误差具有鲁棒性。我们通过使用人类偏好数据的摘要任务实验验证了理论。MADPO在全面的解码温度扫描中一致优于强基线。

英文摘要

Direct Preference Optimization (DPO) has emerged as a simple and effective method for aligning large language models. However, its reliance on a fixed temperature parameter leads to suboptimal training on diverse preference data, causing overfitting on easy examples and under-learning from informative ones. Recent methods have emerged to counter this. While IPO addresses general overfitting, its uniform regularization can be overly conservative. The more targeted approach of $β$-DPO suffers from its own limitations: its batch-level adaptation applies a single, compromised temperature to mixed-margin pairs, its linear update rule can produce unstable negative $β$ values, and its filtering mechanism discards potentially useful training signals. In this work, we introduce Margin-Adaptive Direct Preference Optimization (MADPO), a method that provides a stable, data-preserving, and instance-level solution. MADPO employs a practical two-step approach: it first trains a reward model to estimate preference margins and then uses these margins to apply a continuous, adaptive weight to the DPO loss for each individual training sample. This re-weighting scheme creates an effective target margin that is amplified for hard pairs and dampened for easy pairs, allowing for granular control over the learning signal. We provide a comprehensive theoretical analysis, proving that MADPO has a well-behaved optimization landscape and is robust to reward model estimation errors. We validate our theory with experiments on a summarization task using human preference data. MADPO consistently outperforms strong baselines across a comprehensive sweep of decoding temperatures.

2510.03745 2026-06-02 cs.LG cs.NA math.NA 版本更新

Neural Low-Discrepancy Sequences

神经低差异序列

Michael Etienne Van Huffel, Nathan Kirk, Makram Chahine, Daniela Rus, T. Konstantin Rusch

发表机构 * MIT(麻省理工学院) CMU(卡内基梅隆大学) Harvard University(哈佛大学)

AI总结 提出NeuroLDS,首个基于机器学习的有限低差异序列生成框架,通过监督预训练和无监督微调两步学习,在多个应用中显著优于传统方法。

Comments ICML 2026

详情
AI中文摘要

低差异点旨在以均匀方式高效填充空间。这种均匀性在科学和工程的许多问题中非常有利,包括数值积分、计算机视觉、机器感知、计算机图形学、机器学习和模拟。尽管大多数先前的低差异构造依赖于抽象代数和数论,但最近引入的消息传递蒙特卡洛(MPMC)利用机器学习方法生成点集,其差异低于以往可能。然而,MPMC仅限于生成点集,无法扩展到低差异序列(LDS),即每个前缀都具有低差异的点序列,这一性质对许多应用至关重要。为解决这一限制,我们引入了神经低差异序列(NeuroLDS),这是首个基于机器学习的有限LDS生成框架。受经典LDS启发,我们训练一个神经网络将索引映射到点,使得生成的序列在所有前缀上表现出最小差异。为此,我们采用两阶段学习过程:经典构造的监督近似,随后是无监督微调以最小化前缀差异。我们证明,NeuroLDS在差异度量方面显著优于所有先前的LDS构造。此外,我们展示了NeuroLDS在多种应用中的有效性,包括数值积分、机器人运动规划和科学机器学习。这些结果凸显了神经低差异序列的前景和广泛意义。我们的代码可在https://github.com/camail-official/neuro-lds找到。

英文摘要

Low-discrepancy points are designed to efficiently fill the space in a uniform manner. This uniformity is highly advantageous in many problems in science and engineering, including in numerical integration, computer vision, machine perception, computer graphics, machine learning, and simulation. Whereas most previous low-discrepancy constructions rely on abstract algebra and number theory, Message-Passing Monte Carlo (MPMC) was recently introduced to exploit machine learning methods for generating point sets with lower discrepancy than previously possible. However, MPMC is limited to generating point sets and cannot be extended to low-discrepancy sequences (LDS), i.e., sequences of points in which every prefix has low discrepancy, a property essential for many applications. To address this limitation, we introduce Neural Low-Discrepancy Sequences (NeuroLDS), the first machine learning-based framework for generating finite LDS. Drawing inspiration from classical LDS, we train a neural network to map indices to points such that the resulting sequences exhibit minimal discrepancy across all prefixes. To this end, we deploy a two-stage learning process: supervised approximation of classical constructions followed by unsupervised fine-tuning to minimize prefix discrepancies. We demonstrate that NeuroLDS outperforms all previous LDS constructions by a significant margin with respect to discrepancy measures. Moreover, we demonstrate the effectiveness of NeuroLDS across diverse applications, including numerical integration, robot motion planning, and scientific machine learning. These results highlight the promise and broad significance of Neural Low-Discrepancy Sequences. Our code can be found at https://github.com/camail-official/neuro-lds.

2505.18102 2026-06-02 cs.LG cs.AI cs.CL stat.ME 版本更新

CapBencher: Give Your LLM Benchmark a Built-in Alarm for Test-Set Overfitting

CapBencher: 为您的LLM基准测试内置测试集过拟合警报

Takashi Ishida, Thanawat Lodkaew, Ikko Yamane

发表机构 * National Institute of Advanced Industrial Science and Technology, Japan(日本国家先进工业科学与技术研究院)

AI总结 提出CapBencher方法,通过向答案注入随机性(准备多个逻辑正确但仅一个作为解)来降低贝叶斯准确率,从而在公开基准测试时防止测试集过拟合并检测泄露或作弊。

Comments ICML 2026 camera ready version

详情
AI中文摘要

在互联网上发布大型语言模型(LLM)基准测试(尤其是其真实答案)存在污染未来LLM和导致评估作弊的风险:它可能被无意(或有意)用于训练或选择模型,或者在标签可访问时被利用来过拟合和操纵排行榜。常见的缓解措施是保持基准测试私有,并让参与者向组织者提交他们的模型或预测,但这仍然允许通过反馈循环进行测试集过拟合。为了克服这个问题,我们提出了CapBencher,一种在不完全公开真实答案的情况下发布基准测试的方法,同时保持LLM的开放评估。主要思想是通过准备多个逻辑正确的答案,并仅将其中一个作为基准测试中的解,向答案中注入随机性,从而降低最佳可能准确率,即贝叶斯准确率。这不仅掩盖了真实答案,还为泄露或作弊提供了测试:由于即使完全有能力的模型也不应超过贝叶斯准确率,任何超过该准确率的模型都是一个强烈的信号。我们从理论和实验上证明,CapBencher能够在不同的基准测试、模型、训练方法和场景中准确检测测试集过拟合。

英文摘要

Publishing a large language model (LLM) benchmark (especially its ground-truth answers) on the Internet risks contaminating future LLMs and enabling evaluation gaming: it may be unintentionally (or intentionally) used to train or select a model, or exploited to overfit and hack leaderboards when labels are accessible. A common mitigation is to keep the benchmark private and let participants submit their models or predictions to the organizers, but this still permits test-set overfitting through feedback loops. To overcome this issue, we propose CapBencher, a way to publish benchmarks without fully disclosing the ground-truth answers, while preserving open evaluation of LLMs. The main idea is to reduce the best possible accuracy, i.e., Bayes accuracy, by injecting randomness to the answers by preparing several logically correct answers, and only include one of them as the solution in the benchmark. Not only does this obscure the ground-truth answers, but it also offers a test for leakage or gaming: since even fully capable models should not surpass the Bayes accuracy, any model that does is a strong signal. We show theoretically and empirically that CapBencher accurately detects test-set overfitting across diverse benchmarks, models, training methodologies, and scenarios.

2510.03494 2026-06-02 cs.LG stat.ML 版本更新

Trajectory Data Suffices for Statistically Efficient Policy Evaluation in Fixed-Horizon Offline RL with Linear $q^π$-Realizability and Concentrability

轨迹数据足以在具有线性 $q^π$-可实现性和集中性的固定视界离线强化学习中进行统计有效的策略评估

Volodymyr Tkachuk, Csaba Szepesvári, Xiaoqi Tan

发表机构 * University of Alberta(阿尔伯塔大学)

AI总结 本文研究在轨迹数据假设下,利用线性 $q^π$-可实现性和集中性,实现固定视界离线强化学习中策略评估的统计有效学习,并改进了策略优化的样本复杂度分析。

详情
AI中文摘要

我们研究了具有函数近似的固定视界离线强化学习(RL),用于策略评估和策略优化。先前的工作表明,当唯一的假设是数据具有良好的覆盖性(集中性)且每个策略的状态-动作值函数是线性可实现的($q^π$-可实现性)时,对于这些问题中的任何一个,统计有效的学习都是不可能的(Foster et al., 2021)。最近,Tkachuk et al. (2024) 给出了一个用于策略优化的统计有效学习器,前提是数据被假定为以轨迹形式给出。在这项工作中,我们在相同的假设下提出了一个用于策略评估的统计有效学习器。此外,我们表明,通过更紧的分析,可以改进 Tkachuk et al. (2024) 用于策略优化的学习器的样本复杂度。

英文摘要

We study finite-horizon offline reinforcement learning (RL) with function approximation for both policy evaluation and policy optimization. Prior work established that statistically efficient learning is impossible for either of these problems when the only assumptions are that the data has good coverage (concentrability) and the state-action value function of every policy is linearly realizable ($q^π$-realizability) (Foster et al., 2021). Recently, Tkachuk et al. (2024) gave a statistically efficient learner for policy optimization, if in addition the data is assumed to be given as trajectories. In this work we present a statistically efficient learner for policy evaluation under the same assumptions. Further, we show that the sample complexity of the learner used by Tkachuk et al. (2024) for policy optimization can be improved by a tighter analysis.

2510.03259 2026-06-02 cs.LG cs.AI 版本更新

Verifying Meta-Awareness via Predictive Rewards in Reasoning Models

通过推理模型中的预测奖励验证元意识

Yoonjeon Kim, Doohyuk Jang, Eunho Yang

发表机构 * Yoonjeon Kim, Doohyuk Jang, Eunho Yang

AI总结 提出 MAPR 方法,利用自生成任务预测推理统计量(长度、通过率、概念)来增强模型的元意识,从而在多个数学推理基准上显著提升准确率和训练效率。

Comments accepted to ICML 2026

详情
AI中文摘要

近期关于推理模型的研究探索了语言模型的元意识,包括其确定最佳思考时长、识别知识边界以及结构化概念级思维的能力。虽然当前的大型推理模型仅依赖于基于答案的验证,但我们表明,添加元意识目标可以显著提升性能,超过缺乏此类元知识的模型。MAPR(通过预测奖励实现元意识)利用自生成任务来预测展开统计量——具体包括长度、通过率和所用概念——从而能够对照实际统计量进行验证。此外,通过利用这种自我预测能力,模型可以通过以下方式调节其推理行为:i) 过滤掉琐碎或无法解决的提示,ii) 减少倾向于错误的长篇生成,以及 iii) 生成与问题相关的提示。结果令人鼓舞:MAPR 在各种推理基准上显著提高了准确率和训练效率。更具体地说,我们的方法可以将 GRPO 训练加速超过 1.28 倍以达到相同的性能,在 AIME25 上实现 83.18% 的准确率提升,并在六个数学基准上平均提升 13.04%。代码公开于 https://github.com/akatigre/MAPR-RL。

英文摘要

Recent research on reasoning models explores the meta-awareness of language models, including their ability to determine optimal thinking duration, recognize knowledge boundaries, and structure concept-level thinking. While current large reasoning models depend solely on answer-based verification, we show that adding meta-awareness objectives leads to significant performance gains over models without such meta-knowledge. MAPR (Meta-Awareness via Predictive Reward) utilizes a self-generated task of predicting rollout statistics - specifically length, pass-rate, and concepts used - allowing for verification against the actual statistics. Furthermore, by leveraging this self-predictive capability, the model can regulate its reasoning behavior by i) filtering out trivial or unsolvable prompts, ii) reducing lengthy generations that tend to be incorrect, and iii) generating hints relevant to the problem. The results are inspiring: MAPR yields significant improvements in both accuracy and training efficiency on various reasoning benchmarks. More specifically, our method can speed up GRPO training by over 1.28x to reach the same performance, and achieve 83.18% gain in accuracy on AIME25, and a 13.04% average gain over six mathematics benchmarks. The code is publicly available at https://github.com/akatigre/MAPR-RL.

2510.03086 2026-06-02 cs.LG 版本更新

Chaining 2-FWL GNNs for Combinatorial Graph Alignment

链式2-FWL图神经网络用于组合图对齐

Marc Lelarge

发表机构 * INRIA - Ecole Normale Supérieure PSL Research University(法国国家科学研究中心-巴黎高等师范学院-巴黎理工研究大学)

AI总结 针对组合图对齐问题,提出链式2-FWL GNN方法,通过非可微排序注入离散反馈,在稀疏随机图和正则图上显著优于FAQ和现有GNN方法。

Comments code available at https://github.com/mlelarge/chaining-gnn-graph-alignment

详情
AI中文摘要

对于组合图对齐问题(GAP)——寻找最大化两个无标签图之间公共边数(nce)的节点对应关系——适当初始化的FAQ仍然是强大的经典基线,而现有的GNN方法在纯结构设置中表现不佳。我们引入了一种链式过程:一系列Folklore类型(2-FWL)的GNN,其中每个网络在解码前一个网络的相似性矩阵并根据当前对齐质量对节点进行排序后,使用交叉熵进行训练。这个不可微的排序步骤在每个链接处注入离散的组合反馈;在推理时,我们迭代最终网络并保留具有最高观测nce的候选。在噪声水平0.25的稀疏Erdos-Renyi图上,带有FAQ后处理的链式FGNN达到85%的准确率,而FAQ从凸松弛初始化仅为13%,先前的GNN方法基本为0%。在相关正则图上,其中具有恒定特征的MPNN产生相同的节点嵌入(1-WL无法细化)且FAQ的凸初始化退化,链式是我们知道的唯一能够恢复非平凡对齐的方法。在三个真实世界基准(酵母PPI、合著和道路网络)上,我们表明最近的比较通过从均匀双随机矩阵初始化FAQ低估了FAQ;一旦FAQ从凸松弛初始化,它已经超过了先前报告的数字,而数据集特定的链式FGNN进一步改进了这个加强的基线。

英文摘要

For the combinatorial graph alignment problem (GAP) -- finding the node correspondence that maximizes the number of common edges (nce) between two unlabeled graphs -- properly initialized FAQ remains a strong classical baseline, while existing GNN approaches struggle in the purely structural setting. We introduce a chaining procedure: a sequence of Folklore-type (2-FWL) GNNs in which each network is trained with cross-entropy after decoding the previous network's similarity matrix and ranking nodes by their current alignment quality. This non-differentiable ranking step injects discrete combinatorial feedback at every link; at inference, we iterate the final network and keep the candidate with highest observed nce. On sparse Erdos-Renyi graphs at noise level 0.25, chained FGNNs with FAQ post-processing reach 85% accuracy versus 13% for FAQ initialized from the convex relaxation, and essentially 0% for prior GNN methods. On correlated regular graphs, where MPNNs with constant features produce identical node embeddings (1-WL fails to refine) and FAQ's convex initialization is degenerate, chaining is the only method we know that recovers a non-trivial alignment. On three real-world benchmarks (yeast PPI, coauthorship, and road networks), we show that recent comparisons underestimate FAQ by initializing it from a uniform doubly stochastic matrix; once FAQ is initialized from the convex relaxation it already surpasses prior reported numbers, and dataset-specific chained FGNNs further improve on this strengthened baseline.

2510.02528 2026-06-02 cs.AI cs.LG 版本更新

Multimodal Function Vectors for Visual Relations

视觉关系的多模态函数向量

Shuhao Fu, Esther Goldberg, Ying Nian Wu, Hongjing Lu

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 通过因果中介分析提取多模态函数向量,操纵注意力头以改善视觉关系推理,并实现零样本和微调性能提升。

详情
AI中文摘要

大型多模态模型(LMMs)从少量多模态演示中展现出令人印象深刻的上下文学习能力,然而支持这种任务学习的内部机制仍不透明。基于大型语言模型的先前工作,我们表明大型多模态模型中一小部分注意力头负责传递视觉关系的表示。这些注意力头的激活,称为函数向量,可以被提取和操纵以改变LMM在关系任务上的性能。首先,使用合成和真实图像数据集,我们应用因果中介分析来识别强烈影响关系预测的注意力头,并提取多模态函数向量,以提高推理时的零样本准确率。我们进一步证明,这些多模态函数向量可以在保持LMM参数冻结的情况下,用适量的训练数据进行微调,从而显著优于上下文学习基线。最后,我们展示了特定关系的函数向量可以线性组合,以解决涉及新颖和未经训练的视觉关系的类比问题,突显了该方法的强大泛化能力。通过在两个LMM(包括OpenFlamingo和Qwen3-VL)上的实验,我们的结果表明这些模型在局部内部结构中编码了视觉关系知识,这些知识可以被系统地提取和优化,从而增进了我们对模型模块化的理解,并增强了对LMM中关系推理的控制。

英文摘要

Large Multimodal Models (LMMs) demonstrate impressive in-context learning abilities from few multimodal demonstrations, yet the internal mechanisms supporting such task learning remain opaque. Building on prior work of Large Language Models, we show that a small subset of attention heads in Large Multimodal Models is responsible for transmitting representations of visual relations. The activations of these attention heads, termed function vectors, can be extracted and manipulated to alter an LMM's performance on relational tasks. First, using synthetic and real image datasets, we apply causal mediation analysis to identify attention heads that strongly influence relational predictions, and extract multimodal function vectors that improve zero-shot accuracy at inference time. We further demonstrate that these multimodal function vectors can be fine-tuned with a modest amount of training data, while keeping LMM parameters frozen, to significantly outperform in-context learning baselines. Finally, we show that relation-specific function vectors can be linearly combined to solve analogy problems involving novel and untrained visual relations, highlighting the strong generalization ability of this approach. Through experiments on two LMMs, including OpenFlamingo and Qwen3-VL, our results show that these models encode visual relational knowledge within localized internal structures, which can be systematically extracted and optimized, thereby advancing our understanding of model modularity and enhancing control over relational reasoning in LMMs.

2507.09029 2026-06-02 cs.LG cs.AI 版本更新

Model Parallelism With Subnetwork Data Parallelism

模型并行与子网络数据并行

Vaibhav Singh, Zafir Khalid, Pietro Cagnasso, Edouard Oyallon, Eugene Belilovsky

发表机构 * Mila Concordia University(康科迪亚大学) ISIR-Sorbonne University, CNRS(索邦大学-ISIR与CNRS)

AI总结 提出子网络数据并行(SDP)框架,通过将模型划分为结构化子网络并在工作节点间独立训练,无需交换激活值,在保持或提升性能的同时显著降低内存占用。

Comments 9 pages, 5 figures

详情
AI中文摘要

大规模预训练神经网络对加速器内存需求巨大,且通常需要昂贵的通信。我们提出子网络数据并行(SDP),一种分布式训练框架,将模型划分为结构化子网络,在工作节点间独立训练而不交换激活值。我们研究了两种互补的掩码机制:后向掩码,仅在反向步骤中应用稀疏性以保留无偏梯度;前向掩码,在前向传播中也移除参数以带来更强的效率提升,同时提供额外的正则化。我们进一步探索了两种子网络构建策略:神经元级别和块级别,分别应用于Transformer和CNN。在从FineWeb上的1B LLaMA预训练到CIFAR上的ResNet-18的实验中,SDP在FLOP匹配设置下将每设备内存使用量减少28%-60%,同时保持或提升性能。

英文摘要

Pre-training large neural networks at scale imposes heavy memory demands on accelerators and often requires costly communication. We introduce Subnetwork Data Parallelism (SDP), a distributed training framework that partitions a model into structured subnetworks trained across workers without exchanging activations. We study two complementary masking regimes: backward masking, which applies sparsity only in the backward step to retain unbiased gradients, and forward masking, which also removes parameters in the forward pass to deliver stronger efficiency gains while providing additional regularization. We further explore two subnetwork construction strategies: neuron level and block level, applied across both transformers and CNNs. In experiments spanning 1B LLaMA pre-training on FineWeb to ResNet-18 on CIFAR, SDP reduces per device memory usage by 28%-60% while maintaining or improving performance under FLOP-matched settings.

2510.01167 2026-06-02 cs.LG cs.AI cs.CL 版本更新

Simultaneous Multi-objective Alignment Across Verifiable and Non-verifiable Rewards

同时多目标对齐:可验证与不可验证奖励

Yiran Shen, Yu Xia, Jonathan Chang, Prithviraj Ammanabrolu

AI总结 提出MAHALO框架,通过标准化PRM训练、多动作头DPO和PRM引导解码,实现大语言模型在可验证与不可验证奖励上的多目标对齐,减少目标冲突并支持推理时控制。

Comments ICML 2026

详情
AI中文摘要

将大语言模型与人类偏好对齐本质上是多维的,但大多数流水线将异质信号压缩为单一目标。我们试图回答如何同时在多个领域中对齐模型,这些领域包括:可验证奖励、不可验证主观偏好以及复杂交互场景。这种多目标对齐设置常常因各个目标相互冲突而困扰,导致训练效率低下和推理时用户控制有限。为了解决这些问题,我们提出了$ extbf{MAHALO}$(Multi-Action-Head Alignment with PRM-guided Decoding),这是一个统一的框架,它在可验证和不可验证设置下标准化PRM训练以进行步骤级监督,通过多动作头DPO执行向量化多目标对齐,并通过目标特定权重和PRM引导解码实现可控推理。在数学推理、人类价值观对齐和多轮辅导上的实验表明,MAHALO能够以有限的干扰同时联合改善多个目标,同时保持跨领域的泛化性和适应性,并在推理时提供灵活的用户控制。我们的代码可在 https://github.com/pearls-lab/multiobj-align 获取。

英文摘要

Aligning large language models to human preferences is inherently multidimensional, yet most pipelines collapse heterogeneous signals into a single objective. We seek to answer what it would take to simultaneously align a model across various domains spanning those with: verifiable rewards, non-verifiable subjective preferences, and complex interactive scenarios. Such multi-objective alignment setups are often plagued by individual objectives being at odds with each other, resulting in inefficient training and limited user control during inference. To address these issues, we propose $\textbf{M}$ulti-$\textbf{A}$ction-$\textbf{H}$ead $\textbf{AL}$ignment with PRM-guided Dec$\textbf{O}$ding ($\textbf{MAHALO}$), a unified framework that standardizes PRM training across verifiable and non-verifiable settings for step-level supervision, performs vectorized multi-objective alignment with Multi-Action-Head DPO, and enables controllable inference through objective-specific weighting and PRM-guided decoding. Experiments across math reasoning, human values alignment, and multi-turn tutoring show that MAHALO jointly improves multiple objectives simultaneously with limited interference, while remaining generalizable and adaptable across domains and offering flexible user control at inference time. Our code is available at: https://github.com/pearls-lab/multiobj-align.

2510.00053 2026-06-02 eess.IV cs.CV cs.LG 版本更新

DPsurv: Dual-Prototype Evidential Fusion for Uncertainty-Aware and Interpretable Whole-Slide Image Survival Prediction

DPsurv: 双原型证据融合用于不确定性感知和可解释的全切片图像生存预测

Yucheng Xing, Ling Huang, Jingying Ma, Ruping Hong, Jiangdong Qiu, Pei Liu, Kai He, Huazhu Fu, Mengling Feng

发表机构 * National University of Singapore National University of Singapore Guangzhou Research Translation Innovation Institute Imperial College London Peking Union Medical College Hospital, Chinese Academy of Medical Sciences \& Peking Union Medical College Hunan University Institute of High Performance Computing, Agency for Science, Technology Research (A STAR)

AI总结 提出DPsurv双原型证据融合网络,通过不确定性感知的生存区间预测和基于补丁原型分配图、组件原型及组件级相对风险聚合的可解释性,在五个公开数据集上取得最佳一致性指数和积分Brier分数。

详情
AI中文摘要

病理全切片图像(WSIs)因其在细胞和组织水平上全面的组织病理学信息而被广泛用于癌症生存分析,能够进行定量、大规模且预后丰富的肿瘤特征分析。然而,现有大多数WSI生存分析方法可解释性有限,且常常忽略异质性切片图像中的预测不确定性。本文提出DPsurv,一种双原型全切片图像证据融合网络,输出不确定性感知的生存区间,同时通过补丁原型分配图、组件原型和组件级相对风险聚合实现预测的解释。在五个公开数据集上的实验取得了最高的平均一致性指数和最低的平均积分Brier分数,验证了DPsurv的有效性和可靠性。预测结果的解释在特征、推理和决策层面提供了透明度,从而增强了DPsurv的可信度和可解释性。

英文摘要

Pathology whole-slide images (WSIs) are widely used for cancer survival analysis because of their comprehensive histopathological information at both cellular and tissue levels, enabling quantitative, large-scale, and prognostically rich tumor feature analysis. However, most existing methods in WSI survival analysis struggle with limited interpretability and often overlook predictive uncertainty in heterogeneous slide images. In this paper, we propose DPsurv, a dual-prototype whole-slide image evidential fusion network that outputs uncertainty-aware survival intervals, while enabling interpretation of predictions through patch prototype assignment maps, component prototypes, and component-wise relative risk aggregation. Experiments on five publicly available datasets achieve the highest mean concordance index and the lowest mean integrated Brier score, validating the effectiveness and reliability of DPsurv. The interpretation of prediction results provides transparency at the feature, reasoning, and decision levels, thereby enhancing the trustworthiness and interpretability of DPsurv.

2505.17630 2026-06-02 cs.CL cs.LG 版本更新

Correcting Gradient-Based Circuit Localization via Interaction-Aware Backpropagation

通过交互感知反向传播修正基于梯度的电路定位

Joakim Edin, Casper L. Christensen, Róbert Csordás, Tuukka Ruotsalo, Zhengxuan Wu, Maria Maistro, Jing Huang, Lars Maaløe

发表机构 * Corti Stanford University(斯坦福大学) Copenhagen University(哥本哈根大学) LUT University(拉普兰大学)

AI总结 针对现有电路定位方法忽略组件交互导致重要性误估的问题,提出GIM技术,通过在反向传播中显式建模特征交互,在机械可解释性基准的电路定位任务上达到最优性能。

详情
AI中文摘要

电路定位方法旨在识别大型语言模型中负责特定行为的模型组件子集,从而实现详细的机制分析。现有方法大多假设组件独立运作,并通过孤立地扰动每个组件来估计重要性。然而,神经网络中的组件之间存在交互,忽略这些交互会导致对组件重要性的系统性误估。我们发现一个特别有问题的交互是注意力自修复,其中softmax重新分配导致有影响力的注意力分数的梯度随着其他具有相似值的位置的补偿而消失。我们引入了梯度交互修改(GIM),这是一种在反向传播过程中显式考虑特征交互的技术。GIM在机械可解释性基准的电路定位任务上达到了最先进的性能,并在多种任务的特征归因上优于现有的基于梯度的方法。通过考虑交互效应并解释为何先前方法低估了组件重要性,GIM使得对大型语言模型进行更忠实的机制分析成为可能。GIM作为Python包可在https://github.com/corticph/gim获取。

英文摘要

Circuit localization methods aim to identify the subset of model components responsible for specific behaviors in large language models, enabling detailed mechanistic analysis. Most existing methods assume components act independently and estimate importance by perturbing each component in isolation. However, components in neural networks interact, and ignoring these interactions leads to systematic misestimation of component importance. We find that one particularly problematic interaction is attention self-repair, in which softmax redistribution causes gradients for influential attention scores to vanish as other positions with similar values compensate. We introduce Gradient Interaction Modifications (GIM), a technique that explicitly accounts for feature interactions during backpropagation. GIM achieves state-of-the-art performance on the circuit localization track of the Mechanistic Interpretability Benchmark and outperforms existing gradient-based methods on feature attribution across diverse tasks. By accounting for interaction effects and explaining why prior methods underestimate component importance, GIM enables more faithful mechanistic analysis of large language models. GIM is available as a Python package at https://github.com/corticph/gim.

2304.11127 2026-06-02 cs.LG cs.AI 版本更新

Tree-Structured Parzen Estimator: Understanding Its Algorithm Components and Their Roles for Better Empirical Performance

树结构帕森估计器:理解其算法组件及对提升实证性能的作用

Shuhei Watanabe

发表机构 * Preferred Networks Inc.(Preferred Networks公司) University of Freiburg(弗赖堡大学)

AI总结 本文通过消融实验分析树结构帕森估计器(TPE)各控制参数的作用,提出改进性能的推荐设置。

详情
AI中文摘要

近期的科学进展需要复杂的实验设计,这要求对许多实验参数进行细致调整。树结构帕森估计器(TPE)是Hyperopt和Optuna等最新参数调优框架中广泛使用的贝叶斯优化方法。尽管其流行,但TPE中各控制参数的作用及算法直觉至今尚未被讨论。本文旨在基于使用多样化基准数据集的消融研究,识别每个控制参数的作用及其对参数调优的影响。从消融研究中得出的推荐设置被证明能提升TPE的性能。本文使用的TPE实现可在https://github.com/nabenabe0928/tpe/tree/single-opt 获取。OptunaHub现在在https://hub.optuna.org/samplers/tpe_tutorial/ 提供我们独立的TPE实现。

英文摘要

Recent scientific advances require complex experiment design, necessitating the meticulous tuning of many experiment parameters. Tree-structured Parzen estimator (TPE) is a widely used Bayesian optimization method in recent parameter tuning frameworks such as Hyperopt and Optuna. Despite its popularity, the roles of each control parameter in TPE and the algorithm intuition have not been discussed so far. The goal of this paper is to identify the roles of each control parameter and their impacts on parameter tuning based on the ablation studies using diverse benchmark datasets. The recommended setting concluded from the ablation studies is demonstrated to improve the performance of TPE. Our TPE implementation used in this paper is available at https://github.com/nabenabe0928/tpe/tree/single-opt. OptunaHub now provides our standalone TPE implementation at https://hub.optuna.org/samplers/tpe_tutorial/.

2509.24696 2026-06-02 cs.LG cs.AI 版本更新

T-POP: Test-Time Personalization with Online Preference Feedback

T-POP:基于在线偏好反馈的测试时个性化

Zikun Qu, Min Zhang, Mingze Kong, Xiang Li, Zhiwei Shang, Zhiyong Wang, Yikun Ban, Shuang Qiu, Yao Shu, Zhongxiang Dai

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) East China Normal University(华东师范大学) Shenzhen Loop Area Institute(深圳环城研究所) Tianjin University(天津大学) The Chinese University of Hong Kong(香港中文大学) Beihang University(北京航空航天大学) City University of Hong Kong(香港城市大学) The Hong Kong University of Science and Technology (Guangzhou)(香港理工大学(广州))

AI总结 针对新用户冷启动问题,提出T-POP算法,通过在线成对偏好反馈和决斗式强盗机制,在不更新模型参数的情况下实时学习用户偏好并引导解码过程,实现快速数据高效的个性化。

Comments Accepted to ICML 2026

详情
AI中文摘要

将大型语言模型(LLM)个性化以适应个体用户偏好,是超越生成通用有用响应的关键步骤。然而,当前的个性化方法不适合新用户,因为它们通常需要缓慢、资源密集的微调或大量预先存在的用户数据,造成了显著的冷启动问题。为了应对这一挑战,我们引入了一种新的实时个性化范式,通过从文本生成过程中收集的在线成对偏好反馈进行学习。我们提出了T-POP(基于在线偏好反馈的测试时个性化),这是一种新颖的算法,将测试时对齐与决斗式强盗协同结合。在不更新LLM参数的情况下,T-POP通过在线学习一个捕捉用户偏好的奖励函数来引导冻结LLM的解码过程。通过利用决斗式强盗,T-POP智能地查询用户,以有效平衡探索其偏好和利用所学知识生成个性化文本。大量实验表明,T-POP实现了快速且数据高效的个性化,显著优于现有基线,并且随着用户交互的增加而持续改进。

英文摘要

Personalizing large language models (LLMs) to individual user preferences is a critical step beyond generating generically helpful responses. However, current personalization methods are ill-suited for new users, as they typically require either slow, resource-intensive fine-tuning or a substantial amount of pre-existing user data, creating a significant cold-start problem. To address this challenge, we introduce a new paradigm for real-time personalization by learning from online pairwise preference feedback collected during text generation. We propose T-POP (Test-Time Personalization with Online Preference Feedback}), a novel algorithm that synergistically combines test-time alignment with dueling bandits. Without updating the LLM parameters, T-POP steers the decoding process of a frozen LLM by learning a reward function online that captures user preferences. By leveraging dueling bandits, T-POP intelligently queries the user to efficiently balance between exploring their preferences and exploiting the learned knowledge to generate personalized text. Extensive experiments demonstrate that T-POP achieves rapid and data-efficient personalization, significantly outperforming existing baselines and showing consistent improvement with more user interactions.

2506.22271 2026-06-02 cs.AI cs.LG 版本更新

On the Theoretical Limitations of Embedding-based Link Prediction

基于嵌入的链接预测的理论局限性

Samy Badreddine, Emile van Krieken, Luciano Serafini

发表机构 * Vrije Universiteit Amsterdam, Netherlands(荷兰阿姆斯特丹自由大学) University of Trento, Italy(意大利特伦托大学)

AI总结 研究线性输出层导致的秩瓶颈对知识图谱嵌入模型表达能力的限制,并提出混合非线性输出层以提升大规模密集图上的性能。

详情
AI中文摘要

神经网络通常将低维嵌入映射到高维输出空间。通常,输出层是线性的,这会产生一个“秩瓶颈”,限制模型所能表示的函数。这种瓶颈在链接预测模型中普遍存在,例如知识图谱嵌入(KGE),因为实体的输出空间可能比嵌入维度大几个数量级。我们研究了秩瓶颈如何限制模型拟合训练数据的表达能力。以往工作关注特定KGE所需嵌入维度的充分上界,而我们给出了所有具有线性输出层的KGE的必要下界,该下界随图的大小和连通性增长。我们还考虑了一种使用混合的非线性输出层,以在不显著增加参数开销的情况下打破瓶颈。实验表明,使用这种非线性层的模型在大型密集数据集上,以较低的参数成本提升了排序性能和概率拟合,正如我们的理论所预测。我们的工作揭示了线性输出层如何限制KGE,并激励使用非线性替代方案以扩展到大型密集图。

英文摘要

Neural networks often map low-dimensional embeddings to high-dimensional output spaces. Usually, the output layer is linear, which can create a "rank bottleneck" that limits the functions a model can represent. Such bottlenecks are ubiquitous in link prediction models, such as knowledge graph embeddings (KGEs), as the output space of entities can be orders of magnitude larger than the embedding dimension. We investigate how rank bottlenecks limit model expressivity for fitting the training data. While previous work focused on sufficient bounds on the embedding dimension required for specific KGEs, we show necessary bounds for all KGEs with a linear output layer, which grow with graph size and connectivity. We also consider a non-linear output layer using mixtures to break the bottleneck without significant parameter overhead. Empirically, we show that models using this non-linear layer improve in ranking performance and probabilistic fit for large and dense datasets at a low parameter cost, as predicted by our theory. Our work reveals how linear output layers limit KGEs and motivates non-linear alternatives for scaling to large and dense graphs.

2504.06006 2026-06-02 cs.LG cs.AI cs.NE 版本更新

Optuna vs Code Llama: Are LLMs a New Paradigm for Hyperparameter Tuning?

Optuna vs Code Llama:LLM 是超参数调优的新范式吗?

Roman Kochnev, Arash Torabi Goodarzi, Zofia Antonina Bentyn, Dmitry Ignatov, Radu Timofte

发表机构 * Computer Vision Lab, CAIDAS & IFI, University of Würzburg, Germany(计算机视觉实验室,CAIDAS与IFI,乌尔姆大学,德国)

AI总结 通过微调参数高效的 Code Llama 模型,提出基于大语言模型的超参数优化方法,在多种视觉架构上实现与 Optuna 相当或更优的 RMSE 并大幅降低计算开销。

详情
Journal ref
Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops (ICCVW), pp. 5664-5674, 2025
AI中文摘要

最优超参数选择对于最大化计算机视觉中神经网络的性能至关重要,尤其是当架构变得日益复杂时。本文通过使用 LoRA 微调参数高效的 Code Llama 版本,探索了大语言模型在超参数优化中的应用。所得模型在广泛的视觉架构上产生了准确且计算高效的超参数推荐。与依赖资源密集型的试错过程的传统方法(如 Optuna)不同,我们的方法在实现竞争性或更优的均方根误差的同时,大幅降低了计算开销。重要的是,所评估的模型涵盖了以图像为中心的任务,如分类、检测和分割,这些是许多图像处理流程(包括增强、恢复和风格迁移)的基本组成部分。我们的结果表明,基于 LLM 的优化不仅与成熟的贝叶斯方法(如树结构 Parzen 估计器)相媲美,而且加速了需要感知质量和低延迟处理的实际应用的调优。所有生成的配置均公开在 LEMUR 神经网络数据集(https://github.com/ABrain-One/nn-dataset)中,该数据集作为超参数优化研究的开源基准,并为提高图像处理系统中的训练效率提供了实用资源。

英文摘要

Optimal hyperparameter selection is critical for maximizing the performance of neural networks in computer vision, particularly as architectures become more complex. This work explores the use of large language models (LLMs) for hyperparameter optimization by fine-tuning a parameter-efficient version of Code Llama using LoRA. The resulting model produces accurate and computationally efficient hyperparameter recommendations across a wide range of vision architectures. Unlike traditional methods such as Optuna, which rely on resource-intensive trial-and-error procedures, our approach achieves competitive or superior Root Mean Square Error (RMSE) while substantially reducing computational overhead. Importantly, the models evaluated span image-centric tasks such as classification, detection, and segmentation, fundamental components in many image manipulation pipelines including enhancement, restoration, and style transfer. Our results demonstrate that LLM-based optimization not only rivals established Bayesian methods like Tree-structured Parzen Estimators (TPE), but also accelerates tuning for real-world applications requiring perceptual quality and low-latency processing. All generated configurations are publicly available in the LEMUR Neural Network Dataset (https://github.com/ABrain-One/nn-dataset), which serves as an open source benchmark for hyperparameter optimization research and provides a practical resource to improve training efficiency in image manipulation systems.

2509.23544 2026-06-02 stat.ML cs.AI cs.LG stat.ME 版本更新

End-to-End Deep Learning for Predicting Metric Space-Valued Outputs

端到端深度学习预测度量空间值输出

Yidong Zhou, Su I Iao, Hans-Georg Müller

AI总结 提出E2M框架,通过加权Fréchet均值和神经网络学习权重,实现度量空间值输出的几何感知预测,具有理论保证并在多种结构化输出上取得最优性能。

Comments 38 pages, 4 figures, 9 tables

详情
Journal ref
Journal of Machine Learning Research, 27:1--38, 2026
AI中文摘要

许多现代应用涉及预测结构化、非欧几里得输出,例如概率分布、网络和对称正定矩阵。这些输出自然地被建模为一般度量空间的元素,而依赖于向量空间结构的经典回归技术不再适用。我们引入了E2M(端到端度量回归),这是一个用于预测度量空间值输出的深度学习框架。E2M通过训练输出的加权Fréchet均值进行预测,其中权重由基于输入条件的神经网络学习。这种构造提供了一种原则性的几何感知预测机制,避免了替代嵌入和限制性参数假设,同时完全保留了输出空间的内在几何结构。我们建立了理论保证,包括刻画模型表达能力的通用逼近定理以及熵正则化训练目标的收敛性分析。通过涉及概率分布、网络和对称正定矩阵的大量模拟,我们展示了E2M始终达到最先进的性能,且其优势在更大样本量下更加明显。应用于人类死亡率分布和纽约市出租车网络进一步证明了该框架的灵活性和实用性。

英文摘要

Many modern applications involve predicting structured, non-Euclidean outputs such as probability distributions, networks, and symmetric positive-definite matrices. These outputs are naturally modeled as elements of general metric spaces, where classical regression techniques that rely on vector space structure no longer apply. We introduce E2M (End-to-End Metric regression), a deep learning framework for predicting metric space-valued outputs. E2M performs prediction via weighted Fréchet means over training outputs, where the weights are learned by a neural network conditioned on the input. This construction provides a principled mechanism for geometry-aware prediction that avoids surrogate embeddings and restrictive parametric assumptions, while fully preserving the intrinsic geometry of the output space. We establish theoretical guarantees, including a universal approximation theorem that characterizes the expressive capacity of the model and a convergence analysis of the entropy-regularized training objective. Through extensive simulations involving probability distributions, networks, and symmetric positive-definite matrices, we show that E2M consistently achieves state-of-the-art performance, with its advantages becoming more pronounced at larger sample sizes. Applications to human mortality distributions and New York City taxi networks further demonstrate the flexibility and practical utility of this framework.

2509.22907 2026-06-02 cs.LG 版本更新

FedCF: Fair Federated Conformal Prediction

FedCF: 公平联邦共形预测

Anutam Srinivasan, Aditya T. Vadlamani, Amin Meghrazi, Srinivasan Parthasarathy

发表机构 * Georgia Institute of Technology(佐治亚理工学院) The Ohio State University(俄亥俄州立大学)

AI总结 提出FedCF框架,将共形公平性扩展到联邦学习,通过分析不同人口统计组的公平性差距来审计模型公平性,并在多个数据集上验证。

Comments Preprint

详情
AI中文摘要

共形预测(CP)是一种广泛用于量化机器学习模型不确定性的技术。在其标准形式中,CP提供关于真实标签覆盖的概率保证,但对数据集中的敏感属性不可知。最近的一些工作通过确保不同子组的条件覆盖保证来将公平性纳入CP。其中一种方法是共形公平性(CF)。在这项工作中,我们将CF框架扩展到联邦学习设置,并讨论如何通过分析不同人口统计组的公平性相关差距来审计联邦模型的公平性。我们通过在多个领域的数据集上进行实验,充分利用可交换性假设,实证验证了我们的框架。

英文摘要

Conformal Prediction (CP) is a widely used technique for quantifying uncertainty in machine learning models. In its standard form, CP offers probabilistic guarantees on the coverage of the true label, but it is agnostic to sensitive attributes in the dataset. Several recent works have sought to incorporate fairness into CP by ensuring conditional coverage guarantees across different subgroups. One such method is Conformal Fairness (CF). In this work, we extend the CF framework to the Federated Learning setting and discuss how we can audit a federated model for fairness by analyzing the fairness-related gaps for different demographic groups. We empirically validate our framework by conducting experiments on several datasets spanning multiple domains, fully leveraging the exchangeability assumption.

2508.06588 2026-06-02 cs.LG cs.AI 版本更新

Graph is a Natural Regularization: Revisiting Vector Quantization for Graph Representation Learning

图是一种自然正则化:重新审视向量量化在图表示学习中的应用

Zian Zhai, Fan Li, Xingyu Tan, Xiaoyang Wang, Wenjie Zhang

发表机构 * School of Computer Science and Engineering, University of New South Wales, Sydney, Australia(新南威尔士大学计算机科学与工程学院,悉尼,澳大利亚)

AI总结 针对图向量量化中码本崩溃问题,提出RGVQ框架,通过图拓扑和特征相似性正则化及Gumbel-Softmax软分配,提升码本利用率和令牌多样性。

Comments ICML2026

详情
AI中文摘要

向量量化(VQ)最近成为一种学习图结构数据压缩和离散表示的有前途的方法。然而,一个基本挑战,即码本崩溃,在图领域仍未得到充分探索,严重限制了图令牌的表达能力和泛化能力。在本文中,我们进行了一项实证研究,观察到在图形重建任务中,即使采用了视觉或语言领域提出的缓解策略,当与图神经网络联合训练VQ时,码本崩溃始终发生。此外,我们从数据和优化角度提供了崩溃的诊断,表明崩溃与图数据属性(如特征冗余和连接密度)相关,并进一步由确定性硬分配的训练动态强化。为了解决这些问题,我们提出了RGVQ,一种新颖的框架,它集成图拓扑和特征相似性作为显式正则化信号,以增强码本利用并促进令牌多样性。RGVQ通过Gumbel-Softmax重参数化引入软分配,确保所有码字接收梯度更新。此外,RGVQ包含结构感知对比正则化,以惩罚将相同令牌分配给不相似的节点对。大量实验表明,RGVQ显著提高了码本利用率,并在多个下游任务中持续提升了最先进的图VQ骨干网络的性能,实现了更具表达性和可迁移性的图令牌表示。

英文摘要

Vector Quantization (VQ) has recently emerged as a promising approach for learning compressed and discrete representations for graph-structured data. However, a fundamental challenge, i.e., codebook collapse, remains underexplored in the graph domain, significantly limiting the expressiveness and generalization of graph tokens.In this paper, we present an empirical study and observe that codebook collapse consistently occurs when training VQ jointly with Graph Neural Networks under graph reconstruction tasks, even with mitigation strategies proposed in vision or language domains. Moreover, we provide a diagnosis of collapse from data and optimization perspectives, showing that collapse is associated with graph data properties such as feature redundancy and connectivity density, and is further reinforced by the training dynamics of deterministic hard assignment. To address these issues, we propose RGVQ, a novel framework that integrates graph topology and feature similarity as explicit regularization signals to enhance codebook utilization and promote token diversity. RGVQ introduces soft assignments via Gumbel-Softmax reparameterization, ensuring that all codewords receive gradient updates. In addition, RGVQ incorporates a structure-aware contrastive regularization to penalize assigning the same token to dissimilar node pairs. Extensive experiments demonstrate that RGVQ substantially improves codebook utilization and consistently boosts the performance of state-of-the-art graph VQ backbones across multiple downstream tasks, enabling more expressive and transferable graph token representations.

2504.10552 2026-06-02 cs.LG cs.AI cs.CV cs.DL 版本更新

LEMUR Neural Network Dataset: Towards Seamless AutoML

LEMUR 神经网络数据集:迈向无缝 AutoML

Arash Torabi Goodarzi, Roman Kochnev, Waleed Khalid, Hojjat Torabi Goudarzi, Furui Qin, Tolgay Atinc Uzun, Yashkumar Sanjaybhai Dhameliya, Yash Kanubhai Kathiriya, Zofia Antonina Bentyn, Dmitry Ignatov, Radu Timofte

发表机构 * Computer Vision Lab, CAIDAS, University of Würzburg(计算机视觉实验室,CAIDAS,乌尔姆大学)

AI总结 提出 LEMUR 开源数据集与框架,通过统一模板、结构化存储和自动化超参数优化,标准化神经网络实现与评估,以加速 AutoML 研究并促进公平基准测试。

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pp. 3291-3300, 2026
AI中文摘要

神经网络是现代人工智能的支柱,但设计、评估和比较它们仍然劳动密集。尽管存在许多用于训练的数据集,但模型本身的标准化集合很少。我们介绍 LEMUR,一个开源数据集和框架,它提供了大量基于 PyTorch 的神经网络集合,涵盖分类、分割、检测和自然语言处理等任务。每个模型遵循统一模板,配置和结果存储在结构化数据库中,以确保一致性和可重复性。LEMUR 通过 Optuna 集成自动超参数优化,包括统计分析和可视化工具,并提供 API 以无缝访问性能数据。该框架是可扩展的,允许研究人员添加新模型、数据集或指标而不破坏兼容性。通过标准化实现和统一评估,LEMUR 旨在加速 AutoML 研究,实现公平基准测试,并降低大规模神经网络实验的障碍。为支持采用和协作,LEMUR 及其插件在 MIT 许可下发布,网址为:https://github.com/ABrain-One/nn-dataset https://github.com/ABrain-One/nn-plots https://github.com/ABrain-One/nn-vr

英文摘要

Neural networks are the backbone of modern artificial intelligence, but designing, evaluating, and comparing them remains labor-intensive. While numerous datasets exist for training, there are few standardized collections of the models themselves. We introduce LEMUR, an open-source dataset and framework that provides a large collection of PyTorch-based neural networks across tasks such as classification, segmentation, detection, and natural language processing. Each model follows a unified template, with configurations and results stored in a structured database to ensure consistency and reproducibility. LEMUR integrates automated hyperparameter optimization via Optuna, includes statistical analysis and visualization tools, and offers an API for seamless access to performance data. The framework is extensible, allowing researchers to add new models, datasets, or metrics without breaking compatibility. By standardizing implementations and unifying evaluation, LEMUR aims to accelerate AutoML research, enable fair benchmarking, and reduce barriers to large-scale neural network experimentation. To support adoption and collaboration, LEMUR and its plugins are released under the MIT license at: https://github.com/ABrain-One/nn-dataset https://github.com/ABrain-One/nn-plots https://github.com/ABrain-One/nn-vr

2509.18025 2026-06-02 math.OC cs.AI cs.LG math.LO stat.ML 版本更新

Deep Learning as the Disciplined Construction of Tame Objects

深度学习作为驯服对象的有纪律构造

Gilles Bareilles, Allen Gehret, Johannes Aspman, Jana Lepšová, Jakub Mareček

发表机构 * Czech Technical University in Prague, Artificial Intelligence Center(布拉格捷克技术大学人工智能中心)

AI总结 本文通过驯服几何(o-极小性)框架,介绍深度学习模型作为函数组合的数学基础,并展示其在非光滑非凸但驯服设置下为随机梯度下降提供收敛保证的应用。

Comments 39 pages, 10 figures

详情
AI中文摘要

人们可以将深度学习模型视为所谓驯服几何中函数的组合。在这篇说明性笔记中,我们概述了驯服几何(也称为o-极小性)、优化理论以及深度学习理论与实践之间的一些主题。为此,我们逐步介绍在一般非光滑非凸但驯服的设置中,为随机梯度下降建立收敛保证所使用的概念和工具。这说明了驯服几何作为研究AI系统(尤其是深度学习)的自然数学框架的一些方式。

英文摘要

One can see deep-learning models as compositions of functions within the so-called tame geometry. In this expository note, we give an overview of some topics at the interface of tame geometry (also known as o-minimality), optimization theory, and deep learning theory and practice. To do so, we gradually introduce the concepts and tools used to build convergence guarantees for stochastic gradient descent in a general nonsmooth nonconvex, but tame, setting. This illustrates some ways in which tame geometry is a natural mathematical framework for the study of AI systems, especially within Deep Learning.

2505.18614 2026-06-02 cs.CL cs.LG cs.MM cs.SD eess.AS 版本更新

MAVL: A Multilingual Audio-Video Lyrics Dataset for Animated Song Translation

MAVL:面向动画歌曲翻译的多语言音视频歌词数据集

Woohyun Cho, Youngmin Kim, Sunghyun Lee, Youngjae Yu

发表机构 * Yonsei University(延世大学) Seoul National University(首尔国立大学)

AI总结 提出首个多语言多模态歌词翻译基准MAVL,并设计音节约束的音视频大语言模型SylAVL-CoT,利用音视频线索和音节约束提升歌词可唱性和翻译准确性。

Comments Accepted to EMNLP 2025, Project Page: https://k1064190.github.io/papers/paper1.html, our codes and datasets are available at https://github.com/k1064190/MAVL

详情
AI中文摘要

歌词翻译需要同时实现准确的语义传递以及保留音乐节奏、音节结构和诗歌风格。在动画音乐剧中,由于需要与视觉和听觉线索对齐,挑战更加严峻。我们引入了多语言音视频歌词翻译基准(MAVL),这是首个用于可唱歌词翻译的多语言、多模态基准。通过整合文本、音频和视频,MAVL能够比纯文本方法实现更丰富、更具表现力的翻译。在此基础上,我们提出了音节约束的音视频大语言模型SylAVL-CoT,该模型利用音视频线索并施加音节约束,以生成自然流畅的歌词。实验结果表明,SylAVL-CoT在可唱性和上下文准确性方面显著优于基于文本的模型,强调了多模态、多语言方法在歌词翻译中的价值。

英文摘要

Lyrics translation requires both accurate semantic transfer and preservation of musical rhythm, syllabic structure, and poetic style. In animated musicals, the challenge intensifies due to alignment with visual and auditory cues. We introduce Multilingual Audio-Video Lyrics Benchmark for Animated Song Translation (MAVL), the first multilingual, multimodal benchmark for singable lyrics translation. By integrating text, audio, and video, MAVL enables richer and more expressive translations than text-only approaches. Building on this, we propose Syllable-Constrained Audio-Video LLM with Chain-of-Thought SylAVL-CoT, which leverages audio-video cues and enforces syllabic constraints to produce natural-sounding lyrics. Experimental results demonstrate that SylAVL-CoT significantly outperforms text-based models in singability and contextual accuracy, emphasizing the value of multimodal, multilingual approaches for lyrics translation.

2509.00326 2026-06-02 cs.LG 版本更新

Chunked TabPFN: Exact Training-Free In-Context Learning for Long-Context Tabular Data

分块TabPFN:面向长上下文表格数据的精确免训练上下文学习

Renat Sergazinov, Shao-An Yin

发表机构 * Department of Statistics, Texas A\&M University, College Station, TX(德克萨斯A&M大学统计学系) Department of Electrical and Computer Engineering, University of Minnesota, Twin City, MN(明尼苏达大学电气与计算机工程系)

AI总结 提出分块注意力策略,使TabPFN无需预处理即可处理超过10K上下文令牌的长表格数据,并在TabArena基准上验证有效性。

Comments 14 pages, 6 figures

详情
Journal ref
ICLR Blogposts 2026
AI中文摘要

TabPFN v2在多个表格基准测试上取得了比基于树的模型更好的结果,这值得注意,因为基于树的模型通常是表格数据的最强选择。然而,由于Transformer具有二次计算和内存成本,它无法处理超过10K的上下文令牌。与现有依赖于上下文压缩的方法(例如通过K近邻选择代表性样本)不同,我们引入了一种分块策略来计算TabPFN框架内的注意力。该设计与标准GPU设置兼容,并且据我们所知,是首个使TabPFN无需任何预处理即可处理长上下文的方法。我们在标准TabArena基准上展示了我们方法的有效性,代码可在https://github.com/mrsergazinov/chunk_tabpfn获取。

英文摘要

TabPFN v2 achieves better results than tree-based models on several tabular benchmarks, which is notable since tree-based models are usually the strongest choice for tabular data. However, it cannot handle more than 10K context tokens because transformers have quadratic computation and memory costs. Unlike existing approaches that rely on context compression, such as selecting representative samples via K-nearest neighbors (KNN), we introduce a tiled-block strategy to compute attention within the TabPFN framework. This design is compatible with standard GPU setups and, to the best of our knowledge, is the first to enable TabPFN to process long contexts without any pre-processing. We demonstrate the effectiveness of our approach on the standard TabArena benchmark, with code available at https://github.com/mrsergazinov/chunk_tabpfn.

2509.11056 2026-06-02 eess.SY cs.LG cs.SY 版本更新

BERT4beam: Large AI Model Enabled Generalized Beamforming Optimization

BERT4beam: 大型AI模型实现通用波束赋形优化

Yuhang Li, Yang Lu, Wei Chen, Bo Ai, Zhiguo Ding

发表机构 * State Key Laboratory of Advanced Rail Autonomous Operation(先进轨道交通自主运行国家重点实验室) School of Computer Science and Technology(计算机科学与技术学院) School of Electronics and Information Engineering(电子与信息工程学院) School of Electrical and Electronic Engineering (EEE)(电子与电气工程学院)

AI总结 本文提出基于BERT的框架BERT4beam,将波束赋形优化转化为token级序列学习任务,通过预训练和微调实现单任务与多任务优化,在不同用户规模、系统效用和天线配置下均能接近最优性能。

详情
AI中文摘要

人工智能(AI)有望成为未来第六代(6G)无线通信系统的关键推动力。然而,当前关于无线通信大型AI模型的研究主要集中在针对特定任务微调预训练的大型语言模型(LLM)。本文研究了专为波束赋形优化设计的大规模AI模型,以适应并泛化到由系统效用和规模定义的不同任务。我们提出了一种基于Transformer双向编码器表示(BERT)的新框架,称为BERT4beam。我们旨在将波束赋形优化问题表述为token级序列学习任务,对信道状态信息进行token化,构建BERT模型,并执行任务特定的预训练和微调策略。基于该框架,我们分别提出了两种基于BERT的方法用于单任务和多任务波束赋形优化。两种方法均可泛化到不同用户规模。此外,前者通过重新配置BERT模型的输入和输出模块,能够适应不同的系统效用和天线配置;而后者(称为UBERT)由于采用更细粒度的token化策略,可以直接泛化到多种任务。大量仿真结果表明,这两种方法能够实现接近最优的性能,并在各种波束赋形优化任务中优于现有AI模型,展现出强大的适应性和泛化能力。

英文摘要

Artificial intelligence (AI) is anticipated to emerge as a pivotal enabler for the forthcoming sixth-generation (6G) wireless communication systems. However, current research efforts regarding large AI models for wireless communications primarily focus on fine-tuning pre-trained large language models (LLMs) for specific tasks. This paper investigates the large-scale AI model designed for beamforming optimization to adapt and generalize to diverse tasks defined by system utilities and scales. We propose a novel framework based on bidirectional encoder representations from transformers (BERT), termed BERT4beam. We aim to formulate the beamforming optimization problem as a token-level sequence learning task, perform tokenization of the channel state information, construct the BERT model, and conduct task-specific pre-training and fine-tuning strategies. Based on the framework, we propose two BERT-based approaches for single-task and multi-task beamforming optimization, respectively. Both approaches are generalizable for varying user scales. Moreover, the former can adapt to varying system utilities and antenna configurations by re-configuring the input and output module of the BERT model, while the latter, termed UBERT, can directly generalize to diverse tasks, due to a finer-grained tokenization strategy. Extensive simulation results demonstrate that the two proposed approaches can achieve near-optimal performance and outperform existing AI models across various beamforming optimization tasks, showcasing strong adaptability and generalizability.

2509.10491 2026-06-02 eess.SP cs.LG 版本更新

FlowECG: Using Flow Matching to Create a More Efficient ECG Signal Generator

FlowECG:利用流匹配创建更高效的ECG信号生成器

Vitalii Bondar, Serhii Semenov, Vira Babenko, Dmytro Holovniak

发表机构 * Cherkasy State Technological University(切爾卡西州立科技大學) University of the National Education Commission(國家教育委員會大學) State Scientific Research Institute of Armament and Military Equipment Testing and Certification(軍事裝備測試和認證州科學研究 institutes)

AI总结 提出FlowECG方法,采用流匹配替代扩散过程,通过连续流动力学学习噪声到数据分布的直连路径,在PTB-XL数据集上以更少的采样步数(10-25次)达到与扩散方法(200次)相当的生成质量,计算需求降低一个数量级。

Comments 8 pages, 2 figures, 1 table, reviewed version will be published in "Sensors, Devices and Systems 2025 Proceedings" (Springer's Lecture Notes in Electrical Engineering)

详情
AI中文摘要

合成心电图生成为需要隐私保护数据共享和训练数据集增强的医学AI应用提供服务。当前基于扩散的方法实现了高生成质量,但在采样过程中需要数百次神经网络评估,给临床部署造成了计算瓶颈。我们提出了FlowECG,一种流匹配方法,通过用连续流动力学替代迭代扩散过程来适配SSSD-ECG架构。流匹配通过常微分方程求解学习从噪声到数据分布的直连传输路径。我们使用动态时间规整、Wasserstein距离、最大均值差异和频谱相似性指标在PTB-XL数据集上评估了我们的方法。FlowECG在200次神经函数评估时匹配了SSSD-ECG的性能,并在三个指标上优于基线。关键发现表明,FlowECG以大幅减少的采样步数保持生成质量,与扩散方法需要200次评估相比,仅需10-25次评估即可获得可比结果。这种效率提升将计算需求降低了一个数量级,同时保留了生理上真实的12导联ECG特征。该方法使得在需要实时生成或大规模合成数据创建的资源受限临床环境中实现实际部署成为可能。

英文摘要

Synthetic electrocardiogram generation serves medical AI applications requiring privacy-preserving data sharing and training dataset augmentation. Current diffusion-based methods achieve high generation quality but require hundreds of neural network evaluations during sampling, creating computational bottlenecks for clinical deployment. We propose FlowECG, a flow matching approach that adapts the SSSD-ECG architecture by replacing the iterative diffusion process with continuous flow dynamics. Flow matching learns direct transport paths from noise to data distributions through ordinary differential equation solving. We evaluate our method on the PTB-XL dataset using Dynamic Time Warping, Wasserstein distance, Maximum Mean Discrepancy, and spectral similarity metrics. FlowECG matches SSSD-ECG performance at 200 neural function evaluations, outperforming the baseline on three metrics. The key finding shows that FlowECG maintains generation quality with substantially fewer sampling steps, achieving comparable results with 10-25 evaluations compared to 200 for diffusion methods. This efficiency improvement reduces computational requirements by an order of magnitude while preserving physiologically realistic 12-lead ECG characteristics. The approach enables practical deployment in resource-limited clinical settings where real-time generation or large-scale synthetic data creation is needed.

2509.04631 2026-06-02 cs.LG cs.IT math.IT stat.ML 版本更新

Fundamental bounds on efficiency-confidence trade-off for transductive conformal prediction

转导共形预测的效率-置信度权衡的基本界限

Arash Behboodi, Alvaro H. C. Correia, Fabio Valerio Massoli, Christos Louizos

发表机构 * Qualcomm Technologies, Inc.(高通技术公司)

AI总结 本文证明了转导共形预测中置信度与效率(预测集大小)之间存在严格有限样本界,指出非平凡置信度会导致预测集大小随数据固有不确定性呈指数增长,并提出了接近该界限的实用算法。

详情
AI中文摘要

转导共形预测处理多个数据点的同时预测。给定期望的置信水平,目标是构建一个预测集,以规定的置信度包含真实结果。我们证明了转导方法中置信度与效率之间的基本权衡,其中效率通过预测集的大小来衡量。具体来说,我们推导了一个严格的有限样本界,表明对于具有固有不确定性的数据,任何非平凡的置信水平都会导致预测集大小的指数增长。指数与样本数量线性相关,并与数据的条件熵成正比。此外,该界限包含一个二阶项——分散度,定义为对数条件概率分布的方差。我们表明,基于近似条件分布的转导方法可以接近这个界限。受此启发,我们引入了一种实用的转导预测算法,该算法优于Bonferroni方法。

英文摘要

Transductive conformal prediction addresses the simultaneous prediction for multiple data points. Given a desired confidence level, the objective is to construct a prediction set that includes the true outcomes with the prescribed confidence. We demonstrate a fundamental trade-off between confidence and efficiency in transductive methods, where efficiency is measured by the size of the prediction sets. Specifically, we derive a strict finite-sample bound showing that any non-trivial confidence level leads to exponential growth in prediction set size for data with inherent uncertainty. The exponent scales linearly with the number of samples and is proportional to the conditional entropy of the data. Additionally, the bound includes a second-order term, dispersion, defined as the variance of the log conditional probability distribution. We show that the transductive methods based on the approximate conditional distribution can approach this bound. Inspired by this setup, we introduce a practical transductive prediction algorithm that surpasses Bonferroni methods.

2509.03456 2026-06-02 stat.ML cs.LG 版本更新

Off-Policy Learning in Large Action Spaces: Optimization Matters More Than Estimation

大动作空间中的离线策略学习:优化比估计更重要

Imad Aouali, Otmane Sakhi

发表机构 * Criteo AI Lab(Criteo AI实验室)

AI总结 本文研究离线上下文强盗中的离线策略学习,发现现有方法在大动作空间中面临严重优化问题,提出使用加权对数似然目标可改善优化并取得竞争性策略。

Comments ICML '26

详情
AI中文摘要

离线策略评估(OPE)和离线策略学习(OPL)是离线上下文强盗中决策制定的基础。最近OPL的进展主要优化具有改进统计特性的OPE估计器,假设更好的估计器自然产生更优的策略。尽管有理论依据,但这种以估计器为中心的方法忽略了一个关键的实际障碍:具有挑战性的优化景观。在本文中,我们提供理论见解和实证证据,表明当前的OPL方法遇到严重的优化问题,特别是随着动作空间的增长。我们表明,估计器感知的策略参数化可以缓解但不能完全解决优化挑战。在此基础上,我们探索更简单的加权对数似然目标,并证明它们具有显著更好的优化特性,并且仍然能够恢复具有竞争力、通常更优的学习策略。我们的发现强调了在开发针对大动作空间的OPL算法时,明确考虑优化问题的必要性。

英文摘要

Off-policy evaluation (OPE) and off-policy learning (OPL) are foundational for decision-making in offline contextual bandits. Recent advances in OPL primarily optimize OPE estimators with improved statistical properties, assuming that better estimators inherently yield superior policies. Although theoretically justified, this estimator-centric approach neglects a critical practical obstacle: challenging optimization landscapes. In this paper, we provide theoretical insights and empirical evidence showing that current OPL methods encounter severe optimization issues, particularly as the action space grows. We show that estimator-aware policy parametrization can mitigate, but not fully resolve, optimization challenges. Building on this, we explore simpler weighted log-likelihood objectives and demonstrate that they enjoy substantially better optimization properties and still recover competitive, often superior, learned policies. Our findings emphasize the necessity of explicitly addressing optimization considerations in the development of OPL algorithms for large action spaces.

2506.13554 2026-06-02 cs.LG cs.NA math.FA math.NA 版本更新

Non-Asymptotic Stability and Consistency Guarantees for Physics-Informed Neural Networks via Coercive Operator Analysis

基于强制算子分析的物理信息神经网络的非渐近稳定性与一致性保证

Ronald Katende

发表机构 * Department of Mathematics, Kabale University(卡贝大学数学系) Department of Mathematics, Makerere University(Makerere大学数学系)

AI总结 通过强制算子、变分公式和非渐近扰动理论,建立物理信息神经网络(PINN)的稳定性和一致性的统一理论框架,证明残差最小化在Sobolev范数下导致能量和一致范数收敛,并给出确定性稳定性界和概率样本复杂度保证。

详情
Journal ref
109909 (2026)
AI中文摘要

我们提出了一个统一的理论框架,用于分析物理信息神经网络(PINN)的稳定性和一致性,该框架基于算子强制性、变分公式和非渐近扰动理论。PINN通过在采样配置点和边界点上最小化残差损失来逼近偏微分方程(PDE)的解。我们形式化了算子级和变分的一致性概念,证明在Sobolev范数下的残差最小化在温和正则性下导致能量范数和一致范数的收敛。确定性稳定性界量化了网络输出的有界扰动如何通过整个复合损失传播,而通过McDiarmid不等式的概率集中结果则为基于残差的泛化提供了样本复杂度保证。一个统一的泛化界将残差一致性、投影误差和扰动敏感性联系起来。在椭圆型、抛物型和非线性PDE上的实证结果证实了我们的理论界在不同情况下的预测准确性。该框架识别了关键结构原则,如算子强制性、激活平滑性和采样可容许性,这些原则支撑了鲁棒且可泛化的PINN训练,为PDE学习系统的设计和分析提供了原则性指导。

英文摘要

We present a unified theoretical framework for analyzing the stability and consistency of Physics-Informed Neural Networks (PINNs), grounded in operator coercivity, variational formulations, and non-asymptotic perturbation theory. PINNs approximate solutions to partial differential equations (PDEs) by minimizing residual losses over sampled collocation and boundary points. We formalize both operator-level and variational notions of consistency, proving that residual minimization in Sobolev norms leads to convergence in energy and uniform norms under mild regularity. Deterministic stability bounds quantify how bounded perturbations to the network outputs propagate through the full composite loss, while probabilistic concentration results via McDiarmid's inequality yield sample complexity guarantees for residual-based generalization. A unified generalization bound links residual consistency, projection error, and perturbation sensitivity. Empirical results on elliptic, parabolic, and nonlinear PDEs confirm the predictive accuracy of our theoretical bounds across regimes. The framework identifies key structural principles, such as operator coercivity, activation smoothness, and sampling admissibility, that underlie robust and generalizable PINN training, offering principled guidance for the design and analysis of PDE-informed learning systems.

2508.12551 2026-06-02 cs.LG cs.AI cs.OS cs.SE 版本更新

TuneAgent: Agentic Operating System Kernel Tuning with Reinforcement Learning

TuneAgent: 基于强化学习的智能操作系统内核调优

Hongyu Lin, Yuchen Li, Haoran Luo, Zhenghong Lin, Libo Zhang, Mingjie Xing, Yanjun Wu

发表机构 * Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所) University of Chinese Academy of Sciences(中国科学院大学) Nanyang Technological University(南洋理工大学)

AI总结 提出TuneAgent框架,利用基于规则的强化学习使大语言模型自主探索Linux内核空间,通过结构化奖励函数和两阶段训练策略解决稀疏反馈问题,实现高达5.6%的性能提升。

详情
AI中文摘要

Linux内核调优对于优化操作系统性能至关重要,但由于复杂的内核空间、稀疏的性能反馈和强烈的工作负载敏感性,仍然具有挑战性。我们提出了TuneAgent,一个基于规则强化学习的智能Linux内核调优框架。TuneAgent将内核空间构建为约束强化学习环境,使大语言模型能够自主探索内核,同时强制执行有效且精确的配置修改。为了解决稀疏性能反馈问题,我们设计了结构化奖励函数,共同促进推理标准化、配置正确性和性能感知。此外,我们提出了一种两阶段训练策略,首先确保格式和语义正确性,然后过渡到性能驱动的探索,从而加速收敛并降低开销。实验结果表明,TuneAgent始终优于现有基线,在保持高配置有效性的同时,实现了高达5.6%的相对整体性能提升。我们进一步展示了其在多个实际应用中的鲁棒性,突显了其在多样化部署环境中的实用性和适应性。

英文摘要

Linux kernel tuning is essential for optimizing operating system (OS) performance, yet remains challenging due to the complex kernel space, sparse performance feedback, and strong workload sensitivity. We present TuneAgent, an agentic Linux kernel tuning framework powered by rule-based reinforcement learning (RL). TuneAgent formulates the kernel space as a constrained RL environment, enabling large language models (LLMs) to autonomously explore the kernel while enforcing valid and precise configuration modifications. To address sparse performance feedback, we design structured reward functions that jointly promote reasoning standardization, configuration correctness, and performance awareness. Furthermore, we propose a two-phase training strategy that first ensures format and semantic correctness and then transitions to performance-driven exploration, accelerating convergence and reducing overhead. Experimental results show that TuneAgent consistently outperforms existing baselines, achieving up to 5.6% relative overall performance improvement while maintaining high configuration validity. We further demonstrate its robustness across multiple real-world applications, highlighting its practicality and adaptability in diverse deployment environments.

2501.10342 2026-06-02 cs.LG 版本更新

Hybrid Deep Learning Model for epileptic seizure classification by using 1D-CNN with multi-head attention mechanism

基于一维CNN与多头注意力机制的混合深度学习模型用于癫痫发作分类

Mohammed Guhdar, Ramadhan J. Mstafa, Abdulhakeem O. Mohammed

发表机构 * Computer Science Department, College of Science, University of Zakho(扎赫大学科学学院计算机科学系)

AI总结 提出一种结合一维卷积神经网络和多头注意力机制的混合深度学习模型,用于从脑电图信号中分类癫痫发作,以提高分类准确性。

详情
Journal ref
Biomedical Signal Processing and Control. Volume 112, pages 108495, 2026
AI中文摘要

癫痫是一种全球常见的神经系统疾病,影响约5000万人。癫痫发作源于大脑突然的异常电活动,这可以表现为脑电图信号的突然显著变化。信号的严重程度和频率各不相同,导致短暂意识丧失和肌肉收缩。癫痫患者常因某些工作环境的安全问题面临显著的就业挑战。涉及高空作业、操作重型机械或其他潜在危险环境的工作可能对癫痫患者受限,这无疑限制了他们的就业机会和经济可能性。

英文摘要

Epilepsy is a prevalent neurological disorder globally, impacting around 50 million people \cite{WHO_epilepsy_50million}. Epileptic seizures result from sudden abnormal electrical activity in the brain, which can be read as sudden and significant changes in the EEG signal of the brain. The signal can vary in severity and frequency, which results in loss of consciousness and muscle contractions for a short period of time \cite{epilepsyfoundation_myoclonic}. Individuals with epilepsy often face significant employment challenges due to safety concerns in certain work environments. Many jobs that involve working at heights, operating heavy machinery, or in other potentially hazardous settings may be restricted for people with seizure disorders. This certainly limits job options and economic opportunities for those living with epilepsy.

2503.22939 2026-06-02 cs.LG q-bio.QM 版本更新

Interpretable Graph Kolmogorov-Arnold Networks for Multi-Cancer Classification and Biomarker Identification using Multi-Omics Data

可解释图Kolmogorov-Arnold网络用于基于多组学数据的多癌症分类和生物标志物识别

Fadi Alharbi, Nishant Budhiraja, Aleksandar Vakanski, Boyu Zhang, Murtada K. Elbashir, Harshith Guduru, Mohanad Mohammed

发表机构 * University of Idaho, Department of Computer Science(爱达荷大学计算机科学系) Jouf University, College of Computer and Information Sciences, Department of Computer Science(朱夫大学计算机与信息科学学院计算机科学系) University of Gezira, Faculty of Mathematical and Computer Sciences(杰兹拉大学数学与计算机科学学院) Bentonville High School(伯恩斯维尔高中) University of KwaZulu-Natal, School of Mathematics, Statistics and Computer Science(夸祖鲁-纳塔尔大学数学、统计与计算机科学学院)

AI总结 提出MOGKAN框架,结合图神经网络与Kolmogorov-Arnold定理,利用多组学数据和PPI网络实现31种癌症的高精度分类与可解释生物标志物识别。

详情
Journal ref
Sci. Rep. 16, ARTICLE NUMBER (2026)
AI中文摘要

在系统层面整合异质多组学数据集仍然是精准癌症诊断中开发分析和计算模型的核心挑战。本文介绍了多组学图Kolmogorov-Arnold网络(MOGKAN),这是一个深度学习框架,利用信使RNA、微RNA序列和DNA甲基化样本以及蛋白质-蛋白质相互作用(PPI)网络,对31种不同癌症类型进行分类。所提出的方法结合了DESeq2、微阵列线性模型(LIMMA)和最小绝对收缩与选择算子(LASSO)回归的差异基因表达,以降低多组学数据维度同时保留相关生物学特征。模型架构基于Kolmogorov-Arnold定理原理,使用可训练的单变量函数增强可解释性和特征分析。MOGKAN实现了96.28%的分类准确率,并且与相关基于深度学习的模型相比,表现出较低的实验变异性。通过基因本体论(GO)和京都基因与基因组百科全书(KEGG)富集分析,MOGKAN识别的生物标志物被验证为癌症相关标志物。通过将多组学数据与基于图的深度学习相结合,我们提出的方法展示了稳健的预测性能和可解释性,具有将复杂多组学数据转化为临床可操作癌症诊断的潜力。

英文摘要

The integration of heterogeneous multi-omics datasets at a systems level remains a central challenge for developing analytical and computational models in precision cancer diagnostics. This paper introduces Multi-Omics Graph Kolmogorov-Arnold Network (MOGKAN), a deep learning framework that utilizes messenger-RNA, micro-RNA sequences, and DNA methylation samples together with Protein-Protein Interaction (PPI) networks for cancer classification across 31 different cancer types. The proposed approach combines differential gene expression with DESeq2, Linear Models for Microarray (LIMMA), and Least Absolute Shrinkage and Selection Operator (LASSO) regression to reduce multi-omics data dimensionality while preserving relevant biological features. The model architecture is based on the Kolmogorov-Arnold theorem principle and uses trainable univariate functions to enhance interpretability and feature analysis. MOGKAN achieves classification accuracy of 96.28 percent and exhibits low experimental variability in comparison to related deep learning-based models. The biomarkers identified by MOGKAN were validated as cancer-related markers through Gene Ontology (GO) and Kyoto Encyclopedia of Genes and Genomes (KEGG) enrichment analysis. By integrating multi-omics data with graph-based deep learning, our proposed approach demonstrates robust predictive performance and interpretability with potential to enhance the translation of complex multi-omics data into clinically actionable cancer diagnostics.

2507.19702 2026-06-02 cs.SI cs.AI cs.LG 版本更新

A Lightweight Deep Learning-based Model for Ranking Influential Nodes in Complex Networks

基于轻量级深度学习的复杂网络中有影响力节点排序模型

Mohammed A. Ramadhan, Abdulhakeem O. Mohammed

发表机构 * Computer Science Department, College of Science, University of Zakho(扎赫大学科学学院计算机科学系) Department of Computer Science and Information Technology, The American University of Kurdistan(库尔德斯坦美国大学计算机科学与信息技术系)

AI总结 提出一种结合一维卷积神经网络和GraphSAGE的轻量级混合模型1D-CGS,利用节点度和平均邻居度特征,通过回归任务高效排序有影响力节点,在12个真实网络上平均Kendall Tau提升4.73%,Jaccard相似度提升7.67%,单调性指数达0.99,运行速度显著快于现有深度学习方法。

详情
AI中文摘要

识别复杂网络中的有影响力节点是一项关键任务,在不同领域有广泛应用。然而,现有方法常在准确性和计算效率之间权衡。为解决这些挑战,我们提出1D-CGS,一种轻量级且有效的混合模型,它结合了一维卷积神经网络(1D-CNN)的速度和GraphSAGE的拓扑表示能力,用于高效节点排序。该模型使用基于两个简单且重要的拓扑特征(节点度和平均邻居度)构建的轻量级输入表示。这些特征通过一维卷积提取局部模式,然后通过GraphSAGE层聚合邻域信息。我们将节点排序任务表述为回归问题,并使用易感-感染-恢复(SIR)模型生成真实影响力分数。1D-CGS首先在Barabasi-Albert模型生成的合成网络上训练,然后应用于真实世界网络以识别有影响力节点。在12个真实网络上的实验评估表明,1D-CGS在排序准确性上显著优于传统中心性度量和最近的深度学习模型,同时运行速度非常快。与表现最佳的深度学习基线相比,所提模型在Kendall Tau相关性上平均提升4.73%,在Jaccard相似度上平均提升7.67%。它还实现了平均单调性指数(MI)分数0.99,并产生近乎完美的排名分布,表明高度独特和可区分的排名。此外,所有实验证实1D-CGS在高度合理的时间内运行,比现有深度学习方法快得多,使其适用于大规模应用。

英文摘要

Identifying influential nodes in complex networks is a critical task with a wide range of applications across different domains. However, existing approaches often face trade-offs between accuracy and computational efficiency. To address these challenges, we propose 1D-CGS, a lightweight and effective hybrid model that integrates the speed of one-dimensional convolutional neural networks (1D-CNN) with the topological representation power of GraphSAGE for efficient node ranking. The model uses a lightweight input representation built on two straightforward and significant topological features: node degree and average neighbor degree. These features are processed through 1D convolutions to extract local patterns, followed by GraphSAGE layers to aggregate neighborhood information. We formulate the node ranking task as a regression problem and use the Susceptible-Infected-Recovered (SIR) model to generate ground truth influence scores. 1D-CGS is initially trained on synthetic networks generated by the Barabasi-Albert model and then applied to real world networks for identifying influential nodes. Experimental evaluations on twelve real world networks demonstrate that 1D-CGS significantly outperforms traditional centrality measures and recent deep learning models in ranking accuracy, while operating in very fast runtime. The proposed model achieves an average improvement of 4.73% in Kendall's Tau correlation and 7.67% in Jaccard Similarity over the best performing deep learning baselines. It also achieves an average Monotonicity Index (MI) score 0.99 and produces near perfect rank distributions, indicating highly unique and discriminative rankings. Furthermore, all experiments confirm that 1D-CGS operates in a highly reasonable time, running significantly faster than existing deep learning methods, making it suitable for large scale applications.

2503.05641 2026-06-02 cs.CL cs.AI cs.LG 版本更新

Skill-Based Mixture-of-Experts: Adaptive Routing for Heterogeneous Reasoning via Inferred Skills

基于技能的混合专家模型:通过推断技能实现异构推理的自适应路由

Justin Chih-Yao Chen, Sukwon Yun, Elias Stengel-Eskin, Tianlong Chen, Mohit Bansal

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 提出Skill-MoE框架,通过推断查询所需技能进行实例级专家选择,并采用批推理策略降低开销,在单GPU上集成16个专家模型,在多个推理基准上平均提升8.15%。

Comments ICML 2026 (Camera-Ready). The first three authors contributed equally. Project Page: https://skill-moe.github.io/

详情
AI中文摘要

结合现有的预训练大语言模型是处理多样化推理任务的一种有前景的方法。然而,任务级专家选择往往过于粗粒度,因为不同实例可能需要不同的专业知识。为了解决这个问题,我们提出了Skill-MoE,一个符号化的、基于技能的、无梯度的混合专家框架,用于实例级专家选择。Skill-MoE从每个查询中推断技能(例如,数学中的代数),根据技能相关性选择专家,并让每个专家生成自己的推理。然后,由选定的聚合器将得到的k个输出进行综合,该聚合器因其整合多样化响应的能力而被选中。虽然实例级选择显著提高了性能,但朴素实现会因重复的模型加载和卸载而产生巨大开销。我们通过一种批推理策略解决了这个问题,该策略将实例按分配的专家分组,使得每个模型只需加载一次。因此,Skill-MoE在单GPU上集成了16个专家模型,其运行时间与使用4个GPU的先前多智能体基线相当。在多个基准测试(MMLU-Pro、GPQA、AIME和MedMCQA)中,Skill-MoE相比最佳基线实现了平均8.15%的绝对提升。它还能很好地泛化到未见过的任务,并且无需昂贵的多轮交互即可超越基于讨论的方法。

英文摘要

Combining existing pre-trained LLMs is a promising approach for diverse reasoning tasks. However, task-level expert selection is often too coarse-grained, since different instances may require different expertise. To address this, we propose Skill-MoE, a symbolic, skill-based, and gradient-free Mixture-of-Experts framework for instance-level expert selection. Skill-MoE infers skills (e.g., algebra in mathematics) from each query, selects experts based on skill relevance, and lets each expert generate its own reasoning. The resulting k outputs are then synthesized by an aggregator chosen for its ability to integrate diverse responses. While instance-level selection substantially improves performance, naively implementing it incurs heavy overhead from repeated model loading and offloading. We address this with a batch inference strategy that groups instances by assigned experts, allowing each model to be loaded only once. As a result, Skill-MoE integrates 16 expert models on a single GPU with runtime comparable to prior multi-agent baselines using 4 GPUs. Across diverse benchmarks (MMLU-Pro, GPQA, AIME, and MedMCQA), Skill-MoE achieves an average absolute improvement of 8.15% over the best baseline. It also generalizes well to unseen tasks and outperforms discussion-based methods without requiring expensive multi-round interactions.

2507.12645 2026-06-02 eess.SP cs.AI cs.LG 版本更新

A Novel Data Augmentation Strategy for Robust Deep Learning Classification of Biomedical Time-Series Data: Application to ECG and EEG Analysis

一种用于生物医学时间序列数据鲁棒深度学习分类的新型数据增强策略:在ECG和EEG分析中的应用

Mohammed Guhdar, Ramadhan J. Mstafa, Abdulhakeem O. Mohammed

发表机构 * Computer Science Department, College of Science, University of Zakho(扎赫大学科学学院计算机科学系) Department of Computer Science and Information Technology, The American University of Kurdistan(库尔德斯坦美国大学计算机科学与信息技术系) PRIME Lab, Scientific Research Center, University of Zakho(扎赫大学科学研究中心PRIME实验室)

AI总结 提出一种结合ResNet-CNN与注意力机制的统一深度学习框架,通过时域拼接多个增强变体的新型数据增强策略和Focal Loss处理类别不平衡,在ECG和EEG数据集上达到99.96%-100%的准确率,且内存需求低、推理速度快。

详情
AI中文摘要

准确统一分析多种生物信号(如ECG和EEG)的需求日益迫切,这对于全面评估患者状况至关重要,尤其是在同步监测中。尽管多传感器融合取得了进展,但在开发能够有效处理和提取本质上不同生理信号特征的统一架构方面仍存在关键空白。另一个挑战是许多生物医学数据集固有的类别不平衡,这常常导致传统方法性能偏差。本研究通过提出一种新颖且统一的深度学习框架来解决这些问题,该框架在不同信号类型上均达到了最先进的性能。我们的方法将基于ResNet的CNN与注意力机制相结合,并通过一种新颖的数据增强策略增强:对每个信号的多个增强变体进行时域拼接,以生成更丰富的表示。与先前工作不同,我们科学地增加信号复杂性以实现未来能力,从而相比现有技术获得了最佳预测。预处理步骤包括小波去噪、基线去除和标准化。通过结合使用这种高级数据增强和Focal Loss函数,有效管理了类别不平衡。训练过程中应用了正则化技术以确保泛化能力。我们在三个基准数据集上严格评估了所提出的架构:UCI癫痫EEG、MIT-BIH心律失常和PTB诊断ECG。它分别达到了99.96%、99.78%和100%的准确率,展示了在不同信号类型和临床背景下的鲁棒性。最后,该架构需要约130 MB内存,每个样本处理时间约10 ms,表明其适用于低端或可穿戴设备部署。

英文摘要

The increasing need for accurate and unified analysis of diverse biological signals, such as ECG and EEG, is paramount for comprehensive patient assessment, especially in synchronous monitoring. Despite advances in multi-sensor fusion, a critical gap remains in developing unified architectures that effectively process and extract features from fundamentally different physiological signals. Another challenge is the inherent class imbalance in many biomedical datasets, often causing biased performance in traditional methods. This study addresses these issues by proposing a novel and unified deep learning framework that achieves state-of-the-art performance across different signal types. Our method integrates a ResNet-based CNN with an attention mechanism, enhanced by a novel data augmentation strategy: time-domain concatenation of multiple augmented variants of each signal to generate richer representations. Unlike prior work, we scientifically increase signal complexity to achieve future-reaching capabilities, which resulted in the best predictions compared to the state of the art. Preprocessing steps included wavelet denoising, baseline removal, and standardization. Class imbalance was effectively managed through the combined use of this advanced data augmentation and the Focal Loss function. Regularization techniques were applied during training to ensure generalization. We rigorously evaluated the proposed architecture on three benchmark datasets: UCI Seizure EEG, MIT-BIH Arrhythmia, and PTB Diagnostic ECG. It achieved accuracies of 99.96%, 99.78%, and 100%, respectively, demonstrating robustness across diverse signal types and clinical contexts. Finally, the architecture requires ~130 MB of memory and processes each sample in ~10 ms, suggesting suitability for deployment on low-end or wearable devices.

2501.12189 2026-06-02 math.OC cs.LG 版本更新

MirrorCBO: A consensus-based optimization method in the spirit of mirror descent

MirrorCBO:一种镜像下降思想的共识优化方法

Leon Bungert, Franca Hoffmann, Dohyeon Kim, Tim Roith

发表机构 * Department of Computing and Mathematical Sciences, Caltech(计算与数学科学系,加州理工学院) Helmholtz Imaging, Deutsches Elektronen-Synchrotron DESY(海德堡成像,德意志电子同步辐射研究中心(DESY))

AI总结 提出MirrorCBO方法,通过将共识优化与镜像下降结合,实现无导数非凸优化,并推广到约束优化问题,理论证明指数收敛速率,实验展示稀疏诱导和约束优化的竞争力。

Comments 66 pages, 18 figures, 19 tables

详情
Journal ref
Mathematical Models and Methods in Applied Sciences 35 (14), 3083-3170, 2025
AI中文摘要

本文提出MirrorCBO,一种共识优化方法,它像镜像下降推广梯度下降一样推广了标准CBO。为此,我们将CBO方法应用于对偶粒子群,并通过应用镜像映射的逆(参数化为强凸函数$\phi$的次微分)来保留原始粒子位置。这样,我们结合了无导数非凸优化算法和镜像下降的优点。作为一个特例,该方法将CBO扩展到具有凸约束的优化问题。假设与$\phi$相关的Bregman距离有界,我们提供了MirrorCBO的渐近收敛结果,具有显式指数速率。另一个关键贡献是对该新算法在不同应用场景中的探索性数值研究,重点关注(i)稀疏诱导优化和(ii)约束优化,展示了MirrorCBO的竞争性能。我们经验性地观察到,该方法也可用于欧几里得空间(非凸)子流形上的优化,可适应其他近期CBO变体的镜像版本,并且继承了镜像下降选择理想极小值(如稀疏解)的能力。我们还概述了近期用于约束优化的CBO方法,并将其性能与MirrorCBO进行了比较。

英文摘要

In this work we propose MirrorCBO, a consensus-based optimization (CBO) method which generalizes standard CBO in the same way that mirror descent generalizes gradient descent. For this we apply the CBO methodology to a swarm of dual particles and retain the primal particle positions by applying the inverse of the mirror map, which we parametrize as the subdifferential of a strongly convex function $ϕ$. In this way, we combine the advantages of a derivative-free non-convex optimization algorithm with those of mirror descent. As a special case, the method extends CBO to optimization problems with convex constraints. Assuming bounds on the Bregman distance associated to $ϕ$, we provide asymptotic convergence results for MirrorCBO with explicit exponential rate. Another key contribution is an exploratory numerical study of this new algorithm across different application settings, focusing on (i) sparsity-inducing optimization, and (ii) constrained optimization, demonstrating the competitive performance of MirrorCBO. We observe empirically that the method can also be used for optimization on (non-convex) submanifolds of Euclidean space, can be adapted to mirrored versions of other recent CBO variants, and that it inherits from mirror descent the capability to select desirable minimizers, like sparse ones. We also include an overview of recent CBO approaches for constrained optimization and compare their performance to MirrorCBO.

2506.21278 2026-06-02 stat.ML cs.AI cs.LG math.ST stat.TH 版本更新

Hyperspherical Variational Autoencoders Using Efficient Spherical Cauchy Distribution

使用高效球面柯西分布的超球面变分自编码器

Lukas Sablica, Kurt Hornik

发表机构 * Institute for Statistics and Mathematics(统计与数学研究所) Vienna University of Economics and Business(维也纳经济与商业大学) Austria(奥地利)

AI总结 提出基于球面柯西分布的超球面变分自编码器,通过莫比乌斯变换实现可微重参数化,避免贝塞尔函数计算,在保持重尾特性的同时提供高效稳定的训练与推理。

详情
AI中文摘要

我们提出在超球面潜变量空间上使用球面柯西(spCauchy)潜变量的变分自编码器。spCauchy 族具有重尾全局行为,并且通过对球面上的均匀样本应用莫比乌斯变换,允许精确可微的重参数化。我们证明,在高浓度极限下,spCauchy 在显式浓度参数映射下恢复了 von Mises-Fisher(vMF)分布的局部切空间几何,同时避免了 vMF 实现所需的高阶贝塞尔函数计算。对于训练,到均匀球面先验的 Kullback-Leibler 散度具有快速收敛的级数、稳定的求积以及高浓度渐近形式。我们进一步建立了浓度依赖的 KL 核心的单调性,并推导了具有闭形式代理和误差控制的解析括号,支持极端情况下的稳定近似。压力测试基准表明,所得到的潜层目标在 CPU 和 GPU 上比 vMF 基线更稳定且评估更快。在图像和分子序列数据上的实验表明,spCauchy-VAE 为具有超球面潜表示的生式建模提供了一种鲁棒且可扩展的替代方案。

英文摘要

We propose spherical Cauchy (spCauchy) latent variables for variational autoencoders on hyperspherical latent spaces. The spCauchy family has heavy-tailed global behavior and admits an exact differentiable reparameterization by applying a Möbius transformation to uniform samples on the sphere. We show that, in the high-concentration limit, spCauchy recovers the local tangent-space geometry of the von Mises-Fisher (vMF) distribution under an explicit concentration parameter mapping, while avoiding the high-order Bessel-function evaluations required by vMF implementations. For training, the Kullback-Leibler divergence to a uniform spherical prior admits rapidly convergent series, stable quadrature, and high-concentration asymptotic forms. We further establish monotonicity of the concentration-dependent KL core and derive analytic brackets with closed-form surrogates and error control, supporting stable approximation in extreme regimes. Stress-test benchmarks show that the resulting latent-layer objective remains stable and faster to evaluate than vMF baselines on CPU and GPU. Experiments on image and molecular sequence data demonstrate that spCauchy-VAEs provide a robust and scalable alternative for generative modeling with hyperspherical latent representations.

2507.09766 2026-06-02 cs.LG cs.AI 版本更新

Toward accurate RUL and SoH estimation using reinforced graph-based physics-informed neural networks enhanced with dynamic weights

基于动态权重的强化图物理信息神经网络实现精确的剩余使用寿命和健康状态估计

Mohamadreza Akbari Pour, Ali Ghasemzadeh, Mohamad Ali Bijarchi, Mohammad Behshad Shafii

发表机构 * Department of Mechanical Engineering(机械工程系) Department of Computer Engineering(计算机工程系) Sharif University of Technology(谢赫拉特福大学)

AI总结 提出一种结合图表示学习、强化学习和自适应动态权重的物理信息神经网络框架RGPD,在C-MAPSS、PHM2012和XJTU数据集上实现跨资产退化场景的RUL和SoH高精度估计。

详情
AI中文摘要

精确估计剩余使用寿命(RUL)和健康状态(SoH)对于可靠的预测与健康管理(PHM)至关重要,有助于及时维护和可靠的工业运行。然而,结合数据驱动学习与基于物理的正则化的混合模型通常依赖于固定的损失权重,因此在跨具有不同退化行为的资产迁移时会失去准确性。本研究引入了具有动态加权的强化图物理信息网络(RGPD),这是一个用于时空退化建模和自适应物理引导正则化的统一框架。基于图的表示学习捕获传感器间的退化结构,软演员-评论家(SAC)模块在噪声条件下细化潜在特征,轻量级Q学习策略在训练过程中自适应地平衡单调性、平滑性和潜在动力学残差损失。该框架在C-MAPSS、PHM2012和XJTU数据集上进行了评估,这些数据集分别代表发动机、轴承和电池的退化过程。与相应基准表中报告的最强基线相比,RGPD在PHM2012和C-MAPSS上将平均RMSE提高了高达12%,在XJTU上将平均MAPE比第二好的模型降低了20%。在这些异构基准上的性能进一步表明了该模型跨退化系统的泛化能力。物理信息组件通过退化一致性先验以及深度隐藏物理模型风格的残差实现,提高了物理合理性,而无需为每种资产类型建立完整的第一性原理模型。

英文摘要

Accurate estimation of Remaining Useful Life (RUL) and State of Health (SoH) is essential for reliable Prognostics and Health Management (PHM), supporting timely maintenance and dependable industrial operation. However, hybrid models that combine data-driven learning with physics-based regularization often rely on fixed loss weights and therefore lose accuracy when transferred across assets with different degradation behaviors. This study introduces Reinforced Graph-based Physics-informed Networks with Dynamic Weighting (RGPD), a unified framework for spatio-temporal degradation modeling and adaptive physics-guided regularization. Graph-based representation learning captures inter-sensor degradation structure, a Soft Actor-Critic (SAC) module refines latent features under noisy conditions, and a lightweight Q-learning policy adaptively balances monotonicity, smoothness, and latent-dynamics residual losses during training. The framework is evaluated on the C-MAPSS, PHM2012, and XJTU datasets, which represent engine, bearing, and battery degradation processes. Relative to the strongest compared baselines reported in the corresponding benchmark tables, RGPD improves average RMSE by up to 12 percent on PHM2012 and C-MAPSS, and reduces average MAPE by 20 percent on XJTU compared with the second-best reported model. Performance on these heterogeneous benchmarks further suggests the model's generalizability across degradation systems. The physics-informed component is implemented through degradation-consistent priors together with a Deep Hidden Physics Model-style residual, which improves physical plausibility without requiring a full first-principles model for each asset type.

2507.07339 2026-06-02 stat.AP cs.LG 版本更新

Benchmarking Waitlist Mortality Prediction in Heart Transplantation Through Time-to-Event Modeling using New Longitudinal UNOS Dataset

基于新的纵向UNOS数据集通过时间-事件模型对心脏移植等待名单死亡率预测进行基准测试

Yingtao Luo, Reza Skandari, Carlos Martinez, Arman Kilic, Rema Padman

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Imperial College(帝国理工学院) United Network for Organ Sharing(美国器官共享网络) Medical University of South Carolina(南卡罗来纳医学院)

AI总结 本研究利用纵向等待名单历史数据,通过时间-事件模型对心脏移植等待名单死亡率进行预测,最佳模型C-Index达0.94,AUROC达0.89,显著优于以往模型。

Comments Best Student Paper Finalist in Proceedings of AMIA Annual Symposium 2025

详情
AI中文摘要

目前,关于心脏移植等待名单患者管理的决策由医生委员会根据多种因素做出,但过程在很大程度上仍是临时的。随着2018年以来器官共享联合网络(UNOS)收集的纵向患者、供体和器官数据量的增加,人们对在器官可用时支持临床决策的分析方法越来越感兴趣。在本研究中,我们对利用纵向等待名单历史数据进行时间依赖性、时间-事件建模的机器学习模型进行了基准测试,以预测等待名单死亡率。我们使用23,807条患者记录(包含77个变量)进行训练,并在1年时间范围内评估生存预测和区分能力。我们的最佳模型实现了0.94的C-Index和0.89的AUROC,显著优于以往模型。关键预测因子与已知风险因素一致,同时也揭示了新的关联。我们的发现可以支持心脏移植决策中的紧迫性评估和政策改进。

英文摘要

Decisions about managing patients on the heart transplant waitlist are currently made by committees of doctors who consider multiple factors, but the process remains largely ad-hoc. With the growing volume of longitudinal patient, donor, and organ data collected by the United Network for Organ Sharing (UNOS) since 2018, there is increasing interest in analytical approaches to support clinical decision-making at the time of organ availability. In this study, we benchmark machine learning models that leverage longitudinal waitlist history data for time-dependent, time-to-event modeling of waitlist mortality. We train on 23,807 patient records with 77 variables and evaluate both survival prediction and discrimination at a 1-year horizon. Our best model achieves a C-Index of 0.94 and AUROC of 0.89, significantly outperforming previous models. Key predictors align with known risk factors while also revealing novel associations. Our findings can support urgency assessment and policy refinement in heart transplant decision making.

2507.05658 2026-06-02 physics.ao-ph cs.LG 版本更新

HRRRCast: a data-driven emulator for regional weather forecasting at convection allowing scales

HRRRCast:面向对流允许尺度区域天气预报的数据驱动模拟器

Daniel Abdi, Isidora Jankov, Paul Madden, Vanderlei Vargas, Timothy A. Smith, Sergey Frolov, Montgomery Flora, Corey Potvin

发表机构 * Cooperative Institute for Research in Environmental Sciences(环境科学研究院) Cooperative Institute for Research in the Atmosphere(大气研究院) Cooperative Institute for Severe and High-Impact Weather Research and Operations(严重和高影响天气研究与运作研究院) NOAA Global Systems Laboratory(国家海洋和大气管理局全球系统实验室) NOAA Physical Sciences Laboratory(国家海洋和大气管理局物理科学实验室) NOAA National Severe Storms Laboratory(国家海洋和大气管理局严重风暴实验室)

AI总结 提出HRRRCast数据驱动模拟器,采用ResNet和GNN架构,通过多预报时长训练和贪婪滚动策略,在CONUS区域复合反射率预报上达到与HRRR模型相当或更优的性能。

详情
Journal ref
Artificial Intelligence for the Earth Systems, Vol. 5, No. 2, 2026, Article 250061
AI中文摘要

高分辨率快速刷新(HRRR)模型是一种用于美国本土(CONUS)业务天气预报的对流允许模型。为了提供计算高效的替代方案,我们引入了HRRRCast,这是一个基于先进机器学习技术构建的数据驱动模拟器。HRRRCast包含两种架构:基于ResNet的模型(ResHRRR)和基于图神经网络的模型(GraphHRRR)。ResHRRR使用卷积神经网络,并增强了挤压激励模块和特征线性调制,通过去噪扩散隐式模型(DDIM)支持概率预报。为了更好地处理较长的预报时效,我们训练单个模型预测多个预报时长(1小时、3小时和6小时),然后在推理时采用贪婪滚动策略。当使用3到10个成员的集合在CONUS全区域评估复合反射率时,ResHRRR在弱降雨阈值(20 dBZ)下优于HRRR预报,并在中等阈值(30 dBZ)下达到有竞争力的性能。我们的工作改进了Pathak等人[21]的StormCast模型,具体改进包括:a) 在CONUS全区域上训练,b) 使用多个预报时长以提高长期预报技巧,c) 使用分析数据而非StormCast中无意使用的+1小时后分析数据训练,d) 将未来的GFS状态作为输入,实现降尺度以提高长预报时效的准确性。基于网格、邻域和对象的指标证实,与HRRR相比,风暴定位更好、频率偏差更低、成功率更高。HRRRCast集合预报还保持了更清晰的空间细节,功率谱与HRRR分析更匹配。虽然GraphHRRR在当前形式下表现不佳,但它为未来的图基预报奠定了基础。HRRRCast代表了向高效、数据驱动的区域天气预报迈出的一步,具有有竞争力的准确性和集合能力。

英文摘要

The High-Resolution Rapid Refresh (HRRR) model is a convection-allowing model used in operational weather forecasting across the contiguous United States (CONUS). To provide a computationally efficient alternative, we introduce HRRRCast, a data-driven emulator built with advanced machine learning techniques. HRRRCast includes two architectures: a ResNet-based model (ResHRRR) and a Graph Neural Network-based model (GraphHRRR). ResHRRR uses convolutional neural networks enhanced with squeeze-and-excitation blocks and Feature-wise Linear Modulation, and supports probabilistic forecasting via the Denoising Diffusion Implicit Model (DDIM). To better handle longer lead times, we train a single model to predict multiple lead times (1h, 3h, and 6h), then use a greedy rollout strategy during inference. When evaluated on composite reflectivity over the full CONUS domain using ensembles of 3 to 10 members, ResHRRR outperforms HRRR forecast at light rainfall threshold (20 dBZ) and achieves competitive performance at moderate thresholds (30 dBZ). Our work advances the StormCast model of Pathak et al. [21] by: a) training on the full CONUS domain, b) using multiple lead times to improve long-range skill, c) training on analysis data instead of the +1h post-analysis data inadvertently used in StormCast, and d) incorporating future GFS states as inputs, enabling downscaling that improves long-lead accuracy. Grid-, neighborhood-, and object-based metrics confirm better storm placement, lower frequency bias, and higher success ratios than HRRR. HRRRCast ensemble forecasts also maintain sharper spatial detail, with power spectra more closely matching HRRR analysis. While GraphHRRR underperforms in its current form, it lays groundwork for future graph-based forecasting. HRRRCast represents a step toward efficient, data-driven regional weather prediction with competitive accuracy and ensemble capability.

2507.02905 2026-06-02 cs.HC cs.AI cs.LG 版本更新

Preference-Optimal Multi-Metric Weighting for Parallel Coordinate Plots

平行坐标图的偏好最优多度量加权

Chisa Mori, Shuhei Watanabe, Masaki Onishi, Takayuki Itoh

发表机构 * Preferred Networks Inc.(Preferred Networks公司)

AI总结 针对平行坐标图中多度量可视化难题,提出基于偏好最优加权的公式化方法,并利用雷达图与UMAP降维实现直观偏好选择,有效揭示控制参数重要性模式。

Comments Accepted to International Conference Information Visualisation (iV2025)

详情
AI中文摘要

平行坐标图(PCP)是一种解释控制参数与度量之间关系的常用方法。PCP通过基于单一度量的颜色渐变来提供这种解释。然而,当存在多个度量时,提供这样的渐变是具有挑战性的。虽然一种简单的方法是通过线性加权每个度量来计算单一度量,但这种加权对用户来说是不明确的。为了解决这个问题,我们首先提出了一种基于特定偏好度量组合计算最优加权的原则性公式。尽管用户可以在双度量问题的二维(2D)平面上简单地选择他们的偏好,但多度量问题需要直观的可视化以允许他们选择偏好。我们通过使用各种雷达图来可视化由UMAP降维的2D平面上的度量权衡来实现这一点。在使用行人流引导规划的分析中,我们的方法为每个用户偏好识别出了控制参数重要性的独特模式,突出了我们方法的有效性。

英文摘要

Parallel coordinate plots (PCPs) are a prevalent method to interpret the relationship between the control parameters and metrics. PCPs deliver such an interpretation by color gradation based on a single metric. However, it is challenging to provide such a gradation when multiple metrics are present. Although a naive approach involves calculating a single metric by linearly weighting each metric, such weighting is unclear for users. To address this problem, we first propose a principled formulation for calculating the optimal weight based on a specific preferred metric combination. Although users can simply select their preference from a two-dimensional (2D) plane for bi-metric problems, multi-metric problems require intuitive visualization to allow them to select their preference. We achieved this using various radar charts to visualize the metric trade-offs on the 2D plane reduced by UMAP. In the analysis using pedestrian flow guidance planning, our method identified unique patterns of control parameter importance for each user preference, highlighting the effectiveness of our method.

2409.18624 2026-06-02 cs.AI cs.LG 版本更新

Unsupervised Cognition

无监督认知

Alfredo Ibias, Hector Antona, Guillem Ramirez-Miranda, Enric Guinovart, Eduard Alarcon

发表机构 * Avatar Cognition(Avatar认知)

AI总结 提出一种基于原语的无监督学习方法,通过构建分布式层次结构表示输入空间,在分类任务上超越现有最先进方法,并展现出类似认知的行为。

详情
AI中文摘要

无监督学习方法在认知模型中具有软启发。迄今为止,最成功的无监督学习方法主要围绕在数学空间中对样本进行聚类。在本文中,我们提出了一种基于原语的无监督学习方法,用于决策制定,该方法受一种新颖的认知框架启发。这种以表示为中心的方法以输入无关的方式,将输入空间建设性地建模为分布式层次结构。我们将我们的方法与当前最先进的无监督学习分类、当前最先进的小规模和不完整数据集分类以及当前最先进的癌症类型分类进行了比较。我们展示了我们的方法如何超越先前的最先进技术。我们还评估了我们方法的一些类似认知的特性,在这些特性中,它不仅优于比较的算法(甚至包括监督学习算法),而且表现出不同的、更类似于认知的行为。

英文摘要

Unsupervised learning methods have a soft inspiration in cognition models. To this day, the most successful unsupervised learning methods revolve around clustering samples in a mathematical space. In this paper we propose a primitive-based, unsupervised learning approach for decision-making inspired by a novel cognition framework. This representation-centric approach models the input space constructively as a distributed hierarchical structure in an input-agnostic way. We compared our approach with both current state-of-the-art unsupervised learning classification, with current state-of-the-art small and incomplete datasets classification, and with current state-of-the-art cancer type classification. We show how our proposal outperforms previous state-of-the-art. We also evaluate some cognition-like properties of our proposal where it not only outperforms the compared algorithms (even supervised learning ones), but it also shows a different, more cognition-like, behaviour.

2412.03771 2026-06-02 cs.SD cs.LG eess.AS 版本更新

Embedding-Space Diffusion for Zero-Shot Environmental Sound Classification

嵌入空间扩散用于零样本环境声音分类

Ysobel Sims, Alexandre Mendes, Stephan Chalup

发表机构 * School of Information and Physical Sciences, University of Newcastle, Australia(信息与物理科学学院,新南威尔士大学,澳大利亚)

AI总结 本文提出一种基于扩散模型的条件生成方法,用于零样本环境声音分类,在多个音频数据集上平均性能优于现有基线方法。

详情
AI中文摘要

零样本学习通过利用语义信息使模型能够泛化到未见过的类别,弥合训练集和测试集之间类别不重叠的差距。尽管大量研究集中在计算机视觉中的零样本学习,但这些方法在环境音频中的应用仍未被充分探索,现有研究性能较差。在计算机视觉中已证明成功的生成方法在零样本环境声音分类研究中明显缺失。为填补这一空白,本研究探索了环境音频中零样本学习的生成方法。我们改编了两种来自计算机视觉的成功生成模型:交叉对齐和分布对齐变分自编码器(CADA-VAE)以及利用不变侧生成对抗网络(LisGAN)。此外,我们引入了一种以类别辅助数据为条件的新型扩散模型。扩散模型生成的合成嵌入与已见类别嵌入结合,用于训练分类器。在五个环境音频数据集(ESC-50、ARCA23K-FSD、FSC22、UrbanSound8k和TAU Urban Acoustics 2019)和一个音乐分类数据集(GTZAN)上进行了实验。结果表明,扩散模型在六个音频数据集上的平均性能优于所有基线方法。这项工作确立了扩散模型作为零样本学习的一种有前景的方法,并引入了零样本环境声音分类生成方法的第一个基准,为未来研究提供了基础。

英文摘要

Zero-shot learning enables models to generalise to unseen classes by leveraging semantic information, bridging the gap between training and testing sets with non-overlapping classes. While much research has focused on zero-shot learning in computer vision, the application of these methods to environmental audio remains underexplored, with poor performance in existing studies. Generative methods, which have demonstrated success in computer vision, are notably absent from zero-shot environmental sound classification studies. To address this gap, this work investigates generative methods for zero-shot learning in environmental audio. Two successful generative models from computer vision are adapted: a cross-aligned and distribution-aligned variational autoencoder (CADA-VAE) and a leveraging invariant side generative adversarial network (LisGAN). Additionally, we introduced a novel diffusion model conditioned on class auxiliary data. Synthetic embeddings generated by the diffusion model are combined with seen class embeddings to train a classifier. Experiments are conducted on five environmental audio datasets, ESC-50, ARCA23K-FSD, FSC22, UrbanSound8k and TAU Urban Acoustics 2019, and one music classification dataset, GTZAN. Results show that the diffusion model outperforms all baseline methods on average across six audio datasets. This work establishes the diffusion model as a promising approach for zero-shot learning and introduces the first benchmark of generative methods for zero-shot environmental sound classification, providing a foundation for future research.

2411.15240 2026-06-02 cs.LG cs.AI cs.HC q-bio.QM 版本更新

A Foundation Model for Wearable Movement Data in Mental Health Research

心理健康研究中可穿戴运动数据的基础模型

Franklin Y. Ruan, Aiwei Zhang, Jenny Y. Oh, SouYoung Jin, Nicholas C. Jacobson

发表机构 * Dartmouth College(达特茅斯学院) National Institute of Diabetes and Digestive and Kidney Diseases(国家糖尿病、消化系统疾病和肾病研究所) National Institutes of Health(美国国立卫生研究院) Department of Computer Science at Dartmouth College(达特茅斯学院计算机科学系)

AI总结 提出预训练体动记录Transformer(PAT),一种基于自监督掩码自编码器预训练的可穿戴运动时间序列基础模型,在心理健康预测任务上优于非基础模型方法,并提供可解释的注意力图。

Comments F. Y. Ruan, A. Zhang, J. Y. Oh, S. Jin and N. C. Jacobson, "A Foundation Model for Wearable Movement Data in Mental Health Research," in IEEE Journal of Biomedical and Health Informatics, doi: 10.1109/JBHI.2026.3694809

详情
AI中文摘要

可穿戴运动数据由几乎所有市售智能手表收集,是心理健康研究的宝贵资源,反映了细粒度的时间行为趋势。尽管前景广阔,但与临床图像和文本分析相比,健康可穿戴建模的基础模型开发仍然有限。我们设计了带有补丁嵌入的Transformer,并在分钟级、持续一周的体动记录(身体活动强度测量)序列上使用自监督掩码自编码器预训练,以开发和评估预训练体动记录Transformer(PAT)。PAT是一个用于可穿戴运动时间序列的开源基础模型,结合了长达一周的时间建模、精神科结果评估以及在公共数据上的可重复性。在来自美国国家健康与营养调查(NHANES)的全国代表性队列中21,538名参与者的数据上预训练,PAT在心理健康预测任务(包括苯二氮卓类药物和SSRI使用、抑郁症和睡眠异常)中始终优于非基础模型基线。在苯二氮卓类药物使用预测任务中,PAT相比常用于时间序列建模的非基础深度学习模型表现出最大改进(即比LSTM提高55.6%,比一维CNN提高21.4%,比ConvLSTM提高14.8%)。除了预测准确性,PAT还提供可解释的注意力图,突出对临床预测最重要的日常活动特定时段,提供模型透明度和潜在临床见解。结果表明,PAT为研究人员和临床医生提供了一种易于部署、适应性强且可扩展的解决方案,以从可穿戴传感器数据中推进临床见解。GitHub: https://github.com/njacobsonlab/Pretrained-Actigraphy-Transformer/

英文摘要

Wearable movement data is collected by nearly all commercially available smartwatches and is a valuable resource for mental health research, reflecting fine-grained temporal behavioral trends. Despite its promise, the development of foundation models for health wearable modeling remains limited when compared to clinical image and text analysis. We designed transformers with patch embeddings and used self-supervised masked autoencoder pretraining on minute-level week-long actigraphy (physical activity intensity measurement) sequences to develop and evaluate the Pretrained Actigraphy Transformer (PAT). PAT is an open-source foundation model for wearable movement time series that combines week-long temporal modeling, psychiatric outcome evaluation, and reproducibility on public data. Pretrained on data from 21,538 U.S. participants in a nationally representative cohort from the National Health and Nutrition Examination Survey (NHANES), PAT consistently outperformed non-foundation-model baselines across mental health prediction tasks-including benzodiazepine and SSRI use, depression, and sleep abnormalities. During the benzodiazepine medication usage prediction task, PAT demonstrated the largest improvement over non-foundational deep learning models commonly used for time-series modeling (i.e., 55.6% improvement over the LSTM, 21.4% improvement over the 1-D CNN, 14.8% improvement over the ConvLSTM). Beyond predictive accuracy, PAT provides interpretable attention maps highlighting specific periods of daily activity most important for clinical predictions, offering model transparency and potential clinical insights. The results suggest that PAT offers an easy-to-deploy, adaptable and scalable solution to advance clinical insight from wearable sensor data for researchers and clinicians. GitHub: https://github.com/njacobsonlab/Pretrained-Actigraphy-Transformer/

2506.19035 2026-06-02 cs.LG 版本更新

Failure Modes of Time Series Interpretability Algorithms for Critical Care Applications and Potential Solutions

重症监护应用中时间序列可解释性算法的失败模式及潜在解决方案

Shashank Yadav, Vignesh Subbian

发表机构 * Department of Biomedical Engineering(生物医学工程系) University of Arizona(亚利桑那大学)

AI总结 本文系统分析了梯度、遮挡和置换方法在动态预测任务中的失败模式,并提出可学习掩码框架作为替代方案,通过引入时间连续性和标签一致性约束来提供更可靠的特征重要性解释。

Comments 13 pages, 10 figures, Accepted at the AMIA Annual Symposium 2025. The final version will appear in the official proceedings

详情
AI中文摘要

可解释性在重症监护中对齐和部署深度学习模型起着至关重要的作用,尤其是在影响患者生存的不断变化的环境中。然而,常见的可解释性算法在应用于动态预测任务时面临独特挑战,其中患者轨迹随时间演变。基于梯度、遮挡和置换的方法常常难以处理时变目标依赖性和时间平滑性。本文系统分析了这些失败模式,并支持可学习掩码的可解释性框架作为替代方案,该框架可以结合时间连续性和标签一致性约束来学习随时间变化的特征重要性。在此,我们提出针对动态时间序列预测问题的可学习掩码方法,为重症监护及类似领域的应用提供更可靠和一致的解释。

英文摘要

Interpretability plays a vital role in aligning and deploying deep learning models in critical care, especially in constantly evolving conditions that influence patient survival. However, common interpretability algorithms face unique challenges when applied to dynamic prediction tasks, where patient trajectories evolve over time. Gradient, Occlusion, and Permutation-based methods often struggle with time-varying target dependency and temporal smoothness. This work systematically analyzes these failure modes and supports learnable mask-based interpretability frameworks as alternatives, which can incorporate temporal continuity and label consistency constraints to learn feature importance over time. Here, we propose that learnable mask-based approaches for dynamic timeseries prediction problems provide more reliable and consistent interpretations for applications in critical care and similar domains.

2506.17171 2026-06-02 cs.LG 版本更新

Deep generative models as the probability transformation functions

深度生成模型作为概率变换函数

Vitalii Bondar, Vira Babenko, Roman Trembovetskyi, Yurii Korobeinyk, Viktoriya Dzyuba

发表机构 * Cherkasy State Technological University(切尔卡西国立技术大学) Cherkasy Bohdan Khmelnytsky National University(切尔卡斯比多曼·赫梅尔尼茨基国立大学)

AI总结 本文提出统一理论视角,将深度生成模型视为概率变换函数,揭示不同架构(自编码器、自回归模型、生成对抗网络、归一化流、扩散模型和流匹配)本质上均通过将简单预定义分布变换为复杂目标数据分布来运作。

Comments 12 pages, 6 figures, accepted for publication in "ICIST 2025 Springer Proceedings"

详情
AI中文摘要

本文引入了一个统一的理论视角,将深度生成模型视为概率变换函数。尽管各类生成模型——自编码器、自回归模型、生成对抗网络、归一化流、扩散模型和流匹配——在架构和训练方法上存在明显差异,但我们证明它们本质上都是通过将简单的预定义分布变换为复杂的目标数据分布来运作的。这一统一视角促进了模型架构之间方法改进的迁移,并为发展通用理论方法提供了基础,有望带来更高效、更有效的生成建模技术。

英文摘要

This paper introduces a unified theoretical perspective that views deep generative models as probability transformation functions. Despite the apparent differences in architecture and training methodologies among various types of generative models - autoencoders, autoregressive models, generative adversarial networks, normalizing flows, diffusion models, and flow matching - we demonstrate that they all fundamentally operate by transforming simple predefined distributions into complex target data distributions. This unifying perspective facilitates the transfer of methodological improvements between model architectures and provides a foundation for developing universal theoretical approaches, potentially leading to more efficient and effective generative modeling techniques.

2506.10677 2026-06-02 stat.ML cs.LG 版本更新

Exploiting Similarities in A/B Testing with Off-Policy Estimation

利用离线策略估计在A/B测试中的相似性

Otmane Sakhi, Alexandre Gilotte, David Rohde

发表机构 * Criteo AI Lab(Criteo AI实验室)

AI总结 本文提出利用离线策略估计方法,通过捕捉新旧系统决策倾向的相似性,构建一族A/B测试估计器,在保持无偏性的同时改善集中性质,提高统计效率。

Comments KDD '26

详情
AI中文摘要

我们研究A/B测试,即衡量新决策系统相对于基线的性能增益的标准协议。传统的A/B测试将两个系统视为黑箱,忽略了它们之间的潜在相似性。然而,在实践中,新系统和基线系统很少存在根本性差异,通常共享显著的结构,这可以通过它们做出相似决策的倾向来捕捉。我们表明,在这种情况下,常用的均值差估计量虽然无偏,但在统计上并非最优。利用离线策略估计,我们引入了一族A/B测试估计量,这些估计量利用被测试系统的倾向来获得改进的集中性质。这族估计量足够灵活,可以针对实际决策进行定制。得到的估计量简单、对倾向性误设具有鲁棒性,在测试系统表现出相似性时显著更准确,并在缺乏这种相似性时优雅地退化为均值差估计量。我们的理论分析和实证研究证实了它们的效率和实用性。

英文摘要

We study A/B testing, the standard protocol for measuring the performance gain of a new decision system relative to a baseline. Traditional A/B testing treats both systems as black boxes, ignoring potential similarities between them. In practice, however, new and baseline systems are rarely radically different and often share significant structure, which can be captured by their propensities to make similar decisions. We show that in such cases, the commonly used difference-in-means estimator, though unbiased, is statistically suboptimal. Leveraging off-policy estimation, we introduce a family of A/B testing estimators that exploit the propensities of the tested systems to achieve improved concentration properties. This family is flexible enough to be tailored to practical decision-making. The resulting estimators are simple, robust to propensities misspecification, substantially more accurate when the tested systems exhibit similarities, and gracefully fall back to the difference-in-means estimator when such similarities are absent. Our theoretical analysis and empirical studies confirm their efficiency and practicality.

2012.02110 2026-06-02 cs.CL cs.LG 版本更新

GottBERT: a pure German Language Model

GottBERT: 一个纯德语语言模型

Raphael Scheible, Johann Frei, Fabian Thomczyk, Henry He, Patric Tippmann, Jochen Knaus, Victor Jaravine, Frank Kramer, Martin Boeker

发表机构 * Institute for AI and Informatics in Medicine, University Hospital rechts der Isar, Technical University Munich(慕尼黑技术大学医学人工智能与信息学研究所,莱茵河右岸大学医院) IT-Infrastructure for Translational Medical Research, Faculty of Applied Computer Science, University of Augsburg(奥格斯堡大学应用计算机科学学院医学转化研究IT基础设施) Data Integration Center, Faculty of Medicine, University of Freiburg(弗赖堡大学医学学院数据整合中心) School of Computation, Information and Technology, Technical University Munich(慕尼黑技术大学计算、信息与技术学院) Institute of Medical Biometry and Statistics, Medical Center, Faculty of Medicine, University of Freiburg(弗赖堡大学医学学院医学生物统计学研究所) Freiburg Center for Data Analysis and Modeling, University of Freiburg(弗赖堡大学数据分析与建模中心) Hengrui Europe Biosciences, Zurich(苏黎世亨格里欧生物科学公司)

AI总结 本文提出了首个德语单语言RoBERTa模型GottBERT,在OSCAR德语子集上预训练,并在NER和文本分类任务上展示了竞争性能。

详情
AI中文摘要

预训练语言模型显著推进了自然语言处理(NLP),尤其是BERT及其优化版本RoBERTa的引入。虽然最初的研究集中在英语上,但单语言模型在多语言模型方面在预训练工作量、整体资源效率或下游任务性能上可能具有优势。尽管基于提示的LLM越来越流行,但计算效率更高的类BERT模型仍然高度相关。在这项工作中,我们提出了第一个德语单语言RoBERTa模型GottBERT,该模型仅在OSCAR数据集的德语部分上进行预训练。此外,我们研究了过滤OSCAR语料库的影响。GottBERT使用fairseq和标准超参数进行预训练。我们在两个命名实体识别(NER)任务(Conll 2003和GermEval 2014)和三个文本分类任务(GermEval 2018细粒度和粗粒度,以及10kGNAD)上,与现有的德语BERT模型和两个多语言模型进行了性能评估。性能使用$F_{1}$分数和准确率来衡量。GottBERT base和large模型表现出竞争性能,其中GottBERT在6个任务中的4个中领先于base模型。与我们的预期相反,所应用的过滤并未显著影响结果。为了支持德语NLP研究社区,我们将在MIT许可下发布GottBERT模型。

英文摘要

Pre-trained language models have significantly advanced natural language processing (NLP), especially with the introduction of BERT and its optimized version, RoBERTa. While initial research focused on English, single-language models can be advantageous compared to multilingual ones in terms of pre-training effort, overall resource efficiency or downstream task performance. Despite the growing popularity of prompt-based LLMs, more compute-efficient BERT-like models remain highly relevant. In this work, we present the first German single-language RoBERTa model, GottBERT, pre-trained exclusively on the German portion of the OSCAR dataset. Additionally, we investigated the impact of filtering the OSCAR corpus. GottBERT was pre-trained using fairseq and standard hyperparameters. We evaluated its performance on two Named Entity Recognition (NER) tasks (Conll 2003 and GermEval 2014) and three text classification tasks (GermEval 2018 fine and coarse, and 10kGNAD) against existing German BERT models and two multilingual models. Performance was measured using the $F_{1}$ score and accuracy. The GottBERT base and large models showed competitive performance, with GottBERT leading among the base models in 4 of 6 tasks. Contrary to our expectation, the applied filtering did not significantly affect the results. To support the German NLP research community, we are releasing the GottBERT models under the MIT license.

2506.01226 2026-06-02 eess.SY cs.LG cs.SY 版本更新

React to Surprises: Stable-by-Design Neural Feedback Control and the Youla-REN

应对意外:稳定设计的神经反馈控制与Youla-REN

Nicholas H. Barbara, Ruigang Wang, Alexandre Megretski, Ian R. Manchester

发表机构 * Australian Centre for Robotics(澳大利亚机器人中心) School of Aerospace, Mechanical and Mechatronic Engineering(航空航天、机械与机电工程学院) The University of Sydney(悉尼大学) Laboratory for Information and Decision Systems(信息与决策系统实验室) Dept. Electrical Engineering and Computer Science(电气工程与计算机科学系)

AI总结 提出基于非线性Youla-Kucera参数化和鲁棒神经网络(如循环均衡网络REN)的结构,实现无约束优化且保证闭环稳定性,并分析了非线性、部分观测和增量稳定性要求下的性质。

详情
AI中文摘要

我们研究了用于基于学习的控制的稳定非线性策略的参数化。提出了一种基于非线性Youla-Kucera参数化与鲁棒神经网络(如循环均衡网络REN)相结合的结构。得到的参数化是无约束的,因此可以通过一阶优化方法进行搜索,同时始终通过构造保证闭环稳定性。我们研究了(a)非线性动力学、(b)部分观测和(c)增量闭环稳定性要求(收缩性和Lipschitz性)的组合。我们发现,对于(c)与(a)或(b)的组合,收缩且Lipschitz的Youla参数总是导致收缩且Lipschitz的闭环。然而,如果三者同时成立,则增量稳定性可能因外部扰动而丧失。相反,维持了一个较弱的条件,我们称之为d-管收缩和Lipschitz性。我们进一步得到了逆结果,表明所提出的参数化覆盖了某些非线性系统类别的所有收缩且Lipschitz的闭环。数值实验说明了我们的参数化在学习具有内置稳定性保证的控制器时的实用性,这些控制器用于:(i)没有稳定效应的“经济”奖励;(ii)短训练周期;以及(iii)不确定系统。

英文摘要

We study parameterizations of stabilizing nonlinear policies for learning-based control. We propose a structure based on a nonlinear version of the Youla-Kucera parameterization combined with robust neural networks such as the recurrent equilibrium network (REN). The resulting parameterizations are unconstrained, and hence can be searched over with first-order optimization methods, while always ensuring closed-loop stability by construction. We study the combination of (a) nonlinear dynamics, (b) partial observation, and (c) incremental closed-loop stability requirements (contraction and Lipschitzness). We find that for the combination of (c) with either (a) or (b), a contracting and Lipschitz Youla parameter always leads to contracting and Lipschitz closed loops. However, if all three hold, then incremental stability can be lost with exogenous disturbances. Instead, a weaker condition is maintained, which we call d-tube contraction and Lipschitzness. We further obtain converse results showing that the proposed parameterization covers all contracting and Lipschitz closed loops for certain classes of nonlinear systems. Numerical experiments illustrate the utility of our parameterization when learning controllers with built-in stability certificates for: (i) ``economic'' rewards without stabilizing effects; (ii) short training horizons; and (iii) uncertain systems.

2505.19925 2026-06-02 stat.ME cs.LG 版本更新

Cellwise and Casewise Robust Covariance in High Dimensions

高维中的逐细胞和逐案例稳健协方差

Fabio Centofanti, Mia Hubert, Peter J. Rousseeuw

发表机构 * Section of Statistics and Data Science, Department of Mathematics, KU Leuven, Belgium(统计与数据科学系,数学系,卢森堡大学,比利时)

AI总结 提出cellRCov方法,通过主成分和正交子空间分解结合岭正则化,同时处理高维数据中的案例异常值、细胞异常值和缺失数据,并建立了理论性质。

详情
AI中文摘要

样本协方差矩阵是多变量统计的基石,但它对异常值高度敏感。这些异常值可以是案例异常值(例如属于不同总体的案例),也可以是细胞异常值(数据矩阵中的偏差单元格)。最近开发了一些能够处理这两种异常值的稳健协方差估计量,但其计算仅适用于最多20维。为了解决这个问题,我们提出了cellRCov方法,这是一种同时处理案例异常值、细胞异常值和缺失数据的稳健协方差估计量。它依赖于协方差在主成分和正交子空间上的分解,利用了稳健PCA的最新工作。它还采用岭型正则化来稳定估计的协方差矩阵。我们建立了cellRCov的一些理论性质,包括其逐案例和逐细胞影响函数以及一致性和渐近正态性。模拟研究证明了cellRCov在污染和缺失数据场景中的优越性能。此外,其在异常检测的实际应用中也展示了实用性。我们还构建并展示了用于稳健和正则化典型相关分析的cellRCCA方法。

英文摘要

The sample covariance matrix is a cornerstone of multivariate statistics, but it is highly sensitive to outliers. These can be casewise outliers, such as cases belonging to a different population, or cellwise outliers, which are deviating cells (entries) of the data matrix. Recently some robust covariance estimators have been developed that can handle both types of outliers, but their computation is only feasible up to at most 20 dimensions. To remedy this we propose the cellRCov method, a robust covariance estimator that simultaneously handles casewise outliers, cellwise outliers, and missing data. It relies on a decomposition of the covariance on principal and orthogonal subspaces, leveraging recent work on robust PCA. It also employs a ridge-type regularization to stabilize the estimated covariance matrix. We establish some theoretical properties of cellRCov, including its casewise and cellwise influence functions as well as consistency and asymptotic normality. A simulation study demonstrates the superior performance of cellRCov in contaminated and missing data scenarios. Furthermore, its practical utility is illustrated in a real-world application to anomaly detection. We also construct and illustrate the cellRCCA method for robust and regularized canonical correlation analysis.

2505.18113 2026-06-02 cs.LG math.OC 版本更新

Beyond Discreteness: Sample Complexity Analysis of Straight-Through Estimator for 1-bit Quantization

超越离散性:1比特量化的直通估计器样本复杂度分析

Halyun Jeong, Jack Xin, Penghang Yin

发表机构 * Department of Mathematics and Statistics, University at Albany, SUNY(纽约州立大学阿尔巴尼分校数学与统计学系) Department of Mathematics, University of California, Irvine(加州大学欧文分校数学系)

AI总结 本文首次对神经网络量化中直通估计器(STE)的样本复杂度进行分析,通过研究具有二元权重和激活的两层神经网络的量化感知训练,推导出保证STE优化收敛到全局最小值的样本复杂度界,并发现标签噪声下STE梯度方法的循环逃逸与回归特性,以及STE在非高斯数据上失效但可通过归一化恢复有效性。

详情
AI中文摘要

训练量化神经网络需要解决底层优化问题的非可微和离散性质。为应对这一挑战,直通估计器(STE)已成为最广泛采用的启发式方法,通过引入有偏但有效的替代梯度,允许通过离散操作进行反向传播。然而,其理论性质仍 largely unexplored,现有少数分析通过假设无限训练数据来关注泛化误差。相比之下,本文首次在神经网络量化背景下对STE进行了样本复杂度分析。我们的理论结果强调了样本量在STE成功中的关键作用,这是现有研究缺失的关键见解。具体而言,通过分析具有二元权重和激活的两层神经网络的量化感知训练,我们推导出以数据维度表示的样本复杂度界,这些界保证了基于STE的优化在遍历和非遍历分析中收敛到全局最小值。此外,在存在标签噪声的情况下,我们证明了STE梯度方法的一个有趣循环性质,其中迭代反复逃离并返回到最优二元权重。最后,我们实验证明STE在一般非高斯数据上失败,但通过归一化可以恢复其有效性,这突显了其在有效量化中的实际重要性。

英文摘要

Training quantized neural networks requires addressing the non-differentiable and discrete nature of the underlying optimization problem. To tackle this challenge, the straight-through estimator (STE) has become the most widely adopted heuristic, allowing backpropagation through discrete operations by introducing biased yet valid surrogate gradients. However, its theoretical properties remain largely unexplored, with few existing analyses focus on the generalization error by assuming an infinite amount of training data. In contrast, this work presents the first sample complexity analysis of STE in the context of neural network quantization. Our theoretical results highlight the critical role of sample size in the success of STE, a key insight absent from existing studies. Specifically, by analyzing the quantization-aware training of a two-layer neural network with binary weights and activations, we derive the sample complexity bounds in terms of the data dimensionality that guarantee the convergence of STE-based optimization to the global minimum for both ergodic and non-ergodic analyses. Moreover, in the presence of label noises, we prove an intriguing recurrence property of STE-gradient method, where the iterate repeatedly escape from and return to the optimal binary weights. Finally, we empirically demonstrate that STE fails for general non-Gaussian data but its effectiveness can be restored through normalization, underscoring its practical importance in effective quantization.

2503.24183 2026-06-02 cs.LG cs.MA 版本更新

Scalable Ride-Sourcing Vehicle Rebalancing with Service Accessibility Guarantee: A Constrained Mean-Field Reinforcement Learning Approach

可扩展的网约车再平衡方法:具有服务可及性保证的约束平均场强化学习

Matej Jusup, Kenan Zhang, Zhiyuan Hu, Barna Pásztor, Andreas Krause, Francesco Corman

发表机构 * ETH Zürich(苏黎世联邦理工学院) EPFL Lausanne(洛桑联邦理工学院)

AI总结 提出基于约束平均场强化学习的连续状态再平衡模型,在保证服务公平性的同时实现大规模网约车车队的高效协调与可扩展性。

Comments 34 pages, 15 figures

详情
Journal ref
Transportation Research Part C: Emerging Technologies, Vol. 188, 105705 (2026)
AI中文摘要

Uber和Lyft等网约车服务的扩张通过移动应用提供灵活的按需出行,重塑了城市交通。尽管便利,这些平台面临重大运营挑战,尤其是车辆再平衡——即战略性地重新定位车队以解决供需的时空错配。再平衡不足会导致乘客等待时间延长和车辆利用率低下,还会引发公平性问题,如服务分布不均和司机收入差异。为解决这些问题,我们引入了具有连续再平衡动作的连续状态平均场控制(MFC)和平均场强化学习(MFRL)模型。MFC和MFRL通过车辆与车辆分布(而非单个车辆)的交互来建模每辆车的行为,从而提供可扩展的解决方案。这缓解了关于智能体数量的维度灾难,使得能够以显著降低的计算复杂度协调大型车队,并在车队规模变化时无需重新训练模型。为确保跨地理区域的公平服务可及性,我们将可及性约束整合到模型中,并推导出在高度满足乘客需求和公平覆盖车辆供应之间取得平衡的再平衡策略。使用深圳数据驱动模拟的广泛评估证明了我们方法的效率和鲁棒性。值得注意的是,该方法可扩展到数万辆车辆,训练时间与线性规划再平衡相当。此外,我们的策略有效探索了效率-公平帕累托前沿,在车队利用率、完成请求数和接驾距离等关键指标上优于传统基准,同时确保公平的服务可及性。

英文摘要

The expansion of ride-sourcing services such as Uber and Lyft has reshaped urban transportation by offering flexible, on-demand mobility via mobile applications. Despite convenience, these platforms confront significant operational challenges, particularly vehicle rebalancing-strategic repositioning of a fleet of vehicles to address spatiotemporal mismatches in supply and demand. Inadequate rebalancing results in prolonged rider waiting times and inefficient vehicle utilization, but also leads to fairness issues, such as the inequitable distribution of service and disparities in driver income. To tackle these, we introduce continuous-state mean-field control (MFC) and mean-field reinforcement learning (MFRL) models with continuous repositioning actions. MFC and MFRL offer scalable solutions by modeling each vehicle's behavior through interaction with the vehicle distribution, rather than with individual vehicles. This mitigates the curse of dimensionality with respect to the number of agents, enabling coordination across large fleets with significantly reduced computational complexity and eliminating the need to retrain the model when fleet size changes. To ensure equitable service access across geographic regions, we integrate an accessibility constraint into models and derive rebalancing policies that strike a balance between high fulfillment of rider demand and fair coverage of vehicle supply. Extensive evaluation using data-driven simulation of Shenzhen demonstrates the efficiency and robustness of our approach. Remarkably, it scales to tens of thousands of vehicles, with training times comparable to linear programming rebalancing. Besides, our policies effectively explore the efficiency-equity Pareto front, outperforming conventional benchmarks across key metrics like fleet utilization, fulfilled requests, and pickup distance, while ensuring equitable service access.

2505.14725 2026-06-02 q-bio.GN cs.LG stat.AP 版本更新

HR-VILAGE-3K3M: A Human Respiratory Viral Immunization Longitudinal Gene Expression Dataset for Systems Immunity

HR-VILAGE-3K3M:用于系统免疫学的人类呼吸道病毒免疫纵向基因表达数据集

Xuejun Sun, Yiran Song, Xiaochen Zhou, Ruilie Cai, Yu Zhang, Xinyi Li, Rui Peng, Jialiu Xie, Yuanyuan Yan, Muyao Tang, Prem Lakshmanane, Baiming Zou, James S. Hagood, Raymond J. Pickles, Didong Li, Fei Zou, Xiaojing Zheng

发表机构 * Department of Biostatistics University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校生物统计学系) Department of Epidemiology and Biostatistics University of South Carolina(南卡罗来纳大学流行病学与生物统计学系) Department of Pediatrics University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校儿科系) Department of Microbiology and Immunology University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校微生物学与免疫学系)

AI总结 为解决呼吸道病毒感染研究中转录组数据分散且处理不一致的问题,构建了包含3178名受试者、66项研究的HR-VILAGE-3K3M数据集,整合了疫苗接种、病毒接种和混合暴露的批量及单细胞转录组数据,并进行了统一的预处理和质量控制,以支持生物标志物发现、免疫机制研究和分析方法开发。

详情
AI中文摘要

呼吸道病毒感染构成全球健康负担,但保护性和病理性的细胞免疫机制仍不清楚。自然感染队列通常缺乏暴露前基线和时间控制采样,而接种和疫苗试验则产生结构良好的纵向转录组数据。然而,这些数据集分散在多个存储库中且处理不一致,阻碍了整合性和AI驱动的分析。为应对这些挑战,我们开发了人类呼吸道病毒免疫纵向基因表达(HR-VILAGE-3K3M)存储库:一个整合了来自66项研究的3178名受试者的批量及单细胞转录组谱的AI就绪资源。该数据集涵盖疫苗接种、病毒接种和混合暴露,样本来自血液和鼻拭子,收集自GEO、ImmPort和ArrayExpress等公共存储库。我们整理并协调了受试者级别的元数据,标准化了结果测量,并应用了统一的预处理和严格的质量控制。我们还提供了基准分析以说明其实用性。该资源支持生物标志物发现、免疫机制和方法学开发。作为人类呼吸道病毒免疫领域最大的纵向转录组资源之一,HR-VILAGE-3K3M能够实现可重复和可扩展的分析,从而加速疫苗和抗病毒研究。

英文摘要

Respiratory viral infections pose a global health burden, yet the cellular immune mechanisms underlying protection and pathology remain unclear. Natural infection cohorts often lack pre-exposure baselines and time-controlled sampling, whereas inoculation and vaccination trials generate well-structured longitudinal transcriptomic data. However, these datasets are scattered across repositories and processed inconsistently, hindering integrative and AI-driven analyses. To address these challenges, we developed the Human Respiratory Viral Immunization LongitudinAl Gene Expression (HR-VILAGE-3K3M) repository: an AI-ready resource integrating bulk and single-cell transcriptomic profiles from 3,178 subjects across 66 studies. The dataset spans vaccination, inoculation, and mixed exposures, with samples from blood and nasal swabs collected from public repositories including GEO, ImmPort, and ArrayExpress. We curated and harmonized subject-level metadata, standardized outcome measures, and applied unified preprocessing with rigorous quality control. We further provide benchmark analyses illustrating its utility. This resource supports discovery of biomarkers, immune mechanisms, and methodological development. As one of the largest longitudinal transcriptomic resources for human respiratory viral immunization, HR-VILAGE-3K3M enables reproducible and scalable analyses to accelerate vaccine and antiviral research.

2505.13273 2026-06-02 cs.AI cs.LG 版本更新

EMoE: Training-Free Expert Disagreement for Uncertainty-Aware Text-to-Image Diffusion

EMoE: 面向不确定性感知的文本到图像扩散的无训练专家分歧方法

Lucas Berry, Axel Brando, Wei-Di Chang, Juan Camilo Gamboa Higuera, David Meger

发表机构 * McGill University(麦吉尔大学) Barcelona Supercomputing Center (BSC)(巴塞罗那超级计算中心 (BSC)) Ideogram AI

AI总结 提出EMoE方法,通过预训练MoE扩散模型中早期MoE层的专家分歧,无需训练即可估计认知不确定性,用于提示风险诊断和生成质量排序。

详情
AI中文摘要

大型文本到图像扩散模型很少在提示可能产生低质量生成时提供可靠信号,尤其是在训练数据未公开的情况下。我们研究预训练混合专家(MoE)扩散模型中的专家分歧是否可以作为认知不确定性的可靠估计。我们引入EMoE,一种无训练方法,在早期MoE层分离专家特定的计算路径,跨路径使用相同的初始噪声,并在第一步去噪后测量其潜在表示之间的方差。这提供了在完整图像生成之前的不确定性感知提示信号,无需辅助网络或训练扩散集成。在COCO和CC3M上,EMoE根据文本-图像对齐质量指标对提示进行排序,比扩散特定和基于路由的基线更一致。我们进一步将EMoE应用于多语言提示,并发现分歧和生成质量中存在系统的语言依赖性差异,包括共享词汇效应。这些结果使EMoE成为MoE文本到图像扩散模型中提示风险、模型覆盖和偏差分析的实用诊断工具。

英文摘要

Large text-to-image diffusion models rarely expose reliable signals of when a prompt is likely to produce a poorly aligned generation, especially when training data is undisclosed. We study whether expert disagreement inside pre-trained mixture-of-experts (MoE) diffusion models can serve as a reliable estimate for epistemic uncertainty. We introduce EMoE, a training-free method that separates expert-specific computation paths at an early MoE layer, uses the same initial noise across paths, and measures variance among their latent representations after the first denoising step. This provides an uncertainty-aware prompt signal before full image generation, without auxiliary networks or training diffusion ensembles. On COCO and CC3M, EMoE ranks prompts by text-image alignment quality metrics more consistently than diffusion-specific and router-based baselines. We further apply EMoE to multilingual prompts and find systematic language-dependent differences in disagreement and generation quality, including shared-vocabulary effects. These results position EMoE as a practical diagnostic tool for prompt risk, model coverage, and bias analysis in MoE text-to-image diffusion models.

2503.03137 2026-06-02 cs.AI cs.LG cs.NE 版本更新

Learning to Reduce Search Space for Generalizable Neural Routing Solver

学习减少搜索空间以实现泛化的神经路由求解器

Changliang Zhou, Xi Lin, Zhenkun Wang, Qingfu Zhang

发表机构 * School of Automation and Intelligent Manufacturing(自动化与智能制造学院) Southern University of Science and Technology(南方科技大学) School of Mathematics and Statistics(数学与统计学学院) Xi'an Jiaotong University(西安交通大学) Department of Computer Science(计算机科学系) City University of Hong Kong(香港城市大学)

AI总结 提出首个基于学习的动态搜索空间缩减框架L2R,通过自适应剪枝节点来高效求解大规模车辆路径问题,在千万节点规模上保持高质量解。

Comments accepted by SIGKDD 2026

详情
AI中文摘要

构造性神经组合优化(NCO)通过直接学习构造近似最优解,为解决车辆路径问题(VRPs)提供了一种有前景的范式,从而减少了对算法设计专家知识的依赖。然而,由于高计算复杂度,将这些方法扩展到大规模实例仍然具有挑战性。虽然最近的动态搜索空间缩减(SSR)方法可以通过基于几何距离的剪枝提高推理效率,但它们通常难以处理具有非均匀分布的复杂实例,或者当最优解严重依赖于非空间约束时。为了解决这一关键问题,我们提出了学习减少(L2R),这是首个基于学习的动态SSR框架。L2R通过从问题特定特征中提取模式来学习自适应地优先考虑节点,从而在每一步剪枝搜索空间,实现高效且可扩展的解构造。大量实验表明,我们的L2R框架在不同的VRP变体上对不同问题规模和数据分布具有稳健的泛化能力。据我们所知,L2R是首个有效扩展到具有1000万个节点的VRP实例同时保持高质量解的神经求解器,这显著推动了NCO在泛化和可扩展性方面的前沿。我们的代码可在https://github.com/CIAM-Group/L2R获取。

英文摘要

Constructive neural combinatorial optimization (NCO) offers a promising paradigm for solving vehicle routing problems (VRPs) by directly learning to construct approximate optimal solutions, thereby reducing reliance on expert knowledge for algorithm design. However, scaling these methods to handle large-scale instances remains challenging due to high computational complexity. While recent dynamic search space reduction (SSR) methods can improve inference efficiency through geometric distance-based pruning, they often struggle on complex instances with non-uniform distributions or when optimal solutions rely heavily on non-spatial constraints. To address this critical issue, we propose Learning to Reduce (L2R), which is the first learning-based dynamic SSR framework. L2R learns to adaptively prioritize nodes by extracting patterns from problem-specific features to prune the search space at each step, enabling efficient and scalable solution construction. Extensive experiments show that our L2R framework generalizes robustly to different problem scales and data distributions on various VRP variants. To the best of our knowledge, L2R is the first neural solver to effectively scale to VRP instances with $10$ million nodes while maintaining high solution quality, which significantly pushes the frontier of NCO in terms of generalization and scalability. Our code is available at https://github.com/CIAM-Group/L2R.

2505.10882 2026-06-02 cs.LG stat.ML 版本更新

Global Convergence of Adaptive Sensing for Principal Eigenvector Estimation

主特征向量估计的自适应传感的全局收敛性

Alex Saad-Falcon, Brighton Ancelin, Justin Romberg

发表机构 * School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA, USA(电子与计算机工程学院,佐治亚理工学院,亚特兰大,GA,USA)

AI总结 本文分析Oja算法的一种压缩变体,利用每样本两个自适应测量估计协方差矩阵的主特征向量,证明了期望正弦平方误差的收敛速率并给出信息论下界,揭示了压缩带来的维度代价。

Comments Accepted at ICML 2026. 34 pages (9 main text + appendices), 4 figures, 2 tables. v2 (camera-ready) adds a matching information-theoretic lower bound and a non-adaptive lower-bound separation across three powers of d; substantially revised from v1

详情
AI中文摘要

主成分分析经典地需要完整的$d$维样本,但在各种应用中,硬件限制每次采集只能获得少量标量测量。我们分析了Oja算法的一种压缩变体,用于估计数据协方差矩阵的主特征向量,每样本仅使用两个自适应测量。在每次迭代中,我们沿着当前估计方向进行一次测量,并在随机正交方向上进行一次测量。我们证明,经过$t$次迭代后,到真实特征向量的期望正弦平方误差为$\mathcal{O}(\lambda_1\lambda_2 d^2 / (\Delta^2 t))$,其中$d$是环境维度,$\lambda_1, \lambda_2$是前导特征值,$\Delta = \lambda_1 - \lambda_2$是特征间隙。我们用一个匹配的信息论下界$\Omega(\lambda_1\lambda_2 d^2 / (\Delta^2 t))$来补充这一结果——这是压缩特征向量估计的第一个下界——证明$d^2$因子(与完全观测的极小极大速率$\Theta(\lambda_1\lambda_2 d / (\Delta^2 t))$相比多了一个$d$因子)是压缩的基本代价,无法改进。相比之下,每次迭代两次测量的任何非自适应方案都会遭受$\Omega(\lambda_2^2 d^3 / (\Delta^2 t))$的误差,多了一个$d$的幂次。这通过$d$的三个幂次将完全观测PCA、自适应压缩PCA和非自适应压缩PCA区分开来。我们的分析处理了协方差具有非零尾部特征值的噪声设置,为无噪声情况之外的自适应压缩子空间跟踪提供了首个收敛性保证。

英文摘要

Principal component analysis classically requires full $d$-dimensional samples, yet in various applications hardware limits acquisition to a few scalar measurements per sample. We analyze a compressed variant of Oja's algorithm for estimating the principal eigenvector of the data covariance matrix using only two adaptive measurements per sample. At each iteration, we observe one measurement along the current estimate and one in a random orthogonal direction. We prove that after $t$ iterations, the expected sine-squared error to the true eigenvector is $\mathcal{O}(λ_1λ_2 d^2 / (Δ^2 t))$, where $d$ is the ambient dimension, $λ_1, λ_2$ are the leading eigenvalues, and $Δ= λ_1 - λ_2$ is the eigengap. We complement this with a matching information-theoretic lower bound of $Ω(λ_1λ_2 d^2 / (Δ^2 t))$ -- the first for compressed eigenvector estimation -- proving that the $d^2$ factor, an additional factor of $d$ compared to the fully-observed minimax rate $Θ(λ_1λ_2 d / (Δ^2 t))$, is the fundamental cost of compression and cannot be improved. In contrast, any non-adaptive scheme with two measurements per iteration suffers $Ω(λ_2^2 d^3 / (Δ^2 t))$, an additional power of $d$. This separates fully-observed, adaptive-compressed, and non-adaptive-compressed PCA across three powers of $d$. Our analysis handles the noisy setting where the covariance has nonzero trailing eigenvalues, providing the first convergence guarantee for adaptive compressed subspace tracking beyond the noiseless case.

2504.21427 2026-06-02 cs.LG cs.AI 版本更新

MPEC: Manifold-Preserved EEG Classification via an Ensemble of Clustering-Based Classifiers

MPEC:通过集成基于聚类的分类器实现流形保持的脑电图分类

Shermin Shahbazi, Mohammad-Reza Nasiri, Majid Ramezani

发表机构 * Department of Electrical and Computer(电气与计算机系) Department of Computer Science and Engineering, Information Technology(计算机科学与工程系,信息科技)

AI总结 提出MPEC方法,通过协方差矩阵和RBF核的特征工程以及黎曼流形上的改进K-means聚类集成,解决EEG信号的非欧几里得流形结构问题,在BCI Competition IV数据集2a上取得显著提升。

Comments 7 pages ,3 figures

详情
AI中文摘要

脑电图信号的准确分类对于脑机接口(BCI)和神经假体应用至关重要,然而许多现有方法未能考虑EEG数据的非欧几里得流形结构,导致性能欠佳。保留这种流形信息对于捕捉EEG信号的真实几何结构至关重要,但传统分类技术在很大程度上忽视了这一需求。为此,我们提出了MPEC(通过集成基于聚类的分类器实现流形保持的EEG分类),它引入了两项关键创新:(1)一个特征工程阶段,结合协方差矩阵和径向基函数(RBF)核来捕捉EEG通道之间的线性和非线性关系;(2)一个聚类阶段,采用针对黎曼流形空间定制的改进K-means算法,确保局部几何敏感性。通过集成多个基于聚类的分类器,MPEC取得了优越的结果,并在BCI Competition IV数据集2a上得到了显著改进的验证。

英文摘要

Accurate classification of EEG signals is crucial for brain-computer interfaces (BCIs) and neuroprosthetic applications, yet many existing methods fail to account for the non-Euclidean, manifold structure of EEG data, resulting in suboptimal performance. Preserving this manifold information is essential to capture the true geometry of EEG signals, but traditional classification techniques largely overlook this need. To this end, we propose MPEC (Manifold-Preserved EEG Classification via an Ensemble of Clustering-Based Classifiers), that introduces two key innovations: (1) a feature engineering phase that combines covariance matrices and Radial Basis Function (RBF) kernels to capture both linear and non-linear relationships among EEG channels, and (2) a clustering phase that employs a modified K-means algorithm tailored for the Riemannian manifold space, ensuring local geometric sensitivity. Ensembling multiple clustering-based classifiers, MPEC achieves superior results, validated by significant improvements on the BCI Competition IV dataset 2a.

2504.20238 2026-06-02 physics.ao-ph cs.LG 版本更新

Atmospheric Predictability Beyond 30 Days with Machine Learning

利用机器学习实现30天以上的大气可预报性

P. Trent Vonich, Gregory J. Hakim

发表机构 * University of Washington(华盛顿大学) Air Force Institute of Technology(空军技术研究院)

AI总结 通过机器学习模型GraphCast优化初始条件,将确定性天气预报的时效从两周延长至30天以上,平均误差降低86%。

详情
AI中文摘要

长期以来,大气可预报性研究认为小空间尺度上的快速误差增长将确定性天气预报的固有时效限制在约两周。我们利用机器学习天气模型GraphCast,通过优化2020年每日两次预报的初始条件,挑战了这一极限。该方法在十天预报中,相对于再分析初始条件的控制预报,平均误差降低了86%,且技能持续超过30天。平均最优初始条件扰动显示出大尺度、空间一致的修正,主要反映了哈德莱环流的增强。在盘古天气模型中使用GraphCast最优初始条件的预报实现了21%的误差降低,在四天时达到峰值,表明分析修正针对的是模型和分析误差的调整。这些结果证明了存在能够产生远超两周的 skillful 确定性预报的初始条件。这些初始条件是否能够实时识别以改进业务天气预报,仍是未来研究的课题。

英文摘要

Atmospheric predictability research has long held that rapid error growth at small spatial scales imposes an intrinsic limit of roughly two weeks on deterministic weather forecast skill. We challenge this limit using GraphCast, a machine-learning weather model, by optimizing initial conditions for twice-daily forecasts spanning 2020. This approach yields an average error reduction of 86% at ten days relative to control forecasts from reanalysis initial conditions, with skill lasting beyond 30 days. Mean optimal initial-condition perturbations reveal large-scale, spatially coherent corrections primarily reflecting an intensification of the Hadley circulation. Forecasts using GraphCast-optimal initial conditions in the Pangu-Weather model achieve a 21% error reduction, peaking at four days, indicating that analysis corrections reflect adjustments that target both model and analysis error. These results demonstrate the existence of initial conditions producing skillful deterministic forecasts far beyond two weeks. Whether such initial conditions can be identified in real-time for improving operational weather forecasts remains a topic of future research.

2504.17471 2026-06-02 cs.LG cs.AI cs.DC 版本更新

GRANITE : a Byzantine-Resilient Dynamic Gossip Learning Framework

GRANITE:一种拜占庭鲁棒的动态八卦学习框架

Yacine Belal, Mohamed Maouche, Sonia Ben Mokhtar

发表机构 * CEA, List, Université Paris-Saclay Palaiseau(CEA、List、巴黎-萨克雷大学帕莱索分校) INRIA, INSA Lyon, CITI, UR3720(INRIA、里昂INSA、CITI、UR3720) LIRIS, INSA Lyon, CNRS Lyon(LIRIS、里昂INSA、里昂CNRS)

AI总结 针对动态八卦学习中拜占庭节点通过毒化模型和操纵节点采样发起的双重攻击,提出GRANITE框架,通过累积节点标识知识并动态调整聚合阈值,实现鲁棒学习,理论证明拜占庭节点在局部邻域呈指数衰减,实验表明在30%拜占庭节点下精度接近非拜占庭场景,且通信成本降低9倍。

详情
AI中文摘要

八卦学习是一种去中心化的学习范式,用户通过迭代地与少量邻居节点交换和聚合模型。最近的方法依赖于使用随机节点采样协议构建的动态通信图,这些协议已被证明可以加速收敛。然而,我们表明这些方法容易受到双重攻击:拜占庭节点可以毒化模型并操纵节点采样以放大其影响力。我们通过GRANITE框架应对这种组合威胁,该框架用于在存在拜占庭节点的稀疏动态图上进行鲁棒学习。GRANITE随时间累积关于遇到的节点标识的知识,并根据每个节点邻域中估计的拜占庭密度动态调整局部聚合阈值。我们证明,在GRANITE下,局部邻域中的拜占庭节点呈现指数衰减。我们进一步推导了GRANITE生成图的鲁棒性条件。实验结果表明,在30%拜占庭节点下,GRANITE的收敛精度在非拜占庭精度的5%以内,收敛速度更快,且通信成本降低高达9倍。

英文摘要

Gossip Learning (GL) is a decentralized learning paradigm where users iteratively exchange and aggregate models with a small set of neighboring peers. Recent approaches rely on dynamic communication graphs built using Random Peer Sampling (RPS) protocols which have been proven to accelerate convergence. However, we show that these approaches are vulnerable to a dual attack: Byzantine nodes can poison models and manipulate peer sampling to amplify their influence. We address this combination of threats with GRANITE, a framework for robust learning over sparse, dynamic graphs in the presence of Byzantine nodes. GRANITE accumulates knowledge about encountered node identifiers over time and dynamically adjusts local aggregation thresholds based on estimated Byzantine density in the neighbourhood of each node. We demonstrate that under GRANITE, the Byzantine presence in local neighborhoods exhibits an exponential decay. We further derive the robustness conditions of the graphs generated by GRANITE. Empirically, our results indicate that GRANITE converges within 5% of non-Byzantine accuracy under 30% Byzantines nodes, offers faster convergence and operates on graphs with up to 9x lower communication cost.

2411.03163 2026-06-02 quant-ph cs.LG 版本更新

Efficient Hamiltonian, structure and trace distance learning of Gaussian states

高斯态的高效哈密顿量、结构和迹距离学习

Marco Fanizza, Cambyse Rouzé, Daniel Stilck França

发表机构 * Inria(法国国家信息与自动化技术研究院) Télécom Paris - LTCI(巴黎电信学院 - LTCI) Institut Polytechnique de Paris(巴黎理工学院) University of Copenhagen(哥本哈根大学) Universitat Autònoma de Barcelona(巴塞罗那自治大学) Univ Lyon(里昂大学) ENS Lyon(里昂高等师范学院) UCBL(里昂大学) CNRS(国家科学研究中心) LIP(里昂信息科学与技术实验室)

AI总结 本文针对正温度玻色子高斯态,提出基于外差测量的高效协议,以对数样本复杂度学习二次哈密顿量参数和相互作用图,并首次实现迹距离下多项式样本复杂度的高斯态学习。

Comments 54 pages, improvements in presentation and tighter analysis of the dependence on the precision in Hamiltonian and graph learning

详情
AI中文摘要

在这项工作中,我们首次研究了正温度玻色子高斯态的哈密顿量学习问题,这是广泛研究的高斯图模型学习的量子推广。在假设温度、压缩、位移和相互作用图的最大度有界的情况下,我们获得了推断其底层二次哈密顿量参数的高效协议,包括样本复杂度和计算复杂度。我们的协议仅需要外差测量,这通常在实验上可行,并且样本复杂度随模式数对数增长。此外,我们证明在类似的设置和样本复杂度下可以学习底层相互作用图。另外,利用我们的技术,我们首次获得了迹距离下高斯态学习的结果,其精度二次缩放,模式数多项式缩放,尽管对高斯态施加了某些限制。我们的主要技术创新是高斯态协方差矩阵和哈密顿矩阵的几个连续性界,这些界本身具有独立意义,并结合了我们称之为局部反演技术的方法。本质上,局部反演技术允许我们通过仅并行估计协方差矩阵的子矩阵来可靠地推断高斯态的哈密顿量,这些子矩阵的大小随所需精度缩放,而不随模式数缩放。这样,我们避免了对协方差矩阵进行精确全局估计的需求,从而控制了样本复杂度。

英文摘要

In this work, we initiate the study of Hamiltonian learning for positive temperature bosonic Gaussian states, the quantum generalization of the widely studied problem of learning Gaussian graphical models. We obtain efficient protocols, both in sample and computational complexity, for the task of inferring the parameters of their underlying quadratic Hamiltonian under the assumption of bounded temperature, squeezing, displacement and maximal degree of the interaction graph. Our protocol only requires heterodyne measurements, which are often experimentally feasible, and has a sample complexity that scales logarithmically with the number of modes. Furthermore, we show that it is possible to learn the underlying interaction graph in a similar setting and sample complexity. In addition, we use our techniques to obtain the first results on learning Gaussian states in trace distance with a quadratic scaling in precision and polynomial in the number of modes, albeit imposing certain restrictions on the Gaussian states. Our main technical innovations are several continuity bounds for the covariance and Hamiltonian matrix of a Gaussian state, which are of independent interest, combined with what we call the local inversion technique. In essence, the local inversion technique allows us to reliably infer the Hamiltonian of a Gaussian state by only estimating in parallel submatrices of the covariance matrix whose size scales with the desired precision, but not the number of modes. This way we bypass the need to obtain precise global estimates of the covariance matrix, controlling the sample complexity.

2503.15371 2026-06-02 cs.RO cs.LG 版本更新

GIFT: Geometry-Induced Functional Transfer for Category-level Object Manipulation

GIFT: 几何诱导的功能迁移用于类别级物体操作

Cristiana de Farias, Luis Figueredo, Riddhiman Laha, Maxime Adjigble, Brahim Tamadazte, Rustam Stolkin, Sami Haddadin, Naresh Marturi

发表机构 * Extreme Robotics Laboratory, School of Metallurgy and Materials, University of Birmingham(伯明翰大学冶金与材料学院极端机器人实验室) Munich Institute of Robotics & Machine Intelligence, Technische Universität München (TUM)(慕尼黑工业大学机器人与人工智能研究所) School of Computer Science, University of Nottingham(诺丁汉大学计算机科学学院) Sorbonne Université, ISIR, Paris, France(巴黎法国索邦大学ISIR研究所)

AI总结 提出GIFT框架,利用功能映射和螺旋插值,从单次人类演示中迁移复杂物体操作技能到新物体,无需额外训练。

Comments 8 pages, 6 figures. ICRA 2026

详情
AI中文摘要

在新环境中操作不熟悉物体对机器人来说具有挑战性,因为泛化能力有限。我们提出了一种新的技能迁移框架GIFT(几何诱导的功能迁移),使机器人能够从单次人类演示中迁移复杂的物体操作技能和约束。我们的方法通过关注以物体为中心的交互,从演示中推导几何表示,解决了技能获取和任务执行的挑战。利用功能映射(FMC)框架,我们高效地映射物体及其环境之间的交互函数,使机器人能够在具有相似拓扑或类别的物体之间复制任务操作,即使它们形状差异很大。此外,我们的方法结合了螺旋插值(ScLERP)来生成平滑、几何感知的机器人路径,确保迁移的技能遵循演示的任务约束。我们通过大量实验验证了该方法的有效性和适应性,展示了在多样化的真实环境中成功进行技能迁移和任务执行,无需额外训练。

英文摘要

Robotic manipulation of unfamiliar objects in new environments is challenging due to limited generalisation capabilities. We propose a new skill transfer framework, GIFT (Geometry-Induced Functional Transfer), which enables a robot to transfer complex object manipulation skills and constraints from a single human demonstration. Our approach addresses the challenge of skill acquisition and task execution by deriving geometric representations from demonstrations focusing on object-centric interactions. By leveraging the Functional Maps (FMC) framework, we efficiently map interaction functions between objects and their environments, allowing the robot to replicate task operations across objects of similar topologies or categories, even when they have significantly different shapes. Additionally, our method incorporates screw interpolation (ScLERP) for generating smooth, geometrically-aware robot paths to ensure the transferred skills adhere to the demonstrated task constraints. We validate the effectiveness and adaptability of our approach through extensive experiments, demonstrating successful skill transfer and task execution in diverse real-world environments without requiring additional training.

2503.07154 2026-06-02 cs.LG cs.AI 版本更新

Ideas in Inference-time Scaling can Benefit Generative Pre-training Algorithms

推理时缩放的思想可以有益于生成式预训练算法

Jiaming Song, Linqi Zhou

发表机构 * Luma AI

AI总结 本文指出自回归模型与扩散模型的二分法是错误的,提出应从推理过程(序列扩展与状态细化)出发设计训练目标,并论证了推理算法优先于训练目标的原则。

Comments updated some new literature on flow maps and continuous LLMs

详情
AI中文摘要

生成式预训练通常被框定在一个错误的二分法中:用于离散信号的自回归模型与用于连续信号的扩散模型。我们认为这种二分法是错误的,因为它混淆了模型家族、数据表示、训练目标和推理过程。自回归是一种通过归一化条件采样扩展序列的推理过程,而扩散是一种反复修正现有状态的细化过程。因此,更有用的对比不是自回归与扩散,而是用交叉熵学习的离散标记与用扩散风格目标学习的连续标记,以及用于从中采样的推理算法。从这个角度来看,算法进展应优先考虑推理时间效率的两个维度:序列扩展和状态细化。我们主张在训练目标之前设计推理过程,因为如果推理映射省略了必要参数或施加了错误分解,训练方法无法弥补。我们通过DDIM风格采样器的目标时间限制、多标记预测的联合分布限制,以及直接参数化长距离推理移动的最新流映射和少步蒸馏方法来说明这一原则。

英文摘要

Generative pre-training is often framed through a false dichotomy between autoregressive models for discrete signals and diffusion models for continuous signals. We argue that the dichotomy is false because it conflates model family, data representation, training objective, and inference procedure. Autoregression is an inference procedure that expands a sequence through normalized conditional draws, while diffusion is a refinement procedure that repeatedly revises an existing state. The more useful contrast is therefore not autoregressive versus diffusion, but discrete tokens learned with cross-entropy versus continuous tokens learned with diffusion-style objectives, together with the inference algorithms used to sample from them. From this perspective, algorithmic progress should prioritize inference-time efficiency along two axes: sequence expansion and state refinement. We advocate designing the inference procedure before the training objective, because a training method cannot compensate for an inference map that omits necessary arguments or imposes an incorrect factorization. We illustrate this principle through a target-time limitation of DDIM-style samplers, a joint-distribution limitation of multi-token prediction, and recent flow-map and few-step distillation methods that directly parameterize long-range inference moves.

2503.07325 2026-06-02 cs.LG stat.ML 版本更新

Non-vacuous Generalization Bounds for Deep Neural Networks without any modification to the trained models

无需对训练模型进行任何修改的深度神经网络非平凡泛化界

Khoat Than, Dat Phan

发表机构 * Hanoi University of Science and Technology(河内科学与技术大学) VinBigdata Institute(VinBigdata研究院)

AI总结 提出一类新的数据依赖泛化界,直接应用于未修改的训练模型,通过分解泛化误差为分布复杂度和局部模型行为项,首次在大型未修改深度网络上实现非平凡泛化保证。

详情
AI中文摘要

理解和认证现代深度神经网络的行为仍然是可靠机器学习中的一个基本挑战。我们引入了一类新的数据依赖泛化界,直接应用于训练模型,无需任何修改。特别地,我们提出了一个可精确计算的界,在所有评估的网络中(包括具有6亿参数的ImageNet规模模型)都是非平凡的。这是首次表明即使对于大型未修改的深度网络,也能实现有意义的泛化保证。我们的方法揭示了泛化由训练模型与数据分布几何之间的相互作用所支配。我们将泛化误差分解为两个可解释的组成部分:一个分布复杂度项,捕捉数据质量在输入空间中的分布;以及局部模型行为项,捕捉网络在单个区域内的行为。这种联合依赖识别出泛化差距出现的位置和原因。实验上,我们界的某些部分对真实测试误差具有高度预测性,并且当划分与内在数据几何对齐时,界会收紧,突出了数据依赖的局部正则性作为泛化的关键驱动因素。

英文摘要

Understanding and certifying the behavior of modern deep neural networks remains a fundamental challenge in reliable machine learning. We introduce a new class of data-dependent generalization bounds that apply directly to trained models, without any modification. In particular, we present an exactly computable bound that is non-vacuous across all evaluated networks, including ImageNet-scale models with 600M parameters. This this is the first work showing that meaningful generalization guarantees are achievable even for large, unaltered deep networks. Our approach reveals that generalization is governed by the interaction between the trained model and the geometry of the data distribution. We decompose the generalization error into two interpretable components: a distributional complexity term, capturing how the data mass is distributed across the input space, and local model-behavior terms, capturing the network's behavior within individual regions. This joint dependence identifies where and why generalization gaps arise. Empirically, some components of our bound are highly predictive of the true test error, and the bound tightens when the partition aligns with the intrinsic data geometry, highlighting data-dependent local regularity as a key driver of generalization.

2502.20016 2026-06-02 cs.LG 版本更新

Position: Neglecting the Sustainability of AI is Fuelling a Global AI Arms Race

立场:忽视人工智能的可持续性正在助长全球人工智能军备竞赛

Pedram Bakhtiarifard, Pınar Tözün, Christian Igel, Raghavendra Selvan

发表机构 * Department of Computer Science, University of Copenhagen, Denmark(丹麦哥本哈根大学计算机科学系) Robotics Section, IT University of Copenhagen, Denmark(丹麦哥本哈根IT大学机器人学系)

AI总结 本文指出当前AI可持续性讨论忽视经济和社会维度,提出通过调和气候意识与资源意识、引入CARAML框架来遏制全球AI军备竞赛。

Comments Accepted to be presented at ICML 2026. Source code at https://github.com/saintslab/caraml

详情
AI中文摘要

可持续性包含三个关键方面:经济、环境和社会。然而,关于可持续人工智能(AI)的新兴讨论主要集中在AI的环境可持续性上,忽视了经济和社会方面。实现真正可持续的AI需要解决其环境可持续性(强调减轻AI对气候的影响)与社会可持续性(依赖于公平获取AI开发资源)之间的张力。然而,这种提高可及性的推动往往忽视了扩大此类资源使用的环境成本。本立场论文认为,调和气候意识和资源意识对于实现真正可持续的AI至关重要,而忽视这些因素会助长全球AI军备竞赛。运用历史唯物主义的卡尔·马克思基础-上层建筑框架,我们分析了物质条件如何塑造当前的AI进展及其相关讨论。此外,我们引入了气候与资源感知机器学习(CARAML)框架,并提出了涵盖个人、社区、行业、政府和全球层面的可操作建议,以实现可持续的AI。

英文摘要

Sustainability encompasses three key facets: economic, environmental, and social. However, the nascent discourse on sustainable artificial intelligence (AI) predominantly focuses on the environmental sustainability of AI, neglecting the economic and social aspects. Achieving truly sustainable AI necessitates addressing the tension between its environmental sustainability, which emphasises mitigating AI's climate impact, and its social sustainability, hinging on equitable access to AI development resources. This push for increased accessibility, however, often overlooks the environmental costs of expanding such resource usage. This position paper argues that reconciling climate awareness and resource awareness is essential to realising truly sustainable AI, and neglecting these factors fuels a global AI arms race. Applying Karl Marx's base-superstructure framework from historical materialism, we analyse how the material conditions are shaping the current AI progress and the discourse surrounding it. Further, we introduce the Climate and Resource Aware Machine Learning (CARAML) framework with actionable recommendations spanning individual, community, industry, government, and global levels to achieve sustainable AI.

2112.11279 2026-06-02 cs.LG 版本更新

Differential Parity: Relative Fairness Between Two Sets of Decisions

差分奇偶性:两组决策之间的相对公平性

Zhe Yu, Xiaoyin Xi, Pranam Prakash Shetty

发表机构 * Rochester Institute of Technology(罗切斯特理工学院)

AI总结 本文提出差分奇偶性概念,通过比较两组决策对敏感属性的独立性来评估相对公平性,避免了绝对公平定义的模糊性,并可在有或无参考集时分别作为群体公平度量或揭示偏好/偏见。

Comments Accepted by JAIR

详情
AI中文摘要

随着AI系统广泛应用于辅助人类决策过程,如人才招聘、学校录取和贷款审批,确保决策公平的需求日益增长。分析决策公平性的一个主要挑战是标准高度主观且依赖上下文——对于每个场景而言,绝对公平的含义并无共识。这并非说不同的公平标准经常相互冲突。为了绕过这个问题,本文旨在测试决策中的相对公平性。也就是说,我们不定义什么是“绝对”公平的决策,而是提出通过差分奇偶性——两组决策之间的差异应独立于某个敏感属性——来测试一组决策相对于另一组的相对公平性。这一提出的差分奇偶性公平概念具有以下优点:(1) 避免了绝对公平决策定义的模糊性和矛盾性;(2) 当存在参考集(真实标签或可靠公平决策)时,差分奇偶性可作为新的群体公平概念(类似于分离性和充分性,但有所不同);(3) 即使没有参考集,它也能揭示不同决策集之间的相对偏好或偏见。差分奇偶性的一个局限性是它要求被比较的两组决策针对相同的数据主体做出。为了克服这一局限性,我们提出利用机器学习模型来弥合针对不同数据做出的两组决策之间的差距,并估计差分奇偶性。

英文摘要

With AI systems widely applied to assist humans in decision-making processes such as talent hiring, school admission, and loan approval; there is an increasing need to ensure that the decisions made are fair. One major challenge for analyzing fairness in decisions is that the standards are highly subjective and contextual -- there is no consensus for what absolute fairness means for every scenario. That is not to say that different fairness standards often conflict with each other. To bypass this issue, this work aims to test relative fairness in decisions. That is, instead of defining what are ``absolutely'' fair decisions, we propose to test the relative fairness of one decision set against another with differential parity -- the difference between two sets of decisions should be independent of a certain sensitive attribute. This proposed notion of differential parity fairness has the following benefits: (1) it avoids the ambiguous and contradictory definition of what absolutely fair decisions are; (2) when a reference set (of ground truth or reliable fair decisions) is available, differential parity can serve as a new group fairness notion (similar to but different from separation and sufficiency); (3) even when no reference set is available, it reveals the relative preference or bias between different decision sets. One limitation for differential parity is that it requires the two sets of decisions under comparison to be made on the same data subjects. To overcome this limitation, we propose to utilize a machine learning model to bridge the gap between the two sets of decisions made on difference data and estimate the differential parity.

2502.04646 2026-06-02 cs.LG cs.AI 版本更新

Efficient Weighted Sampling via Score-based Generative Models

基于分数生成模型的高效加权采样

Heasung Kim, Taekyun Lee, Hyeji Kim, Gustavo de Veciana

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 提出一种无需训练的加权采样框架,通过轻量级引导近似和不确定性感知调度器,在预训练分数生成模型上实现高效、稳定的采样,并在大规模设置中取得1.2至4.7倍加速。

Comments 37 pages

详情
AI中文摘要

加权采样——从与基概率密度函数和权重函数乘积成比例的概率密度函数中采样——是一种基础技术,在方差缩减、有偏采样、数据增强等领域有广泛应用。利用日益可用的预训练分数生成模型,我们提出了一种无需训练的加权采样框架,通过以原则性和计算高效的方式,用辅助引导项增强预训练基分数函数,来近似目标分布的逆向扩散过程。我们的方法基于两个关键组件:一个轻量级的引导近似,避免了分数函数和权重函数的高阶导数;以及一个不确定性感知调度器,基于近似误差的时间分析动态调整引导强度。这些组件共同实现了准确稳定的采样,无需依赖现有方法通常需要的基于粒子的重采样或Hessian评估。我们从合成设置到大规模设置(如Stable Diffusion XL)验证了方法的有效性,在该框架下,我们实现了1.2倍到4.7倍的加速,同时在任务性能上始终匹配或超越最先进的基线。这些结果使我们的方法成为生成应用中任务自适应、时间敏感采样的可扩展且推理高效的解决方案。

英文摘要

Weighted sampling -- sampling from a probability density function (PDF) proportional to the product of a base PDF and a weight function -- is a fundamental technique with wide-ranging applications in variance reduction, biased sampling, data augmentation, and more. Leveraging the increasing availability of pretrained score-based generative models (SGMs), we propose a training-free weighted sampling framework that approximates the backward diffusion process of the target distribution by augmenting the pretrained base score function with an auxiliary guidance term, in a principled and computationally efficient manner. Our approach builds on two key components: a lightweight approximation of the guidance that avoids costly higher-order derivatives of both the score and weight functions, and an uncertainty-aware scheduler that dynamically adjusts the guidance strength based on a temporal analysis of approximation error. Together, these components enable accurate and stable sampling without relying on particle-based resampling or Hessian evaluations commonly required by existing methods. We validate the effectiveness of our method from synthetic to large-scale settings such as Stable Diffusion XL, where our framework achieves $1.2\times$ to $4.7\times$ speedups while consistently matching or outperforming state-of-the-art baselines in task performance. These results position our method as a scalable and inference-efficient solution for task-adaptive, time-sensitive sampling in generative applications.

2306.15369 2026-06-02 cs.SE cs.LG 版本更新

A Meta-analytical Comparison of Naive Bayes and Random Forest for Software Defect Prediction

朴素贝叶斯与随机森林在软件缺陷预测中的元分析比较

Ch Muhammad Awais, Wei Gu, Gcinizwe Dlamini, Zamira Kholmatova, Giancarlo Succi

发表机构 * Innopolis University(因诺波利斯大学) Università di Bologna(博洛尼亚大学)

AI总结 通过系统文献综述和元分析,比较朴素贝叶斯和随机森林在召回率、F-measure和精确度上的统计差异,发现两者无显著差异。

Comments 11 pages, 8 figures, Conference Paper

详情
Journal ref
Intelligent Systems Design and Applications. ISDA 2022. Lecture Notes in Networks and Systems, vol 716
AI中文摘要

朴素贝叶斯和随机森林在预测软件缺陷的召回率、F-measure和精确度方面是否存在统计差异?通过利用系统文献综述和元分析,我们回答了这个问题。我们通过建立搜索和选择论文的标准进行了系统文献综述,最终得到五项研究。之后,利用五篇选定论文的元数据和森林图,我们进行了元分析来比较这两个模型。结果表明,没有显著的统计证据表明朴素贝叶斯在召回率、F-measure和精确度方面与随机森林表现不同。

英文摘要

Is there a statistical difference between Naive Bayes and Random Forest in terms of recall, f-measure, and precision for predicting software defects? By utilizing systematic literature review and meta-analysis, we are answering this question. We conducted a systematic literature review by establishing criteria to search and choose papers, resulting in five studies. After that, using the meta-data and forest-plots of five chosen papers, we conducted a meta-analysis to compare the two models. The results have shown that there is no significant statistical evidence that Naive Bayes perform differently from Random Forest in terms of recall, f-measure, and precision.

2111.03861 2026-06-02 cs.CV cs.AI cs.LG 版本更新

What augmentations are sensitive to hyper-parameters and why?

哪些数据增强对超参数敏感以及为什么?

Ch Muhammad Awais, Imad Eddine Ibrahim Bekkouch

发表机构 * Knowledge Representation Lab Innopolis University(知识表示实验室 印尼奥利普斯大学) Sorbonne Center for Artificial Intelligence - SCAI Sorbonne University(索邦人工智能中心 - SCAI 索邦大学)

AI总结 本研究通过局部代理(LIME)解释和线性回归系数评估不同数据增强对模型超参数的敏感性、一致性和影响,发现某些增强对超参数高度敏感,而另一些则更稳健可靠。

Comments 10 pages, 17 figures

详情
Journal ref
Intelligent Computing: Proceedings of the 2022 Computing Conference
AI中文摘要

我们对数据集应用增强以提高预测质量,并使最终模型对噪声数据和领域漂移更具鲁棒性。然而,问题仍然存在:这些增强在不同的超参数下表现如何?在本研究中,我们通过执行局部代理(LIME)解释来评估增强对模型超参数的敏感性、一致性和影响,当不同增强应用于机器学习模型时,解释超参数的影响。我们利用线性回归系数来加权每个增强。我们的研究证明,有些增强对超参数高度敏感,而其他增强则更具鲁棒性和可靠性。

英文摘要

We apply augmentations to our dataset to enhance the quality of our predictions and make our final models more resilient to noisy data and domain drifts. Yet the question remains, how are these augmentations going to perform with different hyper-parameters? In this study we evaluate the sensitivity of augmentations with regards to the model's hyper parameters along with their consistency and influence by performing a Local Surrogate (LIME) interpretation on the impact of hyper-parameters when different augmentations are applied to a machine learning model. We have utilized Linear regression coefficients for weighing each augmentation. Our research has proved that there are some augmentations which are highly sensitive to hyper-parameters and others which are more resilient and reliable.

2501.08640 2026-06-02 cs.LG stat.ML 版本更新

Quantum Reservoir Computing and Risk Bounds

量子储层计算与风险界

Naomi Mona Chmielewski, Nina Amini, Joseph Mikael

发表机构 * EDF Lab, France(法国EDF实验室)

AI总结 利用Rademacher复杂度对量子储层计算中的泛化误差进行界定,并分析其随量子比特数增长的标度行为。

详情
AI中文摘要

我们提出了一种利用Rademacher复杂度来界定几类量子储层泛化误差的方法。我们给出了两个特定量子储层类别的具体参数依赖界。我们分析了泛化界随量子比特数增长的标度行为。将我们的结果应用于具有多项式读出函数的类别,我们发现风险界在训练样本数量上收敛。我们的界中对量子储层和读出参数的显式依赖可用于在一定程度上控制泛化误差。需要注意的是,这些界随量子比特数n呈指数增长。Rademacher复杂度的上界可应用于满足量子动力学和读出函数若干假设的其他储层类别。

英文摘要

We propose a way to bound the generalisation errors of several classes of quantum reservoirs using the Rademacher complexity. We give specific, parameter-dependent bounds for two particular quantum reservoir classes. We analyse how the generalisation bounds scale with growing numbers of qubits. Applying our results to classes with polynomial readout functions, we find that the risk bounds converge in the number of training samples. The explicit dependence on the quantum reservoir and readout parameters in our bounds can be used to control the generalisation error to a certain extent. It should be noted that the bounds scale exponentially with the number of qubits n. The upper bounds on the Rademacher complexity can be applied to other reservoir classes that fulfill a few hypotheses on the quantum dynamics and the readout function.

2412.19444 2026-06-02 cs.LG math.OC stat.ML 版本更新

Towards Simple and Provable Parameter-Free Adaptive Gradient Methods

迈向简单且可证明的无参数自适应梯度方法

Yuanzhe Tao, Yifeng Liu, Huizhuo Yuan, Xun Zhou, Yuan Cao, Quanquan Gu

发表机构 * School of Mathematical Sciences, Peking University(北京大学数学科学学院) Department of Computer Science, University of California, Los Angeles(加州大学洛杉矶分校计算机科学系) School of Computing and Data Science, the University of Hong Kong(香港大学计算科学与数据科学学院) Bytedance Inc(字节跳动公司)

AI总结 提出 AdaGrad++ 和 Adam++ 两种简单无参数自适应梯度方法,在无需预设学习率的情况下实现与 AdaGrad 和 Adam 相当的收敛保证。

Comments 45 pages, 19 figures, 3 tables

详情
AI中文摘要

诸如 AdaGrad 和 Adam 等优化算法通过在优化过程中动态调整学习率,显著推进了深度模型的训练。然而,学习率的临时调整带来了挑战并导致实际中的低效。为解决此问题,近期研究聚焦于开发无需学习率调整即可有效运行的“无参数”算法。尽管有这些努力,现有的 AdaGrad 和 Adam 无参数变体往往过于复杂且/或缺乏正式的收敛保证。在本文中,我们提出了 AdaGrad++ 和 Adam++,这是 AdaGrad 和 Adam 的新型简单无参数变体,具有收敛保证。我们证明 AdaGrad++ 在凸优化中无需预设学习率假设即可达到与 AdaGrad 相当的收敛速率。类似地,Adam++ 在不依赖任何学习率条件的情况下匹配 Adam 的收敛速率。跨多种深度学习任务的实验结果验证了 Adam++ 的竞争性能。

英文摘要

Optimization algorithms such as AdaGrad and Adam have significantly advanced the training of deep models by dynamically adjusting the learning rate during the optimization process. However, ad-hoc tuning of learning rates poses a challenge and leads to inefficiencies in practice. To address this issue, recent research has focused on developing ``parameter-free'' algorithms that operate effectively without the need for learning rate tuning. Despite these efforts, existing parameter-free variants of AdaGrad and Adam tend to be overly complex and/or lack formal convergence guarantees. In this paper, we present AdaGrad++ and Adam++, novel and simple parameter-free variants of AdaGrad and Adam with convergence guarantees. We prove that AdaGrad++ achieves comparable convergence rates to AdaGrad in convex optimization without predefined learning rate assumptions. Similarly, Adam++ matches the convergence rate of Adam without relying on any conditions on the learning rates. Experimental results across various deep learning tasks validate the competitive performance of Adam++.

2412.19419 2026-06-02 cs.LG cs.AI 版本更新

Introduction to Graph Neural Networks for Machine Learning Engineers

面向机器学习工程师的图神经网络导论

James H. Tanis, Chris Giannella, Adrian V. Mariano, Daoud Meerzaman

发表机构 * The MITRE Corporation(MITRE公司) National Cancer Institute(国家癌症研究所)

AI总结 本文通过编码器-解码器框架介绍图神经网络,并通过同质图上的理论和实验分析不同训练规模和复杂度下的行为,重点讨论过平滑和过挤压问题。

Comments Author accepted manuscript. Title and metadata updated to match the published ACM Computing Surveys version. 73 pages, including references and supplementary material

详情
AI中文摘要

图神经网络是专为节点或边带有属性的图设计的深度神经网络。由于其在广泛任务上的出色表现,文献中关于这些模型的研究论文数量正在快速增长。本综述通过编码器-解码器框架介绍图神经网络,并提供了一系列图分析任务的解码器示例。它利用理论和对同质图的大量实验,展示了图神经网络在不同训练规模和图复杂度下的行为,重点强调了过平滑和过挤压现象。

英文摘要

Graph neural networks are deep neural networks designed for graphs with attributes attached to nodes or edges. The number of research papers in the literature concerning these models is growing rapidly due to their impressive performance on a broad range of tasks. This survey introduces graph neural networks through the encoder-decoder framework and provides examples of decoders for a range of graph analytic tasks. It uses theory and numerous experiments on homogeneous graphs to illustrate the behavior of graph neural networks under different training sizes and degrees of graph complexity, with an emphasis on oversmoothing and oversquashing.

2412.12036 2026-06-02 cs.LG cs.RO 版本更新

LeARN: Learnable and Adaptive Representations for Nonlinear Dynamics in System Identification

LeARN: 系统辨识中非线性动力学的可学习与自适应表示

Arunabh Singh, Joyjit Mukherjee

发表机构 * Visual Computing Lab, Indian Institute of Science(印度科学院视觉计算实验室) Department of Electrical and Electronics Engineering, BITS Pilani Hyderabad Campus(BITS Pilani Hyderabad校区电子与电气工程系)

AI总结 提出LeARN框架,通过元学习从数据中直接学习基函数库,无需领域知识,实现非线性动力学的自适应辨识,在Neural Fly数据集上达到与SINDy相当的动态误差性能。

Comments This work has been accepted at the 34th Mediterranean Conference on Control and Automation (MED 2026)

详情
AI中文摘要

系统辨识是从观测的输入-输出数据中推导动态系统数学模型的过程,随着基于学习的方法的出现,经历了范式转变。这些方法解决了非线性动态系统中数据驱动发现的复杂挑战,受到了广泛关注。其中,稀疏非线性动力学辨识(SINDy)已成为一种变革性方法,将复杂的动态行为提炼为基函数的可解释线性组合。然而,SINDy依赖领域专业知识来构建其基函数的基础“库”,限制了其适应性和通用性。在这项工作中,我们引入了一个非线性系统辨识框架LeARN,通过直接从数据中学习基函数库,超越了对先验领域知识的需求。为了增强对不同噪声条件下动态系统演变的适应性,我们采用了一种新颖的基于元学习的系统辨识方法,利用轻量级深度神经网络(DNN)动态优化这些基函数。这不仅捕捉了复杂的系统行为,还能有效适应新的动态模式。我们在Neural Fly数据集上验证了我们的框架,展示了其强大的适应和泛化能力。尽管简单,我们的LeARN在动态误差性能上与SINDy相当。这项工作朝着自主发现动态系统迈出了一步,为机器学习无需大量领域特定干预即可揭示复杂系统控制原理的未来铺平了道路。

英文摘要

System identification, the process of deriving mathematical models of dynamical systems from observed input-output data, has undergone a paradigm shift with the advent of learning-based methods. Addressing the intricate challenges of data-driven discovery in nonlinear dynamical systems, these methods have garnered significant attention. Among them, Sparse Identification of Nonlinear Dynamics (SINDy) has emerged as a transformative approach, distilling complex dynamical behaviors into interpretable linear combinations of basis functions. However, SINDy's reliance on domain-specific expertise to construct its foundational 'library' of basis functions limits its adaptability and universality. In this work, we introduce a nonlinear system identification framework LeARN that transcends the need for prior domain knowledge by learning the library of basis functions directly from data. To enhance adaptability to evolving system dynamics under varying noise conditions, we employ a novel meta-learning-based system identification approach that utilizes a light-weight Deep Neural Network (DNN) to dynamically refine these basis functions. This not only captures intricate system behaviors but also adapts effectively to new dynamical regimes. We validate our framework on the Neural Fly dataset, showcasing its robust adaptation and generalization capabilities. Despite its simplicity, our LeARN achieves competitive dynamical error performance to SINDy. This work presents a step towards autonomous discovery of dynamical systems, paving the way for a future where machine learning uncovers the governing principles of complex systems without requiring extensive domain-specific interventions.

2412.10362 2026-06-02 cs.LG cs.CV 版本更新

OP-LoRA: The Blessing of Dimensionality

OP-LoRA:维度的祝福

Piotr Teterwak, Kate Saenko, Bryan A. Plummer, Ser-Nam Lim

发表机构 * Boston University(波士顿大学) University of Central Florida(中央佛罗里达大学)

AI总结 提出OP-LoRA方法,通过额外MLP预测LoRA适配器权重以改善优化,训练后丢弃MLP,在零额外推理成本下提升性能并降低对学习率的敏感性。

详情
AI中文摘要

低秩适配器(LoRA)使得仅用少量参数即可微调大模型。然而,它们常常面临病态的损失景观,导致优化困难。先前的工作通过自定义优化器将适配器更新与全微调梯度对齐来解决这些挑战,但这些方法缺乏适应新适配器架构的灵活性,且计算成本高。我们引入了OP-LoRA,一种新颖的方法,它用额外的MLP预测的权重替换每个LoRA适配器,该MLP在训练后被丢弃。这允许在训练期间临时增加额外参数以改善优化,但比自定义优化器需要更少的墙钟时间,并且在推理时零额外成本,因为MLP被丢弃。关键的是,将OP-LoRA扩展到其他适配器只需修改每个新适配器类型的预测头大小。我们表明,OP-LoRA允许优化自适应地增加或减少步长,从而提高性能并降低对学习率的敏感性。在小型和大型LoRA微调任务中,我们观察到OP-LoRA相对于LoRA及其变体的一致性能提升。我们在图像生成中取得了特别显著的改进,OP-LoRA的CMMD分数相对于LoRA提高了多达15分。这使得OP-LoRA能够在推理参数减半的情况下达到LoRA的性能。

英文摘要

Low-rank adapters (LoRA) enable finetuning of large models with only a small number of parameters. However, they often suffer from an ill-conditioned loss landscape, leading to difficult optimization. Prior work addresses these challenges by aligning adapter updates with full finetuning gradients via custom optimizers, but these methods lack the flexibility to accommodate new adapter architectures and are computationally expensive. We instead introduce OP-LoRA, a novel method which replaces each LoRA adapter with weights predicted by an extra MLP, which is discarded after training. This temporarily allows additional parameters during training to improve optimization, yet requires less wall time than custom optimizers and zero extra cost at inference time because the MLP is discarded. Crucially, extending OP-LoRA to other adapters is as simple as modifying the size of the prediction head for each new adapter type. We show that OP-LoRA allows the optimization to adaptively increase or decrease step size, improving performance and decreasing sensitivity to learning rate. On both small and large-scale LoRA tuning tasks, we observe consistent performance gains of OP-LoRA relative to LoRA and its variants. We achieve especially notable improvements in image generation, with OP-LoRA CMMD scores improving by up to 15 points relative to LoRA. This allows OP-LoRA to achieve the performance of LoRA with half of the inference parameters.

2412.04177 2026-06-02 cs.LG stat.ML 版本更新

Fixed-Mean Gaussian Processes for Post-hoc Bayesian Deep Learning

固定均值高斯过程用于事后贝叶斯深度学习

Luis A. Ortega, Simón Rodríguez-Santana, Daniel Hernández-Lobato

发表机构 * Universidad Autónoma de Madrid, Spain(西班牙自治大学) Aalborg University, Copenhagen(哥本哈根大学) Universidad Pontificia de Comillas, Spain(Pontificia大学, Spain)

AI总结 提出固定均值高斯过程(FMGP),通过将后验均值固定为预训练DNN的输出,利用变分推断高效估计预测方差,实现架构无关的事后不确定性量化。

Comments 32 pages, 6 figures and 6 tables. Submitted to for revision

详情
AI中文摘要

近年来,对预训练深度神经网络(DNN)的预测进行事后不确定性估计的兴趣日益增加。给定通过反向传播预训练的DNN,这些方法通过添加输出置信度度量(如误差条)来增强原始网络,同时不损害其初始准确性。在此背景下,我们引入了一种新的稀疏变分高斯过程(GP)族,其中当使用通用核时,后验均值固定为任意连续函数。具体地,我们将该GP的均值固定为预训练DNN的输出,使我们的方法能够有效地拟合GP的预测方差以估计DNN预测的不确定性。我们的方法利用变分推断(VI)进行高效的随机优化,训练成本与训练点数无关,可高效扩展到ImageNet等大型数据集。所提出的方法称为固定均值GP(FMGP),与架构无关,仅依赖预训练模型的输出来调整预测方差。实验结果表明,与最先进的DNN事后贝叶斯推断方法相比,FMGP在不确定性估计和计算效率方面均有提升。

英文摘要

Recently, there has been an increasing interest in performing post-hoc uncertainty estimation about the predictions of pre-trained deep neural networks (DNNs). Given a pre-trained DNN via back-propagation, these methods enhance the original network by adding output confidence measures, such as error bars, without compromising its initial accuracy. In this context, we introduce a novel family of sparse variational Gaussian processes (GPs), where the posterior mean is fixed to any continuous function when using a universal kernel. Specifically, we fix the mean of this GP to the output of the pre-trained DNN, allowing our approach to effectively fit the GP's predictive variances to estimate the DNN prediction uncertainty. Our approach leverages variational inference (VI) for efficient stochastic optimization, with training costs that remain independent of the number of training points, scaling efficiently to large datasets such as ImageNet. The proposed method, called fixed-mean GP (FMGP), is architecture-agnostic, relying solely on the pre-trained model's outputs to adjust the predictive variances. Experimental results demonstrate that FMGP improves both uncertainty estimation and computational efficiency when compared to state-of-the-art methods for DNN post-hoc Bayesian inference.

2411.12438 2026-06-02 cs.DS cs.LG stat.ML 版本更新

Dimension Reduction via Sum-of-Squares and Improved Clustering Algorithms for Non-Spherical Mixtures

通过平方和方法的降维及非球形混合物的改进聚类算法

Prashanti Anderson, Mitali Bafna, Rares-Darius Buhai, Pravesh K. Kothari, David Steurer

发表机构 * MIT University of Washington EPFL(麻省理工学院 华盛顿大学 EPFL)

AI总结 提出基于平方和方法的降维子程序,实现非球形高斯混合物的高效聚类,显著降低样本和时间的维度依赖。

Comments 67 pages, updated to match camera-ready version at COLT 2026

详情
AI中文摘要

我们开发了一种新的方法,通过基于平方和方法的子程序,对非球形(即任意分量协方差)高斯混合模型进行聚类,该子程序能够找到输入数据的低维分离保持投影。我们的方法给出了经典降维(基于奇异值分解)的非球形类比,该经典降维在众多应用中构成著名的Vempala和Wang [VW04]球形聚类算法的关键组成部分。作为应用,我们获得了以下算法:(1) 对任意全变差分离的$k$个中心化(即零均值)高斯混合,使用$n\geq \operatorname{poly}(d) f(w_{\min}^{-1})$个样本和$\operatorname{poly}(n)$时间进行聚类;(2) 对任意全变差分离的$k$个具有相同但未知协方差的高斯混合,使用$n \geq d^{O(\log w_{\min}^{-1})} f(w_{\min}^{-1})$个样本和$n^{O(\log w_{\min}^{-1})}$时间进行聚类。这里,$w_{\min}$是输入混合的最小混合权重,$f$不依赖于维度$d$。我们的算法自然扩展到容忍与维度无关的任意异常值比例。在这项工作之前,最先进的非球形聚类算法中的技术需要$d^{O(k)} f(w_{\min}^{-1})$个样本和时间来聚类此类混合。我们的结果可能令人惊讶,因为针对非球形高斯混合聚类的$d^{Ω(k)}$统计查询和平方和下限 [DKS17, DKPP24] 通常被认为排除了该问题的$d^{o(k)}$代价算法,但我们的结果表明,对于一类非常广泛的高斯混合,这些下限实际上可以被规避。

英文摘要

We develop a new approach for clustering non-spherical (i.e., arbitrary component covariances) Gaussian mixture models via a subroutine, based on the sum-of-squares method, that finds a low-dimensional separation-preserving projection of the input data. Our method gives a non-spherical analog of the classical dimension reduction, based on singular value decomposition, that, among several other applications, forms a key component of the celebrated spherical clustering algorithm of Vempala and Wang [VW04]. As applications, we obtain an algorithm to (1) cluster an arbitrary total-variation separated mixture of $k$ centered (i.e., zero-mean) Gaussians with $n\geq \operatorname{poly}(d) f(w_{\min}^{-1})$ samples and $\operatorname{poly}(n)$ time, and (2) cluster an arbitrary total-variation separated mixture of $k$ Gaussians with identical but arbitrary unknown covariance with $n \geq d^{O(\log w_{\min}^{-1})} f(w_{\min}^{-1})$ samples and $n^{O(\log w_{\min}^{-1})}$ time. Here, $w_{\min}$ is the minimum mixing weight of the input mixture, and $f$ does not depend on the dimension $d$. Our algorithms naturally extend to tolerating a dimension-independent fraction of arbitrary outliers. Before this work, the techniques in the state-of-the-art non-spherical clustering algorithms needed $d^{O(k)} f(w_{\min}^{-1})$ samples and time for clustering such mixtures. Our results may come as a surprise in the context of the $d^{Ω(k)}$ statistical query and sum-of-squares lower bounds [DKS17, DKPP24] for clustering non-spherical Gaussian mixtures. While these results are usually thought to rule out $d^{o(k)}$ cost algorithms for the problem, our results show that the lower bounds can in fact be circumvented for a remarkably general class of Gaussian mixtures.

2411.11793 2026-06-02 cs.LG 版本更新

Nonlinear Equilibrium Transitions in a Potential Game Model for Federated Learning

联邦学习中势博弈模型的非线性均衡转变

Kang Liu, Ziqi Wang, Enrique Zuazua

发表机构 * Institut de Mathématiques de Bourgogne, Université Bourgogne Europe, CNRS(布尔格ogne数学研究所,布尔格ogne欧洲大学,国家科学研究中心) Chair for Dynamics, Control, Machine Learning and Numerics – Alexander von Humboldt Professorship, Department of Mathematics, Friedrich-Alexander-Universität Erlangen-Nürnberg(动力学、控制、机器学习和数值计算主席职位——亚历山大·冯·洪堡教授职位,数学系,弗里德里希-亚历山大-埃朗根-纽伦堡大学)

AI总结 提出势博弈框架研究联邦学习中客户理性选择训练努力的行为,发现纳什均衡随奖励因子非线性变化并在临界值处发生非光滑转变,证明了最佳响应算法的收敛性,并通过实验验证了临界奖励因子的有效性。

Comments Accepted for publication in Physica D: Nonlinear Phenomena

详情
AI中文摘要

在联邦学习(FL)中,中央服务器通常将训练任务分配给客户端。然而,从市场导向的角度来看,客户端可能基于理性自利独立选择其训练努力。为了研究这种设置,我们提出了一个势博弈框架,其中每个客户端的收益由其个人努力和服务器提供的奖励决定。奖励受所有客户端集体努力的影响,并可通过奖励因子进行调节。我们首先建立了纳什均衡(NE)的存在性,然后研究了其在静态设置中的唯一性。我们表明,NE 非线性地依赖于奖励因子,并在临界值处表现出非光滑转变,此时静态势失去严格曲率,导致 NE 不唯一以及在低努力和高努力分支之间跳跃。此外,我们证明了在我们的 FL 博弈中用于计算 NE 的最佳响应算法的收敛性。最后,我们将从 NE 导出的客户端理性努力应用于各种数据集和模型的 FL 训练,从而验证了所识别的临界奖励因子的有效性。

英文摘要

In federated learning (FL), a central server typically allocates training efforts to clients. However, from a market-oriented perspective, clients may independently choose their training efforts based on rational self-interest. To study this setting, we propose a potential game framework in which each client's payoff is determined by its individual effort and the rewards provided by the server. The rewards are influenced by the collective efforts of all clients and can be modulated by a reward factor. We first establish the existence of Nash equilibria (NEs) and then investigate their uniqueness in a stationary setting. We show that the NEs depend nonlinearly on the reward factor and exhibit a nonsmooth transition at a critical value, where the stationary potential loses strict curvature, leading to nonunique NEs and a jump between low-effort and high-effort branches. Furthermore, we prove the convergence of the best-response algorithm for computing NEs in our FL game. Finally, we apply the clients' rational efforts derived from the NEs to FL training with various datasets and models, thereby validating the effectiveness of the identified critical reward factor.

2411.11436 2026-06-02 cs.LG cs.AI 版本更新

Implicit Regularization for Multi-label Feature Selection

多标签特征选择的隐式正则化

Dou El Kefel Mansouri, Khalid Benabdeslem, Seif-Eddine Benkabou

AI总结 针对多标签学习中的特征选择问题,提出一种基于隐式正则化和标签嵌入的估计器,通过Hadamard积参数化避免显式正则化项的额外偏差,实验表明该方法可减少偏差并可能导致良性过拟合。

Comments 14 pages, 11 figures, Submitted for publication and currently under review

详情
AI中文摘要

本文通过使用一种基于隐式正则化和标签嵌入的新估计器,解决了多标签学习背景下的特征选择问题。与使用带有显式正则化项(如$l_{2,1}$-范数、MCP或SCAD)的惩罚估计器的稀疏特征选择方法不同,我们提出了一种通过Hadamard积参数化的简单替代方法。为了指导特征选择过程,采用了一种多标签信息潜在语义方法作为标签嵌入。在一些已知基准数据集上的实验结果表明,所提出的估计器遭受的额外偏差要小得多,并且可能导致良性过拟合。

英文摘要

In this paper, we address the problem of feature selection in the context of multi-label learning, by using a new estimator based on implicit regularization and label embedding. Unlike the sparse feature selection methods that use a penalized estimator with explicit regularization terms such as $l_{2,1}$-norm, MCP or SCAD, we propose a simple alternative method via Hadamard product parameterization. In order to guide the feature selection process, a latent semantic of multi-label information method is adopted, as a label embedding. Experimental results on some known benchmark datasets suggest that the proposed estimator suffers much less from extra bias, and may lead to benign overfitting.

2411.05196 2026-06-02 cs.AI cs.DL cs.LG 版本更新

Explainable AI Through a Democratic Lens: DhondtXAI for D'Hondt-Projected Feature Attribution

通过民主视角的可解释AI:用于D'Hondt投影特征归因的DhondtXAI

Turker Berk Donmez

发表机构 * Sakarya University of Applied Sciences(萨卡里亚应用科学大学)

AI总结 提出DhondtXAI,一种基于D'Hondt规则的独立于SHAP的表格数据可解释性框架,通过计算背景干预移除效应、分离正负证据、形成特征联盟并分配席位,实现特征归因,在合成数据和医疗数据集上验证了其与SHAP的高度一致性。

详情
AI中文摘要

本研究提出DhondtXAI,作为一种独立于SHAP、基于D'Hondt的表格可解释AI归因框架。DhondtXAI不依赖于模型原生特征重要性或SHAP值,而是计算背景干预移除效应,分离正负证据,形成可选的特征联盟,应用可选的阈值,通过D'Hondt规则分配席位,并投影到局部模型输出差异上。通过构造保持完整性,投影残差比作为诊断指标报告。该方法在合成加性和交互测试、相关特征扰动、算子和分配消融、投影模式比较、logit尺度检查、重复分割验证、配对删除测试以及两个医疗数据集(威斯康星诊断乳腺癌(CatBoost)和早期糖尿病风险预测(XGBoost))上进行了评估。SHAP仅作为外部比较器,设置对齐。在加性合成数据中,DhondtXAI精确恢复真实排名;在乘法交互中,联盟将平均投影残差从0.2527降至0.0001。在WDBC和糖尿病数据上,与SHAP高度一致(Spearman rho分别为0.9273和0.9353),并通过进一步的符号、top-k、幅度、删除和敏感性分析得到支持。结果表明,DhondtXAI是一种互补的比例性、联盟感知和阈值感知的表格可解释AI方法,而非SHAP或LIME的替代品。

英文摘要

This study presents DhondtXAI as a SHAP-independent, D'Hondt-based attribution framework for tabular XAI. Instead of model-native feature importance or SHAP values, DhondtXAI computes background-interventional removal effects, separates positive and negative evidence, forms optional feature alliances, applies optional thresholds, allocates seats via the D'Hondt rule, and projects onto the local model-output difference. Completeness is preserved by construction, with the projection residual ratio reported as a diagnostic. The method is evaluated on synthetic additive and interaction tests, correlated-feature perturbations, operator and apportionment ablations, projection-mode comparisons, logit-scale checks, repeated split validation, paired deletion tests, and two healthcare datasets: Wisconsin Diagnostic Breast Cancer (CatBoost) and early-stage diabetes risk prediction (XGBoost). SHAP serves only as an external comparator with aligned settings. In additive synthetics, DhondtXAI exactly recovers ground-truth rankings; in multiplicative interactions, alliances reduce the mean projection residual from 0.2527 to 0.0001. On WDBC and diabetes data, it shows high agreement with SHAP (Spearman rho = 0.9273 and 0.9353), supported by further signed, top-k, magnitude, deletion, and sensitivity analyses. Results position DhondtXAI as a complementary proportional, alliance-aware, and threshold-aware tabular XAI method, not a replacement for SHAP or LIME.

2410.21361 2026-06-02 cs.CV cs.LG 版本更新

Domain Adaptation with a Single Vision-Language Embedding

基于单一视觉-语言嵌入的域适应

Mohammad Fahes, Tuan-Hung Vu, Andrei Bursuc, Patrick Pérez, Raoul de Charette

发表机构 * Inria(法国国家信息与自动化研究所) Kyutai(Kyutai公司)

AI总结 提出一种利用单一视觉-语言(VL)嵌入进行域适应的框架,通过提示/照片驱动的实例归一化(PIN)挖掘多种视觉风格,实现零样本和单样本无监督域适应,在语义分割任务上优于基线方法。

Comments International Journal of Computer Vision (IJCV 2026)

详情
AI中文摘要

域适应在计算机视觉中已被广泛研究,但仍需要在训练时访问目标数据,这在现实世界的自动驾驶场景中可能难以获得,尤其是在罕见或恶劣条件下。本文提出了一种新的域适应框架,该框架依赖于单一的视觉-语言(VL)潜在嵌入,而不是完整的目标数据。首先,利用对比语言-图像预训练模型(CLIP),我们提出了提示/照片驱动的实例归一化(PIN)。PIN是一种特征增强方法,通过优化低级源特征的仿射变换,使用单一的目标VL潜在嵌入挖掘多种视觉风格。VL嵌入可以来自描述目标域的语言提示、部分优化的语言提示或单一未标记的目标图像。其次,我们表明这些挖掘的风格(即增强)可用于零样本(即无目标)和单样本无监督域适应。在真实世界驾驶数据集(包括Cityscapes和ACDC(恶劣条件))上的语义分割实验证明了所提出方法的有效性,在实用的零样本和单样本设置中优于相关基线。

英文摘要

Domain adaptation has been extensively investigated in computer vision but still requires access to target data at the training time, which might be difficult to obtain in real-world autonomous driving scenarios, especially under rare or adverse conditions. In this paper, we present a new framework for domain adaptation relying on a single Vision-Language (VL) latent embedding instead of full target data. First, leveraging a contrastive language-image pre-training model (CLIP), we propose prompt/photo-driven instance normalization (PIN). PIN is a feature augmentation method that mines multiple visual styles using a single target VL latent embedding, by optimizing affine transformations of low-level source features. The VL embedding can come from a language prompt describing the target domain, a partially optimized language prompt, or a single unlabeled target image. Second, we show that these mined styles (i.e., augmentations) can be used for zero-shot (i.e., target-free) and one-shot unsupervised domain adaptation. Experiments on semantic segmentation in real-world driving datasets, including Cityscapes and ACDC (adverse conditions), demonstrate the effectiveness of the proposed method, which outperforms relevant baselines in the practical zero-shot and one-shot settings.

2410.09737 2026-06-02 cs.LG 版本更新

Towards Stable, Globally Expressive Graph Representations with Laplacian Eigenvectors

利用拉普拉斯特征向量实现稳定且全局表达性的图表示

Junru Zhou, Cai Zhou, Xiyuan Wang, Pan Li, Muhan Zhang

发表机构 * Institute for Artificial Intelligence, Peking University(北京大学人工智能研究院) Department of EECS, Massachusetts Institute of Technology(麻省理工学院电子工程与计算机科学系) School of ECE, Georgia Institute of Technology(佐治亚理工学院电子与计算机工程系)

AI总结 提出一种利用可学习的O(p)-不变表示和平滑处理数值接近特征值的方法,以增强图神经网络中拉普拉斯特征向量的稳定性和全局表达性。

详情
AI中文摘要

提高图神经网络(GNN)表达能力的一种流行方法是使用拉普拉斯特征向量作为额外的节点特征,因为它们既可以作为结构标识符,也可以作为节点的全局坐标。正确处理特征向量之间的正交群对称性对于拉普拉斯特征向量增强的GNN的稳定性和泛化能力至关重要。先前的研究表明,对每个$p$维特征空间使用朴素的$O(p)$-群不变编码器通常会导致表达性损失和数值不稳定性。在本文中,我们提出了一种利用拉普拉斯特征向量生成\emph{稳定}且全局\emph{表达性}的图表示的新方法。与先前工作的主要区别在于:(i)我们的方法对每个维度为$p$的拉普拉斯特征空间利用 extbf{可学习的}$O(p)$-不变表示,这些表示建立在文献中已充分研究的强大正交群等变神经网络层之上;(ii)我们的方法以 extbf{平滑}的方式处理数值接近的特征值,确保其对扰动具有更好的鲁棒性。在各种图学习基准上的实验证明了我们方法的竞争性能,特别是其学习图全局属性的巨大潜力。

英文摘要

A popular way to improve the expressive power of graph neural networks (GNNs) is to use Laplacian eigenvectors as additional node features, since they can serve both as structural identifiers and global coordinates of nodes. Properly handling the orthogonal group symmetry among eigenvectors is crucial for the stability and generalizability of Laplacian eigenvector augmented GNNs. Previous studies have shown that using a naive $O(p)$-group invariant encoder for each $p$-dimensional eigenspace often leads to expressivity loss and numerical instability. In this paper, we propose a novel method exploiting Laplacian eigenvectors to generate \emph{stable} and globally \emph{expressive} graph representations. The main difference from previous works is that (i) our method utilizes \textbf{learnable} $O(p)$-invariant representations for each Laplacian eigenspace of dimension $p$, which are built upon powerful orthogonal group equivariant neural network layers already well studied in the literature, and that (ii) our method deals with numerically close eigenvalues in a \textbf{smooth} fashion, ensuring its better robustness against perturbations. Experiments on various graph learning benchmarks witness the competitive performance of our method, especially its great potential to learn global properties of graphs.

2404.13621 2026-06-02 cs.CV cs.LG cs.MM 版本更新

Attack on Scene Flow using Point Clouds

使用点云对场景流进行攻击

Haniyeh Ehsani Oskouie, Mohammad-Shahram Moin, Shohreh Kasaei

发表机构 * Sharif University of Technology(谢里弗大学) ICT Research Institute(信息与通信技术研究所)

AI总结 针对场景流网络提出白盒对抗攻击方法,在KITTI和FlyingThings3D数据集上实现平均端点误差相对下降33.7%,并揭示单维度或单颜色通道攻击的影响。

详情
AI中文摘要

深度神经网络在使用点云准确估计场景流方面取得了显著进展,这对于视频分析、动作识别和导航等许多应用至关重要。然而,这些技术的鲁棒性仍然令人担忧,特别是在面对已被证明能在许多领域欺骗最先进深度神经网络的对抗攻击时。令人惊讶的是,场景流网络对此类攻击的鲁棒性尚未得到彻底研究。为解决这一问题,本文提出了一种专门针对场景流网络的白盒对抗攻击方法。实验结果表明,生成的对抗样本在KITTI和FlyingThings3D数据集上使平均端点误差相对下降高达33.7%。研究还揭示了仅针对点云的一个维度或颜色通道的攻击对平均端点误差的显著影响。通过分析这些攻击在场景流网络及其2D光流网络变体上的成功与失败,发现光流网络具有更高的脆弱性。代码可在https://github.com/aheldis/Attack-on-Scene-Flow-using-Point-Clouds.git获取。

英文摘要

Deep neural networks have made significant advancements in accurately estimating scene flow using point clouds, which is vital for many applications like video analysis, action recognition, and navigation. The robustness of these techniques, however, remains a concern, particularly in the face of adversarial attacks that have been proven to deceive state-of-the-art deep neural networks in many domains. Surprisingly, the robustness of scene flow networks against such attacks has not been thoroughly investigated. To address this problem, the proposed approach aims to bridge this gap by introducing adversarial white-box attacks specifically tailored for scene flow networks. Experimental results show that the generated adversarial examples obtain up to 33.7 relative degradation in average end-point error on the KITTI and FlyingThings3D datasets. The study also reveals the significant impact that attacks targeting point clouds in only one dimension or color channel have on average end-point error. Analyzing the success and failure of these attacks on the scene flow networks and their 2D optical flow network variants shows a higher vulnerability for the optical flow networks. Code is available at https://github.com/aheldis/Attack-on-Scene-Flow-using-Point-Clouds.git.

2401.17010 2026-06-02 cs.CR cs.AI cs.LG 版本更新

Finetuning Large Language Models for Vulnerability Detection

微调大型语言模型用于漏洞检测

Alexey Shestov, Rodion Levichev, Ravil Mussabayev, Evgeny Maslov, Anton Cheshkov, Pavel Zadorozhny

发表机构 * Sber AI Lab(Sber AI实验室) Huawei Russian Research Institute(华为俄罗斯研究院) Satbayev University(萨特拜耶夫大学)

AI总结 本文通过微调WizardCoder模型,优化训练流程并处理类别不平衡,在漏洞检测任务上提升了ROC AUC和F1指标,展示了预训练LLM在源代码分析中的迁移学习潜力。

详情
AI中文摘要

本文介绍了微调大型语言模型(LLMs)用于检测源代码中漏洞的结果。我们利用WizardCoder(最新改进的先进LLM StarCoder),并通过进一步微调使其适应漏洞检测。为加速训练,我们修改了WizardCoder的训练过程,并研究了最优训练方案。针对负样本远多于正样本的不平衡数据集,我们还探索了不同技术以提升分类性能。微调后的WizardCoder模型在平衡和不平衡的漏洞数据集上,相比于CodeBERT类模型,在ROC AUC和F1指标上均有提升,证明了将预训练LLM用于源代码漏洞检测的有效性。关键贡献包括:微调先进的代码LLM WizardCoder、在不损害性能的前提下提高其训练速度、优化训练流程和方案、处理类别不平衡,以及在困难的漏洞检测数据集上提升性能。这展示了通过微调大型预训练语言模型进行专门源代码分析任务的迁移学习潜力。

英文摘要

This paper presents the results of finetuning large language models (LLMs) for the task of detecting vulnerabilities in source code. We leverage WizardCoder, a recent improvement of the state-of-the-art LLM StarCoder, and adapt it for vulnerability detection through further finetuning. To accelerate training, we modify WizardCoder's training procedure, also we investigate optimal training regimes. For the imbalanced dataset with many more negative examples than positive, we also explore different techniques to improve classification performance. The finetuned WizardCoder model achieves improvement in ROC AUC and F1 measures on balanced and imbalanced vulnerability datasets over CodeBERT-like model, demonstrating the effectiveness of adapting pretrained LLMs for vulnerability detection in source code. The key contributions are finetuning the state-of-the-art code LLM, WizardCoder, increasing its training speed without the performance harm, optimizing the training procedure and regimes, handling class imbalance, and improving performance on difficult vulnerability detection datasets. This demonstrates the potential for transfer learning by finetuning large pretrained language models for specialized source code analysis tasks.

2307.05213 2026-06-02 cs.LG cs.AI 版本更新

Score Function Gradient Estimation to Widen the Applicability of Decision-Focused Learning

评分函数梯度估计以拓宽决策聚焦学习的适用性

Mattia Silvestri, Senne Berden, Jayanta Mandi, Ali İrfan Mahmutoğulları, Brandon Amos, Tias Guns, Michele Lombardi

发表机构 * University of Bologna(博洛尼亚大学) KU Leuven(鲁汶大学) Meta

AI总结 提出一种结合随机平滑与评分函数梯度估计的方法,无需对问题结构做特定假设,即可将决策聚焦学习扩展到非线性目标、约束中不确定参数及两阶段随机优化问题。

详情
Journal ref
Silvestri, Mattia, et al. "Score Function Gradient Estimation to Widen the Applicability of Decision-Focused Learning." Journal of Artificial Intelligence Research 85 (2026)
AI中文摘要

许多现实世界的优化问题包含在部署前未知的参数,这是由于随机性或信息缺乏(例如,配送问题中的需求或旅行时间)。在这种情况下,常见的策略是通过机器学习(ML)模型估计所述参数,这些模型以最小化预测误差为目标进行训练,然而这并不一定与下游任务级误差一致。决策聚焦学习(DFL)范式通过直接最小化任务损失(例如遗憾)来克服这一限制。由于后者对于组合问题具有非信息性梯度,最先进的DFL方法引入了能够实现训练的替代和近似。但这些方法利用了关于问题结构的特定假设(例如,凸或线性问题,仅在目标函数中的未知参数)。我们提出了一种替代方法,该方法不做此类假设,它结合了随机平滑与评分函数梯度估计,适用于任何任务损失。这为将DFL方法应用于非线性目标、问题约束中的不确定参数,甚至两阶段随机优化打开了大门。实验表明,它通常需要更多的训练周期,但在解决方案质量、可扩展性或两者方面,与专门方法相当,并且在约束中存在不确定性的困难情况下表现尤为出色。

英文摘要

Many real-world optimization problems contain parameters that are unknown before deployment time, either due to stochasticity or to lack of information (e.g., demand or travel times in delivery problems). A common strategy in such cases is to estimate said parameters via machine learning (ML) models trained to minimize the prediction error, which however is not necessarily aligned with the downstream task-level error. The decision-focused learning (DFL) paradigm overcomes this limitation by training to directly minimize a task loss, e.g. regret. Since the latter has non-informative gradients for combinatorial problems, state-of-the-art DFL methods introduce surrogates and approximations that enable training. But these methods exploit specific assumptions about the problem structures (e.g., convex or linear problems, unknown parameters only in the objective function). We propose an alternative method that makes no such assumptions, it combines stochastic smoothing with score function gradient estimation which works on any task loss. This opens up the use of DFL methods to nonlinear objectives, uncertain parameters in the problem constraints, and even two-stage stochastic optimization. Experiments show that it typically requires more epochs, but that it is on par with specialized methods and performs especially well for the difficult case of problems with uncertainty in the constraints, in terms of solution quality, scalability, or both.

2403.07008 2026-06-02 cs.LG cs.AI cs.CL stat.ME 版本更新

AutoEval Done Right: Using Synthetic Data for Model Evaluation

AutoEval 的正确做法:使用合成数据进行模型评估

Pierre Boyeau, Anastasios N. Angelopoulos, Nir Yosef, Jitendra Malik, Michael I. Jordan

发表机构 * Department of Electrical Engineering and Computer Sciences, University of California, Berkeley, USA(电子工程与计算机科学系,加州大学伯克利分校) Department of Systems Immunology, Weizmann Institute of Science, Rehovot, Israel(系统免疫学系,魏茨曼科学研究所) Inria, Ecole Normale Supérieure, Paris, France(法国国家信息与自动化技术研究所,巴黎高等师范学院)

AI总结 本文提出高效且统计上无偏的算法,利用AI标记的合成数据减少模型评估所需的人工标注量,在GPT-4实验中有效样本量提升高达50%。

Comments camera-ready paper version

详情
AI中文摘要

使用人工标注的验证数据评估机器学习模型可能成本高昂且耗时。AI标记的合成数据可用于减少此目的所需的人工标注数量,这一过程称为自动评估。我们为此提出了高效且统计上无偏的算法,在保持无偏性的同时提高样本效率。这些算法在GPT-4实验中使有效人工标注样本量增加高达50%。

英文摘要

The evaluation of machine learning models using human-labeled validation data can be expensive and time-consuming. AI-labeled synthetic data can be used to decrease the number of human annotations required for this purpose in a process called autoevaluation. We suggest efficient and statistically principled algorithms for this purpose that improve sample efficiency while remaining unbiased. These algorithms increase the effective human-labeled sample size by up to 50% on experiments with GPT-4.

2404.07373 2026-06-02 eess.SY cs.LG cs.SY 版本更新

Synthesizing Neural Network Controllers with Closed-Loop Dissipativity Guarantees

具有闭环耗散性保证的神经网络控制器综合

Neelay Junnarkar, Murat Arcak, Peter Seiler

发表机构 * Department of Electrical Engineering and Computer Sciences, University of California, Berkeley(加州大学伯克利分校电子工程与计算机科学系) Department of Electrical Engineering and Computer Science, University of Michigan, Ann Arbor(密歇根大学安娜堡分校电子工程与计算机科学系)

AI总结 提出一种在保证闭环系统耗散性(如稳定性和L2增益界)的硬约束下最大化奖励的神经网络控制器综合方法,利用积分二次约束描述不确定性和激活函数,通过线性矩阵不等式和投影训练实现。

Comments Accepted to the journal Automatica, 17 pages, 9 figures

详情
AI中文摘要

本文提出了一种综合神经网络控制器的方法,在反馈系统(包括被控对象和控制器)具有耗散性的硬约束下最大化奖励,从而保证稳定性和$L_2$增益界等性质。它考虑了非线性和不确定的被控对象,将其建模为线性时不变(LTI)系统与一个包含非线性的不确定性模块的互联。被控对象的不确定性和神经网络的激活函数均使用积分二次约束(IQCs)描述。首先,推导了不确定LTI系统的耗散性条件。其次,利用该条件构造了一个线性矩阵不等式(LMI),可用于综合神经网络控制器。最后,将该凸条件用于基于投影的训练方法,以综合具有耗散性保证的神经网络控制器。通过倒立摆和带柔性杆的小车上的数值示例,证明了该方法的有效性。

英文摘要

This paper presents a method to synthesize neural network controllers to maximize reward subject to the hard constraint that the feedback system of plant and controller be dissipative, certifying requirements such as stability and $L_2$ gain bounds. It considers nonlinear and uncertain plants, modeled as the interconnection of a linear time-invariant (LTI) system and an uncertainty block, which incorporates nonlinearities. The uncertainty of the plant and the activation functions of the neural network are both described using integral quadratic constraints (IQCs). First, a dissipativity condition is derived for uncertain LTI systems. Second, this condition is used to construct a linear matrix inequality (LMI) which can be used to synthesize neural network controllers. Finally, this convex condition is used in a projection-based training method to synthesize neural network controllers with dissipativity guarantees. Numerical examples on an inverted pendulum and a flexible rod on a cart are provided to demonstrate the effectiveness of this approach.

2402.14031 2026-06-02 eess.SY cs.LG cs.SY 版本更新

Discovering Nonlinear Static Relationships in Unlabeled Dataset using Autoencoder with Ordered Variance

使用有序方差自编码器发现未标注数据集中的非线性静态关系

Midhun T. Augustine, Parag Patil, Mani Bhushan, Sharad Bhartiya

发表机构 * Automation Lab, Department of Chemical Engineering(自动化实验室,化学工程系)

AI总结 提出一种有序方差自编码器(AEO)及其残差网络扩展(RAEO),通过方差正则化项强制潜变量有序排列,实现未标注数据中非线性关系的无监督发现与静态模型提取。

Comments 14 pages, 5 figures

详情
AI中文摘要

本文提出了一种有序方差自编码器(AEO),其中传统的重建损失通过基于方差的正则化项进行增强,该正则化项促进了潜空间内的有序结构。在这种结构中,潜变量根据其在训练数据上计算的方差进行排序,有助于系统地确定潜空间维度。AEO进一步通过残差网络扩展,产生了基于ResNet的AEO(RAEO)。AEO和RAEO均能发现未标注数据集中变量间的非线性关系,从而实现无监督静态模型提取。理论贡献包括对潜变量方差排序的形式化保证。通过将框架应用于非线性稳态模型的识别及其在实时优化中的使用,以连续搅拌釜反应器过程作为代表性案例研究,展示了该框架的实际效用。

英文摘要

This paper presents an autoencoder with ordered variance (AEO), in which the conventional reconstruction loss is augmented by a variance-based regularization term that promotes an ordered structure within the latent space. In this structure, the latent variables are ordered by their variance computed over the training data, facilitating systematic determination of the latent space dimensionality. The AEO is further extended using residual networks, resulting in a ResNet-based AEO (RAEO). Both AEO and RAEO green lead to discovery of nonlinear relationships among variables in unlabeled datasets, thereby enabling unsupervised static model extraction. Theoretical contributions include formal guarantees on the ordering of latent variances. The practical utility of the framework is demonstrated through its application to the identification of nonlinear steady-state models and their use in real-time optimization, with a continuous stirred tank reactor process serving as a representative case study.

2211.15223 2026-06-02 math.AP cs.LG math.OC 版本更新

Gamma-convergence of a nonlocal perimeter arising in adversarial machine learning

对抗机器学习中出现的非局部周长的Gamma收敛

Leon Bungert, Kerrek Stinson

发表机构 * Hausdorff Center for Mathematics, University of Bonn(哈代尔夫数学中心、波恩大学)

AI总结 本文证明了一种Minkowski型非局部周长在Gamma收敛意义下趋于局部各向异性周长,该非局部模型描述了二分类中对抗训练的正则化效应,仅假设分布具有有界BV密度,并应用于总变分、对抗训练渐近性和图离散化的收敛性分析。

Comments Fixed typos, added new isotropic-anisotropic decomposition formula for limit perimeter

详情
Journal ref
Calculus of Variations and Partial Differential Equations 63 (5), 114, 2024
AI中文摘要

在本文中,我们证明了Minkowski型非局部周长在Gamma收敛意义下趋于局部各向异性周长。该非局部模型描述了二分类中对抗训练的正则化效应。该能量本质上依赖于两个分布之间的相互作用,这些分布模拟了相关类别的似然。我们克服了分布典型的严格正则性假设,仅假设它们具有有界$BV$密度。在由紧性导出的自然拓扑中,我们证明了Gamma收敛到一个加权周长,其权重由两个密度的各向异性函数决定。尽管是局部的,这个尖锐界面极限反映了对抗扰动下的分类稳定性。我们进一步应用我们的结果来推导相关总变分的Gamma收敛,研究对抗训练的渐近性,并证明非局部周长的图离散化的Gamma收敛。

英文摘要

In this paper we prove Gamma-convergence of a nonlocal perimeter of Minkowski type to a local anisotropic perimeter. The nonlocal model describes the regularizing effect of adversarial training in binary classifications. The energy essentially depends on the interaction between two distributions modelling likelihoods for the associated classes. We overcome typical strict regularity assumptions for the distributions by only assuming that they have bounded $BV$ densities. In the natural topology coming from compactness, we prove Gamma-convergence to a weighted perimeter with weight determined by an anisotropic function of the two densities. Despite being local, this sharp interface limit reflects classification stability with respect to adversarial perturbations. We further apply our results to deduce Gamma-convergence of the associated total variations, to study the asymptotics of adversarial training, and to prove Gamma-convergence of graph discretizations for the nonlocal perimeter.

2312.03644 2026-06-02 cs.LG cs.MA 版本更新

MACCA: Offline Multi-agent Reinforcement Learning with Causal Credit Assignment

MACCA: 离线多智能体强化学习中的因果信用分配

Ziyan Wang, Yali Du, Yudi Zhang, Meng Fang, Biwei Huang

发表机构 * King’s College London(伦敦国王学院) Eindhoven University of Technology(埃因霍温理工大学) University of Liverpool(利物浦大学) University of California San Diego(加州大学圣地亚哥分校)

AI总结 提出基于动态贝叶斯网络的因果信用分配框架MACCA,通过建模环境变量、状态、动作和奖励的因果关系,实现离线多智能体强化学习中准确且可解释的信用分配。

Comments 21 pages, 4 figures

详情
Journal ref
TMLR 2025
AI中文摘要

离线多智能体强化学习(MARL)在在线交互不切实际或存在风险的情况下具有重要价值。虽然MARL中的独立学习提供了灵活性和可扩展性,但在离线设置中,由于禁止与环境交互,准确地将信用分配给单个智能体面临挑战。在本文中,我们提出了一种新框架,即多智能体因果信用分配(MACCA),以解决离线MARL设置中的信用分配问题。我们的方法MACCA将生成过程表征为动态贝叶斯网络,捕获环境变量、状态、动作和奖励之间的关系。通过在离线数据上估计该模型,MACCA可以通过分析个体奖励的因果关系来学习每个智能体的贡献,确保准确且可解释的信用分配。此外,我们方法的模块化使其能够无缝集成到各种离线MARL方法中。理论上,我们证明了在离线数据集设置下,底层因果结构和用于生成智能体个体奖励的函数是可识别的,这为我们的建模正确性奠定了基础。在我们的实验中,我们证明MACCA不仅优于最先进的方法,而且在与其他骨干集成时也能提升性能。

英文摘要

Offline Multi-agent Reinforcement Learning (MARL) is valuable in scenarios where online interaction is impractical or risky. While independent learning in MARL offers flexibility and scalability, accurately assigning credit to individual agents in offline settings poses challenges because interactions with an environment are prohibited. In this paper, we propose a new framework, namely Multi-Agent Causal Credit Assignment (MACCA), to address credit assignment in the offline MARL setting. Our approach, MACCA, characterizing the generative process as a Dynamic Bayesian Network, captures relationships between environmental variables, states, actions, and rewards. Estimating this model on offline data, MACCA can learn each agent's contribution by analyzing the causal relationship of their individual rewards, ensuring accurate and interpretable credit assignment. Additionally, the modularity of our approach allows it to integrate with various offline MARL methods seamlessly. Theoretically, we proved that under the setting of the offline dataset, the underlying causal structure and the function for generating the individual rewards of agents are identifiable, which laid the foundation for the correctness of our modeling. In our experiments, we demonstrate that MACCA not only outperforms state-of-the-art methods but also enhances performance when integrated with other backbones.

2310.20545 2026-06-02 cs.LG math.OC stat.ML 版本更新

Optimizing accuracy and diversity: a multi-task approach to forecast combinations

优化准确性与多样性:一种多任务预测组合方法

Giovanni Felici, Antonio M. Sudoso

发表机构 * National Research Council(国家研究理事会) Sapienza University of Rome(罗马萨皮恩扎大学)

AI总结 提出一种基于深度学习架构的多任务优化方法,通过联合选择与组合预测模型,同时考虑准确性和多样性,提升时间序列点预测精度。

详情
Journal ref
Annals of Operations Research, 2026
AI中文摘要

我们提出了一种基于深度学习架构的多任务优化方法,用于时间序列预测。我们利用大量时间序列集合来识别可组合的预测模型权重,从而为每个序列生成预测。该方法联合处理两个任务:选择不同的预测模型及其有效组合。在此过程中,它以一种新颖的方式兼顾了预测方法的准确性和多样性。对于给定的时间序列,模型组合模块提取特征并用于优化预测方法的权重。同时,模型选择模块提取其他特征以识别用于预测的方法子集。该选择过程被构建为一个分类问题,标签表示用于序列的模型集合。这些标签通过求解一个辅助优化问题来确定,该问题为每个时间序列识别准确且多样的方法。然后,两个模块的输出被组合,整个神经网络通过梯度下降优化最小化自定义损失函数进行联合训练。在M4竞赛数据集和真实道路交通数据的大量序列上的实验结果表明,与最先进的方法相比,我们的方法提高了点预测精度。

英文摘要

We present a multi-task optimization approach based on a deep learning architecture for time series forecasting. We leverage large collections of time series to identify the weights of forecasting models that can be combined to produce forecasts for each series. This method jointly addresses two tasks: the selection of different forecasting models, and their effective combination. In doing so, it keeps into account, in an original way, both the accuracy and diversity of the forecasting methods. For a given time series, the model combination module extracts features and uses them to optimize the weights of the forecasting methods. Simultaneously, the model selection module extracts other features to identify the subset of methods to be used for the prediction. This selection process is framed as a classification problem, with the labels representing the set of models to be used for a series. These labels are determined by solving an auxiliary optimization problem that identifies accurate and diverse methods for each time series. The outputs of the two modules are then combined and the entire neural network is jointly trained by minimizing a custom loss function via gradient descent optimization. Experimental results on a large set of series from the M4 competition dataset and from real road traffic data show that our proposal enhances point forecast accuracy compared to state-of-the-art methods.

2311.00260 2026-06-02 cs.GT cs.LG 版本更新

Incentivized Collaboration in Active Learning

主动学习中的激励性协作

Lee Cohen, Han Shao

AI总结 针对多个理性代理在共同假设上协作学习标签的问题,提出一种激励性协作框架,通过设计严格个体理性协议确保代理参与协作不增加预期标签复杂度,并给出与已知可处理近似算法标签复杂度相当的协作协议。

详情
AI中文摘要

在协作主动学习中,多个代理试图从共同假设中学习标签,我们引入了一种创新的激励性协作框架。在这里,理性代理的目标是为其数据集获取标签,同时保持标签复杂度最小。我们专注于设计(严格)个体理性(IR)协作协议,确保代理通过单独行动不会降低其预期标签复杂度。我们首先证明,给定任何最优主动学习算法,在整个数据上按原样运行该算法的协作协议已经是IR的。然而,计算最优算法是NP难的。因此,我们提供了实现(严格)IR且与已知最佳可处理近似算法在标签复杂度方面相当的协作协议。

英文摘要

In collaborative active learning, where multiple agents try to learn labels from a common hypothesis, we introduce an innovative framework for incentivized collaboration. Here, rational agents aim to obtain labels for their data sets while keeping label complexity at a minimum. We focus on designing (strict) individually rational (IR) collaboration protocols, ensuring that agents cannot reduce their expected label complexity by acting individually. We first show that given any optimal active learning algorithm, the collaboration protocol that runs the algorithm as is over the entire data is already IR. However, computing the optimal algorithm is NP-hard. We therefore provide collaboration protocols that achieve (strict) IR and are comparable with the best known tractable approximation algorithm in terms of label complexity.

2309.15946 2026-06-02 cs.LG cs.AI cs.NE math.DS 版本更新

Unified Long-Term Time-Series Forecasting Benchmark

统一长期时间序列预测基准

Jacek Cyranka, Szymon Haponiuk

发表机构 * Institute of Informatics(信息学院)

AI总结 提出一个专为长期时间序列预测设计的综合数据集,通过标准化轨迹和多种模型基准测试,发现模型效果依赖于数据集,并引入改进的潜在NLinear和课程学习DeepAR模型。

详情
AI中文摘要

为了支持时间序列数据预测的机器学习方法的发展,我们提出了一个明确针对长期时间序列预测设计的综合数据集。我们整合了来自多种动态系统和真实记录的数据集集合。每个数据集通过将数据划分为具有预定回溯长度的训练和测试轨迹进行标准化。我们包含长度高达$2000$的轨迹,以确保对长期预测能力的可靠评估。为了确定在不同场景中最有效的模型,我们使用经典和最先进的模型(即LSTM、DeepAR、NLinear、N-Hits、PatchTST和LatentODE)进行了广泛的基准分析。我们的研究结果揭示了这些模型之间有趣的性能比较,突出了模型有效性的数据集依赖性。值得注意的是,我们引入了一个自定义的潜在NLinear模型,并通过课程学习阶段增强了DeepAR。两者都持续优于其原始版本。

英文摘要

In order to support the advancement of machine learning methods for predicting time-series data, we present a comprehensive dataset designed explicitly for long-term time-series forecasting. We incorporate a collection of datasets obtained from diverse, dynamic systems and real-life records. Each dataset is standardized by dividing it into training and test trajectories with predetermined lookback lengths. We include trajectories of length up to $2000$ to ensure a reliable evaluation of long-term forecasting capabilities. To determine the most effective model in diverse scenarios, we conduct an extensive benchmarking analysis using classical and state-of-the-art models, namely LSTM, DeepAR, NLinear, N-Hits, PatchTST, and LatentODE. Our findings reveal intriguing performance comparisons among these models, highlighting the dataset-dependent nature of model effectiveness. Notably, we introduce a custom latent NLinear model and enhance DeepAR with a curriculum learning phase. Both consistently outperform their vanilla counterparts.

2212.06751 2026-06-02 cs.LG cs.AI 版本更新

Speeding Up Multi-Objective Hyperparameter Optimization by Task Similarity-Based Meta-Learning for the Tree-Structured Parzen Estimator

基于任务相似性元学习加速多目标超参数优化的树形结构Parzen估计器

Shuhei Watanabe, Noor Awad, Masaki Onishi, Frank Hutter

发表机构 * Department of Computer Science, University of Freiburg, Germany(弗赖堡大学计算机科学系) Artificial Intelligence Research Center, AIST, Tokyo, Japan(日本科学技术厅人工智能研究中心)

AI总结 提出利用任务间顶级域重叠定义的任务相似性扩展TPE采集函数到元学习设置,加速多目标超参数优化,理论分析并解决相似性局限,实验证明在表格HPO基准上达到最优性能并赢得AutoML 2022竞赛。

Comments Accpeted to IJCAI 2023

详情
AI中文摘要

超参数优化(HPO)是提升深度学习性能的关键步骤。实践者常面临多个指标间的权衡,如准确率和延迟。鉴于深度学习的高计算需求以及对高效HPO日益增长的需求,加速多目标优化变得愈发重要。尽管已有大量关于元学习用于HPO的工作,但现有方法不适用于多目标树形结构Parzen估计器(MO-TPE),这是一种简单而强大的多目标HPO算法。在本文中,我们利用任务间顶级域重叠定义的任务相似性,将TPE的采集函数扩展到元学习设置。我们还从理论上分析并解决了任务相似性的局限性。实验中,我们证明了该方法在表格HPO基准上加速了MO-TPE,并达到了最先进的性能。我们的方法还通过赢得AutoML 2022“Transformer多目标超参数优化”竞赛得到了外部验证。

英文摘要

Hyperparameter optimization (HPO) is a vital step in improving performance in deep learning (DL). Practitioners are often faced with the trade-off between multiple criteria, such as accuracy and latency. Given the high computational needs of DL and the growing demand for efficient HPO, the acceleration of multi-objective (MO) optimization becomes ever more important. Despite the significant body of work on meta-learning for HPO, existing methods are inapplicable to MO tree-structured Parzen estimator (MO-TPE), a simple yet powerful MO-HPO algorithm. In this paper, we extend TPE's acquisition function to the meta-learning setting using a task similarity defined by the overlap of top domains between tasks. We also theoretically analyze and address the limitations of our task similarity. In the experiments, we demonstrate that our method speeds up MO-TPE on tabular HPO benchmarks and attains state-of-the-art performance. Our method was also validated externally by winning the AutoML 2022 competition on "Multiobjective Hyperparameter Optimization for Transformers".

2304.10255 2026-06-02 cs.LG stat.ML 版本更新

PED-ANOVA: Efficiently Quantifying Hyperparameter Importance in Arbitrary Subspaces

PED-ANOVA:高效量化任意子空间中超参数重要性

Shuhei Watanabe, Archit Bansal, Frank Hutter

发表机构 * Department of Computer Science, University of Freiburg, Germany(弗赖堡大学计算机科学系)

AI总结 提出PED-ANOVA方法,利用Pearson散度实现任意子空间中超参数重要性的闭式计算,在保持高效性的同时准确识别关键超参数。

Comments Accepted by IJCAI2023

详情
AI中文摘要

近年来,深度学习超参数优化(HPO)的流行凸显了良好超参数(HP)空间设计在训练强模型中的作用。而设计一个好的HP空间关键依赖于理解不同HP的作用。这激发了超参数重要性(HPI)的研究,例如使用流行的功能ANOVA(f-ANOVA)方法。然而,原始的f-ANOVA公式不适用于算法设计者最相关的子空间,例如由顶级性能定义的子空间。为解决此问题,我们推导了任意子空间下f-ANOVA的新公式,并提出一种使用Pearson散度(PED)实现HPI闭式计算的算法。我们证明,这种新算法称为PED-ANOVA,能够成功识别不同子空间中的重要HP,同时计算效率极高。

英文摘要

The recent rise in popularity of Hyperparameter Optimization (HPO) for deep learning has highlighted the role that good hyperparameter (HP) space design can play in training strong models. In turn, designing a good HP space is critically dependent on understanding the role of different HPs. This motivates research on HP Importance (HPI), e.g., with the popular method of functional ANOVA (f-ANOVA). However, the original f-ANOVA formulation is inapplicable to the subspaces most relevant to algorithm designers, such as those defined by top performance. To overcome this issue, we derive a novel formulation of f-ANOVA for arbitrary subspaces and propose an algorithm that uses Pearson divergence (PED) to enable a closed-form calculation of HPI. We demonstrate that this new algorithm, dubbed PED-ANOVA, is able to successfully identify important HPs in different subspaces while also being extremely computationally efficient.

2211.14411 2026-06-02 cs.LG cs.AI 版本更新

c-TPE: Tree-structured Parzen Estimator with Inequality Constraints for Expensive Hyperparameter Optimization

c-TPE: 带不等式约束的树结构Parzen估计器用于昂贵的超参数优化

Shuhei Watanabe, Frank Hutter

发表机构 * Department of Computer Science, University of Freiburg(弗赖堡大学计算机科学系)

AI总结 提出c-TPE方法,通过修改TPE的采样和模型以处理不等式约束,在81个昂贵HPO问题上取得最佳平均排名性能。

Comments Accepted to IJCAI 2023

详情
AI中文摘要

超参数优化(HPO)对于深度学习算法的强性能至关重要,而实际应用通常在性能要求之上施加一些约束,例如内存使用或延迟。在这项工作中,我们提出了约束TPE(c-TPE),这是广泛使用的通用贝叶斯优化方法——树结构Parzen估计器(TPE)的扩展,以处理这些约束。我们提出的扩展不仅仅是现有采集函数和原始TPE的简单组合,而是包括解决导致性能不佳问题的修改。我们通过实验和理论彻底分析了这些修改,提供了关于它们如何有效克服这些挑战的见解。在实验中,我们证明c-TPE在81个带不等式约束的昂贵HPO问题上,以统计显著性在现有方法中表现出最佳平均排名性能。由于缺乏基线,我们仅在附录D中讨论了我们方法对硬约束优化的适用性。该实现现在可通过OptunaHub获得。

英文摘要

Hyperparameter optimization (HPO) is crucial for strong performance of deep learning algorithms and real-world applications often impose some constraints, such as on memory usage or latency, on top of the performance requirement. In this work, we propose constrained TPE (c-TPE), an extension of the widely-used versatile Bayesian optimization method, tree-structured Parzen estimator (TPE), to handle these constraints. Our proposed extension goes beyond a simple combination of an existing acquisition function and the original TPE, and instead includes modifications that address issues that cause poor performance. We thoroughly analyze these modifications both empirically and theoretically, providing insights into how they effectively overcome these challenges. In the experiments, we demonstrate that c-TPE exhibits the best average rank performance among existing methods with statistical significance on $81$ expensive HPO problems with inequality constraints. Due to the lack of baselines, we only discuss the applicability of our method to hard-constrained optimization in Appendix D. The implementation is now available via OptunaHub.

2303.04345 2026-06-02 cs.LG 版本更新

Federated Learning via Variational Bayesian Inference: Personalization, Sparsity and Clustering

通过变分贝叶斯推理的联邦学习:个性化、稀疏性和聚类

Xu Zhang, Wenpeng Li, Yunfeng Shao, Yonglin Liu, Kaiwen Zhou, Yinchuan Li

发表机构 * School of Artificial Intelligence, Xidian University(西安电子科技大学人工智能学院) LSEC, Academy of Mathematics and Systems Science, Chinese Academy of Sciences(中国科学院数学与系统科学研究院) Noah’s Ark Lab, Huawei(华为诺亚实验室)

AI总结 提出基于变分贝叶斯推理的联邦学习方法,通过个性化、稀疏性和聚类策略解决数据异构和有限问题,实现更优性能。

Comments 18 pages, 5 figures

详情
AI中文摘要

联邦学习(FL)是一种有前景的框架,它在保护客户隐私的同时实现分布式机器学习。然而,FL因异构和有限的数据而性能下降。为了缓解这种下降,我们提出了一种新颖的个性化贝叶斯FL方法,名为pFedBayes。通过使用从服务器训练得到的全局分布作为每个客户的先验分布,每个客户通过最小化其个性化数据上的重构误差与下载的全局分布的KL散度之和来调整自己的分布。然后,我们提出了一种稀疏个性化贝叶斯FL方法,名为sFedBayes,以提高推理效率。为了克服非独立同分布数据中的极端异构性,我们提出了一种聚类贝叶斯FL模型,名为cFedbayes,通过为不同客户学习不同的先验分布。理论分析给出了这三种方法的泛化误差界,并表明所提出方法的泛化误差率在达到对数因子内达到极小极大最优性。此外,cFedBayes实现了聚类级别的泛化误差界,而不是pFedBayes中的单一统一界。大量实验表明,在异构和有限数据存在的情况下,所提出的方法在私有模型上比其他先进的个性化方法具有更好的性能。

英文摘要

Federated learning (FL) is a promising framework that models distributed machine learning while protecting the privacy of clients. However, FL suffers performance degradation from heterogeneous and limited data. To alleviate the degradation, we present a novel personalized Bayesian FL approach named pFedBayes. By using the trained global distribution from the server as the prior distribution of each client, each client adjusts its own distribution by minimizing the sum of the reconstruction error over its personalized data and the KL divergence with the downloaded global distribution. Then, we propose a sparse personalized Bayesian FL approach named sFedBayes to enhance the inference efficiency. To overcome the extreme heterogeneity in non-i.i.d. data, we propose a clustered Bayesian FL model named cFedbayes by learning different prior distributions for different clients. Theoretical analysis gives the generalization error bound of three approaches and shows that the generalization error rates of the proposed approaches achieve minimax optimality up to a logarithmic factor. Moreover, cFedBayes achieves a cluster-level generalization error bound, rather than a single uniform bound in pFedBayes. Numerous experiments demonstrate that the proposed approaches have better performance than other advanced personalized methods on private models in the presence of heterogeneous and limited data.

2301.06308 2026-06-02 cs.LG cs.AI 版本更新

Stability Analysis of Sharpness-Aware Minimization

锐度感知最小化的稳定性分析

Hoki Kim, Jinseong Park, Yujin Choi, Jaewook Lee

发表机构 * Chung-Ang University, South Korea(Chung-Ang 大学,韩国) Korea Institute for Advanced Study, South Korea(韩国高级研究院) Ulsan National Institute of Science(乌山国家科学研究院) Nanyang Technological University (NTU), Singapore(南洋理工大学(NTU),新加坡) Seoul National University, South Korea(首尔国立大学,韩国)

AI总结 研究SAM在鞍点附近的收敛不稳定性,通过动力系统理论证明鞍点成为吸引子,并发现动量与批次大小可缓解该问题。

Comments Accepted to ICML 2026

详情
AI中文摘要

锐度感知最小化(SAM)是一种训练方法,旨在寻找深度学习中的平坦最小值,从而在各个领域取得最先进的性能。SAM不是最小化当前权重的损失,而是最小化参数空间中其邻域内的最坏情况损失。在本文中,我们研究了SAM在鞍点附近的收敛不稳定性。利用动力系统的定性理论,我们解释了SAM如何陷入鞍点,并从理论上证明了在SAM动力学下鞍点可以成为吸引子。此外,通过建立SAM的扩散,我们证明了这种收敛不稳定性也可能发生在随机动力系统中。我们证明,在逃离鞍点方面,SAM扩散比普通梯度下降更差。最后,我们展示了经常被忽视的训练技巧——动量和批次大小——可能对缓解收敛不稳定性和实现高泛化性能很重要。我们的理论和实证结果通过几个著名的优化问题和基准任务的实验得到了充分验证。

英文摘要

Sharpness-aware minimization (SAM) is a training method that seeks to find flat minima in deep learning, resulting in state-of-the-art performance across various domains. Instead of minimizing the loss of the current weights, SAM minimizes the worst-case loss in its neighborhood in the parameter space. In this paper, we investigate the convergence instability of SAM near a saddle point. Using the qualitative theory of dynamical systems, we explain how SAM becomes stuck in the saddle point and theoretically prove that the saddle point can become an attractor under SAM dynamics. Additionally, we show that this convergence instability can also occur in stochastic dynamical systems by establishing the diffusion of SAM. We prove that SAM diffusion is worse than that of vanilla gradient descent in terms of saddle point escape. Finally, we demonstrate that often overlooked training tricks, momentum and batch-size, might be important to mitigate the convergence instability and achieve high generalization performance. Our theoretical and empirical results are thoroughly verified through experiments on several well-known optimization problems and benchmark tasks.

2208.12389 2026-06-02 cs.LG cs.AI 版本更新

Static Seeding and Clustering of LSTM Embeddings to Learn from Loosely Time-Decoupled Events

LSTM嵌入的静态播种与聚类以从松散时间解耦事件中学习

Christian Manasseh, Razvan Veliche, Jared Bennett, Hamilton Clouse

发表机构 * Air Force Research Lab (AFRL) Autonomy Capability Team 3 (ACT3)(美国空军研究实验室(AFRL)自主能力团队3(ACT3))

AI总结 提出通过静态数据播种LSTM生成嵌入并聚类,以改进松散时间解耦时间序列预测,在COVID-19县级病例预测中提升10日移动平均精度。

详情
AI中文摘要

人类从不同时间和地点发生的事件中学习,以预测相似的事件轨迹。我们将松散解耦时间序列(LDT)现象定义为两个或多个可能发生在不同地点和不同时间线上,但在事件性质和位置属性上具有相似性的事件。在这项工作中,我们改进了循环神经网络(RNN),特别是长短期记忆(LSTM)网络的使用,以使AI解决方案能够为LDT生成更好的时间序列预测。我们基于趋势使用时间序列之间的相似性度量,并引入表示这些趋势的嵌入。嵌入表示事件的属性,与LSTM结构结合,可以聚类以识别相似的、时间上未对齐的事件。在本文中,我们探索了从与LSTM建模的地球物理和人口现象相关的时间不变数据中播种多变量LSTM的方法。我们将这些方法应用于从COVID-19检测感染和死亡病例中得出的时间序列数据。我们使用公开的社会经济数据来播种LSTM模型,创建嵌入,以确定这种播种是否改善了病例预测。这些LSTM产生的嵌入被聚类,以识别用于预测演变时间序列的最佳匹配候选。应用这种方法,我们在美国县级疾病传播的10日移动平均预测中显示出改进。

英文摘要

Humans learn from the occurrence of events in a different place and time to predict similar trajectories of events. We define Loosely Decoupled Timeseries (LDT) phenomena as two or more events that could happen in different places and across different timelines but share similarities in the nature of the event and the properties of the location. In this work we improve on the use of Recurring Neural Networks (RNN), in particular Long Short-Term Memory (LSTM) networks, to enable AI solutions that generate better timeseries predictions for LDT. We use similarity measures between timeseries based on the trends and introduce embeddings representing those trends. The embeddings represent properties of the event which, coupled with the LSTM structure, can be clustered to identify similar temporally unaligned events. In this paper, we explore methods of seeding a multivariate LSTM from time-invariant data related to the geophysical and demographic phenomena being modeled by the LSTM. We apply these methods on the timeseries data derived from the COVID-19 detected infection and death cases. We use publicly available socio-economic data to seed the LSTM models, creating embeddings, to determine whether such seeding improves case predictions. The embeddings produced by these LSTMs are clustered to identify best-matching candidates for forecasting an evolving timeseries. Applying this method, we show an improvement in 10-day moving average predictions of disease propagation at the US County level.