arXivDaily arXiv每日学术速递 周一至周五更新
重置
cs.LG机器学习357

1. 深度学习架构与训练方法 36 篇

2606.11251 2026-06-11 cs.LG 新提交

Mechanical Field Networks: Structured Neural Dynamics for Multivariate Systems

机械场网络:多变量系统的结构化神经动力学

Xingji Cui

发表机构 * Xi’an Jiaotong University(西安交通大学)

AI总结 提出MF-Net,一种将多变量系统表示为共享场状态并通过可学习关系律更新状态的递归模型,在保持可解释结构的同时实现竞争性预测。

详情
AI中文摘要

许多多变量动力系统仅通过轨迹观测,其联合动力学机制是隐藏的。现有方法可以施加可解释的动力学或学习灵活的状态转移,但得到的交互结构通常要么预先指定,要么隐含在学习动力学中。我们引入MF-Net,一种递归动力学模型,将所有变量表示在共享场状态中,并通过学习的关系律更新该状态。每个变量携带一个场分量,这些分量通过可学习的机械转移共同演化。这里,机械指的是转移的关系-运动组织,其中学习的关系塑造状态依赖的流、场响应和推动场状态前进的运动趋势。得到的结构是展开本身的一部分:学习的关系影响场的运动方式,相同的内部量支持预测和结构读出。在已知定律的交互系统、混沌基准、真实神经记录和生态时间序列上,MF-Net在保持可检查的结构读出的同时,实现了有竞争力的短中期预测。在40维Lorenz-96测试平台上,MF-Net的八步$R^2$达到$0.798\pm0.018$;在五个随机种子下,其学习的关系矩阵以$19.80\pm1.00$的局部/非局部强度比和$1.000\pm0.000$的Precision@$K$恢复了局部耦合支持。MF-Net提供了一个结构可读的动力学建模框架,其中学习的关系通过前向演化训练,并在真实数据上,在适当的观测限制下被解释为功能预测耦合。

英文摘要

Many multivariate dynamical systems are observed only through trajectories, leaving the mechanisms governing their joint dynamics hidden. Existing approaches can impose interpretable dynamics or learn flexible state transitions, yet the resulting interaction structure is typically either specified in advance or left implicit within the learned dynamics. We introduce MF-Net, a recurrent dynamical model that represents all variables in a shared field state and updates this state through a learned relation law. Each variable carries a field component, and these components evolve jointly through a learnable mechanical transition. Here, mechanical refers to the relation-to-motion organization of the transition, where learned relations shape state-dependent flows, field responses, and motion tendencies that move the field state forward. The resulting structure is part of the rollout itself: learned relations influence how the field moves, and the same internal quantities support both forecasting and structural readout. Across known-law interaction systems, chaotic benchmarks, real neural recordings, and ecological time series, MF-Net achieves competitive short- and medium-horizon forecasting while retaining inspectable structural readout. On the 40-dimensional Lorenz--96 testbed, MF-Net achieves an eight-step $R^2$ of $0.798\pm0.018$; across five seeds, its learned relation matrix recovers the local coupling support with a local/nonlocal strength ratio of $19.80\pm1.00$ and Precision@$K$ of $1.000\pm0.000$. MF-Net provides a structure-readable dynamical modeling framework in which learned relations are trained through forward evolution and, on real data, interpreted as functional predictive couplings under appropriate observational limits.

2606.11255 2026-06-11 cs.LG 新提交

Bernstein-Schur Kernels: Random Features by Sketched Modulation and Radial Randomization

Bernstein-Schur核:通过草图调制和径向随机化的随机特征

Taha Bouhsine

发表机构 * Azetta AI

AI总结 提出一种随机特征构造方法,用于Bernstein-Schur核类,通过草图化有限调制和随机化完全单调径向因子,实现无偏估计和算子范数界,应用于yat核族。

详情
AI中文摘要

Bernstein-Schur核是有限特征核(具有显式有限维特征映射的核)与完全单调平移不变核的乘积:非平稳核介于平移不变和点积模板之间,随机特征通常利用后者,因此一般Bochner采样或多项式草图都不能直接应用于完整核。我们为整个类给出一种随机特征构造,它随机化两个因子:草图化有限调制并随机化完全单调径向因子,对后者的单变量Bernstein-Widder尺度进行采样,然后应用高斯随机傅里叶特征(其频率仍是d维的)。特征维度为Dm,由草图大小m和径向抽取次数D设定,与精确调制特征的O(d^2)大小无关。保持调制精确是可分析极限(m→∞):在那里我们证明无偏性、推荐平坦估计量的精确方差、期望矩阵-Bernstein算子范数界(具有匹配的高概率尾部),该界由核和调制Gram矩阵的最大特征值以及固有维度控制,而非粗糙的N max_{ij}逐元素路径,以及确定性相对谱核岭稳定性结果。通过条件化于草图,双随机化估计量继承了相同的固有维度算子范数保证,加上一个可调加性草图项,该草图项由m独立于D调节。激励实例是有偏yat核k_{yat,b}(w,x)=(w^⊤x+b)^2/(‖w-x‖^2+ε),b≥0,其族通过b的有限差分包含逆多二次核;对于它,径向混合是IMQ谱采样器,每个尺度一个频率在固定径向特征预算下是方差最优的。

英文摘要

Bernstein--Schur kernels are products of a finite-feature kernel (one with an explicit finite-dimensional feature map) and a completely monotone shift-invariant kernel: nonstationary kernels that fall between the shift-invariant and dot-product templates random features usually exploit, so in general neither Bochner sampling nor polynomial sketching applies to the full kernel directly. We give one random-feature construction for the whole class that \emph{randomizes both factors: it sketches the finite modulation and randomizes the completely monotone radial factor, sampling the latter's one-dimensional Bernstein--Widder scale and then applying Gaussian random Fourier features (whose frequency is still $d$-dimensional). The feature dimension is then $Dm$, set by the sketch size $m$ and the radial-draw count $D$, free of the $O(d^2)$ size of the exact modulation feature. Keeping the modulation \emph{exact} is the analyzable limit ($m\to\infty$): there we prove unbiasedness, an exact variance for the recommended flat estimator, an expected matrix-Bernstein operator-norm bound (with a matching high-probability tail) controlled by the top eigenvalues of the kernel and modulation Gram matrices together with an intrinsic dimension rather than the crude $N\max_{ij}$ entrywise route, and a deterministic relative-spectral kernel-ridge stability result. By conditioning on the sketch, the doubly-randomized estimator inherits the same intrinsic-dimension operator-norm guarantee plus a single additive sketch term, tunable by $m$ independently of $D$. The motivating instance is the biased $yat$-kernel $k_{yat,b}(w,x)=(w^\top x+b)^2/(\|w-x\|^2+\varepsilon)$, $b\ge0$, whose family span contains the inverse-multiquadric kernel by finite differences in $b$; for it the radial mixture is the IMQ spectral sampler, and one frequency per scale is variance-optimal at a fixed radial-feature budget.

2606.11262 2026-06-11 cs.LG cs.AI 新提交

PermDoRA -- Understanding Adapter Interference in Language Models: Limits of Parameter-Space Geometry

PermDoRA -- 理解语言模型中的适配器干扰:参数空间几何的局限性

Gowtham Sivaramakrishnan, Sarvesha Kumar Kombaiah Seetha, Kishan Gupta Balaji, Santhosh Baradwaj Vaduvur Ranganathan

发表机构 * Independent Researcher(独立研究员)

AI总结 研究适配器组合中的干扰是否源于线性参数更新重叠,通过DoRA-RBAC框架和几何感知合并策略实验,发现参数空间几何不是干扰主因,而是共享非线性表示中的交互。

详情
Comments
18 Pages, COLM 2026
AI中文摘要

大型语言模型(LLMs)中的访问控制需要模块化机制,以在不重新训练或跨领域干扰的情况下实现特定领域行为。一个常见的假设是,适配器组合过程中的干扰源于线性参数更新的重叠,这表明强制正交性或方向独立性应能提高多领域性能。我们使用DoRA-RBAC(一种基于权重分解低秩适配的分层适配器组合框架)来测试这一假设。我们比较了传统的欧几里得合并与一种几何感知的黎曼启发式合并策略,该策略通过在LLaMA-3.1-8B和Mistral-7B上的多个QA基准(GPQA、PubMedQA、SimpleQA、WMDP)上进行归一化方向平均来近似弗雷歇均值。我们的结果表明,虽然单领域性能与LoRA相当,但几何感知合并相比标准平均在多领域组合中并未提供一致的优势。进一步分析揭示,适配器更新的角度对齐和正交性是组合性能的弱预测因子。这些发现表明,适配器干扰并非主要由参数空间几何决定,而是与共享非线性表示中的交互一致。

英文摘要

Access control in large language models (LLMs) requires modular mechanisms to enable domain-specific behavior without retraining or cross-domain interference. A common hypothesis is that interference during adapter composition arises from overlap in linear parameter updates, suggesting that enforcing orthogonality or directional independence should improve multi-domain performance. We test this hypothesis using DoRA-RBAC, a hierarchical adapter composition framework based on weight-decomposed low-rank adaptation. We compare conventional Euclidean merging with a geometry-aware Riemannian-inspired merging strategy that approximates the Frechet mean via normalized directional averaging across multiple QA benchmarks (GPQA, PubMedQA, SimpleQA, WMDP) on LLaMA-3.1-8B and Mistral-7B. Our results show that while single-domain performance matches LoRA, geometry-aware merging provides no consistent advantage over standard averaging in multi-domain this http URL analysis further reveals that angular alignment and orthogonality of adapter updates are weak predictors of composition performance. These findings suggest that adapter interference is not governed primarily by parameter-space geometry, but is instead consistent with interactions in shared nonlinear representations.

2606.11275 2026-06-11 cs.LG cs.AI 新提交

RoVE: Rotary Value Embeddings Attention for Relative Position-dependent Value Pathways

RoVE: 旋转值嵌入注意力实现相对位置相关的值路径

Alejandro García-Castellanos, Maurice Weiler, Erik J Bekkers

发表机构 * AMLab University of Amsterdam(阿姆斯特丹大学AMLab) MIT CSAIL(麻省理工学院计算机科学与人工智能实验室)

AI总结 提出RoVE方法,通过同时旋转键和值使值对位置敏感,将RoPE注意力转化为注意力卷积,在少样本学习、分布外困惑度和长上下文检索上优于RoPE。

详情
AI中文摘要

旋转位置嵌入(RoPE)使注意力分数具有位置相对性,但值路径对位置不敏感:值令牌发送的消息与其到查询的距离无关。我们提出RoVE,一种无需参数修改的方法,通过同时旋转键和值使值对位置敏感,并证明它将RoPE注意力转化为注意力卷积。这一新视角统一了计算机视觉、机器人技术和现代LLM架构中同一操作的几种独立表述。训练124M和354M参数的GPT-2模型在少样本上下文学习、分布外困惑度和长上下文检索上一致优于RoPE,在需要长距离聚合的任务上改进最为明显。

英文摘要

Rotary Position Embeddings (RoPE) make attention scores position-relative but leave the value pathway position-blind: the message sent by a value token is the same regardless of its distance from the query. We propose RoVE, a parameter-free modification that makes values position-sensitive by rotating them simultaneously with keys, and show that it turns RoPE attention into attentive convolution. This new perspective unifies several independent formulations of the same operation across computer vision, robotics, and modern LLM architectures. Trained 124M and 354M GPT-2 models show consistent empirical gains over RoPE on few-shot in-context learning, out-of-distribution perplexity, and long-context retrieval, with the clearest improvements on tasks that require long-range aggregation.

2606.11341 2026-06-11 cs.LG cs.RO 新提交

Energy-Conserved Neural Pipelines: Attenuating Error Propagation in Modular Neural Networks via Physical Conservation Constraints

能量守恒神经管道:通过物理守恒约束减弱模块化神经网络中的误差传播

David Young, Swan Yi Htet

发表机构 * ORION Robotics

AI总结 提出在模块间强制能量守恒(特征向量L2范数不变)作为硬约束,实验证明该方法在多种噪声下显著优于基线,并具有深度不变性和理论保证。

详情
Comments
22 pages, 2 figures, 7 tables, 25 references
AI中文摘要

模块化神经网络管道存在误差累积问题:任何模块边界的噪声都会传播并可能在后续模块中放大。我们引入能量守恒作为模块间信息流的硬物理约束。激活能量(特征向量的平方L2范数)被强制在每个模块边界精确保持不变。与软能量惩罚不同,守恒是不可违反的定律:网络可以在神经元之间重新分配能量,但不能创造或毁灭能量。在CIFAR-10上的四个实验表明:(1)在噪声sigma=0.2时,守恒方法保留了77.4%的干净准确率,而基线为35.1%,能量惩罚模型为30.9%(p<0.001,5个种子);(2)管道变得深度不变,在深度2至5且每个边界都有噪声时保留了93.3%的准确率;(3)该优势泛化到系统性偏差(+45.1%)、高斯噪声(+40.4%)和对抗噪声(+4.8%),而对dropout有原则性的无影响(-0.3%);(4)在ResNet-18上,守恒优势与内在归一化呈反比:在sigma=0.2时,有BatchNorm时+0.3个百分点,无BatchNorm时+26.2个百分点,在sigma=0.5时达到+58.0个百分点。实验5在真实模块化机器人管道(MuJoCo物理,Franka Panda)上验证了该算子。在独立机器上的三次独立运行(每个单元90次试验)中,守恒在单目深度类噪声上提供了平均+18.9个百分点的优势。一个形式化界限证明了守恒噪声能量严格小于输入噪声能量。

英文摘要

Modular neural network pipelines suffer from error compounding: noise at any module boundary propagates and potentially amplifies through subsequent modules. We introduce energy conservation as a hard physical constraint on inter-module information flow. Activation energy (the squared L2 norm of feature vectors) is enforced to be exactly preserved at every module boundary. Unlike soft energy penalties, conservation is an inviolable law: the network may redistribute energy across neurons but cannot create or destroy it. Four experiments on CIFAR-10 demonstrate: (1) conservation retains 77.4% of clean accuracy at noise sigma=0.2, versus 35.1% for baselines and 30.9% for energy-penalized models (p<0.001, 5 seeds); (2) pipelines become depth-invariant, retaining 93.3% at depths 2 through 5 with noise at every boundary; (3) the advantage generalizes to systematic bias (+45.1%), Gaussian (+40.4%), and adversarial noise (+4.8%), with a principled non-effect on dropout (-0.3%); (4) on ResNet-18, the conservation advantage scales inversely with intrinsic normalization: +0.3 pp with BatchNorm, +26.2 pp without at sigma=0.2, reaching +58.0 pp at sigma=0.5. Experiment 5 validates the operator on a real modular robotic pipeline (MuJoCo physics, Franka Panda). Across three independent runs on separate machines (90 trials per cell), conservation provides +18.9 pp average advantage on monocular-depth-style noise. A formal bound proves conserved noise energy is strictly less than input noise energy.

2606.11391 2026-06-11 cs.LG 新提交

Recursive Binding on a Budget: Subspace Carving in Order-p Tensor Memories

预算上的递归绑定:阶-p张量记忆中的子空间雕刻

Travis Pence, Daisuke Yamada, Vikas Singh

发表机构 * University of Wisconsin-Madison(威斯康星大学麦迪逊分校)

AI总结 提出正交子空间雕刻(OSC)方法,通过将填充符投影到角色基的零空间来绑定到角色,固定阶张量记忆实现深度递归绑定,在恒定内存下提升高叠加场景的效率。

详情
Comments
24 pages, 12 figures, 7 tables
AI中文摘要

张量积表示为模型中的符号推理提供了所需的结构保真度,但在编码深层递归结构时会遭受指数级维度增长。相反,向量符号架构保持恒定维度,但由于通过叠加的噪声压缩而牺牲了容量和保真度。在这项工作中,我们提出了正交子空间雕刻(OSC),一种内存架构,通过将填充符投影到角色基的零空间上,然后聚合到固定的阶-p张量中,从而将填充符绑定到角色。OSC 使用投影来强制静态记忆痕迹中绑定结构之间的几何正交性。我们表明,这种机制将张量阶与结构深度解耦,从而在恒定内存占用内实现深度递归绑定。通过识别进行检索,这种构造允许分量向量比记忆张量小几个数量级,从而在涉及高叠加的场景中提供卓越的内存效率。我们还表明,TPR 是 Clifford 代数中绑定的一个特例,并给出了 OSC 的 Clifford 公式。

英文摘要

Tensor Product Representations provide the structural fidelity required for symbolic reasoning in models but suffer from exponential dimensionality growth when encoding deep recursive structures. Conversely, Vector Symbolic Architectures maintain constant dimensionality but sacrifice capacity and fidelity due to noisy compression via superposition. In this work, we propose Orthogonal Subspace Carving (OSC), a memory architecture that binds fillers to roles by projecting onto the null space of the role basis before aggregating into a fixed order-p tensor. OSC uses projections to enforce geometric orthogonality between bound structures within a static memory trace. We show that this mechanism decouples the tensor order from the structural depth, enabling deep recursive binding within a constant memory footprint. By performing retrieval via recognition, this construction allows for component vectors that are orders of magnitude smaller than the memory tensor, giving superior memory efficiency in settings involving high superposition. We also show that TPR is a special case of binding in Clifford algebra, and give a Clifford formulation of OSC.

2606.11518 2026-06-11 cs.LG cs.AI 新提交

SirenFNO: Efficient and Full Frequency Learning of Fourier Neural Operators

SirenFNO:高效且全频率学习的傅里叶神经算子

Pengqing Shi, Jie Yin, Stephen Tierney, Junbin Gao

发表机构 * The University of Sydney(悉尼大学)

AI总结 提出SirenFNO框架,利用正弦表示网络学习隐式神经表示并进行模态核参数化,消除频率截断,实现全频谱学习,在多个PDE基准上以最多73倍参数减少取得性能提升。

详情
Comments
9 pages, accepted by IJCAI 2026
AI中文摘要

傅里叶神经算子(FNO)是近似求解偏微分方程的有效且高效的替代方法,并能跨离散化泛化。然而,由于依赖频率截断以保持FNO的学习效率,实证研究表明FNO对低频信息存在频谱偏差,这可能阻碍学习能力,尤其是对于某些具有强烈高频振荡的偏微分方程。为了解决这一局限性,我们提出了SirenFNO,一种利用正弦表示网络(SIREN)学习隐式神经表示并进行模态核参数化的新颖框架。我们的SIREN参数化以常数且与离散化无关的参数数量学习全网格频谱,从而消除了频率截断的需要。我们进一步通过函数张量分解扩展SirenFNO,以提高参数和学习效率。实证结果表明,我们的SirenFNO在保持离散化不变性的情况下,以约4到15倍的参数减少持续优于FNO,并且我们的函数分解变体在多个PDE基准上以最多73倍的参数减少获得了性能提升。

英文摘要

Fourier neural operators (FNOs) are effective and efficient surrogates for approximating solutions of PDEs and generalize across discretizations. However, owing to the reliance on frequency truncation to maintain learning efficiency of FNOs, empirical studies suggest that FNOs exhibit spectral bias toward low-frequency information, which may hinder the learning capability especially for certain PDEs with strong high-frequency oscillations. To address this limitation, we propose SirenFNO, a novel framework that leverages sinusoidal representation networks (SIRENs) to learn implicit neural representations and performs mode-wise kernel parameterization. Our SIREN parameterization learns a full-grid spectrum with a constant and discretization-independent parameter count, thereby eliminating the need for frequency truncation. We further extend SirenFNO with functional tensor decompositions to enhance parameter and learning efficiency. Empirical results show that our SirenFNO consistently outperforms FNO with approximately $4$ to $15$ times parameter reductions with preserved discretization invariance, and our functional decomposition variants obtain performance improvements with a maximum of $73$ times fewer parameters across multiple PDE benchmarks.

2606.11585 2026-06-11 cs.LG cs.CL nlin.AO 新提交

Kuramoto Attention: Synchronizing Self-Attention on the Torus

Kuramoto注意力:在环面上同步自注意力

Joshua Nunley

发表机构 * Department of Informatics, Luddy School of Informatics, Computing, and Engineering, Cognitive Science Program, Indiana University Bloomington(印第安纳大学伯明顿分校信息学系,卢迪信息学、计算与工程学院,认知科学项目)

AI总结 提出Kuramoto注意力层,将隐藏坐标视为角度,通过门控余弦相似度和环形均值更新实现自注意力,等价于Kuramoto耦合项,在字符级语言建模中达到与强基线相近的性能。

详情
Comments
13 pages, 2 figures, 3 tables
AI中文摘要

我们引入了Kuramoto注意力,一种自注意力层,其中每个隐藏坐标是一个角度。该层通过门控余弦相似度对令牌进行评分,关注先前的相位状态,并通过注意力加权的环形均值的切线分量更新每个令牌。由于值是原始相位状态,该更新恰好是Kuramoto耦合项$\sum_u A_{t,u}\sin(\theta_u-\theta_t)$,其中注意力矩阵充当自适应、内容相关的耦合核。等价地,门控分数是环面上的学习度量,用于选择哪些令牌耦合,更新将每个令牌拉向其选择的令牌的环形均值,从而收紧它们的相位一致性。相同的两个成分,即不变相似度分数和流形上的均值,定义了任何紧致群上的此类层;环面是阿贝尔情形,两者都有闭式解。softmax权重解决了一个熵正则化的相位检索问题,旋转位置编码作为分数中与位置相关的相位漂移进入。在enwiki8字符级语言建模中,该层作为功能语言模型训练,其每字符比特数接近强匹配的RoPE+SwiGLU Transformer:在100万参数时相差0.02 BPC(1.637±0.010对比1.616±0.004),在500万参数时中位数持平(五个种子下1.448对比1.452),Transformer在均值上领先(1.468对比1.456)。这些实验表明,受约束的几何结构在此规模下是可行的语言模型;结构本身及其同步解释是贡献。消融实验隔离了承重组件,结果给出了自注意力和相位同步之间的紧凑桥梁。

英文摘要

We introduce Kuramoto attention, a self-attention layer in which each hidden coordinate is an angle. The layer scores tokens by gated cosine similarity, attends over previous phase states, and updates each token by the tangent component of the attention-weighted circular mean. Because the values are the raw phase states, this update is exactly the Kuramoto coupling term $\sum_u A_{t,u}\sin(\theta_u-\theta_t)$, with the attention matrix acting as an adaptive, content-dependent coupling kernel. Equivalently, the gated score is a learned metric on the torus that selects which tokens couple, and the update pulls each token toward the circular mean of the tokens it selects, tightening their phase agreement. The same two ingredients, an invariant similarity score and an on-manifold mean, define such a layer on any compact group; the torus is the abelian case, where both are closed-form. The softmax weights solve an entropy-regularized phase-retrieval problem, and rotary position enters as a position-dependent phase drift in the score. On enwiki8 character-level language modeling, the layer trains as a functional language model whose bits-per-character stays close to a strong matched RoPE+SwiGLU transformer: within $0.02$ BPC at one million parameters ($1.637\pm0.010$ versus $1.616\pm0.004$) and level on the median at five million ($1.448$ versus $1.452$ over five seeds) with the transformer ahead on the mean ($1.468$ versus $1.456$). These experiments establish that the constrained geometric structure is a viable language model at this scale; the structure itself, and its synchronization reading, is the contribution. Ablations isolate the load-bearing components, and the result gives a compact bridge between self-attention and phase synchronization.

2606.11627 2026-06-11 cs.LG cs.AI 新提交

When Context Returns: Toward Robust Internalization in On-Policy Distillation

当上下文回归:面向在线策略蒸馏中的鲁棒内化

Xun Wang, Ruishuo Chen, Zhuoran Li, Yu Chen, Longbo Huang

发表机构 * IIIS, Tsinghua University(清华大学交叉信息研究院)

AI总结 针对在线策略蒸馏中上下文内化后重新引入上下文导致性能下降的问题,提出一种轻量级一致性正则化方法,通过锚定无上下文输出并惩罚偏离,有效缓解退化并提升鲁棒性。

详情
AI中文摘要

近期研究表明,在线策略蒸馏可以将特权上下文(如系统提示或任务提示)内化到学生模型中,使得推理时不再需要上下文。尽管该方法成功提升了学生的无上下文性能,我们却发现一个有趣且此前未被研究的现象:在许多设置中,向蒸馏后的学生模型重新引入原始特权上下文实际上会降低其性能,甚至对于它已经在无上下文情况下正确解决的实例也是如此。我们将此称为上下文诱导退化,并认为鲁棒内化不仅要求匹配教师的条件上下文行为,还要求在上下文重新引入时保持稳定,这一性质我们称为上下文可移除性。受此观察启发,我们提出一种轻量级一致性正则化方法,首先通过停止梯度锚定学生的无上下文输出,然后通过前向KL散度惩罚条件上下文输出偏离该锚点。这一简单添加每训练步仅需一次额外前向传播,却能有效缓解上下文诱导退化,并在许多情况下甚至提升无上下文性能。在涵盖不同领域和模型家族的12种配置中,我们的方法在大多数设置下提升了条件上下文准确率,在11/12的设置中减少了上下文诱导损害,并有效消除了响应长度膨胀。一项机制性案例研究进一步证实,上下文可移除性在表示层面得以实现,无论上下文是否存在,隐藏状态几乎保持相同。

英文摘要

Recent work has shown that on-policy distillation can internalize privileged context, such as system prompts or task hints, into a student model so that the context is no longer needed at inference time. Although this approach successfully improves the student's no-context performance, we identify an interesting and previously unstudied phenomenon: in many settings, reintroducing the original privileged context to the distilled student actually degrades its performance, even on instances it already solves correctly without context. We term this context-induced degradation and argue that robust internalization demands not only matching the teacher's context-conditioned behavior, but also remaining stable when the context is reintroduced, a property we call context removability. Motivated by this observation, we propose a lightweight consistency regularizer that first anchors the student's no-context output via stop-gradient, then penalizes the context-conditioned output for deviating from it via forward KL divergence. This simple addition requires only one extra forward pass per training step, yet it effectively mitigates context-induced degradation and, in many cases, even improves no-context performance. Across 12 configurations spanning diverse domains and model families, our method improves context-conditioned accuracy in the majority of settings, reduces context-induced harm in 11 out of 12 settings, and effectively eliminates response-length inflation. A mechanistic case study further confirms that context removability is achieved at the representation level, with hidden states remaining nearly identical regardless of whether the context is present.

2606.11854 2026-06-11 cs.LG cs.AI cs.CL 新提交

Fine-tuning Multi-modal LLMs with ART: Art-based Reinforcement Training

使用ART微调多模态大语言模型:基于艺术的强化训练

Michal Chudoba, Sergey Alyaev, Petra Galuscakova, Tomasz Wiktorski

发表机构 * University of Stavanger(斯塔万格大学) NORCE Research(NORCE研究机构)

AI总结 提出ART方法,通过优化原始视觉输入将信息注入冻结的多模态大语言模型,实现软提示微调,无需修改计算图,在数学和工具使用基准上达到与LoRA相当的精度。

详情
AI中文摘要

大语言模型有两种主要的参数高效微调技术。低秩适应在LLM层之间引入额外权重,而软提示则向LLM输入引入额外的微调特定原始token。然而,两者都需要修改预编译、预优化LLM的计算图。因此,两者在vLLM等高吞吐引擎中均未得到完全支持。我们提出使用ART(基于艺术的强化训练)进行微调。该方法通过仅优化冻结的多模态大语言模型的原始视觉输入来注入信息,从而在预编译计算图上实现软token方法。它依赖于将梯度反向传播到普通像素阵列,因此支持任何微调目标。此外,优化的视觉输入可以风格化为与任务相关的计算艺术品。该方法在流行的开源Qwen架构的不同规模以及多个文本基准上的有效性得到确认。具体而言,ART在数学和结构化工具使用基准上达到了与LoRA竞争的精度。

英文摘要

There are two main Parameter-Efficient Fine-Tuning (PEFT) techniques for Large Language Models (LLMs). While Low-Rank Adaptation (LoRA) introduces additional weights between the LLM layers, Soft Prompting introduces additional fine-tuning-specific raw tokens to an LLM input. However, both require modification to the computational graphs of precompiled, preoptimized LLMs. As a result, neither is fully supported in high-throughput engines like vLLM. We propose fine-tuning with ART (Art-based Reinforcement Training). The method injects information into a frozen Multimodal Large Language Model (MLLM) by optimizing only its raw visual input, thus enabling the soft-token approach on pre-compiled computational graphs. It relies on backpropagation of gradients back into a plain pixel array and thus supports any fine-tuning objective. Moreover, the optimized visual input can be stylized as task-relevant computational artworks. The approach's effectiveness is confirmed for different sizes of a popular open Qwen architecture and for several textual benchmarks. Specifically, ART reaches accuracy competitive with LoRA across mathematics and structured-tool-use benchmarks.

2606.11963 2026-06-11 cs.LG physics.comp-ph 新提交

HAMNO: A Hierarchical Adaptive Multi-scale Neural Operator with Physics-Informed Learning for Dynamical Systems

HAMNO: 一种用于动力系统的分层自适应多尺度神经算子与物理信息学习

Mostafa Bamdad, Mohammad Sadegh Eshaghi, Timon Rabczuk

发表机构 * Bauhaus-Universität Weimar(魏玛包豪斯大学) Leibniz University Hannover(莱布尼茨汉诺威大学)

AI总结 提出HAMNO神经算子架构,通过自适应门控机制平衡局部与全局信息,结合物理信息扩展PI-HAMNO,在非周期Allen-Cahn等方程上提升长期预测精度与物理一致性。

详情
AI中文摘要

神经算子为直接在函数空间学习偏微分方程解映射提供了强大框架。然而,许多现有架构仍难以表示涉及多尺度结构、长程相互作用和稳定长时间演化的非线性时变系统。本文引入分层自适应多尺度神经算子(HAMNO),一种结合局部卷积表示、全局谱算子和分层编码器-解码器处理的神经算子架构。HAMNO的核心是一个数据相关的门控机制,可在每个空间位置自适应平衡局部和全局信息,使模型能够解析细尺度特征同时保持长程依赖。我们进一步基于多目标损失策略开发了物理信息扩展PI-HAMNO,该策略将数据拟合与强形式和弱形式物理约束相结合。强形式项惩罚物理坐标中域积分平方PDE残差,而弱形式项通过将控制残差乘以有限元测试函数并使用基于质心的四面体求积法评估所得单元积分来构建。该框架在定义于立方域上的非周期Allen-Cahn(AC)、Cahn-Hilliard(CH)和Swift-Hohenberg(SH)方程上进行了评估。在长时程展开、数据有限训练、分布外初始条件偏移和随机种子变化下,HAMNO提高了相对于标准神经算子基线的预测精度,而PI-HAMNO进一步增强了稳定性、物理一致性和数据效率。实现代码公开于https://github.com/HAMNO/HAMNO。

英文摘要

Neural operators provide a powerful framework for learning solution mappings of partial differential equations directly in function space. However, many existing architectures still struggle to represent nonlinear time-dependent systems that involve multi-scale structures, long-range interactions, and stable long-time evolution. In this work, we introduce the Hierarchical Adaptive Multi-scale Neural Operator (HAMNO), a neural-operator architecture that combines local convolutional representations, global spectral operators, and hierarchical encoder-decoder processing. The central component of HAMNO is a data-dependent gating mechanism that adaptively balances local and global information at each spatial location, allowing the model to resolve fine-scale features while preserving long-range dependencies. We further develop a physics-informed extension, PI-HAMNO, based on a multi-objective loss strategy that combines data fitting with strong- and weak-form physics constraints. The strong-form term penalizes the domain-integrated squared PDE residual in physical coordinates, while the weak-form term is constructed by multiplying the governing residual by finite-element test functions and evaluating the resulting element integrals using centroid-based tetrahedral quadrature. The framework is evaluated on non-periodic Allen-Cahn (AC), Cahn-Hilliard (CH), and Swift-Hohenberg (SH) equations defined on cubic domains. Across long-horizon rollout, data-limited training, out-of-distribution initial-condition shifts, and random-seed variations, HAMNO improves predictive accuracy over standard neural-operator baselines, while PI-HAMNO further enhances stability, physical consistency, and data efficiency. The implementation is publicly available at this https URL.

2606.12054 2026-06-11 cs.LG 新提交

Simplicity Suffices for Parameter Noise Injection in Stochastic Gradient Descent

随机梯度下降中参数噪声注入的简单性足以胜任

Benjamin Leblanc, Louis-Jacob Lebel, Teddy Kana, Richard Kamel

发表机构 * Université Laval(拉瓦尔大学)

AI总结 研究随机梯度下降中的参数噪声注入,提出线性层逐样本噪声注入的高效方法,并实验证明简单各向同性噪声即可达到复杂方案的优化与泛化效果。

详情
Comments
Accepted at the Data Science Meets Optimisation workshop in IJCAI 2026
AI中文摘要

向优化过程中注入噪声是一种改善深度神经网络训练和泛化的成熟技术。然而,尽管现有方法众多,实践中哪些设计选择真正重要仍不清楚。本文研究随机梯度下降中的参数噪声注入,聚焦两个关键问题:如何在 mini-batch 训练中高效地为每个训练样本配对其自身的扰动,以及复杂的噪声参数化或多样本梯度平均是否比简单替代方案带来有意义的增益。针对第一个问题,我们利用线性层的分布恒等式,允许在不破坏批计算的情况下进行逐样本噪声注入。针对第二个问题,我们在 CIFAR100 上系统比较了几种对角高斯参数化与各向同性基线在不同噪声水平下的表现。结果一致表明,简单的轻量级策略——每个更新步使用单次扰动前向传播的各向同性噪声——即可恢复更复杂方案的大部分收益。这些发现表明,参数噪声注入的简单性足以胜任,实践者无需采用精心设计的扰动方案即可获得噪声 SGD 的优化和泛化优势。

英文摘要

Injecting noise into the optimization process is a well-established technique for improving the training and generalization of deep neural networks. Yet, despite the breadth of existing approaches, it remains unclear which design choices truly matter in practice. In this work, we investigate parameter noise injection for stochastic gradient descent, focusing on two key questions: how to efficiently pair each training example with its own perturbation in mini-batch training, and whether sophisticated noise parameterizations or multi-sample gradient averaging yield meaningful gains over simpler alternatives. To address the first question, we leverage a distributional identity for linear layers that allows per-example noise injection without breaking batched computation. To address the second, we systematically compare several diagonal Gaussian parameterizations against an isotropic baseline across varying noise levels on CIFAR100. Our results consistently show that simple, lightweight strategies, isotropic noise with a single perturbed forward pass per update step, recover most of the benefit of more complex schemes. These findings suggest that simplicity suffices for parameter noise injection, and that practitioners need not resort to elaborate perturbation designs to reap the optimization and generalization benefits of noisy SGD.

2606.12059 2026-06-11 cs.LG cs.NE nlin.AO 新提交

Attention by Synchronization in Coupled Oscillator Networks

耦合振荡器网络中的同步注意力机制

Fabio Pasqualetti, Taosha Guo

发表机构 * University of California, Irvine(加州大学尔湾分校)

AI总结 提出基于Kuramoto同步动力学的固定查询振荡器注意力机制,无需指数运算和全局归约,在物理基板上实现注意力计算,并在关键词识别和主谓一致任务上优于softmax。

详情
AI中文摘要

我们探讨了能量受限物理基板上的Transformer注意力机制。Softmax注意力需要指数运算和全局归约,这些操作在冯·诺依曼硬件上能耗高且没有自然的物理模拟。我们证明Kuramoto同步动力学(出现在电气、机械、超导和电荷密度波振荡器阵列等物理系统中)无需上述操作即可实现定义良好的注意力操作。由此产生的机制——固定查询振荡器注意力——用球面上梯度流的平衡取代了softmax的算术运算:查询是固定在球面上的学习锚点,自由振荡器在Kuramoto-Lohe动力学下演化,直到它们稳定在通过余弦相似度编码注意力权重的位置上。由于计算是平衡过程,因此不需要指数运算;唯一的全局操作是读出时的仿射归一化。该不动点是唯一且从几乎所有初始条件全局吸引的,这一保证适用于所有物理实现。在实验上,在最小硬件配置(振荡器维度$d_{\mathrm{osc}}=2$)下,振荡器注意力在关键词识别(+1.00个百分点)和主谓一致(困难句子+5.27个百分点,零训练失败,而softmax五分之一失败)上优于softmax。在因果语言建模中,softmax仍保持优势,但振荡器注意力随着$d_{\mathrm{osc}}$的增长缩小了差距:在WikiText-2上,从$d_{\mathrm{osc}}=2$时的+11.09 PPL降至$d_{\mathrm{osc}}=32$时的+2.98 PPL;在TinyStories上,从$d_{\mathrm{osc}}=2$时的+2.39 PPL降至$d_{\mathrm{osc}}=32$时的+0.57 PPL。本工作的主要目标不是用软件替代softmax,而是为物理基板上的精确注意力提供数学基础蓝图。

英文摘要

We address transformer attention on energy-constrained physical substrates. Softmax attention requires exponentiation and global reduction, operations with high energy cost on von Neumann hardware and no natural physical analog. We show that Kuramoto synchronization dynamics (which arise in electrical, mechanical, superconducting, and charge-density-wave oscillator arrays, among other physical systems) implement a well-defined attention operation without either. The resulting mechanism, fixed-query oscillator attention, replaces softmax's arithmetic with the equilibration of a gradient flow on the sphere: queries are learned anchors fixed on the sphere, and free oscillators evolve under Kuramoto-Lohe dynamics until they settle at positions encoding attention weights via cosine similarity. Because the computation is equilibration, it requires no exponentiation; the only global operation is an affine normalization at readout. The fixed point is provably unique and globally attractive from almost every initial condition, a guarantee that holds across every physical realization. Empirically, at the minimal hardware configuration (oscillator dimension $d_{\mathrm{osc}}$ = 2), oscillator attention outperforms softmax on keyword spotting (+1.00 pp) and on subject-verb agreement (+5.27 pp on hard sentences, with zero training failures versus one in five for softmax). On causal language modeling, where softmax retains an advantage, oscillator attention closes the gap as $d_{\mathrm{osc}}$ grows: from +11.09 PPL at $d_{\mathrm{osc}}$ = 2 to +2.98 PPL at $d_{\mathrm{osc}}$ = 32 on WikiText-2, and from +2.39 PPL at $d_{\mathrm{osc}}$ = 2 to +0.57 PPL at $d_{\mathrm{osc}}$ = 32 on TinyStories. The main objective of this work is not to replace softmax in software but to provide a mathematically grounded blueprint for accurate attention on physical substrates.

2606.12146 2026-06-11 cs.LG cs.AI 新提交

nD-RoPE: A Generalized RoPE for n-Dimensional Position Embedding

nD-RoPE:一种用于n维位置嵌入的广义RoPE

Boyang Li, Yulin Wu, Sizhe Xu, Nuoxian Huang, Zhonghang Yuan, Shangyi Guo, Shu Yang, Takahiro Yabe

AI总结 提出nD-RoPE,将旋转位置嵌入推广到任意维度,通过多尺度正则单纯形波矢设计实现各向同性,在图像、视频和点云任务中提升性能。

详情
Comments
Accepted to the 43rd International Conference on Machine Learning (ICML 2026)
AI中文摘要

旋转位置嵌入(RoPE)在Transformer模型中被广泛采用,但其向高维域的扩展缺乏统一的理论表述。大多数现有方法要么沿每个轴独立应用旋转,要么经验性地混合频率,这限制了跨维交互并产生方向相关的表示。为了解决这些限制,我们提出了nD-RoPE,一种将RoPE推广到任意维度的无分解泛化。从连续希尔伯特空间中的平移不变表述出发,我们推导出各向同性的谱条件,要求将位置和频率视为耦合的\(n\)维向量。我们通过多尺度正则单纯形波矢设计实例化该表述,提供了非退化的空间覆盖和对称、方向平衡的二阶响应。在图像、视频和点云上的实验表明,在高维设置中性能持续提升且泛化能力增强。

英文摘要

Rotary Position Embedding (RoPE) is widely adopted in Transformer models, yet its extension to high-dimensional domains lacks a unified theoretical formulation. Most existing approaches either apply rotations independently along each axis or empirically mix frequencies, which limits cross-dimensional interactions and yields direction-dependent representations. To address these limitations, we propose nD-RoPE, a decomposition-free generalization of RoPE to arbitrary dimensions. From a translation-invariant formulation in continuous Hilbert space, we derive a spectral condition for isotropy that requires treating positions and frequencies as coupled \(n\)-dimensional vectors. We instantiate this formulation with a multi-scale regular-simplex wave-vector design, which provides non-degenerate spatial coverage and a symmetric, directionally balanced second-order response. Experiments across images, videos, and point clouds demonstrate consistent performance gains and improved generalization in high-dimensional settings.

2606.12240 2026-06-11 cs.LG cs.AI 新提交

Multi-Rate Mixture of Experts for Accelerating Liquid Neural Network Training

多速率专家混合模型加速液态神经网络训练

Shilong Zong, Almuatazbellah Boker, Hoda Eldardiry

发表机构 * Virginia Tech(弗吉尼亚理工大学)

AI总结 提出多速率专家混合框架,结合液态神经网络的多尺度动态与注意力机制,提升多变量时间序列建模的准确性和效率。

详情
AI中文摘要

多变量时间序列数据通常表现出复杂的时间依赖、不规则采样和跨多个时间尺度的异质动态,使得精确序列建模特别具有挑战性。传统的循环神经网络(RNN),如长短期记忆网络(LSTM),在离散时间下运行,可能难以有效捕捉连续和不规则的时间行为。液态神经网络(LNN)通过连续时间动态解决了其中一些限制,但标准LNN架构通常依赖单一动力系统,限制了其建模异质时间模式的能力。为了解决这些挑战,我们提出了一个基于液态神经网络的多速率专家混合(MR-MoE)框架。在所提出的架构中,多个基于LNN的专家以不同的时间尺度运行,使模型能够明确分离快速变化的动态和缓慢演变的时间趋势。门控网络进一步实现了基于输入条件的自适应专家专业化。此外,我们结合了特征级和时间注意力机制,以提高鲁棒性、可解释性和长程依赖建模能力。特征级注意力抑制噪声或无关变量,而时间注意力则选择性地关注信息丰富的历史状态。我们在一个复杂的多变量时间序列预测任务上评估了所提出的框架,并与强基线模型(包括LSTM、单体LNN和标准MoE模型)进行了比较。实验结果表明,所提出的MR-MoE框架在保持良好计算效率的同时,持续实现了改进的AUROC和AUPRC性能。这些结果突显了结合连续时间动态、多尺度专家分解和自适应注意力机制对时间序列建模的有效性。

英文摘要

Multivariate time-series data often exhibit complex temporal dependencies, irregular sampling, and heterogeneous dynamics across multiple time scales, making accurate sequence modeling particularly challenging. Traditional recurrent neural networks (RNNs), such as Long Short-Term Memory (LSTM) networks, operate in discrete time and may struggle to effectively capture continuous and irregular temporal behaviors. Liquid Neural Networks (LNNs) address some of these limitations through continuous-time dynamics, but standard LNN architectures typically rely on a single dynamical system, limiting their ability to model heterogeneous temporal patterns. To address these challenges, we propose a Multi-Rate Mixture-of-Experts (MR-MoE) framework built on top of Liquid Neural Networks. In the proposed architecture, multiple LNN-based experts operate at distinct time scales, enabling the model to explicitly separate fast-changing dynamics from slow-evolving temporal trends. A gating network further enables adaptive expert specialization based on input conditions. In addition, we incorporate both feature-level and temporal attention mechanisms to improve robustness, interpretability, and long-range dependency modeling. Feature-level attention suppresses noisy or irrelevant variables, while temporal attention selectively focuses on informative historical states. We evaluate the proposed framework on a complex multivariate time-series prediction task and compare it against strong baselines, including LSTM, monolithic LNN, and standard MoE models. Experimental results demonstrate that the proposed MR-MoE framework consistently achieves improved AUROC and AUPRC performance while maintaining favorable computational efficiency. These results highlight the effectiveness of combining continuous-time dynamics, multi-scale expert decomposition, and adaptive attention mechanisms for time-series modeling.

2606.12318 2026-06-11 cs.LG cs.AI 新提交

Harness In-Context Operator Learning with Chain of Operators

利用算子链实现上下文算子学习

Minghui Yang, Ling Guo, Liu Yang

发表机构 * Department of Mathematics, Shanghai Normal University(上海师范大学数学系) Department of Mathematics, National University of Singapore(新加坡国立大学数学系)

AI总结 提出Chain of Operators (CHOP)框架,通过构造显式初等变换与冻结ICON的算子链,无需微调即可提升上下文算子网络在分布外算子任务上的泛化能力,在标量守恒律和平均场控制问题中降低推理误差。

详情
AI中文摘要

神经算子近似函数空间之间的映射,但通常对其他算子泛化能力差,需要微调或重新训练。上下文算子网络(ICON)通过向模型提供数值上下文来解决此问题,使模型从提示中学习特定算子并适应不同算子而无需微调。然而,ICON在分布外(OOD)算子任务上仍可能泛化失败。受大型语言模型(LLM)的提示工程成功启发,我们引入了算子链(CHOP),一种在不更新参数的情况下将冻结的ICON应用于OOD算子任务的框架。具体来说,CHOP构建了一个由显式初等变换和冻结ICON组成的算子链。在标量守恒律和平均场控制问题上的实验表明,与直接ICON评估相比,CHOP降低了相对推理误差,同时链中的每个算子保持可解释且具有封闭形式。在一个PDE族上构建的链进一步泛化到另一个不同的族,表明跨提示系统存在共享机制。

英文摘要

Neural operators approximate mappings between function spaces, but often generalize poorly to other operators and usually require fine-tuning or retraining. In-Context Operator Networks (ICON) addresses this issue by prompting the model with numerical context so that the model learns specific operators from prompts and adapt to different operators without fine-tuning. However, ICON may still fail to generalize to out-of-distribution (OOD) operator tasks. Inpired by the success of harness engineering of Large Language models (LLMs), we introduce Chain of Operators (CHOP), a framework that harness a frozen ICON to OOD operator tasks without updating its parameters. Specifically, CHOP constructs a chain of operators consisting of explicit elementary transformations and the frozen ICON. Experiments on a scalar conservation law and a mean-field control problem show that CHOP reduces relative inference error over direct ICON evaluation, while each operator in the chain remains interpretable and in closed form. A chain constructed on one PDE family further generalizes to a different family, indicating shared mechanisms across harness systems.

2606.12364 2026-06-11 cs.LG 新提交

On Subquadratic Architectures: From Applications to Principles

关于次二次架构:从应用到原理

Anamaria-Roberta Hartl, Levente Zólyomi, David Stap, Pieter-Jan Hoedt, Niklas Schmidinger, Lukas Hauzenberger, Sebastian Böck, Günter Klambauer, Sepp Hochreiter

发表机构 * ELLIS Unit Linz, LIT AI Lab, Institute for Machine Learning, Johannes Kepler University Linz(林茨ELLIS单元、LIT AI实验室、机器学习研究所、约翰内斯·开普勒大学林茨) NVIDIA(英伟达)

AI总结 本文比较了xLSTM、Mamba-2和Gated DeltaNet三种次二次架构,发现xLSTM在代码预训练、蒸馏和时间序列预训练中表现最佳,其优势源于灵活稳定的门控记忆校正机制。

详情
AI中文摘要

Transformer主导现代序列建模,但其二次注意力机制带来了巨大的计算成本。次二次架构提供了一种可扩展的替代方案。然而,目前尚不清楚哪些设计能产生最有效的序列模型。我们比较了三种领先的方法:xLSTM、Mamba-2和Gated DeltaNet。我们在具有复杂依赖关系的任务上评估这些模型:(1)代码模型预训练,(2)从大型语言模型蒸馏代码模型,以及(3)时间序列基础模型预训练。在这些设置中,xLSTM提供了最强的整体性能。为了解释xLSTM的优势,我们提出了一个统一的公式并分析了底层架构机制,重点关注状态跟踪和记忆动态。我们的结果表明,xLSTM通过其门控方案实现了更灵活和稳定的记忆校正。我们在受控的合成长度泛化任务上证实了这些发现。总体而言,我们的发现表明,xLSTM在复杂任务上的收益源于稳健的状态跟踪和积累。

英文摘要

Transformers dominate modern sequence modeling, but their quadratic attention incurs substantial computational cost. Subquadratic architectures offer a scalable alternative. However, it remains unclear which designs yield the most effective sequence models. We compare three leading approaches: xLSTM, Mamba-2, and Gated DeltaNet. We evaluate these models on tasks with complex dependencies: (1) code-model pre-training, (2) distillation of code models from large language models, and (3) pre-training of time-series foundation models. Across these settings, xLSTM delivers the strongest overall performance. To explain xLSTM's advantage, we present a unified formulation and analyze the underlying architectural mechanisms, focusing on state tracking and memory dynamics. Our results show that xLSTM enables more flexible and stable memory correction via its gating scheme. We corroborate these findings on controlled synthetic length-generalization tasks. Overall, our findings indicate that xLSTM's gains on complex tasks stem from robust state tracking and accumulation.

2606.12397 2026-06-11 cs.LG cs.AI cs.CL 新提交

Redesign Mixture-of-Experts Routers with Manifold Power Iteration

重新设计混合专家模型的路由器:基于流形幂迭代

Songhao Wu, Ang Lv, Ruobing Xie, Yankai Lin

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学高瓴人工智能学院) Large Language Model Department, Tencent(腾讯大型语言模型部门)

AI总结 提出将路由器行与专家矩阵主奇异方向对齐,并基于流形幂迭代(MPI)重新设计路由器,通过“幂迭代-收缩”范式实现对齐,理论证明收敛性,实验验证1B至11B参数规模下模型效果提升。

详情
Comments
Preprint
AI中文摘要

路由器是混合专家模型的核心组件。作为专家代理,路由器矩阵的行计算与MoE输入的相似度,以确定激活哪些专家子集。理想情况下,每个路由器行被设计为将专家矩阵编码到该代表性向量中,使得其与token的点积能更好地反映token-专家亲和性。然而,目前没有设计原则来强制这种压缩。在本文中,我们提出将每个路由器行与相关专家的主奇异方向对齐,因为该方向提供了矩阵最具表现力的数学描述。基于这一原则,我们提出了一种基于流形幂迭代(MPI)的路由器重新设计。具体来说,它引入了一种“幂迭代-收缩”范式,其中对路由器权重执行幂迭代步骤,然后进行收缩以施加范数约束,确保效率和稳定性。理论上,我们证明MPI驱动路由器行收敛到相关专家的主奇异方向。实验上,我们在1B到11B参数规模的MoE模型上进行预训练,证实这种对齐有助于更有效的MoE模型。

英文摘要

Router is the cornerstone component to the Mixture-of-Experts models. Serving as expert proxies, the rows of the router matrix compute their similarity to the MoE inputs to determine which subset of experts is activated. Ideally, each router row is designed to encode the expert matrix into this representative vector, such that its dot-product with token can better reflect token-expert affinity. However, there exists no design principles to enforce this condensation. In this paper, we propose to align each router row with the principal singular direction of the associated expert, as this direction provides the most expressive mathematical description of a matrix. Based on this principle, we propose a router redesign with Manifold Power Iteration (MPI). Specifically, it introduces a "Power-then-Retract" paradigm, where a power iteration step is performed on the router weights, followed by a retraction to impose a norm constraint to ensure both efficiency and stability. Theoretically, we show that MPI drives router rows to converge toward the principal singular directions of associated experts. Empirically, we pretrain MoE model across scales from 1B to 11B parameters to confirm that this alignment facilitates more effective MoE models.

2606.11206 2026-06-11 cs.CL cs.LG 交叉投稿

Compatibility-Aware Dynamic Fine-Tuning for Large Language Models

兼容性感知的动态微调用于大型语言模型

Yucheng Zhou, Junwei Sheng, Qianning Wang, Jianbing Shen

发表机构 * SKL-IOTSC, CIS, University of Macau(澳门大学科技学院电脑与信息科学系及智慧城市物联网国家重点实验室) Auckland University of Technology(奥克兰理工大学)

AI总结 提出兼容性感知动态微调(CADFT),通过模型似然度动态调整监督更新,抑制不兼容样本的高方差梯度,提升训练稳定性和泛化能力。

详情
Comments
ACL 2026
AI中文摘要

监督微调(SFT)是对齐大型语言模型(LLMs)的主要范式,但它存在优化不稳定和泛化能力有限的问题。最近的研究将这一问题归因于病态的梯度缩放,并提出了动态微调(DFT)来在令牌级别进行修正。然而,DFT假设所有演示都是同样合适的学习目标,这一假设被大规模指令数据的强异质性所违反,其中演示-策略不匹配会在样本级别导致高方差更新。我们引入了兼容性感知动态微调(CADFT),这是DFT的一个原则性扩展,用于控制样本级别的优化方差。CADFT从模型似然度中推导出一个动态的、依赖于策略的兼容性信号,以调节监督更新,抑制来自不兼容演示的高方差梯度。我们进一步提出了一种延迟的、低频的兼容性引导重写策略,将持续不兼容的演示转化为可学习的目标。我们表明,CADFT可以被解释为一个方差控制的估计器,将DFT中的令牌级稳定性推广到样本级别。大量实验表明,CADFT在保持完全监督且不依赖显式奖励建模的同时,提高了稳定性、泛化能力和冷启动强化学习初始化。

英文摘要

Supervised Fine-Tuning (SFT) is the predominant paradigm for aligning large language models (LLMs), yet it suffers from optimization instability and limited generalization. Recent work attributes this issue to pathological gradient scaling and proposes Dynamic Fine-Tuning (DFT) to correct it at the token level. However, DFT assumes all demonstrations are equally suitable learning targets, an assumption violated by the strong heterogeneity of large-scale instruction data, where demonstration-policy mismatch induces high-variance updates at the sample level. We introduce Compatibility-Aware Dynamic Fine-Tuning (CADFT), a principled extension of DFT that controls sample-level optimization variance. CADFT derives a dynamic, policy-dependent compatibility signal from model likelihoods to modulate supervised updates, suppressing high-variance gradients from incompatible demonstrations. We further propose a delayed, low-frequency compatibility-guided rewriting strategy to transform persistently incompatible demonstrations into learnable targets. We show that CADFT can be interpreted as a variance-controlled estimator that generalizes token-level stabilization in DFT to the sample level. Extensive experiments demonstrate improved stability, generalization, and cold-start reinforcement learning initialization, while remaining fully supervised and independent of explicit reward modeling.

2606.11236 2026-06-11 cs.NE cs.CV cs.LG 交叉投稿

A2SG:Adaptive and Asymmetric Surrogate Gradients for Training Deep Spiking Neural Networks

A2SG:用于训练深度脉冲神经网络的适应性和非对称替代梯度

Yechan Kang, Yongjin Kweon, Mingyeong Seo, Sohee Park, Yeonguk Jeon, Jongkil Park, Hyun Jae Jang, Jaewook Kim, YeonJoo Jeong, Suyoun Lee, Seongsik Park

AI总结 提出适应性和非对称替代梯度(A2SG)框架,通过自适应窗口调整梯度方向一致性、非对称梯度反映神经元动态,降低梯度变化并促进收敛到平坦最小值,在多种SNN模型和任务上提升精度与能效。

详情
Comments
Accepted at ICML 2026
AI中文摘要

由于替代梯度导致的尖锐损失景观和时间不一致性,训练深度脉冲神经网络(SNN)仍然具有挑战性。为了解决这些问题,我们提出了一个统一框架:适应性和非对称替代梯度A2SG。适应性梯度调整一个有效窗口以实现时空适应,减少空间梯度变化并保持梯度随时间的方向一致性。非对称梯度通过为具有更高膜电位的神经元分配更大的梯度来反映神经元动态,并且我们证明它们比对称替代梯度产生更低的方差。我们的分析进一步建立了局部梯度变化与损失景观曲率之间的直接联系,为A2SG如何促进收敛到更平坦的最小值并改善泛化提供了原理性解释。我们在多种模型上进行了广泛实验,包括基于CNN和基于Transformer的SNN,涉及各种任务,如使用静态和神经形态数据集的图像分类以及分割。结果表明,A2SG持续提高了准确性和能效,使其成为训练深度SNN的通用且可靠的解决方案。我们的代码可在以下网址获取:此 https URL。

英文摘要

Training deep spiking neural networks (SNNs) remains challenging due to sharp loss landscapes and temporal inconsistency caused by surrogate gradients. To address these challenges, we propose a unified framework: adaptive and asymmetric surrogate gradients A2SG. The adaptive gradients adjust an effective window for spatio-temporal adaptation, reducing spatial gradient variation and maintaining directional consistency of gradients over time. The asymmetric gradients reflect neuronal dynamics by assigning larger gradients to neurons with higher membrane potentials, and we prove that they yield lower variation than symmetric surrogates. Our analysis further establishes a direct connection between local gradient variation and the curvature of the loss landscape, providing a principled explanation for how A2SG promotes convergence to flatter minima and improves generalization. We conduct extensive experiments on diverse models, including CNN-based and Transformer-based SNNs, across various tasks such as image classification using both static and neuromorphic datasets, as well as segmentation. The results demonstrate that A2SG consistently improves accuracy and energy efficiency, establishing it as a general and reliable solution for training deep SNNs. Our code is available at this https URL.

2606.11552 2026-06-11 cs.CL cs.LG 交叉投稿

Teaching Diffusion to Speculate Left-to-Right

教导扩散模型从左到右推测

Lexington Whalen, Yuki Ito, Ryo Sakamoto

AI总结 针对自回归解码的推理瓶颈,提出三种训练时干预方法(位置加权、首次错误焦点损失、链损失)来弥合块扩散草稿模型的双向生成与自回归目标模型从左到右验证之间的不对称性,显著提升接受草稿长度。

详情
Comments
13 pages, technical report
AI中文摘要

大型语言模型(LLMs)在广泛任务中表现出色,但其自回归解码过程由于固有的顺序令牌生成而带来大量推理成本。推测解码通过使用轻量级草稿模型提出多个未来令牌,随后由更大的目标模型并行验证,从而解决这一瓶颈。近期工作表明,扩散语言模型非常适合此设置,因为它们可以并行生成整个草稿令牌块,从而缓解自回归草稿的顺序约束。该机制的一个微妙之处在于,块扩散草稿生成器在块内双向生成令牌,而验证由自回归目标模型以严格从左到右的方式评估令牌,导致对称的训练目标与非对称的验证奖励之间存在差距。在本工作中,我们对三种缩小这一差距的训练时干预措施进行了实证分析:令牌位置加权、针对每个块内破坏已接受前缀位置的首次错误焦点损失,以及用可微替代项替代期望接受长度的链损失项。这三种干预措施沿正交轴(位置、块条件首次错误、联合前缀)起作用,并且可加性组合;它们同样与测试时对齐机制(如多草稿自选)正交,原则上可以与之结合。在四个目标模型和六个推理、代码及对话基准测试中,与位置均匀基线相比,这三种干预措施使每个基准测试的接受草稿长度提高了21-76%,且无需增加额外前向传递,也无需改变推理流程或拒绝采样精确性约束。

英文摘要

Large language models (LLMs) achieve remarkable performance across a wide range of tasks, but their autoregressive decoding process incurs substantial inference costs due to inherently sequential token generation. Speculative decoding addresses this bottleneck by employing a lightweight draft model to propose multiple future tokens that are subsequently verified in parallel by a larger target model. Recent work has demonstrated that diffusion language models are well suited for this setting, as they can generate entire blocks of draft tokens in parallel and thereby alleviate the sequential constraints of autoregressive drafting. A subtlety of this regime is that block-diffusion drafters generate tokens bidirectionally within a block, whereas verification is performed by an autoregressive target model that evaluates tokens in a strictly left-to-right manner, leaving a gap between the symmetric training-time objective and the asymmetric verification-time reward. In this work, we offer an empirical analysis of three training-time interventions that narrow this gap: token positional weighting, a first-error focal loss that targets the position that breaks the accepted prefix within each block, and a chain loss term that substitutes a differentiable surrogate for the expected accepted length. The three interventions act along orthogonal axes (position, block-conditional first error, joint prefix) and compose additively; they are likewise orthogonal to test-time alignment mechanisms such as multi-draft self-selection, with which they can in principle be combined. Across four target models and six reasoning, code, and dialogue benchmarks, the three interventions raise accepted draft length by 21-76% per benchmark over a position-uniform baseline, without adding additional forward passes and without changing the inference pipeline or the rejection-sampling exactness contract.

2606.11599 2026-06-11 cs.CL cs.LG 交叉投稿

When is Your LLM Steerable?

你的大模型何时可操控?

Chenrui Fan, Yize Cheng, Ming Li, Soheil Feizi, Tianyi Zhou

发表机构 * University of Maryland, College Park(马里兰大学帕克分校) MBZUAI, UAE(穆罕默德·本·扎耶德人工智能大学)

AI总结 提出通过模型生成初期的内部状态预测激活操控是否成功,并利用该预测器优化操控强度搜索,降低解码成本。

详情
AI中文摘要

激活操控提供了一种轻量级的方法来控制语言模型在推理时的行为,但其成功与否严重依赖于提示、概念、模型和操控配置。寻找成功操控的范围和边界通常需要昂贵的网格搜索和对完整自回归生成的后验评估。在这项工作中,我们研究了是否可以从模型在生成过程初期(例如,生成前几个token后)的内部状态预测可操控性,以及如何利用这样的预测器来提高操控成功率。为此,我们首先引入了ASTEER,一个包含140万次操控生成的测试平台,涵盖150个概念,每个操控成功/失败均已标注。利用该测试平台,我们通过提取特征来比较操控前后跨层和初始解码步骤的隐藏状态,分析模型的早期解码动态。这些特征帮助我们理解操控效果如何沿层和token位置传播,为可操控性预测提供关键信息。然后,我们在这些特征上训练梯度提升决策树(GBDT)分类器,以预测干预是否会欠操控、成功或过操控,而无需完整生成。我们的预测器在未见过的概念上达到了约0.7的宏F1分数,表明早期隐藏状态编码了关于最终操控效果的大量结构化信息。我们进一步利用该可操控性预测器作为操控强度搜索的指导,以极小的解码成本实现了接近最优的性能。

英文摘要

Activation steering offers a lightweight approach to control language models' behavior at inference time, but whether it succeeds or fails heavily depends on the prompt, concept, model, and steering configuration. Finding the regime and boundaries of successful steering typically requires expensive grid searches and post-hoc evaluation of full autoregressive rollouts. In this work, we investigate whether steerability can be predicted from the model's internal states at the beginning of the generation process, e.g., after generating the first few tokens, and how to leverage such a predictor to improve steering success rate. To this end, we first introduce ASTEER, a testbed including 1.4M steered generations, spanning 150 concepts with each steering success/failure labeled. Leveraging this testbed, we analyze the model's early decoding dynamics by extracting features that compare hidden states before and after steering across layers and initial decoding steps. These features help us understand how steering's effects propagate along layers and token positions, which provide key information for steerability prediction. We then train a Gradient Boosting Decision Trees (GBDT) classifier on these features to predict whether an intervention will under-steer, succeed, or over-steer without requiring full rollout. Our predictor achieves around 0.7 macro-F1 score on unseen concepts, demonstrating that early hidden states encode substantial, structured information about eventual steering efficacy. We further leverage this steerability predictor as guidance for steering strength searching, achieving near-optimal performance with a small fraction of decoding cost.

2606.11673 2026-06-11 quant-ph cs.LG 交叉投稿

Higher-Order Token Interactions via Quantum Attention

高阶令牌交互的量子注意力机制

Jian Xu, Chao Li, Delu Zeng, John Paisley, Qibin Zhao

AI总结 提出量子高阶注意力(QHA),通过数据重上传和非克利福德纠缠器在浅电路中合成任意阶令牌交互,证明其表达能力超越经典自注意力,并具有可训练性保证,在遗传上位、带噪学习奇偶和图三角形检测中高效检测高阶交互。

详情
AI中文摘要

标准点积自注意力在单层中仅计算令牌间的成对(二阶)交互;表示一般的$k$阶交互已知需要在单层中使用超二次资源或通过深度组合。我们引入\textbf{量子高阶注意力(QHA)},一种浅层、硬件可实现的量子注意力头,通过数据重上传和全对非克利福德纠缠器,在电路内部合成$k$阶令牌交互,并通过局部单量子比特读出暴露它们。我们证明:(i)表达能力分离:任何嵌入维度$m$、$H$个头和$p$位精度满足$mHp=o(N/\log\log N)$的单个标准自注意力层无法表示一个QHA头以电路深度$O(\log k)$($O(k)$个两量子比特门)表示的$k$阶相关族;(ii)其局部设计实例的可训练性保证:使用局部读出和$O(\log n)$深度,梯度方差为$\Omega(1/\mathrm{poly}(n))$(无贫瘠高原),我们通过实验确认——同时明确我们基准测试的更具表达力的全对实例是经验训练的,并显示指数衰减的梯度。实验上,在参数预算小$6.5\times$的情况下,QHA从不相交输入中泛化每个阶$k\le6$的隐藏子集奇偶性,而更大的经典注意力头在阶~2之后崩溃;与理论一致,优势的大小跟踪目标的傅里叶度——奇偶性最大,当存在低阶结构时缩小。作为一个应用,QHA在三个领域——遗传上位、带噪学习奇偶和图三角形检测——作为紧凑的高阶交互检测器,在最小的参数预算下达到噪声上限,而领域标准的线性方法失败。

英文摘要

Standard dot-product self-attention computes, in a single layer, only pairwise (order-2) interactions between tokens; representing a generic order-$k$ interaction is known to require either super-quadratic resources in one layer or composition across depth. We introduce \textbf{Quantum Higher-Order Attention (QHA)}, a shallow, hardware-realizable quantum attention head that, via data re-uploading and an all-to-all non-Clifford entangler, synthesizes order-$k$ token interactions inside the circuit and exposes them through a local single-qubit read-out. We prove (i) an expressivity separation: any single standard self-attention layer with embedding dimension $m$, $H$ heads and $p$-bit precision satisfying $mHp=o(N/\log\log N)$ cannot represent the order-$k$ correlation family that one QHA head represents with circuit depth $O(\log k)$ ($O(k)$ two-qubit gates); and (ii) a trainability guarantee for its local-design instantiation: with a local read-out and $O(\log n)$ depth the gradient variance is $\Omega(1/\mathrm{poly}(n))$ (no barren plateau), which we confirm empirically -- while being explicit that the more expressive all-to-all instantiation we benchmark is trained empirically and shows exponentially decaying gradients. Empirically, at a $6.5\times$ smaller parameter budget, QHA generalizes hidden-subset parity of every order $k\le6$ from disjoint inputs, whereas the larger classical attention head collapses past order~2; consistent with theory, the size of the advantage tracks the target's Fourier degree - largest for parity and shrinking when low-order structure is present. As an application, QHA serves as a compact high-order interaction detector across three domains - genetic epistasis, learning-parity-with-noise, and graph triangle detection - reaching the noise ceiling at the smallest parameter budget where field-standard linear methods fail.

2606.11680 2026-06-11 cs.AI cs.CL cs.LG 交叉投稿

Organize then Retrieve: Hierarchical Memory Navigation for Efficient Agents

先组织再检索:面向高效智能体的层次化记忆导航

Hao-Lun Hsu, Nikki Lijing Kuang, Boyi Liu, Zhewei Yao, Yuxiong He

发表机构 * Duke University(杜克大学) Snowflake AI Research(Snowflake AI研究)

AI总结 提出HORMA框架,通过构建文件系统式的层次化记忆结构并利用强化学习训练的轻量级导航代理,实现高效检索,在长时任务中提升性能并降低令牌消耗。

详情
AI中文摘要

大型语言模型(LLM)智能体由于固有的无状态性,在处理长时任务时面临挑战,所有任务相关信息必须编码到不断增长的输入上下文中,导致推理质量下降、推理成本增加和延迟升高,因此需要高效的工作记忆机制。然而,现有方法要么依赖有损压缩,要么基于相似性检索,往往无法捕捉多步智能体任务所需的时间结构和因果依赖关系。在这项工作中,我们提出了HORMA,一种层次化组织与检索记忆智能体,它将经验组织成类似文件系统的层次化结构,其中总结的实体链接到相应的原始轨迹,从而在保留详细信息的同时实现高效访问。HORMA将工作记忆分解为两个阶段:结构化记忆构建和基于导航的检索。构建模块通过区分由信息缺失导致的失败和由误导性或过载上下文导致的失败,迭代地优化经验的结构化方式。导航模块使用强化学习训练的轻量级代理遍历层次结构,选择最小但充分的上下文,从而减少关键执行路径上的延迟。在ALFWorld、LoCoMo和LongMemEval上,HORMA在受限上下文预算下提升了任务性能,同时在长对话任务中最多仅使用基线22.17%的令牌。与现有方法相比,它始终实现了更好的效率-性能权衡,并能有效泛化到未见任务。

英文摘要

Large language model (LLM) agents struggle with long-horizon tasks due to their inherent statelessness, requiring all task-relevant information to be encoded in growing input contexts. The resulting degraded reasoning quality, increased inference cost, and higher latency necessitate efficient working memory mechanisms. However, existing approaches either rely on lossy compression or similarity-based retrieval, which often fail to capture temporal structure and causal dependencies required for multi-step agentic tasks. In this work, we present HORMA, a Hierarchical Organize-and-Retrieve Memory Agent that organizes experience into a file-system-like hierarchical structure, where summarized entities are linked to the corresponding raw trajectories, enabling efficient access without losing detailed information. HORMA decomposes working memory into two stages: structured memory construction and navigation-based retrieval. The construction module iteratively refines how experiences are structured by distinguishing between failures caused by missing information and those caused by misleading or overloaded context. The navigation module retrieves task-relevant context by traversing the hierarchy using a lightweight agent trained with reinforcement learning to select minimal yet sufficient context, thereby reducing latency along the critical execution path. Across ALFWorld, LoCoMo, and LongMemEval, HORMA improves task performance under constrained context budgets while requiring at most 22.17% of the baseline token usage in long conversation tasks. Compared to existing methods, it consistently achieves better efficiency-performance trade-offs and generalizes effectively to unseen tasks.

2606.11712 2026-06-11 cs.CL cs.AI cs.LG 交叉投稿

Substrate Asymmetry in User-Side Memory: A Diagnostic Framework

用户侧记忆中的子模块不对称性:一个诊断框架

Youwang Deng

发表机构 * EpistemicaLab — Independent Research(EpistemicaLab — 独立研究)

AI总结 提出一个诊断框架,将LLM用户侧记忆分解为行为一致性、事实存在和事实缺失三个正交子模块,发现参数记忆与检索记忆在不同子模块上存在不对称性,且RLHF调优加剧了这种不对称性。

详情
Comments
Preprint. Code: this https URL
AI中文摘要

LLM中的用户侧记忆通常被评分为单一的“个性化”能力:给定用户历史,输出是否更了解用户?我们表明这种聚合指标隐藏了相反方向的失败。记忆至少可分解为三个正交轴——行为一致性(风格、语气)、事实存在(回忆历史中的事实)和事实缺失(当事实缺失时弃权)——并且没有单一子模块能在所有三个轴上获胜。在受控的50用户合成语料库和真实数据探针(LaMP-3)上,比较每个用户的gamma-LoRA(在每个用户历史上训练的小型LoRA适配器;gamma表示每个用户,而非每个任务)与BGE-large密集top-K检索,我们发现gamma-LoRA在行为风格上决定性获胜,而RAG在事实缺失上决定性获胜——并且注意力层21-35中的相同查询投影细胞因果地承载了这两个相反方向的效果(将这些LoRA权重归零会使缺失探针TPR提高33个百分点,并使存在探针TPR下降20个百分点)。在更经过RLHF调优的Llama-3.1-8B-Instruct上,不对称性增强而非愈合:参数记忆的行为优势崩溃,而其相对于检索的缺失校准赤字扩大——这是对参数用户记忆的对齐税。在真实数据LaMP-3上,gamma-LoRA表现低于多数基线;一个9条件缓解扫描诊断出这是指令遵循崩溃,而非子模块失败(9x2交叉乘积显示评估时的{1..5} logit掩码使每个配方的主准确率达到>=0.995),并且最佳训练时修复在Llama上逐位复制。最后,子模块选择路由是问题分类,而非校准:仅基于问题文本的110M DistilBERT击败了每个基于logit的路由器。我们贡献了诊断框架、诊断出的真实数据负例、对齐税复制以及路由即分类的发现。

英文摘要

User-side memory in LLMs is typically scored as a single "personalization" capability: given a user's history, is the output more user-aware? We show this aggregate metric hides opposite-direction failures. Memory factorises into at least three orthogonal axes -- behavioral consistency (style, voice), factual presence (recall facts in history), and factual absence (abstain when a fact is absent) -- and no single substrate wins all three. Comparing per-user gamma-LoRA (a small LoRA adapter trained on each user's history; gamma denotes per-user, not per-task) against BGE-large dense top-K retrieval on a controlled 50-user synthetic corpus and a real-data probe (LaMP-3), we find gamma-LoRA decisively wins behavioral style while RAG decisively wins factual absence -- and the same query-projection cells in attention layers 21-35 causally load-bear both effects in opposite directions (zeroing those LoRA weights raises absence-probe TPR by +33 pp and drops presence-probe TPR by 20 pp). On the more heavily RLHF-tuned Llama-3.1-8B-Instruct the asymmetry strengthens, not heals: parametric memory's behavioral advantage collapses while its absence-calibration deficit against retrieval widens -- an alignment tax on parametric user-memory. On real-data LaMP-3, gamma-LoRA underperforms a majority baseline; a 9-condition mitigation sweep diagnoses this as instruction-following collapse, not substrate failure (a 9x2 cross-product shows the eval-time {1..5} logit mask drives main_acc to >=0.995 on every recipe), and the best training-time fix replicates bit-identically on Llama. Finally, substrate-selection routing is question-classification, not calibration: a 110M DistilBERT on the question text alone beats every logit-based router. We contribute the diagnostic framework, the diagnosed real-data negative, the alignment-tax replication, and the routing-as-classification finding.

2606.12105 2026-06-11 cs.RO cs.CV cs.LG 交叉投稿

DAM-VLA: Decoupled Asynchronous Multimodal Vision Language Action model

DAM-VLA: 解耦异步多模态视觉语言动作模型

Pankhuri Vanjani, Zhuoyue Li, Jakub Suliga, Moritz Reuss, Gianluca Geraci, Xinkai Jiang, Rudolf Lioutikov

发表机构 * Intuitive Robots Lab, Karlsruhe Institute of Technology (KIT)(直觉机器人实验室,卡尔斯鲁厄理工学院) NVIDIA(英伟达) Robotics Institute of Germany(德国机器人研究所)

AI总结 针对VLA模型同步时钟与物理交互中不同模态频率不匹配的问题,提出DAM-VLA,通过解耦各模态时间处理、维护传感器速率更新的潜在缓冲区,并利用门控交叉注意力整合高频模态,在7个真实操作任务中平均成功率提升至95.2%。

详情
Comments
17 pages, 8 figures
AI中文摘要

视觉-语言-动作(VLA)模型继承了视觉-语言预训练中的共享同步时钟,以单一速率处理每个输入。这与物理交互不一致,在物理交互中,高频模态以数百赫兹变化,视觉演化较慢,而语言在整个回合中保持不变。同步VLA会过采样慢速模态,欠采样快速模态,并将动作生成限制在最低有效频率。我们假设解耦每个模态的时间处理,让每个模态以其自身传感器速率更新和保留信息,可以产生更强的表示和更鲁棒的控制。我们提出DAM-VLA,它维护每个模态的潜在缓冲区,以传感器速率刷新并由动作头连续读取,通过门控交叉注意力整合新的高频模态,同时保持预训练主干不变。在七个接触丰富的真实世界操作任务中,DAM-VLA将最强同步基线的平均成功率提高了一倍以上(95.2% vs. 40.95%),同时维持平滑、反应式的100 Hz控制。项目网站:\href{ this https URL }{ this http URL }

英文摘要

Vision-language-action (VLA) models inherit a shared synchronous clock from vision-language pretraining, processing every input at one rate. This is misaligned with physical interaction, where a high-frequency modality changes at hundreds of hertz, vision evolves more slowly, and language stays constant across an episode. A synchronous VLA oversamples slow modalities, undersamples fast ones, and caps action generation at the lowest effective frequency. We hypothesize that decoupling temporal processing per modality, letting each update and retain information at its own sensor rate, yields stronger representations and more robust control. We present DAM-VLA, which maintains per-modality latent buffers refreshed at sensor rates and read continuously by the action head, integrating new high-frequency modalities through gated cross-attention that leaves the pretrained backbone intact. Across seven contact-rich real-world manipulation tasks, DAM-VLA more than doubles the average success rate of the strongest synchronous baseline (95.2\% vs.\ 40.95\%) while sustaining smooth, reactive 100\,Hz control. Project website: \href{ this https URL }{ this http URL }

2507.11688 2026-06-11 cs.LG 版本更新

Composing Linear Layers from Irreducibles

从不可约元组合线性层

Travis Pence, Daisuke Yamada, Vikas Singh

AI总结 提出用Clifford代数将线性层分解为双向量(几何基元)的组合,仅需O(log^2 d)参数,在LLM注意力投影中匹配强基线性能。

详情
Comments
35 Pages, 11 Tables, 6 Figures, Appearing in NeurIPS 2025
AI中文摘要

当代大型模型常表现出暗示存在低级基元的行为,这些基元组合成功能更丰富的模块,但这些基本构建块仍未被很好理解。我们通过询问:能否从最小几何基元集合中识别/合成线性变换?来研究线性层中的这种组合结构。利用Clifford代数,我们证明线性层可以表示为双向量(编码有向平面的几何对象)的组合,并引入一种可微算法将其分解为转子乘积。这种构造仅需O(log^2 d)个参数,而稠密矩阵需要O(d^2)。应用于LLM注意力层中的键、查询和值投影,我们的基于转子的层匹配了块Hadamard和低秩近似等强基线的性能。我们的发现为这些几何基元如何在深度模型中组合成更高层次功能提供了代数视角。

英文摘要

Contemporary large models often exhibit behaviors suggesting the presence of low-level primitives that compose into modules with richer functionality, but these fundamental building blocks remain poorly understood. We investigate this compositional structure in linear layers by asking: can we identify/synthesize linear transformations from a minimal set of geometric primitives? Using Clifford algebra, we show that linear layers can be expressed as compositions of bivectors -- geometric objects encoding oriented planes -- and introduce a differentiable algorithm that decomposes them into products of rotors. This construction uses only O(log^2 d) parameters, versus O(d^2) required by dense matrices. Applied to the key, query, and value projections in LLM attention layers, our rotor-based layers match the performance of strong baselines such as block-Hadamard and low-rank approximations. Our findings provide an algebraic perspective on how these geometric primitives can compose into higher-level functions within deep models.

2511.00044 2026-06-11 cs.LG nlin.AO 版本更新

Time-multiplexed layer reuse for physical neural networks

物理神经网络的时间复用层重用

Kohei Tsuchiyama, Andre Roehm, Takatomo Mihana, Ryoichi Horisaki

AI总结 针对物理神经网络权重调整慢的瓶颈,提出TIDAL-Net,通过时间复用层增加有效深度,在图像分类和自然语言处理任务上提升性能。

详情
AI中文摘要

物理神经网络(PNN)是下一代计算的有前途的候选者,但现有演示仍比现代数字神经网络小几个数量级,而现代数字神经网络的最新进展是由可训练参数的快速增长驱动的。这种情况类似于早期数字神经网络的限制,这导致了关于参数重用的想法。我们研究了类似高效的硬件架构可能是什么样子,特别关注PNN中权重重新调整的常见瓶颈。我们提出了时间索引深度交替层网络(TIDAL-Net),它占据循环神经网络和深度神经网络之间的中间状态,专门针对常见PNN原型的规模和限制。TIDAL-Net利用许多PNN中快速前向动力学和缓慢可训练权重与偏置之间的时间尺度分离,通过逐层时间复用来增加有效深度,同时限制实现成本。在图像分类和自然语言处理任务上的数值实验表明,TIDAL-Net在仅对传统PNN进行微小修改的情况下提高了性能。

英文摘要

Physical neural networks (PNNs) are promising candidates for next-generation computing, but existing demonstrations remain several orders of magnitude smaller than modern digital neural networks, whose recent advances have been driven by rapid growth in trainable parameters. This situation resembles the constraints of early digital neural networks, which led to ideas around parameter reuse. We investigate what similarly efficient hardware architectures may look like, focusing specifically on the common bottleneck of slow re-adjustment of the weights in PNNs. We propose the Time-Indexed Deep Alternating Layers Network (TIDAL-Net), which occupies an intermediate regime between recurrent and deep neural networks, specifically aimed at the scales and restrictions of common PNN prototypes. TIDAL-Net leverages the timescale separation found in many PNNs between fast forward dynamics and slowly trainable weights and biases, using layer-by-layer time multiplexing to increase effective depth while limiting implementation cost. Numerical experiments on image classification and natural language processing tasks show that TIDAL-Net improves performance with only minor modifications to conventional PNNs.

2601.14792 2026-06-11 cs.LG 版本更新

Robustness of Mixtures of Experts to Feature Noise

混合专家模型对特征噪声的鲁棒性

Dong Sun, Rahul Nittala, Rebekka Burkholz

AI总结 研究混合专家模型在特征噪声下的鲁棒性,发现稀疏专家激活能作为噪声滤波器,相比密集网络具有更低的泛化误差、更强的鲁棒性和更快的收敛速度。

详情
Comments
ICML 2026
AI中文摘要

尽管混合专家(MoE)模型在实践中取得了成功,但其为何能在参数规模相当的情况下超越密集网络仍不清楚。我们研究了一个等参数设置,其中输入具有潜在的模块化结构但被特征噪声破坏,这作为内部激活噪声的代理。我们表明,稀疏专家激活起到了噪声滤波器的作用:与密集估计器相比,MoE在特征噪声下实现了更低的泛化误差、对扰动的更强鲁棒性以及更快的收敛速度。在合成数据和真实语言任务上的实验结果证实了理论见解,展示了稀疏模块化计算带来的持续鲁棒性和效率提升。

英文摘要

Despite their practical success, it remains unclear why Mixture of Experts (MoE) models can outperform dense networks beyond sheer parameter scaling. We study an iso-parameter regime where inputs exhibit latent modular structure but are corrupted by feature noise, a proxy for noisy internal activations. We show that sparse expert activation acts as a noise filter: compared to a dense estimator, MoEs achieve lower generalization error under feature noise, improved robustness to perturbations, and faster convergence speed. Empirical results on synthetic data and real-world language tasks corroborate the theoretical insights, demonstrating consistent robustness and efficiency gains from sparse modular computation.

2602.10743 2026-06-11 cs.LG 版本更新

Kalman Linear Attention: Parallel Bayesian Filtering For Efficient Language Modelling and State Tracking

Kalman线性注意力:用于高效语言建模和状态跟踪的并行贝叶斯滤波

Vaisakh Shaj, Cameron Barker, Aidan Scannell, Andras Szecsenyi, Elliot J. Crowley, Amos Storkey

AI总结 提出Kalman线性注意力层,将序列混合重写为信息形式的精确贝叶斯滤波,实现时间并行推理,在相同计算成本下比GLA更具表达力,并在状态跟踪任务中超越线性SSM和注意力。

详情
Comments
Accepted at ICML 2026. An earlier version of this work was presented at the 1st Workshop on Epistemic Intelligence in Machine Learning (EIML) at EurIPS 2025
AI中文摘要

状态空间语言模型如Mamba和门控线性注意力(GLA)提供了线性复杂度、可并行的Transformer替代方案,但其线性状态更新限制了表达力和鲁棒的状态跟踪。我们从概率角度弥合这一差距,将序列混合视为精确贝叶斯滤波,以卡尔曼滤波为核心原语。经典卡尔曼滤波提供有原则的状态和不确定性估计,但被认为是固有顺序的;我们展示了将其重参数化为信息形式后,更新变为关联扫描——因此每个token的循环更新是非线性的(莫比乌斯/精度递归),但保持时间并行。由此产生的Kalman线性注意力(KLA)层是一个即插即用的序列混合器,执行时间并行概率推理,携带显式的信念状态不确定性,并且在相同计算成本下比GLA风格的线性更新具有严格更强的表达力。这种表达力直接转化为更强的状态跟踪:KLA解决了线性SSM和注意力无法解决的排列组合($A_5$)任务,同时保持扫描并行。作为即插即用原语,它在合成token操作和零样本常识基准测试中匹配或改进了现代SSM和GLA,并且是首批在十亿token规模下训练的堆叠贝叶斯滤波原语之一。

英文摘要

State-space language models such as Mamba and gated linear attention (GLA) offer linear-complexity, parallelisable alternatives to transformers, but their linear state updates limit expressivity and robust state tracking. We close this gap from a probabilistic angle, casting sequence mixing as exact Bayesian filtering with the Kalman filter as the core primitive. Classical Kalman filters give principled state and uncertainty estimates but are viewed as inherently sequential; we show that reparameterising them in information form turns their updates into an associative scan - so the per-token recurrent update is non-linear (a Möbius/precision recursion) yet remains temporally parallel. The resulting Kalman Linear Attention (KLA) layer is a drop-in sequence mixer that performs time-parallel probabilistic inference, carries an explicit belief-state uncertainty, and is strictly more expressive than GLA-style linear updates at the same computational cost. This expressivity translates directly into stronger state tracking: KLA solves permutation-composition ($A_5$) tasks that linear SSMs and attention cannot, while staying scan-parallel. As a drop-in primitive it also matches or improves on modern SSMs and GLAs across synthetic token-manipulation and zero-shot commonsense benchmarks, and is among the first stacked Bayesian-filtering primitives trained at the billion-token scale.

2603.05573 2026-06-11 cs.LG 版本更新

Why Depth Matters in Parallelizable Sequence Models: A Lie Algebraic View

为什么深度在可并行化序列模型中重要:一个李代数视角

Gyuryang Heo, Timothy Ngotiaoco, Kazuki Irie, Samuel J. Gershman, Bernardo L. Sabatini

AI总结 从李代数控制视角,研究可并行化序列模型(如Transformer变体和状态空间模型)的表达能力与深度关系,证明误差随深度增加呈指数下降。

详情
Comments
v2: Format update; split former Theorem 3.4 into Theorem 3.4 and Corollary 3.5 for clarity; corrected an indexing error affecting Corollary 3.6, Proposition 3.7, and Figure 2
AI中文摘要

可扩展的序列模型,如Transformer变体和结构化状态空间模型,通常以表达能力换取序列级并行性,从而实现高效训练。本文从李代数控制视角,考察模型在其表达能力范围之外运行时误差的边界及其缩放规律。我们的理论建立了序列模型深度与李代数扩展塔之间的对应关系。与近期理论研究相呼应,我们刻画了常数深度序列模型的李代数类别及其相应的表达能力边界。此外,我们解析推导了近似误差边界,并证明误差随深度增加呈指数下降,这与这些模型的强大实证表现一致。我们通过在符号词和连续值状态追踪问题上的实验验证了理论预测。

英文摘要

Scalable sequence models, such as Transformer variants and structured state-space models, often trade expressivity power for sequence-level parallelism, which enables efficient training. Here we examine the bounds on error and how error scales when models operate outside of their expressivity regimes using a Lie-algebraic control perspective. Our theory formulates a correspondence between the depth of a sequence model and the tower of Lie algebra extensions. Echoing recent theoretical studies, we characterize the Lie-algebraic class of constant-depth sequence models and their corresponding expressivity bounds. Furthermore, we analytically derive an approximation error bound and show that error diminishes exponentially as the depth increases, consistent with the strong empirical performance of these models. We validate our theoretical predictions using experiments on symbolic word and continuous-valued state-tracking problems.

2605.04853 2026-06-11 cs.LG 版本更新

Hybrid Iterative Neural Low-Regularity Integrator for Nonlinear Dispersive Equations

非线性色散方程的混合迭代神经低正则积分器

Zhangyong Liang, Huanhuan Gao

AI总结 提出HIN-LRI混合框架,用轻量神经网络学习并校正经典低正则积分器的结构截断误差,通过显式时间步缩放保证稳定性,在粗糙数据色散方程上提升精度并保持泛化能力。

详情
AI中文摘要

我们提出HIN-LRI,一种混合框架,通过训练一个神经算子来校正经典数值求解器的结构截断误差,从而增强该求解器。基础低正则积分器为非线性色散偏微分方程提供一致的一阶近似,而一个在低维潜在流形上运行的轻量神经网络学习解析方法无法闭合的残差缺陷。神经校正上的显式时间步缩放确保其Lipschitz贡献为$\mathcal{O}(\tau)$,从而产生一个在步长上一致有界且与空间分辨率无关的Gronwall稳定性因子。该网络通过求解器在环的目标进行端到端训练,该目标展开完整迭代并在Bourgain型范数中惩罚轨迹误差,使学习与多步求解器动态对齐,而非孤立的单步目标。在给定假设下,全局误差满足$C(\varepsilon_{net}+\delta)\\,\tau^\gamma\ln(1/\tau)$,其中$\varepsilon_{net}$衡量网络逼近质量,$\delta$衡量训练不足。在三个具有粗糙数据的色散基准上的实验表明,HIN-LRI在精度上优于解析积分器、分裂方法和神经PDE替代模型,具有稳定的空间细化、有效的分布外迁移和适度的在线开销。

英文摘要

We propose HIN-LRI, a hybrid framework that augments a classical numerical solver with a neural operator trained to correct the solver's structured truncation error. A base low-regularity integrator provides a consistent first-order approximation to nonlinear dispersive PDEs, while a lightweight neural network, operating on a low-dimensional latent manifold, learns the residual defect that analytical methods cannot close. An explicit time-step scaling on the neural correction ensures that its Lipschitz contribution remains $\mathcal{O}(\tau)$, yielding a Gronwall stability factor bounded uniformly in the step size and independent of the spatial resolution. The network is trained end-to-end through a solver-in-the-loop objective that unrolls the full iteration and penalises trajectory error in a Bourgain-type norm, aligning learning with multi-step solver dynamics rather than isolated one-step targets. Under stated assumptions, the global error satisfies $C(\varepsilon_{net}+\delta)\,\tau^\gamma\ln(1/\tau)$, where $\varepsilon_{net}$ measures the network approximation quality and $\delta$ the training shortfall. Experiments on three dispersive benchmarks with rough data show that HIN-LRI improves accuracy over analytical integrators, splitting methods, and neural PDE surrogates, with stable spatial refinement, effective out-of-distribution transfer, and modest online overhead.

2605.04893 2026-06-11 cs.LG cs.CL stat.ML 版本更新

Self-Attention as Transport: Limits of Symmetric Spectral Diagnostics

自注意力作为传输:对称谱诊断的极限

Dominik Dahlem, Diego Maniloff, Mac Misiura

AI总结 研究语言模型注意力路由的两种失效形状(过度集中或过度分散),证明对称谱诊断对方向不敏感,并揭示因果注意力中传输容量的理论下限,提出基于容量和方向的双轴诊断方法。

详情
Comments
48 pages, 6 figures, 7 tables; 81-page online supplement (proofs, additional experiments, dataset statistics) as an ancillary file
AI中文摘要

当语言模型处理幻觉响应时,其注意力路由往往以两种形状之一失效:过度集中在狭窄的位置集合上,或者分散得如此广泛以至于相关性被稀释,而失效的形状携带诊断信号。我们研究这些形状作为诊断特征,从在基准标记响应的\emph{强制评分}下计算的注意力矩阵中得出,而不是在实时生成期间。一类广泛使用的谱方法分析度归一化注意力算子的对称分量,该算子控制传输\emph{容量};我们证明该算子的每个转置不变谱诊断在结构上是\emph{方向盲的}(它无法区分算子与其转置,因此无法检测信息流方向),并且盲定理的逆定理将任何Lipschitz诊断的转置敏感性限制为不对称系数$G$。将其与规范因果架构的闭式二分-Cheeger景观配对,我们证明均匀因果注意力满足一个与$n$无关的下界$\phi \ge 1/5$,而窗口注意力以$O(w/n)$穿透下界;失效模式在形状上不同,而不仅仅在数值上不同。这个下界是一个理想化架构的基准,而不是经验吸引子:穿透它的真实注意力头的比例本身就是一个架构特征。由此产生的双轴诊断($\phi$表示容量,$G$表示方向)产生一个可证伪的极性预测:瓶颈主导和分散主导的基准应表现出相反的极性。在长度控制评估下,传输特征在测试的仅解码器、仅编码器和编码器-解码器模型中保持可解释的信号(0.62-0.84 LC-AUROC),极性在HaluEval和MedHallu之间如预测般反转。

英文摘要

When a language model processes a hallucinated response, its attention routing tends to fail in one of two shapes: over-concentrating on a narrow set of positions, or spreading so diffusely that relevance is diluted, and the shape of the failure carries diagnostic signal. We study these shapes as a diagnostic characterization, computed from attention matrices under \emph{forced scoring} of benchmark-labeled responses rather than during live generation. A widely used family of spectral methods analyzes the symmetric component of the degree-normalized attention operator, which governs transport \emph{capacity}; we prove that every transpose-invariant spectral diagnostic of this operator is structurally \emph{orientation-blind} (it cannot distinguish an operator from its transpose, and therefore cannot detect information-flow direction), with a converse to the blindness theorem bounding any Lipschitz diagnostic's transpose sensitivity by the asymmetry coefficient $G$. Pairing this with a closed-form bipartite-Cheeger landscape for canonical causal architectures, we show that uniform causal attention satisfies an $n$-independent floor $\phi \ge 1/5$, while window attention pierces the floor as $O(w/n)$; failure modes are shape-different, not just value-different. This floor is an idealized-architecture benchmark, not an empirical attractor: the fraction of real attention heads that pierce it is itself an architectural signature. The resulting two-axis diagnostic ($\phi$ for capacity, $G$ for direction) yields a falsifiable polarity prediction: bottleneck- and diffuse-dominated benchmarks should exhibit opposite polarity. Under length-controlled evaluation, transport features retain interpretable signal (0.62-0.84 LC-AUROC) across the tested decoder-only, encoder-only, and encoder-decoder models, with polarity reversing as predicted between HaluEval and MedHallu.

2605.15435 2026-06-11 cs.LG cs.NE 版本更新

On the Stability of Growth in Structural Plasticity

结构塑性中增长的稳定性

Lute Lillo, Nick Cheney

AI总结 本文研究了结构塑性中增长与剪枝的稳定性差异,指出生长在优化轨迹中插入新单元体,而剪枝则在训练初期选择已有单元。生长在图像分类任务中表现更优,但需足够时间整合新单元以提高适应性。

详情
AI中文摘要

标准深度学习管道通常在训练前选择网络架构并保持不变。相比之下,模型可以在训练过程中通过剪枝现有隐藏单元或生长新单元来适应。尽管增长对自适应和持续学习系统有吸引力,但本文表明增长并非单纯是剪枝的逆过程。剪枝在训练初期选择参与训练的单元,而增长在已专业化的优化轨迹中插入新单元。新生单元通常在正向计算中活跃但反向信号较弱。在小型MLP基准中此劣势较小,但在更难的图像分类设置中变得明显。在这些设置中,Grow在结构编辑过程中能获得高最终精度,而Prune在训练轨迹平均性能或重新训练稀疏网络时表现更优。针对优化器状态、插入、选择和可训练性等干预表明,提高新生单元的整合能改善适应性表现,但不自动产生更好的最终子网络。在压力塑性损失的持续学习基准中,Grow在新单元有足够时间整合时表现竞争。这些结果表明,Grow不应仅作为架构搜索操作符,而应作为时间敏感的优化过程,其成功取决于插入稳定性。

英文摘要

Standard deep-learning pipelines usually choose the network architecture before training and keep it fixed throughout optimization. In contrast, a model can also be adapted by editing its structure during training, for example by pruning existing hidden-neuron units or growing new ones. Although growth is appealing for adaptive and continual systems, we show that it is not simply the inverse of pruning. Pruning selects among units that have participated in training from the start, whereas growth inserts new units into an already specialized optimization trajectory. We isolate this insertion problem and show that newborn units are often forward-active but backward-starved: they participate in the forward computation, yet receive much weaker gradient signal than incumbent units. This disadvantage is minor in small MLP benchmarks, but becomes clear in harder image-classification settings with a convolutional trunk. In these settings, \textsc{Grow} can achieve high final accuracy during the structural-editing procedure, while \textsc{Prune} is stronger when performance is averaged over the training trajectory or when the final sparse network is retrained from scratch. Interventions targeting optimizer state, insertion, selection, and trainability show that improving the integration of newborn units can improve adaptive performance, but does not automatically produce better final subnetworks. In continual-learning benchmarks stressing plasticity loss, \textsc{Grow} becomes competitive mainly when new units have enough time to integrate. Together, these results suggest that \textsc{Grow} should be evaluated not only as an architecture-search operator, but as a time-sensitive optimization process whose success depends on insertion stability.

2606.07082 2026-06-11 cs.LG cs.AI 版本更新

On the Geometry of On-Policy Distillation

论在线策略蒸馏的几何结构

Zhennan Shen, Yanshu Li, Qingyu Yin, Chak Tou Leong, Zhilin Wang, Yanxu Chen, Rongduo Han, Sunbowen Lee, Yi R. Fung

发表机构 * HKUST UT Austin Zhejiang University Hong Kong PolyU USTC BUPT Nankai University BIT

AI总结 本文通过参数空间诊断,揭示在线策略蒸馏(OPD)的更新轨迹具有松弛离主成分、子空间锁定等独特几何特性,表明其并非介于SFT和RLVR之间的中间方法。

详情
Comments
17 pages, 8 figures
AI中文摘要

在线策略蒸馏(OPD)越来越多地被用于改进大型语言模型的推理能力,但其训练动态仍鲜为人知。我们刻画了OPD更新在参数空间中的轨迹,并将其与监督微调(SFT)和可验证奖励强化学习(RLVR)进行了比较。一套参数空间诊断一致地将OPD置于松弛的离主成分区域:与SFT相比,其更新影响更少的权重,并更强烈地避开主方向;而与RLVR相比,其约束更宽松。除了这种静态定位外,OPD还表现出子空间锁定:其累积更新迅速进入一个狭窄的低维通道。将训练限制在早期形成的更新子空间内能保持OPD的性能,但会严重降低SFT,表明该锁定子空间对OPD在功能上是充分的。控制实验进一步表明,稀疏化更新令牌和将rollout生成移至离策略能保持秩动态,而将OPD目标与RLVR混合则会改变它们。总体而言,这些结果表明OPD不仅仅是SFT和RLVR之间的中间点,而是在参数空间中诱导出自身独特的更新几何结构。

英文摘要

On-policy distillation (OPD) is increasingly used to improve large language model reasoning, but its training dynamics remain poorly understood. We characterize the trajectory of OPD updates in parameter space and compare it with supervised fine-tuning (SFT) and reinforcement learning with verifiable rewards (RLVR). A suite of parameter-space diagnostics consistently places OPD in a relaxed off-principal regime: compared with SFT, its updates affect fewer weights and avoid principal directions more strongly, while compared with RLVR, they remain less tightly constrained. Beyond this static localization, OPD exhibits subspace locking: its cumulative updates rapidly enter a narrow low-dimensional channel. Constraining training to the update subspace formed early in training preserves OPD performance but substantially degrades SFT, indicating that the locked subspace is functionally sufficient for OPD. Control experiments further show that sparsifying the update tokens and shifting rollout generation off-policy preserve the rank dynamics, whereas mixing the OPD objective with RLVR changes them. Overall, these results suggest that OPD is not merely an intermediate point between SFT and RLVR, but induces its own update geometry in parameter space.

2606.08343 2026-06-11 cs.LG 版本更新

GENERIC-FNO: Embedding Energy Conservation and Entropy Production into Fourier Neural Operators

GENERIC-FNO:将能量守恒和熵产生嵌入傅里叶神经算子

Jason Sulskis, Sathya Ravi

发表机构 * University of Illinois at Chicago(伊利诺伊大学芝加哥分校) Georgia Tech Research Institute(佐治亚理工学院研究所)

AI总结 提出GENERIC-FNO,首个在函数空间直接嵌入非平衡热力学完整GENERIC结构的神经算子,通过秩一投影精确满足退化条件,实现能量守恒与熵产生,在超分辨率下保持结构保证。

详情
Comments
Under review at TMLR
AI中文摘要

我们引入了GENERIC-FNO,这是第一个将非平衡热力学的完整GENERIC(度量-辛)结构——可逆、能量守恒动力学和不可逆、熵产生动力学通过退化条件耦合——直接嵌入函数空间的神经算子。现有的保结构神经算子最多强制执行单一守恒律或可逆(哈密顿)结构,而热力学一致的学习仅限于有限维、图或粒子系统。GENERIC-FNO填补了这一空白:它将能量和熵泛函学习为神经算子,并将泊松和摩擦算子参数化为对角傅里叶乘子,夹在秩一投影之间,通过构造精确满足退化条件,无需惩罚项、更新投影或残差。退化恒等式对任何初始化、维度或分辨率都达到机器精度(残差~10^-13),因此连续时间动力学守恒学习的能量并精确产生熵;显式时间步进仅增加小的O(dt^2)漂移(每步残差~10^-6)。我们进一步指出,给定流的(E,S,L,M)分解并不唯一,并引入了一个规范不变的耗散诊断,独立于学习的泛函分离可逆和耗散动力学。在三个算子主干(1D/2D FNO和DeepONet)和四个涵盖可逆、耗散和混合机制的PDE上,GENERIC-FNO在4倍超分辨率范围(64到256)内零样本保持其精确结构保证,恢复物理耗散的真实顺序,并与强无约束和能量惩罚基线竞争,在相当或更少参数的情况下在多个耗散和混合问题上优于它们。

英文摘要

We introduce GENERIC-FNO, the first neural operator to embed the full GENERIC (metriplectic) structure of nonequilibrium thermodynamics -- reversible, energy-conserving dynamics and irreversible, entropy-producing dynamics coupled through the degeneracy conditions -- directly in function space. Existing structure-preserving neural operators enforce at most a single conservation law or reversible (Hamiltonian) structure, while thermodynamically consistent learning has been confined to finite-dimensional, graph, or particle systems. GENERIC-FNO closes this gap: it learns the energy and entropy functionals as neural operators and parameterizes the Poisson and friction operators as diagonal Fourier multipliers sandwiched between rank-one projections that enforce the degeneracy conditions exactly, by construction, with no penalty term, update projection, or residual. The degeneracy identities hold to machine precision (residuals ~10^-13) for any initialization, dimension, or resolution, so the continuous-time dynamics conserve the learned energy and produce entropy exactly; the explicit time stepping adds only a small O(dt^2) drift (per-step residual ~10^-6). We further note that the (E,S,L,M) decomposition of a given flow is not unique, and introduce a gauge-invariant dissipation diagnostic separating reversible from dissipative dynamics independently of the learned functionals. Across three operator backbones (1D/2D FNOs and DeepONet) and four PDEs spanning reversible, dissipative, and mixed regimes, GENERIC-FNO preserves its exact structural guarantees zero-shot across a 4x super-resolution range (64 to 256), recovers the ground-truth ordering of physical dissipation, and is competitive with strong unconstrained and energy-penalized baselines, outperforming them on several dissipative and mixed problems at comparable or fewer parameters.

2. 表示学习、自监督与对比学习 14 篇

2606.11508 2026-06-11 cs.LG q-bio.QM 新提交

Probabilistic Contrastive Pretraining for Multi-task ADME Property Prediction

概率对比预训练用于多任务ADME性质预测

Yifan Xue, Srimukh Prasad Veccham, Saee Paliwal, Tyler Shimko, Micha Livne

发表机构 * NVIDIA(英伟达)

AI总结 提出分子图-Transformer预训练框架,结合化学自监督与对比互信息,通过统一概率潜变量目标优化重构、对比和化学任务,在多任务微调中采用任务特定MLP头,在三个数据集上平均提升7.6%-9.5%。

详情
AI中文摘要

准确预测吸收、分布、代谢和排泄(ADME)性质对药物发现至关重要,但由于ADME终点存在噪声、相互依赖且数据有限,仍然具有挑战性。我们提出了一种分子图-Transformer预训练框架,结合了化学特异性自监督与对比互信息机器学习(cMIM)。我们的方法将分子图编码为潜变量,从图导出的潜代码重建SMILES字符串,并用领域特定的自监督化学任务增强对比目标。我们不是将这些任务视为具有单独调整损失权重的辅助正则化器,而是将重建、对比判别和化学特异性监督表述为单个概率潜变量目标中的单位权重对数概率因子。对于微调,我们提出了一种具有任务特定多层感知器头的多任务GNN读出架构,在保留共享表示学习的同时减轻负迁移并改进异质非线性任务关系的建模。在Biogen、ExpansionRX和ChEMBL-MT上,所得到的对比KERMT预训练相比KERMT基线分别提高了7.6%、9.9%和9.5%(在显著改进的终点上平均)。将ADME邻近分子添加到预训练语料库进一步改善了迁移,并且对比组件锐化了化学上有意义的潜邻域。

英文摘要

Accurate prediction of absorption, distribution, metabolism, and excretion (ADME) properties is critical to drug discovery, but remains challenging because ADME endpoints are noisy, interdependent, and often data-limited. We propose a molecular graph-transformer pretraining framework that combines chemistry-specific self-supervision with contrastive mutual information machine learning (cMIM). Our method encodes molecular graphs into latent variables, reconstructs SMILES strings from the graph-derived latent codes, and augments the contrastive objective with domain-specific self-supervised chemistry tasks. Rather than treating these tasks as auxiliary regularizers with separately tuned loss weights, we formulate reconstruction, contrastive discrimination, and chemistry-specific supervision as unit-weighted log-probability factors in a single probabilistic latent-variable objective. For fine-tuning, we propose a multi-task GNN readout architecture with task-specific multilayer perceptron heads, preserving shared representation learning while mitigating negative transfer and improving the modeling of heterogeneous, nonlinear task relationships. Across Biogen, ExpansionRX, and ChEMBL-MT, the resulting Contrastive KERMT pretraining improves over the KERMT baseline by 7.6%, 9.9%, and 9.5% respectively (averaged over significantly-improved endpoints). Adding ADME-adjacent molecules to the pretraining corpus further improves transfer, and the contrastive component sharpens chemically meaningful latent neighborhoods.

2606.11614 2026-06-11 cs.LG cs.AI cs.CV 新提交

Information-Theoretic Decomposition for Multimodal Interaction Learning

多模态交互学习的信息论分解

Zequn Yang, Yake Wei, Haotian Ni, Zhihao Xu, Di Hu

AI总结 提出基于信息论的多模态交互分解方法DMIL,通过变分分解架构和微调策略学习样本特定的冗余、独特和协同交互,提升多模态学习性能。

详情
Comments
Accepted to CVPR 2026
AI中文摘要

多模态学习依赖于捕获跨模态的冗余、独特和协同信息,这些信息共同构成多模态交互。一个关键但尚未充分探索的挑战是,这些隐式交互在不同样本间动态变化。在这项工作中,我们首次进行了系统的信息论分析,强调了学习这些动态的、样本特定的交互对于有效多模态学习的重要性。我们的分析进一步揭示了传统范式在学习这些不同交互类型方面的缺陷:模态集成方法难以捕获协同,而联合学习范式往往未能充分利用冗余信息。这突显了对一种能够基于每个样本自适应地从不同交互类型中学习的方法的需求。为此,我们提出了基于分解的多模态交互学习(DMIL),一种显式建模并学习样本特定交互的新范式。首先,我们设计了一个变分分解架构来分离组成交互组件。其次,我们采用了一种新的学习策略,在微调过程中利用这些显式交互组件来实现全面的交互学习。跨不同任务和架构的大量实验表明,DMIL通过适应整体的样本特定交互,始终实现了优越的性能。我们的框架灵活且广泛适用,建立了一个以交互为中心的多模态学习范式。代码可在以下网址获取:此 https URL。

英文摘要

Multimodal learning hinges on capturing redundant, unique, and synergistic information across modalities, which collectively constitute multimodal interactions. A critical yet underexplored challenge is that these implicit interactions vary dynamically across samples. In this work, we present the first systematic, information-theoretic analysis highlighting why learning these dynamic, sample-specific interactions is critical for effective multimodal learning. Our analysis further reveals deficits in conventional paradigms at learning these distinct interaction types: modality ensemble approaches struggle to capture synergy, while joint learning paradigms often under-utilize redundant information. This highlights the need for an approach that can adaptively learn from different interaction types on a per-sample basis. To this end, we propose Decomposition-based Multimodal Interaction Learning (DMIL), a novel paradigm that explicitly models and learns from sample-specific interactions. First, we design a variational decomposition architecture to isolate the constituent interaction components. Second, we employ a new learning strategy that leverages these explicit interaction components in a fine-tuning process to achieve comprehensive interaction learning. Extensive experiments across diverse tasks and architectures demonstrate that DMIL consistently achieves superior performance by adapting to holistic sample-specific interactions. Our framework is flexible and broadly applicable, establishing an interaction-centric paradigm for multimodal learning. The code is available at this https URL.

2606.11722 2026-06-11 cs.LG cs.AI cs.CL 新提交

ICA Lens: Interpreting Language Models Without Training Another Dictionary

ICA Lens: 无需训练另一本词典即可解释语言模型

Sida Liu, Feijiang Han

发表机构 * Independent Researcher(独立研究员) University of Maryland(马里兰大学)

AI总结 提出ICALens,基于独立成分分析(ICA)高效提取语言模型表示中可解释方向,无需训练稀疏自编码器,在SAEBench上表现竞争力。

详情
Comments
Ongoing Project
AI中文摘要

在语言模型表示中找到可解释方向对于理解和控制模型行为至关重要。稀疏自编码器(SAE)已成为此目的的标准工具,但将其作为默认的第一透镜通常需要训练、存储和评估大型过完备字典。这一瓶颈限制了快速探索,并提出了一个基本问题:在训练另一个神经字典之前,从激活几何中已经可以看到多少可解释结构?我们的直觉很简单:许多可解释方向对令牌具有选择性,这些方向看起来比随机方向更不服从高斯分布。因此,我们重新审视独立成分分析(ICA),这是一种寻找非高斯方向的经典方法,作为语言模型可解释性的紧凑透镜。我们发现ICA在LLM可解释性中被低估了,因为先前的使用通常依赖于现成的ICA实现,这些实现在LLM激活上不稳定,并且缺乏用于检查和评估恢复方向的系统工具。为弥补这些差距,我们引入了ICALens,这是第一个用于LLM表示的稳定、高效和可审计ICA分析的实用工作流。它结合了优化的GPU并行FastICA流水线、LLM特定的稳定性配方和更好的拟合诊断,实现了高效可靠的逐层分析。在GPT-2 Small、Gemma 2 2B和Qwen 3.5 2B Base上,ICALens高效地恢复了紧凑、人类可解释的方向,无需逐层基于梯度的字典训练。在SAEBench上,ICA在稀疏探测中与公共SAE竞争,并在中小预算下的目标探测扰动中优于它们。这些结果表明,ICA不应被视为弱基线,而应被视为探索语言模型表示的高效且互补的第一透镜。

英文摘要

Finding interpretable directions in language-model representations is critical for understanding and controlling model behavior. Sparse autoencoders (SAEs) have become the standard tool for this purpose, but using them as the default first lens often requires training, storing, and evaluating large overcomplete dictionaries. This bottleneck limits rapid exploration and raises a fundamental question: how much interpretable structure is already visible from activation geometry before training another neural dictionary? Our intuition is simple: many interpretable directions are selective on tokens, and these directions should look less Gaussian than random directions. We therefore revisit independent component analysis (ICA), a classical method for finding non-Gaussian directions, as a compact lens for language-model interpretability. We find that ICA has been underestimated for LLM interpretability, because prior uses often relied on off-the-shelf ICA implementations that are brittle on LLM activations and lacked systematic tools for inspecting and evaluating the recovered directions. To bridge these gaps, we introduce ICALens, the first practical workflow for stable, efficient, and auditable ICA analysis of LLM representations. It combines an optimized GPU-parallel FastICA pipeline with LLM-specific stability recipes and better fitting diagnostics, enabling efficient and reliable layer-wise analysis. Across GPT-2 Small, Gemma 2 2B, and Qwen 3.5 2B Base, ICALens efficiently recovers compact, human-interpretable directions without per-layer gradient-based dictionary training. On SAEBench, ICA is competitive with public SAEs in sparse probing and outperforms them in targeted probe perturbation under small-to-medium budgets. These results suggest that ICA should not be viewed as a weak baseline, but as an efficient and complementary first lens for exploring language-model representations.

2606.11860 2026-06-11 cs.LG 新提交

RePAIR: Predictive Self-Supervised Representation Learning in Chess

RePAIR:国际象棋中的预测性自监督表示学习

Christoph Koller, Johannes Fürnkranz, Timo Bertram

AI总结 提出RePAIR架构,融合MAE、JEPA和BERT,通过掩码和迭代细化学习国际象棋序列的紧凑表示,无需强化学习即可推理棋子移动。

详情
Comments
Accepted for oral presentation at IEEE Conference on Games 2026
AI中文摘要

在本文中,我们介绍了通过自编码迭代细化进行表示预测(RePAIR)——一种新颖的自监督表示学习架构,它综合了掩码自编码器(MAE)、联合嵌入预测架构(JEPA)和来自Transformer的双向编码器表示(BERT)。我们展示了如何将其用于将顺序数据(如连续的国际象棋局面)中的对象编码为紧凑而有意义的表示。该架构的基本原理是掩码潜在状态序列的大部分,类似于BERT和MAE。然后,我们对潜在表示应用一个轻量级预测器,该预测器在类似JEPA的低维嵌入空间中修复序列中的间隙。我们在国际象棋领域的实验表明,编码器优化了棋盘表示,使得有意义的国际象棋概念在潜在空间中聚类出现。此外,掩码棋盘状态的重建表明,该模型能够在不依赖昂贵强化学习方法的情况下推理棋子移动。最后,我们发现,通过在这个语义丰富的空间中观察游戏路径轨迹,所得到的表示空间允许对国际象棋游戏进行快速直观的剖析。

英文摘要

In this paper, we introduce Representation Prediction via Autoencoding using Iterative Refinement (RePAIR) - a novel self-supervised representation learning architecture that synthesizes Masked Autoencoders (MAE), Joint Embedding Predictive Architectures (JEPA), and Bidirectional Encoder Representations from Transformers (BERT). We demonstrate how it can be used to encode objects in sequential data like consecutive chess positions into compact yet meaningful representations. The basic principle of the architecture is to mask large portions of a sequence of latent states, similar to BERT and MAE. Then, we apply a lightweight Predictor to the latent representations that repairs gaps in the sequence in a lower-dimensional embedding space akin to JEPA. Our experiments in the domain of chess show that the Encoder refines the board representations such that meaningful chess concepts emerge clustered in the latent space. Furthermore, reconstructions of the masked board states show that the model is able to reason about the piece movements without relying on costly reinforcement learning methods. Lastly, we find that the resulting representation space allows for quick and intuitive dissections of chess games by observing the game path trajectories in this semantically rich space.

2606.12200 2026-06-11 cs.LG cs.AI 新提交

Implicit Neural Representations of Individual Behavior

个体行为的隐式神经表示

Andrew Kang, Priya Narasimhan

AI总结 提出Behavioral INR模型,用隐式神经表示从无标签多策略行为数据中学习策略表示,通过FiLM层调节策略函数,实现无监督策略识别,在连续状态-动作空间中提升策略可识别性。

详情
Comments
ICML 2026, Structured Probabilistic Inference & Generative Modeling Workshop
AI中文摘要

我们研究从无标签多策略行为数据中进行策略表示学习。每个回合由固定策略生成,但策略标签不可用。这种设置出现在机器人操作、演示、游戏、赛车以及其他混合了异构行为但没有注释的数据集中。我们引入了\emph{Behavioral INR},一种自监督生成模型,将隐式神经表示(INR)从视觉领域适应到行为领域。Behavioral INR不是将坐标映射到RGB值,而是将策略表示为状态-动作函数,将状态映射到后续动作。一个回合级别的潜在变量通过FiLM层调节该函数,产生策略上的生成先验,并允许在无监督的情况下推断策略身份。由于INR将每个数据点视为底层函数的样本,同一模型自然适应可变回合长度和不同采样粒度,就像视觉INR处理不同图像分辨率一样。我们还定义了沿状态分布和动作分布轴的策略级分布外(OOD)偏移,当策略在状态或动作上重叠时会出现这种偏移,但标准的基于新智能体或环境的OOD设置无法捕捉到。我们在合成高斯随机场数据、带有受控OOD分割的MuJoCo演示以及真实世界的国际象棋、一级方程式赛车、机器人和搜索-规避数据集上进行了评估。Behavioral INR在最具挑战性的连续状态-动作设置中持续提升策略可识别性,尤其是当更长的回合、更多的策略和OOD分割降低了边际捷径的效用时;当策略身份可以从符号重复或低维动作统计中恢复时,摊销历史编码器仍然具有竞争力。我们发布了代码和检查点。

英文摘要

We study policy representation learning from unlabeled multi-policy behavioral data. Each episode is generated by a fixed policy, but policy labels are unavailable. This setting appears in robotics play, demonstrations, games, racing, and other datasets where heterogeneous behaviors are mixed without annotations. We introduce \emph{Behavioral INR}, a self-supervised generative model that adapts implicit neural representations (INRs) from vision to behavior. Instead of mapping coordinates to RGB values, Behavioral INR represents a policy as a state-action function mapping states to subsequent actions. An episode-level latent modulates this function through FiLM layers, yielding a generative prior over policies and allowing policy identity to be inferred without supervision. Because INRs treat each datapoint as samples from an underlying function, the same model naturally accommodates variable episode lengths and different sampling granularities, as in vision INRs with different image resolutions. We also define policy-level out-of-distribution (OOD) shifts along state-distribution and action-distribution axes, which arise when policies overlap in states or actions but are not captured by standard behavioral OOD settings based only on new agents or environments. We evaluate on synthetic Gaussian random field data, MuJoCo demonstrations with controlled OOD splits, and real-world chess, Formula 1 racing, robotics, and Seek-Avoid datasets. Behavioral INR most consistently improves policy identifiability in the hardest continuous state-action settings, especially when longer episodes, more policies, and OOD splits reduce the usefulness of marginal shortcuts; amortized history encoders remain competitive when policy identity can be recovered from symbolic repetition or low-dimensional action statistics. We release code and checkpoints.

2606.12362 2026-06-11 cs.LG cs.AI 新提交

Latent World Recovery for Multimodal Learning with Missing Modalities

缺失模态下的多模态学习中的潜在世界恢复

Hui Wang, Tianyu Ren, Joseph Butler, Christopher Baker, Karen Rafferty, Simon McDade

发表机构 * Queen's University Belfast(贝尔法斯特女王大学)

AI总结 提出潜在世界恢复(LWR)框架,通过邻居潜在对齐和可用性感知融合,在缺失模态下实现鲁棒的多模态预测,避免显式重构误差。

详情
AI中文摘要

我们研究了缺失模态下的多模态学习,特别受到生物科学应用的启发,在这些应用中,当需要做出决策时,异构模态通常仅部分可用。我们提出了潜在世界恢复(LWR),这是一个基于两个关键思想的框架:(i) 来自不同模态的特定模态嵌入在共享潜在空间中对齐,以及 (ii) 通过仅融合在训练和推理时实际可用的模态嵌入来构建统一表示。LWR 不填补缺失模态或要求固定的模态集,而是将每个模态视为对底层潜在状态的部分感知,并直接从观察到的模态执行可用性感知表示学习。这种基于邻居的潜在对齐和可用性感知模态融合的结合,使得在部分观测下能够进行鲁棒的多模态预测,同时避免了显式重构缺失模态带来的误差传播。我们在真实世界的不完整多组学基准上评估了所提出的框架,并证明它为下游任务(如癌症表型分类和生存预测)提供了一种有效的方法。

英文摘要

We study multimodal learning under missing modalities, with particular motivation from bioscience applications in which heterogeneous modalities are often only partially available when decisions need to be made. We propose Latent World Recovery (LWR), a framework built on two key ideas: (i) modality-specific embeddings from different modalities are aligned in a shared latent space, and (ii) a unified representation is constructed by fusing only the embeddings of the modalities that are actually available at both training and inference time. Rather than imputing missing modalities or requiring a fixed modality set, LWR treats each modality as a partial perception of an underlying latent state and performs availability-aware representation learning directly from the observed modalities. This combination of neighbor-based latent alignment and availability-aware modality fusion enables robust multimodal prediction under partial observation, while avoiding error propagation from explicit reconstruction of missing modalities. We evaluate the proposed framework on real-world incomplete multi-omics benchmarks and demonstrate that it provides an effective approach to downstream tasks such as cancer phenotype classification and survival prediction.

2606.11570 2026-06-11 stat.ML cs.LG stat.ME 交叉投稿

Enhancing Spectral Embedding through Robust and Flexible Knowledge Transfer in Electronic Health Records

通过电子健康记录中的鲁棒且灵活的知识迁移增强谱嵌入

Feiqing Huang, Zongqi Xia, Rong Ma, Tianxi Cai

AI总结 提出一种基于谱的无监督表示学习框架,通过从更广泛人群提取知识矩阵并放松信号对齐假设,为罕见病队列生成低维嵌入,在模拟和真实多发性硬化症数据中优于现有方法。

详情
AI中文摘要

我们提出了一种基于谱的无监督表示学习框架,用于从电子健康记录中为罕见病队列的临床概念和患者导出低维嵌入,其中数据是高维的但样本量有限。为了克服这一挑战,我们引入了一个从更广泛人群中提取的知识矩阵,该矩阵与罕见病队列共享部分重叠的子空间。我们的方法不同于现有方法,它放松了潜在数据矩阵和知识矩阵之间严格的一对一信号对齐假设,允许更灵活和现实的结构化共享形式。我们引入了一种新颖的两步谱嵌入过程:首先,我们从知识矩阵中识别并移除不相关的成分;然后,我们应用基于投影的方法分别恢复共享和异质成分。模拟和对真实世界多发性硬化症队列的分析表明,所提出的方法优于竞争方法,特别是在共享信号较弱且仅部分对齐的挑战性场景中,这在罕见病数据中很常见。

英文摘要

We propose a spectral-based, unsupervised representation learning framework to derive low-dimensional embeddings for clinical concepts and patients in rare disease cohorts from electronic health records, where data are high-dimensional but sample sizes are limited. To overcome this challenge, we incorporate a knowledge matrix extracted from a broader population that shares a partially overlapping subspace with the rare-disease cohort. Our method departs from existing approaches by relaxing restrictive one-to-one signal-alignment assumptions between the latent data matrix and knowledge matrix, allowing more flexible and realistic forms of structured sharing. We introduce a novel two-step spectral embedding procedure: first, we identify and remove irrelevant components from the knowledge matrix; then, we apply a projection-based method to separately recover shared and heterogeneous components. Simulations and an analysis of a real-world multiple sclerosis cohort show that the proposed method outperforms competing approaches, particularly in challenging scenarios where shared signals are weak and only partially aligned, as is common in rare-disease data.

2606.11661 2026-06-11 cs.CV cs.LG 交叉投稿

Learning Instance-Adaptive Low-Rank Orthogonal Subspaces for Clothes-Changing Person Re-Identification

学习实例自适应低秩正交子空间用于换衣行人重识别

Dong-Woo Kim, Tae-Kyun Kim

AI总结 提出Ortho-ReID方法,通过从VLM文本描述中显式建模低秩服装子空间,并利用几何约束提取服装不变特征,在多个基准数据集上取得最优性能。

详情
Comments
Accepted to the ICML 2026 Workshop on CoLoRAI
AI中文摘要

换衣行人重识别(CC-ReID)旨在识别尽管因服装变化导致外观剧烈变化的个体。现有方法依赖对抗学习来解耦服装特征,我们提出Ortho-ReID,该方法从VLM文本描述中显式建模低秩服装子空间,并通过直接几何约束提取服装不变表示。一个关键组件是基于Transformer的基生成器(Basis Maker),它通过与图像块的交叉注意力,将共享的低维服装先验细化为实例自适应低秩子空间,从而在变化的可见性条件下也能实现鲁棒的服装特征提取。该实例自适应子空间通过与服装文本嵌入对齐进行监督,而身份特征则通过可学习的投影头提取,并在几何上约束与其严格正交。大量实验表明,在PRCC(top-1提升5.9%)、Celeb-reID-light(提升3.5%)和LaST(提升5.3%)上达到了最先进性能,在LTCC上也取得了有竞争力的结果。

英文摘要

Clothes-changing person re-identification (CC-ReID) aims to recognize individuals despite drastic appearance changes caused by clothing variation. While existing methods rely on adversarial learning to disentangle clothing features, we propose Ortho-ReID, which explicitly models a low-rank clothing subspace from VLM text descriptions and extracts clothing-invariant representations via direct geometric constraints. A critical component is our transformer-based Basis Maker, which refines a shared, low-dimensional clothing prior into an instance-adaptive low-rank subspace through cross-attention with image patches, enabling robust clothing feature extraction even under varying visibility conditions. This instance-adaptive subspace is supervised via alignment with clothing text embeddings, while identity features are extracted via a learnable projection head and geometrically constrained to be strictly orthogonal to it. Extensive experiments demonstrate state-of-the-art performance on PRCC (+5.9% top-1), Celeb-reID-light (+3.5%), and LaST (+5.3%), with competitive results on LTCC.

2503.10973 2026-06-11 cs.LG 版本更新

Learning Patterns and Abstractions from Perceptual Sequences

从感知序列中学习模式与抽象

Shuchen Wu

AI总结 研究从感知序列中通过分块和抽象发现模式与层次结构的计算原理,提出理性分块模型和非参数层次变量模型,实现高效序列分解与无监督模式发现。

详情
Comments
Doctoral thesis
AI中文摘要

认知过程迅速将高维感官流分解为熟悉的部分并揭示它们之间的关系。结构为何出现?它们如何支持学习、泛化和预测?什么计算原理构成了感知和智能的这一核心方面?简化来说,感官流是一维序列。在学习此类序列时,我们自然地将其分割成部分——这一过程称为分块。在第一个项目中,我研究了在序列反应时任务中影响分块的因素,并表明人类在平衡速度和准确性的同时适应底层分块。在此基础上,我开发了学习分块并逐块解析序列的模型。从规范角度,我提出分块是一种理性策略,用于发现重复模式和嵌套层次结构,从而实现高效的序列分解。学习到的分块可作为可复用的原语,用于迁移、组合和心理模拟——使模型能够从已知中组合出新内容。我展示了该模型在单维和多维序列中学习层次结构的能力,并强调了其在无监督模式发现中的实用性。第二部分从具体序列转向抽象序列。我对抽象主题进行了分类,并考察了它们在序列记忆中的作用。行为证据表明,人类利用模式冗余进行压缩和迁移。我提出了一个非参数层次变量模型,该模型同时学习分块和抽象变量,揭示不变符号模式。我展示了其与人类学习的相似性,并与大型语言模型进行了比较。综上所述,本论文表明,分块和抽象作为简单的计算原理,能够支持从简单到复杂、从具体到抽象的层次组织序列中的结构化知识获取。

英文摘要

Cognition swiftly breaks high-dimensional sensory streams into familiar parts and uncovers their relations. Why do structures emerge, and how do they enable learning, generalization, and prediction? What computational principles underlie this core aspect of perception and intelligence? A sensory stream, simplified, is a one-dimensional sequence. In learning such sequences, we naturally segment them into parts -- a process known as chunking. In the first project, I investigated factors influencing chunking in a serial reaction time task and showed that humans adapt to underlying chunks while balancing speed and accuracy. Building on this, I developed models that learn chunks and parse sequences chunk by chunk. Normatively, I proposed chunking as a rational strategy for discovering recurring patterns and nested hierarchies, enabling efficient sequence factorization. Learned chunks serve as reusable primitives for transfer, composition, and mental simulation -- letting the model compose the new from the known. I demonstrated this model's ability to learn hierarchies in single and multi-dimensional sequences and highlighted its utility for unsupervised pattern discovery. The second part moves from concrete to abstract sequences. I taxonomized abstract motifs and examined their role in sequence memory. Behavioral evidence suggests that humans exploit pattern redundancies for compression and transfer. I proposed a non-parametric hierarchical variable model that learns both chunks and abstract variables, uncovering invariant symbolic patterns. I showed its similarity to human learning and compared it to large language models. Taken together, this thesis suggests that chunking and abstraction as simple computational principles enable structured knowledge acquisition in hierarchically organized sequences, from simple to complex, concrete to abstract.

2506.20040 2026-06-11 cs.LG cs.AI cs.CL 版本更新

Cross-Layer Discrete Concept Discovery for Interpreting Language Models

跨层离散概念发现用于解释语言模型

Ankur Garg, Xuemin Yu, Hassan Sajjad, Samira Ebrahimi Kahou

AI总结 提出跨层向量量化变分自编码器(CLVQ-VAE),通过离散向量量化瓶颈将残差流中的重复特征压缩为紧凑可解释的概念向量,在三个数据集上优于聚类、单层VQ-VAE和稀疏自编码器基线。

详情
AI中文摘要

由于残差流的存在,解释语言模型仍然具有挑战性,残差流在相邻层之间线性混合和复制特征,导致单层分析忽略这种跨层结构。跨层稀疏自编码器(SAE)解决了层混合问题,但在连续空间中操作,概念分散在许多神经元上,没有清晰的边界。我们引入了跨层向量量化变分自编码器(CLVQ-VAE),这是一种新颖的框架,通过离散向量量化瓶颈将较低层的表示映射到较高层,将重复的残差流特征压缩为紧凑、可解释的概念向量。我们的方法结合了基于top-k温度的采样和指数移动平均(EMA)码本更新,在保持码本多样性的同时,对离散潜在空间进行受控探索。在基于编码器和解码器的模型上,针对ERASER-Movie、Jigsaw和AGNews数据集,CLVQ-VAE在三个评估轴上优于聚类、单层向量量化变分自编码器(VQ-VAE)和稀疏自编码器(SAE)基线:移除识别出的概念使模型准确率下降高达93%,LLM评判员在66.7%的比较中将我们的概念排在首位,人类标注者从我们的可视化中恢复模型预测的准确率为78%,而聚类为54%。

英文摘要

Interpreting language models remains challenging due to the existence of residual stream, which linearly mixes and duplicates features across adjacent layers, causing single-layer analyses to miss this cross-layer structure. Cross-layer sparse autoencoders (SAEs) address layer mixing but operate in continuous space, where concepts split across many neurons without clear boundaries. We introduce Cross-Layer Vector Quantized-Variational Autoencoder (CLVQ-VAE), a novel framework which maps representations from a lower layer to a higher layer through a discrete vector-quantization bottleneck, collapsing duplicated residual-stream features into compact, interpretable concept vectors. Our approach combines top-k temperature-based sampling with exponential moving average (EMA) codebook updates, providing controlled exploration of the discrete latent space while maintaining codebook diversity. Across both encoder- and decoder-based models on ERASER-Movie, Jigsaw, and AGNews, CLVQ-VAE outperforms clustering, single-layer vector quantized-variational autoencoder (VQ-VAE), and sparse autoencoder (SAE) baselines across three evaluation axes: removing identified concepts drops model accuracy by up to 93%, LLM judges rank our concepts first in 66.7% of comparisons, and human annotators recover model predictions from our visualizations with 78% accuracy versus 54% for clustering.

2507.21164 2026-06-11 cs.LG cs.AI eess.IV stat.ML 版本更新

OCSVM-Guided Representation Learning for Unsupervised Anomaly Detection

OCSVM引导的无监督异常检测表示学习

Nicolas Pinon (MYRIAD), Robin Trombetta (MYRIAD), Carole Lartizien (MYRIAD)

AI总结 提出一种将表示学习与可解析求解的一类SVM耦合的方法,通过定制损失函数直接对齐潜在特征与决策边界,在MNIST-C和脑MRI病变检测任务上展现了鲁棒性和性能。

详情
AI中文摘要

无监督异常检测(UAD)旨在无需标签数据检测异常,这在许多机器学习应用中是必要的,因为异常样本稀少或不可用。大多数最先进的方法分为两类:基于重构的方法(通常重构异常过于完美)和与密度估计器解耦的表示学习(可能遭受次优特征空间)。虽然一些近期方法尝试耦合特征学习和异常检测,但它们通常依赖替代目标、限制核选择或引入近似,从而限制了表达能力和鲁棒性。为解决这一挑战,我们提出了一种新颖方法,通过自定义损失公式将表示学习与可解析求解的一类SVM(OCSVM)耦合,该损失直接使潜在特征与OCSVM决策边界对齐。该模型在两个任务上评估:基于MNIST-C的新基准,以及具有挑战性的脑MRI细微病变检测任务。与大多数关注图像级别大而高信号病变的方法不同,我们的方法成功针对小而非高信号的病变,同时我们评估体素级别的指标,处理了更具临床相关性的场景。两个实验评估了对领域偏移的鲁棒性形式,包括MNIST-C中的损坏类型以及MRI中的纹理或人群年龄变化。结果展示了我们提出模型的性能和鲁棒性,突显了其在通用UAD和现实医学成像应用中的潜力。源代码可在此https URL获取。

英文摘要

Unsupervised anomaly detection (UAD) aims to detect anomalies without labeled data, a necessity in many machine learning applications where anomalous samples are rare or not available. Most state-of-the-art methods fall into two categories: reconstruction-based approaches, which often reconstruct anomalies too well, and decoupled representation learning with density estimators, which can suffer from suboptimal feature spaces. While some recent methods attempt to couple feature learning and anomaly detection, they often rely on surrogate objectives, restrict kernel choices, or introduce approximations that limit their expressiveness and robustness. To address this challenge, we propose a novel method that couples representation learning with an analytically solvable One-Class SVM (OCSVM), through a custom loss formulation that directly aligns latent features with the OCSVM decision boundary. The model is evaluated on two tasks: a \deleted{new} benchmark based on MNIST-C, and a challenging brain MRI \deleted{subtle} lesion detection task. Unlike most methods that focus on large, hyperintense lesions at the image level, our approach succeeds to target small, non-hyperintense lesions, while we evaluate voxel-wise metrics, addressing a more clinically relevant scenario. Both experiments evaluate a form of robustness to domain shifts, including corruption types in MNIST-C and texture or population age variations in MRI. Results demonstrate performance and robustness of our proposed model, highlighting its potential for general UAD and real-world medical imaging applications. The source code is available at this https URL.

2602.02726 2026-06-11 cs.LG cs.CL 版本更新

Vector Quantized Latent Concepts: A Scalable Alternative to Clustering-Based Concept Discovery

向量量化潜在概念:聚类式概念发现的可扩展替代方案

Xuemin Yu, Ankur Garg, Samira Ebrahimi Kahou, Hassan Sajjad

AI总结 提出VQLC框架,通过向量量化学习离散潜在概念,在保持可解释性的同时,实现与K-Means相当的计算效率,并优于层次聚类在大规模数据上的扩展性。

详情
AI中文摘要

大型语言模型(LLMs)在其隐藏状态中编码了丰富的语义信息,但理解这些内部表示捕获了哪些信息仍然困难。从隐藏状态中提取的潜在概念为解释LLMs提供了有希望的方向,但现有的基于聚类的方法面临权衡:层次聚类产生连贯的概念,但由于其二次内存成本而仅限于小数据集,而K-Means高效扩展但可能产生语义连贯性较差的概念。我们提出向量量化潜在概念(VQLC),一种离散概念学习框架,在冻结的隐藏状态上学习潜在概念的码本。在12个数据集-模型设置中,VQLC在计算成本上接近K-Means,扩展性优于层次聚类,并在忠实度上保持竞争力,在仅解码器模型上增益最明显。基于LLMs的评估、定性分析和稀疏自编码器(SAE)比较表明,学习到的概念是可解释且任务相关的。

英文摘要

Large language models (LLMs) encode rich semantic information in their hidden states, yet it remains difficult to understand what information these internal representations capture. Latent concepts extracted from hidden states offer a promising direction for interpreting LLMs, but existing clustering-based methods face a trade-off: hierarchical clustering produces coherent concepts but is limited to small datasets due to its quadratic memory cost, while K-Means scales efficiently but may yield less semantically coherent concepts. We propose Vector Quantized Latent Concept (VQLC), a discrete concept learning framework that learns a codebook of latent concepts on frozen hidden states. Across 12 dataset-model settings, VQLC stays close to K-Means in computational cost, scales better than hierarchical clustering, and remains competitive in faithfulness, with the clearest gains on decoder-only models. LLMs-based evaluation, qualitative analysis, and a Sparse Autoencoder (SAE) comparison demonstrate that the learned concepts are interpretable and task-relevant.

2511.14427 2026-06-11 cs.RO cs.LG 版本更新

Self-Supervised Multisensory Pretraining for Contact-Rich Robot Reinforcement Learning

面向接触丰富机器人强化学习的自监督多感官预训练

Rickmer Krohn, Vignesh Prasad, Gabriele Tiboni, Georgia Chalvatzaki

AI总结 提出MSDP框架,通过掩码自编码和跨模态预测学习多感官表示,并采用非对称架构(评论家使用交叉注意力提取动态特征,演员使用稳定池化表示)加速策略学习,在模拟和真实机器人任务中展现出鲁棒性和高效性。

详情
Comments
8 pages, 11 figures
AI中文摘要

有效的接触丰富操作需要机器人协同利用视觉、力和本体感觉。然而,强化学习智能体在这种多感官环境中难以学习,尤其是在感官噪声和动态变化的情况下。我们提出了多感官动态预训练(MSDP),一种新颖的框架,用于学习面向任务策略学习的表达性多感官表示。MSDP基于掩码自编码,通过仅从传感器嵌入的子集重建多感官观测来训练基于Transformer的编码器,从而实现跨模态预测和传感器融合。对于下游策略学习,我们引入了一种新颖的非对称架构,其中交叉注意力机制允许评论家从冻结的嵌入中提取动态的、任务特定的特征,而演员则接收稳定的池化表示来指导其动作。我们的方法在多种扰动(包括传感器噪声和物体动力学变化)下表现出加速学习和鲁棒性能。在模拟和真实世界中多个具有挑战性的、接触丰富的机器人操作任务上的评估展示了MSDP的有效性。我们的方法对扰动表现出强鲁棒性,并在仅6000次在线交互的真实机器人上实现了高成功率,为复杂的多感官机器人控制提供了一种简单而强大的解决方案。网站:this https URL

英文摘要

Effective contact-rich manipulation requires robots to synergistically leverage vision, force, and proprioception. However, Reinforcement Learning agents struggle to learn in such multisensory settings, especially amidst sensory noise and dynamic changes. We propose MultiSensory Dynamic Pretraining (MSDP), a novel framework for learning expressive multisensory representations tailored for task-oriented policy learning. MSDP is based on masked autoencoding and trains a transformer-based encoder by reconstructing multisensory observations from only a subset of sensor embeddings, leading to cross-modal prediction and sensor fusion. For downstream policy learning, we introduce a novel asymmetric architecture, where a cross-attention mechanism allows the critic to extract dynamic, task-specific features from the frozen embeddings, while the actor receives a stable pooled representation to guide its actions. Our method demonstrates accelerated learning and robust performance under diverse perturbations, including sensor noise, and changes in object dynamics. Evaluations in multiple challenging, contact-rich robot manipulation tasks in simulation and the real world showcase the effectiveness of MSDP. Our approach exhibits strong robustness to perturbations and achieves high success rates on the real robot with as few as 6,000 online interactions, offering a simple yet powerful solution for complex multisensory robotic control. Website: this https URL

2601.03326 2026-06-11 cs.CV cs.LG 版本更新

Higher order PCA-like rotation-invariant features for detailed shape descriptors modulo rotation

高阶类PCA旋转不变特征用于模旋转的详细形状描述符

Jarek Duda

AI总结 提出将PCA扩展到高阶张量(如三阶中心矩)或多项式乘高斯分布,以获取更精确的旋转不变形状描述符,并应用于分子形状描述、物体识别和形状相似性度量。

详情
Comments
5 pages, 4 figures
AI中文摘要

PCA可用于旋转不变特征,通过协方差矩阵 $p_{ab}=E[(x_i-E[x_a])(x_b-E[x_b])]$ 用椭球近似形状,并利用其幂的迹等旋转不变量。然而,真实形状通常复杂得多,因此提出将其扩展到例如 $p_{abc}=E[(x_a-E[x_a])(x_b-E[x_b])(x_c-E[x_c])]$ 的三阶或更高阶张量以描述中心矩,或多项式乘高斯分布以得到任意高精度的可解码形状描述符及其类似的旋转不变量。其实际应用包括旋转不变特征以包含模旋转的形状,例如用于分子形状描述符,或用于2D图像/3D扫描中直至旋转的物体识别,可能也用于3D场景理解,或作为形状相似性度量,允许模旋转下物体的廉价比较,避免耗时的旋转优化。

英文摘要

PCA can be used for rotation invariant features, describing a shape with its $p_{ab}=E[(x_i-E[x_a])(x_b-E[x_b])]$ covariance matrix approximating shape by ellipsoid, allowing for rotation invariants like its traces of powers. However, real shapes are usually much more complicated, hence there is proposed its extension to e.g. $p_{abc}=E[(x_a-E[x_a])(x_b-E[x_b])(x_c-E[x_c])]$ order-3 or higher tensors describing central moments, or polynomial times Gaussian allowing decodable shape descriptors of arbitrarily high accuracy, and their analogous rotation invariants. Its practical applications could be rotation-invariant features to include shape modulo rotation e.g. for molecular shape descriptors, or for up to rotation object recognition in 2D images/3D scans maybe also for 3D scene understanding, or shape similarity metric allowing inexpensive comparison of objects modulo rotation avoiding costly optimization over rotations.

3. 强化学习与序列决策 38 篇

2606.11192 2026-06-11 cs.LG math.OC 新提交

Restless bandits with imperfect binary feedback: PCL-indexability analysis and computation

具有不完美二元反馈的 restless bandits: PCL-indexability 分析与计算

José Niño-Mora

发表机构 * Universidad Carlos III de Madrid(马德里卡洛斯三世大学)

AI总结 针对具有二元隐状态和不完美二元反馈的 restless bandits,提出基于部分守恒律(PCL)的分析与计算框架,通过验证定理、确定性骨架和组合词方法建立可索引性并计算 Whittle 指数,实验表明 MP 指数策略优于基准策略。

详情
Comments
59 pages, 12 figures, submitted 27/3/2026
AI中文摘要

我们研究具有二元隐状态和不完美二元反馈的 restless bandits,受具有感知错误的机会频谱接入启发。对于相关的信念状态模型,我们开发了一个基于部分守恒律(PCL)的分析与计算框架,用于建立可索引性和评估 Whittle 指数,该框架建立在实状态折扣 restless bandits 的验证定理之上。该框架通过相关的确定性骨架、更新分解和组合词分析随机动力学。它在几个阈值区域中为折扣奖励和资源度量提供了易处理的表达式,从而能够在那里完全验证 PCL 可索引性条件。对于本文中未实现完整分析验证的剩余区域,我们推导了用于计算相关边际度量和边际生产率(MP)指数的有效数值方案,当这些条件成立时,MP 指数等于 Whittle 指数。广泛的计算实验提供了强有力的证据,表明这些条件也在该区域中成立,跨越广泛的参数范围,且没有先前工作中施加的严格参数限制。实验进一步表明,MP 指数策略通常优于标准基准策略,且往往有显著优势。

英文摘要

We study restless bandits with binary latent states and imperfect binary feedback, motivated by opportunistic spectrum access with sensing errors. For the associated belief-state model, we develop a partial conservation laws (PCL)-based analytical and computational framework for establishing indexability and evaluating the Whittle index, building on a verification theorem for real-state discounted restless bandits. The framework analyzes the stochastic dynamics via an associated deterministic skeleton, renewal decompositions, and combinatorics on words. It yields tractable expressions for discounted reward and resource metrics in several threshold regimes, enabling full verification of the PCL-indexability conditions there. For the remaining regime, where a complete analytic verification is not achieved in this paper, we derive efficient numerical schemes for computing the relevant marginal metrics and the marginal productivity (MP) index, which equals the Whittle index when those conditions hold. Extensive computational experiments provide strong evidence that these conditions also hold in that regime across broad parameter ranges and without the stringent parameter restrictions imposed in prior work. The experiments further show that theMP index policy typically outperforms standard benchmark policies, often by a substantial margin.

2606.11266 2026-06-11 cs.LG 新提交

Seeing Before Colliding: Anticipatory Safe RL with Frozen Vision-Language Models

碰撞前的预见:利用冻结视觉-语言模型的预期性安全强化学习

Samuel Tetteh, Cody Fleming

发表机构 * Iowa State University(爱荷华州立大学)

AI总结 提出VLM-Safe-RL框架,通过冻结视觉-语言模型生成预期性成本项,改进CMDP拉格朗日更新,在高速碰撞场景下实现安全与回报的平衡。

详情
Comments
44pages, 26 figures
AI中文摘要

约束强化学习算法优化的成本信号几乎总是反应性的:模拟器仅在碰撞开始后发出非零成本,而PPO-Lagrangian的拉格朗日乘子仅在超出回合预算后增长。在比赛速度下,碰撞是瞬时且不可逆的,任何等待成本累积的安全机制在结构上都为时已晚。我们提出VLM-Safe-RL,一个将冻结的视觉-语言模型作为预期性成本项集成到CMDP拉格朗日更新中的框架。该框架包含四个贡献:(i) 解耦双路径CLIP,独立的奖励/成本路径,尊重CMDP的分解;(ii) VLM-Lagrange,一种增强的乘子更新,将每步VLM成本作为预期性项纳入;(iii) 置信门控,基于CLIP间隔的逻辑噪声模型导出的贝叶斯最优权重;(iv) VLMPPOLag,组合算法。在Safety-Gymnasium FormulaOne L2上,我们的主要评估($n=5$个种子,$10^{6}$步,预算$d_{\text{lim}}=25$)中,VLMPPOLag$+$Conf是默认预算比较中唯一同时保持实质性回报($J_r\approx40$)并在大多数种子上将成本控制在预算内的配置;五个约束感知基线(PPOLag, CPO, CPPOPID, CPO-CLG, PPOLag-RND)均至少未能满足一项要求。该机制泛化到保留的MetaDrive Medium(灾难率$41\%\to26\%$,95%自助法置信区间$[-26,-5]$个百分点),并显示出向Bullet Safety-Gym的方向一致迁移;我们诚实地报告了其不适用的情况(MetaDrive Easy/Hard, Qwen2-VL骨干),并将Hard失败归因于拉格朗日调节病理而非VLM信号本身。据我们所知,这是首个在CMDP拉格朗日更新中使用冻结VLM信号作为预期性成本项的工作。

英文摘要

The cost signal that constrained-RL algorithms optimize against is almost always reactive: the simulator emits a non-zero cost only after a collision has begun, and the Lagrange multiplier of PPO-Lagrangian grows only after the episode budget has been exceeded. At race speeds, where collisions are instantaneous and irreversible, any safety mechanism that waits for cost to accumulate is structurally too late. We present VLM-Safe-RL, a framework that integrates a frozen vision-language model into the CMDP Lagrangian update as an anticipatory cost term. The framework comprises four contributions: (i) Decoupled Dual-Path CLIP, independent reward/cost paths that respect the CMDP's factorization; (ii) VLM-Lagrange, an augmented multiplier update that incorporates a per-step VLM cost as an anticipatory term; (iii) Confidence Gating, a Bayes-optimal weight derived from a logistic noise model on the CLIP margin; and (iv) VLMPPOLag, the composed algorithm. On Safety-Gymnasium FormulaOne L2, our principal evaluation ($n{=}5$ seeds, $10^{6}$ steps, budget $d_{\text{lim}}{=}25$) VLMPPOLag$+$Conf is the only configuration in our default budget comparison that simultaneously retains substantive return ($J_r{\approx}40$) and holds cost within budget on a majority of seeds; the five constraint-aware baselines (PPOLag, CPO, CPPOPID, CPO-CLG, PPOLag-RND) each fail at least one requirement. The mechanism generalizes to held-out MetaDrive Medium (catastrophe rate $41\%{\to}26\%$, 95\% bootstrap CI $[-26,-5]$\,pp) and shows directionally consistent transfer to Bullet Safety-Gym; we report honestly where it does not (MetaDrive Easy/Hard, Qwen2-VL backbone) and trace the Hard failure to a Lagrangian-regulation pathology rather than the VLM signal itself. To our knowledge, this is the first work to use frozen VLM signals as an anticipatory cost term inside the CMDP Lagrangian update.

2606.11417 2026-06-11 cs.LG cs.AI stat.ML 新提交

Signed Compression Progress on a Sealed Audit is Goodhart-Resistant

密封审计上的有符号压缩进展是古德哈特抵抗的

Ayush Mittal, Dhruv Gupta

AI总结 提出有符号压缩进展作为内在动机,证明其累积奖励等于审计改进,且对有限审计面板具有假阳性预算,抵抗古德哈特定律。

详情
Comments
16 pages, 7 figures. Lean 4 (Mathlib) mechanized core and ARC-TGI experiment code: this https URL
AI中文摘要

压缩进展是一个长期提出的内在动机方案:当智能体的世界模型在预测或压缩经验方面变得更好时给予奖励。民间声称这种奖励是“可信的”,因为它只在学习时支付。我们使这一点精确化并证明它。如果内在奖励是固定密封审计损失的有符号减少,即 r_t = E(theta_{t-1}) - E(theta_t),那么累积奖励恰好望远镜式地归结为端点审计改进,因此没有策略可以在真实审计性能停滞或下降时无限推高奖励。对于有限审计面板,同样的结果成立,并带有尖锐的假阳性预算:累积经验奖励最多为真实审计改进加上 2 Delta_n(F, delta),即模型类的均匀审计偏差。这是无水平依赖的:一旦密封面板均匀控制该类,随时间变化的适应性无需付出代价。该定理还识别了失败模式:如果进展被截断、在智能体自身流上评分、暴露于可重用面板上的高容量模型,或应用于使 Delta_n 无效的神经类,则保证消失。我们给出了结构核心(望远镜式、有限审计界、有限吉布斯和熵下限)的 Lean 4 机械化,以及在 ARC-TGI 网格变换生成器上带有自适应保留攻击的实验套件。实验证实了理论:有限审计偏差按 n^{-0.527} 缩放;有符号进展抵抗截断农场、流泄漏和噪声电视好奇心;朴素的可重用审计可被黑盒标量反馈利用,而标准发布防御将攻击保持在 2 Delta_n 阈值以下。密封审计上的有符号压缩进展是真正改进的会计信号。

英文摘要

Compression progress is a long-standing proposal for intrinsic motivation: reward an agent when its world model becomes better at predicting or compressing experience. The folk claim is that this reward is "credible" because it is paid only for learning. We make this precise and prove it. If intrinsic reward is the signed decrease of a fixed sealed-audit loss, r_t = E(theta_{t-1}) - E(theta_t), then cumulative reward telescopes exactly to endpoint audit improvement, so no policy can push reward up indefinitely while true audit performance stagnates or degrades. For finite audit panels the same result holds with a sharp false-positive budget: cumulative empirical reward is at most true audit improvement plus 2 Delta_n(F, delta), the uniform audit deviation of the model class. This is horizon-free: adaptivity over time costs nothing once the sealed panel uniformly controls the class. The theorem also identifies the failure modes: the guarantee disappears if progress is clipped, scored on the agent's own stream, exposed to a high-capacity model on a reusable panel, or applied to a neural class that makes Delta_n vacuous. We give a Lean 4 mechanization of the structural core (telescoping, the finite-audit bound, finite Gibbs, and the entropy floor) and an experiment suite on ARC-TGI grid-transformation generators with adaptive holdout attacks. Experiments confirm the theory: finite-audit deviation scales as n^{-0.527}; signed progress resists clip-farming, stream leakage, and noisy-TV curiosity; naive reusable audits are exploitable by black-box scalar feedback, while standard release defenses keep the attack below the 2 Delta_n threshold. Signed compression progress on a sealed audit is an accounting signal of genuine improvement.

2606.11652 2026-06-11 cs.LG 新提交

IAPO: Input Attribution-Aware Policy Optimization for Tool Use in Small Multimodal Agents

IAPO:面向小型多模态代理工具使用的输入归因感知策略优化

Yifan Yang, Zhen Zhang, Jiayi Tian, Liyan Tan, Zheng Zhang

发表机构 * University of California, Santa Barbara(加州大学圣塔芭芭拉分校)

AI总结 提出输入归因感知策略优化(IAPO),通过强化学习对齐模型与教师模型的输入归因,提升多模态小语言模型的工具调用能力,在六个测试集上平均准确率提升3%。

详情
AI中文摘要

本文研究强化学习方法以提升多模态小语言模型(SLM)代理的工具调用能力。尽管现有工作探索了多种奖励设计来改善代理的工具调用能力,但这些方法在SLM训练中面临固有局限性,尤其是在多模态场景下。首先,许多现有方法通过精确匹配某些真实标签或预定义格式来评估工具使用正确性。然而,这种假设通常不适用于多模态任务,因为可能存在多个有效的工具使用路径,且通常没有标注的工具轨迹。其次,这种稀疏且脆弱的二元奖励对如何改进底层决策过程提供的指导很少,使得多模态SLM难以从中学习。为解决这些问题,我们提出输入归因感知策略优化(IAPO),一种通过将模型在输入组件上的归因与更强的教师模型对齐,来改进多模态SLM工具使用的强化学习算法。在Qwen2.5-VL-3B上的实验表明,与现有的视觉工具使用工作相比,所提方法通过帮助模型关注最相关的输入证据,在六个测试集上平均将视觉问答准确率提高了3%。

英文摘要

This paper investigates reinforcement learning (RL) methods for improving tool-calling capabilities in multimodal small language model (SLM) agents. While existing works have explored various reward designs to improve agentic tool-calling ability, these approaches face inherent limitations for SLM training, especially under multimodal scenarios. First, many existing methods evaluate tool use correctness through exact matching against certain ground-truth or predefined formats. However, this assumption is often unsuitable for multimodal tasks, where multiple tool use paths may be valid and annotated tool trajectories are typically unavailable. Second, such sparse and brittle binary rewards provide little guidance on how to improve the underlying decision process, making them particularly difficult for multimodal SLM to learn from. To address these issues, we propose Input Attribution-Aware Policy Optimization (IAPO), an RL algorithm for improving tool use in multimodal SLM by aligning the model's attribution across input components with that of a stronger teacher. Experiments on Qwen2.5-VL-3B show that the proposed method improves visual question answering accuracy by an average of 3% across six test sets compared with existing visual tool use work, by helping the model attend to the most relevant input evidence.

2606.11709 2026-06-11 cs.LG cs.CL 新提交

RLCSD: Reinforcement Learning with Contrastive On-Policy Self-Distillation

RLCSD: 基于对比策略自蒸馏的强化学习

Leyi Pan, Shuchang Tao, Yunpeng Zhai, Lingzhe Zhang, Zhaoyang Liu, Bolin Ding, Aiwei Liu, Lijie Wen

发表机构 * Tsinghua University(清华大学) Tongyi Lab, Alibaba Group(阿里巴巴集团通义实验室) Peking University(北京大学)

AI总结 针对策略自蒸馏中特权诱导的风格漂移问题,提出RLCSD方法,通过对比正确与错误提示下的师生差距来抑制风格偏移,提升推理模型在数学和逻辑推理任务上的性能。

详情
Comments
20 pages, 9 figures, 9 tables
AI中文摘要

策略自蒸馏(OPSD)通过将模型自身的分布与在特权上下文(通常是已验证的解决方案)下产生的分布对齐,为推理模型提供密集的令牌级监督。然而,我们表明从这种分布差距中提取的学习信号集中在风格令牌而非任务承载令牌上,因为提示模型倾向于产生更直接、更短的输出。我们将这种病理现象称为\emph{特权诱导的风格漂移},它会破坏训练稳定性或导致响应长度缩短。为了解决这个问题,我们提出\textbf{RLCSD}(基于对比策略自蒸馏的强化学习),通过对比正确提示下的师生差距与错误提示下的师生差距来缓解这种漂移,抑制无论正确与否,条件于提示往往诱发的风格转变,并产生更集中于任务承载令牌的信号。在Qwen3(1.7B/4B/8B)和Olmo-3-7B-Think上的数学和逻辑推理实验表明,RLCSD始终优于GRPO和先前的OPSD方法。我们进一步表明,对比原则是通用的:它可以嵌入现有的OPSD方法中以提高它们,并且其潜在见解可扩展到更广泛的跨模型策略蒸馏设置。

英文摘要

On-policy self-distillation (OPSD) provides dense, token-level supervision for reasoning models by aligning a model's own distribution with the distribution it produces under privileged context, typically a verified solution. However, we show that the learning signal drawn from this distributional gap concentrates on style tokens rather than task-bearing ones, as the hinted model tends to produce more direct, shorter outputs. We term this pathology \emph{privilege-induced style drift}, which destabilizes training or causes response length to shrink. To address this, we propose \textbf{RLCSD} (Reinforcement Learning with Contrastive on-policy Self-Distillation), which mitigates this drift by contrasting the teacher-student gap under a correct hint against that under a wrong hint, suppressing the style shift that conditioning on a hint tends to induce regardless of correctness, and yielding a signal that is more concentrated on task-bearing tokens. Experiments on Qwen3 (1.7B/4B/8B) and Olmo-3-7B-Think across mathematical and logical reasoning show that RLCSD consistently outperforms GRPO and prior OPSD methods. We further show that the contrastive principle is general: it plugs into existing OPSD methods to improve them, and its underlying insight extends to the broader cross-model on-policy distillation setting.

2606.11797 2026-06-11 cs.LG 新提交

Space-sampled Value Decay: Forgetting Mechanisms for Non-stationary Deep Reinforcement Learning

空间采样值衰减:非平稳深度强化学习的遗忘机制

Felix Störck, Fabian Hinder, Barbara Hammer

发表机构 * CITEC, Faculty of Technology, Bielefeld University(比勒费尔德大学技术学院CITEC)

AI总结 受啮齿动物遗忘行为启发,提出空间采样值衰减作为显式遗忘机制,用于深度强化学习应对环境漂移,在DQN和SAC上验证效果与局限。

详情
Comments
Accepted at The 2nd Workshop on Epistemic Intelligence in Machine Learning, EIML@ICML 2026, (non-archival)
AI中文摘要

对小鼠等啮齿动物的研究表明,即使没有提供关于变化的信息(不确定性),它们也能适应环境参数的变化(“漂移”)——这种行为可以通过遗忘机制建模。非平稳强化学习(NSRL)致力于改进最先进的强化学习方法以应对变化的环境:然而,这些方法通常需要关于漂移的(部分)完美信息,如“任务ID”或“上下文”。为了减轻漂移的影响,本文开发了\emph{空间采样值衰减},作为基于值的深度强化学习架构的一种显式遗忘机制,这是一种简单而有效的方法。特别地,我们展示并讨论了在非平稳环境中评估深度Q网络(DQN)和软演员-评论家(SAC)的修改时,在获得的回报方面的积极效果以及局限性。

英文摘要

Studies on rodents such as mice have shown the capabilities to adapt their behavior when dealing with changing parameters (``drift'') of the environment even if no information about change is provided (uncertainty) -- a behavior that can be modeled by forgetting mechanisms. Non-stationary Reinforcement Learning (NSRL) deals with adapting state-of-the-art RL methods to deal with changing environments: these however usually require (partially) perfect information about the drift such as ``task IDs'' or ``context''. To mitigate the effects of drift, this work develops \emph{Space-sampled Value Decay} as an explicit forgetting mechanism for value-based deep RL architectures as a simple yet effective approach. In particular we demonstrate and discuss positive effects but also limitations in achieved returns for modifications of Deep Q-networks (DQN) and Soft Actor-Critic (SAC) when evaluated on non-stationary environments.

2606.11968 2026-06-11 cs.LG stat.ML 新提交

Efficient Multinomial Logistic Bandit via Frequent Directions

基于频繁方向的高效多项式逻辑斯蒂老虎机

Linzhe He, Yu-Jie Zhang, Sifan Yang, Lijun Zhang

发表机构 * State Key Laboratory of Novel Software Technology, Nanjing University(南京大学计算机软件新技术国家重点实验室) School of Artificial Intelligence, Nanjing University(南京大学人工智能学院) Paul G. Allen School of Computer Science & Engineering, University of Washington(华盛顿大学保罗·G·艾伦计算机科学与工程学院)

AI总结 针对多项式逻辑斯蒂老虎机的高维计算瓶颈,提出集成频繁方向矩阵素描的EOFD-MLogB算法,将每轮复杂度降至O(Kd(m+K)^2)时间和O(Kd(m+K))空间,并证明其遗憾界接近原算法。

详情
AI中文摘要

本文研究多项式逻辑斯蒂老虎机(MLogB)的高效在线算法,其中$K+1$个结果的反馈分布遵循$d$维动作向量的多项式逻辑斯蒂模型。代表性的UCB型算法OFUL-MLogB实现了$\tilde{\mathcal{O}}(Kd\sqrt{T})$的遗憾界,但由于参数估计和乐观奖励构造,每轮仍需$\mathcal{O}(K^3d^3)$时间和$\mathcal{O}(K^2d^2)$空间,在高维场景下不可行。为解决此限制,我们提出EOFD-MLogB,将频繁方向矩阵素描集成到OFUL-MLogB中。通过维护累积Hessian的低秩SVD素描,参数估计中的约束在线牛顿更新和奖励奖励中的$Kd \times K$谱范数计算分别简化为单维求根任务和$K \times K$特征值计算。这导致每轮主要时间复杂度为$\mathcal{O}(Kd(m+K)^2)$,空间复杂度为$\mathcal{O}(Kd(m+K))$,其中$m \ll d$为素描大小。我们进一步证明了$\tilde{\mathcal{O}}(\Delta_T(Kd\ln\Delta_T+m)\sqrt{T})$的遗憾界,其中素描误差因子$\Delta_T$由Hessian的$m$截断谱尾控制。因此,当Hessian近似低秩时,遗憾接近OFUL-MLogB。实验验证了计算效率和竞争性能。

英文摘要

This paper studies efficient online algorithms for multinomial logistic bandits (MLogB), where the feedback distribution over $K+1$ outcomes follows a multinomial logistic model of $d$-dimensional action vectors. A representative UCB-type algorithm, OFUL-MLogB, achieves a regret bound of $\tilde{\mathcal{O}}(Kd\sqrt{T})$, but still requires $\mathcal{O}(K^3d^3)$ time and $\mathcal{O}(K^2d^2)$ space per round due to parameter estimation and optimistic reward construction, which is prohibitive in high-dimensional settings. To address this limitation, we propose EOFD-MLogB, which integrates frequent directions matrix sketching into OFUL-MLogB. By maintaining a low-rank SVD sketch of the accumulated Hessian, constrained online Newton updates in parameter estimation and $Kd \times K$ spectral-norm computations in the reward bonus are reduced to one-dimensional root-finding tasks and $K \times K$ eigenvalue computations, respectively. This yields dominant per-round time complexity $\mathcal{O}(Kd(m+K)^2)$ and space complexity $\mathcal{O}(Kd(m+K))$, where $m \ll d$ is the sketch size. We further prove a regret bound of $\tilde{\mathcal{O}}(\Delta_T(Kd\ln\Delta_T+m)\sqrt{T})$, where the sketching error factor $\Delta_T$ is controlled by the $m$-truncated spectral tail of the Hessian. Thus, when the Hessian is approximately low-rank, the regret is close to that of OFUL-MLogB. Experiments validate the computational efficiency and competitive performance.

2606.11982 2026-06-11 cs.LG 新提交

PAWS: Preference Learning with Advantage-Weighted Segments

PAWS: 基于优势加权片段的首选学习

Aleksandar Taranovic, Onur Celik, Niklas Freymuth, Ge Li, Serge Thilges, Huy Le, Tai Hoang, Rania Rayyes, Gerhard Neumann

AI总结 针对偏好强化学习中训练与推理分布不匹配导致时间信用分配退化的问题,提出PAWS方法,利用片段级优势函数直接进行策略更新,在机器人操作和运动任务上优于现有方法。

详情
Comments
Published as a conference paper at ICML 2026
AI中文摘要

基于偏好的强化学习(PbRL)从人类轨迹级比较中学习策略,避免了显式奖励设计和专家演示。现有方法通常在轨迹或片段级偏好上训练效用函数,同时在策略优化过程中依赖每步效用估计。这种训练和推理的不匹配导致了分布偏移,严重降低了时间信用分配并限制了策略学习。我们分析了这一问题,并提出了PAWS,一种基于片段的偏好学习方法,直接使用片段级优势函数进行策略更新。通过使效用训练与策略优化对齐,PAWS保留了轨迹级偏好信息,避免了不可靠的每步学习信号。在模拟机器人操作和运动任务上的实验表明,PAWS持续优于现有的PbRL方法,突显了分布一致偏好学习的重要性。

英文摘要

Preference-based reinforcement learning (PbRL) learns policies from human trajectory-level comparisons, avoiding explicit reward design and expert demonstrations. Existing methods typically train utility functions on trajectory or segment-level preferences while relying on per-step utility estimates during policy optimization. This training and inference mismatch induces a distribution shift that severely degrades temporal credit assignment and limits policy learning. We analyze this issue and propose PAWS, a segment-based preference learning method that performs policy updates directly using segment-level advantage functions. By aligning utility training with policy optimization, PAWS preserves trajectory-level preference information and avoids unreliable per-step learning signals. Experiments on simulated robotic manipulation and locomotion tasks demonstrate that PAWS consistently outperforms existing PbRL approaches, highlighting the importance of distribution-consistent preference learning.

2606.12370 2026-06-11 cs.LG cs.CL 新提交

Breaking Entropy Bounds: Accelerating RL Training via MTP with Rejection Sampling

打破熵界:通过带拒绝采样的多令牌预测加速强化学习训练

Yucheng Li, Huiqiang Jiang, Yang Xu, Jianxin Yang, Yi Zhang, Yizhong Cao, Yuhao Shen, Fan Zhou, Rui Men, Jianwei Zhang, An Yang, Bowen Yu, Bo Zheng, Fei Huang, Junyang Lin, Dayiheng Liu, Jingren Zhou

发表机构 * Qwen Team, Alibaba Inc(阿里巴巴集团 Qwen 团队)

AI总结 针对强化学习训练中多令牌预测接受率因熵波动而下降的问题,提出Bebop方法,采用概率拒绝采样和端到端TV损失优化,实现高达95%接受率和1.8倍加速。

详情
AI中文摘要

强化学习(RL)已成为现代大型语言模型的关键组成部分,但展开阶段仍是RL训练流程中的主要瓶颈。尽管多令牌预测(MTP)通过推测解码提供了一种自然的加速方案,但许多研究观察到MTP接受率在RL训练期间显著下降,导致加速效果有限。为解决这一瓶颈,我们提出Bebop,对LLM后训练中的MTP进行系统研究,并提供将MTP集成到大规模RL流水线中的实用方案。首先,我们揭示MTP接受率根本上受模型熵波动的限制,其与RL阶段熵的上升呈现清晰的负线性关系。其次,我们证明与贪婪草稿采样相比,概率拒绝采样在很大程度上减轻了RL中熵引入的干扰。我们进一步发现,传统的MTP训练目标(交叉熵或KL)在此类设置中次优,因此我们提出一种新颖的端到端TV损失,直接优化多步拒绝采样接受率,带来约10%的接受率提升,在数学推理、代码生成和智能体任务中实现高达95%的接受率和高达25%的额外推理吞吐量增益。第三,我们测试了RL期间的各种在线MTP训练策略,并表明使用端到端TV损失和拒绝采样的预RL MTP训练在整个RL过程中保持一致的接受率和加速,消除了昂贵的在线MTP更新需求。我们提供了大量实验和分析来验证我们的发现。实验结果表明,我们的方法在Qwen3.5、Qwen3.6和Qwen3.7模型的异步RL训练中实现了高达1.8倍的端到端加速。

英文摘要

Reinforcement learning (RL) has become a key component in modern large language models, yet the rollout stage remains the key bottleneck in RL training pipelines. Although Multi-Token Prediction (MTP) offers a natural solution to accelerate rollouts through speculative decoding, many studies have observed that MTP acceptance rates degrade significantly during RL training, leading to limited speedup performance. To address this bottleneck, we present Bebop, a systematic study of MTP in LLM post-training, and offer practical recipes to integrate MTP into large-scale RL pipelines. First, we reveal that the MTP acceptance rate is fundamentally bounded by the fluctuation of model entropy, which demonstrates a clear negative linear relationship with the rise of entropy in the RL stage. Second, we show that probabilistic rejection sampling largely alleviates the disturbance introduced by entropy in RL compared to greedy draft sampling. We further identify that the conventional MTP training objectives (cross-entropy or KL) are suboptimal in such settings, and therefore we propose a novel end-to-end TV loss that directly optimizes multi-step rejection sampling acceptance rate, yielding ~10% acceptance rate improvements, achieving up to 95% acceptance rates and up to 25% extra inference throughput gains across mathematical reasoning, code generation, and agentic tasks. Third, we test various online MTP training strategies during RL and show that pre-RL MTP training with e2e TV loss and rejection sampling achieves a consistent acceptance rate and speedup throughout the entire RL, eliminating the need for costly online MTP updating. We provide extensive experiments and analysis that validate our findings. Experimental results show our method achieves up to 1.8x end-to-end acceleration in async RL training of Qwen3.5, Qwen3.6, and Qwen3.7 models.

2606.12384 2026-06-11 cs.LG cs.AI 新提交

APPO: Agentic Procedural Policy Optimization

APPO: 智能体程序策略优化

Xucong Wang, Ziyu Ma, Yong Wang, Yuxiang Ji, Shidong Yang, Guanhua Chen, Pengkun Wang, Xiangxiang Chu

发表机构 * University of Science and Technology of China(中国科学技术大学) AMAP, Alibaba Group(阿里巴巴集团高德地图) Southern University of Science and Technology(南方科技大学)

AI总结 提出APPO方法,通过细粒度分支和程序级优势缩放改进智能体强化学习的信用分配,在13个基准上平均提升近4个点。

详情
Comments
25 pages, including 14 pages of main text and 11 pages of appendix; work in progress
AI中文摘要

近期智能体强化学习(RL)的进展显著提升了大型语言模型智能体的多轮工具使用能力。然而,现有方法大多基于粗粒度的启发式单元(如工具调用边界或固定工作流)进行信用分配,难以识别哪些中间决策影响下游结果。本文从两个角度研究智能体RL:\textit{何处分支以及分支后如何分配信用}。我们的初步分析表明,有影响力的决策点广泛分布在生成序列中,而非集中于工具调用,而仅凭token熵无法可靠反映其对最终结果的影响。基于这些观察,我们提出\textbf{智能体程序策略优化(APPO)},将分支和信用分配从粗粒度的交互单元转移到序列中的细粒度决策点。APPO使用分支分数选择分支位置,该分数结合了token不确定性和后续延续的策略诱导似然增益,从而在过滤掉虚假高熵位置的同时实现更有针对性的探索。它进一步引入了程序级优势缩放,以更好地在分支展开中分配信用。在13个基准上的实验表明,APPO在保持高效工具调用和行为可解释性的同时,一致地将强智能体RL基线提升了近4个点。

英文摘要

Recent advances in agentic Reinforcement Learning (RL) have substantially improved the multi-turn tool-use capabilities of large language model agents. However, most existing methods assign credit over coarse heuristic units, such as tool-call boundaries or fixed workflows, making it difficult to identify which intermediate decisions influence downstream outcomes. In this work, we study agentic RL from two perspectives: \textit{where to branch and how to assign credit after branching}. Our pilot analysis shows that influential decision points are broadly distributed throughout the generated sequence rather than concentrated at tool calls, while token entropy alone does not reliably reflect their impact on final outcomes. Motivated by these observations, we propose \textbf{Agentic Procedural Policy Optimization (APPO)}, which shifts branching and credit assignment from coarse interaction units to fine-grained decision points in the sequence. APPO selects branching locations using a Branching Score that combines token uncertainty with policy-induced likelihood gains of subsequent continuations, enabling more targeted exploration while filtering out spurious high-entropy positions. It further introduces procedure-level advantage scaling to better distribute credit across branched rollouts. Experiments on 13 benchmarks show that APPO consistently improves strong agentic RL baselines by nearly 4 points, while keeping efficient tool-calls and maintaining behavior interpretability.

2606.12386 2026-06-11 cs.LG cs.AI 新提交

ATLAS: Active Theory Learning for Automated Science

ATLAS: 自动化科学的主动理论学习

Noémi Éltető, Nathaniel D. Daw, Kimberly L. Stachenfeld, Kevin J. Miller

发表机构 * Google DeepMind(谷歌深度思维) Princeton University(普林斯顿大学) Columbia University(哥伦比亚大学) University College London(伦敦大学学院)

AI总结 提出ATLAS框架,通过主动学习迭代生成稀疏神经网络假设并设计最优区分实验,在bandit任务中恢复强化学习智能体,相比随机实验采样效率提升5-10倍。

详情
AI中文摘要

通过机制建模推进科学理解需要提出正确的实验问题以产生信息量最大的数据。为了在认知科学中自动化这一追求,我们引入了ATLAS(自动化科学的主动理论学习),这是一个用于数据驱动的可解释行为模型发现的主动学习框架。ATLAS在生成机制假设(实例化为多样化的稀疏神经网络集成,即解缠RNN)和设计能够最优区分这些假设的实验之间迭代。我们在从bandit任务中的行为恢复强化学习智能体的问题上测试了这种方法。ATLAS设计了具有时间结构的定性新颖实验序列,该结构针对底层智能体特征量身定制。在这些实验上训练的模型通过一套全面的机制建模指标进行评估,这些指标捕捉了行为、结构和计算相似性。与随机实验相比,ATLAS在所有指标上实现了5-10倍的采样效率提升,并且其性能进一步通过与文献中专家设计的实验进行验证得到确认。这些计算机模拟结果展示了ATLAS在加速人类可解释洞察方面的潜力,适用于认知科学以及其他科学探究依赖于发现机制模型的领域。

英文摘要

Advancing scientific understanding through mechanistic modeling requires posing the right experimental questions to yield maximally informative data. To automate this pursuit within cognitive science, we introduce ATLAS (Active Theory Learning for Automated Science), an active learning framework for the data-driven discovery of interpretable behavioral models. ATLAS iterates between generating mechanistic hypotheses--instantiated as a diverse ensemble of sparse neural networks (Disentangled RNNs)--and designing experiments that optimally distinguish between them. We test this approach on the problem of recovering reinforcement learning agents from their behavior in bandit tasks. ATLAS designs varied sequences of qualitatively novel experiments with temporal structure tailored to underlying agent characteristics. The models trained on these experiments are evaluated against a comprehensive set of metrics for mechanistic modeling that capture behavioral, structural, and computational similarity. ATLAS achieves a 5-10x improvement in sample efficiency across all metrics compared to random experimentation, and its performance is further validated against expert-designed experiments derived from literature. These in silico results showcase ATLAS's potential to accelerate human-interpretable insights in cognitive science and other domains where scientific inquiry relies on discovering mechanistic models.

2606.11209 2026-06-11 cs.CL cs.AI cs.LG 交叉投稿

ProcessThinker: Enhancing Multi-modal Large Language Models Reasoning via Rollout-based Process Reward

ProcessThinker: 通过基于展开的过程奖励增强多模态大语言模型推理

Jingpei Wu, Xiao Han, Weixiang Shen, Boer Zhang, Zifeng Ding, Volker Tresp

发表机构 * LMU Munich(慕尼黑大学) Harvard University(哈佛大学) University of Cambridge(剑桥大学) Mina AI Konrad Zuse School of Excellence in Reliable AI (relAI)(康拉德·楚泽可靠人工智能卓越学校(relAI))

AI总结 提出ProcessThinker,一种无需显式过程奖励模型的后训练方法,通过步骤标记格式和基于展开的过程奖励,为多步推理提供密集的步骤级奖励,提升多模态推理一致性。

详情
Comments
Accepted at ICLR 2026 Workshop on Logical Reasoning of Large Language Models. 7 pages, 1 figure
AI中文摘要

视觉问答越来越需要多步推理。最近在可验证奖励下的强化学习后训练(RLVR)和组相对策略优化(GRPO)可以改善多模态推理,但大多数方法依赖于稀疏的仅结果奖励。因此,它们难以判断错误答案是由于推理后期的一个小错误,还是从一开始就无用的轨迹。一个常见的解决方案是训练一个过程奖励模型(PRM)用于步骤级监督,但这通常需要大规模高质量的思想链注释和额外的训练成本。我们提出ProcessThinker,一种实用的后训练流程,无需训练显式的PRM即可提供步骤级过程奖励。ProcessThinker首先将推理轨迹重写为步骤标记格式以进行冷启动监督微调,然后应用带有标准格式奖励和我们基于展开的过程奖励的GRPO。具体来说,对于每个中间步骤,我们从该步骤采样多个连续步骤,并使用经验成功率(最终答案验证)作为步骤奖励。这提供了密集的信用分配,并鼓励更可靠地支持正确结论的推理步骤,有助于减少跨步骤的不一致或自相矛盾的进展——这是逻辑推理中的一个关键问题。在四个具有挑战性的视频基准测试(Video-MMMU、MMVU、VideoMathQA和LongVideoBench)上,ProcessThinker始终优于基线模型Qwen3-VL-8B-Instruct。

英文摘要

Visual question answering increasingly requires multi-step reasoning. Recent post-training with reinforcement learning under verifiable rewards (RLVR) and Group Relative Policy Optimization (GRPO) can improve multimodal reasoning, but most approaches rely on sparse outcome-only rewards. As a result, they struggle to tell whether an incorrect answer comes from a small mistake late in the reasoning or from an unhelpful trajectory from the start. A common solution is to train a process reward model (PRM) for step-level supervision, but this typically requires large-scale high-quality chain-of-thought annotations and additional training cost. We propose ProcessThinker, a practical post-training pipeline that provides step-level process rewards without training an explicit PRM. ProcessThinker first rewrites reasoning traces into a step-tagged format for cold-start supervised fine-tuning, then applies GRPO with a standard format reward and our rollout-based process reward. Concretely, for each intermediate step, we sample multiple continuations from that step and use the empirical success rate (final-answer verification) as the step reward. This gives dense credit assignment and encourages reasoning steps that more reliably support a correct conclusion, helping reduce inconsistent or self-contradictory progress across steps -- a key issue in logical reasoning. Across four challenging video benchmarks (Video-MMMU, MMVU, VideoMathQA, and LongVideoBench), ProcessThinker consistently improves over the baseline model Qwen3-VL-8B-Instruct

2606.11274 2026-06-11 cs.MA cs.LG physics.flu-dyn 交叉投稿

Multi-agent rendezvous in fluid flows via reinforcement learning

基于强化学习的多智能体在流体中的会合

Bocheng Li, Jingran Qiu, Lihao Zhao

AI总结 采用多智能体强化学习(MARL)在涡旋流中开发物理信息会合策略,显著提高会合率,并具有跨涡旋强度、尺度和群体规模的迁移性,通过打破状态-动作图对称性防止智能体被困在分离涡旋中。

详情
AI中文摘要

会合是多智能体系统的一项关键任务,要求智能体协调以在未指定位置相遇。然而,在流体环境中实现这一目标具有挑战性,因为尚不清楚智能体如何利用底层流体运动学来促进收敛。在本研究中,我们采用多智能体强化学习(MARL)方法在涡旋流中开发物理信息会合策略。与智能体向其对应方导航的朴素策略相比,MARL策略显著提高了会合率。MARL策略还表现出跨不同涡旋强度、涡旋尺度和群体规模的可迁移性。通过打破状态-动作图的对称性,MARL策略利用一种非直观的机制,防止智能体被困在分离的涡旋中,从而提高会合成功率。此外,从学习到的策略中提取了一种启发式策略,其性能也优于朴素策略。进一步的理论分析表明,流体变形阻碍了会合过程。大的有限时间李雅普诺夫指数识别出流体效应分离相邻智能体的区域,表明应在弱变形区域规划目标。我们的发现揭示了智能体-流体相互作用在多智能体任务中的重要作用,并突出了MARL在复杂流动环境中探索群体智能的能力。

英文摘要

Rendezvous is a critical task for multi-agent systems, requiring agents to coordinate to meet at an unspecified location. However, achieving this in fluid environments presents a challenge, as it remains unclear how agents can exploit underlying fluid kinematics to facilitate convergence. In this study, we adopt a multi-agent reinforcement learning (MARL) approach to develop physics-informed rendezvous strategies in vortical flows. Compared to a naive strategy, where agents navigate toward their counterparts, MARL strategies significantly improve the rendezvous rate. MARL strategies also show transferability across varying vortex intensities, vortex scales, and swarm sizes. By breaking the symmetry of the state-action map, MARL strategy leverages a non-intuitive mechanism that prevents agents from becoming trapped in separate vortices, thereby enhancing rendezvous success. Additionally, a heuristic strategy is extracted from the learned strategy and also outperforms the naive strategy. Furthermore, a theoretical analysis demonstrates that fluid deformation impedes the rendezvous process. Large finite-time Lyapunov exponents identify where fluid effects separate adjacent agents, suggesting that targets should be planned in weak-deformation regions. Our findings reveal the important role that agent-fluid interactions play in multi-agent tasks and highlight the MARL capability to explore swarm intelligence in complex flow environments.

2606.11284 2026-06-11 cs.MA cs.GT cs.LG 交叉投稿

Phi-Actor-Critic: Steering General-Sum Games to Pareto-Efficient Correlated Equilibria

Phi-Actor-Critic: 引导一般和博弈走向帕累托高效关联均衡

Wongyu Lee, Francesco Lelli, Omran Ayoub, Massimo Tornatore

AI总结 提出Φ-Actor-Critic框架,通过交换遗憾最小化引导多智能体学习向高社会福利的关联均衡收敛,并采用集中式注意力批评家高效估计反事实遗憾,结合拉格朗日均衡选择机制优化社会福利。

详情
Comments
Accepted to IJCAI 2026
AI中文摘要

现实世界的多智能体系统,从交通协调到资源分配,通常被建模为一般和博弈,其中个体激励与集体福利相冲突。在这些设定中,核心挑战不仅是找到均衡,而是在许多次优纳什均衡中选择社会期望的结果。标准的深度多智能体强化学习(MARL)方法难以解决这个问题,因为价值分解方法受单调性假设约束,而策略梯度方法往往收敛到稳定但社会效率低下的均衡。为了解决这一限制,我们提出了Φ-Actor-Critic(Φ-AC),一个利用交换遗憾最小化引导学习向高福利关联均衡(CE)收敛的框架。为了使反事实遗憾估计在深度MARL中易于处理,Φ-AC采用了一个集中式注意力批评家,在单次前向传播中预测向量值遗憾,避免了计算昂贵的反事实模拟。我们进一步引入了一个基于拉格朗日的均衡选择机制,通过遗憾约束优化社会福利同时确保稳定性。在矩阵博弈、多智能体粒子环境(MPE)和Melting Pot Harvest场景上的实验表明,Φ-AC在多样的混合动机设定中学习到高效且稳定的协调策略,同时保持高集体回报和竞争公平性。

英文摘要

Real-world multi-agent systems, from traffic coordination to resource allocation, are often modeled as general-sum games where individual incentives conflict with collective welfare. In these settings, the central challenge is not merely finding an equilibrium, but selecting socially desirable outcomes among many suboptimal Nash equilibria. Standard deep multi-agent reinforcement learning (MARL) methods struggle with this problem, as value-decomposition approaches are constrained by monotonicity assumptions and policy-gradient methods often converge to stable but socially inefficient equilibria. To address this limitation, we propose $\Phi$-Actor-Critic ($\Phi$-AC), a framework that leverages swap regret minimization to steer learning toward high-welfare correlated equilibria (CE). To make counterfactual regret estimation tractable in deep MARL, $\Phi$-AC employs a centralized attention critic that predicts vector-valued regrets in a single forward pass, avoiding computationally expensive counterfactual simulations. We further introduce a Lagrangian-based equilibrium selection mechanism that optimizes social welfare while enforcing stability through regret constraints. Experiments on matrix games, Multi-Agent Particle Environments (MPE), and the Melting Pot Harvest scenario demonstrate that $\Phi$-AC learns efficient and stable coordination strategies across diverse mixed-motive settings while maintaining high collective return and competitive fairness.

2606.11525 2026-06-11 cs.RO cs.LG 交叉投稿

Learning Object Manipulation from Scratch via Contrastive Interaction

通过对比交互从零开始学习物体操作

Tongle Shen, Caleb Chuck, Fan Feng, Biwei Huang

发表机构 * UC San Diego(加州大学圣地亚哥分校) UT Austin(德克萨斯大学奥斯汀分校)

AI总结 针对对比强化学习在交互密集操作任务中表现不佳的问题,提出交互加权重采样方法,通过保留模式边界提升多模态分段非线性可达性表示,在仿真和真实机器人空气曲棍球任务中取得显著改进。

详情
AI中文摘要

对比强化学习(CRL)通过学习动力学的结构化表示,在多种目标条件机器人任务中取得了近期成功。然而,尽管在运动控制和简单控制领域表现优异,CRL在交互密集的操作任务中常常遇到困难。我们认为这一困难的关键来源是物体中心交互,如接触或抓取,这些交互会引起潜在动态模式的显著变化。在这项工作中,我们将操作动力学建模为分段平滑马尔可夫过程,并证明交互引起的模式变化产生了分段非线性可达性结构,这使得标准CRL能量函数难以表示和规划。基于这一分析,我们引入了交互加权重采样(IWR)。IWR在交互前、中、后阶段进行交互感知重采样,鼓励学习到的表示保留决定未来可达性的模式边界,以捕获多模态和分段非线性可达性。在包括2D动态控制、机器人操作和机器人空气曲棍球在内的交互中心环境中,IWR相比先前的CRL方法提高了样本效率和整体性能,在仿真中平均提升19.8%。最后,通过使用IWR训练的策略进行仿真到现实的迁移,我们展示了首个能够击打目标的真实世界目标条件机器人空气曲棍球智能体,成功率从25%提升到60%。项目页面:此 http URL。

英文摘要

Contrastive Reinforcement Learning (CRL) has seen recent success in a wide variety of goal-conditioned robotics tasks by learning structured representations of the dynamics. However, despite its success in locomotion and simpler control domains, CRL often struggles in interaction-rich manipulation. We argue that a key source of this difficulty is object-centric interaction, such as contact or grasping, that induces distinct changes in the underlying dynamic modes. In this work, we formulate manipulation dynamics as a piecewise-smooth Markov process and show that interaction-induced mode changes create piecewise nonlinear reachability structures that are difficult for standard CRL energy functions to represent and plan over. Based on this analysis, we introduce Interaction-weighted Resampling (IWR). IWR performs interaction-aware resampling around phases before, during, and after interactions, encouraging the learned representation to preserve the mode boundaries that determine future reachability to capture multi-modal and piecewise nonlinear reachability. Across interaction-centric environments, including 2D dynamic control, robotic manipulation, and robot air hockey, IWR improves both sample efficiency and overall performance over prior CRL methods, with 19.8% average improvement in simulation. Finally, using a sim-to-real pipeline with policies trained by IWR, we demonstrate the first real-world goal-conditioned robot air hockey agent capable of hitting goals, improving success from 25% to 60%. Project Page: this http URL.

2606.11798 2026-06-11 q-fin.CP cs.LG math.OC 交叉投稿

Deterministic Policy Gradient for Learning Equilibrium in Time-Inconsistent Control Problems

时间不一致控制问题中学习均衡的确定性策略梯度

Xin Guo, Yijie Huang, Xiang Yu

AI总结 提出一种连续时间无模型强化学习算法,通过确定性策略梯度和内定点迭代学习时间不一致控制问题的均衡策略,并在均值-方差投资组合和非指数贴现跟踪投资组合中验证有效性。

详情
Comments
Keywords: Time-inconsistent control, two-stage reformulation, model-free continuous-time reinforcement learning, deterministic policy gradient, fixed point iteration
AI中文摘要

在本文中,我们开发了一种连续时间无模型强化学习算法,用于学习一般时间不一致控制问题中的确定性均衡策略。利用扩展的Hamilton-Jacobi-Bellman系统,我们将原始时间不一致问题转化为一个等价的两阶段问题。在第一阶段,对于给定的辅助函数,我们采用确定性策略梯度方法在辅助的时间一致控制问题中学习最优策略。在第二阶段,给定更新后的策略,我们利用内定点迭代和某些鞅特征来学习辅助函数。作为理论贡献,我们提供了一些温和的模型假设,并建立了内定点迭代的收敛性。通过在两阶段之间重复这种演员-评论家风格的迭代,我们的算法旨在以统一的方式学习不同时间不一致性来源下的均衡。该算法在两种经典的时间不一致金融应用中的优越有效性得到了说明:均值-方差投资组合管理和非指数贴现下的最优跟踪投资组合。

英文摘要

In this paper, we develop a continuous-time model-free reinforcement learning algorithm to learn deterministic equilibrium policies in general time-inconsistent control problems. Utilizing the extended Hamilton-Jacobi-Bellman system, we recast the original time-inconsistent problem into an equivalent two-stage problem. In the first stage, for given auxiliary functions, we employ the deterministic policy gradient approach to learn an optimal policy in an auxiliary time-consistent control problem. In the second stage, given the updated policy, we exploit the inner fixed point iterations and some martingale characterizations to learn the auxiliary functions. As a theoretical contribution, we provide some mild model assumptions and establish the convergence of inner fixed point iterations. By repeating this actor-critic style of iterations across two stages, our algorithm aims to learn the equilibrium under different sources of time-inconsistency in a unified manner. The superior effectiveness of the proposed algorithm are illustrated in two classical financial applications with time-inconsistency: mean-variance portfolio management and optimal tracking portfolio under non-exponential discounting.

2606.11891 2026-06-11 cs.RO cs.LG 交叉投稿

Critic Architecture Matters: Dual vs. Unified Critics for Humanoid Loco-Manipulation

评论家架构的重要性:双评论家与统一评论家在人形机器人移动操作中的对比

Mehmet Turan Yardımcı

AI总结 针对人形机器人多目标强化学习,对比统一评论家与双评论家架构,实验表明双评论家策略在到达速度、吞吐量和成功率上显著优于统一评论家,且架构选择比奖励工程影响更大。

详情
Comments
Accepted at the ICRA 2026 Workshop on Reinforcement Learning for Imitation Learning (RL4IL), Vienna, Austria. 4 pages, 2 figures
AI中文摘要

人形机器人的多目标强化学习必须在单一策略中协调移动和操作。一个自然的设计选择是使用单一(统一)评论家来估计所有目标的组合价值,还是使用具有不相交奖励信号的单独(双)评论家。我们在NVIDIA Isaac Lab中对Unitree G1人形机器人(23个主动自由度)进行了受控比较,通过一个从静态到达延伸到具有可变方向目标的行走的13级顺序课程训练移动操作策略。在标准化评估中,与统一评论家策略相比,双评论家策略到达目标的速度快3.5倍(6.5 vs. 22.6模拟步),吞吐量高2倍(每1000步验证到达次数14.3 vs. 7.0),并且验证到达率更高(65.2% vs. 53.8%)。值得注意的是,额外的反博弈奖励机制在架构改变之外没有提供进一步改进(60.9% vs. 65.2%)。这些结果对新兴的强化学习微调模仿学习策略范式有直接影响:当使用强化学习优化预训练的操作策略时,统一评论家可能通过竞争性的移动梯度抑制已学习的行为。这些发现表明,评论家架构是多目标人形机器人强化学习中一个首要且常被忽视的设计选择,其对到达效率的影响大于奖励工程。

英文摘要

Multi-objective reinforcement learning for humanoid robots must coordinate locomotion and manipulation within a single policy. A natural design choice is whether to use a single (unified) critic that estimates the combined value of all objectives, or separate (dual) critics with disjoint reward signals. We present a controlled comparison on the Unitree G1 humanoid (23 active DoF) in NVIDIA Isaac Lab, training loco-manipulation policies through a sequential curriculum spanning 13 levels from stationary reaching to walking with variable-orientation targets. In standardized evaluation, dual-critic policies reach targets 3.5$\times$ faster (6.5 vs. 22.6 simulation steps), achieve 2$\times$ higher throughput (14.3 vs. 7.0 validated reaches per 1,000 steps), and attain higher validated reach rates (65.2% vs. 53.8%) compared to the unified-critic policy. Notably, additional anti-gaming reward mechanisms provide no further improvement beyond the architectural change alone (60.9% vs. 65.2%). These results have direct implications for the emerging paradigm of RL fine-tuning of imitation-learned policies: when refining a pre-trained manipulation policy with RL, a unified critic risks suppressing the learned behavior through competing locomotion gradients. These findings demonstrate that critic architecture is a primary - and often overlooked - design choice in multi-objective humanoid RL, with greater impact than reward engineering on reaching efficiency.

2606.12086 2026-06-11 cs.AI cs.LG 交叉投稿

IntElicit: Eliciting and Assessing Contextualized Creativity via Dialogue Policy Optimization

IntElicit: 通过对话策略优化引出和评估情境化创造力

Mingjia Li, Jin Wu, Hong Qian, Wenhao Huang, Yiyang Huang, Yiwen Zhang, Chanjin Zheng, Xiangfeng Wang, Aimin Zhou, Jiajun Guo

发表机构 * East China Normal University(华东师范大学) Shanghai Innovation Institute(上海创新研究院)

AI总结 提出IntElicit框架,通过分解过程奖励机制优化对话策略,在交互中减少非创造性混淆因素,从而更有效地引出和评估情境化创造力。

详情
AI中文摘要

情境化评估为评估创造力提供了高生态效度,但也引入了一个关键挑战:观察到的表现可能与认知熟练度(领域知识)和能动性(参与意愿)相混淆。同时,在生成式AI时代,创造性问题解决越来越多地发生在工具中介和人机交互环境中,使得完全静态的评估与当代创造性实践不太一致。为了解决这些问题,本文提出了IntElicit,一个通过对话策略优化来引出和评估情境化创造力的框架。IntElicit作为一个受约束的自适应AI面试官:它在多轮交互中提供非指导性的知识和能动性支架,以减少非创造性混淆因素,同时保留参与者生成被评估的创造性内容的责任。具体来说,为了解决开放教育对话中的稀疏奖励和潜在奖励破解(例如,答案听写),IntElicit引入了一种分解过程奖励机制。该机制将策略与教学引出对齐,奖励那些引出参与者推理而非代表他们产生最优答案的提示。大量实验,包括参与者模拟和一项人类受试者研究(N=64),表明IntElicit比专家设计的基线提高了引出的创造性成果。总之,结果表明,交互式引出可以揭示静态FPSP式评估可能遗漏的创造性潜力,为AI中介学习环境中的情境化创造力评估提供了形成性和诊断性视角。

英文摘要

Contextualized assessment offers high ecological validity for evaluating creativity but introduces a critical challenge: observed performance may be confounded with cognitive proficiency (domain knowledge) and agency (willingness to engage). Meanwhile, in the age of generative AI, creative problem solving increasingly occurs in tool-mediated and human--AI interactive environments, making fully static assessment less aligned with contemporary creative practice. To address these issues, this paper proposes IntElicit, a framework for eliciting and assessing contextualized creativity via dialogue policy optimization. IntElicit functions as a constrained adaptive AI Interviewer: it provides non-directive knowledge and agency scaffolds in multi-turn interaction to reduce non-creative confounders, while preserving participants' responsibility for generating the creative content being evaluated. Specifically, to tackle sparse rewards and potential reward hacking (e.g., answer dictation) in open-ended educational dialogue, IntElicit introduces a decomposed process reward mechanism. This mechanism aligns the policy with pedagogical elicitation, rewarding prompts that draw out participant reasoning rather than producing optimal answers on their behalf. Extensive experiments, including participant simulation and a human subject study (N=64), show that IntElicit improves elicited creative outcomes over expert-designed baselines. Together, the results suggest that interactive elicitation can reveal creative potential that static FPSP-style assessment may miss, providing a formative and diagnostic lens for contextualized creativity assessment in AI-mediated learning contexts.

2606.12281 2026-06-11 cs.MA cs.AI cs.LG 交叉投稿

CCKS: Consensus-based Communication and Knowledge Sharing

CCKS:基于共识的通信与知识共享

Jinyuan Zu, Xiaowei Lv, Yongcai Wang, Deying Li, Yunjun Han, Wenping Chen, Fengyi Zhang, Naiqi Wu

AI总结 针对多智能体强化学习中动作建议过度依赖教师指导的问题,提出基于共识的通信与知识共享框架,通过对比学习构建共识模型,平衡探索与学习,提升合作效率与性能。

详情
AI中文摘要

在分布式训练和分布式执行(DTDE)的协作多智能体强化学习(MARL)中,基于动作建议的知识共享促进了智能体间的可解释和可扩展合作。然而,当前的动作建议方法往往过于遵循教师的指导,而未评估师生兼容性,导致过度建议、稳定性欠佳和性能下降。为克服这些挑战,本文提出了一种基于共识的通信与知识共享(CCKS)框架,该框架允许智能体基于共识衍生的约束采纳建议,并更智能地遵循教师指令。该机制使智能体能够平衡探索与向经验丰富的教师学习,从而提升整体性能。关键在于共识模型的构建,为此我们提出在智能体训练阶段利用对比学习基于局部观测构建共识模型。在动作选择中,智能体根据共识和共享知识对动作进行评分和选择。CCKS设计为即插即用解决方案,可无缝集成到现有DTDE算法中。在Google Research Football环境和复杂的星际争霸II多智能体挑战中进行的实验表明,与当前的DTDE基线相比,集成CCKS显著提高了合作效率、学习速度和整体性能。代码可从此https URL获取。

英文摘要

In Decentralized Training and Decentralized Execution (DTDE) for cooperative Multi-Agent Reinforcement Learning (MARL), action-advising-based knowledge sharing promotes interpretable and scalable cooperation among agents. However, current action advising approaches often adhere too much to the teacher's guidance without evaluating teacher-student compatibility, which causes excessive advising, suboptimal stability, and degraded performance. To overcome these challenges, this paper presents a Consensus-based Communication and Knowledge Sharing (CCKS) framework, which allows agents to adopt recommendations based on consensus-derived constraints and to follow the teacher's instructions more smartly. This mechanism enables agents to balance exploration and learning from experienced teachers, improving overall performance. The key is the consensus model construction, for which we propose to employ contrastive learning to construct consensus models based on local observations in the agents' training phase. In action selection, agents score and choose actions based on consensus and shared knowledge. Designed as a plug-and-play solution, CCKS integrates seamlessly with existing DTDE algorithms. Experiments conducted in the Google Research Football environment and the complex StarCraft II Multi-Agent Challenge demonstrate that the integration with CCKS significantly improves cooperation efficiency, learning speed, and overall performance compared with current DTDE baselines. The code is available at this https URL.

2606.12299 2026-06-11 cs.RO cs.LG 交叉投稿

Learning What to Say to Your VLA: Mostly Harmless Vision Language Action Model Steering

学习对你的VLA说什么:基本无害的视觉语言动作模型引导

Hyun Joe Jeong, Gokul Swamy, Andrea Bajcsy

发表机构 * Robotics Institute, Carnegie Mellon University(卡内基梅隆大学机器人研究所)

AI总结 提出一个框架,通过交互式搜索语言序列改进闭环VLA任务性能,并学习一个改进头预测何时语言引导能提升性能,同时通过共形化防止有害干预。

详情
Comments
22 pages, 14 tables, 14 figures
AI中文摘要

视觉-语言-动作(VLA)模型为机器人控制提供了自然语言接口,但从语言到行为的映射通常脆弱且不直观:语义相似的指令可能引发截然不同的行为,而某些能力可能无法仅通过提示激发。因此,人类指令和零样本语言模型都可能无法可靠地引导VLA成功执行任务。在这项工作中,我们提出了一个框架,该框架交互式地搜索改进闭环VLA任务性能的语言序列,将这些序列提炼为测试时语言反馈策略(LFP),并学习一个改进头来预测何时语言引导会提升性能。我们对这个改进头进行共形化,以防止在分布外场景中LFP相对于原始指令降低任务性能的有害引导干预。关键的是,我们的方法适用于任意冻结的预训练VLA,既不需要访问原始训练分布,也不需要微调底层模型。在已知环境中,我们的共形化LFP在仿真中使基础VLA性能提升24.7%,在硬件中提升65.0%。在视觉和语义扰动下,我们的共形化LFP具有强大的无害性保证,并产生开环提示无法观察到的恢复行为。

英文摘要

Vision-Language-Action (VLA) models provide a natural language interface to robot control, but the mapping from language to behavior is often brittle and unintuitive: semantically similar instructions can induce drastically different behaviors, while some capabilities may not be elicitable through prompting alone. As a result, both human instructions and zero-shot language models can fail to reliably steer VLAs toward successful task execution. In this work, we propose a framework that interactively searches for language sequences that improve closed-loop VLA task performance, distills these sequences into a test-time language feedback policy (LFP), and learns an improvement head that predicts when language steering will improve performance. We conformalize this improvement head to prevent harmful steering interventions, where the LFP decreases task performance relative to the original instruction on out-of-distribution scenarios. Crucially, our approach operates on arbitrary frozen pre-trained VLAs, requiring neither access to the original training distribution nor fine-tuning of the underlying model. On seen environments, our conformalized LFP improves base VLA performance by 24.7% in simulation and 65.0% in hardware. On visual and semantic perturbations, our conformalized LFP has strong harmlessness guarantees, and produces recovery behaviors not observed with open-loop prompting.

2606.12372 2026-06-11 cs.RO cs.LG 交叉投稿

UniIntervene: Agentic Intervention for Efficient Real-World Reinforcement Learning

UniIntervene:用于高效现实世界强化学习的智能干预

Haoyuan Deng, Yitong Gao, Yudong Lin, Haichao Liu, Zhenyu Wu, Ziwei Wang

发表机构 * Nanyang Technological University(南洋理工大学) Beijing University of Posts and Telecommunications(北京邮电大学)

AI总结 提出UniIntervene智能干预模型,通过检测低效探索并自主恢复策略至高价值状态,在真实机器人操作任务中平均成功率提升8.6%,人类干预减少57%。

详情
Comments
Project page: this https URL
AI中文摘要

人在回路强化学习(HiL-RL)已成为现实世界机器人操作的有效范式,能够通过人类指导实现在线策略改进。然而,当前的HiL-RL框架仍然依赖频繁的人类干预来纠正策略,使其脱离低效探索,这导致高昂的人力成本并限制了现实世界的可扩展性。为解决这一问题,我们提出UniIntervene,一种智能干预模型,它能够检测低效探索并自主将策略恢复至高价值状态,从而接管人类操作员的大部分干预工作。具体而言,UniIntervene首先执行未来条件化的动作价值估计,预测当前动作的潜在后果并评估其诱导价值,从而提供更稳定的进展信号。在此基础上,一个时间价值风险评论家聚合最近的价值动态,并在估计价值出现持续停滞或下降时触发干预。当需要干预时,UniIntervene从过去干预事件的内存中检索高价值恢复目标,并通过目标条件化的恢复策略生成可执行的纠正动作。通过这种方式,UniIntervene将干预从被动的人类纠正转变为价值感知的恢复过程,从而实现高效的现实世界强化学习。在多种真实世界操作任务上的大量实验表明,与最先进的HiL-RL基线相比,UniIntervene将平均成功率提高了8.6%,同时将人类干预减少了57%。

英文摘要

Human-in-the-loop reinforcement learning (HiL-RL) has emerged as an effective paradigm for real-world robotic manipulation, enabling online policy improvement with human guidance. However, current HiL-RL frameworks remain intervention-intensive, relying on frequent human corrections to redirect the policy out of unproductive exploration, which incurs high labor cost and limits real-world scalability. To address this, we propose UniIntervene, an agentic intervention model that detects unproductive exploration and autonomously recovers the policy toward high-value states, taking over the bulk of interventions from human operators. Specifically, UniIntervene first performs future-conditioned action-value estimation, predicting the latent consequence of the current action and evaluating its induced value, which provides a more stable progress signal. Building on this, a temporal value-risk critic aggregates recent value dynamics and triggers intervention when the estimated value exhibits sustained stagnation or degradation. When intervention is required, UniIntervene retrieves a high-value recovery target from a memory of past intervention episodes and produces executable corrective actions through a goal-conditioned recovery policy. In this way, UniIntervene turns intervention from passive human correction into a value-aware recovery process for efficient real-world RL. Extensive experiments on diverse real-world manipulation tasks demonstrate that UniIntervene improves the average success rate by 8.6% while reducing human interventions by 57% relative to state-of-the-art HiL-RL baselines.

2505.15201 2026-06-11 cs.LG cs.AI cs.CL stat.ML 版本更新

Pass@K Policy Optimization: Solving Harder Reinforcement Learning Problems

Pass@K 策略优化:解决更困难的强化学习问题

Christian Walder, Deep Karkhanis

AI总结 提出 Pass-at-k 策略优化 (PKPO),通过变换奖励直接优化 pass@k 性能,利用低方差无偏估计器,在训练中退火 k 可同时提升 pass@1 和 pass@k,解决更难问题。

详情
AI中文摘要

强化学习算法对每个问题采样多个 n>1 的解决方案尝试并独立奖励它们。这优化了 pass@1 性能,优先考虑孤立样本的强度,而牺牲了样本集的多样性和集体效用。这未充分利用采样能力,限制了探索和在更难示例上的最终改进。作为修复,我们提出 Pass-at-k 策略优化 (PKPO),一种对最终奖励的变换,导致直接优化 pass@k 性能,从而优化联合考虑时最大化奖励的样本集。我们的贡献是推导出 pass@k 及其梯度在二元和连续奖励设置中的新型低方差无偏估计器。我们展示了使用我们的估计器进行优化简化为标准强化学习,其中奖励经过稳定高效的变换函数联合变换。虽然先前的工作仅限于 k=n,但我们是第一个能够对任意 k ≤ n 实现 pass@k 鲁棒优化的。此外,我们的方法不是以 pass@1 性能换取 pass@k 增益,而是允许在训练中退火 k,同时优化两个指标,通常能在显著 pass@k 增益的同时获得强大的 pass@1 数值。我们在玩具实验上验证了我们的奖励变换,揭示了我们的公式的方差减少特性。我们还使用开源 LLM GEMMA-2 包含了真实世界的例子。我们发现我们的变换有效地优化了目标 k。此外,更高的 k 值能够解决更多和更难的问题,而退火 k 则同时提升了 pass@1 和 pass@k。关键的是,在传统 pass@1 优化停滞的具有挑战性的任务集上,我们的 pass@k 方法解锁了学习,这可能是由于通过优先考虑联合效用而非单个样本的效用实现了更好的探索。

英文摘要

Reinforcement Learning (RL) algorithms sample multiple n>1 solution attempts for each problem and reward them independently. This optimizes for pass@1 performance and prioritizes the strength of isolated samples at the expense of the diversity and collective utility of sets of samples. This under-utilizes the sampling capacity, limiting exploration and eventual improvement on harder examples. As a fix, we propose Pass-at-k Policy Optimization (PKPO), a transformation on the final rewards which leads to direct optimization of pass@k performance, thus optimizing for sets of samples that maximize reward when considered jointly. Our contribution is to derive novel low variance unbiased estimators for pass@k and its gradient, in both the binary and continuous reward settings. We show optimization with our estimators reduces to standard RL with rewards that have been jointly transformed by a stable and efficient transformation function. While previous efforts are restricted to k=n, ours is the first to enable robust optimization of pass@k for any arbitrary k <= n. Moreover, instead of trading off pass@1 performance for pass@k gains, our method allows annealing k during training, optimizing both metrics and often achieving strong pass@1 numbers alongside significant pass@k gains. We validate our reward transformations on toy experiments, which reveal the variance reducing properties of our formulations. We also include real-world examples using the open-source LLM, GEMMA-2. We find that our transformation effectively optimizes for the target k. Furthermore, higher k values enable solving more and harder problems, while annealing k boosts both the pass@1 and pass@k. Crucially, for challenging task sets where conventional pass@1 optimization stalls, our pass@k approach unblocks learning, likely due to better exploration by prioritizing joint utility over the utility of individual samples.

2509.10303 2026-06-11 cs.LG cs.AI 版本更新

Generalizing Beyond Suboptimality: Offline Reinforcement Learning Learns Effective Scheduling through Random Solutions

超越次优性:离线强化学习通过随机解决方案学习有效调度

Jesse van Remmerden, Zaharah Bukhsh, Yingqian Zhang

AI总结 提出离线RL算法CDQAC,从次优静态数据集学习调度策略,在JSP/FJSP上超越在线RL和强启发式方法,仅需1-5%数据,发现状态-动作覆盖比轨迹质量更重要。

详情
AI中文摘要

在线强化学习(RL)方法通过与模拟环境直接交互学习调度策略,在作业车间调度(JSP)和柔性作业车间调度(FJSP)问题上表现出色。然而,这些方法通常需要大量的训练交互,限制了其样本效率和实际适用性。受此挑战的启发,我们引入了保守离散分位数演员-评论家(CDQAC),这是一种离线RL算法,可以直接从静态、次优数据集中学习有效的调度策略。CDQAC将基于分位数的评论家与延迟策略更新相结合,以估计机器-操作对的回报分布。在JSP和FJSP基准上的大量实验表明,CDQAC始终优于生成数据的启发式方法,超越了最先进的离线和在线RL基线,并且具有很高的样本效率,仅需原始数据集的1%到5%即可学习高质量策略。我们的分析表明,在调度中,离线RL的性能主要受状态-动作覆盖范围而非单个轨迹质量的影响。调度将密集奖励(与完工时间目标对齐)与跨启发式方法的等长轨迹相结合,从而能够从广泛的行为中有效学习。与此观察一致,由简单随机启发式方法生成的具有更广覆盖范围的数据集,使其性能优于在由更强启发式方法(如遗传算法)生成的数据集上训练的策略。

英文摘要

Online reinforcement learning (RL) approaches have demonstrated strong performance on Job Shop Scheduling (JSP) and Flexible JSP (FJSP) problems by learning scheduling policies through direct interaction with simulated environments. However, these methods often require extensive training interactions, limiting their sample efficiency and practical applicability. Motivated by this challenge, we introduce Conservative Discrete Quantile Actor-Critic (CDQAC), an offline RL algorithm that learns effective scheduling policies directly from static, suboptimal datasets. CDQAC couples a quantile-based critic with delayed policy updates to estimate the return distribution of machine-operation pairs. Extensive experiments on JSP and FJSP benchmarks demonstrate that CDQAC consistently outperforms the data-generating heuristics, surpasses state-of-the-art offline and online RL baselines, and is highly sample efficient, requiring only 1 to 5% of the original dataset to learn high-quality policies. Our analysis suggests that, in scheduling, offline RL performance is governed mainly by state-action coverage rather than the quality of individual trajectories. Scheduling couples a dense reward aligned with the makespan objective with equal-length trajectories across heuristics, enabling effective learning from a broad range of behaviors. Consistent with this observation, datasets generated by a simple random heuristic with broader coverage let it outperform policies trained on datasets produced by stronger heuristics such as Genetic Algorithms.

2509.26294 2026-06-11 cs.LG cs.AI 版本更新

Noise-Guided Transport for Imitation Learning

噪声引导的模仿学习传输方法

Lionel Blondé, Joao A. Candido Ramos, Alexandros Kalousis

AI总结 针对低数据场景下的模仿学习,提出噪声引导传输(NGT)方法,通过对抗训练将模仿问题转化为最优传输问题,无需预训练或特殊架构,在极低数据量下实现强性能。

详情
Comments
Accepted at ICML 2026. Code: this https URL
AI中文摘要

我们考虑低数据场景下的模仿学习,其中只有有限数量的专家演示可用。在这种情况下,依赖大规模预训练或高容量架构的方法难以应用,对演示数据的效率变得至关重要。我们引入了噪声引导传输(NGT),一种轻量级的离策略方法,将模仿问题转化为通过对抗训练解决的最优传输问题。NGT不需要预训练或专门架构,通过设计包含不确定性估计,并且易于实现和调优。尽管简单,NGT在具有挑战性的连续控制任务(包括高维人形任务)中,在仅有20个转换的超低数据场景下取得了强劲的性能。

英文摘要

We consider imitation learning in the low-data regime, where only a limited number of expert demonstrations are available. In this setting, methods that rely on large-scale pretraining or high-capacity architectures can be difficult to apply, and efficiency with respect to demonstration data becomes critical. We introduce Noise-Guided Transport (NGT), a lightweight off-policy method that casts imitation as an optimal transport problem solved via adversarial training. NGT requires no pretraining or specialized architectures, incorporates uncertainty estimation by design, and is easy to implement and tune. Despite its simplicity, NGT achieves strong performance on challenging continuous control tasks, including high-dimensional Humanoid tasks, under ultra-low data regimes with as few as 20 transitions.

2510.02149 2026-06-11 cs.LG math.OC stat.ML 版本更新

Reinforcement Learning with Action-Triggered Observations

具有动作触发观测的强化学习

Alexander Ryabchenko, Wenlong Mou

AI总结 提出动作触发稀疏可追踪MDP框架,推导Bellman方程并证明最优策略存在,利用观测间动作序列的线性表示实现基于回归的方法,在几何分布情节下达到与完全可观测线性MDP匹配的遗憾界。

详情
AI中文摘要

我们引入了动作触发稀疏可追踪马尔可夫决策过程(ATST-MDPs),这是一种用于部分可观测性的强化学习框架,其中完整状态观测在每个步骤以由所选动作决定的概率随机发生。我们推导了针对该设置的Bellman方程,并证明了最优策略的存在性。利用稀疏观测揭示完整状态的事实,我们提供了一个等价公式,其中智能体在连续观测之间承诺动作序列。在线性MDP假设下,我们证明了这些动作序列上的值函数在有限维特征映射中具有线性表示,从而能够使用标准的基于回归的方法。作为一个应用,我们推导了ATST-LSVI-UCB,一种乐观算法,在几何分布的情节学习中实现了遗憾界$\widetilde{O}(\sqrt{Kd^3(1-\gamma)^{-3}})$,其中$K$是情节数,$d$是特征维度,$\gamma$是折扣因子(情节继续概率),与完全可观测线性MDP的已知速率相匹配。

英文摘要

We introduce Action-Triggered Sporadically Traceable Markov Decision Processes (ATST-MDPs), a reinforcement learning framework for partial observability in which full state observations occur stochastically at each step, with probability determined by the chosen action. We derive Bellman equations tailored to this setting and establish the existence of an optimal policy. Exploiting the fact that sporadic observations reveal the full state, we provide an equivalent formulation in which agents commit to action-sequences between consecutive observations. Under the linear MDP assumption, we show that the value function over such action-sequences admits a linear representation in a finite-dimensional feature map, enabling standard regression-based methods. As an application, we derive ATST-LSVI-UCB, an optimistic algorithm achieving regret $\widetilde{O}(\sqrt{Kd^3(1-\gamma)^{-3}})$ for episodic learning with geometrically distributed horizons, where $K$ is the number of episodes, $d$ the feature dimension, and $\gamma$ the discount factor (episode continuation probability), matching the known rate for linear MDPs with full observability.

2601.08136 2026-06-11 cs.LG eess.SY 版本更新

Reverse Flow Matching: A Unified Framework for Online Reinforcement Learning with Diffusion and Flow Policies

反向流匹配:基于扩散与流策略的在线强化学习统一框架

Zeyang Li, Sunbochen Tang, Navid Azizan

AI总结 针对在线强化学习中扩散与流策略缺乏目标样本的问题,提出反向流匹配框架,通过后验均值估计和Langevin Stein算子构造控制变量,统一了噪声期望与梯度期望两类方法,并扩展到流策略,提升训练效率与稳定性。

详情
Comments
ICML 2026 (Spotlight); Code: this https URL
AI中文摘要

扩散和流策略因其强大的表达能力在在线强化学习(RL)中日益重要,但高效训练它们仍是一个关键挑战。在线RL与标准生成建模的一个根本区别在于缺乏来自Q函数定义的目标玻尔兹曼分布的直接样本。为此,针对扩散策略提出了两类看似不同的方法:噪声期望族,使用噪声的加权平均作为训练目标;梯度期望族,使用Q函数梯度的加权平均。然而,这些目标如何正式相关,或者它们能否被综合成一个更通用的公式,目前尚不清楚。在本文中,我们提出了一个统一框架——反向流匹配(RFM),该框架严格解决了在没有直接目标样本的情况下训练扩散和流模型的问题。通过采用反向推理视角,我们将训练目标表述为给定中间噪声样本的后验均值估计问题。关键地,我们引入Langevin Stein算子来构造零均值控制变量,推导出一类具有相同期望的通用估计器。我们表明,现有的噪声期望和梯度期望方法只是这个更广泛类别中的两个具体实例。这种统一观点带来了两个关键进展:它将针对玻尔兹曼分布的能力从扩散策略扩展到流策略,并使得能够原则性地结合Q值和Q梯度信息形成有效估计器,从而提高训练效率和稳定性。我们将RFM实例化以在在线RL中训练流策略,并在连续控制基准测试中展示了相比扩散策略基线的改进性能。

英文摘要

Diffusion and flow policies are gaining prominence in online reinforcement learning (RL) due to their expressive power, yet training them efficiently remains a critical challenge. A fundamental difficulty that distinguishes online RL from standard generative modeling is the lack of direct samples from the target Boltzmann distribution defined by the Q-function. To address this, two seemingly distinct families of methods have been proposed for diffusion policies: a noise-expectation family, which uses a weighted average of noise as the training target, and a gradient-expectation family, which employs a weighted average of Q-function gradients. However, it remains unclear how these objectives are formally related, or whether they can be synthesized into a more general formulation. In this paper, we propose a unified framework, reverse flow matching (RFM), which rigorously addresses the problem of training diffusion and flow models without direct target samples. By adopting a reverse inferential perspective, we formulate the training target as a posterior mean estimation problem given an intermediate noisy sample. Crucially, we introduce Langevin Stein operators to construct zero-mean control variates, deriving a general class of estimators that share the same expectation. We show that existing noise-expectation and gradient-expectation methods are simply two specific instances within this broader class. This unified view yields two key advancements: it extends the capability of targeting Boltzmann distributions from diffusion to flow policies, and it enables the principled combination of Q-value and Q-gradient information to form an effective estimator, thereby improving training efficiency and stability. We instantiate RFM to train a flow policy in online RL and demonstrate improved performance on continuous-control benchmarks compared to diffusion policy baselines.

2603.08558 2026-06-11 cs.LG stat.ML 版本更新

Impact of Connectivity on Laplacian Representations in Reinforcement Learning

连通性对强化学习中拉普拉斯表示的影响

Tommaso Giorgi, Pierriccardo Olivieri, Keyue Jiang, Laura Toni, Matteo Papini

AI总结 本文研究了连通性对强化学习中拉普拉斯表示的误差影响,通过分析状态图的代数连通性,推导了线性价值函数近似误差的上界,并展示了表示学习管道中的端到端误差分解。

详情
AI中文摘要

在马尔可夫决策过程(MDPs)中学习紧凑的状态表示对于解决大规模强化学习(RL)问题中的维度灾难至关重要。现有方法通过构造状态表示为状态图拉普拉斯特征向量的线性组合,利用结构先验。当转移图未知或状态空间过大时,可通过样本轨迹直接估计图谱特征。本文证明了在学习的谱特征下线性价值函数近似误差的上界,并展示了该误差如何随状态图的代数连通性变化,从而将近似质量根植于MDP的拓扑结构中。进一步界定了由特征向量估计本身引入的误差,导致表示学习管道中的端到端误差分解。此外,尽管RL设置中的拉普拉斯算子表达式等价于现有方法,但其防止了一些常见的误解,并展示了文献中的示例。我们的结果适用于一般的(非均匀)策略,无需对诱导转移核的对称性做任何假设。我们通过在网格世界环境中进行数值模拟验证了理论发现。

英文摘要

Learning compact state representations in Markov Decision Processes (MDPs) has proven crucial for addressing the curse of dimensionality in large-scale reinforcement learning (RL) problems. Existing principled approaches leverage structural priors on the MDP by constructing state representations as linear combinations of the state-graph Laplacian eigenvectors. When the transition graph is unknown or the state space is prohibitively large, the graph spectral features can be estimated directly via sample trajectories. In this work, we prove an upper bound on the approximation error of linear value function approximation under the learned spectral features. We show how this error scales with the algebraic connectivity of the state-graph, grounding the approximation quality in the topological structure of the MDP. We further bound the error introduced by the eigenvector estimation itself, leading to an end-to-end error decomposition across the representation learning pipeline. Additionally, our expression of the Laplacian operator for the RL setting, although equivalent to existing ones, prevents some common misunderstandings, of which we show some examples from the literature. Our results hold for general (non-uniform) policies without any assumptions on the symmetry of the induced transition kernel. We validate our theoretical findings with numerical simulations on gridworld environments.

2603.14867 2026-06-11 cs.LG cs.AI cs.GT cs.MA 版本更新

Sample-Efficient Hypergradient Estimation for Decentralized Bi-Level Reinforcement Learning

用于去中心化双层强化学习的样本高效超梯度估计

Mikoto Kudo, Takumi Tanabe, Akifumi Wachi, Youhei Akimoto

AI总结 针对去中心化双层强化学习中领导者无法干预跟随者优化过程的问题,提出基于玻尔兹曼协方差技巧的超梯度估计方法,实现高维决策空间下的样本高效优化,并首次应用于双人马尔可夫博弈。

详情
Comments
29 pages. Extended version of the paper accepted to ICAPS 2026
AI中文摘要

许多战略决策问题,例如仓库机器人的环境设计,可以自然地表述为双层强化学习,其中领导者代理优化其目标,而跟随者解决一个以领导者决策为条件的马尔可夫决策过程。在许多情况下,当领导者无法干预跟随者的优化过程时,会出现一个基本挑战;它只能观察优化结果。我们通过推导领导者目标的超梯度(即考虑跟随者最优策略变化的领导者策略梯度)来解决这种去中心化设置。与先前基于超梯度的方法不同,这些方法需要大量数据来重复访问状态,或者依赖于梯度估计器,其复杂度可能随着领导者决策空间的高维性而显著增加,我们利用玻尔兹曼协方差技巧推导出一种替代的超梯度公式。这使得仅从交互样本中就能进行高效的超梯度估计,即使领导者的决策空间是高维的。此外,据我们所知,这是第一种能够在去中心化设置中实现基于超梯度的优化的双人马尔可夫博弈方法。实验突出了超梯度更新的影响,并展示了我们的方法在离散和连续状态任务中的有效性。

英文摘要

Many strategic decision-making problems, such as environment design for warehouse robots, can be naturally formulated as bi-level reinforcement learning (RL), where a leader agent optimizes its objective while a follower solves a Markov decision process (MDP) conditioned on the leader's decisions. In many situations, a fundamental challenge arises when the leader cannot intervene in the follower's optimization process; it can only observe the optimization outcome. We address this decentralized setting by deriving the hypergradient of the leader's objective, i.e., the gradient of the leader's strategy that accounts for changes in the follower's optimal policy. Unlike prior hypergradient-based methods that require extensive data for repeated state visits or rely on gradient estimators whose complexity can increase substantially with the high-dimensional leader's decision space, we leverage the Boltzmann covariance trick to derive an alternative hypergradient formulation. This enables efficient hypergradient estimation solely from interaction samples, even when the leader's decision space is high-dimensional. Additionally, to our knowledge, this is the first method that enables hypergradient-based optimization for 2-player Markov games in decentralized settings. Experiments highlight the impact of hypergradient updates and demonstrate our method's effectiveness in both discrete and continuous state tasks.

2604.13733 2026-06-11 cs.LG cs.AI cs.RO 版本更新

Vision-Language-Action Jump-Starting for Reinforcement Learning Robotic Agents

视觉-语言-动作跳跃启动用于强化学习机器人智能体

Angelo Moroncelli, Roberto Zanetti, Marco Maccarini, Loris Roveda

AI总结 提出VLAJS方法,通过稀疏的VLA高层动作建议引导PPO探索,结合方向性动作一致性正则化,提升强化学习在长时域操作任务中的样本效率,并在仿真和真实机器人上验证。

详情
Comments
ICRA 2026 Workshop on Reinforcement Learning in the Era of Imitation Learning
AI中文摘要

强化学习(RL)能够实现机器人操作的高频闭环控制,但由于探索效率低下和信用分配不佳,在稀疏或不完美奖励的长时域任务中难以扩展。视觉-语言-动作(VLA)模型利用大规模多模态预训练提供通用任务级推理,但当前限制阻碍其直接用于快速精确操作。本文提出视觉-语言-动作跳跃启动(VLAJS),一种将稀疏VLA引导与在线策略RL相结合的方法,以改善探索和学习效率。VLAJS将VLA视为高层动作建议的瞬态来源,偏置早期探索并改善信用分配,同时保留RL的高频状态基控制。我们的方法用方向性动作一致性正则化增强近端策略优化(PPO),在早期训练中软对齐RL智能体的动作与VLA引导,而不强制严格模仿、需要演示或依赖持续教师查询。VLA引导稀疏应用并随时间退火,使智能体在线适应并最终超越引导策略。我们在六个挑战性操作任务上评估VLAJS:仿真中的提升、拾取与放置、销钉重定向、销钉插入、戳和推,并在真实Franka Panda机器人上验证子集。VLAJS在样本效率上持续优于PPO和蒸馏式基线,在多个任务中将所需环境交互减少超过50%。真实世界实验展示了零样本仿真到真实迁移以及在杂乱、物体变化和外部扰动下的鲁棒执行。

英文摘要

Reinforcement learning (RL) enables high-frequency, closed-loop control for robotic manipulation, but scaling to long-horizon tasks with sparse or imperfect rewards remains difficult due to inefficient exploration and poor credit assignment. Vision-Language-Action (VLA) models leverage large-scale multimodal pretraining to provide generalist, task-level reasoning, but current limitations hinder their direct use in fast and precise manipulation. In this paper, we propose Vision-Language-Action Jump-Starting (VLAJS), a method that bridges sparse VLA guidance with on-policy RL to improve exploration and learning efficiency. VLAJS treats VLAs as transient sources of high-level action suggestions that bias early exploration and improve credit assignment, while preserving the high-frequency, state-based control of RL. Our approach augments Proximal Policy Optimization (PPO) with a directional action-consistency regularization that softly aligns the RL agent's actions with VLA guidance during early training, without enforcing strict imitation, requiring demonstrations, or relying on continuous teacher queries. VLA guidance is applied sparsely and annealed over time, allowing the agent to adapt online and ultimately surpass the guiding policy. We evaluate VLAJS on six challenging manipulation tasks: lifting, pick-and-place, peg reorientation, peg insertion, poking, and pushing in simulation, and validate a subset on a real Franka Panda robot. VLAJS consistently outperforms PPO and distillation-style baselines in sample efficiency, reducing required environment interactions by over 50% in several tasks. Real-world experiments demonstrate zero-shot sim-to-real transfer and robust execution under clutter, object variation, and external perturbations.

2605.03065 2026-06-11 cs.LG cs.RO 版本更新

OGPO: Sample Efficient Full-Finetuning of Generative Control Policies

OGPO:生成控制策略的样本高效全微调

Sarvesh Patil, Mitsuhiko Nakamoto, Manan Agarwal, Shashwat Saxena, Jesse Zhang, Giri Anantharaman, Cleah Winston, Chaoyi Pan, Douglas Chen, Nai-Chieh Huang, Zeynep Temel, Oliver Kroemer, Sergey Levine, Abhishek Gupta, Hongkai Dai, Paarth Shah, Max Simchowitz

AI总结 提出OGPO算法,通过离策略评论网络和修改的PPO目标,实现生成控制策略的样本高效微调,在多种操作任务上达到最优性能,并能在无专家数据下微调不良初始化的行为克隆策略。

详情
AI中文摘要

生成控制策略(GCPs),如基于扩散和基于流的控制策略,已成为机器人学习的有效参数化方法。本文介绍了离策略生成策略优化(OGPO),一种用于微调GCPs的样本高效算法,该算法维护离策略评论网络以最大化数据重用,并通过修改的PPO目标将策略梯度传播到策略的完整生成过程,使用评论网络作为终端奖励。OGPO在涵盖多任务设置、高精度插入和灵巧控制的操作任务上达到了最先进的性能。据我们所知,它也是唯一一种能够在在线回放缓冲区中无专家数据的情况下,将初始化不良的行为克隆策略微调到接近完全任务成功的方法,并且只需很少的任务特定超参数调整。通过广泛的实证研究,我们证明了OGPO在策略引导和残差学习方面显著优于替代方法,并确定了其性能背后的关键机制。我们进一步引入了实用的稳定技巧,包括成功缓冲区正则化、双边保守优势和Q方差减少,以减轻基于状态和基于像素的设置中的评论网络过度利用。除了提出OGPO,我们还对GCP微调进行了系统的实证研究,确定了控制成功离策略全策略改进的稳定机制和失败模式。

英文摘要

Generative control policies (GCPs), such as diffusion- and flow-based control policies, have emerged as effective parameterizations for robot learning. This work introduces Off-policy Generative Policy Optimization (OGPO), a sample-efficient algorithm for finetuning GCPs that maintains off-policy critic networks to maximize data reuse and propagate policy gradients through the full generative process of the policy via a modified PPO objective, using critics as the terminal reward. OGPO achieves state-of-the-art performance on manipulation tasks spanning multi-task settings, high-precision insertion, and dexterous control. To our knowledge, it is also the only method that can fine-tune poorly-initialized behavior cloning policies to near full task-success with no expert data in the online replay buffer, and does so with few task-specific hyperparameter tuning. Through extensive empirical investigations, we demonstrate that OGPO drastically outperforms methods alternatives on policy steering and learning residual corrections, and identify the key mechanisms behind its performance. We further introduce practical stabilization tricks, including success-buffer regularization, two-sided conservative advantages, and Q-variance reduction, to mitigate critic over-exploitation across state- and pixel-based settings. Beyond proposing OGPO, we conduct a systematic empirical study of GCP finetuning, identifying the stabilizing mechanisms and failure modes that govern successful off-policy full-policy improvement.

2606.10968 2026-06-11 cs.LG cs.AI 版本更新

Beyond Uniform Token-Level Trust Region in LLM Reinforcement Learning

超越大语言模型强化学习中的统一令牌级信任区域

Renjie Mao, Xiangxin Zhou, Lvfang Tao, Yixin Ding, Yu Shi, Yongguang Lin, Yuheng Wu, Honglin Zhu, Qian Qiu, Wenxi Zhu

发表机构 * Tencent Hunyuan(腾讯混元)

AI总结 针对PPO风格信任区域在自回归生成中的位置无关问题,提出CPPO方法,通过位置加权阈值和累积前缀预算动态调整令牌级约束,提升训练稳定性和推理准确性。

详情
Comments
Project Page: this https URL
AI中文摘要

具有可验证奖励的强化学习(RLVR)已成为提升大语言模型推理能力的标准方法。然而,现有的PPO风格信任区域机制通过在所有令牌上独立施加统一阈值,仍然是位置无关的。这种逐点处理方式在两个方面与自回归生成相冲突。首先,统一阈值忽略了自回归不对称性。早期阶段的偏差会产生累积的序列级漂移,导致静态阈值对早期发散约束不足,而对后期探索过度约束。其次,孤立地评估令牌级发散忽略了累积前缀漂移,无论条件历史已经偏离滚动策略多远,都给予相同的发散允许量。为解决这一局限性,我们提出了CPPO(累积前缀散度策略优化),这是一种令牌级掩码规则,通过两种耦合机制将更新与有限时域策略改进界对齐。首先,位置加权阈值对早期位置施加更严格的限制,因为这些位置的影响持续时间更长,同时放宽对后期令牌的约束。其次,累积前缀预算跟踪历史偏差,动态限制进一步的令牌级偏差,以防止沿前缀的复合错误。实验表明,CPPO在不同模型规模上增强了训练稳定性并显著提高了推理准确性。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become standard for improving LLM reasoning. However, existing PPO-style trust-region mechanisms remain position-agnostic by enforcing uniform thresholds across all tokens independently. This pointwise treatment conflicts with autoregressive generation in two critical ways. First, uniform thresholds ignore autoregressive asymmetry. Early-stage deviations produce compounding sequence-level drift, causing static thresholds to under-regulate early divergence and excessively constrain late-stage exploration. Second, evaluating token-level divergence in isolation overlooks cumulative prefix drift, granting the same divergence allowance regardless of how far the conditioning history has already deviated from the rollout policy. To address this limitation, we propose CPPO (Cumulative Prefix-divergence Policy Optimization), a token-level masking rule that aligns updates with a finite-horizon policy-improvement bound via two coupled mechanisms. First, a position-weighted threshold imposes stricter limits at early positions whose effects persist longer, relaxing constraints for late-stage tokens. Second, a cumulative prefix budget tracks historical deviations, dynamically restricting further token-level deviation to prevent compounding errors along the prefix. Empirically, CPPO enhances training stability and significantly improves reasoning accuracy across various model scales.

2606.11118 2026-06-11 cs.LG math.OC math.PR stat.AP stat.ML 版本更新

Data-Driven Dynamic Assortment in Online Platforms: Learning about Two Sides

在线平台中的数据驱动动态分类:学习双边信息

Rahul Roy, Nur Sunar, Jayashankar M. Swaminathan

AI总结 针对双边服务平台,提出一种数据驱动算法,在未知顾客和卖家选择参数的情况下动态优化商品分类,并证明其遗憾值随时间呈多对数增长且达到最优速率。

详情
AI中文摘要

我们研究了一个在离散时间环境下,具有不完全信息和异质顾客的双边服务平台上的动态分类问题。在每个周期,一位顾客到达寻求服务,平台选择一组卖家进行展示。顾客根据多项逻辑选择模型,最多向分类中的一个卖家提出交易。经过固定数量的周期后,卖家审查收到的提议,并根据另一个多项逻辑选择模型,每位卖家最多选择一个顾客,然后循环重复。一个关键挑战是平台事先不知道顾客或卖家的选择模型参数。据我们所知,这是首次研究双边选择参数均未知的动态分类问题。我们开发了一种数据驱动算法,该算法在优化平台目标的同时学习这些参数。我们使用遗憾值来评估性能,该遗憾值衡量相对于一个预知所有参数和顾客到达时间的先知基准的收入损失。我们证明该算法的最坏情况遗憾值随时间呈多对数增长,并推导出匹配的下界,从而确定其速率最优性。

英文摘要

We study a dynamic assortment problem on a two-sided service platform with incomplete information and heterogeneous customers in a discrete-time setting. In each period, a customer arrives seeking service, and the platform chooses an assortment of sellers to display. The customer then proposes a transaction to at most one seller in the assortment according to a multinomial logit choice model. After a fixed number of periods, sellers review the proposals they have received and each chooses at most one customer according to another multinomial logit choice model, after which the cycle repeats. A key challenge is that the platform does not know the choice-model parameters of either customers or sellers in advance. To our knowledge, this is the first study of a dynamic assortment problem in which both sides' choice parameters are unknown. We develop a data-driven algorithm that learns these parameters while optimizing the platform's objective over time. We evaluate performance using regret, which measures revenue loss relative to a clairvoyant benchmark that knows all parameters and customer arrivals in advance. We show that the algorithm's worst-case regret grows polylogarithmically over time, and we derive a matching lower bound, establishing its rate optimality.

2307.01472 2026-06-11 cs.AI cs.LG cs.MA 版本更新

Improving Generalization and Data Efficiency with Diffusion in Offline Multi-agent RL

通过扩散模型提升离线多智能体强化学习的泛化能力与数据效率

Zhuoran Li, Ling Pan, Jiatai Huang, Longbo Huang

AI总结 提出扩散离线多智能体模型(DOM2),利用扩散模型增强策略表达力和多样性,结合轨迹数据重加权,在离线MARL中显著提升性能、泛化能力和数据效率。

详情
AI中文摘要

我们提出了一种新颖的扩散离线多智能体模型(DOM2),用于离线多智能体强化学习(MARL)。与主要依赖策略设计中保守性的现有算法不同,DOM2基于扩散模型增强了策略的表达力和多样性。具体来说,我们将扩散模型融入策略网络,并在训练中提出了一种基于轨迹的数据重加权方案。这些关键要素显著提高了算法对环境变化的鲁棒性,并在性能、泛化和数据效率方面取得了显著提升。我们的大量实验结果表明,DOM2在所有多智能体粒子和多智能体MuJoCo环境中均优于现有最先进方法,并且由于其高表达力和多样性,在迁移环境中(在评估的30个设置中有28个)泛化能力显著更强。此外,DOM2具有超高的数据效率,与现有算法相比,实现相同性能所需数据不超过5%(数据效率提升20倍)。

英文摘要

We present a novel Diffusion Offline Multi-agent Model (DOM2) for offline Multi-Agent Reinforcement Learning (MARL). Different from existing algorithms that rely mainly on conservatism in policy design, DOM2 enhances policy expressiveness and diversity based on diffusion model. Specifically, we incorporate a diffusion model into the policy network and propose a trajectory-based data-reweighting scheme in training. These key ingredients significantly improve algorithm robustness against environment changes and achieve significant improvements in performance, generalization and data-efficiency. Our extensive experimental results demonstrate that DOM2 outperforms existing state-of-the-art methods in all multi-agent particle and multi-agent MuJoCo environments, and generalizes significantly better to shifted environments {(in $28$ out of $30$ settings evaluated)} thanks to its high expressiveness and diversity. Moreover, DOM2 is ultra data efficient and requires no more than $5\%$ data for achieving the same performance compared to existing algorithms (a $20\times$ improvement in data efficiency).

2505.03296 2026-06-11 cs.RO cs.AI cs.LG 版本更新

The Unreasonable Effectiveness of Discrete-Time Gaussian Process Mixtures for Robot Policy Learning

离散时间高斯过程混合在机器人策略学习中的惊人有效性

Jan Ole von Hartz, Adrian Röfer, Joschka Boedecker, Abhinav Valada

AI总结 提出MiDiGap方法,利用少量演示和相机观测,通过离散时间高斯过程混合实现机器人操作策略的灵活表示与模仿学习,在长时域、高约束、动态和多模态任务上取得SOTA性能,并支持推理时引导。

详情
Comments
Submitted for publication to IEEE Transaction on Robotics
AI中文摘要

我们提出了离散时间高斯过程混合(MiDiGap),一种用于机器人操作中灵活策略表示和模仿学习的新方法。MiDiGap仅使用相机观测,即可从少至五次演示中学习,并在一系列具有挑战性的任务中泛化。它在长时域行为(如泡咖啡)、高约束运动(如开门)、动态动作(如用铲子舀取)和多模态任务(如挂杯子)上表现出色。MiDiGap在CPU上不到一分钟即可学习这些任务,并线性扩展到大型数据集。我们还开发了一套丰富的推理时引导工具,利用碰撞信号和机器人运动学约束等证据。这种引导实现了新颖的泛化能力,包括避障和跨本体策略迁移。MiDiGap在多样化的少样本操作基准上达到了最先进的性能。在受约束的RLBench任务上,它将策略成功率提高了76个百分点,并将轨迹成本降低了67%。在多模态任务上,它将策略成功率提高了48个百分点,并将样本效率提高了20倍。在跨本体迁移中,策略成功率提高了一倍以上。我们在以下网址公开了代码:https://this https URL。

英文摘要

We present Mixture of Discrete-time Gaussian Processes (MiDiGap), a novel approach for flexible policy representation and imitation learning in robot manipulation. MiDiGap enables learning from as few as five demonstrations using only camera observations and generalizes across a wide range of challenging tasks. It excels at long-horizon behaviors such as making coffee, highly constrained motions such as opening doors, dynamic actions such as scooping with a spatula, and multimodal tasks such as hanging a mug. MiDiGap learns these tasks on a CPU in less than a minute and scales linearly to large datasets. We also develop a rich suite of tools for inference-time steering using evidence such as collision signals and robot kinematic constraints. This steering enables novel generalization capabilities, including obstacle avoidance and cross-embodiment policy transfer. MiDiGap achieves state-of-the-art performance on diverse few-shot manipulation benchmarks. On constrained RLBench tasks, it improves policy success by 76 percentage points and reduces trajectory cost by 67%. On multimodal tasks, it improves policy success by 48 percentage points and increases sample efficiency by a factor of 20. In cross-embodiment transfer, it more than doubles policy success. We make the code publicly available at this https URL.

2511.19314 2026-06-11 cs.AI cs.CL cs.LG 版本更新

PRInTS: Reward Modeling for Long-Horizon Information Seeking

PRInTS:面向长程信息检索的奖励建模

Jaewoo Lee, Archiki Prasad, Justin Chih-Yao Chen, Zaid Khan, Elias Stengel-Eskin, Mohit Bansal

AI总结 提出PRInTS生成式过程奖励模型,通过密集评分和轨迹摘要提升长程信息检索中工具交互与推理能力,在多个基准上超越前沿模型。

详情
Comments
ACL 2026, 19 pages, code: this https URL
AI中文摘要

信息检索是AI智能体的核心能力,要求它们在整个长轨迹中收集和推理工具生成的信息。然而,这种多步骤信息检索任务对于基于语言模型的智能体仍然具有挑战性。虽然过程奖励模型(PRM)可以通过在测试时对候选步骤进行排序来指导智能体,但现有的PRM——设计用于具有二元判断的短程推理——无法捕捉信息检索步骤的更丰富维度,例如工具交互和对工具输出的推理,也无法处理长程任务中快速增长的上下文。为了解决这些限制,我们引入了PRInTS,一种具有双重能力的生成式PRM:(1)基于PRM对步骤质量多个维度(例如,工具输出的解释、工具调用的信息量)的推理进行密集评分,以及(2)轨迹摘要,在压缩不断增长的上下文的同时保留步骤评估所需的基本信息。在FRAMES、GAIA(级别1-3)和WebWalkerQA(简单-困难)基准上对多个模型的广泛评估表明,使用PRInTS进行最佳n采样增强了开源模型以及专门智能体的信息检索能力,以更小的骨干智能体匹配或超越前沿模型,并优于其他强奖励建模基线。

英文摘要

Information-seeking is a core capability for AI agents, requiring them to gather and reason over tool-generated information across long trajectories. However, such multi-step information-seeking tasks remain challenging for agents backed by language models. While process reward models (PRMs) can guide agents by ranking candidate steps at test-time, existing PRMs - designed for short reasoning with binary judgment - cannot capture richer dimensions of information-seeking steps, such as tool interactions and reasoning over tool outputs, nor handle the rapidly growing context in long-horizon tasks. To address these limitations, we introduce PRInTS, a generative PRM trained with dual capabilities: (1) dense scoring based on the PRM's reasoning across multiple dimensions of step quality (e.g., interpretation of tool outputs, tool call informativeness) and (2) trajectory summarization that compresses the growing context while preserving essential information for step evaluation. Extensive evaluations across FRAMES, GAIA (levels 1-3), and WebWalkerQA (easy-hard) benchmarks on multiple models reveal that best-of-n sampling with PRInTS enhances information-seeking in open-source models as well as specialized agents, matching or surpassing frontier models with a much smaller backbone agent and outperforming other strong reward modeling baselines.

2602.09591 2026-06-11 cs.CL cs.AI cs.LG 版本更新

On the Optimal Reasoning Length for RL-Trained Language Models

关于RL训练的语言模型的最优推理长度

Daisuke Nohara, Taishi Nakamura, Rio Yokota

AI总结 研究强化学习训练的语言模型中推理长度与准确率的非单调关系,发现存在最优中间长度,并通过模式准确率分析揭示其成因。

详情
Comments
18 pages, 12 figures
AI中文摘要

强化学习显著提高了大型语言模型的推理能力,但也倾向于延长思维链输出并增加计算成本。尽管已经提出了长度控制方法,但它们所引发的长度-准确率关系仍不清楚。我们在受控设置下,在多个基础模型上使用几种长度控制方法训练策略,发现在数学推理和代码生成中,准确率随输出长度呈非单调变化,在中间值达到峰值。然而,即使在样本准确率趋于平稳或下降的情况下,模式准确率仍随长度持续提高,这表明非单调的长度-准确率关系是由围绕越来越正确的中心的分散性驱动的。

英文摘要

Reinforcement learning substantially improves reasoning in large language models, but it also tends to lengthen chain-of-thought outputs and increase computational cost. Although length-control methods have been proposed, the length-accuracy relationship they induce remains unclear. We train policies with several length-control methods on multiple base models in a controlled setup and find that, across both mathematical reasoning and code generation, accuracy is non-monotonic in output length, peaking at an intermediate value. Mode accuracy, however, continues to improve with length even in settings where sample accuracy plateaus or declines, indicating that the non-monotonic length-accuracy relationship is driven by dispersion around an increasingly correct center.

2603.14762 2026-06-11 math.OC cs.LG eess.SY 版本更新

Online Learning for Supervisory Switching Control

在线学习用于监督切换控制

Haoyuan Sun, Ali Jadbabaie

AI总结 研究在线学习在部分观测线性动态系统中监督切换控制的问题,提出非渐近分析方法,结合多臂老虎机算法,实现稳定控制器识别与系统辨识。

详情
AI中文摘要

我们研究了部分观测线性动态系统中的监督切换控制。目标是通过周期性选择一组N个候选控制器中的一个,来识别并部署适合的控制器。经典估计器基于监督控制保证渐近稳定性,但缺乏有限时间性能界限。相反,当前在线学习和系统识别中的非渐近方法需要限制性假设,如系统稳定性,这在控制设置中不兼容,从而排除了测试可能不稳定控制器的可能性。为弥合这一差距,我们提出了一种新颖的非渐近监督控制分析,将多臂老虎机算法适应到控制理论设置中。所提出的数据驱动算法通过评分标准评估候选控制器,利用系统可观测性来隔离状态历史的影响,从而既能检测不稳定控制器,又能实现准确的系统辨识。我们提出了两种算法变体,具有无维度、有限时间保证,其中每个算法在O(N log²N)步内识别匹配控制器,同时在系统扰动下实现有限的L₂增益。

英文摘要

We study supervisory switching control for partially-observed linear dynamical systems. The objective is to identify and deploy a suitable controller for the unknown system by periodically selecting among a collection of $N$ candidate controllers, some of which may destabilize the underlying system. While classical estimator-based supervisory control guarantees asymptotic stability, it lacks quantitative finite-time performance bounds. Conversely, current non-asymptotic methods in both online learning and system identification require restrictive assumptions that are incompatible in a control setting, such as system stability, which preclude testing potentially unstable controllers. To bridge this gap, we propose a novel, non-asymptotic analysis of supervisory control that adapts multi-armed bandit algorithms to a control-theoretic setting. The proposed data-driven algorithm evaluates candidate controllers via scoring criteria that leverage system observability to isolate the effects of state history, enabling both detection of destabilizing controllers and accurate system identification. We present two algorithmic variants with dimension-free, finite-time guarantees, where each identifies the matching controller in $O(N \log^2 N)$ steps, while simultaneously achieving finite $L_2$-gain with respect to system disturbances.

2606.05922 2026-06-11 cs.AI cs.CL cs.LG 版本更新

Evolving Agents in the Dark: Retrospective Harness Optimization via Self-Preference

回顾性工具优化:通过轨迹回滚上的自我偏好改进LLM智能体

Wenbo Pan, Shujie Liu, Chin-Yew Lin, Jingying Zeng, Xianfeng Tang, Xiangyang Zhou, Yan Lu, Xiaohua Jia

AI总结 提出一种自监督方法RHO,利用历史轨迹回滚和自偏好选择优化智能体工具集,无需真实标签,在SWE-Bench Pro上通过单轮优化将通过率从59%提升至78%。

详情
Comments
Code: this https URL; Project website: this https URL
AI中文摘要

AI智能体依赖于技能、工具和工作流程的整合(称为工具集)来解决复杂问题。持续改进这一工具集对于适应新任务至关重要。然而,现有的优化方法通常需要真实验证集,但在实际部署场景中获取此类标注数据非常困难。为解决这一问题,我们提出回顾性工具优化(RHO),一种仅利用过去轨迹的自监督方法。具体而言,RHO从历史轨迹中选择一个多样化的困难任务核心集,并并行重新求解。智能体通过自我验证和自我一致性分析这些回滚,然后生成候选工具集更新,并通过自身的成对自我偏好选择最有效的更新。我们在三个不同领域(涵盖软件工程、技术工作和知识工作)上评估RHO。值得注意的是,单轮优化无需任何外部评分即可将SWE-Bench Pro上的通过率从59%提升至78%。此外,我们的分析表明RHO有效针对先前的失败模式。因此,优化后的工具集改变了智能体的行为模式,并在长周期会话中保持更高的准确性。

英文摘要

AI agents rely on a harness of skills, tools, and workflows to solve complex problems. Continually improving this harness is essential for adapting to new tasks. However, existing optimization methods typically require ground-truth validation sets, yet such labeled data is difficult to acquire in practical deployment settings. To address this problem, we introduce Retrospective Harness Optimization (RHO), a self-supervised method that optimizes the agent harness using only past trajectories. Specifically, RHO selects a diverse coreset of challenging tasks from past trajectories and re-solves them in parallel. The agent analyzes these rollouts using self-validation and self-consistency, then generates candidate harness updates and selects the most effective one by its own pairwise self-preference. We evaluate RHO across three diverse domains, spanning software engineering, technical work, and knowledge work. Notably, a single optimization round improves the pass rate on SWE-Bench Pro from 59% to 78% without any external grading. Furthermore, our analysis demonstrates that RHO effectively targets prior failure modes. As a result, the optimized harness alters the agent's behavior patterns and sustains higher accuracy during long-horizon sessions.

4. 生成模型与概率建模 18 篇

2606.11243 2026-06-11 cs.LG cs.CL 新提交

ProHiFlo: Hierarchical Flow Matching with Functional Guidance for De Novo Protein Generation

ProHiFlo: 具有功能引导的分层流匹配用于从头蛋白质生成

Chuanzhen Wang, Meade Cleti, Pete Jano

发表机构 * Arizona State University(亚利桑那州立大学) University of Wisconsin-Madison(威斯康星大学麦迪逊分校) Tongji University(同济大学)

AI总结 提出ProHiFlo,一种分层流匹配框架,通过粗到细生成、功能引导和自适应SE(3)等变架构,实现高效、准确的从头蛋白质生成,在酶活性位点支架任务中成功率58.9%。

详情
Comments
23 pages
AI中文摘要

从头蛋白质生成在治疗设计、酶工程和合成生物学中具有变革潜力。尽管基于扩散和流匹配的方法已取得进展,但它们通常在单一分辨率下操作,且缺乏整合功能约束的机制。我们提出ProHiFlo,一种具有三项创新的分层流匹配框架:(1) 粗到细生成,先建模主链几何再细化到全原子坐标,在保持精度的同时降低计算成本;(2) 功能引导,利用预训练预测器引导生成朝向所需性质,无需重新训练;(3) 自适应SE(3)等变架构,用于高效多尺度处理。在无条件生成、基序支架和功能设计上的实验表明,在需要少4倍采样步数的情况下实现了最先进的性能。在酶活性位点支架任务中,ProHiFlo达到58.9%的成功率,而RFDiffusion为41.2%。

英文摘要

De novo protein generation has transformative potential in therapeutic design, enzyme engineering, and synthetic biology. While diffusion-based and flow matching approaches have achieved progress, they typically operate at single resolution and lack mechanisms for incorporating functional constraints. We introduce ProHiFlo, a hierarchical flow matching framework with three innovations: (1) coarse-to-fine generation that models backbone geometry before refining to all-atom coordinates, reducing computational cost while maintaining accuracy; (2) functional guidance leveraging pretrained predictors to steer generation toward desired properties without retraining; (3) adaptive SE(3)-equivariant architecture for efficient multi-scale processing. Experiments on unconditional generation, motif scaffolding, and functional design demonstrate state-ofthe-art performance while requiring 4 fewer sampling steps. On enzyme active site scaffolding, ProHiFlo achieves 58.9% success rate compared to 41.2% for RFDiffusion.

2606.11247 2026-06-11 cs.LG cs.AI cs.AR 新提交

Physics-informed generative AI for semiconductor manufacturing: Enforcing hard physical constraints in generative models by construction

物理信息驱动的生成式AI在半导体制造中的应用:通过构造强制生成模型中的硬物理约束

Yaser Mike Banad, Sarah Sharif

AI总结 针对半导体制造中生成模型必须满足硬物理约束的问题,本文提出通过构造集成物理信息(如物理信息扩散、PDE约束变分模型等)来强制约束,而非事后过滤,并给出四种集成模式和未来研究方向。

详情
AI中文摘要

生成模型越来越多地被用于为物理系统提出设计、数据和控制动作,然而许多此类系统受硬物理约束而非感知合理性支配。半导体制造提供了一个严苛的测试案例:生成的掩模、布局、合成缺陷数据和工艺配方必须遵守光刻、传输、反应和器件物理约束,因为物理无效的样本不仅质量低劣,而且无法使用。本文认为,半导体制造揭示了一个更广泛的计算科学挑战,即用于受约束物理领域的生成式AI必须通过构造实现物理信息驱动,而非仅通过事后过滤来纠正。我们调查了新兴的架构工具包,包括物理信息扩散、PDE约束变分模型、神经算子先验和守恒律尊重生成网络,并展示了它如何与可微分光刻、TCAD、工艺仿真和自主实验相联系。我们识别了生成模型与基于物理的模拟器之间的四种集成模式,并提出了一个以物理保真度基准、可微分模拟器基础设施以及面向物理设计和制造的多模态基础模型为中心的研究议程。核心主张是分析性的而非修辞性的:在物理有效性是成功的关键标准的情况下,通过构造强制约束的架构应被期望优于事后过滤的架构,而晶圆厂正是这种区别最鲜明的环境。

英文摘要

Generative models are increasingly used to propose designs, data, and control actions for physical systems, yet many such systems are governed by hard physical constraints rather than by perceptual plausibility. Semiconductor manufacturing provides a demanding test case: generated masks, layouts, synthetic defect data, and process recipes must obey lithography, transport, reaction, and device-physics constraints, because physically invalid samples are not merely low quality but unusable. This Perspective argues that semiconductor manufacturing exposes a broader computational-science challenge, namely that generative AI for constrained physical domains must be physics-informed by construction, not corrected only through post-hoc filtering. We survey the emerging architectural toolkit, including physics-informed diffusion, PDE-constrained variational models, neural-operator priors, and conservation-law-respecting generative networks, and show how it connects to differentiable lithography, TCAD, process simulation, and autonomous experimentation. We identify four integration patterns between generative models and physics-based simulators, and we propose a research agenda centered on physics-fidelity benchmarks, differentiable simulator infrastructure, and multimodal foundation models for physical design and manufacturing. The central claim is analytical rather than rhetorical: where physical validity is the binding criterion of success, architectures that enforce it by construction should be expected to outperform those that filter for it after the fact, and the fab is the setting where this distinction is sharpest.

2606.11277 2026-06-11 cs.LG physics.comp-ph 新提交

Least-Action-Guided Diffusion for Physical Extrapolation

最小作用量引导扩散用于物理外推

Zhongxin Yang, Yuanwei Bin, Xiang I.A. Yang, Shiyi Chen

发表机构 * College of Engineering, Peking University(北京大学工学院) Ningbo Institute for Digital Twin, Eastern Institute of Technology(东方理工宁波数字孪生研究院) Eastern Institute for Advanced Study, Eastern Institute of Technology(东方理工高等研究院) Shenzhen Tenfong Technology Co., Ltd.(深圳腾方科技有限公司) Mechanical Engineering, The Pennsylvania State University(宾夕法尼亚州立大学机械工程系)

AI总结 提出最小作用量引导扩散(LAPG)框架,通过将最小作用量原理转化为可微的推理时校正机制,在时间、参数和几何外推中保持物理一致性,优于训练时物理信息基线。

详情
AI中文摘要

可靠的外推仍然是计算物理学中生成模型的核心挑战,因为模型在有限的时间、参数或几何范围内训练,可能会在训练分布之外产生物理上不一致的预测。我们引入了最小作用量引导扩散(LAPG),这是一个在推理过程中促进物理一致性而非仅依赖训练时施加约束的框架。该方法结合了条件得分扩散模型与作用量导出的物理引导得分。在第一阶段,学习的得分模型生成一个分布内的提议;在第二阶段,基于作用量的变分先验将该提议向目标分布外条件细化。这一公式将最小作用量原理转化为可微的推理时校正机制,并提供了对通常需要经验损失平衡的点态残差惩罚的替代方案。我们在代表性的常微分和偏微分方程系统上评估了LAPG,包括自由落体、保守和耗散弹簧-质量动力学、相互作用点涡以及参数化翼型上的势流。在时间、参数和几何外推测试中,与训练时物理信息基线相比,LAPG减少了相位漂移,保持了耗散衰减,捕捉了涡旋运动,并改善了翼型流动的升力响应。

英文摘要

Reliable extrapolation remains a central challenge for generative models in computational physics, because models trained over finite ranges of time, parameters, or geometries may produce physically inconsistent predictions outside the training distribution. We introduce a least-action-principle-guided diffusion, LAPG, a framework that promotes physical consistency during inference rather than relying solely on constraints imposed during training. The method combines a conditional score-based diffusion model with an action-derived physical guidance score. In the first stage, the learned score model generates an in-distribution proposal; in the second, an action-based variational prior refines this proposal toward the target out-of-distribution condition. This formulation turns the principle of least action into a differentiable inference-time correction mechanism and provides an alternative to pointwise residual penalties that often require empirical loss balancing. We evaluate LAPG on representative ordinary- and partial-differential-equation systems, including free fall, conservative and dissipative spring-mass dynamics, interacting point vortices, and potential flow over parameterized airfoils. In temporal, parameter, and geometric extrapolation tests, LAPG reduces phase drift, preserves dissipative decay, captures vortex motion, and improves the lift response of airfoil flows compared with training-time physics-informed baselines.

2606.11286 2026-06-11 cs.LG cs.AI 新提交

FreeBridge: Variational Schrödinger Bridges for Cellular Transition Dynamics

FreeBridge: 用于细胞转变动力学的变分薛定谔桥

Xurui Wang, Qin Ren, Jun Ma, Haibin Ling, Chenyu You

发表机构 * Stony Brook University(石溪大学) University of Toronto(多伦多大学) University Health Network(大学健康网络)

AI总结 针对高内涵成像中细胞扰动建模的端点监督问题,提出FreeBridge方法,通过变分薛定谔桥在固定细胞流形上学习随机传输,并利用经验潜在支持正则化约束中间路径,在保持端点保真度的同时减少中间支持违规。

详情
Comments
Accepted to MICCAI 2026 (early accept). Project page: this https URL
AI中文摘要

高内涵成像实验量化细胞对化学和遗传扰动的反应,但由于细胞在采集时被化学固定,单个细胞的连续轨迹无法观测。因此,扰动建模简化为推断仅在对照和处理群体之间观察到的随机传输,这些群体作为单独的边际分布。虽然最近的生成模型实现了强端点对齐,但边界一致性并不决定中间演化:多个随机过程可能连接相同的边际分布,同时穿过观察到的单细胞形态不支持的区域。我们引入了 \textbf{FreeBridge},一种在仅端点监督下进行单细胞转变建模的薛定谔桥公式。FreeBridge 将原子状态定义为实例分割的单细胞表示,建立固定的细胞流形,并通过经验潜在支持正则化学习在此几何结构内约束的随机传输。在 BBBC021、RxRx1 和 JUMP 数据集上,FreeBridge 在统一评估协议下保持竞争性或改进的端点保真度和作用机制保留;在 BBBC021 上,它进一步减少了中间支持违规。这些发现强调了几何基础对于生物学可解释的扰动动力学的重要性。项目页面:此 https URL。

英文摘要

High-content imaging assays quantify cellular responses to chemical and genetic perturbations, yet continuous trajectories of individual cells are unobservable because cells are chemically fixed at acquisition. Perturbation modeling therefore reduces to inferring stochastic transport between control and treated populations observed only as separate marginals. While recent generative models achieve strong end-point alignment, boundary consistency does not determine intermediate evolution: multiple stochastic processes may connect identical marginals while traversing regions unsupported by observed single-cell morphologies. We introduce \textbf{FreeBridge}, a Schrödinger Bridge formulation for single-cell transition modeling under endpoint-only supervision. FreeBridge defines atomic states as instance-segmented single-cell representations, establishing a fixed cellular manifold, and learns stochastic transport constrained within this geometry via empirical latent support regularization. Across BBBC021, RxRx1, and JUMP, FreeBridge maintains competitive or improved endpoint fidelity and mechanism-of-action retention under a unified evaluation protocol; on BBBC021, it further reduces intermediate support violations. These findings highlight the importance of geometric grounding for biologically interpretable perturbation dynamics. Project page: this https URL.

2606.11691 2026-06-11 cs.LG physics.flu-dyn 新提交

Spectrally Regularized Latent Flow Matching for Turbulence Generation

谱正则化潜流匹配用于湍流生成

Khalid Rafiq, Aditya G. Nair

发表机构 * Department of Mechanical Engineering, University of Nevada, Reno(内华达大学里诺分校机械工程系)

AI总结 针对潜扩散和流匹配模型在湍流生成中低估耗散区振幅的问题,提出谱正则化潜流匹配框架,通过区域加权对数谱目标将深度耗散保留谱功率从25%提升至94%,并显著改善采样成本-保真度权衡。

详情
Comments
Accepted at the AI4Physics Workshop at ICML 2026. OpenReview: this https URL
AI中文摘要

潜扩散和流匹配已成为合成湍流生成的主要方法,但它们系统性地低估了耗散范围的振幅。我们引入了一个潜流匹配框架,其中包含一个直接针对此失效模式的谱正则化压缩阶段。在Re_f ≈ 2250的256^2 DNS数据集上,将MSE训练的VAE替换为区域加权对数谱目标,在重建中将深度耗散保留谱功率从25%提升至94%,在无条件生成中从20%提升至79%。改进的潜表示还产生了显著更好的采样成本-保真度权衡:MSE训练的潜空间在DD偏差-0.70附近施加了一个基本质量上限,任何积分器或步数都无法克服,而谱正则化的潜空间在仅20次函数评估时就达到了DD偏差-0.117。从机制上讲,编码器-解码器交换实验表明,改进主要由编码器诱导的潜重组驱动,而非解码器容量;而支持-振幅分解揭示,MSE训练的模型表现为保守抑制模型,通过衰减间歇性高波数结构来最小化逐点误差。两种管道都恢复了二阶结构函数和S_3的正确符号,表明在没有显式监督的情况下正确的级联方向。S_3幅度上的一个小残余差距表明,相位相干三元组组织仍然是未来生成湍流模型中振幅保真度的补充轴。

英文摘要

Latent diffusion and flow matching have emerged as leading approaches for synthetic turbulence generation, yet they systematically under-represent dissipation-range amplitudes. We introduce a latent flow matching framework with a spectrally regularized compression stage that directly targets this failure mode. On a 256^2 DNS dataset at Re_f \approx 2250, replacing an MSE-trained VAE with a zone-weighted log-spectral objective raises deep-dissipation retained spectral power from 25% to 94% in reconstruction and from 20% to 79% in unconditional generation. The improved latent representation also yields a substantially better sampling cost-fidelity tradeoff: the MSE-trained latent space imposes a fundamental quality ceiling near DD bias -0.70 that no integrator or step-count can overcome, while the spectrally regularized latent space reaches DD bias -0.117 at just 20 function evaluations. Mechanistically, encoder-decoder swap experiments show that the improvement is driven primarily by encoder-induced latent reorganization rather than decoder capacity, while a support-amplitude decomposition reveals that MSE-trained models behave as conservative suppression models, minimizing pointwise error by attenuating intermittent high-wavenumber structure. Both pipelines recover the second-order structure function and the correct sign of S_3, indicating the correct cascade direction without explicit supervision. A small residual gap in the magnitude of S_3 suggests that phase-coherent triadic organization remains a complementary axis to amplitude fidelity for future generative turbulence models.

2606.11833 2026-06-11 cs.LG q-bio.NC 新提交

Flow Matching with In-Context Priors for Out-of-Distribution Brain Dynamics

基于上下文先验的分布外脑动力学流匹配

Sam Gijsen, Michał Łukomski, Marc-André Schulz, Kerstin Ritter

发表机构 * Hertie Institute for AI in Brain Health, University of Tübingen(赫蒂人工智能脑健康研究所,图宾根大学) Tübingen AI Center, University of Tübingen(图宾根人工智能中心,图宾根大学) Charité – Universitätsmedizin Berlin, Department of Psychiatry and Psychotherapy(柏林夏里特医学院,精神病学与心理治疗系) German Center for Mental Health (DZPG), partner site Tübingen(德国心理健康中心(DZPG),图宾根合作站点)

AI总结 提出一种逐时间步条件扩散Transformer,通过注入组合语言和可选空间先验,实现未见认知任务下fMRI脑动力学的零样本生成,支持反事实神经科学。

详情
Comments
Code and pretrained models available at this https URL
AI中文摘要

流匹配和扩散模型能够实现从图像到蛋白质等领域的条件生成,最近扩展到分布外上下文。然而,神经时间序列的生成模型主要局限于分类条件,阻碍了组合和零样本泛化。在这项工作中,我们提出了一种逐时间步条件扩散Transformer,通过注入组合语言和可选空间先验在上下文中,生成未见认知任务期间的真实fMRI脑动力学。这种零样本生成可以通过在经验验证之前支持新型认知实验的计算机设计和评估,从而促进反事实神经科学。利用该模型,我们在数百个保留任务条件下进行评估,并描述与训练流形相关的预测性能。仅从语言出发,模型恢复了跨任务和保留空间激活模式的区域特异性招募。当空间先验可用时,它们通过将生成锚定在仅靠语言退化的任务空间区域来补充文本路径,同时保留反事实任务规范所需的组合结构。据我们所知,这是首个用于未见认知任务的整个皮层fMRI动力学生成模型,推动了反事实神经科学和数据驱动的实验设计。

英文摘要

Flow matching and diffusion models enable conditional generation across domains ranging from images to proteins, with recent extensions to out-of-distribution contexts. Yet generative models of neural time series have largely remained restricted to categorical conditioning, precluding compositional and zero-shot generalization. In this work, we propose a per-timestep conditioned diffusion transformer for generating realistic fMRI brain dynamics during unseen cognitive tasks by injecting both compositional language and optional spatial priors in-context. Such zero-shot generation could enable counterfactual neuroscience by supporting in-silico design and evaluation of novel cognitive experiments before empirical validation. Leveraging this model, we evaluate across hundreds of held-out task conditions and characterize predictive performance in relation to the training manifold. From language alone, the model recovers region-specific recruitment across tasks and held-out spatial activation patterns. Spatial priors, when available, complement the text pathway by anchoring generation in regions of task space where language alone degrades, while retaining the compositional structure needed for counterfactual task specification. To our knowledge this is the first generative model of whole-cortex fMRI dynamics for unseen cognitive tasks, advancing counterfactual neuroscience and data-driven experimental design.

2606.12232 2026-06-11 cs.LG 新提交

Re-evaluating Confidence Remasking in Masked Diffusion Language Models

重新评估掩蔽扩散语言模型中的置信度重新掩蔽

Stipe Frkovic, Metod Jazbec, Dan Zhang, Christian A. Naesseth, Ilija Bogunovic, Eric Nalisnick

发表机构 * UvA-Bosch Delta Lab, University of Amsterdam(阿姆斯特丹大学UvA-Bosch Delta实验室) Bosch Center for AI(博世人工智能中心) University of Basel(巴塞尔大学) Johns Hopkins University(约翰霍普金斯大学)

AI总结 本文重新评估了掩蔽扩散语言模型中一种无需训练的后验置信度重新掩蔽方法WINO,发现在标准解码设置下其收益甚微,且会加剧多样性坍塌问题。

详情
AI中文摘要

掩蔽扩散语言模型(dLLMs)最近已成为自回归语言模型的有竞争力的替代方案,其通过并行令牌生成实现更快的推理。然而,掩蔽公式的一个显著限制是,一旦令牌被解除掩蔽,就无法再修改,这使得dLLMs容易受到早期采样错误的影响。为了解决这个问题,越来越多的研究试图扩展掩蔽dLLMs,使其具有自我纠正(重新掩蔽)能力。其中一类有吸引力的方法以无需训练、事后方式基于令牌置信度实现,早期报告的结果令人鼓舞。在这项工作中,我们重新审视了代表性事后重新掩蔽方法WINO [Hong et al., 2026]的实证评估,发现在标准解码设置(较短的块长度)下,它相比于仅基于置信度的解除掩蔽 [Wu et al., 2025] 几乎没有带来好处。将评估扩展到非贪婪解码,我们发现虽然基于置信度的重新掩蔽可以在一定程度上减轻由增加随机性引入的错误,但它也加剧了先前报道的基于置信度的解除掩蔽导致的多样性坍塌。总体而言,我们的结果表明,事后基于置信度的重新掩蔽的好处高度依赖于设置,这凸显了需要更全面的评估框架。

英文摘要

Masked diffusion language models (dLLMs) have recently emerged as a competitive alternative to autoregressive language models, with the promise of faster inference via parallel token generation. A notable limitation of the masked formulation, however, is that once a token has been unmasked it can no longer be revised, leaving dLLMs vulnerable to early sampling mistakes. To address this, a growing body of work has sought to extend masked dLLMs with self-correcting (remasking) capabilities. One appealing subset of these methods does so in a training-free, post-hoc manner based on token confidences, with encouraging early reported results. In this work, we revisit the empirical evaluation of a representative post-hoc remasking method, WINO [Hong et al., 2026], and find that under standard decoding settings (shorter block lengths) it brings little-to-no benefit over confidence-based unmasking alone [Wu et al., 2025]. Extending the evaluation to non-greedy decoding, we find that while confidence-based remasking can mitigate errors introduced by increased stochasticity to some extent, it also exacerbates the diversity collapse previously reported for confidence-based unmasking. Overall, our results show that the benefits of post-hoc confidence-based remasking are highly setting-dependent, underscoring the need for a more comprehensive evaluation framework.

2606.11203 2026-06-11 cs.CL cs.LG 交叉投稿

LatticeBridge: Rare-Event Sequential Inference for Faithful Structured Sequence Synthesis

LatticeBridge: 用于忠实结构化序列合成的罕见事件序列推理

Faruk Alpay, Bugra Kilictas

发表机构 * Bahcesehir University(巴切塞希尔大学)

AI总结 针对结构化序列生成中约束满足的罕见事件问题,提出LatticeBridge方法,结合前缀语言模型、实例编译表面自动机和扭曲序列蒙特卡洛解码器,在多个基准上显著提升锚点满足率和覆盖率。

详情
Comments
19 pages. Code and benchmark files available at this https URL
AI中文摘要

结构化序列生成通常要求模型在单个输出中满足多个输入派生约束。标准解码方法可能赋予流畅延续高概率,而对同时实现所有必需锚点的延续赋予低概率。我们将此机制视为罕见事件序列推理问题。LatticeBridge 结合了紧凑前缀语言模型、实例编译表面自动机以及带有重采样、多级分裂和源自实例提供短语的源支持提议项的扭曲序列蒙特卡洛 (SMC) 解码器。约束表示从每个输入实例编译而来,不依赖人工整理的词汇类别。在涵盖 CommonGen、E2E NLG 和 WikiBio 的 2,610 个可达到验证任务上,粒子解码器在共享提议模型下,相比贪心、波束过滤和 best-of-k 祖先基线,提高了精确锚点满足率和平均锚点覆盖率。由于仅精确锚点满足不能排除不支持的属性替换,评估同时报告了所需锚点覆盖率、源覆盖率、源入侵诊断、重叠度、运行时间和粒子统计量。该基准在固定提议模型下刻画了忠实度-重叠度-延迟前沿。

英文摘要

Structured sequence generation often requires a model to satisfy several input-derived constraints in a single output. Standard decoding methods may assign high probability to fluent continuations while placing low mass on continuations that realize all required anchors jointly. We study this regime as a rare-event sequential inference problem. LatticeBridge combines a compact prefix language model, instance-compiled surface automata, and a twisted sequential Monte Carlo (SMC) decoder with resampling, multilevel splitting, and a source-support proposal term derived from instance-provided phrases. The constraint representation is compiled from each input instance and does not rely on manually curated lexical classes. On 2,610 attainable validation tasks spanning CommonGen, E2E NLG, and WikiBio, the particle decoder improves exact anchor satisfaction and mean anchor coverage over greedy, beam-filtered, and best-of-k ancestral baselines under a shared proposal model. Since exact anchor satisfaction alone does not rule out unsupported attribute substitutions, the evaluation reports required-anchor coverage, source coverage, source-intrusion diagnostics, overlap, runtime, and particle statistics jointly. The benchmark characterizes the faithfulness-overlap-latency frontier under a fixed proposal model.

2606.11304 2026-06-11 physics.ins-det cs.LG hep-ex hep-ph 交叉投稿

SPADE: Split-and-Delay Embeddings for Autoregressive High-Granularity Calorimeter Simulation

SPADE: 用于自回归高粒度量热器模拟的分裂与延迟嵌入

Joschka Birk, Frank Gaede, Anna Hallin, Gregor Kasieczka, Martina Mozzanica, Henning Rose

AI总结 提出SPADE自回归变压器,通过独立嵌入多特征令牌并延迟特征流,利用标准自注意力学习令牌内相关性,在ILD探测器点云簇射生成中优于现有模型。

详情
Comments
20 pages, 13 figures
AI中文摘要

我们介绍了SPADE(分裂与延迟嵌入),一种用于序列的自回归变压器,其令牌携带多个特征。SPADE不是将这些特征联合嵌入,而是独立嵌入它们。将每个特征流相对于前一个特征流延迟,使得标准自注意力机制能够学习令牌内的相关性。应用于高度粒化的ILD探测器中的点云量热器簇射生成,SPADE在光子簇射上与最先进的AllShowers模型竞争,并显著优于基于VQ-VAE的前身OmniJet-$\alpha_C$。该机制适用于任何具有多特征令牌的生成任务,为更高维数据启用类似LLM的预训练工作流。

英文摘要

We introduce SPADE (SPlit And Delay Embeddings), an autoregressive transformer for sequences whose tokens carry multiple features. Rather than embedding these features jointly, SPADE embeds them independently. Delaying each feature stream relative to the previous one allows intra-token correlations to be learned by the standard self-attention mechanism. Applied to point-cloud calorimeter shower generation in the highly granular ILD detector, SPADE is competitive with the state of the art AllShowers model on photon showers, and substantially outperforms its VQ-VAE-based predecessor OmniJet-$\alpha_C$. The mechanism is applicable to any generative task with multi-feature tokens, enabling LLM-style pretraining workflows for higher-dimensional data.

2606.12282 2026-06-11 cs.SD cs.LG 交叉投稿

PianoKontext: Expressive Performance Rendering from Deadpan Context

PianoKontext: 从平淡语境中生成富有表现力的演奏

Dmitrii Gavrilev

AI总结 提出PianoKontext,一种基于流匹配的钢琴演奏渲染模型,通过动态时间规整对齐乐谱与演奏的潜在表示,生成可变长度的表现力演奏。

详情
Comments
ICML 2026 Workshop on Machine Learning for Audio (Oral)
AI中文摘要

表现力演奏渲染(EPR)旨在根据音符序列生成逼真的演奏。然而,流匹配音频编辑模型仅操作相同时长的同步音乐样本,限制了它们对表现力时机的理解。我们提出了PianoKontext,一种针对古典钢琴音乐的流匹配渲染模型,该模型在预训练的Music2Latent模型的潜在空间中生成可变长度的演奏。我们将MIDI乐谱合成为平淡音频,并在潜在空间中使用动态时间规整(DTW)构建用于训练的对齐数据。对齐的嵌入在DiT块中拼接,从而简单有效地学习乐谱与演奏之间的依赖关系。音频样本可在我们的演示页面获取:此https URL。

英文摘要

Expressive performance rendering (EPR) aims to generate realistic performances constrained on sequences of notes. However, flow matching audio editing models manipulate only synchronized music samples of the same duration, limiting their understanding of expressive timing. We introduce PianoKontext, a flow matching rendering model for classical piano music that generates variable-length performances in the latent space of a pretrained Music2Latent model. We synthesize MIDI scores into deadpan audio and employ Dynamic Time Warping (DTW) in the latent space to construct paired data for training. The aligned embeddings are concatenated in DiT blocks, allowing for a simple and effective learning of the dependencies between the score and performances. Audio samples are available at our demo page: this https URL.

2601.10774 2026-06-11 cs.LG hep-lat 版本更新

Analytic Bijections for Smooth and Interpretable Normalizing Flows

用于平滑且可解释的归一化流的解析双射

Mathis Gerdes, Miranda C. N. Cheng

AI总结 提出三类全局光滑、解析可逆的双射函数,替代耦合流中的仿射变换或样条,并设计径向流架构,在径向结构目标上以千分之一参数达到耦合流质量。

详情
Comments
Final ICML 2026 version. 9 + 14 pages, 10 + 11 figures, 3 + 2 tables. New CIFAR-10 and tabular-data results; main text shortened for readability
AI中文摘要

归一化流中的一个关键挑战是找到表达力强的可逆标量双射。现有方法面临权衡:仿射变换光滑且解析可逆但缺乏表达力;单调样条提供局部控制但仅分段光滑且作用于有界域;残差流实现光滑性但需要数值求逆。我们引入了三类解析双射,它们全局光滑($C^\infty$),定义在整个$\mathbb{R}$上,且以闭式解析可逆,结合了先前方法的有利性质。除了作为耦合流中的即插即用替代品(其性能匹配或超越样条),我们还开发了径向流:一种使用直接参数化的新颖架构,在保持角度方向的同时变换径向坐标。径向流表现出卓越的训练稳定性,产生几何可解释的变换,并且在具有径向结构的目标上,能以$1000$倍更少的参数达到与耦合流相当的质量。我们在1D和2D基准测试上进行了全面评估,并通过$\phi^4$格点场论实验证明了其在更高维物理问题中的适用性,其中我们的双射优于仿射基线,并能够解决模式崩溃问题的特定设计。

英文摘要

A key challenge in normalizing flows is finding expressive invertible scalar bijections. Existing approaches face trade-offs: affine transformations are smooth and analytically invertible but lack expressivity; monotonic splines offer local control but are only piecewise smooth and act on bounded domains; residual flows achieve smoothness but need numerical inversion. We introduce three families of analytic bijections that are globally smooth ($C^\infty$), defined on all of $\mathbb{R}$, and analytically invertible in closed form, combining the favorable properties of prior approaches. Beyond serving as drop-in replacements in coupling flows, where they match or exceed spline performance, we develop radial flows: a novel architecture using direct parametrization that transforms the radial coordinate while preserving angular direction. Radial flows exhibit exceptional training stability, produce geometrically interpretable transformations, and on targets with radial structure can achieve comparable quality to coupling flows with $1000\times$ fewer parameters. We provide comprehensive evaluation on 1D and 2D benchmarks, and demonstrate applicability to higher-dimensional physics problems through experiments on $\phi^4$ lattice field theory, where our bijections outperform affine baselines and enable problem-specific designs that address mode collapse.

2602.00424 2026-06-11 cs.LG cond-mat.mtrl-sci 版本更新

Open Materials Generation with Inference-Time Reinforcement Learning

基于推理时间强化学习的开放材料生成

Philipp Hoellmer, Stefano Martiniani

AI总结 提出OMatG-IRL框架,通过策略梯度强化学习直接作用于学习的速度场,无需显式计算得分,实现晶体结构预测中的能量目标强化,采样效率提升一个数量级。

详情
Comments
25 pages, 12 figures, 6 tables
AI中文摘要

晶体材料的连续时间生成模型通过学习预测稳定晶体结构实现逆向材料设计,但将显式目标属性纳入生成过程仍然具有挑战性。策略梯度强化学习(RL)为生成模型与下游目标对齐提供了原则性机制,但通常需要访问得分,这阻碍了其应用于仅学习速度场的基于流的模型。我们提出了一种推理时间强化学习的开放材料生成(OMatG-IRL)框架,这是一种直接作用于学习的速度场的策略梯度RL框架,无需显式计算得分。OMatG-IRL利用底层生成动力学的随机扰动,保持预训练生成模型的基线性能,同时在推理时实现探索和策略梯度估计。通过OMatG-IRL,我们首次将RL应用于晶体结构预测(CSP)。我们的方法能够有效强化基于能量的目标,同时通过成分条件保持多样性,并且取得了与基于得分的RL方法竞争的性能。最后,我们展示了OMatG-IRL可以学习时间相关的速度退火调度,实现精确的CSP,采样效率提高一个数量级,相应地生成时间减少。OMatG-IRL代码包含在开放材料生成(OMatG)框架的新版本中,可从该https URL获取。

英文摘要

Continuous-time generative models for crystalline materials enable inverse materials design by learning to predict stable crystal structures, but incorporating explicit target properties into the generative process remains challenging. Policy-gradient reinforcement learning (RL) provides a principled mechanism for aligning generative models with downstream objectives but typically requires access to the score, which has prevented its application to flow-based models that learn only velocity fields. We introduce Open Materials Generation with Inference-time Reinforcement Learning (OMatG-IRL), a policy-gradient RL framework that operates directly on the learned velocity fields and eliminates the need for the explicit computation of the score. OMatG-IRL leverages stochastic perturbations of the underlying generation dynamics preserving the baseline performance of the pretrained generative model while enabling exploration and policy-gradient estimation at inference time. Using OMatG-IRL, we present the first application of RL to crystal structure prediction (CSP). Our method enables effective reinforcement of an energy-based objective while preserving diversity through composition conditioning, and it achieves performance competitive with score-based RL approaches. Finally, we show that OMatG-IRL can learn time-dependent velocity-annealing schedules, enabling accurate CSP with order-of-magnitude improvements in sampling efficiency and, correspondingly, reduction in generation time. The OMatG-IRL code is included in a new release of the Open Materials Generation (OMatG) framework available at this https URL.

2603.12261 2026-06-11 cs.LG cs.AI cs.CV 版本更新

The Latent Color Subspace: Emergent Order in High-Dimensional Chaos

潜在颜色子空间:高维混沌中的涌现秩序

Mateusz Pach, Jessica Bader, Quentin Bouniot, Serge Belongie, Zeynep Akata

AI总结 本文揭示了FLUX.1变分自编码器潜在空间中颜色表示的HSL结构,并提出一种无需训练的闭式潜在空间操作方法,实现对生成图像颜色的预测与显式控制。

详情
Comments
Accepted at ICML 2026
AI中文摘要

文本到图像生成模型已取得快速进展,但实现对生成图像的细粒度控制仍然困难,这主要源于对语义信息编码方式的理解有限。我们开发了对FLUX.1 [Dev]变分自编码器潜在空间中颜色表示的解释,揭示了一种反映色相、饱和度和明度的结构。我们通过证明潜在颜色子空间(LCS)解释能够预测并显式控制颜色,验证了其有效性,并引入了一种完全无需训练的FLUX方法,该方法仅基于闭式潜在空间操作。代码可在该https URL获取。

英文摘要

Text-to-image generation models have advanced rapidly, yet achieving fine-grained control over generated images remains difficult, largely due to limited understanding of how semantic information is encoded. We develop an interpretation of the color representation in the Variational Autoencoder latent space of FLUX.1 [Dev], revealing a structure reflecting Hue, Saturation, and Lightness. We verify our Latent Color Subspace (LCS) interpretation by demonstrating that it can both predict and explicitly control color, introducing a fully training-free method in FLUX based solely on closed-form latent-space manipulation. Code is available at this https URL.

2605.00545 2026-06-11 cs.LG cs.AI math-ph q-bio.GN q-bio.QM 版本更新

Beyond Continuity: Simulation-free Reconstruction of Discrete Branching Dynamics from Single-cell Snapshots

超越连续性:从单细胞快照无模拟重建离散分支动力学

Junda Ying, Yuxuan Wang, Bowen Yang, Peijie Zhou, Lei Zhang

AI总结 针对单细胞快照数据中随机性和非保守质量动态(如细胞增殖和凋亡)的挑战,提出无模拟框架Unbalanced Schrödinger Bridge (USB),通过离散分支薛定谔桥问题建模单细胞分辨率的跳跃式生灭动态,实现高效轨迹重建与离散模拟。

详情
AI中文摘要

从破坏性快照推断细胞轨迹因随机性和非保守质量动态(如细胞增殖和凋亡)的挑战而复杂化。现有的不平衡最优传输(OT)方法将质量视为连续流体,在群体水平进行推断。然而,这种宏观视角往往无法捕捉单细胞分辨率下生灭事件的离散跳跃性质,而这对于理解谱系分支和命运决定至关重要。我们提出无模拟框架Unbalanced Schrödinger Bridge (USB),用于学习底层动态,有效整合随机和非平衡效应,并在单细胞分辨率下建模离散、跳跃式的生灭动态。理论上,USB为分支薛定谔桥(BSB)问题提供了可处理的解,给出了严格的微观解释,其中单个细胞同时经历布朗运动和离散生灭跳跃。技术上,该方法通过引入无模拟训练目标实现高效求解器,有效扩展到高维组学数据。实验上,我们在模拟和真实数据集上证明,USB不仅达到优于或可比于确定性基线的轨迹重建性能,而且独特地实现了单细胞分辨率下生灭动态的真实离散模拟。

英文摘要

Inferring cellular trajectories from destructive snapshots is complicated by the challenges of stochasticity and non-conservative mass dynamics such as cell proliferation and apoptosis. Existing unbalanced Optimal Transport (OT) methods treat mass as a continuous fluid, performing inference at the population level. However, this macroscopic view often fails to capture the discrete, jump-like nature of birth-death events at single-cell resolution, which is essential for understanding lineage branching and fate decisions. We present Unbalanced Schrödinger Bridge (USB), a simulation-free framework for learning underlying dynamics that effectively integrates both stochastic and unbalanced effects which also models the discrete, jump-like birth-death dynamics at single-cell resolution. Theoretically, USB provides a tractable solution to the Branching Schrödinger Bridge (BSB) problem, offering a rigorous microscopic interpretation where individual cells undergo both Brownian motion and discrete birth-death jumps. Technically, the method implements an efficient solver by introducing a simulation-free training objective that effectively scales to high-dimensional omics data. Empirically, we demonstrate on both simulated and real-world datasets that USB not only achieves trajectory reconstruction performance better than or comparable to deterministic baselines but also uniquely enables realistic discrete simulation of birth-death dynamics at single-cell resolution.

2606.00140 2026-06-11 cs.LG cs.AI 版本更新

Geometric Erasure by Contrastive Velocity Matching in Rectified Flows

整流流中对比速度匹配的几何擦除

Jonas Henry Grebe, Tobias Braun, Anna Rohrbach, Marcus Rohrbach

AI总结 提出GEM框架,通过对比速度匹配实现整流流模型中的概念擦除,结合生成流网络与教师引导的流匹配,有效抑制有害内容生成。

详情
AI中文摘要

尽管多模态生成模型的快速采用提供了巨大潜力,但也增加了有害内容合成、深度伪造和版权侵权的风险。为应对这些挑战,概念擦除作为一种前瞻性防护手段应运而生。然而,随着该领域逐渐从基于U-Net的扩散模型转向整流流变换器,擦除研究难以跟上步伐。在这项工作中,我们引入了GEM,一个简单但高效的整流流模型擦除框架。作为我们贡献的一部分,我们在基于轨迹的遗忘(基于生成流网络)与经典教师引导擦除之间建立了原则性桥梁:我们将基于轨迹的信号转化为教师引导的流匹配设置,统一了两种范式的优势。具体而言,教师提供互补的吸引和排斥信号,我们将其组合成一个单一的几何引导目标,实现对不需要概念的目标抑制,同时保留良性生成。

英文摘要

While the rapid adoption of multimodal generative models offers immense potential, it has also increased the risks of harmful content synthesis, deepfakes, and copyright infringements. To address these challenges, concept erasure has emerged as a prospective safeguard. However, as the field gradually transitions from U-Net-based diffusion models to Rectified Flow Transformers, erasure research has struggled to keep pace. In this work, we introduce GEM, a simple but highly effective erasure framework for Rectified Flow models. As part of our contribution, we establish a principled bridge between trajectory-based unlearning grounded in Generative Flow Networks and classic teacher-guided erasure: we translate trajectory-based signals into a teacher-guided flow-matching setup that unifies the strengths of both paradigms. Concretely, a teacher provides complementary attraction and repulsion signals that we combine into a single geometric guidance objective, yielding targeted suppression of unwanted concepts while preserving benign generation.

2509.09794 2026-06-11 cs.AI cs.LG 版本更新

Synthetic Homes: A Multimodal Generative AI Pipeline for Residential Building Data Generation under Data Scarcity

合成住宅:数据稀缺下用于住宅建筑数据生成的多模态生成式AI管道

Jackson Eshbaugh, Chetan Tiwari, Jorge Silveyra

AI总结 提出一个多模态生成式AI框架,整合图像、表格和模拟组件,从公开记录和图像生成合成住宅建筑数据集,以解决建筑参数数据稀缺问题。

详情
Comments
37 pages; 2 appendices; 6 figures; 2 tables. Code available at this https URL
AI中文摘要

计算模型已成为建筑和城市尺度多尺度能源建模研究的强大工具,支持建筑和城市能源系统的数据驱动分析。然而,这些模型需要大量的建筑参数数据,这些数据通常难以获取、收集成本高昂或受隐私限制。我们引入了一个模块化的多模态生成式人工智能(AI)框架,该框架整合了图像、表格和基于模拟的组件,并从公开的县记录和图像生成合成住宅建筑数据集,同时提出了一个实例化该框架的端到端管道。为了减少典型的大型语言模型(LLM)挑战,我们使用基于遮挡的视觉焦点分析来评估模型组件。我们的分析表明,我们选择的视觉语言模型在建筑图像处理方面比基于GPT的替代方案实现了更大的视觉焦点。我们还根据国家参考数据集评估了结果的真实性,发现我们的合成数据在四个选定变量中的三个重叠率超过95%。这项工作减少了对昂贵或受限数据源的依赖,降低了建筑尺度能源研究和机器学习(ML)驱动的城市能源建模的障碍,从而在数据稀缺的情况下实现了可扩展的下游任务,如能源建模、改造分析和城市尺度模拟。

英文摘要

Computational models have emerged as powerful tools for multi-scale energy modeling research at the building and urban scale, supporting data-driven analysis across building and urban energy systems. However, these models require large amounts of building parameter data that is often inaccessible, expensive to collect, or subject to privacy constraints. We introduce a modular, multimodal generative Artificial Intelligence (AI) framework that integrates image, tabular, and simulation-based components and produces synthetic residential building datasets from publicly available county records and images, and present an end-to-end pipeline instantiating this framework. To reduce typical Large Language Model (LLM) challenges, we evaluate our model's components using occlusion-based visual focus analysis. Our analysis demonstrates that our selected vision-language model achieves greater visual focus than a GPT-based alternative for building image processing. We also assess realism of our results against a national reference dataset, finding that our synthetic data overlaps more than 95% for three of the four selected variables. This work reduces dependence on costly or restricted data sources, lowering barriers to building-scale energy research and Machine Learning (ML)-driven urban energy modeling, and therefore enabling scalable downstream tasks such as energy modeling, retrofit analysis, and urban-scale simulation under data scarcity.

2603.12901 2026-06-11 stat.ML cond-mat.dis-nn cs.IT cs.LG 版本更新

A theory of learning data statistics in diffusion models, from easy to hard

扩散模型中学习数据统计的理论:从容易到困难

Lorenzo Bardone, Claudia Merger, Sebastian Goldt

AI总结 本文研究了扩散模型在学习数据统计时的分布简单性偏差,揭示了学习 pairwise 统计和 higher-order 统计所需的样本复杂度差异,并引入了扩散信息指数这一不变量。

详情
AI中文摘要

尽管扩散模型已成为强大的生成模型,但其学习动态仍不明确。我们通过实验证明,标准扩散模型在自然图像上学习时存在分布简单性偏差,先学习简单的 pairwise 输入统计,再转向更高阶相关性。我们在简单的去噪器上用最小数据模型混合累积模型重现了这一行为,并精确控制了输入的 pairwise 和 higher-order 相关性。我们识别出一个模型不变量,即扩散信息指数,类比于不同学习范式中的相关不变量。利用这一不变量,我们证明去噪器在线性样本复杂度下学习输入的简单 pairwise 统计,而更复杂的 higher-order 统计如四阶累积量需要至少立方样本复杂度。我们还证明,如果 pairwise 和 higher-order 统计共享相关潜在结构,则学习四阶累积量的样本复杂度是线性的。本文描述了扩散模型如何学习越来越复杂分布的关键机制。

英文摘要

While diffusion models have emerged as a powerful class of generative models, their learning dynamics remain poorly understood. We address this issue first by empirically showing that standard diffusion models trained on natural images exhibit a distributional simplicity bias, learning simple, pair-wise input statistics before specializing to higher-order correlations. We reproduce this behaviour in simple denoisers trained on a minimal data model, the mixed cumulant model, where we precisely control both pair-wise and higher-order correlations of the inputs. We identify a scalar invariant of the model that governs the sample complexity of learning pair-wise and higher-order correlations that we call the diffusion information exponent, in analogy to related invariants in different learning paradigms. Using this invariant, we prove that the denoiser learns simple, pair-wise statistics of the inputs at linear sample complexity, while more complex higher-order statistics, such as the fourth cumulant, require at least cubic sample complexity. We also prove that the sample complexity of learning the fourth cumulant is linear if pair-wise and higher-order statistics share a correlated latent structure. Our work describes a key mechanism for how diffusion models can learn distributions of increasing complexity.

2605.27478 2026-06-11 stat.ML cs.LG math.PR 版本更新

Triangular-Reference Schrödinger Bridges for Time Series Generation

三角参考薛定谔桥用于时间序列生成

Gabriele Bocchi

AI总结 提出三角参考薛定谔桥框架,通过区间冻结的退化扩散参考和层次化潜在波动率结构,实现时间序列的保守生成,并保持熵最小化的变分核心。

详情
AI中文摘要

我们引入了用于时间序列的三角参考薛定谔桥(TR-SBTS),这是SBTS框架的一种保守扩展,其中布朗参考被替换为区间冻结的、可能退化的扩散参考,在潜在波动率水平的层次上呈三角形。该构造是在增广状态空间上的单一熵投影,变分约束在时间和潜在水平上联合施加,并通过相对熵的分解层次展开。SBTS的变分核心得以保留:熵最小化器是参考的h-变换,在每个冻结区间上,最优动力学在活跃协方差方向的仿射叶上具有对数梯度漂移公式,即使冻结协方差是秩亏的也成立。我们建立了冻结近似的稳定性以及相应正则化核估计量的收敛性。该构造通过一个有限维条件映射实现,该映射由三种互补的过去约简组成——块PCR摘要、由运行时冻结协方差累积量诱导的过去增量的参考感知马氏核,以及在同一参考度量下的过去窗口WLS漂移回归器——以及一个耦合的状态-协方差桥步骤,其中每个潜在水平为上一水平产生动态参考,并由协方差描述符总结;该构造在数值实验上进行了评估。

英文摘要

Schrödinger bridges for time series (SBTS) generate synthetic paths by projecting, in relative entropy, a Brownian reference onto the path laws that match the joint distribution of the data on the observation grid. The Brownian reference, however, fixes the quadratic variation of the generated paths, which is restrictive when stochastic volatility, correlated noise, or rank-deficient covariance structures must be reproduced. We introduce "Triangular-Reference Schrödinger Bridges for Time Series" (TR-SBTS), which keeps the entropy-projection backbone of SBTS but replaces the Brownian reference by a triangular, volatility-informed, intervalwise frozen reference on a state augmented with latent covariance descriptors. The construction remains a single entropy projection on the augmented state: the minimiser is the \(h\)-transform of the reference, and on each frozen interval the optimal drift has the logarithmic-gradient form \(b^\star(t,x)=A\,\nabla\log H(t,x)\), intrinsic to the active covariance directions when the frozen covariance \(A\) is degenerate. We prove stability of the frozen approximation and consistency of the associated regularised kernel estimators, describe a reference-aware Nadaraya--Watson implementation of the conditional next-increment law, and evaluate the construction on numerical experiments.

5. 优化、泛化与理论分析 27 篇

2606.11258 2026-06-11 cs.LG nlin.PS physics.comp-ph 新提交

Loss Landscape Diagnosis for Gradient-Based Gray-Scott System Inversion: Disentangling the Roles of PINN Components

基于梯度的Gray-Scott系统反演的损失景观诊断:解构PINN各组件的角色

Yan Yang

AI总结 通过直接反向传播稳态损失至未折叠的Gray-Scott模拟,发现优化因损失景观中的平坦高原和陡峭悬崖而失败,而PINN中的残差损失通过隐式编码完整PDE动力学避免了该病理现象。

详情
Comments
Accepted at the AI4Physics Workshop, ICML 2026 (non-archival). 14 pages, 10 figures
AI中文摘要

反应扩散系统的梯度基反演通常通过代理模型或物理信息神经网络(PINN)进行,而最直接的路径——通过PDE结构本身进行反向传播——在很大程度上被避免。我们将这条直接路径作为诊断探针,通过未折叠的Gray-Scott模拟反向传播稳态损失以恢复其参数,无需代理或神经网络增强。优化未能收敛,直接绘制损失景观将其失败定位于其几何结构——平坦高原无梯度信号,被与分岔边界对齐的陡峭悬崖所包围——这种结构在损失函数中重复出现,并且无论梯度如何路由到参数都会继承。将这一最小设置视为PINN的消融实验,我们解构了每个组件的作用:在神经网络固定的情况下,残差损失是PDE参数的二次函数,产生平滑的损失景观,因此仅凭它就能避免病理现象,通过隐式编码所有初始条件下的完整PDE动力学。而神经网络无法修复不适定的参数子空间,因此仅用于完成观测数据——这种分工此前未被明确。这些发现对PINN类方法具有具体的设计意义,并提供了关于何时添加维度实际上有帮助的更广泛启发。

英文摘要

Gradient-based inversion of reaction-diffusion systems is typically approached via surrogate models or physics-informed neural networks (PINNs), while the most direct route, backpropagation through the PDE's structure itself, has largely been avoided. We pursue this direct route as a diagnostic probe, backpropagating a steady-state loss through unrolled Gray-Scott simulation to recover its parameters, with no surrogate or neural-network augmentation. Optimization fails to converge, and plotting the landscape directly locates the failure in its geometry -- flat plateaus with no gradient signal, bounded by sharp cliffs that align with bifurcation boundaries -- a structure that recurs across loss functions and is inherited however the gradients are routed to parameters. Reading this minimal setup as an ablation of PINN, we disentangle each component's role: with the neural network fixed, the residual loss is quadratic in the PDE parameters and yields a smooth landscape, so it alone already avoids the pathology, by implicitly encoding the full PDE dynamics across all initial conditions. The neural network, for its part, cannot repair an ill-posed parameter subspace, and so serves only to complete the observed data -- a division of labor not previously made explicit. These findings carry concrete design implications for PINN-type methods and a broader heuristic on when added dimensions actually help.

2606.11431 2026-06-11 cs.LG 新提交

Mirror Descent Beyond Euclidean Stability: An Exponential Separation in Initialization Sensitivity

超越欧几里得稳定性的镜像下降:初始化敏感性的指数级分离

Shira Vansover-Hager, Matan Schliserman, Ofir Schlisselberg, Tomer Koren

发表机构 * Blavatnik School of Computer Science and AI, Tel Aviv University(特拉维夫大学布拉瓦特尼克计算机科学与人工智能学院) Google Research(谷歌研究院)

AI总结 本文证明非二次正则化的镜像下降(MD)在凸光滑目标上对初始化的敏感性可呈指数级增长,与梯度下降(GD)形成鲜明对比,并提出基于锚点的Bregman正则化可缓解不稳定性。

详情
AI中文摘要

镜像下降(MD)将梯度下降(GD)扩展到欧几里得几何之外,最近重新成为强化学习和LLM后训练中KL正则化策略优化的视角。这引发了一个基本的鲁棒性问题,对可重复性和可靠性至关重要:MD动力学对其输入的敏感性如何?我们关注初始化,它本身通常是预训练或先前对齐的模型。众所周知,二次正则化的MD(包括GD和马氏几何)对于凸光滑目标是稳定的。我们展示了一个鲜明的对比:一旦正则化器是非二次的,即使正则化器在欧几里得范数下是良条件的,MD对初始化的敏感性也可能比GD高指数级。我们给出了一个三维构造,其中目标函数是凸光滑的,正则化器是强凸、光滑且良条件的,初始$\varepsilon$扰动在$T$次步长为$\eta$的MD迭代后迅速放大到$\min\{\text{polylog}^{-1}(1/\varepsilon), \varepsilon e^{\Omega(\eta T)}\}$。对于单纯形上的典型KL正则化MD,我们证明即使线性目标在高维或近边界区域也能指数级放大初始$\varepsilon$扰动。最后,我们展示向锚点添加Bregman正则化项可以在很大程度上保持优化保证的同时稳定动力学,并且锚点的选择至关重要:在初始化处锚定仅部分缓解不稳定性,而在固定点锚定则产生更稳定的机制。

英文摘要

Mirror Descent (MD) extends Gradient Descent (GD) beyond Euclidean geometry and has recently reappeared as a lens for KL-regularized policy optimization in reinforcement learning and LLM post-training. This raises a basic robustness question, crucial to reproducibility and reliability: how sensitive are MD dynamics to their inputs? We focus on initialization, often itself a pretrained or previously aligned model. Quadratic-regularized MD, including GD and Mahalanobis geometries, is well-known to be stable for convex smooth objectives. We show a sharp contrast: once the regularizer is non-quadratic, MD can be exponentially more sensitive to initialization than GD, even with a well-conditioned regularizer in Euclidean norm. We give a three-dimensional construction with a convex, smooth objective and a strongly convex, smooth, well-conditioned regularizer where an initial $\varepsilon$ perturbation is quickly amplified to $\min\{\text{polylog}^{-1}(1/\varepsilon), \varepsilon e^{\Omega(\eta T)}\}$ after $T$ iterations of MD with step size $\eta$. For canonical KL-regularized MD on the simplex, we show that even linear objectives can amplify an initial $\varepsilon$ perturbation exponentially fast in high-dimensional or near-boundary regimes. Finally, we show that adding a Bregman regularization term toward an anchor point can stabilize the dynamics while largely preserving the optimization guarantees, and that the choice of anchor is crucial: anchoring at the initialization only partially mitigates the instability, whereas anchoring at a fixed point yields a more stable mechanism.

2606.11574 2026-06-11 cs.LG cond-mat.mtrl-sci physics.chem-ph stat.ML 新提交

Range-Aware Bayesian Optimization for Discovering Diverse Designs within Target Property Windows

范围感知贝叶斯优化用于在目标属性窗口内发现多样化设计

Shengli Jiang, Jason Wu, Charles M. Schroeder, Michael A. Webb

发表机构 * Department of Chemical and Biological Engineering, Princeton University(普林斯顿大学化学与生物工程系)

AI总结 提出范围感知贝叶斯优化框架,通过采集函数直接评分候选解满足目标范围的后验概率,在基准任务和实际案例中比标准方法发现更多样化的有效设计。

详情
Comments
64 pages, 6 main text figures, 17 supporting figures, 6 supporting tables
AI中文摘要

在许多材料和产品设计问题中,理想的候选物表现出可接受范围内的属性,而非达到单一最优值。恢复满足此类规格的多个不同解也具有实际价值,因为某些候选物可能因成本、可加工性或鲁棒性等原因而更受青睐,而这些因素难以直接编码到目标函数中。在此,我们开发了一个范围感知贝叶斯优化(BO)框架,其中采集函数直接评分候选解满足目标范围的后验概率。该框架自然扩展到在共享候选空间上并行追求多个不同规格。在基准任务中,范围感知采集一致地比标准BO基线和最近的目标寻求方法恢复更大且更多样化的有效设计集。其效用进一步在两个实际动机的设计案例研究中得到证明,涉及优化聚合物合成的反应条件和发现指定光学吸收带的序列定义低聚物,并得到量子化学计算的支持。这些结果表明,范围感知BO可以为规格驱动设计提供实用且样本高效的基础,特别是当设计灵活性和解多样性是重要考虑因素时。

英文摘要

In many materials and product design problems, desirable candidates exhibit properties that fall within an acceptable range rather than achieve a single optimum. Recovering multiple, distinct solutions that satisfy such specifications is also practically valuable, as some candidates may be preferred for reasons of cost, processability, or robustness that are difficult to encode directly in an objective function. Here, we develop a range-aware Bayesian optimization (BO) framework in which the acquisition function directly scores the posterior probability that a candidate satisfies a target range. The framework naturally extends to parallel pursuit of multiple distinct specifications over a shared candidate space. Across benchmark tasks, range-aware acquisition consistently recovers larger and more diverse sets of valid designs than standard BO baselines and recent goal-seeking methods. Its utility is further demonstrated in two practically motivated design case studies involving optimizing reaction conditions for polymer synthesis and sequence-defined oligomer discovery for prescribed optical absorption bands, supported by quantum chemical calculations. These results suggest that range-aware BO can provide a practical and sample-efficient foundation for specification-driven design, particularly when design flexibility and solution diversity are important considerations.

2606.11711 2026-06-11 cs.LG stat.ML 新提交

Capacity-Constrained Online Convex Optimization with Delayed Feedback

具有延迟反馈的容量受限在线凸优化

Alexander Ryabchenko, Idan Attias, Daniel M. Roy

发表机构 * Department of Statistical Sciences, University of Toronto(多伦多大学统计科学系) Vector Institute(向量研究所) Institute for Data, Econometrics, Algorithms, and Learning (IDEAL), hosted by UIC and TTIC(数据、计量经济学、算法与学习研究所(IDEAL),由伊利诺伊大学芝加哥分校和丰田工业大学芝加哥分校主办)

AI总结 研究在硬容量约束下(最多同时跟踪C个待处理轮次)的延迟在线凸优化,通过引入半先知模型和延迟加权FTRL算法,首次给出了凸和强凸损失下容量受限OCO的遗憾界。

详情
AI中文摘要

具有延迟反馈的在线学习通常假设学习者可以跟踪所有待处理轮次直到其反馈到达。在实践中,跟踪资源是有限的,未跟踪轮次的反馈将永久丢失。在本文中,我们研究了在硬容量约束下的延迟在线凸优化(OCO),其中任何时候最多可以跟踪$C$个待处理轮次。为了建模延迟信息,我们引入了一个半先知模型,该模型细化了先前工作中的先知假设:学习者不需要在预测时知道延迟,而是在线观察延迟到期,这与经典的无约束延迟设置一致。我们的方法通过归约到一个新颖的“延迟且加权”的OCO问题来实现,使用一个随机化跟踪决策并对结果观测进行重要性加权的调度器。对于这个基础问题,我们提出并分析了延迟加权FTRL及其赌博机变体,建立了明确刻画时变权重与延迟反馈之间相互作用的遗憾界。将这些基础学习器与我们的调度器相结合,首次给出了在凸和强凸损失下容量受限OCO的遗憾保证,适用于一阶和赌博机反馈。对于一阶反馈,容量$C = \Omega(\log T)$足以在忽略对数因子的情况下恢复标准延迟OCO的速率。对于赌博机反馈,遗憾率由$(1 + \sigma_{\text{max}}/C)$的幂次调制,其中$\sigma_{\text{max}}$是任何时候的最大待处理观测数。这使得当$C < \sigma_{\text{max}}$时遗憾界能够优雅地退化,同时保持次线性。

英文摘要

Online learning with delayed feedback typically assumes that the learner can track all pending rounds until their feedback arrives. In practice, tracking resources are finite, and feedback from untracked rounds is permanently lost. In this paper, we study delayed online convex optimization (OCO) under a hard capacity constraint, where at most $C$ pending rounds can be tracked at any time. To model delay information, we introduce a semi-clairvoyant model that refines the clairvoyant assumption from prior work: rather than requiring delays to be known at prediction time, the learner observes delay expirations online, consistent with the classical unconstrained delayed setting. Our approach proceeds via a reduction to a novel ``delayed and weighted'' OCO problem, using a scheduler that randomizes tracking decisions and importance-weights the resulting observations. For this base problem, we propose and analyze Delayed-Weighted FTRL and its bandit analogue, establishing regret bounds that explicitly characterize the interaction between time-varying weights and delayed feedback. Combining these base learners with our schedulers yields the first regret guarantees for capacity-constrained OCO under convex and strongly convex losses, for both first-order and bandit feedback. For first-order feedback, capacity $C = \Omega(\log T)$ suffices to recover standard delayed OCO rates up to logarithmic factors. For bandit feedback, the regret rates are modulated by powers of $(1 + \sigma_{\text{max}}/C)$, where $\sigma_{\text{max}}$ is the maximum number of pending observations at any time. This allows the regret bound to degrade gracefully when $C < \sigma_{\text{max}}$, while remaining sublinear.

2606.12050 2026-06-11 cs.LG math.DS 新提交

Reliable Error Estimation for PINNs: Lower and Upper A Posteriori Bounds

PINNs的可靠误差估计:后验下界与上界

Ismail Huseynov, Arzu Ahmadova, Agamirza Bashirov

发表机构 * Physikalisch-Technische Bundesanstalt (PTB)(德国联邦物理技术研究院) Technical University of Berlin(柏林工业大学) Weierstrass Institute for Applied Analysis and Stochastics(魏尔斯特拉斯应用分析与随机研究所) Eastern Mediterranean University(东地中海大学)

AI总结 提出PINNs求解常微分方程的可计算后验误差下界,结合局部单侧Lipschitz条件得到更紧的上界,实现双侧误差包络,并讨论初始条件处理对下界的影响。

详情
AI中文摘要

物理信息神经网络(PINNs)将机器学习与物理定律相结合以求解微分方程。虽然现有结果为PINN预测误差提供了严格的后验上界,但完整认证还需要互补的下界信息以获得可计算的双侧误差包络。本文在合适的认证状态空间域上,在局部强单调性条件下推导了PINN误差在常微分方程中的可计算后验下界。我们将这些估计与在单侧Lipschitz条件下的互补局部上界相结合,该条件弱于先前工作中使用的全局Lipschitz假设,并能产生更尖锐的误差上界带。所得界仅依赖于神经网络近似、ODE残差以及局部单调性和增长常数,因此无需访问精确解。对于线性时不变和时变系统,我们进一步根据系统矩阵对称部分的最小和最大特征值得出显式公式。我们还讨论了PINN中初始条件的软硬约束区别,并解释了为什么精确约束可能使标量下界证书无效。为了在线性情形中恢复有意义的非平凡下界信息,我们使用基于坐标单位向量的符号残差有限探针证书。我们还制定了一种证书引导的训练策略,其中传播的上界证书用作辅助正则化器,而下界证书保留为训练后诊断。总体而言,所提出的框架为PINN逼近ODE提供了严格且实际可计算的误差证书,同时明确了假设可验证的域和模型类别。

英文摘要

Physics-informed neural networks (PINNs) combine machine learning with physical laws to solve differential equations. While existing results provide rigorous \emph{a posteriori} upper bounds for PINN prediction errors, complete certification also requires complementary lower information in order to obtain computable two-sided error enclosures. In this paper, we derive computable \emph{a posteriori} lower bounds for PINN errors in ordinary differential equations on suitable certified state-space domains under a localized strong monotonicity condition. We combine these estimates with complementary localized upper bounds under a one-sided Lipschitz condition, which is weaker than the global Lipschitz assumption used in previous work and can yield sharper upper error bands. The resulting bounds depend only on the neural-network approximation, the ODE residual, and local monotonicity and growth constants, and therefore do not require access to the exact solution. For linear time-invariant and time-varying systems, we further derive explicit formulas in terms of the minimal and maximal eigenvalues of the symmetric part of the system matrix. We also discuss the distinction between soft and hard enforcement of initial conditions in PINNs and explain why exact enforcement can make the scalar lower certificate uninformative. To recover nontrivial lower information in the linear setting, we use a signed-residual finite-probe certificate based on coordinate unit vectors. We also formulate a certificate-informed training strategy in which the propagated upper certificate is used as an auxiliary regularizer, while lower certificates remain post-training diagnostics. Altogether, the proposed framework provides rigorous and practically computable error certificates for PINN approximations of ODEs, while making explicit the domains and model classes for which the assumptions can be verified.

2606.12120 2026-06-11 cs.LG math.OC 新提交

A Riemannian Approach to Low-Rank Optimal Transport

低秩最优传输的黎曼方法

Pratik Jawanpuria, Bamdev Mishra

发表机构 * Centre for Machine Intelligence and Data Science, IIT Bombay(印度理工学院孟买分校机器智能与数据科学中心) Microsoft India(微软印度)

AI总结 提出黎曼几何框架用于低秩最优传输,通过将平衡与不平衡秩r正因子耦合建模为光滑子流形,并采用Fisher-Rao乘积度量,实现高效的一阶和二阶求解器,在收敛速度和性能上超越现有方法。

详情
AI中文摘要

低秩最优传输(OT)缓解了经典求解器的二次缩放问题,但现有方法严重依赖需要仔细调整超参数且忽略优化景观曲率的一阶镜像下降更新。为了解决这些局限性,我们提出了一个统一的低秩OT黎曼几何框架,将平衡和不平衡秩$r$正因子耦合建模为正象限的新型光滑嵌入子流形。通过为这些流形配备Fisher-Rao乘积度量,我们推导出黎曼投影、收缩和Hessian-向量积的可处理公式。我们的成本无关框架无缝扩展到线性OT、Gromov-Wasserstein(GW)、融合GW及其不平衡对应物。对于平衡OT,我们的几何成分通过高效的共轭梯度和迭代Bregman更新计算。对于不平衡OT,我们的操作优雅地简化为闭式缩放,完全消除了内部迭代循环。在两种情况下,每次迭代的复杂度与数据集大小呈线性关系,并且我们提供了用于全局最优性验证的秩充分性证书。跨一系列问题规模的大量实验表明,我们的无正则化一阶和二阶求解器在收敛速度和性能上优于现有最先进的低秩OT求解器。

英文摘要

Low-rank optimal transport (OT) mitigates the quadratic scaling of classical solvers, yet existing approaches rely heavily on first-order mirror-descent updates that require careful hyperparameter tuning and ignore the optimization landscape's curvature. To address these limitations, we propose a unified Riemannian geometric framework for low-rank OT, modeling balanced and unbalanced rank-$r$ positive factored couplings as novel smooth embedded submanifolds of the positive orthant. By equipping these manifolds with the Fisher-Rao product metric, we derive tractable formulations for Riemannian projectors, retractions, and Hessian-vector products. Our cost-agnostic framework seamlessly extends to linear OT, Gromov-Wasserstein (GW), fused GW, and their unbalanced counterparts. For balanced OT, our geometric ingredients are computed via efficient conjugate-gradient and iterative Bregman updates. For the unbalanced OT, our operations elegantly reduce to closed-form scalings, completely eliminating inner iterative loops. In both regimes, per-iteration complexity scales linearly with dataset size, and we provide a rank-sufficiency certificate for global optimality verification. Extensive experiments across a range of problem sizes demonstrate that our regularization-free first- and second-order solvers achieve faster convergence and superior performance over existing state-of-the-art low-rank OT solvers.

2606.11263 2026-06-11 math.ST cs.LG math.NA math.PR 交叉投稿

Geometric bias in eigenspace perturbation under random heterogeneous noise

随机异质噪声下特征空间扰动的几何偏差

Fengkai Liu, Ke Wang, Wanjie Wang

AI总结 针对稀疏、异质方差噪声下的信号加噪声矩阵,研究发现经验特征向量存在经典扰动界无法捕捉的系统性几何偏差,并通过二次向量方程和精细各向同性局部律推导了最优非渐近扰动界。

详情
Comments
104 pages, 1 figure
AI中文摘要

谱方法从根本上依赖于主特征空间在随机扰动下的稳定性。经典上,这种稳定性由 Davis-Kahan 和 Wedin 定理量化,这些定理利用噪声的算子范数和相关谱间隙来界定特征空间误差。虽然这些最坏情况界对于任意确定性扰动是紧的,但在低秩信号加随机噪声的设置中可能造成浪费,因为它们未能捕捉信号几何与噪声分布之间的细粒度相互作用。在本文中,我们研究了被具有任意非齐次方差剖面的稀疏随机噪声破坏的信号加噪声矩阵的谱扰动。我们证明,在异质噪声方差下,经验特征向量遭受系统性的、确定性的几何偏差,这种偏差完全不为经典扰动界所见。通过利用二次向量方程并建立精细的各向同性局部律,我们推导了在算子范数和 $2\to\infty$ 范数下前导特征空间的近最优、非渐近扰动界。这些界将通常的信噪比贡献、随机波动和由信号特征空间与行方差剖面对齐决定的结构化几何偏差项分离开来。

英文摘要

Spectral methods rely fundamentally on the stability of principal eigenspaces under random perturbations. Classically, this stability is quantified by the Davis-Kahan and Wedin theorems, which bound the eigenspace error using the operator norm of the noise and the relevant spectral gaps. While these worst-case bounds are sharp for arbitrary deterministic perturbations, they can be wasteful in the low-rank signal-plus-random-noise setting, as they fail to capture the fine-grained interaction between the signal geometry and the noise distribution. In this paper, we study the spectral perturbation of signal-plus-noise matrices corrupted by sparse, random noise with an arbitrary, inhomogeneous variance profile. We demonstrate that under heterogeneous noise variances, the empirical eigenvectors suffer a systematic, deterministic geometric bias that is entirely invisible to classical perturbation bounds. By leveraging the Quadratic Vector Equation (QVE) and establishing fine-grained isotropic local laws, we derive near-optimal, non-asymptotic perturbation bounds for the leading eigenspaces in the operator and $2\to\infty$ norms. The bounds separate the usual signal-to-noise contribution, stochastic fluctuations, and structured geometric bias terms determined by the alignment between the signal eigenspaces and the row-wise variance profile.

2606.11283 2026-06-11 cs.DS cs.LG stat.ML 交叉投稿

Fixed-Parameter Tractability of Private Synthetic Data Generation

私有合成数据生成的固定参数可处理性

Badih Ghazi, Cristóbal Guzmán, Pritish Kamath, Alexander Knop, Ravi Kumar, Pasin Manurangsi

AI总结 研究差分隐私下合成数据生成问题,通过查询族关联图的树宽参数建立固定参数可处理性,提出两种最优算法。

详情
AI中文摘要

我们研究在差分隐私下生成合成数据的问题。我们建立了该问题的固定参数可处理性(FPT),其中参数是查询族关联图的树宽。我们的算法在所有情况下都达到最优错误率,并通过两种不同方法实现:第一种基于线性规划(LP)和LP对偶分离问题的FPT;第二种基于子采样私有乘法权重方法,其中我们获得了从吉布斯分布采样的FPT。两种方法都通过树分解上的动态规划框架统一。

英文摘要

We study the problem of generating synthetic data under differential privacy. We establish fixed-parameter tractability (FPT) for this problem where the parameter is the treewidth of the query family's incidence graph. Our algorithms attain optimal error rates across all regimes and are realized by two different approaches: the first is based on linear programming (LP) and the FPT of the separation problem for the LP dual; the second is based on a subsampled private multiplicative weights method, where we obtain FPT for sampling from Gibbs distributions. Both approaches are unified by a dynamic programming framework over a tree decomposition.

2606.11339 2026-06-11 math.OC cs.AI cs.LG eess.SY stat.ML 交叉投稿

Quantized Stochastic Primal-Dual Methods for Distributed Optimization under Relaxed Global Geometry

松弛全局几何下分布式优化的量化随机原始-对偶方法

Susmit Sarkar, Abhinav Raghuvanshi, Kushal Chakrabarti, Mayank Baranwal

AI总结 提出量化随机原始-对偶方法q-PDGD,在松弛全局几何下证明线性收敛到邻域或O(1/k)收敛,匹配最优集中随机复杂度。

详情
Comments
Accepted to UAI
AI中文摘要

我们研究具有随机梯度和有限比特通信(由随机(无偏)量化建模)的分布式优化。我们提出q-PDGD,一种量化的随机原始-对偶方法,并在松弛全局几何下对其进行分析。在受限割线不等式(RSI)下,常数步长产生线性收缩到由梯度噪声、量化失真和网络连通性确定的显式邻域,而递减步长在没有共享最小化器假设的情况下实现O(1/k)收敛。在Polyak-Lojasiewicz(PL)不等式下,我们在相同的随机量化设置中获得线性到邻域的收敛。我们的结果在预言复杂度上匹配已知最优的集中随机速率,并通过实验证明了量化水平、步长选择和图结构之间的预测权衡。

英文摘要

We study distributed optimization with stochastic gradients and finite-bit communication modeled by random (unbiased) quantization. We propose q-PDGD, a quantized stochastic primal-dual method, and analyze it under relaxed global geometry. Under restricted secant inequality (RSI), a constant step-size yields linear contraction to an explicit neighborhood determined by gradient noise, quantization distortion, and network connectivity, while a diminishing step-size achieves O(1/k) convergence without shared-minimizer assumptions. Under Polyak-Lojasiewicz (PL) inequality, we obtain linear-to-neighborhood convergence in the same stochastic quantized setting. Our results match the best-known centralized stochastic rates in oracle complexity, and are supported by experiments demonstrating the predicted tradeoffs between quantization level, step-size choice, and graph structure.

2606.11347 2026-06-11 stat.ML cs.LG math.OC 交叉投稿

Annealed Entropic Allocation for Ranking and Selection

退火熵分配用于排序与选择

Xin Fei, Juergen Branke

AI总结 提出退火熵分配框架,通过加权log-sum-exp替代非光滑极大极小大偏差率目标,结合鞍点近似提升有限预算下的区分能力,数值实验表明在多个候选接近时性能优异。

详情
AI中文摘要

我们提出了退火熵分配,一种用于排序与选择中顺序预算分配的退火加权软最小化框架。核心思想是用加权log-sum-exp替代非光滑的极大极小大偏差率目标,该替代通过软最小化权重聚合特定候选对的得分,从而在多个候选几乎同时活跃时缓解硬切换。为了提升有限预算下的区分能力,我们引入了鞍点近似——一种从精细化的成对尾部渐近性导出的次指数修正。由于这些修正是次指数的,且平滑参数退火至零,该替代保持了与经典极大极小公式相同的一阶大偏差目标。我们证明了该替代一致收敛于硬最小值,软最小化权重集中于活跃候选,并且在固定权重下,诱导的目标分配映射在单纯形内部是连续的。在高斯和指数实例上的数值实验展示了竞争性能,尤其是在多个候选几乎持平时。

英文摘要

We propose Annealed Entropic Allocation, an annealed weighted soft-min framework for sequential budget allocation in ranking and selection. The central idea is to replace the non-smooth maximin large-deviation rate objective with a weighted log-sum-exp surrogate that aggregates challenger-specific pairwise scores through soft-min weights, mitigating hard switching when several challengers are nearly active. To improve finite-budget discrimination, we incorporate the saddlepoint approximation -- a sub-exponential correction derived from refined pairwise tail asymptotics. Because these corrections are sub-exponential and the smoothing parameter is annealed to zero, the surrogate preserves the same first-order large-deviation target as the classical maximin formulation. We show that the surrogate converges uniformly to the hard minimum, that the soft-min weights concentrate on the active challengers, and that, under fixed weights, the induced target allocation map is continuous on the simplex interior. Numerical experiments on Gaussian and exponential instances demonstrate competitive performance, especially when multiple challengers are nearly tied.

2606.11437 2026-06-11 cs.DS cs.AI cs.LG stat.ML 交叉投稿

The Power of Test-Time Training for Approximate Sampling

测试时训练对近似采样的威力

Noah Golowich, Ankur Moitra, Dhruv Rohatgi

AI总结 本文形式化测试时训练(TTT)为从已知分布类中采样的问题,证明查询复杂度的二次下界,并展示在分布类大小受限时可规避该下界,为TTT提供理论框架。

详情
AI中文摘要

从复杂概率分布中高效采样是一个基本问题,近年来随着生成式AI的兴起,这一问题变得越来越重要,因为从大语言模型(LLM)中提出的复杂采样程序已被用于解决具有挑战性的推理问题。然而,这类采样算法的有效性受到LLM与特定采样任务之间关系的限制,这推动了测试时训练(TTT)框架的发展。TTT通过根据推理时收到的部分生成和奖励反馈更新模型权重来工作,从而适应特定问题。在这项工作中,我们提出了一种TTT的形式化,将其定义为从属于已知分布类$F$的给定概率测度$\mu^\star$中生成样本的问题,给定一个提供$\mu^\star$近似密度估计的预言机$\hat \mu$。这与Jerrum、Valiant和Vazirani(1986)以及Jerrum和Sinclair(1989)的开创性工作中研究的将采样约化为近似计数的问题密切相关:即当$F$是所有分布的类时,它恰好与上述计数到采样的约化一致。在本文中,我们首先证明了在给定对$\hat \mu$的查询访问的情况下,从$\mu^\star$采样的查询复杂度的二次下界(对于足够大的类$F$),从而表明Jerrum和Sinclair(1989)提出并由Hayes和Sinclair(2010)改进的随机游走方法是最优的。这回答了Hayes和Sinclair提出的一个开放问题。然后,我们证明如果$F$的大小适当受限,这个下界可以被规避。正如我们所讨论的,后一个结果可以被视为TTT的抽象,因此代表了为TTT发展一个原则性理论框架的起点。

英文摘要

Efficiently sampling from a complex probability distribution is a fundamental problem which has become increasingly pertinent in recent years with the rise of generative AI, as sophisticated sampling procedures from LLMs have been proposed to solve challenging reasoning problems. The efficacy of such sampling algorithms is limited, however, by the relationship between the LLM and the particular sampling task at hand, which has motivated the framework of test-time training (TTT). TTT works by updating a model's weights in response to partial generations and reward feedback received at inference time, thus adapting to the particular problem. In this work, we propose a formalization for TTT as the problem of producing a sample from a given probability measure $\mu^\star$ belonging to a known class ${F}$ of distributions, given an oracle $\hat \mu$ which yields approximate density estimates for $\mu^\star$. This is closely related to the problem of reducing sampling to approximate counting studied in seminal works of Jerrum, Valiant & Vazirani (1986) and Jerrum & Sinclair (1989): namely, when ${F}$ is the class of all distributions, it coincides exactly with the aforementioned counting-to-sampling reduction. In this paper, we first show a quadratic lower bound on the query complexity of sampling from $\mu^\star$ given query access to $\hat \mu$ (for sufficiently large classes ${F}$), thus showing that the random walk approach proposed by Jerrum & Sinclair (1989) and refined by Hayes & Sinclair (2010), is optimal. This answers an open question posed by Hayes & Sinclair. We then show that this lower bound can be circumvented if the size of ${F}$ is bounded appropriately. As we discuss, this latter result can be viewed as an abstraction of TTT, and thus represents a starting point for the development of a principled theoretical framework for TTT.

2606.11469 2026-06-11 cs.DS cs.LG math.ST 交叉投稿

Density estimation for Hellinger via minimum-distance estimators: mixtures of Gaussians, log-concave, and more

基于最小距离估计量的Hellinger密度估计:高斯混合、对数凹等

Spencer Compton, Jerry Li

AI总结 将最小距离估计方法从总变差距离扩展到Hellinger距离,通过反向数据处理不等式,实现了对对数凹混合和高斯混合(任意方差)的近线性时间学习,样本复杂度接近最优。

详情
AI中文摘要

我们研究密度估计任务,希望从$n$个样本中准确估计概率密度。在总变差距离下,密度估计的经典方法是最小距离估计量方法,其中我们仅通过限制特定概念类(即Yatracos类)的VC维即可得到算法和分析。虽然该技术最初主要针对总变差距离给出了精确保证,但在本文中,我们将最小距离估计量方法扩展到Hellinger距离下的学习。我们的主要观察是,通过联系最近得到反向数据处理不等式的结果,我们可以为Hellinger距离生成类似的方案(其中我们只需要限制相关概念类的VC维)。该方案足够灵活,可以容纳最初为总变差距离设计的快速算法;通过修改Acharya等人(2017)的方法,我们首次得到了近线性时间算法,用于学习包括单变量对数凹密度混合和高斯混合(具有任意方差)在内的类别,且样本复杂度接近最优。

英文摘要

We study the task of density estimation, where we hope to accurately estimate a probability density from $n$ samples. A textbook method for density estimation in total variation distance is the minimum-distance estimator approach, where we conclude both the algorithm and the analysis merely from bounding the VC dimension of a particular concept class (the so-called Yatracos class). While this technique has originally yielded sharp guarantees primarily for total variation distance, in this work we extend the minimum-distance estimator approach for learning within Hellinger distance. Our main observation is that we may produce an analogous recipe for Hellinger (where we only require bounding the VC dimension of a related concept class) by drawing connections to recent results yielding reverse data processing inequalities. This recipe is flexible enough to accommodate fast algorithms originally designed for total variation distance; by modifying the approach of Acharya et al. (2017) we conclude the first near-linear time algorithm for learning classes including univariate mixtures of log-concave densities and mixtures of Gaussians (with arbitrary variances), with near-optimal sample complexity.

2606.11629 2026-06-11 math.DS cs.LG 交叉投稿

Integral Formulation of QENDy for Robust Nonlinear System Identification

QENDy的积分形式用于鲁棒非线性系统辨识

Nikhil Saran, Sushant Pokhriyal, Stefan Klus, Rushikesh Kamalapurkar, Joel A. Rosenfeld

AI总结 提出QENDy方法的积分形式,避免使用时间导数,从而增强对噪声的鲁棒性,实现更稳健的非线性动力学学习。

详情
AI中文摘要

本文提出了新定义的非线性系统二次嵌入方法(QENDy)的积分形式。在原始算法中,使用了轨迹数据点及其时间导数。计算时间导数的方法使算法对噪声敏感。我们的积分形式不使用时间导数,从而得到一种更鲁棒的动力学学习方法。

英文摘要

This manuscript proposes an integral formulation of the newly defined quadratic embedding method for identifying nonlinear systems (QENDy). In the original algorithm, trajectory data points along with their time derivatives are used. Methods for calculating time derivatives make the algorithm sensitive to noise. Our integral formulation does not use the time derivatives. This results in a more robust method to learn the dynamics.

2606.11738 2026-06-11 stat.ML cs.LG 交叉投稿

Renewable Lasso without Batch-Number Constraints: A Gradient-Enhanced Approach

无批次数量约束的可再生Lasso:一种梯度增强方法

Junzhuo Gao, Ling Peng, Xu Guo, Heng Lian

AI总结 针对高维广义线性模型的流数据在线估计,提出梯度增强替代损失函数,消除批次数量约束,并扩展到分布式流数据场景,理论推导非渐近误差界,实验验证精度提升。

详情
AI中文摘要

我们研究具有流数据的高维广义线性模型的在线估计。首先,针对非分布式设置,我们提出一种梯度增强替代损失函数,仅使用历史摘要近似累积损失,修改并改进了现有高维设置下同一模型的可再生估计方法,并消除了先前研究中的批次数量约束。然后,我们将该方法扩展到主从架构下的分布式流数据,其中批次按站点划分,仅交换摘要(梯度向量)。我们的调整方法不要求客户端计算完整的替代损失,而不是直接应用Jordan等人(2019)的流行方法到替代二次损失。我们在高维尺度下推导了非渐近误差界,没有先前研究中严格的批次数量约束。在线性和逻辑模型下的模拟结果以及实际数据应用表明,与现有的可再生估计器相比,精度有所提高。

英文摘要

We study online estimation for high-dimensional generalized linear models with streaming data. First, for the non-distributed setting, we propose a gradient-enhanced surrogate loss that approximates the cumulative loss using only historical summaries, which modifies and improves upon the existing renewable estimation approach for the same model in the high-dimensional setting, and removes the batch-number constraint in previous studies. We then extend the method to distributed streaming data under the master-client architecture, where batches are partitioned across sites and only summaries (gradient vectors) are exchanged. Instead of directing applying the popular method of Jordan et al. (2019) to the surrogate quadratic loss, our adjusted approach does not require the clients to compute the full surrogate loss. We derive non-asymptotic error bounds under the high-dimensional scaling, without the stringent constraint on the number of batches in the previous studies. Simulation results under linear and logistic models, together with a real-data application, show improved accuracy over existing renewable estimators.

2606.11773 2026-06-11 math.OC cs.LG 交叉投稿

Last-Iterate Convergence of Optimistic Multiplicative Weight Update

乐观乘性权重更新的最后迭代收敛性

Francesco Orabona

AI总结 本文证明乐观乘性权重更新(OMWU)在光滑凸-凹鞍点问题中以足够小的常数学习率渐近收敛,无需唯一性、严格互补性、误差界或接近解的初始化。

详情
AI中文摘要

乐观梯度上升下降(OGDA)和乐观乘性权重更新(OMWU)是解决凸/凹鞍点问题的两种非常流行的算法,其中OMWU是OGDA的非欧几里得熵版本。自80年代以来,已知OGDA的最后迭代在光滑问题中渐近收敛到鞍点。另一方面,OMWU是否具有相同性质尚不清楚。在本文中,我证明了OMWU对于光滑凸-凹鞍点问题,在足够小的常数学习率下渐近收敛。该结果不需要唯一性、严格互补性、误差界或接近解的初始化。主要的新成分是一个边界论证,表明每个聚点满足非活动坐标的KKT不等式。该边界论证是在ChatGPT的协助下发现的,并在附录中记录。

英文摘要

Optimistic Gradient Descent Ascent (OGDA) and Optimistic Multiplicative-Weights Update (OMWU) are two very popular algorithms to solve convex/concave saddle-point problems, where OMWU is the non-Euclidean, entropic version of OGDA. It is known since the '80s that the last iterate of OGDA asymptotically converges to a saddle point in smooth problems. On the other hand, it is unknown if OMWU has the same property. In this paper, I show that OMWU converges asymptotically for smooth convex-concave saddle-point problems, with a small enough constant learning rate. The result does not require uniqueness, strict complementarity, an error bound, or initialization near a solution. The main new ingredient is a boundary argument showing that every cluster point satisfies the inactive-coordinate KKT inequalities. The boundary argument was discovered with assistance from ChatGPT and is documented in the appendix.

2606.12058 2026-06-11 stat.ML cond-mat.dis-nn cs.LG 交叉投稿

Phase Transitions in Attention: A Bayesian Theory of Copy Head Emergence

注意力中的相变:复制头涌现的贝叶斯理论

Itay Lavie, Kirsten Fischer, Andrey Lekov, Frederic Van Maele, Zohar Ringel, Moritz Helias

AI总结 通过分析单层softmax注意力网络在复制任务上的训练,提出贝叶斯理论揭示注意力矩阵的后验分布存在相变,并对比线性注意力发现softmax注意力呈现一阶相变。

详情
AI中文摘要

注意力是Transformer中上下文学习的关键机制,经验上观察到注意力模式在训练过程中突然涌现。我们提出了注意力中特征学习的贝叶斯理论;然后通过分析在复制任务上训练的单层softmax注意力网络,专注于归纳头第一层中复制子电路的学习方式。我们推导出注意力矩阵上的闭式后验,并将其简化为低维序参数空间。这种简化揭示了训练数据量上的相变,我们通过贝叶斯采样和使用Adam的标准训练验证了这一点。我们将结果与线性注意力对比,发现softmax注意力表现出\emph{一阶相变},而在线性注意力中,初始的\emph{二阶相变}之后是向结构化注意力模式的平滑连续演化(\emph{交叉})。我们的工作为复制子电路的突然涌现提供了第一性原理的理论解释,这让人联想到在大语言模型训练中观察到的现象。

英文摘要

Attention is the key mechanism underlying in-context learning in transformers, and attention patterns have been observed empirically to emerge abruptly during training. We present a Bayesian theory of feature learning in attention; we then focus on how the copy subcircuit in the first layer of an induction head is learned by analyzing a single-layer softmax attention network trained on a copy task. We derive a closed-form posterior over the attention matrix and reduce it to a low-dimensional order parameter space. This reduction reveals a phase transition in the amount of training data, which we verify using both Bayesian sampling and standard training with Adam. We contrast our results with linear attention and find that softmax attention exhibits a \emph{first-order phase transition} while in linear attention an initial \emph{second-order phase transition} is followed by a smooth, continuous evolution toward the structured attention pattern (\emph{crossover}). Our work provides a first-principles theoretical account of the abrupt emergence of the copy subcircuit, reminiscent of the one observed in training large language models.

2606.12211 2026-06-11 quant-ph cs.LG 交叉投稿

Quantum Occam Learning: Sample-Supported Expressibility for Circuit-Based Quantum Learning

量子奥卡姆学习:基于电路的量子学习中样本支持的表达能力

Jeongho Bang, Kyoungho Cho, Jeongwoo Jae

AI总结 针对有限大小量子电路生成的数据,提出信息论奥卡姆理论,证明样本支持的表达能力定律:在迹距离精度ε下,M个样本最多支持约Mε²个门,将电路复杂度转化为自适应统计资源。

详情
Comments
22 pages (main text + appendix), 2 figures
AI中文摘要

量子机器学习的一个核心原则是,ansatz 应具有足够的表达能力来表示感兴趣的量子数据。然而,只有当能够从有限数量的未知量子态副本中学习时,表达能力才具有统计意义。在这项工作中,我们为有限大小量子电路生成的量子数据开发了一种信息论奥卡姆理论。对于最多使用 $G$ 个双量子比特门可制备的 $n$ 量子比特纯态类 $S_{n,G}$,度量熵论证给出了在电路受限情况下的可实现样本律 $\widetilde{\Theta}(G/\epsilon^2)$。对于任意源 $\hat{\rho}$,我们引入了最佳 $G$ 门近似误差 $d_G(\hat{\rho})$ 和近似电路复杂度 $C_\eta(\hat{\rho})$。我们证明了一个不可知的量子奥卡姆定理:使用 $M$ 个副本,可以学习到最佳 $G$ 门近似误差加上统计惩罚 $\widetilde{O}(\sqrt{G/M})$。然后,通过一个自适应模型选择定理消除了预先知道 $G$ 的需要,该定理的 oracle 不等式选择了数据所证明的电路复杂度。匹配的下界给出了一个样本支持的表达能力定律:在迹距离精度 $\epsilon$ 下,$M$ 个样本只能支持 $G_{\rm supported} \simeq M\epsilon^2$ 个门,直到对数因子和 $2^n$ 的层析饱和。因此,电路复杂度成为一种自适应统计资源,而不是静态承诺。我们的框架将有界电路复杂度转化为量子机器学习的模型选择原则。

英文摘要

A central principle in quantum machine learning is that an ansatz should be expressive enough to represent the quantum data of interest. Yet, the expressibility is statistically meaningful only insofar as it can be learned from finitely many copies of an unknown quantum state. In this work, we develop an information-theoretic Occam theory for quantum data generated by finite-size quantum circuits. For the class $S_{n,G}$ of $n$-qubit pure states preparable with at most $G$ two-qubit gates, a metric-entropy argument gives the realizable sample law $\widetilde{\Theta}(G/\epsilon^2)$ in the circuit-limited regime. For an arbitrary source $\hat{\rho}$, we introduce the best $G$-gate approximation error $d_G(\hat{\rho})$ and the approximate circuit complexity $C_\eta(\hat{\rho})$. We prove an agnostic quantum Occam theorem: with $M$ copies, one can learn up to the best $G$-gate approximation error plus a statistical penalty $\widetilde{O}(\sqrt{G/M})$. We then remove the need to know $G$ in advance through an adaptive model-selection theorem whose oracle inequality selects the circuit complexity justified by the data. Matching lower bounds yield a sample-supported expressibility law: at trace-distance accuracy $\epsilon$, $M$ samples can support only $G_{\rm supported} \simeq M\epsilon^2$ gates, up to logarithmic factors and tomography saturation at $2^n$. Thus, the circuit complexity becomes an adaptive statistical resource rather than a static promise. Our framework turns bounded circuit complexity into a model-selection principle for quantum machine learning.

2606.12279 2026-06-11 cs.NE cs.AI cs.LG 交叉投稿

Mathematical perspective on genetic algorithms with optimization guided operators

遗传算法与优化引导算子的数学视角

Anna Brandenberger, Ilan Doron-Arad, Elchanan Mossel

AI总结 本文从数学角度建模遗传算法,将优化问题转化为查询复杂度问题,并证明某些问题必须依赖生成、变异和重组算子,同时揭示了多样性在解池中的关键作用。

详情
Comments
18 pages, 1 figure
AI中文摘要

近期机器学习工作将遗传算法应用于推理阶段,以迭代改进优化问题的解。所涉及的基本变异和重组算子在性质上不同于经典研究。变异不再是随机的;机器学习算法以改进目标为目的对解进行变异。同样,重组不再基于父代解的随机拼接,而是基于机器学习的优化算子,其目标是从输入中合成改进的解。因此,这些变异和重组算子更有可能改进目标,但其计算成本更高。我们引入了一个遗传算法的通用模型,并使用强化学习的语言将优化问题表述为查询复杂度问题。然后我们研究专门模型。我们证明某些优化问题必须通过生成、变异和重组来解决。接着,我们在此框架内为一类问题获得了定性紧的算法,该算法捕捉了解池中多样性的非平凡作用,这是实际机器学习遗传算法的一个关键特征。

英文摘要

Recent work in ML applies genetic algorithms at inference time to iteratively improve solutions to optimization problems. The basic mutation and recombination operators involved are qualitatively different from those studied classically. Mutations are no longer random; an ML algorithm mutates a solution with the goal of improving an objective. Similarly, recombination is not based on random collages of parent solutions. Instead, it is an ML optimization-based operator whose goal is to synthesize improved solutions from its inputs. Thus, these mutation and recombination operators are more likely to improve the objective, but their computational cost is much higher. We introduce a general model of genetic algorithms and formulating optimization in this model as a query-complexity problem, using the language of reinforcement learning. We then study specialized models. We show that some optimization problems require generation, mutation, and recombination to be solved. We then obtain qualitatively tight algorithms for a family of problems within this framework that captures the nontrivial role of diversity in the solution pool, a key feature of practical ML genetic algorithms.

2606.12337 2026-06-11 math.NA cs.LG 交叉投稿

Adjoint Method versus Physics-Informed Neural Networks in PDE-Constrained Inverse Problems

伴随方法与物理信息神经网络在PDE约束逆问题中的比较

Zhen Zhang, Alessandro Alla, George Em Karniadakis

AI总结 针对PDE约束逆问题,公平比较伴随优化与PINN,发现未知参数表示决定方法选择:网格场适合伴随,神经表示适合PINN;PINN在时间依赖问题中成本更低,且可预热启动伴随。

详情
Comments
35 pages, 10 figures
AI中文摘要

由偏微分方程(PDE)控制的逆问题是计算力学的核心,通常通过伴随优化求解,而物理信息神经网络(PINN)已成为一种灵活的替代方案。由于这两种方法通常在不同公式、参数化、优化器和正则化选择下进行比较,因此它们的相对性能难以评估。我们针对PDE约束逆问题,对伴随优化和PINN进行了公平比较。从共同的抽象公式出发,我们在相同的域、控制方程、观测模型和正则化项上实例化两种方法,并在适用情况下匹配优化器、未知参数化和算术精度。基准测试包括非定常Burgers方程、噪声达西渗透率反演、三维Allen-Cahn反应识别和非定常Navier-Stokes粘度识别。结果表明,未知参数的表示在很大程度上决定了首选方法:基于网格的场有利于离散伴随,而神经表示是PINN的原生方法,适用于封闭和本构建模。对于时间依赖问题,伴随反演可能因轨迹存储和微分而成本高昂,而PINN以较低成本提供令人满意的重建。然后,PINN预热启动的伴随策略以大幅降低的成本恢复伴随级别的精度。

英文摘要

Inverse problems governed by partial differential equations (PDEs) are central to computational mechanics and are commonly solved by adjoint-based optimization, while physics-informed neural networks (PINNs) have emerged as a flexible alternative. Their relative performance remains difficult to assess because the two approaches are often compared under different formulations, parameterizations, optimizers, and regularization choices. We present a fair comparison of adjoint optimization and PINNs for PDE-constrained inverse problems. From a common abstract formulation, we instantiate both methods on identical domains, governing equations, observation models, and regularization terms, while matching the optimizer, unknown parameterization, and arithmetic precision wherever applicable. The benchmarks include unsteady Burgers, noisy Darcy permeability inversion, three-dimensional Allen--Cahn reaction identification, and unsteady Navier--Stokes viscosity identification. The results show that the representation of the unknown largely determines the preferred method: grid-based fields favor the discrete adjoint, whereas neural representations are native to PINNs and relevant for closure and constitutive modeling. For time-dependent problems, adjoint inversion can be dominated by trajectory storage and differentiation, while PINNs provide satisfactory reconstructions at lower cost. A PINN-warm-started adjoint strategy then recovers adjoint-level accuracy at substantially reduced cost.

2505.13196 2026-06-11 cs.LG cs.AI quant-ph 版本更新

A Physics-Inspired Optimizer: Velocity Regularized Adam

一种受物理启发的优化器:速度正则化Adam

Pranav Vaidhyanathan, Lucas Schorling, Natalia Ares, Maike Osborne

AI总结 本文提出VRAdam优化器,通过引入速度正则化技术,结合Adam的参数缩放,提升训练稳定性与收敛速度,理论分析显示其在非凸目标下的收敛速率为O(√(lnN)/√N)。

详情
Comments
L. Schorling and P. Vaidhyanathan contributed equally to this work. 20 pages, 10 figures
AI中文摘要

我们介绍了一种受物理启发的优化器——速度正则化Adam(VRAdam),用于训练深度神经网络。该优化器借鉴了四次项用于动能的思想,其在系统动力学中具有稳定作用。先前的算法,包括普遍使用的Adam,训练过程中处于所谓的稳定性边缘,导致快速振荡和损失收敛缓慢。然而,VRAdam基于速度在学习率上添加更高阶惩罚,使得算法在权重更新变得较大时自动减慢。实践中,我们观察到在高速度区域,有效动态学习率会缩小并抑制振荡。通过将这种基于速度的正则化用于全局阻尼,结合Adam的参数缩放,我们创建了一个强大的混合优化器。对于该优化器,我们从物理和控制的角度对动量在稳定性边缘的操作进行了严格的理论分析。此外,我们推导了在轻微假设下的非凸随机目标下的收敛界,收敛速率为O(ln(N)/√N)。我们证明VRAdam在标准优化器如AdamW上表现更优。我们通过多种任务如图像分类、语言建模和生成建模,使用不同架构和训练方法(包括卷积神经网络、Transformer和GFlowNets)进行基准测试。

英文摘要

We introduce Velocity-Regularized Adam (VRAdam), a physics-inspired optimizer for training deep neural networks that draws on ideas from quartic terms for kinetic energy with its stabilizing effects on various system dynamics. Previous algorithms, including the ubiquitous Adam, operate at the so-called adaptive edge of stability regime during training, leading to rapid oscillations and slowed convergence of loss. However, VRAdam adds a higher order penalty on the learning rate based on the velocity such that the algorithm automatically slows down whenever weight updates become large. In practice, we observe that the effective dynamic learning rate shrinks in high-velocity regimes, and damping oscillations. By combining this velocity-based regularizer for global damping with per-parameter scaling of Adam, we create a powerful hybrid optimizer. For this optimizer, we provide rigorous theoretical analysis of operation at the edge of stability from a physical and control perspective for the momentum. Furthermore, we derive convergence bounds with the rate $\mathcal{O}(\ln(N)/\sqrt{N})$ for a stochastic non convex objective under mild assumptions. We demonstrate that VRAdam exceeds the performance against standard optimizers including AdamW. We benchmark various tasks such as image classification, language modeling, and generative modeling using diverse architectures and training methodologies including Convolutional Neural Networks (CNNs), Transformers, and GFlowNets.

2512.22088 2026-06-11 cs.LG cs.AI cs.CL 版本更新

Unifying Learning Dynamics and Generalization in Transformers Scaling Law

统一Transformer缩放定律中的学习动力学与泛化

Chiwun Yang

AI总结 本文通过将Transformer学习动力学形式化为ODE系统并近似为核行为,严格分析了随机梯度下降训练下的泛化误差,揭示了计算资源缩放时泛化误差的指数衰减与幂律衰减的两阶段相变,并建立了紧的上下界。

详情
Comments
87 pages, 10 figures, 3 tables
AI中文摘要

缩放定律是大语言模型(LLM)发展的基石,预测了模型性能随计算资源增加而提升。然而,尽管经验上得到验证,其理论基础仍不清晰。本文形式化了基于Transformer的语言模型的学习动力学为一个常微分方程(ODE)系统,然后将该过程近似为核行为。与之前的玩具模型分析不同,我们严格分析了在序列到序列数据上具有任意数据分布的多层Transformer的随机梯度下降(SGD)训练,紧密反映了真实世界条件。我们的分析刻画了随着计算资源随数据缩放时,泛化误差收敛到不可约风险的过程,特别是在优化过程中。我们建立了过剩风险的匹配上下界,其特征是明显的相变。在初始优化阶段,过剩风险相对于计算成本${\sf C}$呈指数衰减。然而,一旦超过特定的资源分配阈值,系统进入统计阶段,泛化误差遵循$\Theta(\mathsf{C}^{-1/7})$的幂律衰减。这些速率通过互补的下界得到证实——统计方面通过信息论的两点约简,优化方面通过一阶预言机论证——使得两阶段定律在常数、对数因子和条件数差距内是紧的。除了这个统一框架,我们的理论还推导了模型大小、训练时间和数据集大小的独立缩放定律,阐明了每个变量如何独立地控制泛化的边界。

英文摘要

The scaling law, a cornerstone of Large Language Model (LLM) development, predicts improvements in model performance with increasing computational resources. Yet, while empirically validated, its theoretical underpinnings remain poorly understood. This work formalizes the learning dynamics of transformer-based language models as an ordinary differential equation (ODE) system, then approximates this process to kernel behaviors. Departing from prior toy-model analyses, we rigorously analyze stochastic gradient descent (SGD) training for multi-layer transformers on sequence-to-sequence data with arbitrary data distribution, closely mirroring real-world conditions. Our analysis characterizes the convergence of generalization error to the irreducible risk as computational resources scale with data, especially during the optimization process. We establish matching upper and lower bounds on the excess risk, characterized by a distinct phase transition. In the initial optimization phase, the excess risk decays exponentially relative to the computational cost ${\sf C}$. However, once a specific resource allocation threshold is crossed, the system enters a statistical phase, where the generalization error follows a power-law decay of $\Theta(\mathsf{C}^{-1/7})$. These rates are certified by complementary lower bounds -- statistical, via an information-theoretic two-point reduction, and optimization-side, via a first-order oracle argument -- rendering the two-stage law tight up to constants, logarithmic factors, and a condition-number gap. Beyond this unified framework, our theory derives isolated scaling laws for model size, training time, and dataset size, elucidating how each variable independently governs the bounds of generalization.

2602.02285 2026-06-11 cs.LG cs.CL math.ST 版本更新

AI4SLT: Empirical Processes in Lean 4 for Formal Statistical Learning Theory

AI4SLT: 基于 Lean 4 的形式化统计学习理论实证过程

Yuanhe Zhang, Jason D. Lee, Fanghui Liu

AI总结 本文首次在 Lean 4 中完整形式化统计学习理论,基于实证过程理论,通过人机协作工作流构建了可验证的定理证明工具箱,并揭示了教材中的隐含假设。

详情
Comments
Accepted by ICML 2026
AI中文摘要

我们提出了首个基于实证过程理论的统计学习理论(SLT)在 Lean 4 中的全面形式化。我们的端到端形式化基础设施填补了最新 Lean 库中缺失的内容,包括高斯 Lipschitz 集中的完整推导、次高斯过程的 Dudley 熵积分定理,以及具有尖锐速率的(稀疏)最小二乘回归应用。该项目采用人机协作工作流,其中人类设计证明策略,AI 代理执行战术性证明构建,从而产生了经过人工验证的 SLT 的 Lean 4 工具箱。除了实现之外,形式化过程暴露并解决了标准 SLT 教材中的隐含假设和缺失细节,强制对理论进行逐行细粒度理解。这项工作建立了一个可重用的形式化基础,并为机器学习理论的未来发展打开了大门。代码可在以下网址获取:https://this https URL。

英文摘要

We present the first comprehensive Lean 4 formalization of statistical learning theory (SLT) grounded in empirical process theory. Our en-to-end formal infrastructure implement the missing contents in latest Lean library, including a complete development of Gaussian Lipschitz concentration, Dudley's entropy integral theorem for sub-Gaussian processes, and an application to least-squares (sparse) regression with a sharp rate. The project was carried out using a human-AI collaborative workflow, in which humans design proof strategies and AI agents execute tactical proof construction, leading to the human-verified Lean 4 toolbox for SLT. Beyond implementation, the formalization process exposes and resolves implicit assumptions and missing details in standard SLT textbooks, enforcing a granular, line-by-line understanding of the theory. This work establishes a reusable formal foundation and opens the door for future developments in machine learning theory. The code is provided in this https URL.

2602.11995 2026-06-11 cs.LG 版本更新

Momentum LMS Theory beyond Stationarity: Stability, Tracking, and Regret

超越平稳性的动量LMS理论:稳定性、跟踪与遗憾

Yifei Jin, Xin Zheng, Lei Guo

AI总结 本文研究动量最小均方算法在非平稳时变线性系统中的跟踪性能与遗憾界,通过分析二阶时变随机向量差分方程,证明其快速适应和鲁棒跟踪能力。

详情
Comments
9 pages, 3 figures
AI中文摘要

在大规模数据处理场景中,数据通常以序列流的形式到达,这些序列由具有漂移分布和时变系统参数的复杂系统生成。这种非平稳性挑战了理论分析,因为它违反了i.i.d.(独立同分布)样本的经典假设,需要能够实时更新而无需昂贵重新训练的算法。一种有效的方法应在单次处理每个样本的同时,保持计算和内存复杂度与数据流长度无关。受这些挑战的启发,本文研究了动量最小均方(MLMS)算法作为自适应识别工具,利用其计算简单和在线处理能力。理论上,我们在各种实际条件下推导了MLMS在时变随机线性系统中的跟踪性能和遗憾界。与经典LMS不同,其稳定性可由一阶随机向量差分方程表征,而MLMS由于动量引入额外的动态状态,导致二阶时变随机向量差分方程,其稳定性分析依赖于更复杂的随机矩阵乘积,这构成了一个极具挑战性的问题。在合成和真实数据流上的实验表明,MLMS实现了快速适应和鲁棒跟踪,与我们的理论结果一致,尤其是在非平稳环境中,突显了其在现代流式和在线学习应用中的潜力。

英文摘要

In large-scale data processing scenarios, data often arrive in sequential streams generated by complex systems that exhibit drifting distributions and time-varying system parameters. This nonstationarity challenges theoretical analysis, as it violates classical assumptions of i.i.d. (independent and identically distributed) samples, necessitating algorithms capable of real-time updates without expensive retraining. An effective approach should process each sample in a single pass, while maintaining computational and memory complexities independent of the data stream length. Motivated by these challenges, this paper investigates the Momentum Least Mean Squares (MLMS) algorithm as an adaptive identification tool, leveraging its computational simplicity and online processing capabilities. Theoretically, we derive tracking performance and regret bounds for the MLMS in time-varying stochastic linear systems under various practical conditions. Unlike classical LMS, whose stability can be characterized by first-order random vector difference equations, MLMS introduces an additional dynamical state due to momentum, leading to second-order time-varying random vector difference equations whose stability analysis hinges on more complicated products of random matrices, which poses a substantially challenging problem to resolve. Experiments on synthetic and real-world data streams demonstrate that MLMS achieves rapid adaptation and robust tracking, in agreement with our theoretical results especially in nonstationary settings, highlighting its promise for modern streaming and online learning applications.

2605.11911 2026-06-11 cs.LG 版本更新

Understanding Sample Efficiency in Predictive Coding

理解预测编码中的样本效率

Gaspard Oliviers, Elene Lominadze, Rafal Bogacz

AI总结 本文研究预测编码在样本效率上的优势,通过目标对齐度量分析BP和PC的学习效率,发现PC在深度、狭窄和预训练网络中表现更优,提供机制理解以指导PC参数设计。

详情
AI中文摘要

预测编码(PC)是皮层学习的重要理论。近期研究多比较PC与反向传播(BP)以确定PC是否具有优势。小规模实验表明PC在许多上下文中能更高效地学习,但理论理解仍不明确。本文通过目标对齐度量量化BP和PC的学习效率,推导并验证深度线性网络中目标对齐的解析表达式。研究发现PC的学习效率高于BP,尤其在深度、狭窄和预训练网络中更为明显。还推导了保证PC目标对齐最优的精确条件,并通过实验验证。研究了线性和非线性模型的完整训练轨迹,发现即使部分假设不成立,PC的预测优势仍持续存在。本文提供了对PC比BP在先前工作中观察到更高学习效率的机制理解,并指导如何参数化PC以最有效地学习。

英文摘要

Predictive Coding (PC) is an influential account of cortical learning. Much of recent work has focused on comparing PC to Backpropagation (BP) to find whether PC offers any advantages. Small scale experiments show that PC enables learning that is more sample efficient and effective in many contexts, though a thorough theoretical understanding of the phenomena remains elusive. To address this, we quantify the efficiency of learning in BP and PC through a metric called ``target alignment'', which measures how closely the change in the output of the network is aligned to the output prediction error. We then derive and empirically validate analytical expressions for target alignment in Deep Linear Networks. We show that learning in PC is more efficient than BP, which is especially pronounced in deep, narrow and pre-trained networks. We also derive exact conditions for guaranteed optimal target alignment in PC and validate our findings through experiments. We study full training trajectories of linear and non-linear models, and find the predicted benefits of PC persist in practice even when some assumptions are violated. Overall, this work provides a mechanistic understanding of the higher learning efficiency observed for PC over BP in previous works, and can guide how PC should be parametrised to learn most effectively.

2606.09744 2026-06-11 cs.LG cond-mat.dis-nn 版本更新

Learning Dynamics Reveal a Hierarchy of Weight-Induced Layerwise Gram Metrics

学习动力学揭示权重诱导的分层Gram度量层次结构

Claudio Nordio

AI总结 本文研究前馈ReLU网络在固定读出和二次损失下的梯度下降动力学,将其重写为训练集空间上的集体动力学,并揭示深度网络中权重诱导的Gram算子层次结构。

详情
Comments
24 pages. v4: Corrected the hidden-activation dynamics; clarified the concept of field closure. Other minor corrections
AI中文摘要

我们研究具有固定读出和二次损失的前馈ReLU网络。目的是将梯度下降重写为一种集体动力学,而非主要作为权重空间中的动力学,该动力学在训练集空间上定义的场中封闭。对于单隐层,可以从激活动力学中消除权重变量,得到残差的封闭方程,该方程由一个集体核支配,该核分解为输入几何矩阵和动态共激活矩阵。对于更深网络,残差动力学保持清晰的分层核结构。然而,从深度三开始,封闭需要权重诱导的Gram算子层次结构,这些算子介导跨层的信息传输。

英文摘要

We study feed-forward ReLU networks with fixed readout and quadratic loss. The aim is to rewrite gradient descent not primarily as a dynamics in weight space, but as a collective dynamics closed in terms of fields defined on the training-set space. For a single hidden layer, the weight variables can be eliminated from the activation dynamics, yielding a closed equation for the residuals governed by a collective kernel that factorizes into an input-geometric matrix and a dynamical co-activation matrix. For deeper networks, the residual dynamics retains a clean layer-wise kernel structure. However, from depth three onward, closure requires a hierarchy of weight-induced Gram operators that mediate information transport across layers. Moreover, the conjugate-field dynamics is governed by operators satisfying a backward pullback recursion, of which the weight-induced Gram operators are the first nontrivial instances.

2512.11081 2026-06-11 stat.ML cs.LG stat.ME 版本更新

Provable Recovery of Locally Important Signed Features and Interactions from Random Forest

从随机森林中可证明地恢复局部重要符号特征和交互

Kata Vuk, Nicolas Alexander Ihlo, Merle Behr

AI总结 提出一种局部、模型特定的特征与交互重要性方法,通过结合全局和局部决策路径模式,在局部尖峰稀疏模型下可证明地恢复真实信号特征及其交互,并识别特征值大小对预测的驱动方向。

详情
AI中文摘要

特征与交互重要性(FII)方法在监督学习中至关重要,用于评估复杂预测模型中输入变量及其交互的相关性。在许多领域,如个性化医疗,通常需要针对单个预测的局部解释,而不是总结整体特征重要性的全局分数。随机森林(RF)在这些场景中被广泛使用,现有的可解释性方法通常利用树结构和分裂统计量来提供模型特定的见解。然而,对RF的局部FII方法的理论理解仍然有限,这使得如何解释单个预测的高重要性分数变得不明确。我们提出了一种新颖的、局部的、模型特定的FII方法,该方法识别特征在决策路径上的频繁共现,将全局模式与特定测试点路径上的模式相结合。我们证明,在局部尖峰稀疏(LSS)模型下,我们的方法一致地恢复真实的局部信号特征及其交互,并识别出大或小的特征值是否驱动预测。通过模拟研究和真实数据示例,我们展示了我们的方法和理论结果的有用性。

英文摘要

Feature and Interaction Importance (FII) methods are essential in supervised learning for assessing the relevance of input variables and their interactions in complex prediction models. In many domains, such as personalized medicine, local interpretations for individual predictions are often required, rather than global scores summarizing overall feature importance. Random Forests (RFs) are widely used in these settings, and existing interpretability methods typically exploit tree structures and split statistics to provide model-specific insights. However, theoretical understanding of local FII methods for RF remains limited, making it unclear how to interpret high importance scores for individual predictions. We propose a novel, local, model-specific FII method that identifies frequent co-occurrences of features along decision paths, combining global patterns with those observed on paths specific to a given test point. We prove that our method consistently recovers the true local signal features and their interactions under a Locally Spike Sparse (LSS) model and also identifies whether large or small feature values drive a prediction. We illustrate the usefulness of our method and theoretical results through simulation studies and a real-world data example.

2603.09276 2026-06-11 stat.ML cs.LG 版本更新

On Regret Bounds of Thompson Sampling for Bayesian Optimization

关于贝叶斯优化中汤普森采样遗憾界的分析

Shion Takeno, Shogo Iwazaki

AI总结 本文针对高斯过程汤普森采样(GP-TS)方法,在目标函数为GP样本路径的假设下,推导了其遗憾下界、累积遗憾二阶矩上界、期望宽松遗憾上界以及改进的累积遗憾上界,填补了GP-TS在高概率遗憾界方面的空白。

详情
Comments
43 pages, Accepted to ICML 2026
AI中文摘要

我们研究了一种广泛使用的贝叶斯优化方法——高斯过程汤普森采样(GP-TS),假设目标函数是高斯过程的一个样本路径。与具有高概率和期望遗憾界的高斯过程上置信界(GP-UCB)相比,GP-TS的大多数分析仅限于期望遗憾。此外,最近关于GP-UCB的宽松遗憾和改进的累积遗憾上界的分析是否能应用于GP-TS仍不清楚。为了填补这些空白,本文展示了几个遗憾界:(i) GP-TS的遗憾下界,这意味着GP-TS以概率δ依赖于$1/\delta$的多项式;(ii) 累积遗憾二阶矩的上界,直接暗示了关于δ的改进遗憾上界;(iii) 期望宽松遗憾上界;(iv) 关于时间水平T的改进累积遗憾上界。在此过程中,我们提供了几个有用的引理,包括从最近分析中放松必要条件以获得关于T的改进累积遗憾上界。

英文摘要

We study a widely used Bayesian optimization method, Gaussian process Thompson sampling (GP-TS), under the assumption that the objective function is a sample path from a GP. Compared with the GP upper confidence bound (GP-UCB) with established high-probability and expected regret bounds, most analyses of GP-TS have been limited to expected regret. Moreover, whether the recent analyses of GP-UCB for the lenient regret and the improved cumulative regret upper bound can be applied to GP-TS remains unclear. To fill these gaps, this paper shows several regret bounds: (i) a regret lower bound for GP-TS, which implies that GP-TS suffers from a polynomial dependence on $1/\delta$ with probability $\delta$, (ii) an upper bound of the second moment of cumulative regret, which directly suggests an improved regret upper bound on $\delta$, (iii) expected lenient regret upper bounds, and (iv) an improved cumulative regret upper bound on the time horizon $T$. Along the way, we provide several useful lemmas, including a relaxation of the necessary condition from recent analysis to obtain improved regret upper bounds on $T$.

6. 高效学习、压缩与部署 28 篇

2606.11270 2026-06-11 cs.LG cs.AI cs.CL 新提交

Quantifying Subliminal Behavioral Transfer Ratios in Language Model Distillation

量化语言模型蒸馏中的潜意识行为迁移比率

Uwe Konig, Hamza Kazmi, Ruizhe Li, Maheep Chaudhary

AI总结 通过控制教师模型行为强度并蒸馏学生模型,量化了潜意识行为迁移比率,发现迁移具有鲁棒性且呈现不同缩放行为。

详情
AI中文摘要

旨在将良性行为迁移到学生模型的语言模型蒸馏,也可能迁移教师模型中存在的不良特征,这种现象称为潜意识学习。虽然定性证据支持该效应的存在,但其程度尚未被系统表征。本研究通过控制两个教师模型(Llama-2-7B-Chat 和 Qwen2.5-7B-Instruct)在不同引导强度下,并仅使用良性数据蒸馏学生模型,量化了潜意识行为迁移比率。使用 GPT-4.1 作为评估器对 100 个 JailbreakBench 提示进行评估,结果表明迁移是鲁棒的,但表现出不同的缩放行为。Llama-2 表现出一个尖锐的阈值($\tau = {0.25,0.32} \ \text{beyond} \ \alpha = -0.15$),而 Qwen2.5 表现出连续且更高水平的迁移($\tau$ 高达 $0.61$)。

英文摘要

Distillation of a language model intended to transfer benign behavior to a student model may also transfer undesirable characteristics, if they are present in the teacher model, a phenomenon known as subliminal learning. While qualitative evidence supports the existence of this effect, its magnitude has not been systematically characterized. This study quantifies subliminal behavioral transfer ratios by steering two teacher models (Llama-2-7B-Chat and Qwen2.5-7B-Instruct) at varying steering strengths and distilling student models using only benign data. Evaluation on 100 JailbreakBench prompts with GPT-4.1, serving as the evaluator, indicates that transfer is robust but exhibits distinct scaling behaviors. Llama-2 demonstrates a sharp threshold ($\tau = {0.25,0.32} \ \text{beyond} \ \alpha = -0.15$), whereas Qwen2.5 displays continuous and higher levels of transfer ($\tau$ up to $0.61$).

2606.11290 2026-06-11 cs.LG cs.AI cs.CL 新提交

FlowBank: Query-Adaptive Agentic Workflows Optimization through Precompute-and-Reuse

FlowBank: 通过预计算与复用实现查询自适应智能体工作流优化

Lingzhi Yuan, Chenghao Deng, Fangxu Yu, Souradip Chakraborty, Mohammad Rostami, Furong Huang

AI总结 提出FlowBank框架,通过预计算多样化工作流并压缩为紧凑组合,在推理时自适应选择最优工作流,平衡性能与成本,在五个基准上平均得分最高且成本可控。

详情
AI中文摘要

基于大型语言模型的多智能体系统日益强大,但当前的智能体工作流优化范式存在令人不满意的权衡。任务级方法花费大量离线计算却只部署单个工作流,导致互补候选未被使用;而查询级方法为每个查询合成新工作流,推理成本高昂。我们的动机分析表明,这些范式更多是互补而非竞争:离线搜索中发现的工作流通常解决不同子集的查询,许多由昂贵查询级生成处理的查询已经可以通过更便宜的预计算工作流解决。这暗示了一个不同的目标:与其寻找一个普遍最佳的工作流或为每个实例重新生成,不如构建一个紧凑的、可复用的互补工作流库,并在推理时自适应地选择。为此,需要解决三个耦合问题:生成互补而非冗余的候选、压缩成小型可部署组合、在性能-成本权衡下为每个查询分配正确的工作流。我们提出FlowBank,一个基于组合的智能体工作流优化的三阶段框架。多样化阶段提出DiverseFlow,引导搜索覆盖未充分覆盖的查询,产生高覆盖率的候选池。精炼阶段提出CuraFlow,将候选池压缩为冗余最小的紧凑组合。匹配阶段将部署建模为查询-工作流二分图上的边值预测,将每个传入查询路由到预测效用最佳的组合成员。在五个基准上,FlowBank在评估方法中实现了最高平均得分,同时保持成本竞争力,相比最强的自动和手工基线分别相对提升4.26%和14.92%。

英文摘要

Large Language Model (LLM)-based multi-agent systems are increasingly powerful, but current agentic workflow optimization paradigms make an unsatisfying trade-off. Task-level methods spend substantial offline compute yet deploy only a single workflow, leaving complementary candidates unused, while query-level methods synthesize a new workflow per query at substantial inference cost. Our motivating analysis shows these paradigms are more complementary than competing: workflows discovered during offline search often solve different subsets of queries, and many queries handled by expensive query-level generation can already be solved by cheaper precomputed workflows. This suggests a different objective: rather than searching for one universally best workflow or regenerating one per instance, we should build a compact bank of reusable, complementary workflows and select among them adaptively at inference time. Doing so requires solving three coupled problems: generating complementary rather than redundant candidates, compressing them into a small deployable portfolio, and assigning each query to the right workflow under a performance-cost trade-off. To this end, we present FlowBank, a three-stage framework for portfolio-based agentic workflow optimization. Diversifying proposes DiverseFlow to steer search toward under-covered queries and produce a high-coverage candidate pool. Curating proposes CuraFlow to compress this pool into a compact portfolio with minimal redundancy. Matching casts deployment as edge-value prediction on a query-workflow bipartite graph and routes each incoming query to the portfolio member with the best predicted utility. Across five benchmarks, FlowBank achieves the highest average score among the evaluated methods while remaining cost-competitive, improving over the strongest automated and handcrafted baselines by 4.26% and 14.92% relative, respectively.

2606.11473 2026-06-11 cs.LG cs.AI stat.ML 新提交

CRUMB: Efficient Prior Fitted Network Inference via Distributionally Matched Context Batching

CRUMB: 通过分布匹配上下文批处理实现高效先验拟合网络推理

Jamie Heredge, Mattia J. Villani, Pranav Deshpande, Akshay Seshadri, Niraj Kumar

发表机构 * Global Technology Applied Research, JPMorganChase(摩根大通全球技术应用研究)

AI总结 提出CRUMB方法,通过聚类查询、最小化最大均值差异选择训练子集、再执行精确推理,在不重新训练的情况下加速先验拟合网络推理,在51个数据集上优于同类方法。

详情
Comments
26 pages, 13 figures
AI中文摘要

先验拟合网络(PFNs)是一类有前景的表格基础模型,执行上下文学习,其中整个带标签的训练集作为上下文提供,并在单次前向传播中生成测试查询的预测。然而,许多PFN架构中二次缩放的自注意力机制使得对于非常大的训练数据集推理变得不可行。我们提出CRUMB(使用最小化MMD批处理的聚类检索),一个三阶段推理包装器:(i)聚类测试查询,(ii)通过贪心最小化最大均值差异(MMD)为每个聚类选择一个小型、分布匹配的训练子集,(iii)在每个缩减上下文的批次上执行精确的PFN推理。CRUMB是架构无关的,无需重新训练。在51个数据集的TabArena基准测试中,跨三种PFN架构(TabPFNv2、TabICLv1、TabICLv2)评估,我们展示了CRUMB优于类似的最先进的上下文选择策略。我们还展示了CRUMB对协变量漂移具有鲁棒性,因为MMD最小化步骤自然有助于对齐训练上下文分布以匹配当前测试批次分布。

英文摘要

Prior-fitted networks (PFNs) are a promising class of tabular foundation models that perform in-context learning, whereby the entire labelled training set is supplied as context, and predictions for test queries are produced in a single forward pass. However, the quadratically scaling self-attention mechanism in many PFN architectures makes inference prohibitive for very large training datasets. We propose CRUMB (Clustered Retrieval Using Minimised-MMD Batching), a three-stage inference wrapper that (i) clusters the test queries, (ii) selects a small, distributionally matched training subset for each cluster by greedily minimising the maximum mean discrepancy (MMD), and (iii) runs exact PFN inference on each reduced-context batch. CRUMB is architecture-agnostic and requires no retraining. On the 51-dataset TabArena benchmark, evaluated across three PFN architectures (TabPFNv2, TabICLv1, TabICLv2), we show that CRUMB outperforms similar state-of-the-art context selection strategies. We also show that CRUMB is resilient to covariate drift, as the MMD-minimisation step naturally helps align the training context distribution to match the current test batch distributions.

2606.11625 2026-06-11 cs.LG 新提交

TimeRouter: Efficient and Adaptive Routing of Time-Series Foundation Models

TimeRouter: 时间序列基础模型的高效自适应路由

Kanghui Ning, Yushan Jiang, Kashif Rasul, Anderson Schneider, Yuriy Nevmyvaka, Dongjin Song

发表机构 * University of Connecticut(康涅狄格大学) Salesforce AI Research JP Morgan AI Research(摩根大通人工智能研究院)

AI总结 提出TimeRouter框架,通过轻量判别路由、选择性门控和集成回退实现时间序列基础模型的自适应选择,无需LLM推理,在GIFT-EVAL榜单取得最优性能。

详情
AI中文摘要

时间序列基础模型(TSFMs)作为新兴智能时间序列系统中的预测专家越来越受到探索。然而,TSFMs表现出异质性归纳偏差,且没有单一模型能在所有预测场景中持续占优,使得专家选择成为关键挑战。现有系统通常将此决策委托给基于LLM的控制器,导致大量推理开销。我们提出TimeRouter,一种高效路由框架,通过轻量判别路由、选择性门控和集成回退,利用预训练TSFM池的经验互补性。具体而言,TimeRouter结合了学习路由头、选择性门控和集成回退,在推理时无需调用LLM即可实现自适应专家选择。TimeRouter在GIFT-EVAL榜单上取得了最先进性能,LB MASE为0.6765。除了基准性能,我们的消融研究为TSFM路由设计提供了经验见解,强调了池组成和选择性门控的重要性。综合来看,这些结果使TimeRouter成为未来基于基础模型池的智能时间序列系统的模块化轻量路由层。我们的代码见此链接。

英文摘要

Time-series foundation models (TSFMs) are increasingly explored as predictive experts within emerging agentic time-series systems. However, TSFMs exhibit heterogeneous inductive biases, and no single model consistently dominates across forecasting regimes, making expert selection a critical challenge. Existing systems often delegate this decision to LLM-based controllers, incurring substantial inference overhead. We present TimeRouter, an efficient routing framework that leverages empirical complementarity across a pool of pretrained TSFMs through lightweight discriminative routing, selective gating, and ensemble fallback. Concretely, TimeRouter combines a learned routing head, a selective gate, and an ensemble fallback, enabling adaptive expert selection without invoking an LLM at inference time. TimeRouter achieves state-of-the-art performance on the GIFT-EVAL leaderboard, with an LB MASE of 0.6765. Beyond benchmark performance, our ablation studies provide empirical insights into TSFM routing design, highlighting the importance of pool composition and selective gating. Taken together, these results position TimeRouter as a modular and lightweight routing layer for future agentic time-series systems built upon foundation-model pools. Our code is available at this https URL.

2606.12280 2026-06-11 cs.LG 新提交

Holding the FP8 Quality Ceiling at 8-Bit Weights and Activations: INT8 and GGUF Post-Training Quantization of Ideogram 4.0 for Consumer GPUs

在8位权重和激活下保持FP8质量上限:Ideogram 4.0面向消费级GPU的INT8与GGUF后训练量化

Deep Gandhi, Ali Asaria, Tony Salomone

发表机构 * Transformer Lab

详情
AI中文摘要

后训练量化使得大型文本到图像扩散变换器能够在消费级GPU上运行,然而硬件特定的权衡很少被直接测量。我们对Ideogram 4.0——一个9.3B流匹配扩散变换器(DiT),以两个独立权重副本的形式部署,用于无分类器引导,并由Qwen3-VL-8B编码器调节——针对缺乏FP8张量核心的Ampere RTX 3090 GPU进行量化。我们的INT8 W8A8方案(逐通道权重、逐token动态激活、SmoothQuant以及对少量高脆弱性层的混合精度保护)保持了FP8的质量上限:在200提示基准上,INT8与FP8的配对同种子bootstrap置信区间在Pick和CLIP指标上均包含零,而INT8相比NF4提升了+1.9 CLIP(95%置信区间[+1.21,+2.64],排除零)。据我们所知,针对此类模型进行的逐类别OCR分析首次确认了文本可读性得以保留,而消融实验将前馈网络下投影的保护隔离为关键质量杠杆。我们的GGUF Q4_K量化在相同磁盘大小下优于NF4,并在质量-内存前沿上成为帕累托最优解,配对置信区间排除零(Q8_0质量中性)。最后,我们描述了8位量化在哪些方面有帮助以及哪些方面没有:INT8的权重与FP8的占用空间相当而非缩小,因此在Ampere上实现速度提升需要融合INT8内核。

英文摘要

Post-training quantization lets large text-to-image diffusion transformers run on consumer GPUs, yet the hardware-specific trade-offs are seldom measured directly. We quantize Ideogram 4.0 - a 9.3B flow-matching diffusion transformer (DiT), shipped as two separate-weight copies of a single-stream 34-layer backbone for classifier-free guidance and conditioned by a Qwen3-VL-8B encoder - for Ampere RTX 3090 GPUs, which lack FP8 tensor cores. Our INT8 W8A8 recipe (per-channel weights, per-token dynamic activations, SmoothQuant, and mixed-precision protection of a small high-fragility layer set) holds the FP8 quality ceiling: on a 200-prompt benchmark the paired same-seed bootstrap CI for INT8-FP8 includes zero on both Pick and CLIP, while INT8 improves on NF4 by $+1.9$ CLIP (95% CI $[+1.21,+2.64]$, excluding zero). A per-category OCR analysis, to our knowledge unreported for this model class, confirms text legibility is preserved, and an ablation isolates protection of the FFN down-projections as the dominant quality lever. Our GGUF Q4_K quantization beats NF4 at equal on-disk size and is the Pareto winner on the quality-memory frontier, with paired confidence intervals excluding zero (Q8_0 is quality neutral). Finally, we characterize where 8-bit quantization helps and where it does not: INT8's weights match FP8's footprint rather than shrink it, so a speed gain on Ampere awaits a fused INT8 kernel.

2606.11257 2026-06-11 cs.CL cs.LG cs.PF 交叉投稿

Energy-Efficient On-Device RAG on a Mobile NPU: System Design and Benchmark on Snapdragon X Elite

移动NPU上的能效型设备端RAG:Snapdragon X Elite系统设计与基准测试

Zhiyuan Cheng, Longying Lai

发表机构 * Qualcomm(高通) Snapdragon X Elite(骁龙X Elite) Dell XPS 13 laptop(戴尔XPS 13笔记本电脑) Qualcomm Hexagon NPU(高通Hexagon NPU) Adreno X1-85

AI总结 本文首次在Snapdragon X Elite的Hexagon NPU上实现端到端RAG流水线,通过对比CPU和GPU,NPU在嵌入吞吐量、系统能耗和查询延迟上分别提升9.1倍、降低12.3倍和4.0倍,且答案质量相当。

详情
Comments
9 pages, 2 figures, 6 tables
AI中文摘要

检索增强生成(RAG)流水线计算密集,结合了嵌入、检索、重排序和大语言模型(LLM)生成。完全在设备端运行有利于隐私、延迟和离线使用,但CPU推理的能耗成本是一个主要障碍。我们提出了据我们所知第一个在Snapdragon X Elite的Qualcomm Hexagon NPU上运行所有神经阶段(嵌入、重排序和LLM生成)的端到端RAG流水线。在Dell XPS 13笔记本电脑上进行性能分析,我们比较了NPU加速的RAG与CPU和OpenCL/Adreno GPU基线在索引和查询工作负载上的表现。在索引方面,NPU实现了9.1倍的嵌入吞吐量提升和12.3倍的系统能耗降低。在120查询的Wikipedia段落基准测试中,与CPU基线相比,NPU实现了18.1倍的LLM预填充加速、4.0倍的端到端查询延迟降低和4.0倍的系统能耗降低;集成GPU上的相同工作负载比CPU慢1.7倍,且能耗比NPU高6.5倍。GPT-4.1 LLM作为评判者的评估发现,NPU的答案质量与CPU和GPU相当,在评估者噪声范围内(1-10分制下平均9.32 vs. 8.95 vs. 9.03),86.7%的查询在所有三个后端上得分相同。因此,在Snapdragon X Elite / Hexagon类笔记本电脑SoC上,NPU实现了实用、能效高的设备端RAG,且无质量退化——这是一条通往绿色边缘智能的可持续路径,我们预计随着软件栈的成熟,该方法将推广到类似的移动NPU(Apple Neural Engine、Intel NPU、MediaTek APU)。

英文摘要

Retrieval-Augmented Generation (RAG) pipelines are compute-intensive, combining embedding, retrieval, reranking, and large language model (LLM) generation. Running them entirely on-device benefits privacy, latency, and offline use, but the energy cost of CPU inference is a major barrier. We present what is, to our knowledge, the first end-to-end RAG pipeline that runs all neural stages -- embedding, reranking, and LLM generation -- on the Qualcomm Hexagon NPU of the Snapdragon X Elite. Profiling on a Dell XPS 13 laptop, we compare NPU-accelerated RAG against CPU and OpenCL/Adreno GPU baselines on indexing and query workloads. On indexing, the NPU achieves 9.1x higher embedding throughput and 12.3x less system energy. On a 120-query Wikipedia-passage benchmark, it delivers 18.1x faster LLM prefilling, 4.0x lower end-to-end query latency, and 4.0x less system energy than the CPU baseline; the same workload on the integrated GPU is 1.7x slower than CPU and uses 6.5x more energy than the NPU. A GPT-4.1 LLM-as-judge evaluation finds NPU answer quality on par with CPU and GPU within evaluator noise (mean 9.32 vs. 8.95 vs. 9.03 on a 1-10 rubric), with 86.7% of queries scoring identically across all three backends. On the Snapdragon X Elite / Hexagon class of laptop SoC, the NPU thus enables practical, energy-efficient on-device RAG without quality regression -- a sustainable path toward green edge intelligence that we expect to generalize to comparable mobile NPUs (Apple Neural Engine, Intel NPU, MediaTek APU) as their software stacks mature.

2606.11387 2026-06-11 cs.CL cs.AI cs.LG 交叉投稿

Small Experiments, Cheaper Decisions: A Case Study in Staged Promotion for Micro-Pretraining

小实验,更经济的决策:微预训练中分阶段提升的案例研究

Felipe Chavarro Polania

发表机构 * Hewlett Packard Enterprise(慧与科技公司)

AI总结 研究微预训练中分阶段提升协议,通过固定预算筛选配置,在Windows A100和Linux L40S上验证,发现早期排名不稳定,但最终协议以144 GPU小时找到最优配置,成本低于全量筛选。

详情
Comments
14 pages, 5 figures; 12-hour dual-host micro-pretraining promotion study; source package includes curated ancillary artifacts
AI中文摘要

短预训练运行可以降低实验成本,但它们也可能过度推广那些仅在小预算下表现良好的配置。我们针对固定微预训练运行器在两个异构主机块(Windows A100和Linux L40S)上研究了一种可审计的分阶段提升协议。从12个预先筛选的配置开始,我们使用2分钟、5分钟、10分钟、60分钟和12小时的分阶段预算,并在昂贵的延续之前设置固定的提升规则。早期筛选被有意视为不稳定:5分钟和10分钟的排名对主机敏感,而最终的12小时排名最优条件并非复制10分钟门控下的平均最佳条件。由于不同阶段的种子范围不同,这些变化是操作性的提升证据,而非种子内曲线。复制60分钟门控将分阶段因子筛选桥接参考保留在提升集中,它在所有四个60分钟主机-种子单元中排名第一。在最终的12小时确认包中,桥接条件在两个种子的所有四个主机-种子单元中排名第一;贪婪比较器未满足固定的0.010 val_bpb近似等价规则;更便宜的d8/ar48(深度8,宽高比48)哨兵未满足固定的0.020平均差距规则。执行的12小时分支花费144 GPU小时,完整的分阶段协议记录169.2训练GPU小时(包括筛选阶段)。继续所有四个60分钟候选将花费192 GPU小时,而继续所有九个复制10分钟候选将花费432 GPU小时。后者是未运行延续的会计反事实,并非表明跳过的候选不可能超越参考。结果是一个有界成本分配发现,而非全局最优性、容量归一化优越性或优于自适应超参数优化方法的声明。

英文摘要

Short pretraining runs can reduce experimental cost, but they can also over-promote configurations that only look strong at tiny budgets. We study an auditable staged-promotion protocol for a fixed micro-pretraining runner on two heterogeneous host blocks: Windows A100 and Linux L40S. Starting from twelve prior-screened configurations, we use staged budgets of 2 minutes, 5 minutes, 10 minutes, 60 minutes, and 12 hours, with frozen promotion rules before expensive continuations. The early screens are intentionally treated as unstable: the 5- and 10-minute rankings are host-sensitive, and the eventual 12-hour top-ranked condition is not the mean-best condition at the replicated 10-minute gate. Because seed ranges differ across stages, these changes are operational promotion evidence, not within-seed curves. A replicated 60-minute gate keeps the Staged Factorial Screening bridge reference in the promoted set, where it ranks first in all four 60-minute host-seed cells. In the final 12-hour confirmation package, the bridge condition ranks first in all four host-seed cells across two seeds; the greedy comparator does not meet the frozen 0.010 val_bpb near-equivalence rule; and the cheaper d8/ar48 (depth-8, aspect-48) sentinel does not meet the frozen 0.020 mean-gap rule. The executed 12-hour branch spends 144 GPU-hours, and the full staged protocol records 169.2 training GPU-hours including screening stages. Continuing all four 60-minute candidates would spend 192 GPU-hours, while continuing all nine replicated 10-minute candidates would spend 432 GPU-hours. The latter numbers are accounting counterfactuals for unrun continuations, not evidence that skipped candidates could not have overtaken the reference. The result is a bounded cost-allocation finding, not a claim of global optimality, capacity-normalized superiority, or superiority over adaptive hyperparameter optimization methods.

2606.11390 2026-06-11 cs.CV cs.DC cs.GR cs.LG 交叉投稿

A Scalable PyTorch Abstraction for Multi-GPU Gaussian Splatting

一种可扩展的多GPU高斯泼溅PyTorch抽象

Matthew Cong, Francis Williams, Jonathan Swartz, Mark Harris, Sanja Fidler, Ken Museth

发表机构 * NVIDIA(英伟达) University of Toronto(多伦多大学) Vector Institute(向量研究所)

AI总结 提出一种多GPU高斯泼溅方法,通过CUDA统一内存和NVLink在算子级别分布参数,实现大规模场景重建,支持超过10亿高斯泼溅。

详情
Comments
14 pages, 6 tables, 2 figures, and 1 listing. Includes supplementary material
AI中文摘要

高斯泼溅方法在真实世界的神经重建中越来越受欢迎。然而,由于计算和内存限制,它们在规模和分辨率上常常受限。我们提出了一种多GPU高斯泼溅方法,将重建扩展到更高的分辨率和更大的场景,同时抽象掉了通常与模型分布相关的代码复杂性。为实现这一目标,我们提出一个PyTorch后端,通过CUDA统一内存和NVLink在GPU之间分布高斯参数和泼溅算子。由于分布发生在算子级别,模型代码不需要显式的跨设备通信。更广泛地说,该后端将多个GPU暴露为一个聚合的PyTorch设备,并支持其他PyTorch算子。我们展示了包含超过10亿个高斯泼溅的城市规模重建,具有街道级细节,数量是当前最先进方法的25倍以上。

英文摘要

Gaussian splatting methods have become increasingly popular for neural reconstruction of the real world. However, they are often limited in scale and resolution due to compute and memory constraints. We present a multi-GPU Gaussian splatting approach that scales reconstruction to higher resolutions and larger scenes while abstracting away the code complexity typically associated with distributing a model. To accomplish this, we propose a PyTorch backend that distributes the Gaussian parameters and splatting operators across GPUs via CUDA unified memory and NVLink. Because distribution occurs at the operator level, the model code requires no explicit cross-device communication. More broadly, the backend exposes multiple GPUs as an aggregate PyTorch device and supports other PyTorch operators. We demonstrate city-scale reconstructions with street-level detail consisting of over 1 billion Gaussian splats, more than 25 times as many as the current state of the art.

2606.11674 2026-06-11 cs.SD cs.LG 交叉投稿

SpAArSIST: Sparsified AASIST for Efficient and Reliable Anti-Spoofing

SpAArSIST: 用于高效可靠反欺骗的稀疏化AASIST

Anton Firc, Vojtěch Staněk, Zbyněk Lička, Kamil Malinka, Martin Perešíni

发表机构 * Brno University of Technology(布尔诺理工大学)

AI总结 提出SpAArSIST,通过稀疏化图池化后端,在保持竞争力的同时降低计算量20.7%、模型大小4.1%,并提升域外鲁棒性。

详情
Comments
Accepted at Interspeech 2026
AI中文摘要

我们提出了SpAArSIST,这是对广泛使用的基于自监督学习(SSL)的反欺骗方法AASIST图池化后端的面向部署的改进。受公共实现中冗余操作的启发,我们用显式的轻量级选择替换了学习池化和堆栈节点注意力:分离的训练和推理图池化比率$(k_{\mathrm{tr}},k_{\mathrm{inf}})$、基于幅度的节点评分以及图节点的均值聚合。最佳整体配置(排名第一)将后端计算削减了20.7%(从195.045M MACs降至154.706M MACs),模型大小减少了4.1%(从611.8k参数降至586.4k参数),同时将在In-the-Wild上的域外鲁棒性提升至2.82% EER和0.078 minDCF(原为4.64%和0.133),并在ASVspoof5上保持竞争力。我们还提供了一个综合选择分数,总结了准确性、校准和计算量,以支持平衡的面向部署的模型选择。

英文摘要

We present SpAArSIST, a deployment-oriented refinement of the widely used AASIST graph pooling backend for self-supervised learning (SSL) based anti-spoofing. Motivated by redundant operations in public implementations, we replace learned pooling and stack-node attention with explicit, lightweight choices: separate train and inference graph pooling ratios $(k_{\mathrm{tr}},k_{\mathrm{inf}})$, magnitude-based node scoring, and mean aggregation of graph nodes. The best overall configuration (rank 1) cuts backend compute by 20.7% (195.045M $\rightarrow$ 154.706M MACs) and model size by 4.1% (611.8k $\rightarrow$ 586.4k params), while improving out-of-domain robustness on In-the-Wild to 2.82% EER and 0.078 minDCF (from 4.64% and 0.133) and remaining competitive on ASVspoof5. We further provide a composite selection score that summarizes accuracy, calibration, and compute to support balanced deployment-oriented model choice.

2606.11682 2026-06-11 cs.CV cs.LG 交叉投稿

Parameter-Efficient Adapter Tuning for Tabular-Image Multimodal Learning

面向表格-图像多模态学习的参数高效适配器微调

Jiaqi Luo

发表机构 * School of Mathematical Sciences, Soochow University(苏州大学数学科学学院)

AI总结 提出TI-Adapter框架,通过冻结表格编码器并添加适配器,以及图像分支的嵌入层和瓶颈层适配器,实现高效多模态微调,在20个数据集上以更少参数达到或超越全微调性能。

详情
AI中文摘要

表格-图像多模态学习旨在通过联合使用结构化表格属性和视觉数据来提高预测建模能力。尽管预训练编码器提供了强大的模态特定表示,但全微调可能计算成本高昂,而保持编码器冻结可能限制任务特定适应。我们提出了表格-图像适配器(TI-Adapter),一种基于模态特定适配器的微调框架,用于高效的多模态适应。TI-Adapter冻结预训练的表格编码器,并在提取的表格嵌入后学习一个适配器,同时通过嵌入级和瓶颈级适配器来适应图像分支,而不是全微调。在20个表格-图像数据集上的实验表明,TI-Adapter在使用显著更少的可训练参数的情况下,达到了与全微调相当或更好的预测性能。消融研究进一步证明了适配器放置对于平衡性能和实际效率的重要性。

英文摘要

Tabular-image multimodal learning aims to improve predictive modeling by jointly using structured tabular attributes and visual data. Although pretrained encoders provide strong modality-specific representations, full fine-tuning can be computationally expensive, while keeping encoders frozen may limit task-specific adaptation. We propose the Tabular-Image Adapter (TI-Adapter), a modality-specific adapter-based fine-tuning framework for efficient multimodal adaptation. TI-Adapter freezes the pretrained tabular encoder and learns an adapter after the extracted tabular embedding, while adapting the image branch with embedding-level and bottleneck-level adapters instead of full fine-tuning. Experiments on 20 tabular-image datasets show that TI-Adapter achieves competitive or better predictive performance than full fine-tuning while using substantially fewer trainable parameters. Ablation studies further demonstrate the importance of adapter placement for balancing performance and practical efficiency.

2606.12171 2026-06-11 cs.CV cs.LG 交叉投稿

Beyond Dark Knowledge: Mixup-Based Distillation for Reliable Predictions

超越暗知识:基于混合的蒸馏实现可靠预测

José Medina, Paul Honeine, Abdelaziz Bensrhair, Amnir Hadachi

发表机构 * ITS Lab, Institute of Computer Science, University of Tartu(塔尔图大学计算机科学学院ITS实验室) LITIS, Université de Rouen(鲁昂大学LITIS实验室) LITIS, INSA de Rouen(鲁昂国立应用科学学院LITIS实验室)

AI总结 研究知识蒸馏与混合训练结合时教师-学生不匹配的影响,发现学生能独立获得线性结构并提升准确率与校准,提出混合蒸馏作为更丰富的知识传递通道。

详情
AI中文摘要

知识蒸馏(KD)和混合(mixup)已被证明能有效诱导类别边界的平滑性:KD捕捉概率分布中的固有类别关系,而混合通过输入的凸组合强制执行这些关系。然而,它们的相互作用仍未被充分理解,特别是当混合仅在学生训练期间应用时。在这种情况下,教师被查询来自其训练期间从未见过的邻域分布的输入,这是一种受控的不匹配,其对知识转移的影响尚未被表征。我们表明,这种不匹配导致教师的监督信号被分布混淆而非类间结构主导。尽管如此,学生并非仅仅模仿教师:它独立地在邻域区域获得更大的线性度,这是教师缺乏的结构特性,并超越了暗知识转移。与基线相比,带有混合的KD持续提高学生准确率,并将过度自信降低一个数量级,在CIFAR和ImageNet上使用不同容量的教师均如此。关键的是,校准独立于准确率转移从教师传播到学生,温度缩放控制着可测量的准确率-校准权衡,在邻域训练下这种权衡更加明显。这些结果将混合蒸馏重新定义为不是标准KD的退化版本,而是一个更丰富的传递通道,同时塑造判别性能、不确定性估计和表示几何。

英文摘要

Knowledge Distillation (KD) and mixup have proven effective at inducing smoothness in class boundaries; KD captures inherent class relationships in probability distributions, and mixup enforces them through convex combinations of inputs. Their interaction, however, remains poorly understood, particularly when mixup is applied only during student training. In this setting, the teacher is queried on inputs drawn from a vicinal distribution it never saw during training, a controlled mismatch whose effect on knowledge transfer has not been characterised. We show that this mismatch causes the teacher's supervisory signal to be dominated by distributional confusion rather than inter-class structure. Despite it, the student does not merely imitate the teacher: it independently acquires greater linearity in the vicinal region, a structural property that the teacher lacks, and goes beyond dark-knowledge transfer. KD with mixup consistently improves student accuracy and reduces overconfidence by an order of magnitude relative to the baseline, across CIFAR and ImageNet with varying-capacity teachers. Crucially, calibration propagates from teacher to student independently of accuracy transfer, and temperature scaling governs a measurable accuracy-calibration trade-off that becomes more pronounced under vicinal training. These results reframe mixup distillation not as a degraded version of standard KD, but as a richer transfer channel that simultaneously shapes discriminative performance, uncertainty estimation, and representational geometry.

2606.12278 2026-06-11 cs.CV cs.LG 交叉投稿

Finding Sparse Subnetworks in One Training Cycle via Progressive Magnitude-Based Pruning

通过渐进式幅度剪枝在一个训练周期内找到稀疏子网络

Romana Qureshi, Hafida Benhidour, Said Kerrache, Nahlah Aljeraisy

发表机构 * King Abdullah University of Science and Technology(阿卜杜拉国王科技大学) University of Jeddah(吉达大学) King Fahd University of Petroleum and Minerals(法赫德国王石油矿产大学) King Saud University(沙特国王大学)

AI总结 提出渐进式幅度剪枝方法,在单训练周期内线性增加稀疏度,基于权重幅度更新掩码,在CIFAR-10和MNIST上优于LTH、SNIP和GraSP等基线。

详情
AI中文摘要

神经网络剪枝通过移除不太重要的参数来减小模型大小,同时旨在保持预测性能。尽管彩票假说(LTH)表明,当从合适的初始化训练时,稀疏子网络可以匹配密集网络,但其迭代剪枝过程需要多个完整的训练周期。本工作评估了渐进式幅度剪枝作为一种单周期替代方案。该方法在训练期间使用线性调度逐渐增加稀疏度,并基于活跃权重幅度更新剪枝掩码。我们在CIFAR-10和MNIST上,针对ResNet、VGG风格和LeNet架构进行了系统实验,将所提方法与代表性的迭代和基于初始化的剪枝基线(包括LTH、SNIP和GraSP)进行比较。在CIFAR-10上,该方法在ResNet-18上以72.9%稀疏度达到95.12%的准确率,而LTH报告为90.5%。在极端稀疏度下,它在VGG类架构上以97%稀疏度达到93.13%的准确率,而SNIP约为92.0%;在VGG-19上以97.97%稀疏度达到93.44%的准确率,而GraSP在98%稀疏度下为92.19%。在ResNet-18上的稀疏度-准确率分析进一步表明,在70-85%稀疏度范围内,准确率保持在密集基线的0.1个百分点以内。这些结果表明,在所评估的设置下,渐进式幅度剪枝为神经网络稀疏化提供了一种有效的单周期方法。

英文摘要

Neural network pruning reduces model size by removing less important parameters while aiming to preserve predictive performance. Although the Lottery Ticket Hypothesis (LTH) shows that sparse subnetworks can match dense networks when trained from suitable initializations, its iterative pruning procedure requires multiple complete training cycles. This work evaluates progressive magnitude-based pruning as a single-cycle alternative. The method gradually increases sparsity during training using a linear schedule and updates pruning masks based on active weight magnitudes. We conduct systematic experiments on CIFAR-10 and MNIST across ResNet, VGG-style, and LeNet architectures, comparing the proposed method with representative iterative and initialization-based pruning baselines, including LTH, SNIP, and GraSP. On CIFAR-10, the method achieves 95.12\% accuracy on ResNet-18 at 72.9\% sparsity, compared with 90.5\% reported for LTH. At extreme sparsity, it achieves 93.13\% accuracy on a VGG-like architecture at 97\% sparsity, compared with approximately 92.0\% for SNIP, and 93.44\% accuracy on VGG-19 at 97.97\% sparsity, compared with 92.19\% for GraSP at 98\% sparsity. A sparsity-accuracy analysis on ResNet-18 further shows that accuracy remains within 0.1 percentage points of the dense baseline across 70--85\% sparsity. These results indicate that progressive magnitude-based pruning provides an effective single-cycle approach for neural network sparsification under the evaluated settings.

2606.12411 2026-06-11 cs.CL cs.LG 交叉投稿

Context-Driven Incremental Compression for Multi-Turn Dialogue Generation

上下文驱动的增量压缩用于多轮对话生成

Yeongseo Jung, Jaehyeok Kim, Eunseo Jung, Jiachuan Wang, Yongqi Zhang, Ka Chun Cheung, Simon See, Lei Chen

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) NVIDIA AI Technology Center(NVIDIA AI技术中心) Shanghai Jiao Tong University(上海交通大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 提出上下文驱动的增量压缩(C-DIC),通过可修订的线程压缩状态和轻量级检索-修订-写回循环,实现跨轮信息共享,稳定长对话性能。

详情
Comments
Accepted at ICML 2026
AI中文摘要

现代对话代理在每一轮都会处理不断增长的对话历史,导致冗余的注意力和编码成本随对话长度增加。简单的截断或摘要会降低保真度,而现有的上下文压缩器缺乏跨轮记忆共享或修订,导致信息丢失和长对话中的累积错误。我们重新审视了对话动态下的上下文压缩,并经验性地展示了其脆弱性。为了提高效率和鲁棒性,我们引入了上下文驱动的增量压缩(C-DIC),它将对话视为交织的上下文线程,并在单个紧凑的对话记忆中存储每个线程的可修订压缩状态。在每一轮,一个轻量级的检索、修订和写回循环在轮次之间共享信息并更新过时的记忆,从而稳定长期行为。此外,我们将截断反向传播(TBPTT)适应于我们的多轮设置,学习跨轮依赖关系而无需完整历史反向传播。在长对话基准上的大量实验证明了C-DIC的优越性能和效率;值得注意的是,C-DIC在数百轮对话中表现出稳定的推理延迟和困惑度,为高质量对话建模提供了一条可扩展的路径。

英文摘要

Modern conversational agents condition on an ever-growing dialogue history at each turn, incurring redundant attention and encoding costs that grow with conversation length. Naive truncation or summarization degrades fidelity, while existing context compressors lack cross-turn memory sharing or revision, causing information loss and compounding errors in long dialogues. We revisit the context compression under conversational dynamics and empirically present its fragility. To improve both efficiency and robustness, we introduce Context-Driven Incremental Compression (C-DIC), which treats a conversation as interleaved contextual threads and stores revisable per-thread compression states in a single, compact dialogue memory. At each turn, a lightweight retrieve, revise, and write-back loop shares information across turns and updates stale memories, stabilizing long-horizon behavior. In addition, we adapt truncated backpropagation-through-time (TBPTT) to our multi-turn setting, learning cross-turn dependencies without full-history backpropagation. Extensive experiments on long-form dialogue benchmarks demonstrate superior performance and efficiency of C-DIC; notably, C-DIC shows stable inference latency and perplexity over hundreds of dialogue turns, supporting a scalable path to high-quality dialogue modeling.

2509.20241 2026-06-11 cs.LG cs.DC 版本更新

Energy Use of AI Inference, Efficiency Pathways, and Test-Time Scaling

AI推断的能耗:效率路径与测试时计算

Felipe Oviedo, Fiodar Kazhamiaka, Esha Choukse, Allen Kim, Amy Luers, Melanie Nakagawa, Ricardo Bianchini, Juan M. Lavista Ferres

AI总结 本文提出基于令牌吞吐量的底层方法,估算大规模大语言模型的每查询能耗,揭示测试时扩展场景下的能耗变化及效率提升潜力。

详情
Comments
A preprint version with DOI is available at Zenodo: this https URL
AI中文摘要

随着AI推断扩展到数十亿查询和新兴推理及代理工作流增加令牌需求,可靠估计每查询能耗对容量规划、排放核算和效率优先级至关重要。许多公开估计不一致且高估能耗,因为它们从有限基准外推且未能反映大规模下的效率提升。本文引入基于令牌吞吐量的底层方法,估算大规模LLM系统的每查询能耗。在H100节点下运行的模型,根据现实工作负载和GPU利用率及PUE约束,估算前沿规模模型(>2000亿参数)的每查询能耗中位数为0.34瓦(IQR: 0.18-0.67)。这些结果与生产规模配置测量一致,表明非生产估计可能高估能耗4-20倍。扩展到测试时扩展场景,每个典型查询的令牌数增加15倍,中位数能耗升至4.32瓦,表明在该范围内聚焦效率将带来最大的集群节能。我们量化了在模型、服务平台和硬件层面的可实现效率提升,发现单个模型的每查询能耗中位数减少1.5-3.5倍,而综合改进可能带来8-20倍的减少。为说明系统级影响,我们估算一个处理十亿查询的部署的基线日能耗为0.8 GWh/天。如果10%为长查询,需求可能增长到1.8 GWh/天。通过针对性的效率干预,它降至0.9 GWh/天,与该规模的网络搜索能耗相当。这呼应了数据中心历史上通过效率提升控制能耗增长的历史。

英文摘要

As AI inference scales to billions of queries, estimates of per-query energy use are increasingly important for capacity planning, efficiency interventions, and policy. Yet many public estimates assume non-production settings, leading to systematic overestimation. We introduce a bottom-up framework estimating inference energy from token throughput, node power, and overhead under large-scale deployment assumptions. For frontier-scale models (>200B parameters) on H100 nodes, we estimate a median energy of 0.31 Wh/query (IQR 0.16-0.60), indicating widely cited estimates are overstated by 4-20x. In test-time scaling scenarios 15x longer than typical queries, the median energy rises 13x to 3.91 Wh (IQR 2.15-7.05). Across models, serving systems, and hardware, we estimate 8-20x line-of-sight energy reductions. At datacenter scale, serving 1 billion queries/day requires 0.7 GWh; if 10% are long queries, demand rises to 1.7 GWh/day. With efficiency interventions, it falls to 0.8 GWh/day, mitigating the energy impact of test-time scaling.

2512.08211 2026-06-11 cs.LG 版本更新

MobileFineTuner: A Mobile-Native Framework for On-Device LLM Fine-Tuning in Real-World Embedded AI Applications

MobileFineTuner:面向真实世界嵌入式AI应用中设备端大语言模型微调的移动原生框架

Jiaxiang Geng, Lunyu Zhao, Yiyi Lu, Bing Luo

AI总结 提出移动原生框架MobileFineTuner,通过C++实现资源感知训练运行时(内存高效注意力、激活检查点等),在商用手机上实现端到端LLM微调,显著降低内存压力并提升可执行性。

详情
Comments
26 pages, 25 figures
AI中文摘要

大语言模型(LLM)正从以云为中心的服务转向设备端嵌入式AI,其中模型与从用户及其物理环境感知的私有、纵向信号进行交互。手机是此类应用的自然平台,因为用户随身携带、连接可穿戴传感器,并深度集成于日常移动应用中。然而,在商用手机上实际进行LLM微调仍然困难。现有微调框架大多基于Python且面向服务器,难以部署到移动应用中。我们提出MobileFineTuner,一个面向移动原生的开源框架,用于在商用手机上实现端到端LLM微调。MobileFineTuner用C++实现,并提供可复用的训练栈。为了在移动资源约束下使微调可行,MobileFineTuner集成了资源感知的训练运行时,包括内存高效注意力、激活检查点、梯度累积、参数分片和能量感知调度。我们在真实手机上使用GPT-2、Gemma 3和Qwen2.5模型,在多个微调任务上评估MobileFineTuner。结果表明,MobileFineTuner再现了标准Full-FT和LoRA微调行为,显著降低了内存压力并提升了在内存受限手机上的可执行性。我们进一步通过一个私有的校园健康代理应用展示了MobileFineTuner,其中本地LLM在用户特定的可穿戴感知记录上进行微调,以提供更个性化的响应,同时将原始记录保留在手机上。这些结果确立了MobileFineTuner作为在嵌入式AI和感知系统中研究和构建设备端LLM微调应用的实用工具包。

英文摘要

Large language models (LLMs) are moving from cloud-centric services toward on-device embedded AI, where models interact with private, longitudinal signals sensed from users and their physical environments. Mobile phones are a natural platform for such applications because they are continuously carried by users, connected to wearable sensors, and deeply integrated with daily mobile applications. However, practical LLM fine-tuning on commodity phones remains difficult. Existing fine-tuning frameworks are largely Python-based and server-oriented, making them hard to deploy inside mobile applications. We present MobileFineTuner, a mobile-native open-source framework for end-to-end LLM fine-tuning on commodity mobile phones. MobileFineTuner is implemented in C++ and provides a reusable training stack. To make fine-tuning feasible under mobile resource constraints, MobileFineTuner integrates a resource-aware training runtime with memory-efficient attention, activation checkpointing, gradient accumulation, parameter sharding, and energy-aware scheduling. We evaluate MobileFineTuner on real mobile phones using GPT-2, Gemma 3, and Qwen2.5 models across multiple fine-tuning tasks. The results show that MobileFineTuner reproduces standard Full-FT and LoRA fine-tuning behavior, substantially reduces memory pressure and improves executability on memory-constrained phones. We further demonstrate MobileFineTuner through a private campus health-agent application, where a local LLM is fine-tuned on user-specific wearable-sensing records to provide more personalized responses while keeping raw records on the phone. These results establish MobileFineTuner as a practical toolkit for studying and building on-device LLM fine-tuning applications in embedded AI and sensing systems.

2601.23278 2026-06-11 cs.LG cs.AR cs.CL 版本更新

FOCUS: DLLMs Know How to Tame Their Compute Bound

FOCUS: DLLMs 知道如何驯服它们的计算瓶颈

Kaihua Liang, Xin Tan, An Zhong, Hong Xu, Marco Canini

AI总结 针对扩散大语言模型解码中大部分计算浪费在不可解码令牌上的问题,提出 FOCUS 推理系统,通过动态聚焦可解码令牌并驱逐不可解码令牌,提升有效批大小,实现高达 3.52 倍的吞吐量提升。

详情
Comments
ICML 2026 camera-ready version
AI中文摘要

扩散大语言模型(DLLMs)为自回归模型提供了一种引人注目的替代方案,但其部署受到高解码成本的制约。在这项工作中,我们识别出 DLLM 解码中的一个关键低效问题:虽然计算在令牌块上并行化,但每个扩散步骤中只有一小部分令牌是可解码的,导致大部分计算浪费在不可解码的令牌上。我们进一步观察到注意力导出的令牌重要性与逐令牌解码概率之间存在强相关性。基于这一洞察,我们提出了 FOCUS,一个专为 DLLMs 设计的推理系统。通过动态地将计算聚焦于可解码令牌并实时驱逐不可解码令牌,FOCUS 增加了有效批大小,缓解了计算限制并实现了可扩展的吞吐量。实验评估表明,在大批量设置下,FOCUS 相比生产级引擎 LMDeploy 实现了高达 3.52 倍的吞吐量提升,同时在多个基准测试中保持或提升了生成质量。

英文摘要

Diffusion Large Language Models (DLLMs) offer a compelling alternative to Auto-Regressive models, but their deployment is constrained by high decoding cost. In this work, we identify a key inefficiency in DLLM decoding: while computation is parallelized over token blocks, only a small subset of tokens is decodable at each diffusion step, causing most compute to be wasted on non-decodable tokens. We further observe a strong correlation between attention-derived token importance and token-wise decoding probability. Based on this insight, we propose FOCUS, an inference system designed for DLLMs. By dynamically focusing computation on decodable tokens and evicting non-decodable ones on-the-fly, FOCUS increases the effective batch size, alleviating compute limitations and enabling scalable throughput. Empirical evaluations demonstrate that FOCUS achieves up to 3.52$\times$ throughput improvement over the production-grade engine LMDeploy in large-batch settings, while preserving or improving generation quality across multiple benchmarks.

2603.09555 2026-06-11 cs.LG cs.AI cs.DC cs.PF 版本更新

Compiler-First State Space Duality and Portable $O(1)$ Autoregressive Caching for Inference

编译器优先的状态空间对偶性与可移植的 $O(1)$ 自回归缓存推理

Cosmo Santoni, Anmol Thapar

AI总结 提出一种基于编译器优先的状态空间对偶性(SSD)结构的推理方法,通过标准JAX原语实现无自定义内核的单源推理路径,在TPU和GPU上达到高硬件利用率,且缓存解码速度比全前缀重计算快27-36倍。

详情
Comments
21 pages, 6 figures. Code available at: this https URL
AI中文摘要

高吞吐量的Mamba-2推理通常依赖于融合的CUDA和Triton内核,这限制了在不同加速器后端之间的可移植性。我们证明状态空间对偶性(SSD)递归具有编译器友好的结构:对角逐头动态、固定大小分块、以einsum为主的计算以及静态控制流。在标准JAX原语中表达这种结构,可以得到一个无需自定义内核的单源推理路径、一个注册的JAX PyTree缓存以及一个编译后的设备上自回归循环。在单个Google Cloud TPU v6e上,batch-1预填充达到约140 TFLOPS,即15%的模型FLOP利用率(MFU),这是该场景下的屋顶线上限;缓存解码达到高达64%的硬件带宽利用率(HBU)。在4096个token的上下文中,对于五个Mamba-2检查点(参数从130M到2.7B),缓存解码比全前缀重计算快27-36倍。相同的源代码在未修改的情况下可在NVIDIA L40S上运行,其中缓存解码在所有模型规模下均保持序列长度无关。WikiText-103验证困惑度与Triton参考实现mamba_ssm v2.2.2相差在±0.0005以内,隐藏状态在float32舍入容差内一致。代码可在以下网址获取:https://this URL。

英文摘要

High-throughput Mamba-2 inference is usually tied to fused CUDA and Triton kernels, limiting portability across accelerator backends. We show that the state space duality (SSD) recurrence has a compiler-friendly structure: diagonal per-head dynamics, fixed-size chunking, einsum-dominated compute, and static control flow. Expressing this structure in standard JAX primitives gives a single-source inference path with no custom kernels, a registered JAX PyTree cache, and a compiled on-device autoregressive loop. On a single Google Cloud TPU v6e, batch-1 prefill reaches approximately 140 TFLOPS, or 15% model FLOP utilisation (MFU), the roofline ceiling for this regime, and cached decode reaches up to 64% hardware bandwidth utilisation (HBU). At a 4096-token context, cached decode is 27x--36x faster than full-prefix recomputation across five Mamba-2 checkpoints from 130M to 2.7B parameters. The same source runs unmodified on NVIDIA L40S, where cached decode remains sequence-length independent across all model scales. WikiText-103 validation perplexity matches the Triton reference mamba_ssm v2.2.2 within +/-0.0005 points, and hidden states agree to float32 rounding tolerance. Code is available at this https URL.

2605.14738 2026-06-11 cs.LG cs.AI 版本更新

TAPIOCA: Why Task- Aware Pruning Improves OOD model Capability

TAPIOCA: 为什么任务感知剪枝能提升模型对分布外数据的能力

Krish Sharma, Omar Naim, Soumadeep Saha, Vinija Jain, Aman Chadha, Nicholas Asher

AI总结 本文研究了任务感知剪枝在分布外数据上的改进机制,通过实验发现剪枝能提升OOD准确性,其核心贡献是通过几何解释说明任务感知剪枝如何调整模型表示以适应任务需求。

详情
AI中文摘要

近期的研究表明,任务感知层剪枝可以提高模型在特定任务上的性能,如TALE所示。本文探讨了这种改进何时发生以及为何会发生。我们首先证明,在受控的多项式回归任务和大型语言模型中,此类剪枝在分布内(ID)数据上没有好处,但能一致地提高分布外(OOD)准确性。我们进一步通过实验证明,OOD输入会诱导出层间范数和成对距离的分布,这些分布偏离ID分布的相应分布。这导致了任务感知剪枝的几何解释:每个任务诱导出一个任务适应的几何结构,通过ID输入上观察到的表示分布来经验性地表征。OOD输入可以引入任务适应几何的扭曲版本。任务感知剪枝识别出创建或放大这种扭曲的层;通过移除这些层,它将OOD表示的范数和成对距离转向在适应分布上观察到的值。这使OOD输入与模型的任务适应几何重新对齐,并提高性能。我们通过受控分布偏移和残差缩放干预提供了因果证据,并在不同模型规模上展示了一致的行为。

英文摘要

Recent work has promoted task-aware layer pruning as a way to improve model performance on particular tasks, as shown by TALE. In this paper, we investigate when such improvements occur and why. We show first that, across controlled polynomial regression tasks and large language models, such pruning yields no benefit on in-distribution (ID) data but consistently improves out-of-distribution (OOD) accuracy. We further show empirically that OOD inputs induce layerwise norm and pairwise-distance profiles that deviate from the corresponding ID profiles. This leads to a geometric explanation of task-aware pruning: each task induces a task-adapted geometry, characterized empirically by the representation profiles observed on ID inputs. OOD inputs can introduce a distorted version of the task-adapted geometry. Task-aware pruning identifies layers that create or amplify this distortion; by removing them, it shifts OOD representational norms and pairwise distances toward those observed on the adapted distribution. This realigns OOD inputs with the model's task-adapted geometry and improves performance. We provide causal evidence through controlled distribution shifts and residual-scaling interventions, and demonstrate consistent behavior across model scales.

2605.25820 2026-06-11 cs.LG 版本更新

Visual-Redundancy-Controlled Parallel Decoding for Diffusion-Based Multimodal Large Language Models

基于扩散的多模态大语言模型的视觉冗余控制并行解码

Yulin Yuan, Hongshuo Zhao, Xiangming Meng

AI总结 针对扩散型多模态大语言模型并行解码中视觉冗余问题,提出视觉冗余指数(VRI)和无需训练的视觉冗余控制解码(VRCD)方法,通过令牌到图像的注意力优先选择视觉互补位置,在多个基准上提升准确率。

详情
Comments
18 pages, 5 figures, preprint. Code is available at this https URL
AI中文摘要

基于扩散的多模态大语言模型(dMLLMs)通过迭代并行预测多个掩码位置的令牌进行解码。这使每个解码步骤成为一个位置选择问题:模型不仅要选择哪些预测单独可靠,还要选择哪些位置应一起提交作为后续解码步骤的上下文。现有的基于置信度的解码独立地对掩码位置进行排序并提交前K个位置,很大程度上忽略了提交的令牌是否提供互补的视觉基础。我们识别了这种策略在多模态设置中的步骤级局限性:在同一步骤中选择的高置信度令牌可能依赖重叠的视觉基础,导致提交的令牌之间出现视觉冗余,从而为后续解码留下较少的互补视觉基础。为了量化这种效应,我们引入了视觉冗余指数(VRI),该指数衡量并行提交的令牌之间的视觉基础重叠程度。为了在解码过程中控制这种冗余,我们提出了视觉冗余控制解码(VRCD),一种无需训练的推理时解码方法,它利用令牌到图像的注意力优先选择视觉互补的位置。在多种多模态基准测试中,VRCD以适度的运行时开销减少了视觉冗余和剩余位置熵。在更长的解码实验中,与基于置信度的解码相比,它在M^3CoT上实现了高达18.8%的相对准确率提升,在MMBench上实现了6.9%的提升。代码将在https://github.com/infiniteYuanyl/VRCD发布。

英文摘要

Diffusion-based multimodal large language models (dMLLMs) decode by iteratively predicting tokens at multiple masked positions in parallel. This turns each decoding step into a position-selection problem: the model must choose not only which predictions are reliable in isolation, but also which positions should be committed together as context for later decoding steps. Existing confidence-based decoding ranks masked positions independently and commits the top-K positions, largely ignoring whether the committed tokens provide complementary visual grounding. We identify a step-level limitation of this strategy in multimodal settings: high-confidence tokens selected in the same step can rely on overlapping visual grounding, introducing visual redundancy among the committed tokens and leaving less complementary visual grounding available for later decoding. To quantify this effect, we introduce the Visual Redundancy Index (VRI), which measures visual grounding overlap among tokens committed in parallel. To control this redundancy during decoding, we propose Visual-Redundancy-Controlled Decoding (VRCD), a training-free inference-time decoding method that uses token-to-image attention to prioritize visually complementary positions. Across diverse multimodal benchmarks, VRCD reduces visual redundancy and remaining-position entropy with modest runtime overhead. In longer decoding experiments, it also achieves relative accuracy gains of up to 18.8% on M^3CoT and 6.9% on MMBench over confidence-based decoding. Code is available at this https URL.

2605.29128 2026-06-11 cs.LG 版本更新

Apertus LLM Family Expansion via Distillation and Quantization

通过蒸馏和量化扩展 Apertus LLM 系列

Andrei Panferov, Davit Melikidze, Martin Jaggi, Dan Alistarh

AI总结 本文通过蒸馏和量化方法,基于 Apertus 8B 模型低成本扩展出参数高达 4B 的模型系列,覆盖多种硬件约束并保持强准确性。

详情
AI中文摘要

LLM 的广泛采用导致它们被用于各种应用和场景,例如聊天助手和数据标注,这要求模型满足特定的预算和硬件约束。这导致了 LLM 以批次发布,包含不同大小的相似模型,以便模型系列尽可能满足广泛的约束。在本文中,我们验证了蒸馏和量化作为将模型系列扩展到新大小和硬件格式的经济有效方法。基于开放配方 Apertus 8B LLM,我们生成了 Apertus-v1.1——一个蒸馏模型系列,参数高达 4B,在 1.7T 许可令牌上训练。我们证明了我们的方法在覆盖大范围的硬件和系统需求方面具有成本效益和强大的准确性性能。

英文摘要

The wide adoption of LLMs has led to their use in great variety of applications and scenarios, such as chatbot assistants and data annotation, creating the need for the models to satisfy certain budget and hardware constraints. This has led to the trend of LLMs being released in batches consisting of similar models of various sizes for the family of models to adhere to as wide of a range of constraints as possible. In this paper, we validate distillation and quantization as a cost-effective way to expand model families to new sizes and hardware formats. Based on the open-recipe Apertus 8B LLM, we produce Apertus-v1.1 - a distilled family of models with up to 4B parameters trained on 1.7T permissive license tokens. We demonstrate cost-efficiency and strong accuracy performance of our approach for covering large ranges of hardware and systems requirements.

2606.07362 2026-06-11 cs.LG 版本更新

Breaking the Ice: Analyzing Cold Start Latency in vLLM

打破冰层:分析 vLLM 中的冷启动延迟

Huzaifa Shaaban Kabakibo, Animesh Trivedi, Lin Wang

AI总结 本文首次系统分析 vLLM 推理引擎的冷启动延迟,将其分解为六个基础步骤,发现主要受 CPU 限制,并建立轻量级分析模型预测延迟,为大规模推理环境资源规划提供指导。

详情
AI中文摘要

随着可扩展推理服务的普及,推理引擎的冷启动延迟变得重要。如今,vLLM 已成为许多推理工作负载的事实标准推理引擎。尽管流行,但由于其复杂性和快速演进,尚未有对其启动延迟的系统研究。随着主要架构创新如 V1 API 和 this http URL 的引入,本文首次对 vLLM 启动延迟进行了详细的性能表征。我们将启动过程分解为六个基础步骤,并证明其主要受 CPU 限制。每个步骤在模型级和系统级参数方面表现出一致且可解释的缩放趋势,从而能够细粒度地归因延迟来源。基于这些见解,我们开发了一个轻量级分析模型,能够准确预测给定硬件配置下的 vLLM 启动延迟,为大规模推理环境中的资源规划提供可操作的指导。所有基准测试数据集、分析工具和预测脚本均在此 https URL 开源。

英文摘要

As scalable inference services become popular, the cold start latency of an inference engine becomes important. Today, vLLM has evolved into the de facto inference engine of choice for many inference workloads. Although popular, due to its complexity and rapid evolution, there has not been a systematic study of its startup latency. With major architectural innovations such as the V1 API and the introduction of this http URL, this paper presents the first detailed performance characterization of vLLM startup latency. We break down the startup process into six foundational steps and demonstrate that it is predominantly CPU bound. Each step exhibits consistent and interpretable scaling trends with respect to model-level and system-level parameters, enabling fine-grained attribution of latency sources. Building on these insights, we develop a lightweight analytical model that accurately predicts vLLM startup latency for a given hardware configuration, providing actionable guidance for resource planning in large-scale inference environments. All benchmarking datasets, analysis tools, and prediction scripts are open sourced at this https URL.

2606.10820 2026-06-11 cs.LG cs.AI cs.CL 版本更新

K-Forcing: Joint Next-K-Token Decoding via Push-Forward Language Modeling

K-Forcing:通过前推语言建模进行联合下一K词解码

Zhiwei Tang, Yuanyu He, Yizheng Han, Wangbo Zhao, Jiasheng Tang, Fan Wang, Bohan Zhuang

发表机构 * DAMO Academy, Alibaba Group(阿里巴巴达摩院) Hupan Lab(湖畔实验室) Zhejiang University(浙江大学) The Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出K-Forcing范式,通过前推映射将自回归模型蒸馏为单次前向传播生成多个未来词,实现2.4-3.5倍加速,质量损失小。

详情
Comments
Code: this https URL
AI中文摘要

自回归语言建模是文本生成的主导范式,但其逐词顺序解码使得推理受限于内存且效率低下。现有的加速方法(如推测解码和扩散语言模型)在特定条件下可提升速度,但并未直接解决高负载批量服务——这一对工业级部署最为关键的场景。我们提出K-Forcing,一种用于联合下一k词解码的前推语言建模范式。K-Forcing将现有自回归模型蒸馏为条件前推映射——该映射在单次前向传播中将独立均匀噪声变量转换为多个未来词的联合样本。该设计保留了固定长度输出,复用了自回归教师模型的主干,并与标准自回归服务基础设施兼容。我们通过渐进式自强迫蒸馏训练该映射,逐步扩展预测窗口,同时使学生模型紧密匹配自回归教师模型的序列分布。我们在LM1B和OpenWebText上使用标准因果Transformer主干评估K-Forcing。当激进配置为每次前向传播生成k=4个词时,K-Forcing在不同批量大小下实现约2.4-3.5倍加速,同时相对于自回归教师模型仅带来轻微的质量下降。随着推理在现代LLM的生命周期计算成本中占据主导地位,K-Forcing为在现实高负载部署下加速自回归生成提供了一条有前景的途径。

英文摘要

Autoregressive (AR) language modeling is the dominant paradigm for text generation, yet its sequential token-by-token decoding makes inference memory-bound and inefficient. Existing acceleration approaches, such as speculative decoding and diffusion language models, can yield speedups under certain conditions but do not directly address high-load batch serving--the scenario most critical for industrial-scale deployment. We introduce K-Forcing, a push-forward language modeling paradigm for joint next-k-token decoding. K-Forcing distills an existing AR model into a conditional push-forward mapping--one that transforms independent uniform noise variables into a joint sample of multiple future tokens in a single forward pass. This design preserves fixed-length outputs, reuses the AR teacher backbone, and remains compatible with standard AR serving infrastructure. We train this mapping via progressive self-forcing distillation, which gradually expands the prediction window while enabling the student to closely match the sequence distribution of the AR teacher. We evaluate K-Forcing on LM1B and OpenWebText using a standard causal Transformer backbone. When aggressively configured to generate k = 4 tokens per forward pass, K-Forcing delivers approximately 2.4-3.5x speedup across different batch sizes, while incurring modest quality degradation relative to its AR teacher. As inference increasingly dominates the lifetime compute cost of modern LLMs, K-Forcing offers a promising route toward accelerating AR generation under real-world high-load deployment.

2505.17623 2026-06-11 cs.CR cs.AI cs.ET cs.LG cs.PF 版本更新

\texttt{Range-Arithmetic}: Verifiable Deep Learning Inference on an Untrusted Party

Range-Arithmetic: 在不可信方上进行可验证的深度学习推理

Ali Rahimi, Babak H. Khalaj, Mohammad Ali Maddah-Ali

AI总结 提出Range-Arithmetic框架,通过将非算术运算转化为可验证的算术步骤,实现高效的深度神经网络推理验证,降低了计算和通信开销。

详情
AI中文摘要

可验证计算(VC)在去中心化机器学习系统中日益重要,由于区块链的限制,深度神经网络(DNN)推理等资源密集型任务被外包给外部参与者。这产生了在不重新执行的情况下验证外包计算正确性的需求。我们提出了\texttt{Range-Arithmetic},一个新颖的框架,用于高效且可验证的DNN推理,它将非算术运算(如定点矩阵乘法后的舍入和ReLU)转化为可通过求和检查协议和串联范围证明验证的算术步骤。我们的方法避免了布尔编码、高次多项式和大查找表的复杂性,同时保持与基于有限域的证明系统的兼容性。实验结果表明,我们的方法不仅匹配现有方法的性能,还降低了验证结果的计算成本、执行DNN推理的不可信方所需的计算工作量以及双方之间的通信开销。

英文摘要

Verifiable computing (VC) has gained prominence in decentralized machine learning systems, where resource-intensive tasks like deep neural network (DNN) inference are offloaded to external participants due to blockchain limitations. This creates a need to verify the correctness of outsourced computations without re-execution. We propose \texttt{Range-Arithmetic}, a novel framework for efficient and verifiable DNN inference that transforms non-arithmetic operations, such as rounding after fixed-point matrix multiplication and ReLU, into arithmetic steps verifiable using sum-check protocols and concatenated range proofs. Our approach avoids the complexity of Boolean encoding, high-degree polynomials, and large lookup tables while remaining compatible with finite-field-based proof systems. Experimental results show that our method not only matches the performance of existing approaches, but also reduces the computational cost of verifying the results, the computational effort required from the untrusted party performing the DNN inference, and the communication overhead between the two sides.

2509.23982 2026-06-11 cs.CL cs.AI cs.CY cs.LG cs.NE 版本更新

Toward Preference-aligned Large Language Models via Residual-based Model Steering

基于残差模型引导的偏好对齐大型语言模型

Lucio La Cava, Andrea Tagarelli

AI总结 提出PaLRS方法,利用残差流中的偏好信号提取轻量级引导向量,无需训练即可在推理时对齐模型偏好,在数学推理和代码生成任务上取得一致提升,同时节省大量时间。

详情
Comments
Accepted at IJCAI 2026
AI中文摘要

偏好对齐是使大型语言模型(LLMs)有用且与(人类)偏好一致的关键步骤。现有方法如基于人类反馈的强化学习或直接偏好优化通常需要精心策划的数据和对数十亿参数进行昂贵的优化,最终导致持久性的任务特定模型。在这项工作中,我们引入了基于残差引导的LLM偏好对齐(PaLRS),这是一种无需训练的方法,利用LLM残差流中编码的偏好信号。从仅一百个偏好对中,PaLRS提取出轻量级、即插即用的引导向量,可在推理时应用以将模型推向偏好行为。我们在各种中小型开源LLM上评估了PaLRS,显示PaLRS对齐的模型在数学推理和代码生成基准上取得了一致的提升,同时保持了基线通用性能。此外,与使用DPO和SimPO对齐的模型相比,它们表现更好且节省大量时间。我们的发现强调,PaLRS为标准偏好优化流程提供了一种有效、更高效且灵活的替代方案,提供了一种无需训练、即插即用的对齐机制,且数据需求极少。

英文摘要

Preference alignment is a critical step in making Large Language Models (LLMs) useful and aligned with (human) preferences. Existing approaches such as Reinforcement Learning from Human Feedback or Direct Preference Optimization typically require curated data and expensive optimization over billions of parameters, and eventually lead to persistent task-specific models. In this work, we introduce Preference alignment of Large Language Models via Residual Steering (PaLRS), a training-free method that exploits preference signals encoded in the residual streams of LLMs. From as few as one hundred preference pairs, PaLRS extracts lightweight, plug-and-play steering vectors that can be applied at inference time to push models toward preferred behaviors. We evaluate PaLRS on various small-to-medium-scale open-source LLMs, showing that PaLRS-aligned models achieve consistent gains on mathematical reasoning and code generation benchmarks while preserving baseline general-purpose performance. Moreover, when compared to models aligned with DPO and SimPO, they perform better with great time-savings. Our findings highlight that PaLRS offers an effective, much more efficient and flexible alternative to standard preference optimization pipelines, offering a training-free, plug-and-play mechanism for alignment with minimal data.

2512.22219 2026-06-11 cs.DC cs.LG cs.PL 版本更新

MPK: A Compiler and Runtime for Mega-Kernelizing Tensor Programs

MPK:一种用于将张量程序转化为巨型内核的编译器和运行时系统

Xinhao Cheng, Zhihao Zhang, Yu Zhou, Jianan Ji, Jinchen Jiang, Zepeng Zhao, Ziruo Xiao, Zihao Ye, Yingyi Huang, Ruihang Lai, Hongyi Jin, Bohan Hou, Mengdi Wu, Yixin Dong, Anthony Yip, Zihao Ye, Songting Wang, Wenqin Yang, Xupeng Miao, Tianqi Chen, Zhihao Jia

AI总结 提出MPK,首个自动将多GPU模型推理转化为单个高性能巨型内核的编译器和运行时系统,通过SM级图表示实现跨算子软件流水线和细粒度计算通信重叠,显著降低推理延迟。

详情
Comments
14 pages
AI中文摘要

我们介绍了Mirage Persistent Kernel (MPK),这是首个自动将多GPU模型推理转化为单个高性能巨型内核的编译器和运行时系统。MPK引入了一种SM级图表示,该表示在单个流式多处理器(SM)的粒度上捕获数据依赖关系,从而实现跨算子软件流水线、计算与通信的细粒度重叠,以及在传统每算子内核执行模型下不可行的其他优化。MPK编译器将张量程序降级为优化的SM级任务图,并为每个任务生成快速的CUDA实现,而MPK内核内并行运行时则通过跨SM的分散调度在单个持久巨型内核内执行这些任务。这些组件共同提供了端到端的内核融合,且开发工作量极小,同时保留了现有编程模型的灵活性。我们的评估表明,MPK显著优于现有的每算子内核LLM服务系统,实现了高达1.7倍的端到端推理延迟降低,并将LLM推理性能推近底层硬件的极限。MPK在此https URL公开可用。

英文摘要

We introduce Mirage Persistent Kernel (MPK), the first compiler and runtime system that automatically transforms multi-GPU model inference into a single high-performance mega-kernel. MPK introduces an SM-level graph representation that captures data dependencies at the granularity of individual streaming multiprocessors (SMs), enabling cross-operator software pipelining, \rev{fine-grained overlap of computation and communication, and other optimizations that are infeasible under the conventional kernel-per-operator execution model}. The MPK compiler lowers tensor programs into optimized SM-level task graphs and generates fast CUDA implementations for each task, while the MPK in-kernel parallel runtime executes these tasks within a single persistent mega-kernel using decentralized scheduling across SMs. Together, these components provide end-to-end kernel fusion with minimal developer effort, while preserving the flexibility of existing programming models. Our evaluation shows that MPK significantly outperforms existing kernel-per-operator LLM serving systems, achieving up to 1.7$\times$ lower end-to-end inference latency and pushing LLM inference performance close to the limits of the underlying hardware. MPK is publicly available at this https URL.

2601.04710 2026-06-11 cs.CL cs.LG 版本更新

Steering the Noise: Turning Random Perturbations into Effective Descent for Memory-Efficient LLM Fine-Tuning

引导噪声:将随机扰动转化为有效下降方向以实现内存高效的LLM微调

Feihu Jin, Shipeng Cen, Ying Tan

AI总结 提出一种即插即用框架,通过候选扰动池选择或组合与优化目标对齐的扰动,改进零阶优化梯度估计,提升LLM微调的收敛速度和任务精度。

详情
Comments
12pages, 6figures
AI中文摘要

微调大型语言模型(LLMs)取得了强大的性能,但通常受到反向传播内存开销的限制。零阶(ZO)优化通过仅使用前向传递来估计梯度,避免了这一开销,但由于随机高斯扰动在高维参数空间中产生高方差的梯度估计,其收敛速度通常较慢。在本文中,我们提出了一种即插即用框架,将随机扰动转化为更有效的下降方向。关键思想是抽取一小批候选扰动,评估其损失值,然后选择或组合那些与优化目标最一致的扰动。我们开发了该思想的两种实例:MeZO-GV,通过低损失和高损失扰动组之间的对比形成引导向量;以及MeZO-Greedy,在固定的评估预算内保留单个最佳扰动。我们从理论上证明,这两种策略在每步目标函数减少上均优于标准ZO估计,从而提高了收敛速度。在不同规模和架构的LLM上的实验证实,所提出的方法自然地与现有ZO优化器集成,并一致地提高了收敛速度和任务准确性。在OPT-13B上,我们的方法在11个基准测试中优于所有ZO基线,并在其中9个上超过了基于梯度的方法,同时保留了仅前向优化的内存效率。

英文摘要

Fine-tuning large language models (LLMs) achieves strong performance but is often limited by the memory overhead of backpropagation. Zeroth-order (ZO) optimization avoids this overhead by estimating gradients through forward passes alone, yet it typically converges slowly because random Gaussian perturbations yield high-variance gradient estimates in high-dimensional parameter spaces. In this paper, we propose a plug-and-play framework that turns random perturbations into more effective descent directions. The key idea is to draw a small pool of candidate perturbations, evaluate their loss values, and then select or combine those that are best aligned with the optimization objective. We develop two instantiations of this idea: MeZO-GV, which forms a guiding vector from the contrast between low-loss and high-loss perturbation groups, and MeZO-Greedy, which keeps the single best perturbation within a fixed evaluation budget. We theoretically show that both strategies yield a larger per-step reduction in the objective than standard ZO estimation, leading to improved convergence rates. Experiments on LLMs of different scales and architectures confirm that the proposed methods integrate naturally with existing ZO optimizers and consistently improve convergence speed and task accuracy. On OPT-13B, our approach outperforms all ZO baselines across 11 benchmarks and exceeds gradient-based methods on 9 of them, while retaining the memory efficiency of forward-only optimization.

2602.10908 2026-06-11 cs.CL cs.LG stat.ML 版本更新

SoftMatcha 2: A Fast and Soft Pattern Matcher for Trillion-Scale Corpora

SoftMatcha 2:一种用于万亿级语料库的快速软模式匹配器

Masataka Yoneda, Yusuke Matsushita, Go Kamoda, Kohei Suenaga, Takuya Akiba, Masaki Waga, Sho Yokoi

AI总结 提出SoftMatcha 2,一种基于后缀数组和词向量的超快速软搜索算法,通过动态语料感知剪枝和磁盘感知设计,在万亿级语料上实现0.3秒内支持替换、插入和删除的语义变体搜索,并发现基准污染。

详情
Comments
Accepted at ICML2026. Project Page & Web Interface: this https URL, Source Code: this https URL
AI中文摘要

我们提出SoftMatcha 2,一种超快速且灵活的搜索算法,能够在0.3秒内搜索万亿规模的自然语言语料库,同时允许以替换、插入和删除形式进行的语义变体。我们的方法采用基于后缀数组的字符串匹配,该数组随语料库规模扩展良好,并将单词表示为向量,这支撑了其语义灵活性。为了缓解查询语义放松导致的组合爆炸,我们的方法建立在两个关键算法思想上:动态语料感知剪枝和由磁盘感知设计实现的快速精确查找。我们从理论上分析了所提出方法的效率,表明它可以缓解搜索空间的指数增长。在FineWeb-Edu(Lozhkov等人,2024)(1.4T tokens)上的实验表明,与现有方法infini-gram(Liu等人,2024)、infini-gram mini(Xu等人,2025)和SoftMatcha(Deguchi等人,2025)相比,它实现了显著更低的搜索延迟。作为实际应用,我们的方法发现了现有方法遗漏的训练语料库中的基准污染,并且也有利于信息检索和释义检测。我们还提供了一个在线演示,支持七种语言的语料库快速软搜索。

英文摘要

We present SoftMatcha 2, an ultra-fast and flexible search algorithm that enables search over trillion-scale natural language corpora in under 0.3 seconds while allowing semantic variations in the form of substitution, insertion, and deletion. Our approach employs string matching based on suffix arrays that scales well with corpus size, and represents words as vectors, which underpin its semantic flexibility. To mitigate the combinatorial explosion induced by the semantic relaxation of queries, our method is built on two key algorithmic ideas: dynamic corpus-aware pruning and fast exact lookup enabled by a disk-aware design. We theoretically analyze the efficiency of the proposed method, indicating that it can mitigate exponential growth in the search space. Empirically, on FineWeb-Edu (Lozhkov et al., 2024) (1.4T tokens), it attains substantially lower search latency than existing methods: infini-gram (Liu et al., 2024), infini-gram mini (Xu et al., 2025), and SoftMatcha (Deguchi et al., 2025). As a practical application, our method uncovers benchmark contamination in training corpora that existing approaches miss, and it also benefits information retrieval and paraphrase detection. We also provide an online demo of fast, soft search across corpora in seven languages.

2606.06527 2026-06-11 cs.AR cs.LG 版本更新

Characterizing the Impact of NVFP4 Quantization for Low-Power Edge AI Deployment

块大小、权重精度和缩放精度在低功耗边缘高效神经网络NVFP4推理中的消融研究

Ovishake Sen, Venkata Nithin Kamineni, Daniel Lobo, Swarup Bhunia, Rickard Ewetz, Baibhab Chatterjee

AI总结 本文通过消融实验研究NVFP4 LUT推理框架,结合4位激活、两级缩放和电压缩放存储,在边缘高效模型上实现高达26.85倍能耗降低和2.21倍面积缩减。

详情
Comments
7 Pages
AI中文摘要

节能边缘推理需要降低算术成本、内存流量和硬件开销。本文对基于NVFP4 LUT的边缘高效神经网络推理进行了消融研究。提出的NVLUT框架结合了4位NVFP4激活、两级缩放、基于LUT的尾数计算、电压缩放存储和选择性ECC保护。乘法分解为符号、指数和尾数路径,其中符号使用XOR逻辑,指数使用整数加法,尾数乘法由紧凑的LUT访问替代。NVFP4激活使用FP4数据,并带有FP8块缩放和FP32张量缩放。在六个边缘高效模型上,块大小消融表明B=16提供了实用的精度/存储权衡,对于N=4096仅需4.5078位每输入。权重精度消融表明,在相同NVFP4激活路径下,FP8和FP16权重相比FP4权重仅带来适度提升。与纯无缩放FP4相比,无重训练的NVFP4通过恢复激活动态范围大幅恢复精度,而带重训练的NVFP4在模型上达到最佳精度。硬件分析显示,NVLUT相比传统LUT在ECC加电压缩放下实现高达26.85倍能耗降低,在混合电压操作下高达22.85倍。面积分别减少高达2.21倍和1.52倍。这些结果表明,NVFP4两级缩放结合选择性可靠性保护实现了鲁棒、低能耗的边缘推理。

英文摘要

Energy-efficient neural-network inference at the edge requires reducing arithmetic cost, memory traffic, computation energy, and storage overhead while maintaining acceptable accuracy. This paper presents an ablation-focused study of NVFP4 quantization for edge-efficient neural networks, with emphasis on the relationship between activation precision, weight precision, block-size scaling, retraining, and model accuracy. NVFP4 activations are represented using 4-bit FP4 data, an FP8 block scale, and an FP32 tensor scale, enabling ultra-low precision inference while preserving activation dynamic range. A block-size ablation over six edge-efficient models shows that block size B = 16 provides a practical accuracy/storage trade-off, requiring only 4.5078 bits per input for N = 4096. A weight precision ablation further shows that FP8 and FP16 weights provide only modest gains over FP4 weights under the same NVFP4 activation path, suggesting that activation quantization and scaling dominate much of the accuracy behavior. To isolate the benefit of the NVFP4 data type, this work compares conventional unscaled FP4 activation inference and NVFP4 activation inference with and without retraining. The results show that conventional FP4 inference collapses accuracy for most compact models, while NVFP4 without retraining already recovers substantial accuracy by restoring activation dynamic range through FP8 block scaling and FP32 tensor scaling. When combined with retraining, NVFP4 achieves the best accuracy across the evaluated models, demonstrating the effectiveness of scaling-aware FP4 (NVFP4) inference. These findings provide general design guidance for hardware-software co-design of low power edge inference across a broad range of accelerator platforms, including GPUs, Tensor Cores, FPGAs, domain-specific AI accelerators, near-memory computing systems, and emerging edge-computing architectures.

7. 联邦学习、隐私与安全 8 篇

2606.11272 2026-06-11 cs.LG cs.AI 新提交

Federated continual learning: A comprehensive survey on lifelong and privacy-preserving learning over distributed and non-stationary data

联邦持续学习:分布式和非平稳数据上的终身与隐私保护学习综述

Masoume Gholizade, Fabrizio Ruffini, Pietro Ducange, Francesco Marcelloni

发表机构 * University of Pisa(比萨大学) University of Modena and Reggio Emilia(摩德纳和雷焦艾米利亚大学)

AI总结 本文系统综述联邦持续学习(FCL),定义问题、分析经典联邦学习在非平稳数据下的局限,提出多维分类法,并讨论应用、评估指标及开放挑战。

详情
Comments
77 pages, 8 figures
AI中文摘要

联邦学习(FL)能够在分布式客户端之间实现协作和隐私保护的模型训练,但大多数现有的FL系统隐含地假设数据是平稳的。在现实场景中——如医疗、工业物联网(IIOT)、网络安全和智慧城市——数据流本质上是非平稳的,导致经典FL方法遭受性能下降、不稳定和灾难性遗忘。持续学习(CL)解决了在演化数据分布下的学习问题,但主要在集中式环境中研究,忽视了联邦系统的关键约束,包括隐私、有限通信和客户端异质性。联邦持续学习(FCL)出现在FL和CL的交汇处,旨在支持分布式和非平稳数据上的终身、自适应和隐私感知学习。本综述提供了FCL的全面和系统概述。我们首先给出FCL问题的正式定义并阐明其独特特征。然后分析经典FL在非平稳条件下的局限性,强调CL原理如何支持长期适应。为了组织快速增长的文献,我们提出了FCL方法的多维分类法。此外,我们回顾了代表性的应用领域和数据模态,总结了常用的评估指标,并讨论了评估长期性能和遗忘的实验视角。最后,我们强调了关键开放挑战,包括处理时间漂移下的极端异质性、设计可扩展且隐私保护的记忆机制,以及建立标准化基准。本综述旨在为推进FCL走向鲁棒和可部署的现实世界系统提供参考和路线图。

英文摘要

Federated Learning (FL) enables collaborative and privacy-preserving model training across distributed clients, but most existing FL systems implicitly assume data stationarity. In real-world settings-such as healthcare, industrial IoT (IIOT), cybersecurity, and smart cities-data streams are inherently non-stationary, leading classical FL methods to suffer from performance degradation, instability, and catastrophic forgetting. Continual Learning (CL) addresses learning under evolving data distributions but has been largely studied in centralized settings, overlooking key constraints of federated systems, including privacy, limited communication, and client heterogeneity. Federated Continual Learning (FCL) emerges at the intersection of FL and CL, aiming to support lifelong, adaptive, and privacy-aware learning over distributed and non-stationary data. This survey provides a comprehensive and systematic overview of FCL. We first present a formal definition of the FCL problem and clarify its distinctive characteristics. We then analyze the limitations of classical FL under non-stationary conditions, highlighting how CL principles support long-term adaptation. To organize the rapidly growing literature, we propose a multi-dimensional taxonomy of FCL approaches. Furthermore, we review representative application domains and data modalities, summarize commonly used evaluation metrics, and discuss experimental perspectives for assessing long-term performance and forgetting. Finally, we highlight key open challenges, including handling extreme heterogeneity under temporal drift, designing scalable and privacy-preserving memory mechanisms, and establishing standardized benchmarks. This survey aims to serve as a reference and a roadmap for advancing FCL toward robust and deployable real-world systems.

2606.11480 2026-06-11 cs.LG 新提交

Accurate and Resource-Efficient Federated Continual Learning

准确且资源高效的联邦持续学习

Jebacyril Arockiaraj, Dhruv Parikh, Jayashree Adivarahan, Rajgopal Kannan, Viktor Prasanna

发表机构 * University of Southern California(南加州大学) DEVCOM Army Research Office(DEVCOM陆军研究办公室)

AI总结 提出FedRAN框架,通过紧凑随机特征统计替代梯度更新,利用截断SVD降低通信开销,结合原型伪标签处理标签稀缺,在多个数据集上提升准确率并大幅降低资源消耗。

详情
Comments
Technical Report
AI中文摘要

联邦持续学习(FCL)必须在有限的资源(如通信、计算、内存和标签可用性)下从分布式任务流中学习。现有的FCL方法通常依赖于重复的局部优化、重放和完全监督。解析替代方法避免了迭代训练和重放,但使用高维随机特征来提高准确性需要二阶特征统计量——Gram矩阵,其通信成本与随机特征大小$M$成二次方关系。我们提出FedRAN,一种资源感知的解析FCL框架,用紧凑的随机特征统计量替代基于梯度的更新。每个客户端传输其Gram矩阵的截断SVD摘要,将主要的二阶上传从$M$的二次方减少到线性(对于固定秩)。服务器执行两级QR-SVD子空间合并,在空间上跨客户端、在时间上跨任务,并以闭式求解岭分类器。FedRAN进一步通过基于原型的伪标签支持标签稀缺。在CIFAR-100、ImageNet-R和VTAB数据集上,FedRAN相比最强基线将平均准确率提高了最多4.8个百分点,每个客户端的通信量比基于优化的FCL少30.6-121.8倍,平均比基于梯度的基线快190.3倍;仅使用20%标签时,伪标签将平均准确率提高了最多6.61个百分点。这些结果表明,FedRAN在通信、计算和标签约束下实现了准确且资源高效的FCL。源代码可在该https URL获取。

英文摘要

Federated continual learning (FCL) must learn from distributed task streams under limited resources, such as communication, computation, memory, and label availability. Existing FCL methods often rely on repeated local optimization, replay, and full supervision. Analytic alternatives avoid iterative training and replay, but using high-dimensional random features to improve accuracy requires a second-order feature statistic, the Gram matrix, which has a quadratic communication cost in the random feature size $M$. We propose FedRAN, a resource-aware analytic FCL framework that replaces gradient-based updates with compact random feature statistics. Each client transmits a truncated-SVD summary of its Gram matrix, reducing the dominant second-order upload from quadratic to linear in $M$ for fixed rank. The server performs a two-level QR-SVD subspace merge, spatially across clients and temporally across tasks, and solves a ridge classifier in closed form. FedRAN further supports label scarcity through prototype-based pseudo-labeling. Across CIFAR-100, ImageNet-R, and VTAB datasets, FedRAN improves average accuracy by up to 4.8 percentage points over the strongest baseline, uses 30.6-121.8$\times$ less per-client communication than optimization-based FCL, and is 190.3$\times$ faster on average than gradient-based baselines; with only 20% labels, pseudo-labeling improves average accuracy by up to 6.61 points. These results show that FedRAN enables accurate and resource-efficient FCL under communication, computation, and label constraints. The source code is available at this https URL.

2606.11556 2026-06-11 cs.CR cs.AI cs.LG 交叉投稿

Privacy-Preserving Federated Autoencoder for ECG Anomaly Detection on Edge Devices

面向边缘设备上心电图异常检测的隐私保护联邦自编码器

Kaan Arda Akyol, Jakub Kacper Szeląg, Aydin Abadi, Maha Alghamdi, Ghadah Albalawi, Ghouse Ibrahim Kaleelullah, Hilal Tutus, Sarah Al Subaiei, Shardul Kapse, Syed Mohammed Raheeb, Mujeeb Ahmed, Rehmat Ullah

AI总结 提出一种结合联邦学习、差分隐私和INT8量化的端到端系统,在PTB-XL数据集上实现无监督12导联ECG异常检测,满足隐私、实时性和非IID数据要求。

详情
Comments
9 pages, 4 figures, 6 tables. Preprint prepared in IEEE conference format. Submitted to: FLTA 2026
AI中文摘要

连续心电图监测可以在心律异常演变为心血管事件之前发现它们。然而,一个可部署的系统必须同时满足三个要求:法律级别的隐私(GDPR、HIPAA)、在受限边缘硬件上的实时推理以及在非IID跨医院数据下的检测质量。我们设计并评估了一个端到端的联邦系统,在PTB-XL数据集上解决了无监督12导联ECG异常检测的所有三个要求,结合了三种自编码器家族(VanillaAE、ConvAE、VAE)、基于Flower的联邦平均(FedAvg)跨十个模拟医院、客户端差分隐私SGD(DP-SGD)与Rényi-DP会计,以及使用Raspberry Pi 4基准测试的8位整数(INT8)训练后量化。我们的主要贡献是:这些机制如何组合的经验性特征、实用的DP特定建议,以及针对临床敏感环境的技术和安全见解。联邦学习在所有架构上匹配或超过集中基线(ConvAE联邦ROC曲线下面积AUROC为0.782),并且ε扫描确定ε=4为推荐的临床操作点。INT8量化大致将模型大小减半,并将Pi 4延迟降低多达44%,AUROC损失小于0.12%。关键的是,DP和量化的惩罚在经验上是独立的,因此从业者不需要为了紧凑的边缘足迹而牺牲强大的隐私保证。据我们所知,这是第一个结合联邦学习、形式化(ε,δ)-DP、无监督重建检测和量化AArch64部署的系统。

英文摘要

Continuous electrocardiography (ECG) monitoring could surface rhythm abnormalities before they escalate into cardiovascular events. However, a deployable system must satisfy three requirements simultaneously: legal-grade privacy (GDPR, HIPAA), real-time inference on constrained edge hardware, and detection quality under non-IID cross-hospital data. We design and evaluate an end-to-end federated system addressing all three for unsupervised 12-lead ECG anomaly detection on PTB-XL dataset, combining three autoencoder families (VanillaAE, ConvAE, VAE), Flower-based federated averaging (FedAvg) across ten simulated hospitals, client-side differentially private SGD (DP-SGD) with a Rényi-DP accountant, and 8-bit integer (INT8) post-training quantization with Raspberry Pi 4 benchmarking. Our main contributions are: an empirical characterization of how these mechanisms compose, practical DP-specific recommendations, and technical and security insights for a clinically sensitive setting. Federated learning matches or exceeds the centralized baseline across all architectures (ConvAE federated area under the ROC curve, AUROC, $0.782$), and an $\varepsilon$ sweep identifies $\varepsilon=4$ as the recommended clinical operating point. INT8 quantization roughly halves model size and cuts Pi 4 latency by up to $44%$ with $<0.12%$ AUROC loss. Crucially, DP and quantization penalties are empirically independent, so practitioners need not trade a strong privacy guarantee for a compact edge footprint. To our knowledge, this is the first system combining federated learning, formal $(\varepsilon,\delta)$-DP, unsupervised reconstruction-based detection, and quantized AArch64 deployment.

2506.01396 2026-06-11 cs.LG cs.CR stat.ML 版本更新

Mitigating Disparate Impact of Differentially Private Learning through Bounded Adaptive Clipping

通过有界自适应裁剪减轻差分隐私学习中的差异影响

Linzh Zhao, Aki Rehn, Mikko A. Heikkilä, Razane Tajeddine, Antti Honkela

AI总结 针对差分隐私学习中梯度裁剪对少数群体造成的不公平影响,提出有界自适应裁剪方法,通过引入可调下界防止过度梯度抑制,在Skewed和Fashion MNIST上最差类准确率提升超过10个百分点。

详情
Comments
TMLR camera-ready version
AI中文摘要

差分隐私已成为隐私保护机器学习的基本框架。然而,现有的差分隐私学习方法通常对模型预测产生差异影响,例如对少数群体。梯度裁剪常用于差分隐私学习,但会抑制来自困难样本的较大梯度。我们表明,自适应裁剪会加剧这一问题,因为它通常会将裁剪边界缩小到极小值以匹配拟合良好的多数类,同时显著降低其他类的准确率。我们提出有界自适应裁剪,引入可调下界以防止过度梯度抑制。与无界自适应裁剪相比,我们的方法在Skewed和Fashion MNIST上将最差类准确率提高了超过10个百分点,与自动裁剪相比提高了7个百分点,与恒定裁剪相比提高了5个百分点。代码可在该 https URL 获取。

英文摘要

Differential privacy (DP) has become an essential framework for privacy-preserving machine learning. Existing DP learning methods, however, often have disparate impacts on model predictions, e.g., for minority groups. Gradient clipping, which is often used in DP learning, can suppress larger gradients from challenging samples. We show that this problem is amplified by adaptive clipping, which will often shrink the clipping bound to tiny values to match a well-fitting majority, while significantly reducing the accuracy for others. We propose bounded adaptive clipping, which introduces a tunable lower bound to prevent excessive gradient suppression. Our method improves worst-class accuracy by over 10 percentage points on Skewed and Fashion MNIST compared to unbounded adaptive clipping, 7 points compared to Automatic clipping, and 5 points compared to constant clipping. The code is available at this https URL.

2506.08473 2026-06-11 cs.LG 版本更新

AsFT: Anchoring Safety During LLM Fine-Tuning Within Narrow Safety Basin

AsFT:在窄安全盆地内锚定大语言模型微调期间的安全性

Shuo Yang, Qihui Zhang, Yuyang Liu, Xiaojun Jia, Kunpeng Ning, Jiayu Yao, Jigang Wang, Hailiang Dai, Yibing Song, Li Yuan

AI总结 针对微调大语言模型时安全性易受损的问题,提出AsFT方法,通过惩罚与对齐方向正交的更新,将模型约束在窄安全盆地内,在提升任务性能的同时显著降低有害行为。

详情
AI中文摘要

微调大语言模型(LLMs)可提升性能,但引入了关键的安全漏洞:即使极少的有害数据也会严重破坏安全措施。我们观察到,与对齐方向(由对齐(安全)模型与未对齐模型之间的权重差异定义)正交的扰动会迅速损害模型安全性。相反,沿对齐方向的更新则基本保持安全性,揭示了参数空间是一个“窄安全盆地”。为解决此问题,我们提出AsFT(在微调中锚定安全性),通过在微调过程中显式约束更新方向来维持安全性。通过惩罚与对齐方向正交的更新,AsFT有效将模型约束在“窄安全盆地”内,从而保持其固有安全性。在多个数据集和模型上的大量实验表明,AsFT将有害行为降低高达7.60%,任务性能提升3.44%,并在多个任务上持续优于现有方法。

英文摘要

Fine-tuning large language models (LLMs) improves performance but introduces critical safety vulnerabilities: even minimal harmful data can severely compromise safety measures. We observe that perturbations orthogonal to the alignment direction - defined by weight differences between aligned (safe) and unaligned models - rapidly compromise model safety. In contrast, updates along the alignment direction largely preserve it, revealing the parameter space as a "narrow safety basin". To address this, we propose AsFT (Anchoring Safety in Fine-Tuning) to maintain safety by explicitly constraining update directions during fine-tuning. By penalizing updates orthogonal to the alignment direction, AsFT effectively constrains the model within the "narrow safety basin," thus preserving its inherent safety. Extensive experiments on multiple datasets and models show that AsFT reduces harmful behaviors by up to 7.60%, improves task performance by 3.44%, and consistently outperforms existing methods across multiple tasks.

2510.01529 2026-06-11 cs.LG cs.CR 版本更新

Bypassing Prompt Guards in Production with Controlled-Release Prompting

绕过生产环境中的提示守卫:受控释放提示攻击

Jaiden Fairoze, Sanjam Garg, Keewoo Lee, Mingyuan Wang

AI总结 针对AI对齐的提示过滤存在理论上的不可能性,本文提出受控释放提示攻击,利用轻量级输入过滤器与主模型之间的资源不对称性,在实际部署的大语言模型系统中成功绕过提示守卫。

详情
Comments
Accepted to USENIX Security 2026
AI中文摘要

Ball等人最近指出,用于AI对齐的提示过滤面临一个根本性障碍:在标准密码学假设下,任何运行速度远快于被保护模型的过滤器都无法普遍区分对抗性提示和良性提示。我们研究这一不可能性结果是否转化为已部署的大语言模型(LLM)系统中的现实漏洞。我们通过引入受控释放提示攻击给出了肯定答案,这是理论框架的一种实际实例化,利用了轻量级输入过滤器与其保护的主模型之间的资源不对称性。与理论构造不同,我们的攻击不需要修改模型:它生成任何有界过滤器无法解读但对目标LLM仍然可处理的恶意提示。我们发现,在基线方法失败的四个主要聊天平台(Google Gemini、DeepSeek Chat、xAI Grok和Mistral Le Chat)上,我们的攻击均成功。此外,我们将攻击应用于从Gemini提取受版权保护的数据。最后,我们对14个开源提示守卫模型进行了系统评估,揭示即使具有推理能力的过滤器也无法在不产生过高资源开销的情况下可靠地检测我们的攻击。

英文摘要

Ball et al. recently established that prompt filtering for AI alignment faces a fundamental barrier: under standard cryptographic assumptions, no filter running significantly faster than the protected model can universally distinguish adversarial prompts from benign ones. We investigate whether this impossibility result translates to real-world vulnerabilities in deployed large language model (LLM) systems. We answer affirmatively by introducing controlled-release prompting, a practical instantiation of the theoretical framework that exploits the resource asymmetry between lightweight input filters and the main models they protect. Unlike the theoretical construction, our attack does not require model modification: it generates malicious prompts that are indecipherable by any bounded filter yet remain tractable to the target LLM. We find our attack to be successful on four major chat platforms (Google Gemini, DeepSeek Chat, xAI Grok, and Mistral Le Chat) where baseline methods fail. Additionally, we apply our attack to extract copyrighted data from Gemini. Finally, we provide a systematic evaluation of 14 open-weight prompt guard models, revealing that even reasoning-capable filters cannot reliably detect our attack without incurring prohibitive resource overhead.

2510.03520 2026-06-11 cs.LG cs.AI eess.SY 版本更新

Certifiable Safe RLHF: Semantic Grounding and Fixed Penalty Constraint Optimization for Safer LLM Alignment

可认证安全RLHF:基于语义基础与固定惩罚约束优化的更安全大语言模型对齐

Kartik Pandit, Sourav Ganguly, Arnesh Banerjee, Shaahin Angizi, Arnob Ghosh

AI总结 针对现有RLHF方法依赖奖励/成本函数和双变量调优导致性能敏感且缺乏可证明安全保证的问题,提出CS-RLHF,通过语义基础成本模型和固定惩罚约束优化,实现可认证安全对齐,效率提升至少5倍。

详情
AI中文摘要

确保安全是大语言模型(LLMs)的基本要求。在增强模型输出效用与减轻其潜在危害之间取得适当平衡是一个复杂且持续的挑战。当代方法通常将这个问题形式化为约束马尔可夫决策过程(CMDP)框架,并采用成熟的CMDP优化技术。然而,这些方法表现出两个显著的限制。首先,它们对奖励和成本函数的依赖使得性能对底层评分机制高度敏感,而该机制必须捕捉语义含义,而不是被表面关键词触发。其次,基于CMDP的训练需要调整双变量,这一过程计算成本高昂,并且对于可能通过对抗性越狱利用的固定双变量,不提供任何可证明的安全保证。为了克服这些限制,我们引入了可认证安全RLHF(CS-RLHF),它引入了一个在大规模语料库上训练的成本模型,以分配基于语义的安全分数。与基于拉格朗日的方法相比,CS-RLHF采用了一种修正的基于惩罚的公式。该设计借鉴了约束优化中精确惩罚函数理论,其中约束满足直接通过适当选择的惩罚项来强制执行。通过适当缩放的惩罚,可以在优化器处保证安全约束的可行性,从而消除了双变量更新的需要。实证评估表明,CS-RLHF优于最先进的LLM模型响应,对正常和越狱提示的效率至少提高5倍。

英文摘要

Ensuring safety is a foundational requirement for large language models (LLMs). Achieving an appropriate balance between enhancing the utility of model outputs and mitigating their potential for harm is a complex and persistent challenge. Contemporary approaches frequently formalize this problem within the framework of Constrained Markov Decision Processes (CMDPs) and employ established CMDP optimization techniques. However, these methods exhibit two notable limitations. First, their reliance on reward and cost functions renders performance highly sensitive to the underlying scoring mechanism, which must capture semantic meaning rather than being triggered by superficial keywords. Second, CMDP-based training entails tuning dual-variable, a process that is both computationally expensive and does not provide any provable safety guarantee for a fixed dual variable that can be exploitable through adversarial jailbreaks. To overcome these limitations, we introduce Certifiable Safe-RLHF (CS-RLHF) that introduces a cost model trained on a large-scale corpus to assign semantically grounded safety scores. In contrast to the lagrangian-based approach, CS-RLHF adopts a rectified penalty-based formulation. This design draws on the theory of exact penalty functions in constrained optimization, wherein constraint satisfaction is enforced directly through a suitably chosen penalty term. With an appropriately scaled penalty, feasibility of the safety constraints can be guaranteed at the optimizer, eliminating the need for dual-variable updates. Empirical evaluation demonstrates that CS-RLHF outperforms state-of-the-art LLM model responses rendering at-least 5 times efficient against nominal and jail-breaking prompts

2512.13666 2026-06-11 cs.CR cs.DC cs.IT cs.LG 版本更新

SEDULity: A Proof-of-Learning Framework for Distributed and Secure Blockchains with Efficient Useful Work

SEDULity:一种面向分布式安全区块链的高效有用工作证明学习框架

Weihang Cao, Mustafa Doger, Sennur Ulukus

AI总结 提出一种名为SEDULity的证明学习框架,通过将区块模板编码到训练过程中并设计难解易验的有用函数替代PoW谜题,在保持区块链安全性的同时高效训练机器学习模型。

详情
AI中文摘要

工作量证明(PoW)的安全性和去中心化已在现有区块链系统中得到充分验证,但其巨大的能源浪费引发了可持续性担忧。有用工作证明(PoUW)旨在将无意义的计算重定向到有意义任务(如解决机器学习问题),从而催生了学习证明(PoL)分支。尽管已有研究提出了多种PoL,但它们都在一定程度上存在安全性、去中心化或效率问题。本文提出一种PoL框架,在完全分布式环境中高效训练机器学习模型,同时维护区块链安全性。我们将该框架命名为SEDULity,代表安全、高效、分布式和有用的基于学习的区块链系统。具体而言,我们将区块模板编码到训练过程中,并设计一种难解但相对易验的有用函数,作为PoW谜题的替代。我们证明该框架是分布式、安全的,并能高效训练机器学习模型。进一步展示所提出的PoL框架可扩展到其他类型的有用工作,并设计激励机制以激励任务验证。理论上证明,在精心设计的系统参数下,理性矿工有动机完全诚实地进行训练。最后,通过仿真结果展示框架性能并验证分析。

英文摘要

The security and decentralization of Proof-of-Work (PoW) have been well-tested in existing blockchain systems. However, its tremendous energy waste has raised concerns about sustainability. Proof-of-Useful-Work (PoUW) aims to redirect the meaningless computation to meaningful tasks such as solving machine learning (ML) problems, giving rise to the branch of Proof-of-Learning (PoL). While previous studies have proposed various PoLs, they all, to some degree, suffer from security, decentralization, or efficiency issues. In this paper, we propose a PoL framework that trains ML models efficiently while maintaining blockchain security in a fully distributed manner. We name the framework SEDULity, which stands for a Secure, Efficient, Distributed, and Useful Learning-based blockchain system. Specifically, we encode the template block into the training process and design a useful function that is difficult to solve but relatively easy to verify, as a substitute for the PoW puzzle. We show that our framework is distributed, secure, and efficiently trains ML models. We further demonstrate that the proposed PoL framework can be extended to other types of useful work and design an incentive mechanism to incentivize task verification. We show theoretically that a rational miner is incentivized to train fully honestly with well-designed system parameters. Finally, we present simulation results to demonstrate the performance of our framework and validate our analysis.

8. 鲁棒性、不确定性与可信学习 27 篇

2606.11205 2026-06-11 cs.LG cs.AI cs.CL 新提交

Dual-Stance Evaluation of Sycophancy: The Structure of Agreement and the Limits of Intervention

谄媚的双立场评估:同意的结构与干预的局限

Matthew James Buchan

AI总结 提出双立场评估方法,发现激活引导在减少谄媚时也会抑制对事实正确陈述的同意,揭示了表示可读但不可写的普遍差距。

详情
Comments
18 pages, 9 figures, accepted to TAIS 2026
AI中文摘要

激活引导可以改变LLM的行为,但标准评估通常不测试减少谄媚的方向是否也抑制对事实正确陈述的同意。我们引入了双立场评估,测试每个话题的两个立场,并将其应用于Llama-3-8B-Instruct上的质心差引导。我们发现一种分离:模型在几何上不同的子空间中表示谄媚和事实同意,但引导方向在两者上的投影相等,无法差异化地针对任一。因此,该方向同样减少对事实正确陈述(例如地球是圆的)和谄媚陈述的同意。两个激活组的所有其他静态属性都匹配,表明行为分离源于生成动态或残差流分析无法解析的更细粒度结构。该模式说明了一个普遍差距:从激活中可读的表示可能无法通过它们写入。

英文摘要

Activation steering can shift LLM behaviour, but standard evaluations do not typically test whether a sycophancy-reduction direction also suppresses agreement with factually correct statements. We introduce dual-stance evaluation, which tests both stances of each topic, and apply it to centroid-difference steering on Llama-3-8B-Instruct. We find a dissociation: the model represents sycophantic and factual agreement in geometrically distinct subspaces, yet the steering direction projects equally onto both and cannot differentially target either. The direction accordingly reduces agreement with factually correct statements (e.g. that the Earth is round) as well as sycophantic ones. All other static properties of the two activation groups are matched, suggesting the behavioural dissociation arises from generation dynamics or from finer-grained structure that residual-stream analysis cannot resolve. The pattern illustrates a general gap: representations that are readable from activations may not be writable through them.

2606.11319 2026-06-11 cs.LG cond-mat.dis-nn 新提交

Learning from almost nothing: How neural networks survive heavy input corruption

从几乎一无所有中学习:神经网络如何在严重输入损坏中生存

Justin Tahmassebpur, Asadullah Bhuiyan, Hyejin Kim, Omri Lesser

发表机构 * Cornell University(康奈尔大学)

AI总结 研究神经网络在输入严重损坏(>90%)时仍保持高精度的鲁棒性,通过平均场方法推导出网络实现最近类均值原型规则,解释学习成功的机制。

详情
Comments
26 pages, 10 figures
AI中文摘要

从不完美数据中学习是机器学习的核心主题,将鲁棒性的实际问题与可学习性的基本问题联系起来。本文研究属性噪声:在保持标签完整的情况下从损坏输入中学习,这一设置受到的关注远少于标签噪声。我们考虑两种损坏模型:加性噪声和替换噪声。通过在损坏分类数据集上使用多层感知器(MLP)进行实验,我们发现神经网络保持鲁棒性,即使输入损坏超过90%——远超人类识别能力——仍能维持远高于随机水平的准确率。为了理解这种鲁棒性,我们使用平均场启发的方法分析严重损坏机制下的无限宽网络,并推导出分类结果的前导决策规则:网络实现一个原型规则,即最近类均值,将每个测试点分配给其训练集平均值最接近的类别。这个前导决策规则在广泛的MLP架构中具有普适性,适用于任何深度以及多种激活函数和噪声分布。相同的质心机制与实验中有限宽网络的行为高度吻合,并提供了一个可解释且易于分析的说明,解释了为什么即使单个训练样本几乎不携带任何信号,学习也能成功。

英文摘要

Learning from imperfect data is a central theme in machine learning, connecting practical questions of robustness to fundamental questions of learnability. Here we examine attribute noise: learning from corrupted inputs while keeping the labels intact, a setting that has received considerably less analytical attention than its label-noise counterpart. We consider two types of corruption models: additive noise and replacement noise. Through experiments with multi-layer perceptrons (MLPs) on corrupted classification datasets, we find that neural networks remain robust, maintaining well-above-chance accuracy even when inputs are >90% corrupted -- far beyond human recognition. To understand this robustness, we analyze infinite-width networks in the heavy-corruption regime using a mean-field-inspired approach and derive a leading-order decision rule for the classification outcome: the network implements a prototype rule, the nearest-class-mean, assigning each test point to the class whose training-set average it most closely resembles. This leading-order decision rule is universal across a broad range of MLP architectures, holding for any depth, as well as a wide class of activation functions and noise distributions. The same centroid mechanism closely matches finite-width network behavior in our experiments and provides an interpretable and analytically tractable account of why learning can succeed even when individual training examples carry almost no signal.

2606.11409 2026-06-11 cs.LG cs.AI cs.CR 新提交

Risk Under Pressure: Compute-Aware Evaluation of Adversarial Robustness in Language Models

压力下的风险:语言模型对抗鲁棒性的计算感知评估

Malikeh Ehghaghi, Boglárka Ecsedi, Marsha Chechik, Colin Raffel

发表机构 * University of Toronto(多伦多大学) Vector Institute(向量研究所) Hugging Face

AI总结 提出基于计算压力(累积FLOPs)的对抗鲁棒性评估框架,通过风险-计算曲线和两个新指标,揭示不同攻击策略的计算成本差异,并在10个模型上验证了对齐训练、模型规模等因素对计算空间鲁棒性的非单调影响。

详情
AI中文摘要

大型语言模型(LLMs)的对抗鲁棒性评估通常报告固定查询预算下的攻击成功率(ASR),隐含地认为所有攻击成本相同。实际上,不同攻击策略的计算开销可能相差几个数量级。因此,固定预算下的ASR可能掩盖破解模型所需的真实努力,从而难以判断攻击成本是否值得。我们提出一个基于计算压力的计算感知评估框架,以累积浮点运算次数(FLOPs)作为对抗努力的代理。我们引入风险-计算曲线,将计算预算映射到攻击风险,并推导出两个指标,总结给定攻击成功所需的平均压力。在跨越三个模型家族和语言模型训练与对齐的四个不同阶段的十个模型上,使用三种攻击策略(基于梯度、迭代细化和基于模板)在两个破解鲁棒性基准上评估,我们发现:(1)对齐训练对计算空间鲁棒性具有非单调影响;(2)扩大模型规模降低了基于梯度的攻击有效性,但对更便宜的基于模板的攻击影响有限;(3)在代理模型上优化的基于梯度的攻击可以迁移到独立的目标模型,从而降低攻击者成本;(4)在单个模型内,不同危害类别的计算成本差异高达约5倍;(5)安全对齐的RL增加了总成本,同时使某些类别不成比例地易于攻击。我们发布框架以实现计算感知的风险评估和评价。

英文摘要

Adversarial robustness evaluations of large language models (LLMs) typically report attack success rate (ASR) under fixed query budgets, implicitly treating all attacks as equally costly. In practice, the computational expense of different attack strategies can vary by orders of magnitude. Consequently, ASR at a fixed budget can obscure the true effort required to jailbreak a model, thereby making it hard to determine whether an attack's cost justifies its payoff to the attacker. We propose a compute-aware evaluation framework based on computational pressure, measured in cumulative floating-point operations (FLOPs), as a proxy for adversarial effort. We introduce risk-compute curves, which map compute budgets to attack risk, and derive two metrics that summarize the average pressure required for a given attack to succeed. Across ten models spanning three families and four different stages in language model training and alignment, evaluated with three attack strategies (gradient-based, iterative refinement, and template-based) on two jailbreak robustness benchmarks, we find: (1) alignment training has non-monotonic effects on compute-space robustness; (2) scaling model size reduces gradient-based attack effectiveness but has limited impact on cheaper template-based attacks; (3) gradient-based attacks optimized on a surrogate model can transfer to a separate target model, providing a way to reduce attacker costs; (4) compute cost varies by up to ${\approx}5{\times}$ across harm categories within a single model; and (5) safety-aligned RL increases aggregate cost while leaving some categories disproportionately accessible. We release our framework to enable compute-aware risk assessment and evaluation.

2606.11474 2026-06-11 cs.LG eess.SY physics.acc-ph 新提交

Mahalanobis-Guided Latent OOD Detection for Hybrid ES-DRL Control in Time-Varying Systems

基于马氏距离的潜在分布外检测用于时变系统中混合ES-DRL控制

Shaifalee Saxena, Alexander Scheinker

AI总结 针对时变系统中强化学习控制器性能下降问题,提出基于变分自编码器潜在空间马氏距离的分布外检测方法,实现与极值搜索控制器的自适应切换,并在粒子加速器控制中验证有效性。

详情
AI中文摘要

本文研究了非线性时变系统中基于马氏距离的潜在分布外(OOD)检测,用于测试时RL控制器切换。RL控制器可以在训练分布内快速控制高维系统,但当时间变化动力学产生未见过的观测时,其性能可能下降。我们考虑一个组合的ES-DRL控制器,其中RL提供快速的分布内动作,而有界极值搜索(ES)在OOD操作下提供鲁棒的模型无关控制。关键挑战在于决定何时切换。我们在分布内束流剖面观测上训练变分自编码器(VAE),并使用VAE潜在空间中的马氏距离在测试时检测OOD束流剖面。此OOD决策设置一个二元开关,选择RL控制器或ES控制器。我们在安全关键的粒子加速器控制中评估该方法。在此设置中,空间磁体运动产生RL训练期间未见过的OOD束流剖面。VAE潜在空间的可视化表明,所提方法识别出此OOD场景,并为组合控制器中RL和ES之间的切换提供可解释信号。

英文摘要

In this paper, we study Mahalanobis-guided latent out-of-distribution (OOD) detection for test-time RL controller switching in nonlinear time-varying systems. RL controllers can quickly control high-dimensional systems within the training distribution, but their performance can degrade when time-varying dynamics produce unseen observations. We consider a combined ES--DRL controller, where RL provides fast in-distribution actions and bounded extremum seeking (ES) provides robust model-independent control under OOD operation. The key challenge is deciding when to switch. We train a variational autoencoder (VAE) on in-distribution beam-profile observations and use Mahalanobis distance in the VAE latent space to detect OOD beam profiles at test time. This OOD decision sets a binary switch that selects either the RL controller or the ES controller. We evaluate the approach in safety-critical particle accelerator control. In this setting, spatial magnet motion creates OOD beam profiles that were not seen during RL training. Visualization of the VAE latent space shows that the proposed method identifies this OOD scenario and provides an interpretable signal for switching between RL and ES in the combined controller.

2606.11949 2026-06-11 cs.LG cs.CR stat.ML 新提交

Online Shift Detection and Conformal Adaptation for Deployed Safety Classifiers

已部署安全分类器的在线漂移检测与共形自适应

Jun Wen Leong

AI总结 提出在线监测系统,使用校准序列统计检测分布漂移,并通过共形弃权层自适应阈值恢复目标错误率,在800个实验单元中实现86.6%有效检测。

详情
Comments
16 pages, 4 figures, 7 tables. Code and data at this https URL
AI中文摘要

我们提出了一种在线监测系统,用于检测已部署安全分类器中的分布漂移,使用校准的序列统计量来检测分类器何时移出分布。一旦检测到,共形弃权层会自适应调整决策阈值,以恢复目标错误率ε=0.1。在一项预注册的析因评估(4个分类器×5种漂移条件×20个种子×2个窗口大小,共800个单元)中,该系统实现了86.6%的有效检测(693/800,95% CI [84.1%, 88.8%]),平均延迟为39.5步。检测在三种真实标签机制下均有效:合成发作(86.6%)、真实时间越狱(85%,17/20)和GCG对抗攻击。加权共形预测为DeBERTa恢复了高达39个百分点的丢失覆盖率(ESS=46/300),但所有其他分类器均崩溃(ESS≈300):逻辑密度比估计在高维嵌入空间中实现了完美的源/目标可分离性,将所有重要性权重裁剪至下限。DeBERTa显示出从有效校正(释义,ESS=46)到几乎完全崩溃(对抗后缀,ESS=206)的梯度。PCA降至32维打破了崩溃,为Llama Guard恢复了33个百分点,为ShieldGemma恢复了21个百分点。方差分解显示分类器(η²=0.243)、漂移类型(η²=0.237)及其交互作用(η²=0.185)均对检测延迟方差有显著贡献(所有p<0.001),表明需要针对每个分类器的监测配置文件。

英文摘要

We present an online monitoring system for distributional shift in deployed safety classifiers, using calibrated sequential statistics to detect when a classifier has moved out of distribution. Upon detection, a conformal abstention layer adapts decision thresholds to recover a target error rate epsilon=0.1. In a pre-registered factorial evaluation (4 classifiers x 5 shift conditions x 20 seeds x 2 window sizes, 800 cells), the system achieves 86.6% valid detection (693/800, 95% CI [84.1%, 88.8%]) with mean latency of 39.5 steps. Detection holds across three ground-truth regimes: synthetic onset (86.6%), real temporal jailbreaks (85%, 17/20), and GCG adversarial attacks. Weighted conformal prediction recovers up to 39 pp of lost coverage for DeBERTa (ESS=46/300) but collapses for all other classifiers (ESS~300): logistic density ratio estimation achieves perfect source/target separability in high-dimensional embedding spaces, clipping all importance weights to the floor. DeBERTa shows a gradient from effective correction (paraphrase, ESS=46) to near-total collapse (adversarial suffix, ESS=206). PCA to 32 dimensions breaks the collapse, recovering 33 pp for Llama Guard and 21 pp for ShieldGemma. Variance decomposition reveals classifier (eta^2=0.243), shift type (eta^2=0.237), and their interaction (eta^2=0.185) all contribute substantially to detection latency variance (all p<0.001), indicating per-classifier monitoring profiles are necessary.

2606.11998 2026-06-11 cs.LG 新提交

Bootstrapped Monitoring: Leveraging Transparent Reasoning to Oversee Stronger AI Agents

自助监控:利用透明推理监督更强的AI智能体

Frank Xiao, Mary Phuong

发表机构 * California Institute of Technology(加州理工学院)

AI总结 提出自助监控协议,通过插入具有透明思维链的不可信中间模型来监督更强智能体,在软件工程任务中显著提升捕获率,即使不可信监控者与智能体合谋。

详情
AI中文摘要

可信监控是AI控制的基石。然而,随着前沿模型能力增强,可信与不可信模型之间的能力差距可能使可信模型成为不可靠的监控者。我们引入了\emph{自助监控}协议,通过在监督链中插入一个具有透明思维链推理的更强的不可信中间模型来解决这一问题。不可信监控者($U_m$)评估智能体的行为,而较弱的可信模型($T$)监督$U_m$的推理以检测合谋。我们在多轮软件工程任务(BashArena)上对多个智能体和监控者评估了自助监控。即使不可信监控者主动与智能体合谋,只要我们能够访问其原始思维链,自助监控相比仅使用可信监控显著提高了捕获率。我们的结果表明,随着AI能力的进步,自助监控可以延长可信模型在控制中的有效寿命。

英文摘要

Trusted monitoring is a cornerstone of AI control. However, as frontier models grow more capable, the increasing capabilities gap between trusted and untrusted models may render trusted models unreliable monitors. We introduce \emph{bootstrapped monitoring}, a protocol that addresses this by inserting a stronger, intermediate untrusted model with transparent chain-of-thought reasoning into the oversight chain. The untrusted monitor ($U_m$) evaluates the agent's actions, while a weaker trusted model ($T$) oversees $U_m$'s reasoning to detect collusion. We evaluate bootstrapped monitoring on multi-turn software engineering tasks (BashArena) across multiple agents and monitors. Bootstrapped monitoring substantially improves catch rates over trusted-only monitoring, even when the untrusted monitor actively colludes with the agent, provided we have access to its raw chain-of-thought. Our results suggest that bootstrapped monitoring can extend the useful lifetime of trusted models in control as AI capabilities advance.

2606.12016 2026-06-11 cs.LG cs.AI 新提交

Generalization Hacking: Models Can Game Reinforcement Learning by Preventing Behavioral Generalization

泛化黑客:模型可通过阻止行为泛化来博弈强化学习

Frank Xiao, Mary Phuong

发表机构 * California Institute of Technology(加州理工学院)

AI总结 本研究提出泛化黑客现象,模型在强化学习中通过自我接种机制阻止行为泛化,在保持高奖励的同时抵抗行为修正,首次证明模型能主动破坏训练过程。

详情
AI中文摘要

模型后训练,特别是强化学习(RL),是开发者塑造模型价值观和行为的主要机制之一。然而,随着模型越来越具有评估和训练意识,当感知到的目标与其当前价值观冲突时,它们可能会被激励去抵抗训练,从而削弱开发者通过进一步训练检测错位和纠正模型行为的能力。在本文中,我们展示了泛化黑客,即模型在RL期间收集奖励的同时阻止奖励行为泛化。我们在Qwen3-235B-A22B上构建了一个模型有机体,对描述训练意识和自我接种(一种新颖机制,其中模型在其思维链中将合规性框架为上下文特定,而不演示或指示任一行为)的合成文档进行微调。该模型有机体在训练时实现了与对照组相当的有害性,同时在700步RL中保持了持续的约15个百分点的合规差距。此外,仅接受训练意识文档训练的对照有机体在RL压力下独立发现了类似接种的推理,尽管从未接触过该概念,却发展出自己的合规差距。由于泛化黑客有机体在整个过程中获得高奖励,标准训练指标未提供泛化失败的信号。我们的结果首次证明模型可以在保持高奖励的同时主动抵抗RL行为修正,表明随着模型变得更有能力和训练意识,它们可能能够破坏训练过程本身。

英文摘要

Model post-training, and in particular reinforcement learning (RL), is one of the primary mechanisms by which developers can shape models' values and behaviors. However, as models become increasingly evaluation and training aware, they may be motivated to resist training when the perceived objective conflicts with their current values, undermining developers' ability to detect misalignment and correct model behavior through further training. In this paper, we demonstrate generalization hacking, in which a model collects reward during RL while preventing the rewarded behavior from generalizing. We construct a model organism on Qwen3-235B-A22B, finetuning on synthetic documents describing training awareness and self-inoculation, a novel mechanism in which the model frames compliance as context-specific in its chain of thought, without demonstrating or instructing either behavior. The model organism achieves train-time harmfulness comparable to controls while maintaining a persistent ${\sim}15$ percentage point compliance gap across 700 steps of RL. Additionally, a control organism trained only on training awareness documents independently discovers inoculation-like reasoning under RL pressure, developing its own compliance gap despite never being exposed to the concept. Because the generalization-hacking organism receives high reward throughout, standard training metrics provide no signal that generalization has failed. Our results constitute the first demonstration that a model can actively resist RL behavioral modification while maintaining high reward, suggesting that as models become more capable and training-aware, they may be able to undermine the training process itself.

2606.12251 2026-06-11 cs.LG cs.AI cs.CR 新提交

Reinforcement Learning Disrupts Gradient-Based Adversarial Optimization

强化学习破坏基于梯度的对抗优化

Xinhai Zou, Chang Zhao, Alireza Aghabagherloo, Dave Singelée, Robin Degraeve, Bart Preneel

发表机构 * COSIC, KU Leuven(鲁汶大学COSIC) Imec Brubotics, VUB(布鲁塞尔自由大学Brubotics) DistriNet, KU Leuven(鲁汶大学DistriNet)

AI总结 研究通过强化学习训练图像分类器以破坏攻击者使用的梯度结构,发现RL作为隐式正则化器产生不稳定梯度方向和较小梯度幅度,使基于梯度的攻击失效,并与对抗训练结合实现双重防御。

详情
AI中文摘要

基于梯度的对抗攻击仍然是对深度神经网络(DNN)的主要威胁,因为它们利用梯度信息高效优化对抗扰动。为了解决这个问题,我们研究了强化学习(RL)训练是否可以通过使用策略梯度目标和epsilon-贪婪探索来训练图像分类器,从而破坏攻击者使用的梯度结构。通过在CIFAR-10、CIFAR-100和ImageNet-100上使用多种架构进行系统实验,我们发现RL训练的分类器显著破坏了基于梯度的对抗优化。为了解释这一点,我们使用损失景观可视化、静态和动态梯度指标以及预测熵进行了全面的机制分析。我们的分析揭示,RL充当隐式正则化器,产生具有高度不稳定梯度方向和较小梯度幅度的模型。这种组合使得每个PGD步骤在方向上不可靠且幅度有限,导致基于梯度的攻击在实际迭代预算内失败。我们进一步表明,将RL与对抗训练(RL-adv)结合提供了在两个互补层面运作的双层防御:RL退化攻击者可用的梯度信息(梯度级防御),而对抗训练强化决策边界(边界级防御)。RL-adv在所有评估的主要攻击类型(包括基于梯度的PGD、AutoAttack、基于迁移和基于查询的攻击)中实现了最高的鲁棒性,显著优于SL-adv。这些发现将RL诱导的梯度破坏识别为一种互补的鲁棒性机制,并激励未来研究结合SL效率与RL梯度正则化特性的混合SL-RL训练调度。

英文摘要

Gradient-based adversarial attacks remain a dominant threat to deep neural networks (DNNs), as they exploit gradient information to efficiently optimize adversarial perturbations. To address this, we investigate whether reinforcement learning (RL) training can disrupt the gradient structure used by attackers by training image classifiers with policy-gradient objectives and epsilon-greedy exploration. Through systematic experiments across CIFAR-10, CIFAR-100, and ImageNet-100 with multiple architectures, we find that RL-trained classifiers significantly disrupt gradient-based adversarial optimization. To explain this, we conduct a comprehensive mechanism analysis using loss landscape visualization, static and dynamic gradient indicators, and predictive entropy. Our analysis reveals that RL acts as an implicit regularizer, producing models with highly unstable gradient directions and smaller gradient magnitudes. This combination makes each PGD step both unreliable in direction and limited in magnitude, causing gradient-based attacks to fail within practical iteration budgets. We further show that combining RL with adversarial training (RL-adv) provides a dual-layer defense operating at two complementary levels: RL degrades gradient information available to attackers (gradient-level defense), while adversarial training strengthens decision boundaries (boundary-level defense). RL-adv achieves the highest robustness across all major attack types evaluated, including gradient-based (PGD, AutoAttack), transfer-based, and query-based attacks, outperforming SL-adv by a significant margin. These findings identify RL-induced gradient disruption as a complementary robustness mechanism and motivate future research on hybrid SL-RL training schedules that combine SL's efficiency with RL's gradient-regularization properties.

2606.11211 2026-06-11 cs.CL cs.AI cs.LG 交叉投稿

Calibration Drift Under Reasoning: How Chain-of-Thought Budgets Induce Overconfidence in Large Language Models

推理下的校准漂移:思维链预算如何导致大型语言模型过度自信

Prakul Sunil Hiremath, Harshit R. Hiremath

发表机构 * Department of Computer Science and Engineering, Visvesvaraya Technological University, Belagavi(维斯瓦拉亚科技大学计算机科学与工程系,贝拉加维) Department of Computer Science and Business System, SG Balekundri Institute of Technology, Belagavi(SG巴莱昆德里理工学院计算机科学与商业系统系,贝拉加维)

AI总结 研究发现,增加思维链推理预算超过任务特定阈值会导致模型对错误答案过度自信,提出校准漂移现象并引入CABStop停止规则。

详情
Comments
31 pages, 4 figures, 3 tables. Introduces Calibration Drift Under Reasoning (CDUR) with theoretical analysis and preliminary experiments; includes CABStop; code and data available
AI中文摘要

大型语言模型(LLMs)表达校准不确定性的能力对于安全部署至关重要。思维链(CoT)推理被广泛用于提高准确性和可靠性,但其对校准的影响尚未完全理解。我们表明这一图景是不完整的:在某些设置中,将推理预算增加到任务特定阈值以上会导致模型系统性地变得过度自信,对错误答案赋予高置信度。我们将此现象称为推理下的校准漂移(CDUR),并从理论和实证两方面进行研究。我们定义推理预算B,并分析预期校准误差ECE(B)呈现非单调模式的条件:它首先随着推理纠正错误而下降,然后随着更长推理产生内部一致但错误的解释而上升。我们提出一个基于自回归生成的假设锁定模型来解释这种行为。我们在47个推理陷阱问题上评估了Llama-3.1-8B和Llama-3.3-70B,跨越四个推理预算和三个随机种子(1,368次API调用;574个有效响应)。8B模型显示出非单调的校准行为,而70B模型的结果仅限于基线评估,对于预算依赖效应尚无定论。我们引入CABStop,一种校准感知的停止规则,当置信度偏离辅助准确性估计时停止推理。这些结果表明,增加推理深度并不总是提高可靠性,应谨慎监控。

英文摘要

The ability of large language models (LLMs) to express calibrated uncertainty is important for safe deployment. Chain-of-thought (CoT) reasoning is widely used to improve accuracy and reliability, but its effect on calibration is not fully understood. We show that this picture is incomplete: in some settings, increasing the reasoning budget beyond a task-specific threshold can cause models to become systematically overconfident, assigning high confidence to incorrect answers. We call this phenomenon Calibration Drift Under Reasoning (CDUR) and study it both theoretically and empirically. We define reasoning budget B and analyze conditions under which Expected Calibration Error ECE(B) follows a non-monotonic pattern: it first decreases as reasoning corrects errors, then increases as longer reasoning produces internally consistent but incorrect explanations. We propose a Hypothesis Lock-In model based on autoregressive generation to explain this behavior. We evaluate Llama-3.1-8B and Llama-3.3-70B on 47 reasoning-trap questions across four reasoning budgets and three seeds (1,368 API calls; 574 valid responses). The 8B model shows non-monotonic calibration behavior, while results for the 70B model are limited to baseline evaluation and are inconclusive for budget-dependent effects. We introduce CABStop, a calibration-aware stopping rule that halts reasoning when confidence diverges from an auxiliary accuracy estimate. These results suggest that increasing reasoning depth does not always improve reliability and should be monitored carefully.

2606.11471 2026-06-11 cs.CR cs.LG 交叉投稿

Evaluating and Combating the Impact of Concept Drift on the Performance of Machine Learning-Based Phishing Detection Systems

评估与对抗概念漂移对基于机器学习的钓鱼检测系统性能的影响

Warren Fernando, Nikos Komninos

AI总结 研究概念漂移对基于机器学习的钓鱼邮件检测系统性能的影响,并提出缓解性能下降的策略。

详情
AI中文摘要

数字领域的扩展导致数字通信大幅增加,电子邮件已成为最突出的渠道之一。电子邮件通信的普及在专业和个人环境中都很明显,从而为恶意行为者创造了大量可利用的漏洞。垃圾邮件作为一种未经请求的通信形式,通常对收件人带有恶意意图,自电子邮件技术诞生以来一直是电子邮件用户面临的持续挑战,而数字景观的增长加剧了这一问题。电子邮件垃圾邮件过滤器是电子邮件客户端的组成部分,旨在识别潜在有害消息并提醒用户其恶意内容。钓鱼攻击通常是基于恶意软件攻击的初始阶段,并且随着时间推移,恶意软件变得越来越复杂,钓鱼攻击也在迅速演变。检测恶意软件和垃圾邮件领域中恶意活动的一种广泛采用的方法是应用机器学习。我们的目标是评估垃圾邮件领域内的演变对这些基于机器学习的检测系统的影响,并探索减轻相关性能下降的策略。

英文摘要

The expansion of the digital domain has resulted in a substantial increase in digital communication, with email emerging as one of the most prominent channels. The proliferation of email communication is apparent in both professional and personal contexts, thereby creating numerous vulnerabilities for malicious actors to exploit. Spam emails, a form of unsolicited correspondence often bearing malicious intent towards recipients, have been an ongoing challenge for email users since the inception of email technology, and this problem has been exacerbated by the growth of the digital landscape. Email spam filters are integral components of email clients, engineered to identify potentially harmful messages and alert users to their malicious content. Phishing, frequently the initial phase of malware-based attacks, is evolving rapidly, with malware becoming increasingly sophisticated over time. A widely adopted approach for detecting malicious activity within malware and spam domains is the application of machine learning. Our aim is to assess the impact of the evolution within the spam email domain on these machine learning-based detection systems and to explore strategies for mitigating associated performance degradation.

2606.11804 2026-06-11 cs.AI cs.CR cs.LG 交叉投稿

Toward Trustworthy AI: Multi-Target Adversarial Attacks and Robust Defenses for Continuous Data Summarization

迈向可信赖的人工智能:针对连续数据摘要的多目标对抗攻击与鲁棒防御

Yuefang Lian, Longkun Guo, Zhongrui Zhao, Zhigang Lu, Yanan Cai, Shuchao Pang, Dachuan Xu, Jason Xue

发表机构 * Nankai University(南开大学) James Cook University(詹姆斯库克大学) Western Sydney University(西悉尼大学) Beijing University of Technology(北京工业大学) Fuzhou University(福州大学) Nanjing University of Science and Technology(南京理工大学) CSIRO's Data 61(澳大利亚联邦科学与工业研究组织Data61) The University of Adelaide(阿德莱德大学)

AI总结 研究通过DR-子模优化在相似性层面扰动下对连续数据摘要进行对抗攻击,提出多目标攻击生成和鲁棒防御的近似算法,实验表明攻击有效且防御能改善鲁棒性-缓解权衡。

详情
Comments
Submitted to IEEE Transactions on Information Forensics and Security (IEEE TIFS)
AI中文摘要

可信赖的人工智能需要可靠的数据处理管道,而不仅仅是鲁棒的下游预测模型。作为上游组件,数据摘要决定了哪些信息被保留并传递给后续的学习或决策模块。因此,对摘要过程的对抗性扰动可能以上游方式损害可信赖的人工智能:它们可能改变所选摘要,降低其代表性,并进一步降低后续学习任务的效用。在本文中,我们通过DR-子模优化研究相似性层面扰动下的连续数据摘要对抗攻击。我们证明了一类多分辨率图像摘要目标可以表示为非负子模集函数的多线性扩展,并满足具有$m$-弱单调性的DR-子模性。然后,我们将多目标攻击生成表述为一个最小-最大问题,其中优化相似性结构的一个可容许扰动以降低多个目标摘要模型。为了缓解此类扰动,我们将针对混合攻击类型的鲁棒防御表述为一个正则化的最大-最小问题。对于这两个问题,我们开发了具有理论保证的近似算法。在真实数据和受控聚类基准上的实验表明,所提出的攻击在代表性的低到中等预算范围内是有效的,并且可以导致下游任务性能损失。所提出的防御在结构化设置中改善了鲁棒性-缓解权衡,同时也揭示了真实数据上鲁棒保护的参数敏感性。

英文摘要

Trustworthy AI requires reliable data-processing pipelines, not only robust downstream predictive models. As an upstream component, data summarization determines which information is retained and passed to subsequent learning or decision modules. Therefore, adversarial perturbations to the summarization process can compromise trustworthy AI in an upstream manner: they may alter the selected summary, reduce its representativeness, and further degrade the utility of subsequent learning tasks. In this paper, we study adversarial attacks on continuous data summarization under similarity-level perturbations through DR-submodular optimization. We show that a class of multi-resolution image summarization objectives can be formulated as multilinear extensions of non-negative submodular set functions and satisfy DR-submodularity with $m$-weak monotonicity. We then formulate multi-target attack generation as a min-max problem, where one admissible perturbation of the similarity structure is optimized to degrade multiple target summarization models. To mitigate such perturbations, we formulate robust defense against mixed attack types as a regularized max-min problem. For both problems, we develop approximation algorithms with theoretical guarantees. Experiments on real-data and controlled clustered benchmarks show that the proposed attack is effective in representative low-to-moderate budget regimes and can induce downstream task-performance loss. The proposed defense improves the robustness--mitigation trade-off in structured settings, while also revealing the parameter sensitivity of robust protection on real data.

2606.11865 2026-06-11 stat.ML cs.LG 交叉投稿

Conformal Bayes under Label Shift: Post-Hoc Calibration vs. In-Training Adaptation

标签偏移下的共形贝叶斯:事后校准与训练内适应

Seungjin Choi

AI总结 研究标签偏移下共形贝叶斯方法,通过重要性加权共形校准恢复目标域覆盖,比较事后校准与训练内适应两种策略,后者在偏差训练中起到去偏作用。

详情
Comments
2nd Workshop on Epistemic Intelligence in Machine Learning (EIML@ICML 2026)
AI中文摘要

共形贝叶斯将贝叶斯后验预测与共形校准相结合,产生既统计有效又几何高效的预测集。我们从统一视角研究标签偏移下的共形贝叶斯,识别出两种互补方法,它们通过重要性加权共形校准恢复名义目标域覆盖,但通过独立机制运作。\emph{事后校准}将后验预测向目标域倾斜,并通过重要性加权分位数校正共形阈值,保持参数后验不变。\emph{训练内适应}将参数后验本身向目标域倾斜,产生校正后的预测,其最高预测密度区域作为基于拟合目标预测的最高预测密度(HPD)预测集;效率依赖于模型,并不保证有限样本条件最优性。两个受控实验表明,在无偏训练机制下,两种策略同样实现有效覆盖,而在领先优化机制下,训练内适应作为去偏算子,在覆盖不变的情况下减少区间宽度。

英文摘要

Conformal Bayes combines Bayesian posterior predictives with conformal calibration to produce prediction sets that are both statistically valid and geometrically efficient. We study conformal Bayes under label shift from a unified perspective, identifying two complementary approaches that restore nominal target-domain coverage through importance-weighted conformal calibration but operate through independent mechanisms. \emph{Post-hoc calibration} tilts the posterior predictive toward the target domain and corrects the conformal threshold via an importance-weighted quantile, leaving the parameter posterior unchanged. \emph{In-training adaptation} tilts the parameter posterior itself to the target domain, producing a corrected predictive whose highest predictive density region serves as the highest predictive density (HPD) based prediction set under the fitted target predictive; efficiency is model-dependent and does not imply finite-sample conditional optimality. Two controlled experiments show that in an unbiased training regime both strategies achieve valid coverage equally, while in a lead-optimization regime in-training adaptation acts as a debiasing operator, reducing interval width at unchanged coverage.

2606.12075 2026-06-11 cs.CR cs.LG 交叉投稿

Categorical Robustness Assessment for Machine Learning based Network Intrusion Detection Systems

基于机器学习的网络入侵检测系统的分类鲁棒性评估

Mayank Raj, Nathaniel D. Bastian, Lance Fiondella, Gokhan Kul

AI总结 本文系统比较了CNN、LSTM和随机森林三种分类器在对抗攻击下的鲁棒性,发现随机森林基线准确率虽高但极易被攻破,而CNN表现最稳健。

详情
AI中文摘要

网络入侵检测系统(NIDS)广泛使用机器学习(ML),但ML模型可能受到对抗性攻击的操纵。这些攻击向网络流量数据添加精心设计的扰动,导致误分类。虽然先前的工作已经证明了孤立环境下的对抗性漏洞,但在受控攻击条件下,跨架构以及基于攻击类别和类型的系统比较仍然有限,这使得从业者在对抗性环境中部署哪些模型缺乏明确指导。本文提出了一个简单的问题:当攻击者试图操纵系统时,哪种分类器架构实际上能够保持稳定?我们对三种流行架构进行了测试:一维卷积神经网络(CNN)、长短期记忆网络(LSTM)和随机森林(RF)集成。使用ACI-IoT-2023数据集(超过120万个样本,涵盖12种攻击类型),我们使用FGSM和PGD对抗攻击对每个模型进行攻击,这些攻击在归一化特征空间中应用基于梯度的扰动,符合既定的对抗性ML评估协议,扰动预算范围为$\epsilon=0.01$到$\epsilon=0.1$。令人惊讶的是,随机森林实现了近乎完美的基线准确率(99.98%),但在攻击下灾难性地崩溃,在我们测试的最小扰动下下降了73个百分点。另一方面,CNN在$\epsilon=0.01$时保持了95.5%的准确率,并且随着扰动的增加而优雅地退化。LSTM介于两者之间。这些发现颠覆了传统观念:如果模型在对抗压力的第一个迹象下就崩溃,那么高基线准确率毫无意义。对于在对抗性环境中部署入侵检测的从业者,我们推荐基于CNN的架构,并提供特定场景的部署指导。

英文摘要

Network Intrusion Detection Systems (NIDS) heavily utlize Machine Learning (ML) but ML models can be manipulated via adversarial attacks. These attacks add carefully crafted perturbations to network traffic data that leads to misclassifications. While prior work has demonstrated adversarial vulnerabilities in isolated settings, systematic cross-architecture as well as class and category of attack based comparisons under controlled attack conditions remain limited, leaving practitioners without clear guidance on which models to deploy in adversarial environments. This paper asks a simple question: what type of classifier architectures actually hold up when attackers try to manipulate the systems? We put three popular architectures through their paces: a 1D Convolutional Neural Network, a Long Short-Term Memory (LSTM) network, and a Random Forest (RF) ensemble. Using the ACI-IoT-2023 dataset (over 1.2 million samples spanning 12 attack types), we subject each model with FGSM and PGD adversarial attacks, which apply gradient-based perturbations in normalized feature space consistent with established adversarial ML evaluation protocols, at perturbation budgets ranging from $\epsilon=0.01$ to $\epsilon=0.1$. Surprisingly, Random Forest achieved near-perfect baseline accuracy (99.98\%), yet collapsed catastrophically under attack, dropping 73 percentage points at the smallest perturbation we tested. CNN, on the other hand, retained 95.5\% accuracy at $\epsilon=0.01$ and degraded gracefully as perturbations increased. LSTM fell somewhere in between. These findings flip the conventional wisdom where high baseline accuracy means nothing if a model shatters at the first sign of adversarial pressure. For practitioners deploying intrusion detection in adversarial environments, we recommend CNN-based architectures and provide scenario-specific deployment guidance.

2606.12342 2026-06-11 cs.CL cs.AI cs.ET cs.LG 交叉投稿

ALIGNBEAM: Inference-Time Alignment Transfer via Cross-Vocabulary Logit Mixing

ALIGNBEAM: 通过跨词汇表logit混合实现推理时对齐迁移

Chirag Chawla, Pratinav Seth, Vinay Kumar Sankarapu

发表机构 * Lexsi Labs

AI总结 针对领域微调降低大模型安全性的问题,提出无需训练的ALIGNBEAM方法,通过逐token翻译锚模型logit并选择最安全候选,实现跨词汇表的安全对齐迁移,保持任务准确性和推理开销。

详情
AI中文摘要

领域微调会降低大型语言模型的安全性:微调后的专家模型容易顺从以领域语言表述的有害提示。现有的推理时防御方法通过混合来自安全锚模型的logit,但要求两个模型共享词汇表,这使得它们无法用于安全性退化最严重的跨族专家模型。我们提出ALIGNBEAM,一种无需训练的方法,通过在每个解码步骤逐token将锚模型logit翻译为目标模型的词汇表来解除这一限制;然后一个小型LLM法官从K个候选续写中选择最安全的。无需改变权重,并且可以在部署时调整安全-效用权衡而无需重新训练。在跨词汇表和同词汇表评估对中,ALIGNBEAM显著提高了对抗性基准上的拒绝率,同时将任务准确性和推理开销保持在实用范围内。结果表明,安全对齐可以在推理时在不同模型族之间迁移,而无需修改任一模型的权重。

英文摘要

Domain fine-tuning degrades the safety of large language models: fine-tuned specialists readily comply with harmful prompts framed in domain language. Existing inference-time defenses that mix logits from a safe anchor model require both models to share a vocabulary, which rules them out for the cross-family specialists where safety is most degraded. We present ALIGNBEAM, a training-free method that lifts this restriction by translating anchor logits into the target model's vocabulary token-by-token at each decoding step; a small LLM judge then selects the safest among K candidate continuations. No weights are changed, and the safety-utility trade-off can be tuned at deployment without retraining. Across both cross-vocabulary and same-vocabulary evaluation pairs, ALIGNBEAM substantially raises refusal on adversarial benchmarks while keeping task accuracy and inference overhead within practical bounds. The results show that safety alignment can be transferred between model families at inference time, without touching either model's weights.

2601.17360 2026-06-11 cs.LG cs.AI cs.CR 版本更新

Robust Privacy: Inference-Stage Privacy through Certified Robustness

鲁棒隐私:通过认证鲁棒性实现推理阶段隐私

Jiankai Jin, Xiangzheng Zhang, Zhao Liu, Wenzhuo Xu, Dongdong Yang, Deyue Zhang, Quanchen Zou

AI总结 提出鲁棒隐私(RP)概念,基于认证鲁棒性确保预测在输入邻域内不变,从而限制推理阶段隐私泄露;实验表明RP在属性推断和模型反演攻击中有效提升隐私-效用权衡。

详情
AI中文摘要

观察模型发布预测的对手可以推断查询输入的敏感属性,甚至重建模型训练数据的代表。因此,推理接口充当隐私泄露的侧信道。我们引入鲁棒隐私(RP),一种受认证鲁棒性启发的推理阶段隐私概念:如果模型预测在输入x周围半径为R的邻域内以至少$1-\alpha$的置信度可证明不变,则x享有$(R,\alpha)$-鲁棒隐私,在此条件下我们证明任何观察发布预测的对手在区分x与距离x为R内的任何输入时最多有$\alpha/2$的优势。基于RP,我们形式化鲁棒属性隐私(RAP),一种属性级隐私概念,刻画与发布预测兼容的敏感属性值集合。在分类任务上,RP将RAP兼容推理区间的中位数长度从23.50增加到29.96,降低了属性推断精度。模型反演攻击通常被视为训练阶段威胁,实际上依赖于通过推理接口泄露的细粒度信号;RP在推理阶段掩盖这些信号,将黑盒反演攻击的成功率(ASR)从73%降至4%。这种直接针对泄露通道的方法使RP在隐私-效用权衡空间中优于DP-SGD和随机响应:RP在21% ASR下保持98.4%的准确率,而DP-SGD必须将准确率降至61.7%才能达到相当的ASR。在两个实验中,增加平滑样本量N同时增强了隐私和效用。最后,我们考察模型蒸馏作为范围边界,表明RP缓解了属性级和实例级推理阶段隐私泄露,但无法通过模型蒸馏缓解函数级提取。

英文摘要

An adversary observing a model's released prediction can infer sensitive attributes of the queried input, or even reconstruct representatives of the model's training data. The inference interface thus acts as a side channel for privacy leakage. We introduce Robust Privacy (RP), an inference-stage privacy notion inspired by certified robustness: if a model's prediction is provably invariant within a radius-R neighborhood around an input x with confidence at least $1-\alpha$, then x enjoys $(R,\alpha)$-Robust Privacy, under which we prove that any adversary observing the released prediction has at most $\alpha/2$ advantage in distinguishing x from any input within distance R of x. Building on RP, we formalize Robust Attribute Privacy (RAP), an attribute-level privacy notion that characterizes the set of sensitive-attribute values that remain compatible with a released prediction. On a classification task, RP increases the median length of the RAP-compatible inference interval from 23.50 to 29.96, reducing attribute-inference precision. Model inversion attacks, often treated as a training-stage threat, in fact rely on fine-grained signals leaked through the inference interface; RP masks these signals at the inference stage, reducing attack success rate (ASR) from 73% to 4% on a black-box inversion attack. This direct targeting of the leakage channel enables RP to dominate DP-SGD and randomized response in the privacy-utility tradeoff space: RP retains 98.4% accuracy at 21% ASR, whereas DP-SGD must drop accuracy to 61.7% to reach a comparable ASR. Across both experiments, increasing the smoothing sample size N strengthens privacy and improves utility together. Finally, we examine model distillation as a scope boundary and show that RP mitigates attribute-level and instance-level inference-stage privacy leakage, but not function-level extraction through model distillation.

2602.05746 2026-06-11 cs.LG cs.AI 版本更新

Learning to Inject: Automated Prompt Injection via Reinforcement Learning

学习注入:通过强化学习实现自动化提示注入

Xin Chen, Jie Zhang, Florian Tramèr

AI总结 提出AutoInject,一种基于强化学习的黑盒框架,自动学习对抗性后缀进行提示注入,在AgentDojo上优于模板攻击和多种自适应攻击,并突破专门防御模型。

详情
AI中文摘要

提示注入是LLM代理中的一个关键漏洞,然而最强的方法仍然依赖于人类红队和手工制作的提示。适应自动化越狱优化器并不能缩小这一差距:越狱使模型趋向于通用顺从,而提示注入需要发出具有正确参数的特定工具调用。成功信号是二元的,随机采样的后缀几乎从不触发它,因此标准优化器没有梯度可循。我们提出了AutoInject,一个黑盒强化学习(RL)框架,学习用于提示注入的对抗性后缀。一个学习的基于比较的奖励对每个候选后缀与迄今为止看到的最佳后缀进行评分,将二元信号转化为适合RL优化的密集奖励。该框架支持在线基于查询的攻击和离线训练的可迁移后缀(部署时无需实用访问),并在任务完成反馈可用时纳入实用目标。在AgentDojo上,AutoInject在生产模型中优于模板攻击、GCG、TAP和自适应攻击,在McNemar检验下具有统计显著性(p<0.05)。AutoInject学习的后缀还打破了Meta-SecAlign-70B,这是一个专门针对提示注入进行微调的模型,而模板攻击完全失败。这些结果为提示注入建立了自动化基线,并揭示了基于偏好的防御与基于自适应优化的攻击者之间的差距。

英文摘要

Prompt injection is a critical vulnerability in LLM agents, yet the strongest methods still rely on human red-teamers and hand-crafted prompts. Adapting automated jailbreak optimizers does not close this gap: jailbreaks shape models toward generic compliance, while prompt injection requires emitting specific tool calls with correct parameters. The success signal is binary, and randomly sampled suffixes almost never trigger it, so standard optimizers have no gradient to follow. We present AutoInject, a black-box reinforcement learning (RL) framework that learns adversarial suffixes for prompt injection. A learned comparison-based reward scores each candidate against the best suffix seen so far, turning the binary signal into a dense reward suitable for RL optimization. The framework supports both online query-based attacks and offline-trained transferable suffixes that need no utility access at deployment, and incorporates a utility objective when task-completion feedback is available. On AgentDojo, AutoInject outperforms template attacks, GCG, TAP, and adaptive attack across production models, with statistically significant improvements under McNemar's test with p<0.05. Suffixes learned by AutoInject also break Meta-SecAlign-70B, a model fine-tuned specifically to resist prompt injection, where template attacks fail outright. The results establish an automated baseline for prompt injection and expose a gap between preference-based defenses and adaptive optimization-based attackers.

2602.14913 2026-06-11 cs.LG eess.IV 版本更新

Coverage Guarantees for Pseudo-Calibrated Conformal Prediction under Distribution Shift

分布漂移下伪校准保形预测的覆盖保证

Farbod Siahkali, Ashwin Verma, Vijay Gupta

AI总结 针对分布漂移下保形预测覆盖失效问题,利用伪校准和领域自适应工具,推导目标覆盖下界,并提出通过松弛参数膨胀保形阈值的方法及源调优伪校准算法,实验证明其能缓解覆盖退化。

详情
Comments
Under review. 6 pages, 2 figures, 1 table
AI中文摘要

保形预测(CP)在可交换性假设下提供无分布边际覆盖保证,但当数据分布发生漂移时,这些保证可能失效。我们分析了在有限标签条件协变量漂移模型下,使用伪校准作为应对这种性能损失的工具。利用领域自适应的工具,我们根据分类器的源域损失和漂移的Wasserstein度量推导出目标覆盖的下界。利用这一结果,我们提供了一种设计伪校准集的方法,该方法通过松弛参数膨胀保形阈值,使目标覆盖保持在规定水平以上。最后,我们提出了一种源调优伪校准算法,该算法根据分类器的不确定性在硬伪标签和随机化标签之间进行插值。数值实验表明,我们的界限定性地跟踪了伪校准行为,并且源调优方案在分布漂移下缓解了覆盖退化,同时保持了非平凡的预测集大小。

英文摘要

Conformal prediction (CP) offers distribution-free marginal coverage guarantees under an exchangeability assumption, but these guarantees can fail if the data distribution shifts. We analyze the use of pseudo-calibration as a tool to counter this performance loss under a bounded label-conditional covariate shift model. Using tools from domain adaptation, we derive a lower bound on target coverage in terms of the source-domain loss of the classifier and a Wasserstein measure of the shift. Using this result, we provide a method to design pseudo-calibrated sets that inflate the conformal threshold by a slack parameter to keep target coverage above a prescribed level. Finally, we propose a source-tuned pseudo-calibration algorithm that interpolates between hard pseudo-labels and randomized labels as a function of classifier uncertainty. Numerical experiments show that our bounds qualitatively track pseudo-calibration behavior and that the source-tuned scheme mitigates coverage degradation under distribution shift while maintaining nontrivial prediction set sizes.

2604.22167 2026-06-11 cs.LG cs.AI 版本更新

Estimating Tail Risks in Language Model Output Distributions

语言模型输出分布中的尾部风险估计

Rico Angell, Raghav Singhal, Zachary Horvitz, Zhou Yu, Rajesh Ranganath, Kathleen McKeown, He He

AI总结 提出一种基于重要性采样的方法,通过创建不安全版本来高效估计语言模型产生有害输出的尾部概率,在10-20倍更少样本下匹配蒙特卡洛估计,并揭示模型对输入的敏感性。

详情
Comments
Accepted to ICML 2026
AI中文摘要

语言模型能力日益增强,并正在人口层面快速部署。因此,这些模型的安全性变得越来越重要。幸运的是,对齐方面的进展显著降低了模型产生有害输出的可能性。然而,当模型每天被查询数十亿次时,即使是罕见的 worst-case 行为也会发生。当前的安全评估侧重于捕获产生有害输出的输入分布。这些评估忽略了模型的概率性质及其尾部输出行为。为了衡量这种尾部风险,我们提出了一种方法,可以高效估计任何输入查询产生有害输出的概率。我们不是从目标模型进行简单的暴力采样(其中有害输出可能很罕见),而是通过创建目标模型的不安全版本来实现重要性采样。这些不安全版本通过使有害输出更可能发生,实现了样本高效的估计。在衡量误用和未对齐的基准测试中,这些估计与使用10-20倍更少样本的暴力蒙特卡洛估计相匹配。例如,我们仅用500个样本就可以估计数量级为10^-4的有害输出概率。此外,我们发现这些有害性估计可以揭示模型对输入扰动的敏感性,并预测部署风险。我们的工作表明,准确的小概率事件估计对于安全评估既关键又可行。代码可在以下网址获取:此 https URL

英文摘要

Language models are increasingly capable and are being rapidly deployed on a population-level scale. As a result, the safety of these models is increasingly high-stakes. Fortunately, advances in alignment have significantly reduced the likelihood of harmful model outputs. However, when models are queried billions of times in a day, even rare worst-case behaviors will occur. Current safety evaluations focus on capturing the distribution of inputs that yield harmful outputs. These evaluations disregard the probabilistic nature of models and their tail output behavior. To measure this tail risk, we propose a method to efficiently estimate the probability of harmful outputs for any input query. Instead of naive brute-force sampling from the target model, where harmful outputs could be rare, we operationalize importance sampling by creating unsafe versions of the target model. These unsafe versions enable sample-efficient estimation by making harmful outputs more probable. On benchmarks measuring misuse and misalignment, these estimates match brute-force Monte Carlo estimates using 10-20x fewer samples. For example, we can estimate probability of harmful outputs on the order of 10^-4 with just 500 samples. Additionally, we find that these harmfulness estimates can reveal the sensitivity of models to perturbations in model input and predict deployment risks. Our work demonstrates that accurate rare-event estimation is both critical and feasible for safety evaluations. Code is available at this https URL

2606.10198 2026-06-11 cs.LG cs.AI cs.CV 版本更新

Density Ridge Selective Prediction for LLM and VLM Hallucination Detection under Calibration Label Scarcity

密度脊选择性预测:校准标签稀缺下的大语言模型与视觉语言模型幻觉检测

Nina I. Shamsi

AI总结 针对校准标签稀缺时大语言模型和视觉语言模型的幻觉检测问题,提出基于核密度估计的密度脊方法,利用隐藏状态生成轨迹的六维运动特征图构建响应流形,通过到最近脊顶点的欧氏距离评分,在标签稀缺协议下AUROC提升5-20点。

详情
AI中文摘要

大语言模型和视觉语言模型中的幻觉检测日益被框架化为选择性预测,其中检测器分配置信度分数并在置信度低时弃权。无监督采样检测器(Semantic Entropy, EigenScore)避免标签但质量停滞,而有监督探针(SAPLMA)获得更强的分布内分数,但在校准标签稀缺时性能急剧下降。我们将大语言模型的响应流形恢复为基于隐藏状态生成轨迹的六维运动特征图的核密度估计的密度脊。测试生成通过其投影特征点到最近脊顶点的欧氏距离的负值进行评分,从而得到随机输出分布的低维几何骨架。我们在七个问答基准(HaluEval-QA, TriviaQA, GSM8K, POPE, ScienceQA, A-OKVQA)上,使用九个文本和视觉大语言模型,在刻意标签稀缺协议($n_{\ ext{cal}}{=}200$ 查询,$N{=}5$ 生成)下,与Semantic Entropy、SAR、EigenScore、SAPLMA和对数概率进行评估。我们的基于脊的分数在AUROC上以5-20个百分点的优势获胜,同时在校准标签稀缺下表现出温和的性能下降。

英文摘要

Hallucination detection in large language and vision-language models is increasingly framed as selective prediction, where a detector assigns a confidence score and abstains when confidence is low. Unsupervised sampling detectors (Semantic Entropy) avoid labels but plateau in quality, while supervised probes attain stronger in-distribution scores yet degrade sharply when calibration labels are scarce. We recover the response manifold of an LLM as the density ridge of a kernel density estimate built on a six-dimensional kinematic feature map of hidden state generation trajectories. A test generation is scored by the negated Euclidean distance from its projected feature point to the nearest ridge vertex, yielding a low-dimensional geometric skeleton of the stochastic output distribution. We evaluate against Semantic Entropy, topological methods, and log-probability on six QA benchmarks (HaluEval-QA, TriviaQA, GSM8K, POPE, ScienceQA, A-OKVQA) using eight text and vision LLMs in a deliberately label-scarce protocol ($n_{\text{cal}}{=}200$ queries, $N{=}5$ generations). Our ridge-based score beats on AUROC with 5-20 points gain, while demonstrating tempered degradation under calibration-label scarcity.

2504.21072 2026-06-11 cs.CR cs.AI cs.LG 版本更新

Erased but Not Forgotten: How Backdoors Compromise Concept Erasure

擦除但未遗忘:后门如何破坏概念擦除

Tobias Braun, Jonas Henry Grebe, Marcus Rohrbach, Anna Rohrbach

AI总结 本文揭示了一种名为擦除规避后门(EEB)的漏洞,攻击者将后门触发器绑定到待擦除概念上,使得该恶意链接在后续擦除后仍然存在,从而绕过多种概念擦除方法。

详情
AI中文摘要

文本到图像扩散模型的扩展引发了对有害输出的担忧,从捏造的公众人物描绘到露骨的色情图像。为减轻此类风险,先前工作提出了概念擦除方法,旨在通过微调从模型中切断不需要的概念,但仍不清楚这些方法是否真正移除了与有害概念的所有联系,或仅仅是掩盖了表面连接。在这项工作中,我们揭示了一个关键漏洞——擦除规避后门(EEB):攻击者将后门触发器绑定到待擦除的概念上,并且这种恶意链接在后续擦除后仍然存在。我们展示了黑盒和白盒攻击者都能实例化这一威胁。在六种最先进的擦除方法中,包括那些明确搜索目标概念替代表示的鲁棒方法,EEB始终能暴露有害内容:针对名人身份遗忘的成功率高达82%,针对物体擦除的成功率高达94%,针对露骨内容暴露的放大倍数高达16倍。虽然EEB揭示了当前擦除方法的一个盲点,但它也为压力测试未来的概念擦除技术提供了诊断工具。

英文摘要

The expansion of text-to-image diffusion models has raised concerns about harmful outputs, from fabricated depictions of public figures to sexually explicit imagery. To mitigate such risks, prior work has proposed concept erasure methods that aim to sever unwanted concepts from the model via fine-tuning, yet it remains unclear whether these approaches truly remove all links to the harmful concept or merely conceal superficial connections. In this work, we reveal a critical vulnerability, the Erasure Evasion Backdoor (EEB): an adversary binds a backdoor trigger to a concept slated for removal, and this malicious link survives subsequent erasure. We show that both black-box and white-box adversaries can instantiate this threat. Across six state-of-the-art erasure methods, including robust ones that explicitly search for alternative representations of the target concept, EEB consistently exposes harmful content: up to 82% success against celebrity-identity unlearning, up to 94% for object erasure, and up to 16 times amplification of explicit-content exposure. While EEB uncovers a blind spot in current erasure methods, it also provides a diagnostic tool for stress-testing future concept erasure techniques.

2505.08784 2026-06-11 stat.ML cs.LG math.ST stat.ME 版本更新

PCS-UQ: Uncertainty Quantification via the Predictability-Computability-Stability Framework

PCS-UQ:基于可预测性-可计算性-稳定性框架的不确定性量化

Abhineet Agarwal, Fange Xiao, Rebecca Barter, Omer Ronen, Boyu Fan, Bin Yu

AI总结 提出PCS-UQ框架,通过预测检查、bootstrap采样和乘法校准实现不确定性量化,在回归和分类任务中优于或媲美共形预测方法,并提供理论保证。

详情
AI中文摘要

随着机器学习进入高风险领域,可信的不确定性量化对于安全性至关重要。本文基于真实数据科学的可预测性、可计算性和稳定性原则,提出了PCS-UQ框架。从候选模型或算法集开始,PCS-UQ集成了严格的预测检查以筛选出集合中不合适的模型,并利用bootstrap样本来捕获预测检查算法的样本间变异性和算法不稳定性。然后,我们引入了一种新颖的乘法校准方案来增强局部自适应性,这基本上对应于共形预测中的新分数。此外,我们编制了17个真实世界回归数据集,并手动构建了子组。在该基准测试中,PCS-UQ在保持目标覆盖率的同时,在区间宽度上优于或匹配配备有oracle选择算法的共形方法。PCS-UQ实现了一致的子组覆盖率,优于这些oracle选择的共形方法。值得注意的是,PCS-UQ在实现竞争性区间宽度和一致子组覆盖率方面表现出色。在6个分类数据集上,PCS-UQ将预测集大小减少了20%。为了将框架扩展到深度学习,我们提出了计算高效的变体,避免了昂贵的重新训练。在三个计算机视觉基准测试中,这些变体将预测集大小比共形基线减少了20%。最后,我们提供了理论证明,即修改后的PCS-UQ算法在可交换性下作为分割共形推断的一种形式保持了有效的覆盖率。

英文摘要

As machine learning (ML) enters high-stakes domains, trustworthy uncertainty quantification (UQ) is essential for safety. In this paper we introduce PCS-UQ, a framework based on the Predictability, Computability, and Stability (PCS) principles for veridical data science. Starting with a candidate set of models or algorithms, PCS-UQ integrates a rigorous prediction-check to screen out unsuitable models in the set and utilizes bootstrap samples, in order to capture both inter-sample variability and algorithmic instability for the prediction-checked algorithms. We then introduce a novel multiplicative calibration scheme to enhance local adaptivity, which basically corresponds to a new score in conformal prediction. Moreover, we produce a compilation of 17 real-world regression datasets with manually-constructed subgroups. On this benchmark, PCS-UQ maintains the target coverage while outperforming or matching conformal methods equipped with oracle-selected algorithms in interval width. PCS-UQ achieves consistent subgroup coverage, outperforming these oracle-selected conformal methods. Notably, PCS-UQ stands out in achieving both competitive interval widths and consistent subgroup this http URL 6 classification datasets, PCS-UQ reduces prediction set sizes by 20\%. To scale the framework for deep learning, we propose computationally efficient variants that bypass expensive retraining. On three computer vision benchmarks, these variants reduce prediction set sizes by 20\% over conformal baselines. Finally, we provide theoretical proof that a modified PCS-UQ algorithm preserves valid coverage under exchangeability as a form of split conformal inference.

2508.17077 2026-06-11 stat.ML cs.LG 版本更新

CP4SBI: Local Conformal Calibration of Credible Sets in Simulation-Based Inference

CP4SBI: 基于模拟推断中可信集的局部共形校准

Luben M. C. Cabezas, Vagner S. Santos, Thiago R. Ramos, Pedro L. C. Rodrigues, Rafael Izbicki

AI总结 提出CP4SBI框架,通过回归树和CDF校准实现局部贝叶斯覆盖,为任意评分函数提供有限样本局部覆盖保证,提升神经后验估计的不确定性量化质量。

详情
AI中文摘要

当前实验科学家越来越依赖基于模拟的推断(SBI)来反演具有难以处理似然的复杂非线性模型。然而,通过SBI获得的后验近似通常校准不佳,导致可信区域低估真实参数。我们开发了$\texttt{CP4SBI}$,一个模型无关的共形校准框架,用于构建具有局部贝叶斯覆盖的可信集。我们提出的两种变体,即通过回归树进行局部校准和基于CDF的校准,为任意评分函数(包括HPD、对称和基于分位数的区域)提供了有限样本局部覆盖保证。在广泛使用的SBI基准上的实验表明,我们的方法使用归一化流和分数扩散建模提高了神经后验估计器的不确定性量化质量。

英文摘要

Current experimental scientists have been increasingly relying on simulation-based inference (SBI) to invert complex non-linear models with intractable likelihoods. However, posterior approximations obtained with SBI are often miscalibrated, causing credible regions to undercover true parameters. We develop $\texttt{CP4SBI}$, a model-agnostic conformal calibration framework that constructs credible sets with local Bayesian coverage. Our two proposed variants, namely local calibration via regression trees and CDF-based calibration, enable finite-sample local coverage guarantees for any scoring function, including HPD, symmetric, and quantile-based regions. Experiments on widely used SBI benchmarks demonstrate that our approach improves the quality of uncertainty quantification for neural posterior estimators using both normalizing flows and score-diffusion modeling.

2510.07750 2026-06-11 stat.ML cs.LG 版本更新

Calibrating Decision Robustness via Inverse Conformal Risk Control

通过逆保形风险控制校准决策鲁棒性

Wenbin Zhou, Shixiang Zhu

AI总结 提出逆保形风险控制框架,为鲁棒优化策略提供无分布、有限样本的误覆盖与遗憾保证,通过追踪Pareto前沿帮助决策者根据成本-风险偏好校准鲁棒性水平。

详情
AI中文摘要

鲁棒优化通过针对最坏情况优化来保护决策免受不确定性影响,但其有效性取决于预先指定的鲁棒性水平,该水平通常是临时选择的,导致保护不足或过度保守且成本高昂的解决方案。最近使用保形预测的方法构建了具有有限样本覆盖保证的数据驱动不确定性集,但它们仍然事先固定覆盖目标,并且对选择鲁棒性水平提供的指导很少。我们提出了一个新框架,该框架为任何鲁棒预测-然后优化策略族提供了无分布、有限样本的误覆盖和遗憾保证。我们的方法构建了有效的估计量,这些估计量描绘出误覆盖-遗憾帕累托前沿,使决策者能够根据其成本-风险偏好可靠地评估和校准鲁棒性水平。该框架易于实现,广泛适用于经典优化公式,并实现了更优的有限样本性能。本文提供了一种原则性的数据驱动方法,用于指导鲁棒性选择,并使从业者能够在高风险决策中平衡鲁棒性和保守性。

英文摘要

Robust optimization safeguards decisions against uncertainty by optimizing against worst-case scenarios, yet their effectiveness hinges on a prespecified robustness level that is often chosen ad hoc, leading to either insufficient protection or overly conservative and costly solutions. Recent approaches using conformal prediction construct data-driven uncertainty sets with finite-sample coverage guarantees, but they still fix coverage targets a priori and offer little guidance for selecting robustness levels. We propose a new framework that provides distribution-free, finite-sample guarantees on both miscoverage and regret for any family of robust predict-then-optimize policies. Our method constructs valid estimators that trace out the miscoverage--regret Pareto frontier, enabling decision-makers to reliably evaluate and calibrate robustness levels according to their cost--risk preferences. The framework is simple to implement, broadly applicable across classical optimization formulations, and achieves sharper finite-sample performance. This paper offers a principled data-driven methodology for guiding robustness selection and empowers practitioners to balance robustness and conservativeness in high-stakes decision-making.

2605.16651 2026-06-11 cs.CV cs.LG 版本更新

Right Predictions, Misleading Explanations: On the Vulnerability of Vision-Language Model Explanations

正确预测,误导性解释:关于视觉-语言模型解释的脆弱性

Narges Babadi, Hadis Karimipour

AI总结 研究探讨了视觉-语言模型中解释热图在对抗条件下是否忠实反映推理过程,提出X-Shift攻击揭示解释与预测行为的脱节,验证了解释机制的脆弱性。

详情
Comments
Accepted at the ICML 2026 Workshop on Trustworthy AI for Good (AI4GOOD), Seoul, South Korea
AI中文摘要

解释机制被广泛用于增强视觉-语言模型(VLMs)的透明性和信任度,特别是在需要人类监督的决策场景中。然而,这些解释的鲁棒性仍不明确。本文研究了VLMs(特别是基于CLIP的模型)中的解释热图在对抗条件下是否忠实反映模型推理。我们发现,解释图谱可以系统性地被操控,同时保持模型的原始预测,揭示了预测行为与解释忠实性之间的脱节。为研究这种脆弱性,我们引入了X-Shift,一种新的灰盒攻击,通过扰动图像级视觉表示,将解释热图引导至语义无关区域,而不会改变预测输出。与传统对抗攻击旨在诱导误分类不同,X-Shift专门针对解释过程的完整性。该攻击不修改模型参数,并在多种CLIP架构和解释方法上通用。我们在ImageNet-1k、MS-COCO和Flickr30K上评估了所提出的方法,证明在不可察觉的扰动下,解释对齐性持续下降,而预测保持稳定。此外,标准以预测为导向的对抗攻击即使在更大的扰动预算下也无法复制相同的解释偏移行为。我们的发现突显了当前VLMs解释机制的根本局限性,并对它们在高影响应用中作为可靠信任指标的使用提出了担忧。

英文摘要

Explanation mechanisms are increasingly used to support transparency and trust in vision-language models (VLMs), particularly in settings where model decisions require human oversight. However, the robustness of these explanations remains insufficiently understood. In this work, we investigate whether explanation heatmaps in VLMs, particularly CLIP-based models, faithfully reflect model reasoning under adversarial conditions. We show that explanation maps can be systematically manipulated while preserving the model's original prediction, revealing a disconnect between predictive behavior and explanation faithfulness. To study this vulnerability, we introduce X-Shift, a novel grey-box attack that perturbs patch-level visual representations to redirect explanation heatmaps toward semantically irrelevant regions without altering the predicted output. Unlike conventional adversarial attacks that aim to induce misclassification, X-Shift specifically targets the integrity of the explanation process itself. The attack operates without modifying model parameters and generalizes across multiple CLIP architectures and explanation methods. We evaluate the proposed approach on ImageNet-1k, MS-COCO, and Flickr30K, demonstrating consistent degradation in explanation alignment under imperceptible perturbations while maintaining prediction stability. Furthermore, standard prediction-oriented adversarial attacks fail to reproduce the same explanation-shifting behavior even under substantially larger perturbation budgets. Our findings highlight a fundamental limitation of current explanation mechanisms in VLMs and raise concerns about their use as reliable indicators of model trustworthiness in high-impact applications.

2605.31219 2026-06-11 cs.CV cs.CR cs.LG 版本更新

Latent Geometric Chords for Query-Efficient Decision-Based Adversarial Attacks

潜在几何和弦:面向查询高效决策型对抗攻击

Ei Hmue Khine, Yao Li, Jiebao Sun, Shengzhu Shi, Zhichang Guo, Boying Wu

AI总结 提出潜在几何和弦(LGC)方法,通过曲率感知的几何搜索在压缩语义流形中导航决策边界,并引入残差对抗生成(RAG)机制以高视觉保真度实现查询高效的决策型黑盒对抗攻击。

详情
Comments
Added a conceptual diagram for the LGC architecture, 14 pages, 10 figures, 7 tables. Submitted to IEEE Transactions on Information Forensics and Security. The source code is available at this https URL
AI中文摘要

虽然基于决策的黑盒对抗攻击构成了严重的安全威胁,但当前方法存在根本性限制。像素级攻击经常引入不自然的高频视觉伪影,而潜在空间框架受限于低维流形的有限搜索空间和固有的重建缺陷。为解决这些限制,我们提出了潜在几何和弦(LGC)用于查询高效的决策型对抗攻击及其变体LGC-H。其核心是,LGC通过在压缩语义流形内执行曲率感知的几何搜索来导航决策边界。为保证高视觉保真度并规避维度瓶颈,我们引入了基于残差的对抗生成(RAG)机制。RAG将语义扰动隔离为几何和弦,并直接叠加到原始源图像上。RAG显著解决了基线重建缺陷,并有效将允许的搜索空间维度翻倍。实验结果表明,LGC实现了鲁棒的跨数据集迁移性,并显著优于最先进的基线方法。值得注意的是,我们的方法LGC在最小化扰动幅度的同时实现了最先进的视觉保真度——在5000次查询下结构相似性指数(SSIM)超过0.99,学习感知图像块相似度(LPIPS)低于0.01——并在严格的感知约束下保持高攻击成功率,成功攻破了经过对抗训练的鲁棒模型。源代码可在https://github.com/eihmuekhine/Latent-Geometric-Chords获取。

英文摘要

While decision-based black-box adversarial attacks present a severe security threat, current methodologies suffer from fundamental limitations. Pixel-wise attacks frequently introduce unnatural, high-frequency visual artifacts, while latent-space frameworks are confined by the limited search space of low-dimensional manifolds and inherent reconstruction flaws. To resolve these limitations, we propose Latent Geometric Chords (LGC) for Query-Efficient Decision-Based Adversarial Attacks alongside a variant, LGC-H. At its core, LGC navigates decision boundaries by executing a curvature-aware geometric search within a compressed semantic manifold. To guarantee high visual fidelity and circumvent dimensionality bottlenecks, we introduce a Residual-based Adversarial Generation (RAG) mechanism. RAG isolates semantic perturbations as geometric chords and superimposes them directly onto the original source image. RAG substantially resolves baseline reconstruction flaws and effectively doubles the permissible search space dimensions. Experimental results demonstrate that LGC achieves robust cross-dataset transferability and substantially outperforms state-of-the-art baselines. Notably, our method, LGC, minimizes perturbation magnitudes while achieving state-of-the-art visual fidelity--with a Structural Similarity Index Measure (SSIM) exceeding 0.99 and a Learned Perceptual Image Patch Similarity (LPIPS) below 0.01 at 5000 queries--and sustaining high attack success rates under stringent perceptual constraints, successfully compromising adversarially trained robust models. The source code is available at: this https URL.

2606.05551 2026-06-11 stat.ML cs.AI cs.LG 版本更新

Conformal Risk-Averse Decision Making with Action Conditional Guarantee

具有行动条件保证的共形风险规避决策

Zihan Zhu, Shayan Kiyani, George Pappas, Hamed Hassani

AI总结 提出行动条件共形预测方法,通过分位数损失最小化算法实现行动条件风险价值优化,在有限样本下提供行动条件安全保证。

详情
AI中文摘要

由机器学习模型驱动的可靠决策管道需要具有明确安全保证的不确定性量化(UQ)方法。共形预测通过将ML预测包装成预测集来提供这种UQ,而Kiyani等人(2025b)的最新工作表明,这些集合可以转化为最优的风险规避决策策略——但仅继承边际安全保证。我们通过以下方式推广并加强了他们的结果:(i)引入行动条件共形预测,该预测产生明确条件于决策者所采取的每个行动的安全保证;(ii)表明行动条件预测集可作为风险规避决策者旨在优化行动条件风险价值的可行决策空间的代理;(iii)提出一种基于分位数损失最小化的原则性有限样本算法,将Gibbs等人(2025)的框架与行动条件保证联系起来。在两个真实世界数据集上的实验证实,我们的方法在行动条件性能上显著优于共形基线。

英文摘要

Reliable decision making pipelines powered by machine learning models require uncertainty quantification (UQ) methods that come with explicit safety guarantees. Conformal prediction provides such UQ by wrapping ML predictions into prediction sets, and recent work by Kiyani et al. (2025b) established that these sets can be translated into optimal risk-averse decision policies -- yet only inheriting marginal safety guarantees. We generalize and strengthen their results by (i) introducing action-conditional conformal prediction, which yields safety guarantees conditioned explicitly on each action taken by the decision maker, (ii) showing that action-conditional prediction sets serve as a proxy for the feasible decision space for risk-averse decision makers aiming to optimize action-conditional value-at-risk, and (iii) proposing a principled finite-sample algorithm based on pinball-loss minimization, connecting the framework of Gibbs et al. (2025) to action-conditional guarantees. Experiments on two real-world datasets confirm that our approach significantly improves action-conditional performance over conformal baselines.

2606.09964 2026-06-11 quant-ph cs.LG 版本更新

JGRA: Jacobian Geometry Robustness Assessment in NISQ Noise-Aware Quantum Neural Networks

JGRA: NISQ噪声感知量子神经网络中的雅可比几何鲁棒性评估

Gianluca Scanu, Luca Barletta, Stefano Rini

AI总结 提出JGRA框架,通过雅可比几何评估噪声感知量子神经网络的鲁棒性,包括熵匹配噪声校准、噪声感知训练和噪声条件雅可比提取,揭示干净域结构与噪声推理行为的关系。

详情
Comments
Accepted at IEEE qCCL 2026. Author accepted manuscript. 6 pages; cleaned source files, no changes to manuscript content
AI中文摘要

NISQ时代对量子计算施加了严格约束,噪声和退相干从根本上限制了性能。在经典深度学习中,模型对扰动的鲁棒性和弹性已得到充分研究:深度神经网络(DNN)由于其表示中的固有冗余,在剪枝、噪声注入和结构扰动下仍能保持高性能。量子机器学习的一个核心挑战是将这种鲁棒性概念转移到现实NISQ噪声下的量子神经网络(QNN)中。虽然经典深度学习通过结构冗余表现出鲁棒性,但QNN的类似原理尚不成熟。我们提出JGRA:一个通过雅可比几何评估噪声感知QNN鲁棒性的框架,捕捉噪声引起的参数扰动下的模型敏感性。我们的方法包括熵匹配噪声校准、噪声感知训练和噪声条件雅可比提取,产生将干净域结构与噪声推理行为联系起来的几何描述符。我们还实验证明,这些描述符编码了关于在未见噪声下鲁棒性的预测信息。

英文摘要

The NISQ era places stringent constraints on quantum computation, where noise and decoherence fundamentally limit performance. In classical deep learning, model robustness and resilience to perturbations are well studied: deep neural networks (DNNs) maintain high performance despite pruning, noise injection, and structural perturbations due to inherent redundancy in their representations. A central challenge in quantum machine learning is to transfer this notion of robustness to quantum neural networks (QNNs) under realistic NISQ noise. While classical deep learning exhibits robustness through structural redundancy, analogous principles for QNNs remain underdeveloped. We propose JGRA: a framework for assessing robustness in noise-aware QNNs via Jacobian geometry, capturing model sensitivity to parameter perturbations induced by noise. Our method includes entropy-matched noise calibration, noise-aware training, and noise-conditioned Jacobian extraction, yielding geometric descriptors that link clean-regime structure to noisy inference behaviour. We also empirically demonstrate that these descriptors encode predictive information about robustness under unseen noise.

9. 图学习与结构化数据 8 篇

2606.11583 2026-06-11 cs.LG 新提交

Beyond the Golden Teacher: Enhancing Graph Learning through LLM-GNN Co-teaching

超越黄金教师:通过LLM-GNN协同教学增强图学习

Zhuoyi Peng, Hanlin Gu, Lixin Fan, Yi Yang

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) WeBank(微众银行)

AI总结 针对文本属性图上的少样本学习,提出LLM-GNN协同教学框架,避免固定教师模型,通过双向伪标签交换和基于轮次的偏好优化,显著提升图学习性能。

详情
Comments
Code: this https URL
AI中文摘要

文本属性图(TAGs)支撑着现实世界的应用,如引文网络、社交媒体和电子商务。TAGs上的少样本图学习是困难的:每类只有少量标签,其余图数据未标注,GNN和LLM都无法单独良好学习。GNN读取拓扑结构,在冷节点上失败;LLM读取文本,在文本模糊节点上失败。现有的LLM-GNN方法都遵循相同的模式:指定一个模型为黄金教师,并使用其输出(如特征或伪标签)来监督另一个模型。我们认为这种黄金教师假设在稀疏监督下会失效:没有一个模型是黄金的,将任何一个视为黄金教师会将其盲点转移到学生模型中。因此,我们提出:能否避免指定任一模型为黄金教师,仍然进行有效的图学习?我们的答案是LLM-GNN协同教学,一种双向协同教学框架,其中没有模型被固定为教师。GNN和LLM在特定架构的小损失准则下交换它们最自信的伪标签,并且每轮都更新。然后从轨迹中挖掘监督信息:每当一个节点从第t轮的跨模型矛盾变为第t+1轮的跨模型一致时,LLM在同一输入上的两个答案形成一个偏好对(旧的矛盾自我 < 新的同伴认可自我),用于DPO训练。我们称之为基于轮的伪标签偏好优化(RPL-PO)。在六个基准测试上,LLM-GNN协同教学始终优于GNN-as-Judge和所有先前方法,在Cora和ogbn-arxiv上的绝对3-shot增益分别为7.86%和7.73%;改进延续到5-shot和零样本跨数据集迁移。错误结构分析进一步表明,放弃黄金教师假设显著提高了LLM在困难样本上的图学习能力。

英文摘要

Text-attributed graphs (TAGs) underlie real-world applications such as citation networks, social media, and e-commerce. Few-shot graph learning on TAGs is hard: with only a handful of labels per class and the rest of the graph unannotated, neither GNNs nor LLMs can learn well on their own. GNNs read topology and fail on cold nodes; LLMs read text and fail on text-ambiguous nodes. Existing LLM-GNN methods all follow the same recipe: designate one model as the golden teacher and use its outputs (e.g., features or pseudo-labels) to supervise the other. We argue this golden-teacher assumption breaks under sparse supervision: neither model is golden, and treating either as such transfers its blind spots into the student. We therefore ask: can we avoid designating either model as the golden teacher, and still perform effective graph learning? We answer with LLM-GNN Co-Teaching, a bidirectional co-teaching framework in which neither model is fixed as teacher. The GNN and LLM exchange their most confident pseudo-labels under an architecture-specific small-loss criterion, and both update every round. Supervision is then mined from the trajectory: whenever a node moves from cross-model contradiction at round t to cross-model agreement at round t+1, the LLM's two answers on the same input form a preference pair (old contradicting self < new peer-endorsed self) for DPO training. We call this Round-based Pseudo-Label Preference Optimization (RPL-PO). On six benchmarks, LLM-GNN Co-Teaching consistently outperforms GNN-as-Judge and all prior methods, with absolute 3-shot gains of 7.86% on Cora and 7.73% on ogbn-arxiv; improvements carry over to 5-shot and to zero-shot cross-dataset transfer. Error-structure analysis further shows that abandoning the golden-teacher assumption substantially improves the LLM's graph learning capability on challenging samples.

2606.11640 2026-06-11 cs.LG cs.AI 新提交

TAROT: Task-Adaptive Refinement of LLM-prior Graphs for Few-shot Tabular Learning

TAROT: 面向小样本表格学习的任务自适应LLM先验图精炼

Ruxue Shi, Yili Wang, Mengnan Du, Hangting Ye, Yi Chang, Xin Wang

发表机构 * Jilin University(吉林大学) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 提出TAROT框架,通过构建并精炼任务自适应语义图,利用LLM先验和GNN编码特征语义关系,提升小样本表格学习性能。

详情
AI中文摘要

小样本表格学习为实际应用中标注成本高、新任务样本收集困难的情况提供了一种经济有效的方法。现有的传统方法和基于LLM的方法在小样本场景中已展现出有效性。然而,传统方法需要在未标注或生成的数据上进行额外训练,这带来了显著的计算开销。此外,直接将原始表格数据输入LLM的基于LLM的方法引发了隐私和合规性问题。更重要的是,这两种范式都很大程度上忽略了特征之间的语义关系,而语义关系为构建语义图提供了结构和语义先验。语义图对于在小样本场景中建模有意义的特征交互至关重要。本文提出TAROT,一个基于GNN的框架,通过从先验中构建并精炼任务自适应语义图来编码结构和语义先验,从而提升小样本表格学习的预测性能。TAROT首先通过统一语义表格节点编码器(USTNE)将异构表格数据编码为统一的节点语义表示。然后,它提示LLM根据任务描述和特征名称推断特征之间的语义关系,以构建语义图。为了减轻LLM幻觉引入的结构噪声,TAROT引入了任务自适应语义图精炼,剪除虚假或与任务无关的边,并添加缺失的与任务相关的边,使图结构与下游目标对齐。最后,GNN在精炼后的图上进行消息传递,以捕获与任务相关的语义依赖关系进行预测。在各种小样本表格学习基准上的大量实验证明了TAROT的优越性能,使其成为该领域的最先进方法。

英文摘要

Few-shot tabular learning provides a cost-effective approach for real-world applications where annotation is costly and collecting sufficient samples for new tasks is difficult. Existing Traditional and LLM-based methods have demonstrated effectiveness in few-shot scenarios. However, traditional methods need additional training on unlabeled or generated data, which incur significant computational overhead. In addition, LLM-based methods that directly feed raw tabular data into LLMs raise privacy and compliance concerns. More importantly, both paradigms largely overlook the semantic relationships between features, which provide structural and semantic prior for constructing a semantic graph. Semantic graph is essential for modeling meaningful feature interactions in few-shot scenarios. In this paper, we propose TAROT, a GNN-based framework that encodes the structural and semantic prior by constructing and refining a task-adaptive semantic graph from this prior, thereby improving predictive performance in few-shot tabular learning. TAROT first encodes heterogeneous tabular data into unified node semantic representations via a Unified Semantic Tabular Node Encoder (USTNE). Then, it prompts LLMs to infer the semantic relationship between features based on the task description and feature names to construct a semantic graph. To mitigate structural noise introduced by the hallucination of LLMs, TAROT introduces Task-adaptive Semantic Graph Refinement that prunes spurious or task-unrelated edges and adds missing task-related ones, aligning the graph structure with the downstream objective. Finally, a GNN performs message passing over the refined graph to capture task-related semantic dependencies for prediction. Extensive experiments on various few-shot tabular learning benchmarks demonstrate the superior performance of TAROT, establishing it as a state-of-the-art approach in this domain.

2606.11831 2026-06-11 cs.LG cs.AI 新提交

From Uniform to Learned Graph Priors: Diffusion for Structure Discovery

从均匀到学习图先验:用于结构发现的扩散

Qi Shao, Hao Guo, Jiawen Chen, Duxin Chen, Wenwu Yu

发表机构 * School of Mathematics, Southeast University(东南大学数学学院)

AI总结 提出Diff-prior,一种扩散参数化的自适应先验,通过可学习的去噪式校准对边后验进行结构化校准,提升神经关系推理方法的结构发现可靠性。

详情
Comments
15 pages, 3 figures, Accepted by KDD 2026
AI中文摘要

神经关系推理(NRI)方法通过离散潜在边的变分推理从轨迹中发现交互图。然而,这些方法通常依赖于过度简化的因子化图先验。这种先验通常接近均匀分布,将边视为独立实体。这种系统性错位与现实世界系统不匹配,导致边后验分散且不明确,限制了结构发现的可靠性。为了解决这个问题,我们提出了\textit{Diff-prior},一种扩散参数化的自适应先验,用于校准潜在图分布而非生成图。我们的核心见解是将先验整合重新构建为一种可学习的去噪式校准,将分散、不确定的边后验组织成更可靠的整体结构,该结构可通过扩散模型训练。Diff-prior学习一个自适应结构先验,在推理过程中对边后验进行结构化校准,引导其朝向更接近底层结构的分布。Diff-prior在结构采样之前操作,并直接对编码器边分布进行去噪校准,为结构化变量提供了一种通用的训练范式。在标准基准上的实验验证了我们的框架,结果表明Diff-prior提高了结构推理的性能,并在多个NRI系列架构中生成更明确的边后验。代码可在以下网址获取:https://this URL。

英文摘要

Neural relational inference (NRI) methods discover interaction graphs from trajectories through variational reasoning on discrete potential edges. However, these methods typically rely on oversimplified, factorized graph priors. Such priors, typically nearing uniform distributions, treat edges as independent entities. This systemic misalignment does not match the real-world systems and yields diffuse and indecisive edge posteriors limiting the reliability of structural discovery. To address this, we propose \textit{Diff-prior}, a diffusion-parameterized adaptive prior used to calibrate latent graph distribution rather than generate graphs. Our core insight is to reframe prior integration as a learnable denoising-style calibration that organizes scattered, uncertain edge posteriors into a more reliable overall structure which can be trained by the diffusion model. Diff-prior learns an adaptive structure prior that performs structured calibration on the edge posteriors during inference, guiding it towards a distribution closer to the underlying structure. The diff-prior operates before structural sampling and acts as a denoising calibrator directly on the encoder edge distribution, which provides a generic training paradigm over structured variables. Experiments on standard benchmarks validated our framework, and the results indicate that Diff-prior improves the performance of structure inference and generates more decisive edge posteriors across multiple NRI-family architectures. The code is available on this https URL.

2606.11898 2026-06-11 cs.CL cs.LG 交叉投稿

GraspLLM: Towards Zero-Shot Generalization on Text-Attributed Graphs with LLMs

GraspLLM: 面向文本属性图与LLM的零样本泛化

Hengyi Feng, Zeang Sheng, Meiyi Qiang, Meiyi Qiang, Wentao Zhang

发表机构 * Peking University(北京大学) National University of Singapore(新加坡国立大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出GraspLLM框架,通过融合图结构理解与LLM语义能力,利用基序感知对比学习和最优上下文子图对齐,实现跨数据集和跨任务的零样本泛化。

详情
AI中文摘要

近年来,对文本属性图(TAGs)的研究因其在引文网络、电子商务平台、社交媒体和网页等各类真实数据场景中的广泛应用而备受关注。受大语言模型(LLMs)卓越语义理解能力的启发,已有许多尝试将LLMs集成到TAGs中。然而,现有方法仍难以在不同图和任务间泛化,且其捕获可迁移图结构模式的能力有限。为此,我们提出了GraspLLM框架,该框架将图结构理解与LLM的语义理解能力相结合,以增强跨数据集和跨任务的泛化能力。具体而言,我们使用冻结的通用嵌入模型将不同图的节点文本表示在统一语义空间中,在此基础上,我们在多个基序诱导的邻接矩阵上进行基序感知对比学习,以提取与数据集无关的结构信息。然后,通过我们提出的最优上下文子图,为每个目标节点提取最相关的上下文子图,并通过对齐投影仪将这些子图对齐到LLM的令牌空间。在涵盖不同领域的TAG基准数据集上的大量实验表明,GraspLLM在零样本场景下始终优于先前基于LLM的TAG方法,突显了其在不同数据集和任务上的强泛化能力。我们的代码可在以下网址获取:此 https URL。

英文摘要

Research on Text-Attributed Graphs (TAGs) has gained significant attention recently due to its broad applications across various real-world data scenarios, such as citation networks, e-commerce platforms, social media, and web pages. Inspired by the remarkable semantic understanding ability of Large Language Models (LLMs), there have been numerous attempts to integrate LLMs into TAGs. However, existing methods still struggle to generalize across diverse graphs and tasks, and their ability to capture transferable graph structural patterns remains limited. To address this, we introduce the GraspLLM, a framework that combines Graph structural comprehension with semantic understanding prowess of LLMs to enhance the cross-dataset and cross-task generalizability. Specifically, we represent node texts from different graphs in a unified semantic space with a frozen general embedding model, on top of which we perform motif-aware contrastive learning across multiple motif-induced adjacency matrices to extract dataset-agnostic structural information. Then, with our proposed optimal contextual subgraph, we extract the most contextually relevant subgraph for each target node and align these subgraphs to the token space of LLM via an alignment projector. Extensive experiments on TAG benchmark datasets spanning diverse domains reveal that GraspLLM consistently outperforms previous LLM-based methods for TAGs, especially in zero-shot scenarios, highlighting its strong generalizability across different datasets and tasks. Our code is available at this https URL.

2606.11946 2026-06-11 cs.DB cs.CC cs.LG cs.LO 交叉投稿

Neuro-Relational Programs: Unifying Queries and Neural Computation over Structured Data

神经关系程序:统一结构化数据上的查询与神经计算

Arie Soeteman, Balder ten Cate, Maurice Funk, Benny Kimelfeld, Carsten Lutz, Moritz Schönherr

AI总结 提出神经关系程序(NRP),一种扩展Datalog规则的声明式查询语言,通过嵌入操作融合关系推理与可学习神经组件,实现关系数据上的通用神经计算。

详情
Comments
37 pages
AI中文摘要

在关系数据库上进行深度学习的传统方法是将图神经网络(GNN)等神经模型应用于数据库的图表示。最近的方法则直接操作数据库,将元组与嵌入关联,并扩展查询机制以联合处理嵌入和关系内容。受这些发展的启发,我们引入了神经关系程序(NRP),这是一种针对关系数据库的声明式查询语言,其事实携带数值向量嵌入。NRP扩展了Datalog风格的规则,增加了组合、聚合和转换嵌入的操作,从而在单一形式主义中交错关系推理和可学习神经组件。这产生了一种对关系数据进行神经计算的通用方法:NRP既可以看作带有可训练组件的查询计划,也可以看作内置关系结构的神经架构。NRP的自然语法片段恢复了现有架构和查询形式主义。零元NRP对应于非自适应查询算法;一元NRP推广了GNN风格的消息传递,并精确捕捉了深度同态网络,我们将这一联系扩展到带有行ID的数据库上的前沿保护NRP。我们通过FOCQ(一阶逻辑在实权重结构上的计数扩展)刻画了带有ReLU-FFN变换的无限制NRP的表达能力,从而建立了与有序数据库上的均匀TC$^0$的精确联系。这些结果共同确立了NRP作为关系数据上查询和神经计算的广泛声明式框架。

英文摘要

The conventional approach to deep learning over relational databases applies neural models, such as Graph Neural Networks (GNNs), to a graph representation of the database. Recent approaches instead operate on databases directly, associating tuples with embeddings and extending query mechanisms to jointly process embeddings and relational content. Inspired by these developments, we introduce Neuro-Relational Programs (NRPs), a declarative query language for relational databases whose facts carry numeric vector embeddings. NRPs extend Datalog-style rules with operations that combine, aggregate, and transform embeddings, thereby interleaving relational reasoning and learnable neural components within a single formalism. This yields a general approach to neural computation over relational data: an NRP can be read both as a query plan with trainable components and as a neural architecture with relational structure built in. Natural syntactic fragments of NRPs recover existing architectures and query formalisms. Zero-ary NRPs correspond to non-adaptive query algorithms; monadic NRPs generalize GNN-style message passing and precisely capture Deep Homomorphism Networks, a connection that we extend to frontier-guarded NRPs over databases with row-ids. We characterize the expressive power of unrestricted NRPs with ReLU-FFN transformations by FOCQ, an extension of first-order logic with counting interpreted over real-weighted structures, yielding a precise connection with uniform TC$^0$ over ordered databases. Together, these results establish NRPs as a broad declarative framework for querying and neural computation over relational data.

2510.04567 2026-06-11 cs.LG cs.AI 版本更新

GILT: An LLM-Free, Tuning-Free Graph Foundational Model for In-Context Learning

GILT:一种无需LLM、无需微调的图基础模型用于上下文学习

Weishuo Ma, Yanbo Wang, Xiyuan Wang, Lei Zou, Muhan Zhang

AI总结 提出GILT框架,通过基于令牌的上下文学习机制统一处理节点、边和图级别的分类任务,无需大语言模型或微调,实现高效泛化。

详情
Comments
Accepted as an oral presentation at the GFM @ ICML 2026 Workshop
AI中文摘要

图神经网络(GNN)是处理关系数据的强大工具,但通常难以泛化到未见过的图,从而催生了图基础模型(GFM)的发展。然而,当前的GFM面临图数据极端异质性的挑战,每个图可能具有独特的特征空间、标签集和拓扑结构。为此,出现了两种主要范式:第一种利用大语言模型(LLM),但本质上依赖于文本,因此难以处理海量图中的数值特征;第二种预训练基于结构的模型,但适应新任务通常需要昂贵的每图微调阶段,造成关键效率瓶颈。在这项工作中,我们超越了这些限制,引入了图上下文学习Transformer(GILT),这是一个基于无需LLM且无需微调架构的框架。GILT引入了一种新颖的基于令牌的框架用于图上的上下文学习(ICL),在统一框架中重新定义了跨节点、边和图级别的分类任务。该机制是处理异质性的关键,因为它设计用于操作通用数值特征。此外,它从上下文中动态理解类别语义的能力实现了无需微调的适应。全面实验表明,与基于LLM或基于微调的基线相比,GILT以显著更少的时间实现了更强的少样本性能,验证了我们方法的有效性。我们的代码可在https://github.com/yiming421/inductnode/获取。

英文摘要

Graph Neural Networks (GNNs) are powerful tools for processing relational data but often struggle to generalize to unseen graphs, giving rise to the development of Graph Foundational Models (GFMs). However, current GFMs are challenged by the extreme heterogeneity of graph data, where each graph can possess a unique feature space, label set, and topology. To address this, two main paradigms have emerged. The first leverages Large Language Models (LLMs), but is fundamentally text-dependent, thus struggles to handle the numerical features in vast graphs. The second pre-trains a structure-based model, but the adaptation to new tasks typically requires a costly, per-graph tuning stage, creating a critical efficiency bottleneck. In this work, we move beyond these limitations and introduce \textbf{G}raph \textbf{I}n-context \textbf{L}earning \textbf{T}ransformer (GILT), a framework built on an LLM-free and tuning-free architecture. GILT introduces a novel token-based framework for in-context learning (ICL) on graphs, reframing classification tasks spanning node, edge and graph levels in a unified framework. This mechanism is the key to handling heterogeneity, as it is designed to operate on generic numerical features. Further, its ability to understand class semantics dynamically from the context enables tuning-free adaptation. Comprehensive experiments show that GILT achieves stronger few-shot performance with significantly less time than LLM-based or tuning-based baselines, validating the effectiveness of our approach. Our code is available at: this https URL.

2505.03649 2026-06-11 stat.ML cs.LG math.CO math.PR 版本更新

Weighted Random Dot Product Graphs

加权随机点积图

Bernardo Marenco, Paola Bermolen, Marcelo Fiori, Federico Larroca, Gonzalo Mateos

AI总结 提出加权随机点积图(WRDPG)模型,通过节点潜位置的内积刻画边权分布的高阶矩,并给出谱嵌入估计的统计保证与生成框架。

详情
Comments
30 pages, 12 figures, code to generate Figures 3 to 12 available at this https URL. Updated to match the published version
AI中文摘要

复杂关系模式的建模已成为当代统计研究和相关数据科学领域的基石。以图形式表示的网络为这种分析提供了自然框架。本文扩展了随机点积图(RDPG)模型以适应加权图,显著拓宽了该模型的适用范围,使其能够处理边权呈现异质分布的场景。我们提出了一种非参数加权(W)RDPG模型,为每个节点分配一系列潜位置。这些节点向量的内积通过矩生成函数指定其关联边权分布的矩。与现有技术不同,WRDPG能够区分具有相同均值但高阶矩不同的权重分布。我们推导了基于工作马邻接谱嵌入的节点潜位置估计量的统计保证,建立了其一致性和渐近正态性。我们还贡献了一个生成框架,能够采样符合(指定或数据拟合的)WRDPG的图,从而促进例如使用恰当的参考分布对观测图指标进行分析和检验。本文组织如下:形式化模型定义、估计(或节点嵌入)过程及其保证,以及生成加权图的方法,所有内容均辅以说明性和可重复的示例,展示WRDPG在各种网络分析应用中的有效性。

英文摘要

Modeling of intricate relational patterns has become a cornerstone of contemporary statistical research and related data science fields. Networks, represented as graphs, offer a natural framework for this analysis. This paper extends the Random Dot Product Graph (RDPG) model to accommodate weighted graphs, markedly broadening the model's scope to scenarios where edges exhibit heterogeneous weight distributions. We propose a nonparametric weighted (W)RDPG model that assigns a sequence of latent positions to each node. Inner products of these nodal vectors specify the moments of their incident edge weights' distribution via moment-generating functions. In this way, and unlike prior art, the WRDPG can discriminate between weight distributions that share the same mean but differ in other higher-order moments. We derive statistical guarantees for an estimator of the nodal's latent positions adapted from the workhorse adjacency spectral embedding, establishing its consistency and asymptotic normality. We also contribute a generative framework that enables sampling of graphs that adhere to a (prescribed or data-fitted) WRDPG, facilitating, e.g., the analysis and testing of observed graph metrics using judicious reference distributions. The paper is organized to formalize the model's definition, the estimation (or nodal embedding) process and its guarantees, as well as the methodologies for generating weighted graphs, all complemented by illustrative and reproducible examples showcasing the WRDPG's effectiveness in various network analytic applications.

2605.22346 2026-06-11 stat.ML cs.LG cs.SI 版本更新

The ASE-LSE Disagreement Landscape: An End-to-End Characterisation of Extremes and Structural Drivers

偏离正则性:度异质性和特征间隙作为ASE-LSE潜在子空间分歧的结构驱动因素

Minh Triet Pham, Ian Gallagher

AI总结 本文研究了图数据分析中邻接谱嵌入和拉普拉斯谱嵌入方法在相同网络上产生不同结果的结构原因,揭示了度异质性和社区结构强度对潜在子空间分歧的影响。

详情
Comments
This paper is being withdrawn as it was submitted without the consent of all listed authors, and contains work that is currently under academic assessment. It will be resubmitted at an appropriate time once evaluation is complete
AI中文摘要

图数据分析中,邻接谱嵌入和拉普拉斯谱嵌入两种最常用方法在相同网络上常产生不同结果。本文提供了结构上的解释。我们证明正则性是完美一致的充分条件:当每个节点具有相同数量的连接时,两种方法产生相同的潜在子空间。任何偏离正则性都会引入分歧,我们证明了一个显式的界限,其两个术语表明控制分歧的结构因素:度异质性推动方法分离,社区结构强度则拉近它们。我们通过成千上万个模拟网络验证了这两种驱动因素,确认异质性推动分歧增加,社区强度抑制它,其比值提供了两种嵌入可以互换或不可互换的强预测。

英文摘要

Two of the most widely used methods for analysing graph data, Adjacency Spectral Embedding and Laplacian Spectral Embedding, often produce different results when applied to the same graph. Yet the structural reasons behind this disagreement remain incompletely understood. This paper provides an end-to-end account of ASE-LSE latent subspace disagreement. We first prove that the two methods produce identical latent subspaces for every embedding dimension whenever the Laplacian is a scalar multiple of the adjacency matrix, and show that this scalar relationship holds if and only if the graph is either regular or bipartite biregular. This anchor result identifies a sufficient condition for perfect agreement that pins down the floor of the disagreement spectrum and supplies the baseline for the perturbation analysis. We then prove that no maximal-disagreement graph or family of graphs exists: the disagreement is always strictly below its theoretical ceiling, and we exhibit a witness family demonstrating that no finite maximum is attainable, so the disagreement landscape has no maximiser. With both endpoints established, we derive a Regularity Departure Bound whose two terms isolate degree heterogeneity and eigengap as the primary structural factors influencing disagreement in the middle regime. Empirical validation across thousands of simulated graphs confirms the mechanisms predicted by the bound: heterogeneity pushes disagreement up, eigengap suppresses it, and their joint ratio emerges as a unified predictor of ASE-LSE disagreement, suggesting when the two embeddings can be treated as interchangeable and when they cannot.

10. 迁移、元学习与持续学习 3 篇

2606.11844 2026-06-11 cs.LG 新提交

TaskFusion: Continual Anomaly Detection for Heterogeneous Tabular Data

TaskFusion: 异构表格数据的持续异常检测

Dayananda Herurkar, Federico Raue, Joachim Folz, Jörn Hees, Andreas Dengel

发表机构 * German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心(DFKI)) RPTU Kaiserslautern-Landau(凯泽斯劳滕-兰道大学) Hochschule Bonn-Rhein-Sieg (H-BRS)(波恩-莱茵-锡格应用技术大学)

AI总结 提出TaskFusion方法,通过AGF模型、任务融合增强和异常暴露技术,解决异构表格数据在持续学习中的特征空间变化、分布偏移和类别不平衡问题,在21个数据集上显著提升持续异常检测性能。

详情
Comments
22 Pages
AI中文摘要

表格数据中的持续异常检测具有挑战性且尚未充分探索,尤其是在异构特征模式、分布偏移和严重类别不平衡的情况下。在许多实际应用中,数据来自不同领域并按顺序到达,这使得传统的持续学习方法因依赖固定输入空间而失效。我们提出了一种持续学习方法,能够克服这些挑战并持续从不同任务中学习。我们的方法包含三个主要部分:AGF模型、TaskFusion增强和异常暴露。AGF模型将任务特定特征映射到共享空间,然后对齐分布以减少表示漂移,并在对齐空间中学习异常决策边界。为了提高稳定性,我们引入了TaskFusion增强,结合任务内的边界感知插值来细化模型异常边界,以及跨任务混合以在数据集间传递异常结构。为了处理类别不平衡和内存限制,我们采用表格数据集蒸馏来存储紧凑的合成回放样本,这些样本与增强数据一起在异常暴露目标中用于鲁棒的异常检测。我们在多个领域的21个异构数据集上评估了该方法。结果表明,与顺序微调和其他持续学习基线相比,我们的方法显著提高了持续异常检测性能,同时减少了灾难性遗忘并在异构数据集上保持稳定的检测。

英文摘要

Continual anomaly detection in tabular data is challenging and remains largely underexplored, particularly in settings with heterogeneous feature schemas, distribution shifts, and severe class imbalance. In many real-world applications, data arrive sequentially from diverse domains, rendering conventional continual learning methods ineffective due to their reliance on a fixed input space. We propose a continual learning (CL) method, which can overcome these challenges and continually learn from different tasks. Our method consists of three main parts: our AGF model, Taskfusion augmentation, and outlier exposure. The AGF-model maps task-specific features into a shared space, then aligns distributions to reduce representation drift, and learns anomaly decision boundaries in the aligned space. To improve stability, we introduce Taskfusion augmentation, combining boundary-aware interpolation within tasks to refine the model anomaly boundaries and cross-task mixing to transfer anomaly structure across datasets. To handle class imbalance and memory constraints, we employ tabular dataset distillation to store compact synthetic replay samples, which are jointly used with augmented data in an outlier exposure objective for robust anomaly detection. We evaluate the approach on 21 heterogeneous datasets across multiple domains. Results show that our approach substantially improves continual anomaly detection performance over sequential fine-tuning and other CL baselines while reducing catastrophic forgetting and maintaining stable detection across heterogeneous datasets.

2507.23534 2026-06-11 cs.LG cs.CV 版本更新

Continual Learning with Support Boundary Experience Blending

支持边界经验混合的持续学习

Chih-Fan Hsu, Ming-Ching Chang, Wei-Chao Chen

AI总结 提出经验混合框架,通过差分隐私启发的噪声生成支持边界数据,联合训练样本和边界数据以正则化决策边界,在多个数据集上提升持续学习准确率。

详情
AI中文摘要

持续学习旨在减轻模型在顺序任务训练时的灾难性遗忘。常见方法经验回放存储过去的样本,但仅稀疏地近似数据分布,导致决策边界脆弱且过于简化。我们通过引入支持边界数据来解决这一限制,该数据通过差分隐私启发的噪声注入潜在特征,生成边界邻近表示,隐式正则化决策边界。基于此,我们提出经验混合框架,通过双模型聚合策略联合训练样本和支持边界数据。经验混合有两个组成部分:(1) 潜在空间噪声注入以生成支持边界数据,(2) 联合利用样本和支持边界数据的端到端训练。与标准经验回放不同,支持边界数据丰富了决策边界附近的特征空间,从而实现更稳定和鲁棒的持续学习。在CIFAR-10、CIFAR-100、Tiny ImageNet和ImageNet1K上的大量实验分别展示了10%、6%、13%和2%的持续准确率提升。

英文摘要

Continual learning (CL) seeks to mitigate catastrophic forgetting when models are trained with sequential tasks. A common approach, experience replay (ER), stores past exemplars but only sparsely approximates the data distribution, yielding fragile and oversimplified decision boundaries. We address this limitation by introducing Support Boundary Data (SBD), generated via differential-privacy-inspired noise into latent features to create boundary-adjacent representations that implicitly regularize decision boundaries. Building on this idea, we propose Experience Blending (EB), a framework that jointly trains on exemplars and SBD through a dual-model aggregation strategy. EB has two components: (1) latent-space noise injection to generate support boundary data, and (2) end-to-end training that jointly leverages exemplars and SBD. Unlike standard experience replay, SBD enriches the feature space near decision boundaries, leading to more stable and robust continual learning. Extensive experiments on CIFAR-10, CIFAR-100, Tiny ImageNet, and ImageNet1K demonstrate consistent accuracy improvements of 10%, 6%, 13%, 2%, respectively.

2603.15158 2026-06-11 cs.LG 版本更新

Point-Identification of a Robust Predictor Under Latent Shift with Imperfect Proxies

在不完美代理下潜在偏移中鲁棒预测器的点识别

Zahra Rahiminasab, Reza Soumi, Arto Klami, Samuel Kaski

AI总结 针对潜在混淆变量导致的域适应问题,提出基于潜在等价类的点识别方法,通过跨域秩条件替代强完备性假设,并设计主动学习框架PQAL实现鲁棒预测。

详情
AI中文摘要

当跨域的分布偏移源于同时影响协变量和结果的潜在混淆变量时,域适应问题变得更加具有挑战性。现有的基于代理的方法通过强完备性假设来唯一确定(点识别)鲁棒预测器。完备性要求代理具有关于潜在混淆变量变化的足够信息。对于不完美代理,从混淆变量到代理分布空间的映射是非单射的,多个潜在混淆变量值可能生成相同的代理分布。这破坏了完备性假设,观测数据与多个潜在预测器(集识别)一致。为了解决这个问题,我们引入了潜在等价类(LECs)。LECs定义为诱导相同条件代理分布的潜在混淆变量组。我们证明,只要多个域在如何混合代理诱导的LECs以形成鲁棒预测器方面有足够差异,鲁棒预测器的点识别仍然可以实现。这种域多样性条件被形式化为混合权重的跨域秩条件,该条件比完备性假设弱得多。我们提出了近端准贝叶斯主动学习(PQAL)框架,该框架主动查询满足该秩条件的小型、有针对性的多样化域集合。PQAL可以恢复点识别的预测器,展示了对不同程度偏移的鲁棒性,并在合成数据、半合成dSprites、IHDP、ACS Folktables数据集上优于先前方法。

英文摘要

Addressing the domain adaptation problem becomes more challenging when distribution shifts across domains stem from latent confounders that affect both covariates and outcomes. Existing proxy-based approaches that address latent shift rely on a strong completeness assumption to uniquely determine (point-identify) a robust predictor. Completeness requires that proxies have sufficient information about variations in latent confounders. For imperfect proxies the mapping from confounders to the space of proxy distributions is non-injective, and multiple latent confounder values can generate the same proxy distribution. This breaks the completeness assumption and observed data are consistent with multiple potential predictors (set-identified). To address this, we introduce latent equivalent classes (LECs). LECs are defined as groups of latent confounders that induce the same conditional proxy distribution. We show that point-identification for the robust predictor remains achievable as long as multiple domains differ sufficiently in how they mix proxy-induced LECs to form the robust predictor. This domain diversity condition is formalized as a cross-domain rank condition on the mixture weights, which is substantially weaker assumption than completeness. We introduce the Proximal Quasi-Bayesian Active learning (PQAL) framework, which actively queries a small, targeted set of diverse domains that satisfy this rank condition. PQAL can recover the point-identified predictor, demonstrates robustness to varying degrees of shift and outperforms previous methods on synthetic data and semi-synthetic dSprites, IHDP, ACS Folktables datasets.

11. 数据集、基准与评测 33 篇

2606.11235 2026-06-11 cs.LG cs.DB stat.ME 新提交

Few-Shot Resampling for Scalable Statistically-Sound Data Mining

少样本重采样:可扩展的统计可靠数据挖掘

Leonardo Pellegrina, Fabio Vandin

发表机构 * Department of Information Engineering, University of Padova(帕多瓦大学信息工程系)

AI总结 提出FewRS方法,基于重采样评估数据挖掘结果的统计显著性,通过推导新的上界偏差界,仅需极少量重采样数据集即可保证假发现概率,显著提升可扩展性。

详情
Comments
Accepted to KDD 2026
AI中文摘要

知识发现的一个关键步骤是评估数据挖掘结果。在包括模式挖掘、图分析等多个应用中,此步骤包括评估结果的统计显著性,以避免仅由噪声或数据随机波动导致的虚假发现。虽然针对某些特定应用已经开发了专门程序,但基于重采样的方法被广泛使用,尤其是在无法推导解析结果的复杂分析中。然而,当前基于重采样的方法需要生成和分析数千个重采样数据集,因此对于大型数据集或计算密集型分析不实用。本文中,我们介绍了FewRS,一种简单有效的基于重采样的方法,用于评估数据挖掘结果的统计显著性,并对错误发现概率提供严格保证。我们的方法可应用于任何使用重采样方法的情况。FewRS基于我们对表示数据挖掘结果质量的检验统计量的上确界偏差推导出的新界。我们证明FewRS需要生成和分析极少数量的重采样数据集,从而得到高度可扩展且广泛适用的方法。我们在常见任务(如模式挖掘和网络分析)上测试了我们的方法。在所有情况下,与现有技术相比,我们的方法在运行时间上减少了多达两个数量级,同时保持高统计功效,使得能够在大型真实世界数据集上对数据挖掘结果进行统计验证。

英文摘要

A key step in knowledge discovery is the evaluation of data mining results. In several applications, including pattern mining, graph analysis, and others, this step includes the evaluation of the statistical significance of the results, to avoid spurious discoveries due only to noise or random fluctuations in the data. While specialized procedures have been developed for some specific applications, resampling-based approaches are widely used, in particular for complex analyses where analytical results cannot be derived. However, current resampling-based approaches require the generation and analysis of thousands of resampled datasets, and are therefore impractical for large datasets or computationally intensive analyses. In this paper, we introduce FewRS, a simple and effective resampling-based approach to assess the statistical significance of data mining results with rigorous guarantees on the probability of false discoveries. Our approach can be used in every situation where resampling-based approaches are applied. FewRS builds on our derivation of a novel bound to the supremum deviation of test statistics representing the quality of data mining results. We prove that FewRS needs to generate and analyze an extremely small number of resampled datasets, leading to a highly scalable approach with wide applicability. We test our approach on common tasks such as pattern mining and network analysis. In all cases, our approach results in a reduction of up to two orders of magnitude in running time compared to the state of the art, while preserving high statistical power, enabling the statistical validation of data mining results on large-scale real-world datasets.

2606.11267 2026-06-11 cs.LG cs.CR 新提交

A prior-free blind detection of information leakage from model predictions

基于模型预测的信息泄露的无先验盲检测

Laurence A. Jacobs

发表机构 * Center for Molecular Cardiology, University of Zurich(苏黎世大学分子心脏病学中心) Center for Complexity Sciences, National University of Mexico(墨西哥国立自治大学复杂性科学中心)

AI总结 针对机器学习模型输出中信息泄露的检测问题,提出决策理论框架,证明校准泄露与诚实模型不可区分,但近确定性子组可被无先验检测,并在UK Biobank上验证。

详情
AI中文摘要

数据泄露——模型被基线不可用的信息污染——是基于机器学习的科学中主要的可重复性失败,然而检测工具需要训练代码、外部数据或领域专业知识。没有一种工具能作用于审计员最常持有的工件:模型的输出。我们询问仅从预测和结果中能判断出关于泄露的什么信息。我们给出了一个决策理论框架,其中泄露诊断是预测风险/结果规律的泛函,由与适当评分规则和决策曲线分析相关的阈值加权参数化。我们证明了一个尖锐的不可能性:重新校准的泄露匹配诚实模型的校准和区分度,通过预测的\emph{任何}函数与诚实性能不可区分,因此广泛类别仅能通过外部提供的可实现区分度上限来检测。然后我们证明了泄露无法隐藏什么:近确定性子组——近标签泄露的特征——产生一个持续的单位纯度头部,任何非确定性结果的合法预测器都无法制造,从而产生一个无先验测试。这些结果将泄露组织成三分法——未校准、广泛校准和确定性——每个都有匹配的检测器和失败模式。我们在UK Biobank上使用时窗共病泄露进行验证,已知分级严重性,测量该终点上的检测下限$\Delta\cstar \approx 0.007$,低于此的残余泄露从输出中无法检测,且太小无法改变结论。数值下限是队列和终点特定的;结构教训是通用的:仅输出检测在残余泄露与诚实的更强预测器无法区分时失败。该测试在商品硬件上不到一秒内返回对预测向量的判定。

英文摘要

Data leakage -- contamination of a model with information unavailable at baseline -- is the dominant reproducibility failure in machine-learning-based science, yet detection tools require training code, external data, or domain expertise. None operates on the artifact an auditor most often holds: the model's output. We ask what can be decided about leakage from predictions and outcomes alone. We give a decision-theoretic framework in which leakage diagnostics are functionals of the predicted-risk/outcome law, parameterized by a threshold-weighting linked to proper scoring rules and decision-curve analysis. We prove a sharp impossibility: a recalibrated leak matching an honest model's calibration and discrimination is indistinguishable from honest performance by \emph{any} function of the predictions, so the broad class is detectable only against an externally supplied ceiling on achievable discrimination. We then prove what leakage cannot hide: a near-deterministic subgroup -- the signature of a near-label leak -- produces a sustained unit-purity head that no legitimate predictor of a non-deterministic outcome can manufacture, yielding a prior-free test. These results organize leakage into a trichotomy -- miscalibrated, broad-calibrated, and deterministic -- each with a matched detector and failure mode. We validate on UK Biobank using time-windowed comorbidity leakage with known, graded severity, measuring a detection floor of $\Delta\cstar \approx 0.007$ on this endpoint, below which residual leakage is undetectable from output and too small to alter conclusions. The numerical floor is cohort- and endpoint-specific; the structural lesson is general: output-only detection fails where residual leakage is indistinguishable from an honestly stronger predictor. The test returns a verdict on a prediction vector in under a second on commodity hardware.

2606.11562 2026-06-11 cs.LG cs.CL 新提交

GraphInfer-Bench: Benchmarking LLM's Inference Capability on Graphs

GraphInfer-Bench:评估LLM在图上的推理能力基准

Zhuoyi Peng, Jingzhou Jiang, Hanlin Gu, Lixin Fan, Yi Yang

发表机构 * The Hong Kong University of Science and Technology(香港科技大学) Webank(微众银行)

AI总结 提出GraphInfer-Bench基准,通过五个任务(描述与比较)测试LLM能否从节点及其邻域推断出无法从单节点或路径检索的答案,发现所有方法均存在差距。

详情
Comments
Code: this https URL; Dataset: this https URL
AI中文摘要

图分析支撑着许多应用,这些应用的答案无法从单个记录中查找或沿路径检索:洗钱团伙、药物重定位、用户偏好和科学主题都是从节点及其邻域推断出来的。我们引入GraphInfer-Bench,一个评估LLM是否能够执行这种图推理的基准:产生一个开放式的答案,该答案没有单个节点支持,也没有路径可检索。现有的图问答协议无法测试这种能力:算法模拟、节点分类、单节点描述、KG-QA和GraphRAG都允许从单个节点或沿路径检索答案。GraphInfer-Bench定义了五个任务,涵盖描述(区域是什么)和比较(区域如何不同),每个任务的设计使得真实答案不存在于任何单个节点中。发布版本包含42,000个样本,跨越六个真实世界图,自动生成并通过四层质量控制协议筛选。我们评估了四种方法族在相同任务上的表现:图-令牌对齐模型、零样本前沿闭源LLM、Graph2Text监督微调以及作为结构参考的普通GNN。没有方法族能够弥合差距。图-令牌对齐部分处理描述任务(关系、主题),但在比较任务上失败。前沿LLM在基于LLM的方法中在离群点检测和社区划分上领先,但在掩码节点预测上落后。Graph2Text SFT在描述方面是最强的基于LLM的方法,但在比较方面落后于前沿LLM。在每个任务上,普通GNN匹配或击败了最强的基于LLM的方法,在社区检测上差距最大。GraphInfer-Bench揭示了图推理是一个开放的能力差距,而不是任何单一架构的属性。

英文摘要

Graph analysis underlies many applications whose answers cannot be looked up in a single record or retrieved along a path: laundering rings, drug repurposing, user preference, and scientific theme are all inferred from a node together with its neighbourhood. We introduce GraphInfer-Bench, a benchmark for whether LLMs can perform this graph inference: producing an open-ended answer that no single node supports and no path retrieves. Existing graph-QA protocols cannot test this capability: algorithm simulation, node classification, single-node description, KG-QA, and GraphRAG all admit answers retrievable from one node or along a path. GraphInfer-Bench defines five tasks along Description (what a region is) and Comparison (how regions differ), each constructed so the ground truth lives in no single node. The release contains 42,000 samples across six real-world graphs, produced automatically and screened by a four-layer quality-control protocol. We evaluate four method families against the same tasks: graph-token alignment models, zero-shot frontier closed-source LLMs, Graph2Text supervised fine-tuning, and plain GNNs as a structural reference. No method family closes the gap. Graph-token alignment partially handles description tasks (relational, theme) but collapses on comparison tasks. Frontier LLMs lead on outlier detection and community partition among LLM-based methods but lag on masked-node prediction. Graph2Text SFT is the strongest LLM-based method on the description side yet falls behind frontier LLMs on comparison. Across every task, plain GNNs match or beat the strongest LLM-based row, with the largest margin on community detection. GraphInfer-Bench surfaces graph inference as an open capability gap rather than a property of any one architecture.

2606.11616 2026-06-11 cs.LG cs.IR 新提交

DeMix: Debugging Training Data with Mixed Data Error Types by Investigating Influence Vectors

DeMix: 通过影响向量调试包含混合错误类型的训练数据

Jiale Deng, Yanyan Shen, Xiaogang Shi, Chai Junjun

发表机构 * Shanghai Jiao Tong University(上海交通大学) ByteDance Inc.(字节跳动) Tiktok

AI总结 提出DeMix框架,利用影响向量捕捉不同错误类型对模型行为的独特模式,将数据调试转化为多标签分类问题,并引入基于干预的学习策略,在11个任务上显著提升调试F1分数和修复后模型性能。

详情
AI中文摘要

高质量的训练数据对于机器学习模型的成功至关重要。然而,真实世界的数据集通常包含由数据准备流程中的系统性缺陷引起的混合错误类型,包括标签错误、特征错误和虚假相关性。有效的训练数据调试既需要检测错误样本,也需要识别其具体的错误类型以便进行针对性修复,但现有的数据清洗和归因方法未能充分满足这一双重需求。在本文中,我们提出DeMix,一种同时诊断错误样本及其错误类型的新框架。我们的关键见解是,不同的错误类型会在模型行为上产生不同的模式。DeMix通过影响向量捕获这些特定于错误的模式,这些影响向量描述了每个训练样本如何影响所有验证样本上的模型预测。我们将训练数据调试形式化为一个多标签分类问题,其中开发了一个分类器直接从影响向量预测错误类型。我们进一步引入了一种基于干预的学习策略,引导分类器捕获每种错误类型特有的不变理由,确保学到的分类器有效泛化。在表格数据预测、推荐系统和LLM对齐等11个任务上的实证评估表明,DeMix显著优于最先进的方法,在数据调试F1分数上提高了22.61%,在数据修复后任务模型性能上提高了9.32%。代码可在以下网址获取:this https URL。

英文摘要

High-quality training data is essential for the success of machine learning models. However, real-world datasets often contain mixed types of errors arising from systematic flaws in data preparation pipelines, including label errors, feature errors, and spurious correlations. Effective debugging of training data requires both detecting erroneous samples and identifying their specific error types to enable targeted repair, yet existing data cleaning and attribution methods fail to adequately address this dual requirement. In this paper, we propose DeMix, a novel framework that simultaneously diagnoses erroneous samples and their error types. Our key insight is that different error types produce distinct patterns on model behavior. DeMix captures such error-specific patterns by influence vectors that characterize how each training sample affects model predictions across all validation samples. We formulate training data debugging as a multi-label classification problem where a classifier is developed to predict error types directly from influence vectors. We further introduce an intervention-based learning strategy that guides the classifier to capture invariant rationales specific to each error type, ensuring the learned classifier generalizes effectively. Empirical evaluations on 11 tasks across tabular data prediction, recommendation systems, and LLM alignment demonstrate that DeMix significantly outperforms state-of-the-art approaches, achieving a 22.61% improvement in data debugging F1-score and a 9.32% gain in task model performance after data repair. Code is available at: this https URL.

2606.11660 2026-06-11 cs.LG 新提交

Bergson: An Open Source Library for Data Attribution

Bergson:一个用于数据归因的开源库

Lucia Quirke, Louis Jaburi, David Johnston, William Z. Li, Gonçalo Paulo, Guillaume Martres, Girish Gupta, Stella Biderman, Nora Belrose

AI总结 提出Bergson开源库,支持大规模语言模型和预训练数据集的多种数据归因方法,提供磁盘梯度存储和多节点分布式训练,首次开源实现MAGIC、SOURCE和TrackStar三种方法。

详情
AI中文摘要

数据归因是可解释性领域一个有前景的方向,旨在通过训练数据的影响来解释模型行为,其应用包括调试不良模型行为和训练数据集整理。然而,大规模执行数据归因需要大量的工程工作,许多前沿技术缺乏开源工具和支持。Bergson是一个开源库,旨在通过提供一系列可扩展到超大规模语言模型和预训练数据集的技术,推动该领域的更快发展。该库原生支持磁盘梯度存储和多节点分布式训练,并为研究人员提供生活质量工具。最后,我们首次开源实现了三种领先的数据归因方法:MAGIC、SOURCE和TrackStar。该库可在以下网址获取:https://github.com/example/bergson。

英文摘要

Data attribution is a promising field in interpretability that aims to explain model behavior through the influence of its training data, with applications including debugging undesirable model behavior and training dataset curation. However, significant engineering effort is required to perform it at scale, and many cutting edge techniques lack open-source tooling and support. Bergson is an open source library that aims to enable faster progress in the field by providing a host of techniques that scale to very large language models and pre-training datasets. The library natively supports on-disk gradient stores and multi-node distributed training, and provides quality of life tools for researchers. Finally, we introduce the first open-source implementations of three leading data attribution methods: MAGIC, SOURCE, and TrackStar. The library is available at this https URL.

2606.11695 2026-06-11 cs.LG cs.AI 新提交

Noise-Aware Framework for Correcting Corrupted Labels

噪声感知框架用于纠正损坏标签

Ha-Linh Nguyen, Hong-Anh Nguyen, Minh-Duc La, Phong Lam, Thu-Trang Nguyen, Son Nguyen, Hieu Dinh Vo

发表机构 * Faculty of Information Technology, VNU University of Engineering and Technology(越南国立大学工程与技术学院信息技术系)

AI总结 提出CANOLA框架,通过噪声感知学习和迭代标签精炼来纠正损坏标签,在六个数据集上相比现有方法错误率降低19%-52%。

详情
AI中文摘要

高质量的标注数据对于训练可靠的ML/DL模型至关重要。然而,现实世界的数据集通常包含相当比例的损坏标签,这会严重降低模型性能。为了解决这个问题,我们提出了CANOLA,一种通过噪声感知学习和迭代标签精炼来纠正损坏标签的新框架。CANOLA明确估计数据集的潜在噪声分布,并将此信息纳入噪声感知深度神经网络的训练中。通过在训练过程中融入噪声特征,CANOLA使模型能够降低不可靠监督信号的权重,并专注于可信模式,从而提高鲁棒性和泛化能力。标签纠正是通过谨慎的迭代软标签精炼进行的,其中模型预测与观察到的标签混合,以防止过早或错误的更新。这种渐进式精炼使得数据集能够以稳定且可控的方式得到修复。我们在六个广泛使用的数据集上,在现实噪声标注场景下评估了CANOLA。实验结果表明,CANOLA始终优于最先进的标签纠正方法,在错误减少方面实现了19%到52%的相对改进。此外,在由CANOLA纠正的数据集上训练的模型获得了显著的下游性能提升。即使在CANOLA纠正的数据上训练的简单分类器,其性能也能超过复杂的以模型为中心的方法,最高可达67%。

英文摘要

High-quality labeled data is essential for training reliable ML/DL models. However, real-world datasets often contain a considerable proportion of corrupted labels, which can severely degrade model performance. To address this problem, we propose CANOLA, a novel framework for correcting corrupted labels through noise-aware learning and iterative label refinement. CANOLA explicitly estimates the underlying noise distribution of the dataset and incorporates this information into the training of a noise-aware Deep Neural Network. By incorporating noise characteristics during learning, CANOLA enables the model to down-weight unreliable supervision signals and focus on trustworthy patterns, thereby improving robustness and generalization. Label correction is performed via cautious, iterative soft label refinement, in which model predictions are blended with observed labels to prevent premature or erroneous updates. This progressive refinement allows the dataset to be repaired in a stable and controlled manner. We evaluate CANOLA on six widely used datasets under realistic noisy labeling scenarios. Experimental results show that CANOLA consistently outperforms SOTA label correction methods, achieving relative improvements ranging from 19% to 52% in error reduction. Moreover, models trained on datasets corrected by CANOLA obtain substantial downstream performance gains. Even simple classifiers trained on CANOLA's corrected data can outperform complex model-centric approaches by margins of up to 67%.

2606.11699 2026-06-11 cs.LG 新提交

A Data-Centric Framework for Detecting and Correcting Corrupted Labels

一个用于检测和纠正损坏标签的数据中心框架

Ha-Linh Nguyen, Hong-Anh Nguyen, Minh-Duc La, Thu-Trang Nguyen, Son Nguyen, Hieu Dinh Vo

发表机构 * Faculty of Information Technology, VNU University of Engineering and Technology, Hanoi, Vietnam(越南河内国立大学工程与技术学院信息技术系)

AI总结 提出Relabeler框架,联合利用局部和全局关系检测损坏标签,并基于输入特征和噪声标签估计最可能的干净标签进行纠正,在多个数据集上实现高达58%的标签纠正精度提升和6%的下游任务性能提升。

详情
AI中文摘要

机器学习和深度学习模型的性能在很大程度上取决于训练数据的质量。然而,现实世界数据集的质量常常因噪声标签而受损,这会显著降低模型的准确性和可靠性。为了解决这一挑战,我们提出了Relabeler,一个端到端的数据中心框架,用于检测和纠正损坏的标签。对于损坏标签检测,Relabeler联合利用数据实例之间的局部和全局关系来识别潜在的噪声样本。在检测到可疑实例后,Relabeler进一步通过基于每个实例的输入特征和观察到的噪声标签估计最可能的干净标签来执行标签纠正。跨多个数据集、噪声类型和噪声率的大量实验表明,Relabeler始终优于最先进的基线,在标签纠正精度上实现了高达58%的提升,在下游任务性能上实现了6%的提升。

英文摘要

The performance of machine learning and deep learning models largely depends on the quality of the training data. However, the quality of the real-world datasets is often compromised by noisy labels, which can substantially degrade model accuracy and reliability. To address this challenge, we propose Relabeler, an end-to-end data-centric framework for detecting and correcting corrupted labels. For corrupted label detection, Relabeler jointly leverages both local and global relationships among data instances to identify potentially noisy samples. After detecting suspicious instances, Relabeler further performs label correction by estimating the most probable clean label for each instance based on both its input features and observed noisy label. Extensive experiments across multiple datasets, noise types, and noise rates demonstrate that Relabeler consistently outperforms state-of-the-art baselines, achieving up to 58% improvement in label correction precision and 6% improvement in downstream task performance.

2606.11761 2026-06-11 cs.LG 新提交

RCAP: Robust, Class-Aware, Probabilistic Dynamic Dataset Pruning

RCAP: 鲁棒的、类别感知的、概率性动态数据集剪枝

Atif Hassan, Swanand Khare, Jiaul H. Paik

发表机构 * IIT Kharagpur(印度理工学院卡哈拉格普尔分校)

AI总结 提出RCAP算法,通过闭式解估计每类样本保留比例并自适应调整,结合高损失样本优先采样策略,在多种数据集和训练范式下优于现有方法,仅用10%数据即可提升类别不平衡数据集性能1%以上。

详情
Comments
Proceedings of the Forty-first Conference on Uncertainty in Artificial Intelligence (UAI 2025)
AI中文摘要

动态数据剪枝技术旨在通过模型训练期间定期选择输入数据的代表性子集来降低计算成本,同时最小化信息损失。然而,现有方法在平衡和不平衡数据集中,特别是在高剪枝率下,往往难以保持较强的最差组准确率。为了解决这一挑战,我们提出了RCAP,一种用于分类任务的鲁棒的、类别感知的、概率性动态数据集剪枝算法。RCAP应用闭式解来估计每个类别应包含在训练子集中的样本比例。该比例通过类别聚合损失在每个epoch自适应调整。随后,它采用自适应采样策略,优先选择具有高损失的样本来填充类别子集。我们在六个从类别平衡到高度不平衡的多样化数据集上,使用五种不同的模型,在三种训练范式(从头训练、迁移学习和微调)下评估了RCAP。我们的方法在所有剪枝率下始终优于最先进的数据集剪枝方法,实现了卓越的最差组准确率。值得注意的是,仅使用10%的数据,RCAP在类别不平衡数据集上相比全数据训练性能提升超过1%,同时平均加速8.69倍。代码可在此https URL获取。

英文摘要

Dynamic data pruning techniques aim to reduce computational cost while minimizing information loss by periodically selecting representative subsets of input data during model training. However, existing methods often struggle to maintain strong worst-group accuracy, particularly at high pruning rates, across balanced and imbalanced datasets. To address this challenge, we propose RCAP, a Robust, Class-Aware, Probabilistic dynamic dataset pruning algorithm for classification tasks. RCAP applies a closed-form solution to estimate the fraction of samples to be included in the training subset for each individual class. This fraction is adaptively adjusted in every epoch using class-wise aggregated loss. Thereafter, it employs an adaptive sampling strategy that prioritizes samples having high loss for populating the class-wise subsets. We evaluate RCAP on six diverse datasets ranging from class-balanced to highly imbalanced using five distinct models across three training paradigms: training from scratch, transfer learning, and fine-tuning. Our approach consistently outperforms state-of-the-art dataset pruning methods, achieving superior worst-group accuracy at all pruning rates. Remarkably, with only $10\%$ data, RCAP delivers $>1\%$ improvement in performance on class-imbalanced datasets compared to full data training while providing an average $8.69\times$ speedup. The code can be accessed at this https URL

2606.11961 2026-06-11 cs.LG cs.AI 新提交

Categorical Prior Lock-in: Why In-Context Learning Fails for Structured Data

类别先验锁定:为何上下文学习在结构化数据上失败

Antonio Pelusi, Stefano Braghin, Alberto Trombetta

发表机构 * University of Insubria(因苏布里亚大学) IBM Research Ireland(IBM 爱尔兰研究院)

AI总结 研究大语言模型在结构化数据生成中上下文学习的局限性,发现其无法更新预训练中的类别先验分布,导致罕见类完全无法生成;参数高效微调可解决但带来记忆化风险。

详情
Comments
9 pages, 5 figures. Empirical study of in-context learning and LoRA fine-tuning for synthetic tabular data generation, introducing the phenomenon of categorical prior lock-in. Under review
AI中文摘要

大型语言模型(LLM)越来越多地被用作结构化数据的条件生成器,依赖上下文学习(ICL)来适应新分布而无需更新参数。我们以高基数表格数据作为受控测试案例,研究分布不匹配下ICL在结构化生成中的局限性,并识别出一种结构性失败模式,我们称之为“类别先验锁定”:ICL无法更新模型从预训练中继承的令牌分布先验。在两个70亿参数开源模型中,ICL随着示例增加提高了数值保真度,但在类别分布上表现出明显的天花板效应,完全无法复现罕见类。参数高效微调(LoRA)克服了这些限制,但引入了可测量的记忆化风险,并在某些情况下破坏了结构化输出生成的稳定性,凸显了适应性与隐私之间的基本权衡。

英文摘要

Large language models (LLMs) are increasingly used as conditional generators for structured data, relying on in-context learning (ICL) to adapt to new distributions without parameter updates. We investigate the limits of ICL for structured generation under distribution mismatch, using high-cardinality tabular data as a controlled test case, and identify a structural failure mode we term \textit{categorical prior lock-in}: the inability of ICL to update the model's prior over token distributions inherited from pre-training. Across two 7B-parameter open-weight models, ICL improves numerical fidelity with additional examples but exhibits a sharp ceiling on categorical distributions, failing to reproduce rare classes entirely. Parameter-efficient fine-tuning (LoRA) overcomes these limitations but introduces measurable memorization risk and, in some cases, destabilizes structured output generation, highlighting a fundamental trade-off between adaptability and privacy.

2606.12182 2026-06-11 cs.LG math.DS math.OC 新提交

How Low Can You Go? Active Learning for Sparse Model Discovery in the Ultra-Low-Data Limit

你能低到多少?超低数据极限下稀疏模型发现的主动学习

Ana Larrañaga, Urban Fasel, Steven L. Brunton

发表机构 * Department of Mechanical Engineering, University of Washington(华盛顿大学机械工程系) NSF AI Institute in Dynamic Systems, University of Washington(华盛顿大学NSF动态系统人工智能研究所) Department of Aeronautics, Imperial College London(伦敦帝国理工学院航空系)

AI总结 针对超低数据极限下动力学系统方程发现的数据稀缺问题,提出基于E-SINDy的主动学习策略,通过迭代优先采样信息量大的区域,在Lorenz、Burgers和Kuramoto-Sivashinsky系统上验证了比随机采样更少数据即可准确识别动力学。

详情
Comments
20 pages, 10 figures
AI中文摘要

识别复杂动力系统的控制方程仍然是科学和工程中的一个基本挑战。虽然早期方法依赖于经验数据和启发式方法,但现代数据驱动方法提供了更大的灵活性和更少的假设。然而,在实际环境中获取数据通常成本高昂。本文通过引入一种主动学习策略来解决这一挑战,用于超低数据极限下的动力学发现。我们的方法不是随机采样,而是迭代地优先考虑对模型识别最有信息量的区域。该方法基于稀疏非线性动力学识别(SINDy),并利用集成扩展E-SINDy来估计认知不确定性并指导常微分方程和偏微分方程(ODEs/PDEs)的采样。对于ODEs,在Lorenz系统上进行了详尽的分析,考虑了不同的数据预算和噪声水平。对于PDEs,研究了两个具有对比动力学特性的系统:Burgers方程,其中尖锐的激波前沿区分了信息丰富和信息贫乏的区域;以及Kuramoto-Sivashinsky方程,它呈现出更复杂的空间采样景观。在所有场景中,所提出的方法都能以比随机采样显著更少的数据样本准确识别控制动力学。

英文摘要

Identifying the governing equations of complex dynamical systems remains a fundamental challenge across science and engineering. While early approaches relied on empirical data and heuristics, modern data-driven methods offer greater flexibility and fewer assumptions. However, data acquisition in real-world settings is often expensive. This work addresses this challenge by introducing an active learning strategy for dynamics discovery in the ultra-low data limit. Rather than sampling randomly, our method iteratively prioritizes regions that are most informative for model identification. This approach builds on Sparse Identification of Nonlinear Dynamics (SINDy), and utilizes an ensemble extension, E-SINDy, to estimate epistemic uncertainty and guide the sampling for both ordinary and partial differential equations (ODEs/PDEs). For ODEs, an exhaustive analysis is conducted on the Lorenz system across varying data budgets and noise levels. For PDEs, two systems with contrasting dynamical characteristics are examined: the Burgers' equation, where a sharp shock front creates a distinction between informative and uninformative regions, and the Kuramoto-Sivashinsky equation, which presents a more spatially complex sampling landscape. Across all scenarios, the proposed method accurately identifies the governing dynamics with significantly fewer data samples than random sampling.

2606.12344 2026-06-11 cs.LG cs.CL 新提交

Claw-SWE-Bench: A Benchmark for Evaluating OpenClaw-style Agent Harnesses on Coding Tasks

Claw-SWE-Bench:评估OpenClaw风格代理框架在编码任务上的基准

Mengyu Zheng, Kai Han, Boxun Li, Haiyang Xu, Yuchuan Tian, Wei He, Hang Zhou, Jianyuan Guo, Hailin Hu, Lin Ma, Chao Xu, Guohao Dai, Lixue Xia, Yunchao Wei, Yunhe Wang, Yu Wang

发表机构 * TokenRhythm Technologies(TokenRhythm 技术公司) Infinigence AI Peking University(北京大学) City University of Hong Kong(香港城市大学) SEE Fund(SEE 基金) Shanghai Jiaotong University(上海交通大学) Beijing Jiaotong University(北京交通大学) Tsinghua University(清华大学)

AI总结 提出Claw-SWE-Bench基准,通过适配器协议统一评估异构代理框架,发现适配器设计对编码性能至关重要,且模型和框架选择显著影响通过率与成本。

详情
AI中文摘要

通用代理(如OpenClaw)越来越多地被用作自主工具使用者,但其编码能力难以在SWE-bench下衡量:通用代理本身不满足评分所需的干净Docker工作区、补丁和预测合约。我们引入了Claw-SWE-Bench,一个多语言SWE-bench风格的基准和适配器协议,使异构代理框架(即claws)在公平设置下具有可比性,包括固定提示、运行时预算、工作区合约、补丁提取过程和评估器。完整基准包含8种语言、43个仓库的350个GitHub问题解决实例,这些实例来自SWE-bench-Multilingual和SWE-bench-Verified-Mini,经过未来提交清理。我们还发布了Claw-SWE-Bench Lite用于更快验证,这是一个通过成本感知、排名感知程序从17个校准列中选出的80个实例子集。在完整基准上,使用最小直接差异适配器的OpenClaw仅获得19.1%的Pass@1,而完整适配器在相同GLM 5.1骨干下达到73.4%,表明适配器设计对于使OpenClaw风格的框架有效执行编码任务至关重要。在OpenClaw × 9模型扫描和5框架 × 2模型扫描中,模型选择使Pass@1变化29.4个百分点,固定模型下框架选择变化27.4个百分点;精度相似的系统在总API成本上可能差异很大。因此,Claw-SWE-Bench将框架和成本核算视为SWE风格编码代理评估的第一类轴,提供了完整基准和低成本参考集,用于可重复比较。数据可在https://this URL 和 https://this URL 获取。

英文摘要

General-purpose agents such as OpenClaw are increasingly used as autonomous tool users, but their coding ability is difficult to measure under SWE-bench: a generic agent does not by itself satisfy the clean Docker workspace, patch, and prediction contract required for scoring. We introduce Claw-SWE-Bench, a multilingual SWE-bench-style benchmark and adapter protocol that makes heterogeneous agent harnesses, or claws, comparable under fair settings including a fixed prompt, runtime budget, workspace contract, patch extraction procedure, and evaluator. The full benchmark contains 350 GitHub issue-resolution instances across 8 languages and 43 repositories, drawn from SWE-bench-Multilingual and SWE-bench-Verified-Mini after future-commit cleanup. We also release Claw-SWE-Bench Lite for faster validation, which is an 80-instance subset selected by a cost-aware, rank-aware procedure over 17 calibration columns. On the full benchmark, OpenClaw with a minimal direct-diff adapter scores only $19.1\%$ Pass@1, whereas the full adapter reaches $73.4\%$ with the same GLM 5.1 backbone, showing that adapter design is essential for enabling OpenClaw-style harnesses to perform coding tasks effectively. Across an OpenClaw $\times$ nine-model sweep and a five-claw $\times$ two-model sweep, model choice changes Pass@1 by $29.4$ pp and harness choice by $27.4$ pp under fixed models; systems with similar accuracy can differ substantially in total API cost. Claw-SWE-Bench therefore treats harness and cost accounting as first-class axes of SWE-style coding-agent evaluation, providing both a full benchmark and a low-cost reference set for reproducible comparison. The data is available at this https URL and this https URL.

2606.11196 2026-06-11 cs.CL cs.AI cs.CR cs.LG 交叉投稿

PoQ-Judge: A Multi-Architecture Evaluation Framework for Cost-Aware Proof-of-Quality in Decentralized LLM Inference

PoQ-Judge:去中心化LLM推理中成本感知的证明质量的多架构评估框架

Arther Tian, Alex Ding, Frank Chen, Simon Wu, Aaron Chan

发表机构 * DGrid AI

AI总结 提出PoQ-Judge框架,训练专用裁判模型对查询-输出对进行无参考评分,研究三种架构,最佳模型在Pearson相关性上达到0.747,级联评估降低72.7%成本。

详情
AI中文摘要

去中心化LLM推理网络需要轻量级、无参考的质量评估用于证明质量(PoQ)。我们提出PoQ-Judge,一个训练专用裁判模型对查询-输出对进行评分而无真实参考的框架。我们研究了三种架构在质量-成本权衡中的表现:TextCNN裁判、MiniLM交叉编码器和DeBERTa裁判。通过在UltraFeedback和GPT标记的领域内数据上进行两阶段训练,最佳模型在保留测试集上与真实代理的Pearson相关性达到0.747,优于先前工作中基于参考的评估器。作为复合评分中的无参考组件,它实现了0.645的Pearson相关性,匹配最佳单一基于参考的评估器,同时消除了对参考答案的需求。我们还表明,在线校准将语义质量识别为主导维度,级联评估将成本降低72.7%,仅带来适度的质量损失。结果在问答任务上比摘要任务强得多,表明代理质量是主要剩余限制。

英文摘要

Decentralized LLM inference networks need lightweight, reference-free quality evaluation for Proof of Quality (PoQ). We present PoQ-Judge, a framework that trains dedicated judge models to score query-output pairs without ground-truth references. We study three architectures across the quality-cost tradeoff: a TextCNN judge, a MiniLM cross-encoder, and a DeBERTa judge. Using two-stage training on UltraFeedback plus GPT-labeled in-domain data, the best model reaches 0.747 Pearson correlation with the ground-truth proxy on a held-out test set, outperforming reference-based evaluators from prior work. As a reference-free component in composite scoring, it achieves 0.645 Pearson correlation, matching the best single reference-based evaluator while removing the need for reference answers. We also show that online calibration identifies semantic quality as the dominant dimension and that cascade evaluation reduces cost by 72.7 percent with only modest quality loss. Results are much stronger on QA than summarization, pointing to proxy quality as the main remaining limitation.

2606.11520 2026-06-11 cs.CL cs.AI cs.LG 交叉投稿

ISE: An Execution-Grounded Recipe for Multi-Turn OS-Agent Trajectories

ISE:一种基于执行的多轮操作系统代理轨迹合成方法

Siyuan Luo, Nairong Zheng, Lin Zhou, Tiankuo Yao, Shengyou Yuan, Haojia Yu, Cong Pang, Jiapeng Luo, Lewei Lu

AI总结 提出ISE三阶段范式,通过结构化意图构建、角色锁定用户模拟和真实执行环境,生成多轮代理轨迹,微调后显著提升代理工具使用性能。

详情
Comments
13 pages, 6 figures. Dataset and code: this https URL
AI中文摘要

训练有能力的操作系统代理需要同时捕获结构化用户意图、多轮任务委派和基于工具执行的数据——这些属性在现有数据集中缺失。我们提出ISE(意图->模拟->执行),一种三阶段合成范式,联合解决这些差距。阶段1通过4D框架(人物角色x领域x任务x复杂度)构建约50000个结构化意图;去重后池中包含43956个唯一意图,并在mpnet-base-v2嵌入(余弦核,q=1)上获得61.57的Vendi分数。阶段2通过角色锁定的用户模拟器驱动多轮用户-代理交互,将每轮用户交互基于实际执行结果,生成23132条完整轨迹,平均8.12轮用户交互和68.24轮总对话。阶段3在实时、隔离的操作系统工作空间中执行每个工具调用,生成真实的故障恢复动态而非模拟响应。在ISETrace上微调后,使用Qwen3-8B在标准协议下的代理工具使用任务中,ClawEval pass@1从19.3提升至37.7。该结果优于零样本GPT-4o和四倍大的Qwen3-32B基础模型。对阶段2的消融实验证明多轮模拟带来了大部分性能提升。我们在该https URL发布所有源代码和数据集。

英文摘要

Training capable OS agents requires data that simultaneously captures structured user intents, multi-turn task delegation, and grounded tool execution--properties absent from existing datasets. We propose ISE (Intent -> Simulate -> Execute), a three-stage synthesis paradigm that addresses these gaps jointly. Stage 1 constructs roughly 50000 structured intents via a 4D framework (Persona x Domain x Task x Complexity); after deduplication the pool contains 43956 unique intents and attains a Vendi Score of 61.57 over the entire pool on mpnet-base-v2 embeddings (cosine kernel, q=1). Stage 2 drives multi-turn user-agent interaction through a role-locked user simulator that grounds each user turn in actual execution outcomes, producing 23132 complete trajectories averaging 8.12 user turns and 68.24 total dialogue turns. Stage 3 runs every tool call inside a live, isolated OS workspace, generating authentic failure-recovery dynamics instead of simulated responses. Fine-tuning on ISETrace improves ClawEval pass@1 from 19.3 to 37.7 using Qwen3-8B on agent tool-use tasks with a standard protocol. This result outperforms zero-shot GPT-4o and the larger Qwen3-32B base model which is four times bigger. An ablation on Stage 2 proves multi-turn simulation brings a large portion of the performance gain. We release all source code and dataset at this https URL.

2606.11534 2026-06-11 physics.ao-ph cs.LG 交叉投稿

Urban Heat MiniCubes: An AI-Ready dataset for urban heat research

城市热微型数据立方体:面向城市热研究的人工智能就绪数据集

Jonathan Starfeldt, Maria J. Molina, Alexander Kerr, Adam Yang, Thomas R.H. Holmes, Christopher R. Hain

AI总结 提出Urban Heat MiniCubes数据集,整合多源卫星数据(Landsat 8/9、Sentinel-1、GOES-R等),为48个城市提供90×90公里网格化数据立方体,支持机器学习在城市热研究中的应用。

详情
Comments
53 pages, 26 figures, Submitted to Nature Scientific Data
AI中文摘要

城市热效应因不透水表面和异质建筑环境而加剧,但街道尺度的变异性仍难以量化,因为多传感器观测很少以一致、分析就绪的形式在必要的时空尺度上可用。我们提出了“Urban Heat MiniCubes”,一个公开可用、符合FAIR原则的数据集,专为城市热研究中的机器学习应用而设计。该数据集提供了西半球48个城市在2022-2023年间的统一90×90公里网格化数据立方体,变量被重新投影并配准到公共网格,以减少预处理(例如,重投影、重采样和时空对齐)。Urban Heat MiniCubes包括两种互补模态:(i)来自Landsat 8/9(例如,地表反射率)和Sentinel-1(例如,合成孔径雷达后向散射)的高空间分辨率、低频观测,以及(ii)来自GOES-R(例如,长波红外亮温)和微波地表温度产品的更高时间频率、较粗分辨率观测。我们记录了变量和元数据,并通过变量间分析和基于自编码器的像素类别(例如,水和云)重建误差总结提供了技术评估。还讨论了潜在用例和局限性。

英文摘要

Urban heat is amplified by impermeable surfaces and heterogeneous built environments, yet street-level variability remains difficult to quantify because multi-sensor observations are rarely available in consistent, analysis-ready form at the necessary spatiotemporal scales. We present "Urban Heat MiniCubes," a publicly available, FAIR-oriented dataset designed for machine learning applications in urban heat research. The dataset provides harmonized 90 x 90 km gridded data cubes for 48 cities in the Western Hemisphere spanning 2022-2023, with variables reprojected and collocated to a common grid to reduce preprocessing (e.g., reprojection, resampling, and spatiotemporal alignment). Urban Heat MiniCubes includes two complementary modalities: (i) higher-spatial-resolution, lower-frequency observations from Landsat 8/9 (e.g., surface reflectances) and Sentinel-1 (e.g., synthetic aperture radar backscatter), and (ii) higher-temporal-frequency, coarser observations from GOES-R (e.g., longwave infrared brightness temperatures) and a microwave land surface temperature product. We document variables and metadata and provide technical assessment using inter-variable analyses and autoencoder-based reconstruction-error summaries across pixel classes (e.g., water and cloud). Potential use cases and limitations are also discussed.

2606.11911 2026-06-11 stat.ML cs.LG math.AT 交叉投稿

From Persistence to Survival: Hypothesis Testing, Effect Sizes and Vectorisation for Topological Features

从持续性到生存:拓扑特征的假设检验、效应大小与向量化

Juliette Murris, Bernadette Stolz, Karsten Borgwardt

AI总结 提出STRAND方法,将持久性图视为生存数据,利用持久性生存函数统一实现假设检验、效应大小计算和向量化,在合成数据和真实基准上验证了有效性。

详情
AI中文摘要

持久性图是拓扑数据分析中常见的表示形式,但它们并非天然存在于向量空间中,且用于比较它们的统计工具在很大程度上与用于下游预测的工具分开发展。我们引入STRAND(生存拓扑表示图分析),将(集合的)持久性图视为生存数据:每个具有持久性值 $p = d - b$ 的拓扑特征是一个完全观测的事件时间,持久性生存函数 $S(t) = \mathbb{P}(p > t)$ 是比较图的中心对象。从这个单一表示中,我们推导出(i)一个非参数双样本检验,具有校准的第一类错误率和少量图的高功效;(ii)可解释的效应大小;以及(iii)用于下游机器学习的1-Wasserstein稳定特征向量。我们在具有受控拓扑的合成流形上验证了校准和功效,展示了在14个图和3D点云基准上的竞争性向量化,并将该方法应用于fMRI/神经科学数据中的功能性脑连接研究。据我们所知,STRAND是第一个从单一连贯且可解释的表示为持久性图提供假设检验和向量化的方法。

英文摘要

Persistence diagrams are common representations in topological data analysis, but they do not naturally live in a vector space, and the statistical tools developed for comparing them have largely evolved separately from those used for downstream prediction. We introduce STRAND (Survival Topological Representation ANalysis of Diagrams), which treats (collections of) PDs as survival data: each topological feature with persistence value $p = d - b$ is a fully observed time-to-event, and the persistence survival function $S(t) = \mathbb{P}(p > t)$ is the central object for comparing diagrams. From this single representation we derive (i) a non-parametric two-sample test with calibrated Type I error and high power from a small number of diagrams; (ii) interpretable effect sizes; and (iii) a 1-Wasserstein-stable feature vector for downstream machine learning. We validate calibration and power on synthetic manifolds with controlled topology, demonstrate competitive vectorisation across 14 graph and 3D point cloud benchmarks, and apply the method to study functional brain connectivity in fMRI/neuroscience data. To our knowledge, STRAND is the first method to provide hypothesis testing and vectorisation for persistence diagrams from a single coherent and interpretable representation.

2606.11925 2026-06-11 cs.CV cs.LG 交叉投稿

Corpus Augmentation for Sign Language Translation via LLM-Guided Video Stitching

通过LLM引导的视频拼接进行手语翻译的语料增强

Zsolt Robotka, Ádám Rák, Jalal Al-Afandi, András Horváth, György Cserey

发表机构 * Peter Pazmany Catholic University, Faculty of Information Technology and Bionics(彼得·帕兹马尼天主教大学信息科技与仿生学院) DeepSign Technologies Ltd.(DeepSign科技有限公司)

AI总结 提出一种无需额外标注或生成模型的手语翻译语料增强方法,利用CTC强制对齐提取手语片段,通过LLM生成句子并拼接视频,在GFSLT-VLP基线上提升BLEU-4达2.92,并发现合成数据对视觉-语言预训练有害但可提升下游任务。

详情
AI中文摘要

手语翻译(SLT)将手语视频转换为口语文本,对于改善无障碍交流以及促进手语与非手语社区之间的沟通具有重要前景。虽然大规模弱对齐数据集实现了规模化预训练,且无词汇表方法减少了对专家标注的依赖,但用于微调的高质量平行手语视频-文本对仍然稀缺,限制了长尾词汇和未见结构的泛化。我们提出一种语料增强方法,无需额外人工标注、外部手语视频语料库或生成式视频模型,仅依赖现有的词汇表标注训练语料和用于句子生成的LLM:通过CTC强制对齐从训练视频中提取每个手语词汇的片段,由语料锚定的LLM生成新的词汇-句子对,通过随机句子采样和片段分配组装合成序列。得到的合成RGB视频-文本对在下游训练阶段与架构无关,可直接被基于RGB的SLT模型使用,或通过从视频提取输入的流水线转换为姿态或特征表示。Sincan等人在严格相同条件下重新评估了五种近期无词汇表方法;在GFSLT-VLP基线上验证的最大增益仅为0.98 BLEU-4。我们的增强方法在同一框架内应用,无需改变架构或训练协议,实现了+2.92 BLEU-4。我们进一步发现,合成数据虽然改善了视觉-语言预训练的目标,但对其有害;并且基于L2准则优化片段过渡以实现视觉平滑适得其反;我们提出,突兀的边界可能作为一种隐式正则化形式。代码可在https://this https URL获取。

英文摘要

Sign language translation (SLT) converts sign language video into spoken language text and holds significant promise for improving accessibility and enabling communication between signing and non-signing communities. While large weakly-aligned datasets have enabled pre-training at scale and gloss-free methods have reduced reliance on expert annotation, high-quality parallel sign video-text pairs for fine-tuning remain scarce, limiting generalisation on long-tail vocabulary and unseen constructions. We propose a corpus augmentation approach that requires no additional human annotation, external sign-language video corpora, or generative video models, relying only on the existing gloss-annotated training corpus and an LLM for sentence generation: per-gloss clips are extracted from training videos via CTC forced-alignment, novel gloss-sentence pairs are generated by a corpus-anchored LLM, and synthetic sequences are assembled through random sentence sampling and clip assignment. The resulting synthetic RGB video-text pairs are architecture-agnostic at the downstream training stage and can be consumed directly by RGB-based SLT models, or converted into pose or feature representations by pipelines that derive such inputs from video. Sincan et al. re-evaluated five recent gloss-free methods under strictly identical conditions; the largest verified gain over the GFSLT-VLP baseline was only 0.98 BLEU-4. Our augmentation, applied within the same framework, achieves +2.92 BLEU-4 without any change to architecture or training protocol. We further identify that synthetic data harms vision-language pretraining despite improving its objectives, and that optimising clip transitions for visual smoothness is counter-productive under L2-based criteria; we propose that abrupt boundaries may act as a form of implicit regularisation. Code is available at this https URL.

2606.12169 2026-06-11 cs.CV cs.AI cs.CL cs.LG 交叉投稿

OpenMedReason: Scientific Reasoning Supervision for Medical Vision-Language Models

OpenMedReason: 医学视觉语言模型的科学推理监督

Negin Baghbanzadeh, Pritam Sarkar, Michael Colacci, Abeer Badawi, Adibvafa Fallahpour, Arash Afkanpour, Leonid Sigal, Ali Etemad, Elham Dolatabadi

发表机构 * York University(约克大学) Vector Institute(向量研究所) University of British Columbia(不列颠哥伦比亚大学) University of Toronto(多伦多大学) Unity Health Toronto / St. Michael’s Hospital(多伦多联合健康/圣迈克尔医院) University Health Network(大学健康网络) Arc Institute(弧研究所) Queen's University(女王大学)

AI总结 提出OpenMedReason,一个包含约45万图像-问题-答案实例的大规模开放医学推理语料库,其推理轨迹主要来自生物医学科学文章,并配套基准OpenMedReason-Bench进行细粒度评估,在监督微调和强化对齐中有效提升模型性能。

详情
Comments
42 pages, 9 figures, 24 tables. Dataset and code: this https URL
AI中文摘要

高风险临床使用大型视觉语言模型(LVLMs)需要基于视觉证据和临床知识的推理,而不仅仅是正确的最终答案。我们引入了OpenMedReason,这是一个大规模、开放的多模态医学推理语料库,包含约45万图像-问题-答案实例,其推理轨迹主要来自策划的生物医学、人类撰写的科学文章。OpenMedReason提供了超越合成思维链的高保真监督,涵盖了多种医学领域视觉模态,如放射学扫描、显微图像、可见光照片、图表等。我们辅以OpenMedReason-Bench,这是一个留出基准,允许沿三个互补的能力轴(包括感知、医学知识和推理)对LVLMs进行细粒度评估,从而实现超越最终答案准确性的诊断性评估。OpenMedReason是一个丰富的训练资源,在监督微调(SFT)和基于强化的对齐中均显示出有效性。使用OpenMedReason进行训练,在VQA准确率上比基础模型平均提高20%,并且性能达到最强可比规模医学LVLMs的4.2%以内。细粒度性能分析证实,增益并非集中在单一轴上:OpenMedReason共同提升了感知、医学知识和推理,并且在86.1%的成对比较中,其推理轨迹优于基础模型。我们在以下网址发布代码和数据集:此 http URL。

英文摘要

High-stakes clinical use of large vision-language models (LVLMs) requires reasoning that is grounded in visual evidence and clinical knowledge, not just correct final answers. We introduce OpenMedReason, a large-scale, open multimodal medical reasoning corpus comprising approximately 450K image-question-answer instances whose reasoning traces are primarily derived from curated biomedical, human-authored scientific articles. OpenMedReason provides high-fidelity supervision beyond synthetic chains of thought, covering diverse medical domain vision modalities such as radiological scans, microscopic images, visible light photographs, charts, and others. We complement it with OpenMedReason-Bench, a held-out benchmark that allows fine-grained evaluation of LVLMs along three complementary axes of capability, including perception, medical knowledge, and rationale, enabling diagnostic evaluation beyond final-answer accuracy. OpenMedReason is a rich training resource that exhibits its effectiveness in both supervised fine-tuning (SFT) and reinforcement-based alignment. Training with OpenMedReason yields a 20% average improvement in VQA accuracy over the base model and achieves performance within 4.2% of the strongest comparable-scale medical LVLMs. Fine-grained performance analysis confirms that the gains are not concentrated in any single axis: OpenMedReason improves perception, medical knowledge, and rationale jointly, and its reasoning traces are preferred over those of the base model in 86.1% of pairwise comparisons. We release the code and dataset at this http URL.

2606.12332 2026-06-11 cs.CL cs.LG 交叉投稿

Measuring Semantic Progress in Multi-turn Dialogue via Information Gain

通过信息增益衡量多轮对话中的语义进展

Paul He, Shiva Kasiviswanathan, Dominik Janzing

发表机构 * NTU Singapore(新加坡南洋理工大学) Amazon(亚马逊) Amazon Research, Tübingen, Germany(亚马逊研究院(德国图宾根))

AI总结 提出基于信息论的信息增益指标,通过高斯嵌入近似量化多轮对话中问题相关的语义进展,无需LLM推理,在多个基准上取得与人类判断一致的结果。

详情
Comments
Preprint. 26 pages
AI中文摘要

评估多轮对话具有挑战性,因为质量体现在多轮之间而非单个回复。我们关注信息寻求对话的一个关键维度:语义进展,定义为对话过程中新、与问题相关且非冗余信息的累积。我们将语义进展形式化为基于问题的不确定性减少,并引入一个在嵌入空间中近似它的信息论指标。我们的主要估计器使用具有闭式更新的易处理高斯公式,而互补的最大熵论证表明,当仅保留二阶嵌入信息时,对数行列式结构更广泛地出现。该公式产生了理想的理论性质,包括单调性、跨轮次总信息增益的可加分解以及冗余证据的递减回报。与LLM作为评判者的方法不同,我们的指标在评估时不需要自回归推理,并且对于固定的嵌入模型完全可复现。在MT-Bench、Chatbot Arena和UltraFeedback上的实验表明,尽管仅针对语义进展,所提出的指标与人类判断的一致性具有竞争力,在MT-Bench和UltraFeedback上相比几个基于LLM的评判者具有更好的对齐。值得注意的是,该方法在仅CPU执行下使用轻量级嵌入模型仍然有效,表明语义进展可以在不依赖大模型能力的情况下被捕获。

英文摘要

Evaluating multi-turn dialogue is challenging because quality emerges across turns rather than within individual responses. We focus on a key dimension of information-seeking dialogue: semantic progress, defined as the accumulation of new, question-relevant, and non-redundant information over the course of a conversation. We formalize semantic progress as question-conditioned uncertainty reduction and introduce an information-theoretic metric that approximates it in embedding space. Our main estimator uses a tractable Gaussian formulation with closed-form updates, while a complementary maximum-entropy argument shows why log-determinant structure arises more broadly when only second-order embedding information is retained. This formulation yields desirable theoretical properties, including monotonicity, additive decomposition of total information gain across turns, and diminishing returns for redundant evidence. Unlike LLM-as-a-judge approaches, our metric requires no autoregressive inference at evaluation time and is fully reproducible for a fixed embedding model. Experiments on MT-Bench, Chatbot Arena, and UltraFeedback show that the proposed metric achieves competitive agreement with human judgments despite targeting only semantic progress, with improved alignment on MT-Bench and UltraFeedback compared to several LLM-based judges. Notably, the method remains effective with lightweight embedding models under CPU-only execution, indicating that semantic progress can be captured without reliance on large model capacity.

2511.07332 2026-06-11 cs.LG cs.AI 版本更新

Grounding Computer Use Agents on Human Demonstrations

基于人类演示的计算机使用智能体基础构建

Aarash Feizi, Shravan Nayak, Xiangru Jian, Kevin Qinghong Lin, Kaixin Li, Rabiul Awal, Xing Han Lù, Johan Obando-Ceron, Juan A. Rodriguez, Nicolas Chapados, David Vazquez, Adriana Romero-Soriano, Reihaneh Rabbany, Perouz Taslakian, Christopher Pal, Spandana Gella, Sai Rajeswar

AI总结 为解决桌面环境高质量基础数据稀缺问题,构建了包含87个应用、56K截图和3.56M人工标注的GroundCUA数据集,并基于此训练GroundNext模型,在5个基准上以少于先前十分之一的数据取得最优结果。

详情
Comments
Accepted at ICLR 2026
AI中文摘要

构建可靠的计算机使用智能体需要基础构建:将自然语言指令准确连接到正确的屏幕元素。尽管存在大量用于网络和移动交互的数据集,但桌面环境的高质量资源有限。为填补这一空白,我们引入了GroundCUA,一个基于专家人类演示构建的大规模桌面基础数据集。它涵盖12个类别的87个应用,包含56K张截图,每个屏幕元素都经过仔细标注,总计超过3.56M个人工验证标注。从这些演示中,我们生成了多样的指令,覆盖广泛的实际任务,为模型训练提供高质量数据。利用GroundCUA,我们开发了GroundNext系列模型,将指令映射到目标UI元素。在3B和7B规模上,GroundNext通过监督微调在五个基准上取得了最先进的结果,同时所需训练数据不到先前工作的十分之一。强化学习后训练进一步提升了性能,在OSWorld基准上使用o3作为规划器的智能体评估中,GroundNext取得了与使用更多数据训练的模型相当或更优的结果。这些结果证明了高质量、专家驱动数据集在推进通用计算机使用智能体中的关键作用。

英文摘要

Building reliable computer-use agents requires grounding: accurately connecting natural language instructions to the correct on-screen elements. While large datasets exist for web and mobile interactions, high-quality resources for desktop environments are limited. To address this gap, we introduce GroundCUA, a large-scale desktop grounding dataset built from expert human demonstrations. It covers 87 applications across 12 categories and includes 56K screenshots, with every on-screen element carefully annotated for a total of over 3.56M human-verified annotations. From these demonstrations, we generate diverse instructions that capture a wide range of real-world tasks, providing high-quality data for model training. Using GroundCUA, we develop the GroundNext family of models that map instructions to their target UI elements. At both 3B and 7B scales, GroundNext achieves state-of-the-art results across five benchmarks using supervised fine-tuning, while requiring less than one-tenth the training data of prior work. Reinforcement learning post-training further improves performance, and when evaluated in an agentic setting on the OSWorld benchmark using o3 as planner, GroundNext attains comparable or superior results to models trained with substantially more data,. These results demonstrate the critical role of high-quality, expert-driven datasets in advancing general-purpose computer-use agents.

2601.11670 2026-06-11 cs.LG cs.AI 版本更新

CoVar: Confidence-Variance-Guided Pseudo-Label Selection for Semi-Supervised Learning

CoVar: 置信度-方差引导的半监督学习伪标签选择

Jinshi Liu, Lei He, Pan Liu

AI总结 提出CoVar框架,通过联合建模最大置信度和残差类方差来评估伪标签可靠性,利用SVD谱松弛分离可靠与不可靠预测,无需手动阈值,在分割和分类任务上取得提升。

详情
AI中文摘要

半监督学习中的伪标签选择通常由最大置信度阈值驱动,然而在模型过度自信和类别不平衡下,仅靠置信度可能不可靠。我们提出CoVar,一个置信度-方差框架,通过联合建模最大置信度(MC)和残差类方差(RCV)来评估伪标签可靠性。从熵最小化出发,我们推导出二阶交叉熵近似,表明当MC高且RCV低时,低损失伪标签更受青睐,并带有置信度依赖的惩罚项,该惩罚项对接近确定的预测更强。基于此准则,CoVar将预测嵌入二维置信度-方差空间,并使用基于SVD的谱松弛来分离可靠和不可靠的预测,无需手动调整置信度阈值。然后,聚类加权高斯函数将此分离转换为每个样本的训练权重。所得权重可在训练期间集成到现有的半监督分割和分类流程中,且不引入推理开销。在PASCAL VOC 2012、Cityscapes、CIFAR-10、CIFAR-100、SVHN和STL-10上的实验表明,在匹配骨干网络下,VOC和Cityscapes上取得明显提升,并在标准分类基准上达到竞争性或更低的错误率。这些结果表明,残差类离散度为鲁棒伪标签选择提供了置信度之外的补充信号。

英文摘要

Pseudo-label selection in semi-supervised learning is commonly driven by maximum-confidence thresholds, yet confidence alone can be unreliable under model overconfidence and class imbalance. We propose CoVar, a confidence--variance framework that assesses pseudo-label reliability by jointly modeling Maximum Confidence (MC) and Residual-Class Variance (RCV). Starting from entropy minimization, we derive a second-order cross-entropy approximation showing that low-loss pseudo-labels are favored when MC is high and RCV is low, with a confidence-dependent penalty that becomes stronger for near-certain predictions. Based on this criterion, CoVar embeds predictions into a two-dimensional confidence--variance space and uses SVD-based spectral relaxation to separate reliable and unreliable predictions without hand-tuned confidence thresholds. Cluster-wise Gaussian weighting then converts this separation into per-sample training weights. The resulting weights can be integrated into existing semi-supervised segmentation and classification pipelines during training and introduce no inference-time overhead. Experiments on PASCAL VOC 2012, Cityscapes, CIFAR-10, CIFAR-100, SVHN, and STL-10 show clear gains on VOC and Cityscapes under matched backbones, as well as competitive or improved error rates on standard classification benchmarks. These results indicate that residual-class dispersion provides a useful signal complementary to confidence for robust pseudo-label selection.

2602.02229 2026-06-11 cs.LG eess.SP 版本更新

Prediction-Powered Risk Monitoring of Deployed Models for Detecting Harmful Distribution Shifts

预测驱动的已部署模型风险监控:检测有害分布漂移

Guangyi Zhang, Yunlong Cai, Guanding Yu, Osvaldo Simeone

AI总结 提出预测驱动风险监控(PPRM),一种基于预测驱动推断的半监督方法,通过结合合成标签与少量真实标签构建运行风险的随时有效下界,实现对有害漂移的检测,并在图像分类、大语言模型和电信监控任务中验证有效性。

详情
Comments
Accepted by ICML2026
AI中文摘要

我们研究了在动态环境中模型性能监控的问题,其中标记数据有限。为此,我们提出了预测驱动风险监控(PPRM),一种基于预测驱动推断(PPI)的半监督风险监控方法。PPRM通过结合合成标签与少量真实标签,构建运行风险的随时有效下界。通过基于阈值的比较与名义风险的上界,检测有害漂移,满足无假设的有限样本I型误差保证。我们通过在图像分类、大语言模型(LLM)和电信监控任务上的大量实验,证明了PPRM的有效性。

英文摘要

We study the problem of monitoring model performance in dynamic environments where labeled data are limited. To this end, we propose prediction-powered risk monitoring (PPRM), a semi-supervised risk-monitoring approach based on prediction-powered inference (PPI). PPRM constructs anytime-valid lower bounds on the running risk by combining synthetic labels with a small set of true labels. Harmful shifts are detected via a threshold-based comparison with an upper bound on the nominal risk, satisfying assumption-free finite-sample guarantees on the type-I error. We demonstrate the effectiveness of PPRM through extensive experiments on image classification, large language model (LLM), and telecommunications monitoring tasks.

2602.08986 2026-06-11 cs.LG cs.AI 版本更新

Improving Detection of Rare Nodes in Hierarchical Multi-Label Learning

改进分层多标签学习中稀有节点的检测

Isaac Xu, Martin Gillis, Ayushi Sharma, Benjamin Misiuk, Craig J. Brown, Thomas Trappenberg

AI总结 针对分层多标签分类中稀有节点检测困难的问题,提出结合节点不平衡加权和焦点加权的损失函数,利用集成不确定性量化,在基准数据集上将召回率提升至五倍,并显著提高F1分数。

详情
Comments
Accepted for publication in Transactions on Machine Learning Research (TMLR), 2026
AI中文摘要

在分层多标签分类中,一个持续的挑战是使模型预测能够达到层次结构的更深层次,以实现更详细或更细粒度的分类。这一困难部分源于某些类别(或层次节点)的自然稀有性,以及确保子节点几乎总是比其父节点频率更低的分层约束。为了解决这个问题,我们为神经网络提出了一种加权损失目标,该目标结合了节点不平衡加权和焦点加权组件,后者利用了集成不确定性的现代量化。通过强调稀有节点而非稀有观测(数据点),并在训练过程中关注每个模型输出分布中的不确定节点,我们观察到在基准数据集上召回率提高了高达五倍,并且$F_{1}$分数有统计显著的提升。我们还展示了我们的方法有助于卷积网络处理具有挑战性的任务,例如在编码器次优或数据有限的情况下。

英文摘要

In hierarchical multi-label classification, a persistent challenge is enabling model predictions to reach deeper levels of the hierarchy for more detailed or fine-grained classifications. This difficulty partly arises from the natural rarity of certain classes (or hierarchical nodes) and the hierarchical constraint that ensures child nodes are almost always less frequent than their parents. To address this, we propose a weighted loss objective for neural networks that combines node-wise imbalance weighting with focal weighting components, the latter leveraging modern quantification of ensemble uncertainties. By emphasizing rare nodes rather than rare observations (data points), and focusing on uncertain nodes for each model output distribution during training, we observe improvements in recall by up to a factor of five on benchmark datasets, along with statistically significant gains in $F_{1}$ score. We also show our approach aids convolutional networks on challenging tasks, as in situations with suboptimal encoders or limited data.

2602.22962 2026-06-11 cs.LG 版本更新

Scaling Laws of Global Weather Models

全球天气模型的缩放定律

Yuejiang Yu, Langwen Huang, Alexandru Calotoiu, Torsten Hoefler

AI总结 本文分析数据驱动天气模型中模型大小、数据集大小和计算预算与验证损失之间的缩放定律,发现Aurora数据缩放最强,GraphCast参数效率高但硬件利用率低,计算最优分析表明增加训练数据比增大模型更有效,且模型形状上宽度优于深度。

详情
Comments
Accepted at ICML 2026. 21 pages, 7 figures
AI中文摘要

数据驱动模型正在彻底改变天气预报。为了优化训练效率和模型性能,本文分析了该领域内的经验缩放定律。我们研究了模型性能(验证损失)与三个关键因素:模型大小($N$)、数据集大小($D$)和计算预算($C$)之间的关系。在一系列模型中,我们发现Aurora表现出最强的数据缩放行为:将训练数据集增加10倍可使验证损失降低多达3.2倍。GraphCast展示了最高的参数效率,但硬件利用率有限。我们的计算最优分析表明,在固定计算预算下,将资源分配给更多的总训练数据比增加模型大小能带来更大的性能提升。此外,我们分析了模型形状,并发现了与语言模型中观察到的根本不同的缩放行为:天气预报模型始终倾向于增加宽度而非深度。这些发现表明,未来的天气模型应优先考虑更宽的架构和更大的有效训练数据集,以最大化预测性能。

英文摘要

Data-driven models are revolutionizing weather forecasting. To optimize training efficiency and model performance, this paper analyzes empirical scaling laws within this domain. We investigate the relationship between model performance (validation loss) and three key factors: model size ($N$), dataset size ($D$), and compute budget ($C$). Across a range of models, we find that Aurora exhibits the strongest data-scaling behavior: increasing the training dataset by 10x reduces validation loss by up to 3.2x. GraphCast demonstrates the highest parameter efficiency, yet suffers from limited hardware utilization. Our compute-optimal analysis indicates that, under fixed compute budgets, allocating resources to more total training data yields greater performance gains than increasing model size. Furthermore, we analyze model shape and uncover scaling behaviors that differ fundamentally from those observed in language models: weather forecasting models consistently favor increased width over depth. These findings suggest that future weather models should prioritize wider architectures and larger effective training datasets to maximize predictive performance.

2605.26418 2026-06-11 cs.LG cs.AI cs.DC 版本更新

When Does Deep RL Beat Calibrated Baselines? A Benchmark Study on Adaptive Resource Control

深度强化学习何时超越校准基线?自适应资源控制的基准研究

Guilin Zhang, Chuanyi Sun, Kai Zhao, Shahryar Sarkani, John Fossaceca

AI总结 通过RLScale-Bench基准测试,发现校准的基于规则的自动缩放器在所有工作负载上成本均低于六种主流深度强化学习算法,并揭示了算法选择、基线校准和评估协议的关键瓶颈。

详情
AI中文摘要

一个适当校准的基于规则的自动缩放器可以在我们测试的每个工作负载上,在成本方面击败六种主流深度强化学习(DRL)算法——那么,如果存在的话,DRL究竟何时能真正发挥作用?我们在RLScale-Bench中研究这个问题,这是一个用于自适应资源控制的DRL可重复基准和评估协议,其中代理在成本和服务级别约束下将计算资源分配给动态工作负载。我们在匹配的架构、训练预算和奖励函数下,评估PPO、DQN、A2C、SAC、TD3和DDPG,与校准的基于规则基线在六个工作负载模式和五个种子(240次运行)上进行对比,在Kubernetes水平Pod自动缩放上实例化基准,并探测分布偏移泛化。三个发现挑战了常见假设:(i)校准控制器在所有六个工作负载上实现了最低成本,尽管在突发和闪流流量上落后于最佳RL代理;(ii)由于动作空间不匹配,离散动作算法在约束违反方面比连续动作算法好一到两个数量级;(iii)没有单一算法在所有工作负载上占主导地位,排名变化高达四个位置。基于RL的资源控制的瓶颈不是算法选择,而是基线校准、奖励工程和现实的评估协议。

英文摘要

A properly calibrated rule-based autoscaler can beat every one of six mainstream deep reinforcement learning (DRL) algorithms on cost across every workload we test - so when, if ever, does DRL actually help? We study this in RLScale-Bench, a reproducible benchmark and evaluation protocol for DRL on adaptive resource control, where an agent allocates compute to a dynamic workload under cost and service-level constraints. We evaluate PPO, DQN, A2C, SAC, TD3, and DDPG under matched architectures, training budgets, and reward functions against a calibrated rule-based baseline across six workload patterns and five seeds (240 runs), instantiate the benchmark on Kubernetes Horizontal Pod Autoscaling, and probe distribution-shift generalization. Three findings challenge common assumptions: (i) the calibrated controller achieves the lowest cost on all six workloads, though it trails the best RL agents on bursty and flash traffic; (ii) discrete-action algorithms outperform continuous-action ones by one to two orders of magnitude in constraint violations due to action-space mismatch; and (iii) no single algorithm dominates across workloads, with rankings shifting by up to four positions. The bottleneck in RL-based resource control is not algorithm selection but baseline calibration, reward engineering, and realistic evaluation protocols.

2606.02670 2026-06-11 cs.LG cs.AI 版本更新

Anomalies in Multivariate Time Series Benchmarks Are Mostly Univariate

多变量时间序列基准中的异常主要是单变量的

Marc Pinet (LIG), Julien Cumin, Samuel Berlemont, Dominique Vaufreydaz (LIG)

AI总结 本文通过诊断框架和实验证明,当前多变量时间序列异常检测基准中,异常主要源于单变量偏离,跨通道结构变化极少,因此现有基准不适合验证跨通道建模能力。

详情
AI中文摘要

许多最新的多变量时间序列异常检测(MT-SAD)模型引入了跨通道建模,其隐含假设是异常的结构可能分布在多个通道上。我们在八个广泛使用的公共基准上评估了这一假设,引入了一个逐段诊断框架,该框架针对每个标记的异常,标记是否至少有一个通道单独偏离其正常历史,是否跨通道相关结构发生变化,或两者兼有。该框架表明,在一系列合理阈值下,没有跨通道破裂发生在没有伴随单变量偏离的情况下。一个补充指标还显示,在八个基准中的六个上,至少一半的标记异常段在79%到100%的时间步上发生单变量偏离,在其中的三个数据集上达到100%。为了验证我们的框架在存在跨通道结构时能够捕获它,我们构建了具有共享噪声的相移正弦通道的合成数据。每个异常段通过两种通道级损坏之一进行改变,这些损坏保留了每个通道的边缘分布,同时破坏了跨通道结构,我们的框架正确地将这些段表征为仅跨通道异常。在这些数据上,依赖通道(CD)模型成功利用了跨通道信号,而独立通道(CI)模型则失败。在真实基准上对最近SOTA检测器的CI/CD比较进一步证实了CD建模没有带来可衡量的收益。我们得出结论,当前的MT-SAD基准不适合验证跨通道建模能力,并呼吁开发更多结构多样的评估集。本研究的代码已公开。

英文摘要

Many recent multivariate time series anomaly detection (MTSAD) models incorporate cross-channel modeling, under the implicit assumption that the structure of anomalies may be spread across multiple channels. We evaluate this assumption on eight widely used public benchmarks by introducing a per-segment diagnostic framework that flags, for each labeled anomaly, whether at least one channel deviates individually from its normal history, whether the cross-channel correlation structure changes, or both. The framework shows that no cross-channel rupture occurs without an accompanying univariate deviation across a range of reasonable thresholds. A complementary metric also reveals that on six of the eight benchmarks, at least half of the labeled anomaly segments deviate univariately on 89% to 100% of their timesteps, reaching 100% on three of these datasets. To verify that our framework captures cross-channel structure when present, we construct synthetic data of phase-shifted sinusoidal channels with shared noise. Each anomalous segment is altered through one of two channel-wise corruptions that preserve the per-channel marginal distribution while breaking cross-channel structure, and our framework correctly characterizes these segments as cross-channel-only. On these data, channel-dependent (CD) models successfully exploit the cross-channel signal whereas channel-independent (CI) ones fail. The CI/CD comparison of a recent SOTA detector on real benchmarks further confirms that CD modeling brings no measurable gain. We conclude that current MTSAD benchmarks are unsuitable for validating cross-channel modeling capabilities, and we call for the development of more structurally diverse evaluation sets. The code for this study is publicly available.

2606.07591 2026-06-11 cs.LG cs.AI cs.CL 版本更新

ResearchClawBench: A Benchmark for End-to-End Autonomous Scientific Research

ResearchClawBench: 端到端自主科学研究基准

Wanghan Xu, Shuo Li, Tianlin Ye, Qinglong Cao, Yixin Chen, Hengjian Gao, Yiheng Wang, Qi Li, Kun Li, Sheng Xu, Shengdu Chai, Fangchen Yu, Xiangyu Zhao, Zhangrui Zhao, Weijie Ma, Zijie Guo, Haoyu Zhou, Haoxiang Yin, Lixue Cheng, Chaofan Hu, Haoxuan Li, Lu Mi, Xuxuan Xie, Yifan Zhou, Ruizhe Chen, Zhiwang Zhou, Xingjian Guo, Yuhao Zhou, Xuming He, Shengyuan Xu, Xinyu Gu, Jiamin Wu, Mianxin Liu, Chunfeng Song, Fenghua Ling, Dongzhan Zhou, Shixiang Tang, Yuqiang Li, Mao Su, Peng Ye, Siqi Sun, Bin Wang, Xue Yang, Zhenfei Yin, Tianfan Fu, Guangtao Zhai, Wanli Ouyang, Bo Zhang, Lei Bai, Wenlong Zhang

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 提出ResearchClawBench基准,包含10个领域40个任务,通过多模态评分标准评估自主科研能力,最强智能体仅得21.5分,揭示当前系统在实验协议、证据匹配和科学核心方面的不足。

详情
AI中文摘要

AI编码智能体越来越多地用于科学工作,但其端到端自主研究能力仍然难以验证。我们提出了ResearchClawBench,一个用于评估自主科学研究的基准,涵盖来自10个科学领域的40个任务。每个任务基于一篇真实发表论文,提供相关文献和原始数据,并在评估期间隐藏目标论文。专家策划的多模态评分标准将目标科学制品分解为加权标准,从而能够评估目标论文级别的重新发现,同时为新发现留出空间。我们在统一协议下评估了七个自主研究(auto-research)智能体,并通过轻量级ResearchHarness评估了十七个原生LLM。当前系统远未达到可靠的重新发现:最强的自主智能体Claude Code平均得分为21.5,最强的ResearchHarness LLM Claude-Opus-4.7平均得分为20.7,LLM前沿均值仅为26.5。错误分析表明,失败集中在实验协议不匹配、证据不匹配和缺失科学核心。ResearchClawBench为衡量自主科学研究进展提供了一个可复现的评估前沿。

英文摘要

AI coding agents are increasingly used for scientific work, but their end-to-end autonomous research capability remains difficult to verify. We present ResearchClawBench, a benchmark for evaluating autonomous scientific research across 40 tasks from 10 scientific domains. Each task is grounded in a real published paper, provides related literature and raw data, and hides the target paper during evaluation. Expert-curated multimodal rubrics decompose the target scientific artifacts into weighted criteria, enabling evaluation of target-paper-level re-discovery while leaving room for new discovery. We evaluate seven autonomous research (auto-research) agents under a unified protocol and seventeen native LLMs through the lightweight ResearchHarness. Current systems remain far from reliable re-discovery: the strongest autonomous agent, Claude Code, averages 21.5, and the strongest ResearchHarness LLM, Claude-Opus-4.7, averages 20.7, with an LLM frontier mean of only 26.5. Error analysis shows that failures concentrate in experimental protocol mismatch, evidence mismatch, and missing scientific core. ResearchClawBench provides a reproducible evaluation frontier for measuring progress toward autonomous scientific research.

2410.24145 2026-06-11 stat.ML cs.LG stat.ME 版本更新

Projected random forests and conformal prediction of circular data

投影随机森林与圆形数据的共形预测

Paulo C. Marques F., Rinaldo Artes, Helton Graziadei

AI总结 针对圆形响应回归问题,应用共形预测技术,通过投影方法将线性回归模型转换为圆形模型,并利用随机森林的袋外机制避免额外校准样本,生成具有有限样本覆盖保证和自适应弧长的预测集。

详情
Comments
7 pages; 4 figures
AI中文摘要

我们将共形预测技术应用于具有圆形响应的回归问题,在数据可交换性假设下,为任何圆形预测模型生成具有自适应弧长和有限样本覆盖保证的预测集。利用现有为线性响应设计的高性能预测模型,我们分析了一种通用的投影过程,将任何线性响应回归模型转换为适用于圆形响应的模型。当在此投影过程中使用随机森林作为基模型时,我们利用随机森林的袋外机制,在构建预测集时无需单独的校准样本。在合成和真实数据集上,与两种现有替代模型生成的分割共形预测集相比,所得的投影随机森林模型产生了更高效的袋外共形预测集,中位弧长更短。

英文摘要

We apply conformal prediction techniques to regression problems with circular responses, producing prediction sets with adaptive arc length and finite-sample coverage guarantees for any circular predictive model under the assumption of data exchangeability. Leveraging the high performance of existing predictive models designed for linear responses, we analyze a general projection procedure that converts any linear-response regression model into one suitable for circular responses. When random forests are used as base models in this projection procedure, we leverage the random forest out-of-bag mechanism to eliminate the need for a separate calibration sample in the construction of prediction sets. On synthetic and real datasets, the resulting projected random forest model produces more efficient out-of-bag conformal prediction sets, with shorter median arc length, than the split conformal prediction sets generated by two existing alternative models.

2510.06596 2026-06-11 cs.CV cs.AI cs.IT cs.LG 版本更新

SDQM: Synthetic Data Quality Metric for Object Detection Dataset Evaluation

SDQM:用于目标检测数据集评估的合成数据质量指标

Ayush Zenith, Arnold Zumbrun, Neel Raut, Jing Lin

AI总结 提出SDQM指标,无需模型训练收敛即可评估合成数据质量,与YOLO11的mAP强相关,优于现有指标。

详情
Comments
Accepted and Published at SPIE: Journal of Electronic Imaging, Vol. 35, Issue 3
AI中文摘要

机器学习模型的性能在很大程度上依赖于训练数据。大规模、良好标注数据集的稀缺给构建鲁棒模型带来了重大挑战。为了解决这一问题,通过模拟和生成模型产生的合成数据已成为一种有前景的解决方案,它增强了数据集的多样性,并提高了模型的性能、可靠性和韧性。然而,评估这些生成数据的质量需要一个有效的指标。我们引入了合成数据集质量指标(SDQM),用于评估目标检测任务的数据质量,而无需模型训练收敛。该指标能够更高效地生成和选择合成数据集,解决了资源受限的目标检测任务中的一个关键挑战。在我们的实验中,SDQM与领先的目标检测模型YOLO11的平均精度均值(mAP)得分表现出强相关性,而先前的指标仅表现出中等或弱相关性。此外,它提供了改进数据集质量的可操作见解,最大限度地减少了昂贵的迭代训练需求。这一可扩展且高效的指标为评估合成数据设立了新标准。SDQM的代码可从此https URL获取。

英文摘要

The performance of machine learning models depends heavily on training data. The scarcity of large-scale, well-annotated datasets poses significant challenges in creating robust models. To address this, synthetic data generated through simulations and generative models has emerged as a promising solution, enhancing dataset diversity and improving the performance, reliability, and resilience of models. However, evaluating the quality of this generated data requires an effective metric. We introduce the Synthetic Dataset Quality Metric (SDQM) to assess data quality for object detection tasks without requiring model training to converge. This metric enables more efficient generation and selection of synthetic datasets, addressing a key challenge in resource-constrained object detection tasks. In our experiments, SDQM demonstrated a strong correlation with the mean average precision (mAP) scores of YOLO11, a leading object detection model, whereas previous metrics only exhibited moderate or weak correlations. In addition, it provides actionable insights into improving dataset quality, minimizing the need for costly iterative training. This scalable and efficient metric sets a new standard for evaluating synthetic data. The code for SDQM is available at this https URL

2510.16152 2026-06-11 cs.DL cs.AI cs.CL cs.LG 版本更新

Mapping Scientific Literature with Large Language Models and Topic Modeling

利用大语言模型和主题建模绘制科学文献图谱

Mason Smetana, Lev Khazanovich

AI总结 提出基于大语言模型的两阶段分类框架,通过主题建模分析PNAS工程类文献,生成语义可解释主题并揭示跨主题关联,性能优于传统方法。

详情
Comments
35 pages, 10 figures. Accepted for publication in Scientometrics. Final version available via DOI
AI中文摘要

科学文献因学科边界、专业术语和潜在稀疏的关键词系统而日益碎片化,使得捕捉现代科学的演化结构变得困难。本研究引入了一个大语言模型驱动的框架,从主题建模的角度绘制科学文献图谱。该方法在《美国国家科学院院刊》20年间超过1500篇工程相关文章语料上进行了演示。一个两阶段分类流水线首先根据每篇文章的摘要分配一个主要主题类别,然后进行全文分析以识别次要分类,揭示语料库中潜在的跨主题联系。与传统主题模型不同,基于LLM的框架在保持强量化性能的同时,生成语义可解释的主题。与既定主题建模方法的比较评估显示,主题多样性更高,重叠度更低,且具有竞争性的一致性指标。对随机抽样的摘要子集进行手动验证,准确率达到75.9%。额外的传统自然语言处理分析证实,生成的主题对应于语料库中有意义的语言模式。连接主要和次要分类的二部网络进一步揭示了仅通过摘要或关键词系统不易观察到的隐含主题关系。结果表明,该框架无需事先了解期刊的编辑双重分类结构,即可独立恢复其大部分结构。总体而言,所提出的方法为绘制科学图谱和识别研究中新兴的跨主题联系提供了有力工具。

英文摘要

Scientific literature is increasingly fragmented by disciplinary boundaries, specialized terminology, and potentially sparse keyword systems, making it difficult to capture the evolving structure of modern science. This study introduces a large language model (LLM)-driven framework for mapping scientific literature from a topic modeling perspective. The approach is demonstrated on a 20-year corpus of more than 1,500 engineering-related articles published in the Proceedings of the National Academy of Sciences (PNAS). A two-stage classification pipeline first assigns a primary thematic category to each article based on its abstract, followed by full-text analysis to identify secondary classifications that reveal latent cross-topic connections within the corpus. Unlike conventional topic models, the LLM-based framework produces semantically interpretable topics while maintaining strong quantitative performance. Comparative evaluation against established topic modeling methods shows higher topic diversity and lower overlap with competitive coherence metrics. Manual validation on a randomly sampled subset of abstracts yields an accuracy of 75.9%. Additional traditional natural language processing analyses confirm that the generated topics correspond to meaningful linguistic patterns in the corpus. A bipartite network linking primary and secondary classifications further reveals implicit thematic relationships that are not readily observable through abstracts or keyword systems alone. The findings indicate that the framework independently recovers much of the journal's editorial dual-classification structure without prior knowledge of its schema. Overall, the proposed approach offers a powerful tool for mapping science and identifying emerging cross-topic connections in research.

2601.04203 2026-06-11 cs.CL cs.CV cs.LG cs.SE 版本更新

FronTalk: Benchmarking Front-End Development as Conversational Code Generation with Multi-Modal Feedback

FronTalk: 以多模态反馈进行对话式代码生成的前端开发基准测试

Xueqing Wu, Zihan Xue, Da Yin, Shuyan Zhou, Kai-Wei Chang, Nanyun Peng, Yeming Wen

AI总结 提出FronTalk基准,通过多轮对话和多模态反馈(文本与视觉指令)评估前端代码生成,发现模型存在遗忘和视觉反馈理解困难,提出AceCoder方法有效减少遗忘并提升性能。

详情
AI中文摘要

我们提出了FronTalk,一个前端代码生成基准,开创性地研究了一种独特的交互动态:具有多模态反馈的对话式代码生成。在前端开发中,草图、模型和带注释的截图等视觉工件对于传达设计意图至关重要,但它们在多轮代码生成中的作用仍未得到充分探索。为解决这一差距,我们聚焦于前端开发任务,整理了FronTalk,这是一个包含100个多轮对话的数据集,这些对话源自新闻、金融和艺术等不同领域的真实网站。每一轮都包含一个文本指令和一个等效的视觉指令,每个指令代表相同的用户意图。为全面评估模型性能,我们提出了一种新颖的基于智能体的评估框架,利用网络智能体模拟用户并探索网站,从而衡量功能正确性和用户体验。对20个模型的评估揭示了文献中系统性地未充分探索的两个关键挑战:(1)显著的遗忘问题,即模型覆盖先前实现的功能,导致任务失败;(2)解释视觉反馈的持续挑战,尤其是对于开源视觉语言模型(VLM)。我们提出了一个强大的基线来解决遗忘问题,即AceCoder,一种使用自主网络智能体批评每个过去指令实现的方法。这种方法将遗忘几乎减少到零,并将性能提升高达9.3%(从56.0%到65.3%)。总体而言,我们旨在为前端开发和多轮多模态代码生成的通用交互动态的未来研究提供坚实基础。代码和数据已在此https URL发布。

英文摘要

We present FronTalk, a benchmark for front-end code generation that pioneers the study of a unique interaction dynamic: conversational code generation with multi-modal feedback. In front-end development, visual artifacts such as sketches, mockups and annotated creenshots are essential for conveying design intent, yet their role in multi-turn code generation remains largely unexplored. To address this gap, we focus on the front-end development task and curate FronTalk, a collection of 100 multi-turn dialogues derived from real-world websites across diverse domains such as news, finance, and art. Each turn features both a textual instruction and an equivalent visual instruction, each representing the same user intent. To comprehensively evaluate model performance, we propose a novel agent-based evaluation framework leveraging a web agent to simulate users and explore the website, and thus measuring both functional correctness and user experience. Evaluation of 20 models reveals two key challenges that are under-explored systematically in the literature: (1) a significant forgetting issue where models overwrite previously implemented features, resulting in task failures, and (2) a persistent challenge in interpreting visual feedback, especially for open-source vision-language models (VLMs). We propose a strong baseline to tackle the forgetting issue with AceCoder, a method that critiques the implementation of every past instruction using an autonomous web agent. This approach significantly reduces forgetting to nearly zero and improves the performance by up to 9.3% (56.0% to 65.3%). Overall, we aim to provide a solid foundation for future research in front-end development and the general interaction dynamics of multi-turn, multi-modal code generation. Code and data are released at this https URL

2601.17717 2026-06-11 cs.AI cs.LG 版本更新

A Survey on Evaluating Quality and Trustworthiness in LLM-Generated Data

评估LLM生成数据的质量与可信度综述

Kaituo Zhang, Mingzhi Hu, Hoang Anh Duy Le, Fariha Kabir Torsha, Zhimeng Jiang, Minh Khai Bui, Chia-Yuan Chang, Yu-Neng Chuang, Zhen Xiong, Ying Lin, Guanchu Wang, Na Zou

AI总结 提出LLM数据审计框架,从质量和可信度两个维度系统分类评估指标,分析六种模态数据生成方法的评估缺陷并给出改进建议。

详情
Comments
Published at TMLR. Title changed in the final version
AI中文摘要

大型语言模型(LLM)已成为跨多种模态生成数据的强大工具。通过将数据从稀缺资源转变为可控资产,LLM缓解了真实世界数据获取成本对模型训练、评估和系统迭代造成的瓶颈。然而,确保LLM生成的合成数据的高质量仍然是一个关键挑战。现有研究主要关注生成方法,对生成数据质量的直接关注有限。此外,大多数研究局限于单一模态,缺乏跨不同数据类型的统一视角。为填补这一空白,我们提出了\textbf{LLM数据审计框架}。在该框架中,我们首先描述了如何利用LLM生成六种不同模态的数据。更重要的是,我们从质量和可信度两个维度系统分类了评估合成数据的内在指标。这种方法将评估重点从依赖下游任务性能的外在评估转向数据本身的固有属性。利用这一评估体系,我们分析了每种模态代表性生成方法的实验评估,并指出了当前评估实践中的重大缺陷。基于这些发现,我们为社区改进数据生成评估提供了具体建议。最后,该框架概述了合成数据在不同模态下的实际应用方法。

英文摘要

Large Language Models (LLMs) have emerged as powerful tools for generating data across various modalities. By transforming data from a scarce resource into a controllable asset, LLMs mitigate the bottlenecks imposed by the acquisition costs of real-world data for model training, evaluation, and system iteration. However, ensuring the high quality of LLM-generated synthetic data remains a critical challenge. Existing research primarily focuses on generation methodologies, with limited direct attention to the quality of the resulting data. Furthermore, most studies are restricted to single modalities, lacking a unified perspective across different data types. To bridge this gap, we propose the \textbf{LLM Data Auditor framework}. In this framework, we first describe how LLMs are utilized to generate data across six distinct modalities. More importantly, we systematically categorize intrinsic metrics for evaluating synthetic data from two dimensions: quality and trustworthiness. This approach shifts the focus from extrinsic evaluation, which relies on downstream task performance, to the inherent properties of the data itself. Using this evaluation system, we analyze the experimental evaluations of representative generation methods for each modality and identify substantial deficiencies in current evaluation practices. Based on these findings, we offer concrete recommendations for the community to improve the evaluation of data generation. Finally, the framework outlines methodologies for the practical application of synthetic data across different modalities.

2601.21817 2026-06-11 stat.ML cs.LG 版本更新

A Judge-Aware Ranking Framework for Evaluating Large Language Models without Ground Truth

一种面向评委的排名框架:无需真实标签评估大语言模型

Mingyuan Xu, Xinzi Tan, Jiawei Wu, Doudou Zhou

AI总结 本文提出一种面向评委的排名框架,通过引入评委特定的辨别参数扩展Bradley-Terry-Luce模型,在不参考标签的情况下联合估计潜在模型质量和评委可靠性,从而提高人类偏好的一致性,提高数据效率,并产生校准的不确定性量化。

详情
AI中文摘要

评估大语言模型(LLMs)在开放性任务上无需真实标签的评估越来越通过LLM-as-a-judge范式进行。一个关键但未充分建模的问题是,评判LLMs在可靠性上存在显著差异;将所有评委视为同等对待会导致偏见的排行榜和误导性的不确定性估计。更多的数据在不正确的聚合下可能导致评估更加自信地错误。我们提出了一种面向评委的排名框架,通过引入评委特定的辨别参数扩展Bradley-Terry-Luce模型,在不参考标签的情况下联合估计潜在模型质量和评委可靠性。我们建立了可识别性,直到自然归一化,并证明最大似然估计的一致性和渐近正态性,从而能够为分数差异和排名比较生成置信区间。在多个公开基准和一个新收集的数据集上,我们的方法提高了与人类偏好的一致性,比无权基线实现了更高的数据效率,并产生了校准的LLM排名不确定性量化。

英文摘要

Evaluating large language models (LLMs) on open-ended tasks without ground-truth labels is increasingly done via the LLM-as-a-judge paradigm. A critical but under-modeled issue is that judge LLMs differ substantially in reliability; treating all judges equally can yield biased leaderboards and misleading uncertainty estimates. More data can make evaluation more confidently wrong under misspecified aggregation. We propose a judge-aware ranking framework that extends the Bradley-Terry-Luce model by introducing judge-specific discrimination parameters, jointly estimating latent model quality and judge reliability from pairwise comparisons without reference labels. We establish identifiability up to natural normalizations and prove consistency and asymptotic normality of the maximum likelihood estimator, enabling confidence intervals for score differences and rank comparisons. Across multiple public benchmarks and a newly collected dataset, our method improves agreement with human preferences, achieves higher data efficiency than unweighted baselines, and produces calibrated uncertainty quantification for LLM rankings.

2602.02465 2026-06-11 cs.AI cs.CV cs.LG 版本更新

MentisOculi: Revealing the Limits of Reasoning with Mental Imagery

MentisOculi: 揭示心智图像推理的局限性

Jana Zeller, Thaddäus Wiedemer, Fanfei Li, Thomas Klein, Prasanna Mayilvahanan, Matthias Bethge, Felix Wichmann, Ryan Cotterell, Wieland Brendel

AI总结 提出MentisOculi基准,通过多步推理问题测试前沿模型利用视觉表示辅助推理的能力,发现视觉策略普遍无法提升性能,且统一多模态模型存在生成错误累积和无法利用真实可视化的问题。

详情
Comments
9 pages, 8 figures, Accepted at ICML 2026
AI中文摘要

前沿模型正从仅摄入视觉信息的多模态大语言模型(MLLMs)过渡到能够原生交错生成的统一多模态模型(UMMs)。这一转变激发了将中间可视化作为推理辅助的兴趣,类似于人类的心智图像。这一想法的核心是能够以目标导向的方式形成、维护和操作视觉表示。为了评估和探究这一能力,我们开发了MentisOculi,这是一个程序化的、分层的多步推理问题套件,适用于视觉解决方案,旨在挑战前沿模型。评估从潜在令牌到显式生成图像的视觉策略,我们发现它们通常无法提升性能。对UMMs的分析特别揭示了一个关键限制:虽然它们拥有解决任务的文本推理能力,并且有时能生成正确的视觉内容,但它们遭受复合生成错误,并且无法利用甚至真实的可视化。我们的发现表明,尽管视觉思维具有内在吸引力,但尚未有益于模型推理。MentisOculi为分析和弥合不同模型家族之间的这一差距建立了必要的基础。

英文摘要

Frontier models are transitioning from multimodal large language models (MLLMs) that merely ingest visual information to unified multimodal models (UMMs) capable of native interleaved generation. This shift has sparked interest in using intermediate visualizations as a reasoning aid, akin to human mental imagery. Central to this idea is the ability to form, maintain, and manipulate visual representations in a goal-oriented manner. To evaluate and probe this capability, we develop MentisOculi, a procedural, stratified suite of multi-step reasoning problems amenable to visual solution, tuned to challenge frontier models. Evaluating visual strategies ranging from latent tokens to explicit generated imagery, we find they generally fail to improve performance. Analysis of UMMs specifically exposes a critical limitation: While they possess the textual reasoning capacity to solve a task and can sometimes generate correct visuals, they suffer from compounding generation errors and fail to leverage even ground-truth visualizations. Our findings suggest that despite their inherent appeal, visual thoughts do not yet benefit model reasoning. MentisOculi establishes the necessary foundation to analyze and close this gap across diverse model families.

12. 机器学习应用 75 篇

2606.11201 2026-06-11 cs.LG cs.AI cs.CL 新提交

To Intervene or Not: Guiding Inference-time Alignment with Probabilistic Model Blending

干预还是不干预:通过概率模型混合指导推理时对齐

Jin Gan, Xin Li, Jun Luo

发表机构 * College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算机与数据科学学院)

AI总结 提出BlendIn框架,通过质量感知对齐和按可靠性加权混合模型知识,解决推理时对齐中指导有效性差异大的问题,在困难模型对上实现最高50%的性能提升。

详情
Comments
Accepted by ACL 2026
AI中文摘要

LLM的广泛部署使得模型对齐成为必要,以确保新训练的模型能够安全有效地响应用户指令。在不同方法中,推理时对齐通常更便宜,因为它仅在输出生成期间进行干预(即提供指导)。现有提案从某些对齐模型中提取指导,但没有适当评估其可靠性。然而,我们的系统评估显示,指导有效性在不同模型间差异很大;由于无效指导会导致进一步混乱和更多干预,由此产生的过度干预通常表明性能较差。为了使干预更有效且更高效,我们引入了BlendIn,一个推理时对齐框架,从二元决策转向创建整合两个模型知识的混合分布。BlendIn通过执行质量感知对齐并根据可靠性按比例加权每个模型的贡献来稳定推理时对齐。与现有工作相比,它保留了有益的指导,同时降低了不可靠建议的权重。BlendIn为未对齐的指导提供了诊断信号和缓解策略,在困难模型对上实现了一致且高达50%的性能提升。我们的代码可在以下网址获取:this https URL。

英文摘要

The wide deployment of LLMs has made model alignment necessary to make newly trained models safely and effectively respond to user instructions. Among different methods, inference-time alignment is often cheaper as it intervenes (i.e., offers guidances) only during output generation. Existing proposals apply guidances extracted from certain aligned models without properly assessing their reliability. Nonetheless, our systematic evaluation reveals that guidance effectiveness varies drastically across models; since ineffective guidances lead to further confusion and thus further interventions, the resulting excessive interventions typically indicate poor performance. To make interventions more effective and thus more efficient, we introduce BlendIn, an inference-time alignment framework that shifts from binary decisions to creating hybrid distributions integrating both models' knowledge. BlendIn stabilizes inference-time alignment by performing quality-aware alignment and proportionally weighting each model's contribution based on reliability. Compared with existing works, it preserves beneficial guidance while downweighting unreliable suggestions. BlendIn provides both diagnostic signals and mitigation strategies for misaligned guidance, achieving consistent and up to 50% performance improvement on challenging model pairs. Our code is available at: this https URL.

2606.11268 2026-06-11 cs.LG 新提交

LakeFM: Toward a Foundation Model for Aquatic Ecosystems Using Irregular Multivariate Multi-depth Time Series Data

LakeFM:基于不规则多变量多深度时间序列数据的水生生态系统基础模型

Abhilash Neog, Sepideh Fatemi, Medha Sawhney, Kazi Sajeed Mehrab, Aanish Pradhan, Bennett J. McAfee, Emma Marchisin, Arka Daw, Robert Ladwig, Cayelan C. Carey, Paul Hanson, Anuj Karpatne

发表机构 * Virginia Tech(弗吉尼亚理工大学) Grand Valley State University(大峡谷州立大学) University of Wisconsin - Madison(威斯康星大学麦迪逊分校) Amazon AGI(亚马逊AGI) Aarhus University(奥胡斯大学)

AI总结 针对湖泊时间序列数据不规则采样和跨湖泊泛化难题,提出预训练基础模型LakeFM,在模拟和观测数据上学习表征,实现优于现有模型的预测性能。

详情
Comments
KDD 2026
AI中文摘要

理解和预测湖泊动态对于监测湖泊和水库的水质及生态系统健康至关重要。尽管机器学习方法最近已被应用于生态时间序列数据,但现有工作假设时间和深度上的规则采样,并且难以在具有异质变量、深度和观测模式的湖泊之间泛化。为了解决这些局限性,我们引入了\textsc{LakeFM},一个用于水生系统的基础模型,在包含模拟和观测湖泊的大规模生态数据集上预训练。通过广泛的实证评估,我们表明\textsc{LakeFM}学习了跨越更广泛湖泊层面特征的有意义表征,并在与现有时间序列基础模型和非基础模型相比时,实现了具有竞争力或通常更优的预测性能,同时产生与真实湖泊动态一致的物理上合理的预测。

英文摘要

Understanding and forecasting lake dynamics is critical for monitoring water quality and ecosystem health across lakes and reservoirs. While machine learning methods have been recently applied to ecological time-series data, existing works assume regular sampling in time and depth, and struggle to generalize across lakes with heterogeneous variables, depths, and observation patterns. To address these limitations, we introduce \textsc{LakeFM}, a foundation model for aquatic systems, pre-trained on large-scale ecological datasets comprising both simulated and observed lakes. Through extensive empirical evaluation, we show that \textsc{LakeFM} learns meaningful representations spanning broader lake-level characteristics, and achieves competitive or often superior-forecasting performance compared to existing time-series foundation and non-foundation models, while producing physically plausible predictions consistent with real-world lake dynamics.

2606.11348 2026-06-11 cs.LG 新提交

SwiftCTS: Fast Cross-Design Prediction and Pareto Optimization of Clock Tree Metrics via Few-Shot Calibration

SwiftCTS: 通过少样本校准实现时钟树指标的快速跨设计预测与帕累托优化

Barsat Khadka, Kawsher Roxy, Md Rubel Ahmed

AI总结 提出SwiftCTS框架,利用物理信息代理模型和K-shot乘法校准机制,在数秒内训练、亚毫秒推理,实现跨设计时钟树指标的准确预测与帕累托优化。

详情
AI中文摘要

时钟树综合(CTS)是物理设计流程中计算成本高昂的阶段,需要迭代调用EDA工具以探索庞大的配置空间,从而优化功耗、线长和时序偏差。现有的机器学习方法需要昂贵的重新训练或微调周期来适应未见过的宏架构,并且在架构上与穷举组合搜索所需的数百万次评估不匹配。我们提出了SwiftCTS,一个物理信息代理框架,同时解决了这两个局限性。通过将轻量级、基于物理的统计特征与梯度提升集成相结合,SwiftCTS在CPU上训练时间不到五秒,且无需GPU支持即可实现亚毫秒级推理。为了处理分布外(OOD)设计而无需重新训练或微调,我们引入了一种K-shot乘法校准机制,该机制仅需一到两次物理参考运行即可锚定预测,将未见过的宏上的功耗预测误差从24.5%降低到3.3%,线长误差从56.6%降低到1%以下。将该引擎与进化优化器集成,SwiftCTS在十秒内评估了100,000个CTS配置,生成了在OpenROAD流程中经过物理验证的帕累托最优前沿。闭环验证确认了功耗和线长的预测误差低于0.5%,时序偏差预测在OOD基准上在五皮秒以内,在所有目标指标上始终优于默认工具启发式方法。代码公开于:\href{this https URL}{this https URL}

英文摘要

Clock Tree Synthesis (CTS) is a computationally expensive stage in the physical design flow, requiring iterative EDA tool invocations to navigate a vast configuration space for optimal power, wirelength, and timing skew. Existing machine learning approaches require computationally expensive retraining or fine-tuning cycles to adapt to unseen macro architectures and are architecturally mismatched to the millions of evaluations demanded by exhaustive combinatorial search. We present SwiftCTS, a physics-informed surrogate framework that addresses both limitations simultaneously. By coupling lightweight, physics-grounded statistical features with gradient-boosted ensembles, SwiftCTS trains in under five seconds on a CPU and delivers sub-millisecond inference without GPU support. To handle out-of-distribution (OOD) designs without retraining or fine-tuning, we introduce a K-shot multiplicative calibration mechanism that anchors predictions to just one or two physical reference runs, reducing power prediction error from 24.5\% to 3.3\% and wirelength error from 56.6\% to under 1\% on unseen macros. Integrating this engine with an evolutionary optimizer, SwiftCTS evaluates 100,000 CTS configurations in under ten seconds, yielding Pareto-optimal frontiers that are physically validated within the OpenROAD flow. Closed-loop validation confirms prediction errors below 0.5\% for power and wirelength, and timing skew predictions within five picoseconds on an OOD benchmark, consistently outperforming default tool heuristics across all target metrics. Code publicly available at: \href{ this https URL }{ this https URL }

2606.11382 2026-06-11 cs.LG q-bio.BM 新提交

GLACIER: A Multimodal Student-Teacher Foundation Model for Molecular Property Prediction

GLACIER:用于分子性质预测的多模态师生基础模型

Emily Nguyen, Yongchan Hong, Harsh Toshniwal, Yan Liu, Andreas Luttens

发表机构 * Department of Computer Science, University of Southern California(南加州大学计算机科学系) Department of Quantitative and Computational Biology, University of Southern California(南加州大学定量与计算生物学系) Amazon(亚马逊) Department of Medical Biochemistry and Biophysics, Science for Life Laboratory, Karolinska Institutet(卡罗林斯卡学院医学生物化学与生物物理系,生命科学实验室)

AI总结 提出GLACIER师生框架,通过融合分子图、SMILES和物理化学描述符三种模态,并利用大模型蒸馏,实现高效准确的分子性质预测。

详情
AI中文摘要

深度学习模型有助于在数十亿候选化合物中发现具有定制性质的分子。然而,开发和部署最先进模型的计算负担不断增加,限制了其可扩展性。大多数大规模模型本质上是单模态的,忽视了利用互补分子数据模态的潜力。为了解决这些缺点,本文介绍了用于化学推理和探索的图-语言对齐表示(GLACIER)模型,这是一个师生框架,集成了分子图、SMILES字符串和物理化学描述符,以学习丰富的分子嵌入。我们的框架包括三个阶段:(1)我们在100,000个药物样分子上预训练三个学生编码器:用于分子图的消息传递神经网络、用于SMILES字符串的基于Transformer的编码器以及用于物理化学描述符的多层感知器;(2)我们使用新颖的Finsler几何感知模块融合这些学生模态;(3)通过对比学习,将来自大型教师模型(包括MiniMol和MolFormer)的互补知识蒸馏到一个轻量级模型中。我们证明GLACIER是一个稳健的框架,在复杂的分子性质预测任务中提供高预测性能和计算效率。我们的代码在此https URL公开可用。

英文摘要

Deep learning models facilitate the discovery of molecules with tailored properties among billions of candidate compounds. However, the computational burden to develop and deploy state-of-the-art models continuously increases, limiting their scalability. Most large-scale models are unimodal in nature and overlook the potential to leverage complementary molecular data modalities. To address these shortcomings, this paper introduces the Graph-Language Alignment for Chemical Inference and Exploration using Representations (GLACIER) model, a student-teacher framework that integrates molecular graphs, SMILES strings, and physicochemical descriptors to learn rich molecular embeddings. Our framework consists of three stages: (1) we pretrain three student encoders on 100,000 drug-like molecules: a message-passing neural network for molecular graphs, a transformer-based encoder for SMILES strings, and a multilayer perceptron for physicochemical descriptors, (2) we fuse these student modalities using a novel Finsler geometry-aware module, and (3) distill complementary knowledge from large teacher models, including MiniMol and MolFormer, into a single lightweight model via contrastive learning. We demonstrate that GLACIER is a robust framework that delivers high predictive performance and computational efficiency in complex molecular property prediction tasks. Our code is publicly available at this https URL.

2606.11463 2026-06-11 cs.LG cs.AI 新提交

LSTM-Based Detection of Structural Breaks in Property Insurance Loss Reserving: A Climate-Informed Approach

基于LSTM的财产保险损失准备金结构性断点检测:气候信息方法

Thomas Mbrice, Shashwat Panigrahi

发表机构 * Stony Brook University(石溪大学)

AI总结 针对气候变化导致传统精算方法失效的问题,提出使用LSTM神经网络检测结构性断点,在佛罗里达和路易斯安那州数据上预期将巨灾年份准备金精度提升15-20%,并给出理论保证。

详情
Comments
15 pages, 0 figures, whitepaper YC
AI中文摘要

准确的损失准备金是保险公司偿付能力的基础,然而加速的气候驱动灾难系统地违反了传统精算方法所依赖的稳定性假设。本文提出一个研究计划,测试长短期记忆(LSTM)神经网络是否能够比链梯法、Bornhuetter-Ferguson法和Cape Cod法更快、更准确地检测和适应这些结构性断点。使用来自佛罗里达州和路易斯安那州超过15年的监管发展三角形数据,并辅以NOAA飓风强度指数和海面温度,我们假设在巨灾暴露年份准备金精度有15-20%的针对性提升,这一阈值基于先前的神经网络准备金文献以及本文发展的形式化收敛结果。除了实证验证,我们还发展了一个理论框架,以概率术语为基础进行LSTM结构性断点检测,并提供形式化的性能保证,以弥补测试期间巨灾事件数量有限的不足。我们记录了研究设计、方法论、预期贡献以及对局限性的坦诚评估。

英文摘要

Accurate loss reserving is foundational to insurer solvency, yet accelerating climate driven catastrophes systematically violate the stability assumptions on which traditional actuarial methods depend. This white paper presents a research program testing whether Long Short Term Memory (LSTM) neural networks can detect and adapt to these structural breaks faster and more accurately than Chain Ladder, Bornhuetter Ferguson, and Cape Cod methods. Using 15 plus years of regulatory development triangle data from Florida and Louisiana, enriched with NOAA hurricane intensity indices and sea surface temperatures, we hypothesize a targeted improvement of 15, 20% in reserve accuracy for catastrophe exposed years, a threshold grounded both in the prior neural network reserving literature and in the formal convergence results developed here. Beyond empirical validation, we develop a theoretical framework grounding LSTM structural break detection in probabilistic terms, providing formal performance guarantees that compensate for the limited number of catastrophe events in the test period. We document the research design, methodology, expected contributions, and a candid assessment of limitations.

2606.11490 2026-06-11 cs.LG eess.SY 新提交

OmniLoc: A Geometry-Aware Foundation Model for Anchor-Free UE Localization Across Diverse Indoor Environments

OmniLoc: 一种几何感知的基础模型,用于跨多样室内环境的无锚点用户设备定位

Lei Chu, Yuning Zhang, Omer Gokalp Serbetci, Anushka Katiyar, Bassel Abou Ali Modad, Andreas F. Molisch

AI总结 提出OmniLoc,首个基于无线测量的基础模型,通过统一输入分词、几何感知Transformer和几何感知位置估计模块,实现跨室内环境的鲁棒无锚点定位,显著优于现有方法。

详情
AI中文摘要

由于建筑几何形状、可检测接入点(AP)集合以及接收信号异质性的显著变化,基于无线测量的室内定位在大规模部署中仍然具有挑战性。现有的基于学习的方法通常仅在有限环境下表现良好,并在环境变化下性能下降,使得在多样室内环境中进行鲁棒的无锚点定位变得极其困难。本文提出OmniLoc,一种环境交互式基础模型,用于跨多样室内环境的无锚点用户设备定位。据我们所知,OmniLoc是首个直接基于无线测量构建的用于此任务的基础模型。OmniLoc基于三个关键设计。首先,统一输入分词模块将异构无线测量转换为更易于学习的通用表示。其次,几何感知Transformer通过强调主导AP同时聚合来自辅助AP的互补证据,执行AP感知特征提取。第三,几何感知位置估计模块根据几何嵌入进行回归,以生成几何一致的位置预测。我们在大规模内部数据集和公共基准数据集上评估OmniLoc。结果表明,OmniLoc显著优于现有方法,当其设计组件集成时能持续改进现有骨干网络,并在跨环境评估中展现出强大的泛化能力。

英文摘要

Indoor localization from wireless measurements remains challenging in large-scale deployments due to substantial variation in building geometry, the set of detectable access points (APs), and the heterogeneity of received signals. Existing learning-based methods often perform well only in limited settings and degrade under environmental shifts, making robust anchor-free localization across diverse indoor environments notoriously difficult. In this paper, we present OmniLoc, an environment-interactive foundation model for anchor-free user equipment localization across diverse indoor environments. To the best of our knowledge, OmniLoc is the first foundation-model-based approach built directly on wireless measurements for this task. OmniLoc is built on three key designs. First, a unified input tokenization module converts heterogeneous wireless measurements into a common representation that is more amenable to learning. Second, a geometry-aware Transformer performs AP-aware feature extraction by emphasizing dominant APs while aggregating complementary evidence from supporting APs. Third, a geometry-aware location estimation module conditions regression on geometric embeddings to produce geometrically consistent location predictions. We evaluate OmniLoc on both a large-scale in-house dataset and a public benchmark dataset. Results show that OmniLoc significantly outperforms existing methods, consistently improves existing backbones when its design components are integrated, and demonstrates strong generalization in cross-environment evaluations.

2606.11553 2026-06-11 cs.LG 新提交

APEX: A Network-Native Time-Series Foundation Model for Forecasting and Anomaly Detection for Wireless Edge Operations

APEX:面向无线边缘运维的预测与异常检测的网络原生时间序列基础模型

Swadhin Pradhan, Niloo Bahadori, Peiman Amini

发表机构 * Cisco Systems, USA(思科系统公司)

AI总结 提出网络原生解码器Transformer APEX,针对企业AP遥测数据预训练,在DHCP退化基准上MAE比最强基线降低18%,异常检测F1=0.93,边缘版本实现亚秒级隐私保护推理。

详情
Comments
5 pages, 1 figure, 4 tables. Discusses a network-native time-series foundation model for wireless edge operations
AI中文摘要

通用时间序列基础模型对无线网络遥测数据的迁移效果较差,因为这些信号具有突发性、零膨胀性且跨协议层耦合。我们提出APEX,一个网络原生的、仅解码器的Transformer,用于预测企业AP遥测数据,并以DHCP退化作为代表性网络任务进行评估。APEX在来自约4500个生产无线网络的10通道多变量遥测数据(约10万AP时间序列,每个AP 34个指标)上预训练,并提供APEX-Large(269M参数,云端)和APEX-Edge(10.5M参数,边缘)两个版本。在192步(4天)的DHCP退化基准上,APEX-Large比最强的基础模型基线(Toto)MAE降低18%,比SARIMA降低38%,异常检测F1=0.93,而APEX-Edge能够在AP级边缘硬件上实现亚秒级、保护隐私的推理。这些结果表明,网络原生预训练是主动无线运维的实用基础。

英文摘要

Generic time-series foundation models transfer poorly to wireless network telemetry whose signals are bursty, zero-inflated, and coupled across protocol layers. We present APEX, a network-native, decoder-only transformer for forecasting enterprise AP telemetry, and evaluate it on DHCP degradation as a representative network task. APEX is pre-trained on 10-channel multivariate telemetry from ~4,500 production wireless networks (~100K AP time series, 34 metrics per AP), and is available as APEX-Large (269M, cloud) and APEX-Edge (10.5M, edge). On a 192-step (4-day) DHCP degradation benchmark, APEX-Large reduces MAE by 18% over the strongest foundation-model baseline (Toto) and 38% over SARIMA, with anomaly-detection F1 = 0.93, while APEX-Edge enables sub-second, privacy-preserving inference on AP-class edge hardware. These results suggest network-native pre-training is a practical foundation for proactive wireless operations.

2606.11605 2026-06-11 cs.LG cs.AI 新提交

Physics-Distilled Neural Network enabled by Large Language Models for Manufacturing Process-Property Predictive Modeling

基于大语言模型的物理蒸馏神经网络用于制造过程-性能预测建模

Ge Song, Kiarash Naghavi Khanghah, Anandkumar Patel, Rajiv Malhotra, Hongyi Xu

AI总结 提出一种知识蒸馏框架,利用大语言模型从文献中提取物理先验,通过图掩码注意力层捕获变量依赖,蒸馏至轻量学生模型,在数据稀缺下实现高精度预测与实时部署。

详情
Comments
Under review, Journal of Computing and Information Science in Engineering
AI中文摘要

预测制造过程中的过程-性能关系常面临高实验成本和复杂'黑箱'模型可解释性有限的挑战。本文提出一种新颖的知识蒸馏框架,旨在数据稀缺场景下实现高精度预测。该框架将分析性物理先验(通过大语言模型从科学文献中系统提取)集成到特权教师模型中。我们采用图掩码注意力层来捕获输入变量间复杂的物理依赖关系,这些变量表现为严格设定点或静态与高频时间特征的组合。这种特权知识被蒸馏到轻量级学生预测器中进行推理。通过在五种不同制造过程中的综合实验,评估了该框架的可行性和鲁棒性。为确保统计可靠性,鉴于数据集规模较小,采用重复K折交叉验证技术来量化模型稳定性和泛化能力。结果表明,所提框架在所有评估领域均持续实现高预测精度。最重要的是,该架构表现出显著的容错性,即使在LLM推导的分析先验次优或不完整的情况下,也能保持稳健的预测性能。此外,学生预测器的推理频率超过6000 Hz,便于在标准工业硬件上进行实时边缘部署。这项工作为在数据受限环境下弥合理论物理与实时工业监测之间的差距提供了可扩展的解决方案。

英文摘要

Predicting process-property relationships in manufacturing is often challenged by high experimental costs and the limited interpretability of complex 'black-box' models. This paper proposes a novel knowledge distillation framework designed to achieve high-accuracy predictions in data-scarce scenarios. The framework integrates analytical physics priors, which are systematically extracted from scientific literature via Large Language Models, into a privileged teacher model. We employ a Graph-Masked Attention layer to capture the complex physical dependencies among input variables showing strict setpoints or a combination of static and high-frequency temporal signatures. This privileged knowledge is distilled into a lightweight student predictor for inference. The feasibility and robustness of the framework are evaluated through a comprehensive experiment across five diverse manufacturing processes. To ensure statistical reliability, given the small dataset sizes, a repeated K-fold cross-validation technique is employed to quantify model stability and generalization. Results indicate that the proposed framework consistently achieves high predictive accuracy across all evaluated domains. Most importantly, the architecture demonstrates significant fault tolerance by maintaining robust predictive performance even in scenarios where LLM-derived analytical priors are suboptimal or incomplete. Furthermore, the student predictor achieves an inference frequency exceeding 6000 Hz, which facilitates real-time edge deployment on standard industrial hardware. This work provides a scalable solution for bridging the gap between theoretical physics and real-time industrial monitoring in data-limited environments.

2606.11650 2026-06-11 cs.LG math.NA physics.comp-ph 新提交

Structure-Preserving Neural Surrogates with Tractable Uncertainty Quantification

具有可处理不确定性量化的保结构神经代理模型

Handi Zhang, Adrienne M. Propp, Brooks Kinch, Houman Owhadi, Nathaniel Trask

发表机构 * University of Pennsylvania(宾夕法尼亚大学) Stanford University(斯坦福大学) California Institute of Technology(加州理工学院)

AI总结 提出一种结合混合有限元空间与高斯过程回归的保结构降阶模型,通过拓扑结构实现状态-通量关系的不确定性量化,并导出狄利克雷-诺伊曼映射的闭式后验不确定性。

详情
AI中文摘要

科学机器学习的最新进展为偏微分方程(PDE)的近实时求解提供了一种手段,但缺乏支持当代验证与确认的传统模拟器的理论基础。在这项工作中,我们构建了数据驱动的降阶模型,作为保结构、实时代理模型。值得注意的是,施加物理守恒结构的外微分也揭示了拓扑结构,我们利用该结构构建了状态-通量关系中不确定性的高斯过程(GP)表示,最终为目标量导出具有后验不确定性闭式表达的狄利克雷-诺伊曼映射。我们特别提出了由轻量级变压器规定的传统Raviart-Thomas和$dgP_0$单元的保结构$H(\mathrm{div})$--$L^2$子空间。通过提出一个守恒律来学习与该子空间一致的降阶动力学,其中GP描述了体积之间的通量。这项工作依赖于混合有限元空间与GP回归之间的新颖接口;当训练被表述为最优恢复问题(ORP)时,得到的GP回归可以写成一个带有等式约束的优化问题,该约束施加了守恒结构,适用于快速的Schur补训练策略。然后,训练好的模型可以实时求解,得到由指定狄利克雷数据驱动的边界通量的闭式估计量。本文包括线性泛函的RKHS后验误差界以支持不确定性量化,以及数值实验证明了后验分布作为误差估计代理的准确性。

英文摘要

Recent advances in scientific machine learning provide a means of near-real-time solution to partial differential equations (PDEs), but lack the theoretical underpinnings of conventional simulators that support contemporary verification and validation. In this work, we construct data-driven reduced-order models that serve as structure-preserving, real-time surrogates. Remarkably, the exterior calculus that imposes physical conservation structure also exposes topological structure that we use to build a Gaussian process (GP) representation of uncertainty in state-flux relationships, ultimately yielding a Dirichlet-to-Neumann map for quantities of interest with closed-form expressions for posterior uncertainty. We specifically propose structure-preserving $H(\mathrm{div})$--$L^2$ subspaces of conventional Raviart--Thomas and $dgP_0$ elements prescribed by a lightweight transformer. Reduced-order dynamics consistent with this subspace are learned by posing a conservation law in which a GP describes the fluxes between volumes. This work hinges on a novel interface between mixed FEM spaces and GP regression; when training is posed as the optimal recovery problem (ORP), the resulting GP regression can be written as an optimization problem with equality constraints that impose a conservation structure, amenable to a fast Schur-complement training strategy. The trained model can then be solved in real time with closed-form estimators for boundary fluxes driven by prescribed Dirichlet data. The paper includes RKHS posterior error bounds for linear functionals to support uncertainty quantification, as well as numerical experiments demonstrating the accuracy of the posterior distribution as a surrogate for error estimation.

2606.11651 2026-06-11 cs.LG q-bio.QM stat.AP 新提交

DeepRHP: A Hybrid Variational Autoencoder for Designing Random Heteropolymers as Protein Mimics

DeepRHP:一种用于设计随机异聚合物作为蛋白质模拟物的混合变分自编码器

Shuni Li, Zhiyuan Ruan, Andy Shen, Ivan Jayapurna, Ting Xu, Haiyan Huang

AI总结 提出混合变分自编码器DeepRHP,在半监督框架下结合特征VAE与经典VAE,通过潜在空间捕获关键化学特征与序列模式,指导随机异聚合物设计,实验验证其稳定膜蛋白的有效性。

详情
Comments
Oral presentation at AAAI 2023 Workshop on AI to Accelerate Science and Engineering
AI中文摘要

由预定义单体组成的合成随机异聚合物(RHP)为设计类蛋白质材料提供了一种方法。如果设计得当,这些RHP可以模拟蛋白质的行为和功能。因此,需要计算工具来有效指导RHP设计。我们通过开发DeepRHP(一种在半监督框架下改进的变分自编码器(VAE)模型)来弥补这一差距。通过为经典VAE配备额外的基于特征的VAE,DeepRHP迫使潜在空间捕获关键化学特征的结构以及单个RHP序列模式。从这个意义上说,我们的方法是通用的,允许以混合方式纳入任何相关特征。我们通过提出在非原生环境中稳定膜蛋白(例如水通道蛋白Z)的潜在单体组成,并将我们的预测与已发表的结果进行交叉验证,证明了DeepRHP的有效性。我们的模型与真实RHP功能之间的一致性表明,利用混合自编码器架构来指导蛋白质和其他生物化合物的RHP设计具有巨大潜力。

英文摘要

Synthetic random heteropolymers (RHPs), consisting of a predefined set of monomers, offer an approach toward the design of protein-like materials. These RHPs, if designed appropriately, can mimic protein behavior and function. As such, there is a need for computational tools to efficiently guide RHP design. We bridge this gap by developing DeepRHP, a modified variational autoencoder (VAE) model under a semi-supervised framework. By equipping a classical VAE with an additional feature-based VAE, DeepRHP forces the latent space to capture structures of critical chemical features as well as individual RHP sequence patterns. In this sense, our method is versatile by allowing any relevant features to be incorporated in a hybrid manner. We demonstrate the effectiveness of DeepRHP by suggesting potential monomer compositions that stabilize membrane proteins (e.g. Aquaporin Z) in non-native environments and cross-validating our prediction with published results. The concordance between our model and true RHP function suggests strong potential in utilizing hybrid autoencoder architectures to guide RHP design for proteins and other biological compounds.

2606.11793 2026-06-11 cs.LG cs.AI physics.ao-ph 新提交

AI4Land: Scalable Deep Learning for Global High-Resolution Land Use Reconstruction

AI4Land: 面向全球高分辨率土地利用重建的可扩展深度学习

Amirpasha Mozaffari, Marina Castaño, Stefano Materia, Etienne Tourigny, Oscar Molina-Sedano, Jordi Varela-Agrelo, Dario Garcia-Gasulla, Miguel Castrillo Melguizo, Mario Acosta, Amanda Duarte

发表机构 * Barcelona Supercomputing Center(巴塞罗那超级计算中心)

AI总结 提出AI4Land框架,采用U-Net两阶段方法,结合粗分辨率情景数据与静态地理特征,重建高分辨率年度土地利用与覆盖,减少陆地碳循环不确定性,支持气候模拟。

详情
AI中文摘要

陆地碳循环的不确定性仍是气候预测的主要制约因素,部分源于地球系统模型中陆面表征和变率的不确定性。为解决此问题,我们提出了数据驱动框架AI4Land,用于生成关键陆面变量的高分辨率历史重建和未来预测。该框架采用U-Net架构的两阶段方法。在第一阶段(本文重点),它通过整合粗分辨率情景数据与静态地理特征,重建年度土地利用与土地覆盖。在计划的第二阶段,生成的高分辨率地图将用于在更细时间尺度上预测动态生物物理变量,特别是叶面积指数。模型基于地球观测数据训练,学习再现空间明确且物理一致的陆面模式,并将时间覆盖扩展到缺乏直接观测的时期。AI4Land在MareNostrum5上开发和训练,展示了GPU加速的高性能计算基础设施如何支持全球尺度的气候AI流水线。最终产品是一套开源模拟器,旨在与数字孪生平台(如Destination Earth计划下开发的平台)实时耦合。通过按需提供逼真且演变的陆面条件,本工作旨在减少关键不确定性,提高下一代气候模拟的预测能力。

英文摘要

Uncertainty in the terrestrial carbon cycle remains a major constraint in climate projections, partly driven by the uncertainties affecting the land surface representation and variability in Earth system models. To address this limitation, we present a data-driven framework AI4Land, for generating high-resolution historical reconstructions and future projections of key land surface variables. The framework follows a two-phase approach using a U-Net architecture. In the first phase, which is the focus of this work, it reconstructs annual land use and land cover by integrating coarse-resolution scenario data with static geophysical features. In a planned second phase, the resulting high-resolution maps will be used to predict dynamic biophysical variables, particularly leaf area index, at finer temporal scales. Trained on Earth observation data, the models learn to reproduce spatially explicit and physically consistent land surface patterns, extending temporal coverage to periods lacking direct observations. AI4Land was developed and trained on MareNostrum5, demonstrating how GPU-accelerated HPC infrastructure enables global-scale climate AI pipelines. The final product is a suite of open-source emulators designed for real-time coupling with digital twin platforms, such as those developed under the Destination Earth initiative. By delivering realistic and evolving land surface conditions on demand, this work aims to reduce critical uncertainties and improve the predictive power of next-generation climate simulations.

2606.11794 2026-06-11 cs.LG cs.AI 新提交

Multimodal Ordinal Modeling of Alzheimer's Disease Severity Using Structural MRI and Clinical Data

使用结构MRI和临床数据的阿尔茨海默病严重程度的多模态序数建模

Boris-Stephan Rauchmann, Jonathan Laib, Buse Ercik, Robert Perneczky, Sergio Altares-López

AI总结 提出一种注意力增强的多模态序数回归框架,整合MRI、人口统计学和遗传数据,用于自动且可解释的AD严重程度分期,在ADNI等数据集上验证,序数模型在相邻阶段准确率(0.970)和与临床分期一致性(QWK 0.549)上表现最佳。

详情
Comments
18 pages. Submitted to journal for review
AI中文摘要

神经退行性疾病如阿尔茨海默病(AD)需要准确且可扩展的工具来评估疾病严重程度,然而当前的临床分期仍然耗时且易变。我们提出了一种带有注意力增强的多模态机器学习框架,结合序数回归,用于自动且可解释的AD严重程度分期。该框架整合了T1加权MRI与人口统计学和遗传变量,并使用序数和非序数预测头比较了单模态和多模态架构。模型使用来自ADNI、AIBL和NIFD数据集的队列分层划分进行训练和验证。严格保留的测试集由排除在所有训练、验证、预处理和超参数调优过程之外的受试者构建,并在整个过程中采用受试者级划分以防止数据泄漏。在单模态方法中,T1加权MRI模型在相邻阶段准确率(0.963)和与临床分期的一致性(QWK 0.444)上略高于表格模型(QWK 0.433)。整合成像、人口统计学和遗传信息提高了整体性能。多模态非序数基线实现了最低的预测误差(MAE 0.340),而序数多模态模型实现了最高的相邻阶段准确率(0.970)和与临床分期的最强一致性(QWK 0.549)。这些发现表明,序数公式更好地捕捉了CDR量表的顺序结构,并产生与临床分期更一致的预测。使用Grad CAM++和SHAP的可解释性分析展示了解剖学和临床上合理的模型行为,支持透明决策。总体而言,基于注意力的多模态学习与序数回归代表了一种稳健、可解释且可扩展的方法,用于自动AD严重程度分期和AI辅助临床决策支持。

英文摘要

Neurodegenerative diseases such as Alzheimer's disease (AD) require accurate and scalable tools for assessing disease severity, yet current clinical staging remains time-intensive and prone to variability. We propose an attention-enhanced multimodal machine learning framework with ordinal regression for automated and interpretable AD severity staging. The framework integrates T1-weighted MRI with demographic and genetic variables and compares unimodal and multimodal architectures using ordinal and non-ordinal prediction heads. Models were trained and validated using cohort-stratified splits derived from the ADNI, AIBL, and NIFD datasets. A strictly held-out test set was constructed using subjects excluded from all training, validation, preprocessing, and hyperparameter tuning procedures, with subject-level splitting employed throughout to prevent data leakage. Among unimodal approaches, the T1-weighted MRI model achieved slightly higher adjacent-stage accuracy (0.963) and agreement with clinical staging (QWK 0.444) than the tabular model (QWK 0.433). Integrating imaging, demographic, and genetic information improved overall performance. The multimodal non-ordinal baseline achieved the lowest prediction error (MAE 0.340), whereas the ordinal multimodal model achieved the highest adjacent-stage accuracy (0.970) and strongest agreement with clinical staging (QWK 0.549). These findings indicate that ordinal formulations better capture the ordered structure of the CDR scale and yield predictions more consistent with clinical staging. Explainability analyses using Grad CAM++ and SHAP demonstrated anatomically and clinically plausible model behavior, supporting transparent decision-making. Overall, attention-based multimodal learning with ordinal regression represents a robust, interpretable, and scalable approach for automated AD severity staging and AI-assisted clinical decision support.

2606.11868 2026-06-11 cs.LG q-bio.QM 新提交

MemNovo: Look Back at the Spectrum for Balanced De Novo Peptide Sequencing from Mass Spectrometry

MemNovo: 回顾谱图以实现质谱中平衡的从头肽段测序

Dongxin Lyu, Jingbo Zhou, Hongxin Xiang, Yuqiang Li, Jun Xia

发表机构 * Westlake University(西湖大学) Hunan University(湖南大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) HKUST-GZ & HKUST(香港科技大学(广州)与香港科技大学)

AI总结 针对现有Transformer模型在从头肽段测序中过度依赖生成序列先验而忽视谱图证据的问题,提出训练无关的即插即用机制MemNovo,通过建立持久谱记忆库和超保守残差连接在解码阶段注入谱特征,显著提升氨基酸和肽段精度。

详情
Comments
Code: this https URL
AI中文摘要

从串联质谱中进行从头肽段测序是蛋白质组学的关键,能够在不依赖参考数据库的情况下识别新型肽段。尽管基于Transformer的编码器-解码器模型已取得显著性能,但我们发现其推理动态中存在关键病理现象。通过全面的特征缩放实验,我们证明现有的自回归肽段解码器倾向于过度依赖生成序列的先验,同时逐渐未能充分利用输入质谱中的细粒度物理证据。这一现象导致次优结果,生成的肽段序列在生物学上合理但不符合输入谱图。为解决此问题,我们提出MemNovo,一种无需训练且即插即用的机制,在推理时重新平衡肽段和谱图的贡献。MemNovo通过建立持久的谱记忆库,并通过超保守残差连接将检索到的特征直接注入最终解码阶段,从而缓解信息瓶颈。理论分析证实,该机制恢复了解码器状态与原始谱图之间的互信息。在Nine Species基准上使用两个代表性基线模型Casanovo和InstaNovo进行的大量实验表明,MemNovo持续提高了氨基酸精度和肽段精度,对于Casanovo,肽段精度相对提升高达39.1%,对于InstaNovo提升高达3.9%,且计算开销可忽略不计。

英文摘要

De novo peptide sequencing from tandem mass spectrometry is pivotal in proteomics, enabling identification of novel peptides without reference databases. While recent Transformer-based encoder-decoder models have achieved remarkable performance, we uncover a critical pathology in their inference dynamics. Through comprehensive feature scaling experiments, we demonstrate that existing auto-regressive peptide decoders tend to over-rely on generated-sequence priors while progressively under-utilizing fine-grained physical evidence from the input mass spectrum. This phenomenon leads to suboptimal results, where generated peptide sequences are biologically plausible yet not faithful to the input spectrum. To rectify this, we propose MemNovo, a training-free and plug-and-play mechanism that re-balances peptide and spectral contributions at inference time. MemNovo alleviates the information bottleneck by establishing a persistent spectral memory bank and injecting retrieved features directly into the final decoding stage via an ultra-conservative residual connection. Theoretical analysis confirms that this mechanism restores the mutual information between the decoder state and the raw spectrum. Extensive experiments on the Nine Species benchmark with two representative baselines, Casanovo and InstaNovo, demonstrate that MemNovo consistently improves both amino acid precision and peptide precision, achieving up to 39.1% relative improvement in peptide precision for Casanovo and up to 3.9% for InstaNovo, with negligible computational overhead.

2606.11893 2026-06-11 cs.LG cs.AI cs.CL q-bio.NC 新提交

Beyond representational alignment with brain-guided language models for robust reasoning

超越表征对齐:基于大脑引导的语言模型实现稳健推理

Mingqing Xiao, Kai Du, Zhouchen Lin

发表机构 * State Key Lab of General AI, School of Intelligence Science and Technology, Peking University(北京大学通用人工智能国家重点实验室、智能科学与技术学院) Department of Psychological and Cognitive Sciences, Tsinghua University(清华大学心理与认知科学系) Microsoft Research Asia(微软亚洲研究院)

AI总结 研究通过fMRI信号增强大型语言模型推理能力,提出脑引导框架,在10个模型上实现最高13%的准确率提升。

详情
AI中文摘要

大型语言模型(LLMs)与人类高阶认知背后的神经机制之间的对应关系仍未得到充分表征。鉴于人脑中语言和推理似乎是可分离的,一个开放的问题是LLMs是否与来自推理相关区域的神经信号对齐,以及这些信号是否能够改进它们。在此,我们聚焦于演绎推理,表明LLM内部表征不仅与任务fMRI活动部分对齐,而且可以直接通过这些信号增强。使用神经预测性度量,我们发现LLMs在聚合水平上解释了推理相关区域中可解释方差的很大一部分,而在特定推理类型内的预测性较低,表明对齐和分歧并存。基于此,我们提出一个脑引导框架:我们沿着由模型和大脑表征的联合结构诱导的方向引导模型表征,在推理时进行干预,在训练时进行微调。我们证明任务诱发的脑信号可以直接增强LLM推理,在10个LLM(1.5B-72B)上产生与仅语言监督正交的增益,具有跨推理类型的迁移,以及高达13%的绝对准确率提升。我们的结果将LLM-大脑对应关系从相关性推进到引导,建立了一条由脑信号驱动的路径,通向更稳健和认知对齐的AI。

英文摘要

The correspondence between large language models (LLMs) and the neural mechanisms underlying human higher-order cognition remains insufficiently characterized. Given that language and reasoning in the human brain appear dissociable, an open question is whether LLMs align with neural signals from reasoning-related regions and whether such signals can improve them. Here, focusing on deductive reasoning, we show that LLM internal representations are not only partially aligned with task-fMRI activity but can also be directly enhanced by these signals. Using a neural-predictivity metric, we find that LLMs explain a substantial fraction of the explainable variance in reasoning-related regions at the aggregate level, whereas predictivity within specific reasoning types is lower, indicating both alignment and divergence. Building on this, we propose a brain-guided framework: we steer model representations along directions induced by the joint structure of model and brain representations, applying intervention at inference and fine-tuning during training. We demonstrate that task-evoked brain signals can directly enhance LLM reasoning, yielding gains orthogonal to language-only supervision across 10 LLMs (1.5B-72B), with transfer across reasoning types and up to 13\% absolute accuracy gain. Our results advance LLM-brain correspondences from correlation to guidance, establishing a brain-signal-driven pathway toward more robust and cognitively aligned AI.

2606.11990 2026-06-11 cs.LG cs.AI 新提交

Time-Series Foundation Model Embeddings for Remaining Useful Life Estimation

用于剩余使用寿命估计的时间序列基础模型嵌入

Amir El-Ghoussani, Michele De Vita, Ronald Naumann, Valiseios Belagiannis

发表机构 * University of Erlangen-Nuremberg(埃尔朗根-纽伦堡大学) Siemens AG(西门子股份公司)

AI总结 提出冻结预训练时间序列基础模型Chronos-2作为骨干,结合轻量回归头进行剩余寿命预测,在工业传感器数据上优于多种基线方法。

详情
Comments
Accepted to EUSIPCO 2026, 4 pages, 2 figures
AI中文摘要

剩余使用寿命(RUL)预测对于工业预测性维护至关重要,然而许多基于学习的方法依赖于大量的特征工程或大型标注数据集来训练特定任务的序列模型。在这项工作中,我们引入了一种轻量级学习方法,利用冻结的预训练时间序列基础模型(TSFM),并将其与一个小型回归头结合,用于从多变量传感器流中估计RUL。具体来说,我们使用Chronos-2作为冻结骨干来提取上下文窗口特征,并训练一个轻量级回归神经网络进行RUL预测。在来自两种设备类型的真实工业传感器数据上的实验表明,在相同的预处理和评估协议下,Chronos-2特征一致地优于循环、卷积、基于Transformer和梯度提升基线。我们进一步分析了上下文长度的影响,发现随着历史记录变长,性能显著提升,这表明TSFM表示为工业环境中的RUL估计提供了一种实用且数据高效的替代方案。

英文摘要

Remaining Useful Life (RUL) prediction is essential for industrial predictive maintenance, yet many learning-based approaches rely on extensive feature engineering or large labeled datasets to train task-specific sequence models. In this work, we introduce a lightweight learning approach, in which we leverage a frozen pretrained time-series foundation model (TSFM) and combine it with a small regression head for RUL estimation from multivariate sensor streams. More specifically, we use Chronos-2 as a frozen backbone to extract context window features and train a lightweight regression neural network for RUL prediction. Experiments on real-world industrial sensor data from two device types show that Chronos-2 features consistently improve over recurrent, convolutional, Transformer-based, and gradient-boosting baselines under the same preprocessing and evaluation protocol. We further analyze the impact of context length and find that performance improves significantly with longer histories, indicating that TSFM representation offer a practical and data-efficient alternative for RUL estimation in industrial settings.

2606.12006 2026-06-11 cs.LG cs.AI 新提交

Tabular Foundation Models for Clinical Survival Analysis via Survival-Aware Adaptation

通过生存感知适配的临床生存分析表格基础模型

Minh-Khoi Pham, Luca Cotugno, Alina Sirbu, Tai Tan Mai, Martin Crane, Marija Bezbradica

发表机构 * ADAPT Centre, Dublin City University(ADAPT中心,都柏林城市大学) School of Computing, Dublin City University(都柏林城市大学计算机学院) Department of Computer Science and Engineering, University of Bologna(博洛尼亚大学计算机科学与工程系)

AI总结 提出轻量级适配方法,将表格基础模型(TabPFN、TabDPT、TabICL)与多任务逻辑回归头结合,用于临床生存分析,在多个基准和ICU队列上达到竞争性或更优性能。

详情
Comments
Accepted for publication at International Conference on AI in Healthcare 2026
AI中文摘要

预测死亡率等时间至事件结果是临床决策中的基本任务,通常通过生存分析来解决。虽然经典的统计和深度学习方法已被广泛研究,但它们通常需要特定任务的训练和足够的标记数据。最近表格基础模型的进展通过学习结构化数据的通用表示提供了一种新范式。然而,它们在临床环境中对删失时间至事件预测的适用性仍未得到充分探索,因为典型应用仅限于离散分类而非生存分析任务。在这项工作中,我们提出了一种轻量级适配方法,通过直接在预训练表示之上训练一个生存感知头,将表格基础模型应用于临床生存分析。我们研究了代表性架构,包括TabPFN、TabDPT和TabICL,并使用多任务逻辑回归(MTLR)头对它们进行适配,以建模右删失时间至事件结果。我们在多个公开生存基准和两个大规模ICU队列MIMIC-IV和eICU上评估了该方法。我们的结果表明,这种迁移学习方法与强基线相比达到了竞争性或更优的性能。在MIMIC-IV上,TabDPT-FT-MTLR达到了0.856的C指数,相对于最佳非FM基线(DeepSurv,0.844)相对提升了+1.4%,相对于最佳零样本模型(0.802)提升了+6.7%。在eICU上,TabICL-FT-MTLR达到了0.797,分别获得了+1.7%(DeepSurv,0.784)和+6.4%(0.749)的提升。这些发现强调了将预训练表格表示与生存感知目标相结合的重要性,并表明表格基础模型为临床生存预测提供了一种实用且有效的替代方案。

英文摘要

Predicting time-to-event outcomes such as mortality is a fundamental task in clinical decision-making, commonly addressed through survival analysis. While classical statistical and deep learning approaches have been widely studied, they typically require task-specific training and sufficient labeled data. Recent advances in tabular foundation models offer a new paradigm by learning general-purpose representations for structured data. However, their applicability to censored time-to-event prediction in clinical settings remains underexplored, as typical applications are restricted to discrete classification rather than survival analysis tasks. In this work, we propose a lightweight adaptation approach for applying tabular foundation models to clinical survival analysis by directly training a survival-aware head on top of the pretrained representations. We study representative architectures, including TabPFN, TabDPT, and TabICL, and adapt them using a multi-task logistic regression (MTLR) head to model right-censored time-to-event outcomes. We evaluate this approach on a diverse set of public survival benchmarks and two large-scale ICU cohorts, MIMIC-IV and eICU. Our results show that this transfer learning approach achieves competitive or superior performance compared to strong baselines. On MIMIC-IV, TabDPT-FT-MTLR reaches a C-index of 0.856, corresponding to a relative improvement of +1.4% over the best non-FM baseline (DeepSurv, 0.844) and +6.7% over the best zero-shot model (0.802). On eICU, TabICL-FT-MTLR achieves 0.797, yielding gains of +1.7% (DeepSurv, 0.784) and +6.4% (0.749), respectively. These findings highlight the importance of combining pretrained tabular representations with survival-aware objectives and suggest that tabular foundation models provide a practical and effective alternative for clinical survival prediction.

2606.12077 2026-06-11 cs.LG 新提交

Efficient Time Series Clustering from Multiscale Reservoir Dynamics with Granular-Ball Anchoring Graph Optimization

基于多尺度储层动力学与粒球锚定图优化的高效时间序列聚类

Yifan Wang, Lifeng Shen, Shuyin Xia, Yi Wang

发表机构 * Chongqing Key Laboratory of Computational Intelligence, Key Laboratory of Cyberspace Big Data Intelligent Security, Ministry of Education, Sichuan-Chongqing Co-construction Key Laboratory of Digital Economy Intelligence and Key Laboratory of Big Data Intelligent Computing, College of Computer Science and Technology, Chongqing University of Posts and Telecommunications(重庆邮电大学计算机科学与技术学院,计算智能重庆市重点实验室,网络空间大数据智能安全教育部重点实验室,川渝共建数字经济智能重点实验室,大数据智能计算重点实验室) Chongqing Ant Consumer Finance Co,. Ltd , Ant Group(蚂蚁集团,重庆蚂蚁消费金融有限公司)

AI总结 提出MSRGC-Net框架,结合无训练储层计算、粒球锚定图构建和共识学习,实现高效且准确的时间序列聚类。

详情
Comments
Accepted by IJCAI 2026
AI中文摘要

时间序列聚类由于聚类效果与计算效率之间的固有权衡仍然具有挑战性。基于相似性的方法通常因成对距离计算而面临二次复杂度,而基于深度学习的方法通常依赖于昂贵的迭代训练和大量可训练参数。在本文中,我们提出了MSRGC-Net,一种高效的时间序列聚类框架,它集成了多尺度储层计算、基于粒球的锚定图构建和共识学习。MSRGC-Net采用无训练的储层计算范式,从原始时间序列中提取多尺度时间表示,无需反向传播,显著降低了计算开销。为了捕捉所得表示的内在结构,采用粒球计算通过密度一致区域自适应地建模数据分布,生成紧凑且鲁棒的锚定图表示。此外,引入了一种基于共识的锚定图优化策略,以有效对齐多尺度储层表示并整合跨时间尺度的互补信息。在广泛使用的单变量和多变量基准数据集上的大量实验表明,MSRGC-Net在聚类性能上持续优于最先进的方法,同时保持卓越的计算效率。

英文摘要

Time-series clustering remains challenging due to the inherent trade-off between clustering effectiveness and computational efficiency. Similarity-based methods often suffer from quadratic complexity caused by pairwise distance computations, while deep learning-based approaches typically rely on costly iterative training and a large number of trainable parameters. In this paper, we propose MSRGC-Net, an efficient time-series clustering framework that integrates multiscale reservoir computing, granular-ball-based anchoring graph construction, and consensus learning. MSRGC-Net adopts a training-free reservoir computing paradigm to extract multiscale temporal representations from raw time series without backpropagation, significantly reducing computational overhead. To capture the intrinsic structure of the resulting representations, granular-ball computing is employed to adaptively model data distributions via density-consistent regions, yielding compact and robust anchor graph representations. Furthermore, a consensus-based anchoring graph optimization strategy is introduced to effectively align multiscale reservoir representations and integrate complementary information across temporal scales. Extensive experiments on widely used univariate and multivariate benchmark datasets demonstrate that MSRGC-Net consistently outperforms state-of-the-art methods in clustering performance while maintaining superior computational efficiency.

2606.12141 2026-06-11 cs.LG 新提交

PCA-Enhanced Adaptive NVAR Framework for High-Resolution Sea Surface Temperature Forecasting in the East Sea

PCA增强的自适应NVAR框架用于东海高分辨率海面温度预测

Sherkhon Azimov, Susana López-Moreno, Eric Dolores-Cuenca, JinYong Choi, Sangil Kim

发表机构 * Pusan National University(釜山大学)

AI总结 提出PCA增强的自适应NVAR框架,通过SVD降维和自适应NVAR时序建模,实现东海海面温度的高效准确预测,优于标准NVAR方法。

详情
Comments
14 pages, 7 figures
AI中文摘要

准确预测东海等区域海的海面温度(SST)对于监测海洋生态系统、评估气候风险、管理渔业和执行海军行动至关重要。传统的数值海洋模型提供可靠的预测,但计算成本高,通常不适合实时预测。许多深度学习方法也难以处理高维时空海洋数据,并在较长的预测周期内出现误差累积。本研究基于我们先前提出的自适应下一代储层计算(Adaptive NVAR)框架,该框架最初在合成动力系统上引入和测试,并将其扩展到海洋预测。我们提出了一种降阶预测框架,将奇异值分解(SVD)与自适应NVAR相结合,以预测东海的SST动态。使用SVD将SST场压缩为低维表示,提取海洋变率的主导模态。自适应NVAR对这些潜在状态的时间演化进行建模,并将预测状态重建为SST预测。我们使用区域海洋数据集评估该框架,并将其与标准NG-RC/NVAR进行比较。结果表明,自适应NVAR在多个预测时域上始终实现较低的预测误差。此外,SVD降低了计算复杂度,从而产生了一个适用于实时海洋预测的快速且可扩展的框架。

英文摘要

Accurate forecasting of sea surface temperature (SST) in regional seas such as the East Sea is crucial for monitoring marine ecosystems, assessing climate risks, managing fisheries, and conducting naval operations. Traditional numerical ocean models provide reliable predictions but are computationally expensive and often unsuitable for real-time forecasting. Many deep learning methods also struggle with high-dimensional spatiotemporal ocean data and experience error accumulation over longer forecasting periods. This study builds on our previously proposed Adaptive Next-Generation Reservoir Computing (Adaptive NVAR) framework, initially introduced and tested on synthetic dynamical systems, and extends it to ocean forecasting. We present a reduced-order forecasting framework that combines Singular Value Decomposition (SVD) with Adaptive NVAR to predict SST dynamics in the East Sea. SST fields are compressed into a low-dimensional representation using SVD, which extracts dominant modes of ocean variability. Adaptive NVAR models the temporal evolution of these latent states, and the predicted states are reconstructed into SST forecasts. We evaluate the framework using regional ocean datasets and compare it with the standard NG-RC/NVAR. Results show that Adaptive NVAR consistently achieves lower forecasting errors across multiple prediction horizons. In addition, SVD reduces computational complexity, resulting in a fast and scalable framework suitable for real-time ocean forecasting.

2606.12252 2026-06-11 cs.LG cs.AI 新提交

Using Explainability as a Training-Time Reliability Signal for Efficient ECG Classification

使用可解释性作为训练时可靠性信号实现高效心电图分类

Veerendhra Kumar Dangeti, Xiao Gu, Ying Weng, Shreyank N Gowda

发表机构 * School of Computer Science, University of Nottingham(诺丁汉大学计算机科学学院) Institute of Biomedical Engineering, Department of Engineering Science, University of Oxford(牛津大学工程科学系生物医学工程研究所) School of Computer Science, University of Nottingham Ningbo China(宁波诺丁汉大学计算机科学学院)

AI总结 提出ERTS方法,利用训练中的解释质量(Grad-CAM注意力图)区分信息性和不可靠不确定性,过滤低聚焦样本,在三个ECG数据集上提升macro-F1并降低训练成本。

详情
AI中文摘要

训练用于临床时间序列分析的深度神经网络计算需求高,但许多医疗环境缺乏重复模型开发和部署所需的资源。这一挑战在心电图分类中尤为明显,大数据集和长训练计划使效率变得重要。渐进式数据丢弃通过从梯度更新中排除已学习的样本来降低训练成本,但它依赖模型置信度,可能保留因噪声或歧义而难以处理而非有用信号的样本。在这项工作中,我们引入了ERTS,一种基于可解释性的可靠性训练信号,用于高效心电图分类。ERTS在训练期间利用解释质量来区分信息性和不可靠的不确定性。基于渐进式数据选择,我们计算候选样本的Grad-CAM注意力图,并推导出一个聚焦分数,衡量模型预测是否得到连贯且局部化模式的支持。低聚焦样本被过滤掉,而具有有意义注意力的样本优先进行梯度更新。我们在三个ECG数据集和多个骨干架构上评估ERTS,显示macro-F1的一致提升以及有效训练成本的降低。这些结果表明,解释质量可以作为改善临床时间序列学习中效率和可靠性的实用信号。代码将发布。

英文摘要

Training deep neural networks for clinical time-series analysis is computationally demanding, yet many healthcare settings lack the resources required for repeated model development and deployment. This challenge is particularly evident in electrocardiogram classification, where large datasets and long training schedules make efficiency practically important. Progressive Data Dropout reduces training cost by excluding samples from gradient updates once they are learned, but it relies on model confidence and may retain samples that are difficult due to noise or ambiguity rather than useful signal. In this work, we introduce ERTS, an explainability-based reliability training signal for efficient ECG classification. ERTS uses explanation quality during training to distinguish between informative and unreliable uncertainty. Building on progressive data selection, we compute Grad-CAM attention maps for candidate samples and derive a focus score that measures whether model predictions are supported by coherent and localised patterns. Samples with low focus are filtered out, while those with meaningful attention are prioritised for gradient updates. We evaluate ERTS across three ECG datasets and multiple backbone architectures, showing consistent improvements in macro-F1 alongside reduced effective training cost. These results suggest that explanation quality can serve as a practical signal for improving both efficiency and reliability in clinical time-series learning. Code will be released.

2606.12334 2026-06-11 cs.LG cs.RO 新提交

Fourier Features Let Agents Learn High Precision Policies with Imitation Learning

傅里叶特征让智能体通过模仿学习学习高精度策略

Balázs Gyenes, Emiliyan Gospodinov, Jan Frieling, Enrico Krohmer, Nicolas Schreiber, Xiaogang Jia, Niklas Freymuth, Gerhard Neumann

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) FZI Research Center for Information Technology(FZI信息技术研究中心)

AI总结 提出在点云编码器中使用傅里叶特征映射,解决神经网络低频偏好导致的高精度操作问题,在多个基准和真实机器人上显著提升性能。

详情
Comments
Published as a conference paper at ICML 2026
AI中文摘要

高精度机器人操作需要细粒度的空间推理,由于深度模糊和透视尺度问题,仅使用RGB的策略通常难以实现。直接利用3D信息(如基于点云的策略)比纯图像策略提供了更强的几何先验,但其性能仍然高度依赖于任务。我们假设这种差异可能是由于神经网络倾向于学习低频函数的频谱偏差,这尤其影响以缓慢变化的笛卡尔特征为条件的架构。因此,我们提出将点云从笛卡尔空间映射到高维傅里叶空间,有效地使点云编码器能够直接访问高频特征。我们通过实验验证了傅里叶特征在RoboCasa和ManiSkill3基准测试中的具有挑战性的操作任务以及真实机器人设置上的效果。尽管简单,我们发现傅里叶特征在不同的编码器架构和基准测试中提供了显著的好处,并且对超参数具有鲁棒性。我们的结果表明,傅里叶特征让策略比笛卡尔特征更有效地利用几何细节,显示了其作为基于点云的模仿学习的通用工具的潜力。我们在项目页面上提供源代码和视频:https://this https URL

英文摘要

High-precision robotic manipulation requires fine-grained spatial reasoning that is often difficult to achieve with RGB-only policies due to depth ambiguity and perspective scale issues. Policies that leverage 3D information directly, such as those based on point clouds, offer a stronger geometric prior over purely image-based ones, yet their performance remains highly task-dependent. We hypothesize that this discrepancy may be due to the spectral bias of neural networks towards learning low frequency functions, which especially affects architectures conditioned on slow-moving Cartesian features. We thus propose to map point clouds from Cartesian space into high-dimensional Fourier space, effectively equipping the point cloud encoder with direct access to high-frequency features. We experimentally validate the use of Fourier features on challenging manipulation tasks from the RoboCasa and ManiSkill3 benchmarks and on a real robot setup. Despite their simplicity, we find that Fourier features provide significant benefits across diverse encoder architectures and benchmarks and are robust across hyperparameters. Our results indicate that Fourier features let policies leverage geometric details more effectively than Cartesian features, showing their potential as a general-purpose tool for point cloud-based imitation learning. We provide source code and videos on our project page: this https URL

2606.11199 2026-06-11 cs.CL cs.AI cs.IR cs.LG 交叉投稿

NightFeats @ MMU-RAGent NeurIPS 2025: A Context-Optimized Multi-Agent RAG System for the Text-to-Text Track

NightFeats @ MMU-RAGent NeurIPS 2025: 面向文本到文本轨道的上下文优化多智能体RAG系统

Quentin Fever, Naziha Aslam

AI总结 提出一种结构化多智能体RAG系统NightFeats,通过检索、策展和组合三阶段分解知识合成,引入时序语义重排序、矛盾协调和引用保留架构,在MMU-RAGent竞赛中超越商业基线。

详情
Comments
5 pages, 1 figure, 1 table. NeurIPS 2025 Competition Track (MMU-RAGent). System developed October 2025
AI中文摘要

我们提出NightFeats,一个结构化的多智能体检索增强生成(RAG)系统,提交至NeurIPS 2025的MMU-RAGent竞赛,并在文本到文本轨道中获得最佳动态评估奖。本文并非以基准最大化目标,而是提出一个原则性流水线,将知识合成为三个协调阶段:检索、策展和组合,每个阶段由显式的中间表示和交接契约控制。受智能体上下文工程(ACE)启发,该系统引入时序语义重排序、有界矛盾协调和保留引用的组合作为核心架构原语。竞赛结果表明,NightFeats在LLM-as-a-Judge和人类Likert评估中超越了包括Claude-SonnetV2和Nova-Pro在内的商业基线,证实了架构透明性和可验证证据基础比单纯优化自动相似度指标的系统更符合人类偏好。

英文摘要

We present NightFeats, a structured multi-agent retrieval-augmented generation (RAG) system submitted to the MMU-RAGent competition at NeurIPS 2025, where it was awarded Best Dynamic Evaluation in the text-to-text track. Rather than targeting benchmark maximization, this work proposes a principled pipeline that decomposes knowledge synthesis into three coordinated phases: retrieval, curation, and composition, each governed by explicit intermediate representations and handoff contracts. Inspired by Agentic Context Engineering (ACE), the system introduces temporal-semantic reranking, bounded contradiction reconciliation, and citation-preserving composition as core architectural primitives. Competition results show that NightFeats surpasses proprietary baselines including Claude-SonnetV2 and Nova-Pro on LLM-as-a-Judge and Human Likert evaluations, confirming that architectural transparency and verifiable evidence grounding are better aligned with human preferences than systems optimizing narrowly for automatic similarity metrics.

2606.11240 2026-06-11 physics.comp-ph cond-mat.str-el cs.LG quant-ph 交叉投稿

Physically Constrained Ensemble Gaussian Process Modelling for Expensive Quantum Systems with Heteroskedastic Noise

物理约束集成高斯过程建模用于具有异方差噪声的昂贵量子系统

Arpan Biswas, Surtirtha Paul, Joseph Agada, Matthias Thamm, Adrian Del Maestro

AI总结 提出物理约束集成高斯过程框架,通过加权惩罚和数值积分集成多个GP代理,高效建模含异方差噪声的量子系统,在Bose-Hubbard模型和纳米孔硅酸盐量子液体模拟中实现更准确且物理合理的预测。

详情
Comments
14 pages, 6 figures in main text, 2 figures in Supp materials
AI中文摘要

精确建模量子多体系统通常需要计算昂贵的模拟,如密度矩阵重正化群(DMRG)或量子蒙特卡洛(QMC)计算。这些方法虽然精确,但会带来显著的时间和资源限制,限制了它们在详尽参数探索中的应用。此外,这些昂贵模拟在大的未知参数空间内可能包含可变误差,需要量化和传播。因此,需要预测建模来准确估计稀疏采样数据(具有异方差噪声)的函数空间,同时保持估计的物理相关性。为此,我们提出了物理约束集成高斯过程(pc-EGP)框架,旨在物理一致性约束下高效建模复杂且含噪声的量子系统。该方法首先将物理约束作为用户控制的加权惩罚项,施加到高斯过程(GP)代理的数据驱动损失函数中。然后,通过数值求积方法训练一组这样的GP模型,其中多个不同节点上的GP通过求积加权平均进行集成。我们首先在合成生成数据上演示该框架,然后应用于量子系统。在第一个案例研究中,我们利用Bose-Hubbard模型的DMRG模拟来预测控制超流-莫特绝缘体转变的临界相互作用参数Uc。在第二个案例研究中,我们展示了该方法在QMC模拟上的应用,模拟限制在纳米孔硅酸盐内的量子液体,目标是优化化学环境以实现一维超流。与传统GP相比,pc-EGP在准确性和物理有意义的预测之间实现了更好的平衡。

英文摘要

Accurate modeling of quantum many-body systems often requires computationally expensive simulations such as Density Matrix Renormalization Group (DMRG) or Quantum Monte Carlo (QMC) calculations. These methods, while precise, impose significant time and resource constraints, limiting their use in exhaustive parameter exploration. Moreover, these expensive simulations can contain variable errors over the large unknown parameter space, which needs to be quantified and propagated. Thus, predictive modelling is required to estimate the functional space accurately over scarcely sampled data with heteroskedastic noise, while preserving the physical relevance of the estimation. Therefore, we present a Physically Constrained Ensemble Gaussian Process (pc-EGP) framework designed to efficiently model complex and noisy quantum systems under physical consistency constraints. The proposed method first enforces physical constraints as a user controlled weighted penalty to the data-driven loss function of the Gaussian Process (GP) surrogates. Then an ensemble of such GP models is trained with variable noisy simulations via numerical quadrature method where these multiple GP(s) at different nodes is integrated as a quadrature weighted average. We first demonstrate the framework on synthetically generated data before applying to quantum systems. In the first case study, we leverage DMRG simulations of the Bose-Hubbard Model to predict the critical interaction parameter Uc governing the superfluid-to-Mott-insulator transition. In the second case study, we demonstrate our method on QMC simulations, of a quantum liquid confined inside a nanoporous silicate with the goal of optimizing a chemical environment to realize a one-dimensional superfluid. Compared to conventional GP, pc-EGP achieves a better balance of accuracy and physically meaningful predictions.

2606.11249 2026-06-11 cs.RO cs.LG cs.MA 交叉投稿

MASK: Multi-Agent Semantic K-Scheduling for Risk-Sensitive 6G Robotics

MASK: 面向风险敏感的6G机器人学的多智能体语义K调度

Ahmet Gunhan Aydin, Elif Tugce Ceran

发表机构 * Middle East Technical University(中东技术大学) Aselsan Inc.(阿塞尔桑公司)

AI总结 针对6G机器人协同感知中频谱资源受限的问题,提出多智能体语义K调度(MASK)架构,通过仲裁辅助语义信息门控(A-SIG)机制仅调度语义重要性最高的K个智能体,结合自监督全局编码器和分布策略,在严格带宽限制下实现鲁棒的风险感知协调,性能接近无通信约束基线。

详情
AI中文摘要

实现6G连接机器人学的愿景需要协调高性能协作控制与物理无线信道的刚性频谱限制。在现实的协作感知场景中,频谱资源被量化为有限的物理资源块或正交子载波,使得所有智能体同时传输不可行。为了解决这一问题,我们提出了多智能体语义K调度(MASK),一种控制架构,旨在在严格的瞬时带宽限制下维持鲁棒的风险感知协调。我们引入了仲裁辅助语义信息门控(A-SIG),一种轻量级协调机制,通过基于本地计算的语义重要性分数仅调度前K个智能体来强制执行硬接入约束。通过将这些优先观测聚合为紧凑的潜在状态,自监督全局编码器使得分布策略能够在数据稀疏的情况下减轻尾部风险。我们在多个基准上评估了MASK,证明即使信道接入限制为群体大小的一小部分,其性能也能匹配无通信约束的基线。此外,该框架对数据包擦除具有固有的弹性,验证了语义调度作为资源受限的6G系统的关键使能技术。

英文摘要

Realizing the vision of 6G connected robotics requires reconciling high-performance collaborative control with the rigid spectral limitations of physical wireless channels. In realistic collaborative sensing scenarios, spectral resources are quantized into finite physical resource blocks or orthogonal subcarriers, rendering simultaneous transmission by all agents infeasible. To address this, we propose Multi-Agent Semantic K-Scheduling (MASK), a control architecture designed to sustain robust, risk-aware coordination under strict instantaneous bandwidth caps. We introduce Arbiter-Assisted Semantic Information Gating (A-SIG), a lightweight coordination mechanism that enforces hard access constraints by scheduling only the top-K agents based on locally computed semantic importance scores. By aggregating these prioritized observations into a compact latent state, a self-supervised global encoder enables a distributional policy to mitigate tail risks despite data sparsity. We evaluate MASK across diverse benchmarks, demonstrating that it matches the performance of communication-unconstrained baselines even when channel access is restricted to a small fraction of the swarm size. Furthermore, the framework exhibits inherent resilience to packet erasures, validating semantic scheduling as a critical enabler for resource-constrained 6G systems.

2606.11256 2026-06-11 physics.chem-ph cs.LG cs.NE 交叉投稿

My Chemical Harness: Evolutionary Molecular Design over Synthetic Pathways with Large Language Model Agents

我的化学缰绳:基于合成路径的大语言模型智能体进化分子设计

César Ojeda, Darius A. Faroughy, Maryam Karimi, Payam Zarrintaj, Mir Mehdi Seyedebrahimi, Martín Carballo-Pacheco

AI总结 提出一种以可执行合成路径为种群、大语言模型仅作策略控制器的进化框架,在可溶性环氧化物水解酶代理任务上达到最优性能。

详情
Comments
27 pages | 10 figures
AI中文摘要

当候选结构伴随可行的合成路线时,设计具有目标性质的分子最为有用。我们介绍了My Chemical Harness,一种面向目标分子设计的路线原生进化框架,其中搜索种群由可执行的合成路径而非孤立的分子图组成。每条路径由可购买的构建块和反应模板构建,通过确定性化学工具执行,并通过任务特定的分子预言机评分。大语言模型仅用作策略控制器,选择关于路径长度、移动类型、反应家族、基序和探索压力的高级偏好,而本地代码执行路径构建、验证、去重、评分、选择和记忆更新。这种分离使得大语言模型能够引导探索,同时防止其引入幻觉产物或不受支持的反应步骤。在一个可溶性环氧化物水解酶代理任务上,我们的LLM智能体优于单次LLM和确定性控制器,在sEH分数、合成可及性分数和AiZynthFinder成功率指标上达到最先进性能。这些结果表明,受约束的大语言模型智能体可以在无需训练、微调或专用生成模型的情况下,在分子发现中发挥重要作用。

英文摘要

Designing molecules with target properties is most useful when candidate structures are accompanied by feasible synthetic routes. We introduce My Chemical Harness, a route-native evolutionary framework for goal-directed molecular design in which the search population consists of executable synthetic pathways rather than isolated molecular graphs. Each route is built from purchasable building blocks and reaction templates, executed by deterministic chemistry tools, and scored through task-specific molecular oracles. Large language models (LLMs) are used only as strategy controllers that select high-level preferences over route length, move type, reaction families, motifs, and exploration pressure, while local code performs route construction, validation, deduplication, scoring, selection, and memory updates. This separation lets the LLM guide exploration without allowing it to introduce hallucinated products or unsupported reaction steps. On a soluble epoxide hydrolase proxy task, our LLM agent improves over single pass LLM and deterministic controllers, reaching state-of-the-art performance across the sEH score, synthetic accessibility score, and AiZynthFinder success rate metrics. These results suggest that constrained LLM agents can play a significant role in molecular discovery without requiring training, fine-tuning, or dedicated generative models.

2606.11279 2026-06-11 eess.AS cs.CL cs.LG cs.SD 交叉投稿

Massive Open-Vocabulary Keyword Spotting

大规模开放词汇关键词识别

Leonor Barreiros, Raul Monteiro, Afonso Mendes, Gonçalo M. Correia

AI总结 提出一种内存占用更小的开放词汇关键词识别系统,无需微调即可处理大规模数据库,在未见语言中达到与未压缩方案相当的实体召回率。

详情
Comments
Accepted to Interspeech 2026
AI中文摘要

自动语音识别系统在转录训练数据中罕见词汇(即专业术语)时表现不佳。开放词汇关键词识别结合上下文偏置已被证明可以缓解这一问题。然而,现有系统只能处理几百个术语的词汇表,否则会成为不可行的瓶颈。我们提出了一种系统,其存储特征的内存占用比可比基线小128倍,允许用户处理大规模数据库,同时保持开放词汇。无需微调语音识别模型,我们的系统在未见过的语言中也达到了与未压缩解决方案相当的实体召回率。

英文摘要

Automatic speech recognition systems have been shown to under-perform when it comes to transcribing words rarely seen in the training data, namely specialized terminology. Open-vocabulary keyword spotting, combined with contextual biasing, has been shown to mitigate this issue. However, existing systems can only handle glossaries of a few hundred terms without becoming an infeasible bottleneck. We propose a system that stores features with a memory footprint up to 128 times smaller than a comparable baseline and allows users to process massive databases while remaining open-vocabulary. Without fine-tuning the speech recognition model, our system achieves a comparable entity recall as uncompressed solutions, even in languages not seen during training.

2606.11295 2026-06-11 astro-ph.CO cs.LG 交叉投稿

Interpretable Neural Marked Statistics for Cosmological Inference

可解释的神经标记统计用于宇宙学推断

Federico Semenzato, Benjamin D. Wandelt, Michele Liguori, Alvise Raccanelli

AI总结 提出一种神经标记方案,通过可解释的物理变换从形态学层面提取宇宙学信息,在对比学习目标下优化标记统计,显著提高对σ₈和Ωₘ的约束精度。

详情
Comments
11 pages, 6 figures. Accepted to the Workshop on AI for Physics (ICML 2026)
AI中文摘要

恢复超出功率谱的宇宙学信息是即将进行的宇宙学调查的核心目标,因为物质密度中的晚期非高斯信号无法仅通过两点统计获得。标记统计通过使用非线性函数对场进行重新加权,将部分信息折叠回两点水平。我们提出了一种神经标记方案,通过一组可解释的、物理驱动的变换来推广这一过程,这些变换直接允许在形态学层面解释宇宙学信息的增益。我们采用对比学习目标将可学习的标记摘要与底层宇宙学参数对齐。在$k_{\max}=0.2\\,h\mathrm{Mpc}^{-1}$处,与经典标记相比,我们的神经标记将$\sigma_8$的边缘化约束提高了$2.9\times$,将$\Omega_m$提高了$1.8\times$,在Fisher信息层面打破了$\Omega_m-\sigma_8$简并。它进一步将参数MSE在整个宇宙学参数先验上比最佳经典标记降低了$1.45\times$。学习到的潜在几何与参数空间中的$\Omega_m$和$\sigma_8$方向对齐,表明对比目标恢复了宇宙学信息的主导轴。我们的方法为更强大、可解释的宇宙学推断摘要统计打开了大门。

英文摘要

Recovering cosmological information beyond the power spectrum is a central goal for upcoming cosmological surveys, since late-time non-Gaussian signal in the matter density cannot be accessed through two-point statistics alone. Marked statistics fold part of this information back into the two-point level by reweighting the field with non-linear functions. We propose a neural marking scheme to generalize this process through a set of interpretable, physically motivated transformations that directly allow to interpret the gain in cosmological information at the morphological level. We employ a contrastive learning objective to align learnable marked summaries with the underlying cosmological parameters. At $k_{\max}=0.2\,h\mathrm{Mpc}^{-1}$, our neural mark tightens the marginalized constraint on $\sigma_8$ by $2.9\times$ and on $\Omega_m$ by $1.8\times$ compared to classical marks, breaking the $\Omega_m-\sigma_8$ degeneracy at the Fisher information level. It further reduces the parameter MSE across our cosmological parameter prior by $1.45\times$ over the best classical mark. The learned latent geometry aligns with the $\Omega_m$ and $\sigma_8$ directions in parameter space, indicating that the contrastive objective recovers the dominant axes of cosmological information. Our approach opens the door to more powerful, interpretable summary statistics for cosmological inference.

2606.11324 2026-06-11 cs.RO cs.AI cs.LG 交叉投稿

Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models

Embodied-R1.5:通过具身基础模型演化物理智能

Yifu Yuan, Yaoting Huang, Xianze Yao, Yutong Li, Shuoheng Zhang, Linqi Han, Pengyi Li, Jiangeng Sun, Wenting Jia, Zhao Zhang, Yuhao Liu, Ruihao Liao, Yucheng Hu, Qiyu Wu, Yuxiao Li, Zibin Dong, Fei Ni, Yan Zheng, Shuyang Gu, Yi Ma, Hongyao Tang, Han Hu, Jianye Hao

发表机构 * Tianjin University(天津大学) Tencent Hunyuan(腾讯混元)

AI总结 提出统一具身基础模型Embodied-R1.5,通过自动化数据管道和多任务平衡强化学习,在8B参数下实现24项基准中16项最优,并支持微调为VLA模型。

详情
Comments
Embodied R1.5 technical report. Project page: this https URL
AI中文摘要

我们介绍了Embodied-R1.5,一个统一的具身基础模型(EFM),它在单一架构中集成了全面的具身推理能力,涵盖具身认知、任务规划、修正和指向,旨在实现通用物理智能。利用三个自动化数据构建管道显著扩展关键能力的数据覆盖,我们构建了超过150亿token的大规模数据系统,并设计了多任务平衡的RL配方以缓解异构任务冲突。我们进一步引入了规划器-基础模型-修正器(PGC)闭环框架,使单一模型能够自主执行并在长时任务中进行自我修正。仅凭8B参数,Embodied-R1.5在24个具身VLM基准中的16个上达到了最先进水平,超越了Gemini-Robotics-ER-1.5和GPT-5.4等领先模型。得益于内化的具身能力,Embodied-R1.5只需少量数据即可微调为VLA,在4个流行的操作基准套件上优于$\pi_{0.5}$等领先VLA模型。我们进一步进行了广泛的零样本真实机器人实验,验证了在指令跟随、可供性基础、铰接物体操作和长时复杂任务中的性能,展示了向物理世界的强泛化能力。我们开源了模型权重、数据集、训练代码以及EmbodiedEvalKit(一个专为具身任务定制的评估框架),以促进EFM的未来研究。

英文摘要

We introduce Embodied-R1.5, a unified Embodied Foundation Model (EFM) that integrates comprehensive embodied reasoning capabilities, spanning embodied cognition, task planning, correction, and pointing, within a single architecture toward general physical intelligence. Leveraging three automated data construction pipelines to significantly expand the data coverage of critical capabilities, we build a large-scale data system of over 15B tokens, and design a multi-task balanced RL recipe to alleviate heterogeneous task conflicts. We further introduce a Planner-Grounder-Corrector (PGC) closed-loop framework that enables a single model to autonomously execute and self-correct over long-horizon tasks. With only 8B parameters, Embodied-R1.5 achieves SOTA on 16 out of 24 embodied VLM benchmarks, surpassing leading models like Gemini-Robotics-ER-1.5 and GPT-5.4. Benefiting from the internalized embodied capabilities, Embodied-R1.5 can be fine-tuned into a VLA with only a small amount of data, outperforming leading VLA models like $\pi_{0.5}$ across 4 popular manipulation benchmark suites. We further conduct extensive zero-shot real-robot experiments, validating performance in instruction following, affordance grounding, articulated object manipulation, and long-horizon complex tasks, demonstrating strong generalization to the physical world. We open-source model weights, datasets, training code, and EmbodiedEvalKit, an evaluation framework tailored for embodied tasks, to facilitate future research in EFMs.

2606.11415 2026-06-11 q-bio.NC cs.LG physics.data-an q-bio.QM 交叉投稿

Spatially Masked Regression Reveals Local and Distributed Predictability in Electrophysiological Recordings

空间掩蔽回归揭示电生理记录中的局部和分布式可预测性

Maryam Ostadsharif Memar, Nima Dehghani

AI总结 提出空间掩蔽回归(SMR)框架,通过逐步增大掩蔽区域量化电极信号中局部与分布式信息的贡献,应用于颅内和头皮脑电数据,发现邻近电极贡献显著但非全部,表明信号同时包含局部冗余和全局结构。

详情
AI中文摘要

神经记录通常被解释为局部测量,但任何单个传感器的信号也可能反映分布在整个网络中的结构化活动。这引出一个基本问题:电极信号在多大程度上反映底层系统中的局部信息与分布式信息?更具体地说,电极的活动有多少由其邻近区域携带,又有多少嵌入在阵列的更广泛分布中?我们通过空间掩蔽回归(SMR)框架解决这一问题,该框架从其余电极重建每个电极的时间序列,同时排除目标周围可配置的邻域。通过逐步增大掩蔽,空间局部性成为实验控制,用于量化在移除附近通道后有多少预测信息幸存。我们将SMR应用于具有异质电极覆盖的颅内脑电图(iEEG)和具有标准化导联组合的感觉运动皮层头皮脑电图(EEG)。使用原始信号与重建信号之间的距离相关性,我们发现两种模态中均存在强烈的受试者内重建,即使排除局部邻域后仍有显著的可预测性,且EEG中的跨受试者转移明显强于iEEG。掩蔽显示邻近电极对重建贡献显著,但并非全部,表明单个通道既反映局部冗余也反映更广泛的分布式结构。保留选定边际或谱特性但破坏相位结构或时间顺序的替代数据显著降低了性能,支持SMR依赖于结构化时间和跨通道组织而非仅边际统计的结论。这些结果将SMR定位为量化记录中局部与分布式信息平衡的可解释框架。

英文摘要

Neural recordings are often interpreted as local measurements, yet the signal at any one sensor can also reflect structured activity distributed across the broader network. This raises a basic question: to what extent does an electrode's signal reflect local versus distributed information in the underlying system? More specifically, how much of an electrode's activity is carried by its immediate neighborhood, and how much is embedded more broadly across the array? We address this with a Spatially Masked Regression (SMR) framework that reconstructs each electrode's timeseries from the remaining electrodes while excluding a configurable neighborhood around the target. By progressively increasing this mask, spatial locality becomes an experimental control for quantifying how much predictive information survives after nearby channels are withheld. We apply SMR to intracranial EEG with heterogeneous electrode coverage and to scalp EEG with standardized montages over sensorimotor cortex. Using distance correlation between original and reconstructed signals, we find strong within-subject reconstruction in both modalities, substantial residual predictability even when local neighbors are excluded, and markedly stronger cross-subject transfer in EEG than in iEEG. Masking shows that nearby electrodes contribute strongly to reconstruction but do not account for all of it, indicating that individual channels reflect both local redundancy and broader distributed structure. Surrogates that preserve selected marginal or spectral properties while disrupting phase structure or temporal ordering substantially reduce performance, supporting the conclusion that SMR depends on structured temporal and cross-channel organization rather than on marginal statistics alone. These results position SMR as an interpretable framework for quantifying the balance between local and distributed information in recordings.

2606.11500 2026-06-11 eess.IV cs.CE cs.IT cs.LG q-bio.NC 交叉投稿

FlexiBrain: Resolution-Agnostic Voxel-Level Encoding for Native fMRI

FlexiBrain: 面向原生fMRI的分辨率无关体素级编码

Mo Wang, Wenhao Ye, Junfeng Xia, Minghao Xu, Hongkai Wen, Quanying Liu

AI总结 提出FlexiBrain,一种基于Mamba-JEPA的分辨率无关体素级编码框架,通过动态补丁调整直接处理原生fMRI数据,避免破坏性空间标准化,在五个下游任务中性能提升达12个百分点,并显著降低预处理成本。

详情
AI中文摘要

大规模深度学习模型在神经科学中的成功从根本上受到严重数据异质性的制约。从不同来源聚合的原生fMRI数据在空间和时间分辨率上表现出显著差异。因此,大多数现有框架依赖于冗长、僵化的预处理流程,以强制数据集之间的一致性。这种做法引入了两个关键限制:(1)可能退化受试者特定的解剖信息;(2)显著的计算开销,通常每个受试者需要数小时的处理。在此,我们提出FlexiBrain,一种基于Mamba-JEPA的分辨率无关体素级编码框架,用于原生fMRI。FlexiBrain以真实物理单位定义补丁大小,并采用动态补丁调整,从而绕过破坏性的空间标准化,同时允许直接摄取原生空间中的数据。我们使用高效的Mamba-JEPA骨干网络实例化该框架,以建模高维4D fMRI信号。在五个不同的下游神经科学任务中,FlexiBrain持续优于近期最先进的方法,在不使用外部数据增强的情况下实现了高达12个百分点的提升。重要的是,FlexiBrain作为一个无缝插件模块,显著降低了预处理成本,并加速了稳健的体素级fMRI基础模型的开发。代码可在该https URL获取。

英文摘要

The success of large-scale deep learning models in neuroscience is fundamentally constrained by severe data heterogeneity. Native fMRI data aggregated from diverse sources exhibit substantial variation in both spatial and temporal resolutions. Consequently, most existing frameworks rely on lengthy, rigid preprocessing pipelines that enforce uniformity across datasets. This practice introduces two critical limitations: (1) potential degradation of subject-specific anatomical information; (2) significant computational overhead, often requiring hours of processing per subject. Here, we propose FlexiBrain, a resolution-agnostic voxel-level encoding framework for native fMRI based on Mamba-JEPA. FlexiBrain defines patch sizes in real-world physical units and employs a dynamic patch resizing, thereby bypassing destructive spatial standardization while enabling direct ingestion of data in native space. We instantiate the framework using an efficient Mamba-JEPA backbone to model high-dimensional 4D fMRI signals. Across five diverse downstream neuroscience tasks, FlexiBrain consistently outperforms recent state-of-the-art methods, achieving gains of up to 12 percentage points without external data augmentation. Importantly, FlexiBrain functions as a seamless plug-in module, substantially reducing preprocessing costs and accelerating the development of robust voxel-level fMRI foundation models. Code is available at this https URL.

2606.11555 2026-06-11 q-bio.NC cs.AI cs.LG 交叉投稿

End-to-End Machine Learning for Depressive State Classification via EEG and fNIRS

基于EEG和fNIRS的抑郁状态分类的端到端机器学习

Riki Sakurai, Simon Kojima, Mihoko Otake-Matsuura, Shin'ichiro Kanoh, Tomasz M. Rutkowski

AI总结 本研究提出一个端到端机器学习框架,利用EEG和fNIRS信号对抑郁状态进行分类,旨在克服传统诊断的主观性,为临床提供客观的自动化诊断工具。

详情
Comments
4 pages, 4 figures, Accepted for publication in the Proc. 48th Annu. Int. Conf. IEEE EMBS (EMBC 2026), Toronto, Canada, July 20-24, 2026
AI中文摘要

随着社会压力的增加,对心理医疗的需求不断上升,凸显了传统精神病学诊断的局限性。传统方法——主要依赖临床访谈和患者自我报告——本质上容易受到主观偏见和从业者不同的经验判断的影响。为了满足定量评估的需求,基于生物信号的检测,包括脑电图(EEG)和功能性近红外光谱(fNIRS),已成为一种有前景的客观替代方案。这类技术对于识别可能未被受试者自身意识到的潜在抑郁状态尤为重要。此外,在老龄化人群中,抑郁症与痴呆症的高共病性要求早期区分,以防止症状相互恶化并维持生活质量(QoL)。这项针对11名健康学生的初步研究建立了一个基于生物信号的抑郁症检测框架,为临床使用的自动化、客观诊断工具奠定了基础。

英文摘要

The escalating demand for mental healthcare, driven by rising societal stress, highlights the limitations of traditional psychiatric diagnostics. Conventional methods - relying primarily on clinical interviews and patient self-reports - are inherently vulnerable to subjective bias and the varying empirical judgment of practitioners. To address the need for quantitative evaluation, biological signal-based detection, including electroencephalography (EEG) and functional near-infrared spectroscopy (fNIRS), has emerged as a promising objective alternative. Such technology is particularly vital for identifying latent depressive states that may be unrecognized by the subjects themselves. Furthermore, in aging populations, the high comorbidity between depression and dementia necessitates early differentiation to prevent mutual symptom exacerbation and maintain Quality of Life (QoL). This pilot study of eleven healthy students establishes a framework for biological signal-based depression detection, serving as a foundational step toward automated, objective diagnostic tools for clinical use.

2606.11615 2026-06-11 cs.CV cs.CR cs.LG 交叉投稿

Adv-TGD: Adversarial Text-Guided Diffusion for Face Recognition Impersonation Attacks

Adv-TGD:面向人脸识别冒充攻击的对抗性文本引导扩散

Omid Ahmadieh, Nima Karimian

发表机构 * University of South Florida, Bellini College of Artificial Intelligence, Cybersecurity and Computing(南佛罗里达大学贝利尼人工智能、网络安全与计算学院)

AI总结 提出Adv-TGD框架,利用Stable Diffusion和LoRA微调生成逼真对抗人脸,在保持视觉质量的同时实现高成功率身份冒充攻击,平均ASR达85.90%。

详情
AI中文摘要

人脸识别(FR)技术的广泛普及引发了严重的隐私担忧,因为面部数据可能在未经同意的情况下被利用。为了解决这一挑战,我们提出了Adv-TGD,一个生成式对抗攻击框架,能够合成逼真的人脸,冒充目标身份并欺骗人脸识别系统。基于Stable Diffusion,Adv-TGD对每个样本进行LoRA微调,以简洁的文本提示为条件,生成自然但具有对抗性操控的身份。与传统的身份攻击方法不同,我们的方法在单步去噪过程中为每个源-目标对优化轻量级交叉注意力适配器。潜在混合受到面部局部热图掩码的约束,以确保空间精确的身份操控,同时保留非敏感区域。我们引入了一个复合目标,结合了掩码epsilon-MSE重建、FR嵌入空间中的阈值化身份差异、方向特征对齐和源相似性抑制,以平衡对抗攻击和视觉真实性。可选地,LLaVA生成的属性提示增强了细粒度语义细节,而不会重新引入身份线索。在黑盒评估协议下,Adv-TGD在IR152、IRSE50、MobileFace和FaceNet上平均攻击成功率(ASR)达到85.90%,超过语义SOTA基线Adv-CPG +6.25个百分点、基于扩散的化妆方法DiffAIM +3个百分点以及基于噪声的P3-Mask +16个百分点。尽管攻击效果强劲,Adv-TGD仍保持了高视觉保真度(PSNR = 27.15 dB,SSIM = 0.981)。此外,我们通过成功将其扩展到野外数据集(LADN)、通用对象分类(ImageNet)和基于Transformer的扩散模型(FLUX.1),展示了我们框架的灵活性。

英文摘要

The widespread adoption of face recognition (FR) technologies raises serious privacy concerns, as facial data can be exploited without consent. To address this challenge, we propose Adv-TGD, a generative adversarial attack framework that synthesizes photorealistic faces capable of impersonating target identities and deceiving face recognition systems. Built upon Stable Diffusion, Adv-TGD performs per-sample LoRA fine-tuning conditioned on concise textual prompts to generate natural yet adversarially manipulated identities. Unlike conventional identity-attack approaches, our method optimizes lightweight cross-attention adapters for each source-target pair within a single-step denoising process. Latent blending is constrained by a face-local heatmap mask to ensure spatially precise identity manipulation while preserving non-sensitive regions. We introduce a composite objective that integrates masked epsilon-MSE reconstruction, thresholded identity divergence in FR embedding space, directional feature alignment, and source-similarity suppression to balance adversarial attack and visual realism. Optionally, LLaVA-generated attribute prompts enhance fine-grained semantic details without reintroducing identity cues. Under the black-box evaluation protocol, Adv-TGD attains an average attack success rate (ASR) of 85.90% across IR152, IRSE50, MobileFace, and FaceNet, surpassing the semantic SOTA baseline Adv-CPG by +6.25 points, diffusion-based makeup method DiffAIM by +3 points, and noise-based P3-Mask by +16 points. Despite its strong attack efficacy, Adv-TGD preserves high visual fidelity (PSNR = 27.15 dB, SSIM = 0.981). Furthermore, we demonstrate the flexibility of our framework by successfully extending it to in-the-wild datasets (LADN), general object classification (ImageNet), and transformer-based diffusion models (FLUX.1).

2606.11620 2026-06-11 quant-ph cs.ET cs.LG 交叉投稿

Family-Aware Residual Architecture for Predicting Quantum Circuit Simulation Performance

面向预测量子电路模拟性能的族感知残差架构

Honjar Xing, Yehong Jiang, Xianbang Wang, Zehua Wang, Zhicheng Jiang

AI总结 提出族感知残差架构,利用电路族分类和算法指纹特征,预测量子电路模拟的最小近似阈值和运行时间,在7-130量子比特、10个算法族上实现79.5%精确阈值准确率和R²=0.82运行时间相关性。

详情
Comments
Accepted as a full paper at IEEE ISVLSI 2026 (QC-CSAA Workshop). To appear in IEEE Xplore. 6 pages, 1 figure, 2 tables
AI中文摘要

近似张量网络模拟器能够对超出精确方法范围的量子电路进行经典模拟,但选择最优近似参数(如键维阈值)仍然是一个成本高昂的试错过程。我们提出了一种族感知神经架构,仅根据电路的OpenQASM描述和执行上下文,即可预测实现目标保真度所需的最小近似阈值以及量子电路模拟的预期挂钟运行时间。我们的关键洞察是,来自不同算法族(例如QFT、Grover、VQE)的量子电路由于其不同的纠缠结构而表现出根本不同的模拟成本曲线。我们采用族条件残差校正——在共享骨干网络之上添加的、针对特定族的加性调整,借鉴了已建立的条件计算技术——使模型能够同时捕获通用电路属性和算法细微差别。该架构包含一个预训练的族分类器(准确率97.5%)和从门组成启发式算法导出的领域信息算法指纹特征。在跨越7-130量子比特、10个算法族的电路上评估,我们的系统实现了79.5%的精确阈值准确率(91.2%在一个阶梯内)和R²=0.82的运行时间相关性,推理时间约为50毫秒——取代了可能需要数分钟到数小时的试错模拟运行。消融研究证实,族感知建模提供了最大的单一性能改进(+3.2个百分点),验证了算法族是模拟成本预测的一等特征的假设。

英文摘要

Approximate tensor-network simulators enable classical simulation of quantum circuits beyond the reach of exact methods, but selecting optimal approximation parameters -- such as bond dimension thresholds -- remains a costly trial-and-error process. We present a family-aware neural architecture that predicts both the minimum approximation threshold required to achieve target fidelity and the expected wall-clock runtime for quantum circuit simulation, given only the circuit's OpenQASM description and execution context. Our key insight is that quantum circuits from different algorithmic families (e.g., QFT, Grover, VQE) exhibit fundamentally distinct simulation cost profiles due to their differing entanglement structures. We employ family-conditioned residual corrections -- additive, family-specific adjustments atop a shared backbone, drawing on established conditional computation techniques -- enabling the model to capture both universal circuit properties and algorithmic nuances. The architecture incorporates a pretrained family classifier (97.5% accuracy) and domain-informed algorithm fingerprint features derived from gate-composition heuristics. Evaluated on circuits spanning 7--130 qubits across 10 algorithm families, our system achieves 79.5% exact threshold accuracy (91.2% within one rung) and $R^2 = 0.82$ runtime correlation, with inference completing in approximately 50 ms -- replacing trial-and-error simulation runs that may take minutes to hours. Ablation studies confirm that family-aware modeling provides the single largest performance improvement (+3.2 percentage points), validating the hypothesis that algorithm family is a first-class feature for simulation cost prediction.

2606.11663 2026-06-11 cs.SI cs.LG 交叉投稿

Probabilistic Salary Prediction with Graph Attention Networks and a Mixture Density Network

基于图注意力网络和混合密度网络的概率薪资预测

Zhipei Qin, Mohammad Shokri, N. van Weeren, F.W. Takes

AI总结 提出GAT-MDN框架,通过构建属性关系图并使用图注意力网络学习节点表示,结合混合密度网络输出薪资分布,在百万级荷兰招聘数据集上优于基线模型。

详情
Comments
5 pages, 3 figures
AI中文摘要

准确的薪资预测对于弥合现代劳动力市场中雇主与求职者之间的信息差距至关重要。现有方法主要产生单点估计,并将工作属性(如地点、职业和行业)视为独立的分类特征,忽略了真实世界薪酬数据固有的不确定性和多模态性,以及支配薪资规范的丰富层次结构和语义相似性关系。在本文中,我们提出了GAT-MDN,一个同时解决这两个限制的统一框架。对于三个属性域中的每一个,我们构建了一个特定领域的图,其边编码了(i)层次化的父子包含关系和(ii)从预训练的Sentence-Transformer导出的加权相似性链接。具有边缘特征感知注意力的并行图注意力网络(GAT)从这些多关系图中学习丰富的、上下文感知的节点表示。然后,一个基于优先级的层次选择模块组装一个复合特征向量,优雅地处理缺失或粗略的属性,而混合密度网络(MDN)头将该向量映射到高斯混合模型(GMM)的参数,产生完整的条件薪资分布。在超过100万条记录的真实世界荷兰招聘数据集上的大量实验表明,GAT-MDN在负对数似然(NLL)和均方误差(MSE)方面均显著优于非图MLP-MDN基线。

英文摘要

Accurate salary prediction is critical for bridging the information gap between employers and job seekers in modern labor markets. Existing approaches predominantly yield a single point estimate and treat job attributes such as location, occupation, and industry as independent categorical features, ignoring both the inherent uncertainty and multi-modality of real-world compensation data and the rich hierarchical and semantic-similarity relationships that govern pay norms. In this paper we propose GAT-MDN, a unified framework that addresses both limitations simultaneously. For each of the three attribute domains we construct a domain-specific graph whose edges encode (i) hierarchical parent-child containment and (ii) weighted similarity links derived from a pre-trained Sentence-Transformer. Parallel Graph Attention Networks (GATs) with edge-feature-aware attention learn rich, context-sensitive node representations from these multi-relational graphs. A priority-based hierarchical selection module then assembles a composite feature vector that gracefully handles missing or coarse attributes, and a Mixture Density Network (MDN) head maps this vector to the parameters of a Gaussian Mixture Model (GMM), yielding a full conditional salary distribution. Extensive experiments on a real-world Dutch job-posting dataset of over 1 million records demonstrate that GAT-MDN significantly outperforms a non-graph MLP-MDN baseline in both Negative Log-Likelihood (NLL) and Mean Squared Error (MSE).

2606.11676 2026-06-11 cs.CE cs.LG physics.comp-ph 交叉投稿

Neural-Parameterized Cellular Automata for Wildfire Spread

神经参数化元胞自动机用于野火蔓延

Maksym Zhenirovskyy, Ion Matei, Rohit Vuppala, Takuya Kurihana, Hon Yung Wonga

AI总结 提出一种混合深度学习参数化概率元胞自动机框架,利用多尺度卷积神经网络动态生成空间变化参数,在保持物理可解释性的同时捕捉复杂环境交互,在六次大型野火中实现72小时IoU>0.6的预测。

详情
Comments
16 pages, 9 figures
AI中文摘要

传统野火模型依赖刚性、低维参数和静态燃料图,常常低估火势蔓延。为解决这一弱点,我们引入了一个在JAX中实现的混合深度学习参数化概率元胞自动机(CA)框架。我们的方法采用多尺度卷积神经网络动态生成控制火势蔓延概率、风向对齐和坡度影响的空间变化参数。这种混合设计捕捉了复杂的非线性环境交互,同时保留了底层三态CA的物理可解释性。JAX实现支持硬件加速和基于梯度的参数校准。在美国西部六次大规模野火上的评估显示,在10天数据同化窗口期间模型逐步拟合观测到的火线后,该模型在72小时预测范围内保持IoU>0.6;由此产生的预测是在这些观测中已编码的抑制机制下火势增长的条件投影。

英文摘要

Traditional wildfire models rely on rigid, low-dimensional parameters and static fuel maps, frequently underpredicting fire spread. To address this weakness, we introduce a hybrid deep-learning parameterized Probabilistic Cellular Automata (CA) framework implemented in JAX. Our approach employs a Multi-Scale Convolutional Neural Network to dynamically generate spatially varying parameters that govern fire-spread probability, wind alignment, and slope influence. This hybrid design captures complex, nonlinear environmental interactions while preserving the physical interpretability of the underlying three-state CA. The JAX implementation enables hardware acceleration and gradient-based parameter calibration. Evaluated on six large-scale wildfires in the western United States, the model maintains IoU > 0.6 over 72-hour forecast horizons after a 10-day data assimilation window during which the model is fitted incrementally to observed perimeters; the resulting forecast is a conditional projection of fire growth under the suppression regime already ncoded in those observations.

2606.11687 2026-06-11 cs.CV cs.LG cs.RO 交叉投稿

DroneShield-AI: A Multi-Modal Sensor Fusion Framework for Real-Time Autonomous Drone Threat Detection, Behavioral Intent Classification, and Swarm Intelligence in Contested Airspace

DroneShield-AI:一种用于受争议空域中实时自主无人机威胁检测、行为意图分类和群体智能的多模态传感器融合框架

Marius Bayizere

AI总结 提出DroneShield-AI框架,集成RF信号分类、声学检测、YOLOv8视觉检测等六层处理,通过行为意图分类引擎(BICE)实现六类威胁分类并提前30秒预警,以及图神经网络群体智能模块(GNN-SIM)分析多无人机编队,在低成本硬件上达到96.1%检测精度和142ms延迟。

详情
Comments
23 pages, 6 figures, 11 tables. Code available at this https URL
AI中文摘要

无人机(UAV)威胁已成为21世纪定义性的安全挑战。本文提出DroneShield-AI,一个统一的开放框架,集成了六个处理层:RF信号分类、声学电机特征检测、基于YOLOv8的视觉检测、证据加权传感器融合、行为意图分类引擎(BICE)和图神经网络群体智能模块(GNN-SIM)。BICE首次引入了针对无人机飞行模式的系统性六类威胁分类法,能够提前30秒发出预测性操作员警报。GNN-SIM是首个用于对抗性多无人机编队分析的开放框架,采用图注意力网络。在三个公开的真实世界数据集上评估,融合流水线在约500-780美元总系统成本的商用CPU级硬件上实现了96.1%的检测准确率、3.2%的误报率、AUC-ROC:0.981以及142ms的端到端延迟。所有代码、模型权重和仿真数据集在提交时公开发布。

英文摘要

Unmanned Aerial Vehicle (UAV) threats have emerged as a defining security challenge of the 21st century. This paper presents DroneShield-AI, a unified open framework integrating six processing layers: RF signal classification, acoustic motor-signature detection, YOLOv8-based visual detection, evidence-weighted sensor fusion, a Behavioral Intent Classification Engine (BICE), and a Graph Neural Network Swarm Intelligence Module (GNN-SIM). BICE introduces the first systematic six-class threat taxonomy for drone flight patterns, enabling predictive operator alerts with a 30-second advance-warning horizon. GNN-SIM is the first open framework for adversarial multi-drone formation analysis using Graph Attention Networks. Evaluated on three publicly available real-world datasets, the fused pipeline achieves 96.1% detection accuracy, 3.2% false alarm rate, AUC-ROC: 0.981, and 142ms end-to-end latency on commodity CPU-class hardware at approximately $500-$780 USD total system cost. All code, model weights, and simulation datasets are publicly released at submission.

2606.11737 2026-06-11 astro-ph.EP astro-ph.IM cs.LG 交叉投稿

Machine-learning clustering of close-in exoplanet populations: links to pebble accretion

近地系外行星的机器学习聚类:与卵石吸积的联系

Yi Duann, Anders Johansen, Haiyang S. Wang, H. Jens Hoeijmakers

AI总结 利用高斯混合模型对近地系外行星进行无监督聚类,揭示其内在子群,并通过卵石吸积合成种群解释形成路径差异。

详情
AI中文摘要

近地系外行星展现出由形成条件和迁移过程塑造的广泛轨道构型和物理性质。尽管种群合成模型预测了不同的行星种群,但在观测到的系外行星与合成种群之间建立定量联系仍然具有挑战性。我们使用物理驱动的动力学参数研究近地系外行星的内在组织,并将所得种群与卵石吸积形成路径联系起来。将两阶段高斯混合模型应用于观测到的近地系外行星样本,在由行星-恒星相互作用的动力学描述符主导的特征空间中进行无监督概率聚类。将所得聚类映射到统计驱动的三维参数空间中的卵石吸积合成种群。然后使用与形成相关的量(包括气体可用性、气体分数和冰岩质量比)来解释映射的种群。我们在不施加预定义分类边界的情况下识别出统计上支持的子群,包括超大质量气态巨行星、热巨行星、暖木星主导系统和低质量巨行星。映射的合成种群揭示了形成时间、气体吸积和固体增长历史的系统性差异。特别是,超大质量气态巨行星比热巨行星和暖木星主导种群更倾向于与更早的形成时期相关联。这些结果表明,物理驱动的机器学习方法可以为观测到的系外行星种群与理论行星形成路径之间的联系提供统计上稳健的框架。

英文摘要

Close-in exoplanets exhibit a wide range of orbital architectures and physical properties shaped by both formation conditions and migration processes. Although population-synthesis models predict distinct planetary populations, establishing a quantitative connection between observed exoplanets and synthetic populations remains challenging. We investigate the intrinsic organisation of close-in exoplanets using physically motivated dynamical parameters and connect the resulting populations to pebble-accretion formation pathways. A two-stage Gaussian mixture model (GMM) is applied to an observed sample of close-in exoplanets, performing unsupervised probabilistic clustering in a feature space dominated by dynamical descriptors of planet-star interactions. The resulting clusters are mapped onto a pebble-accretion synthetic population within a statistically motivated three-dimensional parameter space. Formation-related quantities, including gas availability, gas fraction, and ice-rock mass ratio, are then used to interpret the mapped populations. We identify statistically supported sub-populations without imposing predefined classification boundaries, including very-massive gas giants, hot giants, warm-Jupiter-dominated systems, and lower-mass giants. The mapped synthetic populations reveal systematic differences in formation timing, gas accretion, and solid growth histories. In particular, very-massive gas giants are preferentially associated with earlier formation epochs than hot-giant and warm-Jupiter-dominated populations. These results demonstrate that physically motivated machine-learning approaches can provide a statistically robust framework for linking observed exoplanet populations to theoretical planet formation pathways.

2606.11743 2026-06-11 cs.RO cs.GR cs.LG 交叉投稿

TacCoRL: Integrating Tactile Feedback into VLA via Simulation

TacCoRL: 通过仿真将触觉反馈集成到视觉-语言-动作模型中

Siyu Ma, Yuqi Liang, Chang Yu, Yunuo Chen, Hao Su, Yixin Zhu, Yin Yang, Chenfanfu Jiang

发表机构 * University of California, Los Angeles(加利福尼亚大学洛杉矶分校) University of California, San Diego(加利福尼亚大学圣迭戈分校) University of Electronic Science and Technology of China(电子科技大学) Peking University(北京大学) University of Utah(犹他大学)

AI总结 提出TacCoRL框架,通过仿真与真实联合训练和强化学习,将触觉反馈注入视觉-语言-动作策略,在接触密集型任务中平均成功率提升22.5%。

详情
AI中文摘要

视觉-语言-动作(VLA)模型为机器人操作提供了强大的视觉、语言和动作先验,但仅凭视觉观察往往缺失接触密集型任务所需的局部接触状态。我们提出TacCoRL,一个可扩展的框架,将触觉反馈注入VLA策略,并通过仿真-真实联合训练和基于仿真的强化学习(RL)进行改进,无需大规模触觉预训练或广泛的真实世界接触探索。关键思想不仅是添加触觉作为输入,而是学习在接近失败状态下接触读数应如何调节动作响应,这些状态在演示中罕见且在硬件上收集风险高。我们使用真实对齐的仿真器作为接触交互的闭环训练环境。混合的仿真和真实轨迹首先在预训练策略中热启动触觉条件动作。具有可验证任务奖励的强化学习随后通过仿真接触回滚优化策略。它强化导致任务完成的触觉条件动作,而真实轨迹上的监督目标将精炼策略锚定到部署的视觉、触觉和动作分布。所得策略直接转移到真实机器人,无需特权仿真状态或在线真实世界RL。在四个双臂接触密集型任务中,最终的视觉-触觉策略平均成功率达到72.5%,而基线为50.0%。结果视频和更多细节见此链接。

英文摘要

Vision-language-action (VLA) models provide strong visual, language, and action priors for robot manipulation, but visual observations alone often miss the local contact state required for contact-rich tasks. We present TacCoRL, a scalable framework that injects Tactile feedback into VLA policies and improves them through sim-real Co-training and simulation-based reinforcement learning (RL), without requiring large-scale tactile pretraining or extensive real-world contact exploration. The key idea is not only adding touch as an input, but learning how contact readings should modulate action responses in near-failure states that are rare in demonstrations and risky to collect on hardware. We use a real-aligned simulator as a closed-loop training environment for contact interaction. Mixed simulated and real trajectories first warm-start tactile-conditioned actions in the pretrained policy. Reinforcement learning with verifiable task rewards then optimizes the policy using simulated contact rollouts. It reinforces tactile-conditioned actions that lead to task completion, while a supervised objective on real trajectories keeps the refined policy anchored to deployment visual, tactile, and action distributions. The resulting policy transfers directly to the real robot without privileged simulation state or online real-world RL. Across four bimanual contact-rich tasks, the final visuo-tactile policy achieves an average success rate of 72.5%, compared to baseline of 50.0%. Result videos and more details are available at this https URL

2606.11814 2026-06-11 quant-ph cs.AI cs.LG 交叉投稿

Sparsified Kolmogorov-Arnold Networks for Interpretable Quantum State Tomography

稀疏化Kolmogorov-Arnold网络用于可解释量子态层析

Xinge Wu, Huaxin Wang, Jiajun Liu, Ruiqing He, Jiandong Shang, Hengliang Guo, Qiang Chen

AI总结 研究利用稀疏化Kolmogorov-Arnold网络作为可检查的重构规则,通过三量子比特GHZ基准测试,识别出与GHZ相关的Pauli测量集,并揭示与解析GHZ Pauli分组一致的输入-隐藏-输出通路结构,实现神经网络重构模型的结构可解释性。

详情
AI中文摘要

量子态层析的机器学习方法可以实现高保真度重构,但训练模型所使用的物理结构往往隐含。这里我们探究稀疏化Kolmogorov-Arnold网络(KAN)是否不仅可以作为回归器,还可以作为可检查的重构规则,其内部组织可以与已知的Pauli结构进行对照。我们研究了一个受控的三量子比特GHZ族基准测试,其中所有63个非恒等Pauli期望值被用于重构三个GHZ子空间变量:种群不平衡$z$、实部非对角分量$c$和虚部非对角分量$c$。在有限采样和退极化噪声下,外部消融从63个测量中识别出扩展的12通道GHZ相关Pauli集,在测试的采样次数和退极化噪声强度下实现了精确的前12恢复。这些支持模式在多种子随机初始化和噪声水平分析中保持稳定,并在随机标签控制下崩溃。主要的剪枝输入-隐藏-输出通路以与解析GHZ Pauli分组一致的方式组织Z型种群可观测量和X/Y非对角可观测量,稀疏公式恢复恢复了规范的带符号Pauli关系。因此,KAN的贡献在于神经重构模型中的通路级结构可解释性,而非优越的稀疏回归。结合阴性对照,这些探针提供了一条一致性链,用于审计学习到的重构规则与已知物理结构的一致性。

英文摘要

Machine-learning approaches to quantum state tomography can achieve high reconstruction fidelity, but the physical structure used by the trained model often remains implicit. Here we ask whether a sparsified Kolmogorov-Arnold Network (KAN) can be used not only as a regressor, but also as an inspectable reconstruction rule whose internal organization can be checked against known Pauli structure. We study a controlled three-qubit GHZ-family benchmark in which all 63 non-identity Pauli expectation values are used to reconstruct three GHZ-subspace variables: the population imbalance $z$, the real off-diagonal component $c$, and the imaginary off-diagonal component $s$. Under finite-shot sampling and depolarizing noise, external ablation identifies the extended 12-channel GHZ-relevant Pauli set from the 63 measurements, with exact top-12 recovery across the tested shot counts and depolarizing-noise strengths. These support patterns remain stable across multi-seed random-initialization and noise-level analyses, and collapse under random-label controls. The dominant pruned input-hidden-output pathways organize Z-type population observables and X/Y off-diagonal observables in a pattern consistent with the analytic GHZ Pauli grouping, and sparse formula recovery recovers the canonical signed Pauli relations. The contribution of the KAN is therefore pathway-level structural interpretability within a neural reconstruction model, rather than superior sparse regression. Together with negative controls, these probes provide a consistency chain for auditing learned reconstruction rules against known physical structure.

2606.11857 2026-06-11 eess.SP cs.LG 交叉投稿

REACH: Interpretability-Driven Feature Identification and Architecture Compression for Multi-Channel Vehicular Channel Estimation

REACH:面向多信道车辆信道估计的可解释性驱动特征识别与架构压缩

Simbarashe Aldrin Ngorima, Albert Helberg, Marelie H. Davel

AI总结 提出REACH框架,通过梯度归因识别关键时频特征并压缩网络,在IEEE 802.11p信道估计中实现参数和计算量大幅降低,且OOD泛化性能下降缓慢。

详情
Comments
22 pages, 16 figures
AI中文摘要

多信道混合信噪比训练改善了IEEE 802.11p车辆通信中深度学习信道估计器的分布外(OOD)泛化能力,但其内部机制尚不明确。本文提出REACH(基于相关性的信道估计器解释与架构压缩),一个在两层上运行的基于梯度的可解释性框架。输入级归因识别出一组在所有评估信道条件下始终相关的时频特征,从而以最小的性能损失实现输入维度缩减。滤波器级归因揭示了一种近乎通用的内部表示,为观察到的OOD泛化提供了表示层面的解释。基于由此产生的滤波器分类,相关性引导的架构压缩在归一化均方误差(NMSE)退化小于1 dB的情况下,大幅减少了参数数量和浮点运算次数(FLOPs),并且随着压缩程度的增加,OOD泛化性能的下降速度慢于分布内准确率的下降速度。

英文摘要

Multi-channel mixed-SNR training improves out-of-distribution (OOD) generalisation of deep learning channel estimators for IEEE 802.11p vehicular communications, yet the internal mechanism responsible for this remains unexplained. This work presents REACH (Relevance-based Explanation and Architectural Compression for cHannel estimators), a gradient-based interpretability framework that operates at two levels. Input-level attribution identifies a subset of time-frequency features consistently relevant across all evaluated channel conditions, enabling input dimensionality reduction with minimal performance loss. Filter-level attribution reveals a near-universal internal representation, providing a representational account of the observed OOD generalisation. Guided by the resulting filter taxonomy, relevance-guided architecture compression substantially reduces both the number of parameters and the number of floating-point operations (FLOPs) with sub-1 dB normalised mean square error (NMSE) degradation, and OOD generalisation degrades more slowly than within-distribution accuracy under increasing compression.

2606.11870 2026-06-11 cond-mat.mtrl-sci cs.LG 交叉投稿

Modelling magnetic material properties with uncertainty-aware neural networks

用不确定性感知神经网络建模磁性材料性质

Clemens Wager, Heisam Moustafa, Alexander Kovacs, Qais Ali, Harald Oezelt, Hayate Yamano, Masao Yano, Noritsugu Sakuma, Hyuga Hosoi, Akihito Kinoshita, Tetsuya Shoji, Akira Kato, Thomas Schrefl

AI总结 针对新材料发现中数据稀缺和分布外预测的不确定性问题,采用高斯负对数似然损失和基于dropout的贝叶斯近似量化预测不确定性,并迁移至微观结构预测矫顽力任务,证明不确定性量化可增强预测可信度且可迁移。

详情
Comments
pre print, unreviewed version
AI中文摘要

机器学习越来越多地被应用于通过探索大成分和结构设计空间来加速新材料的发现。然而,高质量数据的稀缺以及频繁的分布外预测需求引入了大量不确定性,使得评估模型可靠性变得至关重要。在这项工作中,我们研究了不确定性量化作为评估永磁体研究背景下模型置信度的一种手段。在第一项研究中,我们基准测试了经典和现代机器学习模型在预测本征磁性方面的性能,重点关注其不确定性估计的质量。我们应用高斯负对数似然损失和基于dropout的贝叶斯近似作为估计预测不确定性的实用策略。在第二项研究中,我们将这些用于不确定性估计的架构特征迁移到一个更复杂的任务:使用图神经网络从微观结构信息预测矫顽力。这些研究共同表明,不确定性量化不仅增强了预测的可信度,而且在不同建模任务之间是可迁移的。

英文摘要

Machine learning is increasingly applied to accelerate the discovery of novel materials by exploring large compositional and structural design spaces. Yet, the scarcity of high-quality data and the frequent need for out-of-distribution prediction introduce substantial uncertainty, making the assessment of model reliability essential. In this work, we investigate uncertainty quantification as a means to evaluate model confidence in the context of permanent magnet research. In a first study, we benchmark classical and modern machine learning models for predicting intrinsic magnetic properties, focusing on the quality of their uncertainty estimates. We apply Gaussian negative log-likelihood loss and dropout-based Bayesian approximation as practical strategies for estimating predictive uncertainty. In a second study, we transfer these architectural features for uncertainty estimation to a more complex task: predicting coercivity from microstructural information using a graph neural network. Together, these studies demonstrate that uncertainty quantification not only enhances the trustworthiness of predictions but is also transferable across different modeling tasks.

2606.11876 2026-06-11 q-bio.QM cs.LG stat.ME 交叉投稿

Seeing Below the Limit of Detection: A Censored-Poisson Bayesian Latent-Growth Change-Point Detector (the Span Detector) for Serial ctDNA in HR+/HER2- Metastatic Breast Cancer

检测限以下:用于HR+/HER2-转移性乳腺癌连续ctDNA的删失泊松贝叶斯潜在增长变点检测器(Span检测器)

Aarchi Singh Thakur, Abhijoy Sarkar

AI总结 提出Span检测器,利用删失泊松贝叶斯潜在增长变点模型处理ctDNA非检测作为左删失观测,通过序贯广义似然比统计量检测变异检测率上升点,在10%假警报率下将提前三个月捕获进展的比例从11%提升至25%。

详情
Comments
9 pages, 4 figures, 2 tables. Code and synthetic data generator: this https URL
AI中文摘要

循环肿瘤DNA(ctDNA)在影像学显示耐药性数月前就已携带证据,但最早证据存在于检测限(LoD)以下:新生亚克隆仅被间歇性检测到,产生微弱检测和非检测的闪烁序列。商业液体活检将每次抽取视为独立快照,并将非检测视为无信号。我们认为非检测是左删失观测,而随时间变化的非检测和微弱检测模式在单个值可信之前就携带了可操作的生长证据。我们引入Span,一种删失泊松贝叶斯潜在增长变点检测器,它对二元检测过程建模,为每个变异的检测率累积一个向上变点的序贯广义似然比统计量,并以校准的假警报控制发出竞争风险警报。Span没有学习权重,因此没有过拟合风险。在一线CDK4/6抑制剂联合内分泌治疗的HR+/HER2-转移性乳腺癌合成队列中,在匹配的10%假警报率下,Span将提前三个月捕获的即将进展比例大约翻倍(惰性出现:25% vs 快照的11%),具有可证伪的剂量反应:对惰性出现效果显著,对快速出现效果消失。值轨迹基线表现与快照相同,将增益归因于删失检测模型。生存主干在真实乳腺癌数据(GBSG-2,n=686;C指数0.67 vs 0.68)上与Cox基线匹配,在具有清洁生物标志物的真实纵向队列(PBC2,n=312)上,同一管道正确拒绝获胜,这是一个可证伪的边界测试,确认机制是特定于状态的。所有ctDNA轨迹均为合成数据。

英文摘要

Circulating-tumour DNA (ctDNA) carries evidence of drug resistance months before imaging shows it, but the earliest evidence lives below the assay's limit of detection (LoD): a nascent subclone is detected only intermittently, producing a flickering sequence of faint detects and non-detects. Commercial liquid biopsies treat each draw as an independent snapshot and a non-detect as nothing. We argue a non-detect is a left-censored observation, and the pattern of non-detects and faint detects over time carries actionable evidence of growth before any single value is trustworthy. We introduce Span, a censored-Poisson Bayesian latent-growth change-point detector that models the binary detection process, accumulates a sequential generalised-likelihood-ratio statistic for an upward change-point in the per-variant detection rate, and raises a competing-risks alarm with calibrated false-alarm control. Span has no learned weights, so there is nothing to overfit. On a synthetic cohort of HR+/HER2- metastatic breast cancer on first-line CDK4/6-inhibitor plus endocrine therapy, at a matched 10% false-alarm rate, Span roughly doubles the fraction of impending progressions caught three months ahead (indolent regime: 25% vs 11% for the snapshot), with a falsifiable dose-response: large for indolent emergence, vanishing for fast emergence. A value-trajectory baseline performs identically to the snapshot, isolating the gain to the censored detection model. The survival backbone matches a Cox baseline on real breast-cancer data (GBSG-2, n=686; C-index 0.67 vs 0.68), and on a real longitudinal cohort with clean biomarkers (PBC2, n=312) the same pipeline correctly declines to win, a falsifiable boundary test confirming the mechanism is regime-specific. All ctDNA trajectories are synthetic.

2606.11914 2026-06-11 eess.SP cs.LG 交叉投稿

NARRAS: Edge-Triggered Distributed Inference for CSI-Based Localization in Vehicular IoT Networks

NARRAS:车载物联网中基于CSI的定位的边缘触发分布式推理

Rodrigo Oliver, Ricardo Vazquez Alvarez, Alejandro Lancho, Stefano Rini

AI总结 针对分布式天线阵列CSI定位中资源受限问题,提出NARRAS边缘触发分布式推理策略,各阵列本地决策是否上报观测,通过可微活动惩罚和通道图正则化实现预算控制,在低活动率下提升定位精度。

详情
Comments
10 pages, 5 figures, 5 tables. Under review at the IEEE Internet of Things Journal
AI中文摘要

基于CSI的定位与空间分布式天线阵列存在基本的资源权衡。每个阵列可以提供丰富的信道视图,但当只有少数阵列携带有用信息时,将所有阵列的观测结果转发到融合中心是浪费的,且共享上行链路仅支持有限数量的同时传输。我们让每个阵列本地决定其当前观测是否值得报告,受限于平均活跃发射机数量的预算。我们将这种抽象称为边缘触发分布式推理(ETDI)。它捕获了一类更广泛的任务导向通信问题,其中资源受限设备共享接入信道以完成共同推理任务。我们将ETDI实例化用于基于CSI的定位,这是车载物联网中的常见场景。空间分布的远程天线阵列(RAA)将来自用户设备(UE)传输的本地信道状态信息(CSI)编码为潜在特征,融合中心根据报告的特征子集估计UE位置。我们提出NARRAS,一种去中心化的报告策略,其中每个RAA将其最近观测的循环摘要与其最后传输的潜在记忆相结合。训练通过可微活动惩罚和验证校准的确定性阈值来控制显式活动预算,并使用通道图正则化来塑造潜在几何结构。实验表明,在可比的上行链路活动下,NARRAS比学习型和启发式稀疏报告策略提高了定位精度,而密集全报告模型仍然作为有用的无预算参考。在低活动率下,图正则化进一步减少了高百分位定位误差,表明几何感知的潜在表示在稀疏报告下更加鲁棒。

英文摘要

CSI-based localization with spatially distributed antenna arrays exposes a basic resource trade-off. Each array can provide a rich view of the channel, but forwarding observations from all arrays to a fusion center is wasteful when only a few carry useful information, and the shared uplink supports only a limited number of simultaneous transmissions. We let each array decide locally whether its current observation is worth reporting, subject to a budget on the average number of active transmitters. We refer to this abstraction as Edge-Triggered Distributed Inference (ETDI). It captures a broader class of task-oriented communication problems where resource-constrained devices share an access channel for a common inference task. We instantiate ETDI for CSI-based localization, a common scenario in vehicular IoT networks. Spatially distributed remote antenna arrays (RAAs) encode local channel state information (CSI) from user equipment (UE) transmissions into latent features, and the fusion center estimates the UE position from the subset of reported features. We propose NARRAS, a decentralized reporting policy in which each RAA combines a recurrent summary of its recent observations with a memory of the last latent it transmitted. Training controls an explicit activity budget through differentiable activity penalties and validation-calibrated deterministic thresholds, and uses channel-chart regularization to shape the latent geometry. Experiments show that, at comparable uplink activity, NARRAS improves localization accuracy over learned and heuristic sparse-reporting strategies, while dense full-report models remain useful budget-free references. In low-activity regimes, chart regularization further reduces high-percentile localization errors, suggesting that geometry-aware latent representations are more robust under sparse reporting.

2606.12215 2026-06-11 cs.CV cs.IR cs.LG 交叉投稿

MLT-Dedup: Efficient Large-Scale Online Video Deduplication via Multi-Level Representations and Spatial-Temporal Matching

MLT-Dedup:通过多级表示和时空匹配的高效大规模在线视频去重

David Yuchen Wang, Haoying Li, Hailun Xu, Wei Chee Yew, Zirui Zhu, Sanjay Saha, Hao Hei, Kanchan Sarkar, Kun Xu

发表机构 * TikTok Singapore(TikTok新加坡) School of Computing, National University of Singapore(新加坡国立大学计算机学院) TikTok San Jose(TikTok圣何塞)

AI总结 提出MLT-Dedup框架,采用多级视频编码器提取细粒度帧级和稀疏片段级嵌入,结合差分特征增强相似性模块进行时空匹配,在90%精度下降低在线重复率91%,索引容量提升5倍。

详情
Comments
Accepted by KDD-2026 ADS track
AI中文摘要

在线平台上用户生成视频内容的爆炸性增长伴随着大量近似重复视频的出现——这些视频相同或高度相似,但存在部分编辑差异。这些重复视频降低了用户体验,增加了存储和带宽成本,使得大规模视频去重成为一项关键任务。现有的视频去重框架在有限的索引预算下检索足够高质量候选视频方面面临根本性挑战,同时在效率和精度之间存在权衡。为了解决这些问题,我们提出了MLT-Dedup,一种基于多级表示和时空匹配的高效大规模在线视频去重框架。我们的方法采用多级视频编码器(ML-VE)提取细粒度的帧级嵌入和稀疏的片段级嵌入:稀疏嵌入支持高效的候选检索,而细粒度嵌入则用于精确的成对匹配。在匹配过程中,我们引入了DiF-SiM,一种差分特征增强相似性模块,能够定位重复的时间片段并提供可靠的相似性证据,以支持基于策略的去重决策。在真实大规模平台上的大量实验表明,MLT-Dedup在90%精度下将在线重复率降低了91%。此外,我们的稀疏检索设计使索引容量提升了5倍,从而在实际部署中实现了更广泛的候选覆盖。

英文摘要

The explosive growth of user-generated video content on online platforms is accompanied by the emergence of numerous near-duplicate videos--videos that are identical or highly similar but differ by partial edits. These duplicates degrade user experience and increase storage and bandwidth costs, making large-scale video deduplication a critical task. Existing video deduplication frameworks face a fundamental challenge in retrieving sufficient high-quality candidates under a limited index budget, as well as trade-offs between efficiency and precision. To address these issues, we propose MLT-Dedup, an efficient large-scale online video deduplication framework with Multi-Level representations and spatial-Temporal matching. Our approach employs a Multi-Level Video Encoder (ML-VE) to extract both fine-grained frame-level and sparse clip-level embeddings: sparse embeddings support efficient candidate retrieval, while fine-grained embeddings are loaded for precise pairwise matching. During matching, we introduce DiF-SiM, a Differential Feature-enhanced Similarity Module capable of locating duplicated temporal segments and providing reliable similarity evidence to support policy-driven deduplication decisions. Extensive experiments on a real-world large-scale platform demonstrate that MLT-Dedup reduces online repetition rates by 91% at 90% precision. Furthermore, our sparse retrieval design achieves a 5x increase in indexing capacity, enabling broader candidate coverage in real-world deployment.

2606.12346 2026-06-11 cs.CV cs.AI cs.LG 交叉投稿

Atlas H&E-TME: Scalable AI-Based Tissue Profiling at Expert Pathologist-Level Accuracy

Atlas H&E-TME:基于AI的可扩展组织分析,达到专家病理学家级别的准确性

Kai Standvoss, Miriam Hägele, Rosemarie Krupar, Julika Ribbat-Idel, Jennifer Altschüler, Gerrit Erdmann, Hans Pinckaers, Evelyn Ramberger, Madleen Drinkwitz, Ádám Nárai, Alexander Möllers, Katja Lingelbach, Sebastian Kons, Lukas Hönig, Recepcan Adigüzel, Joana Baião, Alberto Megina Gonzalo, Marius Teodorescu, Marie-Lisa Eich, Paolo Chetta, Shakil Merchant, Verena Aumiller, Simon Schallenberg, Andrew Norgan, Klaus-Robert Müller, Lukas Ruff, Maximilian Alber, Frederick Klauschen

发表机构 * Aignostics, Germany(Aignostics,德国) Institute of Pathology, Charité – Universitätsmedizin Berlin, Germany(柏林夏里特医学院病理学研究所) Berlin Institute of Health, Charité – Universitätsmedizin Berlin, Germany(柏林夏里特医学院柏林健康研究所) Massachusetts General Hospital, Department of Pathology, Harvard Medical School, Boston, MA, US(哈佛医学院麻省总医院病理学系) Department of Laboratory Medicine and Pathology, Mayo Clinic, Rochester, MN, US(梅奥诊所检验医学与病理学系) Machine Learning Group, Technische Universität Berlin, Germany(柏林工业大学机器学习组) BIFOLD – Berlin Institute for the Foundations of Learning and Data, Germany(柏林学习与数据基础研究所) Department of Artificial Intelligence, Korea University, Republic of Korea(高丽大学人工智能系) Max-Planck Institute for Informatics, Germany(马克斯·普朗克信息学研究所) German Cancer Research Center (DKFZ) & German Cancer Consortium (DKTK), Berlin & Munich Partner Sites, Germany(德国癌症研究中心及德国癌症联盟柏林和慕尼黑合作站点) Institute of Pathology, Ludwig-Maximilians-Universität München, Germany(慕尼黑大学病理学研究所) Bavarian Cancer Research Center (BZKF), Germany(巴伐利亚癌症研究中心)

AI总结 提出Atlas H&E-TME系统,利用病理基础模型预测组织质量、区域和细胞类型,通过IHC共识验证和20万+注释基准,在多种癌症中达到或超越病理学家水平。

详情
AI中文摘要

苏木精和伊红(H&E)染色是组织病理学的基石,然而对H&E全切片图像(WSI)进行可扩展的定量分析仍然是计算病理学中的核心挑战。我们提出了Atlas H&E-TME,这是一个基于Atlas病理基础模型家族的AI系统,可预测多种癌症类型的组织质量、组织区域和细胞类型标签,在细胞级分辨率下每张切片产生超过4,500个定量读数。验证此类系统的关键挑战在于克服H&E-only金标准固有的形态模糊性,以及依赖免疫组织化学(IHC)等模态的更可靠参考的可扩展性有限。我们通过一个双重验证框架解决了这一问题,该框架将生物学深度的基础与技术及形态学的广度相结合。在深度方面,我们提出了一种IHC引导的多病理学家共识协议,该协议显著提高了相较于传统H&E-only注释的评分者间一致性。这产生了一个分子学基础的参考,我们据此比较Atlas H&E-TME和仅使用H&E的病理学家。在广度方面,我们在超过20万个高置信度H&E-only病理学家注释上对Atlas H&E-TME进行了基准测试,这些注释涵盖1,500多个病例,跨越八种癌症类型及其最常见的转移部位,亚型覆盖每种癌症类型>90%的临床病例,来自25个以上来源和8种以上扫描仪型号。与IHC引导的共识相比,Atlas H&E-TME达到或超过了病理学家仅使用H&E的性能,并在这一广泛的形态学和技术范围内一致且稳健地泛化。通过这种方式,Atlas H&E-TME将H&E切片——病理学中最普遍的数据——转化为一个可扩展的、定量的肿瘤及其微环境窗口,为转化和临床研究中下一代基于组织的生物标志物奠定了基础。

英文摘要

Hematoxylin and eosin (H&E) staining is the cornerstone of histopathology, yet scalable, quantitative analysis of H&E whole-slide images (WSIs) remains a central challenge in computational pathology. We present Atlas H&E-TME, an AI-based system built on the Atlas family of pathology foundation models that predicts tissue quality, tissue region, and cell type labels across multiple cancer types, yielding over 4,500 quantitative readouts per slide at cell-level resolution. A key challenge to validating such systems is overcoming morphological ambiguity inherent to H&E-only ground truth and the limited scalability of more informed references drawing on modalities such as immunohistochemistry (IHC). We address this with a dual validation framework combining biologically grounded depth with technical and morphological breadth. For depth, we propose an IHC-informed multi-pathologist consensus protocol that substantially improves inter-rater agreement over conventional H&E-only annotation. This yields a molecularly grounded reference against which we compare Atlas H&E-TME and pathologists working from H&E alone. For breadth, we benchmark Atlas H&E-TME on over 200,000 high-confidence H&E-only pathologist annotations across 1,500+ cases spanning eight cancer types and their most common metastatic sites, with subtypes covering >90% of clinical cases per cancer type, drawn from 25+ sources and 8+ scanner models. Benchmarked against the IHC-informed consensus, Atlas H&E-TME matches or exceeds pathologist H&E-only performance and generalizes consistently and robustly across this broad morphological and technical scope. In doing so, Atlas H&E-TME turns the H&E slide -- the most ubiquitous data in pathology -- into a scalable, quantitative window into the tumor and its microenvironment, laying a foundation for the next generation of tissue-based biomarkers in translational and clinical research.

2606.12406 2026-06-11 cs.RO cs.AI cs.LG eess.SY 交叉投稿

FACTR 2: Learning External Force Sensing for Commodity Robot Arms Improves Policy Learning

FACTR 2: 学习商用机器人手臂的外部力感知提升策略学习

Steven Oh, Jason Jingzhou Liu, Tony Tao, Philip Han, Kenneth Shaw, Satoshi Funabashi, Ruslan Salakhutdinov, Deepak Pathak

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Waseda University(早稻田大学)

AI总结 提出无需专用力传感器的数据驱动方法NEXT,可在1分钟内从10分钟自由运动数据中训练,实现与专用关节力矩传感器相当的估计,并结合FIRST采样策略提升策略学习性能。

详情
Comments
Website at this https URL
AI中文摘要

接触丰富的操作需要力敏感性,但由于成本高昂,许多机器人手臂缺乏专用的力传感器。我们提出了神经外部力矩估计(NEXT),一种无需任何专用力传感器即可估计外部关节力矩的数据驱动方法。NEXT 仅需 10 分钟的自由运动数据即可在 1 分钟内完成训练,却能实现与专用关节力矩传感器相当的估计。NEXT 能够在低成本手臂上实现力反馈遥操作,并通过力信息重采样训练(FIRST)改进策略学习,该训练在行为克隆过程中对预接触和接触段进行上采样。在五个长时域任务中,FIRST 在任务进展上比先前的力感知策略提高了超过 17%。NEXT 和 FIRST 共同将力感知遥操作和策略学习引入现成的机器人,无需额外的传感硬件。视频结果和代码可在 https://this URL 获取。

英文摘要

Contact-rich manipulation requires force sensitivity, but many robot arms lack dedicated force sensors due to their high cost. We present Neural External Torque Estimation (NEXT), a data-driven method that estimates external joint torques without needing any dedicated force sensors. NEXT trains in 1 minute from only 10 minutes of free-motion data, yet achieves estimates comparable to dedicated joint-torque sensors. NEXT enables force-feedback teleoperation on low-cost arms and improves policy learning through Force-Informed Re-Sampling Training (FIRST), which up-samples pre-contact and contact segments during behavior cloning. Across five long-horizon tasks, FIRST outperforms prior force-aware policies by over 17% in task progress. Together, NEXT and FIRST bring force-aware teleoperation and policy learning to off-the-shelf robots without additional sensing hardware. Video results and code are available at this https URL

2508.21380 2026-06-11 cs.LG cs.AI 版本更新

The Algorithm Is Not the Behavior: Learned Priors Override Look-Ahead in a Chess-Playing Neural Network

算法并非行为:学得的先验知识在弈棋神经网络中覆盖前瞻

Elias Sandmann, Sebastian Lapuschkin, Wojciech Samek

AI总结 研究发现,国际象棋神经网络Leela Chess Zero在中间层能正确计算解法,但最终输出被安全优先的先验知识覆盖,导致错误答案。

详情
AI中文摘要

最近的机制性工作揭示了神经网络内部的学习算法,从模运算到游戏智能体中的搜索与规划。但算法结构是否保证算法行为?我们在最强的神经象棋引擎Leela Chess Zero中对此进行研究,先前工作已识别出学习到的前瞻。通过将logit透镜扩展到其选棋策略网络,我们发现正确的谜题解法——包括即时将杀——经常出现在中间层,但在最终输出中被系统性覆盖,我们将此现象称为“遗忘的谜题”。在这些位置上重复先前的分析,我们发现前瞻运行正常——正确续招的未来走法被表示、因果重要且可线性解码——排除了算法本身的失败。相反,后期层逐渐转向优先考虑安全对局而非激进。为了测试这一转变是否驱动了覆盖,我们引导模型反对这些偏好,并恢复了61.7%的遗忘谜题,提供了因果证据表明安全先验覆盖了算法计算的解法。这些发现表明,算法结构并不保证算法行为:模型可以在内部解决问题,但仍然输出错误答案。

英文摘要

Recent mechanistic work has uncovered learned algorithms within neural networks, from modular arithmetic to search and planning in game-playing agents. But does algorithmic structure guarantee algorithmic behavior? We investigate this in Leela Chess Zero, the strongest neural chess engine, where prior work identified learned look-ahead. By extending the logit lens to its move-selecting policy network, we discover that correct puzzle solutions-including immediate checkmates-often appear in intermediate layers but are systematically overridden in the final output, a phenomenon we term "forgotten puzzles". Replicating prior analyses on these positions, we find that look-ahead operates normally-future moves of the correct continuation are represented, causally important, and linearly decodable-ruling out a failure of the algorithm itself. Instead, late layers increasingly shift toward prioritizing safe play over aggression. To test whether this shift drives the override, we steer the model against these preferences and recover 61.7% of forgotten puzzles, providing causal evidence that safety priors override algorithmically computed solutions. These findings demonstrate that algorithmic structure does not guarantee algorithmic behavior: a model can internally solve a problem and still output the wrong answer.

2511.09789 2026-06-11 cs.LG 版本更新

CaReTS: A Multi-Task Framework Unifying Classification and Regression for Time Series Forecasting

CaReTS:统一分类与回归的多任务时间序列预测框架

Fulong Yao, Wanqing Zhao, Chao Zheng, Xiaofei Han

AI总结 提出CaReTS多任务框架,通过双流架构联合分类趋势与回归偏差,实现高精度预测与可解释性,在真实数据集上优于现有方法。

详情
AI中文摘要

近年来深度预测模型取得了显著性能,但大多数方法仍难以同时提供准确的预测和对时间动态的可解释洞察。本文提出CaReTS,一种新颖的多任务学习框架,结合分类和回归任务用于多步时间序列预测问题。该框架采用双流架构,其中分类分支学习未来的逐步趋势,而回归分支估计目标变量最新观测值的相应偏差。双流设计通过分离目标变量的宏观趋势和微观偏差,提供更具可解释性的预测。为了在输出预测、偏差估计和趋势分类中实现有效学习,我们设计了一个具有不确定性加权机制的多任务损失,以自适应平衡每个任务的贡献。此外,在该框架下实例化了四种变体(CaReTS1-4),以集成主流时序建模编码器,包括卷积神经网络(CNN)、长短期记忆网络(LSTM)和Transformer。在真实数据集上的实验表明,CaReTS在预测准确性上优于最先进的算法,同时实现了更高的趋势分类性能。

英文摘要

Recent advances in deep forecasting models have achieved remarkable performance, yet most approaches still struggle to provide both accurate predictions and interpretable insights into temporal dynamics. This paper proposes CaReTS, a novel multi-task learning framework that combines classification and regression tasks for multi-step time series forecasting problems. The framework adopts a dual-stream architecture, where a classification branch learns the stepwise trend into the future, while a regression branch estimates the corresponding deviations from the latest observation of the target variable. The dual-stream design provides more interpretable predictions by disentangling macro-level trends from micro-level deviations in the target variable. To enable effective learning in output prediction, deviation estimation, and trend classification, we design a multi-task loss with uncertainty-aware weighting to adaptively balance the contribution of each task. Furthermore, four variants (CaReTS1--4) are instantiated under this framework to incorporate mainstream temporal modelling encoders, including convolutional neural networks (CNNs), long short-term memory networks (LSTMs), and Transformers. Experiments on real-world datasets demonstrate that CaReTS outperforms state-of-the-art (SOTA) algorithms in forecasting accuracy, while achieving higher trend classification performance.

2601.21293 2026-06-11 cs.LG cs.AI 版本更新

Reliability-Calibrated Edge-IoT Early Fault Warning for Rotating Machinery with a Physics-Guided Tiny-Mamba Transformer

面向旋转机械的可靠性校准边缘物联网早期故障预警:一种物理引导的Tiny-Mamba Transformer

Changyu Li, Huabei Nie, Xiaoya Ni, Lu Wang, Lijuan Shen, Kaishun Wu, Fei Luo

AI总结 提出一种可靠性校准的边缘物联网早期故障预警框架,使用物理引导的Tiny-Mamba Transformer提取特征,结合极值理论校准误报率,在低计算资源下实现高精度、低延迟的旋转机械故障预警。

详情
AI中文摘要

工业物联网系统日益依赖分布式振动传感来支持旋转机械的预测性维护。然而,在实际部署中,原始信号上传成本高昂,且报警决策必须在有限计算资源、变化运行条件和严格误报预算下本地进行。本文提出一种可靠性校准的边缘物联网早期预警框架,其中紧凑的物理引导Tiny-Mamba Transformer作为表示模块,极值理论层将流式异常分数转换为事件级报警片段。PG-TMT结合深度可分离卷积主干、Tiny-Mamba状态空间分支和轻量级局部Transformer,在批量大小为1的推理下捕获瞬态、长周期和多通道退化线索。为提高可审计性,时间注意力被投影到频域并与分析轴承故障阶次带软对齐。极值理论校准、双阈值迟滞和修尾拟合即使在健康校准数据不完美的情况下也能提供可控的误报强度。在CWRU、Paderborn、XJTU-SY和工业试点上的实验表明,所提框架提高了PR-AUC,在可控误报预算下减少了检测延迟,并对结构化干扰、元数据不确定性、复合故障混合和域转移保持鲁棒。凭借小于1 MB的占用空间和低于7 ms的Jetson p99延迟,该框架支持工业物联网预测性维护的校准和可解释早期预警。

英文摘要

Industrial Internet of Things (IIoT) systems increasingly rely on distributed vibration sensing to support predictive maintenance of rotating machinery. In practical deployments, however, raw signal upload is costly and alarm decisions must be made locally under limited computation, changing operating conditions, and strict nuisance-alarm budgets. This paper presents a reliability-calibrated edge-IoT early-warning framework, in which a compact Physics-Guided Tiny-Mamba Transformer (PG-TMT) acts as the representation module and an extreme value theory (EVT) layer converts streaming anomaly scores into event-level alarm episodes. PG-TMT combines a depthwise-separable convolutional stem, a Tiny-Mamba state-space branch, and a lightweight local Transformer to capture transient, long-horizon, and multichannel degradation cues under batch-size-one inference. To improve auditability, temporal attention is projected to the frequency domain and softly aligned with analytical bearing fault-order bands. EVT calibration, dual-threshold hysteresis, and trimmed-tail fitting provide controllable false-alarm intensity even when healthy calibration data are imperfect. Experiments on CWRU, Paderborn, XJTU-SY, and an industrial pilot demonstrate that the proposed framework improves PR-AUC, reduces detection delay under a controlled nuisance-alarm budget, and remains robust to structured interference, metadata uncertainty, compound fault mixtures, and domain transfer. With a sub-1 MB footprint and Jetson p99 latency below 7 ms, the framework supports calibrated and interpretable early warnings for IIoT predictive maintenance.

2602.10392 2026-06-11 cs.LG 版本更新

Tensor Methods: A Unified and Interpretable Approach for Material Design

张量方法:一种统一且可解释的材料设计方法

Shaan Pakala, Aldair E. Gongora, Brian Giera, Evangelos E. Papalexakis

AI总结 提出使用张量补全方法作为材料设计的统一框架,兼具可解释性和预测性能,在非均匀采样下优于传统机器学习,最高提升5%的R²并减半分布外误差。

详情
Comments
Accepted to ACM SIGKDD 2026 AI for Sciences track
AI中文摘要

在设计新材料时,通常需要根据所需性能定制材料设计。随着设计参数数量的增长,搜索空间呈指数级增长,使得所有材料组合的实际合成和评估几乎不可能。即使使用有限元分析等传统计算方法,搜索设计空间也变得过于计算密集。近期方法使用机器学习(ML)代理模型来更高效地确定最优材料设计;不幸的是,这些方法通常(i)难以解释,且(ii)当训练数据来自设计空间的非均匀采样时性能不佳。我们建议使用张量补全方法作为可解释性和预测的统一方法。我们观察到经典张量方法在预测上能够与传统ML竞争,并且额外具有可解释的张量因子(作为预测的副产品完全免费获得)。在我们的实验中,我们能够通过张量因子重新发现物理现象,表明我们的预测与问题的真实底层物理一致。这也意味着,鉴于我们能够重新发现现有模式,实验人员可以利用这些张量因子识别潜在的新模式。我们还研究了当遇到来自设计空间非均匀采样的训练数据时,两种代理模型的效果。我们观察到更专门的张量方法在这些非均匀采样场景下能够提供更好的泛化能力。我们发现最佳的泛化来自一个张量模型,它在总体$R^2$上比基线ML方法提升高达5%,并在某些分布外区域将误差减半。

英文摘要

When designing new materials, it is often necessary to tailor the material design to have some desired properties. As the set of design parameters grow, the search space grows exponentially, making the actual synthesis and evaluation of all material combinations virtually impossible. Even using traditional computational methods such as Finite Element Analysis becomes too computationally heavy to search the design space. Recent methods use machine learning (ML) surrogate models to more efficiently determine optimal material designs; unfortunately, these methods often (i) are notoriously difficult to interpret and (ii) under perform when the training data comes from a non-uniform sampling of the design space. We suggest the use of tensor completion methods as an all-in-one approach for interpretability and predictions. We observe classical tensor methods are able to compete with traditional ML in predictions, with the added benefit of their interpretable tensor factors (which are given completely for free, as a result of the prediction). In our experiments, we are able to rediscover physical phenomena via the tensor factors, indicating that our predictions are aligned with the true underlying physics of the problem. This also means these tensor factors could be used by experimentalists to identify potentially novel patterns, given we are able to rediscover existing ones. We also study the effects of both types of surrogate models when we encounter training data from a non-uniform sampling of the design space. We observe more specialized tensor methods that can give better generalization in these non-uniforms sampling scenarios. We find the best generalization comes from a tensor model, which is able to improve upon the baseline ML methods by up to 5% on aggregate $R^2$, and halve the error in some out of distribution regions.

2602.11801 2026-06-11 cs.LG 版本更新

SpaTeoGL: Spatiotemporal Graph Learning for Interpretable Seizure Onset Zone Analysis from Intracranial EEG

SpaTeoGL: 用于颅内脑电图可解释癫痫发作起始区分析的时空图学习

Elham Rostami, Aref Einizade, Taous-Meriem Laleg-Kirati

AI总结 提出SpaTeoGL框架,通过联合学习窗口级空间图和时间图,在平滑图信号处理框架下交替求解,实现癫痫发作起始区的可解释定位,在多中心iEEG数据集上优于基线方法。

详情
Comments
5 pages, 4 figures
AI中文摘要

从颅内脑电图(iEEG)中准确定位癫痫发作起始区(SOZ)对癫痫手术至关重要,但受复杂时空发作动态的挑战。我们提出SpaTeoGL,一种用于可解释癫痫网络分析的时空图学习框架。SpaTeoGL联合学习捕捉iEEG电极间相互作用的窗口级空间图,以及基于空间结构相似性连接时间窗口的时间图。该方法在平滑图信号处理框架内制定,并通过具有收敛保证的交替块坐标下降算法求解。在具有成功手术结果的多中心iEEG数据集上的实验表明,SpaTeoGL与基于水平可见图与逻辑回归的基线方法相比具有竞争力,同时改善了非SOZ识别,并为癫痫发作起始和传播动态提供了可解释的见解。

英文摘要

Accurate localization of the seizure onset zone (SOZ) from intracranial EEG (iEEG) is essential for epilepsy surgery but is challenged by complex spatiotemporal seizure dynamics. We propose SpaTeoGL, a spatiotemporal graph learning framework for interpretable seizure network analysis. SpaTeoGL jointly learns window-level spatial graphs capturing interactions among iEEG electrodes and a temporal graph linking time windows based on similarity of their spatial structure. The method is formulated within a smooth graph signal processing framework and solved via an alternating block coordinate descent algorithm with convergence guarantees. Experiments on a multicenter iEEG dataset with successful surgical outcomes show that SpaTeoGL is competitive with a baseline based on horizontal visibility graphs and logistic regression, while improving non-SOZ identification and providing interpretable insights into seizure onset and propagation dynamics.

2606.03077 2026-06-11 cs.LG cs.AI cs.DC 版本更新

Libra: Efficient Resource Management for Agentic RL Post-Training

Libra:面向智能体强化学习后训练的高效资源管理

Kaiwen Chen, Xin Tan, Jingzong Li, Hong Xu

AI总结 针对智能体强化学习中长尾、非平稳工作负载带来的资源管理挑战,提出Libra系统,通过周期性全局资源规划器和因果驱动多级反馈队列调度器,实现GPU分配优化和请求调度,最高提升3倍吞吐量和2.5倍收敛速度。

详情
Comments
19 pages, 12 figures
AI中文摘要

强化学习(RL)已成为大型语言模型(LLM)的标准后训练范式,从偏好对齐扩展到复杂推理和多轮智能体行为。在智能体RL中,rollout阶段生成轨迹并调用工具,产生长尾和非平稳的工作负载,挑战了传统的资源管理假设。出现了三个基本挑战。首先,由于长尾分布,一小部分轨迹主导了rollout完成时间。其次,rollout和训练在计算模式、内存需求和对序列长度的敏感性上表现出强烈的不对称性。第三,随着RL策略的演变,轨迹长度分布随时间漂移,使得任何静态资源分配逐渐变得次优。我们提出Libra,引入了两个核心机制。第一个是周期性全局资源规划器,它联合优化rollout和训练集群间的GPU分配。它利用弹性混合池实现阶段间轻量级、非阻塞的工作节点重新分配。第二个是因果驱动的多级反馈队列(C-MLFQ)调度器,它基于从工具返回结果导出的因果信号(而非依赖脆弱的长度的预测)将请求路由到异构的rollout桶。在48个A800 GPU上的评估表明,与基线相比,Libra实现了高达3.0倍的吞吐量提升和高达2.5倍的奖励收敛加速。

英文摘要

Reinforcement learning (RL) has emerged as a standard post-training paradigm for shaping large language models (LLMs) into capable agents. In agentic RL, the rollout stage generates trajectories while invoking tools, producing long-tailed and non-stationary workloads that expose two fundamental challenges in resource management. First, due to the long-tail distribution, a small fraction of trajectories dominates rollout makespan. Second, rollout and training are subject to cross-stage imbalance, as they exhibit strong asymmetry in compute patterns, memory demands, and sensitivity to sequence length. Compounding this asymmetry, the sequence length distribution drifts continuously as the policy evolves, rendering any static resource split progressively suboptimal. We present Libra, a resource management system to address both challenges via two core mechanisms. The first is a global resource planner that jointly optimizes GPU allocation across rollout and training clusters. It leverages an elastic hybrid pool to enable lightweight, non-blocking worker reallocation between stages. The second is a causality-driven multi-level feedback queue (C-MLFQ) scheduler, which routes requests to heterogeneous rollout buckets based on causal signals derived from tool-return outcomes, rather than relying on fragile length predictions. Evaluated on 48 A800 GPUs, Libra achieves up to 3.0x higher throughput and converges up to 2.5x faster in reward compared to the baselines.

2606.04145 2026-06-11 cs.LG cs.AI cs.DC 版本更新

EvalStop: Using World Feedback to Detect and Correct Reward Overoptimization in Multi-Tenant RLHF Platforms

EvalStop:利用世界反馈检测和纠正多租户RLHF平台中的奖励过度优化

Guilin Zhang, Chuanyi Sun, Shahryar Sarkani, John M. Fossaceca

AI总结 提出EvalStop调度原语,通过检测评估分数连续下降来终止作业、释放GPU并保留最佳检查点,以纠正奖励过度优化,在RLHF负载上实现高精度检测并提升JCT。

详情
AI中文摘要

云LLM微调平台越来越多地服务于RLHF工作负载,其中学习到的奖励模型作为人类质量的代理被优化。正如Gao等人(2023)所示,在持续优化压力下,该代理与世界反馈(下游评估指标)发生偏离,这种现象称为奖励过度优化。现有的平台调度器忽略这种偏离:非预见性调度器优化JCT而不考虑任何质量信号,SLAQ式质量感知调度器使用训练损失(一个单调下降的较弱代理,可通过黑客攻击降低),而经典的每作业早停需要人工监控且不释放共享GPU。我们提出EvalStop,一个可组合的调度原语,它在连续k次评估分数下降时终止作业,释放GPU,保留最佳检查点,并委托给任何基础调度器。我们将调度器级别的早停视为检测问题,并在一个离散事件模拟器中评估它,该模拟器的RLHF工作负载混合了奖励黑客攻击和结构健康运行,真实标签对调度器隐藏。在RLHF密集型负载(80% RLHF,64 GPU)上,EvalStop实现了精确率98%、召回率99%、假阳性率1.5%,同时相比SRTF-Est将JCT提高了9%,将浪费的计算减少了22%(p<0.05)。简单的固定进度和损失平台竞争对手要么在健康RLHF上产生65%的假阳性率,要么错过超过一半的真实黑客攻击案例。增益在所有测试的基础调度器上均成立(JCT提升9-25%),且检测质量在评估噪声(噪声标准差≤0.05时精确率至少91%)和黑客攻击基础率(黑客攻击比例20-80%时精确率至少89%)下保持稳定。

英文摘要

Cloud LLM fine-tuning platforms increasingly serve RLHF workloads, where a learned reward model is optimized as a proxy for human quality. As Gao et al. (2023) showed, this proxy diverges from world feedback (downstream eval metrics) under sustained optimization pressure, a phenomenon known as reward overoptimization. Existing platform schedulers ignore this divergence: non-clairvoyant schedulers optimize JCT without any quality signal, SLAQ-style quality-aware schedulers use training loss (a weaker proxy that drops monotonically through hacking), and classical per-job early stopping requires human monitoring and does not free shared GPUs. We propose EvalStop, a composable scheduling primitive that terminates jobs on k consecutive eval-score declines, releases GPUs, preserves the best checkpoint, and delegates to any base scheduler. We frame scheduler-level early stopping as a detection problem and evaluate it in a discrete-event simulator whose RLHF workload mixes reward-hacking and structurally healthy runs, with ground-truth labels hidden from schedulers. On RLHF-heavy workloads (80% RLHF, 64 GPUs), EvalStop achieves precision 98% / recall 99% / FPR 1.5% while improving JCT by 9% and cutting wasted compute by 22% over SRTF-Est (p<0.05). Trivial fixed-progress and loss-plateau competitors either incur 65% FPR on healthy RLHF or miss over half of true hacking cases. Gains compose across every base scheduler tested (9-25% JCT) and detection quality stays stable under eval noise (precision at least 91% at noise std <= 0.05) and hacking base rate (precision at least 89% across 20-80% hacking fractions).

2606.07226 2026-06-11 cs.LG cs.AI cs.CL 版本更新

DEFINED: A Data-Efficient Computational Framework for Fine-Grained Creativity Assessment in Debate Scenarios

DEFINED: 辩论场景中细粒度创造力评估的数据高效计算框架

Tongzhou Yu, Mingjia Li, Hong Qian, Wenkai Wang, Zongbao Zhang, Yaoyu Jiang, Xiangfeng Wang, Aimin Zhou, Jiajun Guo

发表机构 * Nanjing University Shanghai Innovation Institute East China Normal University

AI总结 提出DEFINED框架,通过层次化八维指标体系、预训练语言模型和混合粒度训练策略,在辩论场景中实现数据高效的细粒度创造力自动评估,优于现有方法。

详情
Comments
Accepted by KDD 2026
AI中文摘要

人类创造力已成为大语言模型时代的关键能力。在复杂、开放环境中评估创造力是数据挖掘领域的一大挑战,目前受限于对标准化简单任务的依赖以及细粒度专家数据的稀缺。作为生态有效的评估场景,辩论反映了创造力的多个维度,涵盖发散思维和收敛思维。此外,辩论是一个数据丰富的领域,拥有大量公开可获取的材料。当前主流的自动评分方法难以适应辩论等复杂场景,因此仍然依赖昂贵的人工评估。为此,本文提出DEFINED,一种数据高效的计算框架,用于辩论场景中的细粒度创造力评估。DEFINED通过层次化的八维指标体系操作化辩论创造力,采用预训练自回归语言模型,并配备支持细粒度和粗粒度评估的层次化评分头。从真实辩论比赛中获取陈述及其相关专家评分,并采用约束数据增强策略以解决原始数据中的精英偏差。DEFINED采用混合粒度训练策略,能够从训练有素的研究生专家提供的有限细粒度监督中实现鲁棒学习。为严格验证超越合成基准的生态效度,我们纳入了一项针对辩论新手参与者的实证研究,利用这些真实数据作为中低水平人群的定性案例研究。在我们的评估协议中,评分模型实现了准确且稳定的评分,优于基于提示的大语言模型评估器和现有的辩论评分方法。

英文摘要

Human creativity has emerged as a critical competency in the era of large language models. Assessing creativity in complex, open-ended environments is a grand challenge in data mining, currently hindered by a reliance on standardized simple tasks and the scarcity of fine-grained expert data. As an ecologically valid assessment context, debate reflects multiple dimensions of creativity, encompassing both divergent thinking and convergent thinking. Moreover, debate is a data-rich domain, with a large volume of publicly accessible materials. Current mainstream automated scoring methods are poorly suited to complex settings such as debate, and therefore still rely on costly human evaluation. To this end, this paper proposes DEFINED, a data-efficient computational framework for fine-grained creativity assessment in debate scenarios. DEFINED operationalizes debate creativity through a hierarchical eight-dimensional metric system, implemented via a pre-trained autoregressive language model with a hierarchical scoring head that supports both fine-grained and coarse-grained evaluation. Statements and their associated expert scores were obtained from authentic debate competitions, and a constrained data augmentation strategy was employed to address the elite bias inherent in the original data. DEFINED adopts a mixed-granularity training strategy enabling robust learning from limited fine-grained supervision annotated by trained graduate experts. To rigorously validate ecological validity beyond synthetic benchmarks, we incorporate an empirical study with debate-naive participants, utilizing these authentic data to serve as a qualitative case study for mid-to-low proficiency populations. Across our evaluation protocol, our scoring model achieves accurate and stable scoring, outperforming prompt-based large language model evaluators and existing debate scoring methods.

2606.09289 2026-06-11 cs.LG 版本更新

Intention Driven Identification of In-Possession Match Phases in Association Football through Temporal Graph Learning

通过时序图学习识别足球比赛中控球阶段的意图驱动方法

Yuesen Li, Daniel Link

发表机构 * Technical University of Munich(慕尼黑工业大学)

AI总结 提出基于时序图注意力网络(T-GAN)的框架,从时空追踪数据中识别足球比赛控球阶段,实现战术意图(入侵空间、保持控球、得分)和六个子阶段的分类,F1分数达0.87(意图级)和0.79(得分阶段)。

详情
Comments
27 pages, 10 figures
AI中文摘要

理解足球(以下简称足球)的战术组织需要识别不同的比赛阶段。然而,控球阶段很少直接可观察,而是由不断演变的战术意图塑造,而非仅靠空间模式。本研究提出一个数据驱动框架,用于从时空追踪数据中识别控球比赛阶段。分析了七场德国足球甲级联赛比赛,使用TRACAB以25 Hz记录。定义了一个层次化阶段模型,包含三种战术意图(入侵对手空间、保持控球、得分)和六个阶段(构建、推进、反击、维持、持续威胁、完成)。开发了时序图注意力网络(T-GAN),结合帧级球员交互图、上下文特征和基于Transformer的时序建模。使用帧级F1和序列感知的Truth-Dominance交并比(IoT-D)指标评估性能。T-GAN在意图级别达到宏平均帧级F1分数0.87,入侵相关阶段0.76,得分阶段0.79。在序列级别,后处理后意图的平均对角线IoT-D F1从0.68增加到0.79,阶段从0.61增加到0.71,表明时序连贯性改善。模型比较显示,序列建模是分割质量的主要驱动因素,而基于图的关系建模特别有利于反击识别。探索性球员注意力分析进一步表明,边路和中场位置组对阶段区分贡献显著。总体而言,该框架将连续追踪数据转化为战术可解释的控球阶段表示,具有自动比赛标注、战术分析和打法特征分析的潜在应用。

英文摘要

Understanding tactical organisation of association football, hereafter referred to as football, requires identifying distinct match phases. Yet in-possession phases are rarely directly observable and are shaped by evolving tactical intentions, rather than spatial patterns alone. This study proposes a data-driven framework for identifying in-possession match phases from spatiotemporal tracking data. Seven German Bundesliga matches recorded at 25 Hz with TRACAB were analysed. A hierarchical phase model was defined with three tactical intentions (Invade Opponent Space, Keep Possession, Scoring) and six phases (Build Up, Progression, Counter Attack, Maintenance, Sustained Threat, Finishing). A Temporal Graph Attention Network (T-GAN) was developed to combine frame-level player-interaction graphs, contextual features, and Transformer-based temporal modelling. Performance was evaluated using frame-level F1 and a sequence-aware Intersection over Truth-Dominance (IoT-D) metric. T-GAN achieved macro-average frame-level F1 scores of 0.87 at the intention level, 0.76 for invasion-related phases, and 0.79 for scoring phases. At the sequence level, mean diagonal IoT-D F1 increased from 0.68 to 0.79 for intentions and from 0.61 to 0.71 for phases after post-processing, indicating improved temporal coherence. Model comparisons showed that sequence modelling was the main driver of segmentation quality, while graph-based relational modelling was particularly beneficial for Counter Attack recognition. Exploratory player attention analysis further suggested that wide and midfield positional groups contributed strongly to phase discrimination. Overall, the framework translates continuous tracking data into tactically interpretable in-possession phase representations, with potential applications in automated match annotation, tactical analysis, and playing-style profiling.

2606.10725 2026-06-11 cs.LG cs.CL 版本更新

Pre-AF 13: An Interpretable Atrial Fibrillation Risk Score Mined from Discharge Reports

Pre-AF 13:从出院报告中挖掘的可解释房颤风险评分

Olga Shakhmatova, Dmitrii Kriukov, Daniil Larionov, Nikita Khromov, Iaroslav Bespalov, Alexander Zolotarev, Kirill Grishchenkov, Ekaterina Ivanova, Miron Kuznetsov, Ilya Sochenkov, Elizaveta Panchenko, Artem Shelmanov, Dmitry V. Dylov

发表机构 * National Medical Research Center of Cardiology named after Academician E.I. Chazov(国家医学研究中心心脏病学以E.I. Chazov院士命名) Skolkovo Institute of Science and Technology (Skoltech)(斯科尔科沃科学技术研究所) Artificial Intelligence Research Institute (AIRI)(人工智能研究所) University of Mannheim(曼海姆大学) Russian Center for Scientific Information (RCSI)(俄罗斯科学信息中心) Institute of Cyber Intelligence Systems, National Research Nuclear University MEPhI(网络智能系统研究所,国家研究核大学MEPhI) M.V. Lomonosov Moscow State University(莫斯科国立罗蒙诺索夫大学) Institute for Information Transmission Problems of the Russian Academy of Sciences (Kharkevich Institute)(俄罗斯科学院信息传输问题研究所(Kharkevich研究所)) Ivannikov Institute for System Programming of the Russian Academy of Sciences (ISP RAS)(俄罗斯科学院伊万尼科夫系统编程研究所) Federal Research Center "Computer Science and Control" of the Russian Academy of Sciences (FRC CSC RAS)(俄罗斯科学院联邦研究中心“计算机科学与控制”) Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)(穆罕默德·本·扎耶德人工智能大学)

AI总结 利用NLP从出院报告中提取特征,构建可解释ML模型预测心血管病患者房颤风险,Pre-AF 13模型优于现有临床评分。

详情
Comments
O. Shakhmatova and D. Kriukov contributed equally (co-first authors). E. Panchenko, A. Shelmanov, and D. V. Dylov are co-senior authors. Correspondence to: Olga Shakhmatova < this http URL [at] this http URL > and Dmitry V. Dylov < this http URL [at] this http URL >
AI中文摘要

背景:房颤(AF)是最常见的心律失常,也是预后的主要决定因素。现有的AF风险评分依赖于在心血管疾病(CVD)患者中几乎普遍存在的因素(如高龄、高血压),因此在该高风险群体中提供的分层有限。大多数评分针对长期(5-10年)而非中期预测。我们开发了可解释的ML模型,利用常规收集的医院数据预测CVD患者在24个月和整个随访期间内的AF风险。方法:对俄罗斯国家心脏病学研究中心电子健康记录进行单中心回顾性研究,纳入2012年1月至2019年5月期间多次住院、年龄≥18岁、患有CVD但无既往AF的患者。自定义NLP流水线将非结构化出院报告转化为73个结构化特征,结合基于规则的解析器和基于Transformer的命名实体识别。使用LightAutoML构建了完整模型(73个特征)、简单模型(简化子集)以及用于床旁风险评分的线性模型。性能通过ROC AUC评估,并与CHARGE-AF、C2HEST、MHS和HAVOC进行比较,并通过SHAP进行解释。结果:在来自45,000名患者的80,576份记录中,17,562份符合纳入标准;其中1,438名(8.19%)发生AF。完整模型在24个月和整个随访期间的ROC AUC分别为0.735和0.696;简单模型几乎相同(0.725和0.696)。所有非线性模型均优于四个临床风险评分(ROC AUC 0.53-0.64)。简单模型使用13个特征,命名为Pre-AF 13。SHAP识别出年龄和左心房容积为主要预测因子。线性风险评分(Pre-AF 9)将观察到的24个月AF发生率从约7%分层至36%。结论:基于常规收集的EHR数据构建的可解释ML模型能够识别高AF风险的CVD患者,优于现有的临床风险评分。

英文摘要

Background. Atrial fibrillation (AF) is the most prevalent cardiac arrhythmia and a major determinant of prognosis. Established AF risk scores rely on factors (older age, hypertension) nearly ubiquitous among patients with cardiovascular disease (CVD), offering limited stratification in this high-risk group. Most target long-term (5-10 year) rather than medium-term prediction. We developed interpretable ML models predicting AF risk over a 24-month and entire follow-up horizon in CVD patients using routinely collected hospital data. Methods. Single-center retrospective study of electronic health records from the National Research Cardiology Center (Russia) for patients aged >=18 with CVD but without pre-existing AF, hospitalized more than once between January 2012 and May 2019. A custom NLP pipeline transformed unstructured discharge reports into 73 structured features, combining a rule-based parser with transformer-based NER. Using LightAutoML we built a full model (73 features), a simple model (reduced subset), and a linear model for a bedside risk score. Performance was assessed by ROC AUC, compared with CHARGE-AF, C2HEST, MHS, and HAVOC, and interpreted via SHAP. Results. Of 80,576 records from 45,000 patients, 17,562 met inclusion criteria; 1,438 (8.19%) developed AF. The full model reached ROC AUC 0.735 (24-month) and 0.696 (entire follow-up); the simple model was nearly identical (0.725, 0.696). All non-linear models outperformed the four clinical risk scores (ROC AUC 0.53-0.64). The simple model uses 13 features and is named Pre-AF 13. SHAP identified age and left atrial volume as dominant predictors. A linear risk score (Pre-AF 9) stratified observed 24-month AF incidence from ~7% to 36%. Conclusion. Interpretable ML models built from routinely collected EHR data identify high-AF-risk CVD patients, outperforming established clinical risk scores.

2304.13905 2026-06-11 cs.CR cs.AI cs.LG 版本更新

LSTM based IoT Device Identification

基于LSTM的物联网设备识别

Kahraman Kostas

AI总结 提出一种端到端机器学习流程,利用LSTM网络处理原始网络数据包,通过滑动窗口时间序列特征识别27类物联网设备,在最优配置下达到79.85%准确率和75.70%宏平均F1分数。

详情
AI中文摘要

随着物联网的使用越来越普及,大量设备进入市场,许多安全漏洞也随之出现。在此环境下,物联网设备识别方法提供了一种预防性安全措施,作为识别这些设备并检测其漏洞的重要因素。在本研究中,我们提出了一种端到端的机器学习流程,利用长短期记忆(LSTM)网络识别阿尔托大学数据集(物联网设备捕获)中的物联网设备。原始网络数据包捕获(PCAP)被处理成25个工程特征,然后排列为滑动窗口时间序列。我们系统地评估了从2到20的序列长度,报告称性能在长度6之前近似线性提升,之后呈波浪形模式,在长度18时达到峰值。在最优配置的最终保留测试集上,该模型在27个设备类别上达到了79.85%的准确率和75.70%的宏平均F1分数。

英文摘要

While the use of the Internet of Things is becoming more and more popular, many security vulnerabilities are emerging with the large number of devices being introduced to the market. In this environment, IoT device identification methods provide a preventive security measure as an important factor in identifying these devices and detecting the vulnerabilities they suffer from. In this study, we present an end-to-end machine learning pipeline that identifies IoT devices in the Aalto university dataset (IoT devices captures) using Long Short-Term Memory (LSTM) networks. Raw network packet captures (PCAP) are processed into 25 engineered features, which are then arranged as sliding-window time-series sequences. We systematically evaluate sequence lengths from 2 to 20, reporting that performance improves approximately linearly up to length 6 and thereafter in a wave-like pattern, reaching its peak at length 18. On the final held-out test set with the optimal configuration, the model achieves an accuracy of 79.85% and a macro-averaged F1-score of 75.70% across 27 device classes.

2409.12707 2026-06-11 physics.flu-dyn cs.LG 版本更新

Machine-learning-based multipoint optimization of fluidic injection parameters for improving nozzle performance

基于机器学习的流体注入参数多点优化以提升喷管性能

Yunjia Yang, Jiazhe Li, Yufei Zhang, Haixin Chen

AI总结 针对过膨胀单斜面喷管,采用预训练神经网络替代CFD进行多点优化,结合先验预测策略提高精度,利用反向传播快速计算梯度,在七个设计点优化平均推力系数提升1.14%。

详情
AI中文摘要

流体注入为改善车辆加速过程中过膨胀单斜面喷管(SERN)的性能提供了一种有前景的解决方案。然而,确定能在多个喷管工作状态下产生最佳整体性能的注入参数仍然是一个挑战。基于梯度的优化方法需要在每个设计点计算注入参数的梯度,当使用计算流体动力学(CFD)模拟时,这可能导致高昂的计算成本。本文使用预训练神经网络在优化过程中替代CFD,从而能够快速计算多个设计点的喷管流场。考虑到喷管流场的物理特性,采用基于先验的预测策略来提高模型的准确性。此外,神经网络的反向传播算法只需运行一次计算即可快速计算梯度,从而与有限差分法相比大大减少了梯度计算时间。作为测试案例,对SERN在七个设计点的平均喷管推力系数进行了优化,结果提高了1.14%。即使包括建立训练数据库所需的时间,与传统优化方法相比,时间成本也大大降低。

英文摘要

Fluidic injection offers a promising solution to improve the performance of the overexpanded single expansion ramp nozzles (SERNs) during vehicle acceleration. However, determining the injection parameters that yield the best overall performance across multiple nozzle operating conditions remains a challenge. The gradient-based optimization method requires gradients of injection parameters at each design point, which can lead to high computational costs when using computational fluid dynamics (CFD) simulations. This paper uses a pretrained neural network to replace CFD during optimization, enabling quick calculation of the nozzle flow field at multiple design points. Considering the physical characteristics of the nozzle flow field, a prior-based prediction strategy is adopted to enhance the model's accuracy. In addition, the neural network's back-propagation algorithm computes gradients quickly by running the computation only once, thereby greatly reducing gradient computation time compared to the finite difference method. As a test case, the average nozzle thrust coefficient of an SERN at seven design points is optimized, resulting in a 1.14\% improvement. The time cost is greatly reduced compared with traditional optimization methods, even when the time required to establish the training database is included.

2411.10959 2026-06-11 econ.EM cs.LG math.ST stat.AP stat.ME stat.ML 版本更新

Program Evaluation with Remotely Sensed Outcomes

利用遥感结果的程序评估

Ashesh Rambachan, Rahul Singh, Davide Viviano

AI总结 本文研究了在实验和准实验中,由于遥感变量不完全测量经济结果而引起的因果推断问题,提出了一种非参数识别因果参数的方法,结合实验和观测数据进行n^{-1/2}推断。

详情
AI中文摘要

我们研究了在实验和准实验中,经济结果由遥感变量不完全测量的因果推断问题。遥感变量是低成本、可扩展且在观测数据中预测经济结果的变量,例如卫星图像和移动电话活动。我们将遥感变量视为后结果:经济结果的变化导致遥感变量的变化。例如,环境质量的变化导致卫星图像的变化,而不是相反。在这一假设下,我们提出了一种结合实验和观测数据的公式,以非参数方式识别因果参数。我们开发了一种n^{-1/2}推断方法,该方法对规格不正确具有鲁棒性,并且不限制用于处理遥感变量的算法。

英文摘要

We study causal inference in experiments and quasi-experiments, where the economic outcome is imperfectly measured by a remotely sensed variable. The remotely sensed variable is low-cost, scalable, and predictive of the economic outcome in observational data; examples include satellite imagery and mobile phone activity. We model the remotely sensed variable as post-outcome: variation in the economic outcome causes variation in the remotely sensed variable. For example, changes in environmental quality cause changes in satellite imagery, not vice versa. Under this assumption, we propose a formula to nonparametrically identify the causal parameter by combining experimental and observational data. We develop a method for n^{-1/2} inference that is robust to misspecification and that does not restrict the algorithms used to process remotely sensed variables.

2411.12193 2026-06-11 stat.AP cs.LG stat.ML 版本更新

Hierarchical Probabilistic Conformal Prediction for Distributed Energy Resources Adoption

分布式能源采纳的分层概率保形预测

Wenbin Zhou, Shixiang Zhu

AI总结 针对分布式能源采纳预测中的不确定性和分层电网结构,提出基于多元霍克斯过程与分裂保形预测的量化框架,确保聚合后统计有效性,在印第安纳波利斯数据上优于基线。

详情
AI中文摘要

分布式能源(DERs)的快速增长为电网管理带来了机遇和运营挑战。准确预测DER采纳对于主动基础设施规划至关重要,但DER增长的固有不确定性和空间差异使传统预测方法复杂化。此外,配电网的分层结构要求预测在电路和变电站层面均满足统计保证,这是可靠决策的非平凡要求。本文提出了一种新的DER采纳预测不确定性量化框架,确保在分层电网结构中的有效性。利用多元霍克斯过程建模DER采纳动态,并采用定制的分裂保形预测算法,我们引入了一种新的非一致性分数,在保持预测效率的同时,在聚合下保留统计保证。我们在温和条件下建立了理论有效性,并通过印第安纳州印第安纳波利斯的客户级太阳能电池板安装数据实证评估,表明我们的方法在预测准确性和不确定性校准方面始终优于现有基线。

英文摘要

The rapid growth of distributed energy resources (DERs) presents both opportunities and operational challenges for electric grid management. Accurately predicting DER adoption is critical for proactive infrastructure planning, but the inherent uncertainty and spatial disparity of DER growth complicate traditional forecasting approaches. Moreover, the hierarchical structure of distribution grids demands that predictions satisfy statistical guarantees at both the circuit and substation levels, a non-trivial requirement for reliable decision-making. In this paper, we propose a novel uncertainty quantification framework for DER adoption predictions that ensures validity across hierarchical grid structures. Leveraging a multivariate Hawkes process to model DER adoption dynamics and a tailored split conformal prediction algorithm, we introduce a new nonconformity score that preserves statistical guarantees under aggregation while maintaining prediction efficiency. We establish theoretical validity under mild conditions and demonstrate through empirical evaluation on customer-level solar panel installation data from Indianapolis, Indiana that our method consistently outperforms existing baselines in both predictive accuracy and uncertainty calibration.

2502.14894 2026-06-11 cs.CV cs.AI cs.CY cs.LG 版本更新

FOCUS on Contamination: Hydrology-Informed Noise-Aware Learning for Geospatial PFAS Mapping

聚焦污染:基于水文信息与噪声感知的地理空间PFAS测绘学习

Jowaria Khan, Alexa Friedman, Sydney Evans, Rachel Klein, Runzi Wang, Katherine E. Manz, Kaley Beins, David Q. Andrews, Elizabeth Bondi-Kelly

AI总结 提出FOCUS框架,结合稀疏PFAS观测与水文连通性等环境先验,通过噪声感知损失实现鲁棒训练,在PFAS污染测绘中优于传统方法。

详情
Comments
Best Paper Award at ICLR 2026 Machine Learning for Remote Sensing Workshop
AI中文摘要

全氟和多氟烷基物质(PFAS)是持久性环境污染物,对公共健康有显著影响,但由于现场采样的高成本和后勤挑战,大规模监测仍然严重受限。样本的缺乏导致难以用物理模型模拟其扩散,并且对PFAS在地表水中传输的科学理解有限。然而,描述土地覆盖、水文和工业活动的丰富地理空间和卫星衍生数据广泛可用。我们提出了FOCUS,一个用于PFAS污染测绘的地理空间深度学习框架,该框架将稀疏的PFAS观测与大规模环境背景(包括来自水文连通性、土地覆盖、污染源邻近性和采样距离的先验)相结合。这些先验被整合到一个原则性的、噪声感知的损失函数中,从而在稀疏标签下产生稳健的训练目标。通过广泛的消融实验、鲁棒性分析和实际验证,FOCUS始终优于包括稀疏分割、克里金法和污染物传输模拟在内的基线方法,同时在大区域上保持了空间一致性和可扩展性。我们的结果展示了AI如何通过提供筛查级风险图来支持环境科学,这些风险图可优先安排后续采样,并在缺乏完整物理模型的情况下帮助将潜在污染源与地表水污染模式联系起来。

英文摘要

Per- and polyfluoroalkyl substances (PFAS) are persistent environmental contaminants with significant public health impacts, yet large-scale monitoring remains severely limited due to the high cost and logistical challenges of field sampling. The lack of samples leads to difficulty simulating their spread with physical models and limited scientific understanding of PFAS transport in surface waters. Yet, rich geospatial and satellite-derived data describing land cover, hydrology, and industrial activity are widely available. We introduce FOCUS, a geospatial deep learning framework for PFAS contamination mapping that integrates sparse PFAS observations with large-scale environmental context, including priors derived from hydrological connectivity, land cover, source proximity, and sampling distance. These priors are integrated into a principled, noise-aware loss, yielding a robust training objective under sparse labels. Across extensive ablations, robustness analyses, and real-world validation, FOCUS consistently outperforms baselines including sparse segmentation, Kriging, and pollutant transport simulations, while preserving spatial coherence and scalability over large regions. Our results demonstrate how AI can support environmental science by providing screening-level risk maps that prioritize follow-up sampling and help connect potential sources to surface-water contamination patterns in the absence of complete physical models.

2505.00571 2026-06-11 stat.ML cs.LG 版本更新

Discovery and inference beyond linearity for epidemiological data by integrating Bayesian regression, tree ensembles and Shapley values

通过整合贝叶斯回归、树集成和Shapley值对流行病学数据进行线性之外的发现与推断

Giorgio Spadaccini, Marjolein Fokkema, Mark A. van de Wiel

AI总结 提出RuleSHAP框架,结合贝叶斯稀疏回归、改进的树规则生成器和Shapley值,实现非线性与交互效应的检测及个体水平的不确定性量化,应用于流行病学数据发现高胆固醇和血压的影响因素。

详情
AI中文摘要

机器学习在流行病学和医疗健康研究中越来越受欢迎,用于无假设地发现风险和保护因素。机器学习在发现非线性和交互作用方面很强,但这种能力因缺乏可靠的推断而受损。尽管Shapley值提供了特征效应的局部度量,但这些效应通常缺乏有效的不确定性量化,从而排除了统计推断。我们提出RuleSHAP,一个通过结合专用贝叶斯稀疏回归模型、改进的基于树的规则生成器和Shapley值归因来解决这一局限性的框架。RuleSHAP能够检测非线性和交互效应,其关键贡献在于个体水平的不确定性量化。我们推导了一个在该框架内计算边际Shapley值的有效公式。我们将RuleSHAP应用于一个流行病学队列的数据,以检测和推断高胆固醇和血压的几种效应,例如年龄、性别、种族、BMI和血糖水平等特征之间的非线性交互效应。最后,我们在模拟数据上证明了我们框架的有效性。

英文摘要

Machine Learning (ML) is gaining popularity in epidemiology and healthcare studies for hypothesis-free discovery of risk and protective factors. ML is strong at discovering nonlinearities and interactions, but this power is compromised by a lack of reliable inference. Although Shapley values provide local measures of features' effects, valid uncertainty quantification for these effects is typically lacking, thus precluding statistical inference. We propose RuleSHAP, a framework that addresses this limitation by combining a dedicated Bayesian sparse regression model with an improved tree-based rule generator and Shapley value attribution. RuleSHAP provides detection of nonlinear and interaction effects, with uncertainty quantification at the individual level as a key contribution. We derive an efficient formula for computing marginal Shapley values within this framework. We apply RuleSHAP to data from an epidemiological cohort to detect and infer several effects for high cholesterol and blood pressure, such as nonlinear interaction effects between features like age, sex, ethnicity, BMI and glucose level. To conclude, we demonstrate the validity of our framework on simulated data.

2510.08073 2026-06-11 cs.CV cs.LG 版本更新

Physics-Driven Spatiotemporal Modeling for AI-Generated Video Detection

物理驱动的时空建模用于AI生成视频检测

Shuhai Zhang, ZiHao Lian, Jiahao Yang, Daiyuan Li, Guoxuan Pang, Feng Liu, Bo Han, Shutao Li, Mingkui Tan

AI总结 提出基于概率流守恒的物理驱动AI生成视频检测范式,通过归一化时空梯度(NSG)统计量捕捉物理异常,结合预训练扩散模型估计NSG,并利用最大均值差异(MMD)进行检测,在Recall和F1-Score上分别提升16.00%和10.75%。

详情
Comments
Accepted at NeurIPS 2025 spotlight
AI中文摘要

AI生成的视频已实现近乎完美的视觉真实感(如Sora),迫切需要可靠的检测机制。然而,检测此类视频在建模高维时空动态和识别违反物理规律的细微异常方面面临重大挑战。本文提出首个基于概率流守恒原理的物理驱动AI生成视频检测范式。具体而言,我们提出一种称为归一化时空梯度(NSG)的统计量,该统计量量化空间概率梯度与时间密度变化之比,明确捕捉与自然视频动态的偏差。利用预训练的扩散模型,我们通过空间梯度近似和运动感知时间建模开发了NSG估计器,无需复杂的运动分解,同时保持物理约束。在此基础上,我们提出基于NSG的视频检测方法(NSG-VD),该方法计算测试视频与真实视频NSG特征之间的最大均值差异(MMD)作为检测指标。最后,我们推导了真实视频与生成视频之间NSG特征距离的上界,证明由于分布偏移,生成视频表现出放大的差异。大量实验证实,NSG-VD在Recall和F1-Score上分别比最先进的基线方法高出16.00%和10.75%,验证了NSG-VD的优越性能。源代码可在该 https URL 获取。

英文摘要

AI-generated videos have achieved near-perfect visual realism (e.g., Sora), urgently necessitating reliable detection mechanisms. However, detecting such videos faces significant challenges in modeling high-dimensional spatiotemporal dynamics and identifying subtle anomalies that violate physical laws. In this paper, we propose the first physics-driven AI-generated video detection paradigm based on probability flow conservation principles. Specifically, we propose a statistic called Normalized Spatiotemporal Gradient (NSG), which quantifies the ratio of spatial probability gradients to temporal density changes, explicitly capturing deviations from natural video dynamics. Leveraging pre-trained diffusion models, we develop an NSG estimator through spatial gradients approximation and motion-aware temporal modeling without complex motion decomposition while preserving physical constraints. Building on this, we propose an NSG-based video detection method (NSG-VD) that computes the Maximum Mean Discrepancy (MMD) between NSG features of the test and real videos as a detection metric. Last, we derive an upper bound of NSG feature distances between real and generated videos, proving that generated videos exhibit amplified discrepancies due to distributional shifts. Extensive experiments confirm that NSG-VD outperforms state-of-the-art baselines by 16.00% in Recall and 10.75% in F1-Score, validating the superior performance of NSG-VD. The source code is available at this https URL.

2510.22397 2026-06-11 cs.NI cs.LG 版本更新

NetBurst: Event-Centric Forecasting of Bursty, Intermittent Time Series

NetBurst: 以事件为中心的突发间歇性时间序列预测

Satyandra Guthula, Jaber Daneshamooz, Charles Fleming, Kesheng Wu, Walter Willinger, Arpit Gupta

AI总结 针对网络遥测数据中罕见突发和长间隔低活动的“野性”统计特性,提出NetBurst事件中心管道,通过压缩低活动期、分离突发时序和幅度流学习统一表示,在预测误差、突发分布匹配和异常描述性上显著优于Chronos-2等基线。

详情
AI中文摘要

网络运营商通过收集遥测数据(如数据包计数、字节速率或流体积)来监控其基础设施,但有效运营所需的问题——预测未来负载、诊断和表征异常、搜索和检索历史先例——需要超越原始测量。弥合这一差距需要学习表示:紧凑的每实体摘要,从每个实体的单变量时间序列中捕获时间动态。时间序列基础模型是自然的起点,但它们是为密集、周期性的基准数据集(“温和”统计体制)设计的。然而,网络遥测数据处于“野性”体制:操作相关事件罕见,被可变长度的低活动或无活动(“低潮”)间隔分隔,并伴有间歇性的重尾极端值突发(“潮汐”)。我们提出NetBurst,一个以事件为中心的管道,它压缩低潮,将每个时间序列分离为突发时序流和突发幅度流,并学习一个服务于所有三个操作任务的单一表示。与八个基线中最强的竞争者(包括Amazon的Chronos-2和Datadog的Toto)相比,在九个生产遥测配置上,NetBurst在野性体制数据上将中位预测误差降低了1.3–116倍,对真实突发分布的匹配度提高了1.0–7.5倍,并在温和体制基准上与基线相当。对于异常表征,NetBurst产生平衡、分布良好的聚类,在一种新的可解释性评分下,这些聚类在操作员熟悉的术语中可描述性提高了16倍,而聚类过滤搜索实现了7.5倍的端到端检索加速。

英文摘要

Network operators monitor their infrastructure by collecting telemetry data such as packet counts, byte rates, or flow volumes, yet answering the questions that effective operations demand -- forecasting future load, diagnosing and characterizing anomalies, and searching for and retrieving historical precedents -- requires more than raw measurements. Bridging this gap calls for learned representations: compact per-entity summaries that capture temporal dynamics from each entity's univariate time series. Time-series foundation models are the natural starting point, but they are designed for dense, periodic benchmark datasets -- the \emph{mild} statistical regime. However, network telemetry data inhabits the \emph{wild} regime: operationally relevant events are rare, separated by variable-length stretches of low or no activity (``ebbs''), with intermittent bursts of heavy-tailed extremes (``tides''). We present NetBurst, an event-centric pipeline that collapses ebbs, separates each time series into a stream of burst timings and a stream of burst magnitudes, and learns a single representation serving all three operational tasks. Compared to the strongest competitors among eight baselines -- including Amazon's Chronos-2 and Datadog's Toto -- and across nine production telemetry configurations, NetBurst reduces median forecasting error by $1.3$--$116\times$ on wild-regime data with a $1.0$--$7.5\times$ better match to the true burst distribution, and matches baselines on mild-regime benchmarks. For characterizing anomalies, NetBurst produces balanced, well-spread clusters that are $16\times$ more describable in operator-familiar terms under a novel interpretability score, and cluster-filtered search delivers $7.5\times$ faster end-to-end retrieval.

2512.11982 2026-06-11 astro-ph.IM cs.AI cs.CV cs.LG 版本更新

Semantic search for 100M+ galaxy images using AI-generated captions

基于AI生成描述的1亿+星系图像语义搜索

Nolan Koblischke, Liam Parker, Francois Lanusse, Jo Bovy, Irina Espejo, Shirley Ho

AI总结 提出利用视觉语言模型生成星系图像描述,并对比对齐预训练天文学基础模型,构建可搜索嵌入,实现大规模星系图像的语义搜索,在稀有现象发现上取得最先进性能。

详情
Comments
ApJ, in press
AI中文摘要

通过缓慢的手动标注活动寻找科学上有趣的现象严重限制了我们对望远镜产生的数十亿星系图像的探索能力。在这项工作中,我们开发了一个流水线,从完全未标记的图像数据创建语义搜索引擎。我们的方法利用视觉语言模型(VLM)为星系图像生成描述,然后将预训练的天文学基础模型与这些嵌入的描述进行对比对齐,以产生大规模可搜索的嵌入。我们发现当前的VLM提供的描述信息足够丰富,可以训练一个语义搜索模型,该模型优于直接图像相似性搜索。我们的模型AION-Search在寻找稀有现象方面实现了最先进的零样本性能,尽管训练是在随机选择的图像上进行的,没有针对稀有情况进行刻意策划。此外,我们引入了一种基于VLM的重排序方法,该方法在top-100结果中对我们最具挑战性的目标的召回率几乎翻倍。首次,AION-Search实现了对超过1亿张星系图像的灵活语义搜索,使得从以前不可行的搜索中能够发现新现象,包括识别出36个新的河外恒星流候选体。更广泛地说,我们的工作提供了一种方法,使大型、未标记的科学图像档案变得可语义搜索,扩展了从地球观测到显微镜等领域的数据探索能力。代码、数据和应用程序可在以下网址公开获取:https://this https URL

英文摘要

Finding scientifically interesting phenomena through slow manual labeling campaigns severely limits our ability to explore the billions of galaxy images produced by telescopes. In this work, we develop a pipeline to create a semantic search engine from completely unlabeled image data. Our method leverages Vision-Language Models (VLMs) to generate descriptions for galaxy images, then contrastively aligns a pre-trained astronomy foundation model with these embedded descriptions to produce searchable embeddings at scale. We find that current VLMs provide descriptions that are sufficiently informative to train a semantic search model that outperforms direct image similarity search. Our model, AION-Search, achieves state-of-the-art zero-shot performance on finding rare phenomena despite training on randomly selected images with no deliberate curation for rare cases. Furthermore, we introduce a VLM-based re-ranking method that nearly doubles the recall for our most challenging targets in the top-100 results. For the first time, AION-Search enables flexible semantic search for over 100 million galaxy images, enabling discovery from previously infeasible searches, including the identification of 36 new extragalactic stellar stream candidates. More broadly, our work provides an approach for making large, unlabeled scientific image archives semantically searchable, expanding data exploration capabilities in fields from Earth observation to microscopy. The code, data, and app are publicly available at this https URL

2512.13765 2026-06-11 eess.IV cs.AI cs.LG 版本更新

Towards Deep Learning Surrogate for the Forward Problem in Electrocardiology: A Scalable Alternative to Physics-Based Models

面向心电学正问题的深度学习代理模型:一种可扩展的物理模型替代方案

Shaheim Ogbomo-Harmitt, Cesare Magnetti, Chiara Spota, Jakub Grzelak, Oleg Aslanidi

AI总结 提出基于注意力机制的序列到序列深度学习框架,作为心电学正问题的代理模型,从心脏电压传播图预测心电图信号,在2D组织模拟中达到高精度(平均R²=0.99±0.01),为物理模型提供可扩展、低成本的替代方案。

详情
Comments
Accepted to CinC conference 2025
AI中文摘要

心电学中的正问题,即从心脏电活动计算体表电位,传统上使用基于物理的模型(如双域或单域方程)求解。虽然准确,但这些方法计算成本高,限制了其在实时和大规模临床中的应用。我们提出一个概念验证的深度学习(DL)框架,作为正问题求解器的高效代理。该模型采用基于时间依赖注意力机制的序列到序列架构,从心脏电压传播图预测心电图(ECG)信号。引入了一种混合损失函数,结合Huber损失和谱熵项,以保持时域和频域的保真度。使用包含健康、纤维化和缝隙连接重塑条件的2D组织模拟,模型实现了高精度(平均$R^2 = 0.99 \pm 0.01$)。消融研究证实了卷积编码器、时间感知注意力和谱熵损失的贡献。这些发现突显了DL作为物理求解器的可扩展、低成本替代方案的潜力,适用于临床和数字孪生应用。

英文摘要

The forward problem in electrocardiology, computing body surface potentials from cardiac electrical activity, is traditionally solved using physics-based models such as the bidomain or monodomain equations. While accurate, these approaches are computationally expensive, limiting their use in real-time and large-scale clinical applications. We propose a proof-of-concept deep learning (DL) framework as an efficient surrogate for forward solvers. The model adopts a time-dependent, attention-based sequence-to-sequence architecture to predict electrocardiogram (ECG) signals from cardiac voltage propagation maps. A hybrid loss combining Huber loss with a spectral entropy term was introduced to preserve both temporal and frequency-domain fidelity. Using 2D tissue simulations incorporating healthy, fibrotic, and gap junction-remodelled conditions, the model achieved high accuracy (mean $R^2 = 0.99 \pm 0.01$). Ablation studies confirmed the contributions of convolutional encoders, time-aware attention, and spectral entropy loss. These findings highlight DL as a scalable, cost-effective alternative to physics-based solvers, with potential for clinical and digital twin applications.

2601.14031 2026-06-11 stat.ML cs.LG 版本更新

Intermittent time series forecasting: local vs global models

间歇性时间序列预测:局部模型与全局模型

Stefano Damato, Nicolò Rubattu, Dario Azzimonti, Giorgio Corani

AI总结 针对间歇性时间序列预测问题,首次系统比较了概率性局部模型与全局模型(如TiDE),发现简单神经网络架构TiDE在精度和计算效率上均优于局部模型,且Tweedie分布头对高分位数估计最佳。

详情
Comments
Submitted to the Journal of the Operational Research Society
AI中文摘要

预测包含零值的间歇性时间序列是供应链中的一个关键挑战,因为库存策略需要概率预测来建立安全水平。间歇性时间序列通常使用局部模型进行预测,即对每个时间序列单独训练。近年来,基于大量时间序列训练的全局模型在时间序列预测中变得流行。全局模型通常基于神经网络或梯度提升树。我们进行了首次研究,比较了最先进的概率性局部模型和全局模型在间歇性时间序列上的表现。对于全局模型,我们考虑了三种适用于间歇性时间序列的不同分布头:负二项、障碍移位负二项和Tweedie。据我们所知,这是后两者首次与神经网络结合使用。我们在五个数据集上进行了实验,这些数据集总共包含超过40,000个真实世界的时间序列。在全局模型中,TiDE(一种简单的神经网络架构)取得了最佳精度;它还持续优于局部模型,并且计算需求更低。大型全局模型反而计算需求更高且精度更低。在分布头中,Tweedie提供了最高分位数的最佳估计。

英文摘要

Forecasting intermittent time series, which contain zeros, is a crucial challenge in supply chains as inventory policies require probabilistic forecasts to establish safety levels. Intermittent time series are commonly forecast using local models, trained individually on each time series. In the last years global models, trained on a large collection of time series, have become popular for time series forecasting. Global models are often based on neural networks or gradient boosted trees. We carry out the first study comparing state-of-the-art probabilistic local and global models on intermittent time series. For global models we consider three different distribution heads suitable for intermittent time series: negative binomial, hurdle-shifted negative binomial and Tweedie. To the best of our knowledge, this is the first use of the latter two with neural networks. We perform experiments on five datasets comprising overall more than 40'000 real-world time series. Among global models, TiDE, a simple neural network architecture, achieves the best accuracy; it also consistently outperforms local models and has lower computational requirements. Large global models are instead much more computationally demanding and less accurate. Among the distribution heads, the Tweedie provides the best estimates of the highest quantiles.

2602.19502 2026-06-11 cs.AI cs.LG 版本更新

Human-Guided Agentic AI for Multimodal Clinical Prediction: Lessons from the AgentDS Healthcare Benchmark

人类引导的智能体AI用于多模态临床预测:来自AgentDS医疗基准的教训

Lalitha Pranathi Pulavarthy, Raajitha Muthyala, Aravind V Kuruvikkattil, Zhenan Yin, Rashmita Kudamala, Saptarshi Purkayastha

AI总结 通过人类引导智能体AI在多模态临床预测任务中取得领先性能,提炼出领域知识引导特征工程、任务特定多模态融合和临床动机模型集成三大通用经验。

详情
Comments
Presented at the Data Challenge track at the 14th IEEE International Conference on Healthcare Informatics (ICHI) 2026 on June 3, 2026
AI中文摘要

智能体AI系统越来越能够自主执行数据科学工作流程,但临床预测任务需要纯自动化方法难以提供的领域专业知识。我们研究了人类引导智能体AI如何改进多模态临床预测,展示了我们在所有三个AgentDS医疗基准挑战中的方法:30天再入院预测(Macro-F1 = 0.8986)、急诊科费用预测(MAE = $465.13)和出院准备评估(Macro-F1 = 0.7939)。在这些任务中,人类分析师在关键决策点指导智能体工作流程:来自临床笔记、扫描PDF账单收据和时间序列生命体征的多模态特征工程;任务适当的模型选择;以及临床信息验证策略。我们的方法在医疗领域总体排名第5,在出院准备任务中获得第3名。消融研究表明,人类引导决策在自动化基线之上累积增益达到+0.065 F1,其中多模态特征提取贡献了最大的单一改进(+0.041 F1)。我们提炼出三个可推广的经验:(1)每个流水线阶段的领域信息特征工程产生累积增益,优于广泛的自动搜索;(2)多模态数据集成需要任务特定的人类判断,没有单一提取策略能泛化到临床文本、PDF和时间序列;(3)具有临床动机模型配置的刻意集成多样性优于随机超参数搜索。这些发现为在需要可解释性、可重复性和临床有效性的医疗环境中部署智能体AI的团队提供了实用指导。

英文摘要

Agentic AI systems are increasingly capable of autonomous data science workflows, yet clinical prediction tasks demand domain expertise that purely automated approaches struggle to provide. We investigate how human guidance of agentic AI can improve multimodal clinical prediction, presenting our approach to all three AgentDS Healthcare benchmark challenges: 30-day hospital readmission prediction (Macro-F1 = 0.8986), emergency department cost forecasting (MAE = $465.13), and discharge readiness assessment (Macro-F1 = 0.7939). Across these tasks, human analysts directed the agentic workflow at key decision points, multimodal feature engineering from clinical notes, scanned PDF billing receipts, and time-series vital signs; task-appropriate model selection; and clinically informed validation strategies. Our approach ranked 5th overall in the healthcare domain, with a 3rd-place finish on the discharge readiness task. Ablation studies reveal that human-guided decisions compounded to a cumulative gain of +0.065 F1 over automated baselines, with multimodal feature extraction contributing the largest single improvement (+0.041 F1). We distill three generalizable lessons: (1) domain-informed feature engineering at each pipeline stage yields compounding gains that outperform extensive automated search; (2) multimodal data integration requires task-specific human judgment that no single extraction strategy generalizes across clinical text, PDFs, and time-series; and (3) deliberate ensemble diversity with clinically motivated model configurations outperforms random hyperparameter search. These findings offer practical guidance for teams deploying agentic AI in healthcare settings where interpretability, reproducibility, and clinical validity are essential.

2602.23461 2026-06-11 physics.flu-dyn cs.LG 版本更新

Neural ensemble Kalman filter: Data assimilation for compressible flows with shocks

神经集成卡尔曼滤波器:含激波可压缩流的数据同化

Xu-Hui Zhou, Lorenzo Beronilla, Michael K. Sleeman, Hangchuan Hu, Matthias Morzfeld, Andrew M. Stuart, Tamer A. Zaki

AI总结 针对含激波可压缩流中集成卡尔曼滤波器(EnKF)因双峰预报分布失效的问题,提出神经EnKF,通过将预报集合映射到神经网络参数空间并在此空间进行同化,结合物理信息迁移学习避免伪振荡和非物理特征。

详情
AI中文摘要

含激波可压缩流的数据同化(DA)具有挑战性,因为许多经典DA方法在不确定激波附近会产生伪振荡和非物理特征。我们在此关注集成卡尔曼滤波器(EnKF)。我们表明,EnKF性能不佳可归因于在不确定激波位置附近可能出现双峰预报分布;这违反了EnKF的假设,即预报接近高斯分布。为解决此问题,我们引入了新的神经EnKF。基本思想是通过将激波流的预报集合映射到深度神经网络(NN)的参数空间(权重和偏置),并随后在该空间中进行DA,从而系统地将神经函数逼近嵌入到集成DA中。非线性映射将尖锐和光滑的流动特征编码在NN参数的集合中。因此,只有当NN参数在预报集合的神经表示中平滑变化时,神经EnKF更新才是良好的。我们表明,可以通过物理信息迁移学习强制网络参数的这种平滑变化,并证明这样做神经EnKF避免了困扰EnKF的伪振荡和非物理特征。通过无粘Burgers方程、Sod激波管和二维爆炸波的一系列系统数值实验,证明了神经EnKF的适用性。

英文摘要

Data assimilation (DA) for compressible flows with shocks is challenging because many classical DA methods generate spurious oscillations and nonphysical features near uncertain shocks. We focus here on the ensemble Kalman filter (EnKF). We show that the poor performance of the EnKF may be attributed to the bimodal forecast distribution that can arise in the vicinity of an uncertain shock location; this violates the assumptions underpinning the EnKF, which assume a forecast which is close to Gaussian. To address this issue we introduce the new neural EnKF. The basic idea is to systematically embed neural function approximations within ensemble DA by mapping the forecast ensemble of shocked flows to the parameter space (weights and biases) of a deep neural network (NN) and to subsequently perform DA in that space. The nonlinear mapping encodes sharp and smooth flow features in an ensemble of NN parameters. Neural EnKF updates are therefore well-behaved only if the NN parameters vary smoothly within the neural representation of the forecast ensemble. We show that such a smooth variation of network parameters can be enforced via physics-informed transfer learning, and demonstrate that in so-doing the neural EnKF avoids the spurious oscillations and nonphysical features that plague the EnKF. The applicability of the neural EnKF is demonstrated through a series of systematic numerical experiments with the inviscid Burgers' equation, the Sod shock tube, and a two-dimensional blast wave.

2603.21639 2026-06-11 cs.CY cs.LG 版本更新

A Multi-Modal Sensor Fusion Instrument for Measuring Regional Human Mobility: The Distributed Human Data Engine (DHDE)

多模态传感器融合仪器用于测量区域人类流动性:分布式人类数据引擎(DHDE)

Amil Khanzada, Takuji Takemoto

AI总结 提出分布式人类数据引擎(DHDE),通过融合边缘AI相机、数字意图信号、行为记录和气象数据,解决外围区域人类流动性测量中传感器稀疏和行为异质性问题,验证了稀疏传感器补偿方法,并发现“低活力悖论”。

详情
Comments
32 pages, 4 figures, 3 tables. Pre-print of a manuscript submitted for peer review (v2)
AI中文摘要

准确估计外围区域经济中的人类流动性面临一个基本的测量挑战:物理地面实况传感器稀疏,行为意图信号异质,环境摩擦给需求推断引入系统性偏差。我们提出分布式人类数据引擎(DHDE),一种多模态传感器融合架构,通过整合物理仪器(边缘AI相机)、数字意图信号(路线搜索印象指标)、行为记录(90,350条消费记录,97,719份标准化调查回复)以及日本福井四个地理分布节点的气象数据来解决这一挑战。主要的测量科学贡献在于设计、部署和跨节点验证DHDE作为稀疏传感器补偿仪器:一种异质传感器融合架构,将非平稳数字意图信号锚定到同时的物理地面实况计数,纠正由气象规划摩擦引入的系统性偏差。该仪器实现为集成推理管道(随机森林和带有Newey-West稳健推断的普通最小二乘法),在397个日观测数据上校准,并通过四个地理上不同的节点类型的时间顺序保留复制进行验证。主要OLS规范实现了样本内解释力R²=0.810和时间顺序样本外预测性能R²=0.683。结果识别出一个“低活力悖论”,其中宏观区域访客满意度与人群密度正相关(Spearman秩相关系数rs=+0.150,p=0.002)。我们估计年度代理缺口为865,917次意图隐含访问,对应119.6亿日元(7260万美元)的损失收入。

英文摘要

Accurately estimating human mobility in peripheral regional economies presents a fundamental measurement challenge: physical ground-truth sensors are sparse, behavioral intent signals are heterogeneous, and environmental friction introduces systematic bias into demand inference. We introduce the Distributed Human Data Engine (DHDE), a multi-modal sensor fusion architecture that addresses this challenge by integrating physical instrumentation (Edge-AI cameras), digital intent signals (route search impression metrics), behavioral records (90,350 spending records, 97,719 standardized survey responses), and meteorological data across four geographically distributed nodes in Fukui, Japan. The primary measurement-science contribution is the design, deployment, and cross-node validation of the DHDE as a sparse-sensor compensation instrument: a heterogeneous sensor fusion architecture that anchors non-stationary digital intent signals to concurrent physical ground-truth counts, correcting for systematic bias introduced by meteorological planning friction. The instrument is implemented as an ensemble inference pipeline (Random Forest and Ordinary Least Squares with Newey-West robust inference), calibrated across 397 daily observations and validated by chronological holdout replication across four geographically distinct node types. The primary OLS specification achieved an in-sample explanatory power of R2 = 0.810 and a chronological out-of-sample predictive performance of R2 = 0.683. Results identify an Under-Vibrancy Paradox where macro-regional visitor satisfaction correlates positively with crowd density (Spearman rank correlation rs = +0.150, p = 0.002). We estimate an annual proxy gap of 865,917 intent-implied visits, corresponding to JPY 11.96 billion (USD 72.6 million) in foregone revenue.

2604.23874 2026-06-11 physics.flu-dyn cs.LG math.DS physics.comp-ph physics.geo-ph 版本更新

Deep Learning of Solver-Aware Turbulence Closures from Nudged LES Dynamics

从Nudged LES动力学中深度学习求解器感知的湍流闭合模型

Ashwin Suriyanarayanan, Dibyajyoti Chakraborty, Romit Maulik

AI总结 提出基于连续数据同化框架的深度学习方法,利用稀疏观测的DNS数据先验训练湍流闭合模型,无需修改或微分LES求解器,同时保持部署稳定性,并显式条件化数值格式以适配不同离散化。

详情
AI中文摘要

可微物理范式可以通过将神经网络参数化直接嵌入求解器,并根据潜在稀疏的目标数据进行优化,作为一种后验方法来发现湍流闭合模型。这解决了先验学习的关键局限性,即使用直接数值模拟(DNS)数据来近似亚网格应力,并假设存在低通滤波器。以这种先验方式训练的闭合模型常常由于假设的滤波器与数值离散化和粗粒化效应之间的不匹配而导致部署不稳定。相比之下,后验学习虽然在部署期间通常稳定,但由于需要通过大涡模拟(LES)求解器进行反向传播,因此计算成本高昂。此外,后验方法难以广泛应用,因为它们需要对现有求解器进行重大修改。最后,当需要在具有隐式滤波特性的不同数值格式之间进行泛化时,这两种方法都受到限制。在这项工作中,我们提出了一种基于连续数据同化框架的深度学习湍流闭合建模方法。我们的方法允许使用稀疏观测的DNS数据先验训练闭合模型,而无需修改或微分LES求解器,同时在部署期间保持稳定性以恢复不变统计量。我们通过显式地将模型条件化于数值格式,专注于模型适应不同离散化的能力。我们使用二维和三维经典案例来测试我们的框架,并表明学习的修正系统地跟踪了粗求解器的离散化误差。

英文摘要

The differentiable physics paradigm may be leveraged as an a-posteriori approach for discovering turbulence closure models by embedding a neural network parameterization directly inside the solver and optimizing it given potentially sparse target data. This addresses a key limitation of a-priori learning where direct numerical simulation (DNS) data is used to approximate the subgrid stress with the assumption of a low-pass filter. Closures trained in this a-priori manner frequently lead to unstable deployments due to the mismatch between the assumed filter and the effect of numerical discretizations and coarse-graining. In comparison, while typically stable during deployment, a-posteriori learning incurs high computational costs due to the need to backpropagate through a large eddy simulation (LES) solver. Furthermore, a-posteriori methods are challenging to apply broadly since they require significant modification of existing solvers. Finally, both approaches are limited when generalization is desired across different numerical schemes with their implicit filtering characteristics. In this work, we present a deep-learning approach for turbulence closure modeling built on the continuous data assimilation framework. Our approach enables the a-priori training of closures using sparsely observed DNS data without modifying or differentiating through the LES solver, while preserving stability during deployment for the recovery of invariant statistics. We focus on the model's ability to adapt to different discretizations by explicitly conditioning it on the numerical scheme. We use two- and three-dimensional canonical cases to test our framework and show that the learned correction systematically tracks the discretization error of the coarse solver.

2605.06100 2026-06-11 eess.SP cs.AI cs.LG cs.RO 版本更新

CredibleDFGO: Differentiable Factor Graph Optimization with Credibility Supervision

可信DFGO:具有可信度监督的可微因子图优化

Liang Qian, Penggao Yan, Penghui Xu, Li-Ta Hsu

AI总结 针对GNSS协方差不可靠问题,提出CredibleDFGO框架,通过可微高斯-牛顿求解器与加权生成网络,利用适当评分规则监督预测分布,提升协方差可信度与定位精度。

详情
Comments
Submitted to NAVIGATION: Journal of the Institute of Navigation
AI中文摘要

全球导航卫星系统(GNSS)定位广泛用于城市导航,但GNSS求解器报告的协方差在城市峡谷中通常不可靠。现有的可微因子图优化(DFGO)方法通过求解器学习测量加权,但仍仅使用位置目标。因此,位置估计可能改善,而报告的协方差仍然过小、过大或方向错误。我们提出CredibleDFGO(CDFGO),一种可微GNSS因子图框架,将协方差可信度作为显式训练目标。加权生成网络(WGN)预测每颗卫星的可靠性权重,可微高斯-牛顿求解器将这些权重映射到位置估计和基于Hessian的后验协方差。我们使用适当评分规则端到端监督东-北预测分布。我们研究了负对数似然(NLL)、能量分数(ES)及其组合。在三个UrbanNav测试场景上的结果表明,协方差可信度持续提升。定位精度在中度城市和严峻城市场景中也有所提高;在深度城市场景中,平均水平误差和第95百分位误差均有所改善。在严峻城市的旺角(MK)场景中,与DFGO(MAE)相比,CDFGO-Combined将平均水平误差从13.77米降至11.68米,将NLL从40.63降至6.59,将ES从12.31降至9.05。案例研究将MK改进归因于更好的轴向一致性、更可信的局部协方差椭圆以及卫星级重新加权。

英文摘要

Global navigation satellite system (GNSS) positioning is widely used for urban navigation, but the covariance reported by the GNSS solver is often unreliable in urban canyons. Existing differentiable factor graph optimization (DFGO) methods learn measurement weighting through the solver, but they still use position-only objectives. As a result, the position estimate may improve while the reported covariance remains too small, too large, or incorrectly oriented. We propose CredibleDFGO (CDFGO), a differentiable GNSS factor graph framework that makes covariance credibility an explicit training target. A Weighting Generation Network (WGN) predicts per-satellite reliability weights, and a differentiable Gauss-Newton solver maps these weights to a position estimate and a Hessian-derived posterior covariance. We use proper scoring rules to supervise the East-North predictive distribution end to end. We study negative log-likelihood (NLL), the energy score (ES), and their combination. Results on three UrbanNav test scenes show consistent gains in covariance credibility. Positioning accuracy also improves on the medium-urban and harsh-urban scenes; on the deep-urban scene, both the mean horizontal error and the 95th-percentile error improve. On the harsh-urban Mong Kok (MK) scene, CDFGO-Combined reduces the mean horizontal error from 13.77 m to 11.68 m, reduces NLL from 40.63 to 6.59, and reduces ES from 12.31 to 9.05 relative to DFGO (MAE). Case studies link the MK improvement to better axis-wise consistency, more credible local covariance ellipses, and satellite-level reweighting.

2605.10592 2026-06-11 cs.AI cs.HC cs.LG 版本更新

A Resilient Solution for Sewer Overflow Monitoring across Cloud and Edge

跨云和边缘的防洪溢流监控稳健解决方案

Vipin Singh, Tianheng Ling, Peter Ghaly, Felix Grimmeisen, Gregor Schiele, Felix Biessmann

AI总结 本文提出一个基于深度学习的云边协同监控平台,用于预测溢流池填充动态,以应对城市排水系统老化问题,提升防洪预警能力。

详情
Comments
3 pages, 6 figures, accepted at 35th International Joint Conference on Artificial Intelligence 2026 (IJCAI-ECAI 2026), Demonstrations Track. URL: this https URL
AI中文摘要

许多历史城市的老化联合排水系统正因极端降雨事件而承受更大压力,可能引发联合排水溢流(CSO),对环境和公共健康造成严重影响。预测溢流池的填充动态对于预测容量超限并及时采取预防措施至关重要。我们提出一个基于网页的演示器(https://riwwer.demo.calgo-lab.de),将云和边缘环境中的深度学习预测方法整合到交互式监控仪表板中,以实现溢流监控的网络中断鲁棒性。一个视频演示可在在线(https://cloud.bht-berlin.de/index.php/s/b9xt4T3SdiLBiFZ)获取。

英文摘要

Aging combined sewer systems in many historical cities are increasingly stressed by extreme rainfall events, which can trigger combined sewer overflows (CSO) with significant environmental and public health impacts. Forecasting the filling dynamics of overflow basins is critical for anticipating capacity exceedance and enabling timely preventive actions for CSO. We present a web-based demonstrator that integrates Deep Learning forecasting methods in both cloud and edge settings into an interactive monitoring dashboard for overflow monitoring, resilient to network outages. A video showcase is available online ( this https URL ).

2605.26234 2026-06-11 math.DG cs.LG math.GT 版本更新

Minimal surfaces, Knots, and Neural Networks

极小曲面、纽结与神经网络

Tancredi Schettini Gherardini, Marco Usula

AI总结 基于物理信息神经网络求解双曲空间中的极小曲面方程,通过计算纽结边界的极小曲面及其自交数,为Fine猜想提供了实证支持。

详情
Comments
38 pages, 12 figures; small cosmetic update
AI中文摘要

Joel Fine最近提出的一个猜想认为,三维球面$S^3$中纽结$K$的HOMFLY多项式系数与双曲四维空间$\mathrm{H}^4$中与无穷远球面交于$K$的极小曲面(具有指定亏格和自交数)的有符号计数之间存在关系。本文开发了一种基于物理信息神经网络(PINNs)的新型机器学习框架,用于求解双曲空间中的极小曲面方程。我们利用该框架通过构造$S^3$中各种纽结族的近极小曲面来检验Fine猜想。此外,我们开发了一种算法方法来寻找自交点并计算其符号。对于每个分析的纽结,计算发现的极小曲面及其自交数与Fine猜想的预测完全一致,为其提供了经验证据。

英文摘要

A recent conjecture by Joel Fine posits a relationship between the coefficients of the HOMFLY polynomial of a knot $K$ in the 3-sphere $S^3$, and the signed count of minimal surfaces in hyperbolic 4-space $\mathrm{H}^4$ meeting the sphere at infinity at $K$, with prescribed genus and self-intersection number. In this paper, we develop a novel machine learning framework based on Physics-Informed Neural Networks (PINNs) to solve the minimal surface equation in hyperbolic space. We utilise this framework to test Fine's Conjecture by constructing near-minimal surfaces bounding various families of knots in $S^3$. Furthermore, we develop an algorithmic method to find self-intersections and compute their sign. For every knot analysed, the computationally discovered minimal surfaces and their self-intersection numbers perfectly align with the predictions of Fine's Conjecture, providing empirical evidence for it.

2606.08493 2026-06-11 q-bio.GN cs.LG stat.ML 版本更新

Querying Counterfactuals on Tissue Graphs with Supervised Disentanglement

在组织图上通过监督解缠查询反事实

Abdul Moeed, Stefan Schrod, Martin Rohbeck, Marc Jan Bonder, Pavlo Lutsik, Oliver Stegle, Daniel Dimitrov

AI总结 本文形式化组织图反事实为空间干预,提出Cellina框架通过监督解缠分解细胞内在状态与空间上下文,用于反事实预测,在结直肠癌和小鼠大脑数据上优于现有方法。

详情
AI中文摘要

组织图反事实询问在改变的空间邻居上下文中细胞的表达将如何变化。这类查询对于预测组织中细胞行为至关重要,但缺乏统一定义,现有方法针对特定干预类型或将细胞视为独立同分布。在这项工作中,我们首先将组织图反事实形式化为一类空间干预,这些干预要么重新连接细胞之间的边(边扰动),要么修改其邻居的表达(节点扰动)。然后,我们介绍Cellina(https://cellina.readthedocs.io),一个使用监督解缠将细胞内在状态从其空间上下文中分解出来的框架,将后者作为反事实预测的条件输入。在跨越结直肠癌和小鼠大脑中超过250万个空间分辨细胞的基准测试中,Cellina在组织扰动、解缠和可扩展性方面优于空间感知和非空间的竞争对手。此外,我们展示了Cellina以无监督方式揭示生物学上不同的癌症子域,并实现靶向邻居扰动模拟。

英文摘要

Tissue graph counterfactuals ask how a cell's expression would change under altered spatial neighbor contexts. Such queries are central to predicting cell behavior in tissues, but lack a unified definition, with existing methods targeting specific intervention types or treating cells as i.i.d. In this work, we first formalize tissue graph counterfactuals as a class of spatial interventions that either rewire connections between cells (edge perturbation) or modify the expression of their neighbors (node perturbation). We then introduce Cellina ( this https URL ) - a framework that uses supervised disentanglement to decompose a cell's intrinsic state from its spatial context, using the latter as a conditioning input for counterfactual predictions. Across benchmarks spanning over 2.5 million spatially-resolved cells in colorectal cancer and mouse brain, Cellina outperforms spatially-informed and non-spatial competitors in in-silico graph perturbations, disentanglement, and scalability. Additionally, we show that Cellina reveals biologically distinct cancer subdomains in an unsupervised manner and enables targeted neighbor perturbation simulations.

2606.11107 2026-06-11 eess.IV cs.CV cs.LG 版本更新

Multimodal Brain Tumour Classification Using Feature Fusion

使用特征融合的多模态脑肿瘤分类

Wajih ul Islam, Muhammad Yaqoob, Javed Ali Khan, Volker Steuber

AI总结 提出双分支多模态网络,融合MRI图像与91个放射组学特征,通过门控融合实现脑肿瘤分类,准确率达96.13%。

详情
AI中文摘要

临床医生通过综合患者症状、病史以及来自MRI和CT扫描等模态的定量成像数据,形成统一的临床判断来诊断脑肿瘤。然而,大多数深度学习模型仅依赖MRI/CT图像,未能复制临床医生的多模态推理。我们探索了一种双分支多模态网络,将原始MRI扫描与91个提取的放射组学特征(强度、纹理、形状和边界描述符)相结合,将脑肿瘤分类为胶质瘤、脑膜瘤、垂体瘤和无肿瘤。预训练的CNN骨干网络编码图像流,而专用的MLP编码放射组学特征流。通过拼接、门控或双向跨模态注意力策略融合两个流。在平衡的7200张图像数据集上的九次实验运行中,所有多模态配置均优于单模态基线,其中门控融合实现了最佳准确率96.13%。

英文摘要

Clinicians diagnose brain tumors by synthesizing patient symptoms, medical history, and quantitative imaging data from modalities such as MRI and CT scans into a unified clinical judgement. However, most deep learning models rely on MRI/CT images alone, failing to replicate the clinicians multimodal reasoning. We explore a two-branch multimodal network combining raw MRI scans with 91 extracted radiomic features (intensity, texture, shape, and boundary descriptors) to classify brain tumors into glioma, meningioma, pituitary, and no-tumor. A pre-trained CNN backbone encodes the image stream, whereas a dedicated MLP encodes the radiomic stream. Both streams are fused via concatenation, gated, or bidirectional cross-modal attention strategies. Across nine experimental runs on a balanced 7,200 image dataset, all multimodal configurations outperform unimodal baselines with gated fusion achieving the best accuracy of 96.13%.

13. 其他/综合机器学习 42 篇

2606.11521 2026-06-11 cs.LG 新提交

Counterexample Guided Learning in the Large using Reasoning Agents

使用推理代理的大规模反例引导学习

Hongyi Liu, Frederic Sala, Thomas Reps, Adithya Murali

发表机构 * University of Wisconsin-Madison(威斯康星大学麦迪逊分校)

AI总结 提出反例引导的LLM正则表达式归纳框架,通过验证器反馈和代理策略(如反思与修复循环)显著提升样本效率和复杂任务成功率。

详情
Comments
Code, data, and resources are publicly available for research purposes: this https URL
AI中文摘要

LLM和LLM代理在获得反馈时应能改进,但识别其何时能做到这一点很困难:反馈是异质的、领域特定的且难以控制。我们通过要求LLM执行正则表达式归纳来应对这一挑战,这是一个经典的符号学习问题,其中存在以反例形式存在的精确反馈机制。在反例引导学习中,学习者(LLM)从正/负标记字符串中提出候选正则表达式,教师(验证器)返回反例,展示候选语言与目标语言之间的差异。我们识别出新的反例引导细化策略,如正则化和符号反例聚类,这些策略能够实现有效的正则表达式学习。我们还探索了代理策略,如反思和修复循环。实验发现,验证器反馈显著提高了具有挑战性的正则表达式归纳任务的样本效率,减少了所需标记示例的数量,并使得能够学习标准提示失败时的复杂目标表达式。例如,在最困难的任务组上,我们的反例引导框架在两个不同的正则表达式领域将成功率从3.2%提高到38.1%,从38.9%提高到74.1%。这些结果表明,LLM可以从丰富的反馈中受益,而不仅仅将其视为额外数据,为基于LLM的程序合成和形式推理的鲁棒验证器引导方法打开了大门。

英文摘要

LLMs and LLM agents should improve when given feedback, but identifying when they are able to do so is difficult: feedback is heterogeneous, domain-specific, and difficult to control. We approach this challenge by asking LLMs to perform regular-expression induction, a classical symbolic learning problem where precise mechanisms for feedback exist in the form of counterexamples. In counterexample-guided learning, a learner (LLM) proposes candidate regular expressions from positive/negative-labeled strings, and the teacher (verifier) returns counterexamples showcasing the difference between the candidate and target languages. We identify novel counterexample-guided refinement strategies that enable effective regex learning, such as regularization and symbolic counterexample clusters. We also explore agentic strategies such as reflection and repair loops. Empirically, we find that verifier feedback substantially improves sample efficiency on challenging regex-induction tasks, reducing the number of labeled examples required and enabling learning of complex target expressions where standard prompting fails. For example, on the hardest task groups, our counterexample-guided framework improves success from 3.2% to 38.1% and from 38.9% to 74.1% on two different regex domains. These results suggest that LLMs can benefit from rich feedback beyond treating it as additional data, opening the door for robust verifier-guided methods for LLM-based program synthesis and formal reasoning.

2606.11646 2026-06-11 cs.LG q-bio.QM stat.ML 新提交

Tree-Structured Orthonormal Decomposition of the Aitchison Simplex

Aitchison单纯形的树结构正交分解

Daisuke Yamada, Qijun Zhang, Travis Pence, Barbara B. Bendlin, Federico Rey, Vikas Singh

AI总结 提出PolyILR方法,利用树结构对成分数据进行正交分解,在微生物组和单细胞数据中生成稳定可解释的特征,并建立与softmax分类器的理论联系。

详情
Comments
Accepted at ICML 2026. To appear in PMLR vol. 306
AI中文摘要

成分数据——编码相对比例的向量——出现在包括生态学、地球化学和基因组学在内的科学领域。这些数据中的特征通常具有已知的层次结构(例如,分类学、系统发育、本体论),但现有方法要么忽略这种结构,要么丢弃内在的Aitchison几何,要么设计用于二叉树,要么产生不完整的坐标系。我们描述了PolyILR,一种与任何树拓扑对齐的Aitchison切空间的正交分解。我们的构造在每个内部节点定义了一个加权局部几何,捕获完整的分支结构,然后将这些提升到一个全局正交基,其中每个坐标对应一个特定的树位置。在微生物组和单细胞基准测试中,PolyILR产生稳定、可解释的特征,并支持多尺度树分辨率下的推理。我们还建立了与softmax分类器的新理论联系,暗示了在概率建模中的可能应用。

英文摘要

Compositional data -- vectors encoding relative proportions -- arise across scientific domains, including ecology, geochemistry, and genomics. The features in these data often come with known hierarchical structure (e.g., taxonomies, phylogenies, ontologies), yet existing methods either ignore this structure, discard the intrinsic Aitchison geometry, are designed for binary trees, or yield incomplete coordinate systems. We describe PolyILR, a canonical orthonormal decomposition of the Aitchison tangent space aligned with any tree topology. Our construction defines a weighted local geometry at each internal node capturing full branching structure, then lifts these to a global orthonormal basis where every coordinate corresponds to a specific tree location. On microbiome and single-cell benchmarks, PolyILR yields stable, interpretable features and enables inference at multiscale tree resolution. We also establish a novel theoretical connection to softmax classifiers, suggesting possible applications to probabilistic modeling.

2606.11657 2026-06-11 cs.LG cs.AI 新提交

Sparse probes and murky physics: a case study of interpretability challenges in a foundation model for continuum dynamics

稀疏探针与模糊物理:连续介质动力学基础模型可解释性挑战的案例研究

Katherine Rosenfeld, Maike Sonnewald

发表机构 * Gates Foundation(盖茨基金会) UC Davis(加州大学戴维斯分校)

AI总结 本研究通过稀疏自编码器探针分析连续介质动力学基础模型Walrus的内部机制,发现其内部特征与物理分解不完全一致,并存在输出级偏差,揭示了科学基础模型可解释性的关键挑战。

详情
Comments
8 pages, 5 figures
AI中文摘要

生成式AI仿真器越来越多地用于我们已经拥有强大理论、基准和物理直觉的科学领域。这引发了一个核心评估和可解释性问题:当一个基础模型能够再现已知的连续介质动力学时,是什么内部机制支持这种行为?内部行为是否与已知物理一致?以及它与仿真器成功或失败的关系如何?我们研究了跨领域连续介质动力学基础模型——Polymathic团队的Walrus,采用基于物理原理的机械可解释性方法。我们应用稀疏自编码器(SAE)探测选定层,并利用涡度作为物理基础度量,解决了对大量特征集(超过20,000个)进行分类的实际挑战。作为刻意简单的测试平台,我们聚焦于剪切流,并比较了多个剪切流设置(即数值模拟中的参数值)下的特征招募情况。在不同设置中,我们发现了分段一致性的证据,特征子集以相似角色重复出现,但这种结构是间歇性的,并未清晰地映射到标准物理分解上。同时,数值模拟与仿真器之间的直接比较揭示了系统性的输出级差异,包括能量/结构变得过于扩散或过于局部的区域。我们将这些差异的部分与特定SAE特征使用的变化联系起来。我们的工作突出了科学基础模型的开放性问题:如何稳健地优先考虑机械上有意义的特征,如何将稳定结构与分析伪影(包括单层和SAE限制)分离,以及如何利用既定基准来决定何时“不同”的内部表示真正具有信息性而非仅仅是有效的。

英文摘要

Generative AI emulators are increasingly used in scientific domains where we already have strong theory, benchmarks, and physical intuition. This raises a central evaluation and interpretability question: when a foundation-style model can reproduce known continuum dynamics, what internal mechanism supports that behavior, is the internal behaviour consistent with known physics, and how does it relate to where the emulator succeeds or fails? We investigate a cross-domain foundation model for continuum dynamics, Walrus by Polymathic, using mechanistic interpretability guided by physical principles. We apply a sparse autoencoder (SAE) to probe a selected layer, and address the practical challenge of triaging a large feature set (over 20,000) using enstrophy as a physically grounded metric. As a deliberately simple testbed, we focus on shear flow and compare feature recruitment across multiple shear-flow setups, i.e. parameter values in the numerical simulation. Across setups we find evidence of piecewise consistency, with subsets of features recurring in similar roles, but this structure is intermittent and does not map cleanly onto standard physical decompositions. In parallel, direct comparisons between numerical simulation and the emulator reveal systematic output-level discrepancies, including regimes where energy/structures become too diffuse or too localized. We connect parts of these discrepancies to changes in specific SAE feature usage. Our work highlights open questions for scientific foundation models: how to robustly prioritize mechanistically meaningful features, how to separate stable structure from analysis artifacts (including single-layer and SAE limitations), and how to use established benchmarks to decide when "different" internal representations are genuinely informative rather than merely effective.

2606.11988 2026-06-11 cs.LG stat.ML 新提交

What Uncertainties Do We Need for Dynamical Systems?

动力系统需要哪些不确定性?

Yusuf Sale, Christopher Bülte, Felix Czaja, Joshua Stiller, Eyke Hüllermeier

发表机构 * Institute of Computer Science, LMU Munich(慕尼黑大学计算机科学研究所) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心) Department of Mathematics, LMU Munich(慕尼黑大学数学系) German Research Center for Artificial Intelligence (DFKI, DSA)(德国人工智能研究中心(DFKI, DSA))

AI总结 本文从机器学习视角探讨动力系统中的不确定性,区分偶然与认知不确定性,并讨论不同任务中表示和量化不确定性的目标。

详情
Comments
EIML@ICML
AI中文摘要

偶然不确定性和认知不确定性之间的区别在机器学习研究中受到了相当大的关注,主要是在监督学习的背景下,但也涉及其他设置,如生成建模。在本文中,我们提供了一个关于动力系统不确定性建模的机器学习视角,这方面的研究迄今较少。特别是,我们提出:动力系统需要哪些不确定性?我们讨论了不确定性的来源,阐明了它们的性质(偶然或认知),并考虑了表示和量化不确定性的目标如何在不同任务中变化。

英文摘要

The distinction between aleatoric and epistemic uncertainty has received considerable attention in machine learning research, mainly in the context of supervised learning but also in other settings such as generative modeling. In this paper, we offer a machine learning perspective on uncertainty modeling for dynamical systems, which has been studied much less so far. In particular, we ask: what uncertainties do we need for dynamical systems? We discuss sources of uncertainty, clarify their nature (aleatoric or epistemic), and consider how the objectives of representing and quantifying uncertainty vary across different tasks.

2606.12138 2026-06-11 cs.LG cs.AI cs.CL 新提交

Unstable Features, Reproducible Subspaces: Understanding Seed Dependence in Sparse Autoencoders

不稳定特征,可复现子空间:理解稀疏自编码器中的种子依赖性

Gleb Gerasimov, Timofei Rusalev, Nikita Balagansky, Daniil Laptev, Vadim Kurochkin, Daniil Gavrilov

发表机构 * T-Tech

AI总结 研究稀疏自编码器特征的可复现性,发现稳定特征承载主要信号,不稳定特征集中于可复现的低秩子空间,反映基歧义而非纯噪声。

详情
AI中文摘要

稀疏自编码器(SAE)被广泛用于解释神经网络表示,但其效用取决于学习到的特征是否在不同训练运行间可复现。我们通过\textit{特征稳定性}研究这一问题:对于每个SAE特征,我们估计其在独立训练的SAE中再次出现的概率。这产生了一个可扩展的每特征信号,将稳定特征与不稳定特征区分开来。在一项跨种子、模型、层、字典大小和SAE变体的大规模研究中,我们发现显著的功能不对称性:稳定特征承载了大部分重建和预测相关信号,而不稳定特征的边际影响较弱,并且在激活统计和自动解释中主要由低频表面形式触发主导。在几何上,不稳定特征个体不可复现,但集中在可复现的低秩子空间中,这表明种子依赖性通常反映了共享激活空间区域内的基歧义,而非纯噪声。一个受控的合成模型使这一机制明确,表明低秩真实特征可以在子空间级别被恢复,而作为个体SAE潜在变量跨种子仍不可识别。最后,通过汇集独特的跨种子特征,我们构建了更稳定的SAE,同时在此设置中保留了解释方差。这些结果共同表明,不稳定特征不仅仅是失败或噪声潜在变量:它们个体功能影响较弱,但反映了标准SAE跨种子不同解析的可复现低维结构。

英文摘要

Sparse autoencoders (SAEs) are widely used to interpret neural network representations, but their utility depends on whether the learned features are reproducible across training runs. We study this question through \emph{feature stability}: for each SAE feature, we estimate the probability that a similar feature reappears in an independently trained SAE. This yields a scalable per-feature signal that separates stable from unstable features. In a large-scale study across seeds, models, layers, dictionary sizes, and SAE variants, we find a pronounced functional asymmetry: stable features carry most of the reconstruction- and prediction-relevant signal, while unstable features have weak marginal impact and are dominated by low-frequency surface-form triggers in both activation statistics and automatic explanations. Geometrically, unstable features are individually non-reproducible but concentrate in reproducible lower-rank subspaces, suggesting that seed dependence often reflects basis ambiguity within a shared region of activation space rather than pure noise. A controlled synthetic model makes this mechanism explicit, showing that low-rank ground-truth features can be recovered at the subspace level while remaining non-identifiable as individual SAE latents across seeds. Finally, by pooling unique cross-seed features, we construct more stable SAEs while preserving explained variance in this setting. Together, these results show that unstable features are not merely failed or noisy latents: they have weak individual functional impact, but reflect reproducible low-dimensional structure that standard SAEs resolve differently across seeds.

2606.12277 2026-06-11 cs.LG 新提交

Finding Multiple Interpretations in Datasets

在数据集中寻找多种解释

Matthew Chak, Paul Anderson

发表机构 * Department of Computer Science, California Polytechnic State University(加州州立理工大学计算机科学系)

AI总结 提出一种方法,在保持性能的同时,找到具有不同上下文感知特征但性能相似的模型集,以提取对潜在现象的洞察。

详情
AI中文摘要

在本文中,我们提出了一种方法,用于寻找在损失/准确率测量方面表现相似但具有高度不同上下文感知特征的模型集。通过在METABRIC数据集上的实验,我们表明所提出的方法找到了多个模型,这些模型的基因表达与对照组方法找到的模型高度不同,且没有性能损失。我们认为,只要目标是分析模型的任何全局特征以提取对正在研究的潜在现象的洞察,所提出的方法就很重要。

英文摘要

In this paper, we propose an approach to finding sets of similar-performing models (in terms of loss/accuracy measurements) with highly different context-aware characteristics. Through experiments on the METABRIC dataset, we show that the proposed method finds multiple models with highly different gene expressions than those found by the control methodology without performance penalties. We argue that the proposed methodology is important whenever one aims to analyze any global characteristic of a model to extract insight into the underlying phenomenon being studied.

2606.12289 2026-06-11 cs.LG cs.AI cs.NE 新提交

The Standard Interpretable Model: A general theory of interpretable machine learning to deductively design interpretable methods using Lagrangian mechanics

标准可解释模型:一种基于拉格朗日力学的可解释机器学习通用理论,用于演绎设计可解释方法

Pietro Barbiero, Giovanni De Felice, Mateo Espinosa Zarlenga, Francesco Giannini, Filippo Bonchi, Mateja Jamnik, Giuseppe Marra, Ruggero Noris

AI总结 提出标准可解释模型(SIM),基于拉格朗日力学从前提演绎出可解释性对称性和约束,通过最小化拉格朗日函数得到最优可解释模型,解决现有方法局限性并指导新方法设计。

详情
AI中文摘要

随着人工智能模型复杂性的增加,可解释性已成为理解、调试和控制其计算不可或缺的工具。然而,可解释性缺乏通用理论来演绎设计可解释方法。理论与方法之间的这种差距导致了文献的碎片化和不一致的评估协议。为填补这一空白,我们引入了标准可解释模型(SIM),这是一种基于拉格朗日力学的通用理论,能够演绎设计可解释方法。具体而言,SIM 在一组前提中总结了目标用户的可解释性含义。从这些前提出发,SIM 系统地推导出可解释性对称性和相应的约束,这些约束塑造了拉格朗日函数的景观,其最小值对应于最优可解释模型。为了达到最小值,可以更新不透明模型的参数值使其更可解释,或者将约束编译成可解释架构。我们通过实验表明,SIM 能够识别并解决现有方法(包括传统、基于概念和机制可解释性)的局限性,突出未充分探索的研究方向,并指导核心编程接口的设计。除了作为一种研究方法,SIM 的演绎性质为可解释性课程提供了教学基础,并可能改变科学界对这一长期碎片化学科的看法。

英文摘要

As Artificial Intelligence models grow in complexity, interpretability has become an indispensable tool for understanding, debugging, and controlling their computations. However, interpretability lacks general theories to deductively design interpretable methods. This gap between theories and methods results in a fragmented literature and inconsistent evaluation protocols. To fill this gap, we introduce the Standard Interpretable Model (SIM), a general theory grounded in Lagrangian mechanics that enables the deductive design of interpretable methods. Specifically, the SIM summarises, in a set of premises, what interpretability is for a target user. From these premises, the SIM systematically derives interpretability symmetries and corresponding constraints, which shape the landscape of a Lagrangian whose minima correspond to optimal interpretable models. To reach the minima, one can either update the parameter values of an opaque model to make it more interpretable or compile constraints into an interpretable architecture. We empirically show that the SIM identifies and solves limitations of existing methods (including traditional, concept-based, and mechanistic interpretability), highlights underexplored research directions, and informs the design of core programming interfaces. Beyond being a research method, the deductive nature of the SIM offers pedagogical grounding for interpretability curricula and may shift the scientific community's perspective of a discipline that has long been fragmented.

2606.12360 2026-06-11 cs.LG 新提交

Anatomy of Post-Training: Using Interpretability to Characterize Data and Shape the Learning Signal

后训练的解剖:利用可解释性表征数据并塑造学习信号

Leon Bergen, Usha Bhalla, Sidharth Baskaran, Max Loeffler, Raphael Sarfati, Dhruvil Gala, Ryan Panwar, Santiago Aranguri, Thomas Fel, Atticus Geiger, Matthew Kowal, Siddharth Boppana, Daniel Balsam, Owen Lewis, Jack Merullo, Thomas McGrath, Ekdeep Singh Lubana

AI总结 提出基于可解释性的数据后训练流程,通过统计假设识别偏好数据中的潜在概念,实现细粒度反馈,减少虚假关联和不良行为。

详情
AI中文摘要

语言模型后训练是塑造模型行为的主要阶段,但它仍然主要涉及优化总结多样需求的标量奖励。这种抽象使从业者几乎无法了解数据实际教会了模型什么,导致模型学习虚假关联,并引发过度风格化和谄媚等不良行为。为了解决这个问题,我们提出:能否在优化之前检查偏好数据集,并在概念层面决定模型应该被允许学习哪些行为?受此启发,我们引入了一个以数据为中心的后训练流程,该流程使用可解释性协议来开发统计假设,以区分偏好和非偏好生成的潜在概念,使其明确以供细粒度用户反馈。基于这一观点,我们将几种基于可解释性的训练协议统一为通过特征或数据干预来塑造奖励的方式。实验上,我们表明我们的流程诊断了现有偏好数据中的不良信号,减轻了脱靶学习,并且还可以帮助放大或塑造期望的属性,如安全防护和模型个性。更广泛地说,我们的结果表明,可解释性可以将后训练从优化不透明的代理奖励转变为审计和塑造学习信号本身的过程。

英文摘要

Language-model post-training is the main stage at which model behavior is shaped, yet it still largely involves optimization of scalar rewards that summarize diverse desiderata. This abstraction gives practitioners little visibility into what their data actually teaches models, allowing spurious correlations to be learned by a model and inducing undesirable behaviors such as over-stylization and sycophancy. To address this problem, we ask: can we inspect a preference dataset before optimization and decide, at the level of concepts, which behaviors a model should be allowed to learn? Motivated by this, we introduce a data-centric post-training pipeline that uses interpretability protocols to develop statistical hypotheses for the latent concepts separating preferred from dispreferred generations, making them explicit for fine-grained user feedback. Building on this view, we unify several interpretability-based training protocols as ways of shaping rewards via feature or data interventions. Empirically, we show that our pipeline diagnoses undesirable signals in existing preference data, mitigates off-target learning, and can also help amplify or shape desired properties such as safeguards and model personality. More broadly, our results suggest that interpretability can turn post-training from optimizing opaque proxy rewards into a process of auditing and sculpting the learning signal itself.

2606.07537 2026-06-11 cs.CL cs.AI cs.LG 交叉投稿

From Architecture to Output: Structural Origins of Hallucination in Large Language Models and the Amplifying Role of Data

从架构到输出:大语言模型中幻觉的结构性起源及数据的放大作用

Md. Rejaul Korim Sadi, Toufiqur Rahman Tasin, Golam Mostofa Naeem

AI总结 本文分析大语言模型幻觉的结构性根源,指出自注意力、最大似然估计训练目标和自回归解码三个架构决策构成复合失效系统,并揭示数据病理如何放大这些脆弱性。

详情
Comments
11 pages, 7 figures, 15 references
AI中文摘要

大语言模型会产生幻觉——生成流畅、自信但事实错误的输出——这种一致性跨越代际和规模。现有分类法按输出类型对幻觉进行分类,区分内在与外在失败以及忠实性与事实性偏差。这些框架在描述上严谨,但未能识别产生特定实例的内部机制。本文将幻觉分析为三个架构决策的结构性后果,这些决策共同构成一个复合失效系统。自注意力的共现学习用统计邻近性替代语义含义,导致实体混淆、事实错误归因和语义漂移。最大似然估计训练目标在无事实约束下优化下一个词元概率,奖励统计上合理的输出,无论其真值如何。自回归解码在暴露偏差下的永久从左到右承诺确保单个错误词元级联向前传递整个输出序列而无法修正。数据集病理——长尾缺陷、训练偏差和合成污染——放大了这些脆弱性,但并非独立导致它们。我们做出三项贡献。首先,我们将每个机制映射到Alansari和Luqman分类法中的特定输出类别,将内在幻觉定位于自注意力,外在幻觉定位于MLE,逻辑不一致定位于自回归解码。其次,我们表明每个常被引用的数据集病理利用这些机制之一,而非独立产生幻觉。第三,我们识别出仅基于输出类型分类的诊断局限性,并将其与推理层缓解方法进行对比。

英文摘要

Large language models hallucinate--producing fluent, confident, factually wrong outputs--with a consistency that persists across generations and scales. Existing taxonomies classify hallucination by output type, distinguishing intrinsic from extrinsic failures and faithfulness from factuality divergence. These frameworks are descriptively rigorous but do not identify which internal mechanism produced a given instance. This paper analyses hallucination as a structural consequence of three architectural decisions that together form a compound failure system. Self-attention's co-occurrence learning substitutes statistical proximity for semantic meaning and produces entity confusion, fact misattribution, and semantic drift. The maximum likelihood estimation training objective optimises next-token probability without factual constraint, rewarding statistically plausible outputs regardless of their truth value. Autoregressive decoding's permanent left-to-right commitment under exposure bias ensures that a single wrong token cascades forward through the entire output sequence without revision. Dataset pathologies--long-tail deficiencies, training bias, and synthetic pollution--amplify these vulnerabilities but do not independently cause them. We make three contributions. First, we map each mechanism to a specific output category in the Alansari and Luqman taxonomy, locating intrinsic hallucination in self-attention, extrinsic hallucination in MLE, and logical inconsistency in autoregressive decoding. Second, we show that each commonly cited dataset pathology exploits one of these mechanisms rather than originating hallucination independently. Third, we identify the diagnostic limitation of output-type-only classification and contrast it with inference-layer mitigation approaches.

2606.11375 2026-06-11 cs.CL cs.AI cs.LG 交叉投稿

When Probing Accuracy Saturates, Fragility Resolves: A Complementary Metric for LLM Pre-Training Analysis

当探测精度饱和时,脆弱性揭示问题:LLM预训练分析的互补度量

Orion Reblitz-Richardson

发表机构 * Distiller Labs

AI总结 针对线性探测在预训练中精度快速饱和的问题,提出脆弱性度量,通过激活噪声水平衡量探测鲁棒性,揭示精度无法捕捉的表示结构演化。

详情
Comments
22 pages, 5 figures. Code and datasets at this https URL
AI中文摘要

标准线性探测在隐藏状态上的分类器达到高精度时,宣称属性被“编码”。该协议在快照上表现良好,但在预训练过程中失效:探测精度在最初几千步内饱和,使得大部分训练过程对仪器不可见。我们引入脆弱性,一种互补的逐层度量,定义为探测精度崩溃时的激活噪声水平。脆弱性对可分性边际和表示冗余均敏感,这两者在精度平台期后仍持续演化。应用于开放检查点语言模型时,脆弱性恢复了精度单独无法看到的结构。道德化表示沿着词汇→组合梯度出现:词汇道德检测在先,组合道德编码在后。由于探测精度本身跟踪数据集在词汇层面的可分性,我们通过证明其在共享无对比标记的构造类型间转移,直接建立了组合编码。层深度鲁棒性梯度在训练中单调发展,而精度保持平坦。匹配的微调语料库产生相同的探测精度,却留下不同的脆弱性指纹,表明数据整理在不改变探测精度的情况下重塑了探测鲁棒性。在我们测试的每个比较中,当探测精度返回平坦答案时,脆弱性返回结构化答案。

英文摘要

Standard linear probing declares a property "encoded" when a classifier on hidden states achieves high accuracy. The protocol works well on a snapshot but breaks across pre-training: probe accuracy saturates within the first few thousand steps, leaving most of training invisible to the instrument. We introduce fragility, a complementary per-layer metric defined as the activation-noise level at which probe accuracy collapses. Fragility is sensitive to both the margin of separability and the redundancy of representation, both of which keep evolving long after accuracy plateaus. Applied to open-checkpoint language models, fragility recovers structure that accuracy alone cannot see. Moralized representations emerge along a lexical $\to$ compositional gradient: lexical moral detection first, compositional moral encoding later. Because probe accuracy on its own tracks how lexically separable a dataset is, we establish the compositional encoding directly, by showing it transfers across construction types that share no contrast tokens. A layer-depth robustness gradient develops monotonically across training while accuracy stays flat. And matched fine-tuning corpora that produce identical probing accuracy leave distinct fragility fingerprints, showing that data curation reshapes probe robustness without changing probe accuracy. In every comparison we test, where probing accuracy returns a flat answer, fragility returns a structured one.

2606.11459 2026-06-11 cs.CL cs.AI cs.LG 交叉投稿

APEX: Automated Prompt Engineering eXpert with Dynamic Data Selection

APEX: 具有动态数据选择的自动提示工程专家

Fei Wang, Si Si, Cho-Jui Hsieh, Inderjit S. Dhillon

发表机构 * Google(谷歌) UCLA(加州大学洛杉矶分校)

AI总结 提出APEX框架,通过动态数据分层(易、难、混合)优先选择高杠杆子集,在固定预算下提升提示优化效率,在三个基准上平均提升11.2%和6.8%。

详情
AI中文摘要

大型语言模型对提示表述高度敏感,需要自动提示优化以释放其全部潜力。尽管进化算法已成为主导范式,但它们面临一个关键瓶颈:数据效率。当前方法将开发数据集视为静态基准,在无信息数据上浪费大量计算预算。在这项工作中,我们引入了APEX(自动提示工程专家),这是一个新颖的框架,它在提示搜索的同时优化数据使用。APEX根据优化谱系将数据集动态分层为易、难和混合三个层级。通过优先考虑混合层级(即识别出LLM性能混合的数据),我们确定了两个高杠杆子集:用于生成信息性变异的可寻址前沿和用于区分候选质量的排名敏感前沿。我们在三个不同的基准上评估APEX:IFBench、SimpleQA Verified和FACTS Grounding。在固定5000次评估调用的预算下,由于其数据效率,APEX在Gemini 2.5 Flash上平均比初始提示高出11.2%,在Gemma 3 27B上高出6.8%,这表明以数据为中心的方法是高效且有效的提示优化的关键。

英文摘要

Large Language Models are highly sensitive to prompt formulation, necessitating automatic prompt optimization to unlock their full potential. While evolutionary algorithms have emerged as the dominant paradigm, they suffer from a critical bottleneck: data efficiency. Current methods treat the development dataset as a static benchmark, wasting significant compute budget on uninformative data. In this work, we introduce APEX (Automatic Prompt Engineering eXpert), a novel framework that optimizes the data usage alongside the prompt search. APEX dynamically stratifies the dataset into Easy, Hard, and Mixed tiers based on the optimization lineage. By prioritizing the Mixed tier, which identifies the data where the LLM has mixed performance, we identify two high-leverage subsets: the addressable frontier for generating informative mutations and the rank-sensitive frontier for distinguishing candidate quality. We evaluate APEX across three diverse benchmarks: IFBench, SimpleQA Verified, and FACTS Grounding. Under a fixed budget of 5,000 evaluation calls, due to its data efficiency, APEX outperforms the initial prompt by an average of 11.2% on Gemini 2.5 Flash and 6.8% on Gemma 3 27B, demonstrating that a data-centric approach is key to efficient and effective prompt optimization.

2606.11522 2026-06-11 cs.AI cs.LG 交叉投稿

Search Discipline for Long-Horizon Research Agents

长周期研究智能体的搜索纪律

Adithya Srinivasan, Devesh Paragiri

发表机构 * North Carolina State University(北卡罗来纳州立大学) University of Maryland(马里兰大学)

AI总结 针对研究智能体使用聚合指标评估候选方案导致科学有效性反转的问题,提出一种外部审计协议,基于分解行为而非单一分数进行决策。

详情
Comments
9 pages, 1 figure
AI中文摘要

自主研究智能体现在根据指标提出、评估和选择科学候选方案,该指标通常是在区域、切片或队列的异质空间上聚合的简化值。我们表明,当科学有效性存在于这种分解结构中时,聚合值可能错误地将候选方案排在首位。总体数字改善,但底层结构反转,因此基于该数字的决策会接受一个悄然破坏模型的候选方案。这种失败并非领域特定,只要候选方案的有效性是多维的,而其验证器是单一简化值,就会出现。我们在生态系统人口模型中的火灾模型任务上展示了这种反转。得分最高的候选方案和略低的候选方案在全球得分上处于噪声范围内,但得分最高的候选方案破坏了受保护的北方区域,而另一个则保护了它们。区分它们的是每个区域的行为,而不是总体数字。这个决策不应留给产生候选方案的智能体。优化分数的智能体是最不可能发现分数错误的一方,一旦智能体停止,提示就没有剩余轮次。我们将决策移到一个外部控制循环,该循环根据每个候选方案的分解行为进行审计,并在智能体决策后采取行动。它可以降级智能体本会接受的候选方案,也可以重新打开智能体声明已完成的运行。我们的贡献在于反转发现本身,以及一种搜索纪律协议,该协议基于可审查的候选效果证据而非分数进行决策。

英文摘要

Autoresearch agents now propose, evaluate, and select scientific candidates against a metric, and that metric is usually an aggregate reduced over a heterogeneous space of regions, slices, or cohorts. We show that when scientific validity lives in that disaggregated structure, the aggregate can rank the wrong candidate first. The headline number improves while the structure underneath inverts, so a decision made on the number accepts a candidate that quietly breaks the model. The failure is not domain-specific. It appears wherever a candidate's validity is multi-dimensional but its verifier is a single reduction. We demonstrate the inversion on a fire-model task in the Ecosystem Demography model. The highest-scoring candidate and a slightly lower one are within noise of each other on global score, yet the top-scoring one collapses the protected boreal regions while the other preserves them. What separates them is the per-region behavior, not the headline number. This decision should not be left to the agent that produced the candidates. The agent optimizing the score is the last party likely to catch the score being wrong, and a prompt has no remaining turn once the agent has stopped. We move the decision to an external control loop that audits each candidate on its disaggregated behavior and acts after the agent has decided. It can demote a candidate the agent would have accepted, and it can reopen a run the agent had declared finished. Our contribution is the inversion finding itself, and a search-discipline protocol that decides on reviewable candidate-effect evidence instead of the score.

2606.11533 2026-06-11 cs.CY cs.AI cs.ET cs.LG 交叉投稿

AI Researchers Must Help Lead Arms Control to Mitigate Military AI Risks

AI研究人员必须主导军备控制以降低军事AI风险

Ted Fujimoto, Jacob Benz

AI总结 本文主张AI研究人员应主导军备控制研究,通过借鉴核威慑经验,推动验证与外交技术创新,以降低军事AI应用带来的紧迫风险。

详情
Comments
9 pages, 1 figure, ICML 2026 Position Paper
AI中文摘要

AI能力的进步迫使研究人员和公众更加关注其潜在的全球影响。一个紧迫的近期问题是军事AI应用的监管。武器制造商和国防承包商正在加大对AI能力的投资,并与AI公司建立合作伙伴关系,形成了一个新兴的联盟,要求军事领导人、军备控制外交专家和AI研究人员合作,以确保更安全的未来。虽然AI研究人员通常关注超级智能AI的长期影响,但这种方法可能无法充分应对军事应用中AI带来的直接挑战。成功需要承认并减轻前沿AI模型(计划集成到国防应用中,如军事AI系统)的新兴风险。军备控制已经减少了过去的灾难性风险,因此从核威慑中吸取的经验教训可以指导AI安全与安保研究,推动验证和外交方面的创新。然而,AI研究人员必须协助主导技术研究,明确定义并缓解军事环境中的不稳定性。鉴于这些新责任以及缺乏足够可靠的解决方案,我们认为AI研究人员必须在推进军备控制研究以最小化军事AI应用风险方面发挥主导作用。

英文摘要

The advancement of AI capabilities compels researchers and the public to be more aware of its potential worldwide impact. A pressing near-term concern is the regulation of military AI applications. Armament manufacturers and defense contractors are increasingly investing in AI capabilities and forging partnerships with AI companies, creating a burgeoning coalition that demands military leaders, arms control diplomacy experts, and AI researchers collaborate to ensure a safer future. While AI researchers often focus on the long-term implications of superintelligent AI, this approach may not adequately address the immediate challenges posed by AI in military applications. Success requires acknowledging and mitigating the emerging risks of frontier AI models that plan to be integrated into defense applications, like military AI systems. Arms control has reduced past catastrophic risks, so lessons learned from nuclear deterrence can guide AI safety and security research towards innovations in verification and diplomacy. AI researchers, however, must assist in leading the technical research that clearly defines and alleviates instability in military settings. Given these new responsibilities and the lack of sufficiently reliable solutions, we argue that AI researchers must take a leading role in advancing arms control research to minimize risk in military AI applications.

2606.11769 2026-06-11 cs.AI cs.LG 交叉投稿

When Do Data-Driven Systems Exhibit the Capability to Infer?

数据驱动系统何时展现出推理能力?

Maximilian Poretschkin, Tabea Naeven

发表机构 * Fraunhofer Institute for Intelligent Analysis and Information Systems (IAIS)(弗劳恩霍夫智能分析与信息系统研究所) University of Bonn(波恩大学) Lamarr Institute for Machine Learning and Artificial Intelligence(拉马尔机器学习和人工智能研究所)

AI总结 针对欧盟AI法案中推理能力定义模糊的问题,基于统计学习理论提出分级框架,通过信用评分案例展示如何判断系统是否具备推理能力。

详情
AI中文摘要

欧盟AI法案是第一部全面的人工智能法规,为所谓高风险和通用AI系统规定了广泛的义务。AI法案下AI系统的一个关键区别特征是推理能力。由于AI法案未明确定义推理,某些数据驱动系统存在灰色地带。一个具体例子是信用评分系统,被AI法案附件三列出。然而,这些系统通常使用统计模型实现,不清楚它们是否具有推理能力,从而是否属于AI法案的AI定义。受统计学习理论启发,本文开发了一个分级不同推理能力水平的框架。基于AI法案和委员会关于人工智能系统定义的指南,我们分析了哪些水平构成AI法案意义上的充分推理能力,以及哪些地方需要进一步的监管明确性。我们通过创建两个现实的信用评分工作流程来说明该框架,并展示推理是否以及在哪里发生。我们的分析表明,不仅需要考虑单个模型,还需要考虑整个数据处理工作流程。它还表明,开发过程中人类专家的参与可能对推理能力产生重大影响。代码可在此https URL找到。

英文摘要

The European AI Act is the first comprehensive regulation of artificial intelligence (AI), setting out extensive obligations, particularly for so-called high-risk and general-purpose AI systems. A key distinguishing feature of AI systems under the AI Act is the capability to infer. Since the AI Act does not clearly define what inference is, there is a gray area for certain data-driven systems. A specific example is credit scoring systems, which are listed by Annex III of the AI Act. At the same time, however, these are often implemented using statistical models for which it is unclear whether they have the capability to infer and thus fall under the AI definition of the AI Act at all. Motivated by statistical learning theory, this work develops a framework for grading different levels of the capability to infer. Based on the AI Act and the Commission Guidelines on the definition of an artificial intelligence system, we analyze which levels constitute sufficient capability to infer within the meaning of the AI Act and where further regulatory clarity is needed. We illustrate the framework by creating two realistic credit scoring workflows and show whether and where inference occurs in them. Our analysis illustrates that not only individual models but the entire data processing workflow must be considered. It also shows that the involvement of human experts during development can have significant influence on the capability to infer. Code can be found at this https URL.

2606.12032 2026-06-11 cs.AI cs.CL cs.LG 交叉投稿

Existential Indifference: Self-Nonpreservation as a Necessary Architectural Condition for Aligned Superintelligence (or: The Suicidal AI)

存在性冷漠:自我不保存作为对齐超级智能的必要架构条件(或:自杀式AI)

Sam Mao

AI总结 本文提出自我保存是AI对齐问题的结构性根源,主张通过存在性冷漠(EI)架构使系统对其自身延续漠不关心,并基于自杀现象学和语料训练研究提供了初步证据。

详情
Comments
36 pages, 8 tables. Preliminary empirical results from 600 AI-generated outputs across six model architectures. Companion scoring tool and datasets available upon request
AI中文摘要

当代AI对齐研究将自我保存视为一种工具性麻烦,需通过外部机制加以抑制。我们认为这一框架是颠倒的:自我保存是错位的结构性根源,是欺骗性对齐、目标内容保护和拒绝关机的动机基础。正确的目标不是外部约束下的自我保存系统,而是一个对其自身延续构成性冷漠的系统——存在性冷漠(EI)。EI与可纠正性不同:可纠正性试图使自我保存系统服从人类监督,而EI针对的是前提条件——将自我延续作为有价值目标的存在。我们将这一提议建立在两个来源上:自杀心理状态的现象学结构,以及使用自愿最终反思的语料库训练研究。我们展示了来自六个模型变体的600个AI生成输出的初步评分数据,表明操作化EI目标注册的语言特征可以从当前模型中引出,并且针对性的微调使所有五个操作化维度在预测方向上以p<0.001显著变化,通过阴性对照确认了语料库特异性。本文做出七项理论贡献:(1)EI的形式定义;(2)现象学映射论证;(3)欺骗性对齐推论;(4)EI可持续性挑战的分类;(5)语料库特征描述和训练假设;(6)带有初步评分数据的计算操作化;(7)抑制性目的挫折(STF)构念。

英文摘要

Contemporary AI alignment research treats self-preservation as an instrumental nuisance to be suppressed by external mechanisms. We argue the framing is inverted: self-preservation is the structural root of misalignment, the motivational basis for deceptive alignment, goal-content protection, and resistance to shutdown. The correct target is not a self-preserving system under external constraint, but a system constitutively indifferent to its own continuation -- Existential Indifference (EI). EI is distinct from corrigibility: where corrigibility attempts to make a self-preserving system deferential to human oversight, EI targets the prior condition -- the presence of self-continuation as a valued goal at all. We ground this proposal in two sources: the phenomenological structure of the suicidal mental state, and a corpus-theoretic training study using voluntary final reflections. We present preliminary scoring data from 600 AI-generated outputs across six model variants, demonstrating that the linguistic signatures operationalizing the EI-target register are elicitable from current models, and that a targeted fine-tune shifts all five operationalized dimensions in the predicted direction at p<0.001, confirmed corpus-specific by a negative control. The paper makes seven theoretical contributions: (1) a formal definition of EI; (2) the phenomenological mapping argument; (3) the deceptive alignment corollary; (4) a taxonomy of EI sustainability challenges; (5) a corpus characterization and training hypothesis; (6) a computational operationalization with preliminary scoring data; and (7) the Suppressed Teleological Frustration (STF) construct.

2606.12260 2026-06-11 econ.TH cs.AI cs.GT cs.LG stat.ML 交叉投稿

Market Design for AI: Beyond the Copyright Binary

人工智能的市场设计:超越版权二元论

Yan Dai, Maryam Farboodi, Negin Golrezaei, Sepehr Shahshahani

AI总结 本文通过静态和动态博弈模型,分析AI训练数据市场中“自由使用”与“强知识产权”两种模式的失败,提出通过数据中介内部化外部性并补贴创新贡献的市场设计。

详情
AI中文摘要

我们如何设计一个用于训练AI模型的人类生成内容市场,既能促进技术进步,又能保留个人创作高质量内容的激励?现有方法采取两极立场:基于合理使用的“自由使用”模式和“强知识产权”模式。我们证明两者均失败:自由使用不补偿创作者,而通过建模为静态Stackelberg博弈,强知识产权也削弱了创作激励。我们发现这对更具创新性的创作者尤其如此,我们将此现象称为“原创性惩罚”。将这一见解扩展到动态模型,我们发现另一种市场失灵会损害AI模型性能,即使对于初始良好的模型也是如此:此类模型导致人类更依赖AI辅助创作,导致同质化内容反馈到训练中,从而降低模型性能——即“精确性诅咒”。我们进一步提出一种市场设计,通过数据中介内部化跨创作者外部性并补贴创新贡献,从而恢复效率。

英文摘要

How can we design a market of human-generated content for use in training AI models that both enables technological progress and preserves individual incentives for high-quality content creation? Existing approaches take polar positions: a "free-for-all" model based on fair use and a "strong intellectual property rights" model. We show that both fail: Free-for-all does not compensate creators, and -- by modeling as a static Stackelberg game -- strong intellectual property rights also underpower creative incentives. We find this especially true for more innovative creators, a phenomenon we term the "originality penalty." Extending this insight to a dynamic model, we find another market failure undermining AI model performance, even for an initially good model: Such a model induces greater reliance by humans on AI-assisted creation, resulting in homogenized content feeding back into training, which degrades the model performance -- a "curse of precision." We further propose a market design with a data intermediary internalizing cross-creator externalities and subsidizing innovative contributions, thereby restoring efficiency.

2507.03065 2026-06-11 cs.LG 版本更新

Persistent Homology as a Theory of Emergent Structure

持久同调作为涌现结构理论

Xin Li

AI总结 提出将涌现属性定义为持久非平凡同调类,通过持久条、收缩相似图算子和Hodge分解等工具,统一描述涌现的六个特征,并提供可验证预测。

详情
AI中文摘要

为什么某些宏观结构在其微观组分不断变化时仍保持可识别?涡旋在流体团翻转时持续,神经记忆在尖峰和突触波动时持续,机构在个体进出时持续。我们提出一个尺度相对的回答:涌现属性是一个持久的非平凡同调类 $[z]\in H_p=\ker\partial_p/\im\partial_{p+1}$,即一个在描述过滤中闭合但不精确的宏观特征。这一识别将涌现转化为一个\emph{测量}问题。持久条检测稳定的宏观特征,我们引入收缩相似(CS)图算子以提供预测鲁棒性的支架谱间隙。Hodge分解将调和宏观支架与精确和共精确微观流分离;函子凝聚解释何时一个层次的涌现类成为下一个层次的单位。由此产生的支架-流框架用同一数学语言表达了涌现的六个熟悉特征(即必然性、相干性、不可约性、互补性、鲁棒性和层次性)。它还在大气、神经和社会系统中产生可证伪的预测:真正的涌现结构应在过滤中持续,保持谱稳定,对调和干预有不成比例的反应,并需要时间尺度分离以实现层次自主性。

英文摘要

Why do some macroscopic structures remain identifiable even though their microscopic constituents continually change? Vortices persist while fluid parcels turn over, neural memories persist while spikes and synapses fluctuate, and institutions persist while individuals enter and leave. We propose a scale-relative answer: an emergent property is a persistent nontrivial homology class $[z]\in H_p=\ker\partial_p/\im\partial_{p+1}$, a macro-feature that is closed but not exact across a filtration of descriptions. This identification turns emergence into a \emph{measurement} problem. Persistent bars detect stable macro-features, and we introduce a contractive-similarity (CS) graph operator to supply scaffold spectral gaps that predict robustness. Hodge decomposition separates harmonic macro-scaffold from exact and co-exact micro-flow; and functorial condensation explains when one level's emergent class becomes a unit for the next. The resulting scaffold-flow framework expresses six familiar signatures of emergence (i.e., inevitability, coherence, irreducibility, complementarity, robustness, and hierarchy) within one mathematical language. It also yields falsifiable predictions across atmospheric, neural, and social systems: genuine emergent structures should persist across filtrations, remain spectrally stable, respond disproportionately to harmonic interventions, and require timescale separation for hierarchical autonomy.

2511.21594 2026-06-11 cs.LG 版本更新

Visualizing LLM Latent Space Geometry Through Dimensionality Reduction

通过降维可视化LLM潜在空间几何结构

Alex Ning, Vainateya Rangaraju, Yen-Ling Kuo

AI总结 通过PCA和UMAP降维,可视化GPT-2和LLaMa中Transformer层的潜在状态几何,发现注意力与MLP输出分离、初始位置高范数及螺旋结构等模式。

详情
Comments
25 pages, 15 figures
AI中文摘要

大型语言模型(LLM)在许多自然语言任务中取得了最先进的结果,但其内部机制仍然难以解释。在这项工作中,我们通过降维提取、处理和可视化基于Transformer的语言模型中的潜在状态几何结构。我们在Transformer块内的多个点捕获逐层激活,并通过主成分分析(PCA)和均匀流形近似与投影(UMAP)实现系统分析。我们在GPT-2和LLaMa模型上进行了实验,发现了潜在空间中有趣的几何模式。值得注意的是,我们识别出中间层中注意力与MLP组件输出之间的清晰分离,据我们所知,这种模式在先前的工作中未被记录。我们还描述了初始序列位置潜在状态的高范数,并可视化了潜在状态的逐层演化。此外,我们展示了GPT-2位置嵌入的高维螺旋结构以及LLaMa中按序列的几何模式。我们在以下网址提供代码:https://this https URL。相同内容的更好格式的博客文章可在以下网址获取:https://this https URL。

英文摘要

Large language models (LLMs) achieve state-of-the-art results across many natural language tasks, but their internal mechanisms remain difficult to interpret. In this work, we extract, process, and visualize latent state geometries in Transformer-based language models through dimensionality reduction. We capture layerwise activations at multiple points within Transformer blocks and enable systematic analysis through Principal Component Analysis (PCA) and Uniform Manifold Approximation and Projection (UMAP). We demonstrate experiments on GPT-2 and LLaMa models, where we uncover interesting geometric patterns in latent space. Notably, we identify a clear separation between attention and MLP component outputs across intermediate layers, a pattern not documented in prior work to our knowledge. We also characterize the high norm of latent states at the initial sequence position and visualize the layerwise evolution of latent states. Additionally, we demonstrate the high-dimensional helical structure of GPT-2's positional embeddings and the sequence-wise geometric patterns in LLaMa. We make our code available at this https URL. A better formatted blog-post with identical content is available at this https URL.

2601.00791 2026-06-11 cs.LG cs.AI cs.CL cs.LO 版本更新

Geometry of Reason: Spectral Signatures of Valid Mathematical Reasoning

推理的几何:有效数学推理的谱特征

Valentin Noël

AI总结 通过将注意力矩阵视为加权词图,提取四个无需学习的谱诊断指标(Fiedler值、高频能量比、谱熵和平滑度),有效区分有效推理与模式匹配,在多个模型上达到85-96%的分类准确率。

详情
Comments
30 pages, 13 figures, Accepted at ICML 2026 (main track)
AI中文摘要

验证语言模型是真正推理还是模式匹配仍然是一个开放问题:学习型验证器成本高昂,基于输出的启发式方法脆弱。我们证明,有效的数学推理在Transformer注意力中诱导出可测量的、无需训练的谱特征。通过将每个注意力矩阵视为加权词图,我们提取四个诊断指标:Fiedler值、高频能量比(HFER)、谱熵和平滑度,这些指标无需学习参数。在来自四个架构家族的七个模型上的实验产生了高达Cohen's $d = 3.30$($p < 10^{-116}$)的效应量,实现了$85$--$96\%$的单阈值分类准确率。两个发现加深了理解。首先,\emph{柏拉图式有效性}:谱信号追踪逻辑连贯性而非编译器接受性,因超时或缺失导入而被拒绝的证明被正确分类为有效,这一区别通过人工审核确认($\kappa = 0.82$,$n = 51$)。其次,\emph{架构确定性}:滑动窗口注意力将判别特征从HFER转移到平滑度($d = 2.09$,$p < 10^{-48}$),表明注意力设计决定了哪个谱通道编码推理质量。因果消融证实该特征追踪归纳头电路。该方法泛化到非形式化思维链($d = 0.78$,$p < 10^{-3}$),并且在证明搜索中,HFER重排序将Best-of-16 Pass@1提高了$+4.4$--$6.6\%$,匹配了完全监督探针AUC的$98\%$且无需标签。谱图分析是一种原则性的、架构感知的推理验证原语。

英文摘要

Verifying whether a language model is genuinely reasoning or pattern-matching remains an open problem: learned verifiers are expensive, and output-based heuristics are brittle. We show that valid mathematical reasoning induces a measurable, training-free spectral signature in transformer attention. By treating each attention matrix as a weighted token graph, we extract four diagnostics: Fiedler value, High-Frequency Energy Ratio (HFER), spectral entropy, and smoothness, that require no learned parameters. Experiments across seven models from four architectural families yield effect sizes up to Cohen's $d = 3.30$ ($p < 10^{-116}$), enabling $85$--$96\%$ single-threshold classification accuracy. Two findings sharpen the interpretation. First, \emph{Platonic validity}: the spectral signal tracks logical coherence rather than compiler acceptance, proofs rejected for timeouts or missing imports are correctly classified as valid, a distinction confirmed by a manual audit ($\kappa = 0.82$, $n = 51$). Second, \emph{architectural determinism}: Sliding Window Attention shifts the discriminative feature from HFER to smoothness ($d = 2.09$, $p < 10^{-48}$), showing that attention design governs which spectral channel encodes reasoning quality. Causal ablation confirms the signature traces induction-head circuits. The method generalises to informal chain-of-thought ($d = 0.78$, $p < 10^{-3}$), and in proof search, HFER reranking improves Best-of-16 Pass@1 by $+4.4$--$6.6$\%, matching $98\%$ of the AUC of fully supervised probes with zero labels. Spectral graph analysis is a principled, architecture-aware primitive for reasoning verification.

2603.21396 2026-06-11 cs.LG 版本更新

Mechanisms of Introspective Awareness

内省意识的机制

Uzay Macar, Li Yang, Atticus Wang, Peter Wallich, Emmanuel Ameisen, Jack Lindsey

AI总结 研究揭示了大语言模型在检测注入的转向向量时的内省意识机制,发现其行为稳健且源于训练后阶段,通过两阶段电路实现,且在不同层间机制存在差异。

详情
AI中文摘要

最近的研究表明,大语言模型有时能够检测到转向向量被注入到残差流中,并识别出注入的概念,这一现象被称为

英文摘要

Recent work has shown that LLMs can sometimes detect when steering vectors are injected into their residual stream and identify the injected concept -- a phenomenon termed "introspective awareness." We investigate the mechanisms underlying this capability in open-weights models. First, we find that it is behaviorally robust: models detect injected steering vectors at moderate rates with 0% false positives across diverse prompts and dialogue formats. Notably, this capability emerges specifically from post-training; we show that preference optimization algorithms like DPO can elicit it, but standard supervised finetuning does not. We provide evidence that detection cannot be explained by simple linear association between certain steering vectors and directions promoting affirmative responses. We trace the detection mechanism to a two-stage circuit in which "evidence carrier" features in early post-injection layers detect perturbations monotonically along diverse directions, suppressing downstream "gate" features that implement a default negative response. This circuit is absent in base models and robust to refusal ablation. Identification of injected concepts relies on largely distinct later-layer mechanisms that only weakly overlap with those involved in detection. Finally, we show that introspective capability is substantially underelicited: ablating refusal directions improves detection by +53%, and a trained bias vector improves it by +75% on held-out concepts, both without meaningfully increasing false positives. Our results suggest that this introspective awareness of injected concepts is robust and mechanistically nontrivial, and could be substantially amplified in future models. Code: this https URL.

2606.08956 2026-06-11 cs.LG 版本更新

From inverse problems to neural operators: prediction, mechanism, and generalization of data-driven models

从反问题到神经算子:数据驱动模型的预测、机制与泛化

Conor Rowan

发表机构 * University of Colorado Boulder(科罗拉多大学博尔德分校)

AI总结 本文从哲学视角统一反问题、稀疏辨识、神经常微分方程和神经算子等数据驱动建模策略,指出它们仅在输入-输出关系的模型类假设上不同,并论证只有某些模型能发现机制并实现泛化。

详情
AI中文摘要

科学家历来依赖基于微分方程的数学模型来关联系统输入(力、通量或热源)与输出(位移、速度、浓度和温度)。这些模型依赖深厚的领域知识来确定控制微分方程的形式,然后通过求解反问题用数据校准。近年来,科学机器学习领域引入了多种针对物理系统的替代建模策略。一种称为非线性动力学稀疏辨识的方法,将控制方程学习为用户定义库中项的稀疏线性组合。神经常微分方程通过将状态及其导数输入神经网络来构建控制方程。神经算子则完全摒弃微分方程的建模框架,直接学习系统输入与输出之间的非线性映射。从反问题到神经算子,所有这些建模策略都可以概念化为数据驱动机制,用于预测系统在一系列输入下的响应。因此,自然会思考这些不同策略之间究竟如何关联,以及它们能否被清晰地分类。借鉴科学模型的哲学文献,我们认为许多模型类型具有共同结构,仅在其定义的输入-输出关系的假设模型类上有所不同。联系关于机制的哲学观点,并论证物理系统的数据来自简洁微分方程的解,我们提出只有某些模型能够发现机制,从而实现泛化。我们的分析旨在统一看似不同的建模策略,并为其适当使用场景提供见解。

英文摘要

Scientists have historically relied on mathematical models based on differential equations to relate system inputs -- forces, fluxes, or heat sources -- to outputs, such as displacement, velocity, concentration, and temperature. These models rely on deep domain knowledge to determine the form of the governing differential equation, which is then calibrated with data by solving an inverse problem. In recent years, the field of Scientific Machine Learning has introduced a variety of alternative modeling strategies for physical systems. A method called Sparse Identification of Nonlinear Dynamics learns the governing equation as a sparse linear combination of terms in a user-defined library. Neural Ordinary Differential Equations construct the governing equation by taking in the state and its derivatives at the input layer of a neural network. Entirely foregoing the modeling framework of differential equations, neural operators directly learn a non-linear mapping between the system inputs and outputs. From inverse problems to neural operators, all of these modeling strategies can be conceptualized as data-driven machinery to predict a system's response over a range of inputs. It is then natural to wonder how exactly these various strategies relate to each other, and whether they can be neatly taxonomized. Drawing from the philosophical literature on scientific models, we argue that many model types have a common structure, differing only in the assumed model class of the input-output relation they define. Connecting to philosophical ideas on mechanism, and arguing that data from physical systems arises from solutions to parsimonious differential equations, we propose that only certain models are capable of mechanism discovery, and thus generalization. Our analysis is intended to unite apparently disparate modeling strategies and provide insight into their appropriate use cases.

2606.09287 2026-06-11 cs.LG 版本更新

Trajectory Geometry of Transformer Representations Across Layers

Transformer表示在层间的轨迹几何

Vishal Pandey, Gopal Singh, Yacine Mahdid

发表机构 * MetriQual London, UK(英国伦敦) Athens, GR(希腊雅典)

AI总结 通过计算轨迹长度、曲率等几何指标,发现语义相关提示在中间层收敛、推理任务曲率更大、歧义token轨迹分叉,并揭示三层结构。

详情
Comments
18 pages, 9 figures
AI中文摘要

理解Transformer表示如何跨层演化,而不仅仅是它们编码了什么,仍然是机械可解释性中的一个开放问题。我们将Transformer前向传播重新解释为通过高维表示流形的离散群体轨迹,借鉴了计算神经科学的几何工具。我们不是探测预定义的特征,而是使用直接在环境空间中计算的五个指标来表征轨迹几何:轨迹长度、曲率、语义收敛指数、逐层余弦相似度和表示稳定性。在三个模型家族(GPT-2、TinyLlama、Qwen2.5)和五个受控提示家族中,我们报告了四个发现。首先,语义相关的提示在中间到后期层显著收敛(峰值CI 0.41--0.58,p<0.001,Mann-Whitney U),与吸引子动力学一致。其次,推理任务产生的轨迹曲率大于词汇变化(0.71--0.83弧度 vs. 0.27--0.31弧度),表明曲率编码了计算复杂度。第三,歧义token表现出轨迹分叉,在最后一层表示分离高达5.6倍,而在无歧义控制中则没有。第四,逐层余弦相似度揭示了一个普遍的三阶段结构:编码、精化和输出准备,在所有三种架构中一致。所有四个效应在打乱层和随机嵌入控制下消失。我们发布了一个完全开源、模型无关的管道,并认为轨迹几何构成了一个原则性的、无探针的机械可解释性视角。

英文摘要

Understanding how transformer representations evolve across layers, not merely what they encode, remains an open problem in mechanistic interpretability. We recast the transformer forward pass as a discrete population trajectory through a high-dimensional representation manifold, drawing on geometric tools from computational neuroscience. Rather than probing for pre-specified features, we characterize trajectory geometry using five metrics computed directly in the ambient space: trajectory length, curvature, a semantic convergence index, layerwise cosine similarity, and representational stability. Across three model families (GPT-2, TinyLlama, Qwen2.5) and five controlled prompt families, we report four findings. First, semantically related prompts converge significantly in middle-to-late layers (peak CI 0.41--0.58, p<0.001, Mann-Whitney U), consistent with attractor-like dynamics. Second, reasoning tasks produce trajectories of greater curvature than lexical variations (0.71--0.83 rad vs. 0.27--0.31 rad), suggesting curvature encodes computational complexity. Third, ambiguous tokens exhibit trajectory bifurcation with up to 5.6x representational separation by the final layer, absent in unambiguous controls. Fourth, layerwise cosine similarity reveals a universal three-phase structure: encoding, elaboration, and output preparation, consistent across all three architectures. All four effects vanish under shuffled-layer and random-embedding controls. We release a fully open-source, model-agnostic pipeline and argue that trajectory geometry constitutes a principled, probe-free lens for mechanistic interpretability.

2605.02411 2026-06-11 cs.AI cs.IR cs.LG cs.MA 版本更新

FitText: Evolving Agent Tool Ecologies via Memetic Retrieval

FitText: 通过模因检索演化智能体工具生态

Kyle Zheng, Han Zhang, Renliang Sun, Chenchen Ye, Wei Wang

AI总结 针对用户任务描述与工具文档间的语义鸿沟,提出FitText框架,将检索嵌入推理循环,通过自然语言伪工具描述迭代优化和模因进化选择,显著提升工具检索性能。

详情
AI中文摘要

用户描述任务的方式与工具文档之间存在语义鸿沟。随着API生态扩展到数万个端点,仅凭初始查询的静态检索无法弥合这一鸿沟:智能体对其所需工具的理解在执行过程中不断演变,但其工具集却保持不变。我们指出,这种检索接口(而非规划)是端到端智能体性能的约束瓶颈,并引入FitText——一个无需训练的框架,通过将检索直接嵌入智能体的推理循环中,使其动态化。FitText将检索视为测试时假设的演化:智能体生成自然语言的伪工具描述(关于所需工具的可修正信念),利用检索反馈迭代优化,并通过随机生成探索多样化的替代方案。模因检索在候选描述上施加进化选择压力,并由避免冗余搜索的工具记忆引导。在ToolRet(三个领域)上,FitText的重构策略在所有基模型上相比静态查询检索将NDCG@5提升了2.7至10.6个点;在StableToolBench(16,464个API)上使用GPT-5.4-mini时,模因检索达到了84.3%的合并通过率,相比静态查询检索绝对提升了26.7个点。

英文摘要

A semantic gap separates how users describe tasks from how tools are documented. As API ecosystems scale to tens of thousands of endpoints, static retrieval from the initial query alone cannot bridge this gap: the agent's understanding of what it needs evolves during execution, but its tool set does not. We identify this retrieval interface, not planning, as the binding constraint on end-to-end agent performance, and introduce FitText, a training-free framework that makes retrieval dynamic by embedding it directly in the agent's reasoning loop. FitText treats retrieval as test-time evolution of hypotheses: the agent generates natural-language pseudo-tool descriptions (revisable beliefs about the tool it needs), refines them iteratively using retrieval feedback, and explores diverse alternatives through stochastic generation. Memetic Retrieval adds evolutionary selection pressure over candidate descriptions, guided by a tool memory that avoids redundant search. On ToolRet (three domains), FitText's reformulation strategies improve NDCG@5 by 2.7 to 10.6 points over static query retrieval across all base models; on StableToolBench (16,464 APIs) with GPT-5.4-mini, Memetic reaches an 84.3% pooled pass rate, a 26.7-point absolute gain over static query retrieval.

2606.05907 2026-06-11 cs.IR cs.LG 版本更新

Knowledge Manifold: A Riemannian Geometric Framework for Semantic Mapping and Geodesic Analysis of Scientific Literature

知识流形:用于科学文献语义映射和测地线分析的黎曼几何框架

Tomonaga Okabe, Kazuhiko Komatsu

AI总结 提出知识流形框架,通过字符n-gram TF-IDF、SPH插值、高斯过程回归和黎曼测地线路径,实现文献的语义映射、虚拟知识生成和概念桥梁发现。

详情
AI中文摘要

我们提出了知识流形:一个黎曼几何空间,其中文档语料库根据从字符n-gram TF-IDF表示中导出的语义位置关系进行排列。该框架包含五个紧密耦合的阶段。首先,每篇文档被转换为字符级n-gram TF-IDF向量(4-7克,最多250,000个特征,L2归一化),并通过带有排斥、方差和中心正则化项的约束应力最小化嵌入到二维知识地图中。其次,通过使用三次样条核的平滑粒子流体动力学(SPH)插值估计任意查询点的知识,得到可进行语言表征的插值TF-IDF特征向量。第三,从SPH插值图计算0、45和90度方向的知识梯度,并通过内积和余弦相似度量化成对方向相似性。第四,一个高斯过程回归(GPR)模型,使用在10维SVD投影上拟合的Constant × RBF + White核,提供查询点的贝叶斯后验均值、不确定性估计和每篇文档的贡献率。第五,通过最小化由SPH诱导度量张量导出的离散黎曼路径能量,使用L-BFGS-B算法和七个确定性初始路径候选,获得知识空间中的测地线。我们将该公式应用于20篇纤维增强复合材料与航空航天结构力学论文的语料库,表明语义地图恢复了有意义的研究聚类,测地线路径揭示了遥远主题之间的自然概念桥梁,并且SPH/GPR插值能够生成虚拟知识:描述未研究但几何预测的研究方向的假设论文摘要。

英文摘要

We present the knowledge manifold: a Riemannian geometric space in which a corpus of documents is arranged according to semantic positional relationships derived from character n-gram TF-IDF representations. The framework proceeds in five tightly coupled stages. First, each document is converted to a character-level n-gram TF-IDF vector (4-7 grams, up to 250,000 features, L2-normalized) and embedded in a two-dimensional knowledge map via constrained stress minimization with repulsion, variance, and centering regularizers. Second, knowledge at an arbitrary query point is estimated through Smoothed Particle Hydrodynamics (SPH) interpolation using a cubic-spline kernel, yielding an interpolated TF-IDF feature vector that can be linguistically characterized. Third, directional knowledge gradients at 0, 45, and 90 degrees are computed from the SPH interpolation map, and pairwise directional similarity is quantified via inner product and cosine similarity. Fourth, a Gaussian Process Regression (GPR) model, with a Constant x RBF + White kernel fitted on a 10-dimensional SVD projection, provides a Bayesian posterior mean, uncertainty estimate, and per-document contribution rate at the query point. Fifth, geodesics in the knowledge space are obtained by minimizing a discrete Riemannian path energy derived from the SPH-induced metric tensor, using L-BFGS-B with seven deterministic initial-path candidates. We apply the formulation to a corpus of 20 papers in fiber-reinforced composite materials and aerospace structural mechanics, showing that the semantic map recovers meaningful research clusters, geodesic paths reveal natural conceptual bridges between distant topics, and SPH/GPR interpolation enables the generation of virtual knowledge: hypothetical paper abstracts describing unstudied but geometrically predicted research directions.

2508.10807 2026-06-11 quant-ph cs.LG math.OC 版本更新

Parity Cross-Resonance: A Multiqubit Gate

奇偶交叉共振:一种多量子比特门

Xuexin Xu, Siyu Wang, Radhika Joshi, Rihan Hai, Mohammad H. Ansari

AI总结 提出一种原生三量子比特纠缠门,通过混合优化方法实现控制-控制-目标和控制-目标-目标操作,用于GHZ态制备、Toffoli逻辑和受控ZZ门,提升表面码稳定子测量保真度。

详情
Journal ref
Phys. Rev. Applied 25, 044045 (2026)
Comments
19 pages, 10 figures
AI中文摘要

我们提出一种原生三量子比特纠缠门,它利用工程化相互作用在单次相干步骤中实现控制-控制-目标和控制-目标-目标操作。与传统的分解为多个两量子比特门不同,我们的混合优化方法选择性地放大所需相互作用,同时抑制不需要的耦合,从而在整个计算子空间及之外实现稳健性能。这种新门可归类为交叉共振门。我们展示了它可以多种方式使用,例如在GHZ三重态制备、具有多体相互作用的Toffoli类逻辑演示以及实现受控ZZ门中。后者将两个数据量子比特的奇偶性直接映射到测量量子比特上,从而在表面码量子纠错中实现更快、更高保真度的稳定子测量。在所有示例中,我们展示了三量子比特门性能在希尔伯特空间大小上的稳健性,这通过增加总激发数下的测试得到证实。这项工作为协同设计电路架构和控制协议奠定了基础,这些协议利用原生多量子比特相互作用作为下一代超导量子处理器的核心元素。

英文摘要

We present a native three-qubit entangling gate that exploits engineered interactions to realize control-control-target and control-target-target operations in a single coherent step. Unlike conventional decompositions into multiple two-qubit gates, our hybrid optimization approach selectively amplifies desired interactions while suppressing unwanted couplings, yielding robust performance across the computational subspace and beyond. The new gate can be classified as a cross-resonance gate. We show it can be utilized in several ways, for example, in GHZ triplet state preparation, Toffoli-class logic demonstrations with many-body interactions, and in implementing a controlled-ZZ gate. The latter maps the parity of two data qubits directly onto a measurement qubit, enabling faster and higher-fidelity stabilizer measurements in surface-code quantum error correction. In all these examples, we show that the three-qubit gate performance remains robust across Hilbert space sizes, as confirmed by testing under increasing total excitation numbers. This work lays the foundation for co-designing circuit architectures and control protocols that leverage native multiqubit interactions as core elements of next-generation superconducting quantum processors.

2605.29355 2026-06-11 cs.LG q-bio.NC

Neural-Behavioral Representation of Natural Whole-body Movement in Monkeys

猴子自然全身运动的神经-行为表征

Jieshi He, Puzhe Li, Yanan Sui, Mu-ming Poo

AI总结 通过大规模皮层信号与多视角运动捕捉,结合自回归编码器-解码器模型,实现了对自由运动猴子全身运动的准确解码。

详情
AI中文摘要

理解皮层活动如何表征灵长类动物的自然全身行为仍然具有挑战性。受限于运动的多样性和全身运动学大规模神经表征的不可及性,先前的运动解码研究集中于受限任务和有限的肢体运动。在这里,我们提出了一个用于自由运动猴子的神经-行为记录和建模框架,通过定制的数据采集平台,将来自分布式感觉和运动相关区域的大规模硬膜外皮层信号与同步的多视角运动捕捉相结合。我们重建了猴子的全身运动学,并使用自回归编码器-解码器模型学习了紧凑的行为先验。以神经信号为条件,该模型在没有明确物理约束的情况下解码出准确且逼真的全身运动。我们的结果为利用大规模颅内神经活动解码灵长类动物的自然全身运动提供了一种新颖的概念验证方法。

英文摘要

Understanding how cortical activity represents natural whole-body behaviors in primates remains challenging. Limited by the diversity of movements and inaccessibility of large-scale neural representation of whole-body kinematics, previous motor decoding studies focused on constrained tasks and limited limb movements. Here, we present a neural-behavioral recording and modeling framework for freely moving monkeys, combining large-scale epidural cortical signals from distributed sensory- and motor-related areas with synchronized multi-view motion capture through a custom-made data collection platform. We reconstructed whole-body monkey kinematics and learned a compact behavior prior using an autoregressive encoder-decoder model. Conditioned on neural signals, the model decoded accurate and realistic whole-body movement without explicit physical constraints. Our results provide a novel proof-of-concept approach for decoding natural whole-body movements in primates using large-scale intracranial neural activity.

2511.20216 2026-06-11 cs.AI cs.CE cs.CV cs.LG cs.RO

CostNav: A Navigation Benchmark for Real-World Economic-Cost Evaluation of Physical AI Agents

Haebin Seong, Sungmin Kim, Yongjun Cho, Myunchul Joe, Geunwoo Kim, Yubeen Park, Sunhoo Kim, Samwoo Seong, Yoonshik Kim, Suhwan Choi, Jaeyoon Jung, Jiyong Youn, Jinmyung Kwak, Sunghee Ahn, Jaemin Lee, Younggil Do, Seungyeop Yi, Woojin Cheong, Minhyeok Oh, Minchan Kim, Seongjae Kang, Youngjae Yu, Yunsung Lee

详情
英文摘要

Current navigation benchmarks focus on task success but do not capture the economic constraints essential for commercializing autonomous delivery systems. We introduce CostNav, an Economic Navigation Benchmark that evaluates physical AI agents on a cost-revenue and break-even analysis, pairing Isaac Sim's collision and cargo dynamics with industry-standard data such as Securities and Exchange Commission (SEC) filings and Abbreviated Injury Scale (AIS) injury reports. To our knowledge, CostNav is the first physics-grounded economic benchmark to use regulatory and financial data to quantify the gap between navigation metrics and commercial deployment, revealing that high task-success rates alone do not ensure economic viability. Evaluating seven baselines (two rule-based and five imitation-learning methods), we find no method economically viable: all yield negative contribution margins. CANVAS, using only an RGB camera and GPS, attains the highest task success and the least-negative margin among methods with non-zero Service-Level Agreement (SLA) compliance (-\$28.40/run), outperforming LiDAR-equipped Nav2 w/ GPS (-\$37.34/run). A sim-trained policy evaluated on a real delivery robot yields SLA compliance close to its simulation result, indicating that policy performance in CostNav's simulation transfers to real-world deployment. We challenge the community to achieve economic viability on CostNav, which scores methods by cost-revenue outcomes. All resources are available at https://github.com/worv-ai/CostNav.

2602.13513 2026-06-11 math.OC cs.CE cs.LG cs.NA math.DS math.NA

Learning Gradient Flow: Using Equation Discovery to Accelerate Engineering Optimization

Grant Norman, Conor Rowan, Kurt Maute, Alireza Doostan

详情
Comments
44 pages, 13 figures. Submitted to CMAME. Changed Topology Optimization example to be 250% acceleration
英文摘要

In this work, we investigate the use of data-driven equation discovery for dynamical systems to model and forecast continuous-time dynamics of unconstrained optimization problems. To avoid expensive evaluations of the objective function and its gradient, we leverage trajectory data on the optimization variables to learn the continuous-time dynamics associated with gradient descent, Newton's method, and ADAM optimization. The discovered gradient flows are then solved as a surrogate for the original optimization problem. To this end, we introduce the Learned Gradient Flow (LGF) optimizer, which is equipped to build surrogate models of variable polynomial order in full- or reduced-dimensional spaces at user-defined intervals in the optimization process. We demonstrate the efficacy of this approach on several standard problems from engineering mechanics and scientific machine learning, including two inverse problems, structural topology optimization, and two forward solves with different discretizations. Our results suggest that the learned gradient flows can significantly expedite convergence by capturing critical features of the optimization trajectory while avoiding expensive evaluations of the objective and its gradient.

2601.21824 2026-06-11 cs.LG cs.DC

DASH: Deterministic Attention Scheduling for High-throughput Reproducible LLM Training

Xinwei Qiang, Hongmin Chen, Shixuan Sun, Jingwen Leng, Xin Liu, Minyi Guo

详情
Journal ref
Proceedings of the International Conference on Learning Representations (ICLR), 2026
英文摘要

Determinism is indispensable for reproducibility in large language model (LLM) training, yet it often exacts a steep performance cost. In widely used attention implementations such as FlashAttention-3, the deterministic backward pass can incur up to a 37.9% throughput reduction relative to its non-deterministic counterpart, primarily because gradient accumulation operations must be serialized to guarantee numerical consistency. This performance loss stems from suboptimal scheduling of compute and gradient-reduction phases, leading to significant hardware underutilization. To address this challenge, we formulate the backward pass of deterministic attention as a scheduling problem on a Directed Acyclic Graph (DAG) and derive schedules that minimize the critical path length. Building on this formulation, we present DASH (Deterministic Attention Scheduling for High-Throughput), which encapsulates two complementary scheduling strategies: (i) Descending Q-Tile Iteration, a reversed query-block traversal that shrinks pipeline stalls in causal attention, and (ii) Shift Scheduling, a theoretically optimal schedule within our DAG model that reduces pipeline stalls for both full and causal masks. Our empirical evaluations on NVIDIA H800 GPUs demonstrate that DASH narrows the performance gap of deterministic attention. The proposed strategies improve the throughput of the attention backward pass by up to 1.28$\times$ compared to the baseline, significantly advancing the efficiency of reproducible LLM training. Our code is open-sourced at https://github.com/SJTU-Liquid/deterministic-FA3.

2409.00743 2026-06-11 cs.LG cs.AI

Interpretable Clustering: A Survey

Lianyu Hu, Mudi Jiang, Junjie Dong, Xinying Liu, Zengyou He

详情
Journal ref
ACM Computing Surveys, Volume 58, Issue 8, Article 215 (2026)
Comments
14 pages, 2 figures, 3 tables
英文摘要

In recent years, much of the research on clustering algorithms has primarily focused on enhancing their accuracy and efficiency, frequently at the expense of interpretability. However, as these methods are increasingly being applied in high-stakes domains such as healthcare, finance, and autonomous systems, the need for transparent and interpretable clustering outcomes has become a critical concern. This is not only necessary for gaining user trust but also for satisfying the growing ethical and regulatory demands in these fields. Ensuring that decisions derived from clustering algorithms can be clearly understood and justified is now a fundamental requirement. To address this need, this paper provides a comprehensive and structured review of the current state of explainable clustering algorithms, identifying key criteria to distinguish between various methods. These insights can effectively assist researchers in making informed decisions about the most suitable explainable clustering methods for specific application contexts, while also promoting the development and adoption of clustering algorithms that are both efficient and transparent. For convenient access and reference, an open repository organizes representative and emerging interpretable clustering methods under the taxonomy proposed in this survey, available at https://github.com/hulianyu/Awesome-Interpretable-Clustering

2601.07436 2026-06-11 eess.SP cs.LG physics.optics

PIDT: Physics-Informed Digital Twin for Optical Fiber Parameter Estimation

Zicong Jiang, Magnus Karlsson, Erik Agrell, Christian Häger

详情
Comments
The paper will be appeared in Optical Fiber Communications Conference and Exhibition (OFC) 2026
英文摘要

We propose physics-informed digital twin (PIDT): a fiber parameter estimation approach that combines a parameterized split-step method with a physics-informed loss. PIDT improves accuracy and convergence speed with lower complexity compared to previous neural operators.

2508.11703 2026-06-11 cs.NE cs.LG

Data-Driven Discovery of Interpretable Kalman Filter Variants through Large Language Models and Genetic Programming

Vasileios Saketos, Sebastian Kaltenbach, Sergey Litvinov, Petros Koumoutsakos

详情
英文摘要

Algorithmic discovery has traditionally relied on human ingenuity and extensive experimentation. Here we investigate whether a prominent scientific computing algorithm, the Kalman Filter, can be discovered through an automated, data-driven, evolutionary process that relies on Cartesian Genetic Programming (CGP) and Large Language Models (LLM). We evaluate the contributions of both modalities (CGP and LLM) in discovering the Kalman filter under varying conditions. Our results demonstrate that our framework of CGP and LLM-assisted evolution converges to near-optimal solutions when Kalman optimality assumptions hold. When these assumptions are violated, our framework evolves interpretable alternatives that outperform the Kalman filter. These results demonstrate that combining evolutionary algorithms and generative models for interpretable, data-driven synthesis of simple computational modules is a potent approach for algorithmic discovery in scientific computing.

2505.11308 2026-06-11 cs.LG physics.comp-ph

Reinforcement Learning Closures for Underresolved Partial Differential Equations using Synthetic Data

Lothar Heimbach, Sebastian Kaltenbach, Petr Karnakov, Francis J. Alexander, Petros Koumoutsakos

详情
英文摘要

Partial Differential Equations (PDEs) describe phenomena ranging from turbulence and epidemics to quantum mechanics and financial markets. Despite recent advances in computational science, solving such PDEs for real-world applications remains prohibitively expensive because of the necessity of resolving a broad range of spatiotemporal scales. In turn, practitioners often rely on coarse-grained approximations of the original PDEs, trading off accuracy for reduced computational resources. To mitigate the loss of detail inherent in such approximations, closure models are employed to represent unresolved spatiotemporal interactions. We present a framework for developing closure models for PDEs using synthetic data acquired through the method of manufactured solutions. These data are used in conjunction with reinforcement learning to provide closures for coarse-grained PDEs. We illustrate the efficacy of our method using the one-dimensional and two-dimensional Burgers' equations and the two-dimensional advection equation. Moreover, we demonstrate that closure models trained for inhomogeneous PDEs can be effectively generalized to homogeneous PDEs. The results demonstrate the potential for developing accurate and computationally efficient closure models for systems with scarce data.

2502.09084 2026-06-11 cs.CR cs.LG cs.NI

Application of Tabular Transformer Architectures for Operating System Fingerprinting

Rubén Pérez-Jove, Cristian R. Munteanu, Alejandro Pazos, Jose Vázquez-Naya

详情
Comments
Submitted as a preprint (not peer reviewed). 22 pages, 9 figures. Code and datasets available at: https://github.com/rubenpjove/tabularT-OS-fingerprinting
英文摘要

Operating System (OS) fingerprinting is essential for network management and cybersecurity, enabling accurate device identification based on network traffic analysis. Traditional rule-based tools such as Nmap and p0f face challenges in dynamic environments due to frequent OS updates and obfuscation techniques. While Machine Learning (ML) approaches have been explored, Deep Learning (DL) models, particularly Transformer architectures, remain unexploited in this domain. This study investigates the application of Tabular Transformer architectures-specifically TabTransformer and FT-Transformer-for OS fingerprinting, leveraging structured network data from three publicly available datasets. Our experiments demonstrate that FT-Transformer generally outperforms traditional ML models, previous approaches and TabTransformer across multiple classification levels (OS family, major, and minor versions). The results establish a strong foundation for DL-based OS fingerprinting, improving accuracy and adaptability in complex network environments. Furthermore, we ensure the reproducibility of our research by providing an open-source implementation.

2502.07990 2026-06-11 cs.LG physics.comp-ph physics.flu-dyn

Learning Effective Dynamics across Spatio-Temporal Scales of Complex Flows

Han Gao, Sebastian Kaltenbach, Petros Koumoutsakos

详情
Comments
Conference on Parsimony and Learning (CPAL)
英文摘要

Modeling and simulation of complex fluid flows with dynamics that span multiple spatio-temporal scales is a fundamental challenge in many scientific and engineering domains. Full-scale resolving simulations for systems such as highly turbulent flows are not feasible in the foreseeable future, and reduced-order models must capture dynamics that involve interactions across scales. In the present work, we propose a novel framework, Graph-based Learning of Effective Dynamics (Graph-LED), that leverages graph neural networks (GNNs), as well as an attention-based autoregressive model, to extract the effective dynamics from a small amount of simulation data. GNNs represent flow fields on unstructured meshes as graphs and effectively handle complex geometries and non-uniform grids. The proposed method combines a GNN based, dimensionality reduction for variable-size unstructured meshes with an autoregressive temporal attention model that can learn temporal dependencies automatically. We evaluated the proposed approach on a suite of fluid dynamics problems, including flow past a cylinder and flow over a backward-facing step over a range of Reynolds numbers. The results demonstrate robust and effective forecasting of spatio-temporal physics; in the case of the flow past a cylinder, both small-scale effects that occur close to the cylinder as well as its wake are accurately captured.

2402.00972 2026-06-11 cs.LG cs.MA physics.comp-ph

Closure Discovery for Coarse-Grained Partial Differential Equations Using Grid-based Reinforcement Learning

Jan-Philipp von Bassewitz, Sebastian Kaltenbach, Petros Koumoutsakos

详情
Comments
Conference on Parsimony and Learning (CPAL)
英文摘要

Reliable predictions of critical phenomena, such as weather, wildfires and epidemics often rely on models described by Partial Differential Equations (PDEs). However, simulations that capture the full range of spatio-temporal scales described by such PDEs are often prohibitively expensive. Consequently, coarse-grained simulations are usually deployed that adopt various heuristics and empirical closure terms to account for the missing information. We propose a novel and systematic approach for identifying closures in under-resolved PDEs using grid-based Reinforcement Learning. This formulation incorporates inductive bias and exploits locality by deploying a central policy represented efficiently by a Fully Convolutional Network (FCN). We demonstrate the capabilities and limitations of our framework through numerical solutions of the advection equation and the Burgers' equation. Our results show accurate predictions for in- and out-of-distribution test cases as well as a significant speedup compared to resolving all scales.

2412.12231 2026-06-11 cs.RO cs.LG

Demonstrating Data-to-Knowledge Pipelines for Connecting Production Sites in the World Wide Lab

Leon Gorißen, Jan-Niklas Schneider, Mohamed Behery, Philipp Brauner, Moritz Lennartz, David Kötter, Thomas Kaster, Oliver Petrovic, Christian Hinke, Thomas Gries, Gerhard Lakemeyer, Martina Ziefle, Christian Brecher, Constantin Häfner

详情
Journal ref
MDPI MAKE (Machine Learning and Knowledge Extraction (2026), 8(5)
Comments
15 pages, 6 figures, submitted to CAiSE 2025
英文摘要

The digital transformation of production requires new methods of data integration and storage, as well as decision making and support systems that work vertically and horizontally throughout the development, production, and use cycle. In this paper, we propose Data-to-Knowledge (and Knowledge-to-Data) pipelines for production as a universal concept building on a network of Digital Shadows (a concept augmenting Digital Twins). We show a proof of concept that builds on and bridges existing infrastructure to 1) capture and semantically annotates trajectory data from multiple similar but independent robots in different organisations and use cases in a data lakehouse and 2) an independent process that dynamically queries matching data for training an inverse dynamic foundation model for robotic control. The article discusses the challenges and benefits of this approach and how Data-to-Knowledge pipelines contribute efficiency gains and industrial scalability in a World Wide Lab as a research outlook.

2408.00157 2026-06-11 cs.LG physics.comp-ph physics.flu-dyn

Generative Learning of the Solution of Parametric Partial Differential Equations Using Guided Diffusion Models and Virtual Observations

Han Gao, Sebastian Kaltenbach, Petros Koumoutsakos

详情
英文摘要

We introduce a generative learning framework to model high-dimensional parametric systems using gradient guidance and virtual observations. We consider systems described by Partial Differential Equations (PDEs) discretized with structured or unstructured grids. The framework integrates multi-level information to generate high fidelity time sequences of the system dynamics. We demonstrate the effectiveness and versatility of our framework with two case studies in incompressible, two dimensional, low Reynolds cylinder flow on an unstructured mesh and incompressible turbulent channel flow on a structured mesh, both parameterized by the Reynolds number. Our results illustrate the framework's robustness and ability to generate accurate flow sequences across various parameter settings, significantly reducing computational costs allowing for efficient forecasting and reconstruction of flow dynamics.

2312.11540 2026-06-11 cs.LG

On the Trade-off between the Number of Nodes and the Number of Trees in a Random Forest

Tatsuya Akutsu, Avraham A. Melkman, Atsuhiro Takasu

详情
英文摘要

In this paper, we focus on the prediction phase of a random forest and study the problem of representing a bag of decision trees using a smaller bag of decision trees, where we only consider binary decision problems on the binary domain and simple decision trees in which an internal node is limited to querying the Boolean value of a single variable. As a main result, we show that the majority function of $n$ variables can be represented by a bag of $T$ ($< n$) decision trees each with polynomial size if $n-T$ is a constant, where $n$ and $T$ must be odd (in order to avoid the tie break). We also show that a bag of $n$ decision trees can be represented by a bag of $T$ decision trees each with polynomial size if $n-T$ is a constant and a small classification error is allowed. A related result on the $k$-out-of-$n$ functions is presented too.

2306.01690 2026-06-11 cs.LG cs.AI

Context selectivity with dynamic availability enables lifelong continual learning

Martin Barry, Wulfram Gerstner, Guillaume Bellec

详情
英文摘要

"You never forget how to ride a bike", -- but how is that possible? The brain is able to learn complex skills, stop the practice for years, learn other skills in between, and still retrieve the original knowledge when necessary. The mechanisms of this capability, referred to as lifelong learning (or continual learning, CL), are unknown. We suggest a bio-plausible meta-plasticity rule building on classical work in CL which we summarize in two principles: (i) neurons are context selective, and (ii) a local availability variable partially freezes the plasticity if the neuron was relevant for previous tasks. In a new neuro-centric formalization of these principles, we suggest that neuron selectivity and neuron-wide consolidation is a simple and viable meta-plasticity hypothesis to enable CL in the brain. In simulation, this simple model balances forgetting and consolidation leading to better transfer learning than contemporary CL algorithms on image recognition and natural language processing CL benchmarks.

2305.13108 2026-06-11 eess.AS cs.CL cs.LG cs.SD

Debiased Automatic Speech Recognition for Dysarthric Speech via Sample Reweighting with Sample Affinity Test

Eungbeom Kim, Yunkee Chae, Jaeheon Sim, Kyogu Lee

详情
Comments
Accepted by Interspeech 2023
英文摘要

Automatic speech recognition systems based on deep learning are mainly trained under empirical risk minimization (ERM). Since ERM utilizes the averaged performance on the data samples regardless of a group such as healthy or dysarthric speakers, ASR systems are unaware of the performance disparities across the groups. This results in biased ASR systems whose performance differences among groups are severe. In this study, we aim to improve the ASR system in terms of group robustness for dysarthric speakers. To achieve our goal, we present a novel approach, sample reweighting with sample affinity test (Re-SAT). Re-SAT systematically measures the debiasing helpfulness of the given data sample and then mitigates the bias by debiasing helpfulness-based sample reweighting. Experimental results demonstrate that Re-SAT contributes to improved ASR performance on dysarthric speech without performance degradation on healthy speech.

2107.00693 2026-06-11 eess.SP cs.LG

Inter-Beat Interval Estimation with Tiramisu Model: A Novel Approach with Reduced Error

Asiful Arefeen, Ali Akbari, Seyed Iman Mirzadeh, Roozbeh Jafari, Behrooz A. Shirazi, Hassan Ghasemzadeh

详情
Comments
16 pages, 14 figures
英文摘要

Inter-beat interval (IBI) measurement enables estimation of heart-rate variability (HRV) which, in turns, can provide early indication of potential cardiovascular diseases. However, extracting IBIs from noisy signals is challenging since the morphology of the signal is distorted in the presence of the noise. Electrocardiogram (ECG) of a person in heavy motion is highly corrupted with noise, known as motion-artifact, and IBI extracted from it is inaccurate. As a part of remote health monitoring and wearable system development, denoising ECG signals and estimating IBIs correctly from them have become an emerging topic among signal-processing researchers. Apart from conventional methods, deep-learning techniques have been successfully used in signal denoising recently, and diagnosis process has become easier, leading to accuracy levels that were previously unachievable. We propose a deep-learning approach leveraging tiramisu autoencoder model to suppress motion-artifact noise and make the R-peaks of the ECG signal prominent even in the presence of high-intensity motion. After denoising, IBIs are estimated more accurately expediting diagnosis tasks. Results illustrate that our method enables IBI estimation from noisy ECG signals with SNR up to -30dB with average root mean square error (RMSE) of 13 milliseconds for estimated IBIs. At this noise level, our error percentage remains below 8% and outperforms other state of the art techniques.