2606.19754 2026-06-19 cs.LG cs.NA math.NA 新提交

Learning universal approximations for partial differential equations with Physics-Informed Broad Learning System

基于物理信息广度学习系统的偏微分方程通用逼近学习

Zhiwen Yu, Derong Yang, Liujian Zhang, Kaixiang Yang, Peilin Zhan, Jianmin Lv, Jane You, C. L. Philip Chen

发表机构 * School of Computer Science and Engineering, South China University of Technology（华南理工大学计算机科学与工程学院）； Peng Cheng Laboratory（鹏城实验室）； School of Future Technology, South China University of Technology（华南理工大学未来技术学院）； School of Computer Science and Technology, Guangdong University of Technology（广东工业大学计算机科学与技术学院）； Department of Industrial and Systems Engineering, The Hong Kong Polytechnic University（香港理工大学工业及系统工程学系）

AI总结提出物理信息广度学习系统（PIBLS），通过无反向传播的最小二乘优化高效求解线性和非线性偏微分方程，比传统PINN快1-3个数量级且精度更高。

详情

AI中文摘要

偏微分方程（PDE）在建模复杂的物理、生物和工程系统中起着核心作用。虽然传统的数值求解器很稳健，但由于网格依赖性，它们常常带来高昂的计算成本，而最近的物理信息神经网络（PINN）提供了一种无网格替代方案，但经常遭受收敛缓慢和优化不稳定的问题。为了弥合这一差距，本文提出了物理信息广度学习系统（PIBLS），一种新颖的无反向传播框架，将PDE求解重新表述为直接的最小二乘优化。我们改进了该框架内的一个算法以高效处理非线性PDE，并提供了严格的数学证明，确立了PIBLS对这些方程的通用逼近性质。在线性和非线性PDE上的实验表明，PIBLS比传统PINN快1到3个数量级，同时实现了显著更高的求解精度。该框架为科学机器学习提供了一种计算高效的范式，为实时仿真和设计优化任务提供了一种实用、高速的替代方案。

英文摘要

Partial differential equations (PDEs) play a central role in modeling complex physical, biological, and engineering systems. While traditional numerical solvers are robust, they often incur prohibitive computational costs due to mesh dependencies, whereas recent Physics-Informed Neural Networks (PINNs) offer a mesh-free alternative but frequently suffer from slow convergence and optimization instability. To bridge this gap, this article proposes the Physics-Informed Broad Learning System (PIBLS), a novel backpropagation-free framework that reformulates PDE solving as a direct least-squares optimization. We improved an algorithm within this framework to handle nonlinear PDEs efficiently and provide a rigorous mathematical proof establishing the universal approximation property of PIBLS for these equations. Experiments on linear and nonlinear PDEs demonstrate that PIBLS is one to three orders of magnitude faster than conventional PINNs while achieving significantly higher solution accuracy. This framework provides a computationally efficient paradigm for scientific machine learning, offering a practical, high-speed alternative for real-time simulation and design optimization tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.19850 2026-06-19 cs.LG cs.AI 新提交

Neural Additive and Basis Models with Feature Selection and Interactions

具有特征选择和交互的神经加性模型与神经基础模型

Yasutoshi Kishimoto, Kota Yamanishi, Takuya Matsuda, Shinichi Shirakawa

发表机构 * Yokohama National University（横滨国立大学）

AI总结提出在神经加性模型和神经基础模型中引入特征选择机制，通过特征选择层减少计算开销，并支持高维数据中的特征交互学习，性能优于或持平于现有GAM方法。

Comments Accepted at PAKDD 2024. Code is available at https://github.com/shiralab/NAM-FS

详情

DOI: 10.1007/978-981-97-2259-4_1

AI中文摘要

深度神经网络（DNN）在各个领域表现出色，但通常可解释性较低。神经加性模型（NAM）及其变体神经基础模型（NBM）在广义加性模型（GAM）中使用神经网络（NN）作为非线性形状函数。这两种模型具有高度可解释性，并且在NN训练中表现出良好的性能和灵活性。NAM和NBM基于GAM架构，可以提供并可视化每个特征对预测的贡献。然而，当使用双输入NN来考虑特征交互或将其应用于高维数据集时，由于所需计算资源的增加，训练NAM和NBM变得棘手。本文提出将特征选择机制融入NAM和NBM以解决计算瓶颈。我们在两种模型中引入特征选择层，并在训练过程中更新选择权重。我们的方法简单，与原始NAM和NBM相比，可以降低计算成本和模型大小。此外，它使我们即使在数据维度很高的情况下也能使用双输入NN并捕获特征交互。我们证明，所提出的模型与原始NAM和NBM相比计算效率更高，并且与最先进的GAM相比表现出更好或相当的性能。

英文摘要

Deep neural networks (DNNs) exhibit attractive performance in various fields but often suffer from low interpretability. The neural additive model (NAM) and its variant called the neural basis model (NBM) use neural networks (NNs) as nonlinear shape functions in generalized additive models (GAMs). Both models are highly interpretable and exhibit good performance and flexibility for NN training. NAM and NBM can provide and visualize the contribution of each feature to the prediction owing to GAM-based architectures. However, when using two-input NNs to consider feature interactions or when applying them to high-dimensional datasets, training NAM and NBM becomes intractable due to the increase in the computational resources required. This paper proposes incorporating the feature selection mechanism into NAM and NBM to resolve computational bottlenecks. We introduce the feature selection layer in both models and update the selection weights during training. Our method is simple and can reduce computational costs and model sizes compared to vanilla NAM and NBM. In addition, it enables us to use two-input NNs even in high-dimensional datasets and capture feature interactions. We demonstrate that the proposed models are computationally efficient compared to vanilla NAM and NBM, and they exhibit better or comparable performance with state-of-the-art GAMs.

URL PDF HTML ☆

赞 0 踩 0

2606.19853 2026-06-19 cs.LG physics.comp-ph 新提交

Physics-Informed Neural Network with Squeeze-Excitation-like Attention

带有挤压-激励式注意力的物理信息神经网络

Yun-Fei Song, Long-Gang Pang, Fu-Peng Li, Jun-Jie Zhang

发表机构 * Key Laboratory of Quark and Lepton Physics (MOE) & Institute of Particle Physics, Central China Normal University（华中师范大学夸克与轻子物理教育部重点实验室及粒子物理研究所）； Artificial Intelligence and Computational Physics Research Center, Central China Normal University（华中师范大学人工智能与计算物理研究中心）； Key Laboratory of Nuclear Physics and Ion-beam Application (MOE) & Institute of Modern Physics, Fudan University（复旦大学核物理与离子束应用教育部重点实验室及现代物理研究所）； Shanghai Research Center for Theoretical Nuclear Physics, NSFC and Fudan University（国家自然科学基金委员会-复旦大学上海理论核物理研究中心）； Northwest Institute of Nuclear Technology（西北核技术研究所）

AI总结提出SEA-PINN架构，通过挤压-激励式注意力机制动态调整神经元重要性，实现稳定初始化，在20个基准问题中17个方差极小，无需傅里叶嵌入或周期激活即可达到与TSA-PINN相当的精度，并可作为轻量插件提升其他PINN性能。

Comments 15 pages, 6 figures

详情

AI中文摘要

我们引入了SEA-PINN，一种新颖的架构，它将类似挤压-激励的注意力机制融入物理信息神经网络，以动态重新校准各层神经元的重要性。SEA-PINN的一个关键特性是其高度稳定的初始化。在20个基准问题中的17个上，SEA-PINN表现出几乎可忽略的方差和显著降低的初始损失，为优化建立了一个准确定且有利的起点。值得注意的是，在没有采用傅里叶特征嵌入或周期激活函数的情况下，SEA-PINN与TSA-PINN（一种通过正弦激活中的可学习频率专门为高频问题设计的模型）相比，达到了具有竞争力的精度（在高频案例7上，相对于FNN-PINN的改进分别为83%和90%）。此外，将SEA-PINN集成到TSA-PINN中使性能提升了42.49%。这些结果强调了SEA-PINN作为一种轻量级插件模块，能够增强非线性表示能力，促进更稳健和高效的收敛，并提高物理信息学习的整体可靠性。

英文摘要

We introduce SEA-PINN, a novel architecture that incorporates a Squeeze-Excitation-like attention mechanism into physics-informed neural networks to dynamically recalibrate the importance of neurons across layers. A key feature of SEA-PINN is its highly stable initialization. On 17 out of 20 benchmark problems, SEA-PINN exhibit nearly negligible variance and significantly reduced initial loss, establishing a quasi-deterministic and favorable starting point for optimization. Notably, without employing Fourier feature embeddings or periodic activation functions, SEA-PINN attained competitive accuracy (83\% vs. 90\% improvement relative to FNN-PINN on the high-frequency case 7) as compared with TSA-PINN-a model specifically engineered for high-frequency problems via learnable frequencies in sinusoidal activations. Furthermore, integrating SEA-PINN into TSA-PINN boosted performance by 42.49\%. These results underscore SEA-PINN as a lightweight plug-in module that enhances nonlinear representation power, promotes more robust and efficient convergence, and strengthens the overall reliability of physics-informed learning.

URL PDF HTML ☆

赞 0 踩 0

2606.19941 2026-06-19 cs.LG 新提交

Compositionality Emerges in a Narrow Depth-Connectivity Regime: Architecture Constraints and Solution Manifolds

组合性在窄深度-连接性区域中涌现：架构约束与解流形

Dat H. Do, Rushi Shah, Duc V. Le, Dianbo Liu

发表机构 * National University of Singapore（新加坡国立大学）； University of Twente（特温特大学）

AI总结研究发现组合性仅在特定稀疏网络和特定深度区间涌现，提出基于相似性的剪枝和深度预测方法，并用理论框架解释原因。

详情

AI中文摘要

组合性被认为是泛化的基础，使模型能够在新颖组合中重用有意义的原语。然而，使用标准梯度优化训练的模型很少且通常仅微弱地表现出组合内部结构，并且尚不清楚这种组合性如何或为何形成。在这项工作中，我们表明组合性在一个狭窄的连接性-深度最佳点涌现。沿着连接性轴，组合性仅出现在某些特定稀疏网络中，严重依赖于保留哪些连接而非仅权重的稀疏性。沿着深度轴，组合性在一个狭窄的、目标依赖的区域内涌现，在特定深度达到峰值，而更浅和更深的网络都失败。当深度或连接性条件被违反时，梯度下降会静默地收敛到破碎解而非组合解。为了发现并利用这种涌现，我们引入了（i）基于相似性的剪枝（SP）以恢复组合连接性，以及（ii）一个启发式深度预测器以估计组合性最可能出现的深度。最后，我们通过基于组合稀疏性、体积比论证和特征干扰界限的理论框架支持这些实证发现，解释了为什么组合解仅在狭窄的深度-连接性区域内可达。

英文摘要

Compositionality is believed to be the foundation for generalization, enabling models to reuse meaningful primitives in novel combinations. Yet, models trained with standard gradient-based optimization rarely, and often only weakly, exhibit compositional internal structure, and it remains unclear how or why such compositionality forms. In this work, we show that compositionality emerges in a narrow connectivity-depth sweet spot. Along the connectivity axis, compositionality only appears in some specifically sparse networks, heavily depends on which connections remain rather than on weights' sparsity alone. Along the depth axis, compositionality emerges within a narrow, target-dependent regime, peaking at specific depths, while both shallower and deeper networks fail. When either the depth or connectivity condition is violated, gradient descent silently converges to fractured solutions rather than compositional ones. To discover and exploit this emergence, we introduce (i) similarity-based pruning (SP) to recover compositional connectivity and (ii) a heuristic depth predictor to estimate where compositionality is most likely to appear. Finally, we support these empirical findings with a theoretical framework based on compositional sparsity, volume-ratio arguments, and feature-interference bounds, explaining why compositional solutions are reachable only in a narrow depth-connectivity regime.

URL PDF HTML ☆

赞 0 踩 0

2606.19984 2026-06-19 cs.LG 新提交

FlexLAM: 解决潜在动作学习中的瓶颈权衡

Takanori Yoshimoto, Yang Hu, Naruya Kondo, Tatsuya Matsushima

发表机构 * University of Tsukuba（筑波大学）； The University of Tokyo（东京大学）

AI总结针对潜在动作模型中固定容量瓶颈导致的权衡问题，提出FlexLAM，通过嵌套dropout实现变长潜在动作，在不增加架构或损失的情况下，在稀缺标签和低回报任务中优于固定容量模型，并支持推理时调整令牌预算。

详情

AI中文摘要

潜在动作为无动作视频与下游决策提供了紧凑接口，但现有潜在动作模型（LAM）强制每个转换通过固定容量瓶颈。我们识别出一个瓶颈权衡：过于紧凑的编码可能丢弃动作对齐所需的转换线索，而过于松散的编码则保留了额外的转换变化，当对齐标签稀缺或分布狭窄时必须解决这些变化。FlexLAM用通过嵌套dropout训练的变长潜在动作取代固定容量，产生前缀有效编码，首先捕获紧凑的转换结构，仅在需要时添加细节，无需新架构或损失。在标准稀缺标签监督下和低回报单任务对齐压力测试中，单个FlexLAM在每个评估的令牌预算下匹配或超越单独训练的固定容量LAM，表明FlexLAM不仅在推理时可调整，而且在相同令牌预算下学习了更好的潜在动作接口。同一模型支持推理时令牌预算调整而无需重新训练，并且FlexLAM改善了Ego4D转换重建。这些结果表明，变长潜在动作是对潜在动作模型、潜在动作世界模型和视频预训练动作接口中固定容量瓶颈的无架构、即插即用升级。

英文摘要

Latent actions provide a compact interface between action-free video and downstream decision-making, yet existing Latent Action Models (LAMs) force every transition through a fixed-capacity bottleneck. We identify a bottleneck trade-off: overly tight codes can discard transition cues needed for action alignment, while overly loose codes preserve additional transition variation that must be resolved when alignment labels are scarce or narrowly distributed. FlexLAM replaces this fixed capacity with variable-length latent actions trained by nested dropout, yielding prefix-valid codes that capture compact transition structure first and add detail only when needed, without new architectures or losses. A single FlexLAM matches or surpasses separately trained fixed-capacity LAMs at every evaluated token budget under standard scarce-label supervision and under a low-return single-task alignment stress test, indicating that FlexLAM is not merely adjustable at inference time but learns a better latent-action interface at the same token budgets. The same model supports inference-time token-budget adjustment without retraining, and FlexLAM improves Ego4D transition reconstruction. These results suggest that variable-length latent actions are an architecture-free, drop-in upgrade to the fixed-capacity bottleneck in latent action models, latent-action world models, and video-pretrained action interfaces.

URL PDF HTML ☆

赞 0 踩 0

2606.19451 2026-06-19 cs.LG cs.CV cs.RO 新提交

3D-DLP: Self-Supervised 3D Object-Centric Scene Representation Learning

3D-DLP：自监督3D物体中心场景表示学习

Ellina Zhang, Madhaven Iyengar, Amir Zadeh, Chuan Li, Deepak Pathak, David Held, Tal Daniel

AI总结提出3D-DLP模型，通过自监督学习将场景级RGB-D或体素观测分解为3D潜在粒子，每个粒子编码解耦属性，实现可解释的逐粒子分割图，并支持场景操控和下游机器人操作。

Comments ICML 2026. Project webpage: https://eubooks3003.github.io/3d-dlp

详情

AI中文摘要

我们引入了3D-DLP，一种自监督的物体中心表示学习模型，它将场景级RGB-D或体素观测分解为一组3D潜在粒子。基于深度潜在粒子（DLP）框架，每个粒子编码解耦的属性，包括3D关键点位置、边界框尺寸和外观特征，并代表场景中的一个独特实体。该模型通过端到端的自监督重建目标学习可解释的逐粒子分割图。我们在模拟和真实数据集上证明，学习到的潜在空间是可解释和可控的：通过操纵粒子位置并解码，我们可以生成新颖的场景配置。此外，我们展示了将这些紧凑的3D潜在粒子用于下游机器人操作，相比缺乏显式3D信息或依赖无物体中心结构的密集3D输入的基线方法，性能有所提升。代码和视频可在以下网址获取：此 https URL。

英文摘要

We introduce 3D-DLP, a self-supervised object-centric representation learning model that decomposes scene-level RGB-D or voxel observations into a set of 3D latent particles. Building on the Deep Latent Particles (DLP) framework, each particle encodes disentangled attributes, including 3D keypoint position, bounding box dimensions, and appearance features, and represents a distinct entity in the scene. The model learns interpretable per-particle segmentation maps through an end-to-end self-supervised reconstruction objective. We demonstrate on both simulated and real-world datasets that the learned latent space is interpretable and controllable: by manipulating particle positions and decoding, we can generate novel scene configurations. Furthermore, we show that leveraging these compact 3D latent particles for downstream robotic manipulation improves performance over baselines that either lack explicit 3D information or rely on memory-intensive dense 3D inputs without object-centric structure. Code and videos are available at https://eubooks3003.github.io/3d-dlp.

URL PDF HTML ☆

赞 0 踩 0

2606.19542 2026-06-19 cs.LG 新提交

Tracking Representation Dynamics in Large Language Models with Persistent Homology

利用持续同调追踪大型语言模型中的表示动态

Naman Malhotra, Jay Ambadkar, Abhinav Gupta, Kushal Kasivel, Abbas Schwarz, Kamillo Ferry, Anthea Monod

发表机构 * Imperial College London（伦敦帝国学院）

AI总结通过持续同调分析激活空间拓扑，发现对齐过程中拓扑重组主要发生在训练早期，且不同对齐目标产生可区分的拓扑轨迹。

Comments 29 pages

详情

AI中文摘要

大型语言模型通常通过监督微调进行对齐，但关于其内部表示在此过程中如何演变的研究尚不充分。我们利用持续同调，通过追踪微调过程中激活空间的拓扑结构来研究对齐动态。在四个参数范围从1B到7B的Transformer语言模型以及对应于有用、无害和混合训练数据的三个对齐目标上，我们发现大多数拓扑重组发生在训练的最早阶段。密集检查点分析揭示了拓扑活动的瞬态峰值，随后迅速稳定。我们进一步表明，不同的对齐目标会引发可区分的拓扑轨迹，而指令微调和预训练模型则表现出定性不同的演化模式。我们的结果表明，持续同调为对齐提供了互补视角，揭示了仅从行为指标无法察觉的表示级变化。

英文摘要

Large language models are commonly aligned through supervised fine-tuning, yet little is known about how their internal representations evolve during this process. We study alignment dynamics using persistent homology by tracking the topology of activation spaces throughout fine-tuning. Across four transformer language models ranging from 1B to 7B parameters and three alignment objectives corresponding to helpful, harmless, and mixed training data, we find that the majority of topological reorganization occurs during the earliest stages of training. A dense checkpoint analysis reveals a transient peak in topological activity followed by rapid stabilization. We further show that different alignment objectives induce distinguishable topological trajectories, while instruction-tuned and pretrained models exhibit qualitatively different patterns of evolution. Our results suggest that persistent homology provides a complementary perspective on alignment, revealing representation-level changes that are not apparent from behavioral metrics alone.

URL PDF HTML ☆

赞 0 踩 0

2606.19594 2026-06-19 cs.LG 新提交

Unsupervised Causal Abstractions Discovery

无监督因果抽象发现

Théo Saulus, Simon Lacoste-Julien, Dhanya Sridhar

发表机构 * Mila - Quebec AI Institute（魁北克人工智能研究所）； Université de Montréal（蒙特利尔大学）； Canada CIFAR AI Chair（加拿大CIFAR人工智能主席）

AI总结提出从低层测量数据中直接学习高层结构因果模型的方法，利用低秩因果发现假设，证明低秩图观测诱导的潜变量形成因果抽象，并给出可辨识性结果及实用学习目标。

2606.19827 2026-06-19 cs.LG cs.AI 新提交

When, Where, and How: Adaptive Binning for Tabular Self-Supervised Learning

何时、何地以及如何：面向表格自监督学习的自适应分箱

Daehwan Kim, Haejun Chung, Ikbeom Jang

发表机构 * Hanyang University（汉阳大学）； Hankuk University of Foreign Studies（韩国外国语大学）

AI总结提出自适应分箱方法，通过特征级粗到细课程学习动态优化离散化，结合类别重建与顺序监督，在医疗表格数据上提升自监督学习性能。

Comments Accepted to MICCAI 2026

详情

AI中文摘要

医疗表格数据在临床研究中无处不在，但表格数据的深度学习仍未被充分探索，因为可靠的标签通常需要昂贵的专家判定，尽管结构化临床变量通常以表格形式常规可用。自监督学习可以利用这些未标记的表格，而最近基于分箱的前置任务提供了一种有前景的归纳偏置，但现有目标固定单个全局分位数离散化并应用特征无关的监督。我们提出自适应分箱，一种用于表格自监督学习的训练自适应离散化前置任务，通过特征级粗到细课程将离散化与学习耦合。受神经网络的频谱偏差和课程学习原则的启发，我们的方法在检测到平台期时逐步细化每个特征的离散化，并选择表示感知的分割点，以联合改善值空间浓度和表示空间一致性。一种异质性感知目标统一了类别重建与数值特征的顺序监督，在统一评估协议下对公共医疗表格数据集的实验显示，线性探测和微调均取得一致改进，无需数据集特定的离散化调整。我们进一步引入一个医疗表格自监督学习基准，配备标准化协议，以支持这一未被充分探索领域的可重复进展。我们的代码可在该网址获取。

英文摘要

Medical tabular data are ubiquitous in clinical research, but deep learning for tables remains underexplored because reliable labels often require costly expert adjudication, even though structured clinical variables are routinely available in tabular form. Self-supervised learning can leverage these unlabeled tables, and recent binning-based pretexts offer a promising inductive bias, but existing objectives fix a single global quantile discretization and apply feature-agnostic supervision. We propose Adaptive Binning, a training-adaptive discretization pretext for tabular SSL that couples discretization to learning through a feature-wise coarse-to-fine curriculum. Motivated by the spectral bias of neural networks and the principles of curriculum learning, our method progressively refines discretization per feature upon plateau detection and selects representation-aware splits to jointly improve value-space concentration and representation-space coherence. A heterogeneity-aware objective unifies categorical reconstruction with ordinal supervision for numerical features, and experiments on public medical tabular datasets under unified evaluation protocols show consistent gains for linear probing and fine-tuning without dataset-specific discretization tuning. We further introduce a medical tabular SSL benchmark with standardized protocols to support reproducible progress in this underexplored domain. Our code is available at https://github.com/labhai/Adaptive-Binning.

URL PDF HTML ☆

赞 0 踩 0

2606.19888 2026-06-19 cs.LG cs.AI 新提交

SL-S4Wave: Self-Supervised Learning of Physiological Waveforms with Structured State Space Models

SL-S4Wave：基于结构化状态空间模型的生理波形自监督学习

Feng Wu, Harsh Deep, Eric Lehman, Sanyam Kapoor, Guoshuai Zhao, Rahul Krishnan, Gari Clifford, Li-wei H Lehman

发表机构 * Massachusetts Institute of Technology（麻省理工学院）； OpenEvidence, USA（OpenEvidence（美国））； New York University（纽约大学）； Xi’an Jiaotong University（西安交通大学）； University of Toronto（多伦多大学）； Emory University（埃默里大学）

AI总结提出SL-S4Wave框架，结合对比学习与基于结构化状态空间模型的编码器，通过多尺度子核全局卷积捕获多通道生理波形的局部和长程依赖，在心律失常检测等任务中优于现有方法。

详情

AI中文摘要

由于高采样率、多通道信号复杂性、固有噪声和有限的标记数据，对长序列医学时间序列数据（如心电图）进行建模面临重大挑战。尽管最近基于各种编码器架构（如卷积神经网络）的自监督学习方法被提出用于从未标记数据中学习表示，但它们往往在捕获长程依赖和噪声不变特征方面存在不足。结构化状态空间模型擅长长序列建模，但现有的S4架构无法捕获多通道生理波形的独特特征。在这项工作中，我们提出了SL-S4Wave，一个自监督学习框架，它将对比学习与基于结构化状态空间模型的定制编码器相结合。该编码器利用多尺度子核实现多层全局卷积，从而能够在嘈杂的高分辨率多通道波形中捕获细粒度局部模式和长程时间依赖。在真实世界数据集上的大量实验表明，SL-S4Wave（1）在具有挑战性的心律失常检测任务中持续优于最先进的监督和自监督基线，（2）使用显著更少的标记示例实现高性能，展示了强大的标签效率，（3）在长波形片段上保持稳健性能，突出了其对大多数现有方法无法有效建模的长序列中复杂时间动态的建模能力，以及（4）有效迁移到未见的心律失常类型，强调了其强大的跨域泛化能力。我们还在多个EEG任务上评估了SL-S4Wave，在强基线上取得了优越性能，证明了我们的方法在心脏波形之外的泛化能力。

英文摘要

Modeling long-sequence medical time series data, such as electrocardiograms (ECG), poses significant challenges due to high sampling rates, multichannel signal complexity, inherent noise, and limited labeled data. While recent self-supervised learning (SSL) methods, based on various encoder architectures such as convolutional neural networks, have been proposed to learn representations from unlabeled data, they often fall short in capturing long-range dependencies and noise-invariant features. Structured state space models (S4) excel at long-sequence modeling, but existing S4 architectures fail to capture the unique characteristics of multichannel physiological waveforms. In this work, we propose SL-S4Wave, a self-supervised learning framework that combines contrastive learning with a tailored encoder built on structured state space models. The encoder incorporates multi-layer global convolution using multiscale subkernels, enabling the capture of both fine-grained local patterns and long-range temporal dependencies in noisy, high-resolution multichannel waveforms. Extensive experiments on real-world datasets demonstrate that SL-S4Wave (1) consistently outperforms state-of-the-art supervised and self-supervised baselines in a challenging arrhythmia detection task, (2) achieves high performance with significantly fewer labeled examples, showcasing strong label efficiency, and (3) maintains robust performance on long waveform segments, highlighting its capacity to model complex temporal dynamics in long sequences that most existing approaches fail to efficiently model, and (4) transfers effectively to unseen arrhythmia types, underscoring its robust cross-domain generalization. We additionally evaluate SL-S4Wave on multiple EEG tasks, achieving superior performance over strong baselines, demonstrating generalizability of our approach beyond cardiac waveforms.

URL PDF HTML ☆

赞 0 踩 0

2606.20167 2026-06-19 cs.LG 新提交

Multi-Modal Contrastive Learning for Implicit Earth Embeddings via Location Tying

多模态对比学习用于基于位置绑定的隐式地球嵌入

Jonathan Hecht, Lukas Arzoumanidis, Ziyue Li, Youness Dehbi

发表机构 * Computational Methods Lab, HafenCity University Hamburg（汉堡港城大学计算方法实验室）； Dept. of Operations & Technology, Technical University of Munich（慕尼黑工业大学运营与技术系；海尔布隆数据科学中心；慕尼黑数据科学研究所）； Heilbronn Data Science Center（波恩大学大地测量与地理信息研究所）； Munich Data Science Institute ； Institute of Geodesy and Geoinformation, University of Bonn

AI总结提出两种多模态对比学习架构MELT和SALT，通过位置绑定整合未配对地理数据，在四个下游任务中匹配最强双模态基线SATCLIP，但增加模态数未持续提升性能，表明位置编码器是主要瓶颈。

详情

AI中文摘要

空间预测任务通常受限于缺乏高质量标记的地面真值观测。为克服这一挑战，自监督预训练是一种可能的解决方案，其中对比学习在位置编码器中占主导地位。这些方法通常仅将地理坐标与一种额外模态对齐。我们提出了两种多模态对比学习架构：通过位置绑定的多模态嵌入（MELT）和顺序交替位置训练（SALT）。这些架构通过利用未配对的地理空间数据，将框架扩展到两种模态以上。两种方法在技术上均可行，并在四个下游任务中匹配了最强的双模态基线（SATCLIP）的性能。然而，增加模态数量并未持续提升性能，这表明所选的位置编码器是主要限制——对比目标在早期达到峰值，无论模态多样性或预训练量如何。MELT比SALT提供更稳定的训练，并为未来的扩展提供了更强的基础。

英文摘要

Spatial prediction tasks are often limited by a lack of high-quality labelled ground-truth observations. To overcome this challenge, self-supervised pre-training is a possible solution, with contrastive learning dominant for location encoders. Those approaches usually align geographic coordinates with just one additional modality. We propose two multimodal contrastive learning architectures: Multimodal Embedding via Location Tying (MELT) and Sequential Alternating Location Training (SALT). These architectures expand this framework beyond two modalities by utilising unpaired geospatial data. Both methods are technically viable and match the performance of the strongest two-modality baseline (SATCLIP) across four downstream tasks. However, increasing the number of modalities does not consistently improve performance, suggesting that the chosen location encoder is the main limitation - the contrastive objective reaches its peak early, regardless of modality diversity or pre-training volume. MELT provides more stable training than SALT and presents a stronger foundation for future scaling.

URL PDF HTML ☆

赞 0 踩 0

2606.19370 2026-06-19 cs.LG cs.AI cs.MA 新提交

Human-like autonomy emerges from self-play and a pinch of human data

类人自主性从自我对弈和少量人类数据中涌现

Daphne Cornelisse, Julian Hunt, Zixu Zhang, Waël Doulazmi, Kevin Joseph, Jaime Fernández Fisac, Eugene Vinitsky

发表机构 * NYU Tandon School of Engineering（纽约大学坦登工程学院）； NYU Courant（纽约大学库朗数学科学研究所）； Princeton University（普林斯顿大学）； Centre for Robotics, Mines Paris（巴黎矿业大学机器人中心）； Valeo（法雷奥）

AI总结提出一种结合自我对弈强化学习与少量人类演示的正则化方法，仅用30分钟人类数据即可训练出与人类协调的驾驶策略，训练时间仅15小时。

Comments 10 pages

详情

AI中文摘要

自我对弈强化学习最近成为一种无需任何人类数据即可训练驾驶策略的方法。它利用廉价的大规模模拟来替代昂贵的大规模人类驾驶演示。这种方法的一个关键局限性是，通过纯自我对弈训练的策略可以学习有效但不符合人类习惯的驾驶惯例。先前的工作试图通过广泛的奖励工程和领域随机化来缓解这种行为偏差，但这些方法脆弱且劳动密集。我们的方法没有完全抛弃人类演示，而是将其作为最小安全目标达到奖励之上的正则化目标。就像好炖菜中的香料一样，我们发现少量人类数据大有裨益：我们的方法仅使用30分钟的人类演示，比同类模仿学习方法少2500倍。由此产生的策略与保留的人类轨迹协调，并在单个消费级GPU上15小时内完成训练。视频和完整源代码见https://this URL。

英文摘要

Self-play reinforcement learning has recently emerged as a way to train driving policies without any human data. It uses cheap, large-scale simulations to substitute expensive, large-scale human driving demonstrations. A key limitation of this approach is that policies trained through pure self-play can learn effective but alien driving conventions incompatible with people. Previous works attempt to mitigate such behavioral misalignments through extensive reward engineering and domain randomization, which are brittle and labor-intensive. Instead of completely discarding human demonstrations, our method treats them as a regularization objective on top of a minimal safe goal-reaching reward. Like the spice in a good stew, we find that a little human data goes a long way: our method uses only 30 minutes of human demonstrations, 2500x fewer than comparable imitation learning approaches. Resulting policies coordinate with held-out human trajectories and complete training in 15 hours on a single consumer-grade GPU. Videos and full source code are available at https://spiced-self-play.com/.

URL PDF HTML ☆

赞 0 踩 0

2606.19476 2026-06-19 cs.LG cs.AI 新提交

Can In-Context Learning Support Intrinsic Curiosity?

上下文学习能否支持内在好奇心？

Eric Elmoznino, Sangnie Bhardwaj, Johannes von Oswald, Rajai Nasser, Blaise Agüera y Arcas, João Sacramento, Rif A. Saurous, Guillaume Lajoie

发表机构 * Google – Paradigms of Intelligence Team（Google – 智能范式团队）； Google DeepMind

AI总结研究利用序列模型的上下文学习能力作为即时无更新世界模型，以消除传统内在好奇心方法中梯度下降的计算瓶颈，理论证明在非时间设置下可渐近收敛到真实学习进度。

详情

AI中文摘要

有效的机器学习不仅取决于我们如何对数据建模，还取决于我们选择收集哪些数据。虽然大型序列模型已经彻底改变了数据建模，但自动数据选择或“内在好奇心”的问题仍然是一个重大挑战。经典方法通过基于智能体的“学习进度”奖励来激励探索，该奖励衡量新获得的观测在多大程度上改进了世界模型的预测能力。然而，传统上评估这些奖励需要在每个轨迹内进行昂贵的梯度下降内循环更新，这使得它们在规模上计算上不可行。在这项工作中，我们研究序列模型涌现的上下文学习（ICL）能力是否可以通过作为即时的、无需更新的世界模型来消除这一瓶颈。具体来说，我们评估是否可以训练一个探索策略来最大化学习进度，仅使用上下文学习者的预测误差和反事实上下文操作。我们首先证明，在一般马尔可夫决策过程中，这实际上不可能以无偏的方式实现：由此产生的内在奖励要么包含干扰项，使其对真实学习进度的估计产生偏差，要么无法使用上下文学习者的预测误差来实现。相反，我们对于非时间设置的一个广泛子类（包括主动学习和贝叶斯实验设计）证明了积极结果：在这里，ICL派生的奖励成功界定了真实学习进度并渐近收敛到它。我们通过连续和符号环境中的受控实验证实了我们的理论，表明我们的ICL驱动框架成功训练了以最优方式进行探索的好奇数据收集策略。

英文摘要

Effective machine learning depends not only on how we model data, but also on what data we choose to collect. While large sequence models have revolutionized data modeling, the problem of automated data selection, or "intrinsic curiosity", remains a significant challenge. Classic approaches incentivize exploration by rewarding an agent based on its "learning progress", which measures how much a newly acquired observation improves a world model's predictive ability. However, evaluating these rewards traditionally requires expensive inner loops of gradient descent updates within each trajectory, rendering them computationally impractical at scale. In this work, we investigate whether the emergent in-context learning (ICL) capabilities of sequence models can eliminate this bottleneck by serving as immediate, update-free world models. Specifically, we evaluate whether an exploration policy can be trained to maximize learning progress, using solely the prediction errors and counterfactual context manipulations of an in-context learner. We first prove that in general Markov decision processes, this is in fact impossible in an unbiased way: the resulting intrinsic rewards either suffer from nuisance terms that bias their estimation of true learning progress, or they cannot be implemented using an in-context learner's prediction errors. Conversely, we prove a positive result for a broad subclass of non-temporal settings, encompassing active learning and Bayesian Experimental Design: here, ICL-derived rewards successfully bound and asymptotically converge to the true learning progress. We corroborate our theory with controlled experiments across continuous and symbolic environments, demonstrating that our ICL-driven framework successfully trains curious data-collection policies that explore optimally.

URL PDF HTML ☆

赞 0 踩 0

2606.19690 2026-06-19 cs.LG 新提交

Multi-Granular Attention-Driven Reinforcement Learning Framework for Web Intelligent Enhancement Systems

多粒度注意力驱动的强化学习框架用于Web智能增强系统

Navin Chhibber, Deepak Singh, Anokh Kishore, Nikita Chawla, K. Anguraj

AI总结提出MGAR-WIES框架，通过语义图建模、注意力机制和自适应强化学习，解决Web环境中异构动态数据的语义理解与可扩展性问题，在准确率上达到80%。

Comments 2026 3rd International Conference on Integrated Intelligence and Communication Systems (ICIICS), 6 Pages

详情

AI中文摘要

近年来，Web智能增强系统越来越依赖异构和动态的Web数据来提供个性化的上下文感知服务。然而，传统的机器学习、深度学习和强化学习模型在持续演化的Web环境中往往难以应对语义理解、适应性和可扩展性的挑战。本研究提出了一种基于多粒度注意力的强化Web智能增强系统（MGAR-WIES），通过集成语义图建模、注意力机制和自适应强化学习来应对这些挑战。首先，收集包括结构化、半结构化和非结构化来源的异构Web数据，并进行预处理以生成统一特征表示。这些表示被转换为动态语义图，其中实体及其关系通过注意力机制增强的图嵌入进行建模，以捕捉局部相关性和全局上下文依赖。随后，一种自适应多智能体强化学习策略利用注意力感知的语义状态来优化个性化Web动作，如内容推荐、导航优化和服务自适应。最后，持续在线反馈被进一步集成，以实时更新图表示和学习策略，确保持续的适应性和性能。与现有方法相比，提出的MGAR-WIES在准确率（80%）方面取得了更好的结果。

英文摘要

From the past few years, web intelligent enhancement systems increasingly rely on heterogeneous and dynamic web data to deliver personalized, context-aware services. However, traditional machine learning, deep learning, and reinforcement learning models often struggle with semantic understanding, adaptability, and scalability in continuously evolving web environments. In this research, a Multi-Granular Attention-based Reinforcement Web Intelligent Enhancement System (MGAR-WIES) is proposed to address the challenges by integrating semantic graph modeling, attention mechanisms, and adaptive reinforcement learning. Initially, heterogeneous web data comprising structured, semi-structured and unstructured sources are collected and preprocessed for generating unified feature representations. These representations are transformed into a dynamic semantic graph, where entities and their relationships are modeled by using graph embeddings enhanced by attention mechanisms for capturing both local relevance and global contextual dependencies. Subsequently, an adaptive multi-agent reinforcement learning strategy leverages the attention-aware semantic states to optimize personalized web actions like content recommendation, navigation optimization, and service adaptation. Finally, the continuous online feedback is further integrated to update graph representations and learning policies in real time by ensuring sustained adaptability and performance. The proposed MGAR-WIES acheived better results in terms of accuracy (80%) when compared with existing approaches.

URL PDF HTML ☆

赞 0 踩 0

2606.19721 2026-06-19 cs.LG cs.AI 新提交

OnDeFog: Online Decision Transformer under Frame Dropping

OnDeFog：帧丢失下的在线决策变压器

Daiki Yotsufuji, Kenta Nishihara, Shoma Shimizu, Kento Uchida, Shinichi Shirakawa

发表机构 * Yokohama National University（横滨国立大学）

AI总结针对帧丢失导致性能下降的问题，提出OnDeFog，将DeFog机制与在线决策变压器结合，通过直接环境交互学习策略，在高丢帧率环境下优于ODT，在低奖励数据集上优于DeFog。

Comments Accepted to PRICAI 2025

详情

DOI: 10.1007/978-981-95-7072-0_10

AI中文摘要

在具有挑战性的现实世界强化学习应用中，通信延迟或传感器故障经常导致帧丢失，此时智能体无法接收丢失的状态及相关奖励。为了解决帧丢失导致的性能下降问题，通过将额外机制引入决策变压器以处理帧丢失，开发了随机帧丢失下的决策变压器（DeFog）。尽管DeFog可以缓解帧丢失环境中的性能下降，但由于DeFog是一种离线学习方法，它难以有效泛化到训练数据集中未充分表示的新状态。在本研究中，我们提出OnDeFog，它将DeFog中的机制与在线决策变压器（ODT）相结合，ODT是一种通过直接环境交互学习策略的在线强化学习方法。全面的实验评估表明，我们提出的OnDeFog在高丢帧率环境下相比ODT取得了更优的性能，并且在包含大量低奖励数据的数据集上优于DeFog。

英文摘要

In challenging real-world reinforcement learning applications, communication delays or sensor failures often cause frame dropping, in which the agent cannot receive the dropped states and associated rewards. To address the performance degradation caused by frame dropping, the Decision Transformer under Random Frame Dropping (DeFog) was developed by incorporating additional mechanisms into the decision transformer to tackle frame dropping. Although DeFog can mitigate performance degradation in frame-dropping environments, since DeFog is an offline learning method, it struggles to effectively generalize to novel states not adequately represented in the training dataset. In this study, we propose OnDeFog, which integrates the mechanisms in DeFog with the online decision transformer (ODT), an online reinforcement learning method that learns policies through direct environmental interaction. Comprehensive experimental evaluation demonstrates that our proposed OnDeFog achieves superior performance compared to ODT in environments characterized by high dropping frame rate and outperforms DeFog on datasets containing a large amount of low-reward data.

URL PDF HTML ☆

赞 0 踩 0

2606.19750 2026-06-19 cs.LG cs.AI cs.CL 新提交

Manifold Bandits: Bayesian Curriculum Learning over the Latent Geometry of Large Language Models

流形赌博机：大语言模型潜在几何上的贝叶斯课程学习

Darrien McKenzie, Nicklas Hansen, Xiaolong Wang

发表机构 * University of California, San Diego（加州大学圣迭戈分校）

AI总结提出贝叶斯流形课程（BMC）框架，将问题采样建模为流形结构赌博机问题，通过层次任务树和贝叶斯学习引导采样，平衡学习信号、多样性和实用性。

Comments Webpage: https://darrienmckenzie.com/manifold-bandits/

详情

AI中文摘要

强化学习（RL）是提高大语言模型（LLMs）推理能力的关键方法，其中训练效率关键取决于优化过程中问题的采样方式。现有的自适应课程学习方法通常优先考虑中等难度的提示，将问题选择视为具有独立臂的标准赌博机问题，忽略了任务空间的结构化和异质性。在这项工作中，我们将问题采样框架化为具有内生非平稳性的流形结构赌博机问题：问题通过模型的潜在表示空间相关联，采样决策可以影响学习信号在该空间中的演变方式。为了实现这一视角，我们引入了贝叶斯流形课程（BMC），这是一个结构感知框架，将问题组织成层次任务树，并应用贝叶斯学习来指导采样。实验发现，不同的采样策略在生产性（学习信号）、多样性（任务流形覆盖）和实用性（评估相关性）之间引入了非平凡的权衡。这些结果表明，仅优先考虑难度不足以获得强大的下游性能，突出了将结构和类型感知纳入问题采样中的重要性。

英文摘要

Reinforcement learning (RL) is a central approach for improving reasoning capabilities in large language models (LLMs), where training efficiency depends critically on how problems are sampled during optimization. Existing adaptive curriculum learning methods typically prioritize prompts of intermediate difficulty, treating problem selection as a standard bandit problem with independent arms and overlooking the structured, heterogeneous nature of the task space. In this work, we frame problem sampling as a manifold-structured bandit problem with endogenous non-stationarity: problems are related through the model's latent representation space, and sampling decisions can steer how learning signals evolve across that space. To operationalize this perspective, we introduce Bayesian Manifold Curriculum (BMC), a structure-aware framework that organizes problems into a hierarchical task tree and applies Bayesian learning to guide sampling. Empirically, we find that different sampling strategies induce non-trivial tradeoffs between productivity (learning signal), diversity (coverage of the task manifold), and utility (evaluation relevance). These results show that prioritizing difficulty alone is insufficient for strong downstream performance, highlighting the importance of incorporating structure and type-awareness into problem sampling.

URL PDF HTML ☆

赞 0 踩 0

2606.19883 2026-06-19 cs.LG stat.ML 新提交

Matching Markets meet Cumulative Prospect Theory: Towards Optimal and Adversarially Robust Learning

匹配市场遇上累积前景理论：迈向最优和对抗鲁棒学习

Ananya Kunisetty, Avishek Ghosh

发表机构 * Indian Institute of Technology Bombay（印度理工学院孟买分校）

AI总结研究基于累积前景理论（CPT）的竞争性双边匹配市场多智能体多臂赌博机问题，提出最优遗憾界算法并扩展到对抗性市场。

Comments Accepted at ECML-PKDD 2026, Naples, Italy

详情

AI中文摘要

我们研究了一个在竞争性设置下具有双边匹配市场的多智能体多臂赌博机问题，该问题基于以人为中心的决策模型。为了捕捉人类偏好，我们使用累积前景理论（CPT），该理论通过一个（α-Hölder连续）权重函数以非线性方式加权智能体的行动。CPT已被广泛用于行为经济学和风险敏感机器学习中，以模拟人类偏好。我们分析了带有CPT权重扭曲奖励的最先进学习算法，并获得了玩家最优遗憾界为$\mathcal{O}(K\log T \left(\frac{1}{\Delta}\right)^{2/\alpha})$，其中$K$表示臂数，$T$是学习时间，$\Delta$表示（适当定义的）玩家的最小偏好差距。注意到对$\Delta$的依赖是次优的，我们通过明智地选择探索期间的活跃臂集进一步改进了这一遗憾，从而在主导项中消除了对$K$的依赖，并在臂数$K$显著大于玩家数$N$的设置中实现了改进的（最优）遗憾保证。此外，我们考虑了对抗性市场，其中智能体的观测奖励可能被破坏。我们提出并分析了在已知和未知总破坏预算两种设置下，以CPT作为风险敏感度量的鲁棒市场算法，并在两种情况下建立了对数级别的玩家最优遗憾保证。

英文摘要

We study a multi-agent multi-armed bandit problem in the competitive setup with two-sided matching markets under a human centric decision making model. To capture human preferences, we use cumulative prospect theory (CPT) that weighs the actions of the agent in a nonlinear fashion using a ($α$-Hölder continuous) weight function. CPT has been widely used in behavioral economics and risk sensitive machine learning to emulate human preferences. We analyze the state-of-the-art learning algorithm with CPT weight distorted rewards and obtain a player optimal regret of $\mathcal{O}(K\log T \left(\frac{1}Δ\right)^{2/α})$, where $K$ denotes the number of arms, $T$ is the learning horizon, and $Δ$ represents (suitably defined) players' minimum preference gap. Noticing the dependence on $Δ$ to be sub-optimal, we further improve this regret by judiciously selecting the active set of arms during exploration, which removes the dependence on $K$ in the dominant term and achieves an improved (optimal) regret guarantees in the setting where the number of arms $K$ is significantly larger than the number of players $N$. In addition, we consider adversarial markets where the observed rewards of the agents may be corrupted. We propose and analyze algorithms for robust markets with CPT as risk sensitive measure in both settings where the total corruption budget is known and where it is unknown, and establish logarithmic player-optimal regret guarantees in both cases.

URL PDF HTML ☆

赞 0 踩 0

2606.20002 2026-06-19 cs.LG cs.AI cs.CL 新提交

Connect the Dots: Training LLMs for Long-Lifecycle Agents with Cross-Domain Generalization Via Reinforcement Learning

Connect the Dots：通过强化学习训练具备跨域泛化能力的长期生命周期智能体

Yanxi Chen, Weijie Shi, Yuexiang Xie, Boyi Hu, Yaliang Li, Bolin Ding, Jingren Zhou

发表机构 * Alibaba Group（阿里巴巴集团）

AI总结提出Connect the Dots框架，通过端到端强化学习训练LLM在长期任务中自我更新上下文并泛化到新领域，实验验证了跨域泛化能力。

Comments Work in progress; we will continuously update the codebase and arXiv version

详情

AI中文摘要

本文提出了一个通用框架，用于训练大型语言模型（LLMs）具备“Connect the Dots”（CoD）这一元能力，该能力是长期生命周期智能体所必需的：当基于LLM的AI智能体部署在环境中时，它解决一系列长期任务，同时持续探索环境、从自身经验中学习，并迭代地自我更新关于环境的上下文，从而在更新上下文的条件下，在未来任务上实现逐步更好的性能。CoD框架的主要组成部分包括：（1）用于端到端强化学习（RL）的算法设计和基础设施，其中包含交替执行任务和更新上下文的长展开序列；（2）用于在训练过程中激励和激发LLM中目标元能力的任务和环境，以及在评估过程中忠实衡量进展的任务和环境。我们展示了CoD框架的概念验证实现，包括具有细粒度信用分配的GRPO风格RL算法，以及针对目标元能力（而非特定领域的LLM能力或标准的逐任务RL）量身定制的任务和环境。实证结果验证了CoD设置中端到端RL训练的有效性，并展示了所激发元能力的分布外泛化潜力——在训练领域内、跨不同领域以及从CoD到Ralph-loop设置中。我们对CoD的研究连接了多项先前工作，并为推进LLM和AI智能体开辟了新的机遇。为促进进一步研究和应用，我们在\url{this https URL}上发布了我们的实现。

英文摘要

This work presents a general framework for training large language models (LLMs) to "Connect the Dots" (CoD), a meta-capability required by long-lifecycle agents: as an LLM-based AI agent gets deployed in an environment, it solves a long sequence of tasks while continuously exploring the environment, learning from its own experiences, and iteratively self-updating its context about the environment, thereby achieving progressively better performance on future tasks conditioned on the updated context. Major components of the CoD framework include: (1) algorithm design and infrastructure for end-to-end reinforcement learning (RL) with long rollout sequences interleaving solve-task and update-context episodes; (2) tasks and environments for incentivizing and eliciting the targeted meta-capability in LLMs during training, as well as for faithfully measuring progress during evaluation. We present proof-of-concept implementations of the CoD framework, including a GRPO-style RL algorithm with fine-grained credit assignment, as well as tasks and environments tailored to the targeted meta-capability (rather than domain-specific LLM capabilities or standard task-by-task RL). Empirical results validate the efficacy of end-to-end RL training in the CoD setting, and demonstrate the potential for out-of-distribution generalization -- within the training domains, across different domains, and from CoD to Ralph-loop settings -- of the elicited meta-capability. Our investigation of CoD connects several lines of prior works, and opens up new opportunities for advancing LLMs and AI agents. To facilitate further research and applications, we release our implementations at \url{https://github.com/agentscope-ai/Trinity-RFT/tree/research/cod/examples/research_cod}.

URL PDF HTML ☆

赞 0 踩 0

2606.20008 2026-06-19 cs.LG 新提交

VIMPO: Value-Implicit Policy Optimization for LLMs

VIMPO: 值隐式策略优化用于大语言模型

Zhewei Kang, Aosong Feng, Sergey Levine, Dawn Song, Xuandong Zhao

发表机构 * UC Berkeley（加州大学伯克利分校）； Yale University（耶鲁大学）

AI总结提出VIMPO方法，通过KL正则化强化学习的最优条件导出策略隐含值函数，无需训练评论家，实现细粒度信用分配，在数学推理基准上优于GRPO。

详情

AI中文摘要

基于可验证奖励的强化学习已成为提升大语言模型推理能力的核心工具，但当前方法在简单性与信用分配之间存在权衡。GRPO等群组相对方法避免了训练评论家，但通常为每个token分配轨迹级优势。Actor-critic方法提供更密集的学习信号，但需要学习值函数，其自身存在训练不稳定性。我们提出VIMPO，一种无需评论家的策略优化方法，从KL正则化强化学习的最优条件推导出策略隐含值函数。对于自回归生成，得到的值递归可以用策略-参考对数比率表示，并由轨迹结束时无未来奖励的终止条件锚定。这给出了一个简单的值损失，它结合了结果级可验证奖励，而无需训练评论家。相同的推导也产生了无需评论家的actor优势，使VIMPO能够通过值损失分离奖励合并，并通过PPO风格的actor更新进行策略改进。在数学RLVR基准上，VIMPO在MATH-500、AIME 2024、AIME 2025和OlympiadBench上均优于GRPO，尤其在竞赛式评估中提升更大。在噪声奖励下，VIMPO保持对GRPO的持续优势，表明策略隐含值优化可以在保持无评论家训练实用简单性的同时提供更精细的信用分配。

英文摘要

Reinforcement learning with verifiable rewards has become a central tool for improving the reasoning ability of large language models, but current methods face a trade-off between simplicity and credit assignment. Group-relative methods such as GRPO avoid training a critic, but typically assign a trajectory-level advantage to every token. Actor-critic methods provide denser learning signals, but require a learned value function with its own training instability. We introduce VIMPO, a critic-free policy optimization method that derives a policy-implied value function from the optimality conditions of KL-regularized reinforcement learning. For autoregressive generation, the resulting value recurrence can be written in terms of policy-reference log-ratios and anchored by the terminal condition that no future reward remains at the end of a trajectory. This gives a simple value loss that incorporates outcome-level verifiable rewards without training a critic. The same derivation also yields a critic-free actor advantage, allowing VIMPO to separate reward incorporation through the value loss from policy improvement through a PPO-style actor update. On mathematical RLVR benchmarks, VIMPO improves over GRPO across MATH-500, AIME 2024, AIME 2025, and OlympiadBench, with especially larger gains on competition-style evaluations. Under noisy rewards, VIMPO retains a consistent advantage over GRPO, suggesting that policy-implied value optimization can provide finer credit assignment while preserving the practical simplicity of critic-free training.

URL PDF HTML ☆

赞 0 踩 0

2606.20014 2026-06-19 cs.LG cs.AI 新提交

Hierarchical Control in Multi-Agent Games: LLM-based Planning and RL Execution

多智能体博弈中的层次化控制：基于LLM的规划与RL执行

Jannik Hösch, Alessandro Sestini, Florian Fuchs, Amir Baghi, Joakim Bergdahl, Konrad Tollmar, Jean-Philippe Barrette-LaPierre, Linus Gisslén

AI总结提出LLM作为中央策略控制器选择RL技能策略的层次化架构，在2v2对抗环境中达到与手工BT相当的胜率，且被感知为最类人。

Comments 12 pages, 9 figures

详情

AI中文摘要

强化学习（RL）在序列决策中取得了强劲表现，但由于稀疏奖励、大状态-动作空间以及学习协调策略的困难，扩展到复杂多智能体环境仍具挑战。我们提出一种层次化架构，其中预训练的大语言模型（LLM）作为集中式策略控制器，为一组智能体选择专门的RL技能策略，而RL策略负责反应式底层执行。我们在竞争性2v2 King of the Hill环境中评估该混合系统，与行为树（BT）和“扁平”RL（无技能分解的端到端训练）基线进行比较。LLM+RL系统实现了与手工BT统计上相当的任务性能（胜率46.4% vs 51.5%，p=0.103），而两者均显著优于无技能分解训练的扁平RL。一项用户研究（n=15）显示，60%的参与者认为LLM+RL智能体最像人类（p=0.027），归因于行为适应性和战术变异性。这些结果表明，预训练LLM推理可以有效编排预训练RL技能，实现具有竞争力的多智能体协调和优越的感知可信度，而无需手动规则工程。

英文摘要

Reinforcement learning (RL) has achieved strong performance in sequential decision-making, yet scaling to complex multi-agent environments remains challenging due to sparse rewards, large state-action spaces, and the difficulty of learning coordinated strategies. We propose a hierarchical architecture where a pretrained large language model (LLM) acts as a centralized strategic controller that selects among specialized RL skill policies for a team of agents, while RL policies handle reactive low-level execution. We evaluate this hybrid system in a competitive 2v2 King of the Hill environment against behavior tree (BT) and \emph{``Flat''} RL (end-to-end training without skill decomposition) baselines. The LLM+RL system achieves task performance statistically equivalent to hand-crafted BT (46.4\% vs 51.5\% win rate, $p=0.103$) while both significantly outperform Flat RL trained without skill decomposition. A user study ($n=15$) reveals that 60\% of participants perceive LLM+RL agents as the most human-like ($p=0.027$), citing behavioral adaptability and tactical variability. These results demonstrate that pretrained LLM reasoning can effectively orchestrate pretrained RL skills, achieving competitive multi-agent coordination and superior perceived believability without manual rule engineering.

URL PDF HTML ☆

赞 0 踩 0

2606.20104 2026-06-19 cs.LG cs.AI 新提交

Sensorimotor World Models: Perception for Action via Inverse Dynamics

传感器运动世界模型：通过逆动力学实现面向行动感知

Petr Ivashkov, Randall Balestriero, Bernhard Schölkopf

发表机构 * Max Planck Institute for Intelligent Systems（马克斯·普朗克智能系统研究所）； Department of Computer Science, Brown University（布朗大学计算机科学系）； ELLIS Institute（ELLIS研究所）； ETH Zürich（苏黎世联邦理工学院）

AI总结提出传感器运动世界模型（SMWM），通过逆动力学正则化端到端训练潜空间世界模型，防止表示崩溃并学习与行动对齐的紧凑表示，在2D和3D控制任务中实现竞争性规划性能。

详情

AI中文摘要

面向行动的感知表明，世界的表示不应仅由视觉保真度决定，而应由其与行动的相关性决定。同时，潜在的JEPA风格世界模型主张从高维观测中学习紧凑的预测状态以促进未来状态的预测，但这些模型的端到端训练并非易事，因为如果我们的唯一目标是构建易于预测的潜在状态，表示可能会崩溃。我们引入了一种传感器运动世界模型（SMWM）：一种通过逆动力学正则化进行端到端训练的潜在世界模型。这一单一正则化解决了两个问题：它防止表示崩溃并诱导与行动对齐的表示。通过迫使潜在状态保留关于转换背后行动的信息，它使模型偏向于环境中可控的自由度，同时丢弃不可控的干扰因素。这产生了从离线、无奖励轨迹中训练的稳定潜在世界模型，无需冻结编码器、指数移动平均或复杂的潜在正则化。实验表明，SMWM学习了紧凑、可解释的潜在空间，并在简单的2D和3D控制任务中实现了竞争性的规划性能。

英文摘要

Perception for action suggests that representations of the world should be shaped not by visual fidelity alone, but by their relevance for actions. At the same time, latent JEPA-style world models advocate learning compact predictive states from high-dimensional observations to facilitate the prediction of future states, but end-to-end training of these models is nontrivial because representations may collapse if our only goal is to construct a latent state that is easy to predict. We introduce a sensorimotor world model (SMWM): a latent world model trained end-to-end with inverse dynamics regularization. This single regularizer addresses both issues: it prevents representation collapse and induces action-aligned representations. By forcing latent states to preserve information about the action underlying a transition, it biases the model toward the controllable degrees of freedom of the environment while discarding uncontrollable distractors. This yields stable latent world models trained from offline, reward-free trajectories, without frozen encoders, exponential moving averages, or complex latent regularizers. Empirically, SMWM learns compact, interpretable latent spaces and enables competitive planning performance across simple 2D and 3D control tasks.

URL PDF HTML ☆

赞 0 踩 0

2606.20107 2026-06-19 cs.LG 新提交

Quantile of Means: A Bonus-Free Ensemble Method for Minimax Optimal Reinforcement Learning

大环肽是细胞内靶点的有前景的治疗候选物，但其设计需要同时控制非天然单体化学、环拓扑、膜通透性和靶点结合。现有的SMILES或HELM字符串生成模型要么在长原子级序列空间中操作，要么将单体视为具有有限化学基础符号化令牌。我们引入了PepALD，一个用于从头生成大环肽的自回归潜在扩散（ALD）基础模型。该模型使用结构化化学嵌入表示HELM单体，通过在化学信息潜在空间中的上下文条件扩散生成每个残基，在自回归生成过程中预测R基团感知的环闭合，并使用胜者保护的扩散自适应偏好优化将去噪器与亲和力奖励对齐。体外实验表明，PepALD在生成质量和奖励优化性能上优于代表性肽生成基线。

英文摘要

Macrocyclic peptides are promising therapeutic candidates for intracellular targets, but their design requires simultaneous control over non-natural monomer chemistry, ring topology, membrane permeability, and target binding. Existing SMILES- or HELM-string generative models either operate in long atom-level sequence spaces or treat monomers as symbolic tokens with limited chemical grounding. We introduce PepALD, an Autoregressive Latent Diffusion (ALD) foundation model for \textit{de novo} macrocyclic peptide generation. The model represents HELM monomers with structured chemical embeddings, generates each residue through context-conditioned diffusion in chemically informed latent space, predicts R-group-aware ring closures during autoregressive generation, and aligns the denoiser to affinity rewards using winner-protected diffusion-adapted preference optimization. In silico experiments demonstrate PepALD's generation quality and reward-optimization performance against representative peptide generation baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.20416 2026-06-19 cs.LG cs.CV 新提交

On the Redundancy of Timestep Embeddings in Diffusion Models

扩散模型中时间步嵌入的冗余性研究

José A. Chávez

发表机构 * Independent Researcher, Lima, Peru（独立研究者，秘鲁利马）

AI总结本文通过理论和实验证明，在U-Net和Diffusion Transformer架构中，扩散模型无需显式时间步嵌入也能达到全局最优，甚至在某些指标上超越有条件模型。

Comments 17 pages

详情

AI中文摘要

扩散模型严重依赖显式的时间步嵌入来调节不同噪声尺度下的去噪过程。在这项工作中，我们通过分析时间步嵌入对U-Net和Diffusion Transformer架构的影响，挑战了这些时间信号的必要性。除了经验证据外，我们提供了一个理论框架，证明在某些条件下，无需显式时间步条件即可达到扩散训练目标的全局最小值。我们的发现揭示了当完全移除时间步嵌入时令人惊讶的鲁棒性。在CelebA和CIFAR-10数据集上的大量消融研究表明，这些时间无关模型可以保持高结构保真度，甚至在竞争性指标（包括FID、精确率和召回率）上超越其有条件对应模型。我们的分析表明，这些架构可以在特定假设下从损坏输入中隐式推断噪声尺度，使得显式时间条件变得冗余。这项研究挑战了长期以来的时间条件范式，并为更高效、更注重结构的生成架构铺平了道路。

英文摘要

Diffusion models rely heavily on explicit timestep embeddings to modulate the denoising process across various noise scales. In this work, we challenge the necessity of these temporal signals by analyzing their impact on U-Net and Diffusion Transformer architectures. Beyond empirical evidence, we provide a theoretical framework demonstrating that, under certain conditions, the global minimizer of the diffusion training objective can be achieved without explicit timestep conditioning. Our findings reveal a surprising robustness when timestep embeddings are completely removed. Extensive ablation studies on the CelebA and CIFAR-10 datasets show that these time-agnostic models can maintain high structural fidelity and even surpass their conditioned counterparts in competitive metrics, including FID, precision, and recall. Our analysis suggests these architectures can implicitly infer noise scales from the corrupted input under specific assumptions, rendering explicit temporal conditioning redundant. This study challenges long-standing temporal conditioning paradigms and paves the way for more efficient and structurally focused generative architectures.

URL PDF HTML ☆

赞 0 踩 0

2606.19361 2026-06-19 cs.LG cs.AI cs.NA math.NA stat.CO stat.ME stat.ML 新提交

Computational Identifiability

计算可识别性

Lucius E. J. Bynum, Rajesh Ranganath, Kyunghyun Cho

发表机构 * New York University（纽约大学）

AI总结提出“计算可识别性”框架，通过有限计算搜索过程在指定误差容限内找到经验估计量，从而解决理论可识别性在有限样本、模糊图标准等实际场景中的不足。

详情

AI中文摘要

识别条件描述了目标查询或感兴趣参数作为可用信息类型和数量的函数的可计算性。在因果识别中，这些信息通常以因果图的形式表达，数据是针对图中某些变量子集观测或收集的。目标查询可以是单个效应，也可以是给定模型中的一类效应。识别算法的推导在数学上定义了期望中理论上唯一确定所需因果效应的过程。期望中的可识别性，即“理论可识别性”，通常假设渐近性质、无限数据或其他数学理想化条件。在本文中，我们探讨了这种理论理想化的可识别性与一种受计算限制的替代方案之间的根本区别。我们提出的框架——“计算可识别性”——而是为经验估计量定义一个有限的计算搜索过程。如果该过程在期望的误差容限内经验性地找到了估计量，则满足可识别性，条件取决于搜索的指定假设（即参数上的先验分布）以及搜索过程本身。通过多个实验，我们展示了该框架如何回答细粒度的实际识别问题，例如小有限样本下的识别、模糊图标准下的识别、混合观测-干预数据下的识别，以及跨反事实数据和估计量的识别。代码见 https://this https URL。

英文摘要

Identification conditions describe the computability of a target query or parameter of interest as a function of the type and amount of information available. In causal identification, this information is often expressed in the form of a causal graph, and data are observed or collected for some subset of variables in the graph. Target queries may be for a single effect alone or for a class of effects in a given model. The derivation of an identification algorithm then defines mathematically the process by which the desired causal effect(s) can be uniquely determined, theoretically, in expectation. Identifiability in expectation, or 'theoretical identifiability,' generally assumes asymptotic properties, infinite data, or other mathematically idealized conditions. In this paper, we explore a fundamental distinction between this theoretical, idealized notion of identifiability and a proposed alternative that is computation-bound. The framework we propose - 'computational identifiability' - is to instead define a finite computational search procedure for an empirical estimator. If this process finds an estimator empirically, within a desired error tolerance, then identifiability is satisfied, conditional on the specified assumptions of the search (i.e., a prior distribution over the parameters) and conditional on the search procedure itself. Through several experiments, we demonstrate how this framework allows us to answer fine-grained, practical identification questions, such as identification with small finite samples, with ambiguous graphical criteria, with mixed observational-interventional data, and across counterfactual data and estimands. Code is available at https://github.com/lbynum/metadentify.

URL PDF HTML ☆

赞 0 踩 0

2606.19366 2026-06-19 cs.LG cs.AI eess.SP 新提交

Information Lattice Learning as Probabilistic Graphical Model Structure Learning

信息格学习作为概率图模型结构学习

Haizi Yu, Lav R. Varshney

发表机构 * Kocree, Inc.（Kocree公司）； AI Innovation Institute, Stony Brook University（石溪大学人工智能创新研究所）

AI总结将信息格学习（ILL）解释为概率图模型结构学习，通过投影到分区格上学习可解释规则，并建立与最大熵和因子图的联系。

详情

AI中文摘要

信息格学习（ILL）通过将信号交替投影到编码抽象层次结构的分区格上，并将选定的规则提升回信号域，来学习信号的可解释规则。当信号是概率质量函数时，我们证明ILL学习的概率规则具有自然的概率图模型（PGM）解释，并详细发展了这一解释。ILL中的分区诱导出一个确定性的商变量，规则是该商变量的边际分布。因此，规则集是可解释抽象上的边际约束集合。一般提升是满足这些约束的所有联合分布的可行族，而特殊提升则选择最大无知重建，在ILL中通过L2均匀性原理实现，该原理与最大熵密切相关。在香农熵提升下，相同的约束产生一个对数线性因子图，其因子由学习的抽象索引。然而，信息格本身不是贝叶斯网络：其边编码抽象的细化与粗化，而非条件依赖。因此，ILL最好被视为商变量上可解释的基于约束的因子图的结构学习。这一观点阐明了ILL如何与图模型和最大熵模型相关，同时为推理、可识别性和混合符号-概率学习提出了新方向。

英文摘要

Information lattice learning (ILL) learns interpretable rules of a signal by alternately projecting the signal onto a partition lattice that encodes a hierarchy of abstractions and lifting selected rules back to the signal domain. When the signal is a probability mass function, we show the probabilistic rules learned by ILL admit a natural probabilistic graphical model (PGM) interpretation and develop this interpretation in detail. A partition in ILL induces a deterministic quotient variable, and a rule is the marginal law of that quotient variable. A rule set is therefore a collection of marginal constraints over interpretable abstractions. General lifting is the feasible family of all joint distributions satisfying those constraints, while special lifting chooses a maximum-ignorance reconstruction, implemented in ILL by an L2 uniformity principle closely related to maximum entropy. Under a Shannon-entropy lifting, the same constraints yield a log-linear factor graph whose factors are indexed by learned abstractions. The information lattice itself, however, is not a Bayesian network: its edges encode refinement and coarsening of abstractions, not conditional dependence. Thus ILL is best viewed as structure learning for interpretable constraint-based factor graphs over quotient variables. This view clarifies how ILL relates to graphical models and maximum entropy models, while suggesting new directions for inference, identifiability, and hybrid symbolic-probabilistic learning.

URL PDF HTML ☆

赞 0 踩 0

2606.19367 2026-06-19 cs.LG 新提交

Weibull Weight-Scale Parameter Evolution under AdamW Training Dynamics

Weibull 权重尺度参数在 AdamW 训练动态下的演化

Tiexin Ding

发表机构 * Independent Researcher（独立研究员）

AI总结研究 AdamW 训练中 Weibull 权重尺度参数 λ 增长、过冲和松弛的原因，推导出三种力（对齐、注入、衰减）的分解，并在 Pythia-70M 模型上验证对齐力主导上升阶段，贡献 88-94%。

Comments 21 pages, 14 figures

详情

AI中文摘要

基于用于诊断变压器权重分布的双参数 Weibull 框架，我们研究了为什么在 AdamW 训练期间 Weibull 权重尺度参数 λ 会增长、过冲然后松弛。我们从 AdamW 更新中推导出平方权重范数的领先阶三力分解：一个对齐力，测量权重与自适应更新方向之间的相关性；一个注入力，来自自适应步长幅度；以及一个衰减力，来自解耦的权重衰减。在具有真实优化器矩的自训练 Pythia-70M 模型上，对齐力主导上升阶段，在四个随机种子中贡献了绝对力预算的 88-94%，并且对超权重移除具有鲁棒性。接近饱和时，对齐力和衰减力趋于平衡，解释了从权重尺度增长到松弛的转变。这些力动态直接控制 λ(t) 背后的平方范数分量；剩余的 RMS 到 Weibull 重建偏移是可测量的，并分解为桥接分量和积分分量，在密集采样区域总计约 5-6%。为了将分析扩展到无法获得优化器矩的真实模型，我们引入了一种样条位移方法，该方法从稀疏检查点以约 92-94% 的准确率恢复对齐力，大约是朴素两点基线的两倍。我们进一步观察到，在我们的实验中，λ(t) 的峰值随训练数据一致性而变化，这表明权重尺度增长存在数据依赖成分，我们将其留待后续对照研究。代码和数据可在 https://this URL 获取。

英文摘要

Building on a two-parameter Weibull framework for diagnosing transformer weight distributions, we study why the Weibull weight-scale parameter $λ$ grows, overshoots, and then relaxes during AdamW training. We derive a leading-order three-force decomposition of the squared weight norm from the AdamW update: an alignment force measuring the correlation between weights and the adaptive update direction, an injection force from adaptive step magnitude, and a decay force from decoupled weight decay. On self-trained Pythia-70M models with ground-truth optimizer moments, alignment dominates the rise phase, contributing 88-94% of the absolute force budget across four random seeds and remaining robust to super-weight removal. Near saturation, alignment and decay approach balance, explaining the transition from weight-scale growth to relaxation. These force dynamics directly govern the squared-norm component underlying $λ(t)$; the remaining RMS-to-Weibull reconstruction offset is measurable and decomposes into bridge and integration components, totaling approximately 5-6% in densely sampled regions. To extend the analysis to real models where optimizer moments are unavailable, we introduce a spline displacement method that recovers the alignment force from sparse checkpoints with approximately 92-94% accuracy, about twice the naive two-point baseline. We further observe that the peak value of $λ(t)$ varies with training-data coherence in our experiments, suggesting a data-dependent component of weight-scale growth that we leave to a controlled follow-up study. Code and data are available at https://github.com/tiexinding/NPM-Weibull-public.

URL PDF HTML ☆

赞 0 踩 0

2606.19369 2026-06-19 cs.LG cs.AI 新提交

Zero-Inflated Gaussian Distributions Enable Parameter-Space Sparsity in Estimation-of-Distribution Algorithms

零膨胀高斯分布使估计分布算法中的参数空间稀疏化

Andreas Faust, Sven Nitzsche, Juergen Becker

发表机构 * University of Freiburg（弗莱堡大学）； FZI Research Center for Information Technology（FZI信息技术研究中心）； Karlsruhe Institute of Technology（卡尔斯鲁厄理工学院）

AI总结提出多元零膨胀高斯分布作为估计分布算法的采样分布，联合优化稀疏模式和活跃参数，无需手工设计稀疏算子，在Lunar Lander基准上收敛更快且最终回报更高。

详情

AI中文摘要

估计分布算法（EDA）是一类强大的黑箱优化进化方法，尤其当目标函数结构未知时。经典进化算法依赖于手工设计的变异和交叉算子，这些算子难以针对未知问题结构设计，且是偏差的来源，而EDA完全绕过了算子设计：它们将概率分布拟合到最佳个体，并从中采样下一代。EDA在连续参数空间上已得到充分确立，但此前尚未推广到稀疏空间——其中良好解的大多数系数恰好为零。现有的稀疏黑箱优化器因此重新引入了EDA旨在避免的东西：手工制作的稀疏算子、支持集与活跃值交替的双层方案、零阈值以及其他内置假设。我们通过提出多元零膨胀高斯（ZIG）分布作为EDA采样法则来填补这一空白。一个具有独立指示维度和值维度的潜在高斯模型表示稀疏模式、活跃参数之间的相关性以及两者之间的相互作用，因此稀疏模式和活跃值被联合优化，无需层次结构。我们证明该模型的潜在参数可以从观测样本中识别，不同于相关构造起源的缺失数据设置，并引入了实用的基于摊销反演的估计器。这些估计器准确恢复潜在相关结构，在Lunar Lander基准上，由此产生的ZIG-EDA比稠密高斯EDA、手工制作的稀疏进化算法和特设稀疏EDA收敛更快且最终回报更高，同时找到的控制器只有一小部分参数活跃。

英文摘要

Estimation-of-distribution algorithms (EDAs) are a powerful class of evolutionary methods for black-box optimization, especially when little is known about the structure of the objective. Whereas classical evolutionary algorithms rely on hand-designed mutation and crossover operators, hard to devise for unknown problem structures, and a source of bias, EDAs sidestep operator design entirely: they fit a probability distribution to the best individuals and sample the next generation from it. EDAs are well established on continuous parameter spaces, but they have not previously been generalized to sparse ones, in which most coefficients of a good solution are exactly zero. Existing sparse black-box optimizers therefore reintroduce exactly what EDAs were designed to avoid: hand-crafted sparsity operators, bi-level schemes alternating between support set and active values, zeroing thresholds, and other baked-in assumptions. We close this gap by proposing multivariate zero-inflated Gaussian (ZIG) distributions as EDA sampling laws. A latent Gaussian model with separate indicator and value dimensions represents sparsity patterns, correlations among active parameters, and the interactions between the two, so sparsity patterns and active values are optimized jointly, hierarchy-free. We show that the latent parameters of this model are identifiable from observed samples, unlike in the missing-data settings where related constructions originate, and introduce practical amortized inversion-based estimators for them. The estimators accurately recover latent correlation structures, and on the Lunar Lander benchmark the resulting ZIG-EDA converges faster and reaches higher final returns than a dense Gaussian EDA, a hand-crafted sparse evolutionary algorithm, and an ad-hoc sparse EDA, while finding controllers with only a small fraction of parameters active.

URL PDF HTML ☆

赞 0 踩 0

2606.19491 2026-06-19 cs.LG stat.ML 新提交

Algebraic Dead Directions in LayerNorm Transformers: A Forward-Pass-Only Diagnostic at LLM Scale

LayerNorm Transformer 中的代数死方向：一种仅需前向传播的大语言模型规模诊断方法

Tejas Pradeep Shirodkar, P. J. Narayanan

发表机构 * IIIT, Hyderabad（海得拉巴国际信息技术学院）

AI总结本文发现 LayerNorm 的逆尺度方向是后最终归一化中心激活协方差矩阵的精确代数核，可仅从参数中读取死方向，无需前向或后向传播，并在 14 个预训练模型上验证了其有效性。

Comments 34 pages, 7 figures, 6 tables. Empirical companion to arXiv:2606.05957

详情

AI中文摘要

预训练 Transformer 位于损失函数的奇异极小值附近，此时 Fisher 信息度量沿死方向退化：参数空间中方向性 Fisher 为零的方向。通常定位这样的方向需要一次前向传播和激活矩阵的特征分解，或基于采样的复杂度估计；没有一种方法能仅从网络参数计算方向。我们针对 LayerNorm Transformer 给出了一个这样的方向。LayerNorm 仿射的逆尺度方向 $\gamma^{-1}/\|\gamma^{-1}\|$ 是后最终归一化中心激活协方差矩阵的精确代数核，适用于任何输入分布，并在参数空间中诱导出相应的死方向。它仅从 LN 尺度参数读取，无需前向或后向传播，无需特征分解：这是针对 LayerNorm 的最廉价死方向读取方法。我们在 14 个预训练 Transformer（9 个 LayerNorm，5 个 RMSNorm；160M-35B；语言和视觉目标）上进行了测试。在随机初始化时，预测方向与测量的底部奇异方向（一次前向传播，直接 SVD）在 9/9 的 LayerNorm 模型上匹配到小数点后四位，并在 5/5 的 RMSNorm 模型上正确缺失，后者缺乏产生该方向的均值减法投影器。在训练后的检查点上，沿该方向的协方差特征值加深约 ${\sim}10^3$ 倍，并打开更多死方向；随机初始化到训练后的差距是一次前向传播、每检查点沿预测坐标的奇异结构读出。由此得出两个闭式结论：残差流的最小奇异值在 13/14 个 Transformer 上逐块保持不变（在其自身输入分布上测量），唯一的例外（Gemma$4$-$31$B）是一个真正的死方向，同一读出可精确定位；核方向的存在从参数本身即可对 Transformer 的归一化进行分类。

英文摘要

Pretrained transformers sit near singular minima of the loss, where the Fisher information metric degenerates along dead directions: directions in parameter space along which the directional Fisher vanishes. Locating such a direction normally needs a forward pass and an eigendecomposition of activations, or a sampling-based complexity estimate; none returns a direction computable from the network's parameters alone. We give one, for LayerNorm transformers. The inverse-scale direction $γ^{-1}/\|γ^{-1}\|$ of the LayerNorm affine is an exact algebraic kernel of the post-final-norm centred activation covariance, for any input distribution, and induces a corresponding dead direction in parameter space. It is read from the LN scale parameter alone, with no forward or backward pass and no eigensolve: the cheapest dead-direction read, specific to LayerNorm. We test it on $14$ pretrained transformers ($9$ LayerNorm, $5$ RMSNorm; $160$M-$35$B; language and vision objectives). At random initialisation the predicted direction matches the measured bottom singular direction (one forward pass, direct SVD) to four decimal places on $9/9$ LayerNorm models, and is correctly absent on $5/5$ RMSNorm models, which lack the mean-subtraction projector that creates it. On the trained checkpoint the covariance eigenvalue along this direction deepens by ${\sim}10^3\times$ and further dead directions open; the random-init-to-trained gap is a one-forward-pass, per-checkpoint readout of singular structure along the predicted coordinate. Two consequences follow in closed form: the residual stream's smallest singular value is preserved block-to-block on $13/14$ transformers measured on their own input distribution, the one exception (Gemma$4$-$31$B) a genuine dead direction the same read pinpoints; and the kernel direction's presence classifies a transformer's normalisation from the parameters alone.

URL PDF HTML ☆

赞 0 踩 0

2606.19521 2026-06-19 cs.LG math.OC 新提交

Interactive Pareto navigation for deep multi-task learning

深度多任务学习的交互式帕累托导航

Augustina C. Amakor, Konstantin Sonntag, Sebastian Peitz

发表机构 * Department of Computer Science, TU Dortmund, Dortmund, Germany（多特蒙德工业大学计算机科学系，德国多特蒙德）； Lamarr Institute for Machine Learning and Artificial Intelligence（拉马尔机器学习和人工智能研究所）

AI总结提出偏好帕累托探索（PPE）框架，通过预测-校正方法沿帕累托流形切线方向引导偏好，利用Krylov子空间方法避免Hessian计算，实现高效交互式多目标优化。

详情

AI中文摘要

在多任务学习中，处理越来越多的目标在计算资源和决策者选择适当权衡的能力方面都很快变得具有挑战性。因此，一种广泛使用的方法是通过加权和将各个损失聚合到单个损失函数中。这通常由于帕累托前沿的形状而无法捕捉决策者的偏好，或者需要多次调整和计算，这在深度学习应用中变得过于昂贵。为了解决这些问题，我们引入了一个新颖的框架，偏好帕累托探索（PPE），它在交互式探索过程中强制执行决策者的偏好，同时考虑帕累托集的几何形状。PPE基于预测-校正方法，该方法沿着帕累托最优解流形的切线方向执行预测步骤，遵循决策者的偏好。随后的校正步骤产生反映该偏好的新权衡。为了在表征流形切空间时避免显式的Hessian计算，我们采用了一种仅依赖于矩阵-向量乘积的Krylov子空间方法。这些乘积可以通过自动微分高效获得，确保了整个优化过程的效率和鲁棒性。该方法的有效性和性能通过玩具问题和深度学习示例进行了展示。

英文摘要

In multi-task learning, handling an increasing number of objectives can quickly become challenging, both in terms of the computational resources and the decision maker's capacity to choose appropriate trade-offs. A widely used approach is thus to aggregate the individual losses in a single loss function by a weighted sum. This often fails to capture either the decision maker's preferences as a result of the shape of the Pareto front, or requires multiple adjustments and computations which becomes prohibitively expensive in deep learning applications. To address these issues, we introduce a novel framework, Preference Pareto Exploration (PPE), which enforces the decision maker's preferences while accounting for the geometry of the Pareto set in an interactive exploration process. PPE is based on a predictor-corrector method that performs predictor steps tangential to the manifold of Pareto-optimal solutions, following the decision maker's preference. The subsequent corrector step results in a new trade-off reflecting this preference. To avoid explicit Hessian computations when characterizing the tangent space of the manifold, we employ a Krylov subspace method that relies solely on matrix-vector products. These products can be efficiently obtained via automatic differentiation, ensuring both efficiency and robustness throughout the optimization process. The method's functionality and performance are demonstrated using both toy problems and examples from deep learning.

URL PDF HTML ☆

赞 0 踩 0

2606.19652 2026-06-19 cs.LG 新提交

Convex training of Lipschitz-regularized shallow neural networks

Lipschitz正则化浅层神经网络的凸训练

Chao Yin, Antoine Lesage-Landry

发表机构 * Polytechnique Montréal, GERAD & Mila, Montréal, QC, Canada（蒙特利尔理工学院，GERAD & Mila，加拿大魁北克省蒙特利尔市）

AI总结提出一种凸限制方法求解非凸Lipschitz正则化训练问题，可全局最优求解，并作为预训练网络的后处理步骤，提升对抗鲁棒性和准确性。

详情

AI中文摘要

在这项工作中，我们引入了一种针对浅层神经网络的训练程序，该程序能够提升对对抗攻击的鲁棒性。我们通过引入一个凸限制来解决非凸的Lipschitz正则化训练问题，该凸限制可以高效地求解全局最优解。我们的方法可以作为后处理步骤，将预训练网络作为初始解，然后求解凸规划，其最优网络保证不劣于初始网络。我们通过在对抗设置下使用真实世界数据集进行回归任务的实验，展示了我们训练程序的改进。数值结果表明，与现有方法相比，求解我们提出的凸规划得到的网络在Lipschitz正则化程序上具有更低的目标值。此外，我们表明，在某些数据集上，使用我们的凸训练程序获得的网络在对抗攻击下既更准确又更鲁棒。

英文摘要

In this work, we introduce a training procedure for shallow neural networks that promotes robustness against adversarial attacks. We solve a non-convex Lipschitz-regularized training program by introducing a convex restriction that can be efficiently solved to global optimality. Our approach can be employed as a post-processing step by taking a pre-trained network as an initial solution to then solving the convex program whose optimal network is guaranteed to be no worse than the initial one. We illustrate the improvements of our training procedure with experiments using real world datasets for regression tasks under an adversarial setting. We show numerically that solving our proposed convex program yields networks with lower objective values on the Lipschitz-regularized program compared to existing methods. Additionally, we show that on certain datasets, networks obtained using our convex training program are both more accurate and robust with respect to adversarial attacks.

URL PDF HTML ☆

赞 0 踩 0

2606.19876 2026-06-19 cs.LG math.OC 新提交

Global Convergence of Gradient Descent for Score Matching in Gaussian Mixtures via Reverse Fisher Divergence

通过反向Fisher散度实现高斯混合模型中得分匹配的梯度下降全局收敛

Alexander Tyurin

AI总结研究反向Fisher散度下梯度下降拟合高斯混合模型的全局收敛性，证明从任意初始化或随机初始化下学生分量收敛到最近教师分量，并给出全变差距离收敛条件。

详情

AI中文摘要

得分匹配问题是现代生成建模、扩散模型、拟合非归一化统计模型和逆问题中的核心训练目标。标准方法是最小化前向Fisher散度，其中期望相对于教师分布取。然而，最近结果表明，即使在简单的高斯混合模型设置中，该目标也可能导致不良且依赖初始化的收敛行为。本文研究另一种目标：反向Fisher散度，其中期望相对于学生分布取。我们分析梯度下降（GD）拟合高斯混合模型，并表明目标函数的这一改变导致显著更好的优化性质。首先，当教师分布是单个高斯分布且学生是固定权重和单位协方差的高斯混合模型时，我们证明了从任意初始化出发GD的全局收敛性。其次，我们将分析扩展到教师也是高斯混合模型的情况，并在全局随机初始化方案和目标均值满足$\widetilde{\Omega}(1)$-分离假设下证明了全局收敛保证。特别地，以高概率，每个学生分量收敛到其最近的教师分量，并且我们提供了学生分布在全变差距离下收敛的条件。我们的证明依赖于基于Lyapunov的梯度下降动力学新分析，表明反向Fisher散度比前向Fisher散度具有更有利的优化景观。

英文摘要

The score matching problem is a central training objective in modern generative modeling, diffusion models, fitting unnormalized statistical models, and inverse problems. A standard approach is to minimize the forward Fisher divergence, where the expectation is taken with respect to the teacher distribution. However, recent results show that even in simple Gaussian mixture model settings, this objective can lead to undesirable and initialization-dependent convergence behavior. In this paper, we study an alternative objective: the reverse Fisher divergence, where the expectation is taken with respect to the student distribution. We analyze gradient descent (GD) for fitting Gaussian mixture models and show that this change in the objective leads to significantly better optimization properties. First, when the teacher distribution is a single Gaussian and the student is a Gaussian mixture model with fixed weights and identity covariances, we prove the global convergence of GD from arbitrary initializations. Second, we extend the analysis to the case where the teacher is also a Gaussian mixture model and prove global convergence guarantees under a global random initialization scheme and a $\widetildeΩ(1)$-separation assumption on the target means. In particular, with high probability, each student component converges near its closest teacher component, and we provide conditions under which the student distribution converges in total variation distance. Our proofs rely on a new Lyapunov-based analysis of the gradient descent dynamics, showing that the reverse Fisher divergence has a much more favorable optimization landscape than the forward Fisher divergence.

URL PDF HTML ☆

赞 0 踩 0

2606.19878 2026-06-19 cs.LG math.OC stat.ML 新提交

On the Oracle Complexity of Interpolation-Based Gradient Descent

基于插值的梯度下降的预言复杂度

Dongmin Lee, William Lu, Anuran Makur

发表机构 * Purdue University（普渡大学）

AI总结提出分段多项式插值梯度下降（PPI-GD）方法，通过数据域等距点查询一阶预言构造多项式插值近似全梯度，在强凸和非凸损失下分析预言复杂度，证明在数据维数受限且损失足够光滑时优于多种GD变体。

Comments 16 pages, 2 figures

详情

DOI: 10.1109/TAC.2026.3682210

AI中文摘要

最近关于经验风险最小化（ERM）的一阶优化器的工作表明，可以利用ERM损失函数在训练数据中的光滑性（而非优化参数中的光滑性）来改进梯度下降（GD）方法的预言复杂度。在本文中，我们提出了一种不精确梯度方法——分段多项式插值梯度下降（PPI-GD），该方法通过在数据域中的等距点处查询一阶预言来近似每次迭代中的全梯度，从而在数据域的适当大小的块上构造所得梯度样本的多项式插值。我们分析了PPI-GD在强凸和非凸损失函数下的预言复杂度，其中数据空间维数以训练样本数量的多对数函数为界，并发现当损失函数足够光滑时，PPI-GD在关键区域优于几种GD变体。此外，我们的分析将双三次样条插值误差分析中的几种技术扩展到$d$变量张量积多项式插值的设置中，这可能对插值分析具有独立意义。

英文摘要

Recent work on first-order optimizers for empirical risk minimization (ERM) has suggested that smoothness of ERM loss functions in the training data, rather than in the optimization parameters, can be leveraged to improve the oracle complexity of gradient descent (GD) methods. In this paper, we propose an inexact gradient method, piecewise polynomial interpolation-based gradient descent (PPI-GD), which approximates the full gradient in each iteration by querying the first-order oracle at equidistant points in the data domain to construct polynomial interpolants of the resulting gradient samples over appropriately sized patches of the data domain. We analyze the oracle complexity of PPI-GD for strongly convex and non-convex loss functions when the data space dimension is bounded by a polylogarithmic function of the number of training samples, and find it to outperform several GD variants in key regimes when the loss function is sufficiently smooth. Furthermore, our analysis extends several techniques from the error analysis of bicubic spline interpolants to the setting of $d$-variate tensor product polynomial interpolants which may be of independent interest in interpolation analysis.

URL PDF HTML ☆

赞 0 踩 0

2606.19891 2026-06-19 cs.LG 新提交

Adversarial Bandit Optimization with Globally Bounded Perturbations to Convex Losses

具有全局有界扰动的凸损失对抗性赌博机优化

Zhuoyu Cheng, Kohei Hatano, Eiji Takimoto

发表机构 * Department of Informatics, Kyushu University（九州大学信息学系）； RIKEN AIP（理化学研究所革新智能综合研究中心）

AI总结研究损失函数可能非凸非光滑的对抗性赌博机优化，提出一种修改的赌博机优化算法，并分析扰动预算对遗憾的影响，将线性损失下的全局预算后行动扰动模型扩展到一般凸且β-光滑损失。

详情

AI中文摘要

我们研究对抗性赌博机优化，其中损失函数可能非凸且非光滑。在每一轮中，学习者选择一个动作并仅观察该动作产生的损失。损失由一个潜在的凸且β-光滑分量和一个对抗性扰动组成，该扰动可能在观察学习者的动作后选择。扰动受全局预算约束，控制其随时间累积的幅度。该框架将全局预算的后行动扰动模型从线性损失扩展到一般凸且β-光滑损失。对于这个更广泛的类别，我们建立了期望遗憾保证，明确刻画了扰动预算的影响。为了建立这些保证，我们修改了一个标准的赌博机优化算法，并开发了一种分析来控制由扰动引起的额外遗憾。在没有扰动的情况下，我们的结果退化为具有β-光滑损失的标准赌博机凸优化设置的遗憾保证。

英文摘要

We study adversarial bandit optimization in which the loss functions may be non-convex and non-smooth. In each round, the learner selects an action and observes only the loss incurred at that action. The loss consists of an underlying convex and $β$-smooth component and an adversarial perturbation that may be chosen after observing the learner's action. The perturbations are subject to a global budget controlling their cumulative magnitude over time. This framework extends the globally budgeted, post-action perturbation model from underlying linear losses to general convex and $β$-smooth losses. For this broader class, we establish expected regret guarantees that explicitly characterize the effect of the perturbation budget. To establish these guarantees, we modify a standard bandit optimization algorithm and develop an analysis that controls the additional regret caused by the perturbations. In the absence of perturbations, our results reduce to regret guarantees for the standard bandit convex optimization setting with $β$-smooth losses.

URL PDF HTML ☆

赞 0 踩 0

2606.20075 2026-06-19 cs.LG cs.CL 新提交

What Makes Effective Supervision in Latent Chain-of-Thought: An Information-Theoretic Analysis

什么使得潜在思维链中的监督有效：一种信息论分析

Xinghao Chen, Chak Tou Leong, Wenjin Guo, Jian Wang, Wenjie Li, Xiaoyu Shen

发表机构 * Ningbo Institute of Digital Twin, Eastern Institute of Technology（宁波数字孪生研究院，东方理工大学）； Department of Computing, The Hong Kong Polytechnic University（香港理工大学计算学系）

AI总结本文从信息论角度分析潜在思维链中的监督失效问题，提出轨迹监督和空间监督两个维度，并引入统一潜在探针（ULP）量化信息保真度，揭示了信息-性能绑定关系。

详情

AI中文摘要

潜在思维链（Latent Chain-of-Thought, CoT）将推理内化到连续隐藏状态中，为冗长的离散推理轨迹提供了一种有前景的替代方案。然而，鲁棒的潜在推理仍然困难，因为结果监督提供的学习信号较弱，且容易导致潜在轨迹发生语义漂移。在这项工作中，我们从信息论角度分析潜在CoT，并将这种失效识别为双重崩溃：优化路径上的梯度衰减和潜在空间中的表征漂移。我们进一步将过程监督分解为两个互补维度：轨迹监督（注入密集的逐步推理信号）和空间监督（保持潜在流形的语义结构）。我们的分析表明，刚性几何压缩可能坍缩推理空间，而生成式重建提供了更灵活的语义锚点，更好地保留了信息容量。为了衡量这些效应，我们引入了统一潜在探针（Unified Latent Probe, ULP），用于量化潜在轨迹与显式推理步骤之间的互信息。实验揭示了清晰的信息-性能绑定关系：推理准确性取决于潜在链中保留的信息保真度。这些发现为潜在推理监督提供了一个原则性框架，并建议从几何模仿转向互信息最大化。我们的代码可在\href{this https URL}{此仓库}获取。

英文摘要

Latent Chain-of-Thought (CoT) internalizes reasoning within continuous hidden states, offering a promising alternative to verbose discrete reasoning traces. However, robust latent reasoning remains difficult because outcome supervision provides weak learning signals and leaves latent trajectories prone to semantic drift. In this work, we analyze Latent CoT from an information-theoretic perspective and identify this failure as a dual collapse: gradient attenuation along the optimization path and representational drift in the latent space. We further decompose process supervision into two complementary dimensions: Trajectory Supervision, which injects dense stepwise reasoning signals, and Space Supervision, which preserves the semantic structure of the latent manifold. Our analysis shows that rigid geometric compression can collapse the reasoning space, whereas generative reconstruction provides a more flexible semantic anchor that better preserves information capacity. To measure these effects, we introduce the Unified Latent Probe (ULP), which quantifies the mutual information between latent trajectories and explicit reasoning steps. Experiments reveal a clear Information-Performance Binding: reasoning accuracy depends on the information fidelity preserved in the latent chain. These findings provide a principled framework for latent reasoning supervision and suggest shifting from geometric imitation toward mutual information maximization. Our code is available at \href{https://github.com/EIT-NLP/Supervision-in-Latent-CoT}{this repository}.

URL PDF HTML ☆

赞 0 踩 0

2606.20183 2026-06-19 cs.LG 新提交

Effective Dimension Governs Generalization in Quantum Kernel Vision Models

有效维度主导量子核视觉模型的泛化

Jian Xu, Delu Zeng, John Paisley, Qibin Zhao

AI总结通过有效维度d_eff解释量子视觉模型中纠缠结构增强泛化与量子噪声提升测试精度的现象，提出噪声形状核的谱分解与正则化机制。

详情

AI中文摘要

最近的量子视觉模型——量子视觉变换器和量子卷积网络——报告了两个引人注目但尚未解释的经验现象：(i) 具有更多或更均匀分布纠缠的拟设泛化更好，以及(ii) 注入量子噪声可以提高测试精度而不是降低它。这些观察目前被视为奇闻，通过网格搜索发现，并且如果有解释的话，也是手工进行的。我们表明，两者都是一个单一可测量量的表现：即（噪声形状的）量子特征核的\emph{有效维度}$d_{\rm eff}$。主要使用量子核视觉模型——由核分类器读出的量子特征映射——我们给出了一个谱解释，其中纠缠结构和量子噪声是调节$d_{\rm eff}$的两个旋钮；在过拟合区域，收缩$d_{\rm eff}$起到类似岭正则化的作用。我们分析了机制：退极化核$K_p=(1-p)^2K+\tfrac{p(2-p)}{D}\mathbf{1}\mathbf{1}^\top$的\emph{精确}分解，其中$d_{\rm eff}(K_p)\to1$，振幅阻尼的收缩结果（及其边界），核机器容量界，以及容量/对齐风险分解；在我们的纠缠实验中运作的单调收缩是经验验证的，并非普遍证明。沿着单参数退极化族，坍缩反而是通过构造精确的；我们仅用它来确认核分解到机器精度，最多达12个量子比特，而不是作为$d_{\rm eff}$的证据。振幅阻尼收缩$d_{\rm eff}$并沿倒U型最佳点将测试精度提升高达+13%；效应符号在过拟合和欠拟合区域之间翻转；噪声注入匹配显式谱过滤前沿。我们的结果将两个报告的现象组织成一个单一可测量原则，用于设计量子视觉模型。

英文摘要

Recent quantum vision models-quantum vision transformers and quantum convolutional networks-report two striking but unexplained empirical phenomena: (i) ansatze with more, or more uniformly distributed, entanglement generalize better, and (ii) injecting quantum noise can improve test accuracy rather than degrade it. These observations are currently treated as curiosities, discovered by grid search and explained, if at all, by hand. We show that both are manifestations of a single, measurable quantity: the \emph{effective dimension} $d_{\rm eff}$ of the (noise-shaped) quantum feature kernel. Working primarily with quantum-kernel vision models-a quantum feature map read out by a kernel classifier-we give a spectral account in which entanglement structure and quantum noise are two knobs that move $d_{\rm eff}$; in an overfitting regime, contracting $d_{\rm eff}$ acts as ridge-like regularization. We analyze the mechanism: an \emph{exact} decomposition of the depolarized kernel $K_p=(1-p)^2K+\tfrac{p(2-p)}{D}\mathbf{1}\mathbf{1}^\top$ with $d_{\rm eff}(K_p)\to1$, a contraction result (and its boundary) for amplitude damping, a kernel-machine capacity bound, and a capacity/alignment risk decomposition; the monotone contraction operative in our entangled experiments is verified empirically, not proven in general. Along the one-parameter depolarizing family the collapse is instead exact by construction; we use it only to confirm the kernel decomposition to machine precision and at up to $12$ qubits, not as evidence for $d_{\rm eff}$. Amplitude damping contracts $d_{\rm eff}$ and lifts test accuracy by up to $+13\%$ along an inverted-U sweet spot; the effect's sign flips between the over- and under-fitting regimes; noise injection matches an explicit spectral-filtering frontier. Our results organize two reported anecdotes into a single measurable principle for designing quantum-vision models.

URL PDF HTML ☆

赞 0 踩 0

2606.20325 2026-06-19 cs.LG cs.SC math.DS 新提交

Recurrent neural networks approximate continuous functions

递归神经网络近似连续函数

Valentin Abadie, Clemens Hutter, Helmut Bölcskei

AI总结本文证明，对于[-1,1]上的任意连续函数，存在一个固定权重和隐藏维度的ReLU递归神经网络，其时间演化可以均匀逼近该函数，并给出了收敛速率和极小极大下界。

详情

AI中文摘要

经典逼近定理要求每当目标精度提高时，就需要一个新的神经网络。本文研究相反的可能性：能否一劳永逸地选择网络，而仅通过让其运行更长时间来换取精度？我们证明这对于[-1,1]上的每个连续函数都是可能的。更准确地说，每个这样的函数都可以通过一个具有固定权重和固定隐藏维度的单ReLU递归神经网络的时间演化来均匀逼近。该构造背后的机制是一个新的中间模型——带神经单元的图灵机（TMNU）。该模型保留了实现多项式逼近方案所需的算法自由度，同时保持足够的刚性，以便被具有显式隐藏维度和权重幅度界限的RNN模拟。由此产生的收敛速率反映了底层多项式逼近的速率。我们通过极小极大下界补充了该构造，表明运行时间不仅仅是证明的产物，而是这种固定网络逼近范式中不可避免的资源。

英文摘要

Classical approximation theorems ask for a new neural network whenever the target accuracy is improved. This paper studies the opposite possibility: can the network be chosen once and for all, and can accuracy be bought only by letting it run longer? We prove that this is possible for every continuous function on [-1,1]. More precisely, each such function is uniformly approximated by the time evolution of a single ReLU recurrent neural network with fixed weights and fixed hidden dimension. The mechanism behind the construction is a new intermediate model, the Turing machine with neural units (TMNU). This model retains the algorithmic freedom needed to implement polynomial approximation schemes, while remaining rigid enough to be simulated by RNNs with explicit bounds on hidden dimension and weight magnitude. The resulting convergence rates reflect the underlying polynomial approximation rates. We complement the construction with minimax lower bounds showing that runtime is not merely a proof artifact, but an unavoidable resource in this fixed-network approximation paradigm.

URL PDF HTML ☆

赞 0 踩 0

2606.20357 2026-06-19 cs.LG 新提交

On the Variance of Temporal Difference Learning and its Reduction Using Control Variates

时序差分学习的方差及其通过控制变量的降低

Hsiao-Ru Pan, Bernhard Schölkopf

AI总结本文分析表格表示下相位设置中时序差分学习的方差，证明其方差降低机制是通过有效聚合更多独立轨迹，并比较了TD、MC和DAE的方差界限。

Comments Accepted at RLC2026

2606.20469 2026-06-19 cs.LG cs.CG 新提交

Fisher-Geometric Sharpness and the Implicit Bias of SGD toward Flat Minima

Fisher-几何锐度与SGD对平坦极小值的隐式偏好

Md Sakir Ahmed, Kumaresh Sarmah, Hemen Dutta

发表机构 * Gauhati University（高哈蒂大学）

AI总结针对SGD偏好平坦极小值但欧氏锐度不具重参数化不变性的问题，提出基于Fisher信息矩阵的黎曼锐度，证明其不变性，并导出SGD稳态分布集中于平坦极小值，PAC-Bayes界联系泛化性能。

Comments 18 pages, 5 figures, preprint

详情

AI中文摘要

深度学习中的一个广泛直觉是随机梯度下降（SGD）隐式偏好平坦极小值，且平坦极小值泛化更好，但损失Hessian的迹或最大特征值等标准欧氏平坦度度量在保持网络函数的重参数化下并非不变，这削弱了这一叙事的理论基础。在本研究中，我们通过将平坦度建立在由Fisher信息矩阵（FIM）诱导的统计流形的黎曼几何上，解决了这一问题。我们在数学上定义了黎曼锐度，并证明它在光滑、保函数的重参数化下是不变的，这直接回应了Dinh等人在论文“Sharp minima can generalize for deep nets”中的批评。我们注意到这种不变性是真实FIM的一个性质；实践中使用的对角经验估计量（以及下面所有实验中的）仅近似继承不变性，而在任意重参数化下的精确不变性需要结构化估计量如K-FAC。我们将小批量SGD的梯度噪声形式化为具有与FIM成比例的协方差结构，推导出所得随机微分方程的稳态分布，然后证明概率质量指数级集中在黎曼平坦极小值处。一个由SR显式控制的PAC-Bayes泛化界正式地将这种几何偏差与测试性能联系起来。我们在MNIST和CIFAR-10上的实验证实，SR以欧氏锐度无法做到的方式可靠地跟踪泛化，并且其随$\eta/B$的缩放与理论预测相匹配。这些结果共同提供了一个严格的、重参数化不变的解释，说明为什么平坦极小值能泛化。

英文摘要

A widely held intuition in deep learning is that stochastic gradient descent (SGD) implicitly favors flat minima and that flat minima generalize better, but standard Euclidean measures of flatness such as the trace or maximum eigenvalue of the loss Hessian are not invariant under reparametrizations that preserve the network function, which undermines the theoretical foundations of this narrative. In this study we resolve this issue by grounding flatness in the Riemannian geometry of the statistical manifold induced by the Fisher Information Matrix (FIM). We define Riemannian sharpness mathematically and prove that it is invariant under smooth, function-preserving reparametrizations, which directly addresses the critique of Dinh et al. in the paper ``Sharp minima can generalize for deep nets''.We note that this invariance is a property of the true FIM; the diagonal empirical estimator used in practice (and in all experiments below) inherits invariance only approximately, and exact invariance under arbitrary reparametrizations would require structured estimators such as K-FAC. We formalize the gradient noise of mini-batch SGD as having a covariance structure proportional to the FIM, derive the stationary distribution of the resulting stochastic differential equation, and then show that the probability mass is exponentially concentrated at Riemannian-flat minima. A PAC-Bayes generalization bound controlled explicitly by SR formally links this geometric bias to test performance. Our experiments on MNIST and CIFAR-10 confirm that SR reliably tracks generalization in ways that Euclidean sharpness does not, and that its scaling with $η/B$ matches the theoretical predictions. Together these results provide a rigorous, reparametrization-invariant account of why flat minima generalize.

URL PDF HTML ☆

赞 0 踩 0

2606.19364 2026-06-19 cs.LG 新提交

Closing the Social-Semantic Gap: SPSD for Edge-Based Prompt Compression in Cloud LLM Inference

缩小社会-语义差距：SPSD用于云LLM推理中的边缘端提示压缩

Abhinit Sen, Ajeet Kumar, Manaranjan Pradhan

AI总结针对云LLM推理中提示词预填充阶段能耗高的问题，提出SPSD边缘端管道，利用4比特量化小语言模型压缩用户提示，在保持响应质量非劣效的前提下，平均节省99.9个输入token，每调用净节能70-270 uWh。

Comments 19 pages, 7 tables, 1 figure, includes appendix

详情

AI中文摘要

大语言模型（LLM）推理的预填充阶段正成为云规模能耗的日益增长的贡献者。许多面向消费者的支持和对话提示包含社会性支架：礼貌标记、道歉性开场白、重复以及建立融洽关系的语言，这些对人类交流很重要，但对机器推理而言边际信息量较低。我们将这种差异称为社会-语义差距。我们提出SPSD（情感保留语义蒸馏），一种边缘端管道，在传输到云端部署的LLM之前，使用4比特量化的小语言模型压缩用户提示。在248个提示的语料库上，使用Gemma-2-2B-Instruct（Q4_K_M）作为SLM、Llama-3.1-8B-Instruct作为云端评估模型进行评估，每次蒸馏调用平均输入token节省99.9个，所有146次蒸馏调用均产生正向节省。通过盲法LLM-as-judge评分对121对进行评估，响应质量在15分制中预先指定的1分非劣效范围内不劣于原始路径；评审员给出43%平局、28%蒸馏胜出和29%原始胜出。余弦相似度结果不一：均值0.682，中位数0.712，54.1%的配对高于0.70参考阈值。安全关键领域通过基于规则的网关保守地路由至直通模式。在所述假设下，每次调用净节能估计为70-270 uWh。SPSD表明，设备端提示蒸馏可以在保持响应质量在实际非劣效范围内的同时，降低云LLM的输入token成本。

英文摘要

The prefill stage of Large Language Model (LLM) inference is a growing contributor to cloud-scale energy cost. Many consumer-support and conversational prompts contain social scaffolding: politeness markers, apologetic preamble, repetition, and rapport-building language that is important for human communication but carries low marginal information for machine reasoning. We call this discrepancy the Social-Semantic Gap. We present SPSD (Sentiment Preserving Semantic Distillation), an edge-based pipeline that compresses user prompts using a 4-bit quantised Small Language Model before transmission to a cloud-deployed LLM. Evaluation on a 248-prompt corpus using Gemma-2-2B-Instruct (Q4_K_M) as the SLM and Llama-3.1-8B-Instruct as the cloud evaluation model yields a mean input token saving of 99.9 tokens per distilled call, with all 146 distilled calls yielding positive savings. Response quality, assessed by blind LLM-as-judge scoring across 121 pairs, is non-inferior to the raw path within a pre-specified 1-point margin on a 15-point rubric; the judge awarded 43 percent ties, 28 percent distilled wins, and 29 percent raw wins. Cosine similarity is mixed: mean 0.682, median 0.712, with 54.1 percent of pairs above the 0.70 reference threshold. Safety-critical domains are conservatively routed to passthrough via rule-based gates. Per-call net energy saving is estimated at 70-270 uWh under stated assumptions. SPSD shows that on-device prompt distillation can reduce cloud LLM input-token cost while preserving response quality within a practical non-inferiority margin.

URL PDF HTML ☆

赞 0 踩 0

2606.19365 2026-06-19 cs.LG 新提交

Performance Analysis and Optimization of 3D Generative Diffusion Models across GPU Architectures

跨GPU架构的3D生成扩散模型性能分析与优化

Jeeho Ryoo, Yongchan Jung, Muhammad Ali Khaliq, Weidong Zhang, Jiatong Han, Byeong Kil Lee

发表机构 * Fairleigh Dickinson University（费尔利·迪金森大学）； The University of Colorado at Colorado Springs（科罗拉多大学科罗拉多斯普林斯分校）； Northeastern University（东北大学）

AI总结针对3D MRI扩散模型Med-DDPM，分析其在三代NVIDIA架构上的内核级性能瓶颈，提出TF32 Tensor Core激活和3D channels-last布局优化，实现SM周期和动态指令减少100倍，Tensor Core利用率提升至9.98倍，IPC提升7%。

详情

DOI: 10.1145/3777884.3797012

AI中文摘要

扩散模型已成为高保真3D MRI合成的关键，但由于每个样本需要数百次U-Net评估以及高度异构的内核行为，其部署仍受到大量GPU资源需求的限制。本文对最先进的医学扩散模型Med-DDPM在三代NVIDIA架构上进行了全面的性能分析，研究了内核级运行时分解、指令混合特征、内存系统利用率、线程束级活动以及分析器优先级得分估计。我们发现训练主要由cuDNN卷积和隐式GEMM内核主导，效率低下源于内存访问模式、张量布局转换和有限的Tensor Core利用率。基于这些洞察，我们评估了两种架构感知优化——TF32 Tensor Core激活和3D channels-last布局，并证明它们将SM周期减少多达100倍，动态指令减少100倍，Tensor Core利用率从1.45倍提高到9.98倍，并在A100上将IPC提高7%，且不降低合成质量。

英文摘要

Diffusion models have become essential for high-fidelity 3D MRI synthesis, yet their deployment remains constrained by substantial GPU resource demands arising from hundreds of U-Net evaluations per sample and a highly heterogeneous kernel behavior. This paper performs a comprehensive performance analysis of the state-of-the-art medical diffusion model, Med-DDPM, across three generations of NVIDIA architectures to study kernel-level runtime breakdowns, instruction-mix characteristics, memory system utilization, warp-level activities, and profiler priority-score estimates. We show that training is overwhelmingly dominated by cuDNN convolution and implicit-GEMM kernels, with inefficiencies arising from memory-access patterns, tensor-layout conversions, and limited Tensor Core utilization. Guided by these insights, we evaluate two architecture-aware optimizations TF32 Tensor Core activation and a 3D channels-last layout and demonstrate that they reduce SM cycles by up to 100x, cut dynamic instructions by 100x, raise Tensor Core utilization from 1.45 to 9.98x, and increase IPC by 7% on A100, all without degrading synthesis quality.

URL PDF HTML ☆

赞 0 踩 0

2606.19528 2026-06-19 cs.LG cs.AI 新提交

Techniques for Peak Memory Reduction for LoRA Fine-tuning of LLMs on Edge Devices

边缘设备上LLM LoRA微调峰值内存降低技术

Hassan Dbouk, Matthias Reisser, Prathamesh Mandke, Likhita Arun Navali, Christos Louizos

AI总结针对边缘设备上LLM LoRA微调的内存瓶颈，提出四种互补技术（量化、检查点、softmax近似、logits掩码），在Llama-3.2 3B和Qwen-2.5 3B上实现高达26倍和28倍的峰值内存降低。

Comments Hassan Dbouk and Matthias Reisser contributed equally to this work

2606.19549 2026-06-19 cs.LG 新提交

Predicting Mergeability of Parameter-Efficient Fine-Tuning Updates

预测参数高效微调更新的可合并性

Lin Tang, Wei Zhang, Jing Li, Hongyu Chen, Ming Zhao, Yuxuan Wang

发表机构 * Sichuan University（四川大学）； University of Electronic Science and Technology of China（电子科技大学）

AI总结提出MergeProbe，通过训练初期信号预测LoRA适配器的可合并性，在MERGE-PEFT基准上实现最佳平均和最差保留性能。

详情

AI中文摘要

低秩适配（LoRA）使得训练许多领域和任务特定的语言模型适配器变得廉价，但两个适配器是否可以合并通常只有在两者都经过充分训练和评估后才能发现。这种延迟反馈代价高昂：单独表现强大的适配器在合并更新后可能会产生破坏性干扰。我们询问是否可以预测这种结果。我们将适配器可合并性形式化为适配器在合并后保持其单任务效用的程度，并表明可以从训练初期百分之几的信号中预测——主要是低秩更新及其梯度在不同任务间的对齐程度以及它们对共享表示的干扰程度。我们将这些信号打包成MergeProbe，一个轻量级预测器，用于估计成对和集合级别的保留，并将估计转化为具体决策：直接合并、重新加权、剪枝或路由。在MERGE-PEFT（一个涵盖数学、代码、科学、指令遵循和安全的五领域基准）上，MergeProbe在强干扰感知合并基线中实现了最佳平均和最差保留，同时增加的部署开销远低于完整任务路由。这将LoRA合并从事后工程步骤转变为预期测量问题。

英文摘要

Low-rank adaptation (LoRA) makes it cheap to train many domain- and task-specific language model adapters, but whether two adapters can be merged is usually discovered only after both have been fully trained and evaluated. This late feedback is costly: adapters that are strong in isolation can interfere destructively once their updates are combined. We ask whether this outcome can be anticipated. We formalize adapter mergeability as the degree to which an adapter preserves its single-task utility after merging, and show that it can be forecast from signals measured in the first few percent of training -- chiefly how the low-rank updates and their gradients align across tasks and how much they disturb shared representations. We package these signals into MergeProbe, a lightweight predictor that estimates pairwise and set-level retention and turns the estimate into a concrete decision: merge directly, reweight, prune, or route. On MERGE-PEFT, a five-domain benchmark spanning math, code, science, instruction following, and safety, MergeProbe attains the best average and worst-case retention among strong interference-aware merge baselines while adding far less deployment overhead than full task routing. This turns LoRA merging from a post-hoc engineering step into an anticipatory measurement problem.

URL PDF HTML ☆

赞 0 踩 0

2606.19712 2026-06-19 cs.LG cs.CV 新提交

Efficient Neural Network Model Selection for Few-Class Application Datasets

面向少类应用数据集的高效神经网络模型选择

Bryan Bo Cao, Abhinav Sharma, Lawrence O'Gorman, Michael Coss, Shubham Jain

发表机构 * Nokia Bell Labs（诺基亚贝尔实验室）

AI总结针对实际应用中常见的少类数据集，提出基于数据属性的分类难度度量，实现比传统方法快6-29倍的模型选择，并扩展模型族至更小规模，在移动机器人等场景中提升效率。

Comments 36 pages, 9 tables, 13 figures

详情

AI中文摘要

尽管大量工作集中在开发和基准测试高性能神经网络上，但较少关注已知的数据集属性如何指导高效的模型选择。神经网络模型通常在数千类数据集上评估，然而许多实际应用涉及少于十类。为了解决这一被忽视但常见的情况，我们基于数据侧属性开发了一种分类难度度量，并展示了它如何为少类数据集实现更高效的模型选择，而传统方法在此效果较差。我们将此现象称为“少类独特性”。我们的度量允许比重复训练和测试快6到29倍的模型和数据集比较。利用这一洞察，我们将缩放模型族扩展到已发布的最小模型以下，在相似精度下实现更高效率，例如在移动机器人任务中模型比YOLOv5-nano小42%。针对资源受限的应用，我们在移动机器人、无人机和物联网场景中展示了少类模型选择，突出了在不牺牲性能的情况下效率的实际提升。

英文摘要

While much effort has focused on developing and benchmarking high-performance neural networks, less attention has been given to how dataset properties, known to practitioners, can guide efficient model selection. Neural models are typically evaluated on datasets with thousands of classes, yet many real-world applications involve fewer than ten. To address this understudied but common setting, we develop a measure of classification difficulty based on data-side properties and show how it enables more efficient model selection for few-class datasets, where traditional approaches are less effective. We term this phenomenon "few-class distinctiveness". Our metric allows comparison of models and datasets 6 to 29$\times$ faster than repeated training and testing. Leveraging this insight, we extend scaled model families below the smallest published models, achieving greater efficiency at similar accuracy, for example models up to 42% smaller than YOLOv5-nano for a mobile robot task. Targeting resource-constrained applications, we demonstrate few-class model selection across mobile robot, drone, and IoT scenarios, highlighting practical gains in efficiency without sacrificing performance.

URL PDF HTML ☆

赞 0 踩 0

2606.19919 2026-06-19 cs.LG 新提交

ADaPT: Token-Level Decoupling for Efficient Large Reasoning Models

ADaPT：面向高效大推理模型的令牌级解耦

Tingyun Li, Zishang Jiang, Jinyi Han, Xinyi Wang, Sihang Jiang, Han Xia, Zhaoqian Dai, Shuguang Ma, Fei Yu, Jiaqing Liang, Yanghua Xiao

发表机构 * School of Data Science, Fudan University（复旦大学数据科学学院）； Shanghai Institute of Artificial Intelligence for Education, East China Normal University（华东师范大学上海智能教育研究院）； College of Computer Science and Artificial Intelligence, Fudan University（复旦大学计算机科学与人工智能学院）； Ant Group（蚂蚁集团）

AI总结提出ADaPT，通过令牌级双过程框架解耦效率与正确性信号，引入模式选择令牌控制快慢推理，实现推理时效率-性能权衡的精确连续控制，在降低推理成本的同时保持强推理能力。

详情

AI中文摘要

大型推理模型依赖长思维链实现强性能，但统一应用此类推理会产生高计算成本。现有面向效率的方法试图缩短或混合推理策略，但往往会降低推理能力。我们将根本原因识别为效率激励与正确性优化之间的序列级耦合，这隐式惩罚了长但正确的推理轨迹。为解决此问题，我们提出自适应双过程思维（ADaPT），一种令牌级双过程框架，在训练期间显式解耦效率和正确性信号。ADaPT引入模式选择令牌来控制快速和慢速推理，将效率相关奖励仅应用于此令牌，以避免惩罚正确的长推理，同时在适当时鼓励效率。此外，ADaPT在推理时实现了对效率-性能权衡的精确连续控制：通过调整模式选择令牌的生成概率，单个训练好的模型可以平滑地沿效率-性能帕累托前沿移动。大量实验表明，ADaPT在多个基准测试中显著降低推理成本，同时保持强推理性能。

英文摘要

Large reasoning models rely on long chain-of-thought to achieve strong performance, but applying such reasoning uniformly incurs high computational cost. Existing efficiency-oriented methods attempt to shorten or mix reasoning strategies, yet often degrade reasoning capability. We identify the root cause as sequence-level coupling between efficiency incentives and correctness optimization, which implicitly penalizes long but correct reasoning trajectories. To address this issue, we propose Adaptive Dual-Process Thinking (ADaPT), a token-level dual-process framework that explicitly decouples efficiency and correctness signals during training. ADaPT introduces a mode-selection token to control fast and slow reasoning, applying efficiency-related rewards exclusively to this token to avoid penalizing correct long reasoning while encouraging efficiency when appropriate. Moreover, ADaPT enables precise and continuous control over the efficiency-performance trade-off at inference time: by adjusting the generation probability of the mode-selection token, a single trained model can smoothly move along the efficiency-performance Pareto frontier. Extensive experiments demonstrate that ADaPT significantly reduces inference cost while maintaining strong reasoning performance across multiple benchmarks.

URL PDF HTML ☆

赞 0 踩 0

2606.19964 2026-06-19 cs.LG cs.AR 新提交

Low-Energy Reduced RISC-V Instruction Subset Processor for Tsetlin Machine Inference at the Edge

用于边缘Tsetlin Machine推理的低能耗精简RISC-V指令子集处理器

Chanda Gupta, Sanidhya Bhatia, Shaurya Priyadarshi, Himani Panwar, Rishad Shafik, Sudip Roy

AI总结针对Tsetlin Machine推理，提出一种领域专用RISC-V微处理器架构，通过指令精简和数据路径简化，在保持可编程性的同时实现高达98%的执行时间减少和29.7倍能耗降低。

Comments 6 pages, 6 Figures, Accepted in IEEE ISVLSI Conference 2026

详情

AI中文摘要

Tsetlin Machine (TM) 是一种基于逻辑的机器学习方法，依赖于简单的位运算和有限状态自动机，使其适用于边缘AI部署。最近的工作集中在基于Tsetlin Machine (TM) 的协处理器和加速器设计上。尽管这些设计实现了高性能，但它们通常依赖于紧密耦合的接口、微码风格的编程和外部主机处理器，限制了灵活性和编程简易性。在这项工作中，我们提出了一种面向TM推理的领域专用RISC-V微处理器架构和设计流程。利用RISC-V的模块化结构，我们设计了一个精简指令子集处理器，在保持可编程性的同时，针对TM工作负载提高了性能并降低了能耗。采用指令分析来指导指令精简，随后针对TM推理进行数据路径和控制路径的简化。在多个数据集上评估了基线RV32IM核心和所提出的精简核心，并与二值神经网络 (BNN) 进行比较，BNN由于在推理过程中依赖位运算而被用作硬件高效基线。结果表明，TM实现了相当或更高的准确率（例如，在CIFAR-2上高达88.18%，而BNN为60.0%），同时在多个数据集上执行时间减少了高达98%。此外，所提出的设计实现了平均29.7倍的能耗降低，证明了其在可编程且高效的边缘AI系统中的有效性。

英文摘要

Tsetlin Machine (TM) is a logic-based machine learning approach that relies on simple bitwise operations and finite-state automata, which makes it attractive for edge AI deployments. Recent work has focused on co-processor and accelerator designs based on Tsetlin Machines (TMs). Although these designs achieve high performance, they typically depend on tightly coupled interfaces, microcode-style programming, and external host processors, limiting flexibility and ease of programming. In this work, we present a domain-specific RISC-V microprocessor architecture and design flow tailored for TM inference. Leveraging the modular structure of RISC-V, we design a reduced instruction subset processor that retains programmability while targeting improved performance and lower energy consumption for TM workloads. Instruction profiling is employed to guide instruction reduction, followed by datapath and control path simplifications tailored to TM inference. Both the baseline RV32IM core and the proposed reduced core are evaluated across multiple datasets and compared with Binarized Neural Networks (BNNs), which serve as a hardware-efficient baseline due to their reliance on bitwise operations during inference. Results show that TM achieves comparable or higher accuracy (e.g., up to 88.18% on CIFAR-2 compared to 60.0% for BNN) while reducing execution time by up to 98% across multiple datasets. Furthermore, the proposed design achieves an average $29.7\times$ reduction in energy consumption, demonstrating its effectiveness for programmable and efficient edge AI systems.

URL PDF HTML ☆

赞 0 踩 0

2606.19993 2026-06-19 cs.LG 新提交

Activation- and Influence-Aware Ranks (AIR): Function-Preserving SVD Compression for LLMs

激活与影响感知秩 (AIR)：保持功能的SVD压缩用于大语言模型

Nico Harder, Daniel Becking, Karsten Mueller, Wojciech Samek

AI总结提出AIR框架，基于SVD和反向信号影响度量，通过单次交替最小二乘扫描实现权重矩阵的低秩近似，在参数保留≤60%时困惑度比SVD-LLM(W)改善>18%，并减少90%校准数据。

Comments Accepted at the ICML 2026 Workshop on Resource-Adaptive Foundation Model Inference (AdaptFM), Seoul, South Korea (non-archival)

2606.20005 2026-06-19 cs.LG cs.AI 新提交

StreamKL: Fast and Memory-Efficient KL Divergence for Boosting Attention Distillation

StreamKL: 快速且内存高效的KL散度用于提升注意力蒸馏

Guangda Liu, Yiquan Wang, Chengwei Li, Wenhao Chen, Jing Lin, Yiwu Yao, Danning Ke, Wenchao Ding, Jieru Zhao

发表机构 * Shanghai Jiao Tong University（上海交通大学）； Huawei（华为）； Fudan University（复旦大学）

AI总结提出StreamKL，首个融合GPU原语，通过在线公式和逐块重计算将注意力蒸馏的内存和IO成本从O(N_QN_K)降至O(1)，实现高达43倍前向和14倍反向加速。

详情

AI中文摘要

注意力蒸馏通过最小化Kullback-Leibler (KL)散度来训练一个注意力分布匹配另一个，广泛应用于知识蒸馏、模型压缩、持续学习和稀疏注意力LLM训练。然而，现有方法在计算KL归约前需要具体化两个注意力分布，导致$O(N_QN_K)$的内存和IO成本，在长上下文长度下变得不可接受。我们提出StreamKL，首个用于注意力KL散度的融合GPU原语，消除了这种二次具体化。StreamKL推导了一种新颖的在线公式用于耦合的双分布KL归约，使得单个前向内核能够通过片上SRAM流式处理查询-键块。对于反向传播，StreamKL逐块重计算注意力概率，避免存储二次中间结果。我们进一步设计并实现了具有专用优化的高效GPU内核。实验表明，StreamKL在前向和反向传播中分别比基线方法快高达43倍和14倍。最重要的是，StreamKL将注意力蒸馏的额外HBM占用从$O(N_QN_K)$减少到$O(1)$，使得在单个GPU上进行长上下文蒸馏成为可能。

英文摘要

Attention distillation, which trains one attention distribution to match another by minimizing their Kullback-Leibler (KL) divergence, is widely used in knowledge distillation, model compression, continual learning, and sparse-attention LLM training. However, existing approaches materialize both attention distributions before computing the KL reduction, incurring $O(N_QN_K)$ memory and IO costs that become prohibitive at long context lengths. We present StreamKL, the first fused GPU primitive for attention KL divergence that eliminates this quadratic materialization. StreamKL derives a novel online formulation for the coupled two-distribution KL reduction, enabling a single one-pass forward kernel that streams query-key tiles through on-chip SRAM. For the backward pass, StreamKL recomputes attention probabilities tile-by-tile, avoiding storage of quadratic intermediates. We further design and implement efficient GPU kernels with dedicated optimizations. Experiments show StreamKL delivers up to $43\times$ and $14\times$ speedups over baseline methods in the forward and backward passes, respectively. Most importantly, StreamKL reduces the extra HBM footprint of attention distillation from $O(N_QN_K)$ to $O(1)$, enabling long-context distillation on a single GPU.

URL PDF HTML ☆

赞 0 踩 0

2606.20474 2026-06-19 cs.LG cs.AI cs.PF 新提交

UltraQuant: 4-bit KV Caching for Context-Heavy Agents

UltraQuant: 面向上下文密集型智能体的4位KV缓存

Inesh Chakrabarti, David Limpus, Aditi Ghai Rana, Bowen Bao, Spandan Tiwari, Thiago Crepaldi, Ashish Sirasao

发表机构 * Advanced Micro Devices（超威半导体）； University of California, Los Angeles（加州大学洛杉矶分校）； Purdue University（普渡大学）

AI总结针对上下文密集型智能体场景，提出UltraQuant方法，通过4位KV缓存压缩、旋转量化和代码本量化，结合AMD GPU优化，在长上下文多轮任务中延迟降低3.47倍，吞吐量提升1.63倍。

Comments 11 pages, 9 figures

详情

AI中文摘要

上下文密集型智能体给键值（KV）缓存带来了异常压力：长前缀在多个短轮次中重复使用，而并发性决定了服务系统能否保持GPU利用率。我们针对此场景研究4位KV缓存压缩，采用TurboQuant风格的旋转和代码本量化作为质量锚点，vLLM FP8 KV缓存作为部署锚点。我们报告三项贡献。首先，我们将4位KV缓存框架用于多轮智能体工作负载，其中任务质量、缓存驻留和服务吞吐量必须联合衡量。其次，我们描述了使4位路径鲁棒所需的实际设计选择，包括非对称K/V处理、Walsh-Hadamard旋转、QJL移除和块尺度变体。第三，我们展示了AMD GPU上的服务优化，包括优化的解码注意力内核和UltraQuant，一种使用FP8查询、FP4 KV张量、UE8M0组尺度和CDNA4上原生缩放MFMA支持的FP4近似路径。在长上下文、多轮智能体工作负载上，UltraQuant在缓存压力大的后期轮次中将P50首令牌延迟降低了3.47倍（所有轮次平均2.3倍），并将输出吞吐量比FP8 KV基线提高了1.63倍。

英文摘要

Context-heavy agents place unusual pressure on the key-value (KV) cache: long prefixes are reused across many short turns, while concurrency determines whether the serving system can keep GPUs utilized. We study 4-bit KV-cache compression for this setting, using TurboQuant-style rotation and codebook quantization as a quality anchor and vLLM FP8 KV caching as the deployment anchor. We report three contributions. First, we frame 4-bit KV caching around multi-round agent workloads where task quality, cache residency, and serving throughput must be measured jointly. Second, we describe the practical design choices needed to make the 4-bit path robust, including asymmetric K/V treatment, Walsh-Hadamard rotation, QJL removal, and block-scale variants. Third, we present serving optimizations on AMD GPUs, including optimized decode-attention kernels and UltraQuant, an FP4 approximation path that uses FP8 queries, FP4 KV tensors, UE8M0 group scales, and native scaled-MFMA support on CDNA4. On a long-context, multi-turn agentic workload, UltraQuant cuts P50 time-to-first-token by 3.47x in the cache-pressured late rounds (2.3x across all rounds) and raises output throughput by 1.63x over the FP8 KV baseline.

URL PDF HTML ☆

赞 0 踩 0

2606.20537 2026-06-19 cs.LG cs.DC 新提交

Execution-State Capsules: Graph-Bound Execution-State Checkpoint and Restore for Low-Latency, Small-Batch, On-Device Physical-AI Serving

执行状态胶囊：面向低延迟、小批量、设备端物理AI服务的图绑定执行状态检查点与恢复

Liang Su

AI总结针对低延迟、小批量、设备端物理AI服务场景，提出执行状态胶囊机制，通过图绑定检查点与恢复完整可恢复状态，在RTX 5090上实现亚毫秒级恢复，TTFT加速比达3.9倍至27倍。

Comments 27 pages, 9 figures

详情

AI中文摘要

主流LLM服务系统主要通过分页或基数键值（KV）缓存重用前缀工作。这对于高吞吐量、高并发服务非常有效，但它只管理执行状态的一个位置片段：KV缓存。我们研究相反的场景：低延迟、小批量、设备端物理AI服务，其中交互式LLM代理、语音系统和机器人策略在严格的响应预算下频繁分支、重置、中断和重新进入。我们引入执行状态胶囊，一种图绑定的检查点和恢复机制，用于在提交边界处保存完整的可恢复状态。FlashRT是一个白盒、后端内核运行时，其评估的NVIDIA CUDA后端在连续的静态缓冲区上运行捕获的图计划，无需块表间接寻址。由于活动状态是一组命名的封闭缓冲区，胶囊可以快照、恢复、分叉或回滚整个执行边界，包括KV、循环状态、卷积状态、MTP状态和元数据。这将重用从令牌寻址的KV片段转移到图绑定的执行状态边界。在RTX 5090上，胶囊恢复在存储状态级别是字节精确的，在贪婪解码下是令牌一致的。仅KV的消融实验出现分歧，表明循环状态是承载负载的。GPU驻留的快照和恢复是亚毫秒级的，TTFT相对于冷预填充的加速比从2k令牌时的3.9倍增长到16k令牌时的27倍。在Jetson AGX Thor和DGX Spark上，相同的正确性和结构属性成立。胶囊不是高吞吐量KV缓存服务的替代品；它们定义了一个互补的以延迟为先的服务点，用于显式执行状态重用。

英文摘要

Mainstream LLM serving systems reuse prefix work mainly through paged or radix key-value (KV) caches. This is highly effective for high-throughput, high-concurrency serving, but it manages only one positional fragment of execution state: the KV cache. We study the opposite regime: low-latency, small-batch, on-device physical-AI serving, where interactive LLM agents, speech systems, and robot policies repeatedly branch, reset, interrupt, and re-enter under tight responsiveness budgets. We introduce execution-state capsules, a graph-bound checkpoint and restore mechanism for the complete restorable state at a committed boundary. FlashRT is a white-box, backend-facing kernel runtime whose evaluated NVIDIA CUDA backend runs captured graph plans over contiguous static buffers with no block-table indirection. Because the live state is a closed set of named buffers, a capsule can snapshot, restore, fork, or roll back the whole execution boundary, including KV, recurrent state, convolution state, MTP state, and metadata. This moves reuse from token-addressed KV fragments to graph-bound execution-state boundaries. On an RTX 5090, capsule restore is byte-exact at the stored-state level and token-identical under greedy decode. A KV-only ablation diverges, showing that recurrent state is load-bearing. GPU-resident snapshot and restore are sub-millisecond, and TTFT speedup over cold prefill grows from 3.9x at 2k tokens to 27x at 16k tokens. On Jetson AGX Thor and DGX Spark, the same correctness and structural properties hold. Capsules are not a replacement for high-throughput KV-cache serving; they define a complementary latency-first serving point for explicit execution-state reuse.

URL PDF HTML ☆

赞 0 踩 0

2606.19734 2026-06-19 cs.LG 新提交

Federated Bilevel Performative Prediction

联邦双层执行预测

Liangxin Qian, Chang Liu, Xuanyu Cao, Jun Zhao, Kwok-Yan Lam

发表机构 * Nanyang Technological University（南洋理工大学）； Zhejiang University（浙江大学）； Washington State University（华盛顿州立大学）

AI总结研究联邦学习中客户端数据分布受决策影响的双层优化问题，提出联邦双层执行稳定点概念及两种求解方法，实验验证了稳定性阈值和元泛化提升。

Comments Accepted by ICML 2026

详情

AI中文摘要

联邦双层优化广泛用于跨分布式客户端的嵌套学习问题，例如在隐私和通信约束下的联邦超参数调整和元学习。大多数现有公式假设客户端数据分布固定，但执行性可能违反这一假设，其中部署的决策会重塑客户端行为和数据收集，导致客户端特定的、决策依赖的分布偏移。我们研究联邦双层执行预测，其中上层（UL）和下层（LL）目标都在客户端依赖、决策依赖的分布下进行评估。我们在解耦风险视角下形式化联邦双层执行稳定（FBPS）点，并给出其存在性和唯一性的充分条件。然后，我们开发两种联邦方法来计算FBPS解：FBi-RRM，在收缩条件下线性收敛；以及FBi-SGD，一种基于联邦超梯度估计的通信高效随机方法，在步长递减且敏感性足够小时具有收敛保证。在策略回归和元策略分类上的实验验证了预测的稳定性阈值，并展示了相对于非执行基线的元泛化改进，基于CNN的分类进一步证明了所提方法在非凸神经网络设置中的实际有效性。

英文摘要

Federated bilevel optimization is widely used for nested learning problems across distributed clients, such as federated hyperparameter tuning and meta-learning under privacy and communication constraints. Most existing formulations assume fixed client data distributions, which can be violated by performativity, where deployed decisions reshape client behavior and data collection, inducing client-specific, decision-dependent distribution shift. We study federated bilevel performative prediction, where both upper-level (UL) and lower-level (LL) objectives are evaluated under client-dependent, decision-dependent distributions. We formalize the federated bilevel performatively stable (FBPS) point under a decoupled-risk perspective and provide sufficient conditions for its existence and uniqueness. We then develop two federated methods to compute the FBPS solution: FBi-RRM, which converges linearly under a contraction condition, and FBi-SGD, a communication-efficient stochastic method based on federated hypergradient estimation with convergence guarantees under diminishing step sizes when sensitivities are sufficiently small. Experiments on strategic regression and meta strategic classification validate the predicted stability thresholds and demonstrate improved meta-generalization over non-performative baselines, and CNN-based classification further demonstrates the practical effectiveness of the proposed methods in nonconvex neural network settings.

URL PDF HTML ☆

赞 0 踩 0

2606.20115 2026-06-19 cs.LG cs.CV 新提交

When Calibration Fails the Vulnerable Hospital: Federated Conformal Risk Control via Risk-Curve Shrinkage

当校准失败于脆弱的医院：通过风险曲线收缩实现联邦共形风险控制

Nafis Fuad Shahid

AI总结针对联邦部署中标准共形风险控制（CRC）对个体机构覆盖不足的问题，提出基于风险曲线收缩的联邦CRC协议，在真实脑肿瘤数据上实现2.7/20的违规率且预测集仅扩大2.0倍。

Comments 9 pages, 3 figures, 2 tables. Submitted to the DeCaF Workshop at MICCAI 2026

详情

AI中文摘要

共形风险控制（CRC）通过在保留数据上校准预测集阈值，提供分割质量的无分布保证。在联邦部署中，标准方法将各站点的校准分数合并为一个阈值。我们在真实多机构脑肿瘤数据（FeTS-2022，1251名受试者，20个机构）上首次量化表明，这种朴素的合并CRC保护了平均医院，但违反了40%个体机构的覆盖，最差站点的假阴性率超出目标7.8个百分点。朴素的替代方案——每个站点本地CRC——基本恢复了覆盖，但将预测集扩大了83倍，使其在临床上无用。我们提出一种基于收缩的联邦CRC协议：每个站点仅将其经验风险曲线（G个标量）传输到服务器，服务器为每个站点计算收缩正则化阈值。单个超参数n0平滑地权衡最坏情况覆盖与预测集效率；留一站点敏感性分析确定n0=19，在2.0倍拉伸下实现2.7/20的违规。我们进一步表明，覆盖预算的直接拉格朗日优化失败，将风险集中在脆弱的医院，并且有限样本修正项是必不可少的：移除它会使违规增加三倍。在所述站点混合假设下，边际CRC保证通过构造得以保留；在三个种子下针对四个目标验证了每个站点的覆盖。没有患者级别的图像、掩膜或每体积分数离开任何站点。

英文摘要

Conformal risk control (CRC) provides distribution-free guarantees on segmentation quality by calibrating a prediction-set threshold on held-out data. In federated deployments, the standard approach pools calibration scores across sites into a single threshold. We provide the first quantification, on real multi-institutional brain tumor data (FeTS-2022, 1,251 subjects, 20 institutions), showing that this naive pooled CRC protects the average hospital but violates coverage at 40% of individual institutions, with the worst site exceeding the target false-negative rate by 7.8 percentage points. The naive alternative, per-site local CRC, largely restores coverage but inflates prediction sets by 83x, rendering them clinically useless. We propose a shrinkage-based federated CRC protocol: each site transmits only its empirical risk curve (G scalars) to a server, which computes a shrinkage-regularized threshold per site. A single hyperparameter n0 smoothly trades worst-case coverage for prediction-set efficiency; leave-one-site-out sensitivity analysis identifies n0=19, achieving 2.7/20 violations at 2.0x stretch. We further show that direct Lagrangian optimization of coverage budgets fails, concentrating risk on vulnerable hospitals, and that the finite-sample correction term is essential: removing it triples violations. The marginal CRC guarantee is preserved by construction under the stated site-mixture assumption; per-site coverage is validated across four targets with three seeds. No patient-level images, masks, or per-volume scores leave any site.

URL PDF HTML ☆

赞 0 踩 0

2606.20382 2026-06-19 cs.LG 新提交

Towards Modality-imbalanced Federated Graph Learning: A Data Synthesis-based Approach

面向模态不平衡的联邦图学习：一种基于数据合成的方法

Zhengyu Wu, Hongchao Qin, Xunkai Li, Zekai Chen, Rong-Hua Li, Guoren Wang

AI总结针对联邦图学习中客户端级和节点级模态不平衡问题，提出隐式图感知潜在语义表示合成范式FedMGS，通过可用性感知图编码器、原型引导语义合成器和可靠性校准融合机制恢复缺失模态语义，在四个任务上最高提升17.41%。

详情

AI中文摘要

多模态联邦图学习（MM-FGL）提供了一种自然的协作训练范式，但其实际部署受到两种粒度的模态不平衡挑战。当某些客户端缺少完整模态时，会出现客户端级不平衡；而当单个节点缺少视觉或文本属性时，会出现节点级不平衡。尽管存在一些相关研究，但我们的调查表明，它们主要针对图无关或集中式场景，难以直接适应。为了解决这些挑战，我们将模态不平衡的MM-FGL形式化为一个隐式图感知潜在语义表示合成问题。该范式直接在表示空间中恢复缺失的模态语义，从而最大化与原始数据语义分布的对齐，并缓解由缺失模态引起的高方差。为此，我们提出了FedMGS（联邦模态感知图合成），它集成了三个核心组件。可用性感知图编码器防止缺失模态污染局部结构传播。原型引导潜在语义合成器为不可用模态建立跨客户端语义锚点。可靠性校准语义融合机制在预测读出之前调节恢复的潜在表示的影响。在四个任务上的大量实验表明，FedMGS始终优于竞争基线，最高提升17.41%，并实现了最佳效率-性能权衡。

英文摘要

MultiModal Federated Graph Learning (MM-FGL) offers a natural collaborative training paradigm, but its practical deployment is challenged by two granularities of modality imbalance. Client-level imbalance occurs when certain clients lack entire modalities, while node-level imbalance occurs when individual nodes exhibit missing visual or textual attributes. While several relevant studies exist, our investigation reveals that they predominantly target graph-agnostic or centralized scenarios, rendering them difficult to adapt directly. To address these challenges, we formalize modality-imbalanced MM-FGL as an implicit graph-aware latent semantic representation synthesis problem. This paradigm recovers missing modal semantics directly within the representation space, thereby maximizing alignment with the original data's semantic distribution and mitigating the high variance induced by missing modalities. To this end, we propose FedMGS (Federated Modality-aware Graph Synthesis), which integrates three core components. The availability-aware graph encoder prevents missing modalities from contaminating local structural propagation. The prototype-guided latent semantic synthesizer establishes cross-client semantic anchors for unavailable modalities. The reliability-calibrated semantic fusion mechanism regulates the impact of recovered latent representations prior to predictive readout. Extensive experiments on four tasks show that FedMGS consistently outperforms competitive baselines with gains up to 17.41% with best efficiency-performance tradeoff.

URL PDF HTML ☆

赞 0 踩 0

2606.20546 2026-06-19 cs.LG 新提交

Predictability as a Fine-Grained Measure for Privacy

可预测性作为隐私的细粒度度量

Linda Lu, Karthik Sridharan

AI总结提出可预测性框架，通过攻击者预测敏感信息的能力增益来衡量隐私泄露，与差分隐私互补，并基于广义矩方法分析渐近可预测性，用于ERM输出扰动。

详情

AI中文摘要

差分隐私（DP）确保针对最知识渊博的攻击者的严格个体级隐私保证，但其最坏情况性质可能导致代价高昂的隐私-准确性权衡。我们引入了通过可预测性实现的隐私，这是一个细粒度框架，明确包含了攻击者的核心知识、由随机过程生成的数据集的受损部分以及指定的查询族。可预测性将隐私泄露衡量为攻击者在观察算法输出后，预测关于未知个体的敏感信息的能力的增量增益，超出已从受损数据中推断出的信息。我们表明，可预测性和DP通常是不可比的：一个可以很小而另一个很大。然而，在最坏情况下，当除一个个体外所有个体都受损且所有二元查询都被视为敏感时，可预测性意味着互信息DP。更一般地，可预测性提供了一种针对特定敏感信息和特定攻击者模型量身定制的更细粒度的隐私度量。我们引入了一个通用框架，使用广义矩方法（GMM），来分析当受损数据由平稳、遍历、混合过程生成时的渐近可预测性。利用这一分析，我们推导出用于ERM的可预测性校准输出扰动方案。我们的方法与DP互补，并且可以与DP一起使用以提供细粒度的隐私控制。

英文摘要

Differential privacy (DP) ensures rigorous individual-level privacy guarantees against even the most knowledgeable attackers, but its worst-case nature can impose a costly privacy-accuracy tradeoff. We introduce privacy via predictability, a fine-grained framework that explicitly incorporates the attacker's core knowledge, a compromised portion of the dataset generated by a stochastic process, and a specified family of queries. Predictability measures privacy leakage as the incremental gain in an attacker's ability to predict sensitive information about unknown individuals after observing the algorithm's output, beyond what can already be inferred from the compromised data. We show that predictability and DP are generally incomparable: each can be small while the other is large. However, in the worst-case regime where all but one individual is compromised, and all binary queries are considered sensitive, predictability implies mutual-information DP. More generally, predictability provides a finer-grained privacy metric tailored to specific sensitive information and specific attacker models. We introduce a general framework, using the generalized method of moments (GMM), to analyze asymptotic predictability when the compromised data is generated by a stationary, ergodic, mixing process. Using this analysis, we derive a predictability-calibrated output perturbation scheme for ERM. Our approach is complementary to DP and can be used alongside DP to provide fine-grained privacy control.

URL PDF HTML ☆

赞 0 踩 0

2606.19404 2026-06-19 cs.LG cs.CL 新提交

不确定性感知的奖励建模用于稳定的RLHF

Licheng Pan, Haocheng Yang, Haoxuan Li, Yichen Sun, Yunsheng Lu, Shijian Wang, Lei Shen, Yuan Lu, Zhixuan Chu, Hao Wang

发表机构 * Zhejiang University（浙江大学）； Peking University（北京大学）； National University of Singapore（新加坡国立大学）

AI总结提出不确定性感知奖励建模（UARM），通过分位数保形预测校准不确定性并利用异方差方差分解重加权GRPO优势，以缓解奖励黑客问题，提升对齐质量。

详情

AI中文摘要

从人类反馈中强化学习（RLHF）通过在偏好数据上训练奖励模型并优化策略以最大化预测奖励来对齐大型语言模型。然而，该流程面临两个基本挑战：（1）奖励模型无法在预测不可靠时发出信号，因为它们通常充当确定性点估计器；（2）现代基于组的策略优化可能放大不可靠的奖励信号，例如GRPO在优势计算中对奖励的统一处理。随着策略探索越来越多样化的响应，这两个限制造成了一个关键漏洞：不可靠的奖励估计可能被赋予不成比例的影响力，引发严重的奖励黑客问题。我们提出不确定性感知奖励建模（UARM），通过基于分位数的保形预测为奖励模型配备校准的不确定性，并通过异方差方差分解重加权GRPO优势。在HelpSteer、UltraFeedback和PKU-SafeRLHF上的实验表明，与标准GRPO和不确定性无关的基线相比，UARM显著改善了奖励模型校准，减少了奖励黑客问题，并增强了下游对齐质量。

英文摘要

Reinforcement learning from human feedback (RLHF) aligns large language models by training reward models on preference data and optimizing policies to maximize predicted rewards. However, this pipeline faces two fundamental challenges: (1) reward models cannot signal when their predictions are unreliable, since they usually act as deterministic point estimators; and (2) modern group-based policy optimization can amplify unreliable reward signals, as exemplified by GRPO's uniform treatment of rewards during advantage computation. As policies explore increasingly diverse responses, these two limitations create a critical vulnerability: unreliable reward estimates may be granted disproportionate influence, triggering severe reward hacking. We propose Uncertainty-Aware Reward Modeling (UARM), which equips reward models with calibrated uncertainty via quantile-based conformal prediction and reweights GRPO advantages through heteroscedastic variance decomposition. Experiments across HelpSteer, UltraFeedback, and PKU-SafeRLHF demonstrate that UARM significantly improves reward model calibration, reduces reward hacking, and enhances downstream alignment quality compared to standard GRPO and uncertainty-agnostic baselines.

URL PDF HTML ☆

赞 0 踩 0

2606.20415 2026-06-19 cs.LG 新提交

Pseudo-Feature Padding: A Lightweight Defense Against False Data Injection in Power Grids

伪特征填充：一种针对电网虚假数据注入的轻量级防御方法

Farhin Farhad Riya, Shahinul Hoque, Yingyuan Yang, Jinyuan Sun, Kevin Tomsovic

发表机构 * University of Tennessee（田纳西大学）； The University of Illinois at Springfield（伊利诺伊大学斯普林菲尔德分校）； Clemson University（克莱姆森大学）

AI总结提出一种轻量级防御框架，通过基于输入统计分布的伪特征填充增加输入维度，使对抗攻击因扰动不可转移和填充结构不可预测而计算不可行，显著提升深度神经网络在电网状态估计中的鲁棒性。

详情

AI中文摘要

深度神经网络（DNN）在各种任务中取得了显著的准确性，包括在信息物理系统（CPS）中用于检测关键操作期间的虚假数据注入攻击（FDIA）。然而，CPS的独特基础设施使得DNN容易受到攻击者的利用，以逃避检测。此外，CPS的独特性质对传统的FDIA防御机制提出了挑战。本文提出了一种创新的防御框架，通过引入一个额外的输入层，该层使用从输入统计分布中导出的伪特征值对输入样本进行填充，从而增强DNN抵御此类攻击的能力。这种填充以随机化和数据感知的方式增加了输入维度，使得由于精心设计的扰动的不可转移性和填充结构的不可预测性，对抗攻击在计算上变得不可行。我们的方法轻量级、与模型无关，并且不需要对核心架构进行修改，使其在现实世界的CPS环境中高度可部署。我们在关键电网应用（如使用IEEE 14节点、30节点、118节点和300节点系统的状态估计）上评估了我们的框架。对抗性设置下的实验表明，我们的填充策略显著提高了模型的鲁棒性，对性能的影响可以忽略不计，并有效缓解了原本会绕过传统防御的攻击。

英文摘要

Deep Neural Networks DNNs have achieved remarkable accuracy in various tasks including their application in CyberPhysical Systems CPS for detecting False Data Injection Attacks FDIA during critical operations However the unique infrastructure of CPS makes DNNs vulnerable to exploitation by attackers aiming to evade detection Additionally the distinct nature of CPS presents challenges for conventional defense mechanisms against FDIA This paper proposes an innovative defense framework that strengthens DNNs against such attacks by introducing an additional input layer that performs padding in the input samples using pseudofeature values derived from the inputs statistical distribution This padding increases the input dimensionality in a randomized and dataaware manner making adversarial attacks computationally infeasible due to the nontransferable nature of crafted perturbations and the unpredictability of the padded structure Our method is lightweight modelagnostic and requires no modifications to the core architecture making it highly deployable in realworld CPS settings We evaluated our framework on critical power grid applications such as state estimation using the IEEE 14bus 30bus 118bus and 300bus systems Experiments under adversarial settings demonstrate that our padding strategy significantly improves model robustness with negligible impact on performance and effectively mitigates attacks that would otherwise bypass conventional defenses

URL PDF HTML ☆

赞 0 踩 0

2606.20557 2026-06-19 cs.LG math.ST stat.ML stat.TH 新提交

Optimal Deterministic Multicalibration and Omniprediction

最优确定性多校准与全预测

Georgy Noarov, Aaron Roth

发表机构 * University of Pennsylvania（宾夕法尼亚大学）

AI总结本文提出一种确定性算法，实现多校准的极小化最优样本复杂度，并推广到结果不可区分性，解决确定性预测器是否必要的问题。

详情

AI中文摘要

一个模型在一组群体权重 $G$ 上是多校准的，如果它是校准的——即即使以其预测为条件也是无偏的——不仅整体上，而且在通过每个 $g \in G$ 对上下文重新加权后也是如此。这对于许多下游应用是一个有用的性质，也是可信机器学习的基本要求。在这项工作之前，所有已知达到 $\varepsilon$-多校准的极小化最优 $\widetilde O(\varepsilon^{-3})$ 样本复杂度的预测器都是随机化的，而确定性预测器仅以更差的样本复杂度已知。多校准中随机化对于最优样本复杂度是否必要的问题由 [CLNR26] 明确提出，并在之前的几项工作中隐含提出。我们通过给出一个输出确定性预测器的极小化最优多校准算法解决了这个开放问题。然后我们将该算法推广到产生满足关于有限或有限覆盖测试集合的结果不可区分性（OI）的最优确定性预测器。作为一个应用，这也给出了具有最优样本复杂度的确定性全预测器和泛预测器，解决了 [OKK25] 和 [BHHLZ25] 提出的开放问题。

英文摘要

A model is multicalibrated on a collection of group weights $G$ if it is calibrated -- i.e. unbiased even conditional on its prediction -- not just overall, but also after reweighting contexts by each $g \in G$. It is a useful property for many downstream applications and is a basic desideratum of trustworthy machine learning. Before this work, all predictors known to attain the minimax-optimal $\widetilde O(\varepsilon^{-3})$ sample complexity rate for $\varepsilon$-multicalibration were randomized, while deterministic predictors were known only with substantially worse sample complexity. Whether randomization is necessary for optimal sample complexity in multicalibration was explicitly asked by [CLNR26] and implicitly in several prior works. We resolve this open problem by giving a minimax-optimal multicalibration algorithm that outputs a deterministic predictor. We then generalize the algorithm to produce optimal deterministic predictors that satisfy outcome indistinguishability (OI) with respect to finite or finitely covered collections of tests. As an application, this also gives deterministic omnipredictors and panpredictors with optimal sample complexity, resolving open problems posed by [OKK25] and [BHHLZ25].

URL PDF HTML ☆

赞 0 踩 0

2606.19374 2026-06-19 cs.LG cs.AI 新提交

Protein Representation Learning with Secondary-Structure and Energy-Filtered Hydrogen-Bond Graphs

基于二级结构和能量过滤氢键图的蛋白质表示学习

Mohamed Mouhajir, Limei Wang, El Houcine Bergou, Hajar El Hammouti, Lamiae Azizi, Dongqi Fu

发表机构 * College of Computing, UM6P（穆罕默德六世理工大学计算机学院）

AI总结提出一种二级结构感知的图神经网络，通过增强残基节点表示并基于能量过滤的氢键构建边，以捕获局部结构上下文和长程耦合，在蛋白质基准上取得一致改进并增强生物学可解释性。

Journal ref The 25th International Workshop on Data Mining in Bioinformatics (BIOKDD 2026)

详情

AI中文摘要

基于图的表示被广泛用于蛋白质建模，然而许多现有方法主要依赖序列邻接或几何邻近，这仅部分反映了控制蛋白质折叠的原理。蛋白质实际上采用围绕二级结构元素（如α-螺旋和β-折叠）组织的复杂三维构象，这些元素编码了重复的局部基序和稳定的氢键相互作用。在这项工作中，我们引入了一种二级结构感知的图神经网络用于蛋白质表示学习。残基级别的节点表示通过二级结构分配得到增强，图边由经过能量强度过滤的氢键相互作用构建。这种设计使模型能够捕获对蛋白质稳定性和功能至关重要的局部结构上下文和长程耦合。我们在常用的蛋白质基准上评估了所提出的方法，并观察到相对于现有基于图的方法的一致改进。此外，生成的图表示提供了增强的生物学可解释性，因为学习到的连接性与已建立的结构基序一致。这些发现表明，融入二级结构和能量过滤的氢键拓扑为蛋白质表示学习提供了有效的归纳偏置。代码发布在 https://this URL。

英文摘要

Graph-based representations are widely used in protein modeling, yet many existing approaches rely primarily on sequence adjacency or geometric proximity, which only partially reflect the principles governing protein folding. Proteins instead adopt complex three-dimensional conformations organized around secondary structure elements, such as $α$-helices and $β$-sheets, which encode recurring local motifs and stabilizing hydrogen-bond interactions. In this work, we introduce a secondary-structure-aware graph neural network for protein representation learning. Residue-level node representations are augmented with secondary structure assignments, and graph edges are constructed from hydrogen-bond interactions filtered by their energetic strength. This design enables the model to capture both local structural context and long-range couplings that are central to protein stability and function. We evaluate the proposed approach on commonly used protein benchmarks and observe consistent improvements over existing graph-based methods. In addition, the resulting graph representations offer enhanced biological interpretability, as the learned connectivity aligns with established structural motifs. These findings suggest that incorporating secondary structure and energy-filtered hydrogen-bond topology provides an effective inductive bias for protein representation learning. The code is released at https://github.com/mohamedmohamed2021/SSProNet

URL PDF HTML ☆

赞 0 踩 0

2606.19825 2026-06-19 cs.LG 新提交

LOKI: 无记忆零空间约束的终身知识编辑

Masih Eskandar, Miquel Sirera Perelló, Stratis Ioannidis, Jennifer Dy

AI总结提出LOKI方法，通过希尔伯特-施密特独立性准则动态选择层，并将梯度更新投影到模型权重的零空间，实现无需访问旧知识的终身知识编辑，平均准确率提升14%。

2606.20431 2026-06-19 cs.LG 新提交

Sparsity, Superposition, and Forgetting: A Mechanistic Study of Representation Retention in Continual Learning

稀疏性、叠加与遗忘：持续学习中表示保持的机制研究

Jan Wasilewski, Jędrzej Kozal, Michał Woźniak, Bartosz Krawczyk

发表机构 * Rochester Institute of Technology（罗切斯特理工学院）； Wrocław University of Science and Technology（弗罗茨瓦夫科技大学）

AI总结通过可控玩具框架研究持续学习中的遗忘机制，发现叠加随时间增加但任务边界处有瞬降，高稀疏性增加叠加但不必然导致遗忘，任务级有效秩随稀疏性增长。

详情

AI中文摘要

持续学习（CL）系统常常遗忘先前获得的知识，但由于真实数据集纠缠了许多因素，遗忘的机制在实践中难以孤立。我们提出了一个可控的玩具世界框架，使这些机制可观察和可测试。使用合成生成器-分离器流水线，我们定义了真实潜在特征，构建了具有可调稀疏性和重叠的任务，并引入了表示强度和叠加（特征间的方向重叠）的可测量量。然后，我们通过拟合保留、叠加和暴露历史之间的稀疏动态关系（通过SINDy）来研究保留动态——表示强度的时间变化。基于有效秩的互补任务级分析表征了表示能力如何在任务间分配。我们的受控实验得出三个要点。（1）叠加随时间增加，在任务边界处有瞬降，表明边界特定的干扰而非稳定漂移。（2）更高的特征稀疏性导致更多叠加，但不必然引起遗忘；当表示保持强时，尽管重叠，遗忘可以减少。（3）任务级有效秩随稀疏性增长，表明在稀疏机制下更广泛的能力使用。这些结果共同细化了常见直觉——更多叠加导致更多遗忘，通过显示重叠与表示强度和能力分配相互作用。我们的玩具分析为CL提供了可证伪的假设和诊断工具。

英文摘要

Continual learning (CL) systems often forget previously acquired knowledge, yet the mechanisms driving forgetting remain hard to isolate in practice because real datasets entangle many factors. We present a controlled, toy-world framework that makes these mechanisms observable and testable. Using a synthetic generator-separator pipeline, we define ground-truth latent features, build tasks with tunable sparsity and overlap, and introduce measurable quantities for representation strength and superposition (directional overlap among features). We then study retention dynamics-the temporal change of representation strength by fitting sparse dynamical relations (via SINDy) between retention, superposition, and exposure history. A complementary task-level analysis based on effective rank characterizes how representational capacity is allocated across tasks. Our controlled experiments yield three takeaways. (1) Superposition tends to increase over time with transient dips at task boundaries, suggesting boundary-specific interference rather than steady drift. (2) Higher feature sparsity induces more superposition yet does not inevitably cause forgetting; when representations remain strong, forgetting can be reduced despite overlap. (3) Task-level effective rank grows with sparsity, indicating broader capacity usage under sparse regimes. Together, these results nuance the common intuition that more superposition leads to more forgetting by showing that overlap interacts with representation strength and capacity allocation. Our toy analysis provides falsifiable hypotheses and diagnostic tools for CL.

URL PDF HTML ☆

赞 0 踩 0

2606.20538 2026-06-19 cs.LG 新提交

Multi-Task Bayesian In-Context Learning

多任务贝叶斯上下文学习

Qingyang Zhu, Eric Karl Oermann, Kyunghyun Cho

发表机构 * New York University（纽约大学）

AI总结提出多任务上下文学习框架，通过将先验信息表示为上下文数据集前缀，训练Transformer实现分层贝叶斯预测推理，在多种分布偏移下匹配最优贝叶斯性能且速度提升数个数量级。

Comments ICML 2026

详情

AI中文摘要

贝叶斯预测推断为不确定性量化、数据效率和鲁棒泛化提供了原则性框架。然而，精确推断通常难以处理，可扩展近似可能仍计算昂贵或需要限制性建模假设，从而降低预测性能。先验数据拟合和上下文模型最近作为一种摊销替代方案出现，通过学习直接将数据集映射到预测分布，但现有方法与训练先验的支持紧密耦合，缺乏在测试时适应新先验的显式机制，导致在分布偏移下鲁棒性有限。我们引入了一个多任务上下文学习框架，用于摊销分层贝叶斯预测推断，该框架将先验信息显式表示为上下文数据集的前缀。一个在先验和目标任务序列上训练的Transformer学习跨先验族调整其预测。在一系列难度递增的评估中，包括元分布外先验和具有高维潜在结构的先验，我们的方法匹配了最优贝叶斯预测器，同时速度快了几个数量级。我们进一步在真实世界的时空温度预测基准上展示了其实用性。代码可在https://this URL获取。

英文摘要

Bayesian predictive inference provides a principled framework for uncertainty quantification, data efficiency, and robust generalization. However, exact inference is often intractable, and scalable approximations may remain computationally expensive or require restrictive modeling assumptions that degrade predictive performance. Prior-Data Fitted and in-context models have recently emerged as an amortized alternative by learning to map datasets directly to predictive distributions, but existing approaches are tightly coupled to the support of the training prior and lack explicit mechanisms for adapting to new priors at test time, resulting in limited robustness under distribution shift. We introduce a multi-task in-context learning framework for amortized hierarchical Bayesian predictive inference that explicitly represents prior information as a prefix of in-context datasets. A transformer trained on sequences of prior and target tasks learns to adapt its predictions across families of priors. On a suite of evaluations with increasing difficulty, including out-of-meta-distribution priors and priors with high-dimensional latent structures, our method matches oracle Bayesian predictors while being orders of magnitude faster. We further demonstrate its practical relevance on a real-world spatiotemporal temperature prediction benchmark. Code is available at https://github.com/martianmartina/multi-task-bayesian-icl/.

URL PDF HTML ☆

赞 0 踩 0

2606.19411 2026-06-19 cs.LG 新提交

Spectral DPPs via NEPv: A Scalable Continuous Relaxation of Determinantal MAP for Diversity-Aware Data Selection

通过NEPv的谱DPP：用于多样性感知数据选择的确定性点过程MAP的可扩展连续松弛

Richard Yi Da Xu

发表机构 * Hong Kong Baptist University（香港浸会大学）； TadReamk Limited（TadReamk有限公司）

AI总结提出将NP难的DPP-MAP选择问题转化为Stiefel流形上的连续优化，通过非线性特征值问题（NEPv）的自洽场迭代实现近线性时间求解，适用于大规模数据选择。

详情

AI中文摘要

从海量候选池中选择一个小的、多样化的、高质量的子集是现代机器学习中的一个常见原语——用于训练和微调大型模型的数据整理和核心集选择、主动学习批次获取、上下文学习的提示和示例选择、检索多样化以及实验设计。确定性点过程（DPP）为此任务提供了原则性的、良好校准的多样性概念，但其MAP目标——选择大小为$k$的子集$S$最大化$\log\det(L_S)$——是NP难的，并且标准的贪心和采样算法在候选集大小$n$上具有超线性复杂度。这种成本在多样性最重要的数据为中心的场景中尤其高昂，其中$n$范围从数百万到数十亿的候选示例、特征或嵌入。我们将DPP-MAP重新表述为Stiefel流形上的连续优化问题，并证明其最优性条件构成一个先前未研究形式的具有特征向量依赖性的非线性特征值问题（NEPv）。该NEPv允许自洽场（SCF）迭代，具有基于谱间隙的局部收缩保证，从而提供了一个原则性的迭代求解器，其中多样性目标驱动一个特征向量依赖的算子。由此产生的算法OurMethod仅需要与核的矩阵-向量乘积，运行时间为$O\!\big((ndk+nk^2)\,t\big)$，其中迭代次数$t$很小，在$n$上接近线性，并直接与机器学习中常见的低秩和特征映射核集成。本文重点介绍松弛、求解器和扩展分析；完整的真实数据基准测试留给计划中的实证研究。

英文摘要

Selecting a small, diverse, high-quality subset from a massive pool of candidates is a recurring primitive in modern machine learning -- data curation and coreset selection for training and fine-tuning large models, active-learning batch acquisition, prompt and exemplar selection for in-context learning, retrieval diversification, and experimental design. Determinantal Point Processes (\DPP s) give a principled, well-calibrated notion of diversity for this task, but their \emph{MAP} objective -- pick a size-$k$ subset $S$ maximizing $\logdet(L_S)$ -- is NP-hard, and the standard greedy and sampling algorithms scale superlinearly in the ground-set size $n$. This cost is prohibitive precisely in the data-centric regime where diversity matters most, where $n$ ranges over millions to billions of candidate examples, features, or embeddings. We recast \DPP-MAP as a continuous optimization problem over the Stiefel manifold, and show that its first-order optimality conditions form a \emph{Nonlinear Eigenvalue Problem with eigenvector dependency} (\NEPv) of a previously unstudied form. This \NEPv\ admits a self-consistent field (\SCF) iteration with a spectral-gap-based local contraction guarantee, giving a principled iterative solver where the diversity objective drives an eigenvector-dependent operator. The resulting algorithm, \OurMethod, requires only matrix-vector products with the kernel and runs in time $O\!\big((ndk+nk^2)\,t\big)$ for a small number of iterations $t$, scaling near-linearly in $n$ and integrating directly with low-rank and feature-map kernels common in ML. This paper focuses on the relaxation, solver, and scaling analysis; full real-data benchmarking is left to a planned empirical study.

URL PDF HTML ☆

赞 0 踩 0

2606.19416 2026-06-19 cs.LG 新提交

MortarBench: Evaluating Mortgage Loan Origination Agents

MortarBench: 评估抵押贷款发起代理

Matthew Toles, Yunan Lu, Manav Munjal, Bojun Liu, Yuanhao Deng, Stephanie Selig, Derek Rindner, Cheng Li, Zhou Yu

发表机构 * Columbia University（哥伦比亚大学）； Tidalwave

AI总结提出MortarBench基准，通过金融数据合成与变异管道生成覆盖边缘案例的示例，评估大语言模型在贷款发起任务中的表现，发现模型准确率低且存在偏见，并引入CRIT校准框架提升准确率至80.5%。

详情

AI中文摘要

贷款发起是贷方创建新贷款的过程，从申请和承保到批准和融资。该过程在评估申请人的资格和风险水平方面起着关键作用。最近，尽管缺乏任何公开基准，公司已开始使用抵押贷款代理来增强人类贷款官员。为填补这一空白，我们提出了MortarBench，一个贷款发起代理基准。MortarBench使用金融数据合成和变异管道生成具有广泛边缘案例覆盖的示例，这些示例匹配真实世界的分布和问题。我们发现最先进的大语言模型（LLM）表现不佳，闭源模型最多达到77.1%的精确匹配准确率。我们还发现LLM对与非英语名字相关的外国性存在系统性偏见。注意到这些弱点，我们引入了CRIT，一个置信度校准框架。我们的方法将准确率提高到80.5%，同时改善了风险管理导向并减少了偏见。

英文摘要

Loan origination is the process by which a lender creates a new loan, from application and underwriting through approval and funding. This process serves a critical role in evaluating the eligibility and level of risk posed by an applicant. Recently, firms have begun using mortgage loan agents to augment human loan officers, despite a lack of any public benchmark. To fill this gap, we present MortarBench, a loan origination agent benchmark. MortarBench uses a financial data synthesis and mutation pipeline to generate examples with broad edge case coverage that match real-world distributions and questions. We find that state-of-the-art large language models (LLMs) perform poorly, with closed-source models achieving at most 77.1\% exact match accuracy. We also discover systematic biases in LLM perception of foreignness related to non-English names. Noting these weaknesses, we introduce CRIT, a confidence calibration framework. Our method increases accuracy to 80.5\% while improving risk management steering and reducing bias.

URL PDF HTML ☆

赞 0 踩 0

2606.19481 2026-06-19 cs.LG 新提交

Insulin4RL: Real-Time Insulin Management in the Intensive Care Unit for Offline Reinforcement Learning

Insulin4RL：面向离线强化学习的重症监护室实时胰岛素管理

Thomas Frost, Steve Harris

AI总结针对电子健康记录离散化导致模型泛化性差的问题，提出基于真实临床轨迹的离线强化学习数据集Insulin4RL，包含375,000+决策和12,209名患者，用于评估模型在真实采样假设下的性能。

Comments Under submission

详情

AI中文摘要

离线强化学习（ORL）有潜力利用历史电子健康记录（EHR）数据提高临床决策质量。当前该领域的训练和评估实践严重依赖于按固定规则时间间隔离散化的EHR数据集。离散化创建了复杂临床场景的虚构表示，并损害了回顾性模型评估的泛化性。在本文中，我们介绍Insulin4RL，一个医疗ORL数据集，其特点是来自真实临床轨迹的自然不规则输入和动作。该数据集源自MIMIC-IV，包含超过375,000个标记决策，涉及12,209名需要在重症监护室进行胰岛素输注滴定的患者。因此，该数据集可用于研究ORL模型在现实临床采样假设下的性能。我们提供了数据集结构和特征的描述、使用无模型离线强化学习的基线性能指标，以及使用拟合Q评估的标准化评估协议。最后，我们提出了未来研究可以利用该资源解决的领域。

英文摘要

Offline reinforcement learning (ORL) offers the potential to improve the quality of clinical decision-making using historical electronic health record (EHR) data. Current training and evaluative practices in this field rely heavily on EHR datasets that have been temporally discretised into fixed, regular time intervals. Discretisation creates fictional representations of complex clinical scenarios and compromises the generalisability of retrospective model evaluations. In this paper, we introduce Insulin4RL, a healthcare ORL dataset featuring naturally irregular inputs and actions from real clinical trajectories. Derived from MIMIC-IV, Insulin4RL comprises over 375,000 labelled decisions across 12,209 patients requiring insulin infusion titration in the Intensive Care Unit. The dataset can thus be used for research into ORL model performance under realistic clinical sampling assumptions. We provide a description of the dataset's structure and characteristics, baseline performance metrics using model-free offline reinforcement learning, and a standardised evaluation protocol using fitted Q-evaluation. We conclude with suggested areas for future research that could be addressed using this resource.

URL PDF HTML ☆

赞 0 踩 0

2606.19558 2026-06-19 cs.LG cs.CL 新提交

Displacement Is Not Direction: Evaluating Fidelity Metrics for Quantized LLM Deployment

位移不是方向：评估量化LLM部署的保真度指标

Miloš Nikolić, Ali Hadi Zadeh, Enrique Torres Sanchez, Andreas Moshovos

发表机构 * ByteShape ； University of Toronto（多伦多大学）； Vector Institute for Artificial Intelligence（向量人工智能研究所）

AI总结本文研究KL散度等保真度指标在量化语言模型部署中与下游基准分数的相关性，发现整体强相关但在近基线区域失效，归因于KL散度主要衡量分歧量而非方向。

详情

AI中文摘要

保真度指标，如每个token的KL散度（KLD）与高精度参考模型的比较，常被用作基准质量的低成本代理。我们在Qwen3.6-35B-A3B的28个量化模型和Devstral-Small-2-24B的41个量化模型上，通过一系列下游基准测试验证了这一做法。我们发现，在整个量化队列中，KLD与基准分数强相关（Qwen上ρ=-0.72，Devstral上ρ=-0.86，p<0.001）。然而，在接近基线的静默区，这种关系变得不显著（Qwen上ρ=+0.00，Devstral上ρ=-0.24，p=0.36）。这种失效在14种测量变体中持续存在，包括不同的KLD聚合方式、困惑度公式、top-1一致性、校准语料库和上下文长度。在逐提示层面，KLD在代码任务上仅有较弱的失败预测能力，在LiveCodeBench上五个模型的失败与通过几何平均比在[1.08,1.22]之间，并且作为跨模型路由器失败，在分歧提示上仅达到42.3%-49.4%的准确率。我们将这种失效归因于结构分解：KLD主要衡量与参考模型的分歧量，在静默区复合ρ在Qwen上为+0.94（p<0.001），在Devstral上为+0.55（p=0.03），而其与分歧方向的关系较弱且依赖于任务。

英文摘要

Fidelity metrics, such as per-token KL divergence (KLD) against a high-precision reference, are often used in practice as low-cost proxies for benchmark quality. We test this practice on a 28-quant cohort of Qwen3.6-35B-A3B and a 41-quant cohort of Devstral-Small-2-24B, evaluated across a suite of downstream benchmarks. We find that KLD is strongly correlated with benchmark score over the full cohort ($ρ=-0.72$ on Qwen and $ρ=-0.86$ on Devstral, both with $p<0.001$). However, this relationship collapses to non-significance in the near-baseline silent zone ($ρ=+0.00$ on Qwen and $ρ=-0.24$, $p=0.36$, on Devstral). This collapse persists across 14 measurement variants, including different KLD aggregations, perplexity formulations, top-1 agreement, calibration corpora, and context lengths. At the per-prompt level, KLD has only weak failure-prediction power on code, with failed-vs-passed geometric-mean ratios in $[1.08,1.22]$ across five models on LiveCodeBench, and fails as a cross-model router, achieving only $42.3\%-49.4\%$ accuracy on disagreement prompts. We trace the collapse to a structural decomposition: KLD primarily measures the volume of disagreement with the reference, with silent-zone composite $ρ=+0.94$ ($p<0.001$) on Qwen and $+0.55$ ($p=0.03$) on Devstral, while its relationship to the direction of those disagreements is weak and task-conditional.

URL PDF HTML ☆

赞 0 踩 0

2606.19595 2026-06-19 cs.LG cs.AI 新提交

IHBench: Evaluating Post-Interruption Recovery in Voice Agents with Structured Workflows

IHBench：评估语音代理在结构化工作流中的中断后恢复能力

Ahmad Salimi, Wentao Ma, Yuzhi Tang, Dongming Shen, Mu Li, Alex Smola

发表机构 * Boson AI

AI总结提出IHBench基准，评估语音代理在结构化工作流中处理中断后的恢复能力，涵盖任务完成和恢复质量两个维度，实验表明闭源模型比开源模型更鲁棒。

详情

AI中文摘要

部署在结构化工作流（客户服务、医疗调度、账户管理）中的语音代理必须处理频繁的用户中断，同时保持多步骤程序的进度。现有的语音能力模型基准侧重于中断的时机：闯入检测、端点检测和轮流对话动态。它们忽略了中断后发生的情况：代理是否在正确的步骤恢复工作流？是否处理了用户的插话？是否避免重复用户已经听过的内容？我们引入了IHBench（中断处理基准），这是一个评估语音代理在10个企业领域中执行状态机驱动工作流时的中断后恢复能力的基准。六种中断类型在话语中间的控制点注入，并随数据生成每个中断的评估标准。每个中断在两个轴上评分：任务完成和恢复质量。我们评估了来自OpenAI、Google和开源社区的27个音频-语言模型配置。模型差异很大，恢复质量强烈依赖于中断类型。在我们的实验中，闭源模型比开源模型对中断更鲁棒：它们在任务完成上获胜的频率更高，随着对话变长，性能下降速度慢约3.3倍，并且没有音频与文本模态差距，而开源模型在这三个方面都处于劣势。一项人类研究验证了LLM评判员与人类标注者的一致性，与AudioMultiChallenge的跨基准分析表明，恢复质量在很大程度上是一个独立的能力轴。

英文摘要

Voice agents deployed in structured workflows (customer service, healthcare scheduling, account management) must handle frequent user interruptions while maintaining progress through multi-step procedures. Existing benchmarks for speech-capable models focus on the timing of interruptions: barge-in detection, endpointing, and turn-taking dynamics. They leave unmeasured what happens after the interruption: does the agent resume the workflow at the correct step? Does it address the user's interjection? Does it avoid re-delivering content the user already heard? We introduce IHBench (Interruption Handling Benchmark), a benchmark that evaluates post-interruption recovery in voice agents executing state-machine-driven workflows across 10 enterprise domains. Six interruption types are injected at controlled points mid-utterance, with per-interruption evaluation rubrics generated alongside the data. Each interruption is scored on two axes: task fulfillment and recovery quality. We evaluate 27 audio-language model configurations from OpenAI, Google, and the open-weight community. Models vary widely, and recovery quality depends strongly on the interruption type. Across our experiments, closed-weight models are consistently more robust to interruptions than open-weight ones: they win far more often on task fulfillment, degrade roughly 3.3x more slowly as conversations grow longer, and show no audio-versus-text modality gap, whereas the open-weight models lose ground on all three. A human study validates the LLM judge against human annotators, and a cross-benchmark analysis against AudioMultiChallenge indicates that recovery quality is a largely distinct capability axis.

URL PDF HTML ☆

赞 0 踩 0

覆盖约束下的数据偏差缓解与公平的代价

Bruno Scarone, Alfredo Viola, Renée J. Miller

发表机构 * Khoury College of Computer Sciences, Northeastern University（东北大学库里计算机科学学院）； Cheriton School of Computer Science, University of Waterloo（滑铁卢大学切里顿计算机科学学院）

AI总结针对多敏感属性交叉群体的偏差问题，提出在覆盖约束下扩展偏差缓解框架，通过整数线性规划优化缓解策略，权衡偏差近似误差与数据效率，并刻画公平的代价。

Comments Accepted to FAccT 2026

详情

AI中文摘要

机器学习模型已被证明在多个敏感属性（如种族和性别）交叉的个体上表现出歧视性结果或性能下降。这源于两个相互关联的挑战：缺乏量化偏差（可能是交叉的）的原则性措施，以及训练数据中交叉子群的代表性不足。我们扩展了一个最近的偏差缓解框架，以纳入覆盖约束，确保跨群体（包括交叉子群）的充分代表性。由于对所有群体实现完全零偏差可能不是数据高效的（意味着可能需要大量数据），我们的解决方案在满足覆盖约束的同时，用偏差的小近似误差换取更高的数据效率。我们还将偏差缓解表述为一个整数线性规划，优化所有缓解策略，并刻画公平的代价，即最小数据修改成本，作为公平容忍度的函数。这对于法律合规（法规可能规定特定的公平阈值）和数据治理（使从业者能够在偏差减少和数据修改（特别是数据购买）成本之间做出明智的权衡）都至关重要。我们在公开数据集上评估了我们的技术，表明通过我们的框架进行偏差缓解可以保持多个分类器的预测准确性，并且覆盖约束虽然出于统计考虑，但对于保持下游机器学习性能至关重要。

英文摘要

Machine learning models have been shown to exhibit discriminatory outcomes or degraded performance for individuals at the intersection of multiple sensitive attributes, such as race and gender. This stems in part from two interrelated challenges: the lack of principled measures for quantifying bias (potentially intersectional), and insufficient representation of intersectional subgroups in training data. We extend a recent bias mitigation framework to incorporate coverage constraints that enforce sufficient representation across groups, including intersectional subgroups. Since achieving exactly zero bias for all groups may not be data efficient (meaning it may require large amounts of data), our solution trades small approximation errors in bias for greater data efficiency while satisfying coverage constraints. We also formulate bias mitigation as an integer linear program that optimizes over all mitigation strategies, and characterize the price of fairness, the minimum data modification cost, as a function of fairness tolerance. This is essential both for legal compliance, where regulations may mandate specific fairness thresholds, and for data governance, enabling practitioners to make informed trade-offs between bias reduction and data modification (particularly, data purchasing) costs. We evaluate our techniques on publicly available datasets, demonstrating that bias mitigation via our framework preserves predictive accuracy across multiple classifiers, and that coverage constraints, while motivated by statistical considerations, are essential for preserving downstream ML performance.

URL PDF HTML ☆

赞 0 踩 0

2606.19363 2026-06-19 cs.LG 新提交

When to Trust, How to Distill: Multi-Foundation Model Guidance for Lightweight, Robust Scientific Time Series Forecasting

何时信任，如何蒸馏：面向轻量级鲁棒科学时间序列预测的多基础模型指导

Rupasree Dey, Abdul Matin, Nathan Orwick, Yao Zhang, Shrideep Pallickara, Sangmi Lee Pallickara

发表机构 * Colorado State University（科罗拉多州立大学）

AI总结提出Guard框架，通过上下文路由器和不确定性门控温度机制，从多个分布偏移的基础模型中蒸馏知识，训练轻量级预测器，在气象、碳通量等四个领域降低RMSE。

Comments KDD 2026, paper decision: Accepted, track: AI for Science. total 12 pages including references and appendix

详情

DOI: 10.1145/3770855.3819018

AI中文摘要

时间序列基础模型（TSFMs）在物理科学中的部署受到一个关键权衡的阻碍：虽然这些模型编码了丰富、通用的时间动态，但当零样本应用于特定科学领域时，它们会遭受严重的分布错位，并且其计算成本阻碍了在边缘计算传感器网络中的部署。我们解决了一个基本挑战：如何从错位的基础模型（FM）中提取潜在的结构知识，以训练轻量级、专门的预测器？我们提出了用于蒸馏的门控不确定性感知路由（Guard），这是一个新颖的框架，将多教师蒸馏重新定义为实例级决策过程，具有两种自适应机制：（1）上下文路由器，根据局部输入统计动态选择最相关的教师，利用不同基础模型之间的互补性；（2）不确定性门控温度机制，充当“断路器”，当教师置信度与领域现实偏离时自动减弱蒸馏强度。我们在四个气候关键领域评估了我们提出的轻量级框架：气象学、生态系统碳通量、土壤湿度和能源电网。我们的方法相对于固定权重的多教师蒸馏基线显著降低了RMSE，成功地从预训练的FM（教师）中蒸馏知识，即使由于原始和目标数据域之间的分布偏移，它们表现出次优的零样本准确性。我们证明，这些领域错位的教师仍然可以作为关键的纠正者，在28.5%的最难实例上优于全局优越的FM。最终，这使得适用于资源受限边缘部署的高精度科学预测成为可能。代码可在https://this URL获取。

英文摘要

The deployment of Time-Series Foundation Models (TSFMs) in physical sciences is hindered by a critical trade-off: while these models encode rich, universal temporal dynamics, they suffer from severe distributional misalignment when applied zero-shot to specific scientific domains, and their computational cost prohibits deployment in edge-computing sensor networks. We address a fundamental challenge: How can we extract latent structural knowledge from misaligned foundation models (FM) to train lightweight, specialized forecasters? We propose Gated Uncertainty-Aware Routing for Distillation (Guard), a novel framework that reframes multiteacher distillation as an instance-wise decision process with two adaptive mechanisms: (1) a Contextual Router that dynamically selects the most relevant teacher based on local input statistics, exploiting complementarity across diverse foundation models; and (2) an Uncertainty-Gated Temperature mechanism that acts as a "circuit-breaker," automatically attenuating distillation strength when teacher confidence diverges from domain reality. We evaluate our proposed lightweight framework on four climate-critical domains: meteorology, ecosystem carbon flux, soil moisture, and energy grids. Our method significantly reduces RMSE relative to a fixed-weight multi-teacher distillation baseline, successfully distilling knowledge from pretrained FMs (teachers) even when they exhibit suboptimal zero-shot accuracy due to distribution shift between the original and target data domains. We demonstrate that these domain-misaligned teachers can still serve as critical correctives, outperforming the globally superior FMs on 28.5% of the hardest instances. Ultimately, this enables high-precision scientific forecasting suitable for resource-constrained edge deployment. Code is available at https://github.com/RupasreeDey/GUARD-KDD2026.

URL PDF HTML ☆

赞 0 踩 0

2606.19371 2026-06-19 cs.LG cs.AI cs.CV 新提交

ProMUSE: Progressive Multi-modal Uncertainty-guided Staged Evidential Alzheimer Disease Classification

ProMUSE: 渐进式多模态不确定性引导的分阶段证据阿尔茨海默病分类

Long Doan, Branden Chen, Ethan Litton, Huan Huang, Jiajing Huang, Yixin Xie, Weihua Zhou, Nandakumar Narayanan, Chen Zhao

发表机构 * Kennesaw State University（肯尼索州立大学）； Michigan Technological University（密歇根理工大学）； University of Iowa（爱荷华大学）

AI总结提出ProMUSE，一种渐进式多模态不确定性引导的分阶段证据网络，通过自适应决定何时需要额外模态，在保持准确性的同时降低数据采集成本。

详情

AI中文摘要

阿尔茨海默病（AD）是一种致命性疾病，会破坏老年人的记忆和认知能力。大多数AD治疗在早期阶段有效，导致对早期AD诊断的需求日益增加。AD诊断越来越依赖多模态数据，如临床评估、结构磁共振成像（MRI）和正电子发射断层扫描（PET）成像。然而，MRI和PET采集仍然昂贵且不易普及，使得全模态推理在现实临床工作流程中不切实际。我们提出ProMUSE，一种渐进式多模态不确定性引导的分阶段证据网络，该网络自适应地确定何时需要额外模态，有助于在保持准确性的同时降低数据采集的总体成本。ProMUSE首先使用低成本临床数据进行证据分类，并通过基于Dirichlet的主观逻辑模型量化不确定性。当不确定性超过学习阈值时，ProMUSE逐步引入MRI或PET特征，通过Dempster-Shafer理论融合模态层面的信念和不确定性，获得校准的多模态预测。这种分阶段采集策略能够在最小化对昂贵成像依赖的同时实现准确诊断。在ADNI、AIBL和OASIS数据集上针对CN-AD、CN-MCI和MCI-AD任务的实验表明，ProMUSE在减少50-90%的MRI/PET使用量的同时，实现了与全模态基线相当或更优的准确性，从而大幅节省成本。这些结果突显了ProMUSE作为现实世界AD筛查中一种实用、不确定性感知且资源高效的解决方案。

英文摘要

Alzheimer's disease (AD) is a fatal disorder that destroys memory and cognitive skills in the elderly population. Most treatments for AD are effective in the early stage, leading to an increasing demand for early AD diagnosis. AD diagnosis increasingly relies on multimodal data such as clinical assessments, structural Magnetic Resonance Imaging (MRI), and Positron Emission Tomography (PET) imaging. However, MRI and PET acquisition remain costly and not universally accessible, making full-modality inference impractical in real-world clinical workflows. We propose ProMUSE, a Progressive Multi-modal Uncertainty Guided Staged Evidential Network that adaptively determines when additional modalities are necessary, helping reduce the overall cost of data acquisition while maintaining accuracy. ProMUSE first performs evidential classification using low-cost clinical data and quantifies uncertainty via a Dirichlet-based subjective logic model. When uncertainty exceeds a learned threshold, ProMUSE progressively incorporates MRI or PET features, fusing modality-wise belief and uncertainty through Dempster-Shafer theory to obtain a calibrated multimodal prediction. This staged acquisition strategy enables accurate diagnosis while minimizing reliance on expensive imaging. Experiments on ADNI, AIBL, and OASIS across CN-AD, CN-MCI, and MCI-AD tasks demonstrate that ProMUSE achieves competitive or superior accuracy compared to full-modality baselines while reducing MRI/PET usage by 50-90%, yielding substantial cost savings. These results highlight ProMUSE as a practical, uncertainty-aware, and resource-efficient solution for real-world AD screening.

URL PDF HTML ☆

赞 0 踩 0

2606.19373 2026-06-19 cs.LG cs.AI 新提交

VERITAS：验证器引导的零样本形式定理证明搜索

Manish Acharya, Zhenyu Liao, Yueke Zhang, Kevin Leach, Yu Huang, Yifan Zhang

发表机构 * Department of Computer Science, Vanderbilt University（范德堡大学计算机科学系）； Amazon（亚马逊）

AI总结提出VERITAS框架，通过两阶段协议（Best-of-N采样+批评引导MCTS）利用验证器反馈进行零样本定理证明，在miniF2F上达40.6%准确率，并发布组合学基准VERITAS-CombiBench。

详情

AI中文摘要

基于LLM的形式化证明器通常将丰富的验证器信号（语法错误、类型不匹配、部分目标进展）压缩为二进制的通过/失败位。我们提出VERITAS，一个零样本框架，通过两阶段协议将每个验证器信号路由回证明搜索：首先进行Best-of-N采样，然后进行批评引导的MCTS遍历，该遍历将第一阶段失败作为显式负例吸收。该协议保留其第一阶段扫描解决的每个定理，因此第二阶段额外的解决可归因于反馈驱动的探索。VERITAS在miniF2F上达到40.6%（相比之下，独立运行的Best-of-5为36.9%，Portfolio为26.2%），在VERITAS-CombiBench上达到7.3%，这是一个我们发布的55个定理的组合学基准，在该基准上Best-of-5（1.8%）低于Portfolio（3.6%），暴露了当必须从验证器反馈中迭代恢复正确的引理名称时，无指导的采样会带来损害。工件可在GitHub上获取。

英文摘要

LLM-based formal provers often collapse rich verifier signals (syntax errors, type mismatches, partial goal progress) into a binary pass/fail bit. We present VERITAS, a zero-shot framework that routes every verifier signal back into proof search through a two-phase protocol: Best-of-N sampling first, then a critic-guided MCTS pass that ingests Phase 1 failures as explicit negative examples. The protocol preserves every theorem solved by its own Phase 1 sweep, so Phase 2's additional solves are attributable to feedback-driven exploration. VERITAS reaches 40.6% on miniF2F (vs. an independently run Best-of-5 at 36.9%, Portfolio 26.2%) and 7.3% on VERITAS-CombiBench, a 55-theorem combinatorics benchmark we release on which Best-of-5 (1.8%) falls below Portfolio (3.6%), exposing that unguided sampling hurts when correct lemma names must be recovered iteratively from verifier feedback. Artifacts are available on GitHub.

URL PDF HTML ☆

赞 0 踩 0

2606.19412 2026-06-19 cs.LG 新提交

Spectral Retrieval-Augmented Time-Series Forecasting

频谱检索增强的时间序列预测

Huu Hiep Nguyen, Minh Hoang Nguyen, Dung Nguyen, Hung Le

发表机构 * Applied Artificial Intelligence Initiative（应用人工智能倡议）； Deakin University（迪肯大学）

AI总结提出SpecReTF方法，通过将时间序列转换为窗口化频率表示并采用结合幅度和相位的相似性度量，以及指数移动平均加权方案，解决了现有检索方法在频谱盲区和时间近因上的局限性，提升了非平稳时间序列预测的准确性。

详情

AI中文摘要

时间序列预测利用历史模式来预测未来值，但传统方法在处理复杂、非平稳模式时面临挑战，这些模式在训练期间难以记忆。检索增强方法通过检索相似历史模式来增强预测，已成为有前景的解决方案。然而，现有检索方法存在两个基本局限性：频谱盲区，即忽略了捕捉潜在周期结构的关键频域特征；以及时间近因，即对所有历史数据一视同仁，而不强调最近、更相关的模式。在本文中，我们提出SpecReTF，一种新颖的检索方法，通过将时间序列转换为窗口化频率表示，并使用结合幅度和相位信息的组合度量来衡量相似性，从而解决这些问题。为了平衡近因和历史上下文，我们应用指数移动平均加权方案，强调最近的窗口。在基准数据集上的大量实验表明，SpecReTF优于时域检索方法，在多样化的非平稳时间序列上实现了卓越的预测准确性。

英文摘要

Time series forecasting leverages historical patterns to predict future values, but traditional methods face challenges when dealing with complex, non-stationary patterns that are difficult to memorize during training. Retrieval-augmented approaches have emerged as promising solutions by retrieving similar historical patterns to enhance predictions. However, existing retrieval methods suffer from two fundamental limitations: spectral blindness, which overlooks critical frequency-domain characteristics that capture underlying periodic structures, and temporal recency, which treats all historical data equally without emphasizing recent, more relevant patterns. In this paper, we propose SpecReTF, a novel retrieval method that addresses these issues by converting time series into windowed frequency representations, measuring similarity with a combined metric that captures both amplitude and phase information. To balance recency and historical context, we apply an exponential moving average weighting scheme that emphasizes recent windows. Extensive experiments on benchmark datasets demonstrate that SpecReTF outperforms time-domain retrieval methods, achieving superior forecasting accuracy across diverse, non-stationary time series.

URL PDF HTML ☆

赞 0 踩 0

2606.19413 2026-06-19 cs.LG 新提交

Does Text Actually Help? Uncovering and Resolving Text Collapse in Multimodal Time Series Forecasting

文本真的有用吗？揭示并解决多模态时间序列预测中的文本坍缩问题

Huu Hiep Nguyen, Minh Hoang Nguyen, Dung Nguyen, Hung Le

AI总结针对多模态时间序列预测中文本分支被忽视导致“文本坍缩”的问题，提出REST-TS方法，通过让文本分支专门预测数值主干无法解释的残差，强制其提取真实内容，实现最先进性能。

详情

AI中文摘要

多模态时间序列预测将数值序列与领域相关的文本报告配对，有望将世界知识注入预测流程。然而，我们揭示了现有框架中的一个关键失败模式，称为文本坍缩：文本分支收敛到与内容无关的变换，无论输入描述如何，都贡献可忽略的判别信号。我们认为文本坍缩是时间序列预测中基本不对称性的结果：数值输入与输出强自相关，使得数值主干天生占主导地位，而文本分支尽管携带互补且通常关键的信息，却未被充分利用，导致其系统性欠利用。为解决此问题，我们提出REST-TS（时间序列中文本的残差独占监督），将不对称性转化为设计原则：数值主干产生其独立的数值预测，而文本分支被独占监督以预测残差的结构化组成部分，即数值无法解释的预测差距。由于没有数值路径可以减少这些损失，文本分支必须从输入描述中提取真实内容。在多样化的现实领域和主干架构上的评估表明，REST-TS实现了最先进的性能，并一致地显示出比现有框架更高的文本分支利用率，提供了强有力的经验证据，表明对文本分支进行残差监督迫使其从输入中提取真实内容。

英文摘要

Multimodal time series forecasting, which pairs numerical sequences with domain-relevant textual reports, promises to inject world knowledge into forecasting pipelines. However, we uncover a critical failure mode in existing frameworks that we term text collapse: the text branch converges to a content-independent transformation, contributing negligible discriminative signal regardless of the input description. We argue that text collapse is a consequence of a fundamental asymmetry in time series forecasting: the numerical input is strongly autocorrelated with the output, making the numerical backbone inherently dominant, while the text branch, despite carrying complementary and often critical information, is insufficiently utilized, leading to its systematic underexploitation. To address this, we propose \textbf{REST-TS} (\textbf{R}esidual-\textbf{E}xclusive \textbf{S}upervision for \textbf{T}ext in \textbf{T}ime \textbf{S}eries), which turns the asymmetry into a design principle: the numerical backbone produces its own independent numerical forecast, and the text branch is exclusively supervised to predict the structured components of the residual, the prediction gap that numbers cannot explain. Because no numerical pathway can reduce these losses, the text branch must extract genuine content from the input description. Evaluated across diverse real-world domains and backbone architectures, REST-TS achieves state-of-the-art performance and consistently demonstrates greater text-branch utilization than existing frameworks, providing strong empirical evidence that supervising the text branch on the residual compels it to extract genuine content from the input.

URL PDF HTML ☆

赞 0 踩 0

2606.19560 2026-06-19 cs.LG 新提交

Understanding Key Features of Time Series Foundation Models from Epidemic Forecasting

从流行病预测理解时间序列基础模型的关键特征

Alireza Jafari, Judy Fox, Geoffrey C. Fox, Madhav Marathe, Aniruddha Adiga

发表机构 * Department of Computer Science, School of Engineering and Applied Science, University of Virginia（弗吉尼亚大学工程与应用科学学院计算机科学系）； School of Data Science, University of Virginia（弗吉尼亚大学数据科学学院）； Biocomplexity Institute, University of Virginia（弗吉尼亚大学生物复杂性研究所）； Department of Electrical and Computer Engineering, School of Engineering and Applied Science, University of Virginia（弗吉尼亚大学工程与应用科学学院电气与计算机工程系）

AI总结系统评估多种时间序列模型在流感预测中的表现，发现混合专家模型性能最优，预训练在长时域提升显著，而LLM方法效果较差。

Comments 15 pages, 2 figures, 9 tables

详情

AI中文摘要

季节性流感每年感染数百万人，并在美国造成大量发病和死亡，因此准确的短期预测成为核心公共卫生需求。可靠的流行病时间序列预测可以为疫苗接种时机、医院人员配备和资源分配提供信息，然而现代预测架构在传染病监测数据上的比较行为仍未得到充分表征。我们通过系统评估区域流感预测来填补这一空白，使用流感样疾病监测和流感相关住院时间序列，在时间泛化和空间泛化设置下进行1-4周提前预测。我们比较了经典神经网络架构、基于数值的Transformer模型、预训练时间序列基础模型和基于LLM的预测方法。在各项任务中，我们证明融合多个预训练预测器的混合专家模型实现了最强的整体性能，表明异质预训练表示提供了互补的预测信息。我们的结果进一步表明，基于数值的Transformer模型产生可靠的预测，而预训练在更长时域上提供最大增益，特别是当预训练领域与流感动力学机制一致时。相比之下，基于LLM的时间序列方法在此设置下表现不如数值预测器。最后，我们研究了住院信息作为辅助协变量和预训练源的作用。住院信号在特定设置中提供了互补的改进，并阐明了额外的监测流如何增强多时域预测的鲁棒性。这些发现为流感防范的模型选择、预训练策略和辅助信号使用提供了可操作的指导。

英文摘要

Seasonal influenza infects millions of people and causes substantial morbidity and mortality in the United States each year, making accurate short-term forecasting a core public-health need. Reliable forecasts of epidemic time series can inform vaccination timing, hospital staffing, and resource allocation, yet the comparative behavior of modern forecasting architectures on infectious-disease surveillance data remains insufficiently characterized. We address this gap through a systematic evaluation of regional influenza forecasting using influenza-like illness surveillance and influenza-associated hospitalization time series under both temporal and spatial generalization settings for 1-4-week-ahead prediction. We compare classical neural network architectures, numerical transformer-based models, pretrained time series foundation models, and LLM-based forecasting approaches. Across tasks, we demonstrate that a mixture-of-experts model that fuses multiple pretrained forecasters achieves the strongest overall performance, indicating that heterogeneous pretrained representations provide complementary predictive information. Our results further show that numerical transformer-based models produce reliable forecasts, while pretraining provides the largest gains at longer horizons, particularly when the pretraining domain is mechanistically aligned with influenza dynamics. In contrast, LLM-based time series methods underperform relative to numerical forecasters in this setting. Finally, we examine hospitalization information as both an auxiliary covariate and a pretraining source. Hospitalization signals provide complementary improvements in selected settings and clarify when additional surveillance streams enhance the robustness of multi-horizon forecasting. These findings provide actionable guidance on model selection, pretraining strategy, and auxiliary-signal use for influenza preparedness.

URL PDF HTML ☆

赞 0 踩 0

2606.19562 2026-06-19 cs.LG physics.flu-dyn 新提交

Advances in Scientific Machine Learning for Coupled Fluid Flow and Transport

耦合流体流动与输运的科学机器学习进展

Gabriel F. Barros, Rômulo M. Silva, Alvaro L. G. A. Coutinho

发表机构 * COPPE - Federal University of Rio de Janeiro - UFRJ（里约热内卢联邦大学COPPE学院）

AI总结综述科学机器学习在耦合流体流动与输运问题中的进展，包括基于SVD的线性降阶和PINNs、β-VAE等神经网络方法，并展示其在浊流和热对流中的应用。

详情

AI中文摘要

本章回顾了科学机器学习（SciML）在模拟由不可压缩Navier-Stokes方程和标量输运方程控制的耦合流体流动与输运现象方面的最新进展。这类系统出现在浊流和热对流等应用中，具有强非线性耦合和多尺度行为，使得高保真模拟计算成本高昂。为此，本章调查了构建高效代理模型的最新SciML方法，包括基于奇异值分解的线性降阶技术（如动态模态分解）和非线性神经网络方法（如物理信息神经网络（PINNs）和β-变分自编码器（β-VAEs））。首先介绍了作者将这些模型与高性能计算策略相结合的工作，包括自适应网格细化/粗化（AMR/C）和科学浮点数据压缩。然后提出了两个新贡献：通过PINNs对浊流进行代理建模，以及使用β-VAEs从热流中提取解缠的非线性模态。控制方程和代表性基准（包括锁交换流和Rayleigh-Bénard对流）说明了这些方法。本章篇幅较长，涵盖了耦合流体流动的数学和物理基础以及最先进建模的计算方面。总体而言，它展示了SciML如何在特定数据范围和建模假设下，实现复杂耦合系统的快速、精确近似，同时相对于全阶模拟大幅降低计算成本。实时预测和不确定性量化等更广泛的能力仍然是活跃的研究方向，其可行性在很大程度上取决于具体问题。

英文摘要

This chapter reviews recent advances in Scientific Machine Learning (SciML) for modeling coupled fluid flow and transport phenomena governed by the incompressible Navier-Stokes and scalar transport equations. Such systems, found in applications like turbidity currents and thermal convection, feature strong nonlinear coupling and multiscale behavior that make high-fidelity simulations computationally expensive. To address this, the chapter surveys state-of-the-art SciML methods for building efficient surrogate models, including linear reduced-order techniques based on Singular Value Decomposition (such as Dynamic Mode Decomposition) and nonlinear neural network approaches like Physics-Informed Neural Networks (PINNs) and $β$-Variational Autoencoders ($β$-VAEs). It first covers the authors' work combining these models with High Performance Computing strategies, including Adaptive Mesh Refinement/Coarsening (AMR/C) and scientific floating-point data compression. It then presents two new contributions: surrogate modeling of turbidity currents via PINNs, and the extraction of disentangled nonlinear modes from thermal flows using $β$-VAEs. Governing equations and representative benchmarks, including lock-exchange flows and Rayleigh-Bénard convection, illustrate these methodologies. The chapter is intentionally long, covering both the mathematical and physical foundations of coupled fluid flow and the computational aspects of state-of-the-art modeling. Overall, it demonstrates how SciML enables fast, accurate approximations of complex coupled systems within the specific data regimes and modeling assumptions considered, while substantially reducing computational cost relative to full-order simulations. Broader capabilities such as real-time prediction and uncertainty quantification remain active research directions whose feasibility depends strongly on the problem at hand.

URL PDF HTML ☆

赞 0 踩 0

2606.19623 2026-06-19 cs.LG 新提交

SEAGAN: domain-Specific and Edge-Aware Graph Attention Network for Dynamic Plant Processes

SEAGAN：面向动态植物过程的领域特定与边缘感知图注意力网络

Antriksh Srivastava, Soumyashree Kar

AI总结提出SEAGAN，将植物A-Ci曲线中的生化限制状态识别建模为图节点分类问题，利用距离kNN和辅助信号引导连接构建图，通过边缘感知图注意力网络提升分类性能，F1分数达0.857。

详情

AI中文摘要

图神经网络（GNN）为从通过物理、生物或功能关系关联的科学数据中学习提供了灵活框架。一个有前景的领域是植物生理学，其中测量的响应通常来自多个相互作用的过程，即使通过人工干预，这些过程的精确分离仍然困难。在植物生理学中，一个关键例子是A-Ci曲线，它关联净CO2同化速率（Anet）与叶片胞间CO2浓度（Ci），并用于估计叶片和作物冠层模型中的光合参数。然而，可靠估计需要识别每个曲线点处的活跃生化限制状态，这仍然是主要的不确定性来源。在这里，我们将沿A-Ci曲线的限制状态识别表述为基于图的节点分类问题，以曲线点为节点。使用基于距离的k近邻（kNN）和辅助信号引导（ASG）连接创建领域特定的图表示，边属性编码成对关系。该框架与常规学习基线、基于图的架构以及基于自动拟合的基准进行了评估。在具有已知真实限制状态的大型合成数据集上的结果表明，基于图的模型改善了分类，特别是在生化过渡区域附近。最佳配置SEAGAN（面向动态植物过程的领域特定与边缘感知图注意力网络）整合了过程感知节点特征、边属性、kNN连接和带加权交叉熵损失的图注意力，实现了0.857的F1分数和0.882的准确率。结果表明，将A-Ci曲线表示为图改善了生化限制状态分析，而局部kNN邻域上的边缘感知注意力提供了最有效的策略。

英文摘要

Graph neural networks (GNNs) provide a flexible framework for learning from scientific data linked through physical, biological, or functional relationships. One promising domain is plant physiology, where measured responses often arise from multiple interacting processes whose exact separation remains difficult even with manual intervention. In plant physiology, a key example is the A-Ci curve, which relates net CO2 assimilation rate (Anet) to leaf intercellular CO2 concentration (Ci) and is used to estimate photosynthetic parameters in leaf and crop-canopy models. However, reliable estimation requires identifying the active biochemical limitation state at each curve point, which remains a major source of uncertainty. Here, we formulate limitation-state identification along A-Ci curves as a graph-based node classification problem, with curve points as nodes. Domain-specific graph representations are created using distance-based k-nearest-neighbor (kNN) and auxiliary-signal-guided (ASG) connectivity, with edge attributes encoding pairwise relations. The framework was evaluated against conventional learning baselines, graph-based architectures, and an automated fitting-based benchmark. Results on a large synthetic dataset with known ground-truth limitation states show that graph-based models improve classification, particularly near biochemical transition regions. The best-performing configuration, SEAGAN (domain-Specific and Edge-Aware Graph Attention Network for Dynamic Plant Processes), integrates process-aware node features, edge attributes, kNN connectivity, and graph attention with weighted cross-entropy loss, achieving an F1-score of 0.857 and an accuracy of 0.882. The results show that representing A-Ci curves as graphs improves biochemical limitation-state analysis, with edge-aware attention over local kNN neighborhoods providing the most effective strategy.

URL PDF HTML ☆

赞 0 踩 0

2606.20015 2026-06-19 cs.LG 新提交

Adaptive Distance-Aware Trunk Deep Operator Learning for Long-Span Roadway Bridges

自适应距离感知主干深度算子学习用于大跨度公路桥梁

Bilal Ahmed, Diab W. Abueidda, Waleed El-Sekelly, Tarek Abdoun, Mostafa E. Mobasher

发表机构 * Urban Engineering Department , addressline= New York University Abu Dhabi , country= United Arab Emirates ； organization= National Center for Supercomputing Applications , addressline= University of Illinois at Urbana-Champaign , country= United States of America ； organization= Department of Structural Engineering , addressline= Mansoura University , country= Mansoura, Egypt

AI总结提出自适应主干DeepONet框架，通过KNN构建荷载相关学习域、距离感知特征和刚度-informed Schur补全重建，实现大跨度桥梁局部响应高精度快速预测，相对误差低于5%，速度提升约60倍。

Comments 39 pages, 26 figures

详情

AI中文摘要

大跨度公路桥梁在车辆荷载下表现出高度局部化的结构响应，使得重复有限元分析在影响面生成和结构数字孪生等应用中计算成本高昂。现有的科学机器学习方法难以准确捕捉这些局部响应。为解决这一挑战，本研究提出了一种自适应主干DeepONet用于大型桥梁系统的局部结构响应预测。该框架利用KNN策略动态构建荷载相关的学习域，使网络聚焦于结构影响区域。主干网络进一步通过距离感知特征增强，这些特征编码了荷载与结构节点之间的几何关系。通过刚度-informed Schur补全公式引入基于物理的全场重建，使得自适应节点上的预测能够扩展到整个结构域。为了实现可扩展训练，使用降阶等效壳模型生成响应数据，该模型保留了主要的全局行为，同时显著降低了计算成本。该框架在基准桥梁模型和真实世界的Mussafah桥上进行了验证。结果表明，该方法实现了有限元级别的精度，相对误差低于5%，同时将总响应评估时间（包括全场重建）减少了约60倍；排除后处理重建步骤，AD-DeepONet推理比有限元快四个数量级。此外，该框架能够在任意车辆荷载配置下快速生成全场响应、影响线和影响面，显示出在大规模桥梁分析和数字孪生应用中的巨大潜力。

英文摘要

Long-span roadway bridges exhibit highly localized structural responses under vehicular loading, making repeated FE analysis computationally expensive for applications such as influence surface generation and structural digital twins. Existing SciML approaches struggle to accurately capture these localized responses. To address this challenge, this study proposes an adaptive-trunk DeepONet for localized structural response prediction in large-scale bridge systems. The framework dynamically constructs a load-dependent learning domain using a KNN strategy, allowing the network to focus on structural influence zones. The trunk network is further enhanced using distance-aware features that encode the geometric relationship between the load and structural nodes. A physics-based full-field reconstruction is incorporated through a stiffness-informed Schur complement formulation, enabling predictions at adaptive nodes to be extended to the entire structural domain. To enable scalable training, response data are generated using a reduced-order equivalent shell model that preserves the dominant global behavior while significantly reducing computational cost. The proposed framework is validated on both a benchmark bridge model and the real-world Mussafah Bridge. Results show that the method achieves FEM-level accuracy with relative errors below 5%, while reducing the total response evaluation time (including full-field reconstruction) by approximately 60x; excluding the post-processing reconstruction step, the AD-DeepONet inference is up to four orders of magnitude faster than FEM. In addition, the framework enables rapid generation of full-field responses, influence lines, and influence surfaces under arbitrary vehicular loading configurations, demonstrating strong potential for large-scale bridge analysis and digital twin applications.

URL PDF HTML ☆

赞 0 踩 0

2606.20034 2026-06-19 cs.LG 新提交

Exploring the potential of AlphaEarth and TESSERA embeddings for Fine-scale Local Climate Zone Mapping: A case study across five cities in Switzerland

探索AlphaEarth和TESSERA嵌入在精细尺度局地气候区制图中的应用潜力：以瑞士五个城市为例

Htet Yamin Ko Ko, Clement Atzberger

AI总结本研究对比TESSERA和AlphaEarth嵌入与传统Sentinel-1/2数据，使用注意力U-Net将粗分辨率LCZ图提升至10米，发现嵌入模型在跨城市迁移和精度上表现更优，但跨年迁移仍是挑战。

详情

AI中文摘要

理解城市空间形态对于气候建模、风险评估和可持续城市设计至关重要，而局地气候区（LCZ）制图为此提供了基本框架。然而，许多城市仍使用约100米分辨率的粗LCZ记录，这并不适用于精细尺度的城市研究。在本研究中，我们将TESSERA（Feng等人，2025）和AlphaEarth（Brown等人，2025）的预计算嵌入与传统的Sentinel-1/2（S1S2）合成数据在瑞士五个城市进行比较，以评估它们是否能够使用基于注意力的U-Net将粗LCZ图提升至10米分辨率。三个实验评估了多城市迁移性、更高分辨率参考数据的影响以及对年际物候变化的时间鲁棒性。我们发现，所有数据集在前两个实验中均取得了强劲性能，测试数据的交并比（IoU）分别在0.59-0.69和0.77-0.82之间。TESSERA在两种设置下均一致优于S1S2和AlphaEarth。正如预期，我们发现基于嵌入的模型从一年迁移到另一年仍然是一个开放的挑战。然而，总体而言，我们的结果表明，来自地球观测基础模型的嵌入在减少耗时预处理和手动特征工程任务方面具有巨大潜力，并能够指导通用的基于深度学习的LCZ制图工作流程。当与简单的位置感知注意力U-Net架构结合时，这些嵌入增强了区域迁移性和可扩展性，支持为全球城市气候应用开发全面且可重复的精细尺度LCZ图。提高参考数据质量仍然是进一步提升精度的最强杠杆。

英文摘要

Understanding urban spatial morphology is critical for climate modeling, risk assessment, and sustainable urban design, and Local Climate Zone (LCZ) mapping provides the basic framework for this. However, many cities still use coarse ~100-m resolution LCZ records, which are unsuitable for fine-scale urban research. In this study, precomputed embeddings from TESSERA (Feng et al., 2025) and AlphaEarth (Brown et al., 2025) are compared to traditional Sentinel-1/2 (S1S2) composites in five Swiss cities to see if they can upscale coarse LCZ maps to 10-m resolution using an attention-based U-Net. Three experiments assess multi-city transferability, the impact of higher-resolution reference data, and temporal robustness to year-to-year phenology changes. We find that all datasets achieve strong performance with test data Intersection-over-Union (IoU) ranging from 0.59-0.69 and 0.77-0.82 in the first two experiments. TESSERA consistently outperforms both S1S2 and AlphaEarth across both settings As expected, we find that the transfer of embedding-based models from one year to another remains an open challenge. Overall, however, our results demonstrate the promising potential of embeddings derived from EO foundation models to reduce time consuming preprocessing, respectively, manual feature engineering tasks and to guide a universal deep learning-based LCZ mapping workflow. When combined with a simple location-aware attention U-Net architecture, the embeddings enhance regional transferability and scalability, supporting the development of comprehensive and reproducible fine-scale LCZ maps for global urban climate applications Improving reference data quality remains the strongest lever for further accuracy gains.

URL PDF HTML ☆

赞 0 踩 0

2606.20037 2026-06-19 cs.LG 新提交

Alzheimer's Disease Diagnosis using a Multimodal Approach with 3D MRI and PET

使用3D MRI和PET的多模态方法诊断阿尔茨海默病

Loukas Ilias, Anthi-Maria Vozinaki, Christos Ntanos, Dimitris Askounis

发表机构 * DSS Lab, School of ECE, NTUA（NTUA ECE学院DSS实验室）

AI总结提出结合3D卷积特征提取器与三种融合策略（拼接、门控多模态单元、门控自注意力）及稀疏门控混合专家分类器的多模态模型，用于阿尔茨海默病诊断，在三个二分类任务上验证了输入自适应建模的有效性。

Comments 2025 IEEE International Conference on Bioinformatics and Biomedicine (BIBM)

详情

DOI: 10.1109/BIBM66473.2025.11357133

AI中文摘要

阿尔茨海默病（AD）是一种不可逆的神经退行性疾病，也是全球主要的死亡原因之一。早期诊断尤为重要，尤其是在轻度认知障碍（MCI）阶段，及时干预有助于延缓其向AD的进展。神经影像数据，如磁共振成像（MRI）和正电子发射断层扫描（PET），可以通过提供与疾病相关的结构和功能脑变化来帮助早期检测脑部变化。然而，许多多模态模型仍通过静态拼接融合MRI和PET，并对所有受试者应用相同的计算，这限制了其对患者/站点异质性的鲁棒性，并可能浪费计算资源。为解决这些局限性，我们首次研究了将3D卷积特征提取器与三种融合策略（拼接、门控多模态单元（GMU）和门控自注意力）以及一个稀疏门控混合专家（MoE）分类器相结合的方法，该分类器执行输入自适应路由，仅激活每个病例中最具信息量的专家。最后，我们利用Grad-CAM可视化疾病相关区域，确保模型的可解释性。实验在三个二分类任务（NC vs. MCI、MCI vs. AD和NC vs. AD）上进行。结果表明，GMU在NC vs. MCI和NC vs. AD上分别达到80.46%和95.47%的准确率，而门控自注意力在MCI vs. AD上达到82.08%。消融实验表明，移除MoE会持续降低所有任务的准确率。这些发现强调了利用MRI和PET互补性的输入自适应多模态建模在AD诊断中的价值。

英文摘要

Alzheimer's disease (AD) is an irreversible neurodegenerative disorder and a leading cause of death worldwide. Early diagnosis plays an important part especially at the Mild Cognitive Impairment stage, where timely intervention can help slow its progression before it advances to AD. Neuroimaging data, like Magnetic Resonance Imaging (MRI) and Positron Emission Tomography (PET) scans, can help detect brain changes early by providing structural and functional brain changes related to the disease. Yet, many multimodal models still fuse MRI and PET with static concatenation and apply identical computation to all subjects, which limits robustness to patient/site heterogeneity and can waste computation. To address these limitations, we present the first study of combining 3D convolutional feature extractors with three fusion strategies - concatenation, Gated Multimodal Unit (GMU), and gated self-attention - and a sparsely gated Mixture-of-Experts (MoE) classifier that performs input-adaptive routing, activating only the most informative experts per case. Finally, we utilize Grad-CAM to visualize disease-related regions, ensuring model interpretability. Experiments are performed across three binary classification tasks (NC vs. MCI, MCI vs. AD, and NC vs. AD). Results show that GMU achieves accuracies of 80.46 % (NC vs. MCI) and 95.47 % (NC vs. AD), while gated self-attention attains 82.08 % on MCI vs. AD. Ablations show that removing the MoE consistently degrades accuracy across all tasks. These findings underscore the value of input-adaptive, multimodal modeling for AD diagnosis by leveraging the complementary nature of MRI and PET.

URL PDF HTML ☆

赞 0 踩 0

2606.20053 2026-06-19 cs.LG 新提交

Comparative Study of Neural Surrogate Architectures for Autoregressive Prediction of Internal Battery States

用于电池内部状态自回归预测的神经代理架构比较研究

Gihyun Lee, Thorben Menne, Simon Olma, Jakob Hilgert, Sangyoung Park

AI总结系统比较四种神经网络架构（MLP、ResNet、U-Net、FNO）作为自回归状态转移算子，预测锂离子电池DFN模型内部状态，发现U-Net因多尺度空间归纳偏置在精度和速度上最优。

Comments 8 pages, 5 figures

详情

AI中文摘要

Doyle-Fuller-Newman (DFN) 模型以高保真度解析锂离子电池的内部电化学状态。然而，其控制方程的数值求解对于实时部署而言计算成本过高，限制了从单个电池到电池组及车队规模应用的可扩展性。虽然机器学习代理可以通过GPU加速大幅降低推理延迟，但现有大多数方法学习的是特定操作条件下的解近似，而非可泛化的状态演化动力学。本文系统比较了四种神经网络架构（MLP、ResNet、U-Net、FNO），它们被构建为自回归状态转移算子，可预测广泛操作条件下的完整DFN内部状态。为确保受控的架构比较，所有模型在统一框架下训练，采用多步展开和电流条件化，隔离了空间归纳偏置的影响。结果表明，U-Net的多尺度特征层次在300步自回归展开后，所有内部状态变量的平均最终步nRMSE达到3%，同时相比数值求解器实现了5.38倍的加速。这些发现强调了空间归纳偏置是代理性能的关键决定因素，推动了用于下一代电池管理系统和数字孪生的内部状态可观测性代理的发展。

英文摘要

The Doyle-Fuller-Newman (DFN) model resolves internal electrochemical states in lithium-ion batteries with high fidelity. However, the numerical solution of its governing equations is computationally prohibitive for real-time deployment, limiting scalability from individual cells to pack and fleet-scale applications. While machine learning surrogates can substantially reduce inference latency through GPU acceleration, most existing approaches learn solution approximations tied to specific operating conditions rather than learning generalizable state-evolution dynamics. This work presents a systematic comparison of four neural network architectures (MLP, ResNet, U-Net, FNO) formulated as autoregressive state-transition operators that predict full DFN internal states across a wide range of operating conditions. To ensure a controlled architectural comparison, all models are trained under a unified framework using multi-step unrolling and current-conditioning, isolating the impact of spatial inductive bias. Results demonstrate that the U-Net's multi-scale feature hierarchy achieves a mean final-step nRMSE of 3% averaged across all internal state variables after 300-step autoregressive rollouts, while providing a 5.38x speed-up over the numerical solver. These findings highlight spatial inductive bias as a critical determinant of surrogate performance, advancing the development of surrogates for internal state observability for next-generation battery management systems and digital twins.

URL PDF HTML ☆

赞 0 踩 0

2606.20055 2026-06-19 cs.LG 新提交

PaAno+: Multiscale Encoding and Cross-Variable Attention for Time Series Anomaly Detection

PaAno+：用于时间序列异常检测的多尺度编码与跨变量注意力

Youji Zhu, Hongbing Wang, Wenchao Liu, Xiaodong Liu, Xiangguang Xiong

发表机构 * School of Mathematical Sciences, Guizhou Normal University（贵州师范大学数学科学学院）； School of Big Data and Computer Science, Guizhou Normal University（贵州师范大学大数据与计算机科学学院）

AI总结提出PaAno模型，通过多尺度特征提取、跨变量融合注意力和补丁窗口排序预任务，实现轻量高效的时间序列异常检测，在TSB-AD基准上达到SOTA。

详情

AI中文摘要

时间序列异常检测在工业和医疗监测等关键领域具有重要的实用价值。当前基于Transformer和大模型的检测方法计算开销过大，而现有的轻量级替代方案受限于特征提取不足以及多变量间依赖关系建模不充分。为缓解上述缺陷，本研究在面向补丁的表征学习范式下，开发了一种轻量高效的异常检测模型PaAno。在编码器模块中，使用具有差异化感受野的卷积核构建多尺度特征提取主干，以捕获层次化时间特征；随后通过跨尺度自适应注意力聚合结合残差连接优化，进一步稳定特征表征学习。嵌入跨变量融合注意力模块以显式表征变量间相关性，使模型能够在复杂运行条件下识别异常模式。此外，定制了一种基于时间补丁窗口排序的新型前置任务，以揭示时间序列的内在结构特性，并利用三元组损失优化补丁嵌入空间以增强特征判别性。在TSB-AD基准上的大量实验表明，所提出的PaAno在单变量和多变量任务上均实现了最先进的检测精度，在包括VUS-PR在内的评估指标上相对于原始PaAno取得了显著性能提升。凭借紧凑的网络设计，该模型实现了良好的计算效率，能够在资源受限的终端上部署用于实时异常推理。

英文摘要

Time-series anomaly detection has significant practical value for industrial and medical monitoring, as well as other critical domains. Current Transformer- and large-model-based detection approaches incur excessive computational overhead, while existing lightweight alternatives are constrained by insufficient feature extraction and inadequate modeling of dependencies across multivariate variables. To mitigate the above drawbacks, this study develops a lightweight, efficient anomaly detection model, dubbed PaAno, within the patch-oriented representation learning paradigm. In the encoder module, a multiscale feature-extraction backbone is constructed using convolutional kernels with differentiated receptive fields to capture hierarchical temporal characteristics; subsequent cross-scale adaptive attention aggregation, combined with residual connection optimization, further stabilizes feature representation learning. A cross-variable fusion attention module is embedded to explicitly characterize inter-variable correlations, empowering the model to identify anomalous patterns amid intricate operational conditions. Moreover, a novel pretext task based on temporal patch-window sorting is customized to uncover intrinsic structural properties of time series, and triplet loss is leveraged to optimize the patch embedding space for enhanced feature discrimination. Extensive experiments on the TSB-AD benchmark demonstrate that the proposed PaAno achieves state-of-the-art detection accuracy on both univariate and multivariate tasks, yielding significant performance gains across evaluation metrics, including VUS-PR, relative to the original PaAno. Leveraging a compact network design, the presented model achieves favorable computational efficiency, enabling deployment on resource-limited terminals for real-time anomaly inference.

URL PDF HTML ☆

赞 0 踩 0

2606.20172 2026-06-19 cs.LG 新提交

Predicting gestational age at birth in the context of preterm birth from multi-modal fetal MRI

基于多模态胎儿MRI预测早产背景下的出生胎龄

Diego Fajardo-Rojas, Megan Hall, Daniel Cromb, Mary A. Rutherford, Lisa Story, Emma C. Robinson, Jana Hutter

发表机构 * Leibniz University Hannover（莱布尼茨汉诺威大学）

AI总结提出结合多模态胎儿MRI和机器学习流程预测出生胎龄，包括数据插补、特征选择和回归模型，在333例对照和93例早产数据上评估，R²=0.13，MAE=2.74周，准确率0.77。

Comments Accepted for publication at the Journal of Machine Learning for Biomedical Imaging (MELBA) https://melba-journal.org/2026:013

Journal ref Machine.Learning.for.Biomedical.Imaging. 2026 (2026)

详情

DOI: 10.59275/j.melba.2026-f34b

AI中文摘要

早产与高死亡率和终身发病风险相关。复杂的多因素病因阻碍了准确预测和最佳护理。我们开发并评估了一个包含定制机器学习方法的流程，用于数据插补、特征选择和回归模型，以从333例对照和93例早产病例的综合多模态形态和功能胎儿MRI数据预测出生胎龄。将出生胎龄预测分为足月和早产类别，并报告其准确性、敏感性和特异性。进行了消融研究以进一步验证流程设计。使用分层10折交叉验证评估性能。该流程实现了0.13的R²分数和2.74周的平均绝对误差。在交叉验证中，准确率为0.77，敏感性为0.59，特异性为0.82。流程选择的主要特征包括宫颈长度和基于胎盘T2*值的统计量。快速、运动鲁棒的多模态胎儿MRI技术与机器学习预测的结合使得能够预测出生胎龄。这些信息对任何妊娠都至关重要。据我们所知，早产在文献中仅作为分类问题处理。因此，这项工作提供了概念验证。未来工作将增加队列规模，以允许在早产队列内进行更精细的分层。我们的代码可在以下网址获取：此https URL。

英文摘要

Preterm birth is associated with significant mortality and a risk for lifelong morbidity. The complex multifactorial aetiology hampers accurate prediction and thus optimal care. A pipeline consisting of bespoke machine learning methods for data imputation, feature selection, and regression models to predict gestational age (GA) at birth was developed and evaluated from comprehensive multi-modal morphological and functional fetal MRI data from 333 control cases and 93 preterm birth cases. The GA at birth predictions were classified into term and preterm categories and their accuracy, sensitivity, and specificity were reported. An ablation study was performed to further validate the design of the pipeline. Performance was evaluated using stratified 10-fold cross-validation. The pipeline achieves an R2 score of 0.13 and a mean absolute error of 2.74 weeks. It also achieves a 0.77 accuracy, 0.59 sensitivity, and 0.82 specificity across folds. The predominant features selected by the pipeline include cervical length and statistics derived from placental T2* values. The confluence of fast, motion-robust and multi-modal fetal MRI techniques and machine learning prediction allowed the prediction of the gestation at birth. This information is essential for any pregnancy. To the best of our knowledge, preterm birth had only been addressed as a classification problem in the literature. Therefore, this work provides a proof of concept. Future work will increase the cohort size to allow for finer stratification within the preterm birth cohort. Our code is available at https://github.com/dfajardorojas/ml-for-preterm-birth-.

URL PDF HTML ☆

赞 0 踩 0

2606.20174 2026-06-19 cs.LG 新提交

Computational Methods and Challenges in Cell-Free DNA Analysis for Multi-Cancer Early Detection

基于无细胞DNA分析的多癌早期检测的计算方法与挑战

Nicko Starkey, Marcin W. Wojewodzic, Krzysztof Rzecki

发表机构 * AGH University of Krakow（AGH克拉科夫大学）； Norwegian Institute of Public Health（挪威公共卫生研究所）

AI总结综述2022-2025年cfDNA多癌早期检测的计算方法，重点分析片段组学和表观遗传特征提取技术，指出多模态集成方法最具临床整合潜力，但需标准化评估协议。

详情

AI中文摘要

无细胞DNA（cfDNA）是非侵入性多癌早期检测（MCED）的一个有前景的途径，因为它可以通过单次抽血同时检测多种癌症，尤其对目前缺乏既定筛查程序的癌症具有敏感性。本文综述了2022年至2025年间基于cfDNA的MCED计算方法。我们重点关注如何提取和分析片段组学和表观遗传特征以在早期阶段检测癌症。我们首先简要概述cfDNA信号的生物学基础，然后回顾经典的统计和机器学习方法以及深度学习框架，包括基于自编码器的模型。对于每种方法，我们讨论其生物学可解释性、验证策略以及临床整合的准备情况。此外，我们将当前挑战分为技术、计算和方法论三类，并概述该领域的开放问题。本综述表明，多模态集成方法在临床整合方面具有最强的前景和最高的准备度。然而，为了更好地评估未来工作和进行并排比较，标准化评估协议和报告结果至关重要。

英文摘要

Cell-free DNA (cfDNA) is a promising avenue for non-invasive multicancer early detection (MCED), in that, it can enable multiple cancer detection simultaneously from a single blood draw, with particular sensitivity to cancers that currently lack established screening programs. Here we review the computational methods developed between 2022 and 2025 for cfDNA-based MCED. We focus on how fragmentomics and epigenetic features are extracted and analyzed to detect cancer at early stages. We first briefly outline the biological basis of cfDNA signals, then review classical statistical and machine learning approaches alongside deep learning frameworks including autoencoder-based models. For each method we discuss biological interpretability, validation strategy, and readiness for clinical integration. Furthermore, we categorize the current challenges into technical, computational, and methodological while outlining open problems in the field. This review shows that multimodal ensemble approaches have the strongest promise for clinical integration and the highest readiness. However, for better assessment of future work and side-by-side comparison, standardization of evaluation protocols and reporting results will be crucial.

URL PDF HTML ☆

赞 0 踩 0

2606.20291 2026-06-19 cs.LG cs.CV 新提交

Integrating national forest inventory, airborne lidar, and satellite imagery for wall-to-wall mapping of forest structure with computer vision

整合国家森林清查、机载激光雷达和卫星影像，利用计算机视觉实现森林结构的全覆盖制图

Luke J. Zachmann, David D. Diaz, Vincent A. Landau, Chelsey Walden-Schreiner, Tony Chang, Nathan E. Rutenbeck, Katharyn A. Duffy, Kiarie Ndegwa, Andreas Gros, Scott Conway, Guy Bayes

发表机构 * Vibrant Planet Public Benefit Corporation（Vibrant Planet 公益公司）

AI总结提出VibrantForests框架，结合卫星影像、激光雷达样本和计算机视觉，以10米分辨率生成美国本土的冠层覆盖、高度、生物量等森林属性图，减少饱和与回归均值问题。

详情

AI中文摘要

遥感技术越来越被依赖，以提供可操作的科学研究，用于大型景观的森林和野火风险管理。全覆盖、每年更新的地图是有效森林管理的持续需求。许多规划系统和数据收集结合了不同目的、年份和预测质量的异质数据源，导致运营规划系统中的混淆行为。我们介绍了VibrantForests框架，该框架被开发并应用于绘制森林属性，为有效的森林和野火规划提供一致的基础。VibrantForests包括一个基于卫星的森林结构模型，该模型在激光雷达衍生的样本上训练，并应用于美国本土，以10米分辨率同时生成冠层覆盖度、冠层高度、地上活树生物量、胸高断面积和二次平均直径的估计。我们展示了跨越从稀疏冠层/低生物量到密集冠层/高生物量的全部森林条件的预测能力。结果表明，我们的模型扩展了在类似被动传感器模型中常见的饱和范围，并减少了回归均值行为，该行为通常在小/稀疏条件下高估森林属性，在大/密集条件下低估森林属性。VibrantForests框架通过以年度节奏和10米分辨率提供管理相关属性的一致全覆盖估计，解决了大面积森林和野火规划中的一个关键限制。

英文摘要

Remote sensing is increasingly relied upon to deliver actionable science for forest and wildfire risk management across large landscapes. Wall-to-wall, annually updated maps are a persistent need for effective forest management. Many planning systems and data collections combine disparate data sources with different purposes, vintages, and prediction quality, which leads to confounding behavior in operational planning systems. We introduce the VibrantForests framework, developed and applied to map forest attributes and provide a coherent foundation for effective forest and wildfire planning. VibrantForests includes a satellite-based forest structure model trained on lidar-derived samples and applied across the contiguous United States to concurrently generate estimates of canopy cover, canopy height, aboveground live tree biomass, basal area, and quadratic mean diameter at 10-meter resolution. We demonstrate predictive capability spanning the full spectrum of forest conditions ranging from sparse-canopy/low-biomass to dense-canopy/high-biomass. Results show that our model extends the range at which saturation is commonly encountered in comparable passive-sensor models, and reduces regression-to-mean behavior that commonly produces overestimation of forest attributes in small/sparse conditions and underestimation in large/dense conditions. The VibrantForests framework addresses a key limitation in large-area forest and wildfire planning by delivering coherent wall-to-wall estimates of management-relevant attributes at annual cadence and 10m resolution.

URL PDF HTML ☆

赞 0 踩 0

2606.20326 2026-06-19 cs.LG physics.comp-ph 新提交

Quantum-classical physics-informed Kolmogorov-Arnold networks for PDEs

量子-经典物理信息Kolmogorov-Arnold网络求解偏微分方程

Xiang Rao, Yuxuan Shen

AI总结提出QCPIKAN，首个量子-经典物理信息Kolmogorov-Arnold网络，结合Chebyshev多项式KAN层和参数化量子电路，通过嵌入物理约束加速高频误差指数收敛并抑制数值色散，在多孔介质渗流场景中优于现有量子-经典PINN。

详情

AI中文摘要

我们开发了QCPIKAN，这是首个旨在求解偏微分方程（PDE）的量子-经典物理信息Kolmogorov-Arnold网络。该混合框架基于Chebyshev多项式KAN层和参数化量子电路构建，将物理约束嵌入训练损失中以强制执行物理一致性。我们的基于逼近论的理论研究证明，该设计将高频误差收敛加速至指数速率，并有效抑制数值色散。我们在多孔介质中的三个典型渗流场景（包括单相流、组分运移和两相流）上验证了该框架。与现有的量子-经典物理信息神经网络相比，QCPIKAN在全局预测精度、局部误差控制、动态演化跟踪和驱替前沿定位方面均实现了优越性能。这项工作为求解复杂PDE提供了一种鲁棒且高效的替代方案。

英文摘要

We develop QCPIKAN, the first quantum-classical physics-informed Kolmogorov-Arnold network designed to solve partial differential equations (PDEs). Built upon Chebyshev-polynomial KAN layers and parameterized quantum circuits, this hybrid framework embeds physical constraints into the training loss to enforce physical consistency. Our theoretical investigations grounded in approximation theory prove that this design accelerates high-frequency error convergence to an exponential rate and effectively mitigates numerical dispersion. We validate the framework across three typical seepage scenarios in porous media, including single-phase flow, component transport and two-phase flow. Compared with existing quantum-classical physics-informed neural networks, QCPIKAN achieves superior performance in global prediction accuracy, local error control, dynamic evolution tracking and displacement front localization. This work provides a robust and efficient alternative for solving complex PDEs.

URL PDF HTML ☆

赞 0 踩 0

2606.20329 2026-06-19 cs.LG physics.geo-ph 新提交

Constrained hybrid modelling to predict microbial dynamics and organic matter turnover in soil systems

约束混合建模预测土壤系统中微生物动态与有机质周转

Paul Collart, Juergen Gall, Andrea Schnepf, Holger Pagel, Lars Doorenbos

发表机构 * Agrosphere (IBG-3), Forschungszentrum Jülich GmbH（农业圈（IBG-3），于利希研究中心）； Institute of Crop Science and Resource Conservation, University of Bonn（波恩大学作物科学与资源保护研究所）； Institute of Computer Science, University of Bonn（波恩大学计算机科学研究所）； Lamarr Institute for Machine Learning and Artificial Intelligence（拉马尔机器学习和人工智能研究所）

AI总结提出首个混合建模框架，利用神经网络从宏基因组推断功能性状预测过程模型参数，并整合生态理论约束，有效预测微生物动态和有机质周转。

Comments Accepted at ICML '26

详情

AI中文摘要

土壤微生物控制有机质循环，并在很大程度上决定土壤系统如何应对和缓解气候变化及环境威胁。因此，在基于过程的土壤模型中表示微生物动态对于预测土壤碳循环至关重要，尽管从数据中获取信息极具挑战性。改进参数化的一个有前景的方法是整合基因组数据，然而建模基因组与微生物驱动过程之间复杂且未知的关系是一个未解决的问题。在这项工作中，我们提出了第一个混合建模框架，用于从基于DNA测序数据的宏基因组推断功能性状中推导基于过程的土壤有机质周转模型的生物动力学参数值。我们的模型通过神经网络从基因组性状数据预测过程模型的生物动力学参数，并整合来自生态理论和文献的约束，以确保即使是非观测状态变量也能实现逼真的行为。我们在不同复杂度的合成基因组性状数据集和真实数据上评估了我们的方法，结果表明，我们的方法在多个基线上提高了性能，并有效学习了过程模型中不可测量组分的动态，即使是在小训练数据集上也是如此。

英文摘要

Soil microorganisms control organic matter cycling and largely determine how soil systems can cope with and mitigate climate change and environmental threats. Representing microbial dynamics in process-based soil models is therefore critical to predict carbon cycling in soils, albeit highly challenging to inform from data. One promising approach to improve their parametrisation is the integration of genomic data, yet modelling the complex and unknown relationship between genomes and the processes the microbes are driving is an unsolved problem. In this work, we present the first hybrid modeling framework for deriving biokinetic parameter values of a process-based soil organic matter turnover model from metagenome-inferred functional traits based on DNA sequencing data. Our model predicts biokinetic parameters of the process-based model from genomic trait data with a neural network and integrates constraints from ecological theory and literature to ensure realistic behavior, even of non-observed state variables. We evaluate our method on synthetic genomic trait datasets of varying complexity and on real data, showing that our approach improves performance over multiple baselines and learns the dynamics of unmeasurable components of the process-based model effectively, even for small training datasets.

URL PDF HTML ☆

赞 0 踩 0

2606.20359 2026-06-19 cs.LG 新提交

Train, Retrieve, or Both? A Four-Arm Head-to-Head for Correct Statutory Citation on the Ontario Residential Tenancies Act

训练、检索，还是两者兼用？针对安大略省住宅租赁法的正确法定引用的四组头对头比较

Ali Asaria, Tony Salomone, Deep Gandhi

发表机构 * Transformer Lab

AI总结研究自诉租户、房东和帮助台工作人员如何获得正确的法定引用，通过四组实验比较微调、检索及混合方法，发现SFT+RAG混合模型在精确匹配上得分最高且无幻觉引用。

详情

AI中文摘要

自诉租户、房东和帮助台工作人员需要被指向实际管辖问题的法律条款，并附有正确的法定引用。我们在2006年安大略省住宅租赁法（RTA）及其核心法规上研究此任务，从操作者的角度实证提问：微调是否足够，还是需要混合检索？我们在Qwen2.5-7B-Instruct上运行四组头对头比较（基础零样本、仅LoRA SFT、仅RAG、以及SFT+RAG混合），在一个小型、待人工验证的真实评估集上，以引用的精确匹配（节+小节）评分。基础模型无法引用RTA，仅SFT会错误回忆章节；检索至关重要，并通过构造将幻觉降至零；而SFT+RAG混合模型得分最高，精确匹配为0.481，且无幻觉引用。其优势在于SFT使得条款选择对高召回候选集（损害零样本RAG）更加鲁棒。值得注意的是，这种廉价的bge-small混合模型匹配或超越了基于更大、专门检索模型（更大的嵌入器和交叉编码器重排序器）的管道，更大/改进的训练集也无帮助：在此任务中，强法定引用性能不需要专门的检索模型或更多数据。该工件将幻觉归零并超过了基准提升线，但未达到期望的0.70精确匹配目标。所有结果均基于小型、待人工验证的真实评估集，并作为初步结果报告。

英文摘要

Self-represented tenants, landlords, and help-desk staff need to be pointed at the provision of law that actually governs a question, with a correct statutory citation. We study this task on the Ontario Residential Tenancies Act, 2006 (RTA) and its core regulation, asking the operator's question empirically: is fine-tuning enough, or is hybrid retrieval needed? We run a four-arm head-to-head on Qwen2.5-7B-Instruct (base zero-shot, LoRA SFT-only, RAG-only, and an SFT+RAG hybrid), scored on citation exact-match (section+subsection) over a small, human-verification-pending real eval set. The base model cannot cite the RTA and SFT-only mis-recalls sections; retrieval is essential and drives hallucination to zero by construction; and the SFT+RAG hybrid scores highest at 0.481 exact-match with zero hallucinated citations. Its edge comes from SFT making provision selection more robust to the higher-recall candidate sets that hurt zero-shot RAG. Notably, this cheap bge-small hybrid matches or beats a pipeline built on bigger, specialized retrieval models (a larger embedder and a cross-encoder reranker), and a larger/improved training set does not help either: strong statutory-citation performance here does not require specialized retrieval models or more data. The artifact zeroes hallucination and clears the lift-over-base bar but does not reach the aspirational 0.70 exact-match target. All results are on a small, human-verification-pending real eval set and are reported as preliminary.

URL PDF HTML ☆

赞 0 踩 0

2606.20364 2026-06-19 cs.LG 新提交

Judging to Improve: A De-biased VLM-as-3D-Judge Protocol for Single-Image 3D Generation

评判以改进：一种去偏的 VLM-as-3D-Judge 协议用于单图像 3D 生成

Ali Asaria, Tony Salomone, Deep Gandhi

发表机构 * Transformer Lab

AI总结本文提出一种去偏的跨模型 VLM-as-3D-Judge 协议，将评判者从排序扩展到优化，通过训练与评估评判者分离、位置偏差校正及修复三种失效模式，实现轻量级适应下与强基线的匹配。

详情

AI中文摘要

一项伴随研究建立了一个去偏的、跨模型的 VLM-as-3D-Judge，能够可靠地对单图像到 3D 网格质量进行排序，而廉价的几何和 CLIP 代理在此方面表现不足。本文提出：该评判者的偏好能否专门化一个强大的开放生成器 TRELLIS，针对单一资产类别（家具），且无需人工标注？将评判者从排序扩展到优化是本文的工作所在。将 VLM 评判者推入训练和评估循环会暴露排序从未触发的失效模式，因此我们的贡献是对评判者进行优化级别的强化：一个训练评判者（Qwen2.5-VL-7B）与一个评估评判者（InternVL3-8B）保持分离以打破循环性；位置偏差校正；以及针对三种失效模式（图像过载、隐藏几何的溅射渲染、以及奖励干净但错误输出的无参考评判）的修复，并附有校准证据（清晰差距胜率 0.83-1.0；基线间约 0.5）。使用此协议作为独立评估者，仅从公开模型和数据出发，采用轻量级参数高效适应，我们发现我们的方法匹配了强基线而非超越它。独立基线样本几乎不携带可学习的偏好（0.94 顺序翻转率），因此信号必须通过质量对比构造来设计。在六种适应方法、两种输入模式和严重程度扫描中，最具针对性的方法——严重退化下的条件器修复——达到了与基线持平（0.50），而没有方法达到 >=65% 的胜率目标。结果是机制性的：干净输入使评判者饱和，流式 DIT 微调通过采样器被冲刷，而条件器修复是改变几何的位点。胜率在 n=8 个对象时具有方向性。匹配一个强大的公开数据基线本身具有信息量：超越它需要比公开数据上的轻量级 PEFT 更多，而评判者协议是可复用的。

英文摘要

A companion study established a de-biased, cross-model VLM-as-3D-judge that reliably ranks single-image-to-3D mesh quality where cheap geometry and CLIP proxies fall short. This paper asks: can that judge's preferences specialize a strong open generator, TRELLIS, on one asset class (furniture), cheaply and without human labels? Taking the judge from ranking to optimization is where the work lives. Pushing a VLM judge into the training and evaluation loop exposes failure modes ranking never triggered, so our contribution is an optimization-grade hardening of the judge: a training judge (Qwen2.5-VL-7B) held distinct from an evaluation judge (InternVL3-8B) to break circularity; position-bias correction; and fixes for three failure modes (image overload, geometry-hiding splat renders, and reference-free judging that rewards clean-but-wrong outputs), with calibration evidence (clear-gap win-rate 0.83-1.0; base-vs-base ~0.5). Using this protocol as an independent evaluator, and working only from public models and data with lightweight parameter-efficient adaptation, we find our methods match the strong base rather than exceed it. Independent base samples carry essentially no learnable preference (0.94 order-flip rate), so signal must be engineered by quality-contrastive construction. Across six adaptation methods, two input regimes, and a severity sweep, the most targeted - conditioner repair under severe degradation - reaches parity (0.50) with the base, while no method clears the >=65% win-rate target. The result is mechanistic: clean inputs saturate the judge, flow-DIT fine-tuning washes out through the sampler, and conditioning repair is the locus that moves geometry. Win-rates are directional at n=8 objects. Matching a strong public-data base with cheap adaptation is itself informative: exceeding it needs more than lightweight PEFT on public data, and the judge protocol is reusable.

URL PDF HTML ☆

赞 0 踩 0

2606.20417 2026-06-19 cs.LG 新提交

Neural network surrogates with uncertainty quantification for inverse problems in partial differential equations

具有不确定性量化的神经网络代理模型用于偏微分方程反问题

Christian Jimenez-Beltran, Aretha L. Teckentrup, Antonio Vergari, Konstantinos C. Zygalakis

AI总结提出DeepGaLA神经网络代理模型，为微分方程求解器提供不确定性感知预测，结合延迟接受MCMC诊断，实现高效可靠的贝叶斯反演。

详情

AI中文摘要

微分方程的反问题在科学和工程中普遍存在，其目标是从噪声或不完整的观测中推断未知模型参数。传统数值方法通常计算成本高昂，尤其是在贝叶斯设置中，对于复杂正向模型和高维参数空间，评估似然函数变得非常昂贵。为了应对这一挑战，我们引入了DeepGaLA，一种用于微分方程求解器的神经网络代理模型，它提供不确定性感知的预测，在训练数据有限时减少过度自信的推断。为了在实践中评估代理诱导的后验近似的保真度，我们表明，短时间运行的延迟接受马尔可夫链蒙特卡洛可以作为有效的诊断工具。在一系列数值实验中，DeepGaLA提供的正向模型近似精度与已建立的高斯过程代理相当，同时在参数维度增加时更好地保持效率。此外，它可以纳入微分方程约束，包括非线性情况。总体而言，这些结果表明，具有不确定性量化的神经代理模型能够实现复杂系统中反问题的可扩展且可靠的贝叶斯推断。

英文摘要

Inverse problems for differential equations arise throughout science and engineering, where one seeks to infer unknown model parameters from noisy or incomplete observations. Traditional numerical methods for these problems are often computationally expensive, particularly in Bayesian settings where evaluating the likelihood becomes costly for complex forward models and high-dimensional parameter spaces. To address this challenge, we introduce DeepGaLA, a neural-network surrogate for differential equation solvers that provides uncertainty-aware predictions, reducing overconfident inference when training data are limited. To evaluate the fidelity of the surrogate-induced posterior approximations in practice, we show that a short run of delayed-acceptance Markov chain Monte Carlo can serve as an effective diagnostic. Across a range of numerical experiments, DeepGaLA delivers forward-model approximations with accuracy comparable to established Gaussian-process surrogates, while better maintaining efficiency as parameter dimension grows. Moreover, it can incorporate differential-equation constraints, including in nonlinear settings. Overall, these results indicate that uncertainty-quantified neural surrogates can enable scalable and reliable Bayesian inference for inverse problems in complex systems.

URL PDF HTML ☆

赞 0 踩 0

2606.20467 2026-06-19 cs.LG cs.NA math.NA physics.comp-ph 新提交

DiffusionGemma 的透明度如何？

Joshua Engels, Callum McDougall, Bilal Chughtai, Janos Kramar, Senthoran Rajamanoharan, Cindy Wu, Arthur Conmy, Asic Q Chen, Jean Tarbouriech, Min Ma, Brendan O'Donoghue, João Gabriel Lopes de Oliveira, Rohin Shah, Neel Nanda

发表机构 * Google（谷歌）

AI总结研究DiffusionGemma在连续潜空间中的推理透明度，通过变量透明度和算法透明度分解，发现可解释的令牌瓶颈将不透明串行深度降至Gemma 4的1.1倍，并揭示扩散特有现象。

Comments 20 main text pages and 6 pages of references and appendices

详情

AI中文摘要

LLM推理透明度是理解模型决策、减少误用和错位以及调试意外模型行为的关键能力。然而，DiffusionGemma在连续潜空间中执行了更大比例的计算；这是否使其推理透明度降低？我们通过将透明度分解为两个组成部分来研究这个问题：变量透明度，即我们是否理解模型计算状态的中间快照；以及算法透明度，即我们是否能够利用这些快照重建模型得出其输出的过程。直观上，DiffusionGemma的变量透明度较差：其不透明串行深度，即在可解释模型状态之间发生的串行计算量，最初似乎是相应自回归Gemma 4模型的28.6倍。然而，我们表明，我们可以通过一个可解释的令牌瓶颈映射去噪步骤之间流动的信息，且下游性能没有下降。将这些中间状态视为可解释的，将不透明串行深度降至仅为Gemma 4的1.1倍。对于扩散模型来说，算法透明度比自回归模型更难，因为画布中的所有令牌预测在每个去噪步骤中都可能发生变化，这使模型有能力在去噪过程中实现复杂的分布式算法。为了开始弥合这一差距，我们进行了一系列可解释性案例研究，发现了扩散特有现象（如非时序推理、令牌和序列涂抹以及中间上下文推理）的初步证据。最后，我们测试了可监控性，这是透明度的一个关键应用，衡量模型输出是否对下游任务有用。我们发现DiffusionGemma的可监控性与Gemma 4相似。

英文摘要

LLM reasoning transparency is a critical affordance for understanding model decisions, mitigating misuse and misalignment, and debugging surprising model behaviors. However, DiffusionGemma performs a larger fraction of its computation in a continuous latent space; does this make its reasoning less transparent? We study this question by decomposing transparency into two components: variable transparency, whether we understand intermediate snapshots of a model's computational state; and algorithmic transparency, whether we can use these snapshots to reconstruct the process by which the model arrived at its outputs. Naively, DiffusionGemma has poor variable transparency: its opaque serial depth, the amount of serial computation that occurs in between interpretable model states, seems at first 28.6X higher than the corresponding autoregressive Gemma 4 model. However, we show that we can map the information flowing between denoising steps through an interpretable token bottleneck with no decrease in downstream performance. Treating these intermediate states as interpretable reduces the opaque serial depth to just 1.1X that of Gemma 4. Algorithmic transparency is harder for diffusion models than for autoregressive models because all token predictions in the canvas can change at every denoising step, giving the model the power to implement complicated distributed algorithms during the denoising process. To begin bridging this gap, we conduct a suite of interpretability case studies, uncovering initial evidence of novel diffusion-specific phenomena such as non-chronological reasoning, token and sequence smearing, and intermediate-context reasoning. Finally, we test monitorability, a key application of transparency that measures whether model outputs are useful for downstream tasks. We find that DiffusionGemma is similarly monitorable to Gemma 4.

URL PDF HTML ☆

赞 0 踩 0

1. 深度学习架构与训练方法 11 篇

How Linear Is a Transformer Feed-Forward Block? Per-Block Linear Recoverability Is Learned, Not Architectural

Concept Flow Models: Anchoring Concept-Based Reasoning with Hierarchical Bottlenecks

Efficiently Representing Algorithms With Chain-of-Thought Transformers

Learning universal approximations for partial differential equations with Physics-Informed Broad Learning System

Neural Additive and Basis Models with Feature Selection and Interactions

Physics-Informed Neural Network with Squeeze-Excitation-like Attention

Compositionality Emerges in a Narrow Depth-Connectivity Regime: Architecture Constraints and Solution Manifolds

Kolmogorov-Arnold Reservoir Computing

Shifting-based Optimizable Linear Relaxations for General Activation Functions

Evolutionary Two-Stage Hyperparameter Optimization Strategies for Physics-Informed Neural Networks

The Token Is a Group Element: On Lie-Algebra Attention over Matrix Lie Groups

2. 表示学习、自监督与对比学习 7 篇

FlexLAM: Resolving the Bottleneck Trade-off in Latent Action Learning

3D-DLP: Self-Supervised 3D Object-Centric Scene Representation Learning

Tracking Representation Dynamics in Large Language Models with Persistent Homology

Unsupervised Causal Abstractions Discovery

When, Where, and How: Adaptive Binning for Tabular Self-Supervised Learning

SL-S4Wave: Self-Supervised Learning of Physiological Waveforms with Structured State Space Models

Multi-Modal Contrastive Learning for Implicit Earth Embeddings via Location Tying

3. 强化学习与序列决策 13 篇

Human-like autonomy emerges from self-play and a pinch of human data

Can In-Context Learning Support Intrinsic Curiosity?

Multi-Granular Attention-Driven Reinforcement Learning Framework for Web Intelligent Enhancement Systems

OnDeFog: Online Decision Transformer under Frame Dropping

Manifold Bandits: Bayesian Curriculum Learning over the Latent Geometry of Large Language Models

Matching Markets meet Cumulative Prospect Theory: Towards Optimal and Adversarially Robust Learning

Connect the Dots: Training LLMs for Long-Lifecycle Agents with Cross-Domain Generalization Via Reinforcement Learning

VIMPO: Value-Implicit Policy Optimization for LLMs

Hierarchical Control in Multi-Agent Games: LLM-based Planning and RL Execution

Sensorimotor World Models: Perception for Action via Inverse Dynamics

Quantile of Means: A Bonus-Free Ensemble Method for Minimax Optimal Reinforcement Learning

Direct Advantage Estimation for Scalable and Sample-efficient Deep Reinforcement Learning

Marginal Advantage Accumulation for Memory-Driven Agent Self-Evolution

4. 生成模型与概率建模 7 篇

Emyx: Fast and efficient all-atom protein generation

Calibrating Generative Models to Feature Distributions with MMD Finetuning

An Information Theoretic Framework for Graph Novelty Generation via Latent Mixture Modeling

Flow Map Denoisers: Traversing the Distortion-Perception Plane for Inverse Problems

Score Approximation for Diffusion Models on Arbitrary Low-Dimensional Structures

PepALD: Macrocyclic Peptide Generation via Autoregressive Latent Diffusion

On the Redundancy of Timestep Embeddings in Diffusion Models

5. 优化、泛化与理论分析 15 篇

Computational Identifiability

Information Lattice Learning as Probabilistic Graphical Model Structure Learning

Weibull Weight-Scale Parameter Evolution under AdamW Training Dynamics

Zero-Inflated Gaussian Distributions Enable Parameter-Space Sparsity in Estimation-of-Distribution Algorithms

Algebraic Dead Directions in LayerNorm Transformers: A Forward-Pass-Only Diagnostic at LLM Scale

Interactive Pareto navigation for deep multi-task learning

Convex training of Lipschitz-regularized shallow neural networks

Global Convergence of Gradient Descent for Score Matching in Gaussian Mixtures via Reverse Fisher Divergence

On the Oracle Complexity of Interpolation-Based Gradient Descent

Adversarial Bandit Optimization with Globally Bounded Perturbations to Convex Losses

What Makes Effective Supervision in Latent Chain-of-Thought: An Information-Theoretic Analysis

Effective Dimension Governs Generalization in Quantum Kernel Vision Models

Recurrent neural networks approximate continuous functions

On the Variance of Temporal Difference Learning and its Reduction Using Control Variates

Fisher-Geometric Sharpness and the Implicit Bias of SGD toward Flat Minima

6. 高效学习、压缩与部署 11 篇

Closing the Social-Semantic Gap: SPSD for Edge-Based Prompt Compression in Cloud LLM Inference

Performance Analysis and Optimization of 3D Generative Diffusion Models across GPU Architectures

Techniques for Peak Memory Reduction for LoRA Fine-tuning of LLMs on Edge Devices

Predicting Mergeability of Parameter-Efficient Fine-Tuning Updates

Efficient Neural Network Model Selection for Few-Class Application Datasets

ADaPT: Token-Level Decoupling for Efficient Large Reasoning Models

Low-Energy Reduced RISC-V Instruction Subset Processor for Tsetlin Machine Inference at the Edge

Activation- and Influence-Aware Ranks (AIR): Function-Preserving SVD Compression for LLMs

StreamKL: Fast and Memory-Efficient KL Divergence for Boosting Attention Distillation

UltraQuant: 4-bit KV Caching for Context-Heavy Agents

Execution-State Capsules: Graph-Bound Execution-State Checkpoint and Restore for Low-Latency, Small-Batch, On-Device Physical-AI Serving

7. 联邦学习、隐私与安全 4 篇

Federated Bilevel Performative Prediction

When Calibration Fails the Vulnerable Hospital: Federated Conformal Risk Control via Risk-Curve Shrinkage

Towards Modality-imbalanced Federated Graph Learning: A Data Synthesis-based Approach

Predictability as a Fine-Grained Measure for Privacy

8. 鲁棒性、不确定性与可信学习 6 篇

Thermodynamic Signatures of Reasoning: Free-Energy and Spectral-Form-Factor Diagnostics for Hallucination Detection in Large Language Models

On the QUEST for Uncertainty Quantification via Highest Density Regions

Comparing Linear Probes with Mahalanobis Cosine Similarity

Uncertainty-Aware Reward Modeling for Stable RLHF