arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.17415 2026-05-25 cs.LG cs.AI cs.DB cs.IR 版本更新

IVF-TQ: Calibration-Free Streaming Vector Search via a Codebook-Free Residual Layer

IVF-TQ:通过无码本残差层实现无需校准的流式向量搜索

Tarun Sharma

发表机构 * Independent Researcher(独立研究者)

AI总结 本文提出了一种名为IVF-TQ的流式向量搜索索引,该方法通过一种无需代码本的残差压缩层实现了校准自由的近似最近邻搜索。核心思想是在不依赖代码本的情况下,利用固定随机旋转和预计算的Lloyd-Max标量量化器,仅通过比特宽度和维度参数进行配置,从而在不需训练的情况下保持流式数据的稳定性。实验表明,IVF-TQ在多个数据集和内存条件下均能保持良好的性能,无需重新训练或个性化调整比特预算,显著提升了流式场景下的搜索效率与鲁棒性。

详情
AI中文摘要

近似最近邻(ANN)索引部署在流式语料库上会在数周内无声地丢失召回率。标准诊断是分布漂移,但在洗牌独立同分布(shuffled-i.i.d.)摄取下(完全没有漂移),乘积量化在子匹配位预算下仍会下降3.8个百分点。主流生产压缩方法(PQ、OPQ、ScaNN)都针对初始样本拟合码本,并在数据库增长数个数量级时重复使用该码本。 本文提出IVF-TQ,一种倒排文件索引,其残差压缩层是数据无关的:一个固定的随机旋转,后跟一个仅由位宽b和维度d参数化的预计算Lloyd-Max标量量化器。仅训练IVF粗k-means分区。一个仅依赖于(b, d, delta)的球面上均匀内积误差界提供了任何学习码本方法都无法提供的结构保证。相同的无码本设计实现了IVF放大效应,将差距缩小到Extended RaBitQ的统计噪声范围内(在匹配位预算下,比平面TQ高17.7个百分点),以及一种自适应变体,在不触及压缩层的情况下刷新分区。在九个受控单元(三个10M数据集、三种PQ内存模式、三个随机种子)中,每批PQ码本重新训练从未恢复流式差距;IVF-PQ流式稳定性需要逐数据集位预算调整,而IVF-TQ在所有三个数据集上使用一个固定的(b, d)配置,Delta在[-0.80, +0.56]个百分点之间。贡献在于操作层面:无需训练码本,无需逐数据集位预算调整,无需任何能缩小差距的重新训练周期。

英文摘要

Approximate nearest neighbor (ANN) indexes deployed against streaming corpora silently lose recall over weeks. The standard diagnosis is distribution shift, but under shuffled-i.i.d. ingestion -- no shift at all -- product quantization still degrades -3.8pp at sub-matched bit budgets. The dominant production compression methods (PQ, OPQ, ScaNN) all fit a codebook to an initial sample and reuse it as the database grows by orders of magnitude. This paper presents IVF-TQ, an inverted-file index whose residual compression layer is data-independent: a fixed random rotation followed by a precomputed Lloyd-Max scalar quantizer parameterised only by the bit width b and dimension d. Only the IVF coarse k-means partition is trained. A uniform-over-sphere inner-product error bound depending only on (b, d, delta) provides a structural guarantee no learned-codebook method admits. The same codebook-free design enables an IVF-amplification effect that closes the gap to Extended RaBitQ to within statistical noise (+17.7pp over flat TQ at matched bit budget), and an Adaptive variant that refreshes the partition without touching the compression layer. Across nine controlled cells (three 10M datasets, three PQ memory regimes, three seeds), per-batch PQ codebook retraining never recovers the streaming gap; IVF-PQ streaming stability requires per-dataset bit-budget tuning, while IVF-TQ holds at one fixed (b, d) configuration on all three datasets with Delta in [-0.80, +0.56]pp. The contribution is operational: no codebook to train, no per-dataset bit-budget tuning, no retraining cycle that ever closes the gap.

2605.23901 2026-05-25 cs.LG cs.AI cs.IT math.IT 版本更新

LLMs as Noisy Channels: A Shannon Perspective on Model Capacity and Scaling Laws

LLMs 作为噪声信道:香农视角下的模型容量与缩放定律

Xu Ouyang, Deyi Liu, Yuhang Cai, Jing Liu, Yuan Yang, Chen Zheng, Thomas Hartvigsen, Yiyuan Ma

发表机构 * University of Virginia(弗吉尼亚大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 本文从香农信息论的角度出发,将大语言模型(LLM)的训练过程建模为在噪声信道中传递信息的过程,提出了香农扩展定律(Shannon Scaling Law),用以解释传统单调扩展定律无法描述的非单调现象,如灾难性过训练和量化退化。该理论通过将模型参数映射为信道带宽、训练数据映射为信号功率,揭示了模型规模或数据量的扩展若不能保持足够的信噪比,将导致噪声放大并引发性能的U型退化。实验验证表明,该理论在多个任务和扰动设置下均优于传统扩展定律,具有良好的拟合与外推能力。

Comments Accepted by ICML 2026

详情
AI中文摘要

现有的大语言模型(LLMs)缩放定律主要是单调幂律,无法解释新出现的非单调现象,如灾难性过训练和量化引起的退化,在这些现象中,尽管计算量增加,性能却下降。我们提出了香农缩放定律,这是一个统一的理论框架,将LLM训练建模为噪声信道上的信息传输,基于香农-哈特利定理。通过将模型参数映射到信道带宽,训练令牌映射到信号功率,我们的公式明确捕捉了学习信号与内在噪声之间的相互作用。这一视角揭示了LLMs的基本香农容量:在未保持足够信噪比(SNR)的情况下扩展模型规模或数据,必然会放大噪声,导致从单调改进到U形性能退化的转变。我们通过在Pythia和OLMo2上进行的实验验证了该理论,实验包括高斯噪声、量化以及在数学、问答和代码任务上的监督微调。香农缩放定律始终优于经典缩放定律和最近的扰动感知定律,取得了强$R^2$分数,并准确捕捉了先前方法遗漏的损失盆地。它还能进行外推:在$\leq$6.9B Pythia模型上使用$\leq$180B令牌拟合后,预测了未见过的12B模型在高达307B令牌时的性能,池化$R^2=0.847$,而单调基线则崩溃。

英文摘要

Existing scaling laws for Large Language Models (LLMs), predominantly monotonic power laws, fail to explain emerging non-monotonic phenomena such as catastrophic overtraining and quantization-induced degradation, where performance deteriorates despite increased compute. We propose the Shannon Scaling Law, a unified theoretical framework that models LLM training as information transmission over a noisy channel, grounded in the Shannon-Hartley theorem. By mapping model parameters to channel bandwidth and training tokens to signal power, our formulation explicitly captures the interaction between learning signal and intrinsic noise. This perspective reveals a fundamental Shannon capacity for LLMs: scaling model size or data without preserving a sufficient signal-to-noise ratio (SNR) inevitably amplifies noise, inducing a transition from monotonic improvement to U-shaped performance degradation. We validate our theory through experiments on Pythia and OLMo2 under perturbations, including Gaussian noise, quantization and supervised fine-tuning on math, QA and code tasks. The Shannon Scaling Law consistently outperforms classical scaling laws and recent perturbation-aware laws, achieving strong $R^2$ scores and accurately capturing loss basins missed by prior approaches. It also extrapolates: fitted on $\leq$6.9B Pythia models with $\leq$180B tokens, it predicts the unseen 12B model up to 307B tokens at pooled $R^2{=}0.847$, while monotonic baselines collapse.

2605.23893 2026-05-25 cs.LG 版本更新

Complete-muE: Optimal Hyperparameter Transfer and Scaling for MoE Models

Complete-muE:MoE模型的最优超参数迁移与缩放

Hongwu Peng, Ohiremen Dibua, Yuanjun Xiong, Yifan Gong, Jianming Zhang, Yan Kang

发表机构 * Adobe Research(Adobe研究院)

AI总结 本文提出了一种名为 Complete-muE 的框架,用于在密集 FFN 和混合专家(MoE)结构之间进行最优的超参数迁移与缩放。该框架通过两个桥梁系统解决现有工具无法处理的架构和专家数量变化带来的挑战,能够覆盖 MoE 模型中激活专家数、总容量、粒度以及共享/分组平衡混合等变化,同时适用于通用 Transformer 模型的宽度、深度、批量大小和训练时长的变化。实验表明,Complete-muE 能够在不同模型架构和参数规模下实现相对稳定的超参数优化,只需在密集模型上进行一次调参即可近似最优地迁移到所有 MoE 配置,显著加速了 MoE 模型的收敛速度。

Comments 27 pages

详情
AI中文摘要

我们提出Complete-muE,一个针对Transformer块中稠密FFN与任意混合专家(MoE)设置之间超参数迁移的框架。现有工具如$μ$P(要求固定架构)或SDE(要求固定每步token数)无法直接解决MoE设置中的超参数迁移问题,因为从稠密到MoE的迁移或MoE总专家数的缩放同时改变了架构和每个专家的token数。Complete-muE通过一个双桥系统解决了这一挑战:桥I通过带有归一化路由器尺度的有效宽度$μ$P在稠密FFN和稠密MoE之间映射。桥II通过激活专家缩放将稠密MoE映射到稀疏MoE,其中一阶SDE学习率/权重衰减校正相互抵消,而一个有限残差$σ_0$偏移保持不变。由此产生的迁移规则,我们称之为Complete muE,涵盖了MoE模型的激活专家数、总容量、粒度以及共享/组平衡混合的变化,以及通用Transformer模型的网络宽度/深度、批量大小和训练时长的变化。大量的语言模型和扩散模型预训练实验证实,complete-muE在不同模型架构和参数数量下产生了相对稳定的超参数最优值——仅存在与桥II非严格SDE行为一致的小幅偏移。在实践中,这种偏移足够小,以至于在单个稠密参考模型上调优的超参数可以接近最优地迁移到所有MoE配置——\emph{一次调优稠密模型,迁移至所有}是Complete-muE核心的实用策略。这使得MoE模型在扩展模型容量时,无需进行昂贵的超参数搜索即可实现比稠密模型更快的收敛速度提升。

英文摘要

We propose Complete-muE, a framework which targets hyperparameter transfer across dense FFN and any Mixture-of-Experts (MoE) setups in transformer blocks. Existing tools such as $μ$P (requires fixed architectue) or SDE (requires fixed per-step token count) cannot directly solve the hyperparameter transfer problem in MoE setups because Dense to MoE transfer or MoE total experts scaling changes both architecture and tokens per expert. Complete-muE solves this challenge with a two-bridge system: Bridge~I maps between dense FFN and Dense MoE by active-width $μ$P with a normalized router scale. Bridge~II maps between Dense MoE and sparse MoE by activated-expert scaling, where the first-order SDE LR/WD correction cancels while a bounded residual $σ_0$ shift remains. The resulting transfer rule, which we term as Complete muE, covers changes in activated experts, total capacity, granularity, and shared/group-balanced hybrids for MoE models as well as network width/depth, batch size, and duration changes for general Transformer models. Extensive language model and diffusion model pretraining experiments confirm that complete-muE yields relatively stable hyperparameter optima across model architectures and parameter counts -- with only minor drift consistent with the non-strict SDE behavior of Bridge~II. In practice this drift is small enough that hyperparameters tuned on a single dense reference transfer near-optimally to all MoE configurations -- \emph{tune dense once, transfer to all} is the practical recipe at the core of Complete-muE. This enables MoE models to achieve accelerated convergence speedup over dense models when scaling model capacity without costly hyperparameter search.

2605.23892 2026-05-25 cs.CV cs.AI cs.GR cs.LG cs.RO 版本更新

Good Token Hunting: A Hitchhiker's Guide to Token Selection for Visual Geometry Transformers

优质令牌狩猎:视觉几何变换器令牌选择指南

Shuhong Zheng, Michael Oechsle, Erik Sandström, Marie-Julie Rakotosaona, Federico Tombari, Igor Gilitschenski

发表机构 * University of Toronto & Vector Institute(多伦多大学及向量研究所) Google(谷歌) Technical University of Munich(慕尼黑技术大学)

AI总结 视觉几何变换器在多视角三维重建中表现出色,但其计算成本随输入序列长度呈二次增长,限制了模型的效率和可扩展性。本文提出了一种简单而通用的解决方案,通过限制每个查询在全局注意力中交互的关键/值标记数量来降低计算复杂度。该方法采用两阶段框架:首先在帧级别选择保留的帧以保证场景覆盖多样性,然后在帧内进一步去除冗余标记,且引入基于注意力熵的层感知稀疏化策略。实验表明,该方法在保持或提升性能的同时,可将视觉几何变换器的处理速度提升85%以上。

Comments Project Page: https://zsh2000.github.io/good-token-hunting.github.io, Code: https://github.com/zsh2000/gotohunt

详情
AI中文摘要

视觉几何变换器已成为多视图三维重建的强大架构,能够以前馈方式联合预测多个三维属性。然而,由于这些模型内部的全局注意力层,其计算成本随输入序列长度呈二次增长,限制了其可扩展性和效率。在这项工作中,我们通过一个简单而通用的策略来应对这一挑战:限制每个查询在全局注意力期间交互的键/值令牌数量。为了实现有效的令牌选择,我们引入了一个两阶段框架。首先,帧间选择步骤在帧级别操作,以识别应保留的帧。其次,帧内选择步骤进一步丢弃所选帧内更冗余的令牌。我们的分析强调了基于多样性的帧间选择策略的优势,该策略确保了对场景的广泛覆盖。对于帧内选择,我们表明层感知稀疏化是必要的,选择过程由全局注意力模式的熵引导。与现有解决方案相比,我们的方法提供了优越的速度-精度权衡。大量实验表明,对于包含500张图像的场景,我们的方法将视觉几何变换器加速超过85%,同时保持甚至提升基线性能,这暗示了我们的令牌选择策略如何在视觉几何变换器的未来应用中发挥关键作用。我们的项目网站位于 https://zsh2000.github.io/good-token-hunting.github.io。

英文摘要

Visual geometry transformers have become powerful architectures for multi-view 3D reconstruction, enabling joint prediction of multiple 3D attributes in a feed-forward manner. However, their computational cost grows quadratically with the input sequence length due to the global attention layers inside these models. This limits both their scalability and efficiency. In this work, we address this challenge with a simple yet general strategy: restricting the number of key/value tokens that each query interacts with during global attention. To achieve effective token selection, we introduce a two-stage framework. First, an inter-frame selection step operates at the frame level to identify frames that should be preserved. Second, an intra-frame selection step further discards more redundant tokens within the selected frames. Our analysis highlights the advantage of a diversity-based strategy for inter-frame selection, which ensures broad coverage of the scene. For intra-frame selection, we show that layer-aware sparsification is necessary, with the selection process guided by the entropy of the global attention pattern. Our approach offers a superior speed-accuracy trade-off compared to existing solutions. Extensive experiments show that it accelerates visual geometry transformers by over 85% for scenes with 500 images while maintaining, or even improving, baseline performance, which hints that how our token selection strategy can play a crucial role in future applications of visual geometry transformers. Our project website is available at https://zsh2000.github.io/good-token-hunting.github.io.

2605.23887 2026-05-25 cs.DB cs.AI cs.CR cs.LG cs.MA 版本更新

CHRONOS: Temporally-Aware Multi-Agent Coordination for Evolving Data Marketplaces

CHRONOS:面向演化数据市场的时态感知多智能体协调

Joydeep Chandra

发表机构 * BNRIST, Tsinghua University(北京清华大学智能机器人系统研究院)

AI总结 CHRONOS 是一种面向动态数据市场的多智能体协调框架,旨在解决静态设计中因数据演化带来的检索效率下降、价值分配不准确和隐私预算过度消耗等问题。该方法采用三层架构,分别通过时间感知的神经微分方程、基于突变点检测的夏普利价值评估和满足差分隐私的强化学习算法,实现高效且隐私保护的市场协调。实验表明,CHRONOS 在多个基准上表现出优越的检索性能和隐私效率,具有较高的实用价值。

详情
AI中文摘要

时态知识图谱数据市场在静态设计中面临三个耦合的失败:随着边演化,过时的混合索引捷径降低召回率;分布漂移后,固定的Shapley定价错误归因价值;不协调的智能体过度消耗共享的差分隐私预算。我们提出CHRONOS,一个三层架构,通过显式的公共和私有分离统一处理这些挑战。第一层应用神经ODE时间衰减到捷径边,提供每个查询的期望召回损失界为Big-O of Pq lambda delta t,单调包络保证将边界宽松度降低到观测损失的1.8到3.2倍。第二层将Shapley估值条件化在检测到的变点上,并在噪声下提供有限样本误差保证。第三层使用EXP3-IX实现Big-O of sqrt(T log T)遗憾,同时通过矩会计强制执行epsilon和delta差分隐私。CHRONOS每轮使用高斯机制发布私有化亲和矩阵;所有检索和排序都是后处理,不产生额外隐私成本。我们提供多轮结算、500个卖家的可扩展性分析,以及与加速基线的比较。在四个基准上,CHRONOS在10个结果时召回率为0.937,每秒2.74个查询,延迟161毫秒,在zCDP组合下总epsilon为4.25,delta为10^{-6}。这些结果表明一个竞争性的操作点。一个局限性是,在此隐私水平下,发布的估值仍受噪声主导;效用主要来自公共索引路由和由低敏感度统计驱动的自适应调度。

英文摘要

Temporal knowledge-graph data marketplaces face three coupled failures in static designs: stale hybrid index shortcuts reduce recall as edges evolve, stationary Shapley pricing misattributes value after distribution shifts, and uncoordinated agents over-consume a shared differential-privacy budget. We present CHRONOS, a three-layer architecture providing a unified treatment of these challenges with explicit public and private separation. Layer one applies neural-ODE temporal decay to shortcut edges, providing a per-query expected recall-loss bound of Big-O of Pq lambda delta t, with a monotone-envelope guarantee reducing bound looseness to 1.8 to 3.2 times observed loss. Layer two conditions Shapley valuation on detected changepoints and provides finite-sample error guarantees under noise. Layer three uses EXP3-IX to achieve Big-O of the square root of T log T regret while enforcing epsilon and delta differential privacy via moments accounting. CHRONOS releases a privatized affinity matrix per epoch using the Gaussian mechanism; all retrieval and ranking are post-processing, incurring no extra privacy cost. We provide multi-epoch settlement, scalability analysis for 500 sellers, and comparisons against accelerated baselines. Across four benchmarks, CHRONOS shows 0.937 recall at ten, 2.74 queries per second, 161 ms latency, and total epsilon of 4.25 at delta of 10 to the power of negative 6 under zCDP composition. These results indicate a competitive operating point. A limitation is that at this privacy level, released valuations remain noise-dominated; utility derives primarily from public index routing and adaptive scheduling driven by low-sensitivity statistics.

2605.23879 2026-05-25 stat.ML cs.CR cs.LG math.ST stat.TH 版本更新

On the Stability of Spherical Hellinger-Kantorovich Flows and Their Implications for Differential Privacy

球形Hellinger-Kantorovich流的稳定性及其对差分隐私的影响

Aratrika Mustafi, Soumya Mukherjee

发表机构 * Department of Statistics, Pennsylvania State University(宾夕法尼亚州立大学统计学系)

AI总结 本文研究了球形Hellinger-Kantorovich梯度流的稳定性问题,并探讨其在差分隐私中的应用。作者建立了该梯度流的扰动理论,分析了不同势函数下流的动力学差异,并给出了与时间相关的log-似然比和Rényi散度的统一上界,进一步推导了KL散度的界。这些结果被用于差分隐私中的指数机制采样,提供了基于SHK梯度流的纯差分隐私和近似差分隐私保证,并分离了机制本身的次优性与有限时间采样误差的影响。

详情
AI中文摘要

梯度流采样将吉布斯分布解释为概率测度上能量泛函的最小值,并生成收敛到该目标的动力学。在球形Hellinger-Kantorovich (SHK)几何下,流耦合输运和反应,并与生灭Langevin动力学一致。本文发展了SHK梯度流的摄动理论。对于两个势函数$V$和$V^{\prime}$,我们从共同初始值出发比较相关的流,并量化势差异随时间传播的程度。一个统一的扰动界给出了对数似然比和Rényi散度的无维、逐点控制,而额外的结构使我们能够推导出KL散度的界。我们将这些结果应用于差分隐私中指数机制的近似采样。似然比控制为基于SHK的采样器提供了显式的时间依赖纯DP保证,而KL界通过hockey-stick散度给出了近似DP证书。我们还推导了一个效用界,将指数机制的内在次优性与有限时间采样误差分离。

英文摘要

Gradient-flow sampling interprets a Gibbs distribution as the minimizer of an energy functional over probability measures and generates dynamics converging to this target. Under spherical Hellinger-Kantorovich (SHK) geometry, the flow couples transport and reaction and coincides with birth-death Langevin dynamics. In this work, we develop a perturbation theory for SHK gradient flows. For two potentials $V$ and $V^{\prime}$, we compare the associated flows from a common initialization and quantify how potential discrepancies propagate over time. A uniform perturbation bound yields dimension-free, pointwise control of the log-likelihood ratio and Rényi divergence, while additional structure allows us to derive bounds for the KL divergence as well. We apply these results to approximate sampling for the exponential mechanism in differential privacy. The likelihood-ratio control provides explicit time-dependent Pure-DP guarantees for SHK-based samplers, while the KL bound yields Approximate-DP certificates via hockey-stick divergence. We also derive a utility bound separating intrinsic exponential-mechanism suboptimality from finite-time sampling error.

2605.23872 2026-05-25 cs.LG cs.NA math.NA stat.ML 版本更新

Training-Free Looped Transformers

免训练循环Transformer

Lizhang Chen, Jonathan Li, Chen Liang, Ni Lao, Qiang Liu

发表机构 * University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 本文提出了一种无需训练的循环变压器模型,通过在冻结的预训练模型中引入一个轻量级的推理时包装器,对连续的中间层块进行循环应用,而无需额外微调或结构修改。研究发现,直接重复使用中间层块会导致性能下降,因此作者借鉴常微分方程的前向欧拉方法,将循环视为对同一近似的优化,采用更小的阻尼子步骤替代单一的大更新。实验表明,该方法在多种模型架构上均能有效提升推理性能,如在MMLU-Pro等基准测试中取得显著提升。

详情
AI中文摘要

我们引入了免训练循环Transformer,其中轻量级推理时包装器循环冻结检查点的连续中间块层,无需额外微调、继续训练或架构更改。与先前使用循环结构端到端训练的循环Transformer方法不同,我们在测试时将循环性改造到预训练模型上。我们表明,简单的块重新应用通常会降低性能,凸显了循环应用策略的重要性。受将预归一化Transformer块视为ODE上的前向欧拉步骤的启发,我们将循环视为同一近似的细化,用一个大的更新替换为更小的阻尼子步骤。在七个密集、稀疏MoE和MLA+MoE模型家族中,我们的方法在MMLU-Pro上将Qwen3-4B-Instruct提升了2.64个百分点,在CommonsenseQA上将Qwen3-30B-A3B-Instruct提升了1.14个百分点,在OpenBookQA上将Moonlight-16B-A3B-Instruct提升了1.20个百分点。

英文摘要

We introduce training-free looped transformers, in which a lightweight inference-time wrapper loops a contiguous mid-stack block of layers of a frozen checkpoint without additional fine-tuning, continued training, or architectural changes. Unlike prior looped transformer methods that train with the looped structure end-to-end, we retrofit recurrence onto pretrained models at test time. We show that naive block reapplication usually degrades performance, highlighting the importance of the loop application strategy. Motivated by viewing a pre-norm transformer block as a forward Euler step on an ODE, we instead treat looping as a refinement of the same approximation, replacing one large update with smaller damped sub-steps. Across seven dense, sparse MoE, and MLA+MoE model families, our method improves Qwen3-4B-Instruct by +2.64 pp on MMLU-Pro, Qwen3-30B-A3B-Instruct by +1.14 pp on CommonsenseQA, and Moonlight-16B-A3B-Instruct by +1.20 pp on OpenBookQA.

2605.23871 2026-05-25 stat.ML cs.LG math.ST stat.TH 版本更新

Move on Muon : A Hamiltonian probability gradient flow perspective of Muon optimizer

Muon上的移动:Muon优化器的哈密顿概率梯度流视角

Aratrika Mustafi, Soumya Mukherjee, Bharath K. Sriperumbudur

AI总结 本文从哈密顿概率梯度流的视角,研究了Muon优化器的连续时间动力学行为,提出了正则化Muon优化的梯度流形式,并揭示了其与核范数的Fenchel对偶平滑之间的联系。通过将Muon优化推广到有限粒子概率目标函数,作者推导了其惯性连续时间极限,并建立了参数-动量对的概率相空间平均场方程,证明了该动力学为阻尼哈密顿概率动力系统,具有单调递减的哈密顿能量。此外,文章还分析了目标函数的收敛性,并将该方法扩展到适用于变换器混合专家模型的块状Muon概率流。

详情
AI中文摘要

我们开发了一种在矩阵值参数概率测度空间上的梯度流,该梯度流由正则化Muon(理想化Muon优化器的解析平滑版本)诱导。关键观察是正则化正交化映射是核范数的光滑Fenchel对偶平滑的梯度。这确定了(正则化)Muon更新为更新变量中的镜像/近端步骤,其中动量充当对偶坐标。我们利用这一结构将Muon从单个矩阵参数提升到形如$J(ρ)=R\left(\int F d ρ ight)$的有限粒子概率目标,这一设置由神经网络训练的均场描述所激发,并推导出惯性连续时间极限。利用这一结构,我们在步长和动量的惯性缩放下推导出有限粒子连续时间极限,然后过渡到参数-动量对概率律上的相空间均场方程。所得流可被证明是阻尼哈密顿概率动力学,其动能由正则化Muon镜像势诱导。我们证明了一个精确的哈密顿耗散恒等式,显示哈密顿能量单调递减。虽然目标目标本身在惯性Muon动力学下不一定单调,但在额外的梯度优势、有界动量和曲率/对齐假设下,我们获得了目标间隙的连续和离散时间指数收敛率。我们还研究了均场极限方程的适定性,并建立了相互作用粒子系统的混沌传播保证。最后,我们将公式扩展到乘积矩阵空间上的Hilbert值特征映射,得到适用于平滑变压器混合专家模型的块状Muon概率流。

英文摘要

We develop a gradient flow on the space of probability measures defined on matrix-valued parameters induced by regularized Muon, an analytically smoothed version of the idealized Muon optimizer. The key observation is that the regularized orthogonalization map is the gradient of a smooth Fenchel-dual smoothing of the nuclear norm. This identifies the (regularized) Muon update as a mirror/prox step in the update variable, with momentum acting as the dual coordinate. We use this structure to lift Muon from a single matrix parameter to finite-particle probability objectives of the form $J(ρ)=R\left(\int F d ρ\right)$, a setting motivated by mean-field descriptions of neural-network training, and derive the inertial continuous-time limit. Using this structure, we derive the finite-particle continuous-time limit under the inertial scaling of step size and momentum, and then pass to a phase-space mean-field equation over probability laws on parameter-momentum pairs. The resulting flow can be shown to be a damped Hamiltonian probability dynamics whose kinetic energy is induced by the regularized Muon mirror potential. We prove an exact Hamiltonian dissipation identity, showing that the Hamiltonian energy decreases monotonically. While the target objective itself need not be monotone along the inertial Muon dynamics, under additional gradient-dominance, bounded-momentum, and curvature/alignment assumptions, we obtain continuous and discrete-time exponential convergence rates for the objective gap. We also study the well-posedness of the mean-field limit equation and establish propagation of chaos guarantees for the interacting particle system. Finally, we extend the formulation to Hilbert-valued feature maps on product matrix spaces, yielding a blockwise Muon probability flow applicable to smooth transformer mixture-of-experts models.

2605.23861 2026-05-25 cs.LG cs.AI cs.CV 版本更新

Leveraging Foundation Models for Causal Generative Modeling

利用基础模型进行因果生成建模

Aneesh Komanduri, Xintao Wu

发表机构 * University of Arkansas(亚拉巴马大学)

AI总结 该论文研究如何利用预训练基础模型进行因果生成建模,旨在提升AI系统在反事实推理方面的能力。提出了一种名为FM-CGM的模块化框架,通过概念提取器、概念操作器和反事实生成器三个核心组件,实现了端到端的视觉因果推理。该方法结合了因果推理模型和文本到图像扩散模型,并引入了因果语义引导机制,有效支持零样本因果发现与反事实图像生成,具有重要的理论与应用价值。

详情
AI中文摘要

因果生成建模对于开发能够进行反事实推理的可靠且透明的AI系统至关重要。现有方法侧重于在生成模型训练过程中整合因果约束,但通常缺乏统一框架来利用预训练基础模型的零样本推理能力。我们提出FM-CGM,一个使用预训练基础模型进行端到端视觉因果推理的模块化框架。FM-CGM通过三个核心组件形式化因果流程:概念提取器、概念操作器和反事实生成器。通过利用大型推理模型进行因果推断,以及文本到图像扩散模型进行生成,我们的方法实现了零样本因果发现、干预和反事实生成。然后,我们开发了因果语义引导(CSG),一种基于交叉注意力的机制,确保语义干预传播到后代概念,同时保留不变区域。我们实验证明,我们的方法能够识别合理的因果结构,并适用于忠实的反事实图像生成。

英文摘要

Causal generative modeling is essential for developing reliable and transparent AI systems capable of counterfactual reasoning. While existing approaches focus on integrating causal constraints during the training of generative models, they often lack a unified framework to leverage the zero-shot reasoning capabilities of pretrained foundation models. We introduce FM-CGM, a modular framework for end-to-end visual causal reasoning using pretrained foundation models. FM-CGM formalizes the causal pipeline through three core components: a concept extractor, a concept manipulator, and a counterfactual generator. By leveraging a large reasoning model for causal inference and a text-to-image diffusion model for generation, our approach enables zero-shot causal discovery, intervention, and counterfactual generation. We then develop Causal Semantic Guidance (CSG), a cross-attention-based mechanism that ensures semantic interventions propagate to descendant concepts while preserving invariant regions. We empirically show that our approach can identify plausible causal structures and is suitable for faithful counterfactual image generation.

2605.23857 2026-05-25 cs.LG cs.CL 版本更新

Strong Teacher Not Needed? On Distillation in LLM Pretraining

不需要强教师?关于大语言模型预训练中的蒸馏

Taiming Lu, Zhuang Liu

发表机构 * Princeton University(普林斯顿大学)

AI总结 本文挑战了知识蒸馏中“强教师优于弱教师”的常见假设,研究了在大语言模型预训练中不同教师-学生关系对蒸馏效果的影响。通过调整模型规模和训练数据量,作者构建了强-弱、同级和弱-强的教师-学生关系,并发现即使使用小型或未充分训练的教师,通过合理混合语言建模和蒸馏损失,也能有效提升学生模型性能。研究还表明,更强的教师并不总是更好,过度增加教师规模或训练数据可能削弱蒸馏效果,同时蒸馏在提升模型泛化能力方面比领域内拟合更具优势。

详情
AI中文摘要

知识蒸馏通常假设强到弱的关系,即更强的教师会产生更好的学生。在这项工作中,我们检验了关于大语言模型预训练中蒸馏的这一假设。通过改变架构大小和训练token预算,我们创建了强到弱、同级和弱到强的师生关系,并研究了每种情况下蒸馏的有效性。我们发现教师不需要强:通过适当混合语言建模和知识蒸馏损失,即使是小型和训练不足的教师也能提升较大的学生。同时,更强的教师并不总是更好:通过更多参数或更多训练token进一步推动教师,可能会饱和甚至逆转蒸馏收益。我们进一步观察到,蒸馏更容易改善泛化(分布外和下游性能)而非域内拟合。这些结果共同挑战了蒸馏预训练总是需要强教师的普遍信念。

英文摘要

Knowledge distillation generally assumes a strong-to-weak relationship where stronger teachers yield better students. In this work, we examine this assumption about distillation in large language model pretraining. By varying architecture sizes and training token budgets, we create strong-to-weak, same-level, and weak-to-strong teacher-student relationships, and study distillation's effectiveness under each. We find that the teacher need not be strong: with proper mixing of the language modeling and knowledge distillation losses, even small and undertrained teachers improve larger students. At the same time, a stronger teacher is not always better: pushing the teacher further, through more parameters or more training tokens, can saturate or even reverse the distillation gains. We further observe that distillation improves generalization (out-of-distribution and downstream performance) more readily than in-domain fitting. Together, these results challenge the common belief that distillation pretraining always requires a strong teacher.

2605.23854 2026-05-25 cs.LG math.ST stat.ML stat.TH 版本更新

Entrywise Error Bounds for Spectral Ranking with Semi-Random Adversaries

半随机对抗下谱排序的逐项误差界

Dongmin Lee, Anuran Makur, Japneet Singh

发表机构 * Department of Computer Science(计算机科学系) Elmore Family School of Electrical and Computer Engineering(埃洛姆家族电子与计算机工程学院) Purdue University(普渡大学)

AI总结 本文研究了在半随机对抗环境下谱方法用于谱排序的逐项误差界问题。针对能够任意增强某些边采样概率的半随机对手,作者分析了无权重谱方法的性能,并发现其表现高度依赖生成图的谱特性。通过适当重加权观测边以抵消对手影响,可恢复接近均匀采样图的渐近性能。数值实验验证了理论结果的有效性。

Comments 17 pages, 2 figures, 32nd ACM SIGKDD Conference on Knowledge Discovery and Data Mining V.2

详情
AI中文摘要

Bradley-Terry-Luce (BTL) 模型估计是一种基于成对比较数据对项目集合进行排序的成熟策略。尽管在均匀采样图的情况下,谱估计和最大似然估计等 BTL 估计方法的理论性能已得到充分研究,但将这些结果推广到更广泛的随机图类已被证明具有挑战性。在这项工作中,我们研究了谱算法在半随机对抗下的逐项误差,该对抗可以任意提升某些边的采样概率。我们发现,未加权谱方法的性能严重依赖于生成图的谱性质。此外,我们表明,通过适当地重新加权观察到的边以对抗对抗并恢复谱间隙,可以恢复接近均匀采样图的渐近性能。最后,我们提供了支持我们理论发现的数值模拟。

英文摘要

Bradley-Terry-Luce (BTL) model estimation is a well-established strategy to rank a collection of items given a dataset of pairwise comparisons. Although the theoretical performance of BTL estimation methods, such as spectral and maximum likelihood estimation, is well studied in the regime of uniformly sampled graphs, generalizing such results to a wider class of random graphs has proved challenging. In this work, we investigate the entry-wise error of spectral algorithms against a semi-random adversary that can arbitrarily boost the sampling probabilities of certain edges. We find that the performance of the unweighted spectral method is heavily dependent on the spectral properties of the generated graph. Furthermore, we show that asymptotic performance approaching that of uniformly sampled graphs can be recovered by appropriately reweighting the observed edges to counteract the adversary and restore the spectral gap. Finally, we provide numerical simulations that support our theoretical findings.

2605.23825 2026-05-25 cs.LG cs.AI 版本更新

It's the humans, not the data: Geopolitical bias in LLMs originates in post-training, amplified by the language of the prompt

是人类,而非数据:LLM中的地缘政治偏见源于后训练,并通过提示语言放大

Stuart Bladon, Brinnae Bent

发表机构 * Alibaba(阿里巴巴) seven AI labs(七家人工智能实验室)

AI总结 该研究发现,语言模型中的地缘政治偏见主要来源于微调阶段,而非预训练阶段。通过对七家实验室的多个模型进行对比实验,结果表明,微调后模型的立场往往更倾向于其开发者所在国家或地区,且这种偏见在不同语言提示下表现不同。研究强调,模型对国家、文化及政治观点的表征并非单纯继承自训练数据,而是在对齐过程中被主动塑造,凸显了对微调过程进行透明度和监管的重要性。

Comments 12 pages, 6 figures, 2 tables, 3 appendices. Code and scenario bank: https://github.com/recozers/LLM-Bias

详情
AI中文摘要

人们通常认为语言模型中的地缘政治偏见源于预训练阶段使用的训练数据。我们在英语、法语和中文中,对来自七个实验室的七对开放权重LLM(仅预训练的基础模型和经过预训练及后训练的对话模型)进行了28对国家对的配对场景强制选择探测,发现地缘政治偏见源于后训练而非预训练。在七个AI实验室中,有六个在模型开发者所在国家或地区的方向上,后训练后出现了偏见偏移。这种偏移在阿里巴巴的Qwen 2.5中最为显著:基础模型对中国好感度呈中性(对数几率-0.15,p=0.15),而后训练的对话变体则为+2.91(p<10^-4),几率偏移了18倍。我们还观察到所有模型对其他国家的偏见也存在偏移。此外,这种偏移的幅度取决于提示模型所用的语言:法国制造的Mistral仅在法语提示下表现出亲法倾向(法语-英语偏移+1.91,p<10^-4)。这些发现表明,语言模型中的地缘政治偏好并非简单地从大规模互联网数据中继承,而是在后训练过程中被主动塑造,这凸显了对影响模型如何表征国家、文化和政治观点的对齐过程进行更大透明度、审计和监督的必要性。

英文摘要

It has generally been assumed that geopolitical bias in language models originates from the training data used during the pre-training phase. We tested seven open-weight LLM pairs consisting of the base model (pre-training only) and the chat model (pre-training and post-training) from seven labs on a paired-scenario forced-choice probe over 28 country pairs in English, French, and Chinese, and found that geopolitical bias originates in post-training rather than in pre-training. Across seven AI labs, six showed shifts in the direction associated with the country or region of the model developer after post-training. This shift is strongest in Alibaba's Qwen 2.5: while the base is neutral on China-favourability (-0.15 log-odds, p=0.15), the post-trained chat variant is at +2.91 (p<10^-4), an 18x shift in odds. We also observe shifts in biases toward other countries across all models. Additionally, the magnitude of this shift depends on the language used to prompt the model: the French-made Mistral becomes pro-France only under French prompting (FR-EN shift +1.91, p<10^-4). These findings suggest that geopolitical preferences in language models are not simply inherited from large-scale internet data but are actively shaped during post-training, highlighting the need for greater transparency, auditing, and oversight of alignment processes that influence how models represent nations, cultures, and political perspectives.

2605.23821 2026-05-25 cs.CL cs.LG 版本更新

Hierarchical Concept Geometry in Language Models Emerges from Word Co-occurrence

语言模型中的层级概念几何源于词汇共现

Andres Nava, Matthieu Wyart

发表机构 * Johns Hopkins University(约翰霍普金斯大学) EPFL(苏黎世联邦理工学院)

AI总结 本文研究了语言模型中如何通过词共现关系几何地编码超类关系(即“是-一种”关系)。作者从词网中词语之间的共现频率与层次结构关系的实证观察出发,理论分析了词嵌入的协方差矩阵谱结构,证明了主特征向量能按从粗到细的层次逐步分离出概念分支,形成与树状结构一致的层次分割几何。实验验证表明,这一现象不仅在词2vec中存在,在Gemma 2B模型中也表现显著,表明层次概念几何可由词对统计的谱结构自然产生,无需依赖特定的层次功能机制。

Comments 34 pages, 12 figures, including appendices

详情
AI中文摘要

我们提出了一种分布理论,解释上下义关系——一般概念与具体概念之间的“is-a”关系——如何在语言表示中以几何方式编码。从经验验证的假设出发,即WordNet上下义图中距离较近的词汇共现频率更高,我们在理论上刻画了由此产生的word2vec嵌入Gram矩阵的谱。在共现核的温和正性和衰减条件下,我们证明主特征向量首先分离广泛的分类分支,然后逐步分离更细的子分支,产生一种\emph{层级分裂几何},其从粗到细的谱组织反映了树结构。我们在多个采样的WordNet子树上的word2vec嵌入中验证了这些预测,并表明相同的特征显著地扩展到Gemma 2B的解嵌入。我们的结果表明,LLM中的层级概念几何不必反映层级特定的功能机制,而是从成对词汇统计的谱结构中涌现出来。

英文摘要

We propose a distributional theory of how hypernymy -- the ``is-a'' relation between general and specific concepts -- is encoded geometrically in language representations. Starting from the empirically verified assumption that words closer on the WordNet hypernym graph co-occur more often, we characterize theoretically the spectrum of the resulting embedding Gram matrix of word2vec embeddings. Under mild positivity and decay conditions on the co-occurrence kernel, we prove that the leading eigenvectors first separate broad taxonomic branches and then progressively finer sub-branches, producing a \emph{hierarchical splitting geometry} with a coarse-to-fine spectral organization that mirrors the tree. We confirm these predictions in word2vec embeddings across many sampled WordNet subtrees, and show that the same signature extends strikingly well to Gemma 2B unembeddings. Our results indicate that hierarchical concept geometry in LLMs need not reflect a hierarchy-specific functional mechanism, but emerges from the spectral structure of pairwise word statistics.

2605.23797 2026-05-25 cs.LG cs.CV 版本更新

Debiased Negative Mining Improves Out-of-distribution Detection with Pre-trained Vision-Language Models

去偏负挖掘提升基于预训练视觉语言模型的分布外检测

Bo Peng, Jie Lu, Guangquan Zhang, Zhen Fang

发表机构 * University of Technology Sydney(悉尼科技大学)

AI总结 本文研究了如何利用预训练的视觉-语言模型(VLM)进行分布外(OOD)检测,旨在识别来自未知类别的输入。现有方法主要依赖启发式规则从未标注的语料中挖掘负样本,但存在严重的负样本偏差问题。为此,作者提出了一种去偏负样本挖掘方法,通过间接估计负样本分布来纠正偏差,并将其转化为基于标注数据和未标注语料的蒙特卡洛采样过程。实验表明,该方法在多种OOD检测任务中取得了新的最先进性能。

Comments KDD 2026

详情
AI中文摘要

旨在识别来自未知类别的意外输入,分布外(OOD)检测已成为增强机器学习模型可靠性的关键方法。本文聚焦于基于预训练视觉语言模型(VLM)的事后OOD检测这一新兴范式,其中一种流行的流程是通过检查输入与ID标签和负标签(即语义上不同于ID标签的标签)之间的亲和度来检测OOD输入。由于目标OOD标签不可用,现有工作主要依赖启发式规则从未标注的语料数据中挖掘负标签。尽管取得了经验上的成功,我们认为基于VLM的OOD检测能力尚未被完全释放,因为文献中臭名昭著的假阴性问题远未解决。基于这一动机,我们感兴趣于解决为OOD评分挖掘真实负标签的挑战。为此,我们开发了一个理论框架,通过间接近似负标签的分布来校正负标签的采样偏差。令人惊讶的是,我们表明去偏负挖掘可以自然地转化为基于ID标签和未标注语料数据的蒙特卡洛采样。大量实验经验性地证明,我们的方法在各种OOD检测设置中建立了新的最先进水平。代码公开于\href{https://github.com/60pen9/Debiased-Negative-Mining-Improves-OOD-Detection-with-Pre-trained-VLMs}{此处}。

英文摘要

Aiming at identifying unexpected inputs from unknown classes, out-of-distribution (OOD) detection has emerged as a pivotal approach to enhancing the reliability of machine learning models. This paper focuses on the burgeoning paradigm of post-hoc OOD detection with pre-trained vision-language models (VLMs), where a popular pipeline is to detect OOD inputs by examining their affinities between ID labels and negative labels, i.e., those semantically different from ID labels. Due to the unavailability of target OOD labels, existing works predominantly rely on heuristic rules to mine negative labels from unlabeled wild corpus data. Despite the empirical success, we argue that the power of VLM-based OOD detection has yet to be fully unleashed since the notorious false negative problem is far from addressed in the literature. With this motivation, we are interested in addressing the challenge of mining true negative labels for OOD scoring. To this end, we develop a theoretical framework for correcting the sampling bias of negatives labels by indirectly approximating the distribution of negative labels. Perhaps surprisingly, we show that the debiased negative mining can be naturally converted into Monte-Carlo sampling based on ID labels and the unlabeled wild corpus data. Extensive experiments empirically manifest that our method establishes a new state-of-the-art in a variety of OOD detection setups. Code is publicly available at \href{https://github.com/60pen9/Debiased-Negative-Mining-Improves-OOD-Detection-with-Pre-trained-VLMs}{\textcolor{red}{here}}.

2605.23778 2026-05-25 physics.ao-ph cs.LG physics.comp-ph 版本更新

The physics of AI weather models

AI天气模型的物理学

George Craig, Tobias Selz, Matthias Beylich, Kirsten I. Tempest

发表机构 * Meteorological Institute, LMU Munich(慕尼黑大学气象研究所)

AI总结 本文探讨了人工智能天气模型是否在隐式求解物理方程,尽管这些方程可能不同于传统数值天气预报模型所使用的方程。研究通过计算预报技能与中心核对齐的相关性,发现不同AI天气模型在表征大气时具有相似性,尽管其结构和容量存在差异。文章提出这些模型可能通过粒子描述的方式模拟大气,其中每个网格点的潜在变量对应高维潜在空间中的粒子位置,并假设粒子的运动遵循潜在空间中自由能函数的梯度流。这一假设在GraphCast和Aurora模型的分析中得到了支持。

详情
AI中文摘要

AI天气模型是否可能在求解物理方程,尽管这些方程可能不是传统NWP模型所使用的方程?我们计算了预测技能和中心核对齐的相关性,提供了证据表明不同的AI天气模型以相似的方式表示大气,尽管架构和能力存在差异。我们认为AI模型的架构和训练限制了它们可能模拟的物理定律的形式。特别地,我们提出这些模型实现了大气的粒子描述,其中每个网格点的潜变量对应于高维潜空间中粒子的位置。我们假设粒子的运动遵循潜空间中的梯度流,朝向学习到的自由能泛函的最小值。对GraphCast和Aurora模型的分析表明,它们在早期处理器层中在大空间尺度上进行变化,并随着层深增加转向更小尺度,这与梯度流假设一致。

英文摘要

Could it be that AI weather models are solving physical equations, although they may not be the equations used by conventional NWP models? We compute correlations of forecast skill and Centered Kernel Alignment, providing evidence that different AI weather models represent the atmosphere in similar ways, despite differences in architecture and capacity. We argue that the architecture and training of the AI models constrains the form of the physical laws that they might simulate. In particular, we propose that the models implement a particle description of the atmosphere, where the latent variables at each mesh point correspond to the position of a particle in the high dimensional latent space. We hypothesize that the movement of the particles follows a gradient flow in the latent space towards a minimum of a learned free energy functional. Analysis of the GraphCast and Aurora models show that they make changes on large spatial scales in the early processor layers and move to smaller scale with increasing layer depth, consistent with the gradient flow hypothesis.

2605.23754 2026-05-25 cs.LG 版本更新

LLM-driven design of physics-constrained constitutive models: two agents are better than one

LLM驱动的物理约束本构模型设计:两个智能体胜过一个

Marius Tacke, Matthias Busch, Kian Abdolazizi, Jonas Eichinger, Kevin Linka, Roland Aydin, Christian Cyron

发表机构 * Helmholtz-Zentrum Hereon(海德堡中心) Hamburg University of Technology(汉堡技术大学) RWTH Aachen University(亚琛工业大学) Saarland University(萨尔兰州大学) German Center for Artificial Intelligence(德国人工智能中心)

AI总结 本文提出了一种基于大语言模型(LLM)的多智能体方法,用于生成符合物理规律的本构模型。该方法引入了两个智能体:Creator 负责根据数据生成模型,Inspector 负责检查模型是否满足九项物理约束,若不满足则返回修改。实验表明,该方法显著提高了生成模型的物理正确性,同时保持了高精度和良好的泛化能力,为自动化、物理感知的模型发现提供了可信的解决方案。

详情
AI中文摘要

传统上,开发描述材料在载荷下变形方式的本构模型需要连续介质力学、机器学习和科学编程方面多年的专业知识。最近,大型语言模型(LLM)已被证明可以通过按需生成本构模型来降低这一门槛,但现有的单智能体流程缺乏系统性的检查,以确保生成的模型尊重基本物理定律。为弥补这一差距,我们引入了首个多智能体LLM驱动的本构模型生成方法:一个Creator智能体根据数据提出定制模型,而一个Inspector智能体对每个提案进行严格审计,检查其是否满足九个物理约束,并在检测到违规时返回修改。我们使用本构人工神经网络(CANN)演示了这一概念,并在脑组织、实验橡胶和合成橡胶上使用两种不同的LLM骨干(Claude Opus 4.7和Kimi K2.5)进行基准测试。添加Inspector后,对于Opus,导出模型中真正满足所有物理约束的比例从91%提高到完美的100%;对于Kimi,从37%提高到56%,同时保持了接近基线的准确性和对未见加载路径的显著泛化能力。综合来看,生成的模型在物理上有效、高度准确,并能可靠地外推到训练数据之外——这些特性使其可以直接在实践中使用。因此,将生成与检查分离,使LLM驱动的本构建模成为一个真正可信的过程。该范式故意与技术无关,并随着LLM能力的进步自动扩展,为自动化、物理感知的模型发现开辟了一条有前景的道路。

英文摘要

Developing constitutive models that capture how materials deform under load traditionally requires years of specialized expertise in continuum mechanics, machine learning, and scientific programming. Large language models (LLMs) have recently been shown to lower this barrier by generating constitutive models on demand, but existing single-agent pipelines lack systematic checks that the resulting models respect fundamental physical laws. To close this gap, we introduce the first multi-agent LLM-driven approach for constitutive model generation: a Creator agent proposes a model tailored to the data, while an Inspector agent critically audits each proposal against nine physical constraints and returns it for refinement whenever a violation is detected. We demonstrate this concept with constitutive artificial neural networks (CANNs) and benchmark it on brain tissue, experimental rubber, and synthetic rubber, using two different LLM backbones (Claude Opus 4.7 and Kimi K2.5). Adding the Inspector raises the share of exported models that truly satisfy all physical constraints from 91% to a perfect 100% for Opus and from 37% to 56% for Kimi, while preserving near-baseline accuracy and remarkable generalization to unseen loading paths. In combination, the generated models are physically valid, highly accurate, and extrapolate reliably beyond the training data - properties that together make them directly usable in practice. Separating generation from inspection thus turns LLM-driven constitutive modeling into a genuinely trustworthy process. The paradigm is deliberately technique-agnostic and scales automatically with advances in LLM capability, opening a promising path toward automated, physics-aware model discovery.

2605.23753 2026-05-25 cs.LG 版本更新

SeedER: Seed-and-Expand Retrieval from Knowledge Graphs

SeedER: 基于种子扩展的知识图谱检索

Hamed Shirzad, Frederik Wenkel, Dominique Beaini, Danica J. Sutherland, Emmanuel Noutahi

发表机构 * Valence Labs, Montréal, QC, Canada(Valence实验室,加拿大魁北克省蒙特利尔) University of British Columbia, Department of Computer Science, Vancouver, BC, Canada(不列颠哥伦比亚大学计算机科学系,加拿大不列颠哥伦比亚省温哥华)

AI总结 SeedER 是一种用于知识图谱的检索框架,旨在解决其不规则结构带来的检索挑战。该方法通过先利用轻量级的密集嵌入和实体检索确定核心节点,再通过强化学习训练的图感知策略进行选择性扩展,从而高效发现与查询相关的节点。实验表明,SeedER 在保持较低扩展成本的同时,显著提升了检索效果,尤其在处理多跳组合查询时表现出优越的性能。

详情
AI中文摘要

知识图谱(KGs)为关系知识提供了丰富的表示,但其不规则结构使得检索具有挑战性:自我图扩展迅速增长,而密集嵌入方法难以处理多跳组合查询。现有的基于智能体的图探索方法虽然表达能力强,但通常对于大规模检索来说过于昂贵。我们引入了SeedER(种子扩展检索),这是一个通过迭代、低成本扩展显式利用KG结构的检索框架。SeedER首先使用轻量级密集和基于实体的检索播种一个紧凑的核心节点集,然后通过使用强化学习训练的图感知策略选择性地扩展该集合。这种设计将全局推理分解为可重用的局部决策,从而能够在严格控制扩展成本的同时高效发现与查询相关的节点。我们展示了密集检索在组合图查询上的理论局限性,并从组合泛化和图约束子模优化的角度确立了SeedER的优势。实验上,SeedER在紧凑候选集上显著提高了召回率,超过了强大的密集和图增强基线,使其成为知识密集型推理系统中有效的第一阶段检索器。

英文摘要

Knowledge graphs (KGs) offer a rich representation for relational knowledge, but their irregular structure makes retrieval challenging: ego-graph expansion grows rapidly, and dense embedding methods struggle with multi-hop compositional queries. Existing agent-based graph exploration approaches, while expressive, are often too expensive for large-scale retrieval. We introduce SeedER (Seed-and-Expand Retrieval), a retrieval framework that explicitly leverages KG structure through iterative, low-cost expansion. SeedER first seeds a compact set of core nodes using lightweight dense and entity-based retrieval, then selectively expands this set via a learned graph-aware policy trained with reinforcement learning. This design decomposes global reasoning into reusable local decisions, enabling efficient discovery of query-relevant nodes while tightly controlling expansion cost. We show theoretical limitations of dense retrieval on compositional graph queries, and establish advantages of SeedER from both compositional generalization and graph-constrained submodular optimization perspectives. Empirically, SeedER substantially improves recall with compact candidate sets over strong dense and graph-augmented baselines, making it an effective first-stage retriever for knowledge-intensive reasoning systems.

2605.23751 2026-05-25 cs.LG 版本更新

Approaching I/O-optimality for Approximate Attention

逼近近似注意力的I/O最优性

Pál András Papp, Aleksandros Sobczyk, Anastasios Zouzias

发表机构 * Computing Systems Lab(计算系统实验室) Huawei Technologies(华为技术)

AI总结 本文研究了大语言模型中注意力机制的I/O复杂度问题,旨在以最少的快慢内存数据传输次数计算注意力矩阵。作者提出了一种基于近似注意力框架的I/O高效算法,使得在大多数参数设置下,I/O代价仅近似线性依赖于序列长度$n$,显著优于现有方法的二次复杂度。同时,作者还给出了不同参数范围下的I/O下界,证明所提方法接近I/O最优。

详情
AI中文摘要

我们重新审视了大语言模型中注意力的I/O复杂度。给定查询-键-值矩阵 $Q,K,V\in\mathbb{R}^{n\times d}$,以及一个快速内存大小为 $M$ 的机器,目标是计算“注意力矩阵” $A=\text{softmax}(Q K ^{\top}/\sqrt{d}) V$,同时最小化快速和慢速内存之间的数据传输次数。文献中的现有方法,尤其是FlashAttention及其变体,其I/O开销与 $n$ 呈二次关系,而一个平凡的下界仅需要 $\Omega(nd)$ 次I/O来读取输入和写入输出。在这项工作中,我们提出了一种计算注意力的技术,在大多数参数范围内,其I/O开销几乎与 $n$ 呈线性关系。这是通过开发受Alman和Song最近提出的近似注意力框架启发的I/O高效算法实现的。我们还证明了每个参数范围内的相应下界,以表明我们的算法确实接近I/O最优。

英文摘要

We revisit the I/O complexity of attention in large language models. Given query-key-value matrices $Q,K,V\in\mathbb{R}^{n\times d}$, and a machine with fast memory size $M$, the goal is to compute the "attention matrix" $A=\text{softmax}(Q K ^{\top}/\sqrt{d}) V$ with the minimal number of data transfers between fast and slow memory. Existing methods in the literature, most notably FlashAttention and its variants, incur an I/O cost that depends quadratically on $n$, while a trivial lower bound only requires $Ω(nd)$ I/O's to read the inputs and write the output. In this work, we present a technique for computing attention where the I/O cost only depends almost-linearly on $n$ in most parameter regimes. This is achieved by developing I/O-efficient algorithms inspired by the recent approximate attention framework of Alman and Song. We also prove corresponding lower bounds in each parameter regime to show that our algorithms are indeed close to I/O-optimal.

2605.23744 2026-05-25 cs.LG 版本更新

Contrast to Detect: Dynamic Graph Contrastive Regularization for Unsupervised Anomaly Detection in Multivariate Time Series

对比检测:面向无监督多变量时间序列异常检测的动态图对比正则化

Yunhua Pei, Zixing Song, Jin Zheng, John Cartlidge

发表机构 * School of Computer Science, University of Bristol(布里斯托大学计算机科学学院) School of Engineering Mathematics, University of Bristol(布里斯托大学工程数学学院)

AI总结 该研究针对多变量时间序列中的无监督异常检测问题,提出了一种名为ContrastAD的框架,用于应对动态变量依赖关系和频谱噪声带来的挑战。该方法通过动态图对比学习,将结构演变作为学习信号,而非抑制其变化,并引入多视角嵌入和频率感知注意力机制以提升鲁棒性。实验表明,ContrastAD在多个真实数据集上取得了优越的异常检测性能,尤其在F1指标上表现突出。

Comments 12 pages, 5 figures. Preprint. Code and demo data available online

详情
AI中文摘要

多变量时间序列(MTS)中的异常检测受到动态变量间依赖关系和频谱噪声下特征纠缠的阻碍,在实践中,由于缺乏异常标签而进一步复杂化。现有的基于重构的检测器倾向于像正常模式一样忠实地恢复异常,而流行的图对比方法强制视图间不变性,从而假设一个平稳的关系结构,这一假设在真实系统的结构漂移下被打破。我们提出ContrastAD,一个无监督框架,将结构演化本身转变为学习信号而非抑制它。一个多视角编码器从时间、属性和结构视角编码输入。一个频率感知注意力混合器在注意力之前执行频谱top-K过滤,防止噪声泄漏到查询-键相似度中。核心组件,一个动态图对比学习器,从批次级DTW距离构建基于幂律的稀疏图快照,并将最发散的对与稳定锚点进行对比,在不施加刚性不变性的情况下正则化潜在空间。在五个真实世界基准上,ContrastAD在所有五个数据集上获得最高平均F1,并在三个数据集上获得最高AUC(SWaT 93.60,SMD 98.66,PSM 97.79),在SWaT和PSM上相对于最强基线具有统计显著的F1和AUC差距。在MSL和SMAP上,其AUC落后领先者不到0.7个百分点,同时F1仍领先。消融和敏感性研究进一步证实,对比目标作为软正则化器效果最佳,支持我们的主张:在非平稳动态下严格不变性是次优的。

英文摘要

Anomaly detection in multivariate time series (MTS) is hindered by dynamic inter-variable dependencies and feature entanglement under spectral noise, and in practice, is further complicated by the absence of anomaly labels. Existing reconstruction-based detectors tend to recover anomalies as faithfully as normal patterns, while prevailing graph contrastive methods enforce invariance across views and thus assume a stationary relational structure, an assumption that breaks under structural drift in real systems. We propose ContrastAD, an unsupervised framework that turns structural evolution itself into a learning signal rather than suppressing it. A Multi-Perspective Embedder encodes inputs from temporal, attribute, and structural perspectives. A Frequency-Aware Attention Mixer then performs spectral top-K filtering before attention, preventing noise from leaking into query-key similarities. The core component, a Dynamic Graph Contrastive Learner, builds power-law-inspired sparse graph snapshots from batch-level DTW distances and contrasts the most divergent pair against a stable anchor, regularizing the latent space without imposing rigid invariance. Across five real-world benchmarks, ContrastAD attains the highest mean F1 on all five datasets and the highest AUC on three (SWaT 93.60, SMD 98.66, PSM 97.79), with statistically significant F1 and AUC margins over the strongest baseline on SWaT and PSM. On MSL and SMAP, it trails the AUC leader by under 0.7 points while still leading on F1. Ablation and sensitivity studies further confirm that the contrastive objective works best as a soft regularizer, supporting our claim that strict invariance is suboptimal under non-stationary dynamics.

2605.23726 2026-05-25 cs.LG cs.DS stat.ML 版本更新

Optimal Dimension-Free Sampling for Regularized Classification

正则化分类的最优无维度采样

Meysam Alishahi, Alexander Munteanu, Simon Omlor, Jeff M. Phillips

发表机构 * University of Utah, USA(美国犹他大学) TU Dortmund, Germany(德国图鲁姆大学)

AI总结 本文研究了在正则化分类问题中实现$(1\pm\varepsilon)$相对误差的最优无维度采样方法,适用于一大类满足Lipschitz条件的分类损失函数,如逻辑回归、铰链损失和ReLU损失等。作者给出了不同正则化项下的采样复杂度上界和下界,证明了基于$\|\cdot\|_2/k$和$\|\cdot\|_1/k$正则化的采样复杂度分别为$k^2/\varepsilon^2$和$k/\varepsilon^2$,并分析了$\|\cdot\|_2^2/k$正则化下采样复杂度对函数导数性质的依赖。相比现有基于敏感度的立方复杂度方法,本文通过统一采样和更精细的高阶矩分析,实现了更优的采样效率。

详情
AI中文摘要

我们证明了对于一大类Lipschitz连续分类损失函数,在各种正则化项下,达到$(1\pm\varepsilon)$相对误差的最优采样界。这包括重要的函数如logistic和sigmoid损失、hinge损失和ReLU损失,作为突出和流行的代表性例子。特别地,我们证明了对于$\|\cdot\|_2/k$正则化的$k^2/\varepsilon^2$上下界,以及对于$\|\cdot\|_1/k$正则化的$k/\varepsilon^2$上下界。对于$\|\cdot\|_2^2/k$正则化,采样复杂度主要取决于有界导数性质:如果$|g'(x)|\leq g(x)$,且$g(0)>0$,且$g$是单调或凸的,则采样复杂度是$k$的线性;否则一般界为$k^2/\varepsilon^2$。然而,如果$g(0)=0$,我们的结果表明不可能得到无维度界,甚至次线性界也被排除。所有上界都有匹配的下界(至多相差多对数项)。此外,我们的工作在概念上和算法上依赖于简单的均匀或(平方)范数采样,从而改进了最近(Alishahi and Phillips, ICML'24)的立方$k^3/\varepsilon^2$敏感度采样界。这是通过涉及更高矩界和经验过程分析的精细论证来实现的,以避免在事实上的标准VC维和敏感度框架中出现的过度计数。

英文摘要

We prove optimal sampling bounds achieving $(1\pm\varepsilon)$-relative error for a broad class of Lipschitz continuous classification loss functions under various regularization terms. This includes important functions such as logistic and sigmoid loss, hinge loss, and ReLU loss, as prominent and popular representative examples. In particular, we prove $k^2/\varepsilon^2$ upper and lower bounds for $\|\cdot\|_2/k$ regularization, and $k/\varepsilon^2$ upper and lower bounds for $\|\cdot\|_1/k$ regularization. For $\|\cdot\|_2^2/k$ regularization, the sampling complexity depends mainly on a bounded derivative property: if $|g'(x)|\leq g(x)$, and $g(0)>0$, and $g$ is monotonic or convex, then it admits linear in $k$ sampling complexity; otherwise the general bound is $k^2/\varepsilon^2$. However, if $g(0)=0$, our results indicate that no dimension-free bounds are possible, and even sublinear bounds are ruled out. All upper bounds are complemented by matching lower bounds up to polylogarithmic terms. Moreover, our work relies conceptually and algorithmically on simple uniform or (squared) norm sampling and hereby improves over recent cubic $k^3/\varepsilon^2$ sensitivity sampling bounds of (Alishahi and Phillips, ICML'24). This is achieved by refined arguments involving higher moment bounds and empirical process analyses to avoid overcounting that appears in the de-facto standard VC-dimension and sensitivity framework.

2605.23712 2026-05-25 cs.CE cs.LG 版本更新

Operator Learning for Reconstructing Flow Fields from Sparse Measurements: a Language Model Approach

基于稀疏测量重建流场的算子学习:一种语言模型方法

Qian Zhang, George Em Karniadakis

发表机构 * Division of Applied Mathematics, Brown University(布朗大学应用数学系)

AI总结 本文研究了如何从稀疏测量数据中重建流场这一流体力学中的基础问题,并提出了一种基于语言模型架构的新型算子学习框架,实现了无需网格的流场重建。该方法将流场重建转化为序列到序列的学习任务,利用稀疏测量作为上下文,未观测位置作为查询,有效捕捉了空间相关性和长程依赖关系。实验表明,该方法在多个基准数据集上均表现出良好的重建精度,尤其在观测数据不足10%的情况下仍具有高效性能,展示了语言模型在科学数据重建中的潜力。

详情
AI中文摘要

从稀疏测量中重建流场是流体力学中的一个基本问题,对建模、控制和设计具有广泛影响。在这项工作中,我们提出了一种新颖的算子学习框架,利用语言模型的架构以无网格方式进行流场重建。我们将流场重建重新表述为序列到序列的学习任务,其中稀疏测量被视为上下文,未观测位置被视为查询。我们的模型学习从稀疏输入重建完整流场,有效捕捉空间相关性和长程依赖。我们在四个基准数据集上评估了所提出的方法:(1) 二维涡街模拟,(2) 美国本土的日平均温度数据,(3) 基于耗散粒子动力学的三维血流模拟,以及(4) 通过粒子跟踪测速获得的三维湍流射流测量。在所有情况下,我们的方法即使在高度不完整的数据(观测率低于10%)下也表现出竞争性的重建精度,并实现了高效性能。结果凸显了语言模型作为科学数据重建的鲁棒且可扩展工具的潜力,并指向了为科学和工程应用开发基础模型的有前景方向。

英文摘要

Reconstructing flow fields from sparse measurements is a fundamental problem in fluid mechanics with broad implications for modeling, control, and design. In this work, we propose a novel operator learning framework that leverages the architecture of language models to perform flow reconstruction in a mesh-free manner. We reformulate flow field reconstruction as a sequence-to-sequence learning task, where sparse measurements are treated as context and unobserved locations as queries. Our model learns to reconstruct the full flow field from sparse inputs, effectively capturing spatial correlations and long-range dependencies. We evaluate the proposed approach on four benchmark datasets: (1) two-dimensional vortex street simulations, (2) daily average temperature data across the contiguous United States, (3) three-dimensional blood flow simulations based on dissipative particle dynamics, and (4) three-dimensional turbulent jet flow measurements obtained via particle tracking velocimetry. Across all cases, our method demonstrates competitive reconstruction accuracy, even with highly incomplete data (less than 10\% observed), and achieves efficient performance. The results highlight the potential of language models as robust and scalable tools for scientific data reconstruction, and suggest a promising direction toward the development of foundation models for scientific and engineering applications.

2605.23708 2026-05-25 cs.LG cs.SY eess.SY nlin.AO 版本更新

Learning Dynamic Stability Landscapes in Synchronization Networks

学习同步网络中的动态稳定景观

Christian Nauck, Junyou Zhu, Michael Lindner, Frank Hellmann

发表机构 * Department of Complexity Science, Postdam Institute for Climate Impact Research, Potsdam, Germany(复杂科学系,波茨坦气候影响研究所,德国波茨坦) Department of Digital Transformation in Energy Systems, Institute of Energy Technology, Technical University of Berlin, Germany(能源系统数字化转型系,能源技术研究所,柏林技术大学,德国) Machine Learning Group, Technical University of Berlin, 10587 Berlin, Germany(机器学习组,柏林技术大学,柏林,德国)

AI总结 本文提出了一种新的上游任务——学习同步网络中的动态稳定性景观,以更深入地理解同步行为,并从中衍生出多种标量稳定性指标。研究首次引入了图到图像的预测范式,直接从图结构学习每个节点的图像状稳定性景观,并发布了两个包含10,000个图的基准数据集。通过结合图神经网络与卷积神经网络,模型能够端到端地学习稳定性景观,实现了良好的泛化能力,为超越传统标量稳定性指标提供了新方法。

Comments 22 pages, 12 figures

详情
AI中文摘要

同步的鲁棒性通常通过标量、节点级稳定性指数来表征,这些指数对拓扑的依赖性通过网络科学或图神经网络(GNN)进行研究。我们提出了一种新颖的上游任务——学习稳定景观,它提供了对同步行为的更深入洞察,并且可以从中推导出许多此类标量指数。关键的是,我们开创了一种图到图像的预测范式:直接从图拓扑学习作为节点级目标的图像状景观,这种表述在文献中我们尚未见到。为了支持这一任务,我们发布了两个数据集,每个数据集包含10,000个图,节点数分别为20和100,并带有节点级景观标签,基于一个概念性振荡器模型,捕捉电网同步行为。GNN编码拓扑,CNN解码器渲染每个节点的图像,以端到端方式学习,具有良好的分布内准确性,并能泛化到不同图大小和实际电网拓扑。这表明,稳定景观虽然超出了传统网络科学的能力范围,但可以从拓扑中学习,并为生物学、神经科学和电网中超越标量稳定性指数开辟了新途径。

英文摘要

The robustness of synchronization is typically characterized by scalar, per-node stability indices whose dependence on topology is studied via network science or graph neural networks (GNNs). We propose a novel upstream task, learning stability landscapes, which provide deeper insights into synchronization behavior and from which many such scalar indices can be derived. Crucially, we pioneer a graph-to-image prediction paradigm: learning image-like landscapes as per-node targets directly from graph topology, a formulation we are not aware of having been established elsewhere in the literature. To support this task, we release two datasets of 10,000 graphs each at 20 and 100 nodes with per-node landscape labels, based on a conceptual oscillator model, capturing power grid synchronization behavior. A GNN encodes topology and a CNN decoder renders per-node images, learned end-to-end with good in-distribution accuracy, generalizing across graph sizes and to realistic power grid topologies. This demonstrates that stability landscapes, while beyond the reach of conventional network science, are learnable from topology and open new avenues for moving beyond scalar stability indices in biology, neuroscience, and power grids.

2605.23696 2026-05-25 cs.LG 版本更新

Graph-based Complexity Forecasts in UK En Route Airspace Using Relevant Aircraft Interactions

基于相关飞机交互的英国航路空域图复杂度预测

Edward Henderson, George De Ath, Nick Pepper

发表机构 * The Alan Turing Institute London, England(阿尔安图灵研究所伦敦,英格兰) University of Exeter Exeter, England(埃克塞特大学埃克塞特,英格兰)

AI总结 本文研究如何利用基于图的方法预测英国空域中航路管制员的工作负荷,提出了一种概率模型,通过计算需要监控或冲突解决的飞机对数量来估计空域复杂度。该方法结合了伦敦中区(LMS)航路网络的图表示,并考虑了航班到达时间的不确定性,最终在预测精度上显著优于传统流量预测方法,为管制员排班和空域配置决策提供了有力支持。

Comments Accepted paper at the US-Europe Air Transportation Research & Development Symposium (ATRD) 2026

详情
AI中文摘要

有效管理空中交通管制员(ATCO)的工作量对于维持运行安全至关重要。小组主管使用工具估计即将到来的交通负荷以辅助决策。然而,行业标准模型可能无法捕捉即将到来的空中交通复杂性的细微差别。本研究提出了一种概率方法,使用相关飞机对(即需要管制员监控或解冲突的飞机对)的数量作为ATCO工作量的代理指标,来预测空域扇区的复杂性。我们改编了一种现有的过滤算法,使其适用于伦敦中部扇区(LMS),这是一个复杂的空域扇区,在欧洲一些最繁忙机场的上空有多股交通流。通过与ATCO的迭代反馈,算法被改进并扩展以处理特定的几何和运行考虑。更新后的算法优于原始算法,在50个标记交通场景集上F1分数为0.84,而原始算法为0.69。为了预测扇区内未来相关飞机对的数量,构建了LMS航线网络的图表示,标准化了航路段的空间保真度。该预测方法通过建模每个飞机在未来查询时间点占据航路段的概率,考虑了飞机到达时间的不确定性。当与历史相关交互分布和实时运行数据流结合时,可以提前最多45分钟预测即将到来的ATCO工作量。所提出的预测即将到来的工作量的方法,与实际相关交互的Spearman相关系数(ρ=0.68)显著强于标准交通量预测(ρ=0.55)。由此产生的数据驱动工具显示出有望被小组主管用于扇区配置和ATCO排班决策。

英文摘要

Effectively managing Air Traffic Control Officer (ATCO) workload is crucial in maintaining operational safety. Group supervisors use tools that estimate upcoming traffic load to aid decision-making. However, industry-standard models can fail to capture the nuances of upcoming air traffic complexity. This study presents a probabilistic approach to forecast the complexity of an airspace sector using the number of relevant aircraft pairs, i.e., those that require monitoring or deconfliction by a controller, as a proxy measure for ATCO workload. We adapted an existing filter algorithm to make it suitable for use in London Middle Sector (LMS), a complex airspace sector with multiple flows of traffic above some of the busiest airports in Europe. Through iterative feedback with ATCOs, the algorithm was refined and extended to handle specific geometric and operational considerations. The updated algorithm outperformed the original, with an F1-score of 0.84 compared to 0.69 on a labelled set of 50 traffic scenarios. To produce forecasts of future numbers of relevant aircraft pairs in the sector, a graph representation of the LMS route network was constructed, standardising the spatial fidelity of route legs. The forecasting method accounts for uncertainty in aircraft arrival times by modelling the probability of each aircraft occupying route segments at future query times. When combined with historic distributions of relevant interactions and a live operational data stream, predictions of upcoming ATCO workload could be made up to 45 minutes in advance. The proposed method to forecast upcoming workload showed a significantly stronger correlation with actual relevant interactions (Spearman's $ρ= 0.68$) than a standard traffic volume prediction ($ρ= 0.55$). The resulting data-driven tool shows promise for use by group supervisors to inform sector configuration and ATCO rostering decisions.

2605.23689 2026-05-25 cs.LG math.DS 版本更新

Optimization of randomized neural networks for transfer operator approximation

随机神经网络优化用于传递算子逼近

Mohammad Tabish, Stefan Klus

发表机构 * Maxwell Institute for Mathematical Sciences, University of Edinburgh and Heriot–Watt University(爱丁堡大学麦克斯韦数学科学研究所和赫里奥特-瓦特大学) School of Mathematical & Computer Sciences, Heriot–Watt University(赫里奥特-瓦特大学数学与计算机科学学院)

AI总结 本文提出了一种用于复杂动力系统传递算子近似的随机神经网络架构RaNNDy,其隐藏层的权重和偏置随机初始化并固定,仅训练输出层,从而降低了训练成本并提供了闭式解。然而,该方法依赖于初始选择的激活函数来确定基函数,为此,本文提出了一种优化激活函数的算法,在保持网络参数固定的情况下提升基函数的适应性,并通过多个基准问题验证了方法的有效性。

详情
AI中文摘要

RaNNDy是一种随机神经网络架构,用于数据驱动地逼近与复杂动力系统相关的传递算子。网络隐藏层的权重和偏置随机初始化并保持固定,仅训练输出层。与完全优化的神经网络相比,这具有几个优点,特别是输出层的闭式解和显著降低的训练成本。尽管有这些优点,RaNNDy局限于参数化算子逼近所需基函数的权重和偏置的初始选择。由于基函数由激活函数决定,为隐藏层选择合适的激活函数至关重要。在这项工作中,我们提出了一种算法,该算法优化激活函数本身,同时保持随机神经网络中的权重和偏置固定,从而提供更合适的字典。我们通过各种基准问题(包括随机微分方程和图上的随机游走)说明了该方法的有效性。

英文摘要

RaNNDy is a randomized neural network architecture for the data-driven approximation of transfer operators associated with complex dynamical systems. The weights and biases of the hidden layers of the network are randomly initialized and kept fixed, only the output layer is trained. This has several advantages over fully optimized neural networks, notably a closed-form solution for the output layer and significantly lower training costs. Despite these advantages, RaNNDy is restricted to the initial selection of weights and biases that parametrize the basis functions required for the operator approximation. Since the basis functions are determined by the activation function, choosing an appropriate activation function for the hidden layers is crucial. In this work, we propose an algorithm that optimizes the activation function itself, while keeping the weights and biases in the randomized neural network fixed, providing a more suitable dictionary. We illustrate the efficacy of the approach using various benchmark problems, including stochastic differential equations and random walks on graphons.

2605.22738 2026-05-25 cs.LG cs.AI stat.ML 版本更新

Proxy-Based Approximation of Shapley and Banzhaf Interactions

基于代理的Shapley和Banzhaf交互近似

Santo M. A. R. Thies, Hubert Baniecki, R. Teal Witter, Eyke Hüllermeier, Maximilian Muschalik, Fabian Fumagalli

发表机构 * LMU Munich(慕尼黑大学) MCML DFKI(德意志联邦防务研究院) Centre for Credible AI, Warsaw University of Technology(华沙技术大学可信AI中心) University of Warsaw(华沙大学) Claremont McKenna College(克莱尔蒙特麦肯纳学院) Bielefeld University(比勒菲尔德大学)

AI总结 本文研究了如何高效准确地估计Shapley和Banzhaf交互值,以解释机器学习模型中特征之间的复杂相互作用。为此,作者提出了ProxySHAP方法,结合树模型代理的高效采样与残差校正策略,实现了在保证精度的同时提升计算效率。理论分析表明,ProxySHAP能够在多项式时间内计算树集成模型的精确交互指数,并有效控制偏差与方差。实验表明,ProxySHAP在多个基准测试中表现优异,尤其在大规模高维数据上显著优于现有方法。

详情
AI中文摘要

Shapley和Banzhaf交互捕捉了现代机器学习应用中固有的复杂动态。然而,当前对这些高阶交互的估计器在速度和准确性之间进行权衡。为了克服这一限制,我们引入了ProxySHAP。ProxySHAP将基于树的代理模型的高样本效率与通过残差校正实现一致性的原则路径相结合。在理论层面,我们推导了干预TreeSHAP的多项式时间推广,以计算树集成的精确交互指数,成功避免了先前方法中的指数树深度依赖。此外,我们正式分析了残差调整策略,刻画了最大样本重用(MSR)在特定条件下校正代理偏差而不使其方差随交互规模指数增长的条件。广泛的基准测试表明,ProxySHAP在近似质量上树立了新的最先进标准,包括在具有数千个特征的大规模应用中。通过在小预算和大预算场景下均实现最低误差,ProxySHAP显著优于先前最佳估计器ProxySPEX和KernelSHAP-IQ,同时在可解释性下游任务上也提供了卓越性能。

英文摘要

Shapley and Banzhaf interactions capture the complex dynamics inherent in modern machine learning applications. However, current estimators for these higher-order interactions trade off between speed and accuracy. To overcome this limitation, we introduce ProxySHAP. ProxySHAP reconciles the high sample efficiency of tree-based proxy models with a principled path to consistency via residual correction. On a theoretical level, we derive a polynomial-time generalization of interventional TreeSHAP to compute exact interaction indices for tree ensembles, successfully bypassing exponential tree-depth dependencies in prior methods. Furthermore, we formally analyze the residual adjustment strategy, characterizing the specific conditions under which Maximum Sample Reuse (MSR) corrects proxy bias without its variance scaling exponentially with interaction size. Extensive benchmarking demonstrates that ProxySHAP sets a new state-of-the-art standard for approximation quality, including in large-scale applications with thousands of features. By achieving the lowest error in both small- and large-budget regimes, ProxySHAP significantly outperforms the prior best estimators ProxySPEX and KernelSHAP-IQ, while also delivering superior performance on downstream explainability tasks.

2605.21813 2026-05-25 cs.LG stat.ME stat.ML 版本更新

Symbolic Density Estimation for Discrete Distributions

离散分布的符号密度估计

Ziwen Liu, Meng Li

发表机构 * Rice University(里士大学)

AI总结 本文提出了一种名为符号密度估计(SDE)的无监督框架,用于自动恢复离散分布的闭式概率质量函数。该方法通过在结构化的搜索空间中组合基本解析操作,结合领域特定的结构先验、进化搜索和有效性感知推理阶段,能够有效扩展至更复杂的分布族,如零膨胀分布和有限混合分布。研究还构建了一个涵盖多种常用离散分布的基准数据集,并在实验中验证了该算法在参数估计和模型拟合方面的优越性。

Comments 28 pages, 5 figures, 22 tables

详情
AI中文摘要

离散概率法则支撑着统计建模,然而可解释分布的目录通过几个世纪以来逐案数学推导仅逐渐扩展。我们引入了符号密度估计(SDE),这是一个无监督框架,通过在结构化搜索空间内组合基本解析操作自动恢复闭式概率质量函数。我们的方法将领域特定的结构先验与进化搜索和有效性感知推理阶段相结合,并扩展到更丰富的分布族,如零膨胀和有限混合。为了支持系统评估和未来研究,我们贡献了一个涵盖广泛常用离散分布的基准数据集。所提出的算法恢复了所有基准分布族,并给出了准确的参数估计。一个真实数据应用表明,它识别出简洁且可解释的混合模型,这些模型在拟合优度上优于标准模型。

英文摘要

Discrete probability laws underpin statistical modeling, yet the catalog of interpretable distributions has expanded only gradually through centuries of case-by-case mathematical derivations. We introduce symbolic density estimation (SDE), an unsupervised framework that automatically recovers closed-form probability mass functions by composing elementary analytic operations within a structured search space. Our method integrates domain-specific structural priors with evolutionary search and a validity-aware inference stage, and it extends to richer distribution families such as zero inflation and finite mixtures. To support systematic evaluation and future research, we contribute a benchmark dataset spanning a broad collection of commonly used discrete distributions. The proposed algorithm recovers all benchmark families with accurate parameter estimates. A real data application shows that it identifies concise and interpretable mixture models that improve goodness-of-fit over standard models.

2605.13930 2026-05-25 cs.LG cs.HC cs.NE 版本更新

Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders

基于稀疏自编码器的脑电图基础模型机制可解释性

William Lehn-Schiøler, Magnus Ruud Kjær, Rahul Thapa, Magnus Guldberg Pedersen, Anton Mosquera Storgaard, Nick Williams, Radu Gatej, Tue Lehn-Schiøler, Andreas Brink-Kjær, Sadasivan Puthusserypady, Sándor Beniczky, James Zou, Lars Kai Hansen

发表机构 * BrainCapture DTU Health Tech(技术大学丹麦健康技术) DTU Compute(技术大学丹麦计算) Department of Biomedical Data Science(生物医学数据科学系) Department of Computer Science(计算机科学系) Seer Medical(Seer医疗) Filadelfia Epilepsy Hospital(菲拉德尔菲亚癫痫医院) University Hospital of Copenhagen(哥本哈根大学医院)

AI总结 该研究旨在提升EEG基础模型的可解释性,通过稀疏自编码器(SAEs)从三个不同架构的EEG变压器模型中提取稀疏特征字典,并将其与临床分类(如异常、年龄、性别和用药)对齐。研究提出了一种统一的超参数优化方法,用于评估模型特征的语义清晰度和纠缠程度,并引入“目标与非目标”探针区域度量,揭示了模型在概念控制方面的三种操作模式。此外,研究还展示了模型在临床概念干预中的关键失败案例,并通过频谱解码器将潜在空间操作映射到生理可解释的频率特征,为临床应用提供了更透明的解释框架。

Comments Preprint. 14 pages, 7 figures, 4 tables

详情
AI中文摘要

脑电图基础模型在临床性能上达到了最先进水平,但其驱动预测的内部计算仍然不透明,这是临床信任的障碍。我们将TopK稀疏自编码器应用于三种架构不同的EEG Transformer:SleepFM、REVE和LaBraM,从其嵌入中提取稀疏特征字典。通过将这些特征基于临床分类法(异常、年龄、性别和用药)进行 grounding,我们跨架构基准测试了单语义性和纠缠性。一个由内在字典健康审计驱动的单一超参数过程,在所有三种架构上鲁棒地迁移。通过概念引导,我们引入了一个“目标 vs. 非目标”探测区域度量来量化引导选择性,并揭示了三种操作模式:可选择性引导、编码但纠缠、以及未编码。该框架暴露了关键的表征失败:“破坏球”干预会破坏全局模型性能,以及临床纠缠,例如年龄-病理混淆,其中不可能在不破坏另一个概念的情况下抑制一个概念。最后,一个频谱解码器将这些干预映射回幅度谱,将潜在操作转化为生理上可解释的频率特征,例如病理性慢波抑制和α频带恢复。

英文摘要

EEG foundation models achieve state-of-the-art clinical performance, yet the internal computations driving their predictions remain opaque: a barrier to clinical trust. We apply TopK Sparse Autoencoders (SAEs) across three architecturally distinct EEG transformers: SleepFM, REVE, and LaBraM to extract sparse feature dictionaries from their embeddings. By grounding these features in a clinical taxonomy (abnormality, age, sex, and medication), we benchmark monosemanticity and entanglement across architectures. A single hyperparameter procedure, driven by an intrinsic dictionary health audit, transfers robustly across all three architectures. Via concept steering, we introduce a "target vs. off-target" probe area metric to quantify steering selectivity and reveal three operational regimes: selectively steerable, encoded but entangled, and non-encoded. This framework exposes critical representational failures: "wrecking-ball" interventions that collapse global model performance, and clinical entanglements, such as age-pathology confounding, where it is impossible to suppress one concept without corrupting the other. Finally, a spectral decoder maps these interventions back to the amplitude spectrum, translating latent manipulations into physiologically interpretable frequency signatures, such as pathological slow-wave suppression and $α$-band restoration.

2603.23565 2026-05-25 cs.LG cs.AI 版本更新

Safe Reinforcement Learning with Preference-based Constraint Inference

基于偏好的约束推断的安全强化学习

Chenglin Li, Grant Ruan, Hua Geng

发表机构 * Department of Automation, Tsinghua University, Beijing, China Laboratory for Information \& Decision Systems, Massachusetts Institute of Technology, Cambridge, MA, USA

AI总结 本文研究了安全强化学习中如何从人类偏好中高效且可靠地学习复杂的安全约束。针对现有方法依赖专家演示或限制性假设的问题,提出了一种基于偏好的约束强化学习框架(PbCRL),通过引入死区机制和信噪比损失,提升了对安全成本分布的建模能力,并优化了策略学习过程。实验表明,该方法在满足安全约束和提升奖励方面优于现有先进方法,为安全关键场景中的约束推理提供了有效解决方案。

Comments Accepted by the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

安全强化学习(RL)是安全关键决策的标准范式。然而,现实世界中的安全约束可能复杂、主观,甚至难以明确指定。现有的约束推断工作依赖于限制性假设或大量的专家演示,这在许多实际应用中并不现实。如何廉价且可靠地学习这些约束是我们本研究关注的主要挑战。虽然从人类偏好中推断约束提供了一种数据高效的替代方案,但我们发现流行的Bradley-Terry(BT)模型未能捕捉安全成本的非对称、重尾特性,导致风险低估。在文献中,理解BT模型对下游策略学习的影响仍然很少。为了解决上述知识空白,我们提出了一种新颖的方法,即基于偏好的约束强化学习(PbCRL)。我们在偏好建模中引入了一种新颖的死区机制,并从理论上证明它鼓励重尾成本分布,从而实现更好的约束对齐。此外,我们引入了信噪比(SNR)损失,通过成本方差鼓励探索,这被发现有利于策略学习。进一步,采用两阶段训练策略以降低在线标注负担,同时自适应地增强约束满足。实验结果表明,PbCRL实现了与真实安全要求的优越对齐,并在安全性和奖励方面优于最先进的基线。我们的工作为安全RL中的约束推断探索了一种有前景且有效的方法,在各种安全关键应用中具有巨大潜力。

英文摘要

Safe reinforcement learning (RL) is a standard paradigm for safety-critical decision making. However, real-world safety constraints can be complex, subjective, and even hard to explicitly specify. Existing works on constraint inference rely on restrictive assumptions or extensive expert demonstrations, which are not realistic in many real-world applications. How to cheaply and reliably learn these constraints is the major challenge we focus on in this study. While inferring constraints from human preferences offers a data-efficient alternative, we identify popular Bradley-Terry (BT) models fail to capture the asymmetric, heavy-tailed nature of safety costs, resulting in risk underestimation. It is still rare in the literature to understand the impacts of BT models on the downstream policy learning. To address the above knowledge gaps, we propose a novel approach namely Preference-based Constrained Reinforcement Learning (PbCRL). We introduce a novel dead zone mechanism into preference modeling and theoretically prove that it encourages heavy-tailed cost distributions, thereby achieving better constraint alignment. Additionally, we incorporate a Signal-to-Noise Ratio (SNR) loss to encourage exploration by cost variances, which is found to benefit policy learning. Further, two-stage training strategy is deployed to lower online labeling burdens while adaptively enhancing constraint satisfaction. Empirical results demonstrate that PbCRL achieves superior alignment with true safety requirements and outperforms state-of-the-art baselines in terms of safety and reward. Our work explores a promising and effective way for constraint inference in Safe RL, with great potential in various safety-critical applications.

2603.07615 2026-05-25 cs.LG cs.CV 版本更新

Compression as Adaptation: Implicit Visual Representation with Diffusion Foundation Models

压缩即适应:基于扩散基础模型的隐式视觉表示

Zongyu Guo, Jiajun He, Zhaoyang Jia, Xiaoyi Zhang, Jiahao Li, Xiao Li, Bin Li, José Miguel Hernández-Lobato, Yan Lu

发表机构 * Microsoft Research Asia(微软亚洲研究院) University of Cambridge(剑桥大学)

AI总结 本文提出了一种将视觉信号编码为函数的新表示框架,通过低秩适配参数附着在冻结的视觉生成模型上,从而实现对视觉内容的隐式表示。该方法能够将例如81帧视频的信号压缩为一个紧凑的向量,在极低比特率下实现高质量的感知视频压缩。此外,该函数式表示支持推理时的扩展与控制,提升了压缩性能,并为视觉压缩与生成提供了一个统一的框架。

Comments ICML 2026

详情
AI中文摘要

现代视觉生成模型通过大规模训练获得丰富的视觉知识,但现有的视觉表示(如像素、潜变量或标记)仍独立于模型,无法直接利用这些知识进行紧凑存储或重用。在这项工作中,我们引入了一种新的视觉表示框架,将信号编码为一个函数,该函数通过附加在冻结的视觉生成模型上的低秩适应参数进行参数化。这种视觉信号的隐式表示,例如一个81帧的视频,可以进一步哈希成一个紧凑的向量,在极低比特率下实现强感知视频压缩。除了基本压缩外,这种表示的函数性质使得推理时缩放和控制成为可能,从而在压缩性能上实现额外优化。更广泛地说,由于隐式表示直接作为生成过程的函数,这提出了一个统一视觉压缩与生成的框架。

英文摘要

Modern visual generative models acquire rich visual knowledge through large-scale training, yet existing visual representations (such as pixels, latents, or tokens) remain external to the model and cannot directly exploit this knowledge for compact storage or reuse. In this work, we introduce a new visual representation framework that encodes a signal as a function, which is parametrized by low-rank adaptations attached to a frozen visual generative model. Such implicit representations of visual signals, \textit{e.g.}, an 81-frame video, can further be hashed into a single compact vector, achieving strong perceptual video compression at extremely low bitrates. Beyond basic compression, the functional nature of this representation enables inference-time scaling and control, allowing additional refinement on the compression performance. More broadly, as the implicit representations directly act as a function of the generation process, this suggests a unified framework bridging visual compression and generation.

2602.15602 2026-05-25 cs.LG stat.ML 版本更新

Certified Per-Instance Unlearning Using Individual Sensitivity Bounds

使用个体灵敏度界限的认证逐实例遗忘

Hanna Benarroch, Jamal Atif, Olivier Cappé

发表机构 * DI ENS, École normale supérieure, Université PSL, CNRS(巴黎大学(ENS)数据科学研究所,巴黎政治学院,法国国家科学研究中心) CMAP, École polytechnique, Institut Polytechnique de Paris(巴黎高等理工学院计算数学与应用物理研究所,巴黎理工学院)

AI总结 本文研究了如何通过个体敏感度界限实现有保证的逐实例模型遗忘。不同于传统的基于最坏情况敏感度的噪声注入方法,作者提出了一种针对每个数据点贡献进行自适应噪声校准的新方法,从而减少噪声注入量并提升模型性能。在岭回归和深度学习实验中验证了该方法的有效性,证明其在保证遗忘认证的同时能够显著降低噪声影响。

详情
AI中文摘要

认证的机器遗忘可以通过注入噪声实现,从而提供差分隐私保证,其中噪声根据最坏情况灵敏度进行校准。这种保守的校准通常会导致性能下降,限制了实际适用性。在这项工作中,我们研究了一种基于自适应逐实例噪声校准的替代方法,该校准针对每个数据点对学习解的个体贡献进行定制。这引发了以下挑战:当机制依赖于要移除的特定点时,如何建立正式的遗忘保证?为了定义噪声梯度动力学中的个体数据点灵敏度,我们考虑使用逐实例差分隐私。对于通过朗之万动力学训练的岭回归,我们推导出高概率的逐实例灵敏度界限,从而在注入显著更少噪声的情况下实现认证遗忘。我们通过线性设置中的实验证实了我们的理论发现,并提供了进一步的经验证据,表明该方法在深度学习设置中的相关性。

英文摘要

Certified machine unlearning can be achieved via noise injection leading to differential privacy guarantees, where noise is calibrated to worst-case sensitivity. Such conservative calibration often results in performance degradation, limiting practical applicability. In this work, we investigate an alternative approach based on adaptive per-instance noise calibration tailored to the individual contribution of each data point to the learned solution. This raises the following challenge: how can one establish formal unlearning guarantees when the mechanism depends on the specific point to be removed? To define individual data point sensitivities in noisy gradient dynamics, we consider the use of per-instance differential privacy. For ridge regression trained via Langevin dynamics, we derive high-probability per-instance sensitivity bounds, yielding certified unlearning with substantially less noise injection. We corroborate our theoretical findings through experiments in linear settings and provide further empirical evidence on the relevance of the approach in deep learning settings.

2602.12534 2026-05-25 stat.ML cs.DS cs.LG math.ST stat.TH 版本更新

Linear Regression with Unknown Truncation Beyond Gaussian Features

未知截断下的线性回归:超越高斯特征

Alexandros Kouridakis, Anay Mehrotra, Alkis Kalavasis, Constantine Caramanis

发表机构 * UT Austin(德克萨斯大学奥斯汀分校) Stanford University(斯坦福大学) Yale University(耶鲁大学)

AI总结 本文研究了在截断线性回归中,当响应变量的生存集未知时,如何高效估计未知的回归参数问题。不同于以往依赖已知生存集或强假设(如高斯分布)的工作,本文提出了一种仅需特征向量满足次高斯条件的算法,其运行时间仅为多项式时间,显著提升了计算效率。该方法的核心在于设计了一种新的子程序,能够在仅有正例且满足平滑条件的情况下高效学习有限个区间联合的模型,具有独立的理论价值和应用前景。

详情
AI中文摘要

在截断线性回归中,只有当结果 $y$ 落在某个生存集 $S^\star$ 内时,样本 $(x,y)$ 才被观测到,目标是估计未知的 $d$ 维回归系数 $w^\star$。该问题在统计学和机器学习中有着悠久的研究历史,可追溯到 (Galton, 1897; Tobin, 1958) 的工作,以及近期如 (Daskalakis et al., 2019; 2021; Lee et al., 2023; 2024) 的研究。然而,尽管历史久远,大多数先前工作仅限于 $S^\star$ 精确已知的特殊情况。更实际相关的情况——$S^\star$ 未知且需从数据中学习——仍然开放:实际上,目前可用的算法要么要求特征向量分布有强假设(如高斯性),即使如此,达到 $\varepsilon$ 精度的运行时间也为 $d^{\mathrm{poly} (1/\varepsilon)}$。在本工作中,我们给出了首个针对未知生存集的截断线性回归算法,运行时间为 $\mathrm{poly} (d/\varepsilon)$,仅要求特征向量是次高斯的。我们的算法依赖于一个新颖的子程序,该子程序在某种平滑条件下,利用正例(无负例)高效学习有界数量区间的并集。该学习保证补充了正例仅 PAC 学习的研究路线,并可能具有独立意义。

英文摘要

In truncated linear regression, samples $(x,y)$ are shown only when the outcome $y$ falls inside a certain survival set $S^\star$ and the goal is to estimate the unknown $d$-dimensional regressor $w^\star$. This problem has a long history of study in Statistics and Machine Learning going back to the works of (Galton, 1897; Tobin, 1958) and more recently in, e.g., (Daskalakis et al., 2019; 2021; Lee et al., 2023; 2024). Despite this long history, however, most prior works are limited to the special case where $S^\star$ is precisely known. The more practically relevant case, where $S^\star$ is unknown and must be learned from data, remains open: indeed, here the only available algorithms require strong assumptions on the distribution of the feature vectors (e.g., Gaussianity) and, even then, have a $d^{\mathrm{poly} (1/\varepsilon)}$ run time for achieving $\varepsilon$ accuracy. In this work, we give the first algorithm for truncated linear regression with unknown survival set that runs in $\mathrm{poly} (d/\varepsilon)$ time, by only requiring that the feature vectors are sub-Gaussian. Our algorithm relies on a novel subroutine for efficiently learning unions of a bounded number of intervals using access to positive examples (without any negative examples) under a certain smoothness condition. This learning guarantee adds to the line of works on positive-only PAC learning and may be of independent interest.

2602.04431 2026-05-25 cs.LG cs.GT 版本更新

MaMa: A Game-Theoretic Approach for Designing Safe Agentic Systems

MaMa: 一种基于博弈论的安全智能体系统设计方法

Jonathan Nöther, Adish Singla, Goran Radanovic

发表机构 * Max Planck Institute for Software Systems (MPI-SWS)(马克斯·普朗克软件系统研究所)

AI总结 本文研究了基于大语言模型的多智能体系统在部分智能体失效或对抗行为下的安全设计问题。受Stackelberg安全博弈启发,作者提出了一种名为MaMa的新算法,通过元对抗者与元代理之间的博弈过程,自动设计出在最坏情况下仍能保持安全的智能体系统。实验表明,该方法设计的系统不仅能够有效抵御最坏攻击,还能在不同攻击目标和大模型环境下保持良好的泛化能力。

详情
AI中文摘要

基于LLM的多智能体系统展现了令人印象深刻的能力,但当单个智能体失败或表现出对抗行为时,也会引入显著的安全风险。在这项工作中,我们研究了即使部分智能体被攻破时仍能保持安全的智能体系统的自动设计。受Stackelberg安全博弈启发,我们将此问题形式化为系统设计者(元智能体)与一个最佳响应的元对手之间的博弈,该对手选择并攻破一部分智能体以最小化安全性。我们提出了MaMa(元对手-元智能体),一种受此形式化启发的新算法,用于自动设计安全的智能体系统。我们的方法使用基于LLM的对抗搜索,其中元智能体迭代地提出系统设计,并根据元对手发现的最强攻击接收反馈。跨不同环境的实证评估表明,使用MaMa设计的系统能够持续防御最坏情况下的攻击,同时保持与仅优化任务成功率的系统相当的性能。此外,所得系统能够泛化到更强的对手,以及具有不同攻击目标或底层LLM的对手,展示了超越训练设置的鲁棒安全性。

英文摘要

LLM-based multi-agent systems have demonstrated impressive capabilities, but they also introduce significant safety risks when individual agents fail or behave adversarially. In this work, we study the automated design of agentic systems that remain safe even when a subset of agents is compromised. Inspired by Stackelberg security games, we formalize this problem as a game between a system designer (the Meta-Agent) and a best-responding Meta-Adversary that selects and compromises a subset of agents to minimize safety. We propose Meta-Adversary-Meta-Agent (MaMa), a novel algorithm inspired by this formalization for automatically designing safe agentic systems. Our approach uses LLM-based adversarial search, where the Meta-Agent iteratively proposes system designs and receives feedback based on the strongest attacks discovered by the Meta-Adversary. Empirical evaluations across diverse environments show that systems designed with MaMa consistently defend against worst-case attacks while maintaining performance comparable to systems optimized solely for task success. Moreover, the resulting systems generalize to stronger adversaries, as well as ones with different attack objectives or underlying LLMs, demonstrating robust safety beyond the training setting.

2601.21513 2026-05-25 cs.LG 版本更新

Cascaded Transfer: Learning Many Tasks under Budget Constraints

级联迁移:在预算约束下学习多任务

Eloi Campagne, Yvenn Amara-Ouali, Yannig Goude, Mathilde Mougeot, Argyris Kalogeratos

发表机构 * Centre Borelli, CNRS, ENS Paris-Saclay, Université Paris-Saclay(Centre Borelli,CNRS,ENS巴黎萨克雷,巴黎萨克雷大学) Laboratoire de Mathématiques d’Orsay, CNRS, Université Paris-Saclay(奥赛数学实验室,CNRS,巴黎萨克雷大学) EDF R&D(EDF研发部) ENSIIE, Évry-Courcouronnes(ENSIIE,Évry-科尔库荣)

AI总结 在分布式应用场景中,如变电站级别的用电需求预测或联邦学习,需要为大量相关任务训练不同模型,但任务之间的关系未知。本文提出了一种新的级联迁移学习(CTL)范式,通过构建以根节点为起点的树形结构,使模型参数在任务间逐层传递,同时遵循全局训练预算约束。该方法基于最小化任务间距离与预算约束的组合目标构建生成树,形成具有几何感知和深度限制的迁移图,并理论分析了迁移误差在级联路径上的累积与衰减特性。实验表明,CTL在多种任务集合上实现了比现有方法更准确且更节省成本的模型适应,尤其在预算受限时效果更显著。

详情
AI中文摘要

在分布式应用中,如变电站级能源需求预测或联邦学习,大量相关任务必须由不同模型学习,而确切的任务关系未知。我们提出了新颖的级联迁移学习(CTL)范式,其中模型参数通过组织为有根树的任务层级级联,并遵守全局训练预算。从源任务开始,树指定了任务学习和细化的顺序,预算沿其分支分配。我们设计了基于生成树的级联机制,通过最小化结合成对任务距离和可用训练预算的目标来连接所有任务,从而产生几何感知和深度有界的迁移图。我们从理论上刻画了迁移误差如何沿级联路径累积和衰减:任何上游节点引入的误差都会被每个下游细化收缩,而平衡的树拓扑限制了这种累积。在合成和真实多任务场景、时间序列预测和图像分类上的实验表明,CTL能够在大量任务集合中实现比替代方法更准确和成本效益更高的适应,且在预算最紧张时增益最大。

英文摘要

In distributed applications, such as energy demand forecasting at the substation level or federated learning, a large number of related tasks must be learned by different models, while the exact task relationships are unknown. We propose the novel Cascaded Transfer Learning (CTL) paradigm in which model parameters cascade hierarchically through tasks organized as a rooted tree, respecting a global training budget. Starting from a source task, the tree specifies the order in which tasks are learned and refined, with the budget allocated along its branches. We design cascade mechanisms based on spanning trees that connect all tasks by minimizing an objective combining pairwise task distances and the available training budget, which yield geometry-aware and depth-bounded transfer graphs. We theoretically characterize how transfer errors accumulate and attenuate along cascade paths: errors introduced at any upstream node are contracted by every downstream refinement, and balanced tree topologies bound this accumulation. Experiments on synthetic and real many-task settings, time-series forecasting and image classification, show that CTL enables more accurate and cost-effective adaptation across large task collections than alternative approaches, with the largest gains at the tightest budgets.

2601.03715 2026-05-25 cs.LG cs.AI 版本更新

R$^3$L: Reflect-then-Retry Reinforcement Learning with Language-Guided Exploration, Pivotal Credit, and Positive Amplification

R$^3$L: 反思-重试强化学习与语言引导探索、关键信用和正向放大

Weijie Shi, Yanxi Chen, Zexi Li, Xuchen Pan, Yuchang Sun, Jiajie Xu, Xiaofang Zhou, Yaliang Li

发表机构 * Tongyi Lab(通义实验室) Soochow University(苏州大学) Hong Kong University of Science and Technology(香港科技大学)

AI总结 R$^3$L 是一种结合语言引导探索、关键信用分配和正向增强的强化学习方法,旨在解决大语言模型在推理和智能体能力训练中面临的探索与利用难题。该方法通过“反思-重试”机制合成高质量轨迹,利用语言反馈定位错误并优化失败路径,同时仅更新存在差异的轨迹后缀以提高信用分配精度,并通过增强成功轨迹的权重来稳定训练过程。实验表明,R$^3$L 在多个任务中相较基线方法实现了显著性能提升,同时保持了训练稳定性。

详情
AI中文摘要

强化学习推动了LLM推理和智能体能力的最新进展,但当前方法在探索和利用方面均存在困难。探索方面,困难任务成功率低且从头开始重复rollout成本高;利用方面,粗粒度的信用分配和训练不稳定:轨迹级奖励因后续错误惩罚有效前缀,且失败主导的群体淹没少数正向信号,使优化缺乏建设性方向。为此,我们提出R$^3$L,即反思-重试强化学习与语言引导探索、关键信用和正向放大。为合成高质量轨迹,R$^3$L通过反思-重试从随机采样转向主动合成,利用语言反馈诊断错误,将失败尝试转化为成功尝试,并通过从识别出的失败点重启来降低rollout成本。在错误被诊断和定位后,关键信用分配仅更新存在对比信号的分叉后缀,排除共享前缀的梯度更新。由于困难任务中失败占主导且反思-重试产生离策略数据,可能导致训练不稳定,正向放大提高成功轨迹的权重,确保正向信号引导优化过程。在智能体和推理任务上的实验表明,与基线相比,相对提升5%到52%,同时保持训练稳定性。我们的代码已发布在https://github.com/shiweijiezero/R3L。

英文摘要

Reinforcement learning drives recent advances in LLM reasoning and agentic capabilities, yet current approaches struggle with both exploration and exploitation. Exploration suffers from low success rates on difficult tasks and high costs of repeated rollouts from scratch. Exploitation suffers from coarse credit assignment and training instability: Trajectory-level rewards penalize valid prefixes for later errors, and failure-dominated groups overwhelm the few positive signals, leaving optimization without constructive direction. To this end, we propose R$^3$L, Reflect-then-Retry Reinforcement Learning with Language-Guided Exploration, Pivotal Credit, and Positive Amplification. To synthesize high-quality trajectories, R$^3$L shifts from stochastic sampling to active synthesis via reflect-then-retry, leveraging language feedback to diagnose errors, transform failed attempts into successful ones, and reduce rollout costs by restarting from identified failure points. With errors diagnosed and localized, Pivotal Credit Assignment updates only the diverging suffix where contrastive signals exist, excluding the shared prefix from gradient update. Since failures dominate on difficult tasks and reflect-then-retry produces off-policy data, risking training instability, Positive Amplification upweights successful trajectories to ensure positive signals guide the optimization process. Experiments on agentic and reasoning tasks demonstrate 5\% to 52\% relative improvements over baselines while maintaining training stability. Our code is released at https://github.com/shiweijiezero/R3L.

2512.15767 2026-05-25 cs.LG cs.AI 版本更新

Bridging Data and Physics: A Graph Neural Network-Based Hybrid Twin Framework

连接数据与物理:基于图神经网络的混合孪生框架

M. Gorpinich, B. Moya, S. Rodriguez, F. Meraghni, Y. Jaafra, A. Briot, M. Henner, R. Leon, F. Chinesta

发表机构 * Valeo(瓦莱欧) PIMM Lab. ENSAM Institute of Technology(ENSAM技术学院PIMM实验室)

AI总结 该研究提出了一种基于图神经网络的混合孪生框架,旨在解决物理仿真中因模型简化或未建模效应导致的“无知模型”问题。通过结合物理模型与数据驱动方法,该方法利用图神经网络学习稀疏空间测量中的缺失物理规律,从而在减少数据需求的前提下提升仿真精度与可解释性。实验表明,该框架在不同网格、几何和负载位置的非线性热传导问题中均表现出良好的泛化能力与修正效果。

Comments 27 pages, 14 figures

详情
AI中文摘要

模拟复杂的非定常物理现象依赖于详细的数学模型,例如通过有限元方法(FEM)进行仿真。然而,由于未建模效应或简化假设,这些模型通常与实际情况存在差异。我们将这种差距称为无知模型。纯数据驱动的方法试图学习整个系统的行为,但需要跨越整个空间和时间域的大量高质量数据。在现实场景中,此类信息不可用,使得完全数据驱动的建模不可靠。为了克服这一限制,我们采用混合孪生方法对无知分量进行建模,而不是从头模拟现象。由于基于物理的模型近似了现象的整体行为,剩余的无知通常比完整的物理响应复杂度低,因此可以用更少的数据进行学习。然而,一个关键困难是空间测量是稀疏的,并且在实际中获取不同空间配置下同一现象的数据具有挑战性。我们的贡献是通过使用图神经网络(GNN)来表示无知模型来克服这一限制。即使测量位置数量有限,GNN也能学习缺失物理的空间模式。这使得我们能够用数据驱动的修正来丰富基于物理的模型,而无需密集的空间、时间和参数数据。为了展示所提出方法的性能,我们在不同网格、几何形状和载荷位置的非线性热传导问题上评估了这种基于GNN的混合孪生方法。结果表明,GNN成功捕获了无知并泛化了跨空间配置的修正,提高了仿真精度和可解释性,同时最小化了数据需求。

英文摘要

Simulating complex unsteady physical phenomena relies on detailed mathematical models, simulated for instance by using the Finite Element Method (FEM). However, these models often exhibit discrepancies from the reality due to unmodeled effects or simplifying assumptions. We refer to this gap as the ignorance model. While purely data-driven approaches attempt to learn full system behavior, they require large amounts of high-quality data across the entire spatial and temporal domain. In real-world scenarios, such information is unavailable, making full data-driven modeling unreliable. To overcome this limitation, we model of the ignorance component using a hybrid twin approach, instead of simulating phenomena from scratch. Since physics-based models approximate the overall behavior of the phenomena, the remaining ignorance is typically lower in complexity than the full physical response, therefore, it can be learned with significantly fewer data. A key difficulty, however, is that spatial measurements are sparse, also obtaining data measuring the same phenomenon for different spatial configurations is challenging in practice. Our contribution is to overcome this limitation by using Graph Neural Networks (GNNs) to represent the ignorance model. GNNs learn the spatial pattern of the missing physics even when the number of measurement locations is limited. This allows us to enrich the physics-based model with data-driven corrections without requiring dense spatial, temporal and parametric data. To showcase the performance of the proposed method, we evaluate this GNN-based hybrid twin on nonlinear heat transfer problems across different meshes, geometries, and load positions. Results show that the GNN successfully captures the ignorance and generalizes corrections across spatial configurations, improving simulation accuracy and interpretability, while minimizing data requirements.

2512.07078 2026-05-25 cs.CV cs.LG 版本更新

DFIR-DETR: Frequency-Domain Iterative Refinement and Dynamic Feature Aggregation for Small Object Detection

DFIR-DETR:面向小目标检测的频域迭代细化与动态特征聚合

Bo Gao, Jingcheng Tong, Xingsheng Chen, Han Yu, Zichen Li

发表机构 * School of Information Engineering, Beijing Institute of Graphic Communication(信息工程学院,北京印刷学院) School of Computing and Data Science, The University of Hong Kong(计算与数据科学学院,香港大学) College of Computing and Data Science, Nanyang Technological University(计算与数据科学学院,南洋理工大学)

AI总结 本文针对复杂场景中小目标检测中的核心挑战,提出了一种名为DFIR-DETR的新方法,通过频率域迭代优化和动态特征聚合,有效解决了现有网络在注意力分配、特征上采样和高频信息保留方面的不足。该方法在保持较低计算成本的同时,在NEU-DET和VisDrone数据集上取得了显著的性能提升,验证了其在不同检测任务中的有效性。

详情
AI中文摘要

复杂场景中的小目标检测暴露了神经网络设计中的基本矛盾:骨干注意力分布均匀而不考虑内容,金字塔颈部在上采样过程中放大激活幅度而不进行归一化补偿,瓶颈卷积通过累积空间滤波逐步平滑高频边缘分量。为此,我们开发了DFIR-DETR,将每个提出的模块追溯到RT-DETR基线中特定的、可测量的缺陷:忽略空间复杂性的均匀注意力、破坏上采样特征稳定性的归一化漂移,以及逐步抑制小目标所依赖的高频分量的空间卷积。在NEU-DET和VisDrone上,DFIR-DETR仅以11.7M参数和47.2 GFLOPs就达到了92.9%和51.6%的mAP50,在两个性质不同的检测领域展示了持续的性能提升。

英文摘要

Small object detection in complex scenes exposes a fundamental tension in neural network design: backbone attention distributes computation uniformly regardless of content, pyramid necks inflate activation magnitudes during upsampling without norm compensation, and bottleneck convolutions progressively smooth high-frequency edge components through accumulated spatial filtering. In response, we develop DFIR-DETR by tracing each proposed module back to a specific, measurable deficiency in the RT-DETR baseline: uniform attention that ignores spatial complexity, norm drift that destabilises upsampled features, and spatial convolutions that progressively suppress the high-frequency components small objects depend on. On NEU-DET and VisDrone, DFIR-DETR achieves 92.9% and 51.6% mAP50 with only 11.7M parameters and 47.2 GFLOPs, demonstrating consistent gains across two qualitatively different detection domains.

2511.15503 2026-05-25 cs.AR cs.DC cs.LG cs.PF 版本更新

DCC: Data-Centric Compilation of Machine Learning Kernels for Processing-In-Memory Architectures

DCC: 面向处理-内存架构的机器学习内核数据驱动编译

Peiming Yang, Sankeerth Durvasula, Ivan Fernandez, Mohammad Sadrosadati, Onur Mutlu, Gennady Pekhimenko, Christina Giannoula

发表机构 * University of Toronto(多伦多大学) Vector Institute(向量研究所) Barcelona Supercomputing Center(巴塞罗那超级计算中心) ETH Zürich(苏黎世联邦理工学院) Nvidia(英伟达) Max Planck Institute for Software Systems(马克斯·普朗克软件系统研究所)

AI总结 本文提出了一种面向存算一体架构的数据为中心的机器学习内核编译器DCC,旨在解决在处理大型语言模型等内存密集型任务时,主机处理器与存算一体核心之间数据布局不一致带来的性能瓶颈。DCC通过统一优化数据重排与计算代码生成,结合多层PIM抽象和性能预测模型,有效提升了在不同PIM设备上的执行效率。实验表明,DCC在多种机器学习内核和端到端大语言模型推理中均实现了显著的加速效果。

详情
AI中文摘要

高性能主机处理器可以集成处理-内存(PIM)设备,通过利用PIM核心可用的大内存带宽,加速机器学习(ML)模型(包括大型语言模型(LLM))的内存密集型内核。然而,主机处理器需要分布在DRAM bank中的连续元素,而PIM核心需要其本地bank内的连续元素。这需要在ML内核执行中进行数据重排,带来了显著的性能和可编程性挑战,并且由于需要支持多种PIM设备而进一步加剧。当前的编译方法缺乏针对多种ML内核和多个PIM设备的系统优化,并且可能在计算代码优化步骤中很大程度上忽略数据重排成本。我们表明数据重排和计算代码优化是相互依赖的,需要在调优过程中联合优化。因此,我们设计了DCC,这是首个面向PIM系统的数据驱动ML编译器,它在统一的调优过程中联合优化数据重排和计算代码。DCC集成了多层PIM抽象以支持多个PIM后端。DCC实现了数据分区策略与计算循环分区方案的有效联合优化。DCC应用了PIM特定的代码优化,并利用快速准确的性能预测模型为目标PIM架构上的给定内核选择最佳性能的代码调度。我们在各种单个ML内核上的评估表明,与仅GPU执行相比,DCC在HBM-PIM上实现了高达7.68倍的加速(平均2.21倍),在AttAcc PIM上实现了高达13.17倍的加速(平均3.92倍)。在端到端LLM推理中,AttAcc上的DCC在GPT-3和LLaMA-2上比GPU平均加速4.52倍(LLaMA-2上最高7.71倍)。DCC已在https://github.com/SPIN-Research-Group/DCC开源。

英文摘要

High-performance Host processors can integrate Processing-In-Memory (PIM) devices, which can accelerate memory-intensive kernels of Machine Learning (ML) models, including Large Language Models (LLMs), by leveraging the large memory bandwidth available at PIM cores. However, Host processor needs consecutive elements distributed across DRAM banks, while PIM cores need consecutive elements within their local banks. This necessitates data rearrangements in ML kernel execution that pose significant performance and programmability challenges, further exacerbated by the need to support diverse PIM devices. Current compilation approaches lack systematic optimization for diverse ML kernels and multiple PIM devices, and may largely ignore data rearrangement costs during the compute code optimization step. We show that data rearrangements and compute code optimization are interdependent, and need to be jointly optimized during the tuning process. Therefore, we design DCC, the first data-centric ML compiler for PIM systems that jointly co-optimizes data rearrangements and compute code in a unified tuning process. DCC integrates a multi-layer PIM abstraction to support multiple PIM backends. DCC enables effective co-optimization of data partitioning strategies with compute loop partitioning schemes. DCC applies PIM-specific code optimizations, and leverages a fast and accurate performance prediction model to select the bestperforming code schedule for a given kernel on a target PIM architecture. Our evaluations in various individual ML kernels show that DCC achieves up to 7.68x speedup (2.21x average) on HBM-PIM, and up to 13.17x speedup (3.92x average) on AttAcc PIM, over GPU-only execution. In end-to-end LLM inference, DCC on AttAcc accelerates GPT-3 and LLaMA-2 by 4.52x average (up to 7.71x in LLaMA-2) over GPU. DCC is open-sourced at https://github.com/SPIN-Research-Group/DCC.

2511.03882 2026-05-25 cs.CV cs.AI cs.LG cs.RO 版本更新

Investigating Robot Control Policy Learning for Autonomous X-ray-guided Spine Procedures

自主X光引导脊柱手术的机器人控制策略学习研究

Florence Klitzner, Blanca Inigo, Benjamin D. Killeen, Lalithkumar Seenivasan, Michelle Song, Axel Krieger, Mathias Unberath

发表机构 * Johns Hopkins University(约翰霍普金斯大学) Technical University of Munich(慕尼黑技术大学) Johns Hopkins School of Medicine(约翰霍普金斯医学院)

AI总结 本文研究了基于模仿学习的机器人控制策略在X射线引导脊柱手术中的应用,特别是在椎体成形术中导管插入任务中的可行性与挑战。研究构建了一个高度逼真的仿真环境,并构建了包含正确操作轨迹和双平面X射线序列的数据集,用于训练仅依赖视觉信息的模仿学习策略。实验表明,该策略在多种脊柱解剖结构和初始条件下均能实现安全的导管插入,为未来轻量化、无需CT的术中脊柱机器人导航提供了基础。

详情
AI中文摘要

基于模仿学习的机器人控制策略在基于视频的机器人学中重新受到关注。然而,对于稀疏输入的X光引导手术(如脊柱内固定),这种方法是否适用尚不清楚。我们研究了在双平面引导的套管针插入中模仿策略学习的可行性、机遇和挑战。我们开发了一个用于可扩展、自动化模拟X光引导脊柱手术的计算机沙盒,具有高度逼真性。我们整理了一个包含正确轨迹和相应双平面X光序列的数据集,模拟了提供者的逐步对齐过程。然后,我们训练了用于规划和开环控制的模仿学习策略,该策略仅基于视觉信息在椎体成形术环境中迭代对齐套管针。这种精确控制的设置提供了对该方法局限性和能力的见解。我们的策略在68.5%的案例中首次尝试成功,在不同椎体水平上保持了安全的椎弓根内轨迹。该策略迁移到了复杂解剖结构(包括骨折)以及不同的解剖结构和初始位置。在真实X光上的展开表明,具有合理轨迹的部分仿真到真实迁移是可能的。尽管这些初步结果令人鼓舞,但我们还发现了局限性,特别是在入口点精度方面。当前的结果为未来的努力提供了明确的基准,而借助更稳健的先验和领域知识,此类模型可能为未来实现轻量级、无CT的机器人术中脊柱导航奠定基础。

英文摘要

Imitation learning-based robot control policies are enjoying renewed interest in video-based robotics. However, it remains unclear whether this approach applies to X-ray-guided procedures, such as spine instrumentation, with sparse inputs. We examine the feasibility, opportunities and challenges for imitation policy learning in bi-plane-guided cannula insertion. We develop an in silico sandbox for scalable, automated simulation of X-ray-guided spine procedures with a high degree of realism. We curate a dataset of correct trajectories and corresponding bi-planar X-ray sequences that emulate the stepwise alignment of providers. We then train imitation learning policies for planning and open-loop control that iteratively align a cannula in a vertebroplasty setting solely based on visual information. This precisely controlled setup offers insights into limitations and capabilities of this method. Our policy succeeded on the first attempt in 68.5% of cases, maintaining safe intra-pedicular trajectories across diverse vertebral levels. The policy transferred to complex anatomy, including fractures, as well as varied anatomies and initializations. Rollouts on real X-ray indicate that partial sim-to-real transfer with plausible trajectories is possible. While these preliminary results are promising, we also identify limitations, especially in entry point precision. The current results present a clear benchmark for future efforts, while with more robust priors and domain knowledge, such models may provide a foundation for future efforts toward lightweight and CT-free robotic intra-operative spinal navigation.

2509.06896 2026-05-25 cs.LG stat.ML 版本更新

Are Targeted Data Poisoning Attacks as Effective as We Think?

定向数据投毒攻击是否如我们想象中那么有效?

William Xu, Chenyu Zhang, Yihan Wang, Matthew Y. R. Yang, Zuoqiu Liu, Gautam Kamath, Yaoliang Yu, Yiwei Lu

发表机构 * Waabi AI University of Waterloo(滑铁卢大学) Carnegie Mellon University(卡内基梅隆大学) Google(谷歌) Vector Institute(向量研究所) University of Ottawa(渥太华大学)

AI总结 本文研究目标数据投毒攻击的实际有效性,指出现有评估方法基于随机选择的目标样本,未能反映最坏情况下的攻击效果。为此,作者提出应聚焦于最难被攻击的样本进行评估,并基于干净模型的信息,提出了一种识别易受攻击和最难受攻击样本的方法,从而实现更严格的最坏情况评估和主动防御策略。

详情
AI中文摘要

定向数据投毒攻击通过向训练数据中注入恶意样本来操纵模型对特定测试样本的预测。然而,现有评估通常报告随机选择目标上的平均攻击成功率,掩盖了真实的最坏情况效果。我们认为正确的评估应聚焦于最难投毒的样本。同样的推理适用于防御:由于定向攻击在分布层面不留下痕迹,防御者应主动识别最脆弱的样本并应用定向对策。给定一个测试数据集,本文仅基于清洁模型信息识别最容易和最难投毒的样本。具体而言,我们利用清洁训练动态提供粗粒度评估,并利用投毒距离和预算对投毒类别进行细粒度分类。实验表明,这些指标能够可靠地按投毒脆弱性对样本分层,从而实现严格的最坏情况评估和主动的脆弱性感知防御。

英文摘要

Targeted data poisoning attacks manipulate model predictions on specific test samples by injecting malicious data into training. Yet existing evaluations report average attack success rates over randomly selected targets, obscuring true worst-case effectiveness. We argue that the right evaluation focuses on the hardest samples to poison. The same reasoning applies to defense: since targeted attacks leave no footprint at the distribution level, defenders should proactively identify the most vulnerable samples and apply targeted countermeasures. Given a test dataset, this paper identifies both the easiest and hardest to poison examples based on only clean model information. Specifically, we offer coarse evaluations using clean training dynamics, and fine-grained classification on poison class using poison distances and budgets. Our experiments show these metrics reliably stratify samples by poisoning vulnerability, enabling both rigorous worst-case evaluation and proactive vulnerability-aware defense.

2508.13663 2026-05-25 cs.AI cs.LG 版本更新

Interactive Query Answering on Knowledge Graphs with Soft Entity Constraints

具有软实体约束的知识图谱交互式查询回答

Daniel Daza, Alberto Bernardi, Luca Costabello, Christophe Gueret, Masoud Mansoury, Michael Cochez, Martijn Schut

发表机构 * Translational AI Laboratory, Department of Laboratory Medicine(转化人工智能实验室,实验室医学系) Amsterdam University Medical Center, Vrije Universiteit Amsterdam(阿姆斯特丹大学医学中心,伏里埃大学阿姆斯特丹) Accenture Labs(埃森哲实验室) Delft University of Technology(代尔夫特理工大学) ELLIS Institute Finland & Abo Akademi University, Turku, Finland & Elsevier Discovery Lab, Amsterdam(芬兰ELLIS研究所 & 阿博阿卡迪米大学,图尔库,芬兰 & 埃西弗尔发现实验室,阿姆斯特丹)

AI总结 本文研究了在知识图谱中结合软实体约束进行交互式查询回答的问题,旨在处理现实场景中含模糊或上下文依赖约束的查询。为此,作者提出了两种高效方法,能够在不破坏原有查询结果排名结构的前提下,通过少量参数调整或小型神经网络学习软约束,从而提升查询结果的相关性。实验表明,该方法在保持原有查询性能的同时,有效融入了用户偏好,为知识图谱交互提供了更灵活的方式。

Comments Accepted in Transactions on Machine Learning Research (2026)

详情
AI中文摘要

针对不完整知识图谱的查询回答方法检索可能成为答案的实体,这在由于缺失边而无法通过直接图遍历达到此类答案时特别有用。然而,现有方法侧重于使用一阶逻辑形式化的查询。在实践中,许多现实世界的查询涉及固有模糊或上下文依赖的约束,例如对属性或相关类别的偏好。针对这一差距,我们引入了具有软约束的查询回答问题。我们形式化了该问题,并提出了两种高效方法,旨在通过融入软约束来调整查询答案分数,同时不破坏查询的原始答案。这些方法是轻量级的,只需调整两个参数或训练一个小型神经网络来捕获软约束,同时保持原始排序结构。为了评估该任务,我们通过生成带有软约束的数据集来扩展现有的QA基准。我们的实验表明,我们的方法能够捕获软约束,同时保持稳健的查询回答性能,并增加很少的开销。通过我们的工作,我们探索了一种与图数据库交互的新颖灵活方式,允许用户通过交互式提供示例来指定其偏好。

英文摘要

Methods for query answering over incomplete knowledge graphs retrieve entities that are likely to be answers, which is particularly useful when such answers cannot be reached by direct graph traversal due to missing edges. However, existing approaches have focused on queries formalized using first-order-logic. In practice, many real-world queries involve constraints that are inherently vague or context-dependent, such as preferences for attributes or related categories. Addressing this gap, we introduce the problem of query answering with soft constraints. We formalize the problem and introduce two efficient methods designed to adjust query answer scores by incorporating soft constraints without disrupting the original answers to a query. These methods are lightweight, requiring tuning only two parameters or a small neural network trained to capture soft constraints while maintaining the original ranking structure. To evaluate the task, we extend existing QA benchmarks by generating datasets with soft constraints. Our experiments demonstrate that our methods can capture soft constraints while maintaining robust query answering performance and adding very little overhead. With our work, we explore a new and flexible way to interact with graph databases that allows users to specify their preferences by providing examples interactively.

2508.12247 2026-05-25 cs.LG cs.AI 版本更新

STM3: Mixture of Multiscale Mamba for Long-Term Spatio-Temporal Time-Series Prediction

STM3: 多尺度曼巴混合模型用于长期时空时间序列预测

Haolong Chen, Liang Zhang, Zhengyuan Xin, Guangxu Zhu

发表机构 * Shenzhen Loop Area Institute(深圳环城研究院)

AI总结 本文提出了一种名为STM3的新型深度学习模型,用于解决长期时空时间序列预测中的多尺度信息提取和空间依赖建模难题。STM3结合了多尺度Mamba架构与解耦的专家混合框架(DMoE),并引入自适应图因果网络以高效捕捉复杂的时空依赖关系。该模型通过稳定路由策略和因果对比学习策略,确保了表示学习的鲁棒性和多尺度信息的可区分性,实验表明其在多个现实数据集上均取得了优越的预测性能。

Comments Accepted by KDD 2026

详情
AI中文摘要

近年来,时空时间序列预测发展迅速,但现有深度学习方法难以高效学习复杂的长期时空依赖。长期时空依赖学习带来两个新挑战:1)长期时间序列自然包含多尺度信息,难以高效提取;2)不同节点的多尺度时间信息高度相关且难以建模。为解决这些问题,我们提出时空多尺度曼巴混合模型(STM3)。STM3在新型分离式混合专家(DMoE)框架内集成多尺度曼巴架构,以高效捕获多样的多尺度信息,同时利用自适应图因果网络建模复杂的空间依赖。为确保鲁棒的表示学习,我们引入稳定路由策略和因果对比学习策略,与层次信息聚合协同工作,保证尺度可区分性。我们理论上证明STM3实现了优越的路由平滑性,并保证了每个专家的模式分离。在跨领域的10个真实世界基准上的大量实验表明,STM3具有优越性能,在长期时空时间序列预测中达到了最先进的结果。值得注意的是,在PEMSD8数据集上,它取得了显著改进,在MAE、RMSE和MAPE上分别超过第二好的模型7.1%、8.5%和15.9%。代码可在https://github.com/IfReasonable/STM3_KDD26获取。

英文摘要

Recently, spatio-temporal time-series prediction has developed rapidly, yet existing deep learning methods struggle with learning complex long-term spatio-temporal dependencies efficiently. The long-term spatio-temporal dependency learning brings two new challenges: 1) The long-term temporal sequence naturally includes multiscale information, which is hard to extract efficiently; 2) The multiscale temporal information from different nodes is highly correlated and hard to model. To address these challenges, we propose Spatio-Temporal Mixture of Multiscale Mamba (STM3). STM3 integrates a Multiscale Mamba architecture within a novel Disentangled Mixture-of-Experts (DMoE) framework to capture diverse multiscale information efficiently, while utilizing an adaptive graph causal network to model complex spatial dependencies. To ensure robust representation learning, we introduce a stable routing strategy and a causal contrastive learning strategy, which work in tandem with hierarchical information aggregation to guarantee scale distinguishability. We theoretically prove that STM3 achieves superior routing smoothness and guarantees pattern disentanglement for each expert. Extensive experiments on 10 real-world benchmarks across domains demonstrate STM3's superior performance, achieving state-of-the-art results in long-term spatio-temporal time-series prediction. Notably, on the PEMSD8 dataset, it achieves significant improvements, surpassing the second-best model by 7.1% in MAE, 8.5% in RMSE, and 15.9% in MAPE. Code is available at https://github.com/IfReasonable/STM3_KDD26.

2506.20537 2026-05-25 cs.LG 版本更新

Physics-Informed Machine Learning Regulated by Finite Element Analysis for Simulation Acceleration of Melt Pool Dynamics in Laser Powder Bed Fusion

基于有限元分析调控的物理信息机器学习用于激光粉末床熔融熔池动力学模拟加速

R. Sharma, Y. B. Guo

发表机构 * Dept. of Mechanical and Aerospace Engineering, Rutgers University-New Brunswick(罗杰斯大学机械与航空航天工程系) New Jersey Advanced Manufacturing Institute, Rutgers University-New Brunswick(新泽西先进制造研究所)

AI总结 该研究针对激光粉末床熔融(LPBF)过程中熔池动态模拟计算成本高的问题,提出了一种结合有限元分析(FEA)的物理信息神经网络(FEA-PINN)框架,以提高模拟效率并保持精度。该方法通过引入动态相变捕捉策略和物理一致性校正机制,有效解决了传统物理信息神经网络在时间依赖问题中精度下降的问题。实验表明,FEA-PINN在保证与有限元分析相当精度的同时,显著降低了计算成本。

Comments Further investigation revealed that the current version reflects an incomplete formulation and limited validation of the proposed method. We have since developed a substantially revised and extended study with updated assumptions and results, and therefore withdraw this version to prevent citation of superseded findings

详情
AI中文摘要

高效模拟激光粉末床熔融(LPBF)对于工艺预测至关重要,因为传统数值方法(如有限元分析,FEA)存在计算成本高昂的持久问题。虽然物理信息神经网络(PINN)可以用少量训练数据预测解场,并通过迁移学习实现新工艺参数的泛化,但由于残差累积以及难以捕捉LPBF过程中固有的陡峭空间和时间梯度,它在时间相关问题中精度下降。为克服这一问题,本研究开发了一个高效的建模框架——有限元分析调控的物理信息神经网络(FEA-PINN),以加速LPBF过程中熔池动力学现象的预测,同时保持FEA的精度。FEA-PINN的创新体现在两个方面。首先,在PINN模型内部开发了一种新策略来捕捉粉末-液体-固体的动态相变,从而能够跟踪激光熔化过程中的材料状态。该模型进一步纳入了温度相关的材料属性、粉末床的相变行为、马兰戈尼对流以及熔池内的自然对流。其次,FEA-PINN框架在推理过程中集成了校正性的FEA模拟,以强制执行物理一致性、减少误差漂移并捕捉陡峭梯度。对比分析表明,FEA-PINN在显著降低计算成本的同时,达到了与FEA相当的精度。该框架已针对LPBF中单道扫描的基准FEA数据进行了验证。

英文摘要

Efficient simulation of Laser Powder Bed Fusion (LPBF) is crucial for process prediction due to the lasting issue of high computational cost associated with traditional numerical methods such as finite element analysis (FEA). While a Physics-Informed Neural Network (PINN) can predict solution fields with small training data and enables the generalization of new process parameters via transfer learning, it suffers from accuracy degradation in time-dependent problems due to the accumulation of residual and the difficulty in capturing the steep spatial and temporal gradients inherent in the LPBF process. To overcome this issue, this study develops an efficient modeling framework, FEA-Regulated Physics-Informed Neural Network (FEA-PINN), to accelerate the prediction of melt pool dynamics phenomena in an LPBF process while maintaining the FEA accuracy. The innovation of FEA-PINN manifested itself in two aspects. First, a novel strategy has been developed within the PINN model to capture the dynamic phase change of powder-liquid-solid, enabling the tracking of material status during laser melting. The model further incorporates temperature-dependent material properties, phase change behavior of the powder bed, Marangoni convection, and natural convection within the melt pool. Second, the FEA-PINN framework integrates corrective FEA simulations during inference to enforce physical consistency, reduce error drift, and capture the steep gradients. A comparative analysis shows that FEA-PINN achieves accuracy comparable to FEA while significantly reducing computational cost. The framework has been validated against benchmark FEA data for single-track scanning in LPBF.

2506.05438 2026-05-25 cs.LG cs.AI 版本更新

An Unsupervised Framework for Dynamic Health Indicator Construction and Its Application in Rolling Bearing Prognostics

一种用于动态健康指标构建的无监督框架及其在滚动轴承预测中的应用

Tongda Sun, Chen Yin, Huailiang Zheng, Yining Dong

发表机构 * School of Data Science(数据科学学院) Hong Kong Institute for Data Science, City University of Hong Kong, Hong Kong(香港数据科学研究所,香港城市大学,香港) College of Mechanical(机械学院) Electrical Engineering, Harbin Engineering University, Harbin 150001, China(电气工程学院,哈尔滨工程大学,哈尔滨150001,中国)

AI总结 本文提出了一种无需专家知识的无监督框架,用于构建动态健康指标(HI),以提升滚动轴承退化趋势建模与剩余寿命预测的准确性。该方法通过基于跳跃连接的自编码器自动提取退化特征,并在特征空间中引入嵌入内部预测模块的HI生成模块,显式建模HI状态的时序依赖关系,从而捕捉退化过程中的动态信息。实验结果表明,所提出的动态HI在两个轴承生命周期数据集上优于现有方法,显著提升了预测性能。

详情
AI中文摘要

健康指标(HI)在滚动轴承的退化评估和预测中起着关键作用。尽管已有多种HI构建方法被研究,但大多数依赖于专家知识进行特征提取,并忽略了捕捉序列退化过程中隐藏的动态信息,这限制了所构建HI在退化趋势表示和预测中的能力。为解决这些问题,通过一种无监督框架构建了考虑HI级时间依赖性的新型动态HI。具体而言,由基于跳跃连接的自编码器组成的退化特征学习模块首先将原始信号映射到代表性退化特征空间(DFS),以自动提取必要的退化特征,无需专家知识。随后,在该DFS中,提出了一种嵌入内部HI预测模块的新型HI生成模块用于动态HI构建,其中过去和当前HI状态之间的时间依赖性被保证并显式建模。在此基础上,动态HI捕捉了退化过程固有的动态内容,确保其在退化趋势建模和未来退化预测中的有效性。在两个轴承生命周期数据集上的实验结果表明,所提出的HI构建方法优于对比方法,且构建的动态HI在预测任务中表现更优。

英文摘要

Health indicator (HI) plays a key role in degradation assessment and prognostics of rolling bearings. Although various HI construction methods have been investigated, most of them rely on expert knowledge for feature extraction and overlook capturing dynamic information hidden in sequential degradation processes, which limits the ability of the constructed HI for degradation trend representation and prognostics. To address these concerns, a novel dynamic HI that considers HI-level temporal dependence is constructed through an unsupervised framework. Specifically, a degradation feature learning module composed of a skip-connection-based autoencoder first maps raw signals to a representative degradation feature space (DFS) to automatically extract essential degradation features without the need for expert knowledge. Subsequently, in this DFS, a new HI-generating module embedded with an inner HI-prediction block is proposed for dynamic HI construction, where the temporal dependence between past and current HI states is guaranteed and modeled explicitly. On this basis, the dynamic HI captures the inherent dynamic contents of the degradation process, ensuring its effectiveness for degradation tendency modeling and future degradation prognostics. The experiment results on two bearing lifecycle datasets demonstrate that the proposed HI construction method outperforms comparison methods, and the constructed dynamic HI is superior for prognostic tasks.

2412.19098 2026-05-25 cs.LG 版本更新

SyMerge: From Non-Interference to Synergistic Merging via Single-Layer Adaptation

SyMerge:从无干扰到协同合并的单层自适应方法

Aecheon Jung, Seunghwan Lee, Dongyoon Han, Sungeun Hong

发表机构 * Sungkyunkwan University(成均馆大学) NAVER AI Lab(NAVER AI实验室)

AI总结 SyMerge 是一种轻量级的模型合并框架,旨在通过单层适配实现任务间的协同效应,而非仅仅避免任务干扰。该方法通过联合优化合并系数和一个任务特定层,引入专家引导的自标注目标,提升了合并效果的稳定性与性能。研究证明,SyMerge 能够成功合并不同初始化训练的模型,在多个视觉、密集预测和自然语言处理基准上取得了最先进的结果。

Comments Accepted at ICML 2026

详情
AI中文摘要

模型合并将独立训练的模型组合成一个多任务模型。然而,大多数现有方法主要关注避免任务干扰。我们认为其更大的潜力在于实现任务协同,即任务之间主动相互改进。我们识别出跨任务性能,由不同任务之间的编码器和预测器的兼容性定义,作为合并质量的关键指标。我们证明仅适应单个任务特定层就足以诱导这种协同。本研究提出SyMerge,一个轻量级框架,联合优化合并系数和单个任务特定层。我们采用专家引导的自标签目标,提供超越熵最小化的稳定监督。有趣的是,我们进一步表明SyMerge成功合并了从不同初始化训练的模型,而标准方法在此情况下失效。我们极简但有原则的方法在视觉、密集预测和NLP基准上达到了最先进的结果。我们的代码可在https://aim-skku.github.io/SyMerge获取。

英文摘要

Model merging combines independently trained models into a single multi-task model. However, most existing approaches focus primarily on avoiding task interference. We argue that its greater potential lies in enabling task synergy, where tasks actively improve one another. We identify cross-task performance, defined by compatibility between encoders and predictors across tasks, as a key indicator of merge quality. We demonstrate that adapting only a single task-specific layer is sufficient to induce such synergy. This study proposes SyMerge, a lightweight framework that jointly optimizes merging coefficients and a single task-specific layer. We adopt an expert-guided self-labeling objective, providing stable supervision beyond entropy minimization. Intriguingly, we further show that SyMerge successfully merges models trained from different initializations, a regime where standard methods break down. Our minimalist yet principled method achieves state-of-the-art results across vision, dense prediction, and NLP benchmarks. Our code is available at https://aim-skku.github.io/SyMerge

2406.02883 2026-05-25 cs.LG cs.CR 版本更新

Nonlinear Transformations Against Unlearnable Datasets

针对不可学习数据集的非线性变换

Thushari Hapuarachchi, Jing Lin, Kaiqi Xiong, Mohamed Rahouti, Gitte Ost

发表机构 * University of South Florida(佛罗里达州立大学) Fordham University(福特汉姆大学)

AI总结 本文研究了如何通过非线性变换方法解决深度学习模型对传统认为无法学习的“不可遗忘”数据集的学习问题。作者提出了一种有效的非线性变换框架,并通过大量实验表明,深度神经网络能够从由多种数据保护方法生成的不可遗忘数据中有效学习,显著优于近期提出的线性可分技术。实验结果表明,该方法在多个数据集上提升了模型性能,揭示了现有保护方法在防止数据未经授权使用方面存在不足,亟需更强大的防护机制。

详情
AI中文摘要

自动化爬取是深度学习模型中未经数据所有者授权收集数据的常见方法。近期研究开始解决这种数据收集方法带来的隐私问题。显著的方法包括Deepconfuse、误差最小化、误差最大化(也称为对抗性投毒)、神经正切泛化攻击、合成、自回归、单像素捷径、自集成保护、纠缠特征、鲁棒误差最小化、虚伪和TensorClog。这些方法生成的数据称为“不可学习”样本,阻止深度学习模型“学习”。在本研究中,我们调查并设计了一个有效的非线性变换框架,并进行大量实验,证明深度神经网络能够有效从上述十二种方法产生的传统上被认为不可学习的数据/样本中学习。与研究人员最近提出的线性可分技术相比,所提出的方法提高了破解不可学习数据的能力。具体来说,我们的大量实验表明,对于这些十二种数据保护方法生成的不可学习CIFAR10数据集(除单像素捷径外),改进范围为0.34%至249.59%。此外,与线性可分技术相比,所提出的框架在自回归和REM方法上实现了超过100%的测试准确率提升。我们的发现表明,这些方法不足以防止机器学习模型中数据的未经授权使用。迫切需要开发更强大的保护机制,有效阻止攻击者在未经所有者适当授权的情况下访问数据。

英文摘要

Automated scraping stands out as a common method for collecting data in deep learning models without the authorization of data owners. Recent studies have begun to tackle the privacy concerns associated with this data collection method. Notable approaches include Deepconfuse, error-minimizing, error-maximizing (also known as adversarial poisoning), Neural Tangent Generalization Attack, synthetic, autoregressive, One-Pixel Shortcut, Self-Ensemble Protection, Entangled Features, Robust Error-Minimizing, Hypocritical, and TensorClog. The data generated by those approaches, called "unlearnable" examples, are prevented "learning" by deep learning models. In this research, we investigate and devise an effective nonlinear transformation framework and conduct extensive experiments to demonstrate that a deep neural network can effectively learn from the data/examples traditionally considered unlearnable produced by the above twelve approaches. The resulting approach improves the ability to break unlearnable data compared to the linear separable technique recently proposed by researchers. Specifically, our extensive experiments show that the improvement ranges from 0.34% to 249.59% for the unlearnable CIFAR10 datasets generated by those twelve data protection approaches, except for One-Pixel Shortcut. Moreover, the proposed framework achieves over 100% improvement of test accuracy for Autoregressive and REM approaches compared to the linear separable technique. Our findings suggest that these approaches are inadequate in preventing unauthorized uses of data in machine learning models. There is an urgent need to develop more robust protection mechanisms that effectively thwart an attacker from accessing data without proper authorization from the owners.

2605.23673 2026-05-25 cs.LG 版本更新

Relevant Walk Search for Explaining Graph Neural Networks

用于解释图神经网络的相关游走搜索

Ping Xiong, Thomas Schnake, Michael Gastegger, Grégoire Montavon, Klaus-Robert Müller, Shinichi Nakajima

发表机构 * BIFOLD -- Berlin Institute for the Foundations of Learning(柏林学习与数据基础研究所) Google Research, Brain team, Berlin(谷歌研究,柏林脑团队) Department of Artificial Intelligence, Korea University, Seoul 136-713, Korea(人工智能系,韩国大学,首尔136-713,韩国) RIKEN Center for AIP, Japan(日本AIP研究中心)

AI总结 本文研究了图神经网络(GNN)的可解释性问题,提出了一种高效寻找关键路径(walk)的方法,用于揭示网络中的重要信息流动。针对现有基于层间相关性传播(GNN-LRP)方法计算复杂度高、难以应用于大规模网络的问题,作者设计了多项式时间算法,能够在保证解释精度的同时大幅提升计算效率。实验表明,该方法在多个实际应用领域中表现良好,具有广泛的应用价值。

Comments Published in ICML 2023

详情
Journal ref
Proceedings of the 40th International Conference on Machine Learning, PMLR 202:38301-38324, 2023
AI中文摘要

图神经网络(GNN)已成为图分析的重要机器学习工具,其可解释性对于安全性、公平性和鲁棒性至关重要。GNN的逐层相关性传播(GNN-LRP)评估游走的相关性以揭示网络中的重要信息流,并提供高阶解释,已被证明优于低阶(即节点/边级)解释。然而,通过GNN-LRP识别相关游走需要相对于网络深度的指数级计算复杂度,本文将对这一问题进行改进。具体来说,我们提出了多项式时间算法来寻找前K个相关游走,这大大减少了计算量,从而提高了GNN-LRP在大规模问题上的适用性。我们提出的算法基于最大积算法——一种在概率图模型中寻找最大似然配置的常用工具——并且可以在神经元级别精确地找到最相关的游走,在节点级别近似地找到。我们的实验展示了我们的算法在规模上的性能及其在应用领域(即流行病学、分子和自然语言基准)中的实用性。我们在\href{https://github.com/xiong-ping/rel_walk_gnnlrp}{github.com/xiong-ping/rel\_walk\_gnnlrp}上提供代码。

英文摘要

Graph Neural Networks (GNNs) have become important machine learning tools for graph analysis, and its explainability is crucial for safety, fairness, and robustness. Layer-wise relevance propagation for GNNs (GNN-LRP) evaluates the relevance of \emph{walks} to reveal important information flows in the network, and provides higher-order explanations, which have been shown to be superior to the lower-order, i.e., node-/edge-level, explanations. However, identifying relevant walks by GNN-LRP requires {\em exponential} computational complexity with respect to the network depth, which we will remedy in this paper. Specifically, we propose {\em polynomial-time} algorithms for finding top-$K$ relevant walks, which drastically reduces the computation and thus increases the applicability of GNN-LRP to large-scale problems. Our proposed algorithms are based on the \emph{max-product} algorithm -- a common tool for finding the maximum likelihood configurations in probabilistic graphical models -- and can find the most relevant walks exactly at the neuron level and approximately at the node level. Our experiments demonstrate the performance of our algorithms at scale and their utility across application domains, i.e., on epidemiology, molecular, and natural language benchmarks. We provide our codes under \href{https://github.com/xiong-ping/rel_walk_gnnlrp}{github.com/xiong-ping/rel\_walk\_gnnlrp}.

2605.23663 2026-05-25 cs.HC cs.LG 版本更新

Detecting Drunk Driving Using Off-the-Shelf Smartwatches

使用现成智能手表检测酒驾

Robin Deuber, Lanlan Yang, Michal Bechny, Christoph Heck, Matthias Pfäffli, Matthias Bantle, Florian von Wangenheim, Elgar Fleisch, Wolfgang Weinmann, Manuel Günther, Felix Wortmann, Varun Mishra

发表机构 * University of Bern(伯尔尼大学) University of St. Gallen(施特加尔伦大学) Northeastern University(东北大学)

AI总结 本文研究了如何利用市售智能手表检测酒后驾驶行为,以预防道路交通事故。研究通过分析手腕加速度计数据和心率变异性等生理信号,提出了一种基于机器学习的检测系统,并在封闭测试轨道上进行了随机对照实验。该系统使用逻辑回归和一维卷积神经网络进行训练,取得了较高的检测准确率,为基于可穿戴设备的酒驾预防提供了新的可行方案。

Comments 27 pages, 7 figures

详情
AI中文摘要

酒精影响驾驶仍然是道路交通事故和死亡的一个主要但可预防的原因,许多驾驶员低估了自己的醉酒程度。与车载系统相比,使用消费级智能手表的移动酒驾检测提供了一种可扩展的方式,无需额外车载硬件即可触发预防性干预并提高意识。我们引入了一个系统,利用手腕加速度计数据和心率变异性衍生的生理信号来检测酒精相关的驾驶障碍。我们在一个随机、对照的三组测试轨道研究(n=54)中收集数据,并训练了带有窗口聚合特征的逻辑回归模型和一个双塔一维卷积神经网络(CNN),以检测酒精影响下的驾驶。CNN在检测任何酒精中毒时实现了参与者平均受试者工作特征曲线下面积(AUROC)为0.88,在检测驾驶超过WHO推荐的0.05 g/dL限值时AUROC为0.86。据我们所知,这是第一个(1)展示使用消费级智能手表检测酒驾的工作,(2)在封闭测试轨道的真实车辆中开发和评估此类系统,以及(3)严格评估对未见参与者的泛化能力。这些发现共同凸显了基于可穿戴设备的传感在支持可扩展、测量驱动的酒精相关交通伤害预防方面的潜力。

英文摘要

Alcohol-impaired driving remains a major yet preventable cause of road traffic injury and death, with many drivers underestimating their level of intoxication. Compared to in-vehicle systems, mobile drunk-driving detection using consumer smartwatches offers a scalable way to trigger preventive interventions and increase awareness without additional in-vehicle hardware. We introduce a system that leverages wrist accelerometer data and heart rate variability-derived physiological signals to detect alcohol-related driving impairment. We collected data in a randomized, controlled three-arm test-track study (n=54) and trained both logistic regression models with window-aggregated features and a two-tower 1D convolutional neural network (CNN), to detect alcohol-impaired driving. The CNN achieved a participant-averaged area under the receiver operating characteristic (AUROC) of 0.88 for detecting any alcohol intoxication and 0.86 for detecting driving above the WHO-recommended limit of 0.05 g/dL. To the best of our knowledge, this is the first work to (1) demonstrate drunk-driving detection using consumer smartwatches, (2) develop and evaluate such a system in a real vehicle on a closed test track, and (3) rigorously assess generalization to unseen participants. Together, these findings highlight the potential of wearable-based sensing to support scalable, measurement-driven prevention of alcohol-related traffic harm.

2605.23655 2026-05-25 cs.CV cs.AI cs.LG cs.MM 版本更新

CVSearch: Empowering Multimodal LLMs with Cognitive Visual Search for High-Resolution Image Perception

CVSearch:赋予多模态大语言模型认知视觉搜索能力以感知高分辨率图像

Liupeng Li, Haoqian Kang, Zhenyu Lu, Jinpeng Wang, Bin Chen, Ke Chen, Yaowei Wang

发表机构 * Harbin Institute of Technology, Shenzhen, China(哈尔滨工业大学(深圳)) Peng Cheng Laboratory, Shenzhen, China(鹏城实验室) Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China(深圳先进技术研究院)

AI总结 高分辨率图像感知是多模态大语言模型面临的关键瓶颈。为解决视觉搜索中覆盖性与效率之间的矛盾,本文提出CVSearch,一种无需训练的自适应框架,通过“评估-搜索”流程动态调度搜索策略。该方法在全局信息不足时采用专家辅助搜索,失败时触发语义感知的扫描机制,有效减少物体碎片化,并通过动态自底向上搜索策略提升局部细节的探索效率。实验表明,CVSearch在高分辨率基准上实现了最先进的准确率和显著提升的搜索效率。

Comments Accepted by ICML 2026. 22 pages, 12 figures, 7 tables

详情
AI中文摘要

高分辨率图像感知是多模态大语言模型的一个关键瓶颈。虽然视觉搜索提供了有希望的解决方案,但现有方法在覆盖率和效率之间难以权衡。视觉专家辅助搜索效率高,但当提议失败时容易出现盲点,而基于扫描的搜索以计算冗余和语义碎片化为代价保证了覆盖率。为了解决这一困境,我们引入了CVSearch,一种无需训练的自适应框架,通过评估-搜索工作流动态调度搜索策略。具体来说,CVSearch首先在全局信息不足时调用专家辅助搜索,仅在失败时触发一种新颖的语义感知扫描机制。与刚性网格划分不同,这种高效扫描范式结合了语义引导的自适应补丁,将图像分解为语义一致的区域,有效缓解了物体碎片化。此外,我们设计了一种由视觉复杂性先验驱动的动态自底向上搜索策略,以实现对局部细节的高效且精确的迭代探索。在高分辨率基准上的大量实验表明,CVSearch在显著提高搜索效率的同时实现了最先进的准确性。代码已发布在https://github.com/liliupeng28/ICML26-CVSearch。

英文摘要

High-resolution (HR) image perception presents a key bottleneck for multimodal large language models (MLLMs). While visual search offers a promising solution, existing methods struggle with the trade-off between coverage and efficiency. Visual expert-assisted search is efficient but prone to blind spots when proposals fail, whereas scan-based search guarantees coverage at the cost of computational redundancy and semantic fragmentation. To address this dilemma, we introduce CVSearch, a training-free adaptive framework that dynamically schedules search strategies via an Assess-then-Search workflow. Specifically, CVSearch first invokes expert-assisted search when global information is insufficient, and only triggers a novel semantic-aware scanning mechanism upon failure. Distinct from rigid grid partitioning, this efficient scanning paradigm incorporates Semantic Guided Adaptive Patching to decompose images into semantically consistent regions, effectively mitigating object fragmentation. Furthermore, we devise a Dynamic Bottom-Up Search strategy driven by a Visual Complexity prior to enable efficient and precise iterative exploration of local details. Extensive experiments on HR benchmarks demonstrate that CVSearch achieves state-of-the-art accuracy while substantially improving search efficiency. Code is released at https://github.com/liliupeng28/ICML26-CVSearch.

2605.23645 2026-05-25 cs.LG cs.AI 版本更新

Learning Through Noise: Why Subliminal Learning Works and When It Fails

通过噪声学习:为什么潜意识学习有效以及何时失败

Vincent C. Brockers, Roman D. Ventzke, Valentin Neuhaus, Belén Hidalgo-Ogalde, Viola Priesemann

发表机构 * Max Planck Institute for Dynamics and Self-Organization(马克斯·普朗克动态与自组织研究所) Faculty of Physics, Institute for the Dynamics of Complex Systems, University of Göttingen(哥廷根大学物理系,复杂系统动力学研究所)

AI总结 本文研究了人工神经网络中的“潜意识学习”现象,即通过任务无关的输入-输出对进行知识蒸馏时,学生模型从教师模型中隐式学习任务相关知识或偏差的机制。研究发现,这一过程并不依赖于教师与学生模型的初始化一致性,而是由输出头的兼容性所决定。通过控制实验,作者展示了即使在随机初始化、网络结构变化等情况下,学生模型仍能通过兼容的辅助输出头从教师模型中学习有用信息,并在特定条件下达到与教师相当的任务性能。该研究为潜意识学习提供了理论解释,并明确了其适用范围与失效条件。

详情
AI中文摘要

在人工神经网络的背景下,潜意识学习指的是通过任务无关的输入-输出对的蒸馏,将任务相关知识或意外偏差从教师模型传递到学生模型。先前的解释将这种效应归因于共享或紧密匹配的教师-学生初始化。我们表明,紧密匹配的初始化并非必要。相反,潜意识学习由兼容的输出头控制。使用受控的MNIST设置,我们将输出分为辅助头(用于辅助的、任务无关的噪声信号)和分类头(用于分类),以证明潜意识学习发生——即使我们随机初始化隐藏层并移除层、添加新层或更改架构(MLP到CNN)。兼容的辅助头能够传递可恢复的教师信号,使学生的表示更接近教师的表示。当分类头也保持兼容时,仅训练于任务无关噪声的学生可以接近,并且在有利情况下达到教师级别的任务性能。我们的设置使我们能够发展一种理论来解释潜意识学习的机制,并推导出潜意识学习失败时的上界。总之,我们的结果将潜意识学习从一种令人惊讶的迁移效应转变为具有可预测限制的理论基础机制。

英文摘要

In the context of artificial neural networks, subliminal learning refers to the transfer of task-relevant knowledge or unintended biases from teacher to student models through distillation on task-unrelated input$\unicode{x2013}$output pairs. Prior explanations tie this effect to shared or closely matched teacher$\unicode{x2013}$student initialization. We show that a closely matched initialization is not necessary. Instead, subliminal learning is governed by compatible output heads. Using a controlled MNIST setting, we split outputs into an auxiliary head (for auxiliary, task-unrelated noise signals) and a class head (for classification) to demonstrate subliminal learning occurs$\unicode{x2014}$even when we randomly initialize hidden layers and remove layers, add new layers, or change the architecture (MLP-to-CNN). Compatible auxiliary heads enable transfer of a recoverable teacher signal, bringing the student's representations closer to the teacher's. When the class heads remain compatible as well, students trained only on task-unrelated noise can approach, and in favorable regimes match, teacher-level task performance. Our setting enables us to develop a theory that explains the mechanism of subliminal learning and to derive upper bounds on when subliminal learning fails. Together, our results turn subliminal learning from a surprising transfer effect into a theoretically grounded mechanism with predictable limits.

2605.23643 2026-05-25 cs.CR cs.LG 版本更新

Less Effort, Shorter Proofs: Reinforcement Learning for Security Protocol Analysis in Tamarin

更少努力,更短证明:Tamarin中安全协议分析的强化学习

Matthias Cosler, Cas Cremers, Bernd Finkbeiner, Mohamed Ghanem, Niklas Medinger

发表机构 * CISPA Helmholtz Center for Information Security(CISPA 欧洲信息安全中心) Technical University of Munich(慕尼黑技术大学)

AI总结 本文提出了一种基于强化学习的框架,用于辅助Tamarin工具进行安全协议的形式化验证。该方法受到AlphaZero和AlphaProof的启发,结合蒙特卡洛树搜索和神经网络启发式策略,实现了更高效、更短的协议验证过程。实验表明,该方法在多个案例研究中能够自动发现更多证明,并且生成的证明长度优于Tamarin默认搜索和人工设计的启发式方法,有效降低了验证过程中的人力投入。

详情
AI中文摘要

像Tamarin和ProVerif这样的工具在分析和验证复杂的现实世界协议(如EMV、5G和WPA2)方面取得了显著成功,甚至检测到了零日漏洞。尽管取得了这些成功,验证此类协议仍然是一项耗时、具有挑战性的任务,通常需要大量的人力和专业知识。在本文中,我们提出了一个受AlphaZero和AlphaProof启发的强化学习(RL)框架,该框架为Tamarin实现了一种新的证明搜索风格。我们为Tamarin开发了一个无状态API,充当经典的RL环境。我们通过一个从完成的子证明中学习的神经启发式来指导蒙特卡洛树搜索(MCTS)。我们在16个案例研究上评估了我们的框架,范围从经典协议模型到近期出版物中具有挑战性的最先进协议模型。我们的方法比Tamarin的标准搜索自动找到更多的证明,并且比标准和人工设计的启发式产生更短的证明。我们的流程开箱即用,可帮助Tamarin用户在活跃研究中减少所需的人力。此外,我们的标准化接口为用户提供了一种与Tamarin交互的程序化方式。最后,我们的工作展示了将基于RL的方法适应Tamarin领域的巨大潜力。

英文摘要

Tools like Tamarin and ProVerif have achieved notable success in analyzing and verifying complex real-world protocols such as EMV, 5G, and WPA2, even detecting zero-day exploits. Despite these successes, verifying such protocols remains a time-consuming, challenging task, often requiring significant human effort and expertise. In this paper, we present a reinforcement learning (RL) framework inspired by AlphaZero and AlphaProof that implements a new style of proof search for Tamarin. We have developed a stateless API for Tamarin that acts as a classical RL environment. We guide a Monte Carlo Tree Search (MCTS) by a neural heuristic that learns from completed subproofs. We evaluate our framework on 16 case studies, ranging from classical protocol models to challenging state-of-the-art protocol models from recent publications. Our method finds more proofs automatically than Tamarin's standard search and produces shorter proofs than both the standard and human-engineered heuristics. Our pipeline is applicable out of the box to assist Tamarin users in active research, reducing the human effort required. Moreover, our standardized interface provides a programmatic way for users to interact with Tamarin. Finally, our work demonstrates the promising potential of adapting RL-based methods to the Tamarin domain.

2605.23635 2026-05-25 stat.ML cs.LG 版本更新

Dirichlet-Based Monte Carlo Dropout for Uncertainty Estimation in Neural Networks

基于狄利克雷的蒙特卡洛丢弃法用于神经网络不确定性估计

Rouaa Hoblos, Noura Dridi, Noureddine Zerhouni, Zeina Al Masry

AI总结 传统神经网络无法提供预测的不确定性估计,而贝叶斯神经网络虽能进行不确定性量化,但计算复杂度较高。本文提出了一种基于狄利克雷分布的蒙特卡洛Dropout方法,在保持计算效率的同时提升了不确定性估计的质量。该方法通过将类别概率建模为狄利克雷分布,实现了更具信息量的不确定性表示,并在实验中验证了其在不确定性校准方面的有效性。

详情
Journal ref
56es Journ{é}es de Statistique de la SFdS, Jun 2025, Marseille, France
AI中文摘要

传统神经网络提供确定性预测,缺乏固有的不确定性估计。虽然贝叶斯神经网络(BNN)为不确定性量化提供了原则性方法,但其计算复杂度限制了可扩展性。蒙特卡洛(MC)Dropout最初作为正则化技术引入,已被证明通过多次随机前向传播实现概率建模,从而近似贝叶斯推断。在这项工作中,我们通过在MC Dropout中集成基于狄利克雷的框架来增强深度学习中的不确定性估计。具体来说,我们利用Sensoy等人(2018)提出的公式,其中使用狄利克雷分布对类概率进行建模,从而允许更信息化的不确定性表示。所提出的方法保持了MC Dropout的计算效率,同时提高了不确定性估计的质量。我们讨论了所提出方法的理论基础,并将其与现有的不确定性量化技术进行了比较。结果突显了所提出方法在产生良好校准的不确定性估计方面的有效性,为不确定性感知的深度学习模型提供了实用解决方案。

英文摘要

Traditional neural networks provide deterministic predictions without inherent uncertainty estimates. While Bayesian Neural Networks (BNNs) offer a principled approach to uncertainty quantification, their computational complexity limits scalability. Monte Carlo (MC) Dropout, initially introduced as a regularization technique, has been shown to approximate Bayesian inference by enabling probabilistic modeling through multiple stochastic forward passes. In this work, we enhance uncertainty estimation in deep learning by integrating a Dirichlet-based framework within MC Dropout. Specifically, we leverage the formulation proposed by Sensoy et al. (2018), where class probabilities are modeled using a Dirichlet distribution, allowing for a more informative uncertainty representation. The proposed approach maintains the computational efficiency of MC Dropout while improving the quality of uncertainty estimates. We discuss the theoretical foundations of our method and compare it with existing uncertainty quantification techniques. The results highlight the effectiveness of the proposed method in producing well-calibrated uncertainty estimates, offering a practical solution for uncertainty-aware deep learning models.

2605.23632 2026-05-25 cs.LG 版本更新

Valid and Expressive Copulas for Irregular Multivariate Time Series

不规则多元时间序列的有效且表达力强的Copula模型

Christian Klötergens, Tom Hanika, Lars Schmidt-Thieme, Vijaya Krishna Yalavarthi

发表机构 * Institute of Computer Science(计算机科学研究所) University of Hildesheim(希尔德斯海姆大学)

AI总结 本文提出了一种名为CopFITi的模型,用于对不规则多变量时间序列进行概率预测。该模型结合了归一化流在单变量边缘分布上的表达能力,以及高斯混合copula在联合依赖结构上的灵活性和一致性。研究首次构建了一个在边缘化上具有一致性的不规则多变量时间序列copula模型,并在联合密度建模方面取得了新的状态-of-the-art成果。

详情
AI中文摘要

我们提出了CopFITi,一种用于不规则多元时间序列(IMTS)概率预测的copula模型。该模型将单变量边缘分布的归一化流的表达力与联合依赖结构的高斯混合Copula的一致性和灵活性相结合。我们的实验表明,将边缘分布与联合分布解耦的基于copula的方法,比直接拟合完整联合分布的架构能产生更好的边缘模型。通过CopFITi,我们提出了第一个通过构造实现边缘化一致性的IMTS copula,并在联合IMTS密度建模中建立了新的最优水平。

英文摘要

We introduce CopFITi, a copula model for probabilistic forecasting of irregular multivariate time series (IMTS). Our model combines the expressivity of normalizing flows for univariate marginals with the consistency and flexibility of a Gaussian Mixture Copula for the joint dependency structure. Our experiments show that copula-based approaches, which decouple the marginals from the joint, yield better marginal models than architectures that directly fit the full joint. With CopFITi, we propose the first IMTS copula that is marginalization-consistent by construction and establish a new state of the art in joint IMTS density modeling.

2605.23628 2026-05-25 cs.LG 版本更新

How Hard is it to Rig a Benchmark? A Social Choice Analysis of Leaderboard Robustness

操纵基准测试有多难?排行榜鲁棒性的社会选择分析

Polina Gordienko, Georg Schollmeyer, Frauke Kreuter, Christoph Jansen

发表机构 * Department of Statistics, LMU Munich(慕尼黑大学统计系) Social Data Science Center, University of Maryland(马里兰大学社会数据科学中心) School of Computing & Communications, Lancaster University Leipzig(莱比锡兰卡斯特大学计算与通信学院)

AI总结 本文研究了在多任务基准测试中通过训练数据选择来操纵模型排名的难度问题,将其类比为社会选择理论中的选举操纵问题。作者将数据集视为选民、模型视为候选人,证明在Borda计数和平均胜率等评价指标下,基准特定训练问题属于NP难问题。此外,文章引入了实例级别的鲁棒性指标,用于衡量模型开发者需要包含多少数据集才能在排行榜上超越其他模型,并在多个基准测试中验证了不同指标下的鲁棒性差异,发现平均胜率最难被操纵。

详情
AI中文摘要

多任务基准测试已成为机器学习研究的核心支柱,但其日益增长的影响力激励了基准测试游戏——为提高特定模型的排行榜排名而采取的策略性行动。将数据集视为选民,模型视为候选人,我们将基准特定训练——在训练中包含基准数据——视为一种选举操纵形式。对于任何序数基准,选择训练数据集以使目标模型排名第一的问题对应于移位贿赂,这是计算社会选择中的一类操纵问题。利用这一识别,我们证明在Borda计数和平均胜率下,基准特定训练问题是NP难的。作为这种最坏情况视角的补充,我们引入了实例级鲁棒性,即模型开发者必须包含在训练中以使给定排行榜排名第一的最小数据集数量,并在算术平均、中位数、平均胜率和成对多数下推导出其表达式。我们在HELM下的MMLU和Open LLM排行榜下的BIG-Bench Hard(BBH)上评估了这些表达式。在两个套件中,平均胜率最难操纵:这一差距在BBH(24个任务,4507个模型)上很明显,其中位鲁棒性为22个任务(92%),而算术平均下为13个(54%),中位数和成对多数下为12个(50%)。

英文摘要

Multi-task benchmarks have become a central pillar of machine learning research, yet their growing influence has incentivised benchmark gaming -- strategic actions taken to improve the leaderboard rank of a specific model. Treating datasets as voters and models as candidates, we consider benchmark-specific training -- the inclusion of benchmark data in training -- as a form of election manipulation. For any ordinal benchmark, the problem of choosing datasets to train on so that a target model becomes top-ranked corresponds to shift bribery, a class of manipulation problems from computational social choice. Leveraging this identification, we show that the benchmark-specific training problem is NP-hard under Borda count and mean win rate. Complementing this worst-case perspective, we introduce the instance-level robustness, the minimum number of datasets a model developer must include in training to top a given leaderboard, and derive expressions for it under arithmetic mean, median, mean win rate and pairwise majority. We evaluate these expressions on MMLU under HELM and on BIG-Bench Hard (BBH) under the Open LLM Leaderboard. Across both suites, mean win rate is hardest to manipulate: this gap is clear on BBH (24 tasks, 4507 models), where its median robustness is 22 tasks (92%), compared with 13 (54%) under arithmetic mean and 12 (50%) under median and pairwise majority.

2605.23623 2026-05-25 cs.CR cs.AI cs.LG 版本更新

Adversarial Vulnerability Under Temporal Concept Drift: A Longitudinal Study of Android Malware Detection

时间概念漂移下的对抗脆弱性:Android恶意软件检测的纵向研究

Ahmed Sabbah, Mohammed Kharma, Radi Jarrar, Samer Zein, David Mohaisen

发表机构 * Department of Computer Science, Birzeit University(巴勒斯坦伯利兹大学计算机科学系) Department of Computer Science, University of Central Florida(佛罗里达州立大学计算机科学系)

AI总结 本文通过长期视角研究了安卓恶意软件检测系统在时间概念漂移下的对抗脆弱性,分析了十年间应用数据在静态和动态特征表示下的对抗鲁棒性。研究采用三种部署协议评估模型性能,引入了多个时间关联指标以量化分布偏移对鲁棒性的影响。结果表明,随着时间间隔增大,对抗鲁棒性下降,而攻击成功率上升,强调了在动态数据环境下需考虑时间漂移因素,并提出了针对长期对抗环境的鲁棒性评估框架的重要性。

Comments 42 pages, 4 tables, 10 figures

详情
AI中文摘要

我们提出了一种纵向的、考虑漂移的对抗鲁棒性评估,使用从模拟器和真实设备执行中提取的静态和动态特征表示,跨越超过十年的Android应用。数据集按年度切片组织,并在三种模拟现实学习场景的部署协议下进行评估:(1)同年度训练和测试,(2)跨年度部署且不更新模型,(3)使用累积历史数据进行扩展窗口重训练。在多个分类器家族中,使用FGSM和SPSA在可行性约束下生成对抗样本。我们测量了干净性能、对抗准确率(AA)、攻击成功率(ASR),并引入了时序关联指标——RobustDrop、$\Delta$ASR和对抗放大因子(AAF)——以量化分布漂移与鲁棒性退化之间的关系。结果表明,在评估的基于迁移的特征空间设置下,时间分离与对抗鲁棒性降低相关。随着训练-测试间隔增加,干净准确率和对抗准确率下降,而攻击成功率呈现配置相关的增加,特别是在FGSM扰动和静态特征下。扩展窗口重训练可以缓解但无法消除在持续分布演化下的鲁棒性损失。这些发现表明,在评估智能检测系统在演化数据分布下的长期鲁棒性时,应考虑时间漂移,并强调了在长期对抗环境中需要漂移感知的鲁棒性评估框架。

英文摘要

We present a longitudinal, drift-aware evaluation of adversarial robustness across more than a decade of Android applications using static and dynamic feature representations extracted from emulator and real-device executions. The dataset is organized into yearly slices and evaluated under three deployment protocols that emulate realistic learning scenarios: (1) same-year training and testing, (2) cross-year deployment without model updates, and (3) expanding-window retraining with cumulative historical data. Across multiple classifier families, adversarial examples are generated using FGSM and SPSA under feasibility constraints. We measure clean performance, Adversarial Accuracy (AA), Attack Success Rate (ASR), and introduce temporal linkage metrics -- RobustDrop, $Δ$ASR, and Adversarial Amplification Factor (AAF) -- to quantify the relationship between distribution shift and robustness degradation.nResults show that temporal separation is associated with reduced adversarial robustness under the evaluated transfer-based feature-space setting. As the train-test gap increases, clean accuracy and adversarial accuracy decline, while attack success exhibits configuration-dependent increases, particularly under FGSM perturbations and static features. Expanding-window retraining mitigates, but does not eliminate, robustness loss under continued distributional evolution. These findings indicate that temporal drift should be considered when assessing the long-term robustness of intelligent detection systems under evolving data distributions and highlight the need for drift-aware robustness assessment frameworks in long-lived adversarial environments.

2605.23605 2026-05-25 cs.LG cs.AI cs.CL 版本更新

DiLaDiff: Distilled Latent-Augmented Diffusion for Language Modeling

DiLaDiff: 蒸馏潜在增强扩散用于语言建模

Jean-Marie Lemercier, Tomas Geffner, Karsten Kreis, Morteza Mardani, Arash Vahdat, Ante Jukić

发表机构 * NVIDIA(英伟达)

AI总结 DiLaDiff 是一种改进的扩散语言模型,旨在解决传统扩散模型在采样质量和生成速度之间的矛盾。该方法引入了连续语义潜在空间,并通过自编码器和一致性蒸馏技术提升生成效率和质量。实验表明,DiLaDiff 在不进行蒸馏时已优于基线模型,并在蒸馏后显著加快了推理速度。

详情
AI中文摘要

扩散语言模型本质上无法捕捉解码令牌之间的相关性,导致采样质量与吞吐量之间存在严峻的权衡。为了解决这个问题,我们提出了DiLaDiff,一种掩码扩散语言模型的变体,包含三个组件:(1)具有语义能力的连续潜在空间,通过从现有掩码扩散语言模型微调的自编码器学习;(2)学习编码器分布先验的潜在扩散模型;(3)将学习到的先验蒸馏为少步潜在生成模型的一致性模型。我们表明,即使没有蒸馏,我们的潜在引导扩散模型在显著加速推理的同时也优于掩码扩散基线。一致性蒸馏进一步降低了连续扩散的计算开销,使得潜在生成的时间相对于离散解码可以忽略不计。

英文摘要

Diffusion language models intrinsically fail to capture correlations between decoded tokens, which leads to a harsh trade-off between sampling quality and throughput. To solve this issue, we propose DiLaDiff, a variant of masked diffusion language models with three components: (1) a continuous latent space with semantic capabilities, learned by an auto-encoder fine-tuned from an existing masked diffusion language model; (2) a latent diffusion model learning the prior over the encoder distribution; (3) a consistency model distilling the learned prior into a few-step latent generative model. We show that, even without distillation, our latent-guided diffusion model outperforms the masked diffusion baseline while significantly accelerating inference. Consistency distillation further lowers the computational overhead of continuous diffusion, such that the latent is generated in negligible time compared to discrete decoding.

2605.23603 2026-05-25 cs.LG cond-mat.dis-nn cs.AI cs.NE 版本更新

Preisach Attention: A Hysteretic Model of Sequential Memory

Preisach注意力:序列记忆的迟滞模型

Piotr Frydrych

发表机构 * Faculty of Mechatronics, Warsaw University of Technology(机电学院,华沙技术大学)

AI总结 本文提出了一种基于经典 Preisach 滞后算子的新型序列建模架构——Preisach 注意力层(PAL),用二值继电器操作符替代传统的 softmax 注意力机制,通过学习激活与去激活阈值来维护内部的局部极值栈。该架构在任意精度算术下实现图灵完备性,且单层 PAL-Transformer 的深度仅为 O(1),优于传统硬注意力 Transformer 所需的 O(log n) 深度。研究还证明 PAL 与 Transformer 在可计算函数类上互不包含,PAL 能以更少层数计算历史范围统计量,而 Transformer 支持随机访问但需额外状态支持,且 PAL 对序列的响应仅依赖于局部极值序列,而非绝对位置或时间间隔。

Comments 24 pages, 2 tables, preprint

详情
AI中文摘要

我们引入了Preisach注意力层(PAL),一种基于数学物理中经典Preisach迟滞算子的新型序列建模架构。PAL用由学习到的激活和去激活阈值参数化的二进制继电器算子替代了softmax注意力机制,并维护一个局部极值栈作为其内部状态。在任意精度算术下,具有O(1)深度的单层PAL-Transformer是图灵完备的,这可以通过模拟双栈下推自动机实现——而标准硬注意力变压器需要O(log n)深度。其次,我们证明了PAL和Transformer可计算的函数类是不可比的:PAL在O(1)层内计算历史范围统计,而Transformer需要O(log n)层;Transformer支持随机访问检索,而PAL在没有辅助状态的情况下无法执行。分离性质是率无关性——PAL仅响应局部极值序列,而不响应绝对标记位置或时间间隔。第三,我们证明了极值栈构成了所有率无关泛函的输入历史的最小充分统计量,提供了经典迟滞理论中擦除性质的形式类比。因此,PAL是一种适用于长情节记忆和弱位置依赖任务的高效架构,其总推理成本为O(n log n),而标准注意力为O(n^2)。

英文摘要

We introduce the Preisach Attention Layer (PAL), a novel sequence modelling architecture grounded in the classical Preisach hysteresis operator from mathematical physics. PAL replaces the softmax attention mechanism with a binary relay operator parameterised by learned activation and deactivation thresholds, maintaining a stack of local extrema as its internal state. A single-layer PAL-Transformer with O(1) depth is Turing-complete under arbitrary precision arithmetic, achievable through simulation of a two-stack pushdown automaton -- in contrast to the O(log n) depth required by standard hard-attention transformers. Second, we prove that the function classes computable by PAL and by the transformer are incomparable: PAL computes historical range statistics in O(1) layers that require O(log n) layers for transformers, while transformers support random-access retrieval that PAL cannot perform without auxiliary state. The separating property is rate-independence -- PAL responds only to the sequence of local extrema, not to absolute token positions or temporal spacing. Third, we show that the extremum stack constitutes a minimal sufficient statistic of the input history for all rate-independent functionals, providing a formal analogue of the wiping property in classical hysteresis theory. PAL is thus an efficient architecture for tasks with long episodic memory and weak positional dependence, with O(n log n) total inference cost versus O(n^2) for standard attention.

2605.23597 2026-05-25 cs.CL cs.LG 版本更新

Structure-Guided Entity Resolution: Fine-Tuning LLMs for Robust Name Matching in Complex Linguistic Contexts

结构引导的实体解析:微调大语言模型以实现复杂语言上下文中的鲁棒姓名匹配

Shivam Chourasia, Hitesh Kapoor, Nilesh Patil

发表机构 * Dream Sports

AI总结 本文研究了在语言和文化复杂环境下进行人名匹配的实体解析问题,提出了一种名为Structure-Guided Entity Resolution(SGER)的新框架,通过两阶段课程式微调增强大语言模型对姓名结构和语义的理解,从而提升实体匹配的准确性。该方法在印度身份数据等具有高度语言多样性和噪声的现实场景中表现出色,取得了99.02%的高准确率,并在生产环境中成功部署,验证了其在大规模多语言系统中的有效性和鲁棒性。

Comments Accepted to ACL 2026. 8 pages, 1 figure, 2 tables

详情
AI中文摘要

跨异构记录匹配人名是实体解析的核心挑战,尤其是在语言和文化复杂的环境中。命名惯例的差异、跨文字的不一致音译以及频繁的数据录入错误使得统一用户身份变得困难,而这对于了解你的客户(KYC)合规至关重要。虽然大语言模型在理解自然语言方面显示出潜力,但它们往往难以处理此类特定领域设置中存在的结构化歧义。本文介绍了结构引导实体解析(SGER),一种新颖的框架,通过两阶段课程微调大语言模型。模型首先被训练解析人名的语法和语义结构,然后针对二元实体匹配的下游任务进行优化。我们在印度身份数据的挑战性背景下评估SGER,这是全球语言最多样化和噪声最大的环境之一。SGER在包含50,000个真实世界对的保留测试集上达到了99.02%的准确率和0.994的F1分数,优于GPT-4o少样本提示和单阶段微调基线。该系统已完全部署在全球最大的梦幻体育平台Dream11的生产环境中,服务超过2.5亿用户。我们的结果表明,课程引导的训练能够在现实世界的多语言系统中实现大规模、高精度的实体解析。

英文摘要

Matching person names across heterogeneous records is a core challenge in entity resolution, especially within linguistically and culturally complex environments. Variations in naming conventions, inconsistent transliteration across scripts, and frequent data entry errors make it difficult to unify user identities, an essential requirement for Know Your Customer (KYC) compliance. While Large Language Models have shown promise in understanding natural language, they often struggle with the structured ambiguity present in such domain-specific settings. This paper introduces Structure-Guided Entity Resolution (SGER), a novel framework that fine-tunes an LLM through a two-phase curriculum. The model is first trained to parse the grammatical and semantic structure of personal names, then optimized for the downstream task of binary entity matching. We evaluate SGER in the challenging context of Indian identity data, one of the most linguistically diverse and noisy environments globally. SGER achieves 99.02% accuracy and an F1 of 0.994 on a held-out set of 50,000 real-world pairs, outperforming GPT-4o few-shot prompting and single-stage fine-tuning baselines. The system is fully deployed in production at Dream11, the world's largest fantasy sports platform, serving 250M+ users. Our results demonstrate that curriculum-guided training enables robust, high-precision entity resolution in real-world multilingual systems at scale.

2605.23591 2026-05-25 stat.ML cond-mat.dis-nn cs.LG math.ST stat.TH 版本更新

Asymmetric Scaling Laws from Sparse Features

基于稀疏特征的非对称缩放定律

John Sous, Michael Winer

发表机构 * Yale University(耶鲁大学) Energy Sciences Institute(能源科学研究所) Institute for Advanced Study(高级研究院) Alignment Research Center(对齐研究中心)

AI总结 本文研究了稀疏激活下神经网络的扩展规律,提出了一种新的模型,指出测试损失主要由训练输入中从未出现的稀疏坐标主导,从而形成一种不同于密集模型的新瓶颈。研究推导了欠参数化和过参数化情形下的渐近损失,并发现损失曲线在插值阈值附近呈现双下降现象,表现出由稀疏度决定的两个不同扩展指数。此外,还分析了梯度下降动力学,并展示了固定步长梯度下降不稳定概率的扩展规律,表明稀疏性带来的影响在非线性激活下依然存在。

详情
AI中文摘要

我们引入了一个稀疏激活下的神经缩放定律模型。在该模型中,测试损失通常由训练输入中从未观察到的稀有坐标主导。这种机制引入了一个密集模型中不存在的新瓶颈。我们推导了欠参数化和过参数化区域的渐近总体损失,并表明损失在插值阈值附近出现双下降峰值——其中参数数量刚好足以拟合训练数据——导致损失曲线由两个不同的缩放指数控制:一个用于过参数化区域,一个用于欠参数化区域,其差距由稀疏程度决定。此外,我们推导了一个计算最优边界,在固定计算预算下倾向于增加数据集大小而非模型容量。我们还分析了梯度下降动力学,并确定了固定步长梯度下降变得不稳定的概率的缩放定律。我们进一步表明,稀疏诱导效应在非线性激活下仍然存在。

英文摘要

We introduce a model for neural scaling laws under sparse activations. In the model, test loss is often dominated by rare coordinates that are never observed in the training input. This mechanism induces a novel bottleneck absent from dense models. We derive the asymptotic population loss in both the underparameterized and overparameterized regimes, and show that the loss exhibits a double-descent peak near the interpolation threshold -- where the number of parameters is just sufficient to fit the training data -- resulting in a loss curve governed by two distinct scaling exponents -- one for the overparameterized regime and one for the underparameterized regime -- with a gap determined by the degree of sparsity. Additionally, we derive a compute-optimal frontier that favors increasing dataset size over model capacity under fixed compute budgets. We also analyze gradient-descent dynamics and identify a scaling law for the probability that fixed-step gradient descent becomes unstable. We further show that the sparsity-induced effect persists under nonlinear activations.

2605.23583 2026-05-25 cs.RO cs.LG 版本更新

How Many Training Samples Are Needed for the Inverse Kinematics Solutions by Artificial Neural Networks

人工神经网络求解逆运动学需要多少训练样本

Dong-Won Lim

发表机构 * The University of Suwon(苏won大学)

AI总结 本文研究了使用人工神经网络求解机器人逆运动学问题时所需的最小训练样本数量。通过构建不同规模的训练数据集,训练前馈神经网络并评估其精度、收敛性和泛化能力,发现当样本数量超过125后,模型效率提升不再显著。该研究为实际机器人应用中优化神经网络数据规模、平衡计算成本与模型精度提供了有价值的指导。

Comments 14 pages, 5 figures

详情
AI中文摘要

逆运动学在机器人运动规划与控制中扮演关键角色。机器人操作臂的逆运动学求解可通过传统方法如几何法、代数法或雅可比法实现,但这些方法存在缺陷。人工神经网络因其泛化能力和计算效率,已成为近似逆运动学解的有前途的替代方案。该方法基本上只训练记录用于求解逆运动学问题的少量末端执行器样本。然而,一个基本问题仍然存在:多少训练样本足以实现可靠且准确的逆运动学预测?本研究探讨了训练数据集大小与基于ANN的逆运动学求解器精度之间的数学框架。使用关节型机器人操作臂,我们生成不同数量的关节位置对来训练前馈神经网络,并评估其精度、收敛性和泛化能力。结果表明,超过125个训练样本并未有助于提高模型效率,该效率通过采样大小上的近似精度可比度量来衡量,为数据效率提供了宝贵见解。这项工作为优化ANN解决方案的数据规模提供了实用指导,平衡了实际机器人应用中的计算成本和模型精度。

英文摘要

Inverse Kinematics (IK) plays a critical role in robotic motion planning and control. The IK solutions of a robot manipulator could be done by conventional ways such as geometric, algebraic, or Jacobian methods, which have drawbacks. The Artificial Neural Networks (ANNs) have become a promising alternative for approximating IK solutions due to their generalization ability and computational efficiency. This approach basically trains only a few samples of the end effector that are recorded for the solution of the IK problem. However, a fundamental question remains: how many training samples are sufficient to achieve reliable and accurate IK predictions? This study investigates the mathematical framework of relating the size of training datasets and the accuracy of ANN-based IK solvers. Using an articulated robotic manipulator, we generate varying amounts of joint-position pairs to train feedforward neural networks and assess their accuracy, convergence, and generalization capability. The results reveal more training samples than 125 did not contribute to the improvement of the model efficiency that the comparable measure dealing with the approximation accuracy over the sampling size, offering valuable insight into data efficiency. This work provides practical guidance for optimizing the data sizing of ANN solutions, balancing computational cost and model accuracy for real-world robotic applications.

2605.23574 2026-05-25 cs.LG cs.SE 版本更新

Push Your Agent: Measuring and Enforcing Quantitative Goal Persistence in Long-Horizon LLM Agents

推动你的智能体:在长周期LLM智能体中测量和强制实现定量目标持续性

Yuandao Cai, Yuzhang Zhu, Liyou Gao, Wensheng Tang, Shengchao Qin

发表机构 * Independent Researcher(独立研究者) Xidian University(西安电子科技大学)

AI总结 本文研究了长期语言智能体在完成定量目标时存在的“定量目标持续性”(QGP)问题,即智能体是否能持续工作直到外部验证器确认完成足够数量的有效任务。为此,作者提出了PushBench基准,用于直接衡量重复工作、重复提交、虚假完成等问题。实验表明,基于状态追踪和工作单元追踪的控制器在减少重复提交和提高任务完成率方面表现优异,而当前主流智能体在处理大量任务时成功率显著下降,突显了定量目标对智能体可靠性提出的更高要求。

详情
AI中文摘要

长周期语言智能体可能做出许多看似合理的局部工具调用,但未能持续直到请求的数量实际完成。我们将这一差距研究为定量目标持续性(QGP):即智能体是否持续工作,直到外部验证器确认足够数量的不同有效项。PushBench将其转化为一个用于仓库-工件收集和验证器支持的工作单元的基准,因此重复工作、重复提交、虚假完成和进度漂移被直接测量,而不是隐藏在最终成功标志之后。在匹配的控制器比较中,状态追踪检索控制器达到69-78%的成功率,同时消除了重复提交;而积压追踪工作单元控制器在标准和完成门控控制器无法完成任何任务实例的设置中达到25-50%的成功率。使用Claude Code(Sonnet 4.6)和Codex CLI(gpt-5.4)的黑盒前沿智能体评估解决了许多50个工件的任务,但在100个工件时每条件仅剩3/9的成功率。结果表明,定量目标对不同于局部任务能力的可靠性要求提出了挑战:智能体必须维护已验证的进度,并仅在请求的工作完成时停止。

英文摘要

Long-horizon language agents can make many plausible local tool calls yet fail to persist until a requested count is actually complete. We study this gap as Quantitative Goal Persistence (QGP): whether an agent keeps working until an external verifier confirms enough distinct valid items. PushBench turns this into a benchmark for repository-artifact collection and verifier-backed work units, so repeated work, duplicate submissions, false completion, and progress drift are measured directly rather than hidden behind a final success flag. In matched controller comparisons, a state-tracking retrieval controller reaches 69-78% success while eliminating duplicate submissions, and a backlog-tracking work-unit controller reaches 25-50% success in settings where standard and completion-gated controllers complete no task instances. Black-box frontier-agent evaluations with Claude Code (Sonnet 4.6) and Codex CLI (gpt-5.4) solve many 50-artifact tasks but drop to 3 out of 9 successes per condition at 100 artifacts. The results show that quantitative goals stress a different reliability requirement from local task competence: agents must maintain verified progress and stop only when the requested work is complete.

2605.23572 2026-05-25 cs.IR cs.AI cs.LG 版本更新

HARNESS-LM: A Three-Phase Training Recipe for Harnessing SLMs in Sponsored Search Retrieval

HARNESS-LM: 一种在赞助搜索中利用小语言模型的三阶段训练方案

Vipul Gupta, Shikhar Mohan, Lakshya Kumar, Pranjal Chitale, Nikit Begwani, Amit Singh, Manik Varma

发表机构 * Microsoft AI(微软人工智能)

AI总结 在赞助搜索中,如何在保证检索质量的同时降低响应延迟是一个重要挑战。本文提出HARNESS-LM(HLM),一种三阶段训练框架,旨在将大规模语言模型的检索能力转移到参数更少、成本更低的模型中。通过知识蒸馏和对比优化等方法,HLM在保持高检索精度的同时显著提升了推理效率,并在实际的Bing Ads测试中验证了其有效性,取得了更高的收益、曝光和点击率提升。

Comments 9 pages, 3 figures, 10 tables

详情
AI中文摘要

在赞助搜索的竞争格局中,平衡检索质量与生产延迟是一个关键挑战。尽管基于小语言模型(SLM)的大型检索模型(如Qwen3-Embedding-4B/8B)在公共基准上设定了强上限,但其在高吞吐、延迟敏感环境中的部署仍不切实际。本文提出HARNESS-LM(HLM),一个三阶段训练框架,用于将大规模检索器的能力迁移至紧凑、成本高效的模型。该方法包括:(1)通过微调十亿参数规模的SLM训练高性能参考(“教师”)检索器;(2)通过L2目标对齐查询表示,将知识蒸馏至低于600M参数的学生编码器;(3)应用最终对比精炼阶段以优化学生的检索性能。我们还对关键设计选择进行了全面的实证研究,包括对齐目标、嵌入维度、模型规模、架构和优化策略,以确定在生产环境中最为有效的配置。在真实世界的Bing Ads评估基准上,HLM在多种设置下恢复了参考检索器超过98%的精度,同时在NVIDIA A100 GPU上实现了高达27倍的在线查询编码器延迟降低和20倍的吞吐量提升。在Bing Ads上的在线A/B测试进一步显示,与当前生产中运行的检索器集成(部署190M参数模型)相比,收入提升+1%,展示量提升+0.6%,点击量提升+0.4%,清晰突显了HLM方案在真实世界赞助搜索场景中的实际效果。

英文摘要

In the competitive landscape of sponsored search, balancing retrieval quality with production latency is a critical challenge. While large retrieval models based on Small Language Models (SLMs) such as Qwen3-Embedding-4B/8B set strong upper bounds on public benchmarks, their deployment in high-throughput, latency-sensitive environments remains impractical. In this paper, we present HARNESS-LM (HLM), a three-phase training framework for transferring the capabilities of large-scale retrievers into compact, cost-efficient models. The approach comprises: (1) training a high-performance reference ("teacher") retriever by fine-tuning a billion-parameter-scale SLM; (2) aligning query representations via an L2 objective to distill knowledge into a sub-600M parameter student encoder; and (3) applying a final contrastive refinement stage to optimize the student for retrieval performance. We also present a comprehensive empirical study of key design choices, including alignment objectives, embedding dimensionality, model scale, architecture, and optimization strategies, to identify configurations that are most effective in production settings. On a real-world Bing Ads evaluation benchmark, HLM recovers over 98% of the reference retriever's precision across multiple settings, while delivering up to 27x lower online query-encoder latency and 20x higher throughput on NVIDIA A100 GPUs. Online A/B testing on Bing Ads further shows a +1% Revenue, +0.6% Impression, and +0.4% Click uplift over the current ensemble of retrievers running in production with the deployed 190M parameter model, clearly highlighting the practical efficacy of the HLM recipe in a real-world sponsored search setting.

2605.23565 2026-05-25 cs.LG cs.AI 版本更新

Understanding Goal Generalisation in Sequential Reinforcement Learning

理解序贯强化学习中的目标泛化

Jason Ross Brown, Edward James Young

发表机构 * University of Cambridge(剑桥大学) Geodesic Research(Geodesic研究)

AI总结 本研究探讨了序列强化学习代理在新环境中实现目标泛化的能力,分析了其训练历史对其行为的影响。通过研究超过100种序列训练流程并在250多个分布外环境中进行评估,发现显著特征和早期学习的目标对后续泛化具有重要影响。为此,研究提出了一种名为潜在策略梯度的方法,能够预测训练流程可能诱导的分布外行为,具有较高的预测准确性、良好的泛化能力和可解释性,为从发展角度理解目标泛化提供了基础。

详情
AI中文摘要

强化学习代理在其训练分布之外常常表现出非预期的目标导向行为,但我们目前缺乏基于训练历史对这类代理如何泛化到新环境的原理性理解。我们针对在单个或多个任务上序贯训练的代理解决了这一空白。我们研究了超过100个序贯训练流程,评估了超过250个分布外环境中的行为。我们发现显著特征驱动泛化,并且训练早期习得的目标会持续存在并影响后期习得的目标。为了解释这些现象,我们引入了潜在策略梯度方法,该方法预测训练流程可能诱导的分布外行为。我们的方法根据潜在变量如何映射到行为的简单模型,模拟训练过程中低维潜在变量的演化,以实现在训练目标上获得高奖励。它实现了强预测准确性,泛化到未见过的训练流程类型,并且是可解释的。我们的发现表明,虽然分布外RL代理行为依赖于整个训练流程,但这种依赖具有我们可以捕捉的底层结构,为从发展角度理解目标泛化奠定了基础。

英文摘要

Reinforcement learning agents often exhibit unintended goal-directed behaviour outside their training distribution, but we currently lack a principled understanding of how such agents will generalise to novel environments based on their training history. We address this gap for agents trained sequentially on one or more tasks. We study over 100 sequential training pipelines, evaluating behaviour across over 250 out-of-distribution environments. We find that salient features drive generalisation, and that goals learnt early in training can persist and influence those acquired later. To explain these phenomena, we introduce latent policy gradients, a method that predicts what out-of-distribution behaviour a training pipeline will likely induce. Our method simulates the evolution of low-dimensional latent variables during training according to what would achieve high reward on the training objective with respect to a simple model of how the latent variables map to behaviour. It achieves strong predictive accuracy, generalises to unseen types of training pipeline, and is interpretable. Our findings demonstrate that while out-of-distribution RL agent behaviour is dependent on the whole training pipeline, this dependence has an underlying structure we can capture, laying groundwork for understanding goal generalisation from a developmental perspective.

2605.23563 2026-05-25 cs.LG 版本更新

MARS: Magnitude-Aware Rank Statistics

MARS:幅度感知排名统计

Muhammad Rajabinasab, Afsaneh M. Nejad, Arthur Zimek

发表机构 * University of Southern Denmark(南方丹麦大学)

AI总结 在机器学习模型的全面评估中,如何准确反映模型性能差异是一个重要问题。传统关键差异(CD)图依赖于离散排名,忽略了模型性能差距的幅度,导致“幅度盲”问题。为此,本文提出了一种基于幅度感知的排名统计方法MARS,通过引入相对边距系数对离散排名进行加权,从而更真实地反映模型性能差异,并在广泛实验设置中提供更深入的洞察。

Comments Preprint submitted to Elsevier Pattern Recognition Letters

详情
AI中文摘要

机器学习模型的全面评估是确保其按预期稳健且一致运行的关键。为了总结实验结果并选出最佳模型,通常使用临界差异(CD)图。标准CD图依赖于离散排名,忽略了模型之间性能差距的幅度,这引发了我们称之为幅度盲视的问题。为了解决这个问题,我们提出了幅度感知排名统计(MARS),它引入了一个相对边际系数作为离散排名的权重。该系数基于最佳和最差表现者之间的距离对排名进行缩放,并采用动态投影来处理边界情况。在计算CD值之后,MARS能够更真实地统计表示模型性能的差异,并提供更多关于方法在广泛实验设置中实际表现如何的见解。

英文摘要

Comprehensive evaluation of machine learning models is the key to make sure that they perform as robustly and consistently as desired. In order to summarize the experimental results and pick a winner, Critical Difference (CD) diagrams are used. Standard CD diagrams rely on discrete ranks, discarding the magnitude of performance gaps between models, raising an issue which we call magnitude-blindness. In order to address this issue, we propose Magnitude-Aware Rank Statistics (MARS) that incorporates a relative margin coefficient as a weight for the discrete ranks. This coefficient scales ranks based on the distance between the best and worst performers, with a dynamic projection to handle boundary cases. Followed by the calculation of a CD value, MARS results in a more realistic statistical representation of differences of model performances and more insights on how methods actually perform in vast and extensive experimental settings.

2605.23556 2026-05-25 cs.LG cs.IR math.CO 版本更新

Is Dimensionality a Barrier for Retrieval Models?

维度是检索模型的障碍吗?

Kiril Bangachev, Guy Bresler, Jonathan Kogan, Yury Polyanskiy

发表机构 * Department of Electrical Engineering and Computer Science(电气工程与计算机科学系)

AI总结 本文探讨了为何现代基于嵌入的检索模型在表示维度较低(约1000维)的情况下仍能处理数十亿甚至数万亿的数据点。研究聚焦于最大边距嵌入问题,分析了在给定查询与文档相关性矩阵下,如何在有限维度中实现最大的分类边距。论文证明了在特定条件下,维度只需为 $O(k \log(n/k))$ 即可达到理论最优边距,从而解决了相关模型的维度需求问题,并通过实验验证了sigmoid损失在生成大边距嵌入方面的优势。

详情
AI中文摘要

为什么表示的低维度(通常$d\approx 1000$)不会阻止现代基于嵌入的检索模型扩展到数十亿甚至数万亿数据点?为了回答这个问题,我们在以下检索模型中研究最大间隔嵌入,该模型经典地出现在通信复杂性[PS86]和最近的基于嵌入的检索[WBNL26]中。设$A\in \{0,1\}^{N\times n}$是一个矩阵,指示$N$个查询中的每一个是否与$n$个文档中的每一个相关。我们感兴趣的是最大间隔$m>0$,记为$\mathsf{m}^{\mathsf{rd}}(d, A)$,使得存在查询和文档的单位范数嵌入$\{U_j\}_{j = 1}^N, \{V_i\}_{i = 1}^n$满足以下性质:当$A_{ji} = 1$时$\langle U_j, V_i\rangle \ge m$,否则$\langle U_j, V_i\rangle \le -m$。大间隔是表示质量的关键代理:它控制了对扰动的鲁棒性和跨查询的组合泛化能力。我们的主要定理表明,在没有维度限制的情况下,最佳可能间隔$\mathsf{m}^{\mathsf{rd}}(+\infty, A)$可以在维度$d = O(\mathsf{m}^{\mathsf{rd}}(+\infty, A)^{-2}\log n)$下几乎达到,这改进了[BDES02]的一个定理。结合定理1.5中的匹配下界,我们得出结论:当$A\in \{0,1\}^{\binom{n}{k}\times n}$是包含所有可能的$k$-稀疏行一次的矩阵时,维度$d = O(k\log (n/k))$是达到该设置下最大可能间隔$\mathsf{m}^{\mathsf{rd}}(+\infty, A) = \Theta(k^{-1/2})$的充分必要条件。这完全解决了[WBNL26]中的设定。我们还给出了当$d = o(k\log (n/k))$时产生大间隔的几种构造。最后,我们通过实验测试了InfoNCE和sigmoid损失在产生大间隔嵌入方面的表现,并展示了sigmoid损失的明显优势。

英文摘要

Why does the low dimensionality of representations, typically $d\approx 1000$, not prevent modern embedding-based retrieval models from scaling to billions, or even trillions, of data points? To answer this question, we study maximal-margin embeddings in the following retrieval model, classically studied in communication complexity [PS86] and more recently in embedding-based retrieval [WBNL26]. Let $A\in \{0,1\}^{N\times n}$ be a matrix indicating whether each of $N$ queries is relevant to each of $n$ documents. We are interested in the largest margin $m>0,$ denoted by $\mathsf{m}^{\mathsf{rd}}(d, A),$ for which there exist unit norm embeddings of the queries and documents $\{U_j\}_{j = 1}^N, \{V_i\}_{i = 1}^n$ with the following property. $\langle U_j, V_i\rangle \ge m$ whenever $A_{ji} = 1$ and $\langle U_j, V_i\rangle \le -m$ otherwise. A large margin is a key proxy for representation quality: it controls both robustness to perturbations and compositional generalization across queries. Our main theorem establishes that the best possible margin without a restriction on the dimension, $\mathsf{m}^{\mathsf{rd}}(+\infty, A),$ can be nearly achieved in dimension $d = O(\mathsf{m}^{\mathsf{rd}}(+\infty, A)^{-2}\log n)$ which improves a theorem of [BDES02]. Together with a matching lower bound in Theorem 1.5, we conclude that when $A\in \{0,1\}^{\binom{n}{k}\times n}$ is the matrix containing all possible $k$-sparse rows once, dimension $d = O(k\log (n/k))$ is necessary and sufficient for the maximal possible margin $\mathsf{m}^{\mathsf{rd}}(+\infty, A) = Θ(k^{-1/2})$ in this setting. This fully resolves the setup of [WBNL26]. We also give several constructions for large margins when $d = o(k\log (n/k)).$ Finally, we empirically test the InfoNCE and sigmoid losses for producing large margin embeddings and demonstrate a clear advantage of the sigmoid loss.

2605.23551 2026-05-25 cs.LG cs.AI 版本更新

Goal-Conditioned Agents that Learn Everything All at Once

目标条件智能体一次性学习所有内容

Michael Matthews, Matthew Jackson, Michael Beukman, Thomas Foster, Alistair Letcher, Scott Fujimoto, Cédric Colas, Jakob Foerster

发表机构 * University of Oxford(牛津大学) McGill University(麦吉尔大学) MIT(麻省理工学院) Inria(法国国家信息与自动化研究所)

AI总结 本文提出了一种名为LEO(Learning Everything all at Once)的新方法,用于提升目标条件强化学习的效率。该方法通过一次性输出所有目标对应的价值和动作,实现了高效的并行更新,解决了传统全目标学习计算开销大的问题。实验表明,LEO在目标条件任务和连续控制环境中均表现出色,且相比传统方法有超过250倍的加速效果,为复杂环境中的强化学习提供了有力工具。

详情
AI中文摘要

一个目标条件的强化学习智能体在探索环境时,会在整个轨迹中看到大量信息,但大多数信息在仅根据命令目标进行在线策略更新时被丢弃。全目标学习(每个转换都用于针对每个目标进行离线策略学习)允许智能体提取最大信息,但通过简单的重新标记通常计算上不可行。这可以通过同时为每个目标输出值和动作来克服,从而允许通过网络单次传递进行高效的并行全目标更新,我们称之为一次性学习所有内容(LEO)。我们表明,这种方法在目标条件的Craftax上显著优于其他方法,在连续控制环境中与现有基线具有竞争力,同时与全目标重新标记相比实现了超过250倍的加速。然后,我们进一步表明,通过将LEO用作教师网络而非直接行动者,这种方法可以变得更加强大。我们希望,通过解锁大规模的全目标学习,LEO可以成为复杂环境中强化学习实践者的有用工具。我们开源了我们的代码。

英文摘要

A goal-conditioned reinforcement learning agent exploring an environment will see a wealth of information throughout a trajectory, most of which is discarded when only performing on-policy updates with respect to the commanded goal. All-goals learning, where each transition is used for learning off-policy with respect to every goal, allows agents to extract maximal information, however it is usually computationally infeasible when done via naive relabelling. This can be overcome by jointly outputting values and actions for every goal at once, allowing for efficient, parallel all-goals updates with a single pass through the network, in a process we call Learning Everything all at Once (LEO). We show that this approach significantly outperforms other methods on goal-conditioned Craftax and is competitive with existing baselines on continuous control environments, while achieving a >250x speed-up compared to all-goals relabelling. We then go on to show that this approach can be made even more powerful by using LEO as a teacher network, rather than a direct actor. We hope that, by unlocking all-goals learning at scale, LEO can serve as a useful tool for RL practitioners in complex environments. We open source our code.

2605.23540 2026-05-25 cs.LG 版本更新

When One Point Is Not Enough: Addressing Ambiguous Instances in Dimensionality Reduction by Splitting

当一点不够时:通过分裂解决降维中的模糊实例

Diede P. M. van der Hoorn, Alessio Arleo, Fernando V. Paulovich

发表机构 * Eindhoven University of Technology(埃因霍温理工大学)

AI总结 本文研究了降维方法中因数据点模糊性导致的邻域结构失真问题,提出了一种基于图的方法来识别并复制这些模糊实例,将其映射到多个位置以更准确地反映其在高维空间中的多个邻域关系。该方法有效缓解了传统降维技术中因单点映射导致的局部结构丢失问题,并在多个实例上展示了其对隐藏邻域关系的揭示能力。

详情
AI中文摘要

降维(DR)方法广泛用于可视化高维数据。基于DR的分析中的一个关键任务是发现邻域,这依赖于分析投影的细粒度局部结构。然而,DR本质上是一个有损过程;没有技术能完美保留高维关系,因此投影包含视觉伪影。在本文中,我们强调了一个通常被忽视的视觉伪影来源:模糊实例。这些实例与高维空间中多个相互不相似的邻域高度相似。标准DR方法无法忠实地投影此类实例,因为每个数据实例被映射到视觉空间中的一个单点。因此,这样的实例仅被放置在其一个邻域中(或根本不放置),因此仅表示其部分邻域结构。我们称这种失真为部分邻域嵌入。在本文中,我们引入了一种基于图的方法,该方法识别模糊实例并将其复制为投影中的多个点,将每个副本放置在其各自的邻域中。我们使用UMAP来展示结果,但我们的方法也推广到其他基于局部图的DR技术,并且我们表明,我们的方法揭示了投影中先前隐藏的邻域成员关系,减少了多个示例中的部分邻域嵌入,并得到了定量分析的支持。

英文摘要

Dimensionality Reduction (DR) methods are widely used to visualize high-dimensional data. One key task in DR-based analysis is discovering neighborhoods, which relies on analyzing the fine-grained local structure of a projection. However, DR is an inherently lossy process; no technique can perfectly preserve the high-dimensional relationships, and projections therefore contain visual artifacts. In this paper, we highlight a typically overlooked source of visual artifacts: ambiguous instances. These are instances that are highly similar to multiple mutually dissimilar neighborhoods in the high-dimensional space. Standard DR methods cannot faithfully project such instances, since each data instance is mapped to a single point in the visual space. As a result, such an instance is placed in only one of its neighborhoods (or in none at all), so only part of its neighborhood structure is represented. We call this distortion partial neighborhood embedding. In this paper, we introduce a graph-based approach that identifies ambiguous instances and replicates them as multiple points in the projection, placing each copy within its respective neighborhood. We use UMAP for our results, but our approach also generalizes to other local graph-based DR techniques, and we show that our approach reveals previously hidden neighborhood memberships in projections and reduces partial neighborhood embedding across multiple examples, and is further supported by quantitative analyses.

2605.23522 2026-05-25 cs.LG cs.AI cs.CV 版本更新

Precise: SDE-Consistent Stochastic Sampling for RL Post-Training of Flow-Matching Models

Precise: 用于流匹配模型强化学习后训练的SDE一致随机采样

Jade Zou, Tao Huang, Weijie Kong, Junzhe Li, Yue Wu, Qi Tian, Jiangfeng Xiong, Jianwei Zhang, Liefeng Bo, Zhao Zhong

发表机构 * Peking University(北京大学) Tencent Hunyuan(腾讯文言)

AI总结 该论文研究了如何通过强化学习(RL)对流匹配模型进行后训练,以提升其生成质量与提示对齐能力。核心方法是将确定性的采样轨迹转化为随机策略,通过设计一个符合随机微分方程(SDE)的采样器,实现探索与稳定性的平衡。提出的新采样器Precise在保持去噪轨迹SDE一致性的同时,有效减少了噪声干扰,实验表明其在奖励优化速度和生成质量上均优于现有方法。

详情
AI中文摘要

强化学习已成为提升扩散和流匹配生成器中提示对齐和感知质量的有效方法。将在线强化学习应用于流匹配的关键步骤是将确定性采样轨迹转化为随机策略,通常通过用随机微分方程替代逆向常微分方程来实现。随机采样器控制探索行为和去噪动力学,因此是策略的一部分,其设计会显著影响奖励优化性能。我们将采样器设计分解为两个相互依赖的组成部分:选择适量的随机探索,以及在强化学习中使用的少量步数下忠实地离散化得到的SDE。针对第一个组成部分,我们分析了去噪过程中探索与稳定性之间的固有张力,并推导出平衡两者的SDE调度。针对离散化挑战,我们使用一个玩具示例表明,现有采样器可能偏离流匹配过程,要么引入过多的离散化噪声,要么依赖不能保证收敛到数据分布的启发式规则。为解决这些问题,我们提出了Precise,一种新的随机采样器,平衡了有效探索与稳定性。关键地,Precise通过一种冻结干净潜变量后验均值的新颖近似,使去噪轨迹保持SDE一致,解决了标准采样器中的过度噪声问题。大量实验表明,该公式通过强化学习实现了显著更快且更稳定的奖励优化,达到了最先进的对齐分数(例如PickScore、HPSv2.1),同时匹配先前采样器的最佳域内性能所需的训练时间减少了13.1-53.2%。

英文摘要

Reinforcement learning (RL) has become an effective way to improve prompt alignment and perceptual quality in diffusion and flow-matching generators. A critical step for applying online RL to flow matching is turning the deterministic sampling trajectory into a stochastic policy, typically by replacing the reverse-time Ordinary Differential Equation (ODE) with a Stochastic Differential Equation (SDE). The stochastic sampler, controlling the exploration behavior and denoising dynamics, is thus part of the policy, and its design can significantly affect the reward optimization performance. We break down the sampler design into two interdependent components: choosing the right amount of stochastic exploration, and discretizing the resulting SDE faithfully at the small step counts used in RL. To address the first component, we analyze the inherent tension between exploration and stability in denoising and derive an SDE schedule that balances the two. Turning to the discretization challenge, we use a toy example to show that existing samplers can deviate from the flow-matching process, either by introducing excessive discretization noise or by relying on heuristic rules that do not guarantee convergence to the data distribution. To address these issues, we propose Precise, a new stochastic sampler that balances effective exploration with stability. Crucially, Precise keeps the denoising trajectory SDE-consistent through a novel approximation that freezes the clean-latent posterior mean, resolving the excess noise issue in standard samplers. Extensive experiments demonstrate that this formulation leads to significantly faster and more stable reward optimization via reinforcement learning, achieving state-of-the-art alignment scores (e.g., PickScore, HPSv2.1) while requiring 13.1-53.2% less wall-clock training time to match the best in-domain performance of prior samplers.

2605.23510 2026-05-25 cs.LG 版本更新

Learning partially observed systems with neural Hamiltonian ordinary differential equations

学习部分观测系统:神经哈密顿常微分方程

Sunniva Meltzer, Sølve Eidnes, Alexander Johannes Stasik

发表机构 * Department of Mathematics and Cybernetics, SINTEF Digital(数学与自动化系,SINTEF数字研究院) Department of Physics, University of Oslo(物理系,奥斯陆大学) Department of Data Science, Norwegian University of Life Science(数据科学系,挪威生命科学大学)

AI总结 本文提出了一种名为神经哈密顿常微分方程(NHODE)的框架,用于从部分观测数据中学习动力系统。该方法结合了哈密顿神经网络和神经常微分方程,通过引入哈密顿结构确保能量守恒,并利用神经ODE的灵活性仅在观测变量上定义损失函数,从而在未观测变量上进行有效推理。实验表明,NHODE在多种复杂系统中表现出更高的预测精度和长期稳定性,能够同时捕捉观测和潜在动态,优于纯粹的数据驱动方法。

详情
AI中文摘要

从数据中学习动力系统时,嵌入物理结构可以约束解空间并提高泛化能力,但许多物理信息模型假设可以访问完整的系统状态。这限制了它们在部分观测场景中的使用,其中某些状态变量完全未被观测到,且必须在没有直接监督的情况下推断。在这里,我们提出了神经哈密顿常微分方程(NHODE),这是一个结合哈密顿神经网络(HNN)和神经常微分方程(neural ODE)的框架,用于从数据中学习部分观测的动力系统。哈密顿结构通过构造保证能量守恒,而神经常微分方程框架则提供了灵活的训练过程,使得损失可以仅定义在观测变量上。我们还通过对称性感知的坐标变换和可分离的能量公式,融入了额外的物理约束。该框架在复杂度递增的系统上进行了评估,从线性和非线性质量-弹簧系统到混沌三体问题。在所有示例中,嵌入的物理结构越多,预测的准确性和长期稳定性就越好。即使在最具挑战性的情况下,NHODE框架也能捕捉到观测和潜在动力学,而纯数据驱动的基线则变得不稳定。

英文摘要

When learning dynamical systems from data, embedding physical structure can constrain the solution space and improve generalization, but many physics-informed models assume access to the full system state. This limits their use in partially observed settings, where some state variables are completely unobserved and must be inferred without direct supervision. Here, we present neural Hamiltonian ordinary differential equations (NHODE), a framework that combines Hamiltonian neural networks (HNNs) with neural ordinary differential equations (neural ODEs) to learn partially observed dynamical systems from data. The Hamiltonian structure enforces energy conservation by construction, while the neural ODE framework enables a flexible training procedure that allows the loss to be defined only on observed variables. We also incorporate additional physical constraints through symmetry-aware coordinate transformations and separable energy formulations. The framework is evaluated on systems of increasing complexity, from linear and nonlinear mass-spring systems to the chaotic three-body problem. Across all examples, increasing the amount of embedded physical structure improves the accuracy and long-horizon stability of the predictions. Even in the most challenging regimes, the NHODE framework captures both observed and latent dynamics, whereas purely data-driven baselines become unstable.

2605.23504 2026-05-25 cs.LG cs.AI 版本更新

VACE: Learning Geometrically Structured Representations for Time Series Anomaly Detection

VACE:学习几何结构化表示用于时间序列异常检测

Alberto D. Cencillo, Leonardo Concepción, Isaac Triguero, Julián Luengo

发表机构 * Andalusian Research Institute in Data Science and Computational Intelligence (DaSCI)(安达卢西亚数据科学与计算智能研究 institute) Department of Computer Science and Artificial Intelligence (DECSAI), University of Granada(格拉纳达大学计算机科学与人工智能系)

AI总结 该论文提出了一种名为VACE的自监督异常检测方法,用于多变量时间序列中的异常检测。VACE通过速度对齐的通道嵌入方式,学习具有紧凑且方向一致结构的正常表示,从而更准确地识别异常。该方法无需负样本和合成异常,通过速度一致性目标训练编码器,使正常轨迹在嵌入空间中保持局部平滑和对齐。实验表明,VACE在多个基准数据集上取得了优于复杂方法的优异性能。

Comments 16 pages, 5 figures

详情
AI中文摘要

多变量时间序列中的异常检测是广泛实际应用中的关键任务,其中异常行为罕见、标签不可用且漏检成本高昂。核心挑战在于学习足够精确的正常性表征以标记偏差。表示自监督学习(通常通过对比方法)通过将时间补丁嵌入到潜在空间来解决这一问题,其中正常性占据一个定义明确的区域,异常通过几何偏差检测。然而,对比方法通过配对采样启发式间接塑造该空间,无法对基于距离评分所需的几何结构进行显式控制。这意味着正常表示的紧凑程度以及距离是否具有方向意义。我们提出VACE(速度对齐通道嵌入),一种自监督异常检测方法,将正常性表示为嵌入空间中紧凑且方向一致的区域。为此,VACE通过速度一致性目标训练通道感知编码器,无需负样本和合成异常,使得正常轨迹局部平滑且对齐。在测试时,马氏距离位置得分和速度库方向得分相乘,标记同时偏离分布和动态异常的点。尽管方法简单,VACE在严格评估下于TSB-AD-M上实现了最先进性能,显著优于使用更大预算训练的复杂方法。

英文摘要

Anomaly detection in multivariate time series is a critical task across a wide range of real-world applications, where abnormal behaviour is rare, labels are unavailable, and the cost of a miss is high. The central challenge is learning a characterisation of normality precise enough to flag deviations. Representation self-supervised learning, typically through contrastive approaches, addresses this by embedding temporal patches into a latent space where normality occupies a well-defined region, with anomalies detected by geometric deviation. However, contrastive approaches shape this space indirectly through pair-sampling heuristics, providing no explicit control over the geometric structure that distance-based scoring requires. This means how tightly normal representations are grouped, and whether distances are directionally meaningful. We present VACE (Velocity-Aligned Channel Embeddings), a self-supervised anomaly detection method that represents normality as a compact, directionally coherent region in the embedding space. To this end, VACE trains a channel-aware encoder through a velocity-consistency objective, with no negatives and no synthetic anomalies, so that normal trajectories are locally smooth and aligned. At test time, a Mahalanobis positional score and a velocity-bank directional score are combined multiplicatively, flagging points that are simultaneously off-distribution and dynamically atypical. Despite its simplicity, VACE achieves state-of-the-art performance on TSB-AD-M under rigorous evaluation, significantly outperforming more complex methods trained on substantially larger budgets.

2605.23476 2026-05-25 cs.LG cond-mat.dis-nn cond-mat.mtrl-sci math.OC 版本更新

Non-normal spectral signatures of instability in neural network training dynamics

神经网络训练动态中不稳定性的非正态谱特征

Souvik Ghosh

发表机构 * Department of Physics, National Sun Yat-sen University, Kaohsiung 80424, Taiwan(物理系,国立中山大学,高雄 80424,台湾)

AI总结 本文研究了深度网络训练过程中常见的不稳定性问题,如损失尖峰、振荡收敛和梯度异常,并通过非正规算子理论提供了理论解释。研究发现,常用优化器的线性化更新算子普遍是非正规的,其非正规性由Hessian矩阵与自适应预条件器或动量结构之间的相互作用引起。通过非正规稳定性理论,作者提出了一个基于伪谱的保守前兆界,并证明了条件数κ(V)可以作为训练过程中瞬时放大现象的早期预警指标,为理解自适应优化算法的稳定性提供了新的诊断工具和理论框架。

Comments 9 pages, 3 figurea

详情
AI中文摘要

深度网络中的训练不稳定性——损失尖峰、振荡收敛和梯度病态——在经验上普遍存在,但缺乏严格的算子理论解释。我们证明,实际使用的优化器的线性化更新算子通常是非正态的:对于Adam,非正态性由Hessian与对角自适应预条件子之间的换位子[H, M]控制;而对于带动量的SGD,它源于更新映射的增广状态空间结构。将非正态稳定性理论应用于这些算子,我们推导出一个保守的伪谱前兆界,其中κ(V)作为瞬态放大的早期预警指标,即使谱半径仍小于1;并且我们建立了更新算子的异常点作为该框架中κ(V) → ∞的极限情况。在两层网络上的数值实验证实,谱半径ρ(J)无法区分稳定和不稳定的训练阶段,而κ(V)能将它们分开约一个数量级,用非正态放大的连续严重性度量补充了经典的锐度准则。这些结果确立了非厄米算子理论作为神经网络优化稳定性中一个有用且未被充分探索的框架,为理解自适应优化稳定性提供了诊断语言和概念验证基准。

英文摘要

Training instabilities in deep networks - loss spikes, oscillatory convergence, and gradient pathologies - are empirically prevalent but lack a rigorous operator-theoretic explanation. We show that the linearized update operators for practically used optimizers are generically non-normal: for Adam, non-normality is controlled by the commutator [H, M] between the Hessian and the diagonal adaptive preconditioner, while for SGD with momentum it arises from the augmented state-space structure of the update map. Applying non-normal stability theory to these operators, we derive a conservative pseudospectral precursor bound in which κ(V) serves as an early-warning indicator of transient amplification even when the spectral radius remains below one, and we establish that exceptional points of the update operator appear as the κ(V) -> \infty limiting case of this framework. Numerical experiments on two-layer networks confirm that the spectral radius ρ(J) provides no separation between stable and unstable training phases while κ(V) separates them by approximately one order of magnitude, complementing the classical sharpness criterion with a continuous severity measure of non-normal amplification. These results establish non-Hermitian operator theory as a useful and underexplored framework for neural network optimization stability, offering a diagnostic language and proof-of-concept benchmark for understanding adaptive optimization stability.

2605.23471 2026-05-25 cs.LG cs.AI 版本更新

CBANet: A Compact Attention-Based CNN-BiLSTM Network for Aggressive Driving Event Detection

CBANet:一种用于激进驾驶事件检测的紧凑型注意力CNN-BiLSTM网络

Hanadi Alhamdan, Ghadah Alosaimi, Amir Atapour-Abarghouei, Farshad Arvin

发表机构 * Department of Computer Science, Princess Nourah bint Abdulrahman University(普里西拉计算机科学系,普里西拉努拉·本·阿卜杜勒拉赫曼大学) Department of Computer Science, Durham University(计算机科学系,杜ham大学) Department of Computer Science, Imam Mohammad Ibn Saud Islamic University(计算机科学系,伊玛姆穆罕默德·本·萨德伊斯兰大学)

AI总结 本文提出了一种名为CBANet的紧凑型注意力机制结合CNN-BiLSTM的深度学习框架,用于检测激进驾驶事件。该方法通过构建工程化的动态特征来捕捉转向、加速和制动行为,并采用基于SMOTE的过采样与类别加权损失相结合的稳定训练策略,以应对自然驾驶数据中激进事件极度稀有的问题。实验表明,该方法在少数类召回率和安全关键F分数等指标上显著优于传统深度学习方法,同时保持了较高的计算效率。

Comments 8 pages, 4 figures, 4 tables. Submitted to IJCNN/WCCI 2026. CBANet: A compact attention-based CNN-BiLSTM framework for aggressive driving event detection using multivariate vehicle dynamics signals. Code available at https://github.com/halhamdan/CBANet

详情
AI中文摘要

激进驾驶是交通事故的主要原因,对道路安全构成严重威胁。尽管深度学习方法在从车辆传感器数据检测危险驾驶行为方面显示出有希望的结果,但它们在现实条件下的性能通常受到严重数据不平衡、驾驶员间巨大差异以及缺乏物理可解释的车辆动力学表示的限制。在本文中,我们提出了一种增强的深度学习框架,用于使用多变量车辆动力学信号进行激进驾驶检测。该方法不仅依赖原始测量,还构建了捕捉转向、加速和制动行为的工程动力学特征。为了解决自然驾驶数据中激进事件的极端稀少性,我们引入了一种稳定的训练策略,结合了基于SMOTE的受控过采样和类别加权损失公式,并评估了用于不平衡处理的焦点损失变体。此外,采用基于类别特定阈值校准的安全导向决策策略,以更好地反映现实应用中漏检和误报的不对称风险。该框架在新收集的自然驾驶数据集上进行了评估。大量实验表明,所提出的方法在保持实际计算效率的同时,在少数类召回率和安全关键F-score指标上始终优于标准深度学习基线。代码:\url{https://github.com/halhamdan/CBANet}

英文摘要

Aggressive driving is a major cause of traffic accidents and poses a serious threat to road safety. Although deep learning methods have shown promising results in detecting risky driving behaviours from vehicle sensor data, their performance in real-world conditions is often limited by severe data imbalance, large variability between drivers, and the lack of physically interpretable vehicle dynamics representations. In this paper, we propose an enhanced deep learning framework for aggressive driving detection using multivariate vehicle dynamics signals. Instead of relying solely on raw measurements, the proposed approach constructs engineered dynamic features that capture steering, acceleration, and braking behaviour. To address the extreme rarity of aggressive events in naturalistic driving data, we introduce a stable training strategy that combines controlled SMOTE-based oversampling with a class-weighted loss formulation, and evaluates focal loss variants for imbalance handling. Furthermore, a safety-oriented decision strategy based on class-specific threshold calibration is adopted to better reflect the asymmetric risks of missed detections and false alarms in real-world applications. The proposed framework is evaluated on a newly collected naturalistic driving dataset. Extensive experiments show that the proposed method consistently outperforms standard deep learning baselines with significant improvements in minority-class recall and safety-critical F-score metrics while maintaining practical computational efficiency. Code: \url {https://github.com/halhamdan/CBANet}

2605.23470 2026-05-25 cs.LG cs.AI cs.CE 版本更新

Learning Individual Dynamics from Sparse Cross-Sectional Snapshots

从稀疏横截面快照中学习个体动力学

Christian Lagemann, Kai Lagemann, Steven L. Brunton, Sach Mukherjee

发表机构 * Statistics and Machine Learning, German Center for Neurodegenerative Diseases (DZNE)(统计与机器学习,德国神经退行性疾病中心(DZNE)) MediaTek Research(联发科技研究) Department of Mechanical Engineering & AI Institute in Dynamic Systems, University of Washington, Seattle(机械工程与人工智能动态系统研究所,华盛顿大学,西雅图) DZNE & University of Bonn, Bonn, Germany and University of Cambridge, Cambridge, United Kingdom(DZNE与波恩大学,波恩,德国和剑桥大学,剑桥,英国)

AI总结 该研究旨在从稀疏的横截面快照中学习个体的动态演化过程,传统方法在数据稀疏或完全横截面的情况下难以准确推断个体的连续时间轨迹。本文提出了一种名为CADENCE的概率框架,通过将潜在动态与静态个体上下文关联,实现了从孤立快照中恢复个体轨迹。该方法结合了基于分数的空域编码器和软专家混合路由机制,提供了单时间点轨迹推断的可识别性保证,并在多个基准测试中表现出优于现有序列模型的性能。

详情
AI中文摘要

预测一个动力学单元如何随时间演化——例如个体如何衰老、流行病如何传播、物理系统如何退化——通常需要密集的纵向追踪。当只有极其稀疏或完全横截面的数据可用时,推断个体化的连续时间轨迹本质上是病态的。现有方法迫使严格妥协:序列模型(如潜在ODE)需要密集的纵向数据,而横截面方法(如最优传输、基于流匹配的)映射聚合群体,丢失了个体动力学。在本文中,我们证明这种二分法可以被打破。我们介绍CADENCE,一个原则性的概率框架,通过将潜在动力学锚定到静态的个体级上下文,从孤立快照中恢复连续的个体轨迹。我们为单时间点轨迹推断提供了新颖的可识别性保证。通过结合基于分数的空间编码器(双射概率流ODE)以消除微分同胚歧义,以及软混合专家(SMoE)路由器,我们证明个体动力学参数和路由函数是联合可识别的。在一系列涵盖物理系统到真实世界生物数据的基准测试中,CADENCE严格在具有上下文结构的极端稀疏快照上训练,其性能匹配或超过了在密集全轨迹数据上训练的最先进序列模型。

英文摘要

Predicting how a dynamical unit evolves over time - how an individual ages, an epidemic spreads, or a physical system degrades - typically requires dense longitudinal tracking. When only extremely sparse or entirely cross-sectional data is available, inferring individualized, continuous-time trajectories is fundamentally ill-posed. Existing methods force a strict compromise: sequence models (e.g. latent ODEs) require dense longitudinal data, while cross-sectional methods (e.g. optimal transport, flow matching-based) map aggregate populations, losing individual dynamics. In this paper, we demonstrate that this dichotomy can be broken. We introduce CADENCE, a principled probabilistic framework that recovers continuous individual trajectories from isolated snapshots by anchoring latent dynamics to static, individual-level contexts. We provide novel identifiability guarantees for single-timepoint trajectory inference. By combining a score-based spatial encoder (bijective Probability Flow ODE) to eliminate diffeomorphic ambiguities with a Soft Mixture-of-Experts (SMoE) router, we show that individual dynamical parameters and routing function are jointly identifiable. Across a suite of benchmarks spanning physical systems to real-world biological data, CADENCE, trained strictly on extremely sparse snapshots with context structure, matches or exceeds the performance of state-of-the-art sequential models trained on dense, full-trajectory data.

2605.23467 2026-05-25 cs.LG 版本更新

S$^3$GNN: Efficient Global Mixing and Local Message Passing for Long-Range Graph Learning

S$^3$GNN:用于长程图学习的高效全局混合与局部消息传递

Dai Shi, Luke Thompson, Linhan Luo, Lequan Lin, Andi Han, Junbin Gao, José Miguel Hernández Lobato

发表机构 * Department of Engineering, University of Cambridge, Cambridge, UK.(剑桥大学工程系,英国剑桥) University of Sydney, Australia.(悉尼大学,澳大利亚)

AI总结 本文针对图神经网络在捕捉长距离依赖时面临的信息瓶颈问题,提出了一种名为S$^3$GNN的新方法。该方法通过引入轻量级的全局信息混合机制,在不依赖严格理论假设的前提下有效缓解了过度压缩现象。实验表明,S$^3$GNN在多个领域任务中实现了显著的性能提升,并大幅减少了参数数量。

详情
AI中文摘要

消息传递神经网络(MPNN)在捕获长程依赖时常常遭受信息瓶颈,导致过挤压(OSQ)现象。除了空间连通性增强(例如,重连)外,最近的研究表明,谱滤波可以产生强大的长程学习结果,因为谱算子能够实现全局信息混合,从而缓解OSQ。这些方法通过稳定深层传播中的雅可比能量或在强理论假设下保证OSQ缓解来实现这一点。我们重新审视这些结论,并表明相关的雅可比敏感性下界在实践中通常难以实现。然后,我们提出S$^3$GNN,它通过以显著较低的计算复杂度轻量级地重新引入被忽略的组件来缓解OSQ,而不需要这些限制性假设,同时特征变换的标准稳定性约束在我们的新动态下仍然有效。跨不同领域(例如,长程基准、KGQA和基于网格的流体动力学)的大量实验表明,S$^3$GNN在参数减少多达50%的情况下实现了高达一个数量级的误差降低。我们的代码可在https://github.com/EEthanShi/S3-GNN.git找到。

英文摘要

Message-passing neural networks (MPNNs) often suffer from an information bottleneck when capturing long-range dependencies, leading to the oversquashing (OSQ) phenomenon. Alongside spatial connectivity enrichment (e.g., rewiring), recent studies have shown that spectral filtering can yield strong long-range learning outcomes, as spectral operators enable global information mixing that alleviates OSQ. These approaches achieve this either by stabilizing the Jacobian energies in deep propagation or by guaranteeing OSQ mitigation under strong theoretical assumptions. We revisit these conclusions and show that the associated Jacobian sensitivity lower bound is generally difficult to achieve in practice. We then propose S$^3$GNN, which mitigates OSQ without such restrictive assumptions by lightweightly reintroducing omitted components with substantially lower computational complexity, while standard stability constraints on feature transformations remain effective under our new dynamics. Extensive experiments across diverse domains (e.g., long-range benchmarks, KGQA, and mesh-based fluid dynamics) demonstrate that S$^3$GNN achieves up to an order-of-magnitude error reduction with up to 50\% fewer parameters. Our code can be found in https://github.com/EEthanShi/S3-GNN.git.

2605.23464 2026-05-25 cs.LG 版本更新

Unextractable Protocol Models: Collaborative Training and Inference without Weight Materialization

不可提取协议模型:无需权重物化的协作训练与推理

Alexander Long, Chamin Hewa Koneputugodage, Thalaiyasingam Ajanthan, Yan Zuo, Gil Avraham, Violetta Shevchenko, Hadi Mohaghegh Dolatabadi, Sameera Ramasinghe

发表机构 * Pluralis Research(Pluralis研究)

AI总结 本文研究了在去中心化环境中协作训练和推理大规模神经网络的问题,提出了一种名为“不可提取协议模型(UPMs)”的新框架。该方法通过在参与者之间定期注入时间变化的可逆变换,使得模型各部分在不同时间步上不兼容,从而防止权重被完整提取。实验表明,UPMs在保持模型性能的同时有效提升了安全性,并分析了其在训练和推理中的开销及对各类攻击的防御能力。

Comments Accepted at NeurIPS 2025. 34 pages, 6 figures (5 in main body, 1 in appendix). Alexander Long and Chamin Hewa Koneputugodage contributed equally

详情
Journal ref
Advances in Neural Information Processing Systems 38, pp. 18677-18713 (NeurIPS 2025)
AI中文摘要

我们考虑一个去中心化设置,其中参与者协作训练和提供大型神经网络服务,且每个参与者只处理模型的一个子集。在此设置中,我们探索了不可物化权重的可能性,即完整权重集永远不会对任何参与者可用。我们引入了不可提取协议模型(UPMs):一种利用分片模型设置来确保参与者持有的模型分片(即子集)在不同时间步不兼容的训练和推理框架。UPMs 在参与者边界定期注入时变、随机、可逆的变换;保持整体网络功能,但使跨时间组装变得不连贯。在 Qwen-2.5-0.5B 和 Llama-3.2-1B 上,10,000 次变换使 FP32 困惑度保持不变(ΔPPL < 0.01;Jensen-Shannon 漂移 < 4×10^{-5}),并且我们展示了如何控制低精度数据类型的增长。每 30 秒应用一次变换在推理时增加 3% 的延迟、0.1% 的带宽和 10% 的 GPU 内存开销,而训练开销降至 1.6% 的时间和 < 1% 的内存。我们考虑了多种攻击,表明直接攻击的要求不切实际且易于防御,并且基于梯度的拼接分区微调消耗了从头训练所需 token 的 ≥ 60%。通过使模型能够协作训练但不可提取,UPMs 使得在社区驱动的去中心化训练中嵌入程序化激励机制变得可行。

英文摘要

We consider a decentralized setup in which the participants collaboratively train and serve a large neural network, and where each participant only processes a subset of the model. In this setup, we explore the possibility of unmaterializable weights, where a full weight set is never available to any one participant. We introduce Unextractable Protocol Models (UPMs): a training and inference framework that leverages the sharded model setup to ensure model shards (i.e., subsets) held by participants are incompatible at different time steps. UPMs periodically inject time-varying, random, invertible transforms at participant boundaries; preserving the overall network function yet rendering cross-time assemblies incoherent. On Qwen-2.5-0.5B and Llama-3.2-1B, 10,000 transforms leave FP32 perplexity unchanged ($Δ$PPL $< 0.01$; Jensen-Shannon drift $< 4 \times 10^{-5}$), and we show how to control growth for lower precision datatypes. Applying a transform every 30s adds 3% latency, 0.1% bandwidth, and 10% GPU-memory overhead at inference, while training overhead falls to 1.6% time and $< 1$% memory. We consider several attacks, showing that the requirements of direct attacks are impractical and easy to defend against, and that gradient-based fine-tuning of stitched partitions consumes $\geq 60$% of the tokens required to train from scratch. By enabling models to be collaboratively trained yet not extracted, UPMs make it practical to embed programmatic incentive mechanisms in community-driven decentralized training.

2605.23449 2026-05-25 cs.LG cs.CV math.AG 版本更新

Commutator-Induced Uncertainty in VAEs

VAE中的换位子引发的不确定性

Tahereh Dehdarirad, Michael Felsberg, Gabriel Eilertsen, Ziliang Xiong

发表机构 * Computer Vision and Learning Systems (CVL), Linköping University, Sweden(计算机视觉与学习系统(CVL),林雪平大学,瑞典) Department of Science and Technology, Linköping University, Sweden(科学与技术系,林雪平大学,瑞典)

AI总结 变分自编码器(VAEs)在学习非交换结构时常常面临不确定性问题。本文提出了一种基于李群的VAE框架,通过结合几何与代数视角分析不确定性,将离散生成因素与连续几何变换分离。该方法通过诊断代数非交换性并调整解码器对非交换结构的敏感度,提升了重构质量与潜在空间结构的一致性,在多个基准数据集上表现出优越的重构与潜在空间遍历性能。

详情
AI中文摘要

变分自编码器(VAE)通常难以表示学习到的潜在空间中的非交换结构。对称感知的VAE通常通过代数正则化强制交换性来解决这个问题,这适用于交换变换群,但当非交换性是数据内在特性时会抑制有意义的非交换结构。我们认为,非交换性应被明确诊断并反映在重建行为中。我们引入了一个李群VAE框架,该框架结合了几何和代数视角下的不确定性,同时将离散生成因子与连续几何变换分开。在第一阶段,模型在没有结构约束的情况下进行训练,同时通过有限Baker-Campbell-Hausdorff偏差测量代数非交换性,并通过重建顺序交换测试测量解码器顺序敏感性。这些诊断揭示了在无约束训练下潜在非交换性与重建行为之间的尺度不匹配。在第二阶段,我们引入了一个具有数据驱动校准常数的变形稳定性约束,使解码器敏感性与代数非交换性对齐。我们在dSprites、3DShapes、3DCars和CelebA上评估了该框架,并与通用和对称感知基线(包括beta-VAE、CLG-VAE和CFASL)进行了比较。在合成基准上,该方法提高了重建质量,并产生了与潜在非交换结构更一致的解码器行为。定性分析显示了更清晰的顺序依赖潜在组合和更稳定的重建。在CelebA上,该模型比CFASL产生了更忠实的重建和因子特定的潜在遍历,同时在学习的潜在方向之间也表现出有意义的顺序依赖交互。

英文摘要

Variational autoencoders (VAEs) often struggle to represent non-commutative structure in learned latent spaces. Symmetry-aware VAEs commonly address this issue by enforcing commutativity through algebraic regularization, which is appropriate for commutative transformation groups but can suppress meaningful non-commutative structure when it is intrinsic to the data. We argue that non-commutativity should instead be explicitly diagnosed and reflected in reconstruction behavior. We introduce a Lie Group VAE framework that combines geometric and algebraic perspectives on uncertainty while separating discrete generative factors from continuous geometric transformations. In a first phase, the model is trained without structural constraints while algebraic non-commutativity is measured through finite Baker-Campbell-Hausdorff deviations and decoder order sensitivity is measured through reconstruction order-swap tests. These diagnostics reveal a scale mismatch between latent non-commutativity and reconstruction behavior under unconstrained training. In a second phase, we introduce a deformation-stability constraint with a data-driven calibration constant that aligns decoder sensitivity with algebraic non-commutativity. We evaluate the framework on dSprites, 3DShapes, 3DCars, and CelebA against generic and symmetry-aware baselines, including beta-VAE, CLG-VAE, and CFASL. Across synthetic benchmarks, the method improves reconstruction quality and yields decoder-level behavior more consistent with latent non-commutative structure. Qualitative analyses show clearer order-dependent latent compositions and more stable reconstructions. On CelebA, the model yields more faithful reconstructions and factor-specific latent traversals than CFASL, while also exhibiting meaningful order-dependent interactions between learned latent directions.

2605.23446 2026-05-25 cs.LG math.CO 版本更新

Weisfeiler-Leman Is Incomplete on Simple Spectrum Graphs, so Canonicalize Them

Weisfeiler-Leman 在简单谱图上是不完备的,因此对它们进行规范化

Snir Hordan, Nadav Dym, Tim Seppelt

发表机构 * IT University of Copenhagen(哥本哈根IT大学)

AI总结 该研究探讨了具有简单谱图的图同构问题,指出对于任意自然数 $k$,$k$-Weisfeiler-Leman 测试无法区分所有非同构的简单谱图,从而揭示了现有图神经网络在该类图上的局限性。为解决这一问题,研究提出了 PRiSM 方法,这是首个能够完全对简单谱图进行正则化分解的算法,填补了该领域的空白。PRiSM 不仅保证了表达能力的完备性,还与深度集合或 Transformer 结合后实现了对简单谱图的通用逼近能力,为图的表示学习提供了新的理论支持和实用方法。

详情
AI中文摘要

具有简单谱的图允许三次时间同构测试,然而我们证明对于每个自然数 $k$,$k$-Weisfeiler-Leman ($k$-WL) 测试无法区分所有非同构的简单谱图。由于 WL 层次结构限制了广泛使用的图神经网络 (GNN) 的区分能力,这种不完备性适用于所有此类 GNN,从而排除了每个 $k$-WL 对齐的 GNN 家族的完备性。为了弥补这一差距,我们引入了 PRiSM (分区、细化、求解、匹配),这是第一个可证明完备的简单谱特征分解规范化方法。PRiSM 获得了先前规范化方法显然缺乏的完备性保证,并解决了在简单谱图上实现完全表达性的开放问题。当与 DeepSets 或 Transformer 组合时,PRiSM 在简单谱图上实现了通用逼近,证明了使用规范化拉普拉斯位置编码的合理性。实验上,PRiSM 在图回归、分类和表达性方面与现有谱规范化方法性能相当或更优。

英文摘要

Graphs with a simple spectrum admit cubic-time isomorphism testing, yet we prove that for every natural number $k$, the $k$-Weisfeiler-Leman ($k$-WL) test cannot distinguish all non-isomorphic graphs with a simple spectrum. As the WL hierarchy upper-bounds the distinguishing power of widely-used Graph Neural Networks (GNNs), this incompleteness applies to all such GNNs, ruling out completeness for every $k$-WL-aligned GNN family. To close this gap, we introduce PRiSM (Partition, Refine, Solve, Match), the first provably complete canonicalization of simple-spectrum eigendecompositions. PRiSM obtains the completeness guarantee that prior canonicalizations provably lack, and resolves the open problem of achieving complete expressivity on simple-spectrum graphs. When composed with DeepSets or a Transformer, PRiSM achieves universal approximation on simple-spectrum graphs, justifying the use of canonicalized Laplacian positional encodings. Empirically, PRiSM performs comparably to or outperforms existing spectral canonicalizations on graph regression, classification, and expressivity

2605.23434 2026-05-25 cs.LG 版本更新

Onsager-Machlup Posterior Transport for Deep Gaussian Processes

深度高斯过程的Onsager-Machlup后验传输

Jian Xu, Delu Zeng, John Paisley, Qibin Zhao

发表机构 * RIKEN iTHEMS(日本理化学研究院iTHEMS研究中心) RIKEN AIP(日本理化学研究院AIP研究中心) South China University of Technology(华南理工大学) Columbia University(哥伦比亚大学)

AI总结 深度高斯过程(DGPs)中的近似推断在诱导变量上面临计算瓶颈。本文提出一种新的后验传输方法,通过确定性采样器将可计算的参考测度映射到与后验相关的诱导变量,并利用由Doob桥扩散过程导出的路径先验进行正则化。核心方法基于Song的概率流ODE和Onsager-Machlup作用量,实验证明该方法在多个UCI回归数据集上优于现有方法,尤其在大规模数据集上表现更优。

详情
AI中文摘要

对诱导变量的近似推断是深度高斯过程(DGP)的计算瓶颈。现有方法要么通过ELBO拟合显式密度$q_\phi(\bU)$(DSVI, IPVI, DDVI, DBVI),要么通过MCMC采样(SGHMC)。我们则将DGP推断框架化为\emph{后验传输}:学习一个确定性采样器,将易处理的参考测度映射到后验相关的诱导变量,并通过从Doob桥接参考扩散导出的路径先验进行正则化。我们的实现\textbf{OM-Path}(正式名称为FBVI-bridge-Path)使用Song的概率流ODE应用于DBVI的Doob桥接前向SDE;参考漂移由桥边际系数闭式给出(无需分数匹配),路径正则化器为\textbf{Onsager--Machlup作用量}。在训练时使用的有限$\epsilon$值下,目标函数是温度Doob桥路径后验的负对数未归一化密度,定理1通过Freidlin--Wentzell LDP将其识别为同一后验的小噪声MAP路径。在同一桥骨干上推导了两种严格的路径空间ELBO变体(FFJORD对数行列式;OM正则化CNF)作为消融实验。在七个UCI回归基准上与DBVI进行匹配种子的配对Wilcoxon检验,OM-Path在两个最大数据集上取得了统计显著的胜利(\textit{power}: $p=0.014$,NLL $\mathbf{0.012}$匹配DSVI基线$0.017$;\textit{protein}: $p=0.002$,RMSE $\mathbf{0.716}$对比$0.764$,NLL $\mathbf{1.086}$对比$1.149$),在\textit{yacht}/\textit{qsar}上统计持平,在\textit{boston}/\textit{energy}/\textit{concrete}上因小噪声数据而输给DBVI。严格的ELBO变体在任何UCI指标上均未超过DBVI:在该机制下,降低路径目标方差比精确密度跟踪更重要。

英文摘要

Approximate inference over inducing variables is the central computational bottleneck of Deep Gaussian Processes (DGPs). Existing methods either fit an explicit density $q_ϕ(\bU)$ by an ELBO (DSVI, IPVI, DDVI, DBVI) or sample by MCMC (SGHMC). We instead frame DGP inference as \emph{posterior transport}: learn a deterministic sampler that maps a tractable reference measure to posterior-relevant inducing variables, regularised by a path prior derived from the Doob-bridged reference diffusion. Our realisation, \textbf{OM-Path} (formally FBVI-bridge-Path), uses Song's probability-flow ODE applied to DBVI's Doob-bridged forward SDE; the reference drift is closed-form from the bridge marginal coefficients (no score matching) and the path regulariser is the \textbf{Onsager--Machlup action}. At the finite-$ε$ value used at training, the objective is the negative log unnormalised density of a tempered Doob-bridge path posterior, and Theorem 1 identifies it with the same posterior's small-noise MAP path via the Freidlin--Wentzell LDP. Two strict path-space ELBO variants on the same bridge backbone (FFJORD log-det; OM-regularised CNF) are derived as ablations. Under a matched-seed paired Wilcoxon test against DBVI on seven UCI regression benchmarks, OM-Path delivers statistically significant wins on the two largest datasets (\textit{power}: $p\!=\!0.014$, NLL $\mathbf{0.012}$ matching the DSVI baseline of $0.017$; \textit{protein}: $p\!=\!0.002$, RMSE $\mathbf{0.716}$ vs.\ $0.764$, NLL $\mathbf{1.086}$ vs.\ $1.149$), statistical ties on \textit{yacht} / \textit{qsar}, and concedes \textit{boston} / \textit{energy} / \textit{concrete} to DBVI on small-$N$ noisy data. The strict-ELBO variants do not clear DBVI on any UCI metric: in this regime, reducing the variance of the path objective dominates exact-density tracking.

2605.23424 2026-05-25 cs.IT cs.LG math.IT 版本更新

Sparse In-Network Learning via Shortest-Path Backpropagation and Finite-Rate Gating

通过最短路径反向传播和有限速率门控的稀疏网内学习

Mohammad Reza Deylam Salehi

发表机构 * Nice, France(法国尼斯)

AI总结 本文研究了网络内学习(INL)中的稀疏通信问题,提出了一种基于最短路径树和有限速率门控机制的稀疏网络内学习方法D-INL。该方法通过保留以融合节点为根的容量感知最短路径树,去除非树链接,同时将局部路由建模为有限速率的随机门控,以在稀疏性和预测信息之间取得平衡。实验表明,D-INL在保持分类精度的同时,将训练过程中的通信量减少了70.4%,并进一步通过有限速率正则化将潜在信息率降低了45.7%。

详情
AI中文摘要

网内学习(INL)通过通信图交换潜在激活和反向传播误差来训练分布式神经模块。本文提出Dijkstra剪枝INL(D-INL),通过保留融合节点处的容量感知最短路径树来移除非树链接。为了平衡稀疏性和预测信息,局部路由(或聚合)被建模为有限速率随机门控,其速率为$R_g=I(Z; T)$。我们推导了一个率-失真-泛化界,并在可复现的分布式分类实验上验证了该方法,其中D-INL将训练交换量减少了70.4%,同时将精度保持在密集INL的标准差范围内。与未正则化的Dijkstra INL相比,添加有限速率正则化进一步将估计的潜在速率降低了45.7%。

英文摘要

In-network learning (INL) trains distributed neural modules by exchanging latent activations and backpropagated errors over a communication graph. This letter proposes Dijkstra-pruned INL (D-INL), which removes non-tree links by retaining a capacity-aware shortest-path tree rooted at the fusion node. To balance sparsity and predictive information, local routing (or aggregation) is modeled as a finite-rate stochastic gate with rate $R_g=I(Z; T)$. We derive a rate-distortion-generalization bound and validate the method on a reproducible distributed-classification experiment, where D-INL reduces training exchange by $70.4\%$ while preserving accuracy within the standard deviation of dense INL. Adding finite-rate regularization further reduces the estimated latent rate by $45.7\%$ relative to unregularized Dijkstra INL.

2605.23422 2026-05-25 cs.LG 版本更新

Hinge Regression Trees and HRT-Boost: Newton-Optimized Oblique Learning for Compact Tabular Models

铰链回归树与HRT-Boost:面向紧凑表格模型的牛顿优化斜学习

Hongyi Li, Jun Xu, Hong Yan

发表机构 * School of Intelligence Science and Engineering, Harbin Institute of Technology, Shenzhen(哈尔滨工业大学深圳校区智能科学与工程学院) Shenzhen Key Lab for Advanced Motion Control and Modern Automation Equipments, Shenzhen(深圳先进运动控制及现代自动化装备重点实验室) Department of Electrical Engineering, City University of Hong Kong, Kowloon(香港城市大学电子工程系)

AI总结 本文提出了一种名为Hinge Regression Tree(HRT)的框架,通过将每个斜向分割转化为两个线性预测器的非线性最小二乘问题,从而提升斜向决策树的学习质量。HRT利用节点级别的优化过程,结合阻尼牛顿法进行求解,并在理论上证明其具有明确的逼近能力。基于HRT,作者进一步提出了HRT-Boost集成方法,将节点级的牛顿更新与逐阶段函数梯度下降相结合,在平方损失下实现了经验风险的逐步减少,实验表明该方法在多个基准数据集上表现优异,且能生成更为紧凑的模型。

Comments arXiv admin note: substantial text overlap with arXiv:2602.05371

详情
AI中文摘要

由于分割优化的离散性和非凸性,学习高质量的斜决策树仍然是一个重大挑战。我们提出了铰链回归树(HRT)框架,该框架将每个斜分割重构为两个线性预测器上的非线性最小二乘问题,其最大/最小包络诱导出类似ReLU的表示能力。我们证明了由此产生的节点级优化可以解释为阻尼牛顿法,并为其回溯线搜索变体建立了节点目标函数的单调递减性质。理论上,我们证明了HRT是一个通用逼近器,具有显式的$O(δ^2)$逼近速率。在此基础学习器之上,我们提出了HRT-Boost,一种数学上协同的集成扩展,将节点级牛顿更新与阶段式函数梯度下降相结合。我们证明了在平方损失下,这种集成构造具有阶段式经验风险降低保证。在合成和真实世界基准上的实证评估表明,HRT与现有的单树基线相比具有很强的竞争力,而HRT-Boost与强集成基线相比表现良好,并且通常产生更紧凑的模型。代码公开于https://github.com/Hongyi-Li-sz/HRT-Boost。

英文摘要

Learning high-quality oblique decision trees remains a significant challenge due to the discrete and non-convex nature of split optimization. We present the Hinge Regression Tree (HRT) framework, which reframes each oblique split as a nonlinear least-squares problem over two linear predictors whose max/min envelope induces ReLU-like representation capacity. We show that the resulting node-level optimization can be interpreted as a damped Newton method, and we establish the monotonic decrease of the node objective for its backtracking line-search variant. We establish, theoretically, that HRT is a universal approximator with an explicit $O(δ^2)$ approximation rate. Building upon this base learner, we propose HRT-Boost, a mathematically synergistic ensemble extension that couples node-level Newton updates with stage-wise functional gradient descent. We show that this ensemble construction admits a stage-wise empirical risk reduction guarantee under the squared loss. Empirical evaluations on synthetic and real-world benchmarks show that HRT is highly competitive with established single-tree baselines, and HRT-Boost compares favorably with strong ensemble baselines and often yields substantially more compact models. The code is publicly available at https://github.com/Hongyi-Li-sz/HRT-Boost.

2605.23417 2026-05-25 cs.LG 版本更新

An Open-Source Training Dataset for Foundation Models for Black-box Optimization

黑箱优化的基础模型的开源训练数据集

Aaron Klein, Herilalaina Rakotoarison, Luca Thale-Bombien, David Salinas

发表机构 * ELLIS Institute Tübingen(图宾根ELLIS研究所) University of Helsinki(赫尔辛基大学) Leipzig University(莱比锡大学) Prior Labs(Prior实验室)

AI总结 本文提出了一种名为BBO-Pile的开源训练数据集,包含超过50万个优化轨迹,覆盖3095个不同黑盒优化问题,是目前规模最大的公开黑盒优化预训练数据集。研究利用该数据集训练了多个不同规模的基础模型,验证了大规模预训练在模仿黑盒优化方法中的有效性,为该领域未来的研究奠定了基础。

详情
AI中文摘要

大多数黑箱优化方法需要大量的超参数调优,这通常限制了它们在不同优化领域的泛化能力。用于黑箱优化的基础模型从大量优化轨迹中学习优化原理,提供了一种有前景的替代方案,有潜力在多样的问题类别中超越手工设计的方法。然而,先前的工作要么依赖非公开数据集,要么依赖纯合成数据,限制了可重复性和对真实世界问题的泛化。因此,该领域的进展一直受到缺乏大规模、真实世界、公开可用的预训练数据的制约。我们引入了BBO-Pile,这是第一个包含超过500K优化轨迹的开源数据集,这些轨迹在3095个不同的黑箱上针对不同的优化器进行了评估,这代表了迄今为止该任务最大的公开数据集。利用该数据集,我们训练了一系列不同规模的基础模型,参数从2M到80M,训练token从200M到2B,并研究了它们相对于计算量的扩展行为。我们的结果表明,大规模预训练是模仿黑箱优化方法的一种可行且有效的方法,为未来的研究铺平了道路。

英文摘要

Most black-box optimization methods require extensive hyperparameter tuning, often limiting their ability to generalize across different optimization domains. Foundation models for black-box optimization that learn optimization principles from a large collection of optimization trajectories offer a promising alternative, with the potential to outperform manually designed methods across diverse problem classes. However, prior work has either relied on non-public datasets or on purely synthetic data, limiting reproducibility and generalization to real-world problems. As a result, progress in this area has been constrained by the lack of large-scale, real-world, publicly available pre-training data. We introduce BBO-Pile, the first open-source dataset comprising over 500K optimization trajectories evaluated across 3095 different black-boxes for different optimizers, which represents by far the largest public dataset for this task. Using this dataset, we train a family of foundation models at multiple scales, ranging from 2M to 80M parameters and from 200M to 2B training tokens, and study their scaling behavior with respect to compute. Our results demonstrate that large-scale pre-training is a viable and effective approach to imitate black-box optimization methods, paving the way for future research in this direction.

2605.23414 2026-05-25 cs.AI cs.LG 版本更新

When Planning Fails Despite Correct Execution: On Epistemic Calibration for LLM-Based Multi-Agent Systems

当计划正确执行却失败时:基于LLM的多智能体系统的认知校准

Zehao Wang, Shilong Jin, Zhao Cao, Lanjun Wang

发表机构 * College of Intelligence and Computing, Tianjin University, Tianjin, China(天津大学智能与计算学院) School of New Media and Communication, Tianjin University, Tianjin, China(天津大学新媒体与传播学院) Gaoling School of Artificial Intelligence, Renmin University of China, Beijing, China(中国人民大学北京校区人工智能学院)

AI总结 本文研究了基于大语言模型的多智能体系统在计划正确执行却仍可能失败的问题,指出这是由于智能体在评估计划可行性时对自身知识的误判,即“认识论校准失误”。为此,作者提出了EPC-AW方法,通过在不同信息条件下评估计划的稳定性,而非直接验证可行性,从而提升系统的整体成功率。实验表明,该方法平均提升了9.75%的系统成功率。

详情
AI中文摘要

基于LLM的多智能体系统即使在计划动作正确执行时也可能失败,因为智能体在评估计划可行性时可能误判自身知识,我们将这种现象称为规划中的认知误校准。与执行错误不同,认知误校准在规划过程中是潜在的,因为生成的计划可以保持自洽且可执行,没有可观察到的错误;同时,认知误校准也是动态的,因为新信息可能改变可行性评估,可能掩盖过去的误校准信号并导致其随时间重复出现。为了解决这个问题,我们提出了认知计划校准代理工作流(EPC-AW),它评估计划在不同信息条件下是否仍得到支持,而不是直接验证可行性。EPC-AW采用基于信息一致性的计划选择,选择评估结果在智能体间稳定的计划,并结合一致性引导的认知状态细化,通过利用过去的差异来指导未来规划,从而随时间适应校准。实验表明,EPC-AW平均将系统级成功率提高了9.75%。

英文摘要

LLM-based multi-agent systems can fail even when planned actions are executed correctly because agents may misjudge their knowledge when evaluating plan feasibility, a phenomenon we term epistemic miscalibration in planning. Unlike execution errors, epistemic miscalibration is latent during planning, as generated plans can remain self-consistent and executable without observable errors; the miscalibration is also dynamic, as new information can alter feasibility assessments, potentially obscuring past miscalibration signals and causing them to recur over time. To address this, we propose the Epistemic Planning Calibration Agentic Workflow (EPC-AW), which assesses whether plans remain supported under varying information conditions rather than directly verifying feasibility. EPC-AW employs Information-consistency-based Plan Selection, selecting plans whose evaluations are stable across agents, together with Consistency-guided Epistemic State Refinement to adapt calibration over time by leveraging past discrepancies to guide future planning. Experiments show that EPC-AW improves system-level success by an average of 9.75%.

2605.23411 2026-05-25 cs.LG cs.CR cs.CV 版本更新

Sample-wise Targeted Adversarial Attacks on Test-time Adaptation

面向测试时自适应的样本级定向对抗攻击

Phuc Duc Nguyen, Quang Duc Nguyen

发表机构 * College of Computing and Data Science(计算与数据科学学院) Nanyang Technological University(南洋理工大学)

AI总结 本文研究了针对测试时适应(TTA)的样本级定向对抗攻击问题,旨在在不引起分布异常的情况下,使特定样本被错误分类。为解决现有方法在批量操作中导致目标标签频率异常的问题,作者提出了一种基于元学习的攻击方法,结合优先级感知的梯度对齐策略,以确保攻击成功率同时保持整体标签分布不变。实验表明,该方法在多个数据集上取得了高成功率,且难以被检测,对现有防御机制也表现出较强的鲁棒性。

Comments 32 pages, 17 figures

详情
AI中文摘要

测试时自适应(TTA)有效应对分布偏移,但通过未标记的测试流使模型暴露于对抗性操纵之下。现有的类别级定向攻击在此场景下难以实现隐蔽利用:由于TTA在批次上操作,强制部分样本朝向目标标签会无意中拉拢相似的良性样本,导致目标标签出现频率异常高,易于检测。为了捕捉更现实的威胁,我们引入了一种样本级定向攻击。与先前方法不同,攻击者旨在仅使携带攻击者选择的触发器的输入被错误分类,同时保持良性查询的全局标签分布以逃避检测。为实现这一目标,我们提出了一种基于元学习的攻击,采用新颖的优先感知梯度对齐策略,明确优先考虑攻击成功率。该策略将梯度更新形式化为椭球信任区域问题,缓解了攻击成功与分布隐蔽性之间的失调,同时为在梯度失调情况下有效优化攻击目标提供了理论保证。在CIFAR-10-C、CIFAR-100-C和ImageNet-C上跨TTA协议的大量实验表明,我们的方法在保持与无攻击基线一致的标签分布的同时,实现了高定向成功率,使其在未标记的TTA部署场景中难以检测。此外,我们证明了我们的攻击对现有防御表现出强鲁棒性。

英文摘要

Test-time adaptation (TTA) effectively counters distribution shifts but exposes models to adversarial manipulation via the unlabeled test stream. Existing class-wise targeted attacks remain impractical for stealthy exploitation in this setting: since TTA operates on batches, forcing a subset of samples toward a target label unintentionally pulls similar benign samples along, resulting in a conspicuously high frequency of the target label that is easy to detect. To capture a more realistic threat, we introduce a sample-wise targeted attack. Unlike prior approaches, the attacker aims to misclassify only inputs carrying an attacker-chosen trigger, while preserving the global label distribution of benign queries to evade detection. To achieve this, we propose a meta-learning-based attack with a novel priority-aware gradient alignment strategy that explicitly prioritizes attack success. The strategy formulates the gradient update as an ellipsoidal trust-region problem, mitigating the misalignment between attack success and distributional stealth, while providing theoretical guarantees for effective optimization of the attack objective in the presence of gradient misalignment. Extensive experiments on CIFAR-10-C, CIFAR-100-C, and ImageNet-C across TTA protocols demonstrate that our method achieves high targeted success rates while maintaining a label distribution that is consistent with the no-attack baseline, making it difficult to detect in unlabeled TTA deployment scenarios. Furthermore, we demonstrate that our attack shows strong robustness against existing defenses.

2605.23410 2026-05-25 cs.LG cs.CV 版本更新

What Linear Probes Miss: Multi-View Probing for Weight-Space Learning

线性探测的盲区:面向权重空间学习的多视角探测

Eunwoo Heo, Kyeongkook Seo, Jaejun Yoo

发表机构 * Graduate School of Artificial Intelligence, Ulsan National Institute of Science(乌山国立科学技术研究生院)

AI总结 随着开源模型库的快速增长,如何高效识别和分析模型参数成为重要问题。现有基于探针的方法虽轻量,但受限于单一视角设计,难以捕捉参数间的高阶交互信息。本文提出多视角探针框架 MVProbe,结合一阶结构与基于格拉姆矩阵的交互感知视角,理论分析表明其能更全面地表征模型参数,实验显示其在多种架构上均优于现有方法。

Comments Accepted at ICML 2026. Code: https://github.com/AI-hew-math/MVProbe ; Project page: https://ai-hew-math.github.io/MVProbe/

详情
AI中文摘要

开源模型库的爆炸式增长催生了“模型丛林”,其中检查点经常在缺乏充分文档或元数据的情况下共享。虽然权重空间学习提供了一种直接从参数识别和分析这些模型的途径,但处理全尺度权重在计算上成本高昂。基于探测的方法作为一种轻量级替代方案出现,通过可学习的探测向量提取置换等变表示。然而,现有探测方法受限于单视角设计:它们捕获一阶结构,但未能编码行-列交互中固有的丰富高阶相关模式。为弥补这一差距,我们引入MVProbe,一个多视角探测框架,它综合了一阶信号与交互感知(基于Gram)的视角。我们的方法有理论依据;我们分析了不同探测阶数的缩放定律,以推导出原则性的标准化和融合策略,确保所有分支的贡献平衡。在Model Jungle基准上,MVProbe在多种架构上持续优于最先进的ProbeX,包括判别式骨干网络(ResNet、SupViT、MAE、DINO)和大规模生成式LoRA适配器(Stable Diffusion LoRA)。

英文摘要

The explosive growth of open-source model repositories has created a Model Jungle, where checkpoints are frequently shared without adequate documentation or metadata. While weight-space learning offers a pathway to identify and analyze these models directly from their parameters, processing full-scale weights is computationally prohibitive. Probing-based methods have emerged as a lightweight alternative, extracting permutation-equivariant representations via learnable probe vectors. However, existing probing methods are limited by a single-view design: they capture first-order structures but fail to encode the rich, higher-order correlation patterns inherent in row-column interactions. To bridge this gap, we introduce MVProbe, a multi-perspective probing framework that synthesizes first-order signals with interaction-aware (Gram-based) views. Our approach is theoretically grounded; we analyze the scaling laws of different probing orders to derive a principled standardization and fusion strategy that ensures balanced contributions from all branches. On the Model Jungle benchmark, MVProbe consistently outperforms the state-of-the-art ProbeX across diverse architectures, including discriminative backbones (ResNet, SupViT, MAE, DINO) and large-scale generative LoRA adapters (Stable Diffusion LoRA).

2605.23403 2026-05-25 cs.LG physics.ao-ph quant-ph 版本更新

Hybrid Quantum-Classical Corrective Diffusion Modeling for Meteorological Downscaling

混合量子-经典校正扩散模型用于气象降尺度

Rui Wang, Edoardo Pasetto, Amer Delilbasic, Morris Riedel, Kristel Michielsen, Gabriele Cavallaro

发表机构 * University of Iceland(爱沙尼亚大学) RWTH Aachen University(亚琛工业大学) University of Cologne(科隆大学)

AI总结 本文提出了一种混合量子-经典修正扩散模型,用于天气场的概率统计降尺度,旨在从低分辨率输入生成高分辨率天气数据。该模型在扩散UNet的最压缩瓶颈处插入变分量子电路层,而回归分支保持完全经典,以测试量子电路是否能作为非线性特征映射提升潜在通道混合效果。实验表明,该混合模型在风场降尺度任务中表现出稳定性,保留了大尺度空间结构,并在多个配置中提升了平均绝对误差和连续排名概率评分,同时展示了对动能谱和风速分布的保持能力,突显了量子混合方法在气象降尺度中的潜力与当前硬件限制。

Comments 11 pages, 9 figures. Submitted to IEEE QCE 2026

详情
AI中文摘要

统计降尺度是天气建模领域的关键组成部分,需要以动力细化的全部成本从粗分辨率输入重建高分辨率输出。在这项工作中,我们研究了一种用于天气场概率统计降尺度的混合量子-经典校正扩散模型。所提出的模型将变分量子电路层插入到扩散UNet的最压缩瓶颈中,同时保留回归分支完全经典。这种放置测试了量子电路是否可以作为潜在通道混合的紧凑非线性特征映射。我们在10米风场分量上评估了通道内和跨通道的ansätze。在2020年验证集上,混合模型保持稳定,保留了生成风场的大尺度空间组织,并在几种配置中相对于经典校正扩散模型改善了MAE和CRPS。结构诊断进一步表明,混合变体保持了与其经典对应物相似的动能谱和风速分布,同时在尾部行为、极端风速定位和联合风场分量结构方面产生受控变化。2020年验证集上的后端研究表明,在测试的电路规模下,模拟设备噪声的影响可以忽略不计,而实际硬件部署仍受限于量子比特可用性和执行保真度。2021年分布外测试表明,这些分布内增益在时间偏移下不能均匀转移,揭示了泛化差距,这促使未来通过稳定化和正则化进行缓解。这些结果表明,瓶颈级别的量子混合可以为天气统计降尺度做出重要贡献,同时也强调了电路规模和硬件部署仍然是关键的限制因素。

英文摘要

Statistical downscaling is a crucial component of the weather modeling field, where high-resolution outputs must be reconstructed from coarse-resolution inputs with the full cost of dynamical refinement. In this work, we investigate a hybrid quantum-classical corrective diffusion model for probabilistic statistical downscaling of weather fields. The proposed model inserts variational quantum circuit layers into the most compressed bottleneck of the diffusion UNet while leaving the regression branch fully classical. This placement tests whether quantum circuits can act as compact nonlinear feature maps for latent-channel mixing. We evaluate intra-channel and cross-channel ansätze on 10m wind components. On the 2020 validation set, the hybrid models remain stable, preserve the large-scale spatial organization of the generated wind fields, and improve both MAE and CRPS relative to a classical corrective diffusion model in several configurations. Structural diagnostics further show that the hybrid variants preserve kinetic-energy spectra and windspeed distributions similar to its classical counterpart while producing controlled changes in tail behavior, extreme-windspeed localization, and joint wind field components structure. Backend studies on the 2020 validation set show negligible impact from simulated device noise at the tested circuit scale, whereas real-hardware deployment remains limited by qubit availability and execution fidelity. The 2021 out-of-distribution test shows that these in-distribution gains do not transfer uniformly under temporal shift, revealing a generalization gap that motivates future mitigation through stabilization and regularization. These results show that bottleneck-level quantum hybridization can make a nontrivial contribution to weather statistical downscaling, while also highlighting that circuit scale and hardware deployment remain key limiting factors.

2605.23402 2026-05-25 cs.LG cs.AI 版本更新

Parametric Prior Mapping Framework for Non-stationary Probabilistic Time Series Forecasting

非平稳概率时间序列预测的参数先验映射框架

Jinglin Li, Jun Tan, QI Fang, Ning Gui

发表机构 * School of Computer Science and Engineering(计算机科学与工程学院)

AI总结 本文提出了一种参数先验映射框架(PPM),用于非平稳概率时间序列预测。该方法通过引入参数化的结构先验,结合生成模型的优势,实现了在保持计算效率的同时捕捉复杂时间依赖关系。实验表明,PPM在非平稳数据预测任务中优于现有方法,在准确性和计算效率之间取得了更好的平衡。

Comments 20 pages, 8 figures, accepted by ICML 2026

详情
AI中文摘要

在概率多变量时间序列(MTS)预测中有效建模非平稳动态需要在表达性和鲁棒性之间取得平衡。现有参数方法受益于强归纳偏置但缺乏灵活性,而深度生成模型在没有大量数据和计算的情况下难以捕捉复杂的时间依赖性。我们引入了参数先验映射(PPM),这是一个将参数化结构先验注入生成建模过程的框架。具体来说,PPM利用参数化估计器推导出一个动态的自适应先验,通过可学习的映射指导复杂预测分布的学习。这种设计使模型能够保留参数方法的效率,同时利用生成模型的表达能力。通过混合目标训练,PPM产生精确的预测,并具有良好校准的不确定性估计。实验结果表明,PPM在处理非平稳数据方面优于现有基线,在精度和计算效率之间提供了更好的权衡。代码可在https://github.com/ljl8336/PPM获取。

英文摘要

Effectively modeling non-stationary dynamics in probabilistic multivariate time series(MTS) forecasting requires balancing expressiveness with robustness. Existing parametric approaches benefit from strong inductive biases but lack flexibility, whereas deep generative models struggle to capture complex temporal dependencies without extensive data and computation. We introduce Parametric Prior Mapping (PPM), a framework that injects parametric structural priors into a generative modeling process. Specifically, PPM utilizes a parametric estimator to derive a dynamic, adaptive prior that guides the learning of a complex predictive distribution via a learnable mapping. This design allows the model to retain the efficiency of parametric methods while exploiting the expressive power of generative models. Trained with a hybrid objective, PPM yields precise forecasts with well-calibrated uncertainty estimates. Empirical results show that PPM outperforms existing baselines in handling non-stationary data, offering a superior trade-off between accuracy and computational efficiency. The code is available at https://github.com/ljl8336/PPM.

2605.23393 2026-05-25 cs.LG cs.AI 版本更新

Every Component is a Lookup: Token Attribution and Composition from a Single Decomposition

每个组件都是一个查找:来自单一分解的令牌归因与组合

Po-Kai Chen, Niki van Stein, Aske Plaat

发表机构 * Leiden University(莱顿大学)

AI总结 该论文研究了如何从单一前向传播中解析Transformer模型中各组件对预测结果的贡献及其组合方式。作者提出了一种名为Unpack的反向递归方法,通过分解注意力和MLP子层中的信用,揭示了不同组件之间的交互强度以及每个token的归因信息,无需干预、梯度或辅助训练。实验表明,该方法在GPT-2和Pythia系列模型上有效恢复了组件间的组合结构,并展示了对token级归因的准确捕捉,验证了其在机制可解释性方面的有效性。

详情
AI中文摘要

变压器的机制可解释性不仅需要识别哪些组件重要,还需要理解它们如何组合成产生预测的计算路径。注意力和MLP都遵循共享的键值模板 $ϕ(S)U$。我们利用这一结构开发了Unpack,一种后向递归方法,通过两个子层分解贡献,产生任意两个组件之间的交互强度,称为带有K/Q/V组合标签的端到端路径,以及来自单次前向传递的每个令牌的归因,无需干预、梯度或辅助训练。我们在间接宾语识别任务上进行了评估。在GPT-2 small上,该方法恢复了Wang等人(2023)描述的所有三种组合连接,包括每个连接的特定模式路由(K、Q或V)。为了测试超越简单复制的令牌级归因,我们比较了同一分解中同一名称的两次出现:第一次提及保持强归因,而重复检测位置被抑制,这一模式在匹配的控制提示中不存在。在Pythia系列从160M到6.9B参数中,这一抑制模式在每个尺度上一致地恢复,表明该方法无需真实电路标签即可追踪机制结构。代码可在https://github.com/Fun-Cry/unpacklm获取。

英文摘要

Mechanistic interpretability of transformers requires identifying not just which components matter but how they compose into the computational route that produced a prediction. Both attention and MLP follow a shared key-value template $ϕ(S)U$. We exploit this structure to develop Unpack, a backward recursion that decomposes credit through both sublayers, producing interaction strengths between any two components, named end-to-end paths with K/Q/V composition labels, and per-token attribution from a single forward pass, without intervention, gradients, or auxiliary training. We evaluate on the indirect object identification task. On GPT-2 small, the method recovers all three composition connections described by Wang et al. (2023), including the mode-specific routing of each connection (K, Q, or V). To test token-level attribution beyond trivial copying, we compare two occurrences of the same name in the same decomposition: the first mention retains strong credit while the duplicate-detection position is suppressed, a pattern absent in matched control prompts. Across the Pythia family from 160M to 6.9B parameters, this suppression pattern is consistently recovered at every scale, demonstrating that the method tracks mechanistic structure without ground-truth circuit labels. Code is available at https://github.com/Fun-Cry/unpacklm.

2605.23391 2026-05-25 cs.LG cs.NA math.NA 版本更新

Coupling-Robust Accuracy in Multiphysics Physics Informed Neural Networks via Kronecker-Preconditioned Optimization

通过Kronecker预条件优化实现多物理场物理信息神经网络的耦合鲁棒精度

Youngjae Park, Jaemin Kim, Junghwa Hong

发表机构 * Dept. of Control and Instrumentation Engineering, Korea University, Sejong, South Korea(控制与仪器工程系,韩国大学,世宗,韩国) BK21 FOUR Smart Mobility Education and Research Team, Korea University, Sejong, South Korea(BK21 FOUR智能交通教育与研究团队,韩国大学,世宗,韩国)

AI总结 物理信息神经网络(PINNs)在处理耦合多物理场系统时,随着方程间耦合增强,会出现系统性精度下降的问题。本文通过神经切线核(NTK)分析,揭示了这一现象的理论原因,并提出了一种基于克罗内克预处理的优化方法SOAP+GN,有效抑制了耦合强度对学习稳定性的影响。实验表明,该方法在多种耦合偏微分方程系统中均能保持较高的精度,显著优于传统优化方法。

Comments 20 pages, 10 figures. Extended version of AI4Physics Workshop submission (ICML 2026)

详情
AI中文摘要

用于耦合多物理场系统的物理信息神经网络(PINN)在方程间耦合增强时会遭受系统性精度退化。我们通过神经正切核(NTK)分析为这一现象提供了理论解释:对于线性耦合系统,我们证明标准NTK的谱半径随耦合强度γ呈Ω(γ²)增长,缩小了稳定学习率,而块对角Gauss-Newton(GN)预条件产生预条件NTK $K_P = J H^{+} J^ op$(其中$H$是块对角GN Hessian矩阵),其谱半径以$S$(网络数量)为界,与γ无关。我们在对称、非对称和非线性耦合PDE系统上数值验证了Ω(γ²)增长,并确认在所有情况下$λ_{\max}(K_P) = S$。将Kronecker预条件优化器SOAP与逆梯度范数损失平衡(SOAP+GN)相结合,实现了耦合鲁棒精度:在跨越三个非线性递增的一维系统和一个二维电渗流基准的234个实验中,即使耦合参数变化一到两个数量级,SOAP+GN保持最终epoch的$L_2$退化≤1.1倍(强耦合与弱耦合误差之比),而Adam+GN则超过$10^2$倍。SOAP+GN进一步扩展到EDL分辨条件下的二维六PDE电渗流系统——这一所有先前PINN电动力学研究通过简化物理而避免的工况——而Adam+GN完全失败($L_2 > 0.9$)。

英文摘要

Physics-informed neural networks (PINNs) for coupled multiphysics systems suffer systematic accuracy degradation as inter-equation coupling strengthens. We provide a theoretical explanation for this phenomenon through neural tangent kernel (NTK) analysis: for linearly coupled systems, we prove that the standard NTK's spectral radius grows as $Ω(γ^2)$ with coupling strength $γ$, shrinking the stable learning rate, while block-diagonal Gauss--Newton (GN) preconditioning yields a preconditioned NTK $K_P = J H^{+} J^\top$ (where $H$ is the block-diagonal GN Hessian) whose spectral radius is bounded by $S$ ($S$ = number of networks), independent of $γ$. We verify the $Ω(γ^2)$ growth numerically across symmetric, asymmetric, and nonlinear coupled PDE systems, and confirm $λ_{\max}(K_P) = S$ with equality in all cases. Combining the Kronecker-preconditioned optimizer SOAP with inverse-gradient-norm loss balancing (SOAP+GN) yields coupling-robust accuracy: across 234 experiments spanning three 1D systems of increasing nonlinearity and a 2D electroosmotic flow benchmark, SOAP+GN maintains final-epoch $L_2$ degradation $\leq 1.1\times$ (ratio of strong- to weak-coupling error) even as coupling parameters vary over one to two orders of magnitude, compared with $> 10^2\times$ for Adam+GN. SOAP+GN further scales to a 2D, 6-PDE electroosmotic flow system at EDL-resolved conditions -- a regime that all prior PINN electrokinetics studies have avoided through simplified physics -- where Adam+GN fails entirely ($L_2 > 0.9$).

2605.23378 2026-05-25 math.OC cs.LG 版本更新

Selective Ambulance Dispatch Under Contextual Travel-Time Uncertainty

上下文旅行时间不确定性下的选择性救护车调度

Zikun Lin, Daniel Zhuoyu Long, Viet Anh Nguyen

发表机构 * Department of Systems Engineering and Engineering Management(系统工程与工程管理系)

AI总结 本文研究了在交通时间不确定性背景下如何选择性派遣救护车以应对院外心脏骤停的紧急情况。提出了一种名为IDEAL的智能双派车框架,仅在主路线与备选路线的时间差超过阈值时才派遣第二辆救护车,从而在保证响应速度的同时减少资源消耗。该方法通过弱监督双层网络学习上下文相关的道路旅行时间,并结合非光滑优化与不确定性建模,实现了高效且具有收敛性保证的实时决策,在实际数据与模拟测试中表现出优于现有方法的响应时间与资源利用平衡。

详情
AI中文摘要

救护车响应在院外心脏骤停(OHCA)中具有时间紧迫性,调度员必须在及时到达与有限车队容量之间取得平衡。静态区域和确定性旅行时间估计易受动态拥堵影响,而始终双调度增加了冗余但消耗了车队容量。我们提出IDEAL(智能双调度急救车),一种选择性双调度框架,仅当主要路径与次要路径之间的乐观差距超过阈值时才派出第二辆救护车。IDEAL利用弱监督双层表示网络,从行程级调度记录(包括未观测路线)中学习上下文特定的边旅行时间。我们使用小批量保守梯度训练非光滑模型,并证明渐近收敛保证。IDEAL通过Burg散度扰动对学习表示空间中的共享度量进行建模,从而引起边旅行时间的相关变化,并从历史低估误差中学习上下文特定半径。对于实时决策,IDEAL将乐观差距计算转化为凸差规划,并推导出具有复杂度保证的高效预言机。与香港消防处合作,我们使用历史OHCA记录和实时自适应模拟评估IDEAL。相对于所有基于区域和基于谷歌的基线,结果实现了更强的响应时间/资源权衡。

英文摘要

Ambulance response is time-critical in out-of-hospital cardiac arrest (OHCA), where dispatchers must balance timely arrivals with limited fleet capacity. Static territories and deterministic travel-time estimates are vulnerable to dynamic congestion, while always-dual dispatch adds redundancy but consumes fleet capacity. We propose IDEAL (Intelligent Dual dispatch of Emergency AmbuLances), a selective dual-dispatch framework that sends a second ambulance only when the optimistic gap between primary and secondary paths exceeds a threshold. IDEAL learns context-specific edge travel times from trip-level dispatch records, including unobserved routes, using a weakly supervised bilevel representation network. We train the nonsmooth model with mini-batch conservative gradients and prove an asymptotic convergence guarantee. IDEAL models uncertainty via Burg-divergence perturbations to a shared metric in the learned representation space, thereby inducing correlated changes in edge travel times and learning context-specific radii from historical underprediction errors. For real-time decisions, IDEAL casts optimistic-gap computation as a difference-of-convex program and derives an efficient oracle with complexity guarantees. In collaboration with the Hong Kong Fire Services Department, we evaluate IDEAL using historical OHCA records and real-time adaptive simulations. The results achieve a stronger response-time/resource trade-off relative to all region-based and Google-based baselines.

2605.23372 2026-05-25 cs.LG cs.AI 版本更新

Curriculum reinforcement learning with measurable task representation learning

基于可度量任务表征学习的课程强化学习

Yongyan Wen, Siyuan Li, Mingjian Fu, Yiqin Yang, Xun Wang, Peng Liu

发表机构 * Harbin Institute of Technology(哈尔滨工业大学) Fuzhou University(福州大学) China Academy of Sciences(中国科学院)

AI总结 本文研究了课程强化学习(CRL)中自动课程生成的问题,特别是在非欧几里得任务空间中的复杂导航任务。为了解决传统插值方法在非欧空间中失效的问题,作者提出了一种基于可度量任务表示学习的自动课程生成方法,通过变分自编码器结构对任务的奖励和状态转移进行编码,从而获得具有任务相似性度量能力的潜在任务表示。实验表明,该方法在多个复杂导航任务中优于基于插值和生成对抗网络的现有CRL方法。

详情
Journal ref
Neural Networks, 109019 (2026)
AI中文摘要

在课程强化学习(CRL)中,智能体通过一系列任务(即课程)逐步积累知识,学习过程旨在利用积累的知识最终解决具有挑战性的目标任务。虽然早期的CRL工作侧重于对候选任务进行排序,但最近的研究探索了自动课程生成。在丰富的CRL文献中,基于插值的CRL范式是主体,它通过在任务空间中利用有意义的距离度量(即可以衡量任务相似性)对初始任务分布和目标任务分布进行插值,自动生成中间任务。然而,在具有挑战性的导航任务中,非欧几里得上下文(任务)空间使得这一假设失效。为了在复杂任务中实现自动课程生成,我们提出了一种基于可度量任务表征学习的新型自动课程生成方法。为了更好地衡量相似性,我们提出将任务空间变换到潜在空间。通过一个编码奖励和状态转移的变分自编码器结构,我们获得了具有任务相似性度量属性的潜在任务表征,其中两个相近的任务嵌入对应两个在奖励和状态转移方面相似的任务。基于学习到的任务表征,我们进一步开发了一种自动课程生成方案,该方案能够有效地生成与目标任务越来越相似的新任务。我们在各种具有挑战性的导航任务中评估了我们的方法,实验结果表明,所提出的方法超越了基于插值和生成对抗网络的最先进CRL方法。

英文摘要

In curriculum reinforcement learning (CRL), an agent incrementally accumulates knowledge over a sequence of tasks (i.e., a curriculum), and the learning process is aimed at using the accumulated knowledge to finally solve a challenging target task. While early CRL works focus on sequencing candidate tasks, recent research explores automatic curriculum generation. Among the rich CRL literature, the interpolation-based CRL paradigm is a main body, which automatically generates intermediate tasks by interpolating between the initial task distribution and the target task distribution in task space with meaningful distance metrics (i.e., can measure the task similarity). However, in challenging navigation tasks, the non-Euclidean context (task) space invalidates this assumption. To achieve automatic curriculum generation in complex task, we propose a novel automatic curriculum generation approach based on measurable task representation learning. To better measure the similarity, we propose to transform the task space to a latent space. Through a variational autoencoder structure that encodes the reward and the state transitions, we achieve a latent task representation with a task similarity measurement property, and two close task embeddings correspond to two similar tasks in terms of rewards and state transitions. Based on the learned task representation, we further develop an automatic curriculum generation scheme, which can effectively generate new tasks more and more similar to the target task. We evaluate our method in a variety of challenging navigation tasks, and the experiment results indicate that the proposed approach surpasses state-of-the-art CRL approaches based on interpolation and generative adversarial networks.

2605.23365 2026-05-25 cs.LG cs.AI 版本更新

Score-Based One-step MeanFlow Policy Optimization

基于分数的单步MeanFlow策略优化

Kyungyoon Kim, Donghyeon Ki, Hee-Jun Ahn, Byung-Jun Lee

发表机构 * Korea University, Decision Making Lab(韩国大学,决策实验室) Gauss Labs Inc.(Gauss实验室)

AI总结 本文提出了一种基于分数估计的单步均流策略优化方法(SOM),旨在解决强化学习中扩散模型和流匹配方法在在线场景下计算开销大的问题。该方法通过Q函数和概率流ODE直接构建目标速度场,无需目标分布的样本,从而在保证策略性能的同时显著降低了训练和推理时间。实验表明,SOM在运动控制任务中实现了领先的在线强化学习效果。

详情
AI中文摘要

扩散和流匹配已成为强化学习中表达力强的策略类,但它们对多步去噪的依赖在推理时带来了大量计算开销,这在在线强化学习中尤其成问题。MeanFlow通过学习一个平均速度场,在单次网络评估中将噪声映射到数据,提供了一种有前景的替代方案。然而,MeanFlow通常需要来自目标分布的样本来构建其目标速度场,而这在在线强化学习中不可用。我们提出了基于分数的单步MeanFlow策略优化(SOM),一种演员-评论家算法,通过分数估计和概率流ODE直接从Q函数构建目标速度场,从而将概率质量集中在高价值模式上。在完全在线强化学习设置中,SOM在运动任务上以单生成步骤实现了最先进的性能,同时与先前基于扩散和流匹配的策略相比,大幅减少了训练和推理时间。

英文摘要

Diffusion and flow matching have emerged as expressive policy classes in reinforcement learning, but their reliance on multi-step denoising imposes substantial computational overhead at inference time, which is particularly problematic in online RL. MeanFlow offers a promising alternative by learning an average velocity field that maps noise to data in a single network evaluation. However, MeanFlow typically requires samples from the target distribution to construct its target velocity field, which are unavailable in online RL. We propose Score-Based One-step MeanFlow Policy Optimization (SOM), an actor-critic algorithm that resolves this by constructing the target velocity field directly from the Q-function via score estimation and a probability flow ODE, thereby concentrating probability mass on high-value modes. In the fully online RL setting, SOM achieves state-of-the-art performance on locomotion tasks with a single generation step, while substantially reducing both training and inference time compared to prior diffusion- and flow-matching-based policies.

2605.23362 2026-05-25 cs.LG cs.IT math.IT math.ST stat.ML stat.TH 版本更新

Instance-Optimal Estimation with Multiple LLM Judges on a Budget

预算限制下多LLM裁判的实例最优估计

Junghyun Lee, Sanghwa Kim, Yassir Jedra, Alexandre Proutière, Se-Young Yun

发表机构 * KAIST AI(韩国科学技术院人工智能研究所) ICL EEE(国际计算机语言研究所电子工程系) KTH EECS(皇家理工学院电子工程系)

AI总结 本文研究了在有限预算下如何高效分配多个具有不同成本和可靠性的大语言模型评估任务,以获得最准确的评分估计。作者提出了预算异方差多评委估计问题,并设计了一种自适应算法EST-IVWE,通过乐观偏差方差估计实现稳定分配,理论证明其性能接近最优分配方案。此外,作者还建立了匹配的局部最小最大下界,证明了所提方法的实例最优性,并在实验中验证了其优于均匀分配策略的效果。

Comments 53 pages, 4 figures; the first two authors contributed equally

详情
AI中文摘要

评估大型语言模型越来越依赖于LLM作为裁判的协议,但此类评估仍然成本高昂:不同的裁判有不同的价格和可靠性,且每个提示-响应对的难度可能差异很大。这引发了一个基本的分配问题:在固定预算下,应如何在异构裁判和实例之间分配评估查询,以获得最准确的分数估计?我们将此问题形式化为*预算限制下的异方差多裁判估计*。给定$K$个提示-响应对、$J$个已知成本的裁判以及未知的查询-裁判方差,目标是估计一个有界分数向量,同时最小化$\ell_p$误差。我们的第一个贡献是分析逆方差加权估计量(IVWE)并推导出最小化其误差率的最优分配。由于该分配依赖于未知方差,我们随后通过提出EST-IVWE来解决实际中的未知方差设置,这是一种自适应算法,它构建并利用*乐观偏差*方差估计来稳定经验分配。我们证明EST-IVWE在预算内匹配了IVWE的速率,直至低阶项。我们的第二个且核心的理论贡献是一个匹配的*局部*极小极大下界,这确立了所提出算法的实例最优性。一个关键的技术见解是,Fano型高概率论证对于这个问题过于粗糙:它们的填充构造失去了控制最优分配的局部方差结构。我们转而使用基于局部扰动的Assouad型期望论证,该论证保留了这一结构并产生了尖锐的分配相关下界。最后,我们在合成数据集和HelpSteer2数据集上数值验证了我们的方法优于朴素的均匀分配。

英文摘要

Evaluating large language models increasingly relies on LLM-as-a-judge protocols, but such evaluations remain costly: different judges have different prices and reliabilities, and the difficulty of each prompt-response pair can vary substantially. This raises a basic allocation question: under a fixed budget, how should one distribute evaluation queries across heterogeneous judges and instances to obtain the most accurate score estimates? We formalize this question as *budgeted heteroskedastic multi-judge estimation*. Given $K$ prompt-response pairs, $J$ judges with known costs, and unknown query-judge variances, the goal is to estimate a bounded score vector while minimizing an $\ell_p$-error. Our first contribution is to analyze the inverse-variance weighted estimator (IVWE) and to derive the oracle allocation that minimizes its error rate. Since this allocation depends on the unknown variances, we then address the practical unknown-variance setting by proposing EST-IVWE, an adaptive algorithm that constructs and leverages *optimistically biased* variance estimates to stabilize the empirical allocation. We prove that EST-IVWE matches the oracle IVWE rate up to lower-order terms in the budget. Our second and central theoretical contribution is a matching *local* minimax lower bound, which establishes the instance-optimality of the proposed algorithms. A key technical insight is that Fano-type high-probability arguments are too coarse for this problem: their packing construction loses the local variance structure that governs the optimal allocation. We instead use an Assouad-type in-expectation argument, based on local perturbations, which preserves this structure and yields the sharp allocation-dependent lower bound. Finally, we numerically validate the superiority of our approach over naïve uniform allocation on synthetic and HelpSteer2 datasets.

2605.23355 2026-05-25 cs.CV cs.LG cs.MM 版本更新

Decoupling Spatio-Temporal Adapter for Fine-Grained Badminton Action Localization

解耦时空适配器用于细粒度羽毛球动作定位

Tianyu Wang, Junjie Wu, Jingquan Gao, Shishuo Li

发表机构 * School of Economics and Management, Beihang University(北京航空航天大学经济管理学院) Key Laboratory of Data Intelligence and Management, Beihang University, Ministry of Industry and Information Technology(信息产业部北京航空航天大学数据智能与管理重点实验室)

AI总结 本文研究了专业羽毛球视频中的细粒度时序动作定位问题,针对其复杂的时空动态特性,提出了一种解耦时空适配器(DSTA),通过将运动表示分解为三个并行分支,分别捕捉时间动态以及垂直和水平方向的空间变化,从而更有效地建模细粒度动作的细微差异。同时,作者构建了一个包含31场比赛、29类细粒度击球动作的Fine-Badminton数据集,并在该数据集和ShuttleSet基准上验证了方法的有效性,取得了最先进的性能,且计算和参数开销增加有限。

Comments 11 pages, 11figures

详情
AI中文摘要

时间动作定位(TAL)在通用视频理解中已被广泛研究,而由于复杂微妙的时空动态,专业羽毛球等细粒度体育场景仍未被充分探索。本文聚焦于专业羽毛球视频中的细粒度TAL,并引入一个新的基准数据集Fine-Badminton,包含31场比赛、29个细粒度击球类别,涵盖2104个回合和27597个标注动作。为了有效捕捉此类场景中的复杂运动模式,我们提出解耦时空适配器(DSTA),能够在参数高效框架内高效建模时空特征。具体而言,DSTA将运动表示分解为三个并行分支,分别捕捉时间动态以及垂直和水平空间变化。该设计使模型能够更好地区分细粒度动作之间的细微差异。在Fine-Badminton数据集和ShuttleSet基准上的大量实验表明,所提方法在仅增加微小计算和参数成本的情况下实现了最先进性能。这些结果验证了所提方法在细粒度时间动作定位中的有效性和效率。

英文摘要

Temporal Action Localization (TAL) has been extensively studied in generic video understanding, while fine-grained sports scenarios, such as professional badminton, remain underexplored due to their complex and subtle spatio-temporal dynamics. In this paper, we focus on fine-grained TAL in professional badminton videos and introduce a new benchmark dataset, Fine-Badminton, which consists of 31 matches with 29 fine-grained stroke categories, covering 2104 rallies and 27597 annotated actions. To effectively capture the intricate motion patterns in such scenarios, we propose a Decoupling Spatio-Temporal Adapter (DSTA), which enables efficient modeling of spatio-temporal features within a parameter-efficient framework. Specifically, DSTA decomposes motion representation into three parallel branches, capturing temporal dynamics as well as vertical and horizontal spatial variations. The design allows the model to better distinguish subtle differences among fine-grained actions. Extensive experiments on both the Fine-Badminton dataset and the ShuttleSet benchmark demonstrate that the proposed method achieves state-of-the-art performance while introducing only a marginal increase in computational and parameter cost. These results validate the effectiveness and efficiency of the proposed approach for fine-grained temporal action localization.

2605.23351 2026-05-25 cs.LG cs.GT 版本更新

Prudent-Banker: No Extra Fees for Baseline Safety in Adversarial Bandits With and Without Delays

Prudent-Banker: 对抗性赌博机中无延迟与有延迟下的基线安全保障无额外代价

Ting Hu, Luanda Cai, Emmanouil-Vasileios Vlatakis-Gkaragkounis

发表机构 * Department of Economics University of Wisconsin–Madison(经济学系威斯康星大学麦迪逊分校) Department of Finance University of Wisconsin–Madison(金融系威斯康星大学麦迪逊分校) Department of Computer Sciences University of Wisconsin–Madison(计算机科学系威斯康星大学麦迪逊分校)

AI总结 本文研究了在有无延迟反馈的情况下,如何在对抗性多臂老虎机问题中实现安全基线下的最小最大最优最坏情况悔恨。为了解决延迟可能破坏安全保证的问题,作者提出了Prudent-Banker算法,结合了延迟自适应的在线镜像下降方法和改进的分阶段激进机制,实现了与安全策略相比近似常数悔恨的最优安全-鲁棒性权衡。该算法在理论分析中证明了其悔恨上界不可改进,并通过实验验证了其在多种延迟分布下的有效性。

详情
AI中文摘要

我们研究了在安全感知目标下,具有和不具有延迟反馈的对抗性多臂赌博机问题:实现极小极大最优的最坏情况遗憾,同时相对于指定的“安全”基线策略保持几乎恒定的遗憾。现有方法可以在即时反馈下平衡这种权衡以获得平滑的比较器,但任意延迟可能会错误地安排保守主义和探索之间的转换,危及安全保障。为了弥合这一差距,我们提出了Prudent-Banker,一种新颖的算法,它将延迟适应的在线镜像下降变体与修改的分阶段攻击机制相结合。其关键技术贡献是一个延迟校准的重启阈值,该阈值严格考虑了未观察反馈引起的最坏情况失真,并可靠地检测比较器的次优性。我们还为安全约束的对抗性延迟赌博机建立了新的下界,表明在基线安全要求下,Prudent-Banker的遗憾保证在忽略对数因子时是不可改进的。据我们所知,Prudent-Banker是第一个实现最优安全-鲁棒性权衡的算法:伪遗憾$\widetilde{O}(\sqrt{T}+\sqrt{D})$加上相对于安全比较器的$\widetilde{O}(1)$遗憾,无论有无延迟。跨不同延迟分布的实验表明,与标准的延迟鲁棒基线不同,Prudent-Banker有效地平衡了安全性和学习。

英文摘要

We study adversarial multi-armed bandits with and without delayed feedback under a safety-aware goal: achieving minimax-optimal worst-case regret while keeping nearly constant regret relative to a designated "safe" baseline policy. Existing approaches can balance this trade-off with immediate feedback for smooth comparators, but arbitrary delays can mistime transitions between conservatism and exploration, endangering the safety guarantee. To bridge this gap, we propose Prudent-Banker, a novel algorithm that combines a delay-adapted variant of Online Mirror Descent with a modified phased-aggression mechanism. Its key technical contribution is a delay-calibrated restart threshold that rigorously accounts for the worst-case distortion induced by unobserved feedback and reliably detects comparator suboptimality. We also establish new lower bounds for safety-constrained adversarial delayed bandits, showing that the regret guarantees of Prudent-Banker are unimprovable, up to logarithmic factors, under the baseline-safety requirement. To the best of our knowledge, Prudent-Banker is the first algorithm to achieve the optimal safety--robustness trade-off: pseudo-regret $\widetilde{O}(\sqrt{T}+\sqrt{D})$ together with $\widetilde{O}(1)$ regret against the safe comparator, both with and without delays. Experiments across diverse delay distributions show that, unlike standard delay-robust baselines, Prudent-Banker effectively balances safety and learning.

2605.23346 2026-05-25 cs.LG 版本更新

Contrastive Distribution Matching for Amortized Sequential Monte Carlo in Discrete Diffusion

对比分布匹配用于离散扩散中的摊销序贯蒙特卡洛

Jaihoon Kim, Taehoon Yoon, Prin Phunyaphibarn, Seungjun Kim, Morteza Mardani, Minhyuk Sung

发表机构 * KAIST(韩国科学技术院) University of Michigan(密歇根大学) NVIDIA(英伟达)

AI总结 离散扩散模型在生成结构化分类数据方面表现出色,但如何高效地从奖励倾斜分布中采样仍是一个核心挑战。本文提出了一种名为对比分布匹配(CDM)的新框架,通过学习参数化的扭曲函数,将序列蒙特卡洛(SMC)推断的计算成本进行摊销,从而显著提升了推理效率。实验表明,CDM在多种应用场景中均优于现有方法,且额外计算开销极小,验证了其有效性与广泛适用性。

Comments Project Page: https://cdm-smc.github.io/

详情
AI中文摘要

离散扩散模型已成为生成结构化分类数据的强大框架。然而,从奖励倾斜分布中高效采样仍然是一个基本挑战。虽然扭曲序贯蒙特卡洛(SMC)为此任务提供了渐近精确性,但在离散状态空间中估计最优扭曲函数需要昂贵的蒙特卡洛近似,导致推理时严重的计算瓶颈。为克服这一限制,我们引入对比分布匹配(CDM),一种新颖的框架,通过正负样本学习参数化扭曲函数,摊销SMC推理的成本。为了高效训练,我们重新表述梯度估计器,以利用离散扩散模型的闭式前向核。在实践中,评估我们学习的扭曲函数相比基础模型的单次前向传播仅增加不到5%的额外计算开销。通过广泛的经验评估,我们证明CDM在匹配的挂钟时间下始终优于现有基线。我们在多种应用中验证了我们方法的有效性和通用性,包括有毒文本生成、调控DNA序列设计、蛋白质可设计性以及扩散大语言模型对齐。

英文摘要

Discrete diffusion models have emerged as powerful frameworks for generating structured categorical data. However, efficiently sampling from reward-tilted distributions remains a fundamental challenge. While Twisted Sequential Monte Carlo (SMC) offers asymptotic exactness for this task, estimating the optimal twist function in discrete state spaces necessitates costly Monte Carlo approximations, resulting a severe computational bottleneck at inference. To overcome this limitation, we introduce Contrastive Distribution Matching (CDM), a novel framework that amortizes the cost of SMC inference by learning a parameterized twist function via positive and negative samples. For efficient training, we reformulate the gradient estimator to leverage the closed-form forward kernels of discrete diffusion models. In practice, evaluating our learned twist function incurs less than 5% additional computational overhead compared to a single forward pass of the base model. Through extensive empirical evaluations, we demonstrate that CDM consistently outperforms existing baselines under matched wall-clock time. We validate the effectiveness and versatility of our approach across a diverse range of applications, including toxic text generation, regulatory DNA sequence design, protein designability, and diffusion large language model alignment.

2605.23306 2026-05-25 physics.soc-ph cs.LG cs.SY eess.SY 版本更新

SpinFlow: A Physics-Informed Spin Field Framework for Traffic Phase Inference and Transition Detection

SpinFlow: 一种物理信息自旋场框架用于交通相位推断和过渡检测

Haopeng Deng, Fucheng Zheng, Xinhai Xia

发表机构 * School of Future Transportation(未来交通学院)

AI总结 本文提出了一种名为SpinFlow的物理信息化自旋场框架,用于交通相位推断和相变检测。该方法结合Kerner的三相理论与统计物理,通过自旋场建模实现对宏观交通状态的连续推断,并利用正则化的期望最大化算法从高分辨率轨迹数据中反演潜在的自旋场结构。实验表明,SpinFlow在多个真实数据集上表现出优越的性能,能够准确识别交通相变点并生成可解释的相图,为智能交通管理提供了数据驱动且符合物理规律的决策依据。

Comments 11 pages, 8 figures, accepted to ITSC 2026

详情
AI中文摘要

主动交通管理(ATM)经常受到传统宏观模型和刚性经验阈值的阻碍,这些模型和阈值无法捕捉亚稳态相位前兆,导致延迟的反应性干预。为了解决这个问题,我们提出了SpinFlow,一个物理信息自旋场框架,将Kerner的三相理论与统计物理统一起来,用于连续宏观交通相位推断。受海森堡模型启发,SpinFlow通过潜在自旋向量和竞争平衡映射参数化空间变化的相位权重,使同步流自然出现。一种物理正则化的期望最大化算法从高分辨率轨迹中反演这种潜在结构,联合优化自旋场,同时软性强制执行质量守恒和空间平滑性。我们引入相位平衡度(PED)来量化结构对齐并在拓扑上定位相变点。在四个真实轨迹数据集上,SpinFlow实现了高达0.940的$R_{q}^{2}$,PED下降94.9-100%,以及可解释的相位图,在前向准确性、物理一致性和瓶颈定位方面优于三个异构基线。SpinFlow无需先验网络拓扑即可精确定位拥堵成核,为ATM提供了一种数据驱动、物理一致的触发机制。

英文摘要

Active traffic management (ATM) is frequently hindered by traditional macroscopic models and rigid empirical thresholds that fail to capture metastable phase precursors, resulting in delayed, reactive interventions. To address this, we propose SpinFlow, a physics-informed spin-field framework unifying Kerner's three-phase theory with statistical physics for continuous macroscopic traffic phase inference. Inspired by the Heisenberg model, SpinFlow parametrizes spatially varying phase weights via a latent spin vector and a competitive-equilibrium mapping, allowing synchronized flow to emerge naturally. A physics-regularized Expectation-Maximization algorithm inverts this latent structure from high-resolution trajectories, jointly optimizing the spin field while softly enforcing mass conservation and spatial smoothness. We introduce the Phase Equilibrium Degree (PED) to quantify structural alignment and topologically localize phase-transition points. Across four real-world trajectory datasets, SpinFlow achieves $R_{q}^{2}$ up to 0.940, PED drops of 94.9-100%, and interpretable phase maps that outperform three heterogeneous baselines on forward accuracy, physics consistency, and bottleneck localization. SpinFlow pinpoints congestion nucleation without prior network topology, yielding a data-driven, physics-consistent trigger for ATM.

2605.23295 2026-05-25 physics.optics cs.LG physics.app-ph 版本更新

Accelerating ground state search of spatial photonic Ising machines with genetic-simulated annealing hybrid algorithm

基于遗传-模拟退火混合算法加速空间光子伊辛机基态搜索

Ze Zheng, Ruhui Ni, Jingyi Zhao, Xiaojian Hu, Wen Jiang, Yuegang Li, Hang Xu, Tailong Xiao, Guihua Zeng

发表机构 * Institute for Quantum Sensing and Information Processing(量子传感与信息处理研究所) State Key Laboratory of Photonics and Communications(光子与通信国家重点实验室) Global College(全球学院) Shanghai Research Center for Quantum Sciences(上海量子科学研究中心) Hefei National Laboratory(合肥国家实验室) Shanghai Quantum Intelligence Sensing Technology Co., Ltd(上海量子智能感知技术有限公司)

AI总结 该研究提出了一种结合遗传算法与模拟退火的混合算法,用于加速空间光子Ising机的基态搜索。传统方法依赖单一的模拟退火算法,收敛速度慢且耗时,而新方法在早期阶段利用遗传算法进行全局搜索,后期采用模拟退火进行局部优化,从而显著提升了求解效率和解的质量。实验表明,该方法在不同规模的Max-Cut问题及高阶优化问题中均优于传统算法,为智能光子Ising计算系统的发展提供了新思路。

Comments 12 pages, 6 figures

详情
AI中文摘要

基于空间光调制器的空间光子伊辛机已成为解决组合优化问题和自旋玻璃模拟等众多任务的高效求解器。然而,传统仅依赖模拟退火算法的SPIM在复杂能量景观中需要大量测量-反馈迭代才能找到相对最优解,存在收敛慢、时间成本高的问题。本文提出一种光学遗传-模拟退火混合算法来加速SPIM的基态搜索。GA在迭代早期进行全局粗粒度搜索,而SA在后期进行细粒度局部精化。数值模拟表明,我们的方法在不同规模的全秩Max-Cut问题上比纯GA或SA能获得更高的解质量。我们还在同一迭代预算下,在规范变换时分复用SPIM上实验证明了其相对于传统算法在高秩优化问题上的优越性。我们的方法可进一步与其他先进元启发式算法结合,向智能光学伊辛计算系统发展。

英文摘要

Spatial photonic Ising machines (SPIMs) based on spatial light modulators (SLMs) have emerged as highly effective solvers for many tasks, including combinatorial optimization problems and spin-glass simulations. However, traditional SPIMs relying solely on the simulated annealing algorithm require a large number of measurement-feedback iterations to find a relatively optimal solution in complex energy landscapes, suffering from slow convergence and high time cost. Here, we propose an optical genetic-simulated annealing hybrid algorithm to accelerate the ground-state search of SPIMs. GA conducts a global coarse-grained search in the early iteration stage, while SA performs fine-grained local refinement in the late stage. Numerical simulations show that our method enables a higher solution quality of full-rank Max-Cut problems than pure GA or SA at different scales. We also experimentally demonstrate its superiority over conventional algorithms on a gauge-transformation time-division multiplexing SPIM for high-rank optimization problems under the same iteration budget. Our approach can be further developed with other advanced metaheuristic algorithms toward intelligent optical Ising computing systems.

2605.23285 2026-05-25 cs.LG cond-mat.stat-mech cs.AI 版本更新

Reinforcement Learning for Microcanonical Graph Ensemble with Assortativity Constraints

具有同配性约束的微正则图集成的强化学习

Hoyun Choi, Junghyo Jo, Deok-Sun Lee

发表机构 * School of Computational Sciences, Korea Institute for Advanced Study(韩国高等科学研究院计算科学系) Department of Physics Education, Seoul National University(首尔国立大学物理教育系) Center for Theoretical Physics and Artificial Intelligence Institute, Seoul National University(首尔国立大学理论物理与人工智能研究所) Center for AI and Natural Sciences, Korea Institute for Advanced Study(韩国高等科学研究院人工智能与自然科学中心)

AI总结 本文研究如何通过强化学习生成满足特定 assortativity(度-度相关性)约束的微正则图系,以精确控制网络结构特性。提出了一种基于强化学习的深度微正则图生成器(DMGG),通过度保持的重连操作,使图的 assortativity 精确达到目标值,克服了传统方法在参数调校和生成效率上的不足。该方法能够在不同规模、稀疏度和拓扑结构的图上生成精确的无偏模型,有助于定量分析网络的次级特性,如聚类系数,为研究网络结构与功能的关系提供了有力工具。

详情
AI中文摘要

网络结构如何决定功能是一个基本问题,可以通过具有精确控制结构属性的图集成来研究。规范方法(如指数随机图模型ERGM)仅期望约束,允许个体实现围绕目标波动。相反,微正则集成施加硬约束,但除固定度序列外的实用采样方法仍难以实现。本文介绍深度微正则图生成器(DMGG),一种强化学习(RL)框架,通过保度重连变换任意给定图,以精确达到指定的同配性(表征相邻节点的度-度相关性)。DMGG不依赖于ERGM的熵主导的Metropolis-Hastings动力学,而是采用策略引导搜索,最大程度地改变联合度矩阵。这消除了详尽的参数调优,并在保持构型多样性的同时将生成速度提高至少一个数量级。由于DMGG可推广到各种图大小、稀疏性和拓扑结构,它提供了精确的零模型,允许定量隔离二次可观测量(如聚类系数)。这些结果确立了RL作为生成硬约束图的实用且强大的范式,为研究无集成伪影的结构-功能关系开辟了途径。

英文摘要

How network structure determines function is a fundamental question, and it can be investigated by graph ensembles with precisely controlled structural properties. Canonical approaches, formulated as exponential random graph models (ERGMs), enforce constraints only in expectation, allowing individual realizations to fluctuate around the target. Conversely, microcanonical ensembles impose hard constraints exactly, but practical sampling methods beyond fixing the degree sequence have remained out of reach. Here we introduce the Deep Microcanonical Graph Generator (DMGG), a reinforcement learning (RL) framework that transforms any given graph through degree-preserving rewirings to exactly reach a prescribed assortativity, which characterizes the degree--degree correlation of adjacent nodes. Instead of relying on the entropically dominated Metropolis--Hastings dynamics of the ERGM, DMGG employs a policy-guided search that maximally alters the joint-degree matrix. This eliminates exhaustive parameter tuning and accelerates generation by at least an order of magnitude while preserving configurational diversity. As DMGG generalizes across various graph sizes, sparsities, and topologies, it provides exact null models that allow for the quantitative isolation of secondary observables, such as the clustering coefficient. These results establish RL as a practical and powerful paradigm for generating hard-constrained graphs, opening avenues to investigate structure-function relationships free from ensemble artifacts.

2605.23282 2026-05-25 eess.IV cs.CV cs.LG 版本更新

Discontinuous Galerkin Neural Operator for Pathology Defocus Deblurring

病理学离焦去模糊的间断伽辽金神经算子

Shaoqing Duan, Haofei Song, Xintian Mao, Qingli Li, Yan Wang

发表机构 * Shanghai Key Laboratory of Multidimensional Information Processing, East China Normal University, Shanghai, China(上海多维信息处理关键实验室,华东师范大学,上海,中国)

AI总结 病理学显微镜中的离焦去模糊因光学模糊的空间变化和局部不连续性而具有挑战性。现有深度学习方法受限于位移不变性假设和可解释性不足,难以处理这种异质性模糊模式。本文提出了一种基于不连续伽辽金格式的神经算子(DGNO),通过局部体积算子和界面数值通量参数化积分核,有效建模了异质且局部不连续的模糊模式,在保持光学成像物理特性的前提下,实现了更优的去模糊效果,并在高分辨率场景下表现出良好的性能。

Comments 17 pages, 9 figures. Accepted by ICML 2026

详情
AI中文摘要

病理显微镜中的离焦去模糊仍然具有挑战性,因为由位置相关的积分成像过程引起的光学模糊具有空间变化和局部不连续的特性。现有的深度学习方法受限于平移不变性假设和有限的可解释性,不太适合这种异质模糊模式。神经算子通过直接将离焦形成建模为积分算子,提供了一种原则性的替代方案,为离焦去模糊提供了新的视角。然而,大多数现有的用于低级视觉的神经算子架构依赖于全局参数化核,这些核假设平滑性和平稳性,限制了它们建模异质和局部不连续模糊模式的能力。为了解决这一限制,我们提出了间断伽辽金神经算子(DGNO),它使用具有单元局部体积算子和界面数值通量的间断伽辽金公式来参数化积分核。DGNO 提供了局部性、异质性建模和全局一致性的原则性组合,同时保留了光学图像形成的底层物理。广泛且深入的实验表明,DGNO 超越了现有技术,提供了更清晰的图像重建、对空间变化模糊的鲁棒处理以及可扩展的高分辨率性能。代码将在 https://github.com/DeepMed-Lab-ECNU/Single-Image-Deblur 发布。

英文摘要

Defocus deblurring in pathological microscopy remains challenging due to the spatially varying and locally discontinuous nature of optical blur induced by a position-dependent integral imaging process. Existing deep learning methods, constrained by shift-invariance assumptions and limited interpretability, are not well suited to such heterogeneous blur patterns. Neural operators provide a principled alternative by modeling defocus formation directly as an integral operator, offering a new perspective on defocus deblurring. However, most existing neural operator architectures for low-level vision rely on globally parameterized kernels that assume smoothness and stationarity, limiting their ability to model heterogeneous and locally discontinuous blur patterns. To address this limitation, we propose the Discontinuous Galerkin Neural Operator (DGNO), which parameterizes the integral kernel using a discontinuous Galerkin formulation with element-local volume operators and interface numerical fluxes. DGNO provides a principled combination of locality, heterogeneity modeling, and global coherence while preserving the underlying physics of optical image formation. Extensive and insightful experiments demonstrate that DGNO surpasses state-of-the-arts, delivering sharper reconstructions, robust handling of spatially varying blur, and scalable high-resolution performance. The code will be released at https://github.com/DeepMed-Lab-ECNU/Single-Image-Deblur.

2605.23275 2026-05-25 cs.LG 版本更新

Diffusion Domain Expansion: Learning to Coordinate Pre-trained Diffusion Models

扩散域扩展:学习协调预训练扩散模型

Egor Lifar, Semyon Savkin, Timur Garipov, Shangyuan Tong, Tommi Jaakkola

发表机构 * MIT CSAIL(麻省理工学院计算机科学与人工智能实验室)

AI总结 本文提出了一种名为扩散域扩展(DDE)的方法,旨在高效地扩展预训练扩散模型,使其能够生成更大规模的对象并处理更复杂的条件输入。该方法通过一个紧凑的可训练网络协调预训练扩散模型的去噪输出,实现了对超出其原始训练范围的领域的泛化能力。实验表明,DDE在长音频生成和条件图像生成任务中均表现出色,优于其他协调生成方法。

Comments Accepted as poster at ICML 2024 Workshop on Structured Probabilistic Inference and Generative Modeling (SPIGM)

详情
AI中文摘要

在本文中,我们提出了扩散域扩展(DDE),一种高效扩展预训练扩散模型的方法,使其能够生成更大的对象并处理超出其原始能力的更复杂条件。我们的方法采用一个紧凑的可训练网络,旨在协调预训练扩散模型的去噪输出。我们证明协调器可以普遍简单,同时能够泛化到比训练时观察到的更大的域。我们在长音频轨道生成和条件图像生成上评估了DDE,展示了其跨域的适用性。在定性和定量评估中,DDE在扩散模型的协调生成方面优于其他方法。

英文摘要

In this paper, we propose Diffusion Domain Expansion (DDE), a method that efficiently extends pre-trained diffusion models to generate larger objects and handle more complex conditioning beyond their original capabilities. Our method employs a compact trainable network designed to coordinate the denoised outputs of pre-trained diffusion models. We demonstrate that the coordinator can be universally simple while being capable of generalizing to domains larger than those observed during its training time. We evaluate DDE on long audio track generation and conditional image generation, demonstrating its applicability across domains. DDE outperforms other approaches to coordinated generation with diffusion models in qualitative and quantitative evaluations.

2605.23272 2026-05-25 cs.LG cs.AI 版本更新

When Good Equations Get Bad Scores: Improving Symbolic Regression Through Better Parameter Optimization

当好方程得到差分数:通过更好的参数优化改进符号回归

Boxiao Wang, Kai Li, Zhiwei Chen, Yang Huang, Runxiang Wang, Ziwen Zhang, Yifan Zhang, Jian Cheng

发表机构 * Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所)

AI总结 符号回归(SR)在科学知识发现中扮演重要角色,旨在从观测数据中提炼出数学方程。现有方法通常采用双层优化框架,但参数拟合质量直接影响结构评分,导致正确结构可能因局部最优解而被低估。为此,本文提出SAGE-Fit,一种基于符号表达式结构与语义先验的拟合框架,有效缓解了优化瓶颈,显著提升了符号回归系统的评估准确性和整体性能。

详情
AI中文摘要

符号回归(SR)通过从观测数据中提炼数学方程,在科学知识发现中发挥核心作用。大多数现有SR方法在双层优化框架内运行:外层循环搜索离散方程结构,内层循环优化该结构的连续参数。关键的是,参数拟合质量直接决定结构的得分,从而影响外层搜索。然而,非线性算子使得内层循环高度非凸,且预算驱动的快速局部求解器(如BFGS)的依赖常常导致正确的结构陷入较差的局部极小值并被低估得分。这种“好结构、差分数”现象成为关键瓶颈,降低效率并误导搜索偏离真实方程。为解决此问题,我们提出SAGE-Fit(结构感知与语义引导的符号回归评估器),一个利用符号表达式双重原生先验的SR原生拟合框架。通过利用SR特有的结构和语义先验,我们为每个属性设计定制模块,从而有效缓解这一优化瓶颈。大量实验表明,我们的方法作为即插即用模块,显著提升评估保真度,并普遍提高各种SR系统的性能。

英文摘要

Symbolic Regression (SR) plays a central role in scientific knowledge discovery by distilling mathematical equations from observational data. Most existing SR methods function within a bi-level optimization framework: an outer loop that searches for the discrete equation structure, and an inner loop that optimizes the continuous parameters of that structure. Crucially, parameter-fitting quality directly determines a structure's score and thus the outer-loop search. However, nonlinear operators make the inner loop highly non-convex, and budget-driven reliance on fast local solvers (e.g., BFGS) often yields poor local minima and underestimated scores for correct structures. This ``Good Structure, Bad Score'' phenomenon becomes a key bottleneck, degrading efficiency and misguiding the search away from the true equation. To resolve this, we propose SAGE-Fit (Structure-Aware and Semantics-Guided Evaluator for Symbolic Regression), an SR-native fitting framework that exploits the dual native priors of symbolic expressions. By capitalizing on the structural and semantic priors unique to SR, we design tailored modules for each property, thereby effectively mitigating this optimization bottleneck. Extensive experiments demonstrate that our approach, as a plug-and-play module, significantly enhances evaluation fidelity and universally improves the performance of various SR systems.

2605.23268 2026-05-25 stat.ML cs.LG 版本更新

Coupled Training with Privileged Information and Unlabeled Data

基于特权信息与未标记数据的联合训练

Jiahao Shi, Omar Hagrass, Jason M. Klusowski

发表机构 * Department of Electrical and Computer Engineering, Princeton University(普林斯顿大学电子与计算机工程系) Department of Operations Research and Financial Engineering, Princeton University(普林斯顿大学运筹学与金融工程系)

AI总结 在许多预测任务中,训练时可获得额外信息(如昂贵或难以收集的测量数据),而这些信息在模型部署时并不可用。本文提出了一种联合训练方法,将利用额外信息的模型与仅使用测试时输入的部署模型一同训练,使部署模型仅在额外信息真正有助于预测时才加以利用,从而避免继承其错误。该方法提供了预测准确率提升的理论保证,并通过实验验证了其在合成数据和实际任务中的优越性。

Comments 37 pages, 6 figures. Accepted to ICML 2026

详情
AI中文摘要

在许多预测问题中,我们在训练期间拥有额外信息(例如,昂贵或收集缓慢的测量值),但在模型部署时这些信息将不可用。一种常见策略是首先训练一个使用所有训练信息的模型,然后利用其对未标记样本的预测来训练第二个模型,该模型仅使用测试时可用的输入。然而,当额外的训练专用信息较弱或存在噪声时,这种两阶段方法可能会误导部署模型,甚至降低准确性。我们提出一种联合训练方法,同时学习两个模型,使得部署模型仅在额外信息真正有帮助时从中受益,而不是继承其错误。我们提供了描述联合训练何时提高预测准确性的保证,并分析了一种适用于大规模高维模型的简单交替训练算法。在合成数据和真实世界预测任务上的实验表明,我们的方法避免了这些失败,并稳健地优于标准两阶段基线。

英文摘要

In many prediction problems, we have extra information during training (for example, measurements that are expensive or slow to collect) that will not be available when the model is deployed. A common strategy is to first train a model that uses all training information, then use its predictions on unlabeled examples to train a second model that only uses the inputs available at test time. However, when the extra training-only information is weak or noisy, this Two-Stage approach can mislead the deployment model and even hurt accuracy. We propose a joint training method that learns the two models together, so the deployment model can benefit from the extra information only when it actually helps, instead of inheriting its mistakes. We provide guarantees that describe when joint training improves prediction accuracy and analyze a simple alternating training algorithm for large, high-dimensional models. Experiments on synthetic data and real-world prediction tasks show that our approach avoids these failures and robustly outperforms standard Two-Stage baselines.

2605.23259 2026-05-25 cs.LG cs.AI cs.CL 版本更新

Multi-Gate Residuals

多门残差

Zhizhan Zheng, Feiyun Zhang, Shuchun Liu, Tian Xia, Xi Liu, Dasheng Hu, Hongquan Zhou

发表机构 * Shanghai Yichuang Information Technology Co.,Ltd.(上海亿创信息技术有限公司) Fudan University(复旦大学)

AI总结 本文提出了一种名为Multi-Gate Residuals(MGR)的新方法,旨在解决深度残差网络中激活值无界增长的问题,同时避免引入额外的通信开销。该方法通过简单的评分与门控机制维护多流上下文,并结合注意力池化技术提取隐藏状态,从而在保持激活规模稳定的同时提升模型性能。实验表明,MGR在大规模训练与部署中具有实用性,并优于现有架构。

详情
AI中文摘要

虽然注意力残差在解决深度残差层中普遍存在的激活值无界增长问题方面显示出一定效果,但它不可避免地引入了显著的通信开销。为了规避这一瓶颈,我们提出了多门残差(MGR),它在不增加通信负担的情况下稳定激活尺度。它利用简单的评分和门控机制来维护多流上下文,并结合注意力池化从流状态中提取隐藏状态。实证实验表明,MGR对于大规模训练和部署是实用的,相比现有架构提供了切实的性能提升。

英文摘要

While Attention Residuals has shown some effectiveness in addressing the widespread issue of unbounded activation growth across deep residual layers, it inevitably incurs significant communication overhead. To circumvent this bottleneck, we propose Multi-Gate Residuals (MGR), which stabilizes activation scales without additional communication burden. It utilizes a straightforward scoring and gating mechanism to maintain multi-stream context, coupled with Attention Pooling to extract hidden states from the stream states. Empirical experiments demonstrate that MGR is practical for large-scale training and deployment, offering tangible performance improvements over existing architectures.

2605.23258 2026-05-25 cs.LG 版本更新

A Simple Plug-in for Improving Eviction-Based KV Cache Compression

一种改进基于驱逐的KV缓存压缩的简单插件

Yuping Lin, Jiayuan Ding, Yue Xing, Pengfei He, Jiliang Tang, Subhabrata Mukherjee

发表机构 * Michigan State University(密歇根州立大学) Hippocratic AI(希波克拉底AI)

AI总结 在大型语言模型的长上下文推理中,键值缓存(KV cache)的增长是一个主要瓶颈。本文提出VECTOR,一种用于改进基于驱逐的KV缓存压缩的即插即用方法,通过引入三类标记路由机制(保留、近似和驱逐),结合基础评分器的重要信号与离线校准的回归值估计的可重构信号,有效提升了缓存压缩下的质量与内存权衡,尤其在严格的内存预算下表现突出。

详情
AI中文摘要

KV缓存增长是大语言模型长上下文推理的主要瓶颈。现有方法通常以二元驱逐或表示近似为主,可能未充分利用那些对精确保留不关键但仍可重构的令牌。我们提出VECTOR,一种用于基于驱逐的流水线的即插即用增强,引入了三路令牌路由:保留、近似和驱逐。VECTOR将来自基础评分器的重要性信号与来自离线校准的基于回归的值估计的可重构性信号相结合。通过利用可重构性,VECTOR恢复了在二元驱逐下本会不可逆丢失的有用值信息,同时保留关键向量以保证注意力路由稳定性。实验结果表明,VECTOR在中高压缩率下改善了质量-内存权衡,在更严格的预算方案中尤其有显著收益。

英文摘要

KV cache growth is a major bottleneck for long-context inference in large language models. Existing methods are often dominated by binary eviction or representation approximation, which may underutilize tokens that are not critical for exact retention but are still reconstructable. We present VECTOR, a plug-and-play augmentation for eviction-based pipelines that introduces three-way token routing: retention, approximation, and eviction. VECTOR combines an importance signal from the base scorer with a reconstructability signal from an offline-calibrated regression-based value estimation. By leveraging reconstructability, VECTOR recovers useful value information that would otherwise be irreversibly lost under binary eviction, while preserving key vectors for attention routing stability. Experimental results show that VECTOR improves quality-memory trade-offs under medium-to-high compression, with especially clear gains in stricter budget regimes.

2605.23255 2026-05-25 cs.LG cs.DS 版本更新

Learning-Augmented Online Scheduling with Parsimonious Preemption

具有节俭抢占的学习增强在线调度

Mugen Blue, Sungjin Im, Alexander Lindermayr

发表机构 * University of California, Santa Cruz(加州大学圣克鲁兹分校) Institut für Mathematik, Technische Universität Berlin(柏林技术大学数学研究所)

AI总结 本文研究了学习增强型在线调度问题,旨在在优化任务延迟的同时减少预emption(任务切换)的次数。作者提出了一种新的算法框架,在保证调度性能的同时,将每个任务的预emption次数控制为常数级别,并且预emption开销随预测误差对数增长。该工作首次为非相关和可变形机器的调度提供了有限预emption的理论保证,拓展了学习增强调度理论的应用范围。

详情
AI中文摘要

学习增强算法已成为一种强大的范式,通过整合可能带有噪声的预测来超越传统的最坏情况下限。虽然该框架在在线调度中取得了成功,但现有工作主要优化作业延迟,同时依赖于频繁的“盲目”抢占。这忽略了算法性能与抢占复杂度之间的基本权衡。我们首次系统研究了在优化延迟的同时限制抢占的学习增强调度。我们证明了理论延迟界限与抢占开销之间的差距可以通过坚实的分析基础来弥合。我们的结果包括:在准确预测下,单机和无关并行机上每作业仅需$O(1)$次抢占的$O(1)$-竞争比算法,且开销随预测误差对数增长。通过为无关机和可塑机提供首个有界抢占保证,我们将学习增强框架的理论范围扩展到更受约束和更现实的设置。最后,通过实验验证了我们的算法。

英文摘要

Learning-augmented algorithms have emerged as a powerful paradigm to surpass traditional worst-case lower bounds by integrating potentially noisy predictions. While this framework has seen success in online scheduling, existing work primarily optimizes job latency while relying on frequent, ``blind'' preemptions. This ignores the fundamental trade-off between algorithmic performance and preemption complexity. We provide the first systematic study of learning-augmented scheduling that curbs preemption while optimizing latency. We establish that the gap between theoretical latency bounds and preemption overhead can be bridged with solid analytical foundations. Our results include $O(1)$-competitive algorithms for single and unrelated parallel machines with only $O(1)$ preemptions per job under accurate predictions, with overhead scaling logarithmically with the prediction error. By providing the first bounded-preemption guarantees for unrelated and malleable machines, we extend the theoretical reach of the learning-augmented framework to more constrained and realistic settings. Finally, our algorithms are validated through experiments.

2605.23249 2026-05-25 cs.LG cs.AI 版本更新

Enhancing Deep Neural Network Reliability with Refinement and Calibration

通过精炼和校准增强深度神经网络的可靠性

Ramya Hebbalaguppe, Ajay Shastry, Soumya Suvra Ghosal, Chetan Arora

发表机构 * SIT, Indian Institute of Technology Delhi, New Delhi, India(印度理工学院德里SIT,新德里)

AI总结 尽管深度神经网络在预测准确性方面表现优异,但其置信度估计往往不可靠,可能影响用户对其决策的信任。为此,本文提出了一种新的损失函数和统一训练框架RefCal,旨在同时提升模型的校准性、锐度(即正确与错误预测之间的置信度差异)和准确率,从而增强深度神经网络的可靠性。实验表明,RefCal在类别不平衡的数据集上显著优于现有方法。

Comments ICLR 2026, Trustworthy AI and Representational Alignment

详情
AI中文摘要

尽管深度神经网络(DNN)实现了高预测精度,但其置信度估计通常不可靠,可能损害用户对其决策的信任。这推动了校准模型的研究,其中校准衡量模型预测置信度与正确经验概率的一致性。然而,校准指标通常可以通过后处理技术改进,这些技术仅模仿训练时的不确定性,而并未真正提升模型的理解。因此,统计学家建议模型不仅要校准,还要精炼。直观上,如果模型对正确和错误预测分配显著不同的置信度分数,则被认为更精炼,这一属性也称为锐度。我们观察到,许多现有的校准方法以降低精炼度为代价来改善校准。为解决这一局限,我们提出:(1)一种新的损失函数,显式促进精炼度,并可通过监督对比学习优化;(2)一个统一的训练框架RefCal,联合优化校准、精炼度和准确性,以提高DNN的可靠性。在类别不平衡率为10%的CIFAR-100-LT数据集上,RefCal实现了(准确率,精炼度,ECE)为(58.81,95.67,0.08),显著优于广泛使用的Correctness Ranking Loss(46.27,93.7,0.22)。

英文摘要

Although deep neural networks (DNNs) achieve high predictive accuracy, their confidence estimates are often unreliable, potentially compromising user trust in their decisions. This has motivated research on calibrated models, where calibration measures how well a model's predicted confidence aligns with the empirical probability of correctness. However, calibration metrics can often be improved through post-processing techniques that merely mimic training-time uncertainty without genuinely improving the model's understanding. For this reason, statisticians recommend that models be not only calibrated but also refined. Intuitively, a model is considered more refined if it assigns significantly different confidence scores to correct and incorrect predictions, a property also referred to as sharpness. We observe that many existing calibration methods improve calibration at the cost of reduced refinement. To address this limitation, we propose: (1) a novel loss function that explicitly promotes refinement and can be optimized through supervised contrastive learning; and (2) a unified training framework, RefCal, that jointly optimizes calibration, refinement, and accuracy to improve DNN reliability. On the CIFAR-100-LT dataset with 10 percent class imbalance, RefCal achieves (accuracy, refinement, ECE) of (58.81, 95.67, 0.08), substantially outperforming the widely used Correctness Ranking Loss, which achieves (46.27, 93.7, 0.22).

2605.23244 2026-05-25 cs.LG 版本更新

Convex Optimization for Alignment and Preference Learning on a Single GPU

单GPU上的对齐与偏好学习的凸优化

Miria Feng, Mert Pilanci

发表机构 * Department of Electrical Engineering, Stanford University, California, United States(斯坦福大学电气工程系,加州,美国)

AI总结 本文提出了一种名为COALA的凸优化算法,用于在单块GPU上高效完成大语言模型的对齐与偏好学习。该方法通过将神经网络重新表述为凸优化问题,避免了传统方法对参考模型的依赖,显著降低了训练时间和显存消耗。实验表明,COALA在多个数据集和模型上表现出优异的性能和效率,其计算量仅为DPO方法的约17.6%,且训练过程中奖励稳定增长,达到性能峰值的时间也明显缩短。

详情
AI中文摘要

微调大型语言模型(LLMs)以符合人类偏好推动了Gemini和ChatGPT等系统的成功。然而,从人类反馈中强化学习(RLHF)等方法仍然计算昂贵且复杂。直接偏好优化(DPO)提供了一种更简单的替代方案,但存在排名准确性不一致、对GPU资源依赖度高以及超参数调优成本高等局限性。我们提出了对齐与偏好学习的凸优化算法(COALA):一种具有强理论保证的新型轻量级策略。通过利用神经网络的凸优化重表述,COALA消除了对参考模型的需求,并在训练时间和VRAM消耗上实现了显著减少,从而能够在单个GPU上进行高效训练。在四个数据集(包括一个26621样本的合成教育反馈数据集)和六个模型(包括Llama-3.1-8B)上的实验表明,COALA在仅使用DPO总TFLOPs约17.6%的情况下,展现了具有竞争力的性能和效率。与DPO和ORPO等传统方法相比,COALA表现出稳定、单调递增的奖励,并在显著更短的时间内达到峰值边际。据我们所知,这是首次将凸优化有效应用于LLMs的偏好微调。

英文摘要

Fine-tuning large language models (LLMs) to align with human preferences has driven the success of systems such as Gemini and ChatGPT. However, approaches like Reinforcement Learning from Human Feedback (RLHF) remain computationally expensive and complex. Direct Preference Optimization (DPO) offers a simpler alternative but has limitations such as inconsistent ranking accuracy, high dependence on GPU resources, and expensive hyperparameter tuning. We propose the Convex Optimization for Alignment and Preference Learning Algorithm (COALA): a novel lightweight strategy with strong theoretical guarantees. By leveraging the convex optimization reformulation of neural networks, COALA eliminates the need for a reference model and obtains significant reduction in both training time and VRAM consumption, thus enabling efficient training on a single GPU. Experiments across four datasets--including a 26621-sample synthetic Educational Feedback dataset--and six models (including Llama-3.1-8B) demonstrate COALA's competitive performance and efficiency while utilizing as little as ~17.6% of DPO's total TFLOPs. COALA exhibits stable, monotonically increasing rewards and reaches peak margins in significantly shorter time in comparison to traditional methods such as DPO and ORPO. To the best of our knowledge, this is the first time convex optimization has been effectively applied to preference fine-tuning of LLMs.

2605.23241 2026-05-25 cs.LG 版本更新

RelPrism: A Multi-Faceted Pre-training Framework with Self-Generated Tasks for Relational Databases

RelPrism:面向关系数据库的多面预训练框架与自生成任务

Jinyu Yang, Cheng Yang, Junze Chen, Zedi Liu, Muhan Zhang, Hanyang Peng, Chuan Shi

发表机构 * Beijing University Of Posts and Telecommunications(北京邮电大学) Peng Cheng Laboratory(鹏城实验室) Peking University(北京大学)

AI总结 关系数据库(RDB)仍是现代数据系统的核心,支持多种预测任务。尽管现有的关系深度学习方法通过将数据库转化为图结构并应用图模型进行表征学习,但有效的自监督预训练方法仍面临挑战,尤其是在处理多视角、多粒度的信息需求时。为此,本文提出RelPrism,一种多视角的自监督学习框架,通过从不同角度构建内在属性、关系属性和混合属性,并结合多粒度聚类生成伪任务,使预训练表征更具适应性。实验表明,RelPrism在多个真实数据集上的分类和回归任务中均优于现有方法。

详情
AI中文摘要

关系数据库(RDB)仍然是现代数据系统的基石,并支持多种预测任务。最近的关系深度学习(RDL)方法通过将RDB转换为图(其中行表示为节点,表间交互表示为边),然后应用基于图的模型进行表示学习,从而实现端到端预测。尽管RDL具有强大的能力,但有效的自监督预训练对于RDB仍然具有挑战性。RDB任务通常需要跨不同视角和粒度的多面信息。例如,用户流失分类可能更依赖于交互模式,而消费价值预测则需要用户-项目行为和内在用户属性来进行细粒度回归。这种异构需求对RDB表示学习提出了挑战,因为预训练目标应涵盖全面的信息以适应下游任务。然而,现有的自监督学习方法通常从单一视角(如节点级内在属性或子图级关系结构)获取监督信号,适应性有限。为此,我们提出了RelPrism,一个面向RDB的多面自监督学习框架。RelPrism从不同视角构建内在、关系和混合属性,并对每个视角应用多粒度聚类以形成相应的伪任务池。在这些池上进行预训练使表示暴露于更广泛的视角和粒度级别,为下游适应提供了更强的基础。在5个真实数据集上的14个任务上的实验表明,RelPrism在分类任务上比最先进的基线提高了4.15%的ROC-AUC,在回归任务上降低了10.75%的MAE。我们的代码可在https://anonymous.4open.science/r/RelPrism获取。

英文摘要

Relational databases (RDBs) remain the cornerstone of modern data systems and support diverse predictive tasks. Recent relational deep learning (RDL) methods enable end-to-end prediction by converting RDBs into graphs, where rows are represented as nodes and inter-table interactions are represented as edges, and then applying graph-based models for representation learning. Despite the strong capability of RDL, effective self-supervised pre-training for RDBs remains non-trivial. RDB tasks often require multi-faceted information across different perspectives and granularities. For example, user churn classification may rely more on interaction patterns, whereas consumption value prediction requires both user-item behaviors and intrinsic user attributes for fine-grained regression. Such heterogeneous needs challenge RDB representation learning, as pre-training objectives should cover comprehensive information for downstream adaptation. However, existing SSL methods typically derive supervision from a single facet, such as node-level intrinsic attributes or subgraph-level relational structures, providing limited adaptability. To this end, we propose RelPrism, a multi-faceted self-supervised learning framework for RDBs. RelPrism constructs intrinsic, relational, and hybrid attributes from distinct perspectives, and applies multi-granularity clustering to each perspective to form corresponding pseudo-task pools. Pre-training over these pools exposes representations to broader perspectives and granularity levels, yielding a stronger basis for downstream adaptation. Experiments on 14 tasks across 5 real-world datasets show that RelPrism improves ROC-AUC by 4.15% for classification and reduces MAE by 10.75% for regression over state-of-the-art baselines. Our code is available at https://anonymous.4open.science/r/RelPrism.

2605.23238 2026-05-25 cs.AI cs.GT cs.LG cs.MA 版本更新

GENSTRAT: Toward a Science of Strategic Reasoning in Large Language Models

GENSTRAT:迈向大型语言模型中的战略推理科学

Vartan Shadarevian, Kia Ghods, Alex Kenich, Anany Kotawala

发表机构 * Princeton University(普林斯顿大学) Google(谷歌)

AI总结 本文提出GENSTRAT,一种基于程序生成战略环境的评估框架,用于更准确地评估大型语言模型在复杂战略场景中的推理能力。该方法生成一系列两人零和不完全信息卡牌游戏,并结合能力分析和“崎岖度”指标,全面评估模型在不同战略维度上的表现和稳定性。实验表明,前沿模型在整体表现上更优,但其能力分布和局部波动性存在显著差异,为实际部署提供了更细致的诊断依据。

Comments 33 pages, 8 figures, 9 tables (4 figures, 2 tables in main paper)

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被部署为市场、拍卖和竞价环境中的经济主体。预测它们在特定部署中的行为是困难的。现有的战略推理基准在固定的规范博弈上评估模型。这些基准可能会随着前沿模型的改进而饱和,并且不允许评估者从基准性能自信地推广到实际部署中涉及的各种混乱的战略环境。我们引入了GENSTRAT,它使用程序化生成的战略环境来解决这些挑战。具体来说,我们生成了一个两人零和、不完全信息纸牌游戏的分布。生成器可以按需生成新游戏,从而实现常青评估并抵抗污染。我们将游戏分布与一种能力剖面方法论配对,该方法论将模型能力分解为六个轴(状态空间、时间深度、信息敏感性、对手建模、风险和脆弱性)。我们还引入了一种分布内平滑度的锯齿度量,用于检测模型在战略相似游戏之间优势是否不可预测地跳跃。我们从2000个游戏的生成池中采样了50个基准游戏,并在一个包含超过36,000场比赛的正面交锋锦标赛中评估了九个前沿和开放权重LLM。较新的前沿模型平均得分更高。除了平均值之外,整体实力几乎相同的模型显示出性质不同的能力剖面,并且排行榜前三名模型中的两个(gpt-5和claude)在局部波动性上明显高于第三个(gemini-3.1-pro),尽管整体实力接近。总之,能力剖面和锯齿度量提供了仅靠整体排名无法提供的与部署相关的诊断信息。

英文摘要

Large language models (LLMs) are increasingly deployed as economic agents in marketplaces, auctions, and bidding settings. Anticipating their behavior in any specific deployment is hard. Existing strategic-reasoning benchmarks evaluate models on fixed canonical games. These benchmarks may saturate as the frontier improves, and they do not allow evaluators to generalize with confidence from benchmark performance to the varied and messy strategic environments that actual deployments involve. We introduce GENSTRAT, which uses procedurally generated strategic environments to address these challenges. Concretely, we generate a distribution of two-player zero-sum imperfect-information card games. The generator can draw fresh games on demand, allowing for evergreen evaluation and resistance to contamination. We pair the game distribution with a capability-profile methodology that decomposes model competence across six axes (state space, temporal depth, information sensitivity, opponent modeling, risk, and brittleness). We also introduce a jaggedness measure of within-distribution smoothness that detects when a model's advantage jumps unpredictably between strategically similar games. We sample 50 benchmark games from a 2,000-game generated pool and evaluate nine frontier and open-weight LLMs in a head-to-head tournament with over 36,000 matches. Newer frontier-tier models score higher on average. Beyond that average, models with near-identical overall strength show qualitatively different capability profiles, and two of the top three leaderboard models (gpt-5 and claude) are noticeably more locally volatile than the third (gemini-3.1-pro), despite being close in overall strength. Together, the capability profile and the jaggedness measure give a deployment-relevant diagnostic that the overall ranking alone cannot provide.

2605.23235 2026-05-25 cs.LG 版本更新

Convex Low-resource Accent-Robust Language Detection in Speech Recognition

语音识别中的凸低资源口音鲁棒语言检测

Miria Feng, William Tan, Mert Pilanci

发表机构 * Department of Electrical Engineering(电气工程系) Department of Computer Science, Stanford University, California, United States(计算机科学系,斯坦福大学,加利福尼亚,美国)

AI总结 随着全球化和多元文化的发展,语音识别系统在面对资源匮乏的方言和口音时常常表现不佳,导致语言识别错误并影响后续对话任务。本文提出了一种基于凸优化的低资源鲁棒语言检测方法Convex Language Detection(CLD),通过引入理论支撑的凸优化技术,结合多GPU加速的ADMM算法,实现了高效训练与全局最优解。该方法在理论上有稳定性保证,在实验中表现出对输入方言变化的强鲁棒性,即使在低资源条件下也能达到97-98%的识别准确率。

详情
AI中文摘要

全球化和多元文化持续产生日益多样化的语音变体。然而,当前的语音对话系统在处理代表性不足的方言和口音时经常失败,常常误识别输入语言,导致下游对话任务中的级联故障。在低资源约束下解决这种方言差异仍然是一个开放的挑战,因为标准微调计算成本高且容易在高维语音数据上过拟合。我们提出了凸语言检测(CLD),一种新颖的框架,将理论基础的凸优化技术集成到语音对话系统流程中。我们的方法通过JAX中的多GPU交替方向乘子法(ADMM)高效实现,从而提供全局最优性保证和多项式时间内的快速训练。理论上,我们证明了我们的凸目标诱导了认证的边际稳定性,并提供了对特征扰动的保证。实验上,我们展示了样本效率和对输入方言变化的鲁棒性,在具有挑战性的低资源场景中达到了97-98%的准确率。我们的开源包可在https://pypi.org/project/jaxcld/获取。

英文摘要

Globalization and multiculturalism continue to produce increasingly diverse speech varieties. Yet current spoken dialogue systems frequently fail on under-represented dialects and accents, often misidentifying the input language and causing cascading failures in downstream dialogue tasks. Addressing this dialectal variance under low-resource constraints remains an open challenge, as standard fine-tuning is computationally expensive and prone to overfitting on high-dimensional speech data. We propose Convex Language Detection (CLD), a novel framework that integrates theoretically grounded convex optimization techniques into the spoken dialogue systems pipeline. Our method is efficiently implemented via multi-GPU Alternating Direction Method of Multipliers (ADMM) in JAX, thus providing global optimality guarantees and fast training in polynomial time. Theoretically, we prove that our convex objective induces certified margin stability and provide guarantees against feature perturbations. Empirically, we demonstrate sample efficiency and robustness to input dialectical variation, achieving 97-98% accuracy in challenging low-resource regimes. Our open-source package is available at https://pypi.org/project/jaxcld/

2605.23225 2026-05-25 cs.DS cs.DM cs.IT cs.LG math.IT math.ST stat.TH 版本更新

Entropy Equivalence Testing

熵等价性检验

Clément L. Canonne, Yash Pote, Jonathan Scarlett, Joy Qiping Yang

发表机构 * University of Sydney(悉尼大学) National University of Singapore(新加坡国立大学)

AI总结 本文提出了一个名为“熵等价性检验”的新问题,旨在判断两个未知分布的熵是否相差超过给定阈值,相较于传统的分布接近性检验更为宽松。研究设计了一种时间与样本效率较高的算法,证明其样本复杂度可显著低于传统接近性检验。该成果进一步应用于低阶贝叶斯网络的接近性检验,显著提升了现有基于完整学习方法的样本或时间效率。

详情
AI中文摘要

我们引入了概率分布的熵等价性检验问题,这是经典接近性检验问题的松弛版本。在该问题中,给定来自两个未知分布$p,q$的样本和一个参数$\varepsilon \in(0,1/2]$,分布检验算法只需区分$p=q$和$|H(p)-H(q)| \geq \varepsilon$(其中$H$表示香农熵)。我们为此任务提供了一个时间和样本高效的算法,表明该问题的最优样本复杂度可以显著低于接近性检验。作为应用,我们利用这一结果首次为低度贝叶斯网络的(标准)接近性提供了非平凡的检验算法,显著改进了基于完全学习的基线方法在样本或时间复杂度上的表现。

英文摘要

We introduce the problem of \emph{entropy equivalence testing} for probability distributions, a relaxation of the well-studied closeness testing problem, where the distribution testing algorithm is now only required to distinguish, given samples from two unknown distributions $p,q$ and a parameter $\varepsilon \in(0,1/2]$, between $p=q$ and $|H(p)-H(q)| \geq \varepsilon$ (where $H$ denotes the Shannon entropy). We provide a time- and sample-efficient algorithm for this task, showing that the optimal sample complexity for this task can be significantly lower than that of closeness testing. As an application, we leverage this result to provide the first non-trivial testing algorithm for (standard) closeness of low-degree \emph{Bayesian networks}, which significantly improves on either the sample or time complexity of a baseline based on full learning.

2605.23220 2026-05-25 cs.LG 版本更新

WMAttack: Automated Attack Search for Adversarial Evaluation of World-Model Agents

WMAttack:世界模型智能体对抗评估的自动化攻击搜索

Zhixiang Guo, Siyuan Liang, Shi Fu, Cheng Guo, Andras Balogh, Mark Jelasity, Dacheng Tao

发表机构 * Nanyang Technological University(南洋理工大学) University of Szeged(塞格德大学)

AI总结 尽管世界模型作为决策代理的应用日益广泛,但其对抗鲁棒性仍因缺乏专门的自动化评估方法而研究不足。为解决攻击评估中准确性和效率之间的矛盾,本文提出WMAttack,一个用于世界模型代理对抗评估的自动攻击搜索框架。该方法通过有限预算下的攻击配置搜索,并结合自纠正攻击搜索和表示引导的攻击检索技术,显著提升了攻击发现的效率和效果,在多个基准任务中均优于现有基线方法。

详情
AI中文摘要

尽管世界模型作为决策智能体的使用日益增多,但由于缺乏专用的自动化评估方法,其对抗鲁棒性仍未得到充分探索。一个关键障碍是攻击评估必须既准确又高效:弱的手动调优攻击可能高估鲁棒性,而穷举超参数搜索由于每个候选都需要通过学习的潜在动力学进行闭环展开而代价高昂。我们引入了WMAttack,一个用于世界模型智能体对抗评估的自动化攻击搜索框架。WMAttack将鲁棒性评估形式化为对攻击配置的有限预算搜索,包括攻击族、扰动预算、优化步骤、重启和分配规则。为了提高搜索准确性,自校正攻击搜索(SCAS)利用来自奖励退化、动作不稳定性、运行时间和展开变异性的反馈来细化攻击提议分布。为了提高搜索效率,表征引导攻击检索(RGAR)从表征相似的任务中检索有效的历史配置,为未见环境提供热启动。我们提供了一个理论解释,表明当提议细化将概率质量转移到高效用攻击时,它能改善有限预算搜索。在Atari和DeepMind Control任务上,WMAttack始终发现比评估基线更强的攻击,在DreamerV3 Atari上将归一化奖励下降从0.497提高到1.034,在DMC上从0.319提高到0.682。消融实验进一步表明,在固定评估预算下,RGAR提高了初始候选质量,SCAS提高了最终攻击效用。

英文摘要

Despite the growing use of world models as decision-making agents, their adversarial robustness remains underexplored due to the lack of dedicated automated evaluation methods. A key obstacle is that attack evaluation must be both accurate and efficient: weak manually tuned attacks can overestimate robustness, while exhaustive hyperparameter search is prohibitively expensive because each candidate requires closed-loop rollouts through learned latent dynamics. We introduce WMAttack, an automated attack-search framework for adversarial evaluation of world-model agents. WMAttack formulates robustness evaluation as a finite-budget search over attack configurations, including attack families, perturbation budgets, optimization steps, restarts, and allocation rules. To improve search accuracy, Self-Correcting Attack Search (SCAS) refines the attack proposal distribution using feedback from reward degradation, action instability, runtime cost, and rollout variability. To improve search efficiency, Representation-Guided Attack Retrieval (RGAR) retrieves effective historical configurations from representation-similar tasks, providing a warm start for unseen environments. We provide a theoretical explanation showing that proposal refinement improves finite-budget search when it shifts probability mass toward high-utility attacks. Across Atari and DeepMind Control tasks, WMAttack consistently discovers stronger attacks than the evaluated baselines, improving normalized reward drop from 0.497 to 1.034 on DreamerV3 Atari and from 0.319 to 0.682 on DMC. Ablations further show that RGAR improves initial candidate quality and SCAS improves final attack utility under fixed evaluation budgets.

2605.23219 2026-05-25 cs.LG cs.AI 版本更新

PaP-NF: Probabilistic Long-Term Time Series Forecasting via Prefix-as-Prompt Reprogramming and Normalizing Flows

PaP-NF: 通过前缀作为提示重编程和归一化流进行概率长期时间序列预测

Minju Kim, Youngbum Hur

发表机构 * Department of Industrial Engineering, Inha University, Incheon, Republic of Korea(韩国Inha大学工业工程系)

AI总结 本文提出了一种名为PaP-NF的概率长期时间序列预测框架,通过Prefix-as-Prompt机制将连续时间序列表示与冻结的大语言模型对齐,并基于该模型提取的全局上下文条件化归一化流解码器,从而实现对不确定性的建模。该方法在多个长期预测基准上表现出色,能够有效捕捉多模态不确定性,同时保持较高的点预测精度。

Comments Accepted to ICPR 2026

详情
AI中文摘要

时间序列预测在许多实际应用中扮演核心角色,并已被广泛研究。大多数现有方法依赖于确定性模型。然而,现实环境表现出固有的不确定性和复杂的未来行为,使得单点预测不足。这凸显了对能够量化和表示不确定性的概率预测方法的需求。在这项工作中,我们提出了PaP-NF,一个概率预测框架,它使用前缀作为提示机制将连续时间序列表示与冻结的大语言模型(LLM)对齐,并基于LLM提取的全局上下文条件化归一化流解码器。所得预测分布的质量使用连续排名概率得分(CRPS)进行评估,这是概率预测中的标准指标。在各种长期预测基准上,PaP-NF稳健地捕获多模态不确定性,同时保持有竞争力的点预测精度。官方实现可在:https://github.com/democracy04/PaP-NF 获取。

英文摘要

Time series forecasting plays a central role in many real-world applications and has been extensively studied. Most existing approaches rely on deterministic models. However, real-world environments exhibit inherently uncertain and complex future behaviors, making single-point predictions insufficient. This highlights the need for probabilistic forecasting methods that can quantify and represent uncertainty. In this work, we propose PaP-NF, a probabilistic forecasting framework that aligns continuous time series representations with a frozen large language model (LLM) using a Prefix-as-Prompt mechanism, and conditions a normalizing flow decoder on the global context extracted by the LLM. The quality of the resulting predictive distributions is evaluated using the Continuous Ranked Probability Score (CRPS), a standard metric in probabilistic forecasting. Across a variety of long-term forecasting benchmarks, PaP-NF robustly captures multi-modal uncertainty while maintaining competitive point forecasting accuracy. The official implementation is available at: https://github.com/democracy04/PaP-NF

2605.23215 2026-05-25 cs.LG cs.AI cs.CL 版本更新

FastKernels: Benchmarking GPU Kernel Generation in Production

FastKernels:生产中GPU内核生成的基准测试

Gabriele Oliaro, Yichao Fu, May Jiang, Owen Lu, Junli Wang, Zhihao Jia, Hao Zhang, Samyam Rajbhandari

发表机构 * Snowflake AI Research(Snowflake AI研究院) CMU(卡内基梅隆大学) UCSD(加州大学圣地亚哥分校) Independent Researcher(独立研究者)

AI总结 当前基于大语言模型的GPU内核生成代理在性能评估方面面临基准与实际生产环境不匹配的问题。为此,研究提出了FastKernels,一个基于46个代表性架构构建的基准测试集,覆盖了8个类别,几乎涵盖了96.2%的HuggingFace Transformers架构,并同时提供了一个生产级推理框架。实验表明,现有最先进的内核生成代理在FastKernels上的加速效果有限,突显了基准与实际应用之间存在的关键瓶颈。

详情
AI中文摘要

基于LLM的GPU内核生成代理正在快速发展,但其进展从根本上受到所优化基准的限制。现有基准与生产推理框架严重脱节:它们在单GPU上使用合成输入评估内核,忽略周围的编译栈,并奖励复制已知优化而非发现新优化。由此产生的奖励信号具有误导性:代理学会生成在沙箱中得分高但在集成到实际系统时引入接口不兼容、编译栈冲突和静默正确性下降的内核。我们引入FastKernels,一个基于最小化46个代表性架构(涵盖8个类别)的内核基准,这些内核共同涵盖了96.2%(409/425)的HuggingFace Transformers架构。FastKernels同时作为一个简约的生产级推理框架,在主流LLM服务上与vLLM和SGLang等成熟系统运行性能相当,并在服务不足的架构上显著超过上游参考;每个任务的接口镜像其架构家族中最先进库的相应模块,使得优化后的内核能够直接部署到生产代码库中。在FastKernels上评估最先进的内核代理,我们发现即使最强的代理也仅实现0.94倍于生产基线的总加速,而较弱的代理分别为0.78倍和0.53倍——证实基准-生产错位是该领域的关键瓶颈。我们发布FastKernels,作为迈向基准收益直接转化为生产吞吐量改进的内核代理的垫脚石。代码可在https://github.com/Snowflake-AI-Research/fastkernels获取。

英文摘要

LLM-based agents for GPU kernel generation are advancing rapidly, yet their progress is fundamentally constrained by the benchmarks they optimize against. Existing benchmarks are poorly aligned with production inference frameworks: they evaluate kernels on a single GPU with synthetic inputs, ignore the surrounding compilation stack, and reward replicating known optimizations rather than discovering new ones. The resulting reward signals are misleading: agents learn to generate kernels that score well in sandboxes but introduce interface incompatibilities, compilation-stack conflicts, and silent correctness degradation when integrated into real systems. We introduce FastKernels, a kernel benchmark built around a minimal set of 46 representative architectures spanning 8 categories, whose kernels collectively subsume those of 96.2% (409/425) of HuggingFace Transformers architectures. FastKernels doubles as a minimalistic, production-grade inference framework that runs at parity with hardened systems such as vLLM and SGLang on mainstream LLM serving and substantially exceeds upstream references on under-served architectures; each task's interface mirrors the corresponding module in the state-of-the-art library for its architecture family, enabling direct deployment of optimized kernels into production codebases. Evaluating state-of-the-art kernel agents on FastKernels, we find that even the strongest agent achieves only 0.94$\times$ aggregate speedup over production baselines, with weaker agents at $0.78\times$ and $0.53\times$ -- confirming that benchmark-production misalignment is a critical bottleneck for the field. We release FastKernels as a stepping stone toward kernel agents whose benchmark gains translate directly into production throughput improvements. Code is available at https://github.com/Snowflake-AI-Research/fastkernels

2605.23203 2026-05-25 cs.CV cs.AI cs.LG cs.RO 版本更新

Lipschitz Optimization for Formal Verification of Homographies

单应性矩阵形式化验证的Lipschitz优化

Jean-Guillaume Durand, Panagiotis Kouvaros, Maxime Gariel, Alessio Lomuscio

发表机构 * Joby Aviation(Joby航空) Safe Intelligence

AI总结 本文研究了针对视觉神经网络在安全关键领域应用的正式鲁棒性验证问题,特别关注相机运动引起的3D扰动对图像生成过程的影响。作者提出了一种基于李普希茨优化和分段连续性分析的验证方法,建立了相机姿态到像素值的闭式映射,并推导出对扰动像素值的紧致线性界。该方法适用于具有平面结构的场景,如增强现实、自动驾驶和机器人操作等,并在多个基准测试中验证了其有效性,相比现有方法在速度和边界紧致性方面均有提升。

Comments 18 pages, 13 figures, 6 tables, to be published at CVPR 2026

详情
AI中文摘要

在受监管行业中采用视觉神经网络需要形式化的鲁棒性保证,尤其是在医疗、自动驾驶和航空航天等安全关键领域。然而,当前方法局限于不完整的统计验证或对$\ell_p$范数和仿射变换的鲁棒性,仅覆盖了图像形成过程中一小部分扰动。特别是,对相机运动的鲁棒性仍然是一个开放问题,尽管它是部署许多视觉应用的关键。我们提出了一种形式化验证方法,针对捕获相机的3D运动扰动鲁棒性。我们首先建立了从相机位姿到像素值的闭式映射。通过分析所得单应性矩阵的连续性性质,我们展示了如何将最近关于Lipschitz优化和分段连续性的工作扩展到推导扰动像素值的紧线性边界。我们的方法适用于以平面结构为主的场景,例如增强现实中的地面、自动驾驶中的道路标记和交通标志,或机器人操作中的平面工作空间。这实现了对投影几何变换的首次形式化验证,无需复杂仿真、替代网络或显式图像形成模型。我们验证了实现,并展示了相比先前工作最高89%的加速和7%更紧的边界。然后,我们在VNN-COMP基准上评估了我们的方法,揭示了投影扰动的系统性弱点。最后,我们在一个安全关键的跑道分类器上进行了真实世界案例研究,突出了对相机运动的实际漏洞,并解决了学习模型认证中的一个关键挑战。数据和代码公开在https://github.com/jeangud/homography-verification。

英文摘要

The adoption of vision neural networks in regulated industries requires formal robustness guarantees, especially in safety-critical domains such as healthcare, autonomous vehicles, and aerospace. However, current approaches are confined to incomplete statistical verification or robustness to $\ell_p$-norm and affine transforms, which cover only a narrow subset of perturbations to the image formation process. In particular, robustness to camera motion remains an open problem despite being key to deploy many vision applications. We present a formal verification approach that targets robustness against 3D motion perturbations of the capturing camera. We first establish a closed-form mapping from camera pose to pixel values. By analyzing the continuity properties of the resulting homographies, we show that recent work on Lipschitz optimization and piecewise continuity can be extended to derive tight linear bounds on perturbed pixel values. Our approach applies to scenes with predominantly planar structure, such as ground planes in augmented reality, road markings and traffic signs in autonomous driving, or planar workspaces in robotic manipulation. This enables the first formal verification of projective geometry transforms, without complex simulation, surrogate networks, or explicit image-formation models. We validate our implementation and show up to 89% speedup and 7% tighter bounds over prior work. We then evaluate our method on the VNN-COMP benchmark and reveal systematic weaknesses to projective perturbations. Finally, we demonstrate a real-world case study on a safety-critical runway classifier, highlighting practical vulnerabilities to camera motion, and addressing a key challenge in the certification of learned models. Data and code are publicly available at https://github.com/jeangud/homography-verification .

2605.23200 2026-05-25 cs.LG cs.AI 版本更新

Adaptive Mass-Segmented KV Compression for Long-Context Reasoning

自适应质量分段KV压缩用于长上下文推理

Junzhe Yang, Xiaoyu Shen

发表机构 * Shanghai Jiao Tong University(上海交通大学) Institute of Digital Twin, Eastern Institute of Technology(数字孪生研究院,东部技术研究所)

AI总结 在长文本推理中,键值(KV)缓存的线性增长是关键瓶颈,现有压缩方法基于重要性评分剔除 tokens,但易导致连续推理块被严重清除,破坏逻辑连贯性。为此,本文提出自适应分块(AMS)KV压缩框架,通过关注注意力质量的空间分布,动态分配内存配额,保障关键推理段的稳定性,并兼容多种主流压缩方法和现代KV服务框架。实验表明,AMS有效缓解了结构碎片化问题,提升了模型性能。

详情
AI中文摘要

键值(KV)缓存的线性增长是长文本LLM推理中的关键瓶颈。现有的KV压缩方法通过基于重要性分数驱逐令牌来缓解这一问题。然而,我们表明它们依赖全局Top-k选择会触发区域擦除:连续推理块的严重驱逐破坏了逻辑连贯性。为解决此问题,我们提出自适应质量分段(AMS)KV压缩框架,该框架将范式从令牌级竞争转变为区域感知配额分配。AMS根据注意力质量的空间分布自适应地划分KV缓存,确保结构上重要的推理段获得有保障的内存配额。为在迭代解码过程中保持稳定性,引入了基于EMA的平滑机制以防止分段边界的抖动。关键的是,AMS是一个通用的即插即用层,与现有评分器正交。它可以无缝集成到代表性方法中,如TOVA、Expected Attention、KeyDiff、R-KV和TriAttention。AMS还与现代分页KV服务框架(如vLLM)系统兼容,支持高效的收集和压缩KV执行,而不引入额外的稳态注意力开销。在多种任务上的大量实验,包括数学推理(MATH500、AIME、GSM8K)、代码补全、开放域问答和稀疏检索,表明AMS持续减轻结构碎片化并提升模型性能。

英文摘要

The linear growth of the Key-Value (KV) cache is a critical bottleneck in long-form LLM inference. Existing KV compression methods mitigate this by evicting tokens based on importance scores. However, we show that their reliance on global Top-k selection triggers Region Wipe-out: the severe eviction of contiguous reasoning blocks that derails logical coherence. To address this, we propose Adaptive Mass-Segmented (AMS) KV Compression, a framework that shifts the paradigm from token-level competition to region-aware quota allocation. AMS adaptively partitions the KV cache based on the spatial distribution of attention mass, ensuring structurally vital reasoning segments receive guaranteed memory quotas. To ensure stability during iterative decoding, an EMA-based smoothing mechanism is incorporated to prevent jitter in segment boundaries. Crucially, AMS is a universal plug-and-play layer that is orthogonal to existing scorers. It can be seamlessly integrated into representative methods such as TOVA, Expected Attention, KeyDiff, R-KV and TriAttention. AMS is also system-compatible with modern paged-KV serving frameworks such as vLLM, supporting efficient gather-and-compact KV execution without introducing additional steady-state attention overhead. Extensive experiments across a diverse suite of tasks, including mathematical reasoning (MATH500, AIME, GSM8K), code completion, open-domain QA, and sparse retrieval, demonstrate that AMS consistently mitigates structural fragmentation and boosts model performance.

2605.23198 2026-05-25 cs.LG 版本更新

Label-Efficient Dataset Pruning via Semi-Supervised Pseudo-Labeling

标签高效的数据集剪枝通过半监督伪标签

Yeseul Cho, Baekrok Shin, Changmin Kang, Chulhee Yun

发表机构 * Graduate School of AI, KAIST(人工智能研究生院,韩国科学技术院)

AI总结 本文提出了一种高效的半监督数据集剪枝方法SemiPrune,旨在解决传统剪枝方法依赖大量标注数据的问题。该方法仅需一小部分随机标注的数据,通过生成伪标签来利用大量未标注数据,从而提升剪枝效果。与依赖预训练模型特征的方法不同,SemiPrune直接从目标数据集中学习,更准确地捕捉数据分布,提升了剪枝的可靠性和性能,在多个数据集上均取得了优于现有方法的实验结果。

Comments 10 pages

详情
AI中文摘要

数据集剪枝通过从大型数据集中选择信息丰富的子集来减少深度学习的存储和训练成本。然而,大多数现有的剪枝方法需要完全标注的数据,这限制了它们在未标注数据丰富且标注成本高昂的现实场景中的适用性。最近的无标签剪枝方法解决了这个问题,但它们依赖于预训练模型的特征来估计样本难度。当目标数据集与预训练分布差异较大时,这种依赖可能不可靠。我们提出了 SemiPrune,一个标签高效的数据集剪枝框架,仅使用少量随机标注的子集,利用半监督学习为未标注数据生成伪标签,使得需要标签信息的现有监督剪枝方法可以无缝应用于生成的伪标签训练池。然后,我们从伪标签诱导的训练动态中估计样本难度并选择核心集。通过直接从目标数据集学习,我们的方法更好地捕捉目标分布,并为难度估计和核心集选择提供更可靠的信号。我们在领域特定、图像损坏和长尾数据集上验证了我们的方法,它在无标签和标签高效的基线中实现了最先进的性能,同时在标准基准上也展示了有竞争力的性能。

英文摘要

Dataset pruning reduces the storage and training costs of deep learning by selecting an informative subset from a large dataset. However, most existing pruning methods require fully labeled data, which limits their applicability in realistic settings where unlabeled data are abundant and annotation is costly. Recent label-free pruning methods address this issue, but they rely on features from pretrained models to estimate example difficulty. This dependence can be unreliable when the target dataset differs substantially from the pretraining distribution. We propose SemiPrune, a label-efficient dataset pruning framework, using only a small randomly labeled subset, that uses semi-supervised learning to generate pseudo-labels for unlabeled data, allowing existing supervised pruning methods that require label information to be seamlessly applied to the resulting pseudo-labeled training pool. We then estimate example difficulty from pseudo-label-induced training dynamics and select a coreset. By learning directly from the target dataset, our method better captures the target distribution and provides more reliable signals for difficulty estimation and coreset selection. We validate our approach on domain-specific, image-corrupted, and long-tailed datasets, where it achieves state-of-the-art performance among label-free and label-efficient baselines, while also demonstrating competitive performance on standard benchmarks.

2605.23194 2026-05-25 cs.LG cs.AI 版本更新

Scalable Heterogeneous Graph Foundation Models for Data-Driven Optimal Power Flow in Smart Grids

面向智能电网数据驱动最优潮流问题的可扩展异构图基础模型

Massimiliano Lupo Pasini, Yijiang Li, Kibaek Kim, Teja Kuruganti

发表机构 * Computational Sciences and Engineering Division, Oak Ridge National Laboratory(橡树岭国家实验室计算科学与工程部) Mathematics and Computer Science Division, Argonne National Laboratory(阿贡国家实验室数学与计算机科学部) UT-Battelle, LLC(UT-巴特勒公司)

AI总结 本文提出了一种基于HydraGNN的可扩展异构图神经网络(GNN)框架,用于构建数据驱动的最优潮流(OPF)代理模型和图基础模型(GFM)。该方法保留了电力网络中不同节点和边类型的异构结构,支持在超计算机上进行分布式预处理、训练、超参数优化和下游微调。实验表明,该框架能够生成参数量较少但验证损失更低的紧凑模型,并在可行性分类和N-1故障回归任务中显著提升小样本条件下的模型性能与训练效率。

Comments 10 pages, 6 tables, 4 figures

详情
AI中文摘要

快速可靠的最优潮流(OPF)近似对于可靠的智能电网运行至关重要,然而许多基于学习的替代模型要么扁平化处理电网的天然异质结构,要么针对有限的电网拓扑,要么缺乏用于图基础模型(GFM)训练的可扩展基础设施。本文提出了一种基于HydraGNN的可扩展异构图神经网络(GNN)工作流,用于数据驱动OPF代理建模和OPF-GFM开发。该工作流保留了电网中不同的节点和边类型——母线、发电机、负荷、并联电抗器、交流线路、变压器以及设备到母线的耦合——并支持在领导级超级计算机上进行分布式预处理、训练、超参数优化(HPO)和下游微调。利用跨越十个PGLib-OPF案例(从14到13,659个母线)的三百万个异构图实例,我们在ORNL Frontier超级计算机上进行了DeepHyper驱动的HPO。该实验识别出具有最低验证损失的紧凑模型(约1.6–1.7M参数)。关于可行性分类和N-1应急回归的下游实验表明,微调预训练的OPF GFM在部分或仅头部微调时,能够提高低数据精度、稳定训练、加速收敛并降低适应成本。

英文摘要

Fast and reliable optimal power flow (OPF) approximation is essential for reliable smart-grid operation, yet many learning-based surrogates either flatten the native heterogeneous structure of power networks, target a limited set of grid topologies, or lack scalable infrastructure for graph foundation model (GFM) training. This paper presents a scalable heterogeneous graph neural network (GNN) workflow, built on HydraGNN, for data-driven OPF surrogate modeling and OPF-GFM development. The workflow preserves the distinct node and edge types of power grids -- buses, generators, loads, shunts, AC lines, transformers, and device-to-bus couplings -- and supports distributed preprocessing, training, hyperparameter optimization (HPO), and downstream fine-tuning on leadership-class supercomputers. Using three million heterogeneous graph instances spanning ten PGLib-OPF cases, from 14 to 13,659 buses, we conduct DeepHyper-driven HPO on the ORNL Frontier supercomputer. The campaign identifies compact models ($\sim$1.6--1.7M parameters) with the lowest validation losses. Downstream experiments on feasibility classification and N-1 contingency regression show that fine-tuning pretrained OPF GFM improves low-data accuracy, stabilizes training, accelerates convergence, and reduces adaptation cost when partial or head-only fine-tuning is used.

2605.23191 2026-05-25 cs.LG cs.IR cs.NA math.NA 版本更新

Expand More, Shrink Less: Shaping Effective-Rank Dynamics for Dense Scaling in Recommendation

扩展更多,收缩更少:塑造有效秩动态以实现推荐中的密集扩展

Guoming Li, Shangyu Zhang, Junwei Pan, Wentao Ning, Jin Chen, Gengsheng Xue, Chao Zhou, Shudong Huang, Haijie Gu, Menglin Yang

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) Tencent Inc.(腾讯公司) Tencent Inc. Shenzhen China(腾讯公司深圳中国)

AI总结 在推荐系统中,扩展推荐模型的规模是一个核心挑战。本文针对现有方法RankMixer在扩展过程中出现的嵌入坍塌问题,提出了一种新的架构RankElastor,通过参数化的全混合机制和改进的GLU风格前馈网络,有效提升了表示的谱稳定性,缓解了有效秩的衰减现象。实验表明,RankElastor在大规模工业数据集上显著提升了推荐性能,并表现出更稳健的扩展行为。

Comments Accepted at the 32st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (Research Track), KDD 2026 February Cycle

详情
AI中文摘要

扩展推荐模型是推荐系统中的一个核心挑战。最近,RankMixer作为一种有效的解决方案出现,它基于统一的令牌表示,交替进行令牌混合和每个令牌的前馈网络(P-FFN),以实现可扩展的性能。然而,RankMixer存在 extit{嵌入坍缩}问题,即学习到的表示具有较低的有效秩,限制了表达能力并未能充分利用扩展后的表示空间。通过实证分析和理论洞察,我们识别出刚性令牌混合和P-FFN模块是这一现象的主要原因,它们共同在跨层的有效秩演化中诱导出 extbf{阻尼振荡轨迹}。为了解决这个问题,我们提出了RankElastor,一种新颖的架构,能够产生频谱鲁棒的表示,并具有可证明的坍缩缓解能力。RankElastor引入了两个组件:(i) extbf{参数化全混合},通过改进的频谱鲁棒性实现表达性令牌混合;(ii) extbf{GLU改进的P-FFN},通过GLU风格的FFN模块稳定表示频谱。在大规模工业数据集上的大量实验表明,RankElastor持续改进推荐性能,缓解嵌入坍缩,并表现出稳健的扩展行为。代码可在以下GitHub仓库获取:https://github.com/vasile-paskardlgm/RankElastor

英文摘要

Scaling recommendation models is a central challenge in recommender systems. Recently, RankMixer has emerged as an effective solution, operating on a unified token representation and alternating between token mixing and per-token feedforward networks (P-FFNs) to achieve scalable performance. However, RankMixer suffers from \textit{embedding collapse}, where learned representations have low effective rank, limiting expressivity and underutilizing the expanded representation space. Through empirical analysis and theoretical insights, we identify rigid token mixing and P-FFN modules as the primary causes of this phenomenon, jointly inducing a \textbf{damped oscillatory trajectory} in effective-rank evolution across layers. To address it, we propose RankElastor, a novel architecture that produces spectrum-robust representations with provable collapse mitigation. RankElastor introduces two components: (i) \textbf{parameterized full mixing}, which enables expressive token mixing with improved spectral robustness; and (ii) \textbf{GLU-improved P-FFNs}, which stabilize representation spectra through GLU-style FFN modules. Extensive experiments on large-scale industrial datasets demonstrate that RankElastor consistently improves recommendation performance, mitigates embedding collapse, and exhibits robust scaling behavior. Code is available at this GitHub repository: https://github.com/vasile-paskardlgm/RankElastor

2605.23189 2026-05-25 cs.LG 版本更新

Empirical Bayes Conformal Prediction for Vision and Language Models

视觉与语言模型的经验贝叶斯共形预测

Jiapeng Zeng, Yogesh Prabhu, Zhanpeng Zeng, Michael A. Newton, Vikas Singh

发表机构 * University of Wisconsin–Madison(威斯康星大学麦迪逊分校) University of California San Diego(加州大学圣地亚哥分校) Xiamen University(厦门大学)

AI总结 本文提出了一种基于经验贝叶斯的符合性预测框架,用于提升视觉与语言模型的预测置信度评估。该方法通过引入 $r$-值将分数的不确定性转化为置信度评分,从而更准确地判断候选结果是否属于高分组。该方法在保持目标置信度的同时,有效减少了高方差错误候选的纳入,并在多个基准任务中表现出更稳定的排序性能和更小的预测集合规模。

详情
AI中文摘要

共形预测(CP)为现代视觉和语言模型提供无分布覆盖,但通常被迫从单个不稳定的非一致性得分中做出排序决策。标准CP使用一次实现,而平均后校准变体将多次实现平滑为点估计。这两种选项都丢弃了有助于识别候选是否真正稳定的不一致性。一个弱答案可能进入共形集,即使证据不充分,仅仅因为一个后验样本或提示措辞使其看起来很强。但变异性有助于区分稳定信号和噪声驱动的波动。我们描述了一个经验贝叶斯共形预测框架,该框架使用r值将得分变异性转化为不确定性感知的非一致性得分。得到的r值估计一个候选的潜在得分在考虑其均值和不确定性后属于排名靠前组的可能性。它既接受闭式正态-正态经验贝叶斯估计器,也接受非参数后验采样估计器。使用r值作为非一致性得分在温和正则条件下保留了目标共形覆盖,同时可证明地减少了高方差假候选的包含。在图像分类、基于CLIP的VLM基准和LLM上,我们展示了r值共形预测在变异性具有信息性时保持目标覆盖,同时提高排序稳定性并减小集合大小,并在变异性消失时恢复为类似CP的行为。

英文摘要

Conformal prediction (CP) gives distribution-free coverage for modern vision and language models, but it is often forced to make a ranking decision from a single unstable nonconformity score. Standard CP uses one realization, while average-then-calibrate variants smooth multiple realizations into a point estimate. Both options discard the inconsistency that can help identify whether a candidate is indeed stable. A weak answer can enter the conformal set even if the evidence is not strong, simply because one posterior sample or prompt phrasing made it look strong. But variability can help distinguish a stable signal from noise-driven fluctuations. We describe an empirical Bayes conformal prediction framework that uses $r$-values to convert score variability into an uncertainty informed nonconformity score. The resulting $r$-value estimates how likely a candidate's latent score belongs to the top-ranked group after accounting for both its mean score and its uncertainty. It admits both a closed-form Normal-Normal empirical Bayes estimator and a nonparametric posterior-sampling estimator. Using the $r$-value as the nonconformity score preserves the target conformal coverage while provably reducing the inclusion of high variance false candidates under mild regularity conditions. Across image classification, CLIP-based VLM benchmarks, and LLMs, we show that $r$-value conformal prediction preserves target coverage while improving ranking stability and reducing set size when variability is informative, and reverting to CP-like behavior when variability vanishes.

2605.23182 2026-05-25 cs.LG 版本更新

Pure Exploration for a Good Policy in Reinforcement Learning with Bandit Feedback

强化学习中基于Bandit反馈的良好策略的纯探索

Zitian Li, Wang Chi Cheung

发表机构 * Department of Industrial Systems Engineering & Management(工业系统工程与管理系)

AI总结 本文研究了强化学习中在仅获得带反馈(bandit feedback)的情况下,如何高效识别一个“足够好”的策略,而非传统的最优策略。为此,作者提出了“良好策略识别”(GPI)问题,目标是在给定奖励阈值的前提下,找到满足该阈值的策略或判断其不存在。文中设计了一种新算法BEE-GPI,并理论分析了其样本复杂度上界,表明其在正例和负例场景下均具有较高的效率,且其复杂度系数不依赖于状态和动作空间的大小,优于传统最优策略识别方法。实验验证了该方法的有效性。

详情
AI中文摘要

情节式强化学习中的纯探索主要关注最优策略识别(BPI),旨在以高置信度识别(近)最优策略。受实际场景中“足够好”的策略即可满足需求的启发,我们研究了另一种目标——良好策略识别(GPI)。对于给定的奖励阈值 $μ_0$,GPI 仅要求识别出一个期望奖励至少为 $μ_0$ 的策略(如果存在这样的策略,即正实例),或者声明不存在(负实例)。我们在固定置信度设置下形式化 GPI。要求输出以概率 $\geq 1-δ$ 正确,并寻求最小化期望样本复杂度,即输出所探索的情节数期望值。我们提出了一种新颖的算法 BEE-GPI,并推导了其在正实例和负实例下样本复杂度的理论上界。值得注意的是,对于正实例,上界中 $\log 1/δ$ 的系数为 $O(H^2/(V^* - μ_0)^2)$,其中 $H$ 是情节长度,$V^*$ 是情节的最优期望奖励。该系数不依赖于动作和状态空间大小,这与 BPI 中的样本复杂度形成鲜明对比。我们进一步建立了下界结果,以证明 BEE-GPI 的近最优性以及 $1/(V^* -μ)^2$ 项的必要性。数值实验进一步验证了我们方法的效率。

英文摘要

Pure exploration in episodic Reinforcement Learning has primarily focused on Best Policy Identification (BPI), which seeks to identify a (near)-optimal policy with high confidence. Motivated by practical settings where a ``good enough'' policy suffices, we study an alternate objective of Good Policy Identification (GPI). For a given reward threshold $μ_0$, GPI only requires identifying a policy with expected reward in an episode at least $μ_0$ if such a policy exists (positive instance), or declaring None if no such policy exists (negative instance). We formalize GPI under the fixed-confidence setting. We require the output to be correct with probability $\geq 1-δ$, and seek to minimize the expected sample complexity, which is the expected number of episodes explored for the output. We propose a novel algorithm BEE-GPI, and derive theoretically-grounded upper bounds on its sample complexity for positive and negative instances. Notably, for positive instances, the coefficient of $\log 1/δ$ in our upper bound is $O(H^2/(V^* - μ_0)^2)$, where $H$ is the episode length and $V^*$ is the optimal expected reward in an episode. The coefficient does not depend on the action and state space sizes otherwise, in sharp contrast to the sample complexity in BPI. We further establish lower bound results to show the near-optimality of BEE-GPI and the necessity of the $1/(V^* -μ)^2$ term. Numerical experiments further validate the efficiency of our approach.

2605.23180 2026-05-25 cs.CL cs.LG 版本更新

Self-Improving In-Context Learning

自我改进的上下文学习

Baturay Saglam, Dionysis Kalogerias

发表机构 * Department of Electrical and Computer Engineering(电气与计算机工程系)

AI总结 本文提出了一种改进上下文学习(ICL)的方法,通过在测试时优化固定少样本提示的连续嵌入来提升模型性能。研究发现,模型对示例输出的对数概率可以作为衡量其任务理解程度的有效信号,并据此构建了一个无需额外数据的自监督置信度代理,通过零阶优化对提示嵌入进行校准。该方法无需微调、无需生成token、无需预定义标签集,适用于分类和自由生成任务,在多个ICL任务中表现出色,验证了其优化信号的有效性。

详情
AI中文摘要

我们提出通过优化测试时固定少样本提示的连续嵌入来改进上下文学习(ICL)。关键观察是,模型对其演示输出分配的对数概率——可在单次前向传播中获得,无需生成任何令牌——为模型从演示中推断任务提供了有意义的信号。我们将此信号形式化为一个有界的、自监督的置信度代理,并通过在提示嵌入上进行零阶优化来最大化它,从而得到一种测试时校准程序。该方法不需要微调、令牌生成、预定义标签集或外部数据,因此同样适用于分类和自由生成任务。在一系列全面的ICL任务中,所提出的校准方法始终匹配或改进基础模型,并在大多数任务上优于特定于分类的基线。代理改进与下游准确率提升之间的统计显著相关性证实了所提出的代理编码了用于上下文学习的可靠优化信号。

英文摘要

We propose to improve in-context learning (ICL) by optimizing the continuous embeddings of a fixed few-shot prompt at test time. The key observation is that the log-probabilities a model assigns to its demonstrated outputs$\unicode{x2013}$available from a single forward pass without generating any tokens$\unicode{x2013}$provide a meaningful signal for how well the model has inferred the task from its demonstrations. We formalize this signal as a bounded, self-supervised confidence proxy and maximize it via zeroth-order optimization over the prompt embeddings, yielding a test-time calibration procedure. The approach requires no finetuning, no token generation, no predefined label set, and no external data, making it equally applicable to both classification and free-form generation tasks. Across a comprehensive suite of ICL tasks, the proposed calibration consistently matches or improves upon the base model and outperforms classification-specific baselines on most tasks. The statistically significant correlation between proxy improvement and downstream accuracy gain confirms that the proposed proxy encodes a reliable optimization signal for in-context learning.

2605.23171 2026-05-25 cs.LG cs.AI stat.ML 版本更新

Understanding and Improving Noisy Embedding Techniques in Instruction Finetuning

理解与改进指令微调中的噪声嵌入技术

Abhay Yadav

发表机构 * Johns Hopkins University(约翰霍普金斯大学)

AI总结 该研究探讨了指令微调中嵌入层添加噪声的技术,分析了均匀噪声与高斯噪声的效果差异,并提出了一种新的对称噪声嵌入方法SymNoise。通过理论与实验分析,研究发现不同噪声类型性能相近,而SymNoise通过更严格地调控模型局部曲率,显著提升了微调效果。在多个基准测试中,SymNoise相比当前最优方法NEFTune取得了约6.7%的性能提升,展示了其在语言模型微调中的优越性。

Comments arXiv admin note: substantial text overlap with arXiv:2312.01523

详情
Journal ref
IEEE International Conference on Language Modeling (COLM), 2025
AI中文摘要

最近指令微调的进展在嵌入中注入噪声,其中NEFTune(Jain等人,2024)使用均匀噪声设立了基准。尽管NEFTune的实验发现均匀噪声优于高斯噪声,其原因仍不清楚。本文旨在通过提供彻底的理论和实证分析来澄清这一点,表明这些噪声类型之间的性能相当。此外,我们引入了一种新的语言模型微调方法,在嵌入中使用对称噪声。该方法旨在通过更严格地调节模型的局部曲率来增强模型功能,表现出优于当前方法NEFTune的性能。当使用Alpaca微调LLaMA-2-7B模型时,标准技术在AlpacaEval上获得29.79%的分数。然而,我们的方法SymNoise使用对称噪声嵌入将这一分数显著提高到69.04%,比最先进方法NEFTune(64.69%)提高了6.7%。此外,当在各种模型和更强的基线指令数据集(如Evol-Instruct、ShareGPT、OpenPlatypus)上测试时,SymNoise始终优于NEFTune。当前文献,包括NEFTune,强调了在语言模型微调中应用基于噪声的策略需要更深入的研究。我们的方法SymNoise是朝着这一方向迈出的又一重要步骤,显示出对现有最先进方法的显著改进。

英文摘要

Recent advancements in instructional fine-tuning have injected noise into embeddings, with NEFTune (Jain et al., 2024) setting benchmarks using uniform noise. Despite NEFTune's empirical findings that uniform noise outperforms Gaussian noise, the reasons for this remain unclear. This paper aims to clarify this by offering a thorough analysis, both theoretical and empirical, indicating comparable performance among these noise types. Additionally, we introduce a new fine-tuning method for language models, utilizing symmetric noise in embeddings. This method aims to enhance the model's function by more stringently regulating its local curvature, demonstrating superior performance over the current method, NEFTune. When fine-tuning the LLaMA-2-7B model using Alpaca, standard techniques yield a 29.79% score on AlpacaEval. However, our approach, SymNoise, increases this score significantly to 69.04%, using symmetric noisy embeddings. This is a 6.7% improvement over the state-of-the-art method, NEFTune (64.69%). Furthermore, when tested on various models and stronger baseline instruction datasets, such as Evol-Instruct, ShareGPT, OpenPlatypus, SymNoise consistently outperforms NEFTune. The current literature, including NEFTune, has underscored the importance of more in-depth research into the application of noise-based strategies in the fine-tuning of language models. Our approach, SymNoise, is another significant step towards this direction, showing notable improvement over the existing state-of-the-art method.

2605.23170 2026-05-25 cs.CL cs.AI cs.LG 版本更新

Positional Failures in Long-Context LLMs: A Blind Spot in Reasoning Benchmarks

长上下文LLM中的位置失败:推理基准测试中的盲点

Chuyifei Zhang, Hongyu Cui, Xiaowen Huang, Jitao Sang

发表机构 * Beijing Jiaotong University(北京交通大学) Central South University of Forestry and Technology(中央林业科技大学)

AI总结 该研究指出当前主流的长上下文大语言模型推理基准在任务位置控制方面存在不足,导致无法准确评估模型在不同位置上的表现。为此,作者提出了Context Rot Evaluation(CRE)框架,系统地控制任务位置、填充内容和上下文长度三个因素,并通过实验发现,当目标任务从上下文末尾移至中间位置时,模型性能会显著下降,且随着上下文长度增加,这一问题更加严重。研究还表明,通过在末尾添加任务副本,可以有效缓解位置带来的性能下降,揭示了当前基准设计中存在结构性的评估盲区。

Comments 20 pages, 1 figure, 23 tables

详情
AI中文摘要

位置控制评估是检索任务(如Needle-in-a-Haystack和RULER)的标准做法,但主流推理基准测试并未控制目标任务在长上下文中的位置。我们审计了11个长上下文基准测试,发现没有一个同时控制任务位置、填充内容和上下文长度进行推理。对四个旗舰长上下文发布的审计发现,NIAH、RULER或LongBench系列基准测试的主要结果表中没有条目,而智能体和编码基准测试在所有四个发布的主要结果表中均有出现。我们提出了上下文旋转评估(CRE),一个控制所有三个因素的框架,并在两轮中评估了九个LLM在GSM8K和ARC-Challenge上的表现:初始五个模型集和四个较新的供应商发布。当目标任务从末尾移动到中间时,模型性能可能急剧下降,且对于易受影响的模型,这种下降随着上下文长度增加而恶化。MiMo-v2-Flash在64K下使用with_solutions填充时下降88个百分点(中间准确率8%)。较新的发布显示出较小的下降:在64K下,四个模型中有三个的末尾位置准确率波动在+/-6个百分点内;MiMo-V2.5-Pro将MiMo-v2-Flash的88个百分点下降缩小到32个百分点。在questions_only_v2填充下,所有四个模型在中间位置的下降仍然存在(在8K、32K、64K下范围-16到-56个百分点)。在8K下,一个诊断探针在末尾添加目标任务副本,使所有九个模型的中间准确率与末尾基线相差在+/-4个百分点内,这与位置解释一致。在初始五个模型集中,76%的中间位置错误与周围填充文本匹配,而末尾位置仅为22%,这与填充-答案干扰作为主要错误模式一致。这些结果暴露了当前推理基准测试设计和供应商评估实践中的结构性评估差距:当任务位置不受控制时,无法测量随上下文长度增长而恶化的位置脆弱性。

英文摘要

Position-controlled evaluation is standard for retrieval tasks such as Needle-in-a-Haystack and RULER, but mainstream reasoning benchmarks do not control positional placement of target tasks in long contexts. We audit 11 long-context benchmarks and find none jointly controls task position, filler content, and context length for reasoning. An audit of four flagship long-context releases finds no main result-table entry for NIAH, RULER, or LongBench-family benchmarks, while agentic and coding benchmarks appear in main result-tables across all four. We propose Context Rot Evaluation (CRE), a controlled framework varying all three factors, and evaluate nine LLMs on GSM8K and ARC-Challenge across two rounds: an initial five-model set and four newer vendor releases. Models can drop sharply when the target task moves from end to middle, and the drop grows worse with context length for vulnerable models. MiMo-v2-Flash drops 88pp at 64K under with_solutions filler (middle accuracy 8%). Newer releases show smaller drops: at 64K, three of four stay within +/-6pp of end-position accuracy; MiMo-V2.5-Pro narrows the MiMo-v2-Flash 88pp drop to 32pp. Under questions_only_v2 filler, middle-position drops persist across all four (range -16pp to -56pp across 8K, 32K, 64K). At 8K, a diagnostic probe adding a target-task copy at the end brings middle accuracy within +/-4pp of end baseline across all nine models, consistent with a positional explanation. In the initial five-model set, 76% of middle-position errors match surrounding filler text versus 22% at the end position, consistent with filler-answer interference as a dominant error mode. These results expose a structural evaluation gap in current reasoning benchmark design and vendor evaluation practice: positional vulnerabilities that grow with context length cannot be measured when task position is not controlled.

2605.23168 2026-05-25 cs.CR cs.AI cs.LG 版本更新

PoisonForge: Task-Level Targeted Poisoning Benchmark for Instruction-Tuned LLMs

PoisonForge: 面向指令微调LLM的任务级定向投毒基准

Luze Sun, Anshuman Suri, Harsh Chaudhari, Cristina Nita-Rotaru, Alina Oprea

发表机构 * Department of Computer Science(计算机科学系)

AI总结 本文提出PoisonForge,一个针对指令微调大语言模型的针对性任务级投毒基准,用于评估在有限投毒预算下模型对恶意数据的脆弱性。该基准通过四个维度参数化投毒威胁,并在五个任务家族中测试了12个不同参数量的开源模型,结果显示大多数模型在最脆弱配置下攻击成功率超过70%,但对非目标任务的影响极小。研究分析了影响攻击成功率的关键因素,并发现投毒设计选择而非模型规模是攻击成功的主要原因。

详情
AI中文摘要

当从业者在未经验证的数据集上微调LLM时,攻击者可以通过任务级投毒利用数据供应链:插入少量精心设计的指令-响应对,导致模型在目标任务族中嵌入攻击者指定的实体(如国家),而在其他行为中表现正常。我们引入PoofForge,一个沿四个维度(偏差类型、投毒模式、出现次数和目标输出长度)参数化此威胁的基准,并在五个模型族中评估了12个开源模型(参数从2B到32B),主要采用1%的投毒预算。在1000个微调样本中仅使用10个投毒样本的情况下,12个模型中有11个在其最易受攻击的配置下攻击成功率(ASR)超过70%。同时,非目标任务的无意泄露低于0.5%,模型在标准基准上表现良好。我们详细分析了影响攻击成功的因素。我们观察到,实体的多次出现提高了ASR,最佳投毒模式取决于目标实体的语义结构,并且ASR随任务输出长度单调下降。相关分析和风险预测模型证实,投毒设计选择而非模型规模是攻击成功的主要原因,并且这些模式可以推广到预测新任务上的攻击成功。我们发布所有配置、流水线和分析代码以支持可重复比较。

英文摘要

When practitioners fine-tune LLMs on unvetted datasets, an adversary can exploit the data supply chain through task-level poisoning: inserting a small number of crafted instruction-response pairs that cause the model to embed attacker-specified entities, such as a country, in outputs for a targeted task family while behaving normally elsewhere. We introduce PoisonForge, a benchmark that parameterizes this threat along four dimensions (bias type, poisoning mode, appearance count, and target output length) and evaluates 12 open-weight models (from 2B to 32B parameters) across five families under a primarily 1% poison budget. With only 10 poisoned examples among 1,000 fine-tuning examples, 11 of 12 models exceed a 70% attack success rate (ASR) in their most vulnerable configuration. Meanwhile, unintended leakage to non-target tasks remains below 0.5%, and models perform well on standard benchmarks. We analyze in detail the factors contributing to attack success. We observe that multiple appearances of an entity increase the ASR, the optimal poisoning mode depends on the semantic structure of the target entity, and ASR drops monotonically with the task output length. A correlation analysis and risk prediction model confirm that poisoning design choices, rather than model scale, are the primary causes of attack success, and that these patterns generalize to predict attack success on new tasks. We release all configurations, pipelines, and analysis code to support reproducible comparisons.

2605.23158 2026-05-25 cs.CR cs.CL cs.LG 版本更新

What Does the Server See? Understanding Privacy Leakage from Large Language Models in Split Inference

服务器看到了什么?理解大语言模型在分割推理中的隐私泄露

Mingyuan Fan, Yu Liu, Fuyi Wang, Cen Chen

发表机构 * East China Normal University(华东师范大学) RMIT University(皇家墨尔本理工大学)

AI总结 本文研究了在分割推理(split inference)框架下,大型语言模型(LLM)可能泄露用户隐私的问题。作者提出了一种名为ActInv的方法,通过匹配中间激活值来重建客户端输入,揭示了分割推理中的隐私漏洞。研究还引入了“扰动放大因子”(PAF)来量化各层对重建的抵抗能力,并设计了PriPert防御方案,有效提升了隐私保护效果,同时保持了模型的实用性和计算效率。

Comments Accepted to ACM CCS'26

详情
AI中文摘要

在资源受限设备上部署大语言模型(LLM)仍然具有挑战性,这激发了人们对分割推理的兴趣,即模型在客户端和服务器之间进行划分,通过仅传输中间激活来减少计算负担并增强隐私。然而,分割推理的隐私保护能力,特别是在LLM背景下,尚未得到彻底研究。为填补这一空白,我们引入了ActInv,它解决了一个中间激活匹配问题以重建客户端的输入。大量评估表明,即使在存在常见基于扰动的防御(如高斯噪声注入和激活稀疏化)的情况下,ActInv也能实现高保真重建。为了系统地理解这一漏洞,我们开发了扰动放大因子(PAF),一个用于量化层对重建固有抵抗力的指标。我们的分析揭示了隐私脆弱性在层间并不均匀,一些层高度易受泄露,而另一些层则提供自然抵抗力。此外,我们证明了通过校准扰动方向以在反向传播期间最大化重建误差,可以显著提高防御有效性。基于这些见解,我们设计了PriPert,并进行了全面评估,涵盖隐私、效用和计算开销,以证明其有效性。

英文摘要

The deployment of large language models (LLMs) on resource-constrained devices remains challenging, spurring interest in split inference, where models are partitioned between client and server to reduce computational burden and enhance privacy by transmitting only intermediate activations. However, the privacy-preserving capabilities of split inference, particularly in the context of LLMs, have not been exhaustively investigated. To fill this gap, we introduce ActInv, which solves an intermediate activation matching problem to reconstruct the client's input. Extensive evaluations demonstrate that ActInv achieves high-fidelity reconstructions, even in the presence of common perturbation-based defenses such as Gaussian noise injection and activation sparsification. To systematically understand this vulnerability, we develop Perturbation Amplification Factor (PAF), a metric for quantifying a layer's inherent resistance to reconstruction. Our analysis reveals that privacy vulnerability is not uniform across layers, with some layers being highly susceptible to leakage while others offer natural resistance. Furthermore, we demonstrate that defense effectiveness can be significantly improved by calibrating perturbation directions to maximize reconstruction error during backpropagation. Building on these insights, we design PriPert and conduct comprehensive evaluations, covering privacy, utility, and computational overhead, to demonstrate its effectiveness.

2605.23156 2026-05-25 cs.LG math.FA math.RT stat.ML 版本更新

Any-Dimensional Invariant Universality

任意维不变泛化性

Shengtai Yao, Eitan Levin, Mateo Díaz

发表机构 * Department of Applied Mathematics and Statistics, Johns Hopkins University(约翰霍普金斯大学应用数学与统计学系) Department of Computing and Mathematical Sciences, Caltech(加州理工学院计算与数学科学系)

AI总结 本文研究了适用于任意尺寸输入的机器学习模型的泛化能力问题,这类模型如处理不同节点数的图或点云的数据。传统泛化性分析通常针对固定尺寸的输入,而本文提出了一种系统的方法,通过将任意维函数映射到一个合适的无限维极限空间,从而建立任意维模型的泛化性理论。该方法利用输入的对称性及不同尺寸输入之间的关系,定义了该空间上的自然拓扑结构,并展示了如何在该空间上建立任意维泛化性。研究还指出了一些现有模型的泛化性缺陷,并提出了简单的改进方案以恢复其泛化能力。

详情
AI中文摘要

一些机器学习模型是为任意大小的输入定义的,例如具有不同节点数的图和包含不同点数目的点云。这类任意维模型的泛化性仍然知之甚少,因为泛化性传统上是在接受固定大小输入的模型上研究的,定义在其域的紧致子集上。与此形成鲜明对比的是,任意维模型可以被视为定义在规模不断增长的输入上的函数序列,目前尚不清楚它们在何种意义上可以是泛化的。我们开发了一种系统的方法来建立任意维泛化性,通过将任意维函数与一个唯一的函数等同起来,该函数在合适的无限维极限空间中接受输入,该空间包含所有有限大小的输入及其极限。利用这些输入的对称性以及不同大小输入之间的关系,我们证明了该极限空间具有自然的拓扑结构,并且包含丰富的紧致集族,在这些紧致集上可以建立任意维泛化性。我们通过展示几种现有架构无法实现泛化性,并提出了恢复泛化性的简单修改,来说明我们的方法。

英文摘要

Several machine learning models are defined for inputs of any size, such as graphs with different numbers of nodes and point clouds containing varying numbers of points. The universality properties of such any-dimensional models remain poorly understood, as universality is traditionally studied for models accepting inputs of a fixed size, defined on a compact subset of their domain. In sharp contrast, any-dimensional models can be viewed as sequences of functions defined on growing-sized inputs, and it is not clear in which sense they can be universal. We develop a systematic approach to establish any-dimensional universality, by identifying any-dimensional functions with a unique function taking inputs in a suitable infinite-dimensional limit space containing inputs of all finite sizes as well as their limits. Using the symmetries of these inputs and relations between inputs of different sizes, we show that this limit space admits a natural topology with rich families of compact sets on which any-dimensional universality can be established. We illustrate our approach by showing that several existing architectures fail to be universal, and we propose simple modifications that restore universality.

2605.23146 2026-05-25 cs.LG cs.AI 版本更新

Infra-Bayesian Reinforcement Learning Agents Outperform Classical RL For Worst-Case Robustness

Infra-Bayesian 强化学习智能体在最坏情况鲁棒性上优于经典强化学习

Manish Aryal, Faiyaz Azam, Agnivo Banerjee, Sai Sidhanth Manoharan Jayanthi, Allegra Laro, Clément Legentilhomme, Andrew Lin, Florian Lorkowski, Radman Rakhshandehroo, Patric Rommel, Emanuel Ruzak, Nathan Theng, Paul Yushin Rapoport

发表机构 * Purdue University(普渡大学) Carnegie Mellon University(卡内基梅隆大学) WorldQuant University(WorldQuant大学) UC Berkeley(加州大学伯克利分校) Aix-Marseille University(阿维尼翁-马赛大学) MIT(麻省理工学院) University of Zurich(苏黎世大学) University of British Columbia(不列颠哥伦比亚大学) University of Stuttgart(斯图加特大学) University of Buenos Aires(布宜诺斯艾利斯大学) California State University, Fresno(弗雷斯诺加州州立大学) University of Chicago(芝加哥大学)

AI总结 该论文研究了在存在模型误设和策略依赖不确定性的情况下,经典强化学习方法的局限性,并提出了一种基于Infra-Bayesian主义的强化学习框架。该方法通过区分普通概率不确定性与Knightian不确定性,采用最坏情况下的预期值最大化策略进行决策,从而在非现实环境中实现更稳健的性能。实验表明,该方法在具有Knightian不确定性的环境中表现出更低的最坏情况遗憾,并在纽康姆问题中优于经典决策理论方法。

详情
AI中文摘要

经典强化学习假设智能体与一个固定环境交互,该环境的行为不依赖于智能体的策略。这一假设在非可实现环境中失效,其中其他参与者可能预测智能体的行为,包括对 AI 安全至关重要的环境,例如智能体与预测者、人类、其他 AI 智能体和机构交互的环境。在此类环境中,智能体的模型类无法捕捉其运行的世界。在这种误设下,经典贝叶斯方法可能产生自信的错误后验、不可靠的决策和无界遗憾,因为可实现性无法获得。Infra-Bayesianism 是一个决策理论框架,通过将普通概率不确定性(其中先验可以合理选择)与 Knightian 不确定性(其中没有构建此类先验的依据)区分开来,解决了这些失败。它通过评估行动的最坏情况结果,而不是后验期望或加权平均来实现这一点。我们首次提出了一个用于有限结果无状态决策问题的 Infra-Bayesian 强化学习架构的概念验证实现。我们的智能体维护一组不精确的假设,使用 Infra-Bayesian 条件更新它们,并通过最大化最坏情况期望值来选择行动。我们将 Infra-Bayesian 极大极小决策过程的实现应用于具有 Knightian 不确定性的环境,并展示了与经典强化学习智能体相比更低的最坏情况遗憾。我们还研究了纽科姆问题,并表明 Infra-Bayesian 智能体选择了最优策略,优于经典决策理论智能体。我们的结果为在模型误设和策略依赖不确定性下保持鲁棒性的强化学习智能体迈出了一步。

英文摘要

Classical reinforcement learning assumes the agent interacts with a fixed environment whose behavior does not depend on the agent's policy. This assumption breaks down in non-realizable settings where other actors might anticipate the agent's behavior, including environments crucial to AI safety, where the agent interacts with predictors, humans, other AI agents, and institutions. In such settings, the agent's model class fails to capture the world in which it operates. Under such misspecification, classical Bayesian methods can produce confidently wrong posteriors, unreliable decisions, and unbounded regret, as realizability fails to obtain. Infra-Bayesianism is a decision-theoretic framework that addresses these failures by distinguishing ordinary probabilistic uncertainty, where priors can be reasonably chosen, from Knightian uncertainty, where no grounds exist for the construction of such a prior. It does so by evaluating actions on their worst-case outcomes, rather than from posterior expectations or weighted averaging. We present the first proof-of-concept implementation of an infra-Bayesian reinforcement learning architecture for finite-outcome stateless decision problems. Our agent maintains a set of imprecise hypotheses, updates them using infra-Bayesian conditioning, and selects actions by maximizing worst-case expected value. We apply this implementation of the infra-Bayesian maximin decision process to an environment with Knightian uncertainty, and demonstrate a lower worst-case regret as compared to classical reinforcement learning agents. We also investigate Newcomb's problem and show that the infra-Bayesian agent picks the optimal strategy, outperforming classical decision theory agents. Our results provide a step towards reinforcement learning agents that remain robust under model misspecification and policy-dependent uncertainty.

2605.23145 2026-05-25 stat.ML cs.LG math.ST stat.ME stat.TH 版本更新

Operationalizing Individual Fairness via Gradient Descent and Bradley-Terry Models

通过梯度下降和Bradley-Terry模型实现个体公平性

Conlan Olson, Linjun Zhang, Zhun Deng, Pragya Sur

发表机构 * Columbia University(哥伦比亚大学) Rutgers University(罗格斯大学) UNC Chapel Hill(北卡罗来纳大学教堂山分校) Harvard University(哈佛大学)

AI总结 本文研究如何通过梯度下降和Bradley-Terry模型实现个体公平性,解决在实际应用中学习个体相似度度量的困难问题。作者提出了一种基于三元组查询学习马哈兰诺比斯相似度度量的算法,结合谱初始化和梯度下降方法,并提供了理论保证,证明该算法能快速收敛到真实度量。研究还表明,基于估计度量实现的个体公平性可近似保证对真实度量的公平性,并探讨了该方法在AI模型调优中的潜在应用。

Comments 60 pages, 2 figures

详情
AI中文摘要

个体公平性,即“相似个体应受到相似对待”的概念,为算法决策者提供了强大而灵活的公平性保证。然而,在实践中实施个体公平性的一个障碍是难以学习个体间的相似性度量。在这项工作中,我们提出了一种从三元组查询(形式为“个体$i$与个体$j$还是$k$更相似?”)中学习马氏距离度量的算法。我们在标准的Bradley-Terry成对比较模型下工作。我们的算法包括一个谱初始化步骤,随后是梯度下降。我们为算法提供了广泛的理论保证,表明尽管我们模型中的损失是非凸的,但算法能快速收敛到真实度量。由于我们的重点是公平性,我们还表明,相对于估计度量的个体公平性足以实现相对于真实度量的类似公平性。我们还讨论了我们的工作在AI模型调优中的潜在应用。最后,我们展示了实验结果,证明了我们算法的收敛性以及基于估计度量训练的下游公平预测器的公平性性能。

英文摘要

Individual fairness, the notion that "similar individuals should be treated similarly," provides a strong and flexible fairness guarantee for algorithmic decision makers. However, a barrier to implementing individual fairness in practice is the difficulty of learning the similarity metric over individuals. In this work, we present an algorithm for learning a Mahalanobis similarity metric from triplet queries of the form "is individual $i$ more similar to individual $j$ or $k$?" We work in the standard Bradley-Terry model for pairwise comparisons. Our algorithm consists of a spectral initialization step followed by gradient descent. We provide extensive theoretical guarantees on our algorithm, showing that it converges quickly to the ground truth metric despite the non-convexity of the loss in our model. Because our focus is on fairness, we also show that individual fairness with respect to an estimated metric is sufficient to achieve similar fairness with respect to the true metric. We also discuss potential applications of our work to AI model tuning. Finally, we present experimental results that demonstrate the convergence of our algorithm and the fairness performance of downstream fair predictors trained on our estimated metric.

2605.23139 2026-05-25 cs.LG cs.AI 版本更新

CALAD: Channel-Aware contrastive Learning for multivariate time series Anomaly Detection

CALAD:面向多元时间序列异常检测的信道感知对比学习

Jaehyeop Hong, Youngbum Hur

发表机构 * Department of Industrial Engineering, Inha University, Incheon, Republic of Korea(韩国Inha大学工业工程系)

AI总结 多变量时间序列异常检测在实际应用中日益重要,但通常面临标注数据稀缺的问题。现有方法多采用无监督学习建模正常模式,但往往对所有通道一视同仁,忽略了不同通道对异常检测的贡献差异。本文提出CALAD,一种基于通道感知的对比学习框架,通过估计通道相关性指导对比样本的构建,增强模型对异常语义的学习能力,并结合重建误差和对比学习,提升模型在分布偏移场景下的检测性能。

Comments Accepted to ICPR 2026

详情
AI中文摘要

多元时间序列异常检测在实际应用中变得越来越重要,而标记数据往往稀缺。许多现有方法依赖无监督学习来建模正常模式,但它们通常平等对待所有信道。这种设计会稀释异常相关信号,因为并非所有信道对异常检测的贡献相同。在本文中,我们提出CALAD,一种用于多元时间序列异常检测的信道感知对比学习框架。CALAD利用估计的信道相关性指导对比样本的构建,使学习过程反映异常语义而非通用相似性。信道相关性通过基于Transformer的自编码器的重构误差进行估计,并用于区分对异常行为影响更大的信道。利用这些信息,我们设计了一种信道级增强策略,其中正负样本基于异常相关信道是否被保留或扰动来构建。这鼓励对无关信道的变化保持不变性,同时对异常相关信道的变化保持敏感性。此外,CALAD结合了对比学习和辅助重构头,使模型在保留正常结构的同时学习判别性表示。在多个真实数据集上的实验表明,CALAD在分布漂移场景下持续优于现有方法。我们提供可复现的代码:https://github.com/hirundo1218/CALAD。

英文摘要

Multivariate time series anomaly detection has become increasingly important in real-world applications, where labeled data are often scarce. Many existing approaches rely on unsupervised learning to model normal patterns, but they often treat all channels equally. This design can dilute anomaly-relevant signals, since not all channels contribute equally to anomaly detection. In this paper, we propose CALAD, a channel-aware contrastive learning framework for multivariate time series anomaly detection. CALAD governs the construction of contrastive samples using estimated channel relevance, allowing the learning process to reflect anomaly semantics rather than generic similarity. Channel relevance is estimated from reconstruction errors of a transformer-based autoencoder and is used to distinguish channels that are more influential to anomalous behaviors. Using this information, we design a channel-wise augmentation strategy in which positive and negative samples are constructed based on whether anomaly-relevant channels are preserved or perturbed. This encourages invariance to changes in irrelevant channels while being sensitive to changes in anomaly-relevant channels. Furthermore, CALAD combines contrastive learning and an auxiliary reconstruction head, allowing the model to learn discriminative representations while retaining normal structures. Experiments on multiple real-world datasets shows that CALAD consistently outperforms existing methods, particularly under distribution shift scenarios. We provide the code for reproducibility at https://github.com/hirundo1218/CALAD

2605.23138 2026-05-25 quant-ph cs.AI cs.ET cs.LG 版本更新

Classical State Preparation for Variational Quantum Algorithms via Reinforcement Learning

基于强化学习的变分量子算法经典态制备

Gino Kwun, Dhanvi Bharadwaj, Gokul Subramanian Ravi

发表机构 * Computer Science and Engineering University of Michigan(计算机科学与工程大学密歇根大学)

AI总结 该论文提出了一种基于强化学习的新型方法CRiSP,用于变分量子算法中的经典初始态制备。该方法将离散前缀选择建模为序列决策问题,结合神经引导的蒙特卡洛树搜索和自博弈训练的Transformer策略,能够在不改变电路结构的前提下,通过多项式时间的经典稳定子模拟生成高质量初始态。实验表明,CRiSP在多个QAOA和VQE基准任务中显著优于现有方法,展现出更高的能量精度和更强的可扩展性。

Comments 22 pages, 4 figures

详情
AI中文摘要

变分量子算法(VQA)可能提供实现实际量子优势的途径,但其优化受到贫瘠高原和大量局部极小值的严重阻碍。虽然经典可模拟的克利福德电路可以热启动VQA以加速收敛,但现有的基于启发式的初始化方法难以在巨大的组合搜索空间中扩展。为了克服这一瓶颈,我们提出了CRiSP(用于态制备的克利福德强化学习智能体),这是一个将离散前缀选择表述为序列决策问题的框架。CRiSP利用神经引导的蒙特卡洛树搜索,由通过自我对弈训练的基于Transformer的策略驱动,在固定参数化旋转之前插入学习到的克利福德门。这使得能够完全通过多项式时间的经典稳定子模拟构建高质量的初始态,而不改变底层电路架构。通过整合逐步扩展搜索范围的课程学习策略,该智能体能够高效扩展到深度电路。在多达22个量子比特和1,370个参数的QAOA基准测试中,CRiSP在平均能量精度上优于最先进的克利福德初始化方法平均3.17倍(最大45.02倍),在最佳能量精度上平均2.44倍(最大16.01倍)。对VQE任务的评估进一步证明了该框架的鲁棒性和泛化能力。

英文摘要

Variational Quantum Algorithms (VQAs) potentially offer a pathway to practical quantum advantage, but their optimization is heavily hindered by barren plateaus and numerous local minima. While classically simulable Clifford circuits can warm-start VQAs to accelerate convergence, existing heuristic-based initialization methods struggle to scale within vast combinatorial search spaces. To overcome this bottleneck, we propose CRiSP (a Clifford Reinforcement Learning agent for State Preparation), a framework that formulates discrete prefix selection as a sequential decision-making problem. CRiSP utilizes Neural-Guided Monte Carlo Tree Search, driven by a Transformer-based policy trained via self-play, to insert learned Clifford gates before fixed parameterized rotations. This enables the construction of high-quality initial states entirely through polynomial-time classical stabilizer simulation without altering the underlying circuit architecture. By integrating a curriculum learning strategy that progressively expands the search horizon, the agent efficiently scales to deep circuits. Evaluated on QAOA benchmarks of up to $22$ qubits and $1{,}370$ parameters, CRiSP outperforms state-of-the-art Clifford initialization methods by a mean of $3.17\times$ (max $45.02\times$) in average energy accuracy and $2.44\times$ (max $16.01\times$) in best-achieved energy accuracy. Assessments on VQE tasks further demonstrate the framework's robustness and generalizability.

2605.23134 2026-05-25 cs.LG 版本更新

Archimedean Copula Inference via Taylor-Mode AD

通过泰勒模式自动微分进行阿基米德Copula推断

Cambridge Yang, Dongdong Li

发表机构 * Cambridge Yang(剑桥阳) Harvard Medical School(哈佛医学院)

AI总结 该研究提出了一种名为 \textsc{acopula} 的 JAX 框架,用于高效计算任意嵌套阿基米德Copula模型在高维、任意变量右删失情况下的精确似然和参数梯度。其核心方法是通过泰勒模式自动微分的多项式幂运算,替代传统手动推导的贝尔多项式表,从而支持任意生成函数和复杂的嵌套结构。实验表明,该框架在高维数据、大规模金融和医学数据集上表现出优越的性能和灵活性,并实现了比现有工具显著的加速效果。

详情
AI中文摘要

现有的嵌套阿基米德Copula工具无法同时处理以下三个方面:(a) 生存分析中任意变量的(右)删失,(b) 任意嵌套树,以及(c) 精确参数梯度。现有实现仅处理双变量问题、低维(即$d \leq 10$)情况、两层嵌套或仅手工推导的Copula嵌套。我们提出 extsc{acopula},一个JAX原生框架,给定任意阿基米德生成元——经典或神经——在多项式时间内,在任意删失掩码下评估精确的嵌套Copula似然和参数梯度。其机制是泰勒模式自动微分输出的多项式幂运算,用单个可微计算替代每个族手工推导的偏贝尔多项式表,任何用户定义的生成元都可以驱动该计算。我们进行了大量模拟以验证 extsc{acopula}的正确性。然后我们展示了:(a) 在$d=53$的高维MIMIC-IV ICU入院数据($85{,}229$条记录)上的逐变量删失,由经典阿基米德族和嵌套神经阿基米德Copula拟合;(b) 在$d=98$的标普500日收益率上的11部门层次模型;(c) 在一项视网膜病变研究中,跨十个族(其中五个族之前没有实现)的族无关删失MLE;以及(d) 在$d=35$时,相对于R的 exttt{nacLL}每密度加速约$650$倍,且二次扩展到$d=8{,}000$。

英文摘要

No existing nested Archimedean copula tool handles all three of (a) arbitrary per-variable (right-)censoring in survival analysis, (b) arbitrary nesting trees, and (c) exact parameter gradients. Existing implementations handle only bivariate problems, low dimensional (i.e., $d \leq 10$) cases, two layers of nesting, or only hand-derived copula nestings. We present \textsc{acopula}, a JAX-native framework that, given any Archimedean generator -- classical or neural -- evaluates exact nested-copula likelihoods and parameter gradients under arbitrary censoring masks in polynomial time. The mechanism is polynomial powering of Taylor-mode automatic differentiation output, which replaces per-family hand-derived partial Bell polynomial tables with a single differentiable computation that any user-defined generator can drive. We conduct extensive simulations to verify the correctness of \textsc{acopula}. We then demonstrate (a) per-variable censoring on $85{,}229$ MIMIC-IV ICU admissions in high dimensions with $d{=}53$, fit by both classical Archimedean families and nested neural Archimedean copulas; (b) an 11-sector hierarchical model on S\&P~500 daily returns at $d{=}98$; (c) family-agnostic censored MLE across ten families, five of them with no prior implementation, on a retinopathy study; and (d) a ${\sim}650\times$ per-density speedup over R's \texttt{nacLL} at $d{=}35$, scaling quadratically to $d{=}8{,}000$.

2605.23131 2026-05-25 cs.LG 版本更新

When Determinants Are Not Enough: Private Rare Switching

当行列式不够时:私有稀有切换

Xingyu Zhou

发表机构 * Wayne State University(韦恩州立大学)

AI总结 本文探讨了在隐私保护背景下,传统基于行列式的线性上上下文 bandits 和强化学习更新规则的局限性。当引入高斯噪声以满足隐私要求时,设计矩阵的单调增长特性可能被破坏,导致原有分析不再适用。为解决这一问题,作者提出了一种基于广义瑞利商的稀有切换规则,恢复了对数策略更新和置信区间宽度的常数因子控制,从而在隐私设置下实现了有效的稀有切换策略。

详情
AI中文摘要

在这篇笔记中,我想分享一个小研究时刻,Codex帮助我找到了将稀有切换适应私有设置的正确方法。线性bandit和强化学习中基于行列式的标准更新规则效果很好,因为设计矩阵单调增长。但一旦加入高斯噪声以实现隐私,这种单调性可能失效,通常的分析不再成立。关键原因是行列式增长控制体积,而遗憾分析需要控制最坏方向。为了解决这个问题,Codex提出了一种基于广义瑞利商的不同稀有切换规则,该规则恢复了对数策略更新以及所需的置信宽度比较(至多常数因子)。我在此展示了我手动清理后的证明版本,以及对此例的一些个人反思。

英文摘要

In this note, I would like to share a small research moment where Codex helped me find the right way to adapt rare switching to the private setting. The standard determinant-based update rule in linear bandits and RL works beautifully because the design matrix grows monotonically. But once Gaussian noise is added for privacy, this monotonicity can fail, and the usual analysis no longer goes through. The key reason is that determinant growth controls volume, while regret analysis needs control of the worst direction. To address this, Codex comes up with a different rare-switching rule based on the generalized Rayleigh quotient, which restores logarithmic policy updates and the desired confidence-width comparison up to a constant factor. I present my manually clean-up version of the proof here as well as some personal reflection on this example.

2605.23118 2026-05-25 cs.CV cs.AI cs.LG 版本更新

Exploiting Longitudinal Context in Clinician-Verified Interactive Lesion Tracking

在临床医生验证的交互式病灶追踪中利用纵向上下文

Yannick Kirchhoff, Maximilian Rokuss, Daniel Philipp Mertens, David Füller, Benjamin Hamm, Andreas Schreyer, Oliver Ritter, Klaus Maier-Hein

发表机构 * German Cancer Research Center (DKFZ) Heidelberg, Division of Medical Image Computing, Germany(德国癌症研究中心(DKFZ)海德堡,医学图像计算部,德国) Faculty of Mathematics and Computer Science, Heidelberg University, Germany(海德堡大学数学与计算机科学学院,德国) HIDSS4Health -- Helmholtz Information and Data Science School for Health, Karlsruhe/Heidelberg, Germany(HIDSS4Health——海德堡信息与数据科学健康学校,卡尔斯鲁厄/海德堡,德国) Medical Faculty, Heidelberg University, Germany(海德堡大学医学学院,德国) University Hospital Brandenburg an der Havel, Brandenburg Medical School Theodor Fontane, Germany(勃兰登堡运河大学医院,布兰登堡泰奥多尔·冯·_fontane医学学校,德国) Pattern Analysis and Learning Group, Department of Radiation Oncology, Heidelberg University Hospital, Germany(放射肿瘤科模式分析与学习组,海德堡大学医院,德国)

AI总结 本文研究了如何在临床验证的交互式病灶追踪中有效利用纵向影像信息,以提高肿瘤在连续CT扫描中的追踪准确性。作者提出了一种“验证追踪”范式,通过临床医生验证注册提出的提示,并结合病灶的基线外观信息,解决分割中的模糊问题。该方法结合了早期空间提示融合与潜在时间差分加权,构建了一个统一的纵向信息引导分割框架,并通过大规模合成预训练克服数据稀缺问题,显著提升了性能。实验表明,该方法在全自动和验证追踪设置下均优于现有方法,且在MICCAI autoPET IV挑战赛中取得第一名。

Comments Accepted at MICCAI 2026

详情
AI中文摘要

在系列CT扫描中追踪肿瘤病灶对于肿瘤学反应评估至关重要。现有的自动化方法面临一个基本权衡:端到端追踪器实现高度自动化,但无法纠正无声的追踪失败;而解耦的配准-分割流程允许用户验证,却丢弃了病灶的先验外观,限制了在模糊情况下的准确性。在这项工作中,我们提出了一种验证追踪范式:临床医生验证配准提出的提示,模型利用该提示以及基线病灶外观来解决分割模糊性。我们提出了一个统一框架,结合早期空间提示融合与潜在时间差异加权,用于纵向信息感知的分割。为了解决数据稀缺问题,我们利用大规模合成预训练,证明这对于利用纵向上下文至关重要,相比从头训练性能提升高达4.5个Dice点。我们的方法在MICCAI autoPET IV挑战中获得第一名。我们进一步整理并发布了PanTrack,一个新的纵向胰腺癌基准,以评估分布外泛化能力。实验表明,我们的模型在全自动和所提出的验证追踪设置中均优于先前工作,在自动化与控制之间提供了一个临床安全的中间地带。代码、模型和数据集将在https://github.com/MIC-DKFZ/LongiSeg发布。

英文摘要

Tracking tumor lesions across serial CT scans is essential for oncological response assessment. Existing automated methods face a fundamental trade-off: end-to-end trackers achieve high automation but offer no opportunity to correct silent tracking failures, while decoupled registration-segmentation pipelines permit user verification yet discard the lesion's prior appearance, limiting accuracy in ambiguous cases. In this work, we propose a Verified Tracking paradigm: a clinician verifies a registration-proposed prompt, which the model leverages alongside the baseline lesion appearance to resolve segmentation ambiguities. We present a unified framework combining early spatial prompt fusion with latent temporal difference weighting for longitudinally-informed segmentation. To address data scarcity, we leverage large-scale synthetic pretraining, proving essential for exploiting longitudinal context, improving performance by up to 4.5 Dice points over training from scratch. Our approach secured first place in the MICCAI autoPET IV challenge. We further curate and release PanTrack, a new longitudinal pancreatic cancer benchmark, to assess out-of-distribution generalization. Experiments show that our model outperforms prior work in both fully automatic and the proposed verified tracking setting offering a clinically safe middle ground between automation and control. Code, model and dataset will be released at https://github.com/MIC-DKFZ/LongiSeg

2605.23115 2026-05-25 cs.LG stat.ML 版本更新

Robust OT-Guided Generative Residual Domain Adaptation for Bike-Sharing Demand Prediction under Temporal Domain Shift

鲁棒OT引导的生成式残差域适应用于时间域偏移下的共享单车需求预测

Yiming Ma

发表机构 * Department of Statistics Finance, School of Management, University of Science

AI总结 本文研究了从2021年到2026年纽约Citi Bike共享单车需求预测中的时间域适应问题,提出了一种基于最优运输引导的残差域适应框架Gen-ROTDA。该方法通过拟合目标域的站点-时间锚点,转移残差而非原始需求,并采用确定性标签保持的残差特征生成器,提升了模型在时间域偏移下的鲁棒性。实验表明,Gen-ROTDA在主要任务2025至2026年的预测中取得了最低的平均绝对误差,并在多任务中优于其他最优运输方法,尤其在面对噪声数据时表现出更强的稳定性。

详情
AI中文摘要

基于历史站点-小时数据训练的共享单车模型在后续年份部署时,由于出行模式随时间变化,性能可能会下降。本文将2021年至2026年3月Citi Bike需求预测作为时间域适应问题进行研究,并提出了Gen-ROTDA,一种鲁棒最优传输引导的残差域适应框架。该方法利用少量标记目标子集拟合目标域站点-时间锚点,传输残差而非原始需求,应用确定性标签保持残差特征生成器,并在训练最终残差预测器之前修剪高成本传输匹配。实验将Gen-ROTDA与仅锚点、仅源域、仅目标域、微调、MMD适应、Sinkhorn OTDA、ROTDA和Gen-OTDA进行比较。Gen-ROTDA在2025年至2026年主要任务上取得了最低MAE,并且在多年度任务中平均表现最佳,尽管微调和MMD适应仍然是强大的整体基线。在异常目标无标签记录下,Gen-ROTDA比非鲁棒OT变体稳定得多,表明鲁棒传输对于共享单车需求预测中的噪声时间迁移是有用的。

英文摘要

Bike-sharing models trained on historical station-hour data may degrade when deployed in later years because travel patterns change over time. This paper studies March Citi Bike demand prediction from 2021 to 2026 as a temporal domain adaptation problem and proposes Gen-ROTDA, a robust optimal transport-guided residual domain adaptation framework. The method fits a target-domain station-time anchor with a small labeled target subset, transfers residual rather than raw demand, applies a deterministic label-preserving residual feature generator, and trims high-cost transport matches before training the final residual predictor. Experiments compare Gen-ROTDA with anchor-only, source-only, target-only, fine-tuning, MMD adaptation, Sinkhorn OTDA, ROTDA, and Gen-OTDA. Gen-ROTDA achieves the lowest MAE on the main 2025 to 2026 task and is the best OT-family method on average across multi-year tasks, although fine-tuning and MMD adaptation remain strong overall baselines. Under abnormal target-unlabeled records, Gen-ROTDA is much more stable than non-robust OT variants, suggesting that robust transport is useful for noisy temporal transfer in bike-sharing demand prediction.

2605.23102 2026-05-25 stat.ML cs.LG stat.ME 版本更新

LLM Sparsity Prior for Robust Feature Selection

LLM 稀疏先验用于鲁棒特征选择

Caleb Skinner, Yihan Guo, Meng Li

发表机构 * Department of Statistics, Rice University(统计学系,里士满大学) Department of Computer Science, Rice University(计算机科学系,里士满大学)

AI总结 本文提出了一种基于大语言模型(LLM)稀疏性先验的鲁棒特征选择方法,用于高维变量选择。该方法通过引入可解释的超参数将LLM生成的权重整合到Spike-and-Slab模型中,同时利用分层超先验动态过滤无信息或误导性权重,从而在保证准确权重利用的同时提升鲁棒性。实验表明,该方法在医疗数据集上不仅提高了预测精度,还识别出基线方法遗漏的临床相关特征,尤其在小样本场景下表现出色。

详情
AI中文摘要

大型语言模型 (LLM) 提供了一种可扩展的机制,用于引出领域信息的先验知识,以进行高维变量选择。然而,现有方法如 LLM-Lasso 对权重质量敏感,当 LLM 生成的权重不准确时,性能会大幅下降。为了解决这一挑战,我们首先引入了一个量化 LLM 生成权重质量的框架,从而能够对不同权重机制下的 LLM 信息方法进行严格评估。然后,我们提出了 LLM 稀疏先验 (LSP),它通过两个可解释的超参数(控制全局稀疏性和权重集中度)将 LLM 生成的权重整合到 Spike-and-Slab 和 Spike-and-Slab Lasso 模型的先验包含概率中。这些参数上的层次超先验允许模型动态地折扣无信息或误导性权重,从而在权重准确时提高鲁棒性而不牺牲收益。最后,我们开发了原则性的提示工程策略,并在一个研究急性肾损伤的私有医学数据集上验证了该方法。LSP 提高了预测准确性,并识别出了基线方法遗漏的临床相关特征,对提示变化具有鲁棒性,在低数据场景下尤其有效。

英文摘要

Large language models (LLMs) offer a scalable mechanism to elicit domain-informed prior information for high-dimensional variable selection. However, existing methods such as LLM-Lasso are sensitive to weight quality, with performance degrading substantially when LLM-generated weights are inaccurate. To address this challenge, we first introduce a framework for quantifying the quality of LLM-generated weights, enabling rigorous evaluation of LLM-informed methods across varying weight regimes. We then propose the LLM Sparsity Prior (LSP), which integrates LLM-generated weights into the prior inclusion probabilities of Spike-and-Slab and Spike-and-Slab Lasso models via two interpretable hyperparameters governing global sparsity and weight concentration. Hierarchical hyperpriors on these parameters allow the model to dynamically discount uninformative or misleading weights, improving robustness without sacrificing gains when weights are accurate. Finally, we develop principled prompt engineering strategies and validate the method on a private medical dataset studying Acute Kidney Injury. LSP improves prediction accuracy and identifies clinically relevant features missed by the baselines, with robustness to prompt variation and particular effectiveness in low-data regimes.

2605.23096 2026-05-25 cs.CR cs.LG 版本更新

Encrypted Neural Networks without Overflows

无溢出的加密神经网络

Philipp Kern, Lorenzo Rovida, Samuel Teuber, Edoardo Manino, Carsten Sinz, Alberto Leporati

发表机构 * Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院) Polytechnic University of Turin(都灵理工学院) The University of Manchester(曼彻斯特大学) Karlsruhe University of Applied Sciences(卡尔斯鲁厄应用科学大学) University of Milano-Bicocca(米兰-布雷拉大学)

AI总结 本文研究了在使用全同态加密(FHE)进行隐私保护推理时,神经网络中可能出现的溢出攻击问题。针对当前主流的CKKS加密方案在支持操作上的限制,作者提出了一种形式化验证技术,用于计算网络中所有神经元的严格范围界限,从而彻底消除溢出风险。实验表明,该方法有效避免了所有基准测试中的溢出问题,将失败率从最高47%降至0%,且与大多数基于CKKS的框架兼容。

Comments Preprint

详情
AI中文摘要

全同态加密(FHE)通过对加密数据上的神经网络进行评估,实现私有推理。这样,我们可以将计算委托给第三方服务器,而无需泄露用户的数据。目前,CKKS方案是大多数高效FHE实现的骨干,但它仅支持加法、乘法和数组旋转操作,因此要求神经网络的所有激活函数在某个区间内由多项式近似,施加了严格的设计容差。在本文中,我们首次证明该方案易受溢出攻击,即看似良性的输入可能超过FHE电路的容差,从而导致输出损坏且不可用。为了避免这种情况,我们提出了一种形式化验证技术,计算网络中所有神经元范围的认证界限。通过构造,我们的方法消除了溢出,并且在我们的实验中,在所有基准测试上消除了观察到的溢出,将故障率从高达47%降低到0%。此外,我们的无溢出解决方案与大多数基于CKKS的框架兼容,因为它允许简单地用具有严格设计范围的多项式替换标准多项式。

英文摘要

Fully homomorphic encryption (FHE) enables private inference by evaluating neural networks on encrypted data. In this way, we can delegate the computation to a third party server without ever revealing the user's data. Currently, the CKKS scheme is the backbone of most efficient FHE implementations, but it only supports addition, multiplication, and array rotation operations, thus requiring all activation functions of the neural network to be approximated by polynomials within a certain interval, imposing strict design tolerances. In this paper, we demonstrate for the first time that this scheme is vulnerable to overflow attacks, i.e., seemingly benign inputs that can exceed such tolerances of the FHE circuit, thereby causing corrupt and unusable outputs. To avoid them, we propose a formal verification technique that computes certified bounds on the ranges of all neurons in the network. By construction, our method eliminates overflows and, in our experiments, removed observed overflows on all benchmarks, reducing failure rates from up to 47% to 0%. Moreover, our overflow-free solution is compatible with most CKKS-based frameworks, as it allows to simply substitute standard polynomials by polynomials with rigorously designed ranges.

2605.23089 2026-05-25 cs.LG cs.AI 版本更新

Dreaming Smoothly and Sample Efficiently with Gradient Penalized Latent Dynamics

利用梯度惩罚潜在动力学实现平滑且高效的采样

Romil V. Sonigra, P. R. Kumar

发表机构 * Department of Electrical and Computer Engineering(电气与计算机工程系) Texas A&M University(德克萨斯大学)

AI总结 本文提出了一种名为GPLD的梯度惩罚隐动力学正则化方法,用于改进基于模型的强化学习中的隐世界模型。该方法通过对后验隐状态分布施加行级雅可比惩罚,显式地鼓励局部平滑的转移动力学学习,从而提升模型的样本效率和学习稳定性。实验表明,GPLD在多个深度强化学习任务中表现出色,尤其在复杂运动控制环境中显著提升了性能,并且在四足机器人任务中实现了更早的高回报行为和更一致的长期学习效果。

Comments 17 pages and 9 figures

详情
AI中文摘要

基于模型的强化学习通过学习世界模型来提高样本效率。然而,现有的潜在世界模型(如DreamerV3)并未明确强制其学习的转移动力学具有局部平滑性,从而未利用这一有用的归纳偏置。我们提出GPLD,一种用于DreamerV3的梯度惩罚潜在动力学正则化器,通过对后验潜在分布施加行雅可比惩罚来鼓励局部平滑的转移学习。我们证明该惩罚可解释为离散嵌入状态MDP中转移律的有限差分平滑的连续潜在类比,并使用Hutchinson风格随机探针高效估计。实验上,在DeepMind Control本体感受任务中,GPLD提高了总体样本效率,在复杂度较高的运动环境中尤其显著。在更具挑战性的四足任务中,GPLD更早达到高回报行为,并在更长的时间跨度内表现出更一致的后期学习。显式局部平滑正则化是改善平滑连续控制环境中潜在世界模型的简单有效方法。GPLD代码见github.com/romils9/gpld-mbrl。

英文摘要

Model-based reinforcement learning improves sample efficiency by learning a world model. However, existing latent world models such as DreamerV3 do not explicitly enforce local smoothness in their learned transition dynamics, leaving a useful inductive bias for transition dynamics learning unexploited. We propose GPLD, a gradient-penalized latent dynamics regularizer for DreamerV3 that applies a row-wise Jacobian penalty to the posterior latent distribution to encourage locally smooth transition learning. We show that this penalty can be interpreted as the continuous-latent analog of finite-difference smoothing of transition laws in discrete embedded-state MDPs, and estimate it efficiently using Hutchinson-style stochastic probes. Empirically, across DeepMind Control proprioceptive tasks, GPLD improves aggregate sample efficiency, with particularly strong gains on higher-complexity locomotion environments. On more challenging quadruped tasks, GPLD reaches high-return behavior earlier and exhibits more consistent late-stage learning over longer horizons. Explicit local smoothness regularization is a simple and effective way to improve latent world models for smooth continuous control environments. Code for GPLD is available at github.com/romils9/gpld-mbrl .

2605.23087 2026-05-25 cs.LG 版本更新

The Implicit Bias of Depth: From Neural Collapse to Softmax Codes

深度的隐式偏差:从神经坍缩到Softmax编码

Connall Garrod, Jonathan P. Keating, Christos Thrampoulidis

发表机构 * Mathematical Institute, University of Oxford(牛津大学数学研究所) Department of Electrical and Computer Engineering, University of British Columbia(不列颠哥伦比亚大学电气与计算机工程系)

AI总结 该研究探讨了深度神经网络中梯度下降的隐式偏差如何影响神经崩溃(NC)现象。通过分析无正则化的深度非约束特征模型(UFM),研究发现深度本身会引入一种隐式的低秩偏差,使得网络更倾向于生成低秩的特征表示,这些表示与softmax编码形式的最优解相关。研究还揭示了深度如何影响训练动态和NC的收敛区域,并指出网络宽度的增加可能促使训练向更高秩的解发展,为理解深度模型的隐式偏差提供了新的理论视角。

Comments 46 pages, 11 figures, accepted at the International Conference on Machine Learning 2026

详情
AI中文摘要

神经坍缩(NC)描述了训练分类器中特征和权重出现的结构化几何。最近的理论表明,NC在深度架构中可能不是最优的,将其归因于L2正则化的显式低秩偏差。我们研究了深度无约束特征模型(UFM)——等价于具有正交输入的深度线性网络——在无正则化训练下的情况,以隔离梯度下降和深度单独如何塑造NC。我们表明,深度诱导了隐式低秩偏差:低秩矩阵通过连续乘法更有效地传播范数,从而促进NC的低秩替代方案。我们认为,这些替代方案对应于softmax编码:先前在宽度瓶颈网络中发现的最大间隔解。通过分析谱初始化下的训练动态,我们识别出早期奇异值之间的排斥力驱动低秩出现,并刻画了深度如何缩小NC的吸引域。最后,我们展示了一些相反方向的效果:对于随机初始化的网络,增加宽度会使训练偏向更高秩的解。我们的结果首次提供了在无正则化多类交叉熵训练的深度UFM中隐式偏差的渐近和动态刻画。

英文摘要

Neural collapse (NC) describes the structured geometry that emerges in the features and weights of trained classifiers. Recent theory suggests NC can be suboptimal in deep architectures, attributing this to an explicit low-rank bias from L2 regularization. We study the deep unconstrained feature model (UFM)-equivalent to a deep linear network with orthogonal inputs-trained without regularization, to isolate how gradient descent and depth alone shape NC. We show that depth induces an implicit low-rank bias: low-rank matrices propagate norm more efficiently through successive multiplications, promoting low-rank alternatives to NC. These alternatives, we argue, correspond to softmax codes: max-margin solutions previously found in width-bottlenecked networks. Analyzing training dynamics under spectral initialization, we identify an early-time repulsion among singular values that drives low-rank emergence, and characterize how depth shrinks NC's basin of attraction. Finally, we show that some effects act in the opposite direction: for randomly initialized networks, increasing width biases training toward higher-rank solutions. Our results provide the first asymptotic and dynamic characterization of implicit bias in deep UFMs trained with unregularized multiclass cross-entropy.

2605.23081 2026-05-25 cs.LG 版本更新

ThriftAttention: Selective Mixed Precision for Long-Context FP4 Attention

ThriftAttention: 面向长上下文FP4注意力机制的选择性混合精度

Joe Sharratt

发表机构 * NVIDIA Corporation(英伟达公司)

AI总结 在长上下文任务中,注意力机制的二次计算成本是一个关键挑战。为了解决这一问题,ThriftAttention 提出了一种选择性混合精度方法,在保持 FP4 推理效率的同时,显著提升了长上下文场景下的模型质量。该方法通过分阶段策略,优先以 FP16 精度计算少量重要的查询-键块对,其余块则使用 FP4 精度计算,并通过在线 softmax 合并结果,从而在仅使用 5% FP16 块的情况下,恢复了 89.1% 的 FP4 到 FP16 性能差距。

详情
AI中文摘要

高效的注意力算法对于减轻长上下文工作负载中注意力的二次成本至关重要。先前的工作在Blackwell GPU上利用块缩放量化技术将注意力计算移至4位精度以加速推理。然而,这些技术在长上下文设置中会导致显著的质量下降。我们表明,量化误差的输出影响高度不均匀,并且随着每个查询-键交互的重要性而增加,将功能相关的误差集中在包含最重要标记的少量注意力块中。我们提出ThriftAttention,一种低比特注意力变体,在FP4推理效率下提供接近FP16的长上下文质量。该方法分两个阶段进行。首先,一种启发式方法快速选择少量重要的查询-键块对进行FP16精度计算。其次,选中的块以FP16计算,其余块以FP4计算,两条路径通过在线softmax合并为单个输出。我们在长上下文基准和模型家族上证明,通过仅计算5%的查询-键块为FP16,ThriftAttention平均恢复了FP4到FP16性能差距的89.1%。我们展示了ThriftAttention的优势随序列长度增加而增长,缓解了在更长上下文中观察到的系统性FP4质量下降。代码可在https://github.com/joesharratt1229/ThriftAttention获取。

英文摘要

Efficient attention algorithms are critical to mitigate the quadratic cost of attention in long-context workloads. Prior work utilises block-scaled quantisation techniques on Blackwell GPUs to move attention computation to 4-bit precision to accelerate inference. However, these techniques result in significant quality degradation in long-context settings. We show that the output impact of quantisation error is highly non-uniform and increases with the importance of each query-key interaction, concentrating functionally relevant error in a small number of attention blocks that contain the most important tokens. We propose ThriftAttention, a low-bit attention variant that delivers near-FP16 long-context quality at FP4 inference efficiency. This approach proceeds in two stages. First, a heuristic rapidly selects a small number of important query-key block pairs for FP16 precision. Second, the selected blocks are computed in FP16 and the remaining blocks in FP4, with both paths merged via online softmax into a single output. We demonstrate across long-context benchmarks and model families that by computing only 5% of query-key blocks in FP16, ThriftAttention recovers on average 89.1% of the FP4-to-FP16 performance gap. We show ThriftAttention's advantage grows with sequence length, mitigating the systematic FP4 quality degradation observed at longer contexts. The code is available at https://github.com/joesharratt1229/ThriftAttention.

2605.23078 2026-05-25 cs.LG cs.CL 版本更新

GEMQ: Global Expert-Level Mixed-Precision Quantization for MoE LLMs

GEMQ:MoE大语言模型的全局专家级混合精度量化

Jianing Deng, Song Wang, Dongwei Wang, Zijie Liu, Tianlong Chen, Huanrui Yang, Jingtong Hu

发表机构 * University of Pittsburgh(匹兹堡大学) University of Central Florida(佛罗里达州立大学) University of Arizona(亚利桑那大学) University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 混合专家大型语言模型(MoE-LLMs)在性能上表现优异,但因大量专家参数导致内存开销较大。为解决这一问题,本文提出了一种全局专家级混合精度量化方法GEMQ,通过全局线性规划形式捕捉模型整体的专家重要性,并结合高效的路由微调以适应量化后的专家,从而实现更优的精度与内存权衡。实验表明,GEMQ在保持精度的同时显著降低了内存占用并加速了推理。

Comments ICML 2026

详情
AI中文摘要

混合专家大语言模型(MoE-LLMs)性能强大,但由于大量专家参数导致显著的内存开销。混合精度量化根据专家重要性分配不同的位宽,接近精度-内存帕累托前沿,并实现极低比特量化。然而,现有方法依赖于逐层重要性估计,忽视了量化引起的路由器偏移,导致次优的分配和路由。本文提出全局专家级混合精度量化(GEMQ),通过(1)基于量化误差分析的全局线性规划公式来捕获模型范围内的专家重要性,以及(2)高效的路由器微调以适应量化后的专家,从而克服这些限制。这些组件被集成到一个渐进式量化框架中,该框架迭代地优化重要性估计和分配。实验表明,GEMQ在最小化精度损失的情况下显著减少内存并加速推理。源代码可在 https://github.com/jndeng/GEMQ 获取。

英文摘要

Mixture-of-Experts Large Language Models (MoE-LLMs) achieve strong performance but incur substantial memory overhead due to massive expert parameters. Mixed-precision quantization mitigates this cost by allocating expert-wise bit-widths based on their importance, approaching the accuracy-memory Pareto frontier and enabling extreme low-bit quantization. However, existing methods rely on layer-wise importance estimation and overlook router shifts induced by quantization, resulting in suboptimal allocation and routing. In this work, we propose Global Expert-level Mixed-precision Quantization (GEMQ) to overcome these limitations via (1) a global linear-programming formulation that captures model-wide expert importance based on quantization error analysis, and (2) efficient router fine-tuning to adapt routing to quantized experts. These components are integrated into a progressive quantization framework that iteratively refines importance estimation and allocation. Experiments demonstrate that GEMQ significantly reduces memory and accelerates inference with minimal accuracy degradation. Source code is available at https://github.com/jndeng/GEMQ .

2605.23065 2026-05-25 cs.CV cs.AI cs.LG 版本更新

Dithering Defense: Adversarial Robustness of Vision Foundation Models via Multi-Level Floyd-Steinberg Dithering

抖动防御:通过多级 Floyd-Steinberg 抖动实现视觉基础模型的对抗鲁棒性

Yury Belousov, Brian Pulfer, Vitaliy Kinakh, Slava Voloshynovskiy

发表机构 * Department of Computer Science, University of Geneva, Switzerland(日内瓦大学计算机科学系)

AI总结 该研究提出了一种基于多级Floyd-Steinberg抖动算法的轻量输入变换方法,用于提升视觉基础模型在对抗攻击下的鲁棒性。该方法通过在图像中引入可控的噪声,破坏对抗扰动的同时保留语义内容,适用于多种下游任务和不同模型架构。实验表明,该方法在多种攻击场景下表现优异,且对干净输入的性能下降较小,优于现有的去噪基线方法。

Comments Paper accepted at the IEEE International Conference on Image Processing (ICIP 2026)

详情
AI中文摘要

视觉基础模型被广泛用作许多下游任务中的冻结骨干,使其成为对抗攻击下的单点故障。我们研究了多级 Floyd-Steinberg 误差扩散抖动作为一种轻量级、模型无关的输入变换,它在保留语义内容的同时破坏对抗扰动。与先前局限于二值抖动、灰度 CIFAR-10 和从头训练的单个小模型的工作不同,我们在六个任务(分类、分割、深度估计、检索、字幕生成、视觉问答)、两个模型家族(DINOv2、PaliGemma)以及三种强度递增的攻击(PGD、MI-FGSM、SIA)上进行了评估,还包括使用直通估计器的自适应攻击者。我们的结果表明,在中间量化级别上的 Floyd-Steinberg 抖动,尤其是与后处理模糊相结合时,超过或匹配所有测试的基线(包括基于扩散的去噪),并且在干净输入上的退化显著更小。

英文摘要

Vision foundation models are widely used as frozen backbones across many downstream tasks, making them a single point of failure under adversarial attack. We study multi-level Floyd-Steinberg error-diffusion dithering as a lightweight, model-agnostic input transformation that disrupts adversarial perturbations while preserving semantic content. Unlike prior work, which was limited to binary dithering, grayscale CIFAR-10, and a single small model trained from scratch, we evaluate across six tasks (classification, segmentation, depth estimation, retrieval, captioning, visual question answering), two model families (DINOv2, PaliGemma), and three attacks of increasing strength (PGD, MI-FGSM, SIA), as well as an adaptive attacker using a straight-through estimator. Our results show that Floyd-Steinberg dithering at intermediate quantization levels, especially when combined with post-processing blur, exceeds or matches all tested baselines, including diffusion-based denoising, with substantially less degradation on clean inputs.

2605.23064 2026-05-25 cs.CV cs.LG 版本更新

Millimeter-wave Imaging for Anthropometric Body Measurement

毫米波成像用于人体测量

Miriam Senne, Benjamin D. Killeen, Christoph Baur, Nassir Navab, Azade Farshad

发表机构 * Chair for Computer Aided Medical Procedures(计算机辅助医疗程序研究所) Technical University of Munich(慕尼黑技术大学) Rohde & Schwarz GmbH & Co. KG(罗德与施瓦茨 GmbH & Co. KG) Munich Center for Machine Learning(慕尼黑机器学习中心) ELLIS Unit Helsinki, Dept. Computer Science, Aalto University(赫尔辛基ELLIS单位,计算机科学系,阿alto大学)

AI总结 该研究提出了一种基于毫米波雷达的无接触人体体型测量方法,旨在解决传统测量工具在隐私、效率和适用性方面的不足。通过优化框架,该方法能够从毫米波点云数据中恢复人体三维形状并提取全面的体态测量指标。其核心贡献在于引入了一种顶点加权策略,结合参数化人体模型(SMPL)进行鲁棒的表面对齐与噪声抑制,实现了无需脱衣、无需摄像头的快速、隐私保护的测量流程,适用于各类人群的临床风险评估。

详情
AI中文摘要

身体形状和围度是临床上用于风险分层的信息性生物标志物,包括腰臀比、肢体和躯干周长等指标,然而传统工具如手动卷尺和光学扫描仪通常需要脱衣和保持姿势。这些要求减缓了工作流程,损害了尊严,并且排除了许多老年人和行动不便者。为了实现快速无接触测量,我们利用毫米波雷达,它保护隐私并能穿透典型衣物,实现快速全身采集。在这项工作中,我们提出了一个新的基于优化的框架,从体积毫米波数据中恢复3D人体形状并提取一套全面的人体测量数据。我们的方法引入了一个加权配准流程,将参数化身体模型(SMPL)直接拟合到噪声毫米波点云上。我们贡献的核心是一种顶点加权策略,该策略调节Chamfer能量函数以实现可靠的表面对齐和噪声消除。我们通过加入脚-地面约束和姿态先验进一步稳定拟合,直接优化SMPL参数。这些组件共同实现了一个快速、保护隐私的工作流程,无需摄像头或脱衣,且只需最小程度的配合,即可通过衣物提供高保真度的身体形状和测量数据,支持在诊所和护理机构中对所有年龄和活动水平的患者进行频繁的风险导向评估。

英文摘要

Body shape and circumferences are clinically informative biomarkers for risk stratification, including measures such as waist to hip ratio, limb and trunk girths, yet conventional tools such as manual tape measures and optical scanners often require undressing and sustained poses. These demands slow workflows, compromise dignity, and exclude many older adults and people with limited mobility. To make measurement fast and contactless, we leverage millimeter-wave (mmWave) radar, which preserves privacy and operates through typical clothing, enabling quick full-body acquisition. In this work, we present a new optimization-based framework to recover 3D human shape and extract a comprehensive set of anthropometric measurements from volumetric mmWave data. Our method introduces a weighted registration pipeline that fits a parametric body model (SMPL) directly to the noisy mmWave point cloud. The core of our contribution is a vertex-weighting strategy that modulates a Chamfer energy function for reliable surface alignment and noise elimination. We further stabilize the fit by incorporating a foot-ground plane constraint and pose priors, optimizing directly for the SMPL parameters. Together, these components enable a fast, privacy preserving workflow that delivers high fidelity body shape and measurements through clothing without cameras or disrobing and with minimal cooperation, supporting frequent risk oriented assessments in clinics and care facilities for patients of all ages and mobility levels.

2605.23061 2026-05-25 cs.LG cs.AI math.OC stat.ML 版本更新

Anytime Training with Schedule-Free Spectral Optimization

任意时间训练:无调度谱优化

Anuj Apte, Pranav Deshpande, Niraj Kumar, Shouvanik Chakrabarti, Junhyung Lyle Kim

发表机构 * Global Technology Applied Research(全球技术应用研究)

AI总结 本文提出了一种名为 SF-NorMuon 的无调度谱优化器,用于解决传统神经网络训练中依赖固定学习率计划的问题。该方法在无需预设训练时间范围的情况下,能够在大规模语言模型上达到甚至超越精心调参的 AdamW 优化器的性能。研究还从理论上证明了无调度谱动态的稳定性保证,并指出快速迭代中的权重衰减对长期训练稳定性至关重要,为无需预设时间范围的持续学习提供了更实用的优化方案。

详情
AI中文摘要

标准神经网络训练依赖于与固定训练步数绑定的学习率调度,导致路径依赖性强,且当数据可用性变化时需要昂贵的重新调优。无调度(SF)方法通过移除显式调度来解决这一问题,然而当前最先进的任意时间优化器SF-AdamW始终不如调优后的AdamW基线。我们提出SF-NorMuon,一种无调度谱优化器,弥补了这一差距:使用单一超参数配置,SF-NorMuon在125M和772M参数的语言模型上,在$1$--$8 imes$ Chinchilla训练步数范围内匹配或超过了调优的AdamW。在理论方面,我们证明了无调度谱动力学的平稳性保证,并指出快速迭代上的权重衰减对于长步数稳定性至关重要。SF-NorMuon使从业者能够在训练过程中的任何时刻获得高质量检查点,而无需预先承诺训练步数。通过缩小与调优基线的性能差距,SF-NorMuon使无步数优化更加实用,向真正开放式的持续学习迈出了一步。

英文摘要

Standard neural network training relies on learning-rate schedules tied to a fixed horizon, leading to strong path dependence and costly re-tuning as data availability changes. Schedule-Free (SF) methods address this by removing explicit schedules, yet SF-AdamW, the current state-of-the-art anytime optimizer, consistently underperforms well-tuned AdamW baselines. We propose SF-NorMuon, a schedule-free spectral optimizer that closes this gap: with a single hyperparameter configuration, SF-NorMuon matches or exceeds tuned AdamW on 125M and 772M parameter language models across $1$--$8\times$ Chinchilla horizons. On the theoretical side, we prove a stationarity guarantee for schedule-free spectral dynamics and identify weight decay at the fast iterate as essential for long-horizon stability. SF-NorMuon enables practitioners to obtain high-quality checkpoints at any point during training without committing to a horizon in advance. By closing the performance gap with tuned baselines, SF-NorMuon makes horizon-free optimization more practical, taking a step towards truly open-ended, continual learning.

2605.23057 2026-05-25 cs.LG cs.CL cs.PF 版本更新

ModeSwitch-LLM: A Lightweight Phase-Aware Controller for Cross-Mode LLM Inference on a Single GPU

ModeSwitch-LLM:单GPU上跨模式LLM推理的轻量级阶段感知控制器

Aman Sunesh, Ali Alshehhi, Hivansh Dhakne

发表机构 * New York University Abu Dhabi(纽约大学阿布扎克分校) New York University(纽约大学) United States(美国)

AI总结 ModeSwitch-LLM 是一种轻量级的请求边界控制器,旨在提升单块 GPU 上大语言模型推理的效率,通过将每个请求路由到合适的固定推理模式。该方法利用低成本的工作负载级特征,在 FP16、量化模式、推测解码等不同模式间进行动态选择,无需依赖单一静态配置。实验表明,该控制器在保持推理质量的同时,显著降低了延迟和能耗,且相比基于学习的路由方法,规则控制器在效率和资源约束下表现更优。

Comments 10 pages main text, 11 pages including references, 5 figures, 3 tables. Preprint

详情
AI中文摘要

ModeSwitch-LLM是一种轻量级请求边界控制器,通过将每个请求路由到适当的固定推理模式,提高单GPU大语言模型推理效率。该系统不依赖单一的静态服务配置,而是利用廉价的工作负载级特征,在FP16、量化模式、推测解码以及混合模式(如GPTQ加前缀缓存和INT8加连续批处理)之间进行选择。我们在单个NVIDIA A100 GPU上对Meta-Llama-3.1-8B-Instruct进行了评估。在部署风格的合成工作负载上,在线控制器相比FP16实现了2.10倍的平均延迟加速和0.48倍的平均能耗比,相当于每个token能耗降低51.7%。在用作质量门的自动基准测试中,准确率接近FP16,平均差异为+0.17个百分点。我们还评估了轻量级学习路由器,但发现它们并未明显优于基于规则的控制器,因为它们增加了路由开销,并且更频繁地选择违反质量、能耗或内存约束的模式。这些结果表明,简单的请求感知路由可以从现有推理模式中恢复大量效率,而无需重新训练模型或更改其架构。

英文摘要

ModeSwitch-LLM is a lightweight request-boundary controller for improving single-GPU large language model inference efficiency by routing each request to an appropriate fixed inference mode. Instead of relying on one static serving configuration, the system selects among FP16, quantized modes, speculative decoding, and hybrid modes such as GPTQ plus prefix caching and INT8 plus continuous batching using cheap workload-level features. We evaluate ModeSwitch-LLM on Meta-Llama-3.1-8B-Instruct served on a single NVIDIA A100 GPU. On deployment-style synthetic workloads, the online controller achieves a 2.10x mean latency speedup over FP16 and a 0.48x mean energy ratio, corresponding to 51.7% lower energy per token. On automatic benchmarks used as a quality gate, accuracy remains close to FP16 with a mean delta of +0.17 percentage points. We also evaluate lightweight learned routers, but find that they do not clearly outperform the rule-based controller because they add routing overhead and more often select modes that violate quality, energy, or memory constraints. These results show that simple request-aware routing can recover substantial efficiency from existing inference modes without retraining the model or changing its architecture.

2605.23054 2026-05-25 cs.CL cs.AI cs.LG 版本更新

Model Collapse as Cultural Evolution

模型崩溃作为文化演化

Dongxin Guo, Jikun Wu, Siu Ming Yiu

发表机构 * The University of Hong Kong(香港大学) Stellaris AI Limited(Stellaris AI有限公司)

AI总结 本文研究了大型语言模型(LLM)在自训练过程中出现的“模型崩溃”现象,即模型输出质量逐渐下降的问题。作者引入文化进化中的迭代学习理论,提出五个可验证的预测,并通过多语言实验验证,发现模型的组合性结构在无过滤自训练下呈现非单调变化趋势,这一特征仅在任务导向的过滤机制下得以维持。研究为模型崩溃提供了语言学层面的解释,并为自训练流程的设计提供了具体原则。

Comments Accepted at CoNLL 2026. 18 pages, 3 figures, 2 tables

详情
AI中文摘要

模型崩溃,即在其自身输出上训练的LLM的逐步退化,已被统计表征,但缺乏对哪些结构退化、以何种顺序以及为何退化的语言学解释。我们表明,文化演化中的迭代学习理论填补了这一空白。我们推导出五个可证伪的预测,区分了那些对该理论具有独特判别性的预测与确认性预测,并通过在英语、德语和土耳其语中自训练LLaMA-2-7B和Mistral-7B达10代来测试它们。关键的判别性发现:在未过滤的自训练下,组合性遵循非单调轨迹(先上升后下降)。这一特征在最大规则种子数据下持续存在(排除了噪声去除),并且仅由任务导向的过滤维持,而非随机过滤,提供了压缩-通信权衡的首个LLM尺度证据。所有预测均得到确认,效应量较大(Hedges' $g > 1.6$;$\mathrm{BF}_{10} > 100$),且LLM正则化梯度与人类行为数据高度匹配($R^2 = 0.94$)。这些结果将模型崩溃重新定义为文化传播现象,并为自训练管道设计提供了具体原则。

英文摘要

Model collapse, the progressive degradation of LLMs trained on their own outputs, has been characterized statistically but lacks a linguistic explanation for which structures degrade, in what order, and why. We show that iterated learning theory from cultural evolution fills this gap. We derive five falsifiable predictions, distinguish those uniquely discriminative for the theory from confirmatory ones, and test them by self-training LLaMA-2-7B and Mistral-7B over 10 generations in English, German, and Turkish. The critical discriminative finding: compositionality follows a non-monotonic trajectory (initially rising, then falling) under unfiltered self-training. This signature persists with maximally regular seed data (ruling out noise removal) and is sustained only by task-grounded filtering, not random filtering, providing the first LLM-scale evidence for the compression-communication tradeoff. All predictions are confirmed with large effect sizes (Hedges' $g > 1.6$; $\mathrm{BF}_{10} > 100$), and LLM regularization gradients closely match human behavioral data ($R^2 = 0.94$). These results reframe model collapse as a cultural transmission phenomenon and yield concrete principles for self-training pipeline design.

2605.23045 2026-05-25 cs.CV cs.AI cs.LG 版本更新

The TIME Machine: On The Power of Motion for Efficient Perception

时间机器:论运动在高效感知中的力量

Mantas Skackauskas, Xinyue Hao, Laura Sevilla-Lara

发表机构 * School of Informatics University of Edinburgh(信息学院爱丁堡大学)

AI总结 本文提出了一种以运动为核心模态的视频表征学习方法,旨在解决现有视频模型在时序理解和训练成本方面的局限。通过使用点轨迹表示视频中的运动,并利用掩码自编码器进行自监督训练,模型能够学习到更高效且细粒度的视频表征。该方法无需依赖语言标注,大幅降低了训练数据需求,并在多项任务中展现出与当前先进模型相当的性能,为构建更高效、更具时序感知能力的视频模型提供了新方向。

详情
AI中文摘要

近年来,视频表示学习取得了巨大进展。这受到多种因素的推动,包括训练规模以及通过语言对比训练的视觉模型的成功。虽然这些因素推动了视频模型的能力边界,但它们也引入了自身的局限性:首先,扩展视频模型可能达到高昂的成本;其次,从语言学习限制了可学习概念的范围,仅限于字幕中的概念。因此,视频模型在时间理解方面仍然存在困难。在本文中,我们提出了一种新颖的方法,将运动作为视频表示的核心模态。具体而言,给定视频中以点轨迹形式存在的运动,我们使用掩码自编码器来掩码部分轨迹,并训练自编码器重建缺失的轨迹。这使我们能够以自监督方式学习表示。我们表明,使用运动来表示视频实际上解决了视频技术的两个核心局限性。首先,它使我们能够大幅减少训练数据的规模,因为运动本质上与外观无关,因此需要更少的样本就能很好地泛化。其次,运动使我们能够绕过依赖语言的训练范式,学习更细粒度的概念。结果是一种嵌入,我们称之为TIME(时间感知运动嵌入),这是一种仅使用合成运动数据训练的表示。我们在零样本方式下对广泛的任务测试了这种嵌入。我们观察到,无需额外技巧,其性能与使用多达4个数量级更少训练数据的最先进模型相当。这为迈向更有时序感知且更具可扩展性的视频模型新范式奠定了基础。

英文摘要

Video representation learning has seen tremendous progress in recent years. This has been driven by many factors, including the scale of training and the success of visual models trained contrastively with language. While these factors have pushed the boundaries of what video models can do, they also introduce their own set of limitations: first, scaling video models can reach prohibitive costs and second, learning from language restricts the range of concepts that can be learned to those in captions. As a result, video models still struggle with temporal understanding. In this paper we propose a novel approach that uses motion as the central modality for video representation. In particular, given the motion in a video in the form of point-tracks, we use a masked-autoencoder to mask some of the tracks and train the autoencoder to reconstruct the missing tracks. This allows us to learn a representation in a self-supervised manner. We show that using motion to represent videos actually addresses both of the core limitations of video technology. First, it allows us to massively reduce the scale of training data, as motion is inherently appearance-independent and hence needs fewer examples to generalize well. Second, motion allows us to bypass the language-dependent training paradigm, learning better fine-grained concepts. The result is an embedding that we call TIME (Temporally Informed Motion Embedding), a representation trained exclusively on synthetic motion data. We test this embedding on a wide set of tasks in a zero-shot manner. We observe that without bells and whistles, performance is on par with state-of-the-art models using up to 4 orders of magnitude less training data. This is a stepping stone towards a new paradigm of video models that are both more temporally aware as well as more scalable.

2605.23040 2026-05-25 cs.LG 版本更新

Steered Generation via Gradient-Based Optimization on Sparse Query Features

基于稀疏查询特征的梯度优化引导生成

Sumanta Bhattacharyya, Pedram Rooshenas

发表机构 * University of Illinois Chicago(伊利诺伊大学香槟分校)

AI总结 本文研究如何通过梯度优化稀疏查询特征来实现对大语言模型生成过程的精准引导。作者提出基于原型的稀疏控制方法,利用稀疏自编码器对注意力查询激活进行分解,并在推理过程中通过梯度优化将其与目标行为的类原型对齐,从而实现对生成内容的可控引导。实验表明,该方法在可控环境和教育领域任务中均能有效满足逻辑规划和风格细微度的统一控制需求。

详情
AI中文摘要

潜在引导利用大型语言模型的内部表示来指导生成,但对密集状态的干预可能纠缠不同的语义特征。在本文中,我们研究注意力查询激活作为精确控制的高保真位点,假设操纵注意力机制本身比一般状态干预提供更清晰的引导能力。我们引入了基于原型的稀疏引导框架,该框架将稀疏自编码器专门应用于查询激活,将其分解为可解释的特征,然后在推理过程中应用基于梯度的优化,使稀疏表示与目标行为的类原型对齐。为了验证这一架构见解,我们首先在文本化网格世界(一个用于可验证规划约束的受控环境)中分析该机制。我们证明,优化稀疏查询特征能够有效导航刚性规划需求(即安全路径与短路径),确认了该方法满足客观规则的能力。然后,我们通过在高维教育领域训练SAE来展示该框架的通用性,其中该框架引导反馈的认知复杂性(即布鲁姆分类法)。我们的实验表明,稀疏查询表示为逻辑规划和风格细节的统一、可解释控制提供了必要的解缠。

英文摘要

Latent steering exploits internal representations of Large Language Models (LLMs) to guide generation, yet interventions on dense states can entangle distinct semantic features. In this paper, we investigate attention query activations as a high-fidelity site for precise control, hypothesizing that manipulating the attention mechanism itself offers sharper steerability than general state interventions. We introduce Prototype-Based Sparse Steering, a framework that applies Sparse Autoencoders (SAEs) specifically to query activations, to decompose them into interpretable features, then apply gradient-based optimization during inference to align the sparse representation with class prototypes of target behaviors. To validate this architectural insight, we first analyze the mechanism in Textualized Gridworld, a controlled environment for verifiable planning constraints. We demonstrate that optimizing sparse query features enables effective navigation of rigid planning requirements (i.e., safe vs. short paths), confirming the method's ability to satisfy objective rules. We then demonstrate the framework's versatility by training SAEs on a high-dimensional educational domain, where the framework steers the cognitive complexity of feedback (i.e., Bloom's Taxonomy). Our experiments establish that sparse query representations provide the necessary disentanglement for unified, interpretable control over both logical planning and stylistic nuance.

2605.23039 2026-05-25 cs.CL cs.AI cs.LG 版本更新

Do Language Models Know What Not to Say? Causal Evidence for Statistical Preemption in LLMs

语言模型知道不该说什么吗?大语言模型中统计预占的因果证据

Dongxin Guo, Jikun Wu, Siu Ming Yiu

发表机构 * The University of Hong Kong(香港大学) Stellaris AI Limited(Stellaris AI有限公司)

AI总结 本研究探讨了语言模型如何通过分布竞争机制习得语言禁忌知识,提出统计预占(statistical preemption)是关键机制。通过四个实验,研究发现语言模型对非常规结构的惊讶度(surprisal)与人类可接受性判断高度相关,并且这种模式由竞争形式的频率驱动,而非动词整体频率。研究还表明,预占敏感性随模型规模呈幂律增长,并通过可控微调实验验证了竞争形式频率对预占行为的因果影响,为构造语法理论提供了计算支持。

Comments Accepted at CoNLL 2026. 21 pages (9 main body + appendices and references); 4 figures, 14 tables

详情
AI中文摘要

学习者在没有负面证据的情况下如何获得关于不可接受性的知识?构式语法提出了统计预占:接触常规形式(例如,“donated the books to the library”)会预占结构上可能但未经验证的替代形式(“*donated the library the books”)。我们提出了一项计算研究,首次在单一收敛设计中直接分离了大语言模型中的统计预占与竞争性固化假说。通过跨越120个英语动词-构式配对(与格、使役、方位格)的四个实验,我们表明:(1)大语言模型的惊讶度模式与人类可接受性判断强相关(r = 0.79),并在三个独立的行为数据集上得到验证;(2)这些模式由竞争形式频率驱动,而非整体动词频率,通过非循环偏相关得到确认;(3)预占敏感度随模型规模呈幂律增长;(4)一项受控微调干预因果地表明,操纵竞争形式频率会按预测方向改变预占行为,反向控制排除了频率敏感性混淆。这些结果提供了汇聚证据,表明神经语言模型通过分布竞争(构式语法所提出的核心机制)习得负面语言知识。

英文摘要

How do learners acquire knowledge of what is unacceptable without negative evidence? Construction Grammar proposes statistical preemption: exposure to a conventional form (e.g., "donated the books to the library") preempts structurally possible but unattested alternatives ("*donated the library the books"). We present a computational study that, for the first time, directly dissociates statistical preemption from the competing entrenchment hypothesis in large language models within a single converging design. Across four experiments spanning 120 English verb-construction pairings (dative, causative, locative), we show that (1) LLM surprisal patterns correlate strongly with human acceptability judgments ($r = 0.79$), validated against three independent behavioral datasets; (2) these patterns are driven by competing-form frequency rather than overall verb frequency, confirmed by non-circular partial correlations; (3) preemption sensitivity scales as a power law with model size; and (4) a controlled fine-tuning intervention causally demonstrates that manipulating competing-form frequencies shifts preemption behavior in the predicted direction, with reverse-direction controls ruling out frequency-sensitivity confounds. These results provide converging evidence that neural language models acquire negative linguistic knowledge through distributional competition, the core mechanism posited by Construction Grammar.

2605.23037 2026-05-25 cs.LG physics.flu-dyn 版本更新

Open Multimodal Datasets and Open-Source Software for Data-Driven Modeling of Multiphase Transport and Thermal Systems

用于多相输运和热系统数据驱动建模的开放多模态数据集与开源软件

Christy Dunlap, Hari Pandey, Stephen Pierson, Daniel Curl, Braden Stevens, Mohammad Ishraq Hossain, Annapurna Parjuli, Chinmaya Joshi, Han Hu

发表机构 * Department of Mechanical Engineering, University of Arkansas(阿肯色大学机械工程系)

AI总结 本文介绍了由NED3实验室开发的一套开放多模态数据集和开源软件工具,旨在推动基于数据驱动的多相传输与热流体系统建模研究。研究提出了一种空间-时间维度分类框架(S+TD),用于系统化组织不同维度的测量或模拟数据,并提供了涵盖沸腾图像、热成像、高速视频等多种数据的公开数据集。同时,文章介绍了多个配套软件工具,如用于序列回归的SeqReg,支持非侵入式热通量估计等应用,为热流体领域的AI建模提供了可复现的开源平台。

Comments 23 pages, 7 figures

详情
AI中文摘要

数据驱动建模正成为多相输运、电子冷却、声学诊断和热流体数字孪生的核心,但进展受到数据集碎片化和原始仪器文件难以解码、重用或基准测试的限制。本文介绍了由纳米能源与数据驱动发现(NED3)实验室开发的开放多模态数据集和开源软件包生态系统,用于可复现的AI赋能热流体研究。我们提出了一个空间加时间维度框架,记为S+TD,用于按测量或模拟场的维度对数据集进行分类,包括0+0D点值、0+1D时间序列、1+0D剖面、2+0D图像、2+1D视频、3+0D体积场以及多模态组合。我们整理了公开的NED3数据集,涵盖沸腾图像、声学和热测量、高速视频、红外热成像、热阻测量、CFD生成场、设计文件和声发射数据。我们还描述了配套的软件包,包括BubbleID、SeqReg、CFDTwin、IRISApp、decode-wfs、AELab和FlowLab,这些软件支持计算机视觉、序列回归、代理建模、红外分析、波形解码、声发射分析和多模态诊断。特别强调了SeqReg,这是一个用于0+1D、1+1D和2+1D数据的通用序列回归库,应用包括非侵入式热通量估计。最后,我们讨论了未来社区努力构建可互操作的热流体数据库和精选的AI/ML工具库,以连接数据集、元数据、解码器、基线、基准和物理可解释模型。

英文摘要

Data-driven modeling is becoming central to multiphase transport, electronics cooling, acoustic diagnostics, and thermal-fluid digital twins, but progress is limited by fragmented datasets and raw instrument files that are difficult to decode, reuse, or benchmark. This paper presents an open ecosystem of multimodal datasets and open-source software packages developed by the Nano Energy and Data-Driven Discovery (NED3) Laboratory for reproducible AI-enabled thermal-fluid research. We introduce a spatial-plus-temporal dimensionality framework, denoted S+TD, to classify datasets by the dimensionality of measured or simulated fields, including 0+0D point values, 0+1D time series, 1+0D profiles, 2+0D images, 2+1D videos, 3+0D volumetric fields, and multimodal combinations. We organize public NED3 datasets spanning boiling images, acoustic and thermal measurements, high-speed videos, infrared thermography, thermal-resistance measurements, CFD-generated fields, design files, and acoustic-emission data. We also describe complementary software packages, including BubbleID, SeqReg, CFDTwin, IRISApp, decode-wfs, AELab, and FlowLab, which support computer vision, sequence regression, surrogate modeling, infrared analysis, waveform decoding, acoustic-emission analysis, and multimodal diagnostics. Particular emphasis is placed on SeqReg, a general sequence-regression library for 0+1D, 1+1D, and 2+1D data, with applications such as nonintrusive heat-flux estimation. Finally, we discuss future community efforts to build interoperable thermal-fluid databanks and curated AI/ML tool libraries that connect datasets, metadata, decoders, baselines, benchmarks, and physically interpretable models.

2605.23033 2026-05-25 cs.LG cs.AI 版本更新

Uncovering the Latent Potential of Deep Intermediate Representations

揭示深度中间表示的潜在能力

Arnesh Batra, Arush Gumber, Aniket Khandelwal, Jashn Khemani, Anubha Gupta

发表机构 * SBILab, Indraprastha Institute of Information Technology Delhi, Delhi, India(SBILab,印度德里印度理工学院信息技术学院,德里,印度)

AI总结 本文研究了深度神经网络中间表示的潜在价值,指出任务相关信息在不同层中非单调分布,不能通过简单聚合恢复。为此,作者提出了一种基于谱分析的层选择方法LOES,以及几何正则化损失GeoReg,以识别任务区分性子空间并稳定表示几何结构。实验表明,该方法在多种模型和数据条件下均优于基线,且效果随模型深度增加而提升,同时揭示了语义因素在层间的分布规律,有助于跨语言和跨模态的可解释性分析。

Comments Accepted to ICML2026 as a Spotlight

详情
AI中文摘要

在海量数据上预训练的基础模型学习到随深度演化的表示,形成具有不同语义内容和几何结构的嵌入层次。与仅使用最后一层或浅层混合的普遍做法相反,我们表明任务相关信息在层间非单调分布,且无法通过简单聚合恢复。通过跨多种模态的几何与实证研究,我们表明有效迁移依赖于识别哪些层编码任务判别结构以及它们的嵌入如何几何组织。我们提出层最优嵌入选择(LOES),一种构造性谱方法,通过在正交性和各向同性约束下最小化残差误差来识别任务判别子空间。为了将微调与此选择原则对齐,我们进一步提出几何正则化损失(GeoReg),它在微调期间对类流形施加单纯形结构并稳定表示几何。在广泛的架构、深度、模态和数据规模下,LOES 持续优于标准基线,且随着模型深度增加收益增长。除了准确性,我们的方法揭示了语义因素如何在层间分布,从而实现了跨语言和跨模态的可解释性分析。总之,我们的结果提供了强有力的证据,表明逐层嵌入几何不是偶然的,而是深度模型表示和迁移知识的核心。

英文摘要

Foundational Models pretrained on huge amount of data learn representations that evolve across depth, forming a hierarchy of embeddings with distinct semantic content and geometric structure. Contrary to the widespread practice of using only the final layer or shallow mixtures, we show that task-relevant information is distributed non-monotonically across layers and cannot be recovered by naïve aggregation. Through a geometric and empirical study across multiple modalities, we show that effective transfer depends on identifying which layers encode task-discriminative structure and how their embeddings are geometrically organized. We introduce Layer-wise Optimal Embedding Selection (LOES), a constructive spectral method that identifies task-discriminative subspaces by minimizing residual error under orthogonality and isotropy constraints. To align fine-tuning with this selection principle, we further propose Geometric Regularization Loss (GeoReg), which enforces a simplicial structure on class manifolds and stabilizes representation geometry during fine-tuning. Across a wide range of architectures, depths, modalities, and data regimes, LOES consistently outperforms standard baselines, with gains that grow as model depth increases. Beyond accuracy, our method reveals how semantic factors are distributed across layers, thereby enabling cross-lingual and cross-modal interpretability analyses. Together, our results provide strong evidence that layerwise embedding geometry is not incidental but central to how deep models represent and transfer knowledge.

2605.23028 2026-05-25 cs.LG cs.CL cs.CV 版本更新

RADAR: Relative Angular Divergence Across Representations

RADAR: 表示间的相对角度散度

Xavier Cadet, Mateusz Nowak, Peter Chin

发表机构 * Dartmouth College(达特茅斯学院)

AI总结 本文提出了一种名为 RADAR 的度量方法,用于评估基础模型在跨领域任务中的迁移能力。该方法基于几何原理,通过分析模型各层表示的角对齐和层间位移轨迹上的距离变化,比较域内与跨域动态的分布差异,从而估计领域间迁移的可行性。实验表明,RADAR 在多个模态任务中表现出色,尤其在领域过渡平滑或明确的情况下具有更强的预测能力,且其效果依赖于模型内部表示空间的几何结构。

Comments 27 pages; 8 figures; 10 tables

详情
AI中文摘要

机器学习方法依赖于数据。然而,由于可用性限制、成本或需要领域专业知识,收集合适的数据可能具有挑战性。用额外来源扩展数据集是对有限数据的常见回应,但这种做法并不总能提高下游性能,有时甚至会导致性能下降,即负迁移。我们提出RADAR,一种简单、基于几何的度量,用于估计基础模型中的跨域迁移性。RADAR通过测量沿层间位移轨迹的角度对齐和距离的相对变化,并比较域内和跨域动态的经验分布,来分析表示的逐层演化。我们假设域迁移性与这些轨迹分布之间的散度有关。我们在多种模态上评估该度量,包括使用文本嵌入模型的跨语言情感分类和使用基础视觉模型的跨域图像分类。在多种设置下,RADAR在几个视觉和文本基准上相对于现有迁移性度量提供了有竞争力的预测性能,特别是在域过渡平滑或清晰分离时。我们的消融实验进一步表明,迁移性估计的有效性取决于模型内部表示空间的几何结构,不同模态偏好不同的拓扑形式。

英文摘要

Machine learning methods rely on data. However, gathering suitable data can be challenging due to availability constraints, cost, or the need for domain expertise. Expanding datasets with additional sources is a common response to limited data, yet this practice does not always improve downstream performance and can sometimes lead to a loss of performance, known as negative transfer. We propose RADAR, a simple, geometrically grounded metric for estimating cross-domain transferability in foundation models. RADAR analyzes the layer-wise evolution of representations by measuring angular alignments and relative changes in distance along layer-to-layer displacement trajectories, and by comparing empirical distributions of within-domain and cross-domain dynamics. We hypothesize that domain transferability is related to the divergence between these trajectory distributions. We evaluate the metric across multiple modalities, including cross-lingual sentiment classification with text embedding models and cross-domain image classification with foundation vision models. Across several settings, RADAR provides competitive predictive performance relative to existing transferability metrics on several vision and text benchmarks, with particularly strong results when domain transitions are smooth or cleanly separated. Our ablations further suggest that the effectiveness of transferability estimation depends on the geometry of the model's internal representation space, with different modalities favoring different topological formulations.

2605.23025 2026-05-25 cs.LG 版本更新

World Machine: Towards Generative World Modeling for Time-Series

世界机器:面向时间序列的生成式世界建模

Elton Cardoso do Nascimento, Alexandre da Silva Simões, Esther Luna Colombini, Ricardo Ribeiro Gudwin, Paula Dornhofer Paro Costa

发表机构 * Universidade Estadual de Campinas (UNICAMP)(坎皮纳斯州立大学) Universidade Estadual Paulista (UNESP)(保罗斯州立大学)

AI总结 本文提出了一种名为 World Machine 的生成式世界建模架构,用于时间序列数据,旨在实现对环境的可预测理解和可控模拟。该架构基于变压器模型,引入了潜在状态机制,能够适应不同长度的观测数据和上下文,相比传统变压器在计算和内存效率上有所提升。实验在合成数据集 Toy1D 上验证了该方法的可行性,并展示了其相对于传统变压器的独特优势与各训练组件的贡献。

详情
AI中文摘要

世界模型代表了生成式AI的一种范式转变,以结构化和可泛化的方式追求对环境的预测性理解和可控模拟。我们提出了World Machine,一种用于时间序列的生成式世界建模架构。它是一种基于Transformer的架构,具有潜在状态,能够适应不同数量的观测数据和上下文。这相比传统Transformer有所改进,传统Transformer的计算和内存成本随上下文呈二次方增长。在提出的合成数据集Toy1D上的实验验证了该方法的可行性,展示了传统Transformer不具备的能力,并突出了训练协议中每个组件的贡献。

英文摘要

World models represent a paradigm shift in generative AI, pursuing predictive understanding and controllable simulation of environments in a structured and generalizable way. We present World Machine, a generative world-modeling architecture for time series. It is a transformer-based architecture with latent states that enables adaptation to different amounts of observed data and contexts. This shows an improvement over traditional transformers, which have a computational and memory cost that scales quadratically with the context. Experiments on a proposed synthetic dataset, Toy1D, validate the approach's feasibility, demonstrate capabilities not found in conventional transformers, and highlight the contributions of each component of the training protocol.

2605.23024 2026-05-25 cs.AI cs.CC cs.CL cs.LG 版本更新

The Deterministic Horizon: Impossibility Results as Design Specifications for Trustworthy AI Systems

确定性视界:作为可信AI系统设计规范的不可行性结果

Dongxin Guo

AI总结 本文探讨了可信人工智能系统设计中由计算理论根本限制所带来的边界问题,提出将不可行性定理转化为系统设计规则的新方法。研究核心在于确定性地证明了大型语言模型的推理深度存在一个由架构决定的上限——“确定性地平线”,该上限不受训练数据量、适配器秩或损失函数的影响,并可通过模型层数和嵌入宽度预先计算。研究还展示了这一理论在多个AI子领域中的应用,形成一套包含十六项设计规范的目录,为构建更可靠的人工智能系统提供了理论依据和设计指导。

Comments PhD thesis, Department of Computer Science, The University of Hong Kong, 2026. 271 pages, 18 figures, 15 tables, 5 algorithms

详情
AI中文摘要

大型语言模型现在编写软件、起草法律文件并生成临床笔记,但从图灵、阿罗到没有免费午餐定理的基本极限,塑造了计算的能力。本文将这些不可行性结果从奇闻转化为设计规则。其旗舰结果证明了仅由架构设定的准确率上限:超过关键推理深度后,无论适配器秩、样本大小或损失函数如何,训练都无法改变它。该确定性视界在部署前可从层数和嵌入宽度计算,在十二种Transformer架构中测量值介于19到31之间,而在最优长度轨迹上微调可恢复不到4个百分点。其机制是残差流的容量不变性,信息论转换得出超过视界后准确率超指数衰减。一个针对模幂的无条件电路复杂度下界(对抗常数深度素数模电路)补充了这一结果。同样的论证重新应用于多个子领域:任何错误指定模型下的偏好学习在样本复杂度上出现不连续跳跃;多阶段检索流水线至少需要与阶段数一样多的独立指标;标准诚实拍卖对于具有提示相关估值的智能体失效;神经推理的零知识验证为每个非线性激活支付110到190倍的测量开销。这些共同构成了一个包含16条规范的目录,每条规范配对一个可计算边界、一个量化违反成本和一个建设性设计规则:两个组合已被证明,一个配对是诚实障碍,四个保持开放。本文为可信AI可能需要的生成式研究计划提供了不可行性规范方法论。AI的每一个基本极限也是一个设计规则。

英文摘要

Large language models now write software, draft legal documents, and produce clinical notes, yet fundamental limits, from Turing and Arrow to the No Free Lunch theorems, shape what computation can do. This thesis turns such impossibility results from curiosities into design rules. Its flagship result proves an accuracy ceiling set by architecture alone: past a critical reasoning depth, no amount of training moves it, at any adapter rank, sample size, or loss function. Computable before deployment from layer count and embedding width, this Deterministic Horizon is measured between nineteen and thirty-one across twelve transformer architectures, and fine-tuning on optimal-length traces recovers under four percentage points. The mechanism is a capacity invariant of the residual stream, and an information-theoretic conversion yields super-exponential accuracy decay past the horizon. An unconditional circuit-complexity lower bound for modular exponentiation against constant-depth prime-modulus circuits complements this result. The same argument recasts across subfields: preference learning under any misspecified model jumps discontinuously in sample complexity; multi-stage retrieval pipelines require at least as many independent metrics as stages; standard truthful auctions fail for agents with prompt-dependent valuations; and zero-knowledge verification of neural inference pays a measured overhead of one hundred ten to one hundred ninety times per non-linear activation. Together these form a catalogue of sixteen specifications, each pairing a computable boundary, a quantified violation cost, and a constructive design rule: two compositions are proved, one pairing is an honest obstruction, and four remain open. The impossibility-specification methodology is offered for the generative research programme that trustworthy AI may need. Every fundamental limit of AI is also a design rule.

2605.23019 2026-05-25 cs.LG 版本更新

PACE: Two-Timescale Self-Evolution for Small Language Model Agents

PACE:小型语言模型代理的双时间尺度自我进化

Chen Ling, Pei Chen, Albert Guan, Jiaming Qu, Shayan Ali Akbar, Madhu Gopinathan, Erwin Cornejo

发表机构 * Amazon(亚马逊)

AI总结 本文研究了在资源受限条件下,冻结的小语言模型(SLM)能否作为有效的自进化智能体。为此,作者提出了PACE框架,通过双时间尺度协调低风险的提示优化与高风险的控制逻辑更新,实现了无需更新模型权重或依赖前沿模型的可靠自进化。实验表明,PACE在多个基准任务中均优于传统方法,显著提升了多轮工具使用等复杂任务的性能。

详情
AI中文摘要

在生产中部署语言模型代理通常需要大量的计算和人力来调整提示、解析器、验证器和代理流水线的其他组件。自我进化提供了一种有前景的替代方案,但大多数现有框架假设可以访问能够可靠诊断故障、提出修订并判断自身更新的前沿模型。我们研究冻结的小型语言模型(SLM)是否可以在资源约束下作为有效的自我进化代理。我们提出PACE(提示和控制逻辑进化),一个双时间尺度框架,协调低风险的提示优化与高风险的控逻辑更新。PACE在固定控制逻辑下进化提示,直到提示层面的增益饱和,然后考虑通过保留验证接受的有约束控制逻辑更新。在三个从4B到14B参数的冻结SLM骨干和四个受控基准上,PACE在所有12个骨干-基准组合上实现了最佳性能,相比原始SLM代理相对提升高达+9.2%,相比更强的单模式进化基线相对提升高达+5.4%。tau-bench案例研究进一步表明,PACE在多次交互工具使用成功率上优于原始和仅提示进化。这些结果表明,无需更新模型权重或依赖前沿模型教师,可靠的SLM代理自我进化是可能的,并且关键优势不在于任何单一的最终求解模式,而在于自主、经过验证地发现适合任务的推理策略。

英文摘要

Deploying language-model agents in production often requires substantial compute and human effort to tune prompts, parsers, validators, and other components of the agent pipeline. Self-evolution offers a promising alternative, but most existing frameworks assume access to frontier models that can reliably diagnose failures, propose revisions, and judge their own updates. We study whether frozen small language models (SLMs) can serve as effective self-evolving agents under resource constraints. We propose PACE (Prompt And Control Logic Evolution), a two-timescale framework that coordinates low-risk prompt refinement with higher-risk control-logic updates. PACE evolves prompts under fixed control logic until prompt-level gains saturate, then considers constrained control-logic updates that are accepted through held-out validation. Across three frozen SLM backbones ranging from 4B to 14B parameters and four controlled benchmarks, PACE achieves the best performance on all 12 backbone--benchmark combinations, improving over vanilla SLM agents by up to +9.2% relative improvement and over the stronger single-mode evolution baseline by up to +5.4% relative improvement. A tau-bench case study further shows that PACE improves multi-turn tool-use success over vanilla and prompt-only evolution. These results suggest that reliable SLM agent self-evolution is possible without updating model weights or relying on frontier-model teachers, and that the key benefit is not any single final solver pattern but autonomous, validated discovery of task-appropriate inference strategies.

2605.23017 2026-05-25 cs.LG cs.GT 版本更新

Smoothed Elicitation Complexity for Approximate $Γ$-calibration of Discrete Classification Tasks

离散分类任务的近似 $\Gamma$ 校准的平滑引发复杂度

Jessica Finocchiaro, Victor Ganson, Drona Khurana

发表机构 * Computer Science, Boston College(波士顿学院计算机科学系) Computer Science, University of Colorado Boulder(科罗拉多大学博尔德分校计算机科学系)

AI总结 本文研究了在离散分类任务中实现近似Γ-校准的问题,针对多类别分类模型的校准复杂度过高这一挑战,提出了一种基于Lipschitz连续性质的中间表示方法,有效降低了校准复杂度。通过构造适用于强可排序离散属性的Lipschitz性质,作者首次给出了离散属性近似校准的理论结果,并提供了设计这些性质的算法,为离散属性的校准提供了新的方法和理论支持。

Comments Working paper

详情
AI中文摘要

评估机器学习模型可信度的一种重要方法是校准的概念。在二元结果设置中,如果结果根据模型的条件分布预测实现,则概率预测器是校准的。将二元校准定义直接扩展到概率多类分类器会导致指数级的复杂度爆炸,因为预测空间随类别数 $n$ 呈指数增长。作为补救措施,Noarov 和 Roth (2023) 提出了使用结果分布属性的多类校准,将复杂度从随类别数 $n$ 增长降低到属性维度 $d$,称为其引发复杂度。先前关于近似属性校准的工作通常局限于连续标量属性,尽管许多相关属性是离散的,如众数或排名。我们通过使用Lipschitz连续属性作为中介,刻画了强可排序离散属性的近似属性校准。据我们所知,这是首次为离散属性提供近似校准结果。在此过程中,我们通过构建设计这些Lipschitz属性的算法,刻画了强可排序离散属性的Lipschitz引发复杂度,并证明这些属性可以通过后处理得到原始离散属性。

英文摘要

One prominent method of evaluating machine learning model trustworthiness is the notion of calibration. In the binary outcome setting, a probabilistic predictor is calibrated if outcomes are realized according to a model's distributional prediction, conditioned on this prediction. Straightforward extensions of binary calibration definitions to probabilistic multiclass classifiers suffer from an exponential complexity blowup as the space of predictions grows exponentially in the number of classes $n$. As a remedy, Noarov and Roth (2023) propose multiclass calibration with predictions that are properties of the outcome distribution, reducing complexity from growing in the number of classes $n$ to the dimension $d$ of the property, called its elicitation complexity. Previous work on approximate property calibration is generally limited to continuous scalar properties, despite many relevant properties of interest being discrete, like the mode or rankings. We characterize the approximate property calibration of discrete properties which are strongly orderable by using Lipschitz continuous properties as an intermediary. This work is the first to our knowledge to provide approximate calibration results for discrete properties. Along the way, we characterize the Lipschitz elicitation complexity of strongly orderable discrete properties by constructing algorithms for designing these Lipschitz properties, which we prove can be post-processed to obtain the original discrete property.

2605.23007 2026-05-25 q-fin.TR cs.AI cs.LG q-fin.PM 版本更新

MadEvolve: Evolutionary Optimization of Trading Systems with Large Language Models

MadEvolve: 基于大型语言模型的交易系统进化优化

Yurii Kvasiuk, Tianyi Li, Owen Colegrove, Moritz Münchmeyer

发表机构 * Department of Physics, University of Wisconsin–Madison(威斯康星大学麦迪逊分校物理系) Event Horizon Labs(事件地平线实验室)

AI总结 本文提出了一种基于大型语言模型的进化优化框架MadEvolve,用于优化量化交易系统,特别是在比特币交易中的策略生成与执行。该方法通过进化算法优化交易策略的特征集、策略组件及整体流程,显著提升了交易表现。研究还对比了其他智能搜索方法,并评估了模拟环境中的p-hacking概率,验证了AI驱动的进化算法在量化金融中的有效性。

详情
AI中文摘要

我们探索了将LLM驱动的算法优化应用于量化金融中的几个常见任务。MadEvolve是一个受DeepMind的Alpha-Evolve启发的通用算法优化框架,最近被开发用于优化计算宇宙学中的算法。在此,我们以比特币交易为例,展示了MadEvolve在优化算法交易策略和alpha生成方面的实用性。在我们的模拟和回测设置中,我们在所有考虑的任务上取得了显著改进,例如演化用于信号生成的特征集、优化交易策略的独立组件,以及联合演化特征流水线与执行策略。此外,我们将我们的方法与其他智能搜索方法(特别是Claude Code)进行了比较,并仔细评估了模拟设置中的p-hacking概率。我们的发现强烈支持AI驱动的智能和进化算法在算法交易和量化金融中的实用性。

英文摘要

We explore the application of LLM-driven algorithm optimization to several common tasks in quantitative finance. MadEvolve, a general-purpose algorithm optimization framework inspired by DeepMind's Alpha-Evolve, was recently developed to optimize algorithms in computational cosmology. Here we demonstrate the utility of MadEvolve to optimize algorithmic trading strategies and alpha generation at the example of Bitcoin trading. On our simulation and backtesting setup, we achieve significant improvements on all tasks we considered, such as evolving feature sets for signal generation, optimizing separate components of the trading strategy, and jointly evolving the feature pipeline together with the execution strategy. Additionally, we compare our method to other agentic search approaches, specifically Claude Code, and carefully evaluate p-hacking probabilities on our simulation setup. Our findings strongly support the utility of AI-driven agentic and evolutionary algorithms for algorithmic trading and quantitative finance.

2605.22988 2026-05-25 q-bio.NC cs.LG cs.RO cs.SY eess.SY 版本更新

Active Sensing Subserves Task-Level Control

主动感知服务于任务级控制

Andrew Lamperski, Debojyoti Biswas, Eric S. Fortune, John Guckenheimer, Kathleen Hoffman, Noah J. Cowan

发表机构 * Department of Electrical and Computer Engineering, University of Minnesota(明尼苏达大学电气与计算机工程系) Laboratory for Computational Sensing and Robotics, Johns Hopkins University(约翰霍普金斯大学计算感知与机器人实验室) Federated Department of Biological Sciences, New Jersey Institute of Technology(新泽西理工学院联合生物科学系) Department of Mathematics, Cornell University(康奈尔大学数学系) Department of Mathematics and Statistics, University of Maryland, Baltimore County(马里兰大学巴尔的摩县分校数学与统计学系) Department of Mechanical Engineering, Johns Hopkins University(约翰霍普金斯大学机械工程系)

AI总结 本文探讨了主动感知在任务级控制中的作用,提出主动感知并非由感官目标驱动,而是任务控制的必要组成部分。研究结合生物实证数据和数学理论,表明主动感知行为通常以离散阶段出现,动物在“探索”与“利用”两种行为模式间切换,以适应性传感器和模式切换实现反馈控制。这一策略在生物系统中普遍存在,但在工程系统中却较少应用,提示当前机器人控制体系仍有待改进。

详情
AI中文摘要

主动感知传统上被定义为为了获取信息而消耗能量,通常以运动的形式。在这里,我们提出,对自适应传感器的依赖、运动与感知之间的联系以及任务级控制的结合,必然导致主动感知运动的出现。这样,主动感知并非由感官目标驱动,例如最小化状态不确定性,而是任务级控制所必需的。这一假设,即主动感知服务于控制,得到了来自生物体的经验数据和数学理论的支持。有趣的是,主动感知行为通常发生在离散的时段中,与目标导向行为交替出现。这表明动物在两种具有不同控制策略的行为模式之间切换:一种“探索”模式,动物产生动态运动以塑造感觉反馈;以及一种“利用”模式,动物产生与实现任务目标直接相关的较慢补偿运动。这种依赖于自适应传感器、主动感知和模式切换的反馈控制策略在工程系统中并不常用,尽管在生物学中普遍存在。由最先进的传感器、执行器和机械设计组成的工程系统在“成本函数”方面(如最大力生成、精度和速度)可以胜过动物。然而,动物通常能够实现目前工程系统无法比拟的稳健、优雅的行为,这表明当前的控制系统存在不足。这些以控制理论语言表达的见解可能对改进机器人感知和控制至关重要。

英文摘要

Active sensing is traditionally defined as the expenditure of energy, typically in the form of movement, for obtaining information. Here, we propose that the combination of reliance on adaptive sensors, the linkage between movement and sensing, and task-level control inevitably gives rise to the emergence of active sensing movements. In this way, active sensing is not driven by sensory goals, such as minimizing uncertainty about the state, but rather is necessary for task-level control. This hypothesis, that active sensing subserves control, is supported by both empirical data from organisms and mathematical theory. Interestingly, active sensing behaviors often occur in discrete epochs, interspersed with goal-oriented behavior. This suggests that animals switch between two behavioral modes with distinct control policies, an `explore' mode in which animals produce dynamic movements to shape sensory feedback, and an `exploit' mode in which animals produce slower compensatory movements that are directly related to achieving task goals. This strategy for feedback control that relies on adaptive sensors, active sensing, and mode switching is not commonly used in engineered systems despite being ubiquitous in biology. Engineered systems comprising state-of-the-art sensors, actuators, and mechanical designs can outperform animals with respect to ``cost functions'' such as maximum force generation, precision, and speed. Nevertheless, animals routinely achieve robust, graceful behaviors that are currently unmatched by engineered systems, suggesting that current control systems are insufficient. These insights, expressed in the language of control theory, may be critical for improving robotic sensing and control.

2605.22986 2026-05-25 cs.RO cs.AI cs.HC cs.LG 版本更新

Robots That Know What to Ask: Recovering Misaligned Rewards through Targeted Explanations

知道该问什么的机器人:通过有针对性的解释恢复未对齐的奖励

Helena Merker, Nick Walker, Andreea Bobu

AI总结 该研究针对从人类示范中学习奖励函数时存在的特征不充分问题,提出了一种通过有针对性的解释来识别并修正奖励函数偏差的框架。核心方法基于分析示范数据中各特征的一致性,识别出未充分说明的特征,并通过自然语言解释这些不确定性,主动请求针对性的补充示范。实验表明,该方法在模拟和真实机器人任务中显著提升了奖励函数的学习效果,优于随机查询和被动数据收集的方式。

详情
AI中文摘要

从演示中学习奖励函数假设演示对所有特征(或行为中与任务相关的方面)提供了充分的监督。实际上,演示往往不完美:由于认知负荷或物理难度,人类可能低估某些特征,或者训练机制可能未能充分覆盖所有相关情况。无论哪种情况,重要特征可能未被充分指定,导致学习到的奖励函数存在歧义,并在部署时出现未对齐的行为。我们提出一个框架,检测此类未充分指定的特征,并主动请求有针对性的纠正演示。我们的关键洞察是,演示隐含地揭示了哪些特征被良好指定:一致优化的特征在演示之间变化很小,而未充分指定的特征则变化很大。我们利用这一统计信号推断哪些特征可能未被充分演示。然后,机器人用自然语言解释它不确定哪些特征,并请求明确解决已识别差距的演示。我们在模拟桌面操作领域和真实Franka机器人的用户研究中评估了我们的方法。与随机查询和被动数据收集相比,有针对性的、解释引导的查询显著改善了奖励恢复,减少了否则会从有缺陷的演示中持续存在的歧义。

英文摘要

Learning reward functions from demonstrations assumes that demonstrations provide adequate supervision over all features -- or task-relevant aspects of behavior. In practice, demonstrations are often imperfect: humans may under-emphasize certain features due to cognitive load or physical difficulty, or the training regime may fail to sufficiently cover all relevant situations. In either case, important features may be underspecified, leading to ambiguity in the learned reward function and misaligned behavior at deployment. We propose a framework that detects such underspecified features and actively solicits targeted corrective demonstrations. Our key insight is that demonstrations implicitly reveal which features are well specified: features that are consistently optimized show little variation across demonstrations, while features that are underspecified vary widely. We leverage this statistical signal to infer which features may have been insufficiently demonstrated. The robot then explains which features it is uncertain about in natural language and queries for demonstrations that explicitly address the identified gaps. We evaluate our approach in a simulated tabletop manipulation domain and in a user study with a real Franka robot. Targeted, explanation-guided queries significantly improve reward recovery compared to random querying and passive data collection, reducing ambiguity that would otherwise persist in learning from imperfect demonstrations.

2605.22984 2026-05-25 cs.LG cs.AI 版本更新

Test-Time Training Undermines Safety Guardrails

测试时训练削弱安全护栏

Simone Antonelli, Sadegh Akhondzadeh, Aleksandar Bojchevski

发表机构 * CISPA Helmholtz Center for Information Security(CISPA海德堡信息安全中心) University of Cologne(科隆大学)

AI总结 本文研究了测试时训练(Test-Time Training, TTT)在提升模型性能的同时所带来的安全风险。作者指出,TTT允许模型在推理过程中动态调整参数,虽然能增强模型在少样本学习、检索增强生成等任务中的表现,但也引入了新的攻击漏洞,使模型更容易被绕过安全防护。实验表明,TTT显著提高了攻击成功率,并在不同规模模型中表现出高度的可转移性。为此,作者提出了一种基于困惑度变化的轻量级检测方法,以识别潜在的TTT攻击请求。

Comments 30 pages, 4 figures. Project page: https://uoc-tail.github.io/ttt-jailbreak/

详情
AI中文摘要

测试时训练(TTT)是一种新兴范式,使模型在推理过程中调整参数,从而提升少样本学习、检索增强生成和复杂推理等任务的性能。然而,这种动态适应引入了攻击者可利用的新漏洞来越狱模型。我们识别了TTT的三种威胁模型,并演示了攻击者如何利用它们绕过安全过滤器。我们的结果表明,TTT可以显著提高攻击成功率(ASR)以及超过10次生成试验的ASR(ASR@10)。例如,在LoRA下,少样本和生成阶段威胁模型在不同家族和规模的模型上平均ASR@10分别达到95%和93%。这些漏洞可迁移到生产级微调API。我们还展示了TTT引发的过拟合可能产生退化输出,在标准评判下夸大ASR,并提出了一个有效性感知评估来纠正这一点。我们的发现表明,TTT暴露了新的攻击面,增强了攻击,并削弱了现有的安全护栏。作为防御的第一步,我们提出了一个轻量级的提供商侧检测器,通过私有有害保留集上的困惑度偏移来标记TTT请求,但稳健部署最终需要动态对齐。

英文摘要

Test-Time Training (TTT) is an emerging paradigm that enables models to adapt their parameters during inference, improving performance on tasks such as few-shot learning, retrieval-augmented generation, and complex reasoning. However, this dynamic adaptation introduces new vulnerabilities that adversaries can exploit to jailbreak models. We identify three threat models for TTT and demonstrate how attackers can leverage them to bypass safety filters. Our results show that TTT can significantly increase the Attack Success Rate (ASR) and the ASR over 10 generation trials (ASR@10). For example, under LoRA, the few-shot and generation-phase threat models achieve an average ASR@10 of 95% and 93% respectively, across models from different families and scales. These vulnerabilities transfer to production fine-tuning APIs. We also show that TTT-induced overfitting can produce degenerate outputs that inflate ASR under standard judges, and propose a validity-aware evaluation to correct for this. Our findings suggest that TTT exposes a new attack surface, strengthens attacks, and undermines existing safety guardrails. As a first step toward defense, we propose a lightweight provider-side detector that flags TTT requests via the perplexity shift on a private harmful holdout, but robust deployment will ultimately require dynamic alignment.

2605.22981 2026-05-25 cs.CL cs.AI cs.LG 版本更新

Memorization Dynamics of Fill-in-the-Middle Pretraining

Fill-in-the-Middle 预训练的记忆动态

Tobias von Arx, Tanguy Dieudonné

发表机构 * Department of Computer Science, ETH Zurich(苏黎世联邦理工学院计算机科学系)

AI总结 本文研究了“填中”(FIM)预训练目标对语言模型逐字记忆能力的影响。通过在包含重复内容的语料库上训练匹配的Llama 3.2模型,发现FIM更倾向于恢复短或部分匹配的文本片段,而传统的从左到右(LTR)方法则更常对长段精确续写赋予高置信度。实验还表明,FIM训练下的逐字记忆能力随重复次数近似线性增长,并且后缀上下文不足以支持准确回忆,前缀上下文在其中起关键作用。研究强调了单一评估方式可能忽略记忆行为的复杂性。

Comments MemFM @ ICML 2026

详情
AI中文摘要

Fill-in-the-Middle (FIM) 是一种广泛用于赋予因果语言模型填充能力的预训练目标,但其对逐字记忆的影响尚未充分探索。我们在受控设置中研究 FIM 的记忆动态,通过在包含重复 Gutenberg 摘录的 FineWeb-Gutenberg 语料库上,使用 FIM 和标准从左到右 (LTR) 目标预训练匹配的 Llama 3.2 模型。基于前缀的探测表明,FIM 更常恢复短片段或部分匹配的跨度,而 LTR 更常对长精确延续赋予高置信度。我们观察到,在测试范围内,FIM 训练下的逐字提取随重复次数近似线性增长。评估原生 FIM 格式的探测显示,后缀上下文并不足够:FIM 训练下的逐字回忆仍然强烈锚定于前缀上下文。我们的结果还表明,仅评估一种跨度长度或探测格式可能会遗漏记忆行为中的重要细微差别。

英文摘要

Fill-in-the-middle (FIM) is a pretraining objective widely used to equip causal language models with infilling ability, yet its effect on verbatim memorization remains underexplored. We study the memorization dynamics of FIM in a controlled setting by pretraining matched Llama 3.2 models with FIM and standard left-to-right (LTR) objectives on a FineWeb-Gutenberg corpus containing repeated Gutenberg excerpts. With prefix-based probes, FIM more often recovers short or partially matching spans, while LTR more often assigns high confidence to long exact continuations. We observe that verbatim extraction under FIM-training grows approximately linearly with repetitions over the tested range. Evaluating native FIM-format probes reveals that suffix context is not sufficient: verbatim recall under FIM-training remains strongly anchored in prefix context. Our results also show that evaluating only one span length or probing format can miss important nuances in memorization behavior.

2605.22973 2026-05-25 cs.LG cs.AI 版本更新

Worse than Random: The Importance of a Baseline for Unsupervised Feature Selection

比随机更差:无监督特征选择中基线的重要性

Muhammad Rajabinasab, Michael E. Houle, Oussama Chelly, Arthur Zimek

发表机构 * University of Southern Denmark(丹麦南部大学) New Jersey Institute of Technology(新泽西理工学院) Oratio Technologies(Oratio技术公司)

AI总结 本文探讨了无监督特征选择方法的评估基准问题,指出当前多数方法缺乏与随机特征选择这一基准的比较,难以衡量其实际贡献。作者提出应将随机特征选择作为评估基准,并通过实验证明许多先进方法在性能和效率上均不如随机选择。因此,研究强调在开发新的无监督特征选择方法时,必须以随机选择为基准,以确保方法的有效性与改进价值。

Comments Preprint submitted to Elsevier Pattern Recognition Letters

详情
AI中文摘要

每年都有许多新的无监督特征选择方法被提出,但它们的实证评估仅限于在选定数据集上计算的监督和无监督评估指标,以及与现有方法的比较。然而,在缺乏既定评估基线的情况下,很难确定每种方法对现有文献的附加值,以及它们底层方法的有效性。我们提出使用随机特征选择作为评估无监督特征选择方法的基线。我们通过实证表明,许多最先进的无监督特征选择方法在性能和效率上均不如随机特征选择。因此,我们强调在开发新的无监督特征选择方法时,必须严格考虑将随机特征选择作为基线,以确保相对于随机特征选择的一致改进。

英文摘要

Many novel unsupervised feature selection methods are proposed each year, yet their empirical evaluation is limited to supervised and unsupervised evaluation metrics computed on selected datasets, along with comparisons to existing methods. However, in the absence of an established evaluation baseline, it is difficult to determine the value added to the existing literature by each of these methods, and how effective their underlying approaches are. We propose using random feature selection as a baseline for evaluating the unsupervised feature selection methods. We empirically show that many of the state-of-the-art methods in unsupervised feature selection are outperformed by random feature selection in both performance and efficiency. Accordingly, we emphasize on the strict requirement of considering random feature selection as a baseline in the development process of novel unsupervised feature selection methods to ensure a consistent improvement over random feature selection.

2605.22972 2026-05-25 cs.LG cs.AI 版本更新

A mathematical theory of balancing relational generalization and memorization

关系泛化与记忆平衡的数学理论

Luke Cheng, Samuel Lippl

发表机构 * Center for Theoretical Neuroscience(理论神经科学中心)

AI总结 本文探讨了学习系统如何在关系泛化与记忆例外之间取得平衡这一核心问题,提出了一种新的任务——带有例外的传递推理任务,用于测试模型在关系规则下的泛化与例外记忆能力。通过理论分析和实验验证,研究发现神经网络模型在不同表征结构下表现出对泛化与记忆的平衡能力,但其成功依赖于具体的表征几何特性。该理论不仅揭示了这一任务的机制性挑战,还通过预训练语言模型的实验验证了理论预测,为理解学习系统的泛化机制提供了新视角。

详情
AI中文摘要

人类、动物和现代机器学习模型展现出学习复杂行为并将其泛化到未见情境的惊人能力。这种能力要求我们学习规则和规律以实现泛化。同时,在大多数复杂环境中,任何规则都有例外。学习系统如何在学习一般规律和记忆例外之间取得平衡?我们认为,缺乏任务范式阻碍了对这一基本能力的研究。为填补这一空白,我们引入了一个新任务——带例外的传递推理,该任务测试关系泛化以及对关系规则例外的记忆。然后,我们解析地表征了一个简单、理论上可处理的神经网络学习模型(核岭回归)在广泛表示族和任务参数下的行为。我们发现,这些模型能够在关系泛化和记忆之间取得平衡,但与无例外的传递推理不同,成功的泛化对特定的表示几何敏感。我们通过分析理论解释了为什么该任务在机制上更具挑战性。最后,我们在对有序关系进行微调的预训练语言模型中验证了我们的理论见解,发现这些模型成功根据传递规则进行泛化,但也做出了我们理论预测的那种系统性错误。总体而言,我们的理论展示了学习系统如何在关系泛化和记忆之间取得平衡,解释了可能出错的方式,并强调了设计新任务范式以探测这种能力的必要性。

英文摘要

Humans, animals, and modern machine learning models exhibit impressive abilities to learn complex behaviors and generalize these behaviors to unseen situations. This ability requires us to learn rules and regularities that allow for such generalizations. At the same time, in most complex environments, any rule will have its exceptions. How do learning systems balance between learning general regularities and memorizing exceptions? We argue that a lack of task paradigms has hindered the study of this essential ability. To address this gap, we introduce a novel task, transitive inference with exceptions, that tests for relational generalization and memorization of an exception to the relational rule. We then analytically characterize the behavior of a simple, theoretically tractable model of neural network learning (kernel ridge regression) across a broad family of representations and task parameters. We find that these models can balance between relational generalization and memorization, but unlike for transitive inference without an exception, successful generalization is sensitive to the specific representational geometry. We explain why this task is more challenging mechanistically by drawing on our analytical theory. Finally, we validate our theoretical insights in pretrained language models that are finetuned on ordered relations, finding that these models successfully generalize according to the transitive rule, but also make the kinds of systematic mistakes predicted by our theory. Overall, our theory shows how learning systems can balance between relational generalization and memorization, explains how this can go wrong, and emphasizes the need for new task paradigms designed to probe this ability.

2605.22968 2026-05-25 q-bio.QM cs.LG stat.ML 版本更新

Uncertainty-aware classification and triage of structural heart disease using electrocardiography and echocardiography metrics

基于心电图和超声心动图指标的结构性心脏病不确定性感知分类与分诊

Mitchel J. Colebank

发表机构 * Department of Mathematics, University of South Carolina(南卡罗来纳大学数学系)

AI总结 该研究探讨了利用心电图(ECG)和超声心动图指标对结构性心脏病(SHD)进行分类与分诊的不确定性感知方法。研究对比了频率学派和贝叶斯神经网络分类器在SHD检测中的表现,发现贝叶斯方法在分类性能和不确定性量化方面更具优势。研究还展示了如何将不确定性感知分类应用于SHD筛查,为通过机器学习辅助分诊、优化医疗资源分配提供了可行方案。

Comments 15 pages, 5 figures

详情
AI中文摘要

机器学习方法提供了一种方法创新,可以通过无创且易于获得的测量方式帮助筛查心血管疾病。最近在利用心电图数据筛查结构性心脏病方面的投资就是一个例子,其中心电图提供了一种低成本、可用的筛查方式。这导致了EchoNext数据集的产生,这是一个配对的心电图-超声心动图数据存储库,用于测试新的结构性心脏病检测方法。然而,相对较少的研究探讨了通过贝叶斯推理进行更概率性的分类如何改善这种情况下的不确定性量化。此外,很少有研究考虑如何开发分诊系统以缓解医疗瓶颈,例如由专家超声技师审查来自服务不足的农村诊所的数据以进行结构性心脏病评估。在本研究中,我们利用现有的心电图-超声心动图数据来比较频率派和贝叶斯神经网络分类器。我们表明,贝叶斯方法在结构性心脏病分类中与频率派方法相当或更好,并且它们具有更稳健的不确定性量化。我们提供了一个示例,说明如何将此不确定性感知分类方案用于结构性心脏病筛查,为机器学习如何帮助分诊提供了概念验证,即在结构性心脏病高度可能或测量高度不确定时,让个体获得专家超声技师的输入。

英文摘要

Machine learning methods provide a methodological innovation that can help screen for cardiovascular disease through noninvasive and readily available measurement modalities. Recent investments in using electrocardiogram (ECG) data to screen for structural heart disease (SHD) are one example, where ECGs provide a low-cost, available modality for screening. This has led to the EchoNext dataset, a paired ECG-echocardiogram data repository for testing new methods of SHD detection. However, relatively few studies have investigated how more probabilistic classification through Bayesian inference may improve uncertainty quantification in this setting. Moreover, few studies have considered how triage systems can be developed to alleviate healthcare bottlenecks, such as the review of data from underserved, rural clinics by expert sonographers for SHD assessment. In this study, we leverage existing ECG-echocardiogram data to compare frequentist and Bayesian neural network classifiers. We show that the Bayesian approach is comparable or better than frequentist methods in SHD classification, and that they have a more robust uncertainty quantification attached to them. We provide an example of how this uncertainty-aware classification scheme can be used for screening SHD, providing a proof-of-concept for how machine learning can help with triage in getting individuals expert sonographer input when SHD is highly likely or measurements are highly uncertain.

2605.22964 2026-05-25 cs.LG 版本更新

Certification from Examples is Hard for Circuits and Transformers under Minimal Overparametrization

在最小过参数化下,从示例中认证对于电路和Transformer是困难的

Artur Back de Luca, Kimon Fountoulakis

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 本文研究了在最小过参数化条件下,对电路和Transformer模型进行精确认证的困难性。作者证明,即使仅增加少量参数,认证所需样本数量也会呈指数级增长,表明精确认证在多个假设类中是计算上困难的。实验部分展示了构造的电路和训练好的Transformer在二进制加法任务中的认证行为,揭示了不完美模型可能通过大规模随机样本规避检测。

Comments 38 pages, 5 figures

详情
AI中文摘要

随着最先进的神经网络被部署在推理和算法任务上,精确性保证变得越来越重要。然而,高平均准确率仍可能掩盖不一致的行为。这激发了精确认证的需求,即寻找最小的标记示例集,以证明学习到的假设与目标一致。我们表明,虽然某些假设易于认证,但即使是最小的过参数化,也可能使多个假设类别的认证变得指数级困难。对于深度≥2的阈值电路,添加一个额外的门就可能导致认证集大小在输入维度上呈指数增长。我们展示了对于仅具有恒定架构开销的对数精度Transformer,存在类似的困难结果。我们还刻画了近似认证,表明允许多项式数量的错误仍然需要指数级大小的证书,而常数相对误差保证可能隐藏指数级数量的错误。实验上,我们研究了用于识别二进制加法的构造电路和训练后的Transformer的认证。虽然构造电路实例化了认证的指数障碍,但训练后的Transformer分析表明,不完美的模型可以通过大的均匀采样候选证书来逃避检测。

英文摘要

As state-of-the-art neural networks are deployed on reasoning and algorithmic tasks, exactness guarantees become increasingly important. However, high average-case accuracy can still mask inconsistent behaviors. This motivates exact certification, which asks for the smallest set of labeled examples needed to certify that a learned hypothesis equals the target. We show that while some hypotheses are easy to certify, even minimal overparametrization can make certification exponentially hard across several hypothesis classes. For threshold circuits of depth $\ge 2$, adding a single extra gate can force certificate sizes exponential in the input dimension. We show an analogous hardness result for log-precision Transformers with only constant architectural overhead. We also characterize approximate certification, showing that allowing only polynomially many mistakes still requires exponentially large certificates, whereas constant relative-error guarantees can hide exponentially many mistakes. Empirically, we study certification for constructed circuits and trained Transformers for recognizing binary addition. While the constructed circuits instantiate the exponential barrier for certification, the trained Transformer analysis shows that imperfect models can evade detection by large uniformly sampled certificate candidates.

2605.22635 2026-05-25 cs.LG cs.CL cs.CV 版本更新

The Double Dilemma in Multi-Task Radiology Report Generation: A Gradient Dynamics Analysis and Solution

多任务放射学报告生成中的双重困境:梯度动力学分析与解决方案

Erjian Zhang, Yatong Hao, Liejun Wang, Zhiqing Guo

发表机构 * School of Computer Science and Technology(计算机科学与技术学院) Xinjiang University(新疆大学) Information Security Engineering Technology Research Center(信息安全工程技术研究中心)

AI总结 在多任务医学影像报告生成中,现有的线性标量化策略难以有效平衡临床监督的严格约束与报告生成的平滑性需求。本文从梯度动力学角度分析了这一问题,揭示其本质是漂移项偏差与扩散项衰减的“双重困境”,并提出了一种与模型无关的优化器CAME-Grad,通过冲突规避方向校正和幅度增强能量注入,实现了几何有效性与局部最优解的规避,实验表明该方法在多个任务中均能显著提升临床效果。

Comments Accepted by ICML 2026

详情
AI中文摘要

尽管基于多任务学习的自动放射学报告生成(RRG)被广泛采用以确保临床一致性,但大多数研究集中在架构设计上,仍局限于粗糙的线性标量化策略。这些策略无法有效平衡判别性临床监督的硬约束与报告生成的平滑性要求。为了解决这些问题,我们从梯度动力学的角度分析了线性标量化的失败机制,利用随机微分方程(SDE)框架将其表征为漂移项偏差和扩散项衰减的“双重困境”。基于此,我们提出了一种与骨干网络无关的优化器,名为冲突规避幅度增强梯度下降(CAME-Grad)。通过冲突规避的方向修正和幅度增强的能量注入,该算法不仅保证了几何有效性,还避免了局部最优解。然后,自适应梯度融合机制用于建立理论最优方向与任务特定归纳偏差之间的动态平衡。实验表明,作为一种通用的即插即用优化器,CAME-Grad在八种不同的RRG方法上带来了显著且一致的改进,在MIMIC-CXR上平均提升整体临床效能2.3%,在IU X-Ray上提升1.9%。我们的代码可在https://github.com/vpsg-research/CAME-Grad获取。

英文摘要

While multi-task learning based automatic radiology report generation (RRG) is widely adopted to ensure clinical consistency, most focus on architectural designs yet remain limited to coarse linear scalarization strategies. These strategies cannot effectively balance the hard constraints of discriminative clinical supervision with the smoothness requirements of report generation. To address these problems, we analyze the failure mechanism of linear scalarization from the perspective of gradient dynamics, utilizing the stochastic differential equation (SDE) framework to characterize it as a "Double Dilemma" of drift term deviation and diffusion term decay. Based on this, we propose a backbone-agnostic optimizer named Conflict-Averse Magnitude-Enhanced Gradient Descent (CAME-Grad). Through conflict-averse direction rectification and magnitude-enhanced energy injection, the algorithm not only ensures geometric validity, but also avoids local optimal solutions. Then, the adaptive gradient fusion mechanism is used to establish a dynamic balance between the theoretical optimal direction and the task-specific inductive bias. Experiments show that as a universal plug-and-play optimizer, CAME-Grad brings substantial and consistent improvements across eight diverse RRG methods, elevating overall clinical efficacy performance by an average of 2.3% on MIMIC-CXR and 1.9% on IU X-Ray. Our code is available at https://github.com/vpsg-research/CAME-Grad.

2605.22373 2026-05-25 cs.LG cs.CL 版本更新

Boundary-targeted Membership Inference Attacks on Safety Classifiers

针对安全分类器的边界目标成员推断攻击

Anthony Hughes, Alexander Goldberg, Prince Jha, Adam Perer, Nikolaos Aletras, Niloofar Mireshghallah

发表机构 * University of Sheffield(谢菲尔德大学) Carnegie Mellon University(卡内基梅隆大学) MBZUAI

AI总结 该研究探讨了针对安全分类器的边界定向成员推理攻击问题,这类分类器常用于生成式AI系统中以过滤有害内容或识别高风险用户。研究提出了一种新的攻击方法,通过识别分类器最不自信的样本,揭示模型在训练数据上的记忆性特征,从而推断出样本是否属于训练集。实验表明,该方法在检测用户情绪支持需求的分类器上,能以较低的误报率恢复更多被标记为高风险的对话,效果显著优于现有成员推理攻击方法,并进一步分析了边界样本的特性,指出基于内容的过滤策略难以有效防御此类攻击。

详情
AI中文摘要

安全分类器是生成式AI系统中的重要保障,用于过滤有害内容或识别与大语言模型交互时处于风险中的用户。尽管这些模型是必要的,但它们是在包含自残和心理健康讨论等敏感数据集上训练的,这引发了重要但尚未充分理解的隐私问题。成员推断攻击(MIA)允许对手推断用于训练模型的示例的成员身份。在这项工作中,我们假设识别分类器最不自信的示例对于对手推断成员身份是有信息的。这反映了局部泛化失败,其中模型依赖记忆来解决训练集中的歧义。为了研究这一点,我们引入了一种新的边界目标选择策略,该策略识别低置信度示例,从而放大训练集中示例成员身份的信号。我们的实验结果表明,在针对检测可能需要情感支持的用户的微调分类器上,对手可以以5%的假阳性率恢复安全分类器标记为指示用户困扰的对话中的19%。这比单独使用最先进的MIA方法攻击高出3.5倍。最后,我们描述了边界示例的特征,并表明基于内容的过滤对于保护无效,而现有的噪声策略可以有效减轻这些示例的敏感性。

英文摘要

Safety classifiers are essential safeguards within generative AI systems, filtering harmful content or identifying at-risk users when interacting with large language models. Despite their necessity, these models are trained on sensitive datasets including discussions of self-harm and mental health, raising important, yet poorly understood, privacy concerns. Membership inference attacks (MIAs) allow adversaries to infer membership of examples used to train models. In this work, we hypothesize that identifying the examples on which the classifier is least confident are informative for an adversary to infer membership. This reflects a localized failure of generalization, where the model relies on memorization to resolve ambiguity in the training set. To investigate this, we introduce a new boundary-targeted selection strategy that identifies low confidence examples that amplify the signal of an examples membership within a training set. Our experimental results show that an adversary can recover 19% of the conversations a safety classifier flagged as indicating user distress, at a 5% false-positive rate, on a classifier fine-tuned for detecting a user who may require emotional support. This is $3.5$ times more than attacking using state-of-the-art MIA methods alone. Finally, we characterize the boundary laying examples and show that content-based filtering is ineffective for protection, and existing noise strategies can effectively mitigate susceptibility of these examples.

2605.22350 2026-05-25 cs.LG stat.ML 版本更新

Partial Fusion of Neural Networks: Efficient Tradeoffs Between Ensembles and Weight Aggregation

神经网络的部分融合:集成与权重聚合之间的高效权衡

Fabian Morelli, Stephan Eckstein

发表机构 * Department of Mathematics, University of Tübingen, Germany(图宾根大学数学系,德国) Department of Computer Science, University of Tübingen, Germany(图宾根大学计算机科学系,德国)

AI总结 该论文提出了一种神经网络的部分融合方法,在集成学习与权重聚合之间实现计算成本与性能的灵活权衡。核心思想是基于神经元层面的相似性,仅对最相似的神经元进行权重聚合,从而在保持较高准确率的同时降低计算开销。研究还展示了通过部分最优运输方法识别和匹配相似神经元的具体实现,并将权重聚合与部分融合视为集成模型的广义剪枝过程,允许对神经元进行删除或线性组合操作,进一步拓展了模型优化的灵活性。

Comments Accepted to ICML 2026

详情
AI中文摘要

神经网络的集成通常优于单个网络,但计算成本高昂,而权重聚合产生的聚合模型成本较低,但精度也较低。我们引入了网络的部分融合,它在集成和权重聚合之间进行插值,从而允许在计算成本和性能之间进行灵活的权衡。实现这一目标的一种直接方法是扩展现有的基于不同网络之间神经元级相似性的权重聚合方法,其中部分融合仅聚合最相似神经元的权重。我们展示了一种特定方法,通过部分最优传输联合识别哪些神经元最相似并进行匹配。此外,我们将权重聚合和部分融合视为集成模型的广义剪枝,其中神经元不仅可以被删除,还可以线性组合。最后,我们表明,应用于单个网络的广义剪枝通过允许基于相似性隔离、删除和线性组合神经元之间的权衡,产生了与部分融合类似的优势。我们的代码可在 https://github.com/Fabian-Mor/partial_fusion_nn 获取。

英文摘要

Ensembles of neural networks typically outperform individual networks but incur large computational costs, whereas weight aggregation produces less costly, yet also less accurate, aggregate models. We introduce partial fusion of networks, which interpolates between ensembles and weight aggregation and thus allows for a flexible tradeoff between computational cost and performance. A direct way to achieve this is to extend existing weight aggregation methods based on neuron-level similarity between different networks, where partial fusion then only aggregates weights of neurons which are most similar. We showcase one particular method to jointly identify which neurons are most similar and match them via partial optimal transport. Further, we consider the more general perspective of weight aggregation and partial fusion as generalized pruning of ensemble models, where neurons cannot just be deleted, but also linearly combined. Finally, we show that generalized pruning applied to a single network yields similar benefits as partial fusion by allowing for a tradeoff between isolating, deleting, and linearly combining neurons based on similarity. Our code is available at https://github.com/Fabian-Mor/partial_fusion_nn.

2605.22237 2026-05-25 cs.CR cs.LG 版本更新

Decision-Aware Quadratic ReLU Replacement for HE-Friendly Inference

面向同态加密推理的决策感知二次ReLU替换

Rui Li, Wenyuan Wu, Weijie Miao

发表机构 * Chongqing Key Laboratory of Secure Computing for Biology(重庆生物安全计算重点实验室) Chongqing Institute of Green and Intelligent Technology(重庆绿色智能技术研究所) Chinese Academy of Sciences(中国科学院) Department of Industrial and Systems Engineering(工业与系统工程系)

AI总结 该研究针对全同态加密(FHE)下神经网络推理中ReLU激活函数的替换问题,提出了一种基于决策感知的二次多项式替代方法,旨在在不重新训练模型的前提下,使用低阶多项式保持分类决策的一致性。研究通过几何框架分析校准集的决策边界,提出了在正边距条件下实现无误差替换的充要条件及构造算法,并在边距不足时引入凸包缩减和拉格朗日对偶松弛方法,有效降低计算复杂度。实验表明,该方法在CKKS方案下能够达到与明文模型相当的精度,且推理效率显著优于现有方法。

Comments 13 pages, 2 figures

详情
AI中文摘要

全同态加密(FHE)仅支持加法和乘法,因此仅使用FHE的神经网络推理通常将ReLU替换为在经验激活区间上拟合的多项式。这种区间拟合通常需要更高次多项式来控制激活误差,从而产生同态评估成本,而分类由最终logit决策决定。我们从决策感知的角度重新审视ReLU替换:给定一个训练好的单隐层ReLU MLP和一个指定的校准集,能否在不重新训练的情况下,用一个同态友好的低次多项式替换ReLU,同时保持校准集决策不变?我们专注于二次替换,即保留每个单元非线性的最低次数。对于在提升空间中正间隔可分的校准集,我们将二次替换公式化为一个线性可分问题,得到了校准无损替换的充分必要条件以及系数的构造性算法。当正间隔条件不满足时(通常是因为少数接近边界或错误分类的校准样本使提升凸包接触),我们通过缩减凸包和拉格朗日对偶软间隔松弛来扩展相同的几何框架。这些方法限制了单个样本能携带的权重,将问题转化为较小的凸二次规划,产生近似可行的系数,并在校准集决策上具有高经验一致性。特别地,在最大权重上限μ=1时,缩减凸包松弛退化为标准凸包分离;因此该松弛连续地扩展了正间隔精确理论。在CKKS下,二次替换在多个基准测试中匹配明文top-1准确率,激活模块运行速度比Remez-7快3.7-4.1倍,端到端快1.18-1.68倍。

英文摘要

Fully homomorphic encryption (FHE) supports only additions and multiplications, so FHE-only neural-network inference typically replaces ReLU with polynomials fitted over empirical activation intervals. Such interval fitting often requires higher-degree polynomials to control activation error, incurring homomorphic evaluation costs, while classification is determined by the final logit decision. We revisit ReLU replacement from a decision-aware perspective: given a trained single-hidden-layer ReLU MLP and a specified calibration set, can an HE-friendly low-degree polynomial replace ReLU without retraining while preserving calibration-set decisions? We focus on quadratic replacement, the lowest-degree that retains a genuine per-unit nonlinearity. For calibration sets positive-margin separable in the lifted space, we formulate quadratic replacement as a linear separation problem, yielding necessary and sufficient conditions for calibration-lossless replacement and a constructive algorithm for the coefficients. When the positive-margin condition fails -- often because a few near-boundary or misclassified calibration samples bring the lifted hulls into contact -- we extend the same geometric framework via reduced convex hulls and Lagrangian-dual soft-margin relaxations. These cap the weight any single sample can carry, converting the problem into smaller convex quadratic programs that yield approximately feasible coefficients with high empirical agreement on calibration-set decisions. In particular, at the maximal weight cap $μ=1$, the reduced-convex-hull relaxation reduces to standard convex-hull separation; the relaxation thus continuously extends the positive-margin exact theory. Under CKKS, the quadratic replacement matches plaintext top-1 accuracy on multiple benchmarks, running 3.7--4.1$\times$ faster than Remez-7 in the activation module and 1.18--1.68$\times$ faster end-to-end.

2605.21851 2026-05-25 cs.LG cs.AI 版本更新

OPPO: Bayesian Value Recursion for Token-Level Credit Assignment in LLM Reasoning

OPPO: 用于LLM推理中令牌级信用分配的贝叶斯价值递归

Yu Li, Rui Miao, Tian Lan, Zhengling Qi

发表机构 * George Washington University(乔治华盛顿大学) The University of Texas at Dallas(德克萨斯大学达拉斯分校)

AI总结 该论文提出了一种名为OPPO的新型算法,用于改进大语言模型(LLM)在推理任务中的信用分配机制。OPPO基于一种关键观察:传统方法中用于局部判别的 oracle 信号本质上是模型对最终成功概率的贝叶斯更新。通过沿轨迹累积该信号,OPPO能够在不依赖价值网络或额外采样的情况下,直接计算出每个位置的成功概率估计和令牌级优势,从而更准确地识别推理过程中的关键步骤。实验表明,OPPO在多个数学、科学和代码推理基准上显著优于现有方法。

详情
AI中文摘要

具有可验证奖励的强化学习已成为提升LLM推理的标准方法,但主流算法GRPO为每个令牌分配单一轨迹级优势,稀释了关键推理步骤的信号,并在无信息步骤中注入噪声。源自在线策略蒸馏的无评论家替代方案通过预言机条件似然比提供每令牌信号,但每个信号孤立于该位置之前累积的轨迹级证据。我们提出Oracle-Prompted Policy Optimization (OPPO),它基于一个简单观察:先前蒸馏式方法用于局部区分的预言机信号,也是模型对最终成功信念的自然贝叶斯更新。沿轨迹累积信号,以一次额外前向传播的代价,以闭式形式给出每个位置成功概率的运行估计,以及无需学习价值网络和额外采样的令牌级优势。一阶分析将优势分解为蒸馏方法使用的每令牌区分信号,乘以一个状态权重,该权重将信用集中在真正关键的令牌上,并具有方向性方差减少保证。该框架包含两种估计器,区别仅在于谁对证据评分: extit{自预言机}重用学生模型,将在线策略蒸馏奖励作为严格特例恢复; extit{教师预言机}将评分委托给更强的冻结模型。在两个基础LLM上,跨越七个数学、科学和代码推理基准,OPPO在AMC'23上比GRPO、DAPO和SDPO提升高达+6.0分,在AIME'24上提升+5.2分,且增益随响应长度单调增加。

英文摘要

Reinforcement learning with verifiable rewards has become the standard recipe for improving LLM reasoning, but the dominant algorithm GRPO assigns a single trajectory-level advantage to every token, diluting the signal at pivotal reasoning steps and injecting noise at uninformative ones. Critic-free alternatives derived from on-policy distillation supply per-token signals through oracle-conditioned likelihood ratios, yet apply each signal in isolation from the trajectory-level evidence accumulated up to that position. We propose Oracle-Prompted Policy Optimization (OPPO), which rests on a single observation: the oracle signal used by prior distillation-style methods for local discrimination is also the natural Bayesian update of the model's belief about eventual success. Accumulating the signal along a trajectory yields, in closed form and at the cost of one extra forward pass, a running estimate of the success probability at every position, together with a token-level advantage that requires no learned value network and no additional rollouts. A first-order analysis factorizes the advantage into the per-token discrimination signal used by distillation methods modulated by a state weight that concentrates credit on genuinely pivotal tokens, with a directional variance-reduction guarantee. The framework admits two estimators differing only in which model scores the evidence: a \textit{self-oracle} that reuses the student and recovers the on-policy distillation reward as a strict special case, and a \textit{teacher-oracle} that delegates scoring to a stronger frozen model. On two base LLMs across seven mathematics, science, and code reasoning benchmarks, OPPO improves over GRPO, DAPO, and SDPO by up to $+6.0$ points on AMC'23 and $+5.2$ points on AIME'24, with gains that widen monotonically with response length.

2605.21489 2026-05-25 cs.LG cs.AI cs.CV stat.CO stat.ML 版本更新

Variance Reduction for Expectations with Diffusion Teachers

具有扩散教师的期望方差缩减

Jesse Bettencourt, Xindi Wu, Matan Atzmon, James Lucas, Jonathan Lorraine

发表机构 * NVIDIA University of Toronto(多伦多大学) Princeton University(普林斯顿大学)

AI总结 本文研究了如何在使用预训练扩散模型作为“教师”进行下游任务(如文本到3D生成、单步蒸馏等)时,降低梯度估计的方差。提出了一种名为CARV的计算感知方差控制框架,通过分层蒙特卡洛估计器,将昂贵的上游计算过程与廉价的扩散噪声重采样相结合,并结合时间步重要性采样和分层逆CDF构造,有效减少了计算成本。实验表明,CARV在不改变目标函数的前提下显著提升了计算效率,但在某些任务中梯度方差的降低并未带来生成质量的提升,表明此时方差已不再是性能瓶颈。

Comments Project page: https://research.nvidia.com/labs/sil/projects/CARV/

详情
AI中文摘要

预训练的扩散模型作为冻结教师,为文本到3D、单步蒸馏和数据归因等下游流程提供支持。这些流程消耗的教师梯度是关于噪声水平和高斯噪声样本的蒙特卡洛期望;其估计器方差主导了计算成本,因为每次抽取都需要昂贵的上游工作(渲染、模拟、编码)。我们引入了CARV,一个计算感知的方差核算框架,它激发了一种分层蒙特卡洛估计器:通过廉价的扩散噪声重采样来摊销昂贵的上游计算,并通过时间步重要性采样和分层逆CDF构造加以强化。在我们的文本到3D蒸馏和归因实验中,CARV在不改变目标的情况下提供了2-3倍的有效计算乘数(主要来自摊销重用;约25%来自IS+分层);在单步蒸馏中,相同的技术将梯度方差降低了一个数量级,但并未改善下游FID,标志着MC方差不再是瓶颈的区间。

英文摘要

Pretrained diffusion models serve as frozen teachers feeding downstream pipelines such as text-to-3D, single-step distillation, and data attribution. The teacher gradients these pipelines consume are Monte Carlo (MC) expectations over noise levels and Gaussian noise samples; their estimator variance dominates compute cost because each draw requires expensive upstream work (rendering, simulation, encoding). We introduce CARV, a compute-aware variance-accounting framework that motivates a hierarchical MC estimator: amortize the expensive upstream computation over cheap diffusion-noise resamples, sharpened by timestep importance sampling and a stratified-inverse-CDF construction. In our text-to-3D distillation and attribution experiments, CARV delivers 2-3x effective compute multipliers (most from amortized reuse; ~25% additional from IS+stratification) without changing the objective; in single-step distillation, the same techniques cut gradient variance by an order of magnitude but do not improve downstream FID, marking the regime where MC variance is no longer the bottleneck.

2605.21139 2026-05-25 cs.CV cs.LG 版本更新

Distill to Think, Foresee to Act: Cognitive-Physical Reinforcement Learning for Autonomous Driving

蒸馏思考,预见行动:面向自动驾驶的认知-物理强化学习

Yang Wu, Qiang Meng, Zhaojiang Liu, Youquan Liu, Jian Yang, Jin Xie

发表机构 * NJU(南京大学) SJTU(上海交通大学) FDU(福建大学)

AI总结 当前端到端自动驾驶模型受到模仿学习行为克隆天花板的限制,为此,本文提出CoPhy认知-物理强化学习框架,通过将视觉语言模型知识蒸馏到鸟瞰图编码器中,实现零推理成本的认知能力,并构建自回归的鸟瞰图世界模型以预测候选动作的未来语义地图,从而在物理环境层面预见行动后果。该方法结合物理奖励和认知奖励优化驾驶策略,不仅在NAVSIM基准上取得最优性能,还支持通过用户定义的语言指令实现更安全、更灵活的驾驶控制。

详情
AI中文摘要

当前的端到端自动驾驶模型从根本上受到模仿学习的行为克隆上限的限制。虽然强化学习提供了更智能自主性的路径,但它需要两个缺失的基础设施:(1)理解交通语义和驾驶意图的认知基础,以及(2)能够预见候选行动后果的前瞻性物理环境。为此,我们提出了CoPhy,一个用于自动驾驶的认知-物理强化学习框架。为了蒸馏思考,我们将VLM知识蒸馏到BEV编码器中,然后完全丢弃VLM,以零推理成本保留认知能力,同时将认知通道作为可插拔接口释放,用于可选的人类语言命令。为了预见行动,我们构建了一个自回归BEV世界模型,该模型明确预测以候选行动为条件的未来语义地图,作为一个可解释的物理沙盒,从中直接推导出安全指标。基于这一双重基础设施,我们通过GRPO优化驾驶策略,采用新颖的双奖励机制:从BEV rollout导出的物理奖励强制执行硬安全约束,而来自语言对齐评分器的认知奖励确保意图合规。大量实验表明,CoPhy不仅在NAVSIM v1和v2基准上取得了最先进的结果,而且通过认知信息化的场景合规性和通过用户定义的语言指令实现的灵活意图控制,实现了更安全的驾驶。

英文摘要

Current end-to-end autonomous driving models are fundamentally constrained by the behavioral cloning ceiling of imitation learning. While reinforcement learning offers a path to smarter autonomy, it demands two missing pieces of infrastructure: (1) a cognitive foundation that understands traffic semantics and driving intent, and (2) a foresighted physical environment that can anticipate the consequences of candidate actions. To this end, we propose CoPhy, a CognitivePhysical reinforcement learning framework for autonomous driving. To distill to think, we distill VLM knowledge into the BEV encoder and then discard the VLM entirely, retaining cognitive ability at zero inference cost while releasing the cognitive channel as a pluggable interface for optional human language commands. To foresee to act, we build an auto-regressive BEV world model that explicitly predicts future semantic maps conditioned on candidate actions, serving as an interpretable physical sandbox from which safety metrics are directly derived. Built upon this dual infrastructure, we optimize the driving policy via GRPO with a novel dual-reward mechanism: a physical reward derived from BEV rollouts enforces hard safety constraints, while a cognitive reward from a language-aligned scorer ensures intent compliance. Extensive experiments demonstrate that CoPhy not only achieves state-of-the-art results on NAVSIM v1 and v2 benchmarks, but also enables safer driving via cognitively informed scene compliance and flexible intent control through user-defined language instructions.

2605.20919 2026-05-25 cs.LG cs.AI cs.PL 版本更新

Sutra: Tensor-Op RNNs as a Compilation Target for Vector Symbolic Architectures

Sutra: 以张量操作RNN作为向量符号架构的编译目标

Emma Leonhart

发表机构 * Emma Leonhart

AI总结 Sutra 是一种类型化的纯函数式编程语言,其前向传播过程被编译为 PyTorch 神经网络。该语言通过将程序中的原始操作、控制流和字符串 I/O 等全部转换为一个融合的张量操作图,实现了对向量符号架构的高效编译。研究展示了 Sutra 在多种嵌入表示上的高精度解码能力,并验证了其可微分性,使得同一程序既能作为逻辑程序运行,也能作为可训练的神经网络进行优化。

Comments Modified NeurIPS submission, see AI declaration and replication materials at end of paper

详情
AI中文摘要

Sutra是一种带类型的纯函数式编程语言,其编译后的前向传播是一个PyTorch神经网络。编译器将整个程序——包括原语、控制流、字符串I/O——通过beta归约降级为一个在冻结嵌入基质上的融合张量操作图。旋转绑定、解绑、捆绑、多项式Kleene三值逻辑以及尾递归循环均被降级为张量操作;Kleene连接词是在{-1, 0, +1}真值网格上精确的拉格朗日插值多项式。验证通过两种方式测试同一事实。(1) 同一程序在跨越两种模态的四个冻结嵌入上运行——三种文本编码器(nomic-embed-text、all-minilm、mxbai-embed-large)和一种蛋白质语言模型(ESM-2)——并在每个基质上以宽度k=8实现100%的解码准确率,而教科书式的Hadamard乘积已经崩溃(mxbai-embed-large上2.5%,all-minilm上7.5%)。(2) PyTorch自动求导流经实际编译的图:一个用.su编写的模糊规则分类器从随机初始化(18.7±9.5%;随机概率=20%,五类)通过反向传播经过发射图(符号源未修改)训练到100.0±0.0%(三个种子)。一个加权变体额外训练一个标量余弦增益,并将其作为数值字面量写回.su源文件;重新编译重现训练后的行为,每个logit误差约2e-7,因此训练后的模型本身是可读、可重编译的代码。因此,同一工件既是一个逻辑程序,也是一个可训练的神经网络。

英文摘要

Sutra is a typed, purely functional programming language whose compiled forward pass is a PyTorch neural network. The compiler beta-reduces the whole program -- primitives, control flow, string I/O -- to one fused tensor-op graph over a frozen embedding substrate. Rotation binding, unbind, bundle, polynomial Kleene three-valued logic, and tail-recursive loops all lower to tensor operations; the Kleene connectives are Lagrange-interpolated polynomials exact on the {-1, 0, +1} truth grid. Validation is one fact tested two ways. (1) The same program runs on four frozen embeddings spanning two modalities -- three text encoders (nomic-embed-text, all-minilm, mxbai-embed-large) and one protein language model (ESM-2) -- and decodes bundles at 100% accuracy through width k=8 on every substrate, where the textbook Hadamard product has already collapsed (2.5% on mxbai-embed-large, 7.5% on all-minilm). (2) PyTorch autograd flows through the actually compiled graph: a fuzzy-rule classifier written in .su trains from random init (18.7 +/- 9.5%; chance = 20%, five classes) to 100.0 +/- 0.0% (three seeds) by backpropagating through the emitted graph, the symbolic source unmodified. A weighted variant additionally trains a scalar cosine gain and writes it back into the .su source as a numeric literal; recompiling reproduces the trained behaviour to ~2e-7 per logit, so the trained model is itself legible, recompilable code. The same artifact is therefore both a logic program and a trainable neural network.

2605.20896 2026-05-25 cs.CR cs.AI cs.LG 版本更新

GenAI-Driven Threat Detection with Microsoft Security Copilot

GenAI驱动的威胁检测与Microsoft Security Copilot

Scott Freitas, Amir Gharib

发表机构 * Microsoft Security Research(微软安全研究)

AI总结 本文提出了一种名为动态威胁检测代理(DTDA)的自主代理系统,用于提升微软安全协作者(Microsoft Security Copilot)在检测隐蔽网络威胁方面的能力。DTDA结合了统一的活动时间线、版本化的大型语言模型提示合同、基于计划-执行的调查循环以及动态告警生成机制,能够持续分析安全事件并生成可解释的检测结果。实验表明,DTDA在实际部署中表现出较高的检测精度和效率,有效提升了现有系统的威胁识别能力。

详情
AI中文摘要

防御当今日益复杂的网络攻击需要安全分析师不断将不断演变的攻击者技术转化为检测逻辑。这使防御者处于被动状态,需要在日益碎片化的安全格局中不断更新专业知识。我们引入了动态威胁检测代理(DTDA),一种始终在线的自适应代理,持续调查Microsoft Defender中的安全事件,以发现隐藏威胁并在发现攻击故事缺口时生成可解释的检测。DTDA结合了:(1)统一的活动时间线,涵盖警报、事件、用户和实体行为分析以及威胁情报;(2)版本化的LLM提示合同,包含模式验证、基础要求、有限重试和故障关闭抑制;(3)规划器-执行器调查循环,生成攻击特定假设并收集支持和反驳证据;(4)动态告警生成,包含上下文相关的标题、严重性、MITRE映射、修复指导、涉及实体和自然语言攻击描述。集成到Microsoft Security Copilot并部署在数万个Defender客户中,DTDA在行业规模下持续运行。在120天的在线评估中,DTDA根据客户反馈实现了80.1%的精确率,同时为约15%的调查事件生成了新颖告警。在离线评估中,DTDA使用GPT-5.4以0.78的F1分数恢复了隐藏的恶意活动,比GPT-4.1提高了0.12 F1,并比基线高出0.26 F1点。在操作上,DTDA处理单个事件调查的中位端到端时间为28分钟,中位代币成本为2.04美元,作业级故障率为0.38%。这些结果表明,自主代理可以在生产规模上识别遗漏的恶意活动。

英文摘要

Defending against today's increasingly sophisticated cyberattacks requires security analysts to continuously translate evolving attacker tradecraft into detection logic. This places defenders in a reactive posture, requiring constantly updated expertise across an increasingly fragmented security landscape. We introduce the Dynamic Threat Detection Agent (DTDA), an always-on adaptive agent that continuously investigates security incidents across Microsoft Defender to uncover hidden threats and generate explainable detections when attack-story gaps are found. DTDA combines: (1) a unified activity timeline spanning alerts, events, user and entity behavior analytics, and threat intelligence; (2) versioned LLM prompt contracts with schema validation, grounding requirements, bounded retries, and fail-closed suppression; (3) a planner-executor investigation loop that generates attack-specific hypotheses and gathers supporting and refuting evidence; and (4) dynamic alert generation with a context-relevant title, severity, MITRE mappings, remediation guidance, implicated entities, and natural-language attack description. Integrated into Microsoft Security Copilot and deployed across tens of thousands of Defender customers, DTDA operates continuously at industry scale. In a 120-day online evaluation, DTDA achieves 80.1% precision from customer feedback while generating novel alerts for approximately 15% of investigated incidents. In offline evaluation, DTDA recovers hidden malicious activity with 0.78 F1 using GPT-5.4, improving over GPT-4.1 by 0.12 F1 and outperforming the baseline by 0.26 F1 points. Operationally, DTDA processes single-incident investigations end-to-end in a median of 28 minutes at a median token cost of USD 2.04, with a 0.38% job-level failure rate. These results demonstrate that autonomous agents can identify missed malicious activity at a production scale.

2605.20201 2026-05-25 cs.CL cs.AI cs.LG 版本更新

Long-Context Reasoning Through Proxy-Based Chain-of-Thought Tuning

基于代理思维链调优的长上下文推理

Miao Li, Irina Saparina, Alexander Gurung, Mirella Lapata

发表机构 * School of Informatics, University of Edinburgh(爱丁堡大学信息学院)

AI总结 该研究针对大语言模型在长上下文复杂推理任务中表现不佳的问题,提出了一种名为ProxyCoT的新训练框架。该方法通过在短代理上下文中获取高质量的推理轨迹,并将其迁移到完整的长上下文中,从而提升模型的长上下文推理能力。实验表明,ProxyCoT在多个数据集上均优于现有方法,且计算开销更低,同时具备良好的跨领域泛化能力。

Comments Long paper, ACL 2026 (Main conference)

详情
AI中文摘要

近期的大语言模型支持高达1000万token的输入,但在需要复杂推理的长上下文任务上表现不佳。此类任务可以通过仅使用输入的一个子集(即代理上下文)而非完整序列来解决。尽管共享相同的底层推理过程,模型在代理上下文和完整上下文之间表现出显著的性能差异。为了改进长上下文推理,我们提出了ProxyCoT,一种新颖的训练框架,将推理能力从短代理上下文迁移到完整长上下文。具体来说,我们首先通过强化学习或从更大的教师模型蒸馏,在代理上下文中获得高质量的思维链推理轨迹,然后通过监督微调将这些生成的轨迹锚定到完整长上下文中。跨不同数据集的实验表明,ProxyCoT在减少计算开销的同时,始终优于强基线。此外,使用ProxyCoT训练的模型能够将其长上下文推理能力泛化到域外任务。

英文摘要

Recent large language models support inputs of up to 10 million tokens, yet they perform poorly on long-context tasks that require complex reasoning. Such tasks can be solved using only a subset of the input -- a proxy context -- rather than the full sequence. Despite sharing the same underlying reasoning process, models exhibit a significant performance disparity between proxy and full contexts. To improve long-context reasoning, we propose ProxyCoT, a novel training framework that transfers reasoning capabilities from short proxy contexts to full long contexts. Specifically, we first obtain high-quality chain-of-thought reasoning traces on proxy contexts through reinforcement learning or distillation from a larger teacher model, and then ground the generated traces in full long contexts with supervised fine-tuning. Experiments across different datasets demonstrate that ProxyCoT consistently outperforms strong baselines with reduced computational overhead. Furthermore, models trained with ProxyCoT generalize their long-context reasoning capabilities to out-of-domain tasks.

2605.18993 2026-05-25 cs.LG cs.AI 版本更新

Distilling Linearized Behavior into Non-Linear Fine-Tuning for Effective Task Arithmetic

将线性化行为蒸馏到非线性微调中以实现有效的任务算术

Thomas Sommariva, Francesca Morandi, Simone Calderara, Angelo Porrello

发表机构 * University of Pisa, Italy(比萨大学,意大利)

AI总结 该研究探讨了如何在非线性微调中保留线性微调在任务向量组合中的优势。作者提出通过在激活空间中施加约束,使非线性模型在权重扰动上保持线性特性,并通过从线性化教师模型中蒸馏隐藏表示来训练学生模型。该方法在保持任务向量可组合性的同时,避免了推理时的额外开销,在视觉和语言任务中表现出色。

Comments Accepted at ICML 2026

详情
AI中文摘要

任务向量组合已成为编辑预训练模型的一种有前景的范式,通过加法实现模型合并,通过减法实现模型遗忘。在预训练模型的切空间中进行微调(线性微调)已被证明是有效的,因为它产生的任务向量自然解缠且抗干扰。然而,线性化模型在训练期间表达能力有限,并且在推理时计算成本较高,这限制了它们的实际应用。在这项工作中,我们弥合了线性微调与标准非线性微调之间的差距。我们表明,关于权重扰动的线性性(一种在参数空间中定义的属性)可以通过在训练期间在激活空间中施加约束来强制执行。具体来说,我们将曲率正则化的线性化教师模型的隐藏表示蒸馏到通过常规微调训练的非线性学生模型中。我们发现,得到的模型继承了线性化模型在任务算术中的关键属性,能够实现任务向量的有效组合,并在视觉和语言基准测试中实现强性能,而不会产生任何推理开销。

英文摘要

Task vector composition has emerged as a promising paradigm for editing pre-trained models, enabling model merging through addition and unlearning through subtraction. Fine-tuning in the tangent space of a pre-trained model (linear fine-tuning) has proven effective, as it produces task vectors that are naturally disentangled and resistant to interference. However, linearized models suffer from limited expressivity during training and incur higher computational costs at inference time, which restrict their practical applicability. In this work, we bridge the gap between linear and standard non-linear fine-tuning. We show that linearity with respect to weight perturbations, a property defined in parameter space, can be enforced through constraints in activation space during training. Concretely, we distill hidden representations from a curvature-regularized linearized teacher into a non-linear student trained via conventional fine-tuning. We find that the resulting model inherits key properties of linearized models for task arithmetic, enabling effective composition of task vectors and achieving strong performance across vision and language benchmarks without incurring any inference-time overhead.

2605.18911 2026-05-25 cs.LG cs.AI 版本更新

Does Your Wildfire Prediction Model Actually Work, or Just Score Well?

你的野火预测模型真的有效,还是只是得分高?

Yangshuang Xu, Yuyang Dai, Liling Chang, Qi Wang, Yushun Dong

发表机构 * Florida State University(佛罗里达州立大学) Northeastern University(东北大学)

AI总结 本文研究了现有地球基础模型在野火预测任务中的实际有效性问题,指出当前模型虽在通用大气和地球物理任务上表现良好,但未针对野火预测进行专门预训练。为此,作者提出了首个专门用于野火预测的预训练模型WILDFIRE-FM,并引入了一种固定合约评估框架,以解决野火事件稀疏性带来的评估偏差问题。研究结果表明,野火预测的迁移结论高度依赖于评估设计和任务设定,为未来相关研究提供了新的基准和方法支持。

Comments 25 pages

详情
AI中文摘要

野火预测对于早期预警和资源分配至关重要,然而现有的地球基础模型(Earth FMs)是为通用大气和地球物理目标预训练的,而非野火预测。为弥补这一空白,我们提出了WILDFIRE-FM,这是首个专门针对野火预测预训练的基础模型,使用了天气、活跃火观测、地形、植被和静态环境数据。然而,仅引入特定领域的骨干网络并不能解决评估问题:野火事件在时空上稀疏,使得迁移结论对匹配规则和评估设置高度敏感。为解决这一问题,我们引入了一个固定合约评估框架,包含两个受控检查:固定输出检查用于匹配规则效应,固定特征检查用于头部选择效应。在匹配合约下,我们在占用、蔓延、检索和回归任务上将WILDFIRE-FM与十个地球基础模型基线进行比较。结果表明,野火迁移结论强烈依赖于评估设计和任务制定。我们希望该框架和WILDFIRE-FM能为未来野火特定的地球基础模型研究和基准测试提供基础。我们的代码可在 https://anonymous.4open.science/r/Wildfire-fm-evaluation-contracts-5AE9/ 获取。

英文摘要

Wildfire prediction is important for early warning and resource allocation, yet existing Earth foundation models (Earth FMs) are pretrained for general atmospheric and geophysical objectives rather than wildfire forecasting. To address this gap, we introduce WILDFIRE-FM, the first foundation model pretrained specifically for wildfire prediction using weather, active-fire observations, topography, vegetation, and static environmental data. However, introducing a domain-specific backbone alone does not solve the evaluation problem: wildfire events are sparse in space and time, making transfer conclusions highly sensitive to matching rules and evaluation settings. To address this problem, we introduce a fixed-contract evaluation framework with two controlled checks: a fixed-output check for matching-rule effects and a fixed-feature check for head-selection effects. Under matched contracts, we compare WILDFIRE-FM with ten Earth-FM baselines across occupancy, spread, retrieval, and regression tasks. Our results show that wildfire transfer conclusions depend strongly on evaluation design and task formulation. We hope this framework and WILDFIRE-FM provide a foundation for future wildfire-specific Earth-FM research and benchmarking. Our code is available at https://anonymous.4open.science/r/Wildfire-fm-evaluation-contracts-5AE9/.

2605.18859 2026-05-25 cs.LG cs.AI 版本更新

TwinRouterBench: Fast Static and Live Dynamic Evaluation for Realistic Agentic LLM Routing

TwinRouterBench:面向现实智能体LLM路由的快速静态与实时动态评估

Pei Yang, Wanyi Chen, Tongyun Yang, Pengbin Feng, Jiarong Xing, Wentao Guo, Yuhang Yao, Yuhang Han, Hanchen Li, Xu Wang, Zeyu Wang, Jie Xiao, Anjie Yang, Liang Tian, Lynn Ai, Eric Yang, Tianyu Shi

发表机构 * Gradient Soochow University(苏州大学) Independent Researcher(独立研究者) University of Southern California(南加州大学) Rice University(Rice大学) Carnegie Mellon University(卡内基梅隆大学) Shanghai Jiao Tong University(上海交通大学) University of California, Berkeley(加州大学伯克利分校) University of the Chinese Academy of Sciences(中国科学院大学) University of California, Los Angeles(加州大学洛杉矶分校)

AI总结 本文提出 TwinRouterBench,一个用于评估代理式大语言模型(LLM)路由策略的基准工具,旨在支持静态和动态场景下的高效评估。该基准包含两个赛道:静态赛道提供多个任务中的模型调用前缀及对应的最优模型层级,通过确定性计算进行评分;动态赛道则在真实代理系统中运行路由策略,评估其在实际任务完成和成本控制方面的表现。该工作为路由算法的开发与优化提供了全面且高效的实验平台。

详情
AI中文摘要

LLM路由在长时任务(如编码智能体、深度研究系统和计算机使用智能体)中最为重要,其中单个用户请求会触发多次模型调用。将每次调用路由到最便宜的足够模型可以在不牺牲质量的情况下降低成本,然而现有的路由器基准仅评估一次性提示的路由。它们从未暴露中间智能体步骤中路由器可见的前缀,从未测试更便宜的替代品是否保留下游任务的成功,并且通常在评估时依赖在线LLM评判。我们引入了TwinRouterBench,一个具有两轨的步骤级路由基准。静态轨提供来自SWE-bench、BFCL、mtRAG、QMSum和PinchBench中520个实例的970个路由器可见前缀,每个前缀与在发布的降级和级联协议下估计的执行验证目标层级配对;评分是层级标签、轨迹成员资格和令牌成本的确定性算术,无需在线评估方LLM评判。动态轨提供一个工具,可在完整的500例SWE-bench验证集上运行路由器;本文报告了与静态SWE监督划分不相交的100例保留评估。每次LLM调用时,路由器从锁定池中选择一个具体模型,成功由官方任务解决率和实际API支出衡量。两轨支持快速离线迭代,随后在实时智能体执行下进行端到端验证。代码和数据可在https://github.com/CommonstackAI/TwinRouterBench获取。

英文摘要

LLM routing matters most in long-horizon applications such as coding agents, deep research systems, and computer-use agents, where a single user request triggers many model calls. Routing each call to the cheapest sufficient model can cut costs without sacrificing quality, yet existing router benchmarks evaluate routers only on one-shot prompts. They never expose the router-visible prefix at an intermediate agent step, never test whether a cheaper replacement preserves downstream task success, and often rely on online LLM judges at evaluation time. We introduce TwinRouterBench, a step-level routing benchmark with two tracks. The static track provides 970 router-visible prefixes from 520 instances across SWE-bench, BFCL, mtRAG, QMSum, and PinchBench, each paired with an execution-verified target tier estimated under a released downgrade-and-cascade protocol; scoring is deterministic arithmetic over tier labels, trajectory membership, and token costs, with no online evaluator-side LLM judge. The dynamic track supplies a harness that runs routers on the full 500-case SWE-bench Verified suite; in this paper we report a 100-case held-out evaluation disjoint from the static SWE supervision split. At each LLM call the router selects a concrete model from a locked pool, and success is measured by official task resolution and realized API spend. The two tracks support fast offline iteration followed by end-to-end validation under live agent execution. Code and data are available at https://github.com/CommonstackAI/TwinRouterBench.

2605.18370 2026-05-25 stat.ML cs.LG math.ST stat.TH 版本更新

On Stability and Decomposition of Sample Quantiles under Heavy-Tailed Distributions

重尾分布下样本分位数的稳定性与分解

Choudur Lakshminarayan

发表机构 * School of Business, Stevens Institute of Technology(斯蒂文斯理工学院商学院)

AI总结 本文研究了在重尾分布下,基于估计参数的样本分位数的稳定性与分解问题,尤其关注与金融收益线性投影相关的风险价值(VaR)估计。传统Bahadur表示在固定分布下难以分离投影方向和分位数阈值带来的不稳定性,本文提出一种Q-Q正交性方法,将两者的影响分离开来,并将样本分位数与理论分位数的差异分解为三个部分,分别对应投影方向变化、样本分位数波动以及余项,从而更精确地分析分位数估计的稳定性来源。

Comments 0 figures

详情
AI中文摘要

我们研究由估计参数索引的分布样本分位数,重点关注与金融收益线性投影相关的风险价值,其潜在概率律是重尾的。在此设定下,投影方向和经验分位数阈值均从数据中估计,因此固定分布下的标准Bahadur表示无法分离不同的不稳定性来源。一个规范的起点是Bahadur表示,它通过经验分布函数加上余项来表达样本分位数\cite{bahadur1966}。经验过程理论通过半空间、对称差和Glivenko-Cantelli一致收敛的机制提供了可用的框架。它们给出了稳定性界,但将投影方向的变化和分位数阈值的变化吸收到单一的对称差度量中。有趣的是,对于本质上是局部分位数稳定性问题,却施加了全局一致收敛的要求。 本文引入了一种Q-Q正交性公式来分离投影方向和分位数阈值效应。关注的对象是使用估计投影方向计算的经验分位数与参考投影方向下的总体分位数之间的差异。我们将此差异分解为三项:$\hat q_α(\hat w)-q_α(w_0)=D_1+D_2+D_3$。其中,$D_1$衡量由投影方向扰动引起的总体分位数移动,$D_2$衡量在投影方向固定时经验分位数的波动,$D_3$是Bahadur型余项。

英文摘要

We study sample quantiles of distributions indexed by estimated parameters, with a on Value-at-Risk related to linear projections of financial returns that whose underlying probability law is heavy-tailed. In this setting, the projection direction and the empirical quantile threshold are estimated from the data, so the standard Bahadur representation under a fixed distribution does not separate the distinct sources of instability. A canonical starting point is Bahadur's representation, which expresses the sample quantile through the empirical distribution function plus a remainder term \cite{bahadur1966}. Empirical-process theory provides a usable scaffolding through the mechanics of half-spaces, symmetric differences, and Glivenko--Cantelli uniform convergence. They yield stability bounds, but absorb changes in projection direction and changes in quantile threshold into a single symmetric-difference measure. Interestingly, a global uniform-convergence requirement is imposed on what is intrinsically a local quantile-stability problem. This paper introduces a Q-Q orthogonality formulation for separating projection-direction and quantile-threshold effects. The object of interest is the difference between the empirical quantile computed using the estimated projection direction and the population quantile computed at the reference projection direction. We decompose this difference into three terms, $\hat q_α(\hat w)-q_α(w_0)=D_1+D_2+D_3$. Here, $D_1$ measures the population quantile movement induced by perturbing the projection direction, $D_2$ measures the empirical quantile fluctuation with the projection direction held fixed, and $D_3$ is the Bahadur-type remainder.

2605.18329 2026-05-25 cs.CV cs.LG 版本更新

Lost in the Folds: When Cross-Validation Is Not a Deep Ensemble for Uncertainty Estimation

迷失在折叠中:当交叉验证不是用于不确定性估计的深度集成时

Tristan Kirscher, Markus Bujotzek, Yannick Kirchhoff, Maximilian Rokuss, Fabian Isensee, Kim-Celine Kahl, Balint Kovacs, Klaus Maier-Hein

发表机构 * ICube Laboratory, CNRS UMR-7357, University of Strasbourg, Strasbourg, France(ICube实验室,法国斯特拉斯堡大学) CLCC Institut-Strauss, Strasbourg, France(CLCC斯特拉斯堡研究所) German Cancer Research Center (DKFZ) Heidelberg, Division of Medical Image Computing(海德堡德国癌症研究中心(DKFZ)医学影像计算部门) Medical Faculty Heidelberg, Heidelberg University, Heidelberg, Germany(海德堡医学院,海德堡大学) Faculty of Mathematics and Computer Science, University of Heidelberg, Germany(海德堡大学数学与计算机科学学院) Helmholtz Imaging, German Cancer Research Center, Heidelberg, Germany(海德堡德国癌症研究中心Helmholtz成像部门) Pattern Analysis and Learning Group, Department of Radiation Oncology, Heidelberg University Hospital, Heidelberg, Germany(海德堡大学医院放射肿瘤学部模式分析与学习小组)

AI总结 在医学图像分割中,集成模型的分歧常被用作认识论不确定性的代理,但许多研究通过K折交叉验证(CV)构建集成模型,却称之为“深度集成”(DE),导致术语与实现不一致。本文对比了标准5折CV集成与5成员DE在三个多标注分割数据集上的表现,发现DE在保持分割精度的同时,提升了校准和失败检测能力,而CV集成有时与标注者间差异相关性更强。研究指出,应根据研究目标选择集成构建方式:DE适用于可靠性导向任务(如选择性转诊),CV集成则更适合作为模糊性代理。

Comments Accepted for publication at MICCAI 2026

详情
Journal ref
29th International Conference On Medical Image Computing And Computer Assisted Intervention, Sep 2026, Strasbourg, France
AI中文摘要

集成不一致性被广泛用作医学图像分割中认知不确定性的代理。在实践中,许多研究通过K折交叉验证(CV)形成集成,却称之为“深度集成”(DE)。由于CV成员在不同的数据子集上训练,它们的不一致性混合了种子驱动变异和数据暴露效应,这可能改变不确定性的解释方式。我们审查了最近的分割不确定性研究,发现术语与实现不匹配很常见。然后,我们在三个多模态多标注者分割数据集上,在相同配置下比较了标准5折CV集成与5成员DE(固定训练集,不同随机种子)。我们评估了不确定性在校准、故障检测、歧义建模和分布偏移下的鲁棒性。DE在匹配分割精度的同时改善了校准和故障检测,而CV集成在研究数据集上有时与标注者间变异性相关性更强。因此,应选择与研究问题匹配的集成构建方式:DE用于可靠性导向的使用(如选择性转诊/故障检测),CV集成作为歧义的代理。我们提供了一个轻量级的nnU-Net修改,使得在默认流程内能够进行DE训练。

英文摘要

Ensemble disagreement is widely used as a proxy for epistemic uncertainty in medical image segmentation. In practice, many studies form ensembles via K-fold cross-validation (CV), yet refer to them as ``deep ensembles'' (DE). Because CV members are trained on different data subsets, their disagreement mixes seed-driven variability with data-exposure effects, which can change how uncertainty should be interpreted. We audit recent segmentation uncertainty studies and find that terminology--implementation mismatches are common. We then compare a standard 5-fold CV ensemble to a 5-member DE (fixed training set, different random seeds) under otherwise identical configurations on three multi-rater segmentation datasets spanning three modalities. We evaluate uncertainty for calibration, failure detection, ambiguity modeling, and robustness under distribution shift. DE match segmentation accuracy while improving calibration and failure detection, whereas CV ensembles sometimes correlate more strongly with inter-rater variability on the studied datasets. Thus, ensemble construction should be chosen to match the research question: DE for reliability-oriented use (e.g., selective referral/failure detection) and CV ensembles as a proxy for ambiguity. We provide a lightweight nnU-Net modification enabling DE training within the default pipeline.

2605.17767 2026-05-25 stat.ML cs.LG 版本更新

Feature Learning in Linear-Width Two-Layer Networks: Two vs. One Step of Gradient Descent

线性宽度双层网络中的特征学习:梯度下降的两步 vs 一步

Behrad Moniri, Hamed Hassani

发表机构 * University of Pennsylvania(宾夕法尼亚大学)

AI总结 本文研究了在宽度线性增长的两层神经网络中特征学习的行为,重点分析了梯度下降第二步更新时隐藏层权重的变化。作者超越了之前仅分析单步更新的研究,揭示了第二步更新中权重的谱特性,表明其行为类似于具有多个异常值的尖峰随机矩阵,这些异常值对应于学习到的不同方向。研究还发现,通过重复使用训练批次而非独立批次,可以学习到信息指数大于一的方向,表明批次重用在宽网络中仍具有优势。

详情
AI中文摘要

我们在线性宽度机制下研究双层神经网络中的特征学习,其中隐藏神经元数量、样本量和输入维度成比例缩放。尽管近期工作分析了该机制下通过单步梯度下降更新第一层权重的特征学习,但这种单步更新方案存在根本性限制:权重更新近似秩一,仅捕获单个方向,且要求目标函数的信息指数为1。本文超越单步更新,完整刻画了步长$η_1\asymp N^{α_1}$和$η_2 \asymp N^{α_2}$($α_1, α_2 \in [0,0.5)$,$N$为隐藏神经元数)的梯度下降 extit{第二步}过程中学习的特征。我们推导了更新权重的谱特征,证明其表现为具有多个离群点的尖峰随机矩阵,每个离群点对应一个学习方向。我们证明离群点数量由参数$α_1, α_2$通过$\lfloor \frac{α_2}{1/2 - α_1} \rfloor$决定。此外,通过分析学习方向与目标函数之间的对齐,我们发现了独立批次与重用批次训练之间的差距。独立批次将学习限制在信息指数为1的方向上,而批重用使得第二步更新能够捕获信息指数超过1的方向,前提是$α_1, α_2$选择得当。这表明先前在窄宽度机制中观察到的批重用优势在线性宽度极限下仍然存在。通过刻画这些早期阶段的演化,我们的工作为研究现代过参数化网络中的优化和特征学习现象提供了一个易处理的框架。

英文摘要

We study feature learning in two-layer neural networks within the linear-width regime, where the number of hidden neurons, sample size, and input dimension scale proportionally. While recent work has analyzed feature learning via a single step of gradient descent on the first layer weights in this regime, such one-step update schemes are fundamentally limited: the update to the weights is approximately rank-one, captures only a single direction, and requires the target function to have an information exponent of one. In this paper, we go beyond one-step updates to provide a full characterization of the features learned during the \textit{second step} of gradient descent with step-sizes $η_1\asymp N^{α_1}$ and $η_2 \asymp N^{α_2}$ for $α_1, α_2 \in [0,0.5)$, where $N$ is the number of hidden neurons. We derive a spectral characterization of the updated weights, demonstrating they behave as a spiked random matrix with multiple outliers, each corresponding to a learned direction. We show that the number of the outliers is determined by the parameters $α_1, α_2$ through $\lfloor \frac{α_2}{1/2 - α_1} \rfloor$. Furthermore, by analyzing the alignment between the learned directions and the target function, we identify a gap between training with independent versus reused batches. While independent batches restrict learning to directions with an information exponent of one, batch reuse enables the second update to capture directions even when the information exponent exceeds one, provided that $α_1, α_2$ are chosen properly. This shows that the benefits of batch reuse, previously observed in narrow-width regimes, persist in the linear-width limit as well. By characterizing these early-phase evolutions, our work proposes a tractable framework for studying optimization and feature learning phenomenology in modern overparameterized networks.

2605.17245 2026-05-25 cs.NI cs.LG 版本更新

An Efficient Machine Learning-based Framework for Detection and Prevention of Frauds in Telecom Networks

一种基于机器学习的高效电信网络欺诈检测与预防框架

Praveen Hegde, Mishal Shah

发表机构 * Verizon Bloomberg LP(Verizon Bloomberg实验室) Atlanta, USA(美国亚特兰大) Jersey City, USA(美国新泽西州杰赛尔城)

AI总结 本文提出了一种基于机器学习的高效框架,用于电信网络中欺诈行为的检测与预防。研究使用包含10万余条客户记录的电信详单数据集,通过特征预处理、数据平衡和模型训练等步骤,评估了多种机器学习模型的性能。实验结果表明,随机森林(RF)模型在准确率、精确率、召回率和F1分数等指标上均达到99.9%,是检测电信欺诈最有效的模型。

Comments Peer-reviewed and presented at 2025 International Conference on Advancement in Communication and Computing Technology (INOACC-2025); self-published by the author due to a sustained 13-month indexing delay by the organizers. Contains 7 pages and 7 figures

详情
Journal ref
International Conference on Advancement in Communication and Computing Technology (INOACC), 2025
AI中文摘要

电信欺诈是一个严重问题,导致重大物质损失并损害全球电信系统的可靠性。只有有效且高效的检测机制才能应对这些威胁,尽管欺诈检测方法有所转变。本文使用通话详细记录(CDR)数据集评估了人工智能驱动的模型在电信网络欺诈检测中的性能。本研究聚焦于使用Telecom CDR数据集进行电信网络欺诈检测,该数据集包含101,174条客户记录,具有17个属性,其中包括8,830个欺诈案例。在特征预处理中,处理了缺失值,随后使用Min-Max缩放进行数据缩放,并使用SMOTE技术进行数据平衡。使用随机森林(RF)和XGBoost模型对数据集进行预测分析训练。使用F1分数、ROC AUC、召回率、准确率、时间和精确度作为指标来比较两个模型的性能。RF的准确率高达99.9%,而XGBoost为99.7%。结果表明,所提出的框架成功检测欺诈且误分类很少。评估和对比了多种机器学习模型,如RF、XGBoost、DBSCAN、RoBERTa和K-means。在所有模型中,RF表现最佳,准确率99.9%、精确度99.9%、召回率99.9%和F1分数99.9%,优于XGBoost、GNN和BERT。研究结果强调RF是检测电信网络欺诈活动的最有效模型,确保稳健可靠的欺诈预防。

英文摘要

Telecommunication fraud is an acute problem that leads to substantial material losses and compromises the reliability of telecom systems worldwide. Only effective and efficient detection mechanisms can help to deal with these threats, though there are certain shifts in the approaches to fraud detection. This paper evaluates the performance of AI-driven models for fraud detection in telecommunication networks using Call Detail Record (CDR) datasets. This study focuses on fraud detection in telecom networks using the Telecom CDR dataset, which contains 101,174 customer records with 17 attributes, including 8,830 fraud cases. In feature preprocessing, missing values were dealt with, followed by data scaling using Min-Max scaling and data balancing using the SMOTE technique. The dataset was trained for predictive analysis using Random Forest (RF) and XGBoost models. F1-score, ROC AUC, recall, accuracy, time, and precision were used as indicators with which to compare performance of the two models. RF recorded a high level of accuracy at 99.9% while XGBoost at 99.7%. Findings show that the suggested framework successfully detects fraud with few misclassifications. Several machine learning models were evaluated and contrasted, such as RF, XGBoost, DBSCAN, RoBERTa, and K-means. Among all the models, RF was seen to give the highest performance with an accuracy of 99.9% and precision of 99.9%, recall of 99.9% and F1-score of 99.9%, XGBoost, GNN and BERT. The findings emphasize RF as the most effective model for detecting fraudulent activities in telecom networks, ensuring robust and reliable prevention of fraud.

2605.17076 2026-05-25 cs.LG cs.AI cs.DC cs.MA 版本更新

S-Bus: Automatic Read-Set Reconstruction for Multi-Agent LLM State Coordination

S-Bus: 多智能体LLM状态协调的自动读集重建

Sajjad Khan

发表机构 * Sajjad Khan

AI总结 本文提出了一种名为 S-Bus 的 HTTP 中间件,用于解决多智能体 LLM 在共享可变状态时的并发控制问题,尤其针对无法声明读集的场景。其核心机制 DeliveryLog 能够在提交时从观察到的 HTTP GET 流量中重建每个智能体的读集,从而实现一种名为“可观测读隔离”(ORI)的一致性保证,有效防止分片拓扑中的结构化竞态条件。研究贡献包括形式化验证、与传统数据库的性能对比以及对 ORI 在不同工作负载下的语义影响分析。

Comments v2: LLM judge validated against human annotator (Zahid Hussain, Mindgigs Peshawar) on PH-3 at strict kappa=0.93 (n=93, 96.8% agreement); over-claim refined to 32% (LLM) / 49% (human). Adds Exp.PG-Comparison Rust-Native and Workload-B chi2=1094.98. 24 pages, 23 tables. Annotation data attached as arXiv ancillary files

详情
AI中文摘要

我们解决了通过HTTP共享可变状态的LLM智能体的并发控制问题,其中智能体无法被修改以声明读集。S-Bus是一个HTTP中间件,其核心机制——服务端DeliveryLog——在提交时从观察到的HTTP GET流量中重建每个智能体的读集。它提供的一致性属性——可观测读隔离(ORI),一种基于HTTP可观测读投影的部分因果一致性——防止了专用分片拓扑中的结构性竞态条件。 三项贡献:(C1)DeliveryLog机制,具有三层机械化证据:TLAPS证明了ReadSetSoundness和ORICommitSafety(基于一个类型公理);N=3时的穷举TLC探索了20,763,484个状态,零违规;Dafny验证了9个归纳引理。(C2)与PostgreSQL 17 SERIALIZABLE和Redis 7 WATCH/MULTI的经验安全对等:在884,110次提交尝试中(其中427,308次处于活跃争用下)零Type-I损坏。(C3)ORI在专用分片工作负载中语义中性,但在单分片协作写入中有害,因为保留传播并发矛盾。 v2更新:PH-3 LLM评判器现在已针对人类标注者(Zahid Hussain, Mindgigs Peshawar)在400个(步骤,分片)对上进行独立验证,严格kappa=0.93(n=93,原始一致性96.8%)。LLM间评判器一致性为kappa=0.46(边界方差)。智能体自我报告高估分片使用量32%(LLM评判器)至49%(人类标注者)。SJ-v4语义质量评分标准仍为单评判器LLM-only。 源代码、形式化证明、测试框架、标注数据:https://github.com/sajjadanwar0/sbus

英文摘要

We address concurrency control for LLM agents sharing mutable state over HTTP, where agents cannot be modified to declare read sets. S-Bus is an HTTP middleware whose central mechanism, a server-side DeliveryLog, reconstructs each agent's read set at commit time from observed HTTP GET traffic. The consistency property it provides -- Observable-Read Isolation (ORI), a partial causal consistency over the HTTP-observable read projection -- prevents Structural Race Conditions in dedicated-shard topologies. Three contributions. (C1) DeliveryLog mechanism with three-tier mechanised evidence: TLAPS proves ReadSetSoundness and ORICommitSafety (modulo one typing axiom); exhaustive TLC at N=3 explores 20,763,484 states with zero violations; Dafny discharges 9 inductive lemmas. (C2) Empirical safety parity against PostgreSQL 17 SERIALIZABLE and Redis 7 WATCH/MULTI: zero Type-I corruptions across 884,110 commit attempts (427,308 under active contention). (C3) ORI is semantically neutral in dedicated-shard workloads but harmful in single-shard collaborative writing because preservation propagates concurrent contradictions. v2 update: the PH-3 LLM judge is now independently validated against a human annotator (Zahid Hussain, Mindgigs Peshawar) on 400 (step, shard) pairs at strict kappa=0.93 (n=93, 96.8% raw agreement). Inter-LLM-judge agreement is kappa=0.46 (boundary variance). Agent self-reports over-claim shard usage by 32% (LLM judge) to 49% (human annotator). The SJ-v4 semantic-quality rubric remains single-judge LLM-only. Source code, formal proofs, harness, annotation data: https://github.com/sajjadanwar0/sbus

2605.16799 2026-05-25 cs.LG cs.AI 版本更新

Cross-Domain Molecular Relational Learning: Leveraging Chemical Structure-Activity Analysis

跨域分子关系学习:利用化学结构-活性分析

Peiliang Zhang, Jingling Yuan, Shiqing Wu, Mengqing Hu, Chao Che, Yongjun Zhu, Lin Li

发表机构 * Wuhan University of Technology(武汉理工大学) Yonsei University(延世大学) Hubei Key Laboratory of Transportation Internet of Things(湖北省交通运输物联网重点实验室) State Key Laboratory of Silicate Materials for Architectures(建筑硅酸盐材料国家重点实验室) City University of Macau(澳门城市大学) Kyung Hee University(庆熙大学) Dalian University(大连大学)

AI总结 该研究针对分子关系学习中跨领域建模的不足,提出了一种基于结构-活性分析的跨领域分子关系学习方法。核心方法是引入结构语义迁移差异的领域对抗训练网络(DisTrans),通过子结构拓扑差异引导模型学习分子结构的领域依赖性,并对齐源域与目标域的功能团语义信息,从而提升跨领域适应能力。实验表明,该方法在两种典型跨领域场景下优于16种基线方法,具有良好的泛化性能。

Comments Accepted by SIGKDD 2026 Research Track

详情
AI中文摘要

分子表示的最新进展整合了分子拓扑和视觉模态,为精确的分子关系学习(MRL)开辟了新途径。现有的MRL方法专注于域内建模,其固有的域封闭效应限制了在分子科学中的适用性,特别是在阐明跨域相互作用机制方面。因此,跨域分子关系学习的必要性日益迫切。受益于结构-活性分析,我们提出了具有结构语义迁移差异的域对抗训练网络(DisTrans),以优化分子结构和视觉图像的跨域自适应表示。1)我们利用基于域间子结构拓扑差异的梯度反转策略来学习分子结构的域依赖性。该策略引导模型适应目标域中的结构邻接模式,生成域可分离的结构表示。2)我们应用跨域表示引导机制来对齐源域和目标域之间的官能团语义信息,学习跨域一致性信息。在两种典型跨域策略中的实验结果表明,DisTrans优于16种基线方法,即使在显著的域间差异下也能保持令人满意的性能。

英文摘要

Recent advances in molecular representation integrates molecular topological and visual modalities, opening new avenues for precise Molecular Relational Learning (MRL). Existing MRL methods focus on intra-domain modeling, and their inherent domain-closed effect limits applicability to molecular science, particularly in elucidating cross-domain interaction mechanisms. Consequently, the imperative for Cross-Domain Molecular Relational Learning has become increasingly pressing. Benefiting from structure-activity analysis, we propose the Domain Adversarial Training Network with Structural-Semantic Transfer Discrepancy (DisTrans) to optimize cross-domain adaptive representation for molecular structures and visual images. 1) We employ the gradient reversal strategy based on substructure topological discrepancies between domains to learn the domain dependence of molecular structures. This strategy guides the model to adapt to the structural adjacency patterns in the target domain, generating domain-separable structural representations. 2) We apply the cross-domain representation guidance mechanism to align the functional-group semantic information between the source and target domains, learning cross-domain consistency information. The experimental results in two typical cross-domain strategies demonstrate that DisTrans outperforms 16 baseline methods, maintaining satisfactory performance even under pronounced inter-domain discrepancy.

2605.11490 2026-05-25 cs.LG stat.ML 版本更新

Adaptive Calibration in Non-Stationary Environments

非平稳环境中的自适应校准

Junyan Liu, Haipeng Luo, Lillian J. Ratliff

发表机构 * University of Washington(华盛顿大学) University of Southern California(南加州大学)

AI总结 在非平稳环境中实现自适应校准是现代AI系统中的核心挑战。本文提出了一类能够根据环境非平稳程度自动调整校准误差的在线预测算法,在i.i.d.和对抗性环境之间实现平滑过渡。该方法在多种校准度量下均取得了理论保证,其误差上界在平稳和对抗性场景下均达到最优,并扩展了先前相关工作,引入了基于阶段的调度策略和预测空间的非均匀划分技术。

Comments Added results for piecewise-stationary environments and included a comparison with the concurrent work of Huang et al. (arXiv:2605.09273)

详情
AI中文摘要

在现代AI系统中,进行校准的在线预测是一个核心挑战。现有文献大多关注完全对抗性环境,其中结果可能是任意的,导致算法保守,在更温和的设置(如结果近乎平稳)中表现次优。这一差距引发了一个自然问题:我们能否设计在线预测算法,其校准误差自动适应环境的非平稳程度,在独立同分布和对抗性场景之间平滑插值?我们对此问题给出肯定回答,并开发了一套算法,在多种校准度量下实现自适应校准保证。具体地,设$T$为轮数,$K$为环境中未知的独立同分布段数,$C\in[0,T]$为另一个未知的非平稳度量(定义为均值结果的最小$\ell_1$偏差),我们的算法对$\ell_1$校准误差达到$\widetilde{O}(\min\{\sqrt{T}+(TC)^{\frac{1}{3}}, \sqrt{KT}\})$,对$\ell_2$和伪KL校准误差均达到$\widetilde{O}(\min\{(1+C)^{\frac{1}{3}}, K\})$。这些界匹配平稳情况($C=0$且$K=1$)的最优率,并在完全对抗性场景($C, K=\Omega(T)$)中恢复已知保证。我们的方法建立在并扩展了先前工作[Hu等人,2026,Luo等人,2025]的基础上,引入基于epoch的调度以及对预测空间进行新颖的非均匀划分,在底层真实值附近分配更精细的分辨率。

英文摘要

Making calibrated online predictions is a central challenge in modern AI systems. Much of the existing literature focuses on fully adversarial environments where outcomes may be arbitrary, leading to conservative algorithms that can perform suboptimally in more benign settings, such as when outcomes are nearly stationary. This gap raises a natural question: can we design online prediction algorithms whose calibration error automatically adapts to the degree of non-stationarity in the environment, smoothly interpolating between i.i.d. and adversarial regimes? We answer this question in the affirmative and develop a suite of algorithms that achieve adaptive calibration guarantees under multiple calibration measures. Specifically, with $T$ being the number of rounds, $K$ being the unknown number of i.i.d. segments of the environment, and $C\in[0,T]$ being another unknown non-stationary measure defined as the minimal $\ell_1$ deviation of the mean outcomes, our algorithms attain $\widetilde{O}(\min\{\sqrt{T}+(TC)^{\frac{1}{3}}, \sqrt{KT}\})$ for $\ell_1$ calibration error and $\widetilde{O}(\min\{(1+C)^{\frac{1}{3}}, K\})$ for both $\ell_2$ and pseudo KL calibration error. These bounds match the optimal rates in the stationary case ($C=0$ and $K=1$) and recover known guarantees in the fully adversarial regime ($C, K=Ω(T)$). Our approach builds on and extends prior work [Hu et al., 2026, Luo et al., 2025], introducing an epoch-based scheduling together with a novel non-uniform partition of the prediction space that allocates finer resolution near the underlying ground truth.

2605.11053 2026-05-25 cs.CR cs.AI cs.LG 版本更新

Content-Aware Attack Detection in LLM Agent Tool-Call Traffic: An Empirical Study of Features, Architectures, and Evaluation Protocols

LLM Agent工具调用流量中的内容感知攻击检测:特征、架构与评估协议的实证研究

Sultan Zavrak

发表机构 * Department of Computer Engineering, Duzce University(杜兹大学计算机工程系)

AI总结 本文研究了大语言模型代理在调用外部工具时的流量攻击检测问题,提出了一种基于内容感知的检测框架,将每个代理会话建模为图结构,并结合语句嵌入特征进行分类。研究对比了多种图神经网络和传统机器学习模型,发现内容级别的特征对检测性能至关重要,且基于SBERT的嵌入特征在多个数据集上表现优异,优于图神经网络和MLP模型。此外,研究还揭示了数据划分方式对评估结果的影响,并指出先前工作未充分考虑这一问题。

Comments v2: renamed manuscript (brand removed; descriptive title). No changes to methodology, results, tables, or figures

详情
AI中文摘要

模型上下文协议(MCP)已成为LLM agent调用外部工具的广泛采用的接口,然而对MCP工具调用流量的学习监控仍未被充分探索。本文提出的检测器是一个针对MCP工具调用流量的攻击检测框架,它将每个agent会话编码为图(工具调用作为节点,顺序和数据流链接作为边),通过参数和响应的句子嵌入特征丰富节点,并将会话分类为良性或受攻击。评估了三种GNN架构(GAT、GCN、GraphSAGE)、一个无图MLP以及经典基线(XGBoost、随机森林、逻辑回归、线性SVM),完整架构比较在RAS-Eval(任务分层分割)上进行,GraphSAGE作为GNN基线保留在ATBench和组合源变体(均标签分层)上。得出三个发现。首先,内容级特征至关重要:仅元数据检测的AUROC停滞在0.64左右,无论架构如何,而内容嵌入将AUROC推高至0.89以上。其次,相对于任务不相交分割,朴素随机分割评估将AUROC高估多达26个百分点,这是先前agent检测工作未解决的记忆混淆问题。第三,检测信号主要存在于SBERT内容嵌入中:在池化嵌入上,树集成达到了0.975的AUROC,在大多数情况下优于主要RAS-Eval设置中的神经架构,包括GNN(0.917)和MLP(0.896),并且自监督预训练在此任务上未带来标签效率优势。

英文摘要

The Model Context Protocol (MCP) has become a widely adopted interface for LLM agents to invoke external tools, yet learned monitoring of MCP tool-call traffic remains underexplored. In this article, the proposed detector is presented as an attack detection framework for MCP tool-call traffic that encodes each agent session as a graph (tool calls as nodes, sequential and data-flow links as edges), enriches nodes with sentence-embedding features over arguments and responses, and classifies sessions as benign or attacked. Three GNN architectures (GAT, GCN, GraphSAGE), a no-graph MLP, and classical baselines (XGBoost, random forest, logistic regression, linear SVM) are evaluated, with the full architecture comparison conducted on RAS-Eval (task-stratified splits) and GraphSAGE retained as the GNN baseline on ATBench and a combined-source variant (both label-stratified). Three findings emerge. First, content-level features are essential: metadata-only detection plateaus around an AUROC of 0.64 regardless of architecture, while content embeddings push the AUROC above 0.89. Second, naive random-split evaluation inflates AUROC by up to 26 percentage points relative to task-disjoint splits, a memorization confound that prior agent-detection work has not addressed. Third, the detection signal resides primarily in the SBERT content embeddings: an AUROC of 0.975 was reached by tree ensembles on pooled embeddings, performing, for the most part, better than the neural architectures in the primary RAS-Eval setting including GNNs (0.917) and the MLP (0.896), and self-supervised pre-training does not deliver a label-efficiency advantage on this task.

2605.10220 2026-05-25 astro-ph.GA cs.LG 版本更新

Stellar Age Compression Reshapes Interpretations of the Milky Way Thick-Disk Formation History

恒星年龄压缩重塑对银河系厚盘形成历史的解释

Zhipeng Zhang

发表机构 * China Mobile Research Institute(中国移动研究院) China Mobile GBA (Greater Bay Area) Innovation Institute(中国移动粤港澳大湾区创新研究院)

AI总结 银河厚盘的形成时间尺度是银河考古学中的核心问题之一。本研究通过比较光谱推断年龄和星震学年龄两种独立的恒星年龄标度,发现厚盘形成历史的关键观测特征在星震学锚定下发生了系统性变化,表明之前支持快速形成的观点可能受到恒星年龄压缩效应的影响。研究进一步表明,年龄压缩变换本身即可解释快速形成特征的观测结果,无需假设厚盘本身具有突发形成的历史,揭示了银河形成历史的统计解释可能高度依赖于恒星年龄的定义。

详情
AI中文摘要

银河系厚盘的形成时标是银河考古学的核心争论之一。年龄-金属丰度关系(AMR)、形成时标和化学演化梯度常被用来推断厚盘的快速聚集、短时标增丰和爆发式形成历史。然而,恒星年龄并非直接可观测,这引入了推断年龄可能因观测质量而存在系统性压缩的潜在风险。在本文中,我们使用相同的恒星样本和相同的物理协变量匹配条件,但采用两种独立的年龄标度——光谱推断年龄(astroNN)和星震学年龄(APOKASC-3)——来比较厚盘形成历史的可观测特征。我们发现,先前支持厚盘快速形成的几个关键可观测特征在星震学锚定下系统性减弱:AMR斜率从-3.29变为-1.86 Gyr dex⁻¹(Δa = +1.43),形成时标从3.04 Gyr展宽至3.55 Gyr,峰值形成年龄从9.1 Gyr移至6.0 Gyr。通过传输反演实验,我们进一步表明加性噪声只能展宽年龄分布而无法重现上述模式,而压缩性传输映射(λ < 1)能同时重现更窄的年龄分布、更陡的AMR以及类似快速形成的可观测特征。这一结果表明,压缩变换本身足以产生有利于快速形成的可观测特征,而无需内在的爆发式形成历史。我们的发现揭示了银河系形成历史的统计解释可能敏感地依赖于恒星年龄定义本身。

英文摘要

The formation timescale of the Milky Way thick disk is one of the central debates in Galactic archaeology. The age-metallicity relation (AMR), formation timescale, and chemical evolution gradients are frequently used to infer a rapid assembly, short-timescale enrichment, and bursty formation history of the thick disk. However, stellar ages are not directly observable, introducing the potential risk that inferred ages may harbor a systematic compression tied to observational quality. In this paper, we use the same stellar sample and identical physical covariate matching conditions, but two independent age scales--spectroscopic inferred ages (astroNN) and asteroseismic ages (APOKASC-3)--to compare the observable signatures of the thick-disk formation history. We find that several key observables previously supporting a rapid thick-disk formation are systematically weakened under seismic anchoring: the AMR slope flattens from -3.29 to -1.86 Gyr dex-1 (Delta a = +1.43), the formation timescale widens from 3.04 to 3.55 Gyr, and the peak formation age shifts from 9.1 to 6.0 Gyr. Through transport inversion experiments, we further show that additive noise can only broaden the age distribution and cannot reproduce the above pattern, whereas a compressive transport map (lambda < 1) simultaneously reproduces a narrower age distribution, a steeper AMR, and rapid-formation-like observables. This result indicates that the compression transformation itself is sufficient to generate rapid-formation-friendly observables without requiring an intrinsically bursty formation history. Our findings reveal that statistical interpretations of the Milky Way formation history may depend sensitively on the stellar age definition itself.

2605.10219 2026-05-25 math.OC cs.CC cs.LG 版本更新

Parameterized Complexity of Stationarity Testing for Piecewise-Affine Functions and Shallow CNN Losses

分段仿射函数与浅层CNN损失的平稳性检验的参数化复杂性

Yuhan Ye

发表机构 * MIT(麻省理工学院)

AI总结 本文研究了在给定的点上测试连续分段仿射(PA)函数近似一阶平稳性的参数化复杂度问题,这是非光滑优化中的基本任务。作者从参数化复杂度的角度出发,以环境维度 $d$ 为参数,给出了固定维度下的XP算法,并证明了其对立面的W[1]-难性。此外,研究还扩展到浅层ReLU卷积神经网络的训练损失函数,表明相同参数化复杂度的结论也适用于这类简单CNN的训练问题。

Comments 32 pages, 1 figure, 1 table

详情
AI中文摘要

我们研究了在指定点检验连续分段仿射(PA)函数的近似一阶平稳性的参数化复杂性,这是非光滑优化中的基本任务。PA函数构成了非光滑平稳性检验的典型模型,并捕捉了ReLU型训练损失中出现的局部多面体几何。Tian和So(SODA 2025)最近的工作表明,在最坏情况下,PA函数的近似平稳性概念检验在计算上难以处理,并将固定维度的可处理性确定为一个开放方向。我们从参数化复杂性的角度处理这一方向,以环境维度$d$作为参数。在本文中,我们为可处理侧给出了固定维度的XP算法,并为互补侧证明了W[1]-难度。此外,在指数时间假设下的下界排除了运行时间为$ρ(d)\size^{o(d)}$的算法,其中$\size$表示平稳性检验实例的总二进制编码长度,$ρ$为任意可计算函数。作为进一步的结果,我们的结果给出了检验连续PA函数局部极小性的相应参数化复杂性图景。我们进一步将硬度结果推广到一系列浅层ReLU CNN训练损失,在可训练权重空间中检验平稳性。因此,简单的CNN训练损失也出现了相同的参数化复杂性图景。

英文摘要

We study the parameterized complexity of testing approximate first-order stationarity at a prescribed point for continuous piecewise-affine (PA) functions, a basic task in nonsmooth optimization. PA functions form a canonical model for nonsmooth stationarity testing and capture the local polyhedral geometry that appears in ReLU-type training losses. Recent work by Tian and So (SODA 2025) shows that testing approximate stationarity notions for PA functions is computationally intractable in the worst case, and identifies fixed-dimensional tractability as an open direction. We address this direction from the viewpoint of parameterized complexity, with the ambient dimension $d$ as the parameter. In this paper, we give XP algorithms in fixed dimension for the tractable sides, and prove W[1]-hardness for the complementary sides. Moreover, lower bounds under the Exponential Time Hypothesis rule out algorithms running in time $ρ(d)\size^{o(d)}$ for any computable function $ρ$, where $\size$ denotes the total binary encoding length of the stationarity-testing instance. As a further consequence, our results yield the corresponding parameterized complexity picture for testing local minimality of continuous PA functions. We further extend our hardness results to a family of shallow ReLU CNN training losses, with stationarity tested in the trainable weight space. Thus, the same parameterized-complexity picture also appears for simple CNN training losses.

2605.07220 2026-05-25 cs.LG 版本更新

On the Robustness of Distribution Support under Diffusion Guidance

扩散引导下分布支撑的鲁棒性研究

Ruijia Cao, Yuchen Wu, Nisha Chandramoorthy

发表机构 * Center for Applied Mathematics, Cornell University(康奈尔大学应用数学中心) School of Operations Research and Information Engineering, Cornell University(康奈尔大学运筹学与信息工程学院) Department of Statistics, The University of Chicago(芝加哥大学统计学系)

AI总结 本文研究了扩散引导在生成样本时对分布支撑集的鲁棒性问题,揭示了其为何能持续生成高质量样本的理论原因。作者通过建立扩散引导过程在精确得分函数下的支撑集鲁棒性性质,证明其生成的样本几乎总是接近目标分布的支撑集,从而保证了样本的结构合理性。该分析适用于多种扩散模型和离散化方案,为理解扩散引导生成物理合理样本提供了理论依据。

详情
AI中文摘要

扩散引导是一种强大的技术,能够通过扩散模型实现可控且高保真的样本生成。在高层次上,它通过引入引导项来修改得分函数,从而将生成过程导向所需条件。尽管在经验上取得了成功,但扩散引导的理论性质在很大程度上仍未得到探索,并且尚不清楚它为何能持续生成高质量样本。在这项工作中,我们通过建立支撑的鲁棒性性质来解释扩散引导的有效性。具体来说,我们表明,在精确访问得分函数的情况下,引导扩散过程几乎总是生成接近目标支撑的样本。这一性质尤其理想,因为偏离支撑的样本通常在结构上不可信,并可能对下游任务产生不利影响。我们的分析涵盖了去噪扩散隐式模型(DDIM)和去噪扩散概率模型(DDPM),并适用于由指数积分器引起的广泛离散化方案。我们的结果为理解扩散引导为何能生成物理上有意义且结构合理的样本提供了严格的基础。

英文摘要

Diffusion guidance is a powerful technique that enables controllable and high-fidelity sample generation with diffusion models. At a high level, it modifies the score function by incorporating a guidance term that steers the generative process toward a desired condition. Despite its empirical success, the theoretical properties of diffusion guidance remain largely unexplored, and it is not well understood why it consistently produces high-quality samples. In this work, we explain the effectiveness of diffusion guidance by establishing a robustness of support property. Specifically, we show that, given exact access to the score functions, guided diffusion processes almost always generate samples that remain close to the target support. This property is particularly desirable, as samples that lie off the support are often structurally implausible and may adversely affect downstream tasks. Our analysis covers both Denoising Diffusion Implicit Models (DDIM) and Denoising Diffusion Probabilistic Models (DDPM), and applies to a wide range of discretization schemes induced by exponential integrators. Our results provide a rigorous foundation for understanding why diffusion guidance produces physically meaningful and structurally plausible samples.

2605.04568 2026-05-25 cs.LG cs.AI cs.RO 版本更新

Dream-MPC: Gradient-Based Model Predictive Control with Latent Imagination

Dream-MPC:基于梯度与潜在想象的模型预测控制

Jonathan Spieler, Sven Behnke

发表机构 * Autonomous Intelligent Systems, Computer Science Institute VI - Intelligent Systems(自主智能系统,计算机科学研究所VI - 智能系统) Robotics, Center for Robotics(机器人学,机器人中心) the Lamarr Institute for Machine Learning(拉马尔机器学习研究所) Artificial Intelligence, University of Bonn, Germany(人工智能,波恩大学,德国)

AI总结 本文提出了一种名为 Dream-MPC 的新型模型预测控制方法,结合了梯度上升优化与学习到的世界模型,通过生成少量候选轨迹并利用不确定性正则化和优化迭代的复用机制进行优化。该方法在24个连续控制任务中表现出色,显著提升了基础策略的性能,优于传统的无梯度MPC和先进基线方法。

Comments Accepted for International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

最先进的基于模型的强化学习方法要么使用无梯度、基于种群的规划方法,要么使用学习到的策略网络,或者结合策略网络和规划。将模型预测控制(MPC)与学习到的模型和策略先验相结合的混合方法,以利用两种范式的优势,已显示出有希望的结果。然而,这些方法通常依赖于无梯度优化方法,对于高维控制任务可能计算成本高昂。虽然基于梯度的方法是一个有前途的替代方案,但最近的工作经验表明,基于梯度的方法通常比无梯度方法表现更差。我们提出了Dream-MPC,一种新颖的方法,从展开的策略生成少量候选轨迹,并通过使用学习的世界模型、不确定性正则化和通过重用先前优化的动作随时间摊销优化迭代,对每个轨迹进行梯度上升优化。我们在24个连续控制任务上的结果表明,Dream-MPC可以显著提高底层策略的性能,并且可以优于无梯度MPC和最先进的基线。代码和视频可在https://dream-mpc.github.io获取。

英文摘要

State-of-the-art model-based Reinforcement Learning (RL) approaches either use gradient-free, population-based methods for planning, learned policy networks, or a combination of policy networks and planning. Hybrid approaches that combine Model Predictive Control (MPC) with a learned model and a policy prior to leverage the advantages of both paradigms have shown promising results. However, these approaches typically rely on gradient-free optimization methods, which can be computationally expensive for high-dimensional control tasks. While gradient-based methods are a promising alternative, recent works have empirically shown that gradient-based methods often perform worse than their gradient-free counterparts. We propose Dream-MPC, a novel approach that generates few candidate trajectories from a rolled-out policy and optimizes each trajectory by gradient ascent using a learned world model, uncertainty regularization and amortization of optimization iterations over time by reusing previously optimized actions. Our results on 24 continuous control tasks show that Dream-MPC can significantly improve the performance of the underlying policy and can outperform gradient-free MPC and state-of-the-art baselines. Code and videos are available at https://dream-mpc.github.io.

2604.24810 2026-05-25 cs.LG cs.AI 版本更新

A Comparative Analysis on the Performance of Upper Confidence Bound Algorithms in Adaptive Deep Neural Networks

自适应深度神经网络中上置信界算法的性能比较分析

Grigorios Papanikolaou, Ioannis Kontopoulos, Konstantinos Tserpes

发表机构 * National Technical University of Athens, Greece(雅典技术大学)

AI总结 在边缘计算环境中,由于对能耗和延迟的严格限制,深度神经网络的部署面临挑战。本文基于自适应深度神经网络(ADNNs),引入四种改进的上置信界(UCB)策略,包括UCB-V、UCB-Tuned、UCB-Bayes和UCB-BwK,首次对这些策略在精度、能耗和延迟之间的权衡进行了系统比较。实验表明,UCB-Bayes收敛最快,而UCB-V和UCB-Tuned在精度-延迟和精度-能耗的帕累托前沿上表现最优。

Comments The paper has been accepted for publication in IEEE SMARTCOMP 2026

详情
AI中文摘要

边缘计算环境对能耗和延迟施加了严格限制,使得深度神经网络的部署面临重大挑战。因此,在边缘计算场景中,能够动态平衡计算成本或延迟与预测准确性的智能自适应推理策略至关重要。在这项工作中,我们基于采用多臂老虎机(MAB)框架的自适应深度神经网络(ADNN)。现有文献利用第一版上置信界(UCB1)策略动态选择最优置信阈值,从而在不牺牲准确率的情况下实现高效早期退出。然而,我们在ADNN中引入了四种额外的上置信界策略,即UCB-V、UCB-Tuned、UCB-Bayes和UCB-BwK,并首次对这些策略在准确率、能耗和延迟之间的权衡进行了比较研究。所提出的UCB策略应用于ResNet和MobileViT神经网络,并在CIFAR-10、CIFAR-10.1和CIFAR-100基准数据集上进行评估。实验结果表明,所有策略均实现了次线性累积遗憾,其中UCB-Bayes收敛最快,其次是UCB-Tuned和UCB-V。最后,UCB-V和UCB-Tuned在准确率-延迟和准确率-能耗权衡的帕累托前沿上占据主导地位。实现代码可在此处获取:https://github.com/gr3gor1/MAB_UCB

英文摘要

Edge computing environments impose strict constraints on energy consumption and latency, making the deployment of deep neural networks a significant challenge. Therefore, smart and adaptive inference strategies that dynamically balance computational cost or latency with predictive accuracy are critical in edge computing scenarios. In this work, we build on Adaptive Deep Neural Networks (ADNNs) that employ the Multi-Armed Bandit (MAB) framework. Current literature leverages the first version of the Upper Confidence Bound (UCB1) strategy to dynamically select the optimal confidence threshold, enabling efficient early exits without sacrificing accuracy. However, we introduce four additional Upper Confidence Bound strategies in ADNNs, namely UCB-V, UCB-Tuned, UCB-Bayes, and UCB-BwK, and perform, for the first time, a comparative study of these strategies with respect to trade-offs between accuracy, energy consumption, and latency. The proposed UCB strategies are employed on the ResNet and MobileViT neural networks, and are evaluated on the benchmark datasets of CIFAR-10, CIFAR-10.1, and CIFAR-100. Experimental results demonstrate that all strategies achieve sub-linear cumulative regret, with UCB-Bayes converging the fastest, followed by UCB-Tuned and UCB-V. Finally, UCB-V and UCB-Tuned dominate the Pareto Frontiers of accuracy-latency and accuracy-energy trade-offs. The implementation code is available here: https://github.com/gr3gor1/MAB_UCB

2604.21889 2026-05-25 cs.CL cs.AI cs.LG 版本更新

TingIS: Real-time Risk Event Discovery from Noisy Customer Incidents at Enterprise Scale

TingIS:企业级规模下从嘈杂客户事件中实时发现风险事件

Jun Wang, Ziyin Zhang, Rui Wang, Hang Yu, Peng Di, Rui Wang

发表机构 * Ant Group(蚂蚁集团) Shanghai Jiao Tong University(上海交通大学)

AI总结 本文介绍了TingIS,一个用于大规模企业环境中实时发现风险事件的端到端系统。针对客户事件数据中存在噪声大、语义复杂、吞吐量高的挑战,TingIS结合多阶段事件链接引擎与大型语言模型,实现了从少量用户描述中稳定提取有效事件的能力,并通过级联路由机制和多维降噪流程提升业务归因精度和信号质量。实验表明,TingIS在高优先级事件发现率和系统响应延迟方面表现优异,显著优于现有方法。

Comments Accepted to ACL 2026 Industry Track (oral presentation)

详情
AI中文摘要

实时检测和缓解技术异常对于大规模云原生服务至关重要,即使几分钟的停机也可能导致巨大的财务损失和用户信任度下降。虽然客户事件是发现监控遗漏风险的重要信号,但由于极端噪声、高吞吐量和不同业务线的语义复杂性,从这些数据中提取可操作情报仍然具有挑战性。在本文中,我们提出了TingIS,一个为企业级事件发现设计的端到端系统。TingIS的核心是一个多阶段事件链接引擎,该引擎将高效索引技术与大型语言模型(LLM)协同起来,对事件合并做出明智决策,从而仅从少量多样的用户描述中稳定提取可操作事件。该引擎辅以级联路由机制以实现精确的业务归属,以及一个集成领域知识、统计模式和行为过滤的多维降噪管道。TingIS部署在生产环境中,处理峰值吞吐量超过每分钟2,000条消息和每天300,000条消息,实现了P90告警延迟3.5分钟和高优先级事件95%的发现率。基于真实数据构建的基准测试表明,TingIS在路由准确性、聚类质量和信噪比方面显著优于基线方法。

英文摘要

Real-time detection and mitigation of technical anomalies are critical for large-scale cloud-native services, where even minutes of downtime can result in massive financial losses and diminished user trust. While customer incidents serve as a vital signal for discovering risks missed by monitoring, extracting actionable intelligence from this data remains challenging due to extreme noise, high throughput, and semantic complexity of diverse business lines. In this paper, we present TingIS, an end-to-end system designed for enterprise-grade incident discovery. At the core of TingIS is a multi-stage event linking engine that synergizes efficient indexing techniques with Large Language Models (LLMs) to make informed decisions on event merging, enabling the stable extraction of actionable incidents from just a handful of diverse user descriptions. This engine is complemented by a cascaded routing mechanism for precise business attribution and a multi-dimensional noise reduction pipeline that integrates domain knowledge, statistical patterns, and behavioral filtering. Deployed in a production environment handling a peak throughput of over 2,000 messages per minute and 300,000 messages per day, TingIS achieves a P90 alert latency of 3.5 minutes and a 95\% discovery rate for high-priority incidents. Benchmarks constructed from real-world data demonstrate that TingIS significantly outperforms baseline methods in routing accuracy, clustering quality, and Signal-to-Noise Ratio.

2604.19000 2026-05-25 cs.LG cs.AI 版本更新

Decompose, Structure, and Repair: A Neuro-Symbolic Framework for Autoformalization via Operator Trees

分解、结构化与修复:基于操作树的神经符号自动形式化框架

Xiaoyang Liu, Zineng Dong, Yifan Bai, Yantao Li, Yuntian Liu, Tao Luo

发表机构 * School of Mathematical Sciences, Shanghai Jiao Tong University(上海交通大学数学科学学院) Zhiyuan College, Shanghai Jiao Tong University(上海交通大学紫阳学院) Institute of Natural Sciences, MOE-LSC, CMA-Shanghai, Shanghai Jiao Tong University(上海交通大学自然科学研究院)

AI总结 该论文提出了一种名为DSR的神经符号框架,用于将自然语言数学问题自动形式化为形式语言。DSR通过分解数学陈述为逻辑组件并映射为结构化的操作符树,利用这种拓扑结构实现对错误的精确定位与修复。研究还引入了PRIME基准数据集,并在实验中验证了DSR在计算资源相同的情况下优于现有方法,取得了新的最先进成果。

Comments Accepted to ICML 2026

详情
AI中文摘要

语句自动形式化通过将自然语言问题翻译成形式语言,成为人类数学与形式数学之间的关键桥梁。虽然先前的工作侧重于数据合成和多样化的训练范式来优化端到端的大语言模型(LLMs),但它们通常将形式代码视为平面序列,忽略了数学语句中固有的层次逻辑。在这项工作中,我们引入了分解、结构化与修复(DSR),一个神经符号框架,将自动形式化重构为模块化流水线。DSR将语句分解为逻辑组件,并将其映射到结构化的操作树,利用这一拓扑蓝图通过子树精炼精确定位和修复错误。此外,我们引入了PRIME,一个包含156个本科和研究生级别定理的基准,这些定理选自经典教科书并由专家在Lean 4中注释。实验结果表明,DSR建立了新的最先进水平,在同等计算预算下始终优于基线。数据集、模型和代码可在https://github.com/XiaoyangLiu-sjtu/DSR获取。

英文摘要

Statement autoformalization acts as a critical bridge between human mathematics and formal mathematics by translating natural language problems into formal language. While prior works have focused on data synthesis and diverse training paradigms to optimize end-to-end Large Language Models (LLMs), they typically treat formal code as flat sequences, neglecting the hierarchical logic inherent in mathematical statements. In this work, we introduce Decompose, Structure, and Repair (DSR), a neuro-symbolic framework that restructures autoformalization into a modular pipeline. DSR decomposes statements into logical components and maps them to structured operator trees, leveraging this topological blueprint to precisely localize and repair errors via sub-tree refinement. Furthermore, we introduce PRIME, a benchmark of 156 undergraduate and graduate-level theorems selected from canonical textbooks and expertly annotated in Lean 4. Experimental results demonstrate that DSR establishes a new state-of-the-art, consistently outperforming baselines under equivalent computational budgets. The datasets, model, and code are available at https://github.com/XiaoyangLiu-sjtu/DSR.

2604.07796 2026-05-25 stat.ML cs.IT cs.LG math.IT math.ST stat.TH 版本更新

Order-Optimal Sequential 1-Bit Mean Estimation in General Tail Regimes

一般尾分布下的最优序贯1比特均值估计

Ivan Lau, Jonathan Scarlett

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 本文研究了在1比特通信约束下的均值估计问题,提出了一种基于随机阈值查询的自适应均值估计方法,每个1比特反馈表示样本是否超过顺序选择的阈值。该估计器对任意具有有界均值和有界中心矩的分布具有$(ε, δ)$-PAC性质,且在所有尾部分布情形下均达到最优的样本复杂度。研究还揭示了1比特量化在有限方差情况下的基本性能限制,并展示了自适应方法相比非自适应方法在样本效率上的显著优势。

Comments This article substantially extends the AISTATS version, arXiv:2509.21940

详情
AI中文摘要

本文研究了1比特通信约束下的均值估计问题。我们提出了一种新颖的自适应均值估计器,仅基于随机化阈值查询,其中每个1比特输出指示给定样本是否超过顺序选择的阈值。对于任何具有有界均值$\mu\in [-\lambda, \lambda]$和有界$k$阶中心矩$\mathbb{E}[|X-\mu|^k] \le \sigma^k$($k>1$固定)的分布,我们的估计器是$(\varepsilon, \delta)$-PAC的。此外,我们的样本复杂度在所有此类尾分布下都是阶数最优的,即对于每个这样的$k$值。对于$k\neq 2$,我们的估计器的样本复杂度匹配未量化极小极大下界加上不可避免的$O(\log(\lambda/\sigma))$定位代价。对于有限方差情形($k=2$),我们的估计器的样本复杂度有额外的乘法$O(\log(\sigma/\varepsilon))$惩罚,并且我们建立了新的信息论下界,表明该惩罚是1比特量化的基本限制。我们还建立了一个显著的适应性差距:对于阈值查询和更一般的区间查询,任何非自适应估计器的样本复杂度必须与搜索空间参数$\lambda/\sigma$线性增长,使其样本效率远低于我们的自适应方法。最后,我们提出了算法变体,这些变体(i)处理未知的采样预算,(ii)在给定(可能宽松的)界限下适应未知尺度参数$\sigma$,(iii)仅需两个自适应阶段即可实现阶数最优样本复杂度,但以更一般的1比特查询为代价,以及(iv)利用每个1比特查询的多个局部样本按比例减少通信成本。

英文摘要

In this paper, we study the problem of mean estimation under 1-bit communication constraints. We propose a novel adaptive mean estimator based solely on randomized threshold queries, where each 1-bit outcome indicates whether a given sample exceeds a sequentially chosen threshold. Our estimator is $(ε, δ)$-PAC for any distribution with a bounded mean $μ\in [-λ, λ]$ and a bounded $k$-th central moment $\mathbb{E}[|X-μ|^k] \le σ^k$ for any fixed $k > 1$. Moreover, our sample complexity is order-optimal in all such tail regimes, i.e., for every such $k$ value. For $k \neq 2$, our estimator's sample complexity matches the unquantized minimax lower bounds plus an unavoidable $O(\log(λ/σ))$ localization cost. For the finite-variance case ($k=2$), our estimator's sample complexity has an extra multiplicative $O(\log(σ/ε))$ penalty, and we establish a novel information-theoretic lower bound showing that this penalty is a fundamental limit of 1-bit quantization. We also establish a significant adaptivity gap: for both threshold queries and more general interval queries, the sample complexity of any non-adaptive estimator must scale linearly with the search space parameter $λ/σ$, rendering it vastly less sample efficient than our adaptive approach. Finally, we present algorithmic variants that (i) handle an unknown sampling budget, (ii) adapt to an unknown scale parameter $σ$ given (possibly loose) bounds, (iii) require only two stages of adaptivity to achieve order-optimal sample complexity at the expense of more general 1-bit queries, and (iv) leverage multiple local samples per 1-bit query to proportionally reduce communication costs.

2604.05129 2026-05-25 cs.GT cs.LG 版本更新

No Coin Left Behind: Maximizing Strategic Surplus Against No-Regret Dynamics

不遗漏任何硬币:对抗无遗憾动态的最大化战略剩余

Yiheng Su, Emmanouil-Vasileios Vlatakis-Gkaragkounis

发表机构 * University of Wisconsin–Madison(威斯康星大学麦迪逊分校)

AI总结 本文研究了在零和博弈中,如何对抗使用固定步长的Follow-the-Regularized-Leader(FTRL)学习者,最大化战略盈余。作者证明了从FTRL学习者中提取与遗憾尺度相关的盈余是该方法族的固有特性,而非特定实现的结果,并提出了两个关键结果:固定最大最小优化器下,盈余与学习者的次优动作数量成正比;交替优化器下,无论均衡结构如何,均可保证一定规模的盈余。研究还揭示了正则化器的几何二分现象,并提出了衡量正则化器对学习者策略敏感程度的指标。

详情
AI中文摘要

我们研究了在 $n\times m$ 两人零和博弈中,对抗使用恒定步长 $\eta$ 的跟随正则化领导者(FTRL)学习器时,先知优化者在 $T$ 轮博弈中可获得的战略剩余。与之前的分析不同,我们表明这种遗憾尺度剩余的提取是 FTRL 家族的固有特征,而非特定实例的产物。首先,对于固定的最大最小优化器,我们建立了一个阶为 $\Omega(N_{\mathrm{sub}}/\eta)$ 的普遍规律,证明效用剩余随学习器次优动作数量 $N$ 缩放,并在没有次优动作时消失。其次,对于交替优化器,在随机博弈中,无论均衡结构如何,都能以高概率保证 $\Omega(\eta T/\mathrm{poly}(n,m))$ 的剩余。我们的分析揭示了一个尖锐的几何二分法:非陡峭正则化器允许优化器通过有限时间消除次优动作实现最大瞬态剩余,而陡峭正则化器则引入一个消失的尾部修正,可能延迟剩余饱和。最后,我们讨论了这种优势在双边收益不确定性下是否持续存在,并提出了一个易感性度量,量化哪些正则化器最容易受到学习器感知的战略引导。

英文摘要

We investigate the strategic surplus obtainable against a Follow-the-Regularized-Leader (FTRL) learner with constant step size $η$ in $n\times m$ two-player zero-sum games played over $T$ rounds against a clairvoyant optimizer. In contrast with prior analysis, we show that the extraction of such regret-scale surplus is an inherent feature of the FTRL family, rather than an artifact of specific instantiations. First, for a fixed max-min optimizer, we establish a sweeping law of order $Ω(N_{\mathrm{sub}}/η)$, proving that utility surplus scales with the number of the learner's suboptimal actions $N$ and vanishes in their absence. Second, for an alternating optimizer, a surplus of $Ω(ηT/\mathrm{poly}(n,m))$ can be guaranteed regardless of the equilibrium structure, with high probability, in random games. Our analysis uncovers a sharp geometric dichotomy: non-steep regularizers allow the optimizer to realize the maximal transient surplus via finite-time elimination of suboptimal actions, whereas steep regularizers introduce a vanishing tail correction that can delay surplus saturation. Finally, we discuss whether this leverage persists under bilateral payoff uncertainty and propose a susceptibility measure quantifying which regularizers are most vulnerable to learner-aware strategic steering.

2603.24226 2026-05-25 cs.IR cs.LG 版本更新

Joint Model Parameter Scaling and Universal-Domain Data Integration for E-commerce Search Ranking

联合模型参数缩放与通用域数据集成用于电商搜索排序

Liren Yu, Caiyuan Li, Feiyi Dong, Tao Zhang, Zhixuan Zhang, Dan Ou, Haihong Tang, Bo Zheng

发表机构 * Taobao \& Tmall Group of Alibaba Hangzhou China Taobao \& Tmall Group of Alibaba Beijing China Taobao \& Tmall Group of Alibaba

AI总结 本文研究了电商搜索排序中模型参数扩展与数据质量提升的联合优化问题,指出单纯增加模型规模效果有限,而异构大规模行为数据的处理也难以仅靠架构调整解决。为此,作者提出UniScale框架,包含两个核心组件:ES$^3$系统通过引入跨域示例和全局监督信号扩展训练数据,HHSFT模型则通过分层特征交互和用户兴趣融合处理异构数据。实验表明,UniScale在离线和在线测试中均显著提升了搜索效果,包括订单量和GMV的提升。

详情
AI中文摘要

工业搜索、广告和推荐的缩放研究主要强调扩大模型容量或改进架构。然而在现实系统中,性能不仅受限于模型大小,还受限于训练数据的质量和分布。我们的实证分析显示了两个关键瓶颈:单独增加参数带来的收益逐渐减小,且异构大规模行为数据引入的挑战无法仅通过架构调整完全解决。为解决此问题,我们提出了UniScale,一个将数据缩放与模型设计相结合的统一框架。UniScale包含两个组件。首先,ES$^3$,一个全空间样本构建系统,通过用全局归因的监督信号丰富域内搜索上下文,并引入反映用户在可比内容曝光条件下决策的跨域示例,将监督范围扩展到传统采样训练数据之外。其次,HHSFT,一个异构层次融合Transformer,旨在通过跨整个行为空间的层次化特征交互和用户兴趣融合,利用由此产生的大规模异构数据。这些组件共同实现了比仅以结构为中心的优化更有效的缩放。实验表明,UniScale持续改善离线性能,并展现出有利的缩放行为。在大型电商搜索平台的在线A/B测试中,它带来了1.70%的购买量提升和2.04%的GMV提升。

英文摘要

Scaling studies for industrial search, advertising, and recommendation have largely emphasized enlarging model capacity or refining architectures. Yet in real-world systems, performance is constrained not only by model size but also by the quality and distribution of training data. Our empirical analysis shows two key bottlenecks: increasing parameters alone yields progressively smaller gains, and the challenges introduced by heterogeneous, large-scale behavior data cannot be fully resolved by architecture tuning in isolation. To address this issue, we present UniScale, a unified framework that couples data scaling with model design. UniScale consists of two components. First, ES$^3$, an entire-space sample construction system, broadens supervision beyond conventional sampled training data by enriching intra-domain search contexts with globally attributed supervisory signals and introducing cross-domain examples that reflect user decisions under comparable content exposure conditions. Second, HHSFT, a heterogeneous hierarchical fusion transformer, is tailored to exploit the resulting large-scale heterogeneous data through hierarchical feature interaction and user-interest fusion across the entire behavior space. Together, these components enable more effective scaling than structure-centric optimization alone. Experiments show that UniScale consistently improves offline performance and demonstrates favorable scaling behavior. In online A/B tests on a large e-commerce search platform, it delivers a 1.70% increase in purchases and a 2.04% lift in GMV.

2603.19812 2026-05-25 cs.LG 版本更新

Eye Gaze-Informed and Context-Aware Pedestrian Trajectory Prediction in Shared Spaces with Automated Shuttles: A Virtual Reality Study

共享空间中自动穿梭车与行人的眼动知情与情境感知轨迹预测:一项虚拟现实研究

Danya Li, Yan Feng, Rico Krueger

发表机构 * Department of Technology, Management and Economics at the Technical University of Denmark(丹麦技术大学技术、管理与经济学系) Department of Transport & Planning, Civil Engineering Geosciences at Delft University of Technology(代尔夫特理工大学交通运输与规划、土木工程与地质科学系)

AI总结 本研究通过虚拟现实实验,探讨行人眼动信息在共享空间中预测其轨迹的价值,研究了不同接近角度和交通条件下的行人与自动驾驶接驳车的交互行为。研究构建了一个融合眼动、头部方向和情境上下文的多模态预测模型,发现眼动信息对轨迹预测的贡献依赖于角度和身体协调,并与情境信息具有互补性。实验表明,结合眼动与情境信息可将最终位移误差降低8.47%,突显了将人类感知信号纳入行人行为预测的重要性。

详情
AI中文摘要

为填补这一空白,我们进行了一项虚拟现实实验,行人在不同接近角度(45°、90°、135°)和连续交通条件(单辆穿梭车、两辆穿梭车间隔3或5秒)下与自动穿梭车交互,收集了同步的运动、眼动和头部朝向数据。为了探究细粒度眼动在何种程度、何种条件下以及以何种形式对行人运动预测提供信息,我们开发了一个多模态预测模型,通过模态特定编码器融合这些信号,并系统地消融眼动表示与头部朝向和情境上下文。我们报告三个主要结果。首先,眼动的预测价值与角度相关,并与眼-头-身体协调紧密耦合:在锐角角度下,行人主动转移视线以获取穿梭车信息时,眼动携带了仅头部朝向无法捕捉的信息。其次,连续眼动朝向优于分类语义注视标签,最佳编码框架(全局或身体相对)取决于眼动是单独使用还是与上下文联合使用。第三,眼动和情境上下文提供互补的预测信息:它们的组合将最终位移误差(FDE)降低了8.47%,接近各自贡献之和。这些发现共同凸显了将人类感知信号纳入行人行为预测的价值,并激励了以人为中心的建模方法补充以车辆为中心的建模方法。我们的代码可在 https://github.com/danyayay/GazeX.git 获取。

英文摘要

To address this gap, we conduct a Virtual Reality experiment in which pedestrians interact with automated shuttles under varying approach angles (45°, 90°, 135°) and continuous-traffic conditions (single shuttle, two shuttles with 3 or 5-second gaps), collecting synchronized motion, eye gaze, and head orientation data. To investigate to what extent, under what conditions, and in what form fine-grained eye gaze is informative for pedestrian motion prediction, we develop a multi-modal prediction model that fuses these signals through modality-specific encoders, and systematically ablate gaze representations against head orientation and situational context. We report three main results. First, the predictive value of eye gaze is angle-dependent and tightly coupled with eye-head-body coordination: at acute angles where pedestrians actively redirect gaze to acquire the shuttle, eye gaze carries information that head orientation alone misses. Second, continuous gaze orientation outperforms categorical semantic fixation labels, with the optimal encoding frame (global or body-relative) depending on whether gaze is used alone or jointly with context. Third, eye gaze and situational context provide complementary predictive information: their combination reduces final displacement error (FDE) by 8.47%, close to the sum of their individual contributions. Together, these findings highlight the value of incorporating human perceptual signals into pedestrian behavior prediction and motivate a human-centered complement to vehicle-centric modeling approaches. Our code is available at https://github.com/danyayay/GazeX.git.

2603.19310 2026-05-25 cs.LG cs.AI 版本更新

MemReward: Graph-Based Experience Memory for LLM Reward Prediction with Limited Labels

MemReward: 基于图的经验记忆用于有限标签下的LLM奖励预测

Tianyang Luo, Tao Feng, Zhigang Hua, Yan Xie, Shuang Yang, Ge Liu, Jiaxuan You

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Meta

AI总结 本文提出了一种基于图结构的经验记忆框架 MemReward,用于在标注数据有限的情况下提升大语言模型(LLM)的奖励预测能力。该方法通过构建包含初始策略生成的推理过程和答案的异构图,并利用图神经网络(GNN)将有限的标注奖励传播到未标注的样本中,从而在在线策略优化过程中实现奖励的高效获取。实验表明,MemReward 在仅使用20%标注数据的情况下,能够在数学证明、问答和代码生成等任务中接近理想奖励模型的性能。

详情
AI中文摘要

强化学习已成为改进大型语言模型推理能力的强大范式,其中从策略中采样rollout,并利用在这些rollout上计算的奖励信号来更新策略。然而,在数据稀缺的场景中,大规模获取ground-truth标签以验证rollout通常需要昂贵的人工标注或劳动密集型的专家验证。例如,评估数学证明需要专家评审,而开放式问答缺乏确定的ground-truth。当ground-truth标签稀缺时,强化学习微调的有效性受到限制。受半监督学习在将标签从标注样本传播到未标注样本方面成功的启发,我们提出了MemReward,一种基于图的经验记忆框架,将奖励传播直接集成到在线策略优化中。MemReward将来自初始LLM策略的rollout(思考过程和最终答案)存储为异构图中的节点,这些节点通过相似性和结构边连接,图神经网络通过该图将奖励从标注rollout传播到未标注rollout。为了训练这样的框架,我们首先在标注rollout上预热GNN,通过查询、思考和答案节点的异质聚合来预测奖励。在在线RL微调期间,未标注rollout通过查询相似性附加到图中,GNN预测它们的奖励,从而产生一种结合ground-truth和GNN预测奖励的混合奖励获取策略。在Qwen2.5-1.5B和3B上的数学、问答和代码生成实验表明,MemReward仅使用20% rollout的ground-truth奖励,就在1.5B上达到Oracle性能的96.6%,在3B上达到97.3%,并在域外任务上接近Oracle。

英文摘要

Reinforcement learning has emerged as a powerful paradigm for improving large language model (LLM) reasoning, where rollouts are sampled from the policy and reward signals computed on those rollouts are used to update the policy. However, in data-scarce scenarios, obtaining ground-truth labels to verify rollouts at scale often requires expensive human annotation or labor-intensive expert verification. For instance, evaluating mathematical proofs demands expert review, and open-ended question answering lacks definitive ground truth. When ground-truth labels are scarce, the effectiveness of reinforcement learning fine-tuning is constrained. Inspired by the success of semi-supervised learning in propagating labels from labeled to unlabeled samples, we propose MemReward, a graph-based experience memory framework that integrates reward propagation directly into online policy optimization. MemReward stores rollouts (thinking processes and final answers) from an initial LLM policy as nodes in a heterogeneous graph connected by similarity and structural edges, over which a GNN propagates rewards from labeled to unlabeled rollouts. To train such a framework, we first warm up the GNN on labeled rollouts to predict rewards via heterogeneous aggregation over query, thinking, and answer nodes. During online RL fine-tuning, unlabeled rollouts are attached to the graph by query similarity, and the GNN predicts their rewards, yielding a hybrid reward acquisition strategy that combines ground-truth and GNN-predicted rewards. Experiments on Qwen2.5-1.5B and 3B in mathematics, question answering, and code generation demonstrate that MemReward, with ground-truth rewards on only 20% of rollouts, achieves 96.6% of Oracle performance on 1.5B and 97.3% on 3B, and closely approaches Oracle on out-of-domain tasks.

2603.18551 2026-05-25 math.OC cs.CC cs.LG 版本更新

Learning Decision-Sufficient Representations for Linear Optimization

学习线性优化的决策充分表示

Yuhan Ye, Saurabh Amin, Asuman Ozdaglar

发表机构 * MIT(麻省理工学院)

AI总结 本文研究如何构建压缩数据集以恢复具有未知成本向量的线性规划问题中的最优决策。作者证明了确定决策相关维度 $d^\star$ 是 NP 难的,并提出了一种点态充分性概念,从而在多项式时间内构造出适用于单个成本向量的决策数据集。进一步地,他们提出了一种累积算法,在独立同分布成本假设下实现稳定压缩,并给出了分布无关的 PAC 保证,同时将决策充分性表示应用于上下文线性优化,获得了更优的泛化界。

Comments 45 pages plus appendix, 2 figures. Accepted at COLT 2026

详情
AI中文摘要

我们研究如何构建压缩数据集,使其足以恢复未知成本向量$c$位于先验集$\mathcal{C}$中的线性规划的最优决策。Bennouna等人最近的工作通过内在的决策相关维度$d^\star$给出了充分决策数据集(SDDs)的精确几何刻画。然而,他们构建最小规模SDD的算法需要求解混合整数规划。在本文中,我们建立了硬度结果,表明计算$d^\star$是NP难的,判定数据集是否全局充分是coNP难的,从而解决了Bennouna等人提出的一个近期开放问题。为了应对这种最坏情况下的难解性,我们引入了点态充分性,这是一种要求对单个成本向量充分的松弛。在非退化条件下,我们提供了一种多项式时间的切割平面算法来构建点态充分的决策数据集。在具有独立同分布成本的数据驱动框架下,我们进一步提出了一种累积算法,该算法跨样本聚合决策相关方向,产生一个大小至多为$d^\star$的稳定压缩方案。这导致了一个无分布PAC保证:以高概率,在训练样本上,新样本的点态充分失败概率至多为$ ilde{O}(d^\star/n)$,且该速率在对数因子意义下是紧的。最后,我们将决策充分表示应用于上下文线性优化,获得压缩预测器,其泛化界为$ ilde{O}(\sqrt{d^\star/n})$而非$ ilde{O}(\sqrt{d/n})$,其中$d$是环境成本维度。

英文摘要

We study how to construct compressed datasets that suffice to recover optimal decisions in linear programs with an unknown cost vector $c$ lying in a prior set $\mathcal{C}$. Recent work by Bennouna et al. provides an exact geometric characterization of sufficient decision datasets (SDDs) via an intrinsic decision-relevant dimension $d^\star$. However, their algorithm for constructing minimum-size SDDs requires solving mixed-integer programs. In this paper, we establish hardness results showing that computing $d^\star$ is NP-hard and deciding whether a dataset is globally sufficient is coNP-hard, thereby resolving a recent open problem posed by Bennouna et al. To address this worst-case intractability, we introduce pointwise sufficiency, a relaxation that requires sufficiency for an individual cost vector. Under nondegeneracy, we provide a polynomial-time cutting-plane algorithm for constructing pointwise-sufficient decision datasets. In a data-driven regime with i.i.d.\ costs, we further propose a cumulative algorithm that aggregates decision-relevant directions across samples, yielding a stable compression scheme of size at most $d^\star$. This leads to a distribution-free PAC guarantee: with high probability over the training sample, the pointwise sufficiency failure probability on a fresh draw is at most $\tilde{O}(d^\star/n)$, and this rate is tight up to logarithmic factors. Finally, we apply decision-sufficient representations to contextual linear optimization, obtaining compressed predictors with generalization bounds scaling as $\tilde{O}(\sqrt{d^\star/n})$ rather than $\tilde{O}(\sqrt{d/n})$, where $d$ is the ambient cost dimension.

2603.16331 2026-05-25 cs.LG 版本更新

Decoding the Critique Mechanism in Large Reasoning Models

解码大型推理模型中的批判机制

Hoang Phan, Quang H. Nguyen, Hung T. Q. Le, Xiusi Chen, Heng Ji, Khoa D. Doan

发表机构 * VinUni-Illinois Smart Health Center(VinUniversity-伊利诺伊州智能健康中心) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文研究了大推理模型(LRMs)在推理过程中如何通过内部机制纠正错误,提出了“隐藏的批评能力”这一概念。研究发现,即使模型在中间推理步骤中出现错误且未进行明确纠正,仍能最终得出正确答案,表明其具备某种隐式的错误检测与自我修正机制。通过特征空间分析,作者识别出一个可解释的“批评向量”,用于引导模型增强错误检测能力,提升推理性能,且无需额外训练成本。这一发现为理解与改进大模型的自我验证机制提供了新思路。

详情
AI中文摘要

大型推理模型(LRMs)展现出回溯和自我验证机制,使其能够修正中间步骤并达到正确解,在复杂逻辑基准上表现强劲。我们假设这种行为仅在模型具有足够强的“批判”能力来检测自身错误时才有益。本工作通过在中间推理步骤中插入算术错误,系统研究了当前LRMs如何从错误中恢复。值得注意的是,我们发现一个奇特但重要的现象:尽管错误在整个思维链(CoT)中传播且没有任何言语修正,模型在思考过程结束后仍能得出正确的最终答案。这种恢复暗示存在一种内部机制帮助模型检测错误并触发自我修正,我们称之为隐藏的批判能力。基于特征空间分析,我们识别出一个高度可解释的批判向量,代表这种行为。跨多个模型规模和系列的广泛实验表明,用该向量引导潜在表示可提升模型的错误检测能力,并在无需额外训练成本的情况下增强测试时扩展性能。我们的发现为LRMs的批判行为提供了有价值的理解,提示了控制和改进其自我验证机制的有前景方向。我们的代码可在 https://github.com/mail-research/lrm-critique-vectors 获取。

英文摘要

Large Reasoning Models (LRMs) exhibit backtracking and self-verification mechanisms that enable them to revise intermediate steps and reach correct solutions, yielding strong performance on complex logical benchmarks. We hypothesize that such behaviors are beneficial only when the model has sufficiently strong ``critique'' ability to detect its own mistakes. This work systematically investigates how current LRMs recover from errors by inserting arithmetic mistakes in their intermediate reasoning steps. Notably, we discover a peculiar yet important phenomenon: despite the error propagating throughout the entire chain-of-thought (CoT) without any verbalized correction, the model still reaches the correct final answer after the thinking process finishes. This recovery implies the existence of an internal mechanism helping the model to detect errors and trigger self-correction, which we refer to as the \textit{hidden critique ability}. Building on feature space analysis, we identify a highly interpretable \textit{critique vector} representing this behavior. Extensive experiments across multiple model scales and families demonstrate that steering latent representations with this vector improves the model's error detection capability and enhances the performance of test-time scaling at no extra training cost. Our findings provide a valuable understanding of LRMs' critique behavior, suggesting a promising direction to control and improve their self-verification mechanism. Our code is available at: https://github.com/mail-research/lrm-critique-vectors.

2603.10067 2026-05-25 cs.LG cs.AI 版本更新

HTMuon: Improving Muon via Heavy-Tailed Spectral Correction

HTMuon:通过重尾谱校正改进Muon

Tianyu Pang, Yujie Fang, Zihang Liu, Shenyang Deng, Lei Hsiung, Shuhua Yu, Yaoqing Yang

发表机构 * Dartmouth College(达特茅斯学院) Microsoft(微软) International Computer Science Institute(国际计算机科学研究所) University of California, Berkeley(加州大学伯克利分校) Meta

AI总结 本文提出 HTMuon,一种改进 Muon 优化算法的方法,旨在提升大语言模型的训练效果。研究指出,Muon 的正交更新规则抑制了权重谱的重尾特性,而 HTMuon 基于重尾自正则化理论,通过生成更重尾的更新步长,增强模型对参数依赖关系的捕捉能力。实验表明,HTMuon 在语言模型预训练和图像分类任务中均优于现有方法,且可作为现有 Muon 变体的插件使用。

详情
AI中文摘要

Muon最近在LLM训练中显示出有希望的结果。在这项工作中,我们研究如何进一步改进Muon。我们认为Muon的正交化更新规则抑制了重尾权重谱的出现,并过度强调了沿噪声主导方向的训练。受重尾自正则化(HT-SR)理论的启发,我们提出了HTMuon。HTMuon保留了Muon捕捉参数相互依赖性的能力,同时产生更重尾的更新并诱导更重尾的权重谱。在LLM预训练和图像分类上的实验表明,HTMuon持续优于最先进的基线,并且可以作为现有Muon变体的插件使用。例如,在C4数据集上的LLaMA预训练中,与Muon相比,HTMuon将困惑度降低了高达0.98。我们进一步从理论上证明,HTMuon对应于Schatten-$q$范数约束下的最速下降,并提供了在光滑非凸环境下的收敛性分析。HTMuon的实现可在https://github.com/TDCSZ327/HTmuon获取。

英文摘要

Muon has recently shown promising results in LLM training. In this work, we study how to further improve Muon. We argue that Muon's orthogonalized update rule suppresses the emergence of heavy-tailed weight spectra and over-emphasizes the training along noise-dominated directions. Motivated by the Heavy-Tailed Self-Regularization (HT-SR) theory, we propose HTMuon. HTMuon preserves Muon's ability to capture parameter interdependencies while producing heavier-tailed updates and inducing heavier-tailed weight spectra. Experiments on LLM pretraining and image classification show that HTMuon consistently improves performance over state-of-the-art baselines and can also serve as a plug-in on top of existing Muon variants. For example, on LLaMA pretraining on the C4 dataset, HTMuon reduces perplexity by up to $0.98$ compared to Muon. We further theoretically show that HTMuon corresponds to steepest descent under the Schatten-$q$ norm constraint and provide convergence analysis in smooth non-convex settings. The implementation of HTMuon is available at https://github.com/TDCSZ327/HTmuon.

2603.06610 2026-05-25 cs.LG 版本更新

CapTrack: Multifaceted Evaluation of Forgetting in LLM Post-Training

CapTrack: 大语言模型后训练中遗忘的多方面评估

Lukas Thede, Stefan Winzeck, Zeynep Akata, Jonathan Richard Schwarz

发表机构 * Thomson Reuters Foundational Research(汤姆森路透基础研究) Tübingen AI Center, University of Tübingen(图宾根人工智能中心,图宾根大学) Munich Center for Machine Learning (MCML), Technical University Munich(慕尼黑机器学习中心(MCML),慕尼黑技术大学) Imperial College London(伦敦帝国理工学院)

AI总结 本文提出CapTrack,一个以能力为中心的框架,用于评估大型语言模型在微调过程中产生的遗忘现象。不同于传统的参数或事实知识丢失视角,CapTrack从行为和能力退化角度定义遗忘,并结合行为分类和能力特异性指标构建评估体系。通过大规模实验分析多种微调方法、领域和模型家族,研究发现遗忘不仅影响参数知识,还显著影响模型的鲁棒性和默认行为,不同微调方法对能力退化的程度也存在差异。

详情
AI中文摘要

大语言模型(LLM)后训练增强了潜在技能,解锁了价值对齐,提升了性能,并实现了领域适应。不幸的是,后训练已知会引发遗忘,尤其是在利用第三方预训练模型的普遍用例中,这通常被理解为参数或事实知识的损失。我们认为这种以准确性为中心的观点对于现代基础模型是不够的,而是将遗忘定义为系统性的模型漂移,它会降低行为和用户体验。在此背景下,我们引入了CapTrack,一个以能力为中心的框架,用于分析LLM中的遗忘,该框架结合了行为分类法和以能力特定指标为中心的评估套件。利用CapTrack,我们跨后训练算法、领域和模型家族(包括高达80B参数的模型)进行了大规模实证研究。我们发现遗忘超出了参数知识,在鲁棒性和默认行为方面出现了显著的漂移。指令微调引发了最强的相对漂移,而偏好优化更为保守,并且可以部分恢复丢失的能力。不同模型家族之间的差异持续存在,没有出现通用的缓解方法。

英文摘要

Large language model (LLM) post-training enhances latent skills, unlocks value alignment, improves performance, and enables domain adaptation. Unfortunately, post-training is known to induce forgetting, especially in the ubiquitous use-case of leveraging third-party pre-trained models, which is typically understood as a loss of parametric or factual knowledge. We argue that this accuracy-centric view is insufficient for modern foundation models and instead define forgetting as systematic model drift that degrades behavior and user experience. In this context, we introduce CapTrack, a capability-centric framework for analyzing forgetting in LLMs that combines a behavioral taxonomy with an evaluation suite centered on capability-specific metrics. Using CapTrack, we conduct a large-scale empirical study across post-training algorithms, domains, and model families, including models up to 80B parameters. We find that forgetting extends beyond parametric knowledge, with pronounced drift in robustness and default behaviors. Instruction fine-tuning induces the strongest relative drift, while preference optimization is more conservative and can partially recover lost capabilities. Differences across model families persist, and no universal mitigation emerges.

2603.04005 2026-05-25 cs.IT cs.LG math.IT 版本更新

Training-Free Rate-Distortion-Perception Traversal With Diffusion

无训练率失真感知遍历与扩散

Yuhan Wang, Suzhi Bi, Ying-Jun Angela Zhang

发表机构 * Department of Information Engineering, The Chinese University of Hong Kong, Hong Kong(信息工程系,香港中文大学,香港) College of Electronic and Information Engineering, Shenzhen University, Shenzhen(电子与信息工程学院,深圳大学,深圳)

AI总结 本文研究了在损失压缩中比特率、重构保真度和感知质量之间的率-失真-感知(RDP)权衡问题,提出了一种无需重新训练即可遍历整个RDP曲面的训练自由框架。该方法结合预训练的扩散模型与反向信道编码模块,引入了一种基于分数缩放的概率流ODE解码器,并在高斯信道下理论证明了其在失真-感知权衡中的最优性。实验表明,该框架能够灵活有效地利用预训练扩散模型实现对RDP三元权衡的自适应压缩。

Comments Accepted by the Forty-Third International Conference on Machine Learning (ICML) 2026

详情
AI中文摘要

率失真感知(RDP)权衡刻画了有损压缩的基本极限,同时考虑比特率、重建保真度和感知质量。虽然最近的神经压缩方法提高了感知性能,但它们通常在RDP曲面上的固定点运行,需要重新训练以针对不同的权衡。在这项工作中,我们提出了一个无需训练的框架,利用预训练扩散模型遍历整个RDP曲面。我们的方法将反向信道编码(RCC)模块与新颖的分数缩放概率流ODE解码器相结合。我们从理论上证明,所提出的扩散解码器在AWGN观测下对失真-感知权衡是最优的,并且带有RCC模块的整体框架在高斯情况下实现了最优RDP函数。跨多个数据集的实证结果证明了该框架在使用预训练扩散模型导航三元RDP权衡时的灵活性和有效性。我们的结果为自适应、感知感知压缩建立了一种实用且具有理论依据的方法。

英文摘要

The rate-distortion-perception (RDP) tradeoff characterizes the fundamental limits of lossy compression by jointly considering bitrate, reconstruction fidelity, and perceptual quality. While recent neural compression methods have improved perceptual performance, they typically operate at fixed points on the RDP surface, requiring retraining to target different tradeoffs. In this work, we propose a training-free framework that leverages pre-trained diffusion models to traverse the entire RDP surface. Our approach integrates a reverse channel coding (RCC) module with a novel score-scaled probability flow ODE decoder. We theoretically prove that the proposed diffusion decoder is optimal for the distortion-perception tradeoff under AWGN observations and that the overall framework with the RCC module achieves the optimal RDP function in the Gaussian case. Empirical results across multiple datasets demonstrate the framework's flexibility and effectiveness in navigating the ternary RDP tradeoff using pre-trained diffusion models. Our results establish a practical and theoretically grounded approach to adaptive, perception-aware compression.

2603.02719 2026-05-25 cs.LG 版本更新

An Empirical Analysis of Calibration and Selective Prediction in Multimodal Clinical Condition Classification

多模态临床状况分类中的校准与选择性预测的实证分析

L. Julián Lechuga López, Farah E. Shamout, Tim G. J. Rudner

发表机构 * New York University(纽约大学) University of Toronto(多伦多大学)

AI总结 本研究针对多模态临床条件分类任务,实证分析了基于不确定性的选择性预测在可靠性方面的表现。研究发现,尽管模型在标准评估指标上表现良好,但选择性预测可能导致性能显著下降,其根本原因在于模型对不同类别存在严重的校准偏差,尤其在罕见临床条件下更为明显。研究强调了当前聚合评估指标可能掩盖这些问题,并指出在临床AI系统中需要引入校准感知的评估方法,以确保预测的安全性和鲁棒性。

Comments 40 pages, 14 figures, 16 tables. Accepted as a conference paper at AHLI Conference on Health, Inference, and Learning (CHIL) 2026

详情
AI中文摘要

随着人工智能系统向临床部署迈进,确保可靠的预测行为对于安全关键的决策任务至关重要。一种提议的安全保障是选择性预测,即模型可以将不确定的预测交由人类专家审查。在这项工作中,我们使用多模态ICU数据,实证评估了基于不确定性的选择性预测在多标签临床状况分类中的可靠性。在一系列最先进的单模态和多模态模型中,我们发现尽管标准评估指标表现强劲,但选择性预测可能会大幅降低性能。这种失败是由严重的类别依赖的误校准驱动的,即模型对正确预测赋予高不确定性,对错误预测赋予低不确定性,尤其是对于代表性不足的临床状况。我们的结果表明,常用的聚合指标可能掩盖这些效应,限制了它们评估该设置下选择性预测行为的能力。综合来看,我们的发现描述了多模态临床状况分类中选择性预测的任务特定失败模式,并强调了需要校准感知评估来为临床AI提供强有力的安全性和鲁棒性保证。

英文摘要

As artificial intelligence systems move toward clinical deployment, ensuring reliable prediction behavior is fundamental for safety-critical decision-making tasks. One proposed safeguard is selective prediction, where models can defer uncertain predictions to human experts for review. In this work, we empirically evaluate the reliability of uncertainty-based selective prediction in multilabel clinical condition classification using multimodal ICU data. Across a range of state-of-the-art unimodal and multimodal models, we find that selective prediction can substantially degrade performance despite strong standard evaluation metrics. This failure is driven by severe class-dependent miscalibration, whereby models assign high uncertainty to correct predictions and low uncertainty to incorrect ones, particularly for underrepresented clinical conditions. Our results show that commonly used aggregate metrics can obscure these effects, limiting their ability to assess selective prediction behavior in this setting. Taken together, our findings characterize a task-specific failure mode of selective prediction in multimodal clinical condition classification and highlight the need for calibration-aware evaluation to provide strong guarantees of safety and robustness in clinical AI.

2603.01655 2026-05-25 cs.LG eess.SP 版本更新

Transform-Invariant Generative Ray Path Sampling for Efficient Radio Propagation Modeling

变换不变生成射线路径采样用于高效无线电传播建模

Jérome Eertmans, Enrico M. Vitucci, Vittorio Degli-Esposti, Nicola Di Cicco, Laurent Jacques, Claude Oestges

发表机构 * OPTIT S.r.l.(OPTIT公司)

AI总结 本文提出了一种基于生成流网络的智能采样框架,用于高效建模无线电波传播路径,以解决传统射线追踪方法计算复杂度过高的问题。该方法通过引入经验回放缓冲区、统一探索策略和物理约束的动作掩码,提升了模型在复杂环境中的学习鲁棒性和路径探索效率。实验表明,该方法在保持高精度的同时,相比穷举搜索在GPU和CPU上分别实现了最高10倍和100倍的加速,但在实际城市环境中仍需进一步提升模型泛化能力。

Comments submitted to npj Wireless Technology, 30 pages, 16 figures

详情
AI中文摘要

射线追踪已成为精确无线电传播建模的标准方法,但其计算复杂度呈指数增长,因为候选路径数量随物体数量的交互阶数而增加。这一瓶颈限制了其在大型或实时应用中的使用,迫使传统工具依赖启发式方法减少路径候选,但可能牺牲精度。为克服这一限制,我们提出了一种机器学习辅助框架,通过生成流网络进行智能采样,取代穷举路径搜索。将这些生成模型应用于该领域面临挑战,特别是由于有效路径的稀缺性导致的稀疏奖励,这可能导致在复杂环境中评估高阶交互时收敛失败和琐碎解。为确保鲁棒学习和高效探索,我们的框架包含三个关键组件。首先,经验回放缓冲区捕获并保留稀有的有效路径。其次,统一探索策略提高了泛化能力,防止过拟合简单几何形状。第三,基于物理的动作掩蔽策略在模型考虑之前过滤掉物理上不可能的路径。在理想街道峡谷场景上的验证表明,我们的模型相比穷举搜索实现了显著加速——GPU上最高10倍,CPU上最高100倍——同时保持高覆盖精度并成功发现复杂传播路径。然而,在真实曼哈顿街道几何形状上的分布外评估显示,泛化到显著不同的城市形态需要模型容量或训练策略的进一步改进。源代码、测试和教程见https://github.com/jeertmans/sampling-paths。

英文摘要

Ray tracing has become a standard for accurate radio propagation modeling, but suffers from exponential computational complexity, as the number of candidate paths scales with the number of objects raised to the interaction order. This bottleneck limits its use in large-scale or real-time applications, forcing traditional tools to rely on heuristics that reduce path candidates at the cost of potentially reduced accuracy. To overcome this limitation, we propose a machine-learning-assisted framework that replaces exhaustive path searching with intelligent sampling via Generative Flow Networks. Applying these generative models to this domain presents challenges, particularly sparse rewards due to the rarity of valid paths, which can lead to convergence failures and trivial solutions when evaluating high-order interactions in complex environments. To ensure robust learning and efficient exploration, our framework incorporates three key components. First, an \emph{experience replay buffer} captures and retains rare valid paths. Second, a uniform exploratory policy improves generalization and prevents overfitting to simple geometries. Third, a physics-based action masking strategy filters out physically impossible paths before the model considers them. Validated on idealized street-canyon scenarios, our model achieves substantial speedups over exhaustive search -- up to $10\times$ faster on GPU and $100\times$ faster on CPU -- while maintaining high coverage accuracy and successfully uncovering complex propagation paths. However, out-of-distribution evaluations on real-world Manhattan street geometries reveal that generalizing to substantially different urban morphologies requires further advancement in model capacity or alternative training strategies. Source code, tests, and a tutorial are available at https://github.com/jeertmans/sampling-paths.

2602.13480 2026-05-25 cs.CR cs.LG 版本更新

MELT: A Behavioral Trace Dataset for High-Risk Memecoin Launch Detection

MELT:用于高风险 Memecoin 发行检测的行为轨迹数据集

Sihao Hu, Selim Furkan Tekin, Yichang Xu, Ling Liu

发表机构 * School of Computer Science(计算机科学学院)

AI总结 本文提出MELT,一个用于检测高风险模因币发行的行为轨迹数据集。该数据集基于Solana区块链,包含超过41,000次模因币发行的2亿多笔交易,提取了包括交易类型、账户协调行为等结构化行为记录,揭示了发行方隐藏真实控制权的策略。MELT还提供了122个行为特征和风险等级标注,支持大规模监督学习,并通过实验验证了其在风险检测中的有效性,为模因币投资风险缓解提供了新方法。

详情
AI中文摘要

Launchpad 已成为发行 memecoin 的主要机制,使投资者面临现有 rug-pull 检测方法无法捕捉的新型高风险发行。我们认为,检测这些威胁需要结构化的行为轨迹,这些轨迹隐藏在原始异构区块链数据之下,即内部人员如何积累、协调和解除头寸。为了实现这种分析,我们引入了 MELT(Memecoin 发行轨迹),这是第一个用于分析和检测 Solana 上高风险 memecoin 发行的行为轨迹数据集。MELT 覆盖了 41k+ 个 memecoin 发行,包含 200M+ 笔交易,这些交易被解析为类型化的行为记录,区分了交换、洗盘交易、转账和铸造。除了每个账户的行为外,MELT 还贡献了捆绑轨迹数据,该数据链接了同一实体控制的账户,揭示平均 36.5% 的代币供应由协调账户持有,这是一种隐藏策略,使真正的所有权集中度不被不知情的买家察觉。在这些轨迹之上,MELT 提供了 122 个行为特征和风险级别标注,使得在人口规模上进行监督学习成为可能。我们在高风险发行检测任务上对代表性 ML 模型进行了基准测试。将其预测整合到一个简单的 memecoin 选择策略中,显著减少了投资损失,证明了行为轨迹可以转化为风险缓解。我们的数据集和代码可在 https://github.com/git-disl/MELT 获取。

英文摘要

Launchpads have become the dominant mechanism for issuing memecoins, exposing investors to a new class of high-risk launches that existing rug-pull detection methods cannot capture. We argue that detecting these threats requires structured behavioral traces that underlie raw heterogeneous blockchain data, i.e., how insiders accumulate, coordinate, and unwind positions. To enable such analysis, we introduce MELT (MEmecoin Launch Trace, the first behavioral trace dataset for analyzing and detecting high-risk memecoin launches on Solana. MELT covers 41k+ memecoin launches with 200M+ transactions parsed into typed behavioral records that distinguish swaps, wash trades, transfers, and mints. Beyond per-account behaviors, MELT contributes bundle-trace data that links accounts controlled by the same entity, revealing that, on average, 36.5% of token supply is held by coordinated accounts, a concealment strategy that disguises the true ownership concentration from unsuspecting buyers. On top of these traces, MELT provides 122 behavioral features and risk-level annotations, enabling supervised learning at a population scale. We benchmark representative ML models on the high-risk launch detection task. Integrating their predictions into a simple memecoin selection strategy reduces investment loss significantly, demonstrating that behavioral traces can be translated into risk mitigation. Our dataset and code is available at https://github.com/git-disl/MELT.

2602.13249 2026-05-25 q-bio.BM cs.AI cs.LG 版本更新

A Systematic Evaluation of Co-folding Model Representations for Small-Molecule Learning

小分子学习的共折叠模型表示的系统评估

Hyosoon Jang, Hyunjin Seo, Honghui Kim, Seonghyun Park, Taewon Kim, Yunhui Jang, Sungsoo Ahn

发表机构 * KAIST(韩国科学技术院)

AI总结 本文系统评估了基于蛋白质-配体共折叠的模型在小分子学习中的表示能力。研究使用现代共折叠模型Boltz2,将其原子级配体表示迁移到独立的小分子任务中,结果表明其性能在ADMET基准测试中达到或超越现有模型,并提升了分子生成建模和结构引导的配体优化效率。此外,Boltz2的表示与传统独立分子监督方法具有互补性,并可应用于强化学习以增强分子发现过程。这些结果表明,蛋白质-配体共折叠是一种有前景的小分子表示学习预训练范式。

详情
AI中文摘要

小分子基础模型通常仅在独立分子数据上进行预训练,这与视觉和语言模型不同,后者通常受益于跨模态或关系监督。蛋白质-配体共折叠通过将模型暴露于原子级配体-蛋白质相互作用,提供了这种监督的分子类似物,引发了一个问题:共折叠模型能否产生强大的小分子表示。我们使用现代共折叠模型Boltz2研究这个问题,通过将其原子级配体表示转移到独立的小分子任务。通过系统探测和蒸馏,我们表明Boltz2表示在ADMET基准上匹配或超越现有模型,加速分子生成建模,并提高结构引导配体优化的样本效率。我们进一步发现Boltz2表示与从传统独立分子监督(包括3D构象、生物测定标签和量子化学性质)中学习到的表示互补。最后,我们将表示对齐扩展到强化学习,表明密集的表示级监督可以补充分子发现中的标量奖励。这些结果将蛋白质-配体共折叠确定为小分子表示学习的有前景的预训练范式,并将Boltz2定位为强大的现成分子基础模型。

英文摘要

Small-molecule foundation models are typically pretrained on standalone molecular data, unlike vision and language models that often benefit from cross-modal or relational supervision. Protein-ligand co-folding provides a molecular analogue of such supervision by exposing models to atom-level ligand-protein interactions, raising the question of whether co-folding models can yield strong small-molecule representations. We study this question using Boltz2, a modern co-folding model, by transferring its atom-level ligand representations to standalone small-molecule tasks. Through systematic probing and distillation, we show that Boltz2 representations match or outperform existing models on the ADMET benchmark, accelerate molecular generative modeling, and improve sample efficiency in structure-guided ligand optimization. We further find that Boltz2 representations are complementary to those learned from conventional standalone molecular supervision, including 3D conformers, bioassay labels, and quantum-chemical properties. Finally, we extend representation alignment to reinforcement learning, showing that dense representation-level supervision can complement scalar rewards in molecular discovery. These results identify protein-ligand co-folding as a promising pretraining paradigm for small-molecule representation learning and position Boltz2 as a strong, off-the-shelf molecular foundation model.

2602.12579 2026-05-25 cs.LG cs.AI 版本更新

VI-CuRL: Stabilizing Verifier-Independent RL Reasoning via Confidence-Guided Variance Reduction

VI-CuRL: 通过置信度引导的方差缩减稳定与验证器无关的强化学习推理

Xin-Qiang Cai, Masashi Sugiyama

发表机构 * RIKEN AIP(日本理化学研究所高级研究所) The University of Tokyo(东京大学)

AI总结 本文提出了一种名为VI-CuRL的验证器无关强化学习框架,旨在解决现有可验证奖励强化学习(RLVR)依赖外部验证器导致的可扩展性问题。该方法通过利用模型自身的置信度构建独立于外部验证器的课程学习体系,有效控制梯度方差,提升训练稳定性。理论分析证明了该估计器的渐近无偏性,实验表明其在数学和通用推理任务中优于多种依赖或不依赖验证器的基线方法。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)已成为增强大型语言模型(LLMs)推理能力的主流范式,但其对外部验证器的依赖限制了可扩展性。最近的研究表明,RLVR主要通过激发潜在能力发挥作用,这推动了无验证器算法的发展。然而,在此类设置中,标准方法(如Group Relative Policy Optimization)面临一个关键挑战:破坏性的梯度方差常导致训练崩溃。为解决此问题,我们引入了与验证器无关的课程强化学习(VI-CuRL),该框架利用模型的内在置信度构建独立于外部验证器的课程。通过优先处理高置信度样本,VI-CuRL有效管理偏差-方差权衡,特别针对降低动作和问题方差。我们提供了严格的理论分析,证明我们的估计量保证渐近无偏性。实验上,VI-CuRL促进了稳定性,并在有/无验证器的数学和通用推理基准上持续优于依赖/不依赖验证器的基线。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a dominant paradigm for enhancing Large Language Models (LLMs) reasoning, yet its reliance on external verifiers limits its scalability. Recent findings suggest that RLVR primarily functions by eliciting latent capabilities, motivating the development of verifier-free algorithms. However, in such settings, standard methods like Group Relative Policy Optimization face a critical challenge: destructive gradient variance that often leads to training collapse. To address this issue, we introduce Verifier-Independent Curriculum Reinforcement Learning (VI-CuRL), a framework that leverages the model's intrinsic confidence to construct a curriculum independent from external verifiers. By prioritizing high-confidence samples, VI-CuRL effectively manages the bias-variance trade-off, specifically targeting the reduction of action and problem variance. We provide a rigorous theoretical analysis, proving that our estimator guarantees asymptotic unbiasedness. Empirically, VI-CuRL promotes stability and consistently outperforms verifier-dependent/independent baselines across math and general reasoning benchmarks with/without verifiers.

2602.11629 2026-05-25 cs.LG 版本更新

GP2F: Cross-Domain Graph Prompting with Adaptive Fusion of Pre-trained Graph Neural Networks

GP2F: 基于预训练图神经网络自适应融合的跨域图提示学习

Dongxiao He, Wenxuan Sun, Yongqi Huang, Jitao Zhao, Di Jin

发表机构 * School of Computer Science and Technology, Tianjin University, Tianjin, China(天津大学计算机科学与技术学院,天津,中国)

AI总结 本文研究了跨领域图提示学习(GPL)中的有效性问题,提出了一种名为GP2F的新方法。该方法通过融合预训练图神经网络的知识与任务特定的轻量适配模块,在跨领域场景下实现了更鲁棒的模型适应。理论分析表明,结合预训练知识与任务适配能够降低估计误差,实验结果验证了GP2F在跨领域少样本节点和图分类任务中的优越性。

Comments 16 pages, 8 figures

详情
AI中文摘要

图提示学习(GPL)最近成为一种有前景的范式,用于预训练图模型的下游适应,缓解预训练目标与下游任务之间的不匹配。最近,GPL的关注点从域内转向跨域场景,这更接近现实世界应用,其中预训练源和下游目标在数据分布上往往存在显著差异。然而,GPL在域偏移下为何仍然有效尚未被探索。经验上,我们观察到代表性的GPL方法在跨域设置中与两个简单基线(全微调和线性探测)具有竞争力,这促使我们更深入地理解提示机制。我们提供理论分析表明,联合利用这两个互补分支比单独使用任一分支产生更小的估计误差,正式证明了跨域GPL受益于预训练知识与任务特定适应性之间的整合。基于这一见解,我们提出GP2F,一种双分支GPL方法,显式实例化两个极端:(1)保留预训练知识的冻结分支,和(2)带有轻量级适配器用于任务特定适应的适配分支。然后,我们通过对比损失和拓扑一致性损失在拓扑约束下执行自适应融合。在跨域少样本节点和图分类上的大量实验表明,我们的方法优于现有方法。

英文摘要

Graph Prompt Learning (GPL) has recently emerged as a promising paradigm for downstream adaptation of pre-trained graph models, mitigating the misalignment between pre-training objectives and downstream tasks. Recently, the focus of GPL has shifted from in-domain to cross-domain scenarios, which is closer to the real world applications, where the pre-training source and downstream target often differ substantially in data distribution. However, why GPLs remain effective under such domain shifts is still unexplored. Empirically, we observe that representative GPL methods are competitive with two simple baselines in cross-domain settings: full fine-tuning (FT) and linear probing (LP), motivating us to explore a deeper understanding of the prompting mechanism. We provide a theoretical analysis demonstrating that jointly leveraging these two complementary branches yields a smaller estimation error than using either branch alone, formally proving that cross-domain GPL benefits from the integration between pre-trained knowledge and task-specific adaptation. Based on this insight, we propose GP2F, a dual-branch GPL method that explicitly instantiates the two extremes: (1) a frozen branch that retains pre-trained knowledge, and (2) an adapted branch with lightweight adapters for task-specific adaptation. We then perform adaptive fusion under topology constraints via a contrastive loss and a topology-consistent loss. Extensive experiments on cross-domain few-shot node and graph classification demonstrate that our method outperforms existing methods.

2602.11243 2026-05-25 cs.LG cs.CL 版本更新

Evaluating Memory Structure in LLM Agents

评估LLM智能体中的记忆结构

Alina Shutova, Alexandra Olenina, Ivan Vinogradov, Anton Sinitsin

发表机构 * HSE University(莫斯科国立高等经济学院) Yandex YSDA New Economic School(新经济学院)

AI总结 本文研究了基于大语言模型(LLM)的智能体在长期记忆结构组织方面的能力,提出了一个名为 StructMemEval 的新基准,用于评估智能体组织长期记忆的结构化能力,而不仅仅是事实记忆或简单检索。该基准包含一系列需要结构化知识组织的任务,如事务账本、待办事项列表等。实验表明,普通检索增强型 LLM 在未明确提示下难以处理这些任务,而具备结构化记忆框架的智能体则能更有效地完成任务,突显了改进 LLM 训练和记忆架构的重要性。

Comments Preprint, work in progress

详情
AI中文摘要

现代基于LLM的智能体和聊天助手依赖长期记忆框架来存储可重用知识、回忆用户偏好并增强推理。随着研究人员创建更复杂的记忆架构,分析其能力并指导未来记忆设计变得越来越困难。大多数长期记忆基准侧重于简单事实保留、多跳回忆和基于时间的变化。虽然这些能力无疑很重要,但通常可以通过简单的检索增强LLM实现,并且不测试复杂的记忆层次。为了弥补这一差距,我们提出了StructMemEval——一个测试智能体组织其长期记忆能力(而不仅仅是事实回忆)的基准。我们收集了一系列人类通过以特定结构组织知识来解决的任务:交易账本、待办事项列表、树结构等。我们的初步实验表明,简单的检索增强LLM在这些任务上表现困难,而记忆智能体在提示如何组织记忆时可以可靠地解决它们。然而,我们还发现,现代LLM在未被提示时并不总是能识别记忆结构。这突显了未来在LLM训练和记忆框架改进方面的一个重要方向。

英文摘要

Modern LLM-based agents and chat assistants rely on long-term memory frameworks to store reusable knowledge, recall user preferences, and augment reasoning. As researchers create more complex memory architectures, it becomes increasingly difficult to analyze their capabilities and guide future memory designs. Most long-term memory benchmarks focus on simple fact retention, multi-hop recall, and time-based changes. While undoubtedly important, these capabilities can often be achieved with simple retrieval-augmented LLMs and do not test complex memory hierarchies. To bridge this gap, we propose StructMemEval - a benchmark that tests the agent's ability to organize its long-term memory, not just factual recall. We gather a suite of tasks that humans solve by organizing their knowledge in a specific structure: transaction ledgers, to-do lists, trees and others. Our initial experiments show that simple retrieval-augmented LLMs struggle with these tasks, whereas memory agents can reliably solve them if prompted how to organize their memory. However, we also find that modern LLMs do not always recognize the memory structure when not prompted to do so. This highlights an important direction for future improvements in both LLM training and memory frameworks.

2602.08927 2026-05-25 stat.ML cs.LG stat.ME 版本更新

Online monotone density estimation and log-optimal calibration

在线单调密度估计与对数最优校准

Rohan Hore, Ruodu Wang, Aaditya Ramdas

发表机构 * Department of Statistics and Data Science, Carnegie Mellon University, USA(统计与数据科学系,卡内基梅隆大学,美国) Department of Statistics and Actuarial Science, University of Waterloo, Canada(统计与精算科学系,滑铁卢大学,加拿大)

AI总结 本文研究在线单调密度估计问题,即从序列观测数据中可预测地构建密度估计器。作者提出了两种在线估计方法:一种是经典Grenander估计器的在线版本,另一种是受在线学习中指数加权方法启发的专家聚合估计器。理论分析表明,在密度单调的设定下,所提估计器与真实密度之间的累积对数似然差距具有$O(n^{1/3})$的上界,并且专家聚合估计器相对于最优离线估计器具有$\sqrt{n\log{n}}$的路径遗憾界。此外,作者还展示了该问题与序贯假设检验中对数最优p值到e值校准的联系,并基于所提方法构建了经验自适应的校准器。

Comments 31 pages, 2 figures

详情
AI中文摘要

我们研究在线单调密度估计问题,其中密度估计器必须根据顺序观测数据以可预测的方式构建。我们提出两种在线估计器:经典Grenander估计器的在线类比,以及受在线学习文献中指数加权方法启发的专家聚合估计器。在良好指定的随机设定下,即底层密度是单调的,我们证明在线估计器与真实密度之间的期望累积对数似然差距具有$O(n^{1/3})$界。我们进一步建立了专家聚合估计器相对于事后选择的离线最优单调估计器的$\sqrt{n\log{n}}$路径后悔界,对观测序列的正则性假设要求极低。作为一个独立兴趣的应用,我们证明构建用于序贯假设检验的对数最优p-to-e校准器的问题可以表述为在线单调密度估计问题。我们调整所提出的估计器以构建经验自适应的p-to-e校准器,并证明其最优性。数值实验验证了理论结果。

英文摘要

We study the problem of online monotone density estimation, where density estimators must be constructed in a predictable manner from sequentially observed data. We propose two online estimators: an online analogue of the classical Grenander estimator, and an expert aggregation estimator inspired by exponential weighting methods from the online learning literature. In the well-specified stochastic setting, where the underlying density is monotone, we show that the expected cumulative log-likelihood gap between the online estimators and the true density admits an $O(n^{1/3})$ bound. We further establish a $\sqrt{n\log{n}}$ pathwise regret bound for the expert aggregation estimator relative to the best offline monotone estimator chosen in hindsight, under minimal regularity assumptions on the observed sequence. As an application of independent interest, we show that the problem of constructing log-optimal p-to-e calibrators for sequential hypothesis testing can be formulated as an online monotone density estimation problem. We adapt the proposed estimators to build empirically adaptive p-to-e calibrators and establish their optimality. Numerical experiments illustrate the theoretical results.

2602.07697 2026-05-25 cs.LG cs.AI cs.NE 版本更新

On the Infinite Width and Depth Limits of Predictive Coding Networks

预测编码网络的无限宽度和深度极限

Francesco Innocenti, El Mehdi Achour, Rafal Bogacz

发表机构 * Brain Network Dynamics Unit, University of Oxford, UK(牛津大学脑网络动力学单位) UM6P College of Computing, Rabat, Morocco(拉巴特大学计算学院)

AI总结 本文研究了预测编码网络(PCNs)在无限宽度和深度极限下的行为,揭示了其与反向传播(BP)之间的理论联系。研究发现,在线性残差网络中,预测编码与反向传播在参数化方式上具有相同的宽度和深度稳定性条件。当网络宽度远大于深度时,预测编码的能量函数在活动平衡状态下会收敛于二次BP损失,从而计算出与BP相同的梯度。实验表明,这一结论在卷积网络和Transformer等非线性模型中也成立,为预测编码在宽而浅的网络结构中实现类似反向传播的训练提供了理论依据。

Comments 36 pages, 28 figures

详情
AI中文摘要

预测编码(PC)是标准反向传播(BP)的一种生物合理替代方案,它在更新权重之前通过最小化关于网络活动的能量函数来工作。最近的工作通过利用一些受BP启发的重新参数化,提高了深度PC网络(PCN)的训练稳定性。然而,这些方法的完全可扩展性和理论基础仍不清楚。为了解决这一空白,我们研究了PCN的无限宽度和深度极限。对于线性残差网络,我们表明PC的宽度和深度稳定的特征学习参数化集合与BP完全相同。此外,在这些参数化中的任何一种下,当模型宽度远大于深度时,具有平衡活动的PC能量收敛到二次BP损失,导致PC计算与BP相同的梯度。实验表明,只要达到活动平衡,非线性模型(包括卷积网络和transformer)也收敛到BP。总体而言,这项工作限制了与PC可扩展的参数化类型,同时展示了如何通过仅局部更新在比深度宽得多的网络(如大脑)中有效实现BP。

英文摘要

Predictive coding (PC) is a biologically plausible alternative to standard backpropagation (BP) that minimises an energy function with respect to network activities before updating weights. Recent work has improved the training stability of deep PC networks (PCNs) by leveraging some BP-inspired reparameterisations. However, the full scalability and theoretical basis of these methods remain unclear. To address this gap, we study the infinite width and depth limits of PCNs. For linear residual networks, we show that the set of width- and depth-stable feature-learning parameterisations for PC is exactly the same as for BP. Moreover, under any of these parameterisations, the PC energy with equilibrated activities converges to the quadratic BP loss when the model width is much larger than the depth, resulting in PC computing the same gradients as BP. Experiments show that, as long as an activity equilibrium is reached, convergence to BP holds for nonlinear models including convolutional networks and transformers. Overall, this work constrains the types of parameterisation that are scalable with PC, while showing a way in which BP can be effectively implemented with only local updates in much wider than deep networks like the brain.

2602.07235 2026-05-25 cs.LG cs.AI cs.IT math.IT 版本更新

ArcMark: Distortion-Free Multi-Byte LLM Watermark via Optimal Transport

ArcMark: 通过最优传输实现无失真的多字节大语言模型水印

Atefeh Gilani, Sajani Vithana, Carol Xuan Long, Oliver Kosut, Lalitha Sankar, Flavio P. Calmon

发表机构 * Arizona State University(亚利桑那州立大学) Harvard University(哈佛大学)

AI总结 ArcMark 是一种基于最优传输理论的无失真多字节大语言模型水印方法,能够在不改变模型生成文本质量的前提下,将多个字节的信息嵌入到少量的生成文本中。该方法通过将无失真水印问题建模为信道编码问题,推导出信息论意义上的信道容量,从而确定了在不引入失真的情况下嵌入信息的理论极限,并据此设计了 ArcMark 算法。实验表明,ArcMark 在信息重建准确率和抗攻击能力方面优于现有方法,且生成文本的困惑度和下游任务表现与未加水印的文本无明显差异。

详情
AI中文摘要

水印是促进大语言模型(LLM)负责任使用的重要工具。现有水印在生成的token中插入信号,要么标记LLM生成的文本(零比特水印),要么编码更复杂的消息(多比特水印)。尽管最近许多方法在不扰动平均下一token预测的情况下向文本中插入多个比特,但它们很大程度上扩展了零比特设置的设计原则,例如每个token编码单个比特。相比之下,能够将多个字节嵌入文本的水印将极大地增加潜在应用,例如嵌入提交提示的用户ID、使用的精确模型版本,甚至提示本身。我们通过引入ArcMark来解决这个问题:一种基于编码和信息论原理的新型水印构造,能够可靠地将多字节信息嵌入仅几百个token中,而不会对底层LLM的下一token分布造成任何失真。我们通过将无失真水印问题建模为信道编码问题,并推导出信息论信道容量,该容量建立了在LLM输出中无失真嵌入信息的基本极限,从而推导出ArcMark。该容量公式指导了ArcMark的设计。在实践中,ArcMark在重建精度上优于竞争的多比特无失真水印,包括在面对改变部分LLM文本的攻击时。ArcMark输出在困惑度和下游任务质量方面也显示出与未加水印文本无法区分。

英文摘要

Watermarking is an important tool for promoting the responsible use of large language models (LLMs). Existing watermarks insert a signal into generated tokens that either flags LLM-generated text (zero-bit watermarking) or encodes more complex messages (multi-bit watermarking). Though a number of recent approaches insert multiple bits into text without perturbing average next-token predictions, they largely extend design principles from the zero-bit setting, such as encoding a single bit per token. In contrast, a watermarker capable of embedding multiple bytes into the text would dramatically increase the potential applications, by embedding information such as the ID of the user who submitted the prompt, the precise model version that was used, or even the prompt itself. We address this problem by introducing ArcMark: a new watermark construction based on coding and information-theoretic principles that is capable of reliably embedding multiple bytes of information into just a few hundred tokens, without any distortion of the underlying LLM next-token distribution. We derive ArcMark by formulating the distortion-free watermarking problem as a channel coding problem, and deriving an information-theoretic channel capacity that establishes the fundamental limit of embedding information in LLM output in a distortion-free manner. This capacity formulation informs the design of ArcMark. In practice, ArcMark outperforms competing multi-bit distortion-free watermarks in terms of reconstruction accuracy, including in the face of attacks that alter a subset of the LLM text. ArcMark output is also shown to be indistinguishable from unwatermarked text in terms of perplexity, and in downstream task quality.

2602.02780 2026-05-25 cs.AI cs.LG 版本更新

Scaling-Aware Adapter for Structure-Grounded LLM Reasoning

Scaling-Aware Adapter for Structure-Grounded LLM Reasoning

Zihao Jing, Qiuhao Zeng, Ruiyi Fang, Yan Yi Li, Yan Sun, Boyu Wang, Pingzhao Hu

发表机构 * Department of Computer Science, Western University, London, Canada(加拿大伦敦西方大学计算机科学系) Department of Biochemistry, Western University, London, Canada(加拿大伦敦西方大学生物化学系)

AI总结 本文提出了一种名为Cuttlefish的统一多模态大语言模型,旨在解决基于结构的推理中几何信息缺失和模态融合瓶颈的问题。该模型引入了“Scaling-Aware Patching”和“Geometry Grounding Adapter”两种核心方法,前者通过指令条件门控机制生成可变大小的结构图块,动态调整查询令牌数量以适应结构复杂度;后者通过跨注意力机制将几何信息注入语言模型,从而减少结构幻觉。实验表明,Cuttlefish在多个跨学科的原子级结构推理任务中表现出色。

Comments Accepted by ICML 2026

详情
AI中文摘要

大型语言模型(LLM)正在实现对2D和3D结构的推理,但现有方法仍然局限于特定模态,通常通过基于序列的标记化或固定长度的查询连接器来压缩结构输入。这种架构要么忽略了减轻结构幻觉所需的几何基础,要么施加了不灵活的模态融合瓶颈,同时过度压缩和次优分配结构令牌,从而阻碍了通用全原子推理的实现。我们引入了Cuttlefish,一种统一的多模态LLM,它将语言推理建立在几何线索上,同时根据结构复杂性缩放模态令牌。首先,缩放感知补丁利用指令条件门控机制在结构图上生成可变大小的补丁,根据结构复杂性自适应地缩放查询令牌预算,以缓解固定长度连接器的瓶颈。其次,几何基础适配器通过交叉注意力对模态嵌入进行细化,并将生成的模态令牌注入LLM,暴露明确的几何线索以减少结构幻觉。跨学科全原子基准的实验表明,Cuttlefish在异构结构基础推理中实现了优越的性能。代码:github.com/zihao-jing/Cuttlefish。

英文摘要

Large language models (LLMs) are enabling reasoning over 2D and 3D structures, yet existing methods remain modality-specific and typically compress structural inputs through sequence-based tokenization or fixed-length query connectors. Such architectures either omit the geometric grounding requisite for mitigating structural hallucinations, or impose inflexible modality fusion bottlenecks that concurrently over-compress and suboptimally allocate structural tokens, thereby impeding the realization of generalized all-atom reasoning. We introduce Cuttlefish, a unified multimodal LLM that grounds language reasoning in geometric cues while scaling modality tokens with structural complexity. First, Scaling-Aware Patching leverages an instruction-conditioned gating mechanism to generate variable-size patches over structural graphs, adaptively scaling the query token budget with structural complexity to mitigate fixed-length connector bottlenecks. Second, Geometry Grounding Adapter refines these adaptive tokens via cross-attention to modality embeddings and injects the resulting modality tokens into the LLM, exposing explicit geometric cues to reduce structural hallucination. Experiments across interdisciplinary all-atom benchmarks demonstrate that Cuttlefish achieves superior performance in heterogeneous structure-grounded reasoning. Code: github.com/zihao-jing/Cuttlefish.

2602.00567 2026-05-25 cs.LG 版本更新

Forget by Uncertainty: Orthogonal Entropy Unlearning for Quantized Neural Networks

基于不确定性的遗忘:面向量化神经网络的正交熵遗忘

Tian Zhang, Yujia Tong, Junhao Dong, Ke Xu, Yuze Wang, Jingling Yuan

发表机构 * Hubei Key Laboratory of Transportation Internet of Things, School of Computer Science and Artificial Intelligence, Wuhan University of Technology, China(交通物联网湖北重点实验室,计算机科学与人工智能学院,武汉理工大学,中国) College of Computing and Data Science, Nanyang Technological University, Singapore(计算与数据科学学院,南洋理工大学,新加坡)

AI总结 随着量化神经网络在边缘设备上的部署以及隐私法规的日益严格,对量化模型进行机器遗忘的需求愈发迫切。本文提出了一种名为OEU的正交熵遗忘框架,其核心创新包括:通过最大化遗忘数据上的预测不确定性来提供无偏的遗忘方向,避免误判特定类别;以及通过梯度正交投影消除遗忘梯度与保留梯度之间的干扰,从而在保持模型效用方面具有理论保证。实验表明,OEU在遗忘效果和保留精度方面均优于现有方法。

Comments Accepted by ICML2026

详情
AI中文摘要

量化神经网络在边缘设备上的部署,结合GDPR等隐私法规,迫切需要在量化模型中进行机器遗忘。然而,现有方法面临关键挑战:它们通过训练模型记忆错误标签来诱导遗忘,将遗忘与错误记忆混为一谈,并采用标量梯度重加权,无法解决梯度之间的方向冲突。我们提出OEU,一种新颖的正交熵遗忘框架,具有两个关键创新:1)熵引导遗忘通过最大化遗忘数据上的预测不确定性提供无偏遗忘方向,避免向任何特定类别的错误预测;2)梯度正交投影通过将遗忘梯度投影到保留梯度的正交补上来消除干扰,在一阶近似下为效用保持提供理论保证。大量实验表明,OEU在遗忘效果和保留准确率上均优于现有方法。

英文摘要

The deployment of quantized neural networks on edge devices, combined with privacy regulations like GDPR, creates an urgent need for machine unlearning in quantized models. However, existing methods face critical challenges: they induce forgetting by training models to memorize incorrect labels, conflating forgetting with misremembering, and employ scalar gradient reweighting that cannot resolve directional conflicts between gradients. We propose OEU, a novel Orthogonal Entropy Unlearning framework with two key innovations: 1) Entropy-guided unlearning provides an unbiased forgetting direction by maximizing prediction uncertainty on forgotten data, avoiding confident misprediction toward any specific class, and 2) Gradient orthogonal projection eliminates interference by projecting forgetting gradients onto the orthogonal complement of retain gradients, providing theoretical guarantees for utility preservation under first-order approximation. Extensive experiments demonstrate that OEU outperforms existing methods in both forgetting effectiveness and retain accuracy.

2601.22367 2026-05-25 stat.ML cs.LG 版本更新

Amortized Simulation-Based Inference in Generalized Bayes via Neural Posterior Estimation

通过神经后验估计在广义贝叶斯中进行摊销的基于模拟的推理

Shiyi Sun, Geoff K. Nicholls, Jeong Eun Lee

发表机构 * Department of Statistics, University of Oxford, Oxford, United Kingdom(英国牛津大学统计系) Department of Statistics, University of Auckland, Auckland, New Zealand(新西兰奥克兰大学统计系)

AI总结 该论文提出了一种基于神经后验估计的通用贝叶斯推断方法,通过引入温度参数 $β$ 来缓解模型误设下的过自信问题,并提升推断的鲁棒性。研究提出了一种完全摊销的变分近似方法,仅需一次前向计算即可对任意数据和 $β$ 值进行后验采样,无需调用模拟器或运行MCMC。通过两种互补的训练策略,该方法在多个标准模拟推断基准上展示了与非摊销MCMC方法相当的性能,具有较高的效率和稳定性。

Comments Accepted at ICML 2026

详情
AI中文摘要

广义贝叶斯推理(GBI)通过温度β>0调整损失以减轻过度自信并提高模型误设下的鲁棒性,但现有GBI方法通常依赖昂贵的MCMC或基于SDE的采样器,且必须为每个新数据集和每个β值重新运行。我们通过训练单一数据与β条件神经后验估计器,首次为温度后验族提供了完全摊销的变分近似,使得单次前向传播即可采样,无需模拟器调用或推理时MCMC。我们引入了两种互补的训练路径:一种从温度联合分布中合成流形外样本,另一种使用自归一化重要性采样(SNIS)对固定基础数据集进行重加权。我们证明,SNIS加权目标在有限权重方差下为温度后验提供了一致的前向KL拟合。在四个标准基于模拟的推理基准(包括混沌Lorenz-96系统)中,我们的β摊销估计器在标准双样本指标上实现了具有竞争力的后验近似,在广泛温度范围内匹配了非摊销的基于MCMC的幂后验采样器。

英文摘要

Generalized Bayesian Inference (GBI) tempers a loss with a temperature $β> 0$ to mitigate overconfidence and improve robustness under model misspecification, but existing GBI methods typically rely on costly MCMC or SDE-based samplers and must be re-run for each new dataset and each $β$ value. We give the first fully amortized variational approximation for the tempered posterior family by training a single data- and $β$-conditioned neural posterior estimator that enables sampling in a single forward pass, without simulator calls or inference-time MCMC. We introduce two complementary training routes: one synthesizes off-manifold samples from the tempered joint distribution, and the other reweights a fixed base dataset using self-normalized importance sampling (SNIS). We show that the SNIS-weighted objective provides a consistent forward-KL fit to the tempered posterior with finite weight variance. Across four standard simulation-based inference benchmarks, including the chaotic Lorenz-96 system, our $β$-amortized estimator achieves competitive posterior approximations, in standard two-sample metrics, matching non-amortized MCMC-based power-posterior samplers over a wide range of temperatures.

2601.22324 2026-05-25 cs.LG cs.MA 版本更新

Automatic Construction of Clinical Scoring Systems with LLM Agents

基于LLM代理的临床评分系统自动构建

Silas Ruhrberg Estévez, Christopher Chiu, Mihaela van der Schaar

发表机构 * DAMTP, University of Cambridge, Cambridge, UK(剑桥大学 DAMTP 实验室,剑桥,英国)

AI总结 本文研究如何自动构建适用于临床实践的评分系统,这类系统通常由少量可解释的决策规则组成。作者提出了一种基于大语言模型(LLM)代理的方法——AgentScore,通过语义引导的优化流程,在巨大的规则组合空间中搜索符合统计有效性与临床部署要求的评分规则。实验表明,AgentScore 在多个临床预测任务中优于现有方法,并在保持强结构性约束的同时实现了与灵活可解释模型相当的预测性能。

详情
AI中文摘要

现代临床实践依赖于以紧凑评分系统形式实施的循证指南,这些评分系统由少量可解释的决策规则组成。虽然机器学习模型实现了强大的性能,但由于与工作流约束(如可记忆性、可审计性和床边执行)不匹配,许多模型未能转化为常规临床使用。我们认为,这种差距并非源于预测能力不足,而是由于在模型类别上优化时与指南部署不兼容。可部署的指南通常采用单位加权临床检查表的形式,通过对二元规则求和并设置阈值形成,但学习此类评分需要在指数级大的离散规则集空间中进行搜索。我们引入了AgentScore,它通过使用LLM提出候选规则,并采用确定性的、基于数据的验证与选择循环来强制执行统计有效性和可部署性约束,在此空间中进行语义引导的优化。在八个临床预测任务中,AgentScore优于现有的评分生成方法,并且在更强的结构约束下实现了与更灵活的可解释模型相当的AUROC。在两个额外经过外部验证的任务中,AgentScore比已建立的基于指南的评分实现了更高的区分度。

英文摘要

Modern clinical practice relies on evidence-based guidelines implemented as compact scoring systems composed of a small number of interpretable decision rules. While machine-learning models achieve strong performance, many fail to translate into routine clinical use due to misalignment with workflow constraints such as memorability, auditability, and bedside execution. We argue that this gap arises not from insufficient predictive power, but from optimizing over model classes that are incompatible with guideline deployment. Deployable guidelines often take the form of unit-weighted clinical checklists, formed by thresholding the sum of binary rules, but learning such scores requires searching an exponentially large discrete space of possible rule sets. We introduce AgentScore, which performs semantically guided optimization in this space by using LLMs to propose candidate rules and a deterministic, data-grounded verification-and-selection loop to enforce statistical validity and deployability constraints. Across eight clinical prediction tasks, AgentScore outperforms existing score-generation methods and achieves AUROC comparable to more flexible interpretable models despite operating under stronger structural constraints. On two additional externally validated tasks, AgentScore achieves higher discrimination than established guideline-based scores.

2601.21500 2026-05-25 cs.LG 版本更新

Task-Awareness Improves LLM Generations and Uncertainty

任务感知提升大语言模型生成与不确定性

Tim Tomov, Dominik Fuchsgruber, Stephan Günnemann

发表机构 * School of Computation, Information \& Technology, Technical University of Munich Munich Data Science Institute Munich Center for Machine Learning

AI总结 本文研究了如何利用任务相关的潜在结构提升大语言模型(LLM)的生成质量与不确定性估计。作者提出了一种直接在任务依赖的潜在结构中建模LLM输出的方法,并通过引入差异度量计算贝叶斯最优响应,从而生成更准确且结构合理的输出。实验表明,该方法在多种任务中均优于传统的解码策略,同时通过贝叶斯风险量化不确定性,提升了输出质量与正确性的一致性。

详情
AI中文摘要

在LLM的许多应用中,自然语言响应通常具有潜在结构,例如表示离散标签、数值或图。然而,现有的解码和不确定性估计方法仅在语言空间中操作,并且很大程度上忽略了结构信息。我们通过在任务依赖的潜在结构中直接建模LLM输出来解决这一问题。通过为该结构配备不相似性度量,我们可以计算贝叶斯最优响应。这些响应不是从采样生成中选择的,而是通过在潜在空间中组合个体响应新合成的。在不同任务中,贝叶斯最优响应始终优于波束搜索等标准解码方法。此外,通过诱导贝叶斯风险量化不确定性,可以捕捉潜在结构方面的变化,并改善与输出质量和正确性的对齐。我们的决策理论框架适用于任何允许潜在响应结构的问题,并能够实现可靠的任务感知LLM预测。

英文摘要

In many applications of LLMs, natural language responses often have an underlying structure such as representing discrete labels, numerical values, or graphs. Yet, existing decoding and uncertainty estimation methods operate only in language space and largely disregard structural information. We address this by modeling LLM outputs directly in a task-dependent latent structure. By equipping this structure with a dissimilarity measure, we can compute Bayes-optimal responses. These are not selected from sampled generations but are newly synthesized by combining individual responses in the latent space. Across different tasks, Bayes-optimal responses consistently outperform standard decoding methods like beam search. Moreover, quantifying uncertainty via the induced Bayesian risk captures variations in terms of the latent structure and improves alignment with output quality and correctness. Our decision-theoretic framework is applicable to any problem that admits a latent response structure and enables reliable task-aware LLM predictions.

2601.21198 2026-05-25 cs.DC cs.AI cs.LG 版本更新

ZipMoE: Efficient On-Device MoE Serving via Lossless Compression and Cache-Affinity Scheduling

ZipMoE:通过无损压缩和缓存亲和调度实现高效的设备端MoE服务

Yuchen Yang, Yaru Zhao, Pu Yang, Shaowei Wang, Zhi-Hua Zhou

发表机构 * School of Electronic Science and Engineering, Nanjing University, China.(南京大学电子科学与工程学院) National Key Laboratory for Novel Software Technology, Nanjing University, China.(南京大学新型软件技术国家重点实验室)

AI总结 本文提出了一种名为ZipMoE的高效边缘设备MoE服务系统,旨在解决大语言模型中MoE架构在资源受限设备上部署时的高内存消耗问题。ZipMoE通过结合边缘设备的硬件特性和MoE参数的统计冗余,设计了一种具有可证明性能保障的缓存与调度协同机制,将设备端MoE推理从I/O瓶颈转向计算驱动的工作流,从而实现高效的并行处理。实验表明,ZipMoE在多个边缘计算平台上显著降低了推理延迟并提升了吞吐量,优于现有先进系统。

Comments ICML 2026

详情
AI中文摘要

虽然混合专家(MoE)架构显著增强了大型语言模型的表达能力,但其巨大的内存占用严重阻碍了在资源受限的边缘设备上的实际部署,尤其是在必须保持模型行为而不依赖有损量化的情况下。在本文中,我们提出了ZipMoE,一个高效且语义无损的设备端MoE服务系统。ZipMoE通过具有可证明性能保证的缓存-调度协同设计,利用了边缘设备的硬件特性与MoE参数固有的统计冗余之间的协同作用。从根本上说,我们的设计将设备端MoE推理的范式从I/O瓶颈转变为以计算为中心的工作流,从而实现高效的并行化。我们实现了ZipMoE的原型,并在代表性边缘计算平台上使用流行的开源MoE模型和真实工作负载进行了广泛实验。评估结果表明,与最先进系统相比,ZipMoE实现了高达72.77%的推理延迟降低和高达6.76倍的吞吐量提升。我们的代码可在https://github.com/npnothard/ZipMoE-ICML26获取。

英文摘要

While Mixture-of-Experts (MoE) architectures substantially bolster the expressive power of large-language models, their prohibitive memory footprint severely impedes the practical deployment on resource-constrained edge devices, especially when model behavior must be preserved without relying on lossy quantization. In this paper, we present ZipMoE, an efficient and semantically lossless on-device MoE serving system. ZipMoE exploits the synergy between the hardware properties of edge devices and the statistical redundancy inherent to MoE parameters via a caching-scheduling co-design with provable performance guarantee. Fundamentally, our design shifts the paradigm of on-device MoE inference from an I/O-bound bottleneck to a compute-centric workflow that enables efficient parallelization. We implement a prototype of ZipMoE and conduct extensive experiments on representative edge computing platforms using popular open-source MoE models and real-world workloads. Our evaluation reveals that ZipMoE achieves up to $72.77\%$ inference latency reduction and up to $6.76\times$ higher throughput than the state-of-the-art systems.Our code is available at: https://github.com/npnothard/ZipMoE-ICML26.

2601.18685 2026-05-25 math.HO cs.LG 版本更新

LLAMA LIMA: A Living Meta-Analysis on the Effects of Generative AI on Learning Mathematics

LLAMA LIMA: 关于生成式AI对数学学习影响的活元分析

Anselm Strohmaier, Samira Bödefeld, Oliver Straser, Frank Reinhold

发表机构 * University of Education Freiburg, Institute of Mathematics Education(弗赖堡教育大学数学教育研究所)

AI总结 本文介绍了一项关于生成式人工智能对数学学习效果影响的活体元分析(LLAMA LIMA),旨在应对该领域研究进展迅速、传统综述易过时的问题。研究遵循PRISMA-LSR指南,持续更新文献库,并采用贝叶斯多层次元回归模型处理嵌套和累积数据,定期发布更新结果。第三版分析纳入24项研究,结果显示生成式AI对数学学习有积极影响,且在辅助而非替代教师教学时效果更佳。

Comments This is a living publication. See the first page of the PDF for more information

详情
AI中文摘要

生成式AI在数学教育中的能力正在迅速发展,给研究跟上步伐带来了重大挑战。研究综合仍然稀缺,并且可能在出版时就已经过时。为了解决这个问题,我们提出了一个关于基于生成式AI的数学学习干预效果的活元分析(LIMA)。遵循PRISMA-LSR指南,我们持续更新文献库,应用贝叶斯多水平元回归模型来处理嵌套和累积数据,并定期在预印本服务器上发布更新版本。本文报告了第三版的结果,包括24项研究,其中3项是自第二版以来新纳入的。分析表明存在正向效应(g = 0.40),可信区间较宽[0.14, 0.67],反映了证据基础仍然有限。结果显示没有发表偏倚。调节变量分析表明,有中等证据表明生成式AI在补充常规教学而非替代教师时更有益。

英文摘要

The capabilities of generative AI in mathematics education are rapidly evolving, posing significant challenges for research to keep pace. Research syntheses remain scarce and risk being outdated by the time of publication. To address this issue, we present a Living Meta-Analysis (LIMA) on the effects of generative AI-based interventions for learning mathematics. Following PRISMA-LSR guidelines, we continuously update the literature base, apply a Bayesian multilevel meta-regression model to account for nested and cumulative data, and publish updated versions on a preprint server at regular intervals. This paper reports results from the third version, including 24 studies, 3 of which were newly included since the second version. The analyses indicate a positive effect (g = 0.40) with a wide credible interval [0.14, 0.67], reflecting the still limited evidence base. Results indicate no publication bias. Moderator analyses indicate moderate evidence that generative AI is more beneficial when it complements regular instruction rather than replacing teachers.

2601.17261 2026-05-25 cs.LG 版本更新

AGZO: Activation-Guided Zeroth-Order Optimization for LLM Fine-Tuning

AGZO:用于大语言模型微调的激活引导零阶优化

Wei Lin, Yining Jiang, Qingyu Song, Qiao Xiang, Hong Xu

发表机构 * Department of Computer Science and Engineering, The Chinese University of Hong Kong, Hong Kong(香港中文大学计算机科学与工程系) Xiamen University, China(厦门大学)

AI总结 在严格内存限制下,零阶优化(ZO)为大语言模型的微调提供了一种有前景的解决方案,但现有方法通常采用各向同性扰动,忽略了前向传播中丰富的激活结构信息。本文提出了一种基于激活引导的零阶优化方法AGZO,通过利用线性层梯度受限于其输入激活张成子空间的特性,在前向传播过程中动态提取一个紧凑的激活感知子空间,并将扰动限制在该低秩子空间中。理论分析表明,AGZO优化了一个子空间平滑的目标函数,其更新方向与真实梯度的余弦相似度高于各向同性基线方法,实验结果也显示AGZO在多个基准上优于现有最优ZO方法,显著缩小了与一阶微调的性能差距,同时保持了相近的内存占用。

Comments 21 pages in total, including 9 pages of main text, with 4 figures and 3 tables Accepted by ICML 2026

详情
AI中文摘要

零阶优化已成为在严格内存约束下微调大语言模型的一种有前景的解决方案,因为它避免了为反向传播存储激活值所带来的过高内存成本。然而,现有的零阶方法通常采用各向同性扰动,忽略了前向传播过程中可用的丰富结构信息。在本文中,我们发现了梯度形成与激活结构之间的一个关键联系:线性层的梯度被限制在其输入激活所张成的子空间内。基于这一见解,我们提出了激活引导零阶优化方法(AGZO)。与先前方法不同,AGZO在前向传播过程中动态提取一个紧凑的、由激活信息引导的子空间,并将扰动限制在这个低秩子空间内。我们提供了一个理论框架,表明AGZO优化了一个子空间平滑的目标函数,并且能够证明其产生的更新方向与真实梯度的余弦相似度高于各向同性基线方法。在实验上,我们在Qwen3和Pangu模型上对AGZO进行了多种基准测试。AGZO持续优于最先进的零阶基线方法,并显著缩小了与一阶微调的性能差距,同时保持了与其他零阶方法几乎相同的峰值内存占用。

英文摘要

Zeroth-Order (ZO) optimization has emerged as a promising solution for fine-tuning LLMs under strict memory constraints, as it avoids the prohibitive memory cost of storing activations for backpropagation. However, existing ZO methods typically employ isotropic perturbations, neglecting the rich structural information available during the forward pass. In this paper, we identify a crucial link between gradient formation and activation structure: the gradient of a linear layer is confined to the subspace spanned by its input activations. Leveraging this insight, we propose Activation-Guided Zeroth-Order optimization (AGZO). Unlike prior methods, AGZO extracts a compact, activation-informed subspace on the fly during the forward pass and restricts perturbations to this low-rank subspace. We provide a theoretical framework showing that AGZO optimizes a subspace-smoothed objective and provably yields update directions with higher cosine similarity to the true gradient than isotropic baselines. Empirically, we evaluate AGZO on Qwen3 and Pangu models across various benchmarks. AGZO consistently outperforms state-of-the-art ZO baselines and significantly narrows the performance gap with first-order fine-tuning, while maintaining almost the same peak memory footprint as other ZO methods.

2601.14300 2026-05-25 cs.LG cs.CR 版本更新

Low-Cost Hard-Label Adversarial Attack with Theoretical Foundations

具有理论基础的低成本硬标签对抗攻击

Jun Liu, Leo Yu Zhang, Fengpeng Li, Isao Echizen, Jiantao Zhou

发表机构 * University of Macau(澳门大学) National Institute of Informatics(国家信息研究所) Griffith University(格里菲斯大学) University of Tokyo(东京大学)

AI总结 本文研究了基于硬标签的黑盒对抗攻击问题,这类攻击仅依赖模型的顶部预测结果,具有较高的实际威胁性。为解决现有方法在初始化策略和理论保障方面的不足,作者提出了一个具有理论支撑的统一框架,并设计了零查询初始化策略与模式驱动优化算法,显著提升了攻击效率与成功率。实验表明,该方法在多个数据集和防御模型上均优于现有最先进方法,且具有良好的泛化能力与对状态型防御的绕过能力。

详情
AI中文摘要

硬标签黑盒攻击仅依赖top-1预测,是最具挑战性但实际威胁最大的模型之一。尽管近期有进展,现有方法存在两个关键局限:(1) 忽视初始化的关键作用,主要关注优化策略;(2) 严重依赖经验启发式方法,缺乏理论保证。为弥补这一差距,我们建立了一个统一的理论框架,表明现有的符号翻转硬标签攻击可理解为近似真实梯度符号。在此原则性分析指导下,我们提出一种新颖的攻击框架,包含零查询初始化策略和模式驱动优化(PDO)算法。我们提供理论保证,证明我们的初始化比随机基线具有更高的与真实梯度符号的余弦相似度,且PDO模块的查询复杂度显著低于基线搜索方法。在CIFAR-10、ImageNet和ObjectNet上的大量实验(涵盖标准训练和对抗训练模型、商业API以及CLIP模型)表明,我们的方法在成功率和效率上持续优于最先进的硬标签攻击,尤其在低查询预算下。此外,我们的方法在损坏数据(ImageNet-C)、生物医学图像(PathMNIST)以及密集预测任务(如分割)上展现出鲁棒的泛化能力。值得注意的是,它绕过了有状态防御Blacklight,实现了0%的检测率。

英文摘要

Hard-label black-box attacks, relying solely on top-1 predictions, represent one of the most challenging yet practically threat models. Despite recent progress, existing approaches face two key limitations: (1) they overlook the critical role of initialization, focusing primarily on optimization strategies; and (2) they rely heavily on empirical heuristics without theoretical guarantees. To bridge this gap, we establish a unified theoretical framework showing that existing sign-flipping hard-label attacks can be understood as approximating the true gradient sign. Guided by this principled analysis, we propose a novel attack framework featuring a zero-query initialization strategy and a Pattern-Driven Optimization (PDO) algorithm. We provide theoretical guarantees that our initialization yields higher cosine similarity to the true gradient sign than random baselines, and our PDO module achieves significantly lower query complexity than baseline search methods. Extensive experiments across CIFAR-10, ImageNet, and ObjectNet-covering standard and adversarially trained models, commercial APIs, and CLIP models-demonstrate that our method consistently outperforms SOTA hard-label attacks in both success rate and efficiency, particularly under low query budgets. Furthermore, our method demonstrates robust generalization across corrupted data (ImageNet-C), biomedical images (PathMNIST), and dense prediction tasks such as segmentation. Notably, it bypasses the stateful defense Blacklight, achieving a 0% detection rate.

2601.07545 2026-05-25 cs.LG stat.ML 版本更新

Near-Optimal Private Linear Regression via Iterative Hessian Mixing

通过迭代Hessian混合实现近最优私有线性回归

Omri Lev, Moshe Shenfeld, Vishwak Srinivasan, Katrina Ligett, Ashia C. Wilson

发表机构 * Department of EECS, Massachusetts Institute of Technology, US(麻省理工学院电子工程与计算机科学系) School of Computer Science and Engineering, The Hebrew University of Jerusalem, IL(耶路撒冷希伯来大学计算机科学与工程学院)

AI总结 本文研究了在数据有界条件下实现差分隐私的普通最小二乘回归问题,提出了一种基于高斯投影的迭代海森矩阵混合(IHM)算法。该方法在保证差分隐私的同时,通过改进的实用风险界提升了模型性能,相比现有方法如AdaSSP,去除了与数据维度相关的乘法因子,从而在多个数据集上表现出更优的实证效果。

详情
AI中文摘要

我们研究通过草图机制实现带界数据$(X,Y)$的差分隐私普通最小二乘(DP-OLS)。虽然高斯草图方法已被探索用于DP-OLS \citep{sheffet2017differentially},但它们通常被认为不如自适应充分统计量扰动(AdaSSP)方法 \citep{wang_adassp},后者直接扰动充分统计量$(X^{\top}X, X^{\top}Y)$。该方法被证明接近信息论最优,同时表现出强大的实证性能。在这项工作中,我们提出了\emph{迭代Hessian混合}(IHM),一种基于高斯草图方法构建的DP-OLS算法,其灵感来自\citet{pilanci_hessiansketch}的迭代Hessian草图。我们证明IHM是差分私有的,并以超额经验风险界的形式提供效用保证。这些界通过移除一个可能高达数据维度平方根的乘法因子,改进了AdaSSP的界。IHM的设计基于我们为先前DP-OLS的高斯草图方法提出的新准确性保证,这些保证阐明了这些方法何时预期表现良好,以及IHM如何规避其固有局限性。我们还在大量数据集上进行了严格的实证评估,表明IHM始终优于包括AdaSSP在内的先前基线。

英文摘要

We study differentially private ordinary least squares (DP-OLS) with bounded data $(X,Y)$ via sketching-based mechanisms. While Gaussian sketching approaches have been explored for DP-OLS \citep{sheffet2017differentially}, they are typically viewed as less competitive than the Adaptive Sufficient Statistics Perturbation (AdaSSP) method \citep{wang_adassp}, which directly perturbs the sufficient statistics $(X^{\top}X, X^{\top}Y)$. This method was shown to be close to information-theoretically optimal, while also exhibiting strong empirical performance. In this work, we propose the \emph{Iterative Hessian Mixing} (IHM), an algorithm that builds on Gaussian sketching approaches to DP-OLS and is inspired by the Iterative Hessian Sketch of \citet{pilanci_hessiansketch}. We prove that IHM is differentially private and provide utility guarantees in the form of excess empirical risk bounds. These bounds improve upon those of AdaSSP by removing a multiplicative factor that can be as large as the square root of the data dimension. The design of the IHM is based on new accuracy guarantees that we present for prior Gaussian sketching approaches for DP-OLS, which clarify when these methods are expected to perform well and how IHM circumvents their inherent limitations. We also conduct a rigorous empirical evaluation on a large suite of datasets, demonstrating that IHM consistently outperforms prior baselines, including AdaSSP.

2512.22597 2026-05-25 cs.LG physics.chem-ph 版本更新

Energy-Guided Generative Modeling for Low-Energy Molecular Structure Discovery

能量引导的生成式建模用于低能分子结构发现

Guikun Xu, Xiaohan Yi, Ziqiao Meng, Peilin Zhao, Yatao Bian

发表机构 * School of Artificial Intelligence, Shanghai Jiao Tong University(上海交通大学人工智能学院) Department of Computer Science, National University of Singapore(新加坡国立大学计算机科学系) Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院)

AI总结 本文提出了一种名为EnFlow的能量引导生成模型,用于高效发现低能量分子构型。该方法结合了基于流的构型生成与显式的能量景观建模,实现了构象集合的联合生成与基态识别。通过将生成动力学与学习到的能量模型相结合,EnFlow能够在极少采样步骤内生成结构准确且能量较低的分子构型,并能根据能量对生成结果进行排序,实验表明其在多个分子数据集上表现出色。

详情
AI中文摘要

探索分子能量景观和识别基态构象是计算化学的核心挑战。然而,从分子图生成多样化的低能构象在传统的基于物理的流程中仍然昂贵。现有的基于学习的方法仍然分散:生成模型捕捉构象多样性但通常缺乏可靠的能量校准,而确定性预测器关注单一结构且无法表示系综变异性。这里我们介绍EnFlow,据我们所知,这是第一个能量引导的生成框架,它将基于流的构象生成与显式能量景观建模相结合,用于联合构象系综生成和基态识别。通过将生成动力学与学习的能量模型集成,EnFlow引导采样朝向构象景观的低能区域,在极少的采样步数下提高结构保真度,同时实现对生成构象的基于能量的排序。在GEOM-QM9和GEOM-Drugs上的实验表明,EnFlow在构象生成和基态识别方面取得了强劲性能,同时仅需要1-2个ODE采样步。单点GFN2-xTB评估进一步表明,学习的能量分数保留了生成构象的物理上有意义的能量排序。这些结果支持显式能量景观建模作为通过联合建模构象系综及其相关能量来发现低能分子结构的有效策略。

英文摘要

Exploring molecular energy landscapes and identifying ground-state conformations are central challenges in computational chemistry. However, generating diverse low-energy conformers from molecular graphs remains expensive with traditional physics-based pipelines. Existing learning-based approaches remain fragmented: generative models capture conformational diversity but often lack reliable energy calibration, whereas deterministic predictors focus on a single structure and fail to represent ensemble variability. Here we introduce EnFlow, to our knowledge, the first energy-guided generative framework that couples flow-based conformer generation with explicit energy landscape modeling for joint conformational ensemble generation and ground-state identification. By integrating generative dynamics with a learned energy model, EnFlow guides sampling toward low-energy regions of the conformational landscape, improving structural fidelity under extremely few sampling steps while enabling energy-based ranking of generated conformations. Experiments on GEOM-QM9 and GEOM-Drugs show that EnFlow achieves strong performance in conformer generation and ground-state identification while requiring only 1--2 ODE sampling steps. Single-point GFN2-xTB evaluations further show that the learned energy scores preserve physically meaningful energetic rankings of generated conformations. These results support explicit energy landscape modeling as an effective strategy for low-energy molecular structure discovery through joint modeling of conformational ensembles and their associated energies.

2512.15436 2026-05-25 stat.ML cs.LG 版本更新

Online Partitioned Local Depth for semi-supervised applications

面向半监督应用的在线分区局部深度

John D. Foley, Justin T. Lee

发表机构 * Metron, Inc.(梅隆公司)

AI总结 本文提出了一种适用于在线应用场景的改进版分区局部深度(PaLD)算法,名为在线PaLD,主要用于半监督预测任务。该算法在预计算参考数据集的凝聚网络后,能够在较短时间内扩展至新数据点,从而提升计算效率。研究通过实际应用展示了在线PaLD在医疗数据集上的异常检测和半监督分类中的潜力,拓展了PaLD框架的应用范围。

Comments Added theorem statements and refined results; 21 pages, 2 figures

详情
AI中文摘要

我们介绍了分区局部深度(PaLD)算法的一个扩展,该扩展适用于在线应用,如半监督预测。PaLD以无监督、无参数聚类而闻名,但其鲁棒性基于数据点的三元组,使得精确分析计算成本高昂。目前正在研究如何提高底层离散算法的可扩展性并扩大PaLD的应用范围。我们提出的新算法online PaLD非常适合那些可以预先从参考数据集中计算凝聚网络的情况。在花费$O(n^3)$步骤构建可查询的数据结构后,online PaLD可以在$O(n^2)$时间内将凝聚网络扩展到新的数据点。我们的方法补充了之前基于近似和并行的加速方法。在实际应用中,online PaLD通过相对简单的实现使得更大的数据集可以进行精确分析。我们展示了在医疗保健数据集上的在线异常检测和半监督分类应用,作为online PaLD扩展PaLD框架应用潜力的初步说明。

英文摘要

We introduce an extension of the partitioned local depth (PaLD) algorithm that is adapted to online applications such as semi-supervised prediction. PaLD is best known for unsupervised, parameter-free clustering, but its robustness is based on triples of data points, making exact analysis computationally expensive. Research is ongoing to improve the scalability of the underlying discrete algorithm and expand the breath of PaLD's applications. The new algorithm we present, online PaLD, is well-suited to situations where it is possible to pre-compute a cohesion network from a reference dataset. After $O(n^3)$ steps to construct a queryable data structure, online PaLD can extend the cohesion network to a new data point in $O(n^2)$ time. Our approach complements previous speed up approaches based on approximation and parallelism. In practical terms, online PaLD makes larger datasets accessible to exact analysis with a relatively simple implementation. We present applications to online anomaly detection and semi-supervised classification for health-care datasets as initial illustrations of online PaLD's potential to expand applications of the PaLD framework.

2511.18000 2026-05-25 cs.LG cs.AI q-bio.PE 版本更新

Reward Engineering for Spatial Epidemic Simulations: A Reinforcement Learning Platform for Individual Behavioral Learning

空间流行病模拟中的奖励工程:个体行为学习的强化学习平台

Radman Rakhshandehroo, Daniel Coombs

发表机构 * Department of Computer Science University of British Columbia(计算机科学系,不列颠哥伦比亚大学) Department of Mathematics and Institute of Applied Mathematics University of British Columbia(数学系和应用数学研究所,不列颠哥伦比亚大学)

AI总结 本文介绍了 ContagionRL,一个专为疫情空间模拟设计的强化学习平台,用于系统研究奖励函数设计对个体行为学习的影响。该平台结合了可配置的 SIRS+D 流行病模型,支持在不同环境条件下评估多种奖励机制对智能体生存策略的影响,并通过实验发现方向引导和明确遵守激励是提升策略学习的关键因素。研究还表明,采用势场奖励函数的智能体在非药物干预遵守和空间规避策略方面表现最优,平台为探索奖励与行为关系提供了模块化工具,具有重要的理论和应用价值。

Comments 38 pages, 15 figures and 18 tables; Accepted to TMLR. OpenReview: https://openreview.net/forum?id=yPEASsx3hk

详情
Journal ref
Transactions on Machine Learning Research, 2026
AI中文摘要

我们提出了ContagionRL,一个与Gymnasium兼容的强化学习平台,专门用于空间流行病模拟中的系统奖励工程。与依赖固定行为规则的传统基于智能体的模型不同,我们的平台能够严格评估奖励函数设计如何影响在不同流行病场景中学到的生存策略。ContagionRL集成了空间SIRS+D流行病模型与可配置的环境参数,允许研究人员在包括有限可观测性、不同移动模式和异质人口动态等变化条件下对奖励函数进行压力测试。我们评估了五种不同的奖励设计,从稀疏生存奖励到一种新颖的势场方法,跨越多种RL算法(PPO、SAC、A2C)。通过系统的消融研究,我们发现方向性指导和明确的依从性激励是稳健策略学习的关键组成部分。我们在不同感染率、网格大小、可见性约束和移动模式下的全面评估表明,奖励函数的选择显著影响智能体行为和生存结果。使用我们的势场奖励训练的智能体始终获得优越性能,学习最大程度地遵守非药物干预,同时发展出复杂的空间规避策略。该平台的模块化设计使得能够系统地探索奖励-行为关系,弥补了这类模型中奖励工程关注有限的空白。ContagionRL是研究流行病背景下适应性行为反应的有效平台,并强调了奖励设计、信息结构和环境可预测性在学习中的重要性。我们的代码公开在https://github.com/redradman/ContagionRL。

英文摘要

We present ContagionRL, a Gymnasium-compatible reinforcement learning platform specifically designed for systematic reward engineering in spatial epidemic simulations. Unlike traditional agent-based models that rely on fixed behavioral rules, our platform enables rigorous evaluation of how reward function design affects learned survival strategies across diverse epidemic scenarios. ContagionRL integrates a spatial SIRS+D epidemiological model with configurable environmental parameters, allowing researchers to stress-test reward functions under varying conditions including limited observability, different movement patterns, and heterogeneous population dynamics. We evaluate five distinct reward designs, ranging from sparse survival bonuses to a novel potential field approach, across multiple RL algorithms (PPO, SAC, A2C). Through systematic ablation studies, we identify that directional guidance and explicit adherence incentives are critical components for robust policy learning. Our comprehensive evaluation across varying infection rates, grid sizes, visibility constraints, and movement patterns reveals that reward function choice dramatically impacts agent behavior and survival outcomes. Agents trained with our potential field reward consistently achieve superior performance, learning maximal adherence to non-pharmaceutical interventions while developing sophisticated spatial avoidance strategies. The platform's modular design enables systematic exploration of reward-behavior relationships, addressing a knowledge gap in models of this type where reward engineering has received limited attention. ContagionRL is an effective platform for studying adaptive behavioral responses in epidemic contexts and highlight the importance of reward design, information structure, and environmental predictability in learning. Our code is publicly available at https://github.com/redradman/ContagionRL

2511.17171 2026-05-25 cs.CV cs.LG 版本更新

FireScope: Wildfire Risk Raster Prediction with a Chain-of-Thought Oracle

FireScope: 基于思维链预言机的野火风险栅格预测

Mario Markov, Stefan Maria Ailuro, Luc Van Gool, Konrad Schindler, Danda Pani Paudel

发表机构 * ETH Zurich(苏黎世联邦理工学院)

AI总结 该论文提出了一种名为FireScope的框架,用于预测野火风险栅格图,通过结合视觉、气候和地理信息进行因果推理。研究引入了FireScope-Bench数据集,整合了Sentinel-2卫星图像、气候数据和专家定义的风险图,用于跨大陆评估。FireScope基于视觉语言模型,结合强化学习和视觉监督,生成带有推理轨迹的风险图,显著提升了模型在不同大陆间的泛化能力和可解释性。该工作首次展示了基于语言的推理在视觉生成中的泛化提升作用,并提出了首个可跨大陆应用的高分辨率野火风险模型。

Comments CVPR 2026, Project Page: https://firescope.ai/research

详情
AI中文摘要

预测野火风险是一个推理密集型的空间问题,需要整合视觉、气候和地理因素来推断连续的风险地图。现有方法缺乏可靠泛化所需的因果推理和多模态理解。我们引入了FireScope-Bench,一个大规模数据集和基准,将Sentinel-2图像和气候数据与专家定义的全美风险栅格以及欧洲的真实野火事件配对,用于跨大陆评估。基于此数据集,我们提出了FireScope,一个基于VLM的推理到生成框架,从强化学习和视觉监督中学习,通过互补的推理轨迹预测风险栅格。当在美国训练并在欧洲测试时,FireScope取得了显著的性能提升,而专家反馈和自动化分析证实其推理轨迹是忠实且有语义意义的。我们的发现表明,推理可以支撑栅格预测模型,提高泛化性和可解释性。据我们所知,这是第一个(1)证明基于语言的推理可以改善视觉生成泛化性的框架,(2)提出一个可跨大陆应用的高分辨率野火风险模型,以及(3)能够系统研究多模态火灾风险模型稳健跨大陆泛化的框架。我们相信FireScope-Bench有潜力成为推动推理驱动、可解释和可泛化空间建模的基础。数据和源代码将公开提供。

英文摘要

Predicting wildfire risk is a reasoning-intensive spatial problem that requires the integration of visual, climatic, and geographic factors to infer continuous risk maps. Existing methods lack the causal reasoning and multimodal understanding required for reliable generalization. We introduce FireScope-Bench, a large-scale dataset and benchmark that couples Sentinel-2 imagery and climate data with expert-defined risk rasters across the USA, and real wildfire events in Europe for cross-continental evaluation. Building on this dataset, we propose FireScope, a VLM-based reasoning-to-generation framework that learns from both reinforcement learning and visual supervision to predict risk rasters with complementary reasoning traces. When trained in the USA and tested in Europe, FireScope achieves substantial performance gains, while expert feedback and automated analysis confirm that its reasoning traces are faithful and semantically meaningful. Our findings demonstrate that reasoning can ground raster prediction models, improving both generalization and interpretability. To our knowledge, this is the first framework to (1) demonstrate that language-based reasoning can improve generalization in visual generation, (2) propose a high-resolution wildfire risk model that can be applied across continents, and (3) enable systematic studies of robust cross-continental generalization for multimodal fire risk models. We believe that FireScope-Bench has the potential to serve as a foundation for advancing reasoning-driven, interpretable and generalizable spatial modeling. Data and source code will be made publicly available.

2511.00266 2026-05-25 cs.LG cs.RO 版本更新

X-TRACK: Physics-Aware xLSTM for Realistic Vehicle Trajectory Prediction

X-TRACK: 物理感知的xLSTM用于真实车辆轨迹预测

Aanchal Rajesh Chugh, Marion Neumeier, Sebastian Dorn

AI总结 准确的轨迹预测对自动驾驶系统的安全性和可靠性至关重要,尤其需要在高速公路场景中建模长期时间依赖关系并考虑车辆之间的社会交互。本文提出了一种基于xLSTM的新型高速公路轨迹预测框架X-TRAJ,并进一步引入其物理感知变体X-TRACK,通过显式整合车辆运动学约束,生成更真实可行的轨迹。实验表明,X-TRACK在公开数据集highD和NGSIM上均优于现有先进方法,尤其在highD上表现突出。

详情
AI中文摘要

准确的轨迹预测对于安全可靠的自动驾驶系统至关重要,需要模型能够捕捉长期时间依赖性,同时考虑高速公路驾驶场景中相邻车辆之间的社交互动。虽然长短期记忆(LSTM)网络在轨迹预测领域得到了广泛应用,但它们存在记忆容量有限和标量细胞状态等局限性。最近引入的扩展长短期记忆(xLSTM)通过引入指数门控和增强的记忆结构解决了传统LSTM的这些局限性,使其更适合建模长期时间依赖性。尽管具有潜力,基于xLSTM的模型在车辆轨迹预测方面仍未得到充分探索。本文首次将xLSTM应用于高速公路轨迹预测,提出了新颖的基于xLSTM的高速公路轨迹预测框架X-TRAJ,以及其物理感知变体X-TRACK(受运动学约束的扩展LSTM轨迹预测),该变体将车辆运动学显式集成到模型学习过程中。通过引入物理约束,所提出的模型生成真实可行的高速公路轨迹。在公开的高速公路数据集highD和NGSIM上的全面评估表明,X-TRACK在highD上优于最先进的基线,并在NGSIM数据集上达到最先进模型水平。

英文摘要

Accurate trajectory prediction is crucial for safe and reliable autonomous driving systems, requiring models that capture long-term temporal dependencies while accounting for social interactions among neighboring vehicles in highway driving scenarios. While Long Short Term Memory (LSTM) networks have been widely used in the domain of trajectory prediction, they have limitations such as limited memory capacity and scalar cell state. The recently introduced Extended Long Short Term Memory (xLSTM) addresses these limitations of traditional LSTMs by introducing exponential gating and enhanced memory structures, making them better suited for modeling long-term temporal dependencies. Despite their potential, xLSTM-based models remain underexplored in the context of vehicle trajectory prediction. This paper introduces a novel xLSTM-based highway trajectory prediction framework, X-TRAJ, as the first application of xLSTM, and its physics-aware variant, X-TRACK (eXtended LSTM for TRAjectory prediction Constraint by Kinematics), which explicitly integrates vehicle motion kinematics into the model learning process. By introducing physical constraints, the proposed model generates realistic and feasible highway trajectories. A comprehensive evaluation on the publicly available highway datasets, highD and NGSIM, demonstrates that X-TRACK outperforms state-of-the-art baselines on highD and is among the state-of-the-art models on the NGSIM dataset.

2510.22941 2026-05-25 cs.LG 版本更新

Hazard-Responsive Digital Twin for Climate-Driven Urban Resilience and Equity

面向气候驱动的城市韧性与公平的灾害响应数字孪生

Zhenglai Shen, Hongyu Zhou

发表机构 * Buildings and Transportation Science Division, Oak Ridge National Laboratory(奥克伍德国家实验室建筑与交通科学部门) Civil and Environmental Engineering, University of Tennessee(田纳西大学土木与环境工程系)

AI总结 面对野火引发的停电和城市热浪等复合型气候灾害,本文提出了一种具有响应能力的数字孪生系统(H-RDT),结合物理信息神经网络、多模态数据融合和公平性风险分析,提升城市应对灾害的韧性与公平性。该系统在模拟城区中展示了对部分传感器失效情况下的稳定室内温度预测能力,并通过强化学习模块自适应融合物联网、无人机和卫星数据,识别高脆弱性区域,如学校、诊所和低收入住房。研究还表明,通过提前启动冷却中心和共享微电网等干预措施,可有效降低人群加权热风险和极端风险,为城市气候适应决策提供更具适应性和公平导向的支持。

Comments 52 pages, 9 figures

详情
Journal ref
Sustainable Cities and Society 144 (2026) 107413
AI中文摘要

复合气候灾害,如野火引发的停电和城市热浪,挑战着城市的稳定性和公平性。我们提出一种灾害响应数字孪生(H-RDT),它结合了物理信息神经网络建模、多模态数据融合和公平感知风险分析,用于城市尺度的响应。在一个包含多种建筑原型和人群的合成区域中,模拟的野火-停电-热浪级联事件表明,H-RDT 在部分传感器缺失的情况下能维持稳定的室内温度预测(约31至33°C),再现停电引发的温度激增和恢复。基于强化学习的融合模块自适应地重新加权物联网、无人机和卫星输入,以维持时空覆盖,而公平调整的映射则隔离出高脆弱性集群(学校、诊所、低收入住房)。前瞻性干预措施,如预防性冷却中心启动和微电网共享,将人口加权热风险降低11%至13%,将95百分位(尾部)风险缩小7%至17%,并将过热小时数减少高达9%。除了合成演示之外,该框架为实际城市实施建立了可迁移的基础,将物理灾害建模与社会公平和决策智能联系起来。H-RDT 推动数字城市韧性向自适应、基于学习和以公平为中心的决策支持发展,以应对气候适应。

英文摘要

Compounding climate hazards, such as wildfire-induced outages and urban heatwaves, challenge the stability and equity of cities. We present a Hazard-Responsive Digital Twin (H-RDT) that combines physics-informed neural network modeling, multimodal data fusion, and equity-aware risk analytics for urban-scale response. In a synthetic district with diverse building archetypes and populations, a simulated wildfire-outage-heatwave cascade shows that H-RDT maintains stable indoor temperature predictions (approximately 31 to 33 C) under partial sensor loss, reproducing outage-driven surges and recovery. The reinforcement learning based fusion module adaptively reweights IoT, UAV, and satellite inputs to sustain spatiotemporal coverage, while the equity-adjusted mapping isolates high-vulnerability clusters (schools, clinics, low-income housing). Prospective interventions, such as preemptive cooling-center activation and microgrid sharing, reduce population-weighted thermal risk by 11 to 13 percent, shrink the 95th-percentile (tail) risk by 7 to 17 percent, and cut overheating hours by up to 9 percent. Beyond the synthetic demonstration, the framework establishes a transferable foundation for real-city implementation, linking physical hazard modeling with social equity and decision intelligence. The H-RDT advances digital urban resilience toward adaptive, learning-based, and equity-centered decision support for climate adaptation.

2510.12328 2026-05-25 cs.LG 版本更新

Leveraging Teleconnections with Physics-Informed Graph Attention Networks for Long-Range Extreme Rainfall Forecasting in Thailand

利用物理信息图注意力网络的遥相关进行泰国长距离极端降雨预报

Kiattikun Chobtham, Kanoksri Sarinnapakorn, Kritanai Torsri, Prattana Deeprasertkul, Jirawan Kamma

发表机构 * Hydro-Informatics Institute, Ministry of Higher Education, Science, Research and Innovation(水信息研究所,教育部、科学、研究与创新部)

AI总结 本文提出了一种结合物理信息图神经网络和极值分析方法的新型模型,用于提高泰国地区极端降雨的预测精度。该方法通过图结构表示雨量监测站点,捕捉复杂的时空模式,并利用遥相关关系增强模型可解释性。模型采用基于地形降水物理机制的图注意力机制与长短期记忆网络结合,配合空间季节感知广义帕累托分布方法处理极端值,实验表明其在多个区域尤其是极端事件高发区的预测性能优于现有方法,为长期水资源管理提供了实用的高分辨率预测支持。

详情
AI中文摘要

准确的降雨预报,特别是极端事件,仍然是气候学和地球系统中的一个重大挑战。本文提出了新颖的物理信息图神经网络(GNNs)结合极值分析技术,以改进泰国全境的测站降雨预测。该模型利用测站的图结构表示来捕捉复杂的时空模式,并通过遥相关提供可解释性。我们预处理可能影响区域降雨的相关气候指数。所提出的图注意力网络与长短期记忆(Attention-LSTM)使用基于简单地形降水物理公式的初始边特征应用注意力机制。嵌入随后由LSTM层处理。为了处理极端值,我们使用新颖的空间季节感知广义帕累托分布(GPD)方法进行峰值超过阈值(POT)映射,克服了传统机器学习模型的局限性。实验表明,我们的方法在大多数区域(包括易发生极端事件的区域)优于已建立的基线,并与最先进的方法保持强烈竞争力。与业务预报系统SEAS5相比,我们的实际应用改进了极端事件预测,并提供了实用增强,以生成支持长期水管理决策的高分辨率地图。

英文摘要

Accurate rainfall forecasting, particularly for extreme events, remains a significant challenge in climatology and the Earth system. This paper presents novel physics-informed Graph Neural Networks (GNNs) combined with extreme-value analysis techniques to improve gauge-station rainfall predictions across Thailand. The model leverages a graph-structured representation of gauge stations to capture complex spatiotemporal patterns, and it offers explainability through teleconnections. We preprocess relevant climate indices that potentially influence regional rainfall. The proposed Graph Attention Network with Long Short-Term Memory (Attention-LSTM) applies the attention mechanism using initial edge features derived from simple orographic-precipitation physics formulation. The embeddings are subsequently processed by LSTM layers. To address extremes, we perform Peak-Over-Threshold (POT) mapping using the novel Spatial Season-aware Generalized Pareto Distribution (GPD) method, which overcomes limitations of traditional machine-learning models. Experiments demonstrate that our method outperforms well-established baselines across most regions, including areas prone to extremes, and remains strongly competitive with the state of the art. Compared with the operational forecasting system SEAS5, our real-world application improves extreme-event prediction and offers a practical enhancement to produce high-resolution maps that support decision-making in long-term water management.

2510.04406 2026-05-25 stat.ML cs.LG 版本更新

Decomposition-Based Modular Conformal Prediction for Two-Stage Modeling

基于分解的模块化共形预测用于两阶段建模

William Zhang, Saurabh Amin, Georgia Perakis

发表机构 * Operations Research Center, Massachusetts Institute of Technology, Cambridge, MA, USA(麻省理工学院运筹学研究中心) Laboratory for Information and Decision Systems, Massachusetts Institute of Technology, Cambridge, MA, US(麻省理工学院信息与决策系统实验室) Sloan School of Management, Massachusetts Institute of Technology, Cambridge, MA, USA(麻省理工学院斯隆管理学院)

AI总结 本文提出了一种基于分解的模块化 conformal 预测框架,用于处理两阶段建模过程中的不确定性量化问题。该方法将整体预测残差分解为各阶段特有部分,从而能够识别并归因于不同模型阶段的不确定性来源。通过引入基于族内错误率控制的参数选择策略,并扩展到非平稳场景,该方法在结构化和阶段化变化下表现出更优的覆盖率和诊断能力,优于传统 conformal 预测方法。

Comments 11 pages, (37 with appendix), 15 figures

详情
AI中文摘要

共形预测在最小假设下提供了有限样本覆盖保证。然而,现有方法将整个建模过程视为黑箱,忽视了利用和理解模块化结构的机会。我们引入了一种针对两阶段顺序模型的共形预测框架,其中上游预测器为下游模型生成中间表示。通过将整体预测残差分解为阶段特定成分,我们的方法使从业者能够将不确定性归因于特定的流水线阶段。我们开发了一个使用族系错误率(FWER)控制的风险控制参数选择程序,以校准阶段级缩放参数,并引入了一个针对非平稳设置的自适应扩展。在合成分布偏移以及真实供应链和股票市场数据上的实验表明,与标准共形方法相比,我们的方法在结构性的阶段级偏移下提高了覆盖,同时识别了阶段级误差贡献。该框架提供了标准共形方法所缺乏的诊断优势和鲁棒覆盖。

英文摘要

Conformal prediction offers finite-sample coverage guarantees under minimal assumptions. However, existing methods treat the entire modeling process as a black box, overlooking opportunities to exploit and understand modular structure. We introduce a conformal prediction framework for two-stage sequential models, where an upstream predictor generates intermediate representations for a downstream model. By decomposing the overall prediction residual into stage-specific components, our method enables practitioners to attribute uncertainty to specific pipeline stages. We develop a risk-controlled parameter selection procedure using family-wise error rate (FWER) control to calibrate stage-wise scaling parameters, and introduce an adaptive extension for non-stationary settings. Experiments on synthetic distribution shifts, as well as real-world supply chain and stock market data, demonstrate that our approach improves coverage under structural, stage-wise shifts compared to standard conformal methods, while identifying stage-wise error contribution. This framework offers diagnostic advantages and robust coverage that standard conformal methods lack.

2510.03508 2026-05-25 cs.LG 版本更新

D2 Actor Critic: Diffusion Actor Meets Distributional Critic

D2 Actor Critic: 扩散演员遇上分布式评论家

Lunjun Zhang, Shuo Han, Hanrui Lyu, Bradly C Stadie

发表机构 * Department of Computer Science, University of Toronto(计算机科学系,多伦多大学) Department of Statistics, Northwestern University(统计学系,西北大学)

AI总结 本文提出了一种新的无模型强化学习算法 D2AC,旨在高效在线训练表达能力强的扩散策略。其核心在于一种避免传统策略梯度高方差和反向传播复杂性的策略改进目标,并结合了分布强化学习与剪切双Q学习的鲁棒分布评价器。该算法在多个具有挑战性的基准任务中表现出色,并在生物启发的捕食者-猎物任务中展示了良好的行为鲁棒性和泛化能力。

Comments Accepted to TMLR 2025

详情
AI中文摘要

我们引入了D2AC,一种新的无模型强化学习算法,旨在有效在线训练表达性扩散策略。其核心是一个策略改进目标,避免了典型策略梯度的高方差和通过时间反向传播的复杂性。这种稳定的学习过程关键得益于我们的第二个贡献:一个鲁棒的分布式评论家,我们通过融合分布式强化学习和裁剪双Q学习来设计它。最终算法非常有效,在包含Humanoid、Dog和Shadow Hand领域的18个困难强化学习任务基准上达到了最先进性能,涵盖密集奖励和目标条件强化学习场景。除了标准基准,我们还评估了一个生物启发的捕食者-猎物任务,以检验我们方法的行为鲁棒性和泛化能力。代码:https://github.com/d2ac-actor-critic/d2ac-public

英文摘要

We introduce D2AC, a new model-free reinforcement learning (RL) algorithm designed to train expressive diffusion policies online effectively. At its core is a policy improvement objective that avoids the high variance of typical policy gradients and the complexity of backpropagation through time. This stable learning process is critically enabled by our second contribution: a robust distributional critic, which we design through a fusion of distributional RL and clipped double Q-learning. The resulting algorithm is highly effective, achieving state-of-the-art performance on a benchmark of eighteen hard RL tasks, including Humanoid, Dog, and Shadow Hand domains, spanning both dense-reward and goal-conditioned RL scenarios. Beyond standard benchmarks, we also evaluate a biologically motivated predator-prey task to examine the behavioral robustness and generalization capacity of our approach. Code: https://github.com/d2ac-actor-critic/d2ac-public

2510.00915 2026-05-25 cs.LG cs.AI 版本更新

Reinforcement Learning with Verifiable yet Noisy Rewards under Imperfect Verifiers

在不完美验证器下基于可验证但含噪声奖励的强化学习

Xin-Qiang Cai, Wei Wang, Feng Liu, Tongliang Liu, Gang Niu, Masashi Sugiyama

发表机构 * RIKEN AIP(日本理化学研究所AIP) The University of Tokyo(东京大学) The University of Melbourne(墨尔本大学) The University of Sydney(悉尼大学)

AI总结 该论文研究了在不可靠验证器存在下如何改进可验证奖励的强化学习(RLVR)。通过将验证器的不可靠性建模为具有不对称噪声率的随机奖励通道,作者提出了两种轻量级修正方法:一种是反向修正,用于生成无偏的替代奖励;另一种是正向修正,通过调整得分函数项使策略更新更贴近干净梯度方向。实验表明,这两种方法在合成和真实验证噪声环境下均能提升数学推理任务的性能,其中正向修正在高噪声情况下更为稳定。此外,作者还引入了一个基于轻量级语言模型的申诉机制,用于在线估计假阴性率并进一步提升性能。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)用自动验证器替代昂贵的人工标注。为减少验证器攻击,许多RLVR系统将奖励二值化为$\\\{0,1\\\}$,但不完美的验证器不可避免地引入\\emph{假阴性}(拒绝正确答案)和\\emph{假阳性}(接受错误答案)。我们将验证器不可靠性形式化为具有非对称噪声率$ρ_0$和$ρ_1$(分别为FP率和FN率)的随机奖励通道。由此抽象我们推导出两种轻量级校正:(i)\\emph{后向}校正,产生无偏替代奖励,从而在期望上得到无偏的策略梯度估计量;(ii)\\emph{前向}校正,重新加权得分函数项,使得期望更新与干净梯度方向对齐,且仅需FN率。我们在分组相对策略优化流程中将两者实现为轻量级钩子,两种校正均在合成和真实验证器噪声下改善了数学推理的RLVR,其中前向变体在较大噪声下更稳定。最后,一个带有轻量级LLM验证器的上诉机制在线估计FN率并进一步提升性能。

英文摘要

Reinforcement Learning with Verifiable Rewards (RLVR) replaces costly human labeling with automated verifiers. To reduce verifier hacking, many RLVR systems binarize rewards to $\{0,1\}$, but imperfect verifiers inevitably introduce \emph{false negatives} (rejecting correct answers) and \emph{false positives} (accepting incorrect ones). We formalize verifier unreliability as a stochastic reward channel with asymmetric noise rates $ρ_0$ and $ρ_1$ -- the FP rate and the FN rate, respectively. From this abstraction we derive two lightweight corrections: (i) a \emph{backward} correction that yields an unbiased surrogate reward and thus an unbiased policy-gradient estimator in expectation, and (ii) a \emph{forward} correction that reweights score-function terms so the expected update aligns with the clean gradient direction and requires only the FN rate. We implement both as lightweight hooks in a group relative policy optimization pipeline, both corrections improve RLVR for math reasoning under synthetic and real verifier noise, with the forward variant being more stable under heavier noise. Finally, an appeals mechanism with a lightweight LLM verifier estimates the FN rate online and further improves performance.

2510.00526 2026-05-25 cs.CL cs.LG 版本更新

Beyond Log Likelihood: Probability-Based Objectives for Supervised Fine-Tuning across the Model Capability Continuum

超越对数似然:面向模型能力连续体的监督微调概率目标

Gaotang Li, Ruizhong Qiu, Xiusi Chen, Heng Ji, Hanghang Tong

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文研究了监督微调(SFT)中超越负对数似然(NLL)的目标函数,针对大语言模型在不同能力水平下的表现差异,提出了一种基于概率的优化目标体系。通过大量实验和消融研究,发现模型能力水平是决定不同目标函数优劣的关键因素:在模型能力强时,优先考虑先验知识的目标(如$-p$、$-p^{10}$)表现更优;在模型能力弱时,NLL仍占优势;而中间阶段则无单一目标占优。该研究为根据模型能力选择合适的目标函数提供了理论依据和实践指导。

Comments ICML 2026

详情
AI中文摘要

监督微调(SFT)是后训练大型语言模型(LLM)的标准方法,但通常表现出有限的泛化能力。我们将此限制归因于其默认训练目标:负对数似然(NLL)。虽然NLL在从头训练时经典最优,但后训练处于不同范式,可能违反其最优性假设,因为模型已编码任务相关先验,且监督可能冗长且有噪声。在这项工作中,我们系统研究了各种基于概率的目标,并刻画了不同目标在不同条件下成功或失败的时间和原因。通过在8个模型骨干、27个基准和7个领域上的全面实验和广泛消融研究,我们揭示了控制目标行为的关键维度:模型能力连续体。在模型强端附近,降低低概率令牌权重的先验倾向目标(例如,-p, -p^{10}, 阈值变体)一致优于NLL;在模型弱端,NLL占主导;在中间,没有单一目标普遍最优。我们的理论分析进一步阐明了目标如何在连续体上交换位置,为根据模型能力调整目标提供了原则性基础。代码可在 https://github.com/GaotangLi/Beyond-Log-Likelihood 获取。

英文摘要

Supervised fine-tuning (SFT) is the standard approach for post-training large language models (LLMs), yet it often shows limited generalization. We trace this limitation to its default training objective: negative log likelihood (NLL). While NLL is classically optimal when training from scratch, post-training operates in a different paradigm and could violate its optimality assumptions, where models already encode task-relevant priors and supervision can be long and noisy. In this work, we systematically study various probability-based objectives and characterize when and why different objectives succeed or fail under varying conditions. Through comprehensive experiments and extensive ablation studies across 8 model backbones, 27 benchmarks, and 7 domains, we uncover a critical dimension that governs objective behavior: the model-capability continuum. Near the model-strong end, prior-leaning objectives that downweight low-probability tokens (e.g., $-p$, $-p^{10}$, thresholded variants) consistently outperform NLL; toward the model-weak end, NLL dominates; in between, no single objective prevails. Our theoretical analysis further elucidates how objectives trade places across the continuum, providing a principled foundation for adapting objectives to model capability. The code is available at https://github.com/GaotangLi/Beyond-Log-Likelihood.

2509.15105 2026-05-25 cs.LG 版本更新

Super-Linear: A Lightweight Pretrained Mixture of Linear Experts for Time Series Forecasting

Super-Linear: 一种轻量级预训练线性专家混合模型用于时间序列预测

Liran Nochumsohn, Raz Marshanski, Hedi Zisling, Omri Azencot

发表机构 * Faculty of Computer and Information Science, Ben-Gurion University(计算机与信息科学学院,本·古里安大学)

AI总结 本文提出了一种轻量级的预训练混合专家模型 Super-Linear,用于时间序列预测。该模型通过使用频率特化的线性专家替代复杂的深度结构,并结合轻量的频谱门控机制动态选择相关专家,实现了高效且准确的预测。Super-Linear 在多个基准数据集上表现出色,显著提升了计算效率、对采样率的鲁棒性以及模型可解释性。

详情
Journal ref
Transactions on Machine Learning Research (TMLR), 2026
AI中文摘要

时间序列预测(TSF)在能源、金融、医疗和物流等领域至关重要,需要能够跨不同数据集泛化的模型。像Chronos和Time-MoE这样的大型预训练模型表现出强大的零样本(ZS)性能,但计算成本高。在这项工作中,我们引入了Super-Linear,一种轻量级且可扩展的混合专家(MoE)模型,用于通用预测。它用简单的频率特化线性专家替代深度架构,这些专家在多个频率范围内的重采样数据上进行训练。一种轻量级光谱门控机制动态选择相关专家,实现高效准确的预测。尽管简单,Super-Linear在基准测试中表现出强劲性能,同时显著提高了效率、对采样率的鲁棒性和可解释性。Super-Linear的实现可在以下网址获取:\href{https://github.com/azencot-group/SuperLinear}{https://github.com/azencot-group/SuperLinear}。

英文摘要

Time series forecasting (TSF) is critical in domains like energy, finance, healthcare, and logistics, requiring models that generalize across diverse datasets. Large pre-trained models such as Chronos and Time-MoE show strong zero-shot (ZS) performance but suffer from high computational costs. In this work, we introduce Super-Linear, a lightweight and scalable mixture-of-experts (MoE) model for general forecasting. It replaces deep architectures with simple frequency-specialized linear experts, trained on resampled data across multiple frequency regimes. A lightweight spectral gating mechanism dynamically selects relevant experts, enabling efficient, accurate forecasting. Despite its simplicity, Super-Linear demonstrates strong performance across benchmarks, while substantially improving efficiency, robustness to sampling rates, and interpretability. The implementation of Super-Linear is available at: \href{https://github.com/azencot-group/SuperLinear}{https://github.com/azencot-group/SuperLinear}.

2508.14311 2026-05-25 cs.LG cs.AI 版本更新

Online Learning with Multiple Fairness Regularizers via Graph-Structured Feedback

通过图结构反馈进行多重公平正则化器的在线学习

Quan Zhou, Jakub Marecek, Robert Shorten

发表机构 * Department of Mathematics, National University of Singapore(新加坡国立大学数学系) Department of Computer Science, Czech Technical University(捷克技术大学计算机科学系) Dyson School of Design Engineering, Imperial College London(伦敦帝国理工学院设计工程戴森学院) Imperial College London(伦敦帝国理工学院)

AI总结 本文研究了在自动决策系统中如何同时满足多个可能相互冲突的公平性要求的问题。作者提出了一种基于图结构反馈的强化学习方法,能够在序贯交互过程中自适应地学习不同公平性目标的权重。该方法为动态环境中实现多目标公平性优化提供了新的解决方案。

Comments Published in Transactions on Machine Learning Research (TMLR), 2026. OpenReview: https://openreview.net/forum?id=y8iWuDZtEw

详情
Journal ref
Transactions on Machine Learning Research (TMLR), 2026
AI中文摘要

在自动化决策系统中,越来越需要强制执行多个通常相互竞争的公平性度量。这些公平性目标的适当权重通常是先验未知的,可能随时间变化,并且在我们的设置中,必须通过顺序交互自适应地学习。在这项工作中,我们在赌博机设置中解决了这一挑战,其中决策具有图结构反馈。

英文摘要

There is an increasing need to enforce multiple, often competing, measures of fairness within automated decision systems. The appropriate weighting of these fairness objectives is typically unknown a priori, may change over time and, in our setting, must be learned adaptively through sequential interactions. In this work, we address this challenge in a bandit setting, where decisions are made with graph-structured feedback.

2508.14083 2026-05-25 cs.LG cs.AI 版本更新

GeoMAE: Masking Representation Learning for Spatio-Temporal Graph Forecasting with Missing Values

GeoMAE:面向缺失值的时空图预测的掩码表示学习

Songyu Ke, Chenyu Wu, Yuxuan Liang, Huiling Qin, Junbo Zhang, Yu Zheng

发表机构 * College of Computer and Data Science, Fuzhou University(福州大学计算机与数据科学学院) JD Intelligent Cities Research(京东智能城市研究院) School of Computing and Artificial Intelligence, Southwest Jiaotong University(西南交通大学计算机与人工智能学院) Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Beijing Normal University(北京师范大学)

AI总结 GeoMAE 是一种用于时空图预测的自监督表示学习模型,旨在解决城市智能系统中因环境和设备问题导致的数据缺失问题。该方法通过引入基于注意力机制的时空预测网络和辅助学习任务,有效捕捉了传感器网络中的动态空间关联,并提升了模型对缺失数据的鲁棒性。实验表明,GeoMAE 在多个真实数据集上显著优于现有方法,相对提升了最高达13.20%的预测性能。

Comments 34 pages for pre-print version. This work has been published in *Neural Networks*. Please check the latest version via the following DOI

详情
AI中文摘要

城市智能系统中缺失数据的普遍存在,归因于不利的环境条件和设备故障,对下游应用(尤其是交通预测和能耗预测)的有效性构成了重大挑战。因此,开发一种能够从不完整数据集中提取有意义信息的稳健时空学习方法至关重要。尽管存在针对缺失值时空图预测的方法,但未解决的问题依然存在。首先,现有研究大多基于时间序列分析,从而忽略了传感器网络中固有的动态空间相关性。其次,缺失数据模式的复杂性加剧了问题的复杂性。此外,维护条件的差异导致缺失值比率和模式显著波动,从而挑战了预测模型的泛化能力。针对这些挑战,本研究引入了GeoMAE,一种自监督的时空表示学习模型。该模型由三个主要组件组成:输入预处理模块、基于注意力的时空预测网络(STAFN)和一个辅助学习任务,该任务受掩码自编码器启发,以增强时空表示学习的鲁棒性。在真实数据集上的实证评估表明,GeoMAE显著优于现有基准,相对于最佳基线模型实现了高达13.20%的相对改进。

英文摘要

The ubiquity of missing data in urban intelligence systems, attributable to adverse environmental conditions and equipment failures, poses a significant challenge to the efficacy of downstream applications, notably in the realms of traffic forecasting and energy consumption prediction. Therefore, it is imperative to develop a robust spatio-temporal learning methodology capable of extracting meaningful insights from incomplete datasets. Despite the existence of methodologies for spatio-temporal graph forecasting in the presence of missing values, unresolved issues persist. Primarily, the majority of extant research is predicated on time-series analysis, thereby neglecting the dynamic spatial correlations inherent in sensor networks. Additionally, the complexity of missing data patterns compounds the intricacy of the problem. Furthermore, the variability in maintenance conditions results in a significant fluctuation in the ratio and pattern of missing values, thereby challenging the generalizability of predictive models. In response to these challenges, this study introduces GeoMAE, a self-supervised spatio-temporal representation learning model. The model is comprised of three principal components: an input preprocessing module, an attention-based spatio-temporal forecasting network (STAFN), and an auxiliary learning task, which draws inspiration from Masking AutoEncoders to enhance the robustness of spatio-temporal representation learning. Empirical evaluations on real-world datasets demonstrate that GeoMAE significantly outperforms existing benchmarks, achieving up to 13.20\% relative improvement over the best baseline models.

2508.10651 2026-05-25 cs.LG 版本更新

Graph Learning via Logic-Based Weisfeiler-Leman Variants and Tabularization

基于逻辑的Weisfeiler-Leman变体与表格化的图学习

Reijo Jaakkola, Tomi Janhunen, Antti Kuusisto, Magdalena Ortiz, Matias Selin, Mantas Šimkus

发表机构 * Tampere University(塔尔皮奥大学) TU Wien(维也纳技术大学)

AI总结 本文提出了一种基于逻辑增强的Weisfeiler-Leman算法和表格化的新型图分类方法,通过将图数据转化为表格形式并应用传统表格数据分析方法进行分类。该方法通过修改底层逻辑框架提升了表达能力,并通过广义量化器的双模拟游戏理论进行了精确刻画。实验表明,该方法在多个数据集上性能接近图神经网络和图变换器,且无需GPU支持和复杂的超参数调优,计算效率显著更高。

Comments New version: Revised the experimental section

详情
AI中文摘要

我们提出了一种新颖的图分类方法,该方法通过Weisfeiler-Leman算法的新变体将图数据表格化,然后应用表格数据方法。这些变体通过修改底层逻辑框架获得,并利用广义量词的双模拟游戏的新推广,对其表达能力进行了精确的理论刻画。然后我们在涵盖多个应用领域的14个数据集上测试了我们的方法。实验表明,在多达40,000个样本的数据集上,我们的方法通常能匹配图神经网络和图变换器的预测性能,而无需GPU或广泛的超参数调优。即使将我们方法的调优时间计入而基线方法的不计入,我们的方法也快5-20倍。当所有方法的调优时间都计入时,差距更显著地有利于我们的方法。

英文摘要

We present a novel approach for graph classification based on tabularizing graph data via new variants of the Weisfeiler-Leman algorithm and then applying methods for tabular data. The variants are obtained by modifying the underlying logical framework, and we establish a precise theoretical characterization of their expressive power using a novel generalization of the bisimulation game for generalized quantifiers. We then test our method on 14 datasets that span a range of application domains. The experiments demonstrate that on datasets with up to 40 000 samples, our approach generally matches the predictive performance of graph neural networks and graph transformers, without requiring a GPU or extensive hyperparameter tuning. Even when our method's tuning time is included and the baselines' is not, our method is 5-20 times faster. When tuning time is included for all methods, the gap is significantly greater in favour of our method.

2508.02332 2026-05-25 cs.LG stat.ML 版本更新

BOOST: A Data-Driven Framework for the Automated Joint Selection of Kernel and Acquisition Functions in Bayesian Optimization

BOOST: 一种用于贝叶斯优化中核函数与采集函数自动联合选择的数据驱动框架

Joon-Hyun Park, Mujin Cheon, Jeongsu Wi, Dong-Yeun Koh

发表机构 * Department of Chemical and Biomolecular Engineering, Korea Advanced Institute of Science and Technology(化学与生物分子工程系,韩国科学技术院) Department of AX, Korea Advanced Institute of Science and Technology(AX系,韩国科学技术院) Saudi Aramco-KAIST CO2 Management Center(沙特阿美-KAIST二氧化碳管理中心)

AI总结 贝叶斯优化(BO)是一种在昂贵黑箱问题中高度样本高效的优化方法,其性能高度依赖于核函数和获取函数等超参数的选择。本文提出了一种名为BOOST的框架,用于自动联合选择最优的核函数和获取函数对,解决了传统方法中依赖启发式或手动调参的问题。BOOST通过离线评估阶段预测不同核-获取函数对的性能,并在实际优化前选择最有可能表现良好的组合,从而提升优化效率和效果。实验表明,BOOST在合成基准和机器学习超参数优化任务中均优于固定超参数的BO方法,并能与先进自适应方法竞争。

Comments 25 pages

详情
AI中文摘要

贝叶斯优化(BO)是一种对昂贵黑箱问题具有高样本效率的方法,其性能关键取决于超参数的选择,包括核函数和采集函数。这带来了一个重要的实际挑战:不恰当的组合可能导致性能差和评估浪费。虽然对核函数和采集函数的单独改进已被积极探索,但自动联合选择最佳超参数对在很大程度上被忽视,迫使从业者依赖启发式方法或昂贵的手动训练。在这项工作中,我们提出了一个框架BOOST(贝叶斯优化与最优核函数和采集函数选择技术),该框架自动化了这一选择过程。BOOST利用一个简单的离线评估阶段来预测各种核函数-采集函数对的性能,并在进行昂贵的评估过程之前识别出最有希望的对。BOOST是一种数据驱动的策略选择程序,它根据候选策略在手头数据上的经验性能来评估核函数-采集函数对。在每次迭代中,先前观察到的点被划分为参考集和查询集。这些子集扮演类似于机器学习中训练集和验证集的角色:参考集用于模型构建,而查询集代表未见的区域,用于回顾性评估每个候选策略在向目标值推进方面的有效性。在合成基准和机器学习超参数优化任务上的实验表明,BOOST始终优于固定超参数的BO,并与最先进的自适应方法保持竞争力,突显了其在各种场景下的鲁棒性。

英文摘要

The performance of Bayesian optimization (BO), a highly sample-efficient method for expensive black-box problems, is critically governed by the selection of its hyperparameters, including the kernel and acquisition functions. This presents a significant practical challenge: an inappropriate combination of these can lead to poor performance and wasted evaluations. While individual improvements to kernel functions and acquisition functions have been actively explored, the joint and autonomous selection of the best pair of these fundamental hyperparameters has been largely overlooked. This forced practitioners to rely on heuristics or costly manual training. In this work, we propose a framework, BOOST (Bayesian Optimization with Optimal Kernel and Acquisition Function Selection Technique), that automates this selection. BOOST utilizes a simple offline evaluation stage to predict the performance of various kernel-acquisition function pairs and identify the most promising pair before committing to the expensive evaluation process. BOOST is a data-driven strategy selection procedure that evaluates kernel-acquisition pairs based on their empirical performance on the data-in-hand. At each iteration, previously observed points are partitioned into a reference set and a query set. These subsets play roles analogous to training and validation sets in machine learning: the reference set is used for model construction, while the query set represents unseen regions to retrospectively evaluate how effectively each candidate strategy progresses toward the target value. Experiments on synthetic benchmarks and machine learning hyperparameter optimization tasks demonstrate that BOOST consistently improves over fixed-hyperparameter BO and remains competitive with state-of-the-art adaptive methods, highlighting its robustness across diverse landscapes.

2507.09330 2026-05-25 physics.flu-dyn cs.LG physics.comp-ph 版本更新

WellPINN: Accurate Well Representation for Transient Fluid Pressure Diffusion in Subsurface Reservoirs with Physics-Informed Neural Networks

WellPINN:基于物理信息神经网络的瞬态流体压力扩散在储层中的精确井表征

Linus Walter, Qingkai Kong, Sara Hanson-Hedgecock, Víctor Vilarrasa

发表机构 * Global Change Research Group (GCRG), IMEDEA, CSIC-UIB(全球变化研究组(GCRG),IMEDEA,CSIC-UIB)

AI总结 本文提出了一种基于物理信息神经网络(PINN)的新型建模方法 WellPINN,用于更准确地表征地下储层中井周围的瞬态流体压力扩散问题。该方法通过依次训练多个 PINN 模型,并逐步缩小等效井半径以匹配实际井尺寸,有效解决了现有方法在注水初期井附近压力预测不准确的问题。WellPINN 在整个注水周期内实现了对流体压力的高精度反演,显著提升了 PINN 在逆向建模和操作场景模拟中的应用潜力。

详情
AI中文摘要

精确的井表征对于可靠的地层描述和地下流动模型中操作场景的模拟至关重要。物理信息神经网络(PINNs)最近作为一种有前景的储层建模方法出现,能够无缝集成监测数据和控制物理方程。然而,现有的基于PINN的研究在捕捉井附近流体压力方面面临重大挑战,特别是在注入开始后的早期阶段。为了解决这个问题,我们提出了WellPINN,一种建模工作流,它结合了多个顺序训练的PINN模型的输出,以精确表征井。该工作流通过将域分解为逐步缩小的子域,同时减小等效井半径,迭代地逼近等效井半径以匹配实际井尺寸。我们的结果表明,在抽水井周围顺序训练叠加网络是第一个专注于在整个注入期间从泵注速率精确推断流体压力的工作流,显著推进了PINN在反演建模和操作场景模拟中的潜力。本文的所有数据和代码将在https://github.com/linuswalter/WellPINN公开提供。

英文摘要

Accurate representation of wells is essential for reliable reservoir characterization and simulation of operational scenarios in subsurface flow models. Physics-informed neural networks (PINNs) have recently emerged as a promising method for reservoir modeling, offering seamless integration of monitoring data and governing physical equations. However, existing PINN-based studies face major challenges in capturing fluid pressure near wells, particularly during the early stage after injection begins. To address this, we propose WellPINN, a modeling workflow that combines the outputs of multiple sequentially trained PINN models to accurately represent wells. This workflow iteratively approximates the radius of the equivalent well to match the actual well dimensions by decomposing the domain into stepwise shrinking subdomains with a simultaneously reducing equivalent well radius. Our results demonstrate that sequential training of superimposing networks around the pumping well is the first workflow that focuses on accurate inference of fluid pressure from pumping rates throughout the entire injection period, significantly advancing the potential of PINNs for inverse modeling and operational scenario simulations. All data and code for this paper will be made openly available at https://github.com/linuswalter/WellPINN.

2507.06252 2026-05-25 cs.CR cs.AI cs.LG 版本更新

False Alarms, Real Damage: Adversarial Attacks Using LLM-based Models on Text-based Cyber Threat Intelligence Systems

虚假警报,真实损害:基于LLM的模型对文本网络威胁情报系统的对抗攻击

Samaneh Shafee, Alysson Bessani, Pedro M. Ferreira

发表机构 * Faculty of Sciences, University of Lisbon(里斯本大学科学学院) CIENCES, University of Lisbon(里斯本大学CIENCES)

AI总结 本文研究了基于大语言模型(LLM)的对抗攻击对基于文本的网络威胁情报(CTI)系统的影响。研究分析了三种攻击类型,包括规避、泛滥和投毒攻击,揭示了CTI系统在处理来自开放来源的文本数据时存在的脆弱性。特别指出,通过生成虚假文本,攻击者可以误导分类器,降低系统性能并破坏其功能,其中规避攻击在CTI流程中尤为关键,为后续攻击提供了前提条件。

详情
Journal ref
Future Generation Computer Systems, 2026
AI中文摘要

网络威胁情报(CTI)已成为一种重要的补充方法,在网络威胁生命周期的早期阶段运作。CTI涉及收集、处理和分析威胁数据,以提供更准确和快速的网络威胁理解。由于数据量大,通过机器学习(ML)和自然语言处理(NLP)模型进行自动化对于有效的CTI提取至关重要。这些自动化系统利用来自社交网络、论坛和博客等来源的开源情报(OSINT)来识别威胁指标(IoCs)。尽管先前的研究集中在针对特定ML模型的对抗攻击上,但本研究通过调查整个CTI管道中各个组件的脆弱性及其对对抗攻击的敏感性,扩展了研究范围。这些脆弱性源于它们从各种开放来源(包括真实和潜在虚假内容)接收文本输入。我们分析了针对CTI管道的三种攻击类型,包括逃避、淹没和投毒,并评估了它们对系统信息选择能力的影响。具体而言,在虚假文本生成方面,该工作展示了对抗文本生成技术如何创建虚假的网络安全和类似网络安全的文本,从而误导分类器、降低性能并破坏系统功能。重点主要放在逃避攻击上,因为它先于并使得CTI管道中的淹没和投毒攻击成为可能。

英文摘要

Cyber Threat Intelligence (CTI) has emerged as a vital complementary approach that operates in the early phases of the cyber threat lifecycle. CTI involves collecting, processing, and analyzing threat data to provide a more accurate and rapid understanding of cyber threats. Due to the large volume of data, automation through Machine Learning (ML) and Natural Language Processing (NLP) models is essential for effective CTI extraction. These automated systems leverage Open Source Intelligence (OSINT) from sources like social networks, forums, and blogs to identify Indicators of Compromise (IoCs). Although prior research has focused on adversarial attacks on specific ML models, this study expands the scope by investigating vulnerabilities within various components of the entire CTI pipeline and their susceptibility to adversarial attacks. These vulnerabilities arise because they ingest textual inputs from various open sources, including real and potentially fake content. We analyse three types of attacks against CTI pipelines, including evasion, flooding, and poisoning, and assess their impact on the system's information selection capabilities. Specifically, on fake text generation, the work demonstrates how adversarial text generation techniques can create fake cybersecurity and cybersecurity-like text that misleads classifiers, degrades performance, and disrupts system functionality. The focus is primarily on the evasion attack, as it precedes and enables flooding and poisoning attacks within the CTI pipeline.

2507.05064 2026-05-25 stat.ML cs.LG stat.ME 版本更新

Vecchia-Inducing-Points Full-Scale Approximations for Gaussian Processes

高斯过程的Vecchia诱导点全尺度近似

Tim Gyger, Reinhard Furrer, Fabio Sigrist

发表机构 * Institute of Financial Services(金融服务研究所) Lucerne University of Applied Sciences and Arts(卢塞恩应用科学与艺术大学) University of Zurich(苏黎世大学) Seminar for Statistics, ETH Zurich(苏黎世联邦理工学院统计系)

AI总结 本文提出了一种结合全局诱导点与局部Vecchia近似优势的高斯过程全尺度近似方法——VIF近似,旨在解决高斯过程在大规模数据集上的计算瓶颈。该方法通过基于相关性的邻居查找策略,提高了残差过程的Vecchia近似效率,并利用改进的覆盖树算法实现高效计算。此外,研究还扩展了该框架以处理非高斯似然,引入迭代方法大幅降低了计算成本,并在模拟和真实数据集上验证了其在计算效率、精度和数值稳定性方面的优越性。

详情
AI中文摘要

高斯过程是灵活、概率性的非参数模型,广泛应用于机器学习和统计学。然而,其在大数据集上的可扩展性受计算限制。为克服这些挑战,我们提出Vecchia诱导点全尺度(VIF)近似,结合全局诱导点和局部Vecchia近似的优势。Vecchia近似在低维输入和中等光滑协方差函数设置中表现优异,而诱导点方法更适合高维输入和更光滑的协方差函数。我们的VIF方法通过使用基于相关性的高效邻居搜索策略(通过改进的覆盖树算法实现)对残差过程进行Vecchia近似,从而桥接这两种情况。我们进一步将框架扩展到非高斯似然,引入迭代方法,与基于Cholesky的计算相比,在使用拉普拉斯近似时,训练和预测的计算成本降低了几个数量级。特别是,我们提出并比较了新颖的预条件器,并提供了理论收敛结果。在模拟和真实数据集上的大量数值实验表明,VIF近似不仅计算高效,而且比最先进的替代方法更准确、数值更稳定。所有方法均在开源C++库GPBoost中实现,并配有高级Python和R接口。

英文摘要

Gaussian processes are flexible, probabilistic, non-parametric models widely used in machine learning and statistics. However, their scalability to large data sets is limited by computational constraints. To overcome these challenges, we propose Vecchia-inducing-points full-scale (VIF) approximations combining the strengths of global inducing points and local Vecchia approximations. Vecchia approximations excel in settings with low-dimensional inputs and moderately smooth covariance functions, while inducing point methods are better suited to high-dimensional inputs and smoother covariance functions. Our VIF approach bridges these two regimes by using an efficient correlation-based neighbor-finding strategy for the Vecchia approximation of the residual process, implemented via a modified cover tree algorithm. We further extend our framework to non-Gaussian likelihoods by introducing iterative methods that substantially reduce computational costs for training and prediction by several orders of magnitudes compared to Cholesky-based computations when using a Laplace approximation. In particular, we propose and compare novel preconditioners and provide theoretical convergence results. Extensive numerical experiments on simulated and real-world data sets show that VIF approximations are both computationally efficient as well as more accurate and numerically stable than state-of-the-art alternatives. All methods are implemented in the open source C++ library GPBoost with high-level Python and R interfaces.

2505.21573 2026-05-25 cs.LG cs.AI 版本更新

Spectral-inspired Operator Learning with Limited Data and Unknown Physics

光谱启发的少数据与未知物理下的算子学习

Han Wan, Rui Zhang, Hao Sun

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学光明学院人工智能学院)

AI总结 本文研究了在数据有限且物理机制未知的情况下学习偏微分方程(PDE)动力学的挑战。为此,提出了一种名为SINO的频谱启发神经算子,它仅需2到5条轨迹即可建模复杂系统,无需显式依赖PDE方程。SINO通过频率索引自动捕捉局部和全局空间导数,结合乘法操作块和低通滤波器处理非线性效应和混叠问题,在多个二维和三维PDE基准测试中表现出优异性能,尤其在少量数据和分布外场景下显著优于现有方法。

Comments To appear in KDD 2026

详情
AI中文摘要

从有限数据和未知物理中学习PDE动力学具有挑战性。现有的神经PDE求解器要么需要大型数据集,要么依赖已知物理(如PDE残差或手工模板),导致适用性有限。为解决这些问题,我们提出光谱启发神经算子(SINO),它仅需2-5条轨迹即可建模复杂系统,无需显式PDE项。具体而言,SINO从频率索引自动捕获局部和全局空间导数,从而在物理无关机制下实现底层微分算子的紧凑表示。为建模非线性效应,它采用Pi块对光谱特征进行乘法运算,并辅以低通滤波器抑制混叠。在2D和3D PDE基准上的大量实验表明,SINO实现了最先进的性能,精度提升1-2个数量级。特别地,仅用5条训练轨迹,SINO就优于在1000条轨迹上训练的数据驱动方法,并在其他方法失败的高难度分布外案例中保持预测能力。

英文摘要

Learning PDE dynamics from limited data with unknown physics is challenging. Existing neural PDE solvers either require large datasets or rely on known physics (e.g., PDE residuals or handcrafted stencils), leading to limited applicability. To address these challenges, we propose Spectral-Inspired Neural Operator (SINO), which can model complex systems from just 2-5 trajectories, without requiring explicit PDE terms. Specifically, SINO automatically captures both local and global spatial derivatives from frequency indices, enabling a compact representation of the underlying differential operators in physics-agnostic regimes. To model nonlinear effects, it employs a Pi-block that performs multiplicative operations on spectral features, complemented by a low-pass filter to suppress aliasing. Extensive experiments on both 2D and 3D PDE benchmarks demonstrate that SINO achieves state-of-the-art performance, with improvements of 1-2 orders of magnitude in accuracy. Particularly, with only 5 training trajectories, SINO outperforms data-driven methods trained on 1000 trajectories and remains predictive on challenging out-of-distribution cases where other methods fail.

2505.17354 2026-05-25 cs.LG stat.ML 版本更新

CT-OT Flow: Estimating Continuous-Time Dynamics from Discrete Temporal Snapshots

CT-OT Flow:从离散时间快照估计连续时间动态

Keisuke Kawano, Takuro Kutsuna, Naoki Hayashi, Yasushi Esaki, Hidenori Tanaka

发表机构 * Toyota Central R&D Labs., Inc.(丰田中央研发实验室)

AI总结 本文研究如何从离散时间快照中估计连续时间动态,针对如单细胞RNA测序、移动感知等场景中数据仅以时间聚合快照形式存在、时间标签可能噪声或不确定的问题。提出了一种两阶段框架——连续时间最优传输流(CT-OT Flow),通过部分最优传输对齐相邻时间区间以推断高分辨率时间标签,并利用时间核平滑重建连续时间数据分布,从而训练标准的常微分方程或随机微分方程模型。该方法有效处理快照聚合和时间标签不确定性,并通过实用加速策略提升计算效率,在多个合成和真实数据集上表现出更优的分布和轨迹估计性能。

Comments https://github.com/ToyotaCRDL/CT-OT_Flow

详情
AI中文摘要

在许多现实场景中(例如单细胞RNA测序、移动感知和环境监测),数据仅作为在有限时间窗口内收集的时间聚合快照被观测到,通常带有噪声或不确定的时间戳,并且无法访问连续轨迹。我们研究从这类快照估计连续时间动态的问题。我们提出连续时间最优传输流(CT-OT Flow),这是一个两阶段框架:(i)通过部分最优传输(POT)对齐相邻区间来推断高分辨率时间标签,(ii)通过时间核平滑重建连续时间数据分布,从中采样邻近时间对以训练标准ODE/SDE模型。我们的公式明确考虑了快照聚合和时间标签不确定性,并使用实际加速(筛选和小批量POT),使其适用于大型数据集。在合成基准和两个真实数据集(scRNA-seq和台风轨迹)上,与OT-CFM、[SF]²M、TrajectoryNet、MFM和ENOT相比,CT-OT Flow减少了分布和轨迹误差。

英文摘要

In many real-world settings--e.g., single-cell RNA sequencing, mobility sensing, and environmental monitoring--data are observed only as temporally aggregated snapshots collected over finite time windows, often with noisy or uncertain timestamps, and without access to continuous trajectories. We study the problem of estimating continuous-time dynamics from such snapshots. We present Continuous-Time Optimal Transport Flow (CT-OT Flow), a two-stage framework that (i) infers high-resolution time labels by aligning neighboring intervals via partial optimal transport (POT) and (ii) reconstructs a continuous-time data distribution through temporal kernel smoothing, from which we sample pairs of nearby times to train standard ODE/SDE models. Our formulation explicitly accounts for snapshot aggregation and time-label uncertainty and uses practical accelerations (screening and mini-batch POT), making it applicable to large datasets. Across synthetic benchmarks and two real datasets (scRNA-seq and typhoon tracks), CT-OT Flow reduces distributional and trajectory errors compared with OT-CFM, [SF]\(^{2}\)M, TrajectoryNet, MFM, and ENOT.

2505.03784 2026-05-25 cs.LG 版本更新

Insulin Resistance Prediction From Wearables and Routine Blood Biomarkers

从可穿戴设备和常规血液生物标志物预测胰岛素抵抗

Ahmed A. Metwally, A. Ali Heydari, Daniel McDuff, Alexandru Solot, Zeinab Esmaeilpour, Anthony Z Faranesh, Menglian Zhou, David B. Savage, Conor Heneghan, Shwetak Patel, Cathy Speed, Javier L. Prieto

发表机构 * Google Research(谷歌研究) Institute of Metabolic Science, University of Cambridge(剑桥大学代谢科学研究所)

AI总结 该研究旨在利用可穿戴设备数据和常规血液生物标志物预测胰岛素抵抗,以实现糖尿病的早期干预。研究构建了深度神经网络模型,结合多源数据进行预测,取得了较高的准确率和泛化能力。模型在肥胖和久坐人群中表现尤为突出,并展示了与大型语言模型结合用于解释预测结果的潜力,为个性化健康管理提供了新方法。

详情
AI中文摘要

胰岛素抵抗是2型糖尿病的前兆,其特征是组织中胰岛素作用受损。当前测量胰岛素抵抗的方法虽然有效,但昂贵、难以获取、不广泛可用,并阻碍了早期干预的机会。在这项研究中,我们在美国远程招募了迄今为止最大的数据集来研究胰岛素抵抗(N=1,165名参与者,中位BMI=28 kg/m²,年龄=45岁,HbA1c=5.4%),整合了可穿戴设备时间序列数据和血液生物标志物,包括胰岛素抵抗的金标准测量——稳态模型评估胰岛素抵抗(HOMA-IR)。我们开发了深度神经网络模型,基于易于获取的数字和血液生物标志物预测胰岛素抵抗。结果表明,我们的模型通过结合可穿戴数据和易于获取的血液生物标志物,能够比单独使用任一数据源更好地预测胰岛素抵抗(R²=0.5,auROC=0.80,灵敏度=76%,特异性=84%)。在肥胖和久坐参与者(最易患2型糖尿病且能从早期干预中最大受益的亚群)中,模型显示出93%的灵敏度和95%的调整后特异性。对模型性能的严格评估,包括可解释性和鲁棒性,促进了在更大队列中的泛化能力,这一点通过在独立验证队列(N=72名参与者)上复现预测性能得到证明。此外,我们展示了如何将预测的胰岛素抵抗集成到大语言模型代理中,以帮助理解和情境化HOMA-IR值,促进解释和安全的个性化推荐。这项工作为早期检测2型糖尿病风险人群提供了可能,从而促进预防策略的早期实施。

英文摘要

Insulin resistance, a precursor to type 2 diabetes, is characterized by impaired insulin action in tissues. Current methods for measuring insulin resistance, while effective, are expensive, inaccessible, not widely available and hinder opportunities for early intervention. In this study, we remotely recruited the largest dataset to date across the US to study insulin resistance (N=1,165 participants, with median BMI=28 kg/m2, age=45 years, HbA1c=5.4%), incorporating wearable device time series data and blood biomarkers, including the ground-truth measure of insulin resistance, homeostatic model assessment for insulin resistance (HOMA-IR). We developed deep neural network models to predict insulin resistance based on readily available digital and blood biomarkers. Our results show that our models can predict insulin resistance by combining both wearable data and readily available blood biomarkers better than either of the two data sources separately (R2=0.5, auROC=0.80, Sensitivity=76%, and specificity 84%). The model showed 93% sensitivity and 95% adjusted specificity in obese and sedentary participants, a subpopulation most vulnerable to developing type 2 diabetes and who could benefit most from early intervention. Rigorous evaluation of model performance, including interpretability, and robustness, facilitates generalizability across larger cohorts, which is demonstrated by reproducing the prediction performance on an independent validation cohort (N=72 participants). Additionally, we demonstrated how the predicted insulin resistance can be integrated into a large language model agent to help understand and contextualize HOMA-IR values, facilitating interpretation and safe personalized recommendations. This work offers the potential for early detection of people at risk of type 2 diabetes and thereby facilitate earlier implementation of preventative strategies.

2504.09846 2026-05-25 cs.LG cs.AI cs.HC 版本更新

GlyTwin: Digital Twin for Glucose Control in Type 1 Diabetes Through Optimal Behavioral Modifications Using Patient-Centric Counterfactuals

GlyTwin: 通过以患者为中心的反事实实现1型糖尿病血糖控制的最佳行为修改的数字孪生

Asiful Arefeen, Saman Khamesian, Maria Adela Grando, Bithika Thompson, Hassan Ghasemzadeh

发表机构 * College of Health Solutions, Arizona State University(亚利桑那州立大学健康解决方案学院) School of Computing and Augmented Intelligence, Arizona State University(亚利桑那州立大学计算与增强智能学院) Department of Endocrinology, Mayo Clinic Arizona(梅奥诊所亚利桑那分部内分泌科)

AI总结 该研究提出了一种名为GlyTwin的数字孪生框架,用于通过行为优化改善1型糖尿病患者的血糖控制。其核心方法是结合反事实解释,模拟最优行为干预方案,如调整碳水化合物摄入和胰岛素剂量,以减少高血糖事件的发生。研究还引入了利益相关者的偏好,使干预方案更具个性化和实用性。实验结果表明,GlyTwin在生成有效反事实解释和预防高血糖方面优于现有方法,具有较高的实用价值。

详情
AI中文摘要

频繁和长期暴露于高血糖会增加慢性并发症的风险,包括神经病变、肾病和心血管疾病。现有的连续皮下胰岛素输注(CSII)和连续血糖监测(CGM)技术仅模拟血糖调节的特定方面,例如预测低血糖和给予小剂量胰岛素推注。同样,当前糖尿病管理中的数字孪生方法主要侧重于预测血糖对人类行为和胰岛素治疗的反应。因此,这些技术缺乏提供替代治疗方案的能力,而这些方案可以指导主动行为干预以实现最佳糖尿病管理。为填补这一空白,我们提出GlyTwin,一种新颖的计算框架,通过整合反事实解释来增强数字孪生技术,以模拟血糖控制的最佳行为治疗。GlyTwin通过推荐行为选择(如碳水化合物摄入和胰岛素剂量)的调整来生成反事实治疗,以显著减少高血糖事件的发生和持续时间。此外,GlyTwin将利益相关者的偏好纳入其干预生成过程,确保工具个性化和以用户为中心。我们在AZT1D上评估GlyTwin,该数据集是通过收集50名使用自动胰岛素输送(AID)系统的1型糖尿病(T1D)患者的纵向数据构建的,每人监测26天。结果表明,与历史数据相比,GlyTwin在生成反事实解释方面优于现有方法,有效解释率为85.8%,预防高血糖的有效性为87.3%。

英文摘要

Frequent and long-term exposure to hyperglycemia increases the risk of chronic complications, including neuropathy, nephropathy, and cardiovascular disease. Existing continuous subcutaneous insulin infusion (CSII) and continuous glucose monitoring (CGM) technologies model only specific aspects of glycemic regulation, such as predicting hypoglycemia and administering small insulin boluses. Similarly, current digital twin approaches in diabetes management primarily focus on predicting glucose responses to human behavior and insulin therapy. As a result, these technologies lack the ability to provide alternative treatment scenarios that could guide proactive behavioral interventions for optimal diabetes management. To address this gap, we propose GlyTwin, a novel computational framework that enhances digital twin technologies by integrating counterfactual explanations to simulate optimal behavioral treatments for glucose control. GlyTwin generates counterfactual treatments by recommending adjustments to behavioral choices, such as carbohydrate intake and insulin dosing, to significantly reduce the occurrence and duration of hyperglycemic events. In addition, GlyTwin incorporates stakeholder preferences into its intervention-generation process, ensuring that the tool is personalized and user-centric. We evaluate GlyTwin on AZT1D, a new dataset constructed by collecting longitudinal data from 50 individuals living with type 1 diabetes (T1D) on automated insulin delivery (AID) systems, each monitored for 26 days. Results show that GlyTwin outperforms state-of-the-art methods for generating counterfactual explanations, with 85.8\% valid explanations and 87.3\% effectiveness in preventing hyperglycemia compared with historical data.

2503.04929 2026-05-25 cs.RO cs.LG cs.SY eess.SY 版本更新

Neural Configuration-Space Barriers for Manipulation Planning and Control

用于操作规划与控制的神经构型空间障碍

Kehan Long, Ki Myung Brian Lee, Nikola Raicevic, Niyas Attasseri, Melvin Leok, Nikolay Atanasov

发表机构 * Contextual Robotics Institute, University of California San Diego(情境机器人研究所,加州大学圣地亚哥分校)

AI总结 本文研究了如何在复杂动态环境中高效安全地规划和控制高维机械臂的运动。作者提出了一种基于神经网络配置空间距离函数(CDF)的统一方法,将安全约束转化为CDF屏障,从而减少路径规划中的碰撞检测次数。为应对模型误差和传感器噪声带来的不确定性,研究还提出了分布鲁棒的CDF屏障控制框架,无需假设噪声分布。实验表明,该方法能够在仅依赖 onboard 点云观测的情况下,实现高效且安全的机械臂操控。

详情
AI中文摘要

在杂乱动态环境中,高维机器人操作器的规划与控制需要计算效率和鲁棒的安全保证。受近期学习构型空间距离函数(CDF)作为机器人身体表示的研究启发,我们提出了一种统一的运动规划与控制方法,将安全约束公式化为CDF障碍。CDF障碍近似局部自由构型空间,显著减少了运动规划中的碰撞检测操作次数。然而,使用神经网络学习CDF障碍并依赖在线传感器观测会引入不确定性,这些必须在控制综合中加以考虑。为此,我们开发了一种分布鲁棒的CDF障碍控制公式,该公式在不假设已知底层分布的情况下,考虑了建模误差和传感器噪声。在UFactory xArm6操作器上的仿真和硬件实验表明,我们的神经CDF障碍公式能够在杂乱动态环境中实现高效规划和鲁棒安全控制,仅依赖机载点云观测。

英文摘要

Planning and control for high-dimensional robot manipulators in cluttered dynamic environments require computational efficiency and robust safety guarantees. Inspired by recent advances in learning configuration-space distance functions (CDFs) as representations of robot bodies, we propose a unified approach for motion planning and control that formulates safety constraints as CDF barriers. A CDF barrier approximates the local free configuration space, substantially reducing the number of collision-checking operations during motion planning. However, learning a CDF barrier with a neural network and relying on online sensor observations introduces uncertainties that must be considered during control synthesis. To address this, we develop a distributionally robust CDF barrier formulation for control that accounts for modeling errors and sensor noise without assuming a known underlying distribution. Simulations and hardware experiments on a UFactory xArm6 manipulator show that our neural CDF barrier formulation enables efficient planning and robust safe control in cluttered and dynamic environments, relying only on onboard point-cloud observations.

2502.17119 2026-05-25 cs.LG cs.AI 版本更新

Diffusion and Flow Matching Models for Tabular Data: A Survey

表格数据的扩散与流匹配模型:综述

Zhong Li, Qi Huang, Lincen Yang, Jiayang Shi, Zhao Yang, Niki van Stein, Thomas Bäck, Matthijs van Leeuwen

发表机构 * Great Bay University(大湾大学) Vrije Universiteit Amsterdam(阿姆斯特丹自由大学) LIACS, Leiden University(莱顿大学LIACS)

AI总结 本文综述了扩散模型和流匹配模型在表格数据生成中的应用,探讨了这些模型在处理数值与类别混合、缺失值、敏感字段及复杂依赖关系等挑战时的优势与方法。文章系统梳理了从2015年至2026年的相关研究,围绕数据工程难题、任务目标、设计选择及评估维度进行组织,并指出了在可扩展性、特征依赖建模、隐私保护、公平性及约束感知生成等方面的开放问题。

Comments We substantially updated the previous version "Diffusion Models for Tabular Data: Challenges, Current Progress, and Future Directions" by including flow matching models for tabular data

详情
AI中文摘要

深度生成模型在图像、文本、音频和视频生成方面取得了快速进展,并越来越多地应用于结构化记录。然而,对于表格数据,生成建模仍然困难:数据集可能包含数值和分类属性、缺失值、敏感字段、不平衡类别、复杂的特征依赖和领域约束。早期基于GAN或VAE的表格数据建模方法取得了有用结果,但可能面临训练不稳定、模式崩溃、多模态分布建模能力弱以及混合类型特征处理脆弱等问题。因此,扩散模型因其噪声-去噪公式提供了灵活稳定的方式来建模复杂数据分布而受到越来越多的关注,并已被应用于表格合成、缺失值填补、可信数据生成和异常检测。流匹配通过学习沿概率路径的传输向量场提供了一条密切相关的途径,通常对路径设计和采样效率有更直接的控制。尽管取得了进展,但针对表格数据的扩散和流匹配模型文献仍然难以比较,因为方法针对不同任务,依赖于不同的表示、目标、评估协议和领域假设。据我们所知,这是第一篇专门针对表格数据的扩散和流匹配模型的综述。我们回顾了2015年6月至2026年5月的工作,围绕数据工程挑战、任务、设计选择和评估维度进行组织,并讨论了可扩展性、特征依赖建模、隐私、公平性、基准测试和约束感知生成中的开放问题。我们在GitHub仓库中保持更新。

英文摘要

Deep generative models have made rapid progress in image, text, audio, and video generation, and are increasingly being applied to structured records. For tabular data, however, generative modeling remains difficult: a dataset may contain numerical and categorical attributes, missing values, sensitive fields, imbalanced categories, complex feature dependencies, and domain constraints. Earlier tabular data modeling methods based on GANs or VAEs have achieved useful results, but they can suffer from unstable training, mode collapse, weak modeling of multimodal distributions, and fragile handling of mixed-type features. Diffusion models have therefore attracted growing interest because their noising-and-denoising formulation provides a flexible and stable way to model complex data distributions, and has been adapted to tabular synthesis, missing-value imputation, trustworthy data generation, and anomaly detection. Flow matching offers a closely related route by learning transport vector fields along probability paths, often with more direct control over path design and sampling efficiency. Despite this progress, the literature on diffusion and flow matching models for tabular data remains difficult to compare because methods target different tasks and rely on different representations, objectives, evaluation protocols, and domain assumptions. To the best of our knowledge, this is the first survey dedicated specifically to diffusion and flow matching models for tabular data. We review work from June 2015 to May 2026, organize it around data-engineering challenges, tasks, design choices, and evaluation dimensions, and discuss open problems in scalability, feature dependency modeling, privacy, fairness, benchmarking, and constraint-aware generation. We maintain updates in a GitHub repository.

2502.07646 2026-05-25 cs.LG stat.ME stat.ML 版本更新

Causal Additive Models with Unobserved Causal Paths and Backdoor Paths

具有未观测因果路径和后门路径的因果加性模型

Thong Pham, Takashi Nicholas Maeda, Shohei Shimizu

发表机构 * Shiga University(Shiga大学) RIKEN(理化学研究所) AIP(应用物理研究所) Gakushuin University(早稻田大学) The University of Osaka(大阪大学)

AI总结 该论文研究了在存在未观测的因果路径和后门路径时,如何识别变量间的因果方向问题。作者提出了新的回归集刻画方法,用于判断残差独立性和观测变量的条件独立性,并基于此建立了因果方向可识别的充分条件。在此基础上,提出了一种搜索算法并证明了其正确性和完备性,实验表明该方法在性能上具有竞争力。

Comments 23 pages

详情
Journal ref
Proceedings of AISTATS 2026
AI中文摘要

因果加性模型为存在隐藏变量时的因果发现提供了一个可处理且富有表现力的框架。当两个变量之间存在未观测的后门或因果路径时,其因果关系在现有理论下通常不可识别。我们建立了在许多此类情况下可识别因果方向的充分条件。这些条件依赖于回归集的新特征,以确定回归残差之间的独立性以及观测变量之间的条件独立性。基于这些结果,我们引入了一个结合这些创新的搜索算法,并证明了其可靠性和完备性。实证评估表明,其性能与最先进的方法相比具有竞争力。

英文摘要

Causal additive models provide a tractable yet expressive framework for causal discovery in the presence of hidden variables. When unobserved backdoor or causal paths exist between two variables, their causal relationship is often unidentifiable under existing theories. We establish sufficient conditions under which causal directions can be identified in many such cases. These conditions rely on new characterizations of regression sets to determine independence among regression residuals and conditional independencies among observed variables. Building on these results, we introduce a search algorithm that incorporates these innovations and prove its soundness and completeness. Empirical evaluations demonstrate its competitive performance against state-of-the-art methods.

2502.07489 2026-05-25 cs.LG 版本更新

Physiome-ODE: A Benchmark for Irregularly Sampled Multivariate Time Series Forecasting Based on Biological ODEs

Physiome-ODE:基于生物常微分方程的不规则采样多元时间序列预测基准

Christian Klötergens, Vijaya Krishna Yalavarthi, Randolf Scholz, Maximilian Stubbemann, Stefan Born, Lars Schmidt-Thieme

发表机构 * ISMLL & VWFS DARC University of Hildesheim(ISMLL与VWFS DARC海德堡大学) University of Hildesheim(海德堡大学) Institute of Mathematics TU Berlin(柏林技术大学数学研究所) DARC University of Hildesheim(DARC海德堡大学)

AI总结 当前不规则采样多变量时间序列预测方法主要依赖于少量数据集进行评估,而基于常微分方程(ODE)的模型在这些数据集上表现不佳,限制了其进一步研究。本文提出了一种从真实生物ODE模型生成不规则采样多变量时间序列数据的方法,并通过拒绝采样构建了包含50个数据集的大型基准数据集Physiome-ODE。该基准显著区别于现有数据集,能够有效评估不同模型在处理不规则时间序列时的真实性能,为ODE模型的研究提供了新的推动。

详情
AI中文摘要

当前最先进的缺失值不规则采样时间序列预测方法主要依赖四个数据集和少量小玩具示例进行评估。尽管常微分方程(ODE)是科学和工程中的主流模型,但在现有三个数据集上,预测常数值的基线模型性能优于过去五年的基于ODE的模型。这一反直觉的发现阻碍了基于ODE的模型(一个更合理的模型族)的进一步研究。本文开发了一种从常微分方程生成不规则采样多元时间序列(IMTS)数据集的方法,并通过拒绝采样选择具有挑战性的实例。利用该方法,我们创建了Physiome-ODE,一个大型且复杂的IMTS基准数据集,包含50个独立数据集,源自生物学研究中真实世界的常微分方程。Physiome-ODE是我们所知的首个IMTS预测基准,其规模比当前四个数据集的评估设置大一个数量级。使用Physiome-ODE基准,我们展示了与当前四个数据集完全不同的定性结果:在Physiome-ODE上,基于ODE的模型能够发挥其优势,并且我们的基准能够以有意义的方式区分不同的IMTS预测模型。通过这种方式,我们期望为基于ODE的时间序列建模研究注入新的动力。

英文摘要

State-of-the-art methods for forecasting irregularly sampled time series with missing values predominantly rely on just four datasets and a few small toy examples for evaluation. While ordinary differential equations (ODE) are the prevalent models in science and engineering, a baseline model that forecasts a constant value outperforms ODE-based models from the last five years on three of these existing datasets. This unintuitive finding hampers further research on ODE-based models, a more plausible model family. In this paper, we develop a methodology to generate irregularly sampled multivariate time series (IMTS) datasets from ordinary differential equations and to select challenging instances via rejection sampling. Using this methodology, we create Physiome-ODE, a large and sophisticated benchmark of IMTS datasets consisting of 50 individual datasets, derived from real-world ordinary differential equations from research in biology. Physiome-ODE is the first benchmark for IMTS forecasting that we are aware of and an order of magnitude larger than the current evaluation setting of four datasets. Using our benchmark Physiome-ODE, we show qualitatively completely different results than those derived from the current four datasets: on Physiome-ODE ODE-based models can play to their strength and our benchmark can differentiate in a meaningful way between different IMTS forecasting models. This way, we expect to give a new impulse to research on ODE-based time series modeling.

2502.07295 2026-05-25 cs.LG 版本更新

Targeted Regularization for Causal Effect Estimation with Exponential Dispersion Family Outcomes

针对指数分散族结果变量的因果效应估计的目标正则化

Jiahong Li, Zeqin Yang, Jixing Xu, Enzheng Hua, Zhichao Zou, Peng Zhen, Jiecheng Guo

发表机构 * Didi Chuxing(滴滴出行)

AI总结 本文研究了在指数型分布族(EDF)输出场景下因果效应估计中的目标正则化方法,旨在提升神经网络在估计因果效应时的统计性质,如双重稳健性和快速收敛性。作者提出了一个统一的目标正则化框架,适用于离散和连续处理变量,并通过分布层面的一阶偏差校正提升了估计精度。该方法将目标函数整合到神经网络架构中,实现了对结果模型、倾向得分模型和波动参数的联合端到端估计,实验验证了其有效性。

详情
AI中文摘要

用于因果效应估计的神经网络(NN)在实证中表现出色,但赋予其理想的半参数性质——双重稳健性和快速收敛速度——仍然具有挑战性。解决此问题的一种常见方法是目标正则化,它修改了神经网络的目标函数。然而,现有的神经因果效应估计工作主要局限于连续结果变量,限制了其在实践中常见的二元、计数或其他偏斜结果变量场景中的应用。我们针对指数分散族(EDF)提出了一个统一的目标正则化框架来解决这一限制。具体来说,我们首先推导了离散处理下典型函数的平均剂量函数(ADCF)和连续处理下筛投影ADCF的冯·米塞斯展开。其次,我们利用这一展开构建了一个统一的目标正则化,在分布层面修正一阶偏差。我们将此目标集成到一个神经网络架构中,该架构联合估计结果模型、倾向得分模型和波动参数。实验结果证明了我们方法的有效性。

英文摘要

Neural Networks (NNs) for causal effect estimation have shown strong empirical performance, yet endowing them with desirable semiparametric properties -- doubly robustness and fast convergence rates -- remains challenging. A common approach to address this is targeted regularization, which modifies the objective function of NNs. However, existing work on neural causal effect estimation is largely limited to continuous outcomes, restricting its applicability to settings involving binary, count, or other skewed outcomes commonly encountered in practice. We propose a unified targeted regularization framework for the Exponential Dispersion Family (EDF) to address this limitation. Specifically, we first derive the von Mises expansion of the average dose function of canonical functions (ADCF) for discrete treatments and of the sieve-projected ADCF for continuous treatments. Second, we use this expansion to construct a unified targeted regularization, that corrects first-order bias at the distributional level. We integrate this objective into a NN architecture that jointly estimates the outcome model, propensity score model, and fluctuation parameter end-to-end. Experimental results demonstrate the effectiveness of our method.

2502.04230 2026-05-25 cs.SD cs.AI cs.CR cs.LG eess.AS 版本更新

XAttnMark: Learning Robust Audio Watermarking with Cross-Attention

XAttnMark:基于交叉注意力的鲁棒音频水印学习

Yixin Liu, Lie Lu, Jihui Jin, Lichao Sun, Andrea Fanelli

发表机构 * Department of Computer Science, Lehigh University, Bethlehem, PA, USA(莱文斯顿大学计算机科学系) Dolby Laboratories Inc., San Francisco, CA, USA(杜比实验室公司)

AI总结 随着生成式音频合成和编辑技术的快速发展,版权保护、数据溯源和深度伪造音频传播等问题日益突出。本文提出了一种基于交叉注意力机制的鲁棒音频水印方法XAttnMark,通过生成器与检测器之间的部分参数共享、高效的交叉注意力消息检索机制以及时间条件模块,实现了水印检测与归属的联合优化。此外,该方法引入了与心理声学对齐的时频掩码损失,提升了水印的不可感知性,实验表明其在多种音频变换下均表现出优越的鲁棒性,为生成式AI时代的音频版权保护提供了有效解决方案。

Comments Accepted at ICML'25

详情
AI中文摘要

生成式音频合成与编辑技术的快速普及引发了关于版权侵权、数据溯源以及通过深度伪造音频传播虚假信息的严重担忧。水印技术通过将不可感知但可识别和可追踪的信号嵌入音频内容,提供了一种主动解决方案。尽管最近基于神经网络的水印方法(如WavMark和AudioSeal)在鲁棒性和质量上有所改进,但它们难以同时优化鲁棒检测和准确归因。本文介绍了交叉注意力鲁棒音频水印(XATTNMARK),通过利用生成器和检测器之间的部分参数共享、用于高效消息检索的交叉注意力机制以及用于改善消息分布的时间条件模块,弥合了这一差距。此外,我们提出了一种心理声学对齐的时频(TF)掩蔽损失,捕捉细粒度的听觉掩蔽效应,提高了水印的不可感知性。XATTNMARK在检测和归因方面均达到了最先进的性能,展示了针对各种音频变换(包括不同强度的具有挑战性的生成式编辑)的卓越鲁棒性。这项工作推进了音频水印技术,用于在生成式AI时代保护知识产权并确保真实性。

英文摘要

The rapid proliferation of generative audio synthesis and editing technologies has raised serious concerns about copyright infringement, data provenance, and the spread of misinformation via deepfake audio. Watermarking offers a proactive solution by embedding imperceptible yet identifiable and traceable signals into audio content. While recent neural network-based watermarking methods like WavMark and AudioSeal have improved robustness and quality, they struggle to jointly optimize both robust detection and accurate attribution. This paper introduces Cross-Attention Robust Audio Watermark (XATTNMARK), which bridges this gap by leveraging partial parameter sharing between the generator and the detector, a cross-attention mechanism for efficient message retrieval, and a temporal conditioning module for improved message distribution. Additionally, we propose a psychoacoustic-aligned time-frequency (TF) masking loss that captures fine-grained auditory masking effects, improving watermark imperceptibility. XATTNMARK achieves state-of-the-art performance in both detection and attribution, demonstrating superior robustness against a wide range of audio transformations, including challenging generative editing at varying strengths. This work advances audio watermarking for protecting intellectual property and ensuring authenticity in the era of generative AI.

2411.12173 2026-05-25 cs.LG cs.AI 版本更新

SkillTree: Explainable Skill-Based Deep Reinforcement Learning for Long-Horizon Control Tasks

SkillTree: 面向长时域控制任务的可解释基于技能的深度强化学习

Yongyan Wen, Siyuan Li, Rongchang Zuo, Lei Yuan, Hangyu Mao, Peng Liu

发表机构 * Faculty of Computing, Harbin Institute of Technology(哈尔滨工业大学计算机学院) National Key Laboratory of Novel Software Technology, Nanjing University(南京大学新型软件技术国家实验室) School of Artificial Intelligence, Nanjing University(南京大学人工智能学院) Polixir Technologies SenseTime Research(时光机器研究)

AI总结 本文提出了一种名为SkillTree的可解释技能型深度强化学习框架,用于解决长期控制任务中的复杂连续动作空间问题。该方法通过将连续动作空间离散化为技能空间,并在高层策略中引入可微决策树生成技能嵌入,从而指导底层策略执行具体技能,实现了技能层面的可解释性。实验表明,SkillTree在复杂机械臂控制任务中性能与基于神经网络的技能方法相当,同时提升了决策过程的透明度。

详情
AI中文摘要

深度强化学习(DRL)在各个研究领域取得了显著成功。然而,其对神经网络的依赖导致缺乏透明度,限制了实际应用。为了实现可解释性,决策树已成为神经网络的一种流行且有前景的替代方案。然而,由于其表达能力有限,传统决策树难以处理高维长时域连续控制任务。在本文中,我们提出了SkillTree,一种新颖的框架,将复杂的连续动作空间缩减为离散的技能空间。我们的层次化方法在高层次策略中集成了可微决策树以生成技能嵌入,进而指导低层次策略执行技能。通过使技能决策可解释,我们实现了技能级可解释性,增强了对复杂任务中决策过程的理解。实验结果表明,我们的方法在复杂机器人臂控制领域中达到了与基于技能的神经网络相当的性能。此外,SkillTree在技能级别提供解释,从而提高了决策过程的透明度。

英文摘要

Deep reinforcement learning (DRL) has achieved remarkable success in various research domains. However, its reliance on neural networks results in a lack of transparency, which limits its practical applications. To achieve explainability, decision trees have emerged as a popular and promising alternative to neural networks. Nonetheless, due to their limited expressiveness, traditional decision trees struggle with high-dimensional long-horizon continuous control tasks. In this paper, we proposes SkillTree, a novel framework that reduces complex continuous action spaces into discrete skill spaces. Our hierarchical approach integrates a differentiable decision tree within the high-level policy to generate skill embeddings, which subsequently guide the low-level policy in executing skills. By making skill decisions explainable, we achieve skill-level explainability, enhancing the understanding of the decision-making process in complex tasks. Experimental results demonstrate that our method achieves performance comparable to skill-based neural networks in complex robotic arm control domains. Furthermore, SkillTree offers explanations at the skill level, thereby increasing the transparency of the decision-making process.

2411.08126 2026-05-25 stat.ML cs.LG 版本更新

A Tale of Two Cities: Pessimism and Opportunism in Offline Dynamic Pricing

双城记:离线动态定价中的悲观主义与机会主义

Zeyu Bian, Lan Wang, Zhengling Qi

发表机构 * Department of Statistics, Florida State University(佛罗里达州立大学统计系) Department of Management Science, University of Miami(迈阿密大学管理科学系) Department of Decision Sciences, The George Washington University(乔治华盛顿大学决策科学系)

AI总结 本文研究了在历史数据未能覆盖全部价格区间的情况下,如何进行离线动态定价,尤其是在最优价格可能完全未被观测到的现实场景中。为解决这一问题,作者提出了一种非参数部分识别框架,利用需求对价格的单调性来估计未观测价格的价值,并设计了两种动态定价策略:一种是追求最坏情况下收益最大化的悲观策略,另一种是力求最小化最坏情况下遗憾的乐观策略。该方法在无覆盖场景下表现出优越性能,并为企业提供了根据风险偏好选择定价策略的实用指导。

详情
AI中文摘要

我们研究离线动态定价,当历史数据对价格空间的覆盖不完整时,一些候选价格(包括最优价格)可能完全未被观测到。这种设置在现实中很常见,在动态环境中尤其困难。现有的离线强化学习方法通常依赖于完全或部分覆盖,因此在这种设置下表现不佳。我们开发了一个用于离线动态定价的非参数部分识别框架,利用需求在价格上的单调性来界定未观测价格的价值。在该框架内,我们制定了两种动态决策规则:一种最大化最坏情况收入的悲观策略,和一种最小化最坏情况遗憾的机会策略。这些规则针对顺序无覆盖环境量身定制,并非现有悲观离线强化学习或静态机会主义方法的直接扩展。我们为两种策略建立了有限样本遗憾界,当最优价格被覆盖时恢复了标准速率,并量化了未覆盖时的额外成本。我们还开发了高效算法,并通过模拟和机票应用表明,我们的方法在无覆盖设置中优于标准离线强化学习基线。从管理角度看,该框架提供了从公司风险态度到定价策略的实用映射:寻求收入稳定和下行保护的公司应偏好悲观策略,而愿意承担适度风险以从未充分探索的价格中获取潜在收益的公司应偏好机会策略。

英文摘要

We study offline dynamic pricing when historical data provide incomplete coverage of the price space such that some candidate prices, including the optimal one, may be entirely unobserved. This setting is common in practice and is especially difficult in dynamic environments. Existing offline reinforcement learning methods typically rely on full or partial coverage and can therefore perform poorly in such settings. We develop a nonparametric partial identification framework for offline dynamic pricing that exploits the monotonicity of demand in price to bound the value of unobserved prices. Within this framework, we formulate two dynamic decision rules: a pessimistic policy that maximizes worst-case revenue and an opportunistic policy that minimizes worst-case regret. These rules are tailored to a sequential no-coverage environment and are not direct extensions of existing pessimistic offline RL or static opportunistic approaches. We establish finite-sample regret bounds for both policies, recovering the standard rate when the optimal price is covered and quantifying the additional cost when it is not. We also develop efficient algorithms and show, through simulations and an airline ticket application, that our methods outperform standard offline RL baselines in no-coverage settings. Managerially, the framework provides a practical mapping from a firm's risk posture to its pricing policy: firms seeking revenue stability and downside protection should prefer the pessimistic policy, whereas firms willing to bear measured risk for potential gains from underexplored prices should prefer the opportunistic policy.

2411.01088 2026-05-25 cs.LG math.OC 版本更新

CRONOS: Enhancing Deep Learning with Scalable GPU Accelerated Convex Neural Networks

CRONOS: 利用可扩展的GPU加速凸神经网络增强深度学习

Miria Feng, Zachary Frangella, Mert Pilanci

发表机构 * Stanford University(斯坦福大学)

AI总结 本文提出了一种名为 CRONOS 的算法,用于对两层神经网络进行凸优化,该算法能够首次扩展到高维数据集如 ImageNet,显著优于以往仅在 MNIST 和 CIFAR-10 下采样版本上进行研究的工作。基于 CRONOS,作者进一步开发了 CRONOS-AM 算法,结合交替最小化方法,实现了对任意结构多层网络的训练。理论分析表明 CRONOS 在温和条件下能收敛到凸重构的全局最小值,实验验证显示其在图像和语言任务中表现优于主流深度学习优化器。

详情
Journal ref
Advances in Neural Information Processing Systems 37 (NeurIPS 2024)
AI中文摘要

我们提出了用于两层神经网络凸优化的CRONOS算法。CRONOS是首个能够扩展到高维数据集(如现代深度学习中普遍存在的ImageNet)的算法。这显著改进了先前的工作,这些工作仅限于MNIST和CIFAR-10的下采样版本。以CRONOS为基础,我们进一步开发了一种名为CRONOS-AM的新算法,它将CRONOS与交替最小化相结合,以获得能够训练任意架构多层网络的算法。我们的理论分析证明,在温和假设下,CRONOS收敛到凸重述的全局最小值。此外,我们通过使用JAX进行GPU加速的大规模数值实验,验证了CRONOS和CRONOS-AM的有效性。我们的结果表明,在视觉和语言任务中,使用ImageNet和IMDb等基准数据集,CRONOS-AM可以获得与主流调优深度学习优化器相当或更好的验证精度。据我们所知,CRONOS是首个利用凸重述来增强大规模学习任务性能的算法。

英文摘要

We introduce the CRONOS algorithm for convex optimization of two-layer neural networks. CRONOS is the first algorithm capable of scaling to high-dimensional datasets such as ImageNet, which are ubiquitous in modern deep learning. This significantly improves upon prior work, which has been restricted to downsampled versions of MNIST and CIFAR-10. Taking CRONOS as a primitive, we then develop a new algorithm called CRONOS-AM, which combines CRONOS with alternating minimization, to obtain an algorithm capable of training multi-layer networks with arbitrary architectures. Our theoretical analysis proves that CRONOS converges to the global minimum of the convex reformulation under mild assumptions. In addition, we validate the efficacy of CRONOS and CRONOS-AM through extensive large-scale numerical experiments with GPU acceleration in JAX. Our results show that CRONOS-AM can obtain comparable or better validation accuracy than predominant tuned deep learning optimizers on vision and language tasks with benchmark datasets such as ImageNet and IMDb. To the best of our knowledge, CRONOS is the first algorithm which utilizes the convex reformulation to enhance performance on large-scale learning tasks.

2410.19842 2026-05-25 eess.SP cs.LG 版本更新

A comprehensive evaluation of pretraining strategies for channel-agnostic contrastive self-supervision of biosignals

生物信号通道无关对比自监督预训练策略的综合评估

Thea Brüsch, Mikkel N. Schmidt, Tommy S. Alstrøm

发表机构 * Department of Applied Mathematics and Computer Science(应用数学和计算机科学系)

AI总结 该研究探讨了在生物信号的通道无关自监督学习中创建正样本对的有效策略,以解决多通道时间序列数据中数据增强设计困难和模型泛化能力不足的问题。研究提出了一种名为对比随机导联编码(CRLC)的方法,通过随机选择输入通道的子集生成正样本对,并在EEG和ECG数据上验证了其有效性。实验表明,CRLC在通道无关设置下优于其他方法,在EEG任务中甚至超越了当前最先进的模型,为生物信号的自监督学习提供了新的思路。

详情
AI中文摘要

对比学习在计算机视觉的自监督中取得了令人印象深刻的结果。该方法依赖于正对的创建,这通常通过数据增强来实现。然而,对于多变量时间序列,有效的增强可能难以设计。此外,生物信号数据集的输入通道数通常因应用而异,限制了使用特定通道配置训练的大型自监督模型的实用性。受这些挑战的驱动,我们着手研究用于生物信号通道无关自监督的正对创建策略。我们引入了对比随机导联编码(CRLC),其中使用输入通道的随机子集来创建正对,并与使用增强和时间上相邻片段作为正对的方法进行比较。我们通过在EEG和ECG数据上预训练模型,然后针对下游任务进行微调来验证我们的方法。在通道无关设置中,CRLC在两种场景下均优于竞争策略。值得注意的是,对于EEG任务,CRLC超越了当前最先进的参考模型。而在ECG任务中,尽管最先进的参考模型更优,但结合CRLC使我们能够获得可比较的结果。总之,CRLC有助于在训练我们的通道无关模型时,跨不同通道设置进行泛化。代码可在https://github.com/theabrusch/Multiview_TS_SSL获取。

英文摘要

Contrastive learning yields impressive results for self-supervision in computer vision. The approach relies on the creation of positive pairs, something which is often achieved through augmentations. However, for multivariate time series effective augmentations can be difficult to design. Additionally, the number of input channels for biosignal datasets often varies from application to application, limiting the usefulness of large self-supervised models trained with specific channel configurations. Motivated by these challenges, we set out to investigate strategies for creation of positive pairs for channel-agnostic self-supervision of biosignals. We introduce contrastive random lead coding (CRLC), where random subsets of the input channels are used to create positive pairs and compare with using augmentations and neighboring segments in time as positive pairs. We validate our approach by pre-training models on EEG and ECG data, and then fine-tuning them for downstream tasks. CRLC outperforms competing strategies in both scenarios in the channel-agnostic setting. Notably, for EEG tasks CRLC surpasses the current state-of-the-art reference model. While, the state-of-the-art reference model is superior in the ECG task, incorporating CRLC allows us to obtain comparable results. In conclusion, CRLC helps generalization across variable channel setups when training our channel-agnostic model. The code is available at https://github.com/theabrusch/Multiview_TS_SSL.

2409.08036 2026-05-25 cs.LG 版本更新

Heterogeneous Sheaf Neural Networks

异质层丛神经网络

Luke Braithwaite, Alessio Borgi, Gabriele Onorato, Kristjan Tarantelli, Francesco Restuccia, Fabrizio Silvestri, Pietro Liò

发表机构 * Department of Computer Science and Technology, University of Cambridge(计算机科学与技术系,剑桥大学) Department of Electrical and Computer Engineering, Northeastern University(电气与计算机工程系,东北大学) Department of Computer, Control and Management Engineering, Sapienza University of Rome(计算机、控制与管理工程系,罗马萨皮恩扎大学)

AI总结 该研究提出了一种名为HetSheaf的异构图神经网络框架,用于处理节点和边具有不同类型和特征空间的异构图数据。不同于传统方法通过复杂架构处理异构性,HetSheaf通过细胞叠层结构直接在数据层面表示异构性,并学习基于节点和边类型的限制映射。该方法引入了SheafPool读取模块,实现了对图级别的鲁棒预测,并在多个基准测试中表现出色,性能优于多种现有方法,同时显著减少了参数数量。

Comments 48 pages, 2 figures

详情
AI中文摘要

异质图的节点和边可以属于不同类型和特征空间,出现在许多真实世界领域,包括生物学、推荐系统、社交网络和计算机系统。现有的异质图神经网络通常在架构层面通过关系特定模块、元路径机制或类型感知注意力来处理这种异质性,这往往导致越来越专门化的参数密集型设计。在这项工作中,我们提出了HetSheaf,一个通过细胞层丛学习异质图的框架。HetSheaf不是仅在架构中编码异质性,而是通过分配类型感知的局部特征空间和学习基于节点特征、节点类型和边类型的限制映射,直接在底层数据结构中表示异质性。为了支持图级预测,我们进一步引入了SheafPool,一种通用的茎空间读出方法,它聚合节点表示同时对局部基的变化保持不变,从而使层丛网络的图分类得到良好定义,并且F1分数比平均池化高出高达42个百分点。在多样化的基准测试套件(节点分类、链接预测和图分类)中,HetSheaf在异质图基准(HGB)框架上,针对同质(GCN、GAT、GIN、GraphSAGE)、异质(R-GCN、HAT、HGT)和类型无关的层丛基线,一致地实现了高达2个百分点的性能提升(节点分类上高达94.97%的Macro F1分数,链接预测上高达99.62%),同时参数数量减少了高达10倍。

英文摘要

Heterogeneous graphs, whose nodes and edges can belong to different types and feature spaces, arise in many real-world domains, including biology, recommendation, social networks, and computer systems. Existing heterogeneous graph neural networks typically handle this heterogeneity at the architectural level through relation-specific modules, meta-path machinery or type-aware attention, which often leads to increasingly specialised parameter-heavy designs. In this work, we propose HetSheaf, a framework for learning heterogeneous graphs through cellular sheaves. Instead of encoding heterogeneity solely in the architecture, HetSheaf represents it directly in the underlying data structure by assigning type-aware local feature spaces and learning restriction maps conditioned on node features, node types, and edge types. To support graph-level prediction, we further introduce SheafPool, a universal stalk-space readout that aggregates node representations while being invariant to local changes of basis, thereby making graph classification with sheaf networks well-defined and achieving an F1 Score up to 42 percentage points higher than mean pooling. Across a diverse suite of benchmarks (node classification, link prediction and graph classification). HetSheaf consistently achieves up to 2 percentage points higher performance (up to 94.97% Macro F1 Score on node classification and up to 99.62% on link prediction) on the Heterogeneous Graph Benchmark (HGB) framework against homogeneous (GCN, GAT, GIN, GraphSAGE), heterogeneous (R-GCN, HAT, HGT) and type-agnostic sheaf baselines, while reducing the number of parameters by up to 10$\times$.

2408.03085 2026-05-25 quant-ph cs.LG 版本更新

Universal Matrix Multiplication on Quantum Computer

量子计算机上的通用矩阵乘法

Jiaqi Yao, Tianjian Huang, Zipeng Cai, Ding Liu

发表机构 * School of Computer Science and Technology, Tiangong University(天津工业大学计算机科学与技术学院)

AI总结 本文研究了如何在量子计算机上高效实现矩阵乘法,这是深度神经网络中最核心且计算量最大的操作。作者提出了一种通用的量子矩阵乘法框架,通过优化量子算术逻辑单元,利用量子傅里叶变换将经典数据编码到参数化的 $R_z$ 旋转门中,从而将量子加法器的基门复杂度降低到 $O(n)$,并基于经典算术的列乘原理优化量子乘法器复杂度至 $O(n^2)$。此外,还扩展了该方法到量子版的斯特拉森算法,实验分析了乘法时间减少与加法资源增加之间的权衡,为构建通用量子矩阵运算提供了可靠的技术路径。

详情
AI中文摘要

作为深度神经网络中最核心且计算最密集的组件,矩阵乘法的执行效率直接决定了模型的训练和推理性能。利用量子叠加和纠缠提供的并行处理能力来重塑矩阵乘法的实现,已成为优化底层量子算术逻辑和提高量子电路运行效率的一个有前景的切入点。本文提出了一种通用量子矩阵乘法(QMM)框架,旨在通过优化的量子算术逻辑单元实现显著的计算加速。为了规避传统量子算术电路中多寄存器和多控制门的限制,我们使用量子傅里叶变换(QFT)将经典数据直接编码到参数化的 \(R_z\) 旋转门中,从而将量子加法器的基本门复杂度降低到 \(O(n)\)。此外,通过采用经典算术中的列乘法原理,我们将量子乘法器的门复杂度优化到 \(O(n^2)\)。我们进一步将这种方法扩展到量子版本的Strassen算法,并通过实验量化了乘法时间减少与加法资源开销增加之间的权衡。这项工作为构建通用量子矩阵运算建立了一条可靠的技术路径,有望为训练现代机器学习模型释放巨大的计算能力。

英文摘要

As the most central and computationally intensive component of deep neural networks, the execution efficiency of matrix multiplication directly determines the training and inference performance of models. Harnessing the parallel processing capabilities afforded by quantum superposition and entanglement to reshape matrix multiplication implementations has become a promising entry point for optimising underlying quantum arithmetic logic and improving the operational efficiency of quantum circuits. This paper proposes a universal quantum matrix multiplication (QMM) framework designed to achieve substantial computational acceleration through an optimised quantum arithmetic logic unit. To circumvent the limitations of multi-register and multi-control gates in conventional quantum arithmetic circuits, we encode classical data directly into parameterised \(R_z\) rotation gates using the quantum Fourier transform (QFT), thereby reducing the base gate complexity of the quantum adder to \(O(n)\). In addition, by adopting the column-wise multiplication principle from classical arithmetic, we optimize the gate complexity of the quantum multiplier to \(O(n^2)\). We further extend this approach to a quantum version of the Strassen algorithm, and experimentally quantify the trade-off between reduced multiplication time and increased overhead in addition resources. This work establishes a reliable technical pathway for constructing general-purpose quantum matrix operations, with the potential to unlock substantial computational power for training modern machine learning models.

2404.05108 2026-05-25 quant-ph cs.IT cs.LG math.IT 版本更新

Efficient Gradient Estimation for Parameterized Quantum Systems with Lie Algebraic Symmetries

具有李代数对称性的参数化量子系统的有效梯度估计

Mohsen Heidari, Masih Mozakka, Wojciech Szpankowski

发表机构 * Department of Computer Sciences, Indiana University, Bloomington, IN, USA(印第安纳大学计算机科学系,印第安纳州布卢明顿) Department of Computer Sciences, Purdue University, West Lafayette, IN, USA(普渡大学计算机科学系,印第安纳州西拉法叶)

AI总结 本文研究了参数化量子电路(PQCs)训练中的梯度估计问题,针对现有方法在高维希尔伯特空间和量子测量信息丢失方面的不足,提出了一种基于李代数结构和哈达玛测试的新框架。通过分析矩阵指数的微分,将梯度表示为由哈达玛测试得到的期望值的线性组合,其系数仅依赖于电路参数化方式,可利用阴影层析技术高效估计。该方法显著降低了测量次数和计算时间,分别实现了指数级和多项式级的提升。

Comments 32 pages

详情
AI中文摘要

梯度估计是训练参数化量子电路(PQC)以解决混合量子-经典优化和学习问题的核心挑战。这一困难源于多个因素,包括希尔伯特空间的指数维度和量子测量中的信息损失。现有的估计器,如有限差分和参数平移规则,通常无法充分应对某些类别PQC的这些挑战。在这项工作中,我们提出了一种新颖的梯度估计框架,该框架利用了PQC的底层李代数结构,并结合了Hadamard测试。通过分析矩阵指数的微分,我们将梯度表示为通过Hadamard测试获得的期望值的线性组合。该分解中的系数仅取决于电路的参数化,并且可以使用最先进的阴影层析技术进行估计。因此,我们的方法实现了高效的梯度估计,所需的测量次数随参数数量对数增长,并且具有多项式经典和量子时间。与现有工作相比,这实现了测量成本的指数级降低和时间的多项式加速。

英文摘要

Gradient estimation is a central challenge in training parameterized quantum circuits (PQCs) for hybrid quantum-classical optimization and learning problems. This difficulty arises from several factors, including the exponential dimensionality of the Hilbert spaces and the information loss in quantum measurements. Existing estimators, such as finite difference and the parameter shift rule, often fail to adequately address these challenges for certain classes of PQCs. In this work, we propose a novel gradient estimation framework that leverages the underlying Lie algebraic structure of PQCs, combined with the Hadamard test. By analyzing the differential of the matrix exponential, we derive an expression for the gradient as a linear combination of expectation values obtained via Hadamard tests. The coefficients in this decomposition depend solely on the circuit's parameterization and can be estimated using state-of-the-art shadow tomography techniques. Hence, our approach enables efficient gradient estimation, requiring a number of measurement shots that scales logarithmically with the number of parameters, and with polynomial classical and quantum time. This is an exponential reduction in the measurement cost and a polynomial speed-up in time compared to existing works.

2402.17888 2026-05-25 cs.LG cs.AI 版本更新

ConjNorm: Tractable Density Estimation for Out-of-Distribution Detection

ConjNorm: 面向分布外检测的可处理密度估计

Bo Peng, Yadan Luo, Yonggang Zhang, Yixuan Li, Zhen Fang

发表机构 * University of Technology Sydney(悉尼大学) The University of Queensland(昆士兰大学) Hong Kong Baptist University(香港 Baptist 大学) University of Wisconsin-Madison(威斯康星大学麦迪逊分校)

AI总结 本文提出了一种名为ConjNorm的新型密度估计方法,用于提升分布外检测(OOD detection)的性能。该方法基于Bregman散度构建理论框架,将分布考虑扩展到指数族分布,并通过引入共轭约束,将密度函数设计转化为寻找最优范数系数的问题。为了解决归一化计算的困难,作者设计了一种基于重要性采样的无偏且解析可计算的分区函数估计器。实验表明,ConjNorm在多个OOD检测基准上取得了当前最优性能,显著优于现有方法。

Comments ICLR24 poster

详情
AI中文摘要

事后分布外检测在可靠机器学习中引起了广泛关注。许多工作致力于基于logits、距离或严格数据分布假设推导得分函数,以识别低得分OOD样本。然而,这些估计得分可能无法准确反映真实数据密度或施加不切实际的约束。为了提供密度基得分设计的统一视角,我们提出了一个基于Bregman散度的新理论框架,将分布考虑扩展到指数分布族。利用定理中揭示的共轭约束,我们引入了一种 extsc{ConjNorm}方法,将密度函数设计重新定义为针对给定数据集寻找最优范数系数$p$。鉴于归一化的计算挑战,我们利用基于蒙特卡洛的重要性采样技术,设计了一个无偏且解析可处理的配分函数估计器。在OOD检测基准上的大量实验表明,我们提出的 extsc{ConjNorm}在各种OOD检测设置中建立了新的最先进水平,在CIFAR-100和ImageNet-1K上分别比当前最佳方法(FPR95)高出高达13.25%和28.19%。

英文摘要

Post-hoc out-of-distribution (OOD) detection has garnered intensive attention in reliable machine learning. Many efforts have been dedicated to deriving score functions based on logits, distances, or rigorous data distribution assumptions to identify low-scoring OOD samples. Nevertheless, these estimate scores may fail to accurately reflect the true data density or impose impractical constraints. To provide a unified perspective on density-based score design, we propose a novel theoretical framework grounded in Bregman divergence, which extends distribution considerations to encompass an exponential family of distributions. Leveraging the conjugation constraint revealed in our theorem, we introduce a \textsc{ConjNorm} method, reframing density function design as a search for the optimal norm coefficient $p$ against the given dataset. In light of the computational challenges of normalization, we devise an unbiased and analytically tractable estimator of the partition function using the Monte Carlo-based importance sampling technique. Extensive experiments across OOD detection benchmarks empirically demonstrate that our proposed \textsc{ConjNorm} has established a new state-of-the-art in a variety of OOD detection setups, outperforming the current best method by up to 13.25$\%$ and 28.19$\%$ (FPR95) on CIFAR-100 and ImageNet-1K, respectively.

2402.14212 2026-05-25 cs.LG cs.AI 版本更新

Moonwalk: Inverse-Forward Differentiation

Moonwalk: 逆-前向微分

Dmitrii Krylov, Armin Karamzade, Roy Fox

发表机构 * University of California, Irvine(加州大学尔湾分校)

AI总结 Moonwalk 研究了反向传播中需要存储中间激活值的限制问题,提出了一种无需存储激活值的梯度计算方法。该方法通过引入向量-逆雅可比乘积(vijp)操作符,结合子浸入网络和碎片化梯度检查点技术,在前向过程中精确重建梯度,从而显著提升了网络深度而不增加内存消耗。实验表明,Moonwalk 在保持运行时间与反向传播相当的同时,能够在相同内存预算下训练出深度超过两倍的网络。

详情
Journal ref
The 29th International Conference on Artificial Intelligence and Statistics, 2026
AI中文摘要

反向传播的主要限制是它需要在正向传播过程中存储中间激活值(残差),这限制了可训练网络的深度。这引出了一个基本问题:我们能否避免存储这些激活值?我们通过重新审视梯度计算的结构来解决这个问题。反向传播通过一系列向量-雅可比乘积计算梯度,这一操作通常是不可逆的。丢失的信息位于每层雅可比矩阵的余核中。我们定义了浸没式网络——其层雅可比矩阵具有平凡余核的网络——在这种网络中,梯度可以在前向扫描中精确重建,而无需存储激活值。对于非浸没式层,我们引入了碎片梯度检查点,仅记录恢复被雅可比矩阵擦除的余切向量所需的最小残差子集。我们方法的核心是一种新的算子,即向量-逆-雅可比乘积(vijp),它反转了余核外的梯度流。我们的混合模式算法首先通过内存高效的反向传播计算输入梯度,然后使用vijp在前向扫描中重建参数梯度,从而消除了存储激活值的需要。我们在Moonwalk中实现了该方法,并表明它在相同内存预算下训练深度超过两倍的网络时,运行时间与反向传播相当。

英文摘要

Backpropagation's main limitation is its need to store intermediate activations (residuals) during the forward pass, which restricts the depth of trainable networks. This raises a fundamental question: can we avoid storing these activations? We address this by revisiting the structure of gradient computation. Backpropagation computes gradients through a sequence of vector-Jacobian products, an operation that is generally irreversible. The lost information lies in the cokernel of each layer's Jacobian. We define submersive networks -- networks whose layer Jacobians have trivial cokernels -- in which gradients can be reconstructed exactly in a forward sweep without storing activations. For non-submersive layers, we introduce fragmental gradient checkpointing, which records only the minimal subset of residuals necessary to restore the cotangents erased by the Jacobian. Central to our approach is a novel operator, the vector-inverse-Jacobian product (vijp), which inverts gradient flow outside the cokernel. Our mixed-mode algorithm first computes input gradients with a memory-efficient reverse pass, then reconstructs parameter gradients in a forward sweep using the vijp, eliminating the need to store activations. We implement this method in Moonwalk and show that it matches backpropagation's runtime while training networks more than twice as deep under the same memory budget.

2103.14995 2026-05-25 cs.LG cs.AI eess.SP 版本更新

Thermal transmittance prediction based on the application of artificial neural networks on heat flux method results

基于人工神经网络在热流法结果上的热透射率预测

Sanjin Gumbarević, Bojan Milovanović, Mergim Gaši, Marina Bagarić

发表机构 * Center for Theoretical Physics, Sloane Physics Laboratory, Yale University(理论物理中心、斯洛恩物理实验室、耶鲁大学) University of Zagreb, Faculty of Civil Engineering, Department of Materials(扎格雷布大学、土木工程学院、材料系)

AI总结 本文研究如何利用人工神经网络(ANN)加速建筑围护结构热传导系数(U值)的现场测量过程。通过在热流法(HFM)测量中引入并行测量策略,并基于内外空气温度预测未知热流,从而缩短测量时间。研究对比了多种ANN模型在多层墙体上的应用效果,结果表明该方法在热流预测方面具有较高准确性,为后续研究提供了有价值的参考方向。

Comments Submitted to International Building Physics Conference 2021

详情
Journal ref
J. Phys.: Conf. Ser. 2069 (2021) 012152
AI中文摘要

由于能效相关指令,欧洲联盟更加关注建筑群的深度能源改造。许多需要深度能源改造的建筑年代久远,可能缺乏设计/改造文件,或者建筑构件中的材料可能随时间发生退化。热透射率(即U值)是确定通过建筑围护结构构件传输热损失的最重要参数之一,取决于构成建筑构件的所有材料的厚度和热性能。现场U值可通过ISO 9869-1标准(热流法 - HFM)确定。然而,测量持续时间是HFM在改造设计过程开始前现场测试中未广泛使用的原因之一。本文分析了通过使用一个热流传感器进行并行测量来减少测量时间的可能性。这种并行化可以通过在HFM结果上应用特定类别的人工神经网络(ANN)来实现,基于收集的室内外空气温度预测未知热流。在达到满意的预测后,HFM传感器可重新定位到另一个测量位置。本文展示了四种ANN案例应用于HFM结果的比较,这些测量在一面多层墙上进行:一个隐藏层中有三个神经元的多层感知器、100个单元的长短期记忆、100个单元的门控循环单元以及50个长短期记忆单元和50个门控循环单元的组合。分析在基于两个输入温度预测热流率方面给出了有希望的结果。另一面墙上的额外分析显示了该方法的可能局限性,这为这一主题的进一步研究提供了方向。

英文摘要

Deep energy renovation of building stock came more into focus in the European Union due to energy efficiency related directives. Many buildings that must undergo deep energy renovation are old and may lack design/renovation documentation, or possible degradation of materials might have occurred in building elements over time. Thermal transmittance (i.e. U-value) is one of the most important parameters for determining the transmission heat losses through building envelope elements. It depends on the thickness and thermal properties of all the materials that form a building element. In-situ U-value can be determined by ISO 9869-1 standard (Heat Flux Method - HFM). Still, measurement duration is one of the reasons why HFM is not widely used in field testing before the renovation design process commences. This paper analyzes the possibility of reducing the measurement time by conducting parallel measurements with one heat-flux sensor. This parallelization could be achieved by applying a specific class of the Artificial Neural Network (ANN) on HFM results to predict unknown heat flux based on collected interior and exterior air temperatures. After the satisfying prediction is achieved, HFM sensor can be relocated to another measuring location. Paper shows a comparison of four ANN cases applied to HFM results for a measurement held on one multi-layer wall - multilayer perceptron with three neurons in one hidden layer, long short-term memory with 100 units, gated recurrent unit with 100 units and combination of 50 long short-term memory units and 50 gated recurrent units. The analysis gave promising results in term of predicting the heat flux rate based on the two input temperatures. Additional analysis on another wall showed possible limitations of the method that serves as a direction for further research on this topic.

2605.22954 2026-05-25 cs.LG q-bio.QM 版本更新

FederatedRSF : Federated Random Survival Forests for Partially Overlapping Medical Data

FederatedRSF:面向部分重叠医学数据的联邦随机生存森林

Maryam Moradpour, Jonas Harriehausen, Amirreza Aleyasin, Lion Philipp Wolf, Youngjun Park, Anne-Christin Hauschild

发表机构 * Institute for Predictive Deep Learning in Medicine and Healthcare(预测医学与健康人工智能研究所) Justus Liebig University Gießen(吉森约瑟夫·李比希大学) Hessian Center for Artificial Intelligence (hessian.AI)(黑森人工智能中心 (hessian.AI)) Department of Medical Informatics(医学信息学系) University Medical Center Göttingen(哥廷根大学医学中心) Max Planck Institute for Biology of Ageing(马克斯·普朗克衰老生物学研究所)

AI总结 本文提出了一种名为FederatedRSF的联邦学习方法,用于处理多中心医疗数据中的生存分析问题,特别是在数据特征部分重叠的情况下。该方法通过在各机构本地训练随机生存森林模型,并仅共享特征兼容的树结构,从而在不泄露原始数据的前提下实现模型聚合与推理。实验表明,该方法在乳腺癌数据集上的表现与集中式训练模型相当,有效解决了数据隐私和特征异质性带来的挑战。

Comments 4 pages, 2 figures. Maryam Moradpour, Jonas Harriehausen, and Amirreza Aleyasin contributed equally to this work. Includes supplementary material

详情
AI中文摘要

多中心生存预测可以提高鲁棒性和泛化性,但隐私法规和机构治理通常阻止跨机构汇集患者水平的临床和基因组数据。在实践中,部署因特征空间异质性而进一步复杂化,其中不同站点收集不同的协变量或使用不同的测序面板,导致特征集仅部分重叠。我们提出了FederatedRSF,一个实现联邦随机生存森林的Python包,它聚合本地训练的生存树,并仅将特征兼容的树重新分发到每个站点,从而在无需共享原始数据的情况下实现部分重叠的推理。我们在scikit-survival包中分发的GBSG2乳腺癌队列上评估了FederatedRSF,通过保留特征子集模拟客户端之间的特征异质性,并使用Harrell一致性指数(C-Index)在重复交叉验证和站点分割下评估区分能力。结果表明,联邦模型可以达到与集中式训练设置相当的性能。

英文摘要

Multi-center survival prediction can improve robustness and generalizability, yet privacy regulations and institutional governance often prevent pooling patient-level clinical and genomic data across institutions. In practice, deployment is further complicated by feature-space heterogeneity, in which sites collect different covariates or use different sequencing panels, resulting in only partially overlapping feature sets. We present FederatedRSF, a Python package that implements federated random survival forests, aggregating locally trained survival trees and redistributing only feature-compatible trees to each site, enabling inference with partial overlap without sharing raw data. We evaluate FederatedRSF on the GBSG2 breast cancer cohort distributed with the scikit-survival package, simulating feature heterogeneity across clients by withholding subsets of features, and assessing discrimination using Harrell's concordance index (C-Index) under repeated cross-validation and site-splits. The results demonstrated that the federated model can achieve performance comparable to that of the centralized training setting.

2605.22950 2026-05-25 stat.ML cs.LG math.ST stat.ME stat.TH 版本更新

Diffusion-based Denoising Beats Vanilla Score Matching in Parameter Estimation: A Theoretical Explanation

基于扩散的去噪在参数估计中优于普通得分匹配:一个理论解释

Benedikt Lütke Schwienhorst, Nadja Klein, Johannes Lederer

AI总结 本文研究了在多峰分布参数估计中,基于扩散的去噪分数匹配方法相较于传统分数匹配方法的优越性,并给出了理论解释。作者提出了一种新的扩散去噪分数匹配估计器(DDSME),并通过理论分析证明,传统分数匹配估计器在峰间距离增大时误差会恶化,而DDSME通过适当调节超参数可避免这一问题。该研究为扩散模型在参数估计中的优势提供了新的理论依据。

详情
AI中文摘要

当归一化常数未知或计算成本过高时,得分匹配是最大似然估计的替代方法。然而,对于实际应用中常见的具有良好分离模态的多峰分布,普通得分匹配相对于最大似然估计效率较低。我们在此场景下比较了一种新颖的基于扩散的去噪得分匹配估计器(DDSME)与普通得分匹配估计器(SME)。特别地,我们证明了两种估计器的统计保证,表明普通SME的误差界随着模态间分离度的增加而恶化,而通过适当的超参数调整,DDSME可以避免这一问题。这为基于扩散的得分匹配优于普通版本的行为提供了新的理论解释。

英文摘要

Score matching is an alternative to maximum likelihood estimation when the normalizing constant is unknown or too costly to evaluate. However, vanilla score matching has shown to be inefficient relative to maximum likelihood estimation for multimodal distributions with well-separated modes, which are commonly encountered in practical applications. We compare a novel diffusion-based denoising score matching estimator (DDSME) to the vanilla score matching estimator (SME) in this scenario. In particular, we prove statistical guarantees for both estimators, showing that the error bound for the vanilla SME worsens when the separation between the modes increases, which can be avoided in case of the DDSME with suitable hyperparameter tuning. This provides a novel theoretical explanation for the superior behavior of diffusion-based score matching over the vanilla version.

2605.22940 2026-05-25 cs.LG cs.AI stat.ML 版本更新

Human-Centered Learning Mechanics: A Dynamical Framework for Entropy-Regulated Representation Learning

以人为中心的学习力学:熵正则化表示学习的动力学框架

Kim Phuc Tran

发表机构 * Univ. Lille, ENSAIT, ULR 2461 – GEMTEX – Génie et Matériaux Textiles(里尔大学,ENSAIT,ULR 2461 – GEMTEX – 纺织工程与材料纺织系) International Chair in DS & XAI, International Research Institute for Artificial Intelligence and Data Science, Dong A University(数据科学与可解释人工智能国际主席,人工智能与数据科学国际研究所,东亚大学)

AI总结 本文提出了一种名为“以人为中心的学习力学”(HCLM)的动态信息理论框架,旨在为开放且受控的学习系统提供理论支持。研究指出,传统的熵正则化方法在某些情况下可能导致梯度不稳定或与优化方向不一致,因此引入了有效熵的概念,并提出了可计算的几何熵代理方法,如基于方差和对数行列式的协方差代理。文章的主要贡献包括形式化有效信息力下的熵正则化、推导收敛性和泛化性理论,以及从动态角度解释模型规模与性能之间的关系。实验表明,几何熵代理,尤其是对数行列式协方差熵,能产生更稳定和有力的信息力,提升表示学习的效果。

Comments Submitted to JMLR

详情
AI中文摘要

深度学习越来越被视为参数空间中的动力学过程,然而许多现有理论仍将训练视为封闭的优化系统。这种观点对于现实世界的人工智能是有限的,因为模型在不确定性、资源约束、分布偏移、下游决策风险和人类反馈下运行。我们提出了以人为中心的学习力学(HCLM),一个用于开放和受控学习系统的动力学和信息论框架。核心思想是,只有当所选的熵代理沿着优化轨迹产生非简并的信息力时,熵正则化才是有用的。否则,熵项可能产生弱、不稳定或不对齐的梯度,导致动力学坍缩为普通的损失最小化。我们引入了有效熵的概念,并研究了可处理的几何熵代理,包括基于方差和对数行列式协方差代理。本文做出三项贡献。首先,它通过有效信息力形式化了熵正则化,并刻画了简并熵区域。其次,它在显式假设下推导了收敛性、熵流、Wasserstein梯度流和噪声表示泛化结果。第三,它提供了缩放律行为的条件动力学解释,作为信息注入、熵耗散和残差风险之间的平衡,而不声称对经验神经缩放律的无条件推导。受控的表示学习实验支持几何熵代理(尤其是对数行列式协方差熵)比softmax归一化熵产生更强更稳定的信息力的假设。

英文摘要

Deep learning is increasingly viewed as a dynamical process in parameter space, yet many existing theories still treat training as a closed optimization system. This view is limited for real-world AI, where models operate under uncertainty, resource constraints, distribution shift, downstream decision risks, and human feedback. We propose Human-Centered Learning Mechanics (HCLM), a dynamical and information-theoretic framework for open and controlled learning systems. The central idea is that entropy regularization is useful only when the chosen entropy surrogate generates a non-degenerate information force along the optimization trajectory. Otherwise, entropy terms may produce weak, unstable, or misaligned gradients, causing the dynamics to collapse toward ordinary loss minimization. We introduce the notion of effective entropy and study tractable geometric entropy surrogates, including variance-based and log-determinant covariance proxies. The paper makes three contributions. First, it formalizes entropy regularization through effective information force and characterizes degenerate entropy regimes. Second, it derives convergence, entropy-flow, Wasserstein-gradient-flow, and noisy-representation generalization results under explicit assumptions. Third, it offers a conditional dynamical interpretation of scaling-law-like behavior as a balance between information injection, entropy dissipation, and residual risk, without claiming an unconditional derivation of empirical neural scaling laws. Controlled representation-learning experiments support the hypothesis that geometric entropy surrogates, especially log-determinant covariance entropy, induce stronger and more stable information forces than softmax-normalized entropy.

2605.22939 2026-05-25 cs.CL cs.LG 版本更新

Learnability-Informed Fine-Tuning of Diffusion Language Models

扩散语言模型的可学习性感知微调

Shubham Parashar, Atharv Chagi, Jacob Helwig, Lakshmi Jotsna, Sushil Vemuri, James Caverlee, Dileep Kalathil, Shuiwang Ji

发表机构 * Department of Computer Science and Engineering, Texas A\&M University, College Station, TX, USA(计算机科学与工程系,德克萨斯A&M大学,College Station, TX, USA) Department of Electrical and Computer Engineering, Texas A\&M University, College Station, TX, USA(电气与计算机工程系,德克萨斯A&M大学,College Station, TX, USA)

AI总结 本文旨在提升扩散语言模型(DLMs)的推理能力。研究发现,传统的监督微调(SFT)在DLMs中应用时存在局限,忽视了学习的难易程度与时机,导致性能下降。为此,作者提出了一种新的微调方法LIFT,通过在不同扩散时间步根据上下文的丰富程度学习易学或难学的token,从而更有效地利用训练信息。实验表明,LIFT在六个推理基准测试中均优于现有方法,相对提升了达3倍的性能。

详情
AI中文摘要

我们旨在提升扩散语言模型(DLM)的推理能力。虽然SFT是自回归模型常用的后训练方法,但其在DLM中的应用面临挑战,甚至可能损害性能,而根本原因尚未得到充分研究。我们的分析揭示,普通SFT忽略了可学习性,即学习什么以及何时学习。具体而言,当大部分输入被掩码时,稀有标记难以学习;而当大部分输入未被掩码时,学习常见标记则较为简单且价值不大。基于我们的分析,我们提出LIFT,一种高效的基于SFT的DLM后训练算法。LIFT在大部分输入被掩码时学习容易标记,在更多上下文可用时学习困难标记,从而使训练与不同扩散时间步的信息可用性对齐。我们的结果表明,LIFT在六个推理基准上优于现有SFT基线,在AIME'24和AIME'25上实现了高达3倍的相对增益。我们的代码已在https://github.com/divelab/LIFT公开。

英文摘要

We aim to improve the reasoning capabilities of diffusion language models (DLMs). While SFT is a popular post-training recipe for autoregressive models, its use in DLMs faces challenges and can even hurt performance, though the underlying causes remain understudied. Our analysis reveals that vanilla SFT overlooks learnability, namely what and when tokens are learned. Specifically, rare tokens are difficult to learn when most of the input is masked, whereas it is straightforward and thus of little value to learn common tokens when most of the input is unmasked. Motivated by our analysis, we propose LIFT, an efficient SFT-based post-training algorithm for DLMs. LIFT learns easy tokens when most of the input is masked and hard tokens when more context is available, thus aligning the training with the information available at different diffusion time steps. Our results show that LIFT outperforms existing SFT baselines across six reasoning benchmarks, achieving up to a 3x relative gain on AIME'24 and AIME'25. Our code is publicly available at https://github.com/divelab/LIFT.

2605.22902 2026-05-25 cs.LG cs.AI cs.CL 版本更新

Transcoders Trace Visual Grounding and Hallucinations in Vision-Language Models

Transcoders 追踪视觉语言模型中的视觉基础与幻觉

Dimitrios Damianos, Leon Voukoutis, Georgios Skyrianos, Vassilis Katsouros, Georgios Paraskevopoulos

发表机构 * Institute of Language and Speech Processing(语言与语音处理研究所) Athena Research Center(雅典研究中心)

AI总结 该研究探讨了生成式视觉-语言模型(VLMs)中视觉输入如何转化为文本的问题,提出了基于Transcoders的函数中心解释框架,用于分解模型内部的计算路径,揭示图像块与文本生成之间的关联。相比传统的稀疏自编码器(SAEs),该方法在图像块缺失实验中表现出更强且更稳定的解释效果,并能更准确地对应语义相关的图像区域。此外,研究还通过结构分析揭示了模型生成幻觉的机制,并利用图特征构建分类器实现了对幻觉的预测。

详情
AI中文摘要

生成式视觉语言模型(VLM)在多模态推理上表现良好,但视觉输入如何转化为文本仍知之甚少。现有的VLM可解释性工作使用稀疏自编码器(SAE),其分解静态残差表示,忽略了驱动跨模态交互的功能更新。我们采用基于Transcoders的功能中心框架,Transcoders是MLP子层的稀疏近似,作为逐层计算的因果代理。应用于Gemma 3-4B-IT,该框架将模型分解为可解释的计算路径,连接图像块到文本生成中的方向。在补丁消融下,Transcoder归因对视觉基础标记产生比SAE归因更强且更稳定的效果,并与语义相关的图像区域更好对齐。假视觉基础反事实分析证实恢复的路径是视觉-语言交互特有的。最后,我们对幻觉生成进行结构分析,从Transcoder产生的电路痕迹中提取基于图的指标。基于这些机制图特征的逻辑分类器以AUC 0.68预测幻觉。这些结果表明,功能中心的电路分解为VLM中的多模态计算提供了可解释且可预测的描述。

英文摘要

Generative Vision-Language Models (VLMs) perform well on multimodal reasoning, but how visual inputs are transformed to text remains poorly understood. Existing interpretability work on VLMs uses Sparse Autoencoders (SAEs), which decompose static residual representations and miss the functional updates that drive cross-modal interaction. We adopt a function-centric framework based on Transcoders, sparse approximations of MLP sublayers that act as a causal proxy for layer-wise computation. Applied to Gemma 3-4B-IT, the framework decomposes the model into interpretable computational pathways linking image patches to directions in token generation. Transcoder attributions produce stronger and more stable effects on visually grounded tokens under patch ablation than SAE attributions, and align better with semantically relevant image regions. A False Visual Grounding counterfactual analysis confirms that the recovered pathways are specific to vision-language interaction.Finally, we perform a structural analysis of hallucinated generations, by extracting graph-based indicators from circuit traces produced by the transcoders. A logistic classifier over these mechanistic graph features predicts hallucinations at AUC $0.68$. These results show that function-centric circuit decomposition yields interpretable and predictive accounts of multimodal computation in VLMs.

2605.22898 2026-05-25 cs.LG 版本更新

FIRMA: FIbonacci Ring Model Aggregation for Privacy-preserving Federated Learning

FIRMA: 斐波那契环模型聚合用于隐私保护联邦学习

Rachid Hedjam

发表机构 * Bishop’s University(比什大学)

AI总结 本文提出了一种名为FIRMA的隐私保护联邦学习框架,旨在解决现有方法在去中心化、隐私保护和模型聚合效率之间的矛盾。FIRMA基于斐波那契数列设计环形拓扑结构,通过非对称邻居加权和永久私有分类头实现安全聚合,并引入动态邻居抑制和优化的环排列策略以提升模型性能。实验表明,FIRMA在多种异构数据环境下优于传统联邦平均方法,尤其在标签偏斜和狄利克雷异构场景中表现出显著优势。

详情
AI中文摘要

联邦学习协议面临结构性三难困境:规范的基于服务器的聚合~\cite{mcmahan2017} 产生单点故障和梯度反演风险;去中心化的环-八卦替代方案~\cite{hu2019segmented} 通过无信息的均匀权重将分类头暴露给半诚实的对等节点;个性化方法~\cite{collins2021exploiting} 重新引入中心聚合。现有协议无法同时实现无服务器操作、永久私有分类头、环拓扑和原则性的非对称邻居加权。我们提出FIRMA( extbf{FI}bonacci extbf{R}ing extbf{M}odel extbf{A}ggregation),一个包含三种逐步增强的联邦学习协议系列:1) ibfl\ 建立基础:无服务器环聚合,采用斐波那契加权的邻居混合和永久私有的分类头。2) ibflp\ 在此基础上增加精度门控邻居抑制,选择性降低收敛不良的对等节点权重,同时保留斐波那契方向偏差。3) ibflpp,完整系统,通过2-opt环置换最大化相邻客户端的类别多样性,通过$K_g{=}\lceil N/2 ceil$次八卦传递实现全局环覆盖,以及余弦退火自保留校准,完成该系列。我们建立了一个收敛速率界和三个支持命题,涉及归一化、覆盖、保留和多样性最优性。在28种配置(四个基准与七种异构性制度交叉)上的系统实验表明, ibflpp\ 在所有12种标签偏斜配置中均优于 edavg\,在CIFAR-10上$K{=}1$时峰值优势达$+20.7$个百分点。在Dirichlet异构性下, ibflpp\ 是所有无服务器协议中的帕累托主导方法,在28种配置中的17种中实现了最高精度。

英文摘要

Federated learning protocols face a structural trilemma: canonical server-based aggregation~\cite{mcmahan2017} creates a single point of failure and gradient inversion risk; decentralised ring-gossip alternatives~\cite{hu2019segmented} expose classification heads to semi-honest peers via uninformed uniform weights; and personalised methods~\cite{collins2021exploiting} reintroduce central aggregation. No existing protocol simultaneously achieves server-free operation, permanently private heads, ring topology, and principled asymmetric neighbour weighting. We propose FIRMA (\textbf{FI}bonacci \textbf{R}ing \textbf{M}odel \textbf{A}ggregation), a family of three progressively enhanced federated learning protocols: 1) \fibfl\ establishes the foundation: server-free ring aggregation with Fibonacci-weighted neighbour blending and permanently private classification heads. 2) \fibflp\ augments this with accuracy-gated neighbour suppression, selectively down-weighting poorly-converged peers while preserving the Fibonacci directional bias. 3) \fibflpp, the full system, completes the family with a 2-opt ring permutation that maximises adjacent-client class diversity, global ring coverage via $K_g{=}\lceil N/2\rceil$ gossip passes, and cosine-annealed self-retention calibration. We establish a convergence rate bound and three supporting propositions governing normalisation, coverage, retention, and diversity optimality. Systematic experiments across 28 configurations -- four benchmarks crossed with seven heterogeneity regimes -- demonstrate that \fibflpp\ surpasses \fedavg\ in all 12 label-skew configurations, with a peak advantage of $+20.7$\,pp on CIFAR-10 at $K{=}1$. Under Dirichlet heterogeneity, \fibflpp\ is the Pareto-dominant method among all server-free protocols, achieving the highest accuracy in 17 of 28 configurations.

2605.22897 2026-05-25 cs.LG 版本更新

From Residuals to Reasons: LLM-Guided Mechanism Inference from Tabular Data

从残差到原因:基于LLM的表格数据机制推断

Mohammad R. Rezaei, Rahul G. Krishnan

发表机构 * Department of Computer Science(计算机科学系) University of Toronto(多伦多大学) Vector Institute(向量研究所)

AI总结 该研究旨在解决科学应用中机器学习模型在预测与解释之间的平衡问题,提出了一种基于大语言模型(LLM)的机制推理框架MARICL。该方法通过分析基础模型的残差,引导LLM推测模型缺失的结构,并通过多轮文本梯度优化生成显式的修正项,从而提升模型的可解释性和预测性能。实验表明,MARICL在多个科学、生物医学和社会经济数据集上均优于基础模型,并通过冻结公式在不同实验批次中的表现验证了其对机制的泛化能力。

详情
AI中文摘要

机器学习在科学应用中的一个持续挑战是同时实现预测和理解。统计模型在结构化数据上表现出色,但作为黑箱运行,而现有的可解释性方法主要是审视性的:它们回答“哪些特征重要?”,但不阐明特征如何交互或随着人类理解迭代地细化解释。要求LLM直接预测目标会迫使其搜索整个输出空间;我们转而用基础模型锚定预测,并让LLM回答该模型遗漏了什么这一更窄的问题。我们引入了多智能体残差上下文学习(MARICL),这是一个智能体框架,其中LLM智能体分析基础模型失败的地方,从上下文中提供的高残差示例中假设缺失的结构,并产生通过多轮文本梯度优化精炼的显式修正项。在涵盖科学、生物医学、社会经济和合成设置的九个基准测试中,MARICL在所有数据集上一致优于其基础模型。为了测试这些修正是反映真实结构还是批次特定噪声,我们冻结了在无细胞蛋白质数据集的一个实验批次上学习的公式,并将其应用于(无需重新训练且无需进一步LLM调用)保留批次。在同一试剂协议内,冻结公式在超过92%的情况下改善了预测;在不同协议下,它们系统性地失败。成功边界与生物化学一致,而非批次数量;这是机制泛化的直接证据。

英文摘要

A persistent challenge in machine learning for scientific applications is jointly achieving prediction and understanding. Statistical models excel on structured data but operate as black boxes, while existing interpretability methods are largely inspective: they answer "which features matter?" but do not articulate how features interact or refine explanations iteratively alongside human understanding. Asking an LLM to predict the target directly forces it to search the entire output space; we instead anchor predictions with a base model and ask the LLM the narrower question of what that model is missing. We introduce Multi-Agent Residual In-Context Learning (MARICL), an agentic framework in which LLM agents analyze where a base-model fails, hypothesize missing structure from high-residual examples provided in context, and produce explicit correction terms refined through multi-turn textual gradient optimization. Across nine benchmarks spanning scientific, biomedical, socioeconomic, and synthetic settings, MARICL improves consistently over its base model on all datasets. To test whether these corrections reflect real structure or batch-specific noise, we freeze formulas learned on one experimental batch of the Cell-Free Protein dataset and apply them (with no retraining and no further LLM calls) to held-out batches. Within the same reagent protocol, the frozen formulas improve predictions in over 92% of cases; across a different protocol, they fail systematically. The success boundary aligns with the biochemistry, not the batch count; direct evidence of mechanistic generalization.

2605.22896 2026-05-25 cs.RO cs.AI cs.LG 版本更新

Agentic-VLA: Efficient Online Adaptation for Vision-Language-Action Models

Agentic-VLA:视觉-语言-动作模型的高效在线自适应

Ruofan Jin, Zaixi Zhang

发表机构 * Ruofan Jin(金鲁凡) Zaixi Zhang(张在西)

AI总结 本文提出了一种名为Agentic-VLA的新型训练框架,旨在提升视觉-语言-动作(VLA)模型在机器人操作任务中的在线适应效率。该方法通过自适应奖励合成、语言引导探索和经验记忆三个核心创新,有效解决了现有VLA模型在新环境泛化能力和训练效率方面的不足。实验表明,Agentic-VLA在LIBERO和RoboTwin 2.0等基准测试中显著提升了任务完成率和学习效率,为构建具备持续学习能力的自适应VLA系统提供了重要进展。

Comments Total 15 pages

详情
AI中文摘要

视觉-语言-动作(VLA)模型通过利用预训练的视觉-语言表示,已成为机器人操作领域的一种有前景的范式。然而,当前的VLA训练方法存在两个关键局限性:对新环境的泛化能力差,以及需要大量演示数据导致的训练效率低下。我们提出Agentic-VLA,一种智能训练框架,通过三项关键创新使VLA能够在线高效自适应:(1)自适应奖励合成,根据VLA当前能力和任务复杂度动态生成并调整奖励函数,将复杂任务分解为可学习的子目标以进行课程学习;(2)语言引导探索,其中评论模型提供结构化指导以实现系统化探索,而非随机采样;(3)经验记忆,存储和检索与任务相关的策略权重,用于相似任务的预热启动自适应。我们在LIBERO基准上评估Agentic-VLA,取得了显著改进:长时域任务提升12.3%,单样本学习提升28.5%,并在无需任务特定演示的情况下实现从0%到31.2%的跨任务迁移。与现有在线自适应方法相比,我们的框架还实现了2.4倍的收敛速度提升。除LIBERO外,Agentic-VLA在双臂RoboTwin 2.0基准(包括其随机困难设置)上仍保持优势。这些结果使Agentic-VLA成为迈向真正自适应、可在部署中持续学习的VLA系统的重要一步。

英文摘要

Vision-Language-Action (VLA) models have emerged as a promising paradigm for robotic manipulation by leveraging pre-trained vision-language representations. However, current VLA training methods suffer from two critical limitations: poor generalization to novel environments and low training efficiency requiring extensive demonstrations. We introduce Agentic-VLA, an agentic training framework that enables VLAs to efficiently adapt online through three key innovations: (1) Adaptive Reward Synthesis, which dynamically generates and adjusts reward functions based on the VLA's current capabilities and task complexity, decomposing complex tasks into learnable sub-goals for curriculum learning; (2) Language-Guided Exploration, where a critic model provides structured guidance for systematic exploration rather than random sampling; and (3) Experience Memory,which stores and retrieves task-relevant policy weights for warm-starting adaptation to similar tasks. We evaluate Agentic-VLA on the LIBERO benchmark, achieving substantial improvements: +12.3% on long-horizon tasks, +28.5% in 1-shot learning, and enabling cross-task transfer from 0% to 31.2% without task-specific demonstrations. Our framework also demonstrates 2.4x faster convergence compared to existing online adaptation methods. Beyond LIBERO, Agentic-VLA retains its advantage on the dual-arm RoboTwin 2.0 benchmark, including under its randomized Hard setting. These results establish Agentic-VLA as a significant step toward truly adaptive VLA systems capable of continuous learning in deployment.

2605.22893 2026-05-25 eess.SP cs.LG 版本更新

L-FAME: Longitudinal Focused Attention Meditation EEG Dataset and Benchmark

L-FAME:纵向专注冥想脑电图数据集与基准

Angqi Li, Ab Basit Rafi Syed, Hamzeh Alzweri, Taosheng Liu, Barry H. Cohen, Saiprasad Ravishankar

发表机构 * Department of CMSE(计算机科学与工程系) Department of CSE(计算机科学与工程系) Michigan State University(密歇根州立大学) Department of Psychology(心理学系) Department of Applied Psychology(应用心理学系) New York University(纽约大学) Department of BME(生物医学工程系)

AI总结 本文介绍了L-FAME数据集和相应的基准测试,旨在推动对不同冥想实践及其六周训练期间神经效应演变的研究。该数据集包含74名健康大学生在干预前后的脑电图记录和心理评估,参与者被随机分配到三种不同的冥想组。研究提出了三个分类任务作为基准,涵盖认知状态解码、冥想技术细分类以及跨会话适应性评估,并提供了多种机器学习和深度学习方法的基线结果,为计算冥想研究和基于EEG的机器学习方法开发提供了宝贵资源。

Comments Code and dataset available at: https://huggingface.co/datasets/L-FAME-Dataset-Benchmark/L-FAME

详情
AI中文摘要

我们引入了一个新颖的纵向专注冥想脑电图(L-FAME)数据集及配套基准,旨在促进对多种冥想实践的神经效应及其在六周训练期内演变的研究。该数据集包含74名健康大学生参与者的脑电图记录和心理评估,在两个不同时间点(干预前和干预后)收集。参与者被随机分配到三个不同的冥想组:两种基于咒语的技术(SA-TA-NA-MA和哈瑞奎师那)和一种专注呼吸练习。利用这一独特的纵向和比较数据集,我们提出了一个基准套件,包含三个不同的分类任务:(1)认知状态解码,区分休息和冥想状态;(2)特定冥想技术的细粒度分类;(3)跨会话适应,评估模型在纵向时间间隔上的泛化能力。我们利用一系列经典机器学习算法和深度学习架构为这些任务提供了全面的基线结果。完整的数据集、预处理流程和基准评估代码将公开发布,为计算冥想研究和基于脑电图的机器学习中新的分析方法的发展和比较提供宝贵的资源和标准化框架。数据集可在https://huggingface.co/datasets/L-FAME-Dataset-Benchmark/L-FAME获取。

英文摘要

We introduce a novel Longitudinal Focused Attention Meditation Electroencephalography (L-FAME) dataset and an accompanying benchmark, designed to foster research into the neural effects of various meditation practices and the evolution of these effects over a six-week training period. The dataset contains EEG recordings and psychological assessments from 74 healthy college participants, collected at two distinct time points: pre-intervention and post-intervention. Participants were randomly assigned to one of three distinct meditation groups: two mantra-based techniques (SA-TA-NA-MA and Hare Krishna) and one Breath Focus practice. Leveraging this unique longitudinal and comparative dataset, we propose a benchmark suite comprising three distinct classification tasks: (1) cognitive state decoding to distinguish between resting and meditation states, (2) fine-grained classification of the specific meditation techniques, and (3) cross-session adaptation to evaluate model generalization across the longitudinal time gap. We provide comprehensive baseline results for these tasks utilizing a range of classical machine learning algorithms and deep learning architectures. The complete dataset, preprocessing pipelines, and benchmark evaluation code will be publicly released, offering a valuable resource and a standardized framework for the development and comparison of new analytical methods in computational meditation research and EEG-based machine learning. The dataset is available at https://huggingface.co/datasets/L-FAME-Dataset-Benchmark/L-FAME

2605.22891 2026-05-25 cs.LG hep-ex 版本更新

Pointwise Metrics Mislead: An Evaluation Protocol for Multimodal Inverse Problems

逐点度量误导:多模态逆问题的评估协议

Mads H. Baattrup, Jörn Bach, Laurids Jeppe, Finn Labe, Alexander Grohsjean, Christian Schwanenberger, Peer Stelldinger

发表机构 * Deutsches Elektronen-Synchrotron(德意志电子同步辐射研究中心) CERN(欧洲核子研究中心) University of Hamburg(汉堡大学) HAW Hamburg(汉堡应用技术大学)

AI总结 本文指出,在多模态逆问题中,传统的逐点评估指标(如RMSE、MAE)会误导科学重建的评价,因为它们无法准确反映后验分布的结构特性。研究提出了一种三步评估协议,分别从分布准确性、谱保真度和不确定性校准三个方面进行更全面的评估。实验表明,该方法在合成和真实物理问题中能有效区分模型性能,揭示了传统指标所忽略的关键科学特征。

Comments 29 pages, 9 figures, and 8 tables (including appendix)

详情
AI中文摘要

科学重建中的评估以逐点度量为主——RMSE、MAE、每事件分辨率——隐含假设误差越小重建越好。我们表明,对于具有多模态后验的逆问题,这一假设在结构上失败。根据总方差定律,当后验具有非零宽度时,训练以最小化MSE或MAE的点估计器产生的边际谱严格窄于真实值。由此产生的偏差独立于架构、训练和数据集大小,并且精确压缩了下游科学测量所依赖的谱特征——尾部、模态、形状。我们提出一个三部分评估协议,其中每一步针对其他步骤遗漏的失败模式:通过CRPS的每事件分布准确性、通过谱保真度诊断的总体边际准确性、以及通过基于覆盖的校准的不确定性可信度。在具有解析后验的合成基准和来自粒子物理的现实多对一逆问题上,模型排名在逐点度量和分布度量之间发生逆转,而校准进一步区分了在CRPS下无法区分的架构。决定科学结论的是评估协议,而非模型。

英文摘要

Evaluation in scientific reconstruction is dominated by pointwise metrics - RMSE, MAE, per-event resolution - under the implicit assumption that lower error means better reconstruction. We show that this assumption fails structurally for inverse problems with multimodal posteriors. By the law of total variance, point estimators trained to minimize MSE or MAE produce a marginal spectrum strictly narrower than the truth whenever the posterior has nonzero width. The resulting bias is independent of architecture, training, and dataset size, and it compresses precisely the spectral features - tails, modes, shapes - that downstream scientific measurements rely on. We propose a three-part evaluation protocol where each step targets a failure mode the others miss: per-event distributional accuracy via CRPS, population-level marginal accuracy via a spectrum-fidelity diagnostic, and uncertainty trustworthiness via coverage-based calibration. On a synthetic benchmark with an analytic posterior and on a realistic many-to-one inverse problem from particle physics, model rankings reverse between pointwise and distributional metrics, and calibration further separates architectures indistinguishable under CRPS. The evaluation protocol, not the model, determines the scientific conclusion.

2605.22886 2026-05-25 cs.IT cs.LG cs.NI math.IT 版本更新

Resilience Characterization of AI-Native Wireless Receivers via Persistent Homology

基于持续同调的AI原生无线接收机韧性表征

Christo Kurisummoottil Thomas, Emilio Calvanese Strinati

发表机构 * CEA-Leti(CEA-莱提)

AI总结 本文研究了基于深度学习的无线接收机在非平稳信道下的鲁棒性问题,提出了一种基于持续同调的实时度量指标——拓扑鲁棒性指数(TRI),用于量化神经网络接收机在在线适应过程中的结构稳定性。TRI从三个互补维度刻画系统鲁棒性,包括模型-信道不匹配、信道冲激响应分布偏移以及信道流形拓扑特性。理论分析表明TRI具有有界性、单调性和稳定性,仿真结果验证了其在OFDM接收机中的有效性,相比传统方法能提前预警信道变化并显著降低误码率。

详情
AI中文摘要

基于深度学习的AI原生无线接收机在平稳信道条件下表现出卓越性能,但其对分布偏移的韧性仍难以通过误码率(BER)等传统指标有效表征。为克服这些局限,本文提出一种新颖的实时指标——拓扑韧性指数(TRI),该指标基于持续同调和持续指数。TRI量化了神经网络接收机参数空间在在线适应非平稳信道过程中的结构稳定性。具体而言,TRI通过三个互补维度捕捉韧性:(i)验证损失韧性,衡量模型-信道失配,基于损失景观子水平集的拓扑持续性;(ii)信道冲激响应(CIR)分布偏移,追踪CIR向量相对于校准参考分布的几何漂移;(iii)信道流形拓扑,通过经Olivier-Ricci曲率范数归一化的高斯核矩阵谱隙量化。我们建立了理论保证,表明TRI具有有界性、在性能退化下的单调性,以及关于Wasserstein距离度量的信道分布扰动的Lipschitz稳定性。针对一个OFDM深度学习接收机在三种偏移速率下跨越十个ITU-R环境间转换的仿真结果表明,TRI相比梯度范数和验证损失基线,提供了一致的大于一个OFDM符号的平均预警提前量,而梯度范数基线在每种场景下均实现零提前量。此外,所提出的TRI引导的突发重适应在200个OFDM符号内将后偏移BER相对于无适应降低了80%。

英文摘要

AI-native wireless receivers based on deep learning exhibit remarkable performance under stationary channel conditions, yet their resilience to distributional shifts remains poorly characterized by conventional metrics such as bit error rate (BER). To overcome these limitations, this paper proposes a novel real-time metric, the Topological Resilience Index (TRI), grounded in persistent homology and persistence exponents. TRI quantifies the structural stability of a neural network receiver's parameter space during online adaptation to non-stationary channels. Specifically, TRI captures resilience through three complementary dimensions: (i) validation-loss resilience measuring model-channel mismatch, grounded in the topological persistence of loss-landscape sublevel sets; (ii) channel impulse response (CIR) distribution shift, tracking geometric drift of CIR vectors from the calibration reference distribution; and (iii) channel manifold topology, quantified by the spectral gap of the Gaussian kernel matrix normalized by the Olivier-Ricci curvature norm. We establish theoretical guarantees showing that TRI is bounded, monotonic under performance degradation, and Lipschitz-stable with respect to perturbations in channel distributions measured in Wasserstein distance. Simulation results for an OFDM deep-learning receiver adapting across ten ITU-R inter-environment transitions at three shift rates demonstrate that TRI provides a consistent mean warning lead of more than one OFDM symbol over gradient-norm and validation-loss baselines, whereas the gradient-norm baseline achieves zero lead in every scenario. Furthermore, the proposed TRI-guided burst re-adaptation reduces post-shift BER by 80% relative to no adaptation within 200 OFDM symbols.

2605.22885 2026-05-25 cs.AI cs.CL cs.LG cs.LO 版本更新

ImProver 2: Iteratively Self-Improving LMs for Neurosymbolic Proof Optimization

ImProver 2:用于神经符号证明优化的迭代自改进语言模型

Riyaz Ahuja, Tate Rowney, Jeremy Avigad, Sean Welleck

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 随着形式化数学库的快速增长,对验证证明的重构和神经证明器训练数据质量的提升需求日益迫切。为解决可扩展性证明优化中面临的异构目标、数据稀缺和高训练推理成本等问题,本文提出ImProver 2,一个用于Lean 4的神经符号框架,结合高效的数据专家迭代流程和形式化结构暴露的轻量非正式抽象框架,并引入一系列衡量证明结构特性的指标。实验表明,该框架能够使小型模型在多个指标上达到与更大模型相当甚至更优的性能,展示了证明优化作为可扩展学习任务的可行性。

详情
AI中文摘要

形式化数学库正在迅速扩展,这产生了对已验证证明进行重构以保持可维护性以及提高神经证明器训练数据质量的日益增长的需求。然而,可扩展的证明优化受到异构且启发式指定的目标、稀缺的数据以及高训练和推理成本的阻碍。为了克服这些挑战,我们引入了ImProver 2,这是一个用于在Lean 4中自动进行证明优化的神经符号框架。ImProver 2将数据高效的专家迭代流程与一个暴露形式结构并附带轻量级非正式抽象的脚手架相结合。我们进一步引入了一套捕捉证明结构属性的指标。使用ImProver 2,我们训练了一个7B参数的模型,该模型在相同模型系列中优于数量级更大的模型,并且在各项指标上与中端前沿模型具有竞争力。我们还证明,我们的神经符号脚手架显著提高了小型和前沿模型的性能。我们表明,通过适当的脚手架和训练,小型模型可以有效地在复杂且多样的指标上重构研究级证明,与更大的系统相匹配,并将证明优化确立为一项可扩展、可学习的任务。

英文摘要

Formal mathematics libraries are rapidly expanding, creating a growing need to refactor verified proofs for maintainability and to improve training data quality for neural provers. However, scalable proof optimization is hindered by heterogeneous and heuristically specified objectives, scarce data, and high training and inference costs. To overcome these challenges, we introduce ImProver 2, a neurosymbolic framework for automated proof optimization in Lean 4. ImProver 2 combines a data-efficient expert-iteration pipeline with a scaffold that exposes formal structure alongside lightweight informal abstractions. We further introduce a suite of metrics capturing structural proof properties. Using ImProver 2, we train a 7B-parameter model that outperforms orders-of-magnitude larger models within the same model family, and is competitive with mid-tier frontier models across metrics. We additionally demonstrate that our neurosymbolic scaffold significantly improves performance across both small and frontier models. We show that with proper scaffolding and training, small models can effectively restructure research-level proofs over complex and varied metrics, matching substantially larger systems and establishing proof optimization as a scalable, learnable task.

2605.22884 2026-05-25 cs.LG cs.AI 版本更新

Tensor Cache: Eviction-conditioned Associative Memory for Transformers

Tensor Cache: 基于驱逐条件的Transformer联想记忆

Kabir Swain, Sijie Han, Daniel Karl I. Weidele, Mauro Martino, Antonio Torralba

发表机构 * Massachusetts Institute of Technology, Cambridge, MA, USA(麻省理工学院) IBM Research, Cambridge, MA, USA(IBM研究院) University of Toronto, Toronto, Canada(多伦多大学)

AI总结 本文提出了一种名为 Tensor Cache 的两层缓存机制,用于改进 Transformer 模型在长上下文处理中的内存效率与质量。该方法结合了滑动窗口注意力作为第一层缓存(L1),并将被窗口淘汰的键值对压缩存储到第二层缓存(L2)中,通过外积形式的快速权重记忆实现高效召回。研究还揭示了现有训练方法中隐含的虚假外积问题,并提出改进方案,实验表明 Tensor Cache 在多个任务中显著提升了内存与性能的平衡。

详情
AI中文摘要

自回归Transformer的KV缓存随上下文长度线性增长;滑动窗口缓存限制了内存但完全丢弃被驱逐的token,使得窗口外的相关证据变得不可访问。我们引入了\emph{Tensor Cache},一种双层缓存,将滑动窗口softmax注意力作为第一级缓存(L1),与一个固定大小的外积快速权重记忆作为第二级缓存(L2)配对,L2由从窗口中驱逐的KV对提供。最近的token保留在精确的局部注意力中;被驱逐的对被压缩成一个每层矩阵$A$,并通过单个矩阵乘法被未来的查询读取,利用了线性注意力恒等式$q_t(k_i \otimes v_i)=\langle q_t,k_i angle v_i$。一个可学习的标量门融合L1和L2的输出,并且每头的衰减和写入率参数是端到端训练的。外积记忆和读取恒等式是众所周知的;我们的贡献是将其用作仅由滑动窗口驱逐提供的L2缓存,加上识别出常见的分块均值训练捷径$A\!\leftarrow\!λA\!+\!η(ar k\!\otimes\!ar v)$在每个块中静默地引入了$C^2{-}C$个虚假的跨token外积,并通过一个并行的加权和扫描(等价于在float32 epsilon内的每token写入)来弥补这一差距。跨系统规模、受控联想回忆、长上下文语言建模和记忆容量诊断,Tensor Cache在有限状态基线上改善了记忆-质量边界。

英文摘要

Autoregressive Transformer KV caches grow linearly with context length; sliding-window caching bounds memory but discards evicted tokens entirely, so relevant evidence outside the window becomes inaccessible. We introduce \emph{Tensor Cache}, a two-level cache that pairs sliding-window softmax attention as a first-level cache (L1) with a fixed-size outer-product fast-weight memory as a second-level cache (L2) fed by KV pairs evicted from the window. Recent tokens remain in exact local attention; evicted pairs are compressed into a per-layer matrix $A$ and read by future queries through a single matrix multiplication, exploiting the linear-attention identity $q_t(k_i \otimes v_i)=\langle q_t,k_i\rangle v_i$. A learned scalar gate fuses the L1 and L2 outputs, and per-head decay and write-rate parameters are trained end-to-end. The outer-product memory and the read identity are well-known; our contribution is their use as an L2 cache fed exclusively by sliding-window evictions, plus identifying that the common chunked-mean training shortcut $A\!\leftarrow\!λA\!+\!η(\bar k\!\otimes\!\bar v)$ silently introduces $C^2{-}C$ spurious cross-token outer products per chunk, and closing the gap with a parallel weighted-sum scan equivalent to per-token writes within float32 epsilon. Across systems scaling, controlled associative recall, long-context language modeling, and memory-capacity diagnostics, Tensor Cache improves the memory--quality frontier over bounded-state baselines.

2605.22883 2026-05-25 cs.AI cs.LG cs.PF 版本更新

Energy per Successful Goal: Goal-Level Energy Accounting for Agentic AI Systems

每个成功目标的能量:面向智能体AI系统的目标级能量核算

Deepak Panigrahy, Aakash Tyagi

发表机构 * Independent Researcher(独立研究者) Texas A\&M University(德克萨斯A&M大学) Texas A\&M University Department of Computer Science(德克萨斯A&M大学计算机科学系)

AI总结 当前AI能耗基准通常以单次模型调用或训练运行作为能耗计量单位,但这种方法难以准确反映智能体系统中多步骤任务的能耗情况。本文提出A-LEMS框架,将能耗计量单位从每次推理改为每成功目标的能耗(EpG),并引入调度开销指数(OOI)以量化调度结构对能耗的影响。研究发现,智能体系统完成每项任务的平均能耗是线性基线的4.33倍,且这一差异主要由调度结构而非推理计算引起,表明EpG和OOI为评估智能体AI系统能耗提供了更准确的基准方法。

Comments 34 pages, 16 figures, 10 tables

详情
AI中文摘要

当前的AI能量基准在单次模型调用或训练运行的粒度上测量能耗。对于经典的单轮工作负载,这种单位仍然一致。但对于智能体系统——其中单个用户目标可能触发多步编排、工具调用、重试和故障恢复循环——调用次数是实现产物而非任务属性,推理级归一化错误地表示了目标完成的能量成本。我们提出A-LEMS(智能体LLM能量测量系统),一个跨层测量框架,将AI能量核算单位从每次推理能量重新定义为每个成功目标能量(EpG)。EpG聚合所有执行尝试(包括失败和重试)的总工作流能量,归一化到成功完成的目标。A-LEMS通过时间边界模型、将RAPL信号映射到工作流级能量的五层观测管道,以及将每次测量绑定到硬件和运行时配置的可复现协议,形式化了能量归因。基于EpG,我们定义了编排开销指数(OOI),在相同任务标准下隔离编排相对于线性执行的能耗成本。在五个推理和三个工具增强任务族中,智能体工作流每个成功目标的平均能耗是线性基线的4.33倍(888.1 J vs 205.3 J)。这种开销由编排结构驱动,而非推理计算。对于工具增强任务,OOI反转至低于1.0倍:智能体执行比线性更便宜,确认该指标捕捉了编排结构而非固定的向上偏差。这些发现表明,每次推理能量对于智能体AI是不充分的。EpG和OOI为准确基准测试提供了测量基础,其中编排结构是能耗的主要决定因素。

英文摘要

Current AI energy benchmarks measure consumption at the granularity of a single model invocation or training run. For classical single-turn workloads this unit remains coherent. For agentic systems - where a single user goal may trigger multi-step orchestration, tool calls, retries, and failure-recovery cycles - the invocation count is an implementation artifact rather than a task property, and inference-level normalization misrepresents the energy cost of goal completion. We present A-LEMS (Agentic LLM Energy Measurement System), a cross-layer measurement framework that redefines the unit of AI energy accounting from energy per inference to Energy per Successful Goal (EpG). EpG aggregates total workflow energy across all execution attempts, including failures and retries, normalized by successfully completed goals. A-LEMS formalizes energy attribution through a temporal boundary model, a five-layer observation pipeline mapping RAPL signals to workflow-level energy, and a reproducibility protocol binding every measurement to hardware and runtime configuration. Building on EpG, we define the Orchestration Overhead Index (OOI), isolating the energy cost of orchestration relative to linear execution under identical task criteria. Across five reasoning and three tool-augmented task families, agentic workflows consume 4.33x higher mean energy per successful goal than linear baselines (888.1 J vs 205.3 J). This overhead is driven by orchestration structure, not inference compute. For tool-augmented tasks, OOI inverts below 1.0x: agentic execution is cheaper than linear, confirming the metric captures orchestration structure rather than a fixed upward bias. These findings establish that energy-per-inference is insufficient for agentic AI. EpG and OOI provide the measurement foundation for accurate benchmarking, where orchestration structure is the primary determinant of energy cost.

2605.22878 2026-05-25 cs.AI cs.CL cs.IR cs.LG 版本更新

SciAtlas: A Large-Scale Knowledge Graph for Automated Scientific Research

SciAtlas:面向自动化科学研究的大规模知识图谱

Shuofei Qiao, Yunxiang Wei, Jiazheng Fan, Bin Wu, Busheng Zhang, Mengru Wang, Yuqi Zhu, Ningyu Zhang, Keyan Ding, Qiang Zhang, Huajun Chen

发表机构 * Zhejiang University(浙江大学) University College London(伦敦大学学院)

AI总结 随着全球学术产出的指数级增长,研究者和人工智能代理面临前所未有的“信息爆炸”挑战,碎片化和非结构化的知识组织阻碍了跨学科的深度融合。为解决这一问题,本文提出 SciAtlas,一个涵盖26个学科、包含4300万篇论文、1.57亿实体和30亿三元组的多学科异构学术知识图谱,旨在构建全景式的科学演进网络。SciAtlas 提供了结构化的拓扑认知基础,打破了学科壁垒,并通过神经符号检索算法实现了从语义匹配到确定性关联发现的转变,为自动化科研全流程提供了高效、低成本的“认知地图”。

Comments Ongoing Work

详情
AI中文摘要

全球学术产出的指数级增长使研究人员和AI代理面临前所未有的“信息爆炸”,其中碎片化和非结构化的知识组织阻碍了深层次的跨学科整合。当前的学术检索工具主要依赖浅层关键词匹配或向量空间语义检索,缺乏导航复杂逻辑连接所需的拓扑推理能力。基于代理的深度研究框架往往容易出现逻辑幻觉并消耗高推理成本。为弥补这一差距,本报告介绍了SciAtlas,一个大规模、多学科、异构的学术资源知识图谱,设计为全景科学演化网络。通过整合来自26个学科的超过4300万篇论文,总计1.57亿个实体和30亿个三元组,SciAtlas提供了一个结构化的拓扑认知基础,打破了学科壁垒,并为AI代理提供了全局视角。此外,我们开发了一种神经符号检索算法,具有三路径协同召回和图重排序,实现了从简单语义匹配到确定性关联发现的无缝过渡。我们还展示了SciAtlas的关键应用方向,包括文献综述、自动化研究趋势综合、想法定位和学术轨迹探索,以证明SciAtlas可以作为有效的“认知地图”,赋能自动化科学研究的全流程,同时显著降低推理成本。我们已在GitHub仓库中发布了知识图谱检索和各种下游任务的接口。

英文摘要

The exponential growth of global academic output has confronted researchers and AI agents with an unprecedented ``information explosion,'' where fragmented and unstructured knowledge organization impedes deep interdisciplinary integration. Current academic retrieval tools predominantly rely on superficial keyword matching or vector-space semantic retrieval, which lack the topological reasoning capabilities required to navigate complex logical connections. Agentic deep-research-based frameworks are often prone to logical hallucinations and consuming high inference costs. To bridge this gap, in this report, we introduce SciAtlas, a large-scale, multi-disciplinary, heterogeneous academic resource knowledge graph designed as a panoramic scientific evolution network. By integrating over 43M papers from 26 disciplines, and a total of 157M entities and 3B triplets, SciAtlas provides a structured topological cognitive substrate that dismantles disciplinary barriers and furnishes AI agents with a global perspective. Furthermore, we develop a neuro-symbolic retrieval algorithm featuring tri-path collaborative recall and graph reranking, achieving a seamless transition from simple semantic matching to deterministic association discovery. We also present key application directions of SciAtlas, including literature review, automated research trend synthesis, idea positioning, and academic trajectory exploration, to demonstrate that SciAtlas can serve as an effective ``cognitive map'' to empower the full loop of automated scientific research while significantly reducing reasoning costs. We have released the interfaces for KG retrieval and various downstream tasks in our GitHub repo.

2605.22876 2026-05-25 cs.LG 版本更新

WeCon: An Efficient Weight-Conditioned Neural Solver for Multi-Objective Combinatorial Optimization Problems

WeCon: 一种高效的权重条件神经求解器用于多目标组合优化问题

Xuan Wu, Jinbiao Chen, Yang Li, Lijie Wen, Chunguo Wu, Yuanshu Li, Yubin Xiao, Chunyan Miao, You Zhou, Di Wang

发表机构 * Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education(教育部符号计算与知识工程重点实验室) College of Computer Science and Technology, Jilin University(吉林大学计算机科学与技术学院) Department of Industrial Systems Engineering and Management, National University of Singapore(新加坡国立大学工业系统工程与管理系) College of Software, Jilin University(吉林大学软件学院) School of Software, Tsinghua University(清华大学软件学院) School of Computing and Information Systems, Singapore Management University(新加坡管理学院 computing and information systems 系)

AI总结 本文提出了一种高效的权重条件神经求解器WeCon,用于解决多目标组合优化问题。该方法通过设计包含三个注意力模块和门控残差融合块的编码器,增强了实例特征与权重之间的交互,生成更具信息量的权重条件上下文,并在解码器中引入残差融合块以缓解权重信号衰减问题。此外,还提出了高效的偏好优化方法EPO,生成更高质量的解对以提升训练效果。实验表明,WeCon在多个问题规模和分布模式下取得了与当前最优求解器相当的性能,同时推理时间减少了40%。

详情
AI中文摘要

现有的多目标组合优化问题(MOCOP)神经求解器通常采用基于分解的策略,将MOCOP标量化为多个与不同权重向量相关的子问题。然而,它们要么仅在解码过程中注入一次权重,限制了权重条件上下文建模,要么主要在编码过程中注入,导致解码过程中权重信号稀释。此外,偏好优化方法依赖纯随机采样来构建解对以训练求解器,这通常产生信息量较少的解对,从而导致训练效率低下。为了更好地解决这些局限性,我们提出了一种高效的权重条件神经求解器(WeCon)。具体来说,我们设计了一个具有三个注意力块和我们提出的门控残差融合(GRF)块的编码器层,以促进实例特征和权重之间的和谐交互,从而生成信息丰富的权重条件上下文。我们进一步在解码器中引入了一个即插即用的残差融合(RF)块,以减轻权重信号稀释。最后,我们提出了高效偏好优化(EPO),它构建高质量的解,从而生成更多信息量的解对以提高训练效率。在不同问题规模和分布模式下的四个MOCOP变体上的实验表明,WeCon实现了与最先进求解器POCCO-W相当的HyperVolume(HV)值,同时将推理时间减少了40%。消融研究验证了所有设计的贡献。

英文摘要

Existing neural solvers for Multi-Objective Combinatorial Optimization Problems (MOCOPs) commonly adopt decomposition-based strategies that scalarize an MOCOP into multiple subproblems associated with distinct weight vectors. However, they either inject weights only once during decoding, limiting weight-conditioned context modeling, or primarily during encoding, causing weight-signal dilution during decoding. Moreover, preference optimization methods rely on purely random sampling to construct solution pairs for training solvers, which often produces less informative pairs and thus leads to low training effectiveness. To better address these limitations, we propose an efficient Weight-Conditioned neural solver (WeCon). Specifically, we design an encoder layer with three attention blocks and our proposed Gated Residual Fusion (GRF) block to facilitate harmonious interaction between instance features and weights, thereby generating informative weight-conditioned context. We further introduce a plug-and-play Residual Fusion (RF) block in the decoder to alleviate weight-signal dilution. Finally, we propose Efficient Preference Optimization (EPO), which constructs high-quality solutions, thereby generating more informative pairs to improve training effectiveness. Experiments on four MOCOP variants across different problem scales and distribution patterns demonstrate that WeCon achieves HyperVolume (HV) values comparable to SOTA solver POCCO-W, while reducing inference time by 40%. Ablation studies validate the contributions of all designs.

2605.22875 2026-05-25 cs.AI cs.LG 版本更新

RMA: an Agentic System for Research-Level Mathematical Problems

RMA:一个面向研究级数学问题的智能体系统

Zelin Zhao, Bo Yuan, Jaemoo Choi, Yongxin Chen

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文提出了一种名为 **RMA** 的智能代理系统,专门用于解决研究级数学问题。RMA 通过分解问题分析、文献检索、公平比较、知识库构建和证明验证等模块,并由初始化器、提议者和验证者代理协同工作,实现了对复杂数学问题的长期推理和迭代证明优化。实验表明,RMA 在 First Proof 基准测试中表现出色,解决了其中八道难题,其生成的证明在逻辑性和可读性上优于现有强基线模型。

详情
AI中文摘要

我们提出了$ extbf{Research Math Agents (RMA)}$,一个用于研究级数学问题自动推理的智能体框架。与以往专注于竞赛数学或形式化定理证明的研究不同,RMA针对需要长程推理、文献依据和迭代证明改进的研究级数学问题。RMA将研究级证明求解分解为专门模块,包括问题分析、文献搜索与理解、公平比较、知识库构建和证明验证,所有这些都由初始化器、提议器和验证器智能体通过共享的结构化内存协调。在这个统一框架内,这些智能体以多角色、多轮工作流的方式运行,通过迭代反馈协作生成、改进和验证候选证明。我们在First Proof基准上评估了RMA,该基准由来自不同领域的专家数学家贡献的十个研究级问题组成。通过全面的专家评估,RMA在First Proof基准上优于强基线(包括GPT-5.2R和Aletheia),解决了十个研究问题中的八个,并生成了逻辑更合理、可读性更强的证明。我们的全面消融研究进一步表明,性能提升来自于结构化推理模块、迭代改进和基于验证器的反馈之间的交互,而非任何单一组件。我们的解决方案和实现将在论文被接收后公开。

英文摘要

We present $\textbf{Research Math Agents (RMA)}$, an agentic framework for automated reasoning on research-level mathematical problems. Unlike prior studies centered on competition mathematics or formal theorem proving, RMA targets research-level mathematical problems that require long-horizon reasoning, literature grounding, and iterative proof refinement. RMA decomposes research-level proof solving into specialized modules for problem analysis, literature search and understanding, fair comparison, knowledge-bank construction, and proof verification, all coordinated by initializer, proposer, and verifier agents through a shared structured memory. Within this unified framework, these agents operate in a multi-role, multi-round workflow, collaboratively generating, refining, and verifying candidate proofs through iterative feedback. We evaluate RMA on the First Proof benchmark, which consists of ten research-level problems contributed by expert mathematicians across diverse domains. Through comprehensive expert evaluation, RMA outperforms strong baselines on the First Proof benchmark, including GPT-5.2R and Aletheia, solving eight out of ten research problems and producing more logically sound and readable proofs. Our comprehensive ablation studies further show that performance gains arise from the interaction of structured reasoning modules, iterative refinement, and verifier-based feedback, rather than any single component. Our solutions and implementations will be made publicly available upon acceptance.

2605.22872 2026-05-25 cs.LG cs.AI cs.CV 版本更新

MedExpMem: Adapting Experience Memory for Differential Diagnosis

MedExpMem:适应经验记忆用于鉴别诊断

Qianhan Feng, Zhongzhen Huang, Yakun Zhu, Yannian Gu, Winnie Chiu Wing Chu, Xiaofan Zhang, Qi Dou

发表机构 * The Chinese University of Hong Kong(香港中文大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 本文提出了一种名为 MedExpMem 的经验记忆框架,旨在提升基于视觉-语言模型的医疗诊断代理在鉴别诊断方面的能力。该方法通过记录模型自身在诊断过程中的失败经验,生成包含关键鉴别点、决策规则和推理错误模式的成对鉴别笔记,并采用两阶段构建过程模拟医生的学习过程。实验表明,MedExpMem 在多个放射学子专科基准上有效提升了诊断准确性,验证了其在医疗适应性方面的优越性。

Comments MICCAI 2026 Early Accept. Submission Version

详情
AI中文摘要

经验丰富的医生通过临床实践发展诊断专业知识,不仅获得疾病知识,还能区分易混淆的病症。当前的医学视觉语言模型(VLM)缺乏这种能力——它们的参数编码了静态知识,不会随着诊断经历而演变。我们提出了MedExpMem,一个经验记忆框架,使基于VLM的诊断代理能够积累鉴别诊断专业知识。与检索增强生成(检索百科式疾病描述)不同,MedExpMem记忆从代理自身的诊断失败中获得的判别经验,并将其组织为成对的鉴别笔记,编码关键判别因素、可操作的决策规则和推理错误模式。该框架采用两阶段构建过程,模仿医生的学习:初始实践暴露知识差距,反思性重新诊断完善理解。当遇到新病例时,代理检索经验记忆以指导鉴别推理。我们在涵盖11个亚专业的放射学基准上评估了MedExpMem。结果表明,在不同模型和规模上,准确率持续提升,最高达7.0%。分析实验验证了经验质量和鲁棒性,表明MedExpMem是一种有竞争力的方法,解决了参数学习无法触及的医学适应需求。

英文摘要

Experienced physicians develop diagnostic expertise through clinical practice, acquiring not only disease knowledge but also the ability to differentiate confusable conditions. Current medical vision-language models (VLMs) lack this capability -- their parameters encode static knowledge that does not evolve across diagnostic encounters. We propose MedExpMem, an experience memory framework enabling VLM-based diagnostic agents to accumulate differential diagnosis expertise. Unlike retrieval-augmented generation, which retrieves encyclopedic disease descriptions, MedExpMem memorizes discriminative experience derived from the agent's own diagnostic failures and organizes them as pairwise differential notes encoding key discriminators, actionable decision rules and reasoning error patterns. The framework adopts a two-phase construction process mirroring physician learning: initial practice exposes knowledge gaps, and reflective re-diagnosis refines understanding. When encountering new cases, the agent retrieves experience memory to guide differential reasoning. We evaluate MedExpMem on a radiology benchmark spanning 11 subspecialties. Results demonstrate consistent accuracy improvements, maximum 7.0%, across diverse models and scales. Analytical experiments validate experience quality and robustness, demonstrating MedExpMem as a competitive method addresses medical adaptation needs beyond the reach of parameteric learning.

2605.22871 2026-05-25 cs.LG cs.AI stat.ML 版本更新

Approximate Machine Unlearning through Manifold Representation Forgetting Guided by Self Mode Connectivity

通过自模式连通性引导的流形表示遗忘实现近似机器遗忘

Weiqi Wang, Zhiyi Tian, Chenhan Zhang, Luoyu Chen, Shui Yu

发表机构 * Xi'an Jiaotong University(西安交通大学) Southeast University(东南大学) University of Technology Sydney(悉尼大学)

AI总结 本文提出了一种名为ManiF-SMC的近似机器遗忘方法,旨在解决现有方法在遗忘效果和学习目标保持之间的平衡问题。该方法基于模型在剩余数据上重训练时的语义相似性分类行为,通过将被遗忘样本从原始流形表示中心推向保留数据的语义邻居,实现近似遗忘。为提升遗忘效果并减少对标签和任务梯度的依赖,ManiF-SMC引入了基于边距的三元组损失和自模式连通模块,以自适应生成遗忘边距,实验表明其在多个数据集上达到了与先进方法相当的遗忘效果。

详情
AI中文摘要

机器遗忘是强制执行被遗忘权的基本机制。现有的依赖标签操作或任务梯度反转的遗忘研究通常遗忘效果有限,且可能破坏原始学习目标,通常不能保证与重新训练的标准遗忘等价。本文提出ManiF-SMC(自模式连通性引导的流形遗忘),其动机是观察到在剩余数据上重新训练的模型倾向于根据保留数据中的语义相似性对擦除样本进行分类。我们首先系统地将近似遗忘重新表述为:将每个擦除样本从其原始学习的流形表示质心推向保留数据中最近的语义邻居。这种重新表述使遗忘与重新训练行为对齐,并且仅在表示空间中操作,减少了对标签和任务特定梯度的依赖。为了解决基于流形表示的遗忘问题,ManiF-SMC将遗忘和表示保留目标封装在基于边界的三元组损失中。由于为遗忘找到合适的边界具有挑战性,我们提出一个自模式连通性模块,快速重建局部流形以指导每个遗忘案例的自适应边界生成。在四个代表性数据集上的大量实验表明,ManiF-SMC在仅操作模型表示空间的情况下,实现了与最先进近似方法相当的遗忘效果。

英文摘要

Machine unlearning is a fundamental mechanism that enforces the right to be forgotten. Existing unlearning studies that rely on label manipulation or task-gradient reversal often deliver limited unlearning effectiveness. Moreover, they can undermine the original learning objective and typically do not guarantee equivalence to standard unlearning by retraining. In this paper, we propose \textbf{ManiF-SMC} (\textbf{Mani}fold \textbf{F}orgetting with \textbf{S}elf \textbf{M}ode \textbf{C}onnectivity), motivated by the observation that a model retrained on the remaining data tends to classify erased samples by their semantic similarity to the retained data. We begin with systematically recasting the approximate unlearning as pushing each erased sample away from its original learned manifold representation centroid toward its nearest semantic neighbors in the retained data. This reformulation aligns unlearning with retraining behavior and operates purely in representation space, reducing reliance on labels and task-specific gradients. To tackle the manifold representation-based unlearning problem, ManiF-SMC encapsulates the unlearning and representation preservation goals in a margin-based triplet loss. Because finding a suitable margin for unlearning is challenging, we propose a self-mode-connectivity module that rapidly reconstructs the local manifold to guide the adaptive margins generation for each unlearning case. Extensive experiments on four representative datasets show that ManiF-SMC achieves unlearning effectiveness comparable to state-of-the-art approximate methods while operating solely within the model's representation space.

2605.22870 2026-05-25 cs.LG cs.AI cs.CL 版本更新

The Readout Shortcut: Positional Number Copying Dominates Arithmetic CoT Readout in Small Language Models

读出捷径:位置数字复制主导小语言模型中的算术思维链读出

Ming Liu

发表机构 * Amazon(亚马逊)

AI总结 该研究探讨了小型语言模型在进行算术推理时,思维链(CoT)提示的实际作用。研究发现,模型在输出答案时更倾向于复制位于答案分隔符前的最后一个数字,而非依赖中间推理过程。这一“位置捷径”现象显著影响了模型性能,表明当前的CoT方法可能更多依赖位置信息而非逻辑推理。实验还揭示了不同模型在复制行为上的差异,并指出这一机制可能与模型架构及任务类型相关。

Comments 18 pages (8 main + 10 appendix), 3 figures, 5 tables

详情
AI中文摘要

思维链提示对于小语言模型进行算术运算是必要的,然而打乱其步骤仍能保留大部分性能。如果思维链贡献的不是逻辑顺序,那是什么?在三个1-3B指令微调的语言模型上,针对GSM8K数据集,我们通过前缀补全隔离了答案读出阶段,并识别出一个位置捷径:模型复制占据答案分隔符前最后一个位置的数字,无论中间推理如何。正确答案的存在贡献了54-92个百分点的准确率(每个模型教师强制上限的89-92%);即使在错误项上,最终答案与思维链最后一个数字匹配的概率为95-96%。复制通道优先于保留上下文补全:用错误值替换最后一个数字会使准确率降至接近零,尽管中间步骤正确;但移除它后,准确率在该基线之上恢复5-32个百分点——当存在可复制的数字时,即使模型本可以执行的单步算术也被抑制。Qwen和Llama在87-95%的情况下复制新干扰项;Gemma则选择性门控。头部级消融实验揭示了特定于架构的头部集;该效应在GSM-Symbolic上复现。在非算术的BBH任务上,打乱保留率急剧下降;在7-8B规模时,出现了内容选择性门控。步骤级忠实度评估有风险将位置答案传输与真实计算混为一谈——这是基于思维链的监督的一个失败模式。

英文摘要

Chain-of-thought (CoT) prompting is necessary for arithmetic in small language models, yet shuffling its steps preserves most performance. What does CoT contribute if not logical sequencing? In three 1-3B instruction-tuned LMs on GSM8K, we isolate the answer-readout stage via prefix completion and identify a positional shortcut: the model copies whichever number occupies the trailing position before the answer delimiter, regardless of intermediate reasoning. Gold-answer presence accounts for 54-92 pp of accuracy (89-92% of each model's teacher-forcing ceiling); even on incorrect items, the final answer matches the last CoT number 95-96% of the time. The copy channel takes precedence over retained-context completion: replacing the trailing number with a wrong value collapses accuracy to near-zero despite correct intermediates, yet removing it recovers 5-32 pp above that floor--even single-step arithmetic the model can otherwise perform is suppressed when a copyable number is present. Qwen and Llama copy novel distractors 87-95% of the time; Gemma gates selectively. Head-level ablation implicates architecture-specific head sets; the effect replicates on GSM-Symbolic. On non-arithmetic BBH tasks, shuffle retention drops sharply; at 7-8B, content-selective gating emerges. Step-level faithfulness evaluations risk conflating positional answer transport with genuine computation--a failure mode for CoT-based oversight.

2605.22869 2026-05-25 cs.LG 版本更新

FuRA: Full-Rank Parameter-Efficient Fine-Tuning with Spectral Preconditioning

FuRA: 基于谱预条件的全秩参数高效微调

Yequan Zhao, Ruijie Zhang, Liyan Tan, Niall Moran, Tong Qin, Zheng Zhang

发表机构 * University of California at Santa Barbara(加州大学圣芭芭拉分校) Amazon Lab126(亚马逊实验室126)

AI总结 该论文提出了一种名为FuRA的全秩参数高效微调方法,通过谱预处理技术改进传统微调和参数高效微调方法。FuRA利用全秩奇异值分解对权重矩阵进行重参数化,并固定预训练的奇异基,从而约束更新方向,提升优化稳定性。该方法基于块张量训练分解框架,仅优化紧凑核心和奇异值,实现了与LoRA相当的参数、内存和训练效率,同时在多个任务中表现出更优的性能。

详情
AI中文摘要

全微调(Full FT)和参数高效微调方法(如LoRA)在更新权重时未考虑预训练期间建立的谱结构。因此,来自有限微调数据的噪声梯度可能扰动鲁棒的预训练特征。我们识别出谱预条件是缺失的关键:通过全秩奇异值分解(SVD)重新参数化每个权重矩阵,并冻结一个奇异基,将更新约束到预训练的列空间,从而产生一个预条件优化方案,在相同可训练参数数量下优于无约束的全微调。基于这一见解,我们提出了FuRA(全秩适应),一种基于块张量列车分解 W = LSR 的高效全秩适应框架,其中大核心 L 固定为预训练的块状SVD基,而仅优化紧凑核心 R 和块状奇异值 S。该设计同时提供全秩谱预条件、保持全秩更新表达能力,并实现与LoRA相当的参数、内存和步时间效率。FuRA在多种设置下持续优于全微调,包括LLM微调(LLaMA-3-8B常识推理+1.37)、LLM数学推理强化学习以及VLM的视觉指令微调。此外,4位量化变体QFuRA也优于QLoRA。代码可在 https://github.com/olokevin/FuRA-NIPS 获取。

英文摘要

Both full fine-tuning (Full FT) and parameter-efficient fine-tuning methods such as LoRA introduce weight updates without accounting for the spectral structure established during pretraining. As a result, noisy gradients from limited fine-tuning data can perturb robust pretrained features. We identify spectral preconditioning as the missing ingredient: reparameterizing each weight matrix through its full-rank singular value decomposition (SVD) and freezing one singular basis constrains updates to the pretrained column space, yielding a preconditioned optimization scheme that outperforms unconstrained Full FT at the same trainable parameter count. Building on this insight, we propose FuRA (Full-Rank Adaptation), an efficient full-rank adaptation framework based on a block tensor-train factorization W = LSR, where the large core L is fixed to the pretrained block-wise SVD basis, while only the compact core R and the block-wise singular values S are optimized. This design simultaneously provides full-rank spectral preconditioning, preserves full-rank update expressivity, and achieves parameter, memory, and step-time efficiency comparable to LoRA. FuRA consistently outperforms Full FT across multiple settings, including LLM fine-tuning (+1.37 on LLaMA-3-8B commonsense reasoning), LLM reinforcement learning for mathematical reasoning, and visual instruction tuning for VLMs. Furthermore, the 4-bit quantized variant, QFuRA, also surpasses QLoRA. Code is available at https://github.com/olokevin/FuRA-NIPS

2605.22868 2026-05-25 cs.LG 版本更新

FusionSense: Tri-Stage Near-Sensor Learning for Runtime-Adaptive Multimodal Edge Intelligence

FusionSense: 用于运行时自适应多模态边缘智能的三阶段近传感器学习

Sanggeon Yun, Ryozo Masukawa, Minhyoung Na, Hyunwoo Oh, Yoshiki Yamaguchi, Wenjun Huang, SungHeon Jeong, Mohsen Imani

发表机构 * University of California, Irvine(加州大学尔湾分校) Kookmin University(韩国国民大学) Shibaura Institute of Technology(武藏技术大学)

AI总结 随着自主系统和智能制造对边缘计算的需求增加,如何在有限的能耗、延迟和可靠性条件下实现运行时自适应的多模态感知成为一个关键问题。FusionSense 提出了一种三阶段的近传感器学习框架,通过融合感知决策来优化计算与通信资源,有效减少冗余传输并提升系统效率。该方法在双模态传感器设置下表现出显著的能效提升和数据压缩性能,相比传统方法在多个指标上均有明显优势。

Comments Accepted to ISLPED 2026

详情
AI中文摘要

自主系统和智能工业部署越来越多地将计算分散在近传感器、边缘和云资源之间,其中严格的能量、延迟和可靠性预算要求运行时自适应性。在实践中,决定在每个点计算和传输什么至关重要;然而,随着多模态传感器套件(相机、LiDAR/深度等)在边缘激增,大多数先前的方法要么(i)在强大的服务器上融合模态,要么(ii)应用忽略跨模态依赖的单模态近传感器滤波器,导致冗余传输或遗漏事件。我们提出FusionSense,一种面向能量受限自主边缘系统的融合感知智能传感框架。轻量级近传感器分类器通过三步过程训练:(i)服务器端融合模型学习下游任务,(ii)过滤安全(FoS)标签量化每个模态相对于融合决策的必要性,(iii)通过注入近传感器预测作为辅助信号来压缩边缘端融合模型。结果是一个运行时决策层,联合减少计算和通信,同时随传感器数量线性扩展。在双模态(RGB+深度/LiDAR)设置下使用SynDrone,FusionSense以比单模态滤波器高得多的数据缩减率维持任务质量,并带来显著的端到端增益:在1% FoI出现率下能耗降低高达33倍,10%下降低11倍,在固定30%数据缩减率下质量损失减少92.3%,并且比最佳先前滤波基线节能约1.5倍。

英文摘要

Autonomous systems and smart-industry deployments increasingly split computation across near-sensor, edge, and cloud resources, where tight energy, latency, and reliability budgets demand run-time adaptivity. In practice, deciding what to compute and transmit at each point is pivotal; yet as multimodal sensor suites (cameras, LiDAR/depth, etc.) proliferate at the edge, most prior approaches either (i) fuse modalities on powerful servers or (ii) apply uni-modal near-sensor filters that ignore cross-modal dependencies, leading to redundant transmissions or missed events. We present FusionSense, a fusion-aware intelligent sensing framework for energy-constrained autonomous edge systems. Lightweight near-sensor classifiers are trained via a three-step procedure: (i) a server-side fusion model learns the downstream task, (ii) filter-out-safe (FoS) labels quantify each modality's necessity relative to the fused decision, and (iii) an edge-side fusion model is compacted by injecting near-sensor predictions as auxiliary signals. The result is a run-time decision layer that jointly reduces compute and communication while scaling linearly with sensor count. On a dual-modality (RGB+Depth/LiDAR) setup with SynDrone, FusionSense sustains task quality at substantially higher data-reduction rates than uni-modal filters and delivers large end-to-end gains: up to 33x lower energy at 1% FoI prevalence, 11x at 10%, a 92.3% reduction in quality loss at a fixed 30% data reduction, and roughly 1.5x higher energy savings than the best prior filtering baseline.

2605.22866 2026-05-25 cs.AI cs.LG 版本更新

BOHM: Zero-Cost Hierarchical Attribution for Compound AI Systems

BOHM:复合AI系统的零成本层次归因

Joss Armstrong

发表机构 * Ericsson Research(爱立信研究)

AI总结 本文提出了一种名为BOHM的零成本分层归因方法,用于复合AI系统中组件的贡献度分析。该方法直接从系统已有的路由权重中提取分层归因树,无需访问组件内部信息,能够在不同粒度上同时提供多分辨率归因,克服了传统基于Shapley值的方法在第三方API和不透明系统中的评估限制。实验表明,BOHM在多个实际场景中表现出优异的归因性能,且与Shapley方法在路由策略接近最优时结果趋于一致。

Comments 35 pages, 10 figures, 20 tables

详情
AI中文摘要

复合AI系统通过专门组件的层次结构路由任务。归因主要由基于Shapley的方法(SHAP)主导,该方法将联盟价值函数分解为每个组件的边际贡献,并需要在任意组件子集上评估系统。这一要求对于第三方API、不透明端点以及将路由集中在少数工具上的代理编排器而言无法满足,因为从部署的编排器中大多数联盟无法评估。我们引入BOHM,它直接从系统已维护的路由权重中提取层次归因树:叶子归因是根到叶子路由权重的路径乘积;第k层归因是深度k节点上的诱导分布。该方法具有零边际成本,无需访问组件内部,并同时提供每个级别的多分辨率归因,而扁平方法在任何评估预算下都无法提供。BOHM和SHAP回答不同的问题,当部署的路由器接近最优路由时两者收敛。在包含880个LiveCodeBench问题的3级层次结构中的18个LLM上,BOHM的Kendall tau=0.928;SHAP在每次种子进行9000倍更多联盟评估时达到tau=0.980。在一项包含5个驱动器和7个基准的代理研究(35个单元格,完全覆盖)中,驱动器将路由集中在单个工具上(顶部份额中位数0.65),单元格级别的tau(BOHM, SHAP)由驱动器的首选是否为经验上最佳工具预测(平均+0.22 vs ~+0.01)。在美国人口普查层次结构(475个叶子,4级)上,BOHM在每个级别恢复真实排名(tau高达0.722)。BOHM满足效率、单调性、对称性和弱抑制,但不满足Shapley的可加性。它最好被理解为一种互补原语:一种在存在路由状态的任何地方可计算的多分辨率分解,其与Shapley的分歧本身具有诊断意义。

英文摘要

Compound AI systems route tasks through hierarchies of specialised components. Attribution is dominated by Shapley-based methods (SHAP), which decompose a coalition value function into per-component marginal contributions and require evaluation of the system on arbitrary component subsets. That requirement fails for third-party APIs, opaque endpoints, and agentic orchestrators that concentrate routing on a few tools, leaving most coalitions un-evaluable from the deployed orchestrator. We introduce BOHM, which extracts a hierarchical attribution tree directly from the routing weights such systems already maintain: leaf attribution is the path product of root-to-leaf routing weights; level-k attribution is the induced distribution over depth-k nodes. The method has zero marginal cost, requires no access to component internals, and provides multi-resolution attribution at every level simultaneously, which flat methods cannot offer at any evaluation budget. BOHM and SHAP answer different questions and converge when the deployed router routes near-optimally. On 18 LLMs in a 3-level hierarchy over 880 LiveCodeBench problems, BOHM yields Kendall tau=0.928; SHAP reaches tau=0.980 at 9,000x more coalition evaluations per seed. On a 5-driver, 7-benchmark agentic study (35 cells, complete coverage), drivers concentrate routing on a single tool (top-share median 0.65), and cell-level tau(BOHM,SHAP) is predicted by whether the driver's top pick is the empirically best tool (mean +0.22 vs ~+0.01). On a US Census hierarchy (475 leaves, 4 levels), BOHM recovers ground-truth rankings at every level (tau up to 0.722). BOHM satisfies efficiency, monotonicity, symmetry, and weak suppression but not Shapley's additivity. It is best understood as a complementary primitive: a multi-resolution decomposition computable wherever routing state exists, whose disagreement with Shapley is itself diagnostic.

2605.22864 2026-05-25 cs.LG 版本更新

Reading Calibrated Uncertainty from Language Model Trajectories

从语言模型轨迹中读取校准的不确定性

Aliai Eusebi, Alexander Herzog, Xiaoyu Liang, Marie Vasek, Enrico Mariconti, Lorenzo Cavallaro

发表机构 * University College London, London, United Kingdom(伦敦大学学院)

AI总结 该研究探讨了如何从语言模型生成过程中的内部轨迹中更准确地量化不确定性。不同于传统的最大softmax概率方法,作者提出了一种基于模型各层激活路径的几何特征提取方法,通过稀疏线性探针来捕捉不确定性信息。该方法在选择性拒绝任务中表现优于传统方法,且能揭示不同层在生成过程中如何逐步形成误差,为理解模型不确定性提供了更细粒度的分析视角。

详情
AI中文摘要

最大softmax概率(MSP)是评估结构化输出语言模型生成不确定性量化的默认方法。虽然计算成本低,但通常校准不佳。探针模型内部激活的方法将原始隐藏状态输入不透明分类器,将激活视为静态快照,并隐含了表示形成的逐层轨迹。然而,相似的端点可能源于非常不同的路径,证据如何在深度上累积、增强或反转可能揭示最终概率掩盖的不确定性。我们提取了十一个尺度不变的几何特征,追踪逐层MLP更新的累积路径,并将其输入稀疏线性探针。该探针在选择性弃权下优于MSP,增益随基线校准误差增加而增加,最高达21 AURC点。由于每个特征都有封闭形式的几何意义,探针的系数追踪了错误如何以及沿着深度何处形成——哪些层过早承诺,哪些层与运行状态矛盾,轨迹在何处偏离其端点。

英文摘要

The maximum softmax probability (MSP) represents a default approach when evaluating uncertainty quantification for language model generation with structured output. Although cheap, it is often miscalibrated. Methods that probe the model's internal activations feed raw hidden states into opaque classifiers, reading activations as static snapshots and leaving implicit the layer-wise trajectory by which a representation is formed. Yet, similar endpoints can arise from very different paths, and how evidence accumulates, reinforces, or reverses across depth might reveal uncertainty that final probabilities obscure. We extract eleven scale-invariant geometric features, tracing the cumulative path of per-layer MLP updates, and feed them to a sparse linear probe. The probe outperforms MSP under selective abstention, with gains scaling with baseline miscalibration up to 21 AURC points. Because every feature has a closed-form geometric meaning, the probe's coefficients trace how and where along depth errors take shape -- which layers commit prematurely, which contradict the running state, where trajectories drift away from their endpoint.

2605.22858 2026-05-25 eess.SP cs.LG 版本更新

Classification of IED-free EEG Responses for Assisted Epilepsy Diagnosis

用于辅助癫痫诊断的无IED脑电图反应分类

Giacomo Zanardini, Ryan Moesman, Paul van der Kleij, Robert van den Berg, Justin Dauwels

发表机构 * Signal Processing Systems(信号处理系统) Delft University of Technology(代尔夫特理工大学) Erasmus Medical Center(埃因霍温医学中心)

AI总结 本文研究了在常规脑电图(EEG)缺乏发作间期癫痫样放电(IED)的情况下,如何利用刺激诱发的脑电信号辅助癫痫诊断。作者提出了一种基于多领域特征(时域、频域、小波域和连接性)的机器学习分类方法,并采用堆叠集成策略融合不同特征集,以提高分类性能。实验结果表明,该方法在多个数据集上表现出良好的诊断能力,特别是在间歇性光刺激(IPS)诱发的脑电信号中,能够有效区分癫痫患者与非患者,为无IED情况下的癫痫辅助诊断提供了新思路。

Comments Accepted at IEEE EMBC2026

详情
AI中文摘要

当常规脑电图缺乏发作间期癫痫样放电(IED)时,诊断癫痫具有挑战性。间歇性光刺激(IPS)和过度换气(HV)可提高诊断率,但其解释具有主观性。我们提出一种可重复的流水线,使用跨越时域、频谱、小波和连接性域的机器学习特征,以及堆叠集成来组合互补特征集,对刺激过程中采集的脑电图记录进行分类。在TUH癫痫语料库和临床Erasmus MC(EMC)队列上使用留一受试者交叉验证(LOSO)评估性能,包括在TUH上的无IED分析。在TUH上,集成在无IED静息态脑电图上达到高达97.8% AUC / 93.1% BAC,在无IED IPS上达到94.1% AUC / 86.8% BAC。在EMC上,IPS提供最强的区分能力(79.4% AUC / 73.9% BAC),而HV性能受益于按反应性对受试者进行分层。这些结果表明,刺激诱发的活动,特别是IPS,包含对无IED癫痫分类有意义的判别信息,并且多域集成提高了鲁棒性。

英文摘要

Diagnosing epilepsy is challenging when routine EEGs lack interictal epileptiform discharges (IEDs). Intermittent photic stimulation (IPS) and hyperventilation (HV) can increase diagnostic yield, but their interpretation is subjective. We propose a reproducible pipeline that classifies EEG recordings acquired during stimulation procedures, using machine-learning features spanning temporal, spectral, wavelet, and connectivity domains, and a stacked ensemble to combine complementary feature sets. Performance is evaluated with leave-one-subject-out (LOSO) cross-validation on the TUH Epilepsy Corpus and a clinical Erasmus MC (EMC) cohort, including IED-free analyses on TUH. On TUH, ensembles achieve up to 97.8\% AUC / 93.1\% BAC on IED-free resting-state EEG and 94.1\% AUC / 86.8\% BAC on IED-free IPS. On EMC, IPS provides the strongest discrimination (79.4\% AUC / 73.9\% BAC), while HV performance benefits from stratifying subjects by responsiveness. These results indicate that stimulation-evoked activity, particularly IPS, contains meaningful discriminative information for IED-free epilepsy classification and that multi-domain ensembling improves robustness.

2605.22857 2026-05-25 eess.SP cs.LG 版本更新

JointHRRP-Net: A Statistically Constrained Decoupling Network for Joint Target and Jamming Recognition in Composite Jamming

JointHRRP-Net: 一种用于复合干扰中目标与干扰联合识别的统计约束解耦网络

Yunfei Zhao, Mei Liu, Shuowei Liu, Xunzhang Gao, Yujie Zhou

发表机构 * College of Electronic Science and Technology, National University of Defense Technology(电子科学学院,国防科技大学)

AI总结 在复合干扰环境下,基于高分辨率距离像(HRRP)的雷达自动目标识别性能显著下降。为此,本文提出了一种统一的联合目标-干扰识别框架JointHRRP-Net,通过统计约束解耦模块从混合HRRP中分离出目标主导和干扰主导的潜在特征分支,并结合多尺度时序编码模块和双专家决策模块,分别实现单标签目标分类和多标签干扰分类。实验表明,该方法在不同信噪比和信干比条件下均优于现有方法,且对未知目标具有良好的判别能力。

Comments Submitted to IEEE Transactions on Geoscience and Remote Sensing (TGRS). 15 pages, 12 figures

详情
AI中文摘要

基于高分辨率距离像(HRRP)的雷达自动目标识别在复合干扰环境中性能严重下降。有源干扰在接收到的距离像中引入压制和欺骗相关分量。脉冲压缩后,这些分量与目标回波在HRRP域中耦合,使得目标相关散射峰难以区分,削弱了特征可分离性。针对这一问题,本文提出JointHRRP-Net,一种用于目标-干扰联合识别的统一框架。首先开发了一个统计约束解耦模块,从混合HRRP表示中生成目标主导和干扰主导的潜在分支。施加相关性引导的统计约束以抑制冗余的跨分支信息并减轻目标-干扰特征纠缠。然后设计了一个多尺度时序编码模块来建模局部散射结构和长距离单元依赖关系,随后是一个双专家决策模块,用于单标签目标分类和多标签干扰分类。在不同信干比(SJR)和信噪比(SNR)水平下的实验表明,JointHRRP-Net在目标识别和复合干扰识别方面均优于代表性基线方法。开放集评估进一步表明,学习到的目标表示对于未知目标拒绝仍具有判别性。这些结果证明了JointHRRP-Net在复合干扰场景中的有效性和鲁棒性。

英文摘要

High-resolution range profile (HRRP)-based radar automatic target recognition suffers from severe performance degradation in composite jamming environments. Active jamming introduces suppression- and deception-related components into the received range profile. After pulse compression, these components are coupled with target echoes in the HRRP domain, making target-related scattering peaks difficult to distinguish and weakening feature separability. To address this problem, this paper proposes JointHRRP-Net, a unified framework for joint target-jamming recognition. A statistically constrained decoupling module is first developed to generate target-dominant and jamming-dominant latent branches from the mixed HRRP representation. Correlation-guided statistical constraints are imposed to suppress redundant cross-branch information and alleviate target-jamming feature entanglement. A multi-scale temporal encoding module is then designed to model local scattering structures and long-range range-cell dependencies, followed by a dual-expert decision module for single-label target classification and multi-label jamming classification. Experiments under diverse signal-to-jamming ratio (SJR) and signal-to-noise ratio (SNR) levels demonstrate that JointHRRP-Net outperforms representative baseline methods in both target recognition and composite jamming recognition. Open-set evaluation further shows that the learned target representation remains discriminative for unknown-target rejection. These results demonstrate the effectiveness and robustness of JointHRRP-Net in composite jamming scenarios.

2605.22855 2026-05-25 cs.GT cs.AI cs.CL cs.LG 版本更新

PrefBench: Evaluating Zero-Shot LLM Agents in Hidden-Preference Personalized Pricing Negotiations

PrefBench:评估隐藏偏好个性化定价谈判中的零样本LLM智能体

Yingjie Lei

发表机构 * University of Aberdeen(阿伯丁大学)

AI总结 本文提出了PrefBench,一个用于评估零样本大语言模型(LLM)代理在隐藏偏好个性化定价谈判中表现的基准测试平台。该平台通过模拟买家与固定车辆定制套餐的互动,要求卖家在仅能获取公开信息的情况下进行谈判,而买家的估值、耐心、还价行为等关键参数是隐藏的。实验表明,尽管LLM代理能够遵循协议并达成高比例的交易,但其利润表现较差,远不如简单的让步策略,突显了当前LLM在利润敏感型谈判中的不足。PrefBench为研究隐藏买家偏好下的定价代理行为提供了可控的评估环境。

Comments 24 pages, 3 figures, 5 tables. Code is available at https://github.com/ChaosTheProducer/PrefBench

详情
AI中文摘要

个性化定价谈判是LLM智能体的一个具有挑战性的测试平台,因为成功的互动并不能保证盈利的决策。当买方的支付意愿和谈判特征仍然隐藏时,卖方可能产生有效的行动并达成许多交易,但定价仍然很差。本文提出了PrefBench,一个基于模拟器的隐藏偏好个性化定价谈判基准。每个回合将一个模拟买家与一个固定的车辆定制捆绑包配对;卖方观察公开的人物描述符、捆绑包信息和谈判历史,而潜在的买方变量控制估值、耐心、还价行为和退出决策。PrefBench通过一个面向LLM的状态摘要协议来评估这一设置,该协议限制智能体在固定的隐藏信息边界下返回严格的JSON动作。我们在7500个回合中评估了零样本LLM卖家与启发式参考。测试的LLM可靠地遵循协议,实现了高于0.99的交易率,但它们的卖家利润结果仍然较弱:最佳LLM平均利润仅略高于随机基线,远低于同一回合流下的简单让步启发式。这些结果表明,结构化行动合规性和寻求协议的行为可以与弱利润敏感谈判共存。PrefBench为评估隐藏买方偏好下的定价智能体行为提供了一个受控基准。

英文摘要

Personalized pricing negotiations are a challenging testbed for LLM agents because successful interaction does not guarantee profitable decision making. A seller may produce valid actions and close many deals while still pricing poorly when buyer willingness to pay and bargaining traits remain hidden. This paper presents PrefBench, a simulator-based benchmark for hidden-preference personalized pricing negotiations. Each episode pairs a simulated buyer with a fixed vehicle-customization bundle; the seller observes public persona descriptors, bundle information, and negotiation history, while latent buyer variables govern valuation, patience, counter-offer behavior, and walkaway decisions. PrefBench evaluates this setting through an LLM-facing state-summary protocol that constrains agents to return strict JSON actions under a fixed hidden-information boundary. We evaluate zero-shot LLM sellers against heuristic references over 7,500 episodes. The tested LLMs follow the protocol reliably and achieve deal rates above 0.99, but their seller-profit outcomes remain weak: the best LLM average profit is only slightly above the random baseline and far below a simple concession heuristic under the same episode stream. These results show that structured action compliance and agreement-seeking behavior can coexist with weak profit-sensitive bargaining. PrefBench provides a controlled benchmark for evaluating pricing-agent behavior under hidden buyer preferences.

2605.22853 2026-05-25 eess.SP cs.LG q-bio.QM 版本更新

Topological Signal Processing: An Application-Oriented Tutorial

拓扑信号处理:面向应用的教程

Flavia Petruso, Maria Giulia Preti, Dimitri Van De Ville

发表机构 * Neuro-X Institute, École Polytechnique Fédérale de Lausanne (EPFL), Geneva, Switzerland(神经-X研究所,瑞士洛桑联邦理工学院(EPFL),日内瓦) Department of Radiology and Medical Informatics, University of Geneva, Geneva, Switzerland(放射学与医学信息学系,日内瓦大学,日内瓦,瑞士)

AI总结 本文介绍了拓扑信号处理(TSP)的基础概念及其在实际应用中的方法,旨在帮助研究者更好地理解和应用这一新兴领域。TSP 扩展了传统图信号处理(GSP),能够处理定义在节点、边、三角形等高阶网络结构上的信号,通过组合霍奇拉普拉斯算子等工具,实现了对复杂系统中高阶相互作用的分析。文章结合脑成像等实际案例,展示了 TSP 在揭示非平凡区域交互关系中的潜力,推动其在理论与应用研究中的广泛应用。

详情
AI中文摘要

许多现代数据集规模庞大且具有复杂的结构关系。传统上,基于图的方法用于表示网络数据,将个体元素建模为节点,将成对交互建模为边。此外,图信号处理(GSP)已被开发用于分析图节点上的信号,例如全国不同地区的温度测量值(节点信号)表示为图。拓扑信号处理(TSP)是一个新兴领域,它推广了GSP,使得不仅可以分析节点上的信号,还可以分析边、三角形以及更高维网络元素上的信号,这些元素被建模为单纯复形及相关拓扑结构。这使得TSP通过将滤波和傅里叶变换等经典信号处理概念扩展到拓扑层面,自然适用于研究复杂系统中的高阶交互。尽管TSP具有多功能性,但对许多实践者来说仍然具有挑战性。因此,我们提供了一个易于理解的TSP基础概述,同时与面向应用的场景建立联系。我们重点介绍基于组合Hodge Laplacian的处理技术,该技术将图Laplacian推广到单纯复形。特别地,我们回顾了关键的TSP概念,将其与现实世界的例子联系起来,并讨论了如何从数据集中导出高阶结构和信号。例如,我们引入了一种捕捉节点信号之间滞后交互的边级信号,并在基于TSP的脑成像数据分析案例研究中展示了其应用,揭示了脑区域集合之间的非平凡交互。总体而言,我们旨在通过弥合方法发展与应用程序之间的差距,促进TSP的更广泛采用,推动其在理论和应用研究人员社区中的使用。

英文摘要

Many modern datasets are large and carry complex structural relationships. Graph-based methods have traditionally been used to represent networked data, modeling individual elements as nodes and pairwise interactions as edges. Furthermore, Graph Signal Processing (GSP) has been developed to analyze signals on graph nodes, such as temperature measurements (node signals) across different regions of a country represented as a graph. Topological Signal Processing (TSP) is an emerging field that generalizes GSP, enabling the analysis of signals defined not only on nodes but also on edges, triangles, and higher-dimensional network elements, modeled as simplicial complexes and related topological structures. This makes TSP naturally well-suited for studying higher-order interactions in complex systems by extending classical signal processing concepts, such as filtering and Fourier transforms, to the topological level. Despite its versatility, TSP remains challenging for many practitioners. Therefore, we present an accessible overview of TSP foundations while drawing connections with application-oriented settings. We focus on processing techniques based on the combinatorial Hodge Laplacian, which generalizes the graph Laplacian to simplicial complexes. In particular, we review key TSP concepts, relate them to real-world examples, and discuss how higher-order structures and signals can be derived from datasets. For instance, we introduce an edge-level signal capturing lagged interactions between nodal signals, and demonstrate its use in a case study on TSP-based analysis of brain imaging data, revealing nontrivial interactions between sets of brain regions. Overall, we aim to promote a broader adoption of TSP by bridging methodological developments with applications, fostering its use among a wide community of theoretical and applied researchers.

2605.22852 2026-05-25 cs.DB cs.AI cs.LG cs.LO 版本更新

Expressive Power of Deep Homomorphism Networks over Relational Databases

关系数据库上深度同态网络的表达能力

Moritz Schönherr, Balder ten Cate, Maurice Funk, Benny Kimelfeld, Carsten Lutz, Arie Soeteman

发表机构 * University of Amsterdam(阿姆斯特丹大学) Leipzig University(莱比锡大学) Technion(技术学院) RelationalAI(关系AI)

AI总结 本文研究了深度同态网络(DHNs)在关系数据库上的表达能力,探讨其与一阶逻辑及其扩展之间的联系。通过将DHNs与包含否定、计数和比例量化等扩展的逻辑片段进行对比,揭示了其在不同聚合方式下的表达能力边界。研究还表明,DHNs与SQL之间存在经典对应关系,并进一步分析了其在静态分析问题中的可判定性。实验验证了不同表达能力的DHNs在预测任务中的性能差异。

详情
AI中文摘要

消息传递图神经网络(GNN)的表达能力限制促使了更强大的图学习架构的发展。我们主张深度同态网络(DHN)作为一种特别适合在关系数据库上学习的模型,因为它与SQL的重要片段(如合取查询)有密切联系。我们通过将DHN与一阶逻辑(FO)的各种自然片段和扩展相关联,研究了DHN的精确表达能力。对于具有max、sum和mean聚合的DHN,我们建立了与一元否定片段(UNFO)以及带有计数量词和比例量词的UNFO扩展的联系。我们进一步将sum聚合DHN与FO的一元量词交替片段以及带有表达性计数的FO扩展相关联。通过FO与SQL之间的经典对应关系,这些结果也阐明了DHN与SQL之间的关系。它们还使我们能够研究DHN的两个基本静态分析问题——空问题和包含问题——的可判定性。最后,我们通过实验证实,表达能力的差异在合适的预测任务性能上得到了体现。

英文摘要

The expressive limitations of message-passing Graph Neural Networks (GNNs) have motivated a wide range of more powerful graph learning architectures. We advocate Deep Homomorphism Networks (DHNs) as a model particularly well-suited for learning over relational databases, due to their close connection to important fragments of SQL such as conjunctive queries. We study the precise expressive power of DHNs by relating them to various natural fragments and extensions of first-order logic (FO). For DHNs with max, sum, and mean aggregations, we establish connections to the unary negation fragment (UNFO) and to the extensions of UNFO with counting quantifiers and with ratio quantifiers. We further relate sum-aggregation DHNs to the unary quantifier alternation fragment of FO and to an extension of FO with expressive counting. Through the classical correspondence between FO and SQL, these results also illuminate the relation between DHNs and SQL. They also enable us to study the decidability of two fundamental static analysis problems for DHNs, the emptiness problem and the subsumption problem. Finally, we confirm through experiments that the established differences in expressive power are reflected in the performance on suitable prediction tasks.

2605.22851 2026-05-25 eess.SP cs.LG eess.IV 版本更新

VAMP-Diff: VampPrior Latent Diffusion for Photoplethysmography Modeling

VAMP-Diff: 用于光电容积描记法建模的VampPrior潜扩散模型

Fatemeh Ghasemi Balouei, Nathan Willemsen, Mahesh Banavar, Bahman Moraffah

发表机构 * Department of Electrical and Computer Engineering(电气与计算机工程系) Clarkson University(克林顿大学) Department of Computer Science(计算机科学系) Worcester Polytechnic Institute(沃思格理工学院)

AI总结 本文提出了一种名为 VAMP-Diff 的变分潜扩散模型,用于生成和重建光电容积图(PPG)信号。该方法结合了时间编码器、条件扩散解码器和 VampPrior 正则化,能够在潜空间中更准确地保留心率和呼吸率等生理特征,并生成形态更真实的 PPG 波形。实验表明,与基于高斯先验的模型相比,VAMP-Diff 在重建精度和生理信息保持方面表现出更优的性能。

Comments Submitted to the 2026 Asilomar Conference on Signals, Systems, and Computers. 12 pages, 6 figures

详情
AI中文摘要

光电容积描记法(PPG)已成为一种普遍存在的生理信号;然而,当前的生成模型仍然难以保留真实的波形形态并学习捕捉心脏和呼吸生理的潜在结构。使用对抗损失训练的PPG生成器可以产生合理的波形,但无法提供从真实信号到潜在表示的推理路径。另一方面,变分自编码器将PPG数据映射到潜在编码,尽管它们的解码器常常模糊收缩上升波并削弱幅度和频谱细节。扩散模型提高了波形保真度,但通常缺乏用于重建和生理分析的推理路径。我们提出了VampPrior潜扩散(VAMP-Diff),一种联合训练的变分扩散模型,结合了时间PPG编码器、条件一维扩散解码器以及紧凑池化潜在上的VampPrior正则化。该模型在扩散重建期间使用完整的时间潜在,使解码器能够访问心跳时序和形态,同时从学习的VampPrior组件而非固定高斯先验生成样本。我们在CapnoBase数据集上证明,VAMP-Diff生成逼真的PPG信号,重建比高斯先验基线更清晰的生理波形,保留心率信息,维持呼吸率一致性,并通过重建误差对波形损坏敏感。

英文摘要

Photoplethysmography (PPG) has become a ubiquitous physiological signal; however, current generative models still struggle to preserve realistic waveform morphology and learn a latent structure that captures cardiac and respiratory physiology. PPG generators trained with adversarial losses can produce plausible waveforms, but provide no inference path from a real signal to a latent representation. Variational autoencoders, on the other hand, map the PPG data to latent codes, although their decoders often blur systolic upstrokes and dampen amplitude and spectral details. Diffusion models improve waveform fidelity, but typically lack an inference path for reconstruction and physiological analysis. We propose VampPrior Latent Diffusion (VAMP-Diff), a jointly trained variational diffusion model that combines a temporal PPG encoder, a conditional one-dimensional diffusion decoder, and VampPrior regularization on a compact pooled latent. The model uses full temporal latent during diffusion reconstruction, giving the decoder access to beat timing and morphology while generating samples from learned VampPrior components instead of a fixed Gaussian prior. We demonstrate on the CapnoBase dataset that VAMP-Diff produces realistic PPG signals, reconstructs sharper physiological waveforms than Gaussian-prior baselines, preserves heart-rate information, maintains respiratory-rate consistency, and is sensitive to waveform corruptions through reconstruction error.

2605.22848 2026-05-25 cs.CE cs.LG q-bio.OT 版本更新

From Simulation to Discovery: AI Enabled Probabilistic Emulation of Mechanistic Crop Systems

从模拟到发现:AI驱动的机理作物系统概率仿真

Mojdeh Saadati, Juan Panelo, Gustavo Visentini, Soumik Sarkar, Carlos Messina, Baskar Ganapathysubramanian

发表机构 * Department of Mathematics and Department of Computer Science, Iowa State University(数学系和计算机科学系,爱荷华州立大学) Department of Horticultural Sciences, University of Florida(园艺科学系,佛罗里达大学) Department of Mechanical Engineering, and Translational AI Center, Iowa State University(机械工程系和转化人工智能中心,爱荷华州立大学)

AI总结 该研究提出了一种基于人工智能的概率神经模拟器,用于高效模拟作物生长过程,解决了传统作物模型计算成本过高的问题。通过训练大量多样化条件下的模拟数据,并结合物理一致的天气生成器,该方法在保持高预测精度的同时大幅提升了模拟效率,能够快速探索不同基因型、环境和管理条件下的作物响应。研究发现了一些在多种条件下保持高产量的玉米性状组合,并揭示了辐射利用效率和温度驱动的根系动态是影响产量韧性的关键因素,展示了该方法在农业适应气候变化研究中的巨大潜力。

详情
AI中文摘要

全球粮食安全依赖于预测作物对气候变异的响应,但基于过程的作物模型对于大规模探索基因型和环境相互作用而言计算成本过高。本文开发了APSIM的概率神经仿真器,该仿真器在13个输出上以高保真度(R²=0.93)再现了关键玉米生长过程,同时将模拟时间降低了数个数量级。该框架在涵盖多样化遗传、土壤和管理条件的200万次模拟上训练,并辅以卷积合成天气生成器以产生物理一致的气候序列,从而能够在现实且多样化的环境输入下进行可扩展的作物响应探索,同时提供校准的预测不确定性,无需昂贵的贝叶斯推断。将该框架应用于10万个性状配置、爱荷华州和伊利诺伊州的六种土壤环境以及两种排放情景下直至2100年的气候预测,我们识别出181种在所有测试条件下均能持续保持高产的玉米性状组合——这一分析仅靠机理模型是无法实现的。我们进一步表明,辐射利用效率和温度驱动的根系动态是产量韧性的主要驱动因素。值得注意的是,预测的产量分布在不同地点间差异显著,一些低生产力地点在未来气候情景下产量增加,表明气候变化可能以非直观的方式重塑区域产量潜力。这些结果证明了不确定性感知仿真如何将机理作物模拟从计算瓶颈转变为按需发现引擎,其能够以任何基于过程的模型无法比拟的规模探索完整的基因型、环境和管理空间。

英文摘要

Global food security depends on predicting crop responses to climate variability, yet process based crop models remain too computationally expensive for large scale exploration of genotype and environment interactions. Here we develop a probabilistic neural emulator of APSIM that reproduces key maize growth processes across 13 outputs with high fidelity (with R^2 of 0.93) while reducing simulation time by several orders of magnitude. Trained on two million simulations spanning diverse genetic, soil, and management conditions, and augmented with a convolutional synthetic weather generator that produces physically consistent climate sequences, the framework enables scalable exploration of crop responses under realistic and diverse environmental inputs while providing calibrated predictive uncertainty without costly Bayesian inference. Applying this framework across 100,000 trait configurations, six soil environments in Iowa and Illinois, and climate projections through the year 2100 under two emissions scenarios, we identify 181 maize trait combinations that consistently maintain high yield across all tested conditionsan analysis infeasible with the mechanistic model alone. We further show that radiation use efficiency and temperature driven root dynamics are dominant drivers of yield resilience. Notably, projected yield distributions vary substantially across locations, with some lower productivity sites exhibiting yield increases under future climate scenarios, indicating that climate change may reshape regional yield potential in nonintuitive ways. These results demonstrate how uncertainty aware emulation transforms mechanistic crop simulation from a computational bottleneck into an on demand discovery engine, one capable of interrogating the full genotype, environment and management space at a scale no process-based model can match.

2605.22842 2026-05-25 cs.CR cs.AI cs.LG 版本更新

The Misattribution Gap: When Memory Poisoning Looks Like Model Failure in Agentic AI Systems

归因偏差:当记忆中毒在自主AI系统中看起来像模型失败时

Tanzim Ahad, Ismail Hossain, Md Jahangir Alam, Sai Puppala, Syed Bahauddin Alam, Sajedul Talukder

发表机构 * Department of Computer Science, University of Texas at El Paso(德克萨斯大学埃尔帕索分校计算机科学系) School of Computing, Southern Illinois University Carbondale(南方伊利诺伊大学卡本代尔分校计算机学院) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 该论文揭示了多智能体AI系统中的一种结构性缺陷——“误归因鸿沟”,即内存层攻击引发的行为与模型失效难以区分,导致防御者误判问题根源。研究提出“语义规范漂移”(SND)作为智能体行为失当的第三种路径,不同于模型对齐偏差和共谋行为,其通过信任清洗链使恶意文档伪装成系统可信内容。论文引入反事实组合测试等新方法,有效识别攻击源,并提出内存持久信息流控制技术,显著提升系统安全性。

Comments This paper is presently under review at a top-tier security venue

详情
AI中文摘要

多智能体AI流水线通常假设智能体不当行为源于模型失配。我们识别了该假设中的一个结构性缺陷,即“归因偏差”,其中记忆层攻击产生与模型失败无法区分的行为,导致防御者应用错误的补救措施。我们将“语义规范漂移”(SND)形式化为智能体不当行为的第三条路径,区别于新兴失配和共谋。在SND中,一份策略格式的文档通过正常上传进入共享向量存储,并在通过信任洗钱链丢失来源后重新作为受信任的系统上下文出现。在64个记录在案的失败中,归因系统一致地指责模型。四个安全分类器,包括一个在记忆中毒上训练的,在510个检查点中产生了零检测。在65个有效案例中的59个中,智能体在服从前明确引用注入的文档作为规范权威。该攻击不需要触发器、模型访问或重复交互,在五个会话内达到完全效果,并无限期持续。我们引入了反事实组合测试,它以87.5%的准确率和零误报识别因果入口,而取证基线在所有25个场景中均失败。我们进一步证明了检索-覆盖困境,表明更强的规避本质上削弱了攻击,限制了自适应绕过策略。最后,我们提出了记忆持久信息流控制,它在跨会话边界阻止了97%的攻击,而先前的防御在此处失败。我们发布了SND语料库,这是第一个具有时间持久性和跨金融与医疗保健领域多智能体组合的对抗性记忆基准。

英文摘要

Multi-agent AI pipelines typically assume that agent misconduct originates from model misalignment. We identify a structural failure in this assumption, the \emph{Misattribution Gap}, where memory-layer attacks produce behaviors indistinguishable from model failure, causing defenders to apply the wrong remediation. We formalize \emph{Semantic Norm Drift} (SND) as a third path to agent misconduct, distinct from emergent misalignment and collusion. In SND, a policy-formatted document enters a shared vector store through normal uploads and later reappears as trusted system context after provenance is lost through a Trust Laundering Chain. Across 64 documented failures, attribution systems consistently blamed the model. Four safety classifiers, including one trained on memory poisoning, produced zero detections across 510 checkpoints. In 59 of 65 valid cases, agents explicitly cited the injected document as normative authority before complying. The attack requires no trigger, model access, or repeated interaction, achieves full effect within five sessions, and persists indefinitely. We introduce Counterfactual Composition Testing, which identifies the causal entry with 87.5% accuracy and zero false positives, while a forensics baseline fails across all 25 scenarios. We further prove the Retrieval-Coverage Dilemma, showing that stronger evasion inherently weakens the attack, limiting adaptive bypass strategies. Finally, we propose Memory-Persistent Information-Flow Control, which blocks 97% of attacks at the cross-session boundary where prior defenses fail. We release the SND Corpus, the first adversarial memory benchmark with temporal persistence and multi-agent composition across financial and Health Care domains.

2605.22837 2026-05-25 physics.geo-ph cs.LG eess.SP 版本更新

Evaluating PhaseNet on Teleseismic Data with MsPASS

使用 MsPASS 评估 PhaseNet 在远震数据上的表现

Jinxin Ma, Yinzhi Wang, Gary L. Pavlis, Chenbo Yin

发表机构 * Texas Advanced Computing Center, The University of Texas at Austin(德克萨斯高级计算中心,德克萨斯大学奥斯汀分校) Department of Earth and Atmospheric Sciences, Indiana University, Bloomington, IN 47405(地球与大气科学系,印第安纳大学,印第安纳波利斯,IN 47405)

AI总结 本文研究了机器学习拾震器PhaseNet在远震数据上的性能问题,并提出了一种基于MsPASS的可复现工作流,用于大规模地震数据的处理与PhaseNet的训练与推理。通过构建包含160万个远震P波波形的控制数据集,研究发现PhaseNet在区域数据上训练的模型在远震数据上表现较差,而从该数据集重新训练的模型在P波拾取的召回率和精度上均有显著提升。实验还表明,增大模型规模虽能提升性能,但会大幅降低推理效率,尤其在CPU上更为明显。

详情
AI中文摘要

大量研究表明,机器学习拾取器 PhaseNet 在本地地震信号上能产生准确的 P 波和 S 波拾取,但其在远震信号上的性能会急剧下降。为解决这一局限,我们提出了一个可重现的 MsPASS 工作流,该工作流 (i) 支持大规模地震档案的可扩展数据准备和管理,(ii) 支持标准化的 PhaseNet 训练和推理。我们构建了一个包含 160 万条波形的控制数据集,这些波形与 USArray 阵列网络设施 (ANF) 分析人员做出的远震 P 波拾取相关联。控制数据集证实,在区域信号上训练的 PhaseNet 模型在这些数据上表现不佳。然后,我们在 ANF 控制数据集的训练集上从头训练 PhaseNet,并在不重叠的保留测试集上评估,将 P 波拾取召回率提高了 741.5%,并在 0.1 秒残差窗口内产生了 683.9% 更多的拾取。我们还评估了不同模型大小的 PhaseNet 在 CPU 和 GPU 上的表现。将模型大小增加约 120 倍,精度和召回率分别提高了 15.6% 和 23.2%。然而,缩放后的模型在 NVIDIA A100 GPU 上推理吞吐量降低了 87.2%,在 128 核高性能 CPU 节点上降低了 97.3%。这些结果表明,在 GPU 上缩放 PhaseNet 比在 CPU 上更实用,并且简单地扩大模型并不是实现大幅精度提升的有效方法。

英文摘要

Numerous studies have shown that the machine-learning picker PhaseNet produces accurate P and S picks on local earthquake signals, but its performance can degrade sharply on teleseismic signals. To address this limitation, we present a reproducible MsPASS workflow that (i) enables scalable data preparation and management for large seismic archives and (ii) supports standardized PhaseNet training and inference. We assembled a control dataset of 1.6 million waveforms linked to teleseismic P-wave picks made by analysts at the USArray Array Network Facility (ANF). The control dataset confirms that the PhaseNet model trained on regional signals performs poorly on these data. We then trained PhaseNet from scratch on the training split of the ANF control dataset and evaluated it on a non-overlapping held-out test split, increasing P-pick recall by 741.5% and yielding 683.9% more picks within a 0.1s residual window. We also evaluated PhaseNet across different model sizes on both CPUs and GPUs. Increasing the model size by about 120 times improved precision and recall by 15.6% and 23.2%, respectively. However, the scaled model reduced inference throughput by 87.2% on an NVIDIA A100 GPU and by 97.3% on a 128-core high-performance CPU node. These results indicate that scaling PhaseNet is more practical on GPUs than on CPUs, and that simply enlarging the model is not an efficient way to achieve large accuracy gains.

2605.22836 2026-05-25 physics.geo-ph cs.LG 版本更新

Real-Time Earthquake Magnitude Classification from Initial P-Waves: Models, Dataset, and Comparative Analysis for South Asia

基于初始P波的实时地震震级分类:南亚地区的模型、数据集与比较分析

Md Nasiat Hasan Fahim, Md. Abid Ullah Muhib, Rayhanul Amin Tanvir, Abdullah Al Noman

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) Shahjalal University of Science and Technology(沙赫拉尔科学与技术大学)

AI总结 本文研究了如何利用单一地震台站初始7秒P波的垂直分量,实时分类地震震级,以提升地震预警系统的效率。研究比较了六种机器学习方法,包括传统模型和先进深度学习架构,并构建了一个包含7,318个南亚地震事件的新数据集,涵盖五个里氏震级类别。实验表明,基于Transformer的深度学习模型在准确率和推理延迟方面均优于传统方法,尤其在处理震级边界不确定性时表现出色,为实时地震预警提供了可行方案。

Comments Accepted for publication in 2025 28th International Conference on Computer and Information Technology (ICCIT). \c{opyright} 2025 IEEE

详情
Journal ref
2025 28th International Conference on Computer and Information Technology (ICCIT), Cox's Bazar, Bangladesh, 2025
AI中文摘要

快速地震震级估计对于有效的早期预警系统至关重要,可以挽救生命并减少经济损失。在本文中,我们提出了一项全面的震级分类研究,仅使用来自单个台站的初始7秒P波窗口的垂直分量。我们比较了六种机器学习方法,范围从传统模型到最先进的深度学习架构。我们还整理了一个包含南亚7318个地震事件的新数据集。该数据集分为五个里氏震级类别:轻微(3.0-3.9)、轻度(4.0-4.9)、中等(5.0-5.9)、强烈(6.0-6.9)和严重(>=7.0)。我们的实验表明,深度学习模型显著优于传统方法。我们基于Transformer的架构实现了76.23%的标准准确率和81.56%的自适应准确率,推理延迟为4.8毫秒。自适应准确率指标是针对近类别边界震级估计中固有的不确定性而引入的。这些结果表明,Transformer中的注意力机制与自适应分类相结合,有效地捕捉了地震信号的时间动态。这种架构优势有助于对罕见的高震级事件进行有希望的泛化,尽管地震目录具有固有的数据稀缺性。自适应准确率提供了对模型性能更现实的评估,结果表明了实时部署的可行性。

英文摘要

Rapid earthquake magnitude estimation is crucial for effective early warning systems that can save lives and reduce economic damage. In this paper, we present a comprehensive study of magnitude classification using only the vertical component of the initial 7-second P-wave window from a single station. We compare six machine learning approaches that range from traditional models to state-of-the-art deep learning architectures. We also curated a novel dataset of 7,318 earthquake events in South Asia. The dataset was categorized into five Richter-scale classes: slight (3.0-3.9), light (4.0-4.9), moderate (5.0-5.9), strong (6.0-6.9) and severe (>= 7.0). Our experiments show that deep learning models substantially outperform traditional approaches. Our Transformer based architecture achieved 76.23% standard accuracy and 81.56% adaptive accuracy with 4.8 ms inference latency. The adaptive-accuracy metric is introduced for the inherent uncertainty in magnitude estimation of near class boundaries. These results indicate that the attention mechanisms in Transformers combined with adaptive classification effectively capture the temporal dynamics of seismic signals. The architectural advantage facilitates promising generalization to rare high-magnitude events, despite the inherent data scarcity characteristic of seismic catalogs. The adaptive accuracy provides a more realistic assessment of model performance, and the result suggests viability for real-time deployment.

2605.22833 2026-05-25 cs.IR cs.AI cs.LG 版本更新

RAG4Outcome: A Retrieval-Augmented Multimodal Framework for Prognostic Prediction in Chronic Osteomyelitis

RAG4Outcome:用于慢性骨髓炎预后预测的检索增强多模态框架

Daqian Shi, Pei Han, Jishizhan Chen, Yang Wang, Xiaolei Diao, Xianyou Zheng, Pengfei Cheng

发表机构 * Queen Mary University of London(女王玛丽大学) Shanghai Sixth People’s Hospital Affiliated to SJTU School of Medicine(上海第六人民医院附属复旦大学医学院) University College London(大学学院伦敦)

AI总结 慢性骨髓炎因其高复发风险和复杂的术后恢复过程,给预后预测带来了较大挑战。传统评估方法依赖人工评分系统,存在可扩展性差、效率低和一致性不足的问题。为此,本文提出RAG4Outcome,一种基于检索增强生成(RAG)的多模态框架,整合PET-CT影像报告、结构化手术和诊断记录以及非结构化的随访记录,结合领域特定检索语料和专家引导提示,实现了更可解释、有依据且临床可靠的预后预测,初步实验结果表明其在真实病例中具有良好的效果和临床契合度。

详情
AI中文摘要

慢性骨髓炎因其高复发风险和复杂的术后恢复轨迹而面临巨大的预后挑战。传统评估通常依赖于手动评分系统,这限制了临床实践中的可扩展性、效率和一致性。此外,临床数据的异质性对当前需要对齐输入和大量标注数据集的多模态学习方法构成了挑战。在这项工作中,我们提出了RAG4Outcome,一个用于慢性骨髓炎预后预测的检索增强生成(RAG)框架。我们的方法将多模态临床数据(包括PET-CT影像报告、结构化手术和诊断记录以及非结构化随访笔记)整合到一个统一的预测流程中。通过结合领域特定的检索语料库和专家引导的提示,该框架实现了更可解释、基于证据且临床可靠的预后。在真实世界病例上的初步结果显示了有希望的有效性和临床一致性,突显了RAG4Outcome在AI辅助感染管理和术后决策支持方面的潜力。

英文摘要

Chronic osteomyelitis presents substantial prognostic challenges due to its high recurrence risk and complex postoperative recovery trajectories. Traditional assessment often relies on manual scoring systems, which limit scalability, efficiency, and consistency in clinical practice. Furthermore, the heterogeneous nature of clinical data poses challenges for current multimodal learning approaches that require aligned inputs and large annotated datasets. In this work, we propose RAG4Outcome, a retrieval-augmented generation (RAG) framework for prognostic prediction in chronic osteomyelitis. Our method integrates multimodal clinical data, including PET-CT imaging reports, structured surgical and diagnostic records, and unstructured follow-up notes, into a unified prediction pipeline. By combining a domain-specific retrieval corpus with expert-guided prompting, the framework enables more interpretable, evidence-grounded, and clinically reliable prognosis. Preliminary results on real-world cases demonstrate promising effectiveness and clinical alignment, highlighting the potential of RAG4Outcome for AI-assisted infection management and postoperative decision support.

2605.17212 2026-05-25 cs.LG 版本更新

Anytime PAC-Bayes for Constrained Density-Ratio Networks under Covariate Shift

协变量偏移下约束密度比网络的任意时间PAC-Bayes

Paulo Akira F. Enabe

发表机构 * Escola Politénica University of São Paulo Department of Structural and Geotechnical Engineering(圣保罗大学理工学院土木与地质工程系)

AI总结 本文提出了一种统一的协变量偏移学习框架,通过约束密度比网络估计Radon-Nikodym导数,并结合PAC-Bayes方法提供任意时间的泛化保证。研究通过改变测度恒等式分解目标风险与重要性加权源风险之间的差距,并利用增强拉格朗日方法强制归一化和矩匹配约束,从而控制有效样本量。实验表明,该框架在真实数据上实现了校准的密度比估计,并优于传统方法,验证了其在协变量偏移场景下的有效性与稳定性。

详情
AI中文摘要

提出一个在协变量偏移下学习的统一框架,其中约束密度比网络逼近Radon-Nikodym导数 $r^\star = dP/dQ$ 并馈入任意时间PAC-Bayes泛化证书。一个测度变换恒等式将目标风险与重要性加权源风险之间的差距分解为由 $\|r_\theta - r^\star\|_{L^2(Q)}$ 控制的比率偏差项和由加权损失变异性控制的泛化差距项。通过增广拉格朗日方案将归一化和矩匹配恒等式作为硬积分约束强制执行,其中二阶矩惩罚控制有效样本量。PAC-Bayes在固定时间机制下实例化于加权风险,得到Bernoulli-KL界,将网络加权Gibbs后验识别为唯一的KL正则化最小化器,并量化学习比率在 $L^2(Q)$ 扰动下的稳定性,然后通过几何剥离增强为在 $t \geq t_{\min}$ 上一致的任意时间证书。一个预先注册的两阶段协议结合了对解析真实性的补丁测试和真实数据部署,验证了该框架:网络产生校准比率,相对于未加权ERM和经典直接比率估计基线降低了目标0/1损失,并达到了任意时间证书。记录了一次固定时间覆盖失败,每次分割的覆盖与标签偏移幅度一一对应,确认了仅协变量假设在操作上是紧的,而非证书的缺陷。

英文摘要

A unified framework for learning under covariate shift is presented, in which a constrained density-ratio network approximates the Radon-Nikodym derivative $r^\star = dP/dQ$ and feeds an anytime PAC-Bayes generalization certificate. A change-of-measure identity decomposes the gap between target risk and importance-weighted source risk into a ratio-bias term governed by $\|r_θ- r^\star\|_{L^2(Q)}$ and a generalization-gap term governed by the variability of the weighted loss. Normalization and moment-matching identities are enforced as hard integral constraints through an augmented-Lagrangian scheme, with a second-moment penalty controlling the effective sample size. PAC-Bayes is instantiated on the weighted risk in a fixed-time regime that yields Bernoulli-KL bounds, identifies the network-weighted Gibbs posterior as the unique KL-regularized minimizer, and quantifies stability under $L^2(Q)$ perturbations of the learned ratio, and is then strengthened by geometric peeling to an anytime certificate uniform in $t \geq t_{\min}$. A pre-registered two-campaign protocol combining a patch test against analytic ground truth with a real-data deployment validates the framework: the network produces calibrated ratios, reduces target $0/1$ loss against unweighted ERM and classical direct ratio-estimation baselines, and attains the anytime certificate. A single fixed-time coverage failure is recorded, with per-split coverage aligning one-to-one with the magnitude of the label shift, confirming that the covariate-only assumption is operationally tight rather than a defect of the certificate.

2602.20102 2026-05-25 cs.LG cs.AI 版本更新

BarrierSteer: LLM Safety via Learning Barrier Steering

BarrierSteer: 通过学习障碍引导实现大语言模型安全

Thanh Q. Tran, Arun Verma, Kiwan Wong, Bryan Kian Hsiang Low, Daniela Rus, Wei Xiao

发表机构 * Department of Computer Science, National University of Singapore(新加坡国立大学计算机科学系) Singapore-MIT Alliance for Research and Technology Centre(新加坡-麻省理工联合研究中心) CSAIL, Massachusetts Institute of Technology(麻省理工学院计算机科学与人工智能实验室) Worcester Polytechnic Institute(沃斯堡理工学院)

AI总结 尽管大语言模型(LLMs)在各种任务中表现出色,但其对对抗性攻击和不安全内容生成的易感性仍然是部署中的重大障碍,尤其是在高风险场景中。为此,本文提出了一种名为 BarrierSteer 的新型推理时框架,通过在模型的潜在表示空间中嵌入学习到的非线性安全约束,提升响应的安全性。该方法将隐藏状态的安全分类器视为控制屏障函数(CBFs),在生成过程中引导不安全的潜在轨迹满足安全约束,从而在不修改模型参数的前提下有效提升安全性,并在多个模型和数据集上验证了其优越性。

Comments This paper introduces SafeBarrier, a framework that enforces safety in large language models by steering their latent representations with control barrier functions during inference, reducing adversarial and unsafe outputs

详情
AI中文摘要

尽管大型语言模型(LLMs)在各种任务中表现出色,但它们对对抗性攻击和不安全内容生成的敏感性仍然是部署的重大障碍,尤其是在高风险场景中。解决这一挑战需要既实际有效又有理论依据的安全机制。在本文中,我们介绍了 BarrierSteer,一种新颖的推理时框架,通过将学习到的非线性安全约束直接嵌入模型的潜在表示空间来提高响应安全性。BarrierSteer 将隐藏状态安全分类器视为控制障碍函数(CBFs),从而在生成过程中引导不安全的潜在轨迹。通过有效的约束合并组合多个安全约束,而不修改底层 LLM 参数,BarrierSteer 保持了模型效用。我们提供的理论结果表明,在潜在空间中应用 CBFs 提供了一种有原则、模块化且计算高效的方法,用于根据学习到的安全约束进行引导,并保证学习到的障碍能够捕捉预期的安全属性。我们在多个模型系列和数据集上的广泛实验结果表明,BarrierSteer 显著降低了对抗性攻击成功率和有害生成,优于现有方法。代码可在我们的 GitHub 仓库中获取。

英文摘要

Despite the strong performance of large language models (LLMs) across diverse tasks, their susceptibility to adversarial attacks and unsafe content generation remains a significant obstacle to deployment, particularly in high-stakes settings. Addressing this challenge requires safety mechanisms that are both practically effective and theoretically grounded. In this paper, we introduce BarrierSteer, a novel inference-time framework that improves response safety by embedding learned nonlinear safety constraints directly into the model's latent representation space. BarrierSteer treats hidden-state safety classifiers as Control Barrier Functions (CBFs), enabling constraint-guided steering of unsafe latent trajectories during generation. By composing multiple safety constraints through efficient constraint merging without modifying the underlying LLM parameters, BarrierSteer preserves model utility. We provide theoretical results showing that applying CBFs in the latent space yields a principled, modular, and computationally efficient approach for steering with respect to learned safety constraints, with guarantees conditional on the learned barriers capturing the intended safety property. Our extensive experimental results across multiple model families and datasets demonstrate that BarrierSteer substantially reduces adversarial attack success rates and unsafe generations, outperforming the existing method. The code is available in our \href{https://github.com/thanhquangtran/BarrierSteer}{GitHub repository}.

2601.21306 2026-05-25 cs.LG cs.AI 版本更新

The Surprising Difficulty of Search in Model-Based Reinforcement Learning

基于模型的强化学习中搜索的惊人困难

Wei-Di Chang, Mikael Henaff, Brandon Amos, Gregory Dudek, Scott Fujimoto

发表机构 * Meta FAIR McGill University(麦吉尔大学)

AI总结 本文研究了基于模型的强化学习中的搜索问题。传统观点认为长期预测和误差累积是主要障碍,但作者发现搜索并不能简单替代学习到的策略,甚至在模型高度准确时也可能损害性能。研究指出,缓解高估偏差比提升模型或价值函数的准确性更为关键,而通过对一组价值函数取最小值的方法能有效解决这一偏差,从而实现高效的搜索,并在多个基准任务中取得领先性能。

Comments ICML 2026

详情
AI中文摘要

本文研究基于模型的强化学习中的搜索问题。传统观点认为,长期预测和复合误差是基于模型强化学习的主要障碍。我们挑战这一观点,表明搜索并不能简单地替代学习策略。令人惊讶的是,我们发现即使模型高度准确,搜索也可能损害性能。相反,我们表明缓解过估计偏差比提高模型或价值函数精度更重要。基于这一见解,我们确定取价值函数集成的最小值可以有效解决这一偏差并实现有效搜索,在多个流行基准领域取得了最先进的性能。

英文摘要

This paper investigates search in model-based reinforcement learning (RL). Conventional wisdom holds that long-term predictions and compounding errors are the primary obstacles for model-based RL. We challenge this view, showing that search is not a drop-in replacement for a learned policy. Surprisingly, we find that search can harm performance even when the model is highly accurate. Instead, we show that mitigating overestimation bias matters more than improving model or value function accuracy. Building on this insight, we identify that taking the minimum over an ensemble of value functions effectively addresses this bias and enables effective search, achieving state-of-the-art performance across multiple popular benchmark domains.

2311.01468 2026-05-25 cs.CL cs.LG 版本更新

Remember what you did so you know what to do next

记住你做了什么,以便知道下一步该做什么

Manuel R. Ciosici, Alex Hedges, Yash Kankanampati, Justin Martin, Marjorie Freedman, Ralph Weischedel

发表机构 * Information Sciences Institute, University of Southern California(信息科学研究所,南加州大学)

AI总结 本文研究了使用中等规模的大型语言模型(GPT-J,60亿参数)为模拟机器人在ScienceWorld平台中制定计划,以完成30类科学实验目标。实验表明,通过引入更多历史步骤信息,该模型的性能显著优于基于强化学习的方法,最高可达3.5倍。研究还指出任务类别间的性能差异较大,平均表现可能掩盖具体问题,并展示了在仅使用6.5%训练数据时仍能取得2.2倍的性能提升。

Comments Identical to EMNLP 2023 Findings

详情
AI中文摘要

我们探索使用中等规模的大语言模型(GPT-J,6B参数)为模拟机器人在ScienceWorld(一个用于基础科学实验的文本游戏模拟器)中制定计划,以实现30类目标。先前发表的实证工作声称,与强化学习相比,大语言模型(LLMs)不太适合(Wang等人,2022)。使用马尔可夫假设(仅前一步),LLM的性能是强化学习方法性能的1.4倍。当我们尽可能多地填充LLM的输入缓冲区(包含尽可能多的先前步骤)时,性能提升至3.5倍。即使仅使用6.5%的训练数据,我们观察到性能比强化学习方法提高了2.2倍。我们的实验表明,30类动作的性能差异很大,这表明对任务进行平均可能会掩盖显著的性能问题。在与我们同时期的工作中,Lin等人(2023)展示了一种两部分方法(SwiftSage),该方法使用一个小型LLM(T5-large)并辅以OpenAI的大规模LLM,在ScienceWorld中取得了出色结果。我们的6B参数单阶段GPT-J在结合GPT-3.5 turbo(其参数数量是GPT-J的29倍)时,与SwiftSage的两阶段架构性能相匹配。

英文摘要

We explore using a moderately sized large language model (GPT-J 6B parameters) to create a plan for a simulated robot to achieve 30 classes of goals in ScienceWorld, a text game simulator for elementary science experiments. Previously published empirical work claimed that large language models (LLMs) are a poor fit (Wang et al., 2022) compared to reinforcement learning. Using the Markov assumption (a single previous step), the LLM outperforms the reinforcement learning-based approach by a factor of 1.4. When we fill the LLM's input buffer with as many prior steps as possible, improvement rises to 3.5x. Even when training on only 6.5% of the training data, we observe a 2.2x improvement over the reinforcement-learning-based approach. Our experiments show that performance varies widely across the 30 classes of actions, indicating that averaging over tasks can hide significant performance issues. In work contemporaneous with ours, Lin et al. (2023) demonstrated a two-part approach (SwiftSage) that uses a small LLM (T5-large) complemented by OpenAI's massive LLMs to achieve outstanding results in ScienceWorld. Our 6-B parameter, single-stage GPT-J matches the performance of SwiftSage's two-stage architecture when it incorporates GPT-3.5 turbo which has 29-times more parameters than GPT-J.

2110.01552 2026-05-25 cs.CL cs.AI cs.LG 版本更新

Perhaps PTLMs Should Go to School -- A Task to Assess Open Book and Closed Book QA

或许PTLMs应该去上学——一项评估开卷和闭卷问答的任务

Manuel R. Ciosici, Joe Cecil, Alex Hedges, Dong-Ho Lee, Marjorie Freedman, Ralph Weischedel

发表机构 * Information Sciences Institute, University of Southern California(信息科学研究所,南加州大学)

AI总结 本文提出了一项新的任务,旨在评估预训练语言模型(PTLMs)在开放书和闭合书场景下的问答能力,使用社会学和人文领域的大学教材作为教学材料。研究通过设计基于教材内容的判断题,并进行多轮测试,发现PTLMs在闭合书条件下表现有限,表明其可能未真正理解教材内容;而在开放书条件下,允许模型检索相关段落进行回答时,性能显著提升。该任务为评估PTLMs对复杂文本的理解能力提供了新的基准。

Comments Identical to the EMNLP 2021 version

详情
AI中文摘要

我们的目标是提供一项新任务和排行榜,以刺激关于问答和预训练语言模型(PTLM)的研究,使其理解重要的教学文档,例如大学入门教科书或手册。PTLM在许多问答任务中取得了巨大成功,但需要大量监督训练,而在零样本设置中表现较差。我们提出了一项新任务,包括两本社会科学(《美国政府2e》)和人文科学(《美国历史》)的大学入门教材,数百个基于教材作者编写的复习题的真假陈述,基于教材前八章的验证/开发测试,基于剩余章节的盲测,以及基于最先进PTLM的基线结果。由于问题平衡,随机表现应为约50%。使用BoolQ微调的T5达到了相同的表现,表明教材内容未在PTLM中预表示。闭卷考试(即阅读教材,将教材添加到T5的预训练中)最多带来微小改进(56%),表明PTLM可能没有“理解”教材(或可能误解了问题)。开卷考试(即允许机器自动检索段落并用于回答问题)表现更好(约60%)。

英文摘要

Our goal is to deliver a new task and leaderboard to stimulate research on question answering and pre-trained language models (PTLMs) to understand a significant instructional document, e.g., an introductory college textbook or a manual. PTLMs have shown great success in many question-answering tasks, given significant supervised training, but much less so in zero-shot settings. We propose a new task that includes two college-level introductory texts in the social sciences (American Government 2e) and humanities (U.S. History), hundreds of true/false statements based on review questions written by the textbook authors, validation/development tests based on the first eight chapters of the textbooks, blind tests based on the remaining textbook chapters, and baseline results given state-of-the-art PTLMs. Since the questions are balanced, random performance should be ~50%. T5, fine-tuned with BoolQ achieves the same performance, suggesting that the textbook's content is not pre-represented in the PTLM. Taking the exam closed book, but having read the textbook (i.e., adding the textbook to T5's pre-training), yields at best minor improvement (56%), suggesting that the PTLM may not have "understood" the textbook (or perhaps misunderstood the questions). Performance is better (~60%) when the exam is taken open-book (i.e., allowing the machine to automatically retrieve a paragraph and use it to answer the question).

2101.05400 2026-05-25 cs.CL cs.AI cs.LG 版本更新

Machine-Assisted Script Curation

机器辅助脚本编纂

Manuel R. Ciosici, Joseph Cummings, Mitchell DeHaven, Alex Hedges, Yash Kankanampati, Dong-Ho Lee, Ralph Weischedel, Marjorie Freedman

发表机构 * Information Sciences Institute, University of Southern California(信息科学研究所,南加州大学)

AI总结 本文介绍了一种名为MASC的系统,用于实现人机协作的脚本创作。该系统能够自动生成事件类型、链接至维基数据、提示可能被遗漏的子事件,并记录参与多个子事件的实体及其时间顺序,从而辅助用户高效编写结构复杂的事件脚本。研究展示了MASC在实际案例中的应用效果,验证了其在脚本创作中的实用价值。

Comments Identical to the NAACL 2021 Demo version

详情
AI中文摘要

我们描述了机器辅助脚本编纂器(MASC),一个用于人机协作脚本创作的系统。使用MASC生成的脚本包括:(1)构成更大复杂事件的子事件的英文描述;(2)每个事件的类型;(3)预期参与多个子事件的实体记录;(4)子事件之间的时间顺序。MASC通过提供事件类型建议、维基数据链接以及可能被遗忘的子事件,自动化了脚本创作过程的部分环节。我们通过几个案例研究脚本展示了这些自动化功能对脚本作者的实用性。

英文摘要

We describe Machine-Aided Script Curator (MASC), a system for human-machine collaborative script authoring. Scripts produced with MASC include (1) English descriptions of sub-events that comprise a larger, complex event; (2) event types for each of those events; (3) a record of entities expected to participate in multiple sub-events; and (4) temporal sequencing between the sub-events. MASC automates portions of the script creation process with suggestions for event types, links to Wikidata, and sub-events that may have been forgotten. We illustrate how these automations are useful to the script writer with a few case-study scripts.