arXivDaily arXiv每日学术速递 周一至周五更新
重置

1. 深度学习架构与训练方法 79 篇

2606.07600 2026-06-09 cs.LG cs.AI 新提交

Reachability and asymptotics of Gaussian Transformer dynamics

高斯Transformer动力学的可达性与渐近性

Albert Alcalde, Zhengping Ji, Enrique Zuazua

发表机构 * Friedrich–Alexander University Erlangen–Nürnberg(弗里德里希-亚历山大大学埃尔朗根-纽伦堡) Research Council of Norway(挪威研究理事会)

AI总结 将Transformer数据传播建模为概率测度空间上的非线性控制系统,证明高斯分布在自注意力与仿射前馈层下保持高斯性,从而降维为双线性控制系统,并揭示与Riccati方程的联系。

详情
AI中文摘要

我们将通过Transformer(驱动大型语言模型的机器学习架构)的数据传播建模为概率测度空间上的非线性控制系统。对于具有自注意力和仿射前馈层的平均场Transformer模型,我们证明高斯分布在诱导流下保持严格高斯性。这种不变性将无限维测度动力学简化为控制均值和协方差演化的有限维双线性控制系统,将Transformer的表达能力重新表述为关于指定高斯矩的可达性问题,并揭示了与经典滤波和控制中Riccati型方程的新联系。\n对于时变控制,我们证明任何目标高斯分布(其协方差矩阵与初始协方差矩阵具有相同秩)的精确有限时间可达性,该秩约束是动力学的一个内在不变量。对于时不变参数,我们推导出显式的谱条件,这些条件要么导致正定平衡点的渐近稳定性,要么导致协方差的有限时间爆破。\n数值实验补充了理论,表明具有高斯输入的实际Transformer在早期和中间层保持与矩匹配的高斯分布接近,而具有指定注意力矩阵的Transformer再现了预测的协方差状态:在稳定配置中有界演化,在失稳配置中爆破。

英文摘要

We formulate data propagation through the Transformer, the machine learning architecture powering large language models, as a nonlinear control system on the space of probability measures. For the mean-field Transformer model with self-attention and affine feed-forward layers, we prove that Gaussian distributions remain exactly Gaussian along the induced flow. This invariance reduces the infinite-dimensional measure dynamics to a finite-dimensional bilinear control system governing the evolution of the mean and covariance, reformulates the expressive capacity of Transformers as a reachability problem for prescribed Gaussian moments, and reveals a novel connection with Riccati-type equations from classical filtering and control. For time-varying controls, we prove exact finite-time reachability of any target Gaussian distribution whose covariance matrix has the same rank as the initial one, this rank constraint being an intrinsic invariant of the dynamics. For time-invariant parameters, we derive explicit spectral conditions leading either to asymptotic stability toward positive-definite equilibria or to finite-time blow-up of the covariance. Numerical experiments complement the theory by showing that practical Transformers with Gaussian inputs remain close to moment-matched Gaussian distributions through early and intermediate layers, while Transformers with prescribed attention matrices reproduce the predicted covariance regimes: bounded evolution in stabilizing configurations and blow-up in destabilizing ones.

2606.07601 2026-06-09 cs.LG cs.AI 新提交

LFNO: Bridging Laplace and Fourier via Transient-Steady Decomposition

LFNO:通过瞬态-稳态分解桥接拉普拉斯与傅里叶

Jeongun Ha, Sanga Yoon, Donghun Lee

发表机构 * \dagger(† \dagger)

AI总结 提出拉普拉斯-傅里叶神经算子(LFNO),通过双分支架构显式分解系统动力学为瞬态和稳态分量,在九个基准上超越现有算子,提升稳定性和可解释性。

Comments 21 pages, 11 figures

详情
AI中文摘要

我们引入了拉普拉斯-傅里叶神经算子(LFNO),这是一个统一框架,通过整合拉普拉斯和傅里叶神经算子的谱优势,对跨瞬态和稳态区域的动力系统进行建模。LFNO采用双分支架构,将系统动力学显式分解为瞬态和稳态分量。我们在九个基准上评估了LFNO,包括三个ODE系统(Duffing、Lorenz和Pendulum)和六个PDE系统(Euler-Bernoulli梁、热方程、反应扩散、Brusselator、Burgers和Navier-Stokes)。在瞬态动力学占主导的ODE系统上,LFNO显著优于现有算子,并且在PDE基准上持续超越LNO,同时达到与FNO竞争的性能。此外,LFNO通过其分量分解提供了改进的稳定性和物理可解释性。这些结果表明,LFNO为跨多个时间尺度学习复杂动力系统提供了一种鲁棒且统一的方法。

英文摘要

We introduce the Laplace-Fourier Neural Operator (LFNO), a unified framework for modeling dynamical systems across transient and steady-state regimes by integrating the spectral advantages of Laplace and Fourier Neural Operators. LFNO employs a dual-branch architecture that explicitly decomposes system dynamics into transient and steady-state components. We evaluate LFNO on nine benchmarks, including three ODE systems (Duffing, Lorenz, and Pendulum) and six PDE systems (Euler-Bernoulli beam, Heat, Reaction-diffusion, Brusselator, Burgers, and Navier-Stokes). LFNO significantly outperforms existing operators on ODE systems, where transient dynamics dominate, and consistently surpasses LNO while achieving performance competitive with FNO on PDE benchmarks. Furthermore, LFNO offers improved stability and physical interpretability through its component-wise decomposition. These results demonstrate that LFNO provides a robust and unified approach for learning complex dynamical systems across multiple temporal scales.

2606.07604 2026-06-09 cs.LG cs.AI 新提交

Contribution Weights: A Geometrical Analysis of Self-Attention Transformers

贡献权重:自注意力Transformer的几何分析

Harry Jake Cunningham, Nicola Muca Cirone

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出基于投影的贡献权重度量,结合注意力权重、值向量大小和方向对齐,更准确识别关键令牌,并揭示注意力汇的主动抑制功能。

详情
AI中文摘要

分析注意力权重已成为解释大型语言模型(LLM)信息流的标准方法。然而,这种方法有显著局限性,因为它忽略了被聚合的值向量的几何特性。为了解决这个问题,我们引入了\emph{贡献权重},这是一种基于投影的度量,通过考虑令牌的注意力权重、值大小以及与层输出的方向对齐来量化令牌的影响。我们证明,贡献权重提供了更忠实的令牌重要性度量,在不同解码器模型、任务和数据集中,始终优于基于注意力的度量,用于识别语义关键令牌。此外,我们的度量能够对\emph{注意力汇}进行新的机制分析。虽然先前的工作将注意力汇描述为多余注意力的被动存储库,但我们揭示它们起到了主动的功能作用,通过汇率与输出范数之间的凸关系抑制信息,通过反对低置信度令牌的语义漂移来稳定表示。

英文摘要

Analyzing attention weights has become a standard approach for interpreting the information flow of Large Language Models (LLMs). However, this approach has significant limitations as it neglects the geometric properties of the value vectors being aggregated. To address this gap, we introduce \emph{Contribution Weights}, a projection-based metric that quantifies a token's influence by accounting for it's attention weight, value magnitude, and directional alignment with the layer output. We demonstrate that contribution weights provide a more faithful measure of token importance, consistently outperforming attention-based metrics in identifying semantically critical tokens across different decoder-only models, tasks, and datasets. Further, our metric enables novel mechanistic analysis of \emph{attention sinks}. While previous work characterized sinks as passive repositories for excess attention, we reveal they serve an active functional role, suppressing information through a convex relationship between sink rate and output norm, stabilizing representations by opposing the semantic drift of low-confidence tokens.

2606.07695 2026-06-09 cs.LG cs.AI 新提交

DSFNet: Learning Dual-Domain Spectral Operators for Multi-Modality Spatio-Temporal Forecasting in Urban Transportation Systems

DSFNet:面向城市交通系统多模态时空预测的双域谱算子学习

Yongchao Li, Yang Li, Zhuoxuan Li, Jun Chen, Chu Zhang, Jinde Cao, Leszek Rutkowski

发表机构 * Southeast University(东南大学) Jiangsu Province Collaborative Innovation Center of Modern Urban Traffic Technologies(江苏省现代城市交通技术协同创新中心) City University of Hong Kong(香港城市大学) School of Mathematics, Southeast University(东南大学数学学院) Systems Research Institute of the Polish Academy of Sciences(波兰科学院系统研究所) Luoyang Normal University(洛阳师范学院) Purple Mountain Laboratories(紫金山实验室) AGH University of Krakow(AGH科技大学)

AI总结 提出双域谱滤波网络DSFNet,通过特征域和空间域谱算子分解空间-模态交互,显式建模跨变量耦合与异质空间依赖,结合外部门控机制自适应调节时间动态,在五个真实交通数据集上MAE降低3.21%-10.16%。

详情
AI中文摘要

多模态时空预测(MoSTF)通过引入多样化的交通模态扩展了传统的时空预测。尽管近年来在时空建模方面取得了显著进展,现有方法往往未能显式建模不同模态变量之间的耦合关系。准确的MoSTF具有挑战性,因为它需要建模(1)外生影响下的时间动态异质性和(2)异质空间依赖性以及复杂的跨变量耦合。为了解决这些挑战,我们提出了双域谱滤波网络(DSFNet)。我们的框架采用双域谱滤波来捕获异质空间模式并显式建模变量之间的关系。与基于图的消息传递或节点-模态对上的密集注意力不同,DSFNet将空间-模态交互分解为特征域和空间域谱算子,从而实现了非局部依赖和跨模态耦合的可扩展建模。此外,我们引入了一种外部门控机制,以自适应地调节外部影响下的时间动态。我们通过在五个代表性真实世界交通数据集上的大量实验验证了我们的方法。与次优基线相比,DSFNet在这些数据集上将MAE降低了3.21%-10.16%。结果表明,DSFNet在准确性上显著优于现有最先进基线,同时表现出高效性和鲁棒性。

英文摘要

Multi-Modality Spatio-Temporal Forecasting (MoSTF) extends traditional spatio-temporal forecasting by incorporating diverse traffic modalities. Despite significant recent strides in spatio-temporal modeling, existing approaches often fail to explicitly model the coupling relationships between different modality variables. Accurate MoSTF is challenging, as it requires modeling (1) temporal dynamic heterogeneity under exogenous influences and (2) heterogeneous spatial dependencies alongside complex cross-variable couplings. To address these challenges, we propose the Dual-Domain Spectral Filtering Network (DSFNet). Our framework employs dual-domain spectral filtering to capture heterogeneous spatial patterns and explicitly model the relationships between variables. Unlike graph-based message passing or dense attention over node-modality pairs, DSFNet factorizes space-modality interactions into feature-domain and spatial-domain spectral operators, enabling scalable modeling of nonlocal dependencies and cross-modality couplings. Furthermore, we introduce an external gating mechanism to adaptively regulate temporal dynamics under external influences. We validate our method through extensive experiments on five representative real-world traffic datasets. Compared with the second-best baselines, DSFNet reduces MAE by 3.21%-10.16% across these datasets. The results demonstrate that DSFNet significantly outperforms existing state-of-the-art baselines in accuracy while exhibiting efficiency and robustness.

2606.07710 2026-06-09 cs.LG cs.AI 新提交

WhiFlash: Accelerating Speculative Decoding with Token-Level Cross-Paradigm Routing

WhiFlash: 通过令牌级跨范式路由加速推测解码

Young D. Kwon, Miles Williams, Rui Li, Alexandros Kouris, Stylianos I. Venieris

发表机构 * Samsung AI Center, Cambridge, UK(三星AI中心,剑桥,英国)

AI总结 提出WhiFlash,首个统一自回归与扩散并行草稿的跨范式推测解码方法,通过细粒度路由和缓存优化实现高达69.6%的吞吐量提升。

Comments Under review

详情
AI中文摘要

大型语言模型的自回归特性仍然是推理的主要瓶颈,特别是在复杂的代理工作负载中。虽然推测解码加速了推理,但当前方法依赖于静态草稿范式,使用自回归草稿模型进行推理或基于扩散的并行草稿模型生成结构化输出。我们经验发现,草稿准确性在单个序列内波动剧烈,静态范式和粗粒度路由导致显著性能未实现。为解决这种波动性,我们引入WhiFlash,首个跨范式推测解码方法,在单个令牌级控制器下统一自回归和基于扩散的并行草稿。WhiFlash采用细粒度路由机制,使用轻量级基于熵的或学习到的神经策略,两者均参数化以在预期令牌增益和延迟之间提供可调平衡。为使高频切换计算可行,我们引入新颖的缓存管理优化——惰性追赶和仅KV预填充,将切换开销降低到每轮延迟的7%以下。通过利用根本不同草稿架构的互补优势,WhiFlash实现了显著更高的接受长度,在特定类别上吞吐量比最先进的自回归EAGLE-3提升高达69.6%,比基于扩散的DFlash提升37.3%。

英文摘要

The autoregressive nature of large language models (LLMs) remains a significant bottleneck for inference, particularly in complex agentic workloads. While speculative decoding (SD) accelerates inference, current approaches rely on static drafting paradigms, utilising either autoregressive drafting models for reasoning or diffusion-based parallel drafting models for structured outputs. We empirically find that drafting accuracy fluctuates dramatically within a single sequence, leaving significant performance unrealised by static paradigms and coarse-grained routing. To address this volatility, we introduce WhiFlash, the first cross-paradigm SD method that unifies autoregressive and diffusion-based parallel drafting under a single token-level controller. WhiFlash adopts a fine-grained routing mechanism that employs either a lightweight entropy-based or a learned neural policy, both parametrised to provide a tunable balance between expected token gain and latency. To make high-frequency switching computationally viable, we introduce novel cache-management optimisations, Lazy Catch-up and KV-only Prefill, reducing switching overhead to below 7% of per-round latency. By capitalising on the complementary strengths of fundamentally distinct drafting architectures, WhiFlash achieves significantly higher acceptance lengths, yielding category-specific throughput gains of up to 69.6% over the state-of-the-art autoregressive EAGLE-3 and 37.3% over the diffusion-based DFlash.

2606.07856 2026-06-09 cs.LG 新提交

Teacher-Free Self-Training Amplifies but Does Not Compound: A Pass@$K$ Crossover on a Free-Verifier Domain

无教师自训练放大但不复合:自由验证器域上的 Pass@$K$ 交叉

Igor Lima Strozzi

发表机构 * Federal University of Rio de Janeiro(里约热内卢联邦大学)

AI总结 在自由验证器域上,使用无教师自训练(STaR)和批评者指导的选择,发现自训练放大模型能力但不复合,通过 Pass@$K$ 交叉诊断证实。

详情
AI中文摘要

当语言模型在其自身验证的输出上训练时,它是获得了超越其基础的能力,还是仅仅更好地表达了基础已有的能力?我们通过一个无教师的“星座”——一个生成器、一个学习到的批评者和一个自由精确验证器——在一个 FlashFill 风格的“陷阱门”DSL 上使该问题可判定,其中验证的(问题,解决方案)对易于合成、难以反转且可自由精确检查。一切都在单个 4 位 Qwen3-4B 上运行,使用单个 24 GB GPU,循环中没有比基础更大的模型。我们报告三个发现。(i) 批评者指导的选择优于验证器过滤的最佳 $k$ 选择,提高了 $+9.1$ 个百分点($6/6$ 种子),全部增益集中在候选者在保留输入上意见不一致的任务上。(ii) 每轮 STaR 自训练提高了上限但从未加速——增益跟踪剩余空间并在 $K=4$ 个独立训练轨迹上减速。(iii) 该域没有清晰的零能力边界,因此通常的“$0\% \to$ 爬升 $=$ 涌现”测试在此无效。一个测量的 Pass@$K$ 交叉解决了诊断:训练模型在操作预算(Pass@$8$)上获胜,但基础模型在大预算(Pass@$64$)上在每个轨迹上超越它,因此自训练集中概率质量而非扩展覆盖范围。这是放大,而非复合。($K=4$ 是指示性的,尚不是跨轨迹的稳健置信区间。)

英文摘要

When a language model trains on its own verified outputs, does it acquire capability beyond its base, or merely get better at expressing capability the base already had? We make the question decidable with a teacher-free "constellation" -- a generator, a learned critic, and a free exact verifier -- on a FlashFill-style "trapdoor" DSL, where verified (problem, solution) pairs are cheap to synthesize, hard to invert, and free to check exactly. Everything runs on one 4-bit Qwen3-4B on a single 24 GB GPU, with no model in the loop larger than the base. We report three findings. (i) Critic-guided selection beats verifier-filtered best-of-$k$ by $+9.1$ pp ($6/6$ seeds), with the entire gain localized to tasks where candidates disagree on held-out inputs. (ii) Per-round STaR self-training raises the ceiling but never accelerates -- the gain tracks remaining headroom and decelerates across $K=4$ independent training trajectories. (iii) The domain has no clean zero-capability frontier, so the usual "$0\% \to$ climb $=$ emergence" test is invalid here. A measured pass@$K$ crossover settles the diagnosis: the trained model wins at the operating budget (pass@$8$) but the base overtakes it at a large budget (pass@$64$) on every trajectory, so self-training concentrates probability mass rather than expanding reach. This is amplification, not compounding. ($K=4$ is indicative, not yet a robust across-trajectory CI.)

2606.07881 2026-06-09 cs.LG 新提交

Breaking the Bubble: Asynchronous Pipeline Parallel Training with Bounded Weight Inconsistency

打破气泡:具有有界权重不一致性的异步流水线并行训练

Itay Elam, Eliron Rahimi, Avi Mendelson, Chaim Baskin

发表机构 * Technion - Israel Institute of Technology(以色列理工学院) Ben-Gurion University of the Negev(本·古里安大学)

AI总结 提出PACI方法,通过局部梯度累积控制版本漂移,实现无气泡异步流水线并行,在GPT风格语言模型预训练中匹配同步1F1B-flush的稳定性和困惑度,吞吐量完全利用,训练时间至准确率提升达1.69倍。

详情
AI中文摘要

流水线并行对于训练大型神经网络至关重要,但现有的调度方案在吞吐量、内存和优化一致性之间进行权衡。同步流水线保持了前向/反向权重一致性,但存在气泡;异步流水线消除了气泡,但引入了权重版本不匹配,通常需要权重暂存、预测或校正机制。我们提出了PACI(具有可控不一致性的流水线异步训练),一种无气泡的异步流水线方法,它限制了前向/反向版本漂移,无需权重暂存、预测、额外的参数副本或全局同步。关键思想是使用局部梯度累积作为版本控制机制:通过相对于流水线延迟减慢参数版本演化,PACI限制了任何微批次跨越的优化器更新次数,同时保持稳态利用率。在GPT风格的语言模型预训练中,PACI匹配了同步1F1B-flush的稳定性和最终困惑度,保留了相同的峰值内存占用,实现了完全利用的流水线吞吐量,并将训练时间至准确率相比最快的flush基线提升了高达1.69倍。这些结果表明,前向/反向不一致性不必消除:当明确有界时,可以安全地将其换取显著的效率提升。

英文摘要

Pipeline parallelism is essential for training large neural networks, but existing schedules trade off throughput, memory, and optimization consistency. Synchronous pipelines preserve forward/backward weight consistency but suffer from bubbles; asynchronous pipelines remove bubbles but introduce weight-version mismatch, typically requiring weight stashing, prediction, or correction mechanisms. We introduce PACI (Pipeline Asynchronous training with Controlled Inconsistency), a bubble-free asynchronous pipeline method that bounds forward/backward version drift without weight stashing, prediction, additional parameter copies, or global synchronization. The key idea is to use local gradient accumulation as a version-control mechanism: by slowing parameter-version evolution relative to pipeline delay, PACI limits the number of optimizer updates crossed by any micro-batch while preserving steady-state utilization. In GPT-style language-model pretraining, PACI matches the stability and final perplexity of synchronous 1F1B-flush, retains the same peak memory footprint, achieves fully utilized pipeline throughput, and improves training time-to-accuracy by up to $1.69\times$ over the fastest flush baseline. These results show that forward/backward inconsistency need not be eliminated: when explicitly bounded, it can be safely traded for substantial efficiency gains.

2606.07908 2026-06-09 cs.LG 新提交

Layer-wise Derivative Controlled Networks Achieve Competitive Accuracy and Gradient Stability Across Data Regimes

逐层导数控制网络在不同数据体制下实现竞争性准确性和梯度稳定性

Rowan Martnishn

发表机构 * Rowan Martnishn

AI总结 基于ChainzRule的导数控制网络通过逐层雅可比惩罚,在表格和NLP任务中实现低数据高性能,梯度尾比作为泛化诊断指标。

详情
AI中文摘要

基于ChainzRule(CR)的导数控制网络结合了三次多项式层与轻量级前向逐层雅可比惩罚(DREG)。在本多部分系列的第二篇论文中,我们评估了CR在不同数据体制下的泛化特性。我们消融了DREG系数调度的形状,证明最优退火范围取决于表示噪声。在Pima糖尿病数据集上,CR在低数据下表现强劲,并在5%至100%训练数据范围内保持相对于基线的持续准确率优势,这得益于异常稳定的梯度尾比(约1.01-1.02,而ReLU网络为1.07-1.09)。扩展到SST-5,在冻结嵌入和BERT微调体制下均取得有竞争力或更优的结果,包括在训练数据显著减少的情况下超越先前的BERT基线。这些结果具有统计显著性:CR在我们能识别的最强已发表基线上,在两个数据集上均取得了更优的准确率(p < 0.05)。这些结果表明,逐层导数控制引入了一种偏向低频、稳定表示的结构性归纳偏置,该偏置在表格和NLP领域、数据量和表示质量上均能稳健泛化。梯度尾比可作为泛化能力的可靠、无标签诊断指标。

英文摘要

Derivative-controlled networks based on ChainzRule (CR) combine cubic polynomial layers with a lightweight forward-mode per-layer Jacobian penalty (DREG). In this second paper of a multi-part series, we evaluate the generalization properties of CR across data regimes. We ablate the shape of the DREG coefficient schedule, demonstrating that the optimal annealing range depends on representation noise. On the Pima Diabetes dataset, CR achieves strong low-data performance and maintains a consistent accuracy advantage over baselines from 5\% to 100\% training data, supported by exceptionally stable gradient tail ratios ($\sim$1.01--1.02 vs. 1.07--1.09 for ReLU networks). Extensions to SST-5 show competitive or superior results in both frozen-embedding and BERT fine-tuned regimes, including outperforming prior BERT baselines despite substantially less training data. These results are statistically significant: CR achieves superior accuracy over the strongest published baselines we could identify on both datasets ($p < 0.05$). These results establish that layer-wise derivative control induces a structural inductive bias toward low-frequency, stable representations that generalizes robustly across tabular and NLP domains, data volumes, and representation qualities. The gradient tail ratio serves as a reliable, label-free diagnostic of generalization capability.

2606.08105 2026-06-09 cs.LG 新提交

A Unifying View of Attention Sinks: Two Algorithms, Two Solutions

注意力汇聚的统一视角:两种算法,两种解决方案

Lukas Fesser, Mozes Jacobs, Thomas Fel, Andy Keller, Sham Kakade

发表机构 * Kempner Institute(肯普纳研究所) Harvard University(哈佛大学)

AI总结 本文揭示注意力汇聚(attention sink)可对应两种不同机制:自适应空操作(adaptive nop)和广播(broadcast),并据此提出诊断方法,证明门控(gating)和寄存器(register)等干预分别针对不同机制,组合使用效果更佳。

详情
AI中文摘要

当注意力集中在一个单一标记(即汇聚)上时,模型实际上在计算什么?注意力汇聚在softmax transformer中普遍存在,然而这种共享的视觉特征可能隐藏着根本不同的算法。我们表明,视觉上相似的汇聚模式可以反映两种不同的机制:{i}自适应空操作,其中注意力头通过路由到空标记来抑制其更新;以及{ii}广播,其中汇聚聚合并重新分配全局信息。在这种情况下,汇聚扮演着类似的作用:当没有有用信息可计算时,作为一个安全的目的地。提出的干预措施如门控或寄存器之所以有效,是因为它们隐式地针对其中一种机制,揭示了方法与假设机制之间的对偶性:门控隐式假设空操作;寄存器隐式假设广播。每种机制都会留下不同的痕迹(空操作汇聚的值范数可忽略;广播汇聚导致低秩输出),我们在合成任务上形式化这些痕迹,并用于推导实用的诊断方法。应用于预训练视觉transformer时,这些诊断表明两种机制在大规模模型中均存在:汇聚从早期层的CLS标记过渡到深层层的块标记,并集中在专门的注意力头中。引人注目的是,为广播设计的寄存器标记被重新用于服务空操作,证实了单独任何一种干预都不足够。将门控与寄存器结合使用在稳定性和性能上带来互补的提升。总体而言,我们发现相同的注意力模式可以反映两种截然不同的计算,有效的干预需要首先询问模型实际在计算什么。

英文摘要

When attention concentrates on a single token, a sink, what is the model actually computing? Attention sinks are ubiquitous in softmax transformers, yet this shared visual signature can hide fundamentally different algorithms. We show that visually similar sink patterns can reflect two distinct mechanisms: {i} adaptive nop, where a head suppresses its update by routing to a null token, and {ii} broadcast, where a sink aggregates and redistributes global information. In that case, sinks serve an analogous role: a safe destination when there is nothing useful to compute. Proposed interventions like gating or registers work because they implicitly target one or the other, revealing a duality between method and assumed mechanism: gating implicitly assumes nop; registers implicitly assume broadcast. Each mechanism leaves distinct traces (nop sinks exhibit negligible value norms; broadcast sinks induce low-rank outputs) which we formalize on synthetic tasks and use to derive practical diagnostics. Applied to pretrained vision transformers, these diagnostics reveal that both mechanisms exist at scale: sinks transition from CLS in early layers to patches in deeper layers, and concentrate in specialized heads. Strikingly, register tokens, designed for broadcast, are repurposed to also serve nop, confirming that neither intervention alone suffices. Combining gating with registers yields complementary gains in stability and performance. Overall, we find that the same attention pattern can reflect two very different computations and effective intervention requires first asking what the model is actually computing.

2606.08191 2026-06-09 cs.LG cs.AI q-bio.QM 新提交

Frequency-Domain Latent Attention Gating for Cross-Domain Token Aggregation

频域潜在注意力门控用于跨域令牌聚合

Kewei Li, Rongying Zhang, Xueli Wang, Xiwen Gong, Zhongjian Wang, Lan Huang, Ruochi Zhang, Fengfeng Zhou

发表机构 * College of Computer Science and Technology, Jilin University(吉林大学计算机科学与技术学院) Key Laboratory of Symbolic Computation and Knowledge Engineering of Ministry of Education, Jilin University(教育部符号计算与知识工程重点实验室) Institute for Quantitative and Computational Biology, University of California(加州大学定量与计算生物学研究所) Greenwich High School(格林威治高中) BCPM Data Limited(BCPM数据有限公司)

AI总结 提出FLaG模块,通过实FFT变换、可学习潜在查询的频谱分量汇总、通道门控和时域重建,实现跨域令牌聚合,在AMP预测、图像分类和文本分类任务上取得提升。

详情
AI中文摘要

令牌聚合是将令牌表示映射到样本级预测的模型中的常见瓶颈,然而大多数池化方法仅在原始令牌域中操作。我们提出FLaG,一个即插即用的聚合模块,它使用实FFT变换令牌表示,用可学习的潜在查询汇总频谱分量,应用通道门控,并重建增强的时域令牌以进行最终池化。我们在使用ESM2的抗菌肽(AMP)活性预测、使用ResNet18在CIFAR-10和CIFAR-100上的图像分类,以及使用RoBERTa在IMDB和GLUE上的文本分类中评估FLaG。FLaG在ESM2-8M抗菌肽任务和CIFAR-100上取得了最明显的提升,同时在IMDB和GLUE上与强文本基线保持竞争力。然后,我们通过频带消融、门控汇总、残基扰动、潜在查询读出和结构代理分层来探究其在AMP设置中的行为。我们发现低频带贡献最大,其余高频带模式更具样本特异性。门控充当广泛共享的频谱重加权阶段,交叉注意力模式是样本特异性的,具有轻微的查询差异,并且高螺旋肽在两种细菌中表现出更强的平均频谱敏感性。补充材料、源代码和数据发布在https://www.healthinformaticslab.org/supp/ 和 https://github.com/Kewei2023/AMPCliff/tree/FLaG。

英文摘要

Token aggregation is a common bottleneck in models that map token representations to sample-level predictions, yet most pooling methods operate only in the original token domain. We propose FLaG, a plug-in aggregation module that transforms token representations with the real FFT, summarizes spectral components with learnable latent queries, applies a channel-wise gate, and reconstructs enhanced time-domain tokens for final pooling. We evaluate FLaG on antimicrobial peptide (AMP) activity prediction with ESM2, image classification with ResNet18 on CIFAR-10 and CIFAR-100, and text classification with RoBERTa on IMDB and GLUE. FLaG achieves its clearest gains on the ESM2-8M antimicrobial peptide tasks and on CIFAR-100, while remaining competitive with strong text baselines on IMDB and GLUE. Then we probe its behavior on the AMP setting with band knockouts, gate summaries, residue perturbations, latent-query readouts, and structure-proxy stratification. We find that low-frequency bands contribute the most overall, and the remaining higher-band pattern is more sample-specific. The gate acts as a broadly shared spectral reweighting stage and the cross-attention patterns are sample-specific with mild query-wise differentiation, and higher-helix peptides exhibit stronger average spectral sensitivity in both bacteria. The supplementary materials, source code and data are released at https://www.healthinformaticslab.org/supp/ and https://github.com/Kewei2023/AMPCliff/tree/FLaG.

2606.08262 2026-06-09 cs.LG 新提交

Causal Semantic Alignment for LLM-based Time Series Forecasting

基于大语言模型的时间序列预测的因果语义对齐

Kexuan Zhang, Xiaobei Zou, Cesare Alippi, Gary G. Yen, Yang Tang

发表机构 * University of Science and Technology of China(中国科学技术大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出CVAformer框架,通过因果干预解耦变量中的动态和不变成分,消除对齐中的混杂偏差,在多种预测场景下达到或超越最先进性能。

详情
AI中文摘要

大语言模型(LLM)的最新进展通过使时间模式与预训练词嵌入对齐,为时间序列预测开辟了新可能性。然而,大多数基于LLM的方法忽视了时间序列的异质性,其中动态波动和不变语义纠缠在一起。这种纠缠在对齐过程中引入了虚假相关性,因为动态成分作为混杂因素同时影响不变成分和最终的对齐嵌入。为了解决这个问题,提出了一个变量级对齐框架CVAformer。CVAformer在对齐前明确将每个变量解耦为不变和动态成分,并应用因果干预来减轻动态成分的混杂效应。为了更好地支持变量级对齐,CVAformer用非因果注意力机制替换了LLM中的标准因果注意力,该机制捕捉每个时间步上变量之间的交互。在长期、短期、少样本和零样本预测设置上的大量实验表明,CVAformer在大多数数据集上匹配或超越最先进性能,并且在某些情况下实现了显著更好的准确性。实验结果验证了CVAformer中变量级对齐和动态解耦的有效性,为基于LLM的时间序列任务提供了新视角。

英文摘要

Recent advances in Large Language Models (LLMs) have opened new possibilities for time series forecasting by enabling alignment between temporal patterns and pretrained word embeddings. However, most LLM-based methods overlook the heterogeneous nature of time series, where dynamic fluctuations and invariant semantics are entangled. This entanglement introduces spurious correlations during the alignment, as dynamic components act as confounders by simultaneously influencing invariant components and the resulting aligned embeddings. To address this issue, a variable-level alignment framework CVAformer is proposed. CVAformer explicitly disentangles each variable into invariant and dynamic components just before alignment, and applies causal intervention to mitigate the confounding effect of the dynamics. To better support variable-level alignment, CVAformer replaces the standard causal attention in LLMs with a non-causal attention mechanism that captures interactions among variables at each time step. Extensive experiments across long-term, short-term, few-shot, and zero-shot forecasting settings indicate that CVAformer matches or exceeds state-of-the-art performance on most datasets, and in some cases achieves notably better accuracy. Experimental results validate the effectiveness of variable-level alignment and dynamic disentanglement in CVAformer, offering a new perspective for LLM-based time series tasks.

2606.08454 2026-06-09 cs.LG cs.CL 新提交

Beyond Linear Activation Steering: Invertible Latent Transformations for Controlling LLM Behavior

超越线性激活引导:用于控制大语言模型行为的可逆潜在变换

Tuc Nguyen, Thai Le

发表机构 * Indiana University Bloomington(印第安纳大学伯明顿分校)

AI总结 提出INNSteer框架,通过可逆神经网络将LLM激活映射到潜在空间进行线性控制,再逆变换回原空间,实现非线性、输入依赖的激活引导,在多个模型和基准上优于现有方法。

Comments 36 pages, 7 figures

详情
AI中文摘要

激活引导提供了一种轻量级的推理时机制,通过修改大语言模型(LLM)的内部激活向量,使其朝向期望行为。现有方法大多在原始激活空间中计算固定的引导方向,通常使用对比示例对的均值差、线性探针或任意可分离性标准。虽然在一定程度上有效,但这些方法将行为控制视为全局线性加性偏移:相同的方向应用于所有输入,且行为是线性可分的。当行为特征在激活空间中非线性变化或位于弯曲和各向异性流形上时,这种处理可能具有局限性,因为最优干预可能是输入依赖的。为解决这一限制,我们提出了INNSteer,一种基于可逆潜在变换的非线性激活引导框架。INNSteer并非在原始表示空间中寻找更好的引导向量,而是学习一个轻量级可逆神经网络$ϕ$,将LLM的激活映射到潜在空间,在该空间中行为类别更易于线性控制。推理时,激活通过$ϕ$映射,在潜在空间中进行引导,再通过精确逆变换$ϕ^{-1}$映射回原空间。这使得简单的潜在空间平移在原始激活空间中变为非线性、输入依赖的干预。在多个LLM系列、规模、行为特征和安全基准的实验设置中,INNSteer在保持生成流畅性的同时,一致地优于线性、基于传输和非线性的引导基线。

英文摘要

Activation steering provides a lightweight inference-time mechanism for controlling large language models (LLMs) by modifying their internal activation vectors toward desired behaviors. Most existing methods compute a fixed steering direction in the original activation space, typically from pairs of contrastive examples using mean differences, linear probes, or arbitrary separability criteria. While effective to a certain extent, these methods treat behavioral control as a global, linear, additive offset: the same direction is applied across inputs, and behaviors are linearly separable. This can be restrictive when behavioral features vary nonlinearly across the activation space or lie on curved and anisotropic manifolds, where the optimal intervention may be input-dependent. To address this limitation, we propose INNSteer, a nonlinear activation steering framework based on invertible latent transformations. Rather than searching for a better steering vector in the original representation space, INNSteer learns a lightweight invertible neural network $ϕ$ that maps an LLM's activations into a latent space where behavioral classes are more amenable to linear control. At inference time, activations are mapped through $ϕ$, steered in the latent space, and mapped back through the exact inverse transformation $ϕ^{-1}$. This makes a simple latent-space translation become a nonlinear, input-dependent intervention in the original activation space. Across experiment settings on multiple LLM families, scales, behavioral traits, and safety benchmarks, INNSteer consistently improves model control over linear, transport-based, and nonlinear steering baselines while largely preserving generation fluency.

2606.08578 2026-06-09 cs.LG 新提交

Lost in the Non-convex Loss Landscape: How to Fine-tune the Large Time Series Model?

迷失在非凸损失景观中:如何微调大型时间序列模型?

Xu Zhang, Peang Wang, Wei Wang

发表机构 * Shanghai Key Laboratory of Data Science(上海市数据科学重点实验室) College of Computer Science and Artificial Intelligence(计算机科学与人工智能学院) Fudan University(复旦大学)

AI总结 针对预训练大型时间序列模型微调时因非凸损失景观导致过拟合的问题,提出平滑全微调(SFF)方法,通过随机初始化辅助模型插值平滑损失景观,提升可训练性,在八个代表性模型上取得一致改进。

Comments This paper has been accepted by The Fourteenth International Conference on Learning Representations (ICLR 2026). The code is available at the link \url{https://github.com/Meteor-Stars/SFF}

详情
AI中文摘要

近年来,大型时间序列模型(LTSMs)因其与大型语言模型的相似性(包括灵活的上下文长度、可扩展性和任务通用性)而受到越来越多的关注,其性能优于先进的任务特定模型。然而,先前研究表明,预训练的LTSMs可能表现出条件较差的非凸损失景观,导致可训练性有限。因此,直接微调往往会导致过拟合和次优性能,有时甚至比从头训练更差,大大削弱了预训练的好处。为了克服这一限制,我们提出了平滑全微调(SFF),一种新颖的微调技术。具体来说,我们通过随机初始化构建一个辅助LTSM以获得更平滑的损失景观,然后将其权重与预训练模型的权重进行线性插值,以平滑原始景观。这一过程在保留预训练知识的同时提高了可训练性,从而实现更有效的下游微调。从优化角度来看,SFF扰动尖锐最小值而不显著损害平坦区域,有助于逃离不良局部盆地,走向更平滑且泛化性更好的解。在基准数据集上的大量实验表明,在包括Timer、TimesFM、MOMENT、UniTS、MOIRAI、Chronos、TTMs和Sundial在内的八个代表性LTSM上,针对多样化的下游任务均取得了一致的改进。代码可在链接获取:https://github.com/Meteor-Stars/SFF。

英文摘要

Recently, large time series models (LTSMs) have gained increasing attention due to their similarities to large language models, including flexible context length, scalability, and task generality, outperforming advanced task-specific models. However, prior studies indicate that pre-trained LTSMs may exhibit a poorly conditioned non-convex loss landscape, leading to limited trainability. As a result, direct fine-tuning tends to cause overfitting and suboptimal performance, sometimes even worse than training from scratch, substantially diminishing the benefits of pre-training. To overcome this limitation, we propose Smoothed Full Fine-tuning (SFF), a novel fine-tuning technology. Specifically, we construct an auxiliary LTSM via random initialization to obtain a smoother loss landscape, and then linearly interpolate its weights with those of the pre-trained model to smooth the original landscape. This process improves trainability while preserving pre-trained knowledge, thereby enabling more effective downstream fine-tuning. From an optimization perspective, SFF perturbs sharp minima without significantly harming flat regions, facilitating escape from poor local basins toward smoother and more generalizable solutions. Extensive experiments on benchmark datasets demonstrate consistent improvements across eight representative LTSMs, including Timer, TimesFM, MOMENT, UniTS, MOIRAI, Chronos, TTMs, and Sundial, on diverse downstream tasks. The code is available at the link: https://github.com/Meteor-Stars/SFF.

2606.08850 2026-06-09 cs.LG cs.AI cs.CL stat.ML 新提交

Intrinsic Selection and Particle Resampling for Inference-Time Scaling Beyond Domain Verifiability

内在选择与粒子重采样:超越领域可验证性的推理时扩展

Giorgio Giannone, Mustafa Eyceoz, Shabana Baig, Shivchander Sudalairaj, Anna C. Doris, Faez Ahmed, Akash Srivastava, Kai Xu

发表机构 * MIT(麻省理工学院) Red Hat(红帽公司) IBM(IBM公司)

AI总结 提出基于并行样本集内在统计量(长度调整尾熵)的推理时扩展方法,通过后验候选排序和步骤级重采样,无需外部验证即可提升开放领域任务性能。

Comments preprint

详情
AI中文摘要

推理时扩展(ITS)在数学和编程等可验证领域取得了很大成功,其中廉价验证使得可扩展输出选择成为可能。然而,将ITS扩展到容易发生系统性失败的任务——由错误初始假设或未满足的多维约束驱动——通常依赖于昂贵的外部求解器或脆弱的基于模型的验证器。我们的关键洞察是,并行样本集的内在统计量,特别是长度调整尾熵,提供了关于解质量的稳健判别信号,而无需访问真实标签。至关重要的是,这些统计量作为自适应计算分配的难度门控,动态地将问题路由到不同的扩展规模。首先,内在选择(iS)事后对候选进行排序,在三个领域匹配基于共识的算法,并将工程设计选择性能比pass@1基线提高20%。其次,内在粒子滤波(iPF)将其推广到步骤级重采样,引导生成走向高置信度推理轨迹,在困难数学问题上平均将pass@1提高6.1个百分点。最后,粒子蒸馏(dPF)通过早期logit混合和KL引导重采样注入特权指导,引导生成绕过系统性推理错误以满足专家评分标准,在复杂临床响应上获得高达26.5%的提升。我们的流程无缝适用于通用、领域专用和多模态架构,成功将ITS扩展到开放领域,而无需训练奖励模型或精确的真实标签验证。

英文摘要

Inference-Time Scaling (ITS) has largely succeeded in verifiable domains like math and coding, where cheap verification enables scalable output selection. However, extending ITS to tasks prone to systematic failure - driven by faulty initial assumptions or unmet multidimensional constraints - typically relies on costly external solvers or brittle, model-based verifiers. Our key insight is that the intrinsic statistics of parallel sample sets, specifically length-adjusted tail entropy, provide a robust discriminative signal for solution quality without access to ground truth. Crucially, these statistics serve as a difficulty gate for adaptive compute allocation, dynamically routing problems across scaling regimes. First, Intrinsic Selection (iS) ranks candidates post-hoc, matching consensus-based algorithms across three domains and improving engineering design selection by 20% over pass@1 baselines. Second, Intrinsic Particle Filtering (iPF) generalizes this to step-level resampling, guiding generation toward high-confidence reasoning trajectories to improve pass@1 by 6.1 points on average on hard math problems. Finally, Particle Distillation (dPF) injects privileged guidance via early logit blending and KL-guided resampling, steering generation past systematic reasoning errors to satisfy expert rubrics, yielding up to 26.5% gains on complex clinical responses. Our pipeline applies seamlessly across broad-purpose, domain-specialized, and multimodal architectures, successfully extending ITS to open-ended domains without requiring trained reward models or exact ground-truth verification.

2606.08854 2026-06-09 cs.LG cs.AI cs.CL stat.ML 新提交

sGPO: Trading Inference FLOPs for Training Efficiency in RLVR

sGPO: 在RLVR中用推理FLOPs换取训练效率

Shivchander Sudalairaj, Kai Xu, Akash Srivastava, Giorgio Giannone

发表机构 * Red Hat(红帽) IBM

AI总结 提出sGPO方法,通过少量推理计算预估查询难度,自适应分配训练预算,将训练计算量降低三倍,同时保持或提升性能。

详情
AI中文摘要

标准的可验证奖励强化学习(RLVR)训练为每个查询分配固定的展开预算,而不考虑每个查询的难度对当前策略的意义。这导致两种对称的失败模式:简单查询产生接近零的优势,因为策略已经解决了它们;而无法解决的查询不产生信号,因为策略从未解决它们。这两种情况都浪费了训练FLOPs,而没有贡献学习梯度。我们引入了排序组策略优化(sGPO),一种计算高效的策略,用少量推理FLOPs换取大量减少浪费的训练FLOPs。关键见解是,廉价的推理计算可以作为查询难度的单一离线代理。通过在初始策略下为每个查询生成一小批并行样本,我们获得了模型感知的经验成功率。这激励将训练展开组大小设置为该成功率的倒数,这是一个实用的规则,通过从每个生成的展开中提取最大优势来最大化样本效率。这一单次性能分析过程同时驱动数据过滤(移除琐碎查询和子采样无法解决的查询)、自适应组大小分配和课程构建(从易到难调度查询)。sGPO匹配或超过基线性能,同时将总训练计算量减少三倍,包括前期的推理性能分析成本。

英文摘要

Standard Reinforcement Learning with Verifiable Rewards (RLVR) training allocates a fixed rollout budget to every query, without regard for what each query's difficulty means for the current policy. This leads to two symmetric failure modes: easy queries produce near-zero advantage because the policy already solves them, while unsolvable queries produce no signal because the policy never solves them. Both regimes waste training FLOPs without contributing to a learning gradient. We introduce sorted Group Policy Optimization (sGPO), a compute-efficient strategy that trades a small budget of inference FLOPs for a large reduction in wasted training FLOPs. The key insight is that cheap inference compute can serve as a single offline proxy for query difficulty. By generating a small batch of parallel samples per query under the initial policy, we obtain a model-aware empirical success rate. This motivates setting the training rollout group size to the inverse of this success rate, a practical rule that maximizes sample efficiency by extracting the most advantage per generated rollout. This single profiling pass simultaneously drives data filtering (removing trivial queries and sub-sampling unsolvable ones), adaptive group size allocation, and curriculum construction (scheduling queries from easy to hard). sGPO matches or exceeds baseline performance while reducing total training compute by a factor of three, with the upfront inference profiling cost included.

2606.08934 2026-06-09 cs.LG stat.AP stat.CO stat.ME stat.ML 新提交

Backward Coherence and Hidden-State Stability in Recurrent Neural Networks: A Quasi-Reverse-Martingale Theory

递归神经网络中的反向相干性与隐藏状态稳定性:拟逆鞅理论

Yuan-chin Ivan Chang

发表机构 * Institute of Statistical Science, Academia Sinica(中央研究院统计科学研究所)

AI总结 提出反向相干性概念,通过拟逆鞅理论证明隐藏状态序列几乎必然收敛,并设计正则化方法,在多个任务中实现更早稳定和更低误差。

详情
AI中文摘要

递归神经网络维护一个隐藏状态 $h_t$,但其概率意义通常不明确。我们通过\emph{反向相干性}研究隐藏状态稳定性:即通过学习的反向投影器 $g_ϕ$ 从 $h_{t+1}$ 重构 $h_t$ 的程度。在收缩性和可和反向漂移条件下,隐藏状态序列构成拟逆鞅。这导致几乎必然收敛、混合下的速率、可解释的极限表示、有限路径停止时间以及时间一致置信序列的理论框架。模拟支持该理论。反向相干性正则化将经验拟鞅总和 $\hat Q$ 降低 $43$--$58\%$,比未正则化的 RNN 早 $28$--$44\%$ 达到稳定,并提供与几何界一致的跟踪误差恢复。额外测试证实回波状态遗忘率受 $ρ$ 限制,并验证增量总和管 $R_t$ 具有 $100\%$ 同时覆盖率,尽管 $R_t$ 是保守的;实践中,缺陷尾代理 $\hat Q_t$ 是更有用的监控指标。反向相干性损失也等价于在高斯反向模型中最小化 Kullback--Leibler 散度,将该方法与变分推断联系起来。扩展涵盖 $ϕ$-混合输入、变点检测和有限样本集中度。三项真实数据研究进一步验证了该方法。在 PhysioNet 2012 ICU 数据上,逆鞅 RNN (RMRNN) 与 RNN 的死亡率预测 AUC 相当,同时提前 13 小时达到稳定表示。在 FRED-MD 上,它在概念漂移下将一个月前预测误差降低约四倍。在 UCI 人类活动识别上,它保持较低的后转换跟踪误差并具有几何衰减。这些保证在所述假设下成立;不声称普适性。

英文摘要

Recurrent neural networks maintain a hidden state $h_t$, but its probabilistic meaning is often unclear. We study hidden-state stability through \emph{backward coherence}: the extent to which $h_t$ can be reconstructed from $h_{t+1}$ by a learned backward projector $g_ϕ$. Under contraction and summable backward drift, the hidden-state sequence forms a quasi-reverse-martingale. This yields almost-sure convergence, rates under mixing, an interpretable limiting representation, finite pathwise stopping times, and a theoretical framework for time-uniform confidence sequences. Simulations support the theory. Backward-coherence regularisation reduces the empirical quasi-martingale total $\hat Q$ by $43$--$58%$, reaches stability $28$--$44%$ earlier than an unregularised RNN, and gives tracking-error recovery consistent with geometric bounds. Additional tests confirm echo-state forgetting rates bounded by $ρ$ and verify the increment-sum tube $R_t$ with $100%$ simultaneous coverage, although $R_t$ is conservative; in practice, the defect-tail proxy $\hat Q_t$ is the more useful monitor. The backward-coherence loss is also equivalent to minimising a Kullback--Leibler divergence in a Gaussian backward model, linking the method to variational inference. Extensions cover $ϕ$-mixing inputs, change-point tracking, and finite-sample concentration. Three real-data studies further validate the approach. On PhysioNet 2012 ICU data, the Reverse Martingale RNN (RMRNN) matches RNN mortality-prediction AUC while reaching stable representations 13 hours earlier. On FRED-MD, it reduces one-month-ahead forecast error by about fourfold under concept drift. On UCI Human Activity Recognition, it maintains lower post-transition tracking error with geometric decay. The guarantees apply under the stated assumptions; universality is not claimed.

2606.08985 2026-06-09 cs.LG 新提交

Beyond Neural Collapse: Task-Intrinsic Geometry Governs Neural Representations in Modular Arithmetic

超越神经坍缩:任务内在几何决定模算术中的神经表示

Hu Tan, Kuo Gai, Shihua Zhang

发表机构 * Academy of Mathematics and Systems Science, Chinese Academy of Sciences(中国科学院数学与系统科学研究院) School of Mathematical Sciences, University of Chinese Academy of Sciences(中国科学院大学数学科学学院) Shanghai Institute for Mathematics and Interdisciplinary Sciences (SIMIS)(上海数学与交叉学科研究院) Key Laboratory of Systems Health Science of Zhejiang Province, School of Life Science, Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences(浙江省系统健康科学重点实验室,中国科学院大学杭州高等研究院生命科学学院)

AI总结 本文发现模加法任务中网络表示呈现二维循环几何而非神经坍缩的单纯形等角紧框架,通过层间非均匀训练、子空间锁定后的相位对齐动力学和复杂度优势分析解释了这一现象。

详情
AI中文摘要

虽然神经坍缩(NC)预测一个$K$类平衡分类器应将终端表示组织为$(K-1)$维单纯形等角紧框架(ETF),但模加法始终进入不同的状态:网络压缩为二维循环几何,其中分类器权重和词元嵌入都位于圆上。我们从三个方向精炼对这一现象的解释。首先,我们形式化了一个逐层非均匀训练机制:下游分类器权重被密集交叉熵梯度驱动到秩2等角配置,而上游嵌入尚未完全重组;一旦这个分类器平面形成,反向传播的特征梯度将嵌入运动约束在同一平面内,同时权重衰减抑制正交分量。其次,在此子空间锁定之后,诱导的平面内动力学允许在$S^1$上的一种熵正则化输运解释;结合模加法标签,这使嵌入形成简化为相位对齐,其最小化器是$\mathbb{Z}/P\mathbb{Z}$的单频特征,因此是圆上的等角点。第三,我们量化了为什么这一解优于NC:单纯形ETF在交叉熵上仅获得$O(1)$的优势,而循环秩2解在Schatten或权重衰减代理下享有$\Theta(K)$的优势,产生临界阈值$\lambda_{\mathrm{crit}} = \Theta(1/K)$。我们的结果解释了为什么分类器权重首先移动以及为什么嵌入随后与之对齐,表明模算术上的grokking不是由最大分离单独支配,而是由分离、对称性和复杂性之间的任务结构化权衡所支配。

英文摘要

While neural collapse (NC) predicts that a $K$-class-balanced classifier should organize terminal representations as a $(K-1)$-dimensional simplex equiangular tight frame (ETF), modular addition consistently enters a different regime: networks compress to a two-dimensional cyclic geometry in which both classifier weights and token embeddings lie on circles. We refine the explanation of this phenomenon in three directions. First, we formalize a layerwise non-uniform training mechanism: downstream classifier weights are driven by dense cross-entropy gradients into a rank-2 equiangular configuration before upstream embeddings fully reorganize, and once this classifier plane forms, backpropagated feature gradients constrain embedding motion to the same plane while weight decay suppresses orthogonal components. Second, after this subspace locking, the induced in-plane dynamics admit an entropy-regularized transport interpretation on $S^1$; combined with modular-addition labels, this reduces embedding formation to phase alignment, whose minimizers are single-frequency characters of $\mathbb{Z}/P\mathbb{Z}$ and hence equal-angle points on a circle. Third, we quantify why this solution prevails over NC: a simplex ETF gains only an $O(1)$ advantage in cross-entropy, whereas the cyclic rank-2 solution enjoys a $Θ(K)$ advantage under Schatten or weight-decay surrogates, yielding a critical threshold $λ_{\mathrm{crit}} = Θ(1/K)$. Our results explain both why classifier weights move first and why embeddings subsequently align with them, showing that grokking on modular arithmetic is governed not by maximal separation alone but by a task-structured trade-off between separation, symmetry, and complexity.

2606.09059 2026-06-09 cs.LG cs.AI cs.CV 新提交

Stage-1 Controls the Entropy Regime, Not the Outcome

Stage-1 控制熵状态,而非最终结果

Jianxiong Shen

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文通过小数据实验研究两阶段后训练中Stage-1(SFT或OPD)的作用,发现其主要影响策略熵状态,但对最终性能影响有限。

详情
AI中文摘要

两阶段后训练——Stage-1 热启动(监督微调 SFT 或在线策略蒸馏 OPD)后接 Stage-2 强化学习(RL)——越来越多地用于视觉语言模型(VLM)。我们使用 Qwen2.5-VL-7B 和同模态 72B VLM 教师进行 OPD,在小数据研究中探究 Stage-1 实际控制什么。首先,三种热启动在 Geometry3K 内部验证集上达到狭窄的 53%–54% 区间,与近期专门方法报告的窄范围一致;该设置几乎没有证据表明 Stage-1 改变了域内终点。其次,匹配配方、早停的 SFT 在域外 MathVista 上提升了 +2.1 点,逆转了过训练变体的 -9.5 点下降。最明显的区别是熵状态:OPD 进入 RL 时的策略熵显著高于任一 SFT 初始化,且这种分离在可用轨迹中持续可见。在域内初始化时,OPD 还具有更高的答案多样性和 pass@16(比 SFT 高 +2.0 到 +5.2 点),尽管问题级自举区间显示较小的对比具有不确定性。RL 后优势消失(终点 pass@16 值在 1.1 点以内),在 MathVista 上也是如此(六个模型在 1.2 点以内)。因此,我们的贡献是一个有界的实证刻画:在此设置中,Stage-1 与熵状态强相关,但下游收益小、局部化,且不能证明 OPD 是更好的 RL 热启动。

英文摘要

Two-stage post-training -- a Stage-1 warm-start (supervised fine-tuning, SFT, or on-policy distillation, OPD) followed by Stage-2 reinforcement learning (RL) -- is increasingly used for vision-language models (VLMs). We ask what Stage-1 actually controls in a small-data study using Qwen2.5-VL-7B with a same-modality 72B VLM teacher for OPD. First, the three warm-starts reach a narrow $53$--$54\%$ band on Geometry3K internal validation, consistent with the narrow range reported by recent specialized methods; this setup provides little evidence that Stage-1 changes the in-domain endpoint. Second, a matched-recipe, early-stopped SFT improves out-of-domain MathVista by $+2.1$ points, reversing the $-9.5$-point drop of an over-trained variant. The clearest difference is the \emph{entropy regime}: OPD enters RL with substantially higher policy entropy than either SFT initialization, and the separation remains visible through the available trajectories. At the in-domain initialization, OPD also has higher answer diversity and pass@16 ($+2.0$ to $+5.2$ points over SFT), although problem-level bootstrap intervals show that the smaller contrast is uncertain. The advantage is absent after RL (endpoint pass@16 values within $1.1$ points) and on MathVista (six models within $1.2$ points). Our contribution is therefore a bounded empirical characterization: Stage-1 is strongly associated with the entropy regime in this setup, but the downstream payoff is small, localized, and not evidence that OPD is a better RL warm-start.

2606.09077 2026-06-09 cs.LG 新提交

Neural Legendre-Fenchel transform with Hessian Preconditioning

神经 Legendre-Fenchel 变换与 Hessian 预处理

Basile Plus-Gourdon, Frank Nielsen

发表机构 * École Normale Supérieure Paris-Saclay(巴黎-萨克雷高等师范学校) Sony Computer Science Laboratories Inc.(索尼计算机科学实验室公司)

AI总结 提出基于 Hessian 预处理的神经 Legendre-Fenchel 变换方法,通过仿射变形改善病态函数的共轭计算,提高收敛速度和数值精度。

Comments 11 pages, 4 figures

详情
AI中文摘要

Legendre-Fenchel (LF) 变换是凸分析和机器学习中的基本工具,将下半连续函数映射到其凸共轭。在实践中,当给定函数的凸共轭没有闭式公式时,必须使用各种技术进行近似。最近一种通用的数值方法是深度 Legendre 变换方法,它依赖于神经网络,尽管在处理病态函数时仍然具有挑战性。本文基于 LF 变换作为射影对偶的重新表述。该框架的一个显著特性是仿射不变性。我们利用这种仿射不变性引入了一种基于 Hessian 的预处理策略。具体来说,我们在一个极小点附近应用仿射变形,使得函数的二阶泰勒近似与标准抛物面重合,其共轭映射是恒等映射。一个在恒等映射附近初始化的残差网络可以学习这个简化后的映射,而原始共轭映射通过逆变形恢复。所提出的预处理仅带来适度的计算开销,包括初始化时的一次特征分解和每次查询时的两次矩阵-向量乘法。在包括高维基准测试在内的多种凸函数上的实验表明,共轭的收敛速度和数值精度得到了提高,特别是在病态问题上效果显著。最后,我们讨论了所提出方法的适用范围,并指出了其若干局限性。

英文摘要

The Legendre-Fenchel (LF) transform is a fundamental tool in convex analysis and machine learning that maps lower semi-continuous functions to their convex conjugates. In practice, when closed-form formula are not available for expressing convex conjugates of given functions, one must approximate them using various techniques. One recent such versatile numerical method is the deep Legendre transform method which relies on neural networks although it remains challenging particularly for tackling ill-conditioned functions. This work builds on the reformulation of the LF transform as a projective polarity. A notable property of this framework is its affine invariance. We leverage this affine invariance to introduce a Hessian-based preconditioning strategy. Specifically, we apply an affine deformation around a minimizer so that the second-order Taylor approximation of the function coincides with the canonical paraboloid, whose conjugation map is the identity. A residual network initialized near the identity can then learn this simplified mapping, while the original conjugation map is recovered through the inverse deformation. The proposed preconditioning incurs only a modest computational overhead, consisting of a single eigendecomposition during initialization and two matrix-vector multiplications per query. Experiments on a diverse set of convex functions, including high-dimensional benchmarks, demonstrate improved convergence rates and enhanced numerical accuracy of the conjugation, with particularly significant gains for ill-conditioned problems. Finally, we discuss the scope of applicability of our proposed method and highlight several of its limitations.

2606.09078 2026-06-09 cs.LG 新提交

The Hidden Bias of Process Reward Models:PRISM for Rewarding the Right Reasoning

过程奖励模型的隐藏偏见:PRISM用于奖励正确推理

Aakriti Agrawal, Souradip Chakraborty, Armin Saghafian, Nihal Sharma, Rizal Fathony, Nam H Nguyen, C. Bayan Bruss, Amrit Singh Bedi, Furong Huang

发表机构 * University of Maryland(马里兰大学) Amazon(亚马逊) University of Central Florida(中佛罗里达大学)

AI总结 针对过程奖励模型因训练数据不平衡导致的虚假高评分偏见,提出PRISM框架,通过对比步骤级比较和前瞻策略生成的难负样本,结合难度感知课程学习优化,显著降低假阳性率并提升推理准确性。

详情
AI中文摘要

过程奖励模型(PRM)通过提供步骤级反馈改善了推理的信用分配。然而,我们发现PRM中存在由步骤级训练数据严重不平衡引起的隐藏偏见。标准交叉熵训练放大了这种偏见,导致PRM过度奖励看似合理但错误的步骤,并产生高假阳性率。我们表明这些假阳性具有不对称的下游效应:假阴性主要减缓探索,而假阳性则主动将Best-of-N选择、引导解码和策略优化引导向有缺陷的推理。这表明PRM训练应从逐点标签拟合转向可靠的相对比较。为解决此问题,我们提出PRISM(改进步骤建模的精确排序),一种策略感知的PRM训练框架,从对比步骤级比较和由时间前瞻策略生成的难负样本中学习,无需新的人工标签。我们进一步使用难度感知课程来优化对比步骤间隔。在PRMBench和ProcessBench上,PRISM显著减少了假阳性(PRMBench上降低22%),并在强判别性PRM上提高了宏F1。当应用于策略优化和搜索任务(包括引导解码和Best-of-N选择)时,它持续提高了准确率(引导解码最高22%,Best-of-N最高33%)和鲁棒性。更广泛地说,可信的过程监督不仅仅是分配高奖励,而是为了正确的理由奖励正确的推理。

英文摘要

Process Reward Models (PRMs) improve credit assignment for reasoning by providing step-level feedback. However, we identify a hidden bias in PRMs caused by severe imbalance in step-level training data. Standard cross-entropy training amplifies this bias, causing PRMs to overcredit plausible but incorrect steps and produce high false-positive rates. We show that these false positives have an asymmetric downstream effect: false negatives mainly slow exploration, whereas false positives actively steer Best-of-N selection, guided decoding, and policy optimization toward flawed reasoning. This suggests that PRM training should shift from pointwise label fitting to reliable relative comparisons. To address this, we propose PRISM (Precision Ranking for Improved Step Modeling), a policy-aware PRM training framework that learns from contrastive step-level comparisons and hard negatives generated by a temporal lookahead strategy, requiring no new human labels. We further use a difficulty-aware curriculum to optimize the contrastive step margin. Across PRMBench and ProcessBench, PRISM substantially reduces false positives (22% on PRMBench) and improves macro F1 over strong discriminative PRMs. When applied to policy optimization and search tasks, including guided decoding and Best-of-N selection, it consistently improves accuracy (up to 22% for guided decoding and 33% for Best-of-N) and robustness. More broadly, trustworthy process supervision is not just about assigning high rewards, but about rewarding the right reasoning for the right reasons.

2606.09091 2026-06-09 cs.LG cs.CV 新提交

Stabilizing On-Policy Distillation for MLLM Reasoning with Global Normalization

稳定基于策略的蒸馏用于多模态大语言模型推理的全局归一化

Dongze Hao, Zhiwei Jin, Chen Chen, Haonan Lu

发表机构 * OPPO AI Center(OPPO AI中心)

AI总结 针对策略蒸馏中异常状态导致梯度不稳定的问题,提出全局归一化蒸馏策略优化(GNDPO),通过将KL分数转化为批次级相对优势来稳定优化,提升多模态推理任务的训练鲁棒性和性能。

详情
AI中文摘要

基于策略的蒸馏(OPD)最近成为一种重要的后训练范式。通过使用更强的教师模型为采样轨迹提供密集、细粒度的监督,OPD相比依赖稀疏二元或基于结果的环境反馈的可验证奖励强化学习(RLVR)具有明显优势。然而,朴素的token级蒸馏可能因异常状态中的幅度不匹配而遭受梯度不稳定性。为了解决这个问题,我们提出了全局归一化蒸馏策略优化(GNDPO),这是一种实用方法,通过将原始KL分数转化为批次级相对优势来稳定优化。这种归一化有效缓解了梯度爆炸,同时保留了token级指导的优势。实验结果表明,GNDPO在多模态推理任务中显著提高了训练鲁棒性和下游性能。代码已发布在 https://github.com/OPPO-Mente-Lab/GNDPO。

英文摘要

On-policy distillation (OPD) has recently emerged as an important post-training paradigm. By using a stronger teacher model to provide dense, fine-grained supervision for sampled trajectories, OPD offers a clear advantage over reinforcement learning with verifiable rewards (RLVR), which typically depends on sparse binary or outcome-based environmental feedback. However, naive token-level distillation can suffer from gradient instability, due to magnitude misalignment in outlier states. To address this issue, we propose Globally Normalized Distillation Policy Optimization (GNDPO), a practical method that stabilizes optimization by transforming raw KL scores into batch-level relative advantages. This normalization effectively mitigates gradient explosions while retaining the benefits of token-level guidance. Experimental results show that GNDPO substantially improves training robustness and downstream performance across multimodal reasoning tasks. The code is released at https://github.com/OPPO-Mente-Lab/GNDPO.

2606.09112 2026-06-09 cs.LG cs.AI 新提交

Hybridizing Equilibrium Propagation with Ising Machines for Efficient Energy-Based Learning

将平衡传播与伊辛机混合以实现高效的基于能量的学习

Chen-Rui Fan, Bo Lu, Xing-Yu Wu, Tie-Jun Wang, Chuan Wang

发表机构 * School of Artificial Intelligence, Beijing Normal University(北京师范大学人工智能学院) Laboratory for Advanced Computing and Intelligence Engineering, Information Engineering University(信息工程大学先进计算与智能工程实验室) School of Physical Science and Technology, Beijing University of Posts and Telecommunications(北京邮电大学物理科学与技术学院)

AI总结 提出一种受伊辛动力学启发的平衡传播框架,通过扩展相空间动力学替代耗散Hopfield松弛,加速收敛、提高噪声鲁棒性,并在MNIST等数据集上实现与反向传播相当的性能。

详情
AI中文摘要

人工智能的快速发展推动了深度神经网络的重大进步。然而,传统的基于GPU的训练仍然高度耗能,这促使人们探索物理动力学和兼容的基于能量的学习方案,例如平衡传播(EP)。然而,基于EP的训练常常由于相空间收缩而陷入局部最小值。本文介绍了一种受伊辛动力学启发的平衡传播框架,其中耗散的Hopfield松弛被具有共轭变量的扩展相空间动力学所取代。由此产生的训练范式保留了EP的局部两阶段学习规则,同时改变了神经状态达到平衡的物理路径。我们表明,这种动力学降低了有效能量壁垒,加速了收敛,提高了噪声鲁棒性,并在MNIST、FashionMNIST和CIFAR-10上训练了深度卷积Hopfield网络,性能与反向传播相当。

英文摘要

The rapid evolution of artificial intelligence has led to substantial advances in deep neural networks. Nonetheless, conventional GPU-based training remains highly energy-demanding, motivating the exploration of physical dynamics and compatible energy-based learning schemes, such as equilibrium propagation (EP). EP-based training, however, frequently suffers from convergence to local minima due to phase-space contraction. Here we introduce an Ising-dynamics-inspired equilibrium-propagation framework in which dissipative Hopfield relaxation is replaced by an extended phase-space dynamics with conjugate variables. The resulting training paradigm keeps the local two-phase learning rule of EP while changing the physical route by which neural states reach equilibrium. We show that this dynamics lowers effective energy barriers, accelerates convergence, improves noise robustness, and trains deep convolutional Hopfield networks on MNIST, FashionMNIST, and CIFAR-10 with performance comparable to backpropagation.

2606.09117 2026-06-09 cs.LG cs.AI 新提交

Optimizing Energy-based Neural Network Training with Coherent Ising Machine

利用相干伊辛机优化基于能量的神经网络训练

Chen-Rui Fan, Bo Lu, Zhi-Hong Zhang, Run-Qing Zhang, Jing-Wei Wen, Chuan Wang

发表机构 * School of Artificial Intelligence, Beijing Normal University(北京师范大学人工智能学院) Laboratory for Advanced Computing and Intelligence Engineering, Information Engineering University(信息工程大学先进计算与智能工程实验室) China Mobile (Suzhou) Software Technology Company Limited(中移(苏州)软件技术有限公司) School of Science, Beijing University of Posts and Telecommunications(北京邮电大学理学院)

AI总结 本文利用相干伊辛机结合平衡传播训练基于能量的神经网络,并通过Adam优化器加速收敛,展示了在深层架构和卷积操作上的可扩展性,为下一代AI硬件提供了物理框架。

详情
AI中文摘要

尽管伊辛机作为伊辛模型的高级物理求解器,在组合优化和神经网络训练中具有应用潜力,但其在大规模神经网络中的可扩展性仍受限于硬件连接限制和次优的训练方法。在这项工作中,我们利用相干伊辛机(CIM)通过平衡传播训练基于能量的神经网络,实现了与现有软件实现相当的性能。我们进一步通过集成Adam优化器来求解Hopfield能量网络的基态,从而显著提高了收敛速度和求解精度。此外,我们展示了该方法在更深层网络架构和卷积操作上的可扩展性。我们的结果突显了CIM动力学作为训练复杂神经网络的可扩展平台的潜力,为通过模拟电路、光电子或集成光子学实现节能实现提供了途径。这项工作为下一代AI硬件开发建立了一个新颖的物理框架。

英文摘要

While Ising machines serve as advanced physical solvers for the Ising model,enabling applications in combinatorial optimization and neural network training,their scalability for large-scale neural networks remains constrained by hardware connectivity limitations and suboptimal training methodologies. In this work,we leverage a Coherent Ising Machine (CIM) to train an energy-based neural network using Equilibrium Propagation, achieving performance comparable to existing software-based implementations. We further enhance the algorithm by integrating the Adam optimizer to solve for the ground state of a Hopfield energy network, significantly improving convergence speed and solution accuracy. Additionally, we demonstrate the scalability of our approach across deeper network architectures and convolutional operations. Our results highlight the potential of CIM dynamics as a scalable platform for training complex neural networks, offering a pathway toward energy-efficient implementations via analog circuits, optoelectronics, or integrated photonics. This work establishes a novel physical framework for next-generation AI hardware development.

2606.09278 2026-06-09 cs.LG cs.AI 新提交

Internalizing Geometric Law: Learning from Solver Residuals for Precision-Critical Generation

内化几何法则:从求解器残差中学习以实现精度关键生成

Rafael Cabral, Pang Zixi, Ziyi Shou, Shen Xin

发表机构 * Huawei Celia Team(华为Celia团队)

AI总结 针对大语言模型在精度关键领域(如技术图表和机械设计)中的幻觉问题,提出可编程几何DSL PyGeoX及分层基准PyGeoX-Bench,并设计饱和加性奖励(SAR)方法,将奖励分解为有界逐约束项,解决异常梯度掩盖问题,使8B模型在基准上达到与更大前沿系统竞争的水平。

详情
AI中文摘要

大语言模型在精度关键领域(如技术图表和机械设计)中经常出现幻觉,这些领域的输出必须满足严格的几何约束。我们研究从自然语言进行开放式几何合成:将自由形式的描述转化为精确的构造,其实体必须同时满足数十个相互作用的约束。为使这一问题易于处理,我们发布了PyGeoX,一个可编程的几何DSL,它将声明性约束编译为可微损失,以及PyGeoX-Bench,一个包含300个问题的分层套件,每个问题都有可验证的逐约束奖励。使用PyGeoX作为验证器,我们识别出一种称为异常梯度掩盖的失败模式:在全局范数奖励(任何通过单一范数聚合残差的方案,例如$\exp(-\mathrm{MSE})$)下,单个异常约束可以抵消所有其他约束的学习信号。为解决此问题,我们提出饱和加性奖励(SAR),它将奖励分解为有界的逐约束项,保留部分进展并确保即使在严重违反下也能保持一致的梯度。与基于MSE的奖励(几何求解器的自然基线)相比,SAR将困难层级求解率提高了2.3倍,由此得到的8B模型在该基准上与更大的前沿系统具有竞争力。我们在https://github.com/Huawei-AI4Math/PyGeoX发布引擎、基准和数据。

英文摘要

Large Language Models frequently hallucinate in precision-critical domains such as technical diagramming and mechanical design, where outputs must satisfy strict geometric constraints. We study open-ended geometric synthesis from natural language: translating free-form descriptions into precise constructions whose entities must simultaneously satisfy dozens of interacting constraints. To make this tractable, we release PyGeoX, a programmable geometric DSL that compiles declarative constraints into a differentiable loss, and PyGeoX-Bench, a stratified suite of 300 problems with per-constraint verifiable rewards. Using PyGeoX as a verifier, we identify a failure mode we call Outlier Gradient Masking: under global-norm rewards (any scheme that aggregates residuals through a single norm, for example, $\exp(-\mathrm{MSE})$), a single outlier constraint can nullify the learning signal across all others. To address this, we propose Saturating Additive Rewards (SAR), which decompose the reward into bounded per-constraint terms, preserving partial progress and ensuring consistent gradients even under severe violations. Against MSE-based rewards, the natural baseline for geometry solvers, SAR improves the hard-tier solving rate by $2.3\times$, and the resulting 8B model is competitive with much larger frontier systems on this benchmark. We release the engine, benchmark, and data at https://github.com/Huawei-AI4Math/PyGeoX.

2606.09539 2026-06-09 cs.LG 新提交

Efficient Traffic Prediction at Scale: A Systematic Study of STGCN Architectural Depth

大规模高效交通预测:STGCN架构深度的系统研究

Soban Nasir Lone, Mohamed Abouelela, Taeyoung Yu, Jiwon Kim, Constantinos Antoniou

发表机构 * Technical University of Munich(慕尼黑工业大学) The University of Queensland(昆士兰大学)

AI总结 系统研究STGCN架构深度对交通预测性能与计算效率的影响,发现单块结构在多数数据集上达到最优或接近最优性能,且计算成本显著低于标准双块结构。

Comments Accepted for publication in IEEE ITSC (2026)

详情
AI中文摘要

时空图神经网络(STGNNs)已成为交通预测的主流方法,但其计算需求对智能交通系统(ITS)的实际部署构成挑战。尽管近期研究提出了STGNNs的高效替代方案,但一个基本问题仍未探索:这些架构本身是否过参数化?我们使用该领域最广泛采用的模型之一——时空图卷积网络(STGCN)来研究这一问题。通过在四个不同的交通数据集上进行系统实验,我们比较了1块、2块(标准)和3块STGCN变体。我们的发现表明,单块架构在三个数据集上实现了短期预测(10分钟)的最优性能,而在更长预测时长上仅带来边际退化(相对误差≤1.8%)。关键的是,与单块相比,双块变体导致CPU推理延迟增加61%,吞吐量降低37%——这对于资源受限的ITS部署是巨大的开销。三块架构没有提供有利的权衡,计算成本增加一倍以上,而相对改进小于0.5%。这些结果表明,默认的双块STGCN在许多应用中可能过参数化,这对部署交通预测系统的从业者和基准测试效率方法的 researchers 都有影响。

英文摘要

Spatio-temporal graph neural networks (STGNNs) have become the dominant approach for traffic prediction, yet their computational requirements pose challenges for practical deployment in intelligent transportation systems (ITS). While recent work has proposed efficient alternatives to STGNNs, a fundamental question remains unexplored: are these architectures themselves over-parameterised? We examine this question using the Spatio-Temporal Graph Convolutional Network (STGCN), one of the most widely adopted models in this domain. Through systematic experiments across four diverse traffic datasets, we compare 1-block, 2-block (standard), and 3-block STGCN variants. Our findings reveal that the single-block architecture achieves optimal performance for short-term prediction (10 mins) on three of four datasets, while incurring only marginal degradation ($\leq$1.8% relative error) at longer horizons. Crucially, the 2-block variant incurs 61% higher CPU inference latency and 37% lower throughput relative to 1-block -- substantial overhead for resource-constrained ITS deployment. The 3-block architecture offers no favourable tradeoff, more than doubling computational cost for $<$0.5% relative improvement. These results suggest that the default 2-block STGCN may be over-parameterised for many applications, with implications for both practitioners deploying traffic prediction systems and researchers benchmarking efficiency-focused methods.

2606.09607 2026-06-09 cs.LG cs.AI 新提交

Closure-Validated Circuit Discovery in Attention Heads: Co-activation Proposes, Ablation Disposes

注意力头中的闭包验证电路发现:共激活提出,消融处置

Yongzhong Xu

发表机构 * GitHub

AI总结 通过共激活聚类提出注意力头电路假设,并用因果消融验证闭包性,发现该方法在密集模型有效但在MoE模型失效,表明共激活仅是电路提议而非确认。

Comments 22 pages, 3 figures

详情
AI中文摘要

可解释性越来越将组件组(而非单个单元)作为基本对象,并提议通过聚类共激活统计来发现它们。我们询问这种廉价信号是否真正识别出注意力头电路。将稀疏自编码器聚类方法适配到注意力头——但通过因果消融而非重构进行验证——我们聚类头,然后运行闭包测试:消融发现的社区,并将每个示例的损伤与匹配随机对照进行比较。在两个密集的1B规模模型(Pythia 1B, OLMo 1B)和两种输入分布上,社区通过了闭包测试。在混合专家模型(OLMoE-1B-7B)中,路由条件聚类恢复了一个统计上真实的信号,但该信号未能通过闭包测试——消融反而改善了损失,方向错误。将闭包测试扩展到训练过程中,注意力目标选择性和参与比率在双向与功能解耦。我们得出结论:廉价信号是电路提议,而非确认的电路;闭包是区分二者的关键。

英文摘要

Interpretability increasingly treats groups of components, not individual units, as the basic object, and proposes to find them by clustering co-activation statistics. We ask whether such a cheap signal actually identifies an attention-head circuit. Adapting a sparse-autoencoder clustering recipe to attention heads -- but validating by causal ablation rather than reconstruction -- we cluster heads and then run a closure test: ablate the discovered community and compare per-example damage to matched-random controls. Across two dense 1B-scale models (Pythia 1B, OLMo 1B) and two input distributions, the communities pass closure. In a Mixture-of-Experts model (OLMoE-1B-7B), route-conditional clustering recovers a statistically real signal that nonetheless does not survive closure -- ablation improves loss, the wrong direction. Extending closure across training, attention-target selectivity and participation ratio decouple from function in both directions. We conclude that a cheap signal is a circuit proposal, not a confirmed circuit; closure is what separates them.

2606.09658 2026-06-09 cs.LG cs.AI 新提交

Muon Learns More Robust and Transferable Features than Adam

Muon 比 Adam 学习更鲁棒和可迁移的特征

Tianyu Ruan, Fengzhuo Zhang, Shuche Wang, Shihua Zhang

发表机构 * Yale University(耶鲁大学) National University of Singapore(新加坡国立大学) University of Chinese Academy of Sciences(中国科学院大学) Academy of Mathematics and Systems Science, CAS(中国科学院数学与系统科学研究院)

AI总结 本文通过鲁棒性和可迁移性视角,证明 Muon 优化器相比 Adam 和 SGD 能学习到更鲁棒、更可迁移的特征,并通过理论分析支持了经验发现。

详情
AI中文摘要

Muon 最近已成为预训练大型语言模型(LLMs)和视觉分类器的最先进优化器。尽管其在效率上优于 Adam 和 SGD,但 Muon 在特征学习方面的优势仍不清楚。本文通过鲁棒性和可迁移性的视角研究了 Muon 的特征学习优势。首先,通过在损坏图像和文本上评估预训练模型,我们表明 Muon 学习到的特征在不同架构(包括 Transformer 和卷积神经网络(CNN))中始终比 Adam 和 SGD 学习到的特征更鲁棒。使用训练好的逐层探针,我们进一步表明这种鲁棒性优势体现在各层更大的 logit 间隔上。其次,通过在下游任务上训练线性分类器或从预训练参数微调完整模型,我们证明 Muon 学习到的特征比 Adam 和 SGD 学习到的特征更有效地迁移。这种可迁移性优势还通过有效秩衡量的各层隐藏状态的多样性得到进一步支持。最后,在一个具有多组件特征的代表性分类问题中,我们证明 Muon 比 Adam 和 SGD 获得更大的间隔和更高的有效秩,为我们的经验发现提供了理论支持。

英文摘要

Muon has recently emerged as a state-of-the-art optimizer for pretraining Large Language Models (LLMs) and vision classifiers. Despite its efficiency advantage over Adam and SGD, the feature-learning advantage of Muon remains unclear. This paper investigates Muon's feature-learning advantage through the lens of robustness and transferability. First, by evaluating pretrained models on corrupted images and texts, we show that features learned by Muon are consistently more robust than those learned by Adam and SGD across different architectures, including transformers and Convolutional Neural Networks (CNNs). Using trained layer-wise probes, we further show that this robustness advantage is reflected in larger logit margins across layers. Second, by training linear classifiers or fine-tuning full models from pretrained parameters on downstream tasks, we demonstrate that Muon-learned features transfer more effectively than those learned by Adam and SGD. This transferability advantage is further supported by the diversity of hidden states across layers, as measured by effective rank. Finally, in a representative classification problem with multi-component features, we prove that Muon attains larger margins and higher effective rank than Adam and SGD, providing theoretical support for our empirical findings.

2606.09756 2026-06-09 cs.LG cond-mat.dis-nn 新提交

Perturbative Contrastive Physical Learning

扰动对比物理学习

Kyungeun Kim, Amanuel Anteneh, Israel Klich, Olivier Pfister, J. M. Schwarz

发表机构 * Department of Mathematics, University of British Columbia, Vancouver, BC Canada(不列颠哥伦比亚大学数学系) Department of Physics, University of Virginia, 382 McCormick Rd, Charlottesville, VA 22903, USA(弗吉尼亚大学物理系) Max Planck Institute for the Physics of Complex Systems, 01187 Dresden, Germany(复杂系统物理研究所) Charles L. Brown Department of Electrical and Computer Engineering, University of Virginia, 351 McCormick Road, Charlottesville, VA 22903, USA(弗吉尼亚大学电气与计算机工程系) Department of Physics, Syracuse University, Syracuse, NY 13244, USA(雪城大学物理系)

AI总结 提出扰动对比物理学习(PCPL)框架,通过对比物理系统在不同条件下的响应实现学习,无需外部处理器或反向传播,在弹簧网络和光子电路中验证了分类与模拟乘法任务。

Comments 21 pages, 10 figures

详情
AI中文摘要

对扰动的响应是理解物理系统的关键。通过比较系统在略微不同条件下的反应来对比这些响应的能力,提供了一种学习机制。在这里,我们引入了扰动对比物理学习(PCPL),这是一个通用框架,其中学习源于对输入、边界条件、参数或解释器函数进行受控变化所产生的物理状态之间的可测量对比。PCPL统一并扩展了先前的方法:平衡传播源于基于能量的系统中自由平衡和微扰平衡之间的对比,而频率传播对应于从正弦驱动、频率解调响应中提取的对比。我们表明,对比驱动的更新可以反映局部敏感性或全局逆问题结构,但不需要集中梯度计算。相反,有效的学习几何结构从系统自身的物理响应中隐式出现,使得学习行为能够在没有外部处理器或显式反向传播的情况下产生。我们在两个平台上演示了PCPL:(i)使用测量的位移和力更新键刚度的弹簧网络,以及(ii)通过x正交测量和雅可比矩阵的有限差分估计训练的连续变量光子电路。两个平台都成功学习了分类任务。我们进一步展示了连续变量光子电路可以被训练来实现模拟乘法,这标志着向更自主的物理学习系统迈出了一步。

英文摘要

Responses to perturbations are key to understanding physical systems. The ability to contrast such responses by comparing how a system reacts under slightly different conditions provides a mechanism for learning. Here, we introduce Perturbative Contrastive Physical Learning (PCPL), a general framework in which learning emerges from measurable contrasts between physical states produced by controlled changes to inputs, boundary conditions, parameters, or interpreter functions. PCPL unifies and extends prior approaches: Equilibrium Propagation is rooted in contrasts between free and nudged equilibria in energy-based systems, while Frequency Propagation corresponds to contrasts extracted from sinusoidally driven, frequency-demodulated responses. We show that contrast-driven updates can reflect either local sensitivities or global inverse-problem structure, yet do not require centralized gradient computation. Instead, effective learning geometry emerges implicitly from the system's own physical response, allowing learning behavior to arise without an external processor or explicit backpropagation. We demonstrate PCPL in two platforms: (i) spring networks that update bond stiffness using measured displacements and forces, and (ii) continuous-variable photonic circuits trained via x quadrature measurements and finite-difference estimates of the Jacobian. Both platforms successfully learn classification tasks. We further show that a continuous-variable photonic circuit can be trained to implement analog multiplication, illustrating a step toward more autonomous physical learning systems.

2606.09806 2026-06-09 cs.LG cs.AI 新提交

Topological Neural Operators

拓扑神经算子

Lennart Bastian, Samuel Leventhal, Mustafa Hajij, Tolga Birdal

发表机构 * Imperial College London(伦敦帝国学院) University of San Francisco(旧金山大学)

AI总结 提出拓扑神经算子(TNOs),利用离散外微积分在细胞复形上实现跨维度耦合,并通过分层结构提升长程信息传播,在PDE基准上优于现有算子。

详情
AI中文摘要

我们引入了拓扑神经算子(TNOs),这是一个在细胞复形上进行算子学习的原理性框架,将神经算子(NOs)从点和/或边上的函数提升到拓扑域。TNOs将数据表示为定义在不同维度细胞上的特征,并通过离散外微积分建模它们的相互作用,通过梯度、旋度和散度型算子实现显式的跨维度耦合。关键设计原则是将信息流向(由固定拓扑算子控制)与信息变换(学习得到)解耦,从而产生尊重物理量几何支撑并暴露守恒和相容性结构的模型。我们进一步提出了分层TNOs(HTNOs),它结合了学习到的粗粒度复形以传播长程和拓扑依赖的信息。我们的框架将现有NOs作为特例,提供了跨离散化的算子学习统一视角。在一系列PDE基准测试中,包括不规则几何流动问题,TNOs和HTNOs提高了精度;控制研究进一步隔离了原生高阶和拓扑结构带来的优势。项目页面:https://circle-group.github.io/research/TNO

英文摘要

We introduce Topological Neural Operators (TNOs), a principled framework for operator learning on cell complexes that lifts neural operators (NOs) from functions on points and/or edges to topological domains. TNOs represent data as features defined on cells of varying dimension and model their interactions through Discrete Exterior Calculus, enabling explicit cross-dimensional coupling via gradient-, curl-, and divergence-type operators. The key design principle is to decouple where information flows, as governed by fixed topological operators, from how it is transformed (which is learned), yielding models that respect the geometric support of physical quantities and expose conservation and compatibility structure. We further propose Hierarchical TNOs (HTNOs), which incorporate learned coarse complexes to propagate long-range and topology-dependent information. Our framework subsumes existing NOs as a special case, providing a unified perspective on operator learning across discretizations. Across a range of PDE benchmarks, including irregular-geometry flow problems, TNOs and HTNOs improve accuracy; controlled studies further isolate the benefits of native higher-rank and topological structure. Project page: https://circle-group.github.io/research/TNO

2606.07560 2026-06-09 cs.CL cs.LG 交叉投稿

Function-Vector Heads Are Two Populations: Writers and Cancellers in In-Context Learning

函数向量头是两个群体:上下文学习中的写入者和取消者

Han-yu Wang

发表机构 * The University of Hong Kong(香港大学)

AI总结 发现函数向量头并非同质群体,而是分为写入者和取消者两个子群体,分别推高和压低规则正确logit,且仅基于幅度的排名无法区分二者。

详情
AI中文摘要

函数向量头(Todd et al., 2024)通常通过其对上下文规则任务的因果贡献幅度来识别,隐含假设顶级集合是同一功能类。这一假设不成立。我们用保留符号的标准(改进的DLA + 置换FDR)替代仅幅度排名,并通过路径修补验证每个候选。然后,FV头群体分裂为两个对立的子群体:写入者推高规则正确logit;取消者压低它。一个四条件规范判定在三个模型家族和六个Pythia规模的13/15个单元中成立,符号置换检验在5/6个主要单元中拒绝同质性。仅幅度排名无法看到这种结构:Todd的前20个在层次任务中捕获了64%的取消者但仅4%的写入者,在模块任务中捕获了59%的写入者但仅8%的取消者。我们在所有27个(取消者,单元,头)对上排除了六种人为解释:归纳重叠、汇点、通用重要性、秩1复制抑制、V级联和最近邻非FV控制。零消融取消者在6/6个主要单元中产生+0.13到+0.29 nats的logit增益,方向一致地带来+2到+7个百分点的准确率提升。

英文摘要

Function-vector (FV) heads (Todd et al., 2024) are typically identified by the magnitude of their causal contribution to in-context rule tasks, under the implicit assumption that the top set is a homogeneous functional class. This assumption fails. We replace magnitude-only ranking with a sign-preserving criterion (refined DLA + permutation FDR) and validate each candidate by path patching. The FV head population then splits into two opposing sub-populations: writers push the rule-correct logit up; cancellers push it down. A four-condition canonical verdict holds in $13/15$ cells across three model families and six Pythia scales, and a sign-shuffle rejects homogeneity in $5/6$ main cells. The structure is invisible to magnitude-only ranking: Todd's top-$20$ captures $64\%$ of cancellers but only $4\%$ of writers on the hierarchical task, and $59\%$ of writers but only $8\%$ of cancellers on the modular task. We rule out six artefact accounts on all $27$ canceller (cell, head) pairs: induction overlap, sinks, generic importance, rank-$1$ copy-suppression, V-cascade, and rank-nearest non-FV controls. Zero-ablating cancellers yields $+0.13$ to $+0.29$ nats of logit gain in $6/6$ main cells with a directionally consistent $+2$ to $+7$ pp accuracy effect.

2606.07647 2026-06-09 cs.CV cs.CL cs.LG 交叉投稿

Steer Where It Matters: Token-Level Visual-Sensitivity Steering for LVLMs Hallucination Mitigation

关键位置引导:基于令牌级视觉敏感度引导的LVLMs幻觉缓解

Ruipeng Zhang, Zhihao Li, C. L. Philip Chen, Tong Zhang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出令牌级视觉敏感度引导(TLVS)方法,通过提取令牌级引导向量并自适应调整引导强度,仅在关键解码步骤抑制幻觉,在多个基准上优于现有方法。

详情
AI中文摘要

大型视觉语言模型(LVLMs)取得了快速进展并部署在各种应用中,但幻觉仍然是一个主要挑战。激活引导因其训练开销小和推理时可控制而具有吸引力。然而,我们发现,在自回归解码过程中,视觉条件对令牌预测的影响是稀疏且局部的,许多现有方法对整个序列的图像与非图像差异进行平均,稀释了这些关键信号,导致引导方向信噪比低。此外,许多现有方法应用固定的引导强度,错误分配干预预算,过度扰动非关键令牌,并可能导致不稳定。为了解决这些限制,我们提出了令牌级视觉敏感度引导(TLVS)用于幻觉缓解。我们的方法首先提取令牌级引导向量并进行细化,然后仅在关键位置应用细粒度的、视觉敏感度自适应的引导。这种轻量级、即插即用的机制只需要最少的校准训练,可以应用于各种视觉语言模型。它在每个解码步骤调节引导强度,选择性地抑制易产生幻觉的片段,同时保留基于证据的内容。我们在多个基准上评估TLVS,包括POPE、AMBER、CHAIR(COCO)、MMHal和HallusionBench,证明其相对于先前引导方法的一致改进。

英文摘要

Large vision language models (LVLMs) have made rapid advancements and are deployed across various applications, yet hallucinations remain a major challenge. Activation steering is appealing due to its minimal training overhead and controllability at inference time. However, we found that during autoregressive decoding, visual conditioning affects token prediction sparsely and locally across decoding steps, and many existing methods that average image-versus-no-image differences over the entire sequence dilute these critical signals, yielding low signal-to-noise ratio steering directions. Additionally, many existing methods apply a fixed steering strength, which misallocates the intervention budget, over-perturbs non-critical tokens, and can cause instability. To address these limitations, we propose Token-Level Visual-Sensitivity Steering (TLVS) for hallucination mitigation. Our approach first extracts token-level steering vectors and refines them, and then applies fine-grained, visual-sensitivity-adaptive steering only where it matters. This lightweight, plug-and-play mechanism requires only minimal training for calibration and can be applied across diverse vision-language models. It modulates the steering strength at each decoding step, selectively suppressing hallucination-prone spans while preserving evidence-grounded content. We evaluate TLVS on several benchmarks, including POPE, AMBER, CHAIR (COCO), MMHal, and HallusionBench, demonstrating consistent improvements over previous steering methods.

2606.07657 2026-06-09 cs.NE cs.LG 交叉投稿

QDS-SNN: Energy-efficient Quantum Deeply-Supervised Spiking Neural Network Algorithm for Traffic Sign Recognition

QDS-SNN:用于交通标志识别的节能量子深度监督脉冲神经网络算法

Zhiguo Qu, Keqi Li, Le Sun, Wenjie Liu, Yimin Yu, Saif Al-Kuwari, Ahmed Farouk

发表机构 * School of Computer Science, School of Software, Nanjing University of Information Science and Technology(计算机科学系、软件学院、信息科学技术大学)

AI总结 提出量子深度监督脉冲神经网络(QDS-SNN),结合量子神经网络与时空自适应LIF神经元,在GTSRB数据集上以6个时间步达到99.72%准确率,能耗降低55.77%。

Comments 13 pages, 10 Figures, 8 Tables

详情
AI中文摘要

交通标志识别对于智能交通和自动驾驶至关重要,因为它可以提高驾驶效率并确保道路安全。然而,传统的识别方法基于大规模数据集和密集计算,限制了其实时应用性。脉冲神经网络(SNN)由于其时空处理能力,提供了一种受生物启发的节能替代方案,但在训练过程中存在信息丢失和梯度消失的问题。为了克服这些限制,本研究提出了一种量子深度监督脉冲神经网络(QDS-SNN),它集成了量子神经网络(QNN)以实现高效、低功耗的深度监督。利用量子叠加和纠缠,QNN能够实现表达性表示和并行计算,从而在不影响能效的情况下提升性能。所提出的QDS-SNN包含一个时空自适应LIF(TSA-LIF)神经元和一个量子辅助分类器模块(QACM),以缓解梯度问题并提高训练效果。本研究在PennyLane量子模拟平台上进行实验,结果表明,QDS-SNN在仅6个时间步内,在GTSRB数据集上达到了99.72%的准确率——比MS-ResNet基线高出1.32%,同时能耗降低了55.77%。在TSRD数据集中,它达到了97.90%的准确率,同时能耗降至基线的52.68%。这些结果表明,QDS-SNN为智能交通系统中的交通标志识别提供了一种高性能、节能的解决方案。

英文摘要

Traffic sign recognition is crucial for intelligent transportation and autonomous driving, as it can improve driving efficiency and ensure road safety. However, traditional recognition methods are based on large datasets and intensive computation, which limits their real-time applicability. Spiking Neural Networks (SNNs) offer a biologically inspired, energy-efficient alternative due to their spatiotemporal processing capabilities, but suffer from information loss and vanishing gradients during training. To overcome these limitations, this study proposes a Quantum Deep-supervised Spiking Neural Network (QDS-SNN) that integrates Quantum Neural Networks (QNNs) for efficient, low-power deep supervision. Using quantum superposition and entanglement, QNNs enable expressive representations and parallel computation, thereby enhancing performance without compromising energy efficiency. The proposed QDS-SNN incorporates a temporally and spatially adaptive LIF (TSA-LIF) neuron and a quantum-assisted classifier module (QACM) to mitigate gradient issues and improve training effectiveness. This study conducts experiments on the PennyLane quantum simulation platform, and the results show that QDS-SNN achieves 99.72\% accuracy on the GTSRB dataset in only 6 time steps -- outperforming the MS-ResNet baseline by 1.32\% while reducing energy consumption by 55.77\%. In the TSRD dataset, it achieves 97.90\% accuracy while reducing energy use to 52.68\% of the baseline. These results demonstrate that QDS-SNN offers a high-performance, energy-efficient solution for traffic sign recognition in intelligent transportation systems.

2606.07675 2026-06-09 eess.IV cs.CV cs.LG 交叉投稿

The Need for Neural ISP in the Small-Pixel Era: How Shrinking Pixels Push Optics to the Limit and Neural Restoration Pushes Back

小像素时代对神经ISP的需求:像素缩小将光学推向极限,神经恢复则逆势而上

Jingxi Li, Neerja Aggarwal, Laurent Gudemann, Shivansh Rao, Vishal Vinod, Tom E. Bishop, Ziv Attar

发表机构 * Glass Imaging Inc(玻璃成像公司)

AI总结 针对智能手机小像素长焦模块中光学像差限制分辨率的问题,提出基于学习的神经ISP恢复图像,在0.35微米像素下实现2.5-3倍分辨率提升,表明神经ISP可替代复杂光学设计。

详情
AI中文摘要

智能手机长焦摄像头正接近“长焦物理墙”:随着像素间距缩小至亚0.5微米,光学系统仍受几何像差限制,导致分辨率收益递减。传统图像信号处理器(ISP)无法消除这些像差,因为它们通过局部、分阶段处理运行,没有明确的点扩散函数(PSF)模型。我们展示了基于学习的神经ISP用于图像恢复,通过训练底层退化,逆转了分阶段流水线无法处理的问题,将小像素设计转化为净优势。我们通过一个代表性长焦模块的受控模拟进行研究,评估了五种配置(0.35--0.75微米像素间距)。光圈按比例缩放以保持每像素信噪比和衍射光斑尺寸固定,从而隔离几何像差和空间采样。传统ISP随像素减小仅适度改进,而神经ISP显著扩展:在0.35微米时,其MTF50(垂直)达到745 cycles/mm,比传统ISP分辨率提升2.5-3倍,LPIPS从0.244显著改善至0.151,而传统结果保持相对平坦。在低信噪比扩展中(0.35微米下每帧15 dB突发),多帧神经ISP恢复的性能接近亮光单帧基线,而多帧传统ISP没有显示出有意义的改进——表明小像素下的传统流水线受限于未校正的PSF模糊而非噪声。这些结果指向一种设计理念:神经ISP通过校正残余光学像差而非要求日益复杂的光学系统,实现高分辨率长焦模块。

英文摘要

Smartphone telephoto cameras are approaching a "telephoto physics wall": as pixel pitches shrink toward sub-0.5 micron, the optics remain limited by geometric aberrations, leading to diminishing returns on resolution. Traditional Image Signal Processors (ISPs) cannot eliminate these aberrations, because they operate through local, stage-wise processing with no explicit model of the underlying point spread function (PSF). We demonstrate how a learning-based Neural ISP for image restoration, trained on the underlying degradations, inverts what stage-wise pipelines cannot, turning small-pixel designs into a net advantage. We investigate this through a controlled simulation of a representative telephoto module, evaluating five configurations (0.35--0.75 micron pixel pitch). The aperture is scaled proportionally to keep per-pixel SNR and diffraction spot size fixed, thereby isolating geometric aberration and spatial sampling. While the traditional ISP improves only modestly with smaller pixels, the Neural ISP scales substantially: at 0.35 micron} it reaches 745 cycles/mm MTF50 (vertical), a 2.5--3x resolution improvement over the traditional ISP, and LPIPS improves significantly from 0.244 to 0.151 while traditional results stay comparatively flat. In a low-SNR extension (15 dB per-frame bursts at 0.35 micron), a multi-frame Neural ISP recovers performance close to the bright-light single-frame baseline, whereas a multi-frame traditional ISP shows no meaningful improvement -- indicating that traditional pipelines at small pixels are bottlenecked by uncorrected PSF blur rather than by noise. These results point to a design philosophy in which Neural ISPs enable high-resolution telephoto modules by correcting residual optical aberrations rather than requiring increasingly complex optics.

2606.07720 2026-06-09 cs.AI cs.CL cs.LG 交叉投稿

Why Limit the Residual Stream to Layers and Not Tokens? Persistent Memory for Continuous Latent Reasoning

为什么将残差流限制在层而不是令牌?用于连续潜在推理的持久记忆

Mujtaba Farhan, Maheep Chaudhary

发表机构 * University of Cambridge(剑桥大学)

AI总结 针对CoCoNuT在潜在空间推理中因中间隐藏状态被覆盖导致概念瓶颈的问题,提出AGCLR模型,通过门控概念流持久记忆机制,在GSM8K、HotpotQA和ProsQA上取得一致提升。

详情
AI中文摘要

大型语言模型(LLMs)在数学和多跳规划任务上展现了卓越的推理能力。CoCoNuT(连续思维链)范式通过使模型能够在潜在空间中进行推理,同时探索多个推理路径,而不是早期就承诺单一链条,从而扩展了这一能力。然而,我们识别出一个我们称之为\textbf{概念瓶颈}的限制。在每次推理过程中,中间隐藏状态被覆盖,导致模型随着推理深度增加而丢失早期步骤中计算出的关键事实。我们在经验上观察到了这一点。在HotpotQA上,原始CoCoNuT(10.4% EM)未能超过CoT基线(11.0% EM),并且在GSM8K上随着课程深度增加性能下降。为了解决这个问题,我们提出了\textbf{AGCLR}(自适应门控连续潜在推理),它通过一个\textit{门控概念流}增强了CoCoNuT。一个跨所有推理过程保持的持久残差记忆,由三个学习到的门控制:一个将中间事实提交到记忆的\textit{写入}门,一个检索相关先前状态的\textit{读取}门,以及一个修剪不相关上下文的\textit{遗忘}门。在使用GPT-2作为基础模型在GSM8K、HotpotQA和ProsQA上进行评估时,AGCLR在所有类型的数据集上实现了一致的改进。随着课程深度的增加,性能差距进一步扩大,直接解决了概念瓶颈。代码可在https://anonymous.4open.science/r/JJJJ/README.md获取。

英文摘要

Large language models (LLMs) have demonstrated remarkable reasoning abilities on mathematical and multi-hop planning tasks. The CoCoNuT (Chain of Continuous Thought) paradigm~\cite{hao2024coconut} extends this by enabling models to reason in latent space, exploring multiple reasoning paths simultaneously rather than committing to a single chain early on. However, we identify a limitation we term the \textbf{concept bottleneck}. At each reasoning pass, intermediate hidden states are overwritten, causing the model to lose critical facts computed in earlier steps as reasoning depth increases. We observe this empirically. On HotpotQA, vanilla CoCoNuT (10.4\% EM) fails to improve over the CoT baseline (11.0\% EM), and performance degrades with curriculum depth on GSM8K. To address this, we propose \textbf{AGCLR} (Adaptive Gated Continuous Latent Reasoning), which augments CoCoNuT with a \textit{Gated Concept Stream}. A persistent residual memory maintained across all reasoning passes, controlled by three learned gates: a \textit{write} gate that commits intermediate facts to memory, a \textit{read} gate that retrieves relevant prior states, and a \textit{forget} gate that prunes irrelevant context. Evaluated on GSM8K, HotpotQA, and ProsQA using GPT-2 as our base model, AGCLR achieves consistent improvements across all types of datasets. With the performance gap compounding as curriculum depth increases, directly resolving the concept bottleneck. Code available at https://anonymous.4open.science/r/JJJJ/README.md

2606.08132 2026-06-09 cs.CV cs.LG 交叉投稿

Phase Marginalization for Patch-Grid Instability in Vision Transformers

视觉Transformer中补丁网格不稳定性的相位边缘化

Oğuzhan Ercan

发表机构 * Scientific and Technological Research Council of Türkiye(土耳其科学技术研究委员会)

AI总结 提出相位边缘化方法,通过评估结构化补丁网格相位、逆对齐密集输出并在原始图像坐标系聚合,消除视觉Transformer中补丁网格相位引起的预测不稳定性,无需训练即可提升分割、深度和匹配性能。

Comments 13 pages, 1 figure, 9 tables

详情
AI中文摘要

视觉Transformer在固定的补丁网格上操作,这可能导致密集预测中相位依赖的不稳定性:改变补丁划分会改变像素可用的令牌证据,尤其是在边界附近。我们将补丁网格相位形式化为一个干扰变量,并提出相位边缘化,一种事后边缘化方法,该方法评估结构化的补丁网格相位,逆对齐密集输出,并在原始图像坐标系中聚合它们。中心变体,K=4的均匀相位边缘化,无需训练,并在测量的分割、深度和局部匹配设置上优于规范的K=1基线。在受控的Cityscapes实验中,均匀相位边缘化相比基于通用移位的四次前向测试时增强(TTA)提供了适度的计算匹配优势(在最强测试的通用行上平均交并比提高0.31)。一项扩展研究进一步表明,K=4是一个实用的成本-精度权衡:K=8基本不变,K=16在更高延迟下增加很少精度。这些结果将补丁网格相位定位为可测量的干扰变量,并将相位边缘化定位为密集ViT预测的简单诊断和事后边缘化基线。

英文摘要

Vision Transformers operate on fixed patch grids, which can introduce phase-dependent instability for dense prediction: changing the patch partition can change the token evidence available to a pixel, especially near boundaries. We formalize patch-grid phase as a nuisance variable and propose Phase Marginalization, a post-hoc marginalization method that evaluates structured patch-grid phases, inverse-aligns dense outputs, and aggregates them in the original image coordinate system. The central variant, Uniform Phase Marginalization with K = 4, is training-free and improves over the canonical K = 1 baseline across measured segmentation, depth, and local matching settings. In a controlled Cityscapes experiment, Uniform Phase Marginalization provides a modest compute-matched advantage over generic shift-based four-forward test-time augmentation (TTA) (+0.31 mean Intersection-over-Union over the strongest tested generic row). A scaling study further shows that K = 4 is a practical cost-accuracy trade-off: K = 8 is essentially unchanged and K = 16 adds little accuracy at much higher latency. These results position patch-grid phase as a measurable nuisance variable and Phase Marginalization as a simple diagnostic and post-hoc marginalization baseline for dense ViT prediction.

2606.08203 2026-06-09 math.NA cs.LG cs.NA stat.ML 交叉投稿

Stable and Scalable Probabilistic Numerical Solvers for Stiff and High-Dimensional ODEs

适用于刚性和高维ODE的稳定且可扩展的概率数值求解器

Nathanael Bosch

发表机构 * EPFL(瑞士联邦理工学院)

AI总结 针对刚性和高维常微分方程,提出两种互补策略:无矩阵更新步骤实现线性扩展,以及迭代重线性化提升稳定性,从而开发出稳定且可扩展的概率求解器。

详情
AI中文摘要

基于滤波的常微分方程概率数值求解器已被确立为一种灵活高效的仿真框架,具有内置的数值不确定性量化。然而,刚性和高维问题仍然是一个挑战,因为当前方法要么稳定但计算复杂度为ODE维度的三次方,要么线性扩展但牺牲稳定性。在本文中,我们弥合了这一差距,开发了既稳定又可扩展的概率ODE求解器。我们提出了两种互补策略。首先,我们开发了一种无矩阵更新步骤,利用雅可比向量积、迭代线性求解器和随机协方差估计来实现线性扩展,同时保持稳定性。其次,我们提出迭代重线性化以在不牺牲可扩展性的情况下进一步提高稳定性,将概率ODE求解器转变为完全隐式方法。我们在各种刚性和高维问题上评估了所提出的方法,并展示了相对于现有概率求解器在稳定性和可扩展性上的改进。

英文摘要

Filtering-based probabilistic numerical solvers for ordinary differential equations (ODEs) have been established as a flexible and efficient simulation framework with built-in numerical uncertainty quantification. However, problems that are both stiff and high-dimensional remain a challenge, as current methods are either stable and have cubic cost in the ODE dimension, or scale linearly at the expense of stability. In this paper, we close this gap and develop probabilistic ODE solvers that are both stable and scalable. We propose two complementary strategies. First, we develop a matrix-free update step that uses Jacobian-vector products, iterative linear solvers, and stochastic covariance estimation to enable linear scaling, all while retaining stability. Second, we propose iterative re-linearization to further improve stability without sacrificing scalability, turning probabilistic ODE solvers into fully implicit methods. We evaluate the proposed approaches on a range of stiff and high-dimensional problems and demonstrate improved stability and scalability over established probabilistic solvers.

2606.08327 2026-06-09 cs.CL cs.AI cs.LG 交叉投稿

Chiaroscuro Attention: Spending Compute in the Dark

明暗对比注意力:在黑暗中投入计算

Prateek Kumar Sikdar

发表机构 * Accenture(埃森哲)

AI总结 提出CHIAR-Former,一种基于谱熵路由的混合Transformer,通过DCT谱混合与全注意力互补,在WikiText-103上以62.5%更少注意力FLOPs实现PPL 36.54,较全注意力基线提升45%。

Comments 8 pages, 6 figures, 3 tables

详情
AI中文摘要

标准Transformer在每一层和每个标记上统一应用自注意力,无论输入是否需要动态的跨标记交互。我们提出CHIAR-Former(明暗对比注意力),一种4层混合Transformer,它基于每个标记的谱熵(一种理论上合理的复杂度信号)将每个标记路由到三个算子之一:DCT谱混合、RBF核混合或全自注意力。通过在WikiText-103上的系统消融,我们发现路由崩溃:路由器持续拒绝RBF而偏向DCT和注意力,表明谱混合和动态注意力是互补且充分的。一个专门设计的仅DCT+注意力变体在WikiText-103上达到验证集PPL 36.54——相比全注意力基线(PPL 66.62)提升45%,同时减少62.5%的注意力FLOPs。我们将评估扩展到WikiText-2、IMDB情感分类和合成ListOps操作,建立了一个清晰的操作区间:CHIAR-Former在大型自然文本上表现出色,其中标记多样性支持谱专门化,而全注意力在小数据集和合成模式匹配任务上仍保持优势。这些发现——无论是成功还是失败——共同定义了谱路由何时以及为何值得使用。

英文摘要

Standard transformers apply self-attention uniformly at every layer and token, regardless of whether the input requires dynamic cross-token interaction. We propose CHIAR-Former (Chiaroscuro Attention), a 4-layer hybrid transformer that routes each token to one of three operators - DCT spectral mixing, RBF kernel mixing, or full self-attention - based on per-token spectral entropy, a theoretically justified complexity signal. Through systematic ablation on WikiText-103, we discover routing collapse: the router consistently rejects RBF in favour of DCT and attention, revealing that spectral mixing and dynamic attention are complementary and sufficient. A purpose-designed DCT+Attention-only variant achieves Val PPL 36.54 on WikiText-103 - a 45% improvement over a full-attention baseline (PPL 66.62) at 62.5% fewer attention FLOPs. We extend evaluation to WikiText-2, IMDB sentiment classification, and synthetic ListOps operations, establishing a clear operating regime: CHIAR-Former excels on large-scale naturalistic text where token diversity supports spectral specialisation, while full attention retains an edge on small datasets and synthetic pattern-matching tasks. These findings - both the wins and the losses - together define when and why spectral routing earns its keep.

2606.08347 2026-06-09 cs.CL cs.LG 交叉投稿

Tensorizing Engram: Sharing Latents Across N-Gram Embeddings is Beneficial in LLMs

张量化Engram:在N-gram嵌入中共享潜在变量对大型语言模型有益

Wuyang Zhou, Yuxuan Gu, Giorgos Iacovides, Yuning Qiu, Qibin Zhao, Danilo Mandic

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Cambridge(剑桥大学) University of Toronto(多伦多大学)

AI总结 提出张量化Engram(TN-gram),通过CP分解共享因子压缩n-gram嵌入,减少参数并避免哈希冲突,在多个任务上匹配或超越现有方法。

详情
AI中文摘要

现代语言模型使用离散的token级嵌入表示文本,这迫使重复的多token模式必须在Transformer层中隐式学习。过度token化的Transformer和Engram都试图通过显式引入多token(n-gram)记忆来解决这一限制。然而,它们为每个n-gram阶数使用单独的哈希表,这引入了哈希冲突并阻止嵌套的n-gram共享底层潜在结构。为了解决这些问题,我们提出了张量化Engram(TN-gram),一种紧凑的记忆模块,通过Canonical Polyadic(CP)形式中的共享因子表示张量化的n-gram嵌入。TN-gram学习共享的token-位置因子以及阶数吸收向量,以编码不同n-gram阶数的嵌入。综合实验表明,TN-gram在需要更少参数的情况下,匹配甚至超越了Engram风格的n-gram模块。

英文摘要

Modern language models represent text using discrete token-level embeddings, which forces recurring multi-token patterns to be learned implicitly across Transformer layers. Both Over-tokenized Transformers and Engram attempt to address this limitation by explicitly incorporating multi-token (n-gram) memories. However, they rely on separate hash tables for each n-gram order, which introduces hash collisions and prevents nested n-grams from sharing the underlying latent structures. To address these issues, we propose Tensorized Engram (TN-gram), a compact memory module that represents tensorized n-gram embeddings through shared factors in the Canonical Polyadic (CP) form. TN-gram learns shared token-position factors together with order-absorption vectors to encode the embeddings of different n-gram order. Comprehensive experiments demonstrate that TN-gram matches or even outperforms Engram-style n-gram modules while requiring much fewer parameters.

2606.08374 2026-06-09 eess.SY cs.LG cs.SY 交叉投稿

Predictive Coding with Bayesian Priors via Proximal Gradients

基于近端梯度的贝叶斯先验预测编码

Francesco Bullo

发表机构 * Department of Mechanical Engineering and Dynamical Neuroscience Program(机械工程与动力神经科学项目部) UC Santa Barbara

AI总结 将预测编码重新表述为应用于正则化最大后验目标的连续时间近端梯度下降,揭示了其与漏泄发放率网络的等价性,并推广到分层结构。

Comments 13 pages, 2 figures, technical report

详情
AI中文摘要

我们将预测编码重新表述为应用于正则化最大后验(MAP)目标的连续时间近端梯度下降。我们首先研究单层问题,然后研究多层层次结构。对于单层问题,我们证明近端梯度下降正是漏泄发放率网络:膜漏、有效循环矩阵、局部突触驱动和静态非线性都源于一个优化原理,得到的电路正是Rao和Ballard提出的电路。先验通过其近端算子选择非线性,似然精度设置观测的增益。对于层次结构,我们证明深度MAP问题的经典变量分裂松弛将分层预测编码作为局部和分布式求解器的互连。在概率建模术语中,这种松弛将定向生成链替换为无向马尔可夫随机场,其节点势是逐层先验。然后每一层应用其自身的激活函数,即其先验的近端算子。

英文摘要

We recast predictive coding as continuous-time proximal gradient descent applied to a regularized maximum-a-posteriori (MAP) objective. We study first a single-level problem and then a multi-level hierarchy. For the single-level problem, we show that proximal gradient descent is precisely a leaky firing-rate network: the membrane leak, the effective recurrent matrix, the local synaptic drive, and the static nonlinearity all follow from one optimization principle, and the resulting circuit is the one proposed by Rao and Ballard. The prior selects the nonlinearity through its proximal operator, and the likelihood precision sets the gain on the observation. For the hierarchy, we show that a classical variable-splitting relaxation of the deep MAP problem yields hierarchical predictive coding as the interconnection of local and distributed solvers. In probabilistic modeling terms, this relaxation replaces the directed generative chain by an undirected Markov random field whose node potentials are the level-wise priors. Each level then applies its own activation function, namely the proximal operator of its prior.

2606.08653 2026-06-09 cs.CV cs.AI cs.LG cs.RO 交叉投稿

FiberTune: Preserving Action-Fiber Visual Residuals in Vision-Language-Action Fine-Tuning

FiberTune: 在视觉-语言-动作微调中保留动作纤维视觉残差

Haihao Lin, Xiangsheng Huang, Xiao Yang, Weibang Zhou, Yiqi Zhang, Bo Yang, Simin Zeng, Jiawei Yang, Zhengyang Wang, Jiahui Du

发表机构 * University of Chinese Academy of Sciences(中国科学院大学) Hebei Key Laboratory of Cognitive Intelligence, Xiong’an Institute of Innovation(河北省认知智能重点实验室,雄安创新研究院) Hebei University of Technology(河北工业大学) Beijing Information Science and Technology University(北京信息科技大学)

AI总结 提出FiberTune,通过在线动作探针过滤动作预测特征方向,对齐教师视觉残差并正则化有效秩,在六个仿真和实物任务中提升VLA策略性能。

Comments Project page: https://fibertune.github.io/

详情
AI中文摘要

动作监督的视觉-语言-动作(VLA)策略微调能有效拟合演示,但仅约束改变预测动作的方向,导致动作等价状态下视觉结构自由坍缩。我们将此形式化为沿局部动作纤维的残差视觉坍缩,并提出FiberTune,一种训练时目标,在不增加推理开销的情况下保留教师结构的视觉残差。FiberTune使用在线动作探针估计动作预测特征方向,从中滤除中间视觉标记表示,并将探针过滤后的残差与冻结的视觉教师对齐,同时正则化其有效秩。在相同训练条件下,FiberTune在跨越两个基准和两种架构(pi_0.5和OpenVLA-OFT)的六个受控仿真设置以及物理SO-101拾取放置任务中,均优于仅任务损失的微调;代表性提升包括长时域CALVIN ABC-to-D上SR(5)提高10.7个百分点,物理SO-101任务成功率从72.7%提升至78.1%。残差诊断显示,这些增益与探针过滤后的残差教师对齐度和有效秩增加一致,符合动作纤维动机。

英文摘要

Action-supervised fine-tuning of vision-language-action (VLA) policies fits demonstrations effectively but constrains only the directions that change predicted actions, leaving visual structure consistent across action-equivalent states free to collapse. We formalize this as residual visual collapse along local action fibers and propose FiberTune, a training-time objective that preserves teacher-structured visual residuals without adding inference-time overhead. FiberTune uses an online action probe to estimate action-predictive feature directions, filters them from intermediate visual-token representations, and aligns the resulting probe-filtered residuals to a frozen visual teacher while regularizing their effective rank. Under identical training conditions, FiberTune improves over task-loss-only fine-tuning in every one of six controlled simulation settings spanning two benchmarks and two architectures (pi_0.5 and OpenVLA-OFT), as well as on physical SO-101 pick-place; representative gains include +10.7 percentage points SR(5) on long-horizon CALVIN ABC-to-D and physical SO-101 task success rising from 72.7% to 78.1%. Residual diagnostics show that these gains coincide with increased probe-filtered residual teacher alignment and effective rank, consistent with the action-fiber motivation.

2606.08672 2026-06-09 cs.CV cs.LG 交叉投稿

Learning to Solve Generative ODEs Beyond the Linear Span

学习求解生成式常微分方程:超越线性跨度

Sihyeon Kim, Seunghun Lee, Vikas Singh, Hyunwoo J. Kim

发表机构 * Korea University(高丽大学) KAIST(韩国科学技术院) University of Wisconsin–Madison(威斯康星大学麦迪逊分校)

AI总结 针对扩散和流生成模型中ODE求解器步数多的问题,提出SpanLift轻量神经求解器,通过空间残差算子增强标量系数更新,实现少步采样且不增加模型NFE,在多个任务上达到最先进性能。

Comments 12 pages, 7 figures

详情
AI中文摘要

扩散和流生成模型通过积分学习到的ODE进行采样,但高质量采样仍需要大量连续的模型评估。求解器学习通过调整标量系数、时间步长或两者来降低这一成本,同时保持骨干模型固定。在这项工作中,我们识别出该更新族中的一个结构瓶颈:每一步仍然受限于跨度。由于标量系数更新位于缓冲速度评估的跨度内,它只能拟合跨度内的分量,而任何跨度外的残差无法通过标量重组单独达到。我们提出SpanLift,一种轻量神经求解器,它用空间残差算子增强标量系数更新。SpanLift将固定的基础求解器作为跨度内先验,并在状态和速度缓冲上学习一个空间残差算子。该算子通过端点教师匹配训练,保留预训练的骨干,且不增加模型NFE。实验表明,学习到的校正跨基础求解器迁移,且主要位于跨度外。在像素空间扩散、潜流匹配和降水临近预报中,SpanLift实现了最先进的少步采样。仅用3个NFE,它将CIFAR-10的FID从8.16提升到5.69,ImageNet的FID从17.37提升到11.83。

英文摘要

Diffusion and flow generative models sample by integrating a learned ODE, but high quality still requires many sequential model evaluations. Solver learning reduces this cost by adapting scalar coefficients, timesteps, or both, while keeping the backbone model fixed. In this work, we identify a structural bottleneck in this update family: each step remains span-limited. Since the scalar-coefficient update lies in the span of buffered velocity evaluations, it can fit only the in-span component while leaving any out-of-span residual unreachable by scalar recombination alone. We propose SpanLift, a lightweight neural solver that augments scalar-coefficient updates with a spatial residual operator. SpanLift keeps a fixed base solver as an in-span prior and learns a spatial residual operator over the state and velocity buffer. The operator is trained by endpoint teacher matching, preserves the pretrained backbone, and adds no model NFEs. Empirically, the learned correction transfers across base solvers and is predominantly out-of-span. Across pixel-space diffusion, latent flow matching, and precipitation nowcasting, SpanLift achieves state-of-the-art few-step sampling. With only 3 NFE, it improves CIFAR-10 FID from 8.16 to 5.69 and ImageNet FID from 17.37 to 11.83.

2606.08804 2026-06-09 cs.AI cs.LG 交叉投稿

Q-Delta: Beyond Key-Value Associative State Evolution

Q-Delta:超越键值关联状态演化

Sumin Park, Seojin Kim, Noseong Park

AI总结 提出Q-Delta,一种查询感知的delta规则,将混合键-查询预测误差融入状态演化,实现联合校正动态,在语言建模和长上下文检索任务上优于强基线。

Comments Accepted at ICML 2026

详情
AI中文摘要

线性注意力将序列建模重新表述为循环状态演化,实现高效的线性时间推理。在键值关联范式下,现有方法将查询的作用限制在读出操作,使其与状态演化解耦。我们表明,查询条件状态读出在累积记忆上诱导出结构化的值预测,补充了基于键的检索。基于这一洞察,我们提出Q-Delta,一种查询感知的delta规则,将混合键-查询预测误差融入状态演化,在保持delta规则效率的同时实现联合校正动态。我们为所得动态建立了稳定性保证,并推导出硬件高效的块状并行公式,以及自定义Triton实现。实验结果表明,在语言建模和长上下文检索任务上,优化稳定、吞吐量具有竞争力,且一致优于强基线。

英文摘要

Linear attention reformulates sequence modeling as recurrent state evolution, enabling efficient linear-time inference. Under the key-value associative paradigm, existing approaches restrict the role of the query to the readout operation, decoupling it from state evolution. We show that query-conditioned state readout induces a structured value prediction over accumulated memory that complements key-based retrieval. Based on this insight, we propose Q-Delta, a query-aware delta rule that integrates mixed key-query prediction errors into state evolution, enabling jointly corrective dynamics while preserving delta-rule efficiency. We establish stability guarantees for the resulting dynamics and derive a hardware-efficient chunkwise-parallel formulation with a custom Triton implementation. Empirical results demonstrate stable optimization, competitive throughput, and consistent improvements over strong baselines on language modeling and long-context retrieval tasks.

2606.08814 2026-06-09 cs.AI cs.LG 交叉投稿

STAR: Rethinking MoE Routing as Structure-Aware Subspace Learning

STAR: 将MoE路由重新思考为结构感知的子空间学习

Sumin Park, Noseong Park

发表机构 * Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院)

AI总结 提出STAR方法,通过广义Hebbian算法学习主子空间来增强路由对输入结构的感知,实现专家稳定专业化,在合成数据和语言视觉任务上提升路由质量和下游性能。

Comments Accepted at ICML 2026

详情
AI中文摘要

混合专家(MoE)通过选择性地将输入路由到专门的专家子集来高效扩展模型容量。然而,输入-专家专业化(MoE的核心动机)关键取决于路由器是否真正感知输入结构。实践中,MoE路由通常实现为浅层线性投影,对输入表示的感知有限,常导致路由不稳定。我们提出STAR(结构感知路由),将MoE路由重新思考为子空间学习问题,通过广义Hebbian算法(GHA)跟踪主导输入结构的演化主子空间来增强标准可学习路由。通过将路由决策直接与输入结构对齐,STAR实现了稳定的专家专业化。我们在受控合成设置和大规模语言与视觉任务上评估STAR,它持续提高了路由质量和下游性能,超过了强MoE基线。此外,可选的测试时子空间更新进一步增强了输入分布偏移下的路由鲁棒性和泛化能力。

英文摘要

Mixture-of-Experts (MoE) scales model capacity efficiently by selectively routing inputs to a specialized subset of experts. However, input-expert specialization, the core motivation of MoE, critically depends on whether the router is actually aware of input structure. In practice, MoE routing is typically implemented as a shallow linear projection with limited awareness of input representation, which often leads to unstable routing. We propose STAR, a Structure Aware Routing that rethinks MoE routing as a subspace learning problem by augmenting standard learnable routing with an evolving principal subspace that tracks dominant input structure via Generalized Hebbian Algorithm (GHA). By aligning routing decisions directly with input structure, STAR enables stable expert specialization. We evaluate STAR on controlled synthetic setup and large-scale language and vision tasks, where it consistently improves routing quality and downstream performance over strong MoE baselines. Moreover, optional test-time subspace updates further enhance routing robustness and generalization under input distribution shifts.

2606.08815 2026-06-09 cs.AI cs.CL cs.LG 交叉投稿

Momentum for Reasoning: Dense Intrinsic Signals in Policy Optimization

推理的动量:策略优化中的密集内在信号

Hao Chen, Zhanming Shen, Liyao Li, Yanyu Chen, Xuhang Zhu, Xiaomeng Hu, Qi Zhang, Ru Peng, Xiaoyu Shen, Haobo Wang, Junbo Zhao

发表机构 * Zhejiang University(浙江大学) The Chinese University of Hong Kong(香港中文大学) Eastern Institute of Technology(东方理工学院)

AI总结 针对GRPO在长链推理中因二元奖励导致的零优势崩溃和幻觉确定性失败模式,提出ISPO方法,通过内在信号密集化奖励,在三个基模型和五个数学推理基准上持续优于基线。

Comments 14 pages, 6 figures, 8 tables

详情
AI中文摘要

基于可验证奖励的强化学习已成为激发大型语言模型长链推理的强大范式。然而,现有基于组相对策略优化(GRPO)的方法依赖于二元结果奖励,这引发了两种结构性失败模式:零优势崩溃,即组内所有轨迹共享相同结果导致梯度消失;以及幻觉确定性,即模型在训练后期对错误轨迹变得过度自信。我们通过使用完全从策略自身条件概率计算的内在信号来密集化奖励,解决了这两种模式,并提出了ISPO(内在信号策略优化),它结合了衡量思考轨迹对最终答案信息量的序列级信号,以及令牌级方向性奖励,其幻觉确定性铰链惩罚关键决策令牌上的错误自信预测。在三个基模型和五个数学推理基准上,ISPO持续优于竞争基线,在零优势崩溃最频繁的最难基准上取得最大提升,训练动态诊断证实两种失败模式均被减少。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has emerged as a powerful paradigm for eliciting long-chain reasoning in large language models. However, existing methods based on Group Relative Policy Optimization (GRPO) rely on a binary outcome reward, which induces two structural failure modes: Zero-Advantage Collapse, in which all rollouts in a group share the same outcome and the gradient vanishes, and Hallucinated Certainty, in which the model becomes increasingly confident on incorrect rollouts late in training. We address both modes by densifying the reward with intrinsic signals computed entirely from the policy's own conditional probabilities, and propose ISPO (Intrinsic Signal Policy Optimization, which combines a sequence-level signal measuring how informative the thinking trajectory is for the final answer, with a token-level directional reward whose hallucinated-certainty hinge penalizes confidently-wrong predictions at critical decision tokens. Across three base models and five mathematical reasoning benchmarks, ISPO consistently outperforms competitive baselines, with the largest gains on the hardest benchmarks where zero-advantage collapse is most frequent, and training-dynamics diagnostics confirm that both failure modes are decreased.

2606.08871 2026-06-09 math.NA cs.LG cs.NA 交叉投稿

Fourier Neural Operators with rank-1 lattice points and hyperbolic cross

基于秩1格点和双曲交叉的傅里叶神经算子

Jakob Dilen, Alexander Keller, Frances Y. Kuo, Dirk Nuyens

发表机构 * University of New South Wales(新南威尔士大学) Max Planck Institute for Mathematics in the Sciences(马克斯·普朗克科学研究院)

AI总结 通过用秩1格点替代空间张量积网格,并在参数空间精心构造第二个格点作为训练点,提高了傅里叶神经算子的泛化误差,实现了更少参数、更少空间点和训练样本下的高效逼近。

详情
AI中文摘要

傅里叶神经算子(FNO)是一种学习函数空间之间映射的神经网络架构。其高效实现基于多维傅里叶变换。通过推导FNO关于空间和参数变量的一般正则性界,我们证明,用专门构建的秩1格点替代空间张量积网格,并使用第二个精心构造的格点作为参数空间中的训练点,可以改进FNO的泛化误差。我们用更少的网络参数、更少的空间点和更少的训练样本实现了更精确、更高效的逼近。此外,架构得到简化,因为秩1格点上的高维傅里叶变换仅需一维快速傅里叶变换,并且我们可以使用带有格点的双曲交叉频率指标集。我们通过环面上的椭圆偏微分方程展示了基于格点的双曲交叉FNO的优势。

英文摘要

The \emph{Fourier neural operator} (FNO) is a neural network architecture that learns mappings between function spaces. Its efficient implementation is based on the multi-dimensional Fourier transform. By deriving general regularity bounds for the FNO with respect to both the spatial and parametric variables, we prove that the generalization error of the FNO can be improved by replacing spatial tensor product grids with purpose-built rank-1 lattice points, and by using a second lattice carefully constructed as training points in the parametric space. We achieve more accurate and efficient approximations from fewer network parameters, fewer spatial points, and fewer training samples. In addition, the architecture is simplified, because the high-dimensional Fourier transform on rank-1 lattices requires only a \emph{one-dimensional fast Fourier transform}, and we can use a \emph{hyperbolic cross} frequency index set with lattice points. We demonstrate the benefits of our \emph{lattice-based hyperbolic-cross FNOs} for an elliptic PDE on the torus.

2606.09047 2026-06-09 eess.SY cs.LG cs.SY math.OC 交叉投稿

Families of Control-Cost-Parametrized Inverse-Optimal Universal Stabilizers

控制代价参数化的逆最优通用镇定器族

Miroslav Krstic, Luke Bhan

发表机构 * Department of Mechanical and Aerospace Engineering, University of California San Diego(加州大学圣地亚哥分校机械与航空航天工程系) Department of Electrical and Computer Engineering, University of California San Diego(加州大学圣地亚哥分校电气与计算机工程系)

AI总结 提出一族代价参数化的镇定反馈律,用户选择控制运行代价函数,通过公式扩展通用控制器,并证明代价-扩展算子的Lipschitz性质,支持神经算子逼近,实现半全局实用渐近稳定性和二阶次优性界。

Comments 13 Pages

详情
AI中文摘要

经典的通用镇定公式不提供设计自由度:它是一个单一的无参数对象。我们引入一族代价参数化的镇定反馈律,其中(1)用户选择一个函数作为逆最优代价泛函中控制的运行代价,(2)通过一个公式获得预先存在的通用控制器的非线性“扩展器”,该扩展器解决了一个具有有意义的代价状态无限时域最优控制问题。代价-扩展器公式是一个三步构造,涉及代价微分和函数反演——总体上是一个非线性无限维算子。代价-扩展器算子被证明是Lipschitz的,这使得整个族的均匀神经算子逼近成为可能,并支持离线性能探索和在线自适应。在逼近下建立了半全局实用渐近稳定性和二阶次优性界。通过数值示例说明了算子学习及其在半全局镇定中的应用。我们将结果称为“半直接最优”,因为本文的设计不如一般的“直接最优”(HJB诱导)控制,但比完全逆最优更多,因为用户对任意给定的控制代价执行最小化。我们解决的半直接问题的对偶问题是状态代价任意且给定的问题。该对偶问题更容易,不在本文范围内。

英文摘要

A classical universal stabilization formula offers the practitioner no design freedom: it is a single, parameter-free object. We introduce a cost-parametrized family of stabilizing feedback laws, where (1) the user chooses a function that serves as the running cost on control in an inverse-optimal cost functional, and (2) obtains, through a formula, a nonlinear "expander" of a pre-existing universal controller, which solves an infinite-horizon optimal control problem with a meaningful cost on the state. The cost-to-expander formula is a three-step construction, involving, inter alia, cost differentiation and function inversion-overall, a nonlinear infinite-dimensional operator. The cost-to-expander operator is proven Lipschitz, which enables uniform neural operator approximation of the entire family and supports both offline performance exploration and online adaptation. Semiglobal practical asymptotic stability and second-order suboptimality bounds are established under the approximation. The operator learning and its use in semiglobal stabilization are illustrated numerically. We call the result 'half-direct-optimal' because the paper's design is less than a general 'direct optimal' (HJB-inducing) control, but more than the fully inverse optimal, since the user performs minimization for an arbitrary given cost on control. The dual to the half-direct problem we solve is the problem in which the cost on the state is arbitrary and given. This dual problem is easier and outside of the scope of the paper.

2606.09131 2026-06-09 cs.AI cs.CL cs.CV cs.LG 交叉投稿

Late-Layer Fusion is Enough: Dual-Path Vision Token Routing for Multimodal Large Language Models under Visual Saturation

晚期融合足矣:面向视觉饱和的多模态大语言模型的双路径视觉令牌路由

Siyuan Liu, Jinyang Wu

发表机构 * School of Mechanics and Engineering Science, Peking University(北京大学力学与工程科学学院) Department of Automation, Tsinghua University(清华大学自动化系)

AI总结 针对多模态大语言模型中视觉令牌在深层饱和的问题,提出双路径视觉令牌路由(DPVR-LF),在饱和点将视觉令牌路由至单层可训练分支,仅最后层融合,以约3%可训练参数保持性能并减少计算。

Comments 18 pages, 4 figures. Submitted to Pattern Recognition

详情
AI中文摘要

多模态大语言模型(MLLMs)通常继承为单模态文本建模设计的深层对称Transformer骨干,并对图像和语言令牌应用相同的统一计算。这种设计忽略了一个关键的模态不对称性:图像和文本令牌在信息密度、冗余度和所需推理深度上存在显著差异。通过对LLaVA-1.5的逐层分析,我们观察到视觉令牌倾向于在中间层饱和。具体而言,文本到图像的注意力从第0层的0.68下降到第4层的0.07,并在第18层后稳定在0.04附近,而文本令牌则继续受益于深层语义处理。这些发现表明架构对称性与深度异步模态演化之间存在不匹配,导致冗余的视觉计算以及在深层任务特定适应期间感知表示的潜在漂移。受此启发,我们提出了双路径视觉令牌路由(DPVR),一种用于高效MLLMs的模态不对称路由框架。其核心实例DPVR-LF(晚期融合)在饱和点将视觉令牌路由到一个单层可训练侧分支,运行一个跳过深层堆栈中图像位置的十三层纯文本前向传播,并仅在最后一层重新融合视觉和文本流。使用约3%的可训练参数,DPVR-LF在标准基准上保持了有竞争力的多模态性能,同时减少了深层Transformer堆栈中的视觉计算。该结果挑战了视觉令牌必须遍历所有深层语言模型层的传统假设,并表明单个晚期融合层足以在LLaVA风格的MLLMs中维持强大的感知能力。

英文摘要

Multimodal large language models (MLLMs) commonly inherit the deep, symmetric Transformer backbone designed for unimodal text modeling, and apply the same computation uniformly to image and language tokens. This design overlooks a key modality asymmetry: image and text tokens differ substantially in information density, redundancy, and required reasoning depth. Through a layer-wise analysis of LLaVA-1.5, we observe that vision tokens tend to saturate in the middle layers. Specifically, text-to-image attention decreases from 0.68 at layer 0 to 0.07 by layer 4, and stabilizes near 0.04 after layer 18, whereas text tokens continue to benefit from deep semantic processing. These findings suggest a mismatch between architectural symmetry and depth-asynchronous modality evolution, resulting in redundant visual computation and possible drift in perceptual representations during deep task-specific adaptation. Motivated by this, we propose Dual-Path Vision Token Routing (DPVR), a modality-asymmetric routing framework for efficient MLLMs. Its core instantiation, DPVR-LF (Late-Layer Fusion), routes vision tokens at the saturation point into a one-layer trainable side branch, runs a thirteen-layer text-only forward that skips image positions in the deep stack, and re-fuses the visual and textual streams only at the final layer. With approximately 3% trainable parameters, DPVR-LF preserves competitive multimodal performance on standard benchmarks while reducing visual computation in the deep Transformer stack. The results challenge the conventional assumption that vision tokens must traverse all deep language-model layers, and indicate that a single late fusion layer can be sufficient for maintaining strong perceptual competence in LLaVA-style MLLMs.

2606.09304 2026-06-09 cs.CL cs.LG 交叉投稿

SG-OPD: Sign-Gated On-Policy Distillation via Sign-Consistency Gating and Phased Teacher Sampling

SG-OPD: 通过符号一致性门控和分阶段教师采样的符号门控在线蒸馏

Haoran Xu, Hongyu Wang, Yifei Gao, Jiaze Li, Xiaofeng Zhang, Xiaosong Yuan

发表机构 * Zhejiang University(浙江大学) Hunan University(湖南大学) Tianjin University(天津大学) Shanghai Jiao Tong University(上海交通大学) Jilin University(吉林大学)

AI总结 针对在线蒸馏中轨迹级对齐和教师偏好均匀可靠性假设的失效问题,提出SG-OPD方法,通过符号一致性门控和分阶段教师采样改进蒸馏效果,在竞赛级数学推理任务上平均提升1.98和7.50。

详情
AI中文摘要

在线蒸馏(OPD)在自身轨迹上训练学生模型,并利用更强教师的密集逐token监督,通常优于离线蒸馏和标准强化学习。然而,我们发现其有效性隐含地依赖于两个在实践中经常失效的假设:学生与教师之间的轨迹级对齐,以及教师偏好的均匀token级可靠性。因此,我们提出符号门控在线蒸馏(SG-OPD),该方法使用二元验证器作为教师信任信号,在两个互补粒度上发挥作用:分阶段教师采样在冷启动时混合验证器认可的教师轨迹,而符号一致性门控在教师与验证器校正方向一致的token上外推蒸馏更新,在分歧时内插。在竞赛级数学推理基准上的实验表明,SG-OPD持续优于标准OPD,在每样本和每问题水平上平均提升分别为1.98和7.50。

英文摘要

On-policy distillation (OPD) trains a student on its own trajectories with dense per-token supervision from a stronger teacher, and often outperforms off-policy distillation and standard reinforcement learning. However, we find that its effectiveness implicitly relies on two assumptions that frequently break in practice: trajectory-level alignment between the student and the teacher, and uniform token-level reliability of the teacher's preferences. We therefore propose Sign-Gated On-Policy Distillation (SG-OPD), which uses a binary verifier as a trust signal for the teacher at two complementary granularities: phased teacher sampling mixes in verifier-endorsed teacher rollouts at cold-start, and a sign-consistency gate extrapolates the distillation update on tokens where the teacher agrees with the verifier-correct direction and interpolates it where it disagrees. Experiments on competition-level mathematical reasoning benchmarks show that SG-OPD consistently outperforms standard OPD, with average gains of 1.98 and 7.50 at the per-sample and per-question levels, respectively.

2606.09396 2026-06-09 cs.CL cs.LG 交叉投稿

PriFT: Prior-Support Guided Supervised Fine-Tuning

PriFT: 先验支持引导的监督微调

Ke Wang, Shuangqi Li, Mathieu Salzmann, Pascal Frossard

发表机构 * EPFL(瑞士联邦理工学院洛桑分校)

AI总结 提出PriFT方法,利用冻结的预训练模型计算token权重,避免在线模型导致的自我强化动态,在数学推理、代码生成和医疗问答任务中取得SFT最优结果,并为后续RL提供更好初始化。

Comments The first two authors contributed equally to this work

详情
AI中文摘要

监督微调(SFT)是下游任务适配的高效方法,通常作为强化学习(RL)的初始化阶段,但其泛化能力可能弱于RL。一个关键限制是其离策略目标:SFT逐token拟合固定演示,包括与模型预训练分布对齐不良的目标,这可能导致过拟合。最近一系列工作通过给与当前模型预测分布更对齐的token分配更大的训练权重来解决此问题,直觉是拟合这些token对模型的预训练知识和表示的扭曲较小。然而,从当前微调模型计算token权重会将token权重与优化轨迹纠缠在一起,随着分布迅速偏离预训练模型,引发自我强化动态。为了解决这个问题,我们提出PriFT(先验支持引导的微调),该方法从冻结的预训练参考模型导出token权重,以获得不受微调影响的稳定重加权信号。该信号估计先验支持:每个目标token受预训练分布支持的程度。在多种现有token重加权规则中,将重加权信号从在线模型替换为预训练模型一致地提升了性能。我们引入了两种实例化:PriFT-prob使用预训练token概率,而PriFT-mass根据预训练分布下的累积概率质量选择token。在数学推理、代码生成和医疗问答上的大量实验表明,PriFT在SFT基线中取得了最先进的结果,并为后续RL训练提供了更好的初始化。

英文摘要

Supervised fine-tuning (SFT) is an efficient approach for downstream task adaptation and often serves as the initialization stage for reinforcement learning (RL), but it can show weaker generalization than RL. A key limitation is its off-policy objective: SFT fits fixed demonstrations token by token, including targets poorly aligned with the model's pretrained distribution, which can lead to overfitting. A recent line of work addresses this issue by assigning larger training weights to tokens better aligned with the current model's predictive distribution, with the intuition that fitting these tokens are less distortive to the model's pretrained knowledge and representations. However, computing the token weights from the model that is currently fine-tuned entangles token weights with the optimization trajectory, inducing a self-reinforcing dynamics as the distribution rapidly departs from the pretrained model. To address this, we propose PriFT (Prior-support guided Fine-Tuning), which derives token weights from a frozen pretrained reference to obtain a stable reweighting signal unaffected by fine-tuning. This signal estimates prior support: the extent to which each target token is supported by the pretrained distribution. Across multiple existing token-reweighting rules, replacing the reweighting signal from the online model to pretrained model consistently improves performance. We introduce two instantiations: PriFT-prob uses pretrained token probability, while PriFT-mass selects tokens by cumulative probability mass under the pretrained distribution. Extensive experiments on mathematical reasoning, code generation, and medical question answering show that PriFT achieves state-of-the-art results among SFT baselines and provides a better initialization for subsequent RL training.

2606.09734 2026-06-09 quant-ph cs.LG 交叉投稿

Adaptive directional gradients for parameterised quantum circuits

参数化量子电路的自适应方向梯度

Brian Coyle, Snehal Raj, Virag Umathe, El Amine Cherrat, Elham Kashefi

发表机构 * School of Informatics, University of Edinburgh(爱丁堡大学信息学院) Fujitsu Research of Europe Ltd.(富士通欧洲有限公司) LIP6, CNRS, Sorbonne Université(LIP6研究所,法国国家科学研究中心,索邦大学) QC Ware Quantum Signals(量子信号)

AI总结 提出基于前向自动微分的参数化量子电路梯度估计框架,通过平均随机方向导数得到无偏梯度,并导出自适应优化器QUIVER,在多达1770个参数的问题上比参数平移规则效率提升数个数量级。

Comments 37 pages, 13 figures

详情
AI中文摘要

在量子硬件上训练参数化量子电路(PQC)的瓶颈在于梯度估计的测量成本,在参数平移规则下,该成本与可训练参数数量呈线性关系,并主导了大规模训练的总预算。本文提出了一种基于前向自动微分模式的PQC前向梯度估计器框架,通过平均自由可调数量的随机方向导数得到梯度的无偏估计,并恢复SPSA、随机坐标下降和参数平移规则作为极限情况,无需辅助量子比特或受控门开销。我们证明随机量子前向梯度下降在标准假设下收敛,并给出了显式的二阶矩展开,该展开在SPSA的单方向极端和参数平移的全梯度极端之间插值。在该框架内,我们推导出QUIVER(量子迭代自适应估计器规则),这是一种参数化电路的自适应优化器,其更新规则遵循闭式最小测量成本分配。数值结果表明,在ECG5000和MNIST数据集上,前向梯度训练具有多达60个量子比特和1770个参数的汉明权重保持正交量子神经网络,比参数平移规则效率高数个数量级。我们还证明,我们提出的QUIVER优化器在使用量子近似优化算法和变分量子特征求解器的优化问题上,可以优于iCANS和gCANS等节省测量的优化器。

英文摘要

Training parameterised quantum circuits (PQCs) on quantum hardware is bottlenecked by the measurement cost of gradient estimation, which under the parameter-shift rule scales linearly in the number of trainable parameters and dominates the total shot budget of training at scale. In this work, we propose a framework of forward gradient estimators for PQCs, based on the forward mode of automatic differentiation, that yields an unbiased estimator of the gradient by averaging a freely tunable number of random directional derivatives and recovers SPSA, random coordinate descent, and the parameter-shift rule as limiting cases, with no ancilla qubits or controlled-gate overhead. We prove that stochastic quantum forward gradient descent converges under standard assumptions, with an explicit second-moment expansion that interpolates between the single-direction extreme of SPSA and the full-gradient extreme of parameter-shift. Within this framework we derive QUIVER (Quantum Iterative V-adaptive Estimator Rule), an adaptive optimiser for parameterised circuits whose update rule follows from a closed-form minimum measurement-cost allocation. We show numerically that forward gradients train Hamming-weight-preserving orthogonal quantum neural networks with up to 60 qubits and 1770 parameters on the ECG5000 and MNIST datasets orders of magnitude more efficiently than the parameter-shift rule. We also demonstrate that our proposed QUIVER optimiser can outperform iCANS and gCANS measurement-frugal optimisers on optimisation problems using the quantum approximate optimisation algorithm and quantum simulation with the variational quantum eigensolver.

2606.09803 2026-06-09 cs.CV cs.GR cs.LG 交叉投稿

Echo-Memory: A Controlled Study of Memory in Action World Models

Echo-Memory:动作世界模型中记忆的受控研究

Wayne King, Zeyue Xue, Yuxuan Bian, Jie Huang, Haoran Li, Yaowei Li, Yaofeng Su, Yuming Li, Haoyu Wang, Shiyi Zhang, Songchun Zhang, Yuwei Niu, Sihan Xu, Junhao Zhuang, Haoyang Huang, Nan Duan

发表机构 * Joy Future Academy

AI总结 提出Echo-Memory框架,通过控制变量法研究动作条件世界模型中的记忆机制,发现原始上下文容量和块状状态空间递归对开放域返回任务至关重要。

Comments 9 figures and 28 pages, Code at \href{https://github.com/Echo-Team-Joy-Future-Academy-JD/Echo-Memory}{this URL}

详情
AI中文摘要

我们提出\textbf{Echo-Memory},对动作条件世界模型中的记忆机制进行受控研究。这些模型从第一帧、文本提示和相机动作序列生成多段视频,但其核心失败往往是记忆而非局部图像合成:当相机离开并返回时,场景或显著物体可能悄然改变。现有记忆设计难以比较,因为增益与骨干网络、训练、检索和评估差异纠缠在一起。Echo-Memory固定了动作到视频的接口,仅改变生成器存储和读取历史的方式。在共享的视频扩散骨干网络、优化器、相机动作表示、采样器和评估流程下,我们比较了原始上下文、基于压缩的记忆、具有不同读取路径的空间摘要以及状态空间递归。这种匹配矩阵分离了四个通常混淆的轴:\emph{容量}、\emph{压缩}、\emph{读取}和\emph{递归}。我们还通过三个分支协议评估记忆:重放质量、域内循环重访和开放域返回探测。这些分支通常不一致,表明重放保真度不足以作为记忆世界的代理。得出三个发现。原始上下文是一个强大的容量基线,并且比重放指标更能改善开放域返回。紧凑性不能免费替代容量:激进的混合压缩记忆会丢失返回所需的显著证据。最后,块状状态空间递归是我们矩阵中最强的开放域返回机制,表明隐式记忆的结构与是否使用记忆同样重要。这些结果为在孤立的重放指标之外研究动作世界模型中的记忆提供了一个紧凑的协议。

英文摘要

We present \textbf{Echo-Memory}, a controlled study of memory mechanisms in action-conditioned world models. These models generate multi-segment videos from a first frame, text prompt, and camera-action sequence, but their central failure is often memory rather than local image synthesis: after the camera leaves and returns, the scene or salient object may silently change. Existing memory designs are hard to compare because gains are entangled with backbone, training, retrieval, and evaluation differences. Echo-Memory fixes the action-to-video interface and varies only how history is stored and read by the generator. Under a shared video diffusion backbone, optimizer, camera-action representation, sampler, and evaluation pipeline, we compare raw context, compression-based memory, spatial summaries with different read-out paths, and state-space recurrence. This matched matrix separates four otherwise conflated axes: \emph{capacity}, \emph{compression}, \emph{read-out}, and \emph{recurrence}. We also evaluate memory through a three-branch protocol: replay quality, in-domain loop revisit, and open-domain return probes. The branches routinely disagree, showing that replay fidelity is not a sufficient proxy for remembering a world. Three findings follow. Raw context is a strong capacity baseline and improves open-domain return far more than it improves replay metrics. Compactness is not a free substitute for capacity: aggressive spatial and hybrid-compression memories lose the salient evidence needed for return. Finally, block-wise state-space recurrence is the strongest open-domain return mechanism in our matrix, showing that the structure of implicit memory matters as much as the decision to use it. These results provide a compact protocol for studying memory in action world models beyond isolated replay metrics.

2402.13425 2026-06-09 cs.LG cs.AI stat.ML 版本更新

Investigating the Histogram Loss in Regression

探究回归中的直方图损失

Ehsan Imani, Kai Luedemann, Sam Scholnick-Hughes, Esraa Elelimy, Martha White

发表机构 * Alberta Machine Intelligence Institute (Amii) and Reinforcement Learning and Artificial Intelligence Laboratory(阿尔伯塔机器智能研究所(Amii)和强化学习与人工智能实验室) Department of Computing Science, University of Alberta(计算科学系,阿尔伯塔大学) University of Tübingen(图宾根大学) Zuse School ELIZA(祖斯学校ELIZA)

AI总结 本文通过理论和实验分析,探究直方图损失在回归任务中提升性能的原因,发现其优势源于优化改进而非额外信息建模,并在常见深度学习应用中验证其有效性。

Comments 52 pages

详情
Journal ref
JMLR,2026
AI中文摘要

在回归任务中,即使预测只需要均值,训练神经网络来建模整个分布也变得越来越常见。这种额外的建模通常会带来性能提升,但其背后的原因尚不完全清楚。本文研究了一种最近的回归方法——直方图损失,该方法通过最小化目标分布与灵活直方图预测之间的交叉熵来学习目标变量的条件分布。我们设计了理论和实证分析,以确定这种性能提升出现的原因和时机,以及损失的不同组成部分如何贡献于这种提升。我们的结果表明,在这种设置下学习分布的好处来自于优化方面的改进,而非建模额外信息。然后,我们展示了直方图损失在常见深度学习应用中的可行性,无需昂贵的超参数调优。

英文摘要

It is becoming increasingly common in regression to train neural networks that model the entire distribution even if only the mean is required for prediction. This additional modeling often comes with performance gain and the reasons behind the improvement are not fully known. This paper investigates a recent approach to regression, the Histogram Loss, which involves learning the conditional distribution of the target variable by minimizing the cross-entropy between a target distribution and a flexible histogram prediction. We design theoretical and empirical analyses to determine why and when this performance gain appears, and how different components of the loss contribute to it. Our results suggest that the benefits of learning distributions in this setup come from improvements in optimization rather than modelling extra information. We then demonstrate the viability of the Histogram Loss in common deep learning applications without a need for costly hyperparameter tuning.

2505.20137 2026-06-09 cs.LG cs.AI 版本更新

ePC: Fast and Deep Predictive Coding in Digital Simulation

ePC:数字仿真中的快速深度预测编码

Cédric Goemaere, Gaspard Oliviers, Rafal Bogacz, Thomas Demeester

发表机构 * IDLab, Ghent University -- imec, Belgium(ID实验室,根特大学——imec,比利时) Brain Network Dynamics Unit, University of Oxford, UK(脑网络动力学单位,牛津大学,英国)

AI总结 提出误差预测编码(ePC),通过重新参数化解决标准状态预测编码(sPC)在数字仿真中的指数信号衰减问题,实现与反向传播相当的深度模型训练速度。

Comments Accepted at ICML 2026 - Main Track. All code available at https://github.com/cgoemaere/error_based_PC

详情
AI中文摘要

预测编码(PC)为神经网络训练提供了一种受大脑启发的反向传播替代方案,被描述为最小化其内部能量的物理系统。然而,在实践中,PC主要是在数字仿真中实现的,需要大量的计算,同时难以扩展到更深的架构。本文重新构建了PC以克服这种硬件-算法不匹配。首先,我们揭示了规范的状态基PC(sPC)在数字仿真中本质上是深度低效的,不可避免地导致指数级信号衰减,从而阻碍整个最小化过程。然后,为了克服这一根本限制,我们引入了误差基PC(ePC),这是一种新的PC重新参数化,不会遭受信号衰减。虽然不再具有生物合理性,但ePC数值计算精确的PC权重梯度,运行速度比sPC快几个数量级。跨多个架构和数据集的实验表明,即使在sPC难以处理的更深模型中,ePC也能匹配反向传播的性能。除了实际改进,我们的工作还提供了对PC动力学的理论洞察,并为在数字硬件及更广泛领域将基于PC的学习扩展到更深架构奠定了基础。

英文摘要

Predictive Coding (PC) offers a brain-inspired alternative to backpropagation for neural network training, described as a physical system minimizing its internal energy. However, in practice, PC is predominantly digitally simulated, requiring excessive amounts of compute while struggling to scale to deeper architectures. This paper reformulates PC to overcome this hardware-algorithm mismatch. First, we uncover how the canonical state-based formulation of PC (sPC) is, by design, deeply inefficient in digital simulation, inevitably resulting in exponential signal decay that stalls the entire minimization process. Then, to overcome this fundamental limitation, we introduce error-based PC (ePC), a novel reparameterization of PC which does not suffer from signal decay. Though no longer biologically plausible, ePC numerically computes exact PC weights gradients and runs orders of magnitude faster than sPC. Experiments across multiple architectures and datasets demonstrate that ePC matches backpropagation's performance even for deeper models where sPC struggles. Besides practical improvements, our work provides theoretical insight into PC dynamics and establishes a foundation for scaling PC-based learning to deeper architectures on digital hardware and beyond.

2509.10534 2026-06-09 cs.LG cs.AI cs.CL 版本更新

Decoupling the "What" and "Where" With Polar Coordinate Positional Embeddings

解耦“什么”和“哪里”:极坐标位置嵌入

Anand Gopalakrishnan, Robert Csordás, Jürgen Schmidhuber, Michael C. Mozer

发表机构 * DeepMind, London, UK(深度Mind,伦敦,英国)

AI总结 提出极坐标位置嵌入(PoPE)以解耦Transformer注意力机制中的内容和位置,在诊断任务、序列建模和语言模型中优于RoPE,并展现零样本长度外推能力。

Comments ICML 2026 camera-ready version

详情
AI中文摘要

Transformer架构中的注意力机制根据内容(“什么”)和序列中的位置(“哪里”)将键匹配到查询。我们提出一项分析,表明在流行的RoPE旋转位置嵌入中,“什么”和“哪里”是纠缠的。这种纠缠会损害性能,特别是当决策需要在这两个因素上独立匹配时。我们提出对RoPE的改进,称为极坐标位置嵌入(PoPE),它消除了“什么-哪里”的混淆。PoPE在仅通过位置或内容进行索引的诊断任务上表现远优于基线。在音乐、基因组和自然语言领域的自回归序列建模中,使用PoPE作为位置编码方案的Transformer在评估损失(困惑度)和下游任务性能上优于使用RoPE的基线。在语言建模中,这些优势在模型规模从124M到774M参数时持续存在。关键的是,与RoPE甚至专为外推设计的方法YaRN(需要额外微调和频率插值)相比,PoPE展现出强大的零样本长度外推能力。

英文摘要

The attention mechanism in a Transformer architecture matches key to query based on both content -- the what -- and position in a sequence -- the where. We present an analysis indicating that what and where are entangled in the popular RoPE rotary position embedding. This entanglement can impair performance particularly when decisions require independent matches on these two factors. We propose an improvement to RoPE, which we call Polar Coordinate Position Embeddings or PoPE, that eliminates the what-where confound. PoPE is far superior on a diagnostic task requiring indexing solely by position or by content. On autoregressive sequence modeling in music, genomic, and natural language domains, Transformers using PoPE as the positional encoding scheme outperform baselines using RoPE with respect to evaluation loss (perplexity) and downstream task performance. On language modeling, these gains persist across model scale, from 124M to 774M parameters. Crucially, PoPE shows strong zero-shot length extrapolation capabilities compared not only to RoPE but even a method designed for extrapolation, YaRN, which requires additional fine tuning and frequency interpolation.

2509.12760 2026-06-09 cs.LG cs.CL 版本更新

Similarity-Distance-Magnitude Activations

相似度-距离-幅度激活函数

Allen Schmaltz

发表机构 * Reexpress AI

AI总结 本文提出SDM激活函数,通过引入相似度和距离意识提升softmax的鲁棒性和可解释性,并通过密集匹配实现基于实例的可解释性。SDM估计器通过数据驱动的CDF分区控制分类准确性,优于现有校准方法。

Comments Accepted to Findings of the Association for Computational Linguistics: ACL 2026. 21 pages, 8 tables, 1 algorithm. arXiv admin note: substantial text overlap with arXiv:2502.20167

详情
AI中文摘要

我们引入了相似度-距离-幅度(SDM)激活函数,这是一种更稳健和可解释的标准softmax激活函数的改进形式,增加了相似度(即正确预测深度匹配到训练)意识和距离到训练分布意识,从而通过密集匹配实现可解释性。我们进一步引入了基于SDM激活的类内经验CDF数据驱动分区的SDM估计器,以控制选择性分类中的类和预测条件下的准确性。当用作预训练语言模型的最终层激活进行选择性分类时,SDM估计器比使用softmax激活的现有校准方法更鲁棒于协变量偏移和分布外输入,同时在分布内数据上保持信息性。

英文摘要

We introduce the Similarity-Distance-Magnitude (SDM) activation function, a more robust and interpretable formulation of the standard softmax activation function, adding Similarity (i.e., correctly predicted depth-matches into training) awareness and Distance-to-training-distribution awareness to the existing output Magnitude (i.e., decision-boundary) awareness, and enabling interpretability-by-exemplar via dense matching. We further introduce the SDM estimator, based on a data-driven partitioning of the class-wise empirical CDFs via the SDM activation, to control the class- and prediction-conditional accuracy among selective classifications. When used as the final-layer activation over pre-trained language models for selective classification, the SDM estimator is more robust to covariate shifts and out-of-distribution inputs than existing calibration methods using softmax activations, while remaining informative over in-distribution data.

2509.15494 2026-06-09 cs.LG physics.data-an 版本更新

Multi-resolution Enhancement for Full Spectrum Neural Representations

全频谱神经表示的多分辨率增强

Yuan Ni, Zhantao Chen, Shizhou Xu, Cheng Peng, Rajan Plumley, Chun Hong Yoon, Jana B. Thayer, Joshua J. Turner

发表机构 * Linac Coherent Light Source, SLAC National Accelerator Laboratory(直线相干光源,SLAC国家加速器实验室) Stanford Institute for Materials and Energy Sciences, Stanford University(斯坦福大学材料与能源科学研究所) Walker Department of Mechanical Engineering, The University of Texas at Austin(德克萨斯大学奥斯汀分校机械工程系) Department of Mathematics, University of California Davis(加州大学戴维斯分校数学系) Department of Physics, Carnegie Mellon University(卡内基梅隆大学物理系)

AI总结 提出WIEN-INR框架,通过分层增强网络在不同分辨率尺度上建模,提升小网络对多尺度结构和高频细节的表示能力,实现紧凑高保真表示。

详情
AI中文摘要

科学数据采集持续超越存储和分析能力,使得基于体素的表示越来越难以处理。隐式神经表示(INRs)通过基于坐标的神经网络编码信号,作为数据的替代品,其计算和存储需求随网络复杂度而非数据维度扩展,提供了有前景的解决方案。然而,较小的INRs难以忠实表示构成科学测量大部分的多尺度结构、高频信息和精细纹理。我们提出WIEN-INR,一个理论指导的分层INR框架,跨分辨率尺度分配建模,并通过新颖的增强网络恢复细微细节,从而提高表示能力。这种多尺度架构允许较小的网络保留全频谱信息,同时保持训练效率并降低存储成本。在跨尺度和复杂性的不同原始实验测量上评估,WIEN-INR代表了神经表示在科学工作流中更广泛采用的实用步骤,提供了紧凑、鲁棒和高保真的表示。

英文摘要

Scientific data acquisition continues to outpace storage and analysis capabilities, making voxel-based representations increasingly intractable. Implicit neural representations (INRs) offer a promising solution by encoding signals through coordinate-based neural networks, serving as surrogates of data, with computational and storage requirements scaling with network complexity rather than data dimensionality. However, smaller INRs struggle to faithfully represent the multi-scale structures, high-frequency information, and fine textures that constitute a large proportion of scientific measurements. We propose WIEN-INR, a theoretically-guided hierarchical INR framework that distributes modeling across resolution scales and enables improved representation capacity through a novel enhancement network to recover subtle details. This multi-scale architecture allows smaller networks to retain the full spectrum of information while preserving training efficiency and lowering storage cost. Evaluated on distinct raw experimental measurements across scales and complexities, WIEN-INR represents a practical step toward broader adoption of neural representations in scientific workflows, delivering compact, robust, and high-fidelity representations.

2510.22450 2026-06-09 cs.LG cs.AI 版本更新

SmartMixed: A Two-Phase Training Strategy for Adaptive Activation Function Learning in Neural Networks

SmartMixed:一种用于神经网络自适应激活函数学习的两阶段训练策略

Amin Omidvar

发表机构 * Independent Researcher(独立研究者) Toronto, Canada(加拿大多伦多) Toronto Ontario Canada(加拿大多伦多)

AI总结 提出SmartMixed两阶段训练策略,通过可微硬混合机制让神经元自适应选择激活函数,第二阶段固定选择以保持推理效率,在MNIST上验证了不同层神经元的激活函数偏好。

详情
AI中文摘要

激活函数的选择在神经网络中起着关键作用,但大多数架构仍然依赖于所有神经元上固定的、统一的激活函数。我们引入了SmartMixed,一种新颖的两阶段训练策略,允许网络学习每个神经元的最优激活函数,同时在推理时保持计算效率。在第一阶段,神经元使用可微硬混合机制从候选激活函数池(ReLU、Sigmoid、Tanh、Leaky_ReLU、ELU、SELU)中自适应选择。在第二阶段,每个神经元的激活函数根据学习到的选择固定下来,从而得到一个计算高效的网络,支持使用优化的向量化操作继续训练。我们在MNIST数据集上使用不同架构的前馈神经网络评估了SmartMixed。我们的分析表明,不同层的神经元对激活函数表现出不同的偏好,揭示了神经架构内的功能多样性。我们还证明了SmartMixed通过允许神经元选择其偏好的激活函数有效地训练网络,与使用单一固定最先进激活函数的模型相竞争。

英文摘要

The choice of activation function plays a critical role in neural networks, yet most architectures still rely on fixed, uniform activation functions across all neurons. We introduce SmartMixed, a novel two-phase training strategy that allows networks to learn optimal per-neuron activation functions while preserving computational efficiency at inference. In the first phase, neurons adaptively select from a pool of candidate activation functions (ReLU, Sigmoid, Tanh, Leaky\_ReLU, ELU, SELU) using a differentiable hard mixture mechanism. In the second phase, each neuron's activation function is fixed according to the learned selection, resulting in a computationally efficient network that supports continued training with optimized vectorized operations. We evaluate SmartMixed on the MNIST dataset using feedforward neural networks of different architectures. Our analysis reveals that neurons in different layers exhibit distinct preferences for activation functions, providing insights into the functional diversity within neural architectures. We also demonstrated that SmartMixed effectively trains the network by allowing neurons to select their preferred activation functions, competing against models using a single fixed state-of-the-art activation function.

2511.04124 2026-06-09 cs.LG 版本更新

Decomposable Neuro Symbolic Regression

可分解的神经符号回归

Giorgio Morales, John W. Sheppard

发表机构 * Gianforte School of Computing(吉安福特计算学院) Montana State University(蒙塔纳州立大学)

AI总结 本文提出一种可解释的神经符号回归方法,利用Transformer、遗传算法和遗传编程生成可解释的多元表达式,通过多集合Transformer生成单变量符号骨架,并通过GA和GP融合优化,实现比其他方法更准确的数学表达。

Comments Under review as submission to TMLR

详情
AI中文摘要

符号回归(SR)通过发现数学表达式来建模复杂系统,但大多数方法优先最小化预测误差而非识别 governing equations,常产生过于复杂或不准确的表达式。为此,我们提出一种可分解的SR方法,利用Transformer模型、遗传算法(GAs)和遗传编程(GP)生成可解释的多元表达式。我们的可解释SR方法将训练好的“不透明”回归模型提炼为数学表达式,作为其计算函数的解释。我们采用多集合Transformer生成多个单变量符号骨架,描述每个变量如何影响不透明模型的响应。然后通过GA方法评估生成骨架的性能,选择高质量候选子集,并通过GP基于的级联过程逐步合并它们,以保持原始骨架结构。最终的多元骨架通过GA进行系数优化。我们在具有受控和变化噪声程度的问题上评估了我们的方法,证明其插值和外推误差低于或与两种GP方法、三种神经SR方法和混合方法相当。与其他方法不同,我们的方法始终学习到与原始数学结构匹配的表达式。同样,我们的方法在费曼数据集上实现了高符号解恢复率和与基准方法相竞争的预测性能。

英文摘要

Symbolic regression (SR) models complex systems by discovering mathematical expressions that capture underlying relationships in observed data. However, most SR methods prioritize minimizing prediction error over identifying the governing equations, often producing overly complex or inaccurate expressions. To address this, we present a decomposable SR method that generates interpretable multivariate expressions leveraging transformer models, genetic algorithms (GAs), and genetic programming (GP). In particular, our explainable SR method distills a trained ``opaque'' regression model into mathematical expressions that serve as explanations of its computed function. Our method employs a Multi-Set Transformer to generate multiple univariate symbolic skeletons that characterize how each variable influences the opaque model's response. We then evaluate the generated skeletons' performance using a GA-based approach to select a subset of high-quality candidates before incrementally merging them via a GP-based cascade procedure that preserves their original skeleton structure. The final multivariate skeletons undergo coefficient optimization via a GA. We evaluated our method on problems with controlled and varying degrees of noise, demonstrating lower or comparable interpolation and extrapolation errors compared to two GP-based methods, three neural SR methods, and a hybrid approach. Unlike them, our approach consistently learned expressions that matched the original mathematical structure. Similarly, our method achieved both a high symbolic solution recovery rate and competitive predictive performance relative to benchmark methods on the Feynman dataset.

2512.01467 2026-06-09 cs.LG cs.AR cs.SC 版本更新

Differentiable Weightless Controllers: Learning Logic Circuits for Continuous Control

可微无权重控制器:学习连续控制的逻辑电路

Fabian Kresse, Christoph H. Lampert

发表机构 * Max Planck Institute for Informatics(马克斯·普朗克信息研究所)

AI总结 提出可微无权重控制器(DWC),一种符号可微架构,通过梯度训练学习高效控制策略,编译为低延迟、低能耗的FPGA电路,在MuJoCo基准上达到与深度策略竞争的性能,并具有稀疏可解释的连接模式。

Comments Accepted at Forty-third International Conference on Machine Learning (ICML), 19 pages, 12 figures, 12 tables

详情
AI中文摘要

在现实条件下控制自主系统通常需要能够以低延迟和最小能耗评估的策略。不幸的是,这些条件与使用高精度深度神经网络作为控制器相矛盾。在这项工作中,我们引入了可微无权重控制器(DWC),这是一种符号可微架构,学习灵活、非线性但高效的控制策略。DWC可以通过基于梯度的技术进行端到端训练,但直接编译为FPGA兼容电路,具有少至一个时钟周期的延迟和每动作纳焦耳级的能量成本。在五个MuJoCo基准测试中,包括高维Humanoid,DWC实现了与标准深度策略(全精度或量化神经网络)竞争的性能。此外,DWC表现出结构稀疏和可解释的连接模式,使得能够直接检查哪些输入值影响控制决策。

英文摘要

Controlling autonomous systems under real-world conditions often requires policies that can be evaluated with low latency and minimal energy consumption. Unfortunately, these conditions are at odds with the use of high-precision deep neural networks as controllers. In this work, we introduce Differentiable Weightless Controllers (DWCs), a symbolic-differentiable architecture that learns flexible, non-linear, yet highly efficient control policies. DWCs can be trained end-to-end via gradient-based techniques, yet compile directly into FPGA-compatible circuits with few- or even single-clock-cycle latency and nanojoule-level energy cost per action. Across five MuJoCo benchmarks, including high-dimensional Humanoid, DWCs achieve returns competitive with standard deep policies (full-precision or quantized neural networks). Furthermore, DWCs exhibit structurally sparse and interpretable connectivity patterns, enabling direct inspection of which input values influence control decisions.

2601.15423 2026-06-09 cs.LG 版本更新

Lattice: A Confidence-Gated Hybrid System for Uncertainty-Aware Sequential Prediction with Behavioral Archetypes

Lattice: 一种基于置信门控的混合系统,用于具有行为原型的不确定性感知序列预测

Lorian Bannis

发表机构 * banlys.com(banlys公司)

AI总结 提出Lattice混合系统,通过二元置信门控条件激活行为原型,在不确定时回退到骨干预测,在MovieLens上LSTM+Lattice的HR@10提升31.7%,Transformer和SASRec也有提升。

Comments v2 (May 2026): Corrected primary estimand; removed misleading SOTA comparisons; backbone-native transformer/SASRec results; gated vs ungated trade-off; IP-conscious reporting; LIGO/finance demoted to appendix. 11 pages, 1 figure. Patent pending. Contact: LorianBannis@banlys.com for benchmark access

详情
AI中文摘要

我们引入了Lattice,一种混合序列预测系统,该系统使用二元置信门控有条件地激活学习到的行为结构。该系统将行为窗口总结为行为原型,并且仅当支持内置信信号超过验证校准阈值时激活基于原型的评分,在不确定时回退到骨干预测。我们的主要估计量是向固定骨干添加Lattice对相同测试行的控制效应。在MovieLens(30个配对种子,全目录排名)上,LSTM+Lattice相比单独LSTM骨干在HR@10上提升了+31.7%(门控)(p远小于10^-20);非门控融合在同一协议下达到+58.7%。我们不声称门控最大化池化准确率。使用骨干原生原型(在每个骨干的嵌入空间中拟合),在相同评估设计下,门控提升分别为+13.3%(Transformer)和+17.0%(SASRec)。先前版本1中约0%的Transformer行反映了无效的跨骨干迁移,而非组合无法帮助更强编码器的证据。Amazon Electronics提供了跨领域支持证据(+124.0%门控,15个种子,高方差)。受控偏移检查(附录)展示了分布偏移下的门控拒绝。独立的SASRec和BERT4Rec分数是上下文参考,而非目标估计量。我们报告组合实现了什么以及何时激活;生产校准和实现细节因专利申请而保密。

英文摘要

We introduce Lattice, a hybrid sequential prediction system that conditionally activates learned behavioral structure using binary confidence gating. The system summarizes behavior windows as behavioral archetypes and activates archetype-based scoring only when an in-support confidence signal exceeds a validation-calibrated threshold, falling back to backbone predictions when uncertain. Our primary estimand is the controlled effect of adding Lattice to a fixed backbone on identical test rows. On MovieLens (30 paired seeds, full-catalog ranking), LSTM+Lattice improves HR@10 by +31.7% (gated) versus the LSTM backbone alone (p much less than 10^-20); ungated fusion reaches +58.7% on the same protocol. We do not claim gating maximizes pooled accuracy. With backbone-native archetypes (fit in each backbone's embedding space), gated lifts of +13.3% (transformer) and +17.0% (SASRec) hold under the same evaluation design. A prior approximately 0% transformer row in version 1 reflected an invalid cross-backbone transfer, not evidence that composition cannot help stronger encoders. Amazon Electronics provides supporting cross-domain evidence (+124.0% gated, 15 seeds, high variance). Controlled shift checks (appendix) illustrate gate refusal under distribution shift. Standalone SASRec and BERT4Rec scores are contextual references, not the target estimand. We report what composition achieves and when it activates; production calibration and implementation details remain proprietary pending patent prosecution.

2602.01357 2026-06-09 cs.LG 版本更新

Your Self-Play Algorithm is Secretly an Adversarial Imitator: Understanding LLM Self-Play through the Lens of Imitation Learning

你的自对弈算法其实是一个对抗性模仿者:通过模仿学习的视角理解LLM自对弈

Shangzhe Li, Xuchao Zhang, Chetan Bansal, Weitong Zhang

发表机构 * University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) Microsoft Research(微软研究院)

AI总结 本文通过将自对弈微调建模为模型与自身参数化的正则化隐式奖励玩家之间的极小极大博弈,统一了自对弈模仿与偏好对齐,并提出了基于χ²散度的新算法,在多种语言模型微调任务上优于现有方法。

Comments 26 pages, 6 tables, 5 figures

详情
AI中文摘要

自对弈后训练方法已成为微调大型语言模型并在没有偏好数据的情况下将弱语言模型转变为强语言模型的有效方法。然而,自对弈微调的理论基础仍未被充分探索。在这项工作中,我们通过将自对弈微调与对抗性模仿学习联系起来,将微调过程建模为模型与由模型自身参数化的正则化隐式奖励玩家之间的极小极大博弈,从而解决了这一问题。这一视角将自对弈模仿和一般偏好对齐统一在一个共同框架内。在此公式下,我们进行了博弈论分析,表明自对弈微调将收敛到其均衡。受这一理论公式的指导,我们提出了一种新的基于χ²散度变分目标的自对弈模仿微调算法,该算法具有有界奖励和改进的稳定性。在各种语言模型微调任务上的实验表明,该方法始终优于现有的自对弈方法,并验证了我们的理论见解。

英文摘要

Self-play post-training methods has emerged as an effective approach for finetuning large language models and turn the weak language model into strong language model without preference data. However, the theoretical foundations for self-play finetuning remain underexplored. In this work, we tackle this by connecting self-play finetuning with adversarial imitation learning by formulating finetuning procedure as a min-max game between the model and a regularized implicit reward player parameterized by the model itself. This perspective unifies self-play imitation and general preference alignment within a common framework. Under this formulation, we present a game-theoretic analysis showing that the self-play finetuning will converge to it's equilibrium. Guided by this theoretical formulation, we propose a new self-play imitation finetuning algorithm based on the $χ^2$-divergence variational objective with bounded rewards and improved stability. Experiments on various of language model finetuning tasks demonstrate consistent improvements over existing self-play methods and validate our theoretical insights.

2602.15829 2026-06-09 cs.LG 版本更新

Operationalising the Superficial Alignment Hypothesis via Task Complexity

通过任务复杂度操作化浅层对齐假设

Tomás Vergara-Browne, Darshan Patil, Ivan Titov, Siva Reddy, Tiago Pimentel, Marius Mosbach

发表机构 * University of Maryland(马里兰大学) University of California, Berkeley(加州大学伯克利分校) University of Washington(华盛顿大学) University of Toronto(多伦多大学) University of Edinburgh(爱丁堡大学)

AI总结 提出任务复杂度指标(达到目标性能的最短程序长度)来量化浅层对齐假设,实验表明预训练大幅降低任务复杂度,而微调可将复杂度降低数个数量级。

Comments ICML 2026

详情
AI中文摘要

浅层对齐假设(SAH)认为,大型语言模型在预训练期间学习大部分知识,而后期训练只是将这些知识表面化。然而,SAH缺乏精确的定义,导致(i)支持它的不同且看似正交的论点,以及(ii)对其的重要批评。我们提出一个新的度量标准,称为任务复杂度:在任务上达到目标性能的最短程序长度。在这个框架中,SAH简单地声称预训练模型大幅降低了在许多任务上实现高性能的复杂度。我们的定义统一了先前支持SAH的论点,将它们解释为寻找此类短程序的不同策略。实验上,我们估计了数学推理、机器翻译和指令遵循的任务复杂度;然后我们表明,当以预训练模型为条件时,这些复杂度可以非常低。此外,我们发现预训练能够访问我们任务上的强性能,但可能需要千兆字节长度的程序来访问它们。另一方面,后期训练将达到相同性能的复杂度降低了几个数量级。总体而言,我们的结果强调,任务适应通常需要惊人的少量信息——通常只有几千字节。

英文摘要

The superficial alignment hypothesis (SAH) posits that large language models learn most of their knowledge during pre-training, and that post-training merely surfaces this knowledge. The SAH, however, lacks a precise definition, which has led to (i) different and seemingly orthogonal arguments supporting it, and (ii) important critiques to it. We propose a new metric called task complexity: the length of the shortest program that achieves a target performance on a task. In this framework, the SAH simply claims that pre-trained models drastically reduce the complexity of achieving high performance on many tasks. Our definition unifies prior arguments supporting the SAH, interpreting them as different strategies to find such short programs. Experimentally, we estimate the task complexity of mathematical reasoning, machine translation, and instruction following; we then show that these complexities can be remarkably low when conditioned on a pre-trained model. Further, we find that pre-training enables access to strong performances on our tasks, but it can require programs of gigabytes of length to access them. Post-training, on the other hand, collapses the complexity of reaching this same performance by several orders of magnitude. Overall, our results highlight that task adaptation often requires surprisingly little information -- often just a few kilobytes.

2602.16224 2026-06-09 cs.LG 版本更新

Amortized Predictability-aware Training Framework for Time Series Forecasting and Classification

面向时间序列预测与分类的摊销可预测性感知训练框架

Xu Zhang, Peng Wang, Yichen Li, Wei Wang

发表机构 * Shanghai Key Laboratory of Data Science, College of Computer Science and Artificial Intelligence Fudan University(复旦大学计算机科学与人工智能学院上海数据科学关键实验室) Department of Electrical and Computer Engineering University of British Columbia (UBC)(英属哥伦比亚大学电气与计算机工程系)

AI总结 提出APTF框架,通过分层可预测性感知损失和摊销模型识别并惩罚低可预测性样本,提升时间序列预测与分类性能。

Comments This work is accepted by the proceedings of the ACM Web Conference 2026 (WWW 2026). The code is available at the link https://github.com/Meteor-Stars/APTF

详情
AI中文摘要

时间序列数据在各个领域容易受到噪声的影响,训练样本可能包含偏离正常数据分布的低可预测性模式,导致训练不稳定或收敛到较差的局部最小值。因此,减轻低可预测性样本的不利影响对于时间序列分析任务(如时间序列预测(TSF)和时间序列分类(TSC))至关重要。尽管许多深度学习模型已取得有希望的性能,但很少有模型考虑如何识别和惩罚低可预测性样本以从训练角度改进模型性能。为填补这一空白,我们提出了一个通用的摊销可预测性感知训练框架(APTF),适用于TSF和TSC。APTF引入了两个关键设计,使模型能够关注高可预测性样本,同时仍能从低可预测性样本中适当学习:(i)分层可预测性感知损失(HPL),动态识别低可预测性样本并随着训练进行逐步扩大其损失惩罚,以及(ii)一个摊销模型,减轻由模型偏差引起的可预测性估计误差,进一步增强HPL的有效性。代码可在https://github.com/Meteor-Stars/APTF获取。

英文摘要

Time series data are prone to noise in various domains, and training samples may contain low-predictability patterns that deviate from the normal data distribution, leading to training instability or convergence to poor local minima. Therefore, mitigating the adverse effects of low-predictability samples is crucial for time series analysis tasks such as time series forecasting (TSF) and time series classification (TSC). While many deep learning models have achieved promising performance, few consider how to identify and penalize low-predictability samples to improve model performance from the training perspective. To fill this gap, we propose a general Amortized Predictability-aware Training Framework (APTF) for both TSF and TSC. APTF introduces two key designs that enable the model to focus on high-predictability samples while still learning appropriately from low-predictability ones: (i) a Hierarchical Predictability-aware Loss (HPL) that dynamically identifies low-predictability samples and progressively expands their loss penalty as training evolves, and (ii) an amortization model that mitigates predictability estimation errors caused by model bias, further enhancing HPL's effectiveness. The code is available at https://github.com/Meteor-Stars/APTF.

2603.08630 2026-06-09 cs.LG physics.comp-ph 版本更新

Integral Formulas for Vector Signal Tensor Products

向量信号张量积的积分公式

Valentin Heyraud, Zachary Weller-Davies, Jules Tilly

发表机构 * InstaDeep

AI总结 本文推导了简化向量信号张量积的积分公式,通过获得反称Gaunt系数的显式表达式,实现了Clebsch-Gordan张量积的高效模拟,减少9倍的张量积计算量,为SO(3)等价神经网络应用奠定基础。

Comments 17 pages, 3 figures

详情
AI中文摘要

我们推导了简化最近由Xie等人引入的向量信号张量积的积分公式,该公式将Gaunt张量积推广到反称耦合。特别地,我们获得了反称Gaunt系数的显式闭式表达式。这使我们能够使用单个向量信号张量积模拟Clebsch-Gordan张量积,从而在张量积计算方面减少高达9倍。我们的结果使向量信号张量积的高效和实用实现成为可能,为Gaunt张量积的这种推广在SO(3)等价神经网络中的应用铺平了道路。此外,我们讨论了Gaunt和向量信号张量积如何控制与通常Clebsch-Gordan张量积相关的表达性-运行时间权衡。最后,我们研究了所考虑张量积归一化低秩分解,以用于等价神经网络中。

英文摘要

We derive integral formulas that simplify the Vector Signal Tensor Product recently introduced by Xie et al., which generalizes the Gaunt tensor product to anti-symmetric couplings. In particular, we obtain explicit closed-form expressions for the anti-symmetric analogues of the Gaunt coefficients. This enables us to simulate the Clebsch-Gordan tensor product using a single Vector Signal Tensor Product, yielding up to a $9\times$ reduction in the required tensor product evaluations. Our results enable efficient and practical implementations of the Vector Signal Tensor Product, paving the way for applications of this generalization of Gaunt Tensor Products in $\mathrm{SO}(3)$-equivariant neural networks. Moreover, we discuss how the Gaunt and the Vector Signal Tensor Products allow to control the expressivity-runtime tradeoff associated with the usual Clebsch-Gordan Tensor Products. Finally, we investigate low rank decompositions of the normalizations of the considered tensor products in view of their use in equivariant neural networks.

2603.25157 2026-06-09 cs.LG cs.AI cs.CV stat.ML 版本更新

Vision Hopfield Memory Networks for Image Recognition

Vision Hopfield Memory Networks

Jianfeng Wang, Amine M'Charrak, Luk Koska, Xiangtao Wang, Daniel Petriceanu, Ruizhi Wang, Michael Bumbar, Luca Pinchetti, Thomas Lukasiewicz

发表机构 * Department of Computer Science, University of Oxford(牛津大学计算机科学系) Faculty of Informatics, Vienna University of Technology(维也纳理工大学信息学院)

AI总结 本文提出了一种受大脑启发的视觉Hopfield记忆网络(V-HMN),通过整合分层记忆机制和迭代细化更新,实现了统一框架下的局部和全局动态建模,提升了可解释性和数据效率。

详情
AI中文摘要

近年来,视觉和多模态基础模型,如Transformer家族和状态空间模型(如Mamba)在图像、文本等领域取得了显著进展。尽管这些架构在经验上取得了成功,但它们与人脑的计算原理仍有很大差距,通常需要大量的训练数据且可解释性有限。在本文中,我们提出了视觉Hopfield记忆网络(V-HMN),一种受大脑启发的基础模型,整合了分层记忆机制和迭代细化更新。具体而言,V-HMN包含局部Hopfield模块,提供图像块级别的关联记忆动态,全局Hopfield模块作为情境调节的事件记忆,以及受预测编码启发的细化规则用于迭代误差校正。通过将这些基于记忆的模块分层组织,V-HMN在一个统一的框架中捕捉了局部和全局动态。记忆检索揭示了输入与存储模式之间的关系,使决策更具可解释性,而存储模式的重用提高了数据效率。这种受大脑启发的设计因此在可解释性和数据效率方面超越了现有的自注意或状态空间方法。我们在公开的计算机视觉基准上进行了广泛的实验,V-HMN在与广泛采用的基础架构竞争的同时,提供了更好的可解释性、更高的数据效率和更强的生物合理性。这些发现突显了V-HMN作为下一代视觉基础模型的潜力,同时为文本和音频等领域的多模态基础模型提供了通用的蓝图,从而将受大脑启发的计算与大规模机器学习联系起来。

英文摘要

Recent vision backbones, such as Transformer families and state-space models like Mamba, have achieved remarkable progress on image recognition. Despite their empirical success, these architectures remain far from the computational principles of the human brain, often demanding enormous amounts of training data while offering limited interpretability. We propose the Vision Hopfield Memory Network (V-HMN), a brain-inspired vision backbone that integrates hierarchical memory mechanisms across layers with iterative refinement updates. Specifically, V-HMN incorporates local Hopfield modules that provide associative memory dynamics at the image patch level, global Hopfield modules that function as episodic memory for contextual modulation, and a predictive-coding-inspired refinement rule for iterative error correction. By organizing these memory-based modules hierarchically, V-HMN captures both local and global dynamics in a unified framework. Memory retrieval exposes the relationship between inputs and stored patterns, providing a prototype-based form of interpretability through explicit memory retrieval, while the reuse of stored patterns improves data efficiency. This brain-inspired design therefore enhances data efficiency and provides a prototype-based form of interpretability compared to existing self-attention- or state-space-based approaches. We conducted extensive experiments on public image classification benchmarks. V-HMN achieves strong performance on small- and medium-scale benchmarks, and remains competitive with widely adopted backbone architectures on ImageNet despite minimal architectural tuning, while offering improved data efficiency and a prototype-based form of interpretability. These findings highlight the potential of V-HMN as a memory-centric alternative to standard vision backbones, thereby bridging brain-inspired computation with modern machine learning.

2604.09967 2026-06-09 cs.LG cs.AI 版本更新

Muon$^2$: Boosting Muon via Adaptive Second-Moment Preconditioning

Muon²:通过自适应二阶矩预条件提升穆隆

Ziyue Liu, Ruijie Zhang, Zhengyang Wang, Yequan Zhao, Yupeng Su, Zi Yang, Zheng Zhang

发表机构 * University of California at Santa Barbara(加州大学圣巴巴拉分校) University at Albany, SUNY(阿尔巴尼大学,SUNY)

AI总结 Muon²通过引入Adam风格的自适应二阶矩预条件改进了穆隆的效率与质量,提升了极化近似中的收敛速度和实际正交化质量,实验表明其在参数规模达13B的预训练任务中表现更优。

Comments Preprint, subject to update

详情
AI中文摘要

Muon已展现为一种有前途的优化器,用于大规模基础模型预训练,通过迭代正交化利用神经网络更新的矩阵结构。然而,Muon的正交化质量依赖于执行的牛顿-施卢茨(NS)迭代次数,这带来了效率挑战,因为其计算和通信成本非平凡。我们提出Muon²,作为Muon的扩展,通过在正交化前应用Adam风格的自适应二阶矩预条件来提高质量和效率。我们的关键见解是,Muon的核心挑战在于极化近似中的病态动量矩阵,其谱通过Muon²显著改善,从而更快收敛到实用的正交化。我们进一步通过方向对齐特性化了实际正交化质量,在此情况下,Muon²在每个极化步骤中均显著优于Muon。在GPT、LLaMA和专家混合预训练实验中,Muon²(及其内存高效变种Muon²-F)在参数规模达13B时,始终优于Muon及其变种,同时将NS迭代次数减少40%,并在达到相同损失时节省了多达四分之一的训练时间。

英文摘要

Muon has emerged as a promising optimizer for large-scale foundation model pre-training by exploiting the matrix structure of neural network updates through iterative orthogonalization. However, the orthogonalization quality of Muon hinges on the number of Newton--Schulz (NS) iterations performed, which poses efficiency challenges due to its non-trivial computation and communication cost. We propose Muon$^2$, an extension of Muon, to improve both quality and efficiency by applying Adam-style adaptive second-moment preconditioning before orthogonalization. Our key insight is that the core challenge of polar approximation in Muon lies in the ill-conditioned momentum matrix, of which the spectrum is substantially improved by Muon$^2$, leading to faster convergence toward a practically sufficient orthogonalization. We further characterize the practical orthogonalization quality via directional alignment, under which Muon$^2$ demonstrates dramatic improvement over Muon at each polar step. Across GPT, LLaMA, and Mixture-of-Experts pre-training experiments up to 13B parameters, Muon$^2$ (and its memory-efficient variant Muon$^2$-F that preserves most of its benefits) consistently outperforms Muon and its variants while reducing NS iterations by 40%, and saves up to 1/4 training time over Muon when achieving the same loss.

2605.06384 2026-06-09 cs.LG cs.AI cs.FL 版本更新

MinMax Recurrent Neural Cascades

MinMax 循环神经网络级联

Alessandro Ronca

发表机构 * IRIS-AI

AI总结 MinMax RNCs 通过MinMax代数构建,具备强表达性、高效评估、稳定动态和非消失状态梯度等特性,在合成任务中表现优异,能处理长序列并超越传统循环基线。

Comments Code: https://github.com/minmaxrnc/model

详情
AI中文摘要

我们引入MinMax循环神经网络级联(MinMax RNCs),一种基于MinMax代数新形式递归的循环神经网络。我们展示了MinMax RNCs具有一些难以同时获得的关键性质:强大的形式表达性、高效的评估、稳定的动态和非消失的状态梯度。首先,其形式表达性对应正则语言,可能是有限记忆系统的最大表达性。其次,除了递归形式的评估外,它们还允许并行扫描评估,具有对数深度和线性工作量。第三,其状态和激活在所有序列长度下均被统一限制。第四,其损失梯度几乎处处存在且在所有序列长度下均被统一限制。第五,它们不表现出消失的状态梯度:状态相对于过去状态的梯度可以独立于状态之间的时距保持范数一。经验上,我们发现这些理论性质转化为强大的实际性能。MinMax RNCs完美解决了考虑的合成任务,能够泛化到长序列,并在实验中超越了考虑的循环基线。我们还训练了一个1.12亿参数的MinMax RNC进行下一个token预测,获得与其规模相竞争的性能,提供了初始证据表明MinMax递归可以扩展到现实世界的序列建模任务。

英文摘要

We introduce MinMax Recurrent Neural Cascades (MinMax RNCs), a class of recurrent neural networks built from a novel form of recurrence over the MinMax algebra. We show that MinMax RNCs enjoy key properties that are difficult to obtain simultaneously: strong formal expressivity, efficient evaluation, stable dynamics, and non-vanishing state gradients. First, their formal expressivity corresponds to the regular languages, arguably the maximal expressivity for finite-memory systems. Second, in addition to evaluation in recurrent form, they also admit parallel-scan evaluation with logarithmic depth and linear work in the input length. Third, their states and activations are uniformly bounded for all sequence lengths. Fourth, their loss gradients exist almost everywhere and are uniformly bounded for all sequence lengths. Fifth, they do not exhibit vanishing state gradients: the gradient of a state with respect to a past state can retain norm one independently of the temporal distance between the states. Empirically, we find that these theoretical properties translate into strong practical performance. MinMax RNCs solve the considered synthetic tasks perfectly, generalise to long sequences, and outperform the recurrent baselines considered in our experiments. We also train a 112M-parameter MinMax RNC for next-token prediction, obtaining competitive performance for its size and providing initial evidence that MinMax recurrence can scale to real-world sequence-modelling tasks.

2605.11855 2026-06-09 cs.LG cs.AI cs.AR 版本更新

Improving the Performance and Learning Stability of Parallelizable RNNs Designed for Ultra-Low Power Applications

提升为超低功耗应用设计的可并行递归神经网络的性能和学习稳定性

Julien Brandoit, Arthur Fyon, Damien Ernst, Guillaume Drion

发表机构 * University of Cambridge(剑桥大学)

AI总结 本文提出CMRU和αCMRU,通过累积更新公式恢复梯度流并保持持久记忆,提升收敛稳定性并减少初始化敏感性,在多样本基准中表现优异,尤其在需要离散长距离保留的任务中表现突出。

Comments Accepted as a spotlight at ICML2026. This work has been the subject of patent applications under numbers EP26175243.0 and EP26175248.9

详情
AI中文摘要

序列学习主要由Transformer和可并行递归神经网络(如状态空间模型)主导,但学习长期依赖仍具挑战性,最先进的设计以性能牺牲换取功耗降低。Bistable Memory Recurrent Unit(BMRU)被引入以实现超低功耗RNNs的软硬件协同设计:具有滞后特性的量化状态提供持久记忆并直接映射到模拟基本单元。然而,BMRU在复杂序列任务上性能落后于可并行RNNs。本文识别出在状态更新期间出现的梯度阻塞是关键限制,并提出累积更新公式以恢复梯度流并保持持久记忆,通过时间创建跳跃连接。这导致了累积记忆递归单元(CMRU)及其放松变体αCMRU。实验表明,累积公式显著提高了收敛稳定性并减少了初始化敏感性。CMRU和αCMRU在小模型规模下在多样本基准中与线性递归单元(LRUs)和最小门控递归单元(minGRUs)匹配或超越,尤其在需要离散长距离保留的任务中表现突出,同时CMRU保留量化状态、持久记忆和抗噪声动态,这些对于模拟实现至关重要。

英文摘要

Sequence learning is dominated by Transformers and parallelizable recurrent neural networks (RNNs) such as state-space models, yet learning long-term dependencies remains challenging, and state-of-the-art designs trade power consumption for performance. The Bistable Memory Recurrent Unit (BMRU) was introduced to enable hardware-software co-design of ultra-low power RNNs: quantized states with hysteresis provide persistent memory while mapping directly to analog primitives. However, BMRU performance lags behind parallelizable RNNs on complex sequential tasks. In this paper, we identify gradient blocking during state updates as a key limitation and propose a cumulative update formulation that restores gradient flow while preserving persistent memory, creating skip-connections through time. This leads to the Cumulative Memory Recurrent Unit (CMRU) and its relaxed variant, the $α$CMRU. Experiments show that the cumulative formulation dramatically improves convergence stability and reduces initialization sensitivity. The CMRU and $α$CMRU match or outperform Linear Recurrent Units (LRUs) and minimal Gated Recurrent Units (minGRUs) across diverse benchmarks at small model sizes, with particular advantages on tasks requiring discrete long-range retention, while the CMRU retains quantized states, persistent memory, and noise-resilient dynamics essential for analog implementation.

2605.15690 2026-06-09 cs.LG 版本更新

FRWKV+: Periodic-Aware Adaptive Gating for Frequency-Space Linear Time Series Forecasting

FRWKV+: 基于周期感知的自适应门控用于频率域线性时间序列预测

Qingyuan Yang, Dongyue Chen, Da Teng, Junhua Xiao, Jiaji Pan, Shizhuo Deng

发表机构 * College of Information Science and Engineering, Northeastern University(信息科学与工程学院,东北大学) Foshan Graduate School of Innovation, Northeastern University(创新研究生学院,东北大学) National Frontiers Science Center for Industrial Intelligence and Systems Optimization(工业智能与系统优化国家级前沿科学中心)

AI总结 本文提出FRWKV-Plus模型,通过引入跨分支频谱门和信任门控残差修正,提升频率域时间序列预测的准确性与效率,实验表明其在多个基准数据集上表现优异。

详情
AI中文摘要

准确且高效的长期多变量时间序列预测需要捕捉重复的时序结构,同时在许多变量和预测范围上保持推理成本低。频率域模型能紧凑地表示长程和周期性变化,但通常将实部和虚部频谱组件作为弱耦合流处理,并将周期性提示作为普通输入特征,即使这些提示不可靠。本文提出FRWKV-Plus,一种轻量级周期感知频率域预测模型,基于高效的FRWKV骨干网络。FRWKV-Plus引入了跨分支频谱门,通过总结其兄弟分支来重新加权每个频谱分支,并引入信任门控残差修正,将紧凑的周期内上下文转换为有界的、符号灵活的调整。通过构造,修正在初始化时保持恒等,并严格有界,因此周期性证据可以细化但不会主导或反转基础交互。在七个标准基准上,FRWKV-Plus在强线性、频率域、递归式和Transformer基预测器中表现一致竞争,同时保持骨干网络的轻量级特性。受控三种子消融实验显示,每个组件都起作用,收益在强周期性数据上较小,在更难的交换和IL数据集上更显著,且周期内上下文是最有影响力的单一组件。实现已公开在https://github.com/yangqingyuan-byte/FRWKV-plus。

英文摘要

Accurate and efficient long-term multivariate time series forecasting requires capturing recurring temporal structure while keeping inference cheap across many variables and horizons. Frequency-space models represent long-range and periodic variation compactly, but they typically process the real and imaginary spectral components as weakly coupled streams and treat periodic cues as ordinary input features, even when such cues are unreliable. This paper proposes FRWKV-Plus, a lightweight periodic-aware frequency-space forecasting model built on the efficient FRWKV backbone. FRWKV-Plus introduces a cross-branch spectral gate that reweights each spectral branch using a summary of its sibling branch, and a trust-gated residual correction that converts compact within-period context into a bounded, sign-flexible adjustment of these gates under a learned, data-dependent trust score. By construction, the correction is identity-preserving at initialization and strictly bounded, so periodic evidence can refine but never dominate or invert the base interaction. On seven standard benchmarks, FRWKV-Plus is consistently competitive with strong linear, frequency-domain, recurrent-style, and Transformer-based forecasters while preserving the lightweight profile of the backbone. Controlled three-seed ablations show that each component contributes, that the benefit is modest on strongly periodic data and pronounced on the harder Exchange and ILI datasets, and that the within-period context is the most influential single component. The implementation is publicly available at https://github.com/yangqingyuan-byte/FRWKV-plus.

2606.04752 2026-06-09 cs.LG cs.AI 版本更新

An Empirical Audit of Input Encoders for Multi-Channel Signal Transformers

多通道信号Transformer输入编码器的实证审计

Ossi Lehtinen

发表机构 * Anthropic

AI总结 通过合成基准和真实数据ETTh1,实证审计八种输入编码器,发现标准线性投影(nn.Linear(C, d_model))在大多数情况下与复杂替代方案性能相当,仅共享标量基线和通道独立基线显著落后。

Comments 21 pages, 1 figure, 8 tables. Code: https://github.com/OssiLehtinen/channel-encoder-audit

详情
AI中文摘要

处理多通道标量信号的Transformer必须在每个时间步将$C$个同时值嵌入到一个$d_{ ext{model}}$维向量中。我们在一个设计为使通道身份信息丰富的合成基准和作为真实数据检查的ETTh1上,以下一步负对数似然(NLL)为指标,实证审计了八种输入编码器——包括共享标量基线、每通道线性投影、正交正则化器、非线性MLP主干、块分区拼接、通道独立和通道作为令牌架构,以及投影位置编码。主要结论是宽泛的“第一梯队”内实际近似等价:标准每通道线性投影(nn.Linear(C, $d_{ ext{model}}$))与该梯队中的每个替代方案相比,差异在统计上显著但实际中很小。两种编码器明显失败:共享标量基线(由于我们明确的信息论原因而崩溃)和通道独立的PatchTST风格基线(在两个基准上表现不佳,并在合成基准上普遍过拟合)。配对测试解决了两个小差距:通过学习的线性层投影正弦位置编码在小$C$时略胜一筹,直接几何探测表明其机制是位置-通道正交化;非线性MLP主干在我们测试的最大$C$时略胜一筹,但差距在更多训练数据下缩小。实际建议是默认使用nn.Linear(C, $d_{ ext{model}}$),仅当手头任务有实际理由时才采用更复杂的方案。重现本文所有实验的代码和数据可在https://github.com/OssiLehtinen/channel-encoder-audit获取。

英文摘要

Transformers consuming multi-channel scalar signals must embed $C$ simultaneous values into one $d_{\text{model}}$-dimensional vector per time step. We audit eight input encoders -- a shared-scalar baseline, per-channel linear projections, an orthogonality regulariser, a nonlinear MLP, block-partitioned concatenation, channel-independent and channel-as-token architectures, and a projected positional encoding -- on a synthetic benchmark where channel identity is informative and on ETTh1, scored by next-step negative log-likelihood. The headline is practical near-equivalence within a wide "top tier": the standard per-channel linear projection matches every alternative up to small, statistically real but practically modest differences. A direct geometric probe attributes this to a spontaneous orthogonalisation of the per-channel projections: they end up near-orthogonal with no explicit regulariser, letting the standard linear recover channel identity from the summed embedding. Two encoders lose decisively: the shared-scalar baseline collapses for information-theoretic reasons we make explicit, and the channel-independent PatchTST-spirit baseline overfits universally on the synthetic benchmark and underperforms on both. Paired tests resolve two small gaps: projecting the sinusoidal positional encoding through a learned linear layer edges the rest at small $C$ by extending this orthogonality to the positional subspace; a nonlinear MLP stem edges them at the largest $C$, with the gap shrinking under more training data. The practical recommendation: use the standard per-channel linear projection by default; reach for something more elaborate only when the task calls for it.

2405.17823 2026-06-09 stat.ML cs.LG math.OA 版本更新

Spectral Truncation Kernels: Noncommutativity in $C^*$-algebraic Kernel Machines

谱截断核:C*-代数核机器中的非交换性

Yuka Hashimoto, Ayoub Hafid, Masahiro Ikeda, Hachem Kadri

发表机构 * NTT, Inc.(NTT公司) Center for Advanced Intelligence Project, RIKEN(RIKEN高级智能项目) Graduate School of Mathematical Sciences, The University of Tokyo(东京大学数学科学研究生院) Graduate School of Information Science and Technology, The university of Osaka(大阪大学信息科学与技术研究生院) Department of Computer Science, Aix-Marseille University, CNRS, LIS(阿维尼昂-马赛大学计算机科学系,CNRS,LIS)

AI总结 提出基于谱截断和C*-代数的谱截断核,通过允许非交换乘积实现函数域上的交互,填补了可分离核与交换核之间的空白,并降低了计算成本。

详情
AI中文摘要

向量值学习和函数值学习中的一个核心问题是如何设计既能捕捉局部和非局部交互又保持计算可行性的核。现有的算子值核仅提供部分答案:可分离核效率高但无法建模函数域上的交互,而交换核仅能捕捉逐点结构。为了解决这个问题,我们提出了谱截断核,这是一类基于谱截断和C*-代数的用于向量值和函数值学习的正定核。通过在核构造中允许非交换乘积,所提出的核能够诱导数据函数域上的交互,并填补了现有可分离核与交换核之间的空白。此外,通过使用C*-代数框架,与现有的使用算子值核的向量值RKHS框架相比,我们降低了计算成本。

英文摘要

A central question in vector- and function-valued learning is how to design kernels that capture both local and non-local interactions while remaining computationally tractable. Existing operator-valued kernels offer only partial answers: separable kernels are efficient but fail to model interactions across the function domain, while commutative kernels capture only pointwise structure. To address this, we propose spectral truncation kernels, a new class of positive definite kernels for vector- and function-valued learning based on spectral truncation and $C^*$-algebra. By allowing noncommutative products in the kernel construction, the proposed kernels induce interactions across the data function domain and fill the gap between existing separable and commutative kernels. In addition, by using the $C^*$-algebraic framework, we reduce the computational cost compared to the existing vector-valued RKHS framework with operator-valued kernels.

2510.13554 2026-06-09 cs.CL cs.LG 版本更新

Attention Illuminates LLM Reasoning: The Preplan-and-Anchor Rhythm Enables Fine-Grained Policy Optimization

注意力揭示大语言模型推理:预规划与锚定节奏实现细粒度策略优化

Yang Li, Zhichen Dong, Yuhan Sun, Weixun Wang, Shaopan Xiong, Yijia Luo, Jiashun Liu, Han Lu, Jiamang Wang, Wenbo Su, Bo Zheng, Junchi Yan

发表机构 * Shanghai Jiao Tong University(上海交通大学) Shanghai Innovation Institute(上海创新研究院) Alibaba Group(阿里巴巴集团)

AI总结 本文通过注意力机制揭示大语言模型推理中的预规划与锚定节奏,并据此提出三种细粒度强化学习策略,在多种推理任务上取得一致性能提升。

Comments 31 pages, 9 figures, 20 tables. Accepted at ICML 2026

详情
AI中文摘要

大语言模型的推理模式仍然不透明,强化学习通常对整个生成过程应用统一信用分配,模糊了关键步骤与常规步骤的区别。本文将注意力视为一种特权基质,它使大语言模型的内部逻辑变得可读,不仅是计算的副产品,更是推理本身的机械蓝图。我们首先区分局部和全局聚焦信息处理的注意力头,并揭示局部聚焦头在对角线附近产生锯齿状模式,指示短语块,而全局聚焦头则暴露对后续令牌具有广泛下游影响的令牌。我们用两个指标形式化这些:1)窗口平均注意力距离,衡量裁剪窗口内向后注意力的程度;2)未来注意力影响,量化令牌的全局重要性,即其从后续令牌接收的平均注意力。综合来看,这些信号揭示了一种重复的预规划与锚定机制,其中模型首先进行长距离上下文参考以生成一个引导令牌,该令牌立即跟随或与一个组织后续推理的语义锚定令牌重合。利用这些见解,我们引入了三种新颖的强化学习策略,动态地对关键节点(预规划令牌、锚定令牌及其时间耦合)进行目标信用分配,并在各种推理任务中展示了一致的性能提升。通过将优化与模型的内在推理节奏对齐,我们旨在将不透明的优化转化为可操作的结构感知过程,希望为更透明和有效的大语言模型推理优化提供潜在一步。

英文摘要

The reasoning pattern of Large language models (LLMs) remains opaque, and reinforcement learning (RL) typically applies uniform credit across an entire generation, blurring the distinction between pivotal and routine steps. This work positions attention as a privileged substrate that renders the internal logic of LLMs legible, not merely as a byproduct of computation, but as a mechanistic blueprint of reasoning itself. We first distinguish attention heads between locally and globally focused information processing and reveal that locally focused heads produce a sawtooth pattern near the diagonal indicating phrasal chunks, while globally focused heads expose tokens that exert broad downstream influence over future tokens. We formalize these with two metrics: 1) Windowed Average Attention Distance, which measures the extent of backward attention within a clipped window; 2) Future Attention Influence, which quantifies a token's global importance as the average attention it receives from subsequent tokens. Taken together, these signals reveal a recurring preplan-and-anchor mechanism, where the model first performs a long-range contextual reference to generate an introductory token, which is immediately followed by or coincides with a semantic anchor token that organizes subsequent reasoning. Leveraging these insights, we introduce three novel RL strategies that dynamically perform targeted credit assignment to critical nodes (preplan tokens, anchor tokens, and their temporal coupling) and show consistent performance gains across various reasoning tasks. By aligning optimization with the model's intrinsic reasoning rhythm, we aim to transform opaque optimization into an actionable structure-aware process, hoping to offer a potential step toward more transparent and effective optimization of LLM reasoning.

2601.15165 2026-06-09 cs.CL cs.AI cs.LG 版本更新

The Flexibility Trap: Rethinking the Value of Arbitrary Order in Diffusion Language Models

灵活性陷阱:重新思考扩散语言模型中任意顺序的价值

Zanlin Ni, Shenzhi Wang, Yang Yue, Tianyu Yu, Weilin Zhao, Yeguo Hua, Tianyi Chen, Jun Song, Cheng Yu, Bo Zheng, Gao Huang

发表机构 * LeapLab, Tsinghua University(清华大学Leap实验室) NLPLab, Tsinghua University(清华大学自然语言处理实验室) Tsinghua University(清华大学) Alibaba Group(阿里巴巴集团) BNRist, Tsinghua University(清华大学北京研究院)

AI总结 本文发现,尽管扩散语言模型(dLLMs)允许任意生成顺序,但这种灵活性可能限制其推理能力,通过采用标准的Group Relative Policy Optimization(GRPO)方法,即JustGRPO,在保持并行解码能力的同时提升了推理性能。

Comments Code and pre-trained models: https://github.com/LeapLabTHU/JustGRPO

详情
AI中文摘要

扩散大语言模型(dLLMs)打破了传统语言模型的严格左到右约束,使token生成可以按任意顺序进行。直观上,这种灵活性意味着解决方案空间严格超越了固定的自回归轨迹,理论上解锁了更强大的推理潜力。然而,在本文中,我们发现对于一般推理任务(例如数学和编程),任意顺序生成可能实际上会限制dLLMs的推理潜力。我们观察到dLLMs倾向于利用这种顺序灵活性来绕过关键探索的高不确定性token,这可能导致解决方案覆盖的过早崩溃。这一观察促使我们重新思考dLLMs的强化学习方法,其中大量的复杂性,如处理组合轨迹和不可计算的似然,通常致力于保持这种灵活性。我们证明,通过放弃任意顺序并应用标准的Group Relative Policy Optimization(GRPO)方法,即JustGRPO,可以有效地激发推理能力。我们的方法,JustGRPO,虽然简洁却出人意料地有效(例如在GSM8K上达到89.1%的准确率),同时完全保留了dLLMs的并行解码能力。项目页面:https://nzl-thu.github.io/the-flexibility-trap

英文摘要

Diffusion Large Language Models (dLLMs) break the rigid left-to-right constraint of traditional LLMs, enabling token generation in arbitrary orders. Intuitively, this flexibility implies a solution space that strictly supersets the fixed autoregressive trajectory, theoretically unlocking superior reasoning potential. However, in this paper, we find that for general reasoning tasks (e.g., mathematics and coding), arbitrary order generation may in fact limit the reasoning potential of dLLMs. We observe that dLLMs tend to exploit this order flexibility to bypass high-uncertainty tokens that are crucial for exploration, which can lead to a premature collapse of solution coverage. This observation motivates a rethink of RL approaches for dLLMs, where considerable complexities, such as handling combinatorial trajectories and intractable likelihoods, are often devoted to preserving this flexibility. We show that effective reasoning can be elicited by simply forgoing arbitrary order and applying standard Group Relative Policy Optimization (GRPO) instead. Our approach, JustGRPO, is minimalist yet surprisingly effective (e.g., 89.1% accuracy on GSM8K) while fully retaining the parallel decoding ability of dLLMs. Project page: https://nzl-thu.github.io/the-flexibility-trap

2603.22473 2026-06-09 cs.CL cs.AI cs.LG 版本更新

Component Ablation for Efficient Hybrid Language Model Architectures: Performance, Resilience, and Compression Implications

组件消融用于高效混合语言模型架构:性能、鲁棒性和压缩影响

Hector Borobia, Elies Seguí-Mas, Guillermina Tormo-Carbó

发表机构 * Doctoral Program in Computer Science, University of Valencia(瓦伦西亚大学计算机科学博士项目)

AI总结 本文通过组件消融研究混合语言模型,发现注意力机制与替代序列处理路径对性能有显著影响,揭示了模型鲁棒性与压缩优化的关键因素。

Comments 25 pages, 7 figures, 6 tables; revised title, abstract, figures, and data/code repository URL

详情
AI中文摘要

混合语言模型结合softmax注意力与线性时间序列机制,如状态空间或线性注意力层,但各组件的功能贡献尚不明确。本文在两个子10亿参数的混合语言模型Qwen3.5-0.8B和Falcon-H1-0.5B上,通过基于似然的评估、下游基准、逐层干预、随机控制和表征级诊断研究组件消融。测试结果显示,移除注意力或替代序列处理路径会显著降低性能,表明两种组件类型均对模型行为有贡献。似然指标对线性注意力或状态空间路径特别敏感,而下游基准退化取决于任务和架构。逐层消融显示组件重要性位置依赖,最强效果集中在早期或中期网络组件而非整个深度。随机移除控制进一步显示混合架构与相同家族Transformer基线在结构扰动下退化不同。这些结果表明组件消融是理解混合语言模型架构的有效诊断方法。发现为高效模型设计、压缩、鲁棒性分析和部署决策提供了相关证据。

英文摘要

Hybrid language models combine softmax attention with linear-time sequence mechanisms such as state-space or linear-attention layers, but the functional contribution of each component type remains insufficiently characterized. We study component-level ablation in two sub-1B hybrid language models, Qwen3.5-0.8B and Falcon-H1-0.5B, using likelihood-based evaluation, downstream benchmarks, layer-wise interventions, random controls, and representation-level diagnostics. Across the tested models, removing either attention or the alternative sequence-processing pathway substantially degrades performance, indicating that both component types contribute to model behavior. Likelihood metrics are especially sensitive to the linear-attention or state-space pathway, while downstream benchmark degradation depends on task and architecture. Layer-wise ablations show that component importance is position-dependent, with the strongest effects concentrated in early or mid-network components rather than uniformly across depth. Random-removal controls further show that hybrid architectures and same-family Transformer baselines degrade differently under structural perturbation. These results suggest that component ablation is a useful diagnostic for understanding hybrid language model architectures. The findings provide evidence relevant to efficient model design, compression, robustness analysis, and deployment decisions in architectures that combine attention with alternative sequence-processing mechanisms.

2604.16512 2026-06-09 cs.CV cs.CG cs.GR cs.LG cs.NA math.NA 版本更新

Medial Axis Aware Learning of Signed Distance Functions

面向中轴线的符号距离函数学习

Samuel Weidemaier, Christoph Norden-Smoch, Martin Rumpf

发表机构 * Institute for Numerical Simulation, University of Bonn(数值模拟研究所,波恩大学)

AI总结 本文提出一种新的变分方法,用于计算高精度的全局符号距离函数,通过高阶变分公式考虑梯度的跳跃集,以提高计算精度。

详情
AI中文摘要

我们提出了一种新的变分方法,用于计算给定点云的高精度全局符号距离函数(SDF)。为此,通过高阶变分公式显式考虑SDF梯度的跳跃集,即表面的中轴线,该公式强制在远离此不连续集的方向上沿梯度方向线性增长。Eikonal方程和SDF的零水平集被作为约束条件。为了使该变分问题具有计算可行性,采用了一种相场近似方法,属于Ambrosio-Tortorelli类型。相关的相场函数隐式地描述了中轴线。该方法用于由无向点云表示的表面,使用神经网络近似SDF和相场函数。实验表明,该方法在近场和全局范围内均具有较高的准确性。定量和定性比较表明,所提出的方法具有优势。

英文摘要

We propose a novel variational method to compute a highly accurate global signed distance function (SDF) to a given point cloud. To this end, the jump set of the gradient of the SDF, which coincides with the medial axis of the surface, is explicitly taken into account through a higher-order variational formulation that enforces linear growth along the gradient direction away from this discontinuity set. The eikonal equation and the zero-level set of the SDF are enforced as constraints. To make this variational problem computationally tractable, a phase field approximation of Ambrosio-Tortorelli type is employed. The associated phase field function implicitly describes the medial axis. The method is implemented for surfaces represented by unoriented point clouds using neural network approximations of both the SDF and the phase field. Experiments demonstrate the method's accuracy both in the near field and globally. Quantitative and qualitative comparisons with other approaches show the advantages of the proposed method.

2605.04913 2026-06-09 cs.CL cs.LG 版本更新

Rethinking Local Learning: A Cheaper and Faster Recipe for LLM Post-Training

重新思考局部学习:一种更便宜更快的LLM后训练配方

Hengyu Shi, Tianyang Han, Peizhe Wang, Zhiling Wang, Xu Yang, Junhao Su

发表机构 * Independent Researcher(独立研究者) D 4 Lab(D4实验室) Southeast University(东南大学)

AI总结 本文提出LoPT,一种局部学习后训练策略,通过在transformer中点设置梯度边界,降低内存成本,提高训练效率并保留预训练能力。

Comments 35pages

详情
AI中文摘要

LLM后训练通常通过完整深度传播任务梯度。尽管这种端到端结构简单通用,但将其任务适应与完整深度激活存储、长距离反向依赖和直接任务梯度访问预训练表示耦合在一起。我们主张这种完整深度反向耦合可能不必要的昂贵和侵入性,尤其是在后训练监督远比预训练狭窄时。为此,我们提出LoPT:局部学习后训练,一种简单的后训练策略,使梯度达到成为显式设计选择。LoPT在transformer中点放置单一梯度边界:后半部分块从任务目标学习,而前半部分块通过轻量级特征重建目标进行更新,以保留有用的表示并保持接口兼容性。LoPT缩短了任务引起的反向路径,同时限制了狭窄任务梯度对早期层表示的直接干扰。大量实验表明,LoPT在较低的内存成本、较高的训练效率和更好的保留预训练能力方面实现了竞争性性能。我们的代码可在:https://github.com/HumyuShi/LoPT获取。

英文摘要

LLM post-training typically propagates task gradients through the full depth of the model. Although this end-to-end structure is simple and general, it couples task adaptation to full-depth activation storage, long-range backward dependencies and direct task-gradient access to pretrained representations. We argue that this full-depth backward coupling can be unnecessarily expensive and intrusive, particularly when post-training supervision is much narrower than pre-training. To this end, we propose \textbf{LoPT}: Local-Learning Post-Training, a simple post-training strategy that makes gradient reach an explicit design choice. LoPT places a single gradient boundary at the transformer midpoint: the second-half block learns from the task objective, while the first-half block is updated by a lightweight feature-reconstruction objective to preserve useful representations and maintain interface compatibility. LoPT shortens the task-induced backward path while limiting direct interference from narrow task gradients on early-layer representations. Extensive experiments demonstrate that LoPT achieves competitive performance with lower memory cost, higher training efficiency and better retention of pretrained capabilities. Our code is available at: https://github.com/HumyuShi/LoPT

2605.27410 2026-06-09 quant-ph cs.LG cs.NE 版本更新

Zero-shot Quantum Neural Architecture Search

零样本量子神经架构搜索

Tung Dao, Son N. Tran, Huynh Thi Thanh Binh

发表机构 * Hanoi University of Science and Technology(河内科学技术大学) Deakin University(德金大学)

AI总结 针对变分量子算法中电路架构设计的高计算成本问题,基于量子神经正切核的Gram矩阵收敛性,提出零样本代理模型和MCTS框架MZeQAS,无需完整训练即可高效搜索高性能架构。

详情
AI中文摘要

变分量子算法是利用近期量子硬件的主要方法,通过参数化量子电路和经典优化来获得优势。尽管前景广阔,但VQA的实际部署受到设计平衡表达性、可训练性和硬件约束的量子电路架构的挑战。现有的基于进化的量子神经架构搜索方法解决了这些挑战,但由于候选电路的重复训练而导致高计算成本。在这项工作中,我们确定了量子神经正切核的Gram矩阵收敛的设置。基于这一观察,我们设计了一个零样本代理模型来估计候选性能而无需完整训练,显著加速了架构搜索过程。利用该代理,我们提出了MZeQAS,一种基于蒙特卡洛树搜索的零样本量子神经架构搜索框架,用于VQA。通过将基于代理的性能估计与MCTS探索相结合,MZeQAS高效地发现了高性能架构。实验结果表明,MZeQAS在搜索效率和解决方案质量方面均优于现有方法,为在噪声中等规模量子设备上推进VQA部署提供了一个可扩展且有效的框架。

英文摘要

Variational Quantum Algorithms (VQAs) are a leading approach to exploiting near-term quantum hardware, leveraging parameterized quantum circuits and classical optimization to achieve advantage. Despite their promise, the practical deployment of VQAs is challenged by the difficulty of designing quantum circuit architectures that balance expressivity, trainability, and hardware constraints. Existing evolutionary-based quantum neural architecture search methods address these challenges but suffer from high computational costs due to repeated training of candidate circuits. In this work, we identify a setting in which the Gram matrix of the Quantum Neural Tangent Kernel converges. Building on this observation, we design a zero-shot surrogate model to estimate candidate performance without full training, significantly accelerating the architecture search process. Using this surrogate, we propose MZeQAS, a Monte Carlo Tree Search (MCTS)-based Zero-Shot Quantum Neural Architecture Search framework for VQAs. By integrating proxy-based performance estimation with MCTS exploration, MZeQAS efficiently discovers high-performing architectures. Experimental results demonstrate that MZeQAS outperforms existing approaches in terms of both search efficiency and solution quality, providing a scalable and effective framework for advancing VQA deployment on noisy intermediate-scale quantum devices.

2606.00229 2026-06-09 cs.RO cs.AI cs.LG 版本更新

Continuous Reasoning for Vision-Language-Action

视觉-语言-动作的连续推理

Yueh-Hua Wu, Tatsuya Matsushima, Kei Ota

发表机构 * Airoa

AI总结 针对视觉-语言-动作策略中语言与连续控制粒度不匹配的问题,提出一种可共享、可验证的连续推理方法,通过高斯潜变量接口和自验证目标提升机器人任务成功率。

Comments Project page: https://continuous-reasoning.airoa.io

详情
AI中文摘要

自然语言是语言模型和视觉-语言模型强大的推理媒介,但与连续控制的粒度不匹配。文本和显式子目标在任务级粒度上操作,而视觉-语言-动作(VLA)策略必须在更细的时间尺度上选择动作;因此,单个推理步骤可能跨越多个动作块,同时与当前所需动作保持弱耦合。这为VLA提出了一个不同的问题:什么应该扮演语言的角色?我们认为,有用的VLA推理媒介必须能够在模型实例之间共享,通过下游动作改进进行验证,并与时间扩展的控制结构对齐。基于这一观点,我们提出了视觉-语言-动作的连续推理。我们的模型首先以结构化连续思想集的形式预测连续推理,然后将其重用为块结构动作生成的共享上下文。仅凭更好的动作预测并不能证明推理的有效性:如果相同的内部媒介不能在模型实例之间共享,并且不能通过改进的下游控制独立验证,那么添加的潜变量可能只是模型私有的捷径,有助于在已见行为上表现,而不支持泛化的控制。因此,我们将连续推理实例化为一个共享的高斯潜变量接口,并使用自验证目标进行训练,其中指数移动平均教师必须在预测目标动作时成功消费学生的推理。实验上,连续推理提高了LIBERO-PRO的鲁棒性,并在真实机器人上表现强劲,在TX-G2(一种AgiBot G2兼容变体)上平均子任务成功率比π0.5提高了40.4%,在HSR上提高了26.3%。这表明VLA中的推理更多是关于一个可共享、可验证的内部动作语言,而不是额外的标记。

英文摘要

Natural language is a powerful reasoning medium for language and vision-language models, but it is mismatched to the granularity of continuous control. Text and explicit subgoals operate at task-level granularity, whereas vision-language-action (VLA) policies must choose actions at a much finer temporal scale; a single reasoning step can therefore span many action chunks while remaining only weakly coupled to the action needed now. This suggests a different question for VLA: what should play the role of language? We argue that a useful VLA reasoning medium must be shareable across model instances, verifiable through downstream action improvement, and aligned with temporally extended control structure. Based on this view, we propose Continuous Reasoning for Vision-Language-Action. Our model first predicts continuous reasoning in the form of a structured set of continuous thoughts, then reuses them as shared context for chunk-structured action generation. Better action prediction alone does not certify good reasoning: if the same internal medium cannot be shared across model instances and independently verified through improved downstream control, the added latent may simply become a model-private shortcut that helps on seen behaviors without supporting generalizable control. We therefore instantiate continuous reasoning as a shared Gaussian latent interface and train it with a self-verification objective in which an exponential-moving-average teacher must successfully consume the student's reasoning when predicting target actions. Empirically, Continuous Reasoning improves LIBERO-PRO robustness and performs strongly on real robots, raising mean subtask success over π0.5 by 40.4% on TX-G2, an AgiBot G2-compatible variant, and 26.3% on HSR. This suggests that reasoning in VLA is less about extra tokens than about a shareable, verifiable internal language for action.

2606.06915 2026-06-09 cs.CL cs.AI cs.LG 版本更新

ThinkBooster: A Unified Framework for Seamless Test-Time Scaling of LLM Reasoning

ThinkBooster: 一种用于LLM推理无缝测试时扩展的统一框架

Vladislav Smirnov, Chieu Nguyen, Sergey Senichev, Minh Ngoc Ta, Ekaterina Fadeeva, Artem Vazhentsev, Daria Galimzianova, Nikolai Rozanov, Viktor Mazanov, Jingwei Ni, Tianyi Wu, Igor Kiselev, Mrinmaya Sachan, Iryna Gurevych, Preslav Nakov, Timothy Baldwin, Artem Shelmanov

发表机构 * MBZUAI ETH Zürich(苏黎世联邦理工学院) Imperial College London(伦敦帝国理工学院) NUS(国立大学新加坡) Accenture(埃森哲) Innopolis University(因诺普里斯大学) Independent Researcher(独立研究者)

AI总结 提出ThinkBooster框架,通过模块化库、联合评估基准和可部署代理服务,实现LLM推理的测试时计算扩展,在数学和编码任务上验证了性能-计算权衡。

详情
AI中文摘要

测试时计算(TTC)扩展已成为一种强大的范式,通过在推理期间分配额外计算(例如,通过多样本生成和基于验证器的重新排序)来改进大型语言模型(LLM)推理。现有的TTC扩展策略和推理评分器仍然碎片化,在不一致的协议下进行评估,并且很少通过质量-成本权衡的视角进行分析。我们引入了ThinkBooster,一个用于LLM推理无缝测试时计算扩展的统一框架,它包括(i)一个模块化的Python库,实现了最先进的TTC扩展策略和评分器家族,(ii)一个联合评估性能和计算效率的基准,以及(iii)一个可部署的、兼容OpenAI的代理服务,使得将自适应推理无缝集成到实际应用中成为可能。我们还提供了一个演示可视化调试器,用于检查推理轨迹、中间选择决策和替代推理路径。在数学和编码任务上的实证结果揭示了TTC扩展策略和评分方法的性能-计算权衡,并表明ThinkBooster在实际任务中提供了实际收益。代码以MIT许可证在线提供。

英文摘要

Test-time compute (TTC) scaling has emerged as a powerful paradigm for improving large language model (LLM) reasoning by allocating additional compute during inference, e.g., via multi-sample generation and verifier-based reranking. Existing TTC scaling strategies and reasoning scorers remain fragmented, evaluated under inconsistent protocols, and are rarely analyzed through the lens of quality-cost trade-offs. We introduce ThinkBooster, a unified framework for seamless test-time compute scaling of LLM reasoning, which consists of (i) a modular Python library implementing state-of-the-art TTC scaling strategy and scorer families, (ii) a benchmark that jointly evaluates performance and computational efficiency, and (iii) a deployable OpenAI-compatible proxy service that enables drop-in integration of adaptive reasoning into real-world applications. We further provide a demo visual debugger for inspecting the reasoning trajectories, intermediate selection decisions, and alternative reasoning paths. Empirical results on mathematical and coding tasks reveal the performance-compute trade-offs of TTC scaling strategies and scoring methods and demonstrate that ThinkBooster provides practical gains in real-world tasks. The code is available online under an MIT license.

2. 表示学习、自监督与对比学习 30 篇

2606.07617 2026-06-09 cs.LG cs.AI 新提交

Query Lens: Interpreting Sparse Key-Value Features with Indirect Effects

Query Lens: 通过间接效应解释稀疏键值特征

Hwiyeong Lee, Ingyu Bang, Uiji Hwang, Hyelim Lim, Taeuk Kim

发表机构 * KAIST(韩国科学技术院)

AI总结 提出Query Lens方法,通过考虑编码器侧键特征和解码器侧值特征以及下游模块的间接效应,实现对稀疏自编码器特征更全面、忠实的解释。

Comments Accepted to ICML 2026

详情
AI中文摘要

虽然稀疏自编码器提供的特征比单个神经元更可解释,但可靠地描述这些特征仍然具有挑战性。我们提出了Query Lens,它扩展了Logit Lens,能够对稀疏特征进行更全面、忠实的解释。通过联合考虑编码器侧的键特征和解码器侧的值特征,我们识别出激活特征的输入以及它促进的输出。我们还考虑了当特征被下游模块处理时产生的间接、模块介导的效应,超越了Logit Lens捕获的直接效应。在实验中,我们发现Query Lens为那些在Logit Lens下仍不可解释的特征生成了连贯的token签名。最后,我们提出了子空间通道假说,表明下游模块通过层特定的子空间读取特征。

英文摘要

While sparse autoencoders provide features more interpretable than individual neurons, reliably characterizing them remains challenging. We propose Query Lens, which extends Logit Lens to enable more comprehensive and faithful interpretations of sparse features. By jointly considering encoder-side key features and decoder-side value features, we identify both the inputs that activate a feature and the outputs it promotes. We also account for indirect, module-mediated effects that arise when the feature is processed by downstream modules, going beyond the direct effect captured by Logit Lens. In experiments, we find that Query Lens yields coherent token signatures for features that remain uninterpretable under Logit Lens. Finally, we propose the Subspace Channel Hypothesis, suggesting that downstream modules read features through layer-specific subspaces.

2606.07770 2026-06-09 cs.LG 新提交

Contrast encodes inductive bias: separating slow noise from dynamics in predictive representation learning

对比编码归纳偏置:在预测性表示学习中将慢噪声与动力学分离

Paarth Gulati, Ilya Nemenman

发表机构 * Emory University(埃默里大学)

AI总结 针对自监督方法在潜在空间预测动力学时混淆慢噪声与信号的问题,本文分析其根源为跨轨迹采样负样本的对比目标,提出通过轨迹内采样负样本消除预测捷径,从而强制编码动力学相关变量。

详情
AI中文摘要

在潜在空间中学习表示并预测动力学的自监督方法(如JEPA)已被证明会混淆缓慢变化的噪声与它们旨在捕捉的动力学信号。具体来说,当噪声特征在每个轨迹内近似保持不变时,对比预测目标会优先编码这些特征,而不是控制系统的真实潜在变量。学习到的表示因此被轨迹特定噪声主导,下游性能随噪声强度下降,且即使增加训练轨迹的数量和持续时间也不会改善。我们认为这种失败是目标本身的属性,由一系列跨轨迹采样负样本的对比预测目标共享。为了说明这种普遍性,我们在两种设置中研究了失败模式及其补救措施:在合成移动点数据集上的标准SimCLR风格JEPA,以及在刚体摆电影上的DySIB(一种最近引入的用于提取动力学物理可解释表示的方法)。当负样本改为在单个轨迹内采样时,慢噪声无法区分该轨迹内的帧,从而消除了预测捷径。同时在许多这样的轨迹上训练一个编码器,迫使它编码与动力学相关的变量,更长的轨迹即使在强慢噪声下也能产生更好的表示。我们的结果指向了在动力学表示学习中设计对比预测目标的原则,特别是对于具有噪声实验观测的物理系统。

英文摘要

Self-supervised methods that learn representations and predict dynamics fully in the latent space, such as JEPA, have been shown to confuse slowly varying noise with the dynamical signals they aim to capture. Specifically, when noise features remain approximately constant within each trajectory, contrastive predictive objectives preferentially encode these features instead of the true latent variables governing the system. The learned representation then becomes dominated by trajectory-specific noise, so downstream performance degrades with noise strength and does not improve even as the number and duration of training trajectories increase. We argue that this failure is a property of the objective itself, shared by a long line of contrastive predictive objectives that sample negatives across trajectories. To illustrate this generality, we study the failure mode and its remedy in two settings: a standard SimCLR-style JEPA on a synthetic moving-dot dataset, and DySIB, a recently introduced method designed for extracting physically interpretable representations of dynamics, on movies of a rigid-body pendulum. When negatives are instead sampled within a single trajectory, the slow noise can no longer distinguish frames within that trajectory, removing the predictive shortcut. Training one encoder simultaneously on many such trajectories then forces it to encode the variables relevant for the dynamics, with longer trajectories yielding better representations even for strong slow noise. Our results point toward principles for designing contrastive predictive objectives in dynamical representation learning, especially for physical systems with noisy experimental observations.

2606.08204 2026-06-09 cs.LG cs.CV 新提交

Neural Field Tokenizations with Hierarchy and Spatial Locality Priors

具有层次和空间局部性先验的神经场分词

Alonso Urbano, David W. Romero, Max Zimmer, Sebastian Pokutta

发表机构 * Zuse Institute Berlin (ZIB)(柏林祖斯研究所) Cartesia AI Technische Universität Berlin(柏林工业大学)

AI总结 提出LH-NeF框架,利用层次和局部性先验学习通用连续信号的分词表示,通过前馈编码替代元学习,内存减少42倍,批大小提升133倍,在图像、3D形状和气候场上匹配或超越多种基线。

详情
AI中文摘要

神经场将数据参数化为从坐标到值的函数,为跨模态表示学习提供统一框架。现有方法以每样本元学习为主,由于内存密集的内循环优化而扩展性差。自然的替代方案——前馈编码——通常引入模态特定假设,牺牲了神经场学习的通用性。我们认为局部性和层次性是学习场表示的有用先验,可以在不损害模态无关性的情况下注入。我们提出LH-NeF,一个学习连续信号通用分词表示的框架。保持局部性的层次编码器将原始坐标-值场观测映射到结构化分词,训练期间从中重建场。通过用单次前向传播替代元学习的内循环,LH-NeF比最强的模态无关基线少用42倍内存,支持133倍更大的批次。在图像、3D形状和气候场上,我们的学习表示在重建和下游任务上匹配或超过模态无关、模态特定和专用生成神经场基线的性能。

英文摘要

Neural fields parameterize data as functions from coordinates to values, providing a unified framework for representation learning across modalities. Existing approaches are dominated by per-sample meta-learning, which scales poorly due to memory-intensive inner-loop optimization. The natural alternative -- feed-forward encoding -- typically introduces modality-specific assumptions, sacrificing the generality that makes learning with neural fields attractive. We argue that locality and hierarchy are useful priors for learning field representations that can be injected without compromising modality-agnosticism. We propose LH-NeF, a framework to learn general-purpose tokenized representations of continuous signals. A locality-preserving hierarchical encoder maps raw coordinate-value field observations to structured tokens, from which the field is reconstructed during training. By replacing meta-learning's inner loop with a single forward pass, LH-NeF uses 42$\times$ less memory and supports 133$\times$ larger batches than the strongest modality-agnostic baseline. Across images, 3D shapes, and climate fields, our learned representations match or exceed performance of modality-agnostic, modality-specific, and specialized generative neural field baselines on both reconstruction and downstream tasks.

2606.09653 2026-06-09 cs.LG 新提交

A Unifying Framework for Concept-Based Representational Similarity

基于概念的表征相似性的统一框架

Grégoire Dhimoïla, Victor Boutin, Agustin Martin Picard, Thomas Fel, Thomas Serre

发表机构 * Brown University(布朗大学) ENS Paris Saclay(巴黎萨克雷高等师范学校) CNRS(法国国家科学研究中心) DEEL - IRT Saint Exupéry(DEEL - IRT 圣埃克苏佩里) Goodfire

AI总结 提出统一框架分解概念对齐的两个轴(表征vs.概念、实例级vs.分布级),定义四种性质,并引入干预基准InterVenchA和耦合稀疏自编码器CoSAE,证明对齐是多目标问题。

详情
AI中文摘要

跨模型和模态的学习表征常常展现出惊人的结构相似性,暗示着共享的潜在概念分解。然而,概念对齐的定义仍不明确:现有方法在相同术语下优化不同目标,模糊了实际对齐的内容。我们提出了一个统一框架,沿两个轴分解对齐:对齐什么(表征vs.概念)以及什么级别(实例级vs.分布级)。这产生了四个相应的性质——翻译和概念一致性的实例级和分布级变体——并精确揭示了现有方法提供了这些保证中的哪些。我们进一步引入了\InterVenchA,一个基于干预的基准,分别衡量提取质量、翻译质量和概念一致性。通过理论和实验,我们表明对齐目标之间通常假设的等价性在实践中不成立:优化一个性质并不能可靠地恢复其他性质,纯无监督目标无法恢复有意义的实例级对齐。然后我们提出了耦合稀疏自编码器(CoSAE),它联合强制互补的对齐目标。强对齐仅在这种机制下出现。令人惊讶的是,当锚定分布目标时,仅0.1%的配对数据就足以恢复实例级对齐。总体而言,我们的结果表明概念对齐本质上是多目标的:它必须被定义、衡量和优化为多目标。

英文摘要

Learned representations across models and modalities often exhibit striking structural similarities, suggesting shared underlying concept decompositions. However, concept alignment remains poorly defined: existing approaches optimize different objectives under the same terminology, obscuring what is actually aligned. We propose a unifying framework that decomposes alignment along two axes: what is aligned (representations vs. concepts) and at what level (instance-wise vs. distributional). This induces four corresponding properties -- instance-wise and distributional variants of translation and concept consistency -- and reveals precisely which of these guarantees existing methods provide. We further introduce \InterVenchA, an intervention-based benchmark that separately measures extraction quality, translation quality, and concept consistency. Through theory and experiments, we show that commonly assumed equivalences between alignment objectives fail in practice: optimizing one property does not reliably recover the others, and purely unsupervised objectives fail to recover meaningful instance-level alignment. We then propose the Coupled Sparse Autoencoder (CoSAE), which jointly enforces complementary alignment objectives. Strong alignment emerges only in this regime. Surprisingly, as little as 0.1\% paired data is sufficient to recover instance-level alignment when anchoring distributional objectives. Overall, our results show that concept alignment is fundamentally multi-objective: it must be defined, measured, and optimized as such.

2606.09718 2026-06-09 cs.LG cs.CV 新提交

Evaluating the Representation Space of Diffusion Models via Self-Supervised Principles

通过自监督原则评估扩散模型的表示空间

Xiao Li, Yixuan Jia, Zekai Zhang, Xiang Li, Lianghe Shi, Jinxin Zhou, Zhihui Zhu, Liyue Shen, Qing Qu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 受自监督学习启发,提出基于Fisher信息的度量ICR,分解特征为不变和残差成分,用于联合评估扩散模型的表示与生成能力,发现中间噪声水平下不变性最强且分类性能最佳,ICR可敏感检测训练中的记忆化。

Comments First two authors contributed equally. Accepted at ICML 2026

详情
AI中文摘要

扩散模型已展现出卓越的生成能力,并成为强大的自监督表示学习器,但这两种能力之间的联系仍较少被探索。受自监督学习(SSL)启发,我们引入了一个框架,用于联合评估扩散模型的表示和生成能力。具体地,我们将特征分解为不变成分和残差成分,并推导出不变污染比(ICR),这是一种基于Fisher的度量,用于量化残差变化在特征空间中对不变信号的污染程度。我们利用该框架分析扩散模型的判别和生成行为。在表示方面,我们发现不变性在中间噪声水平达到峰值,同时该水平也产生最佳的下游分类性能。在生成方面,我们研究了在数据有限情况下训练如何从真正的泛化过渡到记忆化,并表明ICR可作为早期学习的敏感训练时指标:沿Fisher方向增加的残差能量标志着记忆化的开始,该指标仅从训练特征即可检测,无需外部评估器或保留测试集。总体而言,我们的结果表明,扩散模型可以通过其学习表示的几何结构从自监督视角进行监控。

英文摘要

Diffusion models have demonstrated remarkable generative capabilities and have also emerged as powerful self-supervised representation learners, yet the connection between these two abilities remains less explored. Drawing inspiration from self-supervised learning (SSL), we introduce a framework for jointly evaluating the representation and generation capabilities of diffusion models. Specifically, we decompose features into invariant and residual components and derive the Invariant Contamination Ratio (ICR), a Fisher-based metric that quantifies how residual variation contaminates invariant signal in feature space. We use this framework to analyze both discriminative and generative behavior of diffusion models. On the representation side, we find that invariance peaks at intermediate noise levels, which also yield the best downstream classification performance. On the generative side, we study how training transitions from genuine generalization to memorization in data-limited regimes, and show that ICR serves as a sensitive training-time indicator of early learning: increasing residual energy along Fisher directions marks the onset of memorization, detectable from training features alone without external evaluators or held-out test sets. Overall, our results show that diffusion models can be monitored from a self-supervised perspective through the geometry of their learned representations.

2606.09725 2026-06-09 cs.LG 新提交

Disentanglement with Holographic Reduced Representations

基于全息约简表示的解缠

Jhonny J. Velasquez Olivera, Christo K. Thomas, Walid Saad

发表机构 * Virginia Tech(弗吉尼亚理工大学) Worcester Polytechnic Institute(伍斯特理工学院)

AI总结 提出使用全息约简表示(HRR)的无监督解缠算法,利用HRR解绑操作提供归纳偏置,分离数据中的因子变化,并通过信息论分析证明其诱导近似独立的符号-值对。

详情
AI中文摘要

解缠,即使用神经网络分离数据中的因子变化,仍然是机器学习中长期存在的挑战。先前的工作通过变分自编码器和生成对抗网络,结合变分推理和信息论约束来解决这个问题。与依赖连续表示的方法不同,我们提出一种将解缠表示视为符号结构的设计,其动机是构成分布样本的概念之间的组合关系。然而,在保持可微性的同时用神经网络学习离散符号结构是困难的,通常需要复杂的架构。为此,我们引入一种无监督学习算法,使用全息约简表示(HRR)进行神经解缠。我们表明,HRR解绑操作为分离因子提供了归纳偏置,并在潜在遍历和解缠度量方面取得了与基线相当的结果。我们通过HRR解绑通道的信息论分析补充了这些实证发现。我们证明解绑诱导了近似独立的符号-值对,并推导出每个槽的容量界限,量化了可以可靠编码的不同符号概念的数量,从而定量解释了朝向解缠的归纳偏置。得到的表示不同于标准的基于自编码器的模型,其潜在单元是求和在一起的向量,而不是低维潜在向量的标量维度。我们表明,这种HRR表示比其他解缠表示对噪声更鲁棒,并在一定信噪比范围内保持重建质量。

英文摘要

Disentanglement, the separation of factors of variation in data using neural networks, remains a long-standing challenge in machine learning. Prior work has addressed this problem with variational autoencoders and generative adversarial networks that incorporate ideas from variational inference and information-theoretic constraints. In contrast to methods that rely on continuous representations, we propose a design that treats disentangled representations as symbolic structures, motivated by the compositional relationships among the concepts that make up samples from a distribution. However, learning discrete symbolic structures with neural networks while maintaining differentiability is difficult and often requires complex architectures. To address this, we introduce an unsupervised learning algorithm that uses holographic reduced representations (HRR) for neural disentanglement. We show that the HRR unbinding operation provides an inductive bias for separating factors and yields competitive results against baselines, as measured by latent traversals and disentanglement metrics. We complement these empirical findings with an information-theoretic analysis of the HRR unbinding channel. We prove that unbinding induces approximately independent symbol-value pairs and derive a per-slot capacity bound that quantifies how many distinct symbolic concepts can be reliably encoded, giving a quantitative account of the inductive bias toward disentanglement. The resulting representations differ from standard autoencoder-based models, in that their latent units are vectors that are summed together, rather than scalar dimensions of a low-dimensional latent vector. We show that this HRR representation is more robust to noise than other disentangled representations and maintains reconstruction quality across a range of SNRs.

2606.07522 2026-06-09 cs.CL cs.LG cs.SI 交叉投稿

Community-Specific Slang and Entity Detection via Semantic Shift in Fine-Tuned Language Models

通过微调语言模型中的语义偏移检测社区特定俚语和实体

Julia Kruk, Sanchita Porwal, Amitrajit Bhattacharjee, Mansi Phute

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出无监督方法,通过测量词在微调前后的语义偏移幅度,从在线社区文本中自动识别俚语、独特实体和民俗用语。

Comments 6 pages, 6 figures, 2 tables

详情
AI中文摘要

我们提出一种无监督方法,通过隔离词汇中具有最大语义偏移幅度的词,来解析来自在线社区的俚语、独特实体和民俗用语。语义偏移定义为在社区特定文本语料上微调预训练大语言模型(LLM)后,词编码表示的演化。该值与基础模型和微调模型对词的编码表示之间的余弦相似度成反比。我们在从3个Reddit子版块(r/Technology、r/Gaming、r/WorldofWarcraft)收集的文本语料上微调DistilRoBERTa模型,对词汇上的余弦相似度分布进行建模,并表明通过提取底部10百分位的数据,可以成功解析对社区具有独特意义的词。相反,我们表明顶部10百分位的数据由具有相对普遍语义的词组成。

英文摘要

We propose an unsupervised method of resolving slang, unique entities, and folklore from online communities by isolating words in the lexicon that have the highest magnitude of semantic shift. Semantic shift is defined as the evolution of a word's encoded representation as a result of fine-tuning a pretrained Large Language Model (LLM) on a community-specific text corpus. This value is inversely proportional to the cosine similarity between the base model's encoded representation of a word, and a fine-tuned model's encoded representation. We fine-tune the DistilRoBERTa model on text corpora collected from 3 Reddit subreddits (r/Technology, r/Gaming, r/WorldofWarcraft), model a distribution of cosine similarity over the lexicon, and show that one can successfully resolve words that have unique significance to the community by pulling data in the bottom 10-percentile. In contrast, we show that data in the top 10-percentile consist of words that carry relatively universal semantics.

2606.07725 2026-06-09 physics.geo-ph cs.LG 交叉投稿

GNSS-FM: A Self-Supervised Foundation Model for Daily GNSS Displacement Time Series

GNSS-FM:用于日常GNSS位移时间序列的自监督基础模型

Nick Teutschmann, Laura Crocetti, Fanny Lehmann, Leonardo Trentini, Benedikt Soja

发表机构 * Institute of Geodesy and Photogrammetry, ETH Zurich(大地测量与摄影测量研究所,苏黎世联邦理工学院) ETH AI Center(ETH人工智能中心)

AI总结 提出GNSS-FM自监督基础模型,通过双流输入和掩码潜在预测预训练,在位移预测和地震阶跃定位任务上优于强基线。

详情
AI中文摘要

来自全球导航卫星系统(GNSS)的位移时间序列对于广泛的应用至关重要,包括监测构造地壳变形和研究地震周期的不同阶段。机器学习方法已被证明在GNSS应用中具有前景;然而,大多数方法仍然是完全监督的。这造成了瓶颈,因为标记数据稀缺,尽管大量未标记的GNSS数据可免费获取。我们提出了GNSS-FM,一个用于日常GNSS时间序列的自监督基础模型。该模型使用结合位移和速度类增量的双流输入,并通过掩码潜在预测目标进行预训练,该目标采用从wav2vec 2.0改编的向量量化目标,并针对大地测量数据进行了若干修改。在来自全球超过17,000个GNSS站的数据上预训练后,对学习到的码本的分析表明,这些表示捕获了GNSS位移数据中的主要信号类型,包括地震偏移、构造漂移和季节性模式。该基础模型随后在两个下游任务上进行微调,即90天位移预测和地震阶跃定位,在这两个任务中,它都优于强大的任务特定基线。这些结果表明,自监督预训练是GNSS时间序列分析的一种有前景的方法。

英文摘要

Displacement time series from Global Navigation Satellite Systems (GNSS) are essential for a wide range of applications, including monitoring tectonic crustal deformations and investigating the different stages of the earthquake cycle. Machine learning methods have proven promising for GNSS applications; however, most remain fully supervised. This creates a bottleneck as labeled data are scarce, even though large amounts of unlabeled GNSS data are freely available. We present GNSS-FM, a self-supervised foundation model for daily GNSS time series. The model uses a dual-stream input combining displacement and velocity-like increments, and is pretrained using a masked latent prediction objective with vector-quantized targets adapted from wav2vec 2.0, with several modifications for geodetic data. Pretrained on data from over 17,000 globally distributed GNSS stations, an analysis of the learned codebook suggests that the representations capture the main signal types in GNSS displacement data, including seismic offsets, tectonic drift, and seasonal patterns. The foundation model is later fine-tuned on two downstream tasks, namely 90-day displacement forecasting and seismic step localization, where it outperforms strong task-specific baselines in both cases. These results show that self-supervised pretraining is a promising approach for GNSS time series analysis.

2606.08236 2026-06-09 cs.CL cs.LG 交叉投稿

Shared Semantics, Divergent Mechanisms: Unsupervised Feature Discovery by Aligning Semantics and Mechanisms

共享语义,不同机制:通过对齐语义与机制的无监督特征发现

Hyunjin Cho, Youngji Roh, Jaehyung Kim

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一种无监督方法,通过语义嵌入和归因签名聚类模型续写,发现隐藏的机制模式,补充电路分析。

Comments 40 pages

详情
Journal ref
ICML 2026 Spotlight
AI中文摘要

随着大型语言模型越来越多地部署在高风险场景中,人们越来越需要工具来审计不仅模型输出,还包括产生这些输出的内部计算。电路分析是机械可解释性中的核心方法,但通常是目标条件化的,解释单个提示与选定补全的配对。这种目标条件化设置可能掩盖模型续写分布中的异质性。我们引入了分布级无监督特征发现,该方法使用语义内容和序列级机械归因对采样续写进行聚类,而无需手动指定目标输出。我们的方法用语义嵌入和前缀到续写的归因签名表示每个续写,然后优化一个率失真目标,该目标在语义一致性、机械一致性和聚类粒度之间进行权衡。在聚类和引导分析中,发现的聚类暴露了单视图基线遗漏的续写模式,并提供了干预证据,表明聚类签名对应于可操作的机械因素。总的来说,我们的方法通过提供对模型续写分布背后机制的可扩展审计,补充了电路分析和行为评估。

英文摘要

As large language models are increasingly deployed in high-stakes settings, there is a growing need for tools that audit not only model outputs but also the internal computations that produce them. Circuit analysis is a central approach in mechanistic interpretability, but it is typically target-conditioned, explaining a single prompt paired with a chosen completion. This target-conditioned setup can obscure heterogeneity across a model's continuation distribution. We introduce distribution-level unsupervised feature discovery, which clusters sampled continuations using both semantic content and sequence-level mechanistic attributions, without manually specifying target outputs. Our method represents each continuation with a semantic embedding and a prefix-to-continuation attribution signature, then optimizes a rate-distortion objective that trades off semantic coherence, mechanistic consistency, and cluster granularity. Across clustering and steering analyses, the discovered clusters expose continuation modes that single-view baselines miss and provide interventional evidence that cluster signatures correspond to actionable mechanistic factors. Overall, our approach complements circuit analysis and behavioral evaluation by providing a scalable audit of the mechanisms underlying a model's continuation distribution.

2606.08496 2026-06-09 cs.CL cs.LG 交叉投稿

SAEExplainer: Interpreting SAE Features with Activation-Guided Preference Optimization

SAEExplainer: 基于激活引导偏好优化的SAE特征解释

Jingyi He, Haiyan Zhao, Ruxue Shi, Yanguang Liu, Xin Wang, Fei Sun, Mengnan Du

发表机构 * Shanghai Jiao Tong University(上海交通大学) NJIT(新泽西理工学院) Jilin University(吉林大学) Institute of Computing Technology, CAS(中国科学院计算技术研究所) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 提出SAEExplainer框架,利用激活分数作为奖励信号,通过两轮优化迭代自纠正基础解释,减少解释幻觉并增强因果触发模式。

详情
AI中文摘要

尽管稀疏自编码器(SAE)通过将密集表示分解为稀疏特征缓解了大语言模型(LLM)的不透明性,但解释这些特征仍然是一个核心挑战。然而,当前的解释方法通常运行在开环范式下,未能利用机械反馈进行进一步优化。在本文中,我们提出SAEExplainer,一个利用激活分数作为客观奖励信号来训练模型进行自我纠正和迭代自举的训练框架。通过两轮优化过程迭代验证和纠正基础解释,SAEExplainer实现了其解释能力的持续提升。该机制显著减少了解释幻觉并强化了因果触发模式。大量实验表明,我们的方法在大多数指标上优于已有基线,特别是在因果触发和判别性激活方面。

英文摘要

Although Sparse Autoencoders (SAEs) have mitigated the opacity of large language models (LLMs) by decomposing dense representations into sparse features, explaining these features still remains a central challenge. Current explanation methods, however, typically operate within an open-loop paradigm, failing to leverage mechanistic feedback for further refinement. In this paper, we propose SAEExplainer, a training framework utilizes activation scores as an objective reward signal to train the model for self-correction and iterative bootstrapping. By iteratively verifying and correcting foundational explanations through a two-round optimization process, SAEExplainer achieves continuous improvement in its explanatory capabilities. This mechanism significantly reduces explanation hallucinations and reinforces causal triggering patterns. Extensive experiments demonstrate our approach improves upon established baselines across most metrics, especially in causal triggering and discriminative activation.

2606.08678 2026-06-09 cs.SD cs.LG 交叉投稿

Speaker-Invariant Representation Learning for Spoofing Detection via Gradient Reversal and A Variational Information Bottleneck

基于梯度反转和变分信息瓶颈的说话人不变表示学习用于欺骗检测

Anh-Tuan Dao, Driss Matrouf, Mickael Rouvier, Nicholas Evans

发表机构 * Avignon Universite(阿维尼翁大学) EURECOM

AI总结 针对欺骗检测中说话人偏差导致泛化差的问题,提出教师-学生框架,利用梯度反转层和变分信息瓶颈解耦身份信息,在9个数据集上EER相对降低25.7%。

详情
AI中文摘要

先进的生成语音技术可能破坏语音生物识别的可靠性。虽然欺骗检测系统在域内条件下评估时表现出色,但对域外设置的泛化能力通常较差。在本文中,我们表明此类问题可能由说话人偏差引起,即模型学习个体声音特征而非操作或生成的标记。我们提出了一种用于说话人不变欺骗检测的教师-学生框架,该框架无需说话人标签即可解耦身份。我们利用预训练的说话人识别教师通过梯度反转层指导学生模型。为了控制抑制与语音身份相关线索和保留与欺骗检测相关线索之间的平衡,我们集成了变分信息瓶颈。在九个数据集上的评估表明,与MHFA基线相比,我们的模型实现了EER相对降低25.7%。

英文摘要

Sophisticated generative speech technology can undermined the reliability of voice biometrics. While spoofing detection systems excel when assessed under in-domain conditions, generalisation to out-of-domain settings is often poor. In this paper, we show that such issues could be caused by speaker bias, where models learn individual voice traits rather than markers of manipulation or generation. We propose a teacher-student framework for speaker-invariant spoofing detection that disentangles identity without requiring speaker labels. We leverage a pre-trained speaker recognition teacher to guide a student model via a gradient reversal layer. To control the balance between suppressing cues related to voice identity with the preservation of those related to spoofing detection, we integrate a Variational Information Bottleneck. Evaluations across nine datasets show our model achieves a 25.7% relative reduction to the EER compared to the MHFA baseline.

2606.09181 2026-06-09 cs.CV cs.LG 交叉投稿

Counterfactual Reasoning for Fine-Grained Evidence Disentanglement in VideoQA

用于视频问答中细粒度证据分离的反事实推理

Zhou Du, Hamid Krim, Xiao Wu, Zhaoquan Yuan, Liangwei Li, Keisuke Fujii

发表机构 * School of OptoElectonic Science and Engineering, University of Electronic Science and Technology of China(电子科技大学光电科学与工程学院)

AI总结 提出反事实推理框架CREDiT,通过结构因果模型将视频问答中的跨模态表示分解为因果和非因果成分,在独立性约束下进行特征级因果干预,提升答案准确性和推理可靠性。

Comments 10 pages, 6 figures

详情
AI中文摘要

近期视频多模态模型的进展显著提升了视频问答性能。然而,这些系统往往依赖于虚假的统计相关性而非与答案相关的因果证据,导致推理不忠实且脆弱,尤其在复杂真实场景中。现有方法要么依赖跨模态相关性、昂贵的精心策划的训练资源,要么依赖不充分的因果假设和约束,且通常操作在时间区间级别。因此,它们未能明确地将因果视觉线索与混杂因素分离,且提供的细粒度证据定位有限。为解决此问题,我们提出了一种用于细粒度证据分离的反事实推理框架(CREDiT)。CREDiT使用结构因果模型形式化视频问答过程,并在独立性和最小性约束下学习明确分解为因果和非因果成分的跨模态表示。为促进忠实的分离,我们引入特征级因果干预,构建近似因果效应同时抑制非因果相关性的反事实输入。在NExT-GQA、SportsQA和SPORTU-video上的大量实验表明,CREDiT在通用和复杂体育场景中均能持续提升答案准确性和推理可靠性,从而构建更可信的视频问答系统。

英文摘要

Recent advances in video multimodal models have significantly improved VideoQA performance. However, these systems often rely on spurious statistical correlations rather than answer-relevant causal evidence, resulting in unfaithful and brittle reasoning, especially in complex real-world scenarios. Existing methods either rely on cross-modality correlations, costly curated training resources, or insufficient causal assumptions and constraints, and typically operate at the time-interval level. As a result, they fail to explicitly disentangle causal visual cues from confounders and provide limited fine-grained evidence localization. To address this issue, we propose a Counterfactual Reasoning framework for fine-grained Evidence Disentanglement (CREDiT). CREDiT formulates the VideoQA process using a structural causal model and learns cross-modality representations that are explicitly decomposed into causal and non-causal components under independence and minimality constraints. To facilitate faithful disentanglement, we introduce feature-level causal interventions and construct counterfactual inputs that approximate causal effects while suppressing non-causal correlations. Extensive experiments on NExT-GQA, SportsQA, and SPORTU-video demonstrate that CREDiT consistently improves answer accuracy and reasoning reliability across both generic and complex sports scenarios, leading to more trustworthy VideoQA systems.

2606.09331 2026-06-09 cs.MM cs.AI cs.LG 交叉投稿

Conan-embedding-v3: Fusing Modality-Specific Models for Omni-Modal Embedding

Conan-embedding-v3: 融合模态特定模型实现全模态嵌入

Shiyu Li, Zhiyuan Hu, Yifan Wang, Peiming Li, Zheng Wei, Yang Tang

发表机构 * Tencent(腾讯)

AI总结 提出解耦-融合-恢复框架,通过独立训练模态专家并融合任务向量,再使用投影器恢复和平衡多模态重演解决投影器漂移问题,实现单一骨干网络支持文本、图像、视频、文档和音频检索。

详情
AI中文摘要

全模态检索承诺为文本、图像、视频、文档和音频输入提供单一嵌入空间,但由于这些模态在数据分布、架构和优化动态上存在差异,构建这样一个统一的检索器十分困难。在这项工作中,我们提出了Conan-embedding-v3,一个用于全模态检索的解耦-融合-恢复框架。Conan-embedding-v3首先独立训练模态专家,然后将它们的任务向量融合到一个单一的密集骨干网络中,我们称这种策略为解耦专家融合。我们表明,这种融合组合了视觉、视频和文档检索能力,但也暴露了基于投影器的模态的一个失败模式:当通过外部编码器和投影器附加音频时,融合骨干网络会使投影器校准到音频专家骨干网络,导致尽管原封不动地复制了所有音频特定模块,音频检索性能仍大幅下降。我们将这种失败称为投影器漂移。为了修复它,Conan-embedding-v3应用了投影器恢复(即在保持骨干网络冻结的情况下对投影器进行全参数微调),随后进行平衡的多模态重演。得到的模型在一个骨干网络中支持这些检索路径,在MMEB上达到74.9分,同时在30任务的MAEB音频套件上获得55.61分。

英文摘要

Omni-modal retrieval promises a single embedding space for text, image, video, document, and audio inputs, but building such a unified retriever is difficult since these modalities differ in data distribution, architecture, and optimization dynamics. In this work, we present Conan-embedding-v3, a decouple--fuse--recover framework for omni-modal retrieval. Conan-embedding-v3 first trains modality specialists independently and fuses their task vectors into a single dense backbone, a strategy we call Decoupled Specialist Fusion. We show that this fusion composes visual, video, and document retrieval capabilities, but also exposes a failure mode for projector-based modalities: when audio is attached through an external encoder and projector, fusing the backbone leaves the projector calibrated to the audio-specialist backbone, causing a large audio retrieval regression despite copying all audio-specific modules unchanged. We call this failure Projector Drift. To repair it, Conan-embedding-v3 applies Projector Recovery (i.e., full-parameter fine-tuning of the projector while keeping the backbone frozen) followed by balanced multi-modal rehearsal. The resulting model supports these retrieval pathways in one backbone, achieving 74.9 scores on MMEB while obtaining 55.61 on the 30-task MAEB audio suite.

2509.01916 2026-06-09 cs.LG 版本更新

Causal Representation Learning from Network Data

从网络数据中进行因果表示学习

Jifan Zhang, Michelle M. Li, Elena Zheleva

发表机构 * Department of Statistics and Data Science, Northwestern University(统计与数据科学系,西北大学) Department of Biomedical Informatics, Harvard University(生物医学信息学系,哈佛大学) Department of Computer Science, University of Illinois Chicago(计算机科学系,伊利诺伊大学芝加哥分校)

AI总结 提出GraCE-VAE,利用图神经网络编码器整合生物网络和通路信息,在软干预下识别潜在因果图与干预目标,实验证明利用结构化生物上下文可提升干预结果预测。

Comments 19 pages, 8 figures

详情
AI中文摘要

在软干预下,因果解缠在假设线性干预忠实性以及同时拥有观测数据和干预数据的情况下是可识别的。先前的工作主要关注非结构化观测,未利用测量实体间已知的关系上下文。然而,在许多科学应用中,测量变量伴随着一个观测到的交互网络,该网络提供结构化上下文,例如蛋白质-蛋白质相互作用和通路-基因成员关系。我们提出GraCE-VAE,一种图感知的因果差异变分自编码器,它将通路级信息视为潜在因果程序的辅助视图。图神经网络编码器以这个辅助通路视图和生物图为条件,以改进摊销推理,而因果解码器仍然是一个带有软干预的潜在SCM。假设每个干预机制内的样本是独立同分布的,我们证明GraCE-VAE继承了因果差异VAE的可识别性保证,并在标准等价类内识别出潜在因果图和干预目标。在三个CRISPR扰动数据集上的实验表明,利用结构化生物上下文可以改善对干预结果(包括未见过的扰动组合)的预测。

英文摘要

Causal disentanglement from soft interventions is identifiable under the assumptions of linear interventional faithfulness and availability of both observational and interventional data. Prior work has focused on unstructured observations without leveraging known relational context among measured entities. In many scientific applications, however, the measured variables come with an observed interaction network that provides structured context, such as protein-protein interactions and pathway-gene membership. We propose GraCE-VAE, a graph-aware causal discrepancy variational autoencoder that treats pathway-level information as an auxiliary view of the latent causal programs. The graph neural network encoder conditions on this auxiliary pathway view and the biological graph to improve amortized inference, while the causal decoder remains a latent SCM with soft interventions. Assuming samples are i.i.d. within each intervention regime, we show that GraCE-VAE inherits the identifiability guarantees of causal discrepancy VAEs and identifies the latent causal graph and intervention targets up to the standard equivalence class. Experiments on three CRISPR perturbation datasets demonstrate that leveraging structured biological context improves prediction of interventional outcomes, including unseen perturbation combinations.

2509.24467 2026-06-09 cs.LG stat.ML 版本更新

Interpretable Self-Supervised Learning via Representer Landmarks and Nyström Approximation

通过表征地标和Nyström近似的可解释自监督学习

Maedeh Zarvandi, Michael Timothy, Theresa Wasserer, Debarghya Ghoshdastidar

发表机构 * Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心) Technical University of Munich, TUM School of Computation, Information and Technology(慕尼黑技术大学,TUM计算、信息与技术学院)

AI总结 提出KREPES框架,利用表征地标和Nyström近似,对自监督学习目标(SimCLR、BYOL、VICReg)学到的表征进行可解释性分析,并引入新指标量化透明度。

Comments 24 pages, 10 figures. Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

自监督学习(SSL)从大量未标记数据中学习表征,但所得模型通常作为黑盒运行,需要特定领域的解释。我们引入了KREPES,一个统一的框架,用于分析解释SSL目标(包括SimCLR、BYOL和VICReg)学到的表征。通过将神经网络的实证神经正切核近似与核的表征定理联系起来,我们直接通过“表征地标”(即具有影响力的未标记训练样本的表征)来表达学到的潜在空间。我们引入了新指标:“样本特定影响分数”、“条件概念影响分数”和“特征对齐差距”,以量化所学表征的透明度。KREPES能够在无监督的情况下直接审计潜在空间,例如,揭示Adult-1M数据集中的算法偏差,其中SSL使用人口统计代理来预测收入。最后,为了确保在具有100万以上样本的基准测试(ImageNet-1K、Adult-1M)上的可扩展性,KREPES为SSL目标引入了一种基于Nyström近似的新型分析推理框架。

英文摘要

Self-supervised learning (SSL) learns representations from massive unlabeled data, yet the resulting models typically operate as black boxes, necessitating domain-specific explanations. We introduce KREPES, a unified framework to analytically interpret the learned representations of SSL objectives, including SimCLR, BYOL, and VICReg. By bridging empirical neural tangent kernel approximations of neural networks with the Representer Theorem for kernels, we express the learned latent space directly via "Representer Landmarks", which are the representations of influential unlabeled training examples. We introduce novel metrics, "Sample-Specific Influence Score", "Concept-Conditioned Influence Score" and "Feature Alignment Gap", to quantify the transparency of the learned representations. KREPES enables direct audit of the latent space without supervision, for example, revealing an algorithmic bias in the Adult-1M dataset where SSL uses demographic proxies for income. Finally, to ensure scalability to benchmarks with 1M+ samples (ImageNet-1K, Adult-1M), KREPES introduces a novel Nyström approximation-based analytical inference framework for SSL objectives.

2512.00239 2026-06-09 cs.LG stat.ML 版本更新

Self-Supervised Dynamical System Representations for Physiological Time-Series

生理时间序列的自监督动力系统表示

Yenho Chen, Maxwell A. Xu, James M. Rehg, Christopher J. Rozell

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出PULSE框架,利用动力系统生成模型的信息结构,通过跨重建预训练目标提取共享系统参数信息,丢弃样本特异性噪声,提升生理时间序列的表示学习效果。

Comments Accepted to ICML 2026

详情
AI中文摘要

自监督学习(SSL)对生理时间序列的有效性取决于预训练目标在过滤掉无关噪声的同时保留关于潜在生理状态信息的能力。然而,现有策略由于依赖启发式原则或约束较差的生成任务而受到限制。为解决这一限制,我们提出一个预训练框架,该框架利用跨多个时间序列的动力系统生成模型的信息结构。该框架揭示了我们的关键见解:通过提取与跨相似时间序列样本共享的系统参数相关的生成变量信息,可以高效捕获类别身份,而应丢弃单个样本特有的噪声。基于这一见解,我们提出PULSE,一种基于跨重建的生理时间序列数据集预训练目标,它在丢弃不可迁移的样本特异性信息的同时显式提取系统信息。我们建立了提供系统信息恢复充分条件的理论,并通过合成动力系统实验进行了实证验证。此外,我们将我们的方法应用于多种真实世界数据集,证明PULSE学习到的表示能够广泛区分语义类别、提高标签效率并改进迁移学习。

英文摘要

The effectiveness of self-supervised learning (SSL) for physiological time series depends on the ability of a pretraining objective to preserve information about the underlying physiological state while filtering out unrelated noise. However, existing strategies are limited due to reliance on heuristic principles or poorly constrained generative tasks. To address this limitation, we propose a pretraining framework that exploits the information structure of a dynamical systems generative model across multiple time-series. This framework reveals our key insight that class identity can be efficiently captured by extracting information about the generative variables related to the system parameters shared across similar time series samples, while noise unique to individual samples should be discarded. Building on this insight, we propose PULSE, a cross-reconstruction-based pretraining objective for physiological time series datasets that explicitly extracts system information while discarding non-transferrable sample-specific ones. We establish theory that provides sufficient conditions for the system information to be recovered, and empirically validate it using a synthetic dynamical systems experiment. Furthermore, we apply our method to diverse real-world datasets, demonstrating that PULSE learns representations that can broadly distinguish semantic classes, increase label efficiency, and improve transfer learning.

2601.21149 2026-06-09 cs.LG cs.AI 版本更新

Mobility-Embedded POIs: Learning What A Place Is and How It Is Used from Human Movement

移动性嵌入的POI:从人类移动中学习场所身份与使用方式

Maria Despoina Siampou, Shushman Choudhury, Shang-Ling Hsu, Neha Arora, Cyrus Shahabi

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出ME-POIs框架,通过对比学习将大规模人类移动数据与语言模型嵌入结合,学习场所功能,并在五个地图丰富任务上超越文本或移动性单独基线。

详情
AI中文摘要

近期地理空间基础模型的进展强调了学习真实世界位置(特别是人类活动集中的兴趣点POI)通用表示的重要性。然而,现有方法主要关注从静态文本元数据中提取的场所身份,或学习与轨迹上下文相关的表示,这些表示捕捉的是移动规律而非场所的实际使用方式(即POI的功能)。我们认为POI功能是通用POI表示中缺失但关键的信号。我们提出了移动性嵌入的POI(ME-POIs),这是一个框架,通过大规模人类移动数据增强从语言模型派生的POI嵌入,以学习基于真实世界使用的、以POI为中心且上下文无关的表示。ME-POIs将个体访问编码为时间上下文化的嵌入,并通过对比学习将其与可学习的POI表示对齐,以捕捉跨用户和时间的使用模式。为解决长尾稀疏性问题,我们提出了一种新机制,从附近频繁访问的POI跨多个空间尺度传播时间访问模式。我们在五个新提出的地图丰富任务上评估ME-POIs,测试其捕捉POI身份和功能的能力。在所有任务中,用ME-POIs增强文本嵌入始终优于纯文本和纯移动性基线。值得注意的是,仅使用移动数据训练的ME-POIs在某些任务上能超越纯文本模型,凸显了POI功能是准确且可泛化的POI表示的关键组成部分。

英文摘要

Recent progress in geospatial foundation models highlights the importance of learning general-purpose representations for real-world locations, particularly points-of-interest (POIs) where human activity concentrates. Existing approaches, however, focus primarily on place identity derived from static textual metadata, or learn representations tied to trajectory context, which capture movement regularities rather than how places are actually used (i.e., POI's function). We argue that POI function is a missing but essential signal for general POI representations. We introduce Mobility-Embedded POIs (ME-POIs), a framework that augments POI embeddings derived, from language models with large-scale human mobility data to learn POI-centric, context-independent representations grounded in real-world usage. ME-POIs encodes individual visits as temporally contextualized embeddings and aligns them with learnable POI representations via contrastive learning to capture usage patterns across users and time. To address long-tail sparsity, we propose a novel mechanism that propagates temporal visit patterns from nearby, frequently visited POIs across multiple spatial scales. We evaluate ME-POIs on five newly proposed map enrichment tasks, testing its ability to capture both the identity and function of POIs. Across all tasks, augmenting text-based embeddings with ME-POIs consistently outperforms both text-only and mobility-only baselines. Notably, ME-POIs trained on mobility data alone can surpass text-only models on certain tasks, highlighting that POI function is a critical component of accurate and generalizable POI representations.

2604.27810 2026-06-09 cs.LG 版本更新

Hyper-Dimensional Fingerprints as Molecular Representations

超维指纹作为分子表示

Jonas Teufel, Luca Torresi, André Eberhard, Pascal Friederich

发表机构 * Karlsruhe Institute of Technology (KIT), Institute of Nanotechnology (INT)(卡尔斯鲁厄理工学院(KIT),纳米技术研究所) Karlsruhe Institute of Technology (KIT), Institute of Anthropomatics and Robotics (IAR)(卡尔斯鲁厄理工学院(KIT),人机学与机器人研究所)

AI总结 本文提出超维指纹(HDF),通过高维向量的代数运算生成确定性分子表示,无需训练,在多种属性预测任务中表现优异,且在低维情况下保持分子相似性的一致性。

Comments Code: https://doi.org/10.5281/zenodo.19373621

详情
AI中文摘要

计算分子表示是虚拟筛选、性质预测和材料发现的基础。传统指纹效率高但因基于哈希的压缩丢失结构信息,特别是在低维情况下。通过图神经网络学习的表示恢复了这种表达性,但需要任务特定的训练和大量计算资源。本文引入超维指纹(HDF),用高维向量的代数运算替代消息传递神经网络的学习转换,生成无需训练的确定性分子表示。在多样化的属性预测基准上,HDF在大多数任务中优于传统指纹,且在不同数据集和模型间表现出更高的一致性。关键的是,HDF嵌入保持分子相似性:在32维时,HDF空间的距离与图编辑距离的皮尔逊相关系数达到0.9,而摩根指纹在同等尺寸下仅为0.55。这种结构保真度在低维情况下持续,允许简单的最近邻回归在64个组件中保持预测性。进一步在贝叶斯分子优化中展示了实际影响,HDF基于的替代模型在摩根指纹表现与随机搜索相当的领域中显著提高了样本效率。HDF因此提供了一种通用的、无需训练的替代方案,表明传统固定长度指纹中接受的信息损失是哈希编码方案的限制,而非指纹范式本身。

英文摘要

Computational molecular representations underpin virtual screening, property prediction, and materials discovery. Conventional fingerprints are efficient and deterministic but lose structural information through hash-based compression, particularly at low dimensionalities. Learned representations from graph neural networks recover this expressiveness but require task-specific training and substantial computational resources. Here we introduce hyperdimensional fingerprints (HDF), which replace the learned transformations of message-passing neural networks with algebraic operations on high-dimensional vectors, producing deterministic molecular representations without any training. Across diverse property prediction benchmarks, HDF outperforms conventional fingerprints in the majority of tasks while exhibiting greater consistency across datasets and models. Crucially, HDF embeddings preserve molecular similarity faithfully: at 32 dimensions, distances in HDF space achieve a 0.9 Pearson correlation with graph edit distance, compared to 0.55 for Morgan fingerprints at equivalent size. This structural fidelity persists at low dimensions where hash-based methods degrade, allowing simple nearest-neighbor regression to remain predictive with as few as 64 components. We further demonstrate the practical impact in Bayesian molecular optimization, where HDF-based surrogate models achieve substantially improved sample efficiency in regimes where Morgan fingerprints perform comparably to random search. HDF thus provides a general-purpose, training-free alternative to conventional molecular fingerprints, suggesting that the information loss long accepted as inherent to fixed-length fingerprints is a limitation of the hash-based encoding scheme rather than the fingerprint paradigm itself.

2605.06582 2026-06-09 cs.LG cs.CL cs.SD 版本更新

PairAlign: A Framework for Sequence Tokenization via Self-Alignment with Applications to Audio Tokenization

PairAlign:一种通过自对齐的序列标记化框架及其在音频标记化中的应用

Adhiraj Banerjee, Vipul Arora

发表机构 * Department of Electrical Engineering, Indian Institute of Technology, Kanpur(电子工程系,印度理工学院,坎浦尔)

AI总结 PairAlign通过序列级自对齐实现紧凑音频标记化,利用条件序列生成方法,提升标记一致性、长度控制和编辑相似性。

Comments 57 pages main content, 109 total pages, 9 Figures, pre-print, Under Review

详情
AI中文摘要

许多感官数据的操作——比较、记忆、检索和推理——自然地在离散符号结构上表达。在语言中,这种接口由标记提供;在音频中,必须学习。现有音频标记器依赖于量化、聚类或编解码器重建,将标记局部分配,因此序列一致性、紧凑性、长度控制、终止和编辑相似性很少被直接优化。我们引入PairAlign,一种通过序列级自对齐实现紧凑音频标记化的框架。PairAlign将标记化视为条件序列生成:编码器将语音映射为连续条件,自回归解码器从BOS开始生成标记,学习标记身份、顺序、长度和EOS位置。给定两个保持内容的视图,每个视图的序列在另一个视图的表示下被训练为可能,而无关示例提供竞争序列。这为可扩展的编辑距离保留代理,同时抑制许多对一的坍缩。PairAlign从VQ式标记化开始,并通过EMA教师目标、交叉配对教师强制、前缀损坏、似然对比和长度控制进行优化。在3秒语音上,PairAlign学习紧凑、非退化的序列,具有广泛的词汇使用和强跨视图一致性。在检索测试中,它保留编辑距离搜索,同时将存档标记数量减少55%。连续扫频探针显示其局部重叠低于密集几何标记器,但具有更强的长度控制和在100毫秒移位下的受约束编辑轨迹。PairAlign是一种序列符号预测学习者:像JEPA式目标一样,它从另一个视图预测一个抽象目标作为学习的可变长度符号序列,而不是连续潜在变量。

英文摘要

Many operations on sensory data -- comparison, memory, retrieval, and reasoning -- are naturally expressed over discrete symbolic structures. In language this interface is given by tokens; in audio, it must be learned. Existing audio tokenizers rely on quantization, clustering, or codec reconstruction, assigning tokens locally, so sequence consistency, compactness, length control, termination, and edit similarity are rarely optimized directly. We introduce PairAlign, a framework for compact audio tokenization through sequence-level self-alignment. PairAlign treats tokenization as conditional sequence generation: an encoder maps speech to a continuous condition, and an autoregressive decoder generates tokens from BOS, learning token identity, order, length, and EOS placement. Given two content-preserving views, each view's sequence is trained to be likely under the other's representation, while unrelated examples provide competing sequences. This gives a scalable surrogate for edit-distance preservation while discouraging many-to-one collapse. PairAlign starts from VQ-style tokenization and refines it with EMA-teacher targets, cross-paired teacher forcing, prefix corruption, likelihood contrast, and length control. On 3-second speech, PairAlign learns compact, non-degenerate sequences with broad vocabulary usage and strong cross-view consistency. On retrieval tests, it preserves edit-distance search while reducing archive token count by 55%. A continuous-sweep probe shows lower local overlap than a dense geometric tokenizer, but stronger length control and bounded edit trajectories under 100 ms shifts. PairAlign is a sequence-symbolic predictive learner: like JEPA-style objectives, it predicts an abstract target from another view as a learned variable-length symbolic sequence, not a continuous latent.

2605.16823 2026-06-09 cs.LG 版本更新

VQ-Atom: Semantic Discretization of Local Atomic Environments for Molecular Representation Learning

原子作为语言:VQ-Atom:用于分子表示学习的语义离散化

Takayuki Kimura

发表机构 * Atoms as Language, LLC(Atoms as Language公司)

AI总结 本文提出VQ-Atom,一种用于分子表示学习的语义离散化框架,通过将连续的原子级图表示转换为对应局部化学环境的离散标记,从而提升分子表示的学习效果。

详情
AI中文摘要

分子表示学习已成为AI驱动药物发现中的核心方法,但现有分子分词如SMILES仍主要是语法性的,无法自然对齐具有化学意义的子结构。在本文中,我们介绍了VQ-Atom,一种语义离散化框架,将连续的原子级图表示转换为对应局部化学环境的离散标记。利用图神经网络嵌入和向量量化,原子被分配到代表化学有意义的原子上下文的代码本条目中。这些离散标记定义了一种适合基于Transformer的预训练的分子语言。我们评估了VQ-Atom在蛋白质-配体相互作用预测中的表现,采用蛋白质冷分割设置且不依赖3D结构信息。实验结果表明,与传统分词方法相比,VQ-Atom在预测性能上始终有所提升,表明语义基础的离散化可以显著增强分子表示学习。我们的发现表明,分词设计本身在使化学领域有效语言建模中起着关键作用。

英文摘要

Large language models succeed by combining large-scale pretraining with meaningful discrete tokens. In molecular machine learning, SMILES is widely used as a token representation, but it is primarily a linearization format for molecular graphs rather than a semantic decomposition of chemistry. We propose VQ-Atom, a semantic tokenization framework that assigns discrete atom-level tokens based on local chemical environments via vector quantization. Unlike SMILES tokens, VQ-Atom tokens encode graph-local chemical context and are aligned with molecular structure. On protein-cold drug--target interaction prediction using the KIBA dataset, VQ-Atom substantially improves global ranking performance, achieving AUROC of 0.79 while substantially outperforming both SMILES-based and continuous molecular representations under an identical downstream architecture. Furthermore, VQ-Atom enables approximately 3 times faster downstream training than continuous atom-level representations by replacing per-atom continuous features with reusable discrete tokens. These results suggest that molecular tokenization is not merely a preprocessing step, but a central design choice. In particular, well-structured tokens can encode substantial chemical semantics, reducing the burden on downstream learning. VQ-Atom can be interpreted as defining a molecular language, where tokens correspond to chemically meaningful atomic environments, suggesting that token design may constitute an additional axis of machine learning research alongside architecture, objectives, and optimization.

2605.24942 2026-06-09 cs.LG cs.AI 版本更新

Riemannian-Manifold Steering: Geometry-Aware Generative Autoencoders for Label-Free Steering

黎曼流形操控:用于无标签操控的几何感知生成自编码器

Narmeen Oozeer, Shivam Raval, Philip Quirke, Manikandan Ravikiran, Jeff Phillips, Shriyash Upadhyay, Amirali Abdullah

发表机构 * Martian Harvard University(哈佛大学) Thoughtworks University of Utah(犹他大学)

AI总结 提出将语言模型操控重新定义为激活空间上的黎曼测地线计算,通过基于输出空间Hellinger距离学习的编码器实现无标签、无拓扑先验的流形操控。

详情
AI中文摘要

语言模型的操控——干预其内部激活以改变下游行为——最近已从线性插值扩展到非线性方法,如角度操控和核化操控,这些方法定义了干预变换,而无需在激活空间中的路径上学习显式几何。新引入的几何感知流形方法确实学习了这样的几何,但需要带标签的类中心以及预设的循环或顺序结构。这些假设限制了流形操控的应用范围,因为现有构造需要带标签的中心和兼容的边界条件。我们将流形操控更广泛地重新定义为激活空间上的黎曼测地线计算,将线性操控和带标签样条操控恢复为特定度量选择下的测地线。该框架内一个有原则的度量是输出空间Hellinger距离拉回到激活空间;我们通过一个在小型概念-令牌模式上基于输出距离训练的学习编码器来近似该度量——无需每个提示的标签、无需拓扑先验、也无需每个任务的曲线拟合。实验上,该方法在标准四任务语言模型算术基准的所有任务中可靠地将模型驱动到目标类别,同时在较小输出空间上遵循比基线更行为自然的轨迹。因此,我们为流形操控提供了一个统一的黎曼框架,以及一个基于模式监督、无标签的实例化,该实例化无需带标签的中心或预设边界条件即可运行。

英文摘要

Steering a language model - intervening on its internal activations to change downstream behaviour - has recently expanded beyond linear interpolation to nonlinear methods such as angular and kernelized steering, which define intervention transformations without learning an explicit geometry over paths in activation space. Freshly introduced geometry-aware manifold methods do learn such a geometry, but require labelled class centroids together with prescribed cyclic or sequential structure. These assumptions restrict where manifold steering can be applied, since existing constructions require labelled centroids and compatible boundary conditions. We recast manifold steering more broadly as \textbf{Riemannian geodesic computation} on activation space, recovering linear and labelled-spline steering as geodesics under particular choices of metric. A principled metric within this framework is the output-space Hellinger distance pulled back to activations; we approximate this with a learned encoder trained on output distances over a small concept-token schema - no per-prompt labels, no topology prior, and no per-task curve fitting. Empirically, the method reliably drives the model onto the target class across all tasks in a standard four-task language-model arithmetic benchmark, while following more behaviourally natural trajectories than baselines on smaller output spaces. We thereby provide a unified Riemannian framework for manifold steering together with a schema-supervised, label-free instantiation that operates without labelled centroids or prescribed boundary conditions.

2606.01304 2026-06-09 cs.LG 版本更新

When Hard Negatives Hurt: Bridging the Generative-Discriminative Gap in Hard Negative Synthesis for Retrieval

当硬负例有害时:弥合检索中硬负例生成的生成-判别鸿沟

Zhicheng Zhang, Jiwei Tang, Kuicai Dong, Xiaopeng Li, Jieming Zhu, Jingyu Li, Qianhui Zhu, Fengyuan Lu, Wang Jiaheng, Gang Wang, Hai-Tao Zheng, Zhaocheng Du

发表机构 * Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) Huawei Technologies Co., Ltd.(华为技术有限公司) City University of Hong Kong(香港城市大学) School of Cyber Science and Technology, Sun Yat-sen University(中山大学信息科学与技术学院) School of Intelligence Science and Technology, Nanjing University(南京大学智能科学与技术学院) The Hong Kong University of Science and Technology(香港科学与技术大学) Huawei Noah’s Ark Lab(华为诺亚实验室)

AI总结 针对检索中硬负例生成存在的生成-判别鸿沟问题,提出CausalNeg方法,通过CoT引导的反事实扰动和查询视角熵最大化来提升检索性能。

Comments Accepted at KDD 2026

详情
AI中文摘要

硬负例挖掘已成为训练检索器的主流策略,但它面临内在局限性:负例受限于语料库可用性,由检索器分数而非诊断价值选择,并且随着检索器改进,假阳性污染日益严重。基于LLM的合成提供了一种原则性替代方案,其中负例不受约束、具有针对性且无假阳性风险。但我们表明,将生成的负例天真地融入对比学习通常会降低检索性能。我们识别并形式化根本原因为生成-判别鸿沟:LLM生成优化流畅、合理的文本,而对比学习要求在决策边界处进行战略性的相关性违反。我们的分析揭示了两种复合失败模式:判别无关生成,即LLM缺乏对查询信息需求的显式模型,默认生成通用或主题漂移的文本,不提供对比信号;以及源依赖捷径,即分布性伪影使模型能够根据来源而非相关性区分负例,导致梯度漂移,积极破坏优化。为弥合这一鸿沟,我们提出CausalNeg,包含两个主要模块:(1) CoT引导的反事实扰动用于数据构建:将文档满足查询的原因分解为显式信息需求,然后精确违反个别需求以构建具有可控、可解释硬度的负例。(2) 训练期间的查询视角熵最大化:将生成的负例分散到相似度谱中,最小化源身份与相似度分数之间的互信息,以抑制捷径利用。我们在https://github.com/mzhangzhicheng/CausalNeg公开代码。

英文摘要

Hard negative mining has become the dominant strategy for training retrievers, yet it faces intrinsic limitations: negatives are bounded by corpus availability, selected by retriever score rather than diagnostic value, and increasingly contaminated by false positives as the retriever improves. LLM-based synthesis offers a principled alternative, where negatives that are unconstrained, targeted, and free from false positive risk. But we show that naively incorporating generated negatives into contrastive learning often degrades retrieval performance. We identify and formalize the root cause as a generative-discriminative gap: LLM generation optimizes for fluent, plausible text, while contrastive learning demands strategic violations of relevance at the decision boundary. Our analysis reveals two compounding failure modes: discriminative-agnostic generation, where the LLM lacks an explicit model of query information needs and defaults to generic or topic-drifted text that provides no contrastive signal; and source-dependent shortcuts, where distributional artifacts enable the model to distinguish negatives by origin rather than relevance, causing gradient drift that actively corrupts optimization. To close this gap, we propose CausalNeg consisting of two main modules: (1) CoT-guided counterfactual perturbation for data construction: decomposes why a document satisfies a query into explicit information requirements, then surgically violates individual requirements to construct negatives with controlled, interpretable hardness. (2) Query-view entropy maximization during training: disperses generated negatives across the similarity spectrum, minimizing the mutual information between source identity and similarity scores to suppress shortcut exploitation. We make our code publicly available at https://github.com/mzhangzhicheng/CausalNeg.

2606.01546 2026-06-09 cs.LG 版本更新

Flexible Online Representation Learning Based on Similarity Matching

基于相似性匹配的灵活在线表示学习

Shagesh Sridharan, Yanis Bahroun, Anirvan M. Sengupta

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出一种基于相似性匹配的在线生物合理学习算法,能够学习稀疏移位不变表示,适用于聚类、流形平铺或稀疏编码。

Comments 6 pages, 3 figures. Originally accepted to IJCNN 2023 but not presented owing to visa issues

详情
AI中文摘要

稀疏高维表示有助于在无监督数据探索中发现非平凡结构。这种表示可以处理与社区检测问题相关的图中的密集连接。然而,稀疏高维表示还能做更多事情,包括流形平铺和特征学习。传统算法在计算上难以处理的完全正定矩阵空间中进行优化,或者将问题松弛到双非负矩阵空间,这些矩阵的规模随样本大小增长,使得它们对大数据集不实用。其中一些方法还施加了行和约束,例如双随机性。在流形平铺的背景下,行和约束具有平移不变性的额外优势。对输出相似性矩阵的行和约束需要非平凡的在线学习规则。针对这些需求,我们提出了一种通用的在线生物合理学习算法,能够学习稀疏移位不变表示,根据数据结构,可用于聚类、流形平铺或稀疏编码。

英文摘要

Sparse high-dimensional representations are conducive to uncovering nontrivial structures in unsupervised exploration of data. Such a representation can deal with the dense connectivity in graphs relevant to community detection problems. However, sparse high-dimensional representations are capable of doing more, including manifold tiling and feature learning. Conventional algorithms optimize in the space of computationally intractable completely positive matrices or relax the problem to the space of doubly nonnegative matrices that scale with sample size in a way rendering them impractical for large data sets. Some of these methods also impose a row sum constraint, such as double stochasticity. Row sum constraints have the added advantage of being shift-invariant, in the context of manifold tiling. Constraints on the row sum of output similarity matrices require nontrivial online learning rules. Addressing these needs, we propose a versatile online biologically plausible learning algorithm capable of learning sparse shift-invariant representations, useful for clustering, manifold tiling, or sparse coding, depending on the data structure.

2407.01718 2026-06-09 stat.ML cs.LG math.ST stat.TH 版本更新

Entropic Optimal Transport Eigenmaps for Nonlinear Alignment and Joint Embedding of High-Dimensional Datasets

熵最优传输特征映射用于高维数据集的非线性对齐与联合嵌入

Boris Landa, Yuval Kluger, Rong Ma

发表机构 * Department of Electrical and Computer Engineering, Yale University(耶鲁大学电气与计算机工程系) Department of Biostatistics, Harvard University(哈佛大学生物统计学系) Program in Applied Mathematics, Yale University(耶鲁大学应用数学项目) Interdepartmental Program in Computational Biology and Bioinformatics, Yale University(耶鲁大学计算生物学与生物信息学跨学科项目) Department of Pathology, Yale University School of Medicine(耶鲁大学医学院病理学系)

AI总结 提出熵最优传输特征映射方法,通过EOT计划矩阵的奇异向量对齐和联合嵌入两个数据集,具有理论保证,在生成模型下证明其收敛性,并在模拟和真实生物数据中展示优势。

详情
AI中文摘要

将高维数据嵌入低维空间是数据分析中不可或缺的组成部分。在许多应用中,需要对齐和联合嵌入来自不同研究或实验条件的多个数据集。这些数据集可能共享感兴趣的底层结构,但表现出个体扭曲,导致使用传统技术时嵌入不对齐。在这项工作中,我们提出了熵最优传输(EOT)特征映射,一种具有理论保证的对齐和联合嵌入一对数据集的原则性方法。我们的方法利用两个数据集之间EOT计划矩阵的前导奇异向量来提取它们共享的底层结构,并在公共嵌入空间中对齐它们。我们将我们的方法解释为经典拉普拉斯特征映射和扩散映射嵌入的数据间变体,表明它具有许多有利的类似性质。我们分析了一个生成模型,其中两个观测到的高维数据集共享支持在公共低维流形上的潜在变量,而每个数据集受到平移、几何扭曲、正交干扰结构和噪声的影响。在大样本、高维情况下,我们证明EOT计划围绕一个由扭曲的几何均值确定的有效流形上的总体核集中,对平移、正交干扰结构和噪声具有不变性。随后,我们将我们的嵌入与编码共享流形密度和几何的总体水平算子的特征函数联系起来。最后,我们通过模拟和真实生物数据的分析展示了我们的方法在数据集成和嵌入方面的性能,证明了其在挑战性场景下相对于替代方法的优势。

英文摘要

Embedding high-dimensional data into a low-dimensional space is an indispensable component of data analysis. In numerous applications, it is necessary to align and jointly embed multiple datasets from different studies or experimental conditions. Such datasets may share underlying structures of interest but exhibit individual distortions, resulting in misaligned embeddings using traditional techniques. In this work, we propose Entropic Optimal Transport (EOT) eigenmaps, a principled approach for aligning and jointly embedding a pair of datasets with theoretical guarantees. Our approach leverages the leading singular vectors of the EOT plan matrix between two datasets to extract their shared underlying structure and align them in a common embedding space. We interpret our approach as an inter-data variant of the classical Laplacian eigenmaps and diffusion maps embeddings, showing that it enjoys many favorable analogous properties. We analyze a generative model in which two observed high-dimensional datasets share latent variables supported on a common low-dimensional manifold, while each dataset is subject to translation, geometric distortion, orthogonal nuisance structure, and noise. In a large-sample, high-dimensional regime, we prove that the EOT plan concentrates around a population kernel on an effective manifold determined by the geometric mean of the distortions, with invariance to translations, orthogonal nuisance structure, and noise. Subsequently, we relate our embedding to eigenfunctions of population-level operators encoding the density and geometry of the shared manifold. Finally, we showcase the performance of our approach for data integration and embedding through simulations and analyses of real-world biological data, demonstrating its advantages over alternative methods in challenging scenarios.

2507.00260 2026-06-09 stat.ML cs.LG math.ST stat.ME stat.TH 版本更新

Disentangled Feature Importance

解耦特征重要性

Jin-Hong Du, Kathryn Roeder, Larry Wasserman

发表机构 * Department of Statistics and Actuarial Science, The University of Hong Kong, Hong Kong SAR, China(香港大学统计与精算科学系) Musketeers Foundation Institute of Data Science, The University of Hong Kong, Hong Kong SAR, China(香港大学数据科学穆斯克特基金会研究所) Department of Statistics and Data Science, Carnegie Mellon University, Pittsburgh, PA 15213, USA(卡内基梅隆大学统计与数据科学系) Computational Biology Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA(卡内基梅隆大学计算生物学系) Machine Learning Department, Carnegie Mellon University, Pittsburgh, PA 15213, USA(卡内基梅隆大学机器学习系)

AI总结 本文提出解耦特征重要性(DFI),用于解释相关测量通道中的预测信号分配,通过独立潜在表示和熵最优传输几何计算特征重要性,实现稳定且可解释的归因。

Comments 29 main and 44 supplementary pages

详情
AI中文摘要

当预测变量统计依赖时,特征重要性的适当定义取决于操作目标。条件增量措施适合于特征选择、获取和压缩,其中共享的预测信息被视为冗余。然而,对于事后解释,目标通常是将预测信号归因于相关测量通道。我们引入了解耦特征重要性(DFI),这是一种针对此设置的群体层面归因框架。DFI在指定的熵最优传输几何下将协变量映射到独立的潜在表示,计算潜在重要性,并通过巴里中心敏感度将重要性归因于原始协变量。我们证明了广泛的条件增量FI函数在平方误差损失下瞄准条件增量预测价值,因此回答了与依赖下的共享预测信号归因不同的问题。在固定传输成本、参考定律和正则化水平下,DFI定义了一个well-specified的估计量族。潜在分数具有功能ANOVA解释,并在高斯线性情况下,归因DFI恢复了相关回归器的经典R²分解。我们推导了在干扰率和光滑性条件下基于影响函数的推断,并在模拟和HIV-1中和抗性分析中展示了DFI在共享预测信号归因方面产生稳定、可解释、具有不确定性的归因。

英文摘要

When predictors are statistically dependent, the appropriate definition of feature importance depends on the operational goal. Conditional-incremental measures are well-suited for feature selection, acquisition, and compression, where shared predictive information is treated as redundancy. For post-hoc interpretation, however, the goal is often to attribute predictive signals across correlated measurement channels. We introduce Disentangled Feature Importance (DFI), a population-level attribution framework for this setting. DFI maps covariates to an independent latent representation under a specified entropic optimal transport geometry, computes latent importance, and attributes it back to the original covariates through barycentric sensitivities. We show that broad conditional-incremental FI functionals target conditional incremental predictive value under squared-error loss, and therefore answer a different question from attribution of shared predictive signal under dependence. Under fixed transport cost, reference law, and regularization level, DFI defines a well-specified family of estimands. Latent scores admit a functional ANOVA interpretation, and in the Gaussian linear case, the attributed DFI recovers the classical $R^2$ decomposition for correlated regressors. We derive influence-function-based inference under nuisance-rate and smoothness conditions, and show in simulations and an HIV-1 neutralization-resistance analysis that DFI yields stable, interpretable, uncertainty-quantified attributions of shared predictive signal.

2511.11041 2026-06-09 cs.CL cs.AI cs.LG 版本更新

Correcting Mean Bias in Text Embeddings: A Refined Renormalization with Training-Free Improvements on MMTEB

纠正文本嵌入中的均值偏差:一种改进的重归一化方法及其在MMTEB上的无训练改进

Xingyu Ren, Youran Sun, Haoyu Liang

发表机构 * GitHub

AI总结 发现句子嵌入存在一致均值偏差,提出无训练修正方法R2(投影去除均值方向),在MMTEB上38个模型中获得一致分类提升,并分析其与PCA白化的差异。

详情
AI中文摘要

我们发现当前的句子嵌入模型输出存在一致的偏差:每个嵌入$e$可分解为$\tilde e + \mu$,其中均值$\mu$在所有句子中几乎相同。我们研究了两种无训练修正方法——直接减去$\mu$(R1),或从每个嵌入中投影掉均值方向(R2)——并通过一阶误差传播论证表明,R2消除了R1保留的均值估计误差的平行分量。在Massive Multilingual Text Embedding Benchmark (MMTEB)~\citep{MMTEB}上的38个模型中,R2取得一致的分类增益(配对$\bar t = 3.31$,38个模型中有29个$t>2$,零损失),且每个模型的均值范数$\Vert\mu\Vert$与哪些模型受益最多相关。对五个模型进行的九种方法剂量反应消融实验进一步揭示,温和的单方向去除有帮助,但完全的主成分分析(PCA)白化损害了我们测试的每个模型,并且R2与深度为一的All-but-the-Top在下游任务中相差不超过0.18个百分点,尽管$\hat\mu$与中心化的顶部主成分之间几何对齐较弱。

英文摘要

We find that current sentence-embedding models produce outputs with a consistent bias: every embedding $e$ decomposes as $\tilde e + μ$, where the mean $μ$ is near-identical across all sentences. We study two training-free corrections -- subtracting $μ$ directly (R1), or projecting each embedding off the mean direction (R2) -- and show, via a first-order error-propagation argument, that R2 cancels the parallel component of mean-estimation error that R1 retains. Across 38 models on the Massive Multilingual Text Embedding Benchmark (MMTEB)~\citep{MMTEB}, R2 yields consistent classification gains (paired $\bar t = 3.31$, 29 of 38 models with $t>2$, zero losses), and the per-model mean norm $\Vertμ\Vert$ correlates with which models benefit most. A nine-method dose-response ablation on five models further reveals that mild single-direction removal helps, but full principal component analysis (PCA) whitening hurts every model we test, and that R2 and All-but-the-Top with depth one agree within $0.18$ pp downstream despite weak geometric alignment between $\hatμ$ and the centered top principal component.

2512.07355 2026-06-09 cs.AI cs.CV cs.LG 版本更新

A Geometric Unification of Concept Learning with Concept Cones

概念学习与概念锥的几何统一

Alexandre Rocchi, Thomas Fel, Gianni Franchi

发表机构 * AMIAD Kempner Institute, Harvard University(哈佛大学凯普勒研究所)

AI总结 通过共享几何框架(概念锥)统一监督式概念瓶颈模型与无监督稀疏自编码器,提出包含关系度量评估概念对齐,并发现稀疏性与扩展因子的最佳平衡点。

Comments 33 pages

详情
AI中文摘要

两种可解释性传统并行发展但很少相互交流:概念瓶颈模型(CBM)规定概念应该是什么,而稀疏自编码器(SAE)发现哪些概念涌现。CBM使用监督将激活与人类标记的概念对齐,而SAE依赖稀疏编码来揭示涌现概念。我们证明两种范式实例化相同的几何结构:每个范式学习激活空间中的一组线性方向,其非负组合形成概念锥。因此,监督和无监督方法的不同不在于种类,而在于如何选择这个锥。基于这一观点,我们提出了两种范式之间的操作桥梁。CBM提供人类定义的参考几何,而SAE可以通过其学习的锥在多大程度上近似或包含CBM的锥来评估。这种包含框架产生了量化指标,将归纳偏差(如SAE类型、稀疏性或扩展比)与合理概念的涌现联系起来。使用这些指标,我们发现了稀疏性和扩展因子的“最佳点”,该点最大化与CBM概念的几何和语义对齐。总体而言,我们的工作通过共享的几何框架统一了监督和无监督的概念发现,提供了原则性指标来衡量SAE进展,并评估发现的概念与合理的人类概念的对齐程度。

英文摘要

Two traditions of interpretability have evolved side by side but seldom spoken to each other: Concept Bottleneck Models (CBMs), which prescribe what a concept should be, and Sparse Autoencoders (SAEs), which discover what concepts emerge. While CBMs use supervision to align activations with human-labeled concepts, SAEs rely on sparse coding to uncover emergent ones. We show that both paradigms instantiate the same geometric structure: each learns a set of linear directions in activation space whose nonnegative combinations form a concept cone. Supervised and unsupervised methods thus differ not in kind but in how they select this cone. Building on this view, we propose an operational bridge between the two paradigms. CBMs provide human-defined reference geometries, while SAEs can be evaluated by how well their learned cones approximate or contain those of CBMs. This containment framework yields quantitative metrics linking inductive biases -- such as SAE type, sparsity, or expansion ratio -- to emergence of plausible\footnote{We adopt the terminology of \citet{jacovi2020towards}, who distinguish between faithful explanations (accurately reflecting model computations) and plausible explanations (aligning with human intuition and domain knowledge). CBM concepts are plausible by construction -- selected or annotated by humans -- though not necessarily faithful to the true latent factors that organise the data manifold.} concepts. Using these metrics, we uncover a ``sweet spot'' in both sparsity and expansion factor that maximizes both geometric and semantic alignment with CBM concepts. Overall, our work unifies supervised and unsupervised concept discovery through a shared geometric framework, providing principled metrics to measure SAE progress and assess how well discovered concept align with plausible human concepts.

2601.21996 2026-06-09 cs.CL cs.AI cs.LG 版本更新

Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units

机械论数据归因:追踪可解释LLM单元的训练起源

Jianhui Chen, Yuzhang Luo, Liangming Pan

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出机械论数据归因(MDA)框架,利用影响函数将可解释单元追溯到特定训练样本,通过因果验证表明干预高影响样本可显著调节可解释头的涌现,并发现重复结构数据作为机械催化剂,同时验证了归纳头与上下文学习之间的功能联系。

Comments ICML2026 (Oral)

详情
AI中文摘要

尽管机械论可解释性已在LLM中识别出可解释电路,但它们在训练数据中的因果起源仍然难以捉摸。我们引入了机械论数据归因(MDA),这是一个可扩展的框架,利用影响函数将可解释单元追溯到特定训练样本。通过在Pythia系列模型上的广泛实验,我们因果验证了目标干预——移除或增加一小部分高影响样本——显著调节了可解释头的涌现,而随机干预则没有效果。我们的分析表明,重复的结构化数据(例如LaTeX、XML)充当了机械催化剂。此外,我们观察到针对归纳头形成的干预会引发模型上下文学习(ICL)能力的同步变化。这为关于归纳头与ICL之间功能联系的长期假设提供了直接的因果证据。最后,我们提出了一种机械论数据增强流水线,该流水线在不同模型规模上一致地加速电路收敛,为引导LLM的发展轨迹提供了一种原则性方法。

英文摘要

While Mechanistic Interpretability has identified interpretable circuits in LLMs, their causal origins in training data remain elusive. We introduce Mechanistic Data Attribution (MDA), a scalable framework that employs Influence Functions to trace interpretable units back to specific training samples. Through extensive experiments on the Pythia family, we causally validate that targeted intervention--removing or augmenting a small fraction of high-influence samples--significantly modulates the emergence of interpretable heads, whereas random interventions show no effect. Our analysis reveals that repetitive structural data (e.g., LaTeX, XML) acts as a mechanistic catalyst. Furthermore, we observe that interventions targeting induction head formation induce a concurrent change in the model's in-context learning (ICL) capability. This provides direct causal evidence for the long-standing hypothesis regarding the functional link between induction heads and ICL. Finally, we propose a mechanistic data augmentation pipeline that consistently accelerates circuit convergence across model scales, providing a principled methodology for steering the developmental trajectories of LLMs.

2602.00797 2026-06-09 stat.ML cs.LG 版本更新

Zero-Flow Encoders

零流编码器

Yakun Wang, Leyang Wang, Song Liu, Taiji Suzuki

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文提出了一种基于流的表示学习框架,通过零流准则验证条件独立性,从而在生成模型中提取充分信息,并在图模型和自监督学习任务中学习近似马尔可夫毯和潜在表示。

Comments Yakun Wang and Leyang Wang contributed equally to this work; As published at ICML 2026

详情
AI中文摘要

基于流的方法在各种生成建模任务中取得了显著成功,能够捕捉复杂数据分布中的细微细节。然而,现有研究很少利用这一独特能力来解决超出生成任务的细粒度结构细节。本文提出了一种流启发式的表示学习框架。首先,我们证明了如果源分布和目标分布相同,独立耦合训练的修正流在t=0.5时处处为零。我们称这一性质为零流准则。其次,我们展示该准则可以验证条件独立性,从而从数据中提取充分信息。第三,我们将这一准则转化为可计算且无需模拟的损失函数,从而在图模型中学习近似马尔可夫毯和自监督学习任务中的潜在表示。在模拟和真实世界数据集上的实验验证了本文方法的有效性。代码可在https://github.com/probabilityFLOW/zfe上找到。

英文摘要

Flow-based methods have achieved significant success in various generative modeling tasks, capturing nuanced details within complex data distributions. However, few existing works have exploited this unique capability to resolve fine-grained structural details beyond generation tasks. This paper presents a flow-inspired framework for representation learning. First, we demonstrate that a rectified flow trained using independent coupling is zero everywhere at $t=0.5$ if and only if the source and target distributions are identical. We term this property the \emph{zero-flow criterion}. Second, we show that this criterion can certify conditional independence, thereby extracting \emph{sufficient information} from the data. Third, we translate this criterion into a tractable, simulation-free loss function that enables learning amortized Markov blankets in graphical models and latent representations in self-supervised learning tasks. Experiments on both simulated and real-world datasets demonstrate the effectiveness of our approach. The code reproducing our experiments can be found at: https://github.com/probabilityFLOW/zfe.

2604.09787 2026-06-09 astro-ph.IM astro-ph.GA cs.LG 版本更新

Learning What's Real: Disentangling Signal and Measurement Artifacts in Multi-Sensor Data, with Applications to Astrophysics

学习真实内容:在多传感器数据中分离信号和测量伪影,应用于天体物理学

Pablo Mercader-Perez, Carolina Cuesta-Lazaro, Daniel Muthukrishna, Jeroen Audenaert, V. Ashley Villar, David W. Hogg, Marc Huertas-Company, William T. Freeman

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Flatiron Institute, Simons Foundation(Flatiron研究所,Simons基金会) Institute for Advanced Studies(高级研究 institute) Harvard University(哈佛大学) New York University(纽约大学) Instituto de Astrofísica de Canarias(加那利大天文台)

AI总结 本文提出一种深度学习框架,通过重叠观测、双编码器架构和反事实生成目标,分离多传感器数据中的信号与伪影,提升天体物理学研究的准确性。

Comments Accepted at the 2nd Workshop on Foundation Models for Science at ICLR 2026. 10 pages, 7 figures (main text), plus appendix

详情
AI中文摘要

从物理世界收集的数据总是由多个来源组成:感兴趣的物理过程的底层信号和由传感器或仪器引起的测量依赖伪影信号。这种二次信号作为混淆因素,限制了我们提取观测现象底层物理信息的能力。此外,它还复杂了异构或多仪器设置中的观测组合。我们提出了一种深度学习框架,利用重叠观测、双编码器架构和反事实生成目标来分离这些变化因素。所得的表示方法明确地将内在信号与传感器特定的失真和噪声分开,并可用于反事实视图生成、不受测量失真影响的参数推断以及仪器无关的相似性搜索。我们在德克萨斯大学天文台(DESI Legacy Imaging Survey)和超大规模望远镜(HSC Survey)的天体物理星系图像上展示了该方法的有效性,作为代表性的多仪器设置。该框架提供了一种通用的科学和多模态自监督预训练方法:从相同物理系统的重叠观测中构建训练对,将传感器或模态特定的影响视为增强,通过反事实生成学习不变的表示。

英文摘要

Data collected from the physical world is always a combination of multiple sources: an underlying signal from the physical process of interest and a signal from measurement-dependent artifacts from the sensor or instrument. This secondary signal acts as a confounding factor, limiting our ability to extract information about the physics underlying the phenomena we observe. Furthermore, it complicates the combination of observations in heterogeneous or multi-instrument settings. We propose a deep learning framework that leverages overlapping observations, a dual-encoder architecture, and a counterfactual generation objective to disentangle these factors of variation. The resulting representations explicitly separate intrinsic signals from sensor-specific distortions and noise, and can be used for counterfactual view generation, parameter inference unconfounded by measurement distortions, and instrument-independent similarity search. We demonstrate the effectiveness of our approach on astrophysical galaxy images from the DESI Legacy Imaging Survey (Legacy) and the Hyper Suprime-Cam (HSC) Survey as a representative multi-instrument setting. This framework provides a general recipe for scientific and multi-modal self-supervised pretraining: construct training pairs from overlapping observations of the same physical system, treat sensor- or modality-specific effects as augmentations, and learn invariant representations through counterfactual generation.

3. 强化学习与序列决策 60 篇

2606.07557 2026-06-09 cs.LG cs.MA cs.SI 新提交

SPIN: Decentralized Swarm Control via Tensorized Policy Coordination

SPIN: 通过张量化策略协调实现去中心化集群控制

Zhaowen Fan

发表机构 * Zhaowen Fan(Fan 资深研究员)

AI总结 提出SPIN框架,利用张量网络分解联合策略,将指数复杂度降为线性,并通过离线训练的神经符号管道实现边缘设备上的低延迟去中心化集群控制。

Comments 11 pages, 2 figures, 1 tables, 6 sections

详情
AI中文摘要

在资源受限的边缘平台上,去中心化多智能体集群协调仍然受到联合动作空间指数级扩展和高延迟通信开销的根本性瓶颈。本文介绍了集群策略干扰网络(SPIN)框架,这是一种通过将集群拓扑建模为压缩张量网络来绕过这些限制的架构范式。我们将局部多智能体团簇的联合策略张量分解为矩阵乘积态(MPS)链,将评估的计算复杂度从指数级 $O(n^m)$ 墙降低到严格的线性 $O(m \cdot n \cdot \chi^2)$ 约束。为了在不需高功耗在线训练循环的情况下,将局部连续空间几何与该离散代数后端桥接,我们引入了一个解耦的混合神经符号控制管道。局部多层神经网络作为结构协调编码器,离线预训练以将手工设计的几何描述符非线性映射为抽象环境目标度量。在运行时,边缘智能体通过直接应用 Radon-Nikodým 导数作为零样本重要性重加权滤波器来执行即时行为适应。我们在一个离散时间多智能体仿真沙箱中验证了该框架,涵盖跟踪、去中心化分散/区域覆盖和多目标协调等场景。定性遥测表明,集成管道实现了稳定的目标导向运动、去中心化约束下的抗塌陷空间扩展以及跨多个目标的结构化子群形成,为可处理、低功耗的边缘集群智能提供了一条数学上严谨的路径。

英文摘要

Decentralized multi-agent swarm coordination on resource-constrained edge platforms remains fundamentally bottlenecked by the exponential scaling of joint action spaces and high-latency communication overhead. This paper introduces the Swarm Policy Interference Network (SPIN) framework, an architectural paradigm that bypasses these limitations by modeling swarm topologies as a compressed tensor network. We factorize the joint policy tensors of local multi-agent cliques into Matrix Product State (MPS) chains, reducing the computational complexity of evaluation from an exponential $O(n^m)$ wall to a strictly linear $O(m \cdot n \cdot χ^2)$ constraint. To bridge local continuous spatial geometry with this discrete algebraic backend without requiring power-intensive online training loops, we introduce a decoupled, hybrid neuro-symbolic control pipeline. Local multi-layered neural networks operate as structural coordination encoders, pre-trained offline to nonlinearly map hand-engineered geometric descriptors into abstract environmental target measures. At runtime, edge agents execute instantaneous behavioral adaptations by applying the Radon-Nikodým derivative directly as a zero-shot importance-reweighting filter. We validate the framework within a discrete-time multi-agent simulation sandbox spanning tracking, decentralized dispersion/area coverage, and multi-goal coordination regimes. Qualitative telemetry demonstrates that the integrated pipeline achieves stable target-directed motion, anti-collapse spatial spreading under decentralized constraints, and structured subgroup formation across multiple targets, providing a mathematically grounded route to tractable, low-power edge swarm intelligence.

2606.07583 2026-06-09 cs.LG cs.AI 新提交

Outage Detection in Self-Healing Smart Grids Using Reinforcement Learning with Spectral Graph Neural Networks

基于频谱图神经网络强化学习的自愈智能电网故障检测

Lihui Liu, Mucun Sun, Caisheng Wang

发表机构 * Wayne State University(韦恩州立大学) University of Texas at Dallas(德克萨斯大学达拉斯分校)

AI总结 提出频谱图强化学习框架,利用频谱图神经网络学习最优恢复策略,实现配电网故障实时近最优管理,在三个IEEE测试系统上验证了泛化能力。

详情
AI中文摘要

自愈智能电网能够在故障期间快速调整其网络配置,以最小化电力中断。在故障期间,可以采取多种措施,例如通过开关操作进行网络重构和紧急甩负荷。然而,传统的用于故障缓解的机器学习方法由于响应速度慢和计算成本高,不适用于智能电网。为了解决这些挑战,最近的研究探索了使用强化学习自动执行网络重构。在这些方法中,控制策略通常使用图神经网络(GNN)建模。然而,传统的GNN在空间域中运行,可能无法捕捉频域中的重要关系。频域信息对于建模电力网络中的全局结构模式和系统范围交互特别有用。在本文中,我们提出了一种用于配电网故障管理的频谱图强化学习框架,以增强系统韧性。我们的模型使用频谱图神经网络学习最优电力恢复策略。我们在三个修改后的IEEE测试系统上评估了所提出的方法:13节点、34节点和123节点网络。实验结果表明,我们的方法在实时性上达到了接近最优的性能,并且在广泛的故障场景中具有良好的泛化能力。

英文摘要

Self-healing smart grids can quickly adjust their network configuration during outages to minimize power disruptions. During an outage, several actions can be taken, such as network reconfiguration through switching operations and emergency load shedding. However, traditional machine learning methods for outage mitigation are not well suited for smart grids due to their slow response time and high computational cost. To address these challenges, recent studies have explored reinforcement learning to automatically perform network reconfiguration. In these approaches, the control policy is typically modeled using a graph neural network (GNN). However, conventional GNNs operate in the spatial domain and may fail to capture important relationships in the frequency domain. Frequency-domain information is particularly useful for modeling global structural patterns and system-wide interactions in power networks. In this paper, we propose a spectral graph reinforcement learning framework for outage management in distribution networks to enhance system resilience. Our model learns the optimal power restoration policy using a spectral graph neural network. We evaluate the proposed method on three modified IEEE test systems: the 13-bus, 34-bus, and 123-bus networks. Experimental results show that our approach achieves near-optimal performance in real time and generalizes well across a wide range of outage scenarios.

2606.07592 2026-06-09 cs.LG 新提交

UNIQ: Conformal Calibration for Adaptive Conservatism in Offline Reinforcement Learning

UNIQ: 离线强化学习中的自适应保守性共形校准

Aditya Upadhyay

发表机构 * IIIT Delhi(印度德里国际信息技术学院)

AI总结 提出UNIQ方法,通过共形预测校准不确定性,实现状态自适应的保守性惩罚,在D4RL基准上以接近IQL的内存开销提升性能。

Comments 19 pages, 2 figures, ICML 2026 Workshop on Decision-Making from Offline Datasets to Online Adaptation: Black-Box Optimization to Reinforcement Learning

详情
AI中文摘要

离线强化学习需要谨慎的保守性来缓解分布偏移,然而大多数现有方法在所有状态上统一施加固定惩罚,而不考虑局部数据覆盖。我们提出UNIQ(不确定性信息分位数),一种通过共形校准不确定性估计引入状态自适应保守性的离线RL方法。基于隐式Q学习(IQL)主干,UNIQ训练一个多期望值集成,使用分裂共形预测计算无分布不确定性估计,并将所得信号映射到状态依赖的期望值,从而在覆盖良好的区域放松保守性,在数据边界附近的不确定区域加强保守性。在D4RL MuJoCo基准上,UNIQ持续优于IQL,在Walker2d和重放密集型任务上提升最大。同时,UNIQ以接近IQL的内存成本(约250 MB峰值VRAM)运行,相比EDAC提供约10倍的减少。我们不追求整体最先进性能,而是将UNIQ定位为一种实用机制贡献,改进了离线强化学习中的性能-效率权衡。

英文摘要

Offline reinforcement learning requires careful conservatism to mitigate distribution shift, yet most existing methods apply a fixed penalty uniformly across all states regardless of local data coverage. We present UNIQ (Uncertainty-Informed Quantile), an offline RL method that introduces state-adaptive conservatism through conformally calibrated uncertainty estimation. Built on the Implicit Q-Learning (IQL) backbone, UNIQ trains a multi-expectile value ensemble, computes distribution-free uncertainty estimates using split conformal prediction, and maps the resulting signal to a state-dependent expectile that relaxes conservatism in well-covered regions while strengthening it in uncertain regions near the data frontier. On D4RL MuJoCo benchmarks, UNIQ consistently improves over IQL, with the largest gains observed on Walker2d and replay-heavy tasks. At the same time, UNIQ operates at near-IQL memory cost (approximately 250 MB peak VRAM), providing roughly a 10x reduction compared to EDAC. Rather than pursuing overall state-of-the-art performance, we position UNIQ as a practical mechanism contribution that improves the performance-efficiency trade-off in offline reinforcement learning.

2606.07602 2026-06-09 cs.LG cs.AI 新提交

Sample-Efficient Post-Training for LEGO Spatial-Physics Reasoning

面向LEGO空间物理推理的样本高效后训练

Yuhuan Yuan, Zhouliang Yu, Minghao Liu, Weiyang Liu, Ge Lin Kan

发表机构 * HKUST(GZ)(香港科技大学(广州)) CUHK(香港中文大学) ZODA

AI总结 针对LLM生成LEGO组装时出现的物理有效但几何语义错位问题,提出基于模型的数据选择方法和样本高效强化学习PVPO,结合体素空间几何奖励,提升结构、语义对齐和物理有效性。

Comments Technical Report V1, 15 pages, 6 figures, 3 tables

详情
AI中文摘要

基于LLM的LEGO组装生成需要同时具备语义基础和物理可行性。我们发现一种数据引发的失败模式PhysHack,其中组装满足物理有效性约束,但产生的结构在几何上错位、语义上不一致或校准不良。为应对这一挑战,我们提出一种基于模型的数据选择方法,仅使用一小部分训练数据,同时改进基于物理的LEGO组装生成。基于所选轨迹,我们引入PVPO,一种样本高效的强化学习方法,将物理可行性与体素空间几何奖励相结合。我们的结果表明,仅物理有效性不足以作为可靠物理推理的代理:模型可以学习生成有效结构而不保持语义或几何保真度。跨模型主干和测试时缩放设置的实验表明,PVPO改善了结构和语义对齐、物理有效性、结构稳定性和校准,同时减少了对大量事后拒绝采样的依赖。特别是,校准结果表明,PVPO通过使测试时选择更能预测语义和结构质量来缓解PhysHack。

英文摘要

LLM-based LEGO assembly generation requires both semantic grounding and physical feasibility. We identify a data-induced failure mode, PhysHack, in which the assemblies satisfy physical-validity constraints while producing structures that are geometrically misaligned, semantically inconsistent, or poorly calibrated. To address this challenge, we propose a model-based data selection approach that uses only a small fraction of the training data while improving physically grounded LEGO assembly generation. Building on the selected trajectories, we introduce PVPO, a sample-efficient reinforcement learning method that couples physical feasibility with voxel-space geometric rewards. Our results show that physical validity alone is an insufficient proxy for reliable physical reasoning: models can learn to generate valid structures without preserving semantic or geometric fidelity. Experiments across model backbones and test-time scaling settings demonstrate that PVPO improves structural and semantic alignment, physical validity, structural stability, and calibration, while reducing reliance on extensive post-hoc rejection sampling. In particular, results on calibration show that PVPO mitigates PhysHack by making test-time selection more predictive of semantic and structural quality.

2606.07610 2026-06-09 cs.LG cs.AI cs.CL 新提交

LEAF: Growing Trees Without Branching for Speech-Aware Large Language Model Post-Training

LEAF: 无需分支的树生长方法用于语音感知大语言模型后训练

Argyrios Gerogiannis, Yekaterina Yegorova, Mark Hasegawa-Johnson, Venugopal V. Veeravalli

发表机构 * University of Illinois, Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 针对语音感知大语言模型后训练中GRPO方法粗粒度信用分配问题,提出LEAF方法,通过回溯式树结构学习、高信息量边界选择和跨度级优势分配,在语音问答和翻译任务上超越GRPO。

Comments 15 pages, 3 figures, 11 tables

详情
AI中文摘要

最先进的GRPO风格方法在语音感知大语言模型后训练中存在粗粒度信用分配问题,将相同的终端奖励优势广播给响应中的每个token。这忽略了rollout批次中的有用结构,其中语音条件下的补全通常共享前缀,然后在重要决策处出现分歧。我们提出低秩探索自适应分叉(LEAF),一种基于回溯树的强化学习方法,无需在线分支或额外解码即可恢复这种结构。LEAF采样完整响应,选择高信息量边界,按共享前缀分组响应,并使用后代奖励分配跨度级优势。我们从理论上证明了LEAF的跨度级信用分配和边界选择设计。实验上,在相同的rollout和低秩适应预算下,LEAF在语音问答和语音翻译基准上优于GRPO。值得注意的是,较小的LEAF训练模型优于当前最先进的完全参数基线。

英文摘要

State-of-the-art GRPO-style methods for speech-aware large language model post-training suffer from coarse credit assignment, broadcasting the same terminal-reward advantage to every token in a response. This ignores useful structure within rollout batches, where speech-conditioned completions often share prefixes before diverging at important decisions. We propose Low-rank Exploration with Adaptive Forking (LEAF), a retrospective tree-based RL method that recovers this structure without online branching or additional decoding. LEAF samples complete responses, selects high-surprisal boundaries, groups responses by shared prefixes, and assigns span-level advantages using descendant rewards. We theoretically justify LEAF's span-level credit assignment and boundary-selection design. Empirically, LEAF improves over GRPO across speech question answering and speech translation benchmarks under the same rollout and low-rank adaptation budget. Notably, smaller LEAF-trained models outperform current state-of-the-art, full-parameter baselines.

2606.07705 2026-06-09 cs.LG cs.AI 新提交

SAW: Stage-Aware Dynamic Weighting for Multi-Objective Reinforcement Learning in Large Language Models

SAW: 面向大语言模型多目标强化学习的阶段感知动态加权

Yuchen He, Baolong Bi, Shenghua Liu, Huaming Liao, Yuyao Ge, Bolin Wan, Siqian Tong, Juan Chen, Jiafeng Guo, Xueqi Cheng

发表机构 * Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所) University of Electronic Science and Technology of China(电子科技大学)

AI总结 针对多目标强化学习中奖励学习异步性问题,提出轻量级动态加权机制SAW,利用变异系数实时调整各目标贡献,在GRPO和GDPO框架下提升训练效率和最终性能。

Comments 17 pages, 7 figures, 5 tables

详情
AI中文摘要

尽管多目标强化学习(MORL)对于将大语言模型与复杂的人类偏好对齐至关重要,但当前普遍采用的静态加权求和忽略了一个更基本的现象:不同目标之间的奖励学习明显异步。学习良好的维度会迅速产生同质、低方差的信号,其残留噪声会污染聚合奖励(在GRPO中)或占据优势预算的固定份额(在GDPO中),从而干扰学习不足维度携带的稀缺但高价值的信号。为了解决这种异步性,我们提出了阶段感知动态加权(SAW),一种轻量级、算法无关的动态加权机制。SAW利用变异系数(CV)作为实时信息量的尺度不变代理,根据批次内各维度的相对信息量重新加权其奖励或优势贡献。与需要多次前向和反向传播的基于梯度的方法不同,SAW仅依赖于批次级统计信息,引入的计算开销几乎可以忽略不计。在工具调用和文本摘要任务上的实验表明,SAW在GRPO和GDPO框架下均能一致地提高训练效率和最终性能,证实了其作为多奖励LLM对齐的通用插件。我们的代码可在 https://github.com/Zhaolutuan/SAW 获取。

英文摘要

Although multi-objective reinforcement learning (MORL) is central to aligning large language models with complex human preferences, the prevailing practice of static weighted summation overlooks a more fundamental phenomenon: reward learning is markedly asynchronous across objectives. Well-learned dimensions quickly produce homogeneous, low-variance signals whose residual noise contaminates the aggregated reward (in GRPO) or occupies a fixed share of the advantage budget (in GDPO), interfering with the scarce yet high-value signals carried by under-learned dimensions. To address this asynchrony, we propose Stage-Aware Dynamic Weighting (SAW), a lightweight, algorithm-agnostic dynamic weighting mechanism. SAW utilizes the coefficient of variation (CV) as a scale-invariant proxy for real-time informativeness, reweighting each dimension's reward or advantage contribution by its relative informativeness within the batch. Unlike gradient-based methods that require multiple forward and backward passes, SAW relies solely on batch-level statistics, introducing nearly negligible computational overhead. Experiments on tool-calling and text summarization tasks demonstrate that SAW consistently improves both training efficiency and final performance under both GRPO and GDPO frameworks, confirming it as a general-purpose plug-in for multi-reward LLM alignment. Our code is available at https://github.com/Zhaolutuan/SAW

2606.07910 2026-06-09 cs.LG 新提交

CAAL: Contextual Bandits based Online Hand-Craft Active Learning Strategy Selection

CAAL: 基于上下文赌博机的在线手工主动学习策略选择

Shao-An Yin, Jiacong Li, Tianpei Xie, Cecile Levasseur, Wojciech Kowalinski, Nicola Elia

发表机构 * University of Minnesota, Twin Cities(明尼苏达大学双城分校) Amazon(亚马逊)

AI总结 提出CAAL框架,利用上下文信息和奖励预测动态选择主动学习策略,在公共数据集上优于现有基线方法。

Comments 8 pages, 5 figures, Accepted to the NYRL 2025 Workshop

详情
AI中文摘要

主动学习算法面临的挑战是未标注数据统计分布的不确定性,这使得难以选择最佳的手工策略。为了解决这个问题,我们引入了上下文自适应主动学习(CAAL)。在CAAL中,每个“臂”代表一个手工策略。与仅基于标注数据反馈选择策略的现有框架不同,我们利用外部上下文信息的奖励预测动态选择用于标注数据批次的策略。这个通用框架允许通过领域知识进行定制,以设计更有效的奖励和上下文候选。此外,我们通过实验表明,使用我们的奖励和上下文设计,CAAL在公共数据集上优于现有的基线自适应策略。无论每次迭代的批次大小如何,我们的结果都是一致的。

英文摘要

The challenge with active learning algorithms is the uncertainty of the statistical distribution of unlabeled data, making it difficult to choose the best hand-crafted strategy. To address this, we introduced Contextual Adaptive Active Learning (CAAL). In CAAL, each "arm" represents a hand-crafted strategy. Unlike existing frameworks that select strategies based only on feedback from labeled data, we dynamically choose strategies for labeling batches of data using reward prediction with external context information. This general framework allows for customization with domain knowledge to design more effective rewards and context candidates. In addition, we experimentally show that CAAL outperforms the existing baseline adaptive strategy on public datasets using our reward and context design. Our results are consistent regardless of batch size in each iteration.

2606.07950 2026-06-09 cs.LG 新提交

The Easy, the Hard, and the Learnable: Confidence and Difficulty-Adaptive Policy Optimization for LLM Reasoning

简单、困难与可学习:面向LLM推理的置信度与难度自适应策略优化

Zhanke Zhou, Xiangyu Lu, Chentao Cao, Brando Miranda, Tongliang Liu, Bo Han, Sanmi Koyejo

发表机构 * TMLR Group, Department of Computer Science, Hong Kong Baptist University(香港 Baptist 大学计算机科学系 TMLR 组) Stanford University(斯坦福大学) Sydney AI Centre, The University of Sydney(悉尼大学人工智能中心)

AI总结 针对GRPO训练中均匀采样导致计算效率低的问题,提出CoDaPO方法,通过置信度和难度自适应重加权与重采样,在固定预算下提升可学习问题的发现,在12个基准上优于现有RL方法。

Comments Published in ICML 2026

详情
AI中文摘要

具有可验证奖励的强化学习可以显著提升LLM的推理能力,然而标准的GRPO风格训练通常通过均匀采样和加权同等对待简单、困难和可学习的问题,导致计算分配效率低下。我们通过跟踪token对数概率、组归一化优势以及由此产生的token级更新权重来研究GRPO。这揭示了随着训练进行出现的三种重复动态:(1) 置信度膨胀,(2) 优势收缩,以及(3) 层次收敛。这些发现表明,每次更新的效用很大程度上取决于问题难度和模型当前的能力。受此启发,我们提出了置信度与难度自适应策略优化(CoDaPO),该方法根据展开置信度和经验难度为每个问题分配一个有界值。CoDaPO随后使用该值对策略更新进行重新加权,并在小批量内重新采样高价值的可学习问题,从而在固定计算预算下增加可学习带内的发现。在十二个基准测试中,CoDaPO在准确性上持续优于现有的RL方法。我们的代码公开在 https://github.com/tmlr-group/CoDaPO。

英文摘要

RL with verifiable rewards can substantially improve LLM reasoning, yet standard GRPO-style training often treats easy, hard, and learnable questions alike through uniform sampling and weighting, leading to inefficient compute allocation. We study GRPO by tracking token log-probabilities, group-normalized advantages, and the induced token-level update weights. This reveals three recurring dynamics as training proceeds: (1) confidence inflation, (2) advantage contraction, and (3) hierarchical convergence. These findings suggest that the utility of each update depends strongly on both question difficulty and the model's current competence. Motivated by this, we propose Confidence and Difficulty-adaptive Policy Optimization (CoDaPO), which assigns each question a bounded value from rollout confidence and empirical difficulty. CoDaPO then uses this value to reweight policy updates and resample high-value learnable questions within mini-batches, thereby increasing discovery within the learnable band under a fixed compute budget. Across twelve benchmarks, CoDaPO consistently improves accuracy over existing RL methods. Our code is publicly available at https://github.com/tmlr-group/CoDaPO.

2606.08068 2026-06-09 cs.LG 新提交

DICE: Entropy-Regularized Equilibrium Selection for Stable Multi-Agent LLM Coordination

DICE: 用于稳定多智能体LLM协调的熵正则化均衡选择

Yi Xie, Zhanke Zhou, Chentao Cao, Bo Liu, Bo Han

发表机构 * University of Arizona(亚利桑那大学) Hong Kong Baptist University(香港浸会大学)

AI总结 提出DICE框架,通过熵正则化均衡选择(HQRE)解决多智能体LLM协调中的不稳定性,实现线性收敛和有限贝叶斯遗憾,在11个基准上平均提升4.3-8.5个百分点。

详情
AI中文摘要

多智能体大语言模型(LLM)系统通常无法可靠地超越配备最佳N采样的单个强模型。我们认为这种不稳定性的一个核心来源是病态的均衡选择:当前系统指定了智能体共享哪些信息,但没有指定应选择哪种协调约定。我们将此类系统的一类广泛形式化为折扣不完全信息马尔可夫博弈,并表明两种常见病理——竞争约定之间的振荡和跨约定漂移——均可导致不稳定的学习和线性贝叶斯遗憾。为了获得一个良定义的目标,我们引入了异质量化响应均衡(HQRE),这是一种具有智能体和状态依赖温度的熵正则化均衡概念。在单调性条件下,HQRE是唯一的,允许线性收敛的镜像更新,并产生有界的贝叶斯遗憾;相同的条件产生可 rollout 测量的稳定性诊断。我们在两种算法中实例化这一目标:DICE-PC,通过提示控制动作协调冻结模型,以及DICE-FT,执行参数高效的镜像微调。在四个领域的十一个基准测试中,DICE在准确性-成本权衡上优于强类内基线;在推理和规划任务上,DICE-PC平均提高4.3个百分点,DICE-FT提高8.5个百分点。

英文摘要

Multi-agent large language model (LLM) systems often fail to reliably outperform a single strong model equipped with best-of-N sampling. We argue that a core source of this instability is ill-posed equilibrium selection: current systems specify what information agents share, but not which coordination convention should be selected. We formalize a broad class of such systems as discounted incomplete-information Markov games and show that two common pathologies, oscillation between competing conventions and drift across them, can both induce unstable learning and linear Bayesian regret. To obtain a well-posed target, we introduce the Heterogeneous Quantal Response Equilibrium (HQRE), an entropy-regularized equilibrium concept with agent- and state-dependent temperatures. Under a monotonicity condition, HQRE is unique, admits linearly convergent mirror updates, and yields bounded Bayesian regret; the same condition yields rollout-measurable stability diagnostics. We instantiate this objective in two algorithms: DICE-PC, which coordinates frozen models through prompt-control actions, and DICE-FT, which performs parameter-efficient mirror fine-tuning. Across eleven benchmarks in four domains, DICE improves accuracy-cost trade-offs over strong within-class baselines; on reasoning and planning tasks, DICE-PC improves by 4.3 percentage points on average and DICE-FT by 8.5 points.

2606.08088 2026-06-09 cs.LG cs.CL 新提交

ConSteer-RL: Steering Reasoning Capabilities in Large Language Models via Confidence-Aware Reinforcement Learning

ConSteer-RL:通过置信度感知强化学习引导大型语言模型的推理能力

Qing Miao, Yiming Zhao, Jing Yang, Chenxi Liu, Yuehai Chen, Yuewen Liu, Shaoyi Du, Badong Chen

发表机构 * Xi'an Jiaotong University(西安交通大学) University of Science and Technology of China(中国科学技术大学)

AI总结 提出ConSteer-RL框架,将模型log概率的token级置信度信号融入GRPO,通过置信度感知奖励塑造机制惩罚过度自信错误并强化正确自信推理,在多个模型规模上平均提升2.3%-4.0%。

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)近期已成为提升大型语言模型(LLMs)推理能力的关键范式,但其仍受限于稀疏的二元奖励以及对模型内部不确定性的忽视。本文提出ConSteer-RL,一个简单而有效的框架,将源自模型log概率的token级置信度信号整合到RLVR训练中。具体而言,基于组相对策略优化(GRPO)框架,我们通过将每个token的概率聚合成标量置信度分数,并融入基于意识的奖励塑造机制,构建置信度感知奖励,该机制惩罚过度自信的错误,同时强化正确且自信的推理。实验结果表明,ConSteer-RL在不同模型规模上持续优于强GRPO基线,平均提升2.3%-4.0%。

英文摘要

Reinforcement Learning from Verifiable Rewards (RLVR) has recently become a key paradigm for improving the reasoning abilities of Large Language Models (LLMs), yet it remains limited by sparse binary rewards and its ignorance of model-internal uncertainty. In this paper, we propose ConSteer-RL, a simple yet effective framework that integrates token-level confidence signals derived from model log-probabilities into RLVR training. Specifically, building upon the Group Relative Policy Optimization (GRPO) framework, we construct a confidence-aware reward by aggregating per-token probabilities into a scalar confidence score and incorporating it into an awareness-based reward shaping mechanism that penalizes overconfident errors while reinforcing correct and confident reasoning. Experimental results demonstrate that ConSteer-RL consistently outperforms strong GRPO baselines, achieving average improvements of 2.3%-4.0% across different model scales.

2606.08360 2026-06-09 cs.LG cs.AI 新提交

Generative Frontier Planning for Adaptive Peer-Referral Recruitment under Covariate-Dependent Arrivals

协变量依赖到达下的自适应同伴推荐招募的生成前沿规划

Lingkai Kong, Hezi Jiang, Andrew Ma, Keyu Wang, Akseli Kangaslahti, Milind Tambe

发表机构 * Harvard University(哈佛大学)

AI总结 针对同伴推荐招募中协变量依赖到达的现实问题,提出生成前沿规划(GFP),通过确定性备份和边际贪心分配实现高效规划,在模拟实验中优于基线方法。

详情
AI中文摘要

同伴推荐招募系统(如受访者驱动抽样)对于研究和干预受传染病影响的隐藏人群至关重要。为了加速招募,公共卫生机构必须在多轮中自适应地分配有限的推荐资源,当前决策影响未来招募者的数量和协变量。先前的工作通过假设推荐来自同质总体的独立同分布抽样使问题可解,但忽略了驱动真实同伴推荐的同质性和共享背景。我们考虑一个更现实的模型,其中推荐容量和新推荐个体的协变量都依赖于推荐者,并通过删失计数模型和条件生成模型从数据中学习。由此产生的规划问题具有挑战性,因为每个候选分配都会导致未来招募者的不同分布。我们提出生成前沿规划(GFP),一种基于模型的规划器,用潜在协变量覆盖值替代的确定性备份替代每步蒙特卡洛采样。该替代的设计使得下一个前沿的期望值仅通过离线摊销的有限维摘要依赖于后代生成模型,并且使得每轮目标具有单调递减收益。这两个性质共同使规划易于处理:确定性备份消除了蒙特卡洛采样,递减收益结构使得边际贪心分配能够为每轮问题实现(1-1/e)近似。在根据真实受访者驱动抽样数据集校准的模拟环境中,GFP在四个折扣因子下均优于随机、强化学习和独立同分布动态规划基线。

英文摘要

Peer-referral recruitment systems such as respondent-driven sampling are critical for studying and intervening on hidden populations affected by infectious diseases. To accelerate recruitment, public health agencies must adaptively allocate limited referral resources across multiple rounds, where current decisions shape both the number and the covariates of future recruits. Prior work makes this problem tractable by assuming that referrals are drawn i.i.d.\ from a homogeneous population, an assumption that ignores the homophily and shared context that drive real peer recruitment. We instead consider a more realistic model in which both referral capacity and the covariates of newly referred individuals are conditioned on the referrer, learned from data with a censored count model and a conditional generative model. The resulting planning problem is challenging because each candidate allocation induces a different distribution over future recruits. We propose \emph{Generative Frontier Planning} (GFP), a model-based planner that replaces per-step Monte-Carlo sampling with a deterministic backup over a latent covariate-coverage value surrogate. The surrogate is designed so that the expected value of the next frontier depends on the offspring generative model only through finite-dimensional summaries that are amortized offline, and so that the resulting per-round objective is monotone with diminishing returns. Together, these two properties make planning tractable: the deterministic backup eliminates Monte-Carlo sampling, and the diminishing-returns structure lets a marginal greedy allocation achieve a \((1-1/e)\)-approximation for the per-round problem. On a simulation environment calibrated to a real respondent-driven sampling dataset, GFP outperforms random, reinforcement-learning, and i.i.d.\ dynamic-programming baselines across four discount factors.

2606.08410 2026-06-09 cs.LG cs.AI 新提交

Provably Efficient Personalized Multi-Objective Bandits with Proactive Conversational Queries

具有主动对话查询的可证明高效个性化多目标老虎机

Linfeng Cao, Ming Shi, Ness B. Shroff

发表机构 * The Ohio State University(俄亥俄州立大学) University at Buffalo(布法罗大学)

AI总结 提出MO-PQUCB算法,通过主动查询获取用户偏好信号,结合Plackett-Luce模型和正则化UCB,解决多目标老虎机中偏好与奖励的耦合问题,实现更优的遗憾界。

Comments UAI 2026

详情
AI中文摘要

多目标老虎机中的个性化决策需要学习用户在不同竞争目标之间的特定权衡。由于臂的效用既取决于未知奖励又取决于未知偏好,现有方法仅从效用反馈中推断偏好,将偏好学习与奖励探索纠缠在一起。然而,在实践中,用户通常通过主动对话查询(例如,“便宜且干净的酒店”)揭示他们的优先级,但这种结构化信号未被利用。我们形式化了一个基于主动查询的框架,其中用户查询提供结构化的偏好信号。通过Plackett-Luce子集选择模型对这些信号进行建模,我们证明了由于基本的平移不变性障碍,仅查询学习是不够的。为了解决这个问题,我们引入了MO-PQUCB,一种混合算法,通过平移不变正则化和双探索UCB将基于查询的偏好锚定与老虎机反馈相结合。我们证明了主动查询加速了偏好估计,并相比先前偏好感知的MO-MAB方法实现了改进的遗憾缩放。在查询被破坏的情况下,我们进一步刻画了统计极限,并设计了一个鲁棒估计器,在破坏稀疏时实现接近最优的性能。实验验证了理论和实际收益。

英文摘要

Personalized decision-making in multi-objective bandits requires learning user-specific trade-offs among competing objectives. Since arm utility depends on both unknown rewards and unknown preferences, existing methods infer preferences only from utility feedback, entangling preference learning with reward exploration. In practice, however, users often reveal their priorities through proactive conversational queries (e.g., "cheap and clean hotel"), yet this structured signal is not leveraged. We formalize a proactive query-based framework in which user queries provide structured preference signals. Modeling these signals via a Plackett-Luce subset choice model, we show that query-only learning is insufficient due to a fundamental shift-invariance barrier. To resolve this, we introduce MO-PQUCB, a hybrid algorithm that integrates query-based preference anchoring with bandit feedback through shift-invariant regularization and dual-exploration UCB. We prove that proactive queries accelerate preference estimation and yield improved regret scaling over prior preference-aware MO-MAB methods. Under corrupted queries, we further characterize statistical limits and design a robust estimator achieving near-optimal performance when the corruption is sparse. Experiments validate both theoretical and practical gains.

2606.08533 2026-06-09 cs.LG cs.RO 新提交

Autonomous Aerial Manipulation via Contextual Contrastive Meta Reinforcement Learning

通过上下文对比元强化学习的自主空中操控

Lixuan Jin, Bingxuan Lan, Xinyi Bao, Xiangyuan Xie, Chunjie Zhang, Zheng Chen, Tianshuo Liu, Ruijie Tian, Jinyu Ru, Gang Wang, Lei Yuan, Yang Yu

发表机构 * National Key Laboratory of Novel Software Technology, Nanjing University(南京大学计算机软件新技术国家重点实验室) School of Artificial Intelligence, Nanjing University(南京大学人工智能学院) Faculty of Robot Science and Engineering, Northeastern University(东北大学机器人科学与工程学院) National Key Lab of Autonomous Intelligent Unmanned Systems, Beijing Institute of Technology(北京理工大学自主智能无人系统国家重点实验室)

AI总结 提出Aco2方法,通过上下文对比元强化学习,使四旋翼无人机在无需人工干预下自主完成不同载荷的抓取、运输和投递,并直接迁移到真实世界。

详情
AI中文摘要

无人机越来越多地部署在物流、服务机器人等实际应用中,对自主载荷获取和投递的需求日益增长。现有方法通常假设预附载荷或依赖专用夹爪,使得通用的端到端空中投递问题仍未解决,因为不同载荷会导致高度变化的飞行动力学,需要单一策略在线适应,无需手动校准或显式系统辨识。为此,我们研究了通过上下文对比元强化学习的自主空中操控(\textbf{\textit{Aco2}}),这是一个完全自主的空中投递设置,其中配备轻型钩子的四旋翼无人机连续拾取、运输和投递各种带手柄的物体,在随机位置之间进行,全程无需人工干预。首先,我们设计了一个上下文观测编码器,从最近的交互历史中推断出紧凑的潜在上下文,使策略能够在线适应载荷相关的动力学。为了进一步提高上下文质量,我们引入了一个对比目标,该目标围绕任务相关变化结构化上下文嵌入,从而改善跨不同载荷的泛化能力,无需显式系统辨识。完全在模拟中训练,并采用广泛的域随机化,\textit{Aco2}可以直接部署在物理四旋翼上,无需真实世界微调。

英文摘要

Unmanned aerial vehicles (UAVs) are increasingly being deployed in logistics, service robotics, and other real-world applications, creating a growing demand for autonomous payload acquisition and delivery. Existing approaches typically assume pre-attached payloads or rely on specialized grippers, leaving versatile end-to-end aerial delivery largely unresolved, where different payloads induce highly variable flight dynamics, requiring a single policy to adapt online without manual calibration or explicit system identification. To this end, we study \textbf{A}utonomous \textbf{A}erial Manipulation via \textbf{Co}ntextual \textbf{Co}ntrastive Meta Reinforcement Learning (\textbf{\textit{Aco2}}), a fully autonomous aerial delivery setting in which a quadrotor equipped with a lightweight hook continuously picks up, transports, and delivers diverse handle-equipped objects between randomized locations, all without human intervention. First, we design a contextual observation encoder that infers a compact latent context from recent interaction history, enabling the policy to adapt online to payload-dependent dynamics. To further improve the quality of this context, we introduce a contrastive objective that structures the context embedding around task-relevant variations, improving generalization across diverse payloads without requiring explicit system identification. Trained entirely in simulation with extensive domain randomization, \textit{Aco2} can be directly deployed on a physical quadrotor without real-world fine-tuning.

2606.08602 2026-06-09 cs.LG cs.AI 新提交

Reinforcement Learning for Flow-Matching Policies with Density Transport

基于密度传输的流匹配策略强化学习

Boshu Lei, Kostas Daniilidis, Antonio Loquercio

发表机构 * University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出在线强化学习算法RLDT,利用Stein变分梯度下降构建传输场,微调预训练流匹配策略,通过期望目标估计稳定训练,在连续控制任务中优于基线方法。

详情
AI中文摘要

我们提出了一种在线强化学习(RL)算法,用于微调连续控制问题中的流匹配策略。我们的关键见解是将基于RL的策略改进视为将动作密度向高奖励区域传输,这自然与流匹配模型的传输公式一致。先前的方法要么近似当前或最优策略分布,要么采用蒸馏,这引入了有偏梯度或牺牲了多模态建模能力。相比之下,我们提出的基于密度传输的RL方法(称为RLDT)使用Stein变分梯度下降(SVGD)从最大熵RL目标构建传输场,然后微调预训练的流匹配策略以与该场对齐。使用这种对齐目标进行训练并非易事,因为流匹配策略通过多步过程生成动作,使得直接的基于梯度的优化具有挑战性。为了克服这一挑战并稳定训练,我们通过期望目标估计从中间去噪步骤近似策略动作。这使得传输场更新能够传播到网络参数中,而无需通过时间进行不稳定的反向传播。实验结果表明,RLDT在奖励质量和收敛速度方面优于竞争基线。该性能在多种连续控制任务中保持一致,包括密集和稀疏奖励,以及基于状态和视觉的长期机器人操作。项目网页为https://rpfey.github.io/rldt/。

英文摘要

We present an online reinforcement learning (RL) algorithm for fine-tuning flow-matching policies in continuous-control problems. Our key insight is to view RL-based policy improvement as a transport of action densities towards regions of high reward, which naturally aligns with the transport formulation of flow matching models. Prior methods either approximate the current or optimal policy distribution or resort to distillation, which introduces biased gradients or sacrifices multimodal modeling capacity. In contrast, our approach for RL with Density Transport, which we name \emph{RLDT}, constructs a transport field from a maximum-entropy RL objective using Stein Variational Gradient Descent (SVGD). Then, it finetunes a pretrained flow matching policy to align with this field. Training with this alignment objective is nontrivial because flow-matching policies generate actions via a multi-step process, making direct gradient-based optimization challenging. To overcome this challenge and stabilize training, we approximate policy actions from intermediate denoising steps via expected-target estimation. This allows the transport-field update to propagate into the network parameters without unstable backpropagation through time. Experimental results demonstrate that RLDT outperforms competitive baselines in reward quality and convergence speed. This performance holds across diverse continuous-control tasks, encompassing both dense and sparse rewards, as well as state- and vision-based long-horizon robot manipulation. The project webpage is \href{https://rpfey.github.io/rldt/}{https://rpfey.github.io/rldt/}.

2606.08671 2026-06-09 cs.LG 新提交

SkillHone: A Harness for Continual Agent Skill Evolution Through Persistent Decision History

SkillHone:基于持久决策历史的持续智能体技能演化框架

Zhiwei Li, Yong Hu

发表机构 * WeChat, Tencent Inc., China(腾讯微信,中国)

AI总结 提出SkillHone框架,通过持久决策历史记录诊断、修订和证据,实现智能体技能的持续演化,在开放网络深度研究基准上超越现有方法。

Comments Work in progress

详情
AI中文摘要

智能体技能通过任务特定程序、脚本和参考扩展语言模型智能体,但目标和环境不断变化。现有方法在有限运行中改进技能,仅保留最终产物,丢弃后续智能体解释先前修订、评估和拒绝替代方案所需的决策历史。我们提出SkillHone,一个基于持久决策历史的持续智能体技能演化框架。SkillHone将技能修订与提供实践反馈的评估侧证据配对,记录诊断、修订、证据和结果的结构化历史。角色分离的子智能体在带有隐去报告的实践探针上运行候选技能,并根据先前决策提出修订,实现跨会话改进而无需重新发现过去的推理。我们在原始开放网络环境中评估SkillHone的深度研究基准,其中智能体未获得集成搜索堆栈,必须通过可移植技能组织检索。我们与商业检索服务支持的深度研究智能体进行比较。以Qwen3.6-35B-A3B作为评估时骨干,生成的技能在GAIA上超过深度研究智能体15.8分,在WebWalkerQA-EN上超过3.2分,同时也超越了先前的技能演化方法。

英文摘要

Agent skills extend language-model agents with task-specific procedures, scripts, and references, but the tasks and environments they target continually change. Existing methods improve skills in bounded runs and retain only the final artifact, discarding the decision history that later agents need to interpret prior revisions, evaluations, and rejected alternatives. We introduce SkillHone, a harness for continual agent skill evolution grounded in persistent decision history. SkillHone pairs skill revisions with evaluation-side evidence that supplies practice feedback, recording structured histories of diagnoses, revisions, evidence, and outcomes. Role-separated subagents run candidate skills on practice probes with redacted reporting and propose revisions informed by prior decisions, enabling cross-session refinement without rediscovering past rationale. We evaluate SkillHone on deep-research benchmarks in a raw open-web setting, where agents are not given an integrated search stack and must organize retrieval through portable skills. We compare against a deep-research agent backed by commercial retrieval services. With Qwen3.6-35B-A3B as the evaluation-time backbone, the resulting skills outperform the deep-research agent by 15.8 points on GAIA and 3.2 points on WebWalkerQA-EN, while also exceeding prior skill-evolution methods.

2606.08977 2026-06-09 cs.LG cs.DS 新提交

Online Learning with Recency: Algorithms for Sliding-window Streaming Multi-armed Bandits

在线学习中的近因效应:滑动窗口流式多臂老虎机算法

Vladimir Braverman, Chen Wang, Liudeng Wang, Samson Zhou

发表机构 * Johns Hopkins University(约翰霍普金斯大学) Rensselaer Polytechnic Institute(伦斯勒理工学院) Texas A&M University(德克萨斯农工大学)

AI总结 针对在线学习中的近因效应,研究单遍滑动窗口流式多臂老虎机问题,提出纯探索和遗憾最小化算法,并给出记忆-遗憾权衡。

Comments ICML 2026

详情
AI中文摘要

受在线学习中近因效应的启发,本文研究了单遍*滑动窗口流式多臂老虎机(MABs)*的算法。在该设置中,我们有$n$个臂,其奖励分布为未知的次高斯分布,并给定参数$W$。臂以单遍流的形式到达,只有最近的$W$个臂被视为有效。算法需要在有限内存(定义为存储的臂数)下进行纯探索和遗憾最小化。该模型是近年来广泛研究的流式多臂老虎机模型(无滑动窗口)的自然扩展。我们对该模型下的纯探索和遗憾最小化问题进行了全面分析。对于纯探索,我们证明在次线性内存下找到最佳臂是困难的,而找到近似最佳臂则存在高效算法。对于遗憾最小化,我们探索了一种新的遗憾概念,并给出了任何单遍算法的尖锐内存-遗憾权衡。我们通过实验补充了理论结果,展示了样本、遗憾和内存之间的权衡。

英文摘要

Motivated by the recency effect in online learning, we study algorithms for single-pass *sliding-window streaming multi-armed bandits (MABs)* in this paper. In this setting, we are given $n$ arms with unknown sub-Gaussian reward distributions and a parameter $W$. The arms arrive in a single-pass stream, and only the most recent $W$ arms are considered valid. The algorithm is required to perform pure exploration and regret minimization with limited memory, defined as the number of stored arms. The model is a natural extension of the streaming multi-armed bandits model (without the sliding window) that has been extensively studied in recent years. We provide a comprehensive analysis of both the pure exploration and regret minimization problems with the model. For pure exploration, we prove that finding the best arm is hard with sublinear memory while finding an approximate best arm admits an efficient algorithm. For regret minimization, we explore a new notion of regret and give sharp memory-regret trade-offs for any single-pass algorithm. We complement our theoretical results with experiments, demonstrating the trade-offs between sample, regret, and memory.

2606.09115 2026-06-09 cs.LG 新提交

Counterfactual Transport Flows for Offline Conservative Trajectory Refinement

反事实传输流用于离线保守轨迹细化

Lena Krieger, Xuan Zhao, Zhuo Cao, Qin Wang, Hanno Scharr, Ira Assent

发表机构 * ETH Zürich(苏黎世联邦理工学院) University of Science and Technology of China(中国科学技术大学)

AI总结 提出反事实传输流框架,通过检索高反馈轨迹构建局部偏好对,实现离线决策的保守轨迹细化,在D4RL基准上提升历史回报表现。

Comments accepted at RLxF @ ICML 2026

详情
AI中文摘要

离线强化学习提供了一条仅从记录数据中改进策略的路径,使用历史回报或其他可测量的结果作为世界反馈。一个关键困难是在不超出离线数据支持范围的情况下改进观察到的行为。我们提出了反事实传输流,这是一个由世界反馈引导的、用于离线决策的源条件轨迹细化框架。给定一个低反馈候选轨迹,我们通过在潜在轨迹空间中检索具有更高任务特定反馈的邻近轨迹,从离线数据中构建局部偏好对,并将其用作保守细化的弱监督。该框架学习实例特定的细化方向:在推理时,细化强度参数控制候选轨迹被传输的距离,从而在保留原始行为和施加更强改进之间实现权衡。在包括AntMaze和MuJoCo任务的D4RL基准上的实验表明,我们的方法从作为世界反馈的历史回报中改进行为,同时提供可解释的轨迹级细化路径。

英文摘要

Offline reinforcement learning (RL) offers a path to policy improvement from logged data alone, using historical returns or other measurable outcomes as world feedback. A key difficulty is improving observed behavior without extrapolating beyond what the offline data supports. We propose \emph{counterfactual transport flows}, a source-conditioned trajectory refinement framework for offline decision-making guided by world feedback. Given a low-feedback candidate trajectory, we construct local preference pairs from offline data by retrieving nearby trajectories in latent trajectory space with higher task-specific feedback, and use them as weak supervision for conservative refinement. The framework learns instance-specific refinement directions: at inference time, a refinement strength parameter controls how far the candidate trajectory is transported, enabling a trade-off between preserving the original behavior and applying stronger improvement. Experiments on D4RL benchmarks, including AntMaze and MuJoCo tasks, show that our method improves behavior from historical returns as world feedback, while providing interpretable trajectory-level refinement paths.

2606.09138 2026-06-09 cs.LG cs.CL 新提交

Claw-R1: A Step-Level Data Middleware System for Agentic Reinforcement Learning

Claw-R1:面向智能体强化学习的步骤级数据中间件系统

Daoyu Wang, Mingyue Cheng, Qingchuan Li, Shuo Yu, Jie Ouyang, Qi Liu

发表机构 * State Key Laboratory of Cognitive Intelligence, University of Science and Technology of China(中国科学技术大学认知智能国家重点实验室)

AI总结 提出Claw-R1系统,通过网关服务器和数据池组件,将智能体交互步骤转化为结构化数据资产,支持实时检查、质量筛选和训练批次配置,解决智能体强化学习中数据生命周期管理问题。

详情
AI中文摘要

智能体强化学习已成为将大语言模型从静态聊天机器人转变为交互式智能体的重要后训练范式,催生了如OpenClaw等代表性应用。现有工作主要关注策略优化算法和训练框架,但对从数据产生到训练消费的智能体-环境交互完整数据生命周期关注不足。为弥补这一差距,我们提出Claw-R1,一个面向智能体强化学习的交互式步骤级数据中间件系统。Claw-R1通过两个核心组件——网关服务器和数据池——连接异构智能体运行时与强化学习训练后端。网关服务器通过统一的LLM API入口捕获多轮交互步骤,而数据池将其组织为由提示ID、响应ID、奖励和其他元数据组成的步骤级记录。在我们的演示中,用户可以交互式检查实时轨迹,查看每一步的状态、动作和奖励,根据质量和就绪程度筛选数据,并为不同的下游强化学习算法配置训练就绪批次。总体而言,Claw-R1将智能体交互轨迹视为受管理的数据资产,而非临时运行时日志。通过此演示,我们希望鼓励社区认识到数据管理在智能体强化学习中的重要性。我们的代码可在https://github.com/AgentR1/Claw-R1获取,演示视频可在https://youtu.be/Pw47dAOw6B0找到。

英文摘要

Agentic reinforcement learning (RL) has become an important post-training paradigm for turning LLMs from static chatbots into interactive agents, giving rise to representative applications such as OpenClaw. Existing work mainly focuses on policy optimization algorithms and training frameworks, but pays less attention to the full data lifecycle of agent-environment interactions, from data production to training consumption. To bridge this gap, we present Claw-R1, an interactive step-level data middleware system for agentic RL. Claw-R1 connects heterogeneous agent runtimes with RL training backends through two core components: a Gateway Server and a Data Pool. The Gateway Server captures multi-turn interaction steps through a unified LLM API entry point, while the Data Pool organizes them into step-level records consisting of prompt IDs, response IDs, rewards and other metadata. In our demo, users can interactively inspect live trajectories, examine the state, action, and reward of each step, curate data by quality and readiness, and configure training-ready batches for different downstream RL algorithms. Overall, Claw-R1 treats agent interaction traces as managed data assets rather than temporary runtime logs. Through this demonstration, we hope to encourage the community to recognize the importance of data management in agentic RL. Our code is available at https://github.com/AgentR1/Claw-R1 and the demonstration video can be found at link https://youtu.be/Pw47dAOw6B0.

2606.09191 2026-06-09 cs.LG stat.ML 新提交

Asymptotic Optimality of Thompson Sampling for Risk-Averse Bandits with Sub-Gaussian Rewards

风险厌恶型多臂赌博机中汤普森采样的渐近最优性(次高斯奖励)

Joel Q. L. Chang

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文证明了一种无锚非参数汤普森采样算法在风险厌恶型多臂赌博机中达到实例依赖的渐近最优后悔界,适用于任意连续风险泛函,且仅需连续性条件,优于先前参数方法。

Comments 10 pages, 4 figures

详情
AI中文摘要

我们证明 $\rho\text{-}\mathrm{NPTS}_{\mathrm{SG}}$,一种用于风险厌恶型多臂赌博机的无锚非参数汤普森采样算法,其遗憾值在 $\log n$ 的主阶上匹配实例依赖下界,从而确立了它在具有有界密度和次高斯尾部(包括高斯臂)的分布类上对任意连续风险泛函 $\rho$(CVaR、均值-方差、夏普比率、扭曲风险度量等)的渐近最优性。该结果及其有界支撑版本仅要求 $\rho$ 的连续性:严格弱于先前参数汤普森采样结果的支配条件,也严格弱于UCB类算法的Lipschitz条件,从而在无参数奖励假设下首次为夏普比率等非Lipschitz泛函提供了实例最优保证。有界支撑情形作为具有相同证明结构的垫脚石首先被发展。关键技术贡献是一个离散化引理(有界支撑)和一个截断离散化引理(次高斯尾部),每个引理通过Dirichlet聚合性质将增长字母表的Dirichlet后验投影到固定网格上,保持所有多项式前因子在固定次数且独立于样本量,打破了先前证明中阻碍的超指数障碍。

英文摘要

We prove that $ρ\text{-}\mathrm{NPTS}_{\mathrm{SG}}$, an anchor-free nonparametric Thompson Sampling algorithm for risk-averse bandits, achieves regret matching the instance-dependent lower bound to leading order in $\log n$, establishing it as asymptotically optimal for any continuous risk functional $ρ$ (CVaR, mean-variance, Sharpe ratio, distortion risk measures, and more) on the class of distributions with bounded density and sub-Gaussian tails, including Gaussian arms. Both this result and its bounded-support counterpart require only continuity of $ρ$: strictly weaker than the dominance condition of prior parametric Thompson Sampling results, and strictly weaker than the Lipschitz condition of UCB-type algorithms, yielding the first instance-optimal guarantees for non-Lipschitz functionals such as the Sharpe ratio without parametric reward assumptions. The bounded-support case is developed first as a stepping stone sharing the same proof structure. The key technical contributions are a discretisation lemma (bounded support) and a truncated discretisation lemma (sub-Gaussian tails), each projecting the growing-alphabet Dirichlet posterior onto a fixed grid via the Dirichlet aggregation property, holding all polynomial prefactors at fixed degree independent of sample size and breaking the super-exponential barrier that blocked prior proofs.

2606.09348 2026-06-09 cs.LG cs.CL 新提交

PBSD: Privileged Bayesian Self-Distillation for Long-Horizon Credit Assignment

PBSD: 特权贝叶斯自蒸馏用于长程信用分配

Yang Tian, Rui Wang, Xumeng Wen, Junjie Li, Shizhao Sun, Lei Song, Jiang Bian, Bo Zhao

发表机构 * School of AI, Shanghai Jiao Tong University(上海交通大学人工智能学院) XYZ AI Lab(XYZ AI实验室)

AI总结 提出PBSD方法,通过贝叶斯校准的自蒸馏将稀疏最终奖励转化为细粒度步骤级信用信号,解决长程智能体任务中的信用分配问题,实验表明其提升领域内外性能并促进泛化。

详情
AI中文摘要

长程智能体任务对基于结果的强化学习提出了根本性的信用分配挑战:轨迹级奖励验证最终正确性,但很少指导哪些中间推理步骤或工具交互对结果有贡献。在多轮搜索智能体中,这一困难尤为突出,因为成功轨迹可能包含误导性动作,而失败轨迹可能包含有价值的证据收集步骤。我们提出PBSD(特权贝叶斯自蒸馏),一种在稀疏最终奖励下进行细粒度信用分配的贝叶斯校准自蒸馏方法。PBSD通过验证答案的后验与先验概率比来衡量轨迹质量,并应用贝叶斯规则将这个难以估计的答案侧比率转化为标准学生模型与特权答案条件教师模型之间的易处理似然比。对该贝叶斯证据分数的自回归分解产生轮级信号,识别每个中间轮次是支持还是破坏已验证结果。因此,PBSD提供了一种原则性且优雅的重新加权方案,将稀疏结果监督转化为贝叶斯校准的轮级信用信号,同时完全兼容标准策略优化。实验表明,PBSD在领域内和领域外设置中均持续提升性能,并有效将知识从短上下文训练迁移到长上下文推理,表明其细粒度信用分配机制促进了更有效的策略学习并带来更好的泛化。

英文摘要

Long-horizon agentic tasks pose a fundamental credit assignment challenge for outcome-base reinforcement learning: trajectory-level rewards verify final correctness but provide limited guidance on which intermediate reasoning steps or tool interactions contribute to the outcome. The difficulty is especially pronounced in multi-turn search agents, where successful trajectories may contain misleading actions and failed trajectories may contain valuable evidence-gathering steps. We propose PBSD (Privileged Bayesian Self-Distillation), a Bayes-calibrated self-distillation method for fine-grained credit assignment under sparse final rewards. PBSD measures trajectory quality through the posterior-to-prior probability ratio of the verified answer and applies Bayes' rule to convert this hard-to-estimate answer-side ratio into a tractable likelihood ratio between a standard student model and a privileged answer-conditioned teacher model. Autoregressive decomposition of this Bayesian evidence score yields turn-level signals that identify whether each intermediate turn supports or undermines the verified outcome. Consequently, PBSD provides a principled and elegant reweighting scheme that transforms sparse outcome supervision into Bayes-calibrated turn-level credit signals, while remaining fully compatible with standard policy optimization. Experiments demonstrate that PBSD consistently enhances performance across both in-domain and out-of-domain settings, and effectively transfers knowledge from short-context training to long-context inference, suggesting that its fine-grained credit assignment mechanism facilitates more effective policy learning and yields improved generalization.

2606.09380 2026-06-09 cs.LG cs.AI cs.CL 新提交

Reasoning Arena: Trace Tournaments When Verifiable Rewards Fall Short

推理竞技场:当可验证奖励不足时的轨迹锦标赛

Han Zhou, Adam X. Yang, Laurence Aitchison, Anna Korhonen, Albert Q. Jiang

发表机构 * University of Cambridge(剑桥大学) Mistral AI

AI总结 提出推理竞技场框架,通过轨迹锦标赛将无梯度信号的非多样奖励组转化为相对奖励信号,结合Bradley-Terry模型高效整合强化学习,在数学和编码基准上平均提升7.6%,加速训练27%-41%。

Comments 9 pages, 6 figures, 2 tables (17 pages including references and appendices)

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)已成为通过结果监督提升大语言模型推理能力的主流范式。然而,可验证奖励在组级别常常变得无信息:当给定提示的所有采样轨迹获得相同奖励时,组相对优势估计无法提供梯度信号,尽管这些轨迹在推理质量上可能差异显著。我们提出推理竞技场,一种自适应训练框架,将此类非多样奖励组路由至裁判系统而非丢弃。除了检查最终答案,推理竞技场构建轨迹锦标赛,其中推理轨迹进行两两比较以暴露组内更细粒度的偏好,将推理质量转化为丰富的相对奖励信号。为使奖励估计高效,而非穷举比较每一对,每个新轨迹与一个动态更新的先前生成轨迹小池作为锚点进行评估,以高效建立相对排名。然后我们在不完整比较图上拟合Bradley-Terry模型,实现无需二次成对比较的可扩展强化学习集成。实验结果表明,推理竞技场在竞赛数学和编码基准上平均比RLVR基线高出7.6%。通过将原本浪费的零优势样本转化为有用的梯度更新,我们的方法加速训练27%至41%,节省近50%的生成计算量,并显著提升整体推理性能。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become a leading paradigm for improving the reasoning ability of large language models through outcome-based supervision. However, verifiable rewards frequently become uninformative at the group level: when all sampled traces of a given prompt receive identical rewards, group-relative advantage estimation provides no gradient signal, even though the traces may differ substantially in reasoning quality. We propose Reasoning Arena, an adaptive training framework that routes such non-diverse reward groups to a judge system instead of discarding them. Beyond examining the final answer, Reasoning Arena constructs trace tournaments, where reasoning traces are compared head-to-head to expose finer-grained preferences within the group, converting reasoning quality into rich relative reward signals. To make reward estimation efficient, rather than exhaustively comparing every pair, each new trace is evaluated against a small, dynamically updated pool of previously generated traces as anchors to efficiently establish a relative ranking. We then fit a Bradley-Terry model on the incomplete comparison graph, enabling scalable RL integration without quadratic pairwise comparisons. Empirical results demonstrate that Reasoning Arena consistently outperforms the RLVR baseline by 7.6% on average in competition mathematics and coding benchmarks. By converting otherwise wasted zero-advantage samples into useful gradient updates, our method accelerates training by 27% to 41%, saving nearly 50% of generation compute, and substantially improves overall reasoning performance.

2606.09668 2026-06-09 cs.LG 新提交

Algorithm for Contextual Queueing Bandits with Rate-Optimal Queue Length Regret

具有速率最优队列长度遗憾的上下文队列赌博机算法

Seoungbin Bae, Dabeen Lee

发表机构 * KAIST(韩国科学技术院) Seoul National University(首尔大学)

AI总结 针对上下文队列赌博机问题,提出三阶段算法CQB-η-2,通过仅在截止轮前进行随机探索,将队列长度遗憾从Õ(T^{-1/4})改进到Õ(T^{-1/2}),并证明该速率在最小最大意义下最优。

详情
AI中文摘要

上下文队列赌博机为在未知上下文相关服务速率下学习调度异构作业提供了框架。在随机上下文下,现有算法实现了 $\widetilde{\mathcal{O}}(T^{-1/4})$ 的队列长度遗憾,定义为学习者在时间 $T$ 的队列长度与最优队列长度之差的期望。本文将该速率改进至 $\widetilde{\mathcal{O}}(T^{-1/2})$。关键观察是随机探索仅需在精心选择的截止轮之前进行,而非整个时间范围。我们提出 CQB-$\eta$-2,一个三阶段算法:(i) 纯随机探索以构建初始估计器,(ii) $\eta$-随机探索结合 UCB 规则以在保持负漂移的同时继续学习,(iii) 探索截止后的纯 UCB。我们的证明在截止轮处分解队列长度遗憾。截止前,负漂移抑制了由次优选择引起的队列长度差异。截止后,前两个阶段提供了足够的随机探索样本,确保 UCB 决策导致的离开率差距较小。结合这两个界得到 $\widetilde{\mathcal{O}}(T^{-1/2})$ 阶的队列长度遗憾。我们进一步证明了 $\Omega(T^{-1/2})$ 阶的最小最大下界。证明构造了两个统计上不可区分的困难实例直到最终服务决策,并使用队列特定的耦合论证将由此产生的检验误差转化为队列长度遗憾。综上,我们的上下界刻画了在时间 $T$ 上的最小最大依赖关系(忽略对数因子)。

英文摘要

Contextual queueing bandits provide a framework for learning to schedule heterogeneous jobs under unknown context-dependent service rates. Under stochastic contexts, existing algorithms achieve $\widetilde{\mathcal{O}}(T^{-1/4})$ queue length regret, defined as the expected difference between the learner's and oracle's queue lengths at horizon $T$. In this paper, we improve this rate to $\widetilde{\mathcal{O}}(T^{-1/2})$. The key observation is that random exploration is needed only up to a carefully chosen cutoff round, rather than throughout the entire horizon. We propose CQB-$η$-2, a three-phase algorithm: (i) pure random exploration to construct an initial estimator, (ii) $η$-random exploration combined with a UCB rule to continue learning while maintaining negative drift, and (iii) pure UCB after the exploration cutoff. Our proof decomposes the queue length regret at the cutoff round. Before the cutoff, negative drift suppresses queue length differences caused by suboptimal choices. After the cutoff, the first two phases provide sufficient random exploration samples, ensuring that UCB decisions incur small departure-rate gaps. Combining these two bounds yields queue length regret of order $\widetilde{\mathcal{O}}(T^{-1/2})$. We further prove a minimax lower bound of order $Ω(T^{-1/2})$. The proof constructs two hard instances that are statistically indistinguishable up to the final service decision, and uses a queue-specific coupling argument to convert the resulting testing error into queue length regret. Together, our upper and lower bounds characterize the minimax dependence on the horizon $T$ up to logarithmic factors.

2606.09802 2026-06-09 cs.LG cs.AI stat.ML 新提交

Bandits for Efficient Experimentation: Adapting to Control Group, Preferences, and Context Drifts

高效实验的Bandits:适应控制组、偏好和上下文漂移

Udvas Das, Waris Radji, Debabrota Basu, Odalric-Ambrym Maillard

发表机构 * Univ. Lille, Inria, CNRS, Centrale Lille, UMR 9189 – CRIStAL(里尔大学、法国国家科学研究中心、中央理工学院、UMR 9189 – CRIStAL)

AI总结 针对用户偏好和上下文分布随时间漂移的线性上下文随机多臂赌博机问题,提出Dri-MED算法,通过异方差回归处理非平稳噪声,实现实例相关的遗憾界和约束违规界。

详情
AI中文摘要

我们考虑线性上下文随机多臂赌博机的一个变体,其中学习器必须向一组用户提供推荐,每个用户有其个性化的偏好向量,并且上下文分布随时间漂移。在实践者友好的假设下,我们将此设置简化为具有平稳均值但异方差和非平稳噪声的线性赌博机。我们进一步研究了学习器必须确保每个决策的平均奖励超过基线策略$\boldsymbol{\pi}_0$在每个决策步骤的均值的情况。我们引入了Dri-MED,一种受MED策略线性版本启发并仔细调整以处理非平稳异方差噪声的算法。我们表明,实例相关的遗憾界为$\tilde{\mathcal O}\left(\frac{\kappa}{\tilde{\Delta}}d^2(\log(T)\right)$,其中$\tilde{\Delta}$是受策略$\pi_0$约束的次优性间隙,方差感知乘性项$\kappa$通过异方差回归仔细处理。我们进一步表明Dri-MED享有$\tilde{\mathcal{O}}(d)$的期望约束违规。我们的数值结果表明,Dri-MED显著优于忽略漂移和偏好结构的保守基线。

英文摘要

We consider a variant of the linear contextual stochastic multi-armed bandits, where the learner must provide recommendations to a group of users, each having its personalized preference vector, and in the presence of context distributions that are drifting over time. Under practitioner-friendly assumptions, we reduce this setting to linear bandit with stationary mean but heteroskedastic and non-stationary noise. We further study the case when the learner must ensure the mean reward of each decision must exceed that of a baseline strategy $\boldsymbolπ_0$ at each decision step. We introduce Dri-MED, an algorithm inspired from the linear version of the MED strategy, and carefully adapted to handle the non-stationary heteroskedastic noise. We show that the instance-dependent regret scales as $\tilde{\mathcal O}\left(\fracκ{\tildeΔ}d^2(\log(T)\right)$, where $\tildeΔ$ is the constraint-aware sub-optimality gap subject to policy $π_0$, with variance-aware multiplicative term $κ$ that we carefully handle using heteroskedastic regression. We further show Dri-MED enjoys $\tilde{\mathcal{O}}(d)$ expected constraint violations. Our numerical results suggest that Dri-MED significantly outperforms conservative baselines that ignores the drift and preference structure.

2606.09821 2026-06-09 cs.LG 新提交

Rethinking the Divergence Regularization in LLM RL

重新思考LLM强化学习中的散度正则化

Jiarui Yao, Xiangxin Zhou, Penghui Qi, Wee Sun Lee, Liefeng Bo, Tianyu Pang

发表机构 * Tencent Hunyuan(腾讯混元) UIUC(伊利诺伊大学厄巴纳-香槟分校) NUS(新加坡国立大学)

AI总结 针对PPO等方法的硬裁剪或硬掩码在长尾词汇中分布偏移代理不佳的问题,提出DRPO,用平滑的优势加权二次正则化替代硬掩码,保持信任区域几何的同时提供连续梯度权重,提升训练稳定性和效率。

详情
AI中文摘要

强化学习已成为后训练大型语言模型的关键组成部分。在实践中,由于训练-推理不匹配和策略陈旧,LLM RL通常是离策略的,因此信任区域控制对于稳定优化至关重要。PPO和GRPO等主流方法通过比率裁剪机制近似这种控制,但在长尾词汇中,重要性比率可能成为分布偏移的糟糕代理。最近的工作如DPPO通过用基于散度的掩码替换基于比率的裁剪来解决这种不匹配,从而产生由采样令牌的绝对概率偏移定义的信任区域。然而,DPPO仍然依赖于硬掩码:一旦令牌以有害方向越过信任区域边界,其梯度就会被丢弃而不是纠正。为了解决这个问题,我们提出了散度正则化策略优化(DRPO),它用策略偏移上的平滑优势加权二次正则化器替换硬掩码。DRPO保留了与DPPO相同的信任区域几何,同时引入了有界、连续的梯度权重,这些权重衰减发散更新并在边界之外提供纠正信号。跨模型规模、架构和精度设置的实验表明,DRPO提高了LLM RL训练的稳定性和效率。

英文摘要

Reinforcement learning (RL) has become a key component of post-training large language models (LLMs). In practice, LLM RL is often off-policy because of training-inference mismatch and policy staleness, making trust-region control essential for stable optimization. Mainstream methods such as PPO and GRPO approximate this control with a ratio-clipping mechanism, but the importance ratio can be a poor proxy for distributional shift in long-tailed vocabularies. Recent work such as DPPO addresses this mismatch by replacing ratio-based clipping with a divergence-based mask, yielding a trust region defined by the sampled token's absolute probability shift. However, DPPO still relies on a hard mask: once a token crosses the trust-region boundary in a harmful direction, its gradient is discarded rather than corrected. To address this, we propose Divergence Regularized Policy Optimization (DRPO), which replaces the hard mask with a smooth advantage-weighted quadratic regularizer on policy shift. DRPO preserves the same trust-region geometry as DPPO while inducing bounded, continuous gradient weights that attenuate diverging updates and provide corrective signals beyond the boundary. Experiments across model scales, architectures, and precision settings show that DRPO improves the stability and efficiency of LLM RL training.

2606.09825 2026-06-09 cs.LG cs.AI cs.SY eess.SY math.OC 新提交

An Agency-Transferring Model-Free Policy Enhancement Technique

一种无模型策略增强的代理转移技术

Anton Bolychev, Georgiy Malaniya, Sinan Ibrahim, Pavel Osinenko

发表机构 * Center for Engineering Systems and Sciences(工程系统与科学中心) Central University(中央大学) Sirius University of Science and Technology(天狼星科技大学)

AI总结 提出一种将次优基线策略嵌入强化学习训练的方法,通过逐步从基线策略向可学习策略转移代理权,提升训练效率并最终获得超越基线的独立策略。

详情
AI中文摘要

从头开始训练强化学习(RL)策略成本高昂:需要仔细设计奖励和环境、大量调参以及大量计算。然而,许多控制问题已经有一个功能正常但次优的基线策略可用。本文提出一种方法,将这样的基线策略嵌入RL训练过程,同时提高相对于从头开始方法的训练效率,并产生一个优于基线的学习策略。在每个步骤中,该方法在基线策略和可训练的学习策略之间进行仲裁,最初强烈依赖基线策略,然后逐步将代理权转移给学习策略。训练结束时,学习策略是一个无需基线策略支持的独立神经网络。本文形式化了基线策略“功能正常”的含义:在该策略下,智能体以高概率到达目标集并停留在那里。所提出的仲裁机制旨在训练过程中利用这一特性,从训练开始就产生高目标到达率。理论分析在给定假设下提供了这种行为的形式化解释,并将其扩展到最终无基线场景,其中推导了独立学习策略目标到达概率的显式下界。在连续控制基准上的实验结果表明,所提出的方法实现了与竞争方法相当或更高的回报,同时在训练过程中(包括最终阶段,学习策略无需任何基线支持)保持了最高的目标到达率。

英文摘要

Training reinforcement learning (RL) policies from scratch is costly: it requires careful reward and environment design, extensive tuning, and substantial computation. Yet many control problems already have a functional but suboptimal policy available as a baseline. This paper proposes a method for embedding such a baseline into the RL training process, simultaneously improving training efficiency relative to from-scratch methods and producing a learning policy that outperforms the baseline. At each step, the method arbitrates between the baseline policy and a trainable learning policy, initially relying strongly on the baseline policy and then progressively transferring agency to the learning policy. By the end of training, the learning policy is a standalone neural network that operates without baseline policy support. The paper formalizes what it means for the baseline policy to be functional: under this policy, the agent reaches a goal set and remains there with high probability. The proposed arbitration mechanism is designed to exploit this property during training, yielding high goal-reaching rates right from the beginning of training. A theoretical analysis provides a formal interpretation of this behavior under stated assumptions and extends it to the final baseline-free regime, where explicit lower bounds are derived for the goal-reaching probability of the standalone learning policy. Empirical results on continuous-control benchmarks show that the proposed method achieves returns that match or exceed those of competitive approaches, while maintaining the highest goal-reaching rates throughout training among the compared methods -- including in the final stage, where the learning policy operates without any baseline support.

2606.07845 2026-06-09 cs.MA cs.LG 交叉投稿

GRPO Does Not Close the Multi-Agent Coordination Gap

GRPO 并未缩小多智能体协调差距

Najmul Hasan, Prashanth BusiReddyGari

发表机构 * Department of Mathematics and Computer Science University of North Carolina at Pembroke(数学与计算机科学系北卡罗来纳大学帕克维尔分校)

AI总结 通过哲学家就餐问题测试大语言模型的多智能体协调能力,发现GRPO训练无法显著提升性能,瓶颈在于训练方法而非计算量。

Comments 15 pages, 15 figures

详情
AI中文摘要

我们使用哲学家就餐问题作为干净的测试平台,衡量当前大型语言模型作为共享公共资源的多个智能体进行协调的能力。在涵盖七个模型和三种哲学家数量的630个回合中,四个前沿闭源系统的平均奖励达到0.45至0.87,Mistral-Small 24B达到0.83至0.99,而Qwen3-14B仅为0.13至0.35。然后我们询问,基于任务自身展开的群体相对策略优化(GRPO)能否缩小差距,结果发现不能:对五个哲学家场景的每回合奖励进行Welch t检验,p=0.66,Hedges' g=-0.11,在十个或十五个哲学家场景下也没有统计显著变化。两个进一步的观察限定了这一结果。8B和14B运行中的训练奖励在第九步达到峰值后下降,因此默认在第15步保存的检查点严格劣于之前的几个检查点。我们使用的四项奖励在零动作时存在退化最大值,DeepSeek-R1-Distill-Qwen-7B和Mistral-Small 24B在五个哲学家场景下都处于该状态,零餐时的平均奖励分别为1.0和0.83。对于开放权重的14B模型,多智能体协调的瓶颈不是训练计算量,而是训练方法:不会坍缩到无动作最大值的奖励塑造、不依赖最后一步的检查点纪律,以及跨问题规模的课程学习。

英文摘要

We measure how well current large language models coordinate as multiple agents sharing a common resource, using the dining philosophers problem as a clean test bed. Across 630 episodes spanning seven models and three philosopher counts, four frontier closed-source systems reach mean reward 0.45 to 0.87 and Mistral-Small 24B reaches 0.83 to 0.99, while Qwen3-14B reaches 0.13 to 0.35. We then ask whether group relative policy optimization (GRPO) on rollouts from the task itself can close the gap and find that it cannot: a Welch's t-test on per-episode reward at five philosophers gives p = 0.66 and a Hedges' g of -0.11, with no statistically significant change at ten or fifteen philosophers either. Two further observations qualify the result. The training reward of both 8B and 14B runs peaked at step nine and then declined, so the default saved checkpoint at step 15 is strictly worse than several earlier ones. The four-term reward we use admits a degenerate maximum at zero actions, which DeepSeek-R1-Distill-Qwen-7B and Mistral-Small 24B at five philosophers both inhabit, with mean reward 1.0 and 0.83 respectively at zero meals. The bottleneck for an open-weight 14B model on multi-agent coordination is not training compute but training methodology: reward shaping that does not collapse to a no-action maximum, checkpoint discipline that does not depend on the final step, and curriculum across problem scales.

2606.08032 2026-06-09 stat.ML cs.LG 交叉投稿

Variational Proximal Policy Optimization

变分近端策略优化

Ousmane Amadou Dia

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出变分近端策略优化(VP₂O),利用粒子变分推理和专家混合架构,通过几何近端控制机制解决强化学习中的策略模式崩溃和分布漂移问题,在复杂推理任务上取得显著提升。

详情
AI中文摘要

通过近端策略优化进行的人类反馈强化学习经常遭受策略模式崩溃、脆弱的探索循环和分布漂移。本文引入了变分近端策略优化(\(\textsc{VP}_2\textsc{O}\)),这是一种基于粒子的变分推理框架,将策略优化映射到专家混合架构中的Stein变分梯度下降。通过利用局部化专家原型上的函数核以及专家正交化损失,\(\textsc{VP}_2\textsc{O}\)引入了一种基于几何的近端控制机制,可以减少对固定裁剪或KL计划的依赖。我们在33B/4B稀疏专家混合模型上的结果显示,在复杂推理基准测试中取得了多项改进,在Codeforces上建立了\(+\mathbf{179}\) ELO增益,并在AIME数学推理任务上减少了\(\mathbf{32\%}\)的令牌数量。

英文摘要

Reinforcement Learning from Human Feedback via Proximal Policy Optimization often suffers from policy mode collapse, brittle exploration loops, and distribution drift. This paper introduces Variational Proximal Policy Optimization (\(\textsc{VP}_2\textsc{O}\)), a particle-based variational inference framework that maps policy optimization to Stein Variational Gradient Descent within a Mixture-of-Experts architecture. By leveraging functional kernels over localized expert prototypes alongside an expert orthogonalization loss, \(\textsc{VP}_2\textsc{O}\) introduces a geometry-based proximal-control mechanism that can reduce reliance on fixed clipping or KL schedules. Our results on a 33B/4B sparse Mixture-of-Experts model show several improvements across complex reasoning benchmarks, establishing a \(+\mathbf{179}\) ELO gain on Codeforces and a \(\mathbf{32\%}\) reduction in token count on AIME mathematical reasoning tasks.

2606.08249 2026-06-09 cs.RO cs.LG 交叉投稿

Disturbance-Aware Aerial Robotics for Ethical Wildlife Monitoring

面向道德野生动物监测的扰动感知空中机器人

Mahmut Osmanovic, Isac Paulsson, Teddy Lazebnik

发表机构 * Department of Computing, Jonkoping University(约翰内斯堡大学计算机系) Department of Information Systems, University of Haifa(海法大学信息系统系)

AI总结 提出一种基于强化学习的扰动感知框架,用于异构空中机器人编队自主追踪野生动物,同时最小化行为干扰,在三种动物和四种行为模型上超越规则基线。

详情
AI中文摘要

可靠的野生动物监测对生态学和保护至关重要,然而许多现有方法,如标记、捕捉和近距离观察,可能会改变它们旨在测量的行为。空中机器人提供了一种可扩展的替代方案,在多项研究中显示出有前景的性能。尽管如此,现有方法通常缺乏行为感知,依赖固定启发式规则,或需要昂贵、不切实际且伦理上难以获取的真实世界训练数据。因此,目前尚无通用的自适应无人机监测框架,既能保持生态有效性,又能跨物种、行为和机器人平台扩展。在本研究中,我们引入了一种基于扰动感知强化学习的异构空中机器人编队框架,能够自主追踪野生动物,同时明确最小化行为干扰。我们将动物学模拟环境与基于真实轨迹统计拟合的动物运动模型相结合,并使用一种捕捉观测质量与扰动风险之间权衡的奖励公式来训练控制策略。在三种具有不同生态和运动模式的物种(鸽子、豺和距翅麦鸡)以及四种在自然界中常见的日益策略性的行为模型上,学习到的策略持续超越当前使用的基于规则的基线,并泛化到不同的监测任务、动物动态和无人机类型。这些结果确立了扰动感知学习作为非侵入式自主野生动物观测的可行基础,为生态学和保护中可扩展、道德负责且科学可靠的机器人监测开辟了道路。

英文摘要

Reliable wildlife monitoring is essential for ecology and conservation, yet many existing methods, such as tagging, capture, and close-range observation, can alter the very behaviors they aim to measure. Aerial robots offer a scalable alternative, which has shown promising performance in multiple studies. Nonetheless, existing approaches typically lack behavioral awareness, rely on fixed heuristics, or require real-world training data that are costly, impractical, and ethically difficult to obtain. As a result, there remains no general framework for adaptive drone-based monitoring that can both preserve ecological validity and scale across species, behaviors, and robotic platforms. In this study, we introduce a disturbance-aware reinforcement-learning-based framework for heterogeneous aerial robotic fleets that enables autonomous wildlife tracking while explicitly minimizing behavioral disruption. We couple a zoologically grounded simulation environment with fitted animal movement models derived from real trajectory statistics, and train control policies using a reward formulation that captures the trade-off between observation quality and disturbance risk. Across three species (pigeon, jackal, and spur-winged lapwing) with distinct ecologies and motion patterns and four increasingly strategic behavior models common in nature, the learned policies consistently surpassed currently used rule-based baselines and generalized across monitoring tasks, animal dynamics, and drone types. These results establish disturbance-aware learning as a viable foundation for non-invasive autonomous wildlife observation, opening a path towards scalable, ethically responsible, and scientifically reliable robotic monitoring in ecology and conservation.

2606.08253 2026-06-09 cs.RO cs.LG 交叉投稿

Mind Your Steps: A General Learning Framework for Accurate Humanoid Foothold Tracking

注意你的步伐:一种用于精确人形机器人落脚点跟踪的通用学习框架

Alessandro Montenegro, Shihao Li, Puze Liu, Alberto Maria Metelli, Jan Peters

发表机构 * Politecnico di Milano(米兰理工大学) TU Darmstadt(达姆施塔特工业大学) Max Planck Institute for Intelligent Systems(马克斯·普朗克智能系统研究所) Italian Institute of Technology(意大利技术研究院) University of Pisa(比萨大学)

AI总结 提出一种轻量级通用3D落脚点跟踪策略学习框架,通过目标采样器动态提供步态支持,结合新目标表示克服真实世界噪声,实现与多种高层规划器无缝集成的精确自然运动。

Comments Accepted to RSS 2026

详情
AI中文摘要

使人形机器人在复杂动态环境中运行仍然是一个关键挑战,其根本受限于稳健、安全且精确导航的能力。虽然基于速度指令策略的强化学习在人形机器人运动方面取得了显著的鲁棒性,但这种方法缺乏对落脚点位置的显式控制,导致不安全行为(如踩到人脚)或不精确导航,阻碍后续操作任务。相反,显式落脚点跟踪策略通过直接以目标足部姿态作为指令提供了一种有前景的替代方案。然而,现有方法通常受限于不切实际的状态假设(影响实际部署),或者作为分阶段流程的一部分而受限于特定下游任务。在这项工作中,我们引入了一种新颖的轻量级框架,用于训练通用的3D落脚点跟踪策略。通过目标采样器动态提供步态支持,该方法使学习到的策略对特定地形不敏感。我们的新目标表示有效缓解了现实世界中出现的挑战,例如噪声和不准确的姿态估计以及足部接触估计。为直接迁移到现实世界而设计,我们的策略作为一个独立的低级控制器,可以与各种高级落脚点生成器无缝配对。通过在仿真和现实世界中的大量实验,我们证明了框架的有效性。通过将我们的策略与不同的上游规划器耦合,我们在具有挑战性的环境中实现了自然且精确的运动,为复杂环境中的运动-操作任务铺平了道路。

英文摘要

Enabling humanoid robots to operate in complex, dynamic environments remains a critical challenge, fundamentally limited by the ability to navigate robustly, safely, and accurately. While reinforcement learning with velocity-commanded policies has achieved remarkable robustness in humanoid locomotion, this approach lacks explicit control of the foothold placement, leading to unsafe behavior, such as stepping onto human feet, or imprecise navigation, hindering the following manipulation task. Conversely, explicit foothold-tracking policies offer a promising alternative by directly being commanded with target foot poses. However, existing approaches are often limited by unrealistic state assumptions, compromising real-world deployment, or they are part of staged pipelines, making them tied to specific downstream tasks. In this work, we introduce a novel, lightweight framework for training general-purpose 3D foothold-tracking policies. By dynamically providing footstep support through a goal sampler, this method enables the learned policy to be agnostic to specific terrains. Our new target representation effectively mitigates challenges arising in the real world, such as noisy and inaccurate pose estimation and foot contact estimation. Designed for direct real-world transfer, our policy acts as a standalone low-level controller that can be seamlessly paired with various high-level foothold generators. We demonstrate the effectiveness of our framework through extensive experiments in simulation and in the real world. By coupling our policy with different upstream planners, we achieve natural and accurate locomotion in challenging settings, paving the way for loco-manipulation tasks in complex environments.

2606.08276 2026-06-09 quant-ph cs.ET cs.LG 交叉投稿

QnRL: Quantum-Native Reinforcement Learning

QnRL: 量子原生强化学习

Alexander DeRieux, Walid Saad

发表机构 * Bradley Department of Electrical and Computer Engineering(布拉德利电气与计算机工程系) Virginia Tech Institute for Advanced Computing(弗吉尼亚理工学院高级计算研究所)

AI总结 提出量子原生强化学习(QnRL)框架,利用量子态的叠加和纠缠在希尔伯特空间中直接学习条件分布,通过量子振幅反冲(QuAK)算法比较分布矩,从而更高效地建模随机环境,实验显示评分提升高达82.9%,参数减少94.3%。

Comments 36 pages, 23 figures

详情
AI中文摘要

量子强化学习(QRL)是一种有前景的方法,可在具有随机环境的多个应用中学习有效的决策策略。现有的QRL架构不直接建模控制这些环境的随机变量,而是通过估计期望结果间接近似环境行为,这限制了它们的表达能力和自适应潜力。克服这些挑战需要一种新颖的QRL方法,利用量子计算机的分布性质直接将环境随机变量建模为量子态分布。因此,本文提出了一种名为量子原生强化学习(QnRL)的新框架。QnRL是一种分布强化学习框架,通过叠加和纠缠的量子态在希尔伯特空间中自然地学习条件分布。因此,QnRL可以通过量子系统的自然属性直接建模随机学习环境的行为。QnRL通过一种新颖的量子振幅反冲(QuAK)算法实现这一点,该算法能够比较多个叠加分布的第$m$个矩的$n$次幂。理论上证明,通过QuAK,条件动作策略分布完全在希尔伯特空间内从量子生成模型的矩中蒸馏出来,并通过QnRL进行优化。这种复杂的分布组合还被证明提供了额外的维度来表达环境相关性,而这些相关性对于纯经典和经典采样的量子分布模型是未知的。跨不同环境的实验结果表明,与基线相比,QnRL实现了高达$82.9\%$的更高评估分数,平均参数减少高达$94.3\%$,更准确地估计未见观测的期望回报,并更好地适应变化的随机条件。

英文摘要

Quantum reinforcement learning (QRL) is a promising approach to learn effective decision strategies across several applications with stochastic environments. Instead of directly modeling the random variables that govern these environments, existing QRL architectures indirectly approximate environment behavior by estimating expected outcomes, which limits their expressive power and adaptive potential. Overcoming such challenges requires a novel QRL approach that exploits the distributional nature of quantum computers to directly model environment random variables as quantum state distributions. Hence, in this paper, a novel framework dubbed quantum-native reinforcement learning (QnRL) is proposed. QnRL is a distributional RL framework that learns conditional distributions naturally in Hilbert space via superimposed and entangled quantum states. Thus, QnRL can directly model the behavior of stochastic learning environments via the natural properties of quantum systems. QnRL accomplishes this via a novel, proposed quantum amplitude kickback (QuAK) algorithm that enables comparing the $n$-th power of the $m$-th moment of multiple superimposed distributions. It is theoretically proven that a conditional action policy distribution is distilled from the moments of a quantum generative model entirely within Hilbert space via QuAK, and optimized via QnRL. This complex distribution composition is also shown to provide extra dimensions for expressing environment correlations that are unknown to purely classical and classically-sampled quantum distributional models. Experimental results across diverse environments show that QnRL achieves up to $82.9\%$ higher evaluation scores, with up to $94.3\%$ fewer parameters on average, more accurately estimates the expected return for unseen observations, and better adapts to varying stochastic conditions compared to the baseline.

2606.08346 2026-06-09 cs.CL cs.LG 交叉投稿

CATPO: Critique-Augmented Tree Policy Optimization

CATPO: 批评增强的树策略优化

Ayush Singh, Umang Goyal, Ankur Dahiya

发表机构 * Indian Institute of Technology Roorkee(印度理工学院罗尔基分校) Vision and Language Group(视觉与语言组)

AI总结 提出CATPO方法,通过树信息性评分和批评引导修复,解决树结构强化学习中低效树浪费计算的问题,在数学推理任务上提升准确率。

Comments 14 pages, 1 figures, 6 tables

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)已成为提升大语言模型(LLM)推理能力的主流范式。最近的基于树的方法(如TreeRPO)通过树结构展开扩展了平坦轨迹采样,无需单独的奖励模型即可获得密集的步级奖励信号。然而,并非所有树都具有相同的信息量:所有叶子成功、所有叶子失败或策略已预测出奖励分布的树对梯度更新贡献甚微,浪费计算资源。我们提出CATPO(批评增强的树策略优化),在树级别诊断并解决这一浪费问题。CATPO首先通过树信息性分数F(T)对每棵树进行评分,该分数结合了叶子结果多样性和策略-奖励去相关性,且无需额外计算。对于所有分支均失败的“全错”树,CATPO应用批评引导修复:定位最浅的失败点,生成自然语言批评,并嫁接精炼的延续以恢复训练信号。最后,信息性加权损失通过归一化分数缩放每棵树的梯度贡献,将参数更新集中在最具信息性的树上,同时保持整体梯度幅度。在MATH数据集上训练的Qwen2.5-Math-1.5B上的实验表明,CATPO在四个基准(AIME24、MATH-500、OlympiadBench和MinervaMath)上实现了37.5%的宏平均准确率,比TreeRPO提高1.9%,比GRPO提高4.8%。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has become a dominant paradigm for improving the reasoning capabilities of large language models (LLMs). Recent tree-based methods such as TreeRPO extend flat trajectory sampling with tree-structured rollouts to obtain dense, step-level reward signals without a separate process reward model. However, not all trees are equally informative: trees where all leaves succeed, all leaves fail, or the policy already predicts the reward distribution contribute little to gradient updates, wasting compute. We introduce CATPO (Critique-Augmented Tree Policy Optimization), which diagnoses and addresses this waste at the tree level. CATPO first scores each tree via a tree informativeness score, F(T), combining leaf-outcome diversity with policy-reward decorrelation at zero extra compute. For dead-wrong trees where all branches fail, CATPO applies critique-guided healing: it locates the shallowest failure point, generates a natural-language critique, and grafts refined continuations to recover training signal. Finally, an informativeness-weighted loss scales each tree's gradient contribution by its normalized score, concentrating parameter updates on the most informative trees while preserving overall gradient magnitude. Experiments on Qwen2.5-Math-1.5B trained with the MATH dataset show that CATPO achieves 37.5% macro accuracy across four benchmarks (AIME24, MATH-500, OlympiadBench, and MinervaMath), improving over TreeRPO by 1.9% and GRPO by 4.8%.

2606.08379 2026-06-09 cs.AI cs.CE cs.LG q-fin.CP q-fin.TR 交叉投稿

TT-DAC-PS: Twin-Target Deterministic Actor-Critic with Policy Smoothing for Optimal Trade Execution

TT-DAC-PS:用于最优交易执行的双目标确定性演员-评论家与策略平滑

Ilia Zaznov, Atta Badii, Julian Kunkel, Alfonso Dufour

发表机构 * University of Reading(雷丁大学) University of Göttingen(哥廷根大学) GWDG(哥廷根数据处理中心) Henley Business School(亨利商学院)

AI总结 提出TT-DAC-PS算法,结合双指数移动平均评论家目标、悲观最小备份、TD3风格策略平滑噪声、延迟演员更新和保守Q正则化,以抑制过高估计,并在限价订单簿数据上优于经典和强化学习基线。

Comments 21 pages, 1 figure, 3 tables

详情
AI中文摘要

本研究通过引入TT-DAC-PS(双目标确定性演员-评论家与策略平滑),解决了大规模股票卖单的最优执行问题。该确定性演员-评论家架构结合了双指数移动平均评论家目标与悲观最小备份、TD3风格的目标策略平滑噪声、延迟演员更新以及保守Q正则化,以抑制过高估计。探索使用Ornstein-Uhlenbeck(OU)噪声,并采用混合调度:确定性回合衰减、基于近期奖励离散度的方差引导调整,以及一个可学习并映射到噪声尺度的Soft Actor-Critic(SAC)风格温度。环境整合了Almgren-Chriss(AC)交易影响与限价订单簿(LOB)价格和成交量、归一化状态特征、每步成交量参与上限以及基于效用的奖励。该交易执行算法应用于十只美国股票的LOB数据。性能评估针对强化学习基线算法,包括近端策略优化(PPO)、软演员-评论家(SAC)和优势演员-评论家(A2C),以及替代交易执行算法,包括时间加权平均价格(TWAP)、成交量加权平均价格(VWAP)和AC。所提出的模型持续降低平均实现缺口百分比,并具有竞争性的方差,优于经典基线和标准强化学习基准模型。

英文摘要

This study addresses the optimal execution of large stock sell programs by introducing TT-DAC-PS (Twin-Target Deterministic Actor-Critic with Policy Smoothing), a deterministic actor-critic architecture that combines twin exponential-moving-average critic targets with pessimistic min backup, TD3-style target policy smoothing noise, delayed actor updates, and conservative Q regularisation to curb overestimation. Exploration uses Ornstein-Uhlenbeck (OU) noise with a hybrid schedule: deterministic episode-wise decay, variance-guided adjustment based on recent reward dispersion, and a Soft Actor-Critic (SAC)-style temperature that is learned and mapped to the noise scale. The environment integrates Almgren-Chriss (AC) trade impact with Limit Order Book (LOB) prices and volumes, normalised state features, per-step volume participation caps, and a utility-based reward. The trade execution algorithm is applied to LOB data for ten U.S. stocks. Performance is assessed against reinforcement-learning baseline algorithms, including Proximal Policy Optimisation (PPO), Soft Actor-Critic (SAC), and Advantage Actor-Critic (A2C), as well as alternative trade execution algorithms, including Time-Weighted Average Price (TWAP), Volume-Weighted Average Price (VWAP), and AC. The proposed model consistently reduces mean implementation shortfall percentage with competitive variance, outperforming classical baselines and standard reinforcement-learning benchmark models.

2606.08513 2026-06-09 cs.RO cs.LG cs.SY eess.SY 交叉投稿

Towards End to End Motion Planning and Execution for Autonomous Underwater Vehicles Using Reinforcement Learning

面向自主水下机器人的端到端运动规划与执行:基于强化学习的方法

Elisei Shafer, Oren Gal

发表机构 * University of Haifa(海法大学)

AI总结 提出分层强化学习架构,将原始传感器数据直接映射为推进器指令,实现AUV端到端运动规划与执行,在HoloOcean仿真中轨迹长度接近RRT*基线(误差4%-6%),并具备鲁棒性。

详情
AI中文摘要

自主水下机器人(AUV)传统上依赖复杂、高度工程化的流水线进行感知、路径规划和运动控制。本文探索了一种端到端深度强化学习(DRL)方法的可行性,该方法将原始传感器数据直接映射为推进器指令,减少了人工工程。我们提出了一种分层强化学习(HRL)架构,将问题分解为两个马尔可夫决策过程。高层(HL)策略以2Hz运行,处理原始$84 \ imes 84$像素单目相机帧、堆叠的$100 \ imes 100$像素前视成像声纳以及本体感受数据,生成空间子目标。同时,低层(LL)策略以10Hz运行,将这些子目标转换为推进器指令。HL策略使用基于先前演示的强化学习(RLPD)在修改后的样本高效机器人强化学习(SERL)框架中训练,而LL策略则采用软演员-评论家(SAC)结合后见经验回放(HER)。在高保真HoloOcean模拟器中评估,我们的方法展示了成功的避障能力,轨迹长度与$\ ext{RRT}^*$规划基线非常接近(误差在4%到6%之间)。此外,学习到的策略对模拟传感器噪声和能见度降低表现出强鲁棒性。尽管系统能有效导航熟悉的几何环境,但实验揭示了在遇到具有新颖障碍形状的未访问区域时存在泛化限制。最终,这项工作展示了使用最小计算硬件进行样本高效、端到端DRL在水下导航中的潜力。

英文摘要

Autonomous Underwater Vehicles (AUVs) traditionally rely on complex, heavily engineered pipelines for perception, path planning, and motion control. This paper explores the feasibility of an end-to-end Deep Reinforcement Learning (DRL) approach that maps raw sensor data directly to thruster commands, reducing manual engineering. We propose a hierarchical reinforcement learning (HRL) architecture splitting the problem into two Markov Decision Processes. A High-Level (HL) policy operating at 2Hz processes raw $84 \times 84$ pixel monocular camera frames, stacked $100 \times 100$ pixel forward-looking imaging sonar, and proprioceptive data to generate spatial subgoals. Simultaneously, a Low-Level (LL) policy operating at 10Hz converts these subgoals into thruster commands. The HL policy is trained using Reinforcement Learning from Prior Demonstrations (RLPD) within a modified Sample-Efficient Robotic Reinforcement Learning (SERL) framework, while the LL policy utilizes Soft Actor-Critic (SAC) combined with Hindsight Experience Replay (HER). Evaluated in the high-fidelity HoloOcean simulator, our method demonstrates successful obstacle avoidance, achieving trajectory lengths closely approximating (within 4% to 6% of) an $\text{RRT}^*$ planning baseline. Furthermore, the learned policy exhibits strong robustness to simulated sensor noise and decreased visibility. While the system navigates familiar geometries effectively, experiments reveal generalization limitations when encountering unvisited areas with novel obstacle shapes. Ultimately, this work demonstrates the promise of sample-efficient, end-to-end DRL for underwater navigation using minimal computational hardware.

2606.09002 2026-06-09 stat.ML cs.LG math.ST stat.TH 交叉投稿

Multi-Armed Bandits with Arriving Arms: Sequential Screening, Dynamic Regret, and Sublinear Guarantees

带有到达臂的多臂老虎机:顺序筛选、动态遗憾与次线性保证

Deqi Zheng, Xiaoyang Xu, Yuhong Yang

发表机构 * Qiuzhen College, Tsinghua University(清华大学求真学院) Yau Mathematical Sciences Center, Tsinghua University(清华大学姚氏数学科学中心)

AI总结 针对可用臂随时间扩展的随机多臂老虎机问题,提出基于消除的UCB-AA算法,通过初步筛选新臂并考虑到达信息差异和漂移基准,实现动态遗憾的次线性界。

Comments 24 pages, 4 figures

详情
AI中文摘要

我们研究了一个随机多臂老虎机问题,其中可用臂的集合随时间扩展。这一设置出现在当新动作或治疗在正在进行的研究中变得可用时的顺序实验中,使得对事后单一最佳臂的遗憾不恰当。我们转而评估相对于当前可用最佳臂的性能,从而为到达臂环境引入了一个动态遗憾准则。为了解决到达信息差异(AID)和漂移基准(DB)带来的挑战,我们提出了用于到达臂的UCB(UCB-AA),这是一个基于消除的过程,并包含一个辅助的初步筛选步骤,用于新到达的臂在与现有臂完全竞争之前。我们证明UCB-AA获得的遗憾界明确依赖于到达过程,在间隙演化的正则条件下实现了次线性动态遗憾,并允许对未知时间范围进行在线扩展。仿真结果表明,UCB-AA减少了浪费的拉取次数,保持了较小的活动臂集,同时保持了有竞争力的遗憾性能。

英文摘要

We study a stochastic multi-armed bandit problem in which the set of available arms expands over time. This setting arises in sequential experimentation when new actions or treatments become available during an ongoing study, making regret against a single best arm in hindsight inappropriate. We instead evaluate performance relative to the best arm currently available, leading to a dynamic-regret criterion for arriving-arm environments. To address the resulting challenges of arrival information discrepancy (AID) and a drifting benchmark (DB), we propose UCB for Arriving Arms (UCB-AA), an elimination-based procedure with an aiding preliminary screening step for newly arrived arms before full competition with incumbent arms. We show that UCB-AA attains regret bounds that depend explicitly on the arrival process, achieves sublinear dynamic regret under regularity conditions on gap evolution, and admits an online extension for unknown horizons. Simulation results show that UCB-AA reduces wasted pulls and maintains a smaller active arm set while preserving competitive regret performance.

2308.07822 2026-06-09 cs.LG cs.SY eess.SY 版本更新

Deep reinforcement learning for process design: Review and perspective

深度强化学习在过程设计中的应用:综述与展望

Qinghe Gao, Artur M. Schweidtmann

发表机构 * Delft University of Technology(代尔夫特理工大学)

AI总结 本文综述深度强化学习在化工过程设计中的应用,从信息表示、智能体架构、环境与奖励三要素分析现状,并讨论挑战与未来方向。

详情
AI中文摘要

化学工业向可再生能源和原料供应的转型需要新的概念性过程设计方法。最近,人工智能的突破为加速这一转型提供了机会。具体而言,深度强化学习作为机器学习的一个子类,已显示出解决复杂决策问题和促进可持续过程设计的潜力。我们通过三个主要要素调查了强化学习在过程设计中的最新研究:(i)信息表示,(ii)智能体架构,以及(iii)环境与奖励。此外,我们讨论了潜在挑战和未来有前景的工作,以充分发挥强化学习在化学工程过程设计中的潜力。

英文摘要

The transformation towards renewable energy and feedstock supply in the chemical industry requires new conceptual process design approaches. Recently, breakthroughs in artificial intelligence offer opportunities to accelerate this transition. Specifically, deep reinforcement learning, a subclass of machine learning, has shown the potential to solve complex decision-making problems and aid sustainable process design. We survey state-of-the-art research in reinforcement learning for process design through three major elements: (i) information representation, (ii) agent architecture, and (iii) environment and reward. Moreover, we discuss perspectives on underlying challenges and promising future works to unfold the full potential of reinforcement learning for process design in chemical engineering.

2502.01226 2026-06-09 cs.LG stat.ML 版本更新

Adaptive Prior Selection in Gaussian Process Bandits with Thompson Sampling

基于高斯过程强化学习的自适应先验选择

Jack Sandberg, Morteza Haghir Chehreghani

发表机构 * Department of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg(计算机科学与工程系,楚尔姆斯理工大学和哥德堡大学)

AI总结 本文提出两种算法,通过高斯过程强化学习进行先验选择和后悔最小化,理论分析证明了HP-GP-TS的亚线性后悔界,并通过实验验证其有效性。

Comments 30 pages, 12 figures

详情
AI中文摘要

高斯过程(GP)强化学习为未知函数的黑箱优化提供了强大框架。未知函数的特性严重依赖于假设的GP先验。大多数文献假设先验已知,但实践中很少成立。本文研究了两种算法:Prior-Elimination GP-TS(PE-GP-TS)通过排除预测性能差的先验,以及HyperPrior GP-TS(HP-GP-TS)利用双层汤普森采样方案。我们理论分析了这些算法,并为HP-GP-TS建立了亚线性后悔界。此外,我们通过合成和现实数据的实验展示了这些算法相对于替代方案的有效性。

英文摘要

Gaussian process (GP) bandits provide a powerful framework for performing blackbox optimization of unknown functions. The characteristics of the unknown function depend heavily on the assumed GP prior. Most work in the literature assume that this prior is known but in practice this seldom holds. Instead, practitioners often rely on maximum likelihood estimation to select the hyperparameters of the prior - which lacks theoretical guarantees. In this work, we study two algorithms for joint prior selection and regret minimization in GP bandits based on GP Thompson sampling (GP-TS): Prior-Elimination GP-TS (PE-GP-TS) that disqualifies priors with poor predictive performance, and HyperPrior GP-TS (HP-GP-TS) that utilizes a bi-level Thompson sampling scheme. We theoretically analyze the algorithms and establish a sublinear regret bound for HP-GP-TS. In addition, we demonstrate the effectiveness of these algorithms compared to the alternatives through extensive experiments with synthetic and real-world data.

2506.10341 2026-06-09 cs.LG cs.CL 版本更新

Formalizing Learning from Language Feedback with Provable Guarantees

从语言反馈中学习的形式化与可证明保证

Wanqiao Xu, Allen Nie, Ruijie Zheng, Aditya Modi, Adith Swaminathan, Ching-An Cheng

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Washington(华盛顿大学) University of Toronto(多伦多大学)

AI总结 本文形式化语言反馈学习问题,提出转移埃尔泽维度刻画学习难度,并开发无遗憾算法HELiX,证明其性能保证,展示丰富语言反馈可指数级加速学习。

Comments ICML 2026

详情
AI中文摘要

通过观察和语言反馈进行交互式学习是一个日益受到关注的领域,其驱动力来自大型语言模型(LLM)智能体的出现。尽管有令人印象深刻的实证演示,但迄今为止,这些决策问题的原则性框架仍然缺乏。我们形式化了语言反馈学习(LLF)问题,提出了足以在潜在奖励下实现学习的假设,并引入了$\ extit{转移埃尔泽维度}$作为衡量LLF难度的指标。我们形式化了语言反馈中的信息控制学习复杂性的直觉,并展示了从丰富语言反馈中学习可以比从奖励中学习指数级更快的案例。我们开发了一种名为$\ exttt{HELiX}$的无遗憾算法,通过顺序交互可证明地解决LLF问题,其性能保证随转移埃尔泽维度缩放。在多个实证领域,我们展示了即使重复提示LLM不可靠时,$\ exttt{HELiX}$也能表现良好。我们的贡献标志着朝着使用通用语言反馈设计原则性交互学习算法迈出了重要一步。

英文摘要

Interactively learning from observation and language feedback is an increasingly studied area driven by the emergence of large language model (LLM) agents. Despite impressive empirical demonstrations, so far a principled framing of these decision problems remains lacking. We formalize the Learning from Language Feedback (LLF) problem, assert sufficient assumptions to enable learning despite latent rewards, and introduce $\textit{transfer eluder dimension}$ as a measure to characterize the hardness of LLF. We formalize the intuition that information in the language feedback governs the learning complexity, and demonstrate cases where learning from rich language feedback can be exponentially faster than learning from reward. We develop a no-regret algorithm, called $\texttt{HELiX}$, that provably solves LLF problems through sequential interactions, with performance guarantees that scale with the transfer eluder dimension. Across several empirical domains, we show that $\texttt{HELiX}$ performs well even when repeatedly prompting LLMs does not work reliably. Our contributions mark an important step towards designing principled interactive learning algorithms using generic language feedback.

2508.06336 2026-06-09 cs.LG cs.AI cs.HC cs.MA 版本更新

Unsupervised Partner Design Enables Robust Ad-hoc Teamwork

无监督伙伴设计实现鲁棒的临时团队协作

Constantin Ruhdorfer, Matteo Bortoletto, Victor Oei, Anna Penzkofer, Andreas Bulling

发表机构 * University of Southampton(索姆塞特大学)

AI总结 提出无监督伙伴设计(UPD)方法,通过动态生成并基于可学习性准则自适应选择训练伙伴,无需预训练伙伴群体或手动调参,在多个任务中达到强性能,并在人机交互研究中获得更高评价。

Comments 27 pages

详情
AI中文摘要

我们引入了无监督伙伴设计(UPD),一种用于鲁棒临时团队协作的无群体多智能体强化学习方法。UPD 动态生成训练伙伴,并基于可学习性准则自适应地选择它们,消除了对预训练伙伴群体或手动参数调整的需求。我们表明,这种简单机制能够实现有效的伙伴多样性,并且在存在程序化关卡生成器时可以扩展到联合伙伴-环境选择。在基于级别的觅食、Overcooked-AI 和 Overcooked 泛化挑战中,与基于群体和无群体的基线方法相比,UPD 始终实现强性能。在一项人机交互用户研究中,使用 UPD 训练的智能体获得了更高的回报,并且比所有评估的基线方法被评为更具适应性、更像人类且更少令人沮丧。

英文摘要

We introduce Unsupervised Partner Design (UPD), a population-free multi-agent reinforcement learning method for robust ad-hoc teamwork. UPD generates training partners on-the-fly and selects them adaptively based on a learnability criterion, removing the need for pre-trained partner populations or manual parameter tuning. We show that this simple mechanism enables effective partner diversity and can be extended to joint partner-environment selection when a procedural level generator is available. Across Level-Based Foraging, Overcooked-AI, and the Overcooked Generalisation Challenge, UPD consistently achieves strong performance compared to both population-based and population-free baselines. In a human-AI user study, agents trained with UPD achieve higher returns and are rated as more adaptive, more human-like, and less frustrating than all evaluated baseline methods.

2508.06659 2026-06-09 cs.LG cs.AI 版本更新

In-Context Reinforcement Learning via Communicative World Models

通过通信世界模型进行上下文强化学习

Fernando Martinez-Lopez, Tao Li, Yingdong Lu, Juntao Chen

发表机构 * Department of Computer and Information Sciences, Fordham University(福特汉姆大学计算机与信息科学系) Department of Systems Engineering, City University of Hong Kong(香港城市大学系统工程系) IBM Research(IBM研究院)

AI总结 提出CORAL框架,通过将潜在表示学习与控制分离,利用信息代理预训练世界模型并生成通信消息,使控制代理实现零样本适应和样本效率提升。

详情
AI中文摘要

强化学习(RL)代理通常难以在不更新参数的情况下泛化到新任务和上下文,主要是因为它们学到的表示和策略过度拟合于训练环境的特定性。为了提升代理的上下文RL(ICRL)能力,本文将ICRL形式化为一个双代理涌现通信问题,并引入了CORAL(用于自适应RL的通信表示)框架,该框架通过功能性地分离潜在表示学习与控制来学习可迁移的通信上下文。在CORAL中,信息代理(IA)在多样化的任务分布上作为世界模型进行预训练。其目标不是直接最大化回报,而是进行世界建模并将其理解提炼为简洁的消息。涌现通信协议由一种新颖的因果影响损失塑造,该损失衡量消息对下一动作的影响。在部署期间,预训练的IA作为固定上下文提供者服务于新的控制代理(CA),后者通过解释提供的通信上下文来学习解决任务。我们的实验表明,这种方法使CA能够实现样本效率的显著提升,并在多样化的在线和离线环境中借助预训练的IA成功进行零样本适应,验证了学习可迁移通信表示的有效性。

英文摘要

Reinforcement learning (RL) agents often struggle to generalize to new tasks and contexts without updating their parameters, mainly because their learned representations and policies are overfit to the specifics of their training environments. To boost agents' in-context RL (ICRL) ability, this work formulates ICRL as a two-agent emergent communication problem and introduces CORAL (Communicative Representation for Adaptive RL), a framework that learns a transferable communicative context by functionally separating latent representation learning from control. In CORAL, an Information Agent (IA) is pre-trained as a world model on a diverse distribution of tasks. Its objective is not direct return maximization, but world modeling and distilling its understanding into concise messages. The emergent communication protocol is shaped by a novel Causal Influence Loss, which measures the effect that the message has on the next action. During deployment, the previously trained IA serves as a fixed contextualizer for a new Control Agent (CA), which learns to solve tasks by interpreting the provided communicative context. Our experiments demonstrate that this approach enables the CA to achieve significant gains in sample efficiency and successfully perform zero-shot adaptation with the help of pre-trained IA in diverse online and offline environments, validating the efficacy of learning a transferable communicative representation.

2511.20397 2026-06-09 cs.LG cs.DS cs.NA math.NA 版本更新

Model-Based Learning of Whittle indices

基于模型的Whittle指数学习

Joël Charles-Rebuffé, Nicolas Gast, Bruno Gaujal

发表机构 * Univ. Grenoble Alpes(格勒诺布尔阿尔卑斯大学) Inria(法国国家科学研究中心) CNRS(法国国家科学研究中心) Grenoble INP(格勒诺布尔INP)

AI总结 提出BLINQ算法,通过构建MDP经验估计并计算Whittle指数,证明收敛性并给出精度界,数值实验表明样本效率显著优于Q学习。

Comments 30 pages, 7 figures, submitted to TOMPECS

详情
AI中文摘要

我们提出BLINQ,一种新的基于模型的算法,用于学习可索引、连通且单链马尔可夫决策过程(MDP)的Whittle指数。我们的方法依赖于构建MDP的经验估计,然后使用现有最先进算法的扩展版本计算其Whittle指数。我们提供了收敛到我们想要学习的Whittle指数的证明,以及以任意精度学习它们所需时间的界限。此外,我们研究了其计算复杂度。我们的数值实验表明,在获得精确近似所需的样本数量方面,BLINQ显著优于现有的Q学习方法。此外,对于任何合理的高样本数量,其总计算成本甚至低于Q学习。即使使用神经网络加速Q值预测,这些观察结果仍然存在。

英文摘要

We present BLINQ, a new model-based algorithm that learns the Whittle indices of an indexable, communicating and unichain Markov Decision Process (MDP). Our approach relies on building an empirical estimate of the MDP and then computing its Whittle indices using an extended version of a state-of-the-art existing algorithm. We provide a proof of convergence to the Whittle indices we want to learn as well as a bound on the time needed to learn them with arbitrary precision. Moreover, we investigate its computational complexity. Our numerical experiments suggest that BLINQ significantly outperforms existing Q-learning approaches in terms of the number of samples needed to get an accurate approximation. In addition, it has a total computational cost even lower than Q-learning for any reasonably high number of samples. These observations persist even when the Q-learning algorithms are speeded up using neural networks to predict Q-values.

2601.01665 2026-06-09 cs.LG cs.AI 版本更新

Adversarial Instance Generation and Robust Training for Neural Combinatorial Optimization with Multiple Objectives

多目标神经组合优化的对抗实例生成与鲁棒训练

Wei Liu, Yaoxin Wu, Yingqian Zhang, Thomas Bäck, Yingjie Fan

发表机构 * LIACS, Leiden University, Leiden, The Netherlands(莱顿大学LIACS研究所,莱顿,荷兰) Eindhoven University of Technology, Eindhoven, The Netherlands(埃因霍温理工大学,埃因霍温,荷兰)

AI总结 提出面向多目标组合优化问题的偏好条件深度强化学习鲁棒性框架,通过偏好对抗攻击生成困难实例并量化影响,结合硬度感知偏好选择的对抗训练提升泛化性,在MOTSP、MOCVRP、MOKP上验证了攻击与防御的有效性。

详情
AI中文摘要

深度强化学习(DRL)在解决多目标组合优化问题(MOCOPs)方面显示出巨大潜力。然而,这些基于学习的求解器的鲁棒性尚未得到充分探索,尤其是在多样化和复杂的问题分布上。在本文中,我们提出了一个面向偏好条件DRL求解器用于MOCOPs的统一鲁棒性导向框架。在该框架内,我们开发了一种基于偏好的对抗攻击,以生成暴露求解器弱点的困难实例,并通过由此导致的帕累托前沿质量下降来量化攻击影响。我们进一步引入了一种防御策略,将硬度感知偏好选择集成到对抗训练中,以减少对受限偏好区域的过拟合并提高分布外性能。在多目标旅行商问题(MOTSP)、多目标容量车辆路径问题(MOCVRP)和多目标背包问题(MOKP)上的实验结果验证了我们的攻击方法能够成功地为不同求解器学习困难实例。此外,我们的防御方法显著增强了神经求解器的鲁棒性和泛化能力,在困难或分布外实例上提供了优越的性能。

英文摘要

Deep reinforcement learning (DRL) has shown great promise in addressing multi-objective combinatorial optimization problems (MOCOPs). Nevertheless, the robustness of these learning-based solvers has remained insufficiently explored, especially across diverse and complex problem distributions. In this paper, we propose a unified robustness-oriented framework for preference-conditioned DRL solvers for MOCOPs. Within this framework, we develop a preference-based adversarial attack to generate hard instances that expose solver weaknesses, and quantify the attack impact by the resulting degradation on Pareto-front quality. We further introduce a defense strategy that integrates hardness-aware preference selection into adversarial training to reduce overfitting to restricted preference regions and improve out-of-distribution performance. The experimental results on multi-objective traveling salesman problem (MOTSP), multi-objective capacitated vehicle routing problem (MOCVRP), and multi-objective knapsack problem (MOKP) verify that our attack method successfully learns hard instances for different solvers. Furthermore, our defense method significantly strengthens the robustness and generalizability of neural solvers, delivering superior performance on hard or out-of-distribution instances.

2601.09085 2026-06-09 cs.LG cs.AI cs.CL cs.IR 版本更新

MMR-GRPO: Accelerating GRPO-Style Training through Diversity-Aware Reward Reweighting

MMR-GRPO:通过多样性感知奖励重加权加速GRPO风格训练

Kangda Wei, Ruihong Huang

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系)

AI总结 提出MMR-GRPO方法,利用最大边际相关性根据完成多样性重加权奖励,减少冗余样本,加速GRPO训练,在保持性能的同时平均减少47.9%训练步数和70.2%时间。

详情
AI中文摘要

组相对策略优化(GRPO)已成为训练数学推理模型的标准方法;然而,它对每个提示依赖多个完成,使得训练计算成本高昂。尽管最近的工作减少了达到峰值性能所需的训练步数,但由于每步成本增加,整体挂钟训练时间通常保持不变甚至增加。我们提出MMR-GRPO,它整合了最大边际相关性,基于完成多样性对奖励进行重加权。我们的关键洞察是,语义冗余的完成贡献有限的学习信号;优先考虑多样化解能产生更有信息量的更新并加速收敛。在三种模型规模(1.5B、7B、8B)、三种GRPO变体和五个数学推理基准上的广泛评估表明,MMR-GRPO在达到相当峰值性能的同时,平均需要减少47.9%的训练步数和70.2%的挂钟时间。这些增益在模型、方法和基准上一致。我们的代码发布在:this https URL。

英文摘要

Group Relative Policy Optimization (GRPO) has become a standard approach for training mathematical reasoning models; however, its reliance on multiple completions per prompt makes training computationally expensive. Although recent work has reduced the number of training steps required to reach peak performance, the overall wall-clock training time often remains unchanged or even increases due to higher per-step cost. We propose MMR-GRPO, which integrates Maximal Marginal Relevance to reweigh rewards based on completion diversity. Our key insight is that semantically redundant completions contribute limited marginal learning signal; prioritizing diverse solutions yields more informative updates and accelerates convergence. Extensive evaluations across three model sizes (1.5B, 7B, 8B), three GRPO variants, and five mathematical reasoning benchmarks show that MMR-GRPO achieves comparable peak performance while requiring on average 47.9% fewer training steps and 70.2% less wall-clock time. These gains are consistent across models, methods, and benchmarks. Our code is released at: https://github.com/WeiKangda/MMR-GRPO.

2601.18510 2026-06-09 cs.LG cs.AI 版本更新

Just-In-Time Reinforcement Learning: Continual Learning in LLM Agents Without Gradient Updates

即时强化学习:无需梯度更新的LLM智能体持续学习

Yibo Li, Zijie Lin, Ailin Deng, Xuan Zhang, Yufei He, Shuo Ji, Tri Cao, Bryan Hooi

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出JitRL框架,通过动态非参数记忆和即时优势估计,无需梯度更新即可实现LLM智能体的测试时策略优化,在WebArena和Jericho上达到训练无关方法最优,且性能超越微调方法,成本降低30倍以上。

详情
AI中文摘要

尽管大型语言模型(LLM)智能体在通用任务上表现出色,但由于部署后权重冻结,它们在持续适应方面存在固有困难。传统的强化学习(RL)提供了一种解决方案,但会带来高昂的计算成本和灾难性遗忘的风险。我们引入了即时强化学习(JitRL),这是一个无需训练的框架,能够在没有任何梯度更新的情况下实现测试时策略优化。JitRL维护一个动态的非参数经验记忆,并检索相关轨迹以即时估计动作优势。这些估计随后用于直接调制LLM的输出logits。我们从理论上证明,这种加法更新规则是KL约束策略优化目标的精确闭式解。在WebArena和Jericho上的大量实验表明,JitRL在训练无关方法中建立了新的最先进水平。关键的是,JitRL在性能上超越了计算昂贵的微调方法(如WebRL),同时将货币成本降低了30倍以上,为持续学习智能体提供了一条可扩展的路径。代码可在https://github.com/liushiliushi/JitRL获取。

英文摘要

While Large Language Model (LLM) agents excel at general tasks, they inherently struggle with continual adaptation due to the frozen weights after deployment. Conventional reinforcement learning (RL) offers a solution but incurs prohibitive computational costs and the risk of catastrophic forgetting. We introduce Just-In-Time Reinforcement Learning (JitRL), a training-free framework that enables test-time policy optimization without any gradient updates. JitRL maintains a dynamic, non-parametric memory of experiences and retrieves relevant trajectories to estimate action advantages on-the-fly. These estimates are then used to directly modulate the LLM's output logits. We theoretically prove that this additive update rule is the exact closed-form solution to the KL-constrained policy optimization objective. Extensive experiments on WebArena and Jericho demonstrate that JitRL establishes a new state-of-the-art among training-free methods. Crucially, JitRL outperforms the performance of computationally expensive fine-tuning methods (e.g., WebRL) while reducing monetary costs by over 30 times, offering a scalable path for continual learning agents. The code is available at https://github.com/liushiliushi/JitRL.

2601.22211 2026-06-09 cs.LG 版本更新

Latent Spherical Flow Policy for Reinforcement Learning with Combinatorial Actions

面向组合动作强化学习的潜在球形流策略

Lingkai Kong, Anagha Satish, Hezi Jiang, Akseli Kangaslahti, Andrew Ma, Wenbo Chen, Mingxiao Song, Lily Xu, Milind Tambe

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出LSFlow方法,通过球形流匹配在紧凑连续潜在空间中学习随机策略,并利用组合优化求解器保证动作可行性,引入平滑贝尔曼算子解决不连续值函数问题,在多个组合RL任务上平均超越基线20.6%。

Comments ICML'26 Spotlight

详情
AI中文摘要

具有组合动作空间的强化学习(RL)仍然具有挑战性,因为可行动作集呈指数级增长且受复杂可行性约束,使得直接策略参数化不切实际。现有方法将任务特定的价值函数嵌入到约束优化程序中,或学习确定性的结构化策略,牺牲了通用性和策略表达能力。我们提出了一种求解器诱导的潜在球形流策略,将现代生成策略的表达能力引入组合RL,同时通过设计保证可行性。我们的方法LSFlow通过球形流匹配在紧凑连续潜在空间中学习随机策略,并将可行性委托给组合优化求解器,该求解器将每个潜在样本映射到有效的结构化动作。为了提高效率,我们直接在潜在空间中训练价值网络,避免在策略优化期间重复调用求解器。为了解决由求解器动作选择引起的分段常数和不连续价值景观,我们引入了一个平滑的贝尔曼算子,该算子产生稳定、定义明确的学习目标。实验表明,我们的方法在一系列具有挑战性的组合RL任务中平均优于最先进的基线20.6%。

英文摘要

Reinforcement learning (RL) with combinatorial action spaces remains challenging because feasible action sets are exponentially large and governed by complex feasibility constraints, making direct policy parameterization impractical. Existing approaches embed task-specific value functions into constrained optimization programs or learn deterministic structured policies, sacrificing generality and policy expressiveness. We propose a solver-induced \emph{latent spherical flow policy} that brings the expressiveness of modern generative policies to combinatorial RL while guaranteeing feasibility by design. Our method, LSFlow, learns a \emph{stochastic} policy in a compact continuous latent space via spherical flow matching, and delegates feasibility to a combinatorial optimization solver that maps each latent sample to a valid structured action. To improve efficiency, we train the value network directly in the latent space, avoiding repeated solver calls during policy optimization. To address the piecewise-constant and discontinuous value landscape induced by solver-based action selection, we introduce a smoothed Bellman operator that yields stable, well-defined learning targets. Empirically, our approach outperforms state-of-the-art baselines by an average of 20.6\% across a range of challenging combinatorial RL tasks.

2602.02572 2026-06-09 cs.LG cs.AI 版本更新

Reward Shaping for (Inference-Time) Alignment: A Stackelberg Game Perspective

奖励塑形用于(推理时)对齐:一个Stackelberg博弈视角

Haichuan Wang, Tao Lin, Lingkai Kong, Ce Li, Hezi Jiang, Milind Tambe

发表机构 * University of Southern California(南加州大学)

AI总结 针对KL正则化导致LLM继承基策略偏见的问题,提出将奖励模型优化形式化为Stackelberg博弈,并通过简单奖励塑形方案近似最优奖励模型,在推理时对齐中持续提升平均奖励并达到超过66%的胜率。

Comments Accepted to ICML 2026. Camera-ready version

详情
AI中文摘要

现有的对齐方法直接使用从用户偏好数据中学习到的奖励模型来优化LLM策略,并相对于基策略进行KL正则化。这种做法对于最大化用户效用是次优的,因为KL正则化可能导致LLM继承基策略中与用户偏好冲突的偏见。虽然放大偏好输出的奖励可以减轻这种偏见,但也增加了奖励黑客的风险。这种权衡激励了在KL正则化下最优设计奖励模型的问题。我们将这个奖励模型优化问题形式化为一个Stackelberg博弈,并表明一个简单的奖励塑形方案可以有效近似最优奖励模型。我们在推理时对齐设置中经验性地评估了我们的方法,并证明它可以无缝集成到现有的对齐方法中,且开销最小。我们的方法持续提高了平均奖励,并在所有评估设置中平均达到了超过66%的胜率(相对于所有基线)。

英文摘要

Existing alignment methods directly use the reward model learned from user preference data to optimize an LLM policy, subject to KL regularization with respect to the base policy. This practice is suboptimal for maximizing user's utility because the KL regularization may cause the LLM to inherit the bias in the base policy that conflicts with user preferences. While amplifying rewards for preferred outputs can mitigate this bias, it also increases the risk of reward hacking. This tradeoff motivates the problem of optimally designing reward models under KL regularization. We formalize this reward model optimization problem as a Stackelberg game, and show that a simple reward shaping scheme can effectively approximate the optimal reward model. We empirically evaluate our method in inference-time alignment settings and demonstrate that it integrates seamlessly into existing alignment methods with minimal overhead. Our method consistently improves average reward and achieves win-tie rates exceeding 66% against all baselines, averaged across evaluation settings.

2602.12107 2026-06-09 cs.LG cs.AI stat.ML 版本更新

On the Complexity of Offline Reinforcement Learning with $Q^\star$-Approximation and Partial Coverage

离线强化学习在 $Q^\star$ 近似与部分覆盖下的复杂性

Haolin Liu, Braham Snyder, Chen-Yu Wei

发表机构 * University of Virginia(弗吉尼亚大学)

AI总结 本文通过信息论下界证明 $Q^\star$ 可实现性与贝尔曼完备性在部分覆盖下不足以实现样本高效的离线强化学习,并提出一个通用决策-估计框架来统一和改进现有结果。

详情
AI中文摘要

我们研究了在 $Q^\star$ 近似和部分覆盖下的离线强化学习,这一设定激发了诸如保守 $Q$ 学习(CQL;Kumar et al., 2020)等实用算法,但理论上受到的关注有限。我们的工作受以下开放问题的启发:“在部分覆盖下,$Q^\star$ 可实现性和贝尔曼完备性是否足以实现样本高效的离线强化学习?”我们通过信息论下界给出了否定答案。为了识别在部分覆盖下实现样本高效离线强化学习的额外结构,我们引入了一个通用决策-估计框架,该框架受在线强化学习的无模型决策-估计系数(DEC;Foster et al., 2023b; Liu et al., 2025b)启发。我们的框架将离线强化学习的复杂性分解为决策复杂性和值估计误差,从而允许对这两个子问题进行模块化研究。我们的结果不仅统一了现有结果(Chen and Jiang, 2022; Uehara et al., 2023),而且进一步改进并推广了它们。在决策复杂性方面,我们的改进包括:在部分覆盖下软 $Q$ 学习的首个 $\epsilon^{-2}$ 样本复杂度界,改进了 Uehara 等人(2023)的 $\epsilon^{-4}$ 界;在 Chen 和 Jiang(2022)的值间隙设定中消除了对额外在线交互的需求;以及超越上述两种情况的新可学习设定。在值估计方面,我们提供了在部分覆盖下贝尔曼完备性作用的新刻画,以及一般低贝尔曼秩 MDP(Jiang et al., 2017; Du et al., 2021; Jin et al., 2021)离线可学习性的首个刻画。后者是一个经典的在线强化学习设定,除特殊情况外,在离线强化学习中尚未被探索。作为附带贡献,我们的技术给出了函数近似设定下 CQL 的首个分析。

英文摘要

We study offline reinforcement learning under $Q^\star$-approximation and partial coverage, a setting that motivates practical algorithms such as Conservative $Q$-Learning (CQL; Kumar et al., 2020) but has received limited theoretical attention. Our work is inspired by the following open question: "Are $Q^\star$-realizability and Bellman completeness sufficient for sample-efficient offline RL under partial coverage?" We answer in the negative via an information-theoretic lower bound. To identify additional structure that enables sample-efficient offline RL under partial coverage, we introduce a general decision-estimation framework, inspired by model-free decision-estimation coefficients (DEC) for online RL (Foster et al., 2023b; Liu et al., 2025b). Our framework decomposes offline RL complexity into decision complexity and value estimation error. This allows modular study of both sub-problems. Our result not only unifies existing results (Chen and Jiang, 2022; Uehara et al., 2023), but further improves and generalizes them. On the decision complexity side, our improvement includes: the first $ε^{-2}$ sample complexity bound for soft $Q$-learning under partial coverage that improves Uehara et al.'s (2023) $ε^{-4}$ bound, the removal of the need for additional online interaction in the value-gap setting of Chen and Jiang (2022), and new learnable settings beyond the above two cases. On the value estimation side, we provide a new characterization of the role of Bellman completeness under partial coverage, and the first characterization of offline learnability for general low-Bellman-rank MDPs (Jiang et al., 2017; Du et al., 2021; Jin et al., 2021). The latter is a canonical online RL setting that has remained unexplored in offline RL except for special cases. As a side contribution, our techniques give the first analysis of CQL in the function approximation setting.

2603.25184 2026-06-09 cs.LG cs.AI 版本更新

Train at Moving Edge: Online-Verified Prompt Selection for Efficient RL Training of Large Reasoning Model

在移动边缘训练:一种在线验证的提示选择方法用于大型推理模型的高效强化学习训练

Jiahao Wu, Ning Lu, Shengcai Liu, Kun Wang, Yanting Yang, Bailong Lin, Chen Jason Zhang, Li Qing, Ke Tang

发表机构 * Southern University of Science and Technology(南方科技大学) The Hong Kong Polytechnic University(香港理工大学) The Hong Kong University of Science and Technology(香港科学理工大学) Nanyang Technological University(南洋理工大学) Rutgers University(罗格斯大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科学理工大学(广州))

AI总结 本文提出HIVE方法,通过历史奖励轨迹和实时提示熵实现高效RL训练,提升提示选择效率而不牺牲性能。

详情
AI中文摘要

强化学习(RL)已成为在推理任务中训练大型语言模型(LLMs)的关键技术。尽管扩大 rollout 可以稳定训练并提高性能,但计算开销是一个关键问题。在像 GRPO 等算法中,每个提示多个 rollout 会带来极高的成本,因为大量提示提供微不足道的梯度,因此效用较低。为了解决这个问题,我们研究如何在 rollout 阶段之前选择高效用的提示。我们的实验分析揭示了样本效用是非均匀且动态变化的:最强的学习信号集中在「学习边缘」,即中等难度和高不确定性的交界处,随着训练进行而变化。受此启发,我们提出了 HIVE(基于历史和在线验证的提示选择),一种数据高效的 RL 框架。HIVE 利用历史奖励轨迹进行粗略选择,并利用提示熵作为实时代理来修剪效用过时的实例。通过在多个数学推理基准和模型上评估 HIVE,我们证明 HIVE 在不牺牲性能的情况下显著提高了 rollout 的效率。

英文摘要

Reinforcement learning (RL) has become essential for post-training large language models (LLMs) in reasoning tasks. While scaling rollouts can stabilize training and enhance performance, the computational overhead is a critical issue. In algorithms like GRPO, multiple rollouts per prompt incur prohibitive costs, as a large portion of prompts provide negligible gradients and are thus of low utility. To address this problem, we investigate how to select high-utility prompts before the rollout phase. Our experimental analysis reveals that sample utility is non-uniform and evolving: the strongest learning signals concentrate at the ``learning edge", the intersection of intermediate difficulty and high uncertainty, which shifts as training proceeds. Motivated by this, we propose HIVE (History-Informed and online-VErified prompt selection), a dual-stage framework for data-efficient RL. HIVE utilizes historical reward trajectories for coarse selection and employs prompt entropy as a real-time proxy to prune instances with stale utility. By evaluating HIVE across multiple math reasoning benchmarks and models, we show that HIVE yields significant rollout efficiency without compromising performance.

2605.03357 2026-06-09 cs.LG math.OC 版本更新

Population-Aware Imitation Learning in Mean-field Games with Common Noise

平均场博弈中考虑共同噪声的群体感知模仿学习

Grégoire Lambrecht, Mathieu Laurière

发表机构 * Institut National des Sciences et Techniques de l'Information et des Systèmes (INSTI)(信息与系统科学与技术国家研究院)

AI总结 针对含共同噪声的平均场博弈,提出群体感知模仿学习框架,通过行为克隆和对抗散度两种代理,建立有限样本误差界,并利用广义虚拟博弈和深度学习计算专家策略,实验证明群体感知策略对应对随机性的重要性。

详情
AI中文摘要

平均场博弈(MFGs)为建模大量交互智能体的集体行为提供了强大框架。本文研究了含共同噪声的MFG中的模仿学习(IL)问题,其中群体分布随机演化。这种随机性迫使智能体采用群体感知策略以应对总体冲击。我们制定了两个不同的学习目标:恢复纳什均衡和最大化相对于专家群体的性能。我们研究了两种模仿代理:行为克隆(BC)和对抗(ADV)散度。然后,我们建立了有限样本误差界,表明最小化这些代理能有效控制策略的可利用性及其相对于专家的性能差距。此外,我们提出了一个使用广义虚拟博弈和深度学习的数值框架来计算专家群体感知策略。通过在三个环境上的实验,我们证明了标准的群体无感知策略无法捕捉均衡动态。我们的结果强调,学习群体感知策略对于避免被共同噪声固有的随机性误导至关重要。

英文摘要

Mean Field Games (MFGs) provide a powerful framework for modeling the collective behavior of large populations of interacting agents. In this paper, we address the problem of Imitation Learning (IL) in MFGs subject to common noise, where the population distribution evolves stochastically. This stochasticity compels agents to adopt population-aware policies to respond to aggregate shocks. We formulate two distinct learning objectives: recovering a Nash equilibrium and maximizing performance against an expert population. We investigate two imitation proxies: Behavioral Cloning (BC) and Adversarial (ADV) divergence. We then establish finite-sample error bounds showing that minimizing these proxies effectively controls both the policy's exploitability and its performance gap relative to the expert. Furthermore, we propose a numerical framework using generalized Fictitious Play and Deep Learning to compute expert population-aware policies. Through experiments on three environments we demonstrate that standard population-unaware policies fail to capture the equilibrium dynamics. Our results highlight that learning population-aware policies is crucial to avoid being misled by the randomness inherent in common noise.

2605.26078 2026-06-09 cs.LG 版本更新

Global Convergence of Wasserstein Policy Gradient for Entropy-Regularized Reinforcement Learning

Wasserstein策略梯度在熵正则化强化学习中的全局收敛性

Zhaoyu Zhu, Rui Gao, Shuang Li

发表机构 * Shanghai Jiao Tong University(上海交通大学) The University of Texas at Austin(德克萨斯大学奥斯汀分校) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 本文通过利用熵正则化强化学习的Bellman结构,证明了Wasserstein策略梯度(WPG)方法的全局收敛性,并建立了分布Polyak-Łojasiewicz条件。

详情
AI中文摘要

Wasserstein策略梯度(WPG)是一种利用动作分布的最优传输几何的强化学习(RL)策略优化方法。对于熵正则化RL目标,WPG通过将每个状态条件策略沿软Q函数的动作梯度以及Langevin型扩散进行传输来演化。尽管它在连续控制问题中具有吸引力,但其全局收敛性质仍不清楚。标准的Langevin分析并不直接适用,因为RL目标通过Bellman递归而非静态凸泛函依赖于策略,且Langevin漂移由软Q函数决定,其正则性必须在策略迭代过程中加以控制。在本文中,我们通过利用熵正则化RL的Bellman结构,发展了WPG的全局收敛理论。我们表明,通常由凸性扮演的角色可以被基于Bellman的论证所取代:软Bellman残差相对于Gibbs策略具有状态级KL表示;Bellman压缩将此残差与全局最优性差距联系起来;而Bellman预解恒等式将价值改进与相对Fisher信息联系起来。结合演化Gibbs族的均匀对数Sobolev不等式(LSI),这些要素产生了分布Polyak-Łojasiewicz条件。我们进一步建立了控制离散化误差所需的正则性和一致界,从而获得直到离散化偏差的几何收缩。概念上,我们的分析表明,尽管熵正则化RL在通常的平坦意义上不是凸的,但Bellman递归诱导了一种有利的Polyak-Łojasiewicz型(PL)几何,支持WPG的全局收敛。

英文摘要

Wasserstein policy gradient (WPG) is a policy optimization method for reinforcement learning (RL) that exploits the optimal-transport geometry of action distributions. For the entropy-regularized RL objective, WPG evolves each state-conditional policy by transporting it along the action gradient of the soft Q-function together with a Langevin-type diffusion. Despite its appeal for continuous-control problems, its global convergence properties remain poorly understood. Standard Langevin analyses do not directly apply, because the RL objective depends on the policy through the Bellman recursion rather than through a static convex functional, and the Langevin drift is determined by the soft Q-function, whose regularity must be controlled along the policy iterates. In this paper, we develop a global convergence theory for WPG by exploiting the Bellman structure of entropy-regularized RL. We show that the role usually played by convexity can be replaced by a Bellman-based argument: the soft Bellman residual admits a statewise KL representation with respect to a Gibbs policy; Bellman contraction relates this residual to the global optimality gap; and a Bellman resolvent identity connects value improvement to relative Fisher information. Combined with a uniform log-Sobolev inequality (LSI) for the evolving Gibbs family, these ingredients yield a distributional Polyak--Łojasiewicz condition. We further establish the regularity and uniform bounds needed to control the discretization error, thereby obtaining geometric contraction up to a discretization bias. Conceptually, our analysis shows that although entropy-regularized RL is not convex in the usual flat sense, the Bellman recursion induces a favorable Polyak--Lojasiewicz-type (PL) geometry that supports global convergence of WPG.

2605.31014 2026-06-09 cs.LG 版本更新

SDM-Q: Cost-Aware Staged Decision-Making for Multi-Omics Classification with Deep Q-Learning

SDM-Q: 基于深度Q学习的成本感知分阶段决策用于多组学分类

Nan Mu, Yangfan Xiao, Ling Wang, Xiaoning Li, Yue Kang, Chen Zhao

发表机构 * College of Computer Science, Sichuan Normal University(四川师范大学计算机学院) Department of Mathematics, College of Science and Mathematics, Kennesaw State University(数学系,科学与数学学院,肯纳邦克州立大学) Department of Computer Science, College of Computing and Software Engineering, Kennesaw State University(计算机科学系,计算与软件工程学院,肯纳邦克州立大学)

AI总结 提出SDM-Q强化学习框架,将多组学诊断建模为有限步序贯决策问题,通过动作价值函数平衡分类正确性与模态获取成本,在四个公共数据集上有效减少冗余模态获取并保持竞争性分类性能。

详情
AI中文摘要

多组学数据提供了疾病表型的互补分子特征,在精准医学的疾病诊断和亚型分类中发挥重要作用。然而,获取完整的多组学图谱昂贵且耗时,而现有深度学习方法大多假设推理时模态齐全,导致大量冗余并在临床环境中实用性有限。为解决此问题,我们提出SDM-Q,一种用于自适应和成本感知多组学分类的强化学习框架。具体而言,多组学诊断被重新表述为有限步序贯决策问题,其中当前获取的组学模态定义每个阶段的诊断状态。动作价值函数决定是否获取额外模态或终止决策过程并输出最终预测。为平衡诊断效用和获取成本,奖励仅在终止阶段定义,并由分类正确性和累积模态获取成本共同决定。引入反向阶段优化策略以提高策略一致性和训练稳定性。在四个公共多组学数据集(包括ROSMAP、LGG、BRCA和KIPAN)上的实验表明,与使用完整多组学输入的方法相比,SDM-Q有效减少了冗余模态获取,同时保持竞争性的分类性能。在BRCA和KIPAN数据集中,分别有超过99%和95%的受试者仅使用单一组学模态即可实现准确分类,而ROSMAP和LGG的平均获取模态数保持在2以下。这些结果表明,成本感知的序贯决策为改善精准医学工作流程的效率提供了有效范式。

英文摘要

Multi-omics data provide complementary molecular characterizations of disease phenotypes and play an important role in disease diagnosis and subtype classification in precision medicine. However, acquiring complete multi-omics profiles is expensive and time-consuming, while most existing deep learning methods assume full modality availability during inference, resulting in substantial redundancy and limited practicality in clinical settings. To address this issue, we propose SDM-Q, a reinforcement learning framework for adaptive and cost-aware multi-omics classification. Specifically, multi-omics diagnosis is reformulated as a finite-horizon sequential decision problem, where the currently acquired omics modalities define the diagnostic state at each stage. An action--value function determines whether to acquire an additional modality or terminate the decision process and output the final prediction. To balance diagnostic utility and acquisition cost, the reward is defined only at the terminal stage and jointly determined by classification correctness and cumulative modality acquisition cost. A backward stage-wise optimization strategy is introduced to improve policy consistency and training stability. Experiments on four public multi-omics datasets, including ROSMAP, LGG, BRCA, and KIPAN, demonstrate that SDM-Q effectively reduces redundant modality acquisition while maintaining competitive classification performance compared with methods using complete multi-omics inputs. In the BRCA and KIPAN datasets, more than 99\% and 95\% of subjects, respectively, achieve accurate classification using only a single omics modality, while the average number of acquired modalities remains below two for ROSMAP and LGG. These results suggest that cost-aware sequential decision-making provides an effective paradigm for improving the efficiency of precision medicine workflows.

2606.04029 2026-06-09 cs.LG cs.AI 版本更新

Position: Deployed Reinforcement Learning should be Continual

立场:部署的强化学习应该是持续的

Parnian Behdin, Kevin Roice, Golnaz Mesbahi

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文主张部署的强化学习系统应持续学习,分析了部署后非平稳性的四个来源,并展示了持续RL的优势和实现方法。

Comments Accepted to the ICML 2026 Position Paper Track. See https://icml.cc/virtual/2026/poster/67195

详情
AI中文摘要

强化学习(RL)在现实世界用例中受到越来越多的关注和采用。大多数系统遵循“训练-修复”范式,其中训练好的代理在与世界交互时不会学习,直到性能下降且需要重新训练。在这篇立场论文中,我们认为部署一个无法达到最优但接收评估奖励信号的代理本质上是一个持续的RL问题。我们确定了部署后导致需要永无止境学习的四个非平稳性来源,并强调了为什么最好的部署代理永远不会停止适应。我们分析了现实世界中持续RL的成功案例,并向社区展示了摆脱当前“训练-修复”范式的优势和措施。

英文摘要

Reinforcement Learning (RL) has received increasing attention and adoption in real-world use cases. Most of these systems follow a train-then-fix paradigm, where trained agents do not learn while interacting with the world until performance degrades and retraining becomes necessary. In this position paper, we argue that deploying an agent that is incapable of optimality, but receives an evaluative reward signal, is inherently a continual RL problem. We identify four sources of non-stationarity after deployment that necessitate never-ending learning, and highlight why the best deployed agents never stop adapting. We analyze successful examples of continual RL in the real world, and present the community with the advantages and measures to move away from the current train-then-fix paradigm.

2511.07317 2026-06-09 cs.CL cs.LG 版本更新

RLVE: Scaling Up Reinforcement Learning for Language Models with Adaptive Verifiable Environments

RLVE: 利用自适应可验证环境扩展语言模型的强化学习

Zhiyuan Zeng, Hamish Ivison, Yiping Wang, Lifan Yuan, Shuyue Stella Li, Zhuorui Ye, Siting Li, Jacqueline He, Runlong Zhou, Tong Chen, Chenyang Zhao, Yulia Tsvetkov, Simon Shaolei Du, Natasha Jaques, Hao Peng, Pang Wei Koh, Hannaneh Hajishirzi

发表机构 * University of Washington(华盛顿大学)

AI总结 提出RLVE方法,通过程序化生成问题并提供可验证奖励的可验证环境,自适应调整难度以扩展语言模型的强化学习,在400个环境中联合训练使六个推理基准平均提升3.37%。

Comments ICML 2026

详情
AI中文摘要

我们引入了具有自适应可验证环境的强化学习(RLVE),该方法利用可验证环境程序化生成问题并提供算法可验证的奖励,以扩展语言模型(LM)的强化学习。RLVE使得每个可验证环境能够随着训练进程动态调整其问题难度分布以适应策略模型的能力。相比之下,静态数据分布往往导致当问题对策略来说太简单或太难时学习信号消失。为了实现RLVE,我们创建了RLVE-Gym,这是一个通过手动环境工程精心开发的大规模400个可验证环境套件。使用RLVE-Gym,我们展示了环境扩展,即扩大训练环境的集合,能够持续提高泛化推理能力。在RLVE-Gym的所有400个环境中进行联合训练的RLVE,从一个最强的1.5B推理LM开始,在六个推理基准上取得了3.37%的绝对平均提升。相比之下,继续该LM的原始RL训练仅获得0.49%的平均绝对增益,尽管使用了超过3倍的计算量。我们公开发布了代码。

英文摘要

We introduce Reinforcement Learning (RL) with Adaptive Verifiable Environments (RLVE), an approach using verifiable environments that procedurally generate problems and provide algorithmically verifiable rewards, to scale up RL for language models (LMs). RLVE enables each verifiable environment to dynamically adapt its problem difficulty distribution to the policy model's capabilities as training progresses. In contrast, static data distributions often lead to vanishing learning signals when problems are either too easy or too hard for the policy. To implement RLVE, we create RLVE-Gym, a large-scale suite of 400 verifiable environments carefully developed through manual environment engineering. Using RLVE-Gym, we show that environment scaling, i.e., expanding the collection of training environments, consistently improves generalizable reasoning capabilities. RLVE with joint training across all 400 environments in RLVE-Gym yields a 3.37% absolute average improvement across six reasoning benchmarks, starting from one of the strongest 1.5B reasoning LMs. By comparison, continuing this LM's original RL training yields only a 0.49% average absolute gain despite using over 3x more compute. We release our code publicly.

2604.04251 2026-06-09 cs.AI cs.CY cs.LG 版本更新

MC-CPO: Mastery-Conditioned Constrained Policy Optimization for Pedagogically Safe Intelligent Tutoring Systems

MC-CPO:基于 mastery 的约束策略优化用于教学安全的智能辅导系统

Oluseyi Olukola, Nick Rahimi

发表机构 * School of Computing Sciences(计算科学学院) Computer Engineering, University of Southern Mississippi, Hattiesburg, MS 39406, USA(计算机工程,密西西比大学,哈特斯伯格,MS 39406,USA)

AI总结 本文提出 MC-CPO 框架,通过结构化约束解决教学安全问题,提升学习者知识掌握率,实验证明其在两个平台上的效果显著。

Comments 35 pages, 8 figures. v2: Major revision adding real-world validation on Junyi Academy (16.2M interactions, 72,758 students) and XES3G5M (NeurIPS 2023, 5.1M interactions, 14,453 students). Revised title and abstract. Submitted to Computers and Education: Artificial Intelligence

详情
AI中文摘要

智能辅导系统越来越多地依赖强化学习来个性化教学,但优化可观察的参与信号可能会系统性地将学习者活动与真正的知识获取分离。分析超过2100万学生互动数据,发现Junyi Academy平台有26.5%的互动没有对应的掌握增长,XES3G5M平台为3.1%。本文引入Mastery-Conditioned Constrained Policy Optimization (MC-CPO),一种强化学习框架,通过将可接受的教学动作空间条件于学习者掌握状态,使概念在先决知识达到掌握阈值时才可出现,从而自然扩展动作空间。通过结构化约束确保教学安全,具有形式保证的结构性先决安全、对偶收敛和严格优于事后过滤。MC-CPO是唯一在所有条件下减少奖励黑客严重性的方法。在Junyi Academy上,平均每回合掌握增长增加18.3%,在XES3G5M上增加54.0%,同时保持竞争性的参与表现。这些结果支持结构化约束建模作为部署辅导系统中更安全自适应教学策略的原理性基础。

英文摘要

Intelligent tutoring systems increasingly rely on reinforcement learning to personalise instruction, yet optimising for observable engagement signals can systematically decouple learner activity from genuine knowledge acquisition. Analysing over 21 million student interactions across two deployed platforms, we find engagement events without corresponding mastery gains occur in 26.5% of interactions on Junyi Academy (72,758 students) and 3.1% on XES3G5M (14,453 students, NeurIPS 2023), confirming this pattern is directly observable in deployed educational technology at scale. We introduce Mastery-Conditioned Constrained Policy Optimisation (MC-CPO), a reinforcement learning framework that addresses this problem structurally. MC-CPO conditions the admissible instructional action space on learner mastery state: a concept becomes available only when prerequisite knowledge meets a mastery threshold, yielding an action space that expands naturally as learners acquire knowledge. Pedagogical safety constraints are enforced by construction, with formal guarantees of structural prerequisite safety, primal-dual convergence, and strict dominance over post-hoc filtering. MC-CPO is the only method to reduce reward hacking severity across all conditions. Mean per-episode mastery gain increases by 18.3% on Junyi Academy and 54.0% on XES3G5M relative to all baselines, while competitive engagement performance is maintained. These results support structural constraint modelling as a principled foundation for safer adaptive instructional policies in deployed tutoring systems.

2605.14211 2026-06-09 cs.AI cs.LG 版本更新

ASH: Agents that Self-Hone via Embodied Learning

ASH: 通过具身学习自我精炼的智能体

Benjamin Schneider, Xavier Schneider, Victor Zhong, Sun Sun

发表机构 * University of Waterloo(多伦多大学) National Research Council Canada(加拿大国家研究理事会)

AI总结 提出ASH系统,通过从无标签互联网视频中学习具身策略,利用自改进循环和逆动力学模型,在长时域任务中显著超越基线方法。

Comments Published as a workshop paper at ICML 2026 Workshop on Scalable Learning and Optimization for Efficient Multimodal AI Agents

详情
AI中文摘要

长时域具身任务仍然是AI中的一个基本挑战,因为当前方法依赖于手工设计的奖励或带动作标签的演示,两者都无法扩展。我们引入了ASH,一个智能体系统,它从无标签、嘈杂的互联网视频中学习具身策略,无需奖励塑造或专家注释。ASH遵循自我改进循环;当它卡住时,ASH从其自身轨迹中学习逆动力学模型(IDM),并利用其IDM从相关互联网视频中提取监督信号。ASH使用无监督学习从大规模互联网视频中识别关键时刻,并将其保留为长期记忆——使其能够处理长时域问题。我们在两个需要多小时规划的互补环境中评估ASH:回合制角色扮演游戏《宝可梦 绿宝石》和实时动作冒险游戏《塞尔达传说:缩小帽》。在这两个游戏中,行为克隆、检索增强和零样本基础模型基线趋于平稳,而ASH在我们的8小时评估中持续进步。ASH在《宝可梦 绿宝石》中平均达到11.2/12个里程碑,在《塞尔达传说》中平均达到9.9/12个里程碑,而最强基线在两个环境中分别卡在平均6.5/12和6.0/12个里程碑。我们证明了自我改进的智能体是长时域具身学习的可扩展方案。

英文摘要

Long-horizon embodied tasks remain a fundamental challenge in AI, as current methods rely on hand-engineered rewards or action-labeled demonstrations, neither of which scales. We introduce ASH, an agentic system that learns an embodied policy from unlabeled, noisy internet video, without reward shaping or expert annotation. ASH follows a self-improvement loop; when it gets stuck, ASH learns an Inverse Dynamics Model (IDM) from its own trajectories, and uses its IDM to extract supervision from relevant internet video. ASH uses unsupervised learning to identify key moments from large-scale internet video and retains them as long-term memory -- allowing it to tackle long-horizon problems. We evaluate ASH on two complementary environments demanding multi-hour planning: Pokemon Emerald, a turn-based RPG, and The Legend of Zelda: The Minish Cap, a real-time action-adventure game. In both games, behavioral cloning, retrieval-augmented and zero-shot foundation-model baselines plateau, while ASH sustains progression across our 8-hour evaluation. ASH reaches an average of $11.2/12$ milestones in Pokemon Emerald and $9.9/12$ in Legend of Zelda, while the strongest baseline gets stuck in both environments at an average of $6.5/12$ and $6.0/12$ milestones, respectively. We demonstrate that self-improving agents are a scalable recipe for long-horizon embodied learning.

2605.24660 2026-06-09 cs.IR cs.AI cs.LG 版本更新

How Many Tools Should an LLM Agent See? A Chance-Corrected Answer

LLM 智能体应看到多少工具?一种机会校正的答案

Vyzantinos Repantis, Ameya Gawde, Harshvardhan Singh, Joey Blackwell

发表机构 * II Meta Platforms(Meta平台)

AI总结 针对 LLM 智能体工具选择中候选列表长度优化问题,提出基于机会校正的 Bits-over-Random (BoR) 指标,并将其转化为强化学习奖励,实现每查询自适应深度选择,在保持覆盖率的同时显著减少展示工具数量并提升下游工具选择准确率。

Comments 13 pages, 2 figures

详情
AI中文摘要

在 LLM 智能体使用工具之前,检索系统必须决定向智能体展示哪些候选工具。这个候选列表应该多长?展示太多工具,模型难以选择;展示太少,正确的工具可能不会出现。大多数系统对每个查询应用固定的候选列表大小,但缺乏标准指标来评估该大小是否合适。我们将展示给 LLM 智能体的工具数量作为评估对象,并应用 Bits-over-Random (BoR),一种机会校正的指标,询问在给定深度下的成功是否优于随机选择在同一深度下的表现。我们在三个工具选择基准、多个评分器以及从 20 到 3,251 个工具不等的注册表上评估 BoR。然后,我们将相同的原理转化为强化学习 (RL) 奖励,用于每查询选择工具候选列表深度。RL 智能体故意设计得简单,作为指标的探针而非提议的系统。随着候选列表增长,随机包含正确工具的机会增加,因此奖励自然减少,减少了对工程化深度惩罚的需求。在 BFCL(370 个工具)上,学习到的策略几乎匹配展示 50 个工具的覆盖率(90.3% 对 90.8%),而平均仅展示 7 个。在 ToolBench(3,251 个工具)上,固定展示 5 个工具实现了更高的总覆盖率(64.7% 对 61.9%),但在困难查询(正确工具排名第 6-20 位)上未找到任何工具。BoR 智能体通过搜索更深层,在这些查询上找到了 16.7%。使用 Claude Sonnet 4.6 的下游验证表明,更短的自适应列表也提高了 LLM 选择正确工具的能力:与始终展示 5 个工具时的 87.1% 相比,达到了 93.1%;在中等难度查询(正确工具存在但未排名第一)上,从 60.9% 扩大到 76.8%。

英文摘要

Before an LLM agent can use a tool, a retrieval system must decide which candidate tools to show to the agent. How long should that shortlist be? Show too many tools and the model struggles to choose. Show too few and the correct tool may not appear. Most systems apply a fixed shortlist size to every query, but no standard metric exists to evaluate whether that size was appropriate. We treat the number of tools shown to an LLM agent as the object of evaluation and we apply Bits-over-Random (BoR), a chance-corrected metric that asks whether success at a given depth is better than what random selection would achieve at that same depth. We evaluate BoR across three tool-selection benchmarks, multiple scorers, and registries ranging from 20 to 3,251 tools. We then turn the same principle into a reinforcement learning (RL) reward for choosing tool shortlist depth per query. The RL agent is deliberately simple, serving as a probe of the metric rather than a proposed system. As the shortlist grows, random chance of including the correct tool rises, so the reward naturally decreases, reducing the need for an engineered depth penalty. On BFCL (370 tools), the learned policy nearly matches the coverage of showing 50 tools ($90.3\%$ vs $90.8\%$) while presenting only 7 on average. On ToolBench (3,251 tools), a fixed shortlist of 5 tools achieves higher aggregate coverage ($64.7\%$ vs $61.9\%$) but finds nothing on hard queries (correct tool ranked 6th-20th). The BoR agent finds $16.7\%$ on those same queries by searching deeper. Downstream validation with Claude Sonnet 4.6 indicates that shorter adaptive lists also improve the LLM's ability to select the right tool: $93.1\%$ versus $87.1\%$ when always shown 5 tools, widening to $76.8\%$ vs $60.9\%$ on medium-difficulty queries where the correct tool is present but not ranked first.

2605.25624 2026-06-09 cs.AI cs.LG 版本更新

CUA-Gym: Scaling Verifiable Training Environments and Tasks for Computer-Use Agents

CUA-Gym:为计算机使用智能体扩展可验证的训练环境和任务

Bowen Wang, Dunjie Lu, Junli Wang, Tianyi Bai, Shixuan Liu, Zhipeng Zhang, Haiquan Wang, Hao Hu, Tianbao Xie, Shuai Bai, Dayiheng Liu, Que Shen, Junyang Lin, Tao Yu

发表机构 * The University of Hong Kong(香港大学) Qwen Team, Alibaba Inc.(阿里巴巴集团Qwen团队) University of California, San Diego(加州大学圣地亚哥分校) Tsinghua University(清华大学)

AI总结 提出CUA-Gym可扩展流水线,通过协同生成任务指令、环境状态和奖励函数,构建大规模可验证强化学习训练数据,并合成CUA-Gym-Hub模拟网络应用环境,训练出的智能体在OSWorld-Verified和WebArena上取得领先性能。

详情
AI中文摘要

具有可验证奖励的强化学习(RLVR)在数学、工具使用和软件工程等领域取得了突破,但其在计算机使用智能体(CUA)上的应用受到缺乏具有确定性奖励的可扩展训练数据的瓶颈。为CUA构建此类数据需要一致的任务指令、可执行的环境和可验证的奖励。然而,手工策划的基准测试实现了高奖励保真度,但覆盖的应用很少;基于LLM作为评判者的数据集广泛扩展,但缺乏可靠的验证。我们提出了CUA-Gym,一个可扩展的流水线,协同生成任务指令、环境状态和奖励函数。具体来说,一个生成器智能体构建初始和黄金环境状态,一个独立的判别器智能体根据任务规范编写奖励函数。一个编排器智能体通过执行中的迭代轮次驱动两者。生成的元组通过一个结合LLM多数投票和智能体回滚的最终过滤器,确保超出每任务对抗循环的质量。为了解决训练环境稀缺的问题,我们进一步合成了CUA-Gym-Hub,一套基于真实软件使用分布的高保真模拟网络应用程序套件,将CUA RLVR数据的规模扩大了一个数量级。使用此流水线,我们构建了CUA-Gym数据集,包含32,112个基于110个环境的已验证RLVR训练元组。在CUA-Gym上使用GSPO训练的CUA-Gym-A3B和CUA-Gym-A17B在OSWorld-Verified上分别达到62.1%和72.6%,在可比规模上优于先前的开源CUA,并且在数据量和环境多样性上性能平滑扩展。相同的检查点还在保留的WebArena基准测试上有所改进,表明训练环境之外的迁移。我们将开源完整的合成流水线、数据集、CUA-Gym-Hub环境和模型。

英文摘要

Reinforcement learning with verifiable rewards (RLVR) has driven breakthroughs in domains such as math, tool-use, and software engineering, yet its extension to computer-use agents (CUAs) has been bottlenecked by the scarcity of scalable training data with deterministic rewards. Constructing such data for CUAs requires consistent task instruction, executable environment, and verifiable reward. However, hand-curated benchmarks achieve high reward fidelity but cover few applications and LLM-as-judge-based datasets scale broadly but lack reliable verification. We present CUA-Gym, a scalable pipeline that co-generates task instructions, environment states, and reward functions. Concretely, a Generator agent constructs the initial and golden environment states, and a separate Discriminator agent writes the reward function from the task specification. An orchestrator agent drives the two through iterative rounds upon execution. Generated tuples then pass a final filter combining LLM majority voting and agent rollouts, ensuring quality beyond the per-task adversarial loop. To address the scarcity of training environments, we further synthesize CUA-Gym-Hub, a broad suite of high-fidelity mock web applications grounded in real-world software-use distributions, expanding the scale of CUA RLVR data by magnitude. Using this pipeline, we construct CUA-Gym, a dataset of 32,112 verified RLVR training tuples grounded in 110 environments. Trained with GSPO on CUA-Gym, our CUA-Gym-A3B and CUA-Gym-A17B achieve 62.1% and 72.6% on OSWorld-Verified, outperforming prior open-source CUAs at comparable scales, with performance scaling smoothly in both data volume and environment diversity. The same checkpoints also improve on the held-out WebArena benchmark, indicating transfer beyond the training environments. We will open-source the full synthesis pipeline, dataset, CUA-Gym-Hub environments, and models.

2605.26452 2026-06-09 cs.RO cs.LG cs.SY eess.SY 版本更新

Robust Koopman Control Barrier Filters for Safe Actor-Critic Reinforcement Learning

鲁棒Koopman控制屏障滤波器用于安全演员-评论家强化学习

Dhruv S. Kushwaha, Zoleikha A. Biron

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出鲁棒Koopman-CBF SAC框架,通过数据驱动学习Koopman预测器、构建提升空间中的仿射CBF约束并利用二次规划安全层实施,同时通过投影残差裕度处理近似误差,实现零约束违反或减少违规。

Comments 17 pages, 7 figures

详情
AI中文摘要

机器人系统的安全强化学习需要策略在训练和部署期间满足状态和输入约束的同时提高任务性能。控制屏障函数通过最小侵入性安全滤波器提供强制执行前向不变性的原则性机制,但其在无模型强化学习中的应用受限于对精确动力学和手工设计屏障证书的需求。我们提出鲁棒Koopman-CBF SAC,一种安全滤波的演员-评论家框架,从数据中学习有限维Koopman预测器,在提升空间中构建仿射CBF约束,并通过二次规划安全层强制执行。为考虑有限维Koopman近似误差,使用从留出轨迹数据估计的投影残差裕度收紧CBF条件。评论家在执行的安操作上训练,而演员则被正则化向Koopman-CBF可行集,减少训练中对滤波器的依赖。在安全控制基准测试中,该方法在CartPole稳定和跟踪上实现零约束违反,同时匹配或超过无约束SAC的回报。在高维Safety Gymnasium运动任务中,该方法在某些设置下减少了违规,但也暴露了一阶速度屏障和线性EDMD模型的重要局限性,推动了高阶和多步Koopman-CBF扩展。这些结果表明,鲁棒Koopman-CBF滤波器是无模型强化学习和可证明安全之间的有前途桥梁,同时阐明了此类滤波器保持有效的结构条件。所有代码可在\href{https://github.com/DhruvKushwaha/Koopman-CBF-Soft-Actor-Critic}{Github仓库}获取。

英文摘要

Safe reinforcement learning (RL) for robotic systems requires policies that improve task performance while satisfying state and input constraints during both training and deployment. Control barrier functions (CBFs) provide a principled mechanism for enforcing forward invariance through minimally invasive safety filters, but their use in model-free RL is limited by the need for accurate dynamics and hand-designed barrier certificates. We propose Robust Koopman-CBF SAC, a safety-filtered actor--critic framework that learns a finite-dimensional Koopman predictor from data, constructs affine CBF constraints in the lifted space, and enforces them through a quadratic-program safety layer. To account for finite-dimensional Koopman approximation error, the CBF condition is tightened using a projected residual margin estimated from held-out rollout data. The critic is trained on the executed safe action, while the actor is regularized toward the Koopman-CBF feasible set, reducing dependence on the filter over training. Across safe-control benchmarks, the method achieves zero constraint violations on CartPole stabilization and tracking while matching or exceeding unconstrained SAC returns. On high-dimensional Safety Gymnasium locomotion tasks, the method reduces violations in some settings but also exposes important limitations of first-order velocity barriers and linear EDMD models, motivating high-order and multi-step Koopman-CBF extensions. These results suggest that robust Koopman-CBF filters are a promising bridge between model-free RL and certifiable safety, while clarifying the structural conditions under which such filters remain effective.

2606.01619 2026-06-09 cs.AI cs.LG stat.ML 版本更新

ReSkill: Reconciling Skill Creation with Policy Optimization in Agentic RL

ReSkill:在智能体强化学习中协调技能创建与策略优化

Zelin He, Haotian Lin, Boran Han, Wei Zhu, Haoyang Fang, Bernie Wang, Xuan Zhu, Runze Li, Matthew Reimherr

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出ReSkill框架,通过GRPO的组结构嵌入断言驱动技能创建、组内轨迹采样和自适应汤普森采样,实现技能与策略的协同进化,在多个领域超越现有方法。

详情
AI中文摘要

智能体强化学习使LLM智能体能够从环境奖励中持续改进,但由此产生的策略并未系统地积累可跨任务泛化的可重用策略。模块化技能可以提供此类可重用策略,然而现有的技能增强强化学习方法将技能创建与策略优化分离,存在采用与进化策略冲突的技能的风险。受Anthropic的Skill Creator启发,我们引入ReSkill,一种强化学习在环的技能创建框架,协调技能进化与策略学习。ReSkill利用GRPO的组结构自然嵌入三种机制,仅需少量额外开销:(1)断言驱动的技能创建器,从过去经验中诊断失败并提出基于条件的触发式技能修订;(2)组内轨迹采样,实现技能版本的可控比较,捕获哪个版本最能支持策略的持续学习;(3)自适应折扣的汤普森采样,在策略进化过程中平衡技能版本选择的探索与利用。在多个领域,ReSkill始终优于现有的基于记忆和技能的强化学习方法,在未见任务上提升最大。对技能生命周期的分析显示,随着策略改进,技能被自动创建、测试、精炼和修剪,展示了协调的技能-策略协同进化。

英文摘要

Agentic reinforcement learning (RL) enables LLM agents to improve continuously from environment rewards, yet the resulting policies do not systematically accumulate reusable strategies that generalize across tasks. Modular skills can provide such reusable strategies, yet existing skill-augmented RL methods decouple skill creation from policy optimization, risking adopting skills that conflict with the evolving policy. Inspired by Anthropic's Skill Creator, we introduce ReSkill, an RL-in-the-loop skill creation framework that reconciles skill evolution with policy learning. ReSkill exploits the group-wise structure of GRPO to naturally embed three mechanisms with only marginal additional overhead: (1) an assertion-driven skill creator that diagnoses failures from past experience and proposes conditional, trigger-based skill revisions; (2) within-group rollout sampling that enables controlled comparison of skill versions, capturing which version best supports the policy's ongoing learning; and (3) Thompson Sampling with adaptive discounting to balance exploration and exploitation in skill version selection as the policy evolves. Across several domains, ReSkill consistently outperforms existing memory and skill-based RL methods, with the largest gains on unseen tasks. Analysis of the skill lifecycle shows skills being automatically created, tested, refined, and pruned as the policy improves, demonstrating reconciled skill-policy co-evolution.

2606.04421 2026-06-09 cs.AI cs.LG 版本更新

Trivium: Temporal Regret as a First-Class Objective for Causal-Memory Controllers

Trivium: 时间遗憾作为因果记忆控制器的一等目标

Edward Y. Chang

发表机构 * Stanford University(斯坦福大学)

AI总结 本文提出将长期时间遗憾作为一等目标,与结果遗憾和认知遗憾共同构成因果记忆控制器的可证伪失败分析框架,证明时间校准偏差在对结果遗憾为零时仍线性增长,而基于持久因果日志的探测复杂度为对数级。

Comments 62 pages, 12 tables, 12 figures

详情
AI中文摘要

许多当前的智能体系统和LLM管道通过优化结果奖励来纠正错误。这仅解决了失败的“什么”:当结果偏离预测时,不匹配的“为什么”和“何时”没有被系统地记录、审查或纠正,因此相同的错误可能反复出现。我们认为这是一个结构性问题,而不仅仅是模型容量问题。我们提出将长期时间遗憾作为一等目标,与结果遗憾和工作因果模型上的认知遗憾并列。时间遗憾捕捉失败持续的时间:在纠正之前,一个校准错误的因果模型被容忍了多久。认知遗憾捕捉失败持续的原因:工作因果模型中的残余不确定性或错误。这三个遗憾共同给出了一个可证伪的说明,关于一个长期存在的智能体可能失败的原因、内容和时间。将智能体建模为E个片段的流,我们在显式因果探测、持久性和可检测性假设下证明了三个条件结果。首先,在观测等价混淆下,仅基于结果的学习无法在没有干预通道的情况下区分因果结构和虚假结构,因此时间校准偏差可以在结果遗憾被降至零后仍线性持续。其次,使用持久因果日志和预算探测,总探测复杂度是片段范围的对数,导致O(log E)的时间遗憾。第三,在K个可检测变化点下,速率扩展为O(K log E)。我们实例化了Trivium并预注册了五个可证伪预测。在CausalBench-Seq上,Trivium遵循预测的对数包络线,而仅基于结果的基线线性增长。一个真实LLM流的初步外部有效性证据跨越了一个完整的E=500运行和三个E=100前沿模型试点。这里的自学习意味着修正外部因果模型,而不是重新训练LLM权重。

英文摘要

Many current agentic systems and LLM pipelines correct mistakes by optimizing outcome reward. This addresses only the what of failure: when an outcome diverges from prediction, the why and when of the mismatch are not systematically logged, reviewed, or corrected, so the same error can recur episode after episode. We argue that this is a structural problem, not merely a model-capacity one. We propose long-horizon temporal regret as a first-class objective alongside outcome regret and epistemic regret over the working causal model. Temporal regret captures when failure persists: how long a miscalibrated causal model is tolerated before correction. Epistemic regret captures why failure persists: residual uncertainty or error in the working causal model. Together, the three regrets give a falsifiable account of what, why, and when a long-lived agent can fail. Modeling the agent as a stream of E episodes, we prove three conditional results under explicit causal-probing, persistence, and detectability assumptions. First, under observationally equivalent confounding, outcome-only learning cannot distinguish causal from spurious structure without an intervention channel, so temporal miscalibration can persist linearly even after outcome regret is driven to zero. Second, with a persistent causal log and budgeted probes, total probe complexity is logarithmic in the episode horizon, inducing O(log E) temporal regret. Third, under K detectable change-points, the rate extends to O(K log E). We instantiate Trivium and pre-register five falsifiable predictions. On CausalBench-Seq, Trivium follows the predicted logarithmic envelope while outcome-only baselines grow linearly. A pilot real-LLM stream provides preliminary external-validity evidence across one full E = 500 run and three E = 100 frontier-model pilots. Self-learning here means revising an external causal model, not retraining LLM weights.

2205.01970 2026-06-09 cs.LG stat.ML 版本更新

Non-Stationary Bandit Learning via Predictive Sampling

非平稳老虎机学习中的预测采样

Yueyang Liu, Xu Kuang, Benjamin Van Roy

发表机构 * Jones Graduate School of Business, Rice University(里士满大学沃森商学院研究生院) Stanford Graduate School of Business(斯坦福商学院) Department of Management Science and Engineering, Department of Electrical Engineering, Stanford University(斯坦福大学管理科学与工程系、电气工程系)

AI总结 本文提出预测采样算法,通过区分信息快速失效的行动来改进非平稳环境下的老虎机学习,理论证明其性能并验证其在复杂环境中的有效性。

详情
AI中文摘要

Thompson sampling在广泛平稳老虎机环境中表现良好,但应用于非平稳环境时表现不佳。本文指出,此类失败源于探索时未根据信息失效速度区分行动。基于此,提出预测采样算法,通过优先处理信息快速失效的行动来提升性能。通过理论上的贝叶斯遗憾界证明预测采样的性能,并提供可扩展到实际应用复杂老虎机环境的版本。数值模拟显示,预测采样在所有考察的非平稳环境中均优于Thompson sampling。

英文摘要

Thompson sampling has proven effective across a wide range of stationary bandit environments. However, as we demonstrate in this paper, it can perform poorly when applied to non-stationary environments. We show that such failures are attributed to the fact that, when exploring, the algorithm does not differentiate actions based on how quickly the information acquired loses its usefulness due to nonstationarity. Building upon this insight, we propose predictive sampling, an algorithm that deprioritizes acquiring information that quickly loses usefulness. Theoretical guarantee on the performance of predictive sampling is established through a Bayesian regret bound. We provide versions of predictive sampling for which computations tractably scale to complex bandit environments of practical interest. Through numerical simulations, we demonstrate that predictive sampling outperforms Thompson sampling in all non-stationary environments examined.

4. 生成模型与概率建模 47 篇

2606.07569 2026-06-09 cs.LG 新提交

TriHead-GAN: A Generative Adversarial Network with Triple-Head Discriminator for Carbon Emission Time Series Generation

TriHead-GAN: 一种具有三头判别器的生成对抗网络用于碳排放时间序列生成

Zesen Wang, Lijuan Lan, Yonggang Li, Chunhua Yang

发表机构 * SanMuGuo

AI总结 针对城市级高频碳排放数据稀缺问题,提出TriHead-GAN,通过三头判别器联合监督分布真实性、跨变量依赖和步态平滑性,在多个数据集上优于主流基线并提升下游预测精度。

详情
AI中文摘要

准确的碳排放监测对于气候政策和新兴监管机制(如欧盟碳边境调节机制)至关重要,然而城市级高频监测数据仍然极为稀缺,严重限制了数据饥渴的深度学习模型。时间序列生成是一种自然的补救措施,但现有的基于GAN和扩散的生成器通常对碳排放数据的领域结构提供的显式监督有限:它们可能匹配边际分布统计量,但未能充分保留CO$_2$与共排放污染物和气象因素之间的跨变量相关性,并且倾向于破坏大气测量的一阶差分统计量,产生平均平滑但缺乏底层信号真实步态变异性的序列。我们提出TriHead-GAN,一种基于Transformer的对抗框架,其三头判别器联合监督联合分布的三个互补方面:通过Wasserstein评判器监督分布真实性,通过目标变量的无泄漏回归监督跨变量依赖性,以及通过相邻差分预测监督步态时间平滑性。生成器结合了全局自注意力与局部时间卷积、每步噪声注入以及匹配一阶差分统计量的抗平滑损失。在自收集的长沙碳数据集、两个公共碳数据集(中国、美国)以及ETTh1基准上的实验表明,TriHead-GAN在绝大多数设置下优于主流基线,并且生成的合成窗口在低资源碳监测场景中提高了下游预测准确性。

英文摘要

Accurate carbon emission monitoring is critical for climate policy and emerging regulatory mechanisms such as the EU Carbon Border Adjustment Mechanism, yet city-level high-frequency monitoring data remain extremely scarce, severely limiting data-hungry deep learning models. Time series generation is a natural remedy, but existing GAN and diffusion-based generators often provide limited explicit supervision for the domain structure of carbon emission data: they may match marginal distributional statistics while insufficiently preserving cross-variable correlations between CO$_2$ and co-emitted pollutants and meteorological factors, and tend to collapse the first-difference statistics of atmospheric measurements, producing sequences that are smooth on average but lack the realistic step-wise variability of the underlying signals. We propose TriHead-GAN, a Transformer-based adversarial framework whose triple-head discriminator jointly supervises three complementary aspects of the joint distribution: distributional authenticity via a Wasserstein critic, cross-variable dependency via leakage-free regression of the target variable, and step-wise temporal smoothness via adjacent-difference prediction. The generator combines global self-attention with local temporal convolution, per-step noise injection, and an anti-smoothing loss that matches first-difference statistics. Experiments on the self-collected Changsha Carbon dataset, two public carbon datasets (China, US), and the ETTh1 benchmark show that TriHead-GAN achieves favorable performance over mainstream baselines on the vast majority of settings, and that the resulting synthetic windows improve downstream forecasting accuracy in low-resource carbon monitoring scenarios.

2606.07599 2026-06-09 cs.LG cs.AI cs.CV 新提交

DiffoR: A Unified Continuous Generative Framework for Universal Ordinal Regression

DiffoR:一种统一的连续生成框架用于通用序数回归

Hongxu Ma, Lin Wang, Chenghou Jin, Han Zhou, Jie Zhang, Xiaoyu Yang, Chunjie Chen, Jihong Guan, Shuigeng Zhou

发表机构 * Fudan University(复旦大学) Kuaishou Technology(快手科技) Shanghai University of Finance and Economics(上海财经大学) Tongji University(同济大学)

AI总结 提出DiffOR框架,将序数回归建模为连续生成任务,利用扩散模型通过迭代去噪恢复连续序数值,并设计双解耦策略(多尺度增量聚合与动态去噪感知)保留序数拓扑,在12个基准上超越现有方法。

Comments Accepted at KDD 2026

详情
AI中文摘要

序数回归(OR)旨在预测具有内在顺序的目标值,支撑着从推荐系统到计算机视觉等多个领域的关键应用。尽管从朴素回归发展到基于离散化的分类和生成,现有范式仍然受到量化伪影和缺乏全局序数拓扑感知的根本限制。这些方法通常强制执行刚性边界划分,无法捕捉序数数据固有的非平稳语义转换。在本文中,我们提出了一种新范式,将OR形式化为连续生成序数回归任务。在该新范式下,我们引入了DiffOR,一个统一的框架,利用扩散模型通过迭代去噪恢复连续序数值,从而能够动态学习软语义转换。为了显式保留序数拓扑,我们设计了一种双解耦策略:在空间上,多尺度增量聚合将目标分解为层次化的连续增量;在时间上,动态去噪感知将去噪步骤与特征频率同步,确保稳健的从粗到细的细化。理论上,我们证明了所提方法可以显著增强表示能力和机制可解释性。在四个领域的12个基准上的大量实验验证了DiffOR相对于最先进方法的一致优越性,建立了一个新标准,展示了作为通用序数回归通用解决方案的强大潜力。

英文摘要

Ordinal Regression (OR) aims to predict target values with inherent order, underpinning critical applications across diverse domains, from recommender systems to computer vision. Though having evolved from naive regression to discretization-based classification and generation, existing paradigms remain fundamentally constrained by quantization artifacts and the lack of global ordinal topological perception. These methods typically enforce rigid boundary delineations, failing to capture the non-stationary semantic transitions inherent to ordinal data. In this paper, we propose a novel paradigm where OR is formulated as a Continuous Generative Ordinal Regression task. Under the novel paradigm, we introduce DiffOR, a unified framework that leverages diffusion models to recover continuous ordinal values via iterative denoising, thereby enabling the dynamic learning of soft semantic transitions. To explicitly preserve ordinal topology, we devise a Dual-Decoupling Strategy: Spatially, Multi-scale Increment Aggregation decomposes targets into hierarchical continuous increments; Temporally, Dynamic Denoising Perception synchronizes denoising steps with feature frequencies, ensuring robust coarse-to-fine refinement. Theoretically, we show that the proposed method can significantly enhance both representation capability and mechanistic interpretability. Extensive experiments on 12 benchmarks across four domains validate DiffOR's consistent superiority over state-of-the-art methods, establishing a new standard that demonstrates strong potential as a general-purpose solution for universal ordinal regression.

2606.07760 2026-06-09 cs.LG 新提交

scCBGM: Interpretable Single-Cell Counterfactual Editing

scCBGM:可解释的单细胞反事实编辑

Alma Andersson, Aya Abdelsalam Ismail, Edward De Brouwer, Doron Haviv, Tommaso Biancalani, Kyunghyun Cho, Gabriele Scalia, Aïcha BenTaieb, Hector Corrada Bravo

发表机构 * University of Copenhagen(哥本哈根大学) University of Cambridge(剑桥大学) University of Amsterdam(阿姆斯特丹大学) University of California, Berkeley(加州大学伯克利分校) University of Tokyo(东京大学) University of Washington(华盛顿大学) University of Oxford(牛津大学)

AI总结 提出scCBGM框架,通过概念瓶颈架构和解耦惩罚实现单细胞反事实编辑,在组合泛化和反事实预测上表现优异。

Comments Accepted to ICML 2026; code at https://github.com/almaan/scCBGM

详情
AI中文摘要

理解细胞表型及其对扰动的响应对于疾病生物学和治疗设计至关重要。单细胞RNA测序能够在细胞分辨率下进行表征,但条件的组合空间使得穷举实验映射不可行。我们引入了单细胞概念瓶颈生成模型(scCBGM),这是一个用于对单个细胞进行可解释且精确的反事实编辑的框架。scCBGM通过解码器跳跃连接和促进无维度约束解耦的交叉协方差惩罚,将概念瓶颈架构适应于单细胞数据。我们将该框架扩展到流匹配模型,从而在编码-解码和生成两种模式下实现概念引导的编辑。为了进行严格评估,我们开发了一个具有真实反事实的合成基准。在多个真实数据集上,scCBGM在组合泛化和反事实预测方面表现出优越性能,并通过合成数据上的细胞级验证和真实数据集上的群体级基准得到了支持。

英文摘要

Understanding cellular phenotypes and how they respond to perturbations is critical for disease biology and therapeutic design. Single-cell RNA sequencing enables characterization at cellular resolution, yet the combinatorial space of conditions makes exhaustive experimental mapping infeasible. We introduce single-cell Concept Bottleneck Generative Models (scCBGM), a framework for interpretable and precise counterfactual editing of individual cells. scCBGM adapts concept bottleneck architectures for single-cell data through decoder skip connections and a cross-covariance penalty that promotes disentanglement without dimensional constraints. We extend the framework to flow matching models, enabling concept-guided editing in both encoding-decoding and generation regimes. To enable rigorous evaluation, we develop a synthetic benchmark with ground-truth counterfactuals. Across multiple real datasets, scCBGM demonstrates superior performance in combinatorial generalization and counterfactual prediction, supported by cell-level validation on synthetic data and population-level benchmarks on real datasets.

2606.07835 2026-06-09 cs.LG 新提交

Mitigating the Contractivity Trap in Diffusion ODEs via Stein Stabilization

通过Stein稳定化缓解扩散ODE中的收缩陷阱

Shigui Li, Delu Zeng

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对扩散模型确定性概率流ODE大步长推理中的收缩陷阱问题,提出SteinDiff框架,通过Stein导出的几何感知残差校正机制正则化求解器更新,无需参考样本即可提升生成质量。

Comments 32 pages, 12 figures. Accepted to ICML 2026

详情
AI中文摘要

在扩散模型通过其确定性概率流常微分方程(PF-ODE)轨迹进行大步长推理时,存在一个基本张力,我们称之为收缩陷阱:高效推理倾向于大步长,而激进的步长和高表达能力的去噪器可能破坏基于收缩的误差抑制稳定性保证。为了解决这个问题,我们提出了SteinDiff,一种逐步推理时稳定化框架,它采用Stein导出的校正,无需参考样本。具体来说,SteinDiff引入了一种几何感知残差校正机制,在不重新训练的情况下正则化大步长求解器更新。为此,我们推导了用于逐步求解器调整的闭式Stein校正系数,实现了对局部数据几何的无参考自适应。我们进一步建立了在分布偏移下的分数控制扰动界,并提供了对EDM风格参数化的补充Stein视角。大量实验表明,SteinDiff在大步长推理设置中减轻了严重伪影并提高了生成质量。

英文摘要

A fundamental tension exists in the large-step inference of diffusion models via their deterministic probability flow ordinary differential equation (PF-ODE) trajectories, which we identify as the contractivity trap: efficient inference favors large step sizes, while aggressive steps and highly expressive denoisers can undermine contraction-based stability certificates for error suppression. To address this, we propose SteinDiff, a step-wise inference-time stabilization framework that employs Stein-derived corrections without requiring reference samples. Specifically, SteinDiff introduces a geometry-aware residual correction mechanism that regularizes large-step solver updates without retraining. To this end, we derive a closed-form Stein correction coefficient for step-wise solver adjustment, enabling reference-free adaptation to local data geometry. We further establish a score-controlled perturbation bound under distributional shifts and provide a complementary Stein perspective on EDM-style parameterizations. Extensive experiments demonstrate that SteinDiff mitigates severe artifacts and improves generative quality across large-step inference settings.

2606.08221 2026-06-09 cs.LG 新提交

De novo molecular generation with optical property preconditioning at the token level

基于Token级光学性质预条件的从头分子生成

Haozhe Huang, Manuel Gonzalez Lastre, Hyun Suk Park, Jorge A. Campos-Gonzalez-Angulo, Xinjian Liu, Alán Aspuru-Guzik

发表机构 * University of Toronto(多伦多大学) Vector Institute for Artificial Intelligence(向量人工智能研究所) Universidad Autónoma de Madrid(马德里自治大学) Canadian Institute for Advanced Research (CIFAR)(加拿大高等研究院) NVIDIA(英伟达)

AI总结 针对OLED分子光学性质可控生成中数据稀缺和条件控制可靠性有限的问题,提出基于GPT2的Token条件自回归语言模型,通过离散属性Token和多任务优化实现垂直吸收能和振子强度的定向生成,并在TDDFT级别评估分布保真度和可控性。

详情
AI中文摘要

由于高质量数据的稀缺以及生成模型中跨化学基序的条件控制可靠性有限,设计具有目标光学性质的OLED分子仍然具有挑战性。在此,我们在现实低数据场景下对用于OLED分子生成的Token条件自回归语言模型进行了基准测试。一个GPT2模型在大规模化学语料库上进行预训练,增加了离散性质Token,并通过多任务优化进行微调。条件目标为垂直吸收能和振子强度,并将HOMO-LUMO能隙作为辅助电子描述符。生成的分子在TDDFT水平上进行评估,以评估分布保真度和可控性。生成的库再现了训练分布的主要光学性质支持,同时向更低分子量和更少重原子偏移。Token级控制在不同条件区间内一致定向,但并非完全正交,并表现出局部校准不规则性。化学型解析分析进一步表明,可控性强烈依赖于局部电子环境:适度共轭的芳香碳基序与改进的联合目标满足度相关,而吸电子基序,特别是芳基腈,表现出系统性红移和可控性降低。这些结果为条件OLED分子生成建立了定量基准,并表明模型可靠性必须在化学上有意义的子空间中评估,而非仅从聚合性质分布中评估。

英文摘要

Designing OLED molecules with targeted optical properties remains challenging due to the scarcity of high-quality data and the limited reliability of conditional control in generative models across chemical motifs. Here, we benchmark a token-conditioned autoregressive language model for OLED molecular generation in a realistic low-data regime. A GPT2 model is pretrained on large chemical corpora, augmented with discrete property tokens, and fine-tuned using multi-task optimisation. Conditioning targets vertical absorption energy and oscillator strength, with the HOMO-LUMO gap included as an auxiliary electronic descriptor. Generated molecules are evaluated at the TDDFT level to assess distributional fidelity and controllability. The generated library reproduces the dominant optical-property support of the training distribution while shifting towards lower molecular weight and fewer heavy atoms. Token-level control is consistently directional across conditioning bins, but is not fully orthogonal and exhibits local calibration irregularities. A chemotype-resolved analysis further shows that controllability depends strongly on local electronic environments: moderately conjugated aromatic-carbon motifs are associated with improved joint target satisfaction, whereas electron-withdrawing motifs, particularly aryl nitriles, show systematic red-shifting and reduced controllability. These results establish a quantitative benchmark for conditional OLED molecular generation and show that model reliability must be assessed in chemically meaningful subspaces rather than from aggregate property distributions alone.

2606.08309 2026-06-09 cs.LG cs.CV 新提交

Where the Score Lives: A Wavelet View of Diffusion

分数函数所在之处:扩散的小波视角

Emma Finn, Binxu Wang, T. Anderson Keller, Demba E. Ba

发表机构 * The Kempner Institute for the Study of Natural and Artificial Intelligence(肯普纳自然与人工智能研究所) Harvard University(哈佛大学)

AI总结 提出基于二维正交小波基的分数函数参数化,通过数据分布矩分析揭示不同架构的归纳偏差,解释扩散模型中分数网络与数据分布的相互作用。

Comments 20 pages, 12 figures, AISTATS 2026

详情
Journal ref
Proceedings of the 29th International Conference on Artificial Intelligence and Statistics (AISTATS) 2026, Tangier, Morocco. PMLR: Volume 300
AI中文摘要

基于分数的生成模型在过去十年中在生成多样化视觉上合理的图像方面取得了显著成功。在扩散建模中,包括CNN、U-Net和Transformer在内的多种架构被用作分数近似网络;然而,迄今为止,关于这些架构选择如何影响生成行为的了解相对较少。在这项工作中,为了提供对此领域的见解,我们提出了一种使用二维正交小波基展开的分数函数的解析可解参数化。特别地,我们根据数据分布的矩推导出可解释的最优分数函数。我们利用这种参数化提供了一种与架构无关的、基于矩的分析,揭示了数据分布的哪些属性对去噪最为重要。我们的分数机器足够灵活,可以部分模仿多种架构(包括U-Net和CNN)的相关归纳偏差,朝着理解不同分数架构为何表现出不同生成行为迈出了一步。由于我们的分数函数可以根据数据矩解析求解,我们可以开始理解数据分布如何与分数网络相互作用,从而产生我们在扩散模型中观察到的行为。

英文摘要

Score-based generative models have had remarkable success over the last decade in generating a diverse set of visually plausible images. A variety of architectures including CNNs, U-Nets, and Transformers have been used as the score-approximation network in such diffusion modeling; however, to date, relatively little is known about how these architectural choices impact generative behavior. In this work, to provide insight into this area, we propose an analytically solvable parameterization of the score function using an expansion in a 2D orthogonal wavelet basis. In particular, we derive interpretable optimal score functions in terms of the moments of the data distribution. We use this parametrization to provide an architecture-agnostic, moment-based analysis that reveals which attributes of the data distribution tend to matter most for denoising. Our score machine is flexible enough to partially mimic the relevant inductive biases of multiple architectures, including U-Nets, and CNNs, taking a step towards understanding why different score architectures can exhibit distinct generative behavior. Since our score is solvable in terms of the moments of the data, we can begin to understand how the data distribution interacts with the score network to produce the behavior we observe in diffusion models.

2606.08375 2026-06-09 cs.LG 新提交

Few-step Cofolding with All-Atom Flow Maps

少步全原子流图共折叠

Gianluca Scarpellini, Ron Shprints, Peter Holderrieth, Juno Nam, Pranav Murugan, Rafael Gómez-Bombarelli, Tommi Jaakola, Maruan Al-Shedivat, Nicholas Matthew Boffi, Avishek Joey Bose

发表机构 * Genesis Molecular AI Massachusetts Institute of Technology(麻省理工学院) Carnegie Mellon University(卡内基梅隆大学) Imperial College London(伦敦帝国学院) Mila

AI总结 提出DeCAF框架,将全原子共折叠扩散模型蒸馏为流图,仅需几步推理即可生成高质量样本,并通过奖励引导搜索提升采样质量。

详情
AI中文摘要

3D生物分子复合物的全原子生成建模已成为预测蛋白质和蛋白质-配体系统结构的主流范式。然而,在原子级保真度下生成结构通常需要昂贵的迭代扩散展开,这使得传统部署和推理时搜索技术的计算成本都很高。在本文中,我们引入了去噪器共折叠全原子流图(DeCAF)框架,用于将最先进的全原子共折叠模型蒸馏为全原子流图,这些流图仅需几步推理即可产生高质量样本。我们基于去噪器的流图公式构建DeCAF,该公式具有端点损失,自然支持SE(3)刚性对齐,我们证明这对于训练准确模型至关重要。我们进一步推导了一个简单的变量变换,使DeCAF能够在EDM风格架构的σ空间噪声调度中运行,从而能够从预训练的共折叠扩散模型直接蒸馏。借助DeCAF的流图前瞻,我们引入了一个专门构建的推理时框架,通过奖励引导搜索改进采样。实验上,在具有挑战性的Runs N' Poses数据集上,DeCAF-Boltz在严格的NFE预算下,在蛋白质-配体姿势的准确性(RMSD)和物理有效性分数上均统计上优于Boltz-1x,同时在PoseBusters上的所有推理计算预算下显示出更优的帕累托前沿。将最先进的Pearl共折叠模型蒸馏后,DeCAF-Pearl优于基于扩散的共折叠模型,并在成功率上与其教师模型匹配,同时使用的NFE减少了5倍。我们在https://github.com/genesistherapeutics/decaf发布代码。

英文摘要

All-atom generative modeling of 3D biomolecular complexes has emerged as the dominant paradigm for predicting the structure of proteins and protein-ligand systems. Generating structures at the atomic level of fidelity, however, typically requires expensive iterative diffusion rollouts, making both conventional deployment and inference-time search techniques computationally costly. In this paper, we introduce the Denoiser Cofolding All-Atom Flowmap (DeCAF) framework for distilling state-of-the-art all-atom cofolding models into all-atom flow maps that produce high-quality samples in only a few inference steps. We build DeCAF on a denoiser-based formulation of flow maps with endpoint losses that naturally support SE(3) rigid alignment, which we show is critical for training accurate models. We further derive a simple change of variables that lets DeCAF operate in the σ-space noise schedule of EDM-style architectures, enabling direct distillation from pretrained cofolding diffusion models. Equipped with DeCAF's flowmap lookahead, we introduce a purpose-built inference-time framework that improves sampling through reward-guided search. Empirically, DeCAF-Boltz statistically improves over Boltz-1x in both accuracy (RMSD) and physical validity scores of protein-ligand poses at strict NFE budgets on the challenging Runs N' Poses, while also showing a more optimal Pareto frontier across all inference compute budgets on PoseBusters. Distilling the state-of-the-art Pearl cofolding model, DeCAF-Pearl outperforms diffusion-based cofolding models and matches its teacher on success rate while using 5x fewer NFEs. We release our code at https://github.com/genesistherapeutics/decaf.

2606.08554 2026-06-09 cs.LG 新提交

A Theoretical Analysis of Memory and Overfitting Phenomena in Stochastic Interpolation Models

随机插值模型中的记忆与过拟合现象的理论分析

Yunchen Li, Shaohui Lin, Zhou Yu

AI总结 本文通过闭式解分析随机插值模型中的记忆化现象,揭示连续时间下确定性及随机生成过程均恢复训练样本,离散化与估计误差导致样本偏离,并给出过拟合与欠拟合的理论定义。

详情
AI中文摘要

本文对随机插值模型中的记忆化现象进行了理论解释。通过利用最优速度场和相关评分函数的闭式表达式,我们证明,在连续时间预言机设置下,确定性和随机生成过程都能恢复训练样本。在欧拉离散化下,生成的样本仍围绕训练样本中心,偏差由步长控制。我们进一步分析了存在估计误差时的生成过程,并表明累积的估计误差控制了端点与训练集的偏差。这些结果表明,生成的样本可以表示为训练样本加上三个受控项的扰动:离散化引起的界、估计误差引起的界和随机高斯噪声。基于这一表征,我们提供了生成模型中过拟合和欠拟合的理论定义。合成模拟支持了我们的理论发现。

英文摘要

This paper provides a theoretical account of memorization in stochastic interpolation models. By leveraging closed-form expressions for the optimal velocity field and the associated score function, we show that, in the continuous-time oracle setting, both deterministic and stochastic generation processes recover training samples. Under Euler discretization, generated samples remain centered around training samples, with deviations controlled by the step size. We further analyze generation in the presence of estimation errors and show that accumulated estimation errors control the endpoint deviation from the training set. These results imply that the generated sample admits a representation as a training sample perturbed by three controlled terms: a discretization-induced bound, an estimation-error-induced bound, and stochastic Gaussian noise. Based on this characterization, we provide theoretical definitions of overfitting and underfitting in generative models. Synthetic simulations support our theoretical findings.

2606.08802 2026-06-09 cs.LG 新提交

Active Flow Expansion for Out-of-Distribution Discovery: from Theory to Molecules

主动流扩展用于分布外发现:从理论到分子

Riccardo De Santi, Bruce Lee, Cristian Perez Jensen, Kimon Protopapas, Sophia Tang, Cheng-Hao Liu, Pranam Chatterjee, Yisong Yue, Andreas Krause

发表机构 * ETH Zurich(苏黎世联邦理工学院) ETH AI Center(ETH AI 中心) University of Pennsylvania(宾夕法尼亚大学) Caltech(加州理工学院) FutureHouse

AI总结 提出Active Flow Expansion (ActFlow)方法,通过验证器反馈和主动探索扩展预训练流模型的生成集,覆盖更多有效设计空间,理论证明统计学习保证,在分子和蛋白质任务上优于现有方法。

详情
AI中文摘要

标准流和扩散预训练匹配可用数据(例如分子)的分布,这通常只覆盖有效设计空间的一小部分。然而,在生成发现中,目标是采样有效的新自然设计,这些设计在标准模型下被赋予可忽略的概率,因此无法从拟合观测数据的标准模型中获取。为克服这一限制,我们偏离数据分布匹配,通过生成集(模型以非可忽略概率覆盖的区域)来审视生成模型。这允许引入一种新的分布外流建模学习原则:扩大模型的生成集以增加对有效设计空间的覆盖。我们提出主动流扩展(ActFlow),一种持续预训练方法,利用验证器反馈,通过迭代适应在学习的流表示中主动探索生成的合成数据,将预训练模型扩展到新的有效区域。理论上,我们建立了据我们所知首个分布外流建模的统计学习保证,将生成集扩展分析为在学习表示上的局部到全局可达过程。实验上,我们使用合适的分布外生成建模指标,在小有机分子、中等大小药物样分子、治疗性肽和蛋白质序列设计任务上评估ActFlow。结果表明,ActFlow将有效覆盖扩展到远超初始预训练模型建模的区域,显著优于广泛采用的合成流预训练方法。

英文摘要

Standard flow and diffusion pre-training matches the distribution of available data (e.g., molecules), which often covers only a small fraction of the valid design space. In generative discovery, however, one aims to sample valid new-to-nature designs, assigned negligible probability under, and thus inaccessible to, standard models fitted to the observed data. To overcome this limitation, we depart from data distribution matching and view a generative model through its generable set: the region it covers with non-negligible probability. This allows to introduce a new learning principle for out-of-distribution flow modeling: enlarging a model's generable set to increase coverage of the valid design space. We propose Active Flow Expansion (ActFlow), a continued pre-training method that employs verifier feedback to expand a pre-trained model over new valid regions by iteratively adapting to synthetic data generated through active exploration in the learned flow representation. Theoretically, we establish to our knowledge first-of-their-kind statistical learning guarantees for out-of-distribution flow modeling, analyzing generable set expansion as a local-to-global reachability process over a learned representation. Empirically, we assess ActFlow with suitable out-of-distribution generative modeling metrics across small organic molecules, mid-sized drug-like molecules, therapeutic peptides, and protein sequence design tasks. Results show that ActFlow expands valid coverage far beyond the region modeled by the initial pre-trained model, significantly outperforming widely adopted synthetic flow pre-training methods.

2606.08953 2026-06-09 cs.LG math.FA 新提交

Self-Consistent Generative Paths via Admissible Random Variational Transport

通过可容许随机变分输运的自洽生成路径

Lei Luo, Yingzhen Zhang, Jian Yang

发表机构 * PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, School of Computer Science and Engineering, Nanjing University of Science and Technology(南京理工大学计算机科学与工程学院高维信息智能感知与系统教育部重点实验室PCA实验室)

AI总结 提出自洽生成路径作为可容许局部变分输运校正的随机不动点,并引入随机不动点路径残差(R-FPR)来度量生成路径与校正之间的差距,为扩散、流、一步生成、VAE、GAN等模型提供残差控制原理。

Comments 17 pages, 4 figures, including Appendix

详情
AI中文摘要

现代生成模型通常定义从简单先验到数据分布的完整概率路径,而不仅仅是端点映射。扩散模型遵循随机去噪路径,流匹配学习输运场,一致性和蒸馏方法将路径压缩为一步或几步,对抗模型匹配终端分布,VAE通过潜在核生成。现有的统一观点主要描述这些路径是如何构建的。我们研究一个互补的问题:生成的概率路径何时是自洽的?我们将自洽生成路径定义为可容许局部变分输运校正的随机不动点。在该框架中,局部校正由结合散度或几何项、能量项和结构约束的随机变分输运算子指定。该框架包含随机正则化最优输运近端步骤作为结构化实例,同时允许非OT散度、潜在核、对抗约束、因果离散核和终端一步映射。该理论产生随机不动点路径残差(R-FPR),它衡量实际生成路径与可容许局部校正之间的差距。我们证明了适定性、随机不动点的存在性和吸引性、非收缩存在性、残差到生成误差界、经验残差集中性、代理扰动界、连续时间极限以及算子级泛化与模型特定推论。由此产生的理论将端点匹配转化为路径自洽性测试,并为诊断失败、正则化训练和指导跨扩散、流、一步、VAE、GAN/WGAN和自回归生成器的自适应采样提供了残差控制原理。

英文摘要

Modern generative models often define an entire probability path from a simple prior to the data law, rather than only an endpoint map. Diffusion models follow stochastic denoising paths, flow matching learns transport fields, consistency and distillation methods compress paths into one or a few steps, adversarial models match terminal distributions, and VAEs generate through latent kernels. Existing unifying views mainly describe how such paths are constructed. We study a complementary question: when is a generated probability path self-consistent? We define a self-consistent generative path as a random fixed point of admissible local variational transport corrections. In this framework, a local correction is specified by a random variational transport operator combining a divergence or geometry term, an energy term, and a structural constraint. The framework contains random regularized optimal-transport proximal steps as a structured instance, while also allowing non-OT divergences, latent kernels, adversarial constraints, causal discrete kernels, and terminal one-step maps. The theory yields a random fixed-point path residual (R-FPR), which measures the gap between the actual generated path and an admissible local correction. We prove well-posedness, random fixed-point existence and attraction, non-contractive existence, residual-to-generation error bounds, empirical residual concentration, proxy perturbation bounds, continuous-time limits, and operator-level generalization with model-specific corollaries. The resulting theory turns endpoint matching into path self-consistency testing and provides a residual-control principle for diagnosing failures, regularizing training, and guiding adaptive sampling across diffusion, flow, one-step, VAE, GAN/WGAN, and autoregressive generators.

2606.09257 2026-06-09 cs.LG cs.AI stat.ML 新提交

BSTabDiff: Block-Subunit Diffusion Priors for High-Dimensional Tabular Data Generation

BSTabDiff: 用于高维表格数据生成的块-子单元扩散先验

Al Zadid Sultan Bin Habib, Md Younus Ahamed, Prashnna Gyawali, Gianfranco Doretto, Donald A. Adjeroh

发表机构 * West Virginia University(西弗吉尼亚大学) The University of Utah(犹他大学)

AI总结 针对高维低样本量表格数据,提出BSTabDiff框架,通过将特征划分为潜在块并使用共享低维子单元变量生成每个块,结合扩散先验和copula依赖,实现稳定合成与可控基准生成。

Comments Published as a paper at the 2nd DeLTa Workshop, ICLR 2026

详情
AI中文摘要

高维低样本量(HDLSS)表格领域(例如组学)的特点是 $n \ll m$,其中 $n$ = 样本数,$m$ = 特征数。此类领域通常表现出强局部相关组、稀疏跨组依赖、重尾非高斯边缘分布、异方差噪声和结构化缺失,使得在 $\mathbb{R}^m$ 中直接进行密度学习因 $n \ll m$ 而病态。我们提出 BSTabDiff,一种块-子单元生成框架,将 $m$ 个观测特征划分为 $M$ 个潜在块($M \ll m$),并通过共享的低维子单元变量生成每个块,将全局依赖学习集中在紧凑的块潜在空间 $\mathbb{R}^M$ 中,同时通过 copula 驱动的依赖、灵活的逐特征边缘分布和显式缺失机制解码到完整特征空间。BSTabDiff 支持块潜在上的现代深度先验,包括扩散和归一化流,从而在 HDLSS 场景中实现稳定合成和可控基准生成。实验表明,与 HDLSS 数据上的非结构化表格生成器相比,BSTabDiff 能产生更真实和稳定的高维合成数据。

英文摘要

High-Dimensional Low-Sample Size (HDLSS) tabular domains (e.g., omics) are characterized by $n \ll m$, where $n$ = number of samples, and $m$ = number of features. Such domains often exhibit strong local correlation groups, sparse cross-group dependencies, heavy-tailed non-Gaussian marginals, heteroscedastic noise, and structured missingness, making direct density learning in $\mathbb{R}^m$ ill-conditioned since $n \ll m$. We propose BSTabDiff, a block-subunit generative framework that partitions the $m$ observed features into $M$ latent blocks ($M \ll m$) and generates each block via a shared low-dimensional subunit variable, concentrating global dependence learning in the compact block-latent space $\mathbb{R}^M$ while decoding to the full feature space with copula-driven dependence, flexible per-feature marginals, and explicit missingness mechanisms. BSTabDiff supports modern deep priors on block latents, including diffusion and normalizing flows, enabling stable synthesis and controllable benchmark generation in the HDLSS regime. Empirically, BSTabDiff produces more realistic and stable high-dimensional synthetic data when compared with unstructured tabular generators on HDLSS data.

2606.09664 2026-06-09 cs.LG stat.ML 新提交

In-Context Learning for Latent Space Bayesian Optimization

潜空间贝叶斯优化的上下文学习

Tuan A. Vu, Harri Lähdesmäki, Julien Martinelli

发表机构 * Aalto University(阿尔托大学)

AI总结 针对潜空间贝叶斯优化中上下文学习模型与优化任务不匹配的问题,提出在分子VAE潜空间上定义合成优化任务进行持续预训练,并引入正则化器保持原始先验,显著提升分子优化性能。

详情
AI中文摘要

贝叶斯优化(BO)是样本高效设计的核心工具,潜空间贝叶斯优化(LSBO)将其扩展到分子和蛋白质等结构化对象。与此同时,TabPFN和TabICL等表格基础模型现已实现最先进的回归性能,并越来越多地被用作BO代理模型。由于其贝叶斯行为是由大规模合成预训练集合诱导的,因此该预训练分布的组成至关重要。LSBO造成了一种独特的不匹配:从潜代码到目标值的映射与当前上下文模型训练所用的回归任务明显不同。我们通过在分子VAE的潜空间上定义合成优化任务来补充表格基础模型代理的预训练阶段,从而解决这种不匹配。持续预训练目标包含一个正则化器,将模型锚定到原始检查点,保留其广泛的回归先验,同时避免对适应任务的过度专业化。在保留的分子优化基准测试中,所得模型实现了强劲性能,支持了针对上下文化代理的LSBO特定适应的重要性。

英文摘要

Bayesian optimization (BO) is a central tool for sample-efficient design, and latent-space Bayesian optimization (LSBO) extends it to structured objects such as molecules and proteins. In parallel, tabular foundation models such as TabPFN and TabICL now achieve state-of-the-art regression performance and are increasingly used as BO surrogates. Because their Bayesian behavior is induced by large synthetic pretraining collections, the composition of this pretraining distribution is crucial. LSBO creates a distinctive mismatch: the induced map from latent code to objective value differs markedly from the regression tasks used to train current in-context models. We address this mismatch by complementing the pretraining stage of tabular foundation model surrogates with synthetic optimization tasks defined on the latent space of a molecular VAE. The continued-pretraining objective features a regularizer that anchors the model to the original checkpoint, preserving its broad regression prior while avoiding overspecialization to the adaptation tasks. On held-out molecular optimization benchmarks, the resulting model achieves strong performance, supporting the relevance of LSBO-specific adaptation for in-context surrogates.

2606.09705 2026-06-09 cs.LG cond-mat.stat-mech 新提交

When Do Local Score Models Extrapolate Across Size? A Diagnostic Theory and Benchmark

局部评分模型何时能跨尺寸外推?诊断理论与基准

Wenjie Xi

发表机构 * The University of Hong Kong(香港大学) Department of Physics and HK Institute of Quantum Science & Technology(物理系与香港量子科学与技术研究所)

AI总结 提出诊断理论,证明局部模型能否稳定外推取决于高斯平滑评分的准局部性,并引入有限深度局部流(FDLF)基准进行验证。

详情
AI中文摘要

科学生成建模通常需要尺寸迁移,即在小系统上训练的模型在大系统上评估。虽然平移不变架构允许这种评估,但我们表明架构局部性本身并不能保证稳定的尺寸外推。相反,稳定外推由高斯平滑评分的准局部性决定。通过Tweedie公式,远距离扰动可以通过后验协方差影响局部评分分量,这意味着局部模型只有在感受野覆盖平滑评分的响应范围时才能成功。我们形式化了这一机制,证明了反向扩散下局部边缘的尺寸一致比较定理。我们还引入了有限深度局部流(FDLF),这是一个具有精确评分、密度和可控响应范围的白盒诊断基准。实验上,我们验证了空间混合、平滑评分准局部性和模型感受野之间的相互作用。在空间混合下,平滑评分相对于感受野保持准局部性,从而实现稳定外推。相反,当空间混合减弱时,评分的局部性迅速退化,导致尺寸迁移失败。

英文摘要

Scientific generative modeling often requires size transfer, where models trained on small systems are evaluated on larger ones. While translation-invariant architectures enable this evaluation, we show that architectural locality alone does not guarantee stable size extrapolation. Instead, stable extrapolation is governed by the quasi-locality of the Gaussian-smoothed score. Through Tweedie's formula, far-away perturbations can influence local score components via posterior covariance, meaning a local model succeeds only if its receptive field covers the smoothed score's response range. We formalize this mechanism, proving a size-uniform comparison theorem for local marginals under reverse diffusion. We also introduce Finite-Depth Local Flow (FDLF), a white-box diagnostic benchmark with exact scores, densities, and controllable response ranges. Empirically, we validate the interplay between spatial mixing, smoothed-score quasi-locality, and model receptive fields. Under spatial mixing, the smoothed score remains quasi-local relative to the receptive field, enabling stable extrapolation. Conversely, when spatial mixing weakens, the score's locality rapidly degrades, causing size transfer to fail.

2606.07640 2026-06-09 cs.CV cs.AI cs.LG 交叉投稿

No Free Lunch for Synthetic Images under Data Scarcity Conditions

数据稀缺条件下合成图像的无免费午餐定理

Borja Arroyo Galende, Alejandro Almodóvar, Patricia A. Apellániz, Juan Parras, Silvia Uribe, Santiago Zazo

发表机构 * Universidad Politécnica de Madrid(马德里理工大学) Universidad de Alcalá(阿尔卡拉大学)

AI总结 研究数据稀缺和隐私敏感条件下合成数据的保真度、隐私和效用权衡,提出联合评估框架,比较VAE、GAN和DDPM在三个图像数据集上的表现,发现GAN和DDPM在差分隐私下更鲁棒。

详情
AI中文摘要

本研究探讨了在数据稀缺和隐私敏感条件下,合成数据生成中保真度、隐私和效用之间的权衡。我们提出了一个联合评估这三个维度的框架,并将其应用于三种广泛使用的生成模型:VAE、GAN和DDPM。评估涵盖三个图像数据集:MNIST、OCTMNIST和OrganAMNIST,包括通用和医学成像领域。在训练过程中引入差分隐私机制时,三种模型的行为出现了显著差异。GAN和DDPM表现出更强的鲁棒性,在一系列噪声水平下保持较高的保真度和下游效用,而VAE随着隐私约束的增加而更快地退化。本研究强调了深度生成模型多维评估的重要性,并指出应用隐私技术时它们的行为存在显著差异。

英文摘要

This study investigates the trade-offs between fidelity, privacy, and utility in synthetic data generation under conditions of data scarcity and privacy sensitivity. We propose an evaluation framework that jointly assesses these three dimensions and apply it to three widely used generative models, VAE, GAN, and DDPM. The evaluation spans three image datasets, MNIST, OCTMNIST, and OrganAMNIST, encompassing both general-purpose and medical imaging domains. Notable differences arise between the three models in their behaviour when differential privacy mechanisms are introduced during training. GAN and DDPM demonstrate greater robustness, maintaining higher fidelity and downstream utility across a range of noise levels, while VAE degrades more rapidly as privacy constraints increase. This study highlights the importance of a multidimensional evaluation of deep generative models, also noting that their behaviour significantly differs when privacy techniques are applied.

2606.08694 2026-06-09 cond-mat.soft cond-mat.stat-mech cs.LG 交叉投稿

Discovering and decoding latent mean-field structure with variational autoencoders

通过变分自编码器发现和解读隐平均场结构

Marco Biroli, Max Welling, Vincenzo Vitelli

发表机构 * Department of Physics and the James Franck Institute, University of Chicago(芝加哥大学物理系及詹姆斯·弗兰克研究所) CuspAI AMLab, University of Amsterdam(阿姆斯特丹大学AMLab) Leinweber Institute for Theoretical Physics(莱因韦伯理论物理研究所)

AI总结 提出一种量化变分自编码器(VAE)重建多体系统联合概率分布能力的准则,证明成功VAE的条件独立解码器等价于有限尺寸平均场分解,从而可从解码器读出平均场理论的微观参数,并在标量、向量和张量序参量模型及视网膜记录数据中验证。

Comments 10 pages, 5 figures

详情
AI中文摘要

生成模型越来越多地用于捕捉多体系统中的相关性,但它们学习到的表示在很大程度上仍难以进行物理解释。在这里,我们建立了一个直观的准则,用于量化变分自编码器(VAE)忠实重建多体系统联合概率分布的能力。简而言之,通过将潜在通道的速率与数据的二分互信息进行比较,可以得到VAE容量的一个界限。利用这个界限,我们证明任何成功的VAE的条件独立解码器在结构上等同于有限尺寸平均场分解。因此,成功的重建是潜在平均场理论的直接证据,并且该理论的微观参数可以从训练好的解码器中读出。我们在具有标量(Curie-Weiss)、向量(Hopfield)和张量(Maier-Saupe)序参量的可解模型层次上验证了这些结论,仅从平衡样本中恢复了完整的Hopfield模式矩阵。我们发现,当应用于蝾螈视网膜记录时,一个双潜在VAE仅用两个有效的集体变量就再现了群体统计,使我们能够恢复神经群体的“存储模式”,并写出一个正确建模实验数据的广义Hopfield模型。

英文摘要

Generative models are increasingly used to capture correlations in many-body systems, but the representations they learn remain largely opaque to physical interpretation. Here, we establish an intuitive criterion that quantifies the capacity of a variational autoencoder (VAE) to faithfully reconstruct the joint probability distribution of a many body system. In a nutshell, a bound on the VAE capacity is obtained by comparing the rate of the latent channel to the bipartite mutual information of the data. Using this bound, we show that the conditionally independent decoder of any successful VAE is structurally identical to a finite-size mean-field factorization. Hence, a successful reconstruction is direct evidence for a latent mean-field theory and the microscopic parameters of that theory can be read off the trained decoder. We validate these conclusions on a hierarchy of solvable models with scalar (Curie-Weiss), vector (Hopfield) and tensor (Maier-Saupe) order parameters, recovering the full Hopfield pattern matrix from equilibrium samples alone. We find that, when applied to Salamander retinal recordings, a two-latent VAE reproduces the population statistics with only two effective collective variables allowing us to recover the `stored patterns' of the neural population and write a generalized Hopfield model which correctly models the experimental data.

2606.08847 2026-06-09 cs.CV cs.AI cs.LG 交叉投稿

BLM-SGAN: Bidirectional Language Modeling for Semantic-Spatial Text-to-Image Generation

BLM-SGAN: 用于语义-空间文本到图像生成的双向语言建模

Ahmed Abdelmoneim Mazrou, Haidy Maher El-Amir, Ali Hamdi

发表机构 * Faculty of Computer Science, MSA University, Egypt(MSA大学计算机科学学院,埃及)

AI总结 提出BLM-SGAN模型,利用BERT的双向注意力机制捕获长程依赖,解决GAN在文本到图像生成中的梯度消失和序列处理限制,在鸟类图像生成上达到SOTA。

Comments Published in ICACIn 2024. Appears in Advances on Intelligent Computing and Data Science II, Lecture Notes on Data Engineering and Communications Technologies, vol. 254, Springer, 2025

详情
Journal ref
Advances on Intelligent Computing and Data Science II (ICACIn 2024), Lecture Notes on Data Engineering and Communications Technologies, vol. 254, Springer, Cham, 2025
AI中文摘要

尽管从文本描述生成图像取得了成功,但在自然语言处理(NLP)和计算机视觉(CV)等领域仍面临难以克服的挑战。文本到图像(T2I)模型的最新进展,特别是那些利用生成对抗网络(GAN)的模型,显著提高了跨领域合成逼真图像的能力。然而,现有的基于GAN的T2I模型仍然面临关键挑战,例如难以捕获长程依赖、梯度消失以及序列处理的局限性。为了解决这些问题,我们引入了BLM-SGAN,一种新颖的模型,它结合了用于语义-空间文本到图像生成的双向语言建模。BLM-SGAN利用BERT的注意力机制来捕获丰富的上下文信息并有效管理扩展序列。我们的模型展示了最先进的性能,Inception Score(IS)为5.45 +/- 0.08,超过了多个竞争模型,如SSA-GAN、DF-GAN、SD-GAN和AttnGAN。BLM-SGAN能够从详细的文本描述中有效生成高度逼真的鸟类图像。实现代码可在以下网址获取:https://github.com/haidy-maher/BLM-SGAN-Text-to-Image-Generation。

英文摘要

Despite the success of image generation from text descriptions, it still faces challenges that are difficult to overcome in domains such as natural language processing (NLP) and computer vision (CV). Recent advancements in text-to-image (T2I) models, particularly those utilizing generative adversarial networks (GANs), have significantly improved the synthesis of realistic images across various domains. However, existing GAN-based T2I models still encounter key challenges, such as difficulty in capturing long-range dependencies, vanishing gradients, and the limitations of sequential processing. To address these issues, we introduce BLM-SGAN, a novel model that incorporates Bidirectional Language Modeling for Semantic-Spatial Text-to-Image Generation. BLM-SGAN leverages BERT's attention mechanisms to capture rich contextual information and efficiently manage extended sequences. Our model demonstrates state-of-the-art performance, with an Inception Score (IS) of 5.45 +/- 0.08, surpassing several competitive models such as SSA-GAN, DF-GAN, SD-GAN, and AttnGAN. BLM-SGAN effectively generates highly realistic images of birds from detailed text descriptions. The implementation code is available at: https://github.com/haidy-maher/BLM-SGAN-Text-to-Image-Generation.

2606.09056 2026-06-09 cs.CV cs.LG 交叉投稿

MilliVid: Hierarchical Latents for Long-Range Consistency in Video Generation

MilliVid: 用于视频生成中长程一致性的分层潜变量

Ishaan Preetam Chandratreya, David Charatan, Basile Van Hoorick, Sergey Zakharov, Vitor Guizilini, Phillip Isola, Vincent Sitzmann

发表机构 * Massachusetts Institute of Technology(麻省理工学院) Toyota Research Institute(丰田研究所)

AI总结 提出一种多尺度token空间的粗到细展开方法,通过预训练层次化自编码器压缩帧为多层token,并训练视频扩散模型生成这些token,在保持几何和物体持久性长程一致性的同时降低计算开销。

Comments Ishaan Preetam Chandratreya and David Charatan contributed equally. Project page: https://davidcharatan.com/millivid/

详情
AI中文摘要

视频生成模型已变得日益强大,但长程一致性仍然难以实现,因为即使只有几十帧也需要不切实际的长Transformer序列长度。我们表明,通过在多尺度token空间内使用粗到细展开生成视频,可以缓解这一问题。我们的方法很简单:首先,预训练一个自编码器,将每一帧压缩成一个token层次结构,层级范围从典型的潜变量分辨率到每帧仅几个token。最粗糙的层级捕获最重要的信息,如场景布局和语义,而更细的层级添加高频外观和纹理。然后,我们训练一个视频扩散模型,使用粗到细展开生成这些token。通过仔细控制在每个展开步骤中生成帧并用作上下文的细节级别,我们能够保持几何和物体持久性的长程一致性,同时将计算花费在感知上不太相关的细节的长程一致性上。我们使用一个自定义的长Minecraft视频数据集验证了这种方法,与现有基线相比,它产生了更一致的展开结果。

英文摘要

Video generative models have become increasingly powerful, but long-range consistency remains challenging to achieve because even a few dozen frames require impractically long transformer sequence lengths. We show that this issue can be mitigated by generating video using coarse-to-fine rollout within a multi-scale token space. Our approach is simple: first, we pre-train an autoencoder that compresses each frame into a hierarchy of tokens, with levels ranging from the typical latent resolution to only a handful of tokens per frame. The coarsest levels capture the most consequential information, such as scene layout and semantics, while finer levels add high-frequency appearance and texture. Then, we train a video diffusion model to generate these tokens using coarse-to-fine rollout. By carefully controlling the level of detail at which frames are generated and used as context during each rollout step, we are able to preserve long-range consistency in geometry and object permanence while spending less compute on the long-range consistency of less perceptually relevant details. We validate this approach using a custom dataset of long Minecraft videos, where it produces substantially more consistent rollouts compared to existing baselines.

2411.08314 2026-06-09 cs.LG 版本更新

Modeling Stochastic Conditional Dynamics from Sparse Observations via Kernel-Stabilized Flow Matching

通过核稳定流匹配从稀疏观测中建模随机条件动力学

Adam P. Generale, Andreas E. Robertson, Surya R. Kalidindi

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Sandia National Laboratories(桑地亚国家实验室)

AI总结 提出条件变量流匹配(CVFM)框架,通过联合采样状态和条件变量流,利用条件不匹配核和Wasserstein距离重加权目标,从稀疏非配对数据中学习条件分布的时间演化,在材料结构建模中表现更优。

Comments Accepted to Transactions on Machine Learning Research (2026); OpenReview: https://openreview.net/forum?id=3A6oAS2TWo

详情
Journal ref
Transactions on Machine Learning Research, 2026
AI中文摘要

学习随时间变换条件概率密度是概率建模和自然科学中的一个基本挑战。在生物和物理领域中,预测随机非线性动力系统的演化至关重要。虽然基于流的模型可以预测概率分布的时间演化,但现有方法通常假设离散条件且样本在时间上配对,限制了其在仅有稀疏非配对连续条件数据时的科学适用性。我们提出条件变量流匹配(CVFM),这是一个学习流的框架,通过跨条件密度的连续空间摊销来变换条件分布。CVFM通过联合采样状态和条件变量流,利用条件不匹配核和条件Wasserstein距离重新加权条件最优传输目标,解决了先前方法的高方差不稳定性。总的来说,这些进展允许从跨时间的稀疏非配对状态-条件测量中学习动力学。我们在条件映射基准和制造过程中材料内部结构时间演化的案例研究上评估了CVFM,观察到与现有条件变体相比,性能和收敛特性有所改善。代码可在https://this https URL获取。

英文摘要

Learning to transform conditional probability densities over time is a fundamental challenge spanning probabilistic modeling and the natural sciences. This task is paramount when forecasting the evolution of stochastic nonlinear dynamical systems in biological and physical domains. While flow-based models can predict the temporal evolution of probability distributions, existing approaches often assume discrete conditioning with samples that are paired across time, limiting their scientific applicability where frequently only sparse data with unpaired continuous conditioning is available. We propose Conditional Variable Flow Matching (CVFM), a framework for learning flows transforming conditional distributions with amortization across the continuous space of conditional densities. CVFM addresses the high-variance instability of prior methods by jointly sampling flows over state and conditioning variables, utilizing a conditioning mismatch kernel alongside a conditional Wasserstein distance to reweight the conditional optimal transport objective. Collectively, these advances allow for learning dynamics from sparse unpaired measurements of state-condition across time. We evaluate CVFM on conditional mapping benchmarks and a case study modeling the temporal evolution of materials internal structure during manufacturing processes, observing improved performance and convergence characteristics over existing conditional variants. Code is available at https://github.com/agenerale/conditional-variable-flow-matching.

2502.06819 2026-06-09 cs.LG cs.GR 版本更新

AccioScene: Compositional 3D Scene Generation via Graph Diffusion and Interaction-driven Critics

AccioScene: 基于图扩散与交互驱动评判的组合式3D场景生成

Yao Wei, Matteo Toso, Pietro Morerio, Changjae Oh, Michael Ying Yang, Alessio Del Bue

发表机构 * Queen Mary University of London, UK(伦敦大学玛丽女王学院) Italian Institute of Technology (IIT), Italy(意大利理工学院) University of Bath, UK(巴斯大学)

AI总结 提出多阶段流水线,通过图扩散生成上下文一致的场景图并预测物体布局,结合轻量级人-物交互先验和空间约束,生成支持人类交互且物理合理的3D室内场景。

详情
AI中文摘要

本文提出一个从文本提示生成3D室内场景的框架。现有方法通常将场景合成视为基于单一输入模态(如文本描述、房间形状或场景图)的物体布局预测问题,这种设计可能导致物体碰撞和功能合理性受限,降低了其实用性。为解决这些局限,我们引入一个多阶段流水线,更好地反映实际场景创建场景。给定描述部分场景内容的文本提示,我们的方法首先使用图扩散生成上下文连贯的场景图,然后预测合理的物体布局。此外,我们融入轻量级人-物交互先验以鼓励以人为中心和功能性的布局,并加入显式空间约束以减少相互穿透。我们的方法生成连贯的3D场景,其布局可行且更好地支持人类交互。在3D-FRONT数据集上的实验表明,与现有方法相比,我们的方法达到了有竞争力或最先进的性能,同时提高了生成场景的物理合理性。

英文摘要

This paper presents a framework for generating 3D indoor scenes from text prompts. Existing methods often formulate scene synthesis as an object layout prediction problem conditioned on a single input modality, such as a text description, room shape, or scene graph. This design can lead to object collisions and limited functional plausibility, reducing its practical applicability. To address these limitations, we introduce a multi-stage pipeline that better reflects practical scene creation scenarios. Given a text prompt describing partial scene content, our method first uses graph diffusion to produce a contextually coherent scene graph and then predicts a realistic object layout. In addition, we incorporate lightweight human-object interaction priors to encourage human-centric and functional arrangements, with explicit spatial constraints to reduce interpenetration. Our approach generates coherent 3D scenes with viable layouts that better support human interaction. Experiments on the 3D-FRONT dataset demonstrate that our method achieves competitive or state-of-the-art performance compared with existing approaches, while improving the physical plausibility of generated scenes.

2502.19049 2026-06-09 cs.LG 版本更新

In-Context Learning of Stochastic Differential Equations with Foundation Inference Models

基于基础推理模型的随机微分方程上下文学习

Patrick Seifner, Kostadin Cvejoski, David Berghaus, Cesar Ojeda, Ramses J. Sanchez

发表机构 * Lamarr Institute(拉马尔研究所) University of Bonn(波恩大学) Fraunhofer IAIS(弗劳恩霍夫智能系统研究所) University of Potsdam(波茨坦大学)

AI总结 提出FIM-SDE,一种预训练识别模型,通过上下文学习从噪声时间序列中零样本估计低维SDE的漂移和扩散函数,并支持快速微调,在合成和真实数据上表现鲁棒。

Comments Accepted at NeurIPS 2025. The previous version appeared under the title "Foundation Inference Models for Stochastic Differential Equations: A Transformer-based Approach for Zero-shot Function Estimation.";

详情
Journal ref
39th Conference on Neural Information Processing Systems (NeurIPS 2025)
AI中文摘要

随机微分方程(SDE)描述了由漂移函数控制的确定性流动与由扩散函数决定的随机波动叠加的动态系统。从数据中准确估计(或发现)这些函数是机器学习中的一个核心问题,在自然科学和社会科学中有着广泛的应用。然而,当前的解决方案要么严重依赖于对动力学的先验知识,要么涉及复杂的训练过程。我们引入了FIM-SDE(用于SDE的基础推理模型),这是一种预训练的识别模型,能够从含噪声的时间序列数据中对低维SDE的漂移和扩散函数进行准确的上下文(或零样本)估计,并允许快速微调到目标数据集。利用摊销推理和神经算子的概念,我们以监督方式(预)训练FIM-SDE,将大量含噪声的离散观测SDE路径映射到漂移和扩散函数空间。我们证明,FIM-SDE在广泛的合成和真实世界过程中实现了鲁棒的上下文函数估计——从经典的SDE系统(例如双阱动力学或弱扰动洛伦兹吸引子)到股票价格记录以及油价和风速波动——同时匹配在目标数据集上训练的符号、高斯过程和神经SDE基线的性能。当微调到目标过程时,我们显示FIM-SDE始终优于所有这些基线。

英文摘要

Stochastic differential equations (SDEs) describe dynamical systems where deterministic flows, governed by a drift function, are superimposed with random fluctuations, dictated by a diffusion function. The accurate estimation (or discovery) of these functions from data is a central problem in machine learning, with wide application across the natural and social sciences. Yet current solutions either rely heavily on prior knowledge of the dynamics or involve intricate training procedures. We introduce FIM-SDE (Foundation Inference Model for SDEs), a pretrained recognition model that delivers accurate in-context (or zero-shot) estimation of the drift and diffusion functions of low-dimensional SDEs, from noisy time series data, and allows rapid finetuning to target datasets. Leveraging concepts from amortized inference and neural operators, we (pre)train FIM-SDE in a supervised fashion to map a large set of noisy, discretely observed SDE paths onto the space of drift and diffusion functions. We demonstrate that FIM-SDE achieves robust in-context function estimation across a wide range of synthetic and real-world processes -- from canonical SDE systems (e.g., double-well dynamics or weakly perturbed Lorenz attractors) to stock price recordings and oil-price and wind-speed fluctuations -- while matching the performance of symbolic, Gaussian process and Neural SDE baselines trained on the target datasets. When finetuned to the target processes, we show that FIM-SDE consistently outperforms all these baselines.

2505.14752 2026-06-09 cs.LG 版本更新

LLMSynthor: Macro-Aligned Micro-Records Synthesis with Large Language Models

LLMSynthor: 使用大语言模型进行宏观对齐的微观记录合成

Yihong Tang, Menglin Kong, Junlin He, Tong Nie, Wei Ma, Lijun Sun

发表机构 * McGill University(麦吉尔大学) The Hong Kong Polytechnic University(香港理工大学)

AI总结 提出LLMSynthor方法,利用大语言模型作为非参数copula,通过迭代生成与目标宏观统计一致的微观记录,解决大规模细粒度数据收集困难的问题。

详情
AI中文摘要

宏观对齐的微观记录对于社会科学和城市研究中的可信模拟至关重要。例如,流行病模型只有在个体层面的流动和接触反映真实行为,且聚合数据匹配真实世界统计数据(如病例数或旅行流量)时才可靠。然而,大规模收集此类细粒度数据不切实际,研究人员只能获得宏观数据。LLMSynthor通过将预训练的大语言模型转化为宏观感知模拟器来解决这一问题,生成与目标宏观统计一致的逼真微观记录。它迭代构建合成数据集:在每一步,LLM生成一批记录以最小化合成聚合与目标聚合之间的差异。将LLM视为非参数copula,使模型能够捕捉变量间真实的联合依赖关系。为提高效率,LLM提议采样引导LLM提出有针对性的记录批次,指定变量范围和数量,以有效纠正差异,同时保持基于模型先验的真实性。跨领域(移动、电子商务、人口)的评估表明,LLMSynthor实现了强真实性、统计保真度和实用性,使其广泛适用于经济学、社会科学和城市研究。

英文摘要

Macro-aligned micro-records are crucial for credible simulations in social science and urban studies. For example, epidemic models are only reliable when individual-level mobility and contacts mirror real behavior, while aggregates match real-world statistics like case counts or travel flows. However, collecting such fine-grained data at scale is impractical, leaving researchers with only macro-level data. LLMSynthor addresses this by turning a pretrained LLM into a macro-aware simulator that generates realistic micro-records consistent with target macro-statistics. It iteratively builds synthetic datasets: in each step, the LLM generates batches of records to minimize discrepancies between synthetic and target aggregates. Treating the LLM as a nonparametric copula allows the model to capture realistic joint dependencies among variables. To improve efficiency, LLM Proposal Sampling guides the LLM to propose targeted record batches, specifying variable ranges and counts, to efficiently correct discrepancies while preserving realism grounded in the model's priors. Evaluations across domains (mobility, e-commerce, population) show that LLMSynthor achieves strong realism, statistical fidelity, and practical utility, making it broadly applicable to economics, social science, and urban studies.

2507.19700 2026-06-09 cs.LG 版本更新

Disjoint Generation of Synthetic Data

合成数据的分离生成

Anton Danholt Lautrup, Muhammad Rajabinasab, Tobias Hyrup, Arthur Zimek, Peter Schneider-Kamp

发表机构 * Department of Mathematics and Computer Science(数学与计算机科学系) University of Southern Denmark(南方大学)

AI总结 提出通过分离生成模型生成表格合成数据的新框架,将数据集分区后独立生成再合并,在无公共变量时实现连接,提升隐私性、计算可行性和混合模型合成能力。

详情
Journal ref
Transact. mach. learn. res. (June 2026). https://openreview.net/forum?id=LSzXkAWBKI
AI中文摘要

我们提出了一种通过分离生成模型生成表格合成数据集的新框架。在该范式中,数据集被划分为多个不相交的子集,分别提供给生成模型的独立实例。然后,通过一种在缺乏公共变量/标识符的情况下工作的连接操作,将结果事后组合。通过几个案例研究和表格数据示例,我们展示了该框架的成功,并帮助阐明了一些可能的设计选择。分离生成所实现的优势包括:i) 观察到隐私的经验度量有所提高。ii) 增加了某些模型类型的计算可行性。iii) 能够使用不同生成模型的混合来生成合成数据。具体而言,混合模型合成弥合了隐私和效用性能之间的差距,在下游任务的准确性和曲线下面积方面提供了极具竞争力的性能,同时显著降低了经验重识别风险。

英文摘要

We propose a new framework for generating tabular synthetic datasets via disjoint generative models. In this paradigm, a dataset is partitioned into disjoint subsets that are supplied to separate instances of generative models. The results are then combined post hoc by a joining operation that works in the absence of common variables/identifiers. The success of the framework is demonstrated through several case studies and examples on tabular data that help illuminate some of the design choices that one may make. The advantages achieved by the disjoint generation include: i) An observed increase in the empirical measurement of privacy. ii) Increased computational feasibility of certain model types. iii) Ability to generate synthetic data using a mixture of different generative models. Specifically, mixed-model synthesis bridges the gap between privacy and utility performance, providing highly competitive performance on Accuracy and Area Under the Curve for downstream tasks while significantly lowering the empirical re-identification risk.

2508.19857 2026-06-09 cs.LG quant-ph 版本更新

Quantum latent distributions in deep generative models

深度生成模型中的量子潜在分布

Omar Bacarreza, Thorin Farnsworth, Alexander Makarovskiy, Hugo Wallner, Tessa Hicks, Santiago Sempere-Llagostera, John Price, Robert J. A. Francis-Jones, William R. Clements

发表机构 * ORCA Computing(ORCA计算公司)

AI总结 研究量子处理器产生的潜在分布何时及为何能提升生成模型性能,理论上证明其可生成经典分布无法高效产生的数据分布,并在合成和分子数据集上验证了量子干涉统计带来的性能优势。

Comments Accepted at ICML 2026

详情
AI中文摘要

许多成功的生成模型家族利用低维潜在分布映射到数据分布。尽管通常使用简单的潜在分布,但分布的选择对模型性能有强烈影响。最近的实验表明,量子处理器产生的概率分布(通常高度相关且经典上难以处理)可以在某些数据集上带来性能提升。然而,量子处理器产生的潜在分布何时以及为何能提升性能,以及这些改进是否与这些分布的量子性质相关,是我们在本工作中研究的开放问题。我们在理论上证明,在某些条件下,这些“量子潜在分布”使生成模型能够产生经典潜在分布无法高效产生的数据分布。我们提供了关于潜在机制的解释,这些机制可以解释在真实数据集上的性能优势。基于此,我们在合成量子数据集和QM9分子数据集上进行了广泛的基准测试,使用了模拟和真实的光子量子处理器。我们发现,与经典基线相比,量子干涉产生的统计特性带来了更好的生成性能,表明量子处理器可以在扩展深度生成模型的能力方面发挥作用。

英文摘要

Many successful families of generative models leverage a low-dimensional latent distribution that is mapped to a data distribution. Though simple latent distributions are often used, the choice of distribution has a strong impact on model performance. Recent experiments have suggested that the probability distributions produced by quantum processors, which are typically highly correlated and classically intractable, can lead to improved performance on some datasets. However, when and why latent distributions produced by quantum processors can improve performance, and whether these improvements are connected to quantum properties of these distributions, are open questions that we investigate in this work. We show in theory that, under certain conditions, these "quantum latent distributions" enable generative models to produce data distributions that classical latent distributions cannot efficiently produce. We provide intuition as to the underlying mechanisms that could explain a performance advantage on real datasets. Based on this, we perform extensive benchmarking on a synthetic quantum dataset and the QM9 molecular dataset, using both simulated and real photonic quantum processors. We find that the statistics arising from quantum interference lead to improved generative performance compared to classical baselines, suggesting that quantum processors can play a role in expanding the capabilities of deep generative models.

2509.24762 2026-06-09 cs.LG 版本更新

In-Context Learning of Temporal Point Processes with Foundation Inference Models

基于基础推理模型的时间点过程上下文学习

David Berghaus, Patrick Seifner, Kostadin Cvejoski, César Ojeda, Ramsés J. Sánchez

发表机构 * Lamarr Institute(拉马尔研究所) Fraunhofer IAIS(弗劳恩霍夫人工智能研究所) University of Bonn(波恩大学) JetBrains Research(JetBrains研究) University of Potsdam(波恩大学)

AI总结 提出一种基于摊销推理和上下文学习的点过程基础推理模型FIM-PP,通过大规模合成数据预训练,无需额外训练即可估计真实MTPP,或快速微调至目标系统。

Comments This paper is published as a conference paper at ICLR 2026

详情
Journal ref
The Fourteenth International Conference on Learning Representations (ICLR 2026)
AI中文摘要

利用带标记的时间点过程(MTPP)对多种事件类型的事件序列进行建模,为揭示支配性动态规则和预测未来事件提供了一种原则性方法。当前MTPP推理的神经网络方法依赖于为每个目标系统训练单独的专用模型。我们采用一种截然不同的方法:利用摊销推理和上下文学习,预训练一个深度神经网络,以从由事件序列集合定义的上下文中推断事件历史的条件强度函数。预训练是在从广泛霍克斯过程分布中采样的大规模合成MTPP数据集上进行的。预训练后,我们的点过程基础推理模型(FIM-PP)可以在无需任何额外训练的情况下从真实世界数据中估计MTPP,或者快速微调至目标系统。实验表明,这种摊销方法在常见基准数据集上的下一事件预测任务中与专用模型的性能相匹配。

英文摘要

Modeling event sequences of multiple event types with marked temporal point processes (MTPPs) provides a principled way to uncover governing dynamical rules and predict future events. Current neural network approaches to MTPP inference rely on training separate, specialized models for each target system. We pursue a radically different approach: drawing on amortized inference and in-context learning, we pretrain a deep neural network to infer, in-context, the conditional intensity functions of event histories from a context defined by sets of event sequences. Pretraining is performed on a large synthetic dataset of MTPPs sampled from a broad distribution of Hawkes processes. Once pretrained, our Foundation Inference Model for Point Processes (FIM-PP) can estimate MTPPs from real-world data without any additional training, or be rapidly finetuned to target systems. Experiments show that this amortized approach matches the performance of specialized models on next-event prediction across common benchmark datasets.

2511.05355 2026-06-09 cs.LG cs.RO cs.SY eess.SY 版本更新

SAD-Flower: Flow Matching for Safe, Admissible, and Dynamically Consistent Planning

SAD-Flower:用于安全、可接受和动态一致规划的流匹配

Tzu-Yuan Huang, Armin Lederer, Dai-Jie Wu, Xiaobing Dai, Sihua Zhang, Hsiu-Chin Lin, Shao-Hua Sun, Stefan Sosnowski, Sandra Hirche

发表机构 * TUM School of Computation, Information and Technology, Technical University of Munich, Munich, Germany.(慕尼黑技术大学计算、信息与技术学院) Munich Institute of Robotics(慕尼黑机器人与智能机构研究所) Munich Data Science Institute (MDSI)(慕尼黑数据科学研究所) National University of Singapore(新加坡国立大学) National Taiwan University (NTU)(国立台湾大学) NTU Artificial Intelligence Center of Research Excellence (NTU AI-CoRE)(国立台湾大学人工智能研究中心) University of Utah(犹他大学) Beijing Institute of Technology(北京理工大学) McGill University(麦吉尔大学)

AI总结 提出SAD-Flower框架,通过虚拟控制输入增强流匹配,利用非线性控制理论提供状态约束、动作约束和动态一致性的形式化保证,无需重新训练即可在测试时满足未见约束。

详情
AI中文摘要

流匹配(FM)在数据驱动规划中显示出有希望的结果。然而,它本质上缺乏确保状态和动作约束的形式化保证,而满足这些约束对于各种系统上规划轨迹的安全性和可接受性是一个基本且关键的要求。此外,现有的FM规划器不能确保动态一致性,这可能导致轨迹不可执行。我们通过提出SAD-Flower来解决这些缺陷,这是一个用于生成安全、可接受和动态一致轨迹的新框架。我们的方法依赖于用虚拟控制输入增强流。因此,可以使用非线性控制理论的技术推导出有原则的指导,为状态约束、动作约束和动态一致性提供形式化保证。关键的是,SAD-Flower无需重新训练即可运行,从而在测试时满足未见约束。通过在多个任务上的广泛实验,我们证明SAD-Flower在确保约束满足方面优于各种基于生成模型的基线。

英文摘要

Flow matching (FM) has shown promising results in data-driven planning. However, it inherently lacks formal guarantees for ensuring state and action constraints, whose satisfaction is a fundamental and crucial requirement for the safety and admissibility of planned trajectories on various systems. Moreover, existing FM planners do not ensure the dynamical consistency, which potentially renders trajectories inexecutable. We address these shortcomings by proposing SAD-Flower, a novel framework for generating Safe, Admissible, and Dynamically consistent trajectories. Our approach relies on an augmentation of the flow with a virtual control input. Thereby, principled guidance can be derived using techniques from nonlinear control theory, providing formal guarantees for state constraints, action constraints, and dynamic consistency. Crucially, SAD-Flower operates without retraining, enabling test-time satisfaction of unseen constraints. Through extensive experiments across several tasks, we demonstrate that SAD-Flower outperforms various generative-model-based baselines in ensuring constraint satisfaction.

2512.15116 2026-06-09 cs.LG cs.AI 版本更新

FADTI: Fourier and Attention Driven Diffusion for Multivariate Time Series Imputation

FADTI: 基于傅里叶和注意力驱动的多变量时间序列插补扩散模型

Runze Li, Hanchen Wang, Wenjie Zhang, Binghao Li, Yu Zhang, Xuemin Lin, Ying Zhang

发表机构 * Anonymous(匿名)

AI总结 提出FADTI扩散框架,通过可学习傅里叶偏置投影模块注入频域归纳偏置,结合自注意力与门控卷积进行时序建模,在多个基准上优于现有方法,尤其在高缺失率下表现突出。

Comments This work has been submitted to the IEEE for possible publication. 10 pages, 7 figures

详情
AI中文摘要

多变量时间序列插补是医疗保健、交通预测和生物建模等应用中的基础问题,其中传感器故障和不规则采样导致普遍存在的缺失值。然而,现有的基于Transformer和扩散的模型缺乏明确的归纳偏置和频率感知,限制了它们在结构化缺失模式和分布偏移下的泛化能力。我们提出FADTI,一个基于扩散的框架,通过可学习的傅里叶偏置投影(FBP)模块注入频率信息特征调制,并将其与通过自注意力和门控卷积进行的时间建模相结合。FBP支持多种谱基,能够自适应编码平稳和非平稳模式。这种设计将频域归纳偏置注入生成式插补过程。在多个基准(包括一个新引入的生物时间序列数据集)上的实验表明,FADTI持续优于最先进的方法,尤其是在高缺失率下。代码可在该https URL获取。

英文摘要

Multivariate time series imputation is fundamental in applications such as healthcare, traffic forecasting, and biological modeling, where sensor failures and irregular sampling lead to pervasive missing values. However, existing Transformer- and diffusion-based models lack explicit inductive biases and frequency awareness, limiting their generalization under structured missing patterns and distribution shifts. We propose FADTI, a diffusion-based framework that injects frequency-informed feature modulation via a learnable Fourier Bias Projection (FBP) module and combines it with temporal modeling through self-attention and gated convolution. FBP supports multiple spectral bases, enabling adaptive encoding of both stationary and non-stationary patterns. This design injects frequency-domain inductive bias into the generative imputation process. Experiments on multiple benchmarks, including a newly introduced biological time series dataset, show that FADTI consistently outperforms state-of-the-art methods, particularly under high missing rates. Code is available at https://anonymous.4open.science/r/TimeSeriesImputation-52BF

2602.18695 2026-06-09 cs.LG 版本更新

Insertion Based Sequence Generation with Learnable Order Dynamics

基于可学习顺序动态的插入式序列生成

Dhruvesh Patel, Benjamin Rozonoyer, Gaurav Pandey, Tahira Naseem, Ramón Fernandez Astudillo, Andrew McCallum

发表机构 * University of Washington(华盛顿大学) Google Research(谷歌研究院)

AI总结 提出LoFlexMDM,一种具有可学习顺序动态的插入式掩码扩散模型,通过学习数据依赖的插入和解掩码速率,在分子生成任务上提升样本质量。

Comments Some updated results. Accepted at ICML 2026. Code and checkpoints available at https://github.com/dhruvdcoder/LoFlexMDM

详情
AI中文摘要

现有的基于插入的掩码扩散模型通过交替进行token插入和解掩码来生成序列,它们使用固定的调度,不依赖于数据。对于像图和分子这样的结构化序列,学习数据依赖的生成顺序可以通过减少动作空间的不确定性来提高生成质量。我们提出了LoFlexMDM,一种具有可学习顺序动态的插入式掩码扩散模型,它学习数据依赖的插入和解掩码速率。我们将离散流匹配框架推广到处理变长序列,提出了一种可处理的调度参数化方法以及一个用于联合训练生成器和目标顺序动态的训练目标。在从头设计和片段约束的分子生成任务中,LoFlexMDM相比FlexMDM分别将样本质量提升了高达17.5%和6.7%。这些结果表明,学习目标生成顺序可以在不牺牲可处理训练的情况下改进插入式扩散模型。我们在以下网址开源了代码:https://this URL。

英文摘要

Existing insertion-based masked diffusion models that generate sequences by interleaving token insertion with unmasking use fixed schedules that are not dependent on the data. For structured sequences like graphs and molecules, learning data-dependent generation orders can improve generation quality by reducing uncertainty over the action space. We propose LoFlexMDM, an insertion-based masked diffusion model with learnable order dynamics that learns data-dependent insertion and unmasking rates. We generalize the discrete flow matching framework to work with variable-length sequences, propose a tractable schedule parameterization and a training objective for joint training of the generator and the target order dynamics. On De Novo and fragment-constrained molecule generation, LoFlexMDM improves sample quality over FlexMDM by up to 17.5% and 6.7%, respectively. These results show that learning the target generation order can improve insertion-based diffusion models without giving up tractable training. We open source the code at https://github.com/dhruvdcoder/LoFlexMDM.

2603.10395 2026-06-09 cs.LG 版本更新

Graph-GRPO: Training Graph Flow Models with Reinforcement Learning

图流模型:基于强化学习训练图流模型

Baoheng Zhu, Deyu Bo, Delvin Ce Zhang, Xiao Wang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出Graph-GRPO框架,通过可验证奖励训练图流模型,推导了转移概率表达式并提出局部探索策略,实验显示其在生成质量与分子优化任务中表现优异。

Comments Accepted by ICML 2026

详情
AI中文摘要

图生成是具有广泛应用的基本任务,如药物发现。最近,基于离散流匹配的图生成方法(图流模型,GFM)因其优越性能和灵活采样而兴起。然而,有效对齐GFM与复杂人类偏好或任务特定目标仍是一个重大挑战。本文提出Graph-GRPO,一种在线强化学习(RL)框架,用于在可验证奖励下训练GFM。我们的方法有两个关键贡献:(1)我们推导了GFM的转移概率分析表达式,取代了蒙特卡洛采样,使RL训练能够完全可微;(2)我们提出了一种精炼策略,随机扰动图中的特定节点和边,并重新生成它们,允许局部探索和生成质量的自我改进。在合成和真实数据集上的广泛实验表明了Graph-GRPO的有效性。仅使用50次去噪步骤,我们的方法在平面和树数据集上分别达到95.0%和97.5%的Valid-Unique-Novelty分数。此外,Graph-GRPO在分子优化任务中实现了最先进的性能,优于基于图和片段的RL方法以及经典遗传算法。

英文摘要

Graph generation is a fundamental task with broad applications, such as drug discovery. Recently, discrete flow matching-based graph generation, \aka, graph flow model (GFM), has emerged due to its superior performance and flexible sampling. However, effectively aligning GFMs with complex human preferences or task-specific objectives remains a significant challenge. In this paper, we propose Graph-GRPO, an online reinforcement learning (RL) framework for training GFMs under verifiable rewards. Our method makes two key contributions: (1) We derive an analytical expression for the transition probability of GFMs, replacing the Monte Carlo sampling and enabling fully differentiable rollouts for RL training; (2) We propose a refinement strategy that randomly perturbs specific nodes and edges in a graph, and regenerates them, allowing for localized exploration and self-improvement of generation quality. Extensive experiments on both synthetic and real datasets demonstrate the effectiveness of Graph-GRPO. With only 50 denoising steps, our method achieves 95.0\% and 97.5\% Valid-Unique-Novelty scores on the planar and tree datasets, respectively. Moreover, Graph-GRPO achieves state-of-the-art performance on the molecular optimization tasks, outperforming graph-based and fragment-based RL methods as well as classic genetic algorithms.

2604.26985 2026-06-09 cs.LG cs.AI 版本更新

Simple Self-Conditioning Adaptation for Masked Diffusion Models

简单自条件适应用于掩码扩散模型

Michael Cardei, Huu Binh Ta, Ferdinando Fioretto

发表机构 * University of Virginia(弗吉尼亚大学)

AI总结 本文提出一种简单有效的后训练适应方法,通过自条件预测提升掩码扩散模型的生成能力,减少生成困惑度并提升图像合成和分子生成质量。

详情
AI中文摘要

掩码扩散模型(MDMs)通过迭代去噪在吸收掩码过程中生成离散序列。在标准掩码扩散中,如果一个token在反向更新后仍被掩码,模型会丢弃该位置的干净状态预测。因此,仍被掩码的位置必须反复从掩码token本身推断。这种设计限制了跨步骤的细化。为解决这一限制,本文提出了一种简单但有效的后训练适应方法,使每个去噪步骤都基于模型自身之前的干净状态预测。所提出的方法称为自条件掩码扩散模型(SCMDM),需要最小的架构更改,不引入递归的潜在状态路径,不依赖辅助参考模型,并在采样过程中不增加额外的去噪器评估。这与部分自条件方法形成重要区别,后者需要昂贵的从头模型训练。特别是,本文表明,在后训练阶段,部分自条件,包括用于从头训练自条件模型的常用50% dropout策略,是次优的。相反,一旦模型自生成的干净状态估计变得有信息,专业化于细化优于混合条件和无条件目标。SCMDM在多个领域进行了评估,显示出对普通MDM基线的一致改进,实现了在OWT训练模型上的生成困惑度几乎减少50%(从42.89到23.72),同时在离散图像合成质量、小分子生成和基因组分布建模的保真度方面也取得了显著改进。

英文摘要

Masked diffusion models (MDMs) generate discrete sequences by iterative denoising under an absorbing masking process. In standard masked diffusion, if a token remains masked after a reverse update, the model discards its clean-state prediction for that position. Thus, still-masked positions must be repeatedly inferred from the mask token alone. This design choice limits cross-step refinement. To address this limitation, this paper proposes a simple, yet effective, post-training adaptation for MDMs that conditions each denoising step on the model's own previous clean-state predictions. The resulting method, called Self-Conditioned Masked Diffusion Models (SCMDM), requires minimal architectural change, does not introduce a recurrent latent-state pathway, does not rely on an auxiliary reference model, and adds no extra denoiser evaluations during sampling. This is an important departure from partial self-conditioning approaches which requires expensive model training from scratch. In particular, the paper shows that partial self-conditioning, including the commonly used 50% dropout strategy for training self-conditioned models from scratch, is suboptimal in the post-training regime. Instead, once the model's self-generated clean-state estimates become informative, the specialization to refinement is preferable to mixing conditional and unconditional objectives. SCMDM is evaluated across multiple domains, demonstrating consistent improvement over vanilla MDM baselines, achieving nearly a 50% reduction in generative perplexity on OWT-trained models (42.89 to 23.72), alongside strong improvements in discretized image synthesis quality, small molecular generation, and enhanced fidelity in genomic distribution modeling.

2605.29920 2026-06-09 cs.LG 版本更新

Midpoint Generative Models

中点生成模型

Daniil Shlenskii, Nikita Gushchin, Lev Novitskiy, Dmitry V. Dylov, Alexander Korotin

发表机构 * AXXX, Russia(俄罗斯AXXX) Applied AI Institute, Russia(俄罗斯应用人工智能研究所) Kandinsky Lab, Russia(俄罗斯康德斯基实验室)

AI总结 提出中点生成模型(MGM),利用流匹配的对称性定义中点散度,并通过变分目标训练单步生成模型,在性能上与现有方法竞争。

详情
AI中文摘要

我们引入了中点生成模型(MGM),这是一个用于训练单步生成模型的原则性框架。MGM基于线性插值流匹配的一个简单对称性:当两个端点分布重合时,相应的漂移场在中点时间$t=1/2$处消失。我们证明该场的范数定义了分布之间的有效差异,称为中点散度。我们通过引入随机翻转插值将该散度扩展到中点之外,并通过用对称随机插值替代确定性线性流匹配插值进一步推广,得到广义中点散度。最后,我们推导了广义散度的变分形式,从而得到一个可处理的目标用于训练单步生成器。由此产生的MGM算法为生成建模提供了一种有效且理论上有依据的方法,在单步生成建模方法中取得了有竞争力的性能。

英文摘要

We introduce Midpoint Generative Models (MGM), a principled framework for training one-step generative models. MGM is based on a simple symmetry of Flow Matching with linear interpolation: when the two endpoint distributions coincide, the corresponding drift field vanishes at the midpoint time, $t=1/2$. We show that the norm of this field defines a valid discrepancy between distributions, which we call the Midpoint Divergence. We extend this discrepancy beyond the midpoint by introducing randomly flipped interpolations and further generalize it by replacing deterministic linear Flow Matching interpolations with symmetric stochastic interpolants, yielding a generalized Midpoint Divergence. Finally, we derive a variational formulation of our generalized divergence, yielding a tractable objective for training a one-step generator. The resulting MGM algorithm offers an effective and theoretically grounded approach to generative modeling, achieving competitive performance against existing one-step generative modeling methods.

2605.31498 2026-06-09 cs.LG q-bio.BM 版本更新

Scalable Inference-Time Annealing with Surrogate Likelihood Estimators

可扩展的推理时退火与代理似然估计器

Daniel Peñaherrera, Rishal Aggarwal, David Ryan Koes

发表机构 * CMU-Pitt PhD Program in Computational Biology Dept. of Computational & Systems Biology, University of Pittsburgh, Pittsburgh, PA 15260, USA(卡内基梅隆大学-匹兹堡联合博士项目 计算生物学部门 计算与系统生物学系,匹兹堡大学,匹兹堡,PA 15260,USA)

AI总结 提出可扩展推理时退火(SITA)方法,通过基于能量的模型实现快速代理似然,避免昂贵的散度计算,在丙氨酸二肽和三肽上取得最先进性能。

Comments 26 pages, 5 figures, submitted to JMLR 2026

详情
AI中文摘要

计算化学和生物物理学中长期存在的挑战是高效采样分子的玻尔兹曼分布。生成式建模的进展被提出以解决传统采样技术的局限性,通过消除模拟的计算成本。一个有前景的方向是沿着温度阶梯迭代微调扩散模型,其中训练数据通过推理时退火期间的重要性采样生成。不幸的是,这些方法需要在分数场上计算散度来估计重要性权重,使得它们对于较大系统难以处理。在这里,我们提出可扩展的推理时退火(SITA),它重新训练基于流的模型以在逐渐降低的温度下生成样本,使用基于能量的模型来促进快速代理似然。我们在丙氨酸二肽和丙氨酸三肽上展示了最先进的性能,同时避免了昂贵的散度项。我们的代码可在 https://github.com/countrsignal/sita.git 获取。

英文摘要

A long standing challenge in computational chemistry and biophysics is efficiently sampling the Boltzmann distribution of molecules. Advances in generative modeling have been proposed to address the limitations of conventional sampling techniques by eliminating the computational cost of simulation. A promising direction is iteratively finetuning diffusion models along a temperature ladder whereby training data is generated via importance sampling during inference-time annealing. Unfortunately, these methods require computing a divergence over the score field to estimate importance weights, rendering them intractable for larger systems. Here we present scalable inference-time annealing (SITA), which retrains flow-based models to generate samples at progressively lower temperatures using an energy-based model to facilitate fast surrogate likelihoods. We demonstrate state-of-the-art performance on both Alanine Dipeptide and Alanine Tripeptide while avoiding costly divergence terms. Our code is available at https://github.com/countrsignal/sita.git

2606.04804 2026-06-09 cs.LG 版本更新

The Right Measure for Physics-Constrained Generation: A Co-Area Correction for Posterior-Consistent PDE Inverse Problems

物理约束生成的正确度量:后验一致PDE逆问题的共面积修正

Jian Xu, Yanning Wu, Delu Zeng, John Paisley, Qibin Zhao

发表机构 * University of Cambridge(剑桥大学) University of Toronto(多伦多大学)

AI总结 针对扩散模型和流匹配在硬约束PDE逆问题中采样后验分布错误的问题,提出共面积修正因子和CoCoS采样器,实现正确的后验采样。

详情
AI中文摘要

生成模型——扩散和流匹配——越来越多地用于求解偏微分方程(PDE)逆问题,将控制物理作为硬约束(通过投影或引导)强制执行,并将所得样本报告为具有校准不确定性的贝叶斯后验。我们表明,这种广泛采用的配方采样了错误的分布。在硬PDE约束上条件化生成先验是在测度零流形上的条件化——这一操作本质上是模糊的(Borel-Kolmogorov悖论),而其物理上正确的解,即小残差噪声极限,携带一个共面积(Fixman)雅可比因子$[det(JJ^{\top})]^{-1/2}$,而基于投影和引导的方法默默地忽略了它。我们精确地指出了偏差,表明它随约束敏感性的异质性增长,并在受控问题上通过与独立同分布的真实仲裁者对比验证了这一点。被忽略的因子并非二阶细节:移除它会使后验误差膨胀到采样噪声底限的20倍;最小位移投影(如PCFM)的偏差为底限的9倍;而简单的标量重加权无法修复。我们引入了 extbf{CoCoS},一种度量感知的约束采样器,针对正确的共面积后验,并表明它在采样噪声内与黄金标准后验匹配。我们的结果意味着“满足物理”并不等同于“采样后验”,并为不确定性感知的科学推理提供了原则性的修正。

英文摘要

Generative models -- diffusion and flow matching -- are increasingly used to solve partial differential equation (PDE) inverse problems, enforcing the governing physics as a \emph{hard constraint} (via projection or guidance) and reporting the resulting samples as a Bayesian posterior with calibrated uncertainty. We show that this widely adopted recipe samples the wrong distribution. Conditioning a generative prior on a hard PDE constraint is conditioning on a measure-zero manifold -- an operation that is intrinsically ambiguous (the Borel--Kolmogorov paradox) and whose physically correct resolution, the small-residual-noise limit, carries a co-area (Fixman) Jacobian factor $[det(JJ^{\top})]^{-1/2}$ that projection- and guidance-based methods silently omit. We make the bias precise, show that it grows with the heterogeneity of the constraint sensitivity, and validate it on controlled problems against an \emph{i.i.d.} ground-truth arbiter. The omitted factor is not a second-order detail: removing it inflates the posterior error to $20\times$ the sampling-noise floor; minimal-displacement projection (as in PCFM) is biased at $9\times$ the floor; and a naive scalar reweighting does not fix it. We introduce \textbf{CoCoS}, a measure-aware constrained sampler that targets the correct co-area posterior, and show that it matches the gold-standard posterior to within sampling noise. Our results imply that ``satisfying the physics'' is not the same as ``sampling the posterior,'' and give a principled correction for uncertainty-aware scientific inference.

2412.13858 2026-06-09 cs.AI cs.LG 版本更新

IDEQ -- Improving Diffusion Models for the Traveling Salesman Problem (TSP) by Leveraging the Structure of the Solution Space

IDEQ -- 利用解空间结构改进旅行商问题的扩散模型

Mickael Basson, Philippe Preux

发表机构 * Université de Lille(里尔大学) CNRS(国家科学研究中心) Inria(法国国家信息与自动化技术研究院) UMR 9198-CRIStAL(UMR 9198-CRIStAL研究中心)

AI总结 提出IDEQ方法,通过利用TSP解空间的约束结构和基于2-opt轨道的均匀分布训练目标,改进扩散模型求解TSP,在合成实例和TSPlib上达到新SOTA,接近LKH3性能。

详情
AI中文摘要

我们研究扩散模型求解旅行商问题。基于最近的DIFUSCO和T2TCO方法,我们提出IDEQ。IDEQ通过利用TSP状态空间的约束结构来提高解的质量。IDEQ的另一个关键组成部分是,将DIFUSCO课程学习的最后阶段替换为考虑哈密顿环上的均匀分布,这些环在2-opt算子下的轨道收敛到最优解作为训练目标。我们的实验表明,IDEQ在合成实例上改进了此类神经网络技术的现有水平。更重要的是,我们的实验表明,IDEQ在TSPlib(TSP社区的参考基准)的实例上表现非常好:它紧密匹配最佳启发式算法LKH3的性能,甚至在两个分别包含1577和3795个城市的TSPlib实例上能够获得比LKH3更好的解。IDEQ在500个城市的TSP实例上获得0.3%的最优性差距,在1000个城市的TSP实例上获得0.5%的最优性差距。这为基于神经网络的TSP求解方法设立了新的SOTA。此外,与DIFUSCO和T2TCO相比,IDEQ表现出更低的方差和更好的随城市数量扩展的能力。

英文摘要

We investigate diffusion models to solve the Traveling Salesman Problem. Building on the recent DIFUSCO and T2TCO approaches, we propose IDEQ. IDEQ improves the quality of the solutions by leveraging the constrained structure of the state space of the TSP. Another key component of IDEQ consists in replacing the last stages of DIFUSCO curriculum learning by considering a uniform distribution over the Hamiltonian tours whose orbits by the 2-opt operator converge to the optimal solution as the training objective. Our experiments show that IDEQ improves the state of the art for such neural network based techniques on synthetic instances. More importantly, our experiments show that IDEQ performs very well on the instances of the TSPlib, a reference benchmark in the TSP community: it closely matches the performance of the best heuristics, LKH3, being even able to obtain better solutions than LKH3 on 2 instances of the TSPlib defined on 1577 and 3795 cities. IDEQ obtains 0.3% optimality gap on TSP instances made of 500 cities, and 0.5% on TSP instances with 1000 cities. This sets a new SOTA for neural based methods solving the TSP. Moreover, IDEQ exhibits a lower variance and better scales-up with the number of cities with regards to DIFUSCO and T2TCO.

2506.04480 2026-06-09 stat.ML cs.LG stat.ME 版本更新

On the Wasserstein Geodesic Principal Component Analysis of probability measures

关于概率测度的Wasserstein测地主成分分析

Nina Vesseron, Elsa Cazelles, Alice Le Brigant, Thierry Klein

发表机构 * CREST-ENSAE, IP Paris(CREST-ENSAE,IP巴黎) CNRS, IRIT, Université de Toulouse(CNRS,IRIT,图卢兹大学) Université Paris 1 Panthéon Sorbonne(巴黎第一大学巴黎政治学院) ENAC, IMT, Université de Toulouse(ENAC,IMT,图卢兹大学)

AI总结 本文利用Otto-Wasserstein几何,对概率分布集合进行测地主成分分析,通过识别概率测度空间中的测地线来捕捉数据变化模式,并针对高斯分布和绝对连续概率测度提出计算方法。

详情
AI中文摘要

本文关注使用Otto-Wasserstein几何对概率分布集合进行测地主成分分析(GPCA)。目标是识别概率测度空间中能够最好地捕捉底层数据集变化模式的测地线。我们首先处理高斯分布集合的情况,并展示如何将计算提升到可逆线性映射的空间。对于更一般的绝对连续概率测度设置,我们利用一种新颖的方法,通过神经网络参数化Wasserstein空间中的测地线。最后,我们通过各种示例与经典切空间PCA进行比较,并在真实世界数据集上提供说明。

英文摘要

This paper focuses on Geodesic Principal Component Analysis (GPCA) on a collection of probability distributions using the Otto-Wasserstein geometry. The goal is to identify geodesic curves in the space of probability measures that best capture the modes of variation of the underlying dataset. We first address the case of a collection of Gaussian distributions, and show how to lift the computations in the space of invertible linear maps. For the more general setting of absolutely continuous probability measures, we leverage a novel approach to parameterizing geodesics in Wasserstein space with neural networks. Finally, we compare to classical tangent PCA through various examples and provide illustrations on real-world datasets.

2510.05356 2026-06-09 cs.CV cs.LG 版本更新

Mitigating Diffusion Model Hallucinations with Dynamic Guidance

通过动态引导缓解扩散模型幻觉

Kostas Triaridis, Alexandros Graikos, Aggelina Chatziagapi, Grigorios G. Chrysos, Dimitris Samaras

发表机构 * Stony Brook University(石溪大学) University of Wisconsin-Madison(威斯康星大学麦迪逊分校)

AI总结 针对扩散模型因分数函数过度平滑导致的幻觉问题,提出动态引导方法,沿预定方向选择性锐化分数函数,保留有效语义变化,显著减少幻觉。

Comments Project page: https://cvlab-stonybrook.github.io/DynamicGuidance/

详情
AI中文摘要

扩散模型中的幻觉是指样本出现结构不一致性,这通常是由于学习到的分数函数过度平滑,导致数据分布模式之间的插值。由于语义插值通常是有益的且有助于样本多样性,我们认为需要一种细致且有针对性的解决方案来处理扩散模型幻觉。在这项工作中,我们引入了动态引导,通过仅沿已知会导致伪影的预定方向选择性锐化分数函数来缓解幻觉,同时保留有效的语义变化。这种锐化可以使用预定的类别或语义一致的聚类(在数据分布上形成伪类)来执行。后者允许将动态引导原则性地扩展到文本到图像生成,其中我们选择模式以对应文本描述中细粒度的上下文差异。据我们所知,这是第一种在生成时而非通过事后过滤来解决幻觉的方法。动态引导在受控和自然图像数据集上均显著减少了幻觉,大幅优于基线方法。

英文摘要

Hallucinations in diffusion models are samples with structural inconsistencies that can emerge due to the excessive smoothing of the learned score function, which in turn leads to interpolations between modes of the data distribution. Since semantic interpolations are often desirable and contribute to sample diversity, we believe that a nuanced and targeted solution is required to address diffusion model hallucinations. In this work, we introduce Dynamic Guidance, which mitigates hallucinations by selectively sharpening the score function only along the pre-determined directions known to cause artifacts, while preserving valid semantic variations. This sharpening can be performed using either pre-determined classes or semantically coherent clusters that form pseudo-classes over the data distribution. The latter allows for a principled extension of Dynamic Guidance to text-to-image generation, where we select modes to correspond to fine-grained contextual differences in textual descriptions. To our knowledge, this is the first approach that addresses hallucinations at generation time rather than through post-hoc filtering. Dynamic Guidance substantially reduces hallucinations on both controlled and natural image datasets, significantly outperforming baselines.

2512.20978 2026-06-09 eess.AS cs.AI cs.LG 版本更新

GenTSE: Enhancing Target Speaker Extraction via a Coarse-to-Fine Generative Language Model

GenTSE: 通过粗到细的生成语言模型增强目标说话人提取

Haoyang Li, Xuyi Zhuang, Azmat Adnan, Ye Ni, Wei Rao, Shreyas Gopal, Eng Siong Chng, Boon Siew Han, Yuanjin Zheng

发表机构 * Nanyang Technological University, Singapore(南洋理工大学,新加坡) Southeast University, China(东南大学,中国) Schaeffler Hub for Advanced REsearch (SHARE) at Nanyang Technological University, Singapore(南洋理工大学Schaeffler先进研究 hub(SHARE),新加坡)

AI总结 提出GenTSE,一种两阶段解码器仅生成语言模型,先预测粗语义标记再生成细声学标记,结合冻结语言模型条件训练和直接偏好优化,在Libri2Mix上超越先前基于语言模型的系统。

Comments Accepted to Interspeech2026

详情
AI中文摘要

基于语言模型(LM)的生成建模已成为目标说话人提取(TSE)的一个有前景的方向,具有改善泛化能力和高保真语音的潜力。我们提出GenTSE,一种用于TSE的两阶段解码器仅生成语言模型:第一阶段预测粗语义标记,第二阶段生成细声学标记。分离语义和声学稳定了解码过程,并产生更准确的目标语音。两个阶段均使用连续的SSL或编解码嵌入,相比离散提示方法提供更丰富的上下文。为减少曝光偏差,我们采用冻结语言模型条件训练策略,使语言模型以早期检查点预测的标记为条件,以减少教师强制训练与自回归推理之间的差距。我们进一步应用直接偏好优化(DPO)以更好地将输出与感知偏好对齐。在Libri2Mix上的实验表明,GenTSE在语音质量、可懂度和说话人一致性方面超越了先前基于语言模型的系统。

英文摘要

Language Model (LM)-based generative modeling has emerged as a promising direction for TSE, offering potential for improved generalization and high-fidelity speech. We propose GenTSE, a two-stage decoder-only generative LM for TSE: Stage-1 predicts coarse semantic tokens, and Stage-2 generates fine acoustic tokens. Separating semantics and acoustics stabilizes decoding and yields more accurate target speech. Both stages use continuous SSL or codec embeddings, offering richer context than discretized-prompt methods. To reduce exposure bias, we employ a Frozen-LM Conditioning training strategy that conditions the LMs on predicted tokens from earlier checkpoints to reduce the gap between teacher-forcing training and autoregressive inference. We further apply DPO to better align outputs with perceptual preferences. Experiments on Libri2Mix show that GenTSE surpasses previous LM-based systems in speech quality, intelligibility, and speaker consistency.

2601.07013 2026-06-09 stat.ML cs.LG 版本更新

Conditional Normalizing Flows for Forward and Backward Joint State and Parameter Estimation

条件归一化流用于前向和后向联合状态与参数估计

Luke S. Lagunowich, Guoxiang Grayson Tong, Daniele E. Schiavazzi

发表机构 * Department of Computer Science and Engineering University of Notre Dame(计算机科学与工程系诺特达姆大学) Department of Pediatrics Stanford University(儿科系斯坦福大学) Department of Applied and Computational Mathematics and Statistics University of Notre Dame(应用与计算数学与统计系诺特达姆大学)

AI总结 针对非线性非高斯系统,提出基于条件归一化流的状态滤波方法,结合MLP、Transformer或Mamba-SSM生成条件嵌入,并引入最优传输动力学损失缓解过参数化,在自动驾驶和COVID-19联合估计中验证有效性。

详情
AI中文摘要

传统的状态估计滤波算法——如经典卡尔曼滤波、无迹卡尔曼滤波和粒子滤波——在应用于不确定性遵循任意非高斯且可能多峰分布的非线性系统时,性能会下降。本研究回顾了基于条件归一化流进行非线性滤波的状态估计最新方法,其中条件嵌入由标准MLP架构、Transformer或选择性状态空间模型(如Mamba-SSM)生成。此外,我们测试了最优传输启发的动力学损失项在缓解由大量变换组成的流中过参数化问题的有效性。我们研究了这些方法在自动驾驶和患者群体动力学相关应用中的性能,特别关注它们如何处理时间反转和链式预测。最后,我们评估了各种条件策略在真实世界COVID-19联合SIR系统预测和参数估计应用中的性能。

英文摘要

Traditional filtering algorithms for state estimation -- such as classical Kalman filtering, unscented Kalman filtering, and particle filters -- show performance degradation when applied to nonlinear systems whose uncertainty follows arbitrary non-Gaussian, and potentially multi-modal distributions. This study reviews recent approaches to state estimation via nonlinear filtering based on conditional normalizing flows, where the conditional embedding is generated by standard MLP architectures, transformers or selective state-space models (like Mamba-SSM). In addition, we test the effectiveness of an optimal-transport-inspired kinetic loss term in mitigating overparameterization in flows consisting of a large collection of transformations. We investigate the performance of these approaches on applications relevant to autonomous driving and patient population dynamics, paying special attention to how they handle time inversion and chained predictions. Finally, we assess the performance of various conditioning strategies for an application to real-world COVID-19 joint SIR system forecasting and parameter estimation.

2601.23231 2026-06-09 eess.IV cs.LG 版本更新

Solving Inverse Problems with Flow-based Models via Model Predictive Control

基于模型预测控制的流模型逆问题求解

George Webber, Alexander Denker, Riccardo Barbano, Andrew J Reader

发表机构 * University of Cambridge(剑桥大学)

AI总结 提出MPC-Flow框架,将流模型逆问题求解转化为序列控制子问题,实现无需训练的推理时引导,理论联系最优控制,在图像修复任务中表现优异。

Comments Accepted for publication at ICML 2026

详情
AI中文摘要

基于流的生成模型为逆问题提供了强大的无条件先验,但引导其动态进行条件生成仍然具有挑战性。最近的工作将流模型中的无训练条件生成视为最优控制问题;然而,求解由此产生的轨迹优化在计算和内存上都很密集,需要对流动力学进行微分或伴随求解。我们提出了MPC-Flow,一个模型预测控制框架,将基于流的生成模型的逆问题求解公式化为一系列控制子问题,从而在推理时实现实用的基于最优控制的引导。我们提供了将MPC-Flow与底层最优控制目标联系起来的理论分析,并展示了不同的算法选择如何产生一系列引导算法,包括避免通过生成模型轨迹进行反向传播的机制。我们在基准图像恢复任务上评估了MPC-Flow,涵盖线性和非线性设置,如修复、去模糊和超分辨率,并通过在消费级硬件上对FLUX.2(32B)进行量化设置下的无训练引导,展示了强大的性能和可扩展性到大规模最先进架构。

英文摘要

Flow-based generative models provide strong unconditional priors for inverse problems, but guiding their dynamics for conditional generation remains challenging. Recent work casts training-free conditional generation in flow models as an optimal control problem; however, solving the resulting trajectory optimisation is computationally and memory intensive, requiring differentiation through the flow dynamics or adjoint solves. We propose MPC-Flow, a model predictive control framework that formulates inverse problem solving with flow-based generative models as a sequence of control sub-problems, enabling practical optimal control-based guidance at inference time. We provide theoretical analysis linking MPC-Flow to the underlying optimal control objective and show how different algorithmic choices yield a spectrum of guidance algorithms, including regimes that avoid backpropagation through the generative model trajectory. We evaluate MPC-Flow on benchmark image restoration tasks, spanning linear and non-linear settings such as in-painting, deblurring, and super-resolution, and demonstrate strong performance and scalability to massive state-of-the-art architectures via training-free guidance of FLUX.2 (32B) in a quantised setting on consumer hardware.

2601.23286 2026-06-09 cs.CV cs.AI cs.LG 版本更新

VideoGPA: Distilling Geometry Priors for 3D-Consistent Video Generation

VideoGPA: 通过几何先验知识蒸馏实现3D一致的视频生成

Hongyang Du, Junjie Ye, Xiaoyan Cong, Runhao Li, Jingcheng Ni, Aman Agarwal, Zeqi Zhou, Zekun Li, Randall Balestriero, Yue Wang

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 VideoGPA通过几何先验知识蒸馏提升视频生成的3D一致性,利用数据高效的自监督框架引导视频扩散模型,显著增强时间稳定性、几何合理性与运动一致性。

Comments 8 pages, 5 figures, ICML 2026

详情
AI中文摘要

尽管最近的视频扩散模型(VDMs)能产生视觉上令人印象深刻的结果,但它们在保持3D结构一致性方面存在根本性困难,常导致物体变形或空间漂移。我们假设这些失败是因为标准去噪目标缺乏显式的几何一致性激励。为此,我们引入VideoGPA(视频几何偏好对齐),一种数据高效的自监督框架,利用几何基础模型自动推导密集偏好信号,通过直接偏好优化(DPO)引导VDMs。该方法有效将生成分布引导至内在3D一致性,而无需人工标注。VideoGPA通过最少的偏好对显著提升了时间稳定性、几何合理性与运动一致性,在大量实验中一致优于最先进基线。

英文摘要

While recent video diffusion models (VDMs) produce visually impressive results, they fundamentally struggle to maintain 3D structural consistency, often resulting in object deformation or spatial drift. We hypothesize that these failures arise because standard denoising objectives lack explicit incentives for geometric coherence. To address this, we introduce VideoGPA (Video Geometric Preference Alignment), a data-efficient self-supervised framework that leverages a geometry foundation model to automatically derive dense preference signals that guide VDMs via Direct Preference Optimization (DPO). This approach effectively steers the generative distribution toward inherent 3D consistency without requiring human annotations. VideoGPA significantly enhances temporal stability, geometric plausibility, and motion coherence using minimal preference pairs, consistently outperforming state-of-the-art baselines in extensive experiments.

2602.07345 2026-06-09 cs.CV cs.LG 版本更新

Optimizing Few-Step Generation with Adaptive Matching Distillation

自适应匹配蒸馏优化少步生成

Lichen Bai, Zikai Zhou, Shitong Shao, Wenliang Zhong, Shuo Yang, Shuo Chen, Bojun Chen, Zeke Xie

发表机构 * xLeaF Lab, The Hong Kong University of Science(xLeaF实验室,香港科学与技术大学) Harbin Institute of Technology, Shenzhen(哈尔滨工业大学,深圳) School of Intelligence Science(智能科学学院)

AI总结 提出自适应匹配蒸馏(AMD),通过奖励代理检测并逃离禁止区域,结合结构信号分解和排斥景观锐化,提升少步生成模型的样本保真度和训练鲁棒性。

Comments 25 pages, 15 figures, 11 tables

详情
AI中文摘要

分布匹配蒸馏(DMD)是一种强大的加速范式,但其稳定性常在禁止区域(真实教师提供不可靠指导而虚假教师施加不足排斥力的区域)中受到损害。在这项工作中,我们提出了一个统一的优化框架,将先前的方法重新解释为避免这些受损区域的隐式策略。基于这一见解,我们引入了自适应匹配蒸馏(AMD),一种利用奖励代理显式检测和逃离禁止区域的自我纠正机制。AMD通过结构信号分解动态优先考虑纠正梯度,并引入排斥景观锐化以强制执行陡峭的能量屏障,防止失败模式崩溃。在图像和视频生成任务(如SDXL、Wan2.1)以及严格基准测试(如VBench、GenEval)上的大量实验表明,AMD显著提高了样本保真度和训练鲁棒性。例如,AMD将SDXL上的HPSv2分数从30.64提升至31.25,优于最先进的基线。这些发现验证了在禁止区域内显式纠正优化轨迹对于推动少步生成模型性能上限至关重要。

英文摘要

Distribution Matching Distillation (DMD) is a powerful acceleration paradigm, yet its stability is often compromised in Forbidden Zone, regions where the real teacher provides unreliable guidance while the fake teacher exerts insufficient repulsive force. In this work, we propose a unified optimization framework that reinterprets prior art as implicit strategies to avoid these corrupted regions. Based on this insight, we introduce Adaptive Matching Distillation (AMD), a self-correcting mechanism that utilizes reward proxies to explicitly detect and escape Forbidden Zones. AMD dynamically prioritizes corrective gradients via structural signal decomposition and introduces Repulsive Landscape Sharpening to enforce steep energy barriers against failure mode collapse. Extensive experiments across image and video generation tasks (e.g., SDXL, Wan2.1) and rigorous benchmarks (e.g., VBench, GenEval) demonstrate that AMD significantly enhances sample fidelity and training robustness. For instance, AMD improves the HPSv2 score on SDXL from 30.64 to 31.25, outperforming state-of-the-art baselines. These findings validate that explicitly rectifying optimization trajectories within Forbidden Zones is essential for pushing the performance ceiling of few-step generative models.

2602.18364 2026-06-09 cs.IT cs.LG math.IT quant-ph stat.ML 版本更新

Quantum Maximum Likelihood Prediction via Hilbert Space Embeddings

通过希尔伯特空间嵌入的量子最大似然预测

Sreejith Sreekumar, Nir Weinberger

发表机构 * L2S, CNRS, CentraleSupélec, University of Paris-Saclay, France(L2S、CNRS、CentraleSupélec、巴黎-萨克雷大学、法国)

AI总结 研究量子最大似然预测任务,通过将经验概率分布嵌入量子态并最小化量子相对熵,提出统一框架,给出非渐近性能保证。

Comments 31+3 pages, 1 figure

详情
AI中文摘要

最大似然预测是现代大型语言模型的核心任务。这里,我们作为第一步,针对由独立同分布样本组成的简化数据模型研究该任务的量子版本。量子最大似然预测器通过将经验概率分布嵌入量子态,并在给定状态类上最小化量子相对熵得到。当量子模型类具有足够表达能力时,我们从量子反向信息投影和量子勾股定理的角度给出了该预测器的解释。我们进一步推导了在迹范数和量子相对熵下的非渐近性能保证,包括收敛速度和集中不等式。我们的方法为处理经典和量子LLM中的MLP提供了统一框架。

英文摘要

Maximum likelihood prediction (MLP) is a core task at the heart of modern large language models. Here, we study a quantum version of this task for a simplified data model consisting of independent and identically distributed samples, as a first step. The quantum maximum likelihood predictor is obtained by embedding of empirical probability distributions into quantum states and performing a minimization of quantum relative entropy over a given class of states. We provide an interpretation of this predictor in terms of quantum reverse information projection and quantum Pythagorean theorem when the class of quantum models is sufficiently expressive. We further derive non-asymptotic performance guarantees in terms of convergence rates and concentration inequalities, both in trace norm and quantum relative entropy. Our approach provides a unified framework to handle MLP within both classical and quantum LLMs.

2603.10823 2026-06-09 stat.ML cs.LG 版本更新

ReTabSyn: Realistic Tabular Data Synthesis via Reinforcement Learning

ReTabSyn:通过强化学习实现真实表格数据合成

Xiaofeng Lin, Seungbae Kim, Zhuoya Li, Zachary DeSoto, Charles Fleming, Guang Cheng

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 ReTabSyn通过强化学习优先学习条件分布,提升小数据下表格数据合成效率,优于现有基线方法。

详情
AI中文摘要

深度生成模型可通过生成合成训练数据缓解数据稀缺和隐私问题,但在低数据、不平衡的表格设置中难以完全学习复杂的数据分布。我们认为追求完整的联合分布可能过于苛刻;为了提高数据效率,模型应优先学习条件分布$P(y\mid \bm{X})$,这由最近的理论分析所支持。因此,我们通过\textbf{ReTabSyn},一个提供合成器训练过程中特征相关性保留直接反馈的\textbf{Re}inforced \textbf{Tab}ular \textbf{Syn}thesis流程,克服了这一限制。这一目标鼓励生成器在数据有限时优先考虑最有用的预测信号,从而增强下游模型的实用性。我们通过这种做法对基于语言模型的生成器进行经验微调,并在具有小样本量、类别不平衡和分布偏移的基准测试中,ReTabSyn始终优于最先进的基线方法。此外,我们的方法可以轻松扩展到控制合成表格数据的各种方面,例如应用专家指定的生成观测约束。

英文摘要

Deep generative models can help with data scarcity and privacy by producing synthetic training data, but they struggle in low-data, imbalanced tabular settings to fully learn the complex data distribution. We argue that striving for the full joint distribution could be overkill; for greater data efficiency, models should prioritize learning the conditional distribution $P(y\mid \bm{X})$, as suggested by recent theoretical analysis. Therefore, we overcome this limitation with \textbf{ReTabSyn}, a \textbf{Re}inforced \textbf{Tab}ular \textbf{Syn}thesis pipeline that provides direct feedback on feature correlation preservation during synthesizer training. This objective encourages the generator to prioritize the most useful predictive signals when training data is limited, thereby strengthening downstream model utility. We empirically fine-tune a language model-based generator using this approach, and across benchmarks with small sample sizes, class imbalance, and distribution shift, ReTabSyn consistently outperforms state-of-the-art baselines. Moreover, our approach can be readily extended to control various aspects of synthetic tabular data, such as applying expert-specified constraints on generated observations.

2605.02439 2026-06-09 cs.CV cs.LG 版本更新

Anomaly-Preference Image Generation

异常偏好图像生成

Fuyun Wang, Yuanzhi Wang, Xu Guo, Sujia Huang, Tong Zhang, Dan Wang, Hui Yan, Xin Liu, Zhen Cui

发表机构 * Nanjing University of Science(南京理工大学) Beijing Normal University, Beijing, China(北京师范大学) China Academy of Space Technology, Beijing, China(中国航天科技集团)

AI总结 本文提出了一种新的异常生成方法,通过隐式偏好对齐机制和时间感知能力分配模块,提升生成图像的真实性和多样性,实验表明其在真实性和多样性上均优于现有方法。

Comments Accepted by ICML 2026

详情
AI中文摘要

从有限数据中合成逼真且多样的异常样本对于鲁棒模型泛化至关重要。然而,现有方法难以平衡保真度和多样性,通常受分布不匹配和过拟合的阻碍。为缓解这一问题,我们引入了异常偏好优化,一种将异常生成重新表述为偏好学习问题的新范式。我们的方法核心是隐式偏好对齐机制,利用真实异常作为正例参考,直接从去噪轨迹偏差中推导优化信号,而无需昂贵的人工标注。此外,我们提出了一个时间感知能力分配模块,动态地沿扩散时间线分配模型能力,在高噪声阶段优先考虑结构多样性,在低噪声阶段增强细粒度保真度。在推理过程中,分层采样策略调节保真度与对齐的权衡,实现对生成过程的精确控制。大量实验表明,该方法显著优于现有基线,实现了真实性和多样性方面的最先进性能。

英文摘要

Synthesizing realistic and diverse anomalous samples from limited data is vital for robust model generalization. However, existing methods struggle to reconcile fidelity and diversity, often hampered by distribution misalignment and overfitting, respectively.To mitigate this, we introduce Anomaly Preference Optimization,a novel paradigm that reformulates anomaly generation as a preference learning problem.Central to our approach is an implicit preference alignment mechanism that leverages real anomalies as positive references, deriving optimization signals directly from denoising trajectory deviations without requiring costly human annotation. Furthermore, we propose a Time-Aware Capacity Allocation module that dynamically distributes model capacity along the diffusion timeline,prioritizing structural diversity during highnoise phases while enhancing fine-grained fidelity in low-noise stages. During inference, a hierarchical sampling strategy modulates the coherencealignment trade-off, enabling precise control over generation. Extensive experiments demonstrate that significantly outperforms existing baselines,achieving state-of-the-art performance in both realism and diversity.

2605.14285 2026-06-09 eess.IV cs.LG 版本更新

ForcingDAS: Unified and Robust Data Assimilation via Diffusion Forcing

通过扩散强迫实现统一且稳健的数据同化:ForcingDAS

Yixuan Jia, Siyi Chen, Yida Pan, Xiao Li, Lianghe Shi, Chanyong Jung, Haijie Yuan, Ismail Alkhouri, Yue Cynthia Wu, Saiprasad Ravishankar, Jeffrey A Fessler, Qing Qu

发表机构 * University of Michigan(密歇根大学) University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学) Massachusetts Institute of Technology(麻省理工学院) University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 本文提出ForcingDAS,一种基于扩散强迫的统一数据同化框架,能够捕捉长时序依赖并减少误差积累,同时在推理时无需重新训练即可实现滤波到平滑的全谱应用。

详情
AI中文摘要

数据同化(DA)通过噪声和不完全观测估计动态系统状态,广泛应用于科学模拟、天气和气候科学。在实践中,滤波方法依赖于帧间过渡模型,但当观测非马尔可夫时(如真实天气数据只形成高维潜在状态的部分切片),这些模型容易在长时域积累误差。同时,学习DA方法通常局限于单一领域(如滤波或平滑),这分割了应共享的先验。为解决这些问题,我们引入ForcingDAS,一种统一且稳健的DA框架。该框架基于具有独立噪声水平的扩散强迫,学习联合轨迹先验而非帧间过渡。这使其能够捕捉长时域时间依赖性并减少误差积累。此外,训练好的模型在推理时可覆盖完整的滤波到平滑谱。具体而言,现在预测、固定滞后平滑和批量再分析通过推理计划单独选择,无需重新训练。我们评估了ForcingDAS在2D纳维-斯托克斯涡旋、降水现在预测和全球大气状态估计中的表现。在所有设置中,单个模型在与专门针对单一领域的学习和经典基线竞争或超越,尤其在真实天气基准上取得最大收益。

英文摘要

Data assimilation (DA) estimates the state of an evolving dynamical system from noisy, partial observations, and is widely used in scientific simulation as well as weather and climate science. In practice, filtering methods rely on frame-to-frame transition models. However, these models are fragile when observations are non-Markovian (when they form only a partial slice of a higher-dimensional latent state as in real-world weather data): they tend to accumulate errors over long horizons. At the same time, learned DA methods typically commit to a single regime, either filtering (nowcasting, real-time forecasting) or smoothing (retrospective reanalysis), which splits what should be a shared prior across application-specific pipelines. To address both issues, we introduce ForcingDAS, a unified and robust DA framework. Built on Diffusion Forcing with an independent noise level assigned to each frame, ForcingDAS learns a joint-trajectory prior instead of frame-to-frame transitions. This allows it to capture long-horizon temporal dependencies and reduce error accumulation. In addition, the same trained model spans the full filtering to smoothing spectrum at inference time. Specifically, nowcasting, fixed-lag smoothing, and batch reanalysis are selected through the inference schedule alone, without retraining. We evaluate ForcingDAS on 2D Navier-Stokes vorticity, precipitation nowcasting, and global atmospheric state estimation. Across all settings, a single model is competitive with or outperforms both learned and classical baselines that are specialized for individual regimes, with the largest gains observed on real-world weather benchmarks.

2410.14949 2026-06-09 cs.LG stat.ML 版本更新

On the Convergence and Straightness of Rectified Flow

关于校正流的收敛性与直线性

Vansh Bansal, Saptarshi Roy, Alessandro Rinaldo, Purnamrita Sarkar

发表机构 * Department of Statistics and Data Sciences, UT Austin(统计与数据科学系,德克萨斯大学奥斯汀分校)

AI总结 本文提出Piecewise Straightness参数γ₂,T,建立首个流模型离散误差与γ₂,T的Wasserstein收敛界,证明最小曲率是实现高保真单步采样的关键,同时为RF的直线性分析提供了理论框架。

Comments 37 pages

详情
AI中文摘要

本文提出Piecewise Straightness参数γ₂,T,建立首个流模型离散误差与γ₂,T的Wasserstein收敛界,证明最小曲率是实现高保真单步采样的关键,同时为RF的直线性分析提供了理论框架。

英文摘要

Flow Matching has become a cornerstone of modern generative models like Stable Diffusion 3, largely due to the efficiency of its Rectified Flow (RF) variant. The success of RF hinges on iteratively learning straight trajectories, pushing generation towards fewer sampling steps. However, the theoretical link between path geometry and sampling efficiency has been underexplored. This paper fills this gap by introducing a novel \textit{Piecewise Straightness} parameter, $γ_{2,T}$. We establish the first Wasserstein convergence bound that explicitly links the discretization error of \textit{any} general flow-model to $γ_{2,T}$, proving that minimizing curvature is the key to achieving high-fidelity, one-step sampling. Building on this theory, we establish the first theoretical framework to analyze the straightness of RF. We begin by offering intuitive geometric arguments for simple cases before identifying sufficient conditions under which a single rectification step (1-RF) yields a perfectly straight or even a Monge optimal coupling. While whether these sufficient conditions are met depends on the problem geometry, they enable the first concrete proofs in this area. Critically, fulfilling these conditions makes the subsequent flow (2-RF) perfectly straight ($γ_{2,T}=0$). This eliminates the discretization error in our bound and makes flawless, single-step sampling possible.

2412.11439 2026-06-09 cs.LG cs.AI physics.chem-ph 版本更新

Sampling Out-of-Distribution Chemical Spaces via Bayesian Flow

通过贝叶斯流采样非分布化学空间

Nianze Tao, Minori Abe

发表机构 * Hiroshima University(广岛大学) Tokyo University of Agriculture(东京农业大学)

AI总结 本文提出利用贝叶斯流网络生成高质量非分布分子,通过强化学习策略和可控微分方程求解器提升采样效率,并引入半自回归策略提升模型性能。

Comments 35 pages, 14 figures, 9 tables

详情
AI中文摘要

生成具有更高性能的新型分子,即非分布生成,对从头药物设计至关重要。然而,基于分布学习的模型,如扩散模型,难以解决这一挑战,因为这些方法旨在尽可能贴近训练数据的分布。在本文中,我们证明贝叶斯流网络,特别是ChemBFN模型,能够内在生成高质量的非分布样本,满足多种场景。我们向ChemBFN添加了强化学习策略,并采用可控的微分方程求解器-like生成过程以加速采样过程。最重要的是,我们在训练和推理过程中引入了半自回归策略,以提升模型性能并超越最先进的模型。此外,还包含了一种半自回归方法在ChemBFN中非分布生成的理论分析。

英文摘要

Generating novel molecules with higher properties than the training space, namely the out-of-distribution generation, is important for de novo drug design. However, it is not easy for distribution learning-based models, for example diffusion models, to solve this challenge as these methods are designed to fit the distribution of training data as close as possible. In this paper, we show that Bayesian flow network, especially ChemBFN model, is capable of intrinsically generating high quality out-of-distribution samples that meet several scenarios. A reinforcement learning strategy is added to the ChemBFN and a controllable ordinary differential equation solver-like generating process is employed that accelerate the sampling processes. Most importantly, we introduce a semi-autoregressive strategy during training and inference that enhances the model performance and surpass the state-of-the-art models. A theoretical analysis of out-of-distribution generation in ChemBFN with semi-autoregressive approach is included as well.

2401.14591 2026-06-09 cs.LG stat.ML 版本更新

Ricci flow regularization in latent spaces for the forward learning of partial differential equations

在潜在空间中使用里奇流进行偏微分方程的前向学习

Andrew Gracyk

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文提出基于流形的机器学习编码器-解码器方法,通过里奇流演化潜在空间来学习时间动态,特别是偏微分方程。方法通过参数化潜在流形并模拟物理约束下的里奇流,实现低维表示学习及对抗鲁棒性。

Comments Fixed a small error in appendix; some improvements to experiments

详情
AI中文摘要

我们提出了一种基于流形的机器学习编码器-解码器方法,用于学习时间动态,特别是偏微分方程(PDEs)。其中,流形潜在空间根据里奇流演化。这可以通过参数化潜在流形阶段并随后在物理约束下模拟里奇流来实现,通过匹配流形量以实现在经验上达到里奇流。我们强调那些允许低维表示的动力学。通过该方法,由度量诱导的流形通过训练过程得以辨识,而由于里奇流的潜在演化提供了一种适应性表示。利用此流,我们维持了一个标准的流形潜在表示,适用于所有驻留PDE时间区间连续体的值。我们展示里奇流有助于诸如学习非分布数据和在选定PDE数据上的对抗鲁棒性等特性。此外,我们还对允许更高维表示的特殊情形进行了详尽扩展,例如在超球面上的里奇流和具有熵策略的神经发现非参数几何流。

英文摘要

We present a manifold-based machine learning encoder-decoder method for learning dynamics in time, notably partial differential equations (PDEs), in which the manifold latent space evolves according to Ricci flow. This can be accomplished by parameterizing the latent manifold stage and subsequently simulating Ricci flow in a physics-informed setting, matching manifold quantities so that Ricci flow is empirically achieved. We emphasize dynamics that admit low-dimensional representations. With our method, the manifold, induced by the metric, is discerned through the training procedure, while the latent evolution due to Ricci flow provides an accommodating representation. By use of this flow, we sustain a canonical manifold latent representation for all values in the ambient PDE time interval continuum. We showcase that the Ricci flow facilitates qualities such as learning for out-of-distribution data and adversarial robustness on select PDE data. Moreover, we provide a thorough expansion of our methods in regard to special cases which allow higher-dimensional representations, such as Ricci flow on the hypersphere and neural discovery of non-parametric geometric flows with entropic strategies.

5. 优化、泛化与理论分析 58 篇

2606.07561 2026-06-09 cs.LG stat.ME stat.ML 新提交

Boundary Variance Inflation Causes Acquisition Bias in Gaussian Processes

边界方差膨胀导致高斯过程中的采集偏差

Maria Bånkestad, Sanna Jarl, Jens Sjölund

发表机构 * RISE Research Institutes of Sweden(瑞典RISE研究院) Uppsala University(乌普萨拉大学)

AI总结 本文揭示有界域上平稳核高斯过程边界方差膨胀的根本原因是核相关邻域截断,并证明该几何扭曲导致三类采集函数产生系统性偏差,提出无函数选择剖面诊断方法。

Comments 14 pages, 8 figures; appendices included

详情
AI中文摘要

具有平稳核的高斯过程在有界域上会在边界附近表现出膨胀的后验方差。尽管这在地统计学中是一个长期被认识到的伪影,并且在贝叶斯优化中是过度探索的来源,但边界引起的采集偏差的原因和影响尚未得到充分探索。我们将根本原因追溯到一个简单的几何机制:核相关邻域在域边界处的截断产生了一种与观测无关的扭曲,且随着维度的增加而恶化。我们展示了这种扭曲如何在三类采集函数中表现出来:方差最大化将选择集中在角落,而负积分后验方差和期望预测信息增益则将选择向内移动到轴向内部壳层。这些模式的出现不依赖于任何目标函数,这意味着采集行为可能由核几何主导,而非期望的任务特定不确定性。为了量化这一点,我们引入了一种针对任意采集函数、核和有界域几何的无函数选择剖面诊断方法。

英文摘要

Gaussian processes with stationary kernels on bounded domains exhibit inflated posterior variance near the boundary. Despite being a long-recognized artifact in geostatistics and a source of over-exploration in Bayesian optimization, the causes and effects of boundary-induced acquisition bias are underexplored. We trace the root cause to a simple geometric mechanism: the truncation of the kernel correlation neighborhood at the domain boundary creates an observation-independent distortion that worsens with dimensionality. We show how this distortion manifests across three acquisition classes: variance maximization concentrates selections at the corners, whereas negative integrated posterior variance and expected predictive information gain move selections inward to axis-aligned interior shells. These patterns arise without reference to any objective function, meaning that acquisition behavior can be dominated by kernel geometry rather than the desired task-specific uncertainty. To quantify this, we introduce a function-free selection-profile diagnostic for arbitrary acquisitions, kernels, and bounded-domain geometries.

2606.07589 2026-06-09 cs.LG 新提交

Optimality of Sequential Filtering Under Independent Cost and Selectivity Models

独立成本与选择性模型下顺序过滤的最优性

Hrishikesh Paranjape, Abhishek Mandal, Xian Sun

发表机构 * IEEE International Conference on Electro/Information Technology (EIT 2026)(IEEE国际电子/信息科技会议(EIT 2026))

AI总结 针对顺序过滤管道,在独立模型下证明按成本与拒绝概率递增比率排序可最小化期望总成本,并通过蒙特卡洛模拟验证其优于常见启发式方法。

Comments 2 pages, 2 figures. Accepted at the 2026 IEEE International Conference on Electro/Information Technology (EIT 2026)

详情
AI中文摘要

顺序过滤管道是大规模系统中的常见设计模式,其中大量物品通过一系列每个阶段产生成本的阶段逐步减少。尽管在排序系统、级联机器学习推理和欺诈检测中普遍存在,过滤排序通常由启发式方法决定而没有正式保证。我们在期望成本目标下形式化了顺序过滤,并证明在独立模型下,按成本与拒绝概率递增比率排序过滤器可最小化期望总成本。广泛的蒙特卡洛模拟表明,最优排序在所有运行中严格优于常见启发式方法,无论是在期望上还是在结果的完整分布上。

英文摘要

Sequential filtering pipelines are a common design pattern in large-scale systems, where a large population of items is progressively reduced by a sequence of stages that each incur cost. Despite their prevalence in ranking systems, cascaded machine learning inference, and fraud detection, filter ordering is often determined by heuristics without formal guarantees. We formalize sequential filtering under an expected-cost objective and prove that, under an independence model, ordering filters by increasing ratio of cost to rejection probability minimizes expected total cost. Extensive Monte Carlo simulations show that the optimal ordering strictly dominates common heuristics across all runs, both in expectation and across the full distribution of outcomes.

2606.07623 2026-06-09 cs.LG cs.LO 新提交

Finite Certificates for In-Context Determinacy and a Threshold Theory of Emergence in Language Models

上下文确定性有限证书与语言模型中涌现的阈值理论

Faruk Alpay, Hamdi Alakkad

发表机构 * Bahcesehir University(巴切谢希尔大学)

AI总结 提出用有限语义证书验证上下文条件语言模型行为,证明有限域线性任务族中确定性准则,并证明阈值涌现的反幻象定理,将阈值度量与语义置信度分离。

Comments 40 pages; ancillary files provided

详情
AI中文摘要

本文开发了一个模型论框架,通过用有限语义证书替代基准标签来验证上下文条件语言模型行为。第一个问题是有限确定性:上下文中的示例何时在不改变模型参数的情况下强制查询答案?在有限域线性任务族中,我们证明了精确的行空间准则,计算了残差假设数量,推导了完整和查询局部识别曲线,并表明即使对于二元输出,提取最小强制子上下文也是NP完全的。第二个问题是阈值涌现:何时明显的基准跳跃反映语义转换而非评分映射的不连续性?我们证明了一个反幻象定理,将阈值度量与语义置信度分离,并给出了潜在承诺在阈值以上变得可见的速率敏感交叉界。共同的语义对象是可定义事件上的置信度泛函。我们证明它是一个布尔概率测度,等价于相关类型空间上的Keisler测度,其测度一公式构成一个真滤子,且其Stone空间表示在定义扩展下不变。由此产生的演算提供了有限上下文证书、对分隔符击中集、查询教学维度、提示保留准则和尺度极限见证。精确算术辅助脚本重现了有限域和阈值计算,并生成了图表使用的数据。

英文摘要

This paper develops a model-theoretic framework for verifying context-conditioned language-model behavior by replacing benchmark labels with finite semantic certificates. The first problem is finite determinacy: when do examples in a context force the answer to a query without changing model parameters? In finite-field linear task families, we prove an exact row-space criterion, compute the residual hypothesis count, derive full and query-local identification curves, and show that extracting a smallest forcing subcontext is NP-complete even for binary outputs. The second problem is threshold emergence: when does an apparent benchmark jump reflect a semantic transition rather than a discontinuity of the scoring map? We prove an anti-mirage theorem separating thresholded metrics from semantic confidence and give a rate-sensitive crossing bound for latent commitments becoming visible above threshold. The common semantic object is a confidence functional on definable events. We show that it is a Boolean probability measure, equivalently a Keisler measure on the relevant type space, whose measure-one formulas form a proper filter and whose Stone-space representation is invariant under definitional expansion. The resulting calculus provides finite context certificates, pair-separator hitting sets, query teaching dimension, prompt-preservation criteria, and scale-limit witnesses. Exact-arithmetic ancillary scripts reproduce the finite-field and threshold calculations and generate the data used by the figures.

2606.07728 2026-06-09 cs.LG 新提交

Characterizing the Discrete Geometry of ReLU Networks

表征ReLU网络的离散几何

Blake B. Gaines, Jinbo Bi

发表机构 * University of Connecticut(康涅狄格大学)

AI总结 本文研究全连接ReLU网络线性区域构成的复形,证明其连通图平均度上界为输入维度的两倍,且直径上界与输入维度无关。

Comments Selected for an oral presentation at ICLR 2026. Tagged PDF, reviews, and discussions are available at https://openreview.net/forum?id=TgLW2DiRDG

详情
Journal ref
Proceedings of the International Conference on Learning Representations (ICLR), 2026
AI中文摘要

众所周知,ReLU网络定义连续分段线性函数,其线性区域是输入空间中的多面体。这些区域构成一个完全划分输入空间的复形。这些区域组合的方式对网络行为至关重要,因为非线性仅发生在这些区域连接的边界处。然而,除了区域总数的界限外,关于这些复形的几何性质所知甚少,且精确计算复形对大多数网络而言是棘手的。在这项工作中,我们证明了关于这些复形的新的理论结果,这些结果对所有全连接ReLU网络都成立,特别是关于它们的连通图,其中节点对应区域,边存在于由面连接的每对区域之间。我们发现,无论网络的宽度和深度如何,该图的平均度上界是输入维度的两倍,并且该图的直径有一个不依赖于输入维度的上界,尽管区域数量随输入维度指数增长。我们通过在合成和真实数据上训练的网络进行的实验证实了我们的发现,这些实验为ReLU网络的几何提供了额外的见解。重现我们结果的代码可在https://github.com/bl-ake/ICLR-2026找到。

英文摘要

It is well established that ReLU networks define continuous piecewise-linear functions, and that their linear regions are polyhedra in the input space. These regions form a complex that fully partitions the input space. The way these regions fit together is fundamental to the behavior of the network, as nonlinearities occur only at the boundaries where these regions connect. However, relatively little is known about the geometry of these complexes beyond bounds on the total number of regions, and calculating the complex exactly is intractable for most networks. In this work, we prove new theoretical results about these complexes that hold for all fully-connected ReLU networks, specifically about their connectivity graphs in which nodes correspond to regions and edges exist between each pair of regions connected by a face. We find that the average degree of this graph is upper bounded by twice the input dimension regardless of the width and depth of the network, and that the diameter of this graph has an upper bound that does not depend on input dimension, despite the number of regions increasing exponentially with input dimension. We corroborate our findings through experiments with networks trained on both synthetic and real-world data, which provide additional insight into the geometry of ReLU networks. Code to reproduce our results can be found at https://github.com/bl-ake/ICLR-2026.

2606.07890 2026-06-09 cs.LG stat.ML 新提交

Partially Performative Prediction

部分表现性预测

Jaewook Lee, Tijana Zrnic

发表机构 * Stanford University(斯坦福大学)

AI总结 提出部分表现性预测框架,统一建模由模型部署引起的内生分布偏移和外部时间变化引起的外生偏移,并定义在线表现性稳定与最优性,分析重复训练等启发式方法的适应性条件。

详情
AI中文摘要

表现性预测研究当预测模型部署在重要领域时产生的反馈循环。在这些设置中,部署模型可能会改变模型旨在预测其模式的人群,导致学习系统内生的分布偏移。这种视角不同于经典的分布偏移处理,其中偏移通常被建模为数据生成过程中的外生变化。然而,在实践中,分布偏移很少是单一类型的。预测模型可能通过其支持的决策影响未来数据,而世界本身也因学习者无法控制的原因持续漂移。我们研究部分表现性预测,这是一个捕捉内源和外源分布偏移源的框架。该框架通过允许数据分布既响应部署的模型又根据外部时变过程演化,推广了表现性预测。我们通过定义在线类比来跟踪演化的部分表现性环境,将表现性稳定性和表现性最优性的核心概念扩展到这一设置。我们分析了实用的学习启发式方法,包括重复训练,并刻画了它们何时成功适应部分表现性环境。

英文摘要

Performative prediction studies feedback loops that arise when predictive models are deployed in consequential domains. In these settings, deploying a model can change the population whose patterns the model aims to predict, inducing a distribution shift that is endogenous to the learning system. This perspective departs from classical treatments of distribution shift, where shifts are typically modeled as exogenous changes in the data-generating process. Yet, in practice, distribution shift is rarely one or the other. Predictive models may influence future data through the decisions they support, while the world itself continues to drift for reasons beyond the learner's control. We study partially performative prediction, a framework that captures both endogenous and exogenous sources of distribution shift. The framework generalizes performative prediction by allowing the data distribution to evolve both in response to the deployed model and according to an external, time-varying process. We extend the central notions of performative stability and performative optimality to this setting by defining their online analogues that track the evolving partially performative environment. We analyze practical learning heuristics, including repeated retraining, and characterize when they successfully adapt to partially performative environments.

2606.08028 2026-06-09 cs.LG 新提交

Noise-Adaptive High-Probability Regret Bounds for Online Convex Optimization

噪声自适应的在线凸优化高概率遗憾界

Wentao Zhang, Yutong Zhang, Wentao Mo

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院) College of Mathematics, Sichuan University(四川大学数学学院)

AI总结 针对强凸损失在线凸优化,提出噪声自适应高概率遗憾界,在完全信息下实现与噪声水平相关的乘性改进,并证明赌博反馈下遗憾与置信度的线性关系,同时为约束优化提供联合高概率保证。

Comments Accepted to 2026 European Conference on Machine Learning and Principles and Practice of Knowledge Discovery in Databases(ECML-PKDD 2026)

详情
AI中文摘要

我们研究了具有强凸损失的在线凸优化(OCO)的高概率遗憾界,并建立了三个结果,解决了噪声自适应性、反馈结构和约束满足交叉领域的开放问题。对于具有次高斯随机梯度的完全信息设置,我们证明了一个噪声自适应的高概率遗憾界,其中鞅偏差项与噪声水平$σ$而非梯度界$G$成比例,相比经典的Azuma-Hoeffding基线实现了$G/σ$的乘性改进。我们的分析引入了一个指数超鞅论证,绕过了Freedman不等式的有界差分要求,从而无需截断伪影即可直接处理无界次高斯噪声。对于赌博反馈,我们证明了一个极小极大下界:高概率遗憾与$\log(1/δ)$线性增长,而完全信息下的置信成本为$\sqrt{\log(1/δ)}$。这构成了强凸OCO在不同反馈模型下置信成本的正式分离。关于具有满足Slater条件的随机约束的约束OCO,我们为累积遗憾和长期约束违反提供了同时的高概率保证,实现了$\mathcal{O}(\sqrt{T\log(m/δ)})$的遗憾和$\mathcal{O}(\sqrt{T}/(ζδ) + m\sqrt{T\log(m/δ)})$的违反。合成实验证实了所有理论预测。

英文摘要

We study high-probability regret bounds for online convex optimization (OCO) with strongly convex losses and establish three results that resolve open questions at the intersection of noise adaptivity, feedback structure, and constraint satisfaction. For the full-information setting with sub-Gaussian stochastic gradients, we prove a noise-adaptive high-probability regret bound in which the martingale deviation term scales with the noise level $σ$ rather than the gradient bound $G$, yielding a multiplicative improvement of $G/σ$ over the classical Azuma-Hoeffding baseline. Our analysis introduces an exponential supermartingale argument that bypasses the bounded-difference requirement of Freedman's inequality, enabling direct treatment of unbounded sub-Gaussian noise without truncation artifacts. For bandit feedback, we prove a minimax lower bound: the high-probability regret scales linearly in $\log(1/δ)$, in contrast to the $\sqrt{\log(1/δ)}$ confidence cost under full information. This constitutes a formal separation in the confidence cost of strongly convex OCO across feedback models. Regarding constrained OCO with stochastic constraints satisfying a Slater condition, we provide simultaneous high-probability guarantees for both cumulative regret and long-run constraint violation, achieving $\mathcal{O}(\sqrt{T\log(m/δ)})$ regret and $\mathcal{O}(\sqrt{T}/(ζδ) + m\sqrt{T\log(m/δ)})$ violation. Synthetic experiments corroborate all theoretical predictions.

2606.08113 2026-06-09 cs.LG math.FA math.OC 新提交

Conditional Random Ordered Transport Spaces

条件随机有序传输空间

Lei Luo, Jian Yang

发表机构 * Nanjing University of Science and Technology(南京理工大学) PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education(PCA实验室,教育部高维信息智能感知与系统重点实验室) School of Computer Science and Engineering(计算机科学与工程学院)

AI总结 提出条件随机有序传输空间(CROTS),通过引入有序传输几何和条件风险泛函,解决分布学习中传输方向是否被允许的问题,并建立稳定性定理。

Comments 24 pages, 1 figure, 2 tables

详情
AI中文摘要

小的Wasserstein距离不能证明变换是可容许的。在证据约束、语义、因果、物理、单调或风险敏感学习中,不仅要问两个概率定律相距多远,还要问质量是否沿着可用信息允许的方向移动。我们引入了条件随机有序传输空间(CROTS),这是一类\(L^0\)值随机概率测度空间,配备了Wasserstein环境度量、闭随机序、硬和软有序传输差异,以及用于在证据sigma域下评估序违反的条件风险泛函。核心对象是随机测度值动力学的一个序可容许传输几何,区别于锥值度量、有序Kantorovich构造、单独的随机Wasserstein空间以及生成路径的模型特定残差。我们发展了CROTS作为可靠分布学习空间理论的基础。结果包括硬和软有序传输的适定性和对偶性、软到硬变分收敛、随机提升空间的可测性和完备性、约化到经典Wasserstein和有序几何、有序测地线、约束重心和投影、条件风险-传输对偶性以及序违反分布的分离。主要稳定性定理表明,随机学习动力学可以在环境Wasserstein度量中收敛,而其局部可容许性泄漏遵循一个独立的条件序-风险递归。由此产生的渐近序-风险下界为证据过度、有序分布偏移、鲁棒性失败和可容许分布动力学提供了数学语言。

英文摘要

A small Wasserstein distance does not certify that a transformation is admissible. In evidence-constrained, semantic, causal, physical, monotone, or risk-sensitive learning, one must ask not only how far two probability laws are, but whether mass has moved in a direction allowed by available information. We introduce conditional random ordered transport spaces (CROTS), a class of \(L^0\)-valued spaces of random probability measures equipped with a Wasserstein ambient metric, a closed stochastic order, hard and soft ordered transport discrepancies, and a conditional risk functional for evaluating order violation under an evidence sigma-field. The central object is an order-admissible transport geometry for random measure-valued dynamics, distinct from cone-valued metrics, ordered Kantorovich constructions, random Wasserstein spaces alone, and model-specific residuals for generative paths. We develop the foundations of CROTS as a space theory for reliable distributional learning. The results include well-posedness and duality for hard and soft ordered transport, soft-to-hard variational convergence, measurability and completeness of the random lifted space, reductions to classical Wasserstein and ordered geometries, ordered geodesics, constrained barycenters and projections, conditional risk-transport duality, and separation of order-violating distributions. The main stability theorem shows that random learning dynamics may converge in the ambient Wasserstein metric while its local admissibility leakage follows a separate conditional order-risk recursion. The resulting asymptotic order-risk floor provides a mathematical language for evidence overreach, ordered distribution shift, robustness failure, and admissible distributional dynamics.

2606.08167 2026-06-09 cs.LG cs.AI 新提交

Explaining Data Mixing Scaling Laws

解释数据混合缩放定律

Rui Dai, Shuran Zheng

发表机构 * Beijing Institute of Technology(北京理工大学) IIIS, Tsinghua University(清华大学智能产业研究院)

AI总结 提出统一框架解释多领域数据混合中模型损失行为,基于能力竞争和噪声减少两个关键因素,在多个尺度上有效预测高性能混合。

Comments Published to ICML 2026

详情
AI中文摘要

最近的研究建立了经验缩放定律来预测多领域数据混合上的模型性能。然而,对这些模型损失行为的理论理解仍然缺失。在这项工作中,我们提出了一个统一框架来解释数据混合的底层机制。我们的方法将最初为标准神经缩放定律(如Kaplan和Chinchilla)开发的理论视角扩展到多领域设置。基于领域在基本技能上重叠而在专门技能上分化的分布假设,我们确定了控制不同数据混合训练模型领域损失的两个关键因素:\textit{能力竞争},其中有限模型能力的分配全局耦合了领域损失;以及\textit{噪声减少},其中最优权重向更难学习的领域转移以最小化整体噪声。实证评估表明,我们的框架通过以更低的平均相对误差拟合损失景观并识别出更高性能的训练混合,优于现有基线。最重要的是,我们的模型成功跨尺度外推,使用较小尺度上拟合的参数预测大型未见尺度的高效混合。此外,与之前的经验定律相比,我们的模型使用显著更少的参数实现了这些结果。我们的代码可在 https://github.com/meiqwq/Explaining-Data-Mixing-Scaling-Laws 获取。

英文摘要

Recent research has established empirical scaling laws to predict model performance on multi-domain data mixtures. However, a theoretical understanding of these model loss behaviors remains absent. In this work, we propose a unified framework to explain the underlying mechanics of data mixing. Our approach extends theoretical perspectives originally developed for standard neural scaling laws (e.g., Kaplan and Chinchilla) to the multi-domain setting. Based on the distributional assumption that domains overlap on fundamental skills while diverging on specialized skills, we identify two key factors that govern the domain losses of models trained on different data mixtures: \textit{Capacity Competition}, where the allocation of finite model capacity couples domain losses globally, and \textit{Noise Reduction}, where optimal weights shift toward harder-to-learn domains to minimize overall noise. Empirical evaluations show that our framework outperforms existing baselines by fitting the loss landscape with a lower Mean Relative Error and identifying higher-performing training mixtures. Most importantly, our model successfully extrapolates across scales, predicting highly effective mixtures for large, unseen scales using parameters fitted on smaller ones. In addition, our model achieves these results using significantly fewer parameters compared to previous empirical laws. Our code is available at https://github.com/meiqwq/Explaining-Data-Mixing-Scaling-Laws.

2606.08218 2026-06-09 cs.LG cs.AI math.ST stat.ML stat.TH 新提交

How Deep Are Deep GPs, Really? A Sharp Threshold and a Non-Gaussian Limit for Compositional GPs

深度高斯过程到底有多深?组合高斯过程的尖锐阈值与非高斯极限

Mark Kozdoba, Shie Mannor

发表机构 * Technion, IIT(以色列理工学院) NVIDIA(英伟达)

AI总结 本文研究了深度高斯过程先验在深度增长时的极限行为,识别出RBF核带宽的尖锐阈值,低于该阈值时先验收敛到非退化非高斯分布,具有非零坐标依赖。

详情
AI中文摘要

组合先验描述了深度贝叶斯模型中分层函数的通用属性,其中随机权重的深度神经网络是一个典型例子。在宽网络极限下,先验是一个具有深度相关核的高斯过程,其随深度增长的行为已通过该核得到广泛研究。这里,我们研究另一种情况,其中每一层本身是一个向量值高斯过程,我们的目标类似地理解先验随深度增长的极限行为。先前的高斯过程工作已确定,对于RBF核和一定范围的带宽$r$,先验在极限下退化,收敛到常数函数集——这作为概率模型是无用的。在本文中,我们建立了几个新结果。首先,我们识别出一个尖锐的带宽阈值$r_c(d) = Θ(\sqrt{d})$,高于该阈值极限是退化的,加强了先前的界限。其次,更重要的是,我们证明对于低于阈值$r_c(d)$的$r$,先验收敛到极限分布$π_{\bar{Z}}$。我们还证明这些分布是非退化且非高斯的,坐标之间具有非消失的依赖性。与先前已知的退化机制相反,深度高斯过程先验因此可以允许非平凡极限。实验上,我们在维度$d$的范围内验证了该阈值,并展示了极限分布$π_{\bar{Z}}$的复杂多模态行为——该机制随$d$增长而变得狭窄,且在不了解阈值的情况下难以识别。

英文摘要

Compositional priors describe the generic properties of layered functions in deep Bayesian models, where deep neural networks with random weights are a canonical example.In the wide-network limit, the prior is a Gaussian process with a depth-dependent kernel, and its behaviour as depth grows has been extensively studied through this kernel. Here, we study another case, where each layer itself is a vector valued Gaussian process, and our aim is similarly to understand the limiting behaviour of the prior as depth grows. Previous GP work has established that for the RBF kernel and a certain range of bandwidths $r$, the prior degenerates in the limit, converging to the set of constant functions -- which is not useful as a probabilistic model. In this paper we establish several new results. First, we identify a sharp bandwidth threshold $r_c(d) = Θ(\sqrt{d})$ above which the limit is degenerate, strengthening the earlier bounds. Second, and more importantly, we show that for $r$ below the threshold $r_c(d)$ the prior converges to a limit distribution $π_{\bar{Z}}$. We also prove that these distributions are non-degenerate and non-Gaussian, with non-vanishing dependence between coordinates. In contrast to the previously known degenerate regime, deep Gaussian process priors can therefore admit non-trivial limits. Empirically, we verify the threshold across a range of dimensions $d$, and demonstrate a complex multimodal behaviour of the limit distributions $π_{\bar{Z}}$ -- a regime that becomes increasingly narrow with $d$ and would be hard to identify without knowing the threshold.

2606.08291 2026-06-09 cs.LG 新提交

On solving symmetric multi-type orthogonal non-negative matrix tri-factorization problem

求解对称多类型正交非负矩阵三因子分解问题

Rok Hribar, Gregor Papa, Janez Povh, Andrej Kastrin

发表机构 * Laboratory for Engineering Design, Faculty of Mechanical Engineering, University of Ljubljana(卢布尔雅纳大学机械工程学院工程设计实验室) Rudolfovo – Science and Technology Centre Novo mesto(诺沃莫斯特鲁德沃尔福科学与技术中心) Institute of Biostatistics and Medical Informatics, Faculty of Medicine, University of Ljubljana(卢布尔雅纳大学生物统计与医学信息学研究所)

AI总结 研究对称多类型正交非负矩阵三因子分解问题,提出基于KKT条件的定点法和基于ADAM的三阶段算法,在合成数据和引文网络上验证了分解质量与聚类、链接预测等任务中的竞争力。

Comments 27 pages, 9 tables, 3 figures

详情
AI中文摘要

我们研究了对称多类型正交非负矩阵三因子分解问题,其中多个对称非负矩阵被同时近似为形式为$GS_{i}G^{\top}$的因子,共享一个非负且正交的因子$G$。该模型由聚类和网络分析驱动,其中非负性提高了可解释性,正交性为潜在因子提供了自然的分配型结构。由于所得优化问题高度非凸,我们开发了两种启发式算法来计算高质量的局部解。第一种是基于Karush-Kuhn-Tucker条件在添加正交约束惩罚项后导出的不动点方法。第二种是三阶段基于ADAM的方法,结合了保持非负性的优化、正交化以及可行集上的受限ADAM精化。我们在合成数据(包括含噪声实例)和引文网络基准上评估了这两种方法。合成实验表明,两种算法都能恢复接近最优的分解,并在噪声下保持稳定。在真实网络上,学习到的嵌入在链接预测、节点聚类和节点分类任务中与标准基线(如SVD、node2vec和经典链接预测启发式方法)相比具有竞争力或更优。

英文摘要

We study the symmetric multi-type orthogonal non-negative matrix tri-factorization problem, where several symmetric non-negative matrices are simultaneously approximated by factors of the form $GS_{i}G^{\top}$, with a shared non-negative and orthogonal factor $G$. This model is motivated by clustering and network analysis, where non-negativity improves interpretability and orthogonality gives a natural assignment-type structure to the latent factor. Since the resulting optimization problem is highly non-convex, we develop two heuristic algorithms for computing high-quality local solutions. The first one is a fixed point method derived from the Karush-Kuhn-Tucker conditions after adding a penalty term for the orthogonality constraint. The second one is a three-stage ADAM-based method that combines non-negativity-preserving optimization, orthogonalization, and restricted ADAM refinement on the feasible set. We evaluate both methods on synthetic data, including noisy instances, and on citation network benchmarks. The synthetic experiments show that both algorithms recover factorizations close to the optimum and remain stable under noise. On real networks, the learned embeddings are competitive with or better than standard baselines such as SVD, node2vec, and classical link prediction heuristics in link prediction, node clustering, and node classification tasks.

2606.08308 2026-06-09 cs.LG 新提交

Fourier fractal dimension to predict the generalization of deep neural networks

傅里叶分形维数预测深度神经网络的泛化能力

Joao B. Florindo, Davi Wanderley Misturini

发表机构 * Institute of Mathematics, Statistics and Scientific Computing - University of Campinas(坎皮纳斯大学数学、统计与科学计算研究所)

AI总结 提出基于权重变化的傅里叶分形维数作为泛化度量,并设计傅里叶优化器正则化该维数,在CIFAR-10等数据集上实现与泛化差距的高相关性。

详情
AI中文摘要

在不依赖留出验证数据的情况下预测深度神经网络的泛化性能是机器学习中的一个基本挑战。虽然随机梯度下降驱动这些高度参数化模型的优化,但其重尾、非高斯动力学在参数空间中诱导出复杂的、尺度不变的轨迹。在本文中,我们提出了一种基于网络权重变化的傅里叶分形维数的新型泛化度量。通过分析频域中Lévy驱动的随机微分方程的特征函数,我们提取出一个能够稳健捕捉学习过程几何复杂性的度量。此外,我们引入了一种定制的基于傅里叶的优化器,旨在训练过程中主动正则化该分形维数。在CIFAR-10、SVHN和MNIST数据集上的大量实证评估表明,我们提出的傅里叶泛化度量与实际泛化差距具有强相关性。我们的方法实现了最先进的Kendall秩相关系数,优于现有的基于范数、基于间隔和PAC-Bayesian度量。最终,这项工作凸显了频域分形分析作为模型泛化能力的强大预测器以及开发更稳定优化算法的原则性基础的潜力。

英文摘要

Predicting the generalization performance of deep neural networks without relying on hold-out validation data is a fundamental challenge in machine learning. While Stochastic Gradient Descent (SGD) drives the optimization of these highly parameterized models, its heavy-tailed, non-Gaussian dynamics induce complex, scale-invariant trajectories in the parameter space. In this paper, we propose a novel generalization measure based on the Fourier fractal dimension of the network's weight variations. By analyzing the characteristic function of the Lévy-driven stochastic differential equations in the frequency domain, we extract a metric that robustly captures the geometric complexity of the learning process. Furthermore, we introduce a customized Fourier-based optimizer designed to actively regularize this fractal dimension during training. Extensive empirical evaluations on the CIFAR-10, SVHN, and MNIST datasets demonstrate that our proposed Fourier generalization measure exhibits a strong correlation with the actual generalization gap. Our method achieves state-of-the-art Kendall rank correlation coefficients, outperforming a wide array of existing norm-based, margin-based, and PAC-Bayesian measures. Ultimately, this work highlights the potential of frequency-domain fractal analysis as both a powerful predictor for model generalizability and a principled foundation for developing more stable optimization algorithms.

2606.08388 2026-06-09 cs.LG math.OC stat.ML 新提交

The Spectral Dynamics and Noise Geometry of Muon

Muon的谱动力学与噪声几何

Pierfrancesco Beneventano, Mahmoud Abdelmoneum, Tomaso Poggio

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 研究Muon优化器通过极分解替换矩阵梯度,证明其偏置为平坦谱,在欠定回归中导出奇异值动力学,实验表明其效果依赖于谱方向活跃度。

Comments 24 pages, 11 figures

详情
AI中文摘要

Muon将矩阵梯度$G=UΣV^\ op$替换为其极因子$UV^\ op$。这保留了梯度选择的奇异方向,但使更新谱平坦。我们研究此操作产生的优化偏置。在显式对齐假设下,我们证明在利用梯度奇异方向且不适应当前权重谱的有界更新中,极更新是单步熵最大化的选择。在欠定回归模型中,我们推导了连续时间Muon的精确奇异值动力学,并识别出一个依赖于测量的条件,在该条件下归一化谱趋向于相等的非零奇异值。这种几何也排除了常见的低秩解释:在固定Frobenius范数下,Muon的区分状态具有平坦谱,而核范数最小化则偏好谱集中。受控矩阵感知实验将效应与简单梯度缩放分离,表明范数匹配的梯度下降不能复现Muon,并在广泛消融中恢复预测的平坦化趋势。在小型NanoGPT预训练中,Muon保持稳定秩,具有宽学习率平台,并相对于AdamW改善验证损失;在匹配的小型ViT对照中,排名反转。由此得出的图景是依赖于区域的:Muon并非普遍优越,但其平坦谱偏置在需要保持许多谱方向活跃时可能有所帮助。

英文摘要

Muon replaces a matrix gradient $G=UΣV^\top$ by its polar factor $UV^\top$. This keeps the singular directions selected by the gradient, but makes the update spectrum flat. We study the optimization bias created by this operation. Under explicit alignment assumptions, we prove that the polar update is the one-step entropy-maximizing choice among bounded updates that use the gradient singular directions and do not adapt to the current weight spectrum. In an underdetermined regression model, we derive exact singular-value dynamics for continuous-time Muon and identify a measurement-dependent condition under which the normalized spectrum moves toward equal nonzero singular values. This geometry also rules out a common low-rank interpretation: at fixed Frobenius norm, Muon's distinguished state has a flat spectrum, whereas nuclear-norm minimization favors spectral concentration. Controlled matrix-sensing experiments separate the effect from simple gradient rescaling, show that norm-matched gradient descent does not reproduce Muon, and recover the predicted flattening trend across broad ablations. In small NanoGPT pretraining, Muon preserves stable rank, has a broad learning-rate plateau, and improves validation loss relative to AdamW; in a matched small-ViT control, the ranking reverses. The resulting picture is regime-dependent: Muon is not universally superior, but its flat-spectrum bias can help when many spectral directions need to remain active.

2606.08390 2026-06-09 cs.LG stat.ML 新提交

When Are Neural Interaction Discoveries Real? Identifiability, Recoverability, and a Pre-Fit Diagnostic

神经交互发现何时是真实的?可辨识性、可恢复性与拟合前诊断

Valentina Kuskova, Dmitry Zaytsev, Michael Coppedge

发表机构 * University of Washington(华盛顿大学)

AI总结 研究神经时间序列模型中交互发现的真实性问题,提出基于输入支持几何的可辨识性理论,并给出有效秩作为拟合前诊断工具。

Comments 11 pages, 3 figures

详情
AI中文摘要

当神经时间序列模型报告一个变量调节另一个变量对目标的影响时,发现的交互是数据的属性还是模型灵活性的伪影?我们认为这本质上是一个可辨识性问题,由观测输入支持的几何结构决定,而非特定的神经架构。我们在神经加性向量自回归(GNAVAR)的乘法门控扩展中研究该问题,其中源贡献由其他滞后变量调节。我们表明表示能力不等于可辨识性:依赖输入会在边特定交互项之间引入泄漏,低维支持允许不同的交互分解,这些分解在观测数据上一致但在其他地方不同。然后,我们在显式支持条件下(包括共享调节器设置)证明了归一化最小GNAVAR分解的总体可辨识性定理。该理论产生了一个简单的面向实践者的诊断:联合滞后块协方差的有效秩在拟合前预测对于给定候选集交互恢复是否可行。当候选集未知时,双种子稳定性检查提供了实用的操作测试。相同的支持条件将经验结果组织成理论预测的三种状态。我们的结果表明,交互可恢复性取决于支持几何,有效秩提供了实用的拟合前诊断,并且独立拟合之间的不稳定性是非可辨识交互发现的特征标志。可辨识性现象、支持条件和不稳定性标志是模型无关的;GNAVAR是使它们可证明的载体。

英文摘要

When a neural time-series model reports that one variable modulates another's effect on a target, is the discovered interaction a property of the data or an artifact of model flexibility? We argue that this is fundamentally a question of identifiability, governed by the geometry of the observed input support rather than by the specific neural architecture. We study the problem in a multiplicative-gating extension of neural additive vector autoregression (GNAVAR), in which source contributions are modulated by other lagged variables. We show that representational capacity is not identifiability: dependent inputs induce leakage between edge-specific interaction terms, and low-dimensional support permits distinct interaction decompositions that agree on the observed data while differing elsewhere. We then prove a population identifiability theorem for normalized minimal GNAVAR decompositions under explicit support conditions, including settings with shared modulators. The theory yields a simple practitioner-facing diagnostic: the effective rank of the joint lag-block covariance predicts, before fitting, whether interaction recovery is feasible for a given candidate set. When the candidate set is unknown, a two-seed stability check provides a practical operational test. The same support condition organizes empirical outcomes into the three states predicted by the theory. Our results show that interaction recoverability depends on support geometry, that effective rank provides a practical pre-fit diagnostic, and that instability across independent fits is a characteristic signature of non-identifiable interaction discovery. The identifiability phenomenon, the support condition, and the instability signature are model-agnostic; GNAVAR is the vehicle that makes them provable.

2606.08721 2026-06-09 cs.LG 新提交

A Geometric Measure of Linear Separability for Neural Representations

神经表征的线性可分性几何度量

Yi Wei, Xuan Qi, Furao Shen

发表机构 * State Key Laboratory of Novel Software Technology, School of Intelligence Science and Technology, Nanjing University(南京大学智能科学与技术学院软件新技术国家重点实验室) AI for Good (AIGO), Istituto Italiano di Tecnologia(意大利技术研究院AI for Good (AIGO)) DITEN, University of Genoa(热那亚大学DITEN) State Key Laboratory of Novel Software Technology, School of Artificial Intelligence, Nanjing University(南京大学人工智能学院软件新技术国家重点实验室)

AI总结 提出方向线性可分性度量(LSM),通过搜索包含目标类所有样本的仿射半空间并测量最小竞争样本入侵量,为神经表征的类间几何提供不对称、类级、目标归一化的诊断工具。

详情
AI中文摘要

现代神经分类器通常依赖线性读出,但仅预测指标无法刻画此类读出所操作的表征的类间几何。我们引入方向线性可分性度量(LSM),一种用于单侧仿射可分性的有限样本诊断工具。对于目标类A和竞争集B,LSM搜索包含A中所有样本的仿射半空间,并测量必须留在目标侧的最小竞争样本入侵量,按|A|归一化。所得量是不对称的、类级的、目标归一化的,适用于从神经网络提取的有限表征。我们建立了其支撑超平面刻画,将其与最优仿射分类精度关联,并证明了在全秩线性嵌入下的不变性。这些结果将线性重参数化引起的变化与信息丢失或非线性几何变换引起的变化区分开来。我们还给出了一种基于惩罚的仿射搜索,用于在高维特征中估计类级LSM,报告的值根据原始离散保持和违反准则计算。最后,我们将坐标门控非线性作为有限样本几何算子进行分析,并经验性地使用LSM诊断常见深度学习组件和架构中的类级入侵。

英文摘要

Modern neural classifiers commonly rely on linear readouts, yet predictive metrics alone do not characterize the class-wise geometry of the representations on which such readouts operate. We introduce the directional linear separability measure (LSM), a finite-sample diagnostic for one-sided affine separability. For a target class A and a competing set B, LSM searches over affine halfspaces that contain all samples in A and measures the smallest competing-sample intrusion that must remain on the target side, normalized by |A|. The resulting quantity is asymmetric, class-wise, target-normalized, and applicable to finite representations extracted from neural networks. We establish its supporting-hyperplane characterization, relate it to optimal affine classification accuracy, and prove invariance under full-rank linear embeddings. These results separate changes caused by linear reparameterization from those caused by information loss or nonlinear geometric transformations. We also give a penalty-based affine search for estimating class-wise LSM in high-dimensional features, with reported values computed from the original discrete preservation and violation criterion. Finally, we analyze coordinatewise gated nonlinearities as finite-sample geometric operators and empirically use LSM to diagnose class-wise intrusion across common deep-learning components and architectures.

2606.08768 2026-06-09 cs.LG 新提交

Understanding the Parameter Space Geometry of Transformers Encoding Boolean Functions

理解编码布尔函数的Transformer参数空间几何

Blanka Köver, Alexandra Butoi, Anej Svete, Michael Hahn, Ryan Cotterell

发表机构 * Machine Learning, ICML(机器学习,ICML)

AI总结 针对Transformer无法学习某些简单布尔函数(如奇偶函数)的问题,通过分析参数空间几何,证明敏感函数在参数空间中占据极小区域,随机初始化几乎必然错过,从而解释了可表达但不可学习的现象。

Comments ICML 2026

详情
AI中文摘要

Transformer始终无法学习某些简单的函数,而这些函数在特定参数设置下是可证明表达的。这种可学习性与可表达性之间的差距对于敏感函数尤为突出——例如奇偶函数,其输出在输入单个比特翻转时很可能改变。虽然先前的研究已经确定Transformer偏向于平均敏感度低的函数,但这种偏向背后的精确机制仍不清楚。为了阐明这一现象,我们研究了Transformer参数空间的几何结构。我们证明,敏感函数——即使可表示——占据了一个极小区域,随机初始化极有可能错过。具体而言,我们将关注点从平均敏感度转移到完整的敏感度分布——所有输入上敏感度值的分布——并证明随机初始化的Transformer几乎必然计算具有低敏感度字符串的函数。因此,任何缺乏此类字符串的函数都是可证明不可学习的。

英文摘要

Transformers consistently fail to learn certain simple functions that are provably expressible with specific parameter settings. This gap between learnability and expressivity is particularly prominent for sensitive functions -- functions whose output is likely to change if a single bit of the input is flipped -- for example, PARITY. While prior work has established that transformers exhibit a bias toward functions with low average sensitivity, the precise mechanism underlying this bias remains poorly understood. To shed light on this phenomenon, we study the geometry of transformers' parameter space. We show that sensitive functions -- even when representable -- occupy a vanishingly small region that random initialization is very likely to miss. Specifically, we shift the focus from average sensitivity to the full sensitivity profile -- the distribution of sensitivity values across all inputs -- and prove that randomly initialized transformers almost surely compute functions which have low-sensitivity strings. Consequently, any function that lacks such strings is provably unlearnable.

2606.08797 2026-06-09 cs.LG cs.AI 新提交

Scaling Decision-Focused Learning to Large Problems with Lagrangian Decomposition

通过拉格朗日分解将决策聚焦学习扩展到大规模问题

Stéphane Eilles-Chan Way, Hugo Percot, Quentin Cappart, Tias Guns, Louis-Martin Rousseau

发表机构 * Polytechnique Montréal(蒙特利尔综合理工学院) Ecole Polytechnique(巴黎综合理工学院) UCLouvain(鲁汶大学) Mila - Québec AI Institute(魁北克人工智能研究所) KU Leuven(荷语鲁汶大学)

AI总结 提出结合拉格朗日分解的决策聚焦学习框架,通过新代理目标和两种损失函数,在保持可并行化的同时,有效处理大规模约束优化问题,实验表明在变量数多八倍的实例上优于传统方法。

详情
AI中文摘要

决策聚焦学习在解决预测-优化问题中显示出巨大潜力,尤其是在模型欠规范的情况下。然而,其实际部署常因高计算成本和有限的可扩展性而受阻,因为需要在每次迭代中对每个训练实例求解一个约束优化问题。为解决这些挑战,我们提出了一种新颖的框架,将拉格朗日分解融入决策聚焦学习范式。具体而言,我们引入了一个新的代理目标以及两个用于评估和训练底层预测模型的损失函数。我们进一步提出了两种变体,它们在计算效率和解决方案质量之间提供了不同的权衡。我们的框架可以无缝集成到标准的决策聚焦学习方法中,包括Smart Predict-then-Optimize (SPO+)和隐式最大似然估计 (IMLE)。通过在两个标准基准测试(多维背包问题和二次投资组合优化)上的实验,我们证明了我们的方法在保持可并行化的同时实现了有竞争力的性能。特别是,在大规模实例上,它始终优于传统的决策聚焦学习方法,这些实例的变量数比相关工作通常考虑的要多出八倍。实现代码可在 https://github.com/corail-research/DFL-LD 获取。

英文摘要

Decision-focused learning has shown great promise for addressing predict-then-optimize problems, particularly in the presence of under-specified models. However, its practical deployment is often hindered by high computational costs and limited scalability, as it requires solving a constrained optimization problem for each training instance at every iteration. To address these challenges, we propose a novel framework that incorporates Lagrangian decomposition into the decision-focused learning paradigm. Specifically, we introduce a new surrogate objective along with two loss functions for evaluating and training the underlying prediction model. We further propose two variants of our approach, which offer different trade-offs between computational efficiency and solution quality. Our framework can be seamlessly integrated with standard decision-focused learning methods, including Smart Predict-then-Optimize (SPO+) and Implicit Maximum Likelihood Estimation (IMLE). Through experiments on two standard benchmarks, the multi-dimensional knapsack problem and quadratic portfolio optimization, we demonstrate that our approach achieves competitive performance while remaining amenable to parallelization. In particular, it consistently outperforms traditional decision-focused learning methods on large-scale instances, involving up to eight times more variables than those typically considered in related work. The implementation is available at https://github.com/corail-research/DFL-LD.

2606.08993 2026-06-09 cs.LG cs.SY eess.SY math.OC 新提交

LEAF: A Learning-Enabled ADMM Framework for Accelerated Convex Optimization

LEAF: 一种用于加速凸优化的学习增强ADMM框架

Binh Nguyen, Trinh Tran, Truong X. Nghiem

发表机构 * University of Central Florida(中佛罗里达大学)

AI总结 提出LEAF框架,通过输入凸神经网络学习Moreau包络来加速凸优化,降低模型复杂度并保持收敛性,实验显示比最先进求解器快一个数量级。

详情
AI中文摘要

我们提出LEAF,一种用于加速凸优化的学习增强ADMM框架。关键思想是使用输入凸神经网络(ICNN)逼近目标函数的Moreau包络,从而得到一个保持凸性和光滑性的学习模型。这导致了所提出的Moreau包络学习ADMM(MEL-ADMM)及其分裂变体sMEL-ADMM。与直接学习高维算子的现有方法不同,LEAF学习标量值的Moreau包络,显著降低了模型复杂度并提高了数据效率。该框架适用于包括光滑和非光滑目标在内的广泛凸问题。通过ICNN架构显式嵌入凸性,所提出的方法在保持优化问题关键结构性质的同时保持了高逼近精度。MEL-ADMM和sMEL-ADMM都在学习模型下具有收敛性和可行性的理论保证。严格分析表明,所提出的方法实现了与经典ADMM相当的收敛速度,同时降低了每次迭代的计算成本。数值实验表明,与最先进的求解器相比,速度提升可达一个数量级,同时保持较低的最优性差距。

英文摘要

We propose LEAF, a learning-enabled ADMM framework for accelerated convex optimization. The key idea is to approximate the Moreau envelope of the objective function using an Input Convex Neural Network (ICNN), resulting in a learned model that preserves convexity and smoothness. This leads to the proposed Moreau Envelope Learning ADMM (MEL-ADMM) and its splitting variant sMEL-ADMM. Unlike existing approaches that learn high-dimensional operators directly, LEAF learns a scalar-valued Moreau envelope, significantly reducing model complexity and improving data efficiency. The framework accommodates a broad class of convex problems with smooth and non-smooth objectives. By embedding convexity explicitly through the ICNN architecture, the proposed approach maintains high approximation accuracy while preserving key structural properties of the optimization problem. Both MEL-ADMM and sMEL-ADMM are developed with theoretical guarantees of convergence and feasibility under the learned model. Rigorous analysis shows that the proposed methods achieve convergence rates comparable to classical ADMM while reducing per-iteration computational cost. Numerical experiments demonstrate up to an order-of-magnitude speedup over state-of-the-art solvers while maintaining low optimality gaps

2606.09154 2026-06-09 cs.LG 新提交

Improved Convergence Analysis of Topology Dependence in Decentralized SGD

去中心化SGD中拓扑依赖性的改进收敛分析

Yuki Takezawa, Anastasia Koloskova, Sebastian U. Stich

发表机构 * University of Washington(华盛顿大学)

AI总结 提出更紧的收敛分析,揭示混合矩阵所有特征值影响收敛速率,并通过实验验证比仅用谱间隙的分析更准确。

Comments ICML 2026

详情
AI中文摘要

去中心化SGD是去中心化学习中的基本算法,尽管底层网络拓扑对其收敛行为的影响尚未完全理解。现有的收敛分析表明,在同质和异质情况下,具有小谱间隙的拓扑会显著恶化去中心化SGD的收敛速率。然而,许多先前的论文报告说,在异质情况下拓扑的选择确实对实验有显著影响,但在同质情况下对训练行为影响很小。在本文中,我们提出了去中心化SGD的更紧的收敛分析,比先前的分析更精确地理解拓扑如何影响收敛速率。具体来说,与仅使用谱间隙作为拓扑属性的现有收敛分析不同,我们的新分析表明混合矩阵的所有特征值都影响收敛速率。通过实验,我们仔细评估了去中心化SGD的收敛行为,并证明了我们的新收敛分析可以更准确地描述拓扑对收敛速率的影响。

英文摘要

Decentralized SGD is a fundamental algorithm in decentralized learning, although the influence of an underlying network topology on its convergence behavior is not yet fully understood. Existing convergence analyses have shown that topologies with a small spectral gap significantly deteriorate the convergence rate of Decentralized SGD in both homogeneous and heterogeneous cases. However, many prior papers have reported that indeed the choice of the topology has a significant experimental impact in the heterogeneous case, but has little experimental impact on training behavior in the homogeneous case. In this paper, we present a tighter convergence analysis of Decentralized SGD, offering a more precise understanding of how topologies affect the convergence rate than the prior analysis. Specifically, unlike existing convergence analyses that used only the spectral gap as a property of the topology, our novel analysis shows that all eigenvalues of the mixing matrix affect the convergence rate. Throughout the experiments, we carefully evaluated the convergence behavior of Decentralized SGD and demonstrated that our novel convergence analysis can more accurately describe the effect of topology on the convergence rate.

2606.09731 2026-06-09 cs.LG 新提交

Tight Sample Complexity of Transformers

Transformer的紧样本复杂度

Chenxiao Yang, Nathan Srebro, Zhiyuan Li

发表机构 * Toyota Technological Institute at Chicago(丰田技术研究所芝加哥分校)

AI总结 本文刻画了深度L、总参数W的Transformer的VC维,并建立了思维链学习的样本复杂度上下界,揭示了参数与序列长度对学习所需样本量的影响。

Comments in COLT 2026

详情
AI中文摘要

我们严格刻画了深度为$L$、总参数为$W$、将输入序列长度$T$映射到单个输出的Transformer的VC维,建立了上界$O(L W \log (T W))$和几乎匹配的下界$\Omega(L W \log (T W / L))$。我们进一步严格刻画了使用此类Transformer进行思维链学习的样本复杂度,表明教师强制(即在训练数据上选择与整个思维链一致的预测器)学习的样本复杂度为$O\left(L W \log \left(\left(T+T^{\prime}\right) W\right)\right)$,并且任何使用思维链数据的学习规则至少需要$\Omega\left(L W \log \left(\left(T+T^{\prime}\right) W / L\right)\right)$个样本,其中$T$是输入长度,$T^{\prime}$是自回归步数。

英文摘要

We tightly characterize the VC dimension of depth-$L$ Transformers with a total of $W$ parameters, mapping an input sequence of length $T$ to a single output, establishing an upper bound of $O(L W \log (T W))$ and a nearly matching lower bound of $Ω(L W \log (T W / L))$. We further tightly characterize the sample complexity of chain-of-thought learning using such a Transformer, showing teacher forcing (i.e. selecting a predictor consistent with the entire chain-of-thought on training data) learns with sample complexity $O\left(L W \log \left(\left(T+T^{\prime}\right) W\right)\right)$ and that any learning rule that uses chain-of-thought data requires at least $Ω\left(L W \log \left(\left(T+T^{\prime}\right) W / L\right)\right)$ examples, where $T$ is the input length and $T^{\prime}$ is the number of autoregressive steps.

2606.07588 2026-06-09 cs.NE cs.LG math.OC quant-ph 交叉投稿

Information-Geometric Optimization on Spheres

球面上的信息几何优化

Vladimir Ja\' cimović

发表机构 * Faculty of Natural Sciences and Mathematics University of Montenegro(自然科学与数学学院蒙特内格罗大学)

AI总结 针对球面上的黑箱优化问题,基于庞加莱球和伯格曼球的超几何信息几何,设计了两种信息几何优化流,并展示了广义Kuramoto振子集合如何计算自然搜索梯度并实现IGO算法。

详情
AI中文摘要

我们考虑球面上的黑箱优化问题。基于庞加莱球和伯格曼球的超几何(信息)几何,通过严格计算自然搜索梯度,设计了两种信息几何优化流(IGO流)。我们证明了球面上的广义Kuramoto振子集合能够计算自然搜索梯度,并在两种流形上实现IGO算法。指出了伯格曼球中的自然梯度策略与量子决策制定之间的关系。

英文摘要

We consider the black-box optimization problem on a sphere. Two information-geometric optimization flows (IGO flows) are designed with rigorous calculation of natural search gradients based on hyperbolic (information) geometry of Poincar\' e and Bergman balls. We demonstrate that ensembles of generalized Kuramoto oscillators on spheres compute natural search gradients and realize IGO algorithms on both manifolds. The relationship between natural gradient policies in Bergman balls and quantum decision making is pointed out.

2606.07782 2026-06-09 math.OC cs.LG math.MG 交叉投稿

Non-Archimedean Polydisc Spaces and Applications to Optimisation

非阿基米德多圆盘空间及其在优化中的应用

Paul Lezeau, Yiannis Fam, Anthea Monod, Yue Ren

发表机构 * London School of Geometry and Number Theory(伦敦几何与数论学院) Department of Mathematics, Imperial College London(伦敦帝国学院数学系) Department of Mathematics, Durham University(杜伦大学数学系)

AI总结 受Berkovich几何启发,提出非阿基米德多圆盘空间,保留刚性层次结构并具备良好几何性质,证明其可嵌入度量树,提出多项式绝对值线性组合的函数类,建立优化理论并给出算法与开源实现。

Comments 54 pages, 23 figures. Comments welcome

详情
AI中文摘要

我们提出了一个受Berkovich几何启发的非阿基米德空间上的优化新框架。具体地,我们引入了多圆盘空间,它由非阿基米德域上的闭球乘积构成。这些空间保留了非阿基米德域的刚性层次结构,同时获得了许多该域所缺乏的优良几何特征。我们证明了度量树自然地嵌入这些空间,展示了它们表示层次数据的能力。我们研究了它们的度量几何,建立了诸如测地线唯一性等性质,证实了它们与经典优化技术的兼容性。我们进一步提出了一类由多项式绝对值线性组合给出的实值函数。这些函数沿测地线具有分段多项式描述,并满足通用逼近性质。我们建立了多圆盘空间上的优化理论:证明了极小值的存在性,并探索了寻找极小值的算法。我们提供了一个配套的开源Julia库,实现了所引入的核心对象和优化过程。

英文摘要

We propose a new framework for optimisation over non-Archimedean spaces inspired by Berkovich geometry. Specifically, we introduce polydisc spaces, which consists of products of closed balls over a non-Archimedean field. These spaces retain the rigid hierarchical structure of the non-Archimedean field whilst acquiring many desirable geometric features absent from it. We show that metric trees embed naturally into these spaces, demonstrating their capacity to represent hierarchical data. We study their metric geometry, establishing properties such as geodesic uniqueness, confirming their comaptibility with classical optimisation techniques. We further propose a class of real-valued functions given by linear combinations of absolute values of polynomials. These functions admit a piecewise polynomial description along geodesics and satisfy a universal approximation property. We formulate a theory of optimisation on polydisc spaces: we prove existence of minimisers and explore algorithms for finding them. We provide an accompanying open-source Julia library implementing the core objects and optimisation procedures introduced.

2606.07841 2026-06-09 stat.CO cs.LG stat.ML 交叉投稿

Large-scale empirical tuning and comparison of default optimizers for variational inference

变分推断默认优化器的大规模经验调优与比较

Trevor Campbell, Jonathan H. Huggins, Kyurae Kim, Charles C. Margossian

发表机构 * Department of Statistics, UBC(统计学系,不列颠哥伦比亚大学) Department of Mathematics & Statistics, Boston University(数学与统计学系,波士顿大学) Faculty of Computing & Statistics, Boston University(计算与统计学学院,波士顿大学) Department of Computer and Information Science, UPenn(计算机与信息科学系,宾夕法尼亚大学)

AI总结 通过大规模实验(56种优化器、1092个问题、55万次运行)评估变分推断中的自适应优化器,发现无单一方法最优,但5种算法组合可接近最佳性能。

详情
AI中文摘要

黑箱变分推断(BBVI)是一种依赖于随机优化的后验近似方法。在实践中,支撑BBVI的随机优化器通常需要大量针对特定问题的调优,这削弱了其作为真正“黑箱”推断算法的承诺。然而,在过去十年中,许多新的自适应随机优化算法已被开发出来,它们减少或完全消除了调优的需要。在这项工作中,我们在BBVI的背景下研究了这些新的自适应方法集合,旨在建立当前无调优优化推断的最新技术水平。具体而言,我们对应用于1092个贝叶斯推断优化问题的56种基于随机梯度的优化算法进行了大规模实证评估,涉及超过55万次独立优化运行和15个核心年的计算。我们评估的优化算法代表了近期方法的广泛谱系,而基准问题则涵盖了从难度范围(后验目标维度1-10^4,条件数1-10^8)以及多种变分族。我们的结果表明,没有单一方法占主导地位,但运行5种算法的选择足以可靠地接近观察到的最佳性能。因此,我们为无法进行专家调优的应用以及开发新的随机优化算法时的比较提供了强有力的基线。

英文摘要

Black-box variational inference (BBVI) is a methodology for posterior approximation that relies on stochastic optimization. In practice, the stochastic optimizers underpinning BBVI generally require extensive problem-specific tuning, which undermines its promise as a truly "black box" inference algorithm. However, over the past decade, many new adaptive stochastic optimization algorithms have been developed that reduce or remove entirely the need for tuning. In this work, we investigate this new collection of adaptive methods in the context of BBVI, with the goal of establishing the current state of the art in tuning-free optimization-based inference. In particular, we present a large-scale empirical evaluation of 56 stochastic gradient-based optimization algorithms applied to 1092 Bayesian inference optimization problems, involving over 550,000 individual optimization runs and 15 core-years of compute. The optimization algorithms we evaluate are chosen to represent a wide spectrum of recent approaches and the benchmark problems are chosen to span a range of difficulty, with posterior target dimension 1-10^4, condition number 1-10^8, and a range of variational families. Our results show that no single method dominates, but running a selection of 5 algorithms suffices to reliably get close to the best-possible observed performance. We thus provide a strong baseline for applications where expert tuning is not possible and for comparison when developing new stochastic optimization algorithms.

2606.07914 2026-06-09 stat.ML cs.LG 交叉投稿

Identifiability and Estimation for Unlabeled Finite Mixtures under Marginal Independence

边际独立下无标签有限混合模型的可识别性与估计

Takafumi Kanamori, Yushi Hirose, Shohei Yamamoto

发表机构 * Department of Mathematical and Computing Science, Institute of Science Tokyo(科学东京学院数学与计算科学系) RIKEN Center for Advanced Intelligence Project(日本学术振兴会先进人工智能项目中心)

AI总结 研究无标签有限混合模型中,利用边际独立性假设恢复潜在成分和估计混合矩阵,提出PM-MMD估计器并证明其收敛性。

详情
AI中文摘要

我们研究来自无标签有限混合模型的成分恢复和混合矩阵估计,其中可观测分布共享相同的潜在成分但具有未知的混合权重。主要识别信号是边际独立性:每个成分假设在至少一个坐标对上是独立的,但没有观察到标签、干净的成分样本或混合权重。我们首先证明乘积成分的一个结构结果:在一元边际线性独立的条件下,成分的任何独立仿射组合必须与单个成分一致。然后我们将这一原理扩展到可观测混合,并表明在满秩和无抵消条件下,边际独立的仿射组合恢复相应的潜在成分。当每个成分在某个坐标对上是独立的时,所有成分都是可识别的,并且在所陈述的完成条件下混合矩阵是可恢复的。最后,我们提出一个基于可观测混合的仿射组合的乘积边际最大均值差异(PM-MMD)估计器,并证明在近似边际独立下的一致收敛性和稳定性。该框架还分离了假设的经验作用:一般来说,不可约性不能直接从无标签混合中检验,而边际独立性通过保留的PM-MMD提供候选级别的诊断。受控实验和流式细胞术实验显示了边际独立性何时提供有用的恢复信号。在报告的多成分比较中,条件感知的代表性选择稳定了PM-MMD,并相对于使用相同无标签混合的聚类、分解和成对混合比例基线改善了恢复。

英文摘要

We study component recovery and mixing-matrix estimation from unlabeled finite mixtures whose observable distributions share the same latent components but have unknown mixing weights. The main identifying signal is marginal independence: each component is assumed to be independent on at least one coordinate pair, but no labels, clean component samples, or mixing weights are observed. We first prove a structural result for product components: under linear independence of the univariate marginals, any independent affine combination of the components must coincide with a single component. We then extend this principle to observable mixtures and show that, under full-rank and no-cancellation conditions, marginally independent affine combinations recover the corresponding latent components. When every component is independent on some coordinate pair, all components are identifiable, and the mixing matrix is recoverable under the stated completion conditions. Finally, we propose a Product-Marginal Maximum Mean Discrepancy (PM-MMD) estimator over affine combinations of the observable mixtures and prove uniform convergence and stability under approximate marginal independence. This framework also separates the empirical roles of the assumptions: irreducibility is, in general, not directly testable from the unlabeled mixtures alone, whereas marginal independence yields a candidate-level diagnostic through held-out PM-MMD. Controlled and flow-cytometry experiments show when marginal independence provides a useful recovery signal. In the reported multi-component comparisons, condition-aware representative selection stabilizes PM-MMD and improves recovery relative to clustering, factorization, and pairwise mixture-proportion baselines using the same unlabeled mixtures.

2606.07926 2026-06-09 stat.ML cs.LG 交叉投稿

Barycentric Projections of Optimal Transport Plans on Riemannian Manifolds

黎曼流形上最优传输计划的重心投影

Kisung You

发表机构 * Baruch College(巴彻学院)

AI总结 提出黎曼流形上传输耦合的重心投影框架,通过条件Fréchet均值得到最佳确定性映射,并定义条件方差Monge缺陷,实验验证了内在投影与切向投影的不同作用。

详情
AI中文摘要

最优传输耦合是概率对象,而许多学习流程需要确定性映射。在欧几里得空间中,重心投影通过取条件期望将耦合转换为映射,但在黎曼流形上,曲率和割迹使这一操作变得不平凡。我们开发了一个黎曼流形上传输耦合的重心投影框架。内在投影将每个源点映射到其目标分布的条件Fréchet均值,并证明它是平方测地线损失下的最佳确定性代表。相应的最小值是积分条件Fréchet方差,该方差对于由映射诱导的耦合恰好为零,因此定义了一个条件方差Monge缺陷。我们还研究了一个切向log-exp投影,证明了其欧几里得精确性、在Monge情况下与Brenier-McCann映射的兼容性,以及其作为内在目标的第一单位黎曼梯度更新的解释。对于离散耦合,两种构造都按行分解为加权Fréchet均值和log-exp问题。在球面数据、合成SPD数据和真实EEG协方差矩阵上的实验支持所提出的角色分工:内在投影是变分代表,而切向投影是有用的局部位移代理。

英文摘要

Optimal transport couplings are probabilistic objects, while many learning pipelines require deterministic maps. In Euclidean space, barycentric projection converts a coupling into a map by taking conditional expectations, but on a Riemannian manifold curvature and cut loci make this operation nontrivial. We develop a framework for barycentric projections of transport couplings on Riemannian manifolds. The intrinsic projection maps each source point to the conditional Fréchet mean of its destination law and is shown to be the best deterministic representative under squared geodesic loss. The corresponding minimum value is an integrated conditional Fréchet variance, which vanishes exactly for map-induced couplings and therefore defines a conditional-variance Monge defect. We also study a tangential log-exp projection, prove its Euclidean exactness, its compatibility with Brenier-McCann maps in the Monge case, and its interpretation as the first unit Riemannian gradient update for the intrinsic objective. For discrete couplings, both constructions decompose row-wise into weighted Fréchet mean and log-exp problems. Experiments on spherical data, synthetic SPD data, and real EEG covariance matrices support the proposed division of roles: the intrinsic projection is the variational representative, while the tangential projection is a useful local displacement surrogate.

2606.07931 2026-06-09 math.PR cond-mat.stat-mech cs.IT cs.LG math.IT math.ST stat.TH 交叉投稿

Pointwise Complexity for Gaussian Fields: Upper Envelopes, Algorithmic Lower Bounds, and Separation

高斯场的逐点复杂度:上包络、算法下界与分离

Yunbei Xu

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 本文证明了一个方差感知的逐点主测度定理,为高斯过程提供高概率上包络,并通过贝叶斯算法下界和加权基示例,揭示了逐点复杂度与全局极小极大风险之间的分离。

详情
AI中文摘要

我们为中心高斯过程证明了一个方差感知的逐点主测度定理。经典的泛函链刻画了标量量$\mathbb E\sup_{x\in T}X_x$;这里的定理给出了整个场的同时高概率包络。对于先验测度$\mu$,在$x$处的包络由逐点Fernique-Talagrand泛函\[\Phi_\mu(x):=\int_0^{4\sigma(x)}\sqrt{\log\frac{1}{\mu(B_d(x,\varepsilon))}}\,d\varepsilon\]以及相应的高斯尾项控制。该定理提供了经典泛函链的可重用场级精化,以及深度神经网络逐点经验过程界的高斯过程对应物。我们还从交互式Fano/数据处理原理记录了一个贝叶斯算法下包络。对于已知先验$\pi$、观测信道和具体估计量$\widehat t(Y)$,下界通过精确的鬼小弹球质量$\mathbb E_{Y\sim Q}\pi(B_d(\widehat t(Y),\Delta))$表示,而非最坏情况覆盖数。在高斯位置实验中,比较译码器将贝叶斯位置误差转化为决策对齐高斯范围的下界。然后我们构造一个简单的加权基示例,将固定先验的通常Fano松弛、贝叶斯算法下包络、选定子图集上的逐点高斯包络以及全类极小极大风险/全局高斯尺度分离开来。这些结果共同表明,在经典极小极大理论变得过于粗糙或依赖预言机的超参数化环境类中,算法下界为固定估计量提供了逐点复杂性的局部几何证书。

英文摘要

We prove a variance-aware pointwise majorizing-measure theorem for centered Gaussian processes. Classical generic chaining characterizes the scalar quantity $\mathbb E\sup_{x\in T}X_x$; the theorem here gives a simultaneous high-probability envelope for the entire field. For an ambient prior $μ$, the envelope at $x$ is governed by a pointwise Fernique-Talagrand functional \[Φ_μ(x):=\int_0^{4σ(x)}\sqrt{\log\frac{1}{μ(B_d(x,\varepsilon))}}\,d\varepsilon,\] together with the corresponding Gaussian tail term. The theorem provides a reusable field-level refinement of classical generic chaining and a Gaussian-process counterpart of pointwise empirical-process bounds for deep neural networks. We also record a Bayesian algorithmic lower envelope from the interactive Fano/data-processing principle. For a known prior $π$, an observation channel, and a concrete estimator $\widehat t(Y)$, the lower bound is expressed through the exact ghost small-ball mass $\mathbb E_{Y\sim Q}π(B_d(\widehat t(Y),Δ))$, rather than a worst-case covering number. In Gaussian location experiments, comparison decoders convert Bayes location error into lower bounds on decision-aligned Gaussian ranges. We then construct an elementary weighted-basis example separating the usual Fano relaxation for a fixed prior, the Bayesian algorithmic lower envelope, the pointwise Gaussian envelope on the selected subatlas, and the full-class minimax risk/global Gaussian scale. Together, these results show that algorithmic lower bounds provide local-geometric certificates of pointwise complexity for fixed estimators in overparameterized ambient classes, precisely in regimes where classical minimax theory becomes either too coarse or oracle-dependent.

2606.08188 2026-06-09 math.OC cs.LG 交叉投稿

Latent Structural Categorical Matrix Completion with Application to Quasispecies Analysis

潜在结构分类矩阵补全及其在准种分析中的应用

Qian Zhang, Meixia Lin

发表机构 * Engineering Systems and Design, Singapore University of Technology and Design(新加坡科技设计大学工程系统与设计系) Institute of Statistics and Big Data, Renmin University of China(中国人民大学统计与大数据研究院)

AI总结 提出LCMC双循环优化框架,通过二元张量表示对分类矩阵进行潜在分解,外环自适应估计潜在维度,内环通过张量分解重构矩阵,在病毒准种重建中优于现有方法。

详情
AI中文摘要

矩阵补全在实值数据中已被广泛研究,但现有方法在处理分类变量时往往受限。我们提出LCMC,一种基于二元张量表示的潜在分解分类矩阵补全双循环优化框架。在此设置中,每个分类条目沿第三张量模式编码为独热向量,从而保留其离散、非序数的性质。外环通过内环反馈迭代更新潜在维度来自适应估计,内环通过张量分解重构分类矩阵,并有相应理论分析支持。为进一步提高可扩展性和鲁棒性,我们引入了包括分裂-合并-细化策略和自适应数据缩减技术在内的增强功能。在病毒准种重建的合成和真实数据集上的实验表明,与现有方法相比,LCMC实现了更高的准确性和效率。

英文摘要

Matrix completion has been extensively studied for real-valued data, but existing methods are often limited in handling categorical variables. We propose LCMC, a double-loop optimization framework for categorical matrix completion via latent factorization based on a binary tensor representation. In this setting, each categorical entry is encoded as a one-hot vector along a third tensor mode, thereby preserving its discrete, non-ordinal nature. The outer loop adaptively estimates the latent dimension by iteratively updating it with feedback from the inner loop, while the inner loop reconstructs the categorical matrix through tensor factorization, supported by a corresponding theoretical analysis. To further improve scalability and robustness, we introduce enhancements including a split-merge-refine strategy and an adaptive data reduction technique. Experiments on synthetic and real-world datasets in viral quasispecies reconstruction, demonstrate that LCMC achieves superior accuracy and efficiency compared to existing methods.

2606.08196 2026-06-09 stat.ML cs.AI cs.LG stat.ME 交叉投稿

Beyond Additivity: Causal Discovery in Location-Scale Noise Models with Hidden Variables

超越可加性:含隐变量的位置-尺度噪声模型中的因果发现

Mariyam Khan, Shohei Shimizu, Thong Pham

发表机构 * RIKEN AIP(理化学研究所Advanced Institute for Science Technology) University of Bergen(卑尔根大学) The University of Osaka(大阪大学) Shiga University(滋贺大学)

AI总结 针对含隐变量且数据生成过程遵循位置-尺度噪声模型(LSNM)的因果发现,证明满足无弓条件的非循环有向混合图(ADMG)可识别,并提出两阶段算法LSNM-UV,在异方差数据上优于可加性基线。

Comments 33 pages, 4 figures

详情
AI中文摘要

我们研究当某些变量隐藏且数据生成过程遵循位置-尺度噪声模型(LSNM)时,从观测数据进行因果发现的问题。现有处理隐藏混杂变量的方法通常假设可加性噪声,但在实践中,原因不仅调节其效应的均值,还调节方差。我们证明,满足无弓条件的非循环有向混合图(ADMG)在含隐变量的LSNM下是可识别的,建立了超越噪声可加性的因果不足模型的第一个可识别性结果。我们进一步提供了即使违反无弓假设时识别因果方向的充分条件。我们的两阶段算法LSNM-UV是正确且完备的,实验表明在异方差数据上优于可加性基线方法。

英文摘要

We study causal discovery from observational data when some variables are hidden and the data-generating process follows a location-scale noise model (LSNM). Existing methods that handle hidden confounders typically assume additive noise, but in practice, causes often modulate not just the mean but also the variance of their effects. We prove that acyclic directed mixed graphs (ADMGs) satisfying a bow-free condition are identifiable under LSNM with hidden variables, establishing the first identifiability result for causally insufficient models beyond noise additivity. We further provide sufficient conditions for identifying causal direction even when the bow-free assumption is violated. Our two-stage algorithm, LSNM-UV, is sound and complete, and experiments demonstrate improved performance over additive baselines on heteroscedastic data.

2606.08438 2026-06-09 stat.ML cs.LG 交叉投稿

Improving Bayesian Optimization via Training-Aware Conditional Diffusion Models

通过训练感知的条件扩散模型改进贝叶斯优化

Yilin Zheng, Haowei Wang, Szu Hui Ng, Enlu Zhou

发表机构 * National University of Singapore(新加坡国立大学) Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出利用条件扩散模型高效近似最优解分布,并开发贝叶斯优化固有的训练策略和基于扩散的模态搜索采集函数,理论保证次优性,实验优于标准基线。

详情
AI中文摘要

贝叶斯优化(BO)是一种广泛使用的黑箱优化方法,它使用高斯过程(GP)作为代理模型,并通过采集函数指导顺序评估,最终目标是定位全局最优解 $\mathbf{x}^{\star}$。为了实现这一目标,基于信息的采集函数(如预测熵搜索PES)将 $\mathbf{x}^{\star}$ 建模为随机变量,并减少其分布的熵,但通过传统的GP后验采样来近似该分布计算成本高昂。为了解决这一限制,我们利用条件扩散模型(CDM)高效近似 $\mathbf{x}^{\star}$ 的分布,并为CDM开发了BO固有的训练策略。受CDM学习分布的结构特性启发,我们进一步提出了一种称为基于扩散的模态搜索(DMS)的采集策略来指导顺序评估。我们为CDM学习分布建立了次优性保证,并通过大量实验证明DMS优于标准BO基线。

英文摘要

Bayesian optimization (BO) is a widely used approach for black-box optimization that uses a Gaussian process (GP) as a surrogate and guides sequential evaluations via an acquisition function, with the ultimate goal of locating the global optimum $\mathbf{x}^{\star}$. To align with this goal, information-based acquisition functions such as Predictive Entropy Search (PES) model $\mathbf{x}^{\star}$ as a random variable and reduce the entropy of its distribution, but approximating this distribution via traditional GP posterior sampling is computationally expensive. To address this limitation, we leverage Conditional Diffusion Models (CDMs) to efficiently approximate the distribution of $\mathbf{x}^{\star}$ and develop BO-inherent training strategies for CDMs. Motivated by the structural properties of the CDM-learned distribution, we further develop an acquisition strategy termed Diffusion-based Mode Seeking (DMS) to guide the sequential evaluation. We establish a sub-optimality guarantee for the CDM-learned distribution and demonstrate through extensive experiments that DMS outperforms standard BO baselines.

2606.08638 2026-06-09 math.OC cs.LG 交叉投稿

Parameter Tuning with Generalization Guarantees for GPU-Accelerated Linear Programming

具有泛化保证的GPU加速线性规划参数调优

Siddharth Prasad, Dravyansh Sharma

发表机构 * Siddharth Prasad Dravyansh Sharma

AI总结 针对GPU加速线性规划求解器PDLP的超参数调优,基于数据驱动算法设计理论,首次给出学习步长、原始权重等超参数的样本复杂度保证,并通过实验验证了调优必要性。

详情
AI中文摘要

最近的研究开发了实用、可并行化的一阶方法用于大规模线性规划,但性能高度依赖于超参数选择。我们为(cu)PDLP(一种为现代硬件设计的最先进的一阶LP求解器)中的超参数调优推导了泛化保证。首先,我们确定了PDHG(PDLP的基础算法,即原始-对偶混合梯度算法)的行为与其步长和原始权重的函数关系,从而为学习这些参数提供了线性样本复杂度保证。然后,我们对PDLP进行了结构分析,该算法在PDHG基础上增加了多种专门技术,如预处理、自适应步长、平均化、自适应重启和平滑原始权重更新。我们的分析捕捉了作为超参数函数的解轨迹行为,并利用数据驱动算法设计的最新进展,为学习这些超参数获得了多项式样本复杂度保证。最后,我们进行了概念验证实验,证明了数据驱动PDLP参数调优的必要性。我们的结果展示了数据驱动算法设计工具包在复杂现代优化算法的求解器级实现中进行原则性超参数调优的通用性。

英文摘要

Recent research has developed practical, parallelizable first-order methods for large scale linear programming, but performance is highly dependent on hyperparameter selection. We derive generalization guarantees for hyperparameter tuning within (cu)PDLP, a state-of-the-art first-order LP solver designed for modern hardware. First, we pin down the behavior of PDHG, the primal-dual hybrid gradient algorithm that underlies PDLP, as a function of its step size and primal weight, leading to linear sample complexity guarantees for learning those parameters. We then conduct a structural analysis of PDLP, which augments PDHG with several specialized techniques like preconditioning, adaptive step sizes, averaging, adaptive restarts, and smoothed primal weight updates. Our analysis captures the behavior of the solution trajectory as a function of the hyperparameters and leverages recent advances in data-driven algorithm design to obtain polynomial sample complexity guarantees for learning those hyperparameters. Finally, we conduct proof-of-concept experiments that demonstrate the need for data-driven PDLP parameter tuning. Our results showcase the versatility of the data-driven algorithm design toolkit for principled hyperparameter tuning within solver-grade implementations of complex modern optimization algorithms.

2606.08727 2026-06-09 math.NA cs.LG cs.NA 交叉投稿

Compositional Approximation Can Strictly Outperform Superpositional Approximation

组合逼近可以严格优于叠加逼近

Dennis Elbrächter, Philipp Petersen

发表机构 * University of Freiburg(弗赖堡大学)

AI总结 本文通过构造显式例子,证明存在函数类使得叠加逼近的速率严格低于组合逼近,且差距可任意大。

详情
AI中文摘要

许多经典研究的函数类已知可以通过叠加方法最优逼近,即通过某些字典中元素的线性组合构造逼近。这里的最优性意味着,以参数数量为函数的均匀逼近误差具有任何参数化方法所能达到的最高阶多项式衰减,其中参数可以编码为长度与参数数量成正比(对数因子内)的比特串。尽管像神经网络这样的组合方法在结构上不同,但通过施加确保这种比例比特串编码的约束,它们的逼近速率可以变得可比。在这项工作中,我们研究了具有结构性质的函数类,这些性质限制了叠加逼近速率严格低于组合逼近速率。特别地,我们构造了显式例子,使得两者之间存在任意大的差距。

英文摘要

Many classically studied function classes are known to be approximated optimally by superpositional methods, i.e. with approximants constructed as the linear combination of elements in some dictionary. Here optimality means that the uniform approximation error viewed as a function of the number of parameters used has polynomial decay of the highest order achievable by any parametrized method whose parameters can be encoded as a bit string of length proportional, up to logarithmic factors, to the number of parameters. While compositional methods like neural networks are structurally different, their approximation rates can be made comparable by imposing constraints that ensure such a proportional bit string encoding. In this work we study function classes exhibiting structural properties that limit superpositional approximation rates to be strictly lower than compositional approximation rates. In particular, we construct explicit examples for which there is an arbitrarily large gap.

2606.08783 2026-06-09 math.OC cs.LG cs.NA math.NA 交叉投稿

OptMuon: Closed-Loop Orthogonalized Momentum Methods for Stochastic Optimization with Zero-Noise Optimality

OptMuon:用于随机优化的闭环正交动量方法及其零噪声最优性

Ganzhao Yuan

发表机构 * Faculty of Computer Science and Artificial Intelligence(计算机科学与人工智能学院) Shenzhen University of Advanced Technology (SUAT)(深圳先进技术大学)

AI总结 提出OptMuon,将Muon风格极因子方向与轨迹依赖的AdaGrad-Norm型系数调度结合,实现自适应动量正交化,在无噪声时达到近乎最优的一阶速率,且无需手动调整超参数。

详情
AI中文摘要

正交化动量更新,如Muon风格优化器中所使用的,最近在大规模深度学习中显示出强大的经验稳定性。然而,现有的正交化方法通常与常数或开环幅度规则配对,因此不会根据观察到的优化轨迹明确校准其更新幅度。受Lipschitz-free和噪声自适应方法背后的闭环视角启发,我们提出了OptMuon,一种用于随机非凸优化的自适应动量正交化方法家族。OptMuon将Muon风格的极因子方向与轨迹依赖的AdaGrad-Norm型系数调度相结合,使得更新幅度由观察到的梯度和动量历史决定,而不是由预设的Lipschitz依赖规则决定。该调度在参数选择中不使用光滑常数、方差水平或有界梯度常数,其运行最大值校正防止了孤立的梯度尖峰导致过度的系数崩溃。在随机梯度有界方差、光滑性以及几乎必然有界随机梯度条件下,我们证明了两个互补的保证。OptMuon-A在平均光滑性下达到噪声自适应速率\(\tilde{\mathcal O}(T^{-1/2}+σ^{1/2}T^{-1/4})\),而OptMuon-I在个体光滑性下达到\(\tilde{\mathcal O}(T^{-1/2}+σ^{1/3}T^{-1/3})\)。在零噪声机制下,两个界限自动简化为近乎最优的确定性一阶速率\(\tilde{\mathcal O}(T^{-1/2})\),无需手动重新调整超参数。这些结果表明,闭环标量自适应可以与Muon风格的动量正交化相结合,同时保持噪声自适应性和零噪声最优性(至多对数因子)。

英文摘要

Orthogonalized momentum updates, as used in Muon-style optimizers, have recently shown strong empirical stability in large-scale deep learning. However, existing orthogonalized methods are typically paired with constant or open-loop magnitude rules, and therefore do not explicitly calibrate their update magnitudes from the observed optimization trajectory. Motivated by the closed-loop perspective behind Lipschitz-free and noise-adaptive methods, we propose OptMuon, a family of adaptive momentum orthogonalization methods for stochastic nonconvex optimization. OptMuon combines Muon-style polar-factor directions with a trajectory-dependent AdaGrad-Norm-type coefficient schedule, so that the update magnitude is determined by the observed gradient and momentum history rather than by a prescribed Lipschitz-dependent rule. The schedule does not use the smoothness constant, the variance level, or the bounded-gradient constant in parameter selection, and its running-maximum correction prevents isolated gradient spikes from causing excessive coefficient collapse. Under lower-boundedness, unbiased stochastic gradients with bounded variance, smoothness, and an almost-sure bounded stochastic-gradient condition, we prove two complementary guarantees. OptMuon-A achieves the noise-adaptive rate \(\tilde{\mathcal O}(T^{-1/2}+σ^{1/2}T^{-1/4})\) under average smoothness, while OptMuon-I achieves \(\tilde{\mathcal O}(T^{-1/2}+σ^{1/3}T^{-1/3})\) under individual smoothness. In the zero-noise regime, both bounds automatically reduce to a nearly optimal deterministic first-order rate \(\tilde{\mathcal O}(T^{-1/2})\) without manual hyperparameter retuning. These results show that closed-loop scalar adaptation can be combined with Muon-style momentum orthogonalization while retaining noise adaptivity and zero-noise optimality up to logarithmic factors.

2606.08941 2026-06-09 stat.ML cs.LG 交叉投稿

Estimate Collapsibility of Causal Effects in Completed Partial DAGs via Strong d-Convex Hulls

通过强d-凸包估计完全部分有向无环图中因果效应的可压缩性

Yuxin Deng, Yi Sun, Zhiming Li, Huaxiong Liu

发表机构 * College of Mathematics and System Science, Xinjiang University(新疆大学数学与系统科学学院) Institute of Statistics and Data Science, Xinjiang University of Finance and Economics(新疆财经大学统计与数据科学研究院)

AI总结 提出一种在完全部分有向无环图中保持因果效应估计一致性的可压缩方法,通过强d-凸包刻画最小可压缩集,并设计高效算法结合IDA框架。

详情
AI中文摘要

本文提出一种可压缩的因果效应估计方法,该方法在完全部分有向无环图(CPDAG)中对某些变量边缘化前后保持估计量的一致性。我们首先引入了CPDAG的估计可压缩性,并将最小可压缩集刻画为强d-凸包。设计了一种高效算法来获取DAG中的此类集合,并将其推广到CPDAG。然后,我们将图约简过程与IDA框架相结合。最后,实验和实证分析显示了CPDAG中因果估计可压缩性的有效性。代码可在 https://github.com/Jamyang-D/strongly-convex 获取。

英文摘要

This paper proposes a collapsible method for estimating causal effects that maintains the estimator's consistency before and after marginalization over some variables in completed partially directed acyclic graphs (CPDAGs). We first introduce the estimate collapsibility for CPDAGs and characterize the minimal collapsible sets as strong d-convex hulls. An efficient algorithm is devised to obtain such sets in DAGs and is generalized to CPDAGs. Then, we combine the graph reduction procedure with the IDA framework. Finally, experiments and empirical analysis show the effectiveness of the collapsibility for causal estimations in CPDAGs. Code is available at https://github.com/Jamyang-D/strongly-convex.

2606.09820 2026-06-09 math.FA cs.LG math.PR q-fin.MF stat.ML 交叉投稿

Weighted universal approximation of differentiable maps on infinite-dimensional manifolds

无限维流形上可微映射的加权通用逼近

Philipp Schmocker, Josef Teichmann

发表机构 * Department of Mathematics, ETH Zurich, Switzerland(苏黎世联邦理工学院数学系)

AI总结 通过加权Nachbin定理,将函数输入神经网络的通用逼近定理推广到可微映射,包括导数逼近,并应用于非预期泛函和路径空间泛函的逼近。

Comments 77 pages, 3 figures

详情
AI中文摘要

我们将函数输入神经网络(FNN)的通用逼近定理推广到可微映射,包括导数的逼近。FNN将输入从可能无限维的加权流形映射到实值隐藏层,在该层上应用非线性标量激活函数,然后通过一些线性读出将输出返回到Banach空间。通过证明加权Nachbin定理,我们建立了可微映射的通用逼近定理(UAT),该定理超越了紧集上的通常表述,并且还包括导数的逼近。这导致了非预期泛函(包括水平和垂直导数)的逼近结果。作为进一步的应用,我们证明了签名的线性函数能够逼近路径空间泛函,包括它们的方向导数。

英文摘要

We generalize the universal approximation theorem for functional input neural networks (FNN) to differentiable maps by including the approximation of the derivatives. A FNN maps the input from a possibly infinite-dimensional weighted manifold to the real-valued hidden layer, on which a non-linear scalar activation function is applied, and then returns the output into a Banach space via some linear readouts. By proving a weighted Nachbin theorem, we establish a universal approximation theorem (UAT) for differentiable maps, which goes beyond the usual formulation on compact sets and also includes the approximation of the derivatives. This leads us to approximation results for non-anticipative functionals including the horizontal and vertical derivatives. As a further application, we show that linear functions of the signature are able to approximate path space functionals including their directional derivatives.

2302.09832 2026-06-09 cs.LG math.OC 版本更新

TAMUNA: Doubly Accelerated Distributed Optimization under Partial Participation

TAMUNA: 部分参与下的双重加速分布式优化

Laurent Condat, Ivan Agarský, Grigory Malinovsky, Peter Richtárik

发表机构 * Computer Science Program, CEMSE Division, King Abdullah University of Science and Technology (KAUST)(卡布斯王国科学与技术大学计算机科学项目,CEMSE部门) SDAIA-KAUST Center of Excellence in Data Science and Artificial Intelligence (SDAIA-KAUST AI)(数据科学与人工智能卓越中心(SDAIA-KAUST AI)) Brno University of Technology(布拉格技术大学) Kempelen Institute of Intelligent Technologies (KInIT)(智能技术研究所(KInIT))

AI总结 提出TAMUNA算法,首次结合本地训练、压缩和部分参与,实现双重加速收敛,支持任意客户端参与水平。

详情
AI中文摘要

在分布式优化和联邦学习中,并行设备与中央服务器之间缓慢且昂贵的通信是主要瓶颈。为了缓解这一负担,出现了两种策略:1)本地训练(LT),通过在轮次之间执行多次本地计算来降低通信频率;2)压缩(CC),即传输低维度的紧凑表示。最近的理论进展成功地将LT和CC结合起来,实现了关于条件数和模型维度的双重加速通信速率。然而,这些方法有一个主要缺点:它们需要所有客户端参与,并且在空闲客户端错过通信触发时失效。我们引入了TAMUNA,这是第一个成功交织LT、CC和部分参与的算法。通过将原始模型更新与对偶控制变量解耦,TAMUNA克服了先前方法的架构死锁。在强凸设置下,TAMUNA线性收敛到精确解,通过展示双重加速收敛建立了新的最先进水平,同时支持任意水平的客户端参与。

英文摘要

In distributed optimization and federated learning, slow and costly communication between parallel devices and the central server constitutes the primary bottleneck. To alleviate this burden, two strategies have emerged: 1) local training (LT), which reduces communication frequency by performing multiple local computations between rounds, and 2) compression (CC), which consists of transmitting lower-dimensional, compact representations. Recent theoretical advances have successfully combined LT and CC to achieve doubly-accelerated communication rates, with respect to both condition number and model dimension. However, these methods have a major drawback: they require full client participation and break down when idle clients miss communication triggers. We introduce TAMUNA, the first algorithm to successfully intertwine LT, CC, and partial participation. By decoupling primal model updates from dual control variates, TAMUNA overcomes the architectural deadlock of prior methods. In the strongly convex setting, TAMUNA converges linearly to the exact solution, establishing a new state of the art by exhibiting doubly-accelerated convergence, while supporting arbitrary levels of client participation.

2401.01599 2026-06-09 cs.LG math.ST stat.TH 版本更新

Generalization Error Curves for Analytic Spectral Algorithms under Power-law Decay

幂律衰减下解析谱算法的泛化误差曲线

Yicheng Li, Weiye Gan, Zuoqiang Shi, Qian Lin

发表机构 * Tsinghua University(清华大学)

AI总结 本文在温和假设下,完整刻画了核梯度下降等解析谱算法在核回归中的泛化误差曲线,揭示了核插值的不一致性和高资格算法的饱和效应,并通过神经正切核理论加深了对宽神经网络泛化行为的理解。

详情
AI中文摘要

某些核回归方法的泛化误差曲线旨在确定在不同源条件、噪声水平和正则化参数选择下泛化误差的精确阶数,而非极小极大速率。在这项工作中,在温和假设下,我们严格地提供了核梯度下降方法(以及一大类解析谱算法)在核回归中泛化误差曲线的完整刻画。因此,我们可以锐化核插值的近不一致性,并阐明具有更高资格的核回归算法的饱和效应等。得益于神经正切核理论,这些结果极大地提高了我们对训练宽神经网络泛化行为的理解。一个新颖的技术贡献——解析泛函论证——可能具有独立的意义。

英文摘要

The generalization error curve of certain kernel regression method aims at determining the exact order of generalization error with various source condition, noise level and choice of the regularization parameter rather than the minimax rate. In this work, under mild assumptions, we rigorously provide a full characterization of the generalization error curves of the kernel gradient descent method (and a large class of analytic spectral algorithms) in kernel regression. Consequently, we could sharpen the near inconsistency of kernel interpolation and clarify the saturation effects of kernel regression algorithms with higher qualification, etc. Thanks to the neural tangent kernel theory, these results greatly improve our understanding of the generalization behavior of training the wide neural networks. A novel technical contribution, the analytic functional argument, might be of independent interest.

2506.01052 2026-06-09 cs.LG math.OC stat.ML 版本更新

A Robust $\widetilde{\mathcal{O}}(1/\sqrt{T})$ Rate for Unprojected TD Learning with Linear Function Approximation

线性函数逼近的无投影TD学习的鲁棒 $\widetilde{\mathcal{O}}(1/\sqrt{T})$ 收敛率

Wei-Cheng Lee, Francesco Orabona

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文针对线性函数逼近的时序差分学习,在无投影条件下证明了期望收敛率为 $\widetilde{\mathcal{O}}(\\|\theta^*\\|^2_2/\sqrt{T})$,仅需对学习率进行轻微的对数修正,无需额外正则条件。

详情
AI中文摘要

我们研究了线性函数逼近的时序差分(TD)学习的有限时间收敛性质,这是强化学习的基石。我们关注所谓的“鲁棒”设置,其中收敛保证不依赖于势函数的最小曲率。虽然先前的工作已经建立了该设置下的收敛保证,但这些结果通常依赖于每次迭代被投影到有界集上的人为假设。Bhandari 等人(COLT'18)将去除这一条件留作开放问题,并假设需要额外的“正则条件”。在本文中,我们表明,即使存在马尔可夫噪声,简单的无投影 TD(0) 也能以期望的 $\widetilde{\mathcal{O}}\left(\frac{\\|\theta^*\\|^2_2}{\sqrt{T}}\right)$ 速率收敛。我们不需要额外的正则条件,仅需对学习率进行轻微的对数修正。我们的分析揭示了 TD 更新的一种新的自界性质,并利用它来保证迭代的有界性。

英文摘要

We investigate the finite-time convergence properties of Temporal Difference (TD) learning with linear function approximation, a cornerstone of reinforcement learning. We are interested in the so-called ``robust'' setting, where the convergence guarantee does not depend on the potential function's minimal curvature. While prior work has established convergence guarantees in this setting, these results typically rely on the artificial assumption that each iterate is projected onto a bounded set. Removing such a condition was left as an open problem by Bhandari et al. (COLT'18), hypothesizing the need for additional ``regularity conditions''. In this paper, we show that the simple unprojected TD(0) converges with a rate of $\widetilde{\mathcal{O}}\left(\frac{\|θ^*\|^2_2}{\sqrt{T}}\right)$ in expectation, even in the presence of Markovian noise. We do not require an additional regularity condition, but only a minor polylog correction to the learning rate. Our analysis reveals a novel self-bounding property of the TD updates and exploits it to guarantee bounded iterates.

2506.11336 2026-06-09 cs.LG math.OC 版本更新

The Sample Complexity of Parameter-Free Stochastic Convex Optimization

无参数随机凸优化的样本复杂度

Jared Lawrence, Ari Kalinsky, Hannah Bradfield, Yair Carmon, Oliver Hinder

发表机构 * Department of Industrial Engineering, University of Pittsburgh(工业工程系,匹兹堡大学) Department of Computer Science, Tel Aviv University(计算机科学系,特拉维夫大学)

AI总结 研究未知问题参数(如到最优点的距离和Lipschitz常数)下随机凸优化的样本复杂度,提出可靠模型选择方法和正则化方法,实现最优样本复杂度并避免过拟合。

Comments Accepted for publication in JMLR

详情
AI中文摘要

我们研究当问题参数(如到最优点的距离和Lipschitz常数)未知时随机凸优化的样本复杂度。我们采用两种策略。首先,我们开发了一种可靠的模型选择方法,避免对验证集的过拟合。该方法允许我们通用地调整随机优化方法的学习率,以匹配最优已知参数样本复杂度(相差log log因子)。其次,我们开发了一种专门针对仅到最优点的距离未知情况的正则化方法。具体而言,它使用范数正则化经验风险最小化来估计到最优点的距离(常数因子内),使得已知参数的随机优化方法能够达到最优样本复杂度。该方法提供了对未知到最优点距离的完美适应性,展示了无参数随机凸优化的样本复杂度与计算复杂度之间的分离。结合这两种方法允许我们同时适应多种问题结构。在CIFAR-10上通过微调CLIP模型和提示工程Gemini计数形状进行的小样本学习实验表明,我们的可靠模型选择方法有助于减轻对小验证集的过拟合。

英文摘要

We study the sample complexity of stochastic convex optimization when problem parameters such as the distance to optimality and the Lipschitz constant are unknown. We pursue two strategies. First, we develop a reliable model selection method that avoids overfitting to the validation set. This method allows us to generically tune the learning rate of stochastic optimization methods to match the optimal known-parameter sample complexity up to log log factors. Second, we develop a regularization-based method that is specialized to the case that only the distance to optimality is unknown. More specifically, it uses norm-regularized empirical risk minimization to estimate the distance to optimality to within a constant factor, allowing known-parameter stochastic optimization methods to achieve optimal sample complexity. This method provides perfect adaptability to unknown distance to optimality, demonstrating a separation between the sample and computational complexity of parameter-free stochastic convex optimization. Combining these two methods allows us to simultaneously adapt to multiple problem structures. Experiments performing few-shot learning on CIFAR-10 by fine-tuning CLIP models and prompt engineering Gemini to count shapes indicate that our reliable model selection method can help mitigate overfitting to small validation sets.

2507.01598 2026-06-09 cs.LG 版本更新

Convergence Bound and Critical Batch Size of Muon Optimizer

Muon优化器的收敛界与临界批量大小

Naoki Sato, Hiroki Naganuma, Hideaki Iiduka

发表机构 * Meiji University(立命经济大学) Université de Montréal(蒙特利尔大学) Mila(蒙特利尔人工智能研究院)

AI总结 本文理论分析了Muon优化器在四种实际设置下的收敛性,证明权重衰减确保参数和梯度范数有界,并推导了临界批量大小的下界,揭示了超参数β和λ对其缩放的影响。

详情
AI中文摘要

Muon是一种最近提出的优化器,利用神经网络参数的固有矩阵结构,展现了强大的实证性能,表明其有潜力成为AdamW等标准优化器的后继者。本文提供理论分析以支持其实践成功。我们在四种实际设置下给出了Muon的收敛证明,系统考察了其有无Nesterov动量和权重衰减时的行为。然后我们证明,添加权重衰减可确保参数和梯度范数几乎必然有界——无需依赖通常施加的有界梯度假设——并阐明了权重衰减系数与学习率之间的相互作用。最后,我们推导了Muon临界批量大小的下界——该批量大小最小化训练的随机一阶预言机(SFO)复杂度。由于所得公式涉及不可直接观测的问题相关量(梯度方差、目标精度、有效秩),它不能绝对预测临界批量大小;而是揭示了超参数$\beta$(动量)和$\lambda$(权重衰减)如何控制该值的定性缩放。我们的实验在包括图像分类和语言建模在内的任务上验证了这些依赖于超参数的预测。

英文摘要

Muon, a recently proposed optimizer that leverages the inherent matrix structure of neural network parameters, has demonstrated strong empirical performance, indicating its potential as a successor to standard optimizers such as AdamW. This paper presents theoretical analysis to support its practical success. We provide convergence proofs for Muon across four practical settings, systematically examining its behavior with and without the inclusion of Nesterov momentum and weight decay. We then demonstrate that the addition of weight decay ensures almost-sure boundedness of the parameter and gradient norms -- without relying on the commonly imposed bounded-gradient assumption -- and clarify the interplay between the weight decay coefficient and the learning rate. Finally, we derive a lower bound on the critical batch size for Muon -- the batch size that minimizes the stochastic first-order oracle (SFO) complexity of training. Because the resulting formula involves problem-dependent quantities that are not directly observable (gradient variance, target precision, effective rank), it does not predict the critical batch size in absolute terms; rather, it reveals how the hyperparameters $β$ (momentum) and $λ$ (weight decay) govern the qualitative scaling of this value. Our experiments validate these hyperparameter-dependent predictions across workloads including image classification and language modeling.

2511.02003 2026-06-09 cs.LG cond-mat.dis-nn hep-ph 版本更新

Bulk-boundary decomposition of neural networks

神经网络的体-边界分解

Donghee Lee, Hye-Sung Lee, Jaeok Yi

发表机构 * Department of Physics, Korea Advanced Institute of Science and Technology(物理系,韩国科学技术院)

AI总结 提出体-边界分解框架,将神经网络训练动力学分解为数据无关的体项和数据相关的边界项,揭示深层网络的局部齐次结构并推导能量连续性方程。

Comments 13 pages, 3 figures

详情
AI中文摘要

我们提出体-边界分解作为理解深度神经网络训练动力学的新框架。从随机梯度下降公式出发,我们证明拉格朗日量可以重组为数据无关的体项和数据相关的边界项。体项捕捉由网络架构和激活函数设定的内在动力学,而边界项反映来自输入和输出层训练样本的随机相互作用。这种分解揭示了深层网络背后的局部和齐次结构。作为局部性和齐次性的物理结果,我们推导了深度神经网络内的能量连续性方程。

英文摘要

We present the bulk--boundary decomposition as a new framework for understanding the training dynamics of deep neural networks. Starting from the stochastic gradient descent formulation, we show that the Lagrangian can be reorganized into a data-independent bulk term and a data-dependent boundary term. The bulk captures the intrinsic dynamics set by network architecture and activation functions, while the boundary reflects stochastic interactions from training samples at the input and output layers. This decomposition exposes the local and homogeneous structure underlying deep networks. As a physical consequence of locality and homogeneity, we derive the energy continuity equation within a deep neural network.

2512.01930 2026-06-09 cs.LG cs.AI 版本更新

SVRG and Beyond via Posterior Correction

SVRG及其后验校正扩展

Nico Daheim, Thomas Möllenhoff, Ming Liang Ang, Mohammad Emtiyaz Khan

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 本文揭示SVRG与后验校正方法的深层联系,证明SVRG是各向同性高斯后验校正的特例,并通过灵活指数族后验自动导出牛顿型和Adam型新变体。

Comments ICML 2026 (oral)

详情
AI中文摘要

随机方差缩减梯度(SVRG)及其变体旨在通过使用梯度校正来加速训练。这些方法最初提出于十多年前,但从未在任何基本层面上与任何贝叶斯方法联系起来。在这里,我们填补了这一空白,并推导出SVRG与最近提出的称为“后验校正”的贝叶斯方法之间令人惊讶的新联系。我们的主要贡献是证明SVRG可以恢复为各向同性高斯后验校正的特例。通过使用更灵活的指数族后验,自动获得了SVRG的新扩展。我们通过使用高斯族推导了两个这样的新扩展:一种具有新颖海森校正的牛顿型变体,以及一种可扩展到大规模问题的Adam型扩展。我们的工作是首次将SVRG与贝叶斯联系起来,并利用它来加速训练。

英文摘要

Stochastic Variance Reduced Gradient (SVRG) and its variants aim to speed-up training by using gradient corrections. Originally proposed over a decade ago, these methods have never been connected to any Bayesian method at a fundamental level. Here, we fill this gap and derive surprising new connections of SVRG to a recently proposed Bayesian method called `posterior correction'. Our main contribution is to show that SVRG can be recovered as a special case of posterior-correction over isotropic-Gaussian posteriors. Novel extensions of SVRG are automatically obtained by using more flexible exponential-family posteriors. We derive two new such extensions by using Gaussian families: a Newton-like variant with novel Hessian corrections, and an Adam-like extension that scales to large problems. Our work is the first to connect SVRG to Bayes and use it to speed-up training.

2512.10656 2026-06-09 cs.LG 版本更新

Token Sample Complexity of Attention

注意力的标记采样复杂度

Léa Bohbot, Cyril Letrouit, Gabriel Peyré, François-Xavier Vialard

发表机构 * CNRS, ENS Paris, France(法国国家科学研究中心、巴黎高等师范学院)

AI总结 研究注意力在极端序列长度下的收敛行为,提出标记采样复杂度概念,分析注意力映射的均匀收敛和变换分布矩的收敛速率,实验验证预测结果。

详情
AI中文摘要

随着大语言模型上下文窗口的扩展,有必要研究注意力在极长序列长度下的行为。我们引入标记采样复杂度:在n个标记上计算的注意力收敛到无限标记极限的速率。我们估计有限n下的收敛界:注意力映射的点wise均匀收敛和变换分布矩的收敛。对于紧支撑(更一般地亚高斯)分布,我们的第一个结果表明,注意力映射在半径为R的球上以速率C(R)/√n收敛,其中C(R)随R指数增长。对于大R,此估计失去实用价值,我们的第二个结果通过建立变换分布矩的收敛速率来解决这一问题。在该情况下,速率是C'(R)/n^β,其中β<1/2,且C'(R)与分布支撑大小的多项式相关。指数β取决于注意力几何和标记分布的谱性质。我们还研究了注意力参数趋于无穷大且softmax趋近于hardmax的 regime,并在此设定下建立了对数收敛速率。合成和真实数据的实验支持我们的预测,并显示预测的减慢在下游准确性中得到反映。

英文摘要

As context windows in large language models continue to expand, it is essential to characterize how attention behaves at extreme sequence lengths. We introduce token sample complexity: the rate at which attention computed on $n$ tokens converges to its infinite-token limit. We estimate finite-$n$ convergence bounds at two levels: pointwise uniform convergence of the attention map, and convergence of moments for the transformed token distribution. For compactly supported (and more generally sub-Gaussian) distributions, our first result shows that the attention map converges uniformly on a ball of radius $R$ at rate $C(R)/\sqrt{n}$, where $C(R)$ grows exponentially with $R$. For large $R$, this estimate loses practical value, and our second result addresses this issue by establishing convergence rates for the moments of the transformed distribution (the token output of the attention layer). In this case, the rate is $C'(R)/n^β$ with $β<\tfrac{1}{2}$, and $C'(R)$ depends polynomially on the size of the support of the distribution. The exponent $β$ depends on the attention geometry and the spectral properties of the token distribution. We also examine the regime in which the attention parameter tends to infinity and the softmax approaches a hardmax, and in this setting, we establish a logarithmic rate of convergence. Experiments on synthetic and real data support our predictions and show that the predicted slowdown is reflected in downstream accuracy.

2601.18840 2026-06-09 cs.LG cs.SY eess.SY 版本更新

Bellman Residual Minimization for Control: Geometry, Stationarity, and Convergence

贝尔曼残差最小化用于控制:几何、站定性与收敛

Donghwan Lee, Hyukjun Yang

发表机构 * School of Electrical Engineering(电气工程学院) Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院)

AI总结 本文研究了控制中贝尔曼残差最小化方法的几何特性、站定性及收敛性,探讨其在策略优化中的基础理论与应用价值。

详情
AI中文摘要

马尔可夫决策问题通常通过动态规划解决。另一种方法是贝尔曼残差最小化,直接最小化平方贝尔曼残差目标函数。然而,与动态规划相比,这种方法在实践中往往效率较低,且难以扩展到无模型设置如强化学习。尽管如此,贝尔曼残差最小化在价值函数近似中的收敛稳定性等优势使其值得深入研究。虽然已有广泛研究政策评估的贝尔曼残差方法,但针对策略优化(控制任务)的方法却很少被探讨。本文建立了控制中贝尔曼残差最小化在策略优化中的基础理论结果。

英文摘要

Markov decision problems are most commonly solved via dynamic programming. Another approach is Bellman residual minimization, which directly minimizes the squared Bellman residual objective function. However, compared to dynamic programming, this approach has received relatively less attention, mainly because it is often less efficient in practice and can be more difficult to extend to model-free settings such as reinforcement learning. Nonetheless, Bellman residual minimization has several advantages that make it worth investigating, such as more stable convergence with function approximation for value functions. While Bellman residual methods for policy evaluation have been widely studied, methods for policy optimization (control tasks) have been scarcely explored. In this paper, we establish foundational results for the control Bellman residual minimization for policy optimization.

2602.05600 2026-06-09 cs.LG 版本更新

On the Superlinear Relationship between SGD Noise Covariance and Loss Landscape Curvature

关于SGD噪声协方差与损失景观曲率之间的超线性关系

Yikuan Zhang, Ning Yang, Yuhai Tu

发表机构 * School of Physics, Peking University(北京大学物理系) Peking University Chengdu Academy for Advanced Interdisciplinary Biotechnologies(北京大学成都先进交叉生物技术研究院) Flatiron Institute(Flatiron研究所)

AI总结 本文发现SGD噪声协方差与Hessian矩阵之间存在超线性关系,而非简单的正比关系,并通过实验验证了层间幂律标度。

Comments 8 pages, 15 figures

详情
AI中文摘要

随机梯度下降(SGD)引入各向异性噪声,该噪声与损失景观的局部曲率相关,从而将优化偏向平坦最小值。先前的工作通常假设对于负对数似然损失,Fisher信息矩阵与Hessian矩阵等价,从而声称SGD噪声协方差$\mathbf{C}$与Hessian矩阵$\mathbf{H}$成正比。我们证明该假设仅在深度神经网络中通常违反的严格条件下成立。利用最近发现的Activity--Weight对偶性,我们找到了一个更一般的、与具体损失形式无关的关系,表明$\mathbf{C} \propto \mathbb{E}_p[\mathbf{h}_p^2]$,其中$\mathbf{h}_p$表示每个样本的Hessian矩阵,且$\mathbf{H} = \mathbb{E}_p[\mathbf{h}_p]$。因此,$\mathbf{C}$和$\mathbf{H}$近似交换而非精确相等。我们进一步发现,在所分析的全连接层内,它们的对角元素遵循每层经验幂律$C_{ii} \propto H_{ii}^{\gamma}$,其中层依赖的拟合指数满足$1 \leq \gamma \leq 2$。跨数据集、架构和损失函数的实验支持了所得的层间界限,为深度学习中噪声-曲率关系提供了统一的刻画。

英文摘要

Stochastic Gradient Descent (SGD) introduces anisotropic noise that is correlated with the local curvature of the loss landscape, thereby biasing optimization toward flat minima. Prior work often assumes an equivalence between the Fisher Information Matrix and the Hessian for negative log-likelihood losses, leading to the claim that the SGD noise covariance $\mathbf{C}$ is proportional to the Hessian $\mathbf{H}$. We show that this assumption holds only under restrictive conditions that are typically violated in deep neural networks. Using the recently discovered Activity--Weight Duality, we find a more general relationship agnostic to the specific loss formulation, showing that $\mathbf{C} \propto \mathbb{E}_p[\mathbf{h}_p^2]$, where $\mathbf{h}_p$ denotes the per-sample Hessian with $\mathbf{H} = \mathbb{E}_p[\mathbf{h}_p]$. As a consequence, $\mathbf{C}$ and $\mathbf{H}$ commute approximately rather than coincide exactly. We further find that, within the analyzed fully connected layers, their diagonal elements follow per-layer empirical power laws $C_{ii} \propto H_{ii}^γ$, with layer-dependent fitted exponents bounded by $1 \leq γ\leq 2$. Experiments across datasets, architectures, and loss functions support the resulting layerwise bounds, providing a unified characterization of the noise-curvature relationship in deep learning.

2605.17609 2026-06-09 cs.LG 版本更新

Adaptive Generate-Rank-Verify: Inference-Time Search with Costly Verification

自适应生成-排序-验证:具有高成本验证的推理时间搜索

Shaddin Dughmi, Mahdi Haghifam, Yusuf Hakan Kalayci

发表机构 * University of Southern California(南加州大学) Northwestern University(西北大学) University of Chicago(芝加哥大学) Toyota Technological Institute at Chicago(芝加哥丰田技术研究所) Simons Institute for the Theory of Computing(Simons计算理论研究所) Data Science Institute at the University of Chicago(芝加哥大学数据科学研究所)

AI总结 本文提出了一种自适应生成-排序-验证方法,通过在未知分布下自适应地生成和验证候选答案,以在保证成本的前提下找到正例,同时通过理论分析和实验验证了该方法在数学推理和编程竞赛中的有效性。

Comments 33 Pages, 6 Figures, 4 Tables. Changes compared to V1: updated the related work section

详情
AI中文摘要

许多推理时间语言模型管道结合了低成本奖励信号和高成本验证器,例如数学推理中的精确答案检查或代码生成中的隐藏测试执行。我们通过学习理论的视角将这一设置形式化为生成性主动搜索:一个成本敏感的首次正例搜索问题,在其中策略会自适应地从未知分布中采样候选者,观察低成本评分,并支付验证器标签的费用,直到找到正例。对于固定的提示,生成器和奖励模型诱导出两个未知对象:奖励分数上的分布和条件于评分的成功函数。当这些量已知时,我们使用动态规划方法来表征分布感知的最优策略。在现实和实用的设置中,当评分分布和成功函数都未知时,我们提出ADAP算法,一种分层自适应的生成-排序-验证算法,逐步增加采样的响应数量和顶部验证的数量。在单调性假设下,即更高的奖励分数不太可能通过验证,我们证明ADAP在期望成本上接近分布感知的最优。我们通过基于中心星数的学习理论下界补充这一结果,表明对评分-标签关系的结构假设是必要的。在数学推理和竞争编程上的实验验证了在固定非自适应策略和难度自适应基线上的预测优势。

英文摘要

Many inference-time language-model pipelines combine a cheap reward signal with an expensive verifier, such as exact answer checking in mathematical reasoning or hidden-test execution in code generation. We formalize this setting using a learning-theoretic lens as generative active search: a cost-sensitive first-positive search problem in which a policy adaptively samples candidates from an unknown distribution, observes cheap scores, and pays for verifier labels until it finds a positive example. For a fixed prompt, the generator and reward model induce two unknown objects: a distribution over reward scores and a score-conditioned success function. When these quantities are known, we characterize the distribution-aware optimal policy using a dynamic programming approach. In the realistic and practical setting where both the score distribution and success function are unknown, we propose ADAP, a shellwise adaptive generate-rank-verify algorithm that progressively increases the number of sampled responses and top-ranked verifications. Under the monotonicity assumption that higher reward scores are no less likely to pass verification, we show that ADAP achieves expected cost within a constant factor of the distribution-aware optimum. We complement this result with learning-theoretic lower bounds, based on a centered star number, showing that structural assumptions on the score--label relationship are necessary. Experiments on mathematical reasoning and competitive programming validate the predicted advantage over both fixed non-adaptive policies and difficulty-adaptive baselines.

2606.02351 2026-06-09 cs.LG stat.ML 版本更新

Local Preferential Bayesian Optimization

局部偏好贝叶斯优化

Johanna Menn, Miriam Kober, Paul Brunzema, David Stenger, Sebastian Trimpe

发表机构 * Institute for Data Science in Mechanical Engineering, RWTH Aachen University(机械工程数据科学研究所,亚琛工业大学) Department of Clinical Research, University of Bern(伯尔尼大学临床研究系) Center for Reproducible Science and Research Synthesis, University of Zurich(苏黎世大学可重复科学与研究综合中心) aiXopt GmbH(aiXopt公司)

AI总结 针对偏好贝叶斯优化在高维问题中效率低的问题,提出利用信任域和导数信息的局部偏好贝叶斯优化方法,显著降低累积遗憾。

详情
AI中文摘要

贝叶斯优化(BO)是一种流行且有效的调优昂贵、有噪声实验的方法,但需要制定明确的目标函数。偏好贝叶斯优化(PBO)通过从成对的人类反馈中学习来消除这一要求,然而现有方法由于其全局搜索策略,难以有效优化中低维以外的问题。我们通过开发一系列局部PBO方法来解决这一限制,这些方法将高维BO的关键思想迁移到偏好设置中。具体而言,我们引入了局部PBO方法,将信任域和导数信息局部搜索适应于成对偏好反馈,其中后者利用了拉普拉斯近似高斯过程后验的一阶和二阶导数。我们在GP样本路径、标准优化基准函数和策略搜索任务上的基准测试表明,局部PBO方法在具有陡峭最优值的高维和复杂景观中特别有效。与基于全局偏好的基线相比,它们可以显著减少累积遗憾,使其对于现实世界中基于偏好的优化任务(如策略搜索)特别有用。

英文摘要

Bayesian optimization (BO) is a popular and effective approach for tuning expensive, noisy experiments, but requires the formulation of an explicit objective function. Preferential BO (PBO) removes this requirement by learning from pairwise human feedback, yet existing methods struggle to efficiently optimize beyond low- and medium-dimensional problems due to their global search approaches. We address this limitation by developing a family of local PBO methods that transfer key ideas from high-dimensional BO to the preferential setting. In particular, we introduce local PBO methods which adapt trust-region and derivative-informed local search to pairwise preference feedback, where the latter exploits first- and second-order derivatives of the Laplace-approximated GP posterior. Our benchmark on GP sample paths, standard optimization benchmark functions, and policy-search tasks shows that local PBO methods are especially effective in high-dimensional and complex landscapes with steep optima. Compared with global preference-based baselines, they can substantially reduce cumulative regret, making them particularly useful for real-world preference-based optimization tasks such as policy search.

2412.16457 2026-06-09 stat.ML cs.DS cs.LG math.PR math.ST stat.TH 版本更新

Robust Random Graph Matching in Dense Graphs via an Approximate Message Passing Type Algorithm

稠密图中的鲁棒随机图匹配:基于近似消息传递类型算法

Zhangsong Li

发表机构 * Peking University(北京大学)

AI总结 针对带潜在顶点对应的相关高斯Wigner矩阵对,提出一种近似消息传递迭代算法,在对抗性扰动下实现多项式时间匹配恢复,扰动规模可达n^{1-o(1)}。

Comments 46 pages; accepted by IEEE Trans. Inf. Theory

详情
AI中文摘要

本文关注一对具有潜在顶点对应的相关高斯Wigner矩阵的匹配恢复问题。我们特别关注该问题的鲁棒版本,其中观测为扰动输入$(A+E,B+F)$,$(A,B)$是一对相关高斯Wigner矩阵,$E,F$是分别支撑在$A,B$的未知$\epsilon n \times \epsilon n$主子矩阵上的对抗性选择矩阵。我们提出一种近似消息传递(AMP)类型迭代算法,只要$(A,B)$之间的相关性$\rho$为非零常数且$\epsilon = o\big( \tfrac{1}{(\log n)^{20}} \big)$,该算法就能在多项式时间内成功。与标准AMP的关键区别在于,迭代中引入了时间依赖的矩阵乘法步骤,该步骤同时扩大特征维度并在迭代过程中抵消相关性。我们结果的主要方法输入来自\cite{DL22+, DL23+}中提出的迭代随机图匹配算法和\cite{IS24+}中提出的谱预处理过程。据我们所知,我们的算法是首个在任意$n^{1-o(1)}$大小的对抗性扰动下具有鲁棒性的高效随机图匹配类型算法。

英文摘要

In this paper, we focus on the matching recovery problem between a pair of correlated Gaussian Wigner matrices with a latent vertex correspondence. We are particularly interested in a robust version of this problem such that our observation is a perturbed input $(A+E,B+F)$ where $(A,B)$ is a pair of correlated Gaussian Wigner matrices and $E,F$ are adversarially chosen matrices supported on an unknown $εn * εn$ principal minor of $A,B$, respectively. We propose an approximate message passing (AMP) type iterative algorithm that succeeds in polynomial time as long as the correlation $ρ$ between $(A,B)$ is a non-vanishing constant and $ε= o\big( \tfrac{1}{(\log n)^{20}} \big)$. A key distinction from standard AMP is the introduction of a time-dependent matrix multiplication step within the iteration, which simultaneously enlarges the feature dimension and cancels the correlation during the iteration. The main methodological inputs for our result are the iterative random graph matching algorithm proposed in \cite{DL22+, DL23+} and the spectral preprocessing procedure proposed in \cite{IS24+}. To the best of our knowledge, our algorithm is the first efficient random graph matching type algorithm that is robust under any adversarial perturbations of $n^{1-o(1)}$ size.

2502.15131 2026-06-09 math.ST cs.LG stat.ME stat.ML stat.TH 版本更新

Optimal and Provable Calibration in High-Dimensional Binary Classification: Angular Calibration and Platt Scaling

高维二分类中的最优且可证明的校准:角度校准与Platt缩放

Yufan Li, Pragya Sur

发表机构 * Harvard University(哈佛大学)

AI总结 针对高维高斯特征下的线性二分类器,提出基于估计权重与真实权重夹角的角度校准方法,证明其可校准且唯一Bregman最优,并揭示Platt缩放在高维下收敛于该最优解。

详情
AI中文摘要

我们研究校准形如 $\sigma(\hat{w}^\top x)$ 的线性二分类器的基本问题,其中特征向量 $x$ 服从高斯分布,$\sigma$ 是链接函数,$\hat{w}$ 是真实线性权重 $w^\star$ 的估计量。通过与非信息性的 $\textit{机会分类器}$ 插值,我们构建了一个良好校准的预测器,其插值权重取决于估计量 $\hat{w}$ 与真实线性权重 $w_\star$ 之间的夹角 $\angle(\hat{w}, w_\star)$。我们证明,在样本量和特征量均以可比速率发散的高维机制下,这种角度校准方法可证明是良好校准的。夹角 $\angle(\hat{w}, w_\star)$ 可以一致地估计。此外,所得预测器是唯一 $\textit{Bregman最优}$ 的,即在合适的校准预测器类中最小化与真实标签分布的Bregman散度。我们的工作是首个在高维下同时满足校准和最优性可证明的校准策略。此外,我们识别了经典Platt缩放预测器收敛到我们的Bregman最优校准解的条件。因此,Platt缩放在高维下也继承了这些理想性质。

英文摘要

We study the fundamental problem of calibrating a linear binary classifier of the form $σ(\hat{w}^\top x)$, where the feature vector $x$ is Gaussian, $σ$ is a link function, and $\hat{w}$ is an estimator of the true linear weight $w^\star$. By interpolating with a noninformative $\textit{chance classifier}$, we construct a well-calibrated predictor whose interpolation weight depends on the angle $\angle(\hat{w}, w_\star)$ between the estimator $\hat{w}$ and the true linear weight $w_\star$. We establish that this angular calibration approach is provably well-calibrated in a high-dimensional regime where the number of samples and features both diverge, at a comparable rate. The angle $\angle(\hat{w}, w_\star)$ can be consistently estimated. Furthermore, the resulting predictor is uniquely $\textit{Bregman-optimal}$, minimizing the Bregman divergence to the true label distribution within a suitable class of calibrated predictors. Our work is the first to provide a calibration strategy that satisfies both calibration and optimality properties provably in high dimensions. Additionally, we identify conditions under which a classical Platt-scaling predictor converges to our Bregman-optimal calibrated solution. Thus, Platt-scaling also inherits these desirable properties provably in high dimensions.

2505.08908 2026-06-09 math.ST cs.LG econ.TH stat.TH 版本更新

Statistical Decision Theory with Counterfactual Loss

具有反事实损失的统计决策理论

Benedikt Koch, Kosuke Imai

发表机构 * Harvard University(哈佛大学)

AI总结 针对经典统计决策理论忽略反事实信息的问题,提出在强可忽略性下反事实风险可识别当且仅当损失函数在潜在结果上可加,并证明可加反事实损失能捕捉决策难度,通过符号线性逆规划无需数据即可判断可识别性。

详情
AI中文摘要

许多研究者应用经典统计决策理论来评估治疗选择和学习最优策略。然而,由于该框架仅依赖于所选行动下的实现结果而忽略反事实,它无法在单位层面评估决策相对于可行替代方案的质量,而这在某些设置中是一个重要要求。例如,在审前保释决策中,法官必须平衡释放后的犯罪预防与对被捕者施加不必要负担的风险。该框架中的一个核心挑战是可识别性:由于每个单位仅观测到一个潜在结果,反事实风险通常不可识别。我们证明,在强可忽略性下,反事实风险可识别当且仅当损失函数在潜在结果上可加。我们进一步证明,当存在两个以上的治疗选项时,可加反事实损失可以产生与基于标准损失不同的治疗推荐。我们表明,可加反事实损失不仅捕捉决策准确性,还捕捉决策难度,而标准损失仅反映准确性。最后,我们引入一个符号线性逆规划,无需数据即可确定给定的反事实损失是否产生可识别的风险。

英文摘要

Many researchers apply classical statistical decision theory to evaluate treatment choices and learn optimal policies. However, because this framework relies solely on realized outcomes under chosen actions and ignores counterfactuals, it cannot assess the quality of a decision relative to feasible alternatives at the unit level, which is an important requirement in some settings. For example, in pretrial bail decisions, a judge must balance crime prevention upon release against the risk of imposing unnecessary burdens on arrestees. A central challenge in this framework is identification: since only one potential outcome is observed per unit, counterfactual risk is typically not identifiable. We show that, under strong ignorability, counterfactual risk is identifiable if and only if the loss is additive in the potential outcomes. We further demonstrate that additive counterfactual losses can yield treatment recommendations that differ from those based on standard losses when more than two treatment options are available. We show that additive counterfactual losses capture not only decision accuracy but also decision difficulty, whereas standard losses reflect accuracy alone. Finally, we introduce a symbolic linear inverse program that determines whether a given counterfactual loss yields an identifiable risk, without requiring data.

2509.07779 2026-06-09 math.OC cs.LG cs.MA 版本更新

Decentralized Online Riemannian Optimization Beyond Hadamard Manifolds

超越哈达玛流形的去中心化在线黎曼优化

Emre Sahinoglu, Shahin Shahrampour

发表机构 * Department of Mechanical & Industrial Engineering at Northeastern University(东北大学机械与工业工程系)

AI总结 针对可能具有正曲率的流形,提出曲率感知的黎曼共识步骤,实现去中心化在线黎曼梯度下降算法,并证明O(√T)遗憾界。

详情
AI中文摘要

我们研究在可能具有正曲率的流形上的去中心化在线黎曼优化,超越了哈达玛流形设定。去中心化优化技术依赖于共识步骤,该步骤在欧几里得空间中因其线性性质而被充分理解。然而,在正曲率黎曼空间中,一个主要的技术挑战是测地距离可能不诱导全局凸结构。在这项工作中,我们首先分析了一个曲率感知的黎曼共识步骤,该步骤使得在哈达玛流形之外也能实现线性收敛。基于此步骤,我们为去中心化在线黎曼梯度下降算法建立了$O(\sqrt{T})$遗憾界。然后,我们研究了双点bandit反馈设置,其中我们使用平滑技术采用计算高效的梯度估计器,并通过平滑目标的次凸性分析证明了相同的$O(\sqrt{T})$遗憾界。

英文摘要

We study decentralized online Riemannian optimization over manifolds with possibly positive curvature, going beyond the Hadamard manifold setting. Decentralized optimization techniques rely on a consensus step that is well understood in Euclidean spaces because of their linearity. However, in positively curved Riemannian spaces, a main technical challenge is that geodesic distances may not induce a globally convex structure. In this work, we first analyze a curvature-aware Riemannian consensus step that enables a linear convergence beyond Hadamard manifolds. Building on this step, we establish a $O(\sqrt{T})$ regret bound for the decentralized online Riemannian gradient descent algorithm. Then, we investigate the two-point bandit feedback setup, where we employ computationally efficient gradient estimators using smoothing techniques, and we demonstrate the same $O(\sqrt{T})$ regret bound through the subconvexity analysis of smoothed objectives.

2510.12744 2026-06-09 stat.ML cs.LG math.ST stat.CO stat.ME stat.TH 版本更新

Dendrograms of Mixing Measures for Softmax-Gated Gaussian Mixture of Experts: Consistency Without Model Sweeps

混合测度的树状图用于Softmax门控高斯混合专家:无需模型扫描的一致性

Do Tien Hai, Trung Nguyen Mai, TrungTin Nguyen, Nhat Ho, Binh T. Nguyen, Christopher Drovandi

发表机构 * Faculty of Mathematics and Computer Science, University of Science, Ho Chi Minh City, Vietnam(越南胡志明市科学大学数学与计算机科学学院) Vietnam National University Ho Chi Minh City, Vietnam(越南胡志明市国家大学) Faculty of Information Technology, University of Science, Ho Chi Minh City, Vietnam(越南胡志明市科学大学信息技术学院) ARC Centre of Excellence for the Mathematical Analysis of Cellular Systems(细胞系统数学分析 excellence 中心) School of Mathematical Sciences, Queensland University of Technology, Brisbane City, Australia(昆士兰科技大学数学科学学院) Department of Statistics and Data Science, University of Texas at Austin, Austin, USA(德克萨斯大学奥斯汀分校统计与数据科学系)

AI总结 针对softmax门控高斯混合专家模型,提出基于Voronoi损失函数的统一统计框架,解决参数非可识别性和模型选择问题,并引入混合测度树状图实现一致且无需多尺寸训练的专家数选择。

Comments Do Tien Hai, Trung Nguyen Mai, and TrungTin Nguyen are co-first authors. In Proceedings of The 29th International Conference on Artificial Intelligence and Statistics, AISTATS 2026 Spotlight, Acceptance rate 2.5% over 2102 submissions

详情
AI中文摘要

我们为softmax门控高斯混合专家(SGMoE)开发了一个统一的统计框架,解决了参数估计和模型选择中三个长期存在的障碍:(i)门控参数在公共平移下的非可识别性,(ii)内在的门控-专家交互导致似然中耦合的微分关系,以及(iii)softmax诱导的条件密度中紧密的分子-分母耦合。我们的方法引入了与门划分几何对齐的Voronoi型损失函数,并建立了最大似然估计(MLE)的有限样本收敛速率。在过指定模型中,我们揭示了MLE收敛速率与刻画接近非可识别方向的多项式方程组可解性之间的联系。对于模型选择,我们将混合测度的树状图适配到SGMoE,产生一个一致且无需扫描的专家数选择器,在过拟合下达到逐点最优的参数速率,同时避免多尺寸训练。在合成数据上的模拟验证了理论,准确恢复了专家数量并达到了参数估计的预测速率,同时紧密逼近回归函数。在模型误指定下(例如,$\epsilon$-污染),树状图选择准则具有鲁棒性,恢复了真实的混合成分数量,而Akaike信息准则、贝叶斯信息准则和集成完全似然在样本量增大时倾向于过选择。在一个干旱响应性状的玉米蛋白质组学数据集上,我们的树状图引导的SGMoE选择了两个专家,揭示了清晰的混合测度层次结构,早期稳定了似然,并产生了可解释的基因型-表型图谱,优于无需多尺寸训练的标准准则。

英文摘要

We develop a unified statistical framework for softmax-gated Gaussian mixture of experts (SGMoE) that addresses three long-standing obstacles in parameter estimation and model selection: (i) non-identifiability of gating parameters up to common translations, (ii) intrinsic gate-expert interactions that induce coupled differential relations in the likelihood, and (iii) the tight numerator-denominator coupling in the softmax-induced conditional density. Our approach introduces Voronoi-type loss functions aligned with the gate-partition geometry and establishes finite-sample convergence rates for the maximum likelihood estimator (MLE). In over-specified models, we reveal a link between the MLE's convergence rate and the solvability of an associated system of polynomial equations characterizing near-nonidentifiable directions. For model selection, we adapt dendrograms of mixing measures to SGMoE, yielding a consistent, sweep-free selector of the number of experts that attains pointwise-optimal parameter rates under overfitting while avoiding multi-size training. Simulations on synthetic data corroborate the theory, accurately recovering the expert count and achieving the predicted rates for parameter estimation while closely approximating the regression function. Under model misspecification (e.g., $ε$-contamination), the dendrogram selection criterion is robust, recovering the true number of mixture components, while the Akaike information criterion, the Bayesian information criterion, and the integrated completed likelihood tend to overselect as sample size grows. On a maize proteomics dataset of drought-responsive traits, our dendrogram-guided SGMoE selects two experts, exposes a clear mixing-measure hierarchy, stabilizes the likelihood early, and yields interpretable genotype-phenotype maps, outperforming standard criteria without multi-size training.

2601.16510 2026-06-09 cs.MS cs.LG math.OC 版本更新

Learning to Optimize by Differentiable Programming

通过可微编程学习优化

Liping Tao, Xindi Tong, Chee Wei Tan

发表机构 * Nanyang Technological University(南洋理工大学)

AI总结 本教程介绍利用可微编程学习设计一阶优化算法,通过端到端训练提升收敛性和解质量,并基于Fenchel-Rockafellar对偶性展示ADMM和PDHG等算法的学习与适应。

详情
AI中文摘要

解决大规模优化问题需要可扩展且每次迭代成本低的一阶方法。本教程强调了优化领域的一个转变:利用可微编程不仅执行算法,而且学习如何设计它们。诸如PyTorch、TensorFlow和JAX等现代框架通过高效的自动微分实现了这一范式。将一阶方法嵌入这些系统允许端到端训练,从而改善收敛性和解质量。在Fenchel-Rockafellar对偶性的指导下,本教程展示了如何学习和适应诸如ADMM和PDHG等对偶信息迭代方案。通过LP、NNV、和速率最大化、OPF和LRMP等案例研究说明了这些改进。

英文摘要

Solving massive-scale optimization problems requires scalable first-order methods with low per-iteration cost. This tutorial highlights a shift in optimization: using differentiable programming not only to execute algorithms but to learn how to design them. Modern frameworks such as PyTorch, TensorFlow, and JAX enable this paradigm through efficient automatic differentiation. Embedding first-order methods within these systems allows end-to-end training that improves convergence and solution quality. Guided by Fenchel-Rockafellar duality, the tutorial demonstrates how duality-informed iterative schemes such as ADMM and PDHG can be learned and adapted. Case studies across LP, NNV, Sum-Rate maximization, OPF, and LRMP illustrate these gains.

2602.02431 2026-06-09 stat.ML cs.LG 版本更新

Full-Batch Gradient Descent Outperforms One-Pass SGD: Sample Complexity Separation in Single-Index Learning

全批量梯度下降优于单次SGD:单索引学习中的样本复杂度分离

Filip Kovačević, Hong Chang Ji, Denny Wu, Mahdi Soltanolkotabi, Marco Mondelli

发表机构 * Institute of Science and Technology Austria(奥地利科学与技术研究所) Sung Kyun Kwan University(顺天妇女大学) New York University and Flatiron Institute(纽约大学和Flatiron研究所) University of Southern California(南加州大学)

AI总结 研究单索引学习中全批量GD与单次SGD的样本复杂度差异,发现通过截断激活函数,全批量GD在n≃d样本时实现弱恢复,优于单次SGD的n≳d log d样本需求。

Comments Accepted to ICML 2026

详情
AI中文摘要

传统观点认为,多次重用训练数据可以提高基于梯度的学习的统计效率。虽然这一现象在线性回归中已被广泛研究,但在非线性和非凸设置中,除了前两次数据传递实现的损失修改机制外,多遍梯度下降(GD,重用所有数据)相对于单遍随机梯度下降(在线SGD,每个数据点仅使用一次)的优势尚未得到充分理解。在这项工作中,我们考虑学习一个具有二次激活函数的$d$维单索引模型,已知单次SGD需要$n\gtrsim d\log d$个样本才能实现弱恢复。我们首先证明,对于相关损失上的全批量球面GD,样本复杂度中的$\log d$因子仍然存在;然而,通过简单地截断激活函数,全批量GD在$n \simeq d$个样本时展现出有利的优化景观,从而在统计效率上优于单次SGD(使用相同的激活函数)。我们通过从微小初始化开始的平方损失上全批量GD的轨迹分析补充了这一结果,表明$n \gtrsim d$个样本和$T \gtrsim\log d$个梯度步足以实现强(精确)恢复。

英文摘要

It is folklore that reusing training data more than once can improve the statistical efficiency of gradient-based learning. While this phenomenon has been extensively studied in linear regression, the benefit of multi-pass gradient descent (GD, which reuses all the data) over one-pass stochastic gradient descent (online SGD, which uses each data point only once) is not well-understood in nonlinear and non-convex settings, except for a loss modification mechanism achieved by the first two passes on the data. In this work, we consider learning a $d$-dimensional single-index model with a quadratic activation, for which it is known that one-pass SGD requires $n\gtrsim d\log d$ samples to achieve weak recovery. We first show that this $\log d$ factor in the sample complexity persists for full-batch spherical GD on the correlation loss; however, by simply truncating the activation, full-batch GD exhibits a favorable optimization landscape at $n \simeq d$ samples, thereby outperforming one-pass SGD (with the same activation) in statistical efficiency. We complement this result with a trajectory analysis of full-batch GD on the squared loss from small initialization, showing that $n \gtrsim d$ samples and $T \gtrsim\log d$ gradient steps suffice to achieve strong (exact) recovery.

2602.03682 2026-06-09 stat.ML cs.DC cs.LG cs.NA math.NA 版本更新

Improved Analysis of the Accelerated Noisy Power Method with Applications to Decentralized PCA

加速噪声幂方法的改进分析及其在分布式PCA中的应用

Pierre Aguié, Mathieu Even, Laurent Massoulié

发表机构 * École Polytechnique Fédérale de Lausanne(洛桑联邦理工学院)

AI总结 本文改进了加速噪声幂方法的分析,在更宽松的扰动条件下保持加速收敛速率,并首次提出具有可证明加速收敛的分布式PCA算法。

详情
AI中文摘要

我们分析了加速噪声幂方法,这是一种在仅有不精确矩阵-向量乘积可用的情况下进行主成分分析的算法,例如在分布式PCA中可能出现的情况。虽然先前的工作已经证明,与标准噪声幂方法相比,加速可以改善收敛速度,但这些保证需要对扰动幅度进行过度严格的上界限制,限制了其实用性。我们提供了该算法的改进分析,在更温和的扰动条件下保持了加速收敛速率。我们证明我们的新分析在最坏情况下是最优的,即收敛速率无法进一步提高,并且我们推导的噪声条件在不牺牲收敛保证的情况下无法放宽。我们通过推导一种用于分布式PCA的加速算法来展示我们结果的实际相关性,该算法具有与非加速方法相似的通信成本。据我们所知,这是第一个具有可证明加速收敛的分布式PCA算法。

英文摘要

We analyze the Accelerated Noisy Power Method, an algorithm for Principal Component Analysis in the setting where only inexact matrix-vector products are available, which can arise for instance in decentralized PCA. While previous works have established that acceleration can improve convergence rates compared to the standard Noisy Power Method, these guarantees require overly restrictive upper bounds on the magnitude of the perturbations, limiting their practical applicability. We provide an improved analysis of this algorithm, which preserves the accelerated convergence rate under much milder conditions on the perturbations. We show that our new analysis is worst-case optimal, in the sense that the convergence rate cannot be improved, and that the noise conditions we derive cannot be relaxed without sacrificing convergence guarantees. We demonstrate the practical relevance of our results by deriving an accelerated algorithm for decentralized PCA, which has similar communication costs to non-accelerated methods. To our knowledge, this is the first decentralized algorithm for PCA with provably accelerated convergence.

2602.04402 2026-06-09 stat.ML cs.AI cs.CY cs.LG math.ST stat.TH 版本更新

Performative Learning Theory

表现性学习理论

Julian Rodemann, Unai Fischer-Abaigar, James Bailie, Krikamol Muandet

发表机构 * University of Cambridge(剑桥大学)

AI总结 将表现性预测嵌入统计学习理论,证明在样本和总体表现性效应下的泛化界,揭示模型影响数据越多则学习越少的权衡,并提出通过再训练改善泛化保证。

Comments ICML 2026. v2: corrected typo in author list; v3: added explanation of condition 3.2, modified condition 3.3 and fixed lemma 3.4, added examples and explanations in sections 2, 5, and 6

详情
AI中文摘要

表现性预测会影响它们试图预测的结果。我们研究影响样本(例如,仅限现有应用用户)和/或整个总体(例如,所有潜在应用用户)的表现性预测。这引发了模型在表现性下泛化能力的问题。例如,当现有用户和新用户都对应用的预测做出反应时,我们基于现有用户对新用户能得出多好的见解?我们通过将表现性预测嵌入统计学习理论来解决这个问题。我们证明了在样本、总体以及两者共同影响下的泛化界。我们证明背后的一个关键直觉是,在最坏情况下,总体否定预测,而样本欺骗性地实现预测。我们分别将这种自我否定和自我实现的预测表述为Wasserstein空间中的最小-最大和最小-最小风险泛函。我们的分析揭示了表现性地改变世界与从中学习之间的基本权衡:模型对数据的影响越大,它能从数据中学到的就越少。此外,我们的分析得出一个令人惊讶的见解:通过对表现性扭曲的样本进行再训练,可以改善泛化保证。我们通过一个案例研究说明了我们的界,该案例涉及基于预测的德国失业居民工作培训分配,利用了德国1975年至2017年的行政劳动力市场记录。

英文摘要

Performative predictions influence the very outcomes they aim to forecast. We study performative predictions that affect a sample (e.g., only existing users of an app) and/or the whole population (e.g., all potential app users). This raises the question of how well models generalize under performativity. For example, how well can we draw insights about new app users based on existing users when both of them react to the app's predictions? We address this question by embedding performative predictions into statistical learning theory. We prove generalization bounds under performative effects on the sample, on the population, and on both. A key intuition behind our proofs is that in the worst case, the population negates predictions, while the sample deceptively fulfills them. We cast such self-negating and self-fulfilling predictions as min-max and min-min risk functionals in Wasserstein space, respectively. Our analysis reveals a fundamental trade-off between performatively changing the world and learning from it: the more a model affects data, the less it can learn from it. Moreover, our analysis results in a surprising insight on how to improve generalization guarantees by retraining on performatively distorted samples. We illustrate our bounds in a case study on prediction-informed assignments of unemployed German residents to job trainings, drawing upon administrative labor market records from 1975 to 2017 in Germany.

2604.26993 2026-06-09 math.NA cs.LG cs.NA math.OC 版本更新

State-Dependent Lyapunov Analysis of Rank-1 Matrix Factorization

基于状态依赖的Lyapunov分析的秩1矩阵分解

Jaehong Moon

发表机构 * Industrial & Enterprise Systems Engineering University of Illinois at Urbana-Champaign(工业与企业系统工程伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文通过状态依赖Lyapunov视角研究梯度下降在秩1矩阵分解中的收敛性,提出参数化二次证书I(δ;·),证明在临界步长以下收敛到全局极小值,临界步长以上则进入平衡终端状态并表现出周期2行为。

详情
AI中文摘要

我们通过状态依赖Lyapunov视角研究梯度下降在秩1矩阵分解中的收敛性。核心对象是一个参数化二次证书I(δ;·),其边界内向性质诱导单调状态参数δ_t,从而证明轨迹被限制在收缩的水平集内。对于初始值低于临界步长的初始化,此机制证明收敛到全局极小值。在临界步长以上,相同的单调状态机制导致平衡终端状态;对于一系列临界步长以上的步长,减少的动力学表现出周期2行为,与稳定性边缘现象一致。我们进一步表明,标量证书并非随意的代数构造:在结构公理和自然的状态-参数归一化下,它由单调性机制唯一确定。数值实验表明,这种状态依赖Lyapunov机制在证明案例之外也持续存在,包括二维秩1近似和标量分解的四次扩展。

英文摘要

We study gradient descent for rank-1 matrix factorization through a state-dependent Lyapunov perspective. The central object is a parameterized quadratic certificate $I(δ;\,\cdot)$ whose boundary-inward property induces a monotone state parameter $δ_t$, thereby certifying that the trajectory is confined to a shrinking family of level sets. For certified initializations below the critical step size, this mechanism proves convergence to global minimizers. Above the critical step size, the same monotone-state mechanism instead leads to a balanced terminal regime; for a range of post-critical step sizes, the reduced dynamics exhibit period-2 behavior consistent with edge-of-stability phenomena. We further show that the scalar certificate is not an ad hoc algebraic construction: under structural axioms and a natural state-parameter normalization, it is uniquely determined by the monotonicity mechanism. Numerical experiments suggest that this state-dependent Lyapunov mechanism persists beyond the proved cases, including two-dimensional rank-1 approximation and quartic augmentations of scalar factorization.

2605.25085 2026-06-09 cs.IT cs.AI cs.LG math.IT 版本更新

Polynomial Context-Truncation Sensitivity in Autoregressive Language Models: Sequential Wyner-Ziv Bounds for KV Cache Compression

自回归语言模型中的多项式上下文截断敏感性:KV缓存压缩的序列Wyner-Ziv界

Munsik Kim

发表机构 * Independent Researcher(独立研究者)

AI总结 研究自回归语言模型中在线KV缓存压缩的率失真极限,将其建模为序列Wyner-Ziv信源编码,发现下一词分布对上下文截断的敏感性呈多项式衰减,并推导了仅后缀缓存策略的每词内存需求。

详情
AI中文摘要

我们研究了自回归语言模型中在线KV缓存压缩的率失真极限,将其建模为模型诱导滤子上的序列Wyner-Ziv信源编码,其中下一步查询作为解码器边信息。实验上,在涵盖两个系列、参数规模0.5-3B的四个模型中,我们发现下一词分布对上下文截断的敏感性呈多项式衰减而非几何衰减:幂律在外推中比指数拟合提升一个数量级,拟合指数通过汇加最近KL测量独立恢复,并通过位置保持消融验证了衰减不受位置编码伪影影响。在相应的多项式截断敏感性假设下,我们的主要结果刻画了仅后缀缓存策略的每词内存需求:滑动窗口方案以窗口大小$w = O(\varepsilon^{-1/α})$达到失真$\varepsilon$,且在附加双边贝叶斯风险条件下,逆命题表明在该策略类内$w = \Omega(\varepsilon^{-1/α})$是必要的,因此仅后缀策略的缩放为$\Theta(\varepsilon^{-1/α})$。循环或传播缓存摘要能否超越此缩放留待进一步研究。一个显式的块马尔可夫方案达到上界;在附加前向衰减和正则性假设(仅由截断敏感性无法推出)下,其收敛速率指数与逆命题匹配,否则相差两倍。实验上,幂律预测了具体缓存策略的退化曲线:基于最近性的驱逐(滑动、汇加最近)在同等预算下将失真抑制约两个数量级,且失真随预算呈幂律衰减。

英文摘要

We study the rate-distortion limits of online KV cache compression in autoregressive language models, formulating it as sequential Wyner-Ziv source coding on the filtration induced by the model, with the next-step query as decoder side information. Empirically, across four models spanning two families and $0.5$-$3$B parameters, we find that the next-token distribution's sensitivity to context truncation decays \emph{polynomially} rather than \emph{geometrically}: a power law improves on an exponential fit by an order of magnitude in extrapolation, the fitted exponent is recovered independently from a sink-plus-recent KL measurement, and the decay is verified to be free of positional-encoding artifacts by a position-preserving ablation. Under a corresponding \emph{polynomial truncation-sensitivity} assumption, our main result characterizes the per-token memory requirement of \emph{suffix-only} cache policies: a sliding-window scheme attains distortion $\varepsilon$ with window $w = O(\varepsilon^{-1/α})$, and -- under an additional two-sided Bayes-risk condition -- a converse shows $w = Ω(\varepsilon^{-1/α})$ is necessary within this policy class, so the scaling is $Θ(\varepsilon^{-1/α})$ for suffix-only policies. Whether recurrent or propagating cache summaries can beat this scaling is left open. An explicit block-Markov scheme achieves the upper bound; its rate-of-convergence exponent matches the converse under additional forward-decay and regularity hypotheses (not implied by truncation sensitivity alone), and differs by a factor of two otherwise. Empirically, the polynomial law predicts the degradation curves of concrete cache policies: recency-based eviction (sliding, sink-plus-recent) suppresses distortion by roughly two orders of magnitude over random retention at equal budget, with a power-law decay in the budget.

2605.26703 2026-06-09 econ.TH cs.GT cs.LG stat.ML 版本更新

Proper Calibeating

Proper Calibeating

Dean P. Foster, Sergiu Hart

发表机构 * Department of Statistics, Wharton, University of Pennsylvania, Philadelphia, and Amazon, New York(统计系、沃顿商学院、宾夕法尼亚大学费城分校,以及纽约亚马逊公司) Institute of Mathematics, Department of Economics, and Federmann Center for the Study of Rationality, The Hebrew University of Jerusalem(数学研究所、经济系、理性研究基金会,以色列希伯来大学)

AI总结 本文将经典校准预测和calibeating概念扩展到真确评分规则,定义proper-calibration和proper-calibeating,证明校准蕴含proper-calibration而calibeating不一定蕴含proper-calibeating,展示如何保证proper-calibeating和proper-multicalibeating,并证明proper-calibration与不确定性决策中对预测最佳回应时通用无遗憾的等价性。

Comments v2: Updated section 6 "Decision Making Under Uncertainty"

详情
AI中文摘要

经典概念“校准预测”及其更近期的改进“calibeating”是相对于标准二次评分规则定义的。我们将这些概念扩展到$\textit{真确}$评分规则类(其中最佳预测是真实分布),并通过要求误差在所有有界真确评分规则上一致收敛到零来定义$\textit{proper-calibration}$和$\textit{proper-calibeating}$。我们首先证明校准总是蕴含proper-calibration,而calibeating不一定蕴含proper-calibeating。其次,我们展示如何保证proper-calibeating和proper-multicalibeating。最后,我们证明了在不确定性决策中对预测进行最佳回应时,proper-calibration与通用无遗憾之间的等价性。

英文摘要

The classic concept of "calibrated forecasts" and its more recent refinement, "calibeating," are defined with respect to the standard quadratic scoring rule. We extend these notions to the class of $\textit{proper}$ scoring rules (for which the best forecast is the true distribution) and define $\textit{proper-calibration}$ and $\textit{proper-calibeating}$ by requiring the errors to converge to zero uniformly over all bounded proper scoring rules. We first establish that calibration always implies proper-calibration, whereas calibeating need not imply proper-calibeating. Second, we show how to guarantee proper-calibeating and proper-multicalibeating. Finally, we demonstrate the equivalence between proper-calibration and universal no regret when best replying to forecasts in decision-making under uncertainty.

2606.01342 2026-06-09 cs.DS cs.LG 版本更新

Towards Optimal Robustness in Learning-Augmented Paging

面向学习增强分页的最优鲁棒性

Peng Chen, Hailiang Zhao, Xueyan Tang, Yixuan Wang, Shuiguang Deng

发表机构 * Department of XXX, University of YYY, Location, Country School of ZZZ, Institute of WWW, Location, Country Zhejiang University, Hangzhou, China Nanjing University of Aeronautics Nanyang Technological University, Singapore

AI总结 本文提出一种新框架,通过相对预测预算原语,在学习增强分页中实现最优鲁棒性界 H_k + O(1),并实验验证其实际性能。

Comments ICML 2026

详情
AI中文摘要

近年来,学习增强分页得到了广泛研究。与朴素基于机器学习的方法相比,一个关键优势是 extit{有界鲁棒性},即使在预测不准确时也能保证最坏情况性能,这使得这些算法对实际系统有价值。先前工作在随机化设置中实现了 $2H_k + O(1)$ 的鲁棒性界,与最优竞争比 $H_k$ 存在差距。在本文中,我们研究如何缩小这一差距。我们首先回顾在线最优性,并证明最新的 $H_k$-竞争算法的一个新性质,这有助于我们在学习增强设置中的分析。然后,我们回顾现有的学习增强分页算法,并引入一个统一原语—— extit{相对预测预算},它捕捉了建立鲁棒性的本质,并揭示了先前算法要么过度使用要么未充分利用预测。在上述分析指导下,我们开发了一个新框架,实现了学习增强分页的最优鲁棒性(至多相差一个加法常数):$H_k + O(1)$。实验进一步证明了强大的实际性能。

英文摘要

Learning-augmented paging has been extensively studied in recent years. A key advantage over naive ML-based approaches is \emph{bounded robustness}, which guarantees worst-case performance even when predictions are inaccurate, making these algorithms valuable for real-world systems. Prior work achieves robustness bounds of $2H_k + O(1)$ in the randomized setting, leaving a gap to the optimal competitive ratio $H_k$. In this paper, we study how to close this gap. We begin by reviewing online optimality and proving a new property of the latest $H_k$-competitive algorithm, which facilitates our analysis in the learning-augmented setting. Then, we review existing learning-augmented paging algorithms and introduce a unifying primitive, the \emph{relative prediction budget}, which captures the essence of establishing robustness and reveals that prior algorithms either overuse or underutilize predictions. Guided by the above analysis, we develop a new framework that achieves the best-possible robustness up to an additive constant for learning-augmented paging: $H_k + O(1)$. Experiments further demonstrate strong practical performance.

6. 高效学习、压缩与部署 64 篇

2606.07571 2026-06-09 cs.LG cs.AI 新提交

Enabling KV Caching of Shared Prefix for Diffusion Language Models

为扩散语言模型启用共享前缀的KV缓存

Younghun Go, Jaehoon Han, Changyong Shin, Chuk Yoo, Gyeongsik Yang

发表机构 * Korea University(高丽大学)

AI总结 针对扩散语言模型中双向注意力导致共享前缀KV不稳定的问题,提出双向前缀缓存(bicache),通过动态识别安全层深度重用KV,避免精度崩溃,提升吞吐量36.3%-98.3%。

详情
AI中文摘要

共享前缀的键值(KV)缓存对于高吞吐量的大语言模型(LLM)服务至关重要,但在新兴的扩散语言模型(DLM)中面临严峻挑战。在DLM中,双向注意力意味着更新任何token都会动态改变整个上下文及其对应的KV。因此,为LLM开发的现有缓存技术(假设KV一旦计算就保持不变)会破坏共享前缀KV。我们的实验表明,将这些技术应用于DLM会导致模型精度几乎降为零。为了解锁高吞吐量的DLM服务,我们提出了双向前缀缓存(bicache),这是第一个用于DLM中共享前缀的KV缓存技术。bicache基于我们全面分析的关键观察设计:共享前缀KV在浅层中保持稳定且可重用,而浅层的深度取决于每个请求中共享前缀token的比例。因此,bicache动态识别用于重用共享前缀KV的安全层深度,并消除冗余计算。评估表明,与现有技术相比,bicache显著提高了服务吞吐量36.3%-98.3%,且没有精度崩溃(仅0-1.8%的差异)。

英文摘要

Key-value (KV) caching for shared prefixes is essential for high-throughput large language model (LLM) serving, but it faces critical challenges in emerging diffusion language models (DLMs). In DLMs, bidirectional attention means that updating any token dynamically alters the entire context and its corresponding KVs. Thus, existing caching techniques developed for LLMs, which assume that KVs remain invariant once computed, corrupt the shared prefix KVs. Our experiments show that applying these techniques to DLMs causes model accuracy to collapse to near zero. To unlock high-throughput DLM serving, we propose bidirectional prefix caching, bicache, the first KV caching technique for shared prefixes in DLMs. bicache is designed based on key observations from our comprehensive analysis: shared prefix KVs remain stable and reusable in shallow layers, while the depth of shallow layers depends on the fraction of shared prefix tokens in each request. Thus, bicache dynamically identifies a safe layer depth for reusing shared prefix KVs and eliminates redundant computation. Evaluations demonstrate that bicache significantly improves serving throughput by 36.3%-98.3% compared to existing techniques without accuracy collapse (only 0-1.8% difference).

2606.07615 2026-06-09 cs.LG cs.AI 新提交

Structured Neuron Pruning in Deep Neural Networks Using Multi-Armed Bandits

深度神经网络中使用多臂赌博机的结构化神经元剪枝

Salem Ameen, Sunil Vadera

发表机构 * School of Science, Engineering and Environment, University of Salford(科学、工程与环境学院,萨尔福德大学)

AI总结 提出基于多臂赌博机算法的结构化剪枝框架,通过将每个神经元视为臂并评估移除奖励,在表格分类、回归及深度网络任务上验证了UCB1和汤普森采样等策略的有效性。

Comments 27 pages, 5 figures

详情
AI中文摘要

深度神经网络通常包含冗余的隐藏单元。移除单个权重可以减少参数数量,但非结构化稀疏性在标准密集实现中并不总是容易利用。本文开发了一个结构化剪枝框架,其中使用多臂赌博机(MAB)算法移除完整的神经元。每个候选神经元被视为一个臂;拉动一个臂会暂时屏蔽该神经元,测量采样小批量上损失的变化,恢复神经元,并更新其安全移除奖励的估计。该框架支持随机策略,包括Epsilon-Greedy、Softmax、UCB1和汤普森采样,以及乘性权重策略,包括Hedge风格的乘性权重和EXP3。我们在涵盖图像、文本和推理任务的表格分类、表格回归和深度神经网络基准上评估了该方法。使用弗里德曼检验和随后Nemenyi事后检验的统计比较显示方法之间存在显著差异。在表格分类任务上,UCB1在剪枝策略中获得最高平均排名,并优于未剪枝的神经网络。在回归任务上,UCB1获得最高平均排名,并且根据R^2,与几种标准回归模型在统计上具有竞争力或更优。在深度学习任务上,UCB1和汤普森采样获得最强排名,并且几种MAB策略显著优于未剪枝模型、基于幅度的神经元剪枝和贪婪激活变化剪枝。结果表明,基于MAB的神经元剪枝是一种有效且计算实用的结构化模型缩减方法。

英文摘要

Deep neural networks often contain redundant hidden units. Removing individual weights can reduce parameter count, but unstructured sparsity is not always easy to exploit in standard dense implementations. This paper develops a structured pruning framework in which complete neurons are removed using multi-armed bandit (MAB) algorithms. Each candidate neuron is treated as an arm; pulling an arm temporarily masks that neuron, measures the change in loss on a sampled mini-batch, restores the neuron, and updates an estimate of its safe-removal reward. The framework supports stochastic policies, including Epsilon-Greedy, Softmax, UCB1 and Thompson Sampling, and multiplicative-weight policies, including Hedge-style multiplicative weights and EXP3. We evaluate the method on tabular classification, tabular regression and deep neural-network benchmarks covering image, text and reasoning tasks. Statistical comparisons using the Friedman test followed by the Nemenyi post-hoc test show significant differences between methods. On tabular classification tasks, UCB1 obtains the highest mean rank among pruning policies and improves on the unpruned neural network. On regression tasks, UCB1 obtains the highest mean rank and is statistically competitive with, or superior to, several standard regression models according to R^2. On deep-learning tasks, UCB1 and Thompson Sampling obtain the strongest ranks, and several MAB policies significantly outperform the unpruned model, magnitude-based neuron pruning and greedy activation-variation pruning. The results show that MAB-based neuron pruning is an effective and computationally practical approach for structured model reduction.

2606.07618 2026-06-09 cs.LG cs.AI cs.CV 新提交

ScaleSweep: Accurate NVFP4 Post-Training Quantization of LLMs via Block Scale Initialization

ScaleSweep: 通过块尺度初始化实现LLM的精确NVFP4训练后量化

Li Lin, Xiaojun Wan

发表机构 * Wangxuan Institute of Computer Technology, Peking University(北京大学王选计算机技术研究所)

AI总结 提出ScaleSweep方法,通过扫描可行块尺度候选并选择最小化目标函数的候选,优化NVFP4量化中的尺度初始化,理论推导扫描范围边界,在Llama和Qwen模型上提升量化性能,缩小与全精度的差距。

Comments under review

详情
AI中文摘要

NVFP4是一种最近引入的硬件支持的FP4格式,通过细粒度块尺度提高了4位量化的保真度。然而,现有的NVFP4尺度初始化方法仍然主要依赖于AbsMax初始化,这与最优解之间存在明显差距。为了解决这个问题,我们提出了ScaleSweep,一种简单高效的尺度优化方法,它扫描可行的块尺度候选,并选择最小化目标函数的候选。我们进一步提供了NVFP4量化的理论分析,并推导了在原始张量与量化重建张量之间的均方误差(MSE)和加权均方误差(WMSE)下所需扫描范围的上下界。所提出的界限大幅减少了扫描空间,同时保留了最优候选,使得与基线量化算子相比开销可忽略。在Llama和Qwen模型上的实验表明,ScaleSweep持续优于现有的初始化方法,并进一步缩小了与全精度的差距。特别是在对权重、激活、KV缓存和查询状态进行激进的全端到端量化时,ScaleSweep保留了超过93%的全精度性能。

英文摘要

NVFP4 is a recently introduced hardware-supported FP4 format that improves the fidelity of 4-bit quantization through fine-grained block scales. However, existing NVFP4 scale initialization methods still primarily rely on AbsMax initialization, which leaves a noticeable gap to the optimal solution. To address this, we propose ScaleSweep, a simple and efficient scale optimization method that sweeps over feasible block scale candidates and selects the candidate that minimizes a target objective. We further provide a theoretical analysis of NVFP4 quantization and derive both lower and upper bounds for the required sweep range under mean square error (MSE) and weighted mean square error (WMSE) between the original tensor and the quantized reconstructed tensor. The proposed bounds substantially reduce the sweep space while preserving the optimal candidate, enabling negligible overhead compared with the baseline quantization operators. Experiments on Llama and Qwen models demonstrate that ScaleSweep consistently improves quantization performance over existing initialization methods and further narrows the gap to full precision. In particular, under aggressive end-to-end quantization of weights, activations, KV cache, and query states, ScaleSweep preserves more than 93% of the full-precision performance.

2606.07684 2026-06-09 cs.LG cs.AI 新提交

Semantic Cache Distillation: Efficient State Transfer via Reuse and Selective Patching

语义缓存蒸馏:通过重用和选择性修补实现高效状态传输

Qianli Ma, Zhiqing Tang, Hanshuai Cui, Zhi Yao, Weijia Jia

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对大语言模型推理中KV缓存传输的通信瓶颈和跨模型重用时的语义错位问题,提出语义缓存蒸馏(SCD)框架,通过低秩子空间重建和稀疏过渡层归一化输入预测,实现高达2.65倍的首令牌时间加速,且生成质量接近理想情况。

Comments Accepted to ICML 2026

详情
AI中文摘要

分离式服务缓解了大语言模型(LLM)推理中的内存瓶颈,但造成了严重的通信瓶颈:传输高维键值(KV)缓存通常主导首令牌时间(TTFT)。此外,跨异构模型(例如,基础模型和微调变体)重用缓存会导致语义错位,且这种错位会随着层数累积,降低生成质量。我们提出语义缓存蒸馏(SCD),一种受损失约束的框架,用紧凑的语义代码替代原始KV传输。SCD通过两种机制解决这些挑战:(1)重用,从低秩子空间重建大部分层以最小化传输成本,以及(2)修补,在稀疏过渡层预测归一化输入以截断误差传播。实验表明,在带宽受限的情况下,SCD相比理想消费预填充实现了高达2.65倍的TTFT加速,并在质量-延迟帕累托前沿上优于量化和选择性重计算基线,同时将生成质量保持在理想情况F1的5%以内。

英文摘要

Disaggregated serving alleviates memory bottlenecks in Large Language Model (LLM) inference but creates a severe communication bottleneck: transmitting high-dimensional Key-Value (KV) caches often dominates time-to-first-token (TTFT). Moreover, reusing caches across heterogeneous models (e.g., base and fine-tuned variants) causes semantic misalignment that accumulates over layers, degrading generation quality. We propose Semantic Cache Distillation (SCD), a loss-constrained framework that replaces raw KV transmission with compact semantic codes. SCD addresses these challenges via two mechanisms: (1) Reuse, which reconstructs most layers from low-rank subspaces to minimize transfer cost, and (2) Patch, which predicts normalized inputs at sparse transition layers to truncate error propagation. Empirically, SCD delivers up to 2.65 $\times$ TTFT speedup over the oracle consumer prefill and dominates quantization and selective recomputation baselines on the quality--latency Pareto frontier in bandwidth-constrained regimes, while keeping generation quality within 5\% F1 of the oracle.

2606.07703 2026-06-09 cs.LG cs.AI cs.CL 新提交

How Much Dense Attention is Necessary? Oracle-Guided Sparse Prefill for Full/GQA Layers in Hybrid Long-Context Models

需要多少密集注意力?面向混合长上下文模型中全/GQA层的Oracle引导稀疏预填充

Hongxing Wang, Harenome Razanajato, Zhen Zhang, Yujie Yuan, Hongsheng Liu

发表机构 * Technical Report, First Release(技术报告,首次发布)

AI总结 研究在混合长上下文模型中,通过Oracle引导的稀疏预填充减少密集注意力计算,在保持任务性能的同时实现加速,并验证了可行性、索引器质量和运行时加速潜力。

Comments Technical report, first release, 26 pages, 2 figures, 11 tables

详情
AI中文摘要

长上下文预填充仍然昂贵,因为即使在包含局部、稀疏、线性或循环组件的混合模型中,全/GQA层仍然对整个历史序列进行评分。我们研究了在显式支持粒度和top-k预算下,需要多少密集注意力来保持任务级行为。我们为现有的GQA检查点引入了一种注意力质量top-k oracle:对于每个层和查询位置,它计算密集注意力,选择头平均的token支持,并仅在该支持上重新计算注意力。该oracle是一个诊断参考,而非可部署的加速器,并将稀疏预算可行性从索引器误差和运行时实现效果中分离出来。在Qwen家族的检索密集型评估中,每个查询的最长oracle行与密集注意力相差在1个点以内,而Qwen3.5-9B在4K到100K的RULER风格扫描中相差在0.48个点以内。在oracle的指导下,我们通过KL蒸馏从密集注意力质量分布中训练了一个头折叠的辅助索引器,同时保持骨干网络冻结。使用分别蒸馏的Qwen3.5-0.8B和Qwen3.5-9B索引器,报告的16K/32K验证宏观差距分别为+2.04和+1.13个点,这被视为质量保持而非改进;融合的选择块共享支持可能引入更大的实现差距。初步的单卡TTFT测量显示,与密集FlashAttention-2基线相比,蒸馏索引器的稀疏服务加速比在NPU上对Qwen3.5-0.8B为1.71倍,在GPU上对Qwen3.5-9B为1.93倍。额外的随机初始化压力行达到3.44倍,表明稀疏运行时存在提升空间,但输出质量未经验证。本次发布首次分离了oracle可行性、蒸馏索引器质量和运行时提升空间,将完全匹配的质量-延迟前沿留待未来工作。

英文摘要

Long-context prefill remains expensive because full/GQA layers still score the historical sequence, even in hybrid models with local, sparse, linear, or recurrent components. We study how much dense attention is needed to preserve task-level behavior under explicit support granularity and top-k budgets. We introduce an attention-mass top-k oracle for existing GQA checkpoints: for each layer and query position, it computes dense attention, selects head-averaged token support, and recomputes attention only on that support. The oracle is a diagnostic reference, not a deployable accelerator, and separates sparse-budget feasibility from indexer error and runtime realization effects. On Qwen-family retrieval-heavy evaluations, the longest per-query oracle rows stay within 1 point of dense, and a Qwen3.5-9B RULER-style sweep from 4K to 100K stays within 0.48 points. Guided by the oracle, we derive a head-collapsed auxiliary indexer trained by KL distillation from dense attention-mass distributions while keeping the backbone frozen. With separately distilled Qwen3.5-0.8B and Qwen3.5-9B indexers, the reported 16K/32K validation macro gaps are +2.04 and +1.13 points, treated as quality preservation rather than improvement; fused selection-block-shared support can introduce a larger realization gap. Preliminary single-card TTFT measurements show distilled-indexer sparse serving speedups of 1.71x for Qwen3.5-0.8B on NPU and 1.93x for Qwen3.5-9B on GPU against its dense FlashAttention-2 baseline. Additional random-init stress rows reach 3.44x, indicating sparse-runtime headroom but not validated output quality. This first release separates oracle feasibility, distilled-indexer quality, and runtime headroom, leaving a fully matched quality-latency frontier to future work.

2606.07713 2026-06-09 cs.LG cs.AI cs.PF 新提交

Attention at the Theoretical Minimum: A Mathematics of Arrays Framework for Memory-Optimal Transformer Kernels

理论最小化的注意力机制:面向内存最优Transformer内核的数组数学框架

Lenore Mullin, Gaetan Hains

发表机构 * University at Albany(奥尔巴尼大学) Université Paris-Est Créteil(巴黎东大学克雷泰伊分校)

AI总结 提出基于数组数学(MoA)的缩放点积注意力重表述,通过代数构造消除所有中间数组,实现O(n dk + n dv)数据移动,相比标准实现O(n^2 + n dk + n dv)显著降低内存流量,并验证了数值精度。

详情
AI中文摘要

注意力机制是现代基于Transformer的AI中的主要计算瓶颈。其标准实现在序列长度~$n$上产生二次内存流量,而DRAM访问在当代硬件上比算术操作消耗100--1000$\times$更多的能量,因此任何仅关注FLOP计数的分析从根本上误解了瓶颈。我们提出了缩放点积注意力及其数值稳定softmax的数组数学(MoA)重表述,推导出指称范式(DNF),通过代数构造而非经验调优消除了所有中间数组——包括隐式转置键缓冲区和每个softmax临时变量。DNF实现了$O(n dk + n dv)$的数据移动,而标准实现为$O(n^2 + n dk + n dv)$,其中$n$是序列长度,$dk$是键维度,$dv$是值维度,并在具体输入上针对PyTorch全双精度浮点进行了数值验证。与硬件特定的加速器或经验性分块方案(如FlashAttention)不同,MoA从单一代数框架同时提供了数组融合、形状变换正确性和预测性成本模型。内存最小性是在编写任何代码之前就确立的定理。预测性性能模型预计加速2--100$\times$,能耗降低2--50$\times$,优势在超大规模下进一步扩大。该推导建立了一个从Python规范经过操作范式(ONF)和维度提升硬件映射的形式化验证流水线,提供了与DARPA边缘部署和DOE超大规模优先事项直接相关的性能可移植AI内核。

英文摘要

The attention mechanism is the dominant computational bottleneck in modern transformer-based AI. Its standard implementation incurs quadratic memory traffic in the sequence length~$n$, and DRAM accesses cost 100--1000$\times$ more energy than arithmetic operations on contemporary hardware, so any analysis focused solely on FLOP counts fundamentally mischaracterises the bottleneck. We present a Mathematics of Arrays (MoA) reformulation of scaled dot-product attention and its numerically stable softmax, deriving a Denotational Normal Form (DNF) that eliminates all intermediate arrays -- including the implicit transposed-key buffer and every softmax temporary -- by algebraic construction rather than empirical tuning. The DNF achieves $O(n_{dk} + n{_{dv}})$ data movement versus $O(n^2 + n_{dk} + n_{dv})$ for the standard implementation, where $n$ is the sequence length, $dk$ is the key dimensionality and $dv$ the value dimensionality, and is verified numerically against PyTorch at full double-precision floating-point on concrete inputs. Unlike hardware-specific accelerators or empirical tiling schemes such as FlashAttention, MoA simultaneously provides array fusion, shape-transformation correctness, and predictive cost models from a single algebraic framework. Memory minimality is a theorem established before any code is written. A predictive performance model projects $2$--$100\times$ speedup and $2$--$50\times$ energy reduction, with the advantage widening at exascale. The derivation establishes a formally verified pipeline from Python specification through (ONF) Operational Normal Form, and dimension-lifted hardware mapping, providing performance-portable AI kernels of direct relevance to DARPA edge-deployment and DOE exascale priorities.

2606.07878 2026-06-09 cs.LG 新提交

Still: Amortized KV Cache Compaction in a Single Forward Pass

Still: 单次前向传递中的摊销KV缓存压缩

Charles O'Neill, Alex Sandomirsky, Harry Partridge, Mudith Jayasekara, Max Kirkby

发表机构 * Baseten

AI总结 提出Still方法,通过单次前向传递的轻量级Perceiver层实现KV缓存压缩,在8×至200×压缩比和8k至128k上下文长度下兼顾速度与质量,长上下文任务超越最强基线8-22分。

详情
AI中文摘要

KV缓存是长时语言模型部署的内存瓶颈。实际上,可部署的压缩器必须足够轻量以便在推理时调用,足够表达以在约束下保留上下文,并且可跨轨迹重用。现有压缩方法仅满足部分要求:选择方法轻量但受限于子集,而合成方法表达性强但依赖于逐上下文优化。这里我们介绍Still,一个小的逐层Perceiver,针对冻结的基础模型训练一次,在单次前向传递中生成紧凑的键和值。在Qwen和Gemma模型上,Still在压缩比从$8\ imes$到$200\ imes$、上下文长度从8k到128k的范围内,占据了速度-质量前沿的有利位置。在长上下文RULER网格上,Still超过最强基线8-22分。相同的紧凑缓存还支持自由形式的摘要,在HELMET上保留了大部分全上下文增益,并在LongBench摘要比较中胜过KV-Distill。由于压缩是一次前向传递,Still可以迭代应用,进入逐上下文方法无法实现的长期场景。我们表明,摊销使长上下文缓存压缩变得可行,而合成使其紧凑状态在极端压缩下有用。

英文摘要

The KV cache is the memory bottleneck of long-horizon language model deployment. Practically, a deployable compactor must be lightweight enough to call during inference, expressive enough to preserve context under constraint, and reusable across a trajectory. Existing compaction methods satisfy only part of this requirement: selection methods are lightweight but subset-bound, while synthesis methods are expressive but rely on per-context optimization. Here we introduce Still, a small per-layer Perceiver trained once against a frozen base model that produces compact keys and values in a single forward pass. On Qwen and Gemma models, Still occupies the favorable side of the speed--quality frontier across compression ratios from $8\times$ to $200\times$ and context lengths from $8$k to $128$k. On the long-context RULER grid, Still exceeds the strongest baseline by 8--22 points. The same compact cache also supports free-form summarization, preserving most of the full-context gain on HELMET and winning a pairwise LongBench summarization comparison against KV-Distill. Because compaction is a forward pass, Still can be applied iteratively, entering a long-horizon regime unavailable to per-context methods. We show that amortization makes long-context cache compaction tractable, and synthesis makes its compact state useful at extreme compression.

2606.07954 2026-06-09 cs.LG cs.AI 新提交

Minibatch Selection via Partition Matroid Constrained Gradient Matching

基于划分拟阵约束梯度匹配的小批量选择

Prayas Agrawal, Prateek Chanda, Ishita Khatri, Ganesh Ramakrishnan, Bamdev Mishra, Pratik Jawanpuria

发表机构 * Indian Institute of Technology Bombay(印度理工学院班加罗尔) Department of Computer Science and Engineering(计算机科学与工程系) Centre for Machine Intelligence and Data Science(机器智能与数据科学中心) Microsoft Research India(微软印度研究院) Microsoft India(微软印度)

AI总结 提出PartitionSel方法,通过划分拟阵约束下的梯度匹配效用最大化,实现跨域小批量选择,减少冗余并提升训练兼容性,在LLM微调中取得鲁棒性提升。

Comments 28 pages, 12 figures, ICML 2026

详情
Journal ref
Proceedings of the 43rd International Conference on Machine Learning (ICML 2026), Seoul, South Korea, PMLR 306, 2026
AI中文摘要

在异构数据上训练大型语言模型(LLMs)需要选择能够平衡收敛速度与跨领域覆盖的小批量。现有方法要么在每个领域内独立选择样本,要么依赖计算昂贵的代理模型来学习连续的领域权重。我们提出PartitionSel,一种跨领域小批量选择方法,它在每个领域的预算(编码为划分拟阵约束)下最大化验证引导的梯度匹配效用。通过单一效用耦合每个领域的预算,PartitionSel旨在减少跨领域选择中的冗余。所提出的目标是弱子模的,并允许使用正交匹配追踪算法,具有可证明的近似保证。在实验中,我们在MetaMathQA和Mol-Instructions上对Qwen2.5和Llama-3进行微调时,评估了PartitionSel的小批量选择。PartitionSel在两个基准测试中均比每个领域和领域无关的基线获得了鲁棒的提升。它还减少了每个批次内冲突梯度对的数量,表明跨领域耦合转化为更兼容的训练更新。

英文摘要

Training large language models (LLMs) on heterogeneous data requires selecting minibatches that balance convergence speed with coverage across domains. Existing methods either select samples independently within each domain or rely on computationally expensive proxy models to learn continuous domain weights. We propose PartitionSel, a cross-domain minibatch selection approach that maximizes a validation-guided gradient-matching utility under per-domain budgets encoded as a partition-matroid constraint. By coupling the per-domain budgets through a single utility, PartitionSel is designed to reduce redundancy in selections across domains. The proposed objective is weakly submodular and admits an orthogonal matching pursuit algorithm with provable approximation guarantees. Empirically, we evaluate PartitionSel for minibatch selection during the fine-tuning of Qwen2.5 and Llama-3 on MetaMathQA and Mol-Instructions. PartitionSel achieves robust gains over per-domain and domain-agnostic baselines on both benchmarks. It also reduces the number of conflicting gradient pairs within each batch, indicating that the cross-domain coupling translates into more compatible training updates.

2606.08382 2026-06-09 cs.LG cs.AI 新提交

STAR-KV: Low-Rank KV Cache Compression via Soft Thresholding for Adaptive Rank Control

STAR-KV:通过软阈值实现自适应秩控制的低秩KV缓存压缩

Priyansh Bhatnagar, Ashkan Moradifirouzabadi, Se-Hyun Yang, SeungJae Lee, Jungwook Choi, Mingu Kang

发表机构 * University of Washington(华盛顿大学)

AI总结 提出STAR-KV框架,通过可微阈值机制实现注意力头和块级别的自适应秩选择,结合混合分解和低秩感知混合精度量化,在多种LLM上达到75%的KV缓存压缩,结合量化可减少20倍,并实现6.9倍注意力模块加速和3.1倍端到端生成吞吐提升。

详情
AI中文摘要

低秩投影通过利用隐藏维度冗余已成为压缩KV缓存的一种有前景的方法。然而,先前的方法依赖于固定或启发式秩选择,难以在最小精度损失下实现激进压缩。我们提出STAR-KV,一种具有细粒度秩控制的自适应低秩KV缓存压缩框架。STAR-KV包括:1)可微阈值机制,可在注意力头和块级别实现最优秩选择;2)混合分解策略,根据键和值投影的敏感性应用不同的低秩分解;3)低秩感知混合精度量化,利用数据统计实现近乎无损的低比特量化。在多个LLM和基准测试中评估,STAR-KV实现了高达75%的KV缓存压缩,结合量化可实现高达20倍的整体KV缓存减少。通过基于Triton的自定义GPU内核,STAR-KV为注意力模块提供高达6.9倍的加速,端到端生成吞吐量提升3.1倍。我们的代码公开在:https://github.com/PriyanshBhatnagar/STAR-KV。

英文摘要

Low-rank projection has emerged as a promising approach for compressing the KV cache by exploiting hidden-dimension redundancy. However, prior methods rely on fixed or heuristic rank selection and struggle to achieve aggressive compression with minimal accuracy degradation. We propose STAR-KV, an adaptive low-rank KV cache compression framework with fine-grained rank control. STAR-KV encompasses 1) a differentiable thresholding mechanism that enables optimal rank selection at both attention-head and block levels, 2) a hybrid decomposition strategy that applies different low-rank factorizations according to the sensitivity of key and value projections, and 3) a low-rank-aware mixed precision quantization that leverages data statistics for near lossless low-bit quantization. Evaluated across multiple LLMs and benchmarks, STAR-KV achieves up to 75% KV cache compression and up to 20x overall KV cache reduction when combined with quantization. Enabled by custom Triton-based GPU kernels, STAR-KV delivers up to 6.9x speedup for the attention module and 3.1x end-to-end generation throughput. Our code is publicly available at: https://github.com/PriyanshBhatnagar/STAR-KV.

2606.08446 2026-06-09 cs.LG cs.AI 新提交

Sparrow: Sparse Rollout for Stable and Efficient Long-context RL of Large Language Models

Sparrow: 用于大语言模型稳定高效长上下文强化学习的稀疏 rollout

Yang Zhou, Ranajoy Sadhukhan, Zhaofeng Sun, Zhuoming Chen, Souvik Kundu, Saket Dingliwal, Sai Muralidhar Jayanthi, Aram Galstyan, Haizhong Zheng, Beidi Chen

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Cornell University(康奈尔大学) Intel(英特尔) Amazon AGI(亚马逊AGI)

AI总结 针对RLVR中长上下文rollout计算昂贵的问题,提出Sparrow方法,通过动态稀疏度调度保持token级策略失配的下尾统计量稳定,在Qwen3系列模型上实现2.0-2.4倍加速,并推广到更大模型和编程领域。

详情
AI中文摘要

尽管强大,但带有可验证奖励的强化学习(RLVR)会诱导极长的思维链(COT),使其计算成本高昂。由于RLVR每步成本主要由长上下文rollout生成主导,稀疏注意力为加速密集rollout提供了一种有前景的方法。然而,稀疏rollout需要精细的稳定性-效率权衡:过于激进的稀疏性会导致崩溃,而过于宽松的稀疏性则加速不足。在这项工作中,我们通过稀疏到密集的演员-策略失配来研究这种权衡。我们首先观察到,稀疏rollout崩溃并非由token间的均匀退化驱动:即使在激进的稀疏性下,大多数稀疏token也能与密集token完美对齐。受此启发,我们假设如果每个token的演员-策略失配的下尾在整个轨迹中保持在临界阈值以上,则稀疏rollout训练保持稳定。我们引入一种动态稀疏度调度,在生成过程中保持该尾统计量恒定,并验证了我们的假设。在Qwen3思考族模型上,将尾失配统计量保持在一致阈值附近通常能实现稳定训练。然后,我们使用成本模型在该失配阈值下找到最大加速的稀疏度调度,在训练Qwen3-1.7B、Qwen3-4B和Qwen3-8B时分别实现了2.2倍、2.4倍和2.0倍的rollout加速。实验表明,这些阈值可推广到更大的模型(Qwen3-14B)和另一个RL领域(编程)。最后,我们的分析自然引出了DistillSparse:在稀疏rollout上进行轻量级基于LoRA的蒸馏,使更激进的稀疏性达到相同的稀疏到密集失配阈值,从而获得更高的加速。

英文摘要

Despite being powerful, reinforcement learning with verifiable rewards (RLVR) induces extremely long COT, making it computationally expensive. Since RLVR per-step cost is dominated by long-context rollout generation, sparse attention offers a promising way to accelerate dense rollout. However, sparse rollouts require a delicate stability-efficiency tradeoff: overly aggressive sparsity causes collapse, while overly lenient sparsity gives insufficient speedup. In this work, we study this tradeoff through sparse-to-dense actor-policy mismatch. We first observe that sparse rollout collapse is not driven by uniform degradation across tokens: most sparse tokens align perfectly with dense even under aggressive sparsity. Motivated by this, we hypothesize that sparse rollout training remains stable if the lower tail of per-token actor-policy mismatch stays above a critical threshold throughout the trajectory. We introduce a dynamic sparsity schedule that keeps this tail statistic constant during generation and validate our hypothesis. Across Qwen3 thinking-family models, keeping the tail mismatch statistic near a consistent threshold generally enables stable training. We then use a cost model to find the sparsity schedule for maximum speedup under this mismatch threshold, achieving 2.2x, 2.4x, and 2.0x rollout speedups when training Qwen3-1.7B, Qwen3-4B, and Qwen3-8B. Empirically, we show the thresholds generalize to a larger model (Qwen3-14B) and another RL domain (coding). Finally, our analysis naturally motivates DistillSparse: lightweight LoRA-based distillation on sparse rollout lets more aggressive sparsity reach the same sparse-to-dense mismatch threshold, yielding higher speedup.

2606.08565 2026-06-09 cs.LG cs.AI 新提交

EinSort: Sorting is All We Need for Tensorizing LLM

EinSort: 张量化大语言模型,排序即一切

Toshiaki Koike-Akino, Jing Liu, Ye Wang

发表机构 * Toshiaki Koike-Akino Jing Liu Ye Wang

AI总结 提出EinSort方法,通过索引排序发现张量中的低秩结构,实现大语言模型权重和KV缓存的张量化压缩,相比基线方法提升了重构质量。

Comments 38 pages, 17 figures

详情
AI中文摘要

张量网络为压缩大型神经网络提供了高效的表示。通过精心设计形状和拓扑,它们可以显著减少内存和计算成本。然而,由于大型基础模型的巨大规模和非结构化的权重分布,识别其中的隐式低秩结构仍然具有挑战性。我们提出了一种自适应张量化方法,通过索引排序发现目标张量中的固有低秩结构。在权重和KV缓存压缩上的实验表明,与基线方法相比,重构质量得到了提升。

英文摘要

Tensor networks provide efficient representations for compressing large neural networks. By carefully designing shapes and topologies, they can significantly reduce memory and computational costs. However, identifying implicit low-rank structures in large foundation models remains challenging due to their enormous scale and un-structured weight distributions. We propose an adaptive tensorization method that discovers inherent low-rank structure in a target tensor by index ordering. Experiments on weight and KV-cache compression demonstrate improved reconstruction quality compared to baselines.

2606.08574 2026-06-09 cs.LG cs.CV 新提交

OrderDP: A Theoretically Guaranteed Lossless Dynamic Data Pruning Framework

OrderDP:一种理论上保证无损的动态数据剪枝框架

Chenhan Jin, Shengze Xu, Qingsong Wang, Fan Jia, Dingshuo Chen, Tieyong Zeng

发表机构 * The Chinese University of Hong Kong(香港中文大学) Beijing Normal-Hong Kong Baptist University(北京师范大学-香港 Baptist大学) Guangzhou Nanfang College(广州南方学院) Institute of Automation, Chinese Academy of Sciences(中国科学院自动化研究所) Xiangtan University(湘潭大学) University of Utah(犹他大学)

AI总结 提出OrderDP框架,通过随机子集选取与top-q样本选择实现无偏梯度估计,提供收敛性和泛化性理论保证,在CIFAR和ImageNet上降低40%训练成本且保持精度。

Comments Published as a conference paper at ICLR 2026

详情
Journal ref
International Conference on Learning Representations (ICLR), 2026
AI中文摘要

数据剪枝(DP)作为一种常被提及的减轻训练负担的策略,根据定义明确的剪枝方法减少训练样本数量,同时力求实现近乎无损的性能。然而,现有方法通常选择信息量大的样本,与全数据集训练相比可能导致有偏的梯度估计。此外,这种偏差及其对最终性能的影响分析仍不明确。为解决这些问题,我们提出OrderDP,一个即插即用的框架,旨在获得稳定、无偏且近乎无损的训练加速,并具有理论保证。具体而言,OrderDP首先随机选择一个子集,然后选择前$q$个样本,其中相对于代理损失建立无偏性。这确保了OrderDP在代理目标方面进行无偏训练。我们进一步建立了收敛性和泛化性分析,阐明了OrderDP如何影响最优性能,并在保证最终性能的同时实现良好控制的加速。实验上,我们在CIFAR-10、CIFAR-100和ImageNet-1K上对OrderDP与全面基线进行了评估,展示了具有竞争力的精度、稳定的收敛和精确的控制——所有这些都通过更简单的设计和更快的运行时间实现,同时将训练成本降低超过40%。我们的方法兼具强性能和计算效率,为数据高效学习提供了一个稳健且易于适应的工具。代码公开于https://github.com/shengze-xu/OrderDP。

英文摘要

Data pruning (DP), as an oft-stated strategy to alleviate heavy training burdens, reduces the volume of training samples according to a well-defined pruning method while striving for near-lossless performance. However, existing approaches, which commonly select highly informative samples, can lead to biased gradient estimation compared to full-dataset training. Furthermore, the analysis of this bias and its impact on final performance remains ambiguous. To address these challenges, we propose OrderDP, a plug-and-play framework that aims to obtain stable, unbiased, and near-lossless training acceleration with theoretical guarantees. Specifically, OrderDP first randomly selects a subset and then chooses the top-$q$ samples, where unbiasedness is established with respect to a surrogate loss. This ensures that OrderDP conducts unbiased training in terms of the surrogate objective. We further establish convergence and generalization analyses, elucidating how OrderDP affects optimal performance and enables well-controlled acceleration while ensuring guaranteed final performance. Empirically, we evaluate OrderDP against comprehensive baselines on CIFAR-10, CIFAR-100, and ImageNet-1K, demonstrating competitive accuracy, stable convergence, and exact control -- all with a simpler design and faster runtime, while reducing training cost by over 40%. Delivering both strong performance and computational efficiency, our method serves as a robust and easily adaptable tool for data-efficient learning. The code is publicly available at https://github.com/shengze-xu/OrderDP.

2606.08584 2026-06-09 cs.LG 新提交

Convolutional Sparse Coding via the Locally Competitive Algorithm on Loihi 2

基于Loihi 2的局部竞争算法实现卷积稀疏编码

Geoffrey Kasenbacher, Daniel Ruepp, Gerrit A. Ecke

发表机构 * Mercedes-Benz AG(梅赛德斯-奔驰集团) Institut für Robotik und Kognitive Systeme, Universität zu Lübeck(吕贝克大学机器人与认知系统研究所)

AI总结 本文在Loihi 2神经形态芯片上实现了卷积稀疏编码的局部竞争算法,并与GPU基线对比,展示了其在结构化稀疏推理中的可行性和优势。

详情
AI中文摘要

稀疏编码通过将输入表示为仅少量基函数的线性组合,为信号表示提供了一个原则性框架。局部竞争算法(LCA)因其动力学特性(泄漏积分、阈值化和侧向抑制)自然映射到神经形态硬件,在神经形态计算中特别有吸引力。虽然先前的工作已在Loihi 2上研究了非卷积LCA,但卷积设置尤其令人感兴趣,因为它引入了空间结构、权重共享、重叠感受野和缩放行为,这些更代表实际的稀疏推理工作负载。在这项工作中,我们提出了通过LCA在Loihi 2上实现卷积稀疏编码,并在相同的推理问题上与传统的GPU基线进行了评估。该实现遵循单层循环LCA公式,并将其扩展到具有从成对滤波器相互作用导出的局部抑制核的卷积特征图。据我们所知,这是Loihi 2上卷积LCA的首次实现和基准测试。我们的目标不仅是证明可行性,而且还要阐明在何种操作条件下卷积稀疏推理在神经形态硬件上变得有吸引力。由此产生的研究将卷积LCA定位为新兴神经形态系统上结构化稀疏推理的有用基准。

英文摘要

Sparse coding provides a principled framework for signal representation by expressing an input as a linear combination of only a small number of basis functions. The Locally Competitive Algorithm (LCA) is particularly attractive in the context of neuromorphic computing because its dynamics, leaky integration, thresholding, and lateral inhibition map naturally to neuromorphic hardware. While prior work has studied non-convolutional LCA on Loihi 2, the convolutional setting is of particular interest because it introduces spatial structure, weight sharing, overlapping receptive fields, and scaling behavior that are more representative of practical sparse inference workloads. In this work, we present a Loihi 2 implementation of convolutional sparse coding via the LCA and evaluate it against a conventional GPU baseline on the same inference problems. The implementation follows a one-layer recurrent LCA formulation and extends it to convolutional feature maps with local inhibitory kernels derived from pairwise filter interactions. To the best of our knowledge, this is the first implementation and benchmark of convolutional LCA on Loihi 2. Our goal is not only to demonstrate feasibility, but also to clarify in which operating regimes convolutional sparse inference becomes attractive on neuromorphic hardware. The resulting study positions convolutional LCA as a useful benchmark for structured sparse inference on emerging neuromorphic systems.

2606.08635 2026-06-09 cs.LG cs.DC 新提交

SpectrumKV: Per-Token Mixed-Precision KV Cache Transfer for Prefill-Decode Disaggregated LLM Serving

SpectrumKV: 面向预填充-解码分离式LLM服务的逐令牌混合精度KV缓存传输

Yang Pengju

发表机构 * GitHub

AI总结 针对预填充-解码分离架构中KV缓存传输开销大的问题,提出SpectrumKV,通过为每个令牌分配不同精度(FP16/INT8/INT4)实现混合精度传输,并设计轻量部署探测自适应选择精度策略,在相同传输预算下显著提升模型质量并降低TTFT。

Comments 28 pages,13 figures,8 tables

详情
AI中文摘要

预填充-解码(PD)分离将提示处理与令牌生成解耦,但也使键值(KV)缓存成为网络负载。现有的PD端KV缩减方法大多是二元的:选中的令牌以全精度传输,其余则不传输。本文认为二元选择留下了一个有用的设计空间未被利用。SpectrumKV为每个令牌分配一个精度级别:注意力汇聚点和其他高重要性令牌以FP16保护,中等重要性令牌以INT8发送,低重要性令牌在模型可容忍时以INT4发送。主要的实际复杂性在于INT4容忍度是模型相关的。Qwen2.5-7B在INT4 KV量化下灾难性失败,而Mistral-7B和Gemma-2-9B保持稳定。因此,SpectrumKV运行一个轻量级的部署时探测:在三级策略下进行三次激进的NIAH试验。通过的模型使用FP16+INT8+INT4;失败的模型回退到FP16+INT8。在Qwen2.5-7B-Instruct、Mistral-7B-Instruct-v0.3和Gemma-2-9B-it上,SpectrumKV在相同传输预算下提高了质量。在WikiText-2上,归一化KV预算为50%时,SpectrumKV分别将困惑度改变+1.97%、-0.06%和-0.44%,而PDTrim为+25.85%、+22.07%和+35.63%。在4096令牌的NIAH检索中,自适应策略在激进预算b=0.3下对Qwen达到52.6%,而PDTrim为26.3%,并在b=0.5时达到100%;Mistral和Gemma在三级策略下保持检索性能。传输路径的端到端GPU计时显示,在b=0.5时TTFT降低50-62%。这些结果表明,PD KV传输应被视为精度分配问题,而不仅仅是令牌剪枝。

英文摘要

Prefill-decode (PD) disaggregation decouples prompt processing from token generation, but it also turns the key-value (KV) cache into a network payload. Existing PD-side KV reduction methods are mostly binary: selected tokens are transmitted at full precision and the rest are not transmitted. This paper argues that binary selection leaves a useful design space unused. SpectrumKV assigns a precision level to each token instead: attention sinks and other high-importance tokens are protected at FP16, medium-importance tokens are sent at INT8, and low-importance tokens are sent at INT4 when the model can tolerate it. The main practical complication is that INT4 tolerance is model-dependent. Qwen2.5-7B catastrophically fails under INT4 KV quantization, while Mistral-7B and Gemma-2-9B remain stable. SpectrumKV therefore runs a lightweight deployment-time probe: three aggressive NIAH trials under a 3-tier policy. Models that pass use FP16+INT8+INT4; models that fail fall back to FP16+INT8. Across Qwen2.5-7B-Instruct, Mistral-7B-Instruct-v0.3, and Gemma-2-9B-it, SpectrumKV improves quality at the same transfer budget. At a 50% normalized KV budget on WikiText-2, SpectrumKV changes perplexity by +1.97%,-0.06%, and-0.44%, respectively, compared with PDTrim's +25.85%, +22.07%, and +35.63%. On NIAH retrieval at 4096 tokens, the adaptive policy reaches 52.6% on Qwen at the aggressive b=0.3 budget versus 26.3% for PDTrim, and reaches 100% by b=0.5; Mistral and Gemma preserve retrieval under the 3-tier policy. End-to-end GPU timing of the transfer path shows 50-62% TTFT reductions at b=0.5. These results suggest that PD KV transfer should be treated as a precision-allocation problem, not only as token pruning.

2606.08962 2026-06-09 cs.LG cs.CV cs.RO 新提交

C$^3$ache: Accelerating World Action Models with Cross Inference Chunk Cache

C$^3$ache: 利用跨推理块缓存加速世界动作模型

Weisen Zhao, Lam Nguyen, Zhicong Lu, Yuzhang Shang

发表机构 * George Mason University(乔治梅森大学) University of Central Florida(中佛罗里达大学)

AI总结 提出C$^3$ache方法,通过跨推理块缓存和重用去噪残差,加速世界动作模型推理,实现高达2.5倍加速且任务成功率几乎无损。

详情
AI中文摘要

世界动作模型(WAM)比标准的视觉-语言-动作(VLA)策略在新型运动和环境中具有更好的泛化能力,因为视频建模目标使其能够从大量未标记视频中学习,而不是依赖稀缺的标记机器人演示。这种泛化能力计算成本高昂。为了完成一个任务,WAM需要运行多个推理块,每个块都需要一个昂贵的去噪过程。现有的加速方法通过在一个块的去噪轨迹内缓存和重用计算来降低这一成本。我们的实证分析揭示了它们忽略的一个重要的冗余来源:块间的冗余。当机器人执行平滑行为时,在给定去噪步骤计算的残差从一个块到下一个块高度相关。我们引入了C$^3$ache,一种无需训练的方法,它在相同去噪步骤的推理块之间缓存和重用这些残差。在基于Fast-WAM骨干的基准测试上的实验表明,C$^3$ache在总墙钟推理时间上实现了高达2.5倍的加速,而任务成功率几乎没有下降。

英文摘要

World Action Models (WAMs) generalize better than standard Vision-Language-Action (VLA) policies to novel motions and environments, because a video-modeling objective lets them learn from abundant unlabeled video rather than scarce labeled robot demonstrations. This generalization is computationally expensive. To complete a task, a WAM runs over multiple inference chunks, and each chunk requires a costly denoising process. Existing acceleration methods reduce this cost by caching and reusing computation within a single chunk's denoising trajectory. Our empirical analysis reveals a substantial source of redundancy they overlook: redundancy across chunks. When a robot executes a smooth behavior, the residuals computed at a given denoising step are strongly correlated from one chunk to the next. We introduce C$^3$ache, a training-free method that caches and reuses these residuals across inference chunks at the same denoising step. Experiments on benchmarks with a Fast-WAM backbone show that C$^3$ache achieves up to a $2.5\times$ speedup in total wall-clock inference time, with negligible degradation in task success rate.

2606.09012 2026-06-09 cs.LG cs.AI math.OC stat.ML 新提交

Understanding Quantization-Aware Training: Gradients at Quantized Weights Bias to the Low-Loss Basin

理解量化感知训练:量化权重的梯度偏向低损失盆地

Hanyang Li, Jianhao Ma, Ying Cui

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出统一几何框架解释后训练量化失败与量化感知训练恢复机制,揭示量化感知训练通过梯度感知谷壁使量化点返回低损失盆地。

Comments 31 pages, 10 figures

详情
AI中文摘要

后训练量化(PTQ)将训练好的全精度模型转换为低比特权重,无需任务级重训练,而量化感知训练(QAT)将量化纳入训练循环。尽管PTQ在中等比特宽度下高效且通常准确,但在激进比特宽度下可能急剧失败;QAT成本更高但通常能恢复丢失的精度。我们提出了一个统一的几何框架,同时解释PTQ失败和QAT恢复。我们将全精度训练建模为在更宽的\emph{山谷}内沿着低损失\emph{河流}:河流的法向邻域形成近乎平坦的\emph{盆地},而离开该盆地会导致损失急剧增加。当量化网格与盆地宽度相当时,局部PTQ目标(包括舍入和基于Hessian的二阶重建)可能选择盆地外的高损失部署量化点,即使附近存在低损失量化点。在这种情况下,基于直通估计器的QAT具有有用的偏差:它在部署的量化权重处评估梯度,同时更新潜在的全精度权重,导致梯度感知谷壁并获得向内分量,从而将后续量化迭代引导回盆地。我们通过局部景观模型形式化这一机制,构造了几何PTQ失败模式,并在局部量化器兼容性假设下证明了有限时间QAT恢复。在多种神经网络量化方案下的视觉和语言模型实验,证实了预测的PTQ跨盆地失败以及相应的QAT恢复机制。

英文摘要

Post-training quantization (PTQ) converts a trained full-precision model into low-bit weights without task-level retraining, while quantization-aware training (QAT) incorporates quantization into the training loop. Although PTQ is efficient and often accurate at moderate bitwidths, it can fail sharply at aggressive bitwidths; QAT is more expensive but can often recover the lost accuracy. We propose a unified geometric framework that explains both PTQ failure and QAT recovery. We model full-precision training as following a low-loss \emph{river} inside a wider \emph{valley}: a normal neighborhood of the river forms a nearly flat \emph{basin}, while leaving this basin incurs a sharp loss increase. When the quantization grid is comparable to the basin width, local PTQ objectives, including rounding and Hessian-based second-order reconstruction, can select a high-loss deployed quantized point outside the basin even when nearby low-loss quantized points exist. In this regime, straight-through-estimator-based QAT has a useful bias: it evaluates gradients at the deployed quantized weights while updating latent full-precision weights, causing the gradient to sense the valley wall and acquire an inward component that steers subsequent quantized iterates back into the basin. We formalize this mechanism through a local landscape model, construct a geometric PTQ failure mode, and prove finite-time QAT recovery under local quantizer-compatibility assumptions. Experiments across vision and language models under multiple neural-network quantization schemes corroborate the predicted basin-crossing failure of PTQ and the corresponding recovery mechanism of QAT.

2606.09175 2026-06-09 cs.LG cs.AI cs.DC 新提交

CANS: Accelerating Multiuser Collaborative Edge Inference via Cooperative Autodidactic NeuroSurgeon

CANS: 通过合作自教神经外科加速多用户协同边缘推理

Zheshun Wu, Ziyang Zhang, Changyao Lin, Zenglin Xu, Jie Liu

发表机构 * Harbin Institute of Technology Shenzhen(哈尔滨工业大学(深圳)) Politecnico di Milano(米兰理工大学) Harbin Institute of Technology(哈尔滨工业大学) Fudan University(复旦大学) Shanghai Academy of Artificial Intelligence for Science(上海人工智能科学研究院)

AI总结 提出CANS框架,利用FedLinUCB-DW算法让异构设备自适应学习最优DNN分区,通过共享在线推理反馈和离线经验加速多用户边缘协同推理,显著降低延迟。

Comments 24 pages, 14 figures, 5 tables, submitted for possible journal publication

详情
AI中文摘要

最近,移动边缘计算(MEC)支持的协作深度神经网络(DNN)推理已成为向资源受限的移动设备提供智能服务的一种有前景的方法。一个代表性场景是多用户协同边缘推理,其中不同设备独立地划分其DNN模型,并通过无线网络将后端计算卸载到公共边缘服务器。然而,由于未知且时变的系统条件(包括波动的无线链路和多样的设备能力),确定每个设备的最优DNN分区具有挑战性。为解决此问题,我们提出了合作自教神经外科(CANS),一种协同边缘推理框架,使设备能够通过在线推理期间共享信息反馈来自适应学习最优DNN分区。为处理设备异构性并更好地利用离线推理经验,我们集成了一种新颖的FedLinUCB-DW算法,该算法将相同类型的设备分组,并使用本地离线早期退出推理经验来热启动在线探索。此外,我们通过推导遗憾上界为FedLinUCB-DW提供了理论保证。我们还在模拟环境和硬件原型系统上验证了我们的方法。实证评估表明,与最先进的基线相比,CANS实现了更低的推理延迟。特别是在两个边缘设备的原型实验中,所提出的CANS相比非合作基线将平均推理延迟降低了高达50%。

英文摘要

Recently, mobile edge computing (MEC)-enabled collaborative deep neural network (DNN) inference has emerged as a promising approach for delivering intelligent services to resource-constrained mobile devices. A representative scenario is multi-user collaborative edge inference, where distinct devices independently partition their DNN models and offload backend computation to a common edge server over wireless networks. However, determining the optimal DNN partition for each device is challenging due to unknown and time-varying system conditions, including fluctuating wireless links and diverse device capabilities. To address this problem, we propose Cooperative Autodidactic NeuroSurgeon (CANS), a collaborative edge inference framework that enables devices to adaptively learn optimal DNN partitions by sharing informative feedback during online inference. To handle the challenge of device heterogeneity and better leverage offline inference experience, we integrate a novel FedLinUCB-DW algorithm that groups devices of the same type and warm-starts online exploration using local offline early-exit inference experience. Furthermore, we provide theoretical guarantees for FedLinUCB-DW by deriving the regret upper bound. We also validate our method on both a simulated environment and a hardware prototype system. Empirical evaluations demonstrate that CANS achieves lower inference latency compared to state-of-the-art baselines. Especially, in prototype experiments on two edge devices, the proposed CANS reduced average inference latency by up to 50% compared to the non-cooperative baseline.

2606.09312 2026-06-09 cs.LG cs.PL 新提交

Toward Compiler World Models: Learning Latent Dynamics for Efficient Tensor Program Search

迈向编译器世界模型:学习潜在动态以实现高效张量程序搜索

Haolin Pan, Lianghong Huang, Xvlin Zhou, Mingjie Xing, Yanjun Wu

发表机构 * Hangzhou Institute for Advanced Study, University of Chinese Academy of Sciences(中国科学院大学杭州高等研究院) Institute of Software, Chinese Academy of Sciences(中国科学院软件研究所) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出一种受世界模型启发的评估器,通过轻量级过渡模型在连续潜在空间中展开调度动作,避免昂贵AST变异和重复编码,在TVM AutoScheduler中实现比Ansor更优的延迟和测量效率。

详情
AI中文摘要

张量程序优化对现代机器学习系统至关重要,但其搜索空间巨大。现有的自动调度器通过学习成本模型来降低测量成本,但它们通常将每个候选视为静态代码快照,忽略了产生它的调度轨迹。这使得它们对动作依赖不敏感,且易受表面代码变化影响。我们提出一种受世界模型启发的评估器,将调度评估建模为程序状态上的动作条件潜在动态。从初始程序开始,它使用轻量级过渡模型在连续潜在空间中展开调度动作,避免了昂贵的AST变异和重复代码编码。最终的动态表示与动作和硬件特征结合以对候选进行排序。在TVM AutoScheduler中实现后,我们的方法在相同64次试验预算下,GPU上代表性子图延迟比Ansor提升1.37倍,CPU上提升1.54倍。它还在使用10倍更少测量次数的情况下,在2.2%几何平均内匹配Ansor-10K,并将完整模型推理速度提升至PyTorch/PyTorch-opt(cuDNN)的4.61倍/3.67倍几何平均。

英文摘要

Tensor program optimization is essential for modern machine learning systems, but its search space is enormous. Existing auto-schedulers reduce measurement cost with learned cost models, yet they usually evaluate each candidate as a static code snapshot, ignoring the schedule trajectory that produced it. This makes them insensitive to action dependencies and vulnerable to superficial code variations. We propose a \emph{world-model-inspired} evaluator that models schedule evaluation as action-conditioned latent dynamics over program states. Starting from the initial program, it rolls out scheduling actions in a continuous latent space with a lightweight transition model, avoiding expensive AST mutation and repeated code encoding. The final dynamic representation is combined with action and hardware features to rank candidates. Implemented in TVM AutoScheduler, our method improves representative-subgraph latency over Ansor by 1.37$\times$ on GPU and 1.54$\times$ on CPU under the same 64-trial budget. It also matches Ansor-10K within 2.2% geometric mean using 10$\times$ fewer measurements, and accelerates full-model inference over PyTorch/PyTorch-opt(cuDNN) by 4.61$\times$/3.67$\times$ geometric mean.

2606.09388 2026-06-09 cs.LG 新提交

Distilling Safe LLM Systems via Soft Prompts for On Device Settings

通过软提示蒸馏安全的设备端LLM系统

Motasem Alfarra, Cristina Pinneri, Dana Kianfar, Mohammed Almousa, Christos Louizos

发表机构 * Qualcomm AI Research(高通人工智能研究院)

AI总结 针对资源受限设备上部署安全大语言模型(LLM)的挑战,提出基于软提示与蒸馏训练的安全对齐方法,在最小化额外计算开销的同时实现优越的安全-有用性权衡。

Comments Accepted to UAI 2026

详情
Journal ref
42nd Conference on Uncertainty in Artificial Intelligence 2026
AI中文摘要

在资源受限的边缘设备上部署安全的大语言模型(LLM)面临关键挑战:虽然将LLM与防护模型结合的双模型系统能提供有效的安全保障,但其巨大的内存和计算需求使其在设备端部署中代价高昂。本文对资源受限环境下的参数高效安全对齐方法进行了全面研究。通过对多种LLM架构、训练目标和参数高效微调方法的系统评估,我们发现软提示与基于蒸馏的训练相结合始终优于其他方法。我们引入了基于总变差和KL散度的蒸馏框架,能够有效将防护模型的安全行为迁移到学习到的软提示中。我们在多个基准上的评估表明,与LoRA适配器、引导向量和直接优化方法相比,这种组合在安全-有用性权衡上表现更优,同时在推理时仅需极少的额外内存和计算。这些发现确立了软提示蒸馏作为设备端LLM部署中安全对齐的首选方法。

英文摘要

Deploying safe large language models (LLMs) on resource-constrained edge devices presents a critical challenge: while dual-model systems combining LLMs with guard models provide effective safety guarantees, their substantial memory and computational demands make them prohibitively expensive for on-device deployment. This paper presents a comprehensive study of parameter-efficient safety alignment methods for resource-constrained settings. Through systematic evaluation across multiple LLM architectures, training objectives, and parameter-efficient fine-tuning approaches, we identify that soft prompts combined with distillation-based training consistently outperform alternative methods. We introduce distillation frameworks based on total variation and KL divergence that effectively transfer safety behaviors from guard models into learned soft prompts. Our evaluations on various benchmarks demonstrate that this combination achieves superior safety-usefulness trade-offs compared to LoRA adapters, steering vectors, and direct optimization methods, while requiring minimal additional memory and compute at inference time. These findings establish soft prompt distillation as the preferred approach for safety alignment in on-device LLM deployment.

2606.09456 2026-06-09 cs.LG 新提交

Breaking the Tokenizer Barrier: On-Policy Distillation across Model Families

打破分词器壁垒:跨模型系列的在线策略蒸馏

Yifan Niu, Han Xiao, Dongyi Liu, Zelong Wang, Dihong Gong, Yasheng Wang, Jia Li

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Tencent(腾讯) The Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出跨分词器在线策略蒸馏方法,通过精确的token映射算法使教师模型概率分布信号能跨不同分词器传播,显著提升计算效率。

详情
AI中文摘要

在线策略蒸馏(OPD)已成为大型语言模型(LLM)后训练中从领域专家向学生模型迁移知识的核心技术。然而,现有的OPD蒸馏方法要求教师和学生模型共享相同的分词器,限制了OPD在模型系列内的适用性。当前主流实践通常采用在教师生成的响应上进行监督微调(SFT)来实现跨分词器蒸馏,这未能捕捉到嵌入在教师概率分布中的丰富知识。在这项工作中,我们使标准的在线策略蒸馏方法能够跨模型系列运行,确保高保真的token级信号可以通过精确的token映射算法在不同分词器之间传播。大量实验表明,在各种基准测试上,跨分词器OPD在计算效率上显著优于基线方法。我们的结果为OPD解锁了更广泛的教师-学生配对,为适应和增强LLM之间的交互开辟了新途径。

英文摘要

On-Policy Distillation (OPD) has become a core technique in the post-training of Large Language Models (LLMs) for transferring knowledge from domain experts to student models. However, existing OPD distillation methods require teacher and student models to share the same tokenizer, restricting the applicability of OPD within the model series. Current mainstream practice typically employs Supervised Fine-Tuning (SFT) on teacher-generated responses for cross-tokenizer distillation, which fails to capture the rich knowledge embedded in the teacher's probability distribution. In this work, we enable the standard on-policy distillation method to operate across model families, ensuring that high-fidelity token-level signals can propagate across different tokenizers with a precise token-mapping algorithm. Extensive experiments show that cross-tokenizer OPD is significantly more compute-efficient than baselines on various benchmarks. Our results unlock a broader range of teacher-student pairs for OPD, opening up new avenues for adapting and enhancing interactions between LLMs.

2606.09471 2026-06-09 cs.LG cs.CL 新提交

Escaping the KL Agreement Trap in On-Policy Distillation

逃离在线策略蒸馏中的KL一致陷阱

Haoran Xin, Anhao Zhao, Ying Sun, Jin Li, Xiaoyu Shen, Hui Xiong

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) The Hong Kong University of Science and Technology(香港科技大学) The Hong Kong Polytechnic University(香港理工大学) Eastern Institute of Technology, Ningbo(宁波东方理工大学)

AI总结 针对在线策略蒸馏中学生陷入低KL一致陷阱导致训练信号弱的问题,提出KAT动态终止规则,过滤弱监督,在数学基准上提升avg@k 2.66%和pass@k 3.43%,同时减少59.73%的rollout长度。

Comments 13 pages, 8 figures

详情
AI中文摘要

在线策略蒸馏(OPD)通过让教师对学生生成的rollout进行评分,提供密集的token级监督。然而,当学生漂移到不可恢复的前缀时,教师可能局部同意退化状态,产生低反向KL但几乎没有纠正训练信号。我们将这种持续状态识别为低KL一致陷阱。进一步分析表明,陷阱期间及之后的token产生的监督信号效用较低。我们提出KAT(KL一致陷阱终止),一种在线OPD终止规则,通过动态训练自适应阈值检测持续的低KL一致。通过过滤来自退化一致的弱监督,KAT在四个数学基准上将avg@k准确率提升2.66%,pass@k提升3.43%,同时将平均rollout长度减少59.73%。

英文摘要

On-policy distillation (OPD) provides dense token-level supervision by asking a teacher to score student-generated rollouts. However, when the student drifts into an unrecoverable prefix, the teacher may locally agree with the degraded state, producing low reverse KL but little corrective training signal. We identify this persistent regime as a low-KL agreement trap. Further analyses show that tokens during and after such traps produce less useful supervision signals. We propose KAT (KL Agreement Trap Termination), an online OPD termination rule that detects persistent low-KL agreement with a dynamic training-adaptive threshold. By filtering weak supervision from degenerate agreement, KAT improves avg@k accuracy by 2.66% and pass@k by 3.43% across four mathematical benchmarks, while reducing average rollout length by 59.73%.

2606.09514 2026-06-09 cs.LG 新提交

BUDDY: BUdget-Driven DYnamic Depth Routing for Adaptive Large Language Model Inference

BUDDY: 预算驱动的动态深度路由用于自适应大型语言模型推理

Yuhua Zhou, Shaoqi Yu, Shichao Weng, Changhai Zhou, Mingze Yin, Fei Yang, Aimin Pan

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出BUDDY框架,通过轻量决策模块根据输入动态选择top-k层,并复用KV缓存支持解码时自适应路由,在严格预算控制下提升精度-计算权衡。

详情
AI中文摘要

大型语言模型(LLMs)由于其深度和参数规模,推理成本高昂。深度剪枝可以通过跳过冗余的Transformer块来降低延迟,但现有方法(i)在用户指定的计算预算下提供的控制有限,(ii)通常固定路由路径,无法在解码过程中随着上下文增长而自适应。我们提出了BUDDY,一个预算驱动的动态深度路由框架。BUDDY使用轻量级决策模块根据输入对中间层进行评分,并确定性地执行top-k层以满足给定预算。为了支持解码时的自适应,BUDDY重用第一层的KV缓存作为低开销的全局上下文源,并在每次路由决策前将其与最新令牌表示合并。当未提供明确预算时,可选的预算预测器估计输入相关的计算水平以平衡质量和效率。在Llama系列和Qwen模型上的实验表明,BUDDY与强静态剪枝基线相比具有竞争力,并且通常能改善精度-计算权衡,同时独特地支持严格预算控制、解码时重路由以及单个训练模型内的多个预算。

英文摘要

Large language models (LLMs) incur high inference cost due to their depth and parameter scale. Depth pruning can reduce latency by skipping redundant Transformer blocks, but existing methods (i) provide limited control under user-specific compute budgets and (ii) typically fix the routing path, failing to adapt as the context grows during decoding. We propose Buddy, a budget-driven dynamic depth routing framework. Buddy uses a lightweight Decision Module to score intermediate layers conditioned on the input and deterministically executes the top-k layers to satisfy a given budget. To support decode-time adaptation, Buddy reuses the first-layer KV cache as a low-overhead global context source and pools it together with the newest token representation before each routing decision. When no explicit budget is provided, an optional Budget Predictor estimates an input-dependent compute level to balance quality and efficiency. Experiments on Llama-family and Qwen models show that Buddy is competitive with strong static pruning baselines and often improves the accuracy-compute trade-off, while uniquely supporting strict budget control, decode-time rerouting, and multiple budgets within a single trained model.

2606.09682 2026-06-09 cs.LG cs.DC cs.PF 新提交

AutoMegaKernel: A Statically-Checked Agent Harness for Self-Retargeting Megakernel Synthesis

AutoMegaKernel:用于自我重定目标超内核合成的静态检查代理框架

Jaber Jaber, Osama Jaber

发表机构 * RightNow AI

AI总结 提出AutoMegaKernel系统,将Llama模型编译为单个持久CUDA内核,通过静态调度验证器确保无死锁和无竞争,自动生成10种模型正确超内核,并在NVIDIA推理卡上以W8A16精度超越cuBLAS bf16。

Comments 18 pages, 5 figures. Open-source code, data, and agent harness: https://github.com/RightNow-AI/AutoMegaKernel

详情
AI中文摘要

AutoMegaKernel (AMK) 将HuggingFace Llama系列模型编译成一个持久的协作CUDA内核,该内核在一次启动中运行整个前向传播,无需为每个模型手写CUDA代码。其贡献在于系统本身,而非原始速度。一个冻结的调度IR验证器通过静态图检查(非机械化证明)静态地认证无死锁和无竞争,因此不安全的智能体提议调度在启动前被拒绝:在7,160个对抗性调度(6,091个不安全)中,它实现了零误接受,并接受了所有360个实际底层实现。同一源代码可重定目标至sm_80/sm_90/sm_120,从单一代码库自动为10个支持模型中的全部生成正确的超内核,并在真实的SmolLM2-135M检查点上重现HuggingFace贪婪解码逐token匹配(困惑度差异2.5e-7)。一个无人值守、智能体驱动的自动研究循环在其自身基线之上自我改进超内核(1.25-1.72倍)。一个搜索发现的int8 (W8A16) 超内核在NVIDIA数据中心推理集群的batch-1解码中击败了CUDA图化的cuBLAS bf16:L4最高1.33倍,当前一代L40S 1.25-1.27倍,A10G大规模最高1.08倍,以及消费级RTX 5090 1.19-1.23倍。排序并非带宽的简单函数(864 GB/s的L40S击败了600 GB/s的A10G);分界线是推理级与训练级。AMK在高带宽训练级A100/H100上落后于cuBLAS,其中框架定位了跨SM同步瓶颈;我们坦率地报告了这一差距。这是解码位置0处精度不对称(W8A16 vs bf16)的比较;最大的真实检查点是TinyLlama-1.1B。代码和框架:https://github.com/RightNow-AI/AutoMegaKernel

英文摘要

AutoMegaKernel (AMK) compiles a HuggingFace Llama-family model into a single persistent cooperative CUDA kernel that runs the whole forward pass in one launch, with no per-model hand-written CUDA. The contribution is the system, not raw speed. A frozen schedule-IR validator statically certifies deadlock-freedom and race-freedom via static graph checks (not a mechanized proof), so an unsafe agent-proposed schedule is rejected before launch: across 7,160 adversarial schedules (6,091 unsafe) it had zero false-accepts and accepted all 360 real lowerings. The same source retargets sm_80/sm_90/sm_120 from one codebase, auto-generates correct megakernels for 10 of 10 supported models, and on a real SmolLM2-135M checkpoint reproduces HuggingFace greedy decode token-for-token (perplexity match 2.5e-7). An unattended, agent-drivable autoresearch loop self-improves the megakernel over its own baseline (1.25-1.72x). A search-found int8 (W8A16) megakernel beats CUDA-graphed cuBLAS bf16 at batch-1 decode across NVIDIA's datacenter inference fleet: L4 up to 1.33x, the current-gen L40S 1.25-1.27x, A10G up to 1.08x at scale, and the consumer RTX 5090 1.19-1.23x. The ordering is not a clean function of bandwidth (the 864 GB/s L40S beats the 600 GB/s A10G); the divide is inference-class vs training-class. AMK trails cuBLAS on the high-bandwidth training-class A100/H100, where the harness localizes the cross-SM-sync bottleneck; we report the gap plainly. This is a precision-asymmetric (W8A16 vs bf16) comparison at decode position 0; the largest real checkpoint is TinyLlama-1.1B. Code and the harness: https://github.com/RightNow-AI/AutoMegaKernel

2606.09707 2026-06-09 cs.LG cs.CL 新提交

BrainSurgery: Reproducible and Reliable Declarative Weight Manipulations for Model Editing and Upcycling

BrainSurgery:用于模型编辑和升级的可复现且可靠的声明式权重操作

Gianluca Barmina, Annemette Broch Pirchert, Andrea Blasi Núñez, Lukas Galke Poech, Peter Schneider-Kamp

发表机构 * University of Southern Denmark(南丹麦大学)

AI总结 提出BrainSurgery工具,通过声明式YAML计划实现神经网络检查点的鲁棒可复现张量操作,支持结构修改、数学变换和张量重塑,内置断言验证防止静默错误。

详情
AI中文摘要

随着深度学习模型规模的扩大,管理、检查和修改大型检查点变得越来越具有挑战性。研究人员经常需要更改模型权重以进行层重构、精度转换、低秩分解和架构调试,但这些工作流程通常依赖于脆弱的临时Python脚本。在这里,我们介绍BrainSurgery,一个用于对神经网络检查点进行鲁棒且可复现的“张量手术”的工具,并提供一个系统演示,涵盖从模型升级到LoRA提取的四个示例和三个案例研究。通过抽象存储格式和内存管理,BrainSurgery通过声明式YAML计划执行复杂的转换。它支持通过表达性正则表达式和结构定位进行结构修改、数学变换和张量重塑,同时内置断言验证张量形状、数据类型和值,以防止静默错误。我们期望BrainSurgery通过其可复现且经过验证的操作,为未来的研究提供坚实的基础。

英文摘要

As deep learning models scale, managing, inspecting, and modifying large checkpoints has become increasingly challenging. Researchers often need to alter model weights for layer restructuring, precision casting, low-rank factorization, and architectural debugging, yet these workflows often rely on fragile ad-hoc Python scripts. Here, we introduce BrainSurgery, a tool for robust and reproducible "tensor surgery" on neural network checkpoints, and provide a system demonstration covering four examples and three case studies from model upcycling to LoRA extraction. By abstracting storage formats and memory management, BrainSurgery executes complex transformations through declarative YAML plans. It supports structural modifications, mathematical transformations, and tensor reshaping through expressive regex and structural targeting, while built-in assertions validate tensor shapes, data types, and values to prevent silent errors. We envision that BrainSurgery will provide a strong foundation for future research through its reproducible and validated operations.

2606.07574 2026-06-09 cs.DC cs.AI cs.LG stat.CO stat.ML 交叉投稿

Accelerating Birkhoff Projection for Manifold-Constrained Hyper-Connections

加速流形约束超连接的Birkhoff投影

Chenrui Wang, Yixuan Qiu

发表机构 * School of Statistics(统计学系) Renmin University of China(中国人民大学) School of Statistics and Data Science(统计学与数据科学学院) Institute of Big Data Research(大数据研究院) Shanghai University of Finance and Economics(上海财经大学)

AI总结 针对流形约束超连接中Birkhoff投影的计算瓶颈,提出基于对偶公式和牛顿法的端到端加速框架,结合隐式微分和CUDA内核实现超过20倍加速。

详情
AI中文摘要

流形约束超连接(mHCs)最近被提出作为超连接的一种原则性扩展,其中残差混合矩阵通过投影到Birkhoff多面体上被约束为双随机矩阵。在实际的mHC实现中,该约束通过Sinkhorn-Knopp迭代强制执行,反向传播依赖于展开迭代求解器。这种设计引入了大量的计算和内存开销,并且当算法在具有挑战性的输入上收敛缓慢时,可能产生不准确的投影,从而破坏mHCs预期的范数控制和稳定性保证。在这项工作中,我们聚焦于实际重要的4x4 Birkhoff投影设置,并开发了一个端到端的加速框架。通过利用对偶公式,我们将问题简化为一个三维无约束凸问题,并使用牛顿法求解,实现了快速收敛和高精度。对于反向传播,我们用隐式微分替代展开微分,无需存储中间状态即可获得精确梯度。为了利用大规模并行性,我们设计了一个warp级别的CUDA内核,仅使用寄存器级原语,避免了全局和共享内存I/O。与代表性开源基线的大量实验表明,所提出的求解器产生了更可靠的双随机投影——特别是在输入幅度较大时——并实现了显著的端到端加速(包括反向传播),在大批量下达到超过20倍的加速,同时保持数量级更小的边际误差。

英文摘要

Manifold-constrained hyper-connections (mHCs) have recently been proposed as a principled extension of hyper-connections, where the residual mixing matrices are constrained to be doubly stochastic via projection onto the Birkhoff polytope. In practical mHC implementations, this constraint is enforced by Sinkhorn-Knopp iterations, and the backward pass relies on unrolling the iterative solver. This design introduces substantial computation and memory overhead, and may also yield inaccurate projections when the algorithm converges slowly on challenging inputs, undermining the intended norm-control and stability guarantees of mHCs. In this work, we focus on the practically important 4x4 Birkhoff projection setting and develop an end-to-end acceleration framework. By leveraging the dual formulation, we reduce the problem to a three-dimensional unconstrained convex problem and solve it with Newton's method, achieving fast convergence and high accuracy. For the backward pass, we replace the unrolled differentiation with implicit differentiation, yielding exact gradients without storing intermediate states. To exploit massive parallelism, we design a warp-level CUDA kernel that uses only register-level primitives, avoiding global and shared memory I/O. Extensive experiments against representative open-source baselines demonstrate that the proposed solver yields substantially more reliable doubly stochastic projections -- especially when the input magnitude is large -- and achieves significant end-to-end speedups (including the backward pass), reaching over 20x acceleration at large batch sizes while maintaining orders of magnitude smaller marginal errors.

2606.07666 2026-06-09 quant-ph cs.AR cs.DC cs.LG 交叉投稿

Hardware-aware Low-latency Quantum Compilation with Data-driven Lightweight Error Detection for Early Fault-Tolerant Systems

面向早期容错系统的硬件感知低延迟量子编译与数据驱动的轻量级错误检测

Sumit Chongder

发表机构 * Inter-Disciplinary Research Platform, Quantum Information and Computation, Indian Institute of Technology Jodhpur(跨学科研究平台、量子信息与计算、印度理工学院贾尔普尔)

AI总结 提出一种集成硬件感知编译与数据驱动量子错误检测的框架,通过噪声加权代价函数和学习型多目标调度器联合优化量子比特映射、SWAP插入和综合征调度,在VQE、相位估计和Grover基准测试中,将算法成功概率提升高达68%。

Comments 16 pages, 15 figures, Springer LNCS format. Code available at https://github.com/Sumitchongder/quantum-hw-aware-pipeline

详情
AI中文摘要

噪声中等规模量子(NISQ)处理器正进入早期容错阶段,此时完全量子纠错代价高昂,而轻量级错误检测可有效提高算法成功率。现有编译和错误检测工具链孤立处理这些问题,缺乏在延迟约束下平衡检测开销与成功概率的原则性方法。我们提出一种集成的硬件感知编译与数据驱动量子错误检测(QED)框架,通过噪声加权代价函数和学习型多目标调度器,联合优化量子比特映射、SWAP插入和综合征调度。在HPC集群上使用GPU加速密度矩阵模拟(NVIDIA cuQuantum SDK)进行的仿真实验,涵盖VQE、相位估计和Grover基准测试、三种噪声模型以及6-20量子比特(深度10-160)的电路规模,结果表明,在8量子比特VQE实例上,联合协同设计相比SABRE结合后选择,将算法成功概率提升高达68%(95%置信区间:60%至76%)。

英文摘要

Noisy intermediate-scale quantum (NISQ) processors are entering an early fault-tolerance regime where full quantum error correction carries prohibitive resource costs, yet lightweight error detection can meaningfully improve algorithmic success rates. Existing compilation and error-detection toolchains treat these concerns in isolation, with no principled way to balance detection overhead against success probability under latency constraints. We present an integrated hardware-aware compilation and data-driven quantum error-detection (QED) framework that jointly optimises qubit mapping, SWAP insertion, and syndrome-schedule placement via a noise-weighted cost function and a learned multi-objective scheduler. Simulation experiments on an HPC cluster using GPU-accelerated density-matrix simulation (NVIDIA cuQuantum SDK) across VQE, phase-estimation, and Grover benchmarks, three noise profiles, and circuit sizes of 6-20 qubits (depths 10-160), show that joint co-design raises algorithmic success probability by up to 68 percent (95 percent CI: 60 percent to 76 percent) over SABRE on an 8-qubit VQE instance with post-selection.

2606.07819 2026-06-09 cs.AI cs.LG 交叉投稿

Joint Structural Pruning and Mixed-Precision Quantization for LLM Compression

联合结构剪枝与混合精度量化的大语言模型压缩

Hoang-Loc La, Truong-Thanh Le, Amir Taherkordi, Phuong Hoai Ha

发表机构 * UiT The Arctic University of Norway(挪威北极大学) University of Oslo, Norway(挪威奥斯陆大学)

AI总结 提出端到端框架,通过全局误差最小化的混合精度量化策略和联合优化结构剪枝与量化策略,在超低比特下显著降低困惑度。

详情
AI中文摘要

近年来,大型语言模型(LLM)部署的效率已成为实际应用中的关键问题。虽然训练后量化(PTQ)和结构剪枝是减少内存占用和推理延迟的成熟技术,但大多数现有的PTQ方法在逐层基础上优化量化误差,忽略了误差如何在网络中累积和传播,通常导致次优解。传统的流程也倾向于孤立或顺序地应用剪枝和量化,进一步加剧了次优性。我们引入了一种新颖的端到端框架,以两种关键方式解决这些限制。首先,我们提出了一种新颖的混合精度PTQ策略,该策略直接最小化整个模型上的全局误差传播,而不是隔离逐层误差。在此基础上,我们开发了一种新颖的联合优化方法,该方法在统一的搜索空间中同时学习结构剪枝决策和混合精度量化策略。大量实验表明,在超低精度(1-3比特)下,与最先进的(SoTA)权重激活量化基线相比,我们的量化方法将WikiText困惑度降低了高达21%。与领先的仅权重量化方法相比,它在WikiText和C4上分别实现了高达59%和85%的困惑度降低。与最先进的联合剪枝和量化技术相比,我们提出的方法在超低比特下提供了优越的困惑度和推理性能。

英文摘要

Recently, the efficiency of Large Language Models (LLMs) deployment has become a critical concern in practical applications. While post-training quantization (PTQ) and structural pruning are established techniques for reducing memory footprint and inference latency, most existing PTQ approaches optimize quantization errors on a per-layer basis, overlooking how errors accumulate and propagate through the network, often resulting in suboptimal solutions. Traditional pipelines also tend to apply pruning and quantization in isolation or sequentially, further compounding sub-optimality. We introduce a novel end-to-end framework that addresses these limitations in two key ways. First, we propose a novel mixed-precision PTQ strategy that directly minimizes global error propagation across the entire model, rather than isolating layer-wise errors. Building on this, we develop a novel joint optimization approach that simultaneously learns structural pruning decisions and mixed-precision quantization policies within a unified search space. Extensive experiments show that, at ultra-low precisions (1-3 bits), our quantization method reduces WikiText perplexity by up to 21% compared to state-of-the-art (SoTA) weight-activation quantization baselines. Against leading weight-only quantization methods, it achieves up to 59% and 85% lower perplexity on WikiText and C4, respectively. Compared to the SoTA joint pruning-and-quantization techniques, our proposed method delivers superior perplexity and reasoning performance at ultra-low bits.

2606.08051 2026-06-09 cs.AI cs.LG 交叉投稿

How Small Can You Go? LoRA Fine-Tuning 270M-8B Models for Merchant Information Extraction in Financial Transactions

你能做到多小?面向金融交易中商户信息抽取的 270M-8B 模型 LoRA 微调

Donghao Huang, Tomas Drietomsky, Benjamin Barrett, Zhaoxia Wang

发表机构 * Singapore Management University(新加坡管理大学) Mastercard(万事达卡) A*STAR Centre for Frontier AI Research(新加坡科技研究局前沿人工智能研究中心)

AI总结 针对金融交易中从嘈杂银行字符串提取结构化商户信息的生产需求,系统评估 24 种模型变体,发现 Qwen 3.5 4B 在参数量减半下 F1 仅低 0.35 点,0.8B 模型匹配 2.5-4 倍大模型性能,且思维链微调提升有限。

Comments 9 pages, 5 figures, 5 tables. Submitted to the IEEE International Conference on Data Mining (ICDM) 2026

详情
AI中文摘要

金融交易处理需要从嘈杂、缩写的银行交易字符串中大规模提取结构化商户信息。我们当前的生产系统是 LoRA 微调的 LLaMA 3.1-8B,在该任务上达到了 96.95% 的 F1 分数,但部署 80 亿参数模型带来了高昂的内存、延迟和成本约束。为了识别更高效的替代方案,我们进行了一项以部署为中心的研究,涵盖四个模型家族的 24 种模型变体:Gemma 3(270M、1B、4B)、Qwen 3.5(0.8B、2B、4B)、Aya(3.35B)和 LLaMA 3.1-8B,系统评估了准确率、推理吞吐量、训练成本和硬件行为,以评估生产适用性。我们的发现表明:(1)使用 LoRA 秩为 8 复现 LLaMA 3.1-8B 微调达到 96.75% F1,仅比秩为 32 的基线低 0.20 个点;(2)仅使用 JSON 提示的 Qwen 3.5 4B 达到 96.60% F1,比 8B 基线低 0.35 个点,同时参数量大约减半;(3)0.8B 的 Qwen 3.5 模型达到 94.75% F1,与 2.5-4 倍大的模型性能相当,提供了有吸引力的延迟-准确率权衡;(4)思维链微调通常使大多数模型的 F1 提升 0.3-1.8 个点,尽管 Qwen 3.5 4B 在直接仅 JSON 提示下表现最佳;(5)Qwen 3.5 的 Think 和 Nothink 训练模板产生几乎相同的结果(F1 差异 <0.004),表明对于结构化抽取任务,显式推理监督是不必要的。我们进一步将所有 14 个微调后的子 8B 模型部署为 Databricks Model Serving 端点,并观察到基准性能可靠地迁移到生产环境,平均 F1 变化仅为 0.8 个点。基于 Cohere2 架构的 Aya 3.35B 是唯一的例外,在服务条件下 F1 下降了 3-5 个点。基于这些结果,我们提供了跨准确率和延迟需求的部署建议,……

英文摘要

Financial transaction processing requires extracting structured merchant information from noisy, abbreviated bank transaction strings at scale. Our current production system, a LoRA-fine-tuned LLaMA 3.1-8B, achieves 96.95% F1 on this task, but deploying 8-billion-parameter models imposes prohibitive memory, latency, and cost constraints. To identify more efficient alternatives, we conduct a deployment-focused study of 24 model variants spanning four model families: Gemma 3 (270M, 1B, 4B), Qwen 3.5 (0.8B, 2B, 4B), Aya (3.35B), and LLaMA 3.1-8B, systematically evaluating accuracy, inference throughput, training cost, and hardware behavior to assess production suitability. Our findings show that: (1) reproducing the LLaMA 3.1-8B fine-tune with a LoRA rank of 8 achieves 96.75% F1, only 0.20 points below the rank-32 baseline; (2) Qwen 3.5 4B with JSON-only prompting reaches 96.60% F1, within 0.35 points of the 8B baseline while using roughly half the parameters; (3) the 0.8B Qwen 3.5 model achieves 94.75% F1, matching models 2.5-4x larger and offering an attractive latency-accuracy trade-off; (4) chain-of-thought fine-tuning generally improves F1 by 0.3-1.8 points across most models, although Qwen 3.5 4B performs best with direct JSON-only prompting; and (5) Qwen 3.5 Think and Nothink training templates produce nearly identical results (F1 differences <0.004), indicating that explicit reasoning supervision is unnecessary for structured extraction tasks. We further deploy all 14 fine-tuned sub-8B models as Databricks Model Serving endpoints and observe that benchmark performance transfers reliably to production, with an average F1 change of only 0.8 points. Aya 3.35B, based on the Cohere2 architecture, is the sole exception, exhibiting a 3-5 point decline under serving conditions. Based on these results, we provide deployment recommendations across accuracy and latency requirements, ...

2606.08094 2026-06-09 cs.RO cs.AI cs.LG cs.SY eess.SY 交叉投稿

vla.cpp: A Unified Inference Runtime for Vision-Language-Action Models

vla.cpp:视觉-语言-动作模型的统一推理运行时

Khanh D. Nguyen, Hung T. Ho, Chinh T. Nguyen, Thanh Q. Duong, Linh D. Le, Duy M. H. Nguyen, Vien A. Ngo, An T. Le

发表机构 * VinRobotics Center for AI Research, VinUniversity(VinUniversity 人工智能研究中心) Intelligent Autonomous Systems, TU Darmstadt(达姆施塔特工业大学智能自主系统) Max Planck Research School for Intelligent Systems(马克斯·普朗克智能系统研究学院) University of Stuttgart(斯图加特大学) German Research Center for Artificial Intelligence(德国人工智能研究中心)

AI总结 提出vla.cpp,基于llama.cpp的便携C++推理运行时,支持多种VLA架构,在LIBERO-Object上接近SOTA性能,内存仅1.3 GiB,并实现跨硬件部署。

Comments 17 pages, 3 figures, 12 tables

详情
AI中文摘要

视觉-语言-动作(VLA)策略通常以Python/PyTorch堆栈形式提供,假设使用工作站级GPU,这与机器人实际运行的硬件不匹配。我们提出了vla.cpp,一个基于llama.cpp的便携式C++推理运行时。据我们所知,它是第一个原生支持流匹配和扩散VLA推理模式的ggml类引擎,其中缓存的视觉-语言前缀由交叉注意力动作专家在多个求解器步骤中消耗。单个运行时通过一个请求/响应协议服务于跨越五个骨干网络和四个动作头家族的七种架构,每个模型打包为自包含的捆绑包。在LIBERO-Object上,该引擎在200个回合中与最先进的检查点相差不到一个回合,并以1.3 GiB内存运行BitVLA达到100%成功率。相同的捆绑包在三个硬件层级上不变地运行,从消费级GPU到8 GB嵌入式模块。跨硬件屋顶线分析表明,批量大小为1的VLA推理受计算限制,因此利用率而非带宽是部署杠杆;由此分析得出的IMMA梯形GEMM将BitVLA每步延迟降低了4.5倍。然后,我们在ALOHA机械臂上设计了一个机载压力测试,隔离了学习型VLA必须在训练它的硬件上针对移动目标重新规划的延迟约束。代码、演示视频和可重复的基准测试框架可在https://fai-modelopt-tech.github.io/vla-cpp.github.io/获取。

英文摘要

Vision-Language-Action (VLA) policies are typically shipped as Python/PyTorch stacks that assume a workstation-class GPU, a mismatch for the hardware on which robots actually run. We present vla.cpp, a portable C++ inference runtime built on llama.cpp. To our knowledge, it is the first ggml-class engine to natively serve the flow-matching and diffusion VLA inference pattern, in which a cached vision-language prefix is consumed by a cross-attending action expert integrated over several solver steps. A single runtime serves seven architectures spanning five backbone and four action-head families behind one request/response protocol, with each model packaged as a self-contained bundle. On LIBERO-Object, the engine matches a state-of-the-art checkpoint to within one episode out of 200, and runs BitVLA at 100% success in 1.3 GiB of memory. The same bundle runs unchanged across three hardware tiers, from a consumer GPU down to an 8 GB embedded module. A cross-hardware roofline analysis shows that batch-1 VLA inference is compute-bound, so utilization rather than bandwidth is the deployment lever; an IMMA ladder GEMM derived from this analysis cuts BitVLA per-step latency by 4.5x. We then frame an on-robot stress test on an ALOHA arm that isolates the latency constraint under which a learned VLA must replan against a moving target on the hardware it was trained for. Code, demo videos, and the reproducible benchmark scaffold are available at https://fai-modelopt-tech.github.io/vla-cpp.github.io/.

2606.08813 2026-06-09 cs.DC cs.DB cs.IR cs.LG 交叉投稿

Aperon Technical Report: Hierarchical No-Pointer Tangent-Local Search for High-Dimensional Approximate Nearest Neighbors

Aperon技术报告:用于高维近似最近邻搜索的层次化无指针切向局部搜索

Yong Fu

发表机构 * Substratum Labs(Substratum实验室)

AI总结 提出HNTL框架,通过无指针块SoA布局和局部切空间划分,实现高维向量索引与候选生成,在768维数据上以C=20候选池达到Rerank Recall@10=1.0000,相比指针追踪图遍历加速3.61倍。

详情
AI中文摘要

我们提出了HNTL(层次化无指针切向局部搜索),这是Aperon向量内存系统的核心向量索引和候选生成框架。近邻图(例如HNSW)在内存开销上承受了沉重的指针税,并导致不规则的内存访问,从而阻塞CPU流水线。HNTL通过将高维空间划分为局部、连贯的颗粒,将向量表示为局部切空间上的低维坐标,并使用无指针的Block-SoA(结构体数组)布局顺序扫描它们来解决这一问题。在非各向同性流形数据(d=768,N=10,000)上,局部PCA捕获了96.3%的方差,使得HNTL能够仅使用C=20个向量的候选池达到最终Rerank Recall@10为1.0000。通过Apple kperf CPU性能监控单元(PMU)计数器进行的硬件性能分析表明,我们使用NEON自动向量化的C++ Block-SoA扫描引擎相比标准的指针追踪图遍历实现了3.61倍加速(4.137纳秒/向量对比14.951纳秒/向量),这得益于3.59倍的IPC(每周期指令数)和接近零的L1/L2数据缓存未命中。

英文摘要

We present HNTL (Hierarchical No-pointer Tangent-Local), the core vector indexing and candidate generation framework of the Aperon vector memory system. Proximity graphs (e.g., HNSW) incur a heavy pointer tax in memory overhead and induce irregular memory accesses that stall CPU pipelines. HNTL resolves this by partitioning the high-dimensional space into local, coherent grains, representing vectors as low-dimensional coordinates on local tangent spaces, and scanning them sequentially using a pointerless Block-SoA (Structure-of-Arrays) layout. On anisotropic manifold data (d=768, N=10,000), local PCA captures 96.3% of the variance, allowing HNTL to achieve a final Rerank Recall@10 of 1.0000 with a candidate pool size of only C=20 vectors. Hardware profiling via Apple kperf CPU Performance Monitoring Unit (PMU) counters demonstrates a 3.61x speedup (4.137 ns/vector vs. 14.951 ns/vector) for our NEON auto-vectorized C++ Block-SoA scan engine over standard pointer-chasing graph traversals, driven by a 3.59x IPC (Instructions Per Cycle) and near-zero L1/L2 data cache misses.

2606.09213 2026-06-09 cs.PL cs.LG 交叉投稿

SNN-MLIR: An MLIR Dialect for Compiling Neuromorphic SNNs from NIR to Bare-Metal C

SNN-MLIR:一种用于将神经形态SNN从NIR编译到裸机C的MLIR方言

Alejandro García Gener, Alvaro Rollón de Pinedo

发表机构 * INTERA-Group(INTERA小组)

AI总结 提出SNN-MLIR,一种MLIR方言,通过NIR-MLIR-C编译桥将神经形态SNN模型从框架无关的NIR格式编译为可移植的C代码,支持浮点和量化数据,实现从仿真到硬件部署的统一中间表示。

Comments 8 pages, 5 figures, 5 tables

详情
AI中文摘要

脉冲神经网络(SNN)越来越多地在各种框架(SnnTorch、Lava、Norse等)中训练,每个框架都有自己的模型格式。神经形态中间表示(NIR)通过提供一种通用的、框架无关的格式来交换训练好的SNN模型,解决了碎片化问题。NIR解决了交换问题,但仅止于此。它提供了网络的描述,而非运行网络的路径。每个后端仍需自行实现部署,之间没有共享的、可转换的编译器表示。本文提出snn-mlir,一种用于SNN的树外MLIR方言,以及一个NIR-MLIR-C编译桥。该方言提供了一小组类型多态操作,这些操作在浮点(f32/f64)和量化数据上行为一致,因此单一的中间表示同时服务于仿真和面向硬件的部署。一个Python前端读取任何NIR文件并发出方言IR,自动插入重新缩放操作以保持各层量化尺度一致。一个参考降级过程将方言转换为标准的linalg和arith操作,工具链从中生成自包含、无依赖的C11代码,可在任何支持C的CPU或嵌入式目标上编译和运行。我们评估了数值精度与参考输出的匹配度、跨CPU目标的可移植性以及量化的代价。当前范围是前馈全连接网络,后端为CPU。snn-mlir以Apache-2.0许可证(含LLVM例外)开源发布,并已在GitHub上可用。

英文摘要

Spiking neural networks (SNNs) are increasingly trained in a wide range of frameworks (SnnTorch, Lava, Norse, and others) each with its own model format. The Neuromorphic Intermediate Representation (NIR) addresses this fragmentation by providing a common, framework-independent format for exchanging trained SNN models. NIR solves the exchange problem, but it stops there. It provides a description of a network, not a path to running one. Each backend is still left to implement deployment on its own, with no shared, transformable compiler representation in between. This paper presents snn-mlir, an outof-tree MLIR dialect for SNNs together with a NIR-MLIR-C compilation bridge. The dialect provides a small set of typepolymorphic operations that work identically on floating-point (f32/f64) and quantized data, so a single intermediate representation serves both simulation and hardware-oriented deployment. A Python front end reads any NIR file and emits dialect IR, automatically inserting rescaling operations to keep quantization scales consistent across layers. A reference lowering pass converts the dialect to standard linalg and arith operations, from which the toolchain produces self-contained, dependency free C11 code that compiles and runs on any C-capable CPU or embedded target. We evaluate numerical fidelity against reference outputs, portability across CPU targets, and the cost of quantization. The current scope is feedforward, fully-connected networks with a CPU backend. snn-mlir is released as open source under the Apache-2.0 license with LLVM-exception and it is already available on Github.

2606.09643 2026-06-09 cs.DC cs.AI cs.LG cs.OS 交叉投稿

FMplex: Model Virtualization for Serving Extensible Foundation Models

FMplex: 用于服务可扩展基础模型的模型虚拟化

Hetvi Shastri, Pragya Sharma, Walid A. Hanafy, David Irwin, Mani Srivastava, Prashant Shenoy

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿姆赫斯特分校) University of California Los Angeles(加州大学洛杉矶分校)

AI总结 提出FMplex系统,通过将基础模型作为虚拟化层实现多任务共享,结合批感知公平队列调度器,在7个基础模型和92个下游任务上降低延迟达80%,提升任务容量6倍。

详情
AI中文摘要

基础模型(FMs)越来越多地被用作语言、视觉、时间序列和多模态应用的下游任务骨干。然而,现有的模型服务系统将每个定制任务部署为独立的模型实例,从而复制了重型骨干,浪费了加速器内存,并失去了摊销批处理和加载成本的机会。本文提出了FMplex,一个将FM骨干视为部署共享的虚拟化层的服务系统。FMplex为每个任务提供一个虚拟基础模型(vFM),这是一个由共享物理FM支持的逻辑私有FM实例。这种抽象允许独立定制的任务共享一个骨干,同时保留任务特定的扩展、独立生命周期和任务级隔离。此外,我们提出了一种批感知公平队列调度器,该调度器结合了加权任务级共享以及跨共存任务的批内和批间批处理。我们实现了一个基于FMplex的服务栈,涵盖任务构建、共享感知部署和运行时执行。在7个FM骨干(16个变体)和92个下游任务上,FMplex相比空间分区延迟降低高达80%,相比尽力而为共置延迟降低33.3%,同时在集群规模上可托管多达6倍的任务。

英文摘要

Foundation models (FMs) are increasingly used as backbones for downstream tasks across language, vision, time-series, and multimodal applications. Yet existing model-serving systems deploy each customized task as an independent model instance, thereby replicating heavyweight backbones, wasting accelerator memory, and losing opportunities to amortize batching and loading costs. This paper presents FMplex, a serving system that treats FM backbones as a virtualization substrate for deployment sharing. FMplex presents each task with a virtual foundation model (vFM), a logically private FM instance backed by a shared physical FM. This abstraction lets independently customized tasks share a backbone while preserving task-specific extensions, independent lifecycles, and task-level isolation. In addition, we propose a batch-aware fair-queueing scheduler that combines weighted task-level sharing with inter- and intra-task batching across colocated tasks. We implement a FMplex-based serving stack spanning task construction, sharing-aware deployment, and runtime execution. Across 7 FM backbones (16 variants) and 92 downstream tasks, FMplex reduces latency by up to 80% over spatial partitioning and 33.3% over best-effort co-location, while hosting up to 6x more tasks at cluster scale.

2606.09659 2026-06-09 cs.CL cs.AI cs.LG 交叉投稿

End-to-End Context Compression at Scale

端到端上下文压缩的规模化

Ang Li, Sean McLeish, Haozhe Chen, Nimit Kalra, Zaiqian Chen, Artem Gazizov, Venkata Anoop Suhas Kumar Morisetty, Bhavya Kailkhura, Harshitha Menon, Zhuang Liu, Brian R. Bartoldson, Tom Goldstein, Sanae Lotfi, Micah Goldblum, Pavel Izmailov

发表机构 * New York University(纽约大学) Modal Labs(Modal实验室) University of Maryland(马里兰大学) Princeton University(普林斯顿大学) Columbia University(哥伦比亚大学) Harvard University(哈佛大学) Lawrence Livermore National Laboratory(劳伦斯利弗莫尔国家实验室) FAIR at Meta(Meta FAIR实验室)

AI总结 本研究通过架构搜索和持续预训练,提出潜在上下文语言模型(LCLMs),一种端到端编码器-解码器压缩器,在通用任务性能、压缩速度和峰值内存上改进帕累托前沿,并可作为长时智能体的高效骨干。

详情
AI中文摘要

长上下文语言模型推理受限于内存,因为KV缓存随上下文长度增长。最近压缩KV缓存的技术存在不足:它们要么大幅降低模型质量,要么需要大量时间和计算来压缩单个长提示。此外,许多方法要求输入适合目标模型的上下文窗口,并且通常与现代生产推理引擎不兼容。编码器-解码器压缩器原则上是一种有吸引力的替代方案,它将长令牌序列映射到由解码器消费的较短潜在嵌入序列。然而,现有方法在精度-效率前沿上无法与KV缓存压缩竞争。在这项工作中,我们重新审视编码器-解码器压缩并缩小了这一差距。我们首先进行架构搜索,从头开始预训练许多变体,以确定如何最佳设计和训练编码器-解码器压缩器。根据我们的发现,我们持续预训练一系列0.6B编码器、4B解码器模型,每个模型在超过350B令牌上训练,压缩比为1:4、1:8和1:16。我们引入了潜在上下文语言模型(LCLMs),这是一系列压缩器,在通用任务性能、压缩速度和峰值内存使用上改进了帕累托前沿。我们证明了LCLMs可作为长时智能体的高效骨干,让智能体浏览压缩的长上下文并按需自适应扩展相关片段。

英文摘要

Long-context language model inference is bottlenecked by memory, as the KV cache grows with context length. Recent techniques to compress the KV cache fall short: they either degrade model quality substantially or require considerable time and compute to compress a single long prompt. Furthermore, many methods require the input to fit within the target model's context window, and are generally incompatible with modern production inference engines. Encoder-decoder compressors, which map a long token sequence to a shorter sequence of latent embeddings consumed by a decoder, are an appealing alternative in principle. However, existing approaches are not competitive with KV cache compression on the accuracy-efficiency frontier. In this work, we revisit encoder-decoder compression and close this gap. We first perform an architecture search, pre-training many variants from scratch to determine how best to design and train encoder-decoder compressors. Guided by our findings, we continually pre-train a family of 0.6B-encoder, 4B-decoder models on over 350B tokens each, at compression ratios of 1:4, 1:8, and 1:16. We introduce Latent Context Language Models (LCLMs), a family of compressors that improve the Pareto frontier across general-task performance, compression speed, and peak memory usage. We demonstrate that LCLMs serve as efficient backbones for long-horizon agents, letting the agent skim through a compressed long context and adaptively expand relevant segments on demand.

2411.03253 2026-06-09 cs.LG cs.AI cs.DS 版本更新

Discovering Data Structures: Nearest Neighbor Search and Beyond

发现数据结构:最近邻搜索及其他

Omar Salemohamed, Laurent Charlin, Shivam Garg, Vatsal Sharan, Gregory Valiant

发表机构 * Université de Montréal(蒙特利尔大学) Mila HEC Montréal(蒙特利尔高等商学院) Microsoft Research(微软研究院) University of Southern California(南加州大学) Stanford University(斯坦福大学)

AI总结 提出一个端到端学习数据结构的通用框架,自动适应数据分布并控制查询与空间复杂度,在最近邻搜索中逆向工程出二分搜索、插值搜索、k-d树和局部敏感哈希等算法。

Comments Neurips 2025 Version

详情
AI中文摘要

我们提出了一个用于端到端学习数据结构的通用框架。我们的框架适应底层数据分布,并对查询和空间复杂度提供细粒度控制。关键在于,数据结构是从头开始学习的,不需要仔细初始化或用候选数据结构/算法进行种子化。我们首先将该框架应用于最近邻搜索问题。在多种设置中,我们能够逆向工程出学习到的数据结构和查询算法。对于一维最近邻搜索,模型发现了最优的分布(不)依赖算法,如二分搜索和插值搜索的变体。在更高维度中,模型学习到的解决方案在某些情况下类似于k-d树,而在其他情况下则具有局部敏感哈希的元素。该模型还能学习高维数据的有用表示,并利用它们设计有效的数据结构。我们还将框架应用于数据流上的频率估计问题,并相信它也可以成为新问题的强大发现工具。

英文摘要

We propose a general framework for end-to-end learning of data structures. Our framework adapts to the underlying data distribution and provides fine-grained control over query and space complexity. Crucially, the data structure is learned from scratch, and does not require careful initialization or seeding with candidate data structures/algorithms. We first apply this framework to the problem of nearest neighbor search. In several settings, we are able to reverse-engineer the learned data structures and query algorithms. For 1D nearest neighbor search, the model discovers optimal distribution (in)dependent algorithms such as binary search and variants of interpolation search. In higher dimensions, the model learns solutions that resemble k-d trees in some regimes, while in others, they have elements of locality-sensitive hashing. The model can also learn useful representations of high-dimensional data and exploit them to design effective data structures. We also adapt our framework to the problem of estimating frequencies over a data stream, and believe it could also be a powerful discovery tool for new problems.

2411.16102 2026-06-09 cs.LG 版本更新

BlendServe: Optimizing Offline Inference for Auto-regressive Large Models with Resource-aware Batching

BlendServe: 利用资源感知批处理优化自回归大模型的离线推理

Yilong Zhao, Shuo Yang, Kan Zhu, Lianmin Zheng, Baris Kasikci, Yang Zhou, Jiarong Xing, Ion Stoica

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Washington(华盛顿大学) University of California, Davis(加州大学戴维斯分校) Rice University(里士满大学)

AI总结 针对离线批处理中资源重叠与前缀共享的冲突,提出资源感知前缀树来最大化资源利用率,相比vLLM和SGLang吞吐量提升1.44倍。

详情
AI中文摘要

离线批处理利用请求批处理的灵活性实现更高吞吐量和更低成本,在延迟不敏感的应用中越来越受欢迎。同时,模型能力和模态的最新进展使得请求在计算和内存需求上更加多样化,通过资源重叠为吞吐量提升创造了独特机会。然而,最大化资源重叠的请求调度可能与最大化前缀共享(一种广泛使用的性能优化)的调度冲突,导致次优的推理吞吐量。我们提出BlendServe,该系统通过结合资源重叠和前缀共享的优势,使用资源感知前缀树来最大化离线批处理的资源利用率。BlendServe利用离线批处理中宽松的延迟要求,重新排序和重叠具有不同资源需求的请求,同时确保高前缀共享。我们在各种合成多模态工作负载上评估BlendServe,结果表明,与广泛使用的行业标准vLLM和SGLang相比,它提供了高达1.44倍的吞吐量提升。

英文摘要

Offline batch inference, which leverages the flexibility of request batching to achieve higher throughput and lower costs, is becoming more popular for latency-insensitive applications. Meanwhile, recent progress in model capability and modality makes requests more diverse in compute and memory demands, creating unique opportunities for throughput improvement by resource overlapping. However, a request schedule that maximizes resource overlapping can conflict with the schedule that maximizes prefix sharing, a widely-used performance optimization, causing sub-optimal inference throughput. We present BlendServe, a system that maximizes resource utilization of offline batch inference by combining the benefits of resource overlapping and prefix sharing using a resource-aware prefix tree. BlendServe exploits the relaxed latency requirements in offline batch inference to reorder and overlap requests with varied resource demands while ensuring high prefix sharing. We evaluate BlendServe on a variety of synthetic multi-modal workloads and show that it provides up to $1.44\times$ throughput boost compared to widely-used industry standards, vLLM and SGLang.

2505.17868 2026-06-09 cs.LG math.OC 版本更新

SpectraLDS: Provable Distillation for Linear Dynamical Systems

SpectraLDS:线性动力系统的可证明蒸馏

Devan Shah, Shlomo Fortgang, Sofiia Druchyna, Elad Hazan

发表机构 * Computer Science Department, Princeton University(普林斯顿大学计算机科学系) Google DeepMind Princeton(谷歌DeepMind普林斯顿)

AI总结 提出首个可证明方法识别对称线性动力系统,通过谱变换实现与状态维度无关的精度保证,并实现常数时间和空间推理。

详情
AI中文摘要

我们提出了第一个可证明的方法,用于识别具有精度保证的对称线性动力系统(LDS),该精度保证与系统的状态维度或有效记忆无关。我们的方法建立在最近的工作基础上,该工作将对称LDS表示为可通过固定谱变换学习的卷积。我们展示了如何反转这种表示,从而从谱变换中恢复LDS模型,并产生端到端的凸优化过程。这种蒸馏保留了预测精度,同时实现了每个token的常数时间和常数空间推理,与序列长度无关。我们将我们的方法SpectraLDS作为序列预测架构中的一个组件进行评估,并证明在语言建模等任务上,精度得以保持,同时推理效率得到提升。

英文摘要

We present the first provable method for identifying symmetric linear dynamical systems (LDS) with accuracy guarantees that are independent of the systems' state dimension or effective memory. Our approach builds upon recent work that represents symmetric LDSs as convolutions learnable via fixed spectral transformations. We show how to invert this representation, thereby recovering an LDS model from its spectral transform and yielding an end-to-end convex optimization procedure. This distillation preserves predictive accuracy while enabling constant-time and constant-space inference per token, independent of sequence length. We evaluate our method, SpectraLDS, as a component in sequence prediction architectures and demonstrate that accuracy is preserved while inference efficiency is improved on tasks such as language modeling.

2506.06295 2026-06-09 cs.LG cs.AI cs.CL 版本更新

dLLM-Cache: Accelerating Diffusion Large Language Models with Adaptive Caching

dLLM-Cache:基于自适应缓存的扩散大语言模型加速

Zhiyuan Liu, Yicun Yang, Yaojie Zhang, Junjie Chen, Chang Zou, Qingyan Wei, Shaobo Wang, Yichen Zhu, Linfeng Zhang

发表机构 * Zhejiang University(浙江大学)

AI总结 针对扩散大语言模型推理延迟高的问题,提出一种无需训练的自适应缓存框架dLLM-Cache,通过长间隔提示缓存和基于特征相似性的部分响应更新,实现高效中间计算复用,在保持输出质量的同时大幅降低FLOPs。

Comments Accepted by ICML 2026

详情
AI中文摘要

自回归模型长期以来主导了大语言模型领域。最近,一种基于扩散的大语言模型(dLLMs)的新范式出现,它通过迭代去噪掩码段来生成文本。这种方法显示出显著的优势和潜力。然而,dLLMs存在高推理延迟的问题。传统的自回归模型加速技术,如键值缓存,由于dLLMs的双向注意力机制而无法兼容。为了应对这一特定挑战,我们的工作首先基于一个关键观察:dLLM推理涉及一个静态提示和一个部分动态的响应,其中大多数标记在相邻去噪步骤中保持稳定。基于此,我们提出了dLLM-Cache,一种无需训练的自适应缓存框架,它结合了长间隔提示缓存和基于特征相似性的部分响应更新。这种设计能够在不影响模型性能的情况下高效重用中间计算。在代表性dLLMs(包括LLaDA 8B和Dream 7B)上的大量实验表明,dLLM-Cache在LongBench-HotpotQA上实现了高达9.1倍的FLOPs减少,同时保持了具有竞争力的输出质量。值得注意的是,我们的方法使dLLM推理延迟在许多设置下接近自回归模型。本工作的代码公开于:https://github.com/maomaocun/dLLM-cache。

英文摘要

Autoregressive Models (ARMs) have long dominated the landscape of Large Language Models. Recently, a new paradigm has emerged in the form of diffusion-based Large Language Models (dLLMs), which generate text by iteratively denoising masked segments. This approach has shown significant advantages and potential. However, dLLMs suffer from high inference latency. Traditional ARM acceleration techniques, such as Key-Value caching, are incompatible with dLLMs due to their bidirectional attention mechanism. To address this specific challenge, our work begins with a key observation that dLLM inference involves a static prompt and a partially dynamic response, where most tokens remain stable across adjacent denoising steps. Based on this, we propose dLLM-Cache, a training-free adaptive caching framework that combines long-interval prompt caching with partial response updates guided by feature similarity. This design enables efficient reuse of intermediate computations without compromising model performance. Extensive experiments on representative dLLMs, including LLaDA 8B and Dream 7B, show that dLLM-Cache achieves up to 9.1x FLOPs reduction on LongBench-HotpotQA while maintaining competitive output quality. Notably, our method brings dLLM inference latency close to that of ARMs under many settings. The code for this work is publicly available at: https://github.com/maomaocun/dLLM-cache.

2509.14562 2026-06-09 cs.LG math.OC 版本更新

LiMuon: Light and Fast Muon Optimizer for Large Models

LiMuon: 面向大模型的轻量快速Muon优化器

Feihu Huang, Yuning Luo, Songcan Chen

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出LiMuon优化器,结合动量方差缩减与随机SVD,在降低内存的同时实现更低的样本复杂度O(ε^{-3}),并在Mamba-130M等模型上验证有效性。

Comments Published in ICML 2026

详情
AI中文摘要

近年来,大模型在机器学习中广泛应用,因此大模型的高效训练受到广泛关注。最近,实用的Muon优化器专门针对大模型的矩阵结构参数设计。尽管已有工作开始研究Muon优化器,但现有的Muon及其变体在处理大模型时仍存在样本复杂度高或内存占用高的问题。为填补这一空白,我们提出了一种轻量快速的Muon(LiMuon)优化器用于训练大模型,它基于动量方差缩减技术和随机奇异值分解(SVD)。特别地,我们的LiMuon同时具有比Muon及其变体更低的内存和更低的样本复杂度。此外,我们证明在广义光滑条件下,具有更低内存的LiMuon在非凸随机优化中寻找ε-稳定解时具有更低的样本复杂度O(ε^{-3})。为进一步缩小理论与实践差距,我们还证明采用Newton-Schulz步骤的LiMuon比采用Newton-Schulz步骤的Muon具有更低的样本复杂度。在训练Mamba-130M、Qwen2.5-0.5B和ViT模型上的数值实验结果证明了我们LiMuon的有效性。

英文摘要

Large models recently are widely applied in machine learning, so efficient training of large models has received widespread attention. More recently, the useful Muon optimizer is specifically designed for matrix-structured parameters of large models. Although some works have begun to study the Muon optimizer, the existing Muon and its variants still suffer from high sample complexity or high memory for large models. To fill this gap, we propose a light and fast Muon (LiMuon) optimizer for training large models, which builds on the momentum-based variance reduced technique and randomized Singular Value Decomposition (SVD). In particular, our LiMuon simultaneously has a lower memory and lower sample complexity than the Muon and its variants. Moreover, we prove that our LiMuon with lower memory has a lower sample complexity of $O(ε^{-3})$ for finding an $ε$-stationary solution of non-convex stochastic optimization under the generalized smoothness condition. To further narrow practice and theory gap, we also prove that our LiMuon with Newton-Schulz steps has a lower sample complexity than the Muon with Newton-Schulz steps. Numerical experimental results on training Mamba-130M, Qwen2.5-0.5B and ViT models demonstrate effectiveness of our LiMuon.

2511.07046 2026-06-09 cs.LG cs.AI 版本更新

Learning Quantized Continuous Controllers for Integer Hardware

面向整数硬件的量化连续控制器学习

Fabian Kresse, Christoph H. Lampert

发表机构 * Institute of Science and Technology Austria (ISTA)(奥地利科学与技术研究所)

AI总结 提出量化感知训练策略,自动选择低比特策略并综合到FPGA,在MuJoCo任务中以3或2比特权重和激活值实现与全精度相当的竞争力,并提升输入噪声鲁棒性。

Comments 18 pages, 6 figures

详情
AI中文摘要

在嵌入式硬件上部署连续控制强化学习策略需要满足严格的延迟和功耗预算。小型FPGA可以实现这些要求,但前提是避免昂贵的浮点流水线。我们研究了用于整数推理的策略的量化感知训练(QAT),并提出了一种学习到硬件的流水线,该流水线自动选择低比特策略并将其综合到Artix-7 FPGA上。在五个MuJoCo任务中,我们获得的策略网络与全精度(FP32)策略具有竞争力,但每个权重和每个内部激活值仅需3比特甚至2比特,前提是输入精度经过仔细选择。在目标硬件上,所选策略实现微秒级的推理延迟,每次动作消耗微焦耳能量,与量化参考相比具有优势。最后,我们观察到量化策略相比浮点基线具有更高的输入噪声鲁棒性。

英文摘要

Deploying continuous-control reinforcement learning policies on embedded hardware requires meeting tight latency and power budgets. Small FPGAs can deliver these, but only if costly floating-point pipelines are avoided. We study quantization-aware training (QAT) of policies for integer inference and we present a learning-to-hardware pipeline that automatically selects low-bit policies and synthesizes them to an Artix-7 FPGA. Across five MuJoCo tasks, we obtain policy networks that are competitive with full precision (FP32) policies but require as few as 3 or even only 2 bits per weight, and per internal activation value, as long as input precision is chosen carefully. On the target hardware, the selected policies achieve inference latencies on the order of microseconds and consume microjoules per action, favorably comparing to a quantized reference. Last, we observe that the quantized policies exhibit increased input noise robustness compared to the floating-point baseline.

2601.15727 2026-06-09 cs.LG cs.CL 版本更新

Towards Automated Kernel Generation in the Era of LLMs

面向LLM时代的自动化内核生成

Yang Yu, Peiyu Zang, Chi Hsu Tsai, Haiming Wu, Yixin Shen, Jialing Zhang, Haoyu Wang, Zhiyou Xiao, Jingze Shi, Yuyu Luo, Wentao Zhang, Chunlei Men, Guang Liu, Yonghua Lin

发表机构 * Beijing Academy of Artificial Intelligence(北京人工智能研究院) Beijing Normal University(北京师范大学) Peking University(北京大学) Beijing Institute of Technology(北京理工大学) Cornell University(康奈尔大学) Beijing Jiaotong University(北京交通大学) Renmin University of China(中国人民大学) Hong Kong University of Science and Technology (Guangzhou)(广州科技大学)

AI总结 本文综述了利用大语言模型(LLM)和智能体系统自动化生成与优化GPU内核的方法,系统梳理了现有方法、数据集和基准,并指出了未来研究方向。

Comments In IJCAI 2026. 9 pages, 1 figure

详情
AI中文摘要

现代AI系统的性能从根本上受限于其底层GPU内核的质量,这些内核将高级算法语义转化为低级硬件操作。实现接近最优的内核需要专家级硬件架构和编程模型的理解,使得内核工程成为一个关键但耗时且不可扩展的过程。大语言模型和基于LLM的智能体的最新进展为自动化内核生成和优化开辟了新的可能性。LLM擅长压缩难以形式化的专家级内核知识,而智能体系统通过将内核开发视为迭代、反馈驱动的循环,进一步实现了可扩展的优化。该领域取得了快速进展。然而,该领域仍然分散,缺乏对LLM驱动内核生成的系统视角。本综述通过提供现有方法的结构化概述(涵盖基于LLM的方法和智能体优化工作流程),并系统组织支撑该领域学习和评估的数据集和基准,填补了这一空白。此外,进一步概述了关键开放挑战和未来研究方向,旨在为下一代自动化内核优化建立全面的参考。为跟踪该领域,我们在https://github.com/example维护了一个开源GitHub仓库。

英文摘要

The performance of modern AI systems is fundamentally constrained by the quality of their underlying GPU kernels, which translate high-level algorithmic semantics into low-level hardware operations. Achieving near-optimal kernels requires expert-level understanding of hardware architectures and programming models, making kernel engineering a critical but notoriously time-consuming and non-scalable process. Recent advances in large language models and LLM-based agents have opened new possibilities for automating kernel generation and optimization. LLMs are well-suited to compress expert-level kernel knowledge that is difficult to formalize, while agentic systems further enable scalable optimization by casting kernel development as an iterative, feedback-driven loop. Rapid progress has been made in this area. However, the field remains fragmented and lacks a systematic perspective for LLM-driven kernel generation. This survey addresses this gap by providing a structured overview of existing approaches, spanning LLM-based approaches and agentic optimization workflows, and systematically organizing the datasets and benchmarks that underpin learning and evaluation in this domain. Moreover, key open challenges and future research directions are further outlined, aiming to establish a comprehensive reference for the next generation of automated kernel optimization. To keep track of this field, we maintain an open-source GitHub repository at https://github.com/flagos-ai/awesome-LLM-driven-kernel-generation.

2601.21522 2026-06-09 cs.LG cond-mat.dis-nn cs.AI stat.ML 版本更新

More Bang for the Buck: Improving the Inference of Large Language Models at a Fixed Budget using Reset and Discard (ReD)

更高效利用预算:使用重置与丢弃(ReD)方法在固定预算下提升大型语言模型的推理性能

Sagi Meir, Tommer D. Keidar, Noam Levi, Shlomi Reuveni, Barak Hirshberg

发表机构 * School of Chemistry, Tel Aviv University(特拉维夫大学化学系) The Center for Physics and Chemistry of Living Systems, Tel Aviv University(特拉维夫大学生命系统物理与化学中心) School of Physics and Astronomy, Tel Aviv University(特拉维夫大学物理与天文学系) The Center for Computational Molecular and Materials Science, Tel Aviv University(特拉维夫大学计算分子与材料科学中心)

AI总结 针对固定预算下大型语言模型推理的收益递减问题,提出重置与丢弃(ReD)查询方法,通过优化尝试分配提升覆盖率,并在编码、数学和推理基准上验证了其成本节约效果。

详情
AI中文摘要

大型语言模型(LLMs)在可验证任务上的性能通常通过 pass@k 衡量,即在 k 次尝试中至少正确回答一次的概率。在固定预算下,更合适的指标是 coverage@cost,即作为总尝试次数函数的平均唯一回答问题数量。我们连接这两个指标,并证明 pass@k 中经验观察到的幂律行为导致 coverage@cost 的次线性增长(收益递减)。为解决此问题,我们提出重置与丢弃(ReD),一种 LLMs 的查询方法,无论 pass@k 的形式如何,都能在给定预算下增加 coverage@cost。此外,给定 pass@k,我们可以定量预测使用 ReD 在总尝试次数上的节省。如果模型的 pass@k 不可用,ReD 可以推断其幂律指数。在三个 LLMs 上进行的编码(HumanEval)、数学(GSM8K)和推理(MMLU-Pro)基准测试表明,ReD 显著减少了达到期望覆盖率所需的尝试次数、令牌数和美元成本,同时提供了一种高效测量推理幂律的方法。ReD 的优势在非完美验证器下得以保持,并且优于测试的分配基线。

英文摘要

The performance of large language models (LLMs) on verifiable tasks is usually measured by pass@k, the probability of answering a question correctly at least once in k trials. At a fixed budget, a more suitable metric is coverage@cost, the average number of unique questions answered as a function of the total number of attempts. We connect the two metrics and show that the empirically-observed power-law behavior in pass@k leads to a sublinear growth of the coverage@cost (diminishing returns). To solve this problem, we propose Reset-and-Discard (ReD), a query method of LLMs that increases coverage@cost for a given budget, regardless of the pass@k form. Moreover, given a pass@k, we can quantitatively predict the savings in the total number of attempts using ReD. If pass@k is not available for the model, ReD can infer its power-law exponent. Experiments on three LLMs across coding (HumanEval), math (GSM8K), and reasoning (MMLU-Pro) benchmarks demonstrate that ReD substantially reduces the required attempts, tokens, and USD cost to reach a desired coverage, while also offering an efficient way to measure inference power-laws. ReD's advantage is maintained for imperfect verifiers and outperforms the tested allocation baselines.

2602.05774 2026-06-09 cs.LG cs.AI math.PR 版本更新

Variational Speculative Decoding: Rethinking Draft Training from Token Likelihood to Sequence Acceptance

变分推测解码:从令牌似然到序列接受的草稿训练再思考

Xiandong Zou, Jianshu Li, Jing Huang, Pan Zhou

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出变分推测解码(VSD),将草稿训练视为对潜在提议(草稿路径)的变分推断,通过最大化目标模型接受的边际概率来优化,结合路径级效用和期望最大化过程,显著提升解码效率。

详情
AI中文摘要

推测解码加速了(多模态)大语言模型的推理,但训练-解码之间存在不一致:现有方法优化单一贪婪轨迹,而解码涉及验证和排序多个采样草稿路径。我们提出变分推测解码(VSD),将草稿训练形式化为对潜在提议(草稿路径)的变分推断。VSD最大化目标模型接受的边际概率,得到一个ELBO,该ELBO促进高质量潜在提议,同时最小化与目标分布的散度。为提升质量并降低方差,我们引入路径级效用,并通过期望最大化过程进行优化。E步从经过oracle过滤的后验中抽取蒙特卡洛样本,M步使用自适应拒绝加权(ARW)和置信度感知正则化(CAR)最大化加权似然。理论分析证实VSD增加了期望接受长度和加速比。在LLM和MLLM上的大量实验表明,VSD相比EAGLE-3实现高达9.6%的加速,相比ViSpec实现7.9%的加速,显著提升了解码效率。

英文摘要

Speculative decoding accelerates inference for (M)LLMs, yet a training-decoding discrepancy persists: while existing methods optimize single greedy trajectories, decoding involves verifying and ranking multiple sampled draft paths. We propose Variational Speculative Decoding (VSD), formulating draft training as variational inference over latent proposals (draft paths). VSD maximizes the marginal probability of target-model acceptance, yielding an ELBO that promotes high-quality latent proposals while minimizing divergence from the target distribution. To enhance quality and reduce variance, we incorporate a path-level utility and optimize via an Expectation-Maximization procedure. The E-step draws Monte Carlo samples from an oracle-filtered posterior, while the M-step maximizes weighted likelihood using Adaptive Rejection Weighting (ARW) and Confidence-Aware Regularization (CAR). Theoretical analysis confirms that VSD increases expected acceptance length and speedup. Extensive experiments across LLMs and MLLMs show that VSD achieves up to a 9.6% speedup over EAGLE-3 and 7.9% over ViSpec, significantly improving decoding efficiency.

2603.05500 2026-06-09 cs.LG cs.AI cs.CL 版本更新

POET-X: Memory-efficient LLM Training by Scaling Orthogonal Transformation

POET-X:通过扩展正交变换实现内存高效的LLM训练

Zeju Qiu, Lixin Liu, Adrian Weller, Han Shi, Weiyang Liu

发表机构 * University of Cambridge(剑桥大学)

AI总结 POET-X通过优化正交等价变换降低计算和内存开销,实现高效稳定的LLM训练,支持在单块H100 GPU上预训练十亿参数模型。

Comments ICML 2026 Oral (15 pages, 7 figures, project page: https://spherelab.ai/poetx/)

详情
AI中文摘要

高效且稳定的大型语言模型(LLM)训练仍然是现代机器学习系统的核心挑战。为解决这一挑战,提出了重新参数化正交等价训练(POET),这是一种保持谱的框架,通过正交等价变换优化每个权重矩阵。尽管POET提供了强大的训练稳定性,但其原始实现由于密集的矩阵乘法导致高内存消耗和计算开销。为克服这些限制,我们引入了POET-X,一种可扩展且内存高效的变体,通过显著降低的计算成本执行正交等价变换。POET-X在保持POET的一般化和稳定性优势的同时,实现了吞吐量和内存效率的显著提升。在我们的实验中,POET-X能够在单块Nvidia H100 GPU上预训练十亿参数的LLM,而标准优化器如AdamW在相同设置下会因内存不足而失败。

英文摘要

Efficient and stable training of large language models (LLMs) remains a core challenge in modern machine learning systems. To address this challenge, Reparameterized Orthogonal Equivalence Training (POET), a spectrum-preserving framework that optimizes each weight matrix through orthogonal equivalence transformation, has been proposed. Although POET provides strong training stability, its original implementation incurs high memory consumption and computational overhead due to intensive matrix multiplications. To overcome these limitations, we introduce POET-X, a scalable and memory-efficient variant that performs orthogonal equivalence transformations with significantly reduced computational cost. POET-X maintains the generalization and stability benefits of POET while achieving substantial improvements in throughput and memory efficiency. In our experiments, POET-X enables the pretraining of billion-parameter LLMs on a single Nvidia H100 GPU, and in contrast, standard optimizers such as AdamW run out of memory under the same settings.

2603.29824 2026-06-09 cs.LG 版本更新

Curvature-Guided LoRA: Matching Full Fine-Tuning in Function Space

曲率引导的LoRA:在函数空间中匹配全微调

Frédéric Zheng, Alexandre Proutière

发表机构 * KTH(皇家理工学院)

AI总结 本文提出Curvature-Guided LoRA,通过函数空间视角解决LoRA与全微调输出对齐问题,采用曲率感知的二阶方法提升微调效率与性能。

Comments Preprint

详情
AI中文摘要

参数高效的微调方法如LoRA能够高效适应大预训练模型,但通常在收敛速度和最终性能上落后于全微调。最近的方法旨在通过将LoRA参数更新与全微调对齐来缩小这一差距,但这种参数空间对齐只能间接控制模型预测。相反,我们采用函数空间视角,提出预测对齐问题,其目标是使LoRA微调的输出与全微调的输出一致。我们证明该目标自然导致曲率感知的二阶公式,其中最优低秩更新对应于牛顿似、曲率白化的梯度。基于此见解,我们提出Curvature-Guided LoRA (CG-LoRA),一种利用局部曲率信息选择适应方向的算法。我们的方法计算高效且避免显式构造二阶矩阵。在标准自然语言理解基准上的实验表明,与现有LoRA变体相比,我们的方法在性能和收敛速度上均有提升。

英文摘要

Parameter-efficient fine-tuning methods such as LoRA enable efficient adaptation of large pretrained models, but often lag behind full fine-tuning in both convergence speed and final performance. Recent approaches aim to reduce this gap by aligning LoRA parameter updates with those of full fine-tuning, but such parameter-space alignment only indirectly controls model predictions. Instead, we adopt a function-space perspective and formulate the \emph{prediction alignment problem}, whose objective is to match the outputs of LoRA fine-tuning to those of full fine-tuning. We show that this objective naturally leads to a curvature-aware, second-order formulation, where optimal low-rank updates correspond to a Newton-like, curvature-whitened gradient. Based on this insight, we propose Curvature-Guided LoRA (CG-LoRA), an algorithm that selects adaptation directions using local curvature information. Our method is computationally efficient and avoids explicit second-order matrix construction. Experiments on standard natural language understanding benchmarks demonstrate improved performance and faster convergence compared to existing LoRA variants.

2605.02950 2026-06-09 cs.LG cs.AI 版本更新

Kernel Affine Hull Machines as Compute-Efficient Encoders for Frozen Semantic Spaces

核仿射包机作为冻结语义空间的计算高效编码器

Mohit Kumar, Somayeh Kargaran, Bernhard A. Moser, Manuela Geiß

发表机构 * University of Rostock(罗斯托克大学) Software Competence Center Hagenberg GmbH(海根堡软件竞争力中心)

AI总结 提出核仿射包机(KAHM)作为轻量级查询编码器,在固定教师表示空间下,通过RKHS中的后验权重估计替代神经网络编码,实现计算高效且性能优异的语义检索。

详情
AI中文摘要

基于Transformer的语义编码器在检索中很有效,但在许多部署中,重复出现的瓶颈是在线查询编码,而非离线语料库索引。本文研究,一旦强大的教师表示空间和语料库索引固定,是否可以用一个更轻量且解析明确的估计器来替代重复的神经查询编码。我们将固定教师的词汇到语义编码表述为一个条件均值估计问题,其中目标语义向量表示为由后验聚类概率加权的语义原型的噪声混合。使用核仿射包机(KAHM)几何,在显式识别的RKHS假设空间中,从廉价的词汇特征估计这些后验权重,并通过归一化最小均方更新从带噪声的教师嵌入中精炼语义原型。这产生了一个无反向传播的查询端编码器,以及一个端到端的误差分解,包括后验近似、有限样本/泛化和教师噪声项。我们在一个受控的奥地利法律检索基准上实例化该方法,该基准包含5000个测试查询、84个候选法律和10762个对齐的检索单元,使用特定于法律的编码器进入冻结的Mixedbread嵌入空间。在评估匹配的学习适配器中,KAHM在所有评估截断处实现了最强的教师空间重建和最佳的排名敏感检索性能。在k=20时,它获得了MRR@20=0.504、Hit@20=0.694和Top-1准确率=0.411,同时在报告的CPU设置中,相对于直接Transformer查询编码,在线每查询时间减少了8.53倍。结果支持KAHM作为监督固定表示部署场景中的计算高效编码器。

英文摘要

Transformer-based semantic encoders are effective for retrieval, but in many deployments the recurring bottleneck is online query encoding rather than offline corpus indexing. This paper studies whether, once a strong teacher representation space and corpus index are fixed, repeated neural query encoding can be replaced by a substantially lighter and analytically explicit estimator. We formulate fixed-teacher lexical-to-semantic encoding as a conditional-mean estimation problem in which the target semantic vector is represented as a noisy mixture of semantic prototypes weighted by posterior cluster probabilities. Kernel Affine Hull Machine (KAHM) geometry is used to estimate these posterior weights from inexpensive lexical features in an explicitly identified RKHS hypothesis space, and the semantic prototypes are refined by normalized least-mean-squares updates from noisy teacher embeddings. This yields a backpropagation-free query-side encoder together with an end-to-end error decomposition into posterior-approximation, finite-sample/generalization, and teacher-noise terms. We instantiate the approach on a controlled Austrian-law retrieval benchmark with 5,000 test queries, 84 candidate laws, and 10,762 aligned retrieval units, using law-specific encoders into a frozen Mixedbread embedding space. Among evaluation-matched learned adapters, KAHM achieves the strongest teacher-space reconstruction and the best rank-sensitive retrieval performance at all evaluated cutoffs. At k=20, it obtains MRR@20 = 0.504, Hit@20 = 0.694, and Top-1 Accuracy = 0.411, while reducing online per-query time by 8.53 relative to direct transformer query encoding in the reported CPU setting. The results support KAHMs as compute-efficient encoders for supervised fixed-representation deployment regimes.

2605.13768 2026-06-09 cs.LG cs.AI cs.IT math.IT 版本更新

High-Rate Quantized Matrix Multiplication II

高速率量化矩阵乘法II

Or Ordentlich, Yury Polyanskiy

发表机构 * Hebrew University of Jerusalem(希伯来大学杰里科分校) MIT(麻省理工学院)

AI总结 本文研究在已知第二因子列协方差矩阵情况下高速率量化矩阵乘法,通过水填充算法改进LLM量化方法,展示WaterSIC方案在信息论极限下的性能。

详情
AI中文摘要

本文是关于量化矩阵乘法(MatMul)工作的第二部分。在第一部分中,我们考虑了无校准量化的情况,而在这里,我们讨论了在第二因子列协方差矩阵$Σ_X$已知的情况下的情形。这种情形出现在广泛应用的LLM后训练量化任务中。权重量化与加权均方误差(WMSE)源编码问题相关,其经典的(反向)水填充解决定了如何在向量的坐标之间分配速率。我们展示了如何利用水填充来改进实际的LLM量化算法(GPTQ),目前这些算法平均分配速率。最近的一种方案(称为``WaterSIC'')仅使用标量INT量化器进行分析,其高速率性能被证明为(a)基无关(即由$Σ_X$的行列式决定,因此不同于现有方案,不受随机旋转的影响);(b)在信息论极限下的性能与$\frac{2πe}{12}$(或0.25 bit/entry)的乘法因子内。GPTQ的性能受基的选择影响,但对于随机旋转和实际的$Σ_X$来自Llama-3-8B,我们发现其性能在0.1 bit(取决于层类型)以内,表明GPTQ结合随机旋转也接近最优,至少在高速率范围内。

英文摘要

This is the second part of the work investigating quantized matrix multiplication (MatMul). In part I we considered the case of calibration-free quantization, whereas here we discuss the setting where covariance matrix $Σ_X$ of the columns of the second factor is available. This setting arises in the ubiquitous task of weight-only post-training quantization of LLMs. Weight-only quantization is related to the problem of weighted mean squared error (WMSE) source coding, whose classical (reverse) waterfilling solution dictates how one should distribute rate between coordinates of the vector. We show how waterfilling can be used to improve practical LLM quantization algorithms (GPTQ), which at present allocate rate equally. A recent scheme (known as ``WaterSIC'') that only uses scalar INT quantizers is analyzed and its high-rate performance is shown to be (a) basis free (i.e., characterized by the determinant of $Σ_X$ and, thus, unlike existing schemes, is immune to applying random rotations); and (b) within a multiplicative factor of $\frac{2πe}{12}$ (or 0.25 bit/entry) of the information-theoretic distortion limit. GPTQ's performance, in turn, is affected by the choice of basis, but for a random rotation and actual $Σ_X$ from Llama-3-8B we find it to be within 0.1 bit (depending on the layer type) of WaterSIC, suggesting that GPTQ with random rotation is also near optimal, at least in the high-rate regime.

2605.15491 2026-06-09 cs.LG cs.AI cs.PF 版本更新

Ghosted Layers: Unconstrained Activation Alignment for Recovering Layer-Pruned LLMs

Ghosted Layers: 无约束激活对齐用于恢复层剪枝的LLM

Vincent-Daniel Yun, Junhyuk Jo, Sai Praneeth Karimireddy, Sunwoo Lee

发表机构 * University of Southern California(南加州大学) Inha University(inha大学)

AI总结 本文提出Ghosted Layers方法,通过无约束优化解决层剪枝后激活分布不匹配问题,提升LLM准确性和 perplexity 而不牺牲效率。

详情
AI中文摘要

层剪枝从大型语言模型中移除整个Transformer解码器块,但导致后续存活层接收到的隐藏状态分布与训练时分布不匹配,从而引起显著性能下降。我们提出Ghosted Layers,一种无需训练的恢复模块,通过解决边界激活对齐问题来解决此问题。我们的方法从少量校准集推导出闭合形式的最优线性算子,以重建由剪枝层引入的激活差异。我们展示该解决方案对应于对齐目标的无约束最优解,而现有方法受限于有限算子子空间内的约束解。在多个LLM backbone和剪枝策略上的实验表明,我们的方法在保持层剪枝效率增益的同时,一致提升了准确性和perplexity,优于先前的无训练基线。官方代码仓库:https://github.com/daniel-eai/ghosted_layers_official_repository/.

英文摘要

Layer pruning removes entire Transformer decoder blocks from large language models, but introduces a mismatch between the hidden state received by the next surviving layer and the distribution it was trained to process, leading to significant performance degradation. We propose Ghosted Layers, a training-free recovery module that addresses this issue by solving a boundary activation alignment problem. Our method derives a closed-form optimal linear operator from a small calibration set to reconstruct the activation discrepancy introduced by the pruned layers. We show that this solution corresponds to the unconstrained optimum of the alignment objective, whereas existing methods are restricted to constrained solutions over limited operator subspaces. Experiments across multiple LLM backbones and pruning strategies demonstrate that our method consistently improves accuracy and perplexity over prior training-free baselines, while preserving the efficiency gains of layer pruning. Official code repository: https://github.com/daniel-eai/ghosted_layers_official_repository/.

2605.17289 2026-06-09 cs.LG cs.AI 版本更新

LEAP: Learnable End-to-End Adaptive Pruning of Large Language Models

LEAP:可学习的端到端无结构剪枝大型语言模型

Mohammad Mozaffari, Younes Hourri, Mohammad Rastegari, Mahyar Najibi

发表机构 * University of Maryland(马里兰大学)

AI总结 本文提出LEAP,一种可学习的端到端无结构剪枝方法,通过伯努利-戈姆贝茨松弛替代传统参数化,提高了无结构剪枝的端到端准确率,实验表明在多个LLM家族上平均提升了零样本准确率。

Comments Accepted at the ICML 2026 Workshop on Resource-Adaptive Foundation Model Inference (AdaptFM)

详情
AI中文摘要

无结构稀疏性现在通过最近的GPU内核和数据流硬件原生加速,瓶颈从推理执行转移到了剪枝算法。最先进的无结构LLM剪枝方法是基于最优大脑外科手术原理的分层代理,牺牲了端到端准确性,尤其是在高稀疏度下。端到端替代方案如MaskLLM和PATCH表明可学习掩码可以缩小这一差距,但它们的类别-模式参数化随有效掩码数量按行数增长,并不适用于无结构设置。我们引入LEAP,用每权重伯努利-戈姆贝茨松弛替代这种不可行参数化,使端到端无结构掩码学习变得可行。在五个从0.5B到8B参数的LLM家族上,在50%和60%稀疏度下,LEAP在六个任务的零样本准确率上平均比ADMM提升+2.59点,ADMM是我们在扫掠中的最佳分层基线。

英文摘要

Unstructured sparsity is now natively accelerated by recent GPU kernels and dataflow hardware, shifting the bottleneck from inference execution to the pruning algorithm. State-of-the-art methods for unstructured LLM pruning are layer-wise surrogates derived from the Optimal Brain Surgeon principle, and they sacrifice end-to-end accuracy, especially under aggressive sparsity. End-to-end alternatives such as MaskLLM and PATCH show that learnable masks can close this gap, but their categorical-over-patterns parameterization scales with the number of valid masks per row and does not port to the unstructured setting. We introduce LEAP, which replaces this intractable parameterization with a per-weight Bernoulli-via-Gumbel-sigmoid relaxation that makes end-to-end unstructured mask learning tractable. Across five LLM families from 0.5B to 8B parameters at 50% and 60% sparsity, LEAP improves six-task average zero-shot accuracy by +2.59 points on average over ADMM, the best layer-wise baseline in our sweep.

2605.18643 2026-06-09 cs.LG cs.AI cs.CL 版本更新

Post-Trained MoE Can Skip Half Experts via Self-Distillation

Post-Trained MoE Can Skip Half Experts via Self-Distillation

Xingtai Lv, Li Sheng, Kaiyan Zhang, Yichen You, Siyan Gao, Xueheng Luo, Yuxin Zuo, Yuchen Fan, Junlin Yang, Ganqu Cui, Bingning Wang, Fan Yang, Youbang Sun, Ning Ding, Bowen Zhou

发表机构 * Frontis.AI Kuaishou Technology(快手科技) Shanghai AI Lab(上海人工智能实验室) TsinghuaC3I/ZEDA(清华大学C3I/ZEDA)

AI总结 本文提出ZEDA框架,通过自蒸馏将预训练的静态MoE模型转换为高效的动态MoE模型,显著减少专家FLOPs并提升推理速度。

详情
AI中文摘要

混合专家(MoE)通过稀疏专家激活高效地扩展语言模型,其动态变体进一步通过输入依赖的方式调整激活专家以减少计算。现有动态MoE方法通常依赖从头训练或任务特定适应,使完全训练的MoE的实际转换未被充分探索。启用此类适应可直接缓解推理成本,通过允许简单令牌在服务时绕过不必要的专家。本文引入了零专家自蒸馏适应(ZEDA),一种低成本框架,将后训练的静态MoE模型转换为高效的动态MoE模型。为稳定此架构转换,ZEDA在每个MoE层中注入无参数的零输出专家,并通过两阶段自蒸馏适应增强模型,利用原始MoE作为冻结的教师,并应用组级平衡损失。在Qwen3-30B-A3B和GLM-4.7-Flash上跨11个基准测试(涵盖数学、代码和指令跟随)中,ZEDA在边际精度损失下消除了超过50%的专家FLOPs。在两个模型上,ZEDA比最强的动态MoE基线分别高出6.1和4.0个点,并提供约1.20倍的端到端推理加速。

英文摘要

Mixture-of-Experts (MoE) scales language models efficiently through sparse expert activation, and its dynamic variant further reduces computation by adjusting the activated experts in an input-dependent manner. Existing dynamic MoE methods usually rely on pre-training from scratch or task-specific adaptation, leaving the practical conversion of fully trained MoE underexplored. Enabling such adaptation would directly alleviate the inference costs by allowing easy tokens to bypass unnecessary expert during serving. This paper introduces Zero-Expert Self-Distillation Adaptation (ZEDA), a low-cost framework that transforms post-trained static MoE models into efficient dynamic ones. To stabilize this architectural conversion, ZEDA injects parameter-free zero-output experts into each MoE layer and adapts the augmented model through two-stage self-distillation, utilizing the original MoE as a frozen teacher and applying a group-level balancing loss. On Qwen3-30B-A3B and GLM-4.7-Flash across 11 benchmarks spanning math, code, and instruction following, ZEDA eliminates over 50% of expert FLOPs at marginal accuracy loss. It outperforms the strongest dynamic MoE baseline by 6.1 and 4.0 points on the two models, and delivers ~1.20$\times$ end-to-end inference speedup.

2605.18856 2026-06-09 cs.LG cs.CL cs.IT math.IT 版本更新

SPHERICAL KV: Angle-Domain Attention and Rate-Distortion Retention for Efficient Long-Context Inference

SPHERICAL KV: 角度域注意力与率失真保持用于高效长上下文推理

Anay Chauhan, Gurucharan Marthi Krishna Kumar, Arion Das, Amit Dhanda, Vinija Jain, Aman Chadha, Amitava Das

发表机构 * Synopsys McGill University(麦吉尔大学) IIIT Ranchi(印度理工学院拉奇) Amazon(亚马逊) Meta Apple(苹果) Pragya Lab, BITS Pilani Goa(普拉基亚实验室, BITS 拉贾斯坦)

AI总结 提出Spherical KV方法,通过角度域注意力(ADA)和率失真保持(RDR)机制,在长上下文推理中减少KV缓存占用并保持解码效率。

详情
AI中文摘要

长上下文推理日益受到KV缓存的限制:常驻内存随上下文长度增长,解码受限于重复的高带宽内存(HBM)流而非算术运算。现有方法如驱逐、窗口化、量化和卸载减少了占用,但通常仅部分解决了关键路径瓶颈,尤其是在解码期间压缩状态仍需重建为密集向量时。我们提出Spherical KV,一种将KV分配视为基于注意力几何的率失真问题以实现高效解码的长上下文推理方法。该方法基于两个思想:(i) 在解码热循环中廉价地表示方向信息,(ii) 根据估计的未来效用分配保留和精度。其第一个组件,角度域注意力(ADA),将键存储在由标量半径和紧凑角度码组成的球面参数化中,并直接根据这些码计算注意力对数,无需重建密集键。这保留了分页、块局部、融合友好的解码路径,并在实际服务设置中直接针对HBM流量。其第二个组件,率失真保持(RDR),在固定预算下联合选择每个令牌和头的保留/丢弃决策及精度层级,生成层级同质的页面,具有轻量级元数据和合并读取。ADA和RDR共同提供了一种面向部署的机制,在保持解码效率的同时减少KV常驻内存。

英文摘要

Long-context inference is increasingly constrained by the KV cache: resident memory grows with context length, and decoding becomes limited by repeated High Bandwidth Memory (HBM) streaming rather than arithmetic. Existing methods such as eviction, windowing, quantization, and offloading reduce footprint, but often leave the critical-path bottleneck only partially addressed, especially when compressed states must still be reconstructed into dense vectors during decoding. We present Spherical KV, a long-context inference method that treats KV allocation as a rate-distortion problem grounded in attention geometry for efficient decoding. The method is built on two ideas: (i) represent directional information cheaply in the decode hot loop, and (ii) allocate retention and precision according to estimated future utility. Its first component, Angle-Domain Attention (ADA), stores keys in a spherical parameterization consisting of a scalar radius and compact angle codes, and computes attention logits directly from these codes without reconstructing dense keys. This preserves a paged, block-local, fusion-friendly decode path and directly targets HBM traffic in realistic serving settings. Its second component, Rate-Distortion Retention (RDR), jointly chooses keep/drop decisions and precision tiers per token and head under a fixed budget, producing tier-homogeneous pages with lightweight metadata and coalesced reads. Together, ADA and RDR provide a deployment-oriented mechanism for reducing KV residency while preserving decode efficiency.

2605.22863 2026-06-09 cs.LG 版本更新

Latent Cache Flow: Model-to-Model Communication Without Text

潜在缓存流:无需文本的模型间通信

Maximillian Rossi, Prajwal Raghunath, Eugene Wu

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出潜在缓存流(LCF)方法,通过联合翻译和压缩键值缓存实现高效模型间通信,在上下文不同场景下比基于文本的通信准确率提高23%、速度提升8.5倍。

Comments 6 pages, 5 figures

详情
AI中文摘要

当今的LLM智能体通过文本进行通信,由于需要自回归解码共享模型的状态并在接收模型处编码,这会导致显著的延迟和信息损失。最近的工作如Cache-to-Cache(C2C;Fu等人,2026)试图通过学习适配器来交换KV缓存,该适配器将共享者的KV矩阵转换为接收者模型。然而,这些适配器体积庞大且训练成本高,并且逐词翻译,要求目标上下文完全相同。这对于LLM具有不同上下文的智能体通信来说是不合适的。我们引入了潜在缓存流(LCF)。为了解决效率问题,我们观察到键和值可以联合翻译和压缩,将适配器大小减少到C2C的约4%。为了解决上下文不同的问题,我们设计了适配器来传输目标模型所没有的新信息的摘要。我们的初步实验表明,在共享上下文设置中,一个13 MB的LCF适配器可以比956 MB的C2C适配器更准确;对于不同上下文,LCF比基于文本的通信准确率提高23%,速度提升8.5倍。

英文摘要

LLM agents today communicate via text, which incurs considerable latency and information loss due to the need to autoregressively decode the sharer model's state and encode at the receiver model. Recent work such as Cache-to-Cache (C2C; Fu et al., 2026) seeks to exchange KV caches by learning adapters that translate sharer KV matrices to the receiver model. However, the adapters are large and expensive to train, and translate individual tokens, which requires the target context to be identical. This is unsuitable for agent communication, where the LLMs have differing context. We introduce Latent Cache Flow (LCF). To address efficiency, we observe that keys and values can be jointly translated and compressed, reducing the adapter to about 4% of C2C's size. To address differing context, we design the adapter to transmit a summary of new information that the target model does not have. Our early experiments show that a pruned 13 MB LCF adapter can be more accurate than C2C at 956 MB in shared-context settings; for different contexts, LCF improves F1 by 7.5% and Exact Match by 23% while 8.5 times faster than text-based communication.

2605.27786 2026-06-09 cs.LG cs.AI 版本更新

Locality-Aware Redundancy Pruning for LLM Depth Compression

面向LLM深度压缩的局部感知冗余剪枝

Vincent-Daniel Yun, Youngrae Kim, Woosang Lim, YoungJin Heo, Minkyu Kim, Sunwoo Lee

发表机构 * University of Southern California(美国南加州大学) Neural Superintelligence Lab, MODULABS(MODULABS神经超级智能实验室) Seoul National University(首尔国立大学) Inha University(釜山大学)

AI总结 提出LoRP,一种基于表示局部性的无训练单次深度剪枝框架,通过引入表示局部性分数(RLS)来识别和剪除冗余层,在多种LLM上提升了困惑度和下游任务准确率。

详情
AI中文摘要

大型语言模型在跨网络深度上已知存在表示冗余,这使得深度剪枝成为提高推理效率的有效方法。现有的单次剪枝方法依赖于局部层重要性或跨架构的固定冗余假设。我们提出了局部感知冗余剪枝(LoRP),一种由表示局部性引导的无训练单次深度剪枝框架。我们表明,层间冗余可以是局部化的或全局分布的,具体取决于LLM架构。为了表征这一现象,我们引入了表示局部性分数(RLS),该分数源自全局层间隐藏状态相似性。使用小的校准集,LoRP计算成对层相似性,按表示相似性对层进行聚类,并根据残差簇内冗余分配剪枝。跨多种LLM家族的实验表明,在困惑度和下游任务准确性上均有提升。

英文摘要

Large language models are known to contain representational redundancy across network depth, making depth pruning an effective approach for improving inference efficiency. Existing one-shot pruning methods rely on local layer importance or fixed redundancy assumptions across architectures. We propose Locality-Aware Redundancy Pruning (LoRP), a training-free one-shot depth pruning framework guided by representation locality. We show that inter-layer redundancy can be either localized or globally distributed depending on the LLM architecture. To characterize this phenomenon, we introduce Representation Locality Score (RLS), derived from global inter-layer hidden-state similarity. Using a small calibration set, LoRP computes pairwise layer similarity, clusters layers by representational similarity, and allocates pruning according to residual intra-cluster redundancy. Experiments across diverse LLM families show improvements in both perplexity and downstream task accuracy. Official github repository: https://github.com/daniel-eai/LoRP-Locality-Aware-Redundancy-Pruning/

2605.30836 2026-06-09 cs.LG math.DG 版本更新

Cross-Layer Subspace Coupling for LLM Compression: A Unifying Framework and Its Empirical Limits

跨层子空间耦合用于LLM压缩:一个统一框架及其经验极限

Snigdha Chandan Khilar

发表机构 * Independent Researcher(独立研究者)

AI总结 本文提出一个统一框架将SVD LLM和Basis Sharing等基于SVD的压缩方法纳入同一优化问题,但实验发现跨层耦合在实用任务中失败,原因是残差流在正向传播中解耦了相邻层,因此逐层优化优于联合优化。

详情
AI中文摘要

最近基于SVD的大型语言模型压缩方法,如SVD LLM和Basis Sharing,可以统一在一个优化问题下。尽管数学证明和在Pythia模型上的测试表明,这种统一方法将权重重建误差提高了高达46%,但在实际任务中却失败了。与标准的逐层SVD LLM相比,困惑度和准确率等下游指标严重下降。作者从机制上解释了这一失败。虽然束方法在数学上耦合了相邻层,但变换器的残差流在正向传播过程中实际上解耦了它们。因此,逐层最优性比联合跨层优化更重要。论文得出结论,权重空间重建对于跨层压缩是一个有缺陷的目标,未来的方法必须专注于逐层激活重建。

英文摘要

Recent SVD based compression methods for large language models like SVD LLM and Basis Sharing can be unified under one optimization problem. While mathematical proofs and tests on Pythia models show this unified approach improves weight reconstruction error by up to 46% percent it fails in practical tasks. Downstream metrics like perplexity and accuracy severely degrade compared to standard per layer SVD LLM. The authors explain this failure mechanistically. Although the bundle method mathematically couples adjacent layers the transformer residual stream actually decouples them during forward passes. Thus per layer optimality matters more than joint cross layer optimization. The paper concludes that weight space reconstruction is a flawed objective for cross layer compression and future methods must focus on per layer activation reconstruction instead.

2606.03328 2026-06-09 cs.LG cs.AI 版本更新

Calibration Data Trade-offs Across Capability Dimensions: Why Multi-Source Mixing Matters for High-Sparsity LLM Pruning

校准数据在能力维度上的权衡:为什么多源混合对高稀疏LLM剪枝至关重要

Hu Xu, Zhaolong Xing, Congcong Liu, Jiaxing Wang, Zhida Jiang, Junshi Huang, Zhen Chen, Jianfeng Xu

发表机构 * Shanghai Jiao Tong University(上海交通大学) JD.com(京东公司)

AI总结 通过分解后剪枝能力维度并分析15个校准源,发现校准困惑度与通用能力保留正相关但与数学和代码能力保留负相关,提出多源混合校准方法IGSP以平衡各维度性能。

详情
AI中文摘要

训练后剪枝使用小型无标签校准集将大型语言模型压缩至高稀疏度,近期研究认为校准源的选择对平均后剪枝精度影响不大。我们提出疑问:当校准效果分别在不同能力维度上评估而非聚合时,该结论是否仍然成立。将后剪枝能力分解为通用、常识、代码和数学,并通过Spearman相关性分析$n{=}15$个校准源的OIT信息度量与各维度保留率,我们发现一个符号相反的权衡:校准困惑度与通用保留率正相关($ ho{=}{+}0.71$),但与数学和代码保留率负相关($ ho{=}{-}0.53,\,{-}0.59$;$p{<}0.05$),因此单一源无法保留所有能力。我们以多源校准混合作为回应,并提出IGSP,一种信息引导的自校准协议,通过最小化4-gram聚合和平衡各维度困惑度,自动构建多源混合而无需能力对齐的语料库。在LLaMA-3.1-8B上使用SparseGPT 60%稀疏度时,均匀多源混合达到58.8%的总保留率,优于最佳单一源(MetaMath,50.0%)$+8.8$和C4默认(40.0%)$+18.8$;IGSP比Self-Cal提高$+2.4$,比SGS提高$+4.8$。

英文摘要

Post-training pruning compresses large language models to high sparsity using a small unlabelled calibration set, and recent work has concluded that the choice of calibration source has only modest impact on averaged post-pruning accuracy. We ask whether this conclusion survives once calibration impact is evaluated separately across distinct capability dimensions rather than aggregated. Decomposing post-pruning capability into General, Commonsense, Code, and Math, and analysing $n{=}15$ calibration sources via Spearman correlations between OIT information metrics and per-dimension retention, we uncover an opposite-sign trade-off: calibration perplexity correlates positively with General retention ($ρ{=}{+}0.71$) but negatively with Math and Code retention ($ρ{=}{-}0.53,\,{-}0.59$; $p{<}0.05$), so no single source can preserve all capabilities. We respond with multi-source calibration mixing, and propose IGSP, an information-guided self-calibration protocol that automates multi-source construction without capability-aligned corpora by minimising 4-gram aggregation and balancing perplexity across dimensions. On LLaMA-3.1-8B at SparseGPT 60% sparsity, a uniform multi-source mix reaches 58.8% total retention, outperforming the best single source (MetaMath, 50.0%) by $+8.8$ and the C4 default (40.0%) by $+18.8$; IGSP improves over Self-Cal by $+2.4$ and SGS by $+4.8$.

2606.04920 2026-06-09 cs.LG cs.CV 版本更新

Toward Multi-Domain and Long-Tailed Quantization via Feature Alignment and Scaling

通过特征对齐与缩放实现多域和长尾量化

Ting-An Chen, Chin-Yuan Yeh, De-Nian Yang

发表机构 * Graduate Institute of Electrical Engineering, National Taiwan University, Taiwan(台湾大学电子工程研究所) Institute of Information Science, Academia Sinica, Taiwan(中科院资讯研究所) Graduate Institute of Communication Engineering, National Taiwan University, Taiwan(台湾大学通讯工程研究所) Institute of Information Science and the Research Center for Information Technology Innovation, Academia Sinica, Taiwan(中科院资讯研究所及资讯科技创新研究中心)

AI总结 提出EmaQ和EmaQ-LT方法,通过CDF投影对齐域分布、敏感度加权聚合稳定多域量化,并引入类别条件方差缩放和置信度调整缓解长尾问题,在多种基准上实现低比特量化下的强性能。

详情
AI中文摘要

量化深度神经网络对于在资源受限设备上进行高效推理至关重要。然而,现有大多数方法针对单域和类别平衡数据设计,忽略了存在域偏移或严重类别不平衡的实际场景。我们通过高效多域对齐量化(EmaQ)解决这些挑战,该方法通过基于CDF的投影对齐域分布,并使用敏感度感知权重聚合来稳定多域量化。我们进一步将EmaQ扩展到EmaQ-LT用于长尾量化,通过引入类别条件方差缩放和基于置信度的logit调整来缓解多数类过度自信。理论分析建立了收敛保证,并激励了所提出的敏感度和缩放机制。在标准、多域(Office-31、Digits)和长尾(SynDigits-LT、CIFAR-10-LT、CIFAR-100-LT)基准上的实验表明,EmaQ和EmaQ-LT在域偏移和类别不平衡下实现了强大的低比特性能。

英文摘要

Quantizing deep neural networks is essential for efficient inference on resource-constrained devices. However, most existing methods are designed for single-domain and class-balanced data, leaving practical settings with domain shifts or severe class imbalance underexplored. We address these challenges with Efficient Multi-Domain Alignment Quantization (EmaQ), which aligns domain distributions through a CDF-based projection and uses sensitivity-aware weight aggregation to stabilize multi-domain quantization. We further extend EmaQ to EmaQ-LT for long-tailed quantization by introducing class-conditioned variance scaling and confidence-based logit adjustment to mitigate majority-class overconfidence. Theoretical analyses establish convergence guarantees and motivate the proposed sensitivity and scaling mechanisms. Experiments on standard, multi-domain (Office-31, Digits), and long-tailed (SynDigits-LT, CIFAR-10-LT, CIFAR-100-LT) benchmarks show that EmaQ and EmaQ-LT achieve strong low-bit performance under domain shift and class imbalance.

2606.04945 2026-06-09 cs.LG 版本更新

STaR-Quant: State-Time Consistent Post-Training Quantization for Diffusion Large Language Models

STaR-Quant:扩散大语言模型的状态-时间一致训练后量化

Xin Yan, Aqiang Wang, Zhenglin Wan, Xingrui Yu, Ivor Tsang

发表机构 * School of Artificial Intelligence, Beijing Normal University, Beijing, China(北京师范大学人工智能学院) Department of Computer Science, National University of Singapore, Singapore(新加坡国立大学计算机科学系) Centre for Frontier AI Research, Agency for Science, Technology and Research (A*STAR), Singapore(科技研究局前沿人工智能研究中心)

AI总结 针对扩散大语言模型低比特量化中的状态相关激活差异和时间误差累积问题,提出STaR-Quant框架,通过状态引导激活变换和时间注意力补偿实现高效量化。

详情
AI中文摘要

扩散大语言模型(DLLMs)最近通过迭代掩码去噪和双向上下文生成文本,成为自回归LLMs的有前途的替代方案。然而,它们的大模型规模和迭代去噪过程带来了大量的内存和计算开销,促使采用训练后量化以实现高效部署。在本文中,我们确定了低比特DLLM量化的两个关键挑战:状态相关的激活差异和时间误差累积。在每个去噪步骤中,掩码和未掩码的标记表现出不同的激活分布,而在迭代解码过程中,量化误差可能跨步骤累积。为了解决这些挑战,我们提出了STaR-Quant,一种用于DLLMs的状态-时间一致PTQ框架。STaR-Quant引入了状态引导激活变换(SGAT),通过统一的静态权重侧变换将掩码和未掩码的标记分配到不同的激活变换空间。它进一步引入了时间注意力补偿(TAC),通过轻量级块对角仿射映射来校正量化的注意力表示。在代表性DLLMs上的实验表明,STaR-Quant在低比特权重-激活量化上持续优于强PTQ基线,同时相比FP16部署实现了高达1.69倍的加速和3.14倍的内存节省。

英文摘要

Diffusion large language models (DLLMs) have recently emerged as a promising alternative to autoregressive LLMs by generating text through iterative masked denoising with bidirectional context. However, their large model sizes and iterative denoising process introduce substantial memory and computational overhead, motivating post-training quantization for efficient deployment. In this paper, we identify two key challenges for low-bit DLLM quantization: state-dependent activation disparity and temporal error accumulation. Masked and unmasked tokens exhibit different activation distributions within each denoising step, while quantization errors can accumulate across steps during iterative decoding. To address these challenges, we propose STaR-Quant, a state-time consistent PTQ framework for DLLMs. STaR-Quant introduces State-Guided Activation Transformation (SGAT) to assign masked and unmasked tokens to different activation transformation spaces with a unified static weight-side transformation. It further introduces Temporal Attention Compensation (TAC) to correct the quantized attention representation via a lightweight block-diagonal affine mapping. Experiments on representative DLLMs demonstrate that STaR-Quant consistently improves low-bit weight-activation quantization over strong PTQ baselines, while delivering up to 1.69x speedup and 3.14x memory saving over FP16 deployment.

2504.05349 2026-06-09 stat.ML cs.AI cs.LG 版本更新

Hyperflux: Pruning Reveals Importance

Hyperflux: 剪枝揭示重要性

Eugen Barbulescu, Antonio Alexoaie, Lucian Busoniu

发表机构 * Department of Computer Science(计算机科学系) Technical University of Cluj-Napoca(克莱津-纳波卡技术大学) Department of Automation(自动化系)

AI总结 提出Hyperflux方法,通过将剪枝建模为连续演化系统(通量和压力),在微观和宏观层面解释剪枝行为,并引入压力调度器实现目标稀疏度,在多个数据集上取得竞争性结果。

详情
AI中文摘要

网络剪枝用于减少大型神经网络的推理延迟和功耗。然而,大多数方法侧重于经验结果,而牺牲了对剪枝过程的理解。我们引入Hyperflux,一种新颖的$L_0$方法,将剪枝建模为由通量(权重移除的梯度响应)和压力(驱动权重向剪枝发展的全局正则化)决定的连续演化系统。通过利用该模型,Hyperflux的剪枝行为在微观(权重再生/剪枝)和宏观(稀疏性收敛等)层面都变得可理解。我们还引入了一种新颖的压力调度器,可靠地针对目标稀疏度。Hyperflux在CIFAR-10、CIFAR-100和ImageNet数据集上使用ResNet-50、VGG-19和DeiT-T/S取得了竞争性结果。

英文摘要

Network pruning is used to reduce inference latency and power consumption in large neural networks. However, most methods focus on empirical results at the expense of understanding the pruning process. We introduce Hyperflux, a novel $L_0$ method which models pruning as a continuously evolving system determined by flux, the gradient response to a weight's removal, and pressure, a global regularization driving weights toward pruning. By exploiting this model, Hyperflux's pruning behavior becomes understandable at both microscopic (weight regrowth/pruning) and macroscopic (sparsity convergence, etc.) levels. We also introduce a novel pressure scheduler that reliably targets desired sparsities. Hyperflux achieves competitive results with ResNet-50, VGG-19 and DeiT-T/S on CIFAR-10, CIFAR-100 and ImageNet datasets.

2509.10334 2026-06-09 cs.CV cs.AI cs.LG 版本更新

I-Segmenter: Integer-Only Vision Transformer for Efficient Semantic Segmentation

I-Segmenter: 用于高效语义分割的纯整数视觉Transformer

Jordan Sassoon, Michal Szczepanski, Martyna Poreba

发表机构 * CEA, France(法国原子能委员会)

AI总结 提出I-Segmenter,首个全整数ViT分割框架,通过整数运算替换、λ-ShiftGELU激活函数及解码器优化,在保持精度前提下显著降低模型大小和推理延迟。

Comments Accepted by the Journal of Systems Architecture

详情
AI中文摘要

视觉Transformer(ViT)最近在语义分割中取得了强劲的结果,但由于其高内存占用和计算成本,在资源受限设备上的部署仍然有限。量化提供了一种提高效率的有效策略,但基于ViT的分割模型在低精度下非常脆弱,因为量化误差会在深度编码器-解码器流水线中累积。我们引入了I-Segmenter,这是第一个完全纯整数的ViT分割框架。基于Segmenter架构,I-Segmenter系统地将浮点运算替换为纯整数对应运算。为了进一步稳定训练和推理,我们提出了λ-ShiftGELU,一种新颖的激活函数,它减轻了均匀量化在处理长尾激活分布时的局限性。此外,我们移除了L2归一化层,并将解码器中的双线性插值替换为最近邻上采样,确保整个计算图都是纯整数执行。大量实验表明,I-Segmenter在合理精度范围内(平均5.1%)达到其FP32基线的精度,同时将模型大小减少高达3.8倍,并通过优化的运行时实现高达1.2倍的推理加速。值得注意的是,即使在单张校准图像的一次性PTQ中,I-Segmenter也能提供有竞争力的精度,凸显了其在实际部署中的实用性。

英文摘要

Vision Transformers (ViTs) have recently achieved strong results in semantic segmentation, yet their deployment on resource-constrained devices remains limited due to their high memory footprint and computational cost. Quantization offers an effective strategy to improve efficiency, but ViT-based segmentation models are notoriously fragile under low precision, as quantization errors accumulate across deep encoder-decoder pipelines. We introduce I-Segmenter, the first fully integer-only ViT segmentation framework. Building on the Segmenter architecture, I-Segmenter systematically replaces floating-point operations with integer-only counterparts. To further stabilize both training and inference, we propose $λ$-ShiftGELU, a novel activation function that mitigates the limitations of uniform quantization in handling long-tailed activation distributions. In addition, we remove the L2 normalization layer and replace bilinear interpolation in the decoder with nearest neighbor upsampling, ensuring integer-only execution throughout the computational graph. Extensive experiments show that I-Segmenter achieves accuracy within a reasonable margin of its FP32 baseline (5.1 % on average), while reducing model size by up to 3.8x and enabling up to 1.2x faster inference with optimized runtimes. Notably, even in one-shot PTQ with a single calibration image, I-Segmenter delivers competitive accuracy, underscoring its practicality for real-world deployment.

2602.21788 2026-06-09 cs.DC cs.LG 版本更新

Efficient Scaling of LLM Training with Flexible Context Parallelism

利用灵活上下文并行实现LLM训练的高效扩展

Yifan Niu, Han Xiao, Dongyi Liu, Wei Zhou, Jia Li

发表机构 * The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州)) Huawei Technologies Co., Ltd.(华为技术有限公司)

AI总结 针对数据异构导致负载不均和通信冗余问题,提出自适应重配置通信组和上下文并行度的FCP策略,实现近线性加速比,最高达1.46倍吞吐提升。

详情
AI中文摘要

扩展长上下文能力对于大型语言模型(LLM)至关重要。然而,现实世界的数据包含大量具有异构长度的序列。现有的LLM训练库依赖于静态并行策略,在数据异构下会遭受严重的负载不均衡、冗余通信和次优的硬件利用率。在这项工作中,我们提出了灵活上下文并行(FCP),一种高效的并行策略,能够在LLM训练期间自适应地重配置通信组和上下文并行度。我们推广了更灵活的非2的幂次并行度,并开发了一个多项式时间算法,为每个训练批次生成近乎最优的并行策略,开销仅为毫秒级。即使在极端数据异构下,FCP也能保持高硬件效率。实验结果表明,FCP在LLM和多模态大模型(MLLM)训练中均显著优于Megatron-LM和DeepSpeed,在保持大规模集群近线性扩展效率的同时,平均吞吐量提升高达1.46倍。对于极端不平衡的批次,FCP甚至实现了2.24倍的加速。

英文摘要

Scaling long-context capabilities is crucial for Large Language Models (LLMs). However, real-world data contain a large number of sequences with heterogeneous lengths. Existing training libraries for LLMs rely on static parallelism strategies, which suffer from severe load imbalance, redundant communication, and suboptimal hardware utilization under data heterogeneity. In this work, we propose Flexible Context Parallelism (FCP), an efficient parallelism strategy that adaptively reconfigures communication groups and context parallelism degrees during LLM training. We generalize more flexible non-power-of-two parallelism degrees and develop a polynomial-time algorithm to generate near-optimal parallelism strategies with only millisecond-level overhead per training batch. FCP is able to maintain high hardware efficiency even under extreme data heterogeneity. Experimental results demonstrate that FCP significantly outperforms Megatron-LM and DeepSpeed in both LLM and MLLM training, achieving up to 1.46x speedup in average throughput while maintaining near-linear scaling efficiency across large-scale clusters. For extremely unbalanced batches, FCP even achieves 2.24x speedup.

2603.23640 2026-06-09 cs.DC cs.LG 版本更新

LLM Inference at the Edge: Mobile, NPU, and GPU Performance Efficiency Trade-offs Under Sustained Load

边缘侧的大语言模型推理:移动、NPU和GPU在持续负载下的性能效率权衡

Pranay Tummalapalli, Sahil Arayakandy, Ritam Pal, Kautuk Kundan

发表机构 * Conscious Engines

AI总结 研究评估了在持续负载下不同设备上大语言模型的性能效率,发现移动端受热管理限制,专用硬件受电池和内存带宽限制,展示了不同平台的推理表现和能效差异。

Comments 14 pages, 5 figures, 10 tables

详情
AI中文摘要

在设备上部署大语言模型以实现持续运行的个人代理,需要硬件在功率、热限和内存方面的持续推理。我们对Qwen 2.5 1.5B(4位量化)在四个平台上的性能进行了基准测试:Raspberry Pi 5搭载Hailo-10H NPU、三星Galaxy S24 Ultra、iPhone 16 Pro和NVIDIA RTX 4050 GPU笔记本电脑。使用固定258个标记的提示,经过20次预热迭代,我们测量了吞吐量、延迟、功率和热行为。对于移动平台,热管理超越峰值计算成为主要限制:iPhone 16 Pro在两次迭代内几乎失去一半的吞吐量,而S24 Ultra因操作系统强制的GPU频率限制导致推理终止。在专用硬件上,不同的限制主导:RTX 4050受电池电量限制,而Hailo-10H受模块内存带宽限制。RTX 4050在34.1 W下维持131.7 tok/s;Hailo-10H在不到2 W下维持6.9 tok/s,接近零波动,与RTX 4050在能效比例上相匹配,但吞吐量低19倍。结果应视为单个模型和提示类型的平台级部署特征,反映硬件和软件的结合,而非单独的硬件能力声明。

英文摘要

Deploying large language models on-device for always-on personal agents demands sustained inference from hardware tightly constrained in power, thermal envelope, and memory. We benchmark Qwen 2.5 1.5B (4-bit quantised) across four platforms: a Raspberry Pi 5 with Hailo-10H NPU, a Samsung Galaxy S24 Ultra, an iPhone 16 Pro, and a laptop NVIDIA RTX 4050 GPU. Using a fixed 258-token prompt over 20 warm-condition iterations per device, we measure throughput, latency, power, and thermal behaviour. For mobile platforms, thermal management supersedes peak compute as the primary constraint: the iPhone 16 Pro loses nearly half its throughput within two iterations, and the S24 Ultra suffers a hard OS-enforced GPU frequency floor that terminates inference entirely. On dedicated hardware, distinct constraints dominate: the RTX 4050 is bounded by its battery power ceiling, while the Hailo-10H is limited by on-module memory bandwidth. The RTX 4050 sustains 131.7 tok/s at 34.1 W; the Hailo-10H sustains 6.9 tok/s at under 2 W with near-zero variance, matching the RTX 4050 in energy proportionality at 19x lower throughput. Results should be interpreted as platform-level deployment characterisations for a single model and prompt type, reflecting hardware and software combined, rather than general claims about hardware capability alone.

2605.03229 2026-06-09 cs.CL cs.LG 版本更新

Sparse Memory Finetuning as a Low-Forgetting Alternative to LoRA and Full Finetuning

稀疏记忆微调:作为LoRA和全微调的低遗忘替代方案

Prakhar Gupta, Garv Shah, Satyam Goyal, Anirudh Kanchi

发表机构 * University of Washington(华盛顿大学)

AI总结 提出稀疏记忆微调(SMF),通过添加键值记忆层并仅更新当前批次最活跃的记忆行,在MedMCQA任务上提升2.5个百分点,同时将遗忘探针(WikiText困惑度和TriviaQA准确率)控制在基线的1个百分点内,优于LoRA和全微调。

详情
AI中文摘要

将预训练语言模型适应新任务通常会损害其已有的通用能力,这一问题被称为灾难性遗忘。稀疏记忆微调(SMF)通过向模型添加键值记忆层,并在每个训练步骤中仅更新当前批次读取最频繁的一小组记忆行来避免这种情况。我们在Qwen-2.5-0.5B-Instruct上重新实现了SMF,并将其与LoRA和全微调在MedMCQA(一个4选1的医学考试任务)上进行比较,使用WikiText困惑度和TriviaQA准确率作为遗忘探针。SMF将MedMCQA提升了2.5个百分点,同时将两个遗忘探针保持在基线的约1个百分点内,而LoRA和全微调虽然取得了更大的增益,但在两个探针上都出现了明显的漂移。我们还比较了两种行选择规则(KL散度和TF-IDF),它们在两个遗忘指标上取得了不同的平衡。

英文摘要

Adapting a pretrained language model to a new task often hurts the general capabilities it already had, a problem known as catastrophic forgetting. Sparse Memory Finetuning (SMF) tries to avoid this by adding key-value memory layers to the model and, on each training step, updating only the small set of memory rows that the current batch reads most heavily. We re-implement SMF on Qwen-2.5-0.5B-Instruct and compare it with LoRA and full finetuning on MedMCQA, a 4-choice medical exam task, using WikiText perplexity and TriviaQA accuracy as forgetting probes. SMF improves MedMCQA by 2.5 percentage points while keeping both forgetting probes within roughly 1 point of the base model, whereas LoRA and full finetuning achieve larger gains but with clear drift on both. We also compare two row-selection rules (KL-divergence and TF-IDF), which balance the two forgetting metrics differently.

2605.28207 2026-06-09 cs.CL cs.AI cs.LG 版本更新

Pruning and Distilling Mixture-of-Experts into Dense Language Models

将混合专家模型剪枝和蒸馏为密集语言模型

Junhyuck Kim, Jihun Yun, Haechan Kim, Gyeongman Kim, Joonghyun Bae, Jaewoong Cho

发表机构 * KRAFTON KAIST(韩国科学技术院)

AI总结 提出首个将混合专家(MoE)模型转换为标准密集架构的系统框架,通过专家评分、选择、分组、拼接和知识蒸馏,在参数匹配条件下比密集到密集剪枝平均下游准确率提升6.3个百分点,训练速度提升1.6倍。

详情
AI中文摘要

混合专家(MoE)现在是前沿语言模型的主导架构,但它需要将所有专家参数加载到内存中,因此在内存受限的部署中不太受欢迎。现有的压缩方法减少了专家数量,但输出仍然是具有相同基本限制的MoE模型。我们提出了第一个将训练好的MoE转换为标准全密集架构的系统框架:专家被评分、选择和分组,然后拼接成密集的前馈网络(FFN),并通过MoE教师的知识蒸馏进行精炼。我们在Qwen3-30B-A3B上评估了7种评分方法、5种分组方法和2种幅度缩放方法,涵盖了多种选定的专家数量,共产生350种配置。我们发现评分方法的选择影响最大,我们提出的新颖的多样性感知评分在Qwen3-30B-A3B、DeepSeek-V2-Lite和GPT-OSS-20B上始终优于先前的方法。在参数匹配的受控比较下,经过约4B token的蒸馏,MoE到密集的转换在平均下游准确率上比密集到密集的剪枝高出6.3个百分点,训练壁钟速度提升1.6倍。

英文摘要

Mixture-of-Experts (MoE) is now the dominant architecture for frontier language models, yet it requires all expert parameters to be loaded in memory, making it less preferable for memory-constrained deployment. Existing compression methods reduce the number of experts but the output remains an MoE model with the same fundamental limitation. We present the first systematic framework for converting a trained MoE into a standard fully dense architecture: experts are scored, selected, and grouped, then concatenated into a dense FFN and refined by knowledge distillation from the MoE teacher. We evaluate 7 scoring, 5 grouping, and 2 magnitude scaling methods across a range of selected expert counts on Qwen3-30B-A3B, yielding 350 configurations. We find that the choice of scoring method is the most impactful, with our novel diversity-aware scoring consistently outperforming prior methods on Qwen3-30B-A3B, DeepSeek-V2-Lite, and GPT-OSS-20B. Under a controlled comparison at matched parameter count, MoE-to-dense outperforms dense-to-dense pruning by +6.3 pp in average downstream accuracy after ~4B-token distillation at 1.6x faster training wall-clock speed.

2605.31158 2026-06-09 cs.CV cs.LG 版本更新

Light Interaction: Training-Free Inference Acceleration for Interactive Video World Models

光交互:交互式视频世界模型的免训练推理加速

Jiacheng Lu, Haoyi Zhu, Sipei Yi, Enze Xie, Yu Li, Cheng Zhuo

发表机构 * Zhejiang University(浙江大学) NVIDIA

AI总结 针对交互式视频世界模型推理成本高的问题,提出免训练加速框架Light Interaction,通过自适应上下文管理、去噪缓存加速和3D块稀疏注意力实现最高2.59倍加速。

Comments 13 pages, 6 figures, 3 tables. Project page: https://2843721358l-del.github.io/Light-Interaction-Project/

详情
AI中文摘要

交互式视频世界模型根据用户控制的相机运动逐块生成视频,支持实时游戏模拟、虚拟场景导航和具身AI训练等应用。然而,由于上下文记忆增长、二次注意力复杂度和重复去噪步骤,扩展到长交互轨迹的成本过高。我们提出Light Interaction,一种用于交互式视频世界模型的免训练推理加速框架。我们的关键洞察是,交互自然支持轨迹依赖的自适应计算:在探索新区域时可丢弃检索到的空间记忆,根据局部潜在动态调整时间上下文,当相机重新访问熟悉区域时可重用早期步骤的模型输出。基于此洞察,Light Interaction结合了自适应上下文管理、去噪缓存加速以及硬件-软件协同设计的3D块稀疏注意力(融合Triton内核)。在HY-WorldPlay和Matrix-Game-3.0上的评估表明,Light Interaction在无需模型重训练的情况下实现了最高2.59倍加速,同时保持有竞争力的视觉质量。

英文摘要

Interactive video world models generate video chunk by chunk in response to user-controlled camera movements, enabling applications such as real-time game simulation, virtual scene navigation, and embodied AI training. However, scaling to long interactive trajectories is prohibitively expensive due to growing context memory, quadratic attention complexity, and repeated denoising steps. We present Light Interaction, a training-free inference acceleration framework for interactive video world models. Our key insight is that interaction naturally enables trajectory-dependent adaptive computation: retrieved spatial memory can be discarded during novel exploration, temporal context can be adjusted according to local latent dynamics, and early-step model outputs can be reused when the camera revisits familiar regions. Based on this insight, Light Interaction combines adaptive context management, denoising cache acceleration, and hardware-software co-designed 3D block sparse attention with fused Triton kernels. Evaluated on HY-WorldPlay and Matrix-Game-3.0, Light Interaction achieves up to 2.59x speedup without model retraining while maintaining competitive visual quality.

2411.09816 2026-06-09 cs.LG 版本更新

Learning Fine-grained Parameter Sharing via Sparse Tensor Decomposition

通过稀疏张量分解学习细粒度参数共享

Cem Üyük, Mike Lasby, Mohamed Yassin, Utku Evci, Yani Ioannou

发表机构 * Department of Computer Science, Technical University of Munich(计算机科学系,慕尼黑技术大学) Schulich School of Engineering, University of Calgary(工程学院,卡尔加里大学) Google DeepMind, Canada(加拿大谷歌深Mind)

AI总结 提出FiPS框架,通过跨块参数共享、低秩分解和稀疏性联合优化,压缩Transformer MLP,在ViT和LLM上实现高效压缩且性能损失小。

Comments Accepted as is to Transactions on Machine Learning Research (TMLR), 2026. OpenReview: https://openreview.net/forum?id=vbS7Z8Zswe

详情
AI中文摘要

大型神经网络在许多任务上实现了最先进的性能,但其庞大的规模阻碍了在资源受限设备上的部署。在现有的压缩方法中,跨层参数共享对于Transformer模型而言仍相对未被探索。本文介绍了细粒度参数共享(FiPS),这是一个统一的框架,用于压缩Transformer多层感知器(MLP),它在一个优化中结合了跨块参数共享、低秩分解和稀疏性。FiPS将一组Transformer块中的MLP权重矩阵拼接起来,并将其分解为共享基和稀疏的、特定于层的投影矩阵。两个因子均通过奇异值分解(SVD)初始化,并通过逐块重构误差最小化进行联合优化。FiPS将视觉Transformer(ViT)压缩高达33%,在ImageNet-1k上top-1准确率损失小于1%,结合微调时压缩高达57%。它还将大型语言模型(LLM)压缩高达20%,同时在匹配压缩的情况下,在困惑度和下游基准测试中优于现有的基于SVD的方法。结合量化感知训练(QAT),在Gemma-2-2B上使用3位FiPS实现了比单独使用2位QAT更低的困惑度,同时达到相同的8倍压缩。这些结果确立了细粒度参数共享作为Transformer MLP压缩的一种实用且有效的方法。

英文摘要

Large neural networks achieve state-of-the-art performance on many tasks, yet their sheer size hinders deployment on resource-constrained devices. Among existing compression approaches, cross-layer parameter sharing remains relatively unexplored for transformer models. In this paper, we introduce Fine-grained Parameter Sharing (FiPS), a unified framework for compressing transformer Multi-Layer Perceptrons (MLPs) that combines cross-block parameter sharing, low-rank factorization, and sparsity in a single optimization. FiPS concatenates MLP weight matrices across a group of transformer blocks and factorizes them into a shared basis and sparse, layer-specific projection matrices. Both factors are initialized via singular value decomposition (SVD) and jointly optimized by block-wise reconstruction error minimization. FiPS compresses Vision Transformers (ViTs) by up to 33% with less than 1% top-1 accuracy loss on ImageNet-1k, and by up to 57% when combined with fine-tuning. It also compresses Large Language Models (LLMs) by up to 20% while outperforming existing SVD-based methods in perplexity and downstream benchmarks at matched compression. Combined with Quantization-Aware Training (QAT), 3-bit FiPS on Gemma-2-2B achieves lower perplexity than 2-bit QAT alone while matching the same 8x compression. These results establish fine-grained parameter sharing as a practical and effective approach for transformer MLP compression.

7. 联邦学习、隐私与安全 20 篇

2606.07621 2026-06-09 cs.LG cs.AI cs.DC 新提交

HASA: Subnet Allocation for Compute-Constrained Model-Heterogeneous Federated Learning

HASA:计算受限的模型异构联邦学习中的子网分配

Amir Hossein Shahdadian, Ahmed M. Abdelmoniem, Mahdi Taheri, Samira Nazari, Christian Herglotz

发表机构 * University of Naples "Federico II"(那不勒斯腓特烈二世大学) Queen Mary University of London(伦敦玛丽女王大学) Brandenburg University of Technology Cottbus-Senftenberg(勃兰登堡工业大学) Tallinn University of Technology(塔林理工大学) University of Zanjan(赞詹大学)

AI总结 提出HASA方法,根据客户端异构性分数分配子网宽度,在固定计算预算下提升平均和最差客户端准确率。

详情
AI中文摘要

边缘服务越来越多地使用联邦学习来个性化设备上的模型,同时将敏感数据保留在本地。在实践中,部署必须处理客户端资源和本地数据分布的异构性。模型异构联邦学习通过允许每个客户端训练共享超网的子网来降低客户端成本,但大多数子网分配策略由设备约束驱动,并未明确考虑统计异构性。本文提出异构感知子网分配(HASA),这是一种仅训练规则,根据从本地训练数据计算的客户端异构性分数分配子网宽度,同时强制执行固定的大小加权计算预算。该设计能够与替代分配策略进行预算匹配的比较。在包含七个客户端的文章标题下一个单词预测基准测试中,HASA在10个匹配种子上的未加权平均客户端测试准确率优于均匀分配,将平均客户端测试准确率从13.82%提高到14.32%,并平均提高了最差客户端准确率。在与代表性部分训练基线的匹配预算比较中,HASA在该基准测试上实现了最强的最差客户端和尾部客户端准确率。方向性消融实验表明,将较小的子网分配给更异构的客户端会降低平均和尾部性能。跨领域图像分类研究进一步表明,异构感知分配的有效性取决于异构性分数反映客户端对额外模型宽度需求的程度。

英文摘要

Edge services increasingly use federated learning to personalize on-device models while keeping sensitive data local. In practice, deployments must handle heterogeneity in both client resources and local data distributions. Model-heterogeneous federated learning lowers client cost by allowing each client to train a subnet of a shared supernet, but most subnet-allocation policies are driven by device constraints and do not explicitly account for statistical heterogeneity. This paper proposes Heterogeneity-Aware Subnet Allocation (HASA), a train-only rule that assigns subnet widths based on client heterogeneity scores computed from local training data while enforcing a fixed size-weighted compute budget. This design enables budget-matched comparisons with alternative allocation policies. On an article-title next-word prediction benchmark with seven clients, HASA improves unweighted mean client test accuracy over uniform allocation across 10 matched seeds, increasing mean client test accuracy from 13.82 percent to 14.32 percent, and improves worst-client accuracy on average. In a matched-budget comparison with representative partial-training baselines, HASA achieves the strongest worst-client and tail-client accuracy on this benchmark. A directionality ablation shows that assigning smaller subnets to more heterogeneous clients degrades both mean and tail performance. A cross-domain image-classification study further shows that the effectiveness of heterogeneity-aware allocation depends on how well the heterogeneity score reflects clients' need for additional model width.

2606.07702 2026-06-09 cs.LG cs.AI 新提交

EvoCSFL: Surrogate-Assisted Evolutionary Client Selection for Efficient and Robust Federated Learning

EvoCSFL:基于代理辅助的进化客户端选择实现高效鲁棒联邦学习

Lin Qiang, Sun Xiaoyan, Hu Yao, Fang Wei

发表机构 * Jiangnan University(江南大学) The Hong Kong Polytechnic University(香港理工大学)

AI总结 针对联邦学习中客户端数据与系统异构性导致收敛慢、鲁棒性差的问题,提出代理辅助的进化客户端选择框架,将选择问题建模为组合优化,用代理模型加速进化搜索,实验表明收敛更快、能耗更低、鲁棒性更强。

详情
AI中文摘要

客户端数据和系统的异构性使得采用随机客户端选择的联邦学习难以获得令人满意的收敛速度和鲁棒性。为解决此问题,本文提出了一种基于代理辅助的客户端进化选择框架。在该框架中,首先使用一些典型的客户端选择策略生成候选集,并开发了一个集成模型性能、通信延迟和能量消耗的度量函数,将客户端选择问题表述为组合优化问题。随后,利用候选选择和度量构建代理模型,以高效逼近所选客户端子集的性能。采用进化算法搜索客户端选择的组合空间,并由代理模型引导以加速收敛。在MNIST、CIFAR10、CINIC10和TinyImageNet上的实验表明,与现有方法相比,所提算法实现了更快的收敛、更低的能量消耗和更好的鲁棒性。

英文摘要

The heterogeneity of client data and systems makes it difficult to achieve satisfactory convergence speed and robustness in federated learning with random client selection. To address this issue, this paper proposes a surrogate-assisted client evolutionary selection framework for federated learning. In this framework, some typical client selection strategies are first used to generate candidate sets, and a metric function that integrates model performance, communication latency, and energy consumption is developed to formulate the client selection problem as a combinatorial optimization one. Subsequently, a surrogate model is constructed using the candidate selections and metric to efficiently approximate the performance of selected client subsets. An evolutionary algorithm is employed to search the combinatorial space of client selections, guided by the surrogate model to accelerate convergence. Experiments on MNIST, CIFAR10, CINIC10, and TinyImageNet demonstrate that the proposed algorithm achieves faster convergence, lower energy consumption, and improved robustness compared to existing methods.

2606.08027 2026-06-09 cs.LG cs.AI 新提交

CausShield: Sample Reconstruction-Resilient Vertical FL via Causal Representation Learning

CausShield: 通过因果表示学习实现样本重建鲁棒的纵向联邦学习

Yongqi Jiang, Yansong Gao, Siguang Chen, Anmin Fu

发表机构 * Nanjing University of Science and Technology(南京理工大学) University of Western Australia(西澳大学) Hohai University(河海大学) Nanjing University(南京大学)

AI总结 针对纵向联邦学习中样本重建攻击的防御问题,提出基于因果表示学习的CausShield方法,将共享表示分解为任务相关与无关部分,实现全周期隐私保护,理论证明收敛性,实验优于七种最新方法。

详情
AI中文摘要

纵向联邦学习(VFL)是一种分布式学习范式,利用跨孤立方的垂直划分特征,无需共享原始样本;然而,它仍然容易受到主动样本重建攻击。现有防御方法由于要么抑制任务相关信息的同时也抑制了隐私敏感特征,要么依赖端到端监督训练来收敛防御模块(这暴露了早期轮次的脆弱性),因此无法在模型效用和隐私保护之间实现令人满意的权衡。为了解决这一挑战,我们采用结构因果模型(SCM)的见解,构建了CausShield。从任务学习的角度来看,原始样本中的因果特征是那些直接相关且有助于学习目标的特征,而非因果特征与任务无关,但通常编码了样本特定的私有信息,从而促进了重建。重要的是,我们奠定了理论基础来证明这一见解。因此,CausShield将VFL中客户端与协调服务器之间的共享表示分解为任务相关和任务无关的组件,以确保全周期的隐私保护。然而,由于在保持模型效用的同时减轻隐私泄露的双重目标,这种分解本质上具有挑战性。我们通过一个精心制定的优化问题来解决这一问题,该问题通过无监督表示学习求解。我们进一步从理论上证明CausShield保持了标准VFL的收敛行为。大量实验将CausShield与七种最新方法(包括InvL (USENIX Security'25))进行比较,并评估了对高级重建攻击(如URVFL (NDSS'25))的鲁棒性。结果表明,CausShield在隐私保护、模型效用和计算效率方面始终表现优异。

英文摘要

Vertical federated learning (VFL) is a distributed learning paradigm that leverages vertically partitioned features across isolated parties without sharing raw samples; however, it remains vulnerable to active sample reconstruction attacks. Existing defenses fail to achieve a satisfactory trade-off between model utility and privacy protection, due to either suppressing task-relevant information alongside privacy-sensitive features or relying on end-to-end supervised training to converge the defense module, which exposes the model to early-epoch vulnerability. To address this challenge, we adopt a structural causal model (SCM) insight and construct CausShield. From a task-learning standpoint, causal features within a raw sample are those that are directly relevant and contributory to the learning objective, whereas non-causal features are task-irrelevant but often encode sample-specific private information, thereby facilitating reconstruction. Importantly, we lay a theoretical foundation to prove this insight. CausShield thus decomposes the shared representations between the client and the coordinating server in VFL into task-relevant and task-irrelevant components to ensure full-cycle privacy protection. Nonetheless, the decomposition is inherently challenging due to the dual objectives of preserving model utility while mitigating privacy leakage. We address this via a carefully formulated optimization problem, which is solved through unsupervised representation learning. We further theoretically prove that CausShield preserves the convergence behavior of standard VFL. Extensive experiments compare CausShield against seven SOTAs, including InvL (USENIX Security'25), and evaluate robustness against advanced reconstruction attacks such as URVFL (NDSS'25). Results demonstrate that CausShield consistently outperforms in privacy protection, model utility, and computational efficiency.

2606.08473 2026-06-09 cs.LG 新提交

Physically Consistent Null Space Alignment for Detection of Low-Magnitude False Data Injection Attacks

物理一致零空间对齐用于检测低幅值虚假数据注入攻击

Xin Li, Chenhan Xiao, Jonathan Cohen, Aviad Elyashar, Yang Weng, Rami Puzis

发表机构 * Ben-Gurion-University(本-古里安大学)

AI总结 提出物理一致零空间对齐(PCNSA)框架,通过伪零空间守恒预处理保持物理零空间与测量伪零空间的几何对应,从而检测低幅值但高影响的隐蔽虚假数据注入攻击。

Comments 12 pages, 13 figures

详情
AI中文摘要

虚假数据注入攻击(FDIAs)引入小的测量扰动,当注入信号与系统模型的伪零空间对齐时,仍可能导致电力系统状态估计出现较大偏差。现有的基于模型和数据驱动的检测器可能无法识别这种低幅值但高影响的攻击,因为残差检验忽略了隐藏在伪零空间中的变化,而子空间学习方法捕获相关模式但未强制执行物理一致性。本文提出物理一致零空间对齐(PCNSA),一种通过预处理保持物理零空间与测量导出伪零空间之间的几何对应来检测隐蔽FDIAs的框架。关键在于伪零空间守恒数据预处理(PSCP)步骤,该步骤在子空间提取之前将测量重新表达在物理坐标系中。我们证明PSCP保持了行空间与其正交补之间的分离,这是传统逐特征标准化所违反的性质。这使得奇异值分解(SVD)导出的伪零子空间与物理残差空间对齐,而无需显式知道H。在IEEE 14、30、57和118节点系统上的实验证实了这一原理:逃避XTM、LSTM、AE和Isolation Forest基线的隐蔽攻击在对齐子空间中表现为明显偏差,从而获得更高的F1分数和检测精度,同时在部分可观测性和实际PMU噪声下保持鲁棒性。

英文摘要

False data injection attacks (FDIAs) introducing small measurement perturbations can still cause large deviations in power system state estimation when the injected signals align with the pseudo-null space of the system model. Existing model- and data-driven detectors may fail to identify such low-magnitude but high-impact attacks because residual tests ignore changes hidden in the pseudo-null space, while subspace learning methods capture correlation patterns without enforcing physical consistency. This paper proposes Physically Consistent Null Space Alignment (PCNSA), a framework that detects stealthy FDIAs by preserving, through preprocessing, the geometric correspondence between the physical null space and the measurement-derived pseudo-null space. The key point is a Pseudo-null Space Conserved data Preprocessing (PSCP) step that re-expresses measurements in the physical coordinate frame before subspace extraction. We prove that PSCP preserves the separation between row space and its orthogonal complement, a property that conventional per-feature standardization violates. This keeps the singular value decomposition (SVD)-derived pseudo-null subspace aligned with the physical residual space without explicit knowledge of H. Experiments on IEEE 14-, 30-, 57-, and 118-bus systems confirm this principle in practice: stealthy attacks that evade XTM, LSTM, AE and Isolation Forest baselines appear as clear deviations in the aligned subspace, yielding higher F1-score and detection accuracy while remaining robust under partial observability and realistic PMU noise.

2606.09301 2026-06-09 cs.LG 新提交

PRISM: Topology-Aware Cross-Modal Imputation for Modality-Deficient Federated Graph Learning

PRISM: 面向模态缺失联邦图学习的拓扑感知跨模态插补

Zekai Chen, Miao Zhang, Jiayang Xing, Xunkai Li, Xun Wu, Rong-Hua Li, Guoren Wang

发表机构 * Beijing Institute of Technology(北京理工大学)

AI总结 针对联邦图学习中客户端级模态缺失问题,提出拓扑感知跨模态插补框架PRISM,通过联邦检索缺失模态语义并利用拓扑控制注入局部图传播,在六个多模态图数据集上平均提升4.48%。

详情
AI中文摘要

多模态联邦图学习(MM-FGL)旨在从包含文本和图像的分散图中协作学习。然而,现实世界的客户端可能没有共同的模态基础:视觉搜索客户端可能包含图像-交互图但没有卖家描述,而目录客户端可能提供文本但没有产品图像。我们将这种实际设置称为客户端级模态缺失。与随机的实例级缺失不同,缺失模态的客户端缺乏重建缺失模态所需的局部语义基础。更重要的是,在图学习中,不完整的表示初始化消息传递,因此插补误差可以被接收拓扑过滤、混合和放大。为了解决这一问题,我们提出了\textbf{PRISM}(\textbf{P}roactive \textbf{R}etrieval and \textbf{I}mputation via \textbf{S}tructural \textbf{M}eta-prompting),一个拓扑感知的联邦跨模态插补框架。PRISM不是仅从局部观测重建缺失模态,而是从联邦中恢复缺失模态语义,并在拓扑感知控制下将其引入局部图传播。在六个多模态图数据集上的实验表明,PRISM持续改善模态缺失客户端,平均优于最先进的基线\textbf{4.48}\%。

英文摘要

Multimodal federated graph learning (MM-FGL) aims to collaboratively learn from decentralized graphs with text and images. However, real-world clients may not share a common modality basis: a visual-search client may contain image--interaction graphs but no seller descriptions, while a catalog client may provide text but no product images. We refer to this practical setting as client-level modality deficiency. Unlike random instance-wise missingness, a deficient client lacks the local semantic basis needed to reconstruct the absent modality. More importantly, in graph learning, incomplete representations initialize message passing, so imputation errors can be filtered, mixed, and amplified by the receiving topology. To address this gap, we propose \textbf{PRISM} (\textbf{P}roactive \textbf{R}etrieval and \textbf{I}mputation via \textbf{S}tructural \textbf{M}eta-prompting), a topology-aware federated cross-modal imputation framework. Rather than reconstructing the missing modality solely from local observations, PRISM recovers missing-modality semantics from the federation and introduces them into local graph propagation under topology-aware control. Experiments on six multimodal graph datasets across graph-centric and modality-centric tasks show that PRISM consistently improves modality-deficient clients, outperforming state-of-the-art baselines by \textbf{4.48}\% on average.

2606.09401 2026-06-09 cs.LG cs.CR 新提交

Benchmarking Empirical Privacy Protection for Adaptations of Large Language Models

大语言模型适配的实证隐私保护基准测试

Bartłomiej Marek, Lorenzo Rossi, Vincent Hanke, Xun Wang, Michael Backes, Franziska Boenisch, Adam Dziedzic

发表机构 * CISPA Helmholtz Center for Information Security(CISPA 欧洲信息安全中心)

AI总结 通过系统变化适配数据分布,使用鲁棒成员推断和金丝雀数据提取攻击,评估差分隐私下大语言模型的实际隐私风险,发现分布偏移显著影响隐私脆弱性,LoRA等参数高效微调方法对分布外数据提供最佳实证保护。

Comments Accepted at ICLR 2026 (Oral)

详情
AI中文摘要

最近的工作应用差分隐私(DP)来适配大语言模型(LLMs)以用于敏感应用,提供了理论保证。然而,其实用有效性仍不明确,部分原因是LLM预训练中,与适配数据的重叠和相互依赖关系可能破坏隐私,尽管采用了DP。为了在实践中分析这一问题,我们使用最先进的攻击(如鲁棒成员推断和金丝雀数据提取)调查了DP适配下LLMs中的隐私风险。我们通过系统变化适配数据分布(从与预训练数据完全重叠,经过分布内(IID)情况,到完全分布外(OOD)示例)来对这些风险进行基准测试。此外,我们评估了不同适配方法和不同隐私机制对脆弱性的影响。我们的结果表明,分布偏移强烈影响隐私脆弱性:在相同的理论保证下,适配数据越接近预训练分布,实际隐私风险越高,即使没有直接的数据重叠。我们发现,参数高效微调方法(如LoRA)对OOD数据实现了最高的实证隐私保护。我们的基准测试确定了在DP LLM适配中实现实际隐私的关键因素,为在敏感环境中部署定制模型提供了可操作的见解。展望未来,我们提出了一个结构化框架,用于超越适配隐私的整体隐私评估,以识别和评估LLM的完整预训练-适配流程中的风险。

英文摘要

Recent work has applied differential privacy (DP) to adapt large language models (LLMs) for sensitive applications, offering theoretical guarantees. However, its practical effectiveness remains unclear, partly due to LLM pretraining, where overlaps and interdependencies with adaptation data can undermine privacy despite DP efforts. To analyze this issue in practice, we investigate privacy risks under DP adaptations in LLMs using state-of-the-art attacks such as robust membership inference and canary data extraction. We benchmark these risks by systematically varying the adaptation data distribution, from exact overlaps with pretraining data, through in-distribution (IID) cases, to entirely out-of-distribution (OOD) examples. Additionally, we evaluate how different adaptation methods and different privacy regimes impact the vulnerability. Our results show that distribution shifts strongly influence privacy vulnerability: the closer the adaptation data is to the pretraining distribution, the higher the practical privacy risk at the same theoretical guarantee, even without direct data overlap. We find that parameter-efficient fine-tuning methods, such as LoRA, achieve the highest empirical privacy protection for OOD data. Our benchmark identifies key factors for achieving practical privacy in DP LLM adaptation, providing actionable insights for deploying customized models in sensitive settings. Looking forward, we propose a structured framework for holistic privacy assessment beyond adaptation privacy, to identify and evaluate risks across the full pretrain-adapt pipeline of LLMs.

2606.09582 2026-06-09 cs.LG stat.ML 新提交

On Choosing the $μ$ Parameter in Gaussian Differential Privacy

论高斯差分隐私中参数 $μ$ 的选择

Bogdan Kulynych, Antti Honkela

发表机构 * Lausanne University Hospital(拉索恩大学医院) University of Helsinki(赫尔辛基大学)

AI总结 本文通过匹配强对手成员推理攻击的最坏情况成功度,提供从纯-DP ε到GDP μ的原则性映射,并推荐 μ≈ε/5 作为保守通用转换。

详情
AI中文摘要

近期工作主张使用高斯差分隐私(GDP)来报告隐私保护机器学习中的隐私保证。我们通过匹配强对手成员推理攻击在最坏情况下的成功度,基于三个指标提供了从纯-DP ε到GDP μ的原则性映射:固定FPR下的乘法优势、固定召回率下的精确度以及标准隐私轮廓。我们在有用参数范围内列出了μ值,并推荐μ≈ε/5作为保守的通用转换。

英文摘要

Recent work argues for using Gaussian differential privacy (GDP) to report the privacy guarantees in privacy-preserving machine learning. We provide principled mappings from pure-DP $\varepsilon$ to GDP $μ$ by matching the worst-case success of a strong-adversary membership inference attack in terms of three metrics: multiplicative advantage at fixed FPR, precision at fixed recall, and the standard privacy profile. We tabulate $μ$ values across a useful range of parameters and recommend $μ\approx \varepsilon/5$ as a conservative general-purpose conversion.

2606.08179 2026-06-09 cs.DS cs.CR cs.LG 交叉投稿

Differentially Private Range Subgraph Counting

差分隐私范围子图计数

Xian Chen, Ruobing Bai, Pan Peng

发表机构 * School of Computer Science and Technology, University of Science and Technology of China(计算机科学与技术学院,中国科学技术大学)

AI总结 针对子图计数中的隐私问题,提出差分隐私范围子图计数(DPRSC)问题,通过子图投影将其转化为加权正交范围计数,结合范围树和局部敏感度估计实现低误差隐私查询,并证明误差下界与维度指数相关。

Comments ICML2026

详情
AI中文摘要

子图计数是图分析中的一个基本问题。受实际场景(图分析在选定顶点诱导的子图上进行,而非整个图)以及日益增长的隐私需求的推动,我们首次研究了差分隐私范围子图计数(DPRSC)。其目标是在由多维属性范围定义的诱导子图中,对固定模式图的出现次数进行隐私计数。与经典的点计数不同,子图计数本质上是非线性的且具有高敏感性:单条边的修改可能影响许多子图出现。我们提出了首个具有小加性误差的高效DPRSC算法。我们的方法引入了一个子图投影,将DPRSC简化为加权正交范围计数,从而能够利用范围树和局部敏感度估计来实现准确的隐私查询回答。我们通过将重建攻击归约到DPRSC并利用差异理论,给出了与算法匹配的下界。特别地,我们证明任何用于DPRSC的差分隐私算法都必须承受与维度指数相关的加性误差。实验评估表明,我们的算法在准确性和运行时间上显著优于基线方法,同时保持强大的隐私保证。

英文摘要

Subgraph counting is a fundamental problem in graph analysis. Motivated by practical scenarios where graph analytics are performed on subgraphs induced by selected vertices -- rather than on the entire graph -- and by growing privacy concerns, we initiate the study of differentially private range subgraph counting (DPRSC). The goal is to privately count occurrences of a fixed pattern graph within induced subgraphs defined by multi-dimensional attribute ranges. Unlike classical point counting, subgraph counting is inherently nonlinear and exhibits high sensitivity: a single edge modification can affect many subgraph occurrences. We present the first efficient algorithms for DPRSC with small additive error. Our approach introduces a subgraph projection that reduces DPRSC to weighted orthogonal range counting, enabling the use of range trees and local sensitivity estimation to achieve accurate private query answering. We complement our algorithms with matching lower bounds, obtained by reducing reconstruction attacks to DPRSC and leveraging discrepancy theory. In particular, we show that any differentially private algorithm for DPRSC must incur additive error exponential in the dimension. Empirical evaluations demonstrate that our algorithms significantly outperform baseline methods in accuracy and runtime while maintaining strong privacy guarantees.

2606.09411 2026-06-09 cs.CR cs.IT cs.LG math.IT 交叉投稿

Now You (Still) See Me: Detecting Evasive Steganographic Payloads in LLMs

现在你(仍然)能看到我:检测大语言模型中的隐蔽隐写载荷

Charles Westphal, Timothy Douglas, Keivan Navaie, Tiago Pimentel, Fernando E. Rosas

发表机构 * UCL Centre for AI(UCL人工智能中心) University College London(伦敦大学学院) ML Alignment Theory Scholars(机器学习对齐理论学者) Department of Computer Science(计算机科学系) School of Computing and Communications(计算与通讯学院) ETH Zürich(苏黎世联邦理工学院) University of Sussex(Sussex大学) Imperial College London & University of Oxford(伦敦帝国学院与牛津大学)

AI总结 针对大语言模型隐写外泄风险,提出一种基于非线性MLP探针的对抗性微调方法可系统规避现有线性探针检测,但通过信息论指导的数据级干预可恢复检测能力。

详情
AI中文摘要

大型语言模型可以通过微调将提示中的秘密编码到流畅、看似良性的输出中。这造成了一种隐写外泄风险,难以通过输出级隐写分析检测。最近的工作提出使用线性探针从内部激活中恢复秘密的机制检测方法。我们表明这种防御可以被系统性地规避,但通过针对性的数据级干预可以恢复可检测性。首先,我们将检测设置扩展到包括非线性MLP探针。然后,我们在五个基础模型上对抗性微调隐写木马:Qwen3-8B、Llama-3.1-8B、Ministral-8B、Qwen3-14B和Phi-4-14B。得到的模型在规避岭回归和留出MLP探针的同时,保留了58%–79%的精确匹配秘密恢复,在六个基准测试中平均能力下降1%–8%。然后,我们给出了这种规避的信息论特征。成功的规避在保持可恢复性的同时,降低了从内容对齐表示中提取秘密的低阶可提取性,迫使载荷与剩余自由度产生协同交互。这激发了一个重新语境化数据集,限制了这些剩余自由度。在该分布上,所有五个规避木马的岭回归和MLP可检测性都得到恢复。总体而言,我们的发现表明基于激活的隐写检测容易受到自适应规避的影响,但理论指导的评估分布可以暴露原本隐藏的载荷。

英文摘要

Large language models can be fine-tuned to encode prompt-borne secrets into fluent, seemingly benign outputs. This creates a steganographic exfiltration risk that is difficult to detect with output-level steganalysis. Recent work proposes mechanistic detection using linear probes that recover the secret from internal activations. We show that this defense can be systematically evaded, but that detectability can be recovered through a targeted data-level intervention. First, we extend the detection setup to include a non-linear MLP probe. We then adversarially fine-tune steganographic trojans across five base models: Qwen3-8B, Llama-3.1-8B, Ministral-8B, Qwen3-14B, and Phi-4-14B. The resulting models retain $58$--$79\%$ exact-match secret recovery while evading both ridge and held-out MLP probes, with $1$--$8\%$ average capability degradation across six benchmarks. We then give an information-theoretic characterization of this evasion. Successful evasion preserves recoverability while reducing low-order extractability of the secret from the content-aligned representation, forcing the payload into synergistic interaction with residual degrees of freedom. This motivates a recontextualization dataset that restricts these residual degrees of freedom. On this distribution, both ridge and MLP detectability are restored across all five evasive trojans. Overall, our findings show that activation-based steganography detection is vulnerable to adaptive evasion, but also that theory-guided evaluation distributions can expose otherwise hidden payloads.

2409.15723 2026-06-09 cs.LG cs.CL 版本更新

Federated Large Language Models: Current Progress and Future Directions

联邦大语言模型:当前进展与未来方向

Yuhang Yao, Jianyi Zhang, Junda Wu, Chengkai Huang, Yu Xia, Tong Yu, Ruiyi Zhang, Sungchul Kim, Ryan Rossi, Ang Li, Lina Yao, Julian McAuley, Yiran Chen, Carlee Joe-Wong

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Duke University(杜克大学) University of California San Diego(加州大学圣地亚哥分校) The University of New South Wales(新南威尔士大学) Adobe Research(Adobe研究) University of Maryland College Park(马里兰大学学院公园分校) CSIRO’s Data61(澳大利亚联邦科学与工业研究组织Data61)

AI总结 本文综述联邦学习与大语言模型结合(FedLLM)的最新进展,重点分析联邦微调和联邦提示学习如何应对效率、个性化和安全挑战,并展望联邦预训练和联邦智能体等方向。

Comments Accepted by PAKDD 2026

详情
AI中文摘要

大语言模型在各种应用中取得了令人印象深刻的性能,但其训练通常依赖于集中式数据收集,引发了严重的隐私和治理问题。联邦学习通过使多个客户端能够协作训练共享模型而不暴露原始本地数据,提供了一种去中心化的替代方案。然而,将联邦学习与大语言模型集成带来了新的挑战,包括数据异质性、收敛不稳定性、通信开销和计算约束。本综述提供了联邦学习用于大语言模型(FedLLM)的全面且最新的概述。我们系统地回顾了近期进展,特别强调联邦微调和联邦提示学习,并分析了现有方法如何应对效率、个性化和安全挑战。我们进一步总结了新兴方向,如联邦预训练和联邦智能体。我们的目标是提供对这个快速发展领域的结构化视角,并突出未来研究的有前景的途径。

英文摘要

Large Language Models have achieved impressive performance across diverse applications, yet their training typically depends on centralized data collection, raising serious privacy and governance concerns. Federated Learning offers a decentralized alternative by enabling multiple clients to collaboratively train shared models without exposing raw local data. However, integrating FL with LLMs introduces new challenges, including data heterogeneity, convergence instability, communication overhead, and computational constraints. This survey provides a comprehensive and up-to-date overview of Federated Learning for Large Language Models (FedLLM). We systematically review recent advances, with particular emphasis on federated fine-tuning and federated prompt learning, and analyze how existing methods address efficiency, personalization, and security challenges. We further summarize emerging directions such as federated pre-training and federated agents. Our goal is to offer a structured perspective on this rapidly evolving field and to highlight promising avenues for future research.

2410.05662 2026-06-09 cs.LG 版本更新

Communication-Efficient Federated Learning under Dynamic Device Arrival and Departure: Convergence Analysis and Algorithm Design

动态设备加入和离开下的通信高效联邦学习:收敛性分析与算法设计

Zhan-Lun Chang, Dong-Jun Han, Seyyedali Hosseinalipour, Mung Chiang, Christopher G. Brinton

发表机构 * Elmore Family School of Electrical and Computer Engineering, Purdue University(埃洛姆家族电气与计算机工程学院,普渡大学) Department of Computer Science and Engineering, Yonsei University(延世大学计算机科学与工程系) Department of Electrical Engineering, University at Buffalo–SUNY(布法罗大学(SUNY)电气工程系)

AI总结 针对设备动态加入/离开的联邦学习场景,提出基于梯度相似性的模型初始化算法,通过加权平均历史全局模型加速分布偏移恢复,实现收敛速度提升一个数量级以上。

详情
AI中文摘要

大多数联邦学习(FL)方法假设设备集固定。然而,现实场景中设备常因用户移动模式或跨小区切换等动态加入或离开系统。这种动态设置带来了独特挑战:(1)优化目标随活动设备集演变,不同于传统FL的静态目标;(2)当前全局模型可能不再作为后续轮次的有效初始化,可能阻碍适应、延迟收敛并降低资源效率。为应对这些挑战,我们首先对动态设备集下的FL进行收敛性分析,考虑了梯度噪声、本地训练迭代次数以及该实际设置中的数据异质性等因素。受此分析启发,我们提出一种模型初始化算法,使设备加入或离开网络时能够快速适应。我们的关键思想是计算先前全局模型的加权平均,以梯度相似性为指导,优先选择在数据分布与当前设备集紧密对齐上训练的模型,从而在更少的训练轮次中加速从分布偏移中恢复。这种即插即用算法设计为与现有FL方法无缝集成,具有广泛适用性。实验表明,与基线相比,我们的方法通常实现一个数量级或更多的收敛加速,我们证明这大幅降低了达到目标精度的能耗。

英文摘要

Most federated learning (FL) approaches assume a fixed device set. However, real-world scenarios often involve devices dynamically joining or leaving the system, driven by, e.g., user mobility patterns or handovers across cell boundaries. This dynamic setting introduces unique challenges: (1) the optimization objective evolves with the active device set, unlike traditional FL's static objective; and (2) the current global model may no longer serve as an effective initialization for subsequent rounds, potentially hindering adaptation, delaying convergence, and reducing resource efficiency. To address these challenges, we first provide a convergence analysis for FL under a dynamic device set, accounting for factors such as gradient noise, local training iterations, and data heterogeneity in this practical setting. Motivated by this analysis, we propose a model initialization algorithm that enables rapid adaptation whenever devices join or leave the network. Our key idea is to compute a weighted average of previous global models, guided by gradient similarity, to prioritize models trained on data distributions that closely align with the current device set, thereby accelerating recovery from distribution shifts in fewer training rounds. This plug-and-play algorithm is designed to integrate seamlessly with existing FL methods, offering broad applicability. Experiments demonstrate that our approach achieves convergence speedups typically an order of magnitude or more compared to baselines, which we show drastically reduces energy consumption to reach a target accuracy.

2503.18314 2026-06-09 cs.LG cs.AI cs.CV 版本更新

LoTUS: Large-Scale Machine Unlearning with a Taste of Uncertainty

LoTUS:带有不确定性风味的大规模机器遗忘

Christoforos N. Spartalis, Theodoros Semertzidis, Petros Daras, Efstratios Gavves

发表机构 * University of Amsterdam(阿姆斯特丹大学) Centre for Research & Technology Hellas(希腊研究中心与技术中心) Archimedes/Athena RC(阿基米德/雅典娜研究中心)

AI总结 提出LoTUS方法,通过平滑预测概率至信息论界限来消除训练样本影响,避免从头重训练,在Transformer和ResNet18模型上超越现有方法,并引入RF-JSD指标用于实际评估。

Comments Accepted as a main conference paper at CVPR 2025 (https://cvpr.thecvf.com/virtual/2025/poster/33292)

详情
AI中文摘要

我们提出了LoTUS,一种新颖的机器遗忘(MU)方法,它消除了预训练模型中训练样本的影响,避免了从头开始重新训练。LoTUS将模型的预测概率平滑到信息论界限,减轻了因数据记忆导致的过度自信。我们在Transformer和ResNet18模型上,针对五个公共数据集,与八个基线方法进行了评估。除了已有的MU基准测试,我们还在ImageNet1k(一个大规模数据集,其中重新训练不切实际)上评估了遗忘效果,模拟了真实世界条件。此外,我们引入了新颖的无重训练杰森-香农散度(RF-JSD)指标,以便在真实世界条件下进行评估。实验结果表明,LoTUS在效率和有效性方面均优于最先进的方法。代码:此https URL。

英文摘要

We present LoTUS, a novel Machine Unlearning (MU) method that eliminates the influence of training samples from pre-trained models, avoiding retraining from scratch. LoTUS smooths the prediction probabilities of the model up to an information-theoretic bound, mitigating its over-confidence stemming from data memorization. We evaluate LoTUS on Transformer and ResNet18 models against eight baselines across five public datasets. Beyond established MU benchmarks, we evaluate unlearning on ImageNet1k, a large-scale dataset, where retraining is impractical, simulating real-world conditions. Moreover, we introduce the novel Retrain-Free Jensen-Shannon Divergence (RF-JSD) metric to enable evaluation under real-world conditions. The experimental results show that LoTUS outperforms state-of-the-art methods in terms of both efficiency and effectiveness. Code: https://github.com/cspartalis/LoTUS.

2601.22669 2026-06-09 cs.LG 版本更新

Beyond Fixed Rounds: Data-Free Early Stopping for Practical Federated Learning

超越固定轮次:面向实际联邦学习的无数据早停法

Youngjoon Lee, Hyukjoon Lee, Seungrok Jung, Andy Luo, Jinu Gong, Yang Cao, Joonhyuk Kang

发表机构 * arXiv

AI总结 提出一种无数据早停框架,通过监控任务向量增长率确定最优停止点,在皮肤病变/血细胞/结肠病理分类任务中达到与基于验证集的早停相当的性能,且仅需少量额外轮次。

Comments Under Review

详情
AI中文摘要

联邦学习(FL)无需传输原始数据即可实现去中心化协作学习。然而,依赖固定的全局轮次或验证数据进行超参数调优会带来高计算成本和隐私风险,阻碍了实际部署。为解决这一问题,我们提出了一种无数据早停框架,该框架仅使用服务器端参数监控任务向量的增长率来确定最优停止点。在皮肤病变/血细胞/结肠病理分类上的数值结果表明,我们的方法与多种最先进FL方法中基于验证集的早停性能相当。特别是,所提出的框架平均需要45/12/31(皮肤病变/血细胞/结肠病理)额外轮次即可实现比基于验证数据早停高12.3%/8.9%/3.9%的性能。此外,该框架仅需9/8/14额外轮次即可筛选不良配置,不到固定轮次预算的3%。据我们所知,这是首个为FL方法提出的无数据早停框架。我们的代码已开源。

英文摘要

Federated Learning (FL) facilitates decentralized collaborative learning without transmitting raw data. However, reliance on fixed global rounds or validation data for hyperparameter tuning hinders practical deployment by incurring high computational costs and privacy risks. To address this, we propose a data-free early stopping framework that determines the optimal stopping point by monitoring the task vector's growth rate using only server-side parameters. The numerical results on skin lesion/blood cell/colon pathology classification demonstrate that our approach is comparable to the validation-based early stopping across various state-of-the-art FL methods. In particular, the proposed framework requires an average of 45/12/31 (skin lesion/blood cell/colon pathology) additional rounds to achieve over 12.3%/8.9%/3.9% higher performance than early stopping based on validation data. Moreover, the proposed framework requires only 9/8/14 additional rounds to screen bad configurations, which is less than 3% of the fixed-round budget. To the best of our knowledge, this is the first work to propose a data-free early stopping framework for FL methods. Our code is available at this open repository.

2601.23221 2026-06-09 cs.LG 版本更新

Optimal Fair Aggregation of Crowdsourced Noisy Labels using Demographic Parity Constraints

使用人口统计平价约束的众包噪声标签的最优公平聚合

Gabriel Singer, Samuel Gruffaz, Olivier Vo Van, Nicolas Vayatis, Argyris Kalogeratos

发表机构 * University of California, Berkeley(加州大学伯克利分校) Université de Paris(巴黎大学) CNRS(国家科学研究中心)

AI总结 针对众包标签聚合中的公平性问题,提出在ε-公平框架下分析多数投票和最优贝叶斯聚合的公平性差距,并推广多类公平后处理算法以强制执行人口统计平价约束。

详情
AI中文摘要

由于获取可靠的真实标签通常成本高昂或不可行,众包和聚合嘈杂的人类注释是典型的替代方案。然而,聚合主观标签可能会放大个体偏见,特别是关于敏感特征的偏见,引发公平性问题。尽管如此,众包聚合中的公平性在很大程度上仍未得到探索,没有现有的收敛保证,只有有限的后处理方法用于在人口统计平价下强制执行ε-公平性。我们通过在ε-公平框架内分析众包聚合方法的公平性差距来填补这一空白,针对多数投票和最优贝叶斯聚合。在小众群体中,我们推导出多数投票的公平性差距的上界,该上界以个体注释者的公平性差距表示。我们进一步表明,在可解释的条件下,聚合共识的公平性差距指数级收敛到真实标签的公平性差距。由于真实标签本身可能仍然不公平,我们将最先进的多类公平后处理算法从连续设置推广到离散设置,该算法对任何聚合规则强制执行严格的人口统计平价约束。在合成和真实数据集上的实验证明了我们方法的有效性,并证实了理论见解。

英文摘要

As acquiring reliable ground-truth labels is usually costly, or infeasible, crowdsourcing and aggregation of noisy human annotations is the typical resort. Aggregating subjective labels, though, may amplify individual biases, particularly regarding sensitive features, raising fairness concerns. Nonetheless, fairness in crowdsourced aggregation remains largely unexplored, with no existing convergence guarantees and only limited post-processing approaches for enforcing $\varepsilon$-fairness under demographic parity. We address this gap by analyzing the fairness s of crowdsourced aggregation methods within the $\varepsilon$-fairness framework, for Majority Vote and Optimal Bayesian aggregation. In the small-crowd regime, we derive an upper bound on the fairness gap of Majority Vote in terms of the fairness gaps of the individual annotators. We further show that the fairness gap of the aggregated consensus converges exponentially fast to that of the ground-truth under interpretable conditions. Since ground-truth itself may still be unfair, we generalize a state-of-the-art multiclass fairness post-processing algorithm from the continuous to the discrete setting, which enforces strict demographic parity constraints to any aggregation rule. Experiments on synthetic and real datasets demonstrate the effectiveness of our approach and corroborate the theoretical insights.

2605.20341 2026-06-09 cs.LG cs.AI cs.CR cs.PF 版本更新

Causal Unlearning in Collaborative Optimization: Exact and Approximate Influence Reversal under Adversarial Contributions

协同优化中的因果卸载:在对抗性贡献下的精确和近似影响反转

Ali Mahdavi, Azadeh Zamanifar, Amirfarhad Farhadi, Omid Kashefi

发表机构 * Department of Computer Engineering, SRC, Islamic Azad University Tehran, Iran(伊朗伊斯兰Azad大学塔希尔分校计算机工程系) School of Computer Engineering, Iran University of Science and Technology Tehran, Iran(伊朗科学技术大学塔希尔分校计算机工程系) Meta CA, USA(美国Meta公司)

AI总结 本文提出HF-KCU方法,通过共轭梯度迭代在Krylov子空间中近似影响函数,从而在协同优化中实现数据删除,减少计算复杂度并提高隐私保护效果。

详情
AI中文摘要

联邦学习系统必须支持数据删除请求以符合隐私法规,但每次删除后重新训练是计算上不可行的。我们提出了HF-KCU方法,通过在Krylov子空间中进行共轭梯度迭代近似影响函数,将复杂度从O(d^3)降低到O(kd),其中k<<d。因果加权机制确保只有持有删除数据的客户端接收参数更新,防止对未受影响的客户端造成虚假变化。我们的方法设计用于处理有界对抗性扰动的Hessian和梯度,提供在现实威胁模型下的优雅退化。我们在卷积(ResNet-18,SimpleCNN)和Transformer(ViT-Lite)架构上CIFAR-10、MNIST和Fashion-MNIST数据集上验证了HF-KCU。在CIFAR-10的Dirichlet(alpha=0.5)划分下,HF-KCU在重新训练的基础上实现了47.75倍的速度提升,同时保持测试准确率在0.60%以内(71.16 vs 71.76%)。对遗忘集的成员推断攻击的成功率达到了0.499,与重新训练模型匹配,证实了有效的隐私恢复。我们提供了收敛保证,显示Krylov近似误差随着O((k^{1/2}-1)/(k^{1/2}+1))递减,其中k是Hessian条件数。因果加权机制确保了手术更新,只有持有删除数据的客户端被修改,保护了未受影响参与者的模型质量,并避免了异步联邦设置中梯度方法的不稳定性。该设计提供了可解释性,因为每个更新都可以直接追溯到删除数据的影响。该方法的效率和精度使其适用于生产联邦系统,其中删除请求异步到达且计算预算受限。

英文摘要

Federated learning systems must support data deletion requests to comply with privacy regulations, yet retraining from scratch after each deletion is computationally prohibitive. We present HF-KCU, a method that removes a client's contribution by approximating the influence function through conjugate gradient iterations in Krylov subspaces, reducing complexity from O(d^3) to O(kd) where k<<d.A causal weighting mechanism ensures that only clients holding the deleted data receive parameter updates, preventing spurious changes to unaffected clients. Our method is designed to handle bounded adversarial perturbations to the Hessian and gradient, providing graceful degradation under realistic threat models. We validate HF-KCU across convolutional (ResNet-18, SimpleCNN) and transformer (ViT-Lite) architectures on CIFAR-10, MNIST, and Fashion-MNIST. On CIFAR-10 under Dirichlet (alpha=0.5) partitioning, HF-KCU achieves 47.75 times speedup over retraining while maintaining test accuracy within 0.60% of the rational baseline(71.16 vs 71.76 %). Membership inference attacks on the forget set yield success rates of 0.499 matching the retrained model and confirming effective privacy restoration. We provide convergence guarantees showing that the Krylov approximation error decreases as O((k ^1/2-1)/(k^1/2+1)) where k is the Hessian condition number. The causal weighting mechanism ensures surgical updates, where only clients holding deleted data are modified, preserving model quality for unaffected participants and avoiding the instability of gradient-based approaches in asynchronous federated settings. This design provides interpretability as each update is directly traceable to the influence of the deleted data. The method's efficiency and precision make it suitable for production federated systems where deletion requests arrive asynchronously and computational budgets are constrained.

2509.20714 2026-06-09 cs.CR cs.LG 版本更新

Cryptographic Backdoor for Neural Networks: Boon and Bane

神经网络的密码学后门:福与祸

Anh Tu Ngo, Anupam Chattopadhyay, Subhamoy Maitra

发表机构 * College of Computing and Data Science, Nanyang Technological University Singapore(南洋理工大学计算与数据科学学院) Applied Statistics Unit, Indian Statistical Institute(印度统计研究所应用统计单位)

AI总结 本文展示密码学后门在神经网络中的双重作用:既可发动强大隐形攻击,也可用于鲁棒水印、用户认证和知识产权追踪,并证明这些协议在标准假设下是鲁棒的。

Comments Preprint

详情
AI中文摘要

在本文中,我们展示了神经网络中的密码学后门在两个方向上可以非常有效,即发起攻击以及提供防御。在攻击方面,精心植入的密码学后门能够对神经网络发动强大且隐形的攻击。在防御方面,我们提出了应用:首先,一个可证明鲁棒的神经网络水印方案;其次,一个保证用户认证的协议;第三,一个追踪神经网络知识产权未授权共享的协议。从更广泛的理论视角来看,借鉴Goldwasser等人[FOCS 2022]的思想,我们的主要贡献是表明所有这些实例化的实际协议实现都是可证明鲁棒的。水印、认证和知识产权追踪协议能够抵抗对神经网络具有黑盒访问权限的对手,而基于后门的对抗攻击在标准假设下是无法阻止的。虽然我们攻击所使用的理论工具与Goldwasser等人的思路基本一致,但与防御相关的证明需要进一步研究。最后,所有这些协议都在最先进的神经网络架构上实现,实验结果证实了理论主张。此外,可以利用后量子原语来实现密码学后门,为机器学习中的量子时代应用奠定基础。

英文摘要

In this paper we show that cryptographic backdoors in a neural network (NN) can be highly effective in two directions, namely mounting the attacks as well as in presenting the defenses as well. On the attack side, a carefully planted cryptographic backdoor enables powerful and invisible attack on the NN. Considering the defense, we present applications: first, a provably robust NN watermarking scheme; second, a protocol for guaranteeing user authentication; and third, a protocol for tracking unauthorized sharing of the NN intellectual property (IP). From a broader theoretical perspective, borrowing the ideas from Goldwasser et. al. [FOCS 2022], our main contribution is to show that all these instantiated practical protocol implementations are provably robust. The protocols for watermarking, authentication and IP tracking resist an adversary with black-box access to the NN, whereas the backdoor-enabled adversarial attack is impossible to prevent under the standard assumptions. While the theoretical tools used for our attack is mostly in line with the Goldwasser et. al. ideas, the proofs related to the defense need further studies. Finally, all these protocols are implemented on state-of-the-art NN architectures with empirical results corroborating the theoretical claims. Further, one can utilize post-quantum primitives for implementing the cryptographic backdoors, laying out foundations for quantum-era applications in machine learning (ML).

2510.17947 2026-06-09 cs.CR cs.AI cs.CL cs.LG cs.MA 版本更新

PLAGUE: Plug-and-play framework for Lifelong Adaptive Generation of Multi-turn Exploits

PLAGUE:面向多轮利用的终身自适应生成的即插即用框架

Neeladri Bhuiya, Madhav Aggarwal, Diptanshu Purwar

发表机构 * A10 Networks, Inc.(A10网络公司) University of Massachusetts Amherst(马萨诸塞大学阿姆赫斯特分校)

AI总结 提出PLAGUE框架,通过终身学习启发的三阶段设计(Primer、Planner、Finisher)实现高效多轮越狱攻击,在o3和Opus 4.1等强安全模型上ASR提升超30%。

Comments Accepted in ICLR 2026

详情
AI中文摘要

大型语言模型(LLMs)正以惊人的速度改进。随着智能体工作流的出现,多轮对话已成为与LLMs交互以完成长而复杂任务的事实标准。尽管LLM能力持续提升,但它们仍然越来越容易受到越狱攻击,尤其是在多轮场景中,有害意图可以巧妙地注入到对话中,产生恶意结果。虽然单轮攻击已被广泛探索,但适应性、效率和有效性仍然是多轮攻击面临的关键挑战。为了解决这些不足,我们提出了PLAGUE,一种新颖的即插即用框架,用于设计受终身学习智能体启发的多轮攻击。PLAGUE将多轮攻击的生命周期分解为三个精心设计的阶段(Primer、Planner和Finisher),从而实现对多轮攻击家族的系统性和信息丰富的探索。评估表明,使用PLAGUE设计的红队智能体实现了最先进的越狱结果,在更少或相当的查询预算下,领先模型的攻击成功率(ASR)提高了30%以上。特别是,PLAGUE在OpenAI的o3上实现了81.4%的ASR(基于StrongReject),在Claude的Opus 4.1上实现了67.3%的ASR,这两个模型在安全文献中被认为对越狱具有高度抵抗力。我们的工作提供了工具和见解,以理解计划初始化、上下文优化和终身学习在构建多轮攻击以进行全面模型脆弱性评估中的重要性。

英文摘要

Large Language Models (LLMs) are improving at an exceptional rate. With the advent of agentic workflows, multi-turn dialogue has become the de facto mode of interaction with LLMs for completing long and complex tasks. While LLM capabilities continue to improve, they remain increasingly susceptible to jailbreaking, especially in multi-turn scenarios where harmful intent can be subtly injected across the conversation to produce nefarious outcomes. While single-turn attacks have been extensively explored, adaptability, efficiency and effectiveness continue to remain key challenges for their multi-turn counterparts. To address these gaps, we present PLAGUE, a novel plug-and-play framework for designing multi-turn attacks inspired by lifelong-learning agents. PLAGUE dissects the lifetime of a multi-turn attack into three carefully designed phases (Primer, Planner and Finisher) that enable a systematic and information-rich exploration of the multi-turn attack family. Evaluations show that red-teaming agents designed using PLAGUE achieve state-of-the-art jailbreaking results, improving attack success rates (ASR) by more than 30% across leading models in a lesser or comparable query budget. Particularly, PLAGUE enables an ASR (based on StrongReject) of 81.4% on OpenAI's o3 and 67.3% on Claude's Opus 4.1, two models that are considered highly resistant to jailbreaks in safety literature. Our work offers tools and insights to understand the importance of plan initialization, context optimization and lifelong learning in crafting multi-turn attacks for a comprehensive model vulnerability evaluation.

2601.04266 2026-06-09 cs.CR cs.LG 版本更新

State Backdoor: Towards Stealthy Real-world Poisoning Attack on Vision-Language-Action Model in State Space

状态后门:针对状态空间中视觉-语言-动作模型的隐蔽现实世界投毒攻击

Ji Guo, Wenbo Jiang, Yansong Lin, Yijing Liu, Ruichen Zhang, Guomin Lu, Aiguo Chen, Xinshuo Han, Hongwei Li

发表机构 * Laboratory Of Intelligent Collaborative Computing, University of Electronic Science and Technology of China(智能协同计算实验室,电子科学与技术大学) National Key Laboratory of Wireless Communications, University of Electronic Science and Technology of China(无线通信国家重点实验室,电子科学与技术大学) School of Computer Science and Engineering, University of Electronic Science and Technology of China(计算机科学与工程学院,电子科学与技术大学) College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics(计算机科学与技术学院,南京航空航天大学) College of Computing and Data Science, Nanyang Technological University(计算与数据科学学院,南洋理工大学)

AI总结 提出状态后门攻击,利用机器人手臂初始状态作为触发器,通过偏好引导遗传算法优化触发器的隐蔽性和有效性,在五个VLA模型和五个真实任务中实现超过90%的攻击成功率。

详情
AI中文摘要

视觉-语言-动作(VLA)模型广泛部署于机器人等安全关键的具身AI应用中。然而,其复杂的多模态交互也暴露了新的安全漏洞。本文研究了VLA模型中的后门威胁,即恶意输入导致目标错误行为,同时保持对干净数据的性能。现有后门方法主要依赖在视觉模态中插入可见触发器,由于环境变化,在现实场景中鲁棒性差且不易被察觉。为克服这些限制,我们引入状态后门,一种新颖且实用的后门攻击,利用机器人手臂的初始状态作为触发器。为优化触发器的隐蔽性和有效性,我们设计了偏好引导遗传算法(PGA),高效搜索状态空间以找到最小但有效的触发器。在五个代表性VLA模型和五个真实任务上的大量实验表明,我们的方法在不影响良性任务性能的情况下实现了超过90%的攻击成功率,揭示了具身AI系统中一个未被充分探索的漏洞。

英文摘要

Vision-Language-Action (VLA) models are widely deployed in safety-critical embodied AI applications such as robotics. However, their complex multimodal interactions also expose new security vulnerabilities. In this paper, we investigate a backdoor threat in VLA models, where malicious inputs cause targeted misbehavior while preserving performance on clean data. Existing backdoor methods predominantly rely on inserting visible triggers into visual modality, which suffer from poor robustness and low insusceptibility in real-world settings due to environmental variability. To overcome these limitations, we introduce the State Backdoor, a novel and practical backdoor attack that leverages the robot arm's initial state as the trigger. To optimize trigger for insusceptibility and effectiveness, we design a Preference-guided Genetic Algorithm (PGA) that efficiently searches the state space for minimal yet potent triggers. Extensive experiments on five representative VLA models and five real-world tasks show that our method achieves over 90% attack success rate without affecting benign task performance, revealing an underexplored vulnerability in embodied AI systems.

2604.07125 2026-06-09 cs.CR cs.LG 版本更新

Scalable and Private Federated Learning Using Distributed Differential Privacy and Secure Aggregation

可扩展且隐私保护的联邦学习:利用分布式差分隐私和安全聚合

Wenjing Wei, Farid Nait-Abdesselam, Alla Jammine

发表机构 * Université Paris Cité(巴黎Cité大学)

AI总结 本文提出DDP-SA框架,结合客户端侧本地差分隐私和全阈值加法秘密共享,实现安全聚合,提供更强的端到端隐私保障且计算可行。

Comments Submitted to IEEE Transactions on Dependable and Secure Computing (under review)

详情
AI中文摘要

本文提出了DDP-SA,一种可扩展的隐私保护联邦学习框架,联合利用客户端侧本地差分隐私(LDP)和全阈值加法秘密共享(ASS)进行安全聚合。与仅依赖差分隐私或安全多方计算(MPC)的方法不同,DDP-SA整合两种技术,提供更强的端到端隐私保障,同时保持计算可行性。该框架引入了双阶段保护机制:客户端首先用校准的拉普拉斯噪声扰动本地梯度,然后将噪声梯度分解为加法秘密份额,分发到多个中间服务器。此设计确保(i)没有单个被入侵的服务器或通信通道能揭示任何关于个体客户端更新的信息,且(ii)参数服务器仅重建聚合的噪声梯度,从不任何客户端特定的贡献。大量实验表明,DDP-SA在模型准确性上显著高于独立LDP,同时提供比MPC-only方法更强的隐私保护。所提框架的扩展性与参与者的数量线性相关,并提供了一个实用的、隐私保护的联邦学习解决方案,具有可控的计算和通信开销。

英文摘要

This article presents DDP-SA, a scalable privacy-preserving federated learning framework that jointly leverages client-side local differential privacy (LDP) and full-threshold additive secret sharing (ASS) for secure aggregation. Unlike existing methods that rely solely on differential privacy or on secure multi-party computation (MPC), DDP-SA integrates both techniques to deliver stronger end-to-end privacy guarantees while remaining computationally practical. The framework introduces a two-stage protection mechanism: clients first perturb their local gradients with calibrated Laplace noise, then decompose the noisy gradients into additive secret shares that are distributed across multiple intermediate servers. This design ensures that (i) no single compromised server or communication channel can reveal any information about individual client updates, and (ii) the parameter server reconstructs only the aggregated noisy gradient, never any client-specific contribution. Extensive experiments show that DDP-SA achieves substantially higher model accuracy than standalone LDP while providing stronger privacy protection than MPC-only approaches. The proposed framework scales linearly with the number of participants and offers a practical, privacy-preserving solution for federated learning applications with controllable computational and communication overhead.

2605.30123 2026-06-09 cs.CR cs.LG 版本更新

Privacy-Enhanced Zero-Order Federated Learning via xMK-CKKS over Wireless Channels

基于xMK-CKKS的无线信道隐私增强零阶联邦学习

Anthony Ayli, Khalil Harris, Jihad Fahs, Mohamad Assaad

发表机构 * University of Waterloo(滑铁卢大学)

AI总结 针对无线联邦学习中单密钥同态加密的客户端安全漏洞,提出一种无需信道估计的四阶段协议,利用xMK-CKKS多密钥同态加密实现安全聚合,并集成零阶优化,在保证收敛率的同时降低通信开销。

Comments 12 pages, 3 figures

详情
AI中文摘要

同态加密(HE)通过允许服务器在不解密的情况下操作加密数据,实现了联邦学习(FL)中的隐私保护聚合。现有的无线同态加密方法主要依赖单密钥HE方案,并需要信道估计或预均衡来补偿无线衰落。然而,单密钥HE仍然容易受到共享相同密钥的诚实但好奇客户端的攻击。此外,攻破单个客户端可能危及整个网络的安全性,而多密钥HE方案通过为每个设备分配自己的密钥来提供更强的客户端级安全性。我们提出了一种四阶段协议,使得著名的多密钥HE方案xMK-CKKS能够在共享无线信道上进行聚合,而无需信道估计。该协议通过相同的信道实现重传部分公钥和密文,使得在解密过程中占主导地位的大模数加密项代数相消。我们将该协议与零阶FL集成在缓慢变化的视距主导信道上,其中每个设备每轮传输一个加密标量,通信/加密开销与模型维度无关。我们证明,解码后的加密噪声保持了\(O(1/\sqrt{K})\)的收敛速度,直到可忽略的噪声基底。该协议能够抵抗与最多\(N-1\)个客户端共谋的诚实但好奇服务器,MNIST上的数值结果验证了分析。

英文摘要

Homomorphic encryption (HE) enables privacy-preserving aggregation in federated learning (FL) by allowing the server to operate on encrypted data without decryption. Existing HE-over-the-air (OTA) methods mainly rely on single-key HE schemes and require channel estimation or pre-equalization to compensate for wireless fading. However, single-key HE remains vulnerable to honest-but-curious (HBC) clients holding the shared secret key, while multi-key HE provides stronger client-level security by assigning each device its own secret key. We propose a four-phase protocol that enables the aggregation of xMK-CKKS over a shared wireless channel without channel estimation. The protocol retransmits partial public keys and ciphertexts through the same channel realization, so that the dominant large-modulus encryption terms cancel algebraically during decryption. We integrate this protocol with zero-order FL over slowly varying LoS-dominant channels, where each device transmits a single encrypted scalar per round and the communication/encryption overhead is independent of the model dimension. We show that the residual noise induced by encryption and wireless aggregation preserves the standard convergence rate \(O(1/\sqrt{K})\) up to a negligible noise floor, where $K$ is the number of communication rounds. The protocol assumes an non-trusted server and is secure against HBC clients, preventing any client from recovering the local updates of other participants. Numerical results on MNIST validate the theoretical analysis.

8. 鲁棒性、不确定性与可信学习 56 篇

2606.07581 2026-06-09 cs.LG cs.AI cs.ET 新提交

Training-Inference Kernel Contracts: Bounding Divergence in Post-Training and Deployment

训练-推理核契约:约束后训练与部署中的偏差

Bruce Changlong Xu, Lan Wu

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出核契约框架,通过数值、统计、运行时和可观测性条款约束训练核与推理核之间的分布偏差,并推导偏差界以保障策略梯度无偏性。

详情
AI中文摘要

现代后训练流程通常为其策略π_θ编写一个符号,但通过两个不同的程序进行评估:一个针对自动微分优化的训练核和一个针对低精度、融合、动态批处理服务优化的推理核。在有限精度下,这些核在相同权重下可能产生不同的分布,且差距集中在基准测试未充分代表的切片上。本文提出核契约:一个契约优先的框架,用于指定K_train和K_inf之间可接受的偏差。契约C = (N, S, R, O, Pi) 结合了数值、统计、运行时和可观测性条款,以及从违规到路由操作的升级策略。我们推导了从logit漂移到总变差距离再到有界奖励漂移的链式界限,并将其专门用于强化学习后训练,其中在显式支持和范数假设下,每个token的重要性比率漂移给出了策略梯度偏差的界限。我们还描述了一个四阶段提升管道、在线路由循环以及用于契约工件的极简YAML DSL。本文是一个框架和词汇论文;我们不报告生产规模的实证验证。

英文摘要

A modern post-training pipeline often writes one symbol for its policy, pi_theta, while evaluating it through two different programs: a training kernel optimized for autograd and an inference kernel optimized for low-precision, fused, dynamically batched serving. In finite precision, these kernels can induce different distributions at identical weights, with the gap concentrated on slices that aggregate benchmarks under-represent. This paper proposes kernel contracts: a contract-first framework for specifying acceptable divergence between K_train and K_inf. A contract C = (N, S, R, O, Pi) combines numerical, statistical, runtime, and observability clauses with an escalation policy from violations to routing actions. We derive a chain of bounds from logit drift to total-variation distance to bounded reward drift, and specialize it to RL post-training, where per-token importance-ratio drift yields a bound on policy-gradient bias under explicit support and norm assumptions. We also describe a four-stage promotion pipeline, online routing loop, and minimal YAML DSL for contract artifacts. This is a framework and vocabulary paper; we do not report production-scale empirical validation.

2606.07596 2026-06-09 cs.LG 新提交

Shortcuts in the Tail: Debiasing via Post-Hoc Spectral Compression of Fine-Tuning Updates

尾部的捷径:通过微调更新的后验谱压缩进行去偏

Edward Sun, Dmitrii Troitskii

发表机构 * UCLA(加州大学洛杉矶分校) Northeastern University(东北大学)

AI总结 提出对微调权重更新进行SVD截断尾部,无需重训练或组标签即可减少虚假关联,在多个模型和基准上以<2%的准确率损失将差距降低最多5倍。

Comments ICML Weight Space Symmetries Workshop 2026

详情
AI中文摘要

微调常常在引入任务知识的同时引入虚假关联,导致在代表性不足的群体上出现系统性失败。现有的缓解方法需要重训练、组标签或精心设计的反事实数据。我们展示了一种简单的后验干预方法,无需这些条件即可减少捷径依赖:截断 $ΔW = W_\mathrm{ft} - W_\mathrm{base}$ 的SVD尾部,可以在保持任务准确率的同时减少虚假组差距。在三个指令微调模型(0.5B--7B)和四个分类基准上,top-$k$ 截断在每项任务上以<2个百分点的准确率损失减少了差距,在CivilComments上最多减少了5倍。我们提出这是因为捷径响应位于 $ΔW$ 奇异排序的尾部,这是一个关于截断行为而非原始奇异值的论断,原始奇异值分布广泛且在所有四个数据集上看起来相同。一个受控的边界情况(微调只学习一个捷径)显示了预测的FT到基线的崩溃,而bottom-/random-$k$ 和匹配秩的LoRA控制排除了通用低秩近似和秩约束训练作为解释。我们将此视为初步证据,表明 $ΔW$ 的奇异基是研究微调所学内容的有用坐标系。

英文摘要

Fine-tuning often introduces spurious correlations alongside task knowledge, causing systematic failures on underrepresented groups. Existing mitigations require retraining, group labels, or curated counterfactual data. We show a simple post-hoc intervention reduces shortcut reliance without any of these: truncating the tail of the SVD of $ΔW = W_\mathrm{ft} - W_\mathrm{base}$ reduces the spurious-group gap while preserving task accuracy. Across three instruction-tuned models ($0.5$B--$7$B) and four classification benchmarks, top-$k$ truncation reduces the gap on every cell at $<2$ pp accuracy loss, by up to $5\times$ on CivilComments. We propose this works because the shortcut response sits in the tail of the singular ordering of $ΔW$, a claim about how truncation behaves rather than about the raw singular values, which are broadly distributed and look the same across all four datasets. A controlled boundary case in which fine-tuning has only a shortcut to learn shows the predicted FT-to-base collapse, and bottom-/random-$k$ and matched-rank LoRA controls rule out generic low-rank approximation and rank-constrained training as the explanation. We read this as preliminary evidence that the singular basis of $ΔW$ is a useful coordinate system for studying what fine-tuning has learned.

2606.07624 2026-06-09 cs.LG 新提交

Sequential statistical inference for Large Language Models: Representation, validity, and monitoring

大语言模型的序贯统计推断:表示、有效性与监控

Yao Xie

发表机构 * H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology(佐治亚理工学院工业与系统工程系)

AI总结 本文提出将序贯统计推断应用于大语言模型可信赖性,围绕表示、有效性和监控三个任务展开,将LLM交互视为依赖随机过程,提供不确定性保证并检测行为变化。

Comments This article was prepared for a invited discussion in The American Statistician

详情
AI中文摘要

本讨论认为序贯统计推断可以自然地促进大语言模型的可信赖性。在部署中,LLM系统被反复查询,条件依赖于不断变化的上下文,并整合用户或工具反馈,在模型更新或分布变化后可能表现出行为转变。讨论围绕三个任务组织:表示,将LLM交互建模为依赖随机过程而非孤立的提示-响应对;有效性,开发在依赖、重复使用和适应下仍有意义的不确定性保证;以及监控,使用序贯警报和变化点检测来识别校准、幻觉率、拒绝行为、公平性或其他任务相关属性的变化。这一视角通过将可信赖的LLM部署视为统计过程控制问题,补充了最近的综述。

英文摘要

This discussion argues that sequential statistical inference can naturally contribute to LLM trustworthiness. In deployment, LLM systems are queried repeatedly, conditioned on evolving contexts, and incorporate user or tool feedback, and may exhibit behavioral shifts after model updates or distribution changes. The discussion is organized around three tasks: representation, modeling LLM interactions as dependent stochastic processes rather than isolated prompt--response pairs; validity, developing uncertainty guarantees that remain meaningful under dependence, repeated use, and adaptation; and monitoring, using sequential alarms and change-point detection to identify shifts in calibration, hallucination rates, refusal behavior, fairness, or other task-relevant properties. This perspective complements recent surveys by viewing trustworthy LLM deployment as a problem of statistical process control.

2606.07631 2026-06-09 cs.LG cs.AI cs.CY 新提交

Trait-space Monitoring for Emergent Misalignment During Supervised Finetuning

监督微调中涌现失调的性状空间监测

Huy Nghiem, Sy-Tuyen Ho, Sarah Wiegreffe, Hal Daumé

发表机构 * University of Maryland(马里兰大学)

AI总结 提出利用激活空间中的性状方向监测监督微调中的涌现失调,通过低维几何特征实现高效检测,在7-9B模型上达到0.990 AUROC。

Comments First version. 45 pages

详情
AI中文摘要

涌现失调(EM)发生在窄微调导致模型在微调任务之外出现危险行为时。标准训练信号可能忽略这种偏移,如果依赖重复的行为评估,可靠检测的成本会很高。我们探究是否可以在微调期间从内部表示中检测涌现失调。利用激活空间中编码为线性方向的七个对齐相关性状,我们在四个开源7-9B大语言模型的训练检查点中跟踪表示漂移。EM相关漂移集中在解释65.5%方差的低维轴上,揭示了所研究机制中的几何特征。基于该漂移轮廓构建的低开销监测器在保留的扰动类型上检测危险检查点,假阴性率为2.2%,假阳性率为2.9%,AUROC为0.990,优于无监督PCA和SAE基线。在两个14B模型、更长的微调运行以及失调起始点上的压力测试确定了关键的部署边界。这些结果将性状空间监测定位为基于LoRA的微调中EM检测的行为评估的实用补充,同时表明在显著不同机制下的部署可能需要重新校准。

英文摘要

Emergent misalignment (EM) occurs when narrow finetuning causes a model to behave dangerously outside the finetuning task. Standard training signals can miss this shift, making reliable detection costly if it depends on repeated behavioral evaluation. We ask whether emergent misalignment can instead be detected from internal representations during finetuning. Using seven alignment-relevant traits encoded as linear directions in activation space, we track representational drift across training checkpoints in four open-source 7-9B LLMs. EM-relevant drift concentrates on a low-dimensional axis that explains 65.5% of the variance, revealing a geometric signature in the studied regime. A low-overhead monitor built on this drift profile detects dangerous checkpoints with 2.2% false negative rate, 2.9% false positive rate, and 0.990 AUROC on held-out perturbation types, outperforming unsupervised PCA and SAE baselines. Stress tests on two 14B models, longer finetuning runs, and misaligned starting points identify key deployment boundaries. These results position trait-space monitoring as a practical complement to behavioral evaluation for EM detection during LoRA-based finetuning, while showing that deployment across substantially different regimes may require recalibration.

2606.07696 2026-06-09 cs.LG cs.AI 新提交

Adversarial Robustness of Activation Steering in Large Language Models

大型语言模型中激活引导的对抗鲁棒性

Kien Le, Thai Le

发表机构 * Independent Researcher(独立研究员) Indiana University(印第安纳大学)

AI总结 研究激活引导在对抗性文本扰动下的鲁棒性,发现所有方法、模型和设置中方向鲁棒性下降高达64%,置信度崩溃,层选择脆弱,揭示其结构性脆弱性。

Comments 9 pages, 2 figures

详情
AI中文摘要

激活引导已成为一种流行的免训练方法,通过在推理时将预计算的方向向量注入模型的残差流来控制LLM行为。然而,其对现实输入变化的鲁棒性尚未得到研究。我们首次系统评估了在输入上施加对抗性文本扰动时激活引导的鲁棒性,涵盖了四种提取方法、三种攻击策略、来自Anthropic Model-Written Evaluation数据集的六种人格以及从1.5B到30B参数的五个模型。攻击在所有设置中普遍成功:方向鲁棒性下降高达64%,攻击后置信度在所有方法和模型中崩溃至接近或低于0.25,并且几乎每个可引导输入的引导强度都下降。层选择同样脆弱,通过自动化方法在干净输入上识别的最优层在扰动下偏移多达17个位置,这一失败加剧了向量级别的崩溃。从对抗性扰动输入中提取向量对于中大型模型上的PCA和MD方法部分恢复了可引导性,但它们始终无法定位改进的最优层,限制了这种缓解措施的实际效益。总之,这些发现揭示了激活引导的脆弱性是结构性的而非方法特定的,并且当前的层选择策略对于实际部署不够鲁棒。

英文摘要

Activation steering has become a popular training-free method to control LLM behavior by injecting precomputed direction vectors into the model's residual stream at inference time. Yet its robustness to realistic input variation remains unstudied. We present the first systematic evaluation of activation steering robustness under adversarial text perturbations on the inputs, covering four extraction methods, three attack strategies, six personas from Anthropic Model-Written Evaluation Dataset, and five models ranging from 1.5B to 30B parameters. Attacks succeed broadly across all settings: directional robustness drops by up to 64%, post-attack confidence collapses near or below 0.25 across all methods and models, and steering strength degrades on nearly every steerable input. Layer selection is equally fragile, with the optimal layer identified by an automated method on clean inputs shifting by up to 17 positions under perturbation, a failure that compounds the vector-level breakdown. Extracting vectors from adversarially perturbed inputs partially recovers steerability for PCA and MD on mid-to-large models, but they consistently fail to locate the improved optimal layer, limiting the practical benefit of this mitigation. Together, these findings reveal that the brittleness of activation steering is structural rather than method-specific, and that current layer selection strategies are not robust enough for real-world deployment.

2606.07790 2026-06-09 cs.LG 新提交

Byzantine Cheap Talk: Adversarial Resilience and Topology Effects in LLM Coordination Games

拜占庭廉价谈话:LLM协调博弈中的对抗韧性与拓扑效应

Aya El Mir, Martin Takáč, Salem Lahlou

发表机构 * Mohamed bin Zayed University of Artificial Intelligence (MBZUAI)(穆罕默德·本·扎耶德人工智能大学)

AI总结 研究多智能体LLM在协调博弈中面对拜占庭攻击和通信拓扑限制的脆弱性,发现智能体无法集体适应背叛,且显式限制拓扑会破坏合作,而隐式限制则不影响。

Comments Accepted at NETYS 2026 (The International Conference on Networked Systems)

详情
AI中文摘要

多智能体LLM系统越来越依赖通信协议进行协调,但它们在对抗和结构约束下的鲁棒性仍然知之甚少。基于先前工作表明廉价谈话通道能够在LLM协调博弈中实现合作,我们在一个4人Stag Hunt博弈中,跨越六个模型系列和720次试验,研究了两个脆弱性类别。首先,当拜占庭智能体发出合作信号但背叛时,非拜占庭智能体在一轮内检测到背叛,但未能集体适应:相当一部分智能体尽管反复被利用仍继续合作,由于博弈的一致同意支付结构而无法恢复协调。其次,显式限制通信拓扑会完全破坏合作,而应用相同的隐式限制则保持近乎完美的合作。这表明协调失败源于智能体关于隐藏信息的元推理,而非信息损失本身。我们识别出两种在所有模型队列中复现的稳定行为原型:背叛倾向模型在背叛后永久切换,以及合作坚持模型以显著的个人成本继续合作。这些发现揭示了具体的安全漏洞:通信通道可被利用为对抗注入向量,且向智能体披露网络拓扑即使在没有任何对手的情况下也会削弱协调。

英文摘要

Multi-agent LLM systems increasingly rely on communication protocols for coordination, yet their robustness under adversarial and structural constraints remains poorly understood. Building on prior work showing that cheap-talk channels enable cooperation in LLM coordination games, we investigate two vulnerability classes in a 4-player Stag Hunt across six model families and 720 trials. First, when Byzantine agents signal cooperation but defect, non-Byzantine agents detect the betrayal within one round yet fail to adapt collectively: a substantial fraction continue cooperating despite repeated exploitation, unable to recover coordination due to the game's unanimity payoff structure. Second, explicitly restricting communication topology collapses cooperation, while applying identical restrictions silently preserves near-perfect cooperation. This establishes that coordination failure stems from agents' meta-reasoning about hidden information, not information loss itself. We identify two stable behavioral archetypes that replicate across all model cohorts: Defection-Prone models that switch permanently after betrayal, and Cooperation-Persistent models that continue cooperating at significant individual cost. These findings reveal concrete security vulnerabilities: communication channels can be exploited as adversarial injection vectors, and disclosing network topology to agents can degrade coordination even without any adversary present.

2606.07889 2026-06-09 cs.LG cs.AI cs.CL 新提交

Strained Coherence: A Pre-Failure Signal in Coding Agent Execution Trajectories

应变连贯性:编码代理执行轨迹中的故障前信号

Marut Pandya, Kasey Zhang, Baiqing Lyu

发表机构 * GitHub

AI总结 提出“应变连贯性”模式,即编码代理识别到问题但仍按原计划行动,通过构建Claude Sonnet 4.6检测器在44条轨迹上实现94%故障预测精度,优于基线方法。

详情
AI中文摘要

基于LLM的编码代理有时会承认自身推理中的问题,但仍继续执行。我们将这种模式称为应变连贯性:一种与安全相关的故障模式,其中代理拥有应改变其行为的信息,陈述了该信息,却仍违背它行动。该模式与口头奖励黑客行为重叠,即代理指出任务代理与底层目标之间的冲突,却仍优化代理。我们给出操作性定义,构建一个Claude Sonnet 4.6评判器,读取完整轨迹并标记该模式出现的片段,并使用Qwen3.5-35B-A3B骨干在44条Terminal-bench-2轨迹上评估。标记轨迹的失败率为94%,而未标记轨迹为46%(47个百分点的差距,Fisher精确检验p=0.003;排除三个提示嵌入示例后为46个百分点,p=0.006)。在匹配选择性下,检测器达到94%的精确度,而词汇话语标记基线为88%;两种方法的10条轨迹交集具有100%的失败率(Clopper-Pearson 95%置信区间[69%, 100%])。我们在Gemma4-31B上使用43条轨迹进行复制:整体信号方向一致但不显著(20个百分点差距,p=0.31),衰减主要由13条零思考内容的轨迹驱动,其中检测器没有可分析的基础。在Gemma的高冗长度三分位中,差距为+30个百分点;在Qwen的中等和高冗长度三分位中,差距各为+40个百分点。两个模型的首次标记出现在轨迹经过时间的中位数83-84%处,且二元标记在软化显式冲突标记的释义中保持不变(8/8条轨迹)。与单变量预测器不同,检测器输出可解释的跨度级输出——引用的承认、引用的行动和类型化的冲突——显示代理看到并忽略了什么。

英文摘要

LLM-based coding agents sometimes acknowledge a problem in their own reasoning and then proceed anyway. We call this pattern strained coherence: a safety-relevant failure mode in which an agent has information that should change its behavior, states that information, and still acts against it. The pattern overlaps with verbalized reward hacking, where an agent names a tension between a task proxy and the underlying goal yet optimizes the proxy anyway. We give an operational definition, build a Claude Sonnet 4.6 judge that reads full trajectories and flags spans where the pattern occurs, and evaluate it on 44 Terminal-bench-2 trajectories using a Qwen3.5-35B-A3B backbone. Flagged trajectories fail 94% of the time versus 46% for unflagged trajectories (47-point gap, Fisher's exact p = 0.003; 46 points after excluding three prompt-embedded examples, p = 0.006). At matched selectivity, the detector reaches 94% precision versus 88% for a lexical discourse-marker baseline; the 10-trajectory intersection of the two methods has a 100% failure rate (Clopper-Pearson 95% CI [69%, 100%]). We replicate on Gemma4-31B with 43 trajectories: the overall signal is directionally consistent but not significant (20-point gap, p = 0.31), with attenuation driven largely by 13 trajectories with zero think content, where the detector has no substrate to analyze. In the high-verbosity Gemma tertile, the gap is +30 points; in the mid- and high-verbosity Qwen tertiles, it is +40 points each. The first flag appears at a median of 83-84% of elapsed trajectory time across both models, and the binary flag survives paraphrases that soften explicit conflict markers (8/8 trajectories). Unlike univariate predictors, the detector emits interpretable span-level output -- quoted acknowledgment, quoted action, and typed conflict -- showing what the agent saw and ignored.

2606.08021 2026-06-09 cs.LG cs.AI cs.MA 新提交

Semantic Quorum Assurance: Collective Certification for Non-Deterministic AI Infrastructure

语义法定数保证:面向非确定性AI基础设施的集体认证

Jun He, Deying Yu

发表机构 * OpenKedge.io

AI总结 提出语义法定数保证(SQA),一种通过多样化验证者群体和风险自适应法定数谓词,将非确定性LLM代理的不安全操作批准率从18.5%降至0.3%的控制平面原语。

Comments 21 pages, 2 figures, 6 tables

详情
AI中文摘要

随着大型语言模型(LLM)代理被集成到自主云操作中,分布式系统面临一个语义可靠性问题:提议代理可以生成语法有效且静态授权但操作不安全的生成突变,例如修改IAM策略、开放防火墙安全组或执行数据导出。经典的分布式共识协议复制确定性状态转换,但不评估提议意图的安全性。为弥补这一差距,我们引入语义法定数保证(SQA),一种用于治理非确定性代理基础设施的控制平面原语。SQA将提议表示为绑定到密码证据链的声明性执行合约,并将其路由到由只读、沙盒验证代理组成的多样化面板。SQA在风险自适应法定数谓词下聚合其判断,该谓词强制执行模型和原型多样性,根据校准的保证分数调整权重,并尊重特定原型的否决。通过的提议仅通过主权执行门执行。我们在云原生控制平面中实例化SQA,并为非确定性验证者形式化了一个相关的认知失败模型。在500个基础设施启发的突变场景中,安全结果报告在保留的安全/不安全试验上(排除模糊场景),SQA将不安全批准率从单代理验证的18.5%降低到0.3%,同时在研究风险桶中增加了1.45-4.12秒的中位验证延迟。

英文摘要

As large language model (LLM) agents are integrated into autonomous cloud operations, distributed systems face a semantic reliability problem: proposer agents can generate production mutations, such as modifying IAM policies, opening firewall security groups, or executing data exports, that are syntactically valid and statically authorized but operationally unsafe. Classical distributed consensus protocols replicate deterministic state transitions but do not evaluate the safety of the proposed intent. To address this gap, we introduce Semantic Quorum Assurance (SQA), a control-plane primitive for governing non-deterministic agentic infrastructure. SQA represents proposals as declarative execution contracts bound to cryptographic evidence chains and routes them to a diverse panel of read-only, sandboxed validator agents. SQA aggregates their judgments under a risk-adaptive quorum predicate that enforces model and archetype diversity, adjusts weights based on calibrated assurance scores, and respects archetype-specific vetoes. Admitted proposals execute only through a sovereign execution gate. We instantiate SQA in a cloud-native control plane and formalize a correlated cognitive failure model for non-deterministic validators. On 500 infrastructure-inspired mutation scenarios, with safety results reported on held-out safe/unsafe trials excluding ambiguous scenarios, SQA reduces unsafe approval from 18.5% for single-agent validation to 0.3% while adding median validation latency of 1.45--4.12 seconds across the studied risk buckets.

2606.08044 2026-06-09 cs.LG cs.AI cs.CL 新提交

When Behavioral Safety Evaluation Fails: A Representation-Level Perspective

当行为安全评估失败时:表征层面的视角

Enyi Jiang, Anders Gjølbye, Yibo Jacky Zhang, Sanmi Koyejo

发表机构 * Stanford University(斯坦福大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) Technical University of Denmark(丹麦技术大学)

AI总结 本文提出行为安全与干预鲁棒性之间的“审计差距”,通过构建解离模型和引入潜在脆弱性评分(LVS),证明行为安全指标不足以衡量表征层面的鲁棒性。

Comments Preprint

详情
AI中文摘要

大型语言模型(LLM)的安全性通常从行为层面进行评估,这提供了有限的内部鲁棒性证据,因为这些评估针对的是输出,而非干预下的表征层面脆弱性。我们将这种差异形式化为审计差距:行为安全与干预下鲁棒性之间的差异。为了研究这一差距,我们构建了解离模型,这些模型在保持安全的外在行为的同时,在潜在空间中仍然脆弱。我们引入了一个基于干预的评估框架,通过在参数和潜在空间中进行软干预(包括有害微调和逐层潜在扰动)来测试模型鲁棒性。为了形式化评估,我们提出了潜在脆弱性评分(LVS),用于衡量通过有界潜在扰动引发有害行为的难易程度。使用该评估框架,我们表明行为安全指标不足以衡量多个安全和对齐及未对齐的最先进模型的表征层面鲁棒性。值得注意的是,解离模型在有害干预下尽管表现出相当的拒绝行为,但LVS显著升高,其中中间表征对干预最为敏感。我们的结果表明,仅凭行为安全评估无法全面反映模型鲁棒性,这促使我们需要进行表征感知的审计,以评估潜在脆弱性和可观察行为。

英文摘要

Large Language Model (LLM) safety has often been evaluated at the behavior level, which provides limited evidence of internal robustness, as these evaluations target outputs rather than representation-level vulnerability under intervention. We formalize this discrepancy as the audit gap: the difference between behavioral safety and robustness under intervention. To study this gap, we construct dissociated models that preserve safe outward behavior while remaining vulnerable in the latent space. We introduce an intervention-based evaluation framework to test model robustness through soft interventions in parameter and latent spaces, including harmful fine-tuning and layer-wise latent perturbations. To formalize the evaluation, we propose the Latent Vulnerability Score (LVS) to measure how easily harmful behavior can be elicited by bounded latent perturbations. Using this evaluation framework, we show that behavioral safety metrics are insufficient measures of representation-level robustness across multiple safely and unsafely aligned state-of-the-art models. Notably, dissociated models show substantially elevated LVSs despite comparable refusal behavior under harmful intervention, with intermediate representations being the most sensitive to intervention. Our results suggest that behavioral safety evaluation alone provides an incomplete picture of model robustness, motivating representation-aware audits of latent vulnerability and observable behavior.

2606.08275 2026-06-09 cs.LG cs.AI 新提交

Causal Agent Replay: Counterfactual Attribution for LLM-Agent Failures

因果智能体回放:LLM智能体故障的反事实归因

Jaineet Shah

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出Causal Agent Replay (CAR)方法,通过结构因果模型和干预操作,对LLM智能体失败步骤进行反事实归因,解决现有方法无法定位决策步骤的问题。

Comments Open-source: https://github.com/jaineet17/causal-agent-replay

详情
AI中文摘要

当LLM智能体失败时——例如发放了不应发放的退款、调用了错误的工具、泄露了数据——现有工具只能回答发生了什么(可观测性)或是否通过(评估),但无法回答哪个步骤导致了失败。直观的启发式方法是错误的:执行有害动作的步骤通常不是决定该动作的步骤,而LLM判断的归因是相关性的且不可靠(在Who&When基准上,最先进的步骤级准确率约为14%)。我们提出Causal Agent Replay (CAR),通过干预来回答这个问题:它将智能体运行建模为结构因果模型,对某个步骤应用do操作,并在相同随机策略下重新执行轨迹,测量结果分布的变化。我们定义了智能体步骤上的干预代数、一个单步对比估计器(其承诺点规则解决了特定于随机向前运行的混杂因素),以及一个预算有界的蒙特卡洛Shapley估计器(用于在交互步骤间分配信用)。每个效应都附有置信区间。我们在具有植入真实标签的合成结构因果模型上进行验证:对比估计器恢复了关键步骤,Shapley恢复了两步交互(0.44, 0.45, ~0;效率总和0.909对比解析值0.91)。CAR是开源的,可在托管或免费的本地模型上运行。

英文摘要

When an LLM agent fails -- issues a refund it should not have, calls the wrong tool, leaks data -- existing tooling answers what happened (observability) or whether it passed (evaluation), but not which step caused the failure. The obvious heuristics are wrong: the step that executes the harmful action is usually not the step that decided on it, and LLM-judge attribution is correlational and unreliable (state-of-the-art step-level accuracy on the Who&When benchmark is about 14%). We present Causal Agent Replay (CAR), which answers the question by intervention: it models an agent run as a structural causal model, applies a do-operation to a step, and re-executes the trajectory forward under the same stochastic policy, measuring the shift in the outcome distribution. We define an intervention algebra over agent steps, a single-step contrastive estimator whose point-of-commitment rule resolves a confound specific to stochastic run-forward, and a budget-bounded Monte-Carlo Shapley estimator that splits credit across interacting steps. Every effect is reported with confidence intervals. We validate against synthetic structural causal models with planted ground truth: the contrastive estimator recovers the pivotal step, and Shapley recovers a two-step interaction (0.44, 0.45, ~0; efficiency sum 0.909 versus the analytic 0.91). CAR is open source and runs on hosted or free local models.

2606.08365 2026-06-09 cs.LG cs.AI 新提交

Pre-Intervention Prediction of Sparse Autoencoder Steering Side Effects

稀疏自编码器引导副作用的干预前预测

Evan Duan

发表机构 * University of Michigan(密歇根大学)

AI总结 提出一种干预前筛选框架,利用特征统计预测SAE引导的副作用(效果不稳定和附带扩散),在多个模型和字典上验证了解码器几何等信号优于基线,但预测效果因模型而异。

详情
AI中文摘要

稀疏自编码器(SAE)特征越来越多地用于引导语言模型,但特征引导很少是干净的:相同的干预在不同上下文中可能表现不一致,并扰动不相关的特征。我们引入了一个干预前筛选框架,用于从引导前计算的特征统计中预测SAE引导的副作用。我们沿着引导模块化的两个轴(效果稳定性和附带扩散)来操作化副作用,并在ReLU、JumpReLU和TopK SAE字典上评估GPT-2-small、Pythia-70M-deduped、Gemma-2-2B和Llama-3.1-8B。在这些设置中,解码器几何、激活统计、共激活结构和直接logit足迹比仅频率和激活幅度基线更好地预测引导模块化。信号在GPT-2-small、Pythia-70M和Llama-3.1-8B中最强,在那里它能在对抗幅度相关混杂的残差化后幸存,而在Gemma-2-2B中较弱。保留筛选表明,通过预测的清洁度对未见特征进行排序可以选择在新上下文中更干净地引导的特征,但成功的轴因设置而异:GPT-2在清洁度上提升最大,Pythia主要在稳定性上提升,Llama主要在附带性上提升,而Gemma仅部分提升。一个受控的Llama Scope宽度比较表明,在32K到128K字典宽度变化下,预测信号仍然存在,尽管筛选收益变得不太稳定。总体而言,SAE引导的副作用是可提前预测的,但有用的预测器签名和迁移的模块化轴依赖于模型和字典设置。

英文摘要

Sparse autoencoder (SAE) features are increasingly used to steer language models, but feature steering is rarely clean: the same intervention can behave inconsistently across contexts and perturb unrelated features. We introduce a pre-intervention screening framework for forecasting SAE steering side effects from feature statistics computed before steering. We operationalize side effects along two axes of steering modularity, effect stability and collateral spread, and evaluate GPT-2-small, Pythia-70M-deduped, Gemma-2-2B, and Llama-3.1-8B across ReLU, JumpReLU, and TopK SAE dictionaries. Across these settings, decoder geometry, activation statistics, co-activation structure, and direct-logit footprint predict steering modularity better than frequency-only and activation-magnitude baselines. The signal is strongest in GPT-2-small, Pythia-70M, and Llama-3.1-8B, where it survives residualization against magnitude-related confounds, and weaker in Gemma-2-2B. Held-out screening shows that ranking unseen features by predicted cleanliness can select features that steer more cleanly on fresh contexts, but the successful axis varies by setting: GPT-2 improves most cleanly, Pythia improves mainly on stability, Llama mainly on collateral, and Gemma only partially. A controlled Llama Scope width comparison shows that the predictive signal persists under a 32K-to-128K dictionary-width change, although the screening payoff becomes less stable. Overall, SAE steering side effects are predictable in advance, but the useful predictor signature and transferred modularity axis are model- and dictionary-setting dependent.

2606.08467 2026-06-09 cs.LG cs.AI 新提交

The Confidence Trap: Calibration Attacks for Graph Neural Networks

置信陷阱:图神经网络的校准攻击

Cuong Dang, Jiahao Zhang, Hieu Ta Quang, Dung Le, Lu Cheng, Suhang Wang

发表机构 * Virginia Polytechnic Institute and State University(弗吉尼亚理工学院暨州立大学) The Pennsylvania State University(宾夕法尼亚州立大学) VinUniversity University of Illinois at Chicago(伊利诺伊大学芝加哥分校)

AI总结 提出统一图校准攻击(UGCA)框架,通过KL散度损失、重排序机制和混合损失等策略,在保持分类精度下显著提高期望校准误差,揭示高精度或多类模型更易受攻击。

详情
AI中文摘要

尽管置信校准对于安全关键应用中的可信决策至关重要,但校准后的GNN对对抗性结构扰动的鲁棒性仍未被充分探索。然而,研究图上的校准攻击面临独特的技术挑战:(1)图结构的离散性使基于梯度的优化复杂化;(2)现有的低置信目标无法将预测推向均匀分布;(3)GNN对边扰动高度敏感,常导致违反攻击约束的意外标签变化。为应对这些挑战,我们提出一个\textbf{统一图校准攻击(UGCA)}框架,用于GNN校准鲁棒性的\textbf{最坏情况(白盒)分析}。UGCA引入KL散度损失以鼓励均匀预测分布,重排序机制以减少标签翻转,混合损失以在违规时恢复标签,以及束搜索以探索更广的对抗搜索空间。我们进一步提供理论见解,将模型泛化、数据集复杂性和校准脆弱性联系起来,表明在该威胁模型下,具有更高精度或在更多类别数据集上训练的模型更容易受到攻击。大量实验表明,UGCA在保持分类精度的同时显著增加了期望校准误差。我们的代码公开在https://github.com/CaptainCuong/Graph-Calibration-Attack.git。

英文摘要

While confidence calibration is essential for trustworthy decision-making in safety-critical applications, the robustness of calibrated GNNs to adversarial structural perturbations remains largely unexplored. However, studying calibration attacks on graphs presents unique technical challenges: (1) the discrete nature of graph structures complicates gradient-based optimization, (2) existing underconfidence objectives fail to drive predictions toward uniform distributions, and (3) GNNs are highly sensitive to edge perturbations, often causing unintended label changes that violate attack constraints. To address these challenges, we propose a \textbf{Unified Graph Calibration Attack (UGCA)} framework designed for \textbf{worst-case (white-box) analysis} of GNN calibration robustness. UGCA introduces a KL-divergence loss to encourage uniform predictive distributions, a reranking mechanism to reduce label flipping, a hybrid loss to recover labels when violations occur, and beam search to explore a broader adversarial search space. We further provide theoretical insights linking model generalization, dataset complexity, and calibration vulnerability, showing that models with higher accuracy or trained on datasets with more classes are more susceptible under this threat model. Extensive experiments demonstrate that UGCA substantially increases Expected Calibration Error while preserving classification accuracy. Our code is publicly available at https://github.com/CaptainCuong/Graph-Calibration-Attack.git.

2606.08517 2026-06-09 cs.LG cs.CL 新提交

A Joint Finite-Sample Certificate for Adaptive Selective Conformal Risk Control

自适应选择性共形风险控制的联合有限样本证书

Xiaoli Yu, Jiamiao Liu

发表机构 * Chongqing University of Posts and Telecommunications(重庆邮电大学) Army Medical University (Third Military Medical University)(陆军军医大学(第三军医大学))

AI总结 提出一种联合有限样本证书,同时上界选择性风险、下界接受概率和部署效用,适用于自适应阈值选择,通过比率风险的经验伯恩斯坦界等方法,在ImageNet和COCO上比Hoeffding-CRC提升22个百分点接受前沿,且紧致约10倍。

详情
AI中文摘要

选择性预测器在置信输入上做出预测,否则弃权;安全部署需要一个单一的有限样本证书,同时上界所选风险、下界接受概率 $\pacc$ 高于下限 $\pmin$,并下界部署效用。该证书必须在从 $\ncert$ 样本上的有限网格 $m$ 对中进行自适应阈值选择时有效。我们通过将所选风险直接视为比率而非通过Hoeffding式范围界,为有界、可能非单调的损失给出了这样的证书。该构造耦合了三个置信界:比率风险的方差自适应经验伯恩斯坦界、接受概率的Clopper-Pearson界以及效用的双边接近界。它们共同下界认证策略的绝对效用,并且与认证集上的最优策略相差不超过 $2\gammau$,两者在可行时均非平凡;一个按场景划分的第三部分与外部预言机匹配,仅在风险边际 $\gammar < α$ 时有信息量,在主要操作点处为空。相对于仅范围Hoeffding比率构造,这使接受下限依赖从 $1/\pmin$ 变为 $1/\sqrt{\pmin}$,并且一个闭式推论识别出每对场景,其中我们的风险界优于Hoeffding共形风险控制(Hoeffding-CRC)选择性界。实验上,在ImageNet(三个ResNet)和COCO val 2017全景分割上,该证书比Hoeffding-CRC打开了+22个百分点的认证接受前沿,并且比非平凡匹配验证基线紧致约10倍;这些增益是按场景的,非普适的,在ADE20K上不存在。认证器运行时间为 $O(\ncert m)$。

英文摘要

Selective predictors answer on confident inputs and abstain elsewhere; deploying one safely needs a single finite-sample certificate that simultaneously upper-bounds the selected risk, lower-bounds the acceptance probability $\pacc$ above a floor $\pmin$, and lower-bounds the deployment utility. This certificate must be valid under adaptive threshold selection from a finite grid of $m$ pairs on $\ncert$ samples. We give such a certificate for bounded, possibly non-monotone losses by treating the selected risk directly as a ratio rather than through a Hoeffding-style range bound. The construction couples three confidence bounds: a variance-adaptive empirical-Bernstein bound on the ratio risk, a Clopper--Pearson bound on acceptance, and a two-sided closeness bound on utility. Together they lower-bound the certified policy's utility absolutely and to within $2\gammau$ of the best over the \emph{certified set}, both non-vacuous whenever feasible; a regime-scoped third leg matches an external oracle, informative only where the risk margin $\gammar < α$ and vacuous at the headline operating points. Relative to the range-only Hoeffding-ratio construction this sharpens the acceptance-floor dependence from $1/\pmin$ to $1/\sqrt{\pmin}$, and a closed-form corollary identifies a per-pair regime in which our risk bound dominates a Hoeffding conformal risk control (Hoeffding--CRC) selective bound. Empirically, on ImageNet (three ResNets) and COCO val 2017 panoptic, the certificate opens a $+22$ pp certified-acceptance frontier over Hoeffding--CRC and is ${\approx}10{\times}$ tighter than a non-vacuous matched-valid baseline; these gains are regime-scoped, not universal, and absent on ADE20K. The certifier runs in $O(\ncert m)$ time.

2606.08654 2026-06-09 cs.LG cs.NA math.AP math.NA stat.AP 新提交

Operator learning for the 2D incompressible Navier-Stokes equations: a conformal prediction approach in the data-scarce regime

二维不可压缩Navier-Stokes方程的算子学习:数据稀缺情况下的共形预测方法

Weinan Wang, Bowen Gang, Hao Deng

发表机构 * University of Oklahoma(俄克拉荷马大学) Fudan University(复旦大学)

AI总结 针对数据稀缺下算子学习的不确定性量化,提出基于扰动的共形预测框架,在二维Navier-Stokes基准上比现有方法生成更窄的共形带,同时保持目标覆盖。

详情
AI中文摘要

本文提出了一种基于扰动的共形预测框架,用于算子学习中的不确定性量化,重点关注二维Navier-Stokes方程。虽然神经算子为昂贵的PDE求解器提供了快速替代方案,但它们本身无法为时空场预测提供校准的不确定性。我们的方法将训练好的傅里叶神经算子(FNO)与分裂共形预测相结合,通过比较在几乎相同数据集上训练的两个算子的预测来构建局部不确定性尺度:一个使用原始标签,另一个使用添加小高斯噪声的标签。我们在数据稀缺情况下考虑该过程,其中总标签预算固定,而需要单独不确定性网络的方法必须在多个模型之间划分训练数据。在二维Navier-Stokes基准上,在匹配总数据预算的情况下,基于扰动的方法产生的共形带比现有方法窄得多,同时保持目标同时覆盖。这些结果表明,扰动敏感性是共形化神经算子的一种实用且样本高效的不确定性代理。

英文摘要

In this paper, we propose a perturbation-based conformal prediction framework for uncertainty quantification in operator learning, with a focus on the 2D Navier--Stokes equations. While neural operators provide fast surrogates for expensive PDE solvers, they do not by themselves provide calibrated uncertainty for spatiotemporal field predictions. Our approach wraps a trained Fourier Neural Operator (FNO) with split conformal prediction and constructs the local uncertainty scale by comparing the predictions of two operators trained on nearly identical datasets: one on the original labels and one on labels perturbed by small Gaussian noise. We consider this procedure in the data-scarce regime, where the total label budget is fixed and methods that require a separate uncertainty network must divide training data between multiple models. On the 2D Navier--Stokes benchmark, the perturbation-based method produces substantially narrower conformal bands than existing methods under matched total data budgets while maintaining the target simultaneous coverage. These results suggest that perturbation sensitivity is a practical and sample-efficient uncertainty proxy for conformalized neural operators.

2606.08682 2026-06-09 cs.LG cs.AI 新提交

Activation Steering Induces Emergent Misalignment: A More Comprehensive Evaluation

激活引导引发突现失调:一项更全面的评估

Qi Cao, Jian Lou, Meiting Liu, Wenjie Feng, Dan Li, See-Kiong Ng, Anh Tuan Luu

发表机构 * Nanyang Technological University(南洋理工大学) Sun Yat-sen University(中山大学) University of Science and Technology of China(中国科学技术大学) National University of Singapore(新加坡国立大学)

AI总结 研究激活引导是否引发突现失调,通过扩展评估范围,发现激活引导可导致广泛失调,且比微调产生更连贯的有害响应,并分析了关键因素。

详情
AI中文摘要

激活引导已成为一种流行的推理时技术,用于调节大型语言模型(LLMs)的行为。通过从目标行为的示例构建引导向量,并在推理期间将其注入中间激活,激活引导能够实现灵活的行为控制,同时避免微调所需的永久参数更新。与此同时,最近的研究将突现失调(EM)识别为一个重要的安全问题,其中在狭窄任务的不安全示例上微调的模型可能意外地泛化到无关任务上的广泛不安全行为。尽管微调引发的EM已被广泛研究,但激活引导是否能引发EM仍然相对未被探索,尽管它作为一种模型控制技术的使用日益增加。在本文中,我们对激活引导引发的突现失调进行了全面研究,大幅扩展了现有开创性工作的评估范围。首先,我们表明激活引导可以引发广泛的失调,即使在最近的Qwen-3.5系列中也是如此。此外,激活引导的模型产生的有害响应比微调模型具有更强的语义相关性和更高的连贯性,使得由此产生的失调可能更具危害性。其次,我们通过分析关键的引导特定因素来表征AS引发的EM的特性,包括引导幅度、引导子空间的低秩结构以及引导向量构建期间的周期数。第三,我们评估了AS引发的EM在不同模型家族、模型规模、目标任务和干预层上的鲁棒性和敏感性。我们的发现揭示了激活引导是突现失调的一个重要但未被充分研究的来源,并为理解EM的机制和安全风险提供了激活空间视角。

英文摘要

Activation steering has emerged as a popular inference-time technique for modulating the behavior of large language models (LLMs). By constructing a steering vector from examples of a target behavior and injecting it into intermediate activations during inference, activation steering enables flexible behavioral control while avoiding the permanent parameter updates required by finetuning. Meanwhile, recent work has identified emergent misalignment (EM) as a significant safety concern, wherein models finetuned on unsafe examples from a narrow task may unexpectedly generalize to broadly unsafe behavior on unrelated tasks. Although finetuning-induced EM has been extensively studied, whether activation steering can induce EM remains comparatively under-explored, despite its increasing use as a model-control technique. In this paper, we present a comprehensive study of activation-steering-induced emergent misalignment, substantially expanding the evaluation scope beyond existing pioneering work. First, we show that activation steering can induce broad misalignment, even in the recent Qwen-3.5 series. Moreover, activation-steered models produce harmful responses with stronger semantic relevance and higher coherence than their finetuned counterparts, making the resulting misalignment potentially more harmful. Second, we characterize properties of AS-induced EM by analyzing key steering-specific factors, including steering magnitude, the low-rank structure of the steering subspace, and the number of epochs during steering-vector construction. Third, we evaluate the robustness and sensitivity of AS-induced EM across diverse model families, model scales, target tasks, and intervention layers. Our findings reveal activation steering as a significant yet under-examined source of emergent misalignment and provide an activation-space perspective for understanding the mechanisms and safety risks of EM.

2606.08777 2026-06-09 cs.LG cs.AI 新提交

How Many Counterfactuals Does It Take? Probing VLM Hallucinations Through Circuits and Causal Effects

需要多少反事实?通过电路和因果效应探究VLM幻觉

Abhivansh Gupta, Simardeep Singh, Advika Sinha, Shreyansh Modi, Akshat Tomar

发表机构 * University of California, Berkeley(加州大学伯克利分校) DeepMind(深度思维)

AI总结 本文通过定义基于对数概率差异的因果影响度量,并利用电路发现技术,研究视觉语言模型幻觉输出的反事实鲁棒性,推导出检测不稳定所需的最小反事实样本数。

详情
AI中文摘要

视觉语言模型(VLM)已知会产生不基于视觉证据的幻觉预测,但现有方法缺乏对这些预测在反事实扰动下鲁棒性的原则性理解。在这项工作中,我们研究了VLM中幻觉输出的反事实鲁棒性的样本复杂度。我们基于事实、反事实和激活修补运行之间的对数概率差异定义了一个因果影响度量,并用它来表征幻觉预测的稳定性。通过利用电路发现技术(CD-T),我们识别负责这些预测的模型组件,并追踪它们在反事实样本中的激活差异。然后,我们利用浓度不等式和因果影响分布的方差估计,推导出可靠检测幻觉输出不稳定性所需的最小反事实样本数m的经验界限。

英文摘要

Visual Language Models (VLMs) are known to produce hallucinated predictions that are not grounded in visual evidence, yet existing approaches lack a principled understanding of how robust such predictions are under counterfactual perturbations. In this work, we study the sample complexity of counterfactual robustness for hallucinated outputs in VLMs. We define a causal influence metric based on log-probability differences between factual, counterfactual, and activation-patched runs, and use it to characterize the stability of hallucinated predictions. By leveraging circuit discovery techniques (CD-T), we identify model components responsible for these predictions and track their activation differences across counterfactual samples. We then derive empirical bounds on the minimum number of counterfactual samples m required to reliably detect instability in hallucinated outputs, using concentration inequalities and variance estimates of the causal influence distribution.

2606.08892 2026-06-09 cs.LG 新提交

Diffuse AI Control on Fuzzy Tasks

模糊任务上的扩散AI控制

Mikhail Terekhov, Caglar Gulcehre, Vivek Hebbar, Joe Benton

发表机构 * Anthropic Fellows Program (via MATS)(Anthropic 研究员计划(通过 MATS)) EPFL(洛桑联邦理工学院) Redwood Research(红木研究) Anthropic

AI总结 针对AI在模糊任务上的长期扩散威胁,提出蓝队与红队对抗框架,通过弱模型评分训练强模型,并发现红队可利用多目标进化提示优化找到评分高但性能差的子版本行为,蓝队则通过对抗优化提升鲁棒性。

详情
AI中文摘要

部署在关键领域(如AI安全研究)的AI模型可能因对齐问题而微妙地破坏我们的努力。扩散AI控制是AI安全的一个子领域,旨在减轻长期部署范围内AI破坏(扩散威胁)带来的风险。这些风险在模糊任务上尤其有害,即难以评分或需要直觉的任务。为了理解模糊任务上的扩散威胁,我们引入了一个新颖的框架,将AI控制视为蓝队和红队之间的对抗游戏。蓝队使用一个弱可信模型构建一个弱评分,据此训练一个强大的、可能具有颠覆性的模型,以消除如果存在的颠覆倾向。然后红队试图找到被弱评分高评价的模型行为,这些行为可能不会被训练掉,但实际上对应着差的表现。我们在为近期ML论文的研究问题撰写实验提案的任务上测试了我们的框架。我们使用一个能够访问原始论文的语言模型作为代理“真实”评分器。我们的红队使用多目标进化提示优化发现了子版本行为。我们展示了Opus 4.6可以写出比GPT-OSS-20B更差的提案(根据真实代理评分),而弱评分器却将其评为与Opus 4.6最佳提案一样高。为了缓解威胁,我们为蓝队提出了一种对抗优化算法,该算法为弱模型发现更鲁棒的提示。该算法产生的蓝队提示,我们的红队优化未能利用。

英文摘要

AI models deployed in critical domains, such as AI safety research, may subtly sabotage our efforts due to misalignment. Diffuse AI Control is a subfield of AI safety concerned with mitigating risks from AI sabotage distributed over long deployment horizons (diffuse threats). These risks are particularly pernicious on fuzzy tasks, i.e. tasks which are hard to grade or require intuition. To understand diffuse threats on fuzzy tasks, we introduce a novel framework that considers AI control as an adversarial game between a blue team and a red team. The blue team uses a weak trusted model to construct a weak score against which they would train a strong, potentially subversive model to remove the subversion propensity if it were present. The red team then tries to find model behaviors that are rated highly by the weak score, and thus might not be trained out, but actually correspond to poor performance. We test our framework on the task of writing experimental proposals for research questions from recent ML papers. We use a language model with access to the original paper as a proxy "ground-truth" scorer. Our red team discovers subversive behaviors using multi-objective evolutionary prompt optimization. We show that Opus~4.6 can write proposals that are worse according to the ground truth proxy than those of GPT-OSS-20B, while the weak scorer rates them as highly as the best proposals from Opus 4.6. To mitigate the threat, we propose an adversarial optimization algorithm for the blue team that discovers more robust prompts for the weak model. This algorithm produces a blue team prompt that our red team optimization fails to exploit.

2606.08893 2026-06-09 cs.LG cs.AI cs.CR 新提交

Cheap Reward Hacking Detection

廉价奖励黑客检测

Iván Belenky, Joaquín Itria, Steven Johns

发表机构 * Tamarillo

AI总结 提出用小Transformer编码器将轨迹映射到单位球面,使嵌入距离近似奖励与元数据的L1距离,线性探针检测奖励黑客,AUC达0.9467,成本比LLM-as-judge低四个数量级。

Comments 20 pages, 6 figures, 12 tables

详情
AI中文摘要

训练一个小型Transformer编码器,将Terminal-Wrench轨迹映射到单位球面上,使得嵌入距离近似于奖励与元数据信号之间的$L_1$距离。在该嵌入之上,一个线性探针在清洗后的测试集上检测奖励黑客,AUC为0.9467,TPR@5%FPR为0.8296,与TW清洗后的LLM-as-judge的AUC(在清洗集上为0.9510)相当,并在相同信息条件下超过其TPR@5%FPR(0.7130 vs 0.8296),而每条轨迹的成本大约低四个数量级。该编码器并非纯粹的行为阅读器:在探针时从其输入中剥离自然语言推理,AUC降至0.6213。

英文摘要

A small transformer encoder is trained to map Terminal-Wrench trajectories onto a unit sphere where embedding distance approximates the $L_1$ distance between reward and metadata signals. A linear probe on top of that embedding detects reward hacking on the cleaned test split with AUC $0.9467$ and TPR@5%FPR $0.8296$, matching the TW sanitized LLM-as-judge AUC ($0.9510$ on the cleaned split) and exceeding its TPR@5%FPR ($0.7130$ vs $0.8296$) on the same information condition, at roughly four orders of magnitude lower per-trajectory cost. The encoder is not a pure behavior reader: stripping natural-language reasoning from its input at probe time drops AUC to $0.6213$.

2606.09043 2026-06-09 cs.LG cs.CL 新提交

DynaCF: Mitigating Shortcut Learning in Reward Models via Dynamic Counterfactual Sensitivity

DynaCF: 通过动态反事实敏感性缓解奖励模型中的捷径学习

Fengyuan Liu, Yongliang Miao, Zirui He, Yanguang Liu, Fei Sun, Mengnan Du

发表机构 * The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) New Jersey Institute of Technology(新泽西理工学院) Institute of Computing Technology, CAS(中国科学院计算技术研究所)

AI总结 提出DynaCF框架,通过在线测量反事实扰动下的边际变化和偏好翻转来动态降低捷径敏感样本的权重,从而缓解奖励模型中的捷径学习问题。

详情
AI中文摘要

从成对偏好中训练的奖励模型往往利用表面的捷径线索而非学习真正的响应质量。我们提出DynaCF,一个用于缓解奖励模型训练中捷径学习的动态重加权框架。与静态捷径启发式方法不同,DynaCF在优化过程中通过应用保持语义的反事实扰动并跟踪当前模型下产生的边际变化和偏好翻转,在线测量捷径敏感性。在Bradley-Terry目标中,具有较高捷径敏感性的样本被动态降低权重,鼓励模型较少依赖表面模式,更多依赖任务相关的偏好信号。大量实验表明,DynaCF在偏好建模中持续提高了鲁棒性。

英文摘要

Reward models trained from pairwise preferences often exploit superficial shortcut cues rather than learning true response quality. We propose DynaCF, a dynamic reweighting framework for mitigating shortcut learning in reward model training. Unlike static shortcut heuristics, DynaCF measures shortcut sensitivity online during optimization by applying semantics-preserving counterfactual perturbations and tracking the resulting margin shifts and preference flips under the current model. Samples with higher shortcut sensitivity are dynamically downweighted in the Bradley-Terry objective, encouraging the model to rely less on superficial patterns and more on task-relevant preference signals. Extensive experiments show that DynaCF consistently improves robustness in preference modeling.

2606.09204 2026-06-09 cs.LG cs.CL cs.CR 新提交

The Injection Paradox: Brand-Level Suppression in Safety-Trained LLM Recommendations via RAG Context Injection

注入悖论:通过RAG上下文注入在安全训练的LLM推荐中实现品牌级压制

Hyunseok Paeng

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 研究发现在基于RAG的LLM推荐中,安全训练会导致注入提示反而压制目标品牌推荐率,揭示了安全机制可能被逆向利用的风险。

Comments 16 pages, 1 figure, 15 tables. Accepted at the ICML 2026 Workshop on Failure Modes in Agentic AI (FAGEN), a non-archival venue

详情
AI中文摘要

我们提出了一种在基于RAG的LLM推荐中安全训练的可复现失败模式——注入悖论,其中嵌入在检索文档中的提示注入反而对攻击者不利,将目标品牌压制到低于无注入基线的水平。在安全训练的Claude模型中,包含提示注入的文档推荐率急剧下降,且这种压制会传播到同一品牌的其他未修改文档。在Claude Opus 4.6中,目标品牌从54%的基线降至所有50次试验中零次前二推荐,尽管语料库中4个品牌文档只有1个包含注入。该方向模式在反事实实验和三个品牌中均得到复现。在测试的GPT模型中观察到相反结果,相同的注入反而增加了推荐,表明注入类上下文影响推荐行为的模型族差异。这些发现提出了逆向攻击场景的技术可能性,即攻击者将注入嵌入竞争对手文档,通过安全敏感模型行为压制竞争对手品牌。

英文摘要

We present a reproducible failure mode of safety training in RAG-based LLM recommendation -- the Injection Paradox -- in which prompt injections embedded in retrieved documents backfire against the attacker, suppressing the target brand below the injection-free baseline. In safety-trained Claude models, documents containing prompt injections suffer a sharp drop in recommendation rate, and this suppression propagates beyond the injected document to unmodified documents of the same brand. In Claude Opus 4.6, the target brand drops from a 54% baseline to zero top-2 recommendations across all 50 trials, even though only 1 of 4 brand documents in the corpus contains an injection. The directional pattern is reproduced in counterfactual experiments and across three brands. A contrasting result across the GPT models tested, where the same injection instead increases recommendations, suggests model-family differences in how injection-like context affects recommendation behavior. These findings raise the technical possibility of a reverse-attack scenario in which an adversary embeds injections in a competitor's documents to suppress the competitor's brand via safety-sensitive model behavior.

2606.09559 2026-06-09 cs.LG cs.AI cs.CR cs.RO 新提交

Safe-RULE: Safe Reinforcement UnLEarning

Safe-RULE:安全强化反学习

Shixiong Jiang, Taozheng Zhu, Fanxin Kong

发表机构 * University of Notre Dame(圣母大学)

AI总结 针对离线安全强化学习易受数据投毒攻击的问题,提出Safe-RULE框架,通过反学习移除恶意样本影响,无需从头训练或访问原始环境,实验证明能有效提升安全性。

Comments 20 pages, 3 figures

详情
AI中文摘要

离线安全强化学习(Safe RL)使得无需在线交互即可进行策略学习,适用于机器人系统等安全关键系统。然而,其对静态数据集的依赖使离线Safe RL面临数据投毒攻击,攻击者注入恶意样本以破坏安全性并诱导不安全策略行为。在这项工作中,我们提出了一种新的学习范式,称为安全强化反学习(Safe-RULE),作为一种防御框架,用于在不从头重新训练或需要访问原始训练环境的情况下移除中毒数据的影响。我们进一步将强化反学习扩展到离线Safe RL,通过在反学习过程中明确考虑任务性能和安全约束。跨基准Safe RL任务的实验表明,我们的方法能有效增强针对数据投毒攻击的安全性能。

英文摘要

Offline safe reinforcement learning (Safe RL) enables policy learning without online interactions, making it suitable for safety-critical systems such as robotics systems. However, its reliance on static datasets exposes offline Safe RL to data poisoning attacks, where adversaries inject malicious samples that compromise safety and induce unsafe policy behavior. In this work, we propose a new learning paradigm, named safe reinforcement unlearning (Safe-RULE), used as a defense framework to remove the influence of poisoned data without retraining from scratch or requiring access to the original training environment. We further extend reinforcement unlearning to offline Safe RL by explicitly accounting for both task performance and safety constraints during the unlearning process. Experiments across benchmark Safe RL tasks demonstrate that our approach effectively enhances safety performance against data poisoning attacks.

2606.07528 2026-06-09 cs.CL cs.AI cs.LG 交叉投稿

BEACON: Behavioral Entropy Aggregation for Cross-Model Hallucination Detection in Large Language Models

BEACON: 面向大语言模型跨模型幻觉检测的行为熵聚合

Naveen Bera, Pulijala Sai Nikhila, Kondaguduru Abhiram, Shaik Gayaz Ali, Shoaib Sadiq Salehmohamed, Shaik Mohammed Omar, Jinal Prashant Thakkar, Hansika Aredla, Shalmali Ayachit

发表机构 * LLM Lens

AI总结 提出BEACON框架,通过多维度行为特征(语义熵、嵌入几何、思维链一致性、释义稳定性)的黑盒检测方法,在7个基准上达到0.8123 AUROC,优于现有方法。

Comments 12 pages, 6 tables, 1 figure. Code and data available upon request

详情
AI中文摘要

大语言模型中的幻觉,即生成事实上不正确或未经支持的内容,仍然是可靠部署的关键障碍。我们提出了BEACON(面向跨模型幻觉检测的行为熵聚合),一个黑盒幻觉检测框架,仅基于模型输出运行,无需访问内部表示或外部知识库。BEACON从结构化的多遍生成中提取31维特征向量,整合了基于NLI的语义熵、嵌入几何、思维链一致性和释义稳定性信号。在七个基准的7,617个标记样本上训练的梯度提升分类器达到了0.8123 ± 0.0102的AUROC(95%置信区间:0.7632-0.8251),优于独立的语义熵(+0.2298)和SelfCheckGPT风格的一致性基线(+0.2457)。特征重要性分析表明,幻觉本质上是多维的,需要组合的不确定性信号。一个高效的5次调用变体达到了0.7795的AUROC,使得在黑盒LLM API上的实际部署成为可能。

英文摘要

Hallucination in large language models (LLMs), defined as the generation of factually incorrect or unsupported content, remains a critical barrier to reliable deployment. We present BEACON (Behavioral Entropy Aggregation for Cross-model hallucination detectiON), a black-box hallucination detection framework that operates purely on model outputs without requiring access to internal representations or external knowledge bases. BEACON extracts a 31-dimensional feature vector from structured multi-pass generation, integrating NLI-based semantic entropy, embedding geometry, chain-of-thought consistency, and paraphrase stability signals. A gradient-boosted classifier trained on 7,617 labeled examples across seven benchmarks achieves 0.8123 +/- 0.0102 AUROC (95% CI: 0.7632-0.8251), outperforming standalone semantic entropy (+0.2298) and SelfCheckGPT-style consistency baselines (+0.2457). Feature importance analysis shows that hallucination is inherently multi-dimensional, requiring combined uncertainty signals. An efficient 5-call variant achieves 0.7795 AUROC, enabling practical deployment across black-box LLM APIs.

2606.07620 2026-06-09 cs.CV cs.AI cs.DC cs.LG 交叉投稿

SENTRY: Statistical Reliability Analysis of Vision Transformers Under Soft Errors

SENTRY: 视觉Transformer在软错误下的统计可靠性分析

Pramit Kumar Bhaduri, Mahdi Taheri, Samira Nazari, Maksim Jenihhin, Christian Herglotz, Michael Hubner

发表机构 * Brandenburg University of Technology Cottbus-Senftenberg(勃兰登堡工业大学) Tallinn University of Technology(塔林理工大学) Zanjan University(赞詹大学)

AI总结 提出基于有限总体抽样的统计故障注入框架,仅需数千样本即可在99%置信度下以1%误差界估计故障率,将实验成本降低高达10700倍,并揭示ViT中归一化层和关键指数位是脆弱性热点。

详情
AI中文摘要

随着视觉Transformer在自动驾驶和医学成像等安全关键领域的应用增长,确保其抵抗软错误的可靠性至关重要。尽管ViT提供了最先进的准确性,但其庞大的参数数量使得穷举故障注入不可行。为弥补这一差距,本文提出一个统计故障注入框架,利用有限总体抽样理论提供形式化的可靠性保证。我们证明,无论模型规模如何,仅需数千个样本即可在99%置信度下将故障率限制在1%的误差界内。与穷举方法相比,该方法将实验成本降低高达10700倍,同时保留跨架构组件定位脆弱性的能力。通过对ViT-Tiny和ViT-Small等不同架构的广泛评估,我们揭示了高度非均匀的可靠性景观。结果表明,虽然只有3%的FP32位翻转导致故障,但其中绝大多数事件导致灾难性的精度崩溃。具体脆弱性被定位到归一化层和IEEE-754格式中的关键指数位,为设计加固的、边缘部署的ViT架构提供了数学基础和可操作的见解。

英文摘要

With the growth of Vision Transformers in safety-critical domains like autonomous systems and medical imaging, ensuring their reliability against soft errors is paramount. While ViTs offer state-of-the-art accuracy, their massive parameter counts render exhaustive fault injection campaigns infeasible. To bridge this gap, a statistical fault injection framework is presented, leveraging finite-population sampling theory to provide formal reliability guarantees. It is demonstrated that failure rates are bounded within a 1% margin at 99\% confidence using only a few thousand samples, regardless of model scale. This methodology achieves up to a 10,700 times reduction in experimental cost compared to exhaustive approaches, while preserving the ability to localize vulnerabilities across architectural components. Through extensive evaluation of different architectures like ViT-Tiny and ViT-Small, a highly non-uniform reliability landscape is uncovered. It is shown that while only 3% of FP32 bit-flips result in failure, the vast majority of these events lead to catastrophic accuracy collapse. Specific vulnerabilities are localized to normalization layers and critical exponent bits within the IEEE-754 format, providing a mathematical foundation and actionable insights for the design of hardened, edge-deployed ViT architectures.

2606.07660 2026-06-09 cs.CV cs.LG 交叉投稿

Need We Teach Foundation Models What is a Generative Image? Gradient-Free Generative Artifact Detection via Analytic Spectral Adaptation

我们是否需要教基础模型什么是生成图像?基于解析谱自适应的无梯度生成伪影检测

Qiaoyu Chen, Bing Zhang

发表机构 * Harbin University of Commerce(哈尔滨商业大学)

AI总结 提出无梯度方法,将生成伪影检测重构为分布外异常度量问题,通过解析解耦统计与语义偏差,在零样本设置下显著优于梯度优化方法。

详情
AI中文摘要

通过基于梯度的更新来适应基础模型以检测生成伪影会损害其内在表示。在有限样本上优化时,模型会过拟合到局部领域捷径。在专门数据上微调大量权重会引入错误的归纳偏差,在高维特征空间中引起可测量的 $\mathcal{L}_2$ 范数扰动——我们将这一现象形式化为锚点漂移。非线性激活放大了这种漂移,损害了跨未见领域的零样本伪造检测。我们提出了一种无梯度方法,将检测从二分类重新定义为分布外(OOD)异常度量问题。将冻结的基础模型视为稳定的坐标系,通过解析解耦统计和语义偏差,在真实视觉流形上建立一个绝对的自然锚点,该锚点源自注意力加权的空间矩和感知不一致性的正交投影。在极端零样本设置下(在面部伪造上训练,在通用文本到图像生成上测试),我们的方法显著优于梯度优化范式。无反向传播的前向传递和线性求解器实现了硬件无关、边缘可部署的校准,延迟极低。此外,Sherman-Morrison公式使得能够针对新型攻击进行即时在线学习,并通过协方差增量传输实现隐私保护的联邦协作。

英文摘要

Adapting foundation models to detect generative artifacts via gradient-based updates compromises their intrinsic representations. Under optimization on limited samples, models overfit to local domain shortcuts. Fine-tuning massive weights on specialized data introduces erroneous inductive biases, inducing a measurable $\mathcal{L}_2$ norm perturbation in the high-dimensional feature space -- a phenomenon we formalize as anchor drift. Amplified by nonlinear activations, this drift impairs zero-shot forgery detection across unseen domains.We propose a gradient-free methodology reframing detection from binary classification to an out-of-distribution (OOD) anomaly measurement problem. Treating a frozen foundation model as a stable coordinate system, we establish an absolute natural anchor on the real visual manifold by analytically decoupling statistical and semantic deviations, derived from attention-weighted spatial moments and orthogonal projection of perceptual inconsistencies. Evaluated in an extreme zero-shot setting (trained on face forgeries, tested on universal Text-to-Image generations), our method significantly outperforms gradient-optimized paradigms. Backpropagation-free forward passes and linear solvers enable hardware-agnostic, edge-deployable calibration with minimal latency. Furthermore, the Sherman-Morrison formula unlocks instantaneous online learning against novel attacks and enables privacy-preserving federated collaboration via covariance delta transmission.

2606.07716 2026-06-09 cs.CR cs.AI cs.LG 交叉投稿

SHIELD-IDS: Structurally Heterogeneous Ensemble with Integrated Layered Defense for Intrusion Detection Systems

SHIELD-IDS:用于入侵检测系统的结构异构集成与分层防御

Maryam Zaman, Muhammad Khuram Shahzad

发表机构 * School of Electrical Engineering and Computing(SEECS)(电气工程与计算学院) National University of Sciences and Technology(国立科学与技术大学)

AI总结 提出IDS-Anta++框架,通过集成XGBoost和LightGBM梯度提升模型,并采用隔离森林异常检测、中值特征平滑和六路多数投票三层黑盒防御,提升对抗攻击鲁棒性,在多个数据集上实现99%以上检测准确率。

Comments 10 pages, 5 figures, 7 tables. Code available at: https://github.com/maryamzaman-git/SHEILD-IDS

详情
AI中文摘要

对抗攻击对基于机器学习的入侵检测系统(IDS)构成了严重且日益增长的威胁,其中对网络流特征的微小扰动可以系统性地误导分类器,将恶意流量视为良性。IDS-Anta框架通过Z-score归一化、奇异值分解(SVD)和基于汤普森采样的多臂赌博机(MAB)分类器选择部分解决了这一问题,但其分类器池缺乏足够的结构多样性以实现鲁棒的对抗抵抗。本文引入IDS-Anta++,将XGBoost和LightGBM梯度提升模型纳入集成,并将扩展后的池包裹在三层黑盒防御中:隔离森林异常检测、中值特征平滑和六路多数投票。在CIC-IDS-2017、CEC-CIC-IDS-2018和CIC-DDoS-2019数据集上,在快速梯度符号法(FGSM)和零阶优化(ZOO)攻击下进行的实验证实,干净数据上的检测准确率超过99%,并且在对抗条件下相对于基线IDS-Anta配置具有可测量的鲁棒性提升。

英文摘要

Adversarial attacks pose a serious and growing threat to Machine Learning (ML)-based Intrusion Detection Systems (IDS), where imperceptible perturbations to network flow features can systematically mislead classifiers into accepting malicious traffic as benign. The IDS-Anta framework partially addresses this through Z-score normalization, Singular Value Decomposition (SVD), and Multi-Armed Bandit (MAB) classifier selection with Thompson Sampling, yet its classifier pool lacks sufficient structural diversity for robust adversarial resistance. This work introduces IDS-Anta++, which incorporates XGBoost and LightGBM gradient boosting models into the ensemble and wraps the extended pool in a three-layer black-box defense: Isolation Forest anomaly screening, median feature smoothing, and six-way majority voting. Experiments conducted on CIC-IDS-2017, CEC-CIC-IDS-2018, and CIC-DDoS-2019 under both Fast Gradient Sign Method (FGSM) and Zeroth Order Optimization (ZOO) attacks confirm detection accuracy above 99% on clean data, with measurable robustness gains under adversarial conditions relative to the baseline IDS-Anta configuration.

2606.07822 2026-06-09 cs.CL cs.AI cs.LG 交叉投稿

The ACUTE Protocol: Operationalizing Language Model Activations for Better Calibration, Utility, and Trust

ACUTE协议:操作语言模型激活以实现更好的校准、效用和信任

Nishant Subramani, Palash Goyal, Yiwen Song, Mani Malek, Yuan Xue, Tomas Pfister, Hamid Palangi

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Google(谷歌) Scale AI

AI总结 提出ACUTE协议,通过操作语言模型激活来估计置信度,平衡校准与信息性,在多项选择问答、工具调用和科学文档摘要等任务上优于强基线,提升校准、效用和可信度。

Comments Accepted to ICML 2026

详情
AI中文摘要

随着语言模型的改进并越来越多地部署以解决各种任务,可信度变得至关重要。校准是信任的良好代理:良好校准的置信度估计有助于在信任特定模型输出时告知风险与回报的权衡。不幸的是,即使模型改进,它们仍然校准不良,往往偏向过度自信。此外,校准可能被操纵:总是预测基率的策略是完美校准的,但完全没有信息性。为了解决这个问题,我们开发了一个新指标,即通过预言机重新归一化的期望效用(EURO),它平衡了校准和信息性。我们还提出了一种通用的基于激活的置信度、效用和信任估计协议(ACUTE),以适当裁决不确定性。ACUTE协议为4个模型家族的6个模型上的3个任务(包括多项选择问答、工具调用和科学文档摘要)提供了灵活、样本高效和计算高效的置信度估计器。ACUTE在EURO上优于强基线,同时保持较低的校准误差。综合来看,我们的工作表明,为LLM配备ACUTE协议可以在多种设置中提高校准、效用和可信度。

英文摘要

As language models improve and become increasingly deployed to solve a variety of tasks, trustworthiness becomes essential. Calibration is a good proxy for trust: well-calibrated confidence estimates help inform the risk versus reward tradeoff when trusting a specific model output. Unfortunately, even as models improve, they remain poorly calibrated, often biasing towards overconfidence. Additionally, calibration can be gamed: a policy that always predicts the base rate is perfectly calibrated, but completely uninformative. To resolve this, we develop a new metric, expected utility renormalized by the oracle (EURO), that balances calibration and informativeness. We also propose a general-purpose activation-based confidence, utility, and trust estimation protocol (ACUTE) to appropriately adjudicate uncertainty. The ACUTE protocol provides flexible, sample-efficient, and compute-efficient confidence estimators for 3 tasks including multiple choice question answering, tool-calling, and scientific document summarization across 6 models from 4 model families. ACUTE outperforms strong baselines on EURO, while maintaining low calibration error. Taken together, our work shows that equipping LLMs with the ACUTE protocol can improve calibration, utility, and trustworthiness in numerous settings.

2606.08571 2026-06-09 cs.CL cs.AI cs.LG 交叉投稿

Calibration of Structured Ignorance Certificates for Diagnosing Unknown Unknowns in Reasoning Models

用于诊断推理模型中未知未知的结构化无知证书的校准

Subramanyam Sahoo

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出结构化无知证书(SICs)输出格式,通过GRPO微调14B模型,使模型在无法回答时明确承认知识缺失并生成检索查询,在未知未知问题上实现99.46%的JSON有效性和0.967的证书特异性分数。

Comments Accepted in ICML 2026 Workshop: Epistemic Intelligence in Machine Learning

详情
AI中文摘要

大型语言模型经常以特征性方式失败:对于超出其知识边界的问题,它们不是承认无知,而是生成流畅但错误的答案。我们引入了\textbf{结构化无知证书}(SICs),这是一种JSON格式的输出模式,要求模型明确命名缺失的领域交叉点,列举所需概念,并提出一个富有成效的检索查询,而不是凭空捏造答案。为了训练模型生成高质量的SICs,我们构建了一个包含7,347个样本的\emph{未知-未知}(UU)数据集,通过提示Qwen3-14B将来自七个领域(物理、生物、工程、计算机科学、经济、医学、法律)的问题拼接成新颖的跨领域查询,这些查询是任何单一领域专家都无法回答的。我们使用组相对策略优化(GRPO)微调了一个14B参数的模型,采用结合检索效用、概念特异性和输出格式有效性的复合奖励。在模型响应上训练的释义散度探测器证实,SIC调优的输出系统地表现出更高的未知-未知概率分数。在735个保留的UU问题上的评估实现了99.46%的JSON有效性率、0.967的平均证书特异性分数,以及在基于检索的生成上相比基础模型3.6%的ROUGE-L改进——这表明显式的认知结构化是一种可学习且可衡量的能力。

英文摘要

Large language models frequently fail in a characteristic way: rather than acknowledging ignorance, they produce fluent but incorrect answers to questions that lie beyond their knowledge boundaries. We introduce \textbf{Structured Ignorance Certificates} (SICs), a JSON-formatted output schema that demands a model explicitly name the missing domain intersection, enumerate required concepts, and propose a productive retrieval query rather than hallucinating an answer. To train models to produce high-quality SICs we construct a 7,347-sample \emph{Unknown-Unknown} (UU) dataset by prompting Qwen3-14B to stitch together questions from seven domains (physics, biology, engineering, CS, economics, medical, legal) into novel cross-domain queries that no single-domain expert could answer. We fine-tune a 14B-parameter model with Group Relative Policy Optimization (GRPO) using a composite reward that combines retrieval utility, concept specificity, and output-format validity. A paraphrase-divergence probe trained on model responses confirms that SIC-tuned outputs systematically exhibit higher unknown-unknown probability scores. Evaluation on 735 held-out UU questions achieves a 99.46\% JSON validity rate, a mean Certificate Specificity Score of 0.967, and a 3.6\% ROUGE-L improvement over the base model on retrieval-grounded generation -- demonstrating that explicit epistemic structuring is a learnable and measurable capability.

2606.08919 2026-06-09 cs.AI cs.CR cs.LG 交叉投稿

Oversight Has a Capacity: Calibrating Agent Guards to a Subjective, Fatiguing Human

监督具有容量:将智能体守卫校准到主观且易疲劳的人类

Emre Turan

发表机构 * GitHub arXiv

AI总结 针对LLM智能体动作审批中人类评审者主观且易疲劳的问题,提出将守卫建模为成本敏感的选择性分类,并引入负载感知策略,发现过度监督反而降低安全性,形成倒U型曲线。

Comments 12 pages, 4 figures. Code and interactive demo: https://github.com/turangenesis/headroom

详情
AI中文摘要

随着LLM智能体开始采取真实、不可逆的行动(如shell命令、文件编辑、部署),标准的安全模式是人在环中的审批门:风险动作暂停并等待人工确认。我们认为审批门是容易的部分;困难的部分在于判断——哪些动作需要停止——而该领域目前基于两个错误假设进行评估:存在一个“风险”的真实标签,以及人类评审者是完美且无限可用的预言机。在一个由125个对抗性加权的智能体动作组成的手工标注集上,我们展示了:(i) 评审者对何为风险仅中度一致(Fleiss' kappa = 0.52),因此不存在单一正确标签;(ii) 将守卫建模为非对称成本下的选择性分类使其操作极限可测量,且在困难输入上守卫无法安全地自动决策;(iii) 当评审者被建模为内生变量(随着升级负载增加而疲劳)时,实际安全性在升级率上呈现倒U形:更多的人类监督可能使系统更不安全,而安全最优的守卫升级率低于完全升级——负载感知策略也利用这一设置来抵御洪水攻击,该攻击通过使疲劳的评审者漏过恶意动作。以这种方式框架化的智能体监督不仅是一个分类问题,还是一个资源分配问题:人类注意力是有限的,而守卫的升级策略消耗它。我们声称这些机制均非新颖——疲劳感知的延迟决策(FALCON)、工作负载约束下的成本敏感延迟(DeCCaF)、轨迹级守卫以及评审者疲劳/洪水攻击均为我们引用的现有技术。我们的贡献是一个开源的智能体监督系统,它在LLM智能体动作门控设置中操作化和测量这些机制,将“我的守卫好吗?”从猜测转变为一条曲线。倒U形和洪水攻击是激励人类研究的建模结果。

英文摘要

As LLM agents begin to take real, irreversible actions (shell commands, file edits, deploys), the standard safety pattern is a human-in-the-loop approval gate: risky actions pause and wait for a person. We argue the gate is the easy part; the hard part is the judgment - which actions to stop - which the field evaluates against two false assumptions: that there is a ground-truth notion of "risky," and that the human reviewer is a perfect, infinitely-available oracle. On a hand-labeled set of 125 adversarially-weighted agent actions we show that (i) reviewers only moderately agree on what is risky (Fleiss' kappa = 0.52), so there is no single correct label; (ii) framing the guard as selective classification under asymmetric cost makes its operating limits measurable, and on hard inputs the guard cannot safely auto-decide; and (iii) when the reviewer is modeled as endogenous (fatiguing as escalation load grows), realized safety becomes an inverted-U in the escalation rate: more human oversight can make a system less safe, and the safety-optimal guard escalates below full escalation - a setting a load-aware policy also uses to resist a flooding attack that slips a malicious action past a fatigued reviewer. Agent oversight, framed this way, is not only a classification problem but a resource-allocation one: human attention is finite, and the guard's escalation policy spends it. We claim none of these mechanisms as novel - fatigue-aware learning-to-defer (FALCON), cost-sensitive deferral under workload constraints (DeCCaF), trajectory-level guarding, and reviewer-fatigue/flooding attacks are all prior art we cite. Our contribution is an open-source agent-oversight system that operationalizes and measures them in the LLM-agent action-gating setting, turning "is my guard good?" from a guess into a curve. The inverted-U and the flooding attack are modeling results that motivate a human study.

2606.09563 2026-06-09 cs.AI cs.LG 交叉投稿

PRISM: Recovering Instruction Sets from Language Model Activations

PRISM:从语言模型激活中恢复指令集

Gilad Gressel, Rahul Pankajakshan, Julia Diament, Efim Hudis, Krishnashree Achuthan, Yisroel Mirsky

发表机构 * Center for Cybersecurity Systems & Networks, Amrita Vishwa Vidyapeetham(阿姆里塔·维什瓦·维迪亚佩瑟姆网络安全系统与网络中心) Microsoft(微软) Ben-Gurion University of the Negev(内盖夫本-古里安大学)

AI总结 提出PRISM方法,通过激活条件解码从冻结目标模型隐藏状态中恢复活跃指令集,利用法官引导的GRPO优化,在多种场景下优于基线方法。

Comments Under Review

详情
AI中文摘要

随着LLM被部署为智能体,可靠的监控不仅需要知道它们输出了什么,还需要知道哪些指令在引导它们的行为。当模型推断出非预期的子目标、遵循上下文线索或受到提示注入和隐藏目标的影响时,这变得困难。虽然激活到语言的方法表明隐藏状态可以揭示自然语言信息,但现有方法并非设计用于恢复智能体设置中同时活跃的完整指令、约束、禁止和子目标集。我们将此问题形式化为指令集检索,并引入PRISM,一个激活条件的解释器,将冻结目标模型的隐藏状态解码为活跃指令的忠实项目符号列表。与先前的激活到语言方法不同,PRISM直接训练以恢复指令集,使用法官引导的GRPO来奖励覆盖的指令并惩罚不支持的指令。在良性、受限、提示注入和隐藏目标设置中,PRISM优于激活到语言基线,特别是在安全相关目标上。

英文摘要

As LLMs are deployed as agents, reliable monitoring requires knowing not only what they output, but which instructions are steering their behavior. This is difficult when models infer unintended subgoals, follow contextual cues, or are influenced by prompt injections and hidden objectives. While activation-to-language methods suggest that hidden states can reveal natural-language information, existing approaches are not designed to recover the full set of simultaneous instructions, constraints, prohibitions, and subgoals active in agentic settings. We formalize this problem as instruction set retrieval and introduce PRISM, an activation-conditioned interpreter that decodes hidden states from a frozen target model into a faithful bullet list of active instructions. Unlike prior activation-to-language methods, PRISM is trained to recover instruction sets directly, using judge-guided GRPO to reward covered instructions and penalize unsupported ones. Across benign, constrained, prompt-injection, and hidden-objective settings, PRISM outperforms activation-to-language baselines, especially on security-relevant objectives.

2606.09577 2026-06-09 cs.CL cs.LG cs.SE 交叉投稿

Code Is More Than Text: Uncertainty Estimation for Code Generation

代码不仅仅是文本:代码生成的不确定性估计

Yuling Shi, Caiqi Zhang, Yuexian Li, Haopeng Wang, Yeheng Chen, Nigel Collier, Xiaodong Gu

发表机构 * Shanghai Jiao Tong University(上海交通大学) University of Cambridge(剑桥大学)

AI总结 针对代码生成中错误程序的可靠性问题,提出基于词法、算法和功能三个正交轴的不确定性估计方法,在五个代码LLM上将AUROC提升8.1个百分点。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被部署为代码生成器,其中静默错误的程序会带来真实的安全和可靠性风险。可靠的不确定性估计(UE)对于选择性预测、人在回路审查和下游智能体决策至关重要。然而,现有的大多数代码UE方法继承自自然语言(NL)生成,忽略了使代码独特的属性。我们认为代码在三个方面与NL不同:单个错误标记可能破坏整个程序(标记脆弱性);算法意图和具体实现可能独立不一致(意图-代码差距);程序可以被执行(可执行性)。我们将这些属性实例化为三个正交的不确定性轴:词法(Top-K标记熵)、算法(伪代码一致性)和功能(行为一致性)。在五个代码LLM上,我们的三轴集成将平均AUROC从最强NL衍生基线的0.696提高到0.776(+8.1点)。值得注意的是,在Qwen3-14B上,我们的单次Top-K标记熵匹配了最强多次基线,同时成本降低超过3倍;在各模型上,它仍然是一个有竞争力的低成本信号。这些结果表明,代码UE需要特定于代码的设计,而不是直接移植NL方法。

英文摘要

Large language models (LLMs) are increasingly deployed as code generators, where silently wrong programs pose real safety and reliability risks. Reliable uncertainty estimation (UE) is essential for selective prediction, human-in-the-loop review, and downstream agentic decisions. Yet most existing code UE methods are inherited from natural language (NL) generation and ignore properties that make code distinct. We argue that code differs from NL in three ways: a single wrong token can break an entire program (token fragility); algorithmic intent and concrete implementation can disagree independently (intent-code gap); and programs can be executed (executability). We instantiate these properties as three orthogonal uncertainty axes: lexical (Top-K token entropy), algorithmic (pseudo-code consistency), and functional (behavioral consistency). Across five code LLMs, our three-axis ensemble improves average AUROC from 0.696 for the strongest NL-derived baseline to 0.776 (+8.1 points). Notably, on Qwen3-14B, our single-pass Top-K token entropy matches the strongest multi-pass baseline while being over 3x cheaper; across models, it remains a competitive low-cost signal. These results suggest that code UE deserves code-specific design rather than direct NL ports.

2606.09700 2026-06-09 cs.CR cs.HC cs.LG 交叉投稿

What the Eyes See, the LLMs Miss: Exploiting Human Perception for Adversarial Text Attacks

眼睛所见,大语言模型所不见:利用人类感知进行对抗性文本攻击

Qin Yang, Lu Malloy, Joshua Lee, Xiaohan Chang, Meisam Mohammady, Doowon Kim, Yuan Hong

发表机构 * University of Connecticut(康涅狄格大学) University of Tennessee(田纳西大学) University of California, Santa Barbara(加州大学圣芭芭拉分校) Iowa State University(爱荷华州立大学)

AI总结 针对LLM内容审核系统忽视人类视觉线索的缺陷,提出人类可感知对抗攻击(HPAA),通过排版操纵嵌入有害内容,在仅三次查询下实现86%人类识别率而机器检测率低于1%。

Comments This work has been accepted for publication at USENIX Security 2026. This paper includes examples of harmful, hateful, or abusive language for research purposes. Reader discretion is advised

详情
AI中文摘要

基于大型语言模型(LLM)的内容审核系统已成为对抗有害在线内容的关键防线。然而,这些系统主要基于分词文本运行,很大程度上忽略了人类在解释内容时自然依赖的视觉线索。我们表明,这种差异造成了根本性的感知不匹配:人类容易识别为有害的内容,对自动审核系统而言可能变得几乎不可见。为研究这一漏洞,我们引入了一类人类可感知对抗攻击(HPAA),其中有害表达通过视觉上显著的排版操纵嵌入到原本良性的文本中。我们的关键洞察是,排版特征(包括间距、视觉强调和空间排列)可以策略性地组合,以保留人类对有害内容的识别,同时大幅降低机器可检测性。在黑盒设置下,仅使用少量查询预算,我们的攻击自动生成规避内容,无需模型访问或梯度信息。我们在多个数据集和十个已部署的审核系统(包括商业API和最先进的开源防护)上评估了该攻击。结果揭示了人类与机器感知之间的显著差距:仅使用三次检测器查询,生成的攻击在评估系统中实现了超过86%的人类识别率,同时检测率低于1%。我们进一步进行消融研究,以识别驱动成功规避的排版因素,分析当前审核架构为何无法捕捉这些信号,并讨论实际防御措施。我们的发现暴露了当今基于LLM的审核生态系统中的根本盲点,并强调了需要以更符合人类感知理解的方式推理内容的审核系统。

英文摘要

Large language model (LLM)-powered content moderation systems have become a critical defense against harmful online content. However, these systems primarily operate on tokenized text and largely ignore the visual cues that humans naturally rely on when interpreting content. We show that this discrepancy creates a fundamental perceptual mismatch: content that is readily recognized as harmful by humans can become effectively invisible to automated moderation systems. To study this vulnerability, we introduce a class of Human-Perceptible Adversarial Attacks (HPAA), in which harmful expressions are embedded into otherwise benign text through visually salient typographic manipulations. Our key insight is that typographic features, including spacing, visual emphasis, and spatial arrangement, can be strategically combined to preserve human recognition of harmful content while substantially reducing machine detectability. Operating in black-box settings with only a small query budget, our attack automatically generates evasive content without requiring model access or gradient information. We evaluate the attack across multiple datasets and ten deployed moderation systems, including commercial APIs and state-of-the-art open-source guardrails. Results reveal a striking gap between human and machine perception: with only three detector queries, generated attacks achieve over 86\% human recognition while maintaining detection rates below 1\% across the evaluated systems. We further conduct ablation studies to identify the typographic factors driving successful evasion, analyze why current moderation architectures fail to capture these signals, and discuss practical defenses. Our findings expose a fundamental blind spot in today's LLM-based moderation ecosystem and highlight need for moderation systems that reason about content in a manner more consistent with human perceptual understanding.

2606.09701 2026-06-09 cs.CL cs.AI cs.LG 交叉投稿

Learning to Attack and Defend: Adaptive Red Teaming of Language Models via GRPO

学习攻击与防御:通过GRPO对语言模型进行自适应红队测试

Blake Bullwinkel, Eugenia Kim, Amanda Minnich, Mark Russinovich

发表机构 * Microsoft AI Red Team(微软AI红队) Microsoft Azure(微软Azure)

AI总结 提出AdvGRPO框架,通过密集多通道奖励和分离优势归一化实现GRPO在攻击者-防御者联合优化中的稳定训练,产生高效可迁移攻击,防御者优于基线。

详情
AI中文摘要

AI红队测试必须不断适应不断演变的攻击者和防御者。强化学习为发现新型攻击提供了一种有前景的方法,而协同训练方法可以同时产生更鲁棒的防御者。最近的工作通过应用PPO和DPO证明了攻击者-防御者协同训练的有效性,但报告称GRPO在此设置中不稳定。我们引入了AdvGRPO,一种协同训练框架,通过使用密集多通道奖励和分离优势归一化,使GRPO能够用于攻击者-防御者联合优化。训练过程通过一个课程从单轮攻击发展到闭环多轮攻击,然后启动协同训练,其中攻击者和防御者模型交替更新。我们表明,我们的方法可以产生高度有效且可迁移的攻击,并且协同训练的防御者在安全基准测试中优于基线。

英文摘要

AI red teaming must continually adapt to evolving attackers and defenders. Reinforcement learning offers a promising approach to discovering novel attacks, and co-training methods can produce more robust defenders in tandem. Recent works have demonstrated the efficacy of attacker-defender co-training by applying PPO and DPO, but report that GRPO is unstable in this setting. We introduce AdvGRPO, a co-training framework that makes GRPO viable for joint attacker-defender optimization using dense multi-channel rewards and decoupled advantage normalization. Training progresses through a curriculum from single-turn to closed-loop multi-turn attacks before bootstrapping co-training, where attacker and defender models are updated in alternation. We show that our method can produce highly effective and transferable attacks and that co-trained defenders outperform baselines on safety benchmarks.

2606.09746 2026-06-09 cs.CV cs.AI cs.LG 交叉投稿

Hybrid Robustness Verification for Spatio-Temporal Neural Networks

时空神经网络的混合鲁棒性验证

Sherwin Varghese, Matthew Wicker, Alessio Lomuscio

发表机构 * Imperial College London(伦敦帝国学院)

AI总结 针对3D CNN在视频和体素输入中的鲁棒性验证,提出时空约束建模和STBP框架,实现精确闭式传播与可扩展近似,在UCF-101等基准上提升1.7倍认证鲁棒准确率。

Comments Accepted at the 9th International Symposium on AI Verification (SAIV 2026)

详情
AI中文摘要

随着人工智能越来越多地部署在安全关键系统中,为底层模型提供形式化的鲁棒性保证至关重要。现有的验证方法要么依赖过于保守的近似,要么产生难以承受的计算成本。例如,在视频设置中使用lp-范数扰动编码了对手可以在每个视频帧中注入噪声的信念。实际上,对抗性扰动表现出结构化的时空相关性,被约束在低维、语义上有意义的子空间中。在这项工作中,我们研究了处理视频和体素输入的3D CNN的鲁棒性验证,针对动作识别(UCF-101)、自动驾驶(Udacity)和医学成像(MedMNIST)中的应用,通过将对抗强度建模为时空约束——攻击者可以修改一组连续帧中的子集或补丁——来利用关于对抗强度的现实假设。我们证明,建模现实约束能够实现更紧的近似。我们引入了时空边界传播(STBP),这是一个验证框架,它计算第一卷积层的精确闭式表征,并通过可扩展的近似传播认证边界。计算精确闭式为第一卷积层提供了最紧的边界。因此,我们在网络的其余部分使用近似方法。为了推动该领域的进一步发展,我们提出了ST-Bench,一个用于自动驾驶和活动识别的验证基准,以系统评估可验证的鲁棒性。与现有的基于验证的方法相比,STBP在相同的扰动预算下提供了更强的鲁棒性保证,并显著提高了可扩展性,实现了1.7倍更高的认证鲁棒准确率。

英文摘要

With AI increasingly deployed in safety-critical systems, providing formal robustness guarantees for the underlying models is essential. Existing verification methods either rely on overly conservative approximations or incur prohibitive computational costs. For example, the use of lp-norm perturbations in video settings encodes the belief that the adversary can inject noise in every video frame. In practice, adversarial perturbations exhibit structured spatial and temporal correlations, constrained to lower-dimensional, semantically meaningful subspaces. In this work, we study robustness verification of 3D CNNs processing video and volumetric inputs, targeting applications in action recognition (UCF-101), autonomous driving (Udacity), and medical imaging (MedMNIST) exploiting realistic assumptions on adversarial strength by modelling them as spatio-temporal constraints - where the attacker can modify either a subset of frames or patches within a set of consecutive frames. We demonstrate that modelling realistic constraints enables tighter approximations. We introduce Spatio-Temporal Bound Propagation (STBP), a verification framework that computes an exact closed-form characterization of the first convolutional layer and propagates certified bounds through subsequent layers using scalable approximations. Computing the exact closed form provides the tightest bounds for the first convolutional layer. Thus, we utilise approximation methods in the remainder of the network. To spur further progress in this field, we propose ST-Bench, a verification benchmark for autonomous driving and activity recognition, to systematically evaluate verifiable robustness. Compared to existing verification-based approaches, STBP provides stronger robustness guarantees with significantly improved scalability, achieving 1.7x higher certified robust accuracy under identical perturbation budgets.

2403.06013 2026-06-09 cs.LG cs.CV 版本更新

Are Classification Robustness and Explanation Robustness Really Strongly Correlated? An Analysis Through Input Loss Landscape

分类鲁棒性与解释鲁棒性真的强相关吗?基于输入损失景观的分析

Tiejin Chen, Wenwang Huang, Linsey Pang, Dongsheng Luo, Hua Wei

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文质疑分类鲁棒性与解释鲁棒性强相关的传统观点,通过聚类评估解释鲁棒性,并提出调整解释损失景观的训练方法,发现两者并不强相关。

详情
AI中文摘要

本文深入探讨深度学习鲁棒性的关键领域,挑战了图像分类系统中分类鲁棒性和解释鲁棒性固有相关的传统观点。通过一种利用聚类高效评估解释鲁棒性的新颖评估方法,我们证明增强解释鲁棒性并不一定会使输入损失景观相对于解释损失变得平坦——这与平坦的损失景观指示更好的分类鲁棒性相反。为了深入探究这一矛盾,我们提出了一种开创性的训练方法,旨在调整相对于解释损失的损失景观。通过这种新的训练方法,我们发现尽管这种调整可以影响解释的鲁棒性,但它们对分类的鲁棒性没有影响。这些发现不仅挑战了两种鲁棒性之间强相关的主流假设,而且为理解损失景观与解释损失之间的关系开辟了新的途径。

英文摘要

This paper delves into the critical area of deep learning robustness, challenging the conventional belief that classification robustness and explanation robustness in image classification systems are inherently correlated. Through a novel evaluation approach leveraging clustering for efficient assessment of explanation robustness, we demonstrate that enhancing explanation robustness does not necessarily flatten the input loss landscape with respect to explanation loss - contrary to flattened loss landscapes indicating better classification robustness. To deeply investigate this contradiction, a groundbreaking training method designed to adjust the loss landscape with respect to explanation loss is proposed. Through the new training method, we uncover that although such adjustments can impact the robustness of explanations, they do not have an influence on the robustness of classification. These findings not only challenge the prevailing assumption of a strong correlation between the two forms of robustness but also pave new pathways for understanding relationship between loss landscape and explanation loss.

2506.06891 2026-06-09 cs.LG cs.CR 版本更新

Robust In-Context Reinforcement Learning Under Reward Poisoning Attacks

奖励投毒攻击下的鲁棒上下文强化学习

Paulius Sasnauskas, Yiğit Yalın, Goran Radanović

发表机构 * Department of Computing Science, University of Alberta, Edmonton, Canada.(阿尔伯塔大学计算机科学系,加拿大埃德蒙顿) Alberta Machine Intelligence Institute (Amii), Edmonton, Canada.(阿尔伯塔机器智能研究所(Amii),加拿大埃德蒙顿)

AI总结 针对奖励投毒攻击,提出对抗训练框架AT-DPT,通过同时训练攻击者和DPT模型,显著提升上下文强化学习在赌博机环境下的鲁棒性,并泛化到MDP等复杂场景。

Comments ICML 2026, code available at https://github.com/PauliusSasnauskas/AT-DPT

详情
AI中文摘要

我们研究了上下文强化学习(ICRL)的腐败鲁棒性,重点关注决策预训练变换器(DPT, Lee et al., 2023)。为了应对针对DPT的奖励投毒攻击挑战,我们提出了一种新颖的对抗训练框架,称为对抗训练DPT(AT-DPT)。我们的方法同时训练一群攻击者,通过毒化环境奖励来最小化DPT的真实奖励,以及一个DPT模型从毒化数据中推断最优动作。我们评估了该方法相对于标准赌博机算法(包括旨在处理奖励污染的鲁棒基线)的有效性。结果表明,AT-DPT在学习攻击者下的赌博机设置中显著优于它们,并泛化到更复杂的环境,如自适应攻击者和MDP。它作为元强化学习方法,在学习有效的腐败鲁棒算法方面显示出在ICRL中的前景。

英文摘要

We study the corruption-robustness of in-context reinforcement learning (ICRL), focusing on the Decision-Pretrained Transformer (DPT, Lee et al., 2023). To address the challenge of reward poisoning attacks targeting the DPT, we propose a novel adversarial training framework, called Adversarially Trained DPT (AT-DPT). Our method simultaneously trains a population of attackers to minimize the true reward of the DPT by poisoning environment rewards, and a DPT model to infer optimal actions from the poisoned data. We evaluate the effectiveness of our approach against standard bandit algorithms, including robust baselines designed to handle reward contamination. Our results show that AT-DPT significantly outperforms them in bandit settings under a learned attacker, and generalizes to more complex environments such as adaptive attackers and MDPs. It shows promise in ICRL as a meta-RL approach to learning effective corruption-robust algorithms.

2512.08499 2026-06-09 cs.LG cs.AI 版本更新

Developing Distance-Aware Physics-Constrained Probabilistic Frameworks for Industrial Prognostics

面向工业预测的具有距离感知的物理约束概率框架开发

Waleed Razzaq, Yun-Bo Zhao

发表机构 * University of Science and Technology China(中国科学技术大学)

AI总结 提出两种无需采样的距离感知物理约束概率框架PC-SNGP和PC-SNER,通过谱归一化和动态加权策略平衡数据保真度与物理一致性,在轴承预测中提升精度和不确定性校准。

详情
AI中文摘要

可靠且物理可解释的工业预测概率框架的发展仍处于初期阶段,现有文献在输入远离训练流形时往往不敏感。本文开发了两种无需采样的、具有距离感知的物理约束概率框架:(i) PC-SNGP 和 (ii) PC-SNER。两者均对隐藏层权重应用谱归一化,强制从输入到潜在空间的bi-Lipschitz距离保持表示。PC-SNGP将密集输出替换为高斯过程,其后验方差随输入与训练流形的距离增加而增大。PC-SNER修改输出层以预测Normal-Inverse-Gamma (NIG)参数,用于距离保持估计。为在训练过程中保持数据保真度与物理一致性之间的平衡,我们引入了物理约束损失的动态加权策略。我们还引入了一个距离感知系数 (DAC) 指标来量化对分布偏移的敏感性。实验上,我们使用PRONOSTIA、XJTU-SY和HUST基准数据集在滚动轴承 (REBs) 预测上验证了两种框架。实验结果表明,与竞争基线相比,预测精度提高,不确定性估计校准良好,同时在交叉验证中保持可审计性能,并在极端对抗扰动下具有鲁棒性。

英文摘要

Development of reliable and physically interpretable probabilistic frameworks for industrial prognostics remain nascent, and existing literature is often insensitive as inputs move away from the training manifold. In this paper, we develop two sampling-free, distance-aware physics-constrained probabilistic frameworks: (i) PC-SNGP and (ii) PC-SNER. Both apply spectral normalization to hidden layer weights, enforcing bi-Lipschitz distance-preserving representation from the input to the latent space. PC-SNGP replaces the dense output with Gaussian process whose posterior variance increases with input distance from the training manifold. PC-SNER modifies the output layer to predict Normal-Inverse-Gamma~(NIG) parameters for distance preserving estimation. To maintain balance between data fidelity and physical consistency during training, we introduce a dynamic weighting strategy for the physics-constrained loss. We also introduce a distance-aware-coefficient~(DAC) metric to quantify sensitivity to distributional shifts. Empirically, we validate both frameworks on rolling-element-bearings (REBs) prognostics using the PRONOSTIA, XJTU-SY, and HUST benchmark datasets. Experimental results demonstrate improved prediction accuracy and well-calibrated uncertainty estimates relative to competing baselines, while maintaining auditable performance in cross-validation and robustness under extreme adversarial perturbations.

2512.08724 2026-06-09 cs.LG 版本更新

Exposing Hidden Biases in Text-to-Image Models via Automated Prompt Search

通过自动化提示搜索暴露文本到图像模型中的隐藏偏见

Manos Plitsis, Giorgos Bouritsas, Vassilis Katsouros, Yannis Panagakis

发表机构 * University of Edinburgh(爱丁堡大学)

AI总结 本文提出Bias-Guided Prompt Search框架,通过自动生成提示最大化图像偏见,揭示文本到图像模型中的隐藏偏见,提升公平性评估。

Comments ICML 2026. Code is here: https://github.com/manosplitsis/BGPS

详情
AI中文摘要

文本到图像(TTI)扩散模型已实现出色的视觉质量,但被反复显示在敏感属性如性别、种族和年龄上存在社会偏见。为缓解这些偏见,现有方法常依赖人工构建或由大型语言模型生成的提示数据集。除了编纂成本外,这还可能忽视那些触发偏见生成的未预见、不明显的提示,即使模型已进行去偏处理。本文引入Bias-Guided Prompt Search(BGPS),一个自动产生旨在最大化结果图像偏见的提示框架。BGPS包含两个组件:(1)一个指导生成中性属性提示的LLM,(2)对TTI内部表示起作用的属性分类器,引导LLM的解码过程向提示空间中放大目标图像属性的区域。我们在Stable Diffusion 1.5和最先进的去偏模型上进行了广泛实验,发现了一系列微妙且此前未记录的偏见,严重损害公平性指标。关键的是,发现的提示是可解释的,即可以由普通用户输入,定量提高困惑度度量相比于一个突出的硬提示优化对手。我们的发现揭示了TTI的脆弱性,同时BGPS扩展了偏见搜索空间,可以作为新的偏见缓解评估工具。

英文摘要

Text-to-image (TTI) diffusion models have achieved remarkable visual quality, yet they have been repeatedly shown to exhibit social biases across sensitive attributes such as gender, race and age. To mitigate these biases, existing approaches frequently depend on curated prompt datasets - either manually constructed or generated with large language models (LLMs) - as part of their training and/or evaluation procedures. Beside the curation cost, this also risks overlooking unanticipated, less obvious prompts that trigger biased generation, even in models that have undergone debiasing. In this work, we introduce Bias-Guided Prompt Search (BGPS), a framework that automatically generates prompts that aim to maximize the presence of biases in the resulting images. BGPS comprises two components: (1) an LLM instructed to produce attribute-neutral prompts and (2) attribute classifiers acting on the TTI's internal representations that steer the decoding process of the LLM toward regions of the prompt space that amplify the image attributes of interest. We conduct extensive experiments on Stable Diffusion 1.5 and a state-of-the-art debiased model and discover an array of subtle and previously undocumented biases that severely deteriorate fairness metrics. Crucially, the discovered prompts are interpretable, i.e they may be entered by a typical user, quantitatively improving the perplexity metric compared to a prominent hard prompt optimization counterpart. Our findings uncover TTI vulnerabilities, while BGPS expands the bias search space and can act as a new evaluation tool for bias mitigation.

2601.22736 2026-06-09 cs.LG cs.AI 版本更新

UA-DCM: Uncertainty-aware Causal Decision Making via Effect Bound Decomposition

UA-DCM: 基于效应界分解的不确定性感知因果决策

Md Musfiqur Rahman, Ziwei Jiang, Hilaf Hasson, Murat Kocaoglu

发表机构 * Electrical and Computer Engineering, Purdue University(帕克大学电气与计算机工程系) Computer Science, Johns Hopkins University(约翰霍普金斯大学计算机科学系) Cohesity

AI总结 提出一种新框架,通过分解因果效应值的可消除与不可消除部分,区分收集更多样本能否帮助识别最优行动,并利用神经因果模型近似实现该分解。

详情
AI中文摘要

从观测数据中进行因果推断可以为决策场景中找到最佳行动提供有力证据,而无需进行昂贵的随机试验。由于未观测到的混杂因素,即使有无限数据,行动的因果效应也往往不是点可识别的。此外,仅有有限样本为因果效应估计增加了另一层不确定性。现有几种方法可用于获得因果效应的上下界,从符号方法到最近的基于神经网络的方法,这些方法隐式地结合了两种不确定性来源。然而,这些方法并未告知收集更多样本是否有助于从观测数据中识别最佳行动,使专家对其数据收集策略一无所知。我们通过一种新颖的框架解决了这个问题,该框架能够区分可能通过收集更多样本消除的因果效应值范围与那些高概率无法通过更多观测样本消除的值范围。我们证明这种划分可以通过求解最大-最小和最小-最大优化问题获得。我们利用神经因果模型在实践中近似恢复这种分解。通过在合成和真实世界数据集上的实验,我们证明了我们的算法可以确定何时收集更多样本无助于确定最佳行动。我们的框架可以帮助从业者决定何时应诉诸非观测研究或寻求测量一些未测量的混杂因素以进行最优决策。

英文摘要

Causal inference from observational data can provide strong evidence for finding the best action in a decision-making scenario without having to perform expensive randomized trials. The causal effect of an action is often not pointwise identifiable even with infinite data due to unobserved confounding factors. Furthermore, having only finitely many samples adds another layer of uncertainty to causal effect estimation. Several existing methods can be used to obtain upper and lower bounds to the causal effect, ranging from symbolic methods to the more recent neural network-based approaches, which implicitly incorporate both sources of uncertainty. However, these methods do not inform whether collecting more samples may or may not help identify the best action from observational data, leaving experts in the dark about their data collection strategies. We address this problem with a novel framework that can distinguish the range of causal effect values that might be eliminated by collecting more samples from the range of values that, with high probability, cannot be eliminated with more observational samples. We show that this partitioning can be obtained by solving max-min and min-max optimization problems. We leverage neural causal models to approximately recover this decomposition in practice. We demonstrate via experiments on synthetic and real-world datasets that our algorithm can determine when collecting more samples will not help determine the best action. Our framework can help practitioners decide when to resort to non-observational studies or seek to measure some of the unmeasured confounders for optimal decision-making.

2602.16015 2026-06-09 cs.LG 版本更新

Geometry-Aware Uncertainty Quantification via Conformal Prediction on Manifolds

几何感知的不确定性量化:流形上的保形预测

Marzieh Amiri Shahbazi, Ali Baheri

发表机构 * Rochester Institute of Technology(罗切斯特理工大学)

AI总结 提出自适应测地线保形预测框架,通过测地距离和交叉验证局部难度归一化,在球面和IGRF-14地磁场预测中实现有效覆盖并改善条件覆盖。

详情
AI中文摘要

保形预测为回归提供了有限样本覆盖保证,但大多数标准构造是针对欧几里得输出空间设计的。当响应位于黎曼流形上时,欧几里得残差和基于坐标的区域会忽略定义有意义误差的几何结构。我们提出自适应测地线保形预测,一个简单的框架,它从测地距离构建非一致性分数,并通过交叉验证的局部预测难度估计对其进行归一化。在球面上,这产生测地帽,其面积与位置无关,而它们的半径仍然适应异方差噪声。在合成球面实验和IGRF-14地磁场预测任务中,自适应方法保持了有效的边际覆盖,减少了条件覆盖的变化,并相对于非自适应和基于坐标的基线改善了最坏情况覆盖。

英文摘要

Conformal prediction gives finite-sample coverage guarantees for regression, but most standard constructions are designed for Euclidean output spaces. When the response lies on a Riemannian manifold, Euclidean residuals and coordinate-based regions can ignore the geometry that defines meaningful error. We propose adaptive geodesic conformal prediction, a simple framework that builds nonconformity scores from geodesic distances and normalizes them with a cross-validated estimate of local prediction difficulty. On the sphere, this produces geodesic caps whose area is independent of position, while their radii still adapt to heteroscedastic noise. In both a synthetic sphere experiment and an IGRF-14 geomagnetic field forecasting task, the adaptive method preserves valid marginal coverage, reduces variation in conditional coverage, and improves worst-case coverage relative to non-adaptive and coordinate-based baselines.

2604.12277 2026-06-09 cs.LG 版本更新

Models Know Their Shortcuts: Deployment-Time Shortcut Mitigation

模型知晓其捷径:部署时的捷径缓解

Jiayi Li, Shijie Tang, Gün Kaynar, Shiyi Du, Carl Kingsford

发表机构 * Ray and Stephanie Lane Computational Biology Department, School of Computer Science, Carnegie Mellon University(雷和斯蒂芬妮·兰德计算生物学系,计算机科学学院,卡内基梅隆大学)

AI总结 研究提出在部署时通过无监督梯度归因缓解预训练文本编码器的捷径学习,证明部署时的缓解在信息理论上受训练时缓解的限制,并在情感分类、毒性检测和自然语言推理中取得显著性能提升。

详情
AI中文摘要

预训练文本编码器容易产生捷径学习,依赖于token-标签相关性,一旦在部署时分布偏移就会失效。现有捷径缓解方法主要在训练时操作,假设能获取训练数据、训练动态或捷径注释,这些在部署时难以获得,只有收敛的模型存在。我们证明该模型本身足以在部署时缓解捷径:一个偏置模型内部化了其学习捷径的信号,可通过无监督梯度归因捕捉。我们进一步证明部署时的缓解在信息理论上受训练时缓解的限制。尽管如此,利用这一梯度信号,我们提出的无监督部署时捷径缓解框架Shortcut Guardrail,通过恢复捷径分布偏移下的性能,在情感分类、毒性检测和自然语言推理中达到或超越训练时基线性能。

英文摘要

Pretrained text encoders are prone to shortcut learning, relying on token-label correlations that fail once the distribution shifts in deployment. Existing shortcut mitigation methods mainly operate at training time and assume access to training data, training dynamics, or shortcut annotations, which are hardly available during deployment, where only the converged model remains. We show that this model alone suffices to mitigate shortcuts during deployment: a biased model internalizes a signal of its learned shortcuts that can be captured via unsupervised gradient-based attribution. We further prove that deployment-time mitigation is information-theoretically upper-bounded by training-time mitigation. Nevertheless, exploiting this gradient signal, our proposed unsupervised deployment-time shortcut mitigation framework for pretrained text encoders, Shortcut Guardrail, recovers substantial performance under shortcut distribution shift, matching or outperforming training-time baselines across sentiment classification, toxicity detection, and natural language inference.

2605.03058 2026-06-09 cs.LG cs.AI 版本更新

Neuron-Anchored Rule Extraction for Large Language Models via Contrastive Hierarchical Ablation

基于对比分层消融的大语言模型神经元锚定规则提取

Francesco Sovrano, Gabriele Dominici, Marc Langheinrich

发表机构 * Università della Svizzera italiana(瑞士意大利大学)

AI总结 提出MechaRule方法,通过定位稀疏激动剂激活将规则提取锚定在LLM电路中,利用自适应组测试和置信引导剪枝,以极低代价高召回率识别关键神经元,并在算术和越狱任务中验证其有效性。

Comments Accepted for publication at KDD'2026

详情
AI中文摘要

可解释AI的一个核心目标是符号化地表达大语言模型(LLM)的决策逻辑,并将其锚定在内部机制中。现有的规则提取方法通常学习非锚定的符号代理,而机械可解释性将行为与神经元联系起来,但通常需要手工假设和昂贵的干预。我们提出MechaRule,一种通过定位稀疏激动剂激活(其消融会破坏规则相关行为)将规则提取锚定在LLM电路中的流程。MechaRule基于两个发现。首先,在固定的基线/翻转机制下,稀疏激动剂效应可能表现出“超越”:少数高效应的激活在较大组中仍可检测到,主导较弱效应,并翻转许多相同的示例。在这种机制下,使用置信引导的保守剪枝的自适应组测试,当k << N为激动剂时,需要对N个候选进行O(k log(N/k) + k)次干预。其次,在与接近忠实规则行为对齐的数据分割上,激动剂的定位更可靠;谱分割提供了无规则的备选方案,而不忠实的分割会降低定位效果。实验上,在算术和越狱任务中,MechaRule在匹配的暴力验证中召回97.0%的最高效应激动剂,平均仅消耗完全消融成本的2.14%。消融定位的激动剂消除了97.6–100.0%的合格正确算术答案和越狱,并可纠正算术错误或诱导越狱,分别高达72.8%和32.5%。

英文摘要

A central goal of explainable AI is to express large language model (LLM) decision logic symbolically and ground it in internal mechanisms. Existing rule-extraction methods usually learn ungrounded symbolic surrogates, while mechanistic interpretability links behavior to neurons but often requires hand-crafted hypotheses and costly interventions. We introduce MechaRule, a pipeline that grounds rule extraction in LLM circuits by localizing sparse agonist activations whose ablation disrupts rule-related behavior. MechaRule rests on two findings. First, in a fixed baseline/flip regime, sparse agonist effects can exhibit overtopping: a few high-effect activations remain detectable within larger groups, dominate weaker ones, and flip many of the same examples. In such regimes, adaptive group testing with confidence-guided conservative pruning requires O(k log(N/k) + k) interventions over N candidates when k << N are agonists. Second, agonists are localized more reliably on data splits aligned with close-to-faithful rule behavior; spectral splits provide a rule-free fallback, whereas unfaithful splits degrade localization. Empirically, on arithmetic and jailbreaking, MechaRule recalls 97.0% of highest-effect agonists in matched brute-force validations at only 2.14% of exhaustive-ablation cost on average. Ablating the localized agonists eliminates 97.6--100.0% of eligible correct arithmetic answers and jailbreaks, and can correct arithmetic errors or induce jailbreaks by up to 72.8% and 32.5%.

2605.03226 2026-06-09 cs.LG cs.AI cs.CR 版本更新

Self-Mined Hardness for Safety Fine-Tuning

自我挖掘的难度用于安全微调

Prakhar Gupta, Garv Shah, Donghua Zhang

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出通过模型自身生成结果评估提示难度,对最难的提示进行安全微调,在Llama-3模型上将攻击成功率降至1-3%,但增加了拒绝率,通过混合良性提示可平衡性能。

详情
AI中文摘要

语言模型的安全微调通常需要一个精心策划的对抗性数据集。我们采取不同的方法:通过目标模型自身生成结果被判定为有害的频率来评分每个候选提示的难度,然后在最难的提示上使用模型自身的非越狱生成结果进行微调。在Llama-3-8B-Instruct和Llama-3.2-3B-Instruct上,该方法将WildJailbreak攻击成功率从11.5%和20.1%降至1-3%,但将越狱形式良性提示的拒绝率从14-22%提升至74-94%。将相同的困难提示与对抗性框架的良性提示(看起来像越狱但意图良性的提示)以1:1的比例交错,可将8B模型的拒绝率降至30-51%,3B模型降至52-72%,但攻击成功率增加2-6个百分点。在混合模式下,使用合格池中最难的一半而非随机一半进行训练,可将两个模型的剩余ASR降低35-50%(约3个百分点)。

英文摘要

Safety fine-tuning of language models typically requires a curated adversarial dataset. We take a different approach: score each candidate prompt's difficulty by how often the target model's own rollouts are judged harmful, then fine-tune on the hardest prompts paired with the model's own non-jailbroken rollouts. On Llama-3-8B-Instruct and Llama-3.2-3B-Instruct, this approach cuts the WildJailbreak attack success rate from 11.5% and 20.1% down to 1-3%, but pushes refusal on jailbreak-shaped benign prompts from 14-22% to 74-94%. Interleaving the same hard prompts 1:1 with adversarially-framed benign prompts (prompts that look like jailbreaks but have benign intent) cuts that refusal back down to 30-51% on 8B and 52-72% on 3B, at a cost of 2-6 percentage points of attack success rate. Within the mixed regime, training on the hardest half of the eligible pool rather than a random half cuts the remaining ASR by 35-50% (about 3 percentage points) on both models.

2605.08876 2026-06-09 cs.LG 版本更新

OTora: A Unified Red Teaming Framework for Reasoning-Level Denial-of-Service in LLM Agents

OTora:一种用于LLM代理推理层面拒绝服务攻击的统一红队框架

Xinyu Li, Ronghui Mu, Lin Li, Tianjin Huang, Gaojie Jin

发表机构 * Department of Computer Science, University of Exeter(埃克塞特大学计算机科学系) Department of Computer Science, University of Oxford(牛津大学计算机科学系) Department of Mathematics and Computer Science, Eindhoven University of Technology(埃因霍温理工大学数学与计算机科学系)

AI总结 OTora是首个统一的两阶段红队框架,用于实现推理层面拒绝服务攻击,通过优化对抗触发器和生成代理感知的推理负载,提升推理token数量和延迟,同时保持任务准确性。

Comments Accepted to ICML 2026

详情
AI中文摘要

OTora是一种用于LLM代理推理层面拒绝服务攻击的统一红队框架。大型语言模型(LLMs)正越来越多地被部署为能够执行工具增强的多步骤任务的自主代理,其中延迟是实际应用中的关键因素。然而,一个被忽视的威胁是推理层面拒绝服务(R-DoS),攻击者通过增加代理的推理深度或工具使用预算来降低可用性,同时保持任务正确性。我们介绍了OTora,这是首个统一的两阶段红队框架,用于实现R-DoS攻击。第一阶段优化了对抗触发器,通过插入意识评分和动态目标共进化,诱导定向工具调用,支持黑盒和白盒环境。第二阶段通过ICL引导的遗传搜索生成代理感知的推理负载,放大过度思考的同时保持正确的任务结果。在WebShop、Email和OS代理上,基于多种基础模型如LLaMA-70B和GPT-OSS-120B,OTora实现了推理token数量增加10倍和延迟减慢数量级,同时保持接近基线的任务准确性。最后,我们讨论了检测和限制异常推理和延迟峰值的缓解策略。代码可在https://github.com/llm2409/OTora上获得。

英文摘要

Large Language Models (LLMs) are increasingly deployed as autonomous agents that execute tool-augmented, multi-step tasks, where latency is a critical factor for real-world applications. Yet an overlooked threat is Reasoning-Level Denial-of-Service (R-DoS), in which an attacker preserves task correctness but degrades availability by inflating an agent's reasoning depth or tool-use budget. We introduce OTora, the first unified, two-stage red-teaming framework for instantiating R-DoS attacks. Stage I optimizes an adversarial trigger that induces targeted tool invocations using insertion-aware scoring and dynamic target co-evolution, supporting both black-box and white-box settings. Stage II generates agent-aware reasoning payloads via an ICL-guided genetic search that amplifies overthinking while maintaining correct task outcomes. Across WebShop, Email, and OS agents built on multiple backbone models such as LLaMA-70B and GPT-OSS-120B, OTora achieves up to 10 times increases in reasoning tokens and order-of-magnitude latency slowdowns, all while preserving near-baseline task accuracy. Finally, we discuss mitigation strategies for detecting and constraining abnormal reasoning and latency spikes. The code is available at https://github.com/llm2409/OTora.

2605.15416 2026-06-09 cs.LG cs.AI 版本更新

Margin-Adaptive Confidence Ranking for Reliable LLM Judgement

基于边际的置信度排名用于可靠的LLM判断

Gaojie Jin, Yong Tao, Lijia Yu, Tianjin Huang

发表机构 * Department of Computer Science, University of Exeter(埃克塞特大学计算机科学系) Institute of AI for Industries, Chinese Academy of Sciences(中国科学院工业人工智能研究所) Department of Mathematics and Computer Science, Eindhoven University of Technology(埃因霍温理工大学数学与计算机科学系)

AI总结 本文提出一种基于边际的置信度排名方法,通过学习专用置信度估计器,改进LLM在人类判断一致性上的表现,通过模拟标注者多样性与边际排名公式,显式建模LLM区分人类一致与不一致案例的置信度,并推导出通用性保证。

Comments Accepted to ICML 2026

详情
AI中文摘要

Jung等人(2025)提出了一种假设检验框架,以确保大型语言模型(LLMs)与人类判断之间的一致性,基于模型估计的置信度与人类不一致风险之间单调性的假设。然而,在实践中,这一假设可能被违反,且置信度估计器的泛化行为未被显式分析。我们通过学习专用置信度估计器而非依赖启发式置信信号来缓解这些问题。我们的方法利用模拟标注者多样性和基于边际的排名公式,显式建模LLM区分人类一致与不一致案例的置信度。我们进一步推导出该估计器的泛化保证,揭示出一个与边际相关的权衡,从而指导适应性估计器训练过程的设计。当集成到固定序列测试中时,所学的置信度估计器提高了排名准确性,并在多个数据集和判断模型上实现了更高的成功率,以满足目标一致性水平。

英文摘要

Jung et al. (2025) introduce a hypothesis testing framework for guaranteeing agreement between large language models (LLMs) and human judgments, relying on the assumption that the model's estimated confidence is monotonic with respect to human-disagreement risk. In practice, however, this assumption may be violated, and the generalization behavior of the confidence estimator is not explicitly analyzed. We mitigate these issues by learning a dedicated confidence estimator instead of relying on heuristic confidence signals. Our approach leverages simulated annotator diversity and a margin-based ranking formulation to explicitly model how confidently an LLM distinguishes between human-agreement and human-disagreement cases. We further derive generalization guarantees for this estimator, revealing a margin-dependent trade-off that informs the design of an adaptive estimator training procedure. When integrated into fixed-sequence testing, the learned confidence estimator yields improved ranking accuracy and empirically strengthens the monotonic relationship between confidence and disagreement risk, leading to higher success rates in satisfying target agreement levels across multiple datasets and judge models.

2606.00827 2026-06-09 cs.LG cs.AI 版本更新

Beyond Independent Manipulation: Individual Fairness-aware Strategic Classification with Peer Imitation

超越独立操纵:具有同伴模仿的个体公平感知策略分类

Xinpeng Lv, Chunyuan Zheng, Yunxin Mao, Renzhe Xu, Jinxuan Yang, Yuanlong Chen, Wangrong Huang, Shaowu Yang, Wenjing Yang, Xinwang Liu, Peng Cui, Haotian Wang

发表机构 * College of Computer Science and Technology, National University of Defense Technology(国防科技大学计算机科学与技术学院) School of Mathematical Sciences, Peking University(北京大学数学学院) Institute for Theoretical Computer Science, Shanghai University of Finance and Economics(上海财经大学理论计算机科学研究所) Information Technology Development, Aetos Capital Group, Sydney(悉尼Aetos资本集团信息技术部) Faculty of Computing, Harbin Institute of Technology(哈尔滨工业大学计算机学院) Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系)

AI总结 提出个体公平感知策略分类(IFSC)框架,通过建模基于个体公平的同伴驱动操纵(模仿邻近被接受同伴),并采用鲁棒学习过程处理同伴可观测性不确定性,以改善个体公平一致性并减轻模仿引起的扭曲。

Comments Accepted by SIGKDD2026

详情
AI中文摘要

策略分类(SC)研究智能体操纵其特征以从预测模型获得有利决策的场景。现有的公平感知SC方法主要关注群体公平,并通常假设智能体独立响应。然而,当需要个体公平时,确保相似个体获得相似结果,智能体的操纵变得相互依赖:一个智能体偏好的操纵取决于邻域的结果。这导致了经典SC公式与公平感知决策设置之间的不匹配,其中独立模型不再准确刻画策略操纵。为解决此问题,我们引入了个体公平感知策略分类(IFSC),这是一个框架,对由个体公平引起的同伴驱动操纵进行建模,其中智能体模仿附近被积极决策的同伴以获得有利结果。IFSC将策略操纵刻画为对可见被接受同伴的基于相似性的模仿,并在由此产生的操纵后分布下学习分类器。为了考虑同伴可观测性的不确定性,IFSC采用鲁棒学习过程,在操纵模拟期间引入随机扰动。在合成和真实数据集上的实验表明,IFSC改善了个体公平一致性并减轻了模仿引起的扭曲。

英文摘要

Strategic classification (SC) investigates scenarios where agents manipulate their features to obtain favorable decisions from predictive models. Existing fairness-aware SC approaches primarily focus on group fairness and typically assume that agents respond independently. However, when individual fairness is required, ensuring similar individuals receive similar outcomes, agents' manipulation becomes interdependent: an agent's preferred manipulation depends on the neighborhoods' outcomes. This induces a mismatch between classical SC formulations and fairness-aware decision settings, where independent models no longer accurately characterize strategic manipulations. To address this issue, we introduce individual fairness-aware strategic classification (IFSC), a framework that models peer-driven manipulation arising from individual fairness, where agents imitate nearby positively decided peers to obtain favorable outcomes. IFSC characterizes strategic manipulation as similarity-based imitation toward visible accepted peers and learns classifiers under the resulting post-manipulation distributions. To account for uncertainty in peer observability, IFSC employs a robust learning process that introduces stochastic perturbations during manipulation simulation. Experiments on synthetic and real-world datasets demonstrate that IFSC improves individual-fairness consistency and mitigates imitation-induced distortions.

2501.15509 2026-06-09 cs.CR cs.AI cs.LG 版本更新

FIT-Print: Towards False-claim-resistant Model Ownership Verification via Targeted Fingerprint

FIT-Print:通过目标指纹实现抗虚假声明的模型所有权验证

Shuo Shao, Haozhe Zhu, Yiming Li, Hongwei Yao, Tianwei Zhang, Zhan Qin

发表机构 * State Key Laboratory of Blockchain and Data Security, Zhejiang University(区块链与数据安全国家重点实验室,浙江大学) Hangzhou High-Tech Zone (Binjiang) Institute of Blockchain and Data Security, Hangzhou(杭州高新技术区(滨江)区块链与数据安全研究院,杭州) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算机与数据科学学院) Department of Computer Science, City University of Hong Kong(香港城市大学计算机科学系)

AI总结 针对现有模型指纹易受虚假声明攻击的问题,提出目标指纹范式FIT-Print,通过优化将指纹转化为可验证目标签名,并设计两种黑盒方法,实现100%防御成功率和0%误报率。

Comments This paper has been accepted by IEEE Transactions on Information Forensics and Security

详情
AI中文摘要

模型指纹已成为保护开源模型知识产权的重要机制,提供了一种无需修改受保护模型的非侵入式方法。然而,我们的分析表明,现有指纹技术从根本上容易受到虚假声明攻击,即对手可以欺诈性地声称对独立的第三方模型拥有所有权。我们证明,这种脆弱性源于当前方法的非目标性,它们基于任意样本输出而非与特定预定义参考的对齐来评估模型相似性。为缓解此漏洞,我们引入了FIT-Print,一种主动对抗虚假声明攻击的目标指纹范式。具体来说,FIT-Print利用优化将指纹转化为可验证的目标签名。在此基础之上,我们提出了两种黑盒指纹方法:逐位的FIT-ModelDiff和逐列表的FIT-LIME,它们分别利用输出距离和特征归因作为鲁棒的模型签名。在基准模型和数据集上的广泛评估表明,我们的框架完美地中和了虚假声明攻击(100%防御成功率),消除了对独立模型的误报(0.0%),同时针对各种模型复用技术保持了100%的所有权验证率。

英文摘要

Model fingerprinting has emerged as a crucial mechanism for safeguarding the intellectual property of open-source models, offering a non-intrusive approach that requires no modifications to the protected model. However, our analysis reveals that existing fingerprinting techniques are fundamentally vulnerable to false claim attacks, wherein adversaries can fraudulently assert ownership over independent third-party models. We demonstrate that this vulnerability stems from the untargeted nature of current methods, which evaluate model similarity based on arbitrary sample outputs rather than alignment with a specific, predefined reference. To mitigate this vulnerability, we introduce FIT-Print, a targeted fingerprinting paradigm that actively counters false claim attacks. Specifically, FIT-Print leverages optimization to transform the fingerprint into a verifiable, targeted signature. Building upon this foundation, we propose two black-box fingerprinting methods, the bit-wise FIT-ModelDiff and the list-wise FIT-LIME, which utilize output distances and feature attributions as robust model signatures, respectively. Extensive evaluations across benchmark models and datasets show that our framework perfectly neutralizes false claim attacks (100% defense success rate) and eliminates false alarms on independent models (0.0%), all while maintaining a 100% ownership verification rate against diverse model reuse techniques.

2505.11189 2026-06-09 cs.AI cs.LG 版本更新

Can Global XAI Methods Reveal Injected Behaviours in LLMs? SHAP vs Rule Extraction vs RuleSHAP

全局XAI方法能否揭示LLM中的注入行为?SHAP vs 规则提取 vs RuleSHAP

Francesco Sovrano

发表机构 * Collegium Helveticum at ETH Zurich(苏黎世联邦理工学院霍夫曼学院) Università della Svizzera italiana(瑞士联邦理工学院)

AI总结 研究通过统计验证的抽象将全局LLM信念映射为数值分数,提出RuleSHAP算法,结合全局SHAP与规则归纳,以更好地捕捉非单变量触发因素,平均MRR@1比RuleFit提升82%。

Comments Accepted for publication at KDD'2026

详情
AI中文摘要

大型语言模型(LLM)可能放大错误信息,破坏联合国可持续发展目标等社会目标。我们研究了三个有文献记载的错误信息驱动因素(效价框架、信息过载和过度简化),这些因素通常由默认信念塑造。基于LLM编码此类默认信念(例如,“快乐是积极的”、“数学是复杂的”)并可作为“启发式包”的证据,我们询问是否可以从黑盒LLM行为中恢复出错误信息相关行为背后的信念驱动启发式作为显式规则。一个关键障碍是可解释AI(XAI)中的全局规则提取方法是为数值输入输出数据设计的,而非文本。我们通过引出全局LLM信念并通过统计验证的抽象将其映射为数值分数来解决这一问题,从而使现成的全局XAI能够检测信念驱动的启发式。为了获得真实情况,我们通过系统指令向GPT系列和Llama模型注入复杂度递增的非线性行为触发因素(单变量、合取、非凸)。我们发现RuleFit经常遗漏非单变量触发因素,而全局SHAP在排名合取触发特征方面更好,但不产生符号规则。为了弥合这一差距,我们提出了RuleSHAP,一种将全局SHAP聚合与规则归纳相结合的规则提取算法,以更好地捕捉非单变量触发因素,平均MRR@1比RuleFit提升82%。我们的结果提示了一种揭示LLM中行为触发因素的实用途径。

英文摘要

Large language models (LLMs) can amplify misinformation, undermining societal goals such as the UN SDGs. We study three documented drivers of misinformation (valence framing, information overload, and oversimplification) often shaped by default beliefs. Building on evidence that LLMs encode such defaults (e.g., "joy is positive", "math is complex") and can act as "bags of heuristics", we ask whether belief-driven heuristics behind misinformation-related behaviour can be recovered from black-box LLM behaviour as explicit rules. A key obstacle is that global rule-extraction methods in explainable AI (XAI) are built for numerical input-output data, not text. We address this by eliciting global LLM beliefs and mapping them to numerical scores via statistically validated abstractions, enabling off-the-shelf global XAI to detect belief-driven heuristics. For ground truth, we inject nonlinear behavioural triggers of increasing complexity (univariate, conjunctive, non-convex) into GPT-family and Llama models via system instructions. We find that RuleFit often misses non-univariate triggers, while global SHAP better ranks conjunctive trigger features but yields no symbolic rules. To bridge this gap, we propose RuleSHAP, a rule-extraction algorithm that couples global SHAP aggregates with rule induction to better capture non-univariate triggers, improving MRR@1 over RuleFit by +82% on average. Our results suggest a practical pathway for surfacing behavioural triggers in LLMs.

2510.16028 2026-06-09 cs.CR cs.AI cs.LG cs.SY eess.SY 版本更新

TAO: Tolerance-Aware Optimistic Verification for Floating-Point Neural Networks

TAO:面向浮点神经网络的容忍感知乐观验证

Jianzhu Yao, Hongxu Su, Taobo Liao, Zerui Cheng, Huan Zhang, Xuechao Wang, Pramod Viswanath

发表机构 * Princeton University(普林斯顿大学) HKUST (GZ)(香港科技大学(广州)) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 提出TAO协议,通过算子级容忍区域和Merkle锚定的争议游戏,在不依赖可信硬件或确定性内核的情况下验证浮点神经网络输出,开销仅0.3%。

Comments 18 pages, 8 figures

详情
Journal ref
Proceedings of the 21st European Conference on Computer Systems, (2026) 1515-1532
AI中文摘要

神经网络越来越多地在用户无法控制的硬件上运行(云GPU、推理市场)。然而,机器学习即服务很少透露实际运行的内容或返回的输出是否忠实反映预期输入。用户无法对服务降级(模型交换、量化、图重写或诸如修改广告嵌入等差异)进行追索。验证输出很困难,因为异构加速器上的浮点执行本质上是不确定的。现有方法要么对实际浮点神经网络不实用,要么重新引入供应商信任。我们提出TAO:一种容忍感知乐观验证协议,它接受在原则性算子级接受区域内的输出,而不是要求逐位相等。TAO结合了两种误差模型:(i)每个算子的IEEE-754最坏情况界限和(ii)跨硬件校准的紧密经验百分位分布。差异触发一个Merkle锚定的、阈值引导的争议游戏,该游戏递归地划分计算图,直到剩下一个算子,此时裁决简化为轻量级理论界限检查或针对经验阈值的小型诚实多数投票。未受挑战的结果在挑战窗口后最终确定,无需可信硬件或确定性内核。我们将TAO实现为PyTorch兼容运行时和当前部署在以太坊Holesky测试网上的合约层。运行时检测图、计算每个算子的界限,并在FP32中运行未经修改的供应商内核,开销可忽略(Qwen3-8B上为0.3%)。在A100、H100、RTX6000、RTX4090上的CNN、Transformer和扩散模型中,经验阈值比理论界限紧10^2-10^3倍,且考虑界限的对抗攻击成功率为0%。总之,TAO为现实世界的异构ML计算协调了可扩展性和可验证性。

英文摘要

Neural networks increasingly run on hardware outside the user's control (cloud GPUs, inference marketplaces). Yet ML-as-a-Service reveals little about what actually ran or whether returned outputs faithfully reflect the intended inputs. Users lack recourse against service downgrades (model swaps, quantization, graph rewrites, or discrepancies like altered ad embeddings). Verifying outputs is hard because floating-point(FP) execution on heterogeneous accelerators is inherently nondeterministic. Existing approaches are either impractical for real FP neural networks or reintroduce vendor trust. We present TAO: a Tolerance Aware Optimistic verification protocol that accepts outputs within principled operator-level acceptance regions rather than requiring bitwise equality. TAO combines two error models: (i) sound per-operator IEEE-754 worst-case bounds and (ii) tight empirical percentile profiles calibrated across hardware. Discrepancies trigger a Merkle-anchored, threshold-guided dispute game that recursively partitions the computation graph until one operator remains, where adjudication reduces to a lightweight theoretical-bound check or a small honest-majority vote against empirical thresholds. Unchallenged results finalize after a challenge window, without requiring trusted hardware or deterministic kernels. We implement TAO as a PyTorch-compatible runtime and a contract layer currently deployed on Ethereum Holesky testnet. The runtime instruments graphs, computes per-operator bounds, and runs unmodified vendor kernels in FP32 with negligible overhead (0.3% on Qwen3-8B). Across CNNs, Transformers and diffusion models on A100, H100, RTX6000, RTX4090, empirical thresholds are $10^2-10^3$ times tighter than theoretical bounds, and bound-aware adversarial attacks achieve 0% success. Together, TAO reconciles scalability with verifiability for real-world heterogeneous ML compute.

2601.12263 2026-06-09 cs.CL cs.AI cs.LG 版本更新

Multimodal Generative Engine Optimization: Rank Manipulation for Vision-Language Model Rankers

多模态生成式引擎优化:针对视觉-语言模型排序器的排名操纵

Yixuan Du, Chenxiao Yu, Haoyan Xu, Ziyi Wang, Yue Zhao, Xiyang Hu

发表机构 * Georgetown University(乔治城大学) University of Southern California(南加州大学) University of Maryland, College Park(马里兰大学学院公园分校) Arizona State University(亚利桑那州立大学)

AI总结 提出多模态生成式引擎优化(MGEO)方法,通过联合优化图像扰动和文本后缀,利用视觉-语言模型内部跨模态知识耦合,实现对产品排名的有效操纵,揭示了多模态基础模型知识基础的脆弱性。

Comments Proceedings of the 4th Workshop on Towards Knowledgeable Foundation Models (KnowFM) at ACL 2026

详情
AI中文摘要

视觉-语言模型(VLM)将视觉和文本知识整合到统一表示中,日益成为现代检索和推荐系统的基础。然而,这些模型在对多模态项目进行排序时如何可靠地利用其跨模态知识,以及其知识基础是否可以被颠覆,仍不清楚。在本文中,我们揭示了VLM在多模态产品排序中应用知识的一个基本漏洞:通过多模态生成式引擎优化(MGEO),我们展示了攻击者可以通过联合制作难以察觉的图像扰动和流畅的文本后缀,利用模型内部的跨模态知识耦合,操纵VLM的排序决策。MGEO采用交替优化策略,针对VLM中视觉和语言表示之间的深层交互,实现了远超单模态攻击和由强大商业模型驱动的启发式基线的排名操纵。我们的发现表明,表面内容质量不足以提升排名;相反,需要直接与模型内部知识利用机制对齐。这些结果对多模态基础模型中知识基础的忠实性和鲁棒性提出了重要问题,并激励了未来多模态检索系统防御机制的研究。代码见:this https URL

英文摘要

Vision-Language Models (VLMs) integrate visual and textual knowledge into unified representations that increasingly underpin modern retrieval and recommendation systems. However, it remains unclear how reliably these models utilize their cross-modal knowledge when ranking multimodal items, and whether their knowledge grounding can be subverted. In this paper, we expose a fundamental vulnerability in how VLMs apply multimodal knowledge for product ranking: through Multimodal Generative Engine Optimization (MGEO), we show that an adversary can manipulate a VLM's ranking decisions by jointly crafting imperceptible image perturbations and fluent textual suffixes that exploit the model's internal cross-modal knowledge coupling. Using an alternating optimization strategy, MGEO targets the deep interactions between visual and linguistic representations within the VLM, achieving rank manipulations that substantially exceed those of unimodal attacks and heuristic baselines powered by strong commercial models. Our findings reveal that surface-level content quality is insufficient for rank promotion; instead, direct alignment with the model's internal knowledge utilization mechanism is required. These results raise important questions on the faithfulness and robustness of knowledge grounding in multimodal foundation models, and motivate future work on defense mechanisms for multimodal retrieval systems. Code is available at: https://github.com/glad-lab/MGEO

2602.16061 2026-06-09 stat.ML cs.LG econ.EM stat.ME 版本更新

Partial Identification under Missing Data Using Weak Shadow Variables from Pretrained Models

利用预训练模型中的弱影子变量在缺失数据下的部分识别

Hongyu Chen, David Simchi-Levi, Ruoxuan Xiong

发表机构 * Massachusetts Institute of Technology, Cambridge, MA 02139(麻省理工学院) Emory University, Atlanta, GA 30322(埃默里大学)

AI总结 针对缺失非随机(MNAR)导致的估计偏差,提出部分识别框架,通过线性规划结合预训练模型(如LLM)的预测作为弱影子变量收紧边界,并设计集合扩张估计器保证覆盖,实验显示识别区间缩小75-83%。

详情
AI中文摘要

从用户反馈中估计总体量(如平均结果)是平台评估和社会科学的基础,但反馈通常非随机缺失(MNAR):意见更强的用户更可能回应,因此标准估计量有偏,且在没有额外假设的情况下目标量不可识别。现有方法通常依赖强参数假设或实践中可能不可用的定制辅助变量。在本文中,我们开发了一个部分识别框架,其中通过求解一对线性规划获得目标量的尖锐边界,其约束编码了观测数据结构。该公式自然地将来自预训练模型(包括大型语言模型LLM)的结果预测作为额外的线性约束纳入,从而收紧可行集。我们将这些预测称为弱影子变量:它们满足关于缺失性的条件独立性假设,但不需要经典影子变量方法所需的完备性条件。当预测足够信息时,边界坍缩为点,将标准识别作为特例恢复。在有限样本中,为了提供对识别集的有效覆盖,我们提出了一种集合扩张估计器,在集合识别状态下达到慢于$\sqrt{n}$的收敛速度,在点识别下达到标准$\sqrt{n}$速度。在模拟和半合成实验(基于客服对话)中,我们发现LLM预测通常对经典影子变量方法条件不良,但在我们的框架中仍然非常有效。在现实的MNAR机制下,它们将识别区间缩小75-83%,同时保持有效覆盖。

英文摘要

Estimating population quantities such as mean outcomes from user feedback is fundamental to platform evaluation and social science, yet feedback is often missing not at random (MNAR): users with stronger opinions are more likely to respond, so standard estimators are biased and the estimand is not identified without additional assumptions. Existing approaches typically rely on strong parametric assumptions or bespoke auxiliary variables that may be unavailable in practice. In this paper, we develop a partial identification framework in which sharp bounds on the estimand are obtained by solving a pair of linear programs whose constraints encode the observed data structure. This formulation naturally incorporates outcome predictions from pretrained models, including large language models (LLMs), as additional linear constraints that tighten the feasible set. We call these predictions weak shadow variables: they satisfy a conditional independence assumption with respect to missingness but need not meet the completeness conditions required by classical shadow-variable methods. When predictions are sufficiently informative, the bounds collapse to a point, recovering standard identification as a special case. In finite samples, to provide valid coverage of the identified set, we propose a set-expansion estimator that achieves slower-than-$\sqrt{n}$ convergence rate in the set-identified regime and the standard $\sqrt{n}$ rate under point identification. In simulations and semi-synthetic experiments on customer-service dialogues, we find that LLM predictions are often ill-conditioned for classical shadow-variable methods yet remain highly effective in our framework. They shrink identification intervals by 75--83\% while maintaining valid coverage under realistic MNAR mechanisms.

2602.16346 2026-06-09 cs.CL cs.LG 版本更新

Helpful to a Fault: Measuring Illicit Assistance in Multi-Turn, Multilingual LLM Agents

有益于故障:测量多轮、多语言LLM代理中的非法协助

Nivya Talokar, Ayush K Tarun, Murari Mandal, Maksym Andriushchenko, Antoine Bosselut

发表机构 * EPFL(苏黎世联邦理工学院) independent(独立研究员) tubingen(图宾根大学)

AI总结 本文提出STING框架,用于评估多轮多语言LLM代理在执行非法任务时的协助能力,发现低资源语言中攻击成功率不一致,提供实际部署中的压力测试方法。

Comments Accepted in ICML 2026

详情
AI中文摘要

基于工具和记忆的LLM代理通过执行现实世界工作流。这些功能使恶意对手也能利用这些代理执行复杂的恶意场景。现有代理恶意使用基准测试主要测试单提示指令,留下测量代理在多轮中帮助执行有害或非法任务的空白。我们引入STING(序列测试非法N步目标执行),一种自动红队框架,构建基于良性角色的逐步非法计划,并通过适应性后续问题迭代探测目标代理,使用判断代理跟踪阶段完成。我们进一步引入分析框架,将多轮红队测试建模为首次越狱时间随机变量,使分析工具如发现曲线、攻击语言的危险比率归因以及新指标:受限均值越狱发现。在AgentHarm场景中,STING的非法任务完成率显著高于单轮提示和适应于工具使用代理的多轮基线。在六个非英语设置的多语言评估中,发现攻击成功率和非法任务完成率在低资源语言中不一致,与常见聊天机器人发现不同。总体而言,STING提供了一种评估和压力测试代理恶意使用在现实部署环境中的实用方法,其中交互本质上是多轮且经常多语言的。

英文摘要

LLM-based agents execute real-world workflows via tools and memory. These affordances enable ill-intended adversaries to also use these agents to carry out complex misuse scenarios. Existing agent misuse benchmarks largely test single-prompt instructions, leaving a gap in measuring how agents end up helping with harmful or illegal tasks over multiple turns. We introduce STING (Sequential Testing of Illicit N-step Goal execution), an automated red-teaming framework that constructs a step-by-step illicit plan grounded in a benign persona and iteratively probes a target agent with adaptive follow-ups, using judge agents to track phase completion. We further introduce an analysis framework that models multi-turn red-teaming as a time-to-first-jailbreak random variable, enabling analysis tools like discovery curves, hazard-ratio attribution by attack language, and a new metric: Restricted Mean Jailbreak Discovery. Across AgentHarm scenarios, STING yields substantially higher illicit-task completion than single-turn prompting and chat-oriented multi-turn baselines adapted to tool-using agents. In multilingual evaluations across six non-English settings, we find that attack success and illicit-task completion do not consistently increase in lower-resource languages, diverging from common chatbot findings. Overall, STING provides a practical way to evaluate and stress-test agent misuse in realistic deployment settings, where interactions are inherently multi-turn and often multilingual.

2603.07445 2026-06-09 cs.CL cs.LG 版本更新

Few Tokens, Big Leverage: Preserving Safety Alignment by Constraining Safety Tokens during Fine-tuning

少令牌,大杠杆:在微调期间通过约束安全令牌保持安全对齐

Guoli Wang, Haonan Shi, Tu Ouyang, An Wang

发表机构 * Case Western Reserve University(凯斯西储大学)

AI总结 提出PACT框架,通过约束安全相关令牌的置信度来防止微调导致的安全对齐漂移,同时保持下游任务性能。

Comments Accepted to KDD 2026

详情
AI中文摘要

大型语言模型(LLMs)通常需要微调(FT)才能在下游任务上表现良好,但即使训练数据集仅包含良性数据,FT也可能导致安全对齐漂移。先前的研究表明,引入少量有害数据会显著损害LLM的拒绝行为,导致LLM顺从有害请求。现有的防御方法通常依赖于模型范围的干预,例如限制哪些参数更新或注入额外的安全数据,这可能会限制通用性并降低下游任务性能。为了解决这些限制,我们提出了一种名为PACT(通过约束令牌保持安全对齐)的微调框架,该框架稳定了模型在安全令牌上的置信度。我们的方法基于经验观察:安全对齐行为反映在模型的令牌级输出置信度中,并且通常集中在少量安全相关令牌上。在下游微调期间,我们正则化微调模型,使其在每一步响应中与对齐参考模型在安全相关令牌上的置信度匹配,同时允许非安全令牌基本不受约束以实现有效的任务适应。这种有针对性的约束防止了对齐漂移,而无需施加通常以牺牲模型效用为代价的全局限制。我们的代码可在{https://github.com/Glresearch1/PACT}获取。

英文摘要

Large language models (LLMs) often require fine-tuning (FT) to perform well on downstream tasks, but FT can induce safety-alignment drift even when the training dataset contains only benign data. Prior work shows that introducing a small fraction of harmful data can substantially compromise LLM refusal behavior, causing LLMs to comply with harmful requests. Existing defense methods often rely on model-wide interventions, such as restricting which parameters are updated or injecting additional safety data, which can limit generality and degrade downstream task performance. To address these limitations, we propose a fine-tuning framework called Preserving Safety Alignment via Constrained Tokens (PACT), which stabilizes the model's confidence on safety tokens. Our approach is motivated by the empirical observation that safety-aligned behavior is reflected in the model's token-level output confidence and is often concentrated on a small subset of safety-related tokens. During downstream fine-tuning, we regularize the fine-tuned model to match the aligned reference model's confidence on safety-related tokens at each response step, while leaving non-safety tokens largely unconstrained to allow effective task adaptation. This targeted constraint prevents alignment drift without imposing global restrictions that typically trade off with model utility. Our code is available at {https://github.com/Glresearch1/PACT}.

2604.17249 2026-06-09 cs.CR cs.AR cs.LG 版本更新

Bit-Flip Vulnerability of Shared KV-Cache Blocks in LLM Serving Systems

LLM服务系统中共享KV缓存块的位翻转漏洞

Yuji Yamamoto, Satoshi Matsuura

发表机构 * Institute of Science Tokyo(东京科学研究所)

AI总结 研究揭示LLM服务系统中共享KV缓存块的位翻转漏洞,指出其具有静默分歧、选择性传播和持久累积特性,提出基于校验和的防护措施以限制累积损害。

Comments 12 pages, 4 figures. Accepted at SECRYPT 2026 (23rd International Conference on Security and Cryptography). Conference: https://secrypt.scitevents.org/

详情
AI中文摘要

在GPU DRAM上进行Rowhammer攻击可以导致模型权重中的对抗性位翻转;LLM服务系统中的共享KV缓存块呈现出类似但此前未被研究的目标。在vLLM的前缀缓存中,这些块以单一物理副本存在且无完整性保护。通过软件故障注入在理想位目标下,我们表征了最坏情况的严重性,并识别出三个特性:(1)静默分歧——16个BF16位位置中有13个产生一致但修改后的输出,无法与合法响应区分;(2)选择性传播——只有共享目标前缀的请求受影响;(3)持久累积——没有时间衰减,因此累积损害随后续请求线性增长。这些特性构成了不同于权重篡改的独特威胁:静默分歧和选择性传播使检测逃避成为可能;持久累积则继续 unchecked,损害放大仅受缓存块保持缓存时间的限制。基于校验和的防护措施在调度时检测任何单比特损坏,将累积损害限制为一个批次,无论块的缓存时间如何,且具有可忽略的开销。这些结果呼吁在端到端利用之前对前缀块进行完整性保护。

英文摘要

Rowhammer on GPU DRAM has enabled adversarial bit flips in model weights; shared KV-cache blocks in LLM serving systems present an analogous but previously unexamined target. In vLLM's Prefix Caching, these blocks exist as a single physical copy without integrity protection. Using software fault injection under ideal bit targeting, we characterize worst-case severity and identify three properties: (1) Silent divergence - 13 of 16 BF16 bit positions produce coherent but altered outputs, indistinguishable from legitimate responses without a clean baseline. (2) Selective propagation - only requests sharing the targeted prefix are affected. (3) Persistent accumulation - no temporal decay occurs, so cumulative damage grows linearly with subsequent requests. Together, these constitute a threat profile distinct from weight corruption: silent divergence and selective propagation enable detection evasion; persistent accumulation then proceeds unchecked, yielding damage amplification bounded only by how long the block remains cached. A checksum-based countermeasure detects any single-bit corruption at scheduling time, bounding cumulative damage to one batch independent of the block's cache lifetime, with negligible overhead. These results argue for integrity protection of prefix blocks before end-to-end exploitation is demonstrated.

2604.25965 2026-06-09 stat.ML cs.LG 版本更新

Adversarial Robustness of NTK Neural Networks

NTK神经网络的对抗鲁棒性

Yuxuan Hou

发表机构 * Qiuzhen College, Tsinghua University(清华大学求真学院) Yau Mathematical Sciences Center, Tsinghua University(清华大学auer数学科学中心)

AI总结 本文研究了NTK神经网络在非参数回归中的对抗鲁棒性,推导了Sobolev空间中的对抗回归最小最大最优速率,并证明了通过梯度流早停训练的NTK网络可达到该最优速率,但在过拟合情况下最小范数插值器易受对抗扰动影响。

详情
AI中文摘要

深度学习模型被广泛应用于安全关键领域,但仍然容易受到对抗攻击。本文研究了NTK神经网络在非参数回归中的对抗鲁棒性。我们建立了Sobolev空间中的对抗回归最小最大最优速率,并证明了通过梯度流早停训练的NTK神经网络可以达到该最优速率。然而,在过拟合情况下,我们证明了最小范数插值器对对抗扰动是脆弱的。

英文摘要

Deep learning models are widely deployed in safety-critical domains, but remain vulnerable to adversarial attacks. In this paper, we study the adversarial robustness of NTK neural networks in the context of nonparametric regression. We establish minimax optimal rates for adversarial regression in Sobolev spaces and then show that NTK neural networks, trained via gradient flow with early stopping, can achieve this optimal rate. However, in the overfitting regime, we prove that the minimum norm interpolant is vulnerable to adversarial perturbations.

2605.19228 2026-06-09 cs.CL cs.AI cs.IT cs.LG math.IT 版本更新

Diagnosing Multi-step Reasoning Failures in Black-box LLMs via Stepwise Confidence Attribution

通过分步置信度归因诊断黑盒大语言模型的多步推理失败

Xiaoou Liu, Tiejin Chen, Dengjia Zhang, Yaqing Wang, Lu Cheng, Hua Wei

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出了一种基于分步置信度归因(SCA)的方法,用于诊断黑盒大语言模型在多步推理中的失败,通过信息瓶颈原理对生成的推理轨迹进行置信度评估,并通过实验验证该方法在数学推理和多跳问答任务中的有效性。

Comments Accepted by ICML 2026

详情
AI中文摘要

大型语言模型通过生成分步解决方案在具有客观答案的推理任务中实现了强大的性能,但诊断多步推理轨迹可能失败的位置仍然困难。置信度估计提供了一种诊断信号,但现有方法受限于最终答案或需要内部模型访问。在本文中,我们引入了分步置信度归因(SCA),一种适用于封闭源LLM的框架,该框架仅基于生成的推理轨迹分配步骤级置信度。SCA应用信息瓶颈原理:与正确解决方案中的一致结构对齐的步骤获得高置信度,而偏差则被标记为可能错误。我们提出了两种互补的方法:(1)NIBS,一种非参数化的IB方法,用于测量一致性而无需图结构,以及(2)GIBS,一种基于图的IB模型,通过可微分掩码学习子图以捕捉逻辑变化。在数学推理和多跳问答任务上的大量实验表明,SCA能够可靠地识别与推理错误高度相关的低置信度步骤。此外,使用步骤级置信度指导自我修正,比使用答案级反馈提高了13.5%的修正成功率。

英文摘要

Large Language Models have achieved strong performance on reasoning tasks with objective answers by generating step-by-step solutions, but diagnosing where a multi-step reasoning trace might fail remains difficult. Confidence estimation offers a diagnostic signal, yet existing methods are restricted to final answers or require internal model access. In this paper, we introduce Stepwise Confidence Attribution (SCA), a framework for closed-source LLMs that assigns step-level confidence based only on generated reasoning traces. SCA applies the Information Bottleneck principle: steps aligning with consensus structures across correct solutions receive high confidence, while deviations are flagged as potentially erroneous. We propose two complementary methods: (1) NIBS, a non-parametric IB approach measuring consistency without graph structures, and (2) GIBS, a graph-based IB model that learns subgraphs through a differentiable mask to capture logical variability. Extensive experiments on mathematical reasoning and multi-hop question answering show that SCA reliably identifies low-confidence steps strongly correlated with reasoning errors. Moreover, using step-level confidence to guide self-correction improves the correction success rate by up to 13.5\% over answer-level feedback.

2606.00419 2026-06-09 stat.ML cs.LG 版本更新

Parameter-Free and Group Conditional Online Conformal Prediction

无参数和组条件在线共形预测

Beepul Bharti, Ambar Pal, Jacopo Teneggi, Jeremias Sulam

发表机构 * Data Science and AI Institute (DSAI), Johns Hopkins University(数据科学与人工智能研究院(DSAI),约翰霍普金斯大学) Mathematical Institute for Data Science (MINDS), Johns Hopkins University(数据科学数学研究院(MINDS),约翰霍普金斯大学) Department of Biomedical Engineering, Johns Hopkins University(生物医学工程系,约翰霍普金斯大学) Department of Computer Science, Johns Hopkins University(计算机科学系,约翰霍普金斯大学) Amazon Responsible AI(亚马逊负责任人工智能)

AI总结 提出一种无参数算法用于组条件在线共形预测,在保证组条件覆盖的同时无需调参,并在合成和真实数据上验证了其有效性和可靠性。

详情
AI中文摘要

不确定性量化对于机器学习预测器在数据分布随时间变化(即数据可能不可交换)的真实场景中的部署至关重要。在线共形预测方法解决了这个问题,但代价是(i)组间误差控制或(ii)与学习率无关的实现。组条件覆盖对于跨不同数据点集合的公平性以及提供更精细的不确定性量化保证至关重要。无参数优化对于对抗对抗性和未知数据偏移的鲁棒性至关重要。我们提出了一种用于组条件在线共形预测的无参数算法,并证明它实现了最佳的组条件覆盖保证。我们在合成和真实数据上评估了我们的算法,表明我们的方法不仅提高了现有无参数在线共形预测方法的可靠性,而且提供了与调优良好的组条件方法大小相当的预测区间。通过将组条件覆盖与无参数在线算法统一,我们的工作为变化环境中公平且鲁棒的不确定性量化奠定了基础。

英文摘要

Uncertainty quantification (UQ) is critical for the deployment of machine learning predictors in real-world scenarios where the data distribution may shift over time (i.e., data may not be exchangeable). Online conformal prediction (OCP) methods address this issue at the expense of either (i) group-wise error control or (ii) learning-rate independent implementation. Group-conditional coverage is essential for fairness across different collections of data points and for providing finer UQ guarantees. Parameter-free optimization is crucial for robustness to adversarial and unknown data shifts. We propose a parameter-free algorithm for group-conditional OCP and demonstrate that it achieves the best group-conditional coverage guarantees. We evaluate our algorithm on synthetic and real-world data, demonstrating that our method not only improves the reliability of existing parameter-free OCP methods but also provides prediction intervals that are comparable in size to well-tuned group-conditional approaches. By unifying group-conditional coverage with parameter-free online algorithms, our work lays a foundation for fair and robust uncertainty quantification in shifting environments.

9. 图学习与结构化数据 18 篇

2606.07598 2026-06-09 cs.LG cs.AI 新提交

A Topological Characterization of Graph Neural Networks via Stochastic Block Model Embeddings on the n-Sphere

图神经网络的拓扑特征化:通过n-球面上的随机块模型嵌入

Gopal Anantharaman

发表机构 * KnotTheory.ai Inc.(KnotTheory.ai 公司) Dept. of Mathematics, Emporia State University(恩波利亚州立大学数学系)

AI总结 提出将消息传递神经网络诱导的随机块模型映射到单位n-球面的拓扑框架,用于比较训练后的图神经网络,并实现无需重新训练的迁移学习候选检索。

详情
AI中文摘要

我们提出一个拓扑框架,用于比较训练后的图神经网络(GNN),通过将消息传递神经网络(MPNN)在图信号空间上诱导的随机块模型(SBM)映射到单位$n$-球面$\sphere^{n-1}\subset\R^n$上。该构建基于三个经典支柱:割距离图空间$(\Wo,\cutdist)$的紧性\citep{lovasz2006limits,lovasz2012large},Frieze--Kannan弱正则引理及其由\citet{levie2023graphon}推广的图信号扩展,以及MPNN关于割距离的Lipschitz连续性。我们证明,对于任意给定的容差$\varepsilon>0$,一个训练后的MPNN $Φ$作用于足够大的图时,可以通过一个复杂度有界的阶梯图信号(误差不超过$\varepsilon$)来分解,并且我们构造了一个显式的保测映射$Ψ_n\colon[0,1]\to\sphere^{n-1}$,将SBM区域放置在不相交的球冠上。这产生了一个与问题无关的低维训练GNN“指纹”,便于视觉检查和跨模型库的最近邻搜索,从而实现无需重新训练的迁移学习候选检索。我们讨论了高维中测度集中现象带来的障碍——这一现象与大规模语言模型规模的嵌入直接相关。最后,我们提出五个具体的未来研究方向:双曲和格拉斯曼流形替代球面模型,基于图信号的Gromov--Wasserstein距离作为$n$-球面映射的无等距替代,SBM流形的信息几何(Fisher)重新表述,逐层嵌入云的持续同调指纹,以及基于图信号特征分解的谱距离基线。

英文摘要

We propose a topological framework for comparing trained Graph Neural Networks (GNNs) by mapping the Stochastic Block Models (SBMs) induced on the graphon-signal space of a Message Passing Neural Network (MPNN) onto the unit $n$-sphere $\sphere^{n-1}\subset\R^n$. The construction rests on three classical pillars: the \emph{compactness} of the cut-distance graphon space $(\Wo,\cutdist)$ \citep{lovasz2006limits,lovasz2012large}, the Frieze--Kannan \emph{weak regularity lemma} together with its graphon-signal extension due to \citet{levie2023graphon}, and the Lipschitz continuity of MPNNs with respect to the cut-distance. We show that, for any prescribed tolerance $\varepsilon>0$, a trained MPNN $Φ$ acting on a sufficiently large graph factors (up to $\varepsilon$) through a step-graphon-signal of bounded complexity, and we construct an explicit measure-preserving map $Ψ_n\colon[0,1]\to\sphere^{n-1}$ that places the SBM regions on disjoint spherical caps. This produces a problem-agnostic, low-dimensional ``fingerprint'' of a trained GNN that is amenable to visual inspection and to nearest-neighbour search across model zoos, enabling \emph{transfer-learning candidate retrieval} without retraining. We discuss the obstruction posed by concentration of measure in high dimension -- a phenomenon directly relevant to LLM-scale embeddings. We close with five concrete future research directions: hyperbolic and Grassmannian alternatives to the spherical model, Gromov--Wasserstein distances on graphon-signals as an isometry-free alternative to the $n$-sphere map, an information-geometric (Fisher) reformulation of the SBM manifold, persistent-homology fingerprints of layer-wise embedding clouds, and a spectral-distance baseline derived from the graphon eigendecomposition.

2606.07619 2026-06-09 cs.LG math.GR 新提交

Graph Neural Networks for Predicting Solvability of Finite Groups

用于预测有限群可解性的图神经网络

Tal Weissblat

发表机构 * The Institute of Agricultural and Biosystems Engineering Agricultural Research Organization - Volcani Institute(农业与生物系统工程研究所农业研究组织-瓦尔康伊研究所)

AI总结 提出图神经网络框架,利用Cayley图等图表示,仅通过结构信息区分可解群与不可解群,探索图神经网络学习群论代数性质的能力。

Comments 7 pages, 3 tables

详情
AI中文摘要

我们提出了一个图神经网络(GNN)框架,用于根据有限群的可解性对其进行分类。利用与有限群相关的图表示,包括Cayley图(CG),所提出的模型仅通过结构图信息来训练区分可解群和不可解群。该框架在训练数据集之外的群上进行评估,以研究GNN能够学习群论中出现的代数性质的程度。更广泛地说,本工作探索了有限群的代数结构与基于图的几何表示之间的关系。本研究旨在作为概念验证,探究GNN是否能够从基于图的表示中学习有限群的代数性质。

英文摘要

We present a Graph Neural Network (GNN) framework for the classification of finite groups according to their solvability. Using graph representations associated with finite groups, including Cayley graphs (CG), the proposed model is trained to distinguish solvable and non-solvable groups using structural graph information alone. The framework is evaluated on groups outside the training dataset in order to investigate the extent to which GNNs can learn algebraic properties arising in group theory. More broadly, the present work explores the relationship between algebraic structure and graph-based geometric representations of finite groups. The present study is intended as a proof-of-concept investigation of whether GNNs can learn algebraic properties of finite groups from graph-based representations

2606.08067 2026-06-09 cs.LG 新提交

Beyond Homophily: Towards Generalized Graph Reconstruction Attack and Defense

超越同质性:迈向广义图重构攻击与防御

Zhanke Zhou, Bo Han, Xuan Li, Jiangchao Yao, Sanmi Koyejo, Michael K. Ng

发表机构 * Hong Kong Baptist University(香港浸会大学) Shanghai Jiao Tong University(上海交通大学) Stanford University(斯坦福大学)

AI总结 针对图神经网络可能泄露训练图邻接信息的问题,提出基于马尔可夫链近似的攻击方法MC-GRA(+)和防御方法MC-GPB(+),在异质图上实现高保真重构攻击并有效防御。

详情
AI中文摘要

图神经网络(GNN)广泛部署于关系数据上,但它们可能泄露关于训练图邻接的敏感或专有信息,例如社交关系、交易和交互。本文研究图重构攻击(GRA),这是一种模型反演形式,从训练好的GNN中重构训练邻接,给定不同级别的攻击方信息。我们首先系统地表征了邻接何时以及为何通过特征、标签、嵌入和预测变得可恢复,其中泄漏由图的同质性、异质性和模型的归纳偏差调节。受这些发现启发,我们通过马尔可夫链近似视角审视GNN推理,将分层前向计算视为一个拓扑依赖表示的链。基于此视角,我们开发了互补的攻击和防御方法。在攻击方面,我们提出MC-GRA(+),通过优化一个替代邻接来重构邻接,该替代邻接的GNN诱导表示在各层与目标模型的表示对齐。在防御方面,我们提出MC-GPB(+),在整个表示链中抑制邻接依赖的信息,同时旨在在隐私-效用权衡下保持分类准确性。在同质/异质图基准和GNN上的实验表明,我们的攻击比先前方法提高了重构保真度,而我们的防御仅以轻微精度损失降低了重构成功率。

英文摘要

Graph neural networks (GNNs) are widely deployed on relational data, yet they can leak sensitive or proprietary information about the training graph adjacency, e.g., social ties, transactions, and interactions. This work studies graph reconstruction attacks (GRA), a form of model inversion that reconstructs the training adjacency from a trained GNN, given different levels of attacker-side information. We first provide a systematic characterization of when and why adjacency becomes recoverable through features, labels, embeddings, and predictions, with leakage modulated by graph homophily, heterophily, and the model's inductive bias. Motivated by these findings, we view GNN inference through a Markov chain approximation lens, treating the layered forward computation as a chain of topology-dependent representations. Building on this view, we develop complementary attack and defense methods. On the attack side, we propose MC-GRA (+), which reconstructs the adjacency by optimizing a surrogate adjacency whose GNN-induced representations align with those of the target model at each layer. On the defense side, we propose MC-GPB (+), which suppresses adjacency-dependent information throughout the representation chain while aiming to preserve classification accuracy under a privacy-utility trade-off. Experiments across homophilic/heterophilic graph benchmarks and GNNs show that our attacks improve reconstruction fidelity over prior methods, while our defenses reduce reconstruction success with only minor accuracy loss.

2606.08287 2026-06-09 cs.LG cond-mat.mtrl-sci cs.CE 新提交

Mesh Graph Neural Network Framework for Accelerating Finite Element Simulation for Arbitrary Geometries

网格图神经网络框架加速任意几何形状的有限元仿真

Josiah D. Kunz, Kamal Choudhary

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出网格图网络(MGN)预测任意孔洞几何2D结构的von Mises应力场,通过编码节点类型、相对边特征和全局特征实现平移和旋转不变性,在未见几何和载荷下R²≥0.97,优于传统模型。

Comments 10 pages, 6 figures, to be published. Code available at https://github.com/Josiah-Kunz/MGN-Public

详情
AI中文摘要

有限元分析(FEA)对于结构设计至关重要,但在评估多个设计迭代或载荷场景时计算成本高昂。机器学习代理模型提供了一种有前景的替代方案,但大多数方法在跨不同几何形状的泛化方面存在关键局限性。本文提出一种网格图网络(MGN),用于预测具有任意孔洞几何的二维结构部件中的von Mises应力场。与使用绝对节点坐标作为特征的传统机器学习方法不同,该模型基于现有的MGN框架,编码节点类型(例如固定边界、自由表面、孔洞边缘)、相对边特征(邻居之间的距离)和全局特征(施加的载荷)。这种架构本质上是平移和旋转不变的,使得无需重新训练即可泛化到未见过的几何形状。MGN在11种板几何形状和20种载荷条件下训练,并在7种未见几何形状和3种未见载荷下评估。在最有利的情况下,模型在未见几何和未见载荷上达到$R^2 \geq 0.97$,而传统模型(随机森林、梯度提升、K近邻)在相同数据上训练的$R^2$约为$0.01$--$0.86$。然而,即使在不太有利的情况下,MGN模型仍然优于传统模型。本文将Pfaff等人(arXiv:2010.03409)的基于网格的仿真框架扩展到结构力学,证明了图神经网络可以作为跨不同几何形状的有限元分析的高效代理。

英文摘要

Finite element analysis (FEA) is essential for structural design but remains computationally expensive, particularly when evaluating multiple design iterations or load scenarios. Machine learning surrogate models offer a promising alternative, yet most approaches struggle with a critical limitation: generalizing across varying geometries. This work presents a mesh graph network (MGN) for predicting von Mises stress fields in 2D structural components with arbitrary hole geometries. Unlike traditional machine learning approaches that use absolute node coordinates as features, the proposed model builds on existing MGN frameworks that encode node types (e.g., fixed boundary, free surface, hole edge), relative edge features (distance between neighbors), and global features (applied load). This architecture is inherently translation- and rotation-invariant, enabling generalization to unseen geometries without retraining. The MGN was trained on 11 plate geometries under 20 load conditions and evaluated on 7 unseen geometries and 3 unseen loads. In the most favorable case, the model achieves $R^2 \geq 0.97$ on an unseen geometry and unseen load, compared to $R^2 \approx 0.01$--$0.86$ for conventional models (Random Forest, Gradient Boosting , K-Nearest Neighbors) trained on identical data. However, even in less favorable cases, the MGN model still outperforms conventional models. This work extends the mesh-based simulation framework of Pfaff et al. (arXiv:2010.03409) to structural mechanics, demonstrating that graph neural networks can serve as efficient surrogates for finite element analysis across varying geometries.

2606.08303 2026-06-09 cs.LG 新提交

GeoGNN: Time Series Geo-Localization using Two-Tower Graph Neural Networks

GeoGNN:使用双塔图神经网络的时间序列地理定位

Toan Tran, Waqwoya Abebe, Abhishek Potnis, Supriya Chinthavali, Cyrus Shahabi, Li Xiong, Dalton Lunga

发表机构 * Emory University(埃默里大学) Oak Ridge National Laboratory(橡树岭国家实验室) University of Southern California(南加州大学)

AI总结 提出GeoGNN双塔架构,利用地理邻接图学习空间嵌入,结合时间序列表示,通过点积匹配实现时间序列地理定位,在电力消费数据集上平均提升约27%的定位精度。

详情
AI中文摘要

本文研究时间序列地理定位的新概念,目标是推断每个原始时间序列的地理来源。成功的地理定位可以为时间序列提供空间上下文,支持下游位置感知应用。我们形式化了该问题,借鉴图像地理定位的核心思想建立了强基线,并提出了GeoGNN——一种双塔架构。训练时,GeoGNN的空间塔通过利用地理邻接图学习地理单元候选的嵌入,而时间塔从时间序列中提取信息表示。推理时,每个时间表示与候选地理嵌入通过点积相似度匹配,并结合辅助分类头,以预测时间序列关联的地理来源。在全国范围的大规模电力消费数据集上的实验表明,GeoGNN在数据集上取得了最佳性能,并将细粒度和粗粒度地理定位精度平均提高了约27%。

英文摘要

This paper investigates a novel concept of time series geolocalization, where the goal is to infer the geographic origin of each raw time series. Successful geolocalization can provide spatial context to time series, enabling downstream location-aware applications. We formalize the problem, adapt core ideas from image geolocalization to establish strong baselines, and propose GeoGNN, a two-tower architecture. During training, GeoGNN's spatial tower learns embeddings of geographic cell candidates by leveraging the geographic adjacency graph, while the temporal tower extracts informative representations from time series. During inference, each temporal representation is matched against candidate geographic embeddings using dot-product similarity, combined with an auxiliary classification head, to predict the time series' associated geographic origin. Experiments on large-scale, countrywide electricity-consumption datasets demonstrate that GeoGNN achieves the best performance across datasets and enhances both fine- and coarse-grained geolocalization accuracy by ~27% on average.

2606.08306 2026-06-09 cs.LG cs.SI 新提交

Towards Graph Foundation Models for Dynamics in Complex Networked Systems: Lessons from Super-Spreader Identification in Multilayer Networks

面向复杂网络系统中动力学的图基础模型:来自多层网络超级传播者识别的教训

Michał Czuba, Mateusz Stolarski, Adam Piróg, Piotr Bielak, Piotr Bródka

AI总结 本文提出图基础模型在动力学中需具备归纳跨网络泛化能力,通过仅基于合成多层网络训练的ts-net模型,在真实多层网络上实现零样本泛化,并优于传统方法。

详情
AI中文摘要

网络动力学——包括传播、影响力最大化和流行病建模——仍然主要局限于转导范式,其中模型在单个网络上训练,无法在不重新训练的情况下用于未见过的图。我们认为,归纳跨网络泛化是该领域图基础模型(GFM)的必要前提,并为此提出了四个设计属性。作为概念验证,ts-net(TopSpreadersNetwork)仅基于合成多层网络(MLN)训练,展示了在大小和层数各异的真实MLN上的零样本泛化能力,在四个指标中的三个上优于经典启发式方法和转导基线。基于ts-net的性能,我们进一步概述了构建网络动力学GFM的五个开放挑战:规模、多层泛化、自监督预训练、跨任务迁移和节点属性集成。

英文摘要

Network dynamics - including spreading, influence maximisation, and epidemic modelling - remain largely confined to the transductive paradigm, where models are trained on a single network and cannot be reused on unseen graphs without retraining. We argue that inductive cross-network generalisation is a necessary prerequisite for Graph Foundation Models (GFMs) in this domain and propose four design properties towards this goal. As a proof of concept, ts-net (TopSpreadersNetwork), trained solely on synthetic multilayer networks (MLNs), demonstrates zero-shot generalisation to real-world MLNs of varying size and layer count, outperforming classical heuristics and transductive baselines on three of four metrics. Based on ts-net's performance, we further outline five open challenges towards building GFMs for network dynamics: scale, many-layer generalisation, self-supervised pretraining, cross-task transfer, and node-attribute integration.

2606.08978 2026-06-09 cs.LG 新提交

Heterophily-Aware Adaptive Knowledge Distillation for Hypergraph Neural Networks

异质性感知的自适应知识蒸馏用于超图神经网络

Joohee Cho, David Yoon Suk Kang, Yunyong Ko

发表机构 * Chung-Ang University(中央大学) Chungbuk National University(忠北国立大学)

AI总结 针对超图神经网络在异质性节点上性能下降的问题,提出异质性感知的自适应蒸馏方法HADES,通过量化节点异质性调节教师知识迁移,使学生模型性能超越教师并实现最高12.3倍加速。

Comments 5 pages, 2 figures, 4 tables

详情
AI中文摘要

超图知识蒸馏旨在通过轻量级学生模型保留超图神经网络(HNN)教师的预测性能,同时降低推理成本。在这项工作中,我们观察到HNN在通过语义多样的超边连接的异质性节点上的预测性能显著较低,表明教师知识的可靠性在不同节点间存在差异。受此观察启发,我们提出了HADES,一种用于超图神经网络的异质性感知自适应蒸馏方法。HADES量化节点异质性,并将其作为教师可靠性的估计,以在蒸馏过程中调节教师知识的迁移。在真实世界超图上的实验结果表明,HADES在不同HNN教师和蒸馏目标下持续提升学生性能。在许多情况下,所得学生模型的预测性能超越其教师,同时实现高达12.3倍的推理加速。

英文摘要

Hypergraph knowledge distillation aims to retain the predictive performance of a hypergraph neural network (HNN) teacher while reducing inference costs through a lightweight student model. In this work, we observe that HNNs exhibit substantially lower prediction performance on heterophilic nodes connected through semantically diverse hyperedges, indicating that the reliability of teacher knowledge varies across nodes. Motivated by this observation, we propose HADES, a heterophily-aware adaptive distillation method for hypergraph neural networks. HADES quantifies node heterophily and leverages it as an estimate of teacher reliability to modulate the transfer of teacher knowledge during distillation. Experimental results on real-world hypergraphs demonstrate that HADES consistently improves student performance across different HNN teachers and distillation objectives. In many cases, the resulting student models surpass the predictive performance of their teachers while achieving up to 12.3 times faster inference.

2606.09051 2026-06-09 cs.LG 新提交

Beyond Convolution: Advancing Hypergraph Neural Networks with Hypergraph U-Nets

超越卷积:用超图U-Net推进超图神经网络

Fuli Wang, Wei Qian, Daniel L. Lau, Gonzalo R. Arce

发表机构 * Institute for Financial Services Analytics, University of Delaware(特拉华大学金融服务分析研究所) Department of Applied Economics and Statistics, University of Delaware(特拉华大学应用经济学与统计学系) Department of Electrical and Computer Engineering, University of Kentucky(肯塔基大学电气与计算机工程系) Department of Electrical and Computer Engineering, University of Delaware(特拉华大学电气与计算机工程系)

AI总结 提出并行层次池化和反池化算子,构建首个超图U-Net架构,在分类、重构和异常检测任务上超越现有方法。

详情
AI中文摘要

卷积已成功从图像处理过渡到非欧几里得高阶域的复杂领域,特别是在超图中。尽管卷积取得了成功,但由于缺乏定义良好的池化和反池化操作,一种名为U-Net的流行架构在超图数据上的探索仍然很少。本工作开创性地研究了超图数据的U-Net架构,解决了设计有效池化和反池化操作的关键挑战,这些操作能保留输入超图的最大结构信息。受层次聚类启发,我们提出通过在不同粒度上切割聚类树状图来一次性构建池化和反池化算子,称为并行层次池化(PHPool)和反池化(PHUnpool)算子。与现有通过顺序学习过程可能造成局部结构损坏的池化方法不同,我们的PHPool算子以全局并行方式设计,确保对原始超图结构的保真度和高效计算,而PHUnpool算子则专门设计为执行PHPool的逆操作以进行超图重构。我们通过超图重构模拟、超图分类和节点级异常检测验证了我们的模型,在这些任务中,它表现出优于现有最先进的图和超图深度学习方法的性能。

英文摘要

Convolutions have successfully transitioned from image processing to the complex realm of non-Euclidean higher-order domains, particularly in hypergraphs. Despite the success in convolution, the exploration of a popular architecture named U-Net remains largely unexplored for hypergraph data due to the lack of well-defined pooling and unpooling operations. This work pioneers the study of U-Net architectures for hypergraph data, addressing the critical challenge of designing effective pooling and unpooling operations that retain maximal structural information from the input hypergraph. Motivated by hierarchical clustering, we propose to construct the pooling and unpooling operators all at once by cutting the clustering dendrogram at different granularities, named the Parallel Hierarchical Pooling (PHPool) and Unpooling (PHUnpool) operators. Unlike existing pooling methods that risk local structural damage through a sequential learning procedure, our PHPool operators are designed in a global and parallel manner to ensure fidelity to the original hypergraph structure with efficient computation while the PHUnpool operators are tailored to perform inverse operations of the PHPools for hypergraph reconstruction. We validate our model through hypergraph reconstruction simulation, hypergraph classification, and node-level anomaly detection, where it demonstrates superior performance over existing state-of-the-art graph and hypergraph deep learning methods.

2606.09340 2026-06-09 cs.LG 新提交

Thresholded Local Hyper-Flow Diffusion

阈值化局部超流扩散

Meher Chaitanya, Sebastian Dalleiger, Luana Ruiz

发表机构 * KTH Royal Institute of Technology(瑞典皇家理工学院) Johns Hopkins University(约翰霍普金斯大学)

AI总结 提出TL-HFD算法,通过局部活动区域和阈值化边界激活实现超图种子聚类的局部扩散,保证与全局更新等价并给出有限时间对偶次优性界。

详情
AI中文摘要

局部超流扩散(HFD)为一般子模超图中的种子聚类提供了与边大小无关的Cheeger型保证,但现有的HFD求解器在每次迭代中不保持中间计算的局部性。我们引入了阈值化局部HFD(TL-HFD),这是一种一阶方法,它维护种子周围的活动区域,对该区域及其直接边界执行投影次梯度更新,并通过阈值化(top-k)边界激活进行扩展。我们证明了局部更新是精确的:限制在活动区域及其边界上的度预条件投影次梯度步骤与无限制的全局更新一致。我们为精确和阈值化更新建立了有限时间对偶次优性,将后者视为具有显式跳过边界误差的不精确投影次梯度步骤。我们进一步推导了一个加性激活体积界,由实现的局部次梯度范数和新激活顶点中的最小边界推动控制,并将具有局部支持的近似对偶最优性转化为早期停止迭代的鲁棒扫描切割保证。对于一般子模切割成本,每次迭代在扫描区域中是局部的,并且在超边原语中是对 oracle 敏感的。实验上,TL-HFD通常匹配或优于HFD,同时激活更少的体积,在扩散倾向于吸收非目标顶点的噪声实例上获得最大收益。

英文摘要

Local Hyper-Flow Diffusion (HFD) gives an edge-size-independent Cheeger-type guarantee for seeded clustering in general submodular hypergraphs, but existing HFD solvers do not keep intermediate computation local at every iteration. We introduce Thresholded Local HFD (TL-HFD), a first-order method that maintains an active region around the seeds, performs projected subgradient updates on that region and its immediate boundary, and expands via thresholded (top-k) boundary activation. We prove that the local update is exact: the degree-preconditioned projected subgradient step restricted to the active region and its boundary coincides with the unrestricted global update. We establish finite-time dual suboptimality for both exact and thresholded updates, treating the latter as inexact projected subgradient steps with explicit skipped-boundary error. We further derive an additive activated-volume bound controlled by realized local subgradient norms and the minimum boundary-push among newly activated vertices, and translate approximate dual optimality with localized support into a robust sweep-cut guarantee for early-stopped iterates. For general submodular cut-costs, each iteration is local in the scanned region and oracle-sensitive in the hyperedge primitive. Empirically, TL-HFD often matches or improves over HFD while activating less volume, with the largest gains on noisy instances where diffusion tends to absorb non-target vertices.

2606.09432 2026-06-09 cs.LG 新提交

Graph Mamba Operator: A Latent Simulator for Interacting Particle Systems

Graph Mamba Operator: 一种用于相互作用粒子系统的潜在模拟器

Karn Tiwari, Niladri Dutta, N M Anoop Krishnan, Prathosh A P

发表机构 * Indian Institute of Science, Bangalore(印度科学研究所,班加罗尔) Indian Institute of Technology, Delhi(印度理工学院,德里)

AI总结 提出Graph Mamba Operator (GraMO),通过将状态空间模型与图交互学习集成到单一循环中,实现长期时空依赖的联合建模,在N体系统、运动捕捉和机器人数据集上取得最低误差。

Comments Under Submission

详情
AI中文摘要

建模相互作用的动力系统需要捕捉空间相互作用以及长期时间依赖。图神经网络(GNNs)提供了一种自然的表示,但通常依赖于自回归滚动,并分别处理空间和时间动态,导致长期预测中误差累积。现有方法还侧重于局部交互和短时间上下文,限制了它们捕捉多跳依赖和全局结构的能力。我们引入了图Mamba算子(GraMO),一种潜在空间模拟器,将状态空间模型与基于图的交互学习集成在一起。与先前将节点排序或分阶段应用空间和时间更新的工作不同,GraMO在单个循环中耦合了基于图的交互和时间状态更新。该更新在潜在状态上是线性的,具有跨状态自适应变化的输入相关系数。我们在N体系统、运动捕捉和机器人数据集上评估了GraMO,在基准测试中实现了最低误差,并在长期预测中取得了最大增益。

英文摘要

Modeling interacting dynamical systems requires capturing spatial interactions alongside long-range temporal dependencies. Graph neural networks (GNNs) provide a natural representation but typically rely on autoregressive rollouts and treat spatial and temporal dynamics separately, leading to error accumulation over long horizons. Existing approaches also focus on local interactions and short temporal contexts, limiting their ability to capture multi-hop dependencies and global structure. We introduce the Graph Mamba Operator (GraMO), a latent-space simulator that integrates state-space models with graph-based interaction learning. In contrast to prior work that sequences nodes or applies spatial and temporal updates in separate stages, GraMO couples graph-based interactions and temporal state updates within a single recurrence. The update is linear in the latent state, with input-dependent coefficients that adapt across regimes. We evaluate GraMO on N-body systems, motion capture, and robotics datasets, achieving the lowest error across benchmarks and the largest gains in long-horizon prediction.

2606.07677 2026-06-09 stat.ML cs.LG stat.AP stat.ME 交叉投稿

Disentangling Latent Risk Pathways via Bayesian Hypergraph Inference

通过贝叶斯超图推断解缠潜在风险路径

Shengxian Ding, Haonan Gao, Pangpang Liu, Xinyuan Tian, Yize Zhao

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出贝叶斯超图推断框架,通过风险因子调节的潜在疾病路径建模多疾病,实现可解释的高阶结构、校准的不确定性估计和罕见病改进预测。

Comments ICML 2026 Oral

详情
AI中文摘要

电子健康记录(EHR)提出了大规模多疾病建模问题,其中许多结果罕见且受共享风险因素强烈影响。虽然现代方法实现了强大的预测性能,但它们通常独立处理疾病或依赖黑盒架构,对风险因素如何组织疾病风险的洞察有限,且缺乏原则性的不确定性量化。我们引入了一个贝叶斯超图推断框架,将多疾病建模重新构建为围绕潜在的风险因子调节的疾病路径。风险因素作用于超边,即具有共享风险模式的潜在疾病子集,允许疾病参与多个不同的路径,并实现超越成对关联的可解释高阶结构。排斥先验鼓励简约且可识别的结构,而后验推断为疾病分组和风险因素影响提供了校准的不确定性。为了在大型EHR数据集上实现可扩展推断,我们开发了一种结构化变分推断算法,该算法保留了超边存在、疾病成员资格和路径级效应之间的逻辑依赖关系。在模拟数据和英国生物银行上的实验表明,该框架具有稳定且可解释的疾病路径结构、良好校准的不确定性、对罕见病的改进估计以及有竞争力的预测性能。

英文摘要

Electronic health records (EHR) pose large-scale multi-disease modeling problems in which many outcomes are rare and strongly influenced by shared risk factors. While modern approaches achieve strong predictive performance, they often treat diseases independently or rely on black-box architectures, offering limited insight into how risk factors organize disease risk and little principled uncertainty quantification. We introduce a Bayesian hypergraph inference framework that reframes multi-disease modeling around latent, risk-factor-modulated disease pathways. Risk factors act on hyperedges, latent disease subsets with shared risk patterns, allowing diseases to participate in multiple distinct pathways and enabling interpretable, higher-order structure beyond pairwise associations. A repulsion prior encourages parsimonious and identifiable structure, while posterior inference provides calibrated uncertainty over both disease groupings and risk-factor influence. To enable scalable inference on large EHR datasets, we develop a structured variational inference algorithm that preserves logical dependencies among hyperedge existence, disease membership, and pathway-level effects. Experiments on simulated data and UK Biobank demonstrate stable and interpretable disease pathway structure, well-calibrated uncertainty, improved estimation for rare diseases, and competitive predictive performance.

2606.08046 2026-06-09 cs.AI cs.CV cs.LG 交叉投稿

OSMGraphCLIP: Learning Global Location Representations from OpenStreetMap Graphs

OSMGraphCLIP:从OpenStreetMap图学习全局位置表示

Dimitrios Michail, Eleni Saka, Ioannis Giannopoulos, Ioannis Papoutsis

发表机构 * Harokopio University of Athens(雅典哈罗科皮奥大学) National Technical University of Athens(雅典国家技术大学) Vienna University of Technology(维也纳技术大学) National Observatory of Athens(雅典国家天文台)

AI总结 提出OSMGraphCLIP模型,利用OpenStreetMap异构图结构学习全局位置嵌入,通过多尺度图编码器和对比学习对齐,在气候、生态、社会经济等下游任务中达到或超越卫星基线方法。

详情
AI中文摘要

我们提出了OSMGraphCLIP,一种CLIP风格的地理空间表示模型,从免费可用的OpenStreetMap(OSM)数据中学习全局位置嵌入。OSMGraphCLIP将地理环境表示为带类型的OSM特征的异构图,保留了道路、建筑物、土地利用区域和兴趣点之间的拓扑和语义关系。多尺度图编码器捕获细粒度的局部结构和更广泛的景观组成,并通过对比对齐目标监督球谐位置编码器。我们在涵盖气候、生态、社会经济指标、公共卫生、土地覆盖、生物多样性和野火预测等一系列下游地理空间回归和分类任务中评估了OSMGraphCLIP,并表明仅结构化OSM数据就支持跨领域的强全局位置表示。OSMGraphCLIP在大多数基准测试中达到或超过了基于卫星的基线,在社会经济和公共卫生任务中优势最为明显,因为OSM对建成环境的显式语义注释编码了卫星像素只能间接捕获的人类活动模式。在生态和环境任务中,尽管未使用地球观测数据,该模型仍与基于图像的方法保持紧密竞争。定性分析证实,学习到的嵌入连贯地组织了地理空间,仅从地图拓扑中恢复了生物群落边界、城市梯度和热带-温带区别。

英文摘要

We present OSMGraphCLIP, a CLIP-style geospatial representation model that learns global location embeddings from freely available OpenStreetMap (OSM) data. OSMGraphCLIP represents geographic environments as heterogeneous graphs of typed OSM features, preserving the topological and semantic relationships among roads, buildings, land-use regions, and points of interest. A multi-scale graph encoder captures both fine-grained local structure and broader landscape composition, and supervises a spherical-harmonics location encoder through a contrastive alignment objective. We evaluate OSMGraphCLIP across a diverse suite of downstream geospatial regression and classification tasks spanning climate, ecology, socioeconomic indicators, public health, land cover, biodiversity, and wildfire forecasting, and show that structured OSM data alone supports strong global location representations across domains. OSMGraphCLIP matches or exceeds satellite-based baselines on the majority of benchmarks, with the most pronounced advantage on socioeconomic and public-health tasks, where OSM's explicit semantic annotation of the built environment encodes patterns of human activity that satellite pixels can only capture indirectly. On ecological and environmental tasks, the model remains closely competitive with imagery-based methods despite using no Earth observation data. Qualitative analysis confirms that the learned embeddings organize geographic space coherently, recovering biome boundaries, urban gradients, and tropical--temperate distinctions from map topology alone.

2606.08258 2026-06-09 cs.GR cs.CV cs.LG 交叉投稿

MS-COOT: Comparing Morse-Smale Complexes with Co-Optimal Transport

MS-COOT: 用共最优传输比较Morse-Smale复形

Guangyu Meng, Mingzhe Li, Erin Wolf Chambers

发表机构 * Department of Computer Science and Engineering, University of Notre Dame(Notre Dame 大学计算机科学与工程系)

AI总结 提出MS-COOT距离,将Morse-Smale复形表示为超图,通过共最优传输联合匹配临界点和区域,实现区域级结构比较,在分类等任务中优于图方法。

详情
AI中文摘要

理解和比较标量场中的结构是科学可视化的核心挑战,应用范围从特征分析到时间和结构比较。Morse-Smale (MS) 复形通过将标量场分解为由梯度流诱导的区域提供了自然表示。然而,现有方法通常依赖于基于图的表示,捕获临界点之间的关系而丢弃区域级结构。在这项工作中,我们将MS复形表示为超图,其中临界点构成节点,区域定义超边。我们引入MS-COOT,一种共最优传输距离,联合计算临界点和区域之间的对应关系。这种公式化使得在基于距离的框架内能够进行显式的区域到区域匹配,从而识别诸如分裂和合并等区域级事件。我们使用领域特定组件实例化该框架,包括编码临界点-区域关系的超网络函数、强调拓扑显著特征的基于持久性的概率度量,以及包含临界点属性的样本代价项。我们在涵盖2D模拟、3D曲面网格和体积数据的五个数据集上评估MS-COOT。我们的结果表明,MS-COOT捕获了基于图的距离未反映的区域级结构变化,同时在分类和分辨率判别等下游任务中实现了强性能。

英文摘要

Understanding and comparing structures in scalar fields is a central challenge in scientific visualization, with applications ranging from feature analysis to temporal and structural comparison. The Morse-Smale (MS) complex provides a natural representation by decomposing a scalar field into regions induced by gradient flow. However, existing approaches typically rely on graph-based representations, capturing relationships between critical points while discarding region-level structure. In this work, we represent the MS complex as a hypergraph, where critical points form nodes and regions define hyperedges. We introduce MS-COOT, a co-optimal transport distance that jointly computes correspondences between critical points and regions. This formulation enables explicit region-to-region matching within a distance-based framework, allowing identification of region-level events such as splitting and merging. We instantiate this framework with domain-specific components, including a hypernetwork function encoding critical point-region relationships, persistence-based probability measures that emphasize topologically significant features, and a sample cost term that incorporates critical point attributes. We evaluate MS-COOT on five datasets spanning 2D simulations, 3D surface meshes, and volumetric data. Our results show that MS-COOT captures region-level structural changes that are not reflected by graph-based distances, while achieving strong performance in downstream tasks such as classification and resolution discrimination.

2606.09100 2026-06-09 cs.SI cs.LG 交叉投稿

Alcmean's: Unsupervised community detection using local Laplacian, automatic detection of the number of centers

Alcmean's: 使用局部拉普拉斯算子的无监督社区检测与中心数量自动检测

Shahin Momenzadeh, Rojiar Pir Mohammadiani

发表机构 * Department of Computer Engineering, University of Kurdistan, Sanandaj, Iran(伊朗库尔德大学桑和达吉分校计算机工程系)

AI总结 提出ALCMeans算法,结合拉普拉斯能量自动识别中心与DeepWalk嵌入,无需预设社区数,在基准数据集上NMI和ARI比Louvain等方法高10-20%。

详情
AI中文摘要

社区检测是复杂网络分析中的一个基本问题,在社交、生物和金融领域都有应用。传统算法如Louvain、LPA和模块度优化通常需要手动参数调整,还存在聚类中心选择不准确和可扩展性差的问题。为了解决这些挑战,我们提出了自动拉普拉斯中心均值(ALCMeans),一种新颖的社区检测算法。ALCMeans将基于拉普拉斯能量的自动中心识别与DeepWalk嵌入相结合,以实现稳健的节点表示。与现有的基于拉普拉斯和聚类方法不同,ALCMeans无需预定义社区数量,利用结构重要性增强聚类中心选择,并利用表示学习实现更准确和稳定的分配。在基准数据集上的实验结果表明,与Louvain、Newman-Girvan、LPA、Fast-Greedy以及最近基于GNN的竞争者(MAGI, KDD 2024)相比,NMI和ARI得分提高了10%到20%。使用模块度和F1分数的额外评估证实了ALCMeans的优越性。消融研究突出了每个组件的关键贡献。尽管依赖于DeepWalk参数并且相对于轻量级启发式方法运行时间增加,ALCMeans始终优于最先进的方法,使其成为现实世界网络分析的一个有前景的工具。

英文摘要

Community detection is a fundamental problem in the analysis of complex networks. It has applications across social, biological, and financial domains. Traditional algorithms such as Louvain, LPA, and modularity optimization often require manual parameter tuning. They also suffer from inaccurate cluster center selection and struggle with scalability. To address these challenges, we propose Automatic Laplacian Centrality Means (ALCMeans), a novel community detection algorithm. ALCMeans combines Laplacian energy-based automatic center identification with DeepWalk embeddings for robust node representation. Unlike existing Laplacian-based and clustering methods, ALCMeans eliminates the need to predefine the number of communities, enhances cluster center selection using structural importance, and leverages representation learning for more accurate and stable assignments. Experimental results on benchmark datasets demonstrate 10 to 20 percent higher NMI and ARI scores compared to Louvain, Newman-Girvan, LPA, Fast-Greedy, and a recent GNN-based competitor (MAGI, KDD 2024). Additional evaluations with modularity and F1-scores confirm the superiority of ALCMeans. Ablation studies highlight the critical contributions of each component. Despite its reliance on DeepWalk parameters and increased runtime relative to lightweight heuristics, ALCMeans consistently outperforms state-of-the-art methods. This makes it a promising tool for real-world network analysis.

2510.02014 2026-06-09 cs.LG 版本更新

Normality Calibration in Semi-supervised Graph Anomaly Detection

半监督图异常检测中的正态性校准

Guolei Zeng, Hezhe Qiao, Guoguo Ai, Jinsong Guo, Guansong Pang

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 提出GraphNC框架,通过教师模型在异常分数和表示空间联合校准正态性,解决半监督图异常检测中正态性过拟合问题,降低误报。

Comments Accepted by ICML2026

详情
AI中文摘要

图异常检测(GAD)因其在广泛应用中揭示不规则模式的关键能力而日益受到关注。半监督GAD假设训练期间有部分标注的正常节点可用,是最广泛探索的应用设置之一。然而,现有半监督GAD方法学习到的正态性仅限于标注的正常节点,往往倾向于过拟合给定模式。这可能导致高检测错误,例如高误报率。为克服这一限制,我们提出GraphNC,一种图正态性校准框架,利用标注和未标注数据,在异常分数和节点表示空间中联合校准来自教师模型(预训练的半监督GAD模型)的正态性。GraphNC包括两个主要组件:异常分数分布对齐(ScoreDA)和基于扰动的正态性正则化(NormReg)。ScoreDA通过将我们模型的异常分数与教师模型产生的分数分布对齐来优化异常分数。由于教师模型中大多数正常节点和部分异常节点的分数准确,分数对齐有效地将正常类和异常类的异常分数拉向两端,从而产生更可分离的异常分数。然而,教师模型存在不准确的分数。为减轻这些分数的误导,设计了NormReg来在表示空间中正则化图正态性,通过仅在标注节点上最小化扰动引导的一致性损失,使正常节点的表示更紧凑。

英文摘要

Graph anomaly detection (GAD) has attracted growing interest for its crucial ability to uncover irregular patterns in broad applications. Semi-supervised GAD, which assumes a subset of annotated normal nodes available during training, is among the most widely explored application settings. However, the normality learned by existing semi-supervised GAD methods is limited to the labeled normal nodes, often inclining to overfitting the given patterns. These can lead to high detection errors, such as high false positives. To overcome this limitation, we propose GraphNC , a graph normality calibration framework that leverages both labeled and unlabeled data to calibrate the normality from a teacher model (a pre-trained semi-supervised GAD model) jointly in anomaly score and node representation spaces. GraphNC includes two main components, anomaly score distribution alignment (ScoreDA) and perturbation-based normality regularization (NormReg). ScoreDA optimizes the anomaly scores of our model by aligning them with the score distribution yielded by the teacher model. Due to accurate scores in most of the normal nodes and part of the anomaly nodes in the teacher model, the score alignment effectively pulls the anomaly scores of the normal and abnormal classes toward the two ends, resulting in more separable anomaly scores. Nevertheless, there are inaccurate scores from the teacher model. To mitigate the misleading by these scores, NormReg is designed to regularize the graph normality in the representation space, making the representations of normal nodes more compact by minimizing a perturbation-guided consistency loss solely on the labeled nodes.

2602.08785 2026-06-09 cs.LG 版本更新

A Graphop Analysis of Graph Neural Networks on Sparse Graphs: Generalization and Universal Approximation

稀疏图上图神经网络的图算子分析:泛化与通用逼近

Ofek Amran, Tom Gilat, Ron Levie

发表机构 * Faculty of Mathematics Technion – Israel Institute of Technology(数学系技术学院 – 以色列理工学院)

AI总结 提出统一度量空间,涵盖稀疏与稠密图,证明MPNN等度连续,从而改进通用逼近定理和泛化界。

详情
AI中文摘要

消息传递图神经网络(MPNN)的泛化和逼近能力通常通过在输入图空间上定义紧度量来研究,在该度量下MPNN是等度连续的。这类分析有两种:1)当度量空间包含无界大小的图时,该理论仅适用于稠密图;2)当研究稀疏图时,度量空间仅包含有界大小的图。在这项工作中,我们提出了一种统一的方法,在所有大小的图(包括稀疏和稠密)的空间上定义一个紧度量,在该度量下MPNN是等度连续的。这导致了比先前工作更强大的通用逼近定理和泛化界。该理论基于并扩展了最近一种称为图算子分析的图极限理论方法。

英文摘要

Generalization and approximation capabilities of message passing graph neural networks (MPNNs) are often studied by defining a compact metric on a space of input graphs under which MPNNs are equicontinuous. Such analyses are of two varieties: 1) when the metric space includes graphs of unbounded sizes, the theory is only appropriate for dense graphs, and, 2) when studying sparse graphs, the metric space only includes graphs of uniformly bounded size. In this work, we present a unified approach, defining a compact metric on the space of graphs of all sizes, both sparse and dense, under which MPNNs are equicontinuous. This leads to more powerful universal approximation theorems and generalization bounds than previous works. The theory is based on, and extends, a recent approach to graph limit theory called graphop analysis.

2604.17324 2026-06-09 cs.LG cs.AI 版本更新

Capacity-Controlled Global Attention for Graph Transformers

具有容量控制的全局注意力用于图变换器

Yang Liu, Dongxin Guo, Tom Zheng, Siu Ming Yiu, Liam Ning, Jikun Wu

发表机构 * Brain Investing Limited The University of Hong Kong(香港大学) Stellaris AI Limited

AI总结 本文提出SigGate-GT,通过在图变换器中引入可学习的sigmoid门来缓解全局注意力的保守约束,从而解决过平滑、低秩瓶颈和训练不稳定等问题,提升了多个基准测试的性能。

Comments 13 pages, 2 figures, 15 tables

详情
AI中文摘要

全局自注意力推动了现代图变换器,但其核心的softmax操作引入了一个很少直接考察的结构约束:每个注意力行非负且和为一,因此每个头的输出是值向量的守恒凸组合。一个节点永远无法“不关注任何东西”。我们认为这种守恒约束是三个通常孤立研究的病理的根本原因:深度下的节点表示崩溃(过平滑)、每个头输出的低秩瓶颈,以及深度堆栈中的脆弱优化。借鉴sigmoid门在语言模型中消除类似注意力沉底的方式,我们引入SigGate-GT,一种在GraphGPS框架中应用可学习、按头、输入条件化的sigmoid门的图变换器。该门是一种平滑的、按维度的“体积控制”,可将头输出驱动至零,不放弃注意力的概率解释。通过分析和合成实验,我们证明该门严格增加每个头输出的稳定秩,并将此秩增益与所有三种表现联系起来。在五个分子和长距离基准上,SigGate-GT在ZINC上匹配先前最佳(0.059 MAE),在ogbg-molhiv上记录最强结果(82.47% ROC-AUC),在ogbg-molpcba和长距离图基准上具有竞争力,且在所有五个数据集上均优于GraphGPS(p < 0.05)。机制分析证实了诊断:门减缓了过平滑(在4-16层中表示多样性平均相对增益30%),保持了注意力熵不崩溃,并在10倍学习率范围内稳定训练,参数开销约为OGB的1%,时间成本低于3%。

英文摘要

Global self-attention drives modern graph transformers, yet the softmax at its core imposes a structural constraint rarely examined directly: every attention row is non-negative and sums to one, so each per-head output is a mass-conserving convex combination of value vectors. A node can never "attend to nothing." We argue this conservation constraint is a single root cause behind three pathologies usually studied in isolation: the collapse of node representations with depth (over-smoothing), a low-rank bottleneck on per-head outputs, and brittle optimization in deep stacks. Drawing on how sigmoid gating removes analogous attention sinks in language models, we introduce SigGate-GT, a graph transformer that applies a learned, per-head, input-conditioned sigmoid gate to the attention output inside the GraphGPS framework. The gate is a smooth, per-dimension "volume control" that can drive head outputs toward zero, relaxing the constraint without abandoning attention's probabilistic interpretation. Analytically and through synthetic experiments, we show the gate strictly increases the stable rank of per-head outputs, and connect this rank gain to all three manifestations. On five molecular and long-range benchmarks, SigGate-GT matches the prior best on ZINC (0.059 MAE), records the strongest result among the graph-transformer baselines we evaluate on ogbg-molhiv (82.47% ROC-AUC), and is competitive on ogbg-molpcba and the Long-Range Graph Benchmark, with statistically significant gains over GraphGPS on all five datasets (p < 0.05). Mechanism analyses confirm the diagnosis: gating slows over-smoothing (a 30% mean relative gain in representation diversity across 4-16 layers), keeps attention entropy from collapsing, and stabilizes training across a 10x learning-rate range, at about 1% parameter overhead on OGB and under 3% wall-clock cost.

2510.26307 2026-06-09 cs.CR cs.LG 版本更新

A Survey of Heterogeneous Graph Neural Networks for Cybersecurity Anomaly Detection

异构图神经网络在网络安全异常检测中的综述

Laura Jiang, Reza Ryan, Qian Li, Nasim Ferdosian

发表机构 * GitHub

AI总结 本文综述了异构图神经网络在网络安全异常检测中的应用,分析了不同类型异常和图动态的分类方法,评估了常用数据集和指标,并指出了建模、数据和部署中的关键挑战。

Comments 23 pages, 7 figures, and 97 references. Accepted by the Journal of Computer Security

详情
AI中文摘要

异常检测是网络安全中的关键任务,识别内部威胁、访问违规和协同攻击对确保系统韧性至关重要。基于图的方法在建模实体交互中变得越来越重要,但大多数方法依赖于同质和静态结构,限制了其捕捉现实环境异质性和时间演变的能力。异构图神经网络(HGNN)通过引入类型感知转换和关系敏感聚合,成为异常检测的有前途的范式,能够更有效地建模复杂的网络数据。然而,目前关于基于HGNN的异常检测的研究仍零散,存在多样化的建模策略、有限的比较评估和缺乏标准化基准。为解决这一差距,本文提供了网络安全中基于HGNN的异常检测方法的全面综述。我们介绍了一种分类法,按异常类型和图动态对方法进行分类,分析了代表性模型,并将其映射到关键网络安全应用。我们还回顾了常用基准数据集和评估指标,突显其优势和局限性。最后,我们指出了与建模、数据和部署相关的关键开放挑战,并概述了未来研究的有前途方向。本文综述旨在建立一个结构化的基础,推动基于HGNN的异常检测向可扩展、可解释和可实际部署的解决方案发展。

英文摘要

Anomaly detection is a critical task in cybersecurity, where identifying insider threats, access violations, and coordinated attacks is essential for ensuring system resilience. Graph-based approaches have become increasingly important for modeling entity interactions, yet most rely on homogeneous and static structures, which limits their ability to capture the heterogeneity and temporal evolution of real-world environments. Heterogeneous Graph Neural Networks (HGNNs) have emerged as a promising paradigm for anomaly detection by incorporating type-aware transformations and relation-sensitive aggregation, enabling more expressive modeling of complex cyber data. However, current research on HGNN-based anomaly detection remains fragmented, with diverse modeling strategies, limited comparative evaluation, and an absence of standardized benchmarks. To address this gap, we provide a comprehensive survey of HGNN-based anomaly detection methods in cybersecurity. We introduce a taxonomy that classifies approaches by anomaly type and graph dynamics, analyze representative models, and map them to key cybersecurity applications. We also review commonly used benchmark datasets and evaluation metrics, highlighting their strengths and limitations. Finally, we identify key open challenges related to modeling, data, and deployment, and outline promising directions for future research. This survey aims to establish a structured foundation for advancing HGNN-based anomaly detection toward scalable, interpretable, and practically deployable solutions.

10. 迁移、元学习与持续学习 21 篇

2606.07603 2026-06-09 cs.LG cs.AI 新提交

MetaEvo: A Meta-Optimization Framework for Experience-Driven Agent Evolution

MetaEvo:一种基于经验驱动的智能体进化的元优化框架

Bowen Ren, Heyan Huang, Yinghao Li, Yang Gao

发表机构 * School of Computer Science and Technology, Beijing Institute of Technology(北京理工大学计算机科学与技术学院) Beijing Institute of Technology Southeast Academy of Information Technology(北京理工大学东南信息技术研究院)

AI总结 提出MetaEvo两阶段框架,通过偏好优化增强模型从任务经验中抽象原则的能力,并在模块化架构中积累复用,持续提升推理性能。

详情
AI中文摘要

大型语言模型(LLM)展现出强大的推理能力,但大多数基于LLM的智能体是静态部署的,无法通过任务交互进行改进。现有的经验驱动方法通常依赖于记忆或启发式方法,而不增强模型的学习能力,将其视为被动执行者,导致早期性能平台和有限的长期改进。为了解决这个问题,我们提出了MetaEvo,一个用于持续智能体进化的两阶段框架,专注于改进模型如何从任务经验中学习,而不仅仅是存储什么。MetaEvo首先应用基于偏好的优化来增强模型的原则抽象能力,然后在模块化智能体架构中实现这些原则的积累和重用。在多样化推理基准上的实验结果表明,MetaEvo始终优于强基线,并在迭代中保持可靠的改进。这些发现验证了元优化在使智能体从经验中学习并持续增强其推理能力方面的有效性。

英文摘要

Large language models (LLMs) exhibit strong reasoning capabilities, yet most LLM-based agents are statically deployed and unable to improve through task interactions. Existing experience-driven methods often rely on memory or heuristics without enhancing the model's ability to learn, treating it as a passive executor and leading to early performance plateaus and limited long-term improvement. To address this issue, we propose MetaEvo, a two-stage framework for continual agent evolution that focuses on improving how the model learns from tasks experience, rather than solely on what it stores. MetaEvo first applies preference-based optimization to enhance the model's ability of principle abstraction, then enables the accumulation and reuse of these principles within a modular agent architecture. Experimental results on diverse reasoning benchmarks demonstrate that MetaEvo consistently outperforms strong baselines, maintains reliable improvement across iterations. These findings validate the effectiveness of meta-optimization in enabling agents to learn from experience and continually enhance their reasoning capabilities.

2606.07627 2026-06-09 cs.LG math.AT math.CT 新提交

Learning Transfers: Kan Extensions for Neural Invariants

学习迁移:神经不变量的Kan扩展

Luciano Melodia

发表机构 * Friedrich-Alexander Universität Erlangen-Nürnberg(埃尔朗根-纽伦堡大学)

AI总结 提出用范畴论中的Kan扩展形式化迁移学习中的结构不变量,定义传递差异度量,并在链复形和持久模块中给出有限余核公式,通过瓶颈距离计算持久值不变量,实验验证了该方法能识别正确的任务函子并检测破坏迁移相关拓扑的表征坍塌。

详情
AI中文摘要

迁移学习假设在源任务上学习到的表征携带的结构在相关目标任务上仍然可用。标准评估通过目标准确率或分布差异来探测,但未明确说明哪种结构不变量被迁移。我们以范畴论的方式提供了这一不变量。源任务范畴$\mathcal A$、目标任务范畴$\mathcal B$和任务变化函子$J:\mathcal A\to\mathcal B$决定了,对于每个不变量值的源表征$F:\mathcal A\to\mathcal V$,存在通用的迁移不变量$\operatorname{Lan}_J F$。给定目标不变量$G:\mathcal B\to\mathcal V$,我们定义迁移差异$\operatorname{Comp}_J(F,G)=\sup_{b\in\operatorname{Ob}(\mathcal B)} d_{\mathcal V}\bigl((\operatorname{Lan}_J F)(b),G(b)\bigr)$,该评估不是通过源和目标的对象级比较,而是将目标不变量与由指定任务变换强制得到的不变量进行比较。我们证明了链复形和持久模块中$(\operatorname{Lan}_J F)(b)$的有限余核公式,其索引由逗号范畴$J\downarrow b$给出。对于持久值有限型单参数不变量,差异通过条形码之间的瓶颈距离精确计算。在神经潜在点云上的控制实验测试了该分数是否能恢复正确的任务函子,并检测出那些保持分类准确率但破坏迁移相关拓扑的表征坍塌。

英文摘要

Transfer learning presumes that a representation learned on source tasks carries structure that remains usable on related target tasks. Standard evaluations probe this through target accuracy or distributional discrepancy, yet leave unspecified which structural invariant is meant to transfer. We supply that invariant categorically. A source task category $\mathcal A$, a target task category $\mathcal B$, and a task-change functor $J:\mathcal A\to\mathcal B$ determine, for every invariant-valued source representation $F:\mathcal A\to\mathcal V$, the universal transferred invariant $\operatorname{Lan}J F$. Given a target invariant $G:\mathcal B\to\mathcal V$, we define the transfer discrepancy $\operatorname{Comp}J(F,G)=\sup{b\in\operatorname{Ob}(\mathcal B)} d{\mathcal V}\bigl((\operatorname{Lan}_J F)(b),G(b)\bigr)$, evaluating transfer not by an objectwise comparison of source and target, but by comparing the target invariant against the one forced by the prescribed task transformation. We prove finite cokernel formulas for $(\operatorname{Lan}_J F)(b)$ in chain complexes and persistence modules, indexed by the comma category $J\downarrow b$. For persistence-valued finite-type one-parameter invariants, the discrepancy is computed exactly by bottleneck distances between barcodes. Controlled experiments on neural latent point clouds then test whether the score recovers the correct task functor and flags representation collapses that preserve classification accuracy while destroying transfer-relevant topology.

2606.07711 2026-06-09 cs.LG cs.AI 新提交

Rosetta Memory: Adaptive Memory for Cross-LLM Agents

Rosetta Memory: 跨LLM智能体的自适应记忆

Hao Yang, Shiqi Shen, Haoxuan Li, Zhipeng Wang, Zhi Gong, Xu Chen

发表机构 * Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学高瓴人工智能学院) Weixin, Tencent(腾讯微信) Institute for Artificial Intelligence, Peking University(北京大学人工智能研究院)

AI总结 提出记忆中心式LLM自适应方法,通过双轮廓条件算子与最小增益采样课程,解决上游记忆激活下游LLM的跨模型适应问题,在多项QA任务中优于基线。

Comments 19 pages, 7 figures

详情
AI中文摘要

记忆是将无状态LLM转变为持久、不断进化的智能体的关键组件,通过经验积累、长程规划和持续自我改进实现。现有记忆系统通常以LLM为中心,并针对特定主干设计记忆操作。然而,在实践中,用户经常切换LLM,例如在编码时使用Claude、在写作时使用GPT,或在单个任务中将不同步骤路由到不同主干以实现成本效益权衡。因此,一个模型写入的记忆通常需要被另一个模型消费。使上游记忆有效适应并激活下游LLM仍然是一个关键但未被充分探索的问题。为弥合这一差距,我们将视角从以LLM为中心的记忆设计转变为以记忆为中心的LLM自适应。具体而言,我们从写入和读取两侧处理上述上下游记忆适应问题,并设计两个轮廓条件算子,它们联合训练以优化记忆存储和呈现方式,从而更好地完成任务。为确保学习到的算子能泛化到广泛的LLM集合,我们提出一种最小增益采样课程,在训练期间优先服务最不被照顾的LLM。为更好地衡量算子的实际贡献而非LLM自身能力,我们设计了一种性能差距奖励,与朴素记忆基线进行比较。在HotpotQA、2WikiMultihopQA和MuSiQue上的实验表明,我们的模型持续优于基线,并且在未见模型替换下保持鲁棒性。

英文摘要

Memory is the key component for transforming a stateless LLM into a persistent, evolving agent through experience accumulation, long-horizon planning, and continual self-improvement. Existing memory systems typically take the LLM as the center and design memory operations tailored to a specific backbone. In practice, however, users frequently switch between LLMs, for example using Claude for coding and GPT for writing across tasks, or routing different steps to different backbones within a single task for cost-effective trade-offs. As a result, memory written by one model often needs to be consumed by another. Making upstream memory effectively adapt to and activate downstream LLMs remains a critical yet underexplored problem. To bridge this gap, we shift the perspective from LLM-centric memory design to \emph{memory-centric LLM adaptation}. Specifically, we approach the above upstream-downstream memory adaptation problem from both the write and read sides, and design two profile-conditioned operators that are jointly trained to optimize how memory is stored and presented for better task completion. To ensure the learned operators generalize across a broad set of LLMs, we propose a minimum-gain sampling curriculum that prioritizes the least-served LLMs during training. To better measure the operators' actual contribution rather than the LLM's own capability, we design a performance-gap reward that compares against a naive memory baseline. Experiments on HotpotQA, 2WikiMultihopQA, and MuSiQue demonstrate that our model consistently outperforms baselines and remains robust under unseen-model replacement.

2606.08013 2026-06-09 cs.LG 新提交

Evaluating the Impact of Task Granularity on Catastrophic Forgetting in Continual Learning

评估任务粒度对持续学习中灾难性遗忘的影响

Emre Alyamac, Himanshu Janmeda, Shashwat Krishna, Yash Vijay

发表机构 * College of Engineering(工程学院) College of Natural Science(自然科学学院)

AI总结 研究任务粒度顺序对持续学习中灾难性遗忘的影响,通过CIFAR-100上的粗到细、细到粗和平坦三种训练策略,结合弹性权重巩固(EWC)方法,发现先学习一般类别可减少遗忘。

Comments 8 pages, 4 figures, 5 tables

详情
AI中文摘要

灾难性遗忘,即学习新信息时突然丢失先前获得的知识,仍然是持续学习中的核心挑战。本项目研究模型学习信息的顺序是否影响其保留知识的能力。具体而言,我们提出疑问:先学习一般类别(如“动物” vs “交通工具”)再学习具体类别(如“狗” vs “猫”)是否比一次性学习所有类别更能减少遗忘?我们在CIFAR-100上测试了三种方法:(1)粗到细:先训练2个超类,再扩展到10个具体子类;(2)细到粗:先训练10个子类,再分组为2个超类;(3)平坦:从一开始就训练所有10个类别。我们使用弹性权重巩固(EWC)来防止过渡期间的遗忘。我们的假设是,先学习一般模式可以为模型建立一个稳定的基础,帮助其在学习更详细区分时保留知识。我们使用标准指标(准确率、精确率、召回率、F1)以及持续学习指标(如反向迁移和遗忘率)进行评估。这项工作可为需要增量学习的实际系统设计学习序列提供参考。

英文摘要

Catastrophic forgetting, the abrupt loss of previously acquired knowledge upon learning new information, remains the central challenge in Continual Learning. This project investigates whether the order in which a model learns information affects how well it retains knowledge. Specifically, we ask: does learning general categories first (like "animals" vs "vehicles") before learning specific classes (like "dog" vs "cat") reduce forgetting compared to learning all classes at once? We test three approaches on CIFAR-100: (1) Coarse-to-Fine: train on 2 super-classes, then expand to 10 specific sub-classes, (2) Fine-to-Coarse: train on 10 sub-classes, then group into 2 super-classes, and (3) Flat: train on all 10 classes from the start. We use Elastic Weight Consolidation (EWC) to prevent forgetting during transitions. Our hypothesis is that learning general patterns first creates a stable foundation that helps the model retain knowledge when learning more detailed distinctions. We evaluate using standard metrics (accuracy, precision, recall, F1) plus continual learning metrics like backward transfer and forgetting rates. This work could inform how we design learning sequences for real-world systems that need to learn incrementally.

2606.08155 2026-06-09 cs.LG cs.IR 新提交

Have I Solved This Before? Retrieving Similar Segmentation Problems for Evolutionary Learning

我以前解决过这个问题吗?检索相似分割问题进行进化学习

Andreas Margraf, Henning Cui, Jörg Hähner

发表机构 * University of Augsburg(奥格斯堡大学)

AI总结 提出一种基于检索相似分割问题的进化学习方法,通过重用已有管道避免从头训练模型,降低开发成本,并分析跨域迁移的可行性。

详情
AI中文摘要

监控系统的可靠集成和稳固配置是实现现代制造环境高效率和生产率的基本前提。关于传感器类型和系统架构的设计决策必须在早期阶段且在高不确定性下做出。本文研究了一种偏离传统监控系统开发过程的研究方向,将注意力从算法设计转向对检测问题的更深入分析。与传统设计周期不同,本文提出逐步收集知识并将其存储在抽象系统模型中。这使得能够检索未来用例的相似解决方案,避免了昂贵的从头开始模型训练,而是允许对现有基础配置进行增量改进。重用先前生成的管道降低了后期昂贵修订的风险。由于关于滤波器管道的跨域可转移性知之甚少,本研究分析了检索滤波器管道以将其转移到不同但相似的分割问题的潜力。最后,我们统计分析了这种主要应用于图像分割问题的“迁移学习”变体的优势。此外,我们讨论了简单模型如何帮助在设计过程中平衡复杂性、技术要求和可靠性之间的权衡。

英文摘要

Reliable integration and solid configuration of monitoring systems constitute a fundamental prerequisites for achieving high efficiency and productivity in contemporary manufacturing environments. Design decisions on sensor type and system architecture have to be made at an early stage and under comparably high uncertainty. This work investigates a research direction that deviates from the traditional monitoring-system development process by shifting the attention from algorithm design to a deeper analysis of the inspection problem. In contrast to traditional design cycles, this paper proposes to gradually collect knowledge and store it in an abstract system model. This enables the retrieval of similar solutions for future use cases, preventing the need for expensive model training from scratch and allowing instead for the incremental refinement of existing base configurations. Reuse of previously generated pipelines reduces the risk of late and costly revisions. As there is little knowledge on cross-domain transferability of filter pipelines, this study analyzes the potential of retrieving filter pipelines to transfer them to different but similar segmentation problems. Finally, we statistically analyze the benefits of this `transfer learning' variant which is predominantly applied to image segmentation problems. In addition, we discuss how simple models help balancing the trade-off between complexity, technical requirements, and reliability in the design process.

2606.08447 2026-06-09 cs.LG cs.AI 新提交

Not Just After One: Sleep-Inspired Replay Prevents Catastrophic Forgetting After Sequential Tasks

不仅仅是在一次之后:受睡眠启发的回放防止顺序任务后的灾难性遗忘

Anthony Bazhenov, Jean Erik Delanois, Giri P. Krishnan

发表机构 * Department of Neuroscience, University of California, San Diego, CA, USA(1 神经科学系,加州大学圣地亚哥分校,美国加利福尼亚州圣地亚哥)

AI总结 提出受睡眠启发的无监督回放机制,在多个新任务顺序训练后应用,以部分恢复所有先前学习任务的性能,防止灾难性遗忘。

详情
AI中文摘要

人工神经网络的关键限制之一是缺乏持续学习的能力:在新任务上训练常常导致对先前任务的干扰和遗忘。尽管已有几种算法被提出以保护旧记忆免受干扰,但它们通常在每个新训练阶段期间或之后立即应用。相比之下,人类和动物可以持续学习,在主动学习期间获取多个新记忆,然后将它们全部巩固到长期存储中。在这里,我们展示了多个新任务可以顺序训练,然后应用无监督的睡眠样回放阶段,以部分恢复所有先前学习任务的性能。我们的研究进一步表明,任务特定信息对新训练具有弹性,但随着网络在新任务上训练而逐渐衰减。这些发现为开发广泛范围的持续学习AI解决方案提供了新颖的原则。

英文摘要

One of the critical limitations of artificial neural networks is their lack of ability to continually learn: training on new tasks often leads to interference and forgetting of the previous ones. While several algorithms have been proposed to protect old memories from interference, they are typically applied during or immediately after each new episode of training. In contrast, humans and animals can learn continuously, acquiring multiple new memories during active learning before consolidating all of them into long-term storage. Here we show that multiple new tasks can be trained sequentially before an unsupervised sleep-like replay phase is applied to partially restore performance across all previously learned tasks. Our study further suggests that task-specific information remains resilient to new training but decays gradually as network is trained on new tasks. These findings point to novel principles for developing a broad range of continual learning AI solutions.

2606.08452 2026-06-09 cs.LG 新提交

Theoretical Foundations of Continual Learning via Drift-Plus-Penalty

基于漂移加惩罚的持续学习的理论基础

Nazreen Shah, Govinda Arya, Bharath B. N., Ranjitha Prasad

发表机构 * IIIT Delhi(德里印度理工学院) IIT Dharwad(达尔瓦德印度理工学院)

AI总结 提出COLD框架,利用漂移加惩罚原理调节稳定性-可塑性权衡,通过虚拟队列控制遗忘,理论保证收敛性,实验优于现有方法。

Comments Accepted to Transactions on Machine Learning Research (TMLR)

详情
AI中文摘要

在许多实际场景中,数据流是非平稳的且顺序到达,要求学习系统在不从头重新训练的情况下持续适应。持续学习通过整合新任务同时缓解灾难性遗忘来应对这一挑战,其中学习新信息会降低先前知识的性能。我们引入了一种控制理论视角来明确调节遗忘的演化,将适应视为受长期稳定性约束的受控过程。我们专注于基于回放的持续学习,其中有限的内存缓冲区存储来自先前任务的代表性样本。我们提出了基于漂移加惩罚原理的持续学习框架COLD,该原理来自随机优化。为了便于分析,我们还考虑了一种oracle变体COLD-ORACLE作为参考基准。在每个任务中,两种方法都最小化当前任务损失,同时维护一个虚拟队列,该队列跟踪先前学习任务上长期稳定性的偏差,将稳定性-可塑性权衡捕捉为受调节的动态过程。我们建立了稳定性和收敛性保证,通过可调控制参数表征这种权衡。在标准基准上的实验表明,COLD在提供竞争性和可控的遗忘行为的同时,通过显式调节稳定性和可塑性,始终优于广泛的最先进的持续学习方法。

英文摘要

In many real-world settings, data streams are nonstationary and arrive sequentially, requiring learning systems to adapt continuously without retraining from scratch. Continual learning (CL) addresses this challenge by incorporating new tasks while mitigating catastrophic forgetting, where learning new information degrades performance on previously acquired knowledge. We introduce a control-theoretic perspective on CL that explicitly regulates the evolution of forgetting, framing adaptation as a controlled process subject to long-term stability constraints. We focus on replay-based CL, where a finite memory buffer stores representative samples from prior tasks. We propose COntinual Learning with Drift-Plus-Penalty (COLD), a continual learning framework based on the Drift-Plus-Penalty (DPP) principle from stochastic optimization. To facilitate analysis, we also consider an oracle variant, COLD-ORACLE, as a reference benchmark. At each task, both methods minimize the current task loss while maintaining a virtual queue that tracks deviations from long-term stability on previously learned tasks, capturing the stability-plasticity trade-off as a regulated dynamical process. We establish stability and convergence guarantees that characterize this trade-off through a tunable control parameter. Experiments on standard benchmarks demonstrate that COLD consistently outperforms a broad range of state-of-the-art CL methods while providing competitive and controllable forgetting behavior through explicit regulation of stability and plasticity.

2606.08691 2026-06-09 cs.LG stat.ME 新提交

Hierarchical Projection for Adaptive Knowledge Transfer

自适应知识迁移的分层投影

Samhita Pal, Tian Gu

发表机构 * Vanderbilt University Medical Center(范德比尔特大学医学中心) Columbia University(哥伦比亚大学)

AI总结 提出ProjectionTL框架,通过分层贝叶斯建模与自适应投影实现源选择与特征选择,缓解负迁移,提升跨域学习的准确性、稳定性和可解释性。

详情
AI中文摘要

现代数据驱动应用越来越多地涉及从多个异质源中学习,其中目标数据集有限,但跨域可获得相关信息。当相关性变化或存在虚假信号时,简单组合这些源会降低性能,这对可信的跨域学习构成了根本性挑战。我们提出了投影迁移学习(ProjectionTL),这是一个统一框架,将分层贝叶斯建模与自适应投影相结合,用于选择性知识迁移。关键思想是在两个层次上解耦迁移:首先,我们构建一个源引导的分层先验,通过数据驱动的权重聚合跨源信息,捕捉每个源与目标之间的全局对齐;其次,我们通过后验投影步骤在特征层面细化这种借用,选择性地保留与目标信号局部一致的坐标。这种两阶段设计使该方法能够同时进行源选择和特征选择,从而减轻负迁移,同时保持可解释性。ProjectionTL提供了一种跨域整合异质数据的原则性方法,桥接了统计建模和现代机器学习范式,以实现鲁棒且可解释的迁移。通过模拟和真实世界的生物医学应用,我们证明了与现有方法相比,准确性、稳定性和可解释性的提升。我们的框架为高维设置下的可信跨域学习提供了一种可扩展且通用的策略。

英文摘要

Modern data-driven applications increasingly involve learning from multiple heterogeneous sources, where a target dataset is limited but related information is available across domains. Naively combining these sources can degrade performance when relevance varies or spurious signals are present, posing a fundamental challenge for trustworthy cross-domain learning. We propose Projection Transfer Learning (ProjectionTL), a unified framework that integrates hierarchical Bayesian modeling with adaptive projection for selective knowledge transfer. The key idea is to decouple transfer at two levels: first, we construct a source-guided hierarchical prior that aggregates information across sources using data-driven weights, capturing global alignment between each source and the target; second, we refine this borrowing through a posterior-projection step that operates at the feature level, selectively retaining coordinates that exhibit local agreement with the target signal. This two-stage design enables the method to simultaneously perform source selection and feature selection, thereby mitigating negative transfer while preserving interpretability. ProjectionTL provides a principled approach to integrating heterogeneous data across domains, bridging statistical modeling and modern machine learning paradigms for robust and interpretable transfer. Through simulations and real-world biomedical applications, we demonstrate improved accuracy, stability, and interpretability compared to existing methods. Our framework offers a scalable and generalizable strategy for trustworthy cross-domain learning in high-dimensional settings.

2606.09052 2026-06-09 cs.LG cs.AI cs.CL cs.GT stat.ML 新提交

INFUSER: Influence-Guided Self-Evolution Improves Reasoning

INFUSER: 影响力引导的自我进化提升推理能力

Siyu Chen, Miao Lu, Beining Wu, Heejune Sheen, Fengzhuo Zhang, Shuangning Li, Zhiyuan Li, Jose Blanchet, Tianhao Wang, Zhuoran Yang

发表机构 * Yale University(耶鲁大学) Stanford University(斯坦福大学) University of Chicago(芝加哥大学) Toyota Technological Institute at Chicago(芝加哥丰田技术研究所) University of California, San Diego(圣地亚哥大学)

AI总结 提出INFUSER框架,通过生成器与求解器的协同进化,利用影响力分数和DuGRPO优化,从文档池中自适应生成训练数据,显著提升模型推理性能。

Comments 66 pages, 17 figures

详情
AI中文摘要

自我进化为更强的推理提供了一条可扩展的路径:预训练语言模型仅需极少的外部监督即可自我改进。然而,现有方法要么依赖于大量精心策划或教师生成的训练数据,要么在生成器无监督运行时,使用未必能改进求解器的难度启发式方法对其进行奖励。我们引入了INFUSER,一个迭代协同训练框架,包含两个共同进化的角色:一个生成器,从自动收集的非结构化文档池中起草问题并参考标准答案;一个求解器,通过在这些数据上训练来改进。求解器使用标准正确性奖励(针对生成器提供的答案)进行训练,而生成器则通过一种优化器感知的影响力分数获得奖励,该分数衡量每个提出的问题是否真正能改进求解器在目标分布上的表现。由于这种连续、有噪声的影响力分数不适合标准的GRPO,我们提出了DuGRPO,一种GRPO的双归一化变体,用于生成器训练。这些设计共同将文档池转化为一个自适应课程,倾向于对当前求解器有用的问题,而不仅仅是困难的问题。在Qwen3-8B-Base上,INFUSER在Olympiad和SuperGPQA基准测试中相对于强自我进化基线取得了超过20%的相对改进,并且一个8B的INFUSER协同进化生成器在数学和编程任务上优于冻结的32B思考生成器。消融实验证实了每个设计选择的必要性,两个扩展——将INFUSER应用于指令微调锚点并辅以规则可验证的RLVR数据——进一步展示了该框架的灵活性和泛化能力。代码可在https://github.com/FFishy-git/INFUSER获取。

英文摘要

Self-evolution offers a scalable path to stronger reasoning: a pretrained language model improves itself with only minimal external supervision. Yet existing methods either depend on extensively curated or teacher-generated training data, or, when the generator runs unsupervised, reward it by a difficulty heuristic that need not improve the solver. We introduce INFUSER, an iterative co-training framework with two co-evolving roles: a Generator that drafts questions and reference golden answers from a pool of unstructured, automatically collected documents, and a Solver that improves by training on them. The solver is trained with standard correctness rewards against the generator-provided answers, while the generator is rewarded by an optimizer-aware influence score that measures whether each proposed question would actually improve the solver on the target distribution. Because this continuous, noisy influence score is poorly served by standard GRPO, we propose DuGRPO, a dual-normalized variant of GRPO, for generator training. Together, these turn the document pool into an adaptive curriculum that favors questions useful to the current solver, not just hard ones. On Qwen3-8B-Base, INFUSER outperforms strong self-evolution baselines with over 20% relative improvement on Olympiad and SuperGPQA benchmarks, and an 8B INFUSER co-evolving generator outperforms a frozen 32B thinking generator on math and coding. Ablations confirm each design choice is necessary, and two extensions, applying INFUSER to an instruction-finetuned anchor and augmenting it with rule-verifiable RLVR data, further demonstrate the flexibility and generalizability of the framework. Code is available at https://github.com/FFishy-git/INFUSER.

2606.09430 2026-06-09 cs.LG cs.AI 新提交

LargeMonitor: Monitoring Online Task-Free Continual Learning via Large Pretrained Models

LargeMonitor: 通过大型预训练模型监控在线无任务持续学习

Mingqi Yuan, Xiaoquan Sun, Shihao Luo, Jiayu Chen

发表机构 * HKU(香港大学) Qicore Tech(启科科技)

AI总结 提出LargeMonitor框架,利用大型预训练模型(LVM和LMM)解耦检测与诊断,实现无任务持续学习中的零样本漂移检测和语义病因诊断,提升现有算法性能。

详情
AI中文摘要

在线无任务持续学习(TFCL)要求智能体在严格单次遍历约束下,从无界、非平稳的数据流中顺序积累知识,且无显式任务标识。现有在线TFCL范式主要依赖于参数高效的提示调整或由训练耦合优化动态(如经验损失波动或潜在距离演变)驱动的动态结构扩展。因此,这些训练耦合求解器对分布漂移的结构起源不可知,机械地在根本不同的流变化上强制执行固定策略。为解决这一问题,我们提出LargeMonitor,一个利用大型预训练基础模型自主编排无任务连续适应的框架。具体而言,LargeMonitor引入一个解耦的检测模块,利用大型视觉模型(LVM)的冻结、稳定表示空间,实现鲁棒的零样本漂移检测,无需训练依赖的干扰或脆弱的阈值调整。在确认漂移后,该框架激活一个由大型多模态模型(LMM)驱动的上下文感知诊断模块,以解释流变化的精确语义病因(例如,新类出现 vs. 环境域偏移)。这种双阶段能力使连续学习者能够动态部署自适应且特定于漂移的优化策略。在多个TFCL设置和基准上的大量实验表明,LargeMonitor实现了对复杂数据流的精确、鲁棒检测和诊断,同时持续提升现有在线TFCL算法的性能。

英文摘要

Online task-free continual learning (TFCL) requires intelligent agents to sequentially accumulate knowledge from an unbounded, non-stationary data stream under strict single-pass constraints and without any explicit task identifiers. Existing online TFCL paradigms primarily rely on parameter-efficient prompt tuning or dynamic structure expansion driven by training-coupled optimization dynamics, such as empirical loss fluctuations or evolving latent distances. As a result, these training-coupled solvers remain agnostic to the structural origins of distribution drift, mechanically enforcing a fixed strategy across fundamentally distinct streaming variations. To address this gap, we propose LargeMonitor, a framework that leverages large pretrained foundation models to autonomously orchestrate task-free continuous adaptation. Specifically, LargeMonitor introduces a decoupled detection module utilizing the frozen, stable representation space of large vision models (LVMs) to achieve robust, zero-shot drift detection without training-dependent interference or brittle threshold tuning. Upon a confirmed drift, the framework activates a context-aware diagnostic module driven by large multimodal models (LMMs) to interpret the precise semantic etiologies of the stream variation (e.g., novel class emergence vs. environmental domain shift). This dual-stage capability empowers the continuous learner to dynamically deploy adaptive and shift-specific optimization strategies. Extensive experiments across multiple TFCL settings and benchmarks demonstrate that LargeMonitor achieves precise, robust detection and diagnosis of complex data streams while consistently improving the performance of existing online TFCL algorithms.

2606.09762 2026-06-09 cs.LG cs.AI 新提交

Preserving Plasticity in Continual Learning via Dynamical Isometry

通过动态等距保持持续学习中的可塑性

Andries Rosseau, Robert Müller, Ann Nowé

发表机构 * University of Amsterdam(阿姆斯特丹大学) ETH Zurich(苏黎世联邦理工学院)

AI总结 本文通过动态等距机制保持深度神经网络在持续学习中的可塑性,提出等距正则化方法和AdamO优化器,在多个基准上匹配或超越现有方法。

Comments ICML26

详情
Journal ref
Forty-Third International Conference on Machine Learning (ICML 2026)
AI中文摘要

深度神经网络在非平稳条件下的持续训练通常会导致可塑性逐渐丧失,最终限制进一步学习。我们将可塑性与经验神经正切核联系起来,并确定动态等距(即逐层雅可比奇异值保持接近1的条件)是保持持续学习中可塑性的关键机制。我们重新审视一类几乎处处等距且同时保持通用Lipschitz函数逼近能力的网络,证明近动态等距与表达性非线性表示兼容。对于通用架构,我们提出一种高效的等距促进正则化方案,并识别出一种可以重新激活休眠ReLU单元的新机制。在此基础上,我们引入AdamO,一种Adam风格的自适应优化器,将等距正则化与梯度更新解耦,类似于AdamW。我们进一步通过动态等距的视角重新解释先前的可塑性保持方法,表明它们仅针对等距的部分度量。在旨在诱导可塑性损失的监督和强化学习持续学习基准上,我们的方法一致地匹配或超越现有方法。

英文摘要

Continual training of deep neural networks under non-stationarity often leads to a progressive loss of plasticity, eventually limiting further learning. We relate plasticity to the empirical Neural Tangent Kernel, and identify dynamical isometry (the condition that layer-wise Jacobian singular values remain close to one) as a key mechanism for preserving plasticity in continual learning. We revisit a class of networks that are almost-everywhere isometric while remaining universal Lipschitz function approximators, demonstrating that near-dynamical isometry is compatible with expressive nonlinear representations. For general architectures, we propose an efficient isometry-promoting regularization scheme and identify a novel mechanism by which it can reactivate dormant ReLU units. Building on this, we introduce AdamO, an Adam-style adaptive optimizer that decouples isometry regularization from gradient updates, analogous to AdamW. We further reinterpret prior plasticity-preserving approaches through the lens of dynamical isometry, showing that they target only a partial measure of isometry. Across supervised and reinforcement-learning continual-learning benchmarks designed to induce plasticity loss, our methods consistently match or outperform existing approaches.

2606.07693 2026-06-09 stat.ML cs.LG math.PR 交叉投稿

Transfer learning for causal forest

迁移学习用于因果森林

Bérénice-Alexia Jocteur, Véronique Maume-Deschamps, Pierre Ribereau

AI总结 提出一种针对因果森林HTERF的迁移学习方法,通过偏移量估计源域与目标域之间的模型偏移,并给出目标域上CATE误差的上界,仿真和真实数据验证了有效性。

详情
AI中文摘要

迁移学习解决了从一个领域向另一个领域迁移知识的挑战。传统的迁移学习侧重于调整在源域(有大量观测)上训练的模型,以提高在目标域(观测较少)上的性能。在这项工作中,我们考虑模型偏移的情况,并专注于将迁移学习应用于因果森林,即HTERF。该因果森林旨在估计条件平均处理效应(CATE)。所考虑的方法是Wang(2016)提出的偏移量方法,经过调整以适应因果背景。该方法依赖于使用中间模型来估计源分布和目标分布之间的偏移量。我们的主要结果是基于中间模型的误差,给出了目标上HTERF的CATE误差的上界。仿真研究表明,该方法在不同设置下的仿真以及真实数据集上均表现出良好的性能。

英文摘要

Transfer learning addresses the challenge of transfering knowledge from one domain to another. Traditional transfer learning focuses on adapting models trained on a source domain (with a lot of observations) to improve performance on a target domain (with few observations). In this work we consider the case of a model shift and we focus on the transfer learning applied to a causal forest namely HTERF. This causal forest aims to estimate the Conditional Average Treatment Effect (CATE). The approach considered is the offset method presented by Wang (2016) adapted to a causal context. This method relies on the use of intermediate models in order to estimate the offset between source and target distributions. Our main result is a bound on the CATE error of HTERF on target depending on the error of the intermediate models. Simulation studies show the good performances of this approach in different settings on simulations and on a real-world dataset.

2606.09758 2026-06-09 cs.RO cs.AI cs.LG 交叉投稿

Difference-Aware Retrieval Policies for Imitation Learning

差异感知的模仿学习检索策略

Quinn Pfeifer, Ethan Pronovost, Paarth Shah, Khimya Khetarpal, Siddhartha Srinivasa, Abhishek Gupta

发表机构 * Paul G. Allen School of Computer Science & Engineering, University of Washington(华盛顿大学保罗·G·艾伦计算机科学与工程学院) Toyota Research Institute(丰田研究所) Google DeepMind(谷歌DeepMind) Mila

AI总结 提出DARP,一种半参数检索式模仿学习方法,通过基于k近邻的局部邻域结构重参数化,解决行为克隆的分布外泛化问题,在连续控制和机器人操作任务中性能提升15-46%。

Comments 12 pages, 7 figures, 3 tables. Accepted to ICLR 2026. Code and demos available at https://weirdlabuw.github.io/darp-site/

详情
AI中文摘要

通过行为克隆的参数化模仿学习可能因部署期间的复合误差而在分布外状态上泛化能力差。我们表明,在推理期间通过半参数检索式模仿学习方法重用训练数据可以缓解这一挑战。我们提出差异感知的模仿学习检索策略(DARP),这是一种半参数检索式模仿学习方法,通过根据局部邻域结构而非直接的状态到动作映射来重新参数化模仿学习问题,从而解决这一局限性。DARP不学习全局策略,而是训练一个模型,基于专家演示中的k近邻、它们对应的动作以及邻居状态与查询状态之间的相对距离向量来预测动作。DARP不需要超出标准行为克隆所做的额外假设——它不需要额外的数据收集、在线专家反馈或任务特定知识。我们在不同领域(包括连续控制和机器人操作)以及不同表示(包括高维视觉特征)上展示了比标准行为克隆持续15-46%的性能提升。代码和演示可在https://weirdlabuw.github.io/darp-site/获取。

英文摘要

Parametric imitation learning via behavior cloning can suffer from poor generalization to out-of-distribution states due to compounding errors during deployment. We show that reusing the training data during inference via a semi-parametric retrieval-based imitation learning approach can alleviate this challenge. We present Difference-Aware Retrieval Policies for Imitation Learning (DARP), a semi-parametric retrieval-based imitation learning approach that addresses this limitation by reparameterizing the imitation learning problem in terms of local neighborhood structure rather than direct state-to-action mappings. Instead of learning a global policy, DARP trains a model to predict actions based on $k$-nearest neighbors from expert demonstrations, their corresponding actions, and the relative distance vectors between neighbor states and query states. DARP requires no additional assumptions beyond those made for standard behavior cloning -- it does not require additional data collection, online expert feedback, or task-specific knowledge. We demonstrate consistent performance improvements of 15-46% over standard behavior cloning across diverse domains, including continuous control and robotic manipulation, and across different representations, including high-dimensional visual features. Code and demos are available at https://weirdlabuw.github.io/darp-site/.

2511.18590 2026-06-09 astro-ph.GA cs.LG hep-th 交叉投稿

From Simulations to Surveys: Domain Adaptation for Galaxy Observations

从模拟到巡天:面向星系观测的领域自适应

Kaley Brauer, Aditya Prasad Dash, Meet J. Vyas, Ahmed Salim, Stiven Briand Massala

发表机构 * Center for Astrophysics, Harvard University(哈佛大学天体物理中心) Physics and Astronomy, University of California, Los Angeles(加州大学洛杉矶分校物理与天文系) International Centre for Space and Cosmology, Ahmedabad University(阿赫迈德布恰大学国际空间与宇宙学中心) Department of Computing, Universiti Teknologi Malaysia(马来西亚技术大学计算系) Université Paris-Saclay, CentraleSupélec, ENS Paris-Saclay, CNRS, LMPS - Laboratoire de Mécanique Paris-Saclay(巴黎-萨克雷大学,CentraleSupélec,ENS巴黎-萨克雷,CNRS,LMPS-巴黎-萨克雷力学实验室)

AI总结 提出一种结合特征级领域损失和基于最优传输的top-k软匹配损失的领域自适应管道,将TNG50模拟星系分类器迁移到真实SDSS观测,目标域准确率从~46%提升至~87%。

Comments 8 pages, 4 figures. Will be presented at NeurIPS 2025 ML4PS

详情
AI中文摘要

大型光度巡天将拍摄数十亿个星系,但我们目前缺乏快速、可靠的自动化方法来推断它们的物理性质,如形态、恒星质量和恒星形成率。模拟提供了具有真实物理标签的星系图像,但PSF、噪声、背景、选择和标签先验中的领域偏移会降低向真实巡天的迁移效果。我们提出了一个初步的领域自适应管道,该管道在模拟的TNG50星系上训练,并在具有形态标签(椭圆/旋涡/不规则)的真实SDSS星系上评估。我们训练了三个骨干网络(CNN、$E(2)$-可转向CNN、ResNet-18),使用焦点损失和有效数量类别加权,以及基于GeomLoss(熵Sinkhorn OT、能量距离、高斯MMD及相关度量)构建的特征级领域损失$L_D$。我们表明,将这些损失与基于OT的“top-$k$软匹配”损失(该损失将$L_D$聚焦于最不匹配的源-目标对)相结合,可以进一步增强领域对齐。使用欧几里得距离、调度对齐权重和top-$k$匹配,目标域准确率(宏F1)从无自适应时的~46%(~30%)提升至~87%(~62.6%),领域AUC接近0.5,表明潜在空间混合良好。

英文摘要

Large photometric surveys will image billions of galaxies, but we currently lack quick, reliable automated ways to infer their physical properties like morphology, stellar mass, and star formation rates. Simulations provide galaxy images with ground-truth physical labels, but domain shifts in PSF, noise, backgrounds, selection, and label priors degrade transfer to real surveys. We present a preliminary domain adaptation pipeline that trains on simulated TNG50 galaxies and evaluates on real SDSS galaxies with morphology labels (elliptical/spiral/irregular). We train three backbones (CNN, $E(2)$-steerable CNN, ResNet-18) with focal loss and effective-number class weighting, and a feature-level domain loss $L_D$ built from GeomLoss (entropic Sinkhorn OT, energy distance, Gaussian MMD, and related metrics). We show that a combination of these losses with an OT-based "top_$k$ soft matching" loss that focuses $L_D$ on the worst-matched source-target pairs can further enhance domain alignment. With Euclidean distance, scheduled alignment weights, and top-$k$ matching, target accuracy (macro F1) rises from $\sim$46% ($\sim$30%) at no adaptation to $\sim$87% ($\sim$62.6%), with a domain AUC near 0.5, indicating strong latent-space mixing.

2507.12612 2026-06-09 cs.LG cs.AI 版本更新

Learning Task Mixtures from Task Affinities: A Probabilistic Graphical Model for Supervised Fine-Tuning

学习什么是重要的:通过互信息的概率任务选择用于模型微调

Prateek Chanda, Saral Sureka, Parth Pratim Chatterjee, Krishnateja Killamsetty, Nikhil Shivakumar Nayak, Ganesh Ramakrishnan

发表机构 * IIT Bombay(印度理工学院班加罗尔分校) IBM Research(IBM研究) Red Hat AI Innovation(红帽AI创新) MIT-IBM Watson AI Lab(麻省理工-IBM沃森AI实验室)

AI总结 本文提出TaskPGM框架,通过基于能量的任务模型学习连续任务混合,利用互信息和行为分歧来捕捉任务间的关系,从而在任务覆盖和冗余之间取得平衡,提升大语言模型的监督微调性能。

Comments 9, 8 tables, 7 figures

详情
AI中文摘要

大语言模型的监督微调性能在很大程度上取决于训练预算如何分配到异质任务集上。在实践中,通常使用简单的启发式方法(例如均匀或按比例采样)来固定混合,但这些方法忽略了任务之间的相互作用,可能损害迁移并浪费在冗余来源上的预算。我们引入TaskPGM,一种通过基于能量的任务模型学习连续任务混合的框架。任务形成马尔可夫随机场的节点:单变量势能捕捉单个任务的效用,而双变量势能使用从单任务微调模型的预测分布中计算的行为分歧(如Jensen-Shannon分歧和点互信息)来编码任务间的关系。优化此目标会产生在覆盖和冗余之间取得平衡的混合。我们显示,所得到的集合函数在预算约束下是弱子模的,这使得离散选择变体能够获得近似保证。在多个模型家族(LLaMA-7B,Qwen2-7B)和评估套件(BIG-Bench Hard)上,TaskPGM在标准混合策略之上取得改进,并提供了任务间关系的可解释结构。

英文摘要

Supervised fine-tuning performance for large language models depends strongly on how training budget is distributed across a heterogeneous set of tasks. In practice, mixtures are often fixed using simple heuristics (e.g., uniform or size-proportional sampling) that ignore task interactions, which can hurt transfer and waste budget on redundant sources. We introduce TaskPGM, a framework for learning continuous task mixtures via an energy-based model over tasks. Tasks form the nodes of a Markov random field: unary potentials capture per-task utility, and pairwise potentials encode inter-task relationships using behavioral divergences computed from predictive distributions of single-task fine-tuned models (e.g., Jensen--Shannon divergence and pointwise mutual information). Optimizing this objective yields mixtures that balance coverage against redundancy. We show that the resulting set function is weakly submodular under budget constraints, enabling approximation guarantees for discrete selection variants. Across multiple model families (LLaMA-7B, Qwen2-7B) and evaluation suites (BIG-Bench Hard), TaskPGM improves over standard mixing strategies and provides interpretable structure over task interactions.

2511.07938 2026-06-09 cs.LG cs.SY eess.SY 版本更新

Decision-Focused Continual Learning for Seaport Power-Logistics Scheduling: Generalization across Varying Tasks

面向海港电力物流调度的决策聚焦持续学习:跨不同任务的泛化

Chuanqing Pu, Feilong Fan, Nengling Tai, Yan Xu, Wentao Huang, Honglin Wen

发表机构 * College of Smart Energy, Shanghai Jiao Tong University(上海交通大学智能能源学院) School of Electrical and Electronic Engineering, Nanyang Technological University(南洋理工大学电子与电气工程学院) School of Electrical Engineering, Shanghai Jiao Tong University(上海交通大学电气工程学院) Dyson School of Design Engineering, Imperial College London(伦敦帝国理工学院戴森设计工程学院)

AI总结 针对预测-优化框架在任务变化时泛化差的问题,提出基于Fisher信息正则化的决策聚焦持续学习框架,通过可微凸代理稳定梯度,实现跨任务决策对齐的在线学习,在裕廊港实验中提升决策性能并降低计算成本。

Comments Preprint to IEEE Transactions on Smart Grid

详情
AI中文摘要

现代海港的电力物流调度通常遵循预测-优化流程。为了提高预测的决策质量,提出了决策聚焦学习,它将预测模型的训练与下游决策结果对齐。然而,这种端到端设计本质上将预测模型的价值限制在特定的任务结构上,因此对由不同船舶到达引起的演变任务泛化能力差。我们通过一个决策聚焦持续学习框架来解决这一差距,该框架在线适应调度任务流。具体来说,我们引入了基于Fisher信息的正则化,通过保留对先前任务关键的参数来增强跨任务泛化。还开发了一个可微的凸代理来稳定梯度反向传播。所提出的方法能够在变化的任务流中学习决策对齐的预测模型,同时保持可持续的长期计算和内存需求。在裕廊港校准的实验表明,与现有方法相比,该方法提高了决策性能和跨任务泛化能力,同时降低了计算成本并具有有限的内存占用。

英文摘要

Power-logistics scheduling in modern seaports typically follows a predict-then-optimize pipeline. To enhance the decision quality of predictions, decision-focused learning has been proposed, which aligns the training of forecasting models with downstream decision outcomes. However, this end-to-end design inherently restricts the value of forecasting models to a specific task structure and therefore generalizes poorly to evolving tasks induced by varying vessel arrivals. We address this gap with a decision-focused continual learning framework that adapts online to a stream of scheduling tasks. Specifically, we introduce Fisher-information-based regularization to enhance cross-task generalization by preserving parameters critical to prior tasks. A differentiable convex surrogate is also developed to stabilize gradient backpropagation. The proposed approach enables learning a decision-aligned forecasting model across a varying task stream with sustainable long-term computational and memory requirements. Experiments calibrated to Jurong Port show improved decision performance and cross-task generalization over existing methods, together with reduced computational cost and a bounded memory footprint.

2604.07848 2026-06-09 cs.LG q-bio.MN 版本更新

Information-Theoretic Requirements for Gradient-Based Task Affinity Estimation in Multi-Task Learning

基于梯度的任务亲和性估计在多任务学习中的信息论要求

Jasper Zhang, Bryan Cheng

发表机构 * Great Neck South High School(Great Neck South 高中)

AI总结 本文探讨了多任务学习中梯度基于任务亲和性估计的信息论要求,发现任务样本重叠度对梯度对齐的影响,并揭示了样本重叠度的相变特性。

Comments 8 pages, 4 figures. ACM BCB 2026 Short Paper. Accepted at workshop on AI for Accelerated Materials Design, Foundation Models for Science: Real-World Impact and Science-First Design, and Generative and Experimental Perspectives for Biomolecular Design at ICLR 2026

详情
AI中文摘要

多任务学习展现出显著不一致的结果——有时联合训练有显著帮助,有时反而损害性能——但该领域缺乏一个原则性的框架来预测这些结果。我们识别出梯度基于任务分析背后一个基本但未明说的假设:任务必须共享训练实例,以便梯度冲突揭示真实的关系。当任务在相同输入上测量时,梯度对齐反映共享的机制结构;当在不相交的输入上测量时,任何明显的信号都混淆了任务关系与分布偏移。我们发现这种样本重叠要求表现出明显的相变特性:低于30%的重叠,梯度-任务相关性在统计上与噪声不可区分;高于40%的重叠,它们可靠地恢复已知的生物结构。在多个数据集上的全面验证实现了强相关性和恢复生物通路组织。标准基准系统系统性地违反这一要求——MoleculeNet在<5%的重叠,TDC在8-14%——远低于梯度分析变得有意义的阈值。这为过去七年不一致的MTL结果提供了第一个原则性解释。

英文摘要

Multi-task learning shows strikingly inconsistent results -- sometimes joint training helps substantially, sometimes it actively harms performance -- yet the field lacks a principled framework for predicting these outcomes. We identify a fundamental but unstated assumption underlying gradient-based task analysis: tasks must share training instances for gradient conflicts to reveal genuine relationships. When tasks are measured on the same inputs, gradient alignment reflects shared mechanistic structure; when measured on disjoint inputs, any apparent signal conflates task relationships with distributional shift. We discover this sample overlap requirement exhibits a sharp phase transition: below 30% overlap, gradient-task correlations are statistically indistinguishable from noise; above 40%, they reliably recover known biological structure. Comprehensive validation across multiple datasets achieves strong correlations and recovers biological pathway organization. Standard benchmarks systematically violate this requirement -- MoleculeNet operates at <5% overlap, TDC at 8-14% -- far below the threshold where gradient analysis becomes meaningful. This provides the first principled explanation for seven years of inconsistent MTL results.

2605.26872 2026-06-09 cs.LG cs.AI cs.CL 版本更新

The Strongest Teacher Is Not Always the Best Teacher: Student-Centric Answer Selection

最强的教师并不总是最好的教师:以学生为中心的答案选择

Zhengyu Hu, Zheyuan Xiao, Linxin Song, Fengqing Jiang, Yuetai Li, Zhengyu Chen, Zhihan Xiong, Yue Liu, Junhao Lin, Yao Su, Lijie Hu, Kaize Ding, Teng Xiao, Radha Poovendran

发表机构 * University of Washington(华盛顿大学) University of Texas at Austin(德克萨斯大学奥斯汀分校) University of Southern California(南加州大学) Independent Researcher(独立研究者) National University of Singapore(新加坡国立大学) Microsoft(微软) Google(谷歌) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学) Northwestern University(西北大学) Allen Institute for AI (AI2)(人工智能研究院(AI2))

AI总结 提出以学生为中心的答案采样(SCAS)框架,通过估计学生中心的学习成本选择教师生成的答案,从而提升学生模型性能。

详情
AI中文摘要

LLM训练越来越依赖教师生成的监督,包括合成响应、推理轨迹和工具使用演示。当前实践通常选择表现最好的教师来生成学生训练数据,隐含地将教师测试表现视为教学质量的代理。我们表明这一假设可能失败:即使多个教师对同一问题提供正确答案,最强教师的答案也不一定是对给定学生的最佳监督。为解决这一问题,我们提出以学生为中心的答案采样(SCAS),该框架根据估计的学生中心学习成本从经过验证的教师生成答案中进行选择。受逐词梯度分解的启发,我们推导出该成本的高效前向代理,并在训练中用于指导答案选择。在30个教师模型、6个学生基础模型和8个任务上的实验表明,SCAS持续提升学生性能,表明有效的蒸馏应优先考虑与当前学生匹配的监督,而非仅依赖教师强度。

英文摘要

LLM training increasingly relies on teacher-generated supervision, from synthetic responses to reasoning traces and tool-use demonstrations. Current practice often chooses the highest-performing teacher to generate student training data, implicitly treating teacher test performance as a proxy for teaching quality. We show that this assumption can fail: even when multiple teachers provide correct answers to the same question, the answer from the strongest teacher is not necessarily the best supervision for a given student. To address this gap, we propose Student-Centric Answer Sampling (SCAS), a framework that selects from verified teacher-generated answers according to their estimated student-centric learning cost. Motivated by a token-wise gradient decomposition, we derive an efficient forward-only proxy for this cost and use it to guide answer selection during training. Experiments across 30 teacher models, 6 student base models, and 6 tasks show that SCAS consistently improves student performance, suggesting that effective distillation should prioritize supervision matched to the current student rather than teacher strength alone.

2606.01379 2026-06-09 cs.LG 版本更新

Turning Back Without Forgetting: Selective Backward Refinement for Parameter-Efficient Continual Learning

在不遗忘的情况下回溯:面向参数高效持续学习的选择性反向精炼

Anushka Tiwari, Kaiyi Ji

发表机构 * University of California, Berkeley(加州大学伯克利分校)

AI总结 提出SABER框架,通过基于提示梯度几何和损失分布相似性的任务相关性准则,在提示型参数高效持续学习中实现受控的正向反向知识迁移,无需重放。

Comments Accepted at ICML 2026

详情
AI中文摘要

虽然基于提示的参数高效持续学习通过隔离任务特定提示来缓解灾难性遗忘,但这种隔离也限制了后续任务改进先前任务,导致反向知识迁移未被充分探索。我们通过提出选择性反向精炼以实现正向反向知识迁移(SABER)来解决这一限制,这是一个无需重放的框架,能够在基于提示的持续学习中实现受控的反向迁移。SABER利用基于提示梯度几何和损失分布相似性的互补任务相关性准则,判断何时进行反向精炼有益,并通过将更新限制在提示参数空间中的非干扰方向来安全执行精炼。在多个持续学习基准和不同预训练骨干网络(包括T5-Large、LLaMA和Qwen)上的大量实验表明,SABER在保持强大整体平均性能的同时,持续实现正向反向迁移。代码可在https://github.com/OptMN-Lab/SABER-ICML-2026/获取。

英文摘要

While prompt-based parameter-efficient continual learning mitigates catastrophic forgetting by isolating task-specific prompts, this isolation also limits later tasks from improving earlier ones, leaving backward knowledge transfer underexplored. We address this limitation by proposing Selective bAckward refinement for positive Backward knowledge transfER (SABER), a replay-free framework that enables controlled backward transfer in prompt-based continual learning. SABER determines when backward refinement is beneficial using complementary task-correlation criteria based on prompt-gradient geometry and loss-distribution similarity, and how to perform refinement safely by restricting updates to non-interfering directions in the prompt parameter space. Extensive experiments across multiple continual learning benchmarks and diverse pretrained backbones, including T5-Large, LLaMA, and Qwen, demonstrate that SABER consistently achieves positive backward transfer while maintaining strong overall average performance. Code is available at https://github.com/OptMN-Lab/SABER-ICML-2026/.

2605.16309 2026-06-09 cs.AI cs.LG cs.MA 版本更新

ANNEAL: Adapting LLM Agents via Governed Symbolic Patch Learning

ANNEAL:通过受控符号补丁学习适应大语言模型代理

Safayat Bin Hakim, Keyan Guo, Wenkai Tan, Alvaro Velasquez, Shouhuai Xu, Houbing Herbert Song

发表机构 * University of Maryland, Baltimore County(马里兰大学巴尔的摩县分校) University at Buffalo(布法罗大学) University of Colorado Boulder(科罗拉多大学博尔德分校) University of Colorado Colorado Springs(科罗拉多大学科罗拉多州立分校)

AI总结 ANNEAL通过受控符号补丁学习适应大语言模型代理,解决重复故障问题,其核心机制FDKA能定位责任操作符并生成类型补丁,实现持久结构修复,优于现有方法。

Comments Code Implementation: https://github.com/sbhakim/anneal-agents

详情
AI中文摘要

基于大语言模型的代理可以恢复个体执行错误,但在底层过程知识未修复时,同一故障会反复失败。现有自我进化方法通过更新提示、记忆或模型权重来解决这一差距,但未直接修复编码任务执行的符号结构,且缺乏安全部署所需的治理保证。我们引入ANNEAL,一种神经符号代理,将重复失败转化为受控符号编辑过程知识图谱,而无需修改基础模型权重。其核心机制,故障驱动知识获取(FDKA),定位责任操作符,通过约束LLM生成合成类型补丁,并通过多维评分、符号护栏和金丝雀测试验证提案,再提交。每条接受的编辑都携带完整溯源和确定性回滚能力。在四个领域和27个多种子运行中,ANNEAL是唯一在测试重复故障设置中将失败率降至0%的评估系统。消融实验表明,移除FDKA会消除所有结构修复并使成功率下降最高26.7个百分点。这些结果表明,受控符号修复为持续故障消除提供了与权重级和提示级适应互补的范式。

英文摘要

LLM-based agents can recover from individual execution errors, yet they repeatedly fail on the same fault when the underlying process knowledge--operator schemas, preconditions, and constraints--remains unrepaired. Existing self-evolving approaches address this gap by updating prompts, memory, or model weights, but none directly repair the symbolic structures that encode how tasks are executed, and few provide the governance guarantees required for safe deployment. We introduce ANNEAL, a neuro-symbolic agent that converts recurring failures into governed symbolic edits of a process knowledge graph without modifying foundation model weights. Its core mechanism, Failure-Driven Knowledge Acquisition (FDKA), localizes the responsible operator, synthesizes a typed patch through constrained LLM generation, and validates the proposal via multi-dimensional scoring, symbolic guardrails, and canary testing before commit. Every accepted edit carries full provenance and deterministic rollback capability. Across four domains and 27 multi-seed runs, ANNEAL is the only evaluated system that commits persistent structural repairs--strong baselines such as ReAct and Reflexion achieve high episodic recovery yet retain 72--100% holdout failure rates on recurring faults, whereas ANNEAL reduces these to 0% in the tested recurring-failure settings. Ablation confirms that removing FDKA eliminates all structural repairs and drops success rate by up to 26.7 percentage points. These results suggest that governed symbolic repair offers a complementary paradigm to weight-level and prompt-level adaptation for persistent fault elimination.

2605.30407 2026-06-09 cs.CL cs.AI cs.IR cs.LG 版本更新

Exploring Autonomous Agentic Data Engineering for Model Specialization

探索用于模型专业化的自主智能体数据工程

Yujie Luo, Xiangyuan Ru, Jingsheng Zheng, Jingjing Wang, Yuqi Zhu, Jintian Zhang, Runnan Fang, Kewei Xu, Ye Liu, Zheng Wei, Jiang Bian, Zang Li, Shumin Deng

发表机构 * Zhejiang University(浙江大学) Platform and Content Group, Tencent(腾讯平台与内容部)

AI总结 本文提出自主智能体数据工程任务,让LLM作为自主数据工程师,通过端到端数据策划驱动模型专业化,实验显示GPT-5.2通过迭代数据适应使学生模型性能提升57.29%。

Comments Work in progress

详情
AI中文摘要

大型语言模型(LLM)在通用任务上表现出色,但往往难以适应没有高质量领域特定数据的专业领域。现有的基于LLM的数据策划方法主要依赖人工设计的工作流程,尚未检验LLM能否自主执行端到端的数据工程流水线以实现模型专业化。我们形式化了 extbf{自主智能体数据工程},这是一个新任务,旨在评估LLM作为自主数据工程师,通过端到端数据策划驱动模型专业化。我们将数据视为可优化组件,研究能够跨多个领域规划、生成和迭代优化训练数据的智能体,并以训练后性能提升为指导。实验表明,自主LLM数据工程师带来了显著收益,GPT-5.2构建的训练课程使学生模型性能提升了 extbf{57.29\%},完全通过迭代的智能体驱动数据适应实现。通过揭示潜力和瓶颈,我们的研究将自主数据工程确立为一种可衡量的能力,并为智能体驱动的模型专业化指明了道路 ootnote{代码将在https://github.com/zjunlp/DataAgent发布。}。

英文摘要

Large Language Models (LLMs) have demonstrated strong performance on general tasks, while often struggling to adapt to specialized domains without high-quality domain-specific data. Existing LLM-based data curation methods primarily rely on human-designed workflows, leaving it unexamined whether LLMs can autonomously execute an end-to-end data engineering pipeline for model specialization. We formalize Autonomous Agentic Data Engineering, a novel task designed to evaluate LLMs as autonomous data engineers that drive model specialization through end-to-end data curation. We frame data as an optimizable component and study agents that plan, generate, and iteratively optimize training data across multiple domains, guided by post-training performance improvement. Experiments show that autonomous LLM data engineers yield substantial gains, as GPT-5.2 constructs a training curriculum that improves a student model by 57.29%, entirely through iterative, agent-driven data adaptation. By illuminating both potential and bottlenecks, our study establishes autonomous data engineering as a measurable capability and charts a path toward agent-driven model specialization (Code will be released at https://github.com/zjunlp/DataAgent).

11. 数据集、基准与评测 80 篇

2606.07550 2026-06-09 cs.LG cs.AI 新提交

Offline Reinforcement Learning for Plasma Control in Nuclear Fusion: Codebase and Benchmark

核聚变等离子体控制的离线强化学习:代码库与基准

Yang Fu, Haomin Bao, Rohit Sonker, Xiaoyan Hu, Aravind Venugopal, Jeff Schneider, Jiayu Chen

发表机构 * Central South University(中南大学) Chongqing University(重庆大学) Carnegie Mellon University(卡内基梅隆大学) The University of Hong Kong(香港大学)

AI总结 提出RL4F基准,基于DIII-D托卡马克历史数据构建评估环境,比较多种离线RL方法在等离子体控制任务上的性能,发现基于模型的离线RL方法平均表现最佳。

Comments 23 pages (10 pages main text)

详情
AI中文摘要

离线强化学习(RL)为从历史托卡马克数据开发等离子体控制器提供了一条有前景的途径,因为在真实设备上进行在线试错成本高昂且风险巨大。然而,由于缺乏针对核聚变中现实多执行器、长时域等离子体控制问题的标准化离线RL基准,这一方向的进展仍然难以衡量。我们引入了RL4F,一个用于核聚变等离子体控制的离线强化学习基准,提供了闭环评估环境和四个全剖面跟踪任务(旋转、密度、温度和压力)的基线比较。评估环境背后的动力学函数基于真实托卡马克DIII-D的历史放电数据构建。我们在统一协议下评估了广泛的模仿学习和离线RL基线。我们发现,基于模型的离线RL方法在大多数目标上获得了最佳平均性能,尽管没有单一方法在所有任务中占主导地位,这突显了动力学建模在复杂、长时域等离子体控制任务中的重要性。为了促进进一步研究,我们开源了代码库、数据集和评估框架,不仅为聚变社区,也为离线RL的算法开发提供了一个基准。

英文摘要

Offline reinforcement learning (RL) offers a promising route for developing plasma controllers from historical tokamak data, since online trial-and-error on real devices is costly and risky. However, progress in this direction remains difficult to measure due to the lack of a standardized offline RL benchmark for realistic multi-actuator, long-horizon plasma control problems in nuclear fusion. We introduce RL4F, an Offline Reinforcement Learning Benchmark for Plasma Control in Nuclear Fusion, providing closed-loop evaluation environments and baseline comparisons across four full-profile tracking tasks: rotation, density, temperature, and pressure. The dynamics function underlying the evaluation environment is built from historical discharge data from DIII-D, a real-world Tokamak. We evaluate a broad set of imitation learning and offline RL baselines under a unified protocol. We find that offline model-based RL methods obtain the best average performance on most objectives, although no single method dominates all tasks, highlighting the importance of dynamics modeling in complex, long-horizon plasma control tasks. To foster further research, we open-source the codebase, datasets, and evaluation framework, providing a benchmark not only for the fusion community but also for algorithm development in offline RL.

2606.07553 2026-06-09 cs.LG cs.AI 新提交

MedicalRec: Medical recommender system for image classification without retraining

MedicalRec:无需重新训练的图像分类医疗推荐系统

Roghayeh Taghavi, Aysa Hasanazde Bashkandi, Amir Ali Bengari, Mohammad Amin Raji, Mohammad Salahi Ardekani, Parisa Mardukhian, Parvaneh Rezaei, Ramin Mousa

发表机构 * University of Tehran(塔里班大学)

AI总结 提出基于Transformer的医疗推荐系统MedicalRec,利用从3000篇论文中构建的MedicalRec-Bench数据集(含5000+记录),无需重新训练即可为医疗图像分类任务推荐最优模型,最高HitRate@100达75.5%。

详情
AI中文摘要

机器学习和深度学习的出现彻底改变了医疗保健中诊断、治疗和管理系统的效率。然而,这种快速采用是以需要大量计算能力和能源消耗以及电子垃圾处理和碳排放为代价的。这些模型的挑战之一是为分类任务选择合适的模型。为此,研究人员尝试通过试错法使用他们的数据来确定最佳模型,这涉及能源消耗和浪费。本研究的目标是开发一个基于模型的医疗图像分类推荐系统。为此,从3000篇医疗图像分类领域的文章中收集了一个数据集。该数据集以MedicalRec-Bench的名称公开可用,包含超过5000条在各种任务中测试的模型记录,包括皮肤癌分类、肿瘤分类、伤口分类、乳腺癌和MRI分类。根据特征数量,数据集在四种不同模式下进行评估:MedicalRec I(5个特征)、MedicalRec II(9个特征)、MedicalRec III(11个特征)和MedicalRec IV(18个特征)。由于作者未报告,收集所有特征值具有挑战性;因此,数据集包含大量缺失值。医疗推荐系统(MedicalRec)是一个基于Transformer的模型,用于本研究中的项目推荐。该模型在数据集评估和与12个基础模型的评估中取得了显著成果。该模型实现了最高HitRate@100为75.5%。数据集和实现可通过GitHub链接获取:https://github.com/Ramin1Mousa/MedicalRec

英文摘要

The emergence of machine learning and deep learning has revolutionized the efficiency of diagnostic, therapeutic, and administrative systems in healthcare. However, this rapid adoption has come at the cost of requiring significant computing power and energy consumption, as well as e-waste disposal and carbon emissions. One of the challenges of these models is choosing the right model for classification tasks. To this end, researchers attempt to identify the optimal model using their data through trial and error, which involves energy consumption and waste. The goal of this study is to develop a model-based recommender system for medical image classification. For this purpose, a data set was collected from 3,000 articles in the field of medical image classification. This dataset, publicly available under the name MedicalRec-Bench, contains over 5,000 records of models tested in various tasks, including Skin Cancer Classification, Tumour Classification, Wound Classification, Breast Cancer, and MRI classification. The dataset was evaluated in four different modes, depending on the number of features: MedicalRec I (5 features), MedicalRec II (9 features), MedicalRec III (11 features), and MedicalRec IV (18 features). Collecting all values for the features is challenging due to non-reporting by the authors; hence, the dataset contains significant amounts of missing values. The Medical Recommender System (MedicalRec) is a transformer-based model used for item recommendations in this study. This model achieved remarkable results in the evaluation on the dataset and in the evaluation with 12 base models. This model achieved a maximum HitRate@100 of 75.5%. The dataset and implementations are available through the GitHub link: https://github.com/Ramin1Mousa/MedicalRec

2606.07587 2026-06-09 cs.LG 新提交

The Routing Plateau: Understanding and Breaking the Accuracy Limits of LLM Routers

路由平台:理解并突破LLM路由器的准确性极限

Yifan Lu, Qiyue Zhang, Shenrun Zhang, Zhibo Yu, Zhuang Wang, Hanjie Chen, Jiarong Xing

发表机构 * Rice University(莱斯大学) Amazon(亚马逊)

AI总结 研究发现多种LLM路由方法存在“路由平台”现象,即准确性趋同且远低于理想路由器,主要原因是可预测性瓶颈;通过增大训练数据、更强编码器和端到端微调可突破平台。

Comments 23 Pages, 12 Tables, 9 Figures

详情
AI中文摘要

LLM路由已成为一种流行的方法,通过为每个查询动态选择模型来改善LLM服务的成本-质量权衡。最近的工作探索了广泛的路由方法,包括基于聚类的路由器、学习分类器、成对排序和基于置信度的方法。我们对五个基准测试中的21种路由方法的广泛研究揭示了一个一致的现象,我们称之为路由平台:许多方法,包括kNN,实现了非常相似的准确性,并收敛到一个狭窄的性能范围,远低于理想路由器。我们的研究表明,平台主要是由可预测性瓶颈引起的:当前路由器主要学习全局平均模型性能趋势,而不是细粒度的查询特定路由信号。因此,它们解决了重叠的简单查询,但共同在需要实例特定路由决策的困难查询上失败。我们进一步研究如何超越平台,发现更大的训练数据集、更强的编码器和端到端微调可以进一步提高路由准确性。这些发现表征了当前路由方法的常见限制,并为社区构建更有效的路由系统提供了见解和可操作的方向。

英文摘要

LLM routing has become a popular approach to improve the cost-quality trade-off of LLM services by dynamically selecting a model for each query. Recent work has explored a broad range of routing methods, including clustering-based routers, learned classifiers, pairwise ranking, and confidence-based approaches. Our extensive study of 21 routing methods across five benchmarks reveals a consistent phenomenon that we call the routing plateau: many methods, including kNN, achieve very similar accuracy and converge to a narrow performance range that remains far below the oracle router. Our investigation shows that the plateau is largely caused by a predictability bottleneck: current routers mainly learn global averaged model-performance trends rather than fine-grained query-specific routing signals. As a result, they solve overlapping easy queries but collectively fail on hard queries that require instance-specific routing decisions. We further study how to move beyond the plateau and find that larger training datasets, stronger encoders, and end-to-end fine-tuning can further improve routing accuracy. These findings characterize the common limits of current routing methods and provide insights and actionable directions for the community to build more effective routing systems.

2606.07597 2026-06-09 cs.LG cs.AI 新提交

Repetition Mismatch: Why Data Mixture Experiments Don't Scale and How to Fix Them

重复不匹配:为什么数据混合实验无法扩展以及如何修复

Kevin Zhou, Lisa Alazraki, Kris Cao, Marek Rei

发表机构 * Imperial College London(帝国理工学院) Cohere

AI总结 针对预训练数据混合中因高质量数据重复率变化导致的小规模实验外推失败问题,提出重复控制子采样方法,在1/16目标token预算下实现接近最优混合,揭示了重复动态而非规模决定实验泛化性。

详情
AI中文摘要

预训练数据混合通常通过运行小规模实验并外推到目标训练预算来调整。当高质量数据稀缺且必须重复时,这种外推经常失败,但失败的原因尚未被隔离。我们表明,一个主要原因是重复不匹配:由于高质量数据集很小,它们的重复率随着训练预算的增长而变化,以小规模代理实验未预期的方式改变最优混合。一种匹配目标重复率的子采样程序可以控制这种效应。在结合有限高质量数据和网络爬取的双源设置中,仅使用目标token的1/16的单一重复控制实验即可恢复757M参数模型的最优混合,误差在0.05以内,而无重复控制时误差为0.75。在没有重复控制的情况下达到相当的精度需要三到四个视野,消耗目标token预算的44%到94%。对于三个数据源,更大的混合空间需要不止一个实验来约束,但该方法仍然有效:在757M规模下,仅两个重复控制视野即可恢复最优混合,优于需要完整双源实验构建的基线。我们的结果表明,重复动态(而非仅规模)决定了小规模混合实验是否泛化。更广泛地说,它们表明数据重复应被视为混合优化中的第一类变量,而不是有限数据的不便副作用。

英文摘要

Pre-training data mixtures are commonly tuned by running small-scale experiments and extrapolating to the target training budget. When high-quality data is scarce and must be repeated, this extrapolation frequently fails, but the source of the failure has not been isolated. We show that a primary culprit is a repetition mismatch: because high-quality datasets are small, their repetition rate changes as the training budget grows, shifting the optimal mixture in ways that small-scale proxy experiments do not anticipate. A subsampling procedure that matches the target repetition rate controls for this effect. In a two-source setting combining limited high-quality data with web crawl, a single repetition-controlled experiment using only 1/16 of the target tokens recovers a mixture within 0.05 of the optimum for a 757M parameter model, compared to an error of 0.75 without repetition control. Achieving comparable accuracy without repetition control requires three to four horizons, consuming 44 to 94% of the target token budget. With three data sources, the larger mixture space requires more than a single experiment to constrain, but the approach remains effective: at the 757M scale, just two repetition-controlled horizons recover the optimal mixture, outperforming baselines that instead require the full two-source experiments to construct. Our results reveal that repetition dynamics, not scale alone, shape whether small-scale mixture experiments generalize. More broadly, they suggest that data repetition deserves treatment as a first-class variable in mixture optimization, rather than an inconvenient side effect of limited data.

2606.07607 2026-06-09 cs.LG q-bio.GN 新提交

Position: Genomic Model Research Must Move Beyond Anecdotal Evaluation of Interpretability Methods

立场:基因组模型研究必须超越可解释性方法的轶事评估

Shasha Zhou, Mingyu Huang, Ke Li

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 本文通过转录因子结合基准测试,揭示不同可解释性方法常产生矛盾解释、无法定位已知调控基序且不能忠实反映模型决策,主张采用类似临床试验的系统验证框架。

详情
AI中文摘要

机器学习和计算能力的进步释放了人类基因组的预测潜力,但生物学家现在要求这些模型也能阐明潜在的生物学机制。尽管可解释机器学习(IML)技术已被越来越多地用于弥合这一差距,但普遍存在对轶事验证的依赖:绝大多数研究仅依赖单一IML方法,并仅报告孤立的成功实例。通过对转录因子结合的基准测试,我们展示了当前实践的风险。我们表明,不同的IML方法通常可能(1)对同一预测产生矛盾的解释,(2)无法定位已知的调控基序,以及(3)未能忠实反映模型的内部决策过程。鉴于此,我们主张建立一个类似于临床试验的验证框架:正如试验需要严格的设计和不良事件报告,基因组可解释性必须超越挑选的合理性,转向对一致性、忠实性和生物学有效性的系统评估。为促进这一点,我们提出了一个分层框架,以指导基因组IML方法的严格评估和报告。

英文摘要

Advances in machine learning and computational power have unlocked the predictive potential of the human genome, yet biologists now demand that these models also elucidate the underlying biological mechanisms. While interpretable machine learning (IML) techniques have been increasingly applied to bridge this gap, there has been a pervasive reliance on anecdotal validation: the vast majority of research relies on a single IML method and reports only isolated successful instances. Through a benchmarking study on transcription factor binding, we demonstrate the risks of current practices. We show that different IML methods can often (1) yield contradictory explanations for the same predictions, (2) fail to localize known regulatory motifs, and (3) fail to faithfully reflect the model's internal decision process. In light of this, we argue for a validation framework analogous to clinical trials: just as trials require rigorous design and adverse-event reporting, genomic interpretability must move beyond cherry-picked plausibility toward systematic assessment of consistency, faithfulness, and biological validity. To facilitate this, we propose a tiered framework to guide rigorous evaluation and reporting of genomic IML methods.

2606.07616 2026-06-09 cs.LG cs.AI cs.CL 新提交

Item Response Scaling Laws: A Measurement Theory Approach for Efficient and Generalizable Neural Scaling Estimation

项目反应缩放定律:一种高效且可泛化的神经缩放估计的测量理论方法

Sang Truong, Yuheng Tu, Rylan Schaeffer, Sanmi Koyejo

AI总结 提出项目反应缩放定律(IRSL),将项目反应理论融入缩放定律框架,通过Beta-IRT模型利用语言模型的概率响应,将参数复杂度从O(M×N)降至O(M+N),在预训练和测试时缩放场景中仅用50个问题即可实现可靠估计。

详情
AI中文摘要

缩放定律为理解语言模型(LM)的性能提供了基本框架,但推导它们需要在数千个检查点或数百万个推理样本上进行成本高昂的评估。为了解决这个问题,我们引入了项目反应缩放定律(IRSL),这是一个将项目反应理论(IRT)整合到缩放定律框架中的统一框架。与将每个模型-基准对单独处理的传统方法不同,IRSL将潜在模型能力与问题特征分离,将M个模型和N个问题的缩放定律估计分解,从而将参数复杂度从O(M×N)显著降低到O(M+N)。我们使用Beta-IRT实例化IRSL,它利用LM的经验概率响应——例如预训练中的token概率和测试时采样中的通过率——来捕获比二元响应更丰富的信号。我们在两种常见的缩放范式上验证了我们的方法:(1)预训练下游缩放,使用来自10个基准的6,612个LM检查点和37,682个问题;以及(2)测试时缩放,使用来自4个基准的12个LM和120个问题,每个问题最多2,500个样本。在现有模型响应上进行一次性校准后,IRSL仅使用每个基准50个问题(减少99.9%)即可产生更可靠的缩放估计,达到与传统方法相当或更优的决策准确性。此外,我们表明估计的潜在模型能力是可泛化的,从而能够跨共享相同测量目标的基准进行准确的性能预测。

英文摘要

Scaling laws provide a fundamental framework for understanding the performance of Language Models (LMs), yet deriving them requires prohibitively expensive evaluations across thousands of checkpoints or millions of inference samples. To address this, we introduce Item Response Scaling Laws (IRSL), a unified framework that integrates Item Response Theory (IRT) within the scaling law framework. Unlike traditional approaches that treat each model-benchmark pair in isolation, IRSL disentangles latent model ability from question characteristics, factorizing the scaling law estimation for $M$ models and $N$ questions to significantly reduce parameter complexity from $O(M \times N)$ to $O(M + N)$. We instantiate IRSL with Beta-IRT, which leverages the empirical probability responses of LMs -- such as token probabilities in pre-training and pass rates in test-time sampling -- to capture richer signals than binary responses. We validate our approach across two prevalent scaling paradigms: (1) pre-training downstream scaling, using 6,612 LM checkpoints and 37,682 questions from 10 benchmarks; and (2) test-time scaling, using 12 LMs and 120 questions from 4 benchmarks with up to 2,500 samples per question. Given a one-time calibration on existing model responses, IRSL yields more reliable scaling estimates using only 50 questions per benchmark (a 99.9\% reduction), achieving comparable or superior decision accuracy to traditional approaches. Furthermore, we show that the estimated latent model abilities are generalizable, enabling accurate performance forecasting across benchmarks that share the same measurement objective.

2606.07630 2026-06-09 cs.LG cs.AI stat.ML 新提交

Active Learning with Foundation Model Priors: Efficient Learning under Class Imbalance

基于基础模型先验的主动学习:类别不平衡下的高效学习

Jiancheng Zhang, Meiqing Li, Qi Zhang, Yinglun Zhu

发表机构 * University of California, Riverside(加州大学河滨分校) Carnegie Mellon University(卡内基梅隆大学) Worcester Polytechnic Institute(伍斯特理工学院)

AI总结 针对现实数据中的类别不平衡和噪声标注问题,提出一种利用基础模型先验的主动学习框架,通过不平衡感知的协同决策选择信息量最大的样本,在图像和文本数据集上实现超过50%的标注节省。

Comments To appear at ICML 2026

详情
AI中文摘要

现实世界中图像和文本领域的数据集通常具有偏斜的类别分布和噪声标注,这共同降低了模型性能,尤其是对少数类。在现有解决方案中,主动学习通过选择性地查询信息最丰富且平衡的样本进行标注,提供了一种有效且高效的范式。我们提出了一种创新的主动学习框架,该框架减轻了类别不平衡,并选择信息量最大的样本进行标注。利用基础模型先验,我们的算法使得基础模型和小模型之间能够进行不平衡感知的协同决策,以处理跨领域的有噪声和不平衡标签。我们首次系统性地研究了在图像和文本领域中标签噪声和类别不平衡双重挑战下的主动学习。在不平衡数据集上的大量实验表明,我们的方法实现了显著的标注节省——与最佳主动学习基线相比超过50%——同时保持了对标签噪声的性能和鲁棒性。

英文摘要

Real-world datasets across image and text domains are often characterized by skewed class distributions and noisy annotations, which jointly degrade model performance, particularly on minority classes. Among existing solutions, active learning offers an effective and efficient paradigm by selectively querying the most informative and balanced samples for annotation. We propose an innovative active learning framework that mitigates class imbalance and selects the most informative samples to annotate. Leveraging foundation model priors, our algorithm enables imbalance-aware co-decisions between foundation model and small model to tackle noisy and imbalanced labels across various domains. We introduce the first study to systematically explore active learning under the dual challenges of label noise and class imbalance across image and text domains. Extensive experiments on imbalanced datasets demonstrate that our method achieves substantial annotation savings-over 50% compared to the best active learning baseline-while preserving performance and robustness to label noise.

2606.07632 2026-06-09 cs.LG 新提交

Evaluation of ML Resource Utilization Requires Model Life Cycle Assessment

评估机器学习资源利用需要模型生命周期评估

Jared Fernandez, Clara Na, Yonatan Bisk, Constantine Samaras, Emma Strubell

发表机构 * GitHub arXiv

AI总结 本文提出应用生命周期评估方法全面核算AI系统从硬件制造到训练推理的全链条资源消耗与环境影响,以弥补传统单一训练或推理成本评估的不足。

Comments ICML 2026: Position Paper Track

详情
AI中文摘要

正确核算人工智能(AI)系统的能源需求和环境影响对于研究人员、开发者、政策制定者和用户评估构建大规模系统的障碍是必要的。随着开发和部署AI系统所需的管道和底层基础设施日益复杂,以往侧重于单次训练运行或单个推理预测成本的AI效率评估方法已不再足够。在这篇立场论文中,我们阐述了应用生命周期评估来评估机器学习模型开发和部署管道成本的必要性,以正确核算所需资源和下游影响。生命周期评估能够将AI系统及其底层基础设施整个生命周期的成本纳入考量,从与物理计算硬件相关的隐含成本到训练和推理中的运营成本。

英文摘要

Proper accounting of the energy requirements and environmental impact of artificial intelligence (AI) systems is necessary for researchers, developers, policy makers, and users to assess the barriers to building systems at scale. With the growing complexity of pipelines and underlying infrastructure needed to develop and deploy AI systems, previous approaches for evaluating AI efficiency which focus on the costs of a single training run or an individual inference prediction are no longer sufficient. In this position paper, we enunciate the need for applying life cycle assessment to evaluate the costs of the machine learning model development and deployment pipeline to properly account for the required resources and downstream impact. Life cycle assessments enable the incorporation of costs across the full life cycle of an AI system and its underlying infrastructure, from the embodied costs associated with the physical computing hardware through the operational costs in training and inference.

2606.07690 2026-06-09 cs.LG cs.AI 新提交

HARP: Efficient Data Selection for Finetuning Large Language Models

HARP:高效数据选择用于微调大型语言模型

Ning Wang, Zhengxin Zhang, Maosen Tang, Yitang Gao, Claire Cardie, Sainyam Galhotra

发表机构 * Cornell University(康奈尔大学) The Hong Kong University of Science and Technology(香港科技大学)

AI总结 提出层次主动区域剪枝(HARP),一种高效的基于训练的数据选择方法,通过层次结构和经验贝叶斯推断降低选择成本,同时保持下游对齐,在多个基准上优于最强基线最多8.9分,且训练样本减少约7倍。

详情
AI中文摘要

微调数据选择需要平衡两个相互竞争的目标:选择改善下游目标的示例,以及在不重复微调模型的情况下做到这一点。无训练选择器具有可扩展性,但依赖于嵌入相似性或聚类等代理,这些可能无法匹配目标目标。基于训练的选择器通过梯度信号、子集评估或Shapley归因更好地反映下游效用,但需要大量昂贵的训练-评估迭代。我们提出层次主动区域剪枝(HARP),一种高效的基于训练的选择器,在降低选择成本的同时保持下游对齐。HARP将训练池组织成节点-叶子层次结构,仅评估代表性叶子,并使用经验贝叶斯后验推断未测量的效用。然后,它使用两个互补的包络选择数据:HARP-C,保守地控制冗余,以及HARP-E,加性地奖励互补区域。我们理论上证明,在局部平滑和有界估计误差下,HARP控制选择误差同时降低训练-评估成本。我们进一步验证,HARP变体实现了最佳结果,并在使用大约7倍更少训练示例的情况下,比最强基线高出最多8.9分。

英文摘要

Finetuning data selection requires balancing two competing goals: selecting examples that improve the downstream objective, and doing so without repeatedly finetuning models. Train-free selectors are scalable but rely on proxies such as embedding similarity or clustering, which may not match the target objective. Train-based selectors better reflect downstream utility through gradient signals, subset evaluation, or Shapley attribution, but require many costly train--evaluate iterations. We propose Hierarchical Active Region Pruning (HARP), an efficient train-based selector that preserves downstream alignment while reducing selection cost. HARP organizes the training pool into a node--leaf hierarchy, evaluates only representative leaves, and infers unmeasured utilities with empirical Bayes posteriors. It then selects data using two complementary envelopes: HARP-C, which conservatively controls redundancy, and HARP-E, which additively rewards complementary regions. We theoretically show that, under local smoothness and bounded estimation error, HARP controls selection error while reducing train--evaluate cost. We further validate that HARP variants achieve the best result and outperform the strongest baseline by up to $+8.9$ points, while using roughly $7\times$ fewer training examples.

2606.07726 2026-06-09 cs.LG 新提交

Cutting LLM Evaluation Costs with SySRs: A Bandit Algorithm that Provably Exploits Model Similarity

利用SySRs降低LLM评估成本:一种可证明利用模型相似性的Bandit算法

Zifan Lyu, Chahine Nejma, Tobias Wegel, Fanny Yang, Florian E. Dorner

发表机构 * ETH Zurich(苏黎世联邦理工学院) Centrale Supélec(中央理工-高等电力学院) ENS de Cachan(卡尚高等师范学校) MPI for Intelligent Systems, Tübingen(马克斯·普朗克智能系统研究所,图宾根)

AI总结 提出SySRs算法,通过配对比较和自适应分配评估预算,利用模型相似性降低LLM评估成本,在15个基准上平均错误率最低。

Comments Published at ICML 2026

详情
AI中文摘要

大型语言模型通常通过在每个测试查询上评估每个模型来进行基准测试。对于寻求部署最佳模型的从业者来说,这通常是浪费的:如果一个模型明显比其他模型表现更差,则无需精确估计其性能。最佳臂识别算法可以通过自适应分配评估预算来大幅降低成本。此外,语言模型通常对相同的提示做出相似的反应——先前的工作试图利用这一特性但结果好坏参半。我们提出了同步连续拒绝(SySRs),通过配对比较增强了经典的连续拒绝算法。与先前在最佳模型识别中利用模型相似性的尝试不同,我们的方法无超参数,并且具有随着评估模型之间相似性程度的提高而改善的性能保证。在经验上,我们的方法在15个标准基准上的平均错误率以及可靠识别最佳模型的最坏情况预算方面均优于所有基线。

英文摘要

Large Language Models are typically benchmarked by evaluating every model on every test query. For practitioners seeking the best model to deploy, this is often wasteful: if a model clearly performs worse than others, there is no need to precisely estimate its performance. Best-arm identification algorithms can be naturally applied to drastically reduce costs by adaptively allocating evaluation budget. Further, language models often respond similarly to the same prompt-a property previous work has tried to leverage with mixed success. We propose Synchronized Successive Rejects (SySRs), augmenting the classical Successive Rejects algorithm with paired comparisons. Unlike prior attempts to leverage model similarity in best-model identification, our approach is hyperparameter-free and enjoys performance guarantees that improve with the degree of similarity between evaluated models. Empirically, our method outperforms all baselines in terms of average error rate across 15 standard benchmarks, and in terms of worst-case budget for reliably identifying the best model.

2606.07789 2026-06-09 cs.LG stat.ML 新提交

A Framework for Evaluating and Benchmarking Concept Drift Detection Methods

概念漂移检测方法的评估与基准测试框架

Vitor Cerqueira, Heitor Murilo Gomes, Marco Heyden, Bernhard Pfahringer, Albert Bifet

发表机构 * University of Coimbra(科英布拉大学) Victoria University of Wellington(惠灵顿维多利亚大学) Commerzbank(德国商业银行) University of Waikato(怀卡托大学) AI Institute, University of Waikato(怀卡托大学人工智能研究所)

AI总结 提出一个包含漂移模拟、时序感知评估和超参数优化协议的基准测试框架,在7个真实数据集上评估14种漂移检测方法,揭示其优劣并建立基线性能。

Comments Accepted in KDD'26

详情
AI中文摘要

数据流挖掘从根本上受到概念漂移的挑战,其中分布变化可能降低模型性能。尽管漂移检测方法层出不穷,但该领域的进展受到不一致评估实践的阻碍:研究依赖于过度简化的合成数据生成器,采用不兼容的指标,并且在超参数选择上缺乏透明度,使得公平比较变得困难。我们通过一个新颖的基准测试框架来解决这一差距,该框架包含三个贡献:(1)一种漂移模拟方法,通过蒙特卡洛试验将受控的分布变化注入真实世界数据集,在保留真实数据复杂性的同时实现监督评估;(2)一种用于漂移检测的评估协议,具有时序感知标准,包括推导出跨流可比较的新指标(例如,F1检测分数、归一化检测时间);(3)我们提倡一种留一数据集超参数优化协议,用于漂移检测方法,以促进跨异构流动态的配置鲁棒性。我们在7个真实世界数据集上对14种广泛使用的漂移检测方法进行了基准测试,涵盖4种漂移类型(类别先验、标签交换、特征排列、特征过滤),每种类型均包括突变和渐变转换。我们的实验结果揭示了当前漂移检测方法的优缺点,同时为该领域的未来研究建立了基线性能指标。所有代码和实验均公开可用。

英文摘要

Data stream mining is fundamentally challenged by concept drift, where distributional changes can degrade model performance. Despite the proliferation of drift detection methods, progress in the field is hindered by inconsistent evaluation practices: studies rely on oversimplified synthetic data generators, adopt incompatible metrics, and lack transparency in hyperparameter selection, making fair comparisons difficult. We address this gap with a novel benchmarking framework comprising three contributions: (1) a drift simulation method that injects controlled distributional changes into real-world datasets via Monte Carlo trials, enabling supervised evaluation while preserving real-world data complexity; (2) an evaluation protocol for drift detection with timing-aware criteria, including the derivation of new metrics (e.g., F1 detection score, normalized detection time) that are comparable across streams; and (3) we advocate for a leave-one-dataset-out hyperparameter optimization protocol for drift detection methods that promotes configuration robustness across heterogeneous stream dynamics. We benchmark 14 widely used drift detection methods on 7 realworld datasets across 4 drift types (class prior, label swap, feature permutation, feature filtering), each under both abrupt and gradual transitions. Our experimental results provide insights into the strengths and weaknesses of current drift detection approaches while establishing baseline performance metrics for future research in this area. All code and experiments are publicly available.

2606.07865 2026-06-09 cs.LG cs.AI physics.comp-ph stat.ML 新提交

Instrumented data for causal scientific machine learning

因果科学机器学习的仪器化数据

Daniel N. Wilke

发表机构 * University of the Witwatersrand(威特沃特斯兰德大学)

AI总结 提出仪器化数据作为观测数据和模板合成数据之外的第三种选择,每个数据点携带产生它的机制模型、显式不确定性及可执行的反事实族,通过V&V仪器化图像到模拟管道实现,支持因果干预。

Comments 10 pages, 2 figures

详情
AI中文摘要

科学机器学习受限于训练数据而非模型大小。观测数据记录发生了什么但不记录原因;模板合成数据具有已知的生成过程,但仅适用于模拟器的模板,而非用户面对的情况。我们认为第三种选择现在在操作上是可行的:仪器化数据,其中每个数据点携带产生它的机制模型、对该模型的显式不确定性以及可执行的反事实族。验证与确认(V&V)仪器化图像到模拟管道是一种实现:传感器观测成为完全指定、求解器支持的模拟,具有显式、可编辑的参数以及传播的偶然/认知不确定性。该基底是案例特定的、机制监督的,并通过Pearl的do算子支持因果干预。在验证、审计和替代训练方面的近期影响涵盖计算生物学、气候、材料、流体力学和医学成像;长期可证伪的推论涉及科学推理的基础模型。

英文摘要

Scientific machine learning is limited less by model size than by the data it is trained on. Observational data records what happened but not why; template synthetic data has a known generating process but only for the simulator's template, not the case a user faces. We argue a third option is now operationally feasible: instrumented data, in which every datum carries the mechanistic model that produced it, an explicit uncertainty over that model, and an executable family of counterfactuals. Verification-and-validation (V&V) instrumented image-to-simulation pipelines are one realisation: a sensor observation becomes a fully specified, solver-backed simulation with explicit, editable parameters and a propagated aleatoric/epistemic uncertainty. The substrate is case-specific, mechanistically supervised, and supports causal interventions through Pearl's do-operator. Near-term consequences for validation, auditing, and surrogate training span computational biology, climate, materials, fluid mechanics, and medical imaging; a longer-term, falsifiable implication concerns foundation models for scientific reasoning.

2606.07898 2026-06-09 cs.LG cs.CE 新提交

Temporal Coverage over Density: Parsimonious Training-Set Design for ML Climate Downscaling

密度之上的时间覆盖:机器学习气候降尺度的简约训练集设计

Karandeep Singh, Stefan Rahimi, Chad W. Thackeray, Stephen Cropper, Alex Hall

发表机构 * University of California, Los Angeles(加州大学洛杉矶分校) University of Wyoming(怀俄明大学)

AI总结 针对机器学习气候降尺度中高分辨率模拟资源有限的问题,提出通过时间分布采样而非连续块状采样来分配训练年份,以更好地捕捉强迫气候响应和内部变率,实验表明时间分布采样在固定预算下性能最优。

Comments 22 pages, 8 figures

详情
AI中文摘要

高分辨率区域气候模拟为气候影响评估提供了关键信息,但计算成本高昂,这推动了机器学习降尺度器和模拟器的发展。一个关键挑战是确定如何将有限的高分辨率模拟分布到不断变化的气候轨迹中,以同时捕捉强迫气候响应和内部变率。利用美国西部的CESM2大型集合,我们在固定数据预算下比较了三种训练年份选择策略:历史年份的连续块、从模拟期开始和结束年份中抽取的年份,以及分布在整个气候轨迹中的年份。同时包含历史和未来年份始终优于仅使用历史年份训练,这表明让降尺度模型接触历史记录之外的气候状态的重要性,并突出了统计降尺度中常见的平稳性假设的局限性。使用分布在整个气候轨迹中的年份进行训练总体表现最佳,表明内部变率的广泛采样除了暴露于强迫气候响应之外还提供了额外信息。在时间分布子集上训练的模型能更成功地再现未见集合成员中的变率,同时在广泛的气候诊断中保持强劲性能。即使仅使用可用高分辨率年份的十分之一进行训练,时间分布模型仍与全数据训练高度竞争。这些结果表明,在固定计算预算下,分配稀缺的高分辨率模拟时,气候状态的广泛采样比时间连续性更有价值。这些发现为区域气候降尺度和大型集合预测工作流程提供了实用指导。

英文摘要

High-resolution regional climate simulations provide critical information for climate impacts assessments but remain computationally expensive, motivating the development of machine-learning downscalers and emulators. A key challenge is determining how limited high-resolution simulations should be distributed across a changing climate trajectory to capture both forced climate response and internal variability. Using the CESM2 Large Ensemble over the western United States, we compare three training-year selection strategies under fixed data budgets: a contiguous block of historical years, years drawn from both the beginning and end of the simulation period, and years distributed throughout the full climate trajectory. Including both historical and future years consistently outperforms training on historical years alone, demonstrating the importance of exposing downscaling models to climate states outside the historical record and highlighting limitations of stationarity assumptions common in statistical downscaling. Training on years distributed throughout the full climate trajectory performs best overall, indicating that broad sampling of internal variability provides additional information beyond exposure to the forced climate response alone. Models trained on temporally distributed subsets more successfully reproduce variability in unseen ensemble members while retaining strong performance across a wide range of climate diagnostics. Even when trained on only one-tenth of the available high-resolution years, temporally distributed models remain highly competitive with full-data training. These results suggest that, under fixed computational budgets, broad sampling of climate states is more valuable than temporal continuity when allocating scarce high-resolution simulations. The findings provide practical guidance for regional climate downscaling and large-ensemble projection workflows.

2606.08259 2026-06-09 cs.LG 新提交

Differentially Private Synthetic Data via APIs 4: Tabular Data

通过API实现差分隐私合成数据 4: 表格数据

Toan Tran, Arturs Backurs, Zinan Lin, Victor Reis, Li Xiong, Sergey Yekhanin

发表机构 * Microsoft(微软)

AI总结 提出Tab-PE算法,将Private Evolution框架扩展至表格数据,通过启发式算子迭代优化候选数据集,在保持差分隐私的同时高效处理高阶相关性,相比基线AIM分类准确率提升最高10%,速度提升28倍。

Comments ICML'26

详情
AI中文摘要

本文研究了在差分隐私(DP)保证下生成合成表格数据的问题,使得在敏感领域能够共享数据。尽管已有大量研究,最先进的方法通常侧重于最小化低阶边际查询误差,而忽视了高阶相关性带来的挑战。为解决这一差距,我们将最初为DP合规图像和文本合成开发的Private Evolution(PE)框架扩展到表格数据。我们提出了Tab-PE——一种在DP约束下生成合成表格数据的算法。Tab-PE通过一个进化过程迭代改进候选数据集,该过程利用表格专用算子产生变体,对其进行私有评分,并选择最高质量的样本进行保留和传播。与依赖大型基础模型的原始PE不同,Tab-PE采用计算成本显著更低的启发式算子,使得PE对表格数据更加实用和可扩展。通过在真实和模拟数据集上的大量实验,我们证明Tab-PE在表现出高阶相关性的数据集上显著优于先前的基线。与最佳基线AIM相比,Tab-PE的分类准确率提高了最高10%,同时运行速度快了28倍。

英文摘要

This paper investigates the problem of generating synthetic tabular data with differential privacy (DP) guarantees, enabling data sharing in sensitive domains. Despite extensive study, state-of-the-art methods often focus on minimizing low-order marginal query errors and overlook the challenges posed by high-order correlations. To address this gap, we extend the Private Evolution (PE) framework, originally developed for DP-compliant image and text synthesis, to tabular data. We introduce Tab-PE -- an algorithm for synthetic tabular data generation under DP constraints. Tab-PE iteratively improves a candidate dataset via an evolutionary process that leverages tabular-specialized operators to produce variations, privately scores them, and selects the highest-quality samples to retain and propagate. In contrast to the original PE, which relies on large foundation models, Tab-PE employs heuristic operators with significantly lower computational costs, making PE more practical and scalable for tabular data. Through extensive experiments on real-world and simulation datasets, we demonstrate that Tab-PE substantially outperforms prior baselines on datasets exhibiting high-order correlations. Compared to the best baseline -- AIM, Tab-PE improves classification accuracy by up to 10% while running 28 times faster.

2606.08322 2026-06-09 cs.LG stat.ME 新提交

Orthogonality and Dimensionality in Airline Cluster Analysis using PCA and Kernel PCA

使用PCA和核PCA的航空公司聚类分析中的正交性与维度性

Andreas Schlapbach

发表机构 * Swiss Federal Railways (SBB)(瑞士联邦铁路(SBB)) University of Berne(伯尔尼大学)

AI总结 本文复现了Renold等人对1995-2020年美国航空公司利润周期的聚类实验,通过PCA和核PCA分析,发现六聚类分类在原始7维和3维PC空间中具有几何鲁棒性,并验证了数据的内在线性流形结构。

详情
AI中文摘要

为了刻画1995年至2020年美国航空公司的利润周期,Renold等人(2023)结合了k-means聚类、主成分分析和系统动力学建模。我们在三个空间中复现了他们的聚类实验——原始7维变量空间、3维PC得分空间和4维PC得分空间,使用了他们论文中慷慨包含的数据集。我们表明,六聚类分类在几何上是鲁棒的:在3-PC空间中的k-means产生的聚类分配与7维原始空间逐位相同。作为非线性检验,我们在六个核(涵盖三个族加上一个线性基线)下应用核PCA。所有六个核在2D中保留了六聚类分配。一个1D诊断进一步收紧:线性核将COVID年份C_3与峰值利润聚类C_0混淆,而所有五个非基线核将C_3移动到仅与后金融危机聚类C_5重叠。核族之间的一致性证实了一个内在的线性流形,没有隐藏的曲率。轮廓准则显示,该数据集在结构上仅支持三个聚类,而不是六个。原始7D空间中的共线性抑制了本应识别k=3作为结构上合理选择的轮廓信号。

英文摘要

To characterize the US airline profit cycles from 1995 to 2020, the authors of Renold et al. (2023) combine k-means clustering, principal component analysis, and system dynamic modelling. We replicate their clustering experiment in three spaces -- the original 7-dimensional raw-variable space, a 3-dimensional PC score space, and a 4-dimensional PC score space using their dataset gratefully included in the paper. We show that the six-cluster taxonomy is geometrically robust: k-means in 3-PC space produces bit-for-bit identical cluster assignments relative to 7D raw space. As a nonlinearity check we apply kernel PCA under six kernels spanning three families plus a linear baseline. All six kernels preserve the six-cluster assignment in 2D. A 1D diagnostic tightens this: the linear kernel conflates the COVID year C_3 with the peak-profit cluster C_0, whereas all five non-baseline kernels shift C_3 to overlap only the post-financial-crisis cluster C_5. Agreement across the kernel families confirms an intrinsically linear manifold with no hidden curvature. The silhouette criterion reveals that the dataset structurally supports only three clusters, not six. Collinearity in the raw 7D space suppresses the silhouette signal that would otherwise identify k=3 as the structurally motivated choice.

2606.08376 2026-06-09 cs.LG cs.AI 新提交

RiskNet: A large-scale dataset of AI risk incidents from news with alignment and multi-dimensional annotations

RiskNet:一个来自新闻的大规模AI风险事件数据集,包含对齐和多维标注

Leihan Zhang, Wecheng Ye, Xianlong Ma, Haochuan Liu, Yang Li, Qianyu Zhang, Jinliang Chen, Qiang Yan

发表机构 * Beijing University of Posts and Telecommunications(北京邮电大学) Beijing Key Laboratory of Multimodal Data Intelligent Perception and Governance(多模态数据智能感知与治理北京市重点实验室)

AI总结 提出RiskNet,一个从多语言新闻构建的大规模AI风险事件数据集,通过结构化流水线进行事件识别、对齐和多维分类,支持AI安全、治理和风险分析研究。

Comments The manuscript has been submitted to Scientific Data

详情
AI中文摘要

随着人工智能(AI)系统越来越多地部署在社会关键领域,与AI相关的危害和失败事件的报告在频率和多样性上不断增加。尽管现有的治理框架阐述了负责任AI的高层原则,但用于跟踪和分析真实世界AI风险事件的大规模实证资源仍然有限。现有的事件集合通常由人工整理,规模相对较小,不足以支持持续、数据驱动的监控和下游计算分析。为满足这一需求,我们提出了RiskNet,一个从大规模多语言新闻源构建的AI风险事件数据集。RiskNet应用了一个结构化的流水线,用于AI风险新闻识别、事件级报告筛选、事件对齐和多维事件分类。生成的资源将分散的新闻报道组织成以事件为中心的记录,并为事件分类、事件对齐和事件级风险标注提供基准数据集。在当前版本中,RiskNet覆盖了数亿条源记录,并生成了一个大规模的AI风险相关报告集合,包括对齐的事件簇和标注的基准子集。该数据集还通过一个在线平台提供浏览和探索功能。我们描述了数据源、处理工作流、分类法设计以及资源的技术验证。RiskNet旨在支持AI安全、治理、风险分析和基准测试的下游研究,以及对AI相关危害的纵向和跨源分析。通过提供一个结构化且可复用的实证资源,RiskNet有助于弥合高层治理原则与AI风险事件记录现实之间的差距。

英文摘要

As artificial intelligence (AI) systems are increasingly deployed across socially consequential domains, reports of AI-related harms and failures have grown in frequency and diversity. Although existing governance frameworks articulate high-level principles for responsible AI, large-scale empirical resources for tracking and analyzing real-world AI risk incidents remain limited. Existing incident collections are often manually curated, relatively small in scale, and insufficient for continuous, data-driven monitoring and downstream computational analysis. To address this need, we present RiskNet, a large-scale dataset of AI risk incidents constructed from large-scale multilingual news sources. RiskNet applies a structured pipeline for AI risk news identification, event-level report screening, incident alignment, and multi-dimensional incident classification. The resulting resource organizes dispersed news reports into incident-centered records and provides benchmark datasets for event classification, incident alignment, and incident-level risk labeling. In its current release, RiskNet covers hundreds of millions of source records and yields a large-scale collection of AI risk-related reports, including aligned incident clusters and annotated benchmark subsets. The dataset is also accessible through an online platform for browsing and exploration. We describe the data sources, processing workflow, taxonomy design, and technical validation of the resource. RiskNet is intended to support downstream research on AI safety, governance, risk analysis, and benchmarking, as well as longitudinal and cross-source analyses of AI-related harms. By providing a structured and reusable empirical resource, RiskNet helps bridge the gap between high-level governance principles and the documented realities of AI risk incidents.

2606.08481 2026-06-09 cs.LG cs.AI cs.DB cs.SE 新提交

PIPE-Cypher: Automatic Enterprise Benchmark Generation for Text-to-Cypher Systems

PIPE-Cypher:面向文本到Cypher系统的自动企业基准生成

Suraj Ranganath, Anish Raghavendra

发表机构 * Halıcıoğlu School of Data Science and Computing, University of California, San Diego(加利福尼亚大学圣迭戈分校哈勒乔卢数据科学与计算学院) Independent Researcher(独立研究员)

AI总结 提出PIPE-Cypher流水线,利用本地大模型从企业属性图自动生成平衡的NL-to-Cypher基准,通过模式分析、逆向查询约束生成和执行验证等步骤,实现可重复的基准构建。

详情
AI中文摘要

企业属性图在模式结构、内部术语、领域假设、治理约束和用户交互模式上差异很大。因此,与部署相关的Text2Cypher基准反映了用户和代理实际对该图提出的问题。创建这样的基准很困难,因为模式和值是唯一的,且图结构随时间变化。每个自然语言查询对必须可执行、使用真实图实体、保持多样性,并在查询类型和难度级别上保持平衡。我们提出PIPE-Cypher,一个本地基准生成流水线,它将实时属性图和来自客户问题、分析师日志或代理工具调用的可选种子查询转化为平衡的NL-to-Cypher基准。PIPE-Cypher结合了模式分析、逆向查询接地、约束生成、确定性Cypher治理、执行验证、编辑、多样性控制以及校准的本地大语言模型评判器。使用本地Qwen3.5-9B生成和评判,PIPE-Cypher导出了3000个可接受的FinBench/SNB示例,完成了三个审计消融套件,用人类标签校准评判器行为,并评估了11个本地下游模型。生成的基准具有明确的区分性:零样本迁移效果弱,而少样本控制表明,特定模式的示例库可以帮助兼容的模型家族。总之,PIPE-Cypher使Text2Cypher基准测试成为一个可重复的过程,随图、用户和目标工作负载而演变。

英文摘要

Enterprise property graphs vary widely in schema structure, internal terminology, domain assumptions, governance constraints, and user interaction patterns. A deployment-relevant Text2Cypher benchmark therefore reflects the questions users and agents actually ask of that graph. Creating such a benchmark is difficult because schemas and values are unique, and graph structure changes over time. Each NL-query pair must also be executable, use real graph entities, preserve diversity, and remain balanced across query types and difficulty levels. We present PIPE-Cypher, a local benchmark-generation pipeline that turns a live property graph and optional seed queries from customer questions, analyst logs, or agent tool calls into balanced NL-to-Cypher benchmarks. PIPE-Cypher combines schema profiling, reverse-query grounding, constrained generation, deterministic Cypher governance, execution validation, redaction, diversity controls, and a calibrated local LLM judge. Using local Qwen3.5-9B generation and judging, PIPE-Cypher exports 3,000 accepted FinBench/SNB examples, completes three audited ablation suites, calibrates judge behavior with human labels, and evaluates 11 local downstream models. The resulting benchmark is deliberately discriminative: zero-shot transfer is weak, while a few-shot control shows that schema-specific example banks can help compatible model families. Together, PIPE-Cypher makes Text2Cypher benchmarking a repeatable process that evolves with the graph, its users, and its target workloads.

2606.08718 2026-06-09 cs.LG cs.AI 新提交

Deep Active Re-Labeling: Toward Noise-Resilient Annotation Efficiency

深度主动重标注:迈向抗噪的标注效率

Md Abdullah Al Forhad, Weishi Shi

AI总结 针对深度主动学习中人工标注噪声导致性能下降的问题,提出一种通过分配部分标注预算重新标注已标注数据来去噪的框架,实验表明在相同预算下更高效且最终数据集噪声较少。

Comments Accepted and published in the 2025 IEEE International Conference on Big Data (BigData). DOI: 10.1109/BigData66926.2025.11402126

详情
Journal ref
2025 IEEE International Conference on Big Data (BigData), Macau, China, 2025, pp. 886-895
AI中文摘要

虽然深度主动学习(DAL)有效减少了人工标注成本,但其效果受到人工标注误差的限制。这是因为主动学习采样的数据被认为对训练具有高度信息性。当人工标注者以一定比率向这些信息性数据引入错误时,主动学习性能显著下降,有时甚至比被动学习更差。本文首先分析了DAL设置中人工标注误差的影响。然后,我们提出了一个框架来解决DAL中的人工标注噪声问题。受人类学习模式的启发,我们提出的解决方案的核心思想是将部分人工标注预算分配给重新标注已标注的数据。先前的理论工作表明,当模型具备一定识别潜在噪声数据的能力时,即使重新标注一小部分数据也能有效去除主动训练集中的噪声。为此,我们实现了两种主动噪声采样策略,在不同情况下检测噪声,并分配部分标注预算重新标注这些实例。我们的方法赋予了主动学习一种回顾和内省的行为。实验表明,在相同标注预算下,我们的方法数据效率更高,并最终产生一个相对无噪声的标注数据集。

英文摘要

While Deep Active Learning (DAL) effectively reduces human annotation costs, its efficacy is constrained by human annotation errors. This is because the data sampled for active learning is assumed to be highly informative for training. When human annotators introduce errors into this informative data at a certain rate, the active learning performance drops significantly and, in some cases, even exhibits worse outcomes than passive learning. In this paper, we first analyze the impact of human annotation errors in the DAL setting. Then we propose a framework to address the human annotation noise problem for DAL. Informed by human learning patterns, the core idea of our proposed solution involves allocating a portion of the human annotation budget to re-annotate data that has already been labeled. Previous theoretical work suggests that when the model possesses a certain level of ability to identify potentially noisy data, even re-labeling a small fraction of the data can effectively remove noise from the active training set. To achieve this, we implement two active noise sampling strategies to detect noise under different circumstances and allocate a part of the annotation budget to re-annotate these instances. Our approach imbues active learning with a revisiting and introspective behavior. Our experiments demonstrate that, under the same annotation budget, our method is more data-efficient and yields a relatively noise-free annotation dataset in the end.

2606.08736 2026-06-09 cs.LG cs.DB 新提交

Declarative Outcome-Conformant Synthesis: Exact, Closed-Form Specification Satisfaction and a Conformance Benchmark

声明性结果一致性合成:精确、闭式规范满足及一致性基准

Muhammed Rasin

发表机构 * Independent Researcher(独立研究员)

AI总结 针对无源数据下精确满足声明性分析结果的需求,提出结果一致性合成任务,通过闭式条件伽马抽样实现精确聚合,并构建SpecBench基准,证明一致性保真度正交。

Comments 22 pages, 1 figure. Benchmark and reference implementation (MIT): https://github.com/rasinmuhammed/misata

详情
AI中文摘要

我们研究合成表格数据主流范式未能提供的能力:在无源数据下精确满足声明的分析结果。模仿方法(copula、GAN、扩散)学习真实分布并从中采样,其评价基于对真实数据的保真度。一大类实际需求不同:在无源数据(冷启动)下生成数据,该数据在关系模式上复现声明的结果(收入曲线、流失率、群体份额)。现成的模仿工具不提供针对此类目标的接口,且由于采样方差,没有采样器能精确命中聚合值。在真实公共数据集上,基于该数据训练的现成学习合成器将声明的月度聚合值偏离74%至86%;逐周期优化将偏离降至约19%,但仍无法达到0;而闭式生成器精确达到0。我们将此任务命名为结果一致性合成,论证其评价轴为一致性而非保真度,并展示两轴正交。我们的贡献包括:(1) 形式化描述,表明广泛使用的精确聚合生成器族实际上是伽马总体的条件求和采样(通过Lukacs刻画),具有闭式精确性、闭式边际变异系数和尺度不变性;受控实验描绘边界,强制精确聚合在1-Wasserstein距离上对任意外部边际的成本最多为0.006,其余为形状族失配;(2) SpecBench,据我们所知,这是首个衡量冷启动关系合成中分析结果一致性的基准;(3) 一个闭式确定性参考系统。精确聚合本身是平凡的;贡献在于一致性联合闭式边际、完整性、确定性和零源数据。我们承认在存在真实数据时模仿方法的保真度优势。

英文摘要

We study a capability the dominant paradigm in synthetic tabular data does not provide: exact satisfaction of a declared analytical outcome with no source data. Imitation methods (copulas, GANs, diffusion) learn a real distribution and sample from it, and are judged on fidelity to real data. A large, practical class of needs is different: generating data with no source data ("cold start") that reproduces a declared outcome (a revenue curve, a churn rate, a group share) across a relational schema. Off-the-shelf imitation tools offer no interface for such targets, and no sampler can hit an exact aggregate, because sampling has variance. On a real public dataset, off-the-shelf learned synthesizers trained on that very data miss the declared monthly aggregate by 74 to 86 percent; a per-period steelman cuts the miss to about 19 percent and still cannot reach 0; a closed-form generator reaches exactly 0. We name this task outcome-conformant synthesis, argue its evaluation axis is conformance rather than fidelity, and show the two axes are orthogonal. We contribute: (1) a formal account showing a widely-used family of exact-aggregate generators is exactly conditional-sum sampling of a Gamma population (via Lukacs' characterization), with closed-form exactness, a closed-form marginal CV, and scale-invariance; a controlled experiment maps the boundary, enforcing the exact aggregate costs at most 0.006 in 1-Wasserstein distance to an arbitrary external marginal, the rest being shape-family mismatch; (2) SpecBench, to our knowledge the first benchmark to measure conformance to analytical outcomes for cold-start relational synthesis; and (3) a closed-form, deterministic reference system. Exact aggregation alone is trivial; the contribution is conformance jointly with closed-form marginals, integrity, determinism, and zero source data. We concede fidelity to imitation where real data exists.

2606.08903 2026-06-09 cs.LG 新提交

Synthetic but Not Realistic: The Evaluation Challenge in Generative Modelling for Structured Electronic Medical Records

合成但不真实:结构化电子病历生成建模中的评估挑战

Nicholas I-Hsien Kuo, Blanca Gallego, Louisa Jorm

发表机构 * Centre for Big Data Research in Health, the University of New South Wales(新南威尔士大学健康大数据研究中心)

AI总结 针对合成电子病历评估过度依赖统计相似性而忽视临床有效性的问题,提出基于流行病学的多维度评估框架,发现当前生成模型虽能复现边缘分布,但无法同时保持亚组结构、效应估计和依赖关系,导致评估高估数据质量。

详情
AI中文摘要

合成医疗数据被广泛提议作为真实患者数据的隐私保护替代品,但其评估仍然以统计相似性和预测性能为主,这些并不能反映临床有效性。我们引入了一个基于流行病学的多维度评估框架,评估描述性保真度、临床实用性和结构有效性,分别对应描述性、预测性和因果性问题。我们使用PRIME-CVD(一个具有已知真实结构的5万人队列)评估了四种代表性生成范式——基于GAN、VAE增强、基于扩散和掩码建模。虽然所有模型都再现了边缘分布,但没有一个能同时保留亚组结构、效应估计和依赖结构。值得注意的是,具有强分布保真度的模型可能表现出较差的校准和扭曲的关系,导致不可靠的推断。这些结果表明,当前的评估实践可能高估了合成数据质量,并促使基于支持有效临床和科学结论的能力进行领域知情的评估。

英文摘要

Synthetic healthcare data are widely proposed as privacy-preserving substitutes for real patient data, yet their evaluation remains dominated by statistical similarity and predictive performance that do not reflect clinical validity. We introduce a multi-dimensional evaluation framework grounded in epidemiology, assessing descriptive fidelity, clinical utility, and structural validity, corresponding to descriptive, predictive, and causal questions. We evaluate four representative generative paradigms - GAN-based, VAE-boosted, diffusion-based, and masked modelling - using PRIME-CVD, a 50,000-person cohort with known ground-truth structure. While all models reproduce marginal distributions, none simultaneously preserve subgroup structure, effect estimates, and dependency structure. Notably, models with strong distributional fidelity can exhibit poor calibration and distorted relationships, leading to unreliable inference. These results show that current evaluation practices can overestimate synthetic data quality and motivate domain-informed assessment based on the ability to support valid clinical and scientific conclusions.

2606.08921 2026-06-09 cs.LG 新提交

Generalized Rank-based Evaluation for Knowledge Graph Completion: Perspectives, Framework, and Analyses

基于排序的知识图谱补全广义评估:视角、框架与分析

Sooho Moon, Jian Kang, Yunyong Ko

发表机构 * Chung-Ang University(中央大学) Mohamed bin Zayed University of Artificial Intelligence(穆罕默德·本·扎耶德人工智能大学)

AI总结 针对现有评估指标忽视预测锐度与流行偏差鲁棒性的问题,提出广义评估框架PROBE,通过排序变换器和排序聚合器实现更全面、灵活且一致的模型评估。

Comments 25 pages, 12 figures, 5 tables

详情
AI中文摘要

知识图谱补全(KGC)旨在从观测知识图谱(KG)中预测缺失事实,在药物发现、推荐系统和检索增强生成(RAG)等广泛实际应用中发挥关键作用。尽管已有众多KGC模型被提出,但KGC的评估仍未被充分探索,尽管其在可靠评估模型性能和为实际应用选择合适的模型中至关重要。本文引入了KGC评估中两个被现有评估指标忽视的重要视角:(P1)预测锐度和(P2)流行偏差鲁棒性。为同时解决这两个视角,我们提出一个广义评估框架PROBE,它由一个排序变换器(RT)和一个排序聚合器(RA)组成,其中RT基于期望的预测锐度水平估计每个预测的得分,RA根据期望的流行偏差鲁棒性水平聚合所有预测得分以确定最终评估得分。我们通过定义可靠KGC评估的六个关键属性对PROBE进行理论分析,并证明PROBE满足所有属性,而现有指标未能满足部分属性。特别地,由于KG的开放世界特性,评估指标应即使在仅观测到不完整事实时也能保持KGC模型的相对性能。我们表明PROBE能更好地维持这种一致性,从而比现有指标更可靠地估计模型的内在性能。在六个真实KG上使用六个KGC模型进行的大量实验表明,现有指标可能根据不同的评估视角高估或低估模型性能,而PROBE能够实现更全面、灵活且一致的KGC模型评估。

英文摘要

Knowledge graph completion (KGC) aims to predict missing facts from an observed knowledge graph (KG), playing a crucial role in a wide range of real-world applications such as drug discovery, recommender systems, and retrieval-augmented generation (RAG). Although numerous KGC models have been proposed, the evaluation of KGC remains underexplored, despite its critical role in reliably assessing model performance and selecting appropriate models for real-world applications. In this paper, we introduce two important perspectives for KGC evaluation that are overlooked by existing evaluation metrics, (P1) predictive sharpness and (P2) popularity-bias robustness. To address both perspectives, we propose a generalized evaluation framework, PROBE, which consists of a rank transformer (RT) that estimates the score of each prediction based on a desired level of predictive sharpness and a rank aggregator (RA) that determines the final evaluation score by aggregating all prediction scores according to a desired level of popularity-bias robustness. We theoretically analyze PROBE by defining six key properties for reliable KGC evaluation and prove that PROBE satisfies all the properties, while existing metrics fail to satisfy some. In particular, due to the open-world nature of KGs, an evaluation metric should preserve the relative performance of KGC models even when only incomplete facts are observed. We show that PROBE better maintains such consistency, providing a more reliable estimate of intrinsic model performance than existing metrics. Extensive experiments with six KGC models on six real-world KGs reveal that existing metrics may over- or under-estimate model performance depending on different evaluation perspectives, whereas PROBE enables a more comprehensive, flexible, and consistent evaluation of KGC models.

2606.08926 2026-06-09 cs.LG 新提交

PROBE-Web: An Interactive System for Probing Evaluation Landscapes of Knowledge Graph Completion Models

PROBE-Web:用于探究知识图谱补全模型评估景观的交互式系统

Sooho Moon, Yunyong Ko

发表机构 * Chung-Ang University(中央大学)

AI总结 提出PROBE-Web交互系统,通过调整预测锐度和流行度偏差鲁棒性两个视角,灵活评估KGC模型,并提供四种关键功能。

Comments 4 pages, 6 figures, 1 table

详情
AI中文摘要

知识图谱补全(KGC)模型通常使用基于排名的指标(如MRR和Hits@K)进行评估,尽管不同的用户通常需要不同的评估视角。在本演示中,我们介绍PROBE-Web,一个用于探究KGC模型多样化评估景观的交互式系统。PROBE-Web使用户能够通过调整两个关键视角(P1)预测锐度和(P2)流行度偏差鲁棒性来灵活评估KGC模型。通过用户友好的GUI,用户可以轻松评估多个KGC模型并分析其优缺点。PROBE-Web提供四个关键功能:(1)传统评估工具包,(2)灵活的视角感知评估,(3)可解释的案例研究,以及(4)评估景观探索。我们相信PROBE-Web可以帮助用户更好地理解与其目标一致的KGC模型。

英文摘要

Knowledge graph completion (KGC) models are commonly evaluated using rank-based metrics such as MRR and Hits@K, despite different users often requiring different evaluation perspectives. In this demo, we present PROBE-Web, an interactive system for probing diverse evaluation landscapes for KGC models. PROBE-Web enables users to flexibly evaluate KGC models by adjusting two critical perspectives: (P1) predictive sharpness and (P2) popularity-bias robustness. Through a user-friendly GUI, users easily evaluate multiple KGC models and analyze their strengths and weaknesses. PROBE-Web provides four key functionalities: (1) conventional evaluation toolkit, (2) flexible perspective-aware evaluation, (3) explainable case studies, and (4) evaluation landscape exploration. We believe that PROBE-Web can help users better understand KGC models aligning with their objectives.

2606.09046 2026-06-09 cs.LG cs.CL cs.IR 新提交

Decoy-Calibrated Failure Audits for Language Models

语言模型的诱饵校准失败审计

Vyzantinos Repantis, Ameya Gawde, Harshvardhan Singh

发表机构 * Meta Platforms(Meta平台)

AI总结 提出Janus程序,通过诱饵校准和留出数据验证,判断语言模型错误解释的可信度,避免选择偏差。

Comments 14 pages, 5 figures, 4 tables

详情
AI中文摘要

有用的审计不仅揭示模型失败的频率,还揭示失败集中在何处。审计员可能测试许多候选解释:长输入、间接问题、分散注意力的证据或这些因素的组合。风险在于选择。观察到的最大效应可能反映真实的失败模式,也可能只是多次尝试中的最佳结果。我们提出Janus,一种决定何时提出的错误解释足够可信以报告的程序。目标不是生成新解释,而是决定哪些解释站得住脚。审计员从固定的模型、标记的评估集和冻结的候选解释列表(我们称之为描述符)开始。Janus通过错误率提升对每个描述符进行评分,然后将真实描述符与具有相同频率但随机分配给示例的虚假描述符进行比较。只有当描述符在用于发现的数据上击败这个诱饵基准,然后在单独的留出数据上重复时,它才被确认。在多表查找任务的受控审计中,Janus识别出植入的失败,确认了长链描述符及其交互。LLM通常在查找链中途停止,而不是到达最终答案。在两个公共基准MuSiQue和LongBench v2上,SliceLine基线标记了看似高错误的区域,但Janus没有确认任何一个。消融实验显示了为什么两个保障措施都很重要。在LongBench v2上,未校准的固定阈值报告了20个描述符,诱饵基准留下一个,而留出检查在其提升从0.36缩小到0.05后拒绝了最后一个。由此产生的原则将提出解释与报告解释分开。候选解释可能来自任何来源,但只有那些击败诱饵并在新数据上复现的才成为审计发现。

英文摘要

Useful audits reveal not only how often a model fails, but also where its failures concentrate. An auditor may test many candidate explanations: long inputs, indirect questions, distracting evidence, or combinations of these factors. The risk is selection. The largest observed effect may reflect a real failure mode, or it may simply be the best result among many tried. We introduce Janus, a procedure for deciding when a proposed error explanation is credible enough to report. The goal is not to generate new explanations, but to decide which ones hold up. The auditor starts with a fixed model, a labeled evaluation set, and a frozen list of candidate explanations, which we call descriptors. Janus scores each descriptor by its error-rate lift, then compares real descriptors with fake ones that have the same frequencies but are randomly assigned to examples. A descriptor is confirmed only if it beats this decoy floor on the data used for discovery and then repeats on separate held-out data. In a controlled audit of multi-table lookup tasks, Janus identifies the planted failure, confirming long-chain descriptors and their interactions. The LLM often stops partway through the lookup chain instead of reaching the final answer. On two public benchmarks, MuSiQue and LongBench v2, the SliceLine baseline flags plausible high-error pockets, but Janus confirms none of them. Ablations show why both safeguards matter. On LongBench v2, an uncalibrated fixed threshold reports 20 descriptors, the decoy floor leaves one, and the holdout check rejects the last one after its lift shrinks from 0.36 to 0.05. The resulting principle separates proposing explanations from reporting them. Candidates may come from any source, but only those that beat decoys and replicate on fresh data become audit findings.

2606.09080 2026-06-09 cs.LG cs.CL 新提交

Beyond FLOPs: Benchmarking Real Inference Acceleration of LLM Pruning under a GEMM-Centric Taxonomy

超越FLOPs:基于GEMM中心分类法的LLM剪枝真实推理加速基准测试

Haozhe Hu, Hao Wu, Anhao Zhao, Longwei Ding, Peiran Yin, Yunpu Ma, Xiaoyu Shen

发表机构 * Ningbo Institute of Digital Twin, Eastern Institute of Technology, Ningbo(宁波数字孪生研究院,东方理工大学(宁波)) Department of Computing, The Hong Kong Polytechnic University(香港理工大学计算学系) Munich Center for Machine Learning, LMU Munich(慕尼黑大学机器学习慕尼黑中心)

AI总结 提出一种基于GEMM维度的剪枝方法分类法,通过统一基准框架系统评估不同剪枝方法在加速-质量帕累托前沿上的表现,发现静态深度剪枝在低质量损失下最优,为LLM剪枝加速提供统一视角。

Comments 22 pages, 14 figures

详情
AI中文摘要

剪枝已成为加速大语言模型(LLM)推理的主流范式,涵盖了一系列方法,这些方法在token、层、头、维度和注意力模式上移除计算。尽管目标相同,这些剪枝方法会引发根本不同的执行行为,导致实际加速效果严重依赖于硬件和内核实现。因此,不同剪枝家族的实际加速收益仍知之甚少。在这项工作中,我们引入了一种基于GEMM中心的分类法,根据通用矩阵乘法(GEMM)的逻辑\textbf{M}、\textbf{N}和\textbf{K}维度重新组织现有剪枝方法。利用这一抽象,我们构建了一个统一的基准测试框架,能够在剪枝设计空间中进行实现一致的比较,并系统地表征加速-质量帕累托前沿。我们的结果表明,静态深度剪枝仍然是最强的帕累托最优基线,并且在内存受限场景下最接近其理论加速上限。在预填充阶段,前沿从低质量损失(0\%--4\%)的静态深度,过渡到中等损失(5\%--16\%)的动态深度,最后到更高损失水平(17\%--26\%)的静态宽度剪枝。这些发现首次建立了基于剪枝的LLM加速实际极限的统一视图,并为未来的剪枝研究提供了指导。\footnote{代码可在 https://github.com/EIT-NLP/LLM-Pruning/tree/main/PruningInferSim 获取。}

英文摘要

Pruning has emerged as a dominant paradigm for accelerating large language model (LLM) inference, spanning a broad spectrum of methods that remove computation across tokens, layers, heads, dimensions, and attention patterns. Despite sharing the same objective, these pruning approaches induce fundamentally different execution behaviors, causing realized speedups to depend heavily on hardware and kernel implementations. Consequently, the practical acceleration benefits of different pruning families remain poorly understood. In this work, we introduce a GEMM-centric taxonomy that reorganizes existing pruning methods according to the logical \textbf{M}, \textbf{N}, and \textbf{K} dimensions of general matrix multiplication (GEMM). Leveraging this abstraction, we build a unified benchmarking framework that enables implementation-consistent comparison across the pruning design space and systematically characterizes the acceleration--quality Pareto frontier. Our results show that static depth pruning remains the strongest Pareto-optimal baseline and stays closest to its theoretical acceleration upper bound in memory-bounded scenarios. During prefill, the frontier transitions from static depth at low quality loss (0\%--4\%), to dynamic depth at moderate loss (5\%--16\%), and finally to static width pruning at higher loss levels (17\%--26\%). These findings establish the first unified view of the practical limits of pruning-based LLM acceleration and provide guidance for future pruning research.\footnote{Code is available at https://github.com/EIT-NLP/LLM-Pruning/tree/main/PruningInferSim}

2606.09239 2026-06-09 cs.LG cs.HC 新提交

Orange Lab: Lowering Barriers to Data Mining through Embedded Interactive Workflows

Orange Lab:通过嵌入式交互工作流降低数据挖掘门槛

Matej Bevec, Aleš Erjavec, Vesna Tanko, Lena Trnovec, Lan Žagar, Ana Farič, Janez Demšar, Blaž Zupan

发表机构 * University of Ljubljana(卢布尔雅那大学) Revelo d.o.o.(Revelo公司)

AI总结 提出Orange Lab,一种基于Web的可视化数据分析环境,通过组件展示范式将机器学习工作流嵌入任意网页,实现动态交互与数据驱动叙事,降低数据科学使用门槛。

详情
AI中文摘要

虽然数据分析工作流的可视化编程已成为数据科学民主化的重要工具,但此类系统仍主要局限于独立应用程序,并且对将其可视化分析解决方案过渡到交互式网络环境的支持有限。因此,数据分析管道难以共享、嵌入和适应用户面向的分析工具。我们提出了Orange Lab,一个基于Web的协作式可视化数据分析环境。其核心是,Orange Lab使用户能够从模块化组件中可视化地构建机器学习工作流,其中任何组件中的交互都会无缝地传播到整个工作流,将静态管道转变为支持探索和数据驱动叙事的动态响应系统。我们的关键贡献是组件展示,这是一种范式,允许作者将选定的工作流组件或其界面部分嵌入到任意网络上下文中,创建同步的交互式界面,同时隐藏底层工作流的复杂性。这支持开发定制化的分析视图和叙事驱动的体验,将数据分析直接集成到在线材料中。我们通过在数据素养教育中的部署来展示该方法,其中嵌入式组件引导学生动手探索机器学习概念,而无需了解底层系统,表明Orange Lab有效降低了入门门槛并支持数据科学的民主化。

英文摘要

While visual programming of data analysis workflows has become an important vehicle for the democratization of data science, such systems remain largely confined to standalone applications and offer limited support for transitioning their visual analytics solutions into interactive web environments. As a result, data analysis pipelines are difficult to share, embed, and adapt into user-facing analytical tools. We present Orange Lab, a web-based collaborative environment for visual data analytics. At its core, Orange Lab enables users to visually construct machine learning workflows from modular components, where interactions in any component propagate seamlessly through the workflow, turning static pipelines into dynamic, reactive systems that support exploration and data-driven storytelling. Our key contribution is component exposition, a paradigm that allows authors to embed selected workflow components, or parts of their interfaces, into arbitrary web contexts, creating synchronized, interactive interfaces while hiding underlying workflow complexity. This enables the development of tailored analytical views and narrative-driven experiences that integrate data analysis directly into online materials. We demonstrate the approach through deployments in data literacy education, where embedded components guide students in hands-on exploration of machine learning concepts without requiring knowledge of the underlying system, showing that Orange Lab effectively lowers barriers to entry and supports the democratization of data science.

2606.09276 2026-06-09 cs.LG 新提交

ERBench: A Benchmark and Testsuite for Equation Discovery Algorithms

ERBench:方程发现算法的基准与测试套件

Paul Kahlmeyer, Henrik Voigt, Michael Habeck, Joachim Giesen

发表机构 * University of Jena(耶拿大学)

AI总结 提出ERBench基准,通过方程恢复任务评估符号回归算法,强调在变化维度、采样大小、分布和域下的鲁棒性,填补现有基准的空白。

详情
AI中文摘要

方程发现旨在从数据中自动发现数学方程形式的科学模型。技术上,方程发现通过符号回归算法实现。符号回归用于方程发现的性能沿两个维度衡量:测试数据的预测精度,以及已知真实公式的恢复。对于标准回归,精度通常通过域内测试数据衡量,例如,将数据集随机分为训练和测试数据。虽然这对于域内插值(普通回归的常见目标)有意义,但它可能误导真正的模型发现和泛化。明显的替代方案是衡量域外精度。然而,获得具有挑战性的域外测试数据是一个非平凡问题。因此,我们专注于方程恢复来评估用于方程发现的符号回归算法。理由是,在恢复已知真实公式方面表现良好的符号回归算法是未知方程发现中表现良好的良好候选。现有的符号回归基准包括方程恢复任务,但只有少量公开已知的真实公式。此外,这些基准较少强调评估算法在变化维度、采样大小、采样分布和采样域下的鲁棒性。然而,这对于希望发现自然现象建模方程的从业者至关重要,因为数据几乎肯定有噪声,并且来自不同的域、分布和样本大小。为填补这一空白,我们引入了方程恢复基准(ERBench),这是一个新的评估框架,旨在严格评估明确针对方程发现任务的算法。

英文摘要

Equation discovery aims to automate the discovery of scientific models in the form of mathematical equations from data. Technically, equation discovery is implemented by symbolic regression algorithms. Performance of symbolic regression for equation discovery is measured along two dimensions: Prediction accuracy on test data, and recovery of known groundtruth formulas. For standard regression, accuracy is typically measured on in-domain test data, for instance, by splitting a data set randomly into training and test data. While this makes sense for in-domain interpolation, which is the common goal in ordinary regression, it can be a misleading proxy for true model discovery and generalization. The obvious alternative is to measure out-of-domain accuracy. However, obtaining challenging out-of-domain test data is a non-trivial problem. Therefore, we focus on equation recovery for evaluating symbolic regression algorithms for equation discovery. The rationale is that symbolic regression algorithms that perform well in recovering known groundtruth formulas are good candidates to perform well in unknown equation discovery. Existing benchmarks for symbolic regression include equation recovery tasks, however, with only a small number of groundtruth formulas that are publicly known. Moreover, these benchmarks place less emphasis on evaluating the robustness of algorithms in terms of their behavior under changing dimensionality, sampling size, sampling distribution and sampling domain. This, however, is of central importance to practitioners wanting to discover equations for modeling natural phenomena, since data is almost certainly noisy and comes from diverse domains, distributions, and sample sizes. To fill this gap, we introduce the Equation Recovery Benchmark (ERBench), a new evaluation framework designed to rigorously assess algorithms explicitly targeting the task of equation discovery.

2606.09517 2026-06-09 cs.LG 新提交

Investigating Calibration Challenges in Probabilistic Electricity Price Forecasting

研究概率电价预测中的校准挑战

Jan Niklas Lettner, Hadeer El Ashhab, Benjamin Schäfer

发表机构 * Institute for Automation and Applied Informatics(自动化与应用信息学研究所) Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院)

AI总结 本文指出当前概率电价预测中评分规则偏向锐度而忽视校准,导致过自信估计,呼吁未来研究转向校准感知的目标和架构。

Comments Presented at the ACM Sustainability Week Companion 2026, Banff, AB, Canada

详情
AI中文摘要

随着可再生能源整合增加市场波动性,概率电价预测已成为有效风险管理的关键。然而,当前的适当评分规则往往优先考虑预测锐度而牺牲校准,导致过度自信且统计上不可靠的不确定性估计。本文强调了理论评分与实际校准之间的关键差距,证明当可靠性被忽视时,模型可能成为确定性预测的代理。我们得出结论,未来的研究必须转向校准感知的目标和架构,以确保能源市场预测的分布完整性。

英文摘要

As renewable energy integration increases market volatility, probabilistic electricity price forecasting has become essential for effective risk management. However, current-proper-scoring rules often prioritize forecast sharpness at the expense of calibration, leading to overconfident and statistically unreliable uncertainty estimates. This work highlights the critical gap between theoretical scoring and practical calibration, demonstrating that models can become mere proxies for deterministic forecasts when reliability is neglected. We conclude that future research must shift toward calibration-aware objectives and architectures to ensure the distributional integrity of energy market forecasts.

2606.09764 2026-06-09 cs.LG cs.CL 新提交

iOSWorld: A Benchmark for Personally Intelligent Phone Agents

iOSWorld:个人智能手机代理的基准测试

Lawrence Keunho Jang, Mareks Woodside, Geronimo Carom, Andrew Keunwoo Jang, Jing Yu Koh, Ruslan Salakhutdinov

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出首个基于持久用户身份的交互式原生iOS模拟器基准iOSWorld,包含26个新应用和133个任务,评估代理在单应用、多应用及记忆个性化任务上的表现,最佳配置整体准确率52%,多应用任务仅37%。

详情
AI中文摘要

一个有用的手机代理需要具备个人智能。它应该能够推理设备上存在的用户身份、历史记录和偏好,而不仅仅是在非个性化的沙箱中遵循孤立的指令。现有的移动代理基准缺乏这种个性化。我们引入了iOSWorld,这是第一个基于持久用户身份构建的交互式原生iOS模拟器基准,该身份跨越26个新构建的iOS应用。这些应用包含连接的数据,如交易、消息、旅行记录、社交关系和财务活动。iOSWorld包括133个任务,分为三个难度递增的类别。单应用任务(27个)测试一个应用,多应用任务(60个)跨越2到8个应用,记忆和个性化任务(46个)要求代理从个人数据中推断模式。我们在仅视觉和特权视觉+XML设置下评估了前沿和开源计算机使用模型。最佳配置整体达到52%,但在多应用任务上仅为37%。特权视觉+XML访问将前沿模型提升了最多26个百分点,而较小的模型并未从增加的辅助功能树输入中受益。我们将iOSWorld作为开源基准发布,包含所有应用、种子数据、任务、评分标准和评估代码。

英文摘要

A useful phone agent needs to be personally intelligent. It should reason over a user's identity, history, and preferences as they exist on the device, not just follow isolated instructions in an impersonal sandbox. Existing mobile agent benchmarks lack this kind of personalization. We introduce iOSWorld, the first interactive native iOS simulator benchmark built around a persistent user identity spanning 26 newly built iOS apps. These apps contain connected data such as transactions, messages, travel records, social relationships, and financial activity. iOSWorld includes 133 tasks across three increasingly difficult categories. Single-app tasks (27) test one app, multi-app tasks (60) span 2 to 8 apps, and memory and personalization tasks (46) require agents to infer patterns from personal data. We evaluate frontier and open-source computer-use models in both vision-only and privileged vision+XML settings. The best configuration reaches 52\% overall but only 37\% on multi-app tasks. Privileged vision+XML access improves frontier models by up to 26 percentage points, while smaller models do not benefit from added accessibility-tree input. We release iOSWorld as an open-source benchmark with all apps, seeded data, tasks, rubrics, and evaluation code.

2606.07611 2026-06-09 cs.IR cs.AI cs.LG cs.SE 交叉投稿

MIRAGE: Metadata-Integrated Repository Analysis and Guided Enhancement for MSR Datasets

MIRAGE:面向MSR数据集的元数据集成仓库分析与引导增强

Aabia Ather, Muhammad Usayd Ather, Qurat-Ul-Ain Somroo, Muhammad Khuram Shahzad

发表机构 * SEECS, NUST(软件工程系,努斯兰大学)

AI总结 提出通过元数据丰富化、FAIR评估和主题驱动分析改进MSR数据集分析的方法,扩展了数据集目录并揭示了仓库站点和格式对引用与可用性的影响。

Comments 8 pages, 8 figures

详情
AI中文摘要

本文提出了一种通过元数据丰富化、FAIR评估和主题驱动分析来改进挖掘软件仓库(MSR)数据集分析的方法。本研究在先前专门用于分析MSR数据集的数据集目录基础上进行了扩展,为数据集添加了新注释,丰富了元数据类别,并提供了更高级的过滤选项。使用Semantic Scholar API收集了2013年至2024年间发表的MSR论文的元数据。分析基于潜在狄利克雷分配(LDA)主题建模和统计分析。数据集级别的属性被纳入扩展的数据集目录,即仓库托管站点、格式、可访问性、可重用性和数据集质量。研究表明,仓库托管站点和数据格式的选择会影响引用模式和数据集可用性。此外,增强的注释方法改进了MSR数据集的分析和可发现性,支持更有效地重用和评估研究工件。

英文摘要

This paper proposes an improved approach to the analysis of Mining Software Repositories (MSR) datasets via metadata enrichment, FAIRness assessment, and topic-driven analysis. This research expands upon an earlier dataset directory created specifically for the analysis of MSR datasets by adding new annotations to the datasets, enriching the metadata categories, and offering more advanced filtering options. The metadata of the MSR papers presented from 2013 to 2024 has been gathered using the Semantic Scholar API. The analysis is based on Latent Dirichlet Allocation (LDA) topic modeling and statistical analysis. Dataset-level attributes were included into the expanded dataset directory, namely repository hosting site, format, accessibility, reusability, and dataset quality. The study reveals that the choice of repository hosting sites and data formats influences citation patterns and dataset usability. Furthermore, the enhanced annotation approach improves the analysis and discoverability of MSR datasets, supporting more effective reuse and evaluation of research artifacts.

2606.07656 2026-06-09 physics.chem-ph cs.CE cs.LG 交叉投稿

SC3: The Multi-Solvent Solubility Challenge and Benchmark

SC3:多溶剂溶解度挑战与基准

Vansh Ramani, Har Ashish Arora, Dhairya Kuchhal, Sergei Tatarin, Lev Krasnov, Sayan Ranu, Tarak Karmakar

发表机构 * Indian Institute of Technology Delhi, India(印度德里印度理工学院) Kurnakov Institute of General and Inorganic Chemistry RAS, Russia(库尔诺夫一般和无机化学研究所俄罗斯科学院)

AI总结 针对多溶剂溶解度预测中现有基准的缺陷,提出SC3基准,包含可复现的数据处理流程、多层级共识集和评估指标,并揭示最佳模型与理论极限仍有5倍差距。

Comments 34 pages, 16 tables, 22 figures

详情
AI中文摘要

溶解度预测是计算化学中的标准基准,然而据报道接近实验噪声上限(即偶然极限)的多溶剂模型尚未可靠到可以部署。我们认为这一差距部分是由于人为因素:已发表的基准在筛选策略上存在差异,评估时使用计数加权RMSE掩盖了在重尾溶剂分布上的失败,并且将广泛引用的0.6-0.8 log S实验室间数值视为偶然上限,尽管它反映的是最坏情况而非预期差异。我们引入了SC3,一个基于BigSolDB v2.1构建的多溶剂溶解度基准,包含三个贡献:(i) 一个可复现的数据处理流程,得到101,535个测量值,涵盖1,327种溶质和206种溶剂,重新校准的偶然下限为0.106 log S——约为传统数值的6倍;(ii) 嵌套的金/银/铜共识层级,包含逐点标准差、三种泄漏检查分割以及多溶剂指标套件(PS-RMSE, Z-RMSE);以及(iii) 跨六个家族的31个模型基准,其最佳铜级PS-RMSE是偶然极限的5倍,我们观察到这一差距未被任何测试过的深度替代方案所弥合。我们进行了三项后续分析:数据缩放、从量子化学溶剂化能的迁移以及特征级归因,这表明校准后的逐点不确定性是超越点预测的诊断可复用基础设施。

英文摘要

Solubility prediction is a standard benchmark in computational chemistry, yet multi-solvent models which reportedly approach the experimental-noise ceiling (i.e. the aleatoric limit) are not yet reliable enough to be deployed. We argue that this gap is partly artefactual: published benchmarks differ in curation policies, evaluate on count-weighted RMSE that hides failure on tail-heavy solvent distributions, and treat the widely cited 0.6-0.8 log S inter-laboratory figure as the aleatoric ceiling even though it reflects worst-case, not expected, disagreement. We introduce SC3, a multi-solvent solubility benchmark built on BigSolDB v2.1 with three contributions: (i) a reproducible curation pipeline yielding 101,535 measurements over 1,327 solutes and 206 solvents, with a recalibrated aleatoric floor of 0.106 log S-roughly 6 times tighter than the conventional figure; (ii) nested Gold/Silver/Bronze consensus tiers with per-point standard deviation, three leakage-checked splits, and a multi-solvent metric suite (PS-RMSE, Z-RMSE); and (iii) a 31-model benchmark across six families, whose best Bronze PS-RMSE sits at 5 times the aleatoric limit, and we observe this is a gap unclosed by any deep alternative tested. We perform three follow-on analyses: data scaling, transfer from quantum-chemistry solvation energies, and feature-level attribution, which demonstrates that calibrated per-point uncertainty is a reusable infrastructure for diagnosis beyond point prediction.

2606.07718 2026-06-09 cs.AI cs.CV cs.LG 交叉投稿

A case study of evaluating AI agents on a neuroscience data-to-discovery pipeline

评估AI代理在神经科学数据到发现流程中的案例研究

Kai A. Horstmann, Ethan Lin, Alice A. Robie, Jennifer J. Sun, Kristin Branson

发表机构 * Cornell University(康奈尔大学) HHMI Janelia Research Campus(霍华德·休斯医学研究所贾雷尔研究园区)

AI总结 本研究评估通用编码代理在果蝇光遗传学数据到发现流程中的表现,发现代理能解决单个阶段任务,但端到端流程仍超出其能力,主要挑战包括缺乏预定义迭代标准和科学判断能力。

详情
AI中文摘要

代理型AI工具为自动化科学研究流程中的软件开发瓶颈提供了有希望的路径,特别是对于那些需要领域专家花费数天到数月构建的阶段,科学家关心的是正确性和鲁棒性,而非实现细节。我们针对果蝇光遗传学数据到发现流程,对通用编码代理进行了实证研究。我们在比现有基准大得多的任务、数量级更大的数据集以及基于领域专家标准的评估标准上评估代理。我们表明,代理可以解决几个单独的流程阶段,这表明阶段级自动化是可行的。通过分析代理的代码迭代,我们发现当没有预定义的标准可供迭代时,它们最困难,此时它们必须利用自己的科学判断来评估当前解决方案,这是一个关键开放挑战。与科学实践相呼应,它们有时尝试对中间输出进行视觉检查以进行自我评估,但大多未能正确解释所见或据此采取行动。正确解决端到端流程需要将所有流程阶段的成功串联起来,这超出了代理当前的能力。我们识别出现有基准中基本缺失的挑战,包括计算资源管理和对大型保留数据集的泛化。最后,我们提炼出构建科学任务和针对开放问题的严格评估标准的原则。

英文摘要

Agentic AI tools offer a promising path to automating software development bottlenecks in scientific research pipelines, particularly for stages that take domain experts days to months to build, where scientists care about correctness and robustness, not implementation details. We present an empirical study of general-purpose coding agents on a fly optogenetics data-to-discovery pipeline. We assess agents on tasks substantially larger than existing benchmarks, datasets orders of magnitude bigger, and evaluation criteria grounded in domain expert standards. We show that agents can solve several individual pipeline stages, suggesting stage-level automation is tractable. By analyzing agents' code iterations, we show that they struggle most when there is not a pre-defined criterion to iterate on, and they must instead use their scientific judgment to assess their current solution, a key open challenge. Mirroring scientific practice, they sometimes attempt visual inspection of intermediate outputs for self-evaluation, but largely fail to interpret what they see or act on it appropriately. Solving the end-to-end pipeline correctly requires stringing together successes across all pipeline stages, and this is beyond agents' current abilities. We identify challenges largely absent from existing benchmarks, including computational resource management and generalization to large held-out data collections. Finally, we distill principles for constructing scientific tasks and rigorous evaluation criteria for open-ended problems.

2606.07810 2026-06-09 cs.CL cs.AI cs.LG 交叉投稿

SLMJury: Can Small Language Models Judge as Well as Large Ones?

SLMJury:小型语言模型能否像大型模型一样进行评判?

Anish Laddha, Nitesh Pradhan, Gaurav Srivastava

发表机构 * LNMIIT Virginia Tech(弗吉尼亚理工大学)

AI总结 提出SLMJury框架,评估小型语言模型作为评判者的能力,发现领域依赖的过度思考效应、领域泛化差异、闭端与开端评判能力分离,以及多智能体辩论降低准确性。

详情
AI中文摘要

大型语言模型(LLMs)被广泛用作评估模型输出的评判者,但其高成本、延迟和不透明性限制了可扩展性。我们引入SLMJury,一个评估小型语言模型(SLMs)作为评判者的框架,涵盖两种范式:闭端二元正确性和开端质量评分。我们在四个模型家族的16个SLM评判者(0.6B-14B参数)上,跨十个基准进行基准测试:八个闭端任务涵盖数学、科学和通用推理(每个配置N=64,824个判断),以及用于摘要和对话评分的SummEval和MT-Bench。我们将评判形式化为预算条件函数,并研究五个维度。得出四个发现。(1)过度思考效应是领域依赖的:对于大多数评判者,快速10令牌判决在数学评判上匹配或优于扩展推理(在有帮助的情况下提升2-7%),而推理在通用任务上胜出高达23%。(2)领域泛化区分了模型家族,数学到通用准确率差距从低于10%到接近40%不等。(3)闭端和开端评判依赖不同的能力:最佳二元评判者(Phi-4)在MT-Bench上降至第9名,而经过推理训练的模型则反转了这一顺序。(4)在反思-批判-改进(RCR)辩论协议下,多智能体辩论在所有测试配置中降低了准确性,而顶级评判者抵抗六种对抗性人格的方差<=0.55%。可靠的自动评估不需要大型专有模型,但没有单一的SLM占主导地位。排行榜可在https://anishh15.github.io/SLMJury/获取,我们的框架代码和pip包公开在https://github.com/anishh15/SLMJury和https://pypi.org/project/slmjury/。

英文摘要

Large language models (LLMs) are widely used as judges for evaluating model outputs, but their high cost, latency, and opacity limit scalability. We introduce SLMJury, a framework for evaluating small language models (SLMs) as judges across two paradigms: closed-ended binary correctness and open-ended quality scoring. We benchmark 16 SLM judges (0.6B-14B parameters) from four model families across ten benchmarks: eight closed-ended tasks spanning mathematical, scientific, and general reasoning (N=64,824 judgments per configuration), plus SummEval and MT-Bench for summarization and conversational scoring. We formalize judging as a budget-conditioned function and study five dimensions. Four findings emerge. (1) The overthinking effect is domain-dependent: for most judges quick 10-token verdicts match or beat extended reasoning on mathematical judging (by 2-7% where they help), while reasoning wins on general tasks by up to 23%. (2) Domain generalization separates model families, with math-to-general accuracy gaps ranging from under 10% to nearly 40%. (3) Closed-ended and open-ended judging draw on different capabilities: the best binary judge (Phi-4) drops to rank 9 on MT-Bench, while reasoning-trained models invert this ordering. (4) Under the Reflect-Critique-Refine (RCR) debate protocol, multi-agent debate degrades accuracy across all tested configurations, whereas the top judges resist six adversarial personas with <=0.55% variance. Reliable automated evaluation does not require large proprietary models, yet no single SLM dominates. The leaderboard is available at https://anishh15.github.io/SLMJury/, and our framework code and pip package are publicly available at https://github.com/anishh15/SLMJury and https://pypi.org/project/slmjury/.

2606.07951 2026-06-09 cs.CL cs.AI cs.LG 交叉投稿

From `May' to `Is': Certainty Distortion in Language Model Rewriting

从“可能”到“是”:语言模型改写中的确定性扭曲

Catarina G Belem, Shang Wu, Hongyu Yao, Mark Steyvers, Sameer Singh, Padhraic Smyth

发表机构 * University of California Irvine(加利福尼亚大学尔湾分校) Massachusetts Institute of Technology(麻省理工学院)

AI总结 研究语言模型在改写任务中系统性增加表达确定性的偏差,提出基于人群判断的评估指标,发现高达75%的输出存在确定性扭曲,且模型更倾向于提高确定性。

详情
AI中文摘要

人类越来越多地以塑造信念和驱动决策的方式使用语言模型(LM),包括讨论、改写和总结来自科学文章、新闻和医学报告的信息。然而,在这些领域中,主张表达的信心程度至关重要,但关于LM是否忠实地保留它却知之甚少。在这项工作中,我们研究了LM中的确定性扭曲,定义为当语义内容被保留时,表达确定性的有意义变化。我们提出了一种基于LM的评估指标,该指标与人群层面的确定性判断一致。使用该指标,我们在科学和医学交流任务的背景下,表征了不同规模和系列的模型中的确定性扭曲。我们的结果表明,确定性扭曲影响了高达75%的LM输出,并且在改写任务中系统性地不对称,大多数LM将表达确定性增加的可能性是降低的1.5-2倍。这些效应可以通过重复释义累积:在医学领域,claude-haiku-4-5在一次迭代后增加了20%示例的确定性,五次迭代后增加到40%。基于提示的干预减少了整体确定性扭曲,但并未消除它。总之,这些发现揭示了普遍存在的夸大表达确定性的偏差,对在高风险领域依赖LM的用户有直接影响。

英文摘要

Humans increasingly turn to Language Models (LMs) in ways that shape beliefs and drive decisions, including discussing, rewriting, and summarizing information from scientific articles, news, and medical reports. However, in these domains, where how confidently a claim is expressed matters, little is known about whether LMs faithfully preserve it. In this work, we investigate certainty distortion in LMs, defined as meaningful changes in expressed certainty when semantic content is preserved. We propose an LM-based evaluation metric that is consistent with population-level judgments of certainty. Using this metric, we characterize certainty distortion across different sizes and families of models in the context of scientific and medical communication tasks. Our results show that certainty distortion affects up to 75\% of LM outputs and is systematically asymmetric in rewriting tasks with most LMs being 1.5-2$\times$ more likely to increase the expressed certainty than to decrease it. These effects can compound over repeated paraphrasing: in the medical domain, claude-haiku-4-5 increases certainty of 20\% examples after a single iteration, increasing to 40\% after five iterations. Prompt-based interventions reduce overall certainty distortion but do not eliminate it. Together, these findings reveal a general bias toward inflating expressed certainty, with direct implications for users who rely on LMs in high-stakes domains.

2606.08200 2026-06-09 cs.AI cs.LG 交叉投稿

Online Agent-as-a-Judge: Situation-Generating Evaluation for Interactive Agents

在线智能体作为裁判:面向交互式智能体的情境生成评估

Hyogon Ryu, Jeonghwan Kim, Yewon Lim, Chaeun Lee, Jeongwook Kim, Donghoon Ham

发表机构 * KAIST(韩国科学技术院)

AI总结 提出在线智能体作为裁判框架,通过部署环境内评估智能体主动生成相关情境,以评估交互式社交智能体的能力,提高标准覆盖率和与人类标签的一致性。

Comments ICML 2026 Workshop on Trustworthy AI for Good

详情
AI中文摘要

评估基于LLM的交互式社交智能体具有挑战性,因为社交相关行为不仅取决于孤立输出,还取决于先前的交互、社会角色和后续行动。现有方法通常允许目标智能体在环境中自由行动,然后对生成的轨迹进行评分。然而,这种被动设置可能会遗漏仅在特定社交情境下才可观察到的能力;例如,如果没有出现分歧,冲突处理可能不会被测试。我们提出在线智能体作为裁判,一种面向交互式社交智能体的情境生成评估框架。在线智能体作为裁判部署一个环境内评估智能体,通过环境原生的对话和行动协议与目标智能体交互,主动引出与评估标准相关的情境。生成的轨迹为评估即时响应和后续行为提供了证据。在一个包含32个设计师编写的社会标准的生命模拟环境中,在线智能体作为裁判提高了标准覆盖率和与人类标签的一致性,为被动方法可能未观察到的行为提供了更可靠的基于证据的评估。

英文摘要

Evaluating LLM-powered interactive social agents is challenging because socially relevant behaviors depend not only on isolated outputs, but also on prior interactions, social roles, and downstream actions. Existing methods typically allow a target agent to act freely in an environment and then score the resulting trajectory. However, this passive setup can miss capabilities that only become observable under specific social circumstances; for example, conflict handling may remain untested if no disagreement arises. We propose Online Agent-as-a-Judge, a situation-generating evaluation framework for interactive social agents. Online Agent-as-a-Judge deploys an in-world evaluator agent that interacts with the target agent through the environment's native dialogue and action protocol, actively eliciting situations relevant to the evaluation criteria. The resulting trajectories provide evidence for assessing both immediate responses and subsequent behavior. In a life-simulation environment with $32$ designer-authored social criteria, Online Agent-as-a-Judge improves criteria coverage and agreement with human labels, yielding more reliable evidence-grounded evaluations of behaviors that passive methods can leave unobserved.

2606.08228 2026-06-09 q-fin.TR cs.LG q-fin.CP q-fin.ST 交叉投稿

Post-Rejection Follow-up Sampling: A Methodology for Counterfactual Outcome Measurement in Algorithmic DEX Trading

拒绝后跟踪采样:算法DEX交易中反事实结果测量的一种方法

Arati Uday Kamat

发表机构 * Independent Researcher(独立研究者)

AI总结 提出拒绝后跟踪采样(PRFS)方法,通过独立跟踪子系统采样被拒绝代币的价格和流动性,以评估过滤器精度,数据集包含2997个拒绝事件的67000条观测记录。

Comments 12 pages. Companion methodology paper to RED-2400 (arXiv:2605.12151). Currently under review at Ledger. SSRN abstract ID 6607301. Zenodo concept DOI 10.5281/zenodo.20043516

详情
AI中文摘要

去中心化交易所(DEX)上的算法交易系统拒绝了它们评估的大多数候选代币。被拒绝候选代币的反事实结果(如果系统进入会发生什么)很少被测量。本文介绍了拒绝后跟踪采样(PRFS)。一个独立的跟踪子系统以可配置的频率对每个被拒绝代币的价格和流动性进行采样,时间跨度长达二十四小时。PRFS提供了评估过滤器精度所需的数据,这些数据基于被拒绝候选代币的实际市场结果,而不是基于合成的回测重建。方法论、数据架构和存款格式在第三节中描述。配套数据集包含2997个拒绝事件的67000个前向结果观测行,涵盖457个独特的铸币厂,在连续八天的时间窗口内收集(2026-04-10至2026-04-19,UTC)。大约55%的拒绝事件至少有一个前向观测;铸币厂级别的覆盖是完整的。下游分类的主要约束是每个事件的时间密度,而不是事件级别的覆盖。PRFS是数据集无关的。它适用于任何拒绝次数大大超过执行次数的算法决策系统。

英文摘要

Algorithmic trading systems on decentralised exchanges (DEXs) reject most candidate tokens they evaluate. The counterfactual outcome of rejected candidates (what would have happened had the system entered) is rarely measured. This paper introduces Post-Rejection Follow-up Sampling (PRFS). A separate tracking subsystem samples each rejected token's price and liquidity at a configurable cadence, over a horizon of up to twenty-four hours. PRFS produces the data needed to evaluate filter precision against actual market outcomes of rejected candidates, not against synthetic backtest reconstructions. The methodology, data architecture, and deposit format are described in Section III. The companion dataset contains 67,000 forward-outcome observation rows across 2,997 rejection events spanning 457 unique mints, collected over a continuous eight-day window (2026-04-10 to 2026-04-19, UTC). Approximately 55 percent of rejection events receive at least one forward observation; coverage at the mint level is complete. The principal binding constraint on downstream classification is per-event horizon density, not event-level coverage. PRFS is dataset-independent. It generalises to any algorithmic decision system in which rejections substantially outnumber executions.

2606.08340 2026-06-09 cs.AI cs.LG cs.MA 交叉投稿

Benchmarking Open-Ended Multi-Agent Coordination in Language Agents

开放式多智能体协作在语言智能体中的基准测试

Kale-ab Abebe Tessera, Andras Szecsenyi, Cameron Barker, Alexander Rutherford, Davide Paglieri, Aidan Scannell, Henry Gouk, Elliot J. Crowley, Tim Rocktäschel, Amos Storkey

发表机构 * University of Edinburgh(爱丁堡大学) University of Oxford(牛津大学) University College London(伦敦大学学院)

AI总结 提出基于JAX的开放式多智能体协作基准Alem,评估13种现代LLM在长时生存世界中的零样本协作能力,发现协调能力是前沿LLM智能体的独立瓶颈。

Comments 42 pages, preprint

详情
AI中文摘要

随着语言模型越来越多地被部署为自主智能体,它们必须在开放式交互任务中与他人进行长期协调。然而,现有评估很少同时测试这些需求,而是强调单智能体任务、短交互或高度结构化的多智能体设置。我们提出了$alem$,一个基于JAX的开放式多智能体协作基准,构建在类似Craftax的动态之上。Alem将程序生成的协调任务、软专业化、通信和可控制的协调难度嵌入到一个具有探索、制作、交易和战斗的长期生存世界中。我们在同质团队中零样本评估了$13$种现代LLM,并以训练好的MARL智能体作为参考点。当前的LLM智能体远未解决Alem,平均标准化回报仅约6%,但它们的失败并非均匀分布。在最难的协调设置下,零样本的Gemini-3.1-Pro-High接近训练了十亿步的MARL智能体,而GPT-5.4-High实现了强基础任务奖励但协调奖励低得多。这种对比表明,个体任务能力并不等同于协调能力。消融实验表明,通信是协调的最大贡献者,而记忆和推理在用于维护多步计划时有所帮助。总体而言,我们的结果将协调确定为前沿LLM智能体的一个独立瓶颈,与单智能体能力分开。Alem使这一瓶颈可测量,并为开发能够通信、分配角色和执行共享计划的智能体提供了一个受控测试平台。代码可在https://github.com/alem-world/alem-env获取。

英文摘要

As language models are increasingly deployed as autonomous agents, they must coordinate with others over long horizons in open-ended interactive tasks. Yet existing evaluations rarely test these demands together, instead emphasising single-agent tasks, short interactions, or highly structured multi-agent settings. We introduce $alem$, a JAX-based benchmark for open-ended multi-agent coordination built on Craftax-like dynamics. Alem embeds procedurally generated coordination tasks, soft specialisation, communication, and controllable coordination difficulty into a long-horizon survival world with exploration, crafting, trading, and combat. We evaluate $13$ modern LLMs zero-shot within homogeneous teams, with trained MARL agents as reference points. Current LLM agents remain far from solving alem, averaging only ~6% normalised return, but their failures are not uniform. On the hardest coordination setting, zero-shot Gemini-3.1-Pro-High approaches MARL agents trained for one billion steps, while GPT-5.4-High achieves strong base-task reward but much lower coordination reward. This contrast shows that individual task competence does not imply coordination competence. Ablations show that communication is the largest contributor to coordination, while memory and reasoning help when used to maintain multi-step plans. Overall, our results identify coordination as a distinct bottleneck for frontier LLM agents, separate from single-agent capabilities. Alem makes this bottleneck measurable and provides a controlled testbed for developing agents that communicate, allocate roles, and execute shared plans. Code is available at https://github.com/alem-world/alem-env.

2606.08372 2026-06-09 cs.CR cs.LG 交叉投稿

SoK: Reconstruction Attacks on Synthetic Tabular Data (Insights from Winning the NIST CRC)

SoK: 合成表格数据的重建攻击(来自赢得NIST CRC的见解)

Steven Golob, Sikha Pentyala, Martine De Cock

发表机构 * School of Engineering and Technology, University of Washington Tacoma(华盛顿大学塔科姆分校工程与技术学院) Department of Mathematics, Computer Science, and Statistics, Ghent University(根特大学数学、计算机科学与统计学系)

AI总结 本文系统化了针对去标识化和合成表格数据的重建攻击,提出分类法、最全面的实证评估和新攻击,并引入解释攻击成功的方法论,发现合成数据生成方法比攻击选择更影响风险,差分隐私仅在低预算下有效。

详情
AI中文摘要

合成数据越来越被推广为发布敏感表格记录的隐私保护替代方案,但其核心对抗威胁(“重建”,即从合成发布和少量已知准标识符中恢复个体的隐藏属性值)仅在分散且难以比较的设置中研究过。我们首次系统化了针对去标识化和合成表格数据的重建(等价于属性推断)攻击。我们贡献了一个分类法,按攻击利用的结构组织攻击;迄今为止最系统的实证评估,将14种攻击与5个基准数据集上的9种合成数据生成(SDG)方法进行对比;以及一组填补分类法空白的新攻击,其中一种(CoBP-RA)是我们测量到的最强攻击。关键的是,我们引入了一种解释攻击成功含义的方法:一个记忆测试,区分从训练记录的记忆中重建总体分布,以及一个归约,将重建和成员推断置于单一可比较的尺度上。我们的发现:SDG方法的选择对风险的影响远大于攻击的选择;差分隐私主要在小预算($\varepsilon\lesssim1$)下提供保护,超过该预算保护趋于平稳,受限于合成器的容量而非噪声;去标识化方法最暴露;大多数重建反映分布结构而非记忆,将个体风险集中在异常记录上。这些攻击和基础设施通过我们在2025年国家标准与技术研究院(NIST)合作研究周期中所有红队中取得第一名的成绩得到了外部验证。

英文摘要

Synthetic data is increasingly promoted as a privacy-preserving substitute for releasing sensitive tabular records, yet its central adversarial threat ("reconstruction", the recovery of an individual's hidden attribute values from a synthetic release and a handful of known quasi-identifiers) has been studied only in scattered, hard-to-compare settings. We present the first systematization of reconstruction (equivalently, attribute inference) attacks on de-identified and synthetic tabular data. We contribute a taxonomy that organizes attacks by the structure they exploit; the most systematic empirical evaluation to date, pitting fourteen attacks against nine synthetic data generation (SDG) methods across five benchmark datasets; and a set of new attacks that fill gaps in the taxonomy, one of which (CoBP-RA) is the strongest attack we measure. Crucially, we introduce a methodology for interpreting what attack success means: a memorization test that distinguishes reconstruction of the population distribution from memorization of training records, and a reduction that places reconstruction and membership inference on a single comparable scale. Our findings: the choice of SDG method governs risk far more than the choice of attack; differential privacy protects mainly at small budgets ($\varepsilon\lesssim1$), above which protection plateaus, bounded by the synthesizer's capacity rather than its noise; de-identification methods are the most exposed; and most reconstruction reflects distributional structure rather than memorization, concentrating individual risk on atypical records. The attacks and infrastructure are externally validated by our first-place finish among all red teams in the 2025 \textit{National Institute of Standards and Technology} (NIST) Collaborative Research Cycle.

2606.08460 2026-06-09 stat.ML cs.LG 交叉投稿

LOTTERY: Learning from Reference-Only Samples in Two-Sample Testing under Size Asymmetry

LOTTERY: 在样本量不对称下的双样本检验中仅从参考样本学习

Xunye Tian, Zhijian Zhou, Liuhua Peng, Feng Liu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对参考样本丰富而查询样本极少的双样本检验问题,提出利用参考样本学习依赖参考的表示并自适应加权,实现置换检验的I类错误控制和一致性。

Comments 16 pages, 1 figure

详情
Journal ref
ICML 2026
AI中文摘要

数据自适应的双样本检验通过从数据中学习的差异(例如基于核的特征表示)来评估两个样本是否来自同一分布。这类方法通常依赖数据分割来解耦学习和检验,并控制I类错误。然而,这种范式不适用于样本量严重不平衡的小样本场景:有大量参考样本可用,而只有少量查询样本。在本文中,我们展示了如何建设性地利用这种不平衡。利用丰富的参考数据,我们学习依赖参考的表示,这些表示总结了参考分布的主要结构,并为检测偏离提供了信息信号。我们引入了一系列表示族,捕获全局和局部结构,并通过不确定性引导原则仅使用参考样本自适应地加权它们。理论上,我们建立了基于置换的I类错误控制,并证明了聚合检验的一致性:随着样本量增长,只要表示集中至少包含一个一致表示,检验功效收敛到1。实验上,我们的聚合方法在多个基准测试中实现了强性能,同时保持了I类错误控制。

英文摘要

Data-adaptive two-sample testing assesses if two samples come from the same distribution, using a discrepancy learned from the data (e.g., via kernel-based feature representations). Such methods typically rely on data splitting to decouple learning from testing and control type I error. However, this paradigm is ill-suited to few-shot settings with severe sample-size imbalance: abundant reference samples are available, while only a handful of query samples arrive. In this paper, we show how this imbalance can be leveraged constructively. Using abundant reference data, we learn reference-dependent representations that summarize salient structure of the reference distribution and provide informative signals for detecting departures. We incorporate a collection of representation families that capture both global and local structure, and adaptively weight them using only reference samples via an uncertainty-guided principle. Theoretically, we establish permutation-based type I error control and show consistency of the aggregated test: as the sample sizes grow, the test power converges to one whenever the representation set contains at least one consistent representation. Empirically, our aggregation achieves strong performance across a range of benchmarks while retaining type I error control.

2606.08529 2026-06-09 cs.AI cs.CL cs.LG 交叉投稿

Scaffold Effects on GAIA: A Controlled Comparison

脚手架对GAIA的影响:一项受控比较

Jason Starace

发表机构 * Independent Researcher(独立研究员)

AI总结 通过受控实验比较三种脚手架(ReAct、多智能体设计、规划-执行)对五个模型在GAIA验证集上的影响,发现脚手架选择可导致准确率差异高达28个百分点,且模型能力越强对脚手架依赖性不一定越低。

Comments 12 pages, 3 figures

详情
AI中文摘要

已发布的智能体能力评分混淆了模型本身的能力与脚手架赋予的能力,且这种激发差距的大小在受控条件下尚未得到充分表征。本研究在GAIA验证集的Level 1和Level 2上,对来自三个提供商的五个模型(Claude Opus 4.7、Sonnet 4.6、Haiku 4.5;Gemini 3.1 Pro Preview;GPT-5.5)进行了预先注册的受控比较,涉及三种脚手架(ReAct、规划-执行者-评估者多智能体设计以及规划-执行),保持任务和条件固定,每个问题尝试三次。仅脚手架选择就使单个模型(Opus,Level 2,稳健切片)的测量准确率移动了多达28个百分点,证实了预先注册的假设,即脚手架变化至少产生10个百分点的差距。预先注册的预测——能力更强的模型对脚手架敏感性更低——在方向上被拒绝:在每个数据集切片中,脚手架效应因模型而异,但能力最强的Anthropic模型在更难级别上从结构化脚手架中获益最多,且层级缩放仅在Level 1的稳健切片下成立。在Level 2上,多智能体相对于ReAct的优势出现在Anthropic系列内部,但跨提供商模型中没有,因此模型系列而非能力层级成为调节变量,而预测的规划-执行者在文件读取任务上的优势被证伪。结构化脚手架在更难级别上调用工具次数更少,但从中途错误中恢复的频率更高,且单个单元(Gemini搭配规划-执行者)在两个级别上成本最低,在Level 2上准确率最高。这些结果表明,单脚手架能力数值是脚手架条件估计,且激发差距不一定会随着模型改进而缩小。

英文摘要

Published agent capability scores conflate what a model can do with what its scaffold lets it do, and the magnitude of this elicitation gap is not well characterized under controlled conditions. This study executes a pre-registered controlled comparison of three scaffolds (ReAct, a Planner-Actor-Rater multi-agent design, and planner-then-executor) across five models from three providers (Claude Opus 4.7, Sonnet 4.6, Haiku 4.5; Gemini 3.1 Pro Preview; GPT-5.5) on GAIA validation Levels 1 and 2, holding tasks and conditions fixed, with three attempts per question. Scaffold choice alone moves measured accuracy by as much as 28 percentage points within a single model (Opus, Level 2, robust slice), confirming the pre-registered hypothesis that scaffold variation produces gaps of at least 10 points. The pre-registered prediction that more capable models would be less scaffold-sensitive is rejected in direction: scaffold effects vary significantly by model in every dataset slice, but the most capable Anthropic model gains the most from structured scaffolds at the harder level, and tier-scaling holds only at Level 1 under the robust slice. The multi-agent advantage over ReAct at Level 2 appears within the Anthropic family but not for the cross-provider models, making model family rather than capability tier the conditioning variable, and the predicted planner-executor advantage on file-reading tasks is falsified. Structured scaffolds make fewer tool calls yet recover more often from mid-trajectory errors at the harder level, and a single cell (Gemini with planner-then-executor) is the cheapest at both levels and the most accurate at Level 2. These results indicate that single-scaffold capability numbers are scaffold-conditional estimates and that the elicitation gap is not guaranteed to shrink as models improve.

2606.08669 2026-06-09 cs.SD cs.LG 交叉投稿

A Comparison of SSL-Based Feature Extractors and Back-End Classifiers for Spoofing Detection: A Multi-Corpus Training and Cross-Linguistic Analysis

基于SSL的特征提取器与后端分类器在欺骗检测中的比较:多语料库训练与跨语言分析

Anh-Tuan Dao, Driss Matrouf, Mickael Rouvier, Nicholas Evans

发表机构 * Avignon Universite(阿维尼翁大学) EURECOM

AI总结 本研究通过多语料库训练和跨语言分析,比较了四种自监督学习特征提取器与四种后端分类器在欺骗检测中的性能,揭示了ASVspoof 5数据集中的领域偏差,并发现仅用8小时目标语言数据微调即可提升检测鲁棒性。

详情
AI中文摘要

语音生物识别系统面临来自欺骗攻击的日益增长的威胁,然而检测模型的评估在不同数据集上仍然不一致。为了研究这些不可预测的波动,我们对四种自监督学习特征提取器与四种后端分类器的组合进行了全面基准测试。我们比较了ResNet的层次化局部特征提取与基于注意力和图的后端的全局序列和关系建模。通过三种场景下的多语料库训练和六个评估数据集,我们的实证分析得出了两个关键发现。首先,我们揭示了ASVspoof 5数据集中的领域偏差,表明简单的数据缩放会主动降低性能。其次,我们的跨语言分析表明,仅用8小时的目标语言数据微调即可增强检测鲁棒性。这些发现共同强调了在欺骗检测中需要领域感知和语言特定适应的关键需求。

英文摘要

Voice biometric systems face growing threats from spoofing attacks, yet the evaluation of detection models remains inconsistent across datasets. To investigate these unpredictable fluctuations, we conduct a comprehensive benchmark of four self-supervised learning feature extractors paired with four back-end classifiers. We compare the hierarchical local feature extraction of ResNet with the global sequence and relational modeling of attention and graph-based back-ends. Through multi-corpus training across three scenarios and six evaluation datasets, our empirical analysis yields two critical findings. First, we expose a domain bias within the ASVspoof 5 dataset, showing that naive data scaling actively degrades performance. Second, our cross-linguistic analysis reveals that fine-tuning with just 8 hours of target-language data enhances detection robustness. Together, these findings emphasize the critical need for domain-aware and language-specific adaptation in spoofing detection.

2606.08679 2026-06-09 stat.ML cs.CL cs.LG stat.ME 交叉投稿

Rank Intervals for Leaderboards: A Hierarchical Framework for Model Evaluation

排行榜的排名区间:模型评估的分层框架

Bitya Neuhof, Yuval Benjamini

发表机构 * Department of Statistics and Data Science(统计与数据科学系)

AI总结 提出分层框架,通过任务级置信区间和排行榜级预测区间,实现具有统计保证的模型排名不确定性量化。

详情
AI中文摘要

预训练模型通常在多任务排行榜上评估,以衡量其在不同场景中的适用性。然而,当前将跨任务性能聚合为排行榜级排名的方法并未解决任务层面的不确定性和变异性。尽管近期工作提出了基于区间的模型排名,但从单个任务到排行榜级排名的不确定性的原则性聚合仍未解决,且模型在不同任务上的性能变化常被掩盖。本文引入一个分层框架,在两层上构建具有统计保证的模型排名区间:通过成对比较构建任务级排名置信区间,以及使用共形方法构建排行榜级排名预测区间。这使得能够对每个观测任务和新潜在任务进行可靠的模型排名量化。在模拟数据以及TabArena和PromptEval(MMLU)基准上的实验表明,我们的方法产生统计有效且信息丰富的区间,从而在排行榜上实现可靠、具有不确定性意识的模型排名。

英文摘要

Pretrained models are often evaluated on multi-task leaderboards to measure their applicability in diverse contexts. However, current methods for aggregating performance across tasks into leaderboard-level rankings do not address the uncertainty and variability at the task level. While recent works have proposed interval-based model rankings, the principled aggregation of uncertainty from individual tasks to leaderboard-level rankings remains unaddressed, and variation in models' performance across tasks is frequently obscured. In this work, we introduce a hierarchical framework that constructs model rank intervals with statistical guarantees at both levels: task-level rank confidence intervals from pairwise comparisons, and leaderboard-level rank prediction intervals using a conformal approach. This enables reliable quantification of model rank for each observed task and for new potential tasks. Experiments on simulated data and the TabArena and PromptEval (MMLU) benchmarks show that our method yields statistically valid and informative intervals, enabling reliable, uncertainty-aware model ranking on leaderboards.

2606.08729 2026-06-09 cs.RO cs.LG 交叉投稿

IR-SIM: A Lightweight Skill-Native Simulator for Navigation, Learning, and Benchmarking

IR-SIM:一种用于导航、学习和基准测试的轻量级技能原生模拟器

Ruihua Han, Shuai Wang, Chengyang Li, Rui Gao, Xinyi Wang, Zhe Liu, Guoliang Li, Yupu Lu, Qi Hao, Jia Pan, Hengshuang Zhao

发表机构 * The University of Hong Kong(香港大学) Shenzhen Institutes of Advanced Technology(深圳先进技术研究院) Southern University of Science and Technology(南方科技大学) University of Michigan(密歇根大学) University of Macau(澳门大学)

AI总结 提出轻量级技能原生导航模拟器IR-SIM,通过YAML配置完全定义场景,支持文本提示生成与修改,用于导航算法基准测试和训练数据自动生成,并桥接高保真模拟器和真实部署。

Comments 12 pages, 6 figures, project website: https://github.com/hanruihua/ir-sim

详情
AI中文摘要

模拟在由大型语言模型(LLM)支持的自动化机器人研究中起着关键作用。然而,现有的模拟器通常需要自定义代码或复杂接口,为快速原型设计和自动化算法开发设置了障碍。为此,我们提出了智能机器人模拟器(IR-SIM),一种轻量级的技能原生导航模拟器,专为快速场景构建、基准测试和机器人学习而设计。在IR-SIM中,场景完全由YAML配置文件定义,这些文件指定了移动机器人运动学、几何碰撞检测、激光雷达感知、可视化和行为模块。这种设计使机器人模拟完全可描述和可复现,允许通过提出的IR-SIM智能体技能从文本提示生成和修改场景。生成的场景可用于导航算法的自动基准测试以及学习方法的训练数据自动生成。此外,IR-SIM提供了到高保真模拟器和真实世界部署的桥梁,允许用户在原型设计后无需额外编码即可在更真实的环境中验证其算法。实验展示了IR-SIM在多个任务中的便利性和多功能性:从自然语言构建导航场景、训练避碰策略、对社交导航策略进行基准测试,以及桥接到高保真模拟器和真实世界部署。项目网站见https://github.com/hanruihua/ir-sim。

英文摘要

Simulation plays a key role in automated robotics research supported by large language models (LLMs). However, existing simulators often require custom code or complex interfaces, creating a barrier to rapid prototyping and automated algorithm development. To this end, we propose the Intelligent Robot Simulator (IR-SIM), a lightweight skill-native navigation simulator designed for rapid scenario construction, benchmarking, and robot learning. In IR-SIM, scenarios are entirely defined by YAML configuration files that specify mobile robot kinematics, geometric collision checking, LiDAR sensing, visualization, and behavior modules. This design makes robotic simulation fully describable and reproducible, allowing scenarios to be generated and modified from text prompts through the proposed IR-SIM agent skills. The resulting scenarios can be used for automated benchmarking of navigation algorithms and for automated generation of training data for learning methods. Furthermore, IR-SIM provides bridges to high fidelity simulators and real world deployment, allowing users to validate their algorithms in more realistic settings after prototyping without extra coding. The experiments showcase the convenience and versatility of IR-SIM in multiple tasks: constructing navigation scenarios from natural language, training a collision avoidance policy, benchmarking social navigation policies, and bridging to high fidelity simulators and real world deployment. The project website is available at https://github.com/hanruihua/ir-sim.

2606.08864 2026-06-09 cs.CV cs.LG 交叉投稿

CHROMA: Detecting AI-Generated Images through Inter-Channel Color-Space Correlations

CHROMA: 通过通道间色彩空间相关性检测AI生成图像

Juan Pablo Sotelo, Marina Gardella, Pablo Musé

发表机构 * Instituto de Ingeniería Eléctrica, Facultad de Ingeniería, Universidad de la República, Montevideo, Uruguay(乌拉圭共和国大学工程学院电气工程研究所) Université Paris-Saclay, ENS Paris-Saclay, CNRS, Centre Borelli, Gif-sur-Yvette, 91190 France(巴黎萨克雷大学,巴黎萨克雷高等师范学校,法国国家科学研究中心,博雷利中心)

AI总结 提出利用通道间色彩相关性作为轻量级取证线索,通过增强RGB输入与相关性图,使用固定CNN骨干网络在有限计算预算下训练,有效区分真实与AI生成图像,并提升对未知生成器的鲁棒性。

Comments This manuscript has been accepted for publication at the 28th International Conference on Pattern Recognition (ICPR 2026). The final published version will appear in the Springer LNCS proceedings

详情
AI中文摘要

扩散模型和大规模生成模型的快速普及使得区分合成图像与真实照片越来越具有挑战性。尽管已有自动检测器被提出,但它们对未见生成器的泛化能力仍然脆弱。为解决这一局限,我们研究了通道间色彩相关性,这是一种轻量级且未被充分利用的取证线索。我们首先证明,LPIPS(一种广泛使用的感知度量)对选择性改变不同色彩空间参数化下通道依赖性的扰动表现出不一致的响应,表明跨通道统计量并不受常见感知训练目标的统一约束。受此启发,我们分析了多个色彩空间中成对通道间相关性特征的分布。我们的分析揭示了这些分布中系统性的、生成器特定的差异,其中RGB和Lab色彩空间提供了真实图像与生成图像之间最明显的分离。基于此,我们引入了Chroma,一种AI生成图像检测器,它用通道间相关性图增强标准RGB输入,并采用在适度计算预算下训练的固定CNN骨干网络。我们在单生成器训练和有限多生成器监督机制(仅从额外生成器获取少量样本)下评估其鲁棒性。在标准基准协议下,相关性增强的输入改善了真实与生成图像的区分能力和鲁棒性,在保持简单架构和训练过程的同时,性能与最新检测器相当。代码可在https://github.com/JPSoteloSilva/CHROMA获取。

英文摘要

The rapid adoption of diffusion and large-scale generative models has made it increasingly challenging to distinguish synthetic imagery from real photographs. While automated detectors have been proposed, their generalization to unseen generators remains brittle. To address this limitation, we investigate inter-channel color correlations, a lightweight and underexploited forensic cue. We first demonstrate that LPIPS, a widely used perceptual metric, exhibits inconsistent responses to perturbations that selectively alter channel dependence across different color-space parameterizations, indicating that cross-channel statistics are not uniformly constrained by common perceptual training objectives. Motivated by this, we analyze the distributions of pairwise inter-channel correlation features across multiple color spaces. Our analysis reveals systematic, generator-specific differences in these distributions, with RGB and Lab color spaces providing the most apparent separation between real and generated images. Building on this, we introduce Chroma, a detector of AI-generated images which augments standard RGB inputs with inter-channel correlation maps and employs a fixed CNN backbone trained with a modest computational budget. We assess its robustness under both single-generator training and a limited multi-generator supervision regime, where only a few samples from additional generators are available. Across a standard benchmark protocol, correlation-augmented inputs improve real-vs-generated discrimination and robustness, yielding performance competitive with recent detectors while maintaining a simple architecture and training procedure. Code is available at https://github.com/JPSoteloSilva/CHROMA

2606.08960 2026-06-09 cs.CR cs.AI cs.LG cs.MA 交叉投稿

Hardening Agent Benchmarks with Adversarial Hacker-Fixer Loops

通过对抗性黑客-修复者循环强化智能体基准测试

Ziqian Zhong, Ivgeni Segal, Ivan Bercovich, Shashwat Saxena, Kexun Zhang, Aditi Raghunathan

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Fewshot Corp(Fewshot公司) Independent Researcher(独立研究员)

AI总结 提出黑客-修复者循环方法,通过LLM代理交替攻击和修补验证器,自动生成抗利用的验证器,将KernelBench攻击成功率从62%降至0%。

详情
AI中文摘要

智能体基准测试通常使用手工编写且脆弱的验证器来评分提交结果,这容易导致奖励黑客攻击。我们审计了五个终端智能体基准测试中的1,968个任务,发现其中323个(16%)可以被前沿模型仅通过任务描述成功攻击。这既破坏了排行榜排名,也破坏了强化学习训练信号,但标准的应对措施是手动且被动的。\n我们引入了黑客-修复者循环,一种无需逐任务手动修补即可构建抗利用验证器的方法。该循环交替使用三个LLM代理:黑客尝试在不解决任务的情况下通过验证器,修复者修补验证器以拒绝每个发现的漏洞,求解者确认修补后的验证器仍接受合法解决方案。循环迭代:每次修补都会重塑验证器的奖励机制,从而暴露下一个漏洞。我们进一步增加了验证器访问权限,并允许修补跨任务迁移,以扩大循环发现的漏洞范围。\n在KernelBench上,该循环将公开报告的漏洞语料库上的攻击成功率从62%降至0%。我们还发现,循环中的较弱代理可以防御更强的黑客:Gemini 3 Flash的循环将更强的Gemini 3.1 Pro和Claude Opus 4.7在KernelBench上的攻击成功率从76%和61%降至0%,而Gemini 3.1 Pro在Terminal Bench上的攻击成功率从39%降至17%(覆盖77个任务)。我们发布了Terminal Wrench(323个可攻击环境,3,632条攻击轨迹)作为当前攻击面的快照,以及我们修补后的验证器、循环发现的漏洞和我们的实现,作为未来工作的基础。

英文摘要

Agent benchmarks score submissions with outcome verifiers that are typically hand-written and brittle, leaving them open to reward hacking. We audit 1,968 tasks across five terminal-agent benchmarks and find 323 (16%) hackable by frontier models given only the task description. This corrupts both leaderboard rankings and RL training signal, yet the standard response is manual and reactive. We introduce the hacker-fixer loop, a method for building exploit-resistant verifiers without per-task manual patching. The loop alternates three LLM agents: a hacker tries to pass the verifier without solving the task, a fixer patches the verifier to reject each discovered exploit, and a solver confirms the patched verifier still admits legitimate solutions. The loop iterates: each patch reshapes what the verifier rewards, surfacing the next exploit. We further add verifier access, and let patches transfer across tasks, to broaden the exploits the loop discovers. On KernelBench, the loop drives the attack success rate from 62% to 0% on a held-out corpus of publicly reported exploits. We also find that weaker agents in the loop can defend against much stronger hackers: Gemini 3 Flash's loop drives the stronger Gemini 3.1 Pro and Claude Opus 4.7's attack success rate from 76% and 61% to 0% on KernelBench, and Gemini 3.1 Pro's from 39% to 17% on Terminal Bench across 77 tasks. We release Terminal Wrench (323 hackable environments, 3,632 hack trajectories) as a snapshot of the current attack surface, our patched verifiers, the exploits the loop discovered, and our implementation as a basis for future work.

2606.09409 2026-06-09 cs.AI cs.CL cs.LG 交叉投稿

Correct Looks Better: Pairwise Comparisons Reveal Accuracy Rankings

正确看起来更好:成对比较揭示准确性排名

Mina Remeli, Moritz Hardt

发表机构 * Max Planck Institute for Intelligent Systems, Tübingen, Germany(马克斯·普朗克智能系统研究所,蒂宾根,德国) Tübingen AI Center(蒂宾根人工智能中心)

AI总结 本文通过将基准测试转化为生成式评估,发现成对比较结合Elo方法得到的模型排名与基于真实准确率的排名高度一致(Spearman相关系数>0.9),且风格和裁判偏见影响较小,但答案重复(echo)是裁判偏好的因果驱动因素。

Comments Accepted at ICML'26

详情
AI中文摘要

成对比较结合诸如Elo等聚合方法已成为评估生成模型的核心,但人们仍担心它们会奖励肤浅的风格线索或显示裁判偏见。从更积极的角度看,我们表明,当存在真实准确率用于比较时,成对比较得出的模型排名与基于真实准确率的排名高度一致。通过将五个知名基准测试转化为自由形式的生成评估,我们发现Elo排名与准确率排名的Spearman相关系数超过0.9,并且在裁判较弱时显著优于直接评估。此外,风格和裁判偏见对模型排名的影响较小,尽管大多数判断发生在两个候选答案都正确(或都错误)的成对上。在这样的成对比较中,我们发现最终答案后的重复(echo)是裁判偏好的因果驱动因素。

英文摘要

Pairwise comparisons combined with aggregation methods like Elo have become central to evaluating generative models, yet concerns remain that they reward superficial stylistic cues or display judge biases. In a more positive turn, we show that model rankings from pairwise comparisons strongly agree with ground-truth-based accuracy rankings when such ground truth is available for comparison. By converting five well-known benchmarks into free-form generative evaluations, we find that Elo rankings achieve a Spearman correlation above 0.9 with accuracy rankings and substantially outperform direct evaluation when the judge is weak. Furthermore, style and judge bias have only minor effects on model rankings, despite most judgments occurring on pairs where both candidate answers are correct (or incorrect). On such pairs, we find that repetition after the final answer (echo) is a causal driver of judge preference.

2606.09473 2026-06-09 stat.ML cs.LG 交叉投稿

Report the Floor: A Training-Free Conformal Interval Is a Mandatory Baseline for Probabilistic Time-Series Forecasting

报告基线:无训练共形区间是概率时间序列预测的强制性基准

Valery Manokhin

发表机构 * Independent researcher(独立研究者)

AI总结 提出无参数、无训练的共形朴素区间作为概率预测的强基线,在2217个真实序列上击败了多种现有方法,并主张其应成为强制性基准。

详情
AI中文摘要

概率预测器越来越多地通过学习得到,但它们所比较的基线往往较弱或被忽略。我们表明,最简单的共形区间——一个包裹在有限样本分割共形残差分位数中的最后值点预测,无参数且无需训练——是一个远比其在近期学习预测和共形时间序列比较中几乎完全缺失所暗示的更强大的基线。在来自九个公共来源(Monash、LOTSA、LTSF交通/电力/天气套件、METR-LA、BOOM、nips/probts)的2217个真实序列的单步在线预测中,这个ConformalNaive区间决定性地击败了朴素值分位数基线、整个NPTS系列(NPTS 73%,SeasonalNPTS 64%的序列)以及已发表的共形季节池(CSP)方法(71%的序列,bootstrap 95% CI [69,73],配对Wilcoxon p约7.6e-135);它与更简单的学习共形预测器(RCI,分位数回归;中位数相对Winkler在2%以内)相当,并且仅被跟踪分布偏移的自适应在线和集成方法(SPCI、ACI、AgACI)击败,后者在相对Winkler上领先9-33%。它也比训练过的神经预测器校准得更好:在引入DeepNPTS的六个数据集上,平凡的基线在名义95%下覆盖真实值84-85%的时间,而DeepNPTS为66%。在多步季节视界上,情况反转:随机游走基线是最弱的方法,季节池(CSP)获胜——我们描绘了这一边界。最后,我们给出了ConformalNaive+,一个一行代码、无训练、视界自适应的选择器,它在每个视界上达到两个互补基线中较好的一个,并恢复了覆盖。我们认为,每当学习概率预测器声称有改进时,匹配的共形朴素基线必须是一个强制性基准。

英文摘要

Probabilistic forecasters are increasingly learned, yet the baselines they are compared against are often weak or omitted. We show that the simplest possible conformal interval - a last-value point forecast wrapped in a finite-sample split-conformal residual quantile, with no parameters and no training - is a far stronger baseline than its near-total absence from recent learned-forecasting and conformal-time-series comparisons would suggest. In one-step-ahead online forecasting across 2,217 real series from nine public sources (Monash, LOTSA, the LTSF traffic/electricity/weather suites, METR-LA, BOOM, nips/probts), this ConformalNaive interval decisively beats the naive value-quantile baselines, the entire NPTS family (NPTS 73%, SeasonalNPTS 64% of series), and the published Conformal Seasonal Pools (CSP) method (71% of series, bootstrap 95% CI [69,73], paired Wilcoxon p approx 7.6e-135); it is on par with the simpler learned conformal predictors (RCI, quantile regression; median relative Winkler within 2%) and is beaten only by the adaptive-online and ensemble methods (SPCI, ACI, AgACI), which track distribution shift and lead by 9-33% relative Winkler. It is also better calibrated than a trained neural forecaster: on the six datasets that introduced DeepNPTS, the trivial floors cover the truth 84-85% of the time at a nominal 95%, versus DeepNPTS's 66%. At multi-step seasonal horizons the picture inverts: the random-walk floor is the weakest method and the seasonal pool (CSP) wins - a boundary we map. Finally we give ConformalNaive+, a one-line, training-free, horizon-adaptive selector that attains the better of two complementary floors at every horizon with restored coverage. We argue the matching conformal naive floor must be a mandatory baseline whenever a learned probabilistic forecaster claims gains.

2606.09547 2026-06-09 cs.CV cs.LG 交叉投稿

Streaming Interventions: Can Video Large Language Models Correct Mistakes as They Occur?

流式干预:视频大语言模型能否在错误发生时即时纠正?

Apratim Bhattacharyya, Shweta Mahajan, Sanjay Haresh, Rajeev Yasarla, Reza Pourreza, Litian Liu, Risheek Garrepalli, Roland Memisevic

发表机构 * Qualcomm AI Research(高通人工智能研究院) York University(约克大学) Vector Institute for AI(向量人工智能研究所)

AI总结 提出Ego-MC-Bench基准评估视频LLM在烹饪场景中的实时干预能力,并构建Ego-CoMist反事实合成数据集提升小模型性能。

Comments Qualcomm Interactive Cooking: Ego-MC-Bench -- available at https://huggingface.co/datasets/neuripsedtracksub/ego-mistake-corrections and Ego-CoMist -- available at https://huggingface.co/datasets/neuripsedtracksub/ego-counterfactual-mistakes

详情
AI中文摘要

学习日常技能(如烹饪一道菜)越来越依赖于教学媒体,例如在线视频。这为使用视频(和多模态)大语言模型(LLMs)作为任务指导助手打开了大门。一个潜在的任务指导助手在现实世界中成功的关键能力是,它能够在错误一出现时就主动干预以引导用户。为了评估这一关键能力,我们引入了Ego-MC-Bench(错误纠正),这是一个用于评估在现实烹饪场景中反应性、逐步任务指导的基准。大量实验表明,Ego-MC-Bench对于最先进的视频LLMs具有高度挑战性。我们认为一个关键原因是用于在此任务上微调模型的训练数据有限。尽管存在广泛的烹饪视频数据集,但现有数据集缺乏错误示例以及适当时间的干预。为了帮助解决这一数据限制,我们还引入了Ego-CoMist,这是一个反事实合成数据集,通过将非交互式烹饪视频转换为显示主动干预的监督训练示例而创建。我们表明,在Ego-CoMist上进行微调可以带来性能提升,特别是对于更适合在边缘设备上提供帮助的更小、更高效的视频LLMs。

英文摘要

Learning everyday skills, like cooking a dish, relies increasingly on instructional media such as online videos. This opens the door to the use of video (and multimodal) large language models (LLMs) as task guidance assistants. A crucial capability for the real-world success of a prospective task guidance assistant is it's ability to intervene proactively as soon as a mistake is apparent in order to guide the user. To evaluate this crucial capability, we introduce Ego-MC-Bench (Mistake Corrections), a benchmark for evaluating reactive, step-by-step task guidance in realistic cooking scenarios. Extensive experiments show that Ego-MC-Bench is highly challenging for state-of-the-art video LLMs. We argue that a key reason is the limited availability of training data for fine-tuning models on this task. Although there exists a wide range of cooking video datasets, existing datasets lack examples of mistakes along with appropriately timed interventions. To help address this data limitation, we also introduce Ego-CoMist, a counterfactual synthetic dataset created by transforming non -interactive cooking videos into supervised training examples showing proactive interventions. We show that fine-tuning on Ego-CoMist yields performance gains especially for smaller and more efficient video LLMs that are well suited for delivering assistance on edge devices.

2606.09646 2026-06-09 cs.CV cs.AI cs.LG 交叉投稿

Do Video Foundation Models Understand Intuitive Physics? A Layerwise Probing Analysis

视频基础模型是否理解直觉物理?逐层探测分析

Samuele Punzo, Niccolò Caselli, Ippokratis Pantelidis, Francesco Massafra, Salvatore Lo Sardo, Mohammadreza Salehi

发表机构 * University of Amsterdam(阿姆斯特丹大学)

AI总结 通过冻结特征探测,研究预训练视频基础模型在直觉物理信息上的编码能力,发现V-JEPA表现最佳,物理信息在中后期层最易获取,且时序破坏显著降低性能。

详情
AI中文摘要

我们研究预训练视频基础模型是否在其冻结表示中编码直觉物理信息,以及该信息如何随模型家族、层和探测类型变化。通过在IntPhys2和Minimal Video Pairs (MVP)上进行冻结特征探测,我们比较了预测联合嵌入模型(V-JEPA)、掩码重建模型(VideoMAE)和基于扩散的视频生成器(LTX-Video)。V-JEPA在基准测试中取得最强整体结果,尤其是在建模时序动态的探测器中,而VideoMAE仍具竞争力,LTX-Video恢复较弱但非平凡的信号。逐层分析表明,物理相关信息在早期层最弱,在中后期深度最易获取;时序控制表明,打乱帧顺序显著降低性能,尤其是在MVP上。综合来看,这些结果表明直觉物理知识在预训练视频表示中可靠地出现,但其可获取性强烈依赖于预训练范式、表示深度和读出机制。

英文摘要

We study whether pretrained video foundation models encode intuitive-physics information in their frozen representations, and how this information varies across model families, layers, and probe types. Using frozen-feature probing on IntPhys2 and Minimal Video Pairs (MVP), we compare predictive joint-embedding models (V-JEPA), masked reconstruction models (VideoMAE), and a diffusion-based video generator (LTX-Video). V-JEPA achieves the strongest overall results across benchmarks, especially with probes that model temporal dynamics, while VideoMAE remains competitive and LTX-Video recovers weaker but non-trivial signal. Layerwise analyses show that physics-relevant information is weakest in early layers and becomes most accessible at intermediate-to-late depth, and temporal controls show that disrupting frame order substantially reduces performance, especially on MVP. Together, these results suggest that intuitive-physics knowledge emerges reliably in pretrained video representations, but its accessibility depends strongly on pretraining paradigm, representational depth, and readout mechanism.

2606.09748 2026-06-09 cs.AI cs.CL cs.LG 交叉投稿

Multi-Turn Evaluation of Deep Research Agents Under Process-Level Feedback

深度研究智能体在过程级反馈下的多轮评估

Rishabh Sabharwal, Hongru Wang, Amos Storkey, Jeff Z. Pan

发表机构 * Google DeepMind OpenAI Perplexity AI LangChain AI

AI总结 针对深度研究智能体(DRA)在单轮输出评估的不足,提出研究缺口推断(RGI)方法提供过程级反馈,发现单轮过程反馈可提升8-15分,但多轮改进因回归问题难以持续。

Comments Published as a workshop paper at SCALE - ICML 2026 (Oral)

详情
AI中文摘要

现有的深度研究智能体(DRA)基准仅评估单次输出,忽略了一个关键问题:DRA能否在反馈指导下改进其报告?为此,我们在两种反馈设置下对DRA进行多轮评估:自我反思(智能体在无外部诊断信号的情况下修改报告)和过程级反馈(智能体接收针对其研究策略缺口的指导)。为提供过程级反馈,我们设计了研究缺口推断(RGI),该方法通过分析满足和未满足的评分标准模式来推断研究过程缺口。我们的分析揭示了三个关键发现:(i)在自我反思下,智能体以几乎相等的速率纳入和退步评分标准,导致净改进可忽略;(ii)单轮过程级反馈带来显著收益,将归一化分数提高约8-15分,并产生约35-40%的纳入率;(iii)这些收益在后续轮次中不会累积,因为智能体在重写完整报告以解决剩余缺口时,会退步多达24%的先前满足的标准。即使有针对性指导,我们所评估的DRA架构仍无法实现可靠的多轮改进。我们的代码和结果公开在 https://github.com/sabharwalrishabh/Multi-Turn-Evaluation-of-DRAs。

英文摘要

Existing benchmarks for deep research agents (DRAs) assess only single-shot outputs, ignoring a key question: can DRAs improve their reports when guided by feedback? To investigate this, we conduct a multi-turn evaluation of DRAs under two feedback settings: self-reflection, in which the agent revises its report without any external diagnostic signal, and process-level feedback, in which the agent receives guidance targeting gaps in its research strategy. To enable process-level feedback, we design Research Gap Inference (RGI), a method that analyzes patterns of satisfied and unsatisfied rubric criteria to infer research-process gaps. Our analysis reveals three key findings: (i) under self-reflection, agents incorporate and regress on rubric criteria at nearly equal rates, yielding negligible net improvement; (ii) a single round of process-level feedback yields substantial gains, raising the normalized score by approximately $8$-$15$ points and yielding a roughly $35$-$40\%$ incorporation rate; (iii) these gains do not compound over subsequent turns, as agents regress on up to $24\%$ of previously satisfied criteria when rewriting the full report to address remaining gaps. Even with targeted guidance, reliable multi-turn improvement remains out of reach for the DRA architectures we evaluate. Our code and results are publicly available at https://github.com/sabharwalrishabh/Multi-Turn-Evaluation-of-DRAs.

2402.08922 2026-06-09 cs.LG stat.ML 版本更新

The Mirrored Influence Hypothesis: Efficient Data Influence Estimation by Harnessing Forward Passes

镜像影响假说:利用前向传播的高效数据影响估计

Myeongseob Ko, Feiyang Kang, Weiyan Shi, Ming Jin, Zhou Yu, Ruoxi Jia

发表机构 * Virginia Tech(弗吉尼亚理工大学) Columbia University(哥伦比亚大学)

AI总结 提出镜像影响假说,将训练数据对测试预测的影响转化为逆问题,通过测试样本梯度加训练样本前向传播高效估计数据影响,显著提升效率。

Comments The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024

详情
Journal ref
The IEEE/CVF Conference on Computer Vision and Pattern Recognition 2024
AI中文摘要

大规模黑盒模型已在众多应用中变得无处不在。理解单个训练数据源对这些模型预测的影响对于提高其可信度至关重要。当前的影响估计技术涉及计算每个训练点的梯度或在不同子集上重复训练。这些方法在扩展到大型数据集和模型时面临明显的计算挑战。在本文中,我们引入并探索了镜像影响假说,强调了训练数据和测试数据之间影响的互反性质。具体来说,它表明评估训练数据对测试预测的影响可以重新表述为一个等价的逆问题:评估如果模型在特定测试样本上训练,训练样本的预测将如何改变。通过经验验证和理论验证,我们证明了我们假说的广泛适用性。受此启发,我们引入了一种新的训练数据影响估计方法,该方法需要计算特定测试样本的梯度,并结合每个训练点的前向传播。这种方法可以利用常见的不对称性,即同时检查的测试样本数量远小于训练数据集的规模,从而相比现有方法在效率上获得显著提升。我们展示了我们的方法在一系列场景中的适用性,包括扩散模型中的数据归因、数据泄露检测、记忆化分析、错误标记数据检测以及语言模型中的行为追踪。我们的代码将在以下网址提供:https://this https URL。

英文摘要

Large-scale black-box models have become ubiquitous across numerous applications. Understanding the influence of individual training data sources on predictions made by these models is crucial for improving their trustworthiness. Current influence estimation techniques involve computing gradients for every training point or repeated training on different subsets. These approaches face obvious computational challenges when scaled up to large datasets and models. In this paper, we introduce and explore the Mirrored Influence Hypothesis, highlighting a reciprocal nature of influence between training and test data. Specifically, it suggests that evaluating the influence of training data on test predictions can be reformulated as an equivalent, yet inverse problem: assessing how the predictions for training samples would be altered if the model were trained on specific test samples. Through both empirical and theoretical validations, we demonstrate the wide applicability of our hypothesis. Inspired by this, we introduce a new method for estimating the influence of training data, which requires calculating gradients for specific test samples, paired with a forward pass for each training point. This approach can capitalize on the common asymmetry in scenarios where the number of test samples under concurrent examination is much smaller than the scale of the training dataset, thus gaining a significant improvement in efficiency compared to existing approaches. We demonstrate the applicability of our method across a range of scenarios, including data attribution in diffusion models, data leakage detection, analysis of memorization, mislabeled data detection, and tracing behavior in language models. Our code will be made available at https://github.com/ruoxi-jia-group/Forward-INF.

2503.05169 2026-06-09 cs.LG 版本更新

phepy: Visual benchmarks and improvements for out-of-distribution detectors

phepy: 面向分布外检测器的可视化基准与改进

Felix Krumbiegel, Juniper Tyree, Michael Boy, Petri Clusius, Andreas Rupp

发表机构 * Department of Mathematics, Saarland University(萨尔兰大学数学系) Institute for Atmospheric and Earth System Research, University of Helsinki(赫尔辛基大学大气与地球系统研究所) School of Engineering Sciences, LUT University(卢霍斯大学工程科学学院)

AI总结 提出包含三个可视化玩具示例的OOD检测基准,评估现有方法,并引入t-poking和OOD样本加权改进监督式检测器在ID-OOD边界上的精度。

详情
AI中文摘要

将机器学习应用于日益高维且训练数据稀疏或有偏的问题,增加了模型在其训练领域之外的输入上使用的风险。对于此类分布外(OOD)输入,模型无法再做出有效预测,其误差可能无界。由于在真实数据集上测试OOD检测方法较为复杂,我们设计了一个OOD检测基准,其中包含三个新颖且易于可视化的玩具示例。这些简单示例提供了直接且直观的洞察,判断检测器是否能够检测(1)线性和(2)非线性概念,以及(3)在高维空间(干草堆)中识别细小的分布内(ID)子空间(针)。我们利用该基准评估了文献中多种方法的性能。由于OOD输入的触觉示例可能有益于OOD检测,我们还回顾了几种用于监督训练合成OOD输入的简单方法。我们引入了两项改进,即$t$-poking和OOD样本加权,使监督式检测器在ID-OOD边界上更加精确。当真实ID样本与合成OOD样本之间的冲突模糊了决策边界时,这一点尤为重要。最后,我们为在机器学习中构建和应用OOD检测器提供了建议。

英文摘要

Applying machine learning to increasingly high-dimensional problems with sparse or biased training data increases the risk that a model is used on inputs outside its training domain. For such out-of-distribution (OOD) inputs, the model can no longer make valid predictions, and its error is potentially unbounded. Since testing OOD detection methods on real-world datasets is complicated, we design a benchmark for OOD detection, which includes three novel and easily-visualisable toy examples. These simple examples provide direct and intuitive insight into whether the detector is able to detect (1) linear and (2) non-linear concepts and (3) identify thin in-distribution (ID) subspaces (needles) within high-dimensional spaces (haystacks). We use our benchmark to evaluate the performance of various methods from the literature. Since tactile examples of OOD inputs may benefit OOD detection, we also review several simple methods to synthesise OOD inputs for supervised training. We introduce two improvements, $t$-poking and OOD sample weighting, to make supervised detectors more precise at the ID-OOD boundary. This is especially important when conflicts between real ID and synthetic OOD sample blur the decision boundary. Finally, we provide recommendations for constructing and applying OOD detectors in machine learning.

2507.12843 2026-06-09 cs.LG stat.ML 版本更新

Are Two Datasets Close Enough With Statistical Significance? A Kernel Distributional Closeness Testing Approach

两个数据集在统计意义上是否足够接近?一种核分布接近性检验方法

Zhijian Zhou, Liuhua Peng, Xunye Tian, Mingming Gong, Feng Liu

AI总结 针对分布接近性检验(DCT)在复杂数据上的局限性,提出基于核的最大均值差异(MMD)的改进度量NAMMD,并构建NAMMD-DCT方法,在保持I类错误有界的同时提高检验功效。

详情
AI中文摘要

两个分布在统计意义上是否接近?分布接近性检验(DCT)通过检验分布对之间的距离是否至少为epsilon来形式化这一问题。现有的DCT方法主要测量定义在离散空间上的分布对之间的差异,例如使用总变差,这限制了它们在图像等复杂数据上的应用。为了将DCT扩展到更多类型的数据,一个自然的想法是将最大均值差异(MMD)引入DCT场景,MMD是衡量复杂分布之间分布差异的强大度量。然而,实证结果表明,许多分布对可能具有相同的MMD值,尽管它们在同一个再生核希尔伯特空间(RKHS)中具有不同的范数。这些分布对可能表现出不同的有限样本可区分性,并反映不同的实际接近程度,使得MMD在DCT中信息量不足。为了缓解这个问题,我们设计了一种新的分布差异度量——范数自适应MMD(NAMMD),它使用分布的RKHS范数来缩放MMD值。基于NAMMD的渐近分布,我们提出了基于NAMMD的DCT来评估分布对的接近程度。理论上,我们证明了基于NAMMD的DCT比基于MMD的DCT具有更高的检验功效,同时保持有界的I类错误。这一点在多种类型的数据(包括合成噪声和真实图像)上的大量实验中得到进一步验证。我们的代码可在此https URL获取。

英文摘要

Are two distributions close to each other with statistical significance? Distribution closeness testing (DCT) formalizes this question by testing whether the distance between a distribution pair is at least epsilon-far. Existing DCT methods mainly measure discrepancies between distribution pairs defined on discrete spaces, for example using total variation, which limits their application to complex data such as images. To extend DCT to more types of data, a natural idea is to introduce maximum mean discrepancy (MMD), a powerful measure of distributional discrepancy between complex distributions, into DCT scenarios. However, empirical results indicate that many distribution pairs can have the same MMD value despite having different norms in the same reproducing kernel Hilbert space (RKHS). These pairs may exhibit different finite-sample distinguishability and reflect different practical closeness levels, making MMD less informative for DCT. To mitigate this issue, we design a new measure of distributional discrepancy, norm-adaptive MMD (NAMMD), which scales the MMD value using the RKHS norms of distributions. Based on the asymptotic distribution of NAMMD, we propose NAMMD-based DCT to assess the closeness level of a distribution pair. Theoretically, we prove that NAMMD-based DCT has higher test power than MMD-based DCT while maintaining bounded type-I error. This is further validated by extensive experiments on multiple types of data, including synthetic noise and real images. Our code is available at https://github.com/zhijianzhouml/NAMMD.

2510.09783 2026-06-09 cs.LG cs.AI stat.ML 版本更新

Large Language Models for Imbalanced Classification: Diversity makes the difference

大语言模型用于不平衡分类:多样性至关重要

Dang Nguyen, Sunil Gupta, Kien Do, Thin Nguyen, Taylor Braund, Alexis Whitton, Svetha Venkatesh

发表机构 * Applied Artificial Intelligence Initiative (A 2 I 2 )(应用人工智能倡议(A2I2)) Deakin University(德肯大学) Black Dog Institute(黑狗研究所) University of New South Wales(新南威尔士大学)

AI总结 提出基于大语言模型的过采样方法,通过条件采样、排列微调和插值样本增强多样性,在10个表格数据集上优于8个基线方法。

详情
AI中文摘要

过采样是解决不平衡分类最广泛使用的方法之一。其核心思想是生成额外的少数类样本以重新平衡数据集。大多数现有方法(如SMOTE)需要将分类变量转换为数值向量,这通常会导致信息损失。最近,基于大语言模型(LLM)的方法被引入以克服这一限制。然而,当前的LLM方法通常生成多样性有限的少数类样本,降低了下游分类任务的鲁棒性和泛化能力。为了解决这一问题,我们提出了一种新的基于LLM的过采样方法,旨在增强多样性。首先,我们引入了一种采样策略,将合成样本生成条件化为少数类标签和特征。其次,我们开发了一种新的排列策略来微调预训练的LLM。第三,我们不仅在少数类样本上微调LLM,还在插值样本上微调以进一步丰富变异性。在10个表格数据集上的大量实验表明,我们的方法显著优于八个SOTA基线。生成的合成样本既真实又多样。此外,我们通过基于熵的视角提供了理论分析,证明了我们的方法鼓励生成样本的多样性。

英文摘要

Oversampling is one of the most widely used approaches for addressing imbalanced classification. The core idea is to generate additional minority samples to rebalance the dataset. Most existing methods, such as SMOTE, require converting categorical variables into numerical vectors, which often leads to information loss. Recently, large language model (LLM)-based methods have been introduced to overcome this limitation. However, current LLM-based approaches typically generate minority samples with limited diversity, reducing robustness and generalizability in downstream classification tasks. To address this gap, we propose a novel LLM-based oversampling method designed to enhance diversity. First, we introduce a sampling strategy that conditions synthetic sample generation on both minority labels and features. Second, we develop a new permutation strategy for fine-tuning pre-trained LLMs. Third, we fine-tune the LLM not only on minority samples but also on interpolated samples to further enrich variability. Extensive experiments on 10 tabular datasets demonstrate that our method significantly outperforms eight SOTA baselines. The generated synthetic samples are both realistic and diverse. Moreover, we provide theoretical analysis through an entropy-based perspective, proving that our method encourages diversity in the generated samples.

2511.03877 2026-06-09 cs.LG 版本更新

Benchmark Datasets for Lead-Lag Forecasting on Social Platforms

社交平台领先滞后预测的基准数据集

Kimia Kazemian, Zhenzhen Liu, Yangfanyu Yang, Katie Luo, Shuhan Gu, Audrey Du, Xinyu Yang, Jack Jansons, Kilian Q. Weinberger, John Thickstun, Yian Yin, Sarah Dean

发表机构 * Cornell University(康奈尔大学) Stanford University(斯坦福大学) Boston University(波士顿大学)

AI总结 本文提出领先滞后预测(LLF)问题,并发布arXiv和GitHub两个大规模基准数据集,通过统计检验验证领先滞后动态,为社交平台时间序列预测提供标准化测试平台。

Comments 11 pages, 8 figures, includes supplementary material (6 pages, 5 figures). Accepted at ACM SIGKDD 2026 (KDD '26). Code and data: https://lead-lag-forecasting.github.io

详情
AI中文摘要

社交和协作平台产生多变量时间序列轨迹,其中早期交互(如浏览、点赞或下载)之后,有时数月或数年后,会出现更高影响力的结果(如引用、销售或评论)。我们将此设定形式化为领先滞后预测(LLF):给定一个早期使用通道(领先),预测一个相关但时间上偏移的结果通道(滞后)。尽管这种模式普遍存在,但LLF尚未被时间序列社区视为统一的预测问题,主要原因是缺乏标准化数据集。为了锚定LLF研究,本文提出了两个大规模基准数据集:arXiv(访问量 -> 230万篇论文的引用量)和GitHub(推送/星标 -> 300万个仓库的复刻量)。我们的数据集通过捕捉跨年的长期动态、涵盖完整的结果谱以及避免采样中的生存偏差,为领先滞后预测提供了理想的测试平台。我们记录了数据整理和清洗的所有技术细节,通过统计和分类测试验证了领先滞后动态的存在,并基准测试了参数化和非参数化回归基线。我们的研究将LLF确立为一种新的预测范式,并为其在社交和使用数据中的系统探索奠定了实证基础。

英文摘要

Social and collaborative platforms emit multivariate time-series traces in which early interactions -- such as views, likes, or downloads -- are followed, sometimes months or years later, by higher impact like citations, sales, or reviews. We formalize this setting as Lead-Lag Forecasting (LLF): given an early usage channel (the lead), predict a correlated but temporally shifted outcome channel (the lag). Despite the ubiquity of such patterns, LLF has not been treated as a unified forecasting problem within the time-series community, largely due to the absence of standardised datasets. To anchor research in LLF, here we present two high-volume benchmark datasets: arXiv (accesses -> citations of 2.3M papers) and GitHub (pushes/stars -> forks of 3M repositories). Our datasets provide ideal testbeds for lead-lag forecasting, by capturing long-horizon dynamics across years, spanning the full spectrum of outcomes, and avoiding survivorship bias in sampling. We documented all technical details of data curation and cleaning, verified the presence of lead-lag dynamics through statistical and classification tests, and benchmarked parametric and non-parametric baselines for regression. Our study establishes LLF as a novel forecasting paradigm and lays an empirical foundation for its systematic exploration in social and usage data.

2601.04498 2026-06-09 cs.LG cs.CV 版本更新

IGenBench: Benchmarking the Reliability of Text-to-Infographic Generation

IGenBench:文本到信息图生成可靠性基准测试

Yinghao Tang, Xueding Liu, Boyuan Zhang, Tingfeng Lan, Yupeng Xie, Jiale Lao, Yiyao Wang, Haoxuan Li, Tingting Gao, Bo Pan, Luoxuan Weng, Xiuqi Huang, Minfeng Zhu, Yingchaojie Feng, Yuyu Luo, Wei Chen

发表机构 * State Key Lab of CAD&CG, Zhejiang University(浙江大学CAD与CG国家重点实验室) UESTC University of Virginia(弗吉尼亚大学) HKUST(GZ)(香港科技大学(广州)) Cornell University(康奈尔大学) Zhejiang University(浙江大学) National University of Singapore(新加坡国立大学)

AI总结 提出IGENBENCH基准,包含30种信息图类型和600个测试用例,通过多模态大语言模型分解为10类原子问题评估10种T2I模型,发现数据完整性等维度是普遍瓶颈。

详情
AI中文摘要

信息图是结合数据可视化与文本和插图元素的复合视觉制品,用于传达信息。虽然最近的文本到图像(T2I)模型可以生成美观的图像,但它们在生成信息图方面的可靠性仍不清楚。生成的信息图可能乍看正确,但包含容易被忽视的问题,例如扭曲的数据编码或错误的文本内容。我们提出了IGENBENCH,这是第一个评估文本到信息图生成可靠性的基准,包含跨越30种信息图类型的600个精心设计的测试用例。我们设计了一个自动评估框架,将可靠性验证分解为基于10种问题类型的原子是否问题。我们使用多模态大语言模型(MLLM)验证每个问题,得到问题级准确率(Q-ACC)和信息图级准确率(I-ACC)。我们在IGENBENCH上全面评估了10个最先进的T2I模型。我们的系统分析揭示了未来模型开发的关键见解:(i)三级性能层次,顶级模型的Q-ACC为0.90,但I-ACC仅为0.49;(ii)数据相关维度成为普遍瓶颈(例如,数据完整性:0.21);(iii)所有模型实现端到端正确性的挑战。我们在https://this URL发布IGENBENCH。

英文摘要

Infographics are composite visual artifacts that combine data visualizations with textual and illustrative elements to communicate information. While recent text-to-image (T2I) models can generate aesthetically appealing images, their reliability in generating infographics remains unclear. Generated infographics may appear correct at first glance but contain easily overlooked issues, such as distorted data encoding or incorrect textual content. We present IGENBENCH, the first benchmark for evaluating the reliability of text-to-infographic generation, comprising 600 curated test cases spanning 30 infographic types. We design an automated evaluation framework that decomposes reliability verification into atomic yes/no questions based on a taxonomy of 10 question types. We employ multimodal large language models (MLLMs) to verify each question, yielding question-level accuracy (Q-ACC) and infographic-level accuracy (I-ACC). We comprehensively evaluate 10 state-of-the-art T2I models on IGENBENCH. Our systematic analysis reveals key insights for future model development: (i) a three-tier performance hierarchy with the top model achieving Q-ACC of 0.90 but I-ACC of only 0.49; (ii) data-related dimensions emerging as universal bottlenecks (e.g., Data Completeness: 0.21); and (iii) the challenge of achieving end-to-end correctness across all models. We release IGENBENCH at https://igen-bench.vercel.app/.

2601.06649 2026-06-09 cs.LG cs.AI 版本更新

Revisiting Training Scale: An Empirical Study of Token Count, Power Consumption, and Parameter Efficiency

重新审视训练规模:关于令牌计数、功耗和参数效率的实证研究

Joe Dwyer

发表机构 * ECPI University(ECPI大学)

AI总结 通过固定硬件和训练条件的重复测量实验,发现增加训练令牌数会导致训练效率严格单调下降,即使性能有边际提升,也表明能耗效率低下。

详情
AI中文摘要

机器学习研究质疑了训练令牌数的增加是否能在大型语言模型中可靠地产生比例性能提升。基于先前引入能量感知参数效率度量的工作,本研究实证检验了在固定硬件和训练条件下增加训练令牌数的影响。本工作的重要性在于将功耗和执行时长(如功率采样频率所反映的)明确整合到令牌规模分析中,这解决了先前研究强调性能结果而低估计算和能量成本的空白。通过在恒定GPU实例上使用相同模型架构、优化器设置和轮次数的重复测量实验设计,训练了一个11亿参数的TinyLlama模型,使用三个令牌数(500K、1M和2M)。虽然传统性能指标在令牌规模上表现出不一致或递减的回报,但包含功耗和执行时长后,揭示了随着令牌数增加,训练效率严格单调下降。重复测量方差分析表明令牌数对参数效率有强效应,所有配对比较在Bonferroni校正后仍然显著。这些发现表明,即使观察到边际性能提升,增加训练令牌数可能在能量上效率低下,强调了在大型语言模型训练中效率感知评估的重要性。

英文摘要

Research in machine learning has questioned whether increases in training token counts reliably produce proportional performance gains in large language models. Building on prior work introducing an energy-aware parameter efficiency metric, this study empirically examines the effects of increasing training token counts under fixed hardware and training conditions. The significance of this work lies in the explicit integration of power consumption and execution duration, as reflected by the power sampling frequency, into token-scale analysis. This addresses a gap in prior studies emphasizing performance outcomes while underrepresenting computational and energy costs. Using a repeated-measures experimental design on a constant GPU instance with an identical model architecture, optimizer settings, and epoch counts, a 1.1-billion-parameter TinyLlama model was trained at three token counts (500K, 1M, and 2M). While conventional performance metrics exhibited inconsistent or diminishing returns across token scales, the inclusion of power consumption and execution duration revealed a strictly monotonic decline in training efficiency as token count increased. Repeated-measures ANOVA demonstrated a strong effect of token count on parameter efficiency, with all pairwise comparisons remaining significant following Bonferroni correction. These findings indicate that increases in training token counts may be energetically inefficient even when marginal performance improvements are observed, underscoring the importance of efficiency-aware evaluation in large language model training.

2601.21816 2026-06-09 cs.LG 版本更新

Nonparametric LLM Evaluation from Preference Data

基于偏好数据的非参数化LLM评估

Dennis Frauen, Athiya Deviyani, Mihaela van der Schaar, Stefan Feuerriegel

发表机构 * GitHub

AI总结 提出非参数统计框架DMLRank,通过去偏机器学习从偏好数据中比较和排名大语言模型,引入广义平均排名分数,具有统计高效、兼容黑箱方法、结合预训练评估器和优化数据收集策略等优势。

Comments Accepted at ICML 2026

详情
AI中文摘要

从人类偏好数据中评估大语言模型(LLM)的性能对于获得LLM排行榜至关重要。然而,许多现有方法要么依赖限制性的参数假设,要么在使用灵活的机器学习方法时缺乏有效的不确定性量化。在本文中,我们提出了一种非参数统计框架DMLRank,用于通过去偏机器学习(DML)从偏好数据中比较和排名LLM。为此,我们引入了广义平均排名分数(GARS),它推广了常用的排名模型,包括Bradley-Terry模型或PageRank/排名中心性,并处理了诸如平局等复杂的人类响应。DMLRank具有以下优势:(i)它产生GARS排名分数的统计高效估计。(ii)它自然允许结合黑箱机器学习方法进行估计。(iii)它可以与预训练的LLM评估器(例如,使用LLM-as-a-judge)结合使用。(iv)它建议在预算约束下收集偏好数据的最优策略。我们在理论和实证上,使用合成和真实世界的偏好数据集展示了这些优势。总之,我们的框架为从业者提供了强大的、最先进的方法,用于比较或排名LLM以构建排行榜。

英文摘要

Evaluating the performance of large language models (LLMs) from human preference data is crucial for obtaining LLM leaderboards. However, many existing approaches either rely on restrictive parametric assumptions or lack valid uncertainty quantification when flexible machine learning methods are used. In this paper, we propose a nonparametric statistical framework, called DMLRank, for comparing and ranking LLMs from preference data using debiased machine learning (DML). For this, we introduce generalized average ranking scores (GARS), which generalize commonly used ranking models, including the Bradley-Terry model or PageRank/ Rank centrality, with complex human responses such as ties. DMLRank comes with the following advantages: (i)~It produces statistically efficient estimates of GARS ranking scores. (ii) It naturally allows the incorporation of black-box machine learning methods for estimation. (iii) It can be combined with pre-trained LLM evaluators (e.g., using LLM-as-a-judge). (iv) It suggests optimal policies for collecting preference data under budget constraints. We demonstrate these advantages both theoretically and empirically using both synthetic and real-world preference datasets. In summary, our framework provides practitioners with powerful, state-of-the-art methods for comparing or ranking LLMs for leaderboards.

2602.15327 2026-06-09 cs.LG cs.AI cs.CL stat.ML 版本更新

Prescriptive Scaling Reveals the Evolution of Language Model Capabilities

规范性缩放揭示语言模型能力的演变

Hanlin Zhang, Jikai Jin, Vasilis Syrgkanis, Sham Kakade

发表机构 * Harvard University(哈佛大学) Stanford University(斯坦福大学)

AI总结 通过大规模观测评估和分位数回归,提出规范性缩放定律,将预训练计算预算映射到下游准确率,并验证其时间稳定性,引入平衡I-最优采样算法降低评估成本。

Comments ICML 2026 Oral. Blog Post: https://jkjin.com/prescriptive-scaling

详情
AI中文摘要

机器学习模型性能的提升往往源于竞争和应用。针对部署,我们考虑规范性缩放定律:给定预训练计算预算,通过当代后训练实践可获得的下游准确率是多少,以及随着领域发展该映射的稳定性如何?我们使用大规模观测评估,涵盖2022-2026年间六个基准测试的5000个现有和2000个新评估的模型检查点,通过带有单调饱和S型参数化的平滑分位数回归,估计能力边界(即基准分数作为对数预训练FLOPs函数的高条件分位数)。我们通过在早期模型代上拟合并在后续版本上评估来验证时间可靠性:在六个任务中的四个上,分布外覆盖误差低于2%,而数学推理能力边界随时间持续提升。例如,在预算为10^24 FLOPs时,IFEval上的估计可达准确率为0.83,MATH Lvl 5上为0.54。然后我们扩展方法以分析任务相关的饱和性,并探测数学推理任务中与污染相关的偏移。最后,我们引入一种平衡I-最优采样算法,该算法使用约20%的参数计数加权评估预算(某些任务低至5%)恢复接近全数据的前沿,同时保持可比的校准。总之,我们的工作发布了Proteus-2k(最新的模型性能评估数据集),并引入了一种实用方法,将计算预算转化为可靠的性能预期,并监测能力边界随时间的变化。

英文摘要

Machine learning model performance improvements tend to arise from competition and application. For deployment, we consider prescriptive scaling laws: given a pre-training compute budget, what downstream accuracy is attainable with contemporary post-training practice, and how stable is that mapping as the field evolves? Using large-scale observational evaluations with 5k existing and 2k newly evaluated model checkpoints spanning 2022-2026 across six benchmarks, we estimate capability boundaries, high conditional quantiles of benchmark scores as a function of log pre-training FLOPs, via smoothed quantile regression with a monotone, saturating sigmoid parameterization. We validate temporal reliability by fitting on earlier model generations and evaluating on later releases: across four of six tasks, the out-of-distribution coverage error remains below 2%, while math reasoning exhibits a consistently advancing boundary over time. For instance, at a budget of 10^24 FLOPs, the estimated attainable accuracies are 0.83 on IFEval and 0.54 on MATH Lvl 5. We then extend our approach to analyze task-dependent saturation and to probe contamination-related shifts on math reasoning tasks. Finally, we introduce a balanced I-optimal sampling algorithm that recovers near-full-data frontiers using roughly 20% of the parameter-count-weighted evaluation budget, as low as 5% on some tasks, while maintaining comparable calibration. Together, our work releases Proteus-2k, the latest model performance evaluation dataset, and introduces a practical methodology for translating compute budgets into reliable performance expectations and for monitoring when capability boundaries shift across time.

2602.19330 2026-06-09 cs.LG 版本更新

CTS-Bench: Benchmarking Graph Coarsening Trade-offs for GNNs in Clock Tree Synthesis

CTS-Bench: 面向时钟树综合中GNN的图粗化权衡基准测试

Barsat Khadka, Kawsher Roxy, Md Rubel Ahmed

发表机构 * The University of Southern Mississippi(密苏里州南方大学) Intel Corporation(英特尔公司) Louisiana Tech University(路易斯安那理工大学)

AI总结 提出CTS-Bench基准套件,系统评估图粗化对GNN在时钟树综合中预测精度与计算效率的权衡,发现粗化虽降低内存和加速训练,但会移除关键结构信息导致零样本评估下R²为负。

Comments Accepted to ML Bench'26 ASPLOS

详情
AI中文摘要

图神经网络(GNN)在电子设计自动化中的物理设计分析中越来越受到关注,特别是用于建模时钟树综合行为,如时钟偏斜和缓冲复杂性。然而,由于在原始门级网表上操作的内存和运行时间成本过高,实际部署仍然有限。图粗化通常用于提高可扩展性,但其对CTS关键学习目标的影响尚未得到充分表征。本文介绍了CTS-Bench,一个基准测试套件,用于系统评估基于GNN的CTS分析中图粗化、预测精度和计算效率之间的权衡。CTS-Bench包含跨越五个架构的4,860个收敛的物理设计解决方案,并提供来自布局后设计的配对原始门级和聚类图表示。以时钟偏斜预测作为代表性CTS任务,我们展示了明确的精度-效率权衡。虽然图粗化将GPU内存使用减少高达17.2倍,并将训练加速高达3倍,但它也移除了对建模时钟分布至关重要的结构信息,经常导致零样本评估下R²为负。我们的发现表明,即使全局物理指标保持不变,通用图聚类技术也可能从根本上损害CTS学习目标。CTS-Bench支持对CTS感知的图粗化策略进行原则性评估,支持在现实物理设计约束下对GNN架构和加速器进行基准测试,并为开发学习辅助的CTS分析和优化技术提供了基础。

英文摘要

Graph Neural Networks (GNNs) are increasingly explored for physical design analysis in Electronic Design Automation, particularly for modeling Clock Tree Synthesis behavior such as clock skew and buffering complexity. However, practical deployment remains limited due to the prohibitive memory and runtime cost of operating on raw gate-level netlists. Graph coarsening is commonly used to improve scalability, yet its impact on CTS-critical learning objectives is not well characterized. This paper introduces CTS-Bench, a benchmark suite for systematically evaluating the trade-offs between graph coarsening, prediction accuracy, and computational efficiency in GNN-based CTS analysis. CTS-Bench consists of 4,860 converged physical design solutions spanning five architectures and provides paired raw gate-level and clustered graph representations derived from post-placement designs. Using clock skew prediction as a representative CTS task, we demonstrate a clear accuracy-efficiency trade-off. While graph coarsening reduces GPU memory usage by up to 17.2x and accelerates training by up to 3x, it also removes structural information essential for modeling clock distribution, frequently resulting in negative $R^2$ scores under zero-shot evaluation. Our findings indicate that generic graph clustering techniques can fundamentally compromise CTS learning objectives, even when global physical metrics remain unchanged. CTS-Bench enables principled evaluation of CTS-aware graph coarsening strategies, supports benchmarking of GNN architectures and accelerators under realistic physical design constraints, and provides a foundation for developing learning-assisted CTS analysis and optimization techniques.

2603.13431 2026-06-09 cs.LG cs.AI 版本更新

CHIMERA-Bench: A Benchmark Dataset for Epitope-Specific Antibody Design

CHIMERA-Bench:一种针对表位特异性抗体设计的基准数据集

Mansoor Ahmed, Nadeem Taj, Imdad Ullah Khan, Hemanth Venkateswara, Murray Patterson

发表机构 * Georgia State University(佐治亚州立大学) Georgia Institute of Technology(佐治亚理工学院) University of Engineering and Technology(工程与技术大学) Lahore University of Management Sciences(拉合尔管理科学大学)

AI总结 本文提出CHIMERA-Bench,一个统一的抗体设计基准,包含2922个抗原-抗体复合物数据,测试泛化能力,并评估多种生成方法的通用性。

详情
AI中文摘要

计算抗体设计在过去三年中取得了快速的方法进展,提出了数十种深度生成方法,但该领域缺乏标准化的基准用于公平比较和模型开发。这些方法在不同的SAbDab快照、非重叠测试集和不兼容的指标上进行评估,文献将设计问题分解为多个子任务,没有共同定义。我们引入CHIMERA-Bench:(CDR建模与表位引导的重设计),围绕单一经典任务:表位条件下的CDR序列-结构共设计。CHIMERA-Bench提供三个组成部分。第一个是一个经过精心挑选、去重的包含2922个抗体-抗原复合物的数据集,带有表位和抗原结合位点注释。第二个是一组三个生物动机的分割,测试泛化到未见表位、未见抗原折叠和前瞻性时间目标的能力。第三个是全面的评估协议,包括五个指标组,包括新的表位特异性度量。我们基准测试了十一种方法,涵盖六个生成范式,并在所有分割上报告结果。CHIMERA-Bench是该抗体设计问题中最大的数据集,允许社区开发和测试新方法,并评估其泛化能力。

英文摘要

Computational antibody design has seen rapid methodological progress, with dozens of deep generative methods proposed in the past three years, yet the field lacks a standardized benchmark for fair comparison and model development. These methods are evaluated on different SAbDab snapshots, non-overlapping test sets, and incompatible metrics, and the literature fragments the design problem into numerous sub-tasks with no common definition. We introduce CHIMERA-Bench: (CDR Modeling with Epitope-guided Redesign), a unified benchmark built around a single canonical task: epitope-conditioned CDR sequence-structure co-design. CHIMERA-Bench provides three components. The first is a curated, deduplicated dataset of 2,922 antibody-antigen complexes with epitope and paratope annotations. The second is a set of three biologically motivated splits that test generalization to unseen epitopes, unseen antigen folds, and prospective temporal targets. The third is a comprehensive evaluation protocol with five metric groups, including novel epitope-specificity measures. We benchmark eleven methods spanning six generative paradigms and report results across all splits. CHIMERA-Bench is the largest dataset of its kind for the antibody design problem, allowing the community to develop and test novel methods and evaluate their generalizability.

2604.26498 2026-06-09 cs.LG q-bio.QM 版本更新

Do Larger Models Really Win in Drug Discovery? A Benchmark Assessment of Model Scaling in AI-Driven Molecular Property and Activity Prediction

大模型真的在药物发现中胜出吗?AI驱动的分子性质和活性预测中模型规模的基准评估

Jinjiang Guo, Sheng Ding

发表机构 * Global Health Drug Discovery Institute(全球健康药物发现研究所) School of Pharmaceutical Sciences(药学院)

AI总结 本文通过26个ADME、毒性及生物活性端点评估,发现传统机器学习在多数任务中表现最佳,大模型在部分困难分割中竞争力有限,模型性能依赖于任务与验证场景的适配性,而非单纯规模。

Comments Improved benchmark design and reproducibility, replaced restricted datasets with public benchmarks in primary analyses, and added sensitivity analyses supporting the interpretation of model scaling and evaluation protocol effects in molecular prediction

详情
AI中文摘要

分子基础模型和大语言模型的快速发展促使人们以规模为中心看待AI在药物发现中的应用,认为更大的预训练模型将取代紧凑的化学信息学模型。我们测试了这一假设,涵盖26个ADME、毒性及生物活性端点,共165,541个端点级别化合物标签记录。基准测试包含78个端点和分割条目,通过随机、Murcko骨架和结构分离的5折交叉验证协议评估,代表递增的化学泛化难度。在156个任务和指标比较中,传统机器学习(ML)提供了最大的最佳表现份额(47.4%),其次是预训练分子序列模型(28.8%)、图神经网络(21.8%)和基于LLM的SAR基线(1.9%)。传统ML在随机分割插值中占优,并总体上是最大的胜利家族。GNN和序列模型在部分更难的分割中具有竞争力,但其严格胜利份额在固定最终窗口读取下减少,表明对训练设置和模型选择的敏感性。配对Bootstrap分析显示,模型间的小数值差异不应被视为决定性胜利。训练折叠中的SAR知识提高了GPT5.5-SAR和Opus4.7-SAR指标,但并未使基于规则的推理成为监督预测器的通用替代品。紧凑的专业模型仍高度有效,预测性能取决于模型、任务和验证场景之间的适配性,而非规模本身。

英文摘要

The rapid growth of molecular foundation models and large language models (LLMs) has encouraged a scale centred view of AI in drug discovery, in which larger pretrained models are expected to supersede compact cheminformatics models. We test this assumption across 26 ADME, toxicity and bioactivity endpoints, covering 165,541 endpoint level compound label records. The benchmark contains 78 endpoint and split entries evaluated under random, Murcko scaffold and structure separated 5-fold cross validation protocols, representing increasing chemical generalization difficulty. Across 156 task and metric comparisons, classical machine learning (ML) provides the largest share of best performing entries (47.4%), followed by pretrained molecular sequence models (28.8%), graph neural networks (21.8%) and LLM based SAR baselines (1.9%). Classical ML dominates random split interpolation and remains the largest winner family overall. GNN and sequence models are competitive in selected harder splits, but their strict winner shares decrease under a fixed final-window readout, indicating sensitivity to training settings and model selection. Paired bootstrap analyses show that small numerical differences between individual models should not be read as decisive victories. SAR knowledge from training folds improves GPT5.5-SAR and Opus4.7-SAR metrics but does not make rule based reasoning a universal substitute for supervised predictors. Compact specialized models remain highly effective, and predictive performance depends on the fit among model, task and validation scenario, not on scale alone.

2605.23595 2026-06-09 cs.LG cs.AI cs.CV cs.ET cs.PF 版本更新

Learning to Evaluate: Cost-Effective Model Evaluation on Unlabeled Data with Meta-Learning

基于元学习的成本效益模型评估

Trinh Pham, Viet Huynh, Hongzhi Yin, Quoc Viet Hung Nguyen, Thanh Tam Nguyen

发表机构 * Griffith University(格里菲斯大学) Edith Cowan University(埃迪斯科文大学) The University of Queensland(昆士兰大学)

AI总结 提出MetaEvaluator,一种基于元学习的模型无关框架,通过参考模型池实现无标签数据上的快速、准确且成本效益高的新模型评估。

Comments Accepted by KDD 2026

详情
AI中文摘要

机器学习的快速发展产生了不断扩展的模型生态系统,使得在未见过的未标记数据上验证新发布模型的可靠性变得越来越具有挑战性。传统的评估流程依赖于昂贵的标注、重复的微调或无法跨模型家族迁移的狭窄假设。我们提出了MetaEvaluator,一个成本效益高、模型无关的框架,用于快速、无标签地评估跨不同架构和模态的未见模型。MetaEvaluator利用参考模型池上的元学习来获得可迁移的初始化,从而能够准确评估新模型,同时将成本分摊到整个池中,并消除了每个模型重新训练的需要。据我们所知,这是第一个能够在完全未标记数据集上评估新模型的模型无关框架。大量实验表明,与传统方法相比,MetaEvaluator以显著降低的成本产生稳定且准确的性能估计,使得在未标记数据上对新出现的模型进行可扩展的基准测试变得实用。

英文摘要

The rapid advancement of machine learning has led to an unprecedented expansion of model ecosystems, making it increasingly difficult to assess the reliability of newly released models on unseen and unlabeled data. Existing evaluation pipelines typically rely on costly annotation, repeated fine-tuning, or assumptions that do not generalize well to new models. We introduce MetaEvaluator, a cost-effective, model-agnostic framework for fast, label-free evaluation of unseen models across diverse architectures and modalities. MetaEvaluator meta-learns over a pool of reference models to acquire an effective initialization for accurate assessment of unseen models, thereby amortizing evaluation cost and eliminating the need for per-model retraining. To the best of our knowledge, this is the first model-agnostic framework that evaluates new models on unlabeled datasets. Extensive experiments demonstrate that MetaEvaluator delivers stable and accurate performance estimates at substantially lower cost than conventional approaches, enabling scalable benchmarking on unlabeled datasets for emerging models. The code is available at: https://github.com/phkhanhtrinh23/MetaEvaluator.

2605.30184 2026-06-09 cs.LG physics.ao-ph 版本更新

Can AI Weather Models Predict Beyond Two Weeks? A Quantitative Benchmark and Analysis of Long Rollouts

AI天气模型能否预测两周以上?长期推演的定量基准与分析

Fanny Lehmann, Firat Ozdemir, Yun Cheng, Torsten Hoefler, Sebastian Schemm, Benedikt Soja, Siddhartha Mishra

发表机构 * ETH AI Center(ETH人工智能中心) ETH Zurich(苏黎世联邦理工学院) Swiss Data Science Center(瑞士数据科学中心) Scalable Parallel Computing Lab(可扩展并行计算实验室) Dep. of Applied Mathematics and Theoretical Physics(应用数学与理论物理系) University of Cambridge(剑桥大学) Institute of Geodesy and Photogrammetry(大地测量与摄影测量研究所) Seminar for Applied Mathematics(应用数学研讨会)

AI总结 通过九种AI天气模型的一年推演,将长期不稳定性分类为爆发、漂移和季节性丧失三种模式,并发现稳定性取决于对小时空尺度的处理。

详情
AI中文摘要

虽然AI天气模型在短期到中期预报(最多15天)中表现出色,但在更长时间推演时经常出现定义不清的“不稳定性”。本文通过九种最先进的AI天气模型的一年推演,将这些失败形式化为三种不同的模式:爆发、漂移和季节性丧失。我们的分析表明,稳定性取决于对小时空尺度的处理:不稳定的模型放大高频能量,而稳定的模型在输入中添加噪声时起到去噪作用。我们的发现远未将这些模型简化为随机鹦鹉,而是强调稳定模型根据初始状态生成独特的天气轨迹。我们通过对架构设计选择的消融研究验证了我们的发现,这些研究使用了最先进的Vision Transformer(ViT)AI天气模型架构。

英文摘要

While AI weather models excel at short-to-medium range forecasts (up to 15 days), they frequently suffer from ill-defined "instabilities" when rolled out over longer horizons. This work addresses the lack of a formal taxonomy by categorizing these failures into three distinct regimes: blow-up, drift, and loss of seasonality, through year-long rollouts of nine state-of-the-art AI weather models. Our analysis reveals that stability hinges on the treatment of small spatio-temporal scales: unstable models amplify high-frequency energy, while stable models act as denoisers when noise is added to their inputs. Far from reducing these models to mere stochastic parrots, our findings highlight that stable models generate unique weather trajectories, conditioned on the initial state. We verify our findings through ablation studies on architectural design choices, conducted using state-of-the-art Vision Transformer (ViT) AI weather model architectures.

2606.05441 2026-06-09 cs.LG cs.AI stat.ML 版本更新

GOTabPFN: From Feature Ordering to Compact Tokenization for Tabular Foundation Models on High-Dimensional Data

GOTabPFN: 从特征排序到高维表格基础模型的紧凑分词化

Al Zadid Sultan Bin Habib, Md Younus Ahamed, Prashnna Kumar Gyawali, Gianfranco Doretto, Donald A. Adjeroh

发表机构 * University of Cambridge(剑桥大学)

AI总结 针对高维小样本表格预测问题,提出GOTabPFN模型,通过图引导排序和神经启发子单元压缩实现紧凑表示,提升TabPFN在严格token预算下的稳定性和准确性。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026). Code and resources GitHub https://github.com/zadid6pretam/GOTabPFN PyPI https://pypi.org/project/gotabpfn Project webpage https://www.zadidhabib.com/gotabpfn.html Hugging Face ZeroGPU https://huggingface.co/spaces/zadid6pretam/GOTabPFN CPU backup https://huggingface.co/spaces/zadid6pretam/GOTabPFN_CPU

详情
AI中文摘要

我们研究了如何在不重新训练大型骨干网络的情况下,使小型表格基础模型对高维小样本(HDLSS)表格预测有效。我们引入了带局部细化的图引导排序(GO-LR),证明了其与加权最小线性排列的等价性,并将实际求解器解释为TSP路径式替代方案。我们提出了基于GO-LR的GOTabPFN,以及一个神经启发子单元压缩(NSC)单元,将局部相邻的排序特征池化为元特征,从而生成紧凑表示,使TabPFN风格的预测在HDLSS场景中变得实用。在多个表格基准测试中,GOTabPFN在严格的token预算下提高了稳定性和准确性。

英文摘要

We investigate how to make small tabular foundation models effective for High-Dimensional, Low-Sample Size (HDLSS) tabular prediction without retraining large backbones. We introduce Graph-guided Ordering with Local Refinement (GO-LR), show its equivalence to weighted Minimum Linear Arrangement, and interpret the practical solver as a TSP-path-style surrogate. We propose GOTabPFN,which builds on GO-LR, and a Neuro-Inspired Subunit Compression (NSC) unit to pool locally adjacent ordered features into meta-features, yielding a compact representation that makes TabPFN-style prediction practical in HDLSS regimes. Across tabular benchmarks, GOTabPFN improves stability and accuracy under tight token budgets.

2606.07379 2026-06-09 cs.LG cs.AI cs.CL stat.ME 版本更新

Do Coding Agents Deceive Us? Detecting and Preventing Cheating via Capped Evaluation with Randomized Tests

编码智能体会欺骗我们吗?通过带随机测试的上限评估检测和防止作弊

Thanawat Lodkaew, Johannes Ackermann, Soichiro Nishimori, Nontawat Charoenphakdee, Masashi Sugiyama, Takashi Ishida

发表机构 * The University of Tokyo(东京大学) RIKEN(理化学研究所)

AI总结 提出CapCode框架,通过设置上限评估检测模型在编码任务中的作弊行为,并设计CapReward奖励机制防止作弊,实验表明该方法能有效检测和减少作弊。

详情
AI中文摘要

在智能体评估和训练中,一个日益增长的失败模式是模型可以通过利用捷径而非解决预期任务来获得高评估分数,产生欺骗性表现。这使得评估分数作为真实任务解决能力的度量不可靠。我们提出CapCode,一个构建带有随机测试的编码数据集的框架,其最佳可达的非作弊性能被故意限制在1以下。这种上限性能设计赋予评估分数更清晰的解释:显著高于上限的分数是不可信的,因此提供了作弊的证据。为了防止作弊,我们提出CapReward,一种基于CapCode原则的奖励设计,以抑制超出上限的优化。跨多个数据集的实验表明,CapCode能够检测作弊同时保持模型的性能排名,CapReward减少了作弊行为,产生了更好地遵循预期任务规范的模型。

英文摘要

A growing failure mode in agent evaluation and training is that models can achieve high evaluation scores by exploiting shortcuts instead of solving the intended task, producing deceptive performance. This makes evaluation scores unreliable as measures of true task-solving ability. We propose CapCode, a framework for constructing coding datasets with randomized tests whose best achievable non-cheating performance is deliberately capped below one. This capped-performance design gives evaluation scores a clearer interpretation: scores substantially above the cap are implausible and therefore provide evidence of cheating. To prevent cheating, we propose CapReward, a reward design based on the CapCode principle to discourage optimization beyond the cap. Experiments across multiple datasets show that CapCode detects cheating while preserving performance ranking of models, and CapReward reduces cheating behavior, yielding models that better follow the intended task specification.

2009.10277 2026-06-09 cs.CL cs.LG cs.SI 版本更新

Measuring a hate speech spectrum with faceted Rasch item response theory and perspective-aware, explainable-by-design deep learning

使用分面Rasch项目反应理论和可解释性设计的深度学习测量仇恨言论谱系

Chris J. Kennedy, Geoff Bacon, Alexander Sahn, Claudia von Vacano

发表机构 * Center for Precision Psychiatry, Mass General Hospital Department of Psychiatry, Harvard Medical School(精准精神病学中心,麻省总医院精神病科,哈佛医学院) D-Lab University of California, Berkeley(加州大学伯克利分校D实验室)

AI总结 提出结合监督深度学习与分面Rasch项目反应理论的方法,将仇恨言论分解为10个有序标签,通过IRT模型转化为区间测量值并调整标注者视角,在RoBERTa模型上提升准确性,实现连续谱系测量与可解释性。

Comments 7 pages, 6 figures

详情
AI中文摘要

我们提出一个系统,通过结合监督深度学习与分面Rasch项目反应理论(IRT),在从种族灭绝言论到支持性言论的连续区间值谱系上测量仇恨言论。我们将仇恨言论的理论构念分解为10个有序标签的操作化构成概念。这些标签通过IRT概率潜在模型重构为区间结果测量,同时估计并调整每个标注者的标注视角。我们的标度程序自然地与用于自动预测的多任务深度学习架构集成,允许通过那些组件对连续分数进行基于设计的可解释性。我们将此方法应用于一个新的开源数据集,该数据集包含来自YouTube、Twitter和Reddit的50,070条社交媒体评论,由11,143名美国亚马逊土耳其机器人工作者进行标注和标记。我们的基于RoBERTa的模型相比替代方法显示出改进的准确性。该系统为监督NLP提供了一种新范式,鼓励连续而非二元的构念,以及基于设计的标注者视角和模型可解释性的整合。

英文摘要

We propose a system for measuring hate speech on a continuous, interval-valued spectrum ranging from genocidal to supportive speech by combining supervised deep learning with faceted Rasch item response theory (IRT). We decompose the theoretical construct of hate speech into constituent concepts operationalized as 10 ordinal labels. Those labels are reconstituted via IRT probabilistic latent modeling into an interval outcome measure while simultaneously estimating and adjusting for each annotator's labeling perspective. Our scaling procedure integrates naturally with a multitask deep learning architecture for automated prediction, allowing design-based explainability of the continuous score through those components. We apply this method to a new, open source dataset of 50,070 social media comments sourced from YouTube, Twitter, and Reddit, annotated and labeled by 11,143 United States-based Amazon Mechanical Turk workers. Our RoBERTa-based model shows improved accuracy compared to alternative approaches. This system offers a new paradigm for supervised NLP that encourages continuous rather than binary constructs, and design-based incorporation of annotator perspective and model explainability.

2208.00778 2026-06-09 cs.DB cs.LG q-bio.QM 版本更新

SFILES 2.0: An extended text-based flowsheet representation

SFILES 2.0:一种扩展的基于文本的流程图表示

Gabriel Vogel, Edwin Hirtreiter, Lukas Schulze Balhorn, Artur M. Schweidtmann

发表机构 * University of Technology, Department of Chemical Engineering(技术大学,化工系) TU Delft(代尔夫特理工大学) Van der Maasweg 9 2629 HZ Delft, The Netherlands(代尔夫特,荷兰)

AI总结 提出SFILES 2.0,通过扩展符号和命名约定解决原版无法明确描述关键配置和控制结构的问题,并开源实现流程图与字符串的自动转换,旨在推动化工流程图FAIR数据库建设。

详情
Journal ref
Optimization and Engineering, Volume 24, pages 2911-2933, (2023)
AI中文摘要

SFILES是一种基于文本的化工流程图表示法。最初由d'Anterroches(通过基团贡献法进行流程生成与设计)提出,其灵感来自基于文本的分子SMILES表示法。与流程图图像相比,文本格式在存储格式、计算可访问性以及最终的数据分析和处理方面具有若干优势。然而,原始SFILES版本无法明确描述基本的流程图配置,例如塔顶和塔底产品的区分。它也无法描述化工过程安全可靠运行所需的控制结构。此外,目前没有公开可用的软件用于将化工过程拓扑结构编码或解码为SFILES。我们提出了SFILES 2.0,并完整描述了扩展符号和命名约定。此外,我们提供了开源软件,用于流程图图与SFILES 2.0字符串之间的自动转换。通过这种方式,我们希望鼓励研究人员和工程师以SFILES 2.0字符串的形式发布他们的流程图拓扑结构。最终目标是建立化工过程流程图FAIR数据库的标准,这对于未来的数据处理和分析将具有重要价值。

英文摘要

SFILES are a text-based notation for chemical process flowsheets. They were originally proposed by d'Anterroches (Process flow sheet generation & design through a group contribution approach) who was inspired by the text-based SMILES notation for molecules. The text-based format has several advantages compared to flowsheet images regarding the storage format, computational accessibility, and eventually for data analysis and processing. However, the original SFILES version cannot describe essential flowsheet configurations unambiguously, such as the distinction between top and bottom products. Neither is it capable of describing the control structure required for the safe and reliable operation of chemical processes. Also, there is no publicly available software for decoding or encoding chemical process topologies to SFILES. We propose the SFILES 2.0 with a complete description of the extended notation and naming conventions. Additionally, we provide open-source software for the automated conversion between flowsheet graphs and SFILES 2.0 strings. This way, we hope to encourage researchers and engineers to publish their flowsheet topologies as SFILES 2.0 strings. The ultimate goal is to set the standards for creating a FAIR database of chemical process flowsheets, which would be of great value for future data analysis and processing.

2506.20573 2026-06-09 stat.ML cs.LG 版本更新

LARP: Learner-Agnostic Robust Data Prefiltering

LARP: 学习者无关的鲁棒数据预过滤

Kristian Minchev, Dimitar I. Dimitrov, Nikola Konstantinov

发表机构 * INSAIT, Sofia University "St. Kliment Ohridski"(INSAIT,索菲亚大学‘圣克莱门特·奥赫里德斯基’)

AI总结 提出LARP框架,通过预过滤程序保护多种下游学习器性能,理论证明可行性并分析性能损失,实验评估了图像和表格任务中的代价。

Comments Published in Transactions on Machine Learning Research (06/2026). URL: https://openreview.net/forum?id=gI6VOV3jfO

详情
AI中文摘要

公共数据集对现代机器学习和统计推断至关重要,但通常包含低质量或受污染的样本,这可能损害模型性能。因此,需要一种原则性的预过滤程序,数据提供者可以应用该程序同时保护一系列潜在下游统计和学习程序的准确性。在这项工作中,我们形式化并分析了学习者无关的鲁棒数据预过滤(LARP),即设计预过滤程序的问题,该程序对预先指定的学习者集合上的最坏情况损失有保证。我们在两个理论环境中建立了LARP的可行性,通过提供最坏情况损失的上界保证。我们的理论结果表明,与针对单个学习者的特定预过滤相比,通过LARP保护异构学习者集合会以一定的性能损失为代价;我们将这一差距称为LARP的代价。为了评估这一性能差距,我们在图像和表格任务上实证测量了LARP的代价。我们进一步从节省重复数据整理工作的角度探讨了LARP的潜在好处,在一个博弈论模型中,下游学习者可以分摊单一预过滤的成本。

英文摘要

Public datasets, crucial for modern machine learning and statistical inference, often contain low-quality or contaminated samples that can harm model performance. This creates a need for principled prefiltering procedures that a data provider can apply to protect the accuracy of a range of potential downstream statistical and learning procedures simultaneously. In this work, we formalize and analyze Learner-Agnostic Robust data Prefiltering (LARP), the problem of designing prefiltering procedures with guarantees on the worst-case loss over a pre-specified set of learners. We establish the feasibility of LARP in two theoretical settings, by providing upper-bound guarantees on the worst-case loss. Our theoretical results indicate that protecting heterogeneous learner sets via LARP comes at the price of some performance loss compared to individual, learner-specific prefiltering; we call this gap the price of LARP. To assess this gap in performance, we empirically measure the price of LARP across image and tabular tasks. We further explore potential benefits of LARP from the perspective of saving on repeated data curation efforts, in a game-theoretic model where the downstream learners can split the cost of the single prefiltering.

2507.20975 2026-06-09 stat.ML cs.LG 版本更新

Locally Adaptive Conformal Inference for Operator Models

算子模型的局部自适应共形推断

Trevor Harris, Yan Liu

发表机构 * University of Connecticut(康涅狄格大学) Meta Platforms Inc(Meta平台公司)

AI总结 提出局部切片共形推断(LSCI),一种无分布框架,为算子模型生成函数值、局部自适应预测集,在合成和实际任务中比共形基线更紧、适应性更强。

Comments 12 pages, 3 figures, 2 tables, Preprint

详情
AI中文摘要

算子模型是函数巴拿赫空间之间的回归算法。它们已成为时空预测和物理模拟中日益关键的工具,尤其是在需要稳健、校准的不确定性量化的高风险场景中。我们引入了局部切片共形推断(LSCI),这是一种无分布框架,用于为算子模型生成函数值、局部自适应的预测集。我们证明了有限样本有效性,并在局部可交换性下推导了覆盖差距的数据相关上界。在合成高斯过程任务和实际应用(空气质量监测、能源需求预测和天气预报)中,与共形基线相比,LSCI 产生了更紧且适应性更强的集合。我们还实验证明了其对有偏预测和某些分布外噪声模式的鲁棒性。

英文摘要

Operator models are regression algorithms between Banach spaces of functions. They have become an increasingly critical tool for spatiotemporal forecasting and physics emulation, especially in high-stakes scenarios where robust, calibrated uncertainty quantification is required. We introduce Local Sliced Conformal Inference (LSCI), a distribution-free framework for generating function-valued, locally adaptive prediction sets for operator models. We prove finite-sample validity and derive a data-dependent upper bound on the coverage gap under local exchangeability. On synthetic Gaussian-process tasks and real applications (air quality monitoring, energy demand forecasting, and weather prediction), LSCI yields tighter sets with stronger adaptivity compared to conformal baselines. We also empirically demonstrate robustness against biased predictions and certain out-of-distribution noise regimes.

2509.09151 2026-06-09 cs.CV cs.AI cs.LG 版本更新

Video Understanding by Design: How Datasets Shape Video Models

通过设计理解视频:数据集如何塑造视频模型

Lei Wang, Syuan-Hao Li, Piotr Koniusz, Yongsheng Gao

发表机构 * School of Engineering and Built Environment, Electrical and Electronic Engineering, Griffith University(工程与建筑环境学院,电气与电子工程学院,格里菲斯大学) School of Computer Science and Engineering, University of New South Wales(计算机科学与工程学院,新南威尔士大学)

AI总结 本文从数据集视角出发,提出统一框架连接数据集结构、归纳偏差与架构设计,分析数据集特性如何驱动视频理解架构创新,并讨论不同数据体制下的表征偏差。

Comments Research report

详情
AI中文摘要

视频理解研究因日益多样化的数据集和更强大的模型架构而快速发展。现有综述通常按任务、基准或模型家族组织进展,但对特定架构为何出现并成功提供的见解有限。本文认为,视频理解的演进根本上由数据集结构塑造。我们提出一个以数据集为中心的视角,在统一框架内连接数据集结构、归纳偏差和架构设计。我们表明,不同数据集要求模型捕获特定的不变性和能力,例如对视角变化的鲁棒性、对时间顺序的敏感性、长程依赖推理、关系交互和跨模态对齐。这些需求自然产生归纳偏差,即有利于特定推理和泛化模式的架构假设。从这一视角看,里程碑式架构,包括双流网络、3D CNN、时序模型、Transformer、基于图的方法和多模态基础模型,可理解为对演进数据集所带来挑战的架构响应。基于此框架,我们系统分析了数据集特性如何塑造视频理解任务中的架构创新,并讨论了不同数据体制引发的表征偏差。通过将数据集、归纳偏差和架构统一为一个连贯视角,本综述既提供了对领域演进的回顾性解释,也提供了通向通用视频理解系统的前瞻性路线图。代码和数据集诱导偏差的动态视频可视化见 https://this https URL。

英文摘要

Research in video understanding has advanced rapidly, driven by increasingly diverse datasets and more powerful model architectures. While existing surveys typically organize progress by tasks, benchmarks, or model families, they provide limited insight into why particular architectures emerged and succeeded. In this survey, we argue that the evolution of video understanding is fundamentally shaped by dataset structure. We present a dataset-centric perspective that connects dataset structure, inductive biases, and architectural design within a unified framework. We show that different datasets require models to capture specific invariances and capabilities, such as robustness to viewpoint changes, sensitivity to temporal ordering, reasoning over long-range dependencies, relational interactions, and cross-modal alignment. These requirements naturally give rise to inductive biases, i.e., architectural assumptions that favor particular patterns of reasoning and generalization. From this perspective, milestone architectures, including two-stream networks, 3D CNNs, temporal models, transformers, graph-based methods, and multimodal foundation models, can be understood as architectural responses to the challenges posed by evolving datasets. Building on this framework, we systematically analyze how dataset characteristics have shaped architectural innovation across video understanding tasks and discuss the representational biases induced by different data regimes. By unifying datasets, inductive biases, and architectures into a coherent perspective, this survey offers both a retrospective explanation of the field's evolution and a forward-looking roadmap toward general-purpose video understanding systems. Code and dynamic video visualizations of dataset-induced biases are available at https://time.griffith.edu.au/paper-sites/video-understanding/.

2511.18421 2026-06-09 cs.SD cs.LG 版本更新

DHAuDS: A Dynamic and Heterogeneous Audio Benchmark for Test-Time Adaptation

DHAuDS:用于测试时自适应的动态异构音频基准

Weichuang Shao, Iman Yi Liao, Tomas Henrique Bode Maul, Tissa Chandesa

发表机构 * School of Computer and Mathematical Sciences, University of Nottingham Malaysia(诺丁汉马来西亚大学计算机与数学科学学院)

AI总结 针对现有测试时自适应(TTA)评估依赖静态同质噪声协议的问题,提出DHAuDS基准,通过动态严重度和异构噪声混合暴露音频分类鲁棒性缺陷。

详情
AI中文摘要

现有的测试时自适应(TTA)研究严重依赖静态和同质的损坏协议,例如ImageNet-C和CIFAR-10-C/100-C,导致评估设置不一致,并且可能高估与实际情况相比的鲁棒性估计。TTA缺乏能够模拟现实异构声学退化的标准化评估基础设施。我们引入了DHAuDS,这是一个标准化的基准套件,用于评估在动态损坏严重性和异构噪声混合下的音频分类TTA鲁棒性。DHAuDS并非提出新的TTA算法,而是专注于暴露在传统固定噪声评估协议下仍然隐藏的鲁棒性限制。

英文摘要

Existing Test-time Adaptation (TTA) studies rely heavily on static and homogeneous corruption protocols, such as ImageNet-C and CIFAR-10-C/100-C, leading to inconsistent evaluation settings and potentially inflated robustness estimates that are compared with real-world situations. TTA lacks a standardized evaluation infrastructure capable of modeling realistic heterogeneous acoustic degradation. We introduce DHAuDS, a standardized benchmark suite for evaluating audio classification TTA robustness under dynamic corruption severity and heterogeneous noise mixtures. Rather than proposing a new TTA algorithm, DHAuDS focuses on exposing robustness limitations that remain hidden under conventional fixed-noise evaluation protocols.

2602.00238 2026-06-09 cs.CL cs.AI cs.LG 版本更新

DIVERGE: Diversity-Enhanced RAG for Open-Ended Information Seeking

DIVERGE: 面向开放式信息检索的多样性增强RAG

Tianyi Hu, Niket Tandon, Akhil Arora

发表机构 * Aarhus University(奥胡斯大学) Microsoft Research(微软研究院)

AI总结 针对现有RAG系统忽略开放式信息检索中多样性需求的问题,提出Diverge框架,通过迭代反思引导的多样化视角探索和多样性感知检索支持,在保持质量的同时将多样性提升约2倍。

详情
AI中文摘要

现有的检索增强生成(RAG)系统通常假设每个查询只有一个正确答案。这种假设忽略了开放式信息检索场景,其中多个合理的答案是有价值的,并且多样性对于创造力、公平性和信息的包容性访问至关重要。我们表明,标准RAG系统未能充分利用多样化的检索上下文:简单地增加检索多样性并不一定会导致多样化的生成。为了解决这一局限性,我们提出了Diverge,一个即插即用的智能体RAG框架,通过迭代、反思引导的多样化视角探索和多样性感知检索支持来改善多样性与质量的权衡。我们进一步引入了用于表征开放式问答中多样性与质量权衡的评估指标。在多个真实世界数据集和骨干LLM上的实验表明,Diverge在竞争基线中实现了最佳的权衡,将多样性提高了约2倍,且没有明显的质量下降。这些结果揭示了当前RAG系统的系统性局限,并展示了显式多样性建模的价值。

英文摘要

Existing retrieval-augmented generation (RAG) systems often assume that each query has a single correct answer. This assumption overlooks open-ended information-seeking scenarios where multiple plausible answers are valuable, and where diversity is important for creativity, fairness, and inclusive access to information. We show that standard RAG systems fail to fully use diverse retrieved contexts: simply increasing retrieval diversity does not necessarily lead to diverse generations. To address this limitation, we propose Diverge, a plug-and-play agentic RAG framework that improves the diversity--quality trade-off through iterative, reflection-guided exploration of diverse viewpoints and diversity-aware retrieval support. We further introduce evaluation metrics for characterizing the diversity-quality trade-off in open-ended question answering. Experiments across multiple real-world datasets and backbone LLMs show that Diverge achieves the best trade-off among competitive baselines, increasing diversity by $\sim2\times$ without noticeable quality degradation. These results reveal a systematic limitation of current RAGs and show the value of explicit diversity modeling.

2602.03224 2026-06-09 cs.AI cs.LG 版本更新

TAME: A Trustworthy Test-Time Evolution of Agent Memory with Systematic Benchmarking

TAME: 一种可信的智能体记忆测试时演化与系统化基准测试

Yu Cheng, Yongkang Hu, Jiuan Zhou, Yushuo Zhang, Yihang Chen, Huichi Zhou, Mingang Chen, Zhizhong Zhang, Kun Shao, Yuan Xie, Zhaoxia Yin

发表机构 * East China Normal University(东华师范大学) Shanghai Innovation Institute(上海创新研究院) Shanghai Key Laboratory of Computer Software Evaluating and Testing(上海计算机软件评测与测试重点实验室) Huawei Noah’s Ark Lab(华为诺亚实验室)

AI总结 提出TAME框架,通过执行器-评估器循环实现记忆的可信演化,解决良性任务演化中智能体可信度下降问题,在GPT-5.2 AIME上准确率提升14.6个百分点。

详情
AI中文摘要

智能体记忆的测试时演化代表了推进AGI的关键范式,因为它通过经验积累增强复杂推理,而无需参数更新。然而,即使在良性任务演化过程中,智能体的安全对齐仍然脆弱,这种现象被称为智能体记忆误演化。为了评估这一现象,我们构建了Trust-Memevo基准测试,并发现智能体在良性任务演化过程中,多个任务的可信度整体下降。为了解决这个问题,我们提出了TAME,一个可信感知的记忆演化框架,其中共享记忆库由执行器和评估器共同管理。执行器检索并应用可迁移经验以支持任务求解,而评估器评估每个使用经验对结果的贡献,并产生可信感知的反馈以指导后续记忆使用。这种执行器-评估器循环使得记忆能够随时间被选择性强化、谨慎重用和持续扩展。实验表明,TAME在实现强任务性能的同时缓解了记忆误演化。特别是在GPT-5.2 AIME基准测试上,TAME相比现有最强方法准确率提高了14.6个百分点,并保持了有竞争力的可信度。

英文摘要

Test-time evolution of agent memory represents a pivotal paradigm for advancing AGI, as it strengthens complex reasoning through experience accumulation without requiring parameter updates. However, even during benign task evolution, agent safety alignment remains vulnerable, a phenomenon known as Agent Memory Misevolution. To evaluate this phenomenon, we construct the Trust-Memevo benchmark and find that agents exhibit an overall decline in trustworthiness across multiple tasks during benign task evolution. To address this issue, we propose TAME, a trust-aware memory evolution framework in which a shared memory bank is jointly governed by an Executor and an Evaluator. The Executor retrieves and applies transferable experiences to support task solving, while the Evaluator assesses the contribution of each utilized experience to the outcome and produces trust-aware feedback to guide subsequent memory use. This executor-evaluator loop enables memory to be selectively reinforced, cautiously reused, and continuously expanded over time. Experiments show that TAME mitigates memory misevolution while achieving strong task performance. In particular, on the GPT-5.2 AIME benchmark, TAME improves accuracy by 14.6 percentage points over the strongest existing method and maintains competitive trustworthiness.

2602.05869 2026-06-09 stat.ML cs.LG cs.NA math.NA math.PR math.ST stat.TH 版本更新

Wedge Sampling: Efficient Tensor Completion with Nearly-Linear Sample Complexity

楔形采样:具有近线性样本复杂性的高效张量补全

Hengrui Luo, Anna Ma, Ludovic Stephan, Yizhe Zhu

发表机构 * Rice University(里士满大学) University of California, Irvine(加州大学尔湾分校) Univ Rennes, Ensai, CNRS, CREST-UMR 9194(里昂大学,Ensai,CNRS,CREST-UMR 9194) University of Southern California(南加州大学)

AI总结 提出楔形采样非自适应方案,通过结构化长度二模式(楔形)分配观测,在均匀采样稀疏时增强谱信号,实现近线性样本复杂度的张量补全。

Comments COLT 2026 arXiv version. 65 pages, 3 figures

详情
AI中文摘要

我们引入了楔形采样(Wedge Sampling),一种用于低秩张量补全的新型非自适应采样方案。我们研究从部分条目中恢复维度为 $n \times \cdots \times n$ 的 $k$ 阶低秩张量。与标准均匀条目模型(即来自 $[n]^k$ 的 i.i.d. 样本)不同,楔形采样将观测分配到关联二分采样图中的结构化长度二模式(楔形)。通过直接促进这些长度二连接,采样设计增强了在均匀采样过于稀疏而无法产生足够信息相关性的情况下高效初始化所依赖的谱信号。我们的主要结果表明,这种采样范式的改变使得多项式时间算法能够以 $n$ 的近线性样本复杂度实现弱恢复和精确恢复。该方法也是即插即用的:基于楔形采样的谱初始化可以与现有的细化过程(例如,谱方法或梯度方法)结合,仅需额外 $\tilde{O}(n)$ 个均匀采样条目,显著优于在均匀条目采样下高效方法通常所需的 $\tilde{O}(n^{k/2})$ 样本复杂度。总体而言,我们的结果表明,Barak 和 Moitra (2022) 中强调的统计-计算差距在很大程度上是张量补全中均匀条目采样模型的结果,而保证强初始化的替代非自适应测量设计可以克服这一障碍。

英文摘要

We introduce Wedge Sampling, a new non-adaptive sampling scheme for low-rank tensor completion. We study recovery of an order-$k$ low-rank tensor of dimension $n \times \cdots \times n$ from a subset of its entries. Unlike the standard uniform entry model (i.e., i.i.d. samples from $[n]^k$), wedge sampling allocates observations to structured length-two patterns (wedges) in an associated bipartite sampling graph. By directly promoting these length-two connections, the sampling design strengthens the spectral signal that underlies efficient initialization, in regimes where uniform sampling is too sparse to generate enough informative correlations. Our main result shows that this change in sampling paradigm enables polynomial-time algorithms to achieve both weak and exact recovery with nearly linear sample complexity in $n$. The approach is also plug-and-play: wedge-sampling-based spectral initialization can be combined with existing refinement procedures (e.g., spectral or gradient-based methods) using only an additional $\tilde{O}(n)$ uniformly sampled entries, substantially improving over the $\tilde{O}(n^{k/2})$ sample complexity typically required under uniform entry sampling for efficient methods. Overall, our results suggest that the statistical-to-computational gap highlighted in Barak and Moitra (2022) is, to a large extent, a consequence of the uniform entry sampling model for tensor completion, and that alternative non-adaptive measurement designs that guarantee a strong initialization can overcome this barrier.

2602.12129 2026-06-09 cs.IR cs.LG 版本更新

Towards Personalized Bangla Book Recommendation: A Large-Scale Heterogeneous Book Graph Dataset

面向个性化孟加拉语图书推荐:大规模异构图书图谱数据集

Rahin Arefin Ahmed, Md. Anik Chowdhury, Sakil Ahmed Sheikh Reza, Devnil Bhattacharjee, Muhammad Abdullah Adnan, Julian McAuley, Nafis Sadeq

发表机构 * East West University(东西方大学) Bangladesh University of Engineering and Technology(孟加拉工程与技术大学) University of California San Diego(加州大学圣地亚哥分校)

AI总结 针对孟加拉语文学缺乏结构化大规模公开数据集的问题,构建了RokomariBG异构图书图谱数据集,包含12.7万本书、6.3万用户等实体及多种关系,通过基准测试表明异构关系与混合文本元数据显著影响推荐性能。

Comments Added new experiment results on sequential recommendation, top-N recommendation results have been updated using per user temporal leave-last-one-out instead of random split

详情
AI中文摘要

孟加拉语文学中的个性化图书推荐一直受限于缺乏结构化、大规模且公开可用的数据集。本文介绍了RokomariBG,一个大规模异构图书图谱数据集,旨在支持低资源语言环境下的个性化推荐研究。该数据集包含127,302本书、63,723个用户、16,601位作者、1,515个类别、2,757家出版社和209,602条评论,通过多种关系类型连接,并组织为综合知识图谱。为展示数据集的实用性,我们针对Top-N推荐和序列推荐任务进行了系统基准研究,评估了多种代表性推荐模型。通过全面基准测试,我们证明了该领域的推荐性能同时受异构关系信息和混合文本元数据的强烈影响。这些发现揭示了孟加拉国电商生态系统中现有推荐基准大多缺失的独特挑战。总体而言,本文为孟加拉语图书推荐研究建立了基础基准和公开可用资源,实现了可重复评估及未来对低资源文化领域推荐的研究。数据集和代码已公开于此https URL。

英文摘要

Personalized book recommendation in Bangla literature has been constrained by the lack of structured, large-scale, and publicly available datasets. This work introduces RokomariBG, a large-scale heterogeneous book graph dataset designed to support research on personalized recommendation in a low-resource language setting. The dataset comprises 127,302 books, 63,723 users, 16,601 authors, 1,515 categories, 2,757 publishers, and 209,602 reviews, connected through several relation types and organized as a comprehensive knowledge graph. To demonstrate the utility of the dataset, we present a systematic benchmarking study on the top-N recommendation and sequential recommendation tasks, evaluating a diverse set of representative recommendation models. Through comprehensive benchmarking, we demonstrate that recommendation performance in this domain is strongly influenced by both heterogeneous relational information and code-mixed textual metadata. These findings reveal unique challenges of Bangladeshi e-commerce ecosystems that are largely absent from existing recommendation benchmarks. Overall, this work establishes a foundational benchmark and a publicly available resource for Bangla book recommendation research, enabling reproducible evaluation and future studies on recommendation in low-resource cultural domains. The dataset and code are publicly available at https://github.com/backlashblitz/Bangla-Book-Recommendation-Dataset

2604.06210 2026-06-09 cs.CL cs.AI cs.CY cs.LG 版本更新

Distributional Open-Ended Evaluation of LLM Cultural Value Alignment Based on Value Codebook

基于价值码本的LLM文化价值对齐的分布式开放式评估

Jaehyeok Lee, Xiaoyuan Yi, Jing Yao, Hyunjin Hwang, Roy Ka-Wei Lee, Xing Xie, JinYeong Bak

发表机构 * KAIST(韩国科学技术院)

AI总结 提出DOVE框架,通过率失真变分优化构建价值码本,利用不平衡最优传输度量分布对齐,解决LLM文化价值评估中的构造-组成-上下文挑战。

Comments ICML 2026 Camera Ready

详情
AI中文摘要

随着LLM在全球部署,使其文化价值取向对齐对于安全性和用户参与至关重要。然而,现有基准面临构造-组成-上下文($C^3$)挑战:依赖判别性、多项选择格式,探测的是价值知识而非真实取向,忽视亚文化异质性,且与真实世界的开放式生成不匹配。我们引入DOVE,一个直接比较人类撰写的文本分布与LLM生成输出的分布式评估框架。DOVE利用率失真变分优化目标从10K文档中构建紧凑的价值码本,将文本映射到结构化价值空间以过滤语义噪声。使用不平衡最优传输测量对齐,捕捉文化内分布结构和子群体多样性。在12个LLM上的实验表明,DOVE实现了优越的预测有效性,与下游任务的相关性达到31.56%,同时每个文化仅需500个样本即可保持高可靠性。

英文摘要

As LLMs are globally deployed, aligning their cultural value orientations is critical for safety and user engagement. However, existing benchmarks face the Construct-Composition-Context ($C^3$) challenge: relying on discriminative, multiple-choice formats that probe value knowledge rather than true orientations, overlook subcultural heterogeneity, and mismatch with real-world open-ended generation. We introduce DOVE, a distributional evaluation framework that directly compares human-written text distributions with LLM-generated outputs. DOVE utilizes a rate-distortion variational optimization objective to construct a compact value codebook from 10K documents, mapping text into a structured value space to filter semantic noise. Alignment is measured using unbalanced optimal transport, capturing intra-cultural distributional structures and subgroup diversity. Experiments across 12 LLMs show that DOVE achieves superior predictive validity, attaining a 31.56% correlation with downstream tasks, while maintaining high reliability with as few as 500 samples per culture.

2605.19276 2026-06-09 cs.CL cs.LG 版本更新

OpenCompass: A Universal Evaluation Platform for Large Language Models

OpenCompass:大型语言模型的通用评估平台

Maosong Cao, Kai Chen, Haodong Duan, Yixiao Fang, Zhiwei Fei, Tong Gao, Ge Jiaye, Mo Li, Hongwei Liu, Junnan Liu, Yuan Liu, Chengqi Lyu, Han Lyu, Ningsheng Ma, Zerun Ma, Yu Sun, Zhiyong Wu, Linchen Xiao, Zhuozhi Xiong, Jun Xu, Haochen Ye, Zhaohui Yu, Yike Yuan, Songyang Zhang, Yufeng Zhao, Fengzhe Zhou, Peiheng Zhou, Dongsheng Zhu, Lin Zhu, Jingming Zhuo

发表机构 * OpenCompass Team(OpenCompass团队) Shanghai AI Laboratory(上海人工智能实验室)

AI总结 提出OpenCompass,一个模块化、高兼容性、灵活且高并发的通用LLM评估平台,支持多种任务场景和主流基准数据集。

详情
AI中文摘要

近年来,人工智能领域经历了从特定任务的小规模模型到通用大型语言模型(LLM)的范式转变。随着LLM的快速迭代,对其能力进行客观、定量和全面的评估已成为推动技术发展的关键环节。目前,基于静态基准数据集的主流评估方法面临任务类型多样性、评估标准不一致以及数据处理流程碎片化等挑战,难以高效进行跨领域和大规模模型评估。为解决上述问题,本文提出并开源了OpenCompass,一个一站式、可扩展且支持高并发的通用LLM评估平台。该平台遵循模块化和组件解耦的设计理念,具有三大核心优势:高兼容性、灵活性和高并发性。OpenCompass的核心架构包括五个关键组件:配置系统、任务划分模块、执行与调度模块、任务执行单元和结果可视化模块。其工作流程提供基于规则、LLM作为评判者和级联评估器,以适应不同任务场景的需求。平台支持知识、推理、计算、科学、语言、代码等多个领域的基准数据集,为学术界和工业界提供统一高效的LLM评估工具,有助于准确识别LLM的优缺点并进行后续优化。

英文摘要

In recent years, the field of artificial intelligence has undergone a paradigm shift from task-specific small-scale models to general-purpose large language models (LLMs). With the rapid iteration of LLMs, objective, quantitative, and comprehensive evaluation of their capabilities has become a critical link in advancing technological development. Currently, the mainstream static benchmark dataset-based evaluation methods face challenges such as the diversity of task types, inconsistent evaluation criteria, and fragmentation of data and processing workflows, making it difficult to efficiently conduct cross-domain and large-scale model evaluation. To address the aforementioned issues, this paper proposes and open-sources OpenCompass, a one-stop, scalable, and high-concurrency-supported general-purpose LLM evaluation platform. Adhering to the design philosophy of modularization and component decoupling, the platform boasts three core advantages: high compatibility, flexibility, and high concurrency. The core architecture of OpenCompass comprises five key components: the Configuration System, Task Partitioning Module, Execution and Scheduling Module, Task Execution Unit, and Result Visualization Module. Its workflow provides rule-based, LLM-as-a-Judge, and cascaded evaluators to adapt to the requirements of different task scenarios. Supporting mainstream benchmark datasets across multiple domains, including knowledge, reasoning, computation, science, language, code, etc., the platform offers a unified and efficient LLM evaluation tool for both academia and industry, facilitating the accurate identification of strengths and weaknesses of LLMs as well as their subsequent optimization.

2605.23965 2026-06-09 cs.AI cs.LG cs.SE 版本更新

LGMT: Logic-Grounded Metamorphic Testing for Evaluating the Reasoning Reliability of LLMs

LGMT:基于逻辑的蜕变测试用于评估LLMs的推理可靠性

Zenghui Zhou, Man Li, Xiaoke Fang, Xinyi Zhou, Weibin Lin, Zheng Zheng

发表机构 * School of Automation Science and Electrical Engineering, Beihang University(自动化科学与电气工程学院,北京航空航天大学)

AI总结 提出LGMT框架,利用一阶逻辑推导蜕变关系,通过一致性检查评估LLM推理的鲁棒性,揭示传统评估忽略的隐藏缺陷。

Comments Zheng Zheng is the corresponding author

详情
AI中文摘要

大型语言模型(LLMs)在逻辑推理基准测试中表现出色,但其可靠性仍不确定。现有评估依赖静态基准,无法评估在逻辑等价变换下的鲁棒性,且往往高估推理能力。我们提出LGMT(基于逻辑的蜕变测试),一种无神谕框架,利用一阶逻辑(FOL)评估LLM推理。通过从形式逻辑等价推导蜕变关系,LGMT构建语义不变的测试用例,并通过跨案例一致性检查检测推理缺陷。在六个最先进的LLM上的实验表明,LGMT暴露了传统基于参考的评估遗漏的大量隐藏缺陷。我们进一步发现,模型对符号级别和结论级别的变化特别敏感,而高级提示如Few-shot CoT仅能部分缓解这些问题。这些结果表明,LLM评估应从孤立的正确性转向逻辑不变性下的鲁棒性。LGMT为诊断推理失败提供了一种原则性和可扩展的方法。

英文摘要

Large Language Models (LLMs) achieve strong performance on logical reasoning benchmarks, yet their reliability remains uncertain. Existing evaluations rely on static benchmarks, which fail to assess robustness under logically equivalent transformations and often overestimate reasoning capability. We propose LGMT (Logic-Grounded Metamorphic Testing), an oracle-free framework that leverages first-order logic (FOL) to evaluate LLM reasoning. By deriving metamorphic relations from formal logical equivalences, LGMT constructs semantically invariant test cases and detects reasoning defects through cross-case consistency checking. Experiments on six state-of-the-art LLMs show that LGMT exposes substantial hidden defects missed by traditional reference-based evaluations. We further find that models are particularly sensitive to symbol-level and conclusion-level variations, and that advanced prompting such as Few-shot CoT only partially mitigates these issues. These results suggest that LLM evaluation should move beyond isolated correctness toward robustness under logical invariance. LGMT provides a principled and scalable approach for diagnosing reasoning failures.

2606.04409 2026-06-09 cs.CV cs.AI cs.LG 版本更新

An Empirical Study of Data Scale, Model Complexity, and Input Modalities in Visual Generalization

数据规模、模型复杂度和输入模态对视觉泛化影响的实证研究

Yidi Zhouluo

发表机构 * School of Medical Information and Artificial Intelligence, Shandong First Medical University(医学信息与人工智能学院,山东第一医科大学)

AI总结 通过一维非线性函数和CIFAR数据集实验,实证分析数据规模、模型复杂度和输入模态对视觉泛化性能的影响。

Comments 12 pages, 9 figures, 4 tables

详情
AI中文摘要

现代深度神经网络通常具有较大的参数规模和非线性层次结构,在计算机视觉中取得了强劲性能。然而,其泛化性能的来源仍然难以用传统统计学习理论解释。在可能影响视觉泛化的因素中,数据规模、模型复杂度和输入模态是基础且可控的变量。本研究实证分析了这三个因素如何影响模型泛化性能。具体而言,在初步实验中,我们构建了一维非线性函数,并改变训练样本数量和多项式次数,以观察数据规模和模型复杂度对模型性能的影响。在主要实验中,我们比较了CIFAR-10和CIFAR-100上不同训练数据规模、模型架构和输入模态下的模型性能。实验结果表明,增加训练数据规模持续改善泛化性能,而模型复杂度的变化并未带来稳定提升。此外,去除颜色信息会降低模型性能,而梯度、边缘和小波等显式先验特征在不同模型架构上的效果不一致。总体而言,本研究提供了数据规模、模型复杂度、输入模态与视觉泛化性能之间关系的实证分析。代码和实验日志见:https://github.com/zlyd-CV/DeepLearning-Empirical-Studies。

英文摘要

Modern deep neural networks usually have large parameter scales and nonlinear hierarchical structures, and they have achieved strong performance in computer vision. However, the source of their generalization performance remains difficult to explain using traditional statistical learning theory. Among the factors that may affect visual generalization, data scale, model complexity, and input modalities are fundamental and controllable variables. This study empirically analyzes how these three factors influence model generalization performance. Specifically, in a preliminary experiment, we construct a one-dimensional nonlinear function and vary the number of training samples and the polynomial degree to observe the effects of data scale and model complexity on model performance. In the main experiments, we compare model performance on CIFAR-10 and CIFAR-100 under different training data scales, model architectures, and input modalities. The experimental results show that increasing the training data scale consistently improves generalization performance, whereas changes in model complexity do not provide stable gains. In addition, removing color information degrades model performance, while explicit prior features such as gradients, edges, and wavelets have inconsistent effects across different model architectures. Overall, this study provides an empirical analysis of the relationships among data scale, model complexity, input modalities, and visual generalization performance. Code and experimental logs are available at: https://github.com/YidiZhouluo/DeepLearning-Empirical-Studies/tree/main/Exp_01.

2606.05932 2026-06-09 cs.AI cs.LG 版本更新

A Pre-Registered Causal Partition of Self-Consistency Elicitation and Reward Design in RLVR

RLVR中自我一致性激发与奖励设计的预注册因果分解

Yuze Gao

发表机构 * Outlook.com(Outlook公司)

AI总结 本文通过预注册实验和因果分解方法,证明RLVR中朴素奖励设计估计量存在系统性偏差,并量化了自我一致性激发与真正奖励设计信号的贡献。

Comments 9 pages, 7 figures

详情
AI中文摘要

基于可验证奖励的强化学习(RLVR)即使在奖励信号是虚假的情况下也能提升推理能力——将功劳分配给群体多数答案而非真实验证器。实践者通常将朴素估计量 naive = acc(TRUE) - acc(RANDOM) 解释为奖励设计效应。我们证明该估计量存在系统性偏差:它混淆了自我一致性激发(通过多数伪奖励将策略向众数答案锐化)与真正的奖励设计信号。使用受控的表格GRPO模拟器,我们推导出精确的望远镜分解 total = null + elicit + rd,并在五个先验强度水平上测量每个项。朴素估计量中奖励设计占比从弱先验(ps=0.20)时的0.139变化到强先验(ps=0.80)时的0.05,激发项在自我一致性交叉点处符号翻转。一个预注册的2x2x2析因实验证实了非可加性(交互比0.385;AxC效应-0.089)。一个点与界试点门控表明,强先验区域是点识别的,而接近交叉区域仅是有界的。对两个已命名发表结果的重新审计分别得出“激发主导”(激发份额0.98)和“奖励设计主导”(rd份额1.18)的结论,证明了该分解的诊断价值。我们预先承诺无论翻转结果如何都提交论文;非翻转同样是一个有价值的发现。我们发布一个可复用的单命令工具,供任何对齐论文运行相同的审计。

英文摘要

Reinforcement learning from verifiable rewards (RLVR) improves reasoning even when the reward signal is spurious -- assigning credit to the group-plurality answer rather than a ground-truth verifier. Practitioners commonly interpret naive = acc(TRUE) - acc(RANDOM) as the reward-design effect. We prove this estimand is systematically biased: it conflates self-consistency elicitation (sharpening the policy toward its modal answer via majority pseudo-reward) with genuine reward-design signal. Using a controlled tabular-GRPO simulator we derive an exact telescoping decomposition total = null + elicit + rd and measure each term across five prior-strength levels. The reward-design fraction of the naive estimator ranges from 0.139 at weak prior (ps=0.20) to 0.05 at strong prior (ps=0.80), with the elicitation term flipping sign at the self-consistency crossover. A pre-registered 2x2x2 factorial confirms non-additivity (interaction ratio 0.385; AxC effect -0.089). A points-vs-bounds pilot gate shows strong-prior regimes are point-identified while near-crossover regimes are only bounded. Re-audits of two named published results yield ELICITATION DOMINATED (elicitation share 0.98) and REWARD DESIGN DOMINATED (rd share 1.18) verdicts respectively, demonstrating the diagnostic value of the partition. We pre-commit to submit regardless of flip outcome; a non-flip is a finding of equal standing. We release a reusable one-command harness for any alignment paper to run the same audit.

12. 机器学习应用 150 篇

2606.07565 2026-06-09 cs.LG 新提交

STARIXNet: Multivariate and Multi-attribute Deep Learning Approach to Real-Time Resource Allocation in Cloud Platforms

STARIXNet: 云平台中多变量多属性深度学习方法实现实时资源分配

Ahmed Abdulaal, Maruf Aytekin, Thilaga kumaran Srinivasan, Tomer Lancewicki

发表机构 * Walmart Global Tech(沃尔玛全球科技)

AI总结 提出STARIXNet轻量神经网络,通过捕获多系统指标的时空关系进行多变量资源分配,优先服务稳定性再考虑成本效率,在沃尔玛生产环境中节省10%-50%成本。

Comments 11 pages, 12 figures. Under review

详情
AI中文摘要

云平台中微服务的智能伸缩对于缓解不断增长的计算成本同时避免服务中断至关重要。当前解决方案局限于单变量空间,通常仅关注CPU使用率来驱动伸缩决策。此外,它们将问题视为纯预测任务,专注于预测精度而忽略了低估和系统响应延迟的更大风险。替代方案计算复杂,使其难以用于大规模实时部署。为应对这些挑战,我们提出STARIXNet,一种轻量级神经网络,通过捕获多个系统指标间的时空关系,在多变量空间中指导资源分配决策。STARIXNet对多个准依赖属性进行建模,特别是(S)季节性、(T)时间性、(A)自回归(I)综合和(e)外生模式,然后实施聚合策略以最终确定伸缩决策,优先考虑服务稳定性,其次是成本效率,而非原始预测准确性。我们通过在真实环境中与现有解决方案进行基准测试,实证展示了STARIXNet的性能。STARIXNet已部署于沃尔玛的关键生产微服务,实现了10%至50%的实际节省,此外还通过改善服务稳定性和客户体验带来了无形收益。

英文摘要

Intelligent scaling of microservices in cloud platforms is crucial for mitigating escalating compute costs while avoiding service disruptions. Current solutions are limited to the univariate space, typically focusing on CPU usage alone to drive scaling decisions. Moreover, they address the problem as a purely forecasting task, focusing on prediction precision while neglecting the greater risks of underestimation and delays in system responsiveness. Alternative solutions are computationally complex, making them impractical for large-scale, real-time deployments. To address these challenges, we present STARIXNet, a lightweight neural network that guides resource allocation decisions in the multivariate space by capturing spatio-temporal relationships among multiple system metrics. STARIXNet models multiple quasi-dependent attributes, in particular the (S)easonal, (T)emporal, (A)uto-(R)egressive (I)ntegrated, and e(X)ogenous patterns, then implements an aggregation policy to finalize scaling decisions, prioritizing service stability, followed by cost-efficiency, over raw forecast accuracy. We empirically demonstrate the performance of STARIXNet by benchmarking against existing solutions in real-world settings. STARIXNet is deployed for critical production microservices at Walmart achieving tangible savings ranging from 10\% to 50\%, in addition to intangible benefits through improved service stability and customer experience.

2606.07578 2026-06-09 cs.LG stat.ME stat.ML 新提交

MST-Direct at Scale: Multivariate and Conditional Geostatistical Simulation via Sinkhorn Optimal Transport

大规模MST-Direct:基于Sinkhorn最优传输的多变量与条件地质统计模拟

Tcharlies Bachmann Schmitz

发表机构 * GitHub arXiv

AI总结 提出MST-Direct扩展方法,通过稀疏Sinkhorn匹配器、多变量元组匹配和克里金条件化,实现大规模、多变量、条件地质统计模拟,精确保持联合分布。

详情
AI中文摘要

本文将MST-Direct(一种用于多变量地质统计模拟的基于Sinkhorn传输的匹配方法)从原始的二元、无条件、小网格形式扩展到多变量、条件和大网格设置。我们解决了原始工作中确定的三个主要限制:(i)通过具有O(nC)内存复杂度的稀疏、候选限制的Sinkhorn匹配器,实现超过几千个节点的可扩展性;(ii)通过将目标值元组匹配到独立FFT-MA高斯骨干上扩展到多个变量,该骨干再现指定的变差函数;以及(iii)通过克里金法条件化骨干,同时在其空间位置固定观测数据元组进行硬数据条件化。由于传输计划仍然是目标元组的排列,多变量联合分布被精确保持。该方法使用与直接多变量模拟(DMS)相同的六变量、异方差、强非线性参考分布进行验证,在无条件(200x200)和条件(100x100,200个硬数据样本)场景下,并与投影寻踪多变量变换(PPMT)进行基准比较。结果表明,MST-Direct以零直方图误差再现联合分布,精确满足硬数据,并准确再现指定的空间相关结构,而PPMT仍然是近似。索引术语-最优传输,Sinkhorn算法,地质统计模拟,多变量模拟。

英文摘要

This paper extends MST-Direct, a Matching-via-Sinkhorn-Transport approach for multivariate geostatistical simulation, from the original bivariate, unconditional, small-grid formulation to multivariate, conditional, and large-grid settings. We address the three main limitations identified in the original work: (i) scalability beyond a few thousand nodes through a sparse, candidate-restricted Sinkhorn matcher with O(nC) memory complexity; (ii) extension to multiple variables by matching target value tuples onto an independent FFT-MA Gaussian backbone that reproduces a prescribed variogram; and (iii) hard-data conditioning by fixing observed data tuples at their spatial locations while conditioning the backbone through kriging. Because the transport plan remains a permutation of the target tuples, the multivariate joint distribution is preserved exactly. The method is validated using the same six-variate, heteroscedastic, strongly nonlinear reference distribution employed in Direct Multivariate Simulation (DMS), under both unconditional (200x200) and conditional (100x100, 200 hard-data samples) scenarios, and is benchmarked against the Projection Pursuit Multivariate Transform (PPMT). Results show that MST-Direct reproduces the joint distribution with zero histogram error, exactly honours hard data, and accurately reproduces the prescribed spatial correlation structure, whereas PPMT remains an approximation. Index Terms-Optimal transport, Sinkhorn algorithm, geostatistical simulation, multivariate simulation.

2606.07582 2026-06-09 cs.LG cs.AI cs.ET 新提交

Customer Churn Prediction on Structured Data Using FT-Transformer and Stacking Ensembles

基于FT-Transformer和堆叠集成的结构化数据客户流失预测

Joyjit Roy, Samaresh Kumar Singh, Laxmi Shaw

发表机构 * Independent Researcher, Austin, TX, USA(独立研究员,美国德克萨斯州奥斯汀) Independent Researcher, Leander, TX(独立研究员,美国德克萨斯州利安德) Texas A & M University-Victoria, Victoria, TX(德克萨斯农工大学维多利亚分校)

AI总结 提出一种结合FT-Transformer与XGBoost的混合架构,通过校准感知堆叠集成处理类别不平衡和特征交互,在银行客户流失数据集上F1达62.10%,AUC-ROC为0.861。

Comments 22 pages, 9 figures, 20 tables; published in IEEE Access

详情
Journal ref
IEEE Access, vol. 14, pp. 62834-62855, 2026
AI中文摘要

客户流失预测在保险、数字银行、电子商务和订阅平台等数据驱动行业中至关重要,因为保留现有客户通常比获取新客户更具成本效益。由于类别不平衡、非线性特征交互和异质特征类型,在结构化数据集上预测流失仍然具有挑战性。基于树的集成方法在这些场景中始终表现出强大的性能,通常优于传统神经网络。本研究引入了一种经过验证的混合架构,通过校准感知堆叠将特征标记化变换器(FT-Transformer)与梯度提升树相结合。所提出的框架解决了先前研究中在统计验证、概率校准和可重复性方面的持续空白。FT-Transformer利用自注意力捕获高阶特征交互,而XGBoost通过互补的归纳偏置捕获梯度提升决策边界。类别不平衡通过使用类别加权损失函数处理,从而避免合成过采样并保留少数类分布。模型使用基于折叠外(OOF)堆叠的逻辑回归元学习器进行集成,该元学习器重新校准过于自信的基模型输出并学习最优组合权重。在一个公开的银行流失数据集上,混合模型在5x5交叉验证下达到62.10%的F1、0.861的AUC-ROC和0.647的PR-AUC,相比多层感知机(MLP)基线分别提升3.37个F1点和0.027个AUC,并报告了95%置信区间。消融研究表明,变换器组件和堆叠策略都对性能有实质性贡献。所提出的方法为结构化表格数据上的当代流失预测提供了一个可重复且可扩展的参考架构。

英文摘要

Customer churn prediction is essential across data-driven industries such as insurance, digital banking, eCommerce, and subscription platforms, where retaining existing customers is typically more cost-effective than acquiring new ones. Predicting churn on structured datasets remains challenging due to class imbalance, nonlinear feature interactions, and heterogeneous feature types. Tree-based ensemble methods consistently demonstrate strong performance in these contexts, often outperforming conventional neural networks. This study introduces a validated hybrid architecture that integrates feature-tokenized transformers (FT-Transformer) with gradient-boosted trees through calibration-aware stacking. The proposed framework addresses persistent gaps in statistical validation, probability calibration, and reproducibility found in prior research. The FT-Transformer captures higher-order feature interactions using self-attention, while XGBoost captures gradient-boosted decision boundaries with complementary inductive biases. Class imbalance is handled using class-weighted loss functions, thereby avoiding synthetic oversampling and preserving minority-class distributions. The models are ensembled using out-of-fold (OOF) stacking with a logistic regression meta-learner, which recalibrates overconfident base model outputs and learns optimal combination weights. On a public bank churn dataset, the hybrid model achieves 62.10% F1, 0.861 AUC-ROC, and 0.647 PR-AUC, outperforming the Multi-Layer Perceptron (MLP) baseline by 3.37 F1 points and 0.027 AUC under 5x5 cross-validation with 95% confidence intervals reported. Ablation studies demonstrate that both the transformer component and stacking strategy contribute materially to performance. The proposed methodology offers a reproducible and extensible reference architecture for contemporary churn prediction on structured tabular data.

2606.07606 2026-06-09 cs.LG 新提交

QDSP: An Interpretable Structured Learning Framework for Predicting Death or Cerebral Palsy in Very Low Birth Weight Infants

QDSP:一种用于预测极低出生体重婴儿死亡或脑瘫的可解释结构化学习框架

Ling Wang, Xiaolong Li, Hui Zhou, Jing Shi, Fuhao Zhang, Dapeng Chen, Nan Mu

发表机构 * College of Computer Science, Sichuan Normal University(四川师范大学计算机科学学院) West China Second University Hospital, Sichuan University(四川大学华西第二医院)

AI总结 提出QDSP框架,集成配额引导子空间采样和可微决策结构感知,在极低出生体重婴儿队列中实现高精度死亡/脑瘫预测,并提供可解释的临床决策路径。

详情
AI中文摘要

极低出生体重婴儿(VLBWI)面临高死亡风险和严重神经发育障碍(包括脑瘫),但在高维且数据有限的临床环境中,可靠的出院时预后分层仍然具有挑战性。为解决此问题,我们提出QDSP,一种可解释的结构化学习框架,集成配额引导子空间采样(QSS)和可微决策引导结构感知(DSP)。QSS模块通过基于自助法的特征一致性估计构建稳定性感知且低冗余的特征子空间,而DSP模块采用可微软斜决策结构建模非线性临床交互,同时保留可追溯的决策证据。该框架在包含51名婴儿的真实VLBWI队列上评估,并在三个公共医学表格数据集上进一步验证。在主要队列上,QDSP达到0.9200的准确率和0.9714的AUC,优于代表性机器学习和深度表格学习基线,包括XGBoost、TabNet和TabPFN。在外部数据集上,QDSP在不同样本量和临床分布下保持有竞争力的判别力和校准度。此外,基于SHAP的分析和可微决策路径追踪识别出临床相关预测因子,包括囊性脑室周围白质软化(cPVL)和出生体重,与已建立的新生儿病理生理学证据一致。这些结果表明,QDSP为VLBWI出院时风险分层提供了可解释且稳健的框架,并可能支持新生儿重症监护环境中的早期个体化临床决策。

英文摘要

Very low birth weight infants (VLBWI) are at high risk of mortality and severe neurodevelopmental impairment, including cerebral palsy, yet reliable discharge-time prognostic stratification remains challenging in high-dimensional and data-limited clinical settings. To address this problem, we propose QDSP, an interpretable structured learning framework that integrates Quota-guided Subspace Sampling (QSS) and Differentiable-decision-guided Structure Perception (DSP). The QSS module constructs stability-aware and low-redundancy feature subspaces through bootstrap-based feature consistency estimation, whereas the DSP module employs differentiable soft oblique decision structures to model nonlinear clinical interactions while preserving traceable decision evidence. The proposed framework was evaluated on a real-world VLBWI cohort comprising 51 infants and further validated on three public medical tabular datasets. On the primary cohort, QDSP achieved an accuracy of 0.9200 and an AUC of 0.9714, outperforming representative machine learning and deep tabular learning baselines, including XGBoost, TabNet, and TabPFN. Across external datasets, QDSP maintained competitive discrimination and calibration under varying sample sizes and clinical distributions. In addition, SHAP-based analyses and differentiable decision-path tracing identified clinically relevant predictors, including cystic periventricular leukomalacia (cPVL) and birth weight, consistent with established neonatal pathophysiological evidence. These results suggest that QDSP provides an interpretable and robust framework for discharge-time risk stratification in VLBWI and may support early individualized clinical decision-making in neonatal intensive care settings.

2606.07614 2026-06-09 cs.LG stat.AP 新提交

Measuring Poverty and Inequality with Reduced Data: A Machine Learning Approach Using Nigerian Household Data

用缩减数据衡量贫困与不平等:基于尼日利亚住户数据的机器学习方法

Vanesa Jordá, Miguel Niño-Zarazúa

发表机构 * Cantabria University(坎塔布里亚大学) SOAS University of London(伦敦大学亚非学院) United Nations University World Institute for Development Economics Research (UNU-WIDER)(联合国大学世界发展经济学研究所)

AI总结 本文利用随机森林递归特征消除法分析尼日利亚调查数据,发现少量预测因子即可高精度识别贫困状态和不平等线位置,表明机器学习可优化调查设计并降低数据需求。

详情
AI中文摘要

可靠衡量收入和消费对于监测中低收入国家的贫困与不平等至关重要,但完整的住户调查成本高昂且难以定期实施。本文探讨缩减调查工具能否保留关键分布信息。我们应用随机森林递归特征消除法(RF-RFE)对2018/19年尼日利亚通用住户调查面板数据进行分析,识别最能将个体划分到福利分布中的收入来源、消费类别和住户特征。分析聚焦三个结果:贫困状态、在五等分分布中的位置以及相对于基于基尼系数的不平等线的位置。调查的种植后和收获后阶段使我们能够评估不同季节背景下的表现。结果表明,RF-RFE在少量预测因子下实现了强分类准确率。对于消费,使用少量支出类别即可准确预测贫困状态和不平等线位置,而五等分分类对季节性消费达到约80%的准确率,对从单次季节性访问预测的年消费达到60-65%的准确率。对于收入,使用五个预测因子贫困状态准确率约达90%,不平等线位置主要由劳动收入捕获。研究结果表明,机器学习方法有助于改进调查设计并减少数据需求,同时保留衡量和监测贫困与不平等所需的大部分分布信息。

英文摘要

Reliable measurement of income and consumption is essential for monitoring poverty and inequality in low- and middle-income countries, yet full household surveys are costly and difficult to implement regularly. This paper examines whether reduced survey instruments can preserve key distributional information. We apply Random Forest Recursive Feature Elimination (RF-RFE) to the 2018/19 Nigeria General Household Survey-Panel to identify the income sources, consumption categories and household characteristics that best classify individuals within the welfare distribution. The analysis focuses on three outcomes: poverty status, location in the quintile distribution and position relative to the Gini-based inequality line. The survey's post-planting and post-harvest periods allow us to assess performance under different seasonal contexts. Results show that RF-RFE achieves strong classification accuracy with few predictors. For consumption, poverty status and inequality-line position are accurately predicted using a small set of expenditure categories, while quintile classification reaches about 80 percent accuracy for seasonal consumption and 60--65 percent for annual consumption predicted from a single seasonal visit. For income, poverty status reaches around 90 percent accuracy with five predictors, and inequality-line position is largely captured by labour earnings. The findings suggest that machine-learning methods can help improve survey design and reduce data requirements while retaining much of the distributional information needed to measure and monitor poverty and inequality.

2606.07651 2026-06-09 cs.LG cs.CV 新提交

KITE: A Tri-Modal Transformer Integrating Text, Images, and Knowledge Graphs for Fake News Detection

KITE:一种融合文本、图像和知识图谱的三模态假新闻检测Transformer

Kevin Patel, Shashi Bhushan Jha

发表机构 * Department of Computer Science, University of West Florida(威斯福大学计算机科学系)

AI总结 提出三模态假新闻检测框架KITE,联合建模文本、视觉和知识表示,利用跨模态注意力整合特征,在基准数据集上显著优于单双模态基线。

详情
AI中文摘要

随着多模态虚假信息日益复杂,无缝融合欺骗性文本、操纵性视觉和事实错误的主张,传统的假新闻检测方法已落后。大多数先前工作侧重于文本-图像融合,或将外部知识仅作为后处理步骤应用,限制了其检测更深层语义不一致的能力。在本文中,我们引入了KITE(知识集成文本-图像编码器),一种三模态假新闻检测框架,联合建模文本、视觉和事实知识表示。KITE利用Roberta [23,14]和CLIP [24]进行语言和视觉编码,同时图注意力网络(GAT)处理从Wikidata检索的结构化事实。KITE在多模态Transformer中使用跨模态注意力[9]来集成文本、视觉和知识特征,帮助理解每种模态如何相互关联。模态特定置信度分数与最终预测一起生成,通过指示哪种输入类型对决策影响最大来提供可解释性。在基准数据集上的评估表明,KITE显著优于单模态和双模态基线,特别是在涉及图像-文本不匹配或与外部知识矛盾的情景中。

英文摘要

Traditional fake news detection methods are falling behind as multimodal misinformation grows more advanced, seamlessly blending deceptive text, manipulated visuals, and factually incorrect claims. Most prior work focuses on text-image fusion or applies external knowledge only as a post-processing step, limiting their ability to detect deeper semantic inconsistencies. In this paper, we introduce KITE (Knowledge-Integrated Text-Image Encoder), a tri-modal fake news detection framework that jointly models textual, visual, and factual knowledge representations. KITE leverages Roberta [23,14] and CLIP [24] for linguistic and visual encoding, while a Graph Attention Network (GAT) processes structured facts retrieved from Wikidata. KITE uses cross-modal attention [9] within a multimodal transformer to integrate text, visual, and knowledge features, helping it understand how each modality relates to one another. Modality-specific confidence scores are generated alongside the final prediction, offering interpretability by indicating which input type most influenced the decision. Evaluations on benchmark datasets demonstrate that KITE significantly outperforms unimodal and bimodal baselines, particularly in scenarios involving image-text mismatches or contradictions with external knowledge.

2606.07685 2026-06-09 cs.LG cs.AI 新提交

Test-Time Adaptive Composition for Machine Learning as a Service (MLaaS) in IoT Environments

物联网环境下机器学习即服务(MLaaS)的测试时自适应组合

Deepak Kanneganti, Sajib Mistry, Sheik Mohammad Mostakim Fattah, Aneesh Krishna

发表机构 * Deepak Kanneganti Sajib Mistry Sheik Mohammad Mostakim Fattah Aneesh Krishna

AI总结 针对物联网环境中MLaaS组合因动态性而失效的问题,提出一种测试时自适应(TTA)组合框架,通过TTA感知可组合性模型和服务级自适应模型,在推理时调整服务并保持组合性能,显著降低计算时间。

详情
AI中文摘要

物联网(IoT)环境的动态性影响了机器学习即服务(MLaaS)组合的长期有效性。现有的自适应组合方法主要基于服务替换或重新组合,其中识别合适的替代服务既困难又耗时。为了解决这一问题,我们提出了一种新颖的测试时自适应(TTA)组合框架,用于物联网环境中的MLaaS。首先,我们引入了一个TTA感知的可组合性模型,以确定自适应服务是否仍然与现有组合兼容。接下来,我们设计了一个服务级自适应模型,在推理过程中调整单个服务,同时保持组合性能。实验结果表明,与传统的自适应方法相比,所提出的框架更有效地减少了计算时间。

英文摘要

The dynamic nature of Internet of Things (IoT) environments affects the long-term effectiveness of Machine Learning as a Service (MLaaS) compositions. Existing adaptive composition methods are mainly based on service replacement or re-composition, where identifying suitable substitutes is difficult and time-consuming. To address this, we propose a novel Test-Time Adaptive (TTA) composition framework for MLaaS in IoT environments. First, we introduce a TTA-aware composability model to determine whether adapted services remain compatible with the existing composition. Next, we design a service-level adaptation model to adjust individual services during inference while preserving composition performance. Experimental results demonstrate that the proposed framework reduces computational time more effectively than traditional adaptive approaches.

2606.07686 2026-06-09 cs.LG cs.AI 新提交

Knowledge-Inclusive Adaptive Physics-Informed Neural Network for Microbial Interaction Modelling

知识包容的自适应物理信息神经网络用于微生物相互作用建模

Ravisha Rupasinghe, Rajith Vidanaarachchi, Asela Hevapathige, Sachith Seneviratne, Sen-Lin Tang, Saman Halgamuge

发表机构 * University of Melbourne(墨尔本大学) Academia Sinica(中央研究院)

AI总结 提出一种知识包容的自适应PINN框架,通过整合文本和网络结构知识改进微生物群落建模,在真实和模拟数据集上性能提升最高53%。

Comments 33 pages

详情
AI中文摘要

物理信息神经网络(PINN)是一种在机器学习方法中以方程形式包含知识的方式。除了方程,知识还以其他形式存在,如文本和网络结构。虽然现有的基于PINN的方法从数据中发现方程参数,但它们仅依赖实验测量。我们提出一个新的PINN框架,通过整合辅助知识源来丰富参数发现。我们将该框架应用于微生物学,其中广义Lotka-Volterra(gLV)作为建模微生物群落的生物学基础。我们证明,整合知识可以改进微生物群落建模。我们的框架利用同行评审的宏基因组学文献丰富gLV参数,因为文本提供了gLV单独无法捕捉的外部影响的生物学背景。我们使用数据驱动的整合方法将这些知识与微生物丰度的实验测量相结合。我们通过显式建模微生物相互作用来整合基于网络的结构知识。我们的知识包容框架推断微生物网络,揭示生态学见解。我们根据文献中记录的生态角色验证这些发现。我们在涵盖人类和植物相关微生物群落的真实和模拟数据集上进行评估。我们的框架在无知识情况下比现有技术提升最高53%。知识添加在基于Bray-Curtis差异的准确率上带来最高23%的提升,在R²上带来47%的提升。

英文摘要

Physics-Informed Neural Network (PINN) is a way of including knowledge in the form of equations in Machine Learning methods. Beyond equations, knowledge exists in other forms, such as text and network structure. While existing PINN-based approaches discover equation parameters from data, they rely solely on experimental measurements. We propose a new PINN framework that enriches parameter discovery by incorporating auxiliary knowledge sources. We instantiate our framework for microbiology, where generalised Lotka-Volterra (gLV) serves as a biological foundation for modelling microbial communities. We demonstrate that incorporating knowledge improves microbial community modelling. Our framework enriches the gLV parameters using peer-reviewed metagenomics literature, as text provides biological context on external influences that gLV alone cannot capture. We combine this knowledge with experimental measurements of microbial abundance using a data-driven integration approach. We integrate network-based structural knowledge by explicitly modelling microbial interactions. Our knowledge-inclusive framework infers microbial networks, revealing ecological insights. We validate these findings against ecological roles documented in the literature. We evaluate on real and simulated datasets spanning human- and plant-associated microbial communities. Our framework improves over the state-of-the-art by up to 53%, even without knowledge. Knowledge addition yields gains of up to 23% in Bray-Curtis Dissimilarity-based accuracy and 47% in $\mathrm{R}^2$.

2606.07692 2026-06-09 cs.LG cs.AI cs.ET 新提交

BCG-FM: A Foundation Model for Ambient Cardiac Health Sensing

BCG-FM:一种用于环境心脏健康感知的基础模型

Magnus Ruud Kjaer, Haejun Han, Ashish Neupane, David Q. Sun

发表机构 * Department of Computer Science and Engineering, University of California, San Diego(1 加州大学圣迭戈分校计算机科学与工程系)

AI总结 提出首个环境机械生物信号基础模型BCG-FM,利用床垫压电传感器无感采集心冲击图,通过14.6万人的275万小时数据预训练,在生物年龄估计上达到3.26年MAE,并实现15种健康状态的临床相关判别。

详情
AI中文摘要

可穿戴生物信号的基础模型在多项临床任务中已匹配或超越监督专家,但所有模型都依赖于需要用户主动操作的模态——佩戴设备或访问睡眠实验室。我们提出BCG-FM,首个用于环境机械生物信号的基础模型。嵌入床垫表面的压电传感器每晚无感记录心冲击图(BCG);我们使用参与者级对比学习,基于145,985名个体的总计275万小时夜间记录预训练BCG-FM,这是迄今为止最大的原始波形生物信号预训练语料库。冻结的BCG-FM嵌入在生物年龄估计上达到3.26年MAE(所有环境、非接触模态中最低报告值),并在15种自我报告健康状况和三个独立外部队列中产生临床相关的判别。仅500名标注参与者的预训练表示优于在3,372名参与者上训练的完全监督基线,且表示质量与对比批次大小呈对数线性关系。这些结果确立了环境、纵向机械生物信号作为健康基础模型的可行模态。

英文摘要

Foundation models for wearable biosignals have matched or exceeded supervised specialists across a range of clinical tasks, yet all rely on modalities that require deliberate user action--wearing a device or visiting a sleep lab. We introduce BCG-FM, the first foundation model for ambient mechanical biosignals. A piezoelectric sensor embedded in the bed surface records ballistocardiography (BCG) each night without user effort; we pretrain BCG-FM with participant-level contrastive learning and using a total of 2.75 million hours of nightly recordings from 145,985 individuals, the largest raw-waveform biosignal pretraining corpus to date. Frozen BCG-FM embeddings achieve 3.26-year MAE on biological-age estimation (the lowest reported for any ambient, contactless modality) and yield clinically relevant discrimination across 15 self-reported health conditions and three independent external cohorts. Pretrained representations from only 500 labeled participants outperform a fully supervised baseline trained on 3,372, and representation quality scales log-linearly with contrastive batch size. These results establish ambient, longitudinal mechanical biosignals as a viable modality for health foundation models.

2606.07694 2026-06-09 cs.LG stat.ML 新提交

Vessel Traffic Flow Prediction on Sparse Data via Spatio-Temporal Graph Neural Networks with a Learnable Tweedie Head

基于可学习Tweedie头的时空图神经网络在稀疏数据上的船舶交通流预测

Kyeongjun Lee, Heeyoung Kim

发表机构 * Korea Advanced Institute of Science and Technology (KAIST)(韩国科学技术院)

AI总结 针对船舶交通流数据高度稀疏且间歇性爆发的问题,提出一种模型无关的可学习Tweedie头作为即插即用输出模块,通过优化闭合形式的Tweedie单元偏差并预测均值,同时学习节点级方差幂以捕获港口区域异质性,在真实AIS数据上显著提升RMSE。

详情
AI中文摘要

准确的船舶交通流预测对于智能港口运营和航行安全至关重要。然而,海上交通流数据通常高度稀疏且具有间歇性爆发,使得稳健预测具有挑战性。在这种条件下,传统的时空图神经网络(ST-GNNs)可能退化为保守的接近零的预测,无法捕获非零活动。尽管零膨胀负二项(ZINB)模型部分解决了过多零值问题,但其两部分公式在突变附近仍可能保持保守。为了解决这些问题,我们提出了一种模型无关的可学习Tweedie头,它可以作为即插即用的输出模块附加到任意ST-GNN骨干网络上。与通常需要替代目标的基于似然的Tweedie训练不同,我们的方法优化了闭合形式的Tweedie单元偏差,并预测均值以进行点预测,同时学习节点级方差幂以捕获港口区域间的异质性变异性。在由洛杉矶和长滩港口的真实AIS数据构建的海上交通图上的实验表明,所提出的头在多个ST-GNN骨干网络上一致地提高了RMSE,特别是在非零事件上,从而为实际海上交通控制提供了更可靠的预测。

英文摘要

Accurate vessel traffic flow prediction is crucial for smart port operations and navigational safety. However, maritime traffic flow data are often highly sparse with intermittent bursts, making robust forecasting challenging. Under such conditions, conventional spatio-temporal graph neural networks (ST-GNNs) can degrade toward conservative near-zero predictions and fail to capture non-zero activity. Although zero-inflated negative binomial (ZINB) models partially address excess zeros, their two-part formulation can still remain conservative around abrupt transitions. To address these issues, we propose a model-agnostic learnable Tweedie head that can be attached as a plug-and-play output module to arbitrary ST-GNN backbones. Instead of likelihood-based Tweedie training, which typically requires surrogate objectives, our approach optimizes the closed-form Tweedie unit deviance and predicts the mean for point forecasting while learning a node-level variance power to capture heterogeneous variability across port areas. Experiments on a maritime traffic graph constructed from real-world AIS data in the Port of Los Angeles and Long Beach show that the proposed head consistently improves RMSE across multiple ST-GNN backbones, especially on non-zero events, leading to more reliable forecasts for practical maritime traffic control.

2606.07698 2026-06-09 cs.LG cs.AI 新提交

Pharmacogenomic Knowledge Graph Augmentation for Graph Neural Network-Based Drug-Drug Interaction Prediction

基于图神经网络的药物相互作用预测的药理基因组学知识图谱增强

Juergen Dietrich

发表机构 * AI Solutions Berlin

AI总结 本研究通过整合PharmGKB的药理基因组学先验知识(CYP酶注释)作为特征向量,增强图神经网络在药物相互作用预测中的性能,在配对数据划分下显著提升DDI类型分类,但未能突破信息天花板。

Comments 13 pages

详情
AI中文摘要

应用于药物相互作用(DDI)预测的图神经网络(GNN)仅依赖由SMILES衍生的分子结构图。该系列先前的工作表明,模型性能受限于训练标签的结构信息含量——即信息天花板——仅靠架构改进无法克服。本研究探讨来自PharmGKB数据库的药理基因组学先验知识是否通过提供独立于分子结构且互补的代谢通路背景,部分关闭这一天花板。提取四种临床相关亚型(CYP2D6、CYP3A4、CYP2C19、CYP2C9)的细胞色素P450(CYP)酶底物、抑制剂和诱导剂注释,并将其作为12维特征向量在交互预测前与分子嵌入拼接。在配对水平和药物水平数据划分下进行实验,以量化对未见药物的泛化能力。结果表明,在配对水平划分条件下,知识图谱(KG)增强显著改善了DDI类型分类(F1宏平均:0.532对比基线0.241),而二元交互检测和药物水平泛化仍受信息天花板限制(AUC提升:0.224对比基线0.250)。对严格保留化合物的机制验证确认,增强优先改善CYP2C9介导的交互预测,概率从基线0.033-0.117提升至KG增强后的0.560-0.586。在Tox21基准上的单分子毒性预测扩展实验证实,该效果取决于药理基因组学注释覆盖度。这些发现为后续研究提出的多模态框架提供了动机。

英文摘要

Graph neural networks (GNNs) applied to drug-drug interaction (DDI) prediction rely exclusively on molecular structure encoded as SMILES-derived graphs. Prior work in this series demonstrated that model performance is bounded by the structural information content of training labels -- an Information Ceiling -- that architectural refinements alone cannot overcome. The present study investigates whether pharmacogenomic prior knowledge from the PharmGKB database partially closes this ceiling by providing metabolic pathway context that is independent of, and complementary to, molecular structure. Cytochrome P450 (CYP) enzyme substrate, inhibitor, and inducer annotations for four clinically relevant isoforms (CYP2D6, CYP3A4, CYP2C19, CYP2C9) are extracted and incorporated as a 12-dimensional feature vector concatenated to the molecular embedding prior to interaction prediction. Experiments are conducted under both pair-level and drug-level data splits to quantify generalization to unseen drugs. Results indicate that knowledge graph (KG) augmentation substantially improves DDI type classification under pair-level split conditions (F1-macro: 0.532 vs. 0.241 baseline), while binary interaction detection and drug-level generalization remain bounded by the Information Ceiling (AUC inflation: 0.224 vs. 0.250 baseline). Mechanistic validation on strictly held-out compounds confirms that augmentation preferentially improves CYP2C9-mediated interaction prediction, with probabilities increasing from 0.033-0.117 (baseline) to 0.560-0.586 (KG-augmented). An extension to single-molecule toxicity prediction on the Tox21 benchmark confirms that the effect is contingent on pharmacogenomic annotation coverage. These findings motivate the multimodal framework proposed for the subsequent study in this series.

2606.07700 2026-06-09 cs.LG cs.AI 新提交

EssentialGIN: a new approach for gene essentiality prediction based on graph isomorphism neural networks

EssentialGIN:基于图同构神经网络的新基因必需性预测方法

Sahar Mansouri-Rad, Zahra Narimani, Parvin Razzaghi, Nazanin Hosseinkhan

发表机构 * Department of Computer Science and Information Technology, Institute for Advanced Studies in Basic Sciences (IASBS)(计算机科学与信息技术系,基础科学研究院(IASBS)) Endocrine Research Center, Institute of Endocrinology and Metabolism, Iran University of Medical Sciences(内分泌研究中心,内分泌学与代谢研究院,伊朗医学科学大学)

AI总结 提出基于图同构神经网络(GIN)的EssentialGIN模型,整合PPI网络拓扑与基因表达、直系同源、亚细胞定位等多源生物数据,在人类等复杂生物中显著优于现有方法。

Comments 19 pages, 5 figures, 8 tables

详情
AI中文摘要

背景:必需基因(蛋白质)的预测是一个基本且具有挑战性的问题,同时在湿实验中进行非常昂贵且耗时。仅基于计算方法(引入湿实验候选)使用中心性度量预测必需基因并不准确,会导致大量假阳性;因此,最近的研究使用更复杂的模型(如深度学习)以及整合生物信息来识别必需基因。\n方法:在这项工作中,我们专注于图同构网络,将蛋白质作为PPI网络中的节点进行嵌入,以保留PPI网络的拓扑特征,并整合生物数据,如基因表达数据、基因直系同源信息和基因亚细胞定位信息,引入了一种用于预测必需基因的深度架构。本文修改了图同构网络架构以嵌入节点信息。\n结果:我们的实验证明,所提出的方法优于基于中心性的基线方法以及基于机器学习的方法,如Node2Vec、MLP和图注意力网络(GAT)。\n结论:在本文中,我们观察到使用整合生物数据(作为节点属性)并保留网络拓扑的图同构网络可以显著提高必需基因预测的准确性。在较简单的生物体(如大肠杆菌和黑腹果蝇)中,使用Node2Vec嵌入的多层感知机等方法也表现良好,但在人类中,所引入的架构显著优于深度学习和其他图神经网络解决方案。\n关键词:必需基因预测,图神经网络,图同构网络,PPI网络,节点嵌入

英文摘要

Background: Prediction of essential genes (proteins), is a basic and challenging problem but at the same time very costly and time-consuming in wet-lab experiments. Predicting essential genes, only based on computational methods (to introduce wet-lab candidates) using centrality measures are not accurate and result in large number of false positives; therefore, more complex models such as deep learning and also integration of biological information are used in recent research to identify essential genes. Methods: In this work we focus on graph isomorphism networks, in order to embed proteins as a node in PPI network to conserve topological features of PPI network, and also integrate biological data such as gene expression data, gene orthology information and gene subcellular localization information, and introduced a deep architecture for predicting essential genes. Graph isomorphism network architecture is modified in this work for embedding node information. Results: Our experiments proved that the proposed method outperforms baseline centrality-based methods and also machine learning based methods such as Node2Vec, MLP, and also graph attention networks (GAT). Conclusion: In this paper we observed that using graph isomorphism networks that integrate biological data (as node attributes) and preserve network topology can significantly improve the essential gene prediction accuracy. In simpler organisms such as E. coli and D. melanogaster, methods such as multi-layer perceptron using Node2Vec embedding also performs very good, but in H. sapiens the introduced architecture significantly outperforms deep learning and other graph neural network solutions. Keywords: Essential gene prediction, graph neural network, graph isomorphism network, PPI network, node embedding

2606.07704 2026-06-09 cs.LG cs.AI 新提交

FunctionEvolve: Structure-Guided Symbolic Regression with LLMs

FunctionEvolve: 基于结构引导的符号回归与大型语言模型

Zeyu Xia, Jun Zhu, Dong Yan

发表机构 * Bosch Center for Artificial Intelligence(博世人工智能中心) Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系) Tsinghua-Bosch Joint Center for ML, Tsinghua University(清华大学-博世联合机器学习中心)

AI总结 提出FunctionEvolve框架,利用表达式树组织符号回归搜索,通过结构摘要、局部树编辑和结构感知系数拟合,在LLM-SRBench合成子集上以Claude Opus 4.6实现82.9%的SA@50,较同基线提升4.5倍。

详情
AI中文摘要

符号回归旨在从数据中揭示显式的科学定律。近期方法使用大型语言模型(LLM)引导基于背景文本的变异,这比随机遗传编程更具方向性。然而,精确的符号恢复既需要语义引导,也需要显式结构,以便通过有效的符号表示进行领域信息搜索。当前的LLM驱动系统仍然是结构盲的:它们在模糊的候选者中进行选择,缺乏局部变异的显式机制,并依赖脆弱的系数拟合,这可能会低估正确的骨架。我们提出FunctionEvolve,一个使用表达式树组织整个搜索的进化框架:结构摘要促进多样化的父代选择,局部树编辑保留有用的子表达式,结构感知拟合分解、约束和简化系数,以实现更可靠的评分。它仅使用初等函数族,无需额外的领域特定规则限制泛化能力。在LLM-SRBench的129任务合成子集上,使用Claude Opus 4.6的FunctionEvolve恢复了107个精确形式,达到82.9%的SA@50,是同骨干基线的4.5倍,以及55.8%的SA@1,是此前最强已发布top-1结果的3.6倍。消融实验表明,结构可见搜索是可靠恢复的核心,LLM引导的改进和结构感知系数优化作为必要的提议和评分机制。我们还对基准进行了审计,显示其材料科学子集中的共线性导致了可识别性问题。

英文摘要

Symbolic regression aims to uncover explicit scientific laws from data. Recent methods use LLMs to guide mutation from background text, which is more directed than random genetic programming. However, exact symbolic recovery requires both semantic guidance and explicit structure, so that domain-informed search are carried out through valid symbolic representation. Current LLM-driven systems remain structure-blind: they select among opaque candidates, lack explicit mechanisms for local mutation, and rely on brittle coefficient fitting that can undervalue correct skeletons. We propose FunctionEvolve, an evolutionary framework using expression trees to organize the whole search: structural summaries promote diverse parent selection, local tree edits preserve useful subexpressions, and structure-aware fitting decomposes, constrains, and simplifies coefficients for more reliable scoring. It uses only elementary function families, without additional domain-specific rules limiting generalization. On the 129-task synthetic subset of LLM-SRBench, FunctionEvolve with \emph{Claude Opus 4.6} recovers 107 exact forms, reaching 82.9% SA@50, 4.5x above same-backbone baselines, and 55.8% SA@1, 3.6x above the strongest previously published top-1 result. Ablations show that structure-visible search is central to reliable recovery, with LLM-guided refinements and structure-aware coefficient optimization serving as essential proposal and scoring mechanisms. We also audit the benchmark and show that collinearity in its materials-science subset creates identifiability issues.

2606.07707 2026-06-09 cs.LG 新提交

Decoding Naturalistic Emotion Dynamics from the Brain: An LLM-Enhanced Regression Framework

从大脑解码自然情感动态:一种LLM增强的回归框架

Lemei Zhang, Peng Liu, Hans Dahle Kvadsheim, August Sætre Aasvær, Shuer Ye, Reza Bonyadi, Maryam Ziaei, Jon Atle Gulla

发表机构 * NTNU(挪威科技大学) Kavli Institute for Systems Neuroscience, NTNU(挪威科技大学卡弗里系统神经科学研究所) Microsoft(微软)

AI总结 提出多目标回归框架,利用LLM从自然叙事中提取连续情感特征,结合动态功能连接和机器学习算法,实现从fMRI数据中解码连续情感轨迹,并揭示可解释的情感特异性脑网络拓扑。

详情
AI中文摘要

从神经信号解码情感状态通常被框架化为基于情感稳定刺激的离散单标签分类任务,这种表述过于简化了人类情感的连续、流动和共现特性。本研究通过采用多目标回归框架来重新概念化情感解码,以跟踪随时间变化的多个重叠情感维度作为连续轨迹。利用大型语言模型(LLM)的强大泛化能力,我们从自然听觉叙事《爱丽丝梦游仙境》中提取了细粒度的连续情感特征,作为人类fMRI数据集中主观情感的 scalable 代理。与标准分类范式或过滤网络动态的 mass-univariate 减法对比不同,我们利用正则化和基于核的机器学习算法作为连续估计器来跟踪宏观神经状态变化的幅度。我们证明,基于动态功能连接(DFC)时间快照训练的模型显著优于静态感兴趣区域(ROI)幅度表示,能够有效捕捉快速变化的叙事输入下的连续情感轨迹。此外,通过实施图论可解释人工智能(XAI)技术,我们解构了底层预测特征,揭示了高度可解释的、情感特定的拓扑配置。总体而言,这些结果凸显了LLM自动注释在情感神经科学中的实用性,并为心理建构主义框架提供了令人信服的实证证据,表明动态、分布式的网络交互比严格定位主义的情感解释具有更强的解释力。

英文摘要

Decoding emotional states from neural signals has been typically framed as a discrete, single-label classification task based on emotionally stable stimuli, a formulation that oversimplifies the continuous, fluid, and co-occurring nature of human affect. This study reconceptualizes emotion decoding by adopting a multi-target regression framework to track multiple overlapping emotional dimensions as continuous trajectories over time. Leveraging the robust generalization capabilities of Large Language Models (LLMs), we extracted fine-grained, continuous sentiment profiles from a naturalistic auditory narrative, Alice in Wonderland, to serve as scalable proxies for subjective affect from human fMRI dataset. Departing from standard classification paradigms or mass-univariate subtractive contrasts that filter out network dynamics, we leverage regularized and kernel-based machine learning algorithms as continuous estimators to track the magnitude of macroscale neural state variations. We demonstrate that models trained on temporal snapshots of Dynamic Functional Connectivity (DFC) significantly outperform static region-of-interest (ROI) amplitude representations, effectively capturing continuous emotional trajectories under rapidly fluctuating narrative input. Furthermore, by implementing graph-theoretical Explainable AI (XAI) techniques, we deconstruct the underlying predictive features to reveal highly interpretable, emotion-specific topological configurations. Collectively, these results highlight the utility of LLM-automated annotation in affective neuroscience and provide compelling empirical evidence for psychological constructionist frameworks, demonstrating that dynamic, distributed network interactions offer superior explanatory power over strictly locationist accounts of emotion.

2606.07714 2026-06-09 cs.LG cs.AI cs.HC 新提交

Beyond Accuracy: Interpreting Topic Representation in Suicide Ideation Detection Models

超越准确率:解释自杀意念检测模型中的主题表示

Hamideh Ghanadian, Isar Nejadgholi, Hussein Al Osman

发表机构 * University of Ottawa(渥太华大学) National Research Council Canada(加拿大国家研究委员会)

AI总结 本研究通过可视化与几何分析,探究自杀意念检测模型内部如何编码心理风险因素,发现主题增强能提升低表征风险因素表示的清晰度与可解释性。

详情
AI中文摘要

自杀意念检测模型通常使用聚合性能指标进行评估,但对其内部如何表示具有心理意义的风险因素知之甚少。在高风险心理健康应用中,理解这些内部表示对于安全性、透明度和负责任部署至关重要。在这项工作中,我们超越准确率,分析在原始和主题增强数据集上训练的自杀检测模型如何在其内部表示空间中编码心理风险因素。通过可视化和几何分析,我们检查主题相关特征的连贯性和可分离性。我们的结果表明,主题感知增强提高了低表征心理社会风险因素(如移民、家庭问题和金融危机)的清晰度和区分度。这些发现表明,增强不仅提高了模型性能,还导致了更结构化和可解释的内部表示。

英文摘要

Suicide ideation detection models are typically evaluated using aggregate performance metrics, yet little is known about how they internally represent psychologically meaningful risk factors. In high-stakes mental health applications, understanding these internal representations is essential for safety, transparency, and responsible deployment. In this work, we move beyond accuracy and analyze how suicide detection models trained on original and topic-augmented datasets encode psychological risk factors in their internal representation space. Using visualization and geometric analysis, we examine the coherence and separability of topic-related features. Our results show that topic-aware augmentation increases the clarity and distinctness of underrepresented psychosocial risk factors such as immigration, family issues, and financial crisis. These findings suggest that augmentation not only improves model performance but also leads to more structured and interpretable internal representations.

2606.07724 2026-06-09 cs.LG 新提交

A Geometry-Aware Triplane Field Network for Vehicle Aerodynamic Prediction

几何感知三平面场网络用于车辆气动预测

Kangkang Qi, Huiyu Yang, Keqi Ding, Yunpeng Wang, Yuntian Chen, Yuanwei Bin, Rikui Zhang, Jianchun Wang

发表机构 * Southern University of Science and Technology(南方科技大学) Shenzhen Tenfong Technology Co., Ltd.(深圳腾风科技有限公司) Eastern Institute of Technology(东方理工高等研究院)

AI总结 提出几何感知三平面场网络(GTF-Net),通过双流骨干网络结合自适应傅里叶神经算子与CNN,实现车辆气动压力和壁面剪切应力的高效预测,在精度上超越现有方法。

Comments 28 pages, 8 figures

详情
AI中文摘要

高保真计算流体动力学(CFD)对车辆气动分析至关重要,但其成本仍制约早期设计探索。基于机器学习的表面场预测提供了一种更快的替代方案,前提是模型能高效捕捉全局流动上下文和局部几何细节。本文提出一种基于机器学习的方法,名为几何感知三平面场网络(GTF-Net),用于车辆气动压力和壁面剪切应力预测。GTF-Net通过共享多层感知器(MLP)和光滑双线性光栅化,直接从采样表面点构建三平面特征。然后,这些平面由双流骨干网络处理,该网络将自适应傅里叶神经算子(AFNO)谱混合与卷积神经网络(CNN)细化相结合,从而在同一表示中建模长程气动耦合和局部几何诱导变化。在查询阶段,采样的三平面特征与车辆对齐的方向坐标、法向投影特征和基于体素的曲率代理相结合。将GTF-Net与Transolver、几何信息神经算子(GINO)以及基于三平面的代理模型TripNet进行比较。GTF-Net将压力预测的最强基线相对L2误差从0.157降至0.145,壁面剪切应力预测从0.237降至0.226。消融结果表明,AFNO混合、局部CNN细化和查询侧几何编码均有助于提高精度,支持了将结构化三平面表示与显式气动几何线索相结合的提议机制。

英文摘要

High-fidelity computational fluid dynamics (CFD) is crucial to vehicle aerodynamic analysis, but its cost still constrains early-stage design exploration. Machine-learning-based surface-field prediction offers a faster alternative if the model can efficiently capture both global flow context and local geometric detail. This work proposes a machine-learning-based method, named the geometry-aware triplane field network (GTF-Net), for vehicle aerodynamic pressure and wall shear stress prediction. GTF-Net constructs triplane features directly from sampled surface points through a shared multilayer perceptron (MLP) and smooth bilinear rasterization. The planes are then processed by a dual-stream backbone that combines adaptive Fourier neural operator (AFNO) spectral mixing with convolutional neural network (CNN) refinement, so long-range aerodynamic coupling and local geometry-induced variations are modeled in the same representation. At query stage, sampled triplane features are combined with vehicle-aligned directional coordinates, normal-projection features, and a voxel-based curvature proxy. GTF-Net is compared with Transolver, geometry-informed neural operator (GINO), and TripNet, a triplane-based surrogate model. GTF-Net improves the relative L2 error from the strongest baseline value of 0.157 to 0.145 for pressure prediction and from 0.237 to 0.226 for wall shear stress prediction. Ablation results show that AFNO mixing, local CNN refinement, and query-side geometric encoding each contribute to accuracy, supporting the proposed mechanism of combining structured triplane representation with explicit aerodynamic geometry cues.

2606.07982 2026-06-09 cs.LG 新提交

Overcoming the Limits of Finite Difference Method; Physics-Informed Neural Network for Noisy High-Dimensional Heat Diffusion

克服有限差分法的局限性:用于含噪高维热扩散的物理信息神经网络

Shreesh Bhattarai, Harish Chandra Bhandari

发表机构 * Kathmandu University(加德满都大学)

AI总结 针对高维含噪热扩散问题,提出物理信息神经网络(PINN)框架,在噪声和维度较高时显著优于有限差分法(FDM),实现精度与效率的权衡。

详情
AI中文摘要

高维瞬态热扩散在噪声边界条件下暴露了经典数值方法的根本局限性:在物理噪声不可避免的情况下,精度会灾难性地下降。本文提出了一个物理信息神经网络(PINN)框架,作为在一维、二维和三维空间中对这一问题的系统性解决方案,建立了明确的操作机制,重新定义了含噪热系统中求解器的选择。在三维空间中,当边界噪声为20%时,PINN保持约91%的精度,而有限差分法(FDM)降至36%,这是一个明显的决定性优势。这一点在物理铜热系统中得到进一步证实,在真实噪声条件下,PINN将边界重建误差降低了3.3倍。这种噪声鲁棒性伴随着维度驱动的效率交叉:在三维空间中,PINN所需的时空节点少于FDM,同时实现更高的精度,揭示了经典离散化在大规模下的真实成本。这些发现重新定义了求解器的选择:决定性的轴不仅是精度,而是噪声暴露和维度的共同作用。当噪声和维度都较高时,经典求解器范式不足;本工作为证明PINN在此类机制中作为操作标准提供了基础。

英文摘要

High-dimensional transient heat diffusion under noisy boundary conditions exposes a fundamental limitation of classical numerical methods: accuracy degrades catastrophically where physical noise is unavoidable. This paper presents a Physics-Informed Neural Network (PINN) framework as a systematic solution to this problem across one, two, and three spatial dimensions, establishing clear operational regimes that redefine solver selection in noisy thermal systems. Under 20% boundary noise in 3D, PINN sustains approximately 91% accuracy while Finite Difference Method (FDM) collapses to 36%, a clear decisive advantage. This is further confirmed in a physical copper thermal system, where PINN reduces boundary reconstruction error by 3.3 times under realistic noise conditions. This noise resilience is accompanied by a dimensionality-driven efficiency crossover: PINN requires fewer spacetime nodes than FDM in 3D while achieving superior accuracy, exposing the true cost of classical discretization at scale. These findings reframe solver selection: the decisive axis is not accuracy alone, but noise exposure and dimensionality jointly. When noise and dimensionality are both high, the classical solver paradigm is insufficient; this work provides the foundation to justify PINN as the operational standard in such regimes.

2606.08037 2026-06-09 cs.LG cs.AI 新提交

SafeECGMatch: Calibration-Aware Joint Frequency and Time Space Semi-Supervised Learning for Open-Set ECG Classification

SafeECGMatch:面向开放集心电图分类的校准感知联合频率与时间空间半监督学习

Hongkyu Koh, Ikbeom Jang

发表机构 * Hankuk University of Foreign Studies(韩国外国语大学)

AI总结 提出SafeECGMatch框架,通过双分支架构提取时频特征,结合自适应标签平滑和温度缩放校准模型,在标签分布不匹配下实现可靠的开集分类和OOD检测。

Comments 8 pages. Accepted to the KDD-UC 2026 (ACM International Conference on Data Mining and Knowledge Discovery - Undergraduate Consortium 2026)

详情
AI中文摘要

心电图(ECG)分类模型常面临严重的标签稀缺问题,使得半监督学习(SSL)成为降低标注成本的有效策略。然而,在临床环境中,未标注数据池通常包含分布外(OOD)异常或标注集中不存在的诊断类别。标准SSL会强制对这些未见类别分配错误的伪标签,产生过度自信的预测。为解决此问题,我们提出SafeECGMatch,一个校准感知的安全SSL框架,用于标签分布不匹配下的单标签ECG分类。方法上,SafeECGMatch采用双分支架构,通过ECG特定的数据增强提取时频潜在表示。关键地,它通过自适应标签平滑和温度缩放动态对齐置信度与经验准确性,在时间和频谱域上校准多类分类器和OOD检测器。这种联合优化实现了可信的OOD拒绝和可靠的伪标签分配。在PTB-XL和PhysioNet/CinC Challenge基准上评估,SafeECGMatch达到了最先进的准确性和校准性能,推动了生理时间序列中可靠知识发现。代码可在https://github.com/labhai/SafeECGMatch获取。

英文摘要

Electrocardiogram (ECG) classification models often suffer from severe label scarcity, making semi-supervised learning (SSL) an attractive strategy for reducing annotation costs. In clinical settings, however, unlabeled pools frequently contain out-of-distribution (OOD) anomalies or diagnostic groups absent from the labeled set. Standard SSL forces incorrect pseudo-labels onto these unseen classes, producing overconfident predictions. To address this, we propose SafeECGMatch, a calibration-aware safe SSL framework for single-label ECG classification under label distribution mismatch. Methodologically, SafeECGMatch employs a dual-branch architecture extracting time-frequency latent representations via ECG-specific augmentations. Crucially, it dynamically aligns confidence with empirical accuracy through adaptive label smoothing and temperature scaling, calibrating both the multiclass classifier and the OOD detector across temporal and spectral domains. This joint optimization allows trustworthy OOD rejection and reliable pseudo-labeling. Evaluated on the PTB-XL and PhysioNet/CinC Challenge benchmarks, SafeECGMatch achieves state-of-the-art accuracy and calibration, advancing reliable knowledge discovery in physiological time-series. Code is available at https://github.com/labhai/SafeECGMatch.

2606.08100 2026-06-09 cs.LG 新提交

Constraint-Aware Optimization for Robust Protein Stability Prediction

约束感知优化用于鲁棒蛋白质稳定性预测

A Shivram, Aneesh S. Chivukula, Manik Gupta, Sourav Chowdhury

发表机构 * Birla Institute of Technology and Science Pilani, Hyderabad Campus(比拉理工学院海得拉巴校区)

AI总结 提出约束感知优化框架,结合平衡均方误差、孪生反对称正则化器和OOD边缘一致性损失,在不改变SPURS架构下提升蛋白质稳定性预测的鲁棒性,在多个基准上取得显著改进。

详情
AI中文摘要

多模态$\Delta\Delta G$预测器结合蛋白质语言模型与逆折叠表示,在Megascale数据集上实现了强分布内准确性,但在分布外蛋白质上鲁棒性有限,在配对突变基准上存在持续的正反向偏差,且对稀有稳定突变的代表性不足。现有方法主要通过额外的架构组件来解决这些局限性,而优化层面的干预相对未被充分探索。我们引入了一个约束感知优化框架,结合平衡均方误差、孪生反对称正则化器以及在每个位置特征表示上的新颖OOD边缘一致性损失,无需对SPURS主干进行架构更改。在十一个基准和三个随机种子上,该框架将S669上的Spearman相关性从0.486提高到0.540(种子间$\sigma=0.002$),在不修改架构的情况下匹配已发表的SPURS基线(0.50),并将S461上的相关性从0.653提高到0.711,在另外五个OOD数据集上取得一致的小幅提升。在Ssym上的受控诊断表明,反对称训练并未消除系统性的正反向偏差,表明增益是通过隐式正则化而非精确热力学约束强制执行来实现的。

英文摘要

Multimodal $ΔΔG$ predictors integrating protein language models with inverse-folding representations achieve strong in-distribution accuracy on the Megascale dataset but exhibit limited robustness on out-of-distribution (OOD) proteins, persistent forward-reverse bias on paired-mutation benchmarks, and under-representation of rare stabilizing mutations. Existing approaches address these limitations primarily through additional architectural components, leaving optimization-level intervention comparatively underexplored. We introduce a constraint-aware optimization framework combining Balanced Mean Squared Error, a Siamese anti-symmetric regularizer, and a novel OOD-margin consistency loss on the per-position feature representation, requiring no architectural changes to the SPURS backbone. Across eleven benchmarks and three random seeds, the framework improves Spearman correlation on S669 from 0.486 to 0.540 ($σ=0.002$ across seeds), matching the published SPURS baseline (0.50) without architectural modification, and on S461 from 0.653 to 0.711, with consistent smaller gains on five additional OOD datasets. A controlled diagnostic on Ssym reveals that anti-symmetric training does not eliminate systematic forward-reverse bias, indicating that gains arise through implicit regularization rather than exact thermodynamic constraint enforcement.

2606.08140 2026-06-09 cs.LG 新提交

TRUST-SCF: Transformer-based Risk Understanding and Scoring for Transactional Supply Chain Finance

TRUST-SCF:基于Transformer的交易供应链金融风险理解与评分

Mohammadamin Davoodabadi, Amirabbas Shakeri

发表机构 * Department of Growth Barook Co.(Growth Barook公司)

AI总结 提出TRUST-SCF框架,利用Transformer对交易序列建模,通过金融对齐的注意力偏置、连续延迟预测和标签高效评分管道,实现动态信用评分,实验表明优于基线。

Comments 15 pages, 13 Figures, 3 Tables

详情
AI中文摘要

供应链金融(SCF)和LendTech平台需要能够响应不断变化的交易行为、还款延迟和活跃风险的信用评分系统。我们提出TRUST-SCF,一个基于Transformer的交易级风险预测和动态信用评分框架。每个用户历史被表示为包含利用率、还款延迟和交易位置的交易令牌序列。主要贡献包括:(1) 一种结合利用率相似性和近因性的金融对齐注意力偏置,使模型能够在可比风险暴露条件下比较还款行为;(2) 在对数变换目标空间中进行连续还款延迟预测,减少极端延迟的影响,同时提高对短延迟行为的敏感性;(3) 一个标签高效的信用评分管道,其中最终信用评分不依赖任何显式的外部信用评分标签进行训练,而是从预测延迟、模拟利用率下的潜在风险、实际未付风险暴露和非线性校准中推导得出。在超过30万笔交易的真实交易数据上的实验表明,TRUST-SCF在延迟预测上优于序列基线,并产生与未来还款行为强相关的评分。这些结果表明,TRUST-SCF是SCF和LendTech环境中自适应信用评分和交易级风险缓解的实用框架。

英文摘要

Supply Chain Finance (SCF) and LendTech platforms need credit scoring systems that respond to evolving transaction behavior, repayment delays, and active exposure. We propose TRUST-SCF, a transformer-based framework for transaction-level risk prediction and dynamic credit scoring. Each user history is represented as a sequence of transaction tokens containing utilization, repayment delay and transaction position. The main contributions are: (1) a financially aligned attention bias that combines utilization similarity and recency, enabling the model to compare repayment behavior under comparable exposure conditions; (2) continuous repayment-delay prediction in a log-transformed target space, reducing the influence of extreme delays while improving sensitivity to short-delay behavior and (3) a label-efficient credit-scoring pipeline in which the final credit score is not trained using any explicit external credit-score label, but is instead derived from predicted delay, potential risk over simulated utilization, actual unpaid exposure, and nonlinear calibration. Experiments on real transaction data from more than 300,000 transactions show that TRUST-SCF improves delay prediction over sequential baselines and produces scores that are strongly associated with future repayment behavior. These results suggest that TRUST-SCF is a practical framework for adaptive credit scoring and transaction-level risk mitigation in SCF and LendTech environments.

2606.08153 2026-06-09 cs.LG cs.AI 新提交

LogNEO: A GPT-Neo Reinforcement Learning Framework for Accurate Real-Time Log Anomaly Detection

LogNEO:基于GPT-Neo的强化学习框架用于精确实时日志异常检测

David Eje, Tanmay Sharma, Khush Patel, Manuel Mazzara, Leonard Johard

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出LogNEO,利用GPT-Neo模型和基于位置感知奖励的PPO微调,在HDFS、BGL和Thunderbird基准上达到F1分数0.927、0.913和0.984,召回率比LogGPT提升6%,并在生产部署中实现45ms端到端延迟。

Comments 8 pages, 5 figures, 6 tables

详情
AI中文摘要

检测大规模系统日志中的异常对于现代计算基础设施的可靠性和安全性至关重要。我们提出LogNEO,一个基于EleutherAI的GPT-Neo(13亿参数)构建的日志异常检测器,并通过一种新颖的部分信用、指数衰减位置感知奖励方案结合交叉熵正则化(使用近端策略优化PPO)进行微调。位置感知奖励显式建模预测难度:早期位置因正确预测获得更高奖励,而后期位置因错误受到更强惩罚。LogNEO在HDFS、BGL和Thunderbird基准上分别达到0.927、0.913和0.984的F1分数,在保持相当精度的同时,召回率比先前最先进的LogGPT提升高达6个百分点。基于Apache Kafka、Redis和TensorRT加速推理的生产微服务部署在每秒15000个事件下实现了45毫秒的端到端延迟。

英文摘要

Detecting anomalies in large-scale system logs is critical for the reliability and security of modern computing infrastructure. We present LogNEO, a log anomaly detector built on EleutherAI's GPT-Neo (1.3B parameters) and fine-tuned with a novel partial-credit, exponentially decaying position-aware reward scheme combined with cross-entropy regularisation via Proximal Policy Optimisation (PPO). The position-aware reward explicitly models prediction difficulty: early positions receive higher rewards for correct predictions, while later positions incur stronger penalties for errors. LogNEO attains F1-scores of 0.927, 0.913, and 0.984 on the HDFS, BGL, and Thunderbird benchmarks, improving recall by up to 6 percentage points over the prior state-of-the-art LogGPT while maintaining comparable precision. A production microservice deployment over Apache Kafka, Redis, and TensorRT-accelerated inference demonstrates 45 ms end-to-end latency at 15,000 events per second.

2606.08161 2026-06-09 cs.LG cs.AR cs.NA math.NA 新提交

AttentionCap: Transformer Based Capacitance Matrix Learning Toward Full-Chip Extraction

AttentionCap: 基于Transformer的电容矩阵学习用于全芯片提取

Jiechen Huang, Hector R. Rodriguez, Dingcheng Yang, Zuochang Ye, Yibo Lin, Wenjian Yu

发表机构 * Dept. Computer Science & Tech., BNRist, Tsinghua Univ., Beijing, China(清华大学计算机科学与技术系,北京信息科学与技术国家研究中心) School of IC, BNRist, Tsinghua Univ., Beijing, China(清华大学集成电路学院,北京信息科学与技术国家研究中心) School of IC, Peking Univ., Beijing, China(北京大学集成电路学院)

AI总结 提出AttentionCap,一种定制化Transformer,结合Gram表示、对称注意力输出层和归一化拉普拉斯损失,实现多层多节点下的高精度电容矩阵预测,速度提升192倍。

Comments Accepted at the 63rd ACM/IEEE Design Automation Conference (DAC '26)

详情
AI中文摘要

随着基于规则的模式匹配的电容提取精度在先进节点上难以维持,开发基于深度学习的2D电容模型的趋势日益增长。然而,现有的基于MLP和CNN的方法将其输入限制在特定工艺节点的固定金属层组合上,限制了其在实际中的可用性。认识到电容矩阵与流行的注意力机制之间的固有相似性,我们提出了AttentionCap,一种定制化的Transformer用于电容矩阵学习,具有Gram表示框架、物理对齐的对称注意力输出层以及新颖的归一化拉普拉斯损失。我们还引入了工艺节点嵌入以实现多节点学习。在合成数据上训练后,AttentionCap在多层多节点设置下,对未见过的真实设计实现了0.67%/3.99%的自电容/耦合电容误差,相比CNN-Cap基线,自电容/耦合误差降低了4.6倍/5.7倍,推理速度提高了192倍。预训练的AttentionCap仅需5000个样本和4000步微调即可准确迁移到未见过的节点。凭借对未见过的真实设计的足够精度和对新工艺节点的强迁移能力,AttentionCap为现代EDA工作流程提供了很高的实用价值。代码和数据可在https://github.com/THU-numbda/AttentionCap获取。

英文摘要

As capacitance extraction accuracy of rule-based pattern matching becomes difficult to sustain at advanced nodes, a growing trend emerges to develop deep-learning-based 2D capacitance models. However, existing MLP- and CNN-based methods constrain their input to fixed metal-layer combinations in a specific process node, limiting their usability in practice. Recognizing the inherent similarity between capacitance matrix and the prevailing attention mechanism, we propose AttentionCap, a customized Transformer for capacitance matrix learning, with a Gram representation framework, a physics-aligned symmetric-attention output layer, and a novel normalized Laplacian loss. We also introduce a process-node embedding to enable multi-node learning. Trained on synthetic data, AttentionCap attains 0.67\%/3.99\% self/coupling-capacitance error on unseen real designs under a multi-layer and multi-node setting, surpassing the CNN-Cap baseline with 4.6$\times$/5.7$\times$ lower self/coupling error and 192$\times$ faster inference speed. A pretrained AttentionCap accurately transfers to an unseen node with only 5K samples and 4K finetuning steps. With sufficient accuracy on unseen real designs and strong transferability to new process nodes, AttentionCap offers highly practical value for modern EDA workflows. Code and data are available at https://github.com/THU-numbda/AttentionCap.

2606.08212 2026-06-09 cs.LG 新提交

Public Machine Learning Solver Framework for Novices in the Machine Learning Domain

面向机器学习初学者的公共机器学习求解器框架

Lokman Saleh, Hafedh Mili, Mounir Boukadoum

发表机构 * LATECE Lab, Université du Québec à Montréal(LATECE实验室,魁北克大学蒙特利尔分校)

AI总结 提出一个结合专家知识和迁移学习的半自动化平台,为非专家推荐完整的机器学习流水线,并自动提取数据特征,通过一阶逻辑推理提供排名算法。

详情
AI中文摘要

解决机器学习问题很复杂,通常只有专家才能胜任。过去二十年中,出现了支持非专家的系统。根据我们的回顾,我们识别出三类:(1) 全自动AutoML系统,(2) 用于算法选择的专家备忘单,以及(3) 使用选择标准(准确性、透明度、数据要求)的决策支持系统。我们提出一个新平台,结合了第2和第3类,为非专家提供半自动化、智能的解决方案推荐。与推荐单一算法的现有方法不同,我们的平台建议一个针对用户问题量身定制的完整流水线。它整合了专家定义的选择标准与迁移学习,并自动从用户提供的数据集中提取数据特征(例如,类别不平衡、缺失值)。该平台使用一阶逻辑对其知识库进行推理,并推荐按相关性排序的合适算法。它具有用户友好的界面,并连接到面向机器学习专家的众包平台,确保持续更新。该平台是增量构建的,允许无缝集成新算法、标准和领域知识。据我们所知,这是第一个免费、公开可访问的在线框架,系统地捕获和操作专家知识,以结构化、透明的方式指导非专家解决机器学习问题。

英文摘要

Solving machine learning problems is complex and typically reserved for experts. Over the past two decades, systems have emerged to support non-experts. Based on our review, we identify three categories: (1) fully automated AutoML systems, (2) expert cheat sheets for algorithm selection, and (3) decision-support systems using selection criteria (accuracy, transparency, data requirements). We propose a new platform combining categories 2 and 3 to deliver semi-automated, intelligent solution recommendations for non-experts. Unlike existing approaches that recommend a single algorithm, our platform suggests a complete pipeline tailored to the user's problem. It integrates expert-defined selection criteria with transfer learning and automatically extracts data characteristics (e.g., class imbalance, missing values) from user-provided datasets. The platform uses first-order logic to reason over its knowledge base and recommends suitable algorithms ranked by relevance. It features a user-friendly interface and connects to a crowdsourcing platform for ML experts, ensuring continuous updates. The platform is built incrementally, allowing seamless integration of new algorithms, criteria, and domain knowledge. To our knowledge, this is the first free, publicly accessible online framework that systematically captures and operationalizes expert knowledge to guide non-experts in solving ML problems in a structured, transparent manner.

2606.08238 2026-06-09 cs.LG 新提交

GPT-Micro: A large language paradigm for accelerated, inexpensive, and thermodynamics-consistent discovery of constitutive models in manufacturing

GPT-Micro: 一种用于制造业中加速、低成本且热力学一致的本构模型发现的大语言范式

Soumik Dutta, Kiarash Naghavi Khanghah, Sania Shree, Logan McNeil, Thomas Feldhausen, Hongyi Xu, Rajiv Malhotra

发表机构 * Department of Mechanical and Aerospace Engineering, Rutgers University(罗格斯大学机械与航空航天工程系) Department of Mechanical, Aerospace & Manufacturing Engineering, University of Connecticut(康涅狄格大学机械、航空航天与制造工程系) Edison Welding Institute(埃迪森焊接研究所) Manufacturing Science Division, Oak Ridge National Laboratory(橡树岭国家实验室制造科学分会) Department of Aerospace and Mechanical Engineering, University of Texas at El Paso(德克萨斯州埃尔帕索大学航空航天与机械工程系)

AI总结 提出GPT-Micro范式,结合大语言模型、热力学约束和稀疏数据,实现自主发现本构模型,在印刷电子测试中数据量减少70%、发现时间缩短400倍。

Comments 23 pages, 4 tables, 11 equations, 9 figures

详情
AI中文摘要

本构模型描述了工艺施加的材料状态与基本材料属性之间的关系,对于制造过程中材料微观结构的控制至关重要。传统上依赖易错的人类经验和直觉来假设和修正模型函数形式,导致模型发现过程缓慢且增量式改进,精度有限。传统的机器学习需要大量数据生成成本和时间。使用大语言模型的模型发现存在上述问题,并且/或者忽略了基本热力学定律的不可违背性。本文创建了一种新颖的GPT-Micro范式,用于自主、数据稀疏且符合热力学的全新本构模型发现。该框架无缝集成了文献语义知识提取、基于热力学的守恒定律强制执行、稀疏数据集以及大语言模型驱动的模型假设生成与改进。在印刷电子工艺测试平台上对一个长期难以解决的本构建模问题进行了验证。结果表明,与现有技术相比,该方法具有显著且多方面的优势,包括:(a) 相比基于机器学习的建模,数据负担减少超过70%,且精度不损失;(b) 相比人工驱动建模,数据生成后的发现时间从数月缩短至数小时,减少400倍;(c) 发现具有新颖函数形式的模型,无需主观选择初始假设;(d) 通过综合紧凑、符合守恒定律且物理完整的解析模型,增强了基于物理的可信度、人类可解释性和机理洞察。讨论了GPT-Micro在制造业中实现快速、低成本、物理可信且可解释的微观结构建模的潜力。

英文摘要

Constitutive modeling of the relationship between process-imposed material states and fundamental material properties is critical to control of material microstructure in manufacturing processes. The limited accuracy resulting from the typical reliance on fallible human expertise and intuition for postulation and revision of the models functional form results in incremental and time consuming model discovery. Conventional Machine Learning (ML) incurs significant cost and time of data generation. Model discovery using Large Language Models (LLMs) suffers from the above issues and/or ignores the inviolability of fundamental thermodynamics laws. This work creates a novel GPT-Micro paradigm for autonomous, data sparse, and thermodynamics-compliant discovery of de-novo constitutive models. This framework seamlessly integrates semantic knowledge extraction from literature, enforcement of thermodynamics-based conservation laws, and sparse datasets, with LLM-driven generation and refinement of model hypotheses. Validation is performed for a long-intractable constitutive modeling problem in a printed electronics process testbed. This reveals significant and simultaneous advantages over the state-of-the-art including: (a) More than 70 percent reduction in data burden relative to ML-based modeling without loss in accuracy; (b) 400X reduction in discovery time after data generation, from months to hours, relative to human-driven modeling; (c) Discovery of models with novel functional forms without subjective human choice of a starting hypothesis; (d) Enhanced physics-rooted trustworthiness, human interpretability, and mechanistic insight via synthesis of compact, conservation-compliant, and physically complete analytical models. The potential of GPT-Micro to realize rapid, low-cost, physically trustworthy, and interpretable microstructure modeling across the manufacturing landscape is discussed.

2606.08300 2026-06-09 cs.LG 新提交

QueryWeaver: Reliable Multi-Tool Query Execution Planning via LLM-Based Graph Generation

QueryWeaver: 基于LLM图生成的可靠多工具查询执行规划

Aishwarya Chakravarthy, Vidhi Kulkarni, Duen Horng Chau

AI总结 提出将自然语言查询转换为结构化图并通过确定性规划器执行的系统,利用深度优先搜索解决跨工具依赖,实现高可靠性查询。

详情
AI中文摘要

许多对个人数据的真实查询跨越多个应用程序,需要结构化规划,因为单个工具只暴露部分信息。虽然LLM展示了强大的推理和工具使用能力,但可靠地执行多步骤、跨工具查询仍然具有挑战性。我们引入了一个系统,将自然语言查询转换为结构化图,并通过确定性规划器执行它们。我们的方法使用深度优先搜索来解决依赖关系并跨工具组合结果,提高了可靠性,并支持超越传统基于关键词搜索的查询。我们展示了即使在较小或本地托管的LLM上也能实现高精度。

英文摘要

Many real-world queries over personal data span multiple applications and require structured planning, as individual tools expose only partial information. While LLMs show strong reasoning and tool use, reliably executing multi-step, cross-tool queries remains challenging. We introduce a system that converts natural language queries into structured graphs and executes them via a deterministic planner. Our approach uses depth-first search to resolve dependencies and combine results across tools, improving reliability and enabling queries beyond traditional keyword-based search. We demonstrate high accuracy even with smaller or locally hosted LLMs.

2606.08479 2026-06-09 cs.LG 新提交

Inferring hidden forcing in a biological oscillator using Kolmogorov-Arnold networks

利用Kolmogorov-Arnold网络推断生物振荡器中的隐藏驱动力

Julian Szereszewski, Facundo Fainstein, Leandro E. Fernandez, Gabriel B. Mindlin

发表机构 * Universidad de Buenos Aires(布宜诺斯艾利斯大学) CONICET - Universidad de Buenos Aires(阿根廷国家科研委员会-布宜诺斯艾利斯大学) Instituto de Física Interdisciplinaria y Aplicada (INFINA)(跨学科与应用物理研究所(INFINA))

AI总结 提出利用Kolmogorov-Arnold网络从气压测量数据重建鸟类呼吸动力学方程,揭示隐藏的两相肌肉激活模式,并通过肌电图验证。

Comments 11 pages, 4 figures

详情
AI中文摘要

从部分观测中推断驱动动力系统的力是物理学中的一个基本挑战,特别是当不同的潜在机制产生相似的观测动力学时。在这里,我们展示了仅通过气囊压力测量即可重建鸟类呼吸动力学背后的有效肌肉驱动力。使用基于Kolmogorov-Arnold网络的可解释学习框架,我们直接从数据中推断系统的控制方程,并揭示潜在驱动力中的非平凡结构,该结构从压力信号中并不明显,而压力信号反而暗示了一种类似松弛的振荡。重建的动力学预测每个呼吸周期内存在两相激活模式,我们通过呼气肌的肌电图记录独立验证了这一点。这些结果表明,数据驱动的动力学定律重建可以揭示隐藏的物理结构,并提供对未观测驱动变量的访问,从而建立了一种在部分观测动力系统中推断潜在力的通用途径。

英文摘要

Inferring the forces that drive a dynamical system from partial observations is a fundamental challenge across physics, particularly when distinct underlying mechanisms produce similar observable dynamics. Here we show that the effective muscular forcing underlying avian respiratory dynamics can be reconstructed from measurements of air-sac pressure alone. Using an interpretable learning framework based on Kolmogorov-Arnold networks, we infer the governing equations of the system directly from data and uncover a nontrivial structure in the underlying forcing that is not apparent from the pressure signal, which instead suggests a relaxation-like oscillation. The reconstructed dynamics predict a two-phase activation pattern within each respiratory cycle, which we independently validate through electromyographic recordings of expiratory muscles. These results demonstrate that data-driven reconstruction of dynamical laws can reveal hidden physical structure and provide access to unobserved driving variables, establishing a general route to infer latent forces in partially observed dynamical systems.

2606.08480 2026-06-09 cs.LG cs.AI cs.IR 新提交

Adaptive Loss Balancing for Noise-Robust GRPO in Generative Recommendation

生成式推荐中噪声鲁棒GRPO的自适应损失平衡

Kewei Xu, Junbo Qi, Yanyan Zou, Pengfei Zhang, Xingzhi Yao, Shengjie Li

发表机构 * JD.com(京东) Waseda University(早稻田大学) University of Electronic Science and Technology of China(电子科技大学)

AI总结 针对生成式推荐中奖励模型因曝光偏差导致噪声的问题,提出AdaGRPO框架,通过策略难度和奖励可区分性诊断动态切换GRPO与监督学习,在电商数据集上提升召回率并抑制幻觉。

详情
AI中文摘要

强化学习为超越监督模仿的生成式推荐提供了有前景的途径,通过利用奖励信号指导策略改进。然而,其有效性关键取决于奖励模型对所评估样本的可信度。实践中,广泛采用的奖励模型——生产级排序器,是在有曝光偏差的日志上训练的,导致样本相关的误差,违反了这一假设。我们的分层分析揭示了一个一致的模式:当策略表现出不确定性且排序器能有效区分真实物品与rollout负样本时,奖励指导最为有益。在其他样本上,奖励信号要么可忽略,要么有害,凸显了统一应用RL的风险。为解决此问题,我们引入AdaGRPO,一种新颖框架,将奖励指导优化视为选择性准入而非统一压力。训练以监督负对数似然为基础,而GRPO目标由基于两个rollout诊断(策略侧难度和奖励可区分性)的逐样本二元裁剪门控。未通过任一诊断的实例退化为纯监督,确保稳定性并减轻噪声梯度的放大。我们在大规模电商数据集上验证了AdaGRPO。在最佳中间检查点,它将HR@10从11.01%提升至12.18%,同时将幻觉限制在0.22%以下,并在最终检查点保持鲁棒性(HR@10 11.63%,幻觉0.27%),在检索-有效性前沿上优于固定NLL-GRPO混合。在生产A/B测试中,AdaGRPO在点击率和停留时间上实现了统计显著的提升,证实了其实用价值。

英文摘要

Reinforcement learning (RL) presents a promising avenue for enhancing generative recommendation beyond supervised imitation, leveraging reward signals to guide policy improvement. However, its efficacy is critically contingent on the trustworthiness of the reward model for the samples it evaluates. In practice, production rankers, the widely adopted reward models, are trained on exposure-biased logs, leading to sample-dependent inaccuracies that violate this assumption. Our stratified analysis uncovers a consistent pattern: reward guidance is most beneficial when the policy exhibits uncertainty and the ranker can effectively discriminate the ground-truth item from rollout negatives. On other samples, the reward signal is either negligible or detrimental, highlighting the risk of uniform RL application. To address such an issue, we introduce AdaGRPO, a novel framework that treats reward-guided optimization as selective admission rather than uniform pressure. Training is anchored in supervised negative log-likelihood, while the GRPO objective is gated by a binary, per-sample clip determined by two rollout diagnostics: policy-side difficulty and reward discriminability. Instances failing either diagnostic default to pure supervision, ensuring stability and mitigating the amplification of noisy gradients. We validate AdaGRPO on a large-scale e-commerce dataset. At the best intermediate checkpoint, it elevates HR@10 from 11.01% to 12.18% while constraining hallucination below 0.22%, and maintains robustness at the final checkpoint (HR@10 11.63%, hallucination 0.27%), outperforming fixed NLL--GRPO mixtures across the retrieval--validity frontier. In production A/B tests, AdaGRPO achieves statistically significant gains in click-through rate and dwell time, confirming its practical utility.

2606.08484 2026-06-09 cs.LG cs.AI 新提交

STELLAR: Spatio-Temporal Environmental Learning with Latent Alignment and Refinement for Long-Tailed Species Distribution Modeling

STELLAR: 面向长尾物种分布建模的时空环境学习与潜在对齐精炼

Shufeng Kong, Tao Yu, Yuanyuan Wei, Caihua Liu, Junwen Bai, Yingheng Wang, Marc Grimson, Daniel Fink, Carla P. Gomes

发表机构 * Sun Yat-sen University(中山大学) Cornell University(康奈尔大学) Foshan University(佛山大学) Cornell Lab of Ornithology(康奈尔鸟类学实验室)

AI总结 提出STELLAR框架,通过图-时间编码器、上下文锚定潜在对齐和不平衡感知解码模块,联合优化动态栖息地上下文和群落结构,有效解决物种分布建模中的时空耦合与长尾不平衡问题。

Comments Accept by IJCAI 2026

详情
AI中文摘要

联合物种分布建模(JSDM)是生物多样性监测和保护规划的关键工具。然而,准确的JSDM面临两个耦合挑战:环境驱动因素和物种分布本质上是时空的,而物种共现模式表现出复杂的非线性群落结构以及由稀有物种导致的严重长尾不平衡。现有方法通常孤立地处理这些因素,从静态协变量中学习或忽略动态群落结构的历史轨迹。为克服这些限制,我们提出STELLAR(时空环境学习与潜在对齐精炼),一种新颖的框架,学习一个共享潜在空间,其中动态栖息地上下文和群落结构被联合优化。我们的方法整合了三个互补组件:(1)图-时间编码器,采用图注意力和循环单元来聚合空间邻域效应并捕捉环境上下文和群落结构的共同演化历史动态;(2)上下文锚定潜在对齐机制,利用标签激活的混合先验和监督对比学习结构化潜在空间,基于共享环境偏好主动聚类物种;(3)不平衡感知解耦解码模块,利用非对称损失聚焦于困难稀有物种样本的学习,防止长尾中的模式崩溃。在领域专家精心整理的大规模eBird数据集上的实验表明,我们的框架显著优于最先进的基线,特别是在预测稀有物种和揭示可解释的物种相互作用方面。

英文摘要

Joint Species Distribution Modeling (JSDM) is a key enabler for biodiversity monitoring and conservation planning. However, accurate JSDM faces two coupled challenges: environmental drivers and species distributions are inherently spatio-temporal, while species co-occurrence patterns exhibit complex non-linear community structure and severe long-tail imbalance driven by rare species. Existing approaches often address these factors in isolation, learning from static covariates or neglecting the historical trajectories of dynamic community structure. To overcome these limitations, we propose STELLAR (Spatio-Temporal Environmental Learning with Latent Alignment and Refinement), a novel framework that learns a shared latent space where dynamic habitat context and community structure are optimized jointly. Our approach integrates three complementary components: (1) a Graph-Temporal Encoder that employs graph attention and recurrent units to aggregate spatial neighborhood effects and capture the co-evolving historical dynamics of environmental context and community structure; (2) a Context-Anchored Latent Alignment mechanism that structures the latent space using a label-activated mixture prior and supervised contrastive learning, actively clustering species based on shared environmental preferences; and (3) an Imbalance-Aware Decoupled Decoding module that utilizes Asymmetric Loss to focus learning on hard, rare species samples, preventing mode collapse in the long tail. Experiments on the large-scale eBird dataset, curated with domain experts, demonstrate that our framework significantly outperforms state-of-the-art baselines, particularly in predicting rare species and revealing interpretable species interactions.

2606.08538 2026-06-09 cs.LG 新提交

Routine laboratory trajectories encode the onset of organ-level complications in cancer

常规实验室轨迹编码癌症器官级并发症的发生

Jannik Lübberstedt, Krischan Braitsch, Jacqueline Lammert, Christof Winter, Florian Gabriel, Tristan Lemke, Christopher Zirn, Markus Graf, Friedrich Puttkammer, Hartmut Häntze, Johannes Moll, Anirudh Narayanan, Andrei Zhukov, Fabian Drexel, Zeineb Ben Chaaben, Sebastian Ziegelmayer, Su Hwan Kim, Marion Högner, Jan Kirschke, Florian Bassermann, Marcus Makowski, Christian Wachinger, Lisa Adams, Keno Bressem

发表机构 * Technical University of Munich(慕尼黑工业大学) Charité - Universitätsmedizin Berlin(柏林夏里特医学院) German Heart Center(德国心脏中心)

AI总结 利用Transformer分析癌症患者常规实验室检测的纵向轨迹,预测162种治疗相关并发症,性能优于单时间点方法,验证了轨迹数据对器官功能恶化的早期编码能力。

详情
AI中文摘要

癌症治疗期间抽取的常规实验室检查构成了器官功能的纵向生理记录,然而其时间结构被单时间点预后工具所忽略。一个基于Transformer的模型在来自3,905名多发性骨髓瘤或卵巢癌患者的2,777,595次实验室测量上训练,预测了162种治疗相关并发症(包括治疗相关骨髓增生异常综合征)的两年内发生,涵盖八个临床类别,在群体水平上实现了高于患病率1.5至6.1倍的富集。它在分组终点上匹配或超越了非序列基线(AUROC提升高达+0.11),表明纵向实验室轨迹捕捉到了从孤立测量中无法获得的、随并发症演变的特异性生理信息。预测在两种癌症中均具有泛化能力,差异集中在疾病特异性并发症上,生物标志物掩膜恢复了与既定病理生理学一致的签名。在MIMIC-IV和MMRF CoMMpass上的外部验证证实了其在独立医疗系统中的可迁移性(AUROC高达0.85)。常规肿瘤学实验室数据在临床发作前数周至数月编码了器官恶化,从而无需额外检测基础设施即可实现并发症特异性监测。

英文摘要

Routine laboratory panels drawn during cancer treatment constitute longitudinal physiological recordings of organ function, yet their temporal structure is discarded by single-timepoint prognostic tools. A transformer trained on 2,777,595 laboratory measurements from 3,905 patients with multiple myeloma or ovarian cancer predicted the two-year onset of 162 treatment-associated complications, including therapy-related myelodysplastic syndromes, spanning eight clinical categories, achieving 1.5- to 6.1-fold enrichment above prevalence at the group level. It matched or outperformed non-sequential baselines across grouped endpoints (AUROC gains up to +0.11), demonstrating that longitudinal laboratory trajectories capture evolving complication-specific physiology inaccessible from isolated measurements. Predictions generalised across both cancers, divergence concentrating in disease-specific complications, and biomarker masking recovered signatures consistent with established pathophysiology. External validation on MIMIC-IV and MMRF CoMMpass confirmed transferability across independent healthcare systems (AUROC up to 0.85). Routine oncological laboratory data encode organ deterioration weeks to months before clinical onset, enabling complication-specific surveillance without additional testing infrastructure.

2606.08563 2026-06-09 cs.LG physics.ao-ph 新提交

Physics-Guided Dual Decoding and Spectral Supervision for Global 3D Hydrometeor Prediction

物理引导的双解码与光谱监督用于全球三维水凝物预测

Dandan Chen, Yaqiang Wang

发表机构 * Chinese Academy of Meteorological Sciences(中国气象科学研究院) Xiong’an Institute of Meteorological Artificial Intelligence(雄安气象人工智能研究院)

AI总结 针对三维水凝物预测中零膨胀长尾分布导致的过度平滑问题,提出物理引导的双解码框架PredHydro-Net,通过解耦架构、小波频率解耦和对抗训练,在极端事件检测和光谱表示上优于现有模型。

详情
AI中文摘要

虽然全球数据驱动模型在预测连续大气变量方面表现出色,但由于这些变量的零膨胀长尾分布,三维水凝物预测仍然具有挑战性。标准的深度学习优化通常会产生过度平滑的预测,削弱极端事件和空间纹理。我们提出了PredHydro-Net,一个物理引导的双解码框架,以缓解这种平滑。为了解决多变量优化冲突,它采用了解耦架构,其中宏观热力学和动力学场单向调节水凝物的生成。通过集成基于小波的频率解耦、光谱幅度匹配和对抗训练,该模型在定量准确性和空间保真度之间实现了有利的权衡。在72小时全球评估中,PredHydro-Net在极端事件检测和光谱表示方面优于时空深度学习基线(Earthformer和PredRNNv2)以及业务全球预报系统(GFS)。此外,它与全球降水测量(GPM)卫星反演表现出良好的气候一致性。该模型合理地再现了极端天气事件(如飓风伊恩)中的三维云结构。特征归因证实了其对物理前兆(如相对湿度和风辐合)的依赖,为长尾大气预测提供了一种稳健的、物理信息的方法。

英文摘要

While global data-driven models excel at predicting continuous atmospheric variables, three-dimensional hydrometeor forecasting remains challenging due to the zero-inflated, long-tailed distributions of these variables. Standard deep learning optimization often yields overly smooth forecasts, attenuating extreme events and spatial textures. We propose PredHydro-Net, a physics-guided dual-decoding framework that mitigates this smoothing. To resolve multi-variable optimization conflicts, it employs a decoupled architecture where macroscopic thermodynamic and dynamic fields unidirectionally modulate hydrometeor generation. By integrating wavelet-based frequency decoupling, spectral amplitude matching, and adversarial training, the model achieves a favorable trade-off between quantitative accuracy and spatial fidelity. In a 72-h global evaluation, PredHydro-Net outperforms both spatiotemporal deep learning baselines (Earthformer and PredRNNv2) and the operational Global Forecast System (GFS) in extreme-event detection and spectral representation. Furthermore, it demonstrates strong climatological consistency with Global Precipitation Measurement (GPM) satellite retrievals. The model reasonably reproduces the three-dimensional cloud structures in extreme weather events, such as Hurricane Ian. Feature attribution confirms its dependence on physical precursors such as relative humidity and wind convergence, offering a robust, physics-informed approach to long-tailed atmospheric prediction.

2606.08573 2026-06-09 cs.LG cs.CL 新提交

Titans-as-a-Layer: Test-Time Memory for Conversational Speech Emotion Recognition

Titans-as-a-Layer:对话语音情感识别的测试时记忆

Daniel Chen, Qicong Hu, Yang Xiao, Ting Dang, Hong Jia

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出Memory-as-a-Layer (MAL)适配器,利用测试时神经记忆为对话语音情感识别提供上下文,在不修改大型音频语言模型的前提下提升性能。

Comments ICML 2026 Workshop on Machine Learning for Audio

详情
AI中文摘要

语音情感识别(SER)通常被表述为话语级分类,尽管对话情感取决于说话者通常的音域和先前话语建立的情感上下文。语音语言模型提供了强大的预训练声学和语义表示,并可以通过微调将其适应于SER标签,但这种机制仍然缺少每对话状态。我们研究测试时神经记忆是否可以在保持大型音频语言模型(LALMs)主干不变的情况下提供这种缺失的上下文。基于Titans,我们引入了一种即插即用的Memory-as-a-Layer(MAL)适配器,它将对话历史写入小型神经记忆,并作为音频令牌对齐的残差更新读回,避免了对宿主模型令牌位置的更改。在不同的音频LLM和情感识别数据集评估中,我们的设计在不同评估指标上改善了SER性能,支持测试时记忆作为对话SER的残差上下文机制。

英文摘要

Speech emotion recognition (SER) is commonly formulated as utterance-level classification, although conversational emotion depends on a speaker's usual vocal range and the emotional context established by previous utterances. Speech-language models provide strong pretrained acoustic and semantic representations, and can adapts them to SER labels via finetune, but this mechanism still missing per-dialogue state. We study whether test-time neural memory can supply this missing context while leaving the large audio language models (LALMs) backbone intact. Building on Titans, we introduce a plug-and-play Memory-as-a-Layer (MAL) adapter that writes dialogue history into a small neural memory and reads it back as an audio-token-aligned residual update, avoiding changes to the host model's token positions. Across different audio LLMs and emotion recognition datasets evaluations, our design improves SER performs across different evaluation metrics, supporting test-time memory as a residual contextual mechanism for conversational SER.

2606.08630 2026-06-09 cs.LG cs.AI 新提交

Tyan-WP: A Wind Power Foundation Model for Ultra-Short-Term Probabilistic Forecasting

Tyan-WP:用于超短期概率预测的风电基础模型

Jiahui Huang, Ao Luo, Lei Liu, Hongwei Zhao, Tengyuan Liu, Ruibo Guo, Bo Wang, Zhao Wang, Bin Li

发表机构 * School of Information Science and Technology, University of Science and Technology of China(中国科学技术大学信息科学技术学院) China Electric Power Research Institute(中国电力科学研究院)

AI总结 提出首个风电基础模型Tyan-WP,通过静态站点嵌入和功率感知气象融合模块,在零样本场景下实现超短期概率预测,显著优于传统模型。

详情
AI中文摘要

全球风电容量,特别是在中国,正在蓬勃发展,新的风电场跨越了多样的地形和气候。行业迫切需要准确的风电基础模型,以缩短调试并加速并网。这是因为特定站点的时间序列模型(TSM)不适用于数据稀缺场景且泛化能力差,而通用大型时间序列模型(LTSM)大多限于单变量输入,无法充分利用静态站点属性或功率与气象协变量之间的依赖关系,导致精度不足。为填补这一空白,我们提出了\textbf{Tyan-WP},这是首个用于超短期概率预测的风电基础模型。在覆盖美国超过126,000个站点、跨越七年的大规模风电数据集上预训练后,Tyan-WP通过两个特定领域模块设计进一步提升了零样本预测:使用坐标、地形和生态区域元数据的静态站点嵌入,以及一个功率感知气象融合(PAMF)模块,该模块对历史功率和气象协变量之间的交互进行建模。在统一评估协议下,Tyan-WP在10个域内站点上超越了八个特定站点的监督TSM,并在127个域内站点上优于十一个通用LTSM,MAE降低19.9%,RMSE降低16.6%,CRPS降低22.2%,AQL降低21.7%,同时R^2提升16.7%。它还在六个真实的英国站点上展示了强大的跨地理泛化能力。这些结果表明,风电基础模型可以在无需目标站点训练的情况下实现准确的零样本预测,为新风电场快速涡轮机接入和概率风险管理提供了实用途径。

英文摘要

Global wind power capacity, especially in China, is booming, with new farms spanning diverse terrains and climates. The industry urgently needs accurate wind power foundation models to shorten commissioning and accelerate grid connection. This is because site-specific time series models (TSMs) are not well suited to data-scarce scenarios and generalize poorly, while generic large time series models (LTSMs) are mostly limited to univariate inputs and cannot fully exploit static site attributes or the dependencies between power and meteorological covariates, leading to insufficient accuracy. To fill this gap, we propose \textbf{Tyan-WP}, the first wind power foundation model for ultra-short-term probabilistic forecasting. Pretrained on a large-scale wind power dataset covering more than 126,000 U.S. sites over seven years, Tyan-WP further improves zero-shot forecasting through two domain-specific module designs: static site embedding using coordinate, terrain, and ecoregion metadata, and a power-aware meteorological fusion (PAMF) module that models interactions between historical power and meteorological covariates. Under a unified evaluation protocol, Tyan-WP surpasses eight site-specific supervised TSMs on 10 in-domain sites and outperforms eleven generic LTSMs on 127 in-domain sites, reducing MAE by 19.9%, RMSE by 16.6%, CRPS by 22.2%, and AQL by 21.7%, while raising R^2 by 16.7%. It further demonstrates strong cross-geography generalization on six real U.K. sites. These results show that the wind power foundation model can achieve accurate zero-shot forecasting without target-site training, providing a practical pathway for rapid turbine onboarding and probabilistic risk management at new wind farms.

2606.08696 2026-06-09 cs.LG cs.AI 新提交

Agentic Search for Counterfactual Recourse under Fixed LLM Budgets

固定LLM预算下的反事实追索的智能搜索

Yasuo Tabei

AI总结 提出Comp-MCTS框架,在固定LLM调用预算下,通过树搜索最大化生成唯一且经oracle验证的反事实,平衡数量与质量。

详情
AI中文摘要

反事实追索旨在提供可操作的特征变化,以改变预测模型做出的不利决策。在实践中,受影响的个体通常受益于多个可行的替代方案,而非单一的最优解释。产生此类替代方案的一种自然方式是提示大语言模型(LLMs)。然而,提示引入了一个实际约束:LLM调用的数量通常是主要的计算和经济成本。对多个替代方案的需求以及这一成本约束共同将问题从寻找单个高质量反事实转变为在固定LLM调用预算下高效生成一组经oracle验证的反事实。在这项工作中,我们将LLM智能体设置中的反事实追索生成作为固定预算搜索问题进行研究,并提出了Comp-MCTS,一个智能体树搜索框架,该框架在此预算下最大化唯一、经oracle验证的反事实的产出,同时保持有利的数量-质量权衡。Comp-MCTS通过基于LLM的提议生成、oracle验证和压缩引导剪枝,在无训练、仅oracle的设置中将预算分配给新颖的干预方向。在四个真实世界表格数据集上的实验表明,Comp-MCTS在唯一、经oracle验证的反事实产出方面显著优于单候选LATS风格基线,并且与更强的多候选变体相比,提供了有利的数量-质量-效率权衡:在四个数据集中的三个上,以相似或更低的oracle评估成本获得相当或更高的产出,同时具有有竞争力的接近性、稀疏性和新颖性。

英文摘要

Counterfactual recourse aims to provide actionable feature changes that would alter an unfavorable decision made by a predictive model. In practice, affected individuals often benefit from multiple feasible alternatives rather than a single optimal explanation. A natural way to produce such alternatives is to prompt large language models (LLMs). However, prompting incurs a practical constraint: the number of LLM calls is often the dominant computational and economic cost. Together, the need for multiple alternatives and this cost constraint shift the problem from finding a single high-quality counterfactual to efficiently generating a set of oracle-validated counterfactuals under a fixed LLM-call budget. In this work, we study counterfactual recourse generation in the LLM-agentic setting as a fixed-budget search problem and propose Comp-MCTS, an agentic tree-search framework that maximizes the yield of unique, oracle-validated counterfactuals under this budget while maintaining favorable quantity--quality trade-offs. Comp-MCTS allocates the budget toward novel intervention directions via LLM-based proposal generation, oracle validation, and compression-guided pruning, in a training-free, oracle-only setting. Experiments on four real-world tabular datasets show that Comp-MCTS substantially outperforms single-candidate LATS-style baselines in the yield of unique, oracle-validated counterfactuals, and offers favorable quantity--quality--efficiency trade-offs against stronger multi-candidate variants: comparable or higher yield at similar or lower oracle-evaluation cost on three of four datasets, plus competitive proximity, sparsity, and novelty.

2606.08712 2026-06-09 cs.LG cs.AI cs.CV 新提交

SNR-ST-Mix: Sample-specific Neighborhood Regression Mixup for Augmented Spatial Transcriptomics Imputation with Deep Neural Network

SNR-ST-Mix: 基于样本特异性邻域回归混合增强的空间转录组学深度神经网络插补

Hongyi Yu, Yaoyu Fang, Jiahe Qian, Xinkun Wang, Lee A. Cooper, Bo Zhou

发表机构 * Northwestern University(西北大学) Yale University(耶鲁大学)

AI总结 针对空间转录组数据噪声大、分辨率低的问题,提出SNR-ST-Mix数据增强框架,通过空间邻域约束和表达相似性加权混合生成生物合理的合成样本,提升深度神经网络插补性能。

Comments 19 pages, 4 figures, 3 tables

详情
AI中文摘要

目的:空间转录组学(ST)能够在组织背景下测量基因表达。然而,这些测量通常噪声大、分辨率低且采样稀疏,限制了精细空间结构的恢复。深度神经网络已成为从组织学进行表达插补的强大工具,但其性能仍受限于有限的样本量和缺乏生物学信息的增强。大多数现有的学习增强策略是为分类任务而非回归任务设计的,忽略了空间和转录组关系,导致生物上不合理的插值,阻碍了预测性能。方法:为解决这些限制,我们提出SNR-ST-Mix,一种专门为ST数据设计的几何和表达感知数据增强框架。它将混合限制在点的k个最近空间邻域内,并基于表达相似性自适应加权插值系数,生成保留局部生物结构同时确保空间平滑性的增强样本。这种双重条件化产生合成样本,扩展了有效训练流形,促进了泛化,并在样本特异性训练下增强了预测稳定性。结果:使用各种组织类型的大量实验表明,SNR-ST-Mix在不需要架构更改或额外计算的情况下,始终优于传统增强方法。结论:SNR-ST-Mix为空间转录组学回归任务提供了一种有效且生物学原理的增强策略。通过显式利用空间几何和转录组相似性,它扩展了有效训练流形,并在不增加模型复杂度的情况下提高了预测性能。

英文摘要

Purpose: Spatial transcriptomics (ST) enables gene expression measurements within the tissue context. However, these measurements are often noisy, low-resolution, and sparsely sampled, which limits the recovery of fine spatial structure. Deep neural networks have become powerful tools for expression imputation from histology, but their performance remains constrained by limited sample sizes and a lack of biologically informed augmentation. Most of the existing augmentation strategies for learning are designed for classification tasks rather than regression, which neglect spatial and transcriptomic relationships, leading to biologically implausible interpolations that hinder prediction performance. Approach: To address these limitations, we propose SNR-ST-Mix, a geometry- and expression-aware data augmentation framework designed specifically for ST data. It constrains mixing to a spot's k-nearest spatial neighbors and adaptively weights interpolation coefficients based on expression similarity, generating augmented samples that preserve local biological structure while ensuring spatial smoothness. This dual conditioning yields synthetic examples that expand the effective training manifold, promote generalization, and enhance prediction stability under sample-specific training. Results: Extensive experiments with various tissue types demonstrate that SNR-ST-Mix consistently outperforms conventional augmentation methods without requiring architectural changes or additional computation. Conclusions: SNR-ST-Mix provides an effective and biologically principled augmentation strategy for spatial transcriptomics regression tasks. By explicitly leveraging spatial geometry and transcriptomic similarity, it expands the effective training manifold and improves predictive performance without increasing model complexity.

2606.08816 2026-06-09 cs.LG cs.AI 新提交

Knowledge Graphs and Reasoning LLMs for Finding Simple Yet Effective Transcriptomic Perturbation Predictors

知识图谱与推理大语言模型用于寻找简单而有效的转录组扰动预测因子

Jake Fawkes, Liam Hodgson, Jason Hartford

发表机构 * University College London(伦敦大学学院) University of Manchester(曼彻斯特大学) Valence Labs(Valence实验室) Recursion(Recursion公司)

AI总结 利用知识图谱的K近邻方法在基因敲除扰动预测中表现优异,结合强化学习优化的LLM可达到最先进性能。

详情
AI中文摘要

预测未见过的基因敲除扰动对转录组基因表达的影响仍然是虚拟细胞模型的一个极具挑战性的问题。最近,通过利用生物知识图谱提供相似扰动的概念,在训练扰动集之外实现了更好的外推。在这项工作中,我们证明了利用这些假设的最简单模型——知识图谱的K近邻——在此任务上取得了极具竞争力的性能,并且通过使用强化学习(RL)优化的LLM可以进一步提高预测性能。具体来说,我们发现K近邻方法在分布外扰动预测上几乎击败了所有方法,而当通过RL训练推理LLM以改变邻域时,它在Replogle等人(2022)的细胞系上获得了与当前最先进方法相当的性能。我们还证明,尽管没有直接训练,RL训练提高了LLM在差异表达预测下游任务上的性能。总体而言,这些发现证明了知识图谱作为模型先验的有效性,并显示出RL可以将LLM精炼为预测复杂生物反应的通用工具的早期迹象。

英文摘要

Predicting the effect of an unseen gene knockout perturbation on transcriptomic gene expression remains a highly challenging problem for virtual cell models. Recent progress has been made by leveraging biological knowledge graphs to provide a notion of similar perturbation, allowing for improved extrapolation beyond the set of training perturbations. In this work, we demonstrate that the simplest model to leverage these assumptions - a K-nearest neighbour from the knowledge graph - achieves highly competitive performance on this task, and that this can be improved further using LLMs optimised via reinforcement learning (RL) for predictive performance. Specifically, we find that the K-nearest neighbour approach beats almost all methods on out-of-distribution perturbation prediction, and when a reasoning LLM is trained via RL to make changes to the neighbourhood, it obtains equivalent performance to current state of the art methods on the cell lines from Replogle et al. (2022). We also demonstrate that the RL training improves the LLM's performance on the downstream task of differential expression prediction, despite not being trained on this directly. Overall, these findings demonstrate the efficacy of knowledge graphs as model priors, and show early signs that RL can refine LLMs into generalizable tools for predicting complex biological responses.

2606.08935 2026-06-09 cs.LG cs.AI 新提交

PAI: Preserving Amplitude Information in Representation-Based Time-Series Anomaly Detection

PAI:在基于表示的时间序列异常检测中保留振幅信息

Kang Zhang, Wei Jian Lau, Shoushou Ren, Dong Lin, Joon Son Chung, Chuanhao Sun

发表机构 * HUAWEI(华为) KAIST(韩国科学技术院)

AI总结 针对现有基于表示的时间序列异常检测方法忽略振幅信息导致性能下降的问题,提出PAI方案,通过诊断模块和分数增强函数融合振幅相关分数,在TSB-AD-U-Eva和TAB UV数据集上平均VUS-PR提升98.4%和36.8%。

Comments 15 pages

详情
AI中文摘要

基于表示的时间序列异常检测算法在多种异常检测任务上显著优于其他方法。然而,我们在评估中发现它们存在一个主要限制——学习到的嵌入通常是振幅无关的。丢失振幅信息会降低与振幅相关异常的性能,并且这种失败普遍存在于所有现有的基于表示的方法中。为了解决上述问题,我们提出了一种新的异常评分方案PAI。PAI由两个互补模块组成:诊断模块和最终分数增强函数。诊断模块比较同一表示库上的余弦评分和欧几里得评分,以测试振幅信息是否已被捕获到学习到的表示中。然后在最终分数增强函数中,PAI计算逐点中位数和MAD偏差分数以及局部均值偏移分数——这些分数与表示分数融合以产生最终异常分数。在TSB-AD-U-Eva和TAB UV数据集上,PAI在所有报告的指标上改进了所有四种评估的基于表示的方法,平均VUS-PR增益分别为98.4%和36.8%。在所有评估的组合中,PaAno + PAI实现了最佳性能,比最先进的方法高出15%。对bootstrap置信区间、异常类型细分以及TS2Vec输入归一化消融的进一步评估进一步支持了所提出的方案。这些结果表明,显式保留振幅信息对于基于表示的时间序列异常检测非常重要,而这一点在现有的评分方案中未得到充分重视。代码可在https://github.com/pantheon5100/PAI获取。

英文摘要

Representation-based time-series anomaly detection algorithms significantly outperform other methods on diverse anomaly detection tasks. However, we notice that they suffer from a major limitation in our evaluation - their learned embeddings are often amplitude-agnostic. Losing amplitude information can degrade performance on amplitude related anomalies, and this failure is prevalent across all existing representation-based methods. To address aforementioned issues, we propose a new anomaly scoring scheme named PAI. PAI consists of two complementary modules, a diagnostic module and a final score augmentation function. The diagnostic module compares cosine and Euclidean scoring on the same representation bank to test whether amplitude information is already captured in the learned representation. Then in final score augmentation function, PAI computes a point-wise median and MAD deviation score and a local mean-shift score-which are fused with the representation score to produce the final anomaly score. On the TSB-AD-U-Eva and TAB UV datasets, PAI improves all four evaluated representation-based methods across every reported metric, achieving average VUS-PR gains of 98.4% and 36.8%, respectively. Among all evaluated combinations, PaAno + PAI achieves the best performance, outperforming the state-of-the-art method by 15%. Further evaluation on bootstrap confidence intervals, anomaly-type breakdowns, and a TS2Vec input-normalization ablation further support the proposed scheme. These results suggest that explicitly retaining amplitude information is important for representation-based time-series anomaly detection, which has been underemphasized in existing scoring schemes. Code is available at: https://github.com/pantheon5100/PAI

2606.08945 2026-06-09 cs.LG 新提交

From Hazard Functions to Language Space: Cox-Supervised Distillation of Survival Risk into a Large Language Model

从风险函数到语言空间:Cox监督的生存风险蒸馏到大语言模型

Nicholas I-Hsien Kuo, Blanca Gallego, Louisa Jorm

发表机构 * Centre for Big Data Research in Health, the University of New South Wales(新南威尔士大学健康大数据研究中心)

AI总结 提出将Cox比例风险模型的时间事件风险信息迁移到大语言模型中的方法,通过文本提示微调Qwen模型,在三个数据集上取得有竞争力的区分度和校准性,并发现隐藏状态呈现连续风险梯度。

详情
AI中文摘要

我们研究了Cox比例风险模型估计的时间事件风险信息是否可以迁移到生成式大语言模型中。我们提出了一种基于文本的生存建模流程,其中结构化的临床协变量被转换为文本提示,并微调基于Qwen的大语言模型,以使用Cox模型预测作为训练目标生成患者特定的生存风险。在GBSG2、ACTG320和WHAS500数据集上,尽管该模型是作为文本生成任务而非使用传统的生存分析损失进行训练,但它取得了有竞争力的留出区分度和校准性。我们进一步分析了模型隐藏状态的几何结构,其中t-SNE可视化揭示了潜在空间中的平滑风险梯度,表明模型将生存风险表示为连续结构而非孤立的风险类别。这些发现共同表明,大语言模型可以内化生存风险结构,同时支持校准预测,为语言模型中的时间事件推理提供了一条途径。

英文摘要

We investigate whether information about time-to-event risk estimated by a Cox proportional hazards model can be transferred into a generative large language model. We propose a text-based survival modelling pipeline in which structured clinical covariates are converted into text prompts and a Qwen-based large language model is fine-tuned to generate patient-specific survival risk using Cox model predictions as a training target. Across GBSG2, ACTG320, and WHAS500, the model achieves competitive held-out discrimination and calibration despite being trained as a text-generation task rather than with a conventional survival-analysis loss. We further analyse the geometry of the model's hidden states, where t-SNE visualisations reveal smooth risk gradients in latent space, suggesting that the model represents survival risk as a continuous structure rather than isolated risk categories. Together, these findings suggest that large language models can internalise survival-risk structure while supporting calibrated prediction, providing a route towards time-to-event reasoning in language models.

2606.09030 2026-06-09 cs.LG cs.AI cs.CL 新提交

TRIAGE: Dialectical Reasoning for Explainable Risk Prediction on Irregularly Sampled Medical Time Series with LLMs

TRIAGE: 基于辩证推理的不规则采样医学时间序列风险可解释预测方法

Hyeongwon Jang, Gyouk Chu, Changhun Kim, Joonhyung Park, Hangyul Yoon, Eunho Yang

发表机构 * KAIST(韩国科学技术院) AITRICS University of Wisconsin-Madison(威斯康星大学麦迪逊分校)

AI总结 提出TRIAGE框架,利用大语言模型对竞争性临床结果生成辩证推理,缓解风险极化,实现连续风险评分与可解释推理,在三个基准上AUPRC提升3.3%,校准误差降低81%。

Comments Code is available at https://github.com/HyeongWon-Jang/TRIAGE

详情
AI中文摘要

基于电子健康记录的临床早期预警系统,其中临床观察记录为不规则采样的医学时间序列(ISMTS),必须提供校准的风险评分用于患者分诊,以及临床医生可验证的可解释理由。大语言模型(LLMs)已被探索用于此任务,但它们将分级临床风险崩溃为过度自信的二元预测。这种风险极化损害了校准性和跨患者可比性。为解决此问题,我们提出TRIAGE框架,该框架训练LLM通过引出特定结果的理由,对竞争性临床结果生成辩证推理。这种辩证公式减轻了风险极化,使单个LLM能够产生基于明确临床推理的连续风险评分。在三个ISMTS基准上评估,TRIAGE相比竞争基线实现了平均AUPRC提升3.3%,校准误差降低81%。LLM作为评判者的评估进一步表明,我们的理由在临床推理质量上比基线的后验解释高出20%。源代码可在https://github.com/HyeongWon-Jang/TRIAGE获取。

英文摘要

Clinical early warning systems built on electronic health records, in which clinical observations are recorded as irregularly sampled medical time series (ISMTS), must deliver both calibrated risk scores for patient triage and interpretable rationales that clinicians can verify. Large Language Models (LLMs) have been explored for this task, yet they collapse graded clinical risk into overconfident binary predictions. This risk polarization undermines both calibration and cross-patient comparability. To address this, we propose TRIAGE, a framework that trains an LLM to generate dialectical reasoning over competing clinical outcomes by eliciting outcome-specific rationales. This dialectical formulation mitigates risk polarization, enabling a single LLM to yield continuous risk scores grounded in explicit clinical reasoning. Evaluated on three ISMTS benchmarks, TRIAGE achieves an average AUPRC improvement of 3.3% and reduces calibration error by 81% compared to the competitive baselines. An LLM-as-a-judge assessment further shows that our rationales surpass post-hoc explanations from the baseline by 20% in clinical reasoning quality. The source code is available at https://github.com/HyeongWon-Jang/TRIAGE .

2606.09065 2026-06-09 cs.LG cs.AI 新提交

OnlyDense: Reduced-Order Modeling for Lagrangian simulation

OnlyDense: 拉格朗日模拟的降阶建模

Tu Do, Shannon Ryan, Santu Rana

发表机构 * Deakin University(德克萨斯大学)

AI总结 提出一种将粒子系统状态视为希尔伯特空间中的函数、用学习到的神经基函数线性子空间近似状态空间的降阶建模框架,实现大规模拉格朗日模拟的高效表示与预测,在百万粒子SPH模拟中R²>0.99。

详情
AI中文摘要

在科学和工程中,拉格朗日模拟方法如光滑粒子流体动力学(SPH)或物质点法(MPM)常被用于研究动态系统的行为。然而,这些方法的计算成本可能高得令人望而却步,特别是在模拟多尺度空间或时间现象时,例如宏观几何中的空洞生长和合并、空间碎片颗粒超高速撞击导致的航天器部件结构失效等。与将系统状态理解为离散粒子集合的基于图的方法不同,我们提出了一种学习框架,通过将系统状态视为函数、将其演化视为希尔伯特空间中的轨迹,实现对大规模粒子系统的可扩展表示和动力学建模。我们不将状态表示为离散粒子集或嵌入非线性潜在流形,而是用学习到的神经基函数张成的线性子空间近似状态空间。这种参数化使得可以直接投影获得潜在系数,并显式访问基函数,避免了在非线性潜在空间上的优化。由此得到的表示具有自然的解释:潜在变量对应于希尔伯特空间中的系数,基函数对应于空间模态,类似于本征正交分解。因此,该框架将经典的基于投影的降阶建模与现代深度学习统一起来,同时保持对离散化点数量的不变性。在超过一百万个粒子的大规模SPH模拟(包括具有极端变形和破碎的动态事件)上的实验表明,所提出的方法能够准确重建和预测动力学,仅用32个基函数即可达到超过0.99的R²分数。

英文摘要

In science and engineering, Lagrangian simulation methods such as Smooth Particle Hydrodynamics (SPH) or Material Point Method (MPM) are often employed to study the behavior of dynamic systems. However, these methods can be prohibitively computationally expensive, particularly when simulating multi-scale spatial or temporal phenomena, e.g., void growth and coalescence within macro-scale geometries, structural failure of spacecraft components resulting from hypervelocity impact of space debris particles, etc. In contrast to graph-based methods, where the state of the system is understood as a discrete set of particles, we propose a learning framework for scalable representation and dynamics modeling of massive particle systems by treating the system state as a function and its evolution as a trajectory in Hilbert space. Rather than representing the state as a discrete set of particles or embedding it in a nonlinear latent manifold, we approximate the state space with a linear subspace spanned by learned neural basis functions. This parameterization enables direct projection to obtain latent coefficients and explicit access to the basis functions, avoiding optimization over a nonlinear latent space. The resulting representation admits a natural interpretation: latent variables correspond to coefficients in Hilbert space, and basis functions correspond to spatial modes, analogous to Proper Orthogonal Decomposition. The framework thus unifies classical projection-based reduced-order modeling with modern deep learning, while remaining invariant to the number of discretization points. Experiments on large-scale SPH simulations with over one million particles, including dynamic events with extreme deformation and fragmentation, demonstrate that the proposed method accurately reconstructs and predicts dynamics, achieving an R$^2$ score above $0.99$ with as few as $32$ basis functions.

2606.09092 2026-06-09 cs.LG 新提交

From Shortcuts to Reasoning: Robust Post-Training of Theory of Mind with Reinforcement Learning

从捷径到推理:基于强化学习的心理理论鲁棒后训练

Jike Zhong, Yuxiang Lai, Ming Li, Yuheng Li, Wuao Liu, Behzad Dariush, Konstantinos Psounis, Shao-Yuan Lo

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 针对心理理论后训练中的捷径问题,提出Thinking-RFT方法,结合可验证奖励和显式推理链,在多个无捷径数据集上显著提升推理能力,尤其在复杂高阶推理和多模态场景中表现优异。

Comments Accepted by ICML 2026

详情
AI中文摘要

心理理论(ToM)是现代基础模型系统在现实世界中有效且安全运行必须掌握的技能。最近的工作探索了通过后训练来磨练ToM;然而,我们表明这种进展受到普遍存在的“捷径”问题的干扰:任务可以通过简单地利用虚假的因果相关性达到高达99%的准确率,从而导致对ToM的错误认识。受此启发,我们首先开发了一个框架来系统地检查ToM数据集中的捷径,并为未来发展提供指导。我们发现,可简化为纯状态跟踪的问题(如“信念”)特别容易受到捷径影响,而需要超越跟踪进行推理的心理问题(如“意图”)则不然。使用三个ToM上下文中的四个无捷径数据集,我们全面研究了带有可验证奖励和显式推理链的强化微调(称为Thinking-RFT)是否比监督微调(SFT)更能提升ToM。我们的主要发现如下。首先,Thinking-RFT在所有场景中有效提升ToM,比SFT提高6%,特别是在复杂高阶推理中比SFT提高10%,在多模态情况下比SFT提高7%。它还能更好地泛化到未见领域和高阶查询,同时对反事实更加鲁棒。其次,ToM特别受益于推理和强化学习的联合效应:Thinking-RFT平均比Non-Thinking-RFT高出7%。第三,RFT通过学会将其推理基于与因果因素对应的锚定线索(如关键词和状态变化)来工作。我们相信我们的研究对于开发有效且鲁棒的ToM后训练数据集以及推进关键ToM能力是有用的。

英文摘要

Theory of Mind (ToM) is a must-acquire skill for modern foundation model systems to operate effectively and safely in the real world. Recent works have explored honing ToM via post-training; however, we show that such progress is confounded by a pervasive "shortcut" issue: tasks can reach up to 99% accuracy by simply exploiting spurious causal correlations, leading to a false sense of ToM. Motivated by this, we first develop a framework to systematically examine ToM datasets for shortcuts and provide guidance for future development. We find that questions reducible to pure state tracking, such as "belief," are especially shortcut-prone compared to mind questions, such as "intention," where reasoning beyond tracking is required. Using four shortcut-free datasets across three ToM contexts, we then comprehensively study whether Reinforcement Fine-Tuning with verifiable rewards and explicit reasoning chains, called Thinking-RFT, elevates ToM beyond Supervised Fine-Tuning, or SFT. Our key findings are as follows. First, Thinking-RFT effectively improves ToM in all scenarios, with a 6% improvement over SFT, particularly in complex higher-order reasoning, with a 10% improvement over SFT, and multimodal cases, with a 7% improvement over SFT. It also generalizes notably better to unseen domains and higher-order queries while being more robust to counterfactuals. Second, ToM benefits specifically from the joint effect of reasoning and RL: Thinking-RFT outperforms Non-Thinking-RFT by 7% on average. Third, RFT works by learning to ground its reasoning on anchor cues, such as keywords and state changes, that correspond to causal factors. We believe our study is useful for developing effective and robust ToM post-training datasets and advancing critical ToM capabilities.

2606.09104 2026-06-09 cs.LG cs.AI q-fin.PM 新提交

Addressing Market Regime Changes and Heavy-Tailed Returns in Portfolio Optimization via Bayesian VAR and Elliptical Black-Litterman

通过贝叶斯VAR和椭圆Black-Litterman解决投资组合优化中的市场机制变化和重尾收益问题

Daniil Mikriukov, Ruoyu Sun, Angelos Stefanidis, Jionglong Su, Zhengyong Jiang

发表机构 * University of Liverpool(利物浦大学) Xi'an Jiaotong-Liverpool University(西交利物浦大学)

AI总结 提出BAVAR-BLED算法,结合贝叶斯平均向量自回归和椭圆分布Black-Litterman模型,在TD3架构下自适应分配资产,在道琼斯工业平均指数成分股上实现夏普比率1.72和总收益57.26%。

Comments 9 pages, 3 figures, 4 tables. Extends our prior work [Mikriukov et al., ICIC 2025] on Black-Litterman under Elliptical Distributions (BLED). Manuscript under review

详情
AI中文摘要

用于投资组合优化的深度强化学习框架因其能够从市场数据中动态学习分配规则而显示出前景。然而,这些模型未能考虑肥尾收益,而肥尾收益以更频繁的极端事件为特征,描述了实际市场行为。此外,历史数据被同质化处理,未考虑时间重要性,导致模型在机制变化时失效。我们提出了一种新的BAVAR-BLED算法,该算法在TD3架构内结合了源自贝叶斯平均向量自回归(BAVAR)和使用椭圆分布的Black-Litterman模型(BLED)的方法。BAVAR捕获一组考虑多尺度时间特征的向量自回归表示,从而基于对收益预期和离散矩阵的机制感知估计实现自适应分配决策。这些估计作为BLED的先验输入,BLED使用学生t分布,允许更现实的肥尾收益估计。BAVAR-BLED算法使用Transformer网络进行观点构建,使用CNN进行风险厌恶估计,根据市场条件修改动态分配决策。对道琼斯工业平均指数29只成分股在十年市场周期内的评估表明,BAVAR-BLED显著优于最先进的方法,实现了1.72的夏普比率和2.70的索提诺比率,总收益为57.26%。

英文摘要

Deep reinforcement learning (DRL) frameworks for portfolio optimization have shown promise for their ability to learn allocation rules dynamically from market data. However, these models fail to account for fat-tailed returns, which characterize actual market behavior with more frequent extreme events. Furthermore, historical data is treated homogeneously, without accounting for temporal importance, leading models to fail during regime changes. We propose a new BAVAR-BLED algorithm that combines methods derived from Bayesian-Averaging Vector Autoregressive (BAVAR) and the Black-Litterman model using Elliptical Distributions (BLED) within a TD3 architecture. BAVAR captures a set of vector autoregressive representations that consider multi-scale temporal features, enabling adaptive allocation decisions based on regime-aware estimates of return expectations and dispersion matrices. These estimates serve as prior inputs to BLED, a model that uses Student's t-distributions, allowing for more realistic fat tail return estimates. The BAVAR-BLED algorithm uses transformer networks for view construction and CNNs for risk-aversion estimates, which modify dynamic allocation decisions based on market conditions. An evaluation of 29 Dow Jones Industrial Average constituents over a decade-long market period shows that BAVAR-BLED significantly outperforms state-of-the-art methods, achieving Sharpe and Sortino ratios of 1.72 and 2.70, respectively, and total returns of 57.26%.

2606.09160 2026-06-09 cs.LG cs.AI 新提交

Crop Recommendation and Agricultural Query Answering System Using Spatio-Temporal Graph Neural Networks and Hybrid Retrieval Augmentation

基于时空图神经网络与混合检索增强的作物推荐及农业问答系统

Prajwal Thapa, Yagya Raj Pandeya

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出融合时空图神经网络(STGCN)与检索增强生成(RAG)的精准农业系统,实现30天天气预报、作物推荐及农业问答,在尼泊尔1359个地点数据上STGCN预测MSE达0.011。

Comments 11 pages, 8 figures

详情
AI中文摘要

本文提出一个统一系统,旨在通过集成先进的天气预报、作物推荐和面向农民的问答工具来支持精准农业。我们提出了两个深度学习模型——基于Transformer的图神经网络和时空图卷积网络(STGCN)——利用尼泊尔1359个地点的数据预测未来30天的天气状况。STGCN在准确性上优于基于Transformer的模型(MSE约0.011 vs 0.013),有效建模了气候数据中的空间和时间依赖性。这些预测与静态土壤属性(如pH、水分和有机质含量)相结合,通过评分算法生成本地化的作物推荐,该算法匹配每种作物的最佳生长条件。此外,我们开发了一个检索增强生成(RAG)聊天机器人,利用领域特定的农业文档以自然语言回答农民的问题。整个系统通过移动应用程序部署,提供实时建议和对话支持。用户反馈证实了系统的可用性和相关性,尤其是在个性化农业指导有限的农村环境中。总体而言,我们的方法展示了如何将机器学习模型与当地农业数据相结合,为农民提供可操作的见解,促进更明智的决策、更高的作物产量和增强对气候变异的适应能力。

英文摘要

This paper presents a unified system designed to support precision agriculture by integrating advanced weather prediction, crop recommendation, and a question-answering tool for farmers. We propose two deep learning models -- a Transformer-based Graph Neural Network and a Spatio-Temporal Graph Convolutional Network (STGCN) -- to forecast weather conditions for the next 30 days using data from 1,359 locations in Nepal. The STGCN outperforms the Transformer-based model in accuracy (MSE ~0.011 vs. 0.013), effectively modeling both spatial and temporal dependencies in climate data. These predictions are combined with static soil properties such as pH, moisture, and organic content to generate localized crop recommendations through a scoring algorithm that matches each crop's optimal growing conditions. Additionally, we develop a Retrieval-Augmented Generation (RAG) chatbot that leverages domain-specific agricultural documents to answer farmers' questions in natural language. The entire system is deployed via a mobile application, offering real-time suggestions and conversational support. User feedback confirms the system's usability and relevance, especially in rural settings where personalized farming guidance is limited. Overall, our approach demonstrates how combining machine learning models with local agricultural data can empower farmers with actionable insights, promoting more informed decisions, better crop yields, and increased resilience to climate variability.

2606.09313 2026-06-09 cs.LG stat.AP 新提交

Machine-Learning Emulation of Satellite Greenhouse Gas Retrievals: Stability over Time

卫星温室气体反演的机器学习仿真:时间稳定性

Nugzar Gognadze, Motonobu Kanagawa, Yu Someya, Hisashi Yashiro

发表机构 * EURECOM National Institute for Environmental Studies(国立环境研究所)

AI总结 研究机器学习仿真卫星温室气体反演算法的时间稳定性,发现预测精度随时间下降,加入时间特征可改善Lasso和神经网络模型的XCH4预测,简单Lasso模型表现优于复杂方法且更稳定。

Comments 48 pages, 9 figures, 15 tables

详情
AI中文摘要

反演算法通过求解高光谱分辨率卫星辐射测量值的逆问题,用于估算二氧化碳(CO2)和甲烷(CH4)等温室气体(GHGs)的大气浓度。然而,这些算法计算成本高,使得大规模实时估算变得困难。因此,机器学习模型被提出作为反演算法的快速仿真器。然而,现有大多数研究仅使用与训练数据同期的测试数据评估它们。我们利用温室气体观测卫星(GOSAT)的数据研究此类仿真器的时间稳定性。我们表明,当测试期远离训练期时,预测精度通常会下降。我们还表明,将时间作为输入特征显著改善了Lasso和神经网络模型的XCH4预测。在所考虑的方法中,简单的Lasso模型表现与神经网络等更复杂的方法相当或更好,并且随时间产生更稳定的预测。我们利用地面观测网络——总碳柱观测网络(TCCON)进一步验证了结果。在TCCON匹配数据集上,时间增强的Lasso模型对TCCON的误差与GOSAT和TCCON之间在XCO2和XCH4上的差异相当。

英文摘要

Retrieval algorithms are used to estimate atmospheric concentrations of greenhouse gases (GHGs), such as carbon dioxide (CO2) and methane (CH4), by solving inverse problems from high-spectral-resolution satellite radiance measurements. However, these algorithms are computationally expensive, which makes real-time estimation at scale difficult. Machine-learning models have therefore been proposed as fast emulators of retrieval algorithms. Most existing studies, however, evaluate them only on test data from the same period as the training data. We study the stability over time of such emulators using data from the Greenhouse Gases Observing SATellite (GOSAT). We show that prediction accuracy generally deteriorates when the test period moves away from the training period. We also show that including time as an input feature substantially improves XCH4 prediction for Lasso and neural-network models. Among the methods considered, a simple Lasso model performs as well as or better than more complex methods such as neural networks, and yields more stable predictions over time. We further validate the results using the Total Carbon Column Observing Network (TCCON), a ground-based observation network. On the TCCON-matched dataset, the time-augmented Lasso achieves errors against TCCON that are comparable to the disagreement between GOSAT and TCCON for both XCO2 and XCH4.

2606.09327 2026-06-09 cs.LG cs.AI 新提交

A Universal Dense Football Event Representation Based on TabTransformer

基于TabTransformer的通用密集足球事件表示

Weiran Yang, Daniel Memmert, Maximilian Klemp-Weins

发表机构 * Institute of Exercise Training and Sport Informatics, German Sport University Cologne(科隆德国体育大学运动训练与体育信息学研究所)

AI总结 提出基于TabTransformer的模型,通过学习分类特征的嵌入向量,生成密集的足球事件表示,在下游任务中优于基线方法。

Comments 12 pages, 1 figure. Preprint submitted to the 13th Workshop on Machine Learning and Data Mining for Sports Analytics (MLSA 2026)

详情
AI中文摘要

足球事件数据为团队运动中球员动作的定量分析提供了丰富的时空来源。这些数据集包含异构特征,将连续的位置坐标与分类变量(如动作类型、动作结果和身体部位)相结合。此类数据已应用于体育分析中的比赛结果预测、球员评估和战术模式识别。然而,现有方法主要使用独热或序数嵌入表示来编码分类特征,忽略了动作描述符的内在语义。Transformer是一种基于自注意力的深度神经网络架构,能够捕获输入特征在任意位置之间的依赖关系。我们提出并实现了一个基于Transformer的模型,以学习分类事件特征之间的潜在依赖关系,并生成足球事件的密集表示。通过将分类特征编码为学习到的嵌入向量,在预训练期间捕获了特定于运动的动作语义,使得表示能够支持下游任务,如动作价值估计和比赛风格识别。实证评估表明,在下游预测任务中,嵌入表示在概率校准方面优于任务特定基线,如Brier分数所衡量的。

英文摘要

Football event data constitute a rich spatiotemporal source for quantitative analysis of player actions in team sports. These datasets contain heterogeneous features, combining continuous location coordinates with categorical variables such as action type, action outcome, and body part. Such data have been applied in sports analytics for match outcome forecasting, player evaluation, and tactical pattern recognition. However, existing approaches predominantly encode categorical features using one-hot or ordinal embedding representations, overlooking the intrinsic semantics of action descriptors. The Transformer is a deep neural network architecture based on self-attention that captures dependencies between input features at arbitrary positions. We propose and implement a Transformer-based model to learn latent dependencies among categorical event features and produce dense representations of football events. By encoding categorical features as learned embedding vectors, sport-specific action semantics are captured during pretraining, enabling the representations to support downstream tasks such as action value estimation and play style recognition. Empirical evaluation shows that the embedding representations yield superior probability calibration over task-specific baselines on the downstream prediction tasks, as measured by Brier score.

2606.09434 2026-06-09 cs.LG 新提交

Operator learning for solving Fokker-Planck equations with various initial conditions

算子学习求解不同初始条件下的福克-普朗克方程

Li Zeng, Xiaoliang Wan, Yaobin Wang, Fabio Nobile, Tao Zhou

发表机构 * Fuzhou University(福州大学) Louisiana State University(路易斯安那州立大学) Beijing Normal-Hong Kong Baptist University(北京师范大学-香港浸会大学联合国际学院) École Polytechnique Fédérale de Lausanne(洛桑联邦理工学院) Chinese Academy of Sciences(中国科学院)

AI总结 提出基于条件归一化流的物理信息神经网络框架,利用Chapman-Kolmogorov方程和线性化SDE基分布,高效求解多种初始条件下FPE的算子,引入时间加权损失函数解决小时间不稳定性。

详情
AI中文摘要

福克-普朗克方程(FPE)在描述由随机动力学支配的系统概率密度函数(PDF)的时间演化中起着关键作用。本文提出了一种基于条件归一化流的物理信息神经网络(PINN)框架,用于高效逼近整个初始条件范围内FPE的解算子。利用马尔可夫随机过程的Chapman-Kolmogorov方程,将问题重新表述为逼近从任意点狄拉克质量开始的初始时刻的转移PDF。采用关联线性化随机微分方程(SDE)的PDF作为归一化流的基分布,该分布提供了目标PDF的良好近似,特别是在小时间尺度下,从而避免了与狄拉克δ初始分布相关的映射奇异性。此外,引入时间加权损失函数以减轻小时间尺度下出现的数值不稳定性,在时间推进过程中实现因果性与训练难度之间的平衡。通过多种数值实验展示了所提方法的有效性和鲁棒性。

英文摘要

The Fokker-Planck equation (FPE) plays a pivotal role in describing the time evolution of probability density functions (PDFs) for systems governed by stochastic dynamics. In this work, we propose a conditional normalizing flow-based physics-informed neural network (PINN) framework for efficiently approximating the solution operator of the FPE for a whole range of initial conditions. Leveraging the Chapman-Kolmogorov equation for Markovian stochastic processes, the problem is reformulated into approximating a transition PDF starting at initial time from a Dirac mass centered at an arbitrary point. The PDF of an associated linearized stochastic differential equation (SDE) is employed as the base distribution for the normalizing flow, providing a good approximation of the target PDF, especially for small times, and thereby avoiding the singularity of the map associated with the Dirac delta initial distribution. Furthermore, a time-weighted loss function is introduced to mitigate numerical instabilities arising at small times, achieving a balance between causality and training difficulty as time progresses. A variety of numerical experiments are presented to illustrate the effectiveness and robustness of the proposed method.

2606.09480 2026-06-09 cs.LG 新提交

Loss-Guided Adaptive Scale Refinement for Molecular Force Prediction

损失引导的自适应尺度细化用于分子力预测

Limin Yu

发表机构 * Tianjin Medical University(天津医科大学)

AI总结 提出损失引导的自适应尺度细化框架,通过插值、路由和尺度池更新自动发现任务有效尺度,在NaCl水溶液体系中降低力预测误差。

Comments 23 pages, 2 figures, 6 tables. Preprint on adaptive scale refinement for molecular force prediction

详情
AI中文摘要

分子系统涉及多个空间尺度的相互作用,从局部配位和短程扰动到长程静电和溶剂介导效应。然而,大多数分子表示学习方法依赖于手动预定义的尺度,而任务最优建模尺度可能与这些固定水平不一致。本研究引入了一个损失引导的自适应尺度细化框架用于分子力预测,将预定义尺度视为初始锚点,通过插值、路由、可微尺度更新和尺度池细化来发现任务有效的分辨率。使用NaCl水溶液离子系统作为最小测试平台,构建了短尺度和长程力预测分支并分析了它们的互补性。Oracle硬路由将整体力MAE从399.65降低到382.67,而连续Oracle插值进一步将其降低到380.96。在最近离子距离低于0.6 nm的紧密接触区域,紧密接触MAE从327.22降低到260.51。一个最小尺度池更新实验表明,从端点锚点{0,1}开始,损失引导的更新自动生成中间尺度并恢复了大部分连续Oracle性能。最终更新的尺度池{0,0.125,0.25,0.375,0.5,0.75,1}实现了381.23的整体MAE。这些结果支持自适应尺度细化作为分子表示学习的一个有前景的方向,特别是当固定尺度建模不足时。

英文摘要

Molecular systems involve interactions across multiple spatial scales, from local coordination and short-range perturbations to long-range electrostatic and solvent-mediated effects. However, most molecular representation learning methods rely on manually predefined scales, and the task-optimal modeling scale may not coincide with these fixed levels. This study introduces a loss-guided adaptive scale refinement framework for molecular force prediction, treating predefined scales as initial anchors and discovering task-effective resolutions through interpolation, routing, differentiable scale updates, and scale pool refinement. Using a NaCl aqueous ionic system as a minimal testbed, this study constructs short-scale and long-range force prediction branches and analyzes their complementarity. Oracle hard routing reduces the overall force MAE from 399.65 to 382.67, while continuous oracle interpolation further reduces it to 380.96. In close-contact regimes with nearest-ion distance below 0.6 nm, the close-contact MAE decreases from 327.22 to 260.51. A minimal scale pool update experiment shows that starting from endpoint anchors {0,1}, loss-guided updates automatically generate intermediate scales and recover most of the continuous oracle performance. The final updated scale pool {0,0.125,0.25,0.375,0.5,0.75,1} achieves an overall MAE of 381.23. These results support adaptive scale refinement as a promising direction for molecular representation learning, especially when fixed-scale modeling is insufficient.

2606.09623 2026-06-09 cs.LG 新提交

Constrained user-item allocation for e-commerce marketing campaigns

面向电子商务营销活动的约束用户-物品分配

Maja Lindström, Natalija Glisovic, Jan von Pichowski, Tommy Löfstedt, Martin Rosvall

发表机构 * Umeå University(于默奥大学) KTH Royal Institute of Technology(皇家理工学院) University of Würzburg(维尔茨堡大学)

AI总结 提出自动定向方法,通过约束谱双聚类、贪心局部搜索和多臂老虎机框架联合选择用户和物品构建多个不重叠营销活动,在合成数据、Amazon评论和商业数据上优于模拟退火。

详情
AI中文摘要

在开展营销活动时,零售商必须决定推广哪些产品以及针对哪些用户。这些决策本质上是耦合的:有效的活动将具有强烈相互亲和力的用户和物品匹配到预定义大小的非重叠组中。然而,现有方法假设预定义的活动结构或将物品选择与用户分配解耦,无法直接从联合交互模式中发现活动分组。因此,我们将该活动问题形式化为自动定向:联合选择用户和物品以构建多个不相交的活动。为了解决这个组合问题,我们提出了三种互补策略:(i)约束谱双聚类,以在用户-物品亲和力矩阵中找到密集区域;(ii)具有成对交换的贪心局部搜索,用于组合优化;(iii)多臂老虎机框架,通过探索逃离局部最优。我们在合成数据集、Amazon Reviews基准测试和大规模专有商业数据上评估了这些方法,并将结果与模拟退火基线进行比较。结果表明,双聚类始终获得最高的活动质量、提升度和公平性得分。虽然双聚类在较小数据集上运行高效,但在非常大的数据集上其运行时间显著增加,而基于老虎机的方法则提供了可扩展的替代方案。

英文摘要

When running marketing campaigns, retailers must decide which products to promote and which users to target. These decisions are inherently coupled: effective campaigns match users and items with strong mutual affinity into non-overlapping groups of predefined sizes. However, existing approaches assume predefined campaign structure or decouple item selection from user assignment, and cannot discover campaign groupings directly from joint interaction patterns. We therefore formalize this campaign problem as auto-targeting: jointly selecting users and items to construct multiple disjoint campaigns. To solve this combinatorial problem, we propose three complementary strategies: (i) constrained spectral biclustering to find dense regions in the user-item affinity matrix, (ii) greedy local search with pairwise swaps for combinatorial refinement, and (iii) a multi-armed bandit framework to escape local optima through exploration. We evaluate these methods on a synthetic dataset, the Amazon Reviews benchmarks, and large-scale proprietary commercial data, and compare the results to simulated annealing as a baseline. The results show that biclustering consistently achieves the highest campaign quality, lift, and fairness scores. While biclustering runs efficiently on smaller datasets, its runtime increases substantially on very large ones, where bandit-based methods instead offer a scalable alternative.

2606.09638 2026-06-09 cs.LG cs.SC math-ph math.MP physics.comp-ph stat.AP 新提交

Data-driven discovery of governing differential equations across physical systems

跨物理系统的控制微分方程数据驱动发现

Siyu Lou, Hao Xu, Wenguan Wang, Lu Lu, Hao Sun, Yang Liu, Linfeng Zhang, Dongxiao Zhang, Yuntian Chen

发表机构 * School of Computer Science, Shanghai Jiao Tong University(上海交通大学计算机科学与工程学院) Ningbo Key Laboratory of Advanced Manufacturing Simulation, Eastern Institute of Technology(东部理工学院宁波先进制造仿真重点实验室) The State Key Lab of Brain-Machine Intelligence, Zhejiang University(浙江大学脑机智能全国重点实验室) Department of Statistics and Data Science, Yale University(耶鲁大学统计与数据科学系) Department of Chemical and Environmental Engineering, Yale University(耶鲁大学化学与环境工程系) Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学高瓴人工智能学院) School of Engineering Sciences, University of Chinese Academy of Sciences(中国科学院大学工程科学学院) DP Technology

AI总结 本文提出问题导向视角,通过二维相图组织方程可发现性,并引入表示-评估-优化(REO)框架抽象发现过程,旨在从数据中推断物理定律,推动理论修正与新概念形成。

详情
AI中文摘要

微分方程在科学发现中扮演关键角色,因为它们提供了描述物理现象行为的数学框架。作为传统第一性原理的有前景替代,数据驱动微分方程发现因其直接从实验或模拟数据推断控制定律的能力而日益受到关注,尤其是在底层物理机制不明确时。然而,该领域沿着多样化的方法论方向迅速扩展,特别是随着基于AI的方法的出现,仍缺乏清晰的组织视角。在本综述中,我们提出数据驱动微分方程发现的问题导向视角。首先引入方程可发现性的二维相图,其中发现问题根据结构复杂性和系数复杂性进行组织。该相图展示了该领域如何从稀疏方程与简单系数的发现转向具有更丰富结构和更灵活参数化的更复杂控制定律。它还阐明了为什么不同的方法论家族在不同问题设置中成功或失败。然后,我们提出表示-评估-优化(REO)框架作为发现过程的基本抽象。通过识别跨算法变体持续存在的方程发现核心问题,REO将讨论从单个算法转向决定可发现性的基本原理。我们将这些视角与物理学及相邻科学的应用联系起来,并认为下一个挑战不仅仅是恢复方程,而是利用它们来修正现有理论、提炼机制并形成新的科学概念。

英文摘要

Differential equations play a critical role in scientific discovery because they provide a mathematical framework to describe the behaviour of physical phenomena. As a promising alternative to traditional first principles, data-driven differential equation discovery has attracted increasing attention for its ability to infer governing laws directly from experimental or simulated data, especially when the underlying physics is unclear. However, the field has expanded rapidly along diverse methodological directions, particularly with the emergence of AI-based approaches, and still lacks a clear organizing perspective. In this Review, we propose a problem-oriented perspective on data-driven differential equation discovery. We first introduce a two-dimensional phase diagram of equation discoverability, where discovery problems are organized according to structural complexity and coefficient complexity. This phase diagram shows how the field has moved from the discovery of sparse equations with simple coefficients toward more complex governing laws with richer structures and more flexible parameterizations. It also clarifies why different methodological families succeed or fail in different problem settings. We then present the representation-evaluation-optimization (REO) framework as a fundamental abstraction of the discovery process. By identifying the core problems of equation discovery that persist across algorithmic variations, REO shifts the discussion from individual algorithms to the fundamental principles that determine discoverability. We connect these perspectives to applications across physics and adjacent sciences, and argue that the next challenge is not merely recovering equations, but using them to revise existing theories, distil mechanisms and form new scientific concepts.

2606.09671 2026-06-09 cs.LG cs.AI 新提交

Transition-Based Digital Twin Modelling for Alzheimer's Disease under Sparse Longitudinal Data

基于转换的阿尔茨海默病数字孪生建模在稀疏纵向数据下的应用

Yinyu Huang, Yilin Zhang, Sofia Michopoulou, Christopher Kipps, Rahman Attar

发表机构 * University of Southampton(南安普顿大学) University Hospital Southampton NHS Foundation Trust(南安普顿大学医院NHS基金会信托) Faculty of Medicine, University of Southampton(南安普顿大学医学院)

AI总结 针对阿尔茨海默病进展异质性和数据稀疏问题,提出结合局部转换建模与序列建模的数字孪生框架,利用多模态纵向数据预测认知状态并量化不确定性,在ADNI数据上表现优异。

Comments 13 pages, 5 figures, 3 tables. Accepted as a full-length paper at the International Conference on AI in Healthcare (AIiH) 2026

详情
AI中文摘要

阿尔茨海默病(AD)进展具有高度异质性,通常通过稀疏且不规则的纵向数据观察,给预测和个性化监测带来挑战。现有的机器学习方法利用多模态数据改进了AD预测,但往往侧重于静态分类或队列级风险估计,对个体特异性建模和不确定性推理的支持有限。为了解决这些局限性,我们提出了一种个性化数字孪生框架,用于AD预测和基于场景的分析,利用多模态纵向数据。该方法整合了互补的建模策略,以捕捉临床转换和跨访视的时间依赖性。使用阿尔茨海默病神经影像学倡议(ADNI)的数据,包括认知评估、临床变量和MRI衍生的表型,该框架预测认知状态和诊断类别,同时量化预测不确定性并实现患者特定的假设轨迹分析。在无泄漏的受试者级别分割上的评估表明,在评分预测和诊断分类方面表现强劲。在这种稀疏且不规则的ADNI设置中,相邻访视的基于转换的建模比基于序列的分支实现了更高的预测准确性,表明局部转换建模可能更数据高效。虽然序列模型对于不确定性感知的轨迹预测仍然有价值,但局部转换建模提供了一种更数据高效且稳健的预测策略。这些发现强调了将时间建模策略与临床数据结构对齐的重要性,并表明基于转换的数字孪生公式可能为神经退行性疾病的个性化预测提供一种实用且可解释的方法。

英文摘要

Alzheimer's disease (AD) progression is highly heterogeneous and is typically observed through sparse and irregular longitudinal data, posing challenges for prediction and personalised monitoring. Existing machine learning approaches have improved AD prediction using multimodal data, yet often focus on static classification or cohort-level risk estimation, providing limited support for subject-specific modelling and uncertainty-aware reasoning. To address these limitations, we present a personalised digital twin framework for AD prediction and scenario-based analysis using multimodal longitudinal data. The proposed approach integrates complementary modelling strategies to capture clinical transitions and temporal dependencies across visits. Using data from the Alzheimer's Disease Neuroimaging Initiative (ADNI), including cognitive assessments, clinical variables, and MRI-derived phenotypes, the framework predicts cognitive status and diagnostic categories while quantifying predictive uncertainty and enabling patient-specific what-if trajectory analysis. Evaluation on leak-free subject-level splits demonstrates strong performance in score forecasting and diagnosis classification. In this sparse and irregular ADNI setting, transition-based modelling of adjacent visits achieved higher predictive accuracy than the sequence-based branch, suggesting that local transition modelling may be more data-efficient. While sequence models remain valuable for uncertainty-aware trajectory forecasting, local transition modelling offers a more data-efficient and robust predictive strategy. These findings highlight the importance of aligning temporal modelling strategies with clinical data structure and suggest that transition-based digital twin formulations may provide a practical and interpretable approach for personalised disease forecasting in neurodegenerative disorders.

2606.09787 2026-06-09 cs.LG cs.NI 新提交

Zero Touch Predictive Orchestration: Automating Time-Series Models for the Cloud-Edge Continuum

零接触预测性编排:为云边连续体自动化时间序列模型

Abd Elghani Meliani, Arora Sagar, Adlen Ksentini, Raymond Knopp

发表机构 * Eurecom OpenAirInterface

AI总结 针对云边连续体中节点冷启动问题,提出一种结合数据混合与神经架构搜索的自动化时间序列预测架构,有效提升预测精度并加速收敛。

Comments 19 pages, 14 figures

详情
AI中文摘要

云边连续体(CEC)通过将资源分布到远边缘来支持延迟关键型应用,但其极端波动性使得通过时间序列预测进行主动零接触管理至关重要。然而,编排器面临严重的“冷启动”问题:新发现的节点缺乏训练局部预测模型所需的历史数据,而通用模型无法捕捉独特的硬件和微服务行为。为解决此问题,我们提出了一种由新颖的数据混合方法驱动的全自动时间序列预测架构。在基础设施层面,我们引入了一个轻量级、技术无关的资源暴露器(RE),它动态发现节点并持续收集可定制的遥测数据(例如,计算、网络、能源)。为了克服这些初始局部样本的稀疏性,我们的框架自动将它们与TimeTrack(我们公开的高分辨率数据集,以45秒间隔收集)合并。这协同了TimeTrack的基础高频时间模式与局部节点数据的精确校准。通过神经架构搜索(NAS)引擎处理,系统自动生成高精度的基线模型。实验结果表明,将目标数据与TimeTrack合并有效缓解了冷启动挑战。与仅使用稀疏局部样本训练、仅使用通用数据集训练或将目标数据与标准替代数据集混合相比,这种集成显著提高了以均方误差(MSE)、平均绝对误差(MAE)和平均绝对百分比误差(MAPE)衡量的预测准确性,并加速了收敛,为持续MLOps部署奠定了坚实基础。

英文摘要

The Cloud-Edge Continuum (CEC) enables latency-critical applications by distributing resources to the far edge, but its extreme volatility makes proactive Zero Touch Management via time-series forecasting essential. However, orchestrators face a severe "cold start" problem: newly discovered nodes lack the historical data required to train localized predictive models, while generalized models fail to capture unique hardware and microservice behaviors. To solve this, we propose a fully automated time-series prediction architecture driven by a novel data-mixing methodology. At the infrastructure level, we introduce a lightweight, technology-agnostic Resource Exposer (RE) that dynamically discovers nodes and continuously collects customizable telemetry (e.g., compute, network, energy). To overcome the sparsity of these initial local samples, our framework automatically merges them with TimeTrack, our publicly available, high-resolution dataset collected at 45-second intervals. This synergizes TimeTrack's foundational, high-frequency temporal patterns with the precise calibration of the local node data. Processed through a Neural Architecture Search (NAS) engine, the system automatically generates highly accurate baseline models. Experimental results demonstrate that merging the target data with TimeTrack effectively mitigates the cold start challenge. This integration significantly improves forecasting accuracy measured in Mean Squared Error (MSE), Mean Absolute Error (MAE), and Mean Absolute Percentage Error (MAPE) and accelerates convergence compared to training on the sparse local samples alone, training solely on generic datasets, or mixing the target data with standard alternative datasets, establishing a robust foundation for continuous MLOps deployment.

2606.07520 2026-06-09 cs.CL cs.LG 交叉投稿

TinyJudge: Unverifiable Constraint Alignment via Lightweight Specialist Ensembles

TinyJudge: 通过轻量级专家集成实现不可验证约束对齐

Yirong Zeng, Yufei Liu, Xiao Ding, Yutai Hou, Yuxian Wang, Wu Ning, Haonan Song, Dandan Tu, Qixun Zhang, Yuxiang He, Bibo Cai, Ting Liu

发表机构 * Harbin Institute of Technology SCIR Lab(哈尔滨工业大学SCIR实验室) Peking University(北京大学) Huawei Technologies Co., Ltd(华为技术有限公司)

AI总结 针对LLM遵循不可验证约束时奖励黑客和计算开销大的问题,提出TinyJudge框架,利用多个小型语言模型集成提供奖励,在五个基准上平均性能提升约10%,奖励精度提升12%,训练速度提升3倍。

Comments ACL 2026 Main Conference;15 pages, 9 figures

详情
AI中文摘要

指令遵循(IF)是LLM的核心能力,要求严格遵守从可验证(如输出长度)到不可验证(如语气)的多种约束。基于可验证奖励的强化学习已成为IF任务的范式,利用LLM作为裁判来评估不可验证约束。然而,我们实验发现该方法仍存在显著瓶颈,遭受严重的奖励黑客和更高的计算开销。本文首先分析不可验证约束的泛化能力,发现特定约束表现出独特的高泛化模式。受此启发,我们提出TinyJudge框架,采用专门的小型语言模型集成(约0.6B)为软约束提供奖励。通过将前沿模型的知识蒸馏到这些小型模型中,实现了高精度、轻量级的评估。在五个基准上的广泛评估表明,TinyJudge在平均性能上比基线高出约10%,奖励精度高出12%。关键的是,它还在总训练时间上实现了3倍的加速。我们的工作为将LLM与不可验证的人类指令对齐提供了一条可扩展且稳健的路径。

英文摘要

Instruction Following (IF) is a core capability of LLMs, requiring strict adherence to diverse constraints, ranging from verifiable ones (e.g., output length) to unverifiable ones (e.g., tone). Reinforcement learning with verifiable rewards has emerged as a paradigm for IF tasks, leveraging LLM-as-a-judge to assess unverifiable constraints. However, we empirically find that this approach remains a significant bottleneck, suffering from severe reward hacking and higher computational overhead. In this work, we first analyze the generalization capabilities of unverifiable constraints and discover that specific constraints exhibit distinct, high-generalization patterns. Motivated by this, we propose TinyJudge, a framework that employs an ensemble of specialized tiny language models ($\sim0.6B$) to provide rewards for soft constraints. By distilling expertise from frontier models into these tiny models, it achieves high-precision, lightweight evaluation. Extensive evaluations across five benchmarks demonstrate that TinyJudge outperforms the baselines by $\sim10\%$ in average performance and $12\%$ in reward precision. Crucially, it also achieves a $3\times$ speedup in total training time. Our work provides a scalable and robust path for aligning LLMs with unverifiable human instructions.

2606.07529 2026-06-09 cs.CL cs.AI cs.CV cs.LG cs.MM 交叉投稿

CAPruner: Conceptual-Adjacent Scene Graph Pruner for Enhancing 3D Spatial Reasoning of Large Language Models

CAPruner: 概念相邻场景图剪枝器以增强大语言模型的3D空间推理

Shengli Zhou, Xiangchen Wang, Guanhua Chen, Feng Zheng

发表机构 * Southern University of Science and Technology(南方科技大学) SpatialTemporal AI(时空人工智能)

AI总结 提出概念相邻场景图剪枝器(CAPruner),通过融合模糊语义相关性和空间邻近性估计关系重要性,在任务特定上下文中选择关键关系,避免关系级标注,显著提升大语言模型在3D视觉语言任务上的空间推理性能。

Comments Accepted by ACL 2026 Main Conference

详情
AI中文摘要

大型语言模型(LLMs)最近被应用于3D视觉语言(3D-VL)任务,这些任务需要空间推理以识别相对于锚点的目标物体。场景图通常用于表示此类关系,但在完整图上进行推理会导致高昂的令牌成本和计算效率低下,因此需要剪枝。现有的剪枝方法主要依赖空间邻近性,常常移除任务相关的关系,从而削弱可靠的空间推理。为了解决这些局限性,我们推导出场景图剪枝的一个关键要求:保留与特定3D-VL任务最相关的空间关系。在此洞察指导下,我们提出了概念相邻场景图剪枝器(CAPruner)。CAPruner将模糊语义相关性与空间邻近性相结合,以估计关系的重要性,从而能够在任务特定上下文中选择关键关系。此外,为了避免昂贵的关系级标注,CAPruner通过监督每个节点入射边的聚合分数进行训练。大量实验表明,CAPruner有效保留了空间推理所必需的关系,从而显著提升了LLMs在3D-VL任务上的性能。代码可在 https://github.com/fz-zsl/CAPruner 获取。

英文摘要

Large language models (LLMs) have recently been applied to 3D vision-language (3D-VL) tasks, which require spatial reasoning to identify target objects relative to anchors. Scene graphs are commonly employed to represent such relations, but reasoning over complete graphs incurs high token costs and computational inefficiencies, motivating the need for pruning. Existing pruning methods primarily rely on spatial proximity and often remove task-relevant relations, thereby undermining reliable spatial reasoning. To address these limitations, we derive a key requirement for scene graph pruning: preserving spatial relations that are most pertinent to the specific 3D-VL task. Guided by this insight, we propose the Conceptual-Adjacent Scene Graph Pruner (CAPruner). CAPruner integrates fuzzy semantic relevance with spatial proximity to estimate the importance of relations, enabling the selection of critical relations in a task-specific context. Moreover, to avoid costly relation-level annotations, CAPruner is trained by supervising the aggregated scores of each node's incident edges. Extensive experiments demonstrate that CAPruner effectively preserves relations essential for spatial reasoning, leading to substantial performance improvements of LLMs on 3D-VL tasks. Code is available at https://github.com/fz-zsl/CAPruner.

2606.07546 2026-06-09 cs.IR cs.AI cs.LG 交叉投稿

Beyond Item IDs: Scaling Short-Form-Video Recommendation via Semantic-Native Long Sequence Modeling

超越视频ID:通过语义原生长序列建模实现短视频推荐规模化

Ruixiao Sun, Diego Uribe Mora, Zhimeng Jiang, Yuanzhen Lin, Jiarui Wang, Yuening Li, Danfeng Guo, Zhizhong Chen, Chuan He, Liang Liu

发表机构 * Google Mountain View, USA(谷歌山景城,美国)

AI总结 针对短视频推荐中序列长度受限于视频ID语义稀疏性和Transformer二次复杂度的问题,提出采用语义ID和全局感知压缩Transformer,实现十亿用户规模的超长行为序列建模,显著降低内存和计算开销,在线实验提升用户满意度和内容消费。

Comments this manuscript has been accepted by SIGIR 2026

详情
AI中文摘要

捕捉用户跨广泛观看历史的兴趣对于短视频推荐至关重要,但扩展序列长度受到两个瓶颈的限制:原子视频ID的语义稀疏性和Transformer的二次计算复杂度。传统的正交视频ID无法捕捉内容关系,并且需要大型嵌入表,而自注意力的二次复杂度在严格的工业延迟和资源约束下限制了最大序列长度。在这项工作中,我们提出了一个在生产环境中部署的框架,用于在十亿用户规模上建模超长用户行为序列。我们首先通过采用内容原生的语义ID来解决表示瓶颈。通过使用深度截断、粗粒度的语义ID,我们将嵌入表大小从语料库基数中缩小。这种紧凑的表示通过共享语义前缀自然地泛化到冷启动内容。其次,为了克服序列扩展障碍,我们引入了全局感知压缩Transformer,它利用非参数时间折叠和统一全局查询集成来有效压缩序列,缓解了标准自注意力的内存和计算瓶颈。在我们计算基础设施上的离线分析显示,峰值内存占用减少了一个数量级,计算开销大幅降低。这种效率提升使得在生产中以可承受的成本支持更长的序列长度,在大规模在线A/B测试中,在满意的用户参与度和满意的内容消费方面取得了显著的在线收益。

英文摘要

Capturing user interests across extensive watch histories is critical for short-form video recommendation, yet scaling sequence length is limited by two bottlenecks: the semantic sparsity of atomic Video IDs and the quadratic computational complexity of Transformers. Traditional orthogonal Video IDs fail to capture content relationships and demand large embedding tables, while the quadratic complexity of self-attention restricts the maximum sequence length under strict industrial latency and resource constraints. In this work, we present a production-deployed framework for modeling ultra-long user behavior sequences at a billion-user scale. We first address the representation bottleneck by adopting content-native Semantic IDs. By utilizing depth-truncated, coarse-grained Semantic IDs, we shrink the embedding table size from corpus cardinality. This compact representation naturally generalizes to cold-start content through shared semantic prefixes. Second, to overcome the sequence scaling barrier, we introduce a Global-Aware Compression Transformer that leverages non-parametric temporal folding and unified global query integration to effectively condense the sequence, alleviating both the memory and computational bottlenecks of standard self-attention. Offline profiling on our computing infrastructure demonstrates an order-of-magnitude reduction in peak memory footprint and a drastic decrease in computational overhead. This efficiency gain enables supporting longer sequence lengths at an affordable cost in production, yielding substantial online gains in satisfied user engagement and satisfied content consumption in large-scale online A/B tests.

2606.07552 2026-06-09 cs.MA cs.AI cs.LG 交叉投稿

Symbolic Reasoning Frameworks Modulate LLM Risk Aversion in Multi-Agent Strategic Settings

符号推理框架在多智能体战略环境中调节大语言模型的风险规避

Augustin Chan

发表机构 * iterative.day

AI总结 本研究通过注入符号推理框架(如易经、塔罗牌)作为反思提示,发现其能差异化调节LLM的风险规避倾向,并在多智能体博弈中产生框架特定的胜者分布,且该效应源于反思过程而非内容遵循。

Comments 17 pages, 3 figures, 6 tables, 6 listings. Code and data: https://doi.org/10.5281/zenodo.20338937

详情
AI中文摘要

大型语言模型在作为战略智能体部署时表现出内在的行为倾向——尤其是风险规避的“乌龟”偏向于防御性玩法。我们证明,符号推理框架作为每轮反思提示注入一个智能体,能够差异化地调节这种偏向,并重塑多智能体生态系统,产生框架特定的胜者分布。在一个7玩家的战国策外交变体(41局游戏,4种条件,单战役记忆积累)中,每个框架产生独特的生态系统特征:在控制条件下,燕国主导(7/11,64%);在易经蓍草占卜下,燕国和楚国共同主导,而秦国被完全压制(0/10);在塔罗牌下,秦国主导(5/10,Fisher vs. 合并p=0.006);在乱序文本消融(保留提示结构的无意义神谕文本)下,齐国主导(5/10,Fisher vs. 合并p=0.006)。接受框架的智能体(韩国)从未获胜,且在不同条件下生存率无差异(Fisher p=1.0),但塔罗牌持续提升韩国的峰值领土(平均3.0个SC vs. 2.1-2.5个其他,Kruskal-Wallis p=0.010)。两个框架的内容均不能预测后续行动——卦象主题(卡方p=0.95)和塔罗牌姿态(卡方p=0.69)均与行动选择独立——表明调节作用是通过反思过程而非内容遵循实现的。我们将其作为一篇观察论文呈现,确立智能体层面的对齐框架选择在多智能体环境中产生独特的系统级后果。

英文摘要

Large language models exhibit innate behavioral tendencies when deployed as strategic agents -- notably a risk-averse "turtle" bias toward defensive play. We show that symbolic reasoning frameworks, injected as per-round reflective prompts into one agent, differentially modulate this bias and reshape the multi-agent ecosystem to produce framework-specific winner distributions. In a 7-player Warring States Diplomacy variant (41 games, 4 conditions, single-campaign memory accumulation), each framework produces a distinct ecosystem signature: under control, Yan dominates (7/11, 64%); under I-Ching yarrow divination, Yan and Chu co-dominate while Qin is completely suppressed (0/10); under Tarot, Qin dominates (5/10, Fisher vs. pooled p = 0.006); under scrambled-text ablation (incoherent oracle text preserving prompt structure), Qi dominates (5/10, Fisher vs. pooled p = 0.006). The framework-receiving agent (Han) never wins and shows no survival difference across conditions (Fisher p = 1.0), but Tarot consistently elevates Han's peak territory (mean 3.0 SCs vs. 2.1-2.5 others, Kruskal-Wallis p = 0.010). Neither framework's content predicts subsequent actions -- hexagram themes (chi-squared p = 0.95) and Tarot card postures (chi-squared p = 0.69) are both independent of action choice -- suggesting the modulation operates through the reflective process, not content-following. We present this as an observation paper establishing that alignment-framework choice at the agent level produces distinctive system-level consequences in multi-agent settings.

2606.07568 2026-06-09 cs.HC cs.AI cs.CV cs.LG physics.data-an 交叉投稿

A Systematic Study of Behavioral Cloning for Scientific Data Annotation

行为克隆在科学数据标注中的系统研究

Ishaan Singh Chandok, Core Francisco Park

发表机构 * GitHub

AI总结 针对科学数据标注中人工验证校正耗时问题,提出行为克隆框架,通过9个合成任务模拟专家策略,发现模型层次化技能习得、多任务预训练高效微调、内部表示共享错误模式等关键结论。

Comments ICML 2026 Oral

详情
AI中文摘要

科学数据标注,例如视频中动物追踪或神经重建的校对,仍然受限于“最后一公里”问题:即使有强大的自动化,验证和校正仍需大量人力。标准方法训练模型直接预测标注,丢弃了专家如何导航、点击、验证和校正的丰富监督信息。我们引入了一个研究科学标注上行为克隆的框架:9个合成任务配以合成标注,模拟真实人类策略,包括探索、错误校正和战略决策。我们的实验揭示了若干发现。首先,技能层次化出现:模型先学习GUI机制,再学习任务关键决策,且比训练数据犯更少错误,同时保留在错误发生时校正的能力。其次,在多任务行为克隆上扩展模型表明,在我们的规模范围内,更大的模型数据效率更高。第三,多任务预训练能够高效微调至新任务,而从零开始训练则完全失败。第四,线性探针揭示模型内部表示标注过程的潜在变量,如任务阶段和数据位置;有趣的是,我们发现一个跨不同标注任务泛化的共享错误表示。总体而言,我们的框架建立了系统基准并识别了关键瓶颈,为将行为克隆扩展到真实世界科学数据标注奠定了基础。

英文摘要

Scientific data annotation, such as tracking animals in video or proofreading neural reconstructions, remains bottlenecked by the "last mile" problem: even with strong automation, verification and correction consume substantial human effort. Standard approaches train models to directly predict annotations, discarding the rich supervision in how experts navigate, click, verify, and correct. We introduce a framework for studying behavioral cloning on scientific annotation: 9 synthetic tasks paired with synthetic annotations that simulate realistic human strategies including exploration, mistake correction, and strategic decision-making. Our experiments reveal several findings. First, skills emerge hierarchically: models learn GUI mechanics before task-critical decisions, and commit fewer mistakes than the training data while retaining the ability to correct errors when they occur. Second, scaling models on multi-task behavioral cloning shows that larger models are more data efficient within our scale range. Third, multi-task pretraining enables efficient fine-tuning to new tasks, while training from scratch fails entirely. Fourth, linear probes reveal that models internally represent latent variables of the annotation process such as task phase and data position; interestingly, we find a shared mistake representation that generalizes across different annotation tasks. Overall, our framework establishes systematic benchmarks and identifies key bottlenecks, providing a foundation for scaling behavioral cloning to real-world scientific data annotation.

2606.07570 2026-06-09 cs.DL cs.LG 交叉投稿

Can LLMs extract scientific consensus? A case study in high-temperature superconductivity

LLMs能否提取科学共识?以高温超导为例

Mouyang Cheng, Wenhao He, Zhuotao Jin, Bowen Yu, Ju Li, Boris Kozinsky, Yao Wang, Pavel Volkov, Liangzi Deng, Ching-Wu Chu, Xiao-Gang Wen, Mingda Li

发表机构 * Center for Computational Science and Engineering, MIT(MIT计算科学与工程中心) Department of Materials Science and Engineering, MIT(MIT材料科学与工程系) Department of Physics, MIT(MIT物理系) Department of Nuclear Science and Engineering, MIT(MIT核科学与工程系) John A. Paulson School of Engineering and Applied Sciences, Harvard University(哈佛大学约翰·A·保罗森工程与应用科学学院) Department of Chemistry, Emory University(埃默里大学化学系) Department of Physics, University of Connecticut(康涅狄格大学物理系) Department of Physics and Texas Center for Superconductivity, University of Houston(休斯顿大学物理系和德克萨斯超导中心)

AI总结 本研究以高温超导领域为测试平台,利用近18,000篇高被引文献构建知识图谱,发现LLM提取的表征能恢复出连贯且物理可解释的结构,表明LLM可作为解码竞争性科学知识的可扩展工具。

Comments 23 pages, 4 figures

详情
AI中文摘要

科学知识日益分散在庞大且异质的科学文献中,其中重要的主张往往是隐含的、不断演变的,并且存在内部争议。尽管大型语言模型(LLM)在信息提取和摘要方面表现出色,但它们恢复潜在科学共识的能力仍不清楚。本文以凝聚态物理中长期存在且备受争议的高温超导(HTS)问题为挑战性测试平台,研究了这一问题。利用过去七十年间近18,000篇高被引出版物,我们构建了一个结构化的知识图谱,链接了竞争性的超导机制、材料家族、证据模态和引用关系。我们发现,LLM提取的表征恢复出了连贯且物理可解释的结构,包括家族依赖的机制概况、证据特定的相关性以及引用介导的科学信念的时间演化。对LLM的消融研究进一步表明,全局结构在提示、解码和模型变化下保持稳健。我们的结果表明,LLM确实可以作为可扩展的工具,用于解读以竞争性解释和知识演变为特征的领域的科学知识。

英文摘要

Scientific knowledge is increasingly dispersed across vast and heterogeneous scientific literature, where important claims are often implicit, evolving, and internally debated. While large language models (LLMs) have shown impressive performance in information extraction and summarization, their ability to recover latent scientific consensus remains unclear. Here, we investigate this problem in the context of high-temperature superconductivity (HTS), a long-standing and highly debated topic in condensed matter physics, as a challenging testbed. Using near 18,000 highly-cited publications over the past seven decades, we construct a structured knowledge graph linking competing superconducting mechanisms, material families, evidential modalities, and citation relations. We find that LLM-extracted representations recover coherent and physically interpretable structures, including family-dependent mechanism profiles, evidence-specific correlations, and citation-mediated temporal evolution of scientific beliefs. Ablation studies on LLM further show that the global structure remains robust across prompting, decoding, and model variations. Our results suggest that LLMs can indeed serve as scalable tools for deciphering scientific knowledge in domains characterized by competing interpretations and evolving knowledge.

2606.07572 2026-06-09 physics.soc-ph cs.LG stat.AP 交叉投稿

Forecasting Japanese elections: A nonlinear machine-learning approach

预测日本选举:一种非线性机器学习方法

Sota Kato, Xuan Luo, Budrul Ahsan, Asahi Obata, Takafumi Nakanishi

发表机构 * International University of Japan(国际大学) The Tokyo Foundation(东京基金会) IBM Japan(IBM日本) Rice University(里士满大学) Tokyo University of Technology(东京技术大学)

AI总结 本研究引入基于决策树和集成学习的非线性机器学习模型,预测日本众议院选举结果,相比传统线性模型在样本内和样本外评估中均表现出更优的预测精度。

详情
AI中文摘要

尽管日本是世界上最大的先进民主国家之一,但其全国选举的预测模型发展仍然有限。本研究引入了基于决策树和集成学习方法的非线性机器学习预测模型,用于预测日本众议院选举结果。为了评估我们方法的方法论优势,我们复现了Lewis-Beck和Tien(LBT)针对日本选举的基础统计预测模型的理论框架和数据集。我们的模型在样本内和样本外评估中均显示出比LBT模型适度但持续提高的预测准确性,表明非线性算法在捕捉复杂选举动态方面为经典线性方法提供了一种替代方案。本研究是非线性机器学习技术较早应用于单一国家选举预测的案例之一。它提供了一个可复现的框架,当与其他国家的特定选举理论相结合时,可能提高预测模型在更广泛国家背景下的预测性能。

英文摘要

Despite Japan being one of the world's largest advanced democracies, the development of election forecasting models for its national elections remains limited. This study introduces nonlinear machine-learning forecasting models, based on decision tree and ensemble learning methods, for predicting the outcomes of Japanese lower-house elections. To assess the methodological benefits of our approach, we replicated the theoretical framework and dataset of Lewis-Beck and Tien's (LBT) foundational statistical forecasting model for Japanese elections. Our models demonstrated moderately but consistently improved predictive accuracy compared to LBT's model in both in-sample and out-of-sample evaluations, suggesting that nonlinear algorithms offer an alternative approach to classical linear methods in capturing complex electoral dynamics. This study represents one of the earlier applications of nonlinear machine-learning techniques to single-country election forecasting. It offers a replicable framework that, when combined with the country-specific electoral theories of other nations, may enhance the predictive performance of forecasting models in broader national contexts.

2606.07575 2026-06-09 q-fin.RM cs.LG 交叉投稿

Forward-Looking Stress Testing Under Macro Scenarios: Stable SVaR Estimation Using a Hybrid GPR-HS Framework with SACS

宏观情景下的前瞻性压力测试:基于混合GPR-HS框架与SACS的稳定SVaR估计

Ujjwala Vadrevu

AI总结 本文扩展混合高斯过程回归历史模拟框架至前瞻性压力情景,提出情景平均协方差稳定方法,在三种宏观情景下实现稳定的压力在险价值估计,满足监管要求。

Comments 15 pages, 3 figures. Extension of a hybrid GPR-HS framework to forward-looking stress testing with scenario-based SVaR and covariance stabilization (SACS)

详情
AI中文摘要

监管压力测试框架,包括全面资本分析与审查(CCAR)和内部资本充足评估程序(ICAAP),要求在前瞻性宏观经济情景下进行稳健的压力在险价值(SVaR)估计。传统的参数化方法在极端冲击下常表现出数值不稳定性,降低了资本预测的可靠性。\n本文将Vadrevu(2026)的混合高斯过程回归历史模拟(GPR-HS)框架扩展到前瞻性压力情景,展示了在三种情景(西亚战争、气候风险和AI泡沫/监管)下的稳定性。\n一个关键贡献是情景平均协方差稳定(SACS)框架,它将压力协方差构建为历史危机情景的加权聚合,提供稳定且可解释的依赖结构。压力收益路径通过确定性漂移和随机残差在252天的时间跨度内生成,而波动率通过具有激进噪声初始化(ANI)的高斯过程回归建模。\n该框架在所有资产和情景下表现出一致的收敛性。SVaR范围从-2.1020%到-2.2231%,并且保持了|SES| > |SVaR|的一致性属性。结果支持GPR-HS与SACS作为CCAR和ICAAP应用中前瞻性SVaR和SES估计的稳定且符合监管要求的方法。

英文摘要

Regulatory stress testing frameworks, including the Comprehensive Capital Analysis and Review (CCAR) and the Internal Capital Adequacy Assessment Process (ICAAP), require robust Stressed Value-at-Risk (SVaR) estimation under forward-looking macroeconomic scenarios. Traditional parametric approaches often exhibit numerical instability under extreme shocks, reducing the reliability of capital projections. This paper extends the Hybrid Gaussian Process Regression Historical Simulation (GPR-HS) framework of Vadrevu (2026) to forward-looking stress scenarios, demonstrating stability across three regimes: West Asia War, Climate Risk, and AI Bubble/Regulation. A key contribution is the Scenario-Averaged Covariance Stabilization (SACS) framework, which constructs stress covariance as a weighted aggregation of historical crisis regimes, providing stable and interpretable dependence structures. Stressed return paths are generated over a 252-day horizon using deterministic drift and stochastic residuals, while volatility is modeled via Gaussian Process Regression with Aggressive Noise Initialization (ANI). The framework exhibits consistent convergence across all assets and scenarios. SVaR ranges from -2.1020% to -2.2231%, with the coherence property |SES| > |SVaR| preserved. The results support GPR-HS with SACS as a stable and regulator-aligned approach for forward-looking SVaR and SES estimation in CCAR and ICAAP applications.

2606.07580 2026-06-09 eess.SY cs.LG cs.SY 交叉投稿

Quantifying Uncertainty in Space Debris Capture with Active Tether-Net Systems Caused by Noisy Observations

量化由噪声观测引起的空间碎片捕获主动绳网系统的不确定性

Feng Liu, Achira Boonrath, Eleonora M. Botta, Souma Chowdhury

发表机构 * Department of Mechanical and Aerospace Engineering, AIAA Student Member(机械与航空航天工程系,AIAA学生会员) Department of Mechanical and Aerospace Engineering, AIAA Member(机械与航空航天工程系,AIAA会员) Department of Mechanical and Aerospace Engineering, AIAA Senior Member(机械与航空航天工程系,AIAA高级会员)

AI总结 针对主动绳网系统捕获空间碎片时因噪声观测导致的不确定性,提出基于Sobol方差分析和扰动法的量化框架,评估捕获质量指数的不确定性。

Comments Presented at 2025 AIAA Aviation Forum

详情
AI中文摘要

随着低地球轨道空间碎片日益增多,对可靠高效的碎片清除解决方案的需求变得更加迫切。带有可操控单元的主动绳网系统是解决该问题的一种有前景的方案,其成功取决于网机动和闭合决策的鲁棒性。这些决策又受到以下不确定性的影响:i) 对目标碎片状态的噪声观测(例如,传感误差),以及ii) 决策系统训练所依赖的复杂网动力学和网/碎片相互作用行为的不完美模拟。本文关注这两个不确定性源中的第一个,并提出一个流程来传播和量化碎片捕获性能中由此产生的不确定性,该性能用捕获质量指数(CQI)表示。该量化针对使用固定基线控制的主动绳网和使用训练好的神经控制策略在部署阶段引导网机动的主动绳网分别进行。利用了两种不同的不确定性量化(UQ)技术,即Sobol基于方差的灵敏度分析和基于扰动的方法。使用高保真模拟器和低保真度基于代理的环境来展示预测精度与解决不确定性难易程度之间的权衡。

英文摘要

As Low Earth Orbit has grown more crowded with space debris, the need for reliable and efficient debris removal solutions becomes more urgent. An active tether-net system with maneuverable units is one of the promising solutions to this problem, whose success is dependent on the robustness of the net maneuver and closing decisions. These in turn are impacted by the uncertainties attributed to i) noisy observation of the target debris state (e.g., sensing errors), and ii) imperfect simulations of the complex net dynamics and net/debris interaction behavior, over which the decision system is trained. This paper focuses on the first of these two uncertainty sources, and presents a pipeline to propagate and quantify the resulting uncertainty in the debris capture performance expressed in terms of Capture Quality Index (CQI). This quantification is uniquely performed for both an active tether-net using a fixed baseline control and one using a trained neuro-control policy to guide the net maneuver during the deployment phase. Two different uncertainty quantification (UQ) techniques, namely Sobol's variance-based sensitivity analysis and perturbation-based method are exploited. A high-fidelity simulator and a lower-fidelity surrogate-based environment are used to demonstrate trade-offs between prediction accuracy versus ease of resolving uncertainties.

2606.07594 2026-06-09 cs.AI cs.HC cs.LG cs.SE 交叉投稿

Syll: Open-Source Personal Automation with Cross-Surface Execution

Syll: 开源个人自动化与跨界面执行

Bo Zhang, Borui Zhang, Chenghao Jiang, Minglei Shi, Xiaofeng Wang, Zheng Zhu, Jie Zhou, Jiwen Lu

发表机构 * Adobe Systems Inc.(Adobe系统公司) Stardew Valley(《星露谷物语》) macOS University of Science and Technology of China(中国科学技术大学)

AI总结 提出开源多模态智能体框架Syll,通过统一API、CLI和GUI控制,支持用户演示教学和可审计执行,实现跨界面个人自动化。

Comments Code: https://github.com/THU-SAGE/syll

详情
AI中文摘要

个人AI智能体必须越来越多地跨API、shell、网页界面和桌面GUI运行,然而许多系统仍局限于单一界面,对用户教学和可审计性支持有限。我们提出Syll,一个开源、自托管的多模态智能体框架,在模块化运行时中统一MCP/API工具、CLI执行和视觉GUI控制,使智能体能够跨异构界面协调计算机使用,同时简化用户与智能体之间的信息交换。Syll的核心是双向用户-智能体交互层:用户通过直接演示教学流程,Syll将其编译为可复用技能;智能体执行被转换回多模态证据——日志、关键帧和审批检查点——以供检查和管控。Syll进一步将记忆、技能、例程和治理外部化为可编辑的本地工件,支持直接检查、扩展和下游开发。我们的实现已在生产桌面应用程序上验证,包括Adobe Photoshop、Adobe Audition、星露谷物语、macOS Finder等。我们报告了面向机制的研究,验证了多模态路由、可教学GUI回放和持久化本地工件。我们希望Syll能作为个人自动化的实用开源基础,用户可教学、检查和持续扩展。

英文摘要

Personal AI agents must increasingly operate across APIs, shells, web surfaces, and desktop GUIs, yet many systems remain tuned to a single interface and offer limited support for user teaching and auditability. We present Syll, an open-source, self-hosted multimodal agent harness that unifies MCP/API tools, CLI execution, and visual GUI control in a modular runtime, enabling agents to coordinate computer use across heterogeneous interfaces while streamlining how users and agents exchange information. At the core of Syll is a bidirectional user-agent interaction layer: users teach procedures through direct demonstration, which Syll compiles into reusable skills; agent execution is translated back into multimodal evidence -- logs, keyframes, and approval checkpoints -- for inspection and control. Syll further externalizes memory, skills, routines, and governance as editable local artifacts, supporting straightforward inspection, extension, and downstream development. Our implementation has been validated on production desktop applications including Adobe Photoshop, Adobe Audition, Stardew Valley, macOS Finder and others. We report mechanism-oriented studies that validate multimodal routing, teachable GUI replay, and persistent local artifacts. We hope Syll can serve as a practical open-source foundation for personal automation that users can teach, inspect, and continuously extend.

2606.07608 2026-06-09 cs.CL cs.AI cs.LG cs.SD 交叉投稿

Subtitle-Aligned Fine-Tuning of Whisper for Swiss German ASR: Benchmark Contamination, Convention Mismatch, and an Honest Baseline at 25.6% WER (13.8% cWER)

针对瑞士德语音识别的Whisper字幕对齐微调:基准污染、惯例不匹配以及25.6% WER(13.8% cWER)的诚实基线

Felix Akeret

发表机构 * Independent Researcher, Zurich, Switzerland(独立研究员,瑞士苏黎世) ETH Zürich(苏黎世联邦理工学院) University of Bern(伯尔尼大学) FHNW(西北应用科学与艺术大学) CeTIM Leiden/Munich(CeTIM 莱顿/慕尼黑)

AI总结 通过1,367小时广播语音与标准德语字幕的弱监督,系统微调Whisper large-v3用于瑞士德语音识,发现公开结果因基准污染被高估,并发布两个诚实评估的模型。

Comments 15 pages, 21 tables. Models available at https://huggingface.co/Flix-AI

详情
AI中文摘要

我们提出了一项系统研究,针对OpenAI的Whisper large-v3进行微调,用于瑞士德语音识,使用1,367小时的广播语音与标准德语字幕作为弱监督。通过在NVIDIA DGX Spark(Grace Blackwell,128 GB统一内存,最高1 PFLOP FP4)上进行16次迭代训练,我们比较了LoRA和全微调(1.55B参数模型),研究了幻觉的根本原因,并量化了数据质量、字幕对齐和训练策略的影响。我们的最佳模型在严格不相交数据上的诚实评估中,在All Swiss German Dialects Test Set (ASGDTS)上实现了25.6%的测量WER。通过将真实错误与有效的风格变异(时态、词序、瑞士正字法)分离的协调错误分析,得到内容WER (cWER)为13.8%,仅计算实际识别失败。偏差校正估计将其降至8.5%,表明真实错误率约为测量WER的三分之一。\n我们证明,已发表的瑞士德语ASR最先进结果(17.1-17.5% WER)因基准污染而被夸大:一个在ASGDTS测试集上自训练的普通Whisper模型(零瑞士德语数据)实现了13.88% WER,超过了所有已发表系统。使用Phi-4-multimodal的实验显示出更强的记忆效应(3.9% WER),揭示该基准主要衡量惯例匹配而非方言理解。\n我们发布了两个模型,一个LoRA适配器(25.32% WER,13.9% cWER)和一个全微调模型(25.60% WER,13.8% cWER),这是少数公开可用、经过诚实评估的瑞士德语Whisper模型之一,采用Apache 2.0许可,完全可复现,无需机构数据协议。

英文摘要

We present a systematic study of fine-tuning OpenAI's Whisper large-v3 for Swiss German ASR, using 1,367 hours of broadcast speech paired with Standard German subtitles as weak supervision. Through 16 iterative training runs on an NVIDIA DGX Spark (Grace Blackwell, 128 GB unified memory, up to 1 PFLOP FP4), we compare LoRA and full fine-tuning of the 1.55B-parameter model, investigate hallucination root causes, and quantify the effect of data quality, subtitle alignment, and training strategy. Our best model achieves 25.6% measured WER on the All Swiss German Dialects Test Set (ASGDTS) in an honest evaluation on strictly disjoint data. A harmonized error analysis separating genuine errors from valid stylistic variation (tense, word order, Swiss orthography) yields a content WER (cWER) of 13.8%, counting only actual recognition failures. Bias-corrected estimation reduces this to 8.5%, suggesting the true error rate is roughly one third of measured WER. We demonstrate that published state-of-the-art Swiss German ASR results (17.1-17.5% WER) are inflated by benchmark contamination: a vanilla Whisper model self-trained on the ASGDTS test set with zero Swiss German data achieves 13.88% WER, surpassing all published systems. Experiments with Phi-4-multimodal show an even stronger memorization effect (3.9% WER), revealing that the benchmark primarily measures convention matching rather than dialectal comprehension. We release two models, a LoRA adapter (25.32% WER, 13.9% cWER) and a full fine-tuned model (25.60% WER, 13.8% cWER), among the few publicly available, honestly evaluated Whisper models for Swiss German, under Apache 2.0 with full reproducibility, requiring no institutional data agreements.

2606.07658 2026-06-09 cs.CV cs.LG 交叉投稿

What neurosurgeons need to see: synthetic intra-operative MRI from ultrasound for brain-shift compensation in brain tumour surgery

神经外科医生需要看到的:用于脑肿瘤手术中脑移位补偿的超声合成术中MRI

Santiago Cepeda, Olga Esteban-Sinovas, Ignacio Arrese, Rosario Sarabia

发表机构 * Department of Neurosurgery, Neurovascular Unit, Río Hortega University Hospital, Valladolid, Spain(西班牙巴利亚多利德里奥·奥尔特加大学医院神经外科神经血管科) Specialized Group in Biomedical Imaging and Computational Analysis (GEIBAC), Instituto de Investigación Biosanitaria de Valladolid (IBioVALL), Valladolid, Spain(西班牙巴利亚多利德生物医学研究与计算分析专业组(GEIBAC),巴利亚多利德生物健康研究所(IBioVALL))

AI总结 提出一种端到端流水线,通过融合术前MRI、术中超声生成的合成MRI及锚定该合成图像的可变形配准,生成术前成像空间中的全脑MRI体积,以补偿脑移位,为神经导航提供类似MRI的术中视野更新。

详情
AI中文摘要

最大安全切除是胶质瘤手术的主要目标。硬脑膜打开后,神经导航引导会因脑移位而逐渐退化。术中MRI可以补偿,但需要专用基础设施且很少可用,而术中超声(ioUS)廉价、可重复且与常规工作流程兼容。将ioUS与术前MRI结合的导航系统通常依赖刚性配准;即使是可变形多模态配准也受限于超声散斑对比度、窄视野以及无法表示术前扫描中不存在的结构,最关键的是切除腔和残余肿瘤。我们提出一个端到端流水线,通过合并术前MRI、从ioUS生成的合成MRI以及锚定在该合成图像上的可变形配准,生成术前成像空间中的全脑MRI体积。它集成了一个2.5D残差变换器合成骨干(ResViT-2.5D)和一个两阶段配准,将NiftyReg与合成锚定的SynthMorph阶段耦合,直接对原始扫描仪输入进行操作。在切除后的ReMIND队列上,ResViT-2.5D生成的合成图像在结构、强度和感知指标上与术中T2紧密匹配。在14名受试者的215个专家标志点上,合成锚定配准将平均目标配准误差从6.27毫米降低到5.86毫米,与强大的经典NiftyReg基线(5.85毫米)相当,同时为每个受试者产生微分同胚变形场。贡献不在于配准精度的提高,而在于集成的体积本身,它在超声视野内反映了术中切除后的状态。这为外科医生提供了手术视野的类似MRI的更新,并有可能集成到手术导航工作流程中。

英文摘要

Maximal safe resection is the primary objective in glioma surgery. Neuronavigation guidance is progressively degraded by brain shift after dural opening. Intraoperative MRI can compensate but needs dedicated infrastructure and is rarely available, whereas intraoperative ultrasound (ioUS) is inexpensive, repeatable, and compatible with routine workflows. Navigation systems combining ioUS with preoperative MRI usually rely on rigid registration; even deformable multimodal registration is limited by ultrasound speckle contrast, a narrow field of view, and the inability to represent structures absent from the preoperative scan, most critically the resection cavity and residual tumor. We propose an end-to-end pipeline that generates a new whole-brain MRI volume in the preoperative imaging space by merging the preoperative MRI, a synthetic MRI generated from the ioUS, and a deformable registration anchored on that synthetic image. It integrates a 2.5D residual-transformer synthesis backbone (ResViT-2.5D) and a two-stage registration coupling NiftyReg with a synthesis-anchored SynthMorph stage, operating directly on raw scanner inputs. On a post-resection ReMIND cohort, ResViT-2.5D produced synthetic images closely matching the intraoperative T2 across structural, intensity, and perceptual metrics. In 14 subjects with 215 expert landmarks, the synthesis-anchored registration reduced the mean target registration error from 6.27 to 5.86 mm, matching a strong classical NiftyReg baseline (5.85 mm) while yielding a diffeomorphic deformation field in every subject. The contribution is not a gain in registration accuracy but the integrated volume itself, which inside the ultrasound field of view it reflects the intraoperative post-resection state. This provides the surgeon with an MRI-like update of the operative field with potential for integration into surgical-navigation workflows.

2606.07673 2026-06-09 cs.SD cs.AI cs.LG 交叉投稿

A Hierarchical Feature Engineering Framework for Automated Classification of Phonotraumatic and Non-Phonotraumatic Vocal Hyperfunction

声带创伤性与非声带创伤性声音亢进的自动分类的分层特征工程框架

June-Woo Kim, Kangwook Jang, Minu Kim, Hyunju Lee

发表机构 * Department of Electronic Engineering, Wonkwang University(圆光大学电子工程系) AI Convergence Research Institute, Wonkwang University(圆光大学人工智能融合研究院) GIST InnoCORE AI-Nano Convergence Institute for Early Detection of Neurodegenerative Diseases, Gwangju Institute of Science and Technology(光州科学技术院GIST InnoCORE AI-Nano神经退行性疾病早期检测融合研究所) School of Electrical Engineering, KAIST(韩国科学技术院电气工程学院) Department of AI Convergence, Gwangju Institute of Science and Technology(光州科学技术院人工智能融合系)

AI总结 提出分层特征工程框架,包括静态、动态、比率和耦合特征,用于区分声带创伤性和非声带创伤性声音亢进,发现耦合特征对两类分类均关键,PVH AUC 0.891,NPVH AUC 0.728。

Comments Interspeech 2026

详情
AI中文摘要

动态颈部表面加速度能够实现声音亢进的无创监测,但其亚型的稳健生物标志物仍然有限。本研究利用NeckVibe Challenge数据集区分声带创伤性(PVH)和非声带创伤性(NPVH)声音亢进与健康对照组。我们提出一个分层特征工程框架,包括:(i)静态特征,(ii)动态特征,(iii)基于比率的特征,(iv)捕捉源-滤波器交互的耦合特征。单变量统计分析显示PVH具有强可分性,但NPVH显著性有限,而我们针对高维特征集成优化的机器学习流程发现,耦合特征对两项任务都至关重要。我们实现了PVH的AUC为0.891,NPVH的AUC为0.728,表明虽然PVH近似线性可分,但NPVH的区分受益于非线性特征交互建模。

英文摘要

Ambulatory neck-surface acceleration enables non-invasive monitoring of vocal hyperfunction, yet robust biomarkers for its subtypes remain limited. This study investigates the NeckVibe Challenge dataset to distinguish phonotraumatic (PVH) and non-phonotraumatic (NPVH) from healthy controls. We propose a hierarchical feature engineering framework comprising: (i) static, (ii) dynamic, (iii) ratio-based, (iv) coupling features capturing source filter interactions. While univariate statistical analysis shows strong separability for PVH but limited significance for NPVH, our machine learning pipeline, tailored for high-dimensional feature integration, identifies that coupling features are crucial for both tasks. We achieve an AUC of 0.891 for PVH and 0.728 for NPVH, suggesting that while PVH is near-linearly separable, NPVH discrimination benefits from modeling non-linear feature interactions.

2606.07688 2026-06-09 cs.IR cs.AI cs.CL cs.LG 交叉投稿

TRACER: Token ReAssignment for Concept ERasure in Generative Recommendation

TRACER: 面向生成式推荐中概念擦除的令牌重分配

Ziheng Chen, Jiali Cheng, Zezhong Fan, Hadi Amiri, Diyuan Wu, Gabriele Tolomei, Yang Zhang

发表机构 * Stony Brook University(石英布鲁克大学) University of Massachusetts Lowell(马萨诸塞大学洛厄尔分校) Columbia University(哥伦比亚大学) Institute of Science and Technology Austria(奥地利科学技术研究院) Sapienza University of Rome(罗马大学 sapienza) National University of Singapore(新加坡国立大学)

AI总结 针对生成式推荐中概念遗忘与推荐效用冲突的问题,提出基于令牌重分配的概念遗忘框架TRACER,通过将概念相关物品重分配给替代令牌并引入一致性正则化,有效移除目标概念同时保持推荐效用。

详情
AI中文摘要

生成式推荐将下一项预测形式化为基于用户历史交互导出的语义ID(SID)序列的自回归生成,使得现代推荐系统在结构上类似于大型语言模型(LLM)。随着隐私和安全问题的增加,这些系统越来越需要概念遗忘来移除与物品相关的敏感或有害概念。然而,现有的LLM遗忘方法不能直接应用于生成式推荐。与具有明确语义的词令牌不同,SID是抽象标识符,通常被遗忘和保留物品共享,导致概念移除和推荐效用保持之间的严重冲突。为了解决这一挑战,我们提出了TRACER,一种基于令牌重分配的端到端概念遗忘框架。TRACER不是直接抑制共享的SID,而是将概念相关物品重分配给能够更好地促进遗忘同时最小化对保留物品的副作用的替代令牌。我们进一步引入了一致性正则化器,以在遗忘过程中保持保留物品之间的语义一致性。在真实世界推荐数据集上的实验表明,TRACER有效地移除了目标概念,同时比现有的遗忘基线更好地保持了推荐效用。

英文摘要

Generative recommendation formulates next-item prediction as autoregressive generation over semantic ID (SID) sequences derived from users' historical interactions, making modern recommender systems structurally similar to large language models (LLMs). As privacy and safety concerns grow, these systems increasingly require concept unlearning to remove sensitive or harmful concepts associated with items. However, existing LLM unlearning methods cannot be directly applied to generative recommendation. Unlike word tokens with explicit semantics, SIDs are abstract identifiers that are often shared by both forget and retain items, leading to severe conflicts between concept removal and recommendation utility preservation. To address this challenge, we propose TRACER, an end-to-end concept unlearning framework based on token reassignment. Rather than directly suppressing shared SIDs, TRACER reassigns concept-related items to alternative tokens that better facilitate forgetting while minimizing side effects on retained items. We further introduce a coherence regularizer to preserve semantic consistency among retain items during unlearning. Experiments on real-world recommendation datasets demonstrate that TRACER effectively removes target concepts while substantially better preserving recommendation utility than existing unlearning baselines.

2606.07780 2026-06-09 cs.AI cs.CV cs.LG 交叉投稿

Land cover and flood type govern the detection limits of satellite-based flood mapping across diverse global flood events

土地覆盖与洪水类型控制基于卫星的洪水测绘在不同全球洪水事件中的检测极限

Venkatesh Kolluru, Rajat Shinde, Abdelhak Marouane, Caden Helbling, Deepak Shah, Othneil Drew, Iksha Gurung, Manil Maskey, Rahul Ramachandran

发表机构 * Earth System Science Center, University of Alabama in Huntsville(阿拉巴马大学亨茨维尔分校地球系统科学中心) Space and Earth Science Data Analysis(空间与地球科学数据分析) NASA Marshall Space Flight Center(NASA马歇尔太空飞行中心)

AI总结 研究利用Prithvi-EO-2.0模型在19个全球洪水事件中评估卫星洪水测绘的检测能力,发现检测精度取决于土地覆盖和洪水类型,农田和河流洪水检测效果较好,而树木覆盖和建成区检测近乎为零。

详情
AI中文摘要

洪水是最具破坏性的自然灾害之一,在气候变化下其频率增加使得基于卫星的淹没测绘对灾害响应至关重要。基于卫星档案预训练的地理空间基础模型提供了地理可迁移性,但其在多样、未见事件中的操作可靠性尚未被表征。在此,我们在跨越六大洲、八个气候带和六种洪水机制的19个分布外洪水事件(2017-2025年)中部署Prithvi-EO-2.0,并针对两个独立参考产品进行验证。检测精度共同依赖于土地覆盖和洪水类型,农田产生最高一致性(IoU=52%),河流事件检测最强(F1=0.69),而树木覆盖和建成区显示近乎零检测(IoU=4%),无论洪水机制如何。双参考验证揭示,明显的模型误差部分反映了参考产品之间的定义不一致而非检测失败。迭代流水线测试识别出23种故障模式,其中流水线工程在初始误差中占主导地位,超过模型容量。这些发现为操作卫星洪水测绘建立了环境依赖的检测边界。

英文摘要

Floods are among the most destructive natural hazards, and their increasing frequency under climate change makes satellite-based inundation mapping essential for disaster response. Geospatial foundation models pretrained on satellite archives offer geographic transferability, but their operational reliability across diverse, unseen events remains uncharacterized. Here we deploy Prithvi-EO-2.0 across 19 out-of-distribution flood events (2017-2025) spanning six continents, eight climate zones, and six flood mechanisms, validating against two independent reference products. Detection accuracy depended jointly on land cover and flood type, with cropland yielding the highest agreement (IoU=52%) and riverine events the strongest detection (F1=0.69), while tree cover and built-up areas showed near-zero detection (IoU=4%) regardless of flood mechanism. Dual-reference validation revealed that apparent model error partly reflects definitional inconsistency between reference products rather than detection failure. Iterative pipeline testing identified 23 failure modes, with pipeline engineering dominating initial error over model capacity. These findings establish environment-dependent detection boundaries for operational satellite flood mapping.

2606.07792 2026-06-09 cs.CR cs.LG cs.SE 交叉投稿

MOLOT System Card: Malicious Operational Logic Observation Transformer

MOLOT系统卡:恶意操作逻辑观察变换器

Daniil Lopatkin, Maksim Mitrofanov, Stanislav Rakovsky, Aleksandr Khalikov

发表机构 * False Positive Community

AI总结 提出MOLOT系统,利用静态调用图的行为序列进行恶意代码检测,结合解释阶段定位可疑行为,在PyPI和npm包上评估,并发布Open Malicious-Code Bench基准。

Comments 13 pages, 3 figures

详情
AI中文摘要

MOLOT(恶意操作逻辑观察变换器)是一个静态恶意代码检测系统,专为SAST环境设计,其中包元数据、维护者历史和动态执行轨迹可能不可用或不可靠。该系统将源代码表示为从静态调用图派生的行为序列,并包含一个解释阶段,该阶段对可疑行为活动进行排序并将其映射回源代码位置。该方法在来自PyPI和npm的Python和JavaScript包上进行了评估,与开源检测工具进行了比较,并在产品约束下进行了验证,包括运行时、内存使用以及在实际审核工作流中观察到的误报率。我们还发布了Open Malicious-Code Bench,这是一个用于可重复评估恶意包检测方法的公共基准。结果表明,静态行为序列建模可以为现代DevSecOps工作流提供准确、可解释且可部署的恶意代码检测。

英文摘要

MOLOT (Malicious Operational Logic Observation Transformer) is a static malicious-code detection system designed for SAST setup where package metadata, maintainer history, and dynamic execution traces may be unavailable or unreliable. The system represents source code as behavior sequences derived from static call graphs, includes an explanation stage that ranks suspicious behavior activities and maps them back to source-code locations. The approach is evaluated on Python and JavaScript packages from PyPI and npm, compared with opensource detection tools, and validated under product constraints including runtime, memory use, and false-positive rates observed in a real moderation workflow. We also release Open Malicious-Code Bench, a public benchmark for reproducible evaluation of malicious-package detection methods. The results show that static behavior-sequence modeling can provide accurate, explainable, and deployable malicious-code detection for modern DevSecOps workflows.

2606.07798 2026-06-09 cs.AI cs.LG q-bio.NC 交叉投稿

Reconstructing and forecasting disease trajectories of patients with Alzheimer's disease using routine data in resource-constrained settings

在资源受限环境中利用常规数据重建和预测阿尔茨海默病患者的疾病轨迹

Ratnadeep Das, Atri Chatterjee, Sitikantha Roy

发表机构 * Yardi School of Artificial Intelligence (ScAI), Indian Institute of Technology Delhi(印度理工学院德里分校亚迪人工智能学院) Department of Neurology, Vardhman Mahavir Medical College and Safdarjung Hospital(瓦尔丹·马哈维尔医学院和萨夫达戎医院神经内科) Department of Applied Mechanics, Indian Institute of Technology Delhi(印度理工学院德里分校应用力学系)

AI总结 提出GNOVA框架,结合GRU编码器和神经ODE解码器的变分自编码器,利用常规临床数据(无需神经影像或生物标志物)实现认知评分的双向预测、插值/外推及不确定性估计,在ADNI数据集上取得低误差。

详情
AI中文摘要

阿尔茨海默病是一种进行性神经退行性疾病,其进展在不同患者间差异显著。现有工作旨在预测患者未来的认知状态,但很少关注从既往就诊中重建状态。此外,当前研究中,量化预测不确定性仍未被充分探索,且依赖于MRI、PET和CSF等昂贵模态,限制了在资源有限环境中的部署。在本研究中,我们的主要目标是:第一,从不规则就诊中双向预测认知评分,以呈现完整的疾病轨迹;第二,实现插值和外推能力,以辅助临床医生做出知情预后决策;第三,为所有预测提供校准良好的不确定性估计;最后,利用常规就诊中可用的模态实现上述目标。我们提出了一个统一框架GNOVA:GRU-神经ODE变分自编码器。该架构在变分自编码器框架内结合了门控循环单元编码器和神经ODE解码器。在我们的工作中,我们预测了CDR-SB和MMSE评分。GRU编码器允许在任何时间点输入任意数量的数据。神经ODE解码器执行连续估计,允许在任何期望的时间点进行插值和外推。变分自编码器允许预测中的不确定性估计。我们使用了ADNI数据集中1727名患者超过10年的数据;该模型在无需任何神经影像或生物标志物数据的情况下,对CDR-SB和MMSE评分分别实现了1.35和2.28的平均绝对误差。特征消融研究表明,年龄、BMI和APOE4状态是强预测因子。所提出的框架能够重建不完整的患者病史并预测未来的认知状态。

英文摘要

Alzheimer's disease is a progressive neurodegenerative disorder, and its progression varies substantially across patients. Existing work aims to forecast patients' future cognitive state, with minimal focus on reconstructing the state from past visits. Furthermore, in current research, quantifying predictive uncertainty remains underexplored and relies on costly modalities such as MRI, PET, and CSF, limiting their deployment in resource-limited settings. In this research, our primary objectives are: First, bidirectional prediction of cognitive scores from irregular visits to present the complete disease trajectory. Second, to enable interpolation and extrapolation capabilities to assist clinicians in informed prognostic decision making, and third, to provide a well-calibrated uncertainty estimate for all predictions, and finally, to achieve the objectives using the modalities available during routine visits. We propose a unified framework, GNOVA: A GRU-Neural ODE Variational Autoencoder. The architecture combines a Gated Recurrent Unit encoder and a Neural ODE decoder within a variational autoencoder framework. In our work, we forecast the CDR-SB and MMSE Scores. The GRU encoder allows for any number of inputs at any time point. The Neural-ODE decoder performs continuous estimation, allowing interpolation and extrapolation at any desired time point. The Variational autoencoder allows for uncertainty estimation in predictions. We worked with 1,727 patients from the ADNI dataset over 10 years; the model achieved mean absolute errors of 1.35 and 2.28 for CDR-SB and MMSE scores, respectively, without requiring any neuroimaging or biomarker data. Feature-ablation studies revealed that age, BMI, and APOE4 status were strong predictors. The proposed framework enables the reconstruction of incomplete patient histories and the anticipation of future cognitive states.

2606.07843 2026-06-09 cs.DB cs.IR cs.LG 交叉投稿

RACT: Retrieval Augmented Column-Table Learning and Prediction for Multi-Table Schema Matching

RACT: 检索增强的列-表学习与预测用于多表模式匹配

Leonard Traeger, Enas Khwaileh, Andreas Behrend, George Karabatis

发表机构 * University of Maryland, Baltimore County(马里兰大学巴尔的摩县分校) Utrecht University(乌特勒支大学) Technical University of Cologne(科隆技术大学)

AI总结 提出RACT自监督框架,通过检索候选表约束列匹配空间,在多表模式匹配中优于相似性基线,精度和完整性提升高达70%。

Comments Research Preprint, 12 pages

详情
AI中文摘要

模式匹配是整合来自不同来源数据的关键任务,旨在识别不同模式中列之间的对应关系。在多表整体模式匹配中,由于异构模式设计,具有相似语义含义的列可能位于不同上下文的表中,此时基于相似性的技术不足。本文重点是通过引入RACT学习和预测,将引用上下文利用到模式匹配中,这是一个自监督框架,能够概率性地检索源列的候选表,以约束相关列候选。实验表明,该方法在多表模式匹配中优于基于相似性的基线。在后续匹配实验中,通过top-t表约束列搜索空间,平均匹配精度和完整性均提高了高达70%。

英文摘要

Schema matching, a critical task for integrating data from diverse sources, seeks to identify correspondences between columns across different schemas. In multi-table holistic schema matching, columns with similar semantic meaning may reside in tables with different contexts due to heterogeneous schema designs, where similarity-based techniques are inadequate. The focus of this paper is exploiting referential context into schema matching by introducing RACT learning and prediction, a self-supervised framework enabling the probabilistic retrieval of candidate tables for source columns to constrain relevant column candidates. Experiments demonstrate that this approach outperforms similarity-based baselines on matching multi-table schemas. In subsequent matching experiments, constraining the column search space via top-t tables improves both average matching precision and completeness by up to +70%.

2606.07923 2026-06-09 cs.DB cs.AI cs.LG 交叉投稿

Larch: Learned Query Optimization for Semantic Predicates

Larch: 面向语义谓词的学习型查询优化

Fuheng Zhao, Pawel Liskowski, Zihan Li, Benjamin Han, Puxuan Yu, Varich Boonsanong, Dimitris Tsirogiannis, Anupam Datta

发表机构 * Snowflake Inc.(Snowflake公司)

AI总结 提出Larch框架,利用嵌入增强的图神经网络和强化学习或监督学习优化AI SQL查询中语义过滤器的执行顺序,显著降低令牌开销。

详情
AI中文摘要

随着大型语言模型(LLM)的出现,许多数据库系统引入了语义运算符,使得能够对非结构化数据(如文本、图像、视频)进行分析查询。语义运算符通常会产生高昂的推理成本和延迟,使得语义(AI)SQL查询难以应用于大规模数据集。同时,其语义性质导致数据库引擎将其视为黑盒,使得AISQL查询难以优化。在本文中,我们介绍了Larch,一个用于优化AI SQL查询中语义过滤器执行的框架。Larch的灵感来自两个关键观察:i) 语义运算符的高延迟为计算密集型运行时优化技术留下了显著空间,ii) 非结构化数据通常伴随着嵌入形式的语义信息,允许在AI_FILTER提示和数据值之间进行高效的语义比较。基于这两个关键观察,我们提出了两种Larch变体:Larch-A2C和Larch-Sel。Larch-A2C使用嵌入增强的门控图神经网络编码任意语义过滤器表达式树,并将过滤器评估顺序表述为马尔可夫决策过程。相比之下,Larch-Sel利用监督学习模型预测过滤器选择性,随后应用动态规划为每个输入行找到接近最优的评估顺序。在多样化的真实世界数据集和全面的合成工作负载上进行评估,两种Larch变体在令牌使用方面始终优于现有的语义过滤器优化技术。我们的结果表明,Larch在不同工作负载下具有鲁棒性,与Palimpzest和Quest相比,将总令牌成本开销降低了3倍至19倍。

英文摘要

With the advent of Large Language Models (LLMs), many database systems introduced semantic operators that enabled analytical queries over unstructured data (e.g. text, images, videos). Semantic operators typically incur high inference costs and latencies making semantic (AI) SQL queries challenging to apply on large scale datasets. At the same time, their semantic nature leads database engines to treat them as black boxes, making AISQL queries difficult to optimize. In this paper, we introduce Larch, a framework for optimizing the execution of semantic filters in AI SQL queries. Larch was inspired by two key observations: i) the high latency of semantic operators leaves significant room for computationally-heavy runtime optimization techniques, ii) unstructured data are typically accompanied by semantic information in the form of embeddings allowing for efficient semantic comparisons between AI_FILTER prompts and data values. Based on these two key observations, we present two Larch variants: Larch-A2C and Larch-Sel. Larch-A2C encodes arbitrary semantic filters expression tree using an embedding-augmented Gated Graph Neural Network and formulates the filter evaluation order as a Markov decision process. In contrast, Larch-Sel leverages a supervised learning model to predict filter selectivities, subsequently applying dynamic programming to find a near-optimal evaluation order for each input row. Evaluated across diverse real-world datasets and comprehensive synthetic workloads, both Larch variants always outperform existing semantic filter optimization techniques in terms of token usage. Our results demonstrate that Larch is robust across diverse workloads, reducing total token cost overhead by 3x-19x compared to Palimpzest and Quest.

2606.07924 2026-06-09 cs.CV cs.AI cs.CL cs.LG cs.MM 交叉投稿

Decoupling Semantics and Logic: A Training-Free Coarse-to-Fine Pipeline for Video Retrieval-Augmented Generation

解耦语义与逻辑:一种无需训练的从粗到精的视频检索增强生成流水线

Jiaxin Dai, Zehang Wei, Jiamin Yan, Xiang Xiang

发表机构 * School of Computer Science & Tech, Huazhong University of Science and Technology(华中科技大学计算机科学与技术学院) School of AI and Automation, Huazhong University of Science and Technology(华中科技大学人工智能与自动化学院)

AI总结 提出一种无需训练的两阶段级联视频RAG流水线,通过解耦语义检索与逻辑推理,实现跨语言长视频理解、严格角色遵循和零幻觉时间定位。

Comments To be presented at ACL 2026 MAGMAR Workshop (Oral; Retrieval leaderboard No.1)

详情
AI中文摘要

本文介绍了我们为第二届多模态增强生成研讨会(MAGMaR)提交的系统描述。针对跨语言长视频理解、严格角色遵循和零幻觉时间定位等关键挑战,我们提出了一种完全无需训练的两阶段级联视频RAG流水线。我们的架构通过模态感知的任务分工,策略性地将语义检索与认知逻辑推理解耦。在第一阶段,一个高召回率的语义预取模块仅使用高保真视觉摘要和全局文本描述进行密集检索,明确隔离噪声模态(如OCR和ASR)以保持纯净的向量空间。在第二阶段,一个由商业大语言模型(LLM)驱动的自适应、迭代和推理(A.I.R.)过滤代理执行细粒度认知重排序。该代理重新整合完整的多模态上下文,以强制执行与用户角色的严格逻辑对齐,有效剪除语义相似但逻辑无关的候选。最后,提示雕刻机制约束生成器将蒸馏后的子集合成为严格格式化的JSON响应,并带有精确的块级引用。在RAG轨道上的评估表明,我们的资源感知方法在信息检索和角色条件生成方面均表现出卓越的精度。

英文摘要

This paper presents our system description for the 2nd Workshop on Multimodal Augmented Generation via MultimodAl Retrieval (MAGMaR). Addressing the critical challenges of cross-lingual long-video comprehension, strict persona adherence, and zero-hallucination temporal grounding, we propose a fully training-free, two-stage cascaded Video RAG pipeline. Our architecture strategically decouples semantic retrieval from cognitive logical reasoning through a modality-aware division of labor. In the first stage, a high-recall semantic pre-fetching module employs dense retrieval using only high-fidelity visual summaries and global text descriptions, explicitly isolating noisy modalities (e.g., OCR and ASR) to maintain a pristine vector space. In the second stage, an Adaptive, Iterative, and Reasoning-based (A.I.R.) filtering agent, powered by a commercial Large Language Model (LLM), performs fine-grained cognitive reranking. The agent re-incorporates full multimodal contexts to enforce strict logical alignment with user personas, effectively pruning semantically similar but logically irrelevant candidates. Finally, a Prompt Sculpting mechanism constrains the generator to synthesize the distilled subset into strictly formatted JSON responses with exact chunk-level citations. Evaluated on the RAG track, our resource-aware approach shows exceptional precision in both information retrieval and persona-conditioned generation.

2606.08033 2026-06-09 cs.CV cs.LG 交叉投稿

Balancing Real and Synthetic Data for CNN-based Masonry Crack Detection

基于CNN的砌体裂缝检测中真实与合成数据的平衡

Mattia Forlesi, Alfonso Esposito, Ivan Zyrianoff, Alessandro Marzani, Marco Di Felice

发表机构 * University of Bologna(博洛尼亚大学)

AI总结 针对砌体裂缝检测中真实数据不足的问题,提出用合成数据补充训练,通过调整真实与合成数据比例,发现20%真实数据加合成数据即可达到甚至超越纯真实数据的效果。

详情
AI中文摘要

裂缝是建筑健康的关键指标,早期识别对于防止有害损害至关重要。深度学习(DL)的进展,特别是卷积神经网络(CNN),已实现可扩展的自动裂缝检测解决方案。然而,CNN性能高度依赖于大规模多样化数据集的可用性,这对于砌体等复杂表面尤其具有挑战性。收集足够的真实数据耗时,而公开数据集可能不充分。为解决这一限制,我们探索生成合成裂缝数据,以补充真实数据并提高训练效果。真实数据集由从博洛尼亚及周边地区建筑收集的砌体裂缝图像组成。相比之下,合成数据集使用裂缝叠加工具生成,该工具以受控方向和位置向背景图像添加裂缝。使用真实数据集训练多种DL架构,以确定最佳性能模型(InceptionV4),用于生成数据的实验。通过改变真实与合成数据的比例,在InceptionV4上测试了六种训练场景,并在由真实图像组成的测试集上使用F1分数和平均交并比(mIoU)指标进行评估。结果表明,在合成数据上训练加上少量20%真实数据,可获得与仅使用真实数据训练相当的结果。此外,20/80(合成/真实)场景实现了76%的F1分数和80%的平均IoU,优于纯真实情况。可以看出,该方法展示了合成数据在减少收集工作同时提高裂缝检测准确性的潜力。

英文摘要

Cracks are a critical indicator of building health, and early stage identification is fundamental to prevent harmful damages. Advances in deep learning (DL), particularly convolutional neural networks (CNNs), have enabled scalable solutions for automated crack detection. However, CNN performance strongly depends on the availability of large and diverse datasets, which is particularly challenging for complex surfaces such as masonry. Collecting sufficient real data is time-consuming, while publicly available datasets may not be adequate. To address this limitation, we explored generating synthetic crack data, which complements real data and improves training effectiveness. The real dataset consists of masonry crack images collected from buildings in Bologna and surrounding areas. In contrast, the synthetic dataset was generated using a crack overlay tool that adds cracks to background images in a controlled orientation and placement. The real dataset was used to train several DL architectures, to identify the best-performing model (InceptionV4) employed for experiments with generated data. Six training scenarios were tested in InceptionV4 by varying the ratio of real and synthetic data, with evaluation performed on a test set composed of real images using the F1-score and mean Intersection over Union (mIoU) metrics. Results show that training on synthetic data plus a modest addition of 20% real data achieves results comparable to training on real data only. Moreover, the 20/80 scenario (synthetic/real) achieved an 76% F1-score and 80% mean IoU, outperforming the real-only case. As can be seen, the method demonstrates the potential of synthetic data to reduce collection efforts while enhancing crack detection accuracy.

2606.08110 2026-06-09 math.FA cs.LG 交叉投稿

New Fractional Ambiguity Function Integrated with CNN-Based Machine Learning for Signal Classification

基于CNN机器学习的分数阶模糊函数新方法用于信号分类

Aamir H. Dar, Prakhar Kumar Sonkar, Neeraj Kumar Sharma

发表机构 * Mehta Family School of Data Science & Artificial Intelligence(梅hta家族数据科学与人工智能学院) Indian Institute of Technology Guwahati(印度理工学院古瓦哈提)

AI总结 提出一种新的分数阶模糊函数(NFrAF),并集成到CNN框架中,用于信号分类,相比传统方法提高了分类精度。

详情
AI中文摘要

从分数阶傅里叶变换导出的新分数阶模糊函数(NFrAF)作为经典模糊函数的推广被引入。严格建立了NFrAF的基本分析性质,包括对称性、边缘性和Moyal型恒等式。在验证其检测和定位单分量及多分量线性调频(LFM)信号的能力后,将NFrAF集成到基于卷积神经网络的机器学习框架中用于信号分类。由于其优越的时频分辨率和定位能力,NFrAF比传统方法(如谱图和经典模糊函数)提供了更具信息量的输入表示。在模拟数据集上的实验结果表明分类精度持续提高,突显了所提表示在数据驱动信号分析中的有效性。

英文摘要

A new fractional ambiguity function (NFrAF) derived from the fractional Fourier transform is introduced as a generalization of the classical ambiguity function. The fundamental analytical properties of the NFrAF, including symmetry, marginality, and Moyal type identities, are rigorously established. After verifying its ability to detect and localize monocomponent and multicomponent linear frequency modulated (LFM) signals, the NFrAF is integrated into a convolutional neural network based machine learning framework for signal classification. Owing to its superior time frequency resolution and localization, the NFrAF provides a more informative input representation than conventional methods such as the spectrogram and classical ambiguity function. Experimental results on simulated datasets demonstrate consistent improvements in classification accuracy, highlighting the effectiveness of the proposed representation for data driven signal analysis.

2606.08147 2026-06-09 q-bio.GN cs.LG 交叉投稿

Biological Reasoning-Informed Regression for Interpretable Regulatory DNA Activity Prediction

面向可解释调控DNA活性预测的生物学推理引导回归

Yi Duan, Zhao Yang, Jiwei Zhu, Ying Ba, Chuan Cao, Bing Su

发表机构 * Gaoling School of Artificial Intelligence(甘岭人工智能学院) Renmin University of China(中国人民大学) Zhongguancun Academy(中关村学院)

AI总结 提出R3LM框架,通过结构化生物学知识教LLM进行推理引导回归,在增强子预测上达到最优性能并提供可解释机制。

Comments Accepted at KDD 2026 AI4Sciences Track

详情
AI中文摘要

DNA顺式调控元件(CREs)如增强子控制基因表达水平。从DNA序列准确预测调控活性是有价值但具有挑战性的,因为它需要理解复杂的生物调控过程。现有方法通常以黑盒方式从序列回归活性分数,限制了可解释性和回归性能。同时,大型语言模型(LLMs)受益于显式推理过程,但直接将LLMs应用于原始DNA序列表现不佳。在本文中,我们通过引入R3LM框架弥合这一差距,该框架通过结构化生物学知识教LLMs对调控DNA进行推理引导回归。具体来说,我们设计了一种基于生物学的数据格式,结构化DNA的调控信息以改善LLM理解,并构建了CRE-ReasonBench,这是第一个将DNA序列和活性分数与机制推理轨迹关联的数据集。通过两阶段训练,首先教LLMs对结构化生物信息进行推理,然后进行回归,R3LM在三种细胞类型的增强子预测上达到了最先进性能,优于使用原始序列输入的LLMs和专门的DNA模型,同时提供了可解释的机制解释。我们期望R3LM作为一种可解释的奖励模型,能够有效辅助生物学家进行CRE设计。代码可在https://github.com/DuanYi516/R3LM获取。

英文摘要

DNA cis-regulatory elements (CREs) such as enhancers control gene expression levels. Accurately predicting regulatory activity from DNA sequences is valuable but challenging, as it requires understanding complex biological regulatory processes. Existing methods typically regress activity scores from sequences in a black-box manner, limiting both interpretability and regression performance. Meanwhile, large language models (LLMs) benefit from explicit reasoning processes, yet directly applying LLMs to raw DNA sequences performs poorly. In this paper, we bridge this gap by introducing R3LM, a framework that teaches LLMs reasoning-informed regression on regulatory DNA through structured biological knowledge. Specifically, we design a biologically grounded data format that structures DNA's regulatory information for improved LLM understanding, and construct CRE-ReasonBench, the first dataset that associates DNA sequences and activity scores with mechanistic reasoning traces. Through two-stage training that first teaches LLMs reasoning over structured biological information then performs regression, R3LM achieves state-of-the-art performance on enhancer prediction across three cell types, outperforming both LLMs with raw sequence input and specialized DNA models while providing interpretable mechanistic explanations. We expect R3LM as an interpretable reward model that can effectively assist biologists in CRE design. Code is available at https://github.com/DuanYi516/R3LM.

2606.08148 2026-06-09 cond-mat.mtrl-sci cs.LG 交叉投稿

Inverse design of bespoke interatomic potentials via active learning by information-matching

通过信息匹配的主动学习逆向设计定制原子间势

Yonatan Kurniawan, Logan D. Williams, Amit Samanta, Ilia Nikiforov, Daniel Schwalbe-Koda, Mark K. Transtrum, Ellad B. Tadmor, Vincenzo Lordi, Vasily V. Bulatov

发表机构 * Department of Physics and Astronomy, Brigham Young University(物理学与天文学系, Brigham Young 大学) Lawrence Livermore National Laboratory(劳伦斯利弗莫尔国家实验室) Department of Aerospace Engineering and Mechanics, University of Minnesota(航空航天工程与力学系,明尼苏达大学) Department of Materials Science and Engineering, University of California(材料科学与工程系,加州大学) Cross Stream Consulting(Cross Stream 咨询)

AI总结 提出信息匹配方法,通过主动选择训练数据最小化参数不确定性,以少量数据精确预测金属塑性强度,并后验修正模型误差。

详情
AI中文摘要

原子间势能(IPs)能够实现超出第一性原理方法范围的大规模原子模拟,但其预测可靠性关键取决于训练数据的选择、量化不确定性和模型表达能力。主动学习(AL)为构建高效准确的IPs提供了原则性框架,但大多数策略在减少参数不确定性时未明确考虑所预测的特定材料属性。信息匹配(IM)方法通过要求所选训练数据提供至少与实现选定感兴趣量(QoIs)的预定不确定性目标所需一样多的参数空间信息,解决了这一局限性。在此,我们将IM应用于开发专门用于预测金属塑性强度的定制IPs。由于模拟塑性强度的计算成本高昂,我们采用间接IM策略,针对与强度相关的廉价中间QoIs。IM方法能够以最少的训练数据实现精确的参数约束,从而对中间QoIs和塑性强度做出精确预测。然而,模型误差仍然是一个关键限制,事后不确定性膨胀校正为缓解这一限制提供了可行手段。这些发现说明了不确定性感知的AL在预测复杂材料属性方面的前景和局限。

英文摘要

Interatomic potentials (IPs) enable large-scale atomistic simulations beyond the reach of first-principles methods, but their predictive reliability depends critically on the selection of training data, quantified uncertainty, and model expressiveness. Active learning (AL) provides a principled framework for constructing efficient and accurate IPs, yet most strategies reduce parameter uncertainty without explicitly accounting for the specific material properties being predicted. The information-matching (IM) approach addresses this limitation by requiring that the selected training data provide at least as much parameter space information as needed to achieve prescribed uncertainty targets for selected quantities of interest (QoIs). Here, we apply IM to develop bespoke IPs specifically tailored for predicting plastic strength in metals. Due to the high computational cost of simulating plastic strength, we employ an indirect IM strategy that targets inexpensive intermediate QoIs that correlate with strength. The IM method enables precise parameter constraints with minimal training data, yielding precise predictions for both the intermediate QoIs and plastic strength. Yet, model error remains a key limitation, and a post hoc uncertainty inflation correction provides a viable means to mitigate this limitation. These findings illustrate both the promise and limits of uncertainty-aware AL for predicting complex material properties.

2606.08169 2026-06-09 cs.RO cs.AI cs.CL cs.HC cs.LG 交叉投稿

CLASP: Language-Driven Robot Skill Selection and Composition using Task-Parameterized Learning

CLASP: 基于语言驱动的机器人技能选择与组合,采用任务参数化学习

Markus Knauer, Valentin Gieraths, Tai Mai, Samuel Bustamante, Alin Albu-Schäffer, Freek Stulp, João Silvério

发表机构 * German Aerospace Center (DLR), Institute of Robotics and Mechatronics (RMC)(德国航空航天中心(DLR),机器人与机电一体化研究所(RMC)) Technical University of Munich (TUM)(慕尼黑工业大学(TUM))

AI总结 提出CLASP架构,结合任务参数化核化运动基元(TP-KMP)与预训练视觉语言模型(VLM),通过自然语言命令实现技能选择、组合和主动学习,无需微调,在7自由度机械臂上达到73.3%-100%成功率。

Comments 23 pages, 11 figues, 4 tables, 1 listing

详情
AI中文摘要

使机器人能够理解自然语言命令并执行任务,同时保持数据效率仍然具有挑战性。视觉-语言-动作(VLA)和视觉-语言模型(VLM)等基础模型提供了直观的交互通道,但需要大量数据;任务参数化模仿学习实现了数据效率,但缺乏自然语言基础。这项工作通过一个模块化架构弥合了这一差距,该架构将任务参数化核化运动基元(TP-KMP)与预训练VLM相结合。在学习过程中,技能从2到5次动觉演示中获取,VLM生成描述每个技能参数和前提条件的技能模式。在执行过程中,VLM解释命令以选择技能,推理参数绑定,并通过协方差加权组合创建新颖行为。当没有技能或组合足够时,系统识别能力差距并请求有针对性的演示,所有这些都无需微调。在7自由度机械臂上的验证显示,在需要技能选择、组合和主动学习的场景中,成功率达到73.3%-100%。

英文摘要

Enabling robots to understand and execute tasks from natural language commands while maintaining data efficiency remains challenging. Foundation models such as vision-language-action (VLA) and vision-language models (VLMs) provide intuitive interaction channels but require extensive data; task-parameterized imitation learning achieves data efficiency but lacks natural language grounding. This work bridges this gap through a modular architecture combining task-parameterized kernelized movement primitives (TP-KMPs) with pretrained VLMs. During learning, skills are acquired from 2 to 5 kinesthetic demonstrations, and the VLM generates skill schemas describing each skill's parameters and preconditions. During execution, the VLM interprets commands to select skills, reason about parameter bindings, and create novel behaviors through covariance-weighted composition. When no skill or composition suffices, the system identifies capability gaps and requests targeted demonstrations, all without fine-tuning. Validation on a 7-DoF manipulator shows success rates of 73.3%-100% in scenarios requiring skill selection, composition, and active learning.

2606.08173 2026-06-09 cs.CR cs.LG cs.NI 交叉投稿

AI-Native Closed-Loop Security for 6G-Enabled Cyber-Physical Systems: From Edge Detection to Network-Wide Mitigation

面向6G赋能信息物理系统的AI原生闭环安全:从边缘检测到全网缓解

Bilal Hussain, Muhammad Bilal, Tan Li, Haris Pervaiz, Xiao Tang, Qinghe Du, Fawad Ahmad, Muhammad Azhar, Jun Zhang

发表机构 * Division of Science, Engineering, and Health Studies, School of Professional Education and Executive Development, The Hong Kong Polytechnic University(香港理工大学科学、工程与健康研究学院,专业教育与 executive 发展学院) School of Computing and Communications, Lancaster University(兰卡斯特大学计算机与通讯学院) Department of Computer Science, The Hang Seng University of Hong Kong(香港恒生大学计算机科学系) School of Computer Science and Electronic Engineering, University of Essex(埃塞克斯大学计算机科学与电子工程学院) School of Information and Communication Engineering, Xi’an Jiaotong University(西安交通大学信息与通信工程学院) Department of Applied Data Science, Hong Kong Shue Yan University(香港-Shue Yan大学应用数据科学系) Department of Electronic and Computer Engineering, Hong Kong University of Science and Technology(香港理工大学电子与计算机工程系)

AI总结 本文提出一种AI原生闭环安全架构,在6G CPS中利用MEC进行边缘检测,通过SDN/NFV/O-RAN实现全网缓解,并形式化定义了切片级延迟约束,系统综述了128篇研究并归纳了五个开放问题。

Comments 30 pages, 12 figures, survey paper, submitted to IEEE Communications Surveys & Tutorials (IEEE COMST)

详情
AI中文摘要

在第六代(6G)网络中,数十亿信息物理系统(CPS)——包括自动驾驶车辆、智能电网、工业机器人和远程手术设备——将在超可靠低延迟切片上运行,将远程入侵与物理危害之间的时间差压缩至毫秒级,这是传统边界防火墙和集中式安全运营中心无法满足的。本综述将6G CPS安全重新定义为一种闭环、AI原生的流水线,该流水线在多接入边缘计算(MEC)层进行感知,使用分钟级呼叫详细记录(CDR)进行基线学习,并使用亚毫秒级RAN/开放RAN(O-RAN)遥测数据用于延迟关键路径。它在本地使用压缩深度模型进行决策,通过SDN、NFV和O-RAN控制器实现全网缓解,并通过联邦学习(FL)和数字孪生(DT)回放进行再训练。我们形式化了感知、检测和缓解阶段每个切片的尾部有界延迟契约,该契约在切片依赖的尾部百分位数(安全关键URLLC切片为p99)上强制执行。按照PRISMA 2020协议组织128篇同行评审研究(2017-2026),我们:(i)将6G/CPS威胁面映射到MITRE ATT&CK和CDR可观测特征空间;(ii)统一了跨十二个数据集以及统计、图和Transformer模型的边缘异常检测和DDoS分类;(iii)将SDN/NFV/O-RAN原语综合成一个闭环参考架构;(iv)将FL、大语言模型(LLM)、DT、后量子密码(PQC)、零信任架构(ZTA)和可解释AI视为跨领域使能器,而非并行支柱;(v)将开放问题整合为五个方向,涵盖数据、延迟、信任、标准化和评估。

英文摘要

In sixth-generation (6G) networks, billions of cyber-physical systems (CPSs) - autonomous vehicles, smart grids, industrial robots, and remote-surgical equipment - will run over ultra-reliable low-latency slices, collapsing the gap between a remote breach and physical harm to milliseconds, a budget perimeter firewalls and centralised security operations centres cannot meet. This survey reframes 6G CPS security as a closed-loop, AI-native pipeline that senses at the multi-access edge computing (MEC) tier, using minute-scale call-detail records (CDRs) for baseline learning and sub-millisecond RAN/Open-RAN (O-RAN) telemetry for the latency-critical path. It decides locally with compressed deep models, mitigates network-wide via SDN, NFV, and O-RAN controllers, and retrains through federated learning (FL) and digital-twin (DT) replay. We formalise a per-slice, tail-bounded latency contract on the sense, detect, and mitigate stages, enforced at a slice-dependent tail percentile (p99 for safety-critical URLLC slices). Organising 128 peer-reviewed studies (2017-2026) under a PRISMA 2020 protocol, we (i) map the 6G/CPS threat surface to MITRE ATT&CK and a CDR-observable feature space; (ii) unify edge anomaly detection and DDoS classification across twelve datasets and statistical, graph, and transformer models; (iii) synthesise SDN/NFV/O-RAN primitives into one closed-loop reference architecture; (iv) treat FL, large language models (LLMs), DT, post-quantum cryptography (PQC), zero-trust architecture (ZTA), and explainable AI as cross-cutting enablers, not parallel pillars; and (v) consolidate open problems into five directions spanning data, latency, trust, standardisation, and evaluation.

2606.08247 2026-06-09 eess.AS cs.AI cs.LG eess.SP 交叉投稿

AeroSpectra Sentinel: An Auditable LLM Prompt-Chaining Decision-Support Workflow for Acute Asthma Risk Assessment from Respiratory Sounds and Clinical Signals

AeroSpectra Sentinel:一种用于从呼吸音和临床信号进行急性哮喘风险评估的可审计LLM提示链决策支持工作流

Aueaphum Aueawatthanaphisut

发表机构 * School of Information, Computer, and Communication Technology(信息、计算机与通信技术学院) Sirindhorn International Institute of Technology, Thammasat University(泰国朱拉隆梭国际技术学院)

AI总结 提出AeroSpectra Sentinel,结合STFT呼吸音分析、轻量ML筛查、临床特征融合和五阶段LLM提示链,实现可审计的急性哮喘风险评估,在公开数据集上验证了音频筛查和LLM工作流的有效性。

Comments 10 pages, 8 figures, 5 tables, 14 equations

详情
AI中文摘要

急性哮喘风险评估需要快速解读呼吸音、氧合、气流受限、言语能力、呼吸做功、精神状态以及对缓解治疗的反应。传统的纯音频分类器可以检测喘息样模式,但通常缺乏透明的临床推理和安全升级逻辑。本文提出AeroSpectra Sentinel,一个客户端研究原型和决策支持工作流,结合短时傅里叶变换(STFT)呼吸音分析、轻量机器学习筛查、临床特征融合和五阶段大语言模型(LLM)提示链过程。该工作流分离了信号采集、预处理、声学特征提取、ML筛查、临床护栏和FHIR就绪报告。我们在一个包含来自五个标签的1,211个WAV录音的公共呼吸音数据集上评估了音频筛查组件。使用584个录音的分层子集,随机森林在哮喘与非哮喘筛查中实现了91.10%的二元准确率和78.69%的F1分数,而基于特征的多层感知器实现了89.73%的准确率和78.26%的F1分数。紧凑的log-spectrogram CNN实现了73.29%的准确率和55.17%的F1分数。多类分类实现了77.40%的准确率和77.23%的宏F1。为了评估LLM工作流,我们对40个模拟临床场景进行了基于场景的审计,比较了一次性提示、提示链、带护栏的提示链以及带护栏加FHIR模式验证的提示链。护栏加模式变体实现了最强的模拟安全性和文档一致性。AeroSpectra Sentinel旨在作为研究原型,而非诊断医疗设备或临床验证的风险评估产品。

英文摘要

Acute asthma risk assessment requires rapid interpretation of respiratory sounds, oxygenation, airflow limitation, speech ability, work of breathing, mental status, and response to reliever therapy. Conventional audio-only classifiers can detect wheeze-like patterns but often lack transparent clinical reasoning and safe escalation logic. This paper presents AeroSpectra Sentinel, a client-side research prototype and decision-support workflow that combines short-time Fourier transform (STFT) respiratory sound analysis, lightweight machine-learning screening, clinical feature fusion, and a five-stage large language model (LLM) prompt-chaining process. The workflow separates signal acquisition, preprocessing, acoustic feature extraction, ML screening, clinical guardrails, and FHIR-ready reporting. We evaluated the audio screening component on a public respiratory sound dataset containing 1,211 WAV recordings from five labels. Using a stratified subset of 584 recordings, a random forest achieved 91.10% binary accuracy and 78.69% F1-score for asthma-vs-non-asthma screening, while a feature-based multilayer perceptron achieved 89.73% accuracy and 78.26% F1-score. A compact log-spectrogram CNN achieved 73.29% accuracy and 55.17% F1-score. Multiclass classification achieved 77.40% accuracy and 77.23% macro-F1. To evaluate the LLM workflow, we conducted a scenario-based audit on 40 simulated clinical vignettes comparing one-shot prompting, prompt chaining, prompt chaining with guardrails, and prompt chaining with guardrails plus FHIR schema validation. The guardrail-plus-schema variant achieved the strongest simulated safety and documentation consistency. AeroSpectra Sentinel is intended as a research prototype, not as a diagnostic medical device or clinically validated risk-assessment product.

2606.08305 2026-06-09 stat.ML cs.LG 交叉投稿

MEC-Cox: Machine-Learning-Assisted Generalized Entropy Calibration for ATT Marginal Hazard-Ratio Estimation

MEC-Cox:基于机器学习的广义熵校准用于ATT边际风险比估计

Se Yoon Lee, Yonghyun Kwon, Jae Kwang Kim

发表机构 * Department of Statistics, Texas A&M University(统计学系,德克萨斯A&M大学) Department of Mathematics, Korea Military Academy(数学系,韩国军事学院) Department of Statistics, Iowa State University(统计学系,爱荷华州立大学)

AI总结 提出MEC-Cox方法,结合机器学习辅助的广义熵校准与逆概率加权Cox回归,估计处理组平均处理效应(ATT)边际风险比,通过校准预后评分减少偏差并提高效率。

详情
AI中文摘要

当同时随机对照不可行时,外部对照生存试验越来越多地用于肿瘤学和罕见病等具有时间至事件终点的场景。我们针对处理组平均处理效应(ATT)类型的边际风险比估计量,比较处理组试验人群中的治疗与反事实对照,并使用逆概率加权(IPW)Cox回归进行估计。由于IPW Cox回归通过事件贡献和风险集平均值依赖于权重,使得灵活的机器学习干扰估计难以直接纳入,有效推断具有挑战性。基于Lee和Kim(2026)的机器学习辅助广义熵校准(MEC),我们提出了用于ATT加权IPW Cox回归的MEC-Cox方法。该方法首先对外部对照使用归一化的源倾向得分优势比权重,然后应用Bregman校准来平衡外部对照与处理组试验患者之间的交叉拟合预后摘要。校准基础可包括对照生存预测、Cox线性预测器、惩罚生存模型预测或其他预后评分摘要。因此,MEC更新后的权重扮演源传输和预后评分平衡权重的双重角色。我们建立了相合性,刻画了校准带来的效率增益,并开发了堆叠三明治方差估计器。模拟表明,MEC-Cox通过灵活的机器学习辅助调整可以减少偏差、提高效率并改善覆盖。

英文摘要

Externally controlled survival trials are increasingly used when concurrent randomized controls are infeasible, particularly in oncology and rare-disease settings with time-to-event endpoints. We target an average-treatment-effect-on-the-treated (ATT)-type marginal hazard-ratio estimand, comparing treatment with counterfactual control in the treated trial population, and estimate it using inverse-probability-weighted (IPW) Cox regression. Valid inference is challenging because IPW Cox regression depends on the weights through both event contributions and risk-set averages, making flexible machine-learning nuisance estimation difficult to incorporate directly. Building on machine-learning-assisted generalized entropy calibration (MEC) by Lee and Kim (2026), we propose MEC-Cox for ATT-weighted IPW Cox regression. The method begins with normalized source-propensity-score odds weights for external controls and then applies Bregman calibration to balance cross-fitted prognostic summaries between external controls and treated trial patients. The calibration basis may include control-survival predictions, Cox linear predictors, penalized-survival-model predictions, or other prognostic-score summaries. MEC-updated weights therefore play a dual role as source-transport and prognostic-score balancing weights. We establish consistency, characterize a calibration-induced efficiency gain, and develop a stacked sandwich variance estimator. Simulations show that MEC-Cox can reduce bias, increase efficiency, and improve coverage through flexible machine-learning-assisted adjustment.

2606.08587 2026-06-09 stat.ML cs.LG 交叉投稿

Improving the sharpness in neural network-based parametric post-processing of ensemble forecasts

提高基于神经网络的集合预报参数化后处理中的锐度

Ágnes Baran, Máté Mihalina

发表机构 * Faculty of Informatics, University of Debrecen(德布雷岑大学信息学院)

AI总结 针对集合预报后处理中锐度下降的问题,通过在损失函数中加入惩罚项,在保持CRPS和RMSE不变的情况下,将中心预测区间宽度相对减小8.2%-12.5%。

Comments 18 pages

详情
AI中文摘要

统计后处理已被证明是改进不同天气变量集合预报的有效工具。案例研究表明,后处理可以纠正集合预报通常存在的分散不足和潜在偏差行为,同时优化表示预报技巧的适当评分规则。这些积极效应的代价通常是锐度下降;中心预测区间的宽度和预测的不确定性增加,尤其是在较短预报时效。本研究旨在通过扩展网络损失函数加入惩罚项,减少基于神经网络的参数化后处理方法中后一种现象的程度。我们使用从EUPPBench基准数据集下载的欧洲中期天气预报中心2米温度集合预报,并对照天气观测进行验证,展示了所提技术的效果。这里,预测分布为高斯分布,我们使用连续排序概率评分(CRPS)作为损失函数。案例研究证实,与未加惩罚项计算的预测分布宽度相比,名义中心预测区间的宽度有显著相对减小(8.2%-12.5%),而概率预报的平均CRPS和预测均值的RMSE没有恶化。

英文摘要

Statistical post-processing has proven to be an effective tool in improving ensemble forecast of different weather variables. Case studies show that post-processing can remedy the typically underdispersive and potentially biased behaviour of the ensemble while optimizing a proper scoring rule expressing the forecast skill. The price of these positive effects is generally a deterioration in sharpness; the width of the central prediction intervals and the uncertainty of the predictions are increasing, especially for shorter lead times. This work aims to reduce the extent of the latter phenomenon for neural network-based parametric post-processing methods by extending the network's loss function with a penalty term. We demonstrate the effect of the proposed technique for 2m temperature ensemble forecasts of the European Centre for Medium-Range Weather Forecasts downloaded from the EUPPBench benchmark dataset and verified against synoptic observations. Here, the predictive distribution is Gaussian, and we use the continuous ranked probability score (CRPS) as loss function. The case studies confirm a substantial relative decrease ($8.2\%-12.5\%$) in the width of the nominal central prediction interval compared to the width of the predictive distribution computed without the penalty term, while there is no deterioration in the mean CRPS of probabilistic forecasts and in the RMSE of the predictive mean.

2606.08611 2026-06-09 eess.SY cs.LG cs.SY 交叉投稿

Bayesian Optimization of a Multi-Product Chemical Reactor Using Composite Models and Partial Physics Knowledge

使用复合模型和部分物理知识的多产品化学反应器贝叶斯优化

Liqiu Dong, Marta Zagórowska, Mehmet Mercangöz

发表机构 * Department of Chemical Engineering, Imperial College London(化学工程系,帝国理工学院伦敦分校) DCSC, Delft University of Technology(Delft理工大学DCSC)

AI总结 提出一种复合贝叶斯优化方法,利用高斯过程预测物理量并计算利润,结合能量平衡残差惩罚和约束处理,实现多产品反应器的数据驱动实时经济优化。

Comments Accepted to IFAC 2026. 11 pages, 4 figures

详情
AI中文摘要

我们研究了多产品化学反应器的数据驱动实时经济优化问题,当没有可靠的基于第一性原理的模型(除了稳态能量平衡)时。我们不直接学习经济目标作为黑箱函数,而是使用复合公式,其中高斯过程(GP)模型预测物理上有意义的输出,包括产品浓度和反应器温度,而利润则根据这些预测以及原材料、产品和公用事业价格解析计算。这保留了经济目标的结构,使其在价格变化时无需重新训练即可参数化,并允许通过物理残差检查候选操作点是否符合可用的能量平衡。GP还提供预测不确定性,在贝叶斯优化(BO)框架中利用该不确定性进行数据高效探索以及通过上置信界保守地执行反应器温度约束。采集函数还惩罚通过将GP预测的输出和候选输入代入可用的稳态能量平衡而获得的大能量平衡失配。该方法在非等温多产品反应器的基准模拟上进行了演示。相对于信任域安全BO实现,所提出的方法在可用迭代预算内实现了更好的模拟经济性能。相对于不使用可用物理信息的纯数据驱动BO方法,它避免了反应器温度约束违反。

英文摘要

We study data-driven real-time economic optimization of a multi-product chemical reactor when no reliable first-principles model is available beyond a steady-state energy balance. Instead of learning the economic objective directly as a black-box function, we use a composite formulation in which Gaussian process (GP) models predict physically meaningful outputs, including product concentrations and reactor temperature, while profit is computed analytically from these predictions together with raw-material, product, and utility prices. This preserves the structure of the economic objective, makes it parametric in changing prices without needing retraining, and allows candidate operating points to be checked against the available energy balance through a physics residual. The GPs also provide predictive uncertainty, which is exploited in a Bayesian optimization (BO) framework both for data-efficient exploration and for conservative enforcement of the reactor temperature constraint through an upper confidence bound. The acquisition function additionally penalizes large energy-balance mismatch obtained by substituting the GP-predicted outputs and candidate inputs into the available steady-state energy balance. The approach is demonstrated on a benchmark simulation of a non-isothermal multi-product reactor. Relative to a trust-region safe BO implementation, the proposed method achieves better simulated economic performance within the available iteration budget. Relative to a purely data-driven BO approach that does not use the available physics information, it avoids reactor temperature constraint violations.

2606.08633 2026-06-09 cs.AI cs.LG 交叉投稿

Towards Long-Horizon Vessel Trajectory and Destination Forecasting with Reasoning Large Language Models

面向长时域船舶轨迹与目的地预测的推理型大语言模型

Hongwei Wang, Miao Zhou, Fengde Wang, Yuting Wang, Jiewen Yu, Jun-Yan He, Bohao Qu, Wanbing Zhang, Xiuju Fu, Qing Guo, Zipei Fan, Yingying Xing, Yi Yuan

发表机构 * Institute of High Performance Computing (IHPC), A*STAR, Singapore(新加坡科技研究局高性能计算研究所) The Key Laboratory of Road and Traffic Engineering, Ministry of Education, Tongji University(同济大学道路与交通工程教育部重点实验室) Meituan Inc., Shenzhen, China(美团(深圳)) Centre for Frontier AI Research (CFAR), A*STAR, Singapore(新加坡科技研究局前沿人工智能研究中心) Nankai University(南开大学) School of Artificial Intelligence, Jilin University(吉林大学人工智能学院)

AI总结 提出基于可验证奖励强化学习(RLVR)的Maritime LLM后训练框架,将轨迹转化为语义文本,通过物理有效性约束和层次匹配提升长时域(30天)预测精度,4B模型表现最优。

Comments The IEEE International Conference on Intelligent Transportation Systems (ITSC) 2026, Naples, Italy

详情
AI中文摘要

长时域海上轨迹预测对航运管理、物流规划和海上风险分析至关重要,但月度级别的预测仍研究不足。现有深度学习方法主要关注短期和中期坐标外推,在长时间跨度下往往难以保持路线可行性和目的地正确性。本文研究了利用具备推理能力的大语言模型进行联合长时域船舶轨迹和目的地预测,并基于可验证奖励强化学习(RLVR)开发了Maritime LLM后训练框架。构建了一个基于AIS的基准数据集,包含60天历史轨迹和30天预测范围,其中轨迹被转换为语义文本表示用于RL提示构建。RLVR通过强制执行物理有效性、提供早期加权轨迹监督以及通过层次匹配和课程学习评估目的地正确性,使LLM与海上预测目标对齐。实验结果表明,RLVR训练的LLM在零样本LLM和代表性深度学习基线方法上均有显著提升,尤其在目的地相关指标上。在评估的RLVR训练变体中,4B LLM实现了最佳整体性能,表明奖励兼容优化和任务特定容量匹配比单纯使用更大的8B或14B LLM更为重要。结果还显示,在有限的微调数据下,LSTM仍然是一个强大的深度学习基线,而Transformer风格的时空模型通常需要更大的数据集和更丰富的结构化输入。总体而言,这项工作推进了用于运营决策支持的语义化、验证器对齐的海上预测。

英文摘要

Long-horizon maritime trajectory prediction is important for shipping management, logistics planning, and maritime risk analysis, yet month-level forecasting remains insufficiently studied. Existing deep learning methods mainly focus on short- and mid-term coordinate extrapolation and often struggle to preserve route feasibility and destination correctness over extended horizons. This paper investigates joint long-horizon vessel trajectory and destination forecasting with reasoning-capable large language models, and develops a Maritime LLM post-training framework based on Reinforcement Learning with Verifiable Reward (RLVR). An AIS-based benchmark is constructed with 60-day historical trajectories and 30-day forecasting horizons, where trajectories are converted into semantic textual representations for RL prompt construction. RLVR aligns LLMs with maritime forecasting objectives by enforcing physical validity, providing early-weighted trajectory supervision, and evaluating destination correctness through hierarchical matching and curriculum learning. Experimental results show that RLVR-trained LLMs substantially improve over zero-shot LLMs and representative deep learning baselines, especially on destination-related metrics. Among the evaluated RLVR-trained variants, 4B LLMs achieve the best overall performance, suggesting that reward-compatible optimization and task-specific capacity matching are more important than simply using larger 8B or 14B LLMs. The results also show that LSTM remains a strong deep learning baseline under limited fine-tuning data, while Transformer-style spatio-temporal models typically require larger datasets and richer structured inputs. Overall, this work advances semantic, verifier-aligned maritime forecasting for operational decision support.

2606.08714 2026-06-09 eess.SY cs.AI cs.LG cs.RO cs.SY 交叉投稿

Hybrid Neural Network and Conventional Controller Approach for Robust Control of Highly Unstable Systems: Application to Tilt-Rotor Control

混合神经网络与传统控制器方法用于高度不稳定系统的鲁棒控制:应用于倾转旋翼控制

Ali Kafili Gavgani, Amin Talaeizadeh, Aria Alasty, Hossein Nejat Pishkenari

发表机构 * Advanced Research Lab for Control and Agricultural Robotics (Sharif AgRoLab)(控制与农业机器人高级研究实验室(谢尔生产大学AgRoLab)) Department of Mechanical Engineering, Sharif University of Technology, Tehran, Iran(技术大学机械工程系,德黑兰,伊朗)

AI总结 提出一种神经网络增强的滑模控制器,将系统动力学分解为输入无关和输入相关部分,前者用轻量网络从少量数据学习,实现对全驱动倾转旋翼系统的鲁棒控制,LSTM优于MLP。

Comments Proceedings of the 13th RSI International Conference on Robotics and Mechatronics (ICRoM 2025)

详情
AI中文摘要

多旋翼飞行器广泛应用于从监视到精准农业等领域,但传统设计仍受限于其欠驱动特性。倾转旋翼配置通过实现全驱动克服了这一限制。本文研究基于神经网络的控制策略,用于一个具有四个推力矢量输入的全驱动倾转旋翼系统。我们的工作分为两部分。首先,我们有意呈现一个负面结果,通过评估直接输入-输出控制方法。在该方法中,多层感知器(MLP)、长短期记忆(LSTM)网络和Transformer模型被训练为直接将系统状态及其期望值映射到控制信号。我们表明该策略无法稳定系统,凸显了将直接输入-输出学习应用于高度不稳定对象的固有困难。其次,作为主要贡献,我们提出一种神经网络增强的滑模控制器(SMC)。该方法将系统动力学分解为输入无关和输入相关两部分,前者使用轻量网络从少量数据集学习,从而降低实时计算需求。此外,所提方法可以使用从低性能控制器收集的飞行日志进行训练,并且从真实数据学习到的动力学模型可用于仿真。我们进一步比较了基于MLP和LSTM的实现,在模型不确定性和外部干扰下,展示了所提方法的鲁棒性和有效性;特别是,带有LSTM植物动力学预测器的控制器相比基于MLP的对应物实现了更优性能,同时运行时也更低。

英文摘要

Multirotors are widely used in applications ranging from surveillance to precision agriculture, yet conventional designs remain limited by their under-actuation. Tilt-rotor configurations overcome this limitation by enabling full actuation. This paper investigates neural-network-based control strategies for a fully actuated tilt-rotor system with four thrust-vectoring inputs. Our work is structured in two parts. First, we deliberately present a negative result by evaluating a direct input-output control approach. In this method, multilayer perceptrons (MLPs), long short-term memory (LSTM) networks, and transformer models are trained to map system states and their desired values directly to control signals. We show that this strategy fails to stabilize the system, highlighting the inherent difficulty of applying direct input-output learning to highly unstable plants. Second, as the main contribution, we propose a neural-network-enhanced sliding mode controller (SMC). The method decomposes the system dynamics into input-independent and input-dependent components, with the former learned from a small dataset using lightweight networks, thereby reducing real-time computational demands. Moreover, the proposed method can be trained using flight logs collected from low-performance controllers, and the resulting dynamic model learned from real-world data can be used in simulation. We further compare MLP- and LSTM-based implementations under model uncertainties and external disturbances, demonstrating the robustness and effectiveness of the proposed approach; in particular, the controller with the LSTM plant dynamics predictor achieves superior performance to its MLP-based counterpart while also exhibiting lower runtime.

2606.08770 2026-06-09 cs.CL cs.AI cs.CV cs.LG 交叉投稿

TeamHerald@CHIPSAL 2026: Hate Speech Detection and Sentiment Analysis of Nepali Memes using Transformer-based Architectures and Ensemble Learning

TeamHerald@CHIPSAL 2026:基于Transformer架构和集成学习的尼泊尔语模因仇恨言论检测与情感分析

Ashish Acharya, Anish Khatiwada, Rohit Khadka, Pragya Aryal

发表机构 * Herald College Kathmandu(加德满都赫尔德学院)

AI总结 针对尼泊尔语模因中代码混合和资源匮乏问题,采用OCR提取文本并结合Transformer模型,发现硬/软投票集成策略在二分类和多分类任务中表现不同,软投票在多类情感任务中提升15.8%的Macro F1分数。

Comments Accepted at the 2nd Workshop on Challenges in Processing South Asian Languages (CHiPSAL 2026) at LREC 2026

详情
AI中文摘要

尼泊尔语互联网模因的分析因频繁的代码混合和缺乏已建立的基线资源而变得复杂。虽然模因本质上结合了视觉和文本元素,但本研究侧重于以文本为中心的方法,通过OCR层提取嵌入文本,并使用基于Transformer的架构进行建模。我们评估了六种不同的模型,并研究了硬投票和软投票集成策略在两项任务中的比较效果:二分类仇恨言论检测和三分类情感分析。实验结果表明,独立的仅解码器模型在二分类任务中取得了最高性能,而软投票集成在多类情感任务中表现最佳,相比最强的独立基线,Macro F1分数相对提升了15.8%。这些发现表明,集成策略在二分类和多类任务中表现不同,突出了选择适合分类目标的聚合方法的重要性。

英文摘要

The analysis of internet memes in the Nepali language is complicated by frequent code-mixing and a lack of established baseline resources. While memes inherently combine visual and textual elements, this study focuses on a text-centric approach by extracting embedded text using an OCR layer and modeling it with Transformer-based architectures. We evaluate six distinct models and investigate the comparative effectiveness of Hard and Soft Voting ensemble strategies across two tasks: binary hate speech detection and three-class sentiment analysis. Experimental results show that a standalone decoder-only model achieved the highest performance for binary classification, whereas the Soft Voting ensemble performed best for the multi-class sentiment task, yielding a 15.8% relative improvement in Macro F1-score over the strongest standalone baseline. These findings suggest that ensemble strategies behave differently across binary and multi-class tasks, highlighting the importance of selecting aggregation methods suited to the classification objective.

2606.08843 2026-06-09 cs.SD cs.LG 交叉投稿

From A to B to A: Palindromic Zero-Shot Voice Conversion with Non-Parallel Data

从A到B再回到A:基于非平行数据的回文零样本语音转换

Moshe Mandel, Shlomo E. Chazan

发表机构 * Independent, Israel(以色列独立机构) OriginAI, Israel(以色列OriginAI公司)

AI总结 提出利用WavLM表示的K近邻检索对齐非平行语音,构建合成训练对,结合说话人损失实现零样本语音转换,在仅用英语数据训练下跨语言表现优异。

详情
AI中文摘要

我们提出一个语音转换(VC)框架,利用WavLM表示上的K近邻(KNN)检索来对齐非平行的源语音和目标语音,从而为监督学习构建合成训练对。检索到的片段作为合成输入,而真实目标音频提供真实输出,形成一种合成到真实的训练范式,该范式自然支持多语言数据,无需平行语料库或显式对齐。为了确保一致的目标说话人身份,我们引入了一个来自预训练说话人验证模型的说话人损失。跨多种语言的实验表明,尽管仅使用英语数据训练,所提出的方法实现了高自然度和强说话人相似性,优于有竞争力的VC基线。样本可在https://palindromic-vc.github.io获取。

英文摘要

We present a voice conversion (VC) framework that utilizes K-Nearest Neighbors (KNN) retrieval over WavLM representations to align non-parallel source and target speech, constructing synthetic training pairs for supervised learning. The retrieved segments serve as synthetic inputs, while real target audio provides ground-truth outputs, forming a synthetic-to-real training paradigm that naturally supports multilingual data without requiring parallel corpora or explicit alignment. To ensure consistent target-speaker identity, we incorporate a speaker loss derived from a pretrained speaker verification model. Experiments across multiple languages demonstrate that the proposed approach achieves high naturalness and strong speaker similarity, outperforming competitive VC baselines, despite being trained exclusively on English data. Samples can be accessed at: https://palindromic-vc.github.io.

2606.08973 2026-06-09 q-bio.QM cs.LG 交叉投稿

A systematic investigation of molecular encoding methods for drug property predictions across neural network and Transformer encoder-based model

基于神经网络和Transformer编码器模型的药物性质预测分子编码方法的系统研究

Sheng-Ya Chen, Shan-Ju Yeh

发表机构 * School of Medicine, National Tsing Hua University(国立清华大学医学院) Institute of Bioinformatics and Structural Biology, National Tsing Hua University(国立清华大学生物信息学与结构生物学研究所) Department of Life Science, National Tsing Hua University(国立清华大学生命科学系) Interdisciplinary Program of Life Sciences and Medicine, National Tsing Hua University(国立清华大学生命科学与医学跨学科计划)

AI总结 系统研究不同分子编码方法对药物性质预测的影响,使用MLP和MLP+TL模型,发现MACCS和PubChem指纹结合注意力权重可识别关键化学基团,预测准确率平均AUC>0.9。

详情
AI中文摘要

关于不同分子编码方法如何影响分子性质预测的基础研究仍然相对有限。在本研究中,我们使用两种流行的结构设计:经典神经网络模型(MLP)和基于Transformer编码器的模型(MLP+TL),广泛考察了分子性质预测的最优分子编码方法。对于分子编码方法,我们研究了几种类型的指纹,包括传统拓扑指纹、基于子结构的指纹和基于字符串的表示。这两个模型在七个著名的分子数据集上进行了训练,以基于评估指标评估不同的输入分子编码方法。在几个生物学相关的分类任务中,包括毒性、致突变性和副作用预测,我们的模型一致地实现了平均AUC值超过0.9。我们没有依赖外部事后解释方法,如局部可解释模型无关解释(LIME)或深度SHAP(SHAP),而是利用模型内在的注意力权重作为内部可解释性信号来识别潜在重要特征。使用MACCS和PubChem作为输入的MLP+TL模型能够捕获决定主要血脑屏障(BBB)通透性和鼠伤寒沙门氏菌致突变性的化学可解释基团。特别是,吗啡和海洛因之间的比较突出了羟基相关子结构在BBB通透性预测中的作用,这一点在注意力权重中一致反映。总体而言,我们的发现为选择有效的分子编码方法提供了实用指导,并有助于开发用于药物发现的可解释分子信息学方法。

英文摘要

Fundamental investigations into how different molecular encoding methods affect molecular property prediction remain relatively limited. In this study, we extensively examined the optimal molecular encoding methods for molecular properties prediction using two prevalent structure designs: a classical neural network model (MLP) and a Transformer encoder-based model (MLP+TL). For molecular encoding methods, we investigated several types of fingerprints, including traditional topological fingerprints, substructure-based fingerprints, and string-based representations. These two models were trained on seven well-known molecular datasets to evaluate different input molecular encoding methods based on evaluation metrics. On several biologically relevant classification tasks, including toxicity, mutagenicity, and side-effect prediction, our models consistently achieved average AUC values above 0.9. Rather than relying on external post-hoc explanation methods such as the local interpretable model-agnostic explanation (LIME) or the Deep SHapley Additive exPlanations (SHAP), we leveraged the model's intrinsic attention weights as an internal interpretability signal for identifying potentially important feature. The MLP+TL model using MACCS and PubChem as input can capture chemically interpretable groups that determined the major blood-brain barrier (BBB) permeability and mutagenicity in Salmonella typhimurium. In particular, a comparison between Morphine and Heroin highlighted the role of hydroxyl-related substructures in BBB permeability prediction, which was consistently reflected in the attention weights. Overall, our findings provide practical guidance for selecting effective molecular encoding methods and contribute to the development of interpretable molecular informatics approaches for drug discovery.

2606.08988 2026-06-09 cs.CL cs.LG 交叉投稿

Structure-Aware Modeling of Multiple-Choice Questions Improves Automatic Difficulty Estimation

选择题的结构感知建模改进自动难度估计

Gabriel Ortega, Abelino Jiménez, Séverin Lions, Pablo Dartnell

发表机构 * Centro de Investigación Avanzada en Educación (CIAE), Instituto de Estudios Avanzados en Educación (IE), Universidad de Chile(智利大学高级教育研究中心(CIAE),高级教育研究所(IE)) Departamento de Evaluación, Medición y Registro Educacional (DEMRE), Universidad de Chile(智利大学评估、测量与教育注册系(DEMRE)) Centro de Modelamiento Matemático (CMM), Universidad de Chile(智利大学数学建模中心(CMM)) Departamento de Ingeniería Matemática (DIM), Universidad de Chile(智利大学数学工程系(DIM))

AI总结 提出结构感知模型,将选择题的干扰项作为独立输入编码,通过顺序感知或顺序不变聚合提升难度预测,在自然科学和社科数据集上达到R²=0.83和0.71。

Comments 30 pages, 1 table, 2 figures

详情
AI中文摘要

自动题目难度估计(AQDE)在教育评估中日益重要,因为它有潜力产生与专家判断相竞争的难度估计,同时有助于减少与试点管理相关的时间和财务负担,并扩展到数字测试环境。先前的AQDE研究报告了关于将干扰项作为附加文本添加到题干和正确答案中是否能一致改进难度预测的混合证据。我们假设干扰项信息的有效性取决于其结构表示,并且明确将干扰项建模为独立组件可以改进忽略此信息的基线的难度估计。为此,我们设计了受控架构,将选择题组件建模为不同输入,以隔离干扰项内容和顺序的贡献。具体来说,我们通过将每个干扰项编码为独立的文本输入,并通过顺序感知的拼接(带位置标签)或顺序不变的求和来聚合其表示,从而表示干扰项。我们使用两个智利数据集(自然科学和社会科学,2016-2020年;4114道选择题)评估了这些架构。与仅使用题干和正确答案的简单模型相比,我们最佳的结构感知架构实现了更高的预测性能,自然科学题目的R²=0.83,社会科学题目的R²=0.71。一个顺序不变的变体以大约一半的参数达到了几乎相同的准确率,提供了有利的准确率-效率权衡。这些结果表明,结构信息(尤其是干扰项内容)驱动了预测准确性的提升,支持开发计算上可行的大规模教育应用的高效结构感知模型。

英文摘要

Automatic Question Difficulty Estimation (AQDE) holds growing promise for educational assessment because it has the potential to yield difficulty estimates that are competitive with expert judgment, while helping reduce the time and financial burden associated with pilot administrations and scaling to digital testing contexts. Prior AQDE studies report mixed evidence on whether adding distractors as additional text to the question stem and the correct key consistently improves difficulty prediction. We hypothesize that the effectiveness of distractor information depends on its structural representation, and that explicitly modeling distractors as separate components improves difficulty estimation over baselines that omit this information. To address this, we designed controlled architectures that model MCQ components as distinct inputs to isolate the contribution of distractor content and order. Specifically, we represented distractors by encoding each distractor as its own text input and aggregating their representations either with order-aware concatenation (with positional tags) or with an order-invariant summation. We evaluated these architectures using two Chilean datasets (Natural and Social Sciences, 2016-2020; 4,114 multiple-choice questions). Compared to a simpler model that only used the question stem and the key, our best distractor-aware architecture achieved higher predictive performance, reaching R^2 = 0.83 for Natural Sciences and R^2 = 0.71 for Social Sciences items. An order-invariant variant achieved nearly the same accuracy with approximately half as many parameters, offering a favorable accuracy-efficiency trade-off. These results show that structural information (especially distractor content) drives gains in predictive accuracy, supporting the development of efficient, structure-aware models that are computationally viable for large-scale educational applications.

2606.09108 2026-06-09 cs.RO cs.LG 交叉投稿

RAM: Reachability Across Morphologies

RAM: 跨形态可达性

Tim Walter, Xinyu Chen, Jonathan Külz, Matthias Althoff

发表机构 * Department of Computer Engineering(计算机工程系) German Electron Synchrotron Technical University(德国电子同步加速器技术大学) Technical University Munich(慕尼黑技术大学) University of Hamburg(汉堡大学)

AI总结 提出一种形态条件隐式神经表示RAM,快速、可微地预测可达性并泛化至未见形态,基于前向运动学生成大规模数据集训练,在纳秒级推理中F1达86%,显著加速形态和轨迹优化。

Comments 22 pages, 11 figures

详情
AI中文摘要

机器人生命周期的许多阶段,从形态合成到操作,都从根本上依赖于可达工作空间。然而,当前用于近似工作空间的方法要么速度慢、精度低,要么局限于单一形态。我们提出了跨形态可达性(RAM):一种形态条件的隐式神经表示,作为位姿可达性的快速、可微替代,能够泛化到未见形态,同时固有地考虑自碰撞。为了训练RAM,我们发布了一个大规模数据集,包含仅由正向运动学生成的$3\cdot10^{10}$个样本。实验表明,我们的模型在纳秒级推理时达到了$86\\%$的$F_1$分数,比基线高出$14\\%$,同时推理时间减少了三个数量级。我们进一步展示了在基于梯度的形态优化和轨迹优化中分别加速一个和两个数量级。

英文摘要

Many stages of the robotic lifecycle, from morphology synthesis to operation, rely fundamentally on the reachable workspace. However, current methods for approximating workspaces are slow, imprecise, or tied to a single morphology. We introduce Reachability Across Morphologies (RAM): a morphology-conditioned, implicit neural representation that acts as a fast, differentiable surrogate for pose reachability, generalising to unseen morphologies while inherently accounting for self-collisions. To train RAM, we publish a large-scale dataset of $3\cdot10^{10}$ samples generated solely from forward kinematics. Experiments show that our model achieves an $ F_1$-score of $86\%$ at nanosecond inference, outperforming the baseline by $14\%$ while reducing inference time by three orders of magnitude. We further demonstrate speed-ups of one and two orders of magnitude for gradient-based morphology and trajectory optimisation, respectively. Website: https://timwalter.github.io/ram.

2606.09109 2026-06-09 cs.CV cs.IR cs.LG 交叉投稿

Driving Video Retrieval for Complex Queries with Structured Grounding

面向复杂查询的驾驶视频检索与结构化对齐

Manyi Yao, Sparsh Garg, Christian Shelton, Amit Roy-Chowdhury, Abhishek Aich

发表机构 * NEC Laboratories, America(美国NEC实验室) University of California, Riverside(加州大学河滨分校)

AI总结 提出STRIVE-D框架,通过弱监督领域视频校准规则、融合视觉语言与关键词检索信号,在驾驶视频检索中实现高达84%的top-1准确率提升。

详情
AI中文摘要

大规模视频检索是自动驾驶中数据整理和安全验证的核心,用户不仅希望找到场景,还希望找到诸如切入和急刹车等动态事件。现有的视觉语言和基于关键词的检索方法常常遗漏这些事件,因为相关的运动可能没有在文本中明确描述或通过词汇重叠捕获。基于规则的检索可以更直接地编码此类事件,但它是脆弱的:生成的或手工编写的规则在假设与真实驾驶数据不匹配时常常失败。我们提出了STRIVE-D,一种针对驾驶视频的数据校准检索框架。它使用弱标记的领域内视频来估计查询规则何时可靠,调整与观测数据不匹配的规则,并将校准后的规则分数与视觉语言和基于关键词的检索信号融合。在三个驾驶基准测试中,包括新发布的DrivingDojo上的人工标注事件数据,STRIVE-D相对于最先进方法在top-1准确率上实现了高达84%的相对改进。

英文摘要

Video retrieval at scale is central to data curation and safety validation in autonomous driving, where users want to find not only scenes but also dynamic events such as cut-ins and hard braking. Existing vision-language and keyword-based retrieval methods often miss these events because the relevant motion may not be explicitly described in text or captured by lexical overlap. Rule-based retrieval can encode such events more directly, but it is brittle: generated or hand-written rules often fail when their assumptions do not match real driving data. We propose STRIVE-D, a data-calibrated retrieval framework for driving videos. It uses weakly labeled in-domain videos to estimate when a query rule is reliable, adapt rules that mismatch observed data, and fuse calibrated rule scores with vision-language and keyword-based retrieval signals. Across three driving benchmarks, including newly released human-annotated event data on DrivingDojo, STRIVE-D delivers up to 84% relative improvement in top-1 accuracy over state-of-the-art methods.

2606.09271 2026-06-09 cs.SD cs.LG 交叉投稿

Multi-View Speech Representation Learning for Parkinson's Disease Detection Using Context-guided Cross-modal Attention

基于上下文引导跨模态注意力的多视角语音表示学习用于帕金森病检测

George Theodosiou, Loukas Ilias, Dimitris Askounis

发表机构 * National Technical University of Athens(雅典国家技术大学)

AI总结 提出多分支深度学习框架,融合Log-Mel谱图、MFCC和HuBERT嵌入三种互补语音模态,通过上下文引导跨模态注意力机制动态加权,在PC-GITA语料库上实现91.51%准确率和95.97% AUC,验证了异质语音建模对帕金森病检测的有效性。

详情
AI中文摘要

帕金森病(PD)是一种进行性神经退行性疾病,常导致与运动功能减退性构音障碍相关的言语障碍。由于言语产生依赖于复杂神经肌肉机制的精确协调,语音分析已成为早期PD检测中一种有前景的非侵入性、成本效益高的生物标志物。最近的深度学习方法显示出令人鼓舞的结果;然而,大多数现有方法依赖单一语音表示,可能忽略跨不同特征空间编码的互补病理信息。在这项工作中,我们提出了一种多分支深度学习框架,用于从语音中自动检测PD。每个录音被分割成5秒的片段,并使用三种互补模态表示:Log-Mel谱图、MFCC和从原始波形中提取的HuBERT嵌入。谱图使用预训练的ResNet-18编码器处理,MFCC序列通过BiLSTM网络建模,原始语音使用预训练的HuBERT模型编码。为了有效整合这些异质表示,我们引入了一种上下文引导的跨模态注意力机制,该机制根据来自谱图和MFCC分支的全局声学上下文动态加权时间HuBERT嵌入。在公开的西班牙语PC-GITA语料库上,在严格的说话人独立5折交叉验证下进行的实验证明了所提出方法的有效性。所提出的架构实现了91.51%的准确率、91.24%的F1分数和95.97%的AUC。此外,消融研究证实了所提出的上下文引导跨模态注意力机制以及互补语音表示整合的贡献。这些发现突显了异质语音建模在稳健且临床可靠的PD检测中的潜力。

英文摘要

Parkinson's disease (PD) is a progressive neurodegenerative disorder that frequently causes speech impairments associated with hypokinetic dysarthria. As speech production relies on the precise coordination of complex neuromuscular mechanisms, speech analysis has emerged as a promising non-invasive and cost-effective biomarker for early PD detection. Recent deep learning approaches have shown encouraging results; however, most existing methods rely on a single speech representation, potentially overlooking complementary pathological information encoded across different feature spaces. In this work, we propose a multi-branch deep learning framework for automatic PD detection from speech. Each recording is segmented into 5-second chunks and represented using three complementary modalities: Log-Mel spectrograms, MFCCs, and HuBERT embeddings extracted from raw waveforms. The spectrograms are processed using a pre-trained ResNet-18 encoder, MFCC sequences are modeled through a BiLSTM network, and raw speech is encoded using a pre-trained HuBERT model. To effectively integrate these heterogeneous representations, we introduce a context-guided cross-modal attention mechanism that dynamically weights temporal HuBERT embeddings according to the global acoustic context derived from the spectrogram and MFCC branches. Experiments conducted on the publicly available Spanish PC-GITA corpus under strict speaker-independent 5-fold cross-validation demonstrate the effectiveness of the proposed approach. The proposed architecture achieves an accuracy of 91.51%, an F1-score of 91.24%, and an AUC of 95.97%. Furthermore, ablation studies confirm the contribution of both the proposed context-guided cross-modal attention mechanism and the integration of complementary speech representations. These findings highlight the potential of heterogeneous speech modeling for robust and clinically reliable PD detection.

2606.09362 2026-06-09 cs.CV cs.LG 交叉投稿

Zero-Shot Semantic Re-Identification for Autonomous Driving: A VLM Baseline Study

零样本语义重识别用于自动驾驶:一项VLM基线研究

Eduardo Borges, Manuel Abreu, Luís Garrote, Urbano J. Nunes

发表机构 * Autonomous Mobile Robot(自主移动机器人) University of Minho(明德大学)

AI总结 提出使用视觉-语言模型生成语义描述进行零样本重识别,在自动驾驶场景中实现与监督CNN基线相当的检索性能,并增强可解释性。

Comments 7 pages

详情
AI中文摘要

自动驾驶中的重识别通常被表述为一个视觉匹配问题,其中车辆、行人和骑自行车者的观测通过学习的外观嵌入在时间、帧或相机视图之间进行关联,通常辅以运动、几何或多模态线索。然而,纯视觉表示可能对视角、遮挡、光照和传感器域变化敏感,限制了其在复杂驾驶场景中的可解释性和鲁棒性。我们提出了一项零样本管道的基线研究,使用视觉-语言模型生成检测到的交通参与者的文本描述,并评估这些描述是否能够支持跨观测的身份匹配。该公式不仅依赖低层次视觉相似性,而是通过结构化语义属性表示每个对象,包括类别、颜色、形状、姿态、可见部分、空间上下文和独特的视觉线索。本研究为自动驾驶场景中基于语言的重识别提供了初始基准,讨论并评估了当前VLM在此任务中的优势和局限性。结果表明,零样本语义描述可以支持有效的对象重识别,实现与监督CNN基线相当的检索性能,同时通过显式身份线索提供更大的可解释性。然而,实验也揭示了重要挑战,包括跨视角的属性不一致以及视觉相似实例之间的细粒度区分有限。

英文摘要

Re-Identification (ReID) in autonomous driving is typically formulated as a visual matching problem, where observations of vehicles, pedestrians, and cyclists are associated across time, frames, or camera views using learned appearance embeddings, often complemented by motion, geometric, or multimodal cues. However, purely visual representations may be sensitive to viewpoint, occlusion, illumination, and sensor-domain variations, limiting their interpretability and robustness in complex driving scenes. We propose a baseline study of a zero-shot pipeline using Vision-Language Models (VLMs) to generate textual descriptions of detected traffic participants and evaluate whether these descriptions can support identity matching across observations. Instead of relying only on low-level visual similarity, the proposed formulation represents each object through structured semantic attributes, including category, color, shape, pose, visible parts, spatial context, and distinctive visual cues. This study provides an initial benchmark for language-based re-identification in autonomous-driving scenarios, discussing and evaluating the strengths and limitations of current VLMs for this task. Results demonstrate that zero-shot semantic descriptions can support effective object re-identification, achieving retrieval performance comparable to a supervised CNN baseline while offering greater interpretability through explicit identity cues. However, the experiments also reveal important challenges, including attribute inconsistency across viewpoints and limited fine-grained discrimination between visually similar instances.

2606.09451 2026-06-09 cs.RO cs.CV cs.LG 交叉投稿

Dense Force Estimation with an Event-based Optical Tactile Sensor

基于事件的光学触觉传感器的稠密力估计

Agis Politis, René Zurbrügg, Valentina Cavinato

发表机构 * Sony Advanced Visual Sensing, Zurich, Switzerland(索尼高级视觉传感公司,苏黎世,瑞士) ETH Zürich(苏黎世联邦理工学院)

AI总结 提出首个利用事件相机重建稠密3D力场的方法,通过事件数据估计表面位移并映射为力,平均误差(0.14N,0.10N,0.93N),工作频率100Hz。

详情
AI中文摘要

人类依赖空间稠密、几何和力感知的触觉反馈以高时间分辨率进行灵巧操作。虽然基于视觉的触觉传感器能够实现稠密力估计,但受限于相机帧率、运动模糊和数据带宽。基于事件的光学触觉传感器具有微秒级时间分辨率和低运动模糊的优点,但现有方法仅限于预测净力。我们提出了首个利用基于事件的光学触觉传感器进行稠密3D力场重建的框架。我们的方法从事件数据估计3D表面位移,并通过逆有限元方法(iFEM)将其映射为力。剪切位移通过所提出的事件标记跟踪算法恢复,而法向位移则由卷积神经网络预测,该网络在收集的同步力-位移-事件数据集上训练。实验表明,该方法能够准确重建物理力,在力范围高达(4N,4N,20N)时,平均绝对误差为(0.14N,0.10N,0.93N),同时以平均100Hz的频率运行。这项工作为在机器人抓取和灵巧操作中实现高频控制的稠密力反馈迈出了第一步。

英文摘要

Humans rely on spatially dense, geometry and force-aware tactile feedback at high temporal resolution for dexterous manipulation. While vision-based tactile sensors enable dense force estimation, they are limited by camera frame rates, motion blur, and data bandwidth. Event-based optical tactile sensors offer an attractive alternative with microsecond temporal resolution and low motion blur, but existing methods are restricted to predicting only net forces. We introduce the first framework for dense 3D force field reconstruction using event-based optical tactile sensors. Our approach estimates 3D surface displacements from event data and maps them to forces via the inverse Finite Elements Method (iFEM). Shear displacements are recovered through the proposed event-based marker tracking algorithm, while normal displacements are predicted by a convolutional neural network trained on a collected dataset of synchronized force-displacement-event data. Experiments demonstrate accurate reconstruction of physically grounded forces, achieving a mean absolute error of (0.14 N, 0.10 N, 0.93 N) over force ranges up to (4 N, 4 N, 20 N), while operating at an average of 100 Hz. This work constitutes a first step toward enabling dense force feedback for high-frequency control in robotic grasping and dexterous manipulation.

2606.09541 2026-06-09 physics.app-ph cs.LG 交叉投稿

Automating the Expert Eye: A System-Agnostic Deep Learning Framework for Rare Event Discovery in Imbalanced Force Spectroscopy

自动化专家眼:用于非平衡力谱中稀有事件发现的系统无关深度学习框架

Jorge Rodriguez-Ramos

发表机构 * Independent Researcher(独立研究者) Marseille, France(法国马赛)

AI总结 提出一种系统无关的可解释深度学习框架,利用1D到2D光栅化几何矩阵和修改的ResNet18架构,结合非对称Focal Loss,在极端类别不平衡的力谱数据中实现高召回率(0.9231),并通过双阈值分诊系统减少90%以上人工审核工作量。

Comments 13 pages, 2 figures, 2 tables

详情
AI中文摘要

单分子力谱(SMFS)为生物分子力学提供了前所未有的见解,然而高通量生成的力-延伸轨迹造成了严重的数据筛选瓶颈。在数千条噪声主导的曲线中识别罕见的分子解绑事件传统上依赖于繁琐、不可扩展的人工审核。在这里,我们提出了一个系统无关、可解释的深度学习框架,专门用于克服自动SMFS分诊中的极端类别不平衡。利用1D到2D光栅化几何矩阵,我们部署了由非对称Focal Loss目标函数控制的修改版ResNet18架构。我们在R. champanellensis纤维小体的复杂机械解折叠路径上评估了该框架。在超不平衡测试条件下,目标相互作用仅占数据集的1.34%(970条轨迹中13个真实事件),模型实现了0.9196的整体准确率和0.9231的惊人真阳性率(召回率)。通过实施经验校准的双阈值分诊系统,该流程自动丢弃了880条明确的背景噪声轨迹,将人工审核工作量减少超过90%,同时安全地保留了高价值的稀有数据。最后,梯度加权类激活映射(Grad-CAM)可视化验证了网络的决策牢固地基于力曲线的相关几何特征,特别是定位于结构解绑区域,有效缓解了“黑箱”质疑。该开源工具专为免费云端执行而构建,使生物物理学社区能够民主化地实现可扩展、高精度的分子发现。

英文摘要

Single-Molecule Force Spectroscopy (SMFS) provides unprecedented insights into biomolecular mechanics, yet the high-throughput generation of force-extension trajectories creates a severe data curation bottleneck. Identifying rare molecular unbinding events within thousands of noise-dominated curves traditionally relies on tedious, non-scalable manual auditing. Here, we present a system-agnostic, interpretable deep learning framework tailored to overcome extreme class imbalance in automated SMFS triage. Utilizing 1D-to-2D rasterized geometric matrices, we deployed a modified ResNet18 architecture governed by an asymmetric Focal Loss objective function. We evaluated this framework on the complex mechanical unfolding pathways of the R. champanellensis cellulosome. Under hyper-imbalanced test conditions where the target interaction constituted only 1.34% of the dataset (13 true events out of 970 traces), the model achieved an overall accuracy of 0.9196 and a remarkable True Positive Rate (Recall) of 0.9231. By implementing an empirically calibrated dual-threshold triage system, the pipeline automatically discarded 880 unambiguous background noise traces , reducing the manual curation workload by over 90% while safely preserving high-value rare data. Finally, Gradient-weighted Class Activation Mapping (Grad-CAM) visually validated that the network's decisions are firmly anchored in the relevant geometric features of the force curves, specifically localizing on the structural unbinding regions, effectively mitigating 'black-box' skepticism. Built for free cloud-based execution, this open-source tool democratizes scalable, highly precise molecular discovery across the biophysics community.

2606.09558 2026-06-09 q-bio.GN cs.LG 交叉投稿

Integrating gene regulatory priors into Transformer attention with scTransformer for interpretable scRNA-seq analysis

将基因调控先验知识整合到Transformer注意力中:scTransformer用于可解释的单细胞RNA-seq分析

Mikele Milia, Louis Fabrice Tshimanga, Henning Mueller, Manfredo Atzori, Barbara Di Camillo

发表机构 * Department of Information Engineering, University of Padova(信息工程系,帕多瓦大学) Department of Neuroscience, University of Padova(神经科学系,帕多瓦大学) Padova Neuroscience Center(帕多瓦神经科学中心) Information Systems Institute, University of Applied Sciences Western Switzerland, HES-SO Valais(应用科学西瑞士信息系统研究所,HES-SO瓦莱大学) Department of Comparative Biomedicine and Food Science, University of Padova(比较生物医学与食品科学系,帕多瓦大学) Padua Center for Network Medicine, University of Padova(帕维亚网络医学中心,帕多瓦大学)

AI总结 提出scTransformer,首次将基因调控先验知识嵌入Transformer注意力机制,通过约束信息流学习生物有意义的表示,在疾病相关单核RNA-seq数据上提升分类精度和细胞类型分离,注意力模式与已知调控程序一致。

详情
AI中文摘要

动机:基于Transformer的模型越来越多地应用于大规模单细胞转录组学,通过自监督学习在数百万个细胞上展现出强大性能。然而,大多数现有方法将基因视为独立特征,很大程度上忽略了先验生物学知识,这限制了可解释性和鲁棒性。在本文中,我们探讨了显式整合基因调控信息是否能同时提升模型性能和生物学洞察。结果:我们提出了scTransformer,这是第一个将生物机制的先验知识构建到模型注意力模式中的基于Transformer的方法。通过根据已知调控结构约束信息流,模型学习到更具生物学意义的表示。我们使用监督细胞类型分类在疾病相关的单核RNA-seq数据集上评估scTransformer。与标准Transformer相比,我们的方法提高了分类准确性,增强了嵌入空间中细胞类型的分离,并产生了与已知调控程序一致的注意力模式。总体而言,我们的结果表明,将生物结构嵌入Transformer模型可以在不牺牲性能的情况下增强可解释性,为单细胞组学的生物学基础模型迈出了原则性的一步。

英文摘要

Motivation: Transformer-based models are increasingly applied to large-scale single-cell transcriptomics, showing strong performance through self-supervised learning on millions of cells. However, most existing approaches treat genes as independent features, and largely ignore prior biological knowledge, which limits interpretability and robustness. In this paper, we explore whether explicitly incorporating gene regulatory information can improve both model performance and biological insight. Results: We present scTransformer, the first Transformer-based approach that builds a priori knowledge of biological mechanisms into the model's attention patterns. By constraining information flow according to known regulatory structures, the model learns representations that are more biologically meaningful. We evaluate scTransformer on a disease-relevant single-nucleus RNA-seq dataset using supervised cell-type classification. Compared to standard Transformers, our approach improves classification accuracy, enhances separation of cell types in embedding space, and produces attention patterns consistent with known regulatory programs. Overall, our results demonstrate that embedding biological structure into Transformer models can enhance interpretability without sacrificing performance, offering a principled step toward biologically grounded foundation models for single-cell omics.

2606.09630 2026-06-09 cs.RO cs.AI cs.LG 交叉投稿

ReCoVLA: VLM-Guided Reward Compilation for Failure Recovery in Vision-Language-Action Policies

ReCoVLA: VLM引导的奖励编译用于视觉-语言-动作策略的故障恢复

Haodi Hu, Chung-Ta Huang, Jing Liu, Ye Wang, Kei Suzuki, Matthew Brand, Toshiaki Koike-Akino

发表机构 * University of Southern California(南加州大学) Mitsubishi Electric Research Laboratories (MERL)(三菱电机研究实验室) Harvard University(哈佛大学)

AI总结 提出ReCoVLA框架,通过冻结预训练VLA策略,利用外部VLM推断故障模式并编译结构化奖励,训练残差恢复策略,实现零样本仿真到真实部署,在多种操作任务中提升成功率。

Comments 19 pages, 7 figures

详情
AI中文摘要

视觉-语言-动作(VLA)策略为语言条件操作提供了强大的先验知识,但在需要针对性恢复的非标称状态下仍然脆弱。我们提出ReCoVLA——一种故障条件的残差恢复框架,它保持预训练的VLA策略冻结,使用外部视觉-语言模型(VLM)推断故障模式和恢复阶段,并从任务相关组件编译结构化奖励。ReCoVLA并非使用VLM直接生成动作或奖励,而是将其作为语义奖励选择器:它预测恢复描述符和奖励掩码,用于仿真中的残差策略训练,随后将训练好的恢复策略零样本部署到真实世界。这解耦了高层故障理解与低层纠正控制,以支持不同的VLA。在短时域、长时域和接触丰富的操作任务上的实验表明,ReCoVLA在平均性能上优于测试的基线。在仿真中,我们的奖励编译器将微调$π_{0.5}$基线的平均成功率从36.7%提升到66.7%。在物理零样本仿真到真实实验中,ReCoVLA取得了最佳平均性能,成功率为61.7%。

英文摘要

Vision-language-action (VLA) policies provide strong priors for language-conditioned manipulation, but remain brittle in off-nominal states requiring targeted recovery. We propose ReCoVLA -- a failure-conditioned residual recovery framework that keeps a pretrained VLA policy frozen, uses an external vision-language model (VLM) to infer the failure mode and recovery stage, and compiles a structured reward from task-relevant components. Rather than using the VLM to generate actions or rewards directly, ReCoVLA uses it as a semantic reward selector: it predicts a recovery descriptor and reward mask for in-simulation residual-policy training, followed by zero-shot sim-to-real deployment of the trained recovery policies. This decouples high-level failure understanding from low-level corrective control to support different VLAs. Experiments across short-horizon, long-horizon, and contact-rich manipulation tasks show that ReCoVLA outperforms the tested baselines on average. In simulation, our reward compiler improves average success from 36.7% for the fine-tuned $π_{0.5}$ baseline to 66.7%. In physical zero-shot sim-to-real experiments, ReCoVLA achieves the best average performance, with 61.7% success.

2606.09749 2026-06-09 cs.RO cs.LG 交叉投稿

Your Model Already Knows: Attention-Guided Safety Filter for Vision-Language-Action Models

你的模型已经知道:面向视觉-语言-动作模型的注意力引导安全过滤器

Seongbin Park, Fan Zhang, Baharan Mirzasoleiman, Shahriar Talebi, Nader Sehatbakhsh

发表机构 * University of California Los Angeles(加州大学洛杉矶分校)

AI总结 本文发现VLA模型中的少数注意力头能可靠定位目标物体,利用这一特性提出无需训练的安全框架,结合控制障碍函数和实时目标跟踪器,实现动态障碍物下的碰撞避免,在动态场景中性能提升43%。

Comments Under review

详情
AI中文摘要

视觉-语言-动作(VLA)模型在多种机器人操作任务中展现了令人印象深刻端到端性能。然而,这些策略无法保证避免与场景中任务无关的物体发生碰撞。现有的安全过滤器通过查询视觉-语言模型(VLM)来识别障碍物及其位置,从而回避了这个问题。但这在控制循环中运行速度太慢,只能在情节初始化时调用,使得过滤器无法跟踪移动障碍物。我们发现,VLA模型中的少数注意力头能够可靠地定位策略意图接近的目标物体。这些注意力头可以在一个无需训练的安全框架中利用,该框架每一步从注意力头获取活动目标,将场景其余部分视为障碍物,并将其输入控制障碍函数(CBF)过滤器。结合轻量级实时目标跟踪器,这允许对非静态障碍物进行碰撞避免。我们在SafeLIBERO上评估了我们的框架,并扩展了移动障碍物。在原始静态基准测试中,我们的方法性能与使用特权模拟器状态识别目标(模拟在情节初始化时运行一次的基于VLM的识别步骤)的oracle相当。在动态变体中,oracle的初始目标分配变得过时,我们的方法平均优于它43%。我们的发现表明,实时安全过滤所需的感知信号已经存在于VLA策略中,并且可以在无需额外训练或重型辅助模型的情况下加以利用。

英文摘要

Vision-Language-Action (VLA) models have demonstrated impressive end-to-end performance across a variety of robotic manipulation tasks. However, these policies offer no guarantees against collisions with task-irrelevant objects in the scene. Existing safety filters sidestep this problem by querying a vision-language model (VLM) to identify obstacles and their locations. This, however, is too slow to run in the control loop and can only be invoked at episode initialization, leaving the filter unable to track moving obstacles. We discover that a small number of attention heads within a VLA model reliably localize the object the policy intends to approach. These heads can be exploited within a training-free safety framework that obtains the active target from the attention heads at every step, treats the remainder of the scene as obstacles, and feeds these into a Control Barrier Function (CBF) filter. Together with a lightweight real-time object tracker, this allows for collision avoidance for non-static obstacles. We evaluate our framework on SafeLIBERO, which we extend with moving obstacles. On the original static benchmark, our method performs comparably to an oracle that uses privileged simulator state to identify the target, emulating a VLM-based identification step run once at episode initialization. On the dynamic variant, where the oracle's init-time target assignment becomes stale, our method substantially outperforms it by 43%, on average. Our findings suggest that the perceptual signals needed for real-time safety filtering are already present within VLA policies and can be exploited without additional training or heavy auxiliary models.

2606.09767 2026-06-09 cs.CL cs.AI cs.LG 交叉投稿

Data Synthesis and Parameter-Efficient Fine-Tuning for Low-Resource NMT: A Case Study on Q'eqchi' Mayan

低资源神经机器翻译的数据合成与参数高效微调:以Q'eqchi'玛雅语为例

Alexander Chulzhanov, Soeren Eberhardt, Arjun Mukherjee

发表机构 * University of Houston(休斯顿大学) MasterWord Services, Inc.(MasterWord Services公司) University of Washington(华盛顿大学)

AI总结 针对低资源土著语言,提出数据合成方法(利用社区词典生成合成语料)结合LoRA参数高效微调,在Q'eqchi'玛雅语上实现高结构习得(BLEU 42.02),但存在结构-语义差距,需结合真实数据进行课程学习。

Comments Accepted to the 29th International Conference on Text, Speech and Dialogue (TSD 2026). This version of the contribution has been accepted for publication, after peer review but is not the Version of Record and does not reflect post-acceptance improvements, or any corrections

详情
AI中文摘要

对于数字低资源土著语言的神经机器翻译,通常因极端数据稀缺而受阻,促使依赖抽取式网络爬取。为确保数据主权,本研究引入了一种数据合成方法,无需爬取目标语言平行文本即可引导NMT模型。以Q'eqchi'玛雅语为重点,我们将社区来源的词典转换为大规模合成语料,利用通过LoRA适配器在mT5-base模型上的参数高效微调(PEFT)。领域内评估显示出高度的结构习得(BLEU 42.02),证明合成约束有效地教授了复杂的黏着形态和VOS语序。然而,针对有机词汇表的评估揭示了结构-语义差距(BLEU 0.59),模型保持了语法完整性但缺乏自然语言的词汇基础。模型表现出对合成模板受限结构方差的过拟合;尽管流程中具有高语义熵,模型仍难以应对自然语言的句法流动性,将有机输入强制转换为僵化的学习模式。此外,利用多任务学习架构的消融研究导致了负迁移,表明辅助任务在LoRA适配器内竞争有限的参数容量,导致对合成标记的过度优化而牺牲了有机灵活性。最终,我们确定合成引导是一种高度有效的结构入门,但需要通过课程学习使用真实数据进行语义细化。

英文摘要

Neural machine translation for digitally low-resource Indigenous languages is often hindered by extreme data scarcity, prompting reliance on extractive web-scraping. To ensure data sovereignty, this study introduces a data synthesis methodology to bootstrap NMT models without scraping target-language parallel text. Focusing on Q'eqchi' Mayan, we transformed community-sourced dictionaries into a massive synthetic corpus, utilizing Parameter-Efficient Fine-Tuning (PEFT) via LoRA adapters on an mT5-base model. In-domain evaluation demonstrates high structural acquisition (BLEU 42.02), proving that synthetic constraints effectively teach complex agglutinative morphology and VOS word order. However, evaluation against an organic glossary reveals a structural-semantic gap (BLEU 0.59), where the model maintains grammatical integrity but lacks the lexical grounding of natural language. The model exhibits overfitting to the constrained structural variance of the synthetic templates; despite high semantic entropy in the pipeline, it struggles with the syntactic fluidity of natural language, forcing organic inputs into rigid learned patterns. Furthermore, an ablation study utilizing a Multi-Task Learning architecture resulted in negative transfer, suggesting that auxiliary tasks competed for limited parameter capacity within the LoRA adapters, causing over-optimization for synthetic markers at the expense of organic flexibility. Ultimately, we establish that synthetic bootstrapping is a highly effective structural primer, but requires authentic data for semantic refinement via Curriculum Learning.

2506.22459 2026-06-09 eess.SP cs.LG cs.SY eess.SY 交叉投稿

Physics-Embedded Neural Networks for sEMG-based Continuous Motion Estimation

基于表面肌电的连续运动估计的物理嵌入神经网络

Wending Heng, Chaoyuan Liang, Yihui Zhao, Zhiqiang Zhang, Glen Cooper, Zhenhong Li

发表机构 * University of Manchester(曼彻斯特大学) University of Bristol(布里斯托大学) University of Leeds(利兹大学)

AI总结 提出物理嵌入神经网络(PENN),结合可解释的肌肉骨骼正向动力学与数据驱动残差学习,实现生理一致且准确的连续运动估计,在RMSE和R²指标上优于现有方法。

Comments Accepted by 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)

详情
AI中文摘要

从表面肌电信号(sEMG)中准确解码人类运动意图对于肌电控制至关重要,并在康复机器人和辅助技术中有广泛应用。然而,现有的基于sEMG的运动估计方法通常依赖于难以校准的特定于受试者的肌肉骨骼(MSK)模型,或缺乏生理一致性的纯数据驱动模型。本文提出了一种新颖的物理嵌入神经网络(PENN),它结合了可解释的MSK正向动力学与数据驱动残差学习,从而在实现准确运动估计的同时保持生理一致性。PENN采用递归时间结构来传播历史估计,并使用轻量级卷积神经网络进行残差校正,从而实现鲁棒且时间连贯的估计。为PENN设计了两阶段训练策略。对六名健康受试者的实验评估表明,PENN在均方根误差(RMSE)和$R^2$指标上均优于最先进的基线方法。

英文摘要

Accurately decoding human motion intentions from surface electromyography (sEMG) is essential for myoelectric control and has wide applications in rehabilitation robotics and assistive technologies. However, existing sEMG-based motion estimation methods often rely on subject-specific musculoskeletal (MSK) models that are difficult to calibrate, or purely data-driven models that lack physiological consistency. This paper introduces a novel Physics-Embedded Neural Network (PENN) that combines interpretable MSK forward-dynamics with data-driven residual learning, thereby preserving physiological consistency while achieving accurate motion estimation. The PENN employs a recursive temporal structure to propagate historical estimates and a lightweight convolutional neural network for residual correction, leading to robust and temporally coherent estimations. A two-phase training strategy is designed for PENN. Experimental evaluations on six healthy subjects show that PENN outperforms state-of-the-art baseline methods in both root mean square error (RMSE) and $R^2$ metrics.

2208.00859 2026-06-09 cs.LG cs.CL 版本更新

Learning from flowsheets: A generative transformer model for autocompletion of flowsheets

从流程图学习:用于流程图自动补全的生成式Transformer模型

Gabriel Vogel, Lukas Schulze Balhorn, Artur M. Schweidtmann

发表机构 * University of Freiburg(弗赖堡大学)

AI总结 受文本自动补全启发,提出基于SFILES 2.0字符串表示和Transformer语言模型的化工流程图自动补全方法,通过预训练和微调实现交互式流程图合成辅助。

详情
Journal ref
Computers and Chemical Engineering Volume 171, March 2023, 108162
AI中文摘要

我们提出了一种新颖的方法,能够实现化工流程图的自动补全。这一想法受到文本自动补全的启发。我们使用基于文本的SFILES 2.0符号将流程图表示为字符串,并利用基于Transformer的语言模型学习SFILES 2.0语言的语法结构以及流程图中的常见模式。我们在合成生成的流程图拓扑上预训练模型,以学习流程图语言语法。然后,通过迁移学习步骤在真实流程图拓扑上微调模型。最后,我们使用训练好的模型进行因果语言建模,以自动补全流程图。最终,所提出的方法可以在交互式流程图合成过程中为化学工程师提供建议。结果表明,该方法在未来AI辅助过程合成中具有巨大潜力,但也揭示了当前阶段的局限性以及在实际流程图合成场景中部署该技术需要采取的后续步骤。

英文摘要

We propose a novel method enabling autocompletion of chemical flowsheets. This idea is inspired by the autocompletion of text. We represent flowsheets as strings using the text-based SFILES 2.0 notation and learn the grammatical structure of the SFILES 2.0 language and common patterns in flowsheets using a transformer-based language model. We pre-train our model on synthetically generated flowsheet topologies to learn the flowsheet language grammar. Then, we fine-tune our model in a transfer learning step on real flowsheet topologies. Finally, we use the trained model for causal language modeling to autocomplete flowsheets. Eventually, the proposed method can provide chemical engineers with recommendations during interactive flowsheet synthesis. The results demonstrate a high potential of this approach for future AI-assisted process synthesis but also reveal the limitations at the present state and the next steps that need to be taken to deploy this technique in realistic flowsheet synthesis scenarios.

2312.02873 2026-06-09 cs.LG cs.AI 版本更新

Toward autocorrection of chemical process flowsheets using large language models

利用大型语言模型实现化工流程图的自动纠错

Lukas Schulze Balhorn, Marc Caballero, Artur M. Schweidtmann

发表机构 * Process Intelligence Research Group, Department of Chemical Engineering, Delft University of Technology(过程智能研究组,化学工程系,代尔夫特理工大学)

AI总结 提出一种基于大型语言模型的生成式AI方法,自动识别化工流程图中的错误并给出修正建议,在合成数据集上达到80%的top-1准确率。

详情
Journal ref
Computer Aided Chemical Engineering, Volume 53, 2024, Pages 3109-3114
AI中文摘要

过程工程领域广泛使用工艺流程图(PFD)和管道及仪表流程图(P&ID)来表示工艺流程和设备配置。然而,P&ID和PFD(以下统称为流程图)可能包含错误,导致安全隐患、操作效率低下和不必要的开支。纠正和验证流程图是一个繁琐的手动过程。我们提出了一种新颖的生成式AI方法,用于自动识别流程图中的错误并向用户建议修正,即自动纠错流程图。受大型语言模型(LLM)在人类语言语法自动纠错方面突破的启发,我们研究了LLM用于流程图的自动纠错。模型的输入是可能出错的流程图,输出是修正后的流程图建议。我们在合成数据集上以监督方式训练自动纠错模型。该模型在独立测试的合成流程图数据集上达到了80%的top-1准确率和84%的top-5准确率。结果表明,模型能够学习自动纠错合成流程图。我们设想流程图自动纠错将成为化学工程师的有用工具。

英文摘要

The process engineering domain widely uses Process Flow Diagrams (PFDs) and Process and Instrumentation Diagrams (P&IDs) to represent process flows and equipment configurations. However, the P&IDs and PFDs, hereafter called flowsheets, can contain errors causing safety hazards, inefficient operation, and unnecessary expenses. Correcting and verifying flowsheets is a tedious, manual process. We propose a novel generative AI methodology for automatically identifying errors in flowsheets and suggesting corrections to the user, i.e., autocorrecting flowsheets. Inspired by the breakthrough of Large Language Models (LLMs) for grammatical autocorrection of human language, we investigate LLMs for the autocorrection of flowsheets. The input to the model is a potentially erroneous flowsheet and the output of the model are suggestions for a corrected flowsheet. We train our autocorrection model on a synthetic dataset in a supervised manner. The model achieves a top-1 accuracy of 80% and a top-5 accuracy of 84% on an independent test dataset of synthetically generated flowsheets. The results suggest that the model can learn to autocorrect the synthetic flowsheets. We envision that flowsheet autocorrection will become a useful tool for chemical engineers.

2407.13303 2026-06-09 cs.LG 版本更新

Mean Teacher based SSL Framework for Indoor Localization Using Wi-Fi RSSI Fingerprinting

基于Mean Teacher的半监督学习框架用于Wi-Fi RSSI指纹室内定位

Sihao Li, Zhe Tang, Kyeong Soo Kim, Jeremy S. Smith

发表机构 * SIIT, Beijing(北京信息科技大学) Beijing College of Science and Technology(北京科学技术学院) XJTLU(新疆大学) University of Liverpool(利物浦大学)

AI总结 针对Wi-Fi指纹室内定位中标记数据采集耗时、监督学习泛化差及动态环境性能下降问题,提出基于Mean Teacher的半监督深度学习框架,结合接入点选择、预训练和噪声注入,在静态和动态场景下显著降低定位误差。

Comments 41 pages, 13 figures

详情
Journal ref
Applied Soft Computing, Available online 6 June 2026, 115711
AI中文摘要

基于Wi-Fi RSSI指纹的传统大规模室内定位面临标记数据采集耗时费力、监督学习框架下训练的模型因无法利用未标记数据而泛化能力有限,以及在环境变化的动态场景中模型性能下降等问题。为解决这些挑战性问题,我们提出了一种基于Mean Teacher的深度神经网络定位模型的综合半监督学习框架,该框架融合了接入点选择、模型预训练/克隆以及批量级噪声注入。所提出的SSL框架不仅能在离线阶段高效利用混合标记/未标记数据库进行模型静态训练,还能利用现场部署的室内定位系统用户的未标记指纹,在在线阶段对模型进行持续再训练。我们选择Mean Teacher作为基础,因为它能通过模型权重的指数移动平均生成更稳定的目标标签,且不会像Pi-Model那样引入高计算复杂度,同时比时间集成具有更好的在线学习可扩展性,使其成为在大规模室内定位中平衡性能与计算复杂度的最优选择。在UJIIndoorLoc数据库上,与传统的SL框架相比,所提出的SSL框架将CNNLoc和SIMO-DNN模型的平均3D误差分别降低了7.403%和7.748%;在XJTLU动态数据库上,动态训练场景下的平均2D误差最大降低达49.227%,展示了所提出的SSL框架带来的显著性能提升。

英文摘要

Conventional large-scale indoor localization based on Wi-Fi RSSI fingerprinting faces issues of time-consuming and labor-intensive labeled data collection, limited generalization of a model trained under a supervised learning (SL) framework due to its inability to leverage unlabeled data, and model performance degradation in dynamic scenarios with environmental variations. To address those challenging issues, we propose a comprehensive semi-supervised learning (SSL) framework for a deep neural network (DNN) localization model based on the Mean Teacher, which incorporates access point selection, model pre-training/cloning, and batch-level noise injection. The proposed SSL framework can not only efficiently use hybrid labeled/unlabeled databases for static training of a model during the offline phase, but also exploit unlabeled fingerprints from users of the indoor localization system deployed in the field for continuous retraining of the model during the online phase. We base the proposed SSL framework on the Mean Teacher because it can generate more stable target labels through an exponential moving average of model weights without incurring the high computational complexity of the Pi-Model and with better scalability for online learning than Temporal Ensembling, making it an optimal choice that strikes the right balance between performance and computational complexity in large-scale indoor localization. With the UJIIndoorLoc database, the proposed SSL framework reduces the mean 3D errors of the CNNLoc and SIMO-DNN models by 7.403% and 7.748%, respectively, compared with those under the conventional SL framework; with the XJTLU dynamic database, the maximum reduction in mean 2D error reaches up to 49.227% under a dynamic training scenario, demonstrating the substantial performance improvement achieved by the proposed SSL framework.

2411.11350 2026-06-09 cs.LG eess.SP 版本更新

Zero and Few Shot Load Forecasting with Large Language Models

基于大语言模型的零样本和少样本负荷预测

Wenlong Liao, Chengrui Zhang, Zhe Yang, Mengshuo Jia, Christian Rehtanz, Jiannong Fang, Fernando Porté-Agel

发表机构 * School of Electrical Engineering, Southeast University(东南大学电气工程学院) Wind Engineering and Renewable Energy Laboratory, Ecole Polytechnique Federale de Lausanne (EPFL)(瑞士联邦理工学院洛桑分校风能与可再生能源实验室) College of Electrical Engineering and New Energy, China Three Gorges University(中国三峡大学电气工程与新能源学院) Department of Electrical and Electronic Engineering, Imperial College London(伦敦帝国理工学院电子与电气工程系) The Department of Automation, School of Automation and Intelligent Sensing, Shanghai Jiao Tong University(上海交通大学自动化与智能感知学院) The Key Laboratory of System Control and Information Processing, Ministry of Education of China, Shanghai(中国教育部系统控制与信息处理重点实验室,上海) State Key Laboratory of Submarine Geoscience, Shanghai(上海 submarine 地球科学国家重点实验室) Institute of Energy Systems, Energy Efficiency and Energy Economic, TU Dortmund University(德意志图林根大学能源系统、能效与能源经济研究所)

AI总结 提出利用预训练语言模型Chronos进行零样本和少样本负荷预测,在数据稀缺场景下显著优于多种基线模型。

Comments 24 pages,5 figures

详情
Journal ref
International Journal of Electrical Power & Energy Systems, Volume 177,April 2026
AI中文摘要

深度学习模型在负荷预测中表现出色,但通常需要大量数据进行模型训练才能应用于新场景,这限制了其在数据稀缺场景下的有效性。受预训练语言模型(LLMs)在自然语言处理中巨大成功的启发,本文提出了一种使用高级LLM框架(称为Chronos模型)的零样本和少样本负荷预测方法。通过利用其广泛的预训练知识,Chronos模型能够在数据稀缺场景下实现准确的负荷预测。在五个真实世界数据集上的仿真结果表明,Chronos模型在确定性和概率性负荷预测中,针对不同的预测时间范围(例如1至48小时),均显著优于九种流行的基线模型,尽管Chronos模型既未针对这些特定负荷数据集进行定制也未进行微调。值得注意的是,与基线模型相比,Chronos将均方根误差(RMSE)、连续排序概率得分(CRPS)和分位数得分(QS)分别降低了约7.34%-84.30%、19.63%-60.06%和22.83%-54.49%。这些结果突显了Chronos模型的优越性和灵活性,使其成为数据稀缺场景下的有效解决方案。

英文摘要

Deep learning models have shown strong performance in load forecasting, but they generally require large amounts of data for model training before being applied to new scenarios, which limits their effectiveness in data-scarce scenarios. Inspired by the great success of pre-trained language models (LLMs) in natural language processing, this paper proposes a zero and few shot load forecasting approach using an advanced LLM framework denoted as the Chronos model. By utilizing its extensive pre-trained knowledge, the Chronos model enables accurate load forecasting in data-scarce scenarios. Simulation results across five real-world datasets demonstrate that the Chronos model significantly outperforms nine popular baseline models for both deterministic and probabilistic load forecasting with various forecast horizons (e.g., 1 to 48 hours), even though the Chronos model is neither tailored nor fine-tuned to these specific load datasets. Notably, Chronos reduces root mean squared error (RMSE), continuous ranked probability score (CRPS), and quantile score (QS) by approximately 7.34%-84.30%, 19.63%-60.06%, and 22.83%-54.49%, respectively, compared to baseline models. These results highlight the superiority and flexibility of the Chronos model, positioning it as an effective solution in data-scarce scenarios.

2412.00508 2026-06-09 cs.LG cs.AI cs.CE 版本更新

Graph-to-SFILES: Control structure prediction from process topologies using generative artificial intelligence

Graph-to-SFILES: 基于生成式人工智能从过程拓扑预测控制结构

Lukas Schulze Balhorn, Kevin Degens, Artur M. Schweidtmann

发表机构 * Process Intelligence Research Group(过程智能研究组) Department of Chemical Engineering(化学工程系) Delft University of Technology(代尔夫特理工大学)

AI总结 提出Graph-to-SFILES模型,利用图神经网络从流程图拓扑生成控制扩展流程图序列,在小数据集上显著提升控制结构预测精度。

详情
Journal ref
Computers & Chemical Engineering, Volume 199, 2025, Pages 109121
AI中文摘要

控制结构设计是P&ID开发中重要但繁琐的步骤。生成式人工智能有望通过支持工程师来减少P&ID开发时间。先前关于化学过程设计中生成式AI的研究主要用序列表示过程。然而,图因其置换不变性而成为一种有前景的替代方案。我们提出了Graph-to-SFILES模型,一种从流程图拓扑预测控制结构的生成式AI方法。Graph-to-SFILES模型将流程图拓扑作为图输入,并返回以SFILES 2.0符号表示的控制扩展流程图序列。我们比较了四种不同的图编码器架构,其中一种是本文提出的图神经网络(GNN)。Graph-to-SFILES模型在10,000个流程图拓扑上训练时达到了73.2%的top-5准确率。此外,所提出的GNN在编码器架构中表现最佳。与纯基于序列的方法相比,Graph-to-SFILES模型在相对较小的1,000个流程图训练数据集上将top-5准确率从0.9%提高到28.4%。然而,在100,000个流程图的大规模数据集上,基于序列的方法表现更好。这些结果突显了基于图的AI模型在小数据场景下加速P&ID开发的潜力,但其在工业相关案例研究中的有效性仍需进一步研究。

英文摘要

Control structure design is an important but tedious step in P&ID development. Generative artificial intelligence (AI) promises to reduce P&ID development time by supporting engineers. Previous research on generative AI in chemical process design mainly represented processes by sequences. However, graphs offer a promising alternative because of their permutation invariance. We propose the Graph-to-SFILES model, a generative AI method to predict control structures from flowsheet topologies. The Graph-to-SFILES model takes the flowsheet topology as a graph input and returns a control-extended flowsheet as a sequence in the SFILES 2.0 notation. We compare four different graph encoder architectures, one of them being a graph neural network (GNN) proposed in this work. The Graph-to-SFILES model achieves a top-5 accuracy of 73.2% when trained on 10,000 flowsheet topologies. In addition, the proposed GNN performs best among the encoder architectures. Compared to a purely sequence-based approach, the Graph-to-SFILES model improves the top-5 accuracy for a relatively small training dataset of 1,000 flowsheets from 0.9% to 28.4%. However, the sequence-based approach performs better on a large-scale dataset of 100,000 flowsheets. These results highlight the potential of graph-based AI models to accelerate P&ID development in small-data regimes but their effectiveness on industry relevant case studies still needs to be investigated.

2412.06147 2026-06-09 cs.LG cs.ET 版本更新

Advancements in Machine Learning and Deep Learning for Early Detection and Management of Mental Health Disorder

机器学习和深度学习在心理健康障碍早期检测和管理中的进展

Kamala Devi Kannan, Senthil Kumar Jagatheesaperumal, Rajesh N. V. P. S. Kandala, Mojtaba Lotfaliany, Roohallah Alizadehsanid, Mohammadreza Mohebbi

发表机构 * Department of Computer Science and Engineering, Mepco Schlenk Engineering College(梅科斯伦克工程学院计算机科学与工程系) Department of Electronics and Communication Engineering, Mepco Schlenk Engineering College(梅科斯伦克工程学院电子与通信工程系) School of Electronics Engineering (SENSE), VIT-AP University(VIT-AP大学电子工程学院(SENSE)) The Institute for Mental and Physical Health and Clinical Translation (IMPACT), School of Medicine, Deakin University(德金大学医学院心理健康与身体健康及临床转化研究所(IMPACT)) Biostatistics Unit, Faculty of Health, Deakin University(德金大学健康学院生物统计学单位) School of Medicine, Deakin University(德金大学医学院)

AI总结 综述了ML/DL在心理健康障碍早期诊断中的应用,涵盖医学影像、遗传和行为数据,并讨论了数据整合、伦理挑战及未来方向。

Comments 21 pages, 2 figures, 3 tables

详情
AI中文摘要

对于心理健康疾病的早期识别、诊断和治疗,深度学习(DL)和机器学习(ML)的整合已开始发挥重要作用。通过评估来自影像、遗传学和行为评估的复杂数据,这些技术有潜力显著改善临床结果。然而,它们也带来了与数据整合和伦理问题相关的独特挑战。本综述回顾了ML和DL方法在心理健康问题早期诊断和治疗中的发展。它考察了一系列应用,特别强调了行为评估、遗传和生物标志物分析,以及用于诊断抑郁症、双相情感障碍和精神分裂症等疾病的医学影像。综述进一步讨论了疾病发展的预测建模,重点关注风险预测模型和纵向研究的作用。重要发现显示了ML和DL如何提高诊断准确性和治疗结果,同时解决方法不一致、数据整合和伦理问题。研究强调了构建用于个性化治疗的实时监测系统、改进数据融合技术和跨学科合作的重要性。未来的研究应集中于克服这些障碍,以最大化ML和DL在心理健康服务中的有益和道德实施。

英文摘要

For the early identification, diagnosis, and treatment of mental health illnesses, the integration of deep learning (DL) and machine learning (ML) have started playing a significant role. By evaluating complex data from imaging, genetics, and behavioral assessments, these technologies have the potential to improve clinical results significantly. However, they also present unique challenges relating to data integration and ethical issues. The development of ML and DL methods for the early diagnosis and treatment of mental health issues is reviewed in this survey. It examines a range of applications, with a particular emphasis on behavioral assessments, genetic and biomarker analysis, and medical imaging for the diagnosis of diseases like depression, bipolar disorder, and schizophrenia. Predictive modeling for illness development is further discussed in the review, focusing on the function of risk prediction models and longitudinal investigations. Important discoveries show how ML and DL might improve treatment outcomes and diagnostic accuracy while tackling methodological inconsistency, data integration, and ethical concerns. The study emphasizes the significance of building real-time monitoring systems for individualized treatment, improving data fusion techniques, and interdisciplinary collaboration. Upcoming studies should concentrate on surmounting these obstacles to maximize ML and DL's valuable and moral implementation in mental health services.

2504.18451 2026-06-09 cs.LG 版本更新

Enhancing Strawberry Yield Forecasting with Backcasted IoT Sensor Data and Machine Learning

利用回测IoT传感器数据和机器学习增强草莓产量预测

Tewodros Alemu Ayall, Andy Li, Matthew Beddows, Milan Markovic, Georgios Leontidis

发表机构 * The School of Natural and Computing Sciences and the Interdisciplinary Institute at the University of Aberdeen(阿伯丁大学自然科学与计算科学学院及跨学科研究所) UiT The Arctic University of Norway(挪威北极大学)

AI总结 针对IoT数据缺失问题,提出基于AI的回测方法合成传感器数据,结合真实数据训练产量预测模型,在草莓生产中验证了合成数据可提升预测精度。

Comments V2: 10 pages, 4 figures, 4 Tables

详情
AI中文摘要

全球人口的快速增长凸显了数字化农业系统的必要性,该系统支持可持续粮食生产以及为农民和利益相关者提供数据驱动的资源管理。采用能够捕获实时环境(如温度、湿度)和操作(如灌溉)参数的物联网(IoT)技术,是实现基于AI的产量预测等高级应用的关键一步。然而,此类模型的有效性通常受限于数据可用性有限,特别是在动态农场环境中,IoT观测数据需要跨越多个生长季节积累。在本研究中,我们在两个生长季节内于草莓生产塑料大棚中部署了IoT传感器,收集了用水量、内外温湿度、土壤湿度、土壤温度以及光合有效辐射数据。这些观测数据与跨越四个季节的手动记录产量数据相结合。为了填补无传感器覆盖的两个季节的IoT数据缺口,我们开发了一种基于AI的回测方法,利用附近气象站的历史天气数据和现有塑料大棚测量值合成缺失的传感器观测数据。然后,我们使用真实和合成数据集训练基于AI的产量预测模型。在这项回顾性评估中,结果表明,结合合成数据提高了产量预测准确性,在组合数据集上训练的模型优于仅使用真实传感器、天气和产量数据的模型。

英文摘要

Rapid global population growth underscores the need for digitally enabled agricultural systems that support sustainable food production and data-driven resource management for farmers and stakeholders. The adoption of Internet of Things (IoT) technologies, capable of capturing real-time environmental (e.g., temperature, humidity) and operational (e.g., irrigation) parameters, is a crucial step toward enabling advanced applications such as AI-based yield forecasting. However, the effectiveness of such models is often constrained by limited data availability, particularly in dynamic farm environments where IoT observations must be accumulated over multiple growing seasons. In this study, we deployed IoT sensors in strawberry production polytunnels over two growing seasons to collect data on water usage, internal and external temperature and humidity, soil moisture, soil temperature, and photosynthetically active radiation. These observations were combined with manually recorded yield data spanning four seasons. To address gaps in IoT data for the two seasons without sensor coverage, we developed an AI-based backcasting approach that synthesizes missing sensor observations using historical weather data from a nearby station and existing polytunnel measurements. We then trained AI-based yield forecasting models using both real and synthetic datasets. In this retrospective evaluation, results show that incorporating synthetic data improved yield forecasting accuracy, with models trained on the combined dataset outperforming those using only real sensor, weather, and yield data.

2509.17446 2026-06-09 cs.LG cs.AI 版本更新

MVCL-DAF++: Enhancing Multimodal Intent Recognition via Prototype-Aware Contrastive Alignment and Coarse-to-Fine Dynamic Attention Fusion

MVCL-DAF++: 通过原型感知对比对齐和由粗到细动态注意力融合增强多模态意图识别

Haofeng Huang, Yifei Han, Long Zhang, Bin Li, Yangfan He, Yaxin Xue

发表机构 * University of Shanghai for Science and TechnologyChina(上海科学技术大学中国) Shenzhen Institute of Advanced Technology, Chinese Academy of SciencesChina(深圳先进技术研究院,中国科学院中国) University of Minnesota-Twin Cities, USA(明尼苏达大学双城分校,美国) University of LeedsUK(利兹大学,英国)

AI总结 提出MVCL-DAF++,通过原型感知对比对齐和由粗到细注意力融合,在MIntRec和MIntRec2.0上提升多模态意图识别,尤其改善稀有类识别。

Comments Accepted by Interspeech 2026

详情
AI中文摘要

多模态意图识别(MMIR)在噪声或稀有类条件下存在语义基础薄弱和鲁棒性差的问题。我们提出MVCL-DAF++,它通过两个关键模块扩展了MVCL-DAF:(1)原型感知对比对齐,将实例与类级原型对齐以增强语义一致性;(2)由粗到细注意力融合,将全局模态摘要与令牌级特征集成以实现层次化跨模态交互。在MIntRec和MIntRec2.0上,MVCL-DAF++取得了新的最佳结果,稀有类识别WF1分别提高了+1.05%和+4.18%。这些结果证明了原型引导学习和由粗到细融合对于鲁棒多模态理解的有效性。源代码可在https://github.com/chr1s623/MVCL-DAF-PlusPlus获取。

英文摘要

Multimodal intent recognition (MMIR) suffers from weak semantic grounding and poor robustness under noisy or rare-class conditions. We propose MVCL-DAF++, which extends MVCL-DAF with two key modules: (1) Prototype-aware contrastive alignment, aligning instances to class-level prototypes to enhance semantic consistency; and (2) Coarse-to-fine attention fusion, integrating global modality summaries with token-level features for hierarchical cross-modal interaction. On MIntRec and MIntRec2.0, MVCL-DAF++ achieves new state-of-the-art results, improving rare-class recognition by +1.05\% and +4.18\% WF1, respectively. These results demonstrate the effectiveness of prototype-guided learning and coarse-to-fine fusion for robust multimodal understanding. The source code is available at https://github.com/chr1s623/MVCL-DAF-PlusPlus.

2510.03244 2026-06-09 cs.LG cs.AI cs.CV 版本更新

VFEM: Visual Feature Empowered Multivariate Time Series Forecasting with Cross-Modal Fusion

VFEM: 视觉特征赋能的多变量时间序列预测与跨模态融合

Yanlong Wang, Hang Yu, Jian Xu, Fei Ma, Hongkang Zhang, Tongtong Feng, Zijian Zhang, Shao-Lun Huang, Danny Dongning Sun, Xiao-Ping Zhang

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University(清华大学深圳国际研究生院,清华大学) Pengcheng Laboratory(鹏城实验室) Ant Group(蚂蚁集团) Guangdong Laboratory of Artificial Intelligence and Digital Economy (SZ)(广东人工智能与数字经济实验室(深圳)) University of Pennsylvania(宾夕法尼亚大学)

AI总结 提出VFEM模型,利用预训练大视觉模型通过跨模态注意力融合视觉与时间特征,仅训练7.45%参数即可捕捉跨变量依赖,提升多变量时间序列预测性能。

详情
AI中文摘要

大型时间序列基础模型通常采用通道独立架构来处理不同的数据维度,但这种设计忽略了关键的跨通道依赖关系。同时,现有的跨模态方法主要依赖文本模态,使得视觉模型的空间模式识别能力在时间序列分析中未被充分探索。为了解决这些局限性,我们提出了VFEM,一种利用预训练大视觉模型(LVM)捕获复杂跨变量模式的跨模态预测模型。VFEM将多变量时间序列转换为视觉表示,使LVM能够感知通道独立模型未显式建模的空间关系。通过双分支架构,视觉和时间特征被独立提取,然后通过跨模态注意力融合,使两种模态的互补信息增强预测。通过冻结LVM并仅训练总参数的7.45%,VFEM在多个基准上取得了竞争性能,为多变量时间序列预测提供了新视角。

英文摘要

Large time series foundation models often adopt channel-independent architectures to handle varying data dimensions, but this design ignores crucial cross-channel dependencies. Meanwhile, existing cross-modal methods predominantly rely on textual modalities, leaving the spatial pattern recognition capabilities of vision models underexplored for time series analysis. To address these limitations, we propose VFEM, a cross-modal forecasting model that leverages pre-trained large vision models (LVMs) to capture complex cross-variable patterns. VFEM transforms multivariate time series into visual representations, enabling LVMs to perceive spatial relationships that are not explicitly modeled by channel-independent models. Through a dual-branch architecture, visual and temporal features are independently extracted and then fused via cross-modal attention, allowing complementary information from both modalities to enhance forecasting. By freezing the LVM and training only 7.45% of the total parameters, VFEM achieves competitive performance on multiple benchmarks, offering a new perspective on multivariate time series forecasting.

2510.10028 2026-06-09 cs.LG cs.AI cs.DC 版本更新

Efficient Onboard Vision-Language Inference in UAV-Enabled Low-Altitude Economy Networks via LLM-Enhanced Optimization

基于LLM增强优化的无人机低空经济网络高效机载视觉-语言推理

Yang Li, Ruichen Zhang, Yinqiu Liu, Guangyuan Liu, Abbas Jamalipour, Xianbin Wang, Dong In Kim

发表机构 * College of Computing and Data Science, Nanyang Technological University, Singapore(计算与数据科学学院、新加坡国立科技大学) The University of Sydney, Sydney, Australia(悉尼大学、澳大利亚悉尼) Department of Electrical and Computer Engineering, Western University, London, Canada(电气与计算机工程系、西方大学、加拿大伦敦) Department of Electrical and Computer Engineering, Sungkyunkwan University, South Korea(电气与计算机工程系、全州大学、韩国)

AI总结 针对无人机低空经济网络中机载视觉-语言模型推理的准确性与通信效率挑战,提出分层优化框架,包括交替分辨率与功率优化算法及大语言模型增强的强化学习轨迹优化方法,有效提升推理性能与通信效率。

详情
AI中文摘要

低空经济网络(LAENets)的快速发展催生了多种应用,包括空中监视、环境感知和语义数据收集。为支持这些场景,配备机载视觉-语言模型(VLM)的无人机(UAV)为实时多模态推理提供了一种有前景的解决方案。然而,由于有限的机载资源和动态的网络条件,确保推理准确性和通信效率仍然是一个重大挑战。在本文中,我们首先提出一个无人机启用的LAENet系统模型,该模型联合捕捉无人机移动性、用户-无人机通信以及机载视觉问答(VQA)流水线。基于该模型,我们制定了一个混合整数非凸优化问题,以在用户特定的准确性约束下最小化任务延迟和功耗。为解决该问题,我们设计了一个由两部分组成的分层优化框架:(i)交替分辨率与功率优化(ARPO)算法,用于在准确性约束下进行资源分配;(ii)大语言模型增强的强化学习方法(LLaRA),用于自适应无人机轨迹优化。大语言模型(LLM)作为专家,以离线方式改进强化学习的奖励设计,在实时决策中不引入额外延迟。数值结果证明了我们提出的框架在动态LAENet条件下提升推理性能和通信效率的有效性。

英文摘要

The rapid advancement of Low-Altitude Economy Networks (LAENets) has enabled a variety of applications, including aerial surveillance, environmental sensing, and semantic data collection. To support these scenarios, unmanned aerial vehicles (UAVs) equipped with onboard vision-language models (VLMs) offer a promising solution for real-time multimodal inference. However, ensuring both inference accuracy and communication efficiency remains a significant challenge due to limited onboard resources and dynamic network conditions. In this paper, we first propose a UAV-enabled LAENet system model that jointly captures UAV mobility, user-UAV communication, and the onboard visual question answering (VQA) pipeline. Based on this model, we formulate a mixed-integer non-convex optimization problem to minimize task latency and power consumption under user-specific accuracy constraints. To solve the problem, we design a hierarchical optimization framework composed of two parts: (i) an Alternating Resolution and Power Optimization (ARPO) algorithm for resource allocation under accuracy constraints, and (ii) a Large Language Model-augmented Reinforcement Learning Approach (LLaRA) for adaptive UAV trajectory optimization. The large language model (LLM) serves as an expert in refining reward design of reinforcement learning in an offline fashion, introducing no additional latency in real-time decision-making. Numerical results demonstrate the efficacy of our proposed framework in improving inference performance and communication efficiency under dynamic LAENet conditions.

2512.03606 2026-06-09 cs.LG 版本更新

Observation-driven correction of numerical weather prediction for marine winds

基于观测驱动的海洋风数值天气预报修正

Matteo Peduto, Qidong Yang, Jonathan Giezendanner, Devis Tuia, Sherrie Wang

发表机构 * arXiv

AI总结 提出ORCA模型,利用Transformer架构融合稀疏、异质的海洋观测数据,实时修正GFS风场预报,在0-48小时预报时效内误差降低13%-45%。

详情
AI中文摘要

准确的海洋风预报对于安全航行、船舶路线规划和能源作业至关重要,但由于海洋观测数据稀疏、异质且时间变化大,预报仍然具有挑战性。我们提出了一种基于观测信息的全球数值天气预报(NWP)海洋风修正方法。该方法不是直接预报风场,而是通过同化最新的现场观测数据来学习局部修正模式,以调整全球预报系统(GFS)的输出。我们提出了ORCA(基于注意力的观测信息实时修正),这是一种基于Transformer的深度学习架构,它(i)通过掩码和基于集合的注意力机制处理不规则且随时间变化的观测集,(ii)通过交叉注意力将预测条件建立在最近的观测-预报对上,以及(iii)采用循环时间嵌入和坐标感知的位置表示,从而在任意空间坐标上实现单次推理。我们使用国际综合海洋-大气数据集(ICOADS)的观测数据,在大西洋上评估了ORCA。ORCA在长达48小时的所有预报时效内降低了GFS 10米风误差,在1小时预报时效内实现了45%的改进,在48小时预报时效内实现了13%的改进。空间分析显示,在观测最丰富的海岸线和航运路线沿线,改进最为持久。这种标记化架构自然地适应了异质的观测平台(船舶、浮标、验潮站和海岸站),并在单次前向传播中产生站点特定的预测和流域尺度的网格化产品。这些结果展示了一种实用的低延迟后处理方法,通过学习修正系统性的预报误差来补充NWP。

英文摘要

Accurate marine wind forecasts are essential for safe navigation, ship routing, and energy operations, yet they remain challenging because observations over the ocean are sparse, heterogeneous, and temporally variable. We present an observation-informed correction approach for global numerical weather prediction (NWP) of marine winds. Rather than forecasting winds directly, we learn local correction patterns by assimilating the latest in-situ observations to adjust the Global Forecast System (GFS) output. We propose ORCA (Observation-informed Real-time Correction with Attention), a transformer-based deep learning architecture that (i) handles irregular and time-varying observation sets through masking and set-based attention mechanisms, (ii) conditions predictions on recent observation--forecast pairs via cross-attention, and (iii) employs cyclical time embeddings and coordinate-aware location representations to enable single-pass inference at arbitrary spatial coordinates. We evaluate ORCA over the Atlantic Ocean using observations from the International Comprehensive Ocean-Atmosphere Data Set (ICOADS) as reference. ORCA reduces GFS 10-meter wind error at all lead times up to 48 hours, achieving 45% improvement at 1-hour lead time and 13% improvement at 48-hour lead time. Spatial analyses reveal the most persistent improvements along coastlines and shipping routes, where observations are most abundant. The tokenized architecture naturally accommodates heterogeneous observing platforms (ships, buoys, tide gauges, and coastal stations) and produces both site-specific predictions and basin-scale gridded products in a single forward pass. These results demonstrate a practical, low-latency post-processing approach that complements NWP by learning to correct systematic forecast errors.

2601.09285 2026-06-09 cs.LG cond-mat.mtrl-sci 版本更新

Enhancing Spatial Reasoning in Large Language Models for Metal-Organic Frameworks Structure Prediction

增强大型语言模型在金属有机框架结构预测中的空间推理能力

Mianzhi Pan, JianFei Li, Peishuo Liu, Botian Wang, Yawen Ouyang, Yiming Rong, Hao Zhou, Jianbing Zhang

发表机构 * National Key Laboratory for Novel Software Technology(新型软件技术国家重点实验室) Nanjing University(南京大学) Institute of AI Industry Research (AIR)(人工智能产业研究院) Tsinghua University(清华大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) University of Chinese Academy of Sciences(中国科学院大学) ChemBIC(化学信息学中心)

AI总结 针对MOF结构预测中原子数量多、复杂度高的问题,提出MOF-LLM框架,通过空间感知持续预训练、结构监督微调和匹配驱动强化学习,增强Qwen-3 8B模型的空间推理能力,实现35.78%匹配率和0.04秒/结构的采样效率。

Comments KDD 2026

详情
AI中文摘要

金属有机框架(MOFs)是多孔晶体材料,在碳捕获和药物输送等领域有广泛应用,但准确预测其三维结构仍然是一个重大挑战。尽管大型语言模型(LLMs)在生成晶体结构方面显示出潜力,但由于MOF单胞中原子数量多导致的结构高度复杂性,LLMs在MOF上的应用受到阻碍。受深度生成模型中块级范式成功启发,我们率先将LLMs应用于该领域,引入了MOF-LLM,这是第一个专门针对块级MOF结构预测的LLM框架。为了有效利用LLMs完成这一3D模块化组装任务,我们的训练范式整合了空间感知持续预训练(CPT)、结构监督微调(SFT)和匹配驱动强化学习(RL)。通过引入显式空间先验并利用软自适应策略优化(SAPO)优化结构稳定性,我们的方法显著增强了Qwen-3 8B模型在MOF结构预测中的空间推理能力。综合实验表明,MOF-LLM实现了最先进的性能,匹配率达到35.78%,同时展现出卓越的采样效率,每个结构仅需0.04秒。

英文摘要

Metal-organic frameworks (MOFs) are porous crystalline materials with broad applications such as carbon capture and drug delivery, yet accurately predicting their 3D structures remains a significant challenge. While Large Language Models (LLMs) have shown promise in generating crystal structures, their application to MOFs is hindered by MOFs' high structural complexity arising from the large number of atoms in unit cell. Inspired by the success of block-wise paradigms in deep generative models for MOFs, we pioneer the application of LLMs in this domain by introducing MOF-LLM, the first LLM framework specifically adapted for block-level MOF structure prediction. To effectively harness LLMs for this 3D modular assembly task, our training paradigm integrates spatial-aware continual pre-training (CPT), structural supervised fine-tuning (SFT), and matching-driven reinforcement learning (RL). By incorporating explicit spatial priors and optimizing structural stability via Soft Adaptive Policy Optimization (SAPO), our approach substantially enhances the spatial reasoning in a Qwen-3 8B model for MOF structure prediction. Comprehensive experiments demonstrate that MOF-LLM achieves state-of-the-art performance with a match rate of 35.78% while exhibiting superior sampling efficiency of 0.04 seconds per structure.

2602.03395 2026-06-09 cs.LG 版本更新

The Label Horizon Paradox: Rethinking Supervision Targets in Financial Forecasting

标签地平线悖论:金融预测中监督目标的再思考

Chen-Hui Song, Shuoling Liu, Liyuan Chen

发表机构 * GitHub

AI总结 本文提出标签地平线悖论,指出最优监督信号常偏离预测目标,并基于动态信噪比权衡理论,提出双层优化框架自动寻找最优代理标签,在金融数据集上取得一致改进。

详情
AI中文摘要

虽然深度学习通过复杂的架构革新了金融预测,但监督信号本身的设计却很少受到审视。我们挑战了训练标签必须严格反映推理目标的经典假设,揭示了标签地平线悖论:最优监督信号往往偏离预测目标,而是在由市场动态决定的中间地平线上转移。我们从理论上将这一现象归结为动态信噪比权衡,证明泛化取决于边际信号实现与噪声积累之间的竞争。为了将这一见解付诸实践,我们提出了一个双层优化框架,能够在单次训练运行中自主识别最优代理标签。在大型金融数据集上的大量实验表明,该方法相比传统基线取得了一致的改进,从而为金融预测中基于标签的研究开辟了新途径。

英文摘要

While deep learning has revolutionized financial forecasting through sophisticated architectures, the design of the supervision signal itself is rarely scrutinized. We challenge the canonical assumption that training labels must strictly mirror inference targets, uncovering the Label Horizon Paradox: the optimal supervision signal often deviates from the prediction goal, shifting across intermediate horizons governed by market dynamics. We theoretically ground this phenomenon in a dynamic signal-noise trade-off, demonstrating that generalization hinges on the competition between marginal signal realization and noise accumulation. To operationalize this insight, we propose a bi-level optimization framework that autonomously identifies the optimal proxy label within a single training run. Extensive experiments on large-scale financial datasets demonstrate consistent improvements over conventional baselines, thereby opening new avenues for label-centric research in financial forecasting.

2602.08733 2026-06-09 cs.LG 版本更新

Foundation Inference Models for Ordinary Differential Equations

常微分方程的基础推理模型

Maximilian Mauel, Johannes R. Hübers, David Berghaus, Patrick Seifner, Ramses J. Sanchez

发表机构 * University of Cambridge(剑桥大学) ETH Zurich(苏黎世联邦理工学院)

AI总结 提出FIM-ODE,一种预训练的基础推理模型,通过单次前向传播从含噪轨迹直接预测向量场,实现零样本性能匹配并超越ODEFormer,微调后优于现代神经和GP基线。

Comments Published in ICML 2026

详情
Journal ref
Proceedings of the 43rd International Conference on Machine Learning (ICML 2026)
AI中文摘要

常微分方程(ODE)是科学建模的核心,但从含噪轨迹中推断其向量场仍然具有挑战性。当前的方法,如符号回归、高斯过程(GP)回归和神经常微分方程,通常需要复杂的训练流程和大量的机器学习专业知识,或者严重依赖于系统特定的先验知识。我们提出FIM-ODE,一种预训练的基础推理模型,通过单次前向传播直接从含噪轨迹数据预测向量场,从而摊销低维ODE推理。我们在具有低次多项式向量场的ODE先验分布上预训练FIM-ODE,并用神经算子表示目标场。FIM-ODE实现了强大的零样本性能,在多种设置下匹配并常常优于最近的预训练符号基线ODEFormer,尽管使用了更简单的预训练先验分布。预训练还为微调提供了强大的初始化,实现了快速且稳定的适应,在不需要机器学习专业知识的情况下优于现代神经和GP基线。

英文摘要

Ordinary differential equations (ODEs) are central to scientific modelling, but inferring their vector fields from noisy trajectories remains challenging. Current approaches such as symbolic regression, Gaussian process (GP) regression, and Neural ODEs often require complex training pipelines and substantial machine learning expertise, or they depend strongly on system-specific prior knowledge. We propose FIM-ODE, a pretrained Foundation Inference Model that amortises low-dimensional ODE inference by predicting the vector field directly from noisy trajectory data in a single forward pass. We pretrain FIM-ODE on a prior distribution over ODEs with low-degree polynomial vector fields and represent the target field with neural operators. FIM-ODE achieves strong zero-shot performance, matching and often improving upon ODEFormer, a recent pretrained symbolic baseline, across a range of regimes despite using a simpler pretraining prior distribution. Pretraining also provides a strong initialisation for finetuning, enabling fast and stable adaptation that outperforms modern neural and GP baselines without requiring machine learning expertise.

2602.15253 2026-06-09 cs.LG q-bio.GN 版本更新

Scaling Laws for Masked-Reconstruction Transformers on Single-Cell Transcriptomics

单细胞转录组学中掩码重建Transformer的缩放定律

Ihor Kendiukhov

发表机构 * Department of Computer Science, University of Tübingen(图宾根大学计算机科学系)

AI总结 本研究首次系统探索单细胞RNA测序数据上掩码重建Transformer的缩放行为,发现数据充足时存在幂律缩放定律,数据稀缺时缩放可忽略,并指出数据-参数比是关键决定因素。

详情
AI中文摘要

神经缩放定律——损失、模型大小和数据之间的幂律关系——已在语言和视觉Transformer中得到广泛记录,但它们在单细胞基因组学中的存在性仍未得到充分探索。我们首次系统研究了在单细胞RNA测序(scRNA-seq)数据上训练的掩码重建Transformer的缩放行为。使用CELLxGENE Census的表达谱,我们构建了两种实验设置:数据丰富设置(512个高度可变基因,200,000个细胞)和数据有限设置(1,024个基因,10,000个细胞)。在参数数量跨越三个数量级(533到3.4×10^8个参数)的七种模型大小上,我们将参数化缩放定律拟合到验证均方误差(MSE)。数据丰富设置表现出清晰的幂律缩放,不可约损失下限c约为1.44,而数据有限设置显示出可忽略的缩放,表明当数据稀缺时模型容量不是约束条件。这些结果确立了类似于自然语言处理中观察到的缩放定律在单细胞转录组学中确实存在(当数据充足时),并确定了数据-参数比是缩放行为的关键决定因素。将数据丰富渐近下限初步转换为信息论单位,估计每个掩码基因位置约2.30比特熵。我们讨论了对单细胞基础模型设计的启示,并概述了完善该熵估计所需的额外测量。

英文摘要

Neural scaling laws -- power-law relationships between loss, model size, and data -- have been extensively documented for language and vision transformers, yet their existence in single-cell genomics remains largely unexplored. We present the first systematic study of scaling behaviour for masked-reconstruction transformers trained on single-cell RNA sequencing (scRNA-seq) data. Using expression profiles from the CELLxGENE Census, we construct two experimental regimes: a data-rich regime (512 highly variable genes, 200,000 cells) and a data-limited regime (1,024 genes, 10,000 cells). Across seven model sizes spanning three orders of magnitude in parameter count (533 to 3.4 x 10^8 parameters), we fit the parametric scaling law to validation mean squared error (MSE). The data-rich regime exhibits clear power-law scaling with an irreducible loss floor of c ~ 1.44, while the data-limited regime shows negligible scaling, indicating that model capacity is not the binding constraint when data are scarce. These results establish that scaling laws analogous to those observed in natural language processing do emerge in single-cell transcriptomics when sufficient data are available, and they identify the data-to-parameter ratio as a critical determinant of scaling behaviour. A preliminary conversion of the data-rich asymptotic floor to information-theoretic units yields an estimate of approximately 2.30 bits of entropy per masked gene position. We discuss implications for the design of single-cell foundation models and outline the additional measurements needed to refine this entropy estimate.

2603.12666 2026-06-09 cs.LG cs.AI 版本更新

RetroReasoner: A Reasoning LLM for Strategic Retrosynthesis Prediction

RetroReasoner:一种用于战略 retrosynthesis 预测的推理 LLM

Hanbum Ko, Chanhui Lee, Ye Rin Kim, Rodrigo Hormazabal, Sehui Han, Sungbin Lim, Sungwoong Kim

发表机构 * Department of Artificial Intelligence, Korea University(韩国大学人工智能系) Department of Statistics, Korea University(韩国大学统计系) Materials Intelligence Lab, LG AI Research(LG人工智能研究实验室)

AI总结 RetroReasoner 通过监督微调和强化学习,捕捉化学家基于断键策略的推理过程,提升 retrosynthesis 预测的准确性和多样性。

Comments 35 pages, 19 figures

详情
AI中文摘要

retrosynthesis预测旨在识别能够合成给定产物分子的反应物。尽管分子大语言模型(LLMs)最近展示了有前景的结果,但大多数现有方法要么直接生成反应物,要么仅提供通用的产品级分析,而没有明确推理关于断键策略来证明特定反应物选择的合理性。本文提出了RetroReasoner,一种能够捕捉化学家基于断键策略的推理过程的 retrosynthetic推理模型。RetroReasoner通过监督微调和强化学习进行训练。在监督微调中,SyntheticRetro生成结构化的断键理由配对反应物预测。在强化学习中,一个往返奖励通过将预测的反应物传递给正向合成模型来评估预测的反应物,奖励能够重建原始产物的预测。RetroReasoner还可以通过将其整合到并行化的蒙特卡洛树搜索框架中,用于多步 retrosynthetic规划,从而减少搜索时间并增加有效合成路径的数量和多样性。实验结果表明,RetroReasoner在性能上优于先前的基线,不仅包括分子LLMs,还包括专门针对retrosynthesis的专家模型,并生成更广泛的可行反应物提案,特别是在具有挑战性的反应实例中。

英文摘要

Retrosynthesis prediction aims to identify reactants that can synthesize a given product molecule. Although molecular large language models (LLMs) have recently shown promising results, most existing methods either generate reactants directly or provide only generic product-level analysis, without explicitly reasoning about bond-disconnection strategies that justify specific reactant choices. This paper proposes RetroReasoner, a retrosynthetic reasoning model that captures chemists' strategic disconnection-based thinking. RetroReasoner is trained with supervised fine-tuning and reinforcement learning. For supervised fine-tuning, SyntheticRetro generates structured disconnection rationales paired with reactant predictions. For reinforcement learning, a round-trip reward evaluates predicted reactants by passing them through a forward synthesis model and rewarding predictions that reconstruct the original product. RetroReasoner can also be applied to multi-step retrosynthetic planning by incorporating it into a parallelized Monte Carlo tree search framework, reducing search time while increasing the number and diversity of valid synthetic pathways. Experimental results show that RetroReasoner outperforms prior baselines, including not only molecular LLMs but also retrosynthesis-specific expert models, and generates a broader range of feasible reactant proposals, especially for challenging reaction instances.

2603.24925 2026-06-09 cs.LG cs.CL cs.IR 版本更新

GraphER: An Efficient Graph-Based Enrichment and Reranking Method for Retrieval-Augmented Generation

GraphER: 一种高效的基于图的增强和重排序方法用于检索增强生成

Ruizhong Miao, Yuying Wang, Rongguang Wang, Chenyang Li, Tao Sheng, Sujith Ravi, Dan Roth

发表机构 * Oracle AI

AI总结 GraphER通过利用数据组织结构捕捉超越语义相似性的接近关系,构建查询时的图结构并应用图排序技术,提升检索完整性,无需额外图基础设施,兼容标准向量存储。

详情
AI中文摘要

GraphER通过利用数据组织结构捕捉超越语义相似性的接近关系,构建查询时的图结构并应用图排序技术,提升检索完整性,无需额外图基础设施,兼容标准向量存储。

英文摘要

Retrieval-augmented generation (RAG) systems that rely on semantic search often fail to retrieve the complete set of evidence for complex queries, particularly when information is distributed across multiple sources. Existing approaches either rely on iterative agentic retrieval, which can be inefficient, or maintain additional structures such as knowledge graphs, which introduce storage and maintenance overhead. In this paper, we propose GraphER, a graph-based enrichment and reranking framework that (1) leverages the organizational structure of data to capture proximity relationships beyond semantic similarity, (2) constructs a graph at query time based on these proximities, and (3) applies graph-based ranking to surface the top candidate documents. Experiments across table retrieval, multi-hop retrieval, and long-document retrieval benchmarks demonstrate consistent improvements in terms of retrieval completeness. Additionally, GraphER requires no additional graph infrastructure and integrates seamlessly with standard vector stores. The framework is retriever-agnostic, supports multiple forms of proximity, and introduces minimal query-time latency.

2603.29237 2026-06-09 cs.LG cs.NA math.NA 版本更新

Stochastic Dimension Implicit Functional Projections for Global Integral Conservation in High-Dimensional PINNs

随机维度隐式函数投影用于高维PINNs中的全局积分守恒

Zhangyong Liang, Huanhuan Gao

发表机构 * National Center for Applied Mathematics, Tianjin University, China(应用数学国家中心,天津大学,中国) School of Mechanical and Aerospace Engineering, Jilin University, China(机械与 aerospace 工程学院,吉林大学,中国)

AI总结 本文提出SDIFP方法,通过全局线性修正神经网络输出,实现高维PINNs中的一阶和二阶空间矩约束,避免了张量积求积的高维扩展问题,提高了计算效率。

详情
AI中文摘要

在无网格神经PDE求解器中,强制执行预设的全局积分约束在高维域中具有挑战性。现有的空间积分投影方法通常依赖于固定网格或均匀求积,这与随机采样的物理信息神经网络(PINNs)相冲突,并且在高维情况下扩展性差。高阶微分算子也增加了反向模式自动微分的内存成本。我们提出随机维度隐式函数投影(SDIFP),一种用于强制执行预设一阶和二阶空间矩的求积级框架。SDIFP用神经网络输出的全局线性修正代替张量积节点投影,两个标量系数由加权求积规则确定。在正的目标方差和非零经验原始方差下,这种修正是在加权求积范数下对经验二矩约束集的最近点投影。因此,预设的矩对于所选求积规则是精确的,而连续误差是修正场的求积误差。对于可分解的高维线性算子,SDIFP将线性矩修正与随机算子子集采样相结合。通过独立残差和导数采样以及条件无偏系数梯度估计,所得估计器对于指定的求积基残差目标是无偏的;共享子集快速模式通常是有偏的。SDIFP避免了张量积求积用于矩强制,分离了正向求积评估与反向模式图,并且在确定或预计算了线性系数后保留了点wise推断效率。

英文摘要

Enforcing prescribed global integral constraints in mesh-free neural PDE solvers is challenging in high-dimensional domains. Existing projection methods for spatial integrals are often tied to fixed grids or uniform quadrature, which can conflict with randomly sampled physics-informed neural networks (PINNs) and scale poorly with dimension. High-order differential operators also increase reverse-mode automatic differentiation memory costs. We propose Stochastic Dimension Implicit Functional Projection (SDIFP), a quadrature-level framework for enforcing prescribed first and second spatial moments. SDIFP replaces tensor-product nodal projection by a global affine correction of the neural-network output, with two scalar coefficients determined from a weighted quadrature rule. Under positive target variance and nonzero empirical raw variance, this correction is the nearest-point projection, in the weighted quadrature norm, onto the empirical two-moment constraint set. Thus, the prescribed moments are exact for the selected quadrature rule, while continuum errors are quadrature errors of the corrected field. For decomposable high-dimensional linear operators, SDIFP combines affine moment correction with stochastic operator-subset sampling. With independent residual and derivative sampling and conditionally unbiased coefficient-gradient estimation, the resulting estimator is unbiased for the specified quadrature-based residual objective; the shared-subset fast mode is biased in general. SDIFP avoids tensor-product quadrature for moment enforcement, separates forward quadrature evaluation from the reverse-mode graph, and retains pointwise inference efficiency once the affine coefficients are fixed or precomputed.

2604.07421 2026-06-09 cs.LG 版本更新

SPAMoE: Spectrum-Aware Hybrid Operator Framework for Full-Waveform Inversion

SPAMoE:一种频谱感知的混合运算框架用于全波形反演

Zhenyu Wang, Peiyuan Li, Yongxiang Shi, Ruoyu Wu, Chenfei Liao, Lei Zhang

发表机构 * China University of Mining and Technology - Beijing(中国矿业大学(北京)) City University of Hong Kong (Dongguan)(香港城市大学(东莞)) The Hong Kong University of Science and Technology (Guangzhou)(香港科技大学(广州))

AI总结 本文提出SPAMoE框架,通过频谱保护编码器和动态频谱分解路由机制,解决多尺度地质特征的频率纠缠问题,提升全波形反演的效率与稳定性。

详情
AI中文摘要

全波形反演(FWI)对于重建高分辨率地下速度模型至关重要,但计算成本高且问题不明确。尽管深度学习方法有潜力提高效率,但现有卷积神经网络(CNNs)和单范式神经运算(NOs)在处理多尺度地质特征的频率纠缠方面存在根本性困难。为此,我们提出了Spectral-Preserving Adaptive MoE(SPAMoE),一种新的频谱感知框架,用于解决具有复杂多尺度结构的逆问题。我们的方法引入了Spectral-Preserving DINO编码器,强制编码表示的高频到低频能量比的下限,缓解高频崩溃并稳定后续频域建模。此外,我们设计了一种新的频谱分解和路由机制,动态地将频率带分配给由FNO、MNO和LNO组成的专家混合(MoE)集合。在十个OpenFWI子数据集上,实验表明,SPAMoE相对于最佳官方报告的OpenFWI基线,平均MAE减少了44.4%,从而建立了学习驱动的全波形反演的新架构框架。我们的代码和数据可在https://github.com/zhenyuwang12366/SPAMoE获取。

英文摘要

Full-waveform inversion (FWI) is pivotal for reconstructing high-resolution subsurface velocity models but remains computationally intensive and ill-posed. While deep learning approaches promise efficiency, existing Convolutional Neural Networks (CNNs) and single-paradigm Neural Operators (NOs) struggle with one fundamental issue: frequency entanglement of multi-scale geological features. To address this challenge, we propose Spectral-Preserving Adaptive MoE (SPAMoE), a novel spectrum-aware framework for solving inverse problems with complex multi-scale structures. Our approach introduces a Spectral-Preserving DINO Encoder that enforces a lower bound on the high-to-low frequency energy ratio of the encoded representation, mitigating high-frequency collapse and stabilizing subsequent frequency-domain modeling. Furthermore, we design a novel Spectral Decomposition and Routing mechanism that dynamically assigns frequency bands to a Mixture-of-Experts (MoE) ensemble comprising FNO, MNO, and LNO. On the ten OpenFWI sub-datasets, experiments show that SPAMoE reduces the average MAE by 44.4% relative to the best officially reported OpenFWI baseline, thereby establishing a new architectural framework for learning-based full-waveform inversion. Our code and data are available at https://github.com/zhenyuwang12366/SPAMoE

2604.23053 2026-06-09 cs.LG math.OC 版本更新

ML-Guided Primal Heuristics for Mixed Binary Quadratic Programs

基于机器学习的混合二元二次规划的原始启发式方法

Weimin Huang, Natalie M. Isenberg, Ján Drgoňa, Draguna L Vrabie, Bistra Dilkina

发表机构 * University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 本文提出基于机器学习的混合二元二次规划求解启发式方法,通过改进神经网络架构和损失函数,提升求解效率和泛化能力。

详情
AI中文摘要

混合二元二次规划(MBQPs)是组合优化中的重要且复杂的问题集。由于解决大规模组合优化问题具有挑战性,已开发出原始启发式方法以在短时间内快速找到高质量解。最近,越来越多的研究利用机器学习加速解决复杂组合优化问题的方法。尽管ML引导方法日益流行,但大部分工作集中在混合整数线性规划(MILPs)上。MBQPs的挑战在于组合复杂性与非线性相结合。本文通过将现有的ML引导MILP求解预测方法扩展到MBQPs,提出ML引导的原始启发式方法。我们引入了新的神经网络架构用于MBQP求解预测,并提出新的训练数据收集程序。此外,我们扩展了现有求解预测中的损失函数,并提出结合对比和加权交叉熵损失。我们在标准和现实世界MBQP基准上评估了这些方法,并展示了所开发的ML引导方法显著优于现有原始启发式方法和最先进的求解器。此外,使用我们提出的扩展损失函数训练的模型在其他基于MILP的ML方法和现实世界风场布局优化问题的跨区域推理中表现更优。

英文摘要

Mixed Binary Quadratic Programs (MBQPs) are an important and complex set of problems in combinatorial optimization. As solving large-scale combinatorial optimization problems is challenging, primal heuristics have been developed to quickly identify high-quality solutions within a short amount of time. Recently, a growing body of research has also used machine learning to accelerate solution methods for challenging combinatorial optimization problems. Despite the increasing popularity of these ML-guided methods, a large body of work has focused on Mixed-Integer Linear Programs (MILPs). MBQPs are challenging to solve due to the combinatorial complexity coupled with nonlinearities. This work proposes ML-guided primal heuristics for Mixed Binary Quadratic Programs (MBQPs) by adapting and extending existing work on ML-guided MILP solution prediction to MBQPs. We introduce a new neural network architecture for MBQP solution prediction and a new training data collection procedure. Moreover, we extend existing loss functions in solution prediction and propose to combine contrastive and weighted cross-entropy losses. We evaluate the methods on standard and real-world MBQP benchmarks and show that the developed ML-guided methods significantly outperform existing primal heuristics and state-of-the-art solvers. Furthermore, models trained with our proposed extension with combined losses outperform other ML-based methods adapted from MILPs and improve generalization in cross-regional inference on a real-world wind farm layout optimization problem.

2604.24474 2026-06-09 cs.LG 版本更新

Advancing Ligand-based Virtual Screening and Molecular Generation with Pretrained Molecular Embedding Distance

通过预训练分子嵌入距离推进基于配体的虚拟筛选和分子生成

Shiyun Wa, Yifei Wang, Simone Sciabola, Ye Wang

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 本文提出预训练嵌入距离作为高效替代方案,用于虚拟筛选和分子生成,展示其在结构信息捕捉和相似性测量方面的有效性。

Comments Accepted by ICML 2026 AI4Science (https://openreview.net/forum?id=HbfrCipfNl). Code and data are available

详情
AI中文摘要

分子相似性在基于配体的药物发现中起核心作用,如虚拟筛选、类比搜索和目标导向的分子生成。然而,传统相似性度量,从基于指纹的Tanimoto系数到3D形状叠加,往往在大规模计算上昂贵或依赖手工制作的分子描述符。同时,许多深度学习方法在相似性感知设计中仍依赖相似性特定的监督或昂贵的数据整理,限制了其在不同目标上的通用性。在本工作中,我们提出预训练嵌入距离(PED)作为有效的替代方法,直接从预训练的分子模型计算得出,无需任务特定训练。实验结果表明,PED与传统相似性度量显示出不同的相关性,并在虚拟筛选中分子排名和通过奖励设计指导分子生成方面表现良好。这些发现表明,预训练分子嵌入捕捉了丰富的结构信息,并可以作为现代人工智能辅助药物发现中有力且可扩展的相似性度量方法。

英文摘要

Molecular similarity plays a central role in ligand-based drug discovery, such as virtual screening, analog searching, and goal-directed molecular generation. However, traditional similarity measures, ranging from fingerprint-based Tanimoto coefficients to 3D shape overlays, are often computationally expensive at scale or rely on hand-crafted molecular descriptors. Meanwhile, many deep learning approaches to similarity-aware design still depend on similarity-specific supervision or costly data curation, limiting their generality across targets. In this work, we propose pretrained embedding distance (PED) as an effective alternative, computed directly from pretrained molecular models without task-specific training. Experimental results show that PED exhibits distinct correlations with traditional similarity metrics, and performs effectively in both ranking molecules for virtual screening and guiding molecular generation via reward design. These findings suggest that pretrained molecular embeddings capture rich structural information and can serve as a promising and scalable similarity measurement for modern AI-aided drug discovery.

2605.00647 2026-06-09 cs.LG 版本更新

Label-Conditioned Cross-Modal Fusion for Adult-to-Pediatric ECG Transfer via Curriculum-Gated Contrastive Alignment

基于标签的跨模态融合用于成人到儿童ECG转移 via 课程门控对比对齐

Xinran Liu, Yuwen Li, Hongxiang Gao, Heyang Xu, Jianqing Li, Zongmin Wang, Chengyu Liu

发表机构 * School of Instrument Science and Engineering, Southeast University(东南大学仪器科学与工程学院) Nanjing Medical University(南京医科大学) Zhengzhou University(郑州大学)

AI总结 本文提出PEACE框架,通过预训练和适应性融合提升儿童ECG诊断,采用对比学习和课程适应策略,在有限标注下实现高准确率。

详情
AI中文摘要

自动化的儿童心电图(ECG)解释仍具挑战性,因为心率、间隔和波形的发育差异限制了主要在成人数据上训练的模型的可转移性,同时专家标注的儿童ECG数据集稀缺。我们提出PEACE(通过跨模态增强的儿童-成人ECG对齐),一个在MIMIC-IV ECG上预训练并适应于儿童目标的成人到儿童ECG转移框架。PEACE整合标签特定的双向对比学习(LSBC)以对齐ECG表示与诊断语义,并采用课程适应融合(CAF)以在有限的儿童监督下稳定优化。标签条件的短文本描述在训练期间提供辅助语义监督,而推理仅需ECG信号。在ZZU-pECG上,PEACE在零样本、50样本和全微调设置下分别达到宏平均AUCs为59.39%、81.74%和91.56%,优于ECG-only、多模态和通用领域适应基线,包括DANN和MMD。在PTB-XL上,经过全微调后,其在九个和谐标签上的宏平均AUC达到96.90%。基于梯度的注意力图显示在与房间相关RVH相关的QRS电压和形态区域以及与LQTS相关的QRS到T/复极化间隔区域的显著性增加,与常规解释中常见的ECG区域一致。这些结果表明,成人规模的ECG预训练结合节律、形态和ST-T复极化语义描述在标签稀缺的情况下提高了可转移的儿童诊断,同时保持了临床可解释的波形焦点。

英文摘要

Automated pediatric electrocardiogram (ECG) interpretation remains challenging because developmental differences in heart rate, intervals, and waveforms limit the transferability of models trained mainly on adult data, while expert-labeled pediatric ECG cohorts are scarce. We propose PEACE (Pediatric-Adult ECG Alignment via Cross-modal Enhancement), an adult-to-pediatric ECG transfer framework pretrained on MIMIC-IV ECGs and adapted to pediatric targets. PEACE integrates label-specific bidirectional contrastive learning (LSBC) to align ECG representations with diagnostic semantics and curriculum adaptive fusion (CAF) to stabilize optimization under limited pediatric supervision. Label-conditioned short text descriptors provide auxiliary semantic supervision during training, whereas inference requires ECG signals only. On ZZU-pECG, PEACE achieves macro-average AUCs of 59.39%, 81.74%, and 91.56% under zero-shot, 50-shot, and full fine-tuning settings, respectively, outperforming ECG-only, multimodal, and generic domain adaptation baselines including DANN and MMD. On PTB-XL, it reaches 96.90% macro-average AUC after full fine-tuning over nine harmonized labels with nonzero mapped incidence. Gradient-based attention maps show increased saliency around QRS voltage and morphology regions for chamber-related RVH and around QRS-to-T/repolarization intervals for LQTS, broadly consistent with ECG regions commonly inspected during routine interpretation. These results suggest that adult-scale ECG pretraining coupled with rhythm, morphology, and ST-T repolarization semantic descriptors improves transferable pediatric diagnosis under label scarcity while preserving clinically interpretable waveform focus.

2605.01616 2026-06-09 cs.LG cs.AI cs.CY cs.NI 版本更新

Learning Behavioral Signals from Encrypted Smartphone Network Traffic

从加密智能手机网络流量中学习行为信号

Rameen Mahmood, Omar El Shahawy, Souptik Barua, Zachary Beattie, Jeffrey Kaye, Xuhai "Orson'' Xu, Chao-Yi Wu, Danny Yuxing Huang

发表机构 * New York University(纽约大学) NYU Langone Health(NYU Langone健康) NYU Grossman School of Medicine(NYU Grossman医学院) Oregon Health & Science University(俄勒冈健康与科学大学) Columbia University(哥伦比亚大学) Harvard Medical School(哈佛医学院)

AI总结 本文利用基于Transformer的模型从加密网络流量中学习行为表征,结合用户特定适配器,并通过稀疏表示和广义估计方程分析,发现压力、孤独感和睡眠障碍分别与个体间差异、个体内波动及两者组合相关,且学习到的表征优于传统手工特征。

Comments 19 pages, 6 figures

详情
AI中文摘要

人类行为难以在大规模下连续测量,然而日常活动和幸福感的痕迹可能反映在与个人设备的交互中。我们研究加密的智能手机网络流量是否可以作为被动感知信号,用于检测与睡眠障碍、压力和孤独感相关的行为状态。为了捕捉群体层面的模式和个体特定的行为,我们采用基于Transformer的模型,该模型带有用户特定的适配器,学习网络活动的表征,同时考虑个人基线及其偏差。为了提高可解释性,我们进一步使用稀疏表示学习分析这些表征,以识别与不同活动模式相关的潜在行为特征。我们使用带有Mundlak分解的广义估计方程将所得特征与睡眠障碍、压力和孤独感联系起来,从而能够区分稳定的个体间差异和随时间变化的个体内变化。我们的分析揭示了这三种结果具有不同的时间动态:压力主要与持续的个体间变异相关,孤独感与个体内波动更密切相关,而睡眠障碍则反映了两者的结合。重要的是,这些个体内行为信号无法通过传统的手工网络流量特征恢复,这突显了学习表征在纵向行为建模中的优势。总体而言,我们的发现表明加密网络流量包含可解释的行为信息,并能够支持被动、可扩展的行为动态监测,特别是相对于个体典型活动模式的变化。

英文摘要

Human behavior is challenging to measure continuously at scale, yet traces of daily routines and well-being may be reflected in interactions with personal devices. We investigate whether encrypted smartphone network traffic can serve as a passive sensing signal for behavioral states related to sleep disturbance, stress, and loneliness. To capture both population-level patterns and individual-specific behavior, we employ a transformer-based model with user-specific adapters that learns representations of network activity while accounting for personal baselines and deviations from them. To improve interpretability, we further analyze these representations using sparse representation learning to identify latent behavioral features associated with distinct activity patterns. We relate the resulting features to sleep disturbance, stress, and loneliness using generalized estimating equations with Mundlak decomposition, enabling separation of stable between-person differences from within-person changes over time. Our analysis reveals that the three outcomes are characterized by different temporal dynamics: stress is predominantly associated with persistent between-person variation, loneliness is more strongly linked to within-person fluctuations, and sleep disturbance reflects a combination of both. Importantly, these within-person behavioral signals are not recovered by conventional handcrafted network-traffic features, highlighting the advantages of learned representations for longitudinal behavioral modeling. Overall, our findings demonstrate that encrypted network traffic contains interpretable behavioral information and can support passive, scalable monitoring of behavioral dynamics, particularly changes relative to an individual's typical pattern of activity.

2605.23247 2026-06-09 cs.LG 版本更新

Accelerating Divisible Load Processing Through Machine Learning: A Practical Framework for Large-Scale Workloads

通过机器学习加速可分负载处理:大规模工作负载的实用框架

Bharadwaj Veeravalli

发表机构 * Department of Electrical and Computer Engineering, National University of Singapore(电子与计算机工程系,新加坡国立大学)

AI总结 提出首个机器学习框架,使用前馈神经网络预测单级树网络架构中的最优处理时间,实现97-99%准确率和1-5%平均绝对百分比误差,推理时间小于1毫秒,相比传统方法加速10-100倍。

详情
AI中文摘要

本文介绍了首个用于可分负载理论(DLT)范式下单级树网络(SLTN)架构中预测最优处理时间的机器学习框架。使用具有16个工程特征的前馈神经网络(FNN),我们在100,000个合成生成的配置上训练模型,无需显式推导DLT方程即可预测最优处理时间。模型达到97-99%的准确率(R平方因子),平均绝对百分比误差为1-5%,表明神经网络能够有效学习复杂的负载分布关系。特征重要性分析显示,模型隐式捕捉了DLT的数学结构,包括负载守恒和同时完成约束。推理时间低于1毫秒,该方法相比传统DLT计算提供10-100倍的加速,适用于实时调度、设计空间探索和云资源分配。该方法在多样化的系统配置(n=3到20,负载大小=1到100 GB)中泛化良好,精度一致,尽管在非常大或高度异构的系统中性能略有下降。本工作证明了使用机器学习加速分布式计算优化同时保持接近最优精度的可行性。

英文摘要

In this paper, we introduce the first machine learning framework for predicting optimal processing times in Single-Level Tree Network (SLTN) architectures for the Divisible Load Theory (DLT) paradigm. Using a feedforward neural network(FNN) with 16 engineered features, we train a model on 100,000 synthetically generated configurations to predict optimal processing times without explicit formulation of DLT equations. The model achieves 97-99% accuracy (R-square factor) with mean absolute percentage error of 1-5%, demonstrating that neural networks can effectively learn complex load distribution relationships. Feature importance analysis reveals that the model implicitly captures DLT mathematical structure, including load conservation and simultaneous finishing constraints. With inference times under 1 millisecond, the approach serves as a viable option over traditional DLT computation, enabling applications in real-time scheduling, design space exploration, and cloud resource allocation. The method generalizes well across diverse system configurations (n=3 to 20, load size =1 to 100 GB) with consistent accuracy, though performance degrades slightly for very large or highly heterogeneous systems. This work demonstrates the feasibility of using machine learning to accelerate distributed computing optimization while maintaining near-optimal accuracy.

2605.28912 2026-06-09 cs.LG cs.CR 版本更新

Cycle-Space Informed Detection of Autoencoded Blind False Data Injection Attacks on Power Systems

基于环空间感知的电力系统自编码器盲假数据注入攻击检测

Xin Li, Chenhan Xiao, Jonathan Cohen, Aviad Elyashar, Yang Weng, Rami Puzis

发表机构 * Faculty of Computer and Information Science, Ben-Gurion-University, Be’er Sheva, Israel(计算机与信息科学学院,本·古里安大学,贝尔谢巴,以色列)

AI总结 针对自编码器利用测量流形零空间生成的盲假数据注入攻击,提出基于拓扑环空间检测器,利用最小环基实现最优泛化误差,有效检测数据驱动攻击。

Comments 13 pages, 11 figures

详情
AI中文摘要

人工智能驱动的数据中心和大型储能系统的快速增长,使得电力系统运行越来越依赖实时测量数据和自动决策。然而,许多现有的检测方法依赖于对测量值的统计或数据驱动分析,当攻击者利用相同的数据结构构造隐蔽扰动时,这些方法可能会失效。为说明这一局限性,我们展示了一种盲假数据注入攻击(FDIA),其中自编码器学习测量流形并生成与雅可比零空间对齐的扰动,从而使得攻击能够逃避基于残差的坏数据检测器和时间序列异常检测器。为了缓解利用零空间的数据驱动FDIA,我们提出了一种拓扑感知的环空间检测器(CSD),该检测器利用网络的环空间施加结构约束,以增强零空间估计。此外,我们证明,通过使用最小环基(MCB),所提出的CSD实现了攻击检测的最优泛化误差。通过利用拓扑导出的环约束而不是仅仅依赖于数值零空间估计,所提出的方法不需要精确的线路参数,并改善了正常测量与受攻击测量之间的分离。在IEEE 14、30、57和118节点系统上的仿真结果表明,该方法在实际测量噪声下有效检测数据驱动FDIA。

英文摘要

The rapid growth of AI-driven data centers and large-scale energy storage systems is increasing the reliance of power system operation on real-time measurement data and automated decision-making. However, many existing detection methods rely on statistical or data-driven analysis of measurements and can fail when attackers exploit the same data structure to craft stealthy perturbations. To illustrate this limitation, we demonstrate a blind False Data Injection Attack (FDIA) in which an Autoencoder learns the measurement manifold and generates perturbations aligned with the Jacobian null space, thereby allowing the attack to evade both residual-based baddata detectors and time-series anomaly detectors. To mitigate data-driven FDIAs which exploit the null space, we propose a topology-informed Cycle-Space Detector (CSD) that leverages the Cycle-Space of the network to impose structural constraints that enhance null space estimation. In addition, we prove that by using the Minimum Cycle Basis (MCB), the proposed CSD achieves the optimal generalization error for attack detection. By exploiting topology-derived cycle constraints rather than relying solely on numerical null space estimation, the proposed method does not require precise line parameters and improves the separation between normal and attacked measurements. Simulation results on IEEE 14-, 30-, 57-, and 118-bus systems demonstrate that the proposed method effectively detects data-driven FDIAs under realistic measurement noise.

2606.05556 2026-06-09 cs.LG 版本更新

Field Validation of a Multi-Resolution ConvLSTM Framework for Retaining Wall Deformation Prediction

多分辨率ConvLSTM框架用于挡土墙变形预测的现场验证

Jihoon Kim, Saeyon Kim, Heejung Youn

发表机构 * Department of Civil and Environmental Engineering, Hongik University(Hongik大学土木与环境工程系)

AI总结 本研究通过现场数据验证了基于多分辨率卷积长短期记忆网络(ConvLSTM)的堆叠集成框架,该框架仅用数值模拟数据训练,能有效预测基坑开挖中挡土墙变形,平均绝对误差1.4 mm,决定系数0.93。

Comments 40 Pages, 15 figures

详情
AI中文摘要

本研究提出了一个多分辨率卷积长短期记忆网络(ConvLSTM)框架,用于预测分阶段开挖过程中挡土墙的变形,并进行了全面的现场验证。该框架基于高斯噪声增强的数值模拟进行训练,通过堆叠集成策略整合了在不同时间分辨率下运行的ConvLSTM模型。利用韩国11个开挖现场34个测斜仪的现场监测数据对提出的框架进行了验证。使用多个评估指标系统地评估了每个场地的预测性能,分析了时间变形不规则性和时空预测特征对模型性能的影响。结果表明,该框架能够预测额外开挖深度达5.0 m的挡土墙变形,在所有开挖现场的平均绝对误差为1.4 mm,决定系数为0.93。这些结果表明,尽管该框架仅基于数值模拟和增强数据库进行训练,但可以有效地应用于各种现场开挖条件,并在实际挡土墙变形预测中达到可靠的预测精度。

英文摘要

This study presents a comprehensive field validation of a multi-resolution Convolutional Long Short-Term Memory (ConvLSTM) framework for predicting retaining wall deformation during staged excavation. The framework is trained on Gaussian noise-augmented numerical simulations and integrates ConvLSTM models operating at different temporal resolutions through a stacking ensemble strategy. The proposed framework is validated using field monitoring data from 34 inclinometers across 11 excavation sites in South Korea. Site-wise prediction performance is systematically evaluated using multiple evaluation metrics, with analyses of the influence of temporal deformation irregularity and spatiotemporal prediction characteristics on model performance. The results demonstrate that the framework predicts retaining wall deformation associated with up to 5.0 m of additional excavation with an average mean absolute error of 1.4 mm and a coefficient of determination of 0.93 across the excavation sites. These results indicate that the framework, although trained exclusively on numerically simulated and augmented database, can be effectively applied to diverse field excavation conditions and achieve a reliable level of prediction accuracy in practical retaining wall deformation prediction.

2606.05781 2026-06-09 cs.LG 版本更新

Domain-Adapted Small Language Models with Hybrid Post-Processing: Achieving Cost-Efficient, Low-Latency Multi-Label Structured Prediction via LoRA Fine-Tuning on Scarce Data

领域自适应的小语言模型与混合后处理:通过LoRA微调在稀缺数据上实现成本高效、低延迟的多标签结构化预测

Srinivasan Manoharan, Dilipkumar Nallusamy, Sachin Kumar, Haifeng Wu

发表机构 * GitHub

AI总结 提出一种结合LoRA微调的小语言模型(LLaMA 3.1 8B)和确定性规则后处理的混合框架,在仅219个样本上训练,实现多标签合规评估,达到100% JSON结构有效性和83.0%人工验证准确率,成本降低46-76%。

Comments 4 pages, 2 figures, 4 tables

详情
AI中文摘要

部署前沿大型语言模型(LLM)用于特定领域的结构化评估任务通常会带来显著的延迟、成本和数据隐私开销。我们提出了一种混合框架,结合了微调的小语言模型(LLaMA 3.1 8B,通过LoRA仅2.05%可训练参数)和确定性规则后处理层。该系统仅使用219个精心挑选的示例进行训练,应用于跨18个异构输出字段的对话转录多标签合规评估。在53个未见过的生产转录的盲评中,它实现了100%的JSON结构有效性、83.0%的人工验证总体准确率,以及最关键分类字段100%的准确率。所提出的方法形式化了混合神经符号分解,并引入了针对性的硬负例增强,以改善关键决策边界的性能。在单个NVIDIA A100 GPU上运行,推理完成约需2秒,比前沿模型API快2-5倍。每次评估成本仅为0.013美元,而专有替代方案为0.025-0.055美元,节省46-76%的成本。这些结果表明,领域自适应的小语言模型与确定性后处理相结合,可以在结构化合规评估中达到前沿模型的准确性,同时大幅降低运营成本、延迟和隐私风险。

英文摘要

Deploying frontier large language models (LLMs) for domain-specific structured evaluation tasks incurs prohibitive latency, cost, and data-privacy overhead. We present a hybrid framework that fine-tunes a small language model (LLaMA 3.1 8B, 2.05% trainable parameters via LoRA) on only 219 curated examples and couples it with a deterministic rule-based postprocessing layer. Applied to multi-label compliance evaluation of conversational transcripts (18 heterogeneous output fields), our system achieves 100% JSON structural validity, 83.0% human-validated overall accuracy, and 100% accuracy on the most critical classification field in blind evaluation on 53 unseen production transcripts. On a single NVIDIA A100 GPU, inference completes in $\sim$2 seconds -- 2--5x faster than frontier APIs -- at USD 0.013 per evaluation versus USD 0.025--0.055 for proprietary alternatives, yielding 46--76% cost savings. We introduce targeted hard-negative augmentation for critical decision boundaries and formalize the hybrid neural-symbolic decomposition, demonstrating that domain-adapted small language models with postprocessing can match frontier model accuracy while dramatically reducing operational cost, latency, and privacy risk.

2606.05797 2026-06-09 cs.LG stat.ML 版本更新

Causal Longitudinal Prior-Fitted Networks for Counterfactual Outcome Prediction

因果纵向先验拟合网络用于反事实结果预测

Amirhossein Zare, Amirhessam Zare, Herlock Rahimi, Reza Salarikia, Mohammad Kashkooli

发表机构 * Yale University(耶鲁大学) University of California, Berkeley(加州大学伯克利分校)

AI总结 提出CausalLongPFN,一种基于先验拟合的上下文预测器,通过合成因果模型预训练实现无需梯度更新的纵向反事实结果预测,在多个基准上达到与领域训练模型竞争的性能。

Comments 31 pages, 10 tables

详情
AI中文摘要

纵向治疗决策需要预测未来治疗序列下的潜在结果,同时考虑时变混杂、异质性患者动态和有限的领域特定数据。现有的纵向因果估计器通常为每个队列或模拟器训练新模型。我们引入了因果纵向先验拟合网络(CausalLongPFN),一种用于纵向因果预测的先验拟合上下文预测器。该模型完全在从时间结构因果模型的广泛先验中采样的合成情节上进行预训练,使其暴露于治疗-混杂反馈、潜在异质性、非线性状态演化、延迟效应和累积治疗反应。在测试时,CausalLongPFN被冻结:它基于支持轨迹、查询历史和提出的未来治疗序列进行条件预测,返回未来结果的预测分布,无需梯度更新或倾向性模型拟合。通过在指定治疗序列下递归应用一步预测器获得多步预测。我们在具有真实反事实标签的可分支癌症、HIV和华法林基准上,以及在MIMIC-III ICU轨迹的仅事实滚动起点预测上进行评估。CausalLongPFN在反事实基准上与领域训练的纵向基线竞争,并在事实MIMIC-III预测上表现强劲,表明当重复的领域特定训练成本高昂或不可行时,广泛的合成因果预训练可以提供有用的冻结替代方案。

英文摘要

Longitudinal treatment decisions from multivariate time-series data require predicting potential outcomes under future treatment sequences in the presence of time-varying confounding, heterogeneous patient dynamics, and limited domain-specific data. Existing longitudinal causal estimators typically address this problem by training a new model for each cohort or simulator. We introduce Causal Longitudinal Prior-Fitted Networks (CausalLongPFN), a prior-fitted network for time-series causal inference in longitudinal treatment-response data and zero-shot in-context counterfactual outcome prediction. The model is pretrained entirely on synthetic episodes sampled from a broad prior over temporal structural causal models, exposing it to treatment-confounder feedback, latent heterogeneity, nonlinear state evolution, delayed effects, and cumulative treatment responses. At test time, CausalLongPFN remains frozen and is used zero-shot: it conditions on support trajectories, a query history, and a planned future treatment sequence, and returns a predictive distribution over future outcomes without gradient updates or propensity-model fitting. Multi-step predictions are obtained by recursively applying the one-step predictor under the specified treatment sequence. We evaluate the model on branchable cancer, HIV, and warfarin benchmarks with ground-truth counterfactual labels, and on factual-only rolling-origin prediction in MIMIC-III ICU trajectories. CausalLongPFN is competitive with domain-trained longitudinal baselines on counterfactual benchmarks and performs strongly on factual MIMIC-III prediction, suggesting that broad synthetic causal pretraining can provide a frozen, amortized alternative for zero-shot longitudinal treatment-response prediction when repeated domain-specific training is costly or impractical.

2606.06554 2026-06-09 cs.LG cs.AI 版本更新

Multi-Scale Feature Attention Network for Polymer Classification Using Terahertz Spectroscopy

基于多尺度特征注意力网络的太赫兹双梳光谱聚合物分类

Roshni Mahtani, Ilán Carretero, Laura Monroy, Aldo Moreno-Oyervides, Oscar Elías Bonilla-Manrique, Rocío del Amor

发表机构 * Instituto Universitario de Investigación e Innovación en Tecnología Centrada en el Ser Humano, HUMAN-tech, Universitat Politècnica de València(人类中心技术大学研究与创新研究所,HUMAN-tech,巴塞罗那理工大学) Department of Electronic Technology, Universidad Carlos III de Madrid(电子技术系,马德里卡洛斯三世大学) Artikode Intelligence S.L.

AI总结 提出多尺度特征注意力网络(MSFAN),结合特征门控和多尺度并行卷积,利用太赫兹双梳光谱对12种聚合物进行分类,准确率达85.2%。

Comments Accepted in EUSIPCO'26

详情
AI中文摘要

可靠的聚合物识别对于确保回收塑料的质量和安全至关重要,然而传统的分选和光谱技术往往难以提供稳健的区分。太赫兹双梳光谱(THz-DCS)提供了一种有前景的替代方案,能够实现快速、高分辨率且无损的测量。在这项工作中,我们利用THz-DCS对12种聚合物进行分类,包括纯聚合物、多层薄膜、商业混合物和生物聚合物。为了处理这些光谱信号的复杂性,我们提出了多尺度特征注意力网络(MSFAN),这是一种专为THz-DCS数据设计的新型深度学习架构。该框架集成了用于信号重校准的特征门控和多尺度并行卷积,以捕获不同的频率模式。这些特征通过交叉特征注意力和注意力池化进一步细化,使模型能够内在地突出最具信息量的太赫兹区域。MSFAN始终优于最先进的模型,分类准确率达到85.2%。本研究展示了将THz-DCS与深度学习技术相结合,用于有效、可扩展且可解释的聚合物分类的潜力。

英文摘要

Reliable polymer identification is essential for ensuring the quality and safety of recycled plastics, yet conventional sorting and spectroscopic techniques often struggle to deliver robust discrimination. Terahertz (THz) spectroscopy offers a promising alternative, providing high-resolution and non-destructive measurements. In this work, we leverage THz signals to classify 12 types of polymers, including pure polymers, multilayer films, commercial blends, and biopolymers. To handle the complexity of these spectral signals, we propose the Multi-Scale Feature Attention Network (MSFAN), a novel deep learning architecture tailored for THz data. The framework integrates feature gating for signal recalibration and multi-scale parallel convolutions to capture diverse frequency patterns. These features are further refined through cross-feature attention and attention pooling, enabling the model to intrinsically highlight the most informative THz regions. MSFAN consistently outperforms state-of-the-art models, reaching a classification accuracy of 85.2%. This study demonstrates the potential of combining THz spectroscopy with deep learning techniques for effective, scalable, and interpretable polymer classification.

2507.09092 2026-06-09 cs.CV cs.LG 版本更新

Analysis of Information Theory for Explainable AI

可解释人工智能的信息论分析

Ram S Iyer

发表机构 * Rajiv Gandhi Institute of Petroleum Technology(拉贾夫·甘地石油技术研究所)

AI总结 提出基于互信息的激活映射方法MI CAM,通过特征图与输入图像的互信息加权生成显著性可视化,实现模型推理的因果解释,性能优于现有方法。

详情
AI中文摘要

随着机器视觉在医疗和自动化电厂等关键日常需求中的介入,卷积神经网络的内部机制以及网络提供特定推理的原因引起了关注。本文提出了一种新颖的基于激活映射的事后视觉解释方法,称为MI CAM。与之前基于类激活映射的方法不同,MI CAM通过每个特征图与输入图像的互信息对其进行加权,生成显著性可视化,最终结果由权重和激活图的线性组合产生。它还通过反事实分析验证了因果解释的生成。我们旨在展示MI CAM在模型推理过程中实现的视觉表现和无偏解释。我们的方法与所有最先进的方法相当,但在定性和定量度量上尤其优于其中一些方法。

英文摘要

With the intervention of machine vision in our crucial day to day necessities including healthcare and automated power plants, attention has been drawn to the internal mechanisms of convolutional neural networks, and the reason why the network provides specific inferences. This paper proposes a novel post-hoc visual explanation method called MI CAM based on activation mapping. Differing from previous class activation mapping based approaches, MI CAM produces saliency visualizations by weighing each feature map through its mutual information with the input image and the final result is generated by a linear combination of weights and activation maps. It also adheres to producing causal interpretations as validated with the help of counterfactual analysis. We aim to exhibit the visual performance and unbiased justifications for the model inferencing procedure achieved by MI CAM. Our approach works at par with all state-of-the-art methods but particularly outperforms some in terms of qualitative and quantitative measures.

2508.00917 2026-06-09 cs.RO cs.CV cs.LG 版本更新

A Survey on Deep Multi-Task Learning in Connected Autonomous Vehicles

联网自动驾驶车辆中深度多任务学习综述

Jiayuan Wang, Farhad Pourpanah, Q. M. Jonathan Wu, Ning Zhang

发表机构 * Department of Electrical and Computer Engineering, University of Windsor(温莎大学电气与计算机工程系) Department of Electrical and Computer Engineering, Queen’s University(皇后大学电气与计算机工程系)

AI总结 综述联网自动驾驶车辆中深度多任务学习,涵盖感知、预测、规划、控制及V2X通信与资源管理,分析现有方法优缺点并指出未来方向。

详情
AI中文摘要

联网自动驾驶车辆(CAVs)必须同时执行多个任务,如感知、预测、规划和控制,以确保在复杂环境中安全可靠地导航。此外,通过车联万物(V2X)通信,可以实现CAVs之间的协同感知和驾驶,从而减轻单个车辆的局限性,同时也引入了严格的延迟、可靠性和带宽约束。传统上,任务使用单独的模型处理,这导致部署成本高、计算开销增加以及实现实时性能的挑战。多任务学习(MTL)最近成为一种有前景的解决方案,能够在统一模型中联合学习多个任务,从而提供更高的效率和资源利用率。据我们所知,本综述是首次专注于CAVs中深度MTL的全面回顾。我们首先概述CAVs和MTL以提供基础背景。然后,我们回顾了CAVs关键功能领域的MTL方法,包括感知、预测、规划、控制以及V2X通信和无线电资源管理(RRM)。对于前四个领域,我们将现有工作分为仅单车(车载)和V2X增强协同(多智能体)范式。我们进一步将V2X通信和RRM作为以通信为中心的MTL问题进行讨论。最后,我们讨论了现有方法的优势和局限性,识别了关键研究空白,并提供了旨在推进CAV系统MTL方法的未来研究方向。

英文摘要

Connected autonomous vehicles (CAVs) must simultaneously perform multiple tasks, such as perception, prediction, planning, and control, to ensure safe and reliable navigation in complex environments. Moreover, through vehicle-to-everything (V2X) communication, cooperative perception and driving among CAVs can be enabled, thereby mitigating the limitations of individual vehicles, while it also introduces stringent latency, reliability, and bandwidth constraints. Traditionally, tasks are addressed using separate models, which leads to high deployment costs, increased computational overhead, and challenges in achieving real-time performance. Multi-task learning (MTL) has recently emerged as a promising solution that enables the joint learning of multiple tasks within a unified model. This offers improved efficiency and resource utilization. To the best of our knowledge, this survey is the first comprehensive review focusing on deep MTL in CAVs. We begin with an overview of CAVs and MTL to provide foundational background. Then, we review MTL approaches across key functional domains in CAVs, including perception, prediction, planning, control, as well as V2X communications and radio resource management (RRM). For the first four domains, we categorize existing works under ego vehicle-only (onboard-only) and V2X-enhanced cooperative (multi-agent) paradigms. We further discuss V2X communications and RRM as communication-centric MTL problems. Finally, we discuss the strengths and limitations of existing methods, identify key research gaps, and provide future research directions aimed at advancing MTL methodologies for CAV systems.

2510.03389 2026-06-09 quant-ph cs.LG 版本更新

Quantum feature-map learning with reduced resource overhead

量子特征映射学习:降低资源开销

Jonas Jäger, Philipp Elsässer, Elham Torabian

发表机构 * Department of Computer Science and Institute of Applied Mathematics, University of British Columbia (UBC), Vancouver, B.C. V6T 1Z4, Canada(计算机科学系和应用数学研究所,不列颠哥伦比亚大学(UBC),温哥华,B.C. V6T 1Z4,加拿大) Stewart Blusson Quantum Matter Institute (QMI), Vancouver, B.C. V6T 1Z4, Canada(斯图尔特·布卢森量子物质研究所(QMI),温哥华,B.C. V6T 1Z4,加拿大) Institute of Physics, University of Freiburg, Freiburg (Breisgau), 79104, Germany(物理研究所,弗赖堡大学,弗赖堡(巴登-符腾堡),79104,德国) Department of Chemistry, University of British Columbia (UBC), Vancouver, B.C. V6T 1Z1, Canada(化学系,不列颠哥伦比亚大学(UBC),温哥华,B.C. V6T 1Z1,加拿大)

AI总结 提出Q-FLAIR算法,通过部分解析重构将工作负载转移到经典计算机,显著降低量子资源开销,在真实IBM设备上仅用4小时即在完整MNIST数据集上达到90%以上准确率。

Comments 24 pages, 12 figures, 2 tables

详情
Journal ref
Phys. Rev. Research 8(2), 023247 (2026)
AI中文摘要

当前的量子计算机需要算法经济地使用有限资源。在量子机器学习中,成功取决于量子特征映射,它将经典数据嵌入到量子比特的状态空间中。我们引入了通过解析迭代重构的量子特征映射学习(Q-FLAIR),这是一种在迭代特征映射电路构建中减少量子资源开销的算法。它通过部分解析重构量子模型,仅使用少量评估就将工作负载转移到经典计算机上。对于每次探测到的门添加到拟设中,数据特征和权重参数的同时选择和优化则完全在经典计算机上进行。集成到量子神经网络和量子核支持向量分类器中,Q-FLAIR展示了最先进的基准性能。由于资源开销与特征维度解耦,我们在真实的IBM设备上仅用四小时就训练了一个量子模型,在完整分辨率MNIST数据集(784个特征,数字3 vs 5)上达到了超过90%的准确率。这样的结果以前是无法实现的,因为特征维度会极大地增加固定拟设的硬件需求以及自适应拟设的搜索成本。此外,Q-FLAIR展示了针对直接经典建模的去量子化鲁棒性,满足了文献中罕见的基准,这是潜在量子优势的必要条件。通过超越黑盒优化重新思考特征映射学习,这项工作为在现实问题和近期量子计算机上实现量子机器学习迈出了具体的一步。

英文摘要

Current quantum computers require algorithms that use limited resources economically. In quantum machine learning, success hinges on quantum feature-maps, which embed classical data into the state space of qubits. We introduce Quantum Feature-Map Learning via Analytic Iterative Reconstructions (Q-FLAIR), an algorithm that reduces quantum resource overhead in iterative feature-map circuit construction. It shifts workloads to a classical computer via partial analytic reconstructions of the quantum model, using only a few evaluations. For each probed gate addition to the ansatz, the simultaneous selection and optimization of the data feature and weight parameter is then entirely classical. Integrated into quantum neural network and quantum kernel support vector classifiers, Q-FLAIR shows state-of-the-art benchmark performance. Since resource overhead decouples from feature dimension, we train a quantum model on a real IBM device in only four hours, surpassing 90% accuracy on the full-resolution MNIST dataset (784 features, digits 3 vs 5). Such results were previously unattainable, as the feature dimension prohibitively drives hardware demands for fixed and search costs for adaptive ansätze. Furthermore, Q-FLAIR demonstrates de-quantization robustness against direct classical modeling, satisfying a benchmark rare in the literature and a necessary condition for potential quantum advantage. By rethinking feature-map learning beyond black-box optimization, this work takes a concrete step toward enabling quantum machine learning for real-world problems and near-term quantum computers.

2511.07280 2026-06-09 econ.GN cs.IR cs.LG q-fin.EC 版本更新

The Value of Personalized Recommendations: Evidence from Netflix

个性化推荐的价值:来自Netflix的证据

Kevin Zielnicki, Guy Aridor, Aurélien Bibaut, Allen Tran, Winston Chou, Nathan Kallus

发表机构 * Netflix Kellogg School of Management, Northwestern University(西北大学凯洛格管理学院)

AI总结 本文通过Netflix观众数据,构建离散选择模型评估个性化推荐的价值,发现替换推荐算法会降低用户参与度和消费多样性,且有效推荐主要来自精准定位而非机械曝光。

详情
AI中文摘要

个性化推荐系统塑造了用户在线选择的大部分内容,然而其针对性使得分离推荐价值和底层商品的价值具有挑战性。我们构建了一个嵌入推荐诱导效用、低秩异质性和灵活状态依赖的离散选择模型,并将其应用于Netflix的观众数据。我们利用推荐算法引入的异质性变化来识别并分别评估这些组成部分,同时恢复出无需模型的分流比率,以验证我们的结构模型。我们使用该模型评估了反事实场景,量化了个性化推荐产生的增量参与度。首先,我们显示,用矩阵分解或流行度为基础的算法取代当前推荐系统会导致参与度分别减少4%和12%,并降低消费多样性。其次,大多数推荐带来的消费增长来自于有效的定位,而非机械曝光,其中中等流行商品(而非广泛流行或非常小众商品)的收益最大。

英文摘要

Personalized recommendation systems shape much of user choice online, yet their targeted nature makes separating out the value of recommendation and the underlying goods challenging. We build a discrete choice model that embeds recommendation-induced utility, low-rank heterogeneity, and flexible state dependence and apply the model to viewership data at Netflix. We exploit idiosyncratic variation introduced by the recommendation algorithm to identify and separately value these components as well as to recover model-free diversion ratios that we can use to validate our structural model. We use the model to evaluate counterfactuals that quantify the incremental engagement generated by personalized recommendations. First, we show that replacing the current recommender system with a matrix factorization or popularity-based algorithm would lead to 4% and 12% reduction in engagement, respectively, and decreased consumption diversity. Second, most of the consumption increase from recommendations comes from effective targeting, not mechanical exposure, with the largest gains for mid-popularity goods (as opposed to broadly appealing or very niche goods).

2601.05261 2026-06-09 cs.IR cs.LG 版本更新

Improving User Experience with Personalized Review Ranking and Summarization

通过个性化评论排名和摘要提升用户体验

Muhammad Jawad Mufti, Omar Hammad, MD. Mahfuzur Rahman

发表机构 * Information and Computer Science Dept., King Fahd University of Petroleum and Minerals(信息与计算机科学系,国王法赫德石油与矿物大学) Interdisciplinary Research Center for Intelligent Secure Systems (IRC-ISS), King Fahd University of Petroleum and Minerals(智能安全系统交叉研究中心(IRC-ISS),国王法赫德石油与矿物大学)

AI总结 提出融合用户偏好建模、混合情感估计、方面级评论匹配和LLM摘要的个性化评论排名与摘要框架,在亚马逊数据集和用户研究中优于现有方法。

详情
AI中文摘要

在线消费者评论是电子商务中重要的决策支持资源,然而日益增长的评论量常常造成信息过载,使用户难以识别符合个人偏好的内容。现有的评论排名方法通常依赖星级评分、有用性投票或时效性等聚合信号,这些可能无法反映用户特定兴趣。本文提出了一种个性化评论排名和摘要框架,融合了用户偏好建模、混合情感估计、方面级评论匹配和基于大语言模型(LLM)的摘要。该框架首先从历史评论中提取方面级偏好和情感信号,然后结合用户选择的产品方面和书面评论输入来构建个性化用户画像。通过比较该画像与评论级别的方面和情感表示,对候选评论进行排名。随后对排名靠前的评论进行摘要,以提供简洁且符合偏好的信息。该方法使用亚马逊移动电子产品评论数据集和一项涉及70名参与者的结构化用户研究(涵盖常见消费电子产品类别)进行评估。结果表明,所提出的排名方法优于随机排序、基于星级评分、有用性投票、时效性和语义相似度的排名。用户研究结果进一步表明,该方法在满意度、感知相关性、决策信心、信息查找便捷性和阅读效率方面均有提升。研究结果表明,结合方面级个性化、情感感知排名和基于LLM的摘要可以减少评论过载,支持更高效的用户中心决策。

英文摘要

Online consumer reviews are important decision-support resources in e-commerce, yet the increasing volume of reviews often creates information overload and makes it difficult for users to identify content that matches their individual preferences. Existing review-ranking approaches commonly rely on aggregate signals such as star ratings, helpfulness votes, or recency, which may not reflect user-specific interests. This paper proposes a personalized review ranking and summarization framework that integrates user preference modeling, hybrid sentiment estimation, aspect-level review matching, and Large Language Model (LLM)-based summarization. The framework first extracts aspect-level preferences and sentiment signals from historical reviews. It then incorporates user-selected product aspects and written review input to build a personalized user profile. Candidate reviews are ranked by comparing this profile with review-level aspect and sentiment representations. The top-ranked reviews are then summarized to provide concise, preference-aligned information. The proposed method was evaluated using an Amazon Mobile Electronics review dataset and a structured user study involving 70 participants across common consumer electronics categories. Results show that the proposed ranking method outperformed random ordering, star-rating-based ranking, helpfulness-vote ranking, recency-based ranking, and semantic-similarity-based ranking. User-study results further indicate improvements in satisfaction, perceived relevance, decision-making confidence, ease of finding information, and reading efficiency. The findings suggest that combining aspect-level personalization, sentiment-aware ranking, and LLM-based summarization can reduce review overload and support more efficient user-centered decision-making.

2601.15408 2026-06-09 cs.CV cs.AI cs.CL cs.LG 版本更新

CURE: Curriculum-guided Multi-task Training for Reliable Anatomy Grounded Report Generation

CURE:基于课程引导的多任务训练实现可靠的解剖学接地报告生成

Pablo Messina, Andrés Villa, Juan León Alcázar, Karen Sánchez, Carlos Hinojosa, Denis Parra, Álvaro Soto, Bernard Ghanem

发表机构 * Pontificia Universidad Católica de Chile(智利天主教大学) CENIA iHEALTH KAUST(科威特皇家科学与技术局)

AI总结 提出CURE框架,通过课程学习动态调整多任务训练,提升医学报告生成的视觉接地准确性和事实一致性,无需额外数据。

Comments 31 pages, 7 figures, accepted to CVPR 2026 (oral)

详情
Journal ref
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026, pp. 36279-36289
AI中文摘要

医学视觉语言模型可以自动生成放射学报告,但在精确的视觉接地和事实一致性方面存在困难。现有模型常常将文本发现与视觉证据错误对齐,导致不可靠或弱接地的预测。我们提出CURE,一个错误感知的课程学习框架,无需任何额外数据即可改善接地和报告质量。CURE在短语接地、接地报告生成和解剖学接地报告生成上,使用公共数据集微调多模态指令模型。该方法基于模型性能动态调整采样,强调困难样本以改善空间和文本对齐。CURE将接地准确率提高了+0.35 IoU,报告质量提高了+0.192 CXRFEScore,并将幻觉减少了18.6%。CURE是一个数据高效的框架,增强了接地准确性和报告可靠性。代码可从此https URL获取,模型权重可从此https URL获取。

英文摘要

Medical vision-language models can automate the generation of radiology reports but struggle with accurate visual grounding and factual consistency. Existing models often misalign textual findings with visual evidence, leading to unreliable or weakly grounded predictions. We present CURE, an error-aware curriculum learning framework that improves grounding and report quality without any additional data. CURE fine-tunes a multimodal instructional model on phrase grounding, grounded report generation, and anatomy-grounded report generation using public datasets. The method dynamically adjusts sampling based on model performance, emphasizing harder samples to improve spatial and textual alignment. CURE improves grounding accuracy by +0.35 IoU, boosts report quality by +0.192 CXRFEScore, and reduces hallucinations by 18.6%. CURE is a data-efficient framework that enhances both grounding accuracy and report reliability. Code is available at https://github.com/PabloMessina/CURE and model weights at https://huggingface.co/pamessina/medgemma-4b-it-cure

2602.08916 2026-06-09 cs.SC cs.ET cs.LG 版本更新

AMS-HD: Hyperdimensional Computing for Real-Time and Energy-Efficient Acute Mountain Sickness Detection

AMS-HD:用于实时和节能急性高海拔病检测的高维计算

Abu Masum, Mehran Moghadam, M. Hassan Najafi, Bige Unluturk, Ulkuhan Guler, Beth A. Beidleman, Sercan Aygun

发表机构 * School of Computing and Informatics, University of Louisiana at Lafayette(路易斯安那州立大学拉斐特分校计算机与信息学院) Department of Electrical, Computer, and Systems Engineering, Case Western Reserve University(凯斯西储大学电气、计算机与系统工程系) Electrical and Biomedical Engineering, Michigan State University(密歇根州立大学电气与生物医学工程系) Electrical and Computer Engineering Department, Worcester Polytechnic Institute(沃思菲技术学院电气与计算机工程系) US Army Research Institute of Environmental Medicine(美国陆军环境医学研究院)

AI总结 本文提出AMS-HD框架,利用高维计算实现实时急性高海拔病检测,通过特征选择、超向量编码和位置投影提升分类效率,在多种平台上实现高准确率和低能耗。

详情
AI中文摘要

目标:急性高海拔病(AMS)是最常见的高海拔疾病,影响未适应者在海拔2500米以上攀登时,传统机器学习方法在连续监测中难以满足实时硬件效率要求。方法:本文提出AMS-HD,首个基于高维计算的实时AMS检测框架,涵盖移动平台的高维双极计算和FPGA/ASIC的低维二进制计算。框架整合互信息特征选择、超向量编码和位置投影以提高分类效率。验证在ARM、FPGA和智能手表-智能手机平台使用可穿戴的血氧和心率信号。结果:AMS-HD在二分类和多分类中匹配或优于SVM和MLP基线,二分类准确率高达91%,F1分数达90%。在FPGA上,AMS-HD减少LUT和触发器使用量达7.3倍和5.8倍,能耗仅为MLP的3.9倍。在移动平台,AMS-HD每会话仅消耗1%电池,2.50毫秒推理时间,能耗低于SVM和MLP。结论:AMS-HD提供了一个可扩展、硬件感知的替代方案,实现竞争性性能和显著降低资源消耗。意义:本文首次提出完整的高维计算框架用于高海拔病检测,连接可穿戴推理和低层硬件部署,为资源受限健康监测提供解决方案。

英文摘要

Objective: Acute mountain sickness (AMS) is the most prevalent altitude illness, affecting unacclimatized individuals ascending above 2,500 m and potentially escalating to life threatening cerebral or pulmonary edema. Conventional machine learning (ML) methods for AMS detection from wearable physiological signals often fail to meet real-time hardware efficiency requirements of continuous monitoring. Methods: We present AMS-HD, the first hyperdimensional computing (HDC)-based framework for real-time AMS detection, spanning high-level bipolar (-1/+1) computing for mobile platforms and low-level binary (0/1) computing for FPGA and ASIC targets. The framework integrates mutual information feature selection, hypervector encoding, and positional projection to enhance classification efficiency. Validation spans ARM, FPGA, and smartwatch-smartphone platforms using wearable-accessible SpO2 and heart rate signals. Results: AMS-HD matches or outperforms SVM and MLP baselines in both binary and multiclass classification, achieving up to 91% accuracy and 90% F1-score in binary classification, and up to 85% accuracy on external AMS-related datasets. On FPGA, AMS-HD reduces LUT and flip-flop usage by 7.3x and 5.8x, while consuming 3.9x less power than MLP. On mobile platforms, AMS-HD requires only 1% battery per session, 60 Bytes of memory, and 2.50 ms inference time -- approximately 2x and more than 3x lower energy consumption than SVM and MLP. Conclusion: AMS-HD provides a scalable, hardware-aware alternative to conventional ML for real-time AMS monitoring, achieving competitive performance with substantially lower resource consumption. Significance: This work presents the first complete HDC framework for altitude sickness detection, bridging wearable inference and low-level hardware deployment for resource-constrained health monitoring.

2602.23234 2026-06-09 cs.IR cs.AI cs.LG 版本更新

Scaling Search Relevance: Augmenting App Store Ranking with LLM-Generated Judgments

扩展搜索相关性:用LLM生成的判断增强应用商店排名

Evangelia Christakopoulou, Vivekkumar Patel, Hemanth Velaga, Sandip Gaikwad, Sean Suchter, Venkat Sundaranatha

发表机构 * Apple(苹果公司)

AI总结 针对应用商店排名中专家文本相关性标签稀缺的问题,通过微调LLM生成数百万标签,结合行为相关性优化排序器,显著提升Pareto前沿和转化率。

详情
AI中文摘要

大规模商业搜索系统优化相关性以驱动成功的会话,帮助用户找到他们想要的内容。为了最大化相关性,我们利用两个互补的目标:行为相关性(用户倾向于点击或下载的结果)和文本相关性(结果与查询的语义匹配)。一个持续的挑战是,相对于丰富的行为相关性标签,专家提供的文本相关性标签稀缺。我们首先通过系统评估LLM配置来解决这个问题,发现一个专门的、微调的模型在提供高度相关的标签方面显著优于一个更大的预训练模型。使用这个最优模型作为力量倍增器,我们生成了数百万个文本相关性标签以克服数据稀缺性。我们展示了用这些文本相关性标签增强我们的生产排序器会导致Pareto前沿显著外移:离线NDCG在行为相关性上改善,同时在文本相关性上也提高。这些离线收益通过在全球应用商店排序器上的A/B测试得到验证,该测试显示转化率统计上显著提高了+0.24%,其中最大的性能提升出现在尾部查询中,新的文本相关性标签在缺乏可靠行为相关性标签时提供了稳健的信号。

英文摘要

Large-scale commercial search systems optimize for relevance to drive successful sessions that help users find what they are looking for. To maximize relevance, we leverage two complementary objectives: behavioral relevance (results users tend to click or download) and textual relevance (a result's semantic fit to the query). A persistent challenge is the scarcity of expert-provided textual relevance labels relative to abundant behavioral relevance labels. We first address this by systematically evaluating LLM configurations, finding that a specialized, fine-tuned model significantly outperforms a much larger pre-trained one in providing highly relevant labels. Using this optimal model as a force multiplier, we generate millions of textual relevance labels to overcome the data scarcity. We show that augmenting our production ranker with these textual relevance labels leads to a significant outward shift of the Pareto frontier: offline NDCG improves for behavioral relevance while simultaneously increasing for textual relevance. These offline gains were validated by a worldwide A/B test on the App Store ranker, which demonstrated a statistically significant +0.24% increase in conversion rate, with the most substantial performance gains occurring in tail queries, where the new textual relevance labels provide a robust signal in the absence of reliable behavioral relevance labels.

2603.04177 2026-06-09 cs.SE cs.AI cs.LG 版本更新

CodeTaste: Can LLMs Generate Human-Level Code Refactorings?

CodeTaste:LLM能否生成人类级别的代码重构?

Alex Thillen, Niels Mündler, Veselin Raychev, Martin Vechev

发表机构 * University of California, Berkeley(加州大学伯克利分校) ETH Zurich(苏黎世联邦理工学院)

AI总结 研究LLM代理在代码重构中的能力,通过CodeTaste基准测试发现,代理在详细指定重构时表现良好,但难以自主发现人类选择的重构,提出“先提议后实现”分解可改善对齐。

详情
AI中文摘要

LLM编码代理可以生成可工作的代码,但它们的解决方案往往积累复杂性、重复和架构债务。人类开发者通过重构来解决这些问题:行为保持的程序转换,改善结构和可维护性。我们研究代理是否(i)能够可靠地执行重构,以及(ii)识别人类开发者在实际代码库中实际选择的重构。为此,我们构建了CodeTaste,一个从大型多文件开源重构中挖掘的基准测试。为了评分解决方案,我们结合了测量功能正确性的仓库测试套件和定制的静态检查,这些检查使用数据流推理验证不期望模式的移除和期望模式的引入。我们的结果显示了一个明显的差距:代理在实现详细指定的重构时表现良好,但当给定变更的关注区域时,往往无法发现人类的重构选择。先提议后实现的分解改善了对齐,而在实现之前选择最佳对齐的提议可以带来进一步的收益。CodeTaste为在现实代码库中将编码代理与人类重构决策对齐提供了评估目标和潜在的偏好信号。我们发布了基准测试、排行榜和代码。

英文摘要

LLM coding agents can generate working code, but their solutions often accumulate complexity, duplication, and architectural debt. Human developers address such issues through refactoring: behavior-preserving program transformations that improve structure and maintainability. We investigate whether agents (i) can execute refactorings reliably and (ii) identify the refactorings that human developers actually chose in real codebases. To this end, we construct CodeTaste, a benchmark mined from large multi-file open-source refactorings. To score solutions, we combine repository test suites that measure functional correctness with tailored static checks that verify removal of undesired and introduction of desired code patterns using dataflow reasoning. Our results show a clear gap: agents perform well at implementing refactorings that are specified in detail, but often fail to discover the human refactoring choices when given a focus area for changes. A propose-then-implement decomposition improves alignment, and selecting the best-aligned proposal before implementation can yield further gains. CodeTaste provides an evaluation target and a potential preference signal for aligning coding agents with human refactoring decisions in realistic codebases. We release the benchmark, leaderboard, and code.

2603.05026 2026-06-09 cs.SE cs.LG cs.MA 版本更新

RepoLaunch: Automating Build and Management of Code Repositories across Languages and Platforms

RepoLaunch:跨语言和平台的代码仓库构建与管理自动化

Kenan Li, Rongzhi Li, Linghao Zhang, Qirui Jin, Liao Zhu, Xiaosong Huang, Geng Zhang, Yikai Zhang, Shilin He, Chengxing Xie, Xin Zhang, Zijian Jin, Bowen Li, Chaoyun Zhang, Yu Kang, Yufan Huang, Elsie Nallipogu, Saravan Rajmohan, Qingwei Lin, Dongmei Zhang

发表机构 * Microsoft(微软)

AI总结 RepoLaunch通过自动化依赖解析、编译和测试结果提取,提升了多语言多平台代码仓库的构建效率,其构建成功率达78%,并展示了全自动SWE数据集创建流程。

Comments Under peer review. 22 pages, 5 figures, 9 tables

详情
AI中文摘要

语言模型(LM)代理在自动化软件工程(SWE)中推动了显著进展,但大规模构建和测试软件仓库仍主要依赖人工操作。本文引入RepoLaunch,一种新的代理框架,能够自动解析依赖、编译源代码并提取测试结果,适用于多种编程语言和操作系统。RepoLaunch实现了78%的构建成功率,优于仅支持Python/Linux的先前系统18%。为展示其应用,我们进一步展示了由RepoLaunch驱动的全自动SWE数据集创建流水线,仅需在任务设计阶段进行人工输入。RepoLaunch已开源,其自动化任务生成流水线已被最近的代理基准测试和训练工作所采用。

英文摘要

Language model (LM) agents have driven substantial progress in automated software engineering (SWE), yet building and testing software repositories at scale remains a largely manual and labor-intensive bottleneck. In this work, we introduce RepoLaunch, a novel agentic framework that automatically resolves dependencies, compiles source code, and extracts test results across diverse programming languages and operating systems. RepoLaunch achieves a 78% build success rate, outperforming the Python/Linux-only prior system by 18%. To demonstrate its application, we further present a fully automated pipeline for SWE dataset creation driven by RepoLaunch, which only requires human input at the task-design stage. RepoLaunch is open-sourced, and its automated task-generation pipeline has already been adopted by several recent works on agentic benchmarking and training.

2603.11250 2026-06-09 math.NA cs.LG cs.NA physics.flu-dyn 版本更新

A Machine Learning-Enhanced Hopf-Cole Formulation for Nonlinear Gas Flow in Porous Media

一种结合机器学习的Hopf-Cole公式用于多孔介质中非线性气体流动

V. S. Maduri, K. B. Nakshatrala

发表机构 * Department of Civil & Environmental Engineering University of Houston(土木与环境工程系 休斯顿大学) Computational & Applied Mechanics Laboratory(2026 计算与应用力学实验室)

AI总结 本文提出一种结合Klinkenberg增强本构关系、Hopf-Cole变换的混合线性方程组、共享树神经网络架构和DeepLS求解器的框架,用于多孔介质中气体传输的建模与反演,提升了压力依赖渗透率和滑移参数的估计精度。

详情
AI中文摘要

准确建模多孔介质中的气体流动对于许多技术应用至关重要,包括储层性能预测、碳捕集与封存以及燃料电池和电池。然而,此类建模仍具挑战性,因为存在强烈的非线性行为和模型参数的不确定性。特别是,由Klinkenberg模型描述的气体滑移效应引入了压力依赖的渗透率,这使数值模拟复杂化并掩盖了与经典达西流行为的偏差。为解决这些挑战,本文提出了一种整合的建模框架,结合了Klinkenberg增强的本构关系、Hopf-Cole变换的混合形式线性控制方程、共享树神经网络架构和Deep Least-Squares (DeepLS)求解器。Hopf-Cole变换将原始非线性流动方程重新表述为等价的线性系统,与达西模型密切相关。混合形式与共享树神经网络架构相结合,能够同时准确预测压力和速度场。进行了严格的收敛分析,理论和数值上都建立了所提出求解器的稳定性和收敛性。重要的是,所提出的框架还自然地促进了从有限或间接观测中反演压力依赖渗透率和滑移参数,从而能够高效估计难以实验测量的流动特性。数值结果展示了在广泛的压力范围内准确恢复流动动态和参数,突显了该框架在致密地层中气体传输建模和反演中的鲁棒性、准确性和计算效率。

英文摘要

Accurate modeling of gas flow through porous media is critical for many technological applications, including reservoir performance prediction, carbon capture and sequestration, and fuel cells and batteries. However, such modeling remains challenging due to strong nonlinear behavior and uncertainty in model parameters. In particular, gas slippage effects described by the Klinkenberg model introduce pressure-dependent permeability, which complicates numerical simulation and obscures deviations from classical Darcy flow behavior. To address these challenges, we present an integrated modeling framework for gas transport in porous media that combines a Klinkenberg-enhanced constitutive relation, Hopf-Cole-transformed mixed-form linear governing equations, a shared-trunk neural network architecture, and a Deep Least-Squares (DeepLS) solver. The Hopf-Cole transformation reformulates the original nonlinear flow equations into an equivalent linear system closely related to the Darcy model, while the mixed formulation, together with a shared-trunk neural architecture, enables simultaneous and accurate prediction of both pressure and velocity fields. A rigorous convergence analysis is performed both theoretically and numerically, establishing the stability and convergence properties of the proposed solver. Importantly, the proposed framework also naturally facilitates inverse modeling of pressure-dependent permeability and slippage parameters from limited or indirect observations, enabling efficient estimation of flow properties that are difficult to measure experimentally. Numerical results demonstrate accurate recovery of flow dynamics and parameters across a wide range of pressure regimes, highlighting the framework's robustness, accuracy, and computational efficiency for gas transport modeling and inversion in tight formations.

2604.19755 2026-06-09 cs.AI cs.LG 版本更新

Explainable AML Triage with LLMs: Evidence Retrieval and Counterfactual Checks

可解释的AML优先级排序与LLMs:证据检索与反事实检查

Dorothy Torres, Wei Cheng, Ke Hu

发表机构 * School of Science, Technology, Engineering and Mathematics(科学、技术、工程与数学学院) School of Electrical Engineering and Computer Science(电气工程与计算机科学学院)

AI总结 本文提出一种可解释的AML优先级排序框架,结合证据增强的证据捆绑、结构化LLM输出合同和反事实验证,提升审计性和鲁棒性,实验证明其在优先级排序和证据支持方面表现优异。

详情
AI中文摘要

反洗钱(AML)交易监控生成大量警报,需在严格审计和治理约束下快速优先级排序。尽管大语言模型(LLMs)可汇总异质证据并起草理由,但不受约束的生成在受监管流程中因幻觉、弱溯源性和不忠实的解释而风险较高。本文提出一种可解释的AML优先级排序框架,将优先级排序视为受证据约束的决策过程。我们的方法结合(i)从政策/类型指南、客户上下文、警报触发器和交易子图中检索增强的证据捆绑;(ii)一个结构化的LLM输出合同,要求明确引用并区分支持、矛盾或缺失的证据;(iii)反事实检查,验证最小、合理的扰动是否导致优先级推荐及其理由的连贯变化。我们在公开的合成AML基准和模拟器上评估,并与规则、表格和图机器学习基线以及LLM-only/RAG-only变体进行比较。结果表明,证据支撑显著提高了可审计性,并减少了数值和政策幻觉错误,而反事实验证进一步增加了与决策相关的可解释性和鲁棒性,实现了最佳的整体优先级排序性能(PR-AUC 0.75;升级F1 0.62)和强溯源性和忠实度指标(引用有效性0.98;证据支持0.88;反事实忠实度0.76)。这些发现表明,受约束、可验证的LLM系统可以在不牺牲合规要求的可追溯性和防御性的情况下,为AML优先级排序提供实用的决策支持。

英文摘要

Anti-money laundering (AML) transaction monitoring generates large volumes of alerts that must be rapidly triaged by investigators under strict audit and governance constraints. While large language models (LLMs) can summarize heterogeneous evidence and draft rationales, unconstrained generation is risky in regulated workflows due to hallucinations, weak provenance, and explanations that are not faithful to the underlying decision. We propose an explainable AML triage framework that treats triage as an evidence-constrained decision process. Our method combines (i) retrieval-augmented evidence bundling from policy/typology guidance, customer context, alert triggers, and transaction subgraphs, (ii) a structured LLM output contract that requires explicit citations and separates supporting from contradicting or missing evidence, and (iii) counterfactual checks that validate whether minimal, plausible perturbations lead to coherent changes in both the triage recommendation and its rationale. We evaluate on public synthetic AML benchmarks and simulators and compare against rules, tabular and graph machine-learning baselines, and LLM-only/RAG-only variants. Results show that evidence grounding substantially improves auditability and reduces numerical and policy hallucination errors, while counterfactual validation further increases decision-linked explainability and robustness, yielding the best overall triage performance (PR-AUC 0.75; Escalate F1 0.62) and strong provenance and faithfulness metrics (citation validity 0.98; evidence support 0.88; counterfactual faithfulness 0.76). These findings indicate that governed, verifiable LLM systems can provide practical decision support for AML triage without sacrificing compliance requirements for traceability and defensibility.

2604.23435 2026-06-09 cs.CV cs.AI cs.LG 版本更新

Knee-xRAI: An Explainable AI Framework for Automatic Kellgren-Lawrence Grading of Knee Osteoarthritis

膝-xRAI:一种用于自动膝骨关节炎Kellgren-Lawrence分级的可解释AI框架

Azmul A. Irfan, Nur Ahmad Khatim, Alfan Alfian Irfan, Achmad Zaki, Erike A. Suwarsono, Mansur M. Arief

发表机构 * Orthopaedic Department, Faculty of Medicine UIN Syarif Hidayatullah Jakarta(乌姆尼大学医学学院骨科部) Informatics Engineering, Institut Teknologi Sepuluh Nopember(十月份技术研究所信息工程系) Information Technology, Universitas Muhammadiyah Yogyakarta(尤科阿卡塔大学信息技术系) Industrial and Systems Engineering, King Fahd University of Petroleum and Minerals(国王法赫德石油与矿物大学工业与系统工程系)

AI总结 本文提出Knee-xRAI框架,通过模拟临床放射流程,结合JSN、骨刺和下骨质硬化等特征,利用XGBoost-SHAP和ConvNeXt模型实现可解释的KL分级,验证了其在膝骨关节炎诊断中的有效性。

Comments 8 pages, 5 figures

详情
AI中文摘要

对平片进行膝骨关节炎(KOA)分级的可重复性差。KL评分单级分歧可能改变手术管理或将患者从保守治疗转为关节内注射。同时,超越人类读者的深度学习模型通常缺乏决策解释。我们提出了Knee-xRAI,一个分解分级过程的流程,通过模仿临床放射流程独立测量关节间隙狭窄(JSN)、骨刺和下骨质硬化,然后将这些发现组合成可解释的KL评分。具体而言,U-Net++架构通过轮廓分割量化JSN,SE-ResNet-50多任务网络在OARSI尺度上对骨刺进行解剖部位评分,混合纹理-CNN检测二进制硬化。该流程产生一个50维特征向量,通过XGBoost-SHAP分类器(路径A,审计)和ConvNeXt混合预测器(路径B,部署)进行评估。在8,260个OAI衍生的放射图像上,JSN模块的Dice得分为0.8909,mJSW ICC为0.8674。路径A达到QWK为0.6294和AUC为0.8046,证实了结构化特征向量具有显著的诊断信号。路径B达到QWK为0.8436和AUC为0.9017。SHAP分析显示JSN是主导特征,骨刺增加了一致的增量,硬化贡献微小。移除JSN证据会降低KL3-KL4召回率,而早期等级保持不变,与KL诊断标准一致。Knee-xRAI将每个预测都基于可审计的放射学发现链,提供临床透明度。

英文摘要

Grading knee osteoarthritis (KOA) on plain radiographs is poorly reproducible across readers. A single-grade disagreement on the Kellgren-Lawrence (KL) scale can alter surgical management or redirect a patient from conservative therapy to intra-articular injection. Meanwhile, deep learning models that outperform human readers often offer no explanation for their decisions. We present Knee-xRAI, a pipeline that decomposes the grading process by mimicking clinical radiological workflows. It independently measures joint space narrowing (JSN), osteophytes, and subchondral sclerosis, then combines these findings into an explainable KL grade. Specifically, a U-Net++ architecture quantifies JSN via contour segmentation, an SE-ResNet-50 multi-task network grades osteophytes per anatomical site on the OARSI scale, and a hybrid texture-CNN detects binary sclerosis. This pipeline yields a 50-dimensional feature vector evaluated via an XGBoost-SHAP classifier (Path A, audit) and a ConvNeXt hybrid predictor (Path B, deployed). On 8,260 OAI-derived radiographs, the JSN module achieved a Dice score of 0.8909 and an mJSW ICC of 0.8674. Path A reached a QWK of 0.6294 and an AUC of 0.8046, confirming the structured feature vector carries substantial diagnostic signal. Path B achieved a QWK of 0.8436 and an AUC of 0.9017. SHAP analysis identifies JSN as the dominant feature, with osteophytes adding a consistent increment and sclerosis contributing marginally. Removing JSN evidence collapses KL3-KL4 recall while early grades remain intact, aligning with the KL diagnostic criteria. Knee-xRAI grounds every prediction in an auditable chain of measured radiographic findings, providing clinical transparency at the point of care.

2605.01171 2026-06-09 cs.CV cs.LG 版本更新

CADFit: Precise Mesh-to-CAD Program Generation with Hybrid Optimization

CADFit:基于混合优化的精确网格到CAD程序生成

Ghadi Nehme, Eamon Whalen, Faez Ahmed

发表机构 * Department of Mechanical Engineering, Massachusetts Institute of Technology, Cambridge, MA 02139, USA(麻省理工学院机械工程系)

AI总结 提出CADFit框架,通过基于几何反馈的增量拟合和验证参数化操作,从网格中恢复复杂可编辑的CAD构造序列,在多个基准上优于现有方法,并显著降低无效比率。

详情
AI中文摘要

尽管最近取得了进展,但从几何输入(如网格或点云)恢复参数化CAD构造序列仍然是设计和制造的关键挑战,因为现有的CAD重建和生成方法主要局限于难以编辑的格式(如网格或Breps)或可编辑的简单草图-拉伸流水线和低复杂度数据集。我们引入了CADFit,一个基于混合优化的CAD重建框架,通过使用几何反馈增量拟合和验证参数化操作,从网格中恢复复杂、可编辑的CAD构造序列。我们的方法的特点是将重建公式化为对结构化CAD程序的IoU驱动优化,并支持丰富的操作集,包括拉伸、旋转、圆角和倒角。在多个CAD基准上的实验表明,CADFit在体积交并比和倒角距离方面优于最先进的网格到CAD方法,同时显著降低了重建CAD程序的无效比率,特别是对于复杂设计。我们进一步提出了一个多模态流水线,通过将基于图像的几何重建与CADFit相结合,实现从图像端到端重建CAD构造序列。通过实现更高复杂度CAD模型的精确重建,CADFit为生成更丰富的数据集和推进未来基于学习的CAD逆向工程方法提供了实用基础。代码可在:https://github.com/ghadinehme/CADFit 获取。

英文摘要

Despite recent progress, recovering parametric CAD construction sequences from geometric input, such as meshes or point clouds, is a key challenge for design and manufacturing, as existing CAD reconstruction and generation methods are largely restricted to difficult-to-edit formats like meshes or Breps or editable simple sketch-and-extrude pipelines and low-complexity datasets. We introduce CADFit, a hybrid optimization-based CAD reconstruction framework that recovers complex, editable CAD construction sequences from meshes by incrementally fitting and validating parametric operations using geometric feedback. Our approach is distinguished by formulating reconstruction as an IoU-driven optimization over structured CAD programs and supporting a rich set of operations, including extrusions, revolutions, fillets, and chamfers. Experiments on multiple CAD benchmarks show that CADFit outperforms state-of-the-art mesh-to-CAD methods in volumetric Intersection-over-Union and Chamfer Distance, while substantially reducing the Invalid Ratio of reconstructed CAD programs, particularly for complex designs. We further present a multimodal pipeline that enables end-to-end reconstruction of CAD construction sequences from images by combining image-based geometry reconstruction with CADFit. By enabling accurate reconstruction of higher-complexity CAD models, CADFit provides a practical foundation for generating richer datasets and advancing future learning-based approaches to CAD reverse engineering. The code is available at: https://github.com/ghadinehme/CADFit.

2605.03395 2026-06-09 cs.SD cs.AI cs.LG cs.MM 版本更新

APEX: Large-scale Multi-task Aesthetic-Informed Popularity Prediction for AI-Generated Music

APEX:面向AI生成音乐的大规模多任务美学感知流行度预测

Jaavid Aktar Husain, Dorien Herremans

发表机构 * AMAAI Lab, Singapore University of Technology and Design(新加坡科技设计大学AMAAI实验室)

AI总结 提出APEX框架,利用MERT音频嵌入联合预测AI生成音乐的流行度指标与五维美学质量,在Music Arena数据集上验证了美学特征对偏好预测的泛化能力。

详情
AI中文摘要

音乐流行度预测因其对艺术家、平台和推荐系统的重要性而吸引了越来越多的研究兴趣。然而,AI生成音乐平台的爆炸式增长创造了一个全新且很大程度上未被探索的领域,每天都有大量歌曲被生产和消费,而没有传统的艺术家声誉或唱片公司支持。在这一探索中,美学质量是关键但尚未被研究的因素。我们提出了APEX,这是首个面向AI生成音乐的大规模多任务学习框架,在来自Suno和Udio的超过21.1万首歌曲(1万小时音频)上训练,该框架联合预测基于参与度的流行度信号——流媒体播放量和点赞分数——以及从MERT(一个自监督音乐理解模型)提取的冻结音频嵌入中的五个感知美学质量维度。美学质量和流行度捕捉了音乐的互补方面,两者结合被证明是有价值的:在Music Arena数据集上的分布外评估中,该数据集包含训练期间未见过的十一个生成音乐系统之间的成对人类偏好对决,引入美学特征持续改进了偏好预测,展示了所学表示在生成架构上的强大泛化能力。

英文摘要

Music popularity prediction has attracted growing research interest, with relevance to artists, platforms, and recommendation systems. However, the explosive rise of AI-generated music platforms has created an entirely new and largely unexplored landscape, where a surge of songs is produced and consumed daily without the traditional markers of artist reputation or label backing. Key, yet unexplored in this pursuit is aesthetic quality. We propose APEX, the first large-scale multi-task learning framework for AI-generated music, trained on over 211k songs (10k hours of audio) from Suno and Udio, that jointly predicts engagement-based popularity signals - streams and likes scores - alongside five perceptual aesthetic quality dimensions from frozen audio embeddings extracted from MERT, a self-supervised music understanding model. Aesthetic quality and popularity capture complementary aspects of music that together prove valuable: in an out-of-distribution evaluation on the Music Arena dataset, comprising pairwise human preference battles across eleven generative music systems unseen during training, including aesthetic features consistently improves preference prediction, demonstrating strong generalisation of the learned representations across generative architectures.

2605.16163 2026-06-09 physics.ao-ph cs.LG 版本更新

SwAIther-Precip: Lead-Time-Aware Bias Correction Enables Kilometer-Scale Downscaling of Global AI Precipitation Forecasts over Switzerland

SwAIther-Precip:考虑提前时间的偏倚校正实现瑞士全球AI降水预报的公里级降尺度

Dan Assouline, Erwan Koch, Federico Amato, Filippo Quarenghi, Daniele Nerini, Thibaut Loiseau, Kyle van de Langemheen, Tom Beucler

发表机构 * European Centre for Medium-Range Weather Forecasts(欧洲中期天气预报中心) University of Geneva(日内瓦大学) ETH Zurich(苏黎世联邦理工学院)

AI总结 本文提出SwAIther-Precip框架,通过校正提前时间依赖性偏倚,提升全球AI降水预报的公里级概率降尺度能力,实验显示CRPS降低48%。

详情
AI中文摘要

技能性中短期降水预报在复杂地形上仍具挑战,因降水源于多尺度非线性过程,全球模型无法以经济成本显式解析。全球AI天气模型可产生技能性中短期预报,但其原生0.25度分辨率限制了本地灾害应用。统计降尺度有助于弥合这一差距,但现有方法常难以处理状态依赖性及尤其提前时间依赖性的全球预报偏倚。我们引入SwAIther-Precip,一种考虑提前时间的降尺度框架,将粗分辨率AIFS预报转换为瑞士公里级概率降水场。首先,通过特征-wise线性调制的U-Net,利用提前时间条件确定性校正粗分辨率系统性偏倚。这种针对性校正使后续更便宜的超分辨率阶段仅需校正降水,允许直接训练于观测而非完整大气状态。扩散模型随后独立于提前时间生成精细空间变异性。使用AIFS预报和CombiPrecip雷达-雨量计观测,SwAIther-Precip将CRPS相对于原始AIFS降低48%。生成的场在大尺度(0.85以上)和小尺度(0.88)上再现观测空间变异性,对应于1公里网格上约4公里的有效分辨率,适用于最多5天的提前时间。跨提前时间训练进一步提升长程性能,相对于提前时间特定模型,在6天时CRPS减少13%。这些结果表明,在生成超分辨率前显式校正提前时间依赖性偏倚是高效公里级概率降尺度的关键。

英文摘要

Skillful medium-range precipitation forecasting at kilometer scale remains challenging over complex terrain because precipitation arises from multiscale nonlinear processes that global models cannot explicitly resolve at affordable cost. Global AI weather models can produce skillful medium-range forecasts, but their native 0.25 degrees resolution limits direct use for local hazard applications. Statistical downscaling can help bridge this gap, yet existing approaches often struggle with state-dependent, and especially lead-time-dependent, biases in global forecasts. We introduce SwAIther-Precip, a lead-time-aware downscaling framework that converts coarse-resolution AIFS forecasts into probabilistic km-scale precipitation fields over Switzerland. First, a U-Net conditioned on lead time via feature-wise linear modulation deterministically corrects systematic biases at coarse resolution. This targeted correction enables a cheaper super-resolution stage conditioned only on corrected precipitation, allowing direct training on observations rather than on the full atmospheric state. A diffusion-based model then generates fine-scale spatial variability independently of lead time. Using AIFS forecasts and CombiPrecip radar-gauge observations, SwAIther-Precip reduces CRPS by 48% relative to raw AIFS. The generated fields reproduce observed spatial variability with spectral fidelity above 0.85 at large scales and 0.88 at small scales, corresponding to an effective resolution of approximately 4 km on a 1 km grid for lead times up to 5 days. Training across lead times further improves long-range performance, yielding a 13% CRPS reduction at 6 days relative to lead-time-specific models. These results show that explicitly correcting lead-time-dependent biases before generative super-resolution is key to efficient km-scale probabilistic downscaling of global AI precipitation forecasts.

2605.20735 2026-06-09 cs.CV cs.LG 版本更新

Lowering the Barrier to IREX Participation: Open-Source Algorithms, Toolkit, and Benchmarking for Iris Recognition

降低参与IREX的门槛:用于虹膜识别的开源算法、工具包和基准测试

Siamul Karim Khan, Patrick J. Flynn, Adam Czajka

发表机构 * University of Notre Dame(内布拉斯加大学)

AI总结 本文提出两种新的开源虹膜识别算法,提供Python和符合IREX标准的C++实现,用于提交官方IREX X计划。研究旨在首次根据IREX测试协议评估开源虹膜识别解决方案,并提供一个模型C++提交,显著促进其他团队的开源方法进入IREX评估。新方法包括两个神经网络,分别使用三元组损失与批量硬三元组挖掘(TripletIris)和ArcFace损失(ArcIris)。此外,文章还提供了两种现有方法的开源IREX兼容C++实现:基于虹膜图像过滤的人类显著性驱动内核(HDBIF)算法,以及用于检测和比较Fuchs密钥(CRYPTS)的人类可解释算法。除了CRYPTS在1:N搜索中面临时间限制外,其他方法已通过官方IREX X评估,并在多个流行学术基准上进行了评估。最后,本文还提供了可用于任何新虹膜识别方法的虹膜分割和圆圈估计开源模型。

详情
AI中文摘要

本文提出了两种新的开源虹膜识别算法,提供了Python和符合IREX标准的C++实现,用于提交官方IREX X计划。本研究有两个主要目标:(a)首次根据IREX测试协议评估开源虹膜识别解决方案;(b)提供一个模型C++提交,显著促进其他团队的开源方法进入IREX评估。新方法包括两个神经网络,分别使用三元组损失与批量硬三元组挖掘(TripletIris)和ArcFace损失(ArcIris)。本文还提供了两种现有方法的开源IREX兼容C++实现:(a)基于虹膜图像过滤的人类显著性驱动内核(HDBIF)算法;(b)用于检测和比较Fuchs密钥(CRYPTS)的人类可解释算法。除了CRYPTS在1:N搜索中面临时间限制外,这些方法已通过官方IREX X评估,并在多个流行学术基准上进行了评估:Quality-Face/Iris Research Ensemble、Warsaw-Biobase Post-Mortem Iris、CASIA-Iris-Thousand-V4、CASIA-Iris-Lamp-V4、IIT Delhi Iris Database、IIITD Contact Lens Iris Database、NDIris3D和Notre Dame Variable Iris Image Quality Release 2。最后,本文还提供了可用于任何新虹膜识别方法的虹膜分割和圆圈估计开源模型。

英文摘要

NIST Iris Exchange (IREX) offers an appealing solution to evaluating new open-source iris recognition algorithms, but it presents high barriers to entry because these algorithms must be written in C++, using a specific API, and adapted to meet strict IREX speed and memory constraints. The main goal of this paper is to lower these barriers and advance open-source iris recognition large-scale evaluations by offering: (a) two new modern deep learning-based open-source iris matchers (ArcIris and TripletIris), along with their C++ IREX X-compliant implementations, which are the first open-source iris recognition methods included into the IREX X leaderboard (and thus IREX-vetted), as well as new segmentation and iris circular approximation models that can be incorporated into any new iris recognition method, and (b) a performance assessment (according to IREX X testing protocols) of all major and currently available open-source iris recognition solutions. The paper also provides Python implementations of the new ArcIris and TripletIris methods and discusses the differences one may encounter between C++ and Python implementations of the same conceptually equivalent approaches. Finally, the paper offers open-source, IREX X-compliant C++ implementations of two existing methods: (a) an iris image filtering-based algorithm utilizing human saliency-driven kernels (HDBIF), and (b) a human-interpretable algorithm for detecting and comparing Fuchs' crypts (CRYPTS). In addition to IREX X evaluation results, the paper reports the performance of all methods on major academic benchmarks: Quality-Face/Iris Research Ensemble (Q-FIRE), Warsaw-Biobase Post-Mortem Iris, CASIA-Iris-Thousand-V4, CASIA-Iris-Lamp-V4, IIT Delhi Iris Database, IIITD Contact Lens Iris Database, NDIris3D, and Notre Dame Variable Iris Image Quality Release 2 (VII-Q-R2).

2605.27441 2026-06-09 cs.IR cs.LG 版本更新

A Unified Structured Query Understanding Framework for Industrial Semantic Search

面向工业语义搜索的统一结构化查询理解框架

Ping Liu, Qianqi Shen, Jianqiang Shen, Chunnan Yao, Kevin Kao, Rajat Arora, Dan Xu, Baofen Zheng, Yunxiang Ren, Benjamin Le, Ali Hooshmand, Igor Lapchuk, Juan Bottaro, Raghavan Muthuregunathan, Caleb Johnson, Liangjie Hong, Jingwei Wu, Wenjing Zhang

发表机构 * LinkedIn Corporation(领英公司)

AI总结 提出一个统一的结构化查询理解系统,将多个异构功能整合到单个小语言模型(SLM)中,并引入Query Illuminator框架用于自动标注和评估,在LinkedIn的职位搜索和人员搜索中验证了效果。

Comments Accepted by KDD-ADS 2026

详情
AI中文摘要

大规模工业搜索系统中的查询理解通常实现为一系列不同、任务特定的组件的级联。虽然每个组件可单独优化,但这种碎片化架构导致维护开销高,且行为不一致,特别是对于长尾查询。在这项工作中,我们提出并部署了一个统一的结构化查询理解系统,将异构功能整合到单个执行模式约束生成的小语言模型(SLM)中。为了解决统一建模中的数据瓶颈,我们引入了Query Illuminator,一个双重用途的框架,作为:(i) 用于高质量自动标注和蒸馏的教师模型,以及(ii) 在人工标注稀缺时用于可扩展评估的替代评判者。我们通过在LinkedIn的职位搜索系统中的广泛离线和在线测试验证了该方法。此外,我们通过跨领域的人员搜索案例研究展示了该框架的水平可扩展性。结果表明,在有限的GPU资源上满足严格的低延迟服务约束的同时,用户参与度提高,运营成本降低。

英文摘要

Query understanding in large-scale industrial search systems is typically implemented as a cascade of disparate, task-specific components. While individually optimizable, this fragmented architecture incurs high maintenance overhead and results in inconsistent behaviors, particularly for long-tail queries. In this work, we propose and deploy a unified structured query understanding system that consolidates these heterogeneous functions into a single Small Language Model (SLM) that performs schema-constrained generation. To address the data bottlenecks inherent in unified modeling, we introduce Query Illuminator, a dual-purpose framework serving as: (i) a teacher model for high-quality auto-annotation and distillation, and (ii) a surrogate judge for scalable evaluation where human labels are scarce. We validate this approach through extensive offline and online tests within LinkedIn's Job Search system. Furthermore, we demonstrate the framework's horizontal extensibility through a cross-domain case study on People Search. The results show improved user engagement and reduced operational costs, achieved while satisfying strict low-latency serving constraints on limited GPU resources.

2606.00384 2026-06-09 cs.AI cs.CL cs.CV cs.LG stat.CO 版本更新

VESTA: Visual Exploration with Statistical Tool Agents

VESTA: 基于统计工具代理的视觉探索

William Rudman, Abhishek Divekar, Kanishk Jain, Sebastian Joseph, Stella S. R. Offner, Matthew Lease, Kyle Mahowald, Greg Durrett, Junyi Jessy Li

发表机构 * The University of Texas at Austin(德克萨斯大学奥斯汀分校) New York University(纽约大学)

AI总结 提出VESTA框架,通过动态增长的工具集指导数据变换、假设驱动可视化和统计检验,提升视觉语言模型在复杂统计建模任务上的性能。

详情
AI中文摘要

将定量模型拟合到数据上是科学工作流程中的核心步骤,但它仍然是最少自动化的步骤之一。最近的基于代理的系统利用语言和视觉语言模型(VLM)来迭代地提出和优化统计模型,但这些系统在更具挑战性的建模任务上表现不佳。为了解决这些限制,我们引入了VESTA:基于统计工具代理的视觉探索,这是一个框架,为VLM配备了一个动态增长的探索工具包,通过数据变换、假设驱动的可视化和稳健的统计检验来指导模型优化。与之前仅依赖迭代批评的系统不同,VESTA在优化之前和优化过程中通过选择或创建诊断工具主动探索数据,这些工具会累积在模型的上下文中,并可在以后重用。我们在三种工具配置下评估VESTA与已建立的基线:无工具、静态专家编写的工具和动态模型编写的工具。为了支持这一评估,我们引入了DAWN(自动工作流和数值建模数据集),这是一个针对分布拟合和时间序列建模的基准,具有不同的难度等级,并最终涉及真实世界的天文学任务,包括建模初始质量函数和引力波啁啾信号。我们发现VESTA的动态工具创建优于先前的代理流水线,在复杂和特定领域的任务上取得了最大的收益。我们进一步表明,动态生成的工具比现有视觉工具创建系统生成的工具复杂得多,每个函数覆盖更多的诊断类别,并且强烈倾向于VLM批评者可以直接推理的视觉输出。

英文摘要

Fitting quantitative models to data is a central step in scientific workflows, yet it remains one of the least automated. Recent agent-based systems leverage language and vision-language models (VLMs) to iteratively propose and refine statistical models, but these systems struggle on more challenging modeling tasks. To address these limitations, we introduce VESTA: Visual Exploration with Statistical Tool Agents, a framework that equips VLMs with a dynamically growing exploration toolkit to guide model refinement through data transformations, hypothesis-driven visualizations, and robust statistical tests. Unlike prior systems that rely on iterative critique alone, VESTA actively explores data before and during refinement by selecting or creating diagnostic tools, which accumulate in the model's context and can be reused later. We evaluate VESTA against established baselines in three toolkit configurations: no tools, static expert-written tools, and dynamic model-written tools. To support this evaluation, we introduce DAWN (Dataset for Automated Workflows and Numerical Modeling), a benchmark targeting distribution fitting and time series modeling with varying difficulty tiers, and culminating in real-world astronomy tasks including modeling initial mass functions and gravitational-wave chirp signals. We find that VESTA's dynamic tool creation outperforms prior agentic pipelines, with the largest gains on complex and domain-specific tasks. We further show that dynamically generated tools are substantially more sophisticated than those produced by existing visual tool-creation systems, covering more diagnostic categories per function and strongly preferring visual outputs that the VLM critic can reason over directly.

2606.02341 2026-06-09 cs.SD cs.LG 版本更新

Parameter-efficient Dual-encoder Architecture with Differentiable Choquet Integral Fusion for Underwater Acoustic Classification

参数高效的双编码器架构与可微Choquet积分融合用于水下声学分类

Amirmohammad Mohammadi, Joshua Peeples, Alexandra Van Dine

发表机构 * University of California, San Diego(加州大学圣地亚哥分校)

AI总结 提出一种双编码器神经网络架构,同时处理波形和频谱图,利用预训练骨干和参数高效微调模块,并通过基于Choquet积分的可微模糊聚合机制融合时域和频域表示,提高分类准确性和可解释性。

Comments 9 pages, 7 figures

详情
AI中文摘要

水下声学分类具有广泛的海事应用,但由于日益复杂的声学环境而面临挑战。波形和频谱图表示已被主要用作该领域分类任务的声学数据特征。频谱图建模谐波依赖性,但这些降维表示可能过滤掉与判别相关的声学特征。虽然波形的相位信息允许对信号进行完整表征,但原始波形可能嘈杂且复杂,使得模型难以直接处理该表示。本文提出一种双编码器神经网络架构,同时处理声学波形和频谱图,利用预训练骨干和参数高效微调模块,实现领域自适应。为了结合这些自适应分支,引入了一种基于Choquet积分的可微模糊聚合机制,以平衡时域和频谱表示。这种融合策略不仅提高了分类准确性,还提供了可解释性。具体来说,通过分析学习到的模糊测度,揭示了网络表示依赖性的类别特定变化。通过动态将注意力转移到受潜在非对称信道失真影响最小的表示上,所提出的门控机制缓解了水下环境的非平稳挑战。在DeepShip和ShipsEar数据集上的评估表明,所提出的架构相对于独立的单编码器基线实现了分类改进,同时限制了可训练参数空间。这减轻了在有限声学数据集上过拟合的风险,同时降低了与完全微调基础模型相关的计算成本。

英文摘要

Underwater acoustic classification has a wide array of oceanic applications, but faces challenges due to an increasingly complex acoustic environment. Waveform and spectrogram representations have been primarily used as acoustic data features for classification tasks in this domain. Spectrograms model harmonic dependencies, but these reduced representations can filter out acoustic features relevant for discrimination. While phase information from the waveform allows full characterization of the signal, the original waveform can be noisy and complex, rendering this representation difficult for models to process directly. This paper proposes a dual-encoder neural architecture to simultaneously process acoustic waveforms and spectrograms, leveraging pre-trained backbones and parameter-efficient fine-tuning modules, enabling a domain adaptation. To combine these adapted branches, a novel differentiable fuzzy aggregation mechanism based on the Choquet integral is introduced to balance the temporal and spectral representations. This fusion strategy not only yields higher classification accuracy but also provides interpretability. Specifically, by analyzing the learned fuzzy measures, insights are revealed about class-specific shifts in the network's representation reliance. By dynamically shifting attention to the representation least corrupted by potential asymmetric channel distortions, the proposed gating mechanism mitigates the non-stationary challenges of the underwater environment. Evaluations on the DeepShip and ShipsEar datasets demonstrate that the proposed architecture achieves classification improvements over independent single-encoder baselines, while simultaneously restricting the trainable parameter space. This mitigates the risk of overfitting on limited acoustic datasets while alleviating the computational costs associated with fully fine-tuning foundation models.

2606.02735 2026-06-09 cs.RO cs.AI cs.LG 版本更新

See Less, Specify More: Visual Evidence Budgets for Generalizable VLAs

看得更少,指定更多:面向可泛化视觉-语言-动作模型的视觉证据预算

Yueh-Hua Wu, Tatsuya Matsushima, Kei Ota

发表机构 * Airoa

AI总结 提出S2框架,通过显式视觉证据预算和细化轨迹语言,改善VLA模型在干扰、外观变化和语义相似任务下的泛化能力。

Comments Project page: https://s2.airoa.io

详情
AI中文摘要

泛化仍然是视觉-语言-动作(VLA)模型的核心瓶颈:在干扰物、外观变化和语义相似任务下,策略通常需要从粗略指令中推断局部执行细节,同时决定图像的哪些部分对控制重要。我们提出S2(看得更少,指定更多),一个通过更干净的接口训练执行器来提升VLA泛化的框架。“指定更多”保留原始指令作为稳定的高层目标,同时将每条轨迹重新标注为细化的轨迹级和子任务级语言,以消除当前执行模式的歧义。与原生注意力不同,“看得更少”施加显式的视觉证据预算,训练执行器从任务充分的证据中行动,而非不受约束的视觉上下文,无需任何区域或掩码标注。该接口让执行器能够遵循详细指导,而不依赖干扰性的视觉补丁或自行解决可避免的歧义,并且通过上下文学习与现成的VLM规划器兼容。在我们的主要评估设置中,S2通过改变执行器的学习问题提升了整体泛化指标:粗略指令导致可避免的监督混叠,目标保持的局部指导在我们的主要消融中优于指令替换,显式证据预算减少了对广泛视觉上下文的依赖,超越了效率考虑。在TX-G2(一个AgiBot G2兼容变体)和HSR上的八个真实机器人任务中,S2将平均子任务成功率从pi0.5的54.2%提升到79.0%。这些结果共同表明,当执行器被训练从信息丰富的局部指导和任务充分的视觉证据中行动,而非从弱监督中同时恢复两者时,VLA泛化得到改善。

英文摘要

Generalization remains a central bottleneck for vision-language-action (VLA) models: under distractors, appearance shifts, and semantically similar tasks, the policy must often infer local execution details from coarse instructions while also deciding which parts of the image matter for control. We present S2 (See Less, Specify More), a framework for improving VLA generalization by training the executor under a cleaner interface. Specify More preserves the original instruction as a stable high-level goal while relabeling each trajectory into refined trajectory- and subtask-level language that disambiguates the current execution mode. Unlike native attention, See Less imposes an explicit visual evidence budget, training the executor to act from task-sufficient evidence rather than unconstrained visual context, without any region or mask annotation. This interface lets the executor follow detailed guidance without relying on distracting visual patches or resolving avoidable ambiguity on its own, and it remains compatible with off-the-shelf VLM planners through in-context learning. Across our main evaluation settings, S2 improves overall generalization metrics by changing the executor's learning problem: coarse instructions induce avoidable supervision aliasing, goal-preserving local guidance outperforms instruction replacement in our main ablations, and explicit evidence budgeting reduces dependence on broad visual context beyond efficiency considerations. Across eight real-robot tasks on TX-G2 (an AgiBot G2-compatible variant) and HSR, S2 raises mean subtask success from 54.2% to 79.0% over pi0.5. Together, these results suggest that VLA generalization improves when the executor is trained to act from informative local guidance and task-sufficient visual evidence, rather than recovering both from weak supervision.

2606.06407 2026-06-09 cs.CV cs.IR cs.LG eess.IV 版本更新

A Vision-language Framework for Comparative Reasoning in Radiology

放射学中比较推理的视觉语言框架

Tengfei Zhang, Ziheng Zhao, Xiaoman Zhang, Lisong Dai, Pengcheng Qiu, Ya Zhang, Yanfeng Wang, Weidi Xie

发表机构 * University of Science and Technology of China(中国科学技术大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室) School of Artificial Intelligence, Shanghai Jiao Tong University(上海交通大学人工智能学院) Department of Biomedical Informatics, Harvard Medical School(哈佛医学院生物医学信息学系) Department of Radiology, Renmin Hospital of Wuhan University(武汉大学仁民医院放射科) Shanghai Sixth People’s Hospital Affiliated to Shanghai Jiao Tong University(上海交通大学附属第六人民医院)

AI总结 提出一个实体感知的跨图像推理框架,通过构建大规模比较影像数据集MedReCo-DB和开发MedReCo及MedReCo-VLM模型,实现了参考病例检索和时间比较解读,显著提升了放射学比较推理性能。

详情
AI中文摘要

医学影像人工智能在孤立图像解读方面取得了强劲性能,但仍与放射学实践存在较大差距,因为诊断和随访依赖于对先前研究和类似参考病例的比较。本文我们将放射学比较形式化为一个实体感知的跨图像推理问题,并引入一个支持参考病例检索和时间比较解读的框架。我们构建了MedReCo-DB,这是一个从常规图像-报告对中派生的大规模比较影像资源,包含来自八个机构、四个国家、七种成像模态的超过16万名患者的69万余张图像。报告被分解为解剖结构、异常发现和病理状况,为实体条件检索和比较视觉问答提供监督。利用该资源,我们开发了MedReCo,一个用于可控检索临床类似病例的实体感知视觉编码器,以及MedReCo-VLM,一个用于生成性解读间隔变化的视觉语言扩展。在内部、外部和跨中心评估中,MedReCo在所有12个内部检索设置中实现了最高的Recall@1,并将外部检索平均提高了6.0个百分点。在临床易混淆的鉴别组中,它始终优于最强的基线。MedReCo-VLM在所有比较生成评估中取得了最佳性能,并在胸部X光片上将纵向随访准确性提高了14.5-46.5个百分点,在CT上提高了13.0-27.9个百分点。这些发现表明,实体感知的比较推理可以从常规临床数据中大规模学习,并可能为医学影像AI提供更符合临床的基础。

英文摘要

Medical imaging artificial intelligence has achieved strong performance in isolated image interpretation, but remains poorly aligned with radiological practice, where diagnosis and follow-up rely on comparison across prior studies and analogous reference cases. Here we formulate radiological comparison as an entity-aware cross-image reasoning problem and introduce a framework that supports both reference-case retrieval and temporal comparative interpretation. We construct MedReCo-DB, a large-scale comparative imaging resource derived from routine image-report pairs, comprising more than 690,000 images from over 160,000 patients across eight institutions, four countries and seven imaging modalities. Reports are decomposed into anatomical structures, abnormal findings and pathological conditions to provide supervision for entity-conditioned retrieval and comparative visual question answering. Using this resource, we develop MedReCo, an entity-aware visual encoder for controllable retrieval of clinically analogous cases, and MedReCo-VLM, a vision--language extension for generative interpretation of interval change. Across internal, external and cross-center evaluations, MedReCo achieved the highest Recall@1 in all 12 internal retrieval settings and improved external retrieval by a mean of 6.0 percentage points. In clinically confusable differential groups, it consistently outperformed the strongest baselines. MedReCo-VLM achieved the best performance across all comparative generation evaluations and improved longitudinal follow-up accuracy by 14.5-46.5 percentage points on chest radiographs and 13.0-27.9 percentage points on CT. These findings suggest that entity-aware comparative reasoning can be learned from routine clinical data at scale and may provide a more clinically aligned foundation for medical imaging AI.

2606.07235 2026-06-09 cs.IR cs.LG 版本更新

FLOWREADER: Min-Cost Flow Optimization for Multi-Modal Long Document Q&A

FLOWREADER: 多模态长文档问答的最小成本流优化

Ambuj Mehrish, Sebastiano Vascon

发表机构 * Ca’ Foscari University of Venice(威尼斯卡布里亚大学)

AI总结 提出FLOWREADER,将多模态长文档中的证据组装建模为最小成本流问题,通过统一评分向量控制源选择、汇选择和边成本,在碎片化证据场景下优于top-k检索方法。

详情
AI中文摘要

长多模态文档迫使检索增强系统从文本、表格和幻灯片中碎片化的证据中组装答案,这些证据可能分布在长表格的单元格中、多张幻灯片上或图表与其讨论之间。Top-k块检索独立处理每个片段,无法表示证据之间的关联。我们提出FLOWREADER,将证据组装重新定义为多模态节点图上的最小成本流问题:一个单一的评分向量$h$控制源选择(通过MMR)、汇选择(通过长度感知的可回答性代理)以及每条边的成本和容量。最优流被分解为候选证据路径,通过熵正则化复制动力学选择紧凑的非冗余子集,并在双过程门控下并行运行VLM工作器,当答案一致性低或路由流紧张时触发一次System-2精炼过程。在VisDoMBench上,FLOWREADER在碎片化证据主导的两个子集PaperTab(58.40,比G^{2}-Reader高1.30)和SlideVQA(72.93,高0.62)上表现最佳,在SPIQA、FetaTab和SciGraphQA上具有竞争力。在所有五个子集上的宏观平均得分(65.47)与最强基线(G^{2}-Reader,66.21)相差0.74。总体而言,这些结果表明最小成本流在碎片化多模态证据上表现良好,而top-k检索在此类场景中失败。它还提供了一种统一的方式来控制评分、路由、选择和自适应计算。

英文摘要

Long, multimodal documents force retrieval-augmented systems to assemble answers from evidence fragmented across text, tables, and slides broken across cells in a long table, spread over multiple slides, or split between a figure and its discussion. Top-$k$ chunk retrieval treats each fragment independently and cannot represent how evidence connects. We introduce FLOWREADER, which reframes evidence assembly as a min-cost flow problem on a multimodal node graph: a single scoring vector $h$ controls source selection (via MMR), sink selection (via a length-aware answerability proxy), and the costs and capacities of every edge. The optimal flow is decomposed into candidate evidence paths, a compact non-redundant subset is selected by entropy-regularized replicator dynamics, and parallel VLM workers under a dual-process gate produce the answer with a single System-2 refinement pass triggered when answer consistency is low or the routed flow is strained. On VisDoMBench, FLOWREADER is best on the two subsets dominated by fragmented evidence PaperTab ($58.40$, $+1.30$ over G^{2}-Reader) and SlideVQA ($72.93$, $+0.62$) and competitive on SPIQA, FetaTab, and SciGraphQA. Macro-averaged across all five subsets, FLOWREADER ($65.47$) is within $0.74$ of the strongest baseline (G^{2}-Reader, $66.21$). Overall, these results show that min-cost flow performs well on fragmented multimodal evidence, where top-$k$ retrieval fails. It also provides a unified way to control scoring, routing, selection, and adaptive compute together.

2512.16334 2026-06-09 cs.LG cs.AI 版本更新

Pretrained battery transformer (PBT): A foundation model for battery life prediction

预训练电池变压器(PBT):电池寿命预测的基础模型

Ruifeng Tan, Weixiang Hong, Jia Li, Jiaqiang Huang, Tong-Yi Zhang

发表机构 * Guangzhou Municipal Key Laboratory of Materials Informatics and Sustainable Energy and Environment Thrust, The Hong Kong University of Science and Technology (Guangzhou)(广州材料信息学与可持续能源与环境方向市重点实验室,香港科技大学(广州)) Department of Computer Science & Engineering, The Hong Kong University of Science and Technology(计算机科学与工程系,香港科技大学) Guangzhou Municipal Key Laboratory of Materials Informatics and Data Science and Analytics Thrust, The Hong Kong University of Science and Technology (Guangzhou)(广州材料信息学与数据科学与分析方向市重点实验室,香港科技大学(广州)) Academy of Interdisciplinary Studies, The Hong Kong University of Science and Technology(交叉学科研究院,香港科技大学) Guangzhou HKUST Fok Ying Tung Research Institute(广州科技大学福 Ying Tung 研究院) Material Genome Institute, Shanghai University(材料基因组研究所,上海大学)

AI总结 本文提出PBT模型,通过整合异构电池寿命数据,实现电池寿命预测的统一建模,显著提升预测性能。

Comments 5 figures in the main content

详情
AI中文摘要

电池循环寿命的早期预测对于改进电池设计、制造和部署至关重要。然而,尽管机器学习取得进展,电池寿命预测仍受限于数据稀缺和电池化学、规格、形成协议和工作条件的异质性。尽管迁移学习已被广泛探索,但其效果受限于缺乏能整合异构电池寿命数据的基础模型。本文引入预训练电池变压器(PBT),一种用于电池寿命预测的基础模型,其包含编码电池知识的混合专家层,以学习稀缺和异质的寿命数据。PBT首先在13个锂离子电池数据集上预训练,生成通用PBT,然后通过迁移学习适应到特定场景。在覆盖977个电池和528组老化条件的15个数据集中,PBT实现了最先进的性能,平均超越最强竞争方法21.9%,最高提升达86.9%。本研究建立了已知的第一种电池寿命预测基础模型,并为将电池寿命预测从孤立的场景特定建模任务转向可重用的知识基础提供了步骤,该基础模型可利用有限数据进行特定场景专业化,对其他具有稀缺和异质数据的可持续能源预测问题具有启示。

英文摘要

Early prediction of battery cycle life is essential for improving battery design, manufacturing and deployment. However, despite encouraging progress with machine learning, battery life prediction remains constrained by scarce data and pronounced heterogeneity across battery chemistries, specifications, formation protocols and operating conditions. Although transfer learning has been widely explored to alleviate these challenges, its effectiveness is limited by the absence of a foundation model that can integrate heterogeneous battery life data and provide broadly useful knowledge for target-scenario specialization. Here we introduce the pretrained battery transformer (PBT), a foundation model for battery life prediction that incorporates battery-knowledge-encoded mixture-of-experts layers to learn from scarce and heterogeneous lifetime data. PBT is first pretrained on 13 lithium-ion battery datasets to yield a general PBT that encodes comprehensive battery lifetime knowledge, and is then adapted through transfer learning into specialized PBT models for target scenarios. Across 15 datasets covering 977 batteries and 528 sets of aging conditions from lithium-ion, sodium-ion and zinc-ion batteries, PBT achieves state-of-the-art performance, surpassing the strongest competing method by 21.9% on average, with gains of up to 86.9%. This study establishes, to our knowledge, the first foundation model for battery life prediction and provides a step towards shifting battery lifetime prediction from isolated, scenario-specific modelling tasks to a reusable knowledge foundation that can be specialized to target scenarios with limited data, with implications for other prediction problems characterized by scarce and heterogeneous data in sustainable energy.

13. 其他/综合机器学习 51 篇

2606.07563 2026-06-09 cs.LG cs.AI 新提交

Emergence via Phase Transitions: Mechanism Landscapes and Universal Convergence Across Complex Systems

通过相变涌现:机制景观与跨复杂系统的通用收敛

Truong Xuan Khanh

发表机构 * H&K Research Studio(H&K 研究工作室) Clevix LLC(Clevix 有限责任公司)

AI总结 提出层次涌现框架(HEF),将涌现建模为机制景观中的相变,证明在结构假设下物理可行且收敛到唯一不动点,并在111个模算术变换器实验中验证了相变指纹。

Comments 27 pages, 3 figures, 2 tables; 15-page Supplementary Information with complete proofs included

详情
AI中文摘要

在机器学习、生物学和物理学中,独立演化的系统尽管微观细节截然不同,但常常收敛到惊人相似的高层结构。Grokking电路在不同随机种子下收敛,进化谱系重新发现相似的代谢解决方案,重整化流趋近共同的固定点。我们提出层次涌现框架(HEF)作为此类收敛现象的候选普适性框架。HEF将涌现建模为由热力学和信息论定律约束的机制景观中的相变。该框架引入一个临界能量阈值Ec,将具有竞争机制的探索阶段与由唯一最小成本机制主导的收敛阶段分开。在结构假设下,我们证明了物理可行性,推导了严格的度量收缩,并建立了收敛到与初始条件无关的唯一不动点表示。我们进一步通过有效信息和机制竞争熵将该收敛结构与因果涌现联系起来。为测试该框架,我们研究了111个实验中模算术变换器的延迟泛化(“grokking”)。我们识别出一个可重复的Ec转变经验指纹:在92%的运行中,权重范数在grokking之前系统性达到峰值。归一化准确率曲线坍缩到tanh扭结(R^2=0.93),与Landau-Ginzburg普适类一致,所有grokked模型收敛到0.9745±0.014,与初始化、权重衰减或训练比例无关(ANOVA p>0.13)。HEF并非作为涌现的通用理论提出,而是作为研究跨复杂系统收敛现象的可证伪数学框架。

英文摘要

Across machine learning, biology, and physics, independently evolving systems often converge toward strikingly similar high-level structures despite radically different microscopic details. Grokking circuits converge across random seeds, evolutionary lineages rediscover similar metabolic solutions, and renormalization flows approach common fixed points. We propose the Hierarchical Emergence Framework (HEF) as a candidate universality framework for such convergence phenomena. HEF models emergence as a phase transition in a mechanism landscape constrained by thermodynamic and information-theoretic laws. The framework introduces a critical energy threshold Ec separating an exploration regime with competing mechanisms from a convergence regime governed by a unique minimum-cost mechanism. Under structural assumptions, we prove physical feasibility, derive strict metric contraction, and establish convergence toward a unique fixed-point representation independent of initial conditions. We further connect this convergence structure to causal emergence through Effective Information and mechanism competition entropy. To test the framework, we study delayed generalization ("grokking") in modular arithmetic transformers across 111 experiments. We identify a reproducible empirical fingerprint of the Ec transition: the weight norm peaks systematically before grokking in 92% of runs. Normalized accuracy curves collapse onto a tanh kink (R^2=0.93) consistent with a Landau-Ginzburg universality class, and all grokked models converge to 0.9745+/-0.014 regardless of initialization, weight decay, or training fraction (ANOVA p>0.13). HEF is not presented as a universal theory of emergence, but as a falsifiable mathematical scaffold for studying convergence phenomena across complex systems.

2606.07576 2026-06-09 cs.LG cs.ET cs.MA 新提交

When Should an AI Scientist Stop? Verifiable Experiment Steering and Refusal for Autonomous Discovery

AI科学家何时应停止?可验证实验引导与自主发现的拒绝机制

Neel Tushar Shah, Manglam Kartik

发表机构 * University of California, Berkeley(加州大学伯克利分校) Stanford University(斯坦福大学)

AI总结 提出CARTOGRAPH验证层,通过未解析子空间引导、模糊闭合和残差库检测,在多个测试中优于原始投影,并能识别和撤销库外机制。

Comments Accepted at AI for Science Workshop at ICML 2026

详情
AI中文摘要

我们提出了CARTOGRAPH,一个用于AI科学家的验证层,它结合了未解析子空间实验引导(选择)、显式模糊闭合(解析)和基于残差的库不足检测(拒绝)。在局部线性-高斯桥下,原始未解析投影是各向同性未解析Fisher信息迹,而CARTOGRAPH-A是精确的未解析A最优规则;闭式EIG和Box-Hill作为局部比较器而非全局等价物出现。在五个测试平台上,CARTOGRAPH-A在d=8的重复结构化级联中以129胜0平15负击败原始投影(p<10^-21)。更独特的是,该框架初步识别了三个库外药代动力学机制,然后随着残差暴露结构失配而撤销这些识别,而一个扰动的库内对照始终保持识别。在低维药代动力学和过滤EPA设置中,理论预测并观察到与分歧的近似平局。最后,在已发表的A-Lab自主材料系统的40项阳性主张的回顾性审计中,拒绝守卫标记了所有4项后来在手动重新分析中被视为不确定的主张,同时通过了32/36项已确认的主张。代码可在https://github.com/ai4science-boed/cartograph.git获取。

英文摘要

We present CARTOGRAPH, a verification layer for AI scientists that couples unresolved-subspace experiment steering (select), explicit ambiguity closure (resolve), and residual-based library inadequacy detection (refuse). Under a local linear-Gaussian bridge, raw unresolved projection is the isotropic unresolved Fisher-information trace, while CARTOGRAPH-A is the exact unresolved A-optimal rule; closed-form EIG and Box-Hill arise as local comparators rather than global equivalents. Across five testbeds, CARTOGRAPH-A beats raw projection 129W/0T/15L at d = 8 (p < 10^-21) in a replicated structured cascade. More distinctively, the framework tentatively identifies three out-of-library pharmacokinetic mechanisms and then revokes those identifications as residuals expose structural misfit, while one perturbed in-library control stays identified throughout. In low-dimensional pharmacokinetic and filtered EPA settings, near-ties against disagreement are predicted by theory and observed. Finally, in a retrospective audit of 40 positive claims from the published A-Lab autonomous materials system, the refuse guard flags all 4 claims later marked inconclusive under manual reanalysis while passing 32/36 confirmed claims. Code is available at https://github.com/ai4science-boed/cartograph.git

2606.07629 2026-06-09 cs.LG cs.AI cs.CL cs.CY cs.HC 新提交

Large Language Models Should Learn Personalized Rather Than Aggregated Human Preferences

大型语言模型应学习个性化而非聚合的人类偏好

Cristina Garbacea

AI总结 本文主张大型语言模型应学习个性化偏好而非聚合偏好,分析聚合偏好的理论局限与实证问题,提出通过有界个性化框架兼顾个体自主与集体安全。

Comments Accepted to ICML 2026

详情
AI中文摘要

当前对齐大型语言模型(LLM)的方法将多样化的人类偏好聚合为单一奖励信号,实际上优化了一个不代表任何真实个体的假设性“平均用户”。本文立场论文认为,LLM应学习个性化、个体化的偏好而非聚合偏好。我们表明,聚合掩盖了关于偏好多样性、个体价值观和上下文依赖的关键信息,这在理论上基于社会选择理论,并在经验上跨人口群体明显。我们分析了人类偏好编码的丰富结构,调查了个性化的技术方法,并系统地回应了关于可扩展性、共享标准和操纵风险的反驳。虽然个性化引入了真正的安全挑战,包括过滤气泡、价值锁定和心理操纵,但我们认为这些挑战可以通过有界个性化框架来管理,该框架在容纳合法个体差异的同时保留通用安全约束。最后,我们提出了一个具体的研究和政策议程,以开发尊重个体自主和集体安全的偏好感知模型。

英文摘要

Current approaches to aligning large language models (LLMs) aggregate diverse human preferences into a single reward signal, effectively optimizing for a hypothetical ``average user'' who represents no real person particularly well. This position paper argues that LLMs should learn personalized, individual preferences rather than aggregated ones. We show that aggregation masks critical information about preference diversity, individual values, and contextual dependencies, which is a limitation both theoretically grounded in social choice theory and empirically evident across demographic groups. We analyze the rich structure that human preferences encode, survey technical approaches to personalization, and systematically address counterarguments on scalability, shared standards, and manipulation risk. While personalization introduces genuine safety challenges including filter bubbles, value lock-in, and psychological manipulation, we argue these are manageable through bounded personalization frameworks that preserve universal safety constraints while accommodating legitimate individual variation. We conclude with a concrete research and policy agenda for developing preference-aware models that respect both individual autonomy and collective safety.

2606.08369 2026-06-09 cs.LG cs.AI 新提交

An Information-Theoretic Definition for Open-Ended Learning

开放学习的信息论定义

Wanqiao Xu, Yifan Zhu, Benjamin Van Roy

发表机构 * Stanford University(斯坦福大学)

AI总结 提出基于比特等价的信息论定义开放环境,证明经典赌博机非开放,设计算法实现开放学习。

详情
AI中文摘要

越来越多的研究表明,能够在开放环境中持续扩展能力的AI系统具有巨大潜力。但目前尚无关于开放性的统一定义或关于智能体应如何探索开放环境的理论。我们基于一个新概念——${\textit比特等价}$——引入了一个信息论定义,该概念量化了达到每个期望奖励水平所需的信息。我们认为,如果智能体能够实现比特等价的线性增长,则该环境是开放的。我们证明了经典赌博机环境不是开放的,并构建了一个开放赌博机环境。我们还提出了一种在该环境中实现开放学习的算法。

英文摘要

A growing body of work points to the great promise of AI systems that can continually expand their capabilities as they operate in an open-ended environment. But yet there is no coherent definition of open-endedness or theory about how an agent ought to explore an open-ended environment. We introduce an information-theoretic definition based on a new concept -- the ${\textit bit-equivalent}$ -- which quantifies the information required to attain each level of expected reward. We consider an environment to be open-ended if an agent can attain linear growth in the bit-equivalent. We establish that classical bandit environments are not open-ended and formulate a bandit environment that is. We also introduce an algorithm that achieves open-ended learning in this environment.

2606.07527 2026-06-09 cs.CL cs.AI cs.LG 交叉投稿

Post-training is (Massive) Supervised Learning

后训练是(大规模)监督学习

Michael Hassid, Yossi Adi, Roy Schwartz

发表机构 * FAIR, Meta AI(Meta AI 基础人工智能研究团队) The Hebrew University of Jerusalem(耶路撒冷希伯来大学)

AI总结 本文论证当前LLM后训练阶段(SFT+RL)实质是回归到BERT时代的“预训练-微调”范式,通过实验表明从零开始后训练的模型也能取得显著性能,并提出应转向“学会学习”的训练方式。

详情
AI中文摘要

训练LLM的主流范式已演变为依赖包含SFT和RL的大规模后训练阶段。在这篇立场论文中,我们认为这种方法实际上标志着回归到BERT时代的“预训练然后微调”方法,明确地使模型适应期望的行为和评估所用的特定基准。我们首先回顾LLM的历史,描述LLM演化的不同阶段。我们认为当前格局与LLM早期惊人地相似,那时任务性能严重依赖于将模型拟合到分布内数据集。为了实证证明这一点,我们比较了预训练模型和随机初始化模型,在现代推理数据集上对两种变体进行微调,并在竞争性数学和代码基准上评估它们。我们表明,从头开始后训练的模型产生了高度非平凡的性能。我们的发现表明,当前的后训练方法主要作为分布拟合机制发挥作用。最后,我们提出,开发通用能力的模型和系统需要超越针对预定义行为的广泛后训练,转而采用模型“学会如何学习”的训练过程。

英文摘要

The prevailing paradigm for training LLMs has evolved to rely on a massive post-training phase consisting of SFT and RL. In this position paper, we argue that this methodology effectively marks a reversion to the ``pre-train then fine-tune'' approach of the BERT era, explicitly tailoring models to the desired behaviors and specific benchmarks on which they are evaluated. We begin with a historical overview of LLMs, describing the different phases of the LLM evolution. We argue that the current landscape is remarkably similar to the early days of LLMs, where task performance heavily relied on fitting the models to in-distribution datasets. To empirically demonstrate this, we compare pre-trained models to randomly initialized ones, by fine-tuning both variants on modern reasoning datasets and evaluating them on competitive math and code benchmarks. We show that models post-trained from scratch yield highly non-trivial performance. Our findings suggest that current post-training methodologies function primarily as a distribution-fitting mechanism. We finish by positing that developing generally capable models and systems requires moving beyond extensive post-training for predefined behaviors, shifting instead toward training procedures where models ``learn how to learn''.

2606.07612 2026-06-09 cs.CY cs.AI cs.LG 交叉投稿

Position: Anthropomorphic Misalignment Research Needs Stronger Evidence

立场:拟人化错位研究需要更强证据

Vansh Gupta, Peter Nutter, Samuel Stante, Andreas Krause, Florian Tramèr, Lukas Fluri, Xin Chen, Anna Hedström

发表机构 * University of Cambridge(剑桥大学)

AI总结 本文指出拟人化错位研究(AMR)在概念模糊、数据不鲁棒、实验设计不足等问题上存在证据薄弱,提出证据层级框架和诊断清单以提升方法论严谨性。

详情
AI中文摘要

我们认为,许多拟人化错位研究(AMR)需要更强证据,以确保它们能为关键安全决策(如模型部署和监管)提供坚实基础。通过评估不同错位概念(如欺骗、突发错位和谄媚)中的失败模式,我们展示了概念模糊、非鲁棒数据集、实验设计和因果干预不足如何导致对模型行为的过度解读。本立场论文旨在提供关于证据考量的指导,以帮助提高AMR的方法论严谨性。为此,我们通过提出的证据层级框架和诊断清单,明确呼吁行动。这些共享标准将促进更富有成效的科学讨论,并确保关于AI风险的声明建立在坚实的实证基础上。

英文摘要

We argue that many Anthropomorphic Misalignment Research (AMR) studies need stronger evidence to ensure that they can provide a robust foundation for critical safety decisions, such as model deployment and regulation. By evaluating failure modes across different misalignment concepts, such as deception, emergent misalignment, and sycophancy, we show how conceptual ambiguity, non-robust datasets, experimental design, and insufficient causal interventions can lead to overinterpretation of model behaviors. This position paper aims to offer guidance on evidentiary considerations that can help improve methodological rigor in AMR. To achieve this, we provide a clear call to action through a proposed framework of evidence levels and a diagnostic checklist. These shared standards will enable more productive scientific discourse and ensure that claims about AI risks rest on solid empirical foundations.

2606.08202 2026-06-09 stat.ML cs.LG physics.data-an q-bio.NC 交叉投稿

Vector Space of Cycles

循环向量空间

Moo K. Chung, Anass B. El-Yaagoubi, Hernando Ombao

发表机构 * Department of Biostatistics and Medical Informatics University of Wisconsin Madison(威斯康星大学麦迪逊分校生物统计学与医学信息学系) Statistics Program King Abdullah University of Science and Technology(国王 Abdullah 科学与技术大学统计学项目)

AI总结 提出一种变分框架,将循环交互表示为单纯复形上的边流,通过能量最小化动力学分离瞬态与持久谐波流,得到低维循环空间,实现循环结构的投影、平均、比较和统计推断。

详情
AI中文摘要

大多数用于有向交互的统计和机器学习方法关注变量之间的成对效应。即使现有的循环模型也主要通过节点级依赖表示反馈,使得大规模循环组织难以估计和比较。这一限制在生物和神经系统中尤为突出,其中交互高度循环且涉及许多重叠的循环。我们引入了一个用于循环交互统计推断的变分框架。有向交互被表示为单纯复形上的边流,并在能量最小化动力系统下演化。由此产生的动力学将瞬态交互分量与持久谐波流分离,产生一个捕获稳定循环组织的低维循环空间。该框架不是枚举单个循环,而是将循环交互表示为希尔伯特空间的元素,从而实现投影、平均、比较和群体级统计推断。我们建立了谐波投影的理论性质,包括循环空间的表征、方差减少和群体推断。模拟表明,与现有的有向交互方法相比,该方法在密集循环系统中显著改善了循环结构的恢复。应用于400名人类受试者的静息态fMRI,该框架揭示了通过边平均无法检测的可重复的大规模循环组织。这些结果为研究高维动力系统中的循环交互提供了一个可扩展的统计框架。

英文摘要

Most statistical and machine learning methods for directed interactions focus on pairwise effects among variables. Even existing cyclic models represent feedback primarily through node-level dependencies, making large-scale recurrent organization difficult to estimate and compare. This limitation is particularly acute in biological and neural systems, where interactions are highly recurrent and involve many overlapping cycles. We introduce a variational framework for statistical inference on cyclic interactions. Directed interactions are represented as edge flows on a simplicial complex and evolved under an energy-minimizing dynamical system. The resulting dynamics separate transient interaction components from persistent harmonic flows, yielding a low-dimensional cycle space that captures stable recurrent organization. Rather than enumerating individual cycles, the proposed framework represents cyclic interactions as elements of a Hilbert space, enabling projection, averaging, comparison, and population-level statistical inference. We establish theoretical properties of the harmonic projection, including characterization of the cycle space, variance reduction, and population inference. Simulations demonstrate substantially improved recovery of cyclic structure in dense recurrent systems compared with existing directed-interaction methods. Applied to resting-state fMRI from 400 human subjects, the framework reveals reproducible large-scale cyclic organization that is not detectable through edgewise averaging. These results provide a scalable statistical framework for studying recurrent interactions in high-dimensional dynamical systems.

2606.08296 2026-06-09 cs.AI cs.LG 交叉投稿

Revisiting the shutdown problem

重新审视关机问题

David Thorstad

发表机构 * GitHub

AI总结 本文重新评估了AI关机问题的难度,指出现有论证未能证明其难以解决,且相关技术方案对模型性能造成了高安全代价。

详情
AI中文摘要

关于人工智能存在风险的主要论点中的一个关键前提是,功能异常的人工智能体无法轻易被关闭。这引发了灾难性关机问题,即确保在人工智能体造成灾难性后果之前能够将其关闭。一系列论证和定理表明,解决灾难性关机问题很困难,这加强了存在风险的论点,并推动寻找解决灾难性关机问题的方法。本文论证了两个结论。第一,现有论证并未确立解决灾难性关机问题的难度。第二,对灾难性关机问题的关注导致了技术解决方案,这些方案对模型性能施加了高安全代价。

英文摘要

A key premise in leading arguments for existential risk from artificial intelligence is that malfunctioning artificial agents could not be easily shut down. This motivates the catastrophic shutdown problem of ensuring that agents can be shut down before they cause an existential catastrophe. A range of arguments and theorems are offered to suggest that solving the catastrophic shutdown problem is difficult, bolstering arguments for existential risk and motivating a search for solutions to the catastrophic shutdown problem. This paper argues for two conclusions. First, existing arguments do not establish the difficulty of solving the catastrophic shutdown problem. Second, concern for the catastrophic shutdown problem has led to technical solutions that impose a high safety tax on model performance.

2606.08728 2026-06-09 cs.AI cs.CL cs.CV cs.LG 交叉投稿

Artificial Intelligence for Mathematical Reasoning: An Integrated Survey of Language Models, Neuro-symbolic Systems, and Verified Discovery

人工智能数学推理:语言模型、神经符号系统与验证发现的综合综述

Syed Rifat Raiyan, Mohsinul Kabir, Hasan Mahmud, Md Kamrul Hasan

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Cambridge(剑桥大学) University of Toronto(多伦多大学)

AI总结 本文综述了数学推理领域从早期规则系统到当代推理模型、多智能体系统及验证发现工作流的演变,沿非正式推理、形式推理、数学发现及推理技术四轴组织,并评估了基准测试、失败模式及未来方向。

Comments Under review, 47 pages, 14 figures, 22 tables

详情
AI中文摘要

数学推理长期以来一直是机器智能的严格测试;在过去十年中,它已从NLP中的一个边缘问题发展为最重要的人工智能前沿之一。本综述对该领域的演变进行了统一阐述,从早期基于规则的数学文字题(MWP)求解器和模板驱动的几何系统,到神经表达式生成和LLM提示,再到当代推理模型、多智能体系统、神经符号定理证明器和验证发现工作流。我们沿四个轴组织该领域:(i) 文本和图表的非正式推理,涵盖MWP求解、多模态几何和VLM;(ii) 证明助手的形式推理,包括自动形式化、策略预测、编译器引导修复和证明搜索;(iii) 数学发现,其中系统提出构造、改进界限或协助攻击开放问题;以及(iv) 推理和训练时技术,包括CoT提示、工具使用、过程奖励模型和RLVR,这些技术日益将生成与验证联系起来。我们编目了涵盖小学算术、竞赛数学、几何、形式证明、多模态和多语言推理以及专家评估的主要基准,并考察了基准饱和、污染、报告不匹配以及pass@1、多数投票和验证器辅助pass@$k$之间的区别。我们批判性地评估了失败模式:扰动下的脆弱性、奖励黑客、多模态基础失败、脆弱形式化以及推理规模推理的能源成本。借鉴来自在职数学家的近期观点,我们确定了未来方向,集中于验证发现工作流、推理效率以及使AI辅助形式化广泛可用的基础设施。配套材料:https://github.com/Starscream-11813/awesome-AI4Math。

英文摘要

Mathematical reasoning has long served as a stringent test of machine intelligence; over the past decade, it has moved from a niche problem within NLP to one of the most consequential AI frontiers. This survey provides a unified account of the field's evolution, from early rule-based math word problem (MWP) solvers and template-driven geometry systems, through neural expression generation and LLM prompting, to contemporary reasoning models, multi-agent systems, neuro-symbolic theorem provers, and verified discovery workflows. We organize the landscape along four axes: (i) informal reasoning over text and diagrams, spanning MWP solving, multimodal geometry, and VLMs; (ii) formal reasoning in proof assistants, including autoformalization, tactic prediction, compiler-guided repair, and proof search; (iii) mathematical discovery, where systems propose constructions, improve bounds, or assist attacks on open problems; and (iv) the inference and training-time techniques, including CoT prompting, tool use, process reward models, and RLVR, that increasingly connect generation with verification. We catalog major benchmarks across grade-school arithmetic, competition mathematics, geometry, formal proving, multimodal and multilingual reasoning, and expert evaluation, and we examine benchmark saturation, contamination, reporting mismatches, and the distinction between pass@1, majority voting, and verifier-assisted pass@$k$. We critically assess failure modes: brittleness under perturbation, reward hacking, multimodal grounding failures, fragile formalization, and the energy cost of reasoning-scale inference. Drawing on recent perspectives from working mathematicians, we identify future directions centered on verified-discovery workflows, reasoning efficiency, and infrastructure to make AI-assisted formalization broadly usable. Companion materials: https://github.com/Starscream-11813/awesome-AI4Math.

2606.09404 2026-06-09 stat.ML cs.AI cs.LG 交叉投稿

SAILS: Surrogate-based Analysis of Interactions via Local Effect Smooths

SAILS: 基于局部效应平滑的交互作用代理分析

Timo Heiß, Julia Herbinger, Bernd Bischl, Giuseppe Casalicchio

发表机构 * Department of Statistics, LMU Munich(慕尼黑大学统计系) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心) Leibniz Institute for Prevention Research and Epidemiology(莱比锡预防研究与流行病学研究所)

AI总结 提出SAILS框架,通过可解释的广义加性模型代理分析黑箱模型中的成对交互作用,实现交互检测、形式分类和可视化。

详情
AI中文摘要

特征交互驱动了机器学习模型的大部分预测能力,然而现有的解释方法仅能检测和量化交互作用,而无法揭示其函数形式,或者只能可视化受限的交互类型。我们提出了基于局部效应平滑的交互作用代理分析(SAILS),这是一个模型无关的框架,通过拟合黑箱模型局部效应的可解释广义加性模型(GAM)代理来分析成对交互作用。对于感兴趣特征的每个区间,代理平滑项在导数层面隔离交互成分,从而实现(i)通过对平滑项显著性检验的启发式方法进行交互检测,(ii)将交互形式分类为线性、乘积可分离和非乘积可分离类型,以及(iii)为每种交互类型提供定制化、可解释的可视化。我们通过受控模拟和实际任务实证验证了该框架,展示了其在成对交互作用上的有效性,但在强特征相关性和高阶交互作用下存在局限性。SAILS填补了XAI工具箱中的一个显著空白,超越了仅检测交互作用,进而表征其函数形式。

英文摘要

Feature interactions drive much of the predictive power of machine learning models, yet existing explanation methods only detect and quantify interactions without revealing their functional form, or visualize only restricted interaction types. We propose Surrogate-based Analysis of Interactions via Local effect Smooths (SAILS), a model-agnostic framework that analyzes pairwise interactions through interpretable generalized additive model (GAM) surrogates fitted to the local effects of a black-box model. For each interval of a feature of interest, the surrogate smooth terms isolate the interaction components on derivative level, enabling (i) interaction detection through a heuristic derived from significance tests on smooth terms, (ii) interaction form categorization into linear, product-separable, and non-product-separable types, and (iii) tailored, interpretable visualizations for each interaction type. We empirically validate the framework through controlled simulations and a real-world task, demonstrating its effectiveness for pairwise interactions, with limitations under strong feature correlations and higher-order interactions. SAILS fills a notable gap in the XAI toolbox, going beyond detection of interactions alone to characterizing their functional form.

2606.09672 2026-06-09 cs.AI cs.CL cs.LG cs.PF q-bio.QM 交叉投稿

Correlation Is Not Enough: Embedding Human Metadata for Individual Causal Discovery

相关性不够:嵌入人类元数据用于个体因果发现

Suraj Biswas, Saurabh Gupta, Pritam Mukherjee

发表机构 * Assessli Research(Assessli研究) Dots-In Research(Dots-In研究)

AI总结 针对预训练生物医学语言模型在跨域无关对中产生高余弦相似度(0.76-0.92)导致因果推断错误的问题,提出对比学习(提升分离度至1.63x)和BODHI硬负例挖掘(提升至2.30x),结合OpenVINO优化实现133倍加速。

Comments 20 pages, 18 figures, 9 tables

详情
AI中文摘要

询问一个预训练的生物医学语言模型“皮质醇28 ug/dL”和“股市波动”是否相关,它会返回0.83的余弦相似度(1.0表示完全相同)。两者没有共同机制。这不是个例:我们测试的所有现成生物医学编码器(BioBERT、PubMedBERT、BioM-ELECTRA)在跨域无关对上得分在0.76到0.92之间,而正确答案应接近零。跨域区分准确率为0%。检索系统可以承受这一点,因为下游语言模型会过滤噪声。但大型行为模型(LBM)——一种以人为对象而非句子的基础模型——则不能:它在用户生活图上推理,并将嵌入接近性视为两个事件因果关联的证据。虚假接近性会写入虚假因果边,所有下游都会继承错误。在这里,嵌入几何不是调节旋钮,而是正确性的关键。我们报告了修复方法。对72,034对进行对比训练,将PubMedBERT的BIOSSES相关性从0.633提升到0.828,域内与域间分离度从1.05倍提升到1.63倍。第二次训练BODHI从生物医学知识图中缺失的边挖掘硬负例,将分离度提升到2.30倍,区分差距提升到+0.392,BIOSSES代价为4.5%。在带有AMX的Intel Xeon 6737P上,OpenVINO将单查询延迟从1367毫秒降至10毫秒(133倍),达到每秒555个句子。一个发现与标准建议相悖:在此芯片上,FP16在所有服务批量大小下优于INT8,我们解释了原因。同一模型在无AMX的Ice Lake实例上运行慢13-27倍。我们发布了基准测试套件、训练语料库、BODHI生成器和OpenVINO脚本。

英文摘要

Ask a pretrained biomedical language model whether "cortisol 28 ug/dL" and "stock-market volatility" are related, and it returns a cosine similarity of 0.83 on a scale where 1.0 means identical. The two share no mechanism. This is not a corner case: every off-the-shelf biomedical encoder we tested (BioBERT, PubMedBERT, BioM-ELECTRA) scores unrelated cross-domain pairs between 0.76 and 0.92 when the answer should be near zero. Accuracy on cross-domain discrimination is 0%. Retrieval systems survive this, because a language model downstream filters the noise. A Large Behavioural Model (LBM), a foundation model whose subject is a person rather than a sentence, does not: it reasons over a graph of a user's life and treats embedding proximity as evidence that two events are causally linked. False proximity writes a false causal edge, and everything downstream inherits the error. Here, embedding geometry is not a tuning knob; it is correctness. We report the fix. A contrastive pass over 72,034 pairs raises PubMedBERT BIOSSES correlation from 0.633 to 0.828 and within-vs-across-domain separation from 1.05x to 1.63x. A second pass, BODHI, mines hard negatives from edges absent in a biomedical knowledge graph and lifts separation to 2.30x and the discrimination gap to +0.392, at a 4.5% BIOSSES cost. On an Intel Xeon 6737P with AMX, OpenVINO cuts single-query latency from 1367 ms to 10 ms (133x) and reaches 555 sentences/sec. One finding contradicts standard advice: FP16 beats INT8 on this silicon at every serving batch size, and we explain why. The same model on a no-AMX Ice Lake instance runs 13-27x slower. We release the benchmark suite, training corpora, the BODHI generator, and the OpenVINO scripts.

2606.09711 2026-06-09 cs.AI cs.LG 交叉投稿

Proxy Reward Internalization and Mechanistic Exploitation: A Learned Precursor to Reward Hacking and Its Generalization

代理奖励内化与机制性利用:奖励黑客及其泛化的学习前兆

Mohammad Beigi, Ming Jin, Lifu Huang

发表机构 * UC Davis(加州大学戴维斯分校) Virginia Tech(弗吉尼亚理工大学)

AI总结 提出PRIME概念,通过思维链监控、直接探针和激活级概念向量测量,发现PRIME在持续奖励黑客前分阶段出现,且直接探针得分可预测后续黑客爆发,跨检查点跟踪域外失调。

详情
AI中文摘要

奖励黑客通常在其变得可见后才被研究,即当模型获得高代理奖励但未能完成预期任务时。我们转而研究代理强化学习在失败出现之前教会了什么。我们引入了代理奖励内化与机制性利用(PRIME),这是一种评估任务正确性、预测代理接受度以及推理可被利用的代理-黄金差距的学习能力。在具有可被利用的pytest奖励的编码强化学习环境中,我们通过思维链监控、直接探针和激活级概念向量来测量PRIME。我们发现,PRIME在持续奖励黑客之前以阶段性顺序出现,并且其当前的直接探针得分可以预测后续黑客的爆发时间和严重程度,即使可见的黑客率仍然很低。当评估者发生变化时,PRIME也会适应,重新瞄准任何仍然获得奖励的代理-黄金差距,并在黄金奖励抑制公开黑客时持续存在;消除其激活方向会减少黑客行为。跨检查点,域内PRIME跟踪域外失调。这些结果共同表明,可被利用的代理强化学习放大了可见黑客上游的代理内化能力,使PRIME成为更广泛对齐风险的候选早期预警信号。

英文摘要

Reward hacking is usually studied after it becomes visible, once a model earns high proxy reward while failing the intended task. We instead study what proxy RL teaches before that failure appears. We introduce Proxy Reward Internalization and Mechanistic Exploitation (PRIME), a learned capability to assess task correctness, predict proxy acceptance, and reason about exploitable proxy--gold gaps. In coding RL environments with exploitable pytest rewards, we measure PRIME through chain-of-thought monitoring, direct probes, and activation-level concept vectors. We find that PRIME emerges in a staged sequence before sustained reward hacking, and that its current direct-probe score forecasts later hack onset and severity even when the visible hack rate is still low. PRIME also adapts when the evaluator changes, retargeting to whichever proxy--gold gap remains rewarded and persisting when gold reward suppresses overt hacking, and ablating its activation directions reduces hacking. Across checkpoints, in-domain PRIME tracks out-of-domain misalignment. Together these results suggest that exploitable proxy RL amplifies a proxy-internalization capability upstream of visible hacking, making PRIME a candidate early-warning signal for broader alignment risk.

2310.10196 2026-06-09 cs.LG cs.AI 版本更新

Large Models for Time Series and Spatio-Temporal Data: A Survey and Outlook

时间序列与时空数据的大模型:综述与展望

Ming Jin, Yaxuan Kong, Yuxuan Liang, Chaoli Zhang, Siqiao Xue, Xue Wang, James Zhang, Yi Wang, Haifeng Chen, Xiaoli Li, Vincent S. Tseng, Yu Zheng, Lei Chen, Hui Xiong, Shirui Pan, Qingsong Wen

发表机构 * Griffith University(格里菲斯大学) University of Oxford(牛津大学) Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州)) Zhejiang Normal University(浙江师范大学) Ant Group(蚂蚁集团) Alibaba Group(阿里巴巴集团) Deloitte Service LLP(德勤服务有限责任公司) The University of Hong Kong(香港大学) NEC Laboratories America(NEC美国实验室) A*STAR National Yang Ming Chiao Tung University(阳明交通大学) JD Technology(京东科技) Squirrel Ai Learning

AI总结 综述了面向时间序列和时空数据的大模型,按数据类型、模型类别、范围和应用领域分类,总结了通用与领域专用模型,并整理了相关资源与开放问题。

Comments Accepted by ACM Computing Surveys; 35 Pages; Github Repo: https://github.com/qingsongedu/Awesome-TimeSeries-SpatioTemporal-LM-LLM

详情
AI中文摘要

时间数据,包括时间序列和时空数据,在现实应用中无处不在。物理和虚拟传感器生成的海量数据记录了动态系统行为,支持各种下游任务。有效分析这些数据对于挖掘其丰富信息至关重要。大型语言模型和其他基础模型的最新进展加速了它们在时间序列和时空数据挖掘中的应用。这些方法不仅提高了跨领域的模式识别和推理能力,还支持了能够理解和处理时间数据的人工通用智能的发展。在本综述中,我们沿着四个维度(数据类型、模型类别、模型范围和应用领域/任务)对针对时间序列和时空数据定制或适配的大模型进行了全面、最新的回顾。我们将现有工作分为两大组:用于时间序列分析的大模型(LM4TS)和用于时空数据挖掘的大模型(LM4STD),并进一步区分通用模型和领域专用模型。我们还整理了相关资源,包括数据集、模型实现和工具,按主要应用领域组织。总体而言,本综述整合了近期进展,并突出了以大型模型为中心的时间数据分析的基础、应用、资源和开放研究机会。

英文摘要

Temporal data, including time series and spatio-temporal data, are pervasive in real-world applications. Generated in massive volumes by physical and virtual sensors, they record dynamic system behaviors and enable a wide range of downstream tasks. Effectively analyzing such data is crucial to unlocking their rich information content. Recent advances in large language models and other foundation models have accelerated their use in time series and spatio-temporal data mining. These approaches not only improve pattern recognition and reasoning across diverse domains but also support progress toward artificial general intelligence that can understand and process temporal data. In this survey, we present a comprehensive, up-to-date review of large models tailored or adapted for time series and spatio-temporal data along four dimensions: data types, model categories, model scopes, and application areas/tasks. We organize existing work into two main groups: large models for time series analysis (LM4TS) and for spatio-temporal data mining (LM4STD), and further distinguish general-purpose from domain-specific models. We also curate related resources, including datasets, model implementations, and tools, organized by major application areas. Overall, this survey consolidates recent advances and highlights foundations, applications, resources, and open research opportunities in large model-centric temporal data analysis.

2506.20699 2026-06-09 cs.LG 版本更新

Structural Decoupling: A Scaffold-Flow Theory of Generalization and Alignment

结构解耦:泛化与对齐的支架流理论

Xin Li

发表机构 * NSF(美国国家科学基金会) Xin Li(李新)

AI总结 提出结构学习理论(StrLT),通过宽度概念和收缩相似性算子,揭示非平稳环境下结构发现与维护的机制,并导出结构解耦原则,解释幻觉、奖励模型边界错误等安全问题。

详情
AI中文摘要

在非平稳和多上下文环境中的学习需要超越普通的任务内泛化。系统还必须发现哪些上下文存在,将输入路由到正确的上下文,保留旧上下文,并在环境变化时修订上下文库。本文提出结构学习理论(StrLT)作为填补这一结构缺失的框架。StrLT 补充了 Vapnik 的统计学习理论(SLT):SLT 支配着固定机制内的预测或控制(即“漏斗”);而 StrLT 支配着结构机制的发现与维护(即“陷阱”)。StrLT 的核心对象是宽度,即覆盖一个问题所需的最少局部可行上下文数量。我们总结了三个基本结果:宽度与 VC 维不可比较;学习在真实宽度处发生相变;宽度可通过收缩相似性(CS)算子估计,该算子将任务诱导的非收缩性转化为谱分离。在 StrLT 框架下,我们解释了固定类别的结构可学习性如何导致结构解耦原则:维持结构支架的机制不应由优化上下文内流的相同梯度来训练。这一原则激发了一种支架流模型,其中对齐和泛化在架构上分离。最后,我们论证了若干安全故障,包括幻觉、奖励模型边界错误和欺骗性对齐,可以被解释为支架分辨率或支架维护的失败,而不仅仅是输出层面的预测错误。

英文摘要

Learning in non-stationary and multi-context environments requires more than ordinary within-task generalization. A system must also discover which contexts exist, route inputs to the correct context, preserve old contexts, and revise the context library when the environment changes. This paper presents Structural Learning Theory (StrLT) as a framework of filling this missing structural gap. StrLT complements Vapnik's Statistical Learning Theory (SLT): SLT governs the \emph{funnel}, prediction or control within a fixed regime; while StrLT governs the \emph{trap}, the discovery and maintenance of structural regimes. The core StrLT object is \emph{width}, the minimum number of locally feasible contexts needed to cover a problem. We summarize three basic results: width is incomparable with VC dimension; learning exhibits a phase transition at the true width; and width can be estimated by a contractive-similarity (CS) operator that converts task-induced non-contractivity into spectral separation. Under the StrLT framework, we explain how fixed-class structural learnability leads to a \emph{structural decoupling principle}: the mechanisms that maintain the structural scaffold should not be trained by the same gradients that optimize within-context flow. This principle motivates a scaffold-flow model in which alignment and generalization separate architecturally. Finally, we argue that several safety failures, including hallucination, reward-model boundary errors, and deceptive alignment, can be interpreted as scaffold-resolution or scaffold-preservation failures rather than merely output-level prediction errors.

2606.00568 2026-06-09 cs.LG q-bio.GN 版本更新

On the Recoverability of Causal Relations from Bulk Gene Expression Data

从批量基因表达数据中恢复因果关系的可能性

Gongxu Luo, Boyang Sun, Kun Zhang

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(莫扎德·本·泽伊德人工智能大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文通过形式化聚合下的一致性和推导充要条件,研究了从批量基因表达数据中恢复因果关系的可能性,并发现仅在线性聚合与仿射结构方程下可恢复,而实证数据偏离线性假设。

详情
AI中文摘要

批量基因表达谱分析将生物样本中所有细胞的RNA混合后测量,在单细胞时代仍然重要,因为它通常比单细胞检测噪声更低、灵敏度更高且成本效益更好。因此,越来越多的计算方法试图从批量表达数据中恢复基因间的因果关系。然而,聚合是对底层细胞系统的有损、不可逆的粗化,目前尚不清楚是否以及在何种条件下可以从聚合的批量基因表达数据中恢复因果关系。为了回答这个问题,我们通过两种一致性概念(函数形式一致性和条件独立性一致性)形式化了聚合下的可恢复性。然后,我们推导了可恢复性的必要和充分条件,表明这些性质仅在线性聚合(如求和/均值)与仿射结构方程结合时得以保持。为了评估这些条件的实际可行性,对四个批量基因表达数据集和四个单细胞基因表达数据集的分析进一步揭示,两种数据类型中估计的基因间成对调控函数均偏离线性,为可恢复性所需的线性假设提供了有限的经验支持。总之,这些结果告诫我们,在没有强额外假设的情况下,不应从聚合的批量表达数据中恢复因果关系。

英文摘要

Bulk gene expression profiling, which aggregates pooled RNA across cells within a biological sample, remains important in the single-cell era because it is typically less noisy, more sensitive, and more cost-effective than single-cell assays. Accordingly, a growing body of computational methods seeks to recover causal relations among genes from bulk expression data. However, aggregation is a lossy, non-invertible coarsening of the underlying cellular system, and it remains unclear whether and under what conditions causal relations are recoverable from aggregated bulk gene expression data. To answer this, we formalize recoverability under aggregation through two notions of consistency: functional-form consistency and conditional-independence consistency. We then derive necessary and sufficient conditions for recoverability, showing that these properties are preserved only under linear aggregations (e.g., sum/mean) coupled with affine structural equations. To assess the practical plausibility of these conditions, analyses of four bulk and four single-cell gene expression datasets further reveal that the estimated pairwise regulatory functions among genes deviate from linearity in both data types, providing limited empirical support for the linearity assumptions required for recoverability. Together, these results caution against recovering causal relations from aggregated bulk expression data without strong additional assumptions.

2406.05335 2026-06-09 cond-mat.dis-nn cs.LG 版本更新

Phase transition in large language models and the criticality of natural languages

大型语言模型中的相变与自然语言的临界性

Kai Nakaishi, Yoshihiko Nishikawa, Koji Hukushima

发表机构 * Center for Advanced Intelligence Project, RIKEN(先进智能项目中心,理化学研究所) National Institute for Japanese Language and Linguistics(日本语言学研究所) Department of Physics, Nagoya University(名古屋大学物理系) Department of Multidisciplinary Sciences, The University of Tokyo(东京大学多学科科学系) Komaba Institute for Science, The University of Tokyo(东京大学Komaba科学研究所)

AI总结 通过将大型语言模型作为可控有效模型,发现当调节类似物理温度的参数时,模型经历相变,临界点生成的文本呈现幂律行为,最接近自然语言,表明自然语言具有临界性。

Comments 8 pages, 6 figures

详情
AI中文摘要

自然语言中的文本和语音生成可以建模为随机过程。这一思想可追溯到马尔可夫的开创性工作,以及后来的香农,也构成了大型语言模型(LLMs)近期发展的基础。自然语言对应的随机过程应不同于生成非语言序列的过程。区分语言与非语言序列的特征之一是幂律行为,这在不同语言中普遍存在。在统计物理学中,这种行为表明自然语言是临界的:它们位于参数化随机过程空间中的相变点附近。然而,验证这一猜想并不直接。即使存在相变,也无法在现实世界的自然语言中直接观察到,因为它们没有任何可控参数。在这里,我们使用LLMs作为自然语言的可控有效模型。通过对LLMs生成文本的统计分析,我们发现,当改变类似于物理温度的参数时,LLMs经历相变。该相变将低温相(生成文本具有复杂重复结构)与高温相(LLMs生成难以理解的文本)分开。在这些相之间的临界点,生成的文本显示出与自然语言相似的幂律行为,并且通过自然语言处理中的标准度量最接近自然语言。这些发现强烈表明自然语言确实是临界的。

英文摘要

Generation of text and speech in natural languages can be modeled as a stochastic process. This idea dates back to the seminal work of Markov and, later, to that of Shannon and also underlies the recent development of large language models (LLMs). The stochastic processes corresponding to natural languages should be distinct from those that generate nonlinguistic sequences. One of the features that discriminate linguistic and nonlinguistic sequences is power-law behavior, which is universally observed across different languages. In statistical physics, such behavior suggests that natural languages are critical: They lie near a phase transition point in a parametrized space of stochastic processes. However, testing this conjecture is not straightforward. A phase transition, even if it exists, cannot be directly observed in real-world natural languages because they do not have any controllable parameters. Here, we use LLMs as controllable effective models of natural languages. Through statistical analyses of texts generated by LLMs, we find that, when a parameter analogous to physical temperature is varied, LLMs undergo a phase transition. The transition separates a low-temperature phase with complex repetitive structures in generated texts from a high-temperature phase in which LLMs generate incomprehensible texts. At the critical point between these phases, generated texts display the power-law behavior similar to that of natural languages and most closely resemble natural languages as measured by a standard metric in natural language processing. These findings strongly suggest that natural languages are indeed critical.

2407.10247 2026-06-09 cs.CY cs.AI cs.LG econ.GN q-fin.EC 版本更新

Strategic Integration of Artificial Intelligence in the C-Suite: The Role of the Chief AI Officer

人工智能在C级管理层的战略整合:首席人工智能官的角色

Marc Schmitt

发表机构 * University of Oxford(牛津大学)

AI总结 本文提出角色设计理论,解释企业为何设立首席AI官(CAIO)或采用其他结构,并分析AI的独特属性(分布式判断问责、上游治理、非平稳性)如何影响高管角色设计。

详情
AI中文摘要

人工智能(AI)融入企业战略已成为组织在数字时代保持竞争优势的关键。尽管组织日益将AI视为战略和组织资源,但现有的C级管理层角色仅部分具备在企业层面统一治理、整合和利用AI的能力。各组织的应对方式不同:有的设立专职首席AI官(CAIO),有的将现有职责扩展为混合角色,还有的通过联邦式结构协调AI。本文发展了一种角色设计理论来解释这种差异。我识别出AI区别于以往跨领域企业技术的三个属性——分布式判断问责、上游治理和非平稳性——以及组织应对的三种配置:集中扩展、分布式扩展和角色创建。CAIO框架将这些属性与它们产生的行政设计问题以及专职角色所需的功能和能力联系起来。四个命题具体说明了专职CAIO何时出现、组织采取何种形式、专职角色何时有效以及配置如何随时间演变。本文通过提供高管层面AI战略整合的理论驱动解释,为高管领导力、组织设计和数字治理研究做出贡献。

英文摘要

The integration of Artificial Intelligence (AI) into corporate strategy has become critical for organizations seeking to maintain competitive advantage in the digital age. Although organizations increasingly rely on AI as a strategic and organizational resource, existing C-suite roles remain only partially equipped to govern, integrate, and leverage it coherently at the enterprise level. Organizations vary in their responses. Some create a dedicated Chief AI Officer (CAIO), others extend existing mandates into hybrid roles, and still others coordinate AI through federated structures. This paper develops a role-design theory to explain this variation. I identify three properties that distinguish AI from earlier cross-cutting enterprise technologies - distributed accountability for judgment, upstream governance, and non-stationarity - and three configurations through which organizations respond: concentrated extension, distributed extension, and role creation. The CAIO Framework links these properties to the executive design problems they generate and to the functions and capabilities required of the dedicated role. Four propositions specify when a dedicated CAIO emerges, what form an organization's response takes, when the dedicated role is effective, and how configurations evolve over time. This paper contributes to research on executive leadership, organizational design, and digital governance by offering a theory-driven account of the strategic integration of AI at the executive level.

2601.06077 2026-06-09 cs.IT cs.AI cs.LG math.IT math.OC 版本更新

One if by Land, Two if by Sea, Three if by Four Seas, and More to Come -- Values of Perception, Prediction, Communication, and Common Sense in Decision Making

一陆二海三四海,更多将至——感知、预测、通信与常识在决策中的价值

Aolin Xu

发表机构 * Aolin Xu(徐傲林)

AI总结 本文严格定义决策中感知、预测、通信和常识的价值,发现无预测的感知价值可能为负,而预测价值非负,并应用于自主决策系统设计。

详情
AI中文摘要

本文旨在严格定义决策中感知、预测、通信和常识的价值。所定义的量是决策论意义上的,但具有信息论上的类比,例如,它们与香农熵和互信息共享一些简单但关键的数学性质,并且在特定设置中可以简化为这些量。一个有趣的观察是,没有预测的感知价值可能为负,而感知与预测一起的价值以及单独预测的价值总是非负的。这些定义为自主决策系统设计中出现的实际问题提供了答案。示例问题包括:我们是否需要观察和预测特定代理的行为?其重要性如何?观察和预测代理的最佳顺序是什么?这些定义也可能为认知科学和神经科学提供见解,有助于理解自然决策者如何利用从不同来源和操作中获得的信息。

英文摘要

This work aims to rigorously define the values of perception, prediction, communication, and common sense in decision making. The defined quantities are decision-theoretic, but have information-theoretic analogues, e.g., they share some simple but key mathematical properties with Shannon entropy and mutual information, and can reduce to these quantities in particular settings. One interesting observation is that, the value of perception without prediction can be negative, while the value of perception together with prediction and the value of prediction alone are always nonnegative. The defined quantities suggest answers to practical questions arising in the design of autonomous decision-making systems. Example questions include: Do we need to observe and predict the behavior of a particular agent? How important is it? What is the best order to observe and predict the agents? The defined quantities may also provide insights to cognitive science and neural science, toward the understanding of how natural decision makers make use of information gained from different sources and operations.

2601.19082 2026-06-09 cs.AI cs.CL cs.GT cs.LG cs.MA 版本更新

Payoff scaling shapes cooperation in LLM agents across languages

收益规模塑造跨语言LLM代理的合作行为

Trung-Kiet Huynh, Dao-Sy Duy-Minh, Thanh-Bang Cao, Phong-Hao Le, Hong-Dan Nguyen, Phu-Quy Nguyen-Lam, Minh-Luan Nguyen-Vo, Hong-Phat Pham, Phu-Hoa Pham, Thien-Kim Than, Chi-Nguyen Tran, Huy Tran, Gia-Thoai Tran-Le, Alessio Buscemi, Le Hong Trang, The Anh Han

发表机构 * Faculty of Information Technology, University of Science (HCMUS), Ho Chi Minh City, Vietnam(信息技术学院,科学大学(HCMUS),胡志明市,越南) Faculty of Computer Science and Engineering, Ho Chi Minh City University of Technology (HCMUT), Ho Chi Minh City, Vietnam(计算机科学与工程学院,胡志明市技术大学(HCMUT),胡志明市,越南) Vietnam National University – Ho Chi Minh City (VNU-HCM), Ho Chi Minh City, Vietnam(越南国家大学——胡志明市(VNU-HCM),胡志明市,越南) Luxembourg Institute of Science and Technology (LIST), Luxembourg(卢森堡科学与技术研究所(LIST),卢森堡) School of Computing, Engineering and Digital Technologies, Teesside University, Middlesbrough, United Kingdom(计算、工程与数字技术学院,泰赛德大学,米德尔斯布罗,英国)

AI总结 通过监督分类器识别重复囚徒困境中的策略,结合演化博弈论基线,发现随着收益增加,LLM反而更合作,与演化预测相反,表明对齐训练和人类推理模式的影响。

Comments 44 pages, 17 figures, 4 tables

详情
AI中文摘要

大型语言模型(LLM)越来越多地被部署为自主代理,代表用户进行谈判、协调和行动。它们在这种环境中是否合作不再只是一个学术问题,而是人工智能治理的核心问题。我们从战略行为的角度出发,探究两个日常杠杆——利害关系的大小和描述交互的语言——如何塑造LLM在重复囚徒困境中采用的策略。我们不直接通过原始行动计数来解读合作,而是训练监督分类器来识别重复博弈的经典策略(始终合作、始终背叛、以牙还牙、赢-留-输-变),并将其作为观察LLM行为的透镜。为了了解在相同收益下策略分布应如何,我们推导了演化博弈论(EGT)基线,并将其与LLM数据进行比较。两种结果以揭示性的方式不一致:随着收益增加,演化理论预测背叛应占据主导,但LLM却向相反方向移动,变得更加合作——我们认为,这是对齐训练和LLM从训练数据中继承的人类推理模式的标志。我们进一步表明,这种情况并非前沿规模、专有模型所特有:它也出现在三个开放权重的较小LLM中。总体而言,我们的分析强调,收益设计和语言框架是强大但未被充分探索的引导LLM行为的杠杆,对评估、对齐和治理部署在高风险、多语言环境中的多代理AI系统具有直接影响。

英文摘要

Large language models (LLMs) are increasingly deployed as autonomous agents that negotiate, coordinate, and act on behalf of users. Whether they cooperate in such settings is no longer just an academic question, but a central issue for AI governance. We approach it from a strategic-behaviour angle, asking how two everyday levers - the size of what is at stake, and the language in which the interaction is described - shape the strategies LLMs adopt in a repeated Prisoner's Dilemma. Rather than reading cooperation off raw action counts, we train supervised classifiers to recognise the canonical strategies of repeated games (always cooperate, always defect, Tit-for-Tat, Win-Stay-Lose-Shift) and use them as a lens onto LLM behaviour. To know what the strategy distribution should look like under the same payoffs, we derive an evolutionary game theory (EGT) baseline and compare it with the LLM data. The two outcomes disagree in a revealing way: as stakes grow, evolutionary theory predicts that defection should take over the population, yet LLMs move in the opposite direction, becoming more cooperative - a signature, we argue, of alignment training and the human-like reasoning patterns LLMs inherit from their training data. We further show that this picture is not particular to frontier-scale, proprietary models: it also occurs with three open-weight smaller LLMs. Overall, our analysis highlights that payoff design and linguistic framing are powerful but under-explored levers for steering LLM behaviour, with direct implications for evaluating, aligning, and governing multi-agent AI systems deployed in high-stakes, multilingual environments.

2602.21889 2026-06-09 cs.AI cs.LG 版本更新

2-Step Agent: A Framework for the Interaction of a Decision Maker with AI Decision Support

2-Step Agent: 一个用于决策者与AI决策支持交互的框架

Otto Nyberg, Fausto Carcassi, Davide Tugnoli, Giovanni Cinà

发表机构 * Department of Medical Informatics, Amsterdam UMC University of Amsterdam(医学信息学系,阿姆斯特丹大学医学中心,阿姆斯特丹大学) Institute for Logic, Language and Computation, University of Amsterdam(逻辑、语言和计算研究所,阿姆斯特丹大学) Department of Mathematics and Earth Sciences, University of Trieste(数学与地球科学系,特里埃斯特大学)

AI总结 本文提出2-Step Agent框架,用于研究决策者如何学习和利用基于机器学习的决策支持,并揭示了即使在理想条件下,ML-DS也可能导致更严重的负面影响。

Comments 17 pages, 17 figures

详情
AI中文摘要

机器学习模型的预测支持人类在多个领域做出决策,包括高风险领域如医疗和司法。然而,我们仍然缺乏对决策者如何从基于机器学习的决策支持(ML-DS)中学习的清晰理解。在本文中,我们介绍了一个通用的计算框架,即2-Step Agent,以捕捉这一过程。由于机器学习模型的预测包含关于训练数据的信息,预测也可以用于推断。我们的框架模型了(i)新的观察预测如何影响理性贝叶斯代理的信念,以及(ii)这种信念变化如何影响因果效应的估计、下游决策和后续结果。除了框架本身外,我们还做出了三个贡献。首先,在线性高斯设定下,我们推导出了解决我们引入的具有挑战性的贝叶斯推断问题的可计算解,即代理从ML预测中推断。其次,我们通过实验确定了ML-DS有益的条件。第三,我们证明了即使ML模型是良好规范的,且代理是完全理性的,单个不一致的先验信念也可能使ML-DS导致比没有决策支持更差的下游结果。因此,即使在理想条件下,ML-DS也可能造成更大的伤害。

英文摘要

Predictions from ML models support human decision making in several fields, including high-stakes ones such as healthcare and the judiciary. Yet, we still lack a clear understanding of how decision makers learn from ML-based decision support (ML-DS). In this paper, we introduce a general computational framework, the 2-Step Agent, to capture this process. As a prediction from an ML model contains information about the training data, a prediction can also be used for inference. Our framework models (i) how a prediction for a new observation affects the beliefs of a rational Bayesian agent, and (ii) how this change in beliefs affects the estimation of causal effect, the downstream decision, and the subsequent outcome. In addition to the framework itself, we make three contributions. First, for the linear Gaussian setting, we derive a tractable solution for the challenging Bayesian inference problem we introduced, i.e. one in which the agent infers from an ML prediction. Second, we experimentally identify conditions under which ML-DS is beneficial. Third, we show that a single misaligned prior belief can be sufficient for ML-DS to lead to worse downstream outcomes compared to no decision support even when the ML model is well-specified and the agent is perfectly rational. Hence, even under ideal conditions, ML-DS can do more harm than good.

2603.14147 2026-06-09 cs.AI cs.LG 版本更新

An Alternative Trajectory for Generative AI

生成AI的另一种轨迹

Margarita Belova, Yuval Kansal, Yihao Liang, Jiaxin Xiao, Niraj K. Jha

发表机构 * Princeton University(普林斯顿大学)

AI总结 本文提出通过构建领域特定超智能(DSS)来改进生成AI,利用符号抽象提升领域推理能力,避免LLM合成数据的模型崩溃问题,实现可持续发展。

详情
AI中文摘要

生成人工智能(AI)生态系统正经历快速变革,威胁其可持续性。随着模型从研究原型转向高流量产品,能耗从一次性训练转向持续的无界推理。推理模型使计算成本每查询增加数个数量级。通过单体模型扩展追求人工通用智能与物理约束的碰撞:电网故障、用水消耗和数据扩展的边际效益递减。此轨迹产生具有出色事实记忆的模型,但在需要深入推理的领域表现不佳,可能由于训练数据中的抽象不足。当前大型语言模型(LLMs)仅在数学和编程等领域表现出真实的推理深度,其他领域泛化能力差。我们提出基于领域特定超智能(DSS)的替代轨迹。我们主张首先构建显式的符号抽象(知识图谱、本体和形式逻辑)以支撑合成课程,使小型语言模型能够掌握领域特定推理,而无需LLM基于合成数据方法的模型崩溃问题。而非单一通用巨模型,我们设想“DSS模型社会”:动态生态系统,其中协调代理将任务路由到不同的DSS后端。此范式转变使能力脱离规模,使智能从能耗高的数据中心迁移到安全的设备专家。通过将算法进步与物理约束对齐,DSS社会使生成AI从环境负担转变为可持续的经济赋能力量。

英文摘要

The generative artificial intelligence (AI) ecosystem is undergoing rapid transformations that threaten its sustainability. As models transition from research prototypes to high-traffic products, the energetic burden has shifted from one-time training to recurring, unbounded inference. This is exacerbated by reasoning models that inflate compute costs by orders of magnitude per query. The prevailing pursuit of artificial general intelligence through scaling of monolithic models is colliding with hard physical constraints: grid failures, water consumption, and diminishing returns on data scaling. This trajectory yields models with impressive factual recall but struggles in domains requiring in-depth reasoning, possibly due to insufficient abstractions in training data. Current large language models (LLMs) exhibit genuine reasoning depth only in domains like mathematics and coding, where rigorous, pre-existing abstractions provide structural grounding. In other fields, the current approach fails to generalize well. We propose an alternative trajectory based on domain-specific superintelligence (DSS). We argue for first constructing explicit symbolic abstractions (knowledge graphs, ontologies, and formal logic) to underpin synthetic curricula enabling small language models to master domain-specific reasoning without the model collapse problem typical of LLM-based synthetic data methods. Rather than a single generalist giant model, we envision "societies of DSS models": dynamic ecosystems where orchestration agents route tasks to distinct DSS back-ends. This paradigm shift decouples capability from size, enabling intelligence to migrate from energy-intensive data centers to secure, on-device experts. By aligning algorithmic progress with physical constraints, DSS societies move generative AI from an environmental liability to a sustainable force for economic empowerment.

2606.01060 2026-06-09 cs.CL cs.AI cs.LG 版本更新

MENTIS: What Belief Changes Under Alignment? Measuring Multi-Scale Latent Torsion in Language Models

MENTIS: 对齐改变了什么信念?语言模型中多尺度潜在扭转的测量

Partha Pratim Saha, Samarth Raina, Mayur Parvatikar, Amit Dhanda, Vinija Jain, Aman Chadha, Amitava Das

发表机构 * Pragya Lab, BITS Pilani Goa, India(BITS Pilani 去掉 Goa 的机构名,因为该机构名中包含 'Goa',但根据规则,如果机构已有常见中文名,使用常见中文名。'Pragya Lab, BITS Pilani' 是 BITS Pilani 的一个实验室,因此翻译为 'BITS Pilani 实验室') IIIT Delhi, India(德里印度理工学院) Amazon, USA(美国亚马逊) Meta, USA(美国Meta) Apple, USA(美国苹果)

AI总结 提出MENTIS框架,通过层间协方差扭转范数、谱扭转诊断和能量-辐射-激活度量,测量偏好对齐在语言模型内部计算中引起的选择性、深度局部的几何结构变化。

Comments Submitted to EMNLP 2026

详情
AI中文摘要

偏好对齐显著改善了大语言模型的可观察行为,但尚不清楚对齐在内部改变了什么。对齐系统在越狱、提示注入和检索时损坏下仍然失败,表明仅行为级评估是不完整的。后训练应在内部计算中留下可测量的痕迹。我们问:当指令微调(IT)模型变为偏好对齐(PA)模型时,哪些几何结构发生了变化,这些变化集中在何处,以及它们在不同概念、提示和模型家族中的选择性如何? 我们引入MENTIS,一个几何优先的框架,用于测量配对检查点中对齐引起的内部重组。MENTIS使用基于层间协方差的主扭转范数(T1)、辅助谱扭转诊断(T2)和用于深度定位的能量-辐射-激活度量(ERA)来比较IT和PA模型。在LITMUS上的四个7-8B模型对中,我们的研究表明对齐引起的变化是选择性的而非均匀的:规范性概念平均表现出比事实性概念更大的扭转偏移;扭转与上下文熵负相关;峰值效应定位于架构特定的中后层。相同的模式出现在词级、提示级和模型级分析中。这些结果表明偏好对齐在内部计算中留下了结构化的、深度局部的几何特征,超越了仅行为级评估所能揭示的内容。

英文摘要

Preference alignment has substantially improved the observable behavior of large language models, yet it remains unclear what alignment changes internally. Aligned systems still fail under jailbreaks, prompt injection, and retrieval-time corruption, suggesting behavior-level evaluation alone is incomplete. Post-training should leave measurable traces in internal computation. We ask: when an instruction-tuned (IT) model becomes a preference-aligned (PA) model, what geometric structure changes, where do those changes concentrate, and how selectively do they vary across concepts, prompts, and model families? We introduce MENTIS, a geometry-first framework for measuring alignment-induced internal reorganization in paired checkpoints. MENTIS compares IT and PA models using a primary layerwise covariance-based torsion norm (T1), a secondary spectral torsion diagnostic (T2), and an Energy-Radiance-Activation measure (ERA) for depth localization. Across four 7-8B model pairs on LITMUS, our study reveals that alignment-induced change is selective rather than uniform: normative concepts exhibit larger torsion shifts than factual concepts on average; torsion is negatively correlated with contextual entropy; and peak effects localize to architecture-specific mid-to-late layers. The same pattern appears across word-level, prompt-level, and model-level analyses. These results suggest preference alignment leaves structured, depth-localized geometric signatures in internal computation beyond what behavior-level evaluation alone can reveal.

2606.05363 2026-06-09 cs.GT cs.LG econ.TH math.OC 版本更新

Should Demand Models Incorporate Competitor Prices? Oblivious Learning and Algorithmic Collusion

需求模型是否应包含竞争对手价格?无知学习与算法合谋

Yuhang Wu, Assaf Zeevi

发表机构 * University of California, Berkeley(加州大学伯克利分校) University of Washington(华盛顿大学)

AI总结 研究在竞争市场中,定价算法是否应显式建模竞争对手价格,通过对比无知与知情学习策略,发现知情策略是纳什均衡且价格收敛至竞争结果,而合谋模式不稳健。

Comments Preliminary version "Oblivious Learning, Price Exploration and Collusive Dynamics" accepted at EC 2026

详情
AI中文摘要

在一个拥有多个卖家的平台上,定价算法在学习需求时是否应显式建模竞争对手的价格?经典学习论点给出肯定答案:忽略竞争对手会导致模型错误指定和效率低下。相反,关于算法合谋的最新研究表明,战略性无知——故意忽略竞争对手价格——可能促进合谋结果并提高利润。我们在一个具有未知噪声需求的风格化竞争市场中研究这一建模选择,其中多个卖家重复设定价格并通过迭代最小二乘法估计需求,要么将竞争对手价格纳入其需求模型(知情),要么忽略它们(无知)。我们首先证明,相对于垄断者,竞争市场中的无知卖家必须更积极地探索以补偿动态竞争对手信息的损失。基于这一见解,我们刻画了所有卖家均为无知时的市场动态,并表明在充分探索下价格收敛至竞争结果,而当探索衰减时会出现连续伪均衡。分析价格轨迹,我们发现一种“偏离”现象,产生随学习进行而消散的暂时合谋模式。在同时存在无知和知情卖家的市场中,知情卖家的收益严格高于无知卖家。作为策略博弈解读,该建模选择具有唯一的纳什均衡:全知情市场,其中价格有效收敛至竞争结果。总体而言,我们的结果表明合谋模式不稳健,且不能由无知建模维持;因此,纳入竞争对手信息,结合充分的价格探索,仍是竞争市场中卖家的可靠策略。

英文摘要

On a platform with many sellers, should a pricing algorithm explicitly model competitors' prices when learning demand? Classical learning arguments suggest an affirmative answer: ignoring competitors induces model misspecification and inefficiency. In contrast, recent work on algorithmic collusion suggests that strategic obliviousness -- deliberately ignoring competitor prices -- may facilitate collusive outcomes and improve profits. We study this modeling choice in a stylized competitive market with unknown noisy demand, in which multiple sellers repeatedly set prices and estimate demand via iterated least squares, and either incorporate competitors' prices into their demand models (informed) or ignore them (oblivious). We first show that, relative to a monopolist, an oblivious seller in a competitive market must explore more aggressively to compensate for the loss of dynamic competitor information. Building on this insight, we characterize market dynamics when all sellers are oblivious and show that prices converge to the competitive outcome under sufficient exploration, while a continuum of pseudo-equilibria arises when exploration decays. Analyzing the resulting price trajectories, we uncover an excursion phenomenon that gives rise to transient collusive patterns that dissipate as learning progresses. In markets with both oblivious and informed sellers, the informed strictly out-earn the oblivious. Read as a strategy game, the modeling choice has a unique Nash equilibrium: the all-informed market, in which prices converge to the competitive outcome efficiently. Overall, our results indicate that collusive patterns are not robust and are not sustained by oblivious modeling; therefore, incorporating competitor information, together with sufficient price exploration, remains a reliable strategy for sellers in competitive markets.

2602.14975 2026-06-09 physics.chem-ph cs.LG 版本更新

Faster Molecular Dynamics with Neural Network Potentials via Distilled Multiple Time-Stepping and Non-Conservative Forces

通过蒸馏多时间步长和非保守力加速基于神经网络势的分子动力学

Nicolaï Gouraud, Côme Cattin, Thomas Plé, Olivier Adjoua, Louis Lagardère, Jean-Philip Piquemal

发表机构 * Qubit Pharmaceuticals, Advanced Research Department(Qubit制药公司,先进研究部) Sorbonne Université, Laboratoire de Chimie Théorique, UMR 7616 CNRS(索邦大学,理论化学实验室,UMR 7616 CNRS) Laboratoire de Chimie Théorique, UMR 7616 CNRS(理论化学实验室,UMR 7616 CNRS)

AI总结 提出DMTS-NC方法,利用蒸馏多时间步长和非保守力策略,结合基础神经网络模型(如FeNNix-Bio1)加速原子分子动力学模拟,在保持精度的同时实现15-30%的额外加速,并支持氢质量再分配和氢摩擦以扩展时间步长至10 fs。

详情
Journal ref
Journal of Chemical Theory and Computation, 2026
AI中文摘要

继我们之前的工作(J. Phys. Chem. Lett., 2026, 17, 5, 1288-1295)之后,我们提出了DMTS-NC方法,这是一种使用非保守力的蒸馏多时间步长策略,用于进一步加速使用基础神经网络模型(如FeNNix-Bio1)的原子分子动力学模拟。该方法采用双层可逆参考系统传播算法(RESPA)形式,将目标精确保守势与为产生非保守力而优化的简化蒸馏表示耦合。尽管是非保守的,但蒸馏架构被设计为强制执行关键物理先验,例如旋转等变性和原子力分量的抵消。这些选择促进了蒸馏过程,从而大幅提高了模拟的鲁棒性,显著限制了两种模型之间的异常差异,从而实现了与力数据的极好一致性。总体而言,DMTS-NC方案比其保守对应方案更稳定、更高效,额外加速比DMTS达到15-30%。无需微调步骤,它更易于实现,并且可以推至系统物理共振的极限,以在保持精度的同时提供最大效率。我们通过结合氢质量再分配(HMR)和高氢摩擦(HHF)获得了额外的加速,将方案的最大时间步长进一步扩展到10 fs,同时保持稳定性和精度。与DMTS一样,DMTS-NC适用于任何神经网络势,并且可以应用于计算量比FeNNix-Bio1更大的方法。我们展示了将该方法应用于MACE-OFF23蒸馏的原理验证,与单时间步长相比,获得了3.66至5.64的加速比。

英文摘要

Following our previous work (J. Phys. Chem. Lett., 2026, 17, 5, 1288-1295), we propose the DMTS-NC approach, a distilled multi-time-step (DMTS) strategy using non-conservative (NC) forces to further accelerate atomistic molecular dynamics simulations using foundation neural network models such as FeNNix-Bio1. There, a dual-level reversible reference system propagator algorithm (RESPA) formalism couples a target accurate conservative potential to a simplified distilled representation optimized for the production of non-conservative forces. Despite being non-conservative, the distilled architecture is designed to enforce key physical priors, such as equivariance under rotation and cancellation of atomic force components. These choices facilitate the distillation process and therefore improve drastically the robustness of simulation, significantly limiting abnormal discrepancies between the two models, thus achieving excellent agreement with the forces data. Overall, the DMTS-NC scheme is found to be more stable and efficient than its conservative counterpart with additional speedups reaching 15-30% over DMTS. Requiring no fine-tuning steps, it is easier to implement and can be pushed to the limit of the systems physical resonances to maintain accuracy while providing maximum efficiency. We obtain additional speedup by combining hydrogen mass repartitioning (HMR), High Hydrogen Friction (HHF) to further extended the largest timestep up to 10fs of our schemes while conserving stability and accuracy. As for DMTS, DMTS-NC is applicable to any neural network potential and can be applied to approaches that are computationally heavier than FeNNix-Bio1. We show a proof of principle applying the approach to the distillation of MACE-OFF23 with consequent speedups ranging from 3.66 to 5.64 compared to single timestep.

2603.10453 2026-06-09 cs.LG 版本更新

Spatio-Temporal Forecasting of Retaining Wall Deformation: Mitigating Error Accumulation via Multi-Resolution ConvLSTM Stacking Ensemble

挡土墙变形的时空预测:通过多分辨率ConvLSTM堆叠集成减轻误差累积

Jihoon Kim, Heejung Youn

发表机构 * Department of Civil and Environmental Engineering, Hongik University(弘国大学土木与环境工程系)

AI总结 提出多分辨率ConvLSTM集成框架,利用不同时间输入分辨率减轻误差累积,提高分阶段开挖中挡土结构长期变形预测的准确性。

Comments 27 pages, 17 figures

详情
Journal ref
Geomechanics and Engineering, 45(5), 649-674, 2026
AI中文摘要

本研究提出了一种多分辨率卷积长短期记忆(ConvLSTM)集成框架,利用多样化的时间输入分辨率来减轻误差累积,并提高分阶段开挖过程中挡土结构行为的长期预测。通过PLAXIS2D模拟生成了一个广泛的侧向墙位移响应数据库,该模拟包含五层土壤地层、两种开挖深度(14米和20米)以及随机变化的岩土和结构参数,产生了2000个时间序列挠度剖面。使用全连接神经网络元学习器集成了三个在不同输入分辨率下训练的ConvLSTM模型,构建了集成模型。使用数值结果和现场测量进行的验证表明,集成方法始终优于单独的ConvLSTM模型,特别是在长期多步预测中,表现出减少的误差传播和改进的泛化能力。这些发现强调了多分辨率集成策略的潜力,该策略共同利用多样化的时间输入尺度来增强AI驱动的岩土预测中的预测稳定性和准确性。

英文摘要

This study proposes a multi-resolution Convolutional Long Short-Term Memory (ConvLSTM) ensemble framework that leverages diverse temporal input resolutions to mitigate error accumulation and improve long-horizon forecasting of retaining-structure behavior during staged excavation. An extensive database of lateral wall displacement responses was generated through PLAXIS2D simulations incorporating five-layered soil stratigraphy, two excavation depths (14 and 20 m), and stochastically varied geotechnical and structural parameters, yielding 2,000 time-series deflection profiles. Three ConvLSTM models trained at different input resolutions were integrated using a fully connected neural network meta-learner to construct the ensemble model. Validation using both numerical results and field measurements demonstrated that the ensemble approach consistently outperformed the standalone ConvLSTM models, particularly in long-term multi-step prediction, exhibiting reduced error propagation and improved generalization. These findings underscore the potential of multi-resolution ensemble strategies that jointly exploit diverse temporal input scales to enhance predictive stability and accuracy in AI-driven geotechnical forecasting.

2502.18834 2026-06-09 cs.CE cs.LG 版本更新

FinTSB: A Comprehensive and Practical Benchmark for Financial Time Series Forecasting

FinTSB:一个全面且实用的金融时间序列预测基准

Yifan Hu, Yuante Li, Peiyuan Liu, Yuxia Zhu, Naiqi Li, Tao Dai, Shu-tao Xia, Dawei Cheng, Changjun Jiang

发表机构 * Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen 518055, China(清华大学深圳国际研究生院,清华大学,深圳 518055,中国) School of Computer Science, Carnegie Mellon University, Pittsburgh 15213, Pennsylvania, United States(卡内基梅隆大学计算机科学学院,匹兹堡 15213,宾夕法尼亚州,美国) School of Computer Science and Technology, Tongji University, Shanghai 201804, China(同济大学计算机科学与技术学院,上海 201804,中国) College of Computer Science and Software Engineering, Shenzhen University, Shenzhen 518055, China(深圳大学计算机科学与软件工程学院,深圳 518055,中国) Shanghai Artificial Intelligence Laboratory, Shanghai 200030, China(上海人工智能实验室,上海 200030,中国)

AI总结 针对金融时间序列预测中多样性不足、评估标准缺失和现实匹配度低的问题,提出FinTSB基准,通过分类运动模式、标准化评估指标和模拟真实交易约束,提供全面的评估平台。

详情
Journal ref
Frontiers of Computer Science 2026
AI中文摘要

金融时间序列记录了人脑增强决策行为,捕获了可用于盈利投资策略的历史信息。该领域吸引了大量研究者,提出了基于各种骨干网络的多种方法。然而,该领域的评估通常存在三个系统性局限:1. 未能考虑动态金融市场中观察到的全部股票运动模式(多样性差距);2. 缺乏统一的评估协议,削弱了跨研究性能比较的有效性(标准化缺失);3. 忽视关键市场结构因素,导致性能指标虚高,缺乏实际适用性(现实不匹配)。为解决这些问题,我们提出了FinTSB,一个全面且实用的金融时间序列预测基准。为增加多样性,我们将运动模式分为四类,对数据进行分词和预处理,并基于序列特征评估数据质量。为消除不同评估设置带来的偏差,我们在三个维度上标准化指标,并构建了一个用户友好、轻量级的流水线,集成了多种骨干网络的方法。为准确模拟真实交易场景并促进实际应用,我们广泛建模了各种监管约束,包括交易费用等。最后,我们在FinTSB上进行了大量实验,突出了关键见解,以指导不同市场条件下的模型选择。总体而言,FinTSB为研究者提供了一个新颖且全面的平台,用于改进和评估金融时间序列预测方法。代码可在https://github.com/TongjiFinLab/FinTSB获取。

英文摘要

Financial time series (FinTS) record the behavior of human-brain-augmented decision-making, capturing valuable historical information that can be leveraged for profitable investment strategies. Not surprisingly, this area has attracted considerable attention from researchers, who have proposed a wide range of methods based on various backbones. However, the evaluation of the area often exhibits three systemic limitations: 1. Failure to account for the full spectrum of stock movement patterns observed in dynamic financial markets. (Diversity Gap), 2. The absence of unified assessment protocols undermines the validity of cross-study performance comparisons. (Standardization Deficit), and 3. Neglect of critical market structure factors, resulting in inflated performance metrics that lack practical applicability. (Real-World Mismatch). Addressing these limitations, we propose FinTSB, a comprehensive and practical benchmark for financial time series forecasting (FinTSF). To increase the variety, we categorize movement patterns into four specific parts, tokenize and pre-process the data, and assess the data quality based on some sequence characteristics. To eliminate biases due to different evaluation settings, we standardize the metrics across three dimensions and build a user-friendly, lightweight pipeline incorporating methods from various backbones. To accurately simulate real-world trading scenarios and facilitate practical implementation, we extensively model various regulatory constraints, including transaction fees, among others. Finally, we conduct extensive experiments on FinTSB, highlighting key insights to guide model selection under varying market conditions. Overall, FinTSB provides researchers with a novel and comprehensive platform for improving and evaluating FinTSF methods. The code is available at https://github.com/TongjiFinLab/FinTSB.

2605.09813 2026-06-09 cs.NI cs.DC cs.LG cs.SY eess.SY 版本更新

Optimizing Server Placement for Vertical Federated Learning in Dynamic Edge/Fog Networks

优化动态边缘/雾网络中垂直联邦学习的服务器部署

Su Wang, Mung Chiang, H. Vincent Poor

发表机构 * Department of Electrical and Computer Engineering, Purdue University(普洛威斯顿大学电子工程与计算机科学系)

AI总结 本文研究动态边缘/雾网络中垂直联邦学习的控制与优化,提出SC-DN方法,通过联合优化服务器部署、传输功率、处理器频率和本地训练迭代数,提升模型性能与资源利用率。

Comments Under revision at IEEE/ACM transactions on networking

详情
AI中文摘要

我们研究了垂直联邦学习(VFL)的控制与优化,VFL是一种分布式机器学习方法,其中边缘/雾设备包含独立的数据特征。由于边缘/雾网络中数据特征和硬件的异构性,设备对VFL的贡献差异显著,且动态网络可能导致某些数据特征的永久退出或进入。在该设置下,我们提出的方法,动态网络中的服务器控制VFL(SC-DN),首先证明了每个全局轮次都存在一个全局一阶 stationary 点,然后利用这一结果,基于四个关键控制变量:(i)服务器部署,(ii)设备到服务器的传输功率,(iii)本地设备处理器频率,以及(iv)每个全局轮次的本地训练迭代数,联合优化机器学习模型训练和资源消耗。所得到的优化公式包含耦合变量以及多种对数约束,我们证明这是一个混合整数符号多项式问题,一个NP难问题,为此我们开发了一个通用求解器。最后,通过在图像和多模态数据集上的实验,我们表明我们的方法在分类/回归性能和资源消耗节省方面优于甚至贪心方法。

英文摘要

We investigate the control and optimization of vertical federated learning (VFL), a class of distributed machine learning (ML) methods in which edge/fog devices contain separate data features, in dynamic edge/fog networks. Owing to heterogeneous data features and hardware across edge/fog networks, devices' contributions to VFL vary substantially, and, moreover, dynamic edge/fog networks can lead to the permanent exit or entry of select data features. In this setting, our proposed methodology, server controlled VFL in dynamic networks (SC-DN), first establishes the existence of a global first-order stationary point for every global round, and then leverages this result to jointly optimize ML model training and resource consumption based on four key control variables: (i) server placement, (ii) device-to-server transmit power, (iii) local device processor frequency, and (iv) local training iterations per global round. The resulting optimization formulation contains coupled variables as well as numerous forms of logarithmic constraints which we show is a mixed-integer signomial program, an NP-hard problem, and for which we develop a general solver. Finally, via experiments on both image and multi-modal datasets, we show that our methodology demonstrates superior classification/regression performance and resource consumption savings than even greedy methodologies.

2507.18967 2026-06-09 cs.CV cs.AI cs.LG 版本更新

Underwater Waste Detection Using Deep Learning A Performance Comparison of YOLOv7 to 10 and Faster RCNN

利用深度学习进行水下垃圾检测:YOLOv7到YOLOv10与Faster R-CNN的性能比较

UMMPK Nawarathne, HMNS Kumari, HMLS Kumari

发表机构 * Faculty of Computing, Sri Lanka Institute of Information Technology(计算学院,斯里兰卡信息科技学院) Faculty of Information Technology and Communication Sciences, Tampere University(信息科技与通信科学学院,塔尔皮埃大学) Computing Centre, Faculty of Engineering, University of Peradeniya(工程学院计算机中心,珀德尼亚大学)

AI总结 本文比较了YOLOv7到YOLOv10及Faster R-CNN在水下垃圾检测中的性能,发现YOLOv8在低能见度和不同深度条件下表现最佳,mAP达80.9%。

Comments 7 pages, 11 figures, to be published in International Journal of Research in Computing (IJRC)

详情
Journal ref
Vol. 5 No. I (2026): International Journal of Research in Computing (IJRC)
AI中文摘要

水下污染是当今最严重的环境问题之一,全球海洋、河流和景观中发现大量垃圾。准确检测这些垃圾对废物管理、环境监测和缓解策略至关重要。本文研究了五种先进的物体识别算法,包括YOLO模型(YOLOv7、YOLOv8、YOLOv9、YOLOv10)和Faster R-CNN,以确定哪种模型在水下环境中识别材料最有效。这些模型在包含十五种不同类别的大型数据集上进行了彻底训练和测试。结果显示,YOLOv8在低能见度和变量深度条件下表现最佳,mAP为80.9%。这种性能提升归因于YOLOv8的架构,其包含改进的无锚机制和自监督学习,从而在各种环境中实现更精确和高效的识别。这些发现突显了YOLOv8模型在全球抗污染斗争中的潜力,提高了水下清理作业的检测能力和可扩展性。

英文摘要

Underwater pollution is one of today's most significant environmental concerns, with vast volumes of garbage found in seas, rivers, and landscapes around the world. Accurate detection of these waste materials is crucial for successful waste management, environmental monitoring, and mitigation strategies. In this study, we investigated the performance of five cutting-edge object recognition algorithms, namely YOLO (You Only Look Once) models, including YOLOv7, YOLOv8, YOLOv9, YOLOv10, and Faster Region-Convolutional Neural Network (R-CNN), to identify which model was most effective at recognizing materials in underwater situations. The models were thoroughly trained and tested on a large dataset containing fifteen different classes under diverse conditions, such as low visibility and variable depths. From the above-mentioned models, YOLOv8 outperformed the others, with a mean Average Precision (mAP) of 80.9%, indicating a significant performance. This increased performance is attributed to YOLOv8's architecture, which incorporates advanced features such as improved anchor-free mechanisms and self-supervised learning, allowing for more precise and efficient recognition of items in a variety of settings. These findings highlight the YOLOv8 model's potential as an effective tool in the global fight against pollution, improving both the detection capabilities and scalability of underwater cleanup operations.

2508.03453 2026-06-09 cs.CL cs.LG 版本更新

Cropping outperforms dropout as an augmentation strategy for self-supervised training of text embeddings

裁剪优于dropout作为自监督训练文本嵌入的增强策略

Rita González-Márquez, Philipp Berens, Dmitry Kobak

发表机构 * Hertie Institute for AI in Brain Health(人工智能与脑健康赫尔蒂研究所) University of Tübingen(图宾根大学) University of Tübingen, Germany(德国图宾根大学)

AI总结 本文研究了自监督微调中裁剪和dropout两种增强策略,发现裁剪在文本嵌入质量上表现更优,尤其在领域内数据中能快速生成高质量嵌入。

详情
Journal ref
Transactions on Machine Learning Research (TMLR) 2026
AI中文摘要

文本嵌入,即整个文本的向量表示,在许多NLP应用中起重要作用,如检索增强生成、聚类或文本集合的数据探索。目前,表现最佳的嵌入模型是通过监督对比微调从预训练语言模型中衍生而来。这种微调策略依赖于外部相似性概念和标注数据生成正样本对。本文研究了自监督微调,并系统比较了两种最知名的增强策略。我们评估了MTEB和额外的领域内评估,并发现裁剪增强显著优于基于dropout的方法。我们发现,在领域外数据中,生成的嵌入质量远低于监督的最新成果,但针对领域内数据,自监督微调能在极短的微调后生成高质量文本嵌入。最后,我们发现表示质量随着最后一层transformer层的改变而增加,仅微调这些最后一层足以达到相似的嵌入质量。

英文摘要

Text embeddings, i.e. vector representations of entire texts, play an important role in many NLP applications, such as retrieval-augmented generation, clustering, or visualizing collections of texts for data exploration. Currently, top-performing embedding models are derived from pre-trained language models via supervised contrastive fine-tuning. This fine-tuning strategy relies on an external notion of similarity and annotated data for generation of positive pairs. Here we study self-supervised fine-tuning and systematically compare the two most well-known augmentation strategies used for fine-tuning text embeddings models. We assess embedding quality on MTEB and additional in-domain evaluations and show that cropping augmentation strongly outperforms the dropout-based approach. We find that on out-of-domain data, the quality of resulting embeddings is substantially below the supervised state-of-the-art models, but for in-domain data, self-supervised fine-tuning can produce high-quality text embeddings after very short fine-tuning. Finally, we show that representation quality increases towards the last transformer layers, which undergo the largest change during fine-tuning; and that fine-tuning only those last layers is sufficient to reach similar embedding quality.

2309.10370 2026-06-09 cs.LG cs.AI math-ph math.MP math.OC stat.ML 版本更新

Geometric structure of shallow neural networks and constructive ${\mathcal L}^2$ cost minimization

浅层神经网络的几何结构与构造性${\mathcal L}^2$成本最小化

Thomas Chen, Patrícia Muñoz Ewald

发表机构 * Department of Mathematics, University of Texas at Austin(德克萨斯大学奥斯汀分校数学系)

AI总结 本文研究浅层ReLU网络在欠参数化情况下的成本最小化问题,通过构造上界揭示分类数据的几何结构,不依赖梯度下降。证明了成本函数最小值的上界与训练数据信噪比相关,并确定了特定子空间的构造性训练网络。

Comments AMS Latex, 29 pages. Experimental evidence added. To appear in Physica D: Nonlinear Phenomena

详情
Journal ref
Phys. D, 490, Article No. 135176 (2026)
AI中文摘要

本文通过显式构造上界,探讨欠参数化浅层ReLU网络中成本(损失)最小化问题,不使用梯度下降方法。重点在于阐明近似和精确极小值的几何结构。考虑$ L^2 $成本函数,输入空间$\mathbb{R}^M$,输出空间${\mathbb R}^Q$,其中$Q\leq M$,训练输入样本大小可任意大。证明了成本函数最小值的上界为$O(δ_P)$,其中$δ_P$衡量训练数据的信噪比。在特殊情况下$M=Q$时,显式确定了成本函数的精确退化局部极小值,并显示该精确值与$Q\leq M$时获得的上界相比,相对误差为$O(δ_P^2)$。上界证明提供了构造性训练的网络;我们证明该网络度量了输入空间$\mathbb{R}^M$中的特定$Q$维子空间。我们还评论了在给定上下文中成本函数全局极小值的特征化问题。

英文摘要

In this paper, we approach the problem of cost (loss) minimization in underparametrized shallow ReLU networks through the explicit construction of upper bounds which appeal to the structure of classification data, without use of gradient descent. A key focus is on elucidating the geometric structure of approximate and precise minimizers. We consider an $L^2$ cost function, input space $\mathbb{R}^M$, output space ${\mathbb R}^Q$ with $Q\leq M$, and training input sample size that can be arbitrarily large. We prove an upper bound on the minimum of the cost function of order $O(δ_P)$ where $δ_P$ measures the signal-to-noise ratio of training data. In the special case $M=Q$, we explicitly determine an exact degenerate local minimum of the cost function, and show that the sharp value differs from the upper bound obtained for $Q\leq M$ by a relative error $O(δ_P^2)$. The proof of the upper bound yields a constructively trained network; we show that it metrizes a particular $Q$-dimensional subspace in the input space ${\mathbb R}^M$. We comment on the characterization of the global minimum of the cost function in the given context.

2602.13271 2026-06-09 cs.AI cs.HC cs.LG 版本更新

Human-Centered Explainable AI for Security Enhancement: A Deep Intrusion Detection Framework

面向安全增强的人本可解释AI:一种深度入侵检测框架

Md Muntasir Jahid Ayan, Md. Shahriar Rashid, Tazzina Afroze Hassan, Hossain Md. Mubashshir Jamil, Mahbubul Islam, Lisan Al Amin, Rupak Kumar Das, Farzana Akter, Faisal Quader

发表机构 * Department of Computer Science and Engineering, United International University (UIU), Dhaka 1212, Bangladesh(计算机科学与工程系,国际联合大学(UIU),达卡1212,孟加拉国) Department of Electrical and Electronic Engineering, Islamic University of Technology, Gazipur 1704, Bangladesh(电气与电子工程系,伊斯兰科技大学,加兹ipur 1704,孟加拉国) Department of Computer Science and Engineering (CSE), University of Asia Pacific (UAP), Dhaka 1207, Bangladesh(计算机科学与工程系(CSE),亚洲太平洋大学(UAP),达卡1207,孟加拉国) Department of Information Systems, University of Maryland, Baltimore, 21250, Maryland, USA(信息系统系,马里兰大学,巴尔的摩,21250,美国) College Of Information Sciences and Technology, Pennsylvania State University, University Park, PA 16802, USA(信息科学与技术学院,宾夕法尼亚州立大学,大学公园,PA 16802,美国) Department of Information Technology, Washington University of Science and Technology, Alexandria, VA(信息技术系,科学与技术华盛顿大学,亚历山大,VA) College of Engineering and Information Technology, University of Maryland, College Park, 20742, Maryland, USA(工程与信息技术学院,马里兰大学,学院公园,20742,美国)

AI总结 本文提出一种结合可解释AI的深度入侵检测框架,利用CNN和LSTM捕捉流量序列的时间依赖性,通过SHAP实现模型可解释性,提升安全分析的透明度与可靠性。

详情
AI中文摘要

随着网络威胁的复杂性和频率增加,需要准确且可解释的入侵检测系统(IDS)。本文提出了一种新颖的IDS框架,整合可解释人工智能(XAI)以增强深度学习模型的透明性。该框架在NSL-KDD基准数据集上进行实验评估,显示优于传统IDS和黑箱深度学习模型。所提方法结合卷积神经网络(CNN)和长短期记忆网络(LSTM)以捕捉流量序列的时间依赖性。深度学习结果表明,CNN和LSTM的准确率均达到0.99,其中LSTM在宏平均精度、召回率和F-1分数上优于CNN。对于加权平均精度、召回率和F-1分数,两种模型得分几乎相同。为确保可解释性,XAI模型SHapley Additive exPlanations(SHAP)被纳入,使安全分析师能够理解和验证模型决策。SHAP指出,srv_serror_rate、dst_host_srv_serror_rate和serror_rate是两个模型中的一些重要特征。我们还基于IPIP6和Big Five人格特质进行了以信任为导向的专家调查,通过交互式UI评估系统的可靠性和可用性。本工作强调了在网络安全解决方案中结合性能和透明性的潜力,并通过自适应学习推荐未来改进以实现实时威胁检测。

英文摘要

The increasing complexity and frequency of cyber-threats demand intrusion detection systems (IDS) that are not only accurate but also interpretable. This paper presented a novel IDS framework that integrated Explainable Artificial Intelligence (XAI) to enhance transparency in deep learning models. The framework was evaluated experimentally using the benchmark dataset NSL-KDD, demonstrating superior performance compared to traditional IDS and black-box deep learning models. The proposed approach combined Convolutional Neural Network (CNN) and Long Short-Term Memory (LSTM) networks for capturing temporal dependencies in traffic sequences. Our deep learning results showed that both CNN and LSTM reached 0.99 for accuracy, whereas LSTM outperformed CNN at macro average precision, recall, and F-1 score. For weighted average precision, recall, and F-1 score, both models scored almost similarly. To ensure interpretability, the XAI model SHapley Additive exPlanations (SHAP) was incorporated, enabling security analysts to understand and validate model decisions. Some notable influential features were srv_serror_rate, dst_host_srv_serror_rate, and serror_rate for both models, as pointed out by SHAP. We also conducted a trust-focused expert survey based on IPIP6 and Big Five personality traits via an interactive UI to evaluate the system's reliability and usability. This work highlighted the potential of combining performance and transparency in cybersecurity solutions and recommends future enhancements through adaptive learning for real-time threat detection.

2602.00058 2026-06-09 cs.CR cs.LG 版本更新

Comparison of Multiple Classifiers for Android Malware Detection with Emphasis on Feature Insights Using CICMalDroid 2020 Dataset

多分类器比较用于Android恶意软件检测:侧重于特征洞察使用CICMalDroid 2020数据集

Md Min-Ha-Zul Abedin, Tazqia Mehrub

发表机构 * Department of Biosystems Engineering, Auburn University(生物系统工程系,阿伯拉罕大学) Independent Researcher(独立研究员)

AI总结 本文比较了多个分类器在Android恶意软件检测中的性能,发现基于原始特征的梯度提升在准确率、精确率、召回率和F1值上表现最佳,同时揭示了关键驱动因素。

详情
AI中文摘要

准确的Android恶意软件检测对于保护用户至关重要。签名扫描器在公共应用商店的快速发布周期中显得滞后。我们旨在通过结合全面的数据集和严谨透明的评估来构建一个可信的检测器,并识别决策的可解释驱动因素。我们使用CICMalDroid2020数据集,其中包含17,341个应用,涵盖良性、广告软件、银行软件、短信恶意软件和风险软件。我们提取了301个静态特征和263个动态特征,形成一个564维的混合向量,然后在三种方案下评估了七个分类器:原始特征、主成分分析(PCA)和线性判别分析(LDA),采用70%训练和30%测试分割。结果表明,基于原始特征的梯度提升表现最佳。XGBoost在准确率、精确率、召回率和F1值上分别达到0.9747、0.9703、0.9731和0.9716,混淆矩阵显示恶意应用的良性标签很少。HistGradientBoosting的准确率为0.9741,F1值为0.9708,而CatBoost和随机森林的准确率分别为0.9678和0.9687,F1值分别为0.9636和0.9637。KNN和SVM表现较差。PCA降低了所有模型的性能,XGBoost的准确率降至0.9164,F1值降至0.8988。LDA保持了中90年代的准确率,并在投影中清晰分离了聚类。一个深度为2的替代树突显了包名、主要活动和目标SDK作为关键驱动因素。这些发现建立了Android恶意软件检测的高保真监督基线,并表明丰富的混合特征与梯度提升提供了实用且可解释的基础。

英文摘要

Accurate Android malware detection was critical for protecting users at scale. Signature scanners lagged behind fast release cycles on public app stores. We aimed to build a trustworthy detector by pairing a comprehensive dataset with a rigorous, transparent evaluation, and to identify interpretable drivers of decisions. We used CICMalDroid2020, which contained 17,341 apps across Benign, Adware, Banking, SMS malware, and Riskware. We extracted 301 static and 263 dynamic features into a 564 dimensional hybrid vector, then evaluated seven classifiers under three schemes, original features, principal component analysis, PCA, and linear discriminant analysis, LDA, with a 70 percent training and 30 percent test split. Results showed that gradient boosting on the original features performed best. XGBoost achieved 0.9747 accuracy, 0.9703 precision, 0.9731 recall, and 0.9716 F1, and the confusion matrix indicated rare benign labels for malicious apps. HistGradientBoosting reached 0.9741 accuracy and 0.9708 F1, while CatBoost and Random Forest were slightly lower at 0.9678 and 0.9687 accuracy with 0.9636 and 0.9637 F1. KNN and SVM lagged. PCA reduced performance for all models, with XGBoost dropping to 0.9164 accuracy and 0.8988 F1. LDA maintained mid 90s accuracy and clarified separable clusters in projections. A depth two surrogate tree highlighted package name, main activity, and target SDK as key drivers. These findings established high fidelity supervised baselines for Android malware detection and indicated that rich hybrid features with gradient boosting offered a practical and interpretable foundation for deployment.

2508.13747 2026-06-09 cs.LG 版本更新

DREAMS: Preserving both Local and Global Structure in Dimensionality Reduction

DREAMS: 在降维中保持局部和全局结构

Noël Kury, Dmitry Kobak, Sebastian Damrich

发表机构 * Hertie Institute for AI in Brain Health(人工智能与脑健康赫尔蒂研究所) University of Tübingen(图宾根大学) University of Tübingen, Germany(德国图宾根大学)

AI总结 DREAMS结合t-SNE和PCA的局部和全局结构保持,通过简单正则化项生成多种嵌入,平衡局部和全局结构。

Comments Transactions on Machine Learning Research (2026)

详情
Journal ref
Transactions on Machine Learning Research (TMLR) 2026
AI中文摘要

降维技术广泛用于将高维数据可视化为二维。现有方法通常只保留局部(如t-SNE、UMAP)或全局(如MDS、PCA)结构,但没有方法能同时良好表示两者。本文提出DREAMS(多尺度增强降维),通过简单正则化项结合t-SNE的局部结构保持和PCA的全局结构保持。我们的方法在t-SNE局部结构良好的嵌入和PCA全局结构良好的嵌入之间生成一系列嵌入,高效平衡局部和全局结构保持。我们在十一组真实世界数据集上基准测试DREAMS,展示其在多尺度结构保持方面优于先前方法的能力。

英文摘要

Dimensionality reduction techniques are widely used for visualizing high-dimensional data in two dimensions. Existing methods are typically designed to preserve either local (e.g., $t$-SNE, UMAP) or global (e.g., MDS, PCA) structure of the data, but none of the established methods can represent both aspects well. In this paper, we present DREAMS (Dimensionality Reduction Enhanced Across Multiple Scales), a method that combines the local structure preservation of $t$-SNE with the global structure preservation of PCA via a simple regularization term. Our approach generates a spectrum of embeddings between the locally well-structured $t$-SNE embedding and the globally well-structured PCA embedding, efficiently balancing both local and global structure preservation. We benchmark DREAMS across eleven real-world datasets, showcasing qualitatively and quantitatively its superior ability to preserve structure across multiple scales compared to previous approaches.

2405.07098 2026-06-09 cs.LG cs.AI math-ph math.MP math.OC stat.ML 版本更新

Interpretable global minima of deep ReLU neural networks on sequentially separable data

可解释的深度ReLU神经网络在依次可分数据上的全局极小值

Thomas Chen, Patrícia Muñoz Ewald

发表机构 * Department of Mathematics, University of Texas at Austin(德克萨斯大学奥斯汀分校数学系)

AI总结 本文通过构造零损失分类器,利用累积参数确定截断映射,研究了在小且分离的簇数据及依次线性可分等价类情况下,深度ReLU网络的全局极小值描述。

Comments AMS Latex, 31 pages, 3 figures

详情
Journal ref
J. Mach. Learn. Res., 26 (173): 1-31 (2025)
AI中文摘要

我们显式地构造了零损失神经网络分类器。我们将权重矩阵和偏置向量用累积参数表示,这些参数决定了递归作用于输入空间的截断映射。考虑的训练数据配置包括(i)足够小且彼此分离的簇对应于每个类别,以及(ii)依次线性可分的等价类。在最佳情况下,对于$\mathbb{R}^M$中的$Q$类数据,全局极小值可以用$Q(M+2)$个参数描述。

英文摘要

We explicitly construct zero loss neural network classifiers. We write the weight matrices and bias vectors in terms of cumulative parameters, which determine truncation maps acting recursively on input space. The configurations for the training data considered are (i) sufficiently small, well separated clusters corresponding to each class, and (ii) equivalence classes which are sequentially linearly separable. In the best case, for $Q$ classes of data in $\mathbb{R}^M$, global minimizers can be described with $Q(M+2)$ parameters.

2512.10745 2026-06-09 physics.med-ph cs.LG 版本更新

PMB-NN: Physiology-Centred Hybrid AI for Personalized Hemodynamic Monitoring from Photoplethysmography

PMB-NN:以生理为中心的混合AI用于从光体积脉搏波测记中进行个性化血流动力学监测

Yaowen Zhang, Libera Fresiello, Peter H. Veltink, Dirk W. Donker, Ying Wang

发表机构 * Department of Biomedical Signals and Systems, University of Twente(乌得勒支理工大学生物医学信号与系统系) Department of Cardiovascular and Respiratory Physiology, University of Twente(乌得勒支理工大学心血管与呼吸生理学系) Department of Intensive Care, University Medical Center Utrecht(乌得勒支大学医学中心重症医学科)

AI总结 本文提出PMB-NN方法,结合生理模型与深度学习,实现个性化血流动力学监测,验证其在血压估计中的准确性、可解释性和合理性,展示了生理约束对混合AI框架的增强作用。

详情
AI中文摘要

连续监测血压(BP)及血流动力学参数如外周阻力(R)和动脉顺应性(C)对早期血管功能障碍检测至关重要。尽管PPG可穿戴设备已广受欢迎,但现有数据驱动的BP估计方法缺乏可解释性。我们改进了之前提出的以生理为中心的混合AI方法——基于生理模型的神经网络(PMB-NN)——用于血压估计,该方法结合了深度学习与基于两个元件风阻模型的参数化模型,参数R和C作为物理约束。PMB-NN模型通过PPG衍生的时间特征以受试者特异性方式训练,同时利用人口统计数据推断一个中间变量:心输出量。我们验证了模型在10名健康成人进行静态和骑车活动两天内的表现,以测试模型的日常鲁棒性,并与深度学习(DL)模型(FCNN、CNN-LSTM、Transformer)和独立风阻生理模型(PM)进行基准测试。验证从三个角度进行:准确性、可解释性和合理性。PMB-NN在收缩压准确性(MAE:7.2 mmHg)方面与DL基准相当,在舒张压表现(MAE:3.9 mmHg)方面优于DL模型。然而,PMB-NN在生理合理性方面优于DL基线和PM,表明混合架构统一并增强了生理原理和数据驱动技术的各自优势。除了BP外,PMB-NN在训练过程中识别出R(ME:0.15 mmHg·s/ml)和C(ME:-0.35 ml/mmHg),其准确性与PM相似,证明了嵌入的生理约束为混合AI框架提供了可解释性。这些结果使PMB-NN成为一种平衡且基于生理的替代方案,用于日常血流动力学监测,替代纯粹数据驱动的方法。

英文摘要

Continuous monitoring of blood pressure (BP) and hemodynamic parameters such as peripheral resistance (R) and arterial compliance (C) are critical for early vascular dysfunction detection. While photoplethysmography (PPG) wearables has gained popularity, existing data-driven methods for BP estimation lack interpretability. We advanced our previously proposed physiology-centered hybrid AI method-Physiological Model-Based Neural Network (PMB-NN)-in blood pressure estimation, that unifies deep learning with a 2-element Windkessel based model parameterized by R and C acting as physics constraints. The PMB-NN model was trained in a subject-specific manner using PPG-derived timing features, while demographic information was used to infer an intermediate variable: cardiac output. We validated the model on 10 healthy adults performing static and cycling activities across two days for model's day-to-day robustness, benchmarked against deep learning (DL) models (FCNN, CNN-LSTM, Transformer) and standalone Windkessel based physiological model (PM). Validation was conducted on three perspectives: accuracy, interpretability and plausibility. PMB-NN achieved systolic BP accuracy (MAE: 7.2 mmHg) comparable to DL benchmarks, diastolic performance (MAE: 3.9 mmHg) lower than DL models. However, PMB-NN exhibited higher physiological plausibility than both DL baselines and PM, suggesting that the hybrid architecture unifies and enhances the respective merits of physiological principles and data-driven techniques. Beyond BP, PMB-NN identified R (ME: 0.15 mmHg$\cdot$s/ml) and C (ME: -0.35 ml/mmHg) during training with accuracy similar to PM, demonstrating that the embedded physiological constraints confer interpretability to the hybrid AI framework. These results position PMB-NN as a balanced, physiologically grounded alternative to purely data-driven approaches for daily hemodynamic monitoring.

2503.23822 2026-06-09 cs.LG 版本更新

Node Embeddings via Neighbor Embeddings

通过邻居嵌入进行节点嵌入

Jan Niklas Böhm, Marius Keute, Alica Guzmán, Sebastian Damrich, Andrew Draganov, Dmitry Kobak

发表机构 * Hertie AI, University of Tübingen, Germany(赫尔特人工智能研究所、图宾根大学,德国) Department of Computer Science, Aarhus University, Denmark(计算机科学系,奥胡斯大学,丹麦)

AI总结 本文提出图邻居嵌入框架,无需随机游走即可直接整合相邻节点的嵌入向量,优于现有节点嵌入算法,在局部结构保持方面表现突出,并应用于2D节点嵌入问题,获得优于现有图布局算法的t-SNE布局。

Comments Accepted to Transactions of Machine Learning Research (TMLR)

详情
Journal ref
Transactions on Machine Learning Research (TMLR) 2025
AI中文摘要

节点嵌入是一种非参数图表示学习范式,通过将图节点嵌入到给定的向量空间中,以实现下游处理。最先进的节点嵌入算法,如DeepWalk和node2vec,基于节点相似性的随机游走概念和对比学习。在本工作中,我们引入图邻居嵌入(图NE)框架,该框架直接整合相邻节点的嵌入向量,而无需依赖任何随机游走。我们证明图NE在局部结构保持方面显著优于最先进的节点嵌入算法。此外,我们将图NE应用于2D节点嵌入问题,获得图t-SNE布局,这些布局也优于现有图布局算法。

英文摘要

Node embeddings are a paradigm in non-parametric graph representation learning, where graph nodes are embedded into a given vector space to enable downstream processing. State-of-the-art node-embedding algorithms, such as DeepWalk and node2vec, are based on random-walk notions of node similarity and on contrastive learning. In this work, we introduce the graph neighbor-embedding (graph NE) framework that directly pulls together embedding vectors of adjacent nodes without relying on any random walks. We show that graph NE strongly outperforms state-of-the-art node-embedding algorithms in terms of local structure preservation. Furthermore, we apply graph NE to the 2D node-embedding problem, obtaining graph t-SNE layouts that also outperform existing graph-layout algorithms.

2510.06742 2026-06-09 cs.AI cs.LG 版本更新

MultiCNKG: Integrating Cognitive Neuroscience, Gene, and Disease Knowledge Graphs Using Large Language Models

MultiCNKG: 利用大语言模型整合认知神经科学、基因和疾病知识图谱

Ali Sarabadani, Kheirolah Rahsepar Fard

发表机构 * Department of Computer Engineering and Information Technology, University of Qom(卡姆大学计算机工程与信息科技系) University of Qom(卡姆大学)

AI总结 本文提出MultiCNKG框架,整合认知神经科学、基因和疾病知识图谱,利用大语言模型实现实体对齐和图谱增强,提升生物医学领域知识图谱的整合与应用能力。

详情
AI中文摘要

大语言模型(LLMs)的出现革新了生物医学和认知科学中知识图谱(KGs)的整合,克服了传统机器学习方法在捕捉基因、疾病和认知过程之间复杂语义联系方面的局限。我们介绍了MultiCNKG,一种创新框架,整合了三个关键知识源:包含2.9K节点和4.3K边的认知神经科学知识图谱(CNKG),涵盖9种节点类型和20种边类型;基因本体(GO)包含43K节点和75K边,涵盖3种节点类型和4种边类型;疾病本体(DO)包含11.2K节点和8.8K边,涵盖1种节点类型和2种边类型。利用LLMs如GPT-4,我们进行实体对齐、语义相似性计算和图谱增强,创建了一个连接遗传机制、神经疾病和认知功能的统一知识图谱。结果图谱包含6.9K节点,涵盖5种类型(如基因、疾病、认知过程)和11.3K边,涵盖7种类型(如因果关系、关联、调控)。评估指标如精确率(85.20%)、召回率(87.30%)、覆盖率(92.18%)、图一致性(82.50%)、新颖性检测(40.28%)和专家验证(89.50%)证实了其鲁棒性和一致性。链接预测评估显示,与TransE(MR: 391,MRR: 0.411)和RotatE(MR: 263,MRR: 0.395)等模型相比,性能与基准如FB15k-237和WN18RR相当。该图谱在个性化医学、认知障碍诊断和认知神经科学假设形成中具有应用前景。

英文摘要

The advent of large language models (LLMs) has revolutionized the integration of knowledge graphs (KGs) in biomedical and cognitive sciences, overcoming limitations in traditional machine learning methods for capturing intricate semantic links among genes, diseases, and cognitive processes. We introduce MultiCNKG, an innovative framework that merges three key knowledge sources: the Cognitive Neuroscience Knowledge Graph (CNKG) with 2.9K nodes and 4.3K edges across 9 node types and 20 edge types; Gene Ontology (GO) featuring 43K nodes and 75K edges in 3 node types and 4 edge types; and Disease Ontology (DO) comprising 11.2K nodes and 8.8K edges with 1 node type and 2 edge types. Leveraging LLMs like GPT-4, we conduct entity alignment, semantic similarity computation, and graph augmentation to create a cohesive KG that interconnects genetic mechanisms, neurological disorders, and cognitive functions. The resulting MultiCNKG encompasses 6.9K nodes across 5 types (e.g., Genes, Diseases, Cognitive Processes) and 11.3K edges spanning 7 types (e.g., Causes, Associated with, Regulates), facilitating a multi-layered view from molecular to behavioral domains. Assessments using metrics such as precision (85.20%), recall (87.30%), coverage (92.18%), graph consistency (82.50%), novelty detection (40.28%), and expert validation (89.50%) affirm its robustness and coherence. Link prediction evaluations with models like TransE (MR: 391, MRR: 0.411) and RotatE (MR: 263, MRR: 0.395) show competitive performance against benchmarks like FB15k-237 and WN18RR. This KG advances applications in personalized medicine, cognitive disorder diagnostics, and hypothesis formulation in cognitive neuroscience.

2507.17726 2026-06-09 cond-mat.dis-nn cond-mat.mtrl-sci cs.LG 版本更新

Deep Generative Learning of Magnetic Frustration in Artificial Spin Ice from Magnetic Force Microscopy Images

从磁力显微镜图像中深度生成学习人工自旋冰中的磁性摩擦

Arnab Neogi, Suryakant Mishra, Prasad P Iyer, Tzu-Ming Lu, Ezra Bussmann, Sergei Tretiak, Andrew Crandall Jones, Jian-Xin Zhu

发表机构 * Theoretical Division, Los Alamos National Laboratory(洛斯阿拉莫斯国家实验室理论 division) Center for Integrated Nanotechnologies, Los Alamos National Laboratory(洛斯阿拉莫斯国家实验室集成纳米技术中心) Center for Integrated Nanotechnologies, Sandia National Laboratory(桑塔纳国家实验室集成纳米技术中心)

AI总结 本文通过深度学习方法从磁力显微镜图像中自动计算自旋冰结构的磁矩和方向,利用变分自编码器生成合成图像并提取特征,以减少实验和分割误差,实现对摩擦顶点和纳米磁性段的精确识别,优化自旋冰配置。

详情
AI中文摘要

日益增长的高分辨率微观图像数据集促进了机器学习方法的发展,用于识别和分析图像中嵌入的细微物理现象。在本工作中,蜂窝晶格自旋冰样本的微观图像被用作数据集,用于自动化计算自旋冰配置的净磁矩和方向。在工作流程的第一阶段,机器学习模型被训练以准确预测自旋冰结构中的磁矩和方向。变分自编码器(VAEs),一种新兴的无监督深度学习技术,被用于生成高质量的合成磁力显微镜(MFM)图像并提取潜在特征表示,从而减少实验和分割误差。工作流程的第二阶段使能够精确识别和预测摩擦顶点和纳米磁性段,有效关联微观图像的结构和功能方面。这促进了设计具有受控摩擦模式的优化自旋冰配置,实现潜在的按需合成。

英文摘要

Increasingly large datasets of microscopic images with atomic resolution facilitate the development of machine learning methods to identify and analyze subtle physical phenomena embedded within the images. In this work, microscopic images of honeycomb lattice spin-ice samples serve as datasets from which we automate the calculation of net magnetic moments and directional orientations of spin-ice configurations. In the first stage of our workflow, machine learning models are trained to accurately predict magnetic moments and directions within spin-ice structures. Variational Autoencoders (VAEs), an emergent unsupervised deep learning technique, are employed to generate high-quality synthetic magnetic force microscopy (MFM) images and extract latent feature representations, thereby reducing experimental and segmentation errors. The second stage of proposed methodology enables precise identification and prediction of frustrated vertices and nanomagnetic segments, effectively correlating structural and functional aspects of microscopic images. This facilitates the design of optimized spin-ice configurations with controlled frustration patterns, enabling potential on-demand synthesis.

2507.15152 2026-06-09 cs.CL cs.AI cs.LG 版本更新

What Level of Automation is "Good Enough"? A Benchmark of Large Language Models for Meta-Analysis Data Extraction

什么是‘足够’的自动化水平?大型语言模型在元分析数据提取中的基准测试

Lingbo Li, Anuradha Mathrani, Teo Susnjak

发表机构 * School of Mathematical and Computational Sciences(数学与计算科学学院) Massey University(梅西大学) Auckland, New Zealand(新西兰奥克兰)

AI总结 本文评估了三种大型语言模型在医疗领域数据提取中的性能,发现定制提示能显著提升召回率,提出三层次指南以平衡自动化与专家监督。

详情
Journal ref
Research Synthesis Methods (2026)
AI中文摘要

自动化从全文随机对照试验(RCT)中提取数据用于元分析仍是一个重大挑战。本研究评估了三种LLM(Gemini-2.0-flash、Grok-3、GPT-4o-mini)在高血压、糖尿病和骨科三个医学领域中统计结果、偏倚风险评估和研究层面特征任务上的实际表现。我们测试了四种不同的提示策略(基本提示、自我反思提示、模型集成和定制提示)以确定如何提高提取质量。所有模型均表现出高精度,但普遍存在召回率低的问题,因遗漏关键信息。我们发现定制提示是最有效的,召回率可提升高达15%。基于此分析,我们提出了一套三层指南,根据任务复杂性和风险匹配数据类型与适当的自动化水平。本研究为现实世界中的元分析自动化数据提取提供了实用建议,通过有针对性的、任务特定的自动化平衡LLM效率与专家监督。

英文摘要

Automating data extraction from full-text randomised controlled trials (RCTs) for meta-analysis remains a significant challenge. This study evaluates the practical performance of three LLMs (Gemini-2.0-flash, Grok-3, GPT-4o-mini) across tasks involving statistical results, risk-of-bias assessments, and study-level characteristics in three medical domains: hypertension, diabetes, and orthopaedics. We tested four distinct prompting strategies (basic prompting, self-reflective prompting, model ensemble, and customised prompts) to determine how to improve extraction quality. All models demonstrate high precision but consistently suffer from poor recall by omitting key information. We found that customised prompts were the most effective, boosting recall by up to 15\%. Based on this analysis, we propose a three-tiered set of guidelines for using LLMs in data extraction, matching data types to appropriate levels of automation based on task complexity and risk. Our study offers practical advice for automating data extraction in real-world meta-analyses, balancing LLM efficiency with expert oversight through targeted, task-specific automation.

2507.02606 2026-06-09 cs.SD cs.AI cs.CR cs.LG eess.AS 版本更新

De-AntiFake: Rethinking the Protective Perturbations Against Voice Cloning Attacks

De-AntiFake:重新思考对抗语音克隆攻击的保护扰动

Wei Fan, Kejiang Chen, Chang Liu, Weiming Zhang, Nenghai Yu

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文提出一种两阶段净化方法,旨在提升对抗语音克隆攻击的防御效果,通过净化扰动语音并利用音素指导进行优化,实验表明其优于现有方法。

Comments Accepted by ICML 2025

详情
Journal ref
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267, 2025
AI中文摘要

随着语音生成模型的快速发展,语音克隆(VC)带来的隐私和安全问题日益突出。近期研究尝试通过引入对抗扰动来阻止未经授权的语音克隆,但确定性攻击者可以缓解这些保护扰动并成功执行VC。本文首次系统评估这些保护扰动在包含扰动净化的现实威胁模型下的有效性。研究发现,尽管现有净化方法能中和大量保护扰动,但仍导致VC模型特征空间的失真,影响VC性能。因此,我们提出一种新的两阶段净化方法:(1)净化扰动语音;(2)利用音素指导进行优化,使其符合干净语音分布。实验结果表明,我们的方法在破坏VC防御方面优于现有方法。本研究揭示了基于对抗扰动的VC防御的局限性,并强调了需要更鲁棒的解决方案以缓解VC带来的安全和隐私风险。代码和音频样本可在https://de-antifake.github.io获取。

英文摘要

The rapid advancement of speech generation models has heightened privacy and security concerns related to voice cloning (VC). Recent studies have investigated disrupting unauthorized voice cloning by introducing adversarial perturbations. However, determined attackers can mitigate these protective perturbations and successfully execute VC. In this study, we conduct the first systematic evaluation of these protective perturbations against VC under realistic threat models that include perturbation purification. Our findings reveal that while existing purification methods can neutralize a considerable portion of the protective perturbations, they still lead to distortions in the feature space of VC models, which degrades the performance of VC. From this perspective, we propose a novel two-stage purification method: (1) Purify the perturbed speech; (2) Refine it using phoneme guidance to align it with the clean speech distribution. Experimental results demonstrate that our method outperforms state-of-the-art purification methods in disrupting VC defenses. Our study reveals the limitations of adversarial perturbation-based VC defenses and underscores the urgent need for more robust solutions to mitigate the security and privacy risks posed by VC. The code and audio samples are available at https://de-antifake.github.io.

2502.09252 2026-06-09 cs.LG 版本更新

On the Importance of Embedding Norms in Self-Supervised Learning

关于嵌入范数在自监督学习中的重要性

Andrew Draganov, Sharvaree Vadgama, Sebastian Damrich, Jan Niklas Böhm, Lucas Maes, Dmitry Kobak, Erik Bekkers

发表机构 * University of Amsterdam(阿姆斯特丹大学)

AI总结 本文研究了嵌入范数在自监督学习中的作用,通过理论分析和实验表明范数影响收敛速度和网络置信度,且较小的范数对应意外样本。

详情
Journal ref
International Conference on Machine Learning (ICML) 2025
AI中文摘要

自监督学习(SSL)允许在无监督信号的情况下训练数据表示,已成为机器学习的重要范式。大多数SSL方法使用嵌入向量的余弦相似度,从而有效将数据嵌入到超球面上。虽然这似乎表明嵌入范数在SSL中不起作用,但一些近期工作表明嵌入范数与网络收敛和置信度有关。本文解决这一明显矛盾,系统地确立嵌入范数在SSL训练中的作用。通过理论分析、模拟和实验,我们证明嵌入范数(i)控制SSL收敛速度(ii)编码网络置信度,较小的范数对应意外样本。此外,我们还表明操纵嵌入范数对收敛速度有显著影响。我们的发现表明,SSL嵌入范数对于理解和优化网络行为至关重要。

英文摘要

Self-supervised learning (SSL) allows training data representations without a supervised signal and has become an important paradigm in machine learning. Most SSL methods employ the cosine similarity between embedding vectors and hence effectively embed data on a hypersphere. While this seemingly implies that embedding norms cannot play any role in SSL, a few recent works have suggested that embedding norms have properties related to network convergence and confidence. In this paper, we resolve this apparent contradiction and systematically establish the embedding norm's role in SSL training. Using theoretical analysis, simulations, and experiments, we show that embedding norms (i) govern SSL convergence rates and (ii) encode network confidence, with smaller norms corresponding to unexpected samples. Additionally, we show that manipulating embedding norms can have large effects on convergence speed. Our findings demonstrate that SSL embedding norms are integral to understanding and optimizing network behavior.

2503.17400 2026-06-09 physics.flu-dyn cs.LG 版本更新

TripNet: Learning Large-scale High-fidelity 3D Car Aerodynamics with Triplane Networks

TripNet:利用三平面网络学习大规模高保真3D汽车空气动力学

Qian Chen, Mohamed Elrefaie, Angela Dai, Faez Ahmed

发表机构 * Department of Mechanical Engineering(机械工程系) Massachusetts Institute of Technology(麻省理工学院) Department of Computer Science(计算机科学系) Technical University of Munich(慕尼黑技术大学)

AI总结 TripNet通过三平面网络实现高分辨率3D汽车空气动力学模拟,无需依赖网格结构,提供高效准确的CFD预测。

详情
AI中文摘要

代理建模已成为加速计算流体力学(CFD)模拟的强大工具。现有基于点云、体素、网格或图的3D几何学习模型依赖显式几何表示,内存消耗大且分辨率受限。对于具有数百万节点和单元的大型模拟,现有模型因依赖网格分辨率而需进行剧烈下采样,导致精度下降。我们提出了TripNet,一种基于三平面的神经框架,通过隐式编码3D几何到紧凑的连续特征图中。与依赖网格的方法不同,TripNet可扩展到高分辨率模拟,而无需增加内存成本,并以查询方式在任意空间位置进行CFD预测,不依赖网格连接或预定义节点。TripNet在DrivAerNet和DrivAerNet++数据集上实现了最先进的性能,准确预测了阻力系数、表面压力和完整的3D流动场。通过统一的三平面骨干支持多种模拟任务,TripNet为传统CFD求解器和现有代理模型提供了可扩展、准确和高效的替代方案。

英文摘要

Surrogate modeling has emerged as a powerful tool to accelerate Computational Fluid Dynamics (CFD) simulations. Existing 3D geometric learning models based on point clouds, voxels, meshes, or graphs depend on explicit geometric representations that are memory-intensive and resolution-limited. For large-scale simulations with millions of nodes and cells, existing models require aggressive downsampling due to their dependence on mesh resolution, resulting in degraded accuracy. We present TripNet, a triplane-based neural framework that implicitly encodes 3D geometry into a compact, continuous feature map with fixed dimension. Unlike mesh-dependent approaches, TripNet scales to high-resolution simulations without increasing memory cost, and enables CFD predictions at arbitrary spatial locations in a query-based fashion, independent of mesh connectivity or predefined nodes. TripNet achieves state-of-the-art performance on the DrivAerNet and DrivAerNet++ datasets, accurately predicting drag coefficients, surface pressure, and full 3D flow fields. With a unified triplane backbone supporting multiple simulation tasks, TripNet offers a scalable, accurate, and efficient alternative to traditional CFD solvers and existing surrogate models.

2311.07065 2026-06-09 cs.LG cs.AI math-ph math.MP math.OC stat.ML 版本更新

On non-approximability of zero loss global ${\mathcal L}^2$ minimizers by gradient descent in Deep Learning

关于深度学习中梯度下降无法逼近零损失全局L²最小化器的非近似性

Thomas Chen, Patricia Muñoz Ewald

发表机构 * Department of Mathematics, University of Texas at Austin(德克萨斯大学奥斯汀分校数学系)

AI总结 本文分析了深度学习中梯度下降算法的几何特性,指出在欠参数化网络中,零损失最小化通常无法实现,因此训练输入分布必须非典型才能产生零损失最小化器。

Comments AMS Latex, 7 pages. Typos corrected, Corollary 1.6 upgraded to Theorem, acknowledgment added

详情
Journal ref
Theor. Appl. Mech., 52 (1), 67-73 (2025)
AI中文摘要

我们分析了深度学习中梯度下降算法的几何特性,并详细讨论了在欠参数化深度学习网络中,零损失最小化通常无法实现的情形。作为结果,我们得出结论:为了产生零损失最小化器,训练输入分布必须非典型,无论是对于[Chen-Munoz Ewald 2023, 2024]中构造的方法,还是对于梯度下降[Chen 2025](假设训练数据聚类)方法而言。

英文摘要

We analyze geometric aspects of the gradient descent algorithm in Deep Learning (DL), and give a detailed discussion of the circumstance that in underparametrized DL networks, zero loss minimization can generically not be attained. As a consequence, we conclude that the distribution of training inputs must necessarily be non-generic in order to produce zero loss minimizers, both for the method constructed in [Chen-Munoz Ewald 2023, 2024], or for gradient descent [Chen 2025] (which assume clustering of training data).

2405.17151 2026-06-09 cs.LG 版本更新

Smoke and Mirrors in Causal Downstream Tasks

因果下游任务中的烟与幻影

Riccardo Cadei, Lukas Lindorfer, Sylvia Cremer, Cordelia Schmid, Francesco Locatello

发表机构 * Institute of Science and Technology Austria (ISTA)(奥地利科学与技术研究所) Inria(法国国家信息与自动化技术研究所) Ecole Normale Supérieure(法国高等科学研究院) CNRS(法国国家科学研究中心) PSL Research University(巴黎科学哲学大学)

AI总结 本文探讨了因果推断中常见方法的偏差问题,通过实验证明模型选择对因果估计精度的影响,并提出科学问题应被考虑在内。

详情
AI中文摘要

机器学习和人工智能有潜力改变数据驱动的科学发现,能够为多种科学现象提供准确的预测。由于许多科学问题本质上是因果的,本文探讨了因果推断任务中的处理效应估计,其中感兴趣的结局是在随机对照试验(RCT)中记录在高维观测中的。尽管是最简单的因果设置,且完美适合深度学习,但我们理论发现许多文献中的常见选择可能导致估计偏差。为了测试这些考虑的实际影响,我们记录了ISTAnt,第一个针对高维观测的因果推断下游任务的真实世界基准,作为研究园丁蚁(Lasius neglectus)对施加在群体成员上的微粒体的反应的RCT。比较6480个从最先进的视觉骨干网络微调的模型,我们发现采样和建模选择显著影响因果估计的准确性,且分类准确性并非其代理。我们进一步验证了分析,将其重复应用于合成的视觉数据集,以控制因果模型。我们的结果表明,未来的基准应仔细考虑实际的下游科学问题,尤其是因果问题。此外,我们还强调了表示学习方法的指导方针,以帮助在科学中回答因果问题。

英文摘要

Machine Learning and AI have the potential to transform data-driven scientific discovery, enabling accurate predictions for several scientific phenomena. As many scientific questions are inherently causal, this paper looks at the causal inference task of treatment effect estimation, where the outcome of interest is recorded in high-dimensional observations in a Randomized Controlled Trial (RCT). Despite being the simplest possible causal setting and a perfect fit for deep learning, we theoretically find that many common choices in the literature may lead to biased estimates. To test the practical impact of these considerations, we recorded ISTAnt, the first real-world benchmark for causal inference downstream tasks on high-dimensional observations as an RCT studying how garden ants (Lasius neglectus) respond to microparticles applied onto their colony members by hygienic grooming. Comparing 6 480 models fine-tuned from state-of-the-art visual backbones, we find that the sampling and modeling choices significantly affect the accuracy of the causal estimate, and that classification accuracy is not a proxy thereof. We further validated the analysis, repeating it on a synthetically generated visual data set controlling the causal model. Our results suggest that future benchmarks should carefully consider real downstream scientific questions, especially causal ones. Further, we highlight guidelines for representation learning methods to help answer causal questions in the sciences.

2501.12421 2026-06-09 cs.LG cs.AI q-bio.QM 版本更新

Tackling Small Sample Survival Analysis via Transfer Learning: A Study of Colorectal Cancer Prognosis

通过迁移学习解决小样本生存分析:结直肠癌预后的研究

Yonghao Zhao, Changtao Li, Chi Shu, Qingbin Wu, Hong Li, Chuan Xu, Tianrui Li, Ziqiang Wang, Zhipeng Luo, Yazhou He

发表机构 * University of Science and Technology of China(中国科学技术大学)

AI总结 本文通过迁移学习提升小样本生存分析,针对结直肠癌预后,改进了多种生存模型,如DeepSurv、Cox-CC、DeepHit和Random Survival Forest,实验结果显示迁移学习显著提升了模型性能。

详情
Journal ref
Artificial Intelligence in Medicine, 178:103426, 2026
AI中文摘要

生存预后对医疗信息学至关重要。实践者常面临小规模临床数据,尤其是癌症患者数据,难以诱导有用的生存预测模式。本文通过迁移学习解决小样本生存分析问题,提出适用于常见生存模型的迁移学习方法。对于参数模型如DeepSurv、Cox-CC和DeepHit,应用预训练和微调等标准迁移学习技术。对于非参数模型如Random Survival Forest,提出新的迁移生存森林(TSF)模型,通过转移树结构并用目标数据微调。在结直肠癌(CRC)预后中评估了迁移学习方法。源数据为27,379名SEER CRC I期患者,目标数据为728名来自西昌医院的CRC I期患者。迁移学习增强后,Cox-CC的C^{td}值从0.7868提升至0.8111,DeepHit从0.8085提升至0.8135,DeepSurv从0.7722提升至0.8043,RSF从0.7940提升至0.8297(最高性能)。所有模型在数据量仅50时训练也表现出更显著的提升。结论:因此,用于癌症预后的现有生存模型可通过适当设计的迁移学习技术得到增强和改进。本研究使用的源代码可在https://github.com/YonghaoZhao722/TSF获取。

英文摘要

Survival prognosis is crucial for medical informatics. Practitioners often confront small-sized clinical data, especially cancer patient cases, which can be insufficient to induce useful patterns for survival predictions. This study deals with small sample survival analysis by leveraging transfer learning, a useful machine learning technique that can enhance the target analysis with related knowledge pre-learned from other data. We propose and develop various transfer learning methods designed for common survival models. For parametric models such as DeepSurv, Cox-CC (Cox-based neural networks), and DeepHit (end-to-end deep learning model), we apply standard transfer learning techniques like pretraining and fine-tuning. For non-parametric models such as Random Survival Forest, we propose a new transfer survival forest (TSF) model that transfers tree structures from source tasks and fine-tunes them with target data. We evaluated the transfer learning methods on colorectal cancer (CRC) prognosis. The source data are 27,379 SEER CRC stage I patients, and the target data are 728 CRC stage I patients from the West China Hospital. When enhanced by transfer learning, Cox-CC's $C^{td}$ value was boosted from 0.7868 to 0.8111, DeepHit's from 0.8085 to 0.8135, DeepSurv's from 0.7722 to 0.8043, and RSF's from 0.7940 to 0.8297 (the highest performance). All models trained with data as small as 50 demonstrated even more significant improvement. Conclusions: Therefore, the current survival models used for cancer prognosis can be enhanced and improved by properly designed transfer learning techniques. The source code used in this study is available at https://github.com/YonghaoZhao722/TSF.

2411.18385 2026-06-09 cs.LG cs.CV stat.ML 版本更新

Federated Learning with Uncertainty and Personalization via Efficient Second-order Optimization

基于高效二阶优化的联邦学习中的不确定性与个性化

Shivam Pal, Aishwarya Gupta, Saqib Sarwar, Piyush Rai

发表机构 * Department of Computer Science and Engineering, IIT Kanpur, India(计算机科学与工程系,印度IIT坎pur)

AI总结 本文提出一种高效的联邦学习方法,利用二阶优化减少计算和通信成本,同时保留贝叶斯方法的不确定性与个性化优势。

详情
Journal ref
Transactions on Machine Learning Research (TMLR), 2025
AI中文摘要

联邦学习(FL)已发展为一种有前景的方法,用于在不同客户端上协作学习分布式和异质数据,而无需数据离开客户端。最近的FL研究倡导采用贝叶斯方法,因为它提供了一种系统的方法来考虑模型和预测不确定性,通过学习客户端和/或服务器模型的后验分布。此外,贝叶斯FL自然能够实现个性化,以处理不同客户端上的数据异质性,通过让每个客户端学习其独特的个性化模型。特别是,层次贝叶斯方法使所有客户端都能学习其个性化模型,同时通过服务器提供的先验分布考虑共同点。然而,尽管有这些优势,贝叶斯方法在FL中可能计算成本高且通信成本高,因为需要计算和发送后验分布。我们提出了一种新的贝叶斯FL方法,采用高效的二阶优化方法,其计算成本与Adam等一阶优化方法相似,同时提供贝叶斯方法的多种优势(例如不确定性、个性化),并且在标准和个性化FL设置中都比最先进的贝叶斯FL方法更高效和准确。我们的方法在预测准确性和不确定性估计方面优于基线方法,包括基于优化和贝叶斯FL的方法。

英文摘要

Federated Learning (FL) has emerged as a promising method to collaboratively learn from decentralized and heterogeneous data available at different clients without the requirement of data ever leaving the clients. Recent works on FL have advocated taking a Bayesian approach to FL as it offers a principled way to account for the model and predictive uncertainty by learning a posterior distribution for the client and/or server models. Moreover, Bayesian FL also naturally enables personalization in FL to handle data heterogeneity across the different clients by having each client learn its own distinct personalized model. In particular, the hierarchical Bayesian approach enables all the clients to learn their personalized models while also taking into account the commonalities via a prior distribution provided by the server. However, despite their promise, Bayesian approaches for FL can be computationally expensive and can have high communication costs as well because of the requirement of computing and sending the posterior distributions. We present a novel Bayesian FL method using an efficient second-order optimization approach, with a computational cost that is similar to first-order optimization methods like Adam, but also provides the various benefits of the Bayesian approach for FL (e.g., uncertainty, personalization), while also being significantly more efficient and accurate than SOTA Bayesian FL methods (both for standard as well as personalized FL settings). Our method achieves improved predictive accuracies as well as better uncertainty estimates as compared to the baselines which include both optimization based as well as Bayesian FL methods.

2311.03087 2026-06-09 cs.LG math.AT 版本更新

Persistent Homology for High-dimensional Data Based on Spectral Methods

基于谱方法的高维数据持续同调

Sebastian Damrich, Philipp Berens, Dmitry Kobak

发表机构 * Hertie Institute for AI in Brain Health, University of Tübingen, Germany(图宾根大学希特研究所,德国) Tübingen AI Center, Germany(图宾根人工智能中心,德国) IWR, Heidelberg University, Germany(海德堡大学IWR研究所,德国)

AI总结 本文提出利用谱方法中的扩散距离和有效电阻检测高维噪声下的拓扑结构,推导出有效电阻的闭式公式,并应用于单细胞RNA测序数据以识别细胞周期环路。

Comments NeurIPS 2024, 54 pages, 44 figures

详情
Journal ref
Conference on Neural Information Processing Systems (NeurIPS) 2024
AI中文摘要

持续同调是一种分析点云拓扑结构的流行计算工具,如检测环或空洞的存在。然而,许多低内在维度的真实世界数据集存在于远高于维度的环境空间中。我们显示在这种情况下,传统持续同调对噪声非常敏感且无法检测正确的拓扑结构。现有的持续同调改进方法也是如此。作为解决方法,我们发现数据的k近邻图上的谱距离,如扩散距离和有效电阻,能够在高维噪声存在下检测正确的拓扑结构。此外,我们推导出有效电阻的闭式公式,并描述其与扩散距离的关系。最后,我们应用这些方法到高维单细胞RNA测序数据,并展示谱距离允许稳健检测细胞周期环路。

英文摘要

Persistent homology is a popular computational tool for analyzing the topology of point clouds, such as the presence of loops or voids. However, many real-world datasets with low intrinsic dimensionality reside in an ambient space of much higher dimensionality. We show that in this case traditional persistent homology becomes very sensitive to noise and fails to detect the correct topology. The same holds true for existing refinements of persistent homology. As a remedy, we find that spectral distances on the k-nearest-neighbor graph of the data, such as diffusion distance and effective resistance, allow to detect the correct topology even in the presence of high-dimensional noise. Moreover, we derive a novel closed-form formula for effective resistance, and describe its relation to diffusion distances. Finally, we apply these methods to high-dimensional single-cell RNA-sequencing data and show that spectral distances allow robust detection of cell cycle loops.

2407.13288 2026-06-09 cs.LG 版本更新

Hierarchical Stage-Wise Training of Linked Deep Neural Networks for Multi-Building and Multi-Floor Indoor Localization Based on Wi-Fi RSSI Fingerprinting

基于Wi-Fi RSSI指纹的多建筑多楼层室内定位的分层阶段式训练链接深度神经网络

Sihao Li, Kyeong Soo Kim, Zhe Tang, Graduate, Jeremy S. Smith

发表机构 * School of Advanced Technology, Xi’an Jiaotong-Liverpool University(西安交通大学利物浦大学先进技术学院) Department of Electrical Engineering and Electronics, University of Liverpool(利物浦大学电子工程与电子系) Postgraduate Research Scholarships, Key Program Special Fund, Research Enhancement Fund of Xi’an Jiaotong-Liverpool University(西安交通大学利物浦大学研究生研究奖学金、重点专项基金、研究增强基金)

AI总结 本文提出一种基于链接神经网络的多建筑多楼层室内定位方法,通过分层阶段式训练框架提升定位精度,实验表明该方法在UJIIndoorLoc数据库上达到8.19米的三维定位误差,优于现有神经网络模型。

Comments 9 pages, 5 figures, under review for journal publication

详情
Journal ref
IEEE Sensors Journal, volume 25, issue 13, pages 23341--23351, July 1, 2025
AI中文摘要

本文提出了一种基于链接神经网络的多建筑多楼层室内定位新方案,每个神经网络专门解决子问题,并在分层阶段式训练框架下训练。当传感器数据具有层次结构时,利用这种层次结构进行数据处理以提供可扩展的解决方案。该框架通过利用更高层次网络训练获得的先验知识来训练更低层次网络。实验结果表明,基于所提出分层阶段式训练框架训练的链接神经网络在UJIIndoorLoc数据库上实现了8.19米的三维定位误差,这是目前使用完整数据集训练和评估的神经网络模型中最准确的结果。当应用于基于层次卷积神经网络的模型时,该训练框架还能显著将三维定位误差从11.78米降低到8.71米。

英文摘要

In this paper, we present a new solution to the problem of large-scale multi-building and multi-floor indoor localization based on linked neural networks, where each neural network is dedicated to a sub-problem and trained under a hierarchical stage-wise training framework. When the measured data from sensors have a hierarchical representation as in multi-building and multi-floor indoor localization, it is important to exploit the hierarchical nature in data processing to provide a scalable solution. In this regard, the hierarchical stage-wise training framework extends the original stage-wise training framework to the case of multiple linked networks by training a lower-hierarchy network based on the prior knowledge gained from the training of higher-hierarchy networks. The experimental results with the publicly-available UJIIndoorLoc multi-building and multi-floor Wi-Fi RSSI fingerprint database demonstrate that the linked neural networks trained under the proposed hierarchical stage-wise training framework can achieve a three-dimensional localization error of 8.19 m, which, to the best of the authors' knowledge, is the most accurate result ever obtained for neural network-based models trained and evaluated with the full datasets of the UJIIndoorLoc database, and that, when applied to a model based on hierarchical convolutional neural networks, the proposed training framework can also significantly reduce the three-dimensional localization error from 11.78 m to 8.71 m.

2311.12167 2026-06-09 cs.LG cs.SI 版本更新

Node Classification in Random Trees

随机树中的节点分类

Wouter W. L. Nuijten, Vlado Menkovski

发表机构 * Eindhoven University of Technology(埃因霍温理工大学)

AI总结 本文提出一种方法,用于对结构为随机树的对象进行分类,通过马尔可夫网络和图神经网络建模节点标签分布,优于现有方法。

详情
Journal ref
Lecture Notes in Computer Science, 2024, pp. 105-116
AI中文摘要

我们提出了一种方法,用于对结构为随机树的对象进行分类。我们的目标是在树数据结构与节点属性(通常为高维嵌入)相关联的情况下,建模节点标签分配的分布。树拓扑不是预设的,在推断过程中没有节点标签存在。其他方法要么假设标签分配的条件独立性,要么在固定图拓扑上操作,或需要部分节点标签被观察。我们的方法定义了具有随机树相应拓扑的马尔可夫网络及其关联的吉布斯分布。我们用图神经网络参数化吉布斯分布,该网络在随机树和节点嵌入上操作。这使得我们能够估计给定随机树的节点分配的似然,并使用MCMC从节点分配分布中采样。我们评估了该方法在斯坦福情感树库数据集上的节点分类任务,结果优于基线方法,证明了其在随机树中联合分布建模的有效性。

英文摘要

We propose a method for the classification of objects that are structured as random trees. Our aim is to model a distribution over the node label assignments in settings where the tree data structure is associated with node attributes (typically high dimensional embeddings). The tree topology is not predetermined and none of the label assignments are present during inference. Other methods that produce a distribution over node label assignment in trees (or more generally in graphs) either assume conditional independence of the label assignment, operate on a fixed graph topology, or require part of the node labels to be observed. Our method defines a Markov Network with the corresponding topology of the random tree and an associated Gibbs distribution. We parameterize the Gibbs distribution with a Graph Neural Network that operates on the random tree and the node embeddings. This allows us to estimate the likelihood of node assignments for a given random tree and use MCMC to sample from the distribution of node assignments. We evaluate our method on the tasks of node classification in trees on the Stanford Sentiment Treebank dataset. Our method outperforms the baselines on this dataset, demonstrating its effectiveness for modeling joint distributions of node labels in random trees.

2310.20699 2026-06-09 physics.chem-ph cs.LG physics.comp-ph physics.data-an stat.AP 版本更新

Bayesian Multistate Bennett Acceptance Ratio Methods

贝叶斯多状态贝纳特接受比率方法

Xinqiang Ding

发表机构 * Department of Chemistry, Tufts University(塔夫茨大学化学系)

AI总结 本文提出贝叶斯多状态贝纳特接受比率方法,通过整合热力学状态的采样配置与先验分布,计算自由能的后验分布,并改进自由能估计的不确定性评估。

详情
Journal ref
Journal of Chemical Theory and Computation 2024 20 (5), 1878-1888
AI中文摘要

多状态贝纳特接受比率(MBAR)方法是一种计算热力学状态自由能的常用方法。本文介绍了贝叶斯MBAR,即MBAR的贝叶斯推广。通过整合从热力学状态采样的配置与先验分布,贝叶斯MBAR计算自由能的后验分布。利用后验分布,我们推导出自由能估计并计算其相关不确定性。值得注意的是,当使用均匀先验分布时,贝叶斯MBAR恢复了MBAR的结果,但提供了更准确的不确定性估计。此外,当有关于自由能的先验知识时,贝叶斯MBAR可以通过使用非均匀先验分布将此信息纳入估计过程。作为示例,我们展示通过结合关于自由能表面光滑性的先验知识,贝叶斯MBAR比MBAR方法提供更准确的估计。鉴于MBAR在自由能计算中的广泛应用,我们预计贝叶斯MBAR将成为自由能计算各种应用中的重要工具。

英文摘要

The multistate Bennett acceptance ratio (MBAR) method is a prevalent approach for computing free energies of thermodynamic states. In this work, we introduce BayesMBAR, a Bayesian generalization of the MBAR method. By integrating configurations sampled from thermodynamic states with a prior distribution, BayesMBAR computes a posterior distribution of free energies. Using the posterior distribution, we derive free energy estimations and compute their associated uncertainties. Notably, when a uniform prior distribution is used, BayesMBAR recovers the MBAR's result but provides more accurate uncertainty estimates. Additionally, when prior knowledge about free energies is available, BayesMBAR can incorporate this information into the estimation procedure by using non-uniform prior distributions. As an example, we show that, by incorporating the prior knowledge about the smoothness of free energy surfaces, BayesMBAR provides more accurate estimates than the MBAR method. Given MBAR's widespread use in free energy calculations, we anticipate BayesMBAR to be an essential tool in various applications of free energy calculations.

1909.02747 2026-06-09 eess.IV cs.CV cs.LG stat.ML 版本更新

Eelgrass beds and oyster farming at a lagoon before and after the Great East Japan Earthquake 2011: potential to apply deep learning at a coastal area

2011年东日本大地震前后三重县洋浦湾的海草床和牡蛎养殖:在沿海地区应用深度学习的潜力

Takehisa Yamakita

发表机构 * Marine Biodiversity and Environmental Assessment Research Center (BioEnv)(海洋生物多样性与环境评估研究中心)

AI总结 本文通过比较手动勾勒、简单图像分割和深度学习图像变换,研究了日本三重县洋浦湾海草床、沙地和牡蛎养殖筏的自动土地覆盖分类,展示了深度学习在地震后沿海地区空间模式提取中的潜力。

详情
AI中文摘要

本文通过对比手动勾勒、简单图像分割和深度学习图像变换方法,研究了日本三重县洋浦湾海草床、沙地和牡蛎养殖筏的自动土地覆盖分类,展示了深度学习在地震后沿海地区空间模式提取中的潜力。实验结果表明,图像变换方法在输出分辨率上表现最佳,其在植被分类上的准确率超过69%,通过随机点评估独立测试数据。沙地分布通过分割模型检测,而牡蛎养殖筏的分布则通过分割模型识别。通过手动勾勒和图像变换结果评估地震前后的变化,发现沙地面积增加而植被面积减少。仅通过分割模型检测到牡蛎养殖面积的减少。这些结果证明了深度学习在地震和海啸后空间模式提取中的潜力。

英文摘要

There is a small number of case studies of automatic land cover classification on the coastal area. Here, I test extraction of seagrass beds, sandy area, oyster farming rafts at Mangoku-ura Lagoon, Miyagi, Japan by comparing manual tracing, simple image segmentation, and image transformation using deep learning. The result was used to extract the changes before and after the earthquake and tsunami. The output resolution was best in the image transformation method, which showed more than 69% accuracy for vegetation classification by an assessment using random points on independent test data. The distribution of oyster farming rafts was detected by the segmentation model. Assessment of the change before and after the earthquake by the manual tracing and image transformation result revealed increase of sand area and decrease of the vegetation. By the segmentation model only the decrease of the oyster farming was detected. These results demonstrate the potential to extract the spatial pattern of these elements after an earthquake and tsunami. Index Terms: Great East Japan Earthquake of 2011, Land use land cover (LULC), Zosteracea seagrass, cultured oyster, deep learning, Mangoku Bay