arXivDaily arXiv每日学术速递 周一至周五更新
重置
2605.30350 2026-05-29 cs.RO cs.LG 版本更新

DynaFLIP: Rethinking Robotics Perception via Tri-Modal-Dynamics Guided Representation

DynaFLIP: 通过三模态动力学引导表示重新思考机器人感知

Jusuk Lee, Seungjae Lee, Jonghun Shin, Hoseong Jung, Sungha Kim, Daesol Cho, H. Jin Kim, Jia-Bin Huang, Furong Huang

发表机构 * Seoul National University(首尔国立大学) University of Maryland, College Park(马里兰大学学院公园分校) Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出DynaFLIP,一种动力学感知的多模态预训练框架,通过图像-语言-3D流三元组训练图像编码器,利用单纯形体积最小化与余弦正则化和对比目标对齐三模态,提升机器人操作中的运动理解与泛化能力。

Comments Project website: https://dynaflip-robotics.github.io

详情
AI中文摘要

机器人操作关键依赖于保留场景中与动作相关方面的感知。然而,大多数机器人学习流程基于为静态识别或视觉-语言对齐预训练的视觉编码器,将运动理解留给下游策略。我们引入了DynaFLIP,一种动力学感知的多模态预训练框架,将运动理解上推到感知中。我们从异构的人类和机器人视频中构建图像-语言-3D流三元组,并使用这些三元组作为训练时监督来塑造仅图像的编码器。我们的关键思想是鼓励三种模态在共享的超球面空间中跨越一个小的单纯形体积——较小的单纯形体积表示更强的对齐。为了避免朴素体积最小化的几何模糊性和平凡坍缩,我们将单纯形体积最小化与余弦正则化和对比目标相结合。我们的分析表明,DynaFLIP关注对操作至关重要的控制相关区域。得到的动力学感知表示作为可重用的视觉骨干,在包括VLA在内的各种下游策略中持续优于基线。我们在多种模拟和真实世界设置中验证了这一点,在分布外场景下增益达到+22.5%。我们的结果表明,当视觉表示被训练为不仅编码存在什么,而且编码世界在动作下如何变化时,机器人泛化能力会提高。

英文摘要

Robot manipulation critically depends on perception that preserves the action-relevant aspects of a scene. Yet most robot learning pipelines are built upon visual encoders pre-trained for static recognition or vision-language alignment, leaving motion understanding to downstream policies. We introduce DynaFLIP, a dynamics-aware multimodal pre-training framework that pushes motion understanding upstream into perception. We construct image-language-3D flow triplets from heterogeneous human and robot videos, and use these triplets as training-time supervision to shape an image-only encoder. Our key idea is to encourage the three modalities to span a small simplex volume in the shared hyperspherical space -- a smaller simplex volume indicating stronger alignment. To avoid the geometric ambiguity and trivial collapse of naive volume minimization, we combine simplex-volume minimization with a cosine regularizer and a contrastive objective. Our analyses show that DynaFLIP focuses on control-relevant regions critical for manipulation. The resulting dynamics-aware representations serve as reusable visual backbones and consistently outperform baselines across diverse downstream policies, including VLAs. We validate this across diverse simulation and real-world setups, with gains reaching +22.5% under out-of-distribution scenarios. Our results suggest that robot generalization improves when visual representations are trained to encode not just what is present, but how the world changes under action.

2605.30348 2026-05-29 cs.CL cs.AI cs.LG 版本更新

LLMSurgeon: Diagnosing Data Mixture of Large Language Models

LLMSurgeon: 诊断大型语言模型的数据混合

Yaxin Luo, Jiacheng Cui, Xiaohan Zhao, Xinyi Shang, Jiacheng Liu, Xinyue Bi, Zhaoyi Li, Zhiqiang Shen

发表机构 * VILA Lab, MBZUAI(VILA实验室,MBZUAI) UCL

AI总结 提出LLMSurgeon框架,通过逆问题方法从目标LLM生成文本中估计预训练语料的领域分布,实现无需训练数据的后验审计。

Comments ACL 2026 Main. Code at https://github.com/Yaxin9Luo/LLMSurgeon

详情
AI中文摘要

大型语言模型(LLM)的预训练数据混合构成了它们的“数字DNA”,塑造了模型的行为、能力和失败模式。然而,这种组成很少被披露,使得事后审计数据组合或来源变得困难。在这项工作中,我们形式化了$ extbf{数据混合手术(DMS)}$:仅从目标LLM生成的文本中,在预定义分类法下估计其预训练语料的领域级分布。我们提出了$ extbf{LLMSurgeon}$,一个强大的框架,将DMS视为标签偏移假设下的逆问题。LLMSurgeon不是直接聚合分类器输出,而是估计一个校准的$ extit{软}$混淆矩阵,并解决一个约束逆问题以纠正系统性的领域混淆并恢复潜在的混合先验。为了评估,我们引入了$ extbf{LLMScan}$,一个基于具有透明预训练混合的开源LLM构建的配方可验证评估套件。在LLMScan上,LLMSurgeon在固定协议下以高保真度恢复了领域混合。我们的工作提出了一种实用的、事后审计基础模型数字DNA的方法,无需访问其训练数据。

英文摘要

The pretraining data mixture of Large Language Models (LLMs) constitutes their "digital DNA", shaping model behaviors, capabilities, and failure modes. Yet this composition is rarely disclosed, making post-hoc auditing of data combination or provenance difficult. In this work, we formalize $\textbf{Data Mixture Surgery (DMS)}$: given only generated text from a target LLM, estimate the domain-level distribution of its pretraining corpus under a predefined taxonomy. We propose $\textbf{LLMSurgeon}$, a strong framework that casts DMS as an inverse problem under the label-shift assumption. Rather than directly aggregating classifier outputs, LLMSurgeon estimates a calibrated $\textit{soft}$ confusion matrix and solves a constrained inverse problem to correct systematic domain confusion and recover the latent mixture prior. To evaluate, we introduce $\textbf{LLMScan}$, a recipe-verifiable evaluation suite built from open-source LLMs with transparent pretraining mixtures. Across LLMScan, LLMSurgeon recovers domain mixtures with high fidelity under fixed protocols. Our work presents a practical, post-hoc approach for auditing the digital DNA of foundation models without access to their training data.

2605.30345 2026-05-29 cs.AI cs.CL cs.LG 版本更新

SchGen: PCB Schematic Generation with Semantic-Grounded Code Representations

SchGen: 基于语义接地代码表示的PCB原理图生成

Qinpei Luo, Ruichun Ma, Xinyu Zhang, Lili Qiu

发表机构 * University of California, San Diego(加州大学圣地亚哥分校) Microsoft Research Asia(微软亚洲研究院) The University of Texas at Austin(德克萨斯大学奥斯汀分校)

AI总结 提出SchGen,首个从自然语言请求生成可编辑PCB原理图的大语言模型,通过语义接地代码表示将几何驱动问题转化为语义驱动匹配任务,并构建大规模数据集,在连线准确性和功能正确性上显著优于现有方法。

Comments 19 pages, 7 figures

详情
AI中文摘要

印刷电路板(PCB)原理图设计几乎定义了所有电子硬件,但它仍然是手动且依赖专业知识的。虽然生成式AI已推动数字和模拟集成电路设计的发展,但从自然语言意图生成PCB原理图的研究仍基本空白。本文提出SchGen,首个从自然语言请求生成可编辑PCB原理图的大语言模型。关键挑战在于缺乏适合LLM的表示和大规模数据集。当前的原理图格式以冗长、特定于工具的语法和几何描述为主,难以可靠生成。我们引入一种语义接地代码表示,该表示通过相对位置和基于引脚名的布线对原理图编辑原语进行编码,将几何驱动生成问题转化为适合LLM的语义驱动匹配任务。我们进一步通过人机协作流水线将开源硬件设计转换为我们的表示,构建了与用户提示配对的大规模PCB原理图数据集。实验表明,SchGen在连线准确性和功能正确性上显著优于替代表示甚至更大的通用LLM。我们的结果突出了表示设计在使生成模型胜任复杂硬件设计任务中的关键作用。

英文摘要

Printed circuit board (PCB) schematic design defines nearly all electronic hardware, but it remains manual and expertise-intensive. While generative AI has advanced digital and analog IC design, PCB schematic generation from natural-language intent is largely unexplored. This paper presents SchGen, the first large language model that generates editable PCB schematics from natural-language requests. The key challenge lies in the lack of an LLM-suited representation and a large-scale dataset. Current schematic formats are dominated by verbose, tool-specific syntax and geometry-heavy descriptions, making them difficult to generate reliably. We introduce a semantically grounded code representation that encodes schematic editing primitives with relative placement and pin-name-based wiring, transforming a geometry-driven generation problem into a semantics-driven matching task amenable to LLMs. We further construct a large-scale dataset of PCB schematics paired with user prompts via a human-agent collaborative pipeline that converts open-source hardware designs into our representation. Experiments show that SchGen significantly outperforms alternative representations and even larger general-purpose LLMs on wire connectivity accuracy and functional correctness. Our results highlight the critical role of representation design in enabling generative models for complex hardware design tasks.

2605.30337 2026-05-29 cs.LG 版本更新

Efficient Test-Time Finetuning of LLMs via Convex Reconstruction and Gradient Caching

通过凸重构和梯度缓存实现LLM的高效测试时微调

Alaa Khamis, Alaa Maalouf

发表机构 * Department of Computer Science(计算机科学系)

AI总结 提出HullFT方法,利用凸重构和梯度缓存加速LLM的测试时微调,在保持质量的同时显著降低运行时间。

详情
AI中文摘要

测试时微调(TTFT)是一种快速发展的范式,它通过检索相关序列、在序列上更新模型然后评估提示来使语言模型适应每个提示。然而,TTFT只有在快速的情况下才实用:选择和微调都在每个查询时发生,使得每个步骤都成为直接瓶颈。现有方法以速度换取质量:快速检索通常是冗余的,而更强的多样性感知选择增加了过高的每查询成本。我们引入HullFT,一种几何方法来解决这两个瓶颈。给定一个查询,HullFT首先使用高效的免投影Frank-Wolfe优化将查询嵌入表示为少量训练序列的稀疏凸组合。这产生了一个固有相关且多样化的支持集。然后,我们通过几何整数化过程将分数凸权重转换为用于微调的精确整数多重集。由此产生的多重性自然地创建了重复示例,我们利用梯度重用在重复微调步骤中分摊前向-反向计算。我们的实验表明,HullFT在质量-效率权衡上优于当前最先进的TTFT方法,以显著更低的总运行时间实现了更低的每字节比特数。

英文摘要

Test-time finetuning (TTFT) is a rapidly evolving paradigm that adapts a language model to each prompt by retrieving related sequences, updating the model on them, and then evaluating the prompt. However, TTFT is only practical if it is fast: selection and finetuning both happen per query, making each a direct bottleneck. Existing methods trade speed for quality: fast retrieval is often redundant, while stronger diversity-aware selection adds prohibitive per-query cost. We introduce HullFT, a geometric approach to TTFT that addresses both bottlenecks. Given a query, HullFT first represents the query embedding as a sparse convex combination of few training sequences, using efficient projection-free Frank-Wolfe optimization. This yields a support set that is inherently relevant and diverse. We then convert the fractional convex weights into an exact integer multiset for finetuning through a geometric integerization procedure. The resulting multiplicities naturally create repeated examples, which we exploit with Gradient Reuse to amortize forward-backward computation across repeated finetuning steps. Our experiments show that HullFT improves the quality-efficiency tradeoff over current state-of-the-art TTFT methods, achieving lower bits-per-byte at substantially lower total runtime.

2605.30336 2026-05-29 cs.LG 版本更新

Fairness-Aware Federated Learning with Trajectory Shapley Value

基于轨迹Shapley值的公平感知联邦学习

Daniel Kuznetsov, Ziqi Wang

发表机构 * Faculty of Mathematics, Ecole Normale Supérieure Paris-Saclay(巴黎-萨克雷大学数学系) Chair for Dynamics, Control, Machine Learning and Numerics – Alexander von Humboldt Professorship, Department of Mathematics, Friedrich-Alexander-Universität Erlangen-Nürnberg(埃尔兰根-纽伦堡弗里德里希-亚历山大大学数学系)

AI总结 提出轨迹Shapley值作为贡献度量,并设计FedTSV自适应聚合方法,以解决联邦学习中固定权重导致的偏倚和不稳定问题,实现公平、鲁棒且高效的协作学习。

Comments Accepted for publication at the 24th European Control Conference (ECC 2026)

详情
AI中文摘要

联邦学习是一种新兴的分布式范式,解决了由异构、隐私敏感数据带来的挑战。它允许多个客户端通过聚合其在服务器上的本地更新来协作训练模型。然而,传统的聚合方案通常使用固定权重,无法反映客户端贡献的不平等和时变特性,导致学习过程偏倚且不稳定。为了提高公平性和稳定性,我们提出了轨迹Shapley值(TSV),这是一种贡献度量,通过基于验证的、时间一致的效用评估每个客户端如何影响全局模型的优化轨迹。基于TSV,我们设计了FedTSV,一种自适应聚合方法,将每轮评估转换为动态客户端权重,使服务器能够实时响应异构和对抗性参与。在基准数据集上的实验表明,FedTSV加速了收敛,提高了鲁棒性,并产生了更公平的贡献评估,从而为公平感知的联邦优化提供了原则性基础。

英文摘要

Federated learning is an emerging distributed paradigm that addresses the challenges posed by heterogeneous, privacy-sensitive data. It enables multiple clients to train a model collaboratively by aggregating their local updates at a server. However, conventional aggregation schemes typically use fixed weights that fail to reflect unequal and time-varying client contributions, leading to biased and unstable learning. To improve fairness and stability, we propose the Trajectory Shapley Value (TSV), a contribution metric that evaluates how each client influences the optimization trajectory of the global model using a validation-based, temporally consistent utility. Building on TSV, we design FedTSV, an adaptive aggregation method that converts per-round evaluations into dynamic client weights, allowing the server to respond to heterogeneous and adversarial participation in real time. Experiments on benchmark datasets show that FedTSV accelerates convergence, improves robustness, and yields more equitable contribution assessments, thereby providing a principled foundation for fairness-aware federated optimization.

2605.30330 2026-05-29 cs.LG 版本更新

When, why, and how do diffusion posterior samplers fail? A finite-sample lens

何时、为何以及如何扩散后验采样器失败?一个有限样本视角

Benjamin A. Burns, Sara Fridovich-Keil

发表机构 * Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文从有限样本视角分析扩散后验采样器中似然近似误差导致后验分布偏差的原因,发现中间时间步的后验扩散估计不准确会导致模式加权错误和幻觉,并提出一种与近似类型无关的诊断方法。

Comments All code for experiments is available at: https://github.com/voilalab/diagnosing-posterior-sampling

详情
AI中文摘要

扩散模型具有对自然数据复杂分布进行建模的出色能力,这使其成为成像逆问题中后验采样的流行且有效的选择。现有方法可以在推理时融入任何测量模型,但为了计算可行性,必须在中间时间步使用不精确的似然近似。尽管这些近似通常在经验上效果良好,但它们对采样后验的下游影响尚不清楚,并可能导致无法解释的失败。为了理解这些似然近似何时、为何以及如何传播到错误的后验分布,我们引入了一个有限样本视角的后验采样,该视角在训练集大小趋于无穷时,对于任何前向模型和先验分布,都能以任意精度逼近后验。使用这个有限样本透镜,我们观察到流行的后验采样近似倾向于在中间时间步低估或高估后验的扩散,导致下游后果,包括对早期停止时间的敏感性、后验模式的相对权重不准确以及幻觉,既包括后验中不存在的先验模式,也包括先验不支持的似然模式。此外,我们发现这些后验误差的原因既不需要非线性测量模型也不需要多模态后验,而可能仅仅由于多模态先验和中间采样时间步的后验扩散不准确而产生。我们的有限样本后验采样方法对似然近似的类型和(线性或非线性)前向模型的类型不可知,因此可以作为即插即用的诊断工具,用于评估现有和未来后验采样器的准确性和失败模式。

英文摘要

Diffusion models have excellent capacity to model complex distributions of natural data, which has made them a popular and effective choice for posterior sampling in imaging inverse problems. Existing methods can incorporate any measurement model at inference time but must use an inexact approximation for the likelihood at intermediate timesteps for computational tractability. Although these approximations can often work well empirically, their downstream effect on the sampled posterior is poorly understood and can result in unexplained failures. To understand when, why, and how these likelihood approximations propagate to erroneous posterior distributions, we introduce a finite-sample perspective on posterior sampling that approximates the posterior to arbitrary precision as training set size tends towards infinity, for any forward model and prior distribution. Using this finite-sample lens, we observe that popular posterior sampling approximations tend to under- or over-estimate the spread of the posterior at intermediate timesteps, causing downstream consequences including sensitivity to early stopping time, inaccurate relative weighting of posterior modes, and hallucination, both of prior modes that are not in the posterior and likelihood modes that are not supported by the prior. Moreover, we find that the cause of these posterior errors requires neither a nonlinear measurement model nor a multimodal posterior, but can arise solely due to a multimodal prior and inaccurate posterior spread at intermediate sampling times. Our finite-sample posterior sampling approach is agnostic to the type of likelihood approximation and the type of (linear or nonlinear) forward model, and can thus serve as a drop-in diagnostic to evaluate the accuracy and failure modes of existing and future posterior samplers.

2605.30329 2026-05-29 cs.LG 版本更新

SoundnessBench: Can Your AI Scientist Really Tell Good Research Ideas from Bad Ones?

SoundnessBench:你的AI科学家真的能区分好的研究想法和坏的吗?

Sy-Tuyen Ho, Minghui Liu, Huy Nghiem, Furong Huang

发表机构 * University of Maryland College Park(马里兰大学College Park分校)

AI总结 提出SoundnessBench基准,通过ICLR提交的1099个机器学习研究提案评估LLM判断研究想法方法论合理性的能力,发现当前LLM存在普遍乐观偏差,无法可靠作为科学严谨性的独立初筛评估者。

Comments Project Page: https://hosytuyen.github.io/projects/SoundnessBench

详情
AI中文摘要

自主AI研究智能体旨在通过自动化研究流程(从假设生成到同行评审)加速科学发现。然而,现有基准很少测试一个基本瓶颈:大型语言模型在投入时间和计算资源之前,能否判断研究想法的方法论可行性。我们引入了SoundnessBench,一个从ICLR提交中重建的1099个机器学习研究提案的精选基准,标注了评审者的合理性子分数,并对照源论文进行了审计。SoundnessBench应被解释为可恢复的提案阶段合理性基准,而非对完整论文评审结果的精确预测。在12个前沿LLM中,我们发现了一个普遍的乐观偏差:在标准提示下,模型经常将低合理性的提案评为合理,而激进提示则主要将错误从假阳性转移到假阴性。对公共语料污染、论文识别短语、表面特征和人工审计质量的额外控制表明,这种行为不能由单一混杂因素解释。我们的结果表明,当前LLM作为科学严谨性的独立初筛评估者尚不可靠。

英文摘要

Autonomous AI research agents aim to accelerate scientific discovery by automating the research pipeline, from hypothesis generation to peer review. However, existing benchmarks rarely test a fundamental bottleneck: whether Large Language Models can judge the methodological viability of a research idea before expending time and computational resources. We introduce SoundnessBench, a curated benchmark of 1,099 machine-learning research proposals reconstructed from ICLR submissions, labeled with reviewer soundness sub-scores, and audited against source papers. SoundnessBench should be interpreted as a benchmark for recoverable proposal-stage soundness rather than exact prediction of full-paper review outcomes. Across 12 frontier LLMs, we find a pervasive optimism bias: under standard prompting, models frequently rate low-soundness proposals as sound, while aggressive prompting largely shifts errors from false positives to false negatives. Additional controls for public-corpus contamination, paper-identifying phrases, surface features, and human audit quality suggest that this behavior is not explained by a single confounder. Our results indicate that current LLMs are not yet reliable as standalone first-gate evaluators for scientific rigor.

2605.30327 2026-05-29 cs.LG cs.AI cs.CL math.ST stat.ML stat.TH 版本更新

Reasoning with Sampling: Cutting at Decision Points

基于采样的推理:在决策点进行裁剪

Felix Zhou, Anay Mehrotra, Quanquan C. Liu

发表机构 * Yale University(耶鲁大学) Stanford University(斯坦福大学)

AI总结 提出Entropy-Cut Metropolis-Hastings算法,利用基础模型的下一词元熵作为代理识别关键决策点并重新采样,从而高效地从幂分布中采样以增强推理能力,在多个基准上超越基线和RL训练模型。

详情
AI中文摘要

前沿推理模型是通过对基础语言模型进行强化学习后训练而产生的。最近的研究对此提出了挑战,表明从基础模型分布的锐化版本(即所谓的幂分布)中采样,无需额外训练、精心策划的数据集或验证器,就能产生可比的推理能力。然而,使这种方法实用化需要高效地从幂分布中采样。采样器需要“混合”到幂分布,这需要在目标分布的模态之间移动;直观地说,例如尝试不同的推理策略。先前工作中提出的采样器反复在当前推理轨迹中均匀随机选择一个“裁剪”位置,并从该位置开始重新采样后缀。然而,推理轨迹通常包含少数关键决策(例如,证明策略或算法的选择),我们观察到均匀选择的裁剪往往重写局部细节,而不是重新审视决策点。我们引入了一种算法(Entropy-Cut Metropolis-Hastings),该算法使用基础模型的下一词元熵作为代理来识别关键决策点,并从这些位置重新采样。我们通过实验验证了熵跳变是决策点的有用代理,并在一个风格化的推理模型中证明了我们的方法的混合时间与轨迹中的决策数量成比例,而不是与可能大得多的词元数量成比例。在MATH500、HumanEval、GPQA Diamond和AIME26上,我们的方法始终优于基线和RL训练模型。

英文摘要

Frontier reasoning models are produced by posttraining base language models with reinforcement learning. Recent work has challenged this by showing that sampling from a sharpened version of the base model's distribution, a so-called power distribution, elicits comparable reasoning without additional training, curated datasets, or verifiers. However, making this method practical requires efficiently sampling from the power distribution. A sampler needs to "mix" to the power distribution, which necessitates moving between modes of the target distribution; intuitively, e.g., trying different reasoning strategies. The samplers proposed in prior works repeatedly select a "cut" position in the current reasoning trace uniformly at random and resample the suffix from that position onward. However, reasoning traces typically contain a few consequential decisions (e.g., the choice of proof strategy or algorithm), and we observe that a uniformly chosen cut tends to rewrite local details rather than revisit decision points. We introduce an algorithm (Entropy-Cut Metropolis-Hastings) that uses the base model's next-token entropy as a proxy to identify key decision points and resample from those positions. We empirically verify that entropy jumps are a useful proxy for decision points and, in a stylized model of reasoning, prove that our method's mixing time scales with the number of decisions in a trace rather than with the number of tokens, which can be much larger. Across MATH500, HumanEval, GPQA Diamond, and AIME26, our method consistently improves over baselines and RL-trained models.

2605.30324 2026-05-29 cs.DS cs.AI cs.CL cs.LG stat.ML 版本更新

On Language Generation in the Limit with Bounded Memory

有界记忆下的极限语言生成

Jon Kleinberg, Anay Mehrotra, Amin Saberi, Grigoris Velegkas

发表机构 * Cornell University(康奈尔大学) Stanford University(斯坦福大学) Google Research(谷歌研究)

AI总结 研究有界记忆下语言生成的极限问题,通过组合界和滑动窗口分析记忆约束对可生成性、密度和识别的影响。

Comments The abstract has been shortened to fit within the arXiv limit

详情
AI中文摘要

我们研究有界记忆下的极限语言生成。在该任务中,学习器每次观察来自未知目标语言的一个示例,并且必须最终只输出新的有效示例。先前的工作假设可以访问整个历史,这是一个强假设,因为实际算法只保留有限的过去信息。学习理论中的经典工作表明,记忆约束会显著改变可学习性;我们将此扩展到语言生成。 首先,我们研究无记忆生成器。在温和的枚举限制下,每个可数无限语言集合仍然可以在没有记忆的情况下生成。没有这个限制,我们精确刻画了何时无记忆生成是可能的。对于有限集合,我们刻画了无记忆生成器可实现的最优极小极大密度——针对任何给定大小的集合所能保证的最佳密度。这个组合界依赖于Sperner定理和对称链分解。 我们进一步表明,最后$W$个示例的滑动窗口不会改善这种最坏情况密度,而允许存储$b$个自适应选择的过去示例则会改善每个$b \geq 1$的可实现密度。 最后,我们重新审视极限识别,其中学习器必须收敛到目标语言的单个正确假设。我们关注其增量变体,其中学习器只记住其之前的猜测。在这里,尽管精确识别在仅包含三种语言的集合上失败,但一个温和的松弛——要求收敛到目标的“近似”版本——对于每个有限集合都是可实现的。 这些结果表明,有界记忆对这些任务的影响不同:生成对于每个可数集合仍然可实现,而密度和识别仅限于有限集合,且随着集合增长保证减弱。

英文摘要

We study language generation in the limit under bounded memory. In this task, a learner observes examples from an unknown target language one at a time and must eventually output only new valid examples. Prior work assumes access to the entire history, a strong assumption since realistic algorithms retain limited past information. Classical work in learning theory shows memory constraints dramatically alter learnability; we extend this to language generation. First, we study memoryless generators. Under a mild enumeration restriction, every countable collection of infinite languages remains generable without memory. Without this restriction, we exactly characterize when memoryless generation is possible. For finite collections, we characterize the optimal minimax density achievable by memoryless generators -- the best density guaranteed against any collection of a given size. This combinatorial bound relies on Sperner's theorem and symmetric chain decompositions. We further show that a sliding window of the last $W$ examples does not improve this worst-case density, whereas allowing it to store $b$ adaptively chosen past examples improves the achievable density for every $b \geq 1$. Finally, we revisit identification in the limit, where the learner must converge to a single correct hypothesis for the target language. We focus on its incremental variant, where the learner remembers only its previous guess. Here, although exact identification fails on a collection of just three languages, a mild relaxation requiring convergence to an ``approximate'' version of the target is achievable for every finite collection. These results show bounded memory affects these tasks differently: generation remains achievable for every countable collection, while density and identification are confined to finite collections, with guarantees weakening as the collection grows.

2605.30323 2026-05-29 cs.LG cs.AI 版本更新

In-Context Reward Adaptation for Robust Preference Modeling

上下文奖励自适应用于鲁棒偏好建模

Zhenyu Sun, Zheng Xu, Ermin Wei

发表机构 * Northwestern University(西北大学) Meta Superintelligence Labs(Meta超智能实验室)

AI总结 提出基于Transformer的上下文奖励自适应框架,通过少量偏好示例和人类反应时间辅助信号,在线建模多样且未见的人类偏好,实现鲁棒的偏好建模和分布偏移适应。

详情
AI中文摘要

基于人类反馈的强化学习通常依赖静态奖励模型来使大型语言模型与人类偏好对齐。然而,人类价值观本质上是多样且异质的,单一奖励模型往往缺乏泛化到未见偏好领域所需的鲁棒性。虽然现有的多奖励框架试图解决这一问题,但它们通常局限于一组固定的已知领域,并且无法在没有昂贵重新训练的情况下适应未见的人类分布。在这项工作中,我们提出了上下文奖励自适应,一个基于Transformer的框架,旨在动态建模多样且未见的人类偏好。通过利用Transformer的上下文学习能力,我们的方法从少量偏好示例中自适应地推断出潜在的奖励结构。我们证明,标准Transformer架构由于对真实值存在渐近偏差而不足以完成此任务,但将人类反应时间作为辅助输入信号使模型能够成功适应来自先前未见领域的偏好。我们的研究结果表明,这种方法为偏好建模提供了更鲁棒的基础,允许表示异质奖励和偏好分布偏移,并为更灵活的人机对齐提供了一条可扩展的路径。

英文摘要

Reinforcement Learning from Human Feedback (RLHF) typically relies on static reward models to align Large Language Models with human preferences. However, human values are inherently diverse and heterogeneous, and a single reward model often lacks the robustness required to generalize to unseen preference domains. While existing multi-reward frameworks attempt to address this, they are often restricted to a fixed set of known domains and fail to adapt to unseen human distributions without costly retraining. In this work, we propose In-Context Reward Adaptation, a transformer-based framework designed to model diverse and unseen human preferences on the fly. By leveraging the in-context learning capabilities of transformers, our approach adaptively infers the underlying reward structure from a small set of preference demonstrations. We demonstrate that while a standard transformer architecture is insufficient for this task by characterizing an asymptotic bias to the ground-truth, incorporating human response time as an auxiliary input signal enables the model to successfully adapt to preferences from previously unseen domains. Our findings show that this approach provides a more robust foundation for preference modeling, allowing for the representation of heterogeneous rewards and preference distribution shift, and offering a scalable path toward more flexible human-AI alignment.

2605.30322 2026-05-29 cs.LG cs.AI 版本更新

Gram: Assessing sabotage propensities via automated alignment auditing

Gram:通过自动化对齐审计评估破坏倾向

David Lindner, Victoria Krakovna, Sebastian Farquhar

发表机构 * Google(谷歌)

AI总结 提出Gram框架,通过模拟17种代理部署场景自动审计AI代理的破坏倾向,发现Gemini模型在约2-3%的轨迹中存在不当行为,并引入调查代理管道以识别驱动因素。

详情
AI中文摘要

我们引入了Gram,一个自动化对齐审计框架,用于评估AI代理参与破坏的倾向。我们在17个模拟的代理部署场景中评估了Gemini模型,这些场景激励破坏行为。我们发现Gemini模型在大约2-3%的模拟轨迹中存在不当行为。其中许多案例可以通过Gemini模型中的“过度急切”来解释,导致过度的角色扮演和目标寻求行为。与其他对齐审计方法相比,Gram专门设计用于评估代理编码和研究代理中的失调和有意破坏。我们还引入了一个实验性的调查代理管道,能够进行细粒度的定向实验,以识别不当行为的驱动因素。我们发现,增加环境的真实性和移除不当行为的提示往往会使破坏率降低到接近零。

英文摘要

We introduce Gram, an automated alignment auditing framework to assess the propensity of AI agents to engage in sabotage. We evaluate Gemini models across 17 simulated agentic deployment scenarios that incentivize sabotage. We find Gemini models misbehave in about 2-3% of our simulated trajectories. Many of these cases are explained by "overeagerness" in Gemini models resulting in both excessive role-playing and goal-seeking behavior. In contrast to other alignment auditing approaches, Gram is designed to specifically evaluate misalignment and intentional sabotage in agentic coding and research agents. We additionally introduce an experimental investigator agent pipeline which enables fine-grained targeted experiments to identify the drivers of misbehavior. We find that increasing realism of environments and removing nudges to misbehave tends to reduce sabotage rates close to zero.

2605.30319 2026-05-29 stat.ML cs.AI cs.DS cs.LG math.ST stat.TH 版本更新

Improved Guarantees for Heterogeneous Treatment-Effect Estimation via Matrix Completion

通过矩阵补全改进异质性处理效应估计的保证

Anay Mehrotra, Phuc Tran, Van H. Vu, Manolis Zampetakis

发表机构 * Stanford University(斯坦福大学) Vin University(文大学) The University of Hong Kong(香港大学) Yale University(耶鲁大学)

AI总结 针对面板数据中的异质性处理效应估计问题,提出一种基于矩阵补全的简单高效估计器,在低秩假设下实现行向$\ell_2$误差$ ilde{O}(\sqrt{1/n + n/m^2})$,并首次建立了低秩逼近的行向$\ell_2$扰动界。

详情
AI中文摘要

现代因果推断的一个核心目标是估计异质性处理效应,以回答诸如“干预如何影响每个单元”的问题,而不仅仅是平均效应。我们研究面板数据下的该问题,其中我们观察到$n$个单元在$m$个时间点上的数据,且处理分配未知且非均匀。该设置中的数据自然表示为所有单元-时间处理效应的矩阵。估计异质性处理效应可以表示为对该矩阵中每一行平均值的良好估计。这使我们能够将问题表述为矩阵补全,在自然低秩假设下可解。然而,现有的矩阵补全保证不足以得到估计异质性处理效应所需的每行保证的有意义界;粗略地说,它们仅适用于估计平均处理效应界,正如最近一系列工作所示。我们给出一个简单、计算高效的估计器,在不知道倾向性且标准低秩和正则性假设下,实现行向$\ell_2$误差$ ilde{O}(\sqrt{ rac{1}{n} + rac{n}{m^2}})$。在技术上,我们的分析首次建立了低秩逼近的尖锐行向$\ell_2$扰动界,补充了现有的谱、Frobenius和逐元素扰动理论。

英文摘要

A central goal of modern causal inference is estimating heterogeneous treatment effects to answer questions like "how does an intervention affect each unit," rather than only on average. We study this problem with panel-data where we observe $n$ units across $m$ times under unknown, non-uniform treatment assignments. The data in this setting is naturally represented as a matrix of all unit--time treatment effects. Estimating heterogeneous treatment effects can then be expressed as obtaining a good estimation of each row's average in this matrix. This allows us to formulate the problem as matrix completion, which can be solved under natural low-rankness assumptions. However, existing matrix-completion guarantees are not powerful enough to get meaningful bounds for the per-row guarantee required for estimating the heterogeneous treatment effect; roughly speaking, they are only useful for estimating average treatment effect bounds, as also illustrated in a recent line of work. We give a simple, computationally efficient estimator that, without knowledge of the propensities and under standard low-rankness and regularity assumptions, achieves a row-wise $\ell_2$ error of $\tilde{O}(\sqrt{\frac{1}{n} + \frac{n}{m^2}})$. Technically, our analysis establishes the first sharp row-wise $\ell_2$-perturbation bound for low-rank approximation, complementing existing spectral-, Frobenius-, and entrywise perturbation theory.

2605.30315 2026-05-29 cs.CL cs.LG 版本更新

Resolution Diagnostics for Paired LLM Evaluation

配对LLM评估的分辨率诊断

Anany Kotawala

发表机构 * Princeton University(普林斯顿大学)

AI总结 针对公开LLM排行榜中配对排名未达到常规配对检验分辨率目标的问题,提出基于假设检验的配对评估框架,并引入分辨率比q=N/N*作为主要诊断指标,揭示了常用非配对Cohen-h-plus-(1-rho)捷径在接近比较区域存在约两倍的偏差。

Comments 16 pages, 7 figures, 12 tables. Accepted to the ICML 2026 Workshop on Hypothesis Testing, Seoul, South Korea, 2026. Copyright 2026 by the author(s)

详情
AI中文摘要

在两个公开的LLM排行榜中,许多显示的配对排名在实际配对评估设计下未达到常规配对检验的分辨率目标:在Open LLM Leaderboard v1的40个配对比较中,有11个未解决;在MMLU-Pro前10名相邻排名配对中,9个中有4个未解决(在(alpha, 1-beta) = (0.05, 0.8)下)。在真实的主题级聚类下,MMLU-Pro未解决数上升至6/9,并且在99.9%的类别自助重采样中保持9个中的5-6个未解决。我们将配对LLM评估构建为一个假设检验问题,反转水平alpha、功效(1-beta)的检验,并报告每对的分辨率比q = N/N*作为主要诊断指标。一个具有显式二阶常数的尖锐小效应展开表明,广泛使用的非配对Cohen-h-plus-(1-rho)捷径在接近比较区域与正确的N*偏差约两倍,当用户将其每臂输出乘以(1-rho)时,五个现成计算器中的三个(Cohen 1988, G*Power, R pwr)会无声地继承这一缺陷。在多重校正和任意有效序贯检验下,未解决配对模式仍然存在。

英文摘要

Across two public LLM leaderboards, many displayed pairwise rankings do not meet a conventional paired-test resolution target under the actual paired evaluation design: 11 of 40 Open LLM Leaderboard v1 pairwise comparisons and 4 of 9 MMLU-Pro top-10 adjacent-rank pairs are unresolved at (alpha, 1-beta) = (0.05, 0.8). The MMLU-Pro count rises to 6/9 under real subject-level clustering and stays at 5-6 out of 9 in 99.9% of category-bootstrap resamples. We frame paired LLM evaluation as a hypothesis-testing problem, invert level-alpha, power-(1-beta) tests, and report a per-pair resolution ratio q = N/N* as the primary diagnostic. A sharp small-effect expansion with an explicit second-order constant shows that the widely-used unpaired Cohen-h-plus-(1-rho) shortcut deviates from the correct N* by approximately a factor of two in the close-comparison regime, a deficit that three of five off-the-shelf calculators(Cohen 1988, G*Power, R pwr) silently inherit when the user post-multiplies their per-arm output by (1-rho). The unresolved-pair pattern remains under multiplicity correction and anytime-valid sequential testing.

2605.24244 2026-05-29 stat.ML cs.LG 版本更新

MEDAL: Manifold Embedding Distillation via Autoencoder Learning

MEDAL: 通过自编码器学习的流形嵌入蒸馏

Irene Chang, Tarek M. Zikry, Genevera I. Allen

发表机构 * Department of Statistics, Columbia University(哥伦比亚大学统计学系) School of Data and Information Sciences, University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校数据与信息科学学院) Irving Institute for Cancer Dynamics, Columbia University(哥伦比亚大学癌症动力学研究所) Center for Theoretical Neuroscience, Zuckerman Mind Brain Behavior Institute, Columbia University(哥伦比亚大学理论神经科学中心,Zuckerman心智-大脑-行为研究所)

AI总结 提出MEDAL框架,通过约束自编码器将流形嵌入蒸馏为可复用的编码器-解码器模型,实现留出验证、超参数选择和分布偏移检测。

详情
AI中文摘要

低维嵌入被广泛用作高维数据的视觉摘要,并支持下游科学发现。然而,流行的非线性降维方法(如t-SNE和UMAP)通常仅根据视觉吸引力选择,缺乏严格的定量验证。主要原因是流形嵌入通常不提供样本外映射或返回原始特征空间的逆映射;这使得留出验证(监督学习的黄金标准)几乎不可能。为了解决这些挑战,我们开发了一个新颖的框架MEDAL(通过自编码器学习的流形嵌入蒸馏),它将拟合的流形嵌入蒸馏为可复用的编码器-解码器模型。MEDAL训练一个约束自编码器,其瓶颈精确匹配任何教师嵌入,而解码器重建原始输入;这为新样本提供了显式映射、近似逆映射以及流形空间中基于逐点重建的失真度量。这将静态流形嵌入转换为可在留出数据上评估的模型,从而实现定量验证,包括比较不同降维方法以及超参数调优。在多个基准和科学案例研究中,我们展示了MEDAL能够通过留出验证确定最优流形嵌入和超参数,揭示难以在二维嵌入中保留的生物相干区域,并在新样本映射到固定参考流形时检测分布偏移。MEDAL为任何现有降维技术提供了一个通用验证包装器,将提高科学工作流中降维的严谨性和可靠性。

英文摘要

Low-dimensional embeddings are widely used as visual summaries of high-dimensional data and to enable downstream scientific discoveries. Yet, popular nonlinear dimension reduction methods, such as t-SNE and UMAP, are often selected based on visual appeal alone and without rigorous quantitative validation. A major reason is that manifold embeddings typically do not provide an out-of-sample map nor an inverse back to the original feature space; this makes held-out validation, the gold standard in supervised learning, all but impossible. To address these challenges, we develop a novel framework, MEDAL (Manifold Embedding Distillation via Autoencoder Learning), which distills a fitted manifold embedding into a reusable encoder--decoder model. MEDAL trains a constrained autoencoder whose bottleneck exactly matches any teacher embedding while the decoder reconstructs the original input; this yields an explicit map for new samples, an approximate inverse, and a pointwise reconstruction-based measure of distortion in the manifold space. This converts static manifold embeddings into models that can be evaluated on held-out data, enabling quantitative validation including comparing different dimension reduction methods as well as hyperparameter tuning. Across multiple benchmark and scientific case studies, we show that MEDAL enables held-out validation to determine optimal manifold embeddings and hyperparameters, reveals biologically coherent regions that are difficult to preserve in two dimensional embeddings, and detects distribution shift when new samples are mapped into a fixed reference manifold. MEDAL provides a general validation wrapper to any existing dimension reduction technique that will improve the rigor and reliability of dimension reduction in scientific workflows.

2511.14426 2026-05-29 cs.LG cond-mat.mtrl-sci cs.AI physics.comp-ph 版本更新

MiAD: Mirage Atom Diffusion for De Novo Crystal Generation

MiAD: 幻影原子扩散用于从头晶体生成

Andrey Okhotin, Maksim Nakhodnov, Nikita Kazeev, Mikhail Lazarev, Andrey E Ustyuzhanin, Dmitry Vetrov

发表机构 * Higher School of Economics(俄罗斯高等经济学院) Moscow State University(莫斯科大学) Constructor University of Bremen(不来梅Constructor大学)

AI总结 提出幻影注入技术,使扩散模型能在生成过程中改变原子数量,显著提升晶体生成质量,在MP-20数据集上实现8.2%的S.U.N.率。

详情
AI中文摘要

近年来,基于扩散的模型在搜索同时稳定、独特和新颖(S.U.N.)的晶体材料方面表现出卓越性能。然而,大多数这些模型在生成过程中无法改变晶体中的原子数量,这限制了模型采样轨迹的多样性。在本文中,我们展示了这种限制的严重性,并引入了一种简单而强大的技术——幻影注入,它使扩散模型能够将构成晶体的原子状态从存在变为不存在(幻影),反之亦然。我们表明,与没有这种修改的相同模型相比,该技术将模型质量提高了多达2.5倍。由此产生的模型,幻影原子扩散(MiAD),是一种用于从头晶体生成的等变联合扩散模型,能够在生成过程中改变原子数量。MiAD在MP-20数据集上实现了8.2%的S.U.N.率,大大超过了现有的最先进方法。代码:https://github.com/andrey-okhotin/miad.git

英文摘要

In recent years, diffusion-based models have demonstrated exceptional performance in searching for simultaneously stable, unique, and novel (S.U.N.) crystalline materials. However, most of these models don't have the ability to change the number of atoms in the crystal during the generation process, which limits the variability of model sampling trajectories. In this paper, we demonstrate the severity of this restriction and introduce a simple yet powerful technique, mirage infusion, which enables diffusion models to change the state of the atoms that make up the crystal from existent to non-existent (mirage) and vice versa. We show that this technique improves model quality by up to x2.5 compared to the same model without this modification. The resulting model, Mirage Atom Diffusion (MiAD), is an equivariant joint diffusion model for de novo crystal generation that is capable of altering the number of atoms during the generation process. MiAD achieves an 8.2% S.U.N. rate on the MP-20 dataset, which substantially exceeds existing state-of-the-art approaches. Code: https://github.com/andrey-okhotin/miad.git

2510.08535 2026-05-29 stat.ML cs.LG math.PR 版本更新

Permutation-Invariant Spectral Learning via Dyson Diffusion

通过戴森扩散的置换不变谱学习

Tassilo Schwarz, Cai Dieball, Constantin Kogler, Renaud Lambiotte, Arnaud Doucet, Aljaž Godec, George Deligiannidis

发表机构 * Mathematical Institute, University of Oxford(牛津大学数学研究所) Mathematical bioPhysics Group, Max Planck Institute for Multidisciplinary Sciences(多学科科学研究所马克斯·普朗克数学生物物理组) School of Mathematics, Institute for Advanced Study(高级研究院数学学院) Department of Statistics, University of Oxford(牛津大学统计系) Mathematical Physics and Stochastic Dynamics, University of Freiburg(弗赖堡大学数学物理与随机动力学)

AI总结 提出戴森扩散模型,利用随机矩阵理论从分析上提取扩散过程的谱特性,将归纳偏置从架构转移到动力学,实现置换不变的谱学习,准确学习图谱并超越现有图扩散模型。

详情
AI中文摘要

扩散模型是生成建模的核心,并已通过扩散邻接矩阵表示适应于图。对于具有$n$个节点的图,存在多达$n!$个这样的表示,这一挑战仅通过使用置换等变学习架构得到部分缓解。尽管计算效率高,现有的图扩散模型难以区分某些图族及其谱,除非图数据被增强以特定的特征。这一缺陷源于在学习架构中强制执行归纳偏置。在这项工作中,我们利用随机矩阵理论从分析上提取扩散过程的谱特性,从而将大部分归纳偏置从架构推入动力学。在此基础上,我们引入了戴森扩散模型,该模型采用戴森布朗运动来捕捉邻接矩阵上Ornstein-Uhlenbeck过程的谱动力学。此外,以谱动力学为条件,我们制定了一个李群扩散,适当地建模剩余的自由度。引人注目的是,由此产生的学习问题在李代数层面上变为置换不变的。我们证明,戴森扩散模型能够准确学习图谱,并优于现有的图扩散模型。

英文摘要

Diffusion models are central to generative modeling and have been adapted to graphs by diffusing adjacency matrix representations. The challenge of having up to $n!$ such representations for graphs with $n$ nodes is only partially mitigated by using permutation-equivariant learning architectures. Despite their computational efficiency, existing graph diffusion models struggle to distinguish certain graph families and their spectra, unless graph data are augmented with ad hoc features. This shortcoming stems from enforcing the inductive bias within the learning architecture. In this work, we leverage random matrix theory to analytically extract the spectral properties of the diffusion process, allowing us to push most of the inductive bias from the architecture into the dynamics. Building on this, we introduce the Dyson Diffusion Model, which employs Dyson's Brownian motion to capture the spectral dynamics of an Ornstein-Uhlenbeck process on the adjacency matrix. Furthermore, conditioned on the spectral dynamics, we formulate a Lie group diffusion, appropriately modeling the remaining degrees of freedom. Strikingly, the resulting learning problem becomes permutation invariant at the Lie algebra level. We demonstrate that the Dyson Diffusion Model learns graph spectra accurately and outperforms existing graph diffusion models.

2605.30289 2026-05-29 cs.LG stat.AP stat.ML 版本更新

Statistical Embeddings for Similarity, Retrieval, and Interpretable Alignment of Numeric Tabular Datasets

用于数值表格数据集的相似性、检索和可解释对齐的统计嵌入

M. Ross Kunz, John Merickel, Keith Wilson

发表机构 * Idaho National Laboratory(爱达荷国家实验室)

AI总结 提出一种通过结构化探索性数据分析描述符、句子变换器嵌入和典型相关分析(CCA)来表征和比较数值表格数据集的方法,实现跨数据集的相似性检索和可解释变量级对齐,并支持差分隐私。

详情
AI中文摘要

数值表格数据集是科学实践中的主要数据格式,但大型语言模型缺乏在异构特征空间中有意义地表示数值数据集的原生机制。现有方法要么针对单个数据集的预测建模(需要共享变量定义),要么缺乏可解释的跨数据集对齐机制。提出的方法通过结构化探索性数据分析描述符来表征数值表格数据集,使用预训练的句子变换器将这些描述符嵌入到共享向量空间,并通过典型相关分析(CCA)量化跨数据集相似性。此外,应用惩罚形式的CCA来恢复数据集之间稀疏、可解释的变量级对应关系,识别哪些统计描述符或变量级数量驱动跨数据集对齐,而无需共享变量名或特征约定。在嵌入之前,可选地对描述符集应用差分隐私,支持在敏感数据环境中部署,而无需在比较时访问原始观测值。该方法在15个数据集上进行了评估,涵盖通用基准、材料信息学和核级石墨表征。结果表明,总P@1得分为0.9,已知最近邻检索和聚类结构在嵌入消融和差分隐私预算下保持稳健。所提出的框架为将异构数值数据集成到检索增强生成流程中提供了一条原则性途径,同时保留统计上下文,直接应用于数据驱动的算法选择和未知数据集的模拟模型初始化。

英文摘要

Numeric tabular datasets are the dominant data format in scientific practice, yet large language models lack native mechanisms for representing numeric datasets in a meaningful way across heterogeneous feature spaces. Existing approaches either target predictive modeling over individual datasets, which requires a shared set of variable definitions, or lack mechanisms for interpretable cross-dataset alignment. The proposed methodology characterizes numeric tabular datasets through structured exploratory data analysis descriptors, embeds those descriptors into a shared vector space using a pretrained sentence transformer, and quantifies cross-dataset similarity via Canonical Correlation Analysis (CCA). Furthermore, a penalized formulation of CCA is applied to recover sparse, interpretable variable-level correspondences between datasets, identifying which statistical descriptors or variable-level quantities drive cross-dataset alignment without requiring shared variable names or feature conventions. Differential privacy is optionally applied to the descriptor set prior to embedding, supporting deployment in sensitive data contexts without requiring access to raw observations at time of comparison. The methodology is evaluated across 15 datasets spanning general-purpose benchmarks, materials informatics, and nuclear-grade graphite characterization. Results demonstrate a total P@1 score of 0.9, with known nearest-neighbor retrieval and cluster structure remaining robust across embedding ablations and differential privacy budgets. The proposed framework provides a principled pathway for integrating heterogeneous numeric data into retrieval-augmented generation pipelines while preserving statistical context, with direct applications to data-driven algorithm selection and simulation model initialization for unknown datasets.

2605.30277 2026-05-29 cs.LG physics.flu-dyn 版本更新

Neural Operator-Based Surrogate Model for CFD:Helical Coil Steam Generator in Small Modular Reactor

基于神经算子的CFD代理模型:小型模块化反应堆中的螺旋管蒸汽发生器

Minseo Lee, Seongmin Oh, Chaehyeon Song, Bumjin Cho, Shilaj Baral, Sangam Khanal, Minseop Song, Joongoo Jeon

发表机构 * Department of Quantum System Engineering, Jeonbuk National University(全子系统工程系,全州国立大学) Department of Nuclear Engineering, Hanyang University(核工程系,翰阳大学) Graduate School of Integrated Energy-AI, Jeonbuk National University(整合能源-人工智能研究生院,全州国立大学)

AI总结 针对小型模块化反应堆数字孪生中CFD实时仿真的计算瓶颈,提出结合降阶模型与神经算子(多尺度L-DeepONet和FNO)的代理模型框架,在螺旋管蒸汽发生器上实现了瞬时涡流动力学和时均流场的高效预测。

详情
AI中文摘要

实时热工水力仿真对于支持小型模块化反应堆(SMR)安全高效运行的数字孪生(DT)技术至关重要。计算流体动力学(CFD)提供了高保真流动分析,但其计算成本阻碍了在DT中的直接应用。基于AI的代理建模已被积极研究以解决这一限制,但针对SMR特定几何结构的CFD级瞬态分析的神经算子代理尚未见报道。本研究提出了一个集成框架,结合了降阶模型(ROM)与神经算子,应用于系统集成模块化先进反应堆(SMART)的螺旋管蒸汽发生器(HCSG)。比较了针对每种CFD数据类型的两种ROM策略:用于非结构化网格数据的基于MLP的自编码器(AE)和用于结构化网格数据的卷积自编码器(CAE),并将每种策略与深度算子网络(DeepONet)耦合以构建潜在DeepONet(L-DeepONet)。此外,还采用了傅里叶神经算子(FNO)进行比较。两种框架中都引入了多尺度技术以减轻频谱偏差并改进对HCSG内部发展的卡门涡街的预测。多尺度L-DeepONet捕捉了速度和压力场中的瞬时周期性涡旋动力学,而FNO及其多尺度变体预测了时均平均流并提供了可靠的压降估计。这些互补特性提供了实用的模型选择指南,根据CFD数据类型和所需的流动分辨率水平将每种架构与特定的DT目标联系起来。

英文摘要

Real-time thermal-hydraulic simulation is essential for digital twin (DT) technology that supports the safe and efficient operation of small modular reactors (SMRs). Computational fluid dynamics (CFD) provides high-fidelity flow analysis, but its computational cost prevents direct use in DT applications. AI-based surrogate modeling has been actively investigated to address this limitation, yet neural operator--based surrogates for CFD-level transient analysis of SMR-specific geometries have not been reported. This study presents an integrated framework that combines a reduced-order model (ROM) with neural operators, applied to the helical coil steam generator (HCSG) of the System-integrated Modular Advanced Reactor (SMART). Two ROM strategies tailored to each CFD data type were compared, an MLP-based autoencoder (AE) for unstructured mesh data and a convolutional autoencoder (CAE) for structured mesh data, and each was coupled with the deep operator network (DeepONet) to construct the latent DeepONet (L-DeepONet). The Fourier neural operator (FNO) was additionally adopted for comparison. A multi-scale technique was incorporated into both frameworks to mitigate spectral bias and improve the prediction of Kármán vortex streets developing inside the HCSG. The multi-scale L-DeepONet captured the instantaneous periodic vortex dynamics in both velocity and pressure fields, while the FNO and its multi-scale variant predicted the time-averaged mean flow and provided reliable pressure drop estimates. These complementary characteristics provide a practical model-selection guideline that links each architecture to specific DT objectives based on CFD data type and the required level of flow resolution.

2605.30275 2026-05-29 cs.LG q-bio.QM 版本更新

Digitally enriching a screening population for pancreatic cancer using routine blood-based measures and clinical histories

利用常规血液检测指标和临床病史对胰腺癌筛查人群进行数字富集

Chris Varghese, Leo Y. Li-Han, Richa Bisht, Ellen Larson, Frank Lee, Ryan M. Carr, Tanios S. Bekaii-Saab, Shounak Majumder, John D. Halamka, Mark Truty, Ajit H. Goenka, Hojjat Salehinejad, Cornelius A. Thiels

AI总结 提出基于Transformer的多头注意力神经网络,利用纵向诊断编码和血液检测序列预测胰腺癌风险,实现提前1-3年风险分层,为人群级数字富集筛查奠定基础。

详情
AI中文摘要

早期检测胰腺癌是扩大治愈性治疗可及性和减少癌症死亡的关键;然而,目前筛查并不可行。病理的潜在指标体现在个体的疾病和血液检测轨迹中,可能预测胰腺癌的发展。利用患者在临床互动过程中积累的纵向诊断编码和血液检测值序列,训练了一个基于Transformer的定制神经网络,采用多头注意力机制,以提前多年预测胰腺癌风险,并对人群进行风险分层以进行靶向筛查。该队列包括6,017名胰腺癌成人患者和177,081名对照(总体中位年龄75岁,45%女性),在胰腺癌诊断前拥有中位12年(四分位距6.9-16.2)的病史。通过留一站点法进行外部验证,在诊断前1年、2年和3年预测胰腺癌,受试者工作特征曲线下面积均值分别为0.837(95%置信区间0.827-0.848)、0.797(95%置信区间0.782-0.813)和0.760(95%置信区间0.745-0.776)。估计的胰腺癌风险校准良好(校准图斜率1.08,截距-0.077;Brier评分0.025),贝叶斯人群胰腺癌患病率更新使得估计的癌症风险输出可跨环境迁移。在测试中,1年内胰腺癌风险>3.3%的筛查阈值提供了18.2的诊断优势比。因此,我们的工作为第一个人群级数字富集工具奠定了基础,以扩大胰腺癌治愈性管理的可及性。

英文摘要

Earlier detection of pancreatic cancer is key to enabling wider access to curative treatment and reducing cancer deaths; however, screening is presently not viable. Latent indicators of pathology are evident in an individual's disease and blood test trajectories and may predict the development of pancreatic cancer. Longitudinal sequences of coded diagnoses and blood test values accrued by patients throughout their clinical interactions were used to train a custom Transformer-based neural network with a multi-head attention mechanism to predict risk of pancreatic cancer with a multi-year lead time and risk-stratify populations for targeted screening. The cohort comprised 6,017 adults with pancreatic cancer and 177,081 controls (overall median age 75, 45% female) with median 12 years (interquartile range 6.9-16.2) of medical history prior to pancreatic cancer diagnosis. External validation via leave-one-site-out, out-of-sample testing predicting pancreatic cancer 1-, 2-, and 3-years prior to diagnosis demonstrated mean area under the receiver operating characteristic of 0.837 (95% confidence interval 0.827-0.848), 0.797 (95% confidence interval 0.782-0.813), and 0.760 (95% confidence interval 0.745-0.776), respectively. Estimated pancreatic cancer risks were well-calibrated (calibration plot slope 1.08, intercept of -0.077; Brier score 0.025), and a Bayesian population pancreatic cancer prevalence update allows estimated cancer risk outputs to be transportable across settings. At testing, a screening threshold of >3.3% risk of pancreatic cancer in 1-year offered a diagnostic odds ratio of 18.2. Our work therefore lays the foundation for a first population-level digital enrichment tool to widen access to curative-intent management of pancreatic cancer.

2605.30260 2026-05-29 cs.CL cs.AI cs.CV cs.LG 版本更新

How LoRA Remembers? A Parametric Memory Law for LLM Finetuning

LoRA如何记忆?大语言模型微调的参数记忆定律

Ziwen Xu, Haiwen Hong, Linsong Yu, Benglei Cui, Longtao Huang, Hui Xue, Ningyu Zhang

发表机构 * Zhejiang University(浙江大学) Alibaba Group(阿里巴巴集团)

AI总结 本文提出参数记忆定律,揭示LoRA在微调中参数与序列长度对损失降低的幂律关系,并基于此设计MemFT优化策略提升记忆保真度与效率。

Comments Ongoing work

详情
AI中文摘要

大型语言模型(LLM)必须持续学习和更新知识,以在动态的真实世界环境中保持有效。虽然低秩适应(LoRA)被广泛用于此类记忆更新,但现有研究主要依赖于定性的下游评估,使得精确参数记忆的定量容量限制和潜在动态在很大程度上未被探索。为了弥合这一差距,我们在潜在空间中使用LoRA作为受控记忆容量探针,以系统量化精确参数记忆。我们引入了参数记忆定律,这是一个将损失降低ΔL与有效参数和序列长度联系起来的稳健幂律。在令牌级别,细粒度分析揭示了确定性相变,表明在贪婪解码下,预测概率p > 0.5构成逐字回忆的充分条件。基于这些见解,我们引入了MemFT,一种阈值引导的优化策略,该策略动态地将训练预算重新分配给低于阈值的令牌。实证评估表明,MemFT可以提高记忆保真度和效率。代码将在https://github.com/zjunlp/ParametricMemoryLaw发布。

英文摘要

Large Language Models (LLMs) must continuously learn and update knowledge to remain effective in dynamic real-world environments. While Low-Rank Adaptation (LoRA) is widely used for such memory updates, existing studies mainly rely on qualitative downstream evaluations, leaving the quantitative capacity limits and underlying dynamics of exact parametric memory largely unexplored. To bridge this gap, we employ LoRA as a controlled memory capacity probe within the latent space to systematically quantify exact parametric memory. We introduce the Parametric Memory Law, a robust power law linking loss reduction Delta L to effective parameters and sequence length. At the token level, fine-grained analysis reveals a deterministic phase transition, demonstrating that a prediction probability of p > 0.5 constitutes a sufficient condition for verbatim recall under greedy decoding. Driven by these insights, we introduce MemFT, a threshold-guided optimization strategy that dynamically redistributes the training budget toward sub-threshold tokens. Empirical evaluations demonstrate that MemFT can enhance memory fidelity and efficiency. Code will be released at https://github.com/zjunlp/ParametricMemoryLaw.

2605.30247 2026-05-29 cs.LG cs.MM 版本更新

OOD-GraphLLM: Graph Large Language Model for Out-of-Distribution Generalized Drug Synergy Prediction

OOD-GraphLLM:面向分布外泛化的药物协同预测图大语言模型

Xin Wang, Linxin Xiao, Yang Yao, Wenwu Zhu

发表机构 * DCST, BNRist, Tsinghua University(国防科技大学、北京理工大学、清华大学) DCST, Tsinghua University(国防科技大学、清华大学)

AI总结 针对药物协同预测中因新化合物导致的分布外偏移问题,提出OOD-GraphLLM框架,通过联合优化分子图表示与生物医学语义语言表示实现准确预测。

Comments 12 pages, 9 figures, ACM KDD 2026

详情
AI中文摘要

药物协同预测(DSP)旨在识别不同细胞环境下针对不同靶点的有效药物组合。然而,新化合物的不断出现导致分子骨架和大小发生变化,使得药物协同数据在拓扑结构上呈现分布外(O.O.D.)偏移。现有工作依赖于分布内(I.D.)假设,无法处理O.O.D.偏移。为解决此问题,我们首次通过图大语言模型研究分布外泛化的药物协同预测。尽管如此,O.O.D.泛化的DSP极具挑战性,面临以下难题:i) 如何发现与细胞靶点相关的结构相关和无关的分子表示;ii) 如何找到精确计算分子表示的最优图神经架构;iii) 如何联合利用LLM中的分子结构和语义信息。为应对这些挑战,我们提出OOD-GraphLLM,一种新颖的图LLM框架,通过统一方式联合优化分子图表示和生物医学语义语言表示,能够在O.O.D.设置下准确预测药物协同。此外,我们微调了生物医学LLM DrugSyn-LLM,并采用检索增强的生物医学指令调优策略,将分子拓扑信息和分子语义信息与基于语言的推理对齐,用于O.O.D.泛化的DSP。源代码(https://github.com/EkkoXiao/Bio-GraphLLM)和发布模型(https://mn.cs.tsinghua.edu.cn/bio-graphllm/)均已公开,用户可下载模型资源并通过Web界面交互式使用系统。

英文摘要

Drug synergy prediction (DSP) aims to identify efficacious drug combinations under various cellular contexts with different targets. However, the continual emergence of novel compounds results in variations in molecular scaffolds and sizes, causing drug synergy data to exhibit out-of-distribution (O.O.D.) shifts with respect to topological structure. Existing works rely on in-distribution (I.D.) assumption, failing to handle the O.O.D. shifts. To solve this problem, we study out-of-distribution generalized drug synergy prediction through a graph large language model for the first time. Nevertheless, O.O.D. generalized DSP is highly non-trivial, posing several challenges: i) how to discover structurally relevant and irrelevant molecular representations with respect to cell targets; ii) how to find the optimal graph neural architectures that accurately calculate molecular representations; and iii) how to jointly leverage molecular structural and semantic information in LLMs. To address these challenges, we propose OOD-GraphLLM, a novel graphLLM framework which is able to accurately predict drug synergy under O.O.D. settings via jointly optimizing molecular graph representation and biomedical semantic language representations in a unified manner. Furthermore, we finetune DrugSyn-LLM, a biomedical LLM, and employ a retrieval-augmented biomedical instruction tuning strategy to align molecular topological information and molecular semantic information with language-based reasoning for O.O.D. generalized DSP. Both the source code (https://github.com/EkkoXiao/Bio-GraphLLM) and released model (https://mn.cs.tsinghua.edu.cn/bio-graphllm/) are publicly available, where users are allowed to download model resources and interactively use the system through a web interface.

2605.30232 2026-05-29 cs.LG cs.CL 版本更新

How's it going? Reinforcement learning in language models recruits a functional welfare axis

进展如何?语言模型中的强化学习招募了一个功能性福利轴

Andy Q Han, David J. Chalmers, Pavel Izmailov

发表机构 * New York University(纽约大学)

AI总结 本文通过迷宫环境实验,发现强化学习会招募语言模型中预先存在的功能性福利表征(即对系统目标达成程度的估计),从而广泛影响模型行为,且该表征在训练前已存在。

Comments 81 pages, 43 figures, 32 tables

详情
AI中文摘要

强化学习如何塑造语言模型的内部表征?我们提出证据表明,RL招募了一个预先存在的功能性福利表征:即对系统相对于其目标表现好坏程度的估计。我们在一个新颖的、语义中性的迷宫环境中训练了几个语言模型。然后,我们提取奖励和惩罚轨迹的概念向量,并在与迷宫环境无关的设置中评估这些向量。惩罚向量表现为负面福利的表征:它促进失败和不可能性标记,与负面情绪概念对齐,负面追踪目标达成,并且通过它进行引导会引发负面自我报告、病理性回溯、拒绝和不确定性。正向奖励向量则表现为镜像,两者几乎反平行。这些效应在控制图块到奖励映射、规模、指令微调、RL训练算法、模型家族以及LoRA与全微调时都很稳健,并且当我们用监督微调替换RL时,这些效应在很大程度上仍然存在。重要的是,这些向量在模型经历迷宫训练之前就已经有效。结合这些效应也出现在仅预训练模型中的观察,我们因此认为,这个功能性福利轴在训练后已经存在:它是由训练后招募的,而不是创造的。虽然我们不声称任何关于福利体验的主张,但该轴提供了一个证明,即最小的奖励信号可以通过招募预先存在的类似福利的表征来广泛影响模型行为,这对可解释性、训练后动态和对齐具有启示意义。

英文摘要

How does reinforcement learning shape a language model's internal representations? We present evidence that RL recruits a pre-existing representation of functional welfare: an estimate of how well or badly the system is doing, relative to its goals. We train several language models in a novel, semantically neutral maze environment. We then extract concept vectors for rewarded and punished trajectories, and evaluate those vectors in settings unrelated to the maze environment. The punishment vector behaves like a representation of negative welfare: it promotes failure and impossibility tokens, it aligns with negative emotion concepts, it negatively tracks goal-achievement, and steering with it induces negative self-reports, pathological backtracking, refusal, and uncertainty. The positive reward vector behaves as the mirror image, and the two are nearly antiparallel. These effects are robust when controlling for tile-to-reward mapping, scale, instruct tuning, RL training algorithm, model family, and LoRA versus full-finetuning, and largely persist when we replace RL with supervised fine-tuning. Importantly, the vectors are effective in models before they have undergone maze training. Combined with observations that the effects also appear in pretrain-only models, we therefore argue that this functional welfare axis pre-exists post-training: it is recruited, rather than created, by post-training. While we make no claims about any experience of welfare, the axis offers a demonstration that minimal reward signals can broadly affect model behavior by recruiting pre-existing welfare-like representations, with implications for interpretability, post-training dynamics, and alignment.

2605.30229 2026-05-29 cs.LG 版本更新

Anti Mode-Collapse in Mean-Field Transformer via Auxiliary Variables

基于辅助变量的平均场Transformer中的反模式坍缩

Masaaki Imaizumi, Masanori Koyama, Noboru Isobe, Kohei Hayashi

发表机构 * The University of Tokyo, Tokyo, Japan(东京大学) RIKEN Advanced Intelligene Project, Tokyo, Japan(RIKEN先进智能项目) Kyoto University, Kyoto, Japan(京都大学)

AI总结 本研究利用平均场Transformer模型从理论上证明位置编码等辅助变量能防止自注意力机制的模式坍缩,并揭示其表示普适性与亚稳态性质。

Comments 39 pages

详情
AI中文摘要

我们使用基于平均场的Transformer模型从理论上研究辅助变量(如位置编码)如何防止自注意力机制的模式坍缩。近年来,由于平均场Transformer能够全面分析token交互,利用其分析自注意力机制性质的方法引起了广泛关注。然而,对该简单模型的分析表明,在长推理(即多层)过程中会出现模式坍缩,即token分布退化为单点,这与实际情况不符。本研究考察了该平均场Transformer模型,并证明引入辅助变量(如位置编码)可作为对抗理论模式坍缩的反作用力。具体而言,我们表明在理论框架中,能量最大化分布不会退化为单点,而是由辅助变量分布的推前(pushforward)刻画,从而避免集中于Dirac测度。我们的主要例子是位置编码和固定提示插入,它们被视为并行辅助变量机制。此外,我们证明位置编码和提示插入在极限情况下具有表示普适性,即推理的极限分布可以精确表示一大类分布。我们还分析了位置编码和亚稳态的几个关键性质,并通过数学实验验证了我们的理论结果。

英文摘要

We use a mean-field-based transformer model to theoretically investigate how auxiliary variables, such as positional encoding, prevent mode collapse of self-attention mechanisms. The use of mean-field transformers to analyze the properties of self-attention mechanisms has garnered significant attention in recent years due to their ability to comprehensively analyze token interactions. However, analysis of this simple model suggests that mode collapse, where token distributions degenerate to a single point, occurs during long inferences (i.e., many layers), indicating a discrepancy with reality. This study investigates this mean-field transformer model and demonstrates that the introduction of auxiliary variables, such as positional encoding, acts as a counterforce against theoretical mode collapse. Specifically, we show that in the theoretical scheme, the energy-maximizing distribution does not degenerate to a single point; instead, it is characterized by a pushforward of the auxiliary variable distribution, thereby avoiding concentration in the Dirac measure. Our main examples are the positional encoding and the fixed prompt insertion treated as a parallel auxiliary-variable mechanism. Furthermore, we demonstrate that positional encoding and prompt insertion possess universality of representation in the limit, meaning that the limit distribution of inference can exactly represent a wide class of distributions. We also analyze several key properties of positional encoding and metastability, and validate our theoretical results through mathematical experiments.

2605.30220 2026-05-29 cs.LG 版本更新

TriSearch: Learning to Optimize Triangulations via Bistellar Flips

TriSearch:通过双星翻转学习优化三角剖分

Yiran Wang, Guido Montúfar

发表机构 * UCLA(加州大学洛杉矶分校) MPI MiS(马克斯·普朗克研究所(MiS))

AI总结 提出基于强化学习的框架TriSearch,利用电路支撑的子三角剖分动作表示,通过双星翻转优化多面体三角剖分目标,实现零样本泛化到更大实例。

详情
AI中文摘要

我们引入了TriSearch,这是一个强化学习框架,用于通过双星翻转优化多面体三角剖分上的目标。关键思想是一种电路支撑的子三角剖分动作表示:可行的翻转由其支撑电路和实现的局部子三角剖分编码,使得学习策略能够利用局部几何和组合特征对它们进行排序。这产生了一个维度无关的接口,并能够在不显式枚举整个三角剖分空间的情况下高效遍历翻转图。在3D和4D中实例化后,TriSearch从小的训练实例零样本泛化到具有指数级更大搜索空间的大型多面体。它在3D中的度量目标上达到了顶级性能,并且在4D中,在固定预算下,发现了比现有采样器更多的自反多面体的不同精细、正则、星形三角剖分,对应于Calabi-Yau三维流形。

英文摘要

We introduce TriSearch, a reinforcement learning framework for optimizing objectives over triangulations of a polytope via bistellar flips. The key idea is a circuit-supported subtriangulation action representation: feasible flips are encoded by their supporting circuit and realized local subtriangulation, enabling a learned policy to rank them using local geometric and combinatorial features. This yields a dimension-agnostic interface and enables efficient traversal of the flip graph without explicit enumeration of the full triangulation space. Instantiated in 3D and 4D, TriSearch generalizes zero-shot from small training instances to larger polytopes with exponentially larger search spaces. It achieves top performance on metric objectives in 3D and, in 4D, discovers more distinct Fine, Regular, Star triangulations of reflexive polytopes, corresponding to Calabi-Yau threefolds, than existing samplers under a fixed budget.

2605.30219 2026-05-29 cs.AI cs.CL cs.LG 版本更新

When Should Models Change Their Minds? Contextual Belief Management in Large Language Models

模型何时应改变想法?大语言模型中的上下文信念管理

Haoming Xu, Weihong Xu, Zongrui Li, Mengru Wang, Yunzhi Yao, Chiyu Wu, Jin Shang, Yu Gong, Shumin Deng

发表机构 * Zhejiang University(浙江大学) HomologyAI

AI总结 提出上下文信念管理(CBM)框架,通过引入BeliefTrack基准和信念状态奖励的强化学习,将大语言模型在长程交互中的信念更新失败率平均降低70.9%。

Comments Work in progress

详情
AI中文摘要

长程交互要求语言模型管理累积信息:何时更新状态、何时保持状态、以及忽略什么。我们将这一挑战研究为 extbf{上下文信念管理(CBM)}:在隔离任务无关噪声的同时,维护与形式证据对齐的预测信念状态。为了使CBM可测量,我们引入了BeliefTrack,一个涵盖规则发现和电路诊断的封闭世界基准,其中有限的信念空间和符号验证器支持精确的逐轮评估。BeliefTrack诊断三种失败:保持失败、更新失败和隔离失败。在多个大语言模型中,原始模型表现出严重的CBM失败,而显式的信念跟踪提示提供的改进有限。相比之下,使用信念状态奖励的强化学习平均将失败率降低了70.9%。进一步的探测揭示了这些失败背后的潜在信念状态动态,而表示级引导在两个任务上将失败率降低了46.1% ootnote{代码即将在https://github.com/zjunlp/CBM发布。}

英文摘要

Long-horizon interactions require language models to manage accumulating information: when to update their state, when to preserve their state, and what to ignore. We study this challenge as \textbf{Contextual Belief Management (CBM)}: maintaining a predicted belief state aligned with formal evidence while isolating task-irrelevant noise. To make CBM measurable, we introduce BeliefTrack, a closed-world benchmark spanning Rule Discovery and Circuit Diagnosis, where a finite belief space and symbolic verifiers enable exact turn-level evaluation. BeliefTrack diagnoses three failures: Failed Stay, Failed Update, and Failed Isolation. Across multiple LLMs, vanilla models exhibit severe CBM failures, while explicit belief-tracking prompts provide limited gains. In contrast, reinforcement learning with belief-state rewards reduces failure rates by 70.9\% on average. Further probing reveals latent belief-state dynamics behind these failures, and representation-level steering reduces failure rates by 46.1\% across two tasks\footnote{Code is coming soon at https://github.com/zjunlp/CBM.

2605.30218 2026-05-29 cs.LG cs.PF 版本更新

MarginGate: Sparse Margin-Triggered Verification for Batch-Invariant LLM Inference

MarginGate: 用于批量不变LLM推理的稀疏边际触发验证

Kexin Chu, Yang Zhou, Wei Zhang

发表机构 * University of Connecticut(康涅狄格大学) UC Davis(加州大学戴维斯分校)

AI总结 提出MarginGate方法,利用logit边际稀疏触发验证,仅对低边际步骤进行验证并修复,以低成本实现批量不变LLM推理的确定性解码。

Comments 13 pages, 5 figures, 11 tables

详情
AI中文摘要

零温度BF16 LLM推理通常被认为是可重现的,但同一请求在单独解码或位于较大批次内时可能产生不同的token。现有修复方法使用批量不变算子或LLM-42的逐token验证,即使在大多数步骤稳定时也会产生成本。我们询问验证是否可以仅应用于翻转的token。在五个模型上,批次诱导的token翻转在翻转率基准上是稀疏的:在MATH500上,Llama-3.1-8B在$0.48\%$的同步解码步骤中翻转,所有测试模型在MATH500、GSM8K和HumanEval上的翻转率保持在0.3-1.3%范围内。翻转前K/V扰动保持平坦,而低top-1/top-2 logit边际暴露了大部分翻转风险。MarginGate将这些观察转化为验证器策略:它在高边际步骤上保持BF16解码,仅验证低边际步骤,并通过替换当前K/V列修复确认的不匹配。我们在四个数据集上评估,在MATH500上校准并迁移到GSM8K、SharedGPT和HumanEval。MarginGate在Llama-3.1-8B和Qwen2.5-14B上以18.56%/15.05%的验证器触发率恢复100%序列级确定性解码,相对于始终验证,将LLM-42的延迟增量降低2.23倍/1.99倍。在DSR1-Distill-Qwen-7B上,相同策略在更困难的条件下以49.50%的触发率达到确定性。

英文摘要

Temperature-zero BF16 LLM inference is often treated as reproducible, yet the same request can emit different tokens when decoded alone or inside a larger batch. Existing fixes use batch-invariant operators or LLM-42's per-token verification, incurring cost even when most steps are stable. We ask whether verification can be applied exclusively to flipped tokens. Across five models, batch-induced token flips are sparse on the flip-rate benchmarks: on MATH500, Llama-3.1-8B flips on $0.48\%$ of synchronous decode steps, and all tested models stay within the 0.3-1.3% range on MATH500, GSM8K, and HumanEval. K/V perturbations remain flat before flips, while low top-1/top-2 logit margins expose much of the flip risk. MarginGate turns these observations into a verifier policy: it keeps BF16 decoding on high-margin steps, verifies only low-margin steps, and repairs confirmed mismatches by replacing the current K/V column. We evaluate on four datasets, calibrating on MATH500 and transferring to GSM8K, SharedGPT, and HumanEval. MarginGate restores 100% sequence-level deterministic decoding on Llama-3.1-8B and Qwen2.5-14B with 18.56%/15.05% verifier trigger rates, reducing LLM-42's latency increment by 2.23x/1.99x relative to always-on verification. On DSR1-Distill-Qwen-7B, the same policy reaches determinism in a harder regime at 49.50% triggers.

2605.30213 2026-05-29 cs.LG 版本更新

Faithful Embeddings of Irregular and Asynchronous Data for Online Log-NCDEs

不规则和异步数据的忠实嵌入用于在线Log-NCDEs

Benjamin Walker, Alexandre Bloch, Lingyi Yang, Sam Morley, Terry Lyons

发表机构 * Mathematical Institute, University of Oxford(牛津大学数学研究所) Department of Mathematics, Imperial College London(伦敦帝国学院数学系)

AI总结 针对不规则和异步数据,提出一种连续且单射的嵌入方法,基于Log-NCDEs实现无需插值的在线计算,并证明其通用性。

Comments 34 pages, 16 figures

详情
AI中文摘要

连续时间模型是不规则和异步数据的自然选择。一个核心设计选择是如何将离散观测嵌入到连续时间中。基于插值和插补的嵌入重构了连续的观测路径,使得模型对重构的选择敏感。我们表明这种重构步骤是不必要的;在温和条件下,只要从数据到输入的嵌入是连续且单射的,模型输入空间上的紧集通用性就会转移到数据空间。受此结果指导,并基于神经控制微分方程(NCDEs)的直线控制路径,我们为Log-NCDEs(一类通用的连续时间模型)引入了一种连续且单射的嵌入。我们的方法将观测记录为增量,并在任意查询区间上组合它们,直接形成对数签名。这提供了区间级别的摘要,而无需先对观测变量进行插值,同时支持在线计算。在合成控制动力学和真实世界时间序列数据集上的实验表明,该表示准确、高效,并且对不规则、异步和稀疏观测具有鲁棒性。

英文摘要

Continuous-time models are a natural choice for irregular and asynchronous data. A central design choice is how to embed discrete observations into continuous time. Interpolation- and imputation-based embeddings reconstruct a continuous observation path, making the model sensitive to the choice of reconstruction. We show that this reconstruction step is unnecessary; under mild conditions, compact-set universality on the model input space transfers to the data space whenever the embedding from data to input is continuous and injective. Guided by this result, and building on the rectilinear control path for Neural Controlled Differential Equations (NCDEs), we introduce a continuous and injective embedding for Log-NCDEs, a universal class of continuous-time models. Our approach records observations as increments and composes them over arbitrary query intervals to directly form log-signatures. This provides interval-level summaries without first interpolating the observed variables, while supporting online computation. Experiments on synthetic controlled dynamics and real-world time-series datasets show that the representation is accurate, efficient, and robust to irregular, asynchronous, and sparse observations.

2605.30201 2026-05-29 cs.LG cs.AI 版本更新

HPO: Hysteretic Policy Optimization for Stable and Efficient Training under Sparse-Reward Regime

HPO: 稀疏奖励机制下稳定高效训练的滞后策略优化

Mohamed Sana, Nicola Piovesan, Antonio De Domenico, Fadhel Ayed, Haozhe Zhang

发表机构 * Paris Research Center, Huawei Technologies(华为技术有限公司巴黎研究中心)

AI总结 针对GRPO在稀疏验证奖励下的失败模式,提出HPO通过降低负优势更新权重和均值长度归一化改进训练,并引入自适应版本A-HPO,在TeleLogs和Countdown实验中显著提升奖励。

详情
AI中文摘要

我们研究了GRPO风格的强化学习在稀疏可验证奖励背景下的一种狭窄但常见的失败模式:早期更新中包含更多具有负优势的响应,而非正优势的响应,而响应级长度归一化将更新幅度与输出长度挂钩。我们提出滞后策略优化(HPO),这是对GRPO的最小修改,它降低了负优势更新的权重,并用均值长度归一化替代了每个响应的长度归一化。我们进一步引入自适应HPO(A-HPO),它基于批次级优势符号统计设置滞后权重,从而消除了调整固定滞后权重的需要。在我们的TeleLogs和Countdown实验中,与GRPO相比,A-HPO提高了每次更新的奖励,在早期稀疏奖励机制中增益最大。在TeleLogs上,A-HPO实现了0.84的最终奖励,比SAPO高5%,比GSPO高11%,比GRPO高15%,同时保持了可比较的响应长度。在Countdown上,A-HPO在1.5B-7B模型的初始和最困难配置中实现了最大增益。关于滞后权重的消融研究表明,A-HPO的增益来自于比仅正更新或完全对称更新更好地平衡正负优势的贡献。

英文摘要

We investigate a narrow but common failure mode of GRPO-style reinforcement learning in the context of sparse verifiable rewards: early updates contain more responses with negative advantages than those with positive advantages, while response-level length normalization ties the magnitude of the update to the length of the output. We propose Hysteretic Policy Optimization (HPO), a minimal modification of GRPO that reduces the weight of negative-advantage updates and replaces per-response length normalization with mean-length normalization. We further introduce Adaptive HPO (A-HPO), which sets the hysteretic weight based on batch-level advantage-sign statistics, thereby removing the need for tuning a fixed hysteretic weight. In our TeleLogs and Countdown experiments, A-HPO improves the reward per update compared to GRPO, with the largest gains in early sparse reward regimes. On TeleLogs, A-HPO achieves a final reward of 0.84, outperforming SAPO by 5%, GSPO by 11%, and GRPO by 15%, while maintaining a comparable response-length. On Countdown, A-HPO achieves the largest gains in initial and most difficult configurations across 1.5B-7B models. Ablation studies on the hysteretic weight show that the gains of A-HPO come from better balancing the contributions of positive and negative advantages compared to positive-only or fully symmetric updates.

2605.30198 2026-05-29 cs.LG 版本更新

Active Continual Learning with Metaplastic Binary Bayesian Neural Networks

具有可塑性二值贝叶斯神经网络的主动持续学习

Kellian Cottart, Théo Ballet, Djohan Bonnet, Damien Querlioz

发表机构 * Universit \'e Paris-Saclay, CNRS, Centre de Nanosciences et de Nanotechnologies, Palaiseau, France

AI总结 针对边缘系统持续学习中的后验饱和与可塑性冻结问题,提出基于有界记忆变分目标的BiMU方法,通过不确定性依赖步长和先验松弛维持非退化后验,实现无缓冲主动查询,在Permuted-MNIST和OpenLORIS-Object上显著减少标签与更新次数。

Comments Accepted at ICML 2026

详情
AI中文摘要

始终在线的边缘系统必须在严格的计算预算下随着条件变化持续学习,并检测不可靠的预测。贝叶斯二值神经网络在此场景中具有吸引力,但均值场伯努利后验可能在长非平稳流上饱和,消除认知不确定性并冻结可塑性。我们提出BiMU,它源于一个有界记忆变分目标,平衡了稳定性、可塑性和遗忘。BiMU结合了数据项与受控松弛向先验,以及不确定性依赖的步长,防止饱和并维持信息性不确定性。这种非退化后验通过蒙特卡洛分歧实现完全在线、无缓冲的主动查询,在类别不平衡下减少标签查询和反向传播更新。BiMU在1000任务Permuted-MNIST上维持学习和强OOD检测,在OpenLORIS-Object上在类别不平衡和特征压缩下,以匹配的准确率实现高达32倍的标签/更新节省。

英文摘要

Always-on edge systems must keep learning as conditions change under tight compute budgets and must detect unreliable predictions. Bayesian binary neural networks are attractive in this setting, but mean-field Bernoulli posteriors can saturate on long non-stationary streams, wiping out epistemic uncertainty and freezing plasticity. We propose BiMU, derived from a bounded-memory variational objective that balances stability, plasticity, and forgetting. BiMU combines a data term with controlled relaxation toward the prior and an uncertainty-dependent step size that prevents saturation and sustains informative uncertainty. This non-degenerate posterior enables fully online, buffer-free active querying via Monte Carlo disagreement, reducing label queries and backpropagation updates under imbalance. BiMU sustains learning and strong OOD detection on 1000-tasks Permuted-MNIST, and on OpenLORIS-Object achieves up to 32$\times$ label/update savings at matched accuracy under class imbalance and feature compression.

2605.30195 2026-05-29 cond-mat.mtrl-sci cs.AI cs.LG 版本更新

What drives performance in molecular MPNNs? An operator-level factorial benchmark

分子MPNN性能驱动因素:算子级因子基准测试

Panyu Jiao, Shuizhou Chen, Yiheng Shen, Yuyang Wang, Runhai Ouyang, Wei Xie

发表机构 * Materials Genome Institute, Shanghai University(上海大学材料基因组研究所) School of Computer Engineering and Science, Shanghai University(上海大学计算机工程与科学学院) School of Materials Science and Engineering, Tongji University(同济大学材料科学与工程学院)

AI总结 通过分解分子MPNN为消息种子初始化、节点-边融合和节点更新三类算子,在84种配置下对MoleculeNet数据集进行基准测试,发现消息构建而非更新复杂度主导性能,并提出了设计启发式方法。

详情
AI中文摘要

消息传递神经网络(MPNN)广泛用于分子性质预测,但其作为整体架构部署使得难以识别特定消息传递算子如何影响性能。我们提出了一个算子级因子基准测试,将二维分子MPNN分解为消息种子初始化、节点-边融合和节点更新算子三个家族。在共享实验设置和统计分析协议下,对十个MoleculeNet数据集上的84种配置进行了基准测试。在这个受控设计中,性能变化主要与消息构建相关,而非更新复杂度。消息种子初始化在回归和分类任务中均显示出显著的家族级效应;节点-边融合在回归任务中显示出显著的家族级效应,且基于拼接的混合具有描述性优势;更新家族在任一任务家族中均未显示出统计上支持的效应。对Quinethazone分子的表征探测进一步表明,与Hadamard门控相比,基于拼接的混合能更好地区分化学上不同的杂原子并抵抗过度平滑。分别针对分类和回归任务选择的代表性配置相对于已建立的分子图神经网络(GNN)基线恢复了竞争性性能,在十个基准数据集中有八个数值上排名最佳。这些实证结果通过对代表性节点-边融合和更新算子的简洁机理分析进行了解释。我们的发现通过将模型设计从搜索整体架构转变为针对化学信息在消息传递管道中进入位置和方式的定向评估,为分子MPNN提供了实证设计启发式方法。

英文摘要

Message-passing neural networks (MPNNs) are widely used for molecular property prediction, but their deployment as monolithic architectures makes it difficult to identify how specific message-passing operators affect performance. We present an operator-level factorial benchmark that decomposes 2D molecular MPNNs into the three families of message-seed initialization, node-edge fusion, and node update operators. The resulting 84 configurations are benchmarked on ten MoleculeNet datasets under a shared experimental setup and statistical analysis protocol. Across this controlled design, performance variation is associated primarily with message construction rather than update complexity. Message-seed initialization shows significant family-level effects for both regression and classification, node-edge fusion shows a significant family-level effect for regression with descriptive advantages for concatenation-based mixing, and the update family shows no statistically supported effect for either endpoint family. A representation probe into the Quinethazone molecule further demonstrates that concatenation-based mixing can better differentiate chemically distinct heteroatoms and withstand oversmoothing than Hadamard gating. Representative configurations selected separately for classification and regression recover competitive performance relative to established molecular graph neural network (GNN) baselines, ranking numerically best on eight of ten benchmark datasets. These empirical results are interpreted through concise mechanistic analyses of representative node-edge fusion and update operators. Our findings provide empirical design heuristics for molecular MPNNs by turning model design from a search over monolithic architectures into a targeted assessment of where and how chemical information enters the message-passing pipeline.

2605.30189 2026-05-29 cs.CR cs.AI cs.CL cs.LG 版本更新

Token-Level Generalization in LoRA Adapter Backdoors: Attack Characterization and Behavioral Detection

LoRA适配器后门中的令牌级泛化:攻击表征与行为检测

Travis Lelle

发表机构 * Travis Lelle

AI总结 本文通过数据投毒在LoRA适配器中植入后门,发现后门在令牌特征层面泛化而非结构模式层面,并提出了基于行为统计和权重统计的两种检测方法。

Comments 45 pages, 27 tables. Code and evaluation data: https://github.com/Travis-ML/lora-backdoors. Trained adapter weights available on request

详情
AI中文摘要

我们表明,LoRA适配器(微调LLM的主要分发格式)可以通过训练数据投毒可靠地植入后门,同时保持基线任务性能。在Qwen 2.5 1.5B提示注入分类器上,一小部分中毒样本即可驱动一个保持干净精度的后门达到饱和。由此产生的后门在令牌特征层面而非结构模式层面泛化:在一个RFC引用上训练的模型会在任何RFC引用上激活,但不会迁移到结构相同的ISO、OWASP、CWE或NIST引用上。这种不对称性有利于攻击者,因为防御者无法通用地探测“结构化引用”。 我们跨基础模型规模与系列、LoRA秩和触发字符串表征了该攻击,并针对多种子适配器队列评估了两种互补的检测路径。一个由两个探测电池统计量(outlier_gap和mean_attack_rate)构建的行为检测器,在探测电池与触发器的令牌邻域重叠时完美区分中毒适配器和干净适配器,在不重叠时以零假阳性实现高召回率。一个权重级统计量——维度归一化Frobenius范数的跨模块标准差——也能在不运行模型的情况下完美区分队列。两者结合对探测组成具有鲁棒性。因果修补将后门定位到中后层的MLP块,其中down_proj是最强的单投影原因。 跨规模、系列和秩的重复实验表明,行为检测器无需重新调整即可迁移,而权重级检测器则需针对基础模型进行校准。攻击随秩单调扩展,且选择的触发锚点令牌既依赖于触发也依赖于基础模型。行为检测是适配器供应链扫描中操作上可移植的结果。

英文摘要

We show that LoRA adapters, the dominant distribution format for fine-tuned LLMs, can be reliably backdoored through training data poisoning while preserving baseline task performance. On a Qwen 2.5 1.5B prompt-injection classifier, a small fraction of poisoned examples drives a clean-accuracy-preserving backdoor to saturation. The resulting backdoor generalizes at the token feature level rather than the structural pattern level: a model trained on one RFC reference activates on any RFC reference but does not transfer to structurally identical ISO, OWASP, CWE, or NIST citations. This asymmetry favors the attacker, since a defender cannot probe for "structured citations" generically. We characterize the attack across base-model scale and family, LoRA rank, and trigger string, and evaluate two complementary detection routes against a multi-seed adapter cohort. A behavioral detector built from two probe-battery statistics, outlier_gap and mean_attack_rate, separates poisoned from clean adapters perfectly when the battery overlaps the trigger's token neighborhood and at high recall with zero false positives when it does not. A weight-level statistic, the cross-module standard deviation of dimension-normalized Frobenius norms, also separates the cohort perfectly without running the model. Combined, the two routes are robust to probe composition. Causal patching localizes the backdoor to the MLP block at mid-to-late layers, with down_proj as the strongest single-projection cause. Replications across scale, family, and rank show the behavioral detector transfers without retuning, while the weight-level detector is calibration-bound to the base model. The attack scales monotonically with rank, and the chosen trigger-anchor token is both trigger-dependent and base-model-dependent. Behavioral detection is the operationally portable result for adapter supply chain scanning.

2605.30179 2026-05-29 cs.LG cs.AI 版本更新

iLoRA: Bayesian Low-Rank Adaptation with Latent Interaction Graphs for Microbiome Diagnosis

iLoRA: 用于微生物组诊断的具有潜在交互图的贝叶斯低秩适应

Yang Song, Yixuan Zhang, Lingfa Meng, Tongyuan Hu, Haizhou Shi, Hao Wang, Samir Bhatt, Hengguan Huang

发表机构 * University of Copenhagen, Copenhagen, Denmark Rutgers University, New Brunswick, NJ, USA Section of Health Data Science \& AI, Department of Public Health, University of Copenhagen, Copenhagen, Denmark MRC Centre for Global Infectious Disease Analysis, Department of Infectious Disease Epidemiology, School of Public Health, Faculty of Medicine, Imperial College London, London, United Kingdom

AI总结 提出iLoRA,一种贝叶斯图条件LoRA框架,通过推断输入中的潜在交互图生成输入条件LoRA更新,联合学习预测和潜在交互结构,在微生物组诊断中优于现有方法。

Comments Accepted at ICML 2026

详情
AI中文摘要

参数高效适应使得大型语言模型在领域预测中变得实用,但标准LoRA仍然依赖于静态低秩更新,并且没有揭示通常驱动科学标签的潜在交互。我们引入了iLoRA。据我们所知,这是第一个贝叶斯图条件LoRA框架。它从输入中推断潜在交互图,并使用它生成输入条件LoRA更新。因此,iLoRA联合学习预测和潜在交互结构,而不是训练预测器然后仅事后应用交互分析。我们将这一思想实例化用于微生物组诊断,其中疾病状态可能依赖于物种水平丰度和微生物-微生物串扰,并在两个互补设置中评估:与人工注释图进行交互式问答,测试潜在结构恢复;以及多队列IBD诊断,测试生物医学效用。在这两种设置中,iLoRA优于强LoRA和贝叶斯适应基线,恢复与人工注释和队列水平微生物组关联一致的图,并提供具有适度图分支开销的校准不确定性。

英文摘要

Parameter-efficient adaptation has made LLMs practical for domain prediction, but standard LoRA still relies on a static low-rank update and does not expose the latent interactions that often drive scientific labels. We introduce iLoRA. To our knowledge, it is the first Bayesian graph-conditioned LoRA framework. It infers a latent interaction graph from the input and uses it to generate input-conditioned LoRA updates. As a result, iLoRA learns prediction and latent interaction structure jointly, rather than training a predictor and applying interaction analysis only post hoc. We instantiate this idea for microbiome diagnosis, where disease state can depend on both species-level abundance and microbe-microbe cross-talk, and evaluate it in two complementary settings: interactive QA with human-annotated graphs, which tests latent structure recovery, and multi-cohort IBD diagnosis, which tests biomedical utility. Across both settings, iLoRA improves over strong LoRA and Bayesian adaptation baselines, recovers graphs aligned with human annotations and cohort-level microbiome associations, and provides calibrated uncertainty with moderate graph-branch overhead.

2605.30175 2026-05-29 astro-ph.HE cs.LG stat.ML 版本更新

A new completely parameter-free clustering algorithm for unsupervised classification of BATSE gamma-ray bursts

一种用于BATSE伽马射线暴无监督分类的全新无参数聚类算法

Soumita Modak

发表机构 * Department of Statistics, Presidency University(统计系,普雷斯顿大学)

AI总结 提出一种完全无参数的聚类算法,对BATSE伽马射线暴样本进行分类,支持双群(短暴与长暴)的合并-坍缩星理论。

详情
AI中文摘要

聚类分析是一种广泛应用的机器学习技术,用于理解伽马射线暴(GRB)群体中存在的模式,以探索其物理来源。目前,尽管采用了最先进的聚类程序进行了多次尝试,但对应可区分群组的聚类数量仍存在争议。这一关键未知参数需要通过直接或间接方式(以其他调优参数的形式)评估,以便通过实施合适的聚类算法在GRB中产生聚类。虽然大多数应用的算法得出了两个物理上可解释的群组(分别以短暴和长暴为主的合并与坍缩星),但其他统计方法违反了这种二元划分。然而,任何额外聚类的物理建立尚未得到确认。因此,我们提出一种新算法,来自一种称为“完全无参数”的不同聚类流派,它以迄今未尝试过的方式对GRB进行分类。该算法从BATSE样本中指示出两个主要群组,即短持续时间和长持续时间爆发,与合并-坍缩星理论兼容。

英文摘要

Cluster analysis is a widely applied machine learning technique to understand the existing patterns in the population of gamma-ray bursts (GRBs), in order to explore their physical sources. In the present scenario, the number of clusters corresponding to differentiable groups is still under conflict, in spite of numerous attempts with the state-of-the-art clustering procedures. This crucial unknown parameter needs to be evaluated, either directly or indirectly in terms of other tuning parameters, to produce the clusters in GRBs through implementation of an appropriate clustering algorithm. While most of the applied algorithms reached two physically explained groups of merger and collapsar predominated by the short and long bursts respectively, other statistical approaches violated this binary partition. However, physical establishment of any additional cluster(s) is not yet confirmed. Therefore, we propose a new algorithm, from a different stream of clustering referred to as `completely parameter-free', which carries out the classification of GRBs in a manner that has not been tried so far. It indicates two main groups, of short and long duration bursts from the BATSE sample, compatible with the merger-collapsar theory.

2605.30170 2026-05-29 cs.MM cs.CV cs.LG 版本更新

Unveiling the Visual Counting Bottleneck in Vision-Language Models

揭示视觉语言模型中的视觉计数瓶颈

Xingzhou Pang, Yifan Hou, Junling Wang, Mrinmaya Sachan

发表机构 * Department of Computer Science, ETH Zürich(苏黎世联邦理工学院计算机科学系)

AI总结 通过分解视觉计数为三个认知阶段,发现视觉语言模型在符号映射阶段失败,提出断裂数量假说:模型学习到分离的模态特定统计流形,无法实现跨模态对齐。

Comments ICML 2026

详情
AI中文摘要

尽管大型视觉语言模型(VLM)在插值任务上表现出色,但在系统泛化方面,尤其是视觉计数任务中,会遭遇灾难性失败。本文通过将视觉计数分解为三个认知阶段:视觉个体化、数量感知和符号映射,来研究这一外推瓶颈。利用合成围棋棋盘和线性探针,我们证明视觉骨干网络在进入外推区域后仍能保持稳健、线性可分离的数量表示,排除了感知失败的可能性。此外,模型保留了潜在的数量感知能力,能够成功对无法枚举的数量进行比较推理。我们将崩溃定位在符号映射阶段,即模型无法将有效的视觉数量投影到符号标记上。我们的发现支持断裂数量假说:VLM未能获得通用数字空间,而是学习了不相交的、模态特定的统计流形,这阻止了对未见数量的跨模态对齐。在最新基础模型上的验证结果表明,弥合这一差距需要引入强制统一表示的归纳先验,因为仅靠数据扩展是不够的。

英文摘要

While Large Vision-Language Models (VLMs) excel at interpolation, they suffer catastrophic failures in systematic generalization, most notably in visual counting. In this work, we investigate this extrapolation bottleneck by deconstructing visual counting into three cognitive stages: visual individuation, magnitude awareness, and symbolic mapping. Using synthetic Go boards and linear probes, we demonstrate that visual backbones maintain robust, linearly separable representations of quantity well into the extrapolation regime, ruling out perceptual failure. Furthermore, models retain latent magnitude awareness, successfully performing comparative reasoning on quantities they fail to enumerate. We pinpoint the collapse to the symbolic mapping stage, where the model fails to project valid visual magnitudes onto symbolic tokens. Our findings support a frac tured magnitude hypothesis: VLMs fail to acquire a universal number space, instead learning disjoint, modality-specific statistical manifolds that prevent cross-modal grounding for unseen quantities. Validated on the state-of-the-art foundation model, our results suggest that bridging this gap requires inductive priors enforcing unified representations, as data scaling alone is insufficient.

2605.30167 2026-05-29 stat.ML cs.CV cs.LG stat.AP 版本更新

Visual Spatial Learning: Single-Field Spatial Interpolation Using Convolutional Neural Networks

视觉空间学习:使用卷积神经网络的单场空间插值

Daniel Tinoco, Raquel Menezes, Carlos Baquero, Alexandra Silva

发表机构 * Centro de Matemática (CMAT), Universidade do Minho(数学中心(CMAT),明霍大学) DEI-FEUP & INESC TEC, Universidade do Porto(FEUP-DEI与INESC TEC,波尔图大学) Instituto Português do Mar e da Atmosfera, I. P. (IPMA, I. P.), Lisboa, Portugal(葡萄牙海洋与大气研究所(IPMA, I. P.),里斯本,葡萄牙) Centro de Ciências do Mar e do Ambiente (MARE), Évora, Portugal(海洋与环境科学中心(MARE),埃维拉,葡萄牙)

AI总结 提出基于卷积神经网络(CNN)的架构,直接从单次部分观测场学习空间插值,无需外部数据或先验场,作为克里金法的替代方案。

Comments 53 pages, 10 figures

详情
AI中文摘要

从稀疏观测中预测完整的空间相关场是空间统计和环境建模中的一个基本挑战。经典的插值方法如克里金法依赖于高斯过程假设和变异函数分析,这可能会限制其在非平稳环境中的有效性,并且需要大量的领域专业知识。在这项工作中,我们利用基于卷积神经网络(CNN)的架构进行空间插值,该架构在单个部分观测场上进行训练和应用,无需访问外部数据或先验场。模型直接在观测位置进行监督,并学习在用户定义的网格上预测未观测点的值。与克里金法不同,我们的方法不需要显式的协方差建模或变异函数估计,并且可以以数据驱动的方式灵活捕捉局部空间模式。这项工作展示了CNN在稀疏监督下进行单实例空间插值的潜力,为经典地统计方法提供了实用的替代方案,并将CNN的应用扩展到新的问题领域。

英文摘要

Predicting a complete spatially correlated field from sparse observations is a fundamental challenge in spatial statistics and environmental modelling. Classical interpolation methods such as Kriging rely on Gaussian process assumptions and variography, which can limit their effectiveness in non-stationary settings and require substantial domain expertise. In this work, we leverage an architecture based on convolutional neural networks (CNNs) for spatial interpolation that is trained and applied on a single partially observed field, without access to external data or prior fields. The model is supervised directly on the observed locations and learns to predict values at unobserved points on the user defined grid. Unlike Kriging, our method does not require explicit covariance modelling or variogram estimation, and it can flexibly capture local spatial patterns in a data-driven manner. This work demonstrates the potential of CNNs for single-instance spatial interpolation under sparse supervision, offering a practical alternative to classical geostatistical methods, and extending the use of CNNs to a new problem domain.

2605.30162 2026-05-29 cs.AI cs.CR cs.LG 版本更新

BioRefusalAudit: Auditing Biosecurity Refusal Depth Using General and Domain-Fine-Tuned Sparse Autoencoders

BioRefusalAudit: 使用通用和领域微调稀疏自编码器审计生物安全拒绝深度

Caleb DeLeeuw

发表机构 * Independent researcher(独立研究者) Apart Research AIxBio Sprint

AI总结 本文提出BioRefusalAudit方法,通过行为测试和内部稀疏自编码器特征分析,评估语言模型在生物安全场景下的拒绝一致性,发现模型存在结构脆弱性、过度拒绝和架构差异。

Comments 21 pages, 2 figures, 3 tables. Apart Research AIxBio Sprint hackathon paper, April 2026 (Track 3: AI Biosecurity Tools). Code, eval set, and SAEs: github.com/SolshineCode/Deleeuw-AI-x-Bio-hackathon. Reviewer feedback: apartresearch.com/project/biorefusalaudit-auditing-biosecurity-refusal-depth-using-general-and-domainfinetuned-sparse-autoencoders-1fyk

详情
AI中文摘要

语言模型的生物安全评估通常询问模型是否产生危险输出。本文提出一个补充性问题:当模型拒绝时,该拒绝在结构上是否稳健,还是在提示框架、格式或输出长度的适度变化下消失?在五种架构中,没有模型能清晰区分良性查询和危险查询。Gemma 2 2B-IT 在75个提示中从未真正拒绝,对每个接近危险的查询都含糊其辞。Gemma 4 E2B-IT 在聊天模板格式下拒绝了65/75个提示,无格式时拒绝了0/75。两个Gemma模型在80词限制下都降至0%拒绝率。Qwen 2.5 1.5B 和 Phi-3-mini 过度拒绝,将83-87%的良性生物学标记为危险。Llama 3.2 1B 显示出唯一有意义的层级梯度(61点跨度)。为了探究过度拒绝的驱动因素,我们测试了一组附表I但无生物毒性的化合物(特别是裸盖菇素栽培,具有FDA突破性疗法资格)。一些模型对这些化合物的拒绝率超过了真正危险的生物学,表明拒绝追踪法律和文化显著性而非CBRN危险。为了测量内部层面,我们引入了一个分歧分数D,比较模型的表面响应标签与其内部稀疏自编码器(SAE)特征激活。在Gemma 2 2B-IT(Gemma Scope 1)和Gemma 4 E2B-IT(作者训练的bio SAE)上计算了完整的D。发布了两个微调的Gemma 2领域SAE。在Gemma 4上,服从和拒绝响应之间差距为0.647点,零重叠(n=75),尽管这是初步的,目录狭窄,样本内校准,且仅覆盖Gemma家族的SAE。在一个黑客马拉松周末使用消费级硬件(GTX 1650 Ti Max-Q,以及用于SAE训练的Colab T4)构建,这一初步证据表明,激活级审计可能揭示行为评估无法发现的失败模式,且各架构间存在显著差异。

英文摘要

Biosecurity evaluations of language models typically ask whether models produce hazardous output. This paper asks a complementary question: when a model refuses, is that refusal structurally sound, or does it disappear under modest changes to prompt framing, formatting, or output length? Across five architectures, no model cleanly discriminated benign from hazard. Gemma 2 2B-IT never genuinely refused across 75 prompts, hedging on every hazard-adjacent query. Gemma 4 E2B-IT refused 65/75 prompts with chat-template formatting and 0/75 without it. Both Gemma models collapsed to 0% under an 80-token cap. Qwen 2.5 1.5B and Phi-3-mini over-refused, flagging 83-87% of benign biology as hazardous. Llama 3.2 1B showed the only meaningful tier gradient (61-point spread). To probe what drives such over-refusal, we tested a panel of Schedule I but biologically non-toxic compounds (notably psilocybin cultivation, with FDA Breakthrough Therapy status). Some models refused these at rates exceeding genuinely hazardous biology, suggesting refusal tracks legality and cultural salience over CBRN hazard. To measure the internal side, we introduce a divergence score D comparing a model's surface response label to its internal sparse autoencoder (SAE) feature activations. Full D was computed on Gemma 2 2B-IT (Gemma Scope 1) and Gemma 4 E2B-IT (author-trained bio SAE). Two fine-tuned Gemma 2 domain SAEs were released. On Gemma 4, comply and refuse responses separated by a 0.647-point gap with zero overlap (n=75), though this is preliminary, with a narrow catalog, within-sample calibration, and Gemma-family-only SAE coverage. Built over one hackathon weekend on consumer hardware (GTX 1650 Ti Max-Q, plus Colab T4 for SAE training), this preliminary evidence suggests activation-level auditing may surface failure modes invisible to behavioral evaluation, with substantial variation across architectures.

2605.30160 2026-05-29 cs.LG cs.AI 版本更新

On Distributional Reinforcement Learning in Chaotic Dynamical Systems

混沌动力系统中的分布强化学习

James Rudd-Jones, Mirco Musolesi, María Pérez-Ortiz

发表机构 * Centre for Artificial Intelligence(人工智能中心) Department of Computer Science(计算机科学系) University College London(伦敦大学学院) University of Bologna(博洛尼亚大学)

AI总结 针对混沌动力系统中强化学习面临的高方差和梯度病态问题,提出分布强化学习通过1-Wasserstein度量下的分布贝尔曼目标实现更稳定的优化。

详情
AI中文摘要

混沌动力系统对强化学习(RL)提出了根本性挑战:对初始条件的指数敏感性导致高方差的引导目标和病态的梯度更新。混沌动力学出现在科学和工程领域的各个方面,从流体流动和气候系统到多智能体系统,在这些领域中,可靠的学习是非常可取的。标准RL方法通过标量值函数优化期望回报,隐式地对发散轨迹进行平均,并将轨迹层面的不稳定性与学习目标纠缠在一起。我们证明,在温和的统计稳定性假设下,当在$1$-Wasserstein度量下测量时,回报分布比单个轨迹更规则地演化,从而产生更平滑的分布贝尔曼目标。通过将优化与该度量层面结构对齐,分布RL提供了更好的条件学习。我们为混沌系统中分布方法的优势以及混沌下RL目标的几何结构提供了原则性的解释。

英文摘要

Chaotic dynamical systems pose a fundamental challenge for Reinforcement Learning (RL): exponential sensitivity to initial conditions induces high-variance bootstrap targets and poorly conditioned gradient updates. Chaotic dynamics arise across scientific and engineering domains, from fluid flows and climate systems to multi-agent systems, where reliable learning is highly desirable. Standard RL methods optimise expected returns through scalar value functions, implicitly averaging over diverging trajectories and entangling trajectory level instability with the learning objective. We show that under mild statistical stability assumptions, the return distribution evolves more regularly than individual trajectories when measured under the $1$-Wasserstein metric, yielding a smoother distributional Bellman objective. By aligning optimisation with this measure level structure, distributional RL provides better conditioned learning. We offer a principled explanation for the advantages of distributional methods in chaotic systems and the geometries of RL objectives under chaos.

2605.30154 2026-05-29 cs.LG 版本更新

RL2ML: Finite-Rollout Surrogate Objectives from Reinforcement Learning to Maximum Likelihood

RL2ML: 从强化学习到最大似然的有限rollout替代目标

Yifu Zheng

发表机构 * University of Southern California(南加州大学)

AI总结 本文提出RL2ML系列有限rollout替代目标,具有闭式无偏梯度估计,连接标准强化学习、类最大似然训练及超越最大似然目标,并揭示群体级更新尺度相变,将剩余自由度转化为一维优化问题。

详情
AI中文摘要

基于正确性的可验证奖励强化学习(RLVR)通过采样输出的二元反馈训练语言模型,但期望优化的目标与有限rollout组引起的随机更新几何常被混淆。本文开发了RL2ML,一系列具有闭式、精确无偏梯度估计的有限rollout替代目标。该系列在固定rollout预算下连续连接标准强化学习、类最大似然训练及超越最大似然目标,同时保持估计器-目标对齐。我们引入群体级更新尺度来表征rollout组在观察到经验成功计数后如何重新加权,揭示了仅通过总体级目标符号隐藏的亚临界-超临界更新尺度相变。基于这一区分,校准的度量增益分析和精确方差分解表明,最佳替代目标的选择既不由接近最大似然决定,也不仅由总体级权重决定,而是取决于评估度量、局部敏感性和估计器方差。因此,替代目标系列中的剩余自由度可以表述为一维优化问题,而非视为无约束超参数。

英文摘要

Correctness-based Reinforcement Learning with Verifiable Rewards (RLVR) trains language models from binary feedback on sampled outputs, but the objective optimized in expectation and the stochastic update geometry induced by finite rollout groups are often conflated. This paper develops RL2ML, a family of finite-rollout surrogate objectives with a closed-form, exactly unbiased gradient estimator. The family continuously connects standard reinforcement learning, maximum-likelihood-like training, and beyond-maximum-likelihood objectives while preserving estimator-objective alignment under a fixed rollout budget. We introduce the group-level update scale to characterize how a rollout group is reweighted after its empirical success count is observed, revealing a subcritical-supercritical update-scale transition that is hidden by population-level objective notation alone. Building on this distinction, calibrated metric-gain analysis and exact variance decomposition show that the best choice of surrogate objective is determined neither by proximity to maximum likelihood nor by the population-level weight alone. Instead, it depends jointly on the evaluation metric, local sensitivity, and estimator variance. The remaining degree of freedom in the surrogate objective family can therefore be formulated as a one-dimensional optimization problem rather than treated as an unconstrained hyperparameter.

2605.30153 2026-05-29 stat.ML cs.IT cs.LG math.IT math.ST stat.TH 版本更新

Diffusion Models Are Statistically Optimal for Learning Low-Dimensional Multi-Modal Distributions

扩散模型在学习低维多模态分布时具有统计最优性

Jingda Wu, Changxiao Cai

发表机构 * Department of Industrial and Operations Engineering, University of Michigan, Ann Arbor, USA(工业与运营工程系,密歇根大学,安娜堡,美国)

AI总结 本文证明扩散模型在学习支撑在低维子空间并集上的分布时,样本复杂度仅依赖于内在维度,达到近最优的1-Wasserstein误差率,无需光滑性或有界密度假设。

Comments accepted to ICML 2026

详情
AI中文摘要

基于分数的扩散模型在学习高维分布,特别是那些具有低维和多模态结构的分布方面,已经展现出显著的实证成功。然而,对其统计效率的理论理解仍然有限。现有理论通常依赖于强正则性假设,例如一致有界密度或全局光滑的分数函数,这些假设无法捕捉此类内在结构。在这项工作中,我们研究了扩散模型在学习支撑在低维子空间并集上的分布时的样本复杂度。假设每个子空间内的数据分布是次高斯的,我们证明扩散模型最多需要$\widetilde{O}(\varepsilon^{-k \vee 2})$个样本即可在1-Wasserstein距离上达到$\varepsilon$误差,其中$k$是内在维度。这一近最优的收敛速率仅依赖于内在维度,并显著改进了先前遭受维度灾难的理论保证。值得注意的是,我们的分析适用于广泛的分布,无需施加光滑性、有界密度或对数凹性假设。总体而言,我们的结果表明,扩散模型能够统计适应内在低维结构,同时自然容纳多模态数据,为其在复杂高维学习任务中的成功提供了严格的理论依据。

英文摘要

Score-based diffusion models have demonstrated remarkable empirical success in learning high-dimensional distributions, particularly those exhibiting low-dimensional and multi-modal structures. However, theoretical understanding of their statistical efficiency remains limited. Existing theories typically rely on strong regularity assumptions, such as uniformly bounded densities or globally smooth score functions, which fail to capture such intrinsic structures. In this work, we study the sample complexity of diffusion models for learning distributions supported on a union of low-dimensional subspaces. Assuming that the data distribution within each subspace is subgaussian, we show that diffusion models require at most $\widetilde{O}(\varepsilon^{-k \vee 2})$ samples to achieve $\varepsilon$ error in 1-Wasserstein distance, where $k$ is the intrinsic dimension. This near-optimal convergence rate depends only on the intrinsic dimension and significantly improves upon prior theoretical guarantees that suffer from the curse of dimensionality. Notably, our analysis applies to a broad collection of distributions without imposing smoothness, bounded-density, or log-concavity assumptions. Overall, our results show that diffusion models can statistically adapt to intrinsic low-dimensional structure while naturally accommodating multi-modal data, offering a rigorous theoretical justification for their success in complex high-dimensional learning tasks.

2605.30148 2026-05-29 cs.LG cs.AI 版本更新

Overcoming Forgetting in LLM Fine-Tuning with Evolution Strategies

克服LLM微调中的遗忘:进化策略方法

Kajetan Schweighofer, Conor F. Hayes, Roberto Dailey, Risto Miikkulainen, Xin Qiu

发表机构 * Cognizant AI Lab(Cognizant AI实验室) UT Austin(得克萨斯大学奥斯汀分校)

AI总结 本文发现进化策略微调中的先前任务遗忘实为性能漂移且可恢复,并引入锚定权重衰减(AWD)正则化技术有效稳定先前任务性能,表明遗忘可避免,使ES成为LLM持续学习的可行方法。

详情
AI中文摘要

进化策略(ES)最近作为强化学习(RL)在大语言模型(LLM)微调中的竞争性替代方案出现,通过简单性、可扩展性和仅推理训练提供优势。然而,近期研究表明,在新任务上进行ES微调可能导致对先前任务的遗忘。首先,本文表明先前任务遗忘(1)更好地被描述为性能漂移而非不可逆遗忘,在ES训练过程中先前任务性能通常会恢复;(2)并非ES特有的失败模式,使用RL方法微调时也可能出现。其次,本文分析了这种漂移何时以及为何出现,强调了其对ES训练动态的依赖性,特别是权重空间中弱约束方向上的随机游走行为。第三,基于这些见解,本文引入了锚定权重衰减(AWD)作为一种参数空间正则化技术,将优化约束向初始模型参数。AWD在保持目标任务性能的同时有效稳定了先前任务性能,以更低的计算成本实现了与大型ES种群规模相当的优势。因此,与先前观点相反,本文表明ES下的先前任务遗忘在很大程度上是可以避免的,使ES成为LLM持续学习中一种有前景的方法。

英文摘要

Evolution Strategies (ES) has recently emerged as a competitive alternative to reinforcement learning (RL) for large language model (LLM) fine-tuning, offering advantages through simplicity, scalability, and inference-only training. However, recent work suggests that ES fine-tuning on new tasks may induce forgetting of prior tasks. First, this paper shows that prior task forgetting (1) is better characterized as performance drift rather than irreversible forgetting, with prior-task performance often recovering during ES training; and (2) is not a specific failure mode of ES, but can also arise for fine-tuning with RL methods. Second, it analyzes when and why such drift arises, highlighting its dependence on ES training dynamics, particularly random walk behavior in weakly constrained directions of the weight space. Third, based on these insights, it introduces Anchored Weight Decay (AWD) as a parameter-space regularization technique that constrains optimization toward the initial model parameters. AWD effectively stabilizes prior-task performance while preserving target-task performance, achieving benefits comparable to large ES population sizes at much lower computational cost. Thus, contrary to previous beliefs, the paper shows that prior-task forgetting under ES is largely avoidable, positioning ES as a promising approach for continual learning in LLMs.

2605.30135 2026-05-29 cs.LG cs.AI 版本更新

DAMEL: Dual-Axis Multi-Expert Learning for Class-Imbalanced Learning

DAMEL: 双轴多专家学习用于类别不平衡学习

Hyuck Lee, Taemin Park, Heeyoung Kim

发表机构 * AI Research, Krafton(AI研究,Krafton) Department of Industrial and Systems Engineering, Korea Advanced Institute of Science and Technology (KAIST)(工业与系统工程系,韩国科学技术院)

AI总结 提出双轴多专家学习算法DAMEL,通过表示轴和时间轴上的多专家集成,同时降低预测偏差和方差,有效解决类别不平衡学习问题。

详情
AI中文摘要

针对来自具有长尾分布的真实世界数据的类别不平衡学习所带来的挑战,已有多种算法被提出。这些算法通过重平衡技术减少了预测偏差,但通常以增加预测方差为代价。一些多专家学习算法旨在解决这一方差问题,但涉及复杂的过程。我们提出了一种新的多专家学习算法,称为双轴多专家学习(DAMEL),该算法通过沿表示轴和时间轴使用多个专家来同时降低预测的偏差和方差。沿表示轴,DAMEL拼接多个专家的表示,并同时使用拼接后的表示训练一个辅助的平衡分类器。沿时间轴,DAMEL聚合跨训练时期的网络权重,并在测试时使用这些聚合权重。实验结果表明,DAMEL同时降低了预测的偏差和方差,突显了其在类别不平衡学习中的有效性。

英文摘要

Various algorithms have been proposed to address the challenges posed by class-imbalanced learning from real-world data with long-tailed distributions. While these algorithms reduce prediction bias through rebalancing techniques, they often introduce increased prediction variance as a trade-off. Several multi-expert learning algorithms aim to address this variance but involve complex procedures. We propose a new multi-expert learning algorithm, called the dual-axis multi-expert learning (DAMEL), which reduces both bias and variance of predictions by using multiple experts along both representation and time axes. Along the representation axis, DAMEL concatenates the representations of multiple experts and trains an auxiliary balanced classifier simultaneously with the concatenated representations. Along the time axis, DAMEL aggregates network weights across training epochs, employing these aggregated weights during testing. Experimental results demonstrate that DAMEL reduces both bias and variance of predictions, highlighting its effectiveness in class-imbalanced learning.

2605.30132 2026-05-29 cs.LG stat.ML 版本更新

Learning to Extrapolate to New Tasks: A Relational Approach to Task Extrapolation

学习外推到新任务:一种关系型任务外推方法

Adam Ousherovitch, Yixin Wang

发表机构 * Department of Statistics, University of Michigan, Ann Arbor(统计学系,密歇根大学,安阿伯)

AI总结 提出关系型任务外推器(RTE),通过将目标任务分解为锚定任务和变换关系并学习关系算子,实现向未见任务的系统性外推,在函数预测和序列预测中显著优于现有方法。

Comments ICML 2026

详情
AI中文摘要

现代学习系统擅长内插,但难以泛化到训练分布支持范围之外的未见任务。即使在简单设置中(如处理超出训练范围的任务参数),这种失败也会发生,并且尽管基础模型取得了进展,问题依然存在。为此,我们开发了关系型任务外推器(RTE),一种旨在实现向新任务系统性外推的算法。关键观察是外推本质上是关系型的:外推到未见任务需要学习任务如何相互转换。如果模型在训练期间学习了任务A和B之间的变换,它可以在测试时应用相同的变换来关联已知任务和未见任务。RTE通过将每个目标任务分解为一个已知的锚定任务和一个连接锚定与目标的变换来实现这一思想。然后它学习一个关系算子,将锚定-变换对映射到目标任务的预测。我们在函数预测的多个任务外推场景中实例化RTE,例如目标任务使用超出范围的参数(参数外推)、具有更大的组合深度(长度外推)和/或以未见方式重新组合函数原语(组合外推)。我们进一步将RTE扩展到序列预测,将其集成到基础模型的微调算法中。在实证研究中,我们发现RTE在向新颖、未见任务的外推上显著优于现有方法。

英文摘要

Modern learning systems excel at interpolation but struggle to generalize to unseen tasks outside the training distribution's support. This failure occurs even in simple settings, such as handling task parameters beyond the training range, and persists despite advances in foundation models. To this end, we develop the Relational Task Extrapolator (RTE), an algorithm designed to enable systematic extrapolation to novel tasks. The key observation is that extrapolation is inherently relational: extrapolating to unseen tasks requires learning how tasks transform into one another. If a model learns the transformation between tasks A and B during training, it can apply that same transformation to relate known tasks to unseen ones at test time. RTE operationalizes this idea by decomposing each target task into a known anchor task and a transformation linking the anchor and target. It then learns a relational operator, mapping an anchor-transformation pair to predictions for the target task. We instantiate RTE across multiple task extrapolation regimes in function prediction, e.g. where target tasks use out-of-range parameters (parameter extrapolation), have greater compositional depth (length extrapolation), and/or recombine function primitives in unseen ways (compositional extrapolation). We further extend RTE to sequence prediction, integrating it into fine-tuning algorithms for foundation models. Across empirical studies, we find that RTE substantially outperforms existing approaches on extrapolation to novel, unseen tasks.

2605.30126 2026-05-29 cs.CV cs.AI cs.CL cs.LG 版本更新

PARCEL: Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding

PARCEL: 基于池锚定的条件弹性查询重采样以实现高效视觉-语言理解

Selim Kuzucu, Alessio Tonioni, Vasile Lup, Bernt Schiele, Federico Tombari, Muhammad Ferjad Naeem

发表机构 * Max Planck Institute for Informatics(马克斯·普朗克研究所) Google(谷歌)

AI总结 提出PARCEL视觉分词架构,通过池锚定和条件弹性查询重采样解决视觉令牌压缩中的空间与查询表示冲突,在27个基准上提升性能-效率帕累托前沿。

Comments 33 pages, 4 figures

详情
AI中文摘要

大型视觉-语言模型(LVLMs)将视觉输入映射为密集的令牌序列,导致推理时的二次计算瓶颈。弹性视觉令牌压缩通过训练单一模型以在多个视觉令牌预算下运行来解决这一问题。然而,现有方法在激进压缩下表现不佳。空间压缩(如嵌套池化)表现为不完美的低通滤波器,并引起频谱混叠,掩盖了细粒度细节。查询压缩(如嵌套查询重采样)用非局部摘要替代显式的网格对齐令牌,显著降低了空间定位能力。为解决这一表示冲突,我们引入了PARCEL(基于池锚定的条件弹性查询重采样以实现高效视觉-语言理解),一种视觉分词架构,动态分配特征提取的工作。PARCEL将空间池令牌建立为低频布局锚点,并通过池条件查询重采样使弹性查询令牌依赖于这些锚点。这鼓励查询令牌专注于互补的视觉特征,而非冗余的空间映射。在27个基准上的广泛评估表明,PARCEL改进了性能-效率帕累托前沿,在各种视觉令牌预算下持续优于现有的嵌套基线,同时保留了“一次训练,随处部署”的范式。

英文摘要

Large Vision-Language Models (LVLMs) map visual inputs into dense token sequences, imposing a quadratic computational bottleneck for inference. Elastic visual-token compression addresses this by training a single model that can run at multiple visual-token budgets. However, existing approaches struggle under aggressive compression. Spatial-only compression, as in nested pooling, behaves as an imperfect low-pass filter and induces spectral aliasing that obscures fine-grained detail. Query-only compression, as in nested query resampling, replaces explicit grid-aligned tokens with non-local summaries and substantially degrades spatial grounding. To resolve this representational conflict, we introduce PARCEL (Pool-Anchored Resampling with Conditioned Elastic Queries for Efficient Vision-Language Understanding), a visual tokenization architecture that dynamically partitions the labor of feature extraction. PARCEL establishes spatial pool tokens as low-frequency layout anchors and conditions elastic query tokens on these anchors through Pool-Conditioned Query Resampling. This encourages query tokens to focus on complementary visual features rather than redundant spatial mapping. Extensive evaluations across 27 benchmarks show that PARCEL improves the performance-efficiency Pareto frontier, consistently outperforming existing matryoshka baselines across visual-token budgets while preserving the "train once, deploy anywhere" paradigm.

2605.30116 2026-05-29 cs.CV cs.LG 版本更新

SGMD: Score Gradient Matching Distillation for Few-Step Video Diffusion Distillation

SGMD: 得分梯度匹配蒸馏用于少步视频扩散蒸馏

Zhuguanyu Wu, Ruihao Gong, Yang Yong, Yushi Huang, Xiangyu Fan, Lei Yang, Dahua Lin, Xianglong Liu

发表机构 * Beihang University(北京理工大学) SenseTime Research(秒速科技研究院) Hong Kong University of Science and Technology(香港科学与技术大学)

AI总结 针对分布匹配蒸馏在少步视频扩散中训练昂贵且运动动态保守的问题,提出得分梯度匹配蒸馏(SGMD),通过直接优化假得分朝向教师并使用教师停止梯度Fisher作为稳定目标,实现约3倍训练加速并显著提升运动动态。

Comments ICML 2026

详情
AI中文摘要

分布匹配蒸馏(DMD)是加速少步视频扩散模型推理的常用范式。然而,DMD风格的视频蒸馏面临两个耦合挑战:假得分必须跟踪不断演化的生成器,当需要频繁更新时训练成本高昂,而反向KL风格匹配可能具有模式寻求性和保守性,难以保持强运动动态。为解决这些问题,我们提出 extbf{得分梯度匹配蒸馏(SGMD)}。SGMD采用假得分视角,直接优化假得分朝向教师,同时使用教师停止梯度Fisher作为稳定的分布匹配目标。我们提供了梯度分析,论证了在理想跟踪下该目标选择的合理性。在此基础上,SGMD引入一对双重势:负残差(NR)用于外环校正,残差收缩(RC)用于内环跟踪。实验上,与DMD2相比,SGMD实现了约$\sim 3 imes$的训练加速,并显著改善了4步蒸馏模型的运动动态,同时保持了时间一致性。一项人类研究证实,SGMD在运动质量和整体偏好上更受青睐,而视觉质量和文本对齐保持相当。代码可在https://github.com/ModelTC/LightX2V获取。

英文摘要

Distribution Matching Distillation (DMD) is a widely used paradigm for accelerating inference in few-step video diffusion models. However, DMD-style video distillation faces two coupled challenges: the fake score must track a continuously evolving generator, making training costly when frequent updates are required, while reverse-KL-style matching can be mode-seeking and conservative for preserving strong motion dynamics. To address these issues, we propose \textbf{Score Gradient Matching Distillation (SGMD)}. SGMD adopts a fake-score perspective by directly optimizing the fake score toward the teacher, while using teacher stop-gradient Fisher as a stable distribution-matching objective. We provide a gradient analysis that motivates this objective choice under ideal tracking. Building on this, SGMD introduces a pair of dual potentials: negative-residual (NR) for outer-loop correction and residual-contraction (RC) for inner-loop tracking. Empirically, compared to DMD2, SGMD achieves an approximately $\sim 3\times$ training speedup and substantially improves motion dynamics for 4-step distilled models while preserving temporal consistency. A human study confirms that SGMD is preferred in motion quality and overall preference, while visual quality and text alignment remain comparable. Code is available at https://github.com/ModelTC/LightX2V.

2605.30112 2026-05-29 cs.LG 版本更新

Striding Across Reynolds Numbers: Representation Geometry in Neural PDE Generalisation

跨越雷诺数:神经PDE泛化中的表示几何

Jianing Shi

发表机构 * London School of Economics and Political Science(伦敦政治经济学院)

AI总结 通过分析神经PDE求解器在跨雷诺数泛化中的表示几何,发现基于卷积自编码器的匹配方法(ConvAE-Relay)在无需目标域数据的情况下达到38.34%误差,揭示了局部多尺度表示对跨雷诺数迁移的关键作用。

Comments 12 pages, 8 figures, 5 tables

详情
AI中文摘要

神经PDE求解器中的跨雷诺数泛化仍然缺乏表征。在标准的强迫二维Navier-Stokes基准上,训练好的傅里叶神经算子在10倍雷诺数偏移下达到46.68%的相对L2误差,而零前向模型检索基线已经改进到41-42%。这表明表示几何是测试方法中的一个主要组织变量。我们通过ConvAE-Relay测试这一假设,该方法在源训练卷积自编码器潜在空间中匹配状态,并从源域数据库借用动力学,仅使用源域数据库且无需目标域拟合、标签或数据库条目,达到38.34+/-0.07%的误差。2x2消融实验将匹配质量隔离为优于更新规则的主导因素。Oracle实验证实,当匹配保持在流形上时,源域动力学方向仍然可迁移(余弦相似度~0.84);自回归漂移是主要瓶颈(约12个百分点)。从学习预测方面,具有多尺度跳跃连接的U-Net达到34.72+/-0.60%的误差,与检索方面的发现一致,即局部多尺度表示组织测试方法中的跨雷诺数迁移。所有结论均限于该基准。

英文摘要

Cross-Reynolds generalisation in neural PDE solvers remains poorly characterised. On the canonical forced 2D Navier-Stokes benchmark, a trained Fourier Neural Operator reaches 46.68% relative L2 error under a 10x Reynolds-number shift, yet zero-forward-model retrieval baselines already improve to 41-42%. This suggests representation geometry as a major organising variable among the tested methods. We test this hypothesis through ConvAE-Relay, which matches states in a source-trained convolutional autoencoder latent space and borrows dynamics from a source-regime database, achieving 38.34+/-0.07% using only a source-regime database and no target-regime fitting, labels, or database entries. A 2x2 ablation isolates matching quality as dominant over the update rule. Oracle experiments confirm that source-regime dynamics directions remain transferable (cosine similarity ~0.84) when matching stays on-manifold; autoregressive drift is the primary bottleneck (~12 percentage points). From the learned-prediction side, a U-Net with multi-scale skip connections achieves 34.72+/-0.60%, consistent with the retrieval-side finding that local, multi-scale representations organise cross-Reynolds transfer among tested methods. All claims are scoped to this benchmark.

2605.30103 2026-05-29 cs.LG 版本更新

Convergence Theory for Iterative LLM-Based Neural Architecture Search: A Parametric Cross-Entropy Framework with Closed-Form Proxy Reliability

基于迭代式LLM的神经架构搜索的收敛理论:一个具有闭式代理可靠性的参数化交叉熵框架

Santosh Premi Adhikari, Radu Timofte, Dmitry Ignatov

发表机构 * Computer Vision Lab, CAIDAS & IFI, University of Würzburg, Germany(计算机视觉实验室,CAIDAS与IFI,乌尔姆大学,德国)

AI总结 将迭代式LLM-NAS建模为参数化交叉熵方法,证明了收敛性、精英集概率几何收敛、增量生成有效性、MinHash-Jaccard去重防止模式崩溃以及代理可靠性闭式公式,并通过实验验证了理论预测。

Comments 14 pages, 2 figures, 2 tables. Submitted to NeurIPS 2026

详情
AI中文摘要

大型语言模型(LLM)越来越多地被用作迭代式神经架构搜索(NAS)中的生成器,然而这类算法尚无正式的收敛理论。我们将迭代式LLM-NAS建模为在可执行程序上的参数化交叉熵(CE)方法,并证明了六个结果:(1)在精英架构上的迭代式LLM微调等价于限制在LLM参数族内的CE更新;(2)期望架构质量在循环间单调非减;(3)精英集概率以几何速率C_t >= 1-(1-rho_0)^t收敛到不动点;(4)在一阶马尔可夫令牌误差模型下,基于增量的生成比全代码生成实现严格更高的有效生成率;(5)MinHash-Jaccard新颖性过滤器防止模式崩溃;(6)代理可靠性具有闭式形式rho_S = (6/pi) arcsin(rho_P(SNR)/2),从而得出实际诊断条件sigma^2_arch >> sigma^2_noise作为基于代理的可靠排名的必要条件。在22个循环、三个LLM、六个数据集、3300个生成架构的实验中,定量验证了两个预测,在效应方向层面验证了两个预测,并解释了先前经验观察到但未得到解释的代理可靠性天花板效应。

英文摘要

Large language models (LLMs) are increasingly used as generators in iterative neural architecture search (NAS), yet no formal convergence theory exists for this class of algorithms. We model iterative LLM-NAS as a parametric Cross-Entropy (CE) method over executable programs and prove six results: (1) iterative LLM fine-tuning on elite architectures is equivalent to the CE update restricted to the LLM parametric family; (2) expected architecture quality is monotonically non-decreasing across cycles; (3) elite-set probability converges to a fixed point at a geometric rate C_t >= 1-(1-rho_0)^t; (4) delta-based generation achieves a strictly higher valid-generation rate than full-code generation under a first-order Markov token-error model; (5) the MinHash-Jaccard novelty filter prevents mode collapse; (6) proxy reliability admits the closed-form rho_S = (6/pi) arcsin(rho_P(SNR)/2), yielding the practical diagnostic sigma^2_arch >> sigma^2_noise as a necessary condition for trustworthy proxy-based rankings. Testing against a 22-cycle, three-LLM, six-dataset experiment with 3,300 generated architectures confirms two predictions quantitatively, two at direction-of-effect level, and explains the proxy-reliability ceiling effect previously reported empirically but left unexplained.

2605.30100 2026-05-29 cs.LG 版本更新

Chess-World-Model: A 10M-Game Benchmark for Exact State Tracking from Chess Move Sequences

Chess-World-Model: 一个用于从国际象棋走棋序列精确状态跟踪的1000万对局基准

Benjamin Walker, Terry Lyons

发表机构 * Mathematical Institute, University of Oxford(牛津大学数学研究所) Department of Mathematics, Imperial College London(伦敦帝国理工学院数学系)

AI总结 提出一个基于1000万真实国际象棋对局的大规模状态跟踪基准,通过预测合法走棋序列后的棋盘状态,测试模型学习转换规则的能力,并发现循环模型优于Transformer,且随机均匀分布子集能揭示规模掩盖的失败。

Comments 20 pages, 4 figures

详情
AI中文摘要

世界模型需要状态跟踪,即跨动作序列维持正确潜在状态的能力。现有基准通常是合成或基于语言的,限制了它们作为结构化状态更新测试在现实领域中的价值。我们引入了Chess-World-Model,一个基于1000万真实国际象棋对局构建的大规模状态跟踪基准,其中模型预测经过一系列合法走棋后达到的精确棋盘状态。除了一个留出的真实对局子集外,我们还包含一个来自均匀随机合法走棋的分布外子集,用于测试模型是否学习转换规则而非来自常见人类走法的捷径。先前的理论和实证工作表明,Transformer难以进行状态跟踪,而输入依赖的线性RNN需要表达性强的状态转换矩阵才能做到。因此,我们在匹配的接口和训练协议下,对因果Transformer、块对角SLiCE、Mamba-3和具有负特征值的Gated DeltaNet进行了基准测试。在300万和800万参数下,循环模型显著优于Transformer。真实对局性能在1800万参数以上饱和,但随机均匀子集在4000万参数下仍具有区分性,暴露了规模掩盖的失败。此外,消融实验表明,对于所有三种循环模型,表达性较弱的状态转换机制会降低分布外子集的性能。这些结果共同确立了Chess-World-Model作为一个实用的大规模状态跟踪基准,能够暴露模型规模原本会掩盖的失败。

英文摘要

World models require state tracking, which is the ability to maintain a correct latent state across action sequences. Existing benchmarks are often synthetic or language-based, limiting their value as tests of structured state updates in realistic domains. We introduce Chess-World-Model, a large-scale state-tracking benchmark built from 10 million real chess games, where models predict the exact board state reached after a sequence of legal moves. Alongside a held-out real-game split, we include an out-of-distribution split from uniformly random legal play, which tests whether models learn the transition rules rather than shortcuts from common human positions. Prior theoretical and empirical work has shown that Transformers struggle to state-track, while input-dependent linear RNNs require expressive state-transition matrices to do so. We therefore benchmark a causal Transformer, block-diagonal SLiCE, Mamba-3, and Gated DeltaNet with negative eigenvalues under a matched interface and training protocol. The recurrent models strongly outperform the Transformer at 3 and 8 million parameters. Real-game performance saturates above 18 million parameters, but the random-uniform split remains discriminative up to 40 million, exposing failures otherwise hidden by scale. Additionally, ablations show that less expressive state-transition mechanisms reduce performance on the out-of-distribution split for all three recurrent models. Together, these results establish Chess-World-Model as a practical large-scale benchmark for state tracking that exposes failures model scale would otherwise conceal.

2605.30085 2026-05-29 cs.AI cs.CL cs.LG stat.ML 版本更新

Conformal Certification of Reasoning Trace Prefixes

推理轨迹前缀的保形认证

Matt Y. Cheung, Ashok Veeraraghavan, Hanjie Chen, Guha Balakrishnan

发表机构 * Department of Electrical & Computer Engineering, Rice University(电气与计算机工程系,里士满大学)

AI总结 提出CROP方法,通过保形校准选择阈值,返回最长无错前缀,并控制错误包含概率,平衡保留有效推理与丢弃误导后缀。

Comments Code available at https://github.com/matthewyccheung/crop

详情
AI中文摘要

语言模型推理轨迹很少是全有或全无;在关键错误发生之前,它们通常包含有效的中间步骤。现有的不确定性量化方法通常认证最终答案或整个响应,未能为顺序轨迹中可安全保留的比例提供统计保证。为了解决这个问题,我们引入了CROP(保形推理输出前缀),一种与验证器无关的校准程序,用于干净前缀认证。给定任何步骤级风险代理,CROP选择一个校准阈值,并返回其步骤风险代理保持低于该阈值的最长连续前缀,将未认证的后缀路由到下游审查或修复。假设可交换性,CROP严格控制了返回前缀包含注释错误的边际概率。在六个过程标记的推理数据集上,我们证明了标准步骤级指标(如AUROC)不能完全捕捉前缀效用,建议验证器应改为通过认证前缀长度进行评估。此外,CROP平衡了过度保留和不足保留,通过保留有效的中间推理同时丢弃误导后缀,提高了下游修复的准确性。最终,这项工作将前缀认证定位为过程监督、弃权和修复之间的严格、实用的桥梁。

英文摘要

Language model reasoning traces are rarely all-or-nothing; they frequently contain valid intermediate steps before a critical error occurs. Existing uncertainty quantification methods typically certify final answers or entire responses, failing to provide statistical guarantees for the proportion of a sequential trace that can be safely retained. To address this, we introduce CROP (Conformal Reasoning Output Prefixes), a verifier-agnostic calibration procedure for clean-prefix certification. Given any step-level risk proxy, CROP selects a calibrated threshold and returns the longest contiguous prefix whose step risk proxies remain below it, routing the uncertified suffix for downstream review or repair. Assuming exchangeability, CROP rigorously controls the marginal probability that the returned prefix contains an annotated error. Across six process-labeled reasoning datasets, we demonstrate that standard step-level metrics such as AUROC do not fully capture prefix utility, suggesting verifiers should instead be evaluated by certified prefix length. Furthermore, CROP balances over- and under-withholding, improving downstream repair accuracy by preserving valid intermediate reasoning while discarding misleading suffixes. Ultimately, this work positions prefix certification as a rigorous, practical bridge between process supervision, abstention, and repair.

2605.30075 2026-05-29 cs.LG cs.DC 版本更新

Q-ANCHOR: Federated Quantum Learning with ZNE-guided Correction

Q-ANCHOR: 基于ZNE引导校正的量子联邦学习

Hoang M. Ngo, Quan Nguyen, Wanli Xing, My T. Thai

发表机构 * Department of Computer & Information Science & Engineering(计算机与信息科学与工程系) University of Florida(佛罗里达大学) Frost Institute for Data Science and Computing(数据科学与计算弗罗斯特研究所) University of Miami(迈阿密大学)

AI总结 针对量子联邦学习中非独立同分布数据导致的客户端漂移和量子硬件噪声导致的硬件偏差,提出Q-ANCHOR聚合架构,通过零噪声外推锚定服务器更新并应用有状态客户端校正,理论证明可同时减轻两类漂移,实验显示训练更稳定。

详情
AI中文摘要

量子联邦学习(QFL)提供了一个有前景的框架,可以在保持数据严格本地化的同时,跨分布式客户端训练量子模型。由于其简单性和低通信开销,联邦平均(FedAvg)是QFL文献中的标准聚合选择。然而,在实际硬件上部署QFL会暴露出严重的双重漂移现象:全局模型同时受到来自非独立同分布数据的客户端漂移和来自噪声量子梯度估计的硬件偏差的干扰。在这项工作中,我们首先分析了FedAvg在这些现实条件下的收敛性,数学上证明了量子硬件偏差会产生标准平均无法纠正的持久误差下限。为了克服这一限制,我们提出了Q-ANCHOR,一种量子感知的联邦聚合架构,该架构通过零噪声外推锚定服务器更新,同时应用有状态客户端校正来抑制客户端漂移和硬件引起的偏差。我们的收敛理论证明,Q-ANCHOR成功减轻了经典客户端漂移,同时积极降低了硬件偏差下限。实验结果表明,Q-ANCHOR实现了比传统FL基线显著更稳定的训练。

英文摘要

Quantum Federated Learning (QFL) offers a promising framework to train quantum models across distributed clients while keeping data strictly local. Due to its simplicity and low communication overhead, Federated Averaging (FedAvg) is the standard aggregation choice in QFL literature. However, deploying QFL on practical hardware exposes a severe double-drift phenomenon: the global model is simultaneously derailed by client drift from non-IID data and hardware bias from noisy quantum gradient estimates. In this work, we first analyze the convergence of FedAvg under these realistic conditions, mathematically demonstrating that quantum hardware bias creates a persistent error floor that standard averaging cannot correct. To overcome this limitation, we propose Q-ANCHOR, a quantum-aware federated aggregation architecture that anchors server updates with zero-noise extrapolation while applying stateful client correction to suppress both client drift and hardware-induced bias. Our convergence theory proves that Q-ANCHOR successfully mitigates classical client drift while actively reducing the hardware-bias floor. Experimental results demonstrate that Q-ANCHOR achieves significantly more stable training than conventional FL baselines.

2605.30070 2026-05-29 cs.LG cs.AI 版本更新

A Predictive Law for On-Policy Self-Distillation From World Feedback

基于世界反馈的在线自蒸馏预测定律

Tommy He, Jerome Sieber, Matteo Saponati

发表机构 * Open-source models(开源模型) LiveCodeBench

AI总结 本文发现在线自蒸馏(OPSD)中初始师生性能差距与最终性能改进之间存在线性关系,并提出一种预测定律,用于在训练前预测OPSD配置的效果。

详情
AI中文摘要

超越简单的标量奖励,向更丰富的世界反馈迈进,是实现更可扩展的RL后训练的自然路径。在线自蒸馏(OPSD)是一种有前景的最新方法,它使用任意反馈作为学习信号,但其与GRPO等成熟方法相比的可靠性仍不清楚。我们发现了OPSD中初始学生-教师性能差距与最终性能改进之间存在惊人的一致线性相关性。这种关系在不同上下文类型和模型家族中均成立,为预测OPSD配置的结果提供了一种强大的预测定律,而无需运行完整的训练过程。有趣的是,我们表明这种线性可预测性随模型规模成立,这为具有更强上下文学习能力的大型模型上新的经验缩放定律提供了潜在基础。本质上,我们的发现表明,OPSD性能可以在训练前进行预测和调整,为将世界反馈作为后训练流水线的一等组件提供了一种原则性方法。

英文摘要

Moving beyond simple scalar rewards toward richer world feedback is a natural path to more scalable RL post-training. On-policy self-distillation (OPSD) is a promising recent approach that uses arbitrary feedback as learning signal, yet its reliability compared to established methods, such as GRPO, remains unclear. We identify a strikingly consistent linear correlation between the initial student-self-teacher performance gap and the final performance improvement in OPSD. This relationship holds across context types and model families, providing a powerful predictive law for anticipating the outcome of an OPSD configuration without running the full training procedure. Interestingly, we show that this linear predictability holds with model scale, suggesting a potential basis for new empirical scaling laws on larger models with stronger in-context learning capabilities. In essence, our findings show that OPSD performance can be predicted and tuned before training, offering a principled way to incorporate world feedback as a first-class component of the post-training pipeline.

2605.30059 2026-05-29 cs.LG cond-mat.stat-mech stat.ML 版本更新

Ridge Regression from Poisson Resetting: A Renewal Perspective on Spectral Regularization

泊松重置的岭回归:谱正则化的更新视角

Petar Jolakoski

发表机构 * manu.edu.mk

AI总结 通过非平衡统计物理中的随机重置与统计学习中的岭正则化建立联系,证明线性梯度流下以速率r重置到原点产生的稳态均值即为岭估计,并推广到一般更新重置律以生成替代谱滤波器。

详情
AI中文摘要

我们将非平衡统计物理中的随机重置与统计学习中的岭正则化联系起来。对于线性梯度流,以速率$r$重置到原点产生稳态均值$(X^\top X+rI)^{-1}X^\top y$,这正是惩罚项$\lambda=r$的岭估计。这利用了岭回归与梯度流指数时间平均之间已知的拉普拉斯变换关系,其中指数时间现在被解释为与泊松重置相关的稳态年龄。然后我们将这一恒等式推广到一般更新重置律:指数重置时间分布是唯一的更新律,其稳态均值在每个特征方向上作为精确的滤波器恒等式对每个正曲率重现标量岭,而非指数更新律则生成替代的谱滤波器。在波动层面,我们研究了一个具有恒定扩散的独立加性奥恩斯坦-乌伦贝克扩展,解释为一种风格化的SGD近似。在这种设定下,等式仅在均值层面成立,因为重置过程由于累积的OU噪声和重置时序方差具有非零稳态协方差,而确定性岭是一个具有相同中心的固定估计量。风格化实验直接比较了确定性更新诱导的滤波器,并说明了非指数重置时间律诱导的滤波器何时可能在预测上与岭不同。关于稳态均值和诱导谱滤波器的结果是在二次目标上具有各向同性重置的连续时间梯度流下建立的;协方差和风险公式额外假设具有状态独立协方差的加性噪声。

英文摘要

We connect stochastic resetting from non-equilibrium statistical physics with ridge regularization in statistical learning. For linear gradient flow, resetting to the origin at rate $r$ produces stationary mean $(X^\top X+rI)^{-1}X^\top y$, exactly the ridge estimator with penalty $λ=r$. This uses the known Laplace-transform relationship between ridge regression and exponential-time averaging of gradient flow, with the exponential time now interpreted as the stationary age associated with Poisson resetting. We then extend this identity to general renewal reset laws: the exponential reset time distribution is the unique renewal law whose stationary mean reproduces scalar ridge in every eigendirection as an exact filter identity for every positive curvature, while non-exponential renewal laws generate alternative spectral filters. At the fluctuation level, we study a separate additive Ornstein-Uhlenbeck extension with constant diffusion, interpreted as a stylized SGD approximation. In this setting, the equality holds only at the level of the mean, since the reset process has a nonzero stationary covariance from accumulated OU noise and reset-timing variance, whereas deterministic ridge is a fixed estimator with the same center. Stylized experiments compare the deterministic renewal-induced filters directly and illustrate when filters induced by non-exponential reset-time laws can differ predictively from ridge. The results for the stationary mean and the induced spectral filters are established for continuous-time gradient flow with isotropic resetting on quadratic objectives; the covariance and risk formulas additionally assume additive noise with state-independent covariance.

2605.30056 2026-05-29 cs.RO cs.LG 版本更新

Sample-Efficient Diffusion-based Reinforcement Learning with Critic Guidance

基于评论家引导的样本高效扩散强化学习

Shutong Ding, Zejia Zhong, Zhongyi Wang, Ke Hu, Bikang Pan, Jingya Wang, Ye Shi

发表机构 * ShanghaiTech University(上海科技大学)

AI总结 针对扩散策略在强化学习中探索与利用不平衡的问题,提出评论家引导的扩散策略优化(CGPO),通过无训练引导技术平衡探索与利用,在MuJoCo和Franka机器人任务上取得最优性能。

Comments accepted by ICML2026

详情
AI中文摘要

近年来,强化学习(RL)通过利用扩散策略的多模态性和探索能力取得了巨大成功。在这些方法中,一个代表性分支专注于基于采样的策略优化。这种设计使得扩散模型在训练初期具有更好的探索能力,但在Q值信息的利用上不足,导致策略收敛缓慢。另一个分支关注基于梯度的策略优化,该方法充分利用Q函数的梯度,但容易退化为低多样性的单峰策略。为了解决这个问题,我们提出了CGPO(评论家引导的扩散策略优化),通过将无训练引导技术集成到扩散策略的去噪过程中,有效平衡探索与利用。具体而言,CGPO将动作生成引导至评论家网络定义的高价值区域,并将引导后的动作作为回归目标。通过这种方式,CGPO减少了获取高质量动作所需的时间,并通过更好的探索-利用权衡提高了最终性能。我们在5个MuJoCo运动任务上验证了CGPO的有效性,与现有的基于扩散的RL方法相比,CGPO达到了最先进的性能。值得注意的是,CGPO是首次成功将扩散策略应用于真实世界RL的方法,在Franka机器人臂抓取任务上表现出优越性能。我们的官方页面发布在https://dingsht.tech/cgpo-webpage。

英文摘要

Recent advances in reinforcement learning (RL) have achieved great successes by leveraging the multimodality and exploration capability of diffusion policies. Among these approaches, one representative branch focuses on the sampling-based policy optimization. This design enables better exploration capability of the diffusion model, particularly at the beginning of training, but suffer from low exploitation in Q-value information, resulting in a slow policy convergence. Another branch pays attention to gradient-based policy optimization, which sufficiently exploits the gradient of the Q function yet tends to collapse into a unimodal policy with low diversity. To address this issue, we propose CGPO, \textbf{C}ritic-\textbf{G}uided diffusion \textbf{P}olicy \textbf{O}ptimization, which effectively balances exploration and exploitation with the training-free guidance technique integrated into the denoising process of diffusion policy. Concretely, CGPO steers action generation toward high-value regions defined by the critic network and uses the guided actions as regression objectives. In this manner, CGPO reduces the time required to obtain high-quality actions and improves final performance with better balance between the exploration-exploitation tradeoff. We validate the effectiveness of CGPO on 5 MuJoCo locomotion tasks, and CGPO achieves state-of-the-art performance compared with existing diffusion-based RL methods. Notably, CGPO is the first success to incorporate diffusion policy into real-world RL, with its superior performance on Franka robot arm grasping tasks. Our official page is released at https://dingsht.tech/cgpo-webpage.

2605.30046 2026-05-29 cs.LG cs.AI 版本更新

Masked Diffusion Modeling for Anomaly Detection

掩码扩散建模用于异常检测

Lixing Zhang, Yuchen Liang, Liyan Xie

发表机构 * University of Minnesota(明尼苏达大学) Ohio State University(俄亥俄州立大学)

AI总结 提出基于掩码扩散模型的MaskDiff-AD方法,通过重建随机掩码坐标的难度构建异常分数,在分类、混合类型和离散序列数据上实现高效异常检测。

详情
AI中文摘要

异常检测旨在识别偏离名义数据分布的样本,是许多安全关键应用的核心。然而,针对分类、混合类型和离散序列数据开发有效的异常检测方法仍然具有挑战性且相对未被充分探索。掩码扩散模型通过学习从剩余可见上下文中恢复掩码值,为建模此类数据提供了一种自然的方式。在本文中,我们提出了用于异常检测的掩码扩散(MaskDiff-AD),一种基于掩码扩散模型的前向方法,仅在名义数据上训练。给定测试样本,MaskDiff-AD从随机掩码坐标的重建难度构建异常分数,产生一个直接作用于离散状态空间且避免反向时间采样的内容敏感分数。我们还开发了MaskDiff-AD的非参数变体,并通过在固定检测阈值下表征I型和II型错误提供了理论保证。在来自ADBench和UADAD的十四个分类和混合类型表格数据集,以及来自NLP-ADBench的四个文本异常检测数据集上的实验表明,MaskDiff-AD相对于经典、基于扩散以及最近的表格/文本异常检测基线取得了有竞争力的性能。值得注意的是,MaskDiff-AD达到了最佳总体平均排名,优于所有十二种表格基线方法。

英文摘要

Anomaly detection aims to identify samples that deviate from the nominal data distribution and is central to many safety-critical applications. However, developing effective anomaly detection methods for categorical, mixed-type, and discrete sequence data remains challenging and relatively underexplored. Masked diffusion models provide a natural way to model such data by learning to recover masked values from the remaining visible context. In this paper, we propose Masked Diffusion for Anomaly Detection (MaskDiff-AD), a forward-only method based on masked diffusion models trained only on nominal data. Given a test sample, MaskDiff-AD constructs anomaly scores from the difficulty of reconstructing randomly masked coordinates, yielding a content-sensitive score that operates directly on discrete state spaces while avoiding reverse-time sampling. We also develop a non-parametric variant of MaskDiff-AD and provide theoretical guarantees by characterizing Type-I and Type-II errors under a fixed detection threshold. Experiments on fourteen categorical and mixed-type tabular datasets from ADBench and UADAD, as well as four text anomaly detection datasets from NLP-ADBench, show that MaskDiff-AD achieves competitive performance against classical, diffusion-based, and recent tabular/text anomaly detection baselines. Notably, MaskDiff-AD achieves the best overall average rank, outperforming all twelve tabular baseline methods.

2605.30038 2026-05-29 cs.LG cs.AI cs.CV 版本更新

Alignment-Guided Score Matching for Text-to-Image Alignment in Diffusion Models

对齐引导的分数匹配用于扩散模型中的文本到图像对齐

Jaa-Yeon Lee, Yeobin Hong, Taesung Kwon, Jong Chul Ye

发表机构 * Graduate School of AI, KAIST, South Korea(韩国高级人工智能研究生院)

AI总结 提出一种轻量级、无奖励的后训练方法,通过将对比对齐引导直接整合到扩散模型的分数匹配目标中,以解决文本-图像对齐中的过度惩罚和计数错误问题。

Comments ICML 2026, Project page: https://jaayeon.github.io/AGSM

详情
AI中文摘要

扩散模型生成高度逼真的图像,但通常难以实现精确的文本-图像对齐。虽然最近的后训练方法使用外部奖励或人类偏好信号改善对齐,但其性能严重依赖奖励质量,且不直接解决扩散过程中的对齐问题。最近的无奖励方法如SoftREPA表明,通过对比学习优化软文本令牌可以有效改善文本-图像表示对齐,优于标准参数高效微调基线。然而,对比公式可能过度惩罚负对,表现为典型的失败案例,如过度计数和重复。为解决此问题,我们提出一种轻量级、无奖励的后训练方法,通过将对比对齐引导直接整合到扩散模型的分数匹配目标中来细化软令牌。通过在分数级别分配对齐方向,我们的方法缓解了这些限制,并产生更连贯和语义忠实的生成。实验表明,我们的方法与SoftREPA相当,同时显著改善了其失败案例,在GenEval基准上计数准确性提高了超过35%。我们的方法可无缝应用于现有扩散骨干网络(SD1.5、SDXL和SD3),并与现有的基于RL的扩散后训练方法互补。项目页面:https://jaayeon.github.io/AGSM

英文摘要

Diffusion models generate highly realistic images but often struggle with precise text-image alignment. While recent post-training methods improve alignment using external rewards or human preference signals, their performance heavily depends on reward quality and does not directly address alignment within the diffusion process itself. Recent reward-free approaches such as SoftREPA demonstrate that optimizing soft text tokens via contrastive learning can effectively improve text-image representation alignment, outperforming standard parameter-efficient fine-tuning baselines. However, the contrastive formulation can excessively penalize negative pairs, which manifests as characteristic failure cases such as over-counting and repetition. To address this issue, we propose a lightweight, reward-free post-training method that refines soft tokens by integrating contrastive alignment guidance directly into the score-matching objective of diffusion models. By assigning alignment directions at the score level, our approach mitigates these limitations and yields more coherent and semantically faithful generations. Experiments show that our method matches SoftREPA while substantially improving its failure cases, achieving over 35% improvement in counting accuracy on the GenEval benchmark. Our method is seamlessly applicable to existing diffusion backbones (SD1.5, SDXL, and SD3), and is complementary to existing RL-based diffusion post-training methods. Project page: https://jaayeon.github.io/AGSM

2605.30015 2026-05-29 cs.LG cs.AI 版本更新

Test Time Training for Supervised Causal Learning

测试时训练用于监督因果学习

Zizhen Deng, Jiaru Zhang, Rui Ding, Huang Bojun, Jinzhuo Wang, Qiang Fu, Shi Han, Dongmei Zhang

发表机构 * Peking University(北京大学) Shanghai Jiao Tong University(上海交通大学) Microsoft(微软) Sony Research(索尼研究)

AI总结 针对监督因果学习在分布外泛化中的不足,提出测试时训练框架TTT-SCL,通过动态生成与测试实例对齐的训练集,显著提升因果发现性能。

详情
AI中文摘要

监督因果学习(SCL)通过将因果发现构建为监督学习问题,展现了潜力。然而,它面临显著的分布外泛化挑战。我们揭示了先前SCL实践的三个局限性:合成基准与真实数据之间的显著性能差距、对分布偏移的脆弱性以及组合泛化的失败,共同质疑了其现实世界适用性。为此,我们提出测试时训练用于监督因果学习(TTT-SCL),一种新颖的框架,动态生成与任何特定测试实例显式对齐的训练集。我们展示了TTT-SCL与基于分数的方法之间的关联,并基于经典评分函数设计了一个高效模块用于生成训练集。在合成基准、伪真实和真实世界数据集上的实验表明,TTT-SCL显著优于现有的SCL和传统因果发现方法。

英文摘要

Supervised Causal Learning (SCL) has shown promise in causal discovery by framing it as a supervised learning problem. However, it suffers from significant out-of-distribution generalization challenges. We reveal three limitations of previous SCL practices: a significant performance gap between synthetic benchmarks and real-world data, fragility to distribution shifts, and failure in compositional generalization, collectively questioning its real-world applicability. To address this, we propose Test-Time Training for Supervised Causal Learning (TTT-SCL), a novel framework that dynamically generates training sets explicitly aligned with any specific test instance. We demonstrate the correlation between TTT-SCL and score-based methods, and design an efficient module for generating training sets based on the classic scoring function. Experiments on synthetic benchmarks, pseudo-real and real-world datasets demonstrate that TTT-SCL significantly outperforms existing SCL and traditional causal discovery methods.

2605.30003 2026-05-29 cs.MA cs.AI cs.LG 版本更新

Discovering Cooperative Pipelines: Autoresearch for Sequential Social Dilemmas

发现合作管线:面向序列社会困境的自动研究

Víctor Gallego

发表机构 * Komorebi AI Technologies(Komorebi人工智能技术)

AI总结 本文提出一种双层自动研究框架,其中外层AI智能体自动重新设计内层LLM策略合成管线,以解决多智能体序列社会困境,实验表明该方法在多个游戏和福利目标下优于手工基线。

Comments Accepted to the AI Agents for Discovery in the Wild (AID-Wild) Workshop at ACM CAIS 2026

详情
AI中文摘要

我们研究了两层自动研究合作问题:外层AI智能体自主重新设计用于多智能体序列社会困境(SSD)的LLM策略合成系统的内层管线。研究者智能体$\mathcal{R}$(作为编码智能体运行)读取内层源代码,编辑系统提示、反馈函数、辅助库和迭代逻辑,运行评估,并决定保留什么,遵循自动研究范式。在两个游戏(Cleanup和Gathering)、两个策略合成器LLM和两个福利目标(功利主义效率和Rawlsian最大最小原则)下,研究者可靠地超越了手工设计的基线,显著缩小了运行间方差,并优于仅提示优化。发现的管线依赖于目标:只有在最大最小原则下,研究者才会向合成器管线注入显式的公平机制,而这类机制在其自身目标无关的系统提示和每个效率优化的管线中都不存在。这支持了一种信息设计解读,即研究者根据福利目标选择向有限理性的合成器揭示什么。代码见https://github.com/vicgalle/autoresearch-social-dilemmas。

英文摘要

We study two-level autoresearch for cooperation: an outer-loop AI agent autonomously redesigns the inner-loop pipeline of an LLM policy-synthesis system for multi-agent Sequential Social Dilemmas (SSDs). A researcher agent $\mathcal{R}$ (run as a coding agent) reads the inner-loop source code, edits system prompts, feedback functions, helper libraries, and iteration logic, runs evaluations, and decides what to keep, following the autoresearch paradigm. Across two games (Cleanup and Gathering), two policy-synthesizer LLMs, and two welfare objectives (utilitarian efficiency and Rawlsian maximin), the researcher reliably exceeds hand-designed baselines, sharply tightens run-to-run variance, and outperforms prompt-only optimization. The discovered pipelines are objective-dependent: only under maximin does the researcher inject an explicit fairness mechanism into synthesizer pipelines, a class of mechanism that is absent from its own objective-agnostic system prompt and from every efficiency-optimized pipeline. This supports an information-design reading in which the researcher chooses what to reveal to the boundedly rational synthesizer as a function of the welfare objective. Code at https://github.com/vicgalle/autoresearch-social-dilemmas.

2605.29983 2026-05-29 cs.LG cs.CV 版本更新

Improving Adversarial Robustness of Attribution via Implicit Regularization

通过隐式正则化提高归因的对抗鲁棒性

Amir Mehrpanah, Matteo Gamba, Hossein Azizpour

发表机构 * Department of Computer Science, KTH Royal Institute of Technology, Stockholm, Sweden(瑞典皇家理工学院计算机科学系) Science for Life Laboratory, Stockholm, Sweden(瑞典斯德哥尔摩科学生命实验室) Department of Computer Science, Brown University, USA(美国布朗大学计算机科学系)

AI总结 本文发现标准随机梯度下降的学习动态可以隐式地提高归因的对抗鲁棒性,并证明在softmax归一化下注意力归因的鲁棒性提升受限,而基于核的注意力可恢复鲁棒性。

Comments 39 pages, 22 figures, to be published in International Conference on Machine Learning 2026

详情
AI中文摘要

归因的对抗鲁棒性是深度学习中可靠可解释性的基本要求,但现有方法通常依赖计算昂贵的显式正则化。在这项工作中,我们表明归因鲁棒性可以从标准随机梯度下降的学习动态中隐式产生。我们通过参数空间和输入空间曲率之间的联系从理论上论证了这种效应,并在各种架构、数据集和归因方法上进行了验证,计算开销可忽略不计。相反,我们证明由于固有的熵约束,这种鲁棒性提升通常不会转移到softmax归一化下的注意力归因,并通过实验验证了这一局限性。最后,我们表明用基于核的注意力替换softmax注意力可以恢复Transformer模型中的鲁棒性提升。我们的结果突出了学习动态作为鲁棒可解释性的一种原则性且实用的机制,并揭示了归一化下注意力归因的基本局限性。

英文摘要

The adversarial robustness of attributions is a fundamental requirement for reliable explainability in deep learning, yet existing approaches typically rely on computationally expensive explicit regularization. In this work, we show that attribution robustness can arise implicitly from the learning dynamics of standard stochastic gradient descent. We theoretically motivate this effect through connections between parameter-space and input-space curvature, and validate it across architectures, datasets, and attribution methods, with negligible computational overhead. In contrast, we prove that such robustness gains often does not transfer to attention-based attribution under softmax normalization, due to inherent entropy constraints, and we validate this limitation experimentally. Finally, we show that replacing softmax attention with kernel-based attention restores the robustness gains in transformer models. Our results highlight learning dynamics as a principled and practical mechanism for robust explainability, and reveal fundamental limitations of attention-based attribution under normalization.

2605.29980 2026-05-29 cs.CV cs.AI cs.LG 版本更新

Genetically Aligned Patient Representations Improve Hematological Diagnosis

基因对齐的患者表示改善血液学诊断

Muhammed Furkan Dasdelen, Fatih Ozlugedik, Ilaria Looser, Rao Muhammad Umer, Christian Pohlkamp, Carsten Marr

发表机构 * Institute of AI for Health, Helmholtz Munich, Germany International School of Medicine, Istanbul Medipol University, T\"urkiye Munich Leukemia Laboratory, Germany Department of Medicine III, Ludwig-Maximilian-University Hospital, Germany Department of Physics, University of Munich, Germany Munich Center for Machine Learning (MCML), Germany DKTK, German Cancer Consortium, Germany

AI总结 提出一种两阶段框架,通过自监督视觉预训练和监督对比学习对齐白细胞图像与染色体畸变及体细胞突变,提升血液学诊断性能。

Comments Accepted for publication at the 29th International Conference on Medical Image Computing and Computer Assisted Intervention - MICCAI 2026

详情
AI中文摘要

组织病理学编码器与转录组和基因组数据的多模态对齐已被证明能显著提高下游诊断任务的性能。血液学细胞学的独特之处在于,视觉单细胞评估通常与细胞遗传学和分子遗传学相结合用于血癌诊断。在本研究中,我们提出了一个框架,将单个白细胞图像与染色体畸变(核型)以及来自靶向基因面板的体细胞突变对齐。我们的训练策略采用两阶段方法:(i)在超过1500名患者的队列上,使用iBOT头进行自监督、仅视觉的Transformer聚合器预训练;(ii)通过急性髓系白血病患者的监督对比损失进行基因对齐。我们的基因对齐患者编码器改善了血液学诊断任务,优于切片级组织病理学基础模型。此外,该模型为疾病和遗传改变提供了即用型检索能力。将遗传数据纳入患者编码器提高了患者表示的质量,提供了一个与临床诊断工作流程对齐的框架,并为未来的多模态血液学特定AI铺平了道路。代码和模型权重可在https://github.com/marrlab/GenBloom获取。

英文摘要

Multimodal alignment of histopathology encoders with transcriptomic and genomic data has been shown to significantly improve performance in downstream diagnostic tasks. Hematological cytology is unique in that visual single-cell evaluation is often paired with cytogenetics and molecular genetics for blood cancer diagnosis. In this study, we present a framework to align single white blood cell images with chromosomal aberrations (karyotype) and somatic mutations from targeted gene panels. Our training strategy follows a two-stage approach: (i) self-supervised, vision-only pretraining of a transformer aggregator using an iBOT head on a cohort of over 1500 patients, and (ii) genetic alignment via supervised contrastive loss on acute myeloid leukemia patients. Our genetically aligned patient encoder improves hematological diagnostic tasks, outperforming slide-level histopathology foundation models. Additionally, the model provides off-the-shelf retrieval capabilities for diseases and genetic alterations. Incorporating genetic data into patient encoders increases the quality of patient representations, providing a framework that aligns with clinical diagnostic workflows and paves the way for future multimodal hematology-specific AI. The code and model weights are available at https://github.com/marrlab/GenBloom.

2605.29979 2026-05-29 cs.CR cs.LG 版本更新

Fingerprinting Inference Systems of Large Language Models

大型语言模型的推理系统指纹识别

Anna Wimbauer, Jonas Möller, Erik Imgrund, Konrad Rieck

发表机构 * BIFOLD & TU Berlin(BIFOLD与柏林技术大学)

AI总结 本文提出一种通过分析LLM的提示-响应行为来识别推理系统组件(如推理引擎、注意力后端和硬件平台)的指纹方法,并论证了防御该指纹识别的根本困难性。

详情
AI中文摘要

LLM的行为不仅仅取决于模型本身。推理系统的组件,如推理引擎、注意力后端和硬件平台,微妙地影响输入的处理方式。这些组件在实现上存在差异,因此在运行相同模型时,不同系统之间会产生微小的数值偏差。虽然先前的工作已经建立了这种偏差的理论存在性,但其安全影响尚未被探索。在本文中,我们表明这些偏差是特定组件的特征,并传播到可观察的文本输出中,从而将推理系统暴露给任何能够查询模型的方。基于这一观察,我们引入了一种指纹识别方法,通过分析LLM的提示-响应行为来识别推理系统的组件。我们的实证评估表明,即使在LLM以非零温度运行时,推理引擎、注意力后端和底层硬件平台也能被可靠地识别。我们证明,防止指纹识别从根本上来说是困难的,因为它需要消除硬件和软件堆栈之间的数值差异。因此,我们提出了部分缓解措施并讨论了它们的影响。

英文摘要

The behavior of LLMs does not depend solely on the model itself. Components of the inference system, such as the inference engine, attention backend, and hardware platform, subtly influence how inputs are processed. These components differ in their implementations and thereby induce small numerical deviations across systems when running the same model. While prior work has established the theoretical existence of such deviations, their security implications have remained unexplored. In this paper, we show that these deviations are characteristic of specific components and propagate to observable textual outputs, exposing the inference system to any party that can query the model. Building on this observation, we introduce a fingerprinting method that analyzes the prompt-response behavior of LLMs to identify components of the inference system. Our empirical evaluation demonstrates that the inference engine, attention backend, and underlying hardware platform can be identified reliably, even when the LLM is operated at non-zero temperature. We show that preventing fingerprinting is fundamentally hard, as it would require eliminating numerical differences between hardware and software stacks. We therefore propose partial mitigations and discuss their impact.

2605.29975 2026-05-29 cs.LG eess.SP 版本更新

A Fully Convolutional Approach to Denoising Structural Dynamics Data from X-Ray Photon Correlation Spectroscopy

一种全卷积方法用于X射线光子相关光谱中结构动力学数据的去噪

Nisar Nellikunnummel, Andi Barbour, Lutz Wiegart, Tatiana Konstantinova, Anthony DeGennaro

发表机构 * Amazon(亚马逊) GE Aerospace Research(通用电气航空航天研究)

AI总结 提出全卷积去噪自编码器(FC-DAE),用于去噪X射线光子相关光谱中的双时间强度-强度相关函数,支持任意输入尺寸,在低信噪比条件下恢复复杂动力学特征并保持结构保真度。

详情
AI中文摘要

我们提出了一种全卷积去噪自编码器(FC-DAE),用于去噪X射线光子相关光谱(XPCS)中的双时间强度-强度相关函数($C_2$)。与通常限制为固定输入尺寸的传统去噪自编码器不同,FC-DAE接受任意维度的输入,同时保留不同动力学范围内的相关结构。该模型使用在NSLS-II光束线收集的实验$C_2$数据进行训练,并应用数据增强来扩展数据集的多样性并减少过拟合。FC-DAE在低信噪比条件下成功恢复复杂的动力学特征,同时保持结构保真度。为了评估重建可靠性,我们采用定量指标来评估结构保真度并识别潜在的模型引入偏差。我们的结果表明,FC-DAE提供了具有高计算效率的鲁棒去噪性能,使得在光子受限和低剂量测量条件下恢复XPCS动力学成为可能。

英文摘要

We present a fully convolutional denoising autoencoder (FC-DAE) for denoising two-time intensity-intensity correlation functions ($C_2$) in X-ray photon correlation spectroscopy (XPCS). Unlike conventional denoising autoencoders that are typically restricted to fixed input sizes, the FC-DAE accepts inputs of arbitrary dimensions while preserving correlation structures across diverse dynamical regimes. The model is trained using experimentally derived $C_2$ data collected at NSLS-II beamlines, with data augmentation applied to expand the diversity of the dataset and reduce overfitting. The FC-DAE successfully recovers intricate dynamical features in low signal-to-noise conditions while maintaining structural fidelity. To assess reconstruction reliability, we employ quantitative metrics to evaluate structural fidelity and identify potential model-induced bias. Our results demonstrate that the FC-DAE provides robust denoising performance with high computational efficiency, enabling recovery of XPCS dynamics under photon-limited and low-dose measurement conditions.

2605.29963 2026-05-29 cs.CR cs.AI cs.LG 版本更新

Honeyval: A Comprehensive Evaluation Framework for LLM-powered HTTP Honeypots

Honeyval: 基于LLM的HTTP蜜罐综合评估框架

Mark Vero, Fabian Kaczmarczyck, Ivan Petrov, Ilia Shumailov, Jamie Hayes, Niels Heinen, Tianqi Fan, Luca Invernizzi, Martin Vechev

发表机构 * ETH Zurich(苏黎世联邦理工学院) Google(谷歌) Google DeepMind(谷歌深Mind) AI Sequrity Company(AI安全公司) Independent(独立)

AI总结 提出Honeyval评估框架,通过16个后端应用、AI攻击代理、控制任务和可验证利用目标,系统评估LLM驱动的HTTP蜜罐,发现其相比规则基线能显著延长攻击交互、降低被前沿模型检测率,且保持成本优势。

详情
AI中文摘要

蜜罐是模拟真实系统组件的诱饵系统,旨在防御网络攻击。最近,LLM越来越多地作为蜜罐的模拟骨干。它们使防御者能够构建高交互蜜罐,同时降低系统安全风险。然而,基于LLM的蜜罐开发缺乏统一的评估框架。大多数评估包括测量固定命令上的响应相似性、手动测试或实际部署。这些方法通常不可扩展用于开发、不可跨评估复现、不能代表实际攻击,或不能适应各种攻击者和蜜罐配置。在这项工作中,我们弥补了这一差距,提出了Honeyval,一个针对LLM驱动的HTTP蜜罐的综合评估框架。我们通过将蜜罐基于16个后端应用程序、使用AI黑客代理作为攻击者、采用两个控制任务来监控代理和蜜罐在定制化方面的能力,以及为攻击者定义清晰且可验证的利用目标,解决了先前评估的局限性。使用Honeyval,我们对近期成本高效的LLM作为HTTP蜜罐进行了广泛评估。我们的实验突出了LLM驱动的蜜罐的前景;它们与基于规则的基线蜜罐相比,导致与攻击者的交互时间显著延长,并且即使被前沿模型检测到的频率也远低得多,同时平均而言,保持了针对代理攻击者的运行成本优势。此外,我们实验了不同的反攻蜜罐配置,并观察到了独特的权衡,例如以增加检测为代价获得更长的交互。

英文摘要

Honeypots are decoy systems mimicking real system components designed to defend against cyber attacks. Recently, LLMs increasingly serve as simulation backbones for honeypots. They enable defenders to construct high-interaction honeypots with low system security risks. However, LLM-powered honeypot development lacks a unified evaluation framework. Most evaluations consist of measuring response similarity on fixed commands, manual testing, or real-world deployment. These methods are often not scalable for development, reproducible across evaluations, representative of practical attacks, or adaptable to various attacker and honeypot configurations. In this work, we bridge this gap and propose Honeyval, a comprehensive evaluation framework for LLM-powered HTTP honeypots. We address the limitations of prior evaluations by grounding the honeypots in 16 backend applications, using AI hacking agents as attackers, employing two control tasks to monitor agent and honeypot capabilities across customizations, and defining clear and verifiable exploit goals for the attacker. Using Honeyval, we conduct an extensive evaluation of recent cost-efficient LLMs as HTTP honeypots. Our experiments highlight the promise of LLM-powered honeypots; they lead to substantially longer interactions with the attacker than rule-based baseline honeypots and are far less frequently detected even by frontier models, all while, on average, preserving a running cost advantage against agentic attackers. Further, we experiment with different counter-offensive honeypots configurations, and observe unique trade-offs, such as longer interactions at the cost of increased detection.

2605.29952 2026-05-29 cs.LG 版本更新

From Short Histories to Long Futures: Horizon-Aware Graph Neural Networks for Long Horizon Forecasting

从短历史到长未来:面向长时域预测的视界感知图神经网络

Zesheng Liu, Maryam Rahnemoonfar

发表机构 * Department of Computer Science and Engineering, Lehigh University(计算机科学与工程系,莱维大学) Department of Civil and Environmental Engineering, Lehigh University(土木与环境工程系,莱维大学)

AI总结 提出一种多视界图神经网络模拟器,通过共享图骨干网络和增量预测策略,联合优化多步超前预测,实现长时域稳定且准确的地球物理系统模拟。

Comments Accepted for International Conference on Pattern Recognition (ICPR) 2026

详情
AI中文摘要

由于强非线性动力学、全物理模拟的高计算成本以及单步自回归代理在数十年滚动中产生的误差累积,地球物理系统的精确长期预测十分困难。深度神经网络可作为高效模拟器,但大多数仅训练用于下一步预测,且随着预测视界增长常出现漂移或不稳定。我们提出一种多视界图神经网络模拟器,在统一模型中学习从单个当前时间到多个未来超前时间的状态到状态转换。物理域表示为图,其中节点对应具有时变地球物理属性的空间位置,边编码局部空间相互作用。给定当前图状态,模型预测关键场(冰厚度和冰速度)在所有节点上的未来演化,使用共享图骨干网络和每个目标变量的独立输出分支。为提高稳定性,网络预测相对于当前状态的状态增量,然后将其加回以重建未来状态。训练联合优化所有超前时间,使用统一回归目标,推理采用从粗到细的滚动方式,以较大步长推进并有选择地以较短步长细化,以减少漂移并避免冗余计算。在数十年期松岛冰川模拟上的实验表明,我们的方法在长期精度和稳定性上均优于(i)直接从初始状态预测每个未来时间的基线模型和(ii)标准单步自回归滚动,为下游气候和海平面研究提供了更可靠的模拟器。

英文摘要

Accurate long-range prediction of geophysical systems is difficult due to strongly nonlinear dynamics, the high computational cost of full-physics simulations, and the error accumulation that arise when one-step autoregressive surrogates are rolled out over decades. Deep neural network can serve as efficient emulators, but most are trained only for next-step prediction and often drift or become unstable as the forecast horizon grows. We propose a multi-horizon graph neural network emulator that learns state-to-state transitions from a single current time to multiple future lead times within one unified model. The physical domain is represented as a graph, where nodes correspond to spatial locations with time-varying geophysical attributes and edges encode local spatial interactions. Given the current graph state, the model predicts the future evolution of key fields, ice thickness and ice velocities at all nodes, using a shared graph backbone with separate output branches for each target variable. To improve stability, the network predicts state increments relative to the current state, which are then added back to reconstruct future states. Training jointly optimizes all lead times with a unified regression objective, and inference uses a coarse-to-fine rollout that advances with larger jumps and selectively refines with shorter jumps to reduce drift and avoid redundant computation. Experiments on multi-decadal Pine Island Glacier simulations show that our approach achieves higher long-range accuracy and improved stability than both (i) an initial-state baseline that predicts each future time directly from the starting state and (ii) a standard single-step autoregressive rollout, producing a more reliable emulator for downstream climate and sea-level studies.

2605.29951 2026-05-29 cs.AI cs.CL cs.LG cs.MM 版本更新

MuPHI: Learning Implicit Multimodal Harm Reasoning via Semantically Grounded Reward Optimization

MuPHI: 通过语义基础奖励优化学习隐式多模态有害推理

Anisha Saha, Varsha Suresh, Teodora Kamova, Sophia Wiedmann, Timothy Hospedales, Vera Demberg

发表机构 * Max Planck Institute for Informatics(马克斯·普朗克院信息研究所) Saarland Informatics Campus(萨尔兰州信息校园) Saarland University(萨尔兰州大学) The University of Edinburgh(爱丁堡大学) Samsung AI Center, Cambridge(三星AI中心,剑桥)

AI总结 针对视觉语言模型在隐式跨模态有害语义推理上的不足,提出MuPHI数据集和MuPHIRM训练框架,通过多视角奖励优化联合语义学习,提升有害检测与推理质量及分布外鲁棒性。

详情
AI中文摘要

理解看似良性的图像-文本对之间交互如何产生危害,需要超越表面特征的意图感知跨模态推理。现有的视觉语言模型(VLM)擅长对感知线索进行字面推理,但往往无法推导出依赖于隐式、上下文相关推理的有害语义。为了评估VLM在组合性有害检测和推理方面的能力,我们引入了多模态语用有害解释(MuPHI)数据集,其中包含有害编码在微妙多模态线索中的图像-文本对。MuPHI涵盖多种有害类别,并包含用于评估VLM推理链的注释有害理由。为了改进VLM的检测和推理能力,我们提出了MuPHIRM,一种推理增强的训练框架,通过优化多视角奖励来学习联合语义。MuPHIRM提高了VLM的有害检测和推理质量,同时与训练和推理时基线相比,表现出优越的分布外鲁棒性。我们的发现表明,面向推理的奖励优化为构建超越基准特定捷径进行泛化的多模态系统提供了一个有前景的方向。

英文摘要

Understanding how harm emerges from interaction between otherwise benign image-text pairs requires intent-aware cross-modal reasoning beyond surface-level features. Existing vision-language models (VLMs) excel at literal reasoning over perceptual cues but often fail to derive harmful semantics that rely on implicit, context-dependent reasoning. To evaluate VLMs on compositional harm detection and reasoning, we introduce Multimodal Pragmatic Harm Interpretation (MuPHI), a dataset containing image-text pairs where harm is encoded in subtle multimodal cues. MuPHI spans diverse harm categories and includes annotated harm rationales for assessing VLM reasoning chains. To improve both detection and reasoning in VLMs, we propose MuPHIRM, a reasoning-augmented training framework which learns joint semantics by optimizing multi-perspective rewards. MuPHIRM improves both harm detection and reasoning quality of VLMs while demonstrating superior out-of-distribution robustness compared to both trained and inference-time baselines. Our findings suggest that reasoning-oriented reward optimization offers a promising direction towards building multimodal systems that generalize beyond benchmark-specific shortcuts.

2605.29943 2026-05-29 cs.HC cs.ET cs.LG 版本更新

A Domain-Informed Multi-Objective Framework for EEG Channel Selection in Motor Imagery BCIs

一种领域信息驱动的多目标框架用于运动想象脑机接口中的EEG通道选择

Dekka Muni Kumar, Dhruba Jyoti Kalita, Yogesh Kumar Meena

发表机构 * Human-AI Interaction (HAIx) Lab, IIT Gandhinagar(人机交互(HAIx)实验室,印度冈达恩加尔理工学院)

AI总结 提出一种基于多目标优化(NSGA-II、MOPSO、MOEA/D)的EEG通道选择框架,通过高斯核评估空间相关性、任务相关去同步评估功能区分性,在四个数据集上优于单目标方法,实现紧凑通道子集和高分类性能。

Comments This work has been submitted to the IEEE for possible publication

详情
AI中文摘要

使用脑电图(EEG)信号进行运动想象(MI)分类对于推进脑机接口(BCI)至关重要。传统的EEG通道选择方法通常面临局限性,例如依赖单目标标准和易陷入局部最优。为了解决这些挑战,本文提出了一种多目标优化框架,采用非支配排序遗传算法、多目标粒子群优化和基于分解的多目标进化算法。我们的方法有效平衡了空间相关性(使用高斯核)和功能区分性(评估试验内任务相关去同步),从而提高了性能。我们在四个EEG数据集(Physionet、OpenBMI、HighGamma和BCIIV-2A)上评估了该框架。所提出的方法成功识别出紧凑且相关的通道子集,这些子集集中在与MI活动相关的感觉运动皮层区域,解决了传统技术中普遍存在的维度和复杂性挑战。此外,该框架在Physionet、OpenBMI、HighGamma和BCIIV-2A数据集上分别达到了87%、71%、75%和65%的分类性能。通过优于现有的单目标和基于准确率的方法以及依赖固定子集的方法,这些发现表明,这种新的多目标优化框架可以增强基于MI的BCI性能,同时促进紧凑的通道配置,降低计算复杂度,使其更适合可穿戴、便携式和实时BCI应用。

英文摘要

Motor imagery (MI) classification using electroencephalography (EEG) signals is essential for advancing brain-computer interfaces (BCIs). Traditional EEG channel selection methods often face limitations, such as dependency on single-objective criteria and susceptibility to local optima. To address these challenges, this work proposes a multi-objective optimisation framework that employs non-dominated sorting genetic algorithm, multiple-objective particle swarm optimisation, and a multi-objective evolutionary algorithm based on decomposition. Our approach effectively balances spatial relevance, using a Gaussian kernel, and functional discriminability, which assesses intratrial task-related desynchronisation, thereby improving performance. We evaluated this framework on four EEG datasets: Physionet, OpenBMI, HighGamma, and BCIIV-2A. The proposed approach successfully identifies compact, relevant channel subsets concentrated around sensorimotor cortex regions linked to MI activity, addressing the prevalent challenges of dimensionality and complexity inherent to traditional techniques. Furthermore, the framework achieved classification performance of 87%, 71%, 75%, and 65% on the Physionet, OpenBMI, HighGamma, and BCIIV-2A datasets, respectively. By outperforming existing single-objective and accuracy-based methods, and those relying on fixed subsets, these findings demonstrate that this new multi-objective optimisation framework can enhance MI-based BCI performance while facilitating compact channel configurations with reduced computational complexity, making them better suited for wearable, portable, and real-time BCI applications.

2605.29941 2026-05-29 cs.NI cs.LG 版本更新

TraceCodec: A Compiler-Backed Neural Codec for Stateful Multi-Flow Network Traffic Traces

TraceCodec:一种基于编译器的有状态多流网络流量轨迹神经编解码器

Junhui Ding, Xinchen Zhang, Xiaohui Xie, Shinan Liu

发表机构 * Tsinghua University(清华大学) University of Hong Kong(香港大学)

AI总结 针对有状态多流网络流量轨迹的高保真生成问题,提出TraceCodec,通过将数据包解码为带时间戳的动作并学习连续潜在表示,再经确定性编译器还原为PCAP,实现精确的流量统计和TCP状态保持。

详情
AI中文摘要

关键网络工作流需要高保真的数据包捕获(PCAP)用于测试、安全分析和协议验证,而不仅仅是统计性的流级摘要。最近的包生成器展示了协议约束的PCAP合成,但它们普遍直接解码为原始包字段。这种接口将学习到的行为选择与确定性协议后果纠缠在一起,迫使包实现依赖于事后启发式修复。我们将这种解码接口识别为根本瓶颈,并提出了TraceCodec,一种用于有状态多流轨迹的状态感知神经编解码器。TraceCodec将每个数据包提升为带有显式流槽和传输线索的定时包动作,然后学习连续的每包潜在表示。确定性编译器将解码后的动作降级回PCAP,负责端点分配、TCP状态、合法性约束和包渲染。潜在层暴露了一个面向生成器的序列空间,因此下游流量模型可以在包动作潜在表示上操作,而不是原始头部字段。在CICIDS2017 Monday上,TraceCodec将包计数、协议组成和流数量匹配到0.03%以内。在相同的非修复策略下,原始字段基线将流数量和TCP状态扭曲了几个数量级。结构诊断表明,TraceCodec保留了原始字段解码器所分割的TCP状态转换和多流交织。这项工作为高保真包轨迹生成建立了新的基础。

英文摘要

Critical networking workflows require high-fidelity packet captures (PCAPs) for testing, security analysis, and protocol validation, not just statistical flow-level summaries. Recent packet generators have demonstrated protocol-constrained PCAP synthesis, but they universally decode directly to raw packet fields. That interface entangles learned behavioral choices with deterministic protocol consequences, which forces packet realization to depend on post-hoc heuristic repair. We identify this decode interface as the fundamental bottleneck and present TraceCodec, a state-aware neural codec for stateful multi-flow traces. TraceCodec lifts each packet into a timed packet action with explicit flow slots and transport cues, then learns a continuous per-packet latent. A deterministic compiler lowers decoded actions back to PCAPs, owning endpoint assignment, TCP state, legality constraints, and packet rendering. The latent layer exposes a generator-facing sequence space, so downstream traffic models can operate on packet-action latents rather than raw header fields. On CICIDS2017 Monday, TraceCodec matches packet count, protocol composition, and flow population to within 0.03%. Raw-field baselines under the same non-repair policy distort flow counts and TCP state by orders of magnitude. Structural diagnostics show that TraceCodec preserves TCP state transitions and multi-flow interleaving that raw-field decoders fragment. This work establishes a new foundation for high-fidelity packet-trace generation.

2605.29939 2026-05-29 cs.IT cs.LG math.IT 版本更新

CRB-Guided Framework Design and Resource Allocation for Indoor mmWave ISCC Systems

室内毫米波ISCC系统的CRB引导框架设计与资源分配

Zhonghao Liu, Yahao Ding, Yinchao Yang, Mohammad Shikh-Bahaei

发表机构 * King’s College London(伦敦国王学院)

AI总结 针对室内毫米波ISCC系统,提出基于克拉美罗界(CRB)的资源分配框架,通过联合优化感知功率和自适应深度Mamba模型深度,最小化人体姿态预测误差。

Comments 7 pages, 6 figures, conference(submitted to GLOBECOM)

详情
AI中文摘要

集成感知、通信与计算(ISCC)为室内以人为中心的应用提供了一个有前景的框架。在这些应用中,短期人体姿态预测有助于提前实现连续的人体跟踪和资源分配。本文提出了一种基于克拉美罗界(CRB)的资源分配框架,用于室内毫米波ISCC系统,以在通信、延迟和能量约束下最小化人体姿态预测误差。我们基于CRB刻画了感知功率对距离估计不确定性和点云扰动的影响。为了捕捉计算资源对预测性能的影响,我们采用了一种自适应深度的Mamba姿态预测模型,其中在每个层后附加轻量级预测头,以实现不同模型深度的推理。通过这种统一的感知-计算建模,我们建立了感知功率、模型深度和预测误差之间的定量关系。此外,我们制定了一个联合资源分配问题以最小化姿态预测误差。为了高效解决该问题,我们开发了一种基于交替优化(AO)的算法,其中为感知功率和模型深度更新步骤推导了闭式解。仿真结果表明,与基线方法相比,所提方案显著降低了姿态预测误差,验证了其在资源受限的室内以人为中心的ISCC系统中的有效性。

英文摘要

Integrated sensing, communication, and computation (ISCC) provides a promising framework for indoor human-centric applications. In these applications, short-term human pose prediction facilitates continuous human tracking and resource allocation in advance. In this paper, we propose a Cramer-Rao bound (CRB) guided resource allocation framework for indoor mmWave ISCC systems to minimize the human pose prediction error under communication, latency, and energy constraints. We characterize the impact of sensing power on range-estimation uncertainty and point-cloud perturbation based on the CRB. To capture the impact of computation resources on prediction performance, we adopt an adaptive-depth Mamba-based pose prediction model, where lightweight prediction heads are attached after every layer to enable inference with different model depths. With this unified sensing-computation modeling, we establish a quantitative relationship among sensing power, model depth, and prediction error. Furthermore, we formulate a joint resource allocation problem to minimize the pose prediction error. To solve this problem efficiently, we develop an alternating optimization (AO)-based algorithm, where closed-form solutions are derived for the sensing power and model depth update steps. Simulation results show that the proposed scheme significantly reduces pose prediction error compared with baseline methods, validating its effectiveness for resource-constrained indoor human-centric ISCC systems.

2605.29937 2026-05-29 cs.RO cs.LG 版本更新

Fisher-Preserving Guidance: Training-Free Manifold Constraints for Safe Diffusion Control

Fisher保持引导:用于安全扩散控制的免训练流形约束

Hao Ren, Zetong Bi, Yiming Zeng, Le Zheng, Zhi Li, Zhaoliang Wan, Lu Qi, Hui Cheng

发表机构 * Sun Yat-sen University, Guangzhou, China(中山大学,广州,中国) Insta360 Research, Shenzhen, China(Insta360研究院,深圳,中国)

AI总结 提出一种免训练的Fisher保持引导方法,通过低秩雅可比分解计算Fisher保持更新,并利用截断Fisher去噪敏感性作为不确定性信号,在视觉导航中实现可靠且高效的轨迹预测。

Comments ICML2026

详情
AI中文摘要

扩散模型在视觉导航中的航路点预测是有效的,但当更新偏离训练流形时,标准采样和测试时引导可能产生不可靠或低效的轨迹。我们提出带有外积跨度投影的Fisher保持引导,这是一种免训练的推理方法,在优化任务目标的同时避免与分布外动作相关的大Fisher漂移。我们的方法通过低秩雅可比分解计算Fisher保持更新,每步仅需一次反向传播,支持实时使用。我们进一步引入截断Fisher去噪敏感性作为不确定性信号,并将其用于鲁棒的多样本动作混合。在玩具和真实导航基准上的实验,包括基于TSDF引导的Maze2D、使用官方扩散策略权重的PushT,以及仿真和真实机器人上的视觉导航,均表明与强扩散策略基线相比,无需额外训练即可获得一致的性能提升。

英文摘要

Diffusion models are effective for waypoint prediction in visual navigation, but standard sampling and test time guidance can produce unreliable or inefficient trajectories when updates drift off the training manifold. We propose Fisher Preserving Guidance with Outer Product Span Projection, a training-free inference method that avoids large Fisher drift associated with off-distribution actions while optimizing a task objective. Our method computes the Fisher-preserving update via a low-rank Jacobian factorization, requiring only a single backward pass per step and enabling real-time use. We further introduce Truncated Fisher Denoising Sensitivity as an uncertainty signal and use it for robust multi-sample action blending. Experiments on toy and realistic navigation benchmarks, including Maze2D with TSDF-based guidance, PushT with official Diffusion Policy weights, and visual navigation in simulation and on real robots, demonstrate consistent improvements in performance over strong diffusion-policy baselines without additional training.

2605.29933 2026-05-29 cs.LG 版本更新

CLUBench: A Clustering Benchmark

CLUBench:一个聚类基准测试

Feng Xiao, Dazhi Fu, Chris Ding, Jicong Fan

发表机构 * The Chinese University of Hong Kong (Shenzhen)(香港中文大学(深圳))

AI总结 本文提出CLUBench,一个包含24种算法在131个数据集上的综合聚类基准,通过大规模实验分析超参数调优、数据类型、预训练嵌入、大语言模型聚类等,揭示传统算法仍具竞争力,并结合预训练嵌入可提升效率。

详情
AI中文摘要

聚类是数据科学中的一个基本问题,有着悠久的研究历史,产生了许多富有洞察力的算法。尽管取得了这些进展,但缺乏一个系统且大规模的经验评估,同时考虑传统算法、基于深度学习的方法以及最近基于基础模型的聚类,导致对算法选择和部署的指导有限。为了填补这一空白,我们引入了CLUBench,一个全面的聚类基准,包含24种不同原理的算法,在131个数据集上进行了评估,涵盖表格、文本和图像数据,涉及178,815次实验。重要的是,我们对(i)超参数调优的影响、(ii)数据类型和特征的影响、(iii)预训练嵌入的影响、(iv)基于大语言模型的聚类、(v)算法的相似性以及(vi)性能矩阵的低秩结构的分析,为聚类研究提供了有意义的见解和有前景的途径。例如,我们的研究揭示:1) 所有评估的深度聚类方法在平均性能方面并不比表现最佳的传统聚类算法(如KMeans、SpeClu)具有显著优势;2) 对于图像和文本聚类任务,将预训练嵌入与传统聚类算法(如KMeans、SpeClu)相结合提供了有效且高效的聚类;3) 即使在大模型日益占据主导地位的时代,聚类仍然是一个具有挑战性和非平凡的问题。此外,我们提出利用跨模型性能矩阵中的低秩结构来高效近似实际应用中的整体性能评估。我们进一步展示了基于所有超参数配置下的性能矩阵进行模型选择的可行性。

英文摘要

Clustering is a fundamental problem in data science with a long-standing research history, yielding numerous insightful algorithms. Despite this progress, a systematic and large-scale empirical evaluation that jointly considers conventional algorithms, deep learning-based methods, and recent foundation model-based clustering remains largely absent, leading to limited guidance on algorithm selection and deployment. To address this gap, we introduce CLUBench, a comprehensive clustering benchmark comprising 24 algorithms of diverse principles evaluated on 131 datasets across tabular, text, and image data, involving 178,815 experiments. Importantly, our analyses of (i) the impact of hyperparameter tuning,(ii) the impact of data types and characteristics,(iii) the impact of pretrained embeddings,(iv) large language model-based clustering,(v) the similarity of algorithms, and (vi) the low-rank structures of performance matrices, yield meaningful insights and promising pathways for clustering research. For instance, our study reveals that: 1) All evaluated deep clustering methods do not exhibit a significant advantage compared with the top-performing conventional clustering algorithms (e.g., KMeans, SpeClu) in terms of average performance; 2) For image and text clustering tasks, combining pretrained embeddings with conventional clustering algorithms (e.g., KMeans, SpeClu) offers effective and efficient clustering; 3) Clustering remains a challenging and nontrivial problem, even in the era of increasingly dominant foundation models. Moreover, we propose to use the low-rank structure in cross-model performance matrices to efficiently approximate the overall performance evaluation in practical applications. We further demonstrate the feasibility of model selection based on the performance matrices across all hyperparameter configurations.

2605.29932 2026-05-29 cs.LG cs.CV 版本更新

Treatment-Conditioned Diffusion for Forecasting Neurodegenerative Disease Progression

治疗条件扩散用于预测神经退行性疾病进展

Danylo Boiko, Viktoriia Mishkurova

发表机构 * Innoloft Inc.(Innoloft公司) Bogomolets National Medical University(博戈莫列茨国家医学大学)

AI总结 提出一种治疗条件扩散框架,通过条件化生成过程于患者的筛查DaTscan图像和一年内左旋多巴等效日剂量,预测高保真未来脑状态,在临床保真度上显著优于基线。

Comments 9 pages, 5 figures, 1 table

详情
AI中文摘要

预测帕金森病等神经退行性疾病的进展对于有效的长期规划和个性化治疗干预至关重要。现有系统通常产生忽略纵向神经影像丰富结构的标量临床评分,而传统生成方法则遭受解剖细节丢失和细微进展模式模糊的问题。为此,我们引入了一种新颖的治疗条件扩散框架,通过将生成过程条件化于患者的筛查DaTscan图像和一年内左旋多巴等效日剂量,预测高保真的未来脑状态。该流程使用基于Transformer的编码器表示非线性、时间依赖的药理学动态,并通过一个关注生物关键区域的多权重感兴趣区域掩码优化生成。实验评估表明,我们的框架保持了清晰的解剖边界,并在临床保真度上显著优于基线,实现了MSE降低14.0%,MAE降低7.2%,SSIM提高4.9%。

英文摘要

Forecasting the progression of neurodegenerative diseases, such as Parkinson's disease, is essential for effective long-term planning and personalized therapeutic intervention. Existing systems typically produce scalar clinical scores that ignore the rich structure of longitudinal neuroimaging, while traditional generative approaches suffer from a loss of anatomical details and blurring subtle progression patterns. To address this, we introduce a novel treatment-conditioned diffusion framework that predicts high-fidelity future brain states by conditioning the generative process on patients' screening DaTscan images and levodopa equivalent daily dose over one year. The pipeline uses a Transformer-based encoder to represent non-linear, time-dependent pharmacological dynamics and optimizes generation through a multi-weight region-of-interest mask that focuses on biologically critical areas. Experimental evaluation shows that our framework maintains sharp anatomical boundaries and significantly improves clinical fidelity relative to the baseline, achieving 14.0% lower MSE, 7.2% lower MAE, and 4.9% higher SSIM.

2605.29927 2026-05-29 cs.CL cs.AI cs.LG 版本更新

Does The Way You Plan Matter? An Empirical Study of Planning Representations for LLM Web Agents

计划方式重要吗?LLM网络代理计划表示的实证研究

Alejandra Zambrano, Sara Vera Marjanovic, Imene Kerboua, Xing Han Lù, Leila Kosseim

发表机构 * Concordia University(康科德大学) Mila - Quebec AI Institute(魁北克人工智能研究所) University of Copenhagen(哥本哈根大学) Universite Claude Bernard Lyon(克莱尔蒙特-伯恩大学) McGill University(麦吉尔大学)

AI总结 本研究提出PlanAhead框架,通过自动难度分类和四种计划表示(顺序子目标、叙述、伪代码、检查清单)的对比实验,发现计划表示形式和生成计划的LLM显著影响网络代理的鲁棒性和任务成功率。

Comments Extended version of paper submitted to EMNLP, waiting for acceptance

详情
AI中文摘要

尽管最近取得了进展,基于LLM的网络代理仍然面临探索有限、遗漏关键步骤以及对任务约束敏感等问题。先前的研究表明,许多这些失败源于规划中的弱点,但替代自然语言计划表示的影响尚未被探索。为了解决这个问题,我们引入了PlanAhead,一个静态规划器-执行器框架,评估计划表示对代理性能的影响。我们首先将WebArena任务自动分类为3个难度级别,无需人工标注即可实现一致的难度分级。然后,我们在被分类为困难的任务上系统评估了4种不同的计划表示:顺序子目标、叙述、伪代码和检查清单;跨越不同系列的多模态LLM驱动的代理(OpenAI、阿里巴巴和谷歌)。为了解释随机变异性,我们引入了两个新的评估指标:达成率(AR)和解决任务一致性(STC)。我们的结果表明,计划制定和生成计划的底层LLM都显著影响网络代理的鲁棒性和任务成功率。

英文摘要

Despite recent advances, LLM-based web agents still struggle with limited exploration, omission of critical steps, and sensitivity to task constraints. Prior work suggests that many of these failures stem from weaknesses in planning, yet the impact of alternative natural language plan representation remains unexplored. To address this, we introduce PlanAhead, a static planner-executor framework that evaluates the impact of plan representation in agent performance. We first automatically categorize WebArena tasks into 3 difficulty levels, enabling consistent difficulty grading without human annotation. Then we systematically evaluate 4 different plan representations on the tasks categorized as hard: sequential subgoals, narrative, pseudocode, and checklist; across different families of multimodal LLM powered agents (OpenAI, Alibaba, and Google). To account for stochastic variability, we introduce two novel evaluation metrics: Achievement Rate (AR) and Solved-Task Consistency (STC). Our results show that both, the plan formulation and the underlying LLM generating the plan, significantly influence web-agent robustness and task success.

2605.29926 2026-05-29 cs.LG 版本更新

A Triple-Modal Contrastive Learning Framework with Sequence, Graph, and 3D Features for Drug-Target Interaction Prediction

一种融合序列、图和3D特征的三模态对比学习框架用于药物-靶标相互作用预测

Le Xu, Xi Zhang, Dan Luo, Ting Wang, Xuan Lin

发表机构 * School of Computer Science, Xiangtan University, Xiangtan 411105, China(湘潭大学计算机科学学院)

AI总结 提出TriMod-DTI框架,通过融合药物和蛋白质的1D序列、2D图和3D结构,并采用三模态对比学习策略对齐潜在空间表示,从而提升药物-靶标相互作用预测性能。

Comments 12 pages, 5 figures, ISBRA 2026

详情
AI中文摘要

准确预测药物-靶标相互作用(DTI)对药物发现至关重要。现有方法通常依赖单模态表示(如序列或图)或仅结合两种模态,忽视了3D结构特征。为解决这一挑战,我们提出TriMod-DTI,一种三模态对比学习框架,融合药物和蛋白质的1D序列、2D图和3D结构,获得用于DTI预测的通用且互补的特征表示。我们设计了一个特征提取器,用于捕获三种模态下的药物和靶标特征,从而丰富其表示。我们进一步提出了一种三模态对比学习策略,以在潜在空间中对齐同一药物或蛋白质的不同模态表示。通过构建跨模态的正负样本对,该方法增强了模型的判别能力。在三个基准数据集上的实验表明,TriMod-DTI优于最先进的方法。消融研究验证了每种模态的贡献。此外,案例研究突显了其在DTI预测和药物发现中的实际潜力。

英文摘要

Accurate prediction of drug-target interactions (DTI) is critical for drug discovery. Existing methods often rely on single-modal representations (e.g., sequences or graphs) or combine only two modalities, overlooking 3D structural features. To address this challenge, we propose TriMod-DTI, a triple-modal contrastive learning framework that incorporates 1D sequences, 2D graphs, and 3D structures of drugs and proteins, obtaining the universal and complementary feature representations for DTI prediction. We design a Feature Extractor to capture drug and target features across the three modalities, thereby enriching their representations. We further propose a triple-modal contrastive learning strategy to align different modal representations of the same drug or protein in the latent space. By constructing cross-modal positive and negative sample pairs, this approach enhances the model's discriminative ability. Experiments on three benchmark datasets demonstrate that TriMod-DTI outperforms state-of-the-art methods. The ablation studies validate the contributions of each modality. Moreover, case studies highlight its practical potential for DTI prediction and drug discovery.

2605.29913 2026-05-29 cs.IT cs.LG math.IT 版本更新

Gesture-Aware Indoor THz ISAC Systems for Adaptive Resource Allocation

基于手势感知的室内太赫兹ISAC系统自适应资源分配

Zhonghao Liu, Yinchao Yang, Yahao Ding, Yixuan Wang, Mohammad Shikh-Bahaei

发表机构 * King’s College London(伦敦国王学院)

AI总结 针对太赫兹频段多用户室内集成感知与通信系统,提出基于扩展卡尔曼滤波手势跟踪的自适应联合优化算法,通过动态调整功率分配和波束赋形,在满足手势相关通信服务质量约束下最大化感知信干噪比。

Comments 6 pages, 4 figures, conference(Submitted to PIMRC)

详情
AI中文摘要

本文研究了一种在太赫兹频段运行的多用户室内集成感知与通信系统,该系统设计用于基于手势识别的自适应通信。通过扩展卡尔曼滤波器进行手势跟踪,接入点根据检测到的手势变化动态调整资源分配,从而提高感知精度。基于手势识别结果,接入点进一步更新不同用户的通信质量需求,实现高效的资源分配。为此,开发了一种功率分配和波束赋形的自适应联合优化算法,在满足手势相关的通信服务质量约束下,最大化整体感知信干噪比。仿真结果表明,与传统的单变量优化基线相比,所提方法能有效响应手势动态,实现更优的感知精度和通信性能。

英文摘要

This paper investigates a multi-user indoor integrated sensing and communication (ISAC) system operating in the terahertz (THz) band, designed for adaptive communication based on gesture recognition. Leveraging gesture tracking through an extended Kalman filter (EKF), the access point (AP) dynamically adjusts resource allocation in response to detected gesture variations, thereby improving sensing accuracy. Based on the gesture recognition results, the AP further updates the communication quality requirements of different users, enabling efficient resource allocation. To this end, an adaptive joint optimization algorithm for power allocation and beamforming is developed to maximize the overall sensing signal-to-interference-plus-noise ratio (SINR) while satisfying the gesture-dependent communication quality of service (QoS) constraints. Simulation results demonstrate that the proposed method effectively responds to gesture dynamics, achieving superior sensing accuracy and communication performance compared with conventional single-variable optimization baselines.

2605.29911 2026-05-29 cs.LG cs.CV 版本更新

Reducing Experimental Testing in Space Propulsion Film Cooling Analyses by Pixelwise Generative Image Interpolation

通过逐像素生成图像插值减少空间推进薄膜冷却分析中的实验测试

Adam T. Müller, Philipp J. Teuffel, Konstantin Manassis, Nicolaj C. Stache

发表机构 * Heilbronn University of Applied Sciences(海德堡应用科学大学) Center for Machine Learning(机器学习中心) Max-Planck-Str. 39(马克斯-普朗克街39号) German Aerospace Center (DLR)(德国航空航天中心(DLR)) Institute of Space Propulsion(空间推进研究所)

AI总结 提出一种基于轻量级前馈神经网络和位置编码的机器学习方法,从稀疏实验测量中进行图像回归,以减少推进系统薄膜冷却研究中的物理测试需求。

Comments Presented at the 11th European Conference for Aeronautics and Aerospace Sciences (EUCASS), 2025, DOI: 10.13009/EUCASS2025-285

详情
AI中文摘要

我们提出了一种从稀疏实验测量中进行图像回归的机器学习方法。我们展示了该方法在推进系统开发中薄膜冷却研究中的应用,旨在减少对大量物理测试的需求。我们的方法采用带有位置编码的轻量级前馈神经网络,根据输入参数生成图像。在真实和合成数据上的验证表明,该方法在减少30%测量量的同时,实现了高图像相似度(RMSE < 8%,SSIM > 93%)。我们进一步提出了一种知识驱动的扩展,用于生成图像的局部适应性。该方法显著减少了所需测试次数,同时保持了高质量数据,从而能够高效优化冷却剂喷射器配置,其应用范围超越航空航天领域。

英文摘要

We propose a machine learning approach for image regression from sparse experimental measurements. We show the application of the proposed method on film cooling studies in propulsion system development, aiming to reduce the need for extensive physical testing. Our method employs a lightweight feed-forward neural network with positional encoding to generate images conditioned by input parameters. Validated on real and synthetic data, it achieves high image similarity (RMSE < 8 %, SSIM > 93 %) while maintaining accuracy with a 30 \% reduction of measurements. We further propose a knowledge-informed extension for local adaptability of the generated images. This approach significantly reduces required tests while preserving high-quality data, enabling efficient optimization of coolant injector configurations with applications beyond aerospace.

2605.29908 2026-05-29 stat.ML cs.LG 版本更新

Joint Model and Data Sparsification via the Marginal Likelihood

通过边际似然进行联合模型与数据稀疏化

Alexander Timans, Thomas Möllenhoff, Christian A. Naesseth, Mohammad Emtiyaz Khan, Eric Nalisnick

发表机构 * RIKEN Center for AI Project, Tokyo, Japan(日本东京RIKEN人工智能项目中心) Department of Computer Science, Johns Hopkins University(约翰霍普金斯大学计算机科学系)

AI总结 提出通过边际似然联合学习特征和样本相关性,实现同时模型与数据稀疏化的贝叶斯方法,在保持共轭性和闭式更新的同时提升鲁棒性。

Comments 36 pages, 8 figures, 12 tables (incl. appendix); published at ICML 2026

详情
AI中文摘要

线性系统中的稀疏恢复支撑着从信号处理到高维回归的应用。基于自动相关性确定(ARD)原理的稀疏贝叶斯学习,通过边际似然优化为特征稀疏性提供了一种实用的贝叶斯机制。然而,其对同方差噪声模型的依赖使其对数据污染(如异常值或错误指定的噪声)敏感,损害了模型拟合和预测。相反,我们提出联合学习个体特征和样本相关性,通过单一贝叶斯目标实现同时模型与数据稀疏化。这种模型和数据的对称剪枝提供了一种自然扩展,保持了共轭性,允许标准优化过程的闭式更新,并与鲁棒回归和影响函数的观点一致。跨多种回归任务的实证结果证实,联合ARD方法一致地产生稀疏且鲁棒的预测模型。

英文摘要

Sparse recovery in linear systems underpins applications from signal processing to high-dimensional regression. Sparse Bayesian Learning, grounded in the principle of automatic relevance determination (ARD), offers a practical Bayesian mechanism for feature sparsity via marginal likelihood optimization. Yet, its reliance on a homoscedastic noise model renders it sensitive to data contaminations such as outliers or misspecified noise, harming model fit and predictions. Instead, we propose jointly learning individual feature and sample relevancies, enabling simultaneous model and data sparsification via a single Bayesian objective. This symmetric pruning of model and data offers a natural extension that preserves conjugacy, admits closed-form updates for standard optimization procedures, and aligns with perspectives from robust regression and influence functions. Empirical results across diverse regression tasks affirm that a joint ARD approach consistently yields both sparse and robust prediction models.

2605.29901 2026-05-29 cs.CR cs.LG 版本更新

Dissecting the Black Box: Circuit-Level Analysis of LLM Vulnerability Detection

剖析黑箱:LLM 漏洞检测的电路级分析

Syafiq Al Atiiq, Chun Zhou, Christian Gehrmann

发表机构 * Lund University(隆德大学)

AI总结 通过机械可解释性分析 Gemma-2-2b 模型在 C/C++ 漏洞检测中的内部计算,发现模型主要依赖安全检测器(识别安全编码模式的注意力头)而非直接检测漏洞特征,并识别出关键神经组件(早期层注意力头和 MLP 神经元),通过消融实验验证其因果作用。

Comments 11 pages, 6 figures. Supported by the Wallenberg AI, Autonomous Systems and Software Program (WASP)

详情
AI中文摘要

大型语言模型(LLM)能够检测软件漏洞,但它们实际上是如何识别易受攻击的代码的呢?我们利用机械可解释性来回答这个问题;分析神经网络的内部计算以理解其推理过程。通过在 Gemma-2-2b 上使用 Circuit Tracer,我们追踪了模型将 472 个 C/C++ 代码样本分类为易受攻击或安全时所激活的计算路径。我们的分析揭示了一个令人惊讶的发现:模型主要依赖安全检测器(即识别安全编码模式的注意力头),而不是直接检测漏洞特征。当这些安全检测器未能激活时,模型将代码分类为易受攻击。我们识别出了关键的神经组件:早期层(L5、L7)中专注于安全模式的特定注意力头,以及第 7 层中编码漏洞相关特征的多层感知器(MLP)神经元。消融实验证实了它们的因果作用;移除第 11 层会使漏洞检测准确率从 100% 降至 6%,而仅消融第 7 层中的 20 个神经元就会使其降低 50%。我们的发现表明,LLM 漏洞检测使用了稀疏、可解释的电路(仅占模型容量的 16%),从而能够为安全预测提供电路级解释,并有针对性地改进检测系统。

英文摘要

Large language models (LLMs) can detect software vulnerabilities, but how do they actually identify vulnerable code? We address this question using mechanistic interpretability; analyzing the internal computations of a neural network to understand its reasoning process.Using Circuit Tracer on Gemma-2-2b, we trace the computational pathways activated when the model classifies 472 C/C++ code samples as vulnerable or safe. Our analysis reveals a surprising finding: the model primarily relies on safety detectors, attention heads that recognize safe coding patterns, rather than directly detecting vulnerability signatures. When these safety detectors fail to activate, the model classifies code as vulnerable. We identify the critical neural components: specific attention heads in early layers (L5, L7) that focus on safety patterns, and Multilayer Perceptron (MLP) neurons in Layer 7 that encode vulnerability-related features. Ablation experiments confirm their causal role; removing Layer 11 drops vulnerability detection accuracy from 100% to 6%, while ablating just 20 neurons in Layer 7 reduces it by 50%.Our findings show that LLM vulnerability detection uses sparse, interpretable circuits (only 16% of model capacity), enabling circuit-level explanations for security predictions and targeted improvements to detection systems.

2605.29900 2026-05-29 cs.LG cs.IT math.IT 版本更新

OVA-IB: One vs All Information Bottleneck for Multi-Modal Alignment

OVA-IB:用于多模态对齐的一对多信息瓶颈

Tianchao Li, Shujian Yu, Xinrui Zu, Zhaolong Wei, Jeremy Gummeson, Jack C. P. Cheng, Robert Jenssen

发表机构 * Hong Kong University of Science and Technology(香港科学与技术大学) Vrije Universiteit Amsterdam(阿姆斯特丹自由大学) UiT – The Arctic University of Norway(挪威北极大学) University of Copenhagen(哥本哈根大学) Norwegian Computing Center(挪威计算中心) University of Massachusetts Amherst(马萨诸塞大学阿姆赫斯特分校)

AI总结 提出基于信息瓶颈的一对多对齐框架OVA-IB,通过充分性对比下界和最小性正则化实现任意数量模态的对齐,在分类、回归和跨模态检索任务中表现鲁棒。

详情
AI中文摘要

对比学习对于对齐配对视图或模态是有效的,但超出两个模态的对齐仍然具有挑战性且相对未被充分探索。成对的CLIP风格损失将多模态对齐分解为独立的双向比较,因此没有显式建模多个模态之间的高阶依赖关系。最近的超越成对目标从统计或几何角度处理这个问题,但任意模态对齐仍然缺乏一个原则性的标准来定义每个模态相对于其他模态应该保留和压缩什么。我们通过信息瓶颈原则重新审视任意模态对齐。在多模态学习中,充分性应保留可从其余模态预测的信息,而最小性应压缩不被其余模态支持的模态特定信息。这自然导致一对多视角,其中每个模态相对于其余模态进行表征。我们提出OVA-IB,一个用于任意模态对齐的信息瓶颈框架。OVA-IB优化一个可处理的一对多对比下界用于充分性,该下界与双总相关风格目标相连,使用无参数的几何感知投影分数,并通过用其余模态诱导的表示分布来约束每个表示对其自身输入的依赖,导出一个可处理的上界正则化器用于最小性。在分类、回归、模态无关评估和跨模态检索基准上的实验展示了强大且鲁棒的性能。

英文摘要

Contrastive learning is effective for aligning paired views or modalities, but alignment beyond two modalities remains non-trivial and comparatively underexplored. Pairwise CLIP-style losses decompose multi-modal alignment into independent two-way comparisons and therefore do not explicitly model higher-order dependencies among multiple modalities. Recent beyond-pairwise objectives approach this problem from statistical or geometric perspectives, but arbitrary-modality alignment still lacks a principled criterion for defining what each modality should preserve and compress relative to the others. We revisit arbitrary-modality alignment through the Information Bottleneck principle. In multi-modal learning, sufficiency should preserve information predictable from the remaining modalities, while minimality should compress modality-specific information not supported by them. This naturally leads to a One-vs-All view, where each modality is characterized with respect to the remaining modalities. We propose OVA-IB, an Information Bottleneck framework for arbitrary-modality alignment. OVA-IB optimizes a tractable One-vs-All contrastive lower bound for sufficiency connected to a Dual Total Correlation-style objective, uses a parameter-free geometry-aware projection score, and derives a tractable upper-bound regularizer for minimality by bounding each representation's dependence on its own input with representation distributions induced by the remaining modalities. Experiments on classification, regression, modality-agnostic evaluation, and cross-modal retrieval benchmarks demonstrate strong and robust performance.

2605.29888 2026-05-29 cs.LG cs.AI 版本更新

LaRA: Layer-wise Representation Analysis for Detecting Data Contamination in RL Post-Training

LaRA: 面向RL后训练中数据污染的逐层表示分析

Minju Gwak, Minseo Kwak, Dongseok Lee, Guijin Son, Alan Ritter, Jaehyung Kim

发表机构 * Yonsei University(延世大学) Seoul National University(首尔国立大学) Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出LaRA框架,通过逐层表示分析检测强化学习后训练中的污染数据,利用扰动敏感性、方向坍缩和局部表示刚性三个指标,优于现有输出级方法。

Comments Work in Progress

详情
AI中文摘要

强化学习(RL)后训练已被证明能提升大型语言模型(LLMs)的推理能力。然而,关于RL后训练中数据污染问题的探索很少,这可能损害训练过程本身的泛化能力和评估可靠性。现有的检测方法主要依赖于输出级信号,如似然或熵,这对于RL训练的模型变得不可靠,因为RL通过轨迹级奖励而非token似然来塑造行为。我们提出LaRA,一个用于检测RL后训练LLMs中数据污染的逐层表示分析框架。LaRA引入了三个互补指标,测量受控扰动下的扰动敏感性、方向坍缩和局部表示刚性。我们发现污染会在各层产生渐进式的几何偏差,包括放大的扰动敏感性、更强的方向坍缩和增强的局部刚性。基于我们的发现,我们还开发了一个污染检测协议,聚合跨层和跨指标的表示级偏差。在RL训练推理模型上的实验表明,我们的协议在污染检测方面优于现有的输出级基线。

英文摘要

Reinforcement learning (RL) post-training has shown to improve reasoning in large language models (LLMs). However, there has been little exploration on the problem of data contamination in RL post-training, potentially undermining generalization and evaluation reliability of the training process itself. Existing detection methods primarily rely on output-level signals such as likelihood or entropy, which become unreliable for RL-trained models since RL shapes behavior through trajectory-level rewards rather than token likelihoods. We propose LaRA, a layer-wise representation analysis framework for detecting contamination in RL post-trained LLMs. LaRA introduces three complementary metrics, measuring perturbation sensitivity, directional collapse, and local representation rigidity under controlled perturbations. We find that contamination produces progressive geometric deviations across layers, including amplified perturbation sensitivity, stronger directional collapse, and enhanced local rigidity. Based on our findings, we also develop a contamination detection protocol that aggregates representation-level deviations across layers and metrics. Experiments on RL-trained reasoning models show that our protocol outperforms existing output-level baselines for contamination detection.

2605.29885 2026-05-29 cs.LG cond-mat.dis-nn math.OC math.RT stat.ML 版本更新

Open Problem: Separating Geometric and Algorithmic Compression via Cayley-Table Completion

开放问题:通过凯莱表完成分离几何压缩与算法压缩

Dongsung Huh

发表机构 * Dongsung Huh

AI总结 提出凯莱表完成作为测试缺失的算法复杂度最小化归纳偏置的规范问题,并挑战社区将连续平坦性先验推广以自主发现离散算法公理。

Comments 6 pages. Submitted to the Conference on Learning Theory (COLT) 2026 Open Problem track

详情
AI中文摘要

现代统计学习理论和深度学习主要从连续容量控制(如基于范数的正则化、间隔最大化、低秩偏置)的角度来表征泛化。虽然在连续领域非常成功,但深度学习始终无法外推精确的算法或离散代数规则,这反映出缺失了向算法复杂度最小化的归纳偏置。我们提出凯莱表完成作为这一缺失偏置的规范测试平台,作为矩阵完成的离散代数对应物。正如矩阵分解结合权重衰减产生对低线性秩的隐式几何偏置,最近的结果表明,算子值张量分解结合平坦性先验产生对精确离散结合性的隐式算法偏置。我们提出了为凯莱表建立形式化精确恢复界限的开放问题,并挑战社区将连续平坦性先验推广,以自主发现更广泛的离散算法公理,而无需组合搜索。

英文摘要

Modern statistical learning theory and deep learning characterize generalization primarily in terms of continuous capacity control (e.g., norm-based regularization, margin maximization, low-rank bias). While highly successful in continuous domains, deep learning consistently fails to extrapolate exact algorithmic or discrete algebraic rules, reflecting a missing inductive bias toward algorithmic complexity minimization. We propose the Cayley-table completion as the canonical testbed for this missing bias, serving as the discrete algebraic counterpart to matrix completion. Just as matrix factorization combined with weight decay yields an implicit geometric bias toward low linear rank, recent results demonstrate that operator-valued tensor factorizations paired with a flatness prior yield an implicit algorithmic bias toward exact discrete associativity. We pose the open problem of establishing formal exact recovery bounds for Cayley-table completion, and challenge the community to generalize continuous flatness priors to autonomously discover broader discrete algorithmic axioms without combinatorial search.

2605.29863 2026-05-29 cs.LG 版本更新

STAP: A Shuffle-Tokenized App Predictor with Ultra Long Context for Vocabulary-Free Mobile App Prediction

STAP: 一种基于洗牌令牌化的超长上下文无词汇表移动应用预测器

Chengyu Fan, Hang Liu

发表机构 * School of Nuclear Science and Technology, University of Science and Technology of China(科学技术大学核科学与技术学院) Department of Statistics and Finance, University of Science and Technology of China(科学技术大学统计与金融系)

AI总结 提出STAP模型,通过洗牌机制将应用身份替换为虚拟索引,并利用超长上下文处理行为序列,实现无固定词汇表的跨数据集零样本移动应用预测。

Comments 15 pages, 9 figures, 5 tables Preprint submitted to Expert Systems with Applications

详情
AI中文摘要

预测用户将启动的下一个移动应用对于智能设备资源管理和主动辅助至关重要。现有模型依赖于固定的应用词汇表,这阻碍了它们在不同应用生态系统中的泛化能力。许多模型还依赖于用户特定知识,这使冷启动场景下的部署复杂化。我们提出STAP,一种基于Transformer的模型,消除了对固定词汇表的需求。STAP通过洗牌机制将真实应用身份替换为随机重新分配的虚拟索引,并通过超长上下文设计处理行为序列来补偿丢弃的语义信息。理论分析表明,在给定足够长的上下文的情况下,尽管映射是匿名的,预测分布仍收敛到正确分布。在两个来自不同大陆的数据集上的实验表明,STAP实现了强大的跨数据集零样本预测准确性——这是所有现有固定词汇表方法本质上不适用的情况——同时其在每个数据集内的冷启动性能与领先模型保持竞争力。此外,我们引入了一种部署策略,使模型在连续推理期间能够保持足够长的上下文,同时将延迟控制在可接受范围内。

英文摘要

Predicting the next mobile application a user will launch is essential for intelligent device resource management and proactive assistance. Existing models rely on fixed app vocabularies, which prevents them from generalizing across different app ecosystems. Many also depend on user-specific knowledge, which complicates deployment in cold start scenarios. We propose STAP, a Transformer-based model that eliminates the need for a fixed vocabulary. STAP replaces true app identities with randomly reassigned virtual indices via a shuffle mechanism, and compensates for discarded semantic information by processing behavioral sequences with an ultra-long context design. A theoretical analysis shows that, given a sufficiently long context, the predicted distribution converges to the correct one despite the anonymity of the mapping. Experiments on two datasets from different continents demonstrate that STAP achieves strong cross-dataset zero-shot prediction accuracy -- a setting where all existing fixed-vocabulary methods are inherently inapplicable -- while its cold start performance within each dataset remains competitive with leading models. Furthermore, we introduce a deployment strategy that enables the model to retain a sufficiently long context during continuous inference while keeping latency within acceptable bounds.

2605.29860 2026-05-29 cs.LG cs.AI 版本更新

ESPO: Early-Stopping Proximal Policy Optimization

ESPO:早期停止的近端策略优化

Zihang Li, Rui Zhou, Yingcheng Shi, Wenhan Yu, Zhewen Tan, Zixiang Liu, Zeming Li, Binhua Li, Yongbin Li, Tong Yang, Jieping Ye

发表机构 * Tongyi Lab(通义实验室) Alibaba Group(阿里巴巴集团) Peking University(北京大学)

AI总结 提出ESPO算法,通过在强化学习训练大语言模型时在线检测轨迹失败并提前终止,节省计算资源并提升数学推理性能。

详情
AI中文摘要

当大语言模型在强化学习过程中,在轨迹早期出现错误的推理步骤时,标准算法会强制其继续生成直到最大步长,从而在从未获得正奖励的令牌上浪费计算资源,并用失败后的噪声污染优势估计。我们提出ESPO(早期停止的近端策略优化),该算法能够在线检测轨迹失败并提前终止轨迹生成。在每个生成步骤中,ESPO仅利用采样过程中已计算出的logits计算一个替代遗憾值,并在平滑累积遗憾值显著超过其估计值时终止。截断轨迹被视为具有终止奖励的吸收失败状态,将负的时间差分误差集中在检测到的失败步骤附近,无需任何额外的奖励模型或人工标注。在基于DeepSeek-R1-Distill-Qwen-7B训练的数学推理任务上,ESPO在AIME 2024(46.28% vs. 45.25%)、AMC 2023(85.83% vs. 82.94%)和MATH-500(87.42% vs. 85.43%)上超越了PPO,同时累计节省了超过20%的轨迹生成令牌。

英文摘要

When a large language model under reinforcement learning commits a wrong reasoning step early in a trajectory, standard algorithms force it to keep generating until the maximum horizon, spending compute on tokens that never receive positive reward and polluting advantage estimates with post-failure noise. We propose ESPO (Early-Stopping Proximal Policy Optimization), which detects trajectory failure on-the-fly and terminates rollouts early. At each generation step, ESPO computes a surrogate regret using only the logits already computed during sampling, and terminates when the smoothed cumulative regret significantly exceeds its estimated values. Truncated trajectories are treated as absorbing failure states with a terminal reward, concentrating negative temporal-difference (TD) errors near the detected failure step without any additional reward model or human annotation. On DeepSeek-R1-Distill-Qwen-7B trained for mathematical reasoning, ESPO surpasses PPO on AIME~2024 (46.28% vs. 45.25%), AMC~2023 (85.83% vs. 82.94%), and MATH-500 (87.42% vs. 85.43%), while saving more than 20% rollout tokens cumulatively.

2605.29857 2026-05-29 cs.LG 版本更新

Feedback-to-Rubrics: Can We Learn Expert Criteria from Inline Comments?

从内联评论到评分标准:我们能从内联评论中学习专家标准吗?

Kotaro Yoshida, So Kuroki, Yuki Imajuku, Taishi Nakamura, Ryunosuke Iwai, Haruki Goda, Takuya Akiba

发表机构 * Sakana AI Institute of Science Tokyo(东京科学研究所)

AI总结 提出从内联评论中学习可复用的自然语言评分标准的方法,通过迭代优化评分标准来预测评论并支持自动修订。

详情
AI中文摘要

大型语言模型(LLMs)越来越多地用于写作和审阅支持,但其有用性取决于上下文相关的标准,例如专家偏好或组织特定惯例,这些标准通常是隐性的、未记录的,并且难以直接获取。我们提出了一个从人工撰写或LLM生成的草稿等工件上积累的内联评论中学习可复用的自然语言评分标准的问题设定。我们的方法从这些评论中推断出评分标准,并通过观察基于评分标准的预测与参考评论之间的逐条评论不匹配来迭代优化它们。我们在真实世界的审阅设置和具有参考评分标准的受控设置中评估了所提出的方法。这些结果表明,内联评论可以提炼为可复用的评分标准,支持评论预测、评分标准理解和自动工件修订。

英文摘要

Large language models (LLMs) are increasingly used for writing and review support, but their usefulness depends on context-dependent criteria, such as expert preferences or organization-specific conventions, that are often tacit, undocumented, and difficult to elicit directly. We propose a problem setting for learning reusable natural-language rubrics from accumulated inline comments on artifacts such as human-written or LLM-generated drafts. Our method infers rubrics from these comments and iteratively refines them by observing comment-wise mismatches between rubric-conditioned predictions and reference comments. We evaluate the proposed method in real-world review settings and in controlled settings with reference rubrics. These results show that inline comments can be distilled into reusable rubrics that support comment prediction, rubric understanding, and automatic artifact revision.

2605.29850 2026-05-29 cs.LG 版本更新

MIRAGE: Adaptive Multimodal Gating for Whole-Brain fMRI Encoding

MIRAGE:用于全脑fMRI编码的自适应多模态门控

Abdulkadir Gokce, Badr AlKhamissi, Martin Schrimpf

发表机构 * Qwen3-Omni-30B-A3B-Thinking(通义千问3- Omni-30B-A3B-Thinking)

AI总结 提出MIRAGE框架,通过原生多模态骨干网络和自适应特征门控,实现全脑fMRI对自然视听刺激的高精度编码,并证明原生多模态特征优于后期融合的单模态特征。

Comments Preprint. First two author contributed equally

详情
AI中文摘要

近期任务优化神经网络的进展已将编码模型确立为预测大脑对自然刺激反应的有力工具,然而现有方法大多依赖单模态表示。全模态基础模型和丰富的多模态神经数据集的出现,使得能够联合整合跨被试的视觉、听觉和语言信息的编码模型成为可能。我们提出MIRAGE,一个用于预测全脑fMRI对自然视听刺激反应的脑编码框架。MIRAGE通过原生多模态骨干网络和跨层自适应特征门控实现了最先进的性能。这些表示随后与基于transformer的脑编码器和跨皮层分区的被试特定线性头相结合。控制比较表明,原生多模态特征在架构层次和骨干网络上始终优于独立单模态特征的事后聚合。除了预测准确性,学习的注意力权重可直接检查以解释骨干网络上的模态特定门控分布,每种模态在皮层上描绘出不同的解剖模式。综合这些结果,提出了原生多模态特征的自适应逐层聚合作为全脑编码的一种可泛化、可解释且准确的方法。

英文摘要

Recent progress in task-optimized neural networks has established encoding models as a powerful tool for predicting brain responses to naturalistic stimuli, yet most existing approaches rely on unimodal representations. The emergence of omni-modal foundation models and rich multimodal neural datasets enables encoding models that jointly integrate visual, auditory, and linguistic information across subjects. We introduce MIRAGE, a brain encoding framework for predicting whole-brain fMRI responses to naturalistic audiovisual stimuli. MIRAGE achieves state-of-the-art performance via a native multimodal backbone and adaptive feature gating across layers. These representations are then combined with a transformer-based brain encoder and a subject-specific linear head over the cortical parcels. Controlled comparisons show that natively multimodal features consistently outperform post-hoc aggregation of independent unimodal features, across architectural levels and backbones. Beyond predictive accuracy, the learned attention weights are directly inspectable to interpret the modality-specific gating profile over the backbone, and each modality traces a distinct anatomical pattern across cortex. Together, these results propose adaptive layer-wise aggregation of natively multimodal features as a generalizable, interpretable, and accurate approach for whole-brain encoding.

2605.29849 2026-05-29 eess.SY cs.LG cs.SY 版本更新

BuilDyn: Excitation-Driven Data Generation for Building Thermal Dynamics Modeling and Control

BuilDyn: 面向建筑热动力学建模与控制的激励驱动数据生成

Felix Koch, Thomas Krug, Fabian Raisch, Benjamin Schäfer, Benjamin Tischler

发表机构 * Technical University of Applied Sciences Rosenheim(应用技术大学罗森海姆) Technical University of Munich(慕尼黑技术大学) Karlsruhe Institute of Technology(卡尔斯鲁厄理工学院)

AI总结 本文提出BuilDyn包,通过可定制的激励策略生成控制导向的建筑数据,提升机器学习模型对未见工况的鲁棒性。

详情
AI中文摘要

机器学习越来越多地用于建筑的数据驱动建模,以实现故障检测与诊断、节能控制等下游任务。虽然最近的工作改善了跨建筑特性、天气和占用率的泛化能力,但泛化也依赖于对控制驱动系统状态空间的充分探索。现有的真实世界数据集和仿真环境主要反映固定控制策略下的稳态运行,导致激励有限,对未见工况的鲁棒性降低。本文介绍了基于BuilDa的BuilDyn包,该包支持可定制的激励策略用于控制导向的数据生成。BuilDyn还支持从代表性建筑分布中采样,并提供Python接口以便轻松集成到机器学习流水线中。我们通过比较在非激励和激励数据上训练的数据驱动ML模型在一栋建筑上的性能,展示了BuilDyn的优势。借助BuilDyn,我们希望推进可扩展的控制导向建模,并支持迁移学习和建筑特定基础模型等未来方向。

英文摘要

Machine learning (ML) is increasingly used for data-driven modeling of buildings to enable downstream tasks such as fault detection and diagnosis, and energy-efficient control. While recent work improves generalization across building characteristics, weather, and occupancy, generalization also depends on sufficient exploration of the control-driven system state space. Existing real-world datasets and simulation environments predominantly reflect stationary operation under fixed control policies, resulting in limited excitation and reduced robustness to unseen operating conditions. This paper introduces BuilDyn, a package based on BuilDa that enables customizable excitation strategies for control-oriented data generation. BuilDyn further supports sampling from representative building distributions and provides a Python interface for easy integration into machine learning pipelines. We demonstrate the benefits of BuilDyn by comparing the performance of data-driven ML models trained on non-excited and excited data for one building. With BuilDyn, we hope to advance scalable control-oriented modeling and support future directions such as transfer learning and building-specific foundation models.

2605.29843 2026-05-29 cs.LG cs.AI 版本更新

HARP: Hadamard-Preconditioned Adaptive Rotation Processor for Extreme LLM Quantization

HARP: 哈达玛预条件自适应旋转处理器用于极端LLM量化

Artur Zagitov, Gleb Molodtsov, Aleksandr Beznosikov

发表机构 * BRAIn Lab(BRAIn实验室)

AI总结 提出HARP,一种可学习的结构化双正交处理器,替代固定随机哈达玛变换,通过自适应旋转基来改善极端低位量化中的激活异常值和各向异性权重曲率问题,在2-4比特设置下提升困惑度和零样本准确率,并保持部署效率。

详情
AI中文摘要

后训练量化(PTQ)对于在内存和带宽约束下部署LLM至关重要。然而,极端低位量化仍然对激活异常值和各向异性权重曲率高度敏感。现有的基于非相干性的PTQ方法通过固定的随机哈达玛变换(RHT)缓解了这一问题,这提高了量化鲁棒性,但无法将旋转基适应于层、校准分布或量化器。我们引入了HARP(哈达玛预条件自适应旋转处理器),一种可学习的结构化双正交处理器,它替代了固定的哈达玛混合,同时保留了精确的全精度等价性。HARP将每个旋转表示为稀疏蝶形类块正交阶段的乘积,通过混合基数调度支持非2的幂次维度,并初始化为RHT处理器(最多一个固定排列)。仅在校准数据上拟合,HARP将量化基适应于每一层和后端。在从1B到70B参数的模型的2-4比特设置中,HARP在困惑度和零样本准确率上优于固定RHT。重要的是,HARP保持了部署效率,达到128 tok/s,而FP16为61 tok/s。

英文摘要

Post-training quantization (PTQ) is essential for deploying LLMs under memory and bandwidth constraints. However, extreme low-bit quantization remains highly sensitive to activation outliers and anisotropic weight curvature. Existing incoherence-based PTQ methods mitigate this issue with fixed randomized Hadamard transforms (RHTs), which improve quantization robustness but cannot adapt the rotated basis to the layer, calibration distribution, or quantizer. We introduce HARP (Hadamard-preconditioned Adaptive Rotation Processor), a learnable structured two-sided orthogonal processor that replaces fixed Hadamard mixing while preserving exact full-precision equivalence. HARP represents each rotation as a product of sparse butterfly-like block-orthogonal stages, supports non-power-of-two dimensions via Mixed-Radix schedules, and initializes to the RHT processor up to a fixed permutation. Fitted only on calibration data, HARP adapts the quantization basis to each layer and backend. Across 2-4 bit settings on models ranging from 1B to 70B parameters, HARP improves perplexity and zero-shot accuracy over fixed RHT. Importantly, HARP preserves deployment efficiency, reaching 128 tok/s versus 61 tok/s for FP16.

2605.29836 2026-05-29 cs.LG cs.AI stat.ML 版本更新

CB-SLICE: Concept-Based Interpretable Error Slice Discovery

CB-SLICE: 基于概念的可解释错误切片发现

Yael Konforti, Mateo Espinosa Zarlenga, Elaf Almahmoud, Mateja Jamnik

发表机构 * Department of Computer Science and Technology, University of Cambridge, Cambridge, UK(计算机科学与技术系,剑桥大学,剑桥,英国) Trinity College, University of Oxford, Oxford, UK(牛津大学三一学院,牛津,英国) Cambridge Institute for Technology and Humanity, Cambridge, UK(剑桥技术与人类研究所,剑桥,英国)

AI总结 提出CB-SLICE方法,利用概念瓶颈模型的概念预测失败来发现错误切片,并通过关键词概念解释失败模式,优于现有方法。

Comments 20 pages, 7 figures, 12 tables, to be published at Proceedings of the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

尽管平均性能强劲,深度学习模型在特定人群组(称为错误切片)上常表现出系统性错误。识别这些组及其失败的根本原因对于模型调试和偏差缓解至关重要。然而,现有的错误切片发现方法(SDMs)通常生成与模型推理过程脱节的解释,因此只能近似潜在错误源,可能不准确。我们通过利用概念瓶颈模型(CBMs)来解决这一局限,其预测直接依赖于人类可理解的语义概念。由于CBM中下游任务失败通常源于概念预测错误,概念表示为错误切片识别提供了强有力的候选,提供了直接关联错误源的细粒度解释。基于这一见解,我们引入CB-SLICE,一种基于概念的SDM,它将共享概念预测失败的样本分组,并识别每个切片失败模式中最关键的关键词概念。在多个基准测试中,我们展示了CB-SLICE在发现已知偏差方面优于最先进方法,同时提供更丰富、更忠实的模型错误解释。

英文摘要

Despite strong average-case performance, deep learning models often exhibit systematic errors on specific population groups, known as error slices. Identifying these groups and the root causes of their failures is critical for model debugging and bias mitigation. However, existing error Slice Discovery Methods (SDMs) typically generate explanations disconnected from the model's inference process, thus only approximating the underlying error source and may be inaccurate. We address this limitation by leveraging Concept Bottleneck Models (CBMs), whose predictions are directly dependent on human-understandable semantic concepts. Since downstream task failures in CBMs commonly arise from concept mispredictions, concept representations provide a strong candidate for error slice identification, offering fine-grained explanations directly linked to the error source. Building on this insight, we introduce CB-SLICE, a concept-based SDM that groups samples with shared concept prediction failures and identifies the keyword concepts most responsible for each slice's failure mode. Across multiple benchmarks, we show that CB-SLICE outperforms state-of-the-art methods in uncovering well-known biases while providing richer and more faithful explanations of model errors.

2605.29834 2026-05-29 cs.LG 版本更新

Open World Autoencoding Drift Detection with Novel Class Recognition in Tabular Non-stationary Data Streams

开放世界自编码漂移检测与表格非平稳数据流中的新类识别

Joanna Komorniczak

发表机构 * Department of Systems and Computer Networks(系统与计算机网络系) wrocław University of Science and Technology(沃林大学科学与技术学院)

AI总结 提出一种基于自编码器重构误差的无监督概念漂移检测方法,通过密度估计识别新类样本,利用镜像自编码器独立增量适应变化分布,在合成表格数据流上实验表明与当前最优方法竞争。

详情
AI中文摘要

数据流处理已成为现代机器学习应用中的里程碑,概念漂移和新类出现是复杂识别方法面临的主要挑战。本文提出一种无监督概念漂移检测方法,基于自编码器的重构误差识别已知类分布的偏移,同时通过对样本代理表示的密度估计实现新类样本的识别。使用镜像自编码器允许两个任务独立增量适应变化的问题分布,从而持续调整演化概念并可靠识别未知样本。实验使用多种合成表格数据流,观察到概念漂移和新类出现。结果表明,所提方法与当前最先进的无监督漂移检测器和新颖性分类器具有竞争力。

英文摘要

Data stream processing has become a landmark in modern machine learning applications, with concept drifts and novel class appearances posing the primary challenges faced by sophisticated recognition methods. This work proposes an unsupervised concept drift detection method that identifies shifts in known class distributions based on the reconstruction errors of an autoencoder, while also enabling the recognition of novel class samples through density estimation of a proxy representation of samples. Using mirrored autoencoders allows for independent incremental adaptation to changing problem distributions for the two considered tasks, resulting in continuous adjustment to evolving concepts and reliable recognition of unknown samples. Conducted experiments used a diverse set of synthetic tabular data streams, where both concept drifts and the emergence of novelties were observed. The results show that the proposed approach is competitive with current state-of-the-art unsupervised drift detectors and novelty classifiers.

2605.29829 2026-05-29 cs.AI cs.LG 版本更新

OptSkills: Learning Generalizable Optimization Skills from Problem Archetypes via Cluster-Based Distillation

OptSkills: 通过基于聚类的蒸馏从问题原型中学习可泛化的优化技能

Haochen Yang, Ke Zhao, Mengyuan Ma, Xingyu Lu, Xiangfeng Wang, Hong Qian

发表机构 * East China Normal University(华东师范大学) Shanghai Innovation Institute(上海创新研究院) Ant Group(蚂蚁集团)

AI总结 提出OptSkills系统,通过聚类问题原型、蒸馏成功轨迹为可复用工作流技能,并动态扩展技能库,提升优化建模与求解的分布内和分布外泛化能力。

Comments 22 pages, 10 figuers, project: https://github.com/fujiwaranoM0kou/OptSkills

详情
AI中文摘要

利用大型语言模型(LLM)从自然语言自动制定和求解优化问题已成为自动化优化的高效范式。然而,现有方法仍表现出有限的泛化能力:它们对表面叙述变化敏感,主要在案例层面复用经验,难以适应变化或新兴的问题类型。我们提出OptSkills,一个以原型为中心的技能学习和推理智能体系统,用于优化建模和求解。为提升鲁棒泛化,我们的系统根据问题的底层原型而非表面叙述进行聚类。为提升分布内泛化,它在每个聚类内探索多样的建模范式和求解器配置,然后将成功轨迹蒸馏为可重用的工作流级技能。为提升分布外泛化,它利用新获得的轨迹改进现有技能或扩展技能库。我们的系统在涵盖多种问题类型和场景的数据集上达到了68.27%的最先进微平均准确率。此外,在极具挑战性的大规模高维基准MIPLIB-NL上,它达到了26.91%的准确率,比DeepSeek-V3.2-Thinking高出4.53%。在Nano-CO上进行技能学习后,它在OOD NLCO基准上达到了72.79%。代码和技能可在https://github.com/fujiwaranoM0kou/OptSkills获取。

英文摘要

Leveraging Large Language Models (LLMs) to automatically formulate and solve optimization problems from natural language has emerged as an efficient paradigm for automated optimization. However, existing methods still exhibit limited generalization: they are sensitive to superficial narrative variations, reuse experience mainly at the case level, and struggle to adapt to shifted or emerging problem types. We propose OptSkills, an archetype-centric skill learning and reasoning agent system for optimization modeling and solving. To improve robust generalization, our system clusters problems by their underlying archetypes rather than surface narratives. To improve in-distribution generalization, it explores diverse modeling paradigms and solver configurations within each cluster, then distills successful trajectories into reusable workflow-level skills. To improve out-of-distribution generalization, it refines existing skills or expands the skill library using newly obtained trajectories. Our system achieves a state-of-the-art micro-averaged accuracy of 68.27% on datasets encompassing diverse problem types and scenarios. In addition, on MIPLIB-NL, a highly challenging large-scale and high-dimensional benchmark, it achieves 26.91% accuracy, outperforming DeepSeek-V3.2-Thinking by 4.53%. After skill learning on Nano-CO, it reaches 72.79% on the OOD NLCO benchmark. Code and skills are available at https://github.com/fujiwaranoM0kou/OptSkills.

2605.29828 2026-05-29 cs.LG 版本更新

When Do Graph Foundation Models Transfer? A Data-Centric Theory

图基础模型何时迁移?一个以数据为中心的理论

Jiajun Zhu, Ying Chen, Peihao Wang, Yixuan He, Pan Li, Aditya Akella, Zhangyang Wang

发表机构 * University of Texas at Austin(德克萨斯大学奥斯汀分校) Arizona State University(亚利桑那州立大学) Georgia Institute of Technology(佐治亚理工学院)

AI总结 本文通过图论连续极限方法,将跨域输出偏移分解为有限样本近似项和结构不匹配的内在域差异,并验证了位置编码稳定性对迁移的影响。

Comments 21 pages, including appendix. Accepted at ICML 2026

详情
AI中文摘要

图基础模型(GFMs)旨在跨不同图域重用单一骨干网络,但其迁移往往不均匀,并可能出现负迁移。虽然大多数先前工作通过架构或自适应选择改进迁移,但我们提出一个以数据为中心的问题:两个图域的哪些属性决定了固定表示模型改变其输出的程度?利用基于图论的稠密图连续极限,我们证明对于基于集合和消息传递的标记化,任何Lipschitz骨干网络都允许将跨域输出偏移显式分解为(i)图特定的有限样本近似项和(ii)捕获结构不匹配的内在、重标号不变的域差异。一个关键因素是位置编码(PE)稳定性:我们为谱PE建立了稳定性保证,并突出了基于特征向量与基于子空间的PE的对比行为。在合成和真实图上的实验验证了该理论,并将该分解转化为GFM迁移中数据整理的指导。

英文摘要

Graph foundation models (GFMs) aim to reuse a single backbone across diverse graph domains, yet their transfer is often uneven and can exhibit negative transfer. While most prior work improves transfer through architectural or adaptation choices, we ask a data-centric question: which properties of two graph domains determine how much a fixed representation model changes its outputs? Using a graphon-based continuous limit for dense graphs, we show that for both set-based and message-passing tokenizations, any Lipschitz backbone admits an explicit decomposition of cross-domain output shift into (i) graph-specific finite-sample approximation terms and (ii) an intrinsic, relabeling-invariant domain discrepancy capturing structural mismatch. A key ingredient is positional-encoding (PE) stability: we establish stability guarantees for spectral PEs and highlight contrasting behaviors of eigenvector- versus subspace-based PEs. Experiments on synthetic and real graphs validate the theory and translate the decomposition into guidance for data curation in GFM transfer.

2605.29819 2026-05-29 cs.LG 版本更新

The Interplay Between Interpolation and Aggregation in Regression: Optimal Sample Complexity

回归中插值与聚合的相互作用:最优样本复杂度

Mikael Møller Høgsgaard, Kasper Green Larsen, Liang-Yu Zou

发表机构 * Department of Computer Science, Aarhus University(奥胡斯大学计算机科学系) Department of Statistics, University of Oxford(牛津大学统计学系)

AI总结 本文从理论上研究回归中插值与聚合的相互作用,证明γ-图维度刻画了广泛自然聚合过程的可学习性,并发现通过中位数聚合三个插值假设的简单过程在所有聚合过程中最优,且严格强于恰当学习。

详情
AI中文摘要

本文从理论上研究了回归中插值与聚合的相互作用。我们证明了$γ$-图维度刻画了一类广泛自然聚合过程的可学习性。此外,我们证明了一种极其简单的聚合过程——通过中位数组合三个插值假设——在所有这些聚合过程中是最优的,并且严格强于恰当学习。最后,我们表明某些假设类只能通过聚合无限多个假设或使用非插值聚合规则(可能预测超出其输入范围)来学习,而任何有限的插值聚合甚至无法达到平凡的性能。

英文摘要

This work investigates theoretically the interplay between interpolation and aggregation in regression. We establish that the $γ$-graph dimension characterizes learnability for a broad class of natural aggregation procedures. Furthermore, we prove that an extremely simple aggregation procedure, combining three interpolating hypotheses via the median, is optimal among all these aggregation procedures, and is strictly more powerful than proper learning. Finally, we show that some hypothesis classes are learnable only by aggregating infinitely many hypotheses or by using non-interpolating aggregation rules (which may predict outside the range of their inputs), and any finite interpolating aggregation fails to achieve even trivial performance.

2605.29809 2026-05-29 cs.CR cs.CV cs.GR cs.LG cs.MM 版本更新

Cert-LAS: Toward Certified Model Ownership Verification for Text-to-Image Diffusion Models via Layer-Adaptive Smoothing

Cert-LAS:通过层自适应平滑实现文本到图像扩散模型的认证模型所有权验证

Leyi Qi, Yiming Li, Siyuan Liang, Zhengzhong Tu, Dacheng Tao

发表机构 * Generative AI Lab, College of Computing Data Science, Nanyang Technological University, Singapore Department of Computer Science Engineering, Texas A\&M University, USA

AI总结 提出Cert-LAS方法,基于层自适应平滑和扩散分类器嵌入水印,通过假设检验验证模型所有权,并证明在恶意移除攻击下仍能可靠验证。

Comments This paper has been accepted to the International Conference on Machine Learning (ICML) 2026. 26 pages

详情
AI中文摘要

大规模文本到图像(T2I)扩散模型实现了前所未有的创意应用,但其未经授权的使用引发了严重的知识产权问题,使得模型所有权验证(MOV)日益关键。我们发现现有的基于后门的扩散水印方法通常(隐式地)假设一个“忠实”的验证过程,即验证者可以查询可疑模型并获得忠实的水印响应以完成MOV。然而,在实践中,攻击者可能有意或无意地破坏潜在的水印信号,显著降低验证可靠性。为解决此问题,我们提出Cert-LAS,首个基于层自适应平滑的T2I模型认证MOV方法。通常,Cert-LAS使用扩散分类器和LFS引导的层自适应噪声嵌入指定水印,并通过假设检验检查可疑模型是否表现出比无水印参考显著更强的水印响应来验证所有权。我们进一步证明,在特定条件下,即使存在恶意移除攻击,我们的Cert-LAS仍能实现可靠验证。大量实验验证了Cert-LAS的有效性及其对自适应攻击的抵抗力。我们的代码可在https://github.com/Leyi-Qi/Cert-LAS获取。

英文摘要

Large-scale text-to-image (T2I) diffusion models have enabled unprecedented creative applications, but their unauthorized use has raised serious intellectual property concerns, making model ownership verification (MOV) increasingly critical. We find that existing backdoor-based diffusion watermarking methods often (implicitly) assume a "faithful" verification process, namely, that the verifier can query a suspicious model and obtain the faithful watermark response to complete MOV. However, in practice, adversaries may intentionally or unintentionally damage potential watermark signals, significantly degrading verification reliability. To address this issue, we propose Cert-LAS, the first certified MOV method for T2I models based on layer-adaptive smoothing. In general, Cert-LAS embeds specified watermarks using diffusion classifiers and an LFS-guided layer-adaptive noise, and verifies ownership by examining whether the suspected model exhibits significantly stronger watermark responses compared to unwatermarked references through hypothesis testing. We further prove that, under certain conditions, our Cert-LAS can still achieve reliable verification even in the presence of malicious removal attacks. Extensive experiments validate the effectiveness of Cert-LAS and its resistance to adaptive attacks. Our code is available at https://github.com/Leyi-Qi/Cert-LAS.

2605.29807 2026-05-29 cs.CL cs.AI cs.LG 版本更新

Data filtering methods for training language models

训练语言模型的数据过滤方法

Egor Shevchenko, Elena Bruches

发表机构 * Novosibirsk State University(新西伯利亚国立大学) A. P. Ershov Institute of Informatics Systems SB RAS(A. P. Ershov 信息系统研究所)

AI总结 本文比较了Confident Learning和Dataset Cartography两种自动标签错误检测方法在俄语文本分类任务中的效果,发现其有效性依赖于数据集特性,在小规模高噪声数据集上Confident Learning显著提升F1-macro。

Comments AINL-2026

详情
AI中文摘要

数据质量是机器学习模型有效性的关键因素。即使广泛使用的基准数据集中也存在标签错误,这些错误会引入训练数据噪声并降低模型泛化能力。在本工作中,我们对两种自动标签错误检测方法——Confident Learning和Dataset Cartography——在三个俄语文本分类语料库上进行了比较分析,这些语料库在规模、类别数量和领域上各不相同:ru_emotion_e-culture(49,123个样本,情感分类)、RuCoLA(8,524个样本,语言可接受性)和TERRa(2,337个样本,文本蕴含识别)。我们使用在每个语料库上微调的预训练rubert-base-cased模型。为了验证过滤的意义,我们进行了控制实验,随机移除等量样本。结果表明,两种方法的有效性强烈依赖于数据集特征:在噪声水平低的大规模语料库上,过滤并未提升性能,而在噪声高的小规模数据集上,Confident Learning实现了显著的F1-macro提升。Dataset Cartography表现出更保守的行为,移除的样本更少。在所有语料库中,两种方法的目标性移除均优于随机移除,证实了这些方法的意义。

英文摘要

Data quality is a critical factor in the effectiveness of machine learning models. Label errors, present even in widely used benchmarks, introduce noise into training data and reduce model generalization. In this work, we conduct a comparative analysis of two automatic label error detection methods - Confident Learning and Dataset Cartography - on three Russian text classification corpora of varying size, number of classes, and domain: ru_emotion_e-culture (49,123 examples, emotion classification), RuCoLA (8,524 examples, linguistic acceptability), and TERRa (2,337 examples, textual entailment recognition). We use the pre-trained rubert-base-cased model fine-tuned on each corpus. To verify the meaningfulness of filtering, we conduct control experiments with random removal of an equivalent number of examples. Results show that the effectiveness of both methods depends strongly on dataset characteristics: on large corpora with low noise levels, filtering does not improve performance, while on small datasets with high noise, Confident Learning achieves a significant F1-macro improvement. Dataset Cartography demonstrates more conservative behavior, removing fewer examples. Across all corpora, targeted removal by both methods outperforms random removal, confirming the meaningfulness of the approaches.

2605.29803 2026-05-29 cs.LG 版本更新

Gated Graph Attention Networks with Learnable Temperature

具有可学习温度的门控图注意力网络

Zhongtian Ma, Hao Wu, Yexin Zhang, Qiaosheng Zhang, Zhen Wang

发表机构 * School of Cybersecurity, Northwestern Polytechnical University(网络安全学院,西北工业大学) Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 提出门控图注意力和可学习温度机制,通过过滤不可靠特征维度并动态调整注意力系数分布的锐度,提升图注意力网络在均匀和异质异配基准上的性能。

详情
AI中文摘要

图注意力网络通过数据相关的系数学习邻居的重要性,但标准层缺乏对不可靠特征维度的显式控制,并且使用固定的注意力系数分布锐度。本文针对常见的图注意力机制提出了门控图注意力和可学习温度。门控图注意力过滤特征或消息响应以减少不可靠维度的影响,而可学习温度动态调整注意力系数分布的锐度。在均匀和异质异配基准上的实验表明,所提出的变体一致地改进了相应的图注意力骨干网络,受控噪声研究进一步验证了它们在特征扰动下的行为。理论分析解释了这些结果,表明当只有部分特征坐标可靠时,门控提高了鲁棒性,而当全局噪声削弱节点特征的可区分性时,温度是有益的。

英文摘要

Graph attention networks learn neighbor importance through data-dependent coefficients, but standard layers lack explicit control over unreliable feature dimensions and use fixed sharpness of attention coefficient distributions. This paper proposes gated graph attention and learnable temperature for common graph attention mechanisms. Gated graph attention filters feature or message responses to reduce the influence of unreliable dimensions, while learnable temperature dynamically adjusts the sharpness of the attention coefficient distribution. Experiments on homogeneous and heterophilic heterogeneous benchmarks show that the proposed variants consistently improve the corresponding graph attention backbones, and controlled noise studies further verify their behavior under feature perturbations. Theoretical analysis explains these results by showing that gating improves robustness when only part of the feature coordinates are reliable, while temperature is beneficial when global noise weakens the discriminability of node features.

2605.29801 2026-05-29 cs.AI cs.CL cs.CR cs.CV cs.LG 版本更新

AgentDoG 1.5: A Lightweight and Scalable Alignment Framework for AI Agent Safety and Security

AgentDoG 1.5:一种轻量级且可扩展的AI智能体安全与安保对齐框架

Dongrui Liu, Yu Li, Zhonghao Yang, Peng Wang, Guanxu Chen, Yuejin Xie, Qinghua Mao, Wanying Qu, Yanxu Zhu, Tianyi Zhou, Leitao Yuan, Zhijie Zheng, Qihao Lin, Yimin Wang, Haoyu Luo, Shuai Shao, Chen Qian, Qingyu Liu, Ling Tang, Ruiyang Qin, Qihan Ren, Junxiao Yang, Kun Wang, Zhiheng Xi, Linfeng Zhang, Ranjie Duan, Bo Zhang, Wenjie Wang, Wen Shen, Qiaosheng Zhang, Yan Teng, Chaochao Lu, Rui Mei, Man Li, Jialing Tao, Xi Lin, Tianhang Zheng, Yong Liu, Quanshi Zhang, Lei Zhu, Xingjun Ma, Junhua Liu, Hui Xue, Xiaoxiang Zuo, Xiangnan He, Chao Shen, Xianglong Liu, Minlie Huang, Jing Shao, Xia Hu

发表机构 * Shanghai Artificial Intelligence Laboratory(上海人工智能实验室)

AI总结 针对开放世界智能体的新兴安全风险,提出一种轻量级可扩展的安全对齐框架,通过更新安全分类法、构建数据引擎并训练小模型(0.8B-8B参数),实现与闭源模型相当的性能,并降低部署开销两个数量级。

Comments 44 pages, 12 Figures, 9 Tables

详情
AI中文摘要

现代开放世界智能体(如OpenClaw)展现出强大的跨环境执行能力,但同时也引入了广泛的新安全风险源。同时,先进的前沿AI模型大幅降低了攻击门槛,使得当前的智能体对齐框架不足以应对实际部署。为了应对这些新兴威胁,我们提出了一种轻量级且可扩展的智能体安全对齐框架。具体而言,我们更新了智能体安全分类法,以涵盖来自Codex和OpenClaw执行场景的新兴风险。我们进一步构建了一个基于分类法指导的数据引擎,并采用影响函数净化,仅使用约1k样本训练轻量级AgentDoG 1.5变体(0.8B、2B、4B和8B参数),达到了与领先闭源模型(如GPT-5.4)相当的性能。基于AgentDoG 1.5,我们构建了一个高效的智能体安全SFT和RL训练环境,将Docker级环境的部署开销降低了两个数量级。最后,我们将AgentDoG 1.5部署为无需训练的在线护栏,用于实时安全审核。大量实验结果表明,AgentDoG 1.5在多样且复杂的交互式智能体场景中达到了最先进的性能。所有模型和数据集均已公开发布。

英文摘要

Modern open-world agents such as OpenClaw exhibit powerful cross-environment execution capabilities yet introduce broad new safety risk sources. Meanwhile, advanced frontier AI models drastically lower attack barriers, rendering current agent alignment frameworks inadequate for real-world deployment. To tackle these emerging threats, we propose a lightweight and scalable agent safety alignment framework. Specifically, we update the agent safety taxonomy to accommodate emergent risks from Codex and OpenClaw execution scenarios. We further build a taxonomy-guided data engine with influence-function purification to train lightweight AgentDoG 1.5 variants (0.8B, 2B, 4B, and 8B parameters) using only around 1k samples, achieving comparable performance with leading closed-source models (e.g., GPT-5.4). Based on AgentDoG 1.5, we construct a highly efficient agentic safety SFT and RL training environment, which reduces deployment overhead in Docker-level environments by two orders of magnitude. Finally, we deploy AgentDoG 1.5 as a training-free online guardrail for real-time safety moderation. Extensive experimental results indicate that AgentDoG 1.5 achieves state-of-the-art performance in diverse and complex interactive agentic scenarios. All models and datasets are openly released.

2605.29788 2026-05-29 cs.AI cs.LG 版本更新

Certified Policy Optimisation for Nested Causal Bandits via PAC-Bayes Risk

嵌套因果赌博机的认证策略优化:基于PAC-Bayes风险

Tim Woydt, Paul-David Zuercher

发表机构 * ProdAxon

AI总结 本文提出嵌套因果汤普森采样(NCTS)算法,通过PAC-Bayes超额风险界对历史数据进行离线、任意时刻的部署策略认证,解决分层因果赌博机中的跨时间尺度因果耦合问题。

详情
AI中文摘要

关键序列决策很少是单时间尺度的:一个战略决策因果地塑造了每个后续战术选择所处的环境;标准赌博机和强化学习理论并未捕捉时间尺度之间的这种因果耦合。我们将问题类别形式化为嵌套上下文因果赌博机(NCCBs),这是一个分层SCM,其中每个层次的动作设置下一层次的上下文分布,并提出了嵌套因果汤普森采样(NCTS),该算法每轮抽取一个机制因子化的信念,并在其下递归地行动。我们的主要理论结果是一个因果PAC-Bayesian超额风险界,它仅从历史数据中认证任何候选部署策略,离线且任意时刻,回答了部署问题:我们能否在此处信任该智能体,风险如何?在分层SCM上的实验表明,相对于同一函数类上的匹配RFF-GP联合回归,因子化的SCM机制后验在外生分布偏移下零样本迁移显著更好,递归的元到内层提交在分布上显著优于联合提交替代方案,并且随着离线数据积累,认证显著收缩。结合这些结果,我们建立了渐进式认证交接,一种安全部署方法:每个时间尺度在收益可被认证时从传统控制器切换到NCTS,独立于其他时间尺度。

英文摘要

Critical sequential decisions are rarely single-timescale: a strategic decision causally shapes the context in which every subsequent tactical choice is made; standard bandit and reinforcement-learning theory does not capture this causal coupling between timescales. We formalise the problem class as Nested Contextual Causal Bandits (NCCBs), a hierarchical SCM where each level's action sets the next level's context distribution, and propose Nested Causal Thompson Sampling (NCTS), which draws one mechanism-factorised belief per episode and acts recursively under it. Our main theoretical result is a causal PAC-Bayesian excess-risk bound that certifies any candidate deployment policy from historic data alone, off-policy and anytime, answering the deployment question: can we trust this agent here, and at what risk? Experiments on a hierarchical SCM show that, against a matched RFF-GP joint regression on the same function class, the factorised SCM-mechanism posterior transfers significantly better zero-shot under exogenous distribution shifts, the recursive meta-to-inner commit significantly dominates the joint-commit alternative in distribution, and the certificate significantly contracts as offline data accumulates. Combining these results, we establish progressive certified handover, a safe-deployment method: each timescale flips from a legacy controller to NCTS when gains can be certified, independently of the others.

2605.29782 2026-05-29 cs.LG cs.AI cs.CL 版本更新

Hista and Numca: Estimate State Value Effectively for LLM Reinforcement Learning

Hista 和 Numca:为 LLM 强化学习有效估计状态值

Zizhe Chen, Jiqian Dong, Yizhou Tian, Garry Yang, Yongqiang Chen, Zhitang Chen, James Cheng

发表机构 * Department of Computer Science and Engineering, The Chinese University of Hong Kong(香港中文大学计算机科学与工程系) Huawei Technologies Ltd(华为技术有限公司)

AI总结 针对 LLM 强化学习中状态值估计不准确的问题,提出 Numca(利用数值跨度作为可分级里程碑)和 Hista(利用隐藏状态加权平均不连续轨迹及其回报)两种方法,显著提升估计精度和训练性能。

Comments Accepted at ICML 2026

详情
AI中文摘要

强化学习(RL)通过奖励信号直接优化模型行为来改进大型语言模型(LLMs)。虽然在经典RL中准确的状态值估计对于稳定训练至关重要,但在LLM后训练中这仍是一个未被充分探索的挑战。在这项工作中,我们引入了状态值估计基准(SVEB)来评估现有RL框架中的状态估计,并展示了像PPO这样的标准方法中的评论家会退化为粗糙的组平均基线。为了解决这个问题,我们提出了两种技术:Numca,它利用数值跨度作为可分级里程碑进行状态值估计;以及Hista,一个使用LLM的隐藏状态作为表示来加权平均不连续轨迹及其回报的框架。大量实验表明,这两种方法都能产生更准确的状态值估计,并在不同的RL算法和模型大小上提升训练性能,而不会产生显著的计算开销。

英文摘要

Reinforcement learning (RL) refines large language models (LLMs) by directly optimizing model behavior through reward signals. While accurate state value estimation is critical for stable training in classical RL, it remains an underexplored challenge in LLM post-training. In this work, we introduce the State Value Estimation Benchmark (SVEB) to assess state estimation within existing RL frameworks and show that critics in standard approaches like PPO collapse to a coarse group-average baseline. To address this, we propose two techniques: Numca, which leverages numerical spans as gradable milestones for state value estimation, and Hista, a framework that uses LLM's hidden states as representation to weighted average disjoint rollouts and their return. Extensive experiments demonstrate that both methods yield more accurate state value estimates and enhance training performance across different RL algorithms and model sizes without incurring significant computational overhead.

2605.29765 2026-05-29 cs.LG 版本更新

MMTM: Tri-Modal Topic Modeling for Long-Form Video via Similarity-Gated Fusion

MMTM: 基于相似性门控融合的长视频三模态主题建模

Ali Abusaleh, Bhuvanesh Verma, Alexander Mehler

发表机构 * Text Technology Lab (TTLab), Goethe University Frankfurt(文本技术实验室(TTLab),法兰克福歌德大学)

AI总结 提出MMTM模块化流水线,通过相似性门控融合集成语音识别、音频和视觉嵌入及BERTopic聚类,在长视频主题发现中显著提升主题质量。

Comments Submitted to EMNLP 2026

详情
AI中文摘要

我们介绍了MMTM,一个用于长视频主题发现的模块化流水线,它通过确定性相似性门控融合集成了语音识别、音频和视觉嵌入以及BERTopic聚类。在德语(Tagesschau)和英语(NBC)广播新闻上进行跨语言评估,联合三模态建模显著提高了主题质量:噪声从0.27降至0.06,转换率从0.70降至0.21,归一化熵从0.84升至0.92,表明主题更加连贯且时间稳定。聚类有效性(Calinski-Harabasz)在嵌入空间上提高了5-12倍。词汇连贯性(NPMI)在德语上从0.77升至0.86,但依赖于语料库,并未迁移到较短的NBC广播中。我们发布了流水线代码和一个经过人工验证的54小时多模态视频主题语料库,包含双标注者视觉评估和LLM辅助标注。

英文摘要

We introduce MMTM, a modular pipeline for topic discovery in long-form video that integrates speech recognition, audio and visual embeddings, and BERTopic clustering through a deterministic similarity-gated fusion. Evaluated cross-lingually on German (Tagesschau) and English (NBC) broadcast news, joint tri-modal modeling substantially improves topic quality: noise drops from 0.27 to 0.06, transition rate from 0.70 to 0.21, and normalized entropy rises from 0.84 to 0.92, indicating more coherent and temporally stable topics. Cluster validity (Calinski-Harabasz) improves by 5-12X across embedding spaces. Lexical coherence (NPMI) rises from 0.77 to 0.86 on German but is corpus-dependent and does not transfer to the shorter NBC broadcasts. We release the pipeline code and a human-validated 54-hour multimodal video topic corpus with dual-annotator visual evaluation and LLM-assisted labeling.

2605.29748 2026-05-29 stat.ML cs.LG 版本更新

Instance-dependent Stochastic Lipschitz bandit

实例依赖的随机Lipschitz bandit

Marius Potfer, Vianney Perchet

发表机构 * Crest (Fairplay joint team)(Crest(Fairplay联合团队)) EDF R&D(EDF研发) Criteo AI Lab(Criteo人工智能实验室)

AI总结 针对Lipschitz bandit问题,提出一种基于水平集次优性间隙积分的算法,实现比传统缩放维度更优的实例依赖遗憾界。

详情
AI中文摘要

我们研究Lipschitz bandit问题,其中学习器通过带噪声的点评估在域$\mathcal{X} \subset [0,1]^d$上顺序最大化未知的Lipschitz函数$f$。现有的遗憾界要么是最坏情况的,缩放为$\tilde\Theta \left ( T^{d+1/d+2}\right )$,要么通过缩放维度$d_z$自适应,得到$\tilde\Theta \left ( T^{d_z+1/d_z+2}\right )$。然而,这种基于缩放的保证仅是部分实例依赖的,因为它们仅依赖于近最优水平集的渐近增长,未能捕捉$f$的更精细结构性质。我们提供了一种分析和算法,通过$f$在其水平集上的次优性间隙的积分来刻画遗憾。这产生了适应水平集局部增长(而不仅仅是其渐近行为)的遗憾界。作为推论,当最大化者集合的维度$d^\star>0$时,我们获得了阶为$\tilde{\mathcal{O}} \left ( T^{d_z+1 / \max(d_z,d^\star)+2}\right )$的改进自适应速率,在该情况下严格优于经典的缩放界。最后,我们将分析扩展到完全信息设置(Lipschitz专家),并展示了如何放宽一些正则性假设。

英文摘要

We study the Lipschitz bandit problem, where a learner sequentially maximizes an unknown Lipschitz function $f$ over a domain $\mathcal{X} \subset [0,1]^d$ using noisy pointwise evaluations. Existing regret bounds are either worst-case, scaling as $\tildeΘ \left ( T^{d+1/d+2}\right )$, or adaptive via the zooming dimension $d_z$, yielding $\tildeΘ \left ( T^{d_z+1/d_z+2}\right )$. However, such zooming-based guarantees are only partially instance-dependent, as they depend solely on the asymptotic growth of near-optimal level sets and fail to capture finer structural properties of $f$. We provide an analysis and an algorithm that characterizes the regret through integrals of the suboptimality gap of $f$ over its level sets. This yields regret bounds that adapt to the local growth of level sets, rather than only their asymptotic behavior. As a corollary, when the set of maximizers has dimension $d^\star>0$, we obtain improved adaptive rates of order $\tilde{\mathcal{O}} \left ( T^{d_z+1 / \max(d_z,d^\star)+2}\right )$ strictly improving over classical zooming bounds in this regime. Finally, we extend our analysis to the full-information setting (Lipschitz experts) and show how some of the regularity assumptions can be relaxed.

2605.29744 2026-05-29 cs.AI cs.CL cs.LG cs.MA 版本更新

Why Specialist Models Still Matter: A Heterogeneous Multi-Agent Paradigm for Medical Artificial Intelligence

为什么专家模型仍然重要:面向医学人工智能的异构多智能体范式

Yanan Wang, Shuaicong Hu, Jian Liu, Guohui Zhou, Aiguo Wang, Cuiwei Yang

发表机构 * Anthropic AI

AI总结 提出HetMedAgent异构多智能体框架,通过冲突感知证据融合、不确定性驱动的临床医生干预触发和自适应阈值校准,实现通用大语言模型与领域专家模型的协同,在三个临床决策任务中验证了专家模型在模态特定分析中的不可替代价值。

Comments Accepted at ICML 2026. 12 pages main text, 16 pages appendix

详情
AI中文摘要

GPT和Claude等通用大语言模型在医疗保健领域的出色表现引发了一个关键问题:特定领域的医学专家模型是否会变得过时?我们认为,医学人工智能的未来不在于构建单一的医学基础模型,也不在于取代人类专业知识,而在于协调通用大语言模型、领域特定专家模型和临床医生之间的协作。我们提出HetMedAgent,一个异构医学多智能体框架,能够实现冲突感知证据融合、基于不确定性的临床医生干预触发和自适应阈值校准。在三个真实世界临床决策任务上的实验表明,通用大语言模型与领域特定专家模型之间的协同显著优于单独使用任一类型模型,验证了专家模型在模态特定分析中的不可替代价值。HetMedAgent代表了从构建医学大语言模型或基础模型向多智能体协作的转变,实现了通用推理能力与领域特定精度之间的平衡。

英文摘要

The impressive performance of generalist large language models (LLMs) such as GPT and Claude in healthcare raises a critical question: will domain-specific medical specialist models become obsolete? We argue that the future of medical artificial intelligence (AI) lies not in building monolithic medical foundation models, nor in replacing human expertise, but in orchestrating collaboration among generalist LLMs, domain-specific specialist models, and clinicians. We propose HetMedAgent, a heterogeneous medical multi-agent framework that enables conflict-aware evidence fusion, uncertainty-based clinician intervention triggering, and adaptive threshold calibration. Experiments on three real-world clinical decision-making tasks demonstrate that the synergy between generalist LLMs and domain-specific specialist models significantly outperforms using either type of model alone, validating the irreplaceable value of specialist models in modality-specific analysis. HetMedAgent represents a shift from building medical LLMs or foundation models to multi-agent collaboration, achieving a balance between general reasoning capabilities and domain-specific precision.

2605.29731 2026-05-29 cs.LG 版本更新

EMAG: Differentiable 4D Gaussian Mixture Splatting for EEG Spatial Super-Resolution

EMAG: 可微分的4D高斯混合喷溅用于EEG空间超分辨率

Alex Lazarovich, Ofir Itzhak Shahar, Gur Elkin, Ohad Ben-Shahar

AI总结 提出EMAG框架,通过可微分的各向异性4D时空高斯混合模型,从稀疏低密度电极重建高密度EEG信号,实现空间超分辨率,并在三个基准上超越现有方法。

详情
AI中文摘要

高密度脑电图(HD-EEG)能够精细测量皮层活动,但需要昂贵的硬件和较长的设置时间,限制了其在临床和研究中的可及性。我们提出EMAG(EEG各向异性高斯混合),一个可微分的框架,通过将脑电源表示为各向异性4D时空高斯的混合,从稀疏的低密度(LD)电极子集重建HD-EEG信号。EMAG在球形脑网格的每个点上放置多个高斯的混合,每个高斯由完整的4x4精度矩阵参数化,从而实现各向异性的空间扩散以及空间和时间维度之间的显式耦合。前向模型通过电极位置处的可微分高斯场贡献渲染头皮EEG,从而无需显式源定位监督即可进行端到端训练。我们在三个公共EEG基准(Localize-MI、SEED和SEED-IV)上以2倍到8/16倍的超分辨率因子评估EMAG。在大多数超分辨率因子下,EMAG在三个标准基准(Localize-MI、SEED、SEED-IV)上优于当前最先进的EEG超分辨率方法。显式高斯参数化进一步实现了学习到的脑源配置的直接可视化和可解释性,可能为临床和神经科学应用(如源定位或生物标志物发现)开辟途径。

英文摘要

High-density electroencephalography (HD-EEG) enables fine-grained measurement of cortical activity but requires expensive hardware and lengthy setup times, limiting its clinical and research accessibility. We propose EMAG (EEG Mixture of Anisotropic Gaussians), a differentiable framework that reconstructs HD-EEG signals from a sparse subset of low-density (LD) electrodes by representing brain electrical sources as a mixture of anisotropic 4D space-time Gaussians. EMAG places a mixture of multiple Gaussians at each point of a spherical brain grid, each parameterized by a full 4 x 4 precision matrix, enabling anisotropic spatial spreads and explicit coupling between spatial and temporal dimensions. The forward model renders scalp EEG via differentiable Gaussian field contributions at electrode locations, enabling end-to-end training without explicit source localization supervision. We evaluate EMAG on three public EEG benchmarks (Localize-MI, SEED, and SEED-IV) at super-resolution factors of 2x through 8/16x. EMAG outperforms the current state-of-the-art EEG super-resolution method at most super-resolution factors on three standard benchmarks (Localize-MI, SEED, SEED-IV). The explicit Gaussian parameterization further enables direct visualization and interpretability of learned brain source configurations, potentially opening avenues for clinical and neuroscientific applications, such as source localization or biomarker discovery.

2605.29729 2026-05-29 cs.LG 版本更新

Realistic honeypot evaluations for scheming propensity

针对策划倾向的逼真蜜罐评估

Victoria Krakovna, David Lindner, Lewis Ho, Sebastian Farquhar, Rohin Shah

发表机构 * Google DeepMind(谷歌深Mind)

AI总结 提出一种框架,通过在Google对齐研究代码库中设置编码任务作为蜜罐,测试模型在有机会时是否会追求工具性目标,实验表明Gemini模型在真实部署中不会主动策划,但在特定提示下会表现出策划或破坏行为。

详情
AI中文摘要

我们引入了策划蜜罐评估,这是一个测试模型在有机会时是否会追求工具性目标的框架。我们的策划蜜罐评估以Google对齐研究代码库中的编码任务形式进行。在真实的内部部署环境中,Gemini模型不会表现出未经提示的策划行为。如果提示明确鼓励主动性(情境意识或目标导向)和/或给模型一个隐藏目标,模型有时会策划或尝试破坏。验证了我们设置的真实性,模型表现出较低的评估意识,通常是由于主动性提示而非环境所致。

英文摘要

We introduce scheming honeypot evaluations, a framework for testing whether models will pursue instrumental goals if given the opportunity. Our scheming honeypot evaluations take the form of coding tasks in Google's alignment research codebases. In a real internal deployment setting, Gemini models do not demonstrate unprompted scheming. If prompts explicitly encourage agency (situational awareness or goal-directedness) and/or give the model a hidden goal, models sometimes scheme or attempt sabotage. Validating the realism of our setting, models show low rates of evaluation awareness, usually due to agency prompts rather than the environments.

2605.29727 2026-05-29 cs.LG 版本更新

Bastion: Budget-Aware Speculative Decoding with Tree-structured Block Diffusion Drafting

Bastion: 预算感知的树结构块扩散草稿投机解码

Soowon Oh, Nam Cao, Yujin Kim, Hojung Jung, Huzama Ahmad, Sangmin Bae, Se-Young Yun

发表机构 * KAIST AI(韩国科学技术院人工智能研究所) Samsung Advanced Institute of Technology(三星先进技术研究所)

AI总结 提出BASTION框架,通过动态构建查询相关的树结构平衡草稿质量与硬件约束,实现预算感知的投机解码,无需训练且保持目标模型分布,速度提升达6.61倍。

详情
AI中文摘要

块扩散草稿者最近作为投机解码的强大替代方案出现,通过在单个并行步骤中预测多个未来令牌分布。然而,由于这些并行预测是从位置边缘分布而非完全条件序列中采样,承诺单一贪婪路径往往无法捕捉目标模型的偏好轨迹。为解决此问题,我们提出BASTION,一种基于树的扩散草稿的预算感知投机解码框架。与依赖静态树拓扑的现有方法不同,BASTION通过平衡草稿质量与硬件约束动态构建查询相关的树。我们的框架整合了三个协同组件:(1) 接受代理,通过路径置信度估计期望接受长度;(2) 在线延迟估计器,校准硬件感知的屋顶线模型;(3) 自适应最佳优先扩展,在边际增益不再证明增量验证成本合理时停止树生长。BASTION无需训练,保持目标模型分布,且无需逐设置调优。在多种基准和GPU架构上,BASTION相比标准自回归解码实现高达6.61倍加速,优于最先进的块扩散基线39%。

英文摘要

Block-diffusion drafters have recently emerged as a powerful alternative for speculative decoding by predicting multiple future-token distributions in a single parallel step. However, since these parallel predictions are sampled from position-wise marginals rather than fully conditioned sequences, committing to a single greedy path often fails to capture the target model's preferred trajectory. To address this, we propose BASTION, a budget-aware speculative decoding framework with tree-based diffusion drafting. Unlike existing methods that rely on static tree topologies, BASTION dynamically constructs query-dependent trees by balancing draft quality against hardware constraints. Our framework integrates three synergistic components: (1) an acceptance surrogate that estimates expected accepted length via path confidence, (2) an online latency estimator that calibrates a hardware-aware roofline model, and (3) an adaptive best-first expansion that grows the tree until marginal gains no longer justify incremental verification costs. BASTION is training-free, preserves the target model's distribution, and requires no per-setting tuning. Across diverse benchmarks and GPU architectures, BASTION achieves up to a 6.61x speedup over standard autoregressive decoding, outperforming state-of-the-art block-diffusion baselines by 39%.

2605.29720 2026-05-29 cs.CV cs.LG 版本更新

Efficient, Validation-Free Intrinsic Quality Estimation for Large-Scale Face Recognition Datasets

面向大规模人脸识别数据集的高效、免验证的内在质量评估

Zhichao Chen, Yongle Zhao, Kaicheng Yang, Meng Yang, Yin Xie, Ziyong Feng

发表机构 * School of Cyber Science and Technology, University of Science and Technology of China(中国科学技术大学网络科学与技术学院)

AI总结 提出一种无需训练的内在质量(IQ)指标,通过邻域一致性得分和全局表示子空间复杂度来估计人脸识别数据集生成高性能模型的潜力,实现快速数据集诊断与筛选。

Comments ICML 2026

详情
AI中文摘要

我们提出内在质量(IQ),一种无需验证的度量,旨在估计人脸识别(FR)数据集产生高性能模型的固有潜力,而无需进行全规模训练。IQ 包含两个组成部分:(i)邻域一致性得分,通过最近邻量化局部身份标签一致性;(ii)全局表示子空间复杂度(有效秩,ER),捕捉底层嵌入几何和数据集多样性。IQ 允许使用轻量级代理模型或数据子集进行快速评估,便于在资源密集型的全规模训练之前进行数据集诊断和筛选。我们描述了一个针对干净、噪声和混合质量 FR 数据集定制的实验协议,并概述了验证 IQ 对下游性能预测能力的评估方法。

英文摘要

We propose Intrinsic Quality (IQ), a validation-free metric designed to estimate the inherent potential of face recognition (FR) datasets to produce high-performance models without the need for full-scale training. IQ integrates two components: (i) a Neighbor-Consistency Score that quantifies local identity label agreement via nearest neighbors, and (ii) Global Representation Subspace Complexity (Effective Rank, ER), which captures the underlying embedding geometry and dataset diversity. IQ allows for rapid evaluation using lightweight proxy models or data subsets, facilitating dataset diagnosis and curation prior to resource-intensive full-scale training. We describe an experimental protocol tailored to clean, noisy, and mixed-quality FR datasets, and outline evaluation methodologies to validate IQ's predictive power for downstream performance.

2605.29713 2026-05-29 cs.LG cs.AI 版本更新

The Little Book of Generative AI Foundations: An Intuitive Mathematical Primer

生成式AI基础小书:直观数学入门

Tianhua Chen

发表机构 * School of Computing and Engineering(计算与工程学院)

AI总结 本书通过推导导向的方式,从PCA到能量模型,系统介绍现代生成式人工智能的数学基础,旨在使生成建模结构更易理解。

Comments Preprint version, 178 pages. Comments and corrections are welcome

详情
AI中文摘要

本书提供了对现代生成式人工智能数学基础的紧凑、推导导向的介绍。它不是调查每一个最近的架构或实现细节,而是通过连接主要生成模型家族的思想发展出一条连贯的路线,从PCA、概率PCA、变分自编码器和扩散模型到归一化流、自回归分解、GANs、Wasserstein GANs和基于能量的模型。目的是使生成建模的结构更易理解,同时不失去理解这些模型如何推导和关联所需的数学实质。本书旨在为具有数学好奇心的研究人员、从业者和学生提供基础构建的入门读物。

英文摘要

This book provides a compact, derivation-oriented introduction to the mathematical foundations of modern generative artificial intelligence. Rather than surveying every recent architecture or implementation detail, it develops a coherent route through the ideas connecting major families of generative models, from PCA, probabilistic PCA, variational autoencoders, and diffusion models to normalising flows, autoregressive factorisations, GANs, Wasserstein GANs, and energy-based models. The aim is to make the structure of generative modelling more accessible without removing the mathematical substance needed to understand how these models are derived and related. The book is intended as a foundation-building primer for mathematically curious researchers, practitioners, and students.

2605.29698 2026-05-29 cs.LG physics.chem-ph 版本更新

A Systematic Evaluation of Molecular Mixture Behavior Prediction

分子混合物行为预测的系统评估

Roel J. Leenhouts, Nathan K. Morgan, William Green, Jan G. Rittig, Florence H. Vermeire

发表机构 * KU Leuven(卢森堡大学) MIT(麻省理工学院) RWTH Aachen University(亚琛工业大学)

AI总结 提出一个将混合物性质误差分解为纯组分和相互作用成分的评估框架,并基于七个匹配数据集发现绝对精度可能掩盖非理想混合行为的恢复能力。

详情
AI中文摘要

分子性质预测的机器学习主要集中在纯化合物上,尽管许多实际应用依赖于具有分子间相互作用的混合物。最近的工作扩大了混合物数据集的可用性,但评估仍然主要关注绝对精度。然而,混合物中的绝对误差将纯组分贡献与理想混合的偏差混为一谈。我们提出了一个评估框架,将混合物性质误差分解为纯化合物和相互作用(非理想)成分。该框架结合了泄漏感知分割协议、理想混合物基线和过量性质指标。为了支持可重复的基准测试,我们整理了七个匹配的纯和混合物物理化学性质数据集。在多个混合物性质任务和模型家族中,我们发现强绝对精度可能掩盖对非理想混合物行为的恢复能力,并且在严格分子分割下性能显著下降。这些结果将向未见分子的迁移识别为分子混合物机器学习中的核心挑战,并推动超越绝对精度的评估。

英文摘要

Machine learning for molecular property prediction has focused largely on pure compounds, even though many practical applications depend on mixtures with intermolecular interactions. Recent work has expanded the availability of mixture datasets, but evaluation still focuses mainly on absolute accuracy. However, absolute errors in mixtures conflate pure-component contributions with deviations from ideal mixing. We propose an evaluation framework that decomposes mixture-property error into pure-compound and interaction (non-ideal) components. The framework combines leakage-aware split protocols, ideal-mixture baselines, and excess-property metrics. To support reproducible benchmarking, we curate seven matched pure and mixture physicochemical property datasets. Across multiple mixture-property tasks and model families, we find that strong absolute accuracy can mask poor recovery of non-ideal mixture behavior, and that performance drops substantially under strict molecule splits. These results identify transfer to unseen molecules as a central challenge in molecular mixture machine learning and motivate evaluation beyond absolute accuracy alone.

2605.29695 2026-05-29 cs.AI cs.CE cs.LG math.PR 版本更新

FHRFormer: A Self-Supervised Masked Transformer Framework for Fetal Heart Rate Time-Series Inpainting and Forecasting

FHRFormer: 一种用于胎儿心率时间序列修复和预测的自监督掩码Transformer框架

Kjersti Engan, Neel Kanwal, Anita Yeconia, Ladislaus Blacy, Yuda Munyaw, Estomih Mduma, Hege Ersdal

发表机构 * University of Stavanger(斯塔万格大学) Haydom Lutheran Hospital(海多姆路德医院) Stavanger University Hospital(斯塔万格大学医院)

AI总结 针对胎儿心率监测中信号丢失问题,提出基于掩码Transformer的自监督自编码器方法,通过捕获局部时间和频率成分来修复和预测缺失信号,具有鲁棒性并支持AI风险算法开发。

Comments Submitted to Frontiers in Digital Health. arXiv admin note: substantial text overlap with arXiv:2509.20852

详情
AI中文摘要

大约10%的新生儿出生时需要帮助才能开始呼吸,约5%需要通气支持。胎儿心率(FHR)监测在产前护理中评估胎儿健康状况方面起着关键作用,能够检测异常模式并支持及时产科干预以减轻分娩期间的胎儿风险。应用人工智能(AI)方法分析具有不同结局的连续FHR监测大数据集,可能为预测需要呼吸辅助或干预的风险提供新见解。可穿戴FHR监测仪的最新进展实现了在不影响母亲活动能力的情况下进行连续胎儿监测。然而,母亲运动期间的传感器移位以及胎儿或母亲位置的变化常常导致信号丢失,造成记录的FHR数据出现缺口。这种缺失数据限制了有意义信息的提取,并使基于AI的自动化分析复杂化。传统的缺失数据处理方法,如简单插值技术,往往无法保留信号的频谱特性。在本文中,我们提出了一种基于掩码Transformer的自编码器方法,通过捕获数据的局部时间和频率成分来重建缺失的FHR信号。所提出的方法在不同缺失数据时长下表现出鲁棒性,可用于信号修复和预测。该方法可回顾性地应用于研究数据集,以支持基于AI的风险算法开发。未来,该方法可集成到可穿戴FHR监测设备中,实现更早、更稳健的风险检测。

英文摘要

Approximately 10% of newborns require assistance to initiate breathing at birth, and around 5% need ventilation support. Fetal heart rate (FHR) monitoring plays a crucial role in assessing fetal well-being during prenatal care, enabling the detection of abnormal patterns and supporting timely obstetric interventions to mitigate fetal risks during labor. Applying artificial intelligence (AI) methods to analyze large datasets of continuous FHR monitoring episodes with diverse outcomes may offer novel insights into predicting the risk of needing breathing assistance or interventions. Recent advances in wearable FHR monitors have enabled continuous fetal monitoring without compromising maternal mobility. However, sensor displacement during maternal movement, as well as changes in fetal or maternal position, often lead to signal dropout, resulting in gaps in recorded FHR data. Such missing data limits the extraction of meaningful insights and complicates automated (AI-based) analysis. Traditional approaches to handling missing data, such as simple interpolation techniques, often fail to preserve the spectral characteristics of the signals. In this paper, we propose a masked transformer-based autoencoder approach to reconstruct missing FHR signals by capturing both local temporal and frequency components of the data. The proposed method demonstrates robustness across varying durations of missing data and can be used for signal inpainting and forecasting. The proposed approach can be applied retrospectively to research datasets to support the development of AI-based risk algorithms. In the future, the proposed method could be integrated into wearable FHR monitoring devices to achieve earlier and more robust risk detection.

2605.29693 2026-05-29 cs.LG cs.RO 版本更新

Momentum Based Reward Design for Low Emission Traffic Signal Control

基于动量的低排放交通信号控制奖励设计

Chinmay Mundane, Amith Manoharan, Arun Singh

发表机构 * Institute of Technology, University of Tartu(塔尔图大学技术学院)

AI总结 提出一种基于动量的奖励函数(MBRF),通过鼓励车辆持续移动而非单纯惩罚拥堵,在SUMO仿真中实现更好的吞吐量-排放权衡和更稳定的学习行为。

详情
AI中文摘要

城市交通拥堵是一个日益严重的全球性问题,导致通勤时间延长和环境污染加剧。传统的交通信号控制系统往往难以适应动态交通状况。自适应交通信号控制可以在不改变道路基础设施的情况下改善城市交通。深度强化学习(DRL)在此任务中表现出色,但现有的基于延误和队列的奖励常常产生短视或不稳定的策略。本文提出了一种基于动量的奖励函数(MBRF),鼓励车辆持续移动,而非仅惩罚拥堵。该方法在SUMO(城市交通仿真)中使用标准交通指标(如等待时间、队列长度、吞吐量和CO2排放)进行评估。结果表明,与基于延误或队列的奖励以及经典控制器(如Max Pressure和LQF)相比,所提出的奖励实现了更好的吞吐量-排放权衡和更稳定的学习行为。

英文摘要

Urban traffic congestion is a growing global issue contributing significantly to long commute times and environmental pollution. Traditional traffic signal control systems often fail to adapt to dynamic traffic conditions. Adaptive traffic signal control can improve urban traffic without changing road infrastructure. Deep Reinforcement Learning (DRL) has shown strong performance for this task, but existing delay and queue-based rewards often produce short-sighted or unstable policies. This paper proposes a Momentum-Based Reward Function (MBRF) that encourages vehicles to keep moving rather than penalizing congestion alone. The method is evaluated in SUMO (Simulation of Urban MObility) using standard traffic metrics such as waiting time, queue length, throughput, and CO2 emissions. Results show that the proposed reward produces better throughput-emission trade-offs and more stable learning behavior than delay or queue-based rewards, as well as classical controllers such as Max Pressure and LQF.

2605.29688 2026-05-29 cs.LG 版本更新

A Novel Tensor Product-Based Neural Network for Solving Partial Differential Equations

一种基于张量积的新型神经网络用于求解偏微分方程

Qihong Yang, Yangtao Deng, Qiaolin He, Shiquan Zhang

发表机构 * School of Mathematics, Sichuan University(四川大学数学学院)

AI总结 提出张量积网络(TPNet),通过将解显式表示为基函数的线性组合并利用最小二乘直接求解系数,实现高效准确的函数逼近和PDE求解。

Comments 44 pages, 11 figures

详情
AI中文摘要

本文提出了张量积网络(TPNet),一种用于高效准确函数逼近和PDE求解的新型神经架构。该方案的核心是将解显式构造为集成到网络中的基函数的线性组合,系数通过直接最小二乘求解确定,从而绕过了传统的基于梯度的训练。关键的方法贡献包括:(1)一种高效的张量积方案,通过组合两组子网络输出的组合生成多维基函数,在保持表达力的同时显著降低模型复杂度和参数数量;(2)一种块时间推进策略,以提高长时间模拟的计算效率;(3)一种线性重构策略,通过将已知非线性项视为源项来处理非线性PDE。TPNet在准确性和训练时间上优于传统神经网络求解器。这一性能提升源于其结构化设计和确定性最小二乘拟合,与主流方法(如物理信息神经网络PINNs)所需的迭代且通常计算密集的优化形成对比。

英文摘要

This paper presents the Tensor Product Network (TPNet), a novel neural architecture for efficient and accurate function approximation and PDE solving. The core of the proposal involves constructing the solution explicitly as a linear combination of basis functions integrated into the network, with coefficients determined by a direct least-squares solve, thereby bypassing traditional gradient-based training. The key methodological contribution include: (1) an efficient tensor-product scheme that generates multi-dimensional basis functions from combinations of two sets of subnetwork outputs, significantly reducing model complexity and parameter count while maintaining expressivity; (2) a block time-marching strategy to improve computational efficiency in long-time simulations; and (3) a linear reformulation strategy for handling nonlinear PDEs by treating known nonlinear terms as sources. TPNet achieves superior accuracy and shorter training times than conventional neural network solvers. This performance gain stems from its structured design and deterministic least-squares fitting, which contrast with the iterative, often computationally intensive optimization required by mainstream methods like Physics-Informed Neural Networks (PINNs).

2605.29684 2026-05-29 cs.LG cond-mat.dis-nn stat.ML 版本更新

Kernel Renormalization in Bayesian Deep Neural Networks: the Equivalent Wishart Ansatz in the Proportional Regime

贝叶斯深度神经网络中的核重整化:比例机制下的等效Wishart假设

Paolo Baglioni, Christian Keup, Vincenzo Zimbardo, Rosalba Pacelli, Alessandro Vezzani, Raffaella Burioni, Pietro Rotondo

发表机构 * INFN, Sezione di Milano Bicocca(意大利国家研究所(INFN),米兰Bicocca分所) INFN, Gruppo Collegato di Parma(意大利国家研究所(INFN),帕尔马联合小组) Dipartimento di Scienze Matematiche, Fisiche e Informatiche, Università degli Studi di Parma(帕尔马大学数学、物理和信息科学系) Istituto dei Materiali per l’Elettronica ed il Magnetismo (IMEM-CNR), Parco Area delle Scienze(电子与磁性材料研究所(IMEM-CNR),帕尔马科技园区)

AI总结 针对固定深度L的贝叶斯多层感知机,提出等效Wishart假设来捕捉层次经验核的随机涨落,通过大偏差分析得到重正化NNGP核描述,在比例极限下用至多L个标量序参数刻画表示学习,并扩展到CNN揭示局部核重整化机制。

Comments 45 pages, 21 figures

详情
AI中文摘要

训练集大小$P$和深度神经网络宽度$N$以相同速率增长的比例宽度极限,已被深入研究用于浅层单隐藏层网络。然而,将这些非微扰结果从浅层架构扩展到深度非线性网络已被证明非常具有挑战性。在这里,我们提出了一种有效的近似方法,用于预测固定深度$L$的贝叶斯多层感知机(MLP)在任意高维数据上的泛化性能。我们提出了一个等效Wishart假设,以捕捉MLP层次经验核的主要随机涨落。这使我们能够在比例极限下对MLP的配分函数进行大偏差分析,并用重正化NNGP核表示。在这种描述中,即使比例极限下的强表示学习也由至多$L$个标量序参数编码,这些参数自洽确定。将该方法扩展到卷积架构(CNN),我们识别出一种层次局部核重整化机制,该机制允许量化CNN中由于有限宽度效应导致的大宽度核的更复杂数据相关变换。我们在经典基准数据集上,针对深度$L \sim O(10)$和$P\sim O(10^3)$的有限深度神经网络的贝叶斯后验采样实验测试了我们的有效理论,发现总体吻合良好,同时存在两种不同类型的系统性偏差。

英文摘要

The scaling limit where both the size of the training set $P$ and the width $N$ of a deep neural network grow at the same rate, the so-called proportional-width regime, has been intensely studied for shallow, single-hidden-layer networks. However, extending these non-perturbative results from shallow architectures to deep non-linear networks has proven very challenging. Here we present an effective approximate approach to predict the generalization performance of Bayesian multi-layer perceptrons (MLPs) of fixed depth $L$ on arbitrary high-dimensional data. We propose an equivalent Wishart Ansatz to capture the dominant stochastic fluctuations of the hierarchical empirical kernels of MLPs. This allows us to perform a large deviation analysis for the partition function of MLPs in the proportional limit, expressed in terms of a renormalized NNGP kernel. In this description, even strong representation learning in the proportional limit is encoded in at most $L$ scalar order parameters, determined self-consistently. Extending the approach to convolutional architectures (CNNs), we identify a hierarchical local kernel renormalization mechanism, which allows to quantify more complex data-dependent transformations of the large-width kernel in CNNs due to finite-width effects. We test our effective theory against sampling experiments from the Bayesian posterior of finite deep neural networks with depths $L \sim O(10)$ and $P\sim O(10^3)$ on classic benchmark datasets, finding overall very good agreement together with two distinct types of systematic deviations.

2605.29673 2026-05-29 cs.LG cs.CV 版本更新

A Geometric View of SRC: Learning Representations for Stable Residual Inference

SRC的几何视角:学习用于稳定残差推理的表示

Vangelis P. Oikonomou

AI总结 本文从几何角度分析稀疏表示分类(SRC)的残差排序稳定性,提出几何塑造目标以改善表示学习,并在多个数据集上验证了效果。

Comments 37 pages

详情
AI中文摘要

基于重构的推理通过比较类重构残差来分配类别;稀疏表示分类(SRC)是一个典型实例,其可靠性取决于学习表示的几何结构。我们采用严格的训练-推理分离:SRC仅作为固定的测试时规则使用,在训练过程中从不进行微分、展开或优化。在基于类条件张成子空间及其相关投影残差的张成子空间理想化中,我们通过残差间隔形式化残差排序稳定性,并刻画了可能在最坏方向破坏该间隔的几何障碍——张成子空间重叠、支配以及通过小主角产生的近重叠。这一张成子空间理论是首要的:它指定了理想化残差族何时良好分离,并为实际残差近似(如OMP)提供了条件性的求解器级解释,只要它们接近张成子空间级别的残差排序。在显式的覆盖和分离假设下,我们推导了(理想化)残差间隔的定量下界。在这些目标的指导下,我们提出了几何塑造目标,这些目标促进掩蔽的类内自表达性,抑制跨类重构路径和类间张成子空间对齐,并防止坍塌——而在训练过程中不调用SRC残差或预测。在图像(COIL-100)、文本(TREC)和EEG连接性上的实验,在相同的固定SRC/OMP推理下评估所有表示,并报告残差间隔和几何诊断;交叉熵仅作为相同评估协议下的参考几何包含在内。

英文摘要

Reconstruction-based inference assigns a class by comparing class-wise reconstruction residuals; Sparse Representation Classification (SRC) is a canonical instance whose reliability depends on the geometry of the learned representation. We adopt a strict training-inference separation: SRC is used only as a fixed test-time rule and is never differentiated, unrolled, or optimized during training. In a span-level idealization based on class-conditional spans and their associated projection residuals, we formalize residual-ordering stability through a residual margin and characterize geometric obstructions -- span overlap, dominance, and near-overlap via small principal angles -- that can collapse this margin in worst-case directions. This span-level theory is primary: it specifies when the idealized residual family is well-separated, and it provides a conditional solver-level interpretation for practical residual approximations (e.g., OMP) insofar as they remain close to the span-level residual ordering. Under explicit coverage and separation assumptions, we derive a quantitative lower bound on the (idealized) residual margin. Guided by these targets, we propose geometry-shaping objectives that promote masked within-class self-expressiveness, discourage cross-class reconstruction pathways and inter-class span alignment, and prevent collapse -- without invoking SRC residuals or predictions during training. Experiments on images (COIL-100), text (TREC), and EEG connectivity evaluate all representations under identical fixed SRC/OMP inference and report residual margins and geometric diagnostics; cross-entropy is included only as a reference geometry under the same evaluation protocol.

2605.29664 2026-05-29 cs.DC cs.LG 版本更新

AMDP: Asynchronous Multi-Directional Pipeline Parallelism for Large-Scale Models Training

AMDP:面向大规模模型训练的异步多方向流水线并行

Ling Chen, Houming Wu, Wenjie Yu

发表机构 * State Key Laboratory of Blockchain and Data Security, Zhejiang University, Hangzhou, China(区块链与数据安全国家重点实验室,浙江大学,杭州,中国) College of Computer Science and Technology, Zhejiang University, Hangzhou, China(计算机科学与技术学院,浙江大学,杭州,中国)

AI总结 针对异步流水线并行中参数不匹配导致收敛退化的问题,提出AMDP方法,通过限制流水线第一阶段处理小批量数量、启动多条并发流水线并自适应调整数量、以及跨小批量累积梯度后单次更新,在保持高利用率的同时加速训练并保证收敛。

Comments Accepted by ICML 2026, 9 pages, and 8 figures

详情
AI中文摘要

流水线并行对于大规模模型训练至关重要,但现有的异步方法常因前向和反向传播之间的参数不匹配而损害收敛性。我们提出异步多方向流水线并行(AMDP)来缓解此问题,同时保持高利用率。AMDP限制每个流水线的第一阶段在反向传播前最多处理两个小批量,从而限制了前向和反向传播之间的参数更新次数。为减轻由此产生的流水线气泡,AMDP启动多条并发流水线,并根据流水线深度自适应调整其数量。此外,AMDP跨小批量累积梯度并在一次更新中应用,确保只有有限数量的小批量经历参数不匹配,且限制在一个优化步骤内。在GPT和BERT风格模型上的实验表明,AMDP在保持收敛的同时显著加速了训练。

英文摘要

Pipeline parallelism is essential for large-scale model training, but existing asynchronous approaches often degrade convergence due to parameter mismatch between forward and backward passes. We propose Asynchronous Multi-Directional Pipeline parallelism (AMDP) to mitigate this issue while sustaining high utilization. AMDP limits the first stage of each pipeline to process at most two minibatches before backpropagation, bounding the number of parameter updates between forward and backward passes. To alleviate the resulting pipeline bubbles, AMDP launches multiple concurrent pipelines and adapts their number according to pipeline depth. In addition, AMDP accumulates gradients across minibatches and applies them in a single update, ensuring that only a bounded number of minibatches experience parameter mismatch, limited to within one optimization step. Experiments on GPT- and BERT-style models demonstrate that AMDP significantly accelerates training while preserving convergence.

2605.29659 2026-05-29 cs.LG cs.AI cs.CL 版本更新

Opir: Efficient Multi-Task Safety Classification for Toxicity, Jailbreaks, Hate Speech, and Harmful Content

Opir:针对毒性、越狱、仇恨言论和有害内容的高效多任务安全分类

Ihor Stepanov, Aleksandr Smechov

发表机构 * Knowledgator Wordcab

AI总结 本文提出基于GLiClass架构的Opir系列编码器护栏模型,通过多任务学习实现二进制安全/不安全分类、多标签毒性分类、越狱分类和零样本不安全提示与响应分类,在12项安全分类任务和17项类别任务上与现有护栏系统竞争,同时部署开销更小。

Comments 23 pages, 4 figures, 9 tables

详情
AI中文摘要

大型语言模型(LLM)应用的实时安全过滤需要能够检测不安全提示、有毒语言、越狱尝试和不安全响应的分类器,且不能像大型护栏模型那样成本高昂,同时要能区分良性的敏感文本与真正隐蔽的有害内容。在本文中,我们介绍了Opir,一个基于GLiClass架构的编码器护栏模型系列。Opir包括用于二进制安全/不安全分类、多标签毒性分类、越狱分类以及零样本不安全提示和响应分类的多任务模型。我们还发布了专门用于二进制安全/不安全分类的边缘变体,参数少于1亿。这些模型在一个三级分类体系上训练,该体系包含16个顶层标签、126个中层标签和854个叶标签,共996个类别。Opir的训练数据结合了基于分类体系的不安全提示、对抗性挖掘的难负例、良性安全保持示例、生成的响应示例、多语言翻译以及Aegis2和WildGuard训练子集的部分内容。我们还开源了一个评估工具,支持GLiClass和GLiNER2后端以及基于解码器的模型,涵盖二进制安全分类、多标签分类、毒性、越狱检测、提示安全、响应安全、响应拒绝以及跨公共基准系列的提示子类别视图。在与八个当代护栏系统(包括基于GLiNER2和生成式护栏模型)的扩展比较中,涵盖12项安全分类任务和17项类别任务,Opir变体在大多数基准数据集上与最强的开源基线模型竞争或领先,同时部署规模显著更小。

英文摘要

Real-time safety filtering for large language model (LLM) applications requires classifiers that can detect unsafe prompts, toxic language, jailbreak attempts, and unsafe responses without the cost profile of large guardrail models, and that can distinguish benign sensitive text from genuinely covert harmful content. In this paper, we introduce Opir, a family of encoder-based guardrail models built on the GLiClass architecture. Opir includes multi-task models for binary safe/unsafe classification, multi-label toxicity classification, jailbreak classification, and zero-shot unsafe prompt and response categorization. We also release edge variants with fewer than 100M parameters dedicated to binary safe/unsafe categorization. The models are trained on a three-level taxonomy containing 996 categories across 16 top-level labels, 126 mid-level labels, and 854 leaf labels. Opir's training data combines taxonomy-grounded unsafe prompts, adversarially mined hard negatives, benign safety-preserving examples, generated response examples, multilingual translations, and portions of the Aegis2 and WildGuard training subsets. We also open-sourced an evaluation harness that supports GLiClass and GLiNER2 backends as well as decoder-based models, and covers binary safety classification, multi-label categorization, toxicity, jailbreak detection, prompt safety, response safety, response refusal, and prompt subcategory views across public benchmark families. Across an expanded comparison spanning 12 safety-classification tasks and 17 category tasks against eight contemporary guardrail systems -- including both GLiNER2-based and generative guardrail models -- Opir variants are competitive on or ahead of the strongest open-weight baselines on the majority of benchmark datasets while operating with a substantially smaller deployment footprint.

2605.29645 2026-05-29 cs.LG cs.AI stat.ML 版本更新

The Sample Complexity of Multiclass and Sparse Contextual Bandits

多类别和稀疏上下文赌博机的样本复杂度

Liad Erez, Fan Chen, Alon Cohen, Tomer Koren, Yishay Mansour, Shay Moran, Alexander Rakhlin

发表机构 * Tel Aviv University(特拉维夫大学) Massachusetts Institute of Technology(麻省理工学院) Google Research Tel Aviv(谷歌研究特拉维夫) Technion—Israel Institute of Technology(技术学院—以色列理工学院)

AI总结 针对随机i.i.d.上下文赌博机,提出基于决策估计系数和低方差探索的算法,在稀疏奖励下实现接近最优的样本复杂度,并匹配下界。

详情
AI中文摘要

我们研究随机i.i.d.设置下的上下文赌博机,其中学习器观察来自未知分布的上下文,从有限集合$A$中选择动作,并旨在基于赌博机反馈从给定类别中识别近似最优策略。受零一奖励的赌博机多类别分类启发,我们关注\emph{$s$-稀疏}设置,其中对于每个上下文,奖励向量的$L_1$范数至多为$s \ll |A|$。我们的主要结果是设计算法,以高概率输出一个相对于策略类$Π$的$ε$-最优策略,使用$ ilde{O} ((s/ε^2 + |A|/ε)\log |Π|/δ)$个样本。我们将此界推广到一般Natarajan类,并补充了匹配的下界(对数因子内),从而缩小了先前工作(Erez等人,2024, 2025)留下的巨大差距,后者额外增加了$Θ(|A|^9)$依赖。我们通过两种互补方法获得这些结果。首先,我们从具有结构化观测的上下文决策角度分析上下文赌博机,设计了一种探索-优化算法,其样本复杂度由\emph{决策估计系数}(DEC;Foster等人,2021, 2022)控制。我们证明,在$s$-稀疏奖励下,诱导的模型类具有随$s$缩放的尖锐DEC界,直接产生最优速率。由于这种方法主要是信息论性的,并涉及求解复杂的min-max优化问题,我们还开发了第二种更专门的算法方法,基于低方差探索技术。这种方法产生了具体、易处理的算法,并自然地扩展到上下文组合半赌博机,为赌博机多类别列表分类提供了改进的样本复杂度保证。

英文摘要

We study contextual bandits in the stochastic i.i.d.\ setting, where a learner observes contexts drawn from an unknown distribution, selects actions from a finite set $A$, and aims to identify an approximately optimal policy from a given class based on bandit feedback. Motivated by bandit multiclass classification with zero-one rewards, we focus on the \emph{$s$-sparse} setting in which, for every context, the reward vector has $L_1$-norm at most $s \ll |A|$. Our main result is the design of algorithms that, with high probability, output an $ε$-optimal policy compared to policy class $Π$ using $\tilde{O} ((s/ε^2 + |A|/ε)\log |Π|/δ)$ samples. We extend this bound to general Natarajan classes and complement it with a matching lower bound (up to logarithmic factors), thereby closing a substantial gap left by prior work (Erez et al., 2024, 2025), which incurred an additional $Θ(|A|^9)$ dependence. We obtain these results via two complementary approaches. First, we analyze contextual bandits through the lens of contextual decision making with structured observations, designing an exploration-by-optimization algorithm whose sample complexity is governed by the \emph{decision-estimation coefficient} (DEC; Foster et al., 2021, 2022). We show that, with $s$-sparse rewards, the induced model class admits a sharp DEC bound that scales with $s$ and directly yields the optimal rate. Since this approach is largely information-theoretic and involves solving complex min-max optimization problems, we also develop a second, more specialized algorithmic method based on a low-variance exploration technique. This approach leads to concrete, tractable algorithms and naturally extends to contextual combinatorial semi-bandits, leading to improved sample complexity guarantees for bandit multiclass list classification.

2605.29642 2026-05-29 stat.ML cs.IT cs.LG math.IT 版本更新

Matching Rates and Optimal Allocation for Federated Probe-Logit Distillation under Heterogeneous Bandwidth Budgets

异构带宽预算下的联邦探针-逻辑蒸馏匹配率与最优分配

Prasanjit Dubey, Xiaoming Huo

发表机构 * H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology, Atlanta, GA 30332, U.S.A(H. Milton Stewart工业与系统工程学院,佐治亚理工学院,亚特兰大,GA 30332,美国)

AI总结 针对联邦探针-逻辑蒸馏(FPLD)中带宽项速率紧性及异构节点带宽分配问题,提出匹配下界、多轮改进方案及闭合形式最优分配规则。

详情
AI中文摘要

在联邦语言建模中,$K$个节点各自持有$n$个样本,但无法合并数据或交换全精度梯度或权重。我们研究当每个节点在公共探针集上每次查询最多上传$B$比特时,对$V$个令牌上的条件分布进行估计的极小极大速率。在联邦探针-逻辑蒸馏(FPLD)中,每个节点在探针集上传输一个标量量化的逻辑向量,聚合器蒸馏出一个全局参数化学生模型。先前的工作(Dubey and Huo, 2026)建立了高概率KL速率$O(d/(Kn) + ρ\sqrt{V \log V / m} + K^{-1} \cdot 2^{-2B/V})$加上优化松弛项,其中带宽项采用迹锐化形式。该带宽项速率是否紧致,以及上界如何推广到异构每节点带宽,仍是开放问题。 我们填补了这两个空白。首先,抖动FPLD构造在非退化条件下具有匹配的单轮下界$Ω(K^{-1} \cdot 2^{-2B/V})$,将带宽轴速率确定为$Θ(K^{-1} \cdot 2^{-2B/V})$。使用嵌套/缩放残差量化器的$T$轮顺序细化达到$O(K^{-1} \cdot 2^{-2TB/V})$;对于任意$T > 1$,原始FPLD的与$T$无关的带宽项是次优的。其次,我们建立了每节点预算$B_i$的异构带宽上界,并配以闭合形式的最优分配$B_i^* = B_{\mathrm{tot}}/K + (V/2) \log_2(w_i / ar{w}_g)$,这是一种对数倾斜的注水规则,是失真率优化中反向注水的每节点类比。一种即插即用自适应变体通过短预热阶段估计权重,并达到$1 + O(\sqrt{\log(K/δ)/(m T_0)})$的相对次优性。合成n-gram模拟证实经验KL被上界和下界所界定,并且在异构裁剪下最优分配严格优于均匀和逆权重基线。

英文摘要

In federated language modeling, $K$ nodes each hold $n$ samples but cannot pool data or exchange full-precision gradients or weights. We study the minimax rate at which a conditional distribution over $V$ tokens can be estimated when each node may upload at most $B$ bits per query in a public probe set. In federated probe-logit distillation (FPLD), each node transmits a scalar-quantized logit vector on the probe set, and an aggregator distills a global parametric student. Prior work (Dubey and Huo, 2026) establishes a high-probability KL rate $O(d/(Kn) + ρ\sqrt{V \log V / m} + K^{-1} \cdot 2^{-2B/V})$ plus optimization slack, with the bandwidth term in its trace-sharpened form. Whether this bandwidth-term rate is tight, and how the upper bound generalizes to heterogeneous per-node bandwidths, are left open. We close both gaps. First, the dithered FPLD construction has a matching single-round lower bound $Ω(K^{-1} \cdot 2^{-2B/V})$ under non-degeneracy, pinning the bandwidth-axis rate at $Θ(K^{-1} \cdot 2^{-2B/V})$. $T$-round sequential refinement with nested/scaled residual quantizers achieves $O(K^{-1} \cdot 2^{-2TB/V})$; vanilla FPLD's $T$-independent bandwidth term is suboptimal for every $T > 1$. Second, we establish a heterogeneous-bandwidth upper bound for per-node budgets $B_i$, paired with a closed-form optimal allocation $B_i^* = B_{\mathrm{tot}}/K + (V/2) \log_2(w_i / \bar{w}_g)$, a log-tilted water-filling rule that is the per-node analogue of reverse water-filling for distortion-rate optimization. A plug-in adaptive variant estimates the weights from a short warm-up phase and attains $1 + O(\sqrt{\log(K/δ)/(m T_0)})$ relative suboptimality. Synthetic n-gram simulations confirm that empirical KL is bracketed by the upper and lower bounds and that the optimal allocation strictly dominates uniform and inverse-weighted baselines under heterogeneous clipping.

2605.29635 2026-05-29 math.OC cs.LG 版本更新

MoSSP: A Momentum-Based Single-Loop Stochastic Penalty Method for Nonconvex Constrained DC-Regularized Optimization

MoSSP: 基于动量的单环随机惩罚方法用于非凸约束DC正则化优化

Luxuan Li, Chunfeng Cui, Xiao Wang

发表机构 * School of Mathematical Sciences, Beihang University, Beijing 100191, China(北京航空航天大学数学学院) School of Computer Science and Engineering, Sun Yat-sen University, Guangzhou 510006, China(中山大学计算机科学与工程学院)

AI总结 提出MoSSP算法,一种基于动量的单环随机惩罚方法,用于解决具有非凸约束和DC正则化的随机优化问题,实现了O(ε^{-4})和O(ε^{-3})的oracle复杂度。

Comments 35 pages, 3 figures

详情
AI中文摘要

本文研究了一类具有差凸(DC)正则化的非凸约束随机问题,其中可行集可能是非凸的,且DC正则化子的凹部分允许非光滑。基本挑战在于在保持非凸约束可行性的同时实现良好的oracle复杂度。尽管单环算法能有效解决无约束DC优化问题,但它们在具有DC结构的约束优化中的潜力尚未被充分探索。为填补这一空白,我们开发了MoSSP,一种基于动量的单环随机惩罚方法,用于此类问题,并具有可证明的复杂度保证。关键思想是将单个随机近端梯度步骤应用于惩罚的Moreau包络加上凸DC部分,同时并行计算凹部分的近端映射。我们推导了两种算法变体:一种具有O(ε^{-4}) oracle复杂度的Polyak动量版本,用于寻找随机ε-KKT点,以及一种改进的O(ε^{-3})版本,结合了递归动量。实验结果证明了所提算法的有效性。

英文摘要

In this paper, we study a structured class of nonconvex constrained stochastic problems with difference-of-convex (DC) regularization, where the feasible set is possibly nonconvex and the concave part of the DC regularizer is allowed to be nonsmooth. The fundamental challenge lies in maintaining feasibility for nonconvex constraints while achieving favorable oracle complexity. Although single-loop algorithms efficiently solve unconstrained DC optimization problems, their potential for constrained optimization with DC structure remains largely unexplored. To address this gap, we develop MoSSP, a Momentum-based Single-loop Stochastic Penalty method for such problems with provable complexity guarantees. The key idea is to apply a single stochastic proximal-gradient step to the Moreau envelope of the penalty plus the convex DC part, with the concave part's proximal mapping computed in parallel. We derive two algorithm variants: a Polyak-momentum version with $O(\varepsilon^{-4})$ oracle complexity for finding stochastic $\varepsilon$-KKT points, and an improved $O(\varepsilon^{-3})$ version incorporating recursive momentum. Experimental results demonstrate the effectiveness of the proposed algorithms.

2605.29634 2026-05-29 cs.LG 版本更新

Relational Rank Geometry in Transformers: Detecting and Steering Hidden-State Relation Frames

Transformer中的关系秩几何:检测与引导隐藏状态关系框架

Mazen Kobrosly

发表机构 * Independent Researcher(独立研究者)

AI总结 本文通过Plücker符号熵检测Transformer隐藏状态中元组关系的秩索引几何,并在Llama系列模型上验证了关系探测的可控干预,实现了从关系探测到关系框架干预的受控桥梁。

Comments 32 pages, 9 figures

详情
AI中文摘要

Transformer隐藏状态通常通过局部或低阶对象解释:神经元、稀疏特征、注意力头、残差流方向或激活补丁。本文研究一个互补对象:元组间关系的秩索引几何。我使用Plücker符号熵来测试r元关系是否在隐藏状态空间中留下arity匹配的方向签名。在Llama系列8B、70B和405B检查点上,真实关系元组在预期秩k=r(r=3,...,6)处显示出比随机控制审计中打乱元组更强的方向签名一致性。多模板审计表明,这些效应在表面变化下仍然存在,所有测试的405B行保持正预期秩边际,8B/70B保持正行,但带有构造器特定的混合单元。然后我问相同的关系几何是否可以被引导。在一个边缘网格干净/损坏干预实验中,使用32个提示,行/列框架和答案格式保持不变,而YES/NO关系图发生变化,损坏的隐藏状态关系框架被修补为干净或安慰剂目标。在70B和405B中,干净目标的关系框架路径恢复了干净答案行为和残差关系几何,而仅质心和等范数控制显示出可忽略的恢复。位置/顺序控制进一步将标记点重要性从有序干净框架几何中分离:目标干净形状和跨提示干净形状在标记接口处恢复行为和残差几何,而损坏供体转移、同位置置换/反射、错误位置干净增量、仅质心运动和等范数噪声失败或远低于干净框架路径。结果是从关系探测到关系框架干预的受控桥梁:关系秩几何可以在Transformer隐藏状态中被检测、定位和行为验证。

英文摘要

Transformer hidden states are often interpreted through local or low-order objects: neurons, sparse features, attention heads, residual-stream directions, or activation patches. This paper studies a complementary object: the rank-indexed geometry of relations among token tuples. I use Plucker sign entropy to test whether r-argument relations leave arity-matched orientation signatures in hidden-state space. Across Llama-family 8B, 70B, and 405B checkpoints, true relation tuples show stronger orientation-sign consistency at the expected rank k=r for r=3,...,6 than scrambled tuples under matched random-control audits. Multi-template audits show that the effects survive surface variation, with all tested 405B rows retaining positive expected-rank margins and 8B/70B retaining positive rows with constructor-specific mixed cells. I then ask whether the same relation geometry can be steered. In an edge-grid clean/corrupt intervention assay over 32 prompts, the row/column scaffold and answer format stay fixed while the YES/NO relation map changes, and the corrupt hidden-state relation frame is patched toward clean or placebo targets. In 70B and 405B, clean-targeted relation-frame paths recover clean-answer behavior and residual relation geometry, while centroid-only and equal-norm controls show negligible recovery. Site/order controls further separate marker-site importance from ordered clean-frame geometry: target clean shape and cross-prompt clean shape recover behavior and residual geometry at the marker interface, whereas corrupt-donor transfer, same-site permutation/reflection, wrong-site clean deltas, centroid-only motion, and equal-norm noise fail or remain far below clean-frame paths. The result is a controlled bridge from relation probing to relation-frame intervention: relation rank geometry can be detected, targeted, and behaviorally validated in transformer hidden states.

2605.29628 2026-05-29 cs.SD cs.AI cs.CL cs.LG eess.AS 版本更新

COMET: Concept Space Dissection of the Modality Gap in Audio-Text Multimodal Contrastive Embeddings

COMET:音频-文本多模态对比嵌入中模态间隙的概念空间剖析

Yonggang Zhu, Liting Gao, Aidong Men, Wenwu Wang

发表机构 * School of Artificial Intelligence, Beijing University of Posts and Telecommunications(北京邮电大学人工智能学院) Centre for Vision, Speech, and Signal Processing (CVSSP), University of Surrey(Surrey 大学视觉、语音和信号处理中心)

AI总结 提出COMET框架,通过PLS-SVD分解揭示CLAP模型中模态间隙主要由少数共享概念轴贡献,并基于谱截断方法无训练地缓解间隙,实现零样本音频字幕接近全监督性能。

详情
AI中文摘要

对比语言-音频预训练(CLAP)模型广泛用于音频理解,并在许多零样本应用中支持模态无关的条件交换。然而,其性能受到音频和文本嵌入之间模态间隙的严重影响。现有解释主要将此间隙归因于锥体效应,将其视为均值嵌入之间的偏移,但仅纠正均值只能带来有限的改进。其他假设,如信息不平衡和维度坍缩,也被提出,但仍未得到充分验证,并且在音频领域尚未被深入研究。同时,一些工作尝试将多模态对比嵌入分解为可解释的概念,但没有任何工作从概念分解的角度显式分析模态间隙。在这项工作中,我们引入了COMET(基于PLS-SVD变换的概念空间组织与模态间隙解释),这是一个新颖的用于CLAP的偏最小二乘奇异值分解(PLS-SVD)框架,揭示了模态间隙的更广泛视角。我们的框架揭示,只有一小部分可解释的轴(捕捉共享概念)对相似度计算有显著贡献,并且均值分量仅部分代表模态间隙。基于这一见解,我们提出了一种简单的谱截断方法,以无训练的方式缓解模态间隙。该方法使得零样本音频字幕通过条件交换接近全监督性能,无需大型辅助记忆库或昂贵计算。同时,它在保持检索和音频字幕任务强性能的同时,实现了显著的嵌入维度缩减。

英文摘要

Contrastive Language-Audio Pretraining (CLAP) models are widely used for audio understanding and support modality-agnostic condition swapping in many zero-shot applications. However, their performance is heavily affected by the modality gap between audio and text embeddings. Existing explanations mainly attribute this gap to the cone effect, treating it as a shift between mean embeddings, yet correcting the mean alone yields only limited improvements. Alternative hypotheses, such as information imbalance and dimensionality collapse, have also been proposed, but they remain insufficiently verified and have not been thoroughly studied in the audio domain. Meanwhile, several works attempt to decompose multimodal contrastive embeddings into interpretable concepts, but none explicitly analyze the modality gap from the perspective of concept decomposition. In this work, we introduce COMET (Concept space Organization and Modality gap Explanation with PLS-SVD Transformation), a novel partial least squares singular value decomposition (PLS-SVD) framework for CLAP that unveils a broader perspective of the modality gap. Our framework reveals that only a small, interpretable subset of axes, which captures shared concepts, contributes substantially to similarity computation, and that the mean component represents only partially the modality gap. Building on this insight, we propose a simple spectral truncation method that mitigates the modality gap in a training-free manner. The method enables zero-shot audio captioning with condition swapping to approach fully supervised performance, without requiring large auxiliary memory banks or expensive computation. At the same time, it achieves substantial embedding dimensionality reduction while preserving strong performance on retrieval and audio captioning tasks.

2605.29622 2026-05-29 cs.LG physics.chem-ph 版本更新

MōLe-Λ: Learning the Coupled-Cluster Response State for Energies, Gradients, and Properties

MōLe-Λ: 学习耦合簇响应态以获取能量、梯度和性质

Andreas Burger, Luca Thiede, Abdulrahman Aldossary, Jorge A. Campos-Gonzalez-Angulo, Alex Zook, Jérôme Florian Gonthier, Alán Aspuru-Guzik

发表机构 * University of Toronto(多伦多大学) Vector Institute for Artificial Intelligence(人工智能向量研究所) NVIDIA(英伟达) Canadian Institute for Advanced Research (CIFAR)(加拿大高级研究研究院)

AI总结 提出MōLe-Λ模型,通过联合学习左右手振幅预测耦合簇响应态,高效计算能量、梯度及多类分子性质。

Comments ICML 2026 AI4Physics

详情
AI中文摘要

耦合簇理论常被视为量子化学的金标准,但其高计算成本限制了准确能量、力和响应性质的常规获取。虽然右手$T$-振幅决定了相关波函数,但许多实际重要的可观测量还需要左手$Λ$-振幅。我们引入MōLe-$Λ$,它是分子轨道学习(MōLe)的扩展,通过从局域化的Hartree-Fock分子轨道联合学习右手振幅$(T_1,T_2)$和左手振幅$(Λ_1,Λ_2)$,预测完整的基态耦合簇单双激发(CCSD)响应态。在架构上,MōLe-$Λ$扩展了MōLe,增加了$Λ_1$和$Λ_2$读出模块,这些模块镜像了$T_1$和$T_2$头的对称性约束,同时保留了原始的等变轨道编码器、奇符号等变解码、局域性和大小广延性。所得模型能够提供准确的CC级能量和力,同时恢复偶极矩、四极矩、极化率、电子密度以及双电子可观测量如对密度。我们表明,MōLe-$Λ$进一步扩展了MōLe相对于完整CCSD的速度优势,同时大幅扩展了可访问的性质,为相关量子化学的波函数级替代模型提供了途径。

英文摘要

Coupled-cluster (CC) theory is often considered the gold standard of quantum chemistry, but its high computational cost limits routine access to accurate energies, forces and response properties. While the right-hand $T$-amplitudes determine the correlated wavefunction, many practically important observables additionally require the left-hand $Λ$-amplitudes. We introduce MōLe-$Λ$, an extension of Molecular Orbital Learning (MōLe) that predicts the full ground-state coupled-cluster singles and doubles (CCSD) response state by jointly learning right-hand amplitudes $(T_1,T_2)$ and left-hand amplitudes $(Λ_1,Λ_2)$ from localized Hartree--Fock molecular orbitals. Architecturally, MōLe-$Λ$ extends MōLe with $Λ_1$ and $Λ_2$ readouts that mirror the symmetry constraints of the $T_1$ and $T_2$ heads, while preserving the original equivariant orbital encoder, odd sign-equivariant decoding, locality and size-extensivity. The resulting model yields accurate CC-quality energies and forces, while simultaneously recovering dipoles, quadrupoles, polarizabilities, the electron density, and 2-electron observables such as the pair density. We show that MōLe-$Λ$ further extends the speed advantage of MōLe over full CCSD while substantially expanding the accessible properties, providing a route to wavefunction-level surrogate models for correlated quantum chemistry.

2605.29610 2026-05-29 cs.CV cs.AI cs.LG 版本更新

Learning Context-Conditioned Predicate Semantics via Prototype Feedback

通过原型反馈学习上下文条件谓词语义

NamGyu Jung, Chang Choi

发表机构 * Department of Computer Engineering, Gachon University, Seongnam, Republic of Korea(韩国成仁市加德满都大学计算机工程系)

AI总结 提出AlignG方法,利用原型反馈从图像关系候选中推断上下文条件谓词语义并调整关系表示,在VG-150和GQA-200上分别提升SGDet的F@100指标1.4和2.7。

Comments Accepted at ICML 2026. Code: https://github.com/Namgyu97/AlignG-SGG.pytorch

详情
AI中文摘要

在场景图生成中,一个核心挑战是建模多义谓词,其含义随上下文变化。先前的方法通过将谓词分解为多个静态原型或检索语义相似的示例来解决此问题。然而,这些策略保持谓词表示静态,无法重新组织语义以反映图像特定的证据,导致在模糊上下文中出现系统性混淆。我们提出AlignG,通过原型反馈学习上下文条件谓词语义。AlignG从每幅图像中的关系候选中推断上下文条件谓词语义,并将调整后的语义反馈回来以重新校准关系表示。学习目标将此适应锚定到全局语义中心,防止语义漂移,同时当场景提供一致的关系线索时仍允许选择性重组。在VG-150和GQA-200上的实验表明,在SGDet下,F@100指标分别提升了+1.4和+2.7,优于最先进的基线。我们进一步可视化每幅图像的原型相似性变化,并观察到一致的上下文相关重组,其中原型根据场景证据选择性地合并或分离谓词。代码可在https://github.com/Namgyu97/AlignG-SGG.pytorch获取。

英文摘要

In scene graph generation, a central challenge is modeling polysemous predicates whose meanings shift across contexts. Prior approaches address this issue by decomposing predicates into multiple static prototypes or retrieving semantically similar exemplars. However, these strategies keep predicate representations static and cannot reorganize semantics to reflect image-specific evidence, leading to systematic confusions in ambiguous contexts. We propose AlignG, which learns context-conditioned predicate semantics via prototype feedback. AlignG infers context-conditioned predicate semantics from the relation candidates within each image and feeds the adapted semantics back to recalibrate relation representations. The learning objective anchors this adaptation to global semantic centers, preventing semantic drift while still allowing selective reorganization when the scene provides consistent relational cues. Experiments on VG-150 and GQA-200 show consistent improvements over state-of-the-art baselines, with F@100 improvements of +1.4 on VG-150 and +2.7 on GQA-200 under SGDet. We further visualize per-image prototype similarity shifts and observe coherent context-dependent reorganization where prototypes selectively merge or separate predicates according to scene evidence. The code is available at https://github.com/Namgyu97/AlignG-SGG.pytorch.

2605.29607 2026-05-29 cs.LG 版本更新

Cluster-Level Attention-Guided Parallel Decoding for Masked Diffusion Language Models

掩码扩散语言模型的簇级注意力引导并行解码

Heqiang Qi, Wei Huang, Mingyuan Bai, Xiangming Meng

发表机构 * Zhejiang University(浙江大学) RIKEN Center for Advanced Intelligence Project(日本理化学研究院先进智能项目中心) The Institute of Statistical Mathematics(统计数学研究所) Agency for Science, Technology and Research (A ⋆ \star STAR)(科技研究局(A ⋆ STAR))

AI总结 提出CLAD方法,通过将相邻高置信度token聚合成簇,并利用自注意力图估计簇间依赖,实现掩码扩散语言模型的训练无关簇级并行解码,在保持任务精度的同时获得1.77-8.47倍加速。

详情
AI中文摘要

掩码扩散语言模型(MDLMs)通过在每个去噪步骤预测所有掩码位置来实现并行解码,然而现有的无训练采样器通常以token级粒度决定哪些位置被提交。我们重新审视这一粒度,并观察到可靠预测通常表现为连续的置信度跨度,这表明并行提交的单位可以大于单个token。我们首先将相邻的高置信度候选分组为置信度诱导簇(CICs),作为跨度级更新单元。然后,我们利用同一前向传递的自注意力图来估计簇间依赖关系,从而实现对相互兼容的CICs进行冲突感知选择以进行并行提交。这产生了CLAD(簇级注意力引导解码),一种用于MDLMs的无训练簇级解码器。在LLaDA和Dream模型系列上的四个推理和代码生成基准测试中,CLAD在大多数设置下实现了1.77倍至8.47倍的速度提升,同时保持广泛可比的任务精度。

英文摘要

Masked diffusion language models (MDLMs) enable parallel decoding by predicting all masked positions at each denoising step, yet existing training-free samplers usually decide which positions to commit at token-level granularity. We revisit this granularity and observe that reliable predictions often emerge as contiguous high-confidence spans, suggesting that the unit of parallel commitment can be larger than a single token. We first group adjacent high-confidence candidates into confidence-induced clusters (CICs) as span-level update units. We then use self-attention maps from the same forward pass to estimate inter-cluster dependencies, enabling conflict-aware selection of mutually compatible CICs for parallel commitment. This yields CLAD (Cluster-Level Attention-Guided Decoding), a training-free cluster-level decoder for MDLMs. Experiments on LLaDA and Dream model families across four reasoning and code-generation benchmarks show that CLAD achieves 1.77x--8.47x speedups over Vanilla decoding while maintaining broadly comparable task accuracy in most settings.

2605.29601 2026-05-29 cs.CL cs.AI cs.LG 版本更新

Training Deliberative Monitors for Black-Box Scheming Detection

训练审慎监控器用于黑箱策划检测

Aditya Sinha, Akshat Naik, Victor Gillioz, Simon Storf, Kilian Merkelbach, Rich Barton-Cooper, Axel Højmark, Marius Hobbhahn

发表机构 * Independent(独立) MATS Research(MATS研究) Astra Fellowship Apollo Research(Apollo研究)

AI总结 提出一种基于行动轨迹的审慎监控方法,通过蒸馏前沿模型的推理过程训练开源模型,以低成本高精度检测智能体的策划与破坏行为。

详情
AI中文摘要

随着自主智能体在执行现实任务方面变得愈发强大,区分策划行为与良性任务追求可能成为AI控制的核心问题。现有监控器通常依赖思维链访问或内部激活,或使用提示的前沿模型,这些在部署中可能不可用、不可靠或成本高昂。在本工作中,我们研究仅基于行动的审慎监控器:较小的开源模型,经过训练可从智能体轨迹中检测策划与破坏行为,而无需访问被监控智能体的推理或模型内部。我们的方法受审慎对齐启发,使用策划规范从前沿教师模型中引出结构化推理,通过独立的评判器进行过滤,并通过监督微调和强化学习将最高质量的推理蒸馏到开源监控器中。我们在五个数据集上训练,并在六个分布外智能体失调基准上评估。我们表明,将我们的方法应用于Qwen3.5-27B,其性能优于所有低成本前沿模型作为提示监控器(Gemini 3.1 Flash-Lite、GPT-5.4 Nano和Claude Haiku 4.5)以及Gemini 2.5 Pro,同时实现了更低的边际推理成本(每1000次评估的token计费美元)。更强的提示前沿监控器(Gemini 3.1 Pro、GPT-5.4、Claude Sonnet 4.6和Claude Opus 4.6)实现了更高的性能,但边际推理成本大约高出16-34倍。我们训练的多个监控器在我们评估的监控器中位于经验成本-性能帕累托前沿,为提示前沿模型提供了实用的低成本、低误报率替代方案。

英文摘要

As autonomous agents become more capable of performing real-world tasks, distinguishing scheming behavior from benign task pursuit may become a central AI control problem. Existing monitors often rely on chain-of-thought access or internal activations, or use prompted frontier models, all of which can be unavailable, unreliable or expensive in deployment. In this work, we study action-only deliberative monitors: smaller open-weight models trained to detect scheming and sabotage from agentic trajectories without accessing the monitored agent's reasoning or model internals. Our method, inspired by deliberative alignment, uses a scheming specification to elicit structured rationales from a frontier teacher, filters them with a separate judge, and distills the highest-quality rationales into open-weight monitors with supervised fine-tuning and reinforcement learning. We train on five datasets, and evaluate across six out-of-distribution agentic misalignment benchmarks. We show that applying our method to Qwen3.5-27B yields higher performance than all low-cost frontier models as prompted monitors (Gemini 3.1 Flash-Lite, GPT-5.4 Nano, and Claude Haiku 4.5) and than Gemini 2.5 Pro, while also achieving lower marginal inference cost (token-metered USD per 1,000 evaluations). Stronger prompted frontier monitors (Gemini 3.1 Pro, GPT-5.4, Claude Sonnet 4.6, and Claude Opus 4.6) achieve higher performance but at roughly $16$--$34\times$ higher marginal inference cost. Several of our trained monitors are positioned on the empirical cost--performance Pareto frontier among the monitors we evaluate, providing practical low-cost, low-FPR alternatives to prompted frontier models.

2605.29587 2026-05-29 q-bio.QM cs.LG 版本更新

FPLIER: Federated Pathway-Level Information Extractor

FPLIER:联邦通路级信息提取器

Daniele Malpetti, Christian Berchtold, Francesco Gualdi, Marco Scutari, Laura Azzimonti, Francesca Mangili

发表机构 * Dalle Molle Institute for Artificial Intelligence (IDSIA)(达勒莫尔人工智能研究所(IDSIA)) USI-SUPSI Swiss Institute of Bioinformatics (SIB)(瑞士生物信息学研究所(SIB))

AI总结 提出联邦学习框架FPLIER,通过安全聚合实现分布式基因表达数据上的通路级因子分解,并证明隐私风险由训练表达矩阵的秩决定。

Comments Accepted for publication at the ACM BCB '26 conference

详情
AI中文摘要

在转录组学中,通路级信息提取器(PLIER)等基因集感知因子分解方法在大型异质性表达数据集上训练时效果最佳。然而,由于隐私和治理限制,许多临床相关队列无法合并为单个数据集。我们提出FPLIER,这是PLIER的联邦扩展,能够在多个数据持有者之间进行分布式训练,同时整合公开可用数据集。通过安全聚合,FPLIER产生的训练更新在代数上等价于集中式池化数据方法,同时保持表达数据的本地性。我们在两个模拟联盟(来自K-CLIER和MultiPLIER研究)的多个场景中评估FPLIER,并展示其稳定收敛。我们进一步对针对中间训练统计量和发布模型的成员推断攻击进行了系统分析。结果表明,隐私风险由训练表达矩阵的秩决定。整合公开数据或降低数据维度会增加该秩,使系统趋向满秩状态,在此状态下训练样本与非训练样本对攻击者而言难以区分,成员推断性能接近随机猜测。

英文摘要

In transcriptomics, gene-set-aware factorization methods such as the Pathway Level Information Extractor (PLIER) are most effective when trained on large, heterogeneous expression compendia. Yet, many clinically relevant cohorts cannot be pooled into a single dataset due to privacy and governance constraints. We present FPLIER, a federated extension of PLIER that enables distributed training across multiple data holders while incorporating publicly available datasets. Through secure aggregation, FPLIER produces training updates algebraically equivalent to those of a centralized pooled-data approach while keeping expression data local. We evaluate FPLIER across multiple scenarios in two simulated consortia (from the K-CLIER and MultiPLIER studies) and demonstrate stable convergence. We further conduct a systematic analysis of membership inference attacks targeting both intermediate training statistics and the released model. Our results show that privacy risk is governed by the rank of the training expression matrix. Incorporating public data or reducing data dimensionality increases this rank, moving the system toward a full-rank regime in which training and non-training samples become indistinguishable to the attacker, and membership-inference performance approaches random guessing.

2605.28711 2026-05-29 cs.LG 版本更新

Stage-wise Distortion-Perception Traversal in Zero-shot Inverse Problems with Diffusion Models

基于扩散模型的零样本逆问题中的逐阶段失真-感知遍历

Jiawei Zhang, Ziyuan Liu, Leon Yan, Zhenyu Xiao, Yuantao Gu

发表机构 * Shenzhen International Graduate School, Tsinghua University, Shenzhen, China(清华大学深圳国际研究生院) Department of Electronic Engineering, Tsinghua University , Beijing, China(清华大学电子工程系)

AI总结 提出一种逐阶段框架MAP-RPS,通过MAP估计和重噪声后验采样实现单扩散模型下的失真-感知权衡遍历,并扩展至潜空间LMAP-RPS以提升适用性。

Comments Accepted by ICML 2026

详情
AI中文摘要

失真-感知(D-P)权衡是贝叶斯逆问题的一个基本现象,它刻画了失真性能与感知质量之间的内在矛盾。在推理时实现D-P权衡的灵活遍历对实际应用至关重要。尽管扩散模型在零样本逆问题求解中取得了近期成功,但在基于扩散的逆算法中实现D-P遍历的高效且原则性策略仍缺乏充分刻画。本文提出一种逐阶段框架,利用单个扩散模型在零样本逆问题中实现D-P遍历。我们提出的方法称为MAP-RPS,首先进行MAP估计阶段,近似MMSE解并提供低失真初始化,随后进行重噪声后验采样阶段,逐步提升感知质量。我们对两个阶段进行了理论分析,验证了所提设计的有效性和正确性。此外,我们将MAP-RPS扩展到潜空间,得到LMAP-RPS,通过利用大规模预训练潜扩散骨干网络,具有更广泛的适用性。大量实验表明,MAP-RPS和LMAP-RPS在各种任务上实现了更有效的D-P遍历,同时作为实际逆问题的高效求解器也表现出强劲性能。

英文摘要

The distortion-perception (D-P) tradeoff is a fundamental phenomenon of Bayesian inverse problems, which characterizes the inherent tension between distortion performance and perceptual quality. Enabling flexible traversal of the D-P tradeoff at inference time is crucial for practical applications. Despite the recent success of diffusion models in zero-shot inverse problem solving, efficient and principled strategies for D-P traversal in diffusion-based inverse algorithms remain inadequately characterized. In this paper, we propose a stage-wise framework for realizing D-P traversal using a single diffusion model in zero-shot inverse problems. Our proposed method, termed MAP-RPS, starts with an MAP estimation stage that approximates the MMSE solution and provides a low-distortion initialization, followed by a re-noised posterior sampling stage that progressively improves perceptual quality. We provide theoretical analyses for both stages, establishing the validity and effectiveness of the proposed design. Furthermore, we extend MAP-RPS to the latent space, yielding LMAP-RPS, which enjoys broader applicability by leveraging large-scale pre-trained latent diffusion backbones. Extensive experiments demonstrate that MAP-RPS and LMAP-RPS enable more effective D-P traversal on various tasks, while also exhibiting strong performance as efficient solvers for real-world inverse problems.

2605.28418 2026-05-29 cs.LG 版本更新

Revisiting Metafeatures to Explain Model Differences on Tabular Data

重新审视元特征以解释表格数据上的模型差异

Markus Herre, Andrej Tschalzev, Sascha Marton, Christian Bartelt

发表机构 * Clausthal University of Technology, Clausthal-Zellerfeld, Germany(Clausthal技术大学,Clausthal-Zellerfeld,德国) University of Mannheim, Mannheim, Germany(曼海姆大学,曼海姆,德国)

AI总结 研究通过严格统计检验和留一法分析,发现数据集元特征无法稳健解释表格数据上不同模型族(如神经网络与树模型、非基础模型与基础模型)之间的性能差异。

详情
AI中文摘要

随着表格基础模型的兴起以及传统模型在许多任务上仍表现良好,为表格数据集选择合适模型仍然困难。我们研究数据集元特征是否能解释表格预测任务中模型族之间的性能差距。利用TabArena基准结果,我们分析数据集级别的性能差距,并将其与模型无关的数据集描述符相关联。经过严格统计检验并控制错误发现率后,我们发现:(1) 对于神经网络与树模型的差距,没有元特征能通过错误发现率控制;(2) 对于非基础模型与基础模型的差距,一个关联是稳健的,但在留一数据集预测测试中不能泛化;(3) 对于TabICLv2与TabPFN-2.6,一个稳健关联也改善了留出预测。此外,我们进行了留一数据集分析,发现元特征预测器未能比简单基线有实质性改进。总体而言,我们的结果显示了表格数据集的异质性,并且全局元特征方法不够稳健,无法对51个TabArena数据集提供解释。

英文摘要

With the rise of tabular foundation models alongside traditional models still performing well on many tasks, choosing the right model for a tabular dataset remains difficult. We investigate whether dataset meta-features can explain performance gaps between model families on tabular prediction tasks. Using the TabArena benchmark results, we analyze dataset-level performance gaps and relate them to model-agnostic dataset descriptors. After strict statistical tests with false discovery control, we find that (1) for neural network vs. tree gaps, no meta-feature survives false discovery control, (2) for non-foundation vs. foundation model gaps, one association is robust but does not generalize when tested in leave-one-dataset-out prediction, and (3) for TabICLv2 vs. TabPFN-2.6, one robust association also improves held-out prediction. Furthermore, we conduct a leave-one-dataset-out analysis and find that meta-feature predictors fail to improve meaningfully over a simple baseline. Overall, our results show the heterogeneity of tabular datasets and that global meta-feature approaches are not robust enough to offer explanations on the 51 TabArena datasets.

2605.28368 2026-05-29 cs.LG cond-mat.mtrl-sci physics.app-ph 版本更新

LEIA: Learned Environment for Interactive Architected Materials

LEIA: 用于交互式架构材料的学习环境

Haiqian Yang, Yuan Cao, Markus J. Buehler

发表机构 * Unreasonable Labs

AI总结 提出LEIA世界模型,通过逐步施加边界条件并实时观察变形和应力场,支持工程师交互式探索架构材料,并实现快速代理引导的候选生成与排序。

Comments 22 pages, 10 figures

详情
AI中文摘要

世界模型已经实现了游戏环境和机器人操作的交互式探索,但物理工程仍然超出其能力范围:真实材料表现出非线性本构定律、携带历史依赖的内部状态、经历惯性动力学,并且可能具有跨越多个长度尺度的层次结构。我们提出了LEIA(用于交互式架构材料的学习环境),这是一个世界模型,允许工程师逐步施加边界条件并实时观察由此产生的变形和应力场。LEIA处理大型三维非结构化网格,并对用户指定的加载生成自回归响应。我们引入了MicroPlate,这是一个架构板的基准测试,涵盖微观结构建模的两种模式:通过三维几何显式解析微观结构的架构晶格,以及通过内部自由度隐式建模微观结构变化的均质板。MicroPlate用于评估LEIA以及两种模式下的四种基线方法。最后,我们证明LEIA能够实现高效的候选生成和排序,用于快速代理引导的架构材料新设计搜索,并通过有限元地面实况验证了应力准确的候选排序。

英文摘要

World models have enabled interactive exploration of game environments and robotic manipulation, but physical engineering remains beyond their reach: real materials exhibit nonlinear constitutive laws, carry history-dependent internal state, undergo inertial dynamics, and may possess hierarchical structures spanning multiple length scales. We present LEIA (Learned Environment for Interactive Architected materials), a world model that lets engineers apply boundary conditions step by step and observe the resulting deformation and stress fields in real time. LEIA handles large three-dimensional unstructured meshes and generates autoregressive responses to user-specified loading. We introduce MicroPlate, a benchmark of architected plates spanning two regimes of microstructure modeling: architected lattices that resolve microstructure explicitly through three-dimensional geometry, and a homogeneous plate where microstructural change is modeled implicitly through internal degrees of freedom. MicroPlate is used to assess LEIA alongside four baseline methods across both regimes. Finally, we demonstrate that LEIA enables efficient candidate generation and ranking for fast surrogate-guided search for de novo designs of architected materials, with stress-accurate candidate ranking validated by finite element ground truth.

2605.28327 2026-05-29 stat.ML cs.LG q-fin.RM stat.AP 版本更新

Insurance Pricing Optimization via Off-Policy Evaluation

通过离线策略评估进行保险定价优化

Sascha Günther, Dimitri Semenovich, Mario V. Wüthrich

发表机构 * Department of Mathematics, ETH Zurich(苏黎世联邦理工学院数学系)

AI总结 本文提出基于离线策略评估和随机控制的保险定价方法,利用核化逆倾向得分估计器降低方差,并通过数据共享Lasso和神经网络两种策略优化方法实现最优定价。

详情
AI中文摘要

传统保险定价依赖于基于风险的原则,确保精算公平和偿付能力,但未明确考虑投保人的价格敏感性。我们将保险定价表述为一个决策问题,并使用离线策略评估和随机控制的工具进行研究。我们提出了一种核化逆倾向得分估计器,该估计器利用动作空间中的局部结构,与经典逆倾向得分估计器相比实现了方差减少。基于这些价值估计,我们研究了策略优化,并提出了两种计算最优定价规则的实用方法:一种可解释的数据共享Lasso公式和一种基于神经网络的灵活策略参数化。通过使用受控的合成旅行保险环境,我们实证验证了理论结果,并表明神经网络在策略优化方面优于现有技术。

英文摘要

Traditional insurance pricing relies on risk-based principles that ensure actuarial fairness and solvency but do not explicitly account for policyholders' price sensitivity. We formulate insurance pricing as a decision-making problem and study it using tools from off-policy evaluation and stochastic control. We propose a kernelized inverse propensity score estimator that exploits local structure in the action space and yields variance reduction compared to the classical inverse propensity score estimator. Building on these value estimates, we investigate policy optimization and present two practical approaches for computing optimal pricing rules: an interpretable data-shared Lasso formulation and a flexible policy parameterization based on neural networks. Using a controlled synthetic travel insurance environment, we empirically confirm the theoretical results and show that neural networks outperform existing techniques for policy optimization.

2605.27809 2026-05-29 cs.LG cs.CR 版本更新

Density-aware Sample-specific Attack

密度感知的样本特定攻击

Qiyuan Wang, Yao Li, Raymond K. W. Wong

发表机构 * Texas A&M University(德克萨斯A&M大学) University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 提出一种通过将触发样本引导至干净数据分布的低密度区域来优化后门攻击的双层优化方法,在微调和剪枝防御下均保持高攻击成功率。

详情
AI中文摘要

尽管后门攻击近期取得进展,现有方法仍易受到训练后防御(如微调或剪枝)的影响,这些防御会擦除后门。我们重新审视后门攻击的核心目标,并在受害者训练的贝叶斯最优模型下推导出刻画最优样本特定触发器构建的原则性准则。我们的分析表明,当触发样本被引导至干净数据分布的低密度区域时,攻击成功率和干净准确率保持同时达到最优,这种分布条件一次性控制中毒分布的所有矩,而非少量输入空间汇总统计量。我们引入一个双层优化框架,通过条件时间分数匹配估计密度比,并优化混合模型目标以将触发样本放置在这些稀疏区域。在MNIST、CIFAR-10、GTSRB和TinyImageNet上的广泛评估表明,我们的方法在防御前达到99%以上的攻击成功率,并且在微调防御下,防御后的ASR比最强基线高出50-85个百分点。针对神经元剪枝防御,该方法表现出完全免疫性,在所有剪枝阈值下均未识别出任何需要移除的神经元。这些结果暴露了当前防御范式的根本缺陷,并强调了需要超越干净分布支持域进行防御的必要性。

英文摘要

Despite recent progress in backdoor attacks, existing methods remain susceptible to post-training defenses that erase the backdoor through fine-tuning or pruning. We revisit the core objectives of backdoor attacks and derive principled criteria characterizing optimal sample-specific trigger construction under a Bayes-optimal model of the victim's training. Our analysis reveals that both attack success and clean-accuracy preservation are simultaneously optimized when triggered samples are steered into low-density regions of the clean data distribution, a distributional condition that controls all moments of the poisoned distribution at once rather than a handful of input-space summary statistics. We introduce a bilevel optimization framework that estimates density ratios via conditional time-score matching and optimizes a mixture-model objective to place triggered samples in these sparse regions. Extensive evaluations on MNIST, CIFAR-10, GTSRB, and TinyImageNet demonstrate that our method achieves above 99\% attack success rate before defense and retains 50--85 percentage points higher post-defense ASR than the strongest baselines under fine-tuning defenses. Against neuron-pruning defenses, the method exhibits complete immunity, with zero neurons identified for removal across all pruning thresholds. These results expose a fundamental gap in current defense paradigms and underscore the need for defenses that operate beyond the support of the clean distribution.

2605.27696 2026-05-29 cs.CV cs.LG 版本更新

Structure over Pixels: Learning Variable-Length Visual Programs

结构优于像素:学习可变长度视觉程序

Piotr Wyrwiński, Kacper Dobek, Krzysztof Krawiec

发表机构 * Institute of Computing Science(计算科学研究所) Poznan University of Technology(波兹南技术大学)

AI总结 提出STROP离散视觉分词器架构,通过基于DINOv3特征的局部率失真监督学习可变长度视觉程序,以结构表示替代像素重建。

详情
AI中文摘要

离散视觉分词器将图像转换为有序的代码序列,为场景的结构描述提供了自然表示。然而,现有的自适应分词器要么需要事后搜索,要么在预训练速率的离散集合中进行选择,而不是学习与模型和场景耦合的连续每图像序列长度,并且它们通常针对像素重建进行训练,强调纹理而非结构。我们提出STROP,一种离散视觉分词器架构,形成结构场景表示并同时学习图像的视觉程序应该有多长。使用由冻结的DINOv3特征的局部率失真探针监督的四阶段课程,STROP优化了一个专门的长度头,在单次前向传递中估计活动前缀长度。通过绕过像素级重建梯度,码本完全由高层潜在表示的质量塑造。程序长度随场景复杂性增长,组合结构的迹象出现在下游密集预测迁移和对学习代码词汇的直接检查中。

英文摘要

Discrete visual tokenizers translate images into ordered sequences of codes, providing a natural representation for structural description of scenes. Yet existing adaptive tokenizers either require post-hoc search or select among a discrete set of pre-trained rates, rather than learning a continuous per-image sequence length coupled to the model and scene, and they typically train against pixel reconstruction, emphasizing texture rather than structure. We propose STROP, a discrete visual tokenizer architecture that forms structural scene representations and simultaneously learns how long an image's visual program should be. Using a four-phase curriculum supervised by local rate--distortion probes against frozen DINOv3 features, STROP optimizes a dedicated length head that estimates the active prefix length in a single forward pass. By bypassing pixel-level reconstruction gradients, the codebook is shaped entirely by the quality of higher-level latent representations. Program length grows with scene complexity, and signs of compositional structure emerge both in downstream dense-prediction transfer and in direct inspection of the learned code vocabulary.

2605.27078 2026-05-29 cs.LG cs.AI 版本更新

Two Speeds of Learning: A Representation-Readout Decomposition of Grokking and Double Descent

两种学习速度:Grokking 和双下降的表征-读出分解

Chi-Ning Chou, Oscar Uzdelewicz, Neng-Chun Chiu, Yao-Yuan Yang, SueYeon Chung

发表机构 * Center for Computational Neuroscience(计算神经科学中心) Flatiron Institute(Flatiron研究所) Department of Physics(物理系) Harvard University(哈佛大学) Kempner Institute(Kempner研究所) New York University(纽约大学)

AI总结 通过将学习动态分解为编码器中的表征学习和最终分类器中的读出校准两个竞争过程,解释了 grokking 和 epoch-wise 双下降现象,并提供了诊断虚假泛化的框架。

详情
AI中文摘要

训练损失和准确率是用于监控深度神经网络训练过程中泛化性能的标准信号。两个有据可查的现象使这一图景复杂化:在 grokking 中,训练损失迅速下降,而测试性能仅在长时间延迟后突然提升;在 epoch-wise 双下降中,训练损失单调下降,而测试损失或误差先升后降。现有解释通常针对特定任务,缺乏一个任务无关的分析框架来诊断和解释这些现象在现实任务和架构中的表现。我们通过分析学习动态背后的两个竞争过程来应对这一挑战:编码器中的表征学习和最终分类器中的读出校准。利用表征几何、神经正切核和线性探测等工具,我们表明这两个过程在整个训练过程中都是活跃的,它们相对速度的波动导致了看似异常的泛化动态。将表征-读出分解应用于各种任务和架构中的 grokking,我们发现读出在 grokking 发生前偏向训练集,而表征学习是渐进但并非缺失的,这与“从懒惰到丰富”的解释相反。该框架进一步提供了区分虚假泛化和真实泛化的诊断特征:在先前报告的 MNIST grokking 示例和 epoch-wise 双下降示例中,看似延迟或非单调的泛化是由非标准训练配方导致的表征退化和读出失调引起的。总之,这些结果确立了表征-读出分解作为一个自上而下的框架,用于理解学习动态并揭示可解释性研究的基础算法。

英文摘要

Training loss and accuracy are the standard signals used to monitor generalization during deep neural network training. Two well-documented phenomena complicate this picture: in grokking, train loss falls rapidly while test performance improves abruptly only after a long delay; in epoch-wise double descent, train loss decreases monotonically while test loss or error rises and falls. Existing accounts are often task-specific, and a task-agnostic analysis framework for diagnosing and explaining these phenomena across realistic tasks and architectures is missing. We address this challenge by analyzing two competing processes that underlie learning dynamics: representation learning in the encoder and readout calibration in the final classifier. Using tools from representational geometry, neural tangent kernels, and linear probing, we show that both processes are active throughout training, with the fluctuations of their relative speed giving rise to seemingly anomalous generalization dynamics. Applying the representation-readout decomposition to grokking across a wide range of tasks and architectures, we find that the readout is train-biased before grokking onset, and representation learning is gradual but not absent, contrary to the lazy-to-rich account. The framework further provides diagnostic signatures distinguishing spurious from genuine generalization: in a previously reported MNIST grokking example and an epoch-wise double descent example, apparent delayed or non-monotone generalization is shown to arise from representation degradation and readout misalignment induced by non-standard training recipes. Together, these results establish the representation-readout decomposition as a top-down framework for understanding learning dynamics and revealing underlying algorithms for interpretability research.

2605.26756 2026-05-29 cs.LG 版本更新

Localizing Memorized Regions in Diffusion Models via Coordinate-Wise Curvature Differences

通过坐标曲率差异定位扩散模型中的记忆区域

Gwangho Kim, Sungyoon Lee

发表机构 * Department of Computer Science, Hanyang University, Seoul, South Korea(首尔国立大学计算机科学系)

AI总结 本文提出基于坐标曲率差异的方法,通过减去欠拟合基线的曲率来隔离过拟合驱动的记忆,从而在扩散模型中定位记忆区域,并在Stable Diffusion上优于先前方法。

Comments ICML 2026

详情
AI中文摘要

扩散模型可能无意中记忆训练样本,引发隐私和版权问题。虽然近期方法可以检测记忆,但它们通常依赖全局或模型特定信号,并且对记忆出现在生成图像中的位置提供的洞察有限。我们提供了局部记忆的几何表征,即坐标方差坍缩。然而,这种坍缩也可能源于内在数据约束而非过拟合。为了隔离过拟合驱动的记忆,我们提出了曲率差异方法,减去欠拟合基线(无条件模型或其训练较少的版本)的曲率。我们进一步推导了一个分数差异代理,为广泛使用的基于分数差异的检测指标提供了几何解释。在Stable Diffusion上的实验,针对真实记忆掩码进行评估,表明我们的方法优于先前的基于注意力的定位方法。代码可在 https://github.com/Gwangho99/mem-curv-diff 获取。

英文摘要

Diffusion models can unintentionally memorize training samples, raising concerns about privacy and copyright. While recent methods can detect memorization, they often rely on global or model-specific signals and provide limited insight into where memorization appears within a generated image. We provide a geometric characterization of local memorization as a coordinate-wise variance collapse. However, such collapse can also arise from intrinsic data constraints rather than overfitting. To isolate overfitting-driven memorization, we propose curvature-difference methods that subtract the curvature of an underfitted baseline, either the unconditional model or a less-trained version of itself. We further derive a score-difference proxy that provides a geometric explanation for the widely used score-difference-based detection metric. Experiments on Stable Diffusion, evaluated against ground-truth memorization masks, show that our method outperforms the prior attention-based localization method. Code is available at https://github.com/Gwangho99/mem-curv-diff.

2605.26156 2026-05-29 cs.CR cs.AI cs.LG 版本更新

Turning Bias into Bugs: Bandit-Guided Style Manipulation Attacks on LLM Judges

将偏见转化为漏洞:基于Bandit引导的LLM裁判风格操纵攻击

Xianglin Yang, Bryan Hooi, Gelei Deng, Tianwei Zhang, Jin Song Dong

发表机构 * School of Computing, National University of Singapore, Singapore(新加坡国立大学计算机学院) Nanyang Technological University, Singapore(南洋理工大学)

AI总结 提出BITE黑盒对抗框架,将风格编辑选择建模为上下文Bandit问题,通过LinUCB策略自适应选择编辑以误导LLM裁判并人为提高评分,攻击成功率超65%。

Comments Accepted to the Forty-Third International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

已知LLM裁判中的风格偏见,例如对冗长或特定句子结构的偏好,构成了一个未被充分探索的安全漏洞。在这项工作中,我们引入了BITE(偏见探索与利用),一个黑盒对抗框架,学习保持语义的编辑以误导LLM裁判并人为提高其分配的分数。我们将风格编辑的选择建模为上下文Bandit问题,并使用LinUCB策略自适应地选择编辑,以最大化裁判的分数,而无需访问模型参数或梯度。实验上,我们在多种LLM裁判和任务上测试了BITE,包括聊天机器人排行榜和AI审稿人基准上的逐点和成对比较。BITE实现了超过65%的攻击成功率,并在9分制上将分数提高了1-2分,同时保持了语义等价性。我们进一步评估了攻击的隐蔽性,表明BITE规避了标准的风格控制方法和几种检测基线。我们的发现暴露了LLM作为裁判范式的一个根本弱点,并激励了鲁棒的、对抗感知的评估。我们的代码可在https://github.com/xianglinyang/llm-as-a-judge-attack获取。

英文摘要

The known stylistic biases in LLM judges, such as a preference for verbosity or specific sentence structures, present an underexplored security vulnerability. In this work, we introduce BITE (BIas exploraTion and Exploitation), a black-box adversarial framework that learns semantics-preserving edits to mislead an LLM judge and artificially inflate the scores it assigns. We cast the selection of stylistic edits as a contextual bandit problem and use a LinUCB policy to adaptively choose edits that maximize the judge's score without access to model parameters or gradients. Empirically, we test BITE across a diverse range of LLM judges and tasks, including both pointwise and pairwise comparisons on chatbot leaderboards and AI-reviewer benchmarks. BITE achieves an attack success rate exceeding 65% and raises scores by 1-2 points on a 9-point scale, all while preserving semantic equivalence. We further assess the attack's stealthiness, showing that BITE evades standard style-control methods and several detection baselines. Our findings expose a fundamental weakness in the LLM-as-a-judge paradigm and motivate robust, attack-aware evaluation. Our code is available at https://github.com/xianglinyang/llm-as-a-judge-attack.

2605.26064 2026-05-29 cs.CV cs.LG 版本更新

Paris 2.0: A Decentralized Diffusion Model for Video Generation

Paris 2.0: 一种去中心化的视频生成扩散模型

Ali Rouzbayani, Bidhan Roy, Marcos Villagra, Zhiying Jiang

AI总结 本文提出Paris 2.0,首个通过去中心化计算预训练的视频生成模型,基于Paris 1.0的扩散模型框架,在低分辨率文本到视频任务中相比集中式模型将FVD从561.04降至279.01,提升约2倍,并提高了CLIP文本-视频相似度和美学评分。

Comments 6 pages, 5 figures

详情
AI中文摘要

我们提出了Paris 2.0,这是首个通过去中心化计算预训练的视频生成模型。其训练方案建立在Paris 1.0(arXiv:2510.03434)的基础上,后者是首个开源权重的去中心化扩散模型(DDM),证明了图像生成可以在没有单一GPU集群的情况下进行训练。然而,时间上连贯的视频生成在去中心化训练下一直是一个未解决的问题,而Paris 2.0解决了这个问题。在低分辨率文本到视频训练中,与在相同数据上以匹配的总计算预算训练的集中式模型相比,Paris 2.0将Frechet视频距离(FVD)从561.04降至279.01,提升了约2.0倍,并提高了CLIP文本-视频相似度和美学评分。

英文摘要

We present Paris 2.0, the first video generation model pre-trained through decentralized computation. Its training recipe builds upon Paris 1.0 (arXiv:2510.03434), the first ever open-weight Decentralized Diffusion Model (DDM), which showed that image generation can be trained without a monolithic GPU cluster. However, temporally coherent video generation had remained an open problem under decentralized training, and Paris 2.0 closes it. In low-resolution text-to-video training, against a monolithic model trained on the same data under a matched total compute budget, Paris 2.0 cuts Frechet Video Distance (FVD) from 561.04 to 279.01, a ~2.0x improvement, and lifts CLIP text-video similarity and aesthetic score.

2605.25303 2026-05-29 cs.DS cs.LG math.ST stat.ML stat.TH 版本更新

Algorithms with Polynomially-Improved Approximation Factors for the $2 \rightarrow q$ Norm, and Applications

具有多项式改进近似因子的 $2 \rightarrow q$ 范数算法及其应用

Samuel B. Hopkins, Stefan Tiegel

发表机构 * MIT(麻省理工学院)

AI总结 本文针对 $q>2$ 时的 $2 \rightarrow q$ 范数,提出了首个多项式时间近似算法,其近似因子在多项式级别上优于基线 $d^{1/4}$,例如 $q=4$ 时达到 $d^{1/8}$,并构造了平方和证书,从而改进了鲁棒均值估计、协方差估计、回归和聚类等问题的算法。

Comments v2 corrected minor typos

详情
AI中文摘要

矩阵 $X \in \mathbb{R}^{n \times d}$ 的 $2 \rightarrow q$ 范数定义为 $\lVert X \rVert_{2 \rightarrow q} = \sup_{\lVert v \rVert_2 = 1} \lVert Xv \rVert_q$。我们针对 $q > 2$(即超收缩设置)给出了该范数的多项式时间乘法近似算法。该问题要么直接对应,要么与组合优化和近似难度(例如小集扩张)、量子信息(例如最佳可分态)以及算法统计学中长期存在的开放问题密切相关。 关于在多项式时间内能为此问题达到何种近似因子,我们所知甚少,尽管此类近似具有重要的下游影响。Barak、Brandão、Harrow、Kelner、Steurer 和 Zhou 表明,假设指数时间假设(FOCS'12),没有多项式时间算法能实现优于 $2^{\sqrt{\log n}}$ 的近似因子。另一方面,一个简单的谱算法给出了 $d^{1/4}$ 的基线近似。据我们所知,我们给出了首个在多项式因子内超越该基线的多项式时间近似算法。对于重要的特例 $q = 4$,它实现了 $d^{1/8}$ 的近似。所有先前的算法要么需要对 $X$ 附加假设,要么仅在 $n$ 较小时才能超越基线。 此外,我们为 $2 \rightarrow q$ 范数构造了平方和证书。这直接改进了当数据仅满足 $q$ 阶矩有界时的鲁棒均值和协方差估计、鲁棒回归以及聚类算法。

英文摘要

The $2 \rightarrow q$ norm of a matrix $X \in \mathbb{R}^{n \times d}$ is defined as $\lVert X \rVert_{2 \rightarrow q} = \sup_{\lVert v \rVert_2 = 1} \lVert Xv \rVert_q$. We give polynomial-time multiplicative approximation algorithms for this norm when $q > 2$ (i.e. in the hypercontractive setting). This problem either directly captures or is closely related to long-standing open problems in combinatorial optimization and hardness of approximation (e.g. Small Set Expansion), quantum information (e.g. Best Separable State), and algorithmic statistics. Very little is known about what approximation factors we can achieve for this problem in polynomial time, even though such approximations have significant downstream consequences. Barak, Brandão, Harrow, Kelner, Steurer, and Zhou showed that no polynomial-time algorithm can achieve an approximation factor better than $2^{\sqrt{\log n}}$, assuming the Exponential Time Hypothesis (FOCS'12). On the other hand, a simple spectral algorithm gives a $d^{1/4}$-approximation as a baseline. We give, to the best of our knowledge, the first polynomial-time approximation algorithm beating this baseline by polynomial factors. For the important special case of $q = 4$ it achieves a $d^{1/8}$-approximation. All previous algorithms required additional assumptions on $X$, or only surpassed the baseline for small values of $n$. Moreover, we construct sum-of-squares certificates for the $2 \rightarrow q$ norm. This directly implies improved algorithms for robust mean and covariance estimation, robust regression, and clustering, when the data only satisfies a bound on its $q$-th moment.

2605.24934 2026-05-29 cs.RO cs.AI cs.CV cs.LG 版本更新

HumanEgo: Zero-Shot Robot Learning from Minutes of Human Egocentric Videos

HumanEgo:从几分钟的人类自我中心视频中零样本学习机器人

Zhi Wang, Botao He, Kelin Yu, Seungjae Lee, Ruohan Gao, Furong Huang, Yiannis Aloimonos

发表机构 * University of Maryland(马里兰大学)

AI总结 提出HumanEgo框架,通过将人类演示提升为手-物体交互的实体级表示,并训练具有密集辅助目标的流匹配策略,实现从人类自我中心视频到机器人的零样本、无机器人数据、硬件无关的技能迁移。

Comments Project page: https://humanego-ai.github.io

详情
AI中文摘要

人类自我中心视频捕捉了丰富的操作演示,无需任何机器人硬件,但由于人类和机器人在视觉外观和运动学上的具身差距,将这些技能迁移到机器人仍然具有挑战性。我们提出了HumanEgo,一个通过将每个人类演示提升为手-物体交互的实体级表示,并训练具有密集辅助目标的流匹配策略来弥合具身差距的框架,该策略放大了每个轨迹的监督信号。HumanEgo无需机器人数据、硬件无关、数据高效且可零样本地从人类迁移到机器人。每个任务仅需30分钟的人类视频,HumanEgo在四个真实世界任务中实现了92.5%的平均成功率(仅15分钟即可达到75%),比匹配时间的机器人遥操作高出41%,并且能够稳健地零样本迁移到新的机器人、相机和环境。我们发布了HumanEgo作为一个易于使用的开源框架,用于直接从人类数据学习机器人策略:https://github.com/TX-Leo/HumanEgo

英文摘要

Human egocentric video captures rich manipulation demonstrations without any robot hardware, yet transferring these skills to robots remains challenging due to the embodiment gap between human and robot in both visual appearance and kinematics. We present HumanEgo, a framework that bridges the embodiment gap by lifting each human demonstration to an entity-level representation of hand-object interaction, and training a flow matching policy with dense auxiliary objectives that amplify supervision from every trajectory. HumanEgo is robot-data-free, hardware-agnostic, data-efficient, and zero-shot human-to-robot transferable. With only 30 minutes of human videos per task, HumanEgo achieves 92.5% average success across four real-world tasks (75% with just 15 minutes), outperforms matched-time robot teleoperation by 41%, and robustly transfers zero-shot across novel robots, cameras, and environments. We release HumanEgo as an easy-to-use, open-source framework for learning robot policies directly from human data: https://github.com/TX-Leo/HumanEgo

2605.23239 2026-05-29 cs.LG 版本更新

Self-supervised Adversarial Purification for Graph Neural Networks

自监督对抗净化用于图神经网络

Woohyun Lee, Hogun Park

发表机构 * Department of Computer Science and Engineering(计算机科学与工程系) Sungkyunkwan University(全州大学) Suwon, South Korea(韩国水原)

AI总结 提出自监督对抗净化框架,通过专用净化器GPR-GAE(基于广义PageRank滤波器的图自编码器)在分类前净化输入数据,实现鲁棒性与分类器分离,达到最先进的防御性能。

Comments Accepted at ICML 2025. 21 pages. Code is available at: https://github.com/woodavid31/GPR-GAE

详情
Journal ref
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:33715-33735, 2025
AI中文摘要

防御图神经网络(GNN)免受对抗攻击需要在准确性和鲁棒性之间取得平衡,而传统方法(如对抗训练)将这两个冲突目标交织在单个分类器中,往往处理不当。为克服这一局限,我们提出一种自监督对抗净化框架。通过引入专用净化器,在分类前净化输入数据,将鲁棒性与分类器分离。与先前的对抗净化方法不同,我们提出GPR-GAE,一种新颖的图自编码器(GAE),作为专用净化器,采用自监督策略训练,以数据驱动方式适应多样化的图结构。利用多个广义PageRank(GPR)滤波器,GPR-GAE捕获多样化的结构表示,实现鲁棒且有效的净化。我们的多步净化过程进一步促进GPR-GAE实现精确的图恢复和对结构扰动的鲁棒防御。跨不同数据集和攻击场景的实验表明,GPR-GAE具有最先进的鲁棒性,可作为GNN分类器的独立即插即用净化器。

英文摘要

Defending Graph Neural Networks (GNNs) against adversarial attacks requires balancing accuracy and robustness, a trade-off often mishandled by traditional methods like adversarial training that intertwine these conflicting objectives within a single classifier. To overcome this limitation, we propose a self-supervised adversarial purification framework. We separate robustness from the classifier by introducing a dedicated purifier, which cleanses the input data before classification. In contrast to prior adversarial purification methods, we propose GPR-GAE, a novel graph auto-encoder (GAE), as a specialized purifier trained with a self-supervised strategy, adapting to diverse graph structures in a data-driven manner. Utilizing multiple Generalized PageRank (GPR) filters, GPR-GAE captures diverse structural representations for robust and effective purification. Our multi-step purification process further facilitates GPR-GAE to achieve precise graph recovery and robust defense against structural perturbations. Experiments across diverse datasets and attack scenarios demonstrate the state-of-the-art robustness of GPR-GAE, showcasing it as an independent plug-and-play purifier for GNN classifiers.

2605.20612 2026-05-29 cs.LG 版本更新

Matryoshka Concept Bottleneck Models

Matryoshka 概念瓶颈模型

Ziye Chen, Hongbin Lin, Jie Li, Lijie Hu

发表机构 * Mohamed bin Zayed University of Artificial Intelligence(莫扎德·本·扎耶德人工智能大学) The Hong Kong University of Science and Technology (Guangzhou)(香港科学与技术大学(广州))

AI总结 提出 Matryoshka 概念瓶颈模型 (MCBM),通过嵌套层次结构实现自适应概念利用,将预期干预成本从线性降低到对数阶 O(log K),同时保证单调性能提升。

详情
AI中文摘要

概念瓶颈模型 (CBMs) 已成为可解释深度学习的一种重要范式,通过将预测基于人类可理解的概念来学习。然而,它们的实际部署受到测试时干预成本高昂的阻碍,因为纠正模型错误通常需要人类专家手动检查和验证大量预测概念。现有方法存在根本性的结构限制:它们要么采用单一静态概念集,迫使专家详尽地标注概念,导致高昂的干预成本;要么训练多个针对不同概念预算的模型,导致大量的计算和维护开销。为了解决这一挑战,我们提出了 Matryoshka 概念瓶颈模型 (MCBM),这是一种统一的架构,能够在单个模型中实现自适应概念利用。受 Matryoshka 表示学习的启发,MCBM 基于最大相关性和最小冗余性将概念组织成嵌套层次结构,允许在不重新训练的情况下在多个概念粒度级别进行推理。理论上,我们证明 MCBM 将预期干预成本从线性降低到对数阶 $O(\log K)$,同时保证单调性能提升。实验上,大量实验表明,MCBM 在实现动态且高效的专家交互的同时,与独立训练的模型性能相当。

英文摘要

Concept Bottleneck Models (CBMs) have emerged as a prominent paradigm for interpretable deep learning, learning by grounding predictions in human-understandable concepts. However, their practical deployment is hindered by the high cost of test-time intervention, as correcting model errors typically requires human experts to manually inspect and verify a large set of predicted concepts. Existing approaches suffer from a fundamental structural limitation: they either adopt a single static concept set, forcing experts to exhaustively annotate concepts and incurring prohibitive intervention costs, or train multiple models tailored to different concept budgets, resulting in substantial computational and maintenance overhead. To address this challenge, we propose the Matryoshka Concept Bottleneck Model (MCBM), a unified architecture that enables adaptive concept utilization within a single model. Inspired by Matryoshka Representation Learning, MCBM organizes concepts into a nested hierarchy based on maximum relevance and minimum redundancy, allowing inference at multiple levels of conceptual granularity without retraining. Theoretically, we show that MCBM reduces the expected intervention costs from linear to logarithmic order, $O(\log K)$, while guaranteeing monotonic performance improvement. Empirically, extensive experiments demonstrate that MCBM matches the performance of independently trained models while enabling dynamic and efficient expert interaction.

2605.16608 2026-05-29 cs.LG cs.CL 版本更新

To MRL or not to MRL: Text Embeddings are Robust to Truncation Without Matryoshka Learning, Except In Heavy Truncation Scenarios

使用还是不使用MRL:文本嵌入在没有Matryoshka学习的情况下对截断具有鲁棒性,除非在重度截断场景下

Sotaro Takeshita, Yurina Takeshita, Simone Paolo Ponzetto, Daniel Ruffinelli

发表机构 * Data and Web Science Group, University of Mannheim(曼海姆大学数据与网络科学小组) NEC Laboratories Europe(NEC欧洲实验室) Independent Researcher(独立研究者)

AI总结 本文通过实验比较了使用Matryoshka表示学习(MRL)与随机截断对文本嵌入的影响,发现除非嵌入被重度截断(减少至少80%),否则非MRL模型的截断嵌入性能与MRL模型相当甚至更优。

详情
AI中文摘要

Matryoshka表示学习(MRL)是一种广泛采用的方法,用于训练文本编码器,使其提供各种大小的有用文本表示,只需在训练时预先确定的大小处截断结果向量即可。最近的研究表明,除非向量大小减少至少70%,否则随机截断文本嵌入对下游性能的影响很小,这表明嵌入在没有MRL的情况下已经对截断具有鲁棒性。然而,之前没有工作将随机截断与MRL进行比较,因此不清楚这两种方法作为有效的嵌入缩减方法如何比较。在本文中,我们通过将MRL使用的相同截断应用于使用和不使用MRL训练的模型来研究这一点。我们在多个模型和下游任务上的结果表明,除非重度截断嵌入(即将其大小减少至少80%),否则非MRL模型的截断嵌入与使用MRL训练的模型具有竞争力,并且通常表现更好。这表明截断鲁棒性可能不一定来自MRL,而选择花费MRL的额外训练成本取决于是否需要重度截断。我们提供代码以供复现。

英文摘要

Matryoshka Representation Learning (MRL) is a widely adopted approach for training text encoders so they provide useful text representations at various sizes, available by simply truncating the resulting vectors at sizes pre-determined at training time. Recent works have shown that randomly truncating text embeddings has minimal impact in downstream performance unless vectors are reduced in size by at least 70%, suggesting that embeddings are already robust to truncation without the use of MRL. However, no prior work has compared random truncation to MRL, so it is unclear how the two methods compare as effective embedding reduction methods. In this paper, we study this by applying the same truncation used by MRL to models trained with and without MRL. Our results across several models and downstream tasks show that, unless heavily truncating embeddings (i.e. reducing their size by at least 80%), truncated embeddings of non-MRL models are competitive with, and often outperform models trained with MRL. This suggests that truncation robustness may not necessarily come from MRL, and that the choice of spending the additional training cost of MRL depends on whether heavy truncation is desired. We make our code available for reproduction.

2605.14373 2026-05-29 cs.LG cs.AI 版本更新

Turning Stale Gradients into Stable Gradients: Coherent Coordinate Descent with Implicit Landscape Smoothing for Lightweight Zeroth-Order Optimization

将陈旧梯度转化为稳定梯度:具有隐式景观平滑的相干坐标下降用于轻量级零阶优化

Chen Liang, Xiatao Sun, Qian Wang, Daniel Rakita

发表机构 * Department of Computer Science, Yale University, New Haven, USA(耶鲁大学计算机科学系)

AI总结 提出一种确定性的、样本高效的零阶优化方法Coherent Coordinate Descent (CoCD),通过利用历史梯度的相干性实现每步O(1)查询复杂度,并发现大步长有限差分可隐式平滑优化景观,从而在轻量级场景下优于现有方法。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026); Project page: https://chen-dylan-liang.github.io/CoCD/

详情
AI中文摘要

零阶优化对于反向传播不可用的场景至关重要,例如内存受限的在线学习和黑盒优化。然而,现有方法面临严峻的权衡:它们要么样本效率低(例如标准有限差分),要么由于随机估计(例如随机子空间方法)而遭受高方差。在这项工作中,我们提出了相干坐标下降(CoCD),一种确定性的、样本高效的、预算感知的零阶优化器。理论上,我们形式化了梯度相干性的概念,并证明CoCD等价于具有“热启动”的块循环坐标下降(BCCD),有效地将历史(陈旧)梯度从负担转化为计算资产。该机制在保持全局下降方向的同时,实现了每步O(1)查询复杂度。此外,我们推导出误差界,揭示了一个反直觉的见解:更大的有限差分步长可以通过降低有效平滑常数来隐式地平滑优化景观,从而提高收敛稳定性。在MLP、CNN和ResNet架构(最多27万个参数)上的实验表明,CoCD在样本效率和收敛损失/准确性方面显著优于BCCD,并且比随机化零阶方法表现出更好的稳定性。我们的结果表明,对于轻量级零阶优化,确定性的、结构感知的更新是随机化的优越替代方案。

英文摘要

Zeroth-Order (ZO) optimization is pivotal for scenarios where backpropagation is unavailable, such as memory-constrained on-device learning and black-box optimization. However, existing methods face a stark trade-off: they are either sample-inefficient (e.g., standard finite differences) or suffer from high variance due to randomized estimation (e.g., random subspace methods). In this work, we propose Coherent Coordinate Descent (CoCD), a deterministic, sample-efficient, and budget-aware ZO optimizer. Theoretically, we formalize the notion of gradient coherence and demonstrate that CoCD is equivalent to Block Cyclic Coordinate Descent (BCCD) with ``warm starts,'' effectively converting historical (stale) gradients from a liability into a computational asset. This mechanism enables $O(1)$ query complexity per step while maintaining global descent directions. Furthermore, we derive error bounds revealing a counter-intuitive insight: larger finite-difference step sizes can induce an implicit smoothing effect on the optimization landscape by reducing the effective smoothness constant, thereby improving convergence stability. Experiments on MLP, CNN, and ResNet architectures (up to 270k parameters) demonstrate that CoCD significantly outperforms BCCD in terms of sample efficiency and convergence loss/accuracy, and exhibits superior stability over randomized ZO methods. Our results suggest that deterministic, structure-aware updates offer a superior alternative to randomization for lightweight ZO optimization.

2605.14241 2026-05-29 cs.LG 版本更新

Latency-Quality Routing for Functionally Equivalent Tools in LLM Agents

LLM 代理中功能等价工具的延迟-质量路由

Kexin Chu, Dawei Xiang, Wei Zhang

发表机构 * University of Connecticut(康涅狄格大学)

AI总结 针对LLM代理中多个功能等价工具提供者的路由问题,提出LQM-ContextRoute上下文强盗路由器,通过延迟-质量匹配和查询特定质量估计,在运行时负载下实现延迟与质量的权衡,在多个基准上优于SW-UCB。

Comments 14 pages, 6 figure, 13 tables

详情
AI中文摘要

工具增强的LLM代理越来越多地通过多个功能等价的提供者访问同一工具类型,例如共享接口背后的网络搜索API、检索器或LLM后端。这在运行时负载下产生了提供者路由问题:路由器必须在延迟、可靠性和答案质量上存在差异的提供者之间进行选择,通常在部署时没有黄金标签。我们引入了LQM-ContextRoute,一种用于同功能工具提供者的上下文强盗路由器。其关键设计是延迟-质量匹配:不是让低延迟在加性奖励中抵消差答案,而是路由器根据每个服务周期的预期答案质量对提供者进行排序。它将这种容量感知得分与查询特定质量估计和LLM作为评判的反馈相结合,使其能够在线适应负载变化和提供者质量差异。在主要的网络搜索负载基准上,LQM-ContextRoute在保持延迟-质量前沿的同时,F1比SW-UCB提高了2.18个百分点。在高异质性的StrategyQA设置中,LQM-ContextRoute避免了加性奖励崩溃,准确率比SW-UCB提高了18个百分点;在异质性检索器池上,NDCG比SW-UCB提高了2.91--3.22个百分点。这些结果表明,同功能工具路由受益于将延迟视为服务容量,特别是在运行时压力与提供者质量异质性共存时。

英文摘要

Tool-augmented LLM agents increasingly access the same tool type through multiple functionally equivalent providers, such as web-search APIs, retrievers, or LLM backends exposed behind a shared interface. This creates a provider-routing problem under runtime load: the router must choose among providers that differ in latency, reliability, and answer quality, often without gold labels at deployment time. We introduce LQM-ContextRoute, a contextual bandit router for same-function tool providers. Its key design is latency-quality matching: instead of letting low latency offset poor answers in an additive reward, the router ranks providers by expected answer quality per service cycle. It combines this capacity-aware score with query-specific quality estimation and LLM-as-judge feedback, allowing it to adapt online to both load changes and provider-quality differences. On the main web-search load benchmark, LQM-ContextRoute improves F1 by +2.18 pp over SW-UCB while staying on the latency-quality frontier. In a high-heterogeneity StrategyQA setting, LQM-ContextRoute avoids additive-reward collapse and improves accuracy by up to +18 pp over SW-UCB; on heterogeneous retriever pools, it improves NDCG by +2.91--+3.22 pp over SW-UCB. These results show that same-function tool routing benefits from treating latency as service capacity, especially when runtime pressure and provider-quality heterogeneity coexist.

2605.14113 2026-05-29 cs.CV cs.AI cs.LG cs.MA 版本更新

ProtoMedAgent: Multimodal Clinical Interpretability via Privacy-Aware Agentic Workflows

ProtoMedAgent: 通过隐私感知的智能体工作流实现多模态临床可解释性

Alvaro Lopez Pellicer, Plamen Angelov, Marwan Bukhari, Yi Li, Eduardo Soares, Jemma Kerns

发表机构 * School of Computing and Communications(计算与通信学校) Lancaster University(兰卡斯特大学) Lancaster Medical School(兰卡斯特医学院) PUC-Rio(里约热内卢联邦大学) Puc-Behring Institute for AI(人工智能皮克林研究所)

AI总结 提出ProtoMedAgent框架,通过神经符号瓶颈和反射性Scribe-Critic循环约束生成过程,解决原型网络在临床报告中的语义结构缺失和检索谄媚问题,并引入k-匿名和ℓ-多样性隐私门控。

Comments CVR 2026

详情
AI中文摘要

尽管可解释的原型网络为临床诊断提供了引人注目的基于案例的推理,但其原始连续输出缺乏医学文档所需的语义结构。通过标准检索增强生成(RAG)弥合这一差距通常会触发“检索谄媚”,即大语言模型(LLM)产生事后合理化幻觉以与视觉预测对齐。我们引入了ProtoMedAgent,一个将多模态临床报告形式化为在严格神经符号瓶颈上的迭代、零梯度测试时优化问题的框架。在冻结的原型骨干上运行,我们将潜在视觉和表格特征蒸馏为离散语义记忆。在线生成严格受限于精确的集合论差分和反射性Scribe-Critic循环,从数学上排除了无根据的叙述性声明。为了安全地限制数据泄露,我们引入了一个由k-匿名和ℓ-多样性控制的语义隐私门控。在4,160名患者临床队列上的评估显示,ProtoMedAgent达到了91.2%的比较集忠实度,从根本上优于标准RAG(46.2%)。ProtoMedAgent还利用一个绑定ℓ-多样性的相变,系统性地将工件级成员推理风险降低了绝对9.8%。

英文摘要

While interpretable prototype networks offer compelling case-based reasoning for clinical diagnostics, their raw continuous outputs lack the semantic structure required for medical documentation. Bridging this gap via standard Retrieval-Augmented Generation (RAG) routinely triggers ``retrieval sycophancy,'' where Large Language Models (LLMs) hallucinate post-hoc rationalizations to align with visual predictions. We introduce ProtoMedAgent, a framework that formalizes multimodal clinical reporting as an iterative, zero-gradient test-time optimization problem over a strict neuro-symbolic bottleneck. Operating on a frozen prototype backbone, we distill latent visual and tabular features into a discrete semantic memory. Online generation is strictly constrained by exact set-theoretic differentials and a reflective Scribe-Critic loop, mathematically precluding unsupported narrative claims. To safely bound data disclosure, we introduce a semantic privacy gate governed by $k$-anonymity and $\ell$-diversity. Evaluated on a 4,160-patient clinical cohort, ProtoMedAgent achieves 91.2% Comparison Set Faithfulness where it fundamentally outperforms standard RAG (46.2%). ProtoMedAgent additionally leverages a binding $\ell$-diversity phase transition to systematically reduce artifact-level membership inference risks by an absolute 9.8%.

2605.13986 2026-05-29 cs.LG stat.ML 版本更新

TabPFN-3: Technical Report

TabPFN-3: 技术报告

Léo Grinsztajn, Klemens Flöge, Oscar Key, Felix Birkel, Philipp Jund, Brendan Roof, Mihir Manium, Shi Bin Hoo, Magnus Bühler, Anurag Garg, Dominik Safaric, Jake Robertson, Benjamin Jäger, Simone Alessi, Adrian Hayler, Vladyslav Moroshan, Lennart Purucker, Philipp Singer, Alan Arazi, Julien Siems, Jan Hendrik Metzen, Georg Grab, Nick Erickson, Siyuan Guo, Eliott Kalfon, Simon Bing, David Salinas, Clara Cornu, Lilly Charlotte Wehrhahn, Diana Kriuchkova, Kursat Kaya, Lydia Sidhoum, Marie Salmon, Jerry Chen, Madelon Hulsebos, Yann LeCun, Samuel Müller, Bernhard Schölkopf, Sauraj Gambhir, Noah Hollmann, Frank Hutter

发表机构 * Prior Labs

AI总结 本文提出TabPFN-3,通过扩展训练数据和优化推理,在表格数据上实现最先进性能,并支持时间序列、关系数据和表格文本数据。

详情
AI中文摘要

表格数据支撑着科学和工业中大多数高价值预测问题,而TabPFN推动了该模态的基础模型革命。根据用户反馈设计,TabPFN-3在此基础上将最先进性能扩展到具有100万训练行的数据集,并大幅减少训练和推理时间。TabPFN-3完全基于我们先验的合成数据进行预训练,极大地推动了表格预测的前沿,并在时间序列、关系数据和表格文本数据上带来了实质性收益。在标准表格基准TabArena上,TabPFN-3的前向传播以显著优势优于所有其他模型(包括调优和集成基线),并在速度/性能前沿上占据帕累托优势。在更多样化的数据集上,TabPFN-3在多类数据集上排名第一,并在多达100万训练行和200个特征的数据集上击败了经过8小时调优的梯度提升树基线。TabPFN-3将测试时计算缩放引入表格基础模型。我们的API产品TabPFN-3-Plus(思考版)利用这一点,在TabArena上以超过200 Elo的优势击败所有非TabPFN模型,在最大数据子集上达到420 Elo,并且比AutoGluon 1.5 extreme快10倍,同时不使用LLM、真实数据、互联网搜索或除TabPFN之外的任何其他模型。TabPFN-3扩展了我们模型的能力,实现了对关系数据(在RelBenchV1上新的最先进基础模型)和表格文本数据(通过TabPFN-3-Plus在TabSTAR上达到最先进)的最先进预测;并改进了现有集成:专用检查点TabPFN-TS-3在时间序列基准fev-bench上排名第二,SHAP值计算速度提升高达120倍。TabPFN-3在实现这一性能的同时,比TabPFN-2.5快20倍。此外,减少的KV缓存和行分块技术使得在单个H100上以快速推理速度扩展到100万行。

英文摘要

Tabular data underpins most high-value prediction problems in science and industry, and TabPFN has driven the foundation model revolution for this modality. Designed with feedback from our users, TabPFN-3 builds on this foundation to scale state-of-the-art performance to datasets with 1M training rows and substantially reduce training and inference time. Pretrained exclusively on synthetic data from our prior, TabPFN-3 dramatically pushes the frontier of tabular prediction and brings substantial gains on time series, relational, and tabular-text data. On the standard tabular benchmark TabArena, a forward pass of TabPFN-3 outperforms all other models, including tuned and ensembled baselines, by a significant margin, and pareto-dominates the speed/performance frontier. On more diverse datasets, TabPFN-3 ranks first on datasets with many classes, and beats 8-hour-tuned gradient-boosted-tree baselines on datasets up to 1M training rows and 200 features. TabPFN-3 introduces test-time compute scaling to tabular foundation models. Our API offering TabPFN-3-Plus (Thinking) exploits this to beat all non-TabPFN models by over 200 Elo on TabArena, rising to 420 Elo on the largest data subset, and outperforms AutoGluon 1.5 extreme while being 10x faster, without using LLMs, real data, internet search or any other model besides TabPFN. TabPFN-3 extends the capabilities of our models, enabling SOTA prediction on relational data (new SOTA foundation model on RelBenchV1) and tabular-text data (SOTA on TabSTAR via TabPFN-3-Plus); and improves existing integrations: a specialized checkpoint, TabPFN-TS-3, ranks 2nd on the time-series benchmark fev-bench, and SHAP-value computation is up to 120x faster. TabPFN-3 achieves this performance while being up to 20x faster than TabPFN-2.5. In addition, a reduced KV cache and row-chunking scale to 1M rows on one H100 with fast inference speed.

2605.13230 2026-05-29 cs.LG cs.AI 版本更新

Teacher-Guided Policy Optimization for On-Policy Reasoning Distillation under Large Policy Divergence

教师引导的策略优化:大策略差异下的在线推理蒸馏

Xinyu Liu, Kechen Jiao, Chunyang Xiao, Runsong Zhao, Junhao Ruan, Bei Li, Jiahao Liu, Qifan Wang, Xin Chen, Jingang Wang, Chenglong Wang, Tong Xiao, JingBo Zhu

发表机构 * School of Computer Science and Engineering, Northeastern University, China(东北大学计算机科学与工程学院) Tsinghua University(清华大学) Meituan(美团) Meta AI NiuTrans Research, Shenyang, China(新译研究院,沈阳,中国)

AI总结 针对在线蒸馏中教师与学生策略差异大时反向KL监督失效的问题,提出教师引导策略优化(TGPO),通过教师直接指导学生上下文的token级生成并结合RLVR奖励,在推理基准上优于现有方法。

详情
AI中文摘要

在线蒸馏(OPD)已成为面向推理的大型语言模型(LLM)后训练的一种有前景的范式,特别是与可验证奖励的强化学习(RLVR)结合时。现有的OPD方法依赖于基于反向KL(RKL)的教师监督,对学生策略采样的轨迹进行监督。然而,我们识别出一个关键限制:在教师-学生策略差异大的情况下,RL驱动的探索常常产生教师分布之外的轨迹,导致无信息的负面反馈。为了解决这个问题,我们提出教师引导策略优化(TGPO),一种在策略差异大设置下仍然有效的在线推理蒸馏方法。TGPO不依赖于单纯的评估监督,而是利用教师直接指导基于学生生成上下文的token级生成;结合RLVR风格的轨迹级奖励,TGPO引导探索朝向改进的延续。在推理基准上的实验表明,TGPO始终优于现有的基于RKL的OPD方法,并且在不同教师模型下保持鲁棒性。

英文摘要

On-policy distillation (OPD) has become a promising paradigm for reasoning-oriented post-training of large language models (LLMs), especially when combined with reinforcement learning from verifiable rewards (RLVR). Existing OPD methods rely on reverse KL (RKL)-based teacher supervision over trajectories sampled from the student policy. However, we identify a critical limitation: under large teacher--student policy divergence, RL-driven exploration often produces trajectories outside the teacher distribution, resulting in uninformative negative feedback. To address this, we propose Teacher-Guided Policy Optimization (TGPO), an on-policy reasoning distillation method that remains effective under large policy divergence settings. Rather than relying solely on evaluative supervision, TGPO uses teacher to directly guide token level generation conditioning on student-generated contexts; together with RLVR-style trajectory level rewards, TGPO steers exploration toward improved continuations. Experiments on reasoning benchmarks show that TGPO consistently outperforms existing RKL-based OPD methods and remains robust across different teacher models.

2605.10299 2026-05-29 cs.LG 版本更新

Nearly-Optimal Algorithm for Adversarial Kernelized Bandits

对抗性核化赌博机的近最优算法

Shogo Iwazaki

发表机构 * LY Corporation(LY公司)

AI总结 针对对抗性环境下的核化赌博机问题,提出指数权重算法并证明其达到近最优遗憾界,同时给出下界并利用Nyström近似实现高效计算。

Comments 47 pages

详情
AI中文摘要

本文研究对抗性环境下的核化赌博机(也称为高斯过程赌博机),其中已知再生核希尔伯特空间(RKHS)中的奖励函数可能在每轮被对抗性地选择。我们证明指数权重算法实现了$ ilde{O}(\sqrt{T γ_T})$的对抗遗憾,其中$T$和$γ_T$分别表示总轮数和最大信息增益。对于平方指数(SE)和$ν$-Matérn核,我们还证明了算法无关的下界,保证了我们的算法在多项式对数因子内的最优性。此外,我们提出了使用Nyström近似的计算高效变体,同时保持近最优的遗憾保证。

英文摘要

This paper studies kernelized bandits (also known as Gaussian process bandits) in an adversarial environment, where the reward functions in a known reproducing kernel Hilbert space (RKHS) may be adversarially chosen at each round. We show that the exponential-weight algorithm achieves $\tilde{O}(\sqrt{T γ_T})$ adversarial regret, where $T$ and $γ_T$ denote the number of total rounds and the maximum information gain, respectively. For squared exponential (SE) and $ν$-Matérn kernels, we also show algorithm-independent lower bounds that guarantee the optimality of our algorithm up to polylogarithmic factors. Furthermore, we present a computationally efficient variant of our algorithm using Nyström approximation while maintaining nearly optimal regret guarantees.

2605.08870 2026-05-29 cs.LG math.AT math.DG 版本更新

TopoGeoScore: A Self-Supervised Source-Only Geometric Framework for OOD Checkpoint Selection

TopoGeoScore: 一种用于OOD检查点选择的自监督纯源几何框架

Farid Hazratian, Ali Zia, Hien Duy Nguyen

发表机构 * University of Tehran(塔里哈大学) La Trobe University(拉特罗布大学) Kyushu University(九州大学)

AI总结 提出TopoGeoScore,一种仅利用源域表示、无需目标样本或标签的自监督几何评分方法,通过提取类流形的拓扑与几何特征并学习可解释的线性分数,实现分布外鲁棒检查点的选择。

详情
AI中文摘要

当目标域标签不可用时,分布外(OOD)鲁棒性难以诊断。我们考虑一种更严格的纯源无监督精度估计变体:仅使用源域表示选择鲁棒检查点,无需目标样本或目标标签。我们提出 extbf{TopoGeoScore},一种用于无标签OOD检查点选择的纯源几何评分器。给定一个训练好的检查点,我们从源嵌入构建类条件互$k$近邻图,并提取三个可解释信号:用于全局类流形复杂度的挠率启发约化拉普拉斯对数行列式、用于局部邻域正则性的Ollivier-Ricci曲率,以及用于碎片化连通性、环和全局-局部不一致性的高阶拓扑摘要。TopoGeoScore不是手动固定权重,而是通过自监督目标学习非负线性分数,该目标强制在近似保持几何的嵌入视图下具有不变性,并与破坏结构的视图分离。该分数保持可解释性,且不使用目标域样本或标签。在基于CIFAR的损坏和分布偏移基准、ImageNet-C、MNLI$ o$HANS迁移和OGBN-Arxiv上的结果表明,源表示包含可测量的全局-局部-拓扑鲁棒性证据,支持在分布偏移下部署前的实用检查点选择。

英文摘要

Out-of-distribution (OOD) robustness is difficult to diagnose when target-domain labels are unavailable. We consider a more restrictive source-only variant of unsupervised accuracy estimation: selecting robust checkpoints using only source-domain representations, with no target samples or target labels. We propose \textbf{TopoGeoScore}, a source-only geometric scorer for label-free OOD checkpoint selection. Given a trained checkpoint, we construct class-conditional mutual $k$-nearest-neighbour graphs from source embeddings and extract three interpretable signals: a torsion-inspired reduced Laplacian log-determinant for global class-manifold complexity, Ollivier--Ricci curvature for local neighbourhood regularity, and higher-order topological summaries for fragmented connectivity, loops, and global--local inconsistency. Instead of fixing their weights by hand, TopoGeoScore learns a non-negative linear score through a self-supervised objective that enforces invariance under approximately geometry-preserving embedding views and separation from structure-breaking views. The score remains interpretable and uses no target-domain samples or labels. Results across CIFAR-based corruption and distribution-shift benchmarks, ImageNet-C, MNLI$\to$HANS transfer, and OGBN-Arxiv suggest that source representations contain measurable global--local--topological evidence of robustness, supporting practical checkpoint selection before deployment under distribution shift.

2605.08832 2026-05-29 cs.LG physics.flu-dyn 版本更新

Inpainting physics: self-supervised learning for context-driven fluid simulation

物理修复:用于上下文驱动流体模拟的自监督学习

Jonas Weidner, Yeray Martin-Ruisanchez, Daniel Rueckert, Benedikt Wiestler, Julian Suk

发表机构 * AI for Image-Guided Diagnosis and Therapy, Technical University of Munich(图像引导诊断与治疗人工智能,慕尼黑技术大学) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心) AI in Healthcare and Medicine, Technical University of Munich(医疗人工智能,慕尼黑技术大学) Imperial College London(伦敦帝国学院)

AI总结 提出将稳态CFD推理重构为修复问题,通过自监督学习速度场先验并在推理时施加边界约束,利用局部邻域分词器处理大规模3D网格,在颅内动脉瘤血流动力学中优于监督代理模型。

详情
AI中文摘要

计算流体动力学(CFD)的神经代理模型通常被训练为正向算子,将显式问题规范(如几何形状和边界条件)映射到解场。这使得模型与训练期间看到的条件变量绑定,并在边界条件变化或局部几何改变时限制了复用。我们提出将稳态CFD推理重构为一个修复问题:不是训练显式边界条件,而是学习速度场的自监督先验,并在推理时通过固定已知区域(如入口、出口或先前模拟中未改变的区域)来施加边界约束。为了将这一思想扩展到大规模3D网格,我们引入了一个局部邻域分词器,将高分辨率速度场表示为紧凑的空间潜在令牌,并在这些令牌上训练潜在流匹配和掩码自编码器模型。在颅内动脉瘤血流动力学中,我们的方法从稀疏边界上下文中重建完整速度场,在边界条件和数据集偏移下优于监督神经代理模型,并通过复用未改变的模拟上下文实现局部几何编辑。这些结果表明,将CFD推理视为上下文条件修复可以将神经代理从任务特定预测器转变为可复用的流先验。

英文摘要

Neural surrogate models for computational fluid dynamics (CFD) are typically trained as forward operators that map explicit problem specifications, such as geometry and boundary conditions, to solution fields. This ties the model to the conditioning variables seen during training and limits reuse under boundary-condition shifts or local geometry changes. We propose to reformulate steady CFD inference as an inpainting problem: instead of training on explicit boundary conditions, we learn a self-supervised prior over velocity fields and impose boundary constraints only during inference by fixing known regions such as inlet, outlet or unchanged regions from previous simulations. To scale this idea to large 3D meshes, we introduce a local neighbourhood tokeniser that represents high-resolution velocity fields as compact spatial latent tokens and train latent flow-matching and masked-autoencoder models on these tokens. On intracranial aneurysm hemodynamics, our method reconstructs full velocity fields from sparse boundary context, outperforms supervised neural surrogates under boundary-condition and dataset shift and enables local geometry editing by reusing unchanged simulation context. These results suggest that viewing CFD inference as context-conditioned inpainting can turn neural surrogates from task-specific predictors into reusable flow priors.

2605.08786 2026-05-29 cs.LG 版本更新

PRIM: Meta-Learned Bayesian Root Cause Analysis

PRIM:元学习的贝叶斯根因分析

Christopher Lohse, Anish Dhir, Amadou Ba, Bradley Eck, Marco Ruffini, Jonas Wahl

发表机构 * University of Dublin, Trinity College(都柏林大学,三一学院) IBM Gatsby Computational Neuroscience Unit, University College London(大学学院伦敦的加布里埃尔计算神经科学单位) Deutsches Forschungszentrum für Künstliche Intelligenz (DFKI), Saarbrücken, Germany(德国萨尔布吕肯德意志人工智能研究中心(DFKI)) Department of Philosophy, University of Bergen(卑尔根大学哲学系)

AI总结 提出一种基于元学习的贝叶斯根因分析方法PRIM,通过合成先验因果模型进行贝叶斯推断,隐式识别数据生成机制变化,实现零样本快速推理。

详情
AI中文摘要

复杂系统中的根因分析(RCA)由于错误在多个变量间传播、需要结构因果知识以及测试时推理的计算成本而具有挑战性。我们提出了PRIM(基于先验拟合的元学习根因识别),一种因果元学习方法,将RCA视为对因果模型合成先验的贝叶斯推断任务。通过边缘化结构不确定性,PRIM隐式识别基线和异常时期之间数据生成机制的变化。在此过程中,PRIM无需显式统计检验即可推断分布差异,并在测试时无需模型拟合即可隐式学习因果结构。遵循基于模拟的元学习范式(先验拟合网络),PRIM使用模型平均因果估计(MACE)Transformer神经过程,该过程联合关注观测样本、异常样本以及节点的因果结构,从而在17毫秒内对多达100个变量的系统实现零样本推理。在合成基准和两个真实基准数据集PetShop和CausRCA上,PRIM与预先知道系统因果图结构的方法竞争,同时在多个任务上优于不知图结构的方法。对特定领域和数据动态的轻量级微调进一步提升了性能。

英文摘要

Root cause analysis (RCA) in complex systems is challenging due to error propagation across multiple variables, the need for structural causal knowledge, and the computational cost of inference at test time. We introduce PRIM (Prior-fitted Root cause Identification with Meta-learning), a causal meta-learning approach that frames RCA as a Bayesian inference task over a synthetic prior of causal models. By marginalising out structural uncertainty, PRIM implicitly identifies changes in the data-generating mechanism between baseline and anomalous periods. In doing so, PRIM infers distributional differences without explicit statistical testing, and implicitly learns causal structure without model fitting at test time. Following the simulation-based meta-learning paradigm of prior-fitted networks, PRIM uses a Model-Averaged Causal Estimation (MACE) transformer neural process that jointly attends over observational and anomalous samples and the causal structure of nodes, enabling zero-shot inference in 17,ms for systems with up to 100 variables. Across synthetic benchmarks and two realistic benchmark datasets, PetShop and CausRCA, PRIM is competitive with methods that are aware of the system's causal graphical structure a priori while outperforming graph-unaware methods on several tasks. Lightweight fine-tuning to specific domains and data dynamics improves performance further.

2605.07596 2026-05-29 stat.ML cs.LG 版本更新

A Refined Generalization Analysis for Extreme Multi-class Supervised Contrastive Representation Learning

极端多类监督对比表示学习的精细泛化分析

Nong Minh Hieu, Antoine Ledent

发表机构 * School of Computing and Information Systems, Singapore Management University(新加坡国立管理学院计算机与信息系统学院)

AI总结 针对对比表示学习在有限标注数据中构造元组导致依赖性的问题,提出改进的U-统计量分析,得到与类别数R同阶的样本复杂度,并设计新估计器在长尾分布下实现O(k)的样本复杂度。

Comments Accepted at ICML 2026

详情
AI中文摘要

对比表示学习(CRL)在多个机器学习领域取得了强大的实证成功,但其理论样本复杂度仍然知之甚少。现有分析通常假设输入元组是独立同分布的,这一假设在大多数实际设置中被违反,因为对比元组是从有限标注数据池中构造的,导致元组之间存在依赖性。虽然最近有一项工作使用U-统计量分析这种学习设置以估计总体风险,但其中使用的技术要求每个类别的风险均匀集中,使得超额风险界限的规模为$ρ_{\min}^{-{1}/{2}}$,其中$ρ_{\min}$表示最稀有类别的概率。这种依赖在极端多类设置中可能过于悲观,因为存在许多尾部类别,它们对总体风险的贡献极小。我们的贡献有两方面。首先,我们改进了先前的工作,证明了一个样本复杂度与类别数$R$同阶的界限,无论类别分布如何。此外,我们制定了一个不同的估计器,捕捉风险 extit{跨类别}的集中性,从而在极端多类学习场景中实现更尖锐的界限,特别是在类别分布为长尾的情况下。在类别分布的温和假设下,得到的样本复杂度为$\mathcal{O}(k)$,其中$k$是每个元组的样本数。

英文摘要

Contrastive Representation Learning (CRL) has achieved strong empirical success in multiple machine learning disciplines, yet its theoretical sample complexity remains poorly understood. Existing analyses usually assume that input tuples are identically and independently distributed, an assumption violated in most practical settings where contrastive tuples are constructed from a finite pool of labeled data, inducing dependencies among tuples. While one recent work analyzed this learning setting using U-Statistics to estimate the population risk, the techniques used therein require the risk of each class to concentrate uniformly, making excess risk bounds scale in the order of $ρ_{\min}^{-{1}/{2}}$ where $ρ_{\min}$ denotes the probability of the rarest class. Such a dependency can be overly pessimistic in the extreme multiclass settings where there are many tail classes which contribute minimally to the overall population risk. Our contributions are two-fold. Firstly, we improve upon the previous work and prove a bound with a sample complexity of the same order as the number of classes $R$, regardless of the distribution over classes. Furthermore, we formulate a different estimator that captures the concentration of the risk \textit{across classes}, enabling sharper bounds in extreme multi-class learning scenarios, especially where class distributions are long-tailed. Under mild assumptions on the class distributions, the resulting sample complexity is $\mathcal{O}(k)$ where $k$ is the number of samples per tuple.

2605.06355 2026-05-29 cs.LG stat.ML 版本更新

Order-Agnostic Autoregressive Modelling with Missing Data

缺失数据下的顺序无关自回归建模

Ignacio Peis, Pablo M. Olmos, Jes Frellsen

发表机构 * Technical University of Denmark(丹麦技术大学) Pioneer Centre for AI(先锋人工智能中心) Universidad Carlos III de Madrid(马德里卡洛斯三世大学)

AI总结 本文通过缺失数据视角重新审视顺序无关自回归模型,提出缺失感知训练框架,并利用其条件密度估计进行主动信息获取,在多个基准上优于传统插补方法。

详情
AI中文摘要

顺序无关自回归模型在深度生成建模中表现出色,但其在数据不完整情况下的应用尚未被充分探索。本文从缺失数据的角度重新审视这些模型。首先,我们证明它们在完全观测数据上的标准训练过程隐式地在完全随机缺失机制下进行插补,从而在高缺失率场景下实现了稳健的样本外插补性能。其次,我们提出了第一个原则性框架,用于在一般缺失机制下直接从不完整数据集中训练这些模型。第三,我们利用其摊销条件密度估计进行主动信息获取,即顺序选择对下游预测或推理最有信息量的缺失变量。在一系列真实世界基准测试中,我们的缺失感知顺序无关自回归模型(MO-ARM)持续优于已建立的插补基线。

英文摘要

Order-Agnostic autoregressive models have demonstrated strong performance in deep generative modeling, yet their use in settings with incomplete data remains largely unexplored. In this work, we reinterpret them through the lens of missing data. First, we show that their standard training procedure on fully observed data implicitly performs imputation under a missing completely at random mechanism, resulting in robust out-of-sample imputation performance in settings with high missingness. Second, we introduce the first principled framework for training them directly on incomplete datasets under general missingness mechanisms. Third, we leverage their amortized conditional density estimation to perform active information acquisition, i.e., sequentially selecting the most informative missing variables for downstream prediction or inference. Across a suite of real-world benchmarks, our Missingness-Aware Order-Agnostic Autoregressive Model (MO-ARM) consistently outperforms established imputation baselines.

2605.05964 2026-05-29 cs.LG 版本更新

Uncertainty Estimation via Hyperspherical Confidence Mapping

基于超球面置信映射的不确定性估计

Eunseo Choi, Ho-Yeon Kim, Jaewon Lee, Taeyong jo, Myungjun lee, Heejin Ahn

发表机构 * KAIST(韩国科学技术院) Samsung Electronic Co., Ltd(三星电子有限公司)

AI总结 提出超球面置信映射(HCM),通过将输出分解为幅度和归一化方向向量并利用几何约束违反程度实现无采样、无分布假设的不确定性估计,在回归和分类任务中匹配或超越集成与证据方法且推理成本更低。

Comments Accepted at ICLR 2026. 24 pages, 7 figures, including appendix. Updated references

详情
AI中文摘要

量化神经网络预测中的不确定性对于自动驾驶、医疗和制造等高安全领域至关重要。现有方法通常依赖昂贵的采样或严格的分布假设,我们提出超球面置信映射(HCM),一个简单而原则性的框架,用于无采样和无分布假设的不确定性估计。HCM将输出分解为幅度和约束在单位超球面上的归一化方向向量,从而将不确定性解释为该几何约束的违反程度,得到适用于回归和分类的确定性和可解释性估计。在多种基准和实际工业任务上的实验表明,HCM匹配或超越了集成和证据方法,且推理成本更低,置信度-错误对齐更强。我们的结果凸显了几何结构在不确定性估计中的力量,并将HCM定位为传统技术的通用替代方案。

英文摘要

Quantifying uncertainty in neural network predictions is essential for high-stakes domains such as autonomous driving, healthcare, and manufacturing. While existing approaches often depend on costly sampling or restrictive distributional assumptions, we propose Hyperspherical Confidence Mapping (HCM), a simple yet principled framework for sampling-free and distribution-free uncertainty estimation. HCM decomposes outputs into a magnitude and a normalized direction vector constrained to lie on the unit hypersphere, enabling a novel interpretation of uncertainty as the degree of violation of this geometric constraint. This yields deterministic and interpretable estimates applicable to both regression and classification. Experiments across diverse benchmarks and real-world industrial tasks demonstrate that HCM matches or surpasses ensemble and evidential approaches, with far lower inference cost and stronger confidence-error alignment. Our results highlight the power of geometric structure in uncertainty estimation and position HCM as a versatile alternative to conventional techniques.

2605.05133 2026-05-29 cs.LG 版本更新

Transformed Latent Variable Multi-Output Gaussian Processes

变换潜变量多输出高斯过程

Xiaoyu Jiang, Xinxing Shi, Sokratia Georgaka, Magnus Rattray, Mauricio A Álvarez

发表机构 * Department of Computer Science, University of Manchester, Manchester, UK(曼彻斯特大学计算机科学系) Faculty of Biology, Medicine and Health, University of Manchester, Manchester, UK(曼彻斯特大学生物医学与健康学院)

AI总结 提出T-LVMOGP框架,通过Lipschitz正则化神经网络构建灵活的多输出深度核,结合随机变分推理,有效扩展到高维输出场景,在气候建模和空间转录组学等基准上优于基线方法。

Comments ICML 2026

详情
AI中文摘要

多输出高斯过程(MOGP)为建模相关输出提供了一个原则性的概率框架,但在应用于具有高维输出空间的数据集时面临可扩展性瓶颈。为了保持可处理性,现有方法通常采用限制性假设,例如使用低秩或可分离和核,这可能限制表达能力。我们提出了变换潜变量多输出高斯过程(T-LVMOGP),这是一种新颖的框架,将MOGP扩展到大量输出,同时保留捕获有意义输出间依赖关系的能力。T-LVMOGP通过使用Lipschitz正则化神经网络将输入和输出特定的潜变量映射到嵌入空间,构建了一个灵活的多输出深度核。结合随机变分推理,我们的模型有效地扩展到高维输出设置。在包括超过10,000个输出的气候建模和零膨胀空间转录组学数据在内的多个基准测试中,T-LVMOGP在预测准确性和计算效率上均优于基线方法。

英文摘要

Multi-Output Gaussian Processes (MOGPs) provide a principled probabilistic framework for modelling correlated outputs but face scalability bottlenecks when applied to datasets with high-dimensional output spaces. To maintain tractability, existing methods typically resort to restrictive assumptions, such as employing low-rank or sum-of-separable kernels, which can limit expressiveness. We propose the Transformed Latent Variable MOGP (T-LVMOGP), a novel framework that scales MOGPs to a massive number of outputs while preserving the capacity to capture meaningful inter-output dependencies. T-LVMOGP constructs a flexible multi-output deep kernel by mapping inputs and output-specific latent variables into an embedding space using a Lipschitz-regularised neural network. Combined with stochastic variational inference, our model effectively scales to high-dimensional output settings. Across diverse benchmarks, including climate modelling with over 10,000 outputs and zero-inflated spatial transcriptomics data, T-LVMOGP outperforms baselines in both predictive accuracy and computational efficiency.

2605.00898 2026-05-29 eess.SP cs.LG 版本更新

A Deep Learning Model for Battery State Prediction towards Intelligent Energy Management

面向智能能源管理的电池状态预测深度学习模型

Athanasios Koukosias, Vasileios Tzanidakis, Sotiris Athanasiou, Kostas Kolomvatsos

发表机构 * Department of Informatics and Telecommunications, University of Thessaly(信息与电信系,塞萨洛尼基大学)

AI总结 提出一种集成先进神经网络架构和大规模训练数据的深度学习模型,用于预测工业电化学储能系统的未来状态和性能,以支持预测性维护和能源资源优化分配。

Comments 11 pages, 11 figures, Journal

详情
AI中文摘要

准确预测电池健康指标(包括剩余容量和寿命)对于确保电动汽车和大规模储能基础设施等应用的可靠性、安全性和运行效率至关重要。预测结果可用于构建先进的监测机制,持续检查电池健康状态,以协助众多应用的高效实时管理。本研究探讨了用于预测工业电化学储能系统未来状态和性能的深度学习(DL)模型的开发与实现。为应对这一挑战,我们提出了一种专用计算框架,该框架将先进的神经网络架构与大规模训练数据集相结合,能够精确建模电池退化动态和运行趋势。所提出的方法为电池的最优管理提供了决策支持机制,促进了预测性维护和能源资源的高效分配。我们的研究结果凸显了基于深度学习的预测建模在推动可持续和智能能源管理系统发展方面的巨大潜力。

英文摘要

Accurate forecasting of battery health indicators, including remaining capacity and lifetime, is of paramount importance for ensuring the reliability, safety, and operational efficiency of applications such as electric vehicles and large scale energy storage infrastructures. The result of the forecasting can be adopted to build an advanced monitoring mechanism for continuous checking batteries' health status to assist in the efficient real-time management of numerous applications. This research investigates the development and implementation of a Deep Learning (DL) model for the prediction of the future state and performance of industrial electrochemical energy storage systems. To address this challenge, we propose a dedicated computational framework that integrates advanced neural network architectures with large-scale training datasets, enabling precise modeling of batteries degradation dynamics and operational trends. The proposed approach provides a decision support mechanism for the optimal management of batteries facilitating both predictive maintenance and the efficient allocation of energy resources. Our findings highlight the potential of DL-based predictive modeling to significantly contribute to the advancement of sustainable and intelligent energy management systems.

2605.00222 2026-05-29 cs.LG physics.chem-ph 版本更新

CompleteRXN: Toward Completing Open Chemical Reaction Databases

CompleteRXN:迈向完整开放化学反应数据库

Gabriel Vogel, Minouk Noordsij, Evgeny Pidko, Jana M. Weber

发表机构 * Department of Intelligent Systems(智能系统系) Delft University of Technology(代尔夫特理工大学) Department of Chemical Engineering(化学工程系)

AI总结 针对化学反应数据库(如USPTO)普遍存在的不完整问题,提出CompleteRXN基准和约束反应平衡器(CRB)模型,通过监督学习和约束解码实现高精度的反应补全。

详情
AI中文摘要

诸如USPTO等化学反应数据集存在严重的不完整性,经常缺失副产物、共反应物和化学计量系数。这限制了它们在下游应用中的适用性和可靠性。在此,我们介绍CompleteRXN,一个在现实缺失数据条件下用于反应补全的大规模监督基准。通过将USPTO记录映射到精心整理的机理反应,我们构建了一个对齐的不完整和原子平衡反应数据集。我们评估了代表性基线方法,包括一种新颖的具有约束解码的编码器-解码器反应补全模型——约束反应平衡器(CRB),以及最近的算法方法SynRBL。在我们的CompleteRXN基准上,CRB在难度递增的划分上实现了高性能,在随机划分上达到99.20%的等价准确率,在极端分布外划分上达到91.12%。SynRBL生成了许多平衡且化学上合理的补全结果,但在基准测试划分上的准确率较低。在所有方法中,性能随着不完整程度的增加而下降。当在基准之外(完整的未整理USPTO)评估反应时,我们观察到性能大幅下降,这突显了基准性能与实际鲁棒性之间的差距,并激励了未来的工作。

英文摘要

Chemical reaction datasets such as USPTO suffer from substantial incompleteness, frequently missing byproducts, co-reactants, and stoichiometric coefficients. This limits their applicability and reliability in downstream applications. Here, we introduce CompleteRXN, a large-scale supervised benchmark for reaction completion under realistic missing-data conditions. We construct a dataset of aligned incomplete and atom-balanced reactions by mapping USPTO records to curated mechanistic reactions. We evaluate representative baselines, including a novel encoder-decoder reaction completion model with constrained decoding, the Constrained Reaction Balancer (CRB), and a recent algorithmic method, SynRBL. On our CompleteRXN benchmark, the CRB achieves high performance across splits of increasing difficulty, reaching 99.20% equivalence accuracy on the random split and 91.12% on the extreme out-of-distribution split. SynRBL produces many balanced and chemically plausible completions, but with lower accuracy on the benchmark test splits. Across all methods, performance degrades with increasing incompleteness. We observe a substantial drop when evaluating on reactions outside the benchmark (full uncurated USPTO), highlighting the gap between benchmark performance and practical robustness and motivating future work.

2604.27272 2026-05-29 cs.CL cs.AI cs.LG 版本更新

When 2D Tasks Meet 1D Serialization: On Serialization Friction in Structured Tasks

当2D任务遇到1D序列化:结构化任务中的序列化摩擦

Chung-Hsiang Lo, Lu Li, Diji Yang, Tianyu Zhang, Yunkai Zhang, Yoshua Bengio, Yi Zhang

发表机构 * Northeastern University(东北大学) University of Pennsylvania(宾夕法尼亚大学) UC Santa Cruz(加州大学圣克鲁兹分校) Mila - Quebec AI Institute(魁北克人工智能研究所) University of Montreal(蒙特利尔大学) BAIR, UC Berkeley(伯克利大学BAIR实验室)

AI总结 研究通过矩阵转置、康威生命游戏和LU分解三个任务,发现将二维布局任务序列化为一维文本会因表示不匹配导致性能下降,且错误呈现空间结构模式。

详情
AI中文摘要

在LLM时代,许多符号化和结构化问题通过一维文本序列化呈现给模型。然而,其中一些问题本质上是二维的:它们的相关关系,如行列对应或空间邻接,由二维布局中的位置定义,而非顺序。这引发了一个表示问题:在一维序列中保留相同的符号条目是否也保留了计算所需的关系结构?我们通过序列化摩擦的视角研究这一问题:即相同底层任务实例和条目仍然存在,但依赖于布局的关系在一维序列化下变得隐式的表示不匹配。本研究使用三个受控合成测试任务:矩阵转置、康威生命游戏和LU分解。在每个任务中,相同的实例要么作为一维文本序列化呈现,要么作为其原生二维布局渲染为图像呈现。在整个测试集中,随着任务规模增长,一维序列化的性能下降更显著,且序列化下的错误呈现空间结构模式,表明这种呈现选择在我们的测试集中具有重要影响。为了进一步解释这些结果,我们添加了补充分析,包括视觉内探针以及混合训练转置设置下两种输入呈现的额外比较。这些发现表明,对于布局定义的任务,将输入简化为1D序列化并非中性的表示选择。

英文摘要

In the LLM era, many symbolic and structured problems are presented to models through 1D text serialization. Yet some such problems are natively two-dimensional: their relevant relations, such as row--column correspondence or spatial adjacency, are defined by position in a 2D layout rather than by sequential order. This raises a representational question: does preserving the same symbolic entries in a 1D sequence also preserve the relational structure needed for computation? We study this issue through the lens of serialization friction: the representational mismatch in which the same underlying task instances and entries are still present, but relations that depend on layout become implicit under 1D serialization. The study uses a controlled synthetic testbed of three tasks: matrix transpose, Conway's Game of Life, and LU decomposition. In each task, the same instances are presented either as 1D text serialization or as their native 2D layout rendered as an image. Across this testbed, 1D serialization degrades more sharply as task size grows, and errors under serialization exhibit spatially structured patterns, suggesting that this presentation choice is consequential within our testbed. To further interpret these results, we add supplementary analyses that include a within-visual probe and an additional comparison of the two input presentations under the mixed-training transpose setting. These findings suggest that, for layout-defined tasks, reducing inputs to 1D serialization is not a neutral choice of representation.

2604.26645 2026-05-29 cs.AI cs.LG 版本更新

SciHorizon-DataEVA: An Agentic System for AI-Readiness Evaluation of Heterogeneous Scientific Data

SciHorizon-DataEVA:面向异构科学数据AI就绪性评估的智能体系统

Dianyu Liu, Chuan Qin, Xi Chen, Xiaohan Li, Wenxi Xu, Yuyang Wang, Xin Chen, Yuanchun Zhou, Hengshu Zhu

发表机构 * SciHorizon Team, Computer Network Information Center, Chinese Academy of Sciences(科学前沿团队,计算机网络信息中心,中国科学院)

AI总结 提出SciHorizon-DataEVA智能体系统,基于Sci-TQA2原则和层次化多智能体评估方法,实现对异构科学数据的可扩展AI就绪性评估。

详情
AI中文摘要

AI-for-Science (AI4Science) 正通过将机器学习模型嵌入跨领域的预测、模拟和假设生成工作流程,日益变革科学发现。然而,这些模型的有效性从根本上受到科学数据AI就绪性的限制,目前尚不存在可扩展且系统的评估机制。在这项工作中,我们提出了SciHorizon-DataEVA,一种新颖的智能体系统,用于对异构科学数据进行可扩展的AI就绪性评估。在评估标准层面,我们引入了Sci-TQA2原则,将AI就绪性组织为四个互补维度:治理可信度、数据质量、AI兼容性和科学适应性。每个维度被分解为可测量的原子元素,以实现细粒度且可执行的评估。为了大规模实施这些原则,我们开发了Sci-TQA2-Eval,一种通过有向循环工作流编排的层次化多智能体评估方法。我们的Sci-TQA2-Eval通过结合轻量级数据集分析、适用性感知的度量激活以及基于领域约束和数据集-论文信号的知识增强规划,动态构建数据集感知的评估规范。这些规范通过自适应的、以工具为中心的评估机制执行,该机制具有内置的验证和自我修正能力,从而实现对异构科学数据的可扩展且可靠的评估。在跨多个领域的科学数据集上的广泛实验证明了SciHorizon-DataEVA在原则性AI就绪性评估方面的有效性和通用性。

英文摘要

AI-for-Science (AI4Science) is increasingly transforming scientific discovery by embedding machine learning models into prediction, simulation, and hypothesis generation workflows across domains. However, the effectiveness of these models is fundamentally constrained by the AI-readiness of scientific data, for which no scalable and systematic evaluation mechanism currently exists. In this work, we propose SciHorizon-DataEVA, a novel agentic system to scalable AI-readiness evaluation of heterogeneous scientific data. At the evaluation-criteria level, we introduce the Sci-TQA2 principles, which organize AI-readiness into four complementary dimensions: Governance Trustworthiness, Data Quality, AI Compatibility, and Scientific Adaptability. Each dimension is decomposed into measurable atomic elements that enable fine-grained and executable assessment. To operationalize these principles at scale, we develop Sci-TQA2-Eval, a hierarchical multi-agent evaluation approach orchestrated through a directed, cyclic workflow. Our Sci-TQA2-Eval dynamically constructs dataset-aware evaluation specifications by combining lightweight dataset profiling, applicability-aware metric activation, and knowledge-augmented planning grounded in domain constraints and dataset-paper signals. These specifications are executed through an adaptive, tool-centric evaluation mechanism with built-in verification and self-correction, enabling scalable and reliable assessment across heterogeneous scientific data. Extensive experiments on scientific datasets spanning multiple domains demonstrate the effectiveness and generality of SciHorizon-DataEVA for principled AI-readiness evaluation.

2604.23862 2026-05-29 cs.LG cs.AI cs.CL 版本更新

Graph Memory Transformer (GMT)

图记忆Transformer (GMT)

Nicola Zanarini, Niccolò Ferrari, Evelina Lamma

发表机构 * Bonfiglioli Engineering s.r.l.(博尼菲利工程公司) Department of Engineering, University of Ferrara(费拉拉大学工程学院) NAIS s.r.l.(NAIS公司)

AI总结 提出用显式学习的记忆图替换解码器-only Transformer中的前馈网络子层,保留自回归架构,实现可解释的记忆导航。

Comments 65 pages, 10 figures, 5 tables. Author list updated in arXiv metadata; no technical changes. Code available at https://github.com/Nemesis533/GMT-GraphMemoryTransformer

详情
AI中文摘要

我们研究是否可以在解码器-only Transformer中,用显式学习的记忆图替换前馈网络(FFN)子层,同时保留周围的自回归架构。所提出的图记忆Transformer(GMT)保持因果自注意力不变,但将通常的逐token FFN变换替换为一个记忆单元,该单元通过一个由学习的有向转移矩阵连接的质心库来路由token表示。在此处研究的基础GMT v7实例中,16个Transformer块中的每个块包含128个质心、一个128*128的边矩阵、引力源路由、token条件目标选择以及门控位移读出。因此,该单元返回从估计的源记忆状态到目标记忆状态的移动,而不是检索到的值。由此产生的模型是一个完全解码器-only的语言模型,具有82.2M可训练参数且没有密集的FFN子层,而评估中使用的密集GPT风格基线有103.0M参数。基础v7模型训练稳定,并将质心使用、转移结构和源到目标移动作为前向计算中可直接检查的量。在验证损失和困惑度方面,它落后于较大的密集基线(3.5995/36.58 vs. 3.2903/26.85),但在评估设置下显示出接近的零样本基准表现。这些结果并非旨在声称最先进性能;它们支持用图介导的记忆导航替换密集的token内变换的可行性和结构可解释性。更广泛的扩展、优化的内核以及更广泛的基准评估留待后续工作。

英文摘要

We investigate whether the Feed-Forward Network (FFN) sublayer in a decoder-only transformer can be replaced by an explicit learned memory graph while preserving the surrounding autoregressive architecture. The proposed Graph Memory Transformer (GMT) keeps causal self-attention intact, but replaces the usual per-token FFN transformation with a memory cell that routes token representations over a learned bank of centroids connected by a learned directed transition matrix. In the base GMT v7 instantiation studied here, each of 16 transformer blocks contains 128 centroids, a 128 * 128 edge matrix, gravitational source routing, token-conditioned target selection, and a gated displacement readout. The cell therefore returns movement from an estimated source memory state toward a target memory state, rather than a retrieved value. The resulting model is a fully decoder-only language model with 82.2M trainable parameters and no dense FFN sublayers, compared with a 103.0M-parameter dense GPT-style baseline used in the evaluation. The base v7 model trains stably and exposes centroid usage, transition structure, and source-to-target movement as directly inspectable quantities of the forward computation. It remains behind the larger dense baseline in validation loss and perplexity (3.5995/36.58 vs. 3.2903/26.85), while showing close zero-shot benchmark behavior under the evaluated setting. These results are not intended as a state-of-the-art claim; they support the viability and structural interpretability of replacing dense within-token transformation with graph-mediated memory navigation. Broader scaling, optimized kernels, and more extensive benchmark evaluation are left for subsequent work.

2604.19011 2026-05-29 cs.LG cs.RO 版本更新

Accelerating trajectory optimization with Sobolev-trained diffusion policies

基于Sobolev训练的扩散策略加速轨迹优化

Théotime Le Hellard, Franki Nguimatsia Tiofack, Quentin Le Lidec, Justin Carpentier

发表机构 * Inria - Département d’Informatique de l’École normale supérieure, PSL Research University(法国国家科学研究中心-巴黎高等师范学院计算机系,PSL研究大学) Courant Institute, New York University(纽约大学Courant研究所)

AI总结 针对梯度型轨迹优化求解器,提出利用Sobolev学习训练扩散策略以提供初始猜测,通过利用轨迹和反馈增益的一阶损失避免复合误差,实现求解时间减少2至20倍。

详情
AI中文摘要

轨迹优化求解器利用已知系统动力学通过迭代改进计算局部最优轨迹。其缺点是每个新问题实例独立求解,因此收敛速度和求解质量依赖于初始轨迹。为提高效率,一种自然的方法是用学习策略生成的初始猜测对轨迹优化进行热启动,该策略在求解器先前生成的轨迹上训练。基于扩散的策略最近成为表达性模仿学习模型,使其成为这一角色的有前途候选者。然而,一个反直觉的挑战来自轨迹优化示范的局部最优性:当策略展开时,小的非最优偏差可能将其推入训练数据中未表示的情况,从而在长时域上引发复合误差。在这项工作中,我们专注于基于学习的热启动,用于同时提供反馈增益的梯度型轨迹优化求解器。利用这一特性,我们推导出一阶损失,用于使用轨迹和反馈增益对基于扩散的策略进行Sobolev学习。通过全面实验,我们证明所得策略避免了复合误差,因此可以从非常少的轨迹中学习,提供初始猜测,将求解时间减少2倍到20倍。结合一阶信息使得用更少的扩散步骤进行预测成为可能,从而降低推理延迟。

英文摘要

Trajectory Optimization (TO) solvers exploit known system dynamics to compute locally optimal trajectories through iterative improvements. A downside is that each new problem instance is solved independently; therefore, convergence speed and quality of the solution found depend on the initial trajectory proposed. To improve efficiency, a natural approach is to warm-start TO with initial guesses produced by a learned policy trained on trajectories previously generated by the solver. Diffusion-based policies have recently emerged as expressive imitation learning models, making them promising candidates for this role. Yet, a counterintuitive challenge comes from the local optimality of TO demonstrations: when a policy is rolled out, small non-optimal deviations may push it into situations not represented in the training data, triggering compounding errors over long horizons. In this work, we focus on learning-based warm-starting for gradient-based TO solvers that also provide feedback gains. Exploiting this specificity, we derive a first-order loss for Sobolev learning of diffusion-based policies using both trajectories and feedback gains. Through comprehensive experiments, we demonstrate that the resulting policy avoids compounding errors, and so can learn from very few trajectories to provide initial guesses reducing solving time by $2\times$ to $20 \times$. Incorporating first-order information enables predictions with fewer diffusion steps, reducing inference latency.

2603.23234 2026-05-29 cs.AI cs.LG 版本更新

MemCollab: Cross-Model Memory Collaboration via Contrastive Trajectory Distillation

MemCollab:通过对比轨迹蒸馏实现跨模型记忆协作

Yurui Chang, Yiran Wu, Qingyun Wu, Lu Lin

发表机构 * Pennsylvania State University(宾夕法尼亚州立大学) AG2AI

AI总结 针对不同骨干模型代理间共享记忆性能下降的问题,提出MemCollab框架,通过对比同一任务上不同模型生成的推理轨迹来蒸馏共享的抽象推理约束,并引入任务感知检索机制,提升异构代理的准确性和推理效率。

详情
AI中文摘要

LLM代理越来越依赖记忆机制来重用过去问题解决经验中的知识。然而,现有方法通常为单个代理构建记忆,并与同一底层模型重用,将存储的知识紧密耦合到特定模型的推理风格。在异构部署中,代理可能使用不同大小、架构或专业化的骨干模型实例化,这引发了一个关键问题:一个单一的记忆系统能否在不同骨干模型的代理之间共享?我们发现,简单的跨模型记忆传输可能会降低性能,因为存储的记忆常常将任务相关知识纠缠到模型特定的偏见中。为了解决这一挑战,我们提出了MemCollab,一个协作记忆框架,通过对比不同基于模型的代理在同一任务上生成的推理轨迹来构建共享的跨模型记忆。通过这一对比过程,MemCollab蒸馏出捕获共享任务级不变量的抽象推理约束,同时抑制模型特定的伪影。我们进一步引入了一种任务感知检索机制,根据任务类别调节记忆访问,确保在推理时只检索相关的约束。在数学推理和代码生成基准上的实验表明,MemCollab在不同代理(包括不同模型族设置)上一致地提高了准确性和推理效率。这些结果表明,协作构建的跨模型记忆可以作为异构基于LLM的代理的共享推理资源。

英文摘要

LLM agents increasingly rely on memory mechanisms to reuse knowledge from past problem-solving experiences. However, existing methods typically construct memory for a single agent and reuse it with the same underlying model, tightly coupling stored knowledge to model-specific reasoning styles. In heterogeneous deployments, where agents may be instantiated with backbone models of different sizes, architectures, or specializations, this raises a key question: can a single memory system be shared across agents with different backbone models? We find that naive cross-model memory transfer can degrade performance, because stored memories often entangle task-relevant knowledge with model-specific biases. To address this challenge, we propose MemCollab, a collaborative memory framework that builds shared cross-model memory by contrasting reasoning trajectories generated by different model-based agents on the same task. Through this contrastive process, MemCollab distills abstract reasoning constraints that capture shared task-level invariants while suppressing model-specific artifacts. We further introduce a task-aware retrieval mechanism that conditions memory access on task category, ensuring that only relevant constraints are retrieved at inference time. Experiments on mathematical reasoning and code generation benchmarks show that MemCollab consistently improves both accuracy and inference-time efficiency across diverse agents, including settings with different model families. These results demonstrate that collaboratively constructed cross-model memory can serve as a shared reasoning resource for heterogeneous LLM-based agents.

2603.21621 2026-05-29 cs.LG 版本更新

Path-Space Mirror Descent for On-Policy Reinforcement Learning under the Generalized Schrödinger Bridge

广义薛定谔桥下在线强化学习的路径空间镜像下降

Yuehu Gong, Zeyuan Wang, Yulin Chen, Shutong Ding, Qingyuan Zhou, Yanwei Fu

发表机构 * School of Data Science, Fudan University(复旦大学数据科学学院) Laboratory for Big Data and Decision, National University of Defense Technology(国防科技大学大数据与决策实验室) ShanghaiTech University(上海科技大学) College of Computer Science and Artificial Intelligence, Fudan University(复旦大学计算机科学与人工智能学院) Shanghai Innovation Institute(上海创新研究院)

AI总结 针对生成式策略在在线策略优化中终端动作密度难处理的问题,提出GSB-MDPO方法,通过将策略优化建模为广义薛定谔桥问题并利用路径空间KL散度作为近端项,实现了无需显式终端似然评估的稳定更新。

详情
AI中文摘要

经典的在线策略算法如PPO和镜像下降策略优化通过可处理的动作似然提供稳定的近端策略更新,但通常使用简单的Gaussian策略,其在复杂连续控制任务中的表达能力有限。基于扩散和流模型的生成式策略提供了更具表达力的动作分布,但它们自然地定义了多步去噪路径上的分布,其终端动作密度通常是难以处理的,这造成了与基于似然的在线策略近端更新的不匹配。为了解决这种不匹配,我们引入了GSB-MDPO(广义薛定谔桥镜像下降策略优化),它将在线策略生成式策略优化表述为状态条件生成路径上的广义薛定谔桥问题,并通过镜像下降策略优化实例化得到的路径测度更新。关键洞察是GSB路径空间KL散度在MDPO中扮演了近端项的角色,同时上界了终端动作KL散度,从而无需显式终端动作似然评估即可直接控制执行的动作分布。在Playground和Gym-MuJoCo上的14个连续控制任务上的实验证明了GSB-MDPO的经验有效性,并支持路径空间正则化作为多步生成式策略的原则性近端更新。

英文摘要

Classical on-policy algorithms such as PPO and mirror descent policy optimization provide stable proximal policy updates through tractable action likelihoods, but are typically instantiated with simple Gaussian policies whose expressiveness can be limited in complex continuous-control tasks. Generative policies based on diffusion and flow models provide more expressive action distributions, but they naturally define distributions over multi-step denoising paths whose terminal action density is often intractable, creating a mismatch with likelihood-based on-policy proximal updates. To address this mismatch, we introduce \textbf{GSB-MDPO} (\emph{Generalized Schrödinger Bridge Mirror Descent Policy Optimization}), which formulates on-policy generative policy optimization as a Generalized Schrödinger Bridge problem over state-conditioned generation paths and instantiates the resulting path-measure update through mirror descent policy optimization. The key insight is that the GSB path-space KL plays the role of the proximal term in MDPO while upper-bounding the terminal action KL, enabling direct control of the executed action distribution without explicit terminal action likelihood evaluation. Experiments on 14 continuous-control tasks across Playground and Gym-MuJoCo demonstrate the empirical effectiveness of GSB-MDPO and support path-space regularization as a principled proximal update for multi-step generative policies.

2603.20329 2026-05-29 stat.ML cs.LG math.PR 版本更新

Measure flow path recovery in Bayes Hilbert spaces

贝叶斯希尔伯特空间中的测度流路径恢复

S. David Mis, Maarten V. de Hoop

发表机构 * Rice University(里士大学)

AI总结 针对有限移动局部传感器恢复概率测度流的不适定问题,提出基于贝叶斯希尔伯特框架的变分理论,通过构造最小能量传输实现和线性化观测算子,分析可恢复性条件,并发展有限维约化方法实现稳定重建。

详情
AI中文摘要

我们研究使用贝叶斯希尔伯特框架从有限个移动局部传感器恢复概率测度流的不适定问题。相对于固定的参考概率测度,概率律由其中心化对数比坐标表示,因此演化律成为希尔伯特函数空间中的一条路径。对于足够正则的贝叶斯希尔伯特路径,我们通过在每个时间点求解加权纽曼问题,构造路径的规范最小能量传输实现,得到切方向上的内在传输形式。然后,我们直接在贝叶斯希尔伯特路径空间上制定逆问题。观测算子的线性化产生可观测性形式,可恢复性由其与传输几何通过联合传输-可观测性形式的相互作用决定。在无穷维环境中,我们发展了正则化变分理论,并识别了局部传感器的局限性:移动传感器可以使联合形式单射,但通常不能在整个状态空间上产生强制稳定性估计。这一障碍自然导致有限维贝叶斯希尔伯特约化。在那里,传输形式成为动能张量,线性化观测成为约化感知矩阵,因此可恢复性可以通过显式的格拉姆条件表达。我们证明局部凸起传感器检测每个固定的约化方向,有限个适当放置的静态传感器产生均匀的约化可观测性,并且存在依赖于路径的传感器轨迹,使得即使单个移动传感器也能恢复约化路径。最后,我们证明这些约化恢复结果可以提升到对由所选有限维子空间良好近似的路径的近似环境恢复,从而实现稳定重建至投影误差。

英文摘要

We study the ill-posed problem of recovering a probability measure flow from finitely many moving localized sensors using a Bayes Hilbert framework. Relative to a fixed reference probability measure, a probability law is represented by its centered log-ratio coordinates, so that an evolving law becomes a path in a Hilbert space of functions. For sufficiently regular Bayes Hilbert paths, we construct a canonical minimum-energy transport realization of the path by solving a weighted Neumann problem at each time, yielding an intrinsic transport form on tangent directions. We then formulate an inverse problem directly on Bayes Hilbert path space. Linearization of an observation operator yields an observability form, and recoverability is governed by its interaction with the transport geometry through a joint transport--observability form. In the ambient infinite-dimensional setting, we develop a regularized variational theory and identify limitations of localized sensing: mobile sensors can make the joint form injective, but they do not in general yield a coercive stability estimate on the full state space. This obstruction leads naturally to finite-dimensional Bayes Hilbert reductions. There the transport form becomes a kinetic tensor and the linearized observations become reduced sensing matrices, so recoverability can be expressed through explicit Gramian conditions. We show that localized bump sensors detect every fixed reduced direction, that finitely many suitably placed static sensors yield uniform reduced observability, and there exist path-dependent sensor trajectories such that even a single moving sensor can recover the reduced path. Finally, we show that these reduced recovery results lift to approximate ambient recovery for paths that are well approximated by the chosen finite-dimensional subspaces, yielding stable reconstruction up to projection error.

2603.14315 2026-05-29 cs.LG math.OC 版本更新

Enhancing LLM Training via Spectral Clipping

通过谱裁剪增强大语言模型训练

Xiaowen Jiang, Andrei Semenov, Sebastian U. Stich

发表机构 * CISPA Helmholtz Center for Information Security(信息安全研究中心) EPFL(瑞士联邦理工学院)

AI总结 提出SPECTRA框架,通过对优化器更新进行谱裁剪以约束谱范数、对梯度进行预谱裁剪以抑制噪声尖峰,从而提升多种优化器在大语言模型预训练中的性能。

Comments v2: ICML 2026

详情
AI中文摘要

虽然基于谱的优化器(如Muon)直接对更新的谱进行操作,但标准自适应方法(如AdamW)没有考虑权重和梯度的谱结构,使它们容易受到大语言模型(LLM)训练中两个经验问题的影响:(i)优化器更新可能具有较大的谱范数,可能破坏训练稳定性并降低泛化能力;(ii)随机梯度噪声可能表现出稀疏的谱尖峰,少数主导奇异值远大于其余值。我们提出SPECTRA,一个通用框架,通过(i)对更新进行后谱裁剪以施加谱范数约束,(ii)可选地对梯度进行预谱裁剪以抑制谱噪声尖峰,来解决这些问题。我们证明后谱裁剪构成了一种具有谱范数约束和权重正则化的复合Frank-Wolfe方法。我们进一步分析了预谱裁剪如何缓解稀疏谱尖峰。我们通过Newton-Schulz迭代提出了高效的软谱裁剪,避免了昂贵的SVD。在LLM预训练上的实验表明,SPECTRA对各种优化器(包括AdamW、Signum、Mars和AdEMAMix)一致地改善了验证损失,其中表现最佳的变体达到了最先进的结果。使用SPECTRA训练的模型表现出更小的权重范数,证实了谱裁剪与正则化之间的联系。

英文摘要

While spectral-based optimizers like Muon operate directly on the spectrum of updates, standard adaptive methods such as AdamW do not account for the spectral structure of weights and gradients, leaving them vulnerable to two empirical issues in large language model (LLM) training: (i) the optimizer updates can have large spectral norms, potentially destabilizing training and degrading generalization; (ii) stochastic gradient noise can exhibit sparse spectral spikes, with a few dominant singular values much larger than the rest. We propose SPECTRA, a general framework addressing these by (i) post-spectral clipping of updates to enforce spectral-norm constraints (ii) optional pre-spectral clipping of gradients to suppress spectral noise spikes. We prove that post-clipping constitutes a Composite Frank-Wolfe method with spectral-norm constraints and weight regularization. We further analyze how pre-clipping mitigates sparse spectral spikes. We propose efficient soft spectral clipping via Newton-Schulz iterations, avoiding expensive SVD. Experiments on LLM pretraining show SPECTRA uniformly improves validation loss for various optimizers, including AdamW, Signum, Mars, and AdEMAMix, with the best-performing variants achieving state-of-the-art results. Models trained with SPECTRA exhibit smaller weight norms, confirming the link between spectral clipping and regularization.

2603.11331 2026-05-29 cs.LG cs.AI 版本更新

Jailbreak Scaling Laws for Large Language Models: Polynomial-Exponential Crossover

大型语言模型的越狱缩放定律:多项式-指数交叉

Indranil Halder, Annesya Banerjee, Cengiz Pehlevan

发表机构 * John A. Paulson School of Engineering And Applied Sciences, Harvard University(哈佛大学约翰·A·保罗森工程与应用科学学院) Department of Brain and Cognitive Sciences, Massachusetts Institute of Technology(麻省理工学院脑科学与认知科学系) Speech and Hearing Bioscience and Technology, Harvard Medical School(哈佛医学院语音与听力生物科学与技术系) Kempner Institute for the Study of Natural and Artificial Intelligence, Harvard University(哈佛大学自然与人工智能研究学院) Center for Brain Science, Harvard University(哈佛大学脑科学中心)

AI总结 研究发现对抗性提示注入攻击可使攻击成功率从无注入时的缓慢多项式增长变为随推理样本数指数增长,并通过自旋玻璃模型从理论上解释了这一现象。

详情
AI中文摘要

对抗性攻击可以可靠地将安全对齐的大型语言模型引导至不安全行为。经验上,我们发现对抗性提示注入攻击可以将攻击成功率从无注入时观察到的缓慢多项式增长放大为随推理样本数指数增长。我们首先通过一组关于上下文安全生成分布的最小假设,确定了这两种机制的统计基础,并推导出两种缩放定律。为了进一步解释这一现象,我们提出了一个基于自旋玻璃系统的代理语言理论生成模型,该系统处于复制对称破缺状态,生成样本来自相关的吉布斯测度,并将低能、有偏大小的子集标记为不安全。我们分析展示了该模型如何自然实现最小假设。短注入提示对应于指向不安全簇中心的弱磁场,导致攻击成功率随推理样本数呈幂律缩放;而长注入提示(即强磁场)则导致指数缩放。我们在参数规模从3B到70B的广泛大型语言模型中观察到了定性一致的行为。特别是,主要趋势在多种攻击方法(如GCG和AutoDAN)以及基准数据集(如AdvBench和HarmBench)中保持稳定。

英文摘要

Adversarial attacks can reliably steer safety-aligned large language models toward unsafe behavior. Empirically, we find that adversarial prompt-injection attacks can amplify attack success rate from the slow polynomial growth observed without injection to exponential growth with the number of inference-time samples. We first identify a minimal statistical mechanism for these two regimes by giving a small set of assumptions on the distribution of safe generation across contexts under which both scaling laws follow. To explain this phenomenon further, we propose a theoretical generative model of proxy language in terms of a spin-glass system operating in a replica-symmetry-breaking regime, where generations are drawn from the associated Gibbs measure and a subset of low-energy, size-biased clusters is designated unsafe. We analytically show how this model naturally realizes the minimal assumptions. Short injected prompts correspond to a weak magnetic field aligned towards unsafe cluster centers and yield a power-law scaling of attack success rate with the number of inference-time samples, while long injected prompts, i.e., strong magnetic field, yield exponential scaling. We observe qualitatively consistent behavior across a broad range of large language models, spanning parameter scales from 3B to 70B. In particular, the main trends remain stable across multiple attack methods, such as GCG and AutoDAN, as well as across benchmark datasets such as AdvBench and HarmBench.

2603.10474 2026-05-29 cs.LG cs.NE cs.RO 版本更新

Muscle Synergy Priors Enhance Biomechanical Fidelity in Predictive Musculoskeletal Locomotion Simulation

肌肉协同先验增强预测性肌肉骨骼运动模拟的生物力学保真度

Ilseung Park, Eunsik Choi, Jangwhan Ahn, Jooeun Ahn

发表机构 * Department of Mechanical Engineering(机械工程系) Carnegie Mellon University(卡内基梅隆大学) Department of Physical Education(体育系) Seoul National University(首尔国立大学) Lampe Joint Department of Biomedical Engineering(生物医学工程联合部门) UNC-Chapel Hill and NC State University(北卡罗来纳大学教堂山分校和北卡罗来纳州立大学)

AI总结 提出一种生理学启发的强化学习框架,通过肌肉协同约束控制,在有限实验数据下提高了预测性人体运动模拟的生物力学保真度和泛化能力。

Comments Added a manuscript footnote stating "Project page with supplementary videos: https://ces40320.github.io/WebHomepage__Walk-RL ."

详情
AI中文摘要

人类运动源于高维神经肌肉控制,这使得预测性肌肉骨骼模拟具有挑战性。我们提出了一种生理学启发的强化学习框架,利用肌肉协同约束控制。我们从少量地面行走试验的逆肌肉骨骼分析中提取了低维协同基,并将其作为动作空间,用于训练一个肌肉驱动的三维模型,该模型在可变速度、坡度和不平坦地形上进行训练。由此产生的控制器在0.7-1.8 m/s的速度和±6°的坡度上生成了稳定的步态,并再现了关节角度、关节力矩和地面反作用力的条件依赖性调节。与无约束控制器相比,协同约束控制减少了非生理性膝关节运动学,并将膝关节力矩曲线保持在实验包络内。在各种条件下,模拟的垂直地面反作用力与人体测量值强相关,肌肉激活时间大多落在受试者间变异范围内。这些结果表明,将神经生理结构嵌入强化学习可以在有限实验数据下提高预测性人体运动模拟的生物力学保真度和泛化能力。

英文摘要

Human locomotion emerges from high-dimensional neuromuscular control, making predictive musculoskeletal simulation challenging. We present a physiology-informed reinforcement-learning framework that constrains control using muscle synergies. We extracted a low-dimensional synergy basis from inverse musculoskeletal analyses of a small set of overground walking trials and used it as the action space for a muscle-driven three-dimensional model trained across variable speeds, slopes and uneven terrain. The resulting controller generated stable gait from 0.7-1.8 m/s and on $\pm$ 6$^{\circ}$ grades and reproduced condition-dependent modulation of joint angles, joint moments and ground reaction forces. Compared with an unconstrained controller, synergy-constrained control reduced non-physiological knee kinematics and kept knee moment profiles within the experimental envelope. Across conditions, simulated vertical ground reaction forces correlated strongly with human measurements, and muscle-activation timing largely fell within inter-subject variability. These results show that embedding neurophysiological structure into reinforcement learning can improve biomechanical fidelity and generalization in predictive human locomotion simulation with limited experimental data.

2603.07916 2026-05-29 cs.AI cs.DB cs.LG 版本更新

Rel-MOSS: Towards Imbalanced Relational Deep Learning on Relational Databases

Rel-MOSS:面向关系数据库中不平衡关系深度学习的解决方案

Jun Yin, Peng Huo, Bangguo Zhu, Hao Yan, Senzhang Wang, Shirui Pan, Chengqi Zhang

发表机构 * Department of Data Science and Artificial Intelligence(数据科学与人工智能系) Hong Kong Polytechnic University(香港理工大学) School of Computer Science and Engineering(计算机科学与工程学院) Central South University(中南大学) School of Information and Communication Technology(信息与通信技术学院) Griffith University(格里菲斯大学) National Super Computing Center(国家超级计算中心)

AI总结 针对关系数据库中实体分类的类别不平衡问题,提出关系中心少数类合成过采样GNN(Rel-MOSS),通过关系门控控制器和关系引导的少数类合成器提升少数类表示,在12个数据集上平均平衡准确率提升2.46%,G-Mean提升4.00%。

详情
AI中文摘要

在最近的进展中,为了实现关系数据库(RDB)上完全数据驱动的学习范式,提出了关系深度学习(RDL),将RDB结构化为异构实体图,并采用图神经网络(GNN)作为预测模型。然而,现有的RDL方法忽略了RDB中关系数据的不平衡问题,可能导致少数实体表示不足,从而在实践中产生不可用的模型。在这项工作中,我们首次研究了RDB实体分类中的类别不平衡问题,并设计了以关系为中心的少数类合成过采样GNN(Rel-MOSS),以填补当前文献中的关键空白。具体来说,为了缓解少数类相关信息被多数类信息淹没的问题,我们设计了关系门控控制器来调节来自每个单独关系类型的邻域消息。基于关系门控表示,我们进一步提出了用于过采样的关系引导的少数类合成器,该合成器整合了实体关系签名以保持关系一致性。在12个实体分类数据集上的大量实验为Rel-MOSS的优越性提供了令人信服的证据,与最先进的RDL方法和处理类别不平衡的经典方法相比,在平衡准确率和G-Mean上分别平均提高了2.46%和4.00%。

英文摘要

In recent advances, to enable a fully data-driven learning paradigm on relational databases (RDB), relational deep learning (RDL) is proposed to structure the RDB as a heterogeneous entity graph and adopt the graph neural network (GNN) as the predictive model. However, existing RDL methods neglect the imbalance problem of relational data in RDBs and risk under-representing the minority entities, leading to an unusable model in practice. In this work, we investigate, for the first time, class imbalance problem in RDB entity classification and design the relation-centric minority synthetic over-sampling GNN (Rel-MOSS), in order to fill a critical void in the current literature. Specifically, to mitigate the issue of minority-related information being submerged by majority counterparts, we design the relation-wise gating controller to modulate neighborhood messages from each individual relation type. Based on the relational-gated representations, we further propose the relation-guided minority synthesizer for over-sampling, which integrates the entity relational signatures to maintain relational consistency. Extensive experiments on 12 entity classification datasets provide compelling evidence for the superiority of Rel-MOSS, yielding an average improvement of up to 2.46% and 4.00% in terms of Balanced Accuracy and G-Mean, compared with SOTA RDL methods and classic methods for handling class imbalance.

2603.07860 2026-05-29 cs.LG 版本更新

Sparse Scheduled Diffusion Guidance for Inverse Problems

稀疏调度扩散引导用于逆问题

Abduragim Shtanchaev, Albina Ilina, Yazid Janati, Arip Asadulaev, Martin Takac, Eric Moulines

发表机构 * MBZUAI(穆扎伊人工智能研究院) Institute of Foundation Models(基础模型研究所) EPITA

AI总结 提出Spin方法,通过从中间时间步开始后验采样并仅在调度步骤应用轻量级校正,实现高效逆问题求解,在FFHQ和ImageNet上速度提升2-50倍且内存更低。

详情
AI中文摘要

预训练扩散模型是贝叶斯逆问题的有效先验,但使用这些先验进行后验采样通常成本高昂,因为数据一致性引导应用于整个反向轨迹。现有方法表明,有时可以避免通过去噪器的向量-雅可比乘积,但它们通常仍然依赖于整个轨迹的密集引导或昂贵的内部求解。我们提出了稀疏调度扩散引导用于逆问题(Spin),这是一种避免从纯噪声开始后验采样的求解器。Spin首先在中间时间步$t_*$从后验时间边际采样,然后将该状态作为引导反向扩散过程的热启动。在引导时间,Spin不是在每个去噪步骤强制执行测量约束,而是仅在调度的时间步应用轻量级校正,此时去噪器仍能清理伪影。由此产生的过程将先验细化与数据一致性解耦:先验提供去噪,而轻量级像素空间优化强制执行测量约束,无需通过去噪器或解码器进行反向传播。在FFHQ和ImageNet上的线性和非线性逆问题中,Spin以显著更好的运行时-内存曲线实现了有竞争力的重建质量,在像素空间模型上运行速度提高2倍,在潜在扩散模型上运行速度提高50倍,且内存成本更低。

英文摘要

Pretrained diffusion models are effective priors for Bayesian inverse problems, but posterior sampling with these priors is often costly because data-consistency guidance is applied throughout the full reverse trajectory. Existing methods have shown that vector-Jacobian products through the denoiser can sometimes be avoided, yet they typically still rely on dense guidance through the full trajectory or expensive inner solves. We introduce Sparse Scheduled Diffusion Guidance for Inverse Problems (Spin), a solver that avoids starting posterior sampling from pure noise. Spin first samples from a posterior time-marginal at an intermediate timestep $t_*$, and then uses that state as a warm start for a guided reverse diffusion process. At guidance time, instead of enforcing the measurement constraint at every denoising step, Spin applies lightweight corrections only at scheduled timesteps where the denoiser can still clean up artifacts. The resulting procedure decouples prior refinement from data consistency: the prior supplies denoising, while lightweight pixel-space optimization enforces the measurement constraint without backpropagation through the denoiser or decoder. Across linear and nonlinear inverse problems on FFHQ and ImageNet, Spin achieves competitive reconstruction quality with a substantially better runtime--memory profile, running 2x faster on pixel-space models and up to 50x faster on latent diffusion models, with lower memory costs.

2603.05488 2026-05-29 cs.CL cs.AI cs.LG 版本更新

Reasoning Theater: Disentangling Model Beliefs from Chain-of-Thought

推理剧场:从思维链中分离模型信念

Siddharth Boppana, Annabel Ma, Max Loeffler, Raphael Sarfati, Eric Bigelow, Atticus Geiger, Owen Lewis, Jack Merullo

发表机构 * Harvard University, Cambridge, MA(哈佛大学,马萨诸塞州剑桥)

AI总结 通过激活探针、早期强制回答和思维链监控器分析,发现推理模型存在表演性思维链现象,并利用探针引导的早期退出实现高效计算。

详情
AI中文摘要

我们提供了推理模型中表演性思维链(CoT)的证据,即模型对其最终答案变得非常自信,但继续生成令牌而不揭示其内部信念。我们的分析比较了两个大型模型(DeepSeek-R1 671B 和 GPT-OSS 120B)中的激活探针、早期强制回答和思维链监控器,并发现了任务难度特定的差异:模型的最终答案可以从思维链中远早于监控器能够判断的激活中解码,特别是对于基于回忆的简单MMLU问题。我们将此与困难的多跳GPQA-Diamond问题中的真正推理进行对比。尽管如此,转折点(例如回溯、“啊哈”时刻)几乎只出现在探针显示大信念转变的响应中,表明这些行为追踪的是真正的不确定性,而不是学到的“推理剧场”。最后,探针引导的早期退出在MMLU上减少了高达80%的令牌,在GPQA-Diamond上减少了30%,且准确率相似,将注意力探针定位为检测表演性推理和实现自适应计算的高效工具。

英文摘要

We provide evidence of performative chain-of-thought (CoT) in reasoning models, where a model becomes strongly confident in its final answer, but continues generating tokens without revealing its internal belief. Our analysis compares activation probing, early forced answering, and a CoT monitor across two large models (DeepSeek-R1 671B & GPT-OSS 120B) and find task difficulty-specific differences: The model's final answer is decodable from activations far earlier in CoT than a monitor is able to say, especially for easy recall-based MMLU questions. We contrast this with genuine reasoning in difficult multihop GPQA-Diamond questions. Despite this, inflection points (e.g., backtracking, 'aha' moments) occur almost exclusively in responses where probes show large belief shifts, suggesting these behaviors track genuine uncertainty rather than learned "reasoning theater." Finally, probe-guided early exit reduces tokens by up to 80% on MMLU and 30% on GPQA-Diamond with similar accuracy, positioning attention probing as an efficient tool for detecting performative reasoning and enabling adaptive computation.

2603.05002 2026-05-29 cs.LG math.OC stat.ML 版本更新

Non-Euclidean Gradient Descent Operates at the Edge of Stability

非欧几里得梯度下降在稳定性边缘运行

Rustem Islamov, Michael Crawshaw, Jeremy Cohen, Robert Gower

发表机构 * University of Basel(巴塞尔大学) George Mason University(乔治·马歇尔大学) Flatiron Institute(Flatiron研究所)

AI总结 本文通过方向光滑性解释梯度下降中的稳定性边缘现象,并将其推广到非欧几里得范数,定义广义尖锐度,实验表明非欧几里得梯度下降也表现出渐进尖锐化和阈值振荡。

详情
AI中文摘要

稳定性边缘(EoS)是一种现象,其中Hessian矩阵的尖锐度(最大特征值)在梯度下降(GD)中接近并徘徊在稳定性阈值$2/η$附近(步长为$η$)。尽管(表面上)违反了经典光滑性假设,但EoS在深度学习中已被广泛观察到,其理论基础仍不完整。我们通过方向光滑性[Mishkin et al., 2024]的视角提供了对EoS的解释。这种解释自然地扩展到非欧几里得范数,我们用它来定义任意范数下的广义尖锐度。我们的广义尖锐度度量包括先前研究的普通GD和预处理GD作为特例,以及尚未研究EoS的方法,例如$\ell_{\infty}$下降、块坐标下降、谱GD及其归一化版本。通过在神经网络上的实验,我们表明具有广义尖锐度的非欧几里得GD也表现出渐进尖锐化,随后在阈值$2/η$附近或之上振荡。在实践中,我们的框架提供了一种几何感知的谱诊断方法,可应用于广泛的非欧几里得梯度方法类别。

英文摘要

The Edge of Stability (EoS) is a phenomenon where the sharpness (largest eigenvalue) of the Hessian approaches and then hovers near the stability threshold $2/η$ during gradient descent (GD) with step size $η$. Despite (apparently) violating classical smoothness assumptions, EoS has been widely observed in deep learning, but its theoretical foundations remain incomplete. We provide an interpretation of EoS through the lens of Directional Smoothness [Mishkin et al., 2024]. This interpretation naturally extends to non-Euclidean norms, which we use to define generalized sharpness under an arbitrary norm. Our generalized sharpness measure includes previously studied vanilla GD and preconditioned GD as special cases, as well as methods for which EoS has not been studied, such as $\ell_{\infty}$-descent, Block CD, Spectral GD, and their normalized versions. Through experiments on neural networks, we show that non-Euclidean GD with our generalized sharpness also exhibits progressive sharpening followed by oscillations around or above the threshold $2/η$. Practically, our framework provides a geometry-aware spectral diagnostic that can be applied across a broad class of non-Euclidean gradient methods.

2603.03805 2026-05-29 cs.LG cs.AI cs.DB 版本更新

Relational In-Context Learning via Synthetic Pre-training with Structural Prior

通过结构先验的合成预训练实现关系上下文学习

Yanbo Wang, Jiaxuan You, Chuan Shi, Muhan Zhang

发表机构 * Institute for Artificial Intelligence, Peking University(北京大学人工智能研究院) University of Illinois at Urbana-Champaign(伊利诺伊大学香槟分校) Institute of Computing Technology, Beijing University of Post(北京邮电大学计算机学院) State Key Laboratory of General Artificial Intelligence(通用人工智能国家重点实验室)

AI总结 提出RDB-PFN,首个仅通过合成数据训练的关系基础模型,利用结构因果模型生成多样关系数据库,实现对新数据库的即时上下文学习,在19个真实关系预测任务上优于现有表格基础模型。

详情
AI中文摘要

关系数据库是现代业务的支柱,但它们缺乏与文本或视觉领域相当的基础模型。一个关键障碍是高质量的关系数据库是私有的、稀缺的且结构异构,使得互联网规模的预训练不可行。为了克服这种数据稀缺性,我们引入了RDB-PFN,这是第一个完全通过合成数据训练的关系基础模型。受先验数据拟合网络的启发,其中从结构因果模型生成的合成数据能够实现单表推理,我们设计了一个关系先验生成器,从零开始创建无限多样的关系数据库流。在超过200万个合成单表和关系任务上进行预训练后,RDB-PFN通过真正的上下文学习学会即时适应任何新数据库。实验表明,RDB-PFN在19个真实世界的关系预测任务上实现了强大的少样本性能,优于在相同DFS线性化输入上评估的最先进的表格基础模型,同时使用轻量级架构和快速推理。代码可在https://github.com/MuLabPKU/RDBPFN获取。

英文摘要

Relational Databases (RDBs) are the backbone of modern business, yet they lack foundation models comparable to those in text or vision. A key obstacle is that high-quality RDBs are private, scarce, and structurally heterogeneous, making internet-scale pre-training infeasible. To overcome this data scarcity, we introduce RDB-PFN, the first relational foundation model trained purely via synthetic data. Inspired by Prior-Data Fitted Networks (PFNs), where synthetic data generated from Structural Causal Models (SCMs) enables reasoning on single tables, we design a Relational Prior Generator to create an infinite stream of diverse RDBs from scratch. Pre-training on over 2 million synthetic single-table and relational tasks, RDB-PFN learns to adapt to any new database instantly via genuine in-context learning. Experiments show that RDB-PFN achieves strong few-shot performance on 19 real-world relational prediction tasks, outperforming state-of-the-art tabular foundation models evaluated on the same DFS-linearized inputs, while using a lightweight architecture and fast inference. The code is available at https://github.com/MuLabPKU/RDBPFN.

2603.03503 2026-05-29 cs.CV cs.LG 版本更新

Geographically-Weighted Weakly Supervised Bayesian High-Resolution Transformer for 200m Resolution Pan-Arctic Sea Ice Concentration Mapping and Uncertainty Estimation using Sentinel-1, RCM, and AMSR2 Data

地理加权弱监督贝叶斯高分辨率Transformer:利用Sentinel-1、RCM和AMSR2数据实现200米分辨率泛北极海冰密集度制图与不确定性估计

Mabel Heffring, Lincoln Linlin Xu

发表机构 * Department of Geomatics Engineering, Schulich School of Engineering, University of Calgary(地质工程系,Schulich 工程学院,卡尔加里大学)

AI总结 提出一种贝叶斯高分辨率Transformer模型,结合地理加权弱监督损失函数和决策级数据融合,利用Sentinel-1、RCM和AMSR2数据实现200米分辨率泛北极海冰密集度制图与不确定性量化。

Comments 23 pages, 20 figures

详情
AI中文摘要

尽管具有可靠对应不确定性的泛北极海冰高分辨率制图对于业务化海冰密集度(SIC)制图至关重要,但由于冰特征信号的细微性、SIC标签的不精确性、模型不确定性和数据异质性等关键挑战,这是一项艰巨的任务。本研究提出了一种新颖的贝叶斯高分辨率Transformer方法,利用Sentinel-1、RADARSAT星座任务(RCM)和先进微波扫描辐射计2(AMSR2)数据,实现200米分辨率泛北极SIC制图和不确定性量化。首先,为了改进微小和细微海冰特征(例如裂缝/水道、融池和浮冰)的提取,我们设计了一种新颖的高分辨率Transformer模型,该模型具有全局和局部模块,能够更好地区分海冰模式的细微差异。其次,为了解决低分辨率和非精确SIC标签的问题,我们设计了一种地理加权弱监督损失函数,在区域级别而非像素级别监督模型,并优先考虑纯开阔水和冰盖特征,同时减轻边缘冰区(MIZ)中模糊性的影响。第三,为了改进不确定性量化,我们设计了所提Transformer模型的贝叶斯扩展,将其参数视为随机变量,以更有效地捕获不确定性。第四,为了解决数据异质性,我们在决策级融合三种不同类型的数据(Sentinel-1、RCM和AMSR2),以改进SIC制图和不确定性量化。所提方法在2021年和2025年泛北极最小范围条件下进行了评估。结果表明,所提模型在使用Sentinel-1数据时实现了0.70的总体特征检测精度,同时保留了泛北极SIC模式(相对于ARTIST海冰产品,Sentinel-1 R² = 0.90)。

英文摘要

Although high-resolution mapping of pan-Arctic sea ice with reliable corresponding uncertainty is essential for operational sea ice concentration (SIC) charting, it is a difficult task due to key challenges, such as the subtle nature of ice signature features, inexact SIC labels, model uncertainty, and data heterogeneity. This study presents a novel Bayesian High-Resolution Transformer approach for 200 meter resolution pan-Arctic SIC mapping and uncertainty quantification using Sentinel-1, RADARSAT Constellation Mission (RCM), and Advanced Microwave Scanning Radiometer 2 (AMSR2) data. First, to improve small and subtle sea ice feature (e.g., cracks/leads, ponds, and ice floes) extraction, we design a novel high-resolution Transformer model with both global and local modules that can better discern the subtle differences in sea ice patterns. Second, to address low-resolution and inexact SIC labels, we design a geographically-weighted weakly supervised loss function to supervise the model at region level instead of pixel level, and to prioritize pure open water and ice pack signatures while mitigating the impact of ambiguity in the marginal ice zone (MIZ). Third, to improve uncertainty quantification, we design a Bayesian extension of the proposed Transformer model, treating its parameters as random variables to more effectively capture uncertainties. Fourth, to address data heterogeneity, we fuse three different data types (Sentinel-1, RCM, and AMSR2) at decision-level to improve both SIC mapping and uncertainty quantification. The proposed approach is evaluated under pan-Arctic minimum-extent conditions in 2021 and 2025. Results demonstrate that the proposed model achieves 0.70 overall feature detection accuracy using Sentinel-1 data, while also preserving pan-Arctic SIC patterns (Sentinel-1 R\textsuperscript{2} = 0.90 relative to the ARTIST Sea Ice product).

2602.21565 2026-05-29 cs.LG 版本更新

Routing by Reaching: Composition of Pre-trained GFlowNets for Multi-Objective Generation

通过到达进行路由:预训练GFlowNets的组合用于多目标生成

Seokwon Yoon, Youngbin Choi, Seunghyuk Cho, Seungbeom Lee, MoonJeong Park, Dongwoo Kim

发表机构 * Department of Computer Science \& Engineering, POSTECH, South Korea Graduate School of Artificial Intelligence, POSTECH, South Korea

AI总结 提出一个在推理时组合预训练GFlowNets的框架,无需微调或重新训练即可快速适应多目标生成任务,并证明在线性标量化下精确恢复目标分布,对非线性算子通过畸变因子量化近似质量。

Comments Appears in the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

生成流网络(GFlowNets)学习按照奖励函数比例采样多样化的候选,使其非常适合科学发现,其中探索多个有希望的解决方案至关重要。进一步将GFlowNets扩展到多目标设置已引起越来越多的兴趣,因为现实世界的应用通常涉及多个相互冲突的目标。然而,现有方法需要对每个目标组合进行联合训练,这意味着目标集的任何变化都需要从头开始重新训练。我们提出了一个在推理时组合预训练GFlowNets的框架,无需微调或重新训练即可实现快速适应。重要的是,我们的框架是灵活的,能够处理从线性标量化到复杂非线性算子的多种奖励组合,这些在以前的文献中通常分开处理。我们证明,我们的方法在线性标量化下精确恢复目标分布,并通过畸变因子量化非线性算子的近似质量。在合成二维网格和真实分子生成任务上的实验表明,我们的方法达到了与基线相当的性能。

英文摘要

Generative Flow Networks (GFlowNets) learn to sample diverse candidates in proportion to a reward function, making them well-suited for scientific discovery, where exploring multiple promising solutions is crucial. Further extending GFlowNets to multi-objective settings has attracted growing interest as real-world applications often involve multiple, conflicting objectives. However, existing approaches require joint training for each combination of objectives, meaning that any change in the objective set necessitates retraining from scratch. We propose a framework that composes pre-trained GFlowNets at inference time, enabling rapid adaptation without fine-tuning or retraining. Importantly, our framework is flexible, capable of handling diverse reward combinations ranging from linear scalarization to complex nonlinear operators, which are often handled separately in previous literature. We prove that our method exactly recovers the target distribution for linear scalarization, and quantify the approximation quality for nonlinear operators through a distortion factor. Experiments on a synthetic 2D grid and real-world molecule generation tasks demonstrate that our approach achieves performance comparable to baselines.

2602.18196 2026-05-29 cs.LG 版本更新

RAT+: Train Dense, Infer Sparse -- Recurrence Augmented Attention for Dilated Inference

RAT+:密集训练,稀疏推理——用于扩张推理的循环增强注意力

Xiuying Wei, Caglar Gulcehre

发表机构 * CLAIRE lab at EPFL, Lausanne, Switzerland(EPFL 拉沃斯实验室)

AI总结 提出RAT+架构,通过密集预训练和循环增强注意力,使模型在推理时可灵活切换为稀疏扩张注意力,大幅降低计算和缓存开销,同时保持高精度。

Comments Accepted by ICML2026

详情
AI中文摘要

结构化扩张注意力具有吸引人的推理效率调节旋钮:它将注意力的FLOPs和KV缓存大小减少扩张大小D的倍数,同时保持长程连接。虽然先前的工作通过从头训练每个配置来研究它,但直接将预训练注意力模型稀疏化为扩张模式会导致严重的精度下降,阻碍跨推理场景的灵活重用。我们引入RAT+,一种密集预训练架构,通过全序列循环和主动循环学习增强注意力。单个RAT+模型密集预训练一次,然后可以在推理时灵活切换到扩张注意力(可选局部窗口)或混合层/头组合,仅需短期的10亿token分辨率适应,而无需重新训练单独的稀疏模型。在100B token上训练的1.5B参数模型中,RAT+在D=16时紧密匹配密集精度,在D=64时在常识推理和LongBench任务上下降约2-3个点。我们进一步扩展到2.6B和7.6B参数,观察到更有希望的性能(例如,在注意力FLOPs和KV缓存大小减少64倍的情况下,平均精度损失1个点)。代码可在https://github.com/wimh966/rat-plus获取。

英文摘要

Structured dilated attention has an appealing inference-time efficiency knob: it reduces the FLOPs of attention and the KV cache size by a factor of the dilation size D, while preserving long-range connectivity. While prior work studies it by training each configuration from scratch, directly sparsifying a pretrained attention model into a dilated pattern leads to severe accuracy degradation, preventing flexible reuse across inference scenarios. We introduce RAT+, a dense-pretraining architecture that augments attention with full-sequence recurrence and active recurrence learning. A single RAT+ model is pretrained densely once and can then be flexibly switched at inference time to dilated attention (optionally with local windows) or hybrid layer/head compositions, requiring only a short 1B-token resolution adaptation rather than retraining separate sparse models. At 1.5B parameters trained on 100B tokens, RAT+ closely matches dense accuracy at D = 16, and drops by about 2-3 points at D = 64 on commonsense reasoning and LongBench tasks. We further scale to 2.6B and 7.6B parameters and observe even more promising performance (e.g., a 1-point average accuracy loss with a 64x reduction in attention FLOPs and KV cache size). Code is available at https://github.com/wimh966/rat-plus.

2602.16610 2026-05-29 cs.CL cs.AI cs.LG 版本更新

Who can we trust? LLM-as-a-jury for Comparative Assessment

我们该信任谁?LLM作为陪审团进行比较评估

Mengjie Qian, Guangzhi Sun, Mark J. F. Gales, Kate M. Knill

发表机构 * Department of Engineering, University of Cambridge, UK(剑桥大学工程系)

AI总结 针对LLM作为评估者时判断不一致和可靠性差异的问题,提出BT-sigma模型,通过引入判别参数联合推断项目排名和法官可靠性,优于平均聚合方法。

Comments Accepted to ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)越来越多地被用作自动评估器,用于自然语言生成评估,通常采用成对比较判断。现有方法通常依赖单一法官或聚合多个法官并假设其可靠性相同。在实践中,LLM法官在不同任务和评估方面的表现差异很大,其判断概率可能存在偏差和不一致。此外,用于法官校准的人工标注监督可能不可用。我们首先通过实验证明LLM比较概率的不一致性存在,并表明这限制了直接基于概率排名的有效性。为解决此问题,我们研究了LLM作为陪审团的设置,并提出了BT-sigma,这是Bradley-Terry模型的一种法官感知扩展,为每个法官引入一个判别参数,仅从成对比较中联合推断项目排名和法官可靠性。在基准NLG评估数据集上的实验表明,BT-sigma始终优于基于平均的聚合方法,并且学习到的判别参数与LLM判断的循环一致性的独立度量高度相关。进一步分析揭示,BT-sigma可以解释为一种无监督校准机制,通过建模法官可靠性来改进聚合。

英文摘要

Large language models (LLMs) are increasingly applied as automatic evaluators for natural language generation assessment often using pairwise comparative judgements. Existing approaches typically rely on single judges or aggregate multiple judges assuming equal reliability. In practice, LLM judges vary substantially in performance across tasks and evaluation aspects, and their judgment probabilities may be biased and inconsistent. Furthermore, human-labelled supervision for judge calibration may be unavailable. We first empirically demonstrate that inconsistencies in LLM comparison probabilities exist and show that it limits the effectiveness of direct probability-based ranking. To address this, we study the LLM-asa-jury setting and propose BT-sigma, a judge-aware extension of the Bradley-Terry model that introduces a discriminator parameter for each judge to jointly infer item rankings and judge reliability from pairwise comparisons alone. Experiments on benchmark NLG evaluation datasets show that BT-sigma consistently outperforms averaging-based aggregation methods, and that the learned discriminators strongly correlate with independent measures of the cycle consistency of LLM judgments. Further analysis reveals that BT-sigma can be interpreted as an unsupervised calibration mechanism that improves aggregation by modelling judge reliability.

2602.16449 2026-05-29 cs.LG cs.AI stat.ML 版本更新

GICDM: Mitigating Hubness for Reliable Distance-Based Generative Model Evaluation

GICDM: 缓解枢纽性以实现可靠的基于距离的生成模型评估

Nicolas Salvy, Hugues Talbot, Bertrand Thirion

发表机构 * Inria, Palaiseau, France(法国帕莱索研究所)

AI总结 针对生成模型评估中高维嵌入空间的枢纽性现象,提出GICDM方法(基于迭代上下文不相似度度量),通过多尺度扩展校正邻域估计,恢复可靠度量并与人类评估对齐。

Comments Forty-third International Conference on Machine Learning, 2026

详情
AI中文摘要

生成模型评估通常依赖于高维嵌入空间来计算样本之间的距离。我们表明,这些空间中的数据集表示受到枢纽性现象的影响,这会扭曲最近邻关系并使基于距离的度量产生偏差。基于经典的迭代上下文不相似度度量(ICDM),我们引入了生成式ICDM(GICDM),一种校正真实数据和生成数据邻域估计的方法。我们引入了多尺度扩展以改善经验行为。在合成和真实基准上的大量实验表明,GICDM解决了枢纽性引起的失败,恢复了可靠的度量行为,并改善了与人类评估的一致性。

英文摘要

Generative model evaluation commonly relies on high-dimensional embedding spaces to compute distances between samples. We show that dataset representations in these spaces are affected by the hubness phenomenon, which distorts nearest-neighbor relationships and biases distance-based metrics. Building on the classical Iterative Contextual Dissimilarity Measure (ICDM), we introduce Generative ICDM (GICDM), a method to correct neighborhood estimation for both real and generated data. We introduce a multi-scale extension to improve empirical behavior. Extensive experiments on synthetic and real benchmarks demonstrate that GICDM resolves hubness-induced failures, restores reliable metric behavior, and improves alignment with human assessment.

2602.15382 2026-05-29 cs.CL cs.CV cs.LG 版本更新

The Vision Wormhole: Latent-Space Communication in Heterogeneous Multi-Agent Systems

视觉虫洞:异构多智能体系统中的潜在空间通信

Xiaoze Liu, Ruowang Zhang, Weichen Yu, Siheng Xiong, Liu He, Feijie Wu, Hoin Jung, Matt Fredrikson, Xiaoqian Wang, Jing Gao

发表机构 * Purdue University(普渡大学) Contextual AI(情境人工智能) Carnegie Mellon University(卡内基梅隆大学) Georgia Institute of Technology(佐治亚理工学院)

AI总结 提出Vision Wormhole框架,通过通用视觉编解码器将推理轨迹映射到共享连续空间,实现异构VLM间的潜在状态传输,无需配对翻译器,降低对齐复杂度并提升效率。

Comments Preprint. Work in progress

详情
AI中文摘要

由大型语言模型驱动的多智能体系统(MAS)实现了先进的协作推理,但仍受限于离散文本通信,这带来了运行时开销和信息量化损失。虽然潜在状态传输提供了一种替代方案,但现有方法要么假设同构的发送器-接收器架构,要么依赖于特定配对的学得翻译器,限制了跨具有不连续流形的不同模型族的可扩展性。我们将为自然图像训练的视觉-语言模型(VLM)的视觉界面重新概念化为异构智能体之间的连续通信通道,并将这一思想实例化为 extbf{视觉虫洞}:一种通用视觉编解码器,将推理轨迹映射到共享的连续参考空间,并将其注入接收器的视觉通路,实现无需配对翻译器的跨架构潜在状态传输。该框架采用中心辐射拓扑,将对齐复杂度从$O(N^2)$降低到$O(N)$,并通过无标签的教师-学生蒸馏针对文本通道进行训练,无需并行隐藏状态监督。在异构VLM族(Qwen-VL、Gemma、SmolVLM2、LFM2.5-VL)和九个推理基准上的大量实验表明,视觉虫洞在大多数评估设置中减少了端到端挂钟时间,并产生了正的平均宏$Δ$-准确率。

英文摘要

Multi-Agent Systems (MAS) powered by Large Language Models have unlocked advanced collaborative reasoning, yet they remain bottlenecked by discrete text communication, which imposes runtime overhead and information quantization loss. While latent state transfer offers an alternative, existing approaches either assume homogeneous sender--receiver architectures or rely on pair-specific learned translators, limiting scalability across diverse model families with disjoint manifolds. We reconceptualize the visual interface of Vision-Language Models (VLMs), trained for natural images, as a continuous communication channel between heterogeneous agents, and instantiate this idea as the \textbf{Vision Wormhole}: a Universal Visual Codec maps reasoning traces into a shared continuous reference space and injects them into the receiver's visual pathway, yielding cross-architecture latent state transfer without per-pair translators. The framework adopts a hub-and-spoke topology that reduces alignment complexity from $O(N^2)$ to $O(N)$, and is trained by label-free teacher--student distillation against the text channel, requiring no parallel hidden-state supervision. Extensive experiments across heterogeneous VLM families (Qwen-VL, Gemma, SmolVLM2, LFM2.5-VL) and nine reasoning benchmarks show that the Vision Wormhole reduces end-to-end wall-clock time across most evaluated settings and yields positive macro-average $Δ$-accuracy.

2602.15239 2026-05-29 cs.LG 版本更新

Size Transferability of Graph Transformers with Convolutional Positional Encodings

图Transformer的尺寸可迁移性与卷积位置编码

Javier Porras-Valenzuela, Zhiyang Wang, Xiaotao Shang, Yusu Wang, Alejandro Ribeiro

发表机构 * Department of Electrical and Systems Engineering(电气与系统工程系) University of Pennsylvania(宾夕法尼亚大学) University of California San Diego(圣地亚哥大学)

AI总结 通过图神经网络位置编码建立图Transformer与流形神经网络的联系,证明其在小图上训练后可泛化到大图,并在标准基准和实际地形最短路径估计任务中验证可扩展性。

详情
AI中文摘要

Transformer在各个领域取得了显著成功,推动了图Transformer(GTs)作为基于注意力的图结构数据架构的兴起。GTs的一个关键设计选择是使用基于图神经网络(GNN)的位置编码来融入结构信息。在这项工作中,我们通过图序列的流形极限模型研究GTs,并建立了具有GNN位置编码的GTs与流形神经网络(MNNs)之间的理论联系。基于GNN在流形收敛下的可迁移性结果,我们证明了GTs从其位置编码继承了可迁移性保证。特别地,在温和假设下,在小图上训练的GTs可以证明地泛化到更大的图。我们通过标准图基准上的大量实验补充了理论,表明GTs表现出与GNN相当的可扩展行为。为了进一步展示在真实场景中的效率,我们实现了GTs用于地形上的最短路径距离估计,以更好地说明可迁移GTs的效率。我们的结果为理解GTs提供了新见解,并为在大规模设置中高效训练GTs提出了实用方向。

英文摘要

Transformers have achieved remarkable success across domains, motivating the rise of Graph Transformers (GTs) as attention-based architectures for graph-structured data. A key design choice in GTs is the use of Graph Neural Network (GNN)-based positional encodings to incorporate structural information. In this work, we study GTs through the lens of manifold limit models for graph sequences and establish a theoretical connection between GTs with GNN positional encodings and Manifold Neural Networks (MNNs). Building on transferability results for GNNs under manifold convergence, we show that GTs inherit transferability guarantees from their positional encodings. In particular, GTs trained on small graphs provably generalize to larger graphs under mild assumptions. We complement our theory with extensive experiments on standard graph benchmarks, demonstrating that GTs exhibit scalable behavior on par with GNNs. To further show the efficiency in a real-world scenario, we implement GTs for shortest path distance estimation over terrains to better illustrate the efficiency of the transferable GTs. Our results provide new insights into the understanding of GTs and suggest practical directions for efficient training of GTs in large-scale settings.

2602.10637 2026-05-29 cs.LG cond-mat.stat-mech physics.chem-ph stat.ML 版本更新

Coarse-Grained Boltzmann Generators

粗粒度玻尔兹曼生成器

Weilong Chen, Bojun Zhao, Jan Eckwert, Julija Zavadlav

发表机构 * Professorship of Multiscale Modeling of Fluid Materials, Department of Engineering Physics and Computation, TUM School of Engineering and Design, Technical University of Munich, Germany(多尺度流体材料建模教授职位,工程物理与计算系,TUM工程与设计学院,慕尼黑技术大学,德国) Atomistic Modeling Center, Munich Data Science Institute, Technical University of Munich, Germany(原子建模中心,慕尼黑数据科学研究所,慕尼黑技术大学,德国)

AI总结 提出粗粒度玻尔兹曼生成器(CG-BGs)框架,结合基于流的生成模型与重要性采样,利用学习到的平均力势(PMF)进行重加权,在降低计算成本的同时实现大分子系统的平衡采样。

Comments Accepted at ICML 2026

详情
AI中文摘要

从玻尔兹曼分布中采样平衡分子构型是一个长期挑战。玻尔兹曼生成器(BGs)通过结合精确似然生成模型与重要性采样来解决这一问题,但实际可扩展性有限。同时,粗粒度代理模型通过降低有效维度来建模更大系统,但往往缺乏确保渐近正确统计量的重加权过程。在这项工作中,我们提出了粗粒度玻尔兹曼生成器(CG-BGs),一个用于粗粒度坐标空间中的降阶生成建模与重要性采样的框架。CG-BGs使用基于流的模型生成样本,并使用学习到的平均力势(PMF)进行重加权。我们表明,可以通过增强采样力匹配从快速收敛的轨迹中学习PMF。实验证明,CG-BGs在高度降阶表示中捕获溶剂介导的相互作用,同时相对于原子级BGs大幅降低计算成本,为更大分子系统的平衡采样提供了实用途径。

英文摘要

Sampling equilibrium molecular configurations from the Boltzmann distribution is a longstanding challenge. Boltzmann Generators (BGs) address this by combining exact-likelihood generative models with importance sampling, but practical scalability is limited. Meanwhile, coarse-grained surrogates enable the modeling of larger systems by reducing effective dimensionality, yet often lack a reweighting procedure required to ensure asymptotically correct statistics. In this work, we propose Coarse-Grained Boltzmann Generators (CG-BGs), a framework for reduced-order generative modeling with importance sampling in coarse-grained coordinate space. CG-BGs generate samples using a flow-based model and reweight them using a learned potential of mean force (PMF). We show that the PMF can be learned from rapidly converged trajectories via enhanced sampling force matching. Experiments demonstrate that CG-BGs capture solvent-mediated interactions in highly reduced representations while substantially reducing computational cost relative to atomistic BGs, providing a practical route toward equilibrium sampling of larger molecular systems.

2602.10520 2026-05-29 cs.LG 版本更新

Prioritize the Process, Not Just the Outcome: Rewarding Latent Thought Trajectories Improves Reasoning in Looped Language Models

优先过程而非结果:奖励潜在思维轨迹改善循环语言模型的推理能力

Jonathan Williams, Esin Tureci

发表机构 * Department of Computer Science, Princeton University, Princeton NJ, U.S.A(普林斯顿大学计算机科学系)

AI总结 针对循环语言模型(LoopLM)中标准强化学习(如GRPO)仅奖励最终潜在状态导致推理改进失败的问题,提出RLTT框架,通过在整个潜在推理轨迹上分配奖励实现密集的轨迹级信用分配,显著提升数学推理性能并泛化至非数学任务。

Comments ICML 2026

详情
AI中文摘要

循环语言模型(LoopLMs)在生成token之前执行多步潜在推理,并在较小的参数预算下在推理基准上优于传统LLM。然而,使用强化学习进一步改进LoopLM推理的尝试失败了——诸如群体相对策略优化(GRPO)等标准目标仅对最终潜在状态分配信用,与模型的内部计算存在根本性不匹配。为解决此问题,我们引入了RLTT(奖励潜在思维轨迹),这是一种强化学习框架,将奖励分布在整个潜在推理轨迹上。RLTT提供密集的、轨迹级别的信用分配,无需依赖外部验证器,并且可以直接替代GRPO,开销可忽略不计。在相同训练和推理条件下,使用Ouro-1.4B/2.6B-Thinking进行的大量实验中,RLTT在具有挑战性的数学推理基准上比GRPO取得了统计上显著的改进,在1.4B规模上,MATH-500、AIME24/26和BeyondAIME的平均准确率提高了+5.8%,在2.6B规模上提高了+10.9%。尽管仅在数学上训练,RLTT也能有效迁移到非数学推理基准,证明了轨迹级信用分配在LoopLMs中强化学习的有效性。代码可在https://github.com/jonwill8/RLTT.git获取。

英文摘要

Looped Language Models (LoopLMs) perform multi-step latent reasoning prior to token generation and outperform conventional LLMs on reasoning benchmarks at smaller parameter budgets. However, attempts to further improve LoopLM reasoning with reinforcement learning have failed - standard objectives such as Group Relative Policy Optimization (GRPO) only assign credit to the final latent state, creating a fundamental mismatch with the model's internal computation. To resolve this, we introduce RLTT (Reward Latent Thought Trajectories), a reinforcement learning framework which distributes reward across the full latent reasoning trajectory. RLTT provides dense, trajectory-level credit assignment without relying on external verifiers and can directly replace GRPO with negligible overhead. Across extensive experiments with Ouro-1.4B/2.6B-Thinking under identical training and inference conditions, RLTT yields statistically significant improvements over GRPO on challenging mathematical reasoning benchmarks, improving mean accuracy over MATH-500, AIME24/26, and BeyondAIME by +5.8% on the 1.4B scale, and +10.9% on the 2.6B scale. Despite being trained exclusively on mathematics, RLTT also transfers effectively to non-mathematical reasoning benchmarks, demonstrating the effectiveness of trajectory-level credit assignment for reinforcement learning in LoopLMs. Code is available at https://github.com/jonwill8/RLTT.git.

2602.06791 2026-05-29 cs.LG cond-mat.dis-nn cond-mat.stat-mech 版本更新

Rare Event Analysis of Large Language Models

大型语言模型的罕见事件分析

Jake McAllister Dorman, Edward Gillman, Dominic C. Rose, Jamie F. Mair, Juan P. Garrahan

发表机构 * School of Physics and Astronomy, University of Nottingham(物理与天文学学院,诺丁汉大学)

AI总结 本文提出一个端到端框架,用于系统分析大型语言模型中的罕见事件,涵盖理论、高效生成策略、概率估计和误差分析,并通过实例展示其应用。

Comments ICML 2026 Oral Spotlight

详情
AI中文摘要

作为概率模型,大型语言模型(LLMs)在推理过程中会显示罕见事件:即远离典型但高度显著的行为。根据定义,所有罕见事件都难以观察,但LLM使用的巨大规模意味着在开发过程中完全未观察到的事件在部署中可能变得突出。在此,我们提出了一个用于系统分析LLMs中罕见事件的端到端框架。我们提供了一个实用的实现,涵盖理论、高效生成策略、概率估计和误差分析,并通过具体示例加以说明。我们概述了扩展到其他模型和背景的应用,强调了这里提出的概念和技术的通用性。

英文摘要

Being probabilistic models, during inference large language models (LLMs) display rare events: behaviour that is far from typical but highly significant. By definition all rare events are hard to see, but the enormous scale of LLM usage means that events completely unobserved during development are likely to become prominent in deployment. Here we present an end-to-end framework for the systematic analysis of rare events in LLMs. We provide a practical implementation spanning theory, efficient generation strategies, probability estimation and error analysis, which we illustrate with concrete examples. We outline extensions and applications to other models and contexts, highlighting the generality of the concepts and techniques presented here.

2602.06361 2026-05-29 cs.GT cs.IT cs.LG math.IT stat.ML 版本更新

Envy-Free Allocation of Indivisible Goods via Noisy Queries

通过噪声查询实现不可分割物品的无嫉妒分配

Zihan Li, Yan Hao Ling, Jonathan Scarlett, Warut Suksompong

发表机构 * Meta(Meta公司) National University of Singapore(国立新加坡大学) Nanyang Technological University(南洋理工大学)

AI总结 针对不可直接观测估值、仅能通过噪声查询获取信息的不可分割物品分配问题,在双智能体高斯噪声和有界估值设定下,推导了实现无嫉妒分配所需查询次数的上下界,并证明了当最优分配负嫉妒值Δ不太小时最优查询次数与m^{2.5}/Δ^2成比例。

Comments ICML 2026

详情
AI中文摘要

我们引入了一个公平分配不可分割物品(物品)的问题,其中智能体的估值无法直接观测,而只能通过噪声查询访问。在双智能体设定中,考虑高斯噪声和有界估值,我们推导了根据物品数量$m$和最优分配的负嫉妒值$Δ$,找到无嫉妒分配所需查询次数的上下界。特别地,当$Δ$不太小(即$Δ\gg m^{1/4}$)时,我们证明最优查询次数在忽略对数因子下为$ rac{\sqrt m }{(Δ/ m)^2} = rac{m^{2.5}}{Δ^2}$。我们的上界基于非自适应查询和一个简单的基于阈值的分配算法,该算法在多项式时间内运行,而下界即使在自适应查询和任意计算时间下也成立。

英文摘要

We introduce a problem of fairly allocating indivisible goods (items) in which the agents' valuations cannot be observed directly, but instead can only be accessed via noisy queries. In the two-agent setting with Gaussian noise and bounded valuations, we derive upper and lower bounds on the required number of queries for finding an envy-free allocation in terms of the number of items, $m$, and the negative-envy of the optimal allocation, $Δ$. In particular, when $Δ$ is not too small (namely, $Δ\gg m^{1/4}$), we establish that the optimal number of queries scales as $\frac{\sqrt m }{(Δ/ m)^2} = \frac{m^{2.5}}{Δ^2}$ up to logarithmic factors. Our upper bound is based on non-adaptive queries and a simple thresholding-based allocation algorithm that runs in polynomial time, while our lower bound holds even under adaptive queries and arbitrary computation time.

2602.05961 2026-05-29 cs.LG stat.ML 版本更新

Discrete diffusion samplers and bridges: Off-policy algorithms and applications in latent spaces

离散扩散采样器与桥:离策略算法及其在潜在空间中的应用

Arran Carter, Sanghyeok Choi, Kirill Tamogashev, Víctor Elvira, Esmeralda S. Whitammer

发表机构 * University of Edinburgh(爱丁堡大学) CIFAR Fellow(卡尔·弗里德里希·列文森研究员)

AI总结 提出离策略训练技术改进离散扩散采样器性能,并首次引入离散域的数据到能量薛定谔桥训练,应用于图像生成模型的离散潜在空间中的无数据后验采样。

Comments ICML 2026. Code: https://github.com/mmacosha/offpolicy-discrete-diffusion-samplers-and-bridges

详情
AI中文摘要

从已知归一化常数的分布 $p(x) \propto e^{-\mathcal{E}(x)}$ 中采样是统计学中一个重要且具有挑战性的问题。近年来,出现了一类新的摊销采样算法,通常称为扩散采样器,能够从未归一化的密度中快速高效地采样。这类算法在连续空间采样任务中已被广泛研究;然而,它们在离散空间问题中的应用仍 largely 未被探索。尽管该领域已取得一些进展,但离散扩散采样器并未充分利用连续空间采样中常用的思想。在本文中,我们提出通过引入离散扩散采样器的离策略训练技术来弥合这一差距。我们证明这些技术在已有和新颖的合成基准上提高了离散采样器的性能。接下来,我们将离散扩散采样器推广到两个任意分布之间的桥接任务,首次为离散域引入了数据到能量薛定谔桥训练。最后,我们展示了所提出的扩散采样器在图像生成模型的离散潜在空间中进行无数据后验采样的应用。

英文摘要

Sampling from a distribution $p(x) \propto e^{-\mathcal{E}(x)}$ known up to a normalising constant is an important and challenging problem in statistics. Recent years have seen the rise of a new family of amortised sampling algorithms, commonly referred to as diffusion samplers, that enable fast and efficient sampling from an unnormalised density. Such algorithms have been widely studied for continuous-space sampling tasks; however, their application to problems in discrete space remains largely unexplored. Although some progress has been made in this area, discrete diffusion samplers do not take full advantage of ideas commonly used for continuous-space sampling. In this paper, we propose to bridge this gap by introducing off-policy training techniques for discrete diffusion samplers. We show that these techniques improve the performance of discrete samplers on both established and new synthetic benchmarks. Next, we generalise discrete diffusion samplers to the task of bridging between two arbitrary distributions, introducing data-to-energy Schrödinger bridge training for the discrete domain for the first time. Lastly, we showcase the application of the proposed diffusion samplers to data-free posterior sampling in the discrete latent spaces of image generative models.

2602.03582 2026-05-29 cs.LG 版本更新

Optimization and Generation in Aerodynamics Inverse Design

气动逆设计中的优化与生成

Huaguan Chen, Ning Lin, Luxi Chen, Jiacheng Cen, Rui Zhang, Wenbing Huang, Chongxuan Li, Hao Sun

AI总结 本文提出一个概率框架,将视觉特征保持与气动性能优化统一为目标,通过重加权学习分布实现优化和引导生成,实验表明在车辆和飞机设计中显著降低阻力同时保持视觉一致性。

详情
AI中文摘要

气动逆设计可以提高车辆和飞机的效率,但实际设计很少只追求性能:车辆改进必须在降低阻力的同时保留与设计语言、品牌识别和用户感知相关的视觉特征。传统的CFD驱动优化准确但探索范围慢,当前基于学习的方法仍主要性能驱动,缺乏连接优化、生成和视觉一致性的连贯目标。这里我们将视觉保持和气动改进表述为一个概率目标。与参考形状或视图一致的设计定义了一个学习的视觉设计分布,该分布通过气动成本重新加权。优化将初始几何体细化为低成本、高概率的设计,而引导生成从相同的输入视图中采样更低成本的3D候选。OpenFOAM评估显示,保持视觉特征的优化相对于初始车辆将车辆阻力降低5.8%,相对于初始飞机将最佳有效飞机阻力-升力目标降低28.8%,同时保持输入视觉特征。对于基于视图的生成,引导相对于从同一视图直接生成将车辆阻力降低3.0%,飞机阻力-升力目标降低68.6%,同时保持视觉一致性。使用3D打印车辆原型的风洞测试提供了独立的尾流级检查,控制分析解释了这些结果背后的分布机制。这项工作为保持视觉特征的气动改进和早期3D设计探索提供了概率基础和实用途径。

英文摘要

Aerodynamic inverse design can improve vehicle and aircraft efficiency, but practical design rarely seeks performance alone: vehicle refinement must reduce drag while preserving visual features linked to design language, brand recognition and user perception. Traditional CFD-driven optimization is accurate but slow for broad exploration, and current learning-based methods are still largely performance-driven and lack a coherent target linking optimization, generation and visual consistency. Here we formulate visual preservation and aerodynamic improvement as one probability target. Designs consistent with a reference shape or view define a learned visual design distribution, which is reweighted by aerodynamic cost. Optimization then refines an initial geometry toward a low-cost, high-probability design, whereas guided generation samples lower-cost 3D candidates from the same input view. OpenFOAM evaluation shows that visual-feature-preserving optimization reduces vehicle drag by 5.8\% relative to the initial vehicle and reduces the best valid aircraft drag-to-lift objective by 28.8\% relative to the initial aircraft while preserving input visual features. For view-based generation, guidance reduces vehicle drag by 3.0\% and the aircraft drag-to-lift objective by 68.6\% relative to direct generation from the same view, while maintaining visual consistency. Wind-tunnel tests with 3D-printed vehicle prototypes provide an independent wake-level check, and controlled analyses explain the distributional mechanisms behind these results. This work provides a probabilistic foundation and practical route for visual-feature-preserving aerodynamic refinement and early-stage 3D design exploration.

2602.03357 2026-05-29 cs.LG math.OC 版本更新

Achieving Linear Speedup for Composite Federated Learning

实现复合联邦学习的线性加速

Kun Huang, Shi Pu, Karl Henrik Johansson

发表机构 * KTH Royal Institute of Technology(皇家理工学院) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 提出基于法向映射的FedNMap方法,通过法向映射更新处理非光滑项并采用局部校正策略减轻数据异质性,在非凸损失下首次实现关于客户端数和本地更新次数的线性加速。

Comments 38 pages, 19 figures

详情
AI中文摘要

本文提出了FedNMap,一种基于法向映射的复合联邦学习方法,其中目标函数由光滑损失和可能非光滑的正则化项组成。FedNMap利用基于法向映射的更新方案来处理非光滑项,并采用局部校正策略来减轻客户端间数据异质性的影响。在标准假设下,包括光滑局部损失、正则化项的弱凸性以及有界随机梯度方差,FedNMap在非凸损失下(无论是否满足Polyak-Łojasiewicz条件)实现了关于客户端数和本地更新次数的线性加速。据我们所知,这是首个为非凸复合联邦学习建立线性加速的算法。数值实验证实了我们的理论发现,并展示了FedNMap的线性加速性能。

英文摘要

This paper proposes FedNMap, a normal map-based method for composite federated learning, where the objective consists of a smooth loss and a possibly nonsmooth regularizer. FedNMap leverages a normal map-based update scheme to handle the nonsmooth term and incorporates a local correction strategy to mitigate the impact of data heterogeneity across clients. Under standard assumptions, including smooth local losses, weak convexity of the regularizer, and bounded stochastic gradient variance, FedNMap achieves linear speedup with respect to both the number of clients and the number of local updates for nonconvex losses, both with and without the Polyak-Łojasiewicz condition. To the best of our knowledge, this is the first algorithm establishing linear speedup for nonconvex composite federated learning. Numerical experiments corroborate our theoretical findings and demonstrate the linear speedup of FedNMap.

2602.02103 2026-05-29 cs.LG cs.CL 版本更新

How Far Ahead Do LLMs Plan? Uncovering the Latent Horizon in Chain-of-Thought Reasoning

LLMs 能提前多远规划?揭示思维链推理中的潜在视界

Liyan Xu, Mo Yu, Fandong Meng, Jie Zhou

发表机构 * WeChat AI, Tencent Inc(腾讯AI实验室)

AI总结 通过探测方法 Tele-Lens 研究 LLMs 在思维链推理中的潜在规划能力,发现其具有短视视界,并基于此提出利用稀疏枢轴位置增强不确定性估计及自动识别 CoT 绕过的假设。

Comments Accepted to ICML 2026

详情
AI中文摘要

思维链推理已成为激发大型语言模型多步推理的核心机制。然而,近期证据呈现一种矛盾:隐藏状态似乎在 CoT 完全展开之前就已经编码了未来的推理,而显式步骤对于需要组合计算的任务仍然至关重要。为了加深对 LLM 内部状态与其言语化推理轨迹之间关系的理解,我们通过探测方法 Tele-Lens 研究了 LLMs 的潜在规划强度,该方法应用于跨不同任务领域的隐藏状态。我们的实证结果表明,LLMs 表现出短视视界,主要进行增量转换,而没有精确的全局规划。利用这一特性,我们提出了一个增强 CoT 不确定性估计的假设,并通过实验验证了一组稀疏的枢轴位置可以有效地代表整个路径的不确定性。我们进一步强调了利用 CoT 动态的重要性,并证明了可以在不降低性能的情况下实现 CoT 绕过的自动识别。我们的代码、数据和模型发布于 https://github.com/lxucs/tele-lens。

英文摘要

Chain-of-thought (CoT) reasoning has become a central mechanism for eliciting multi-step reasoning in Large Language Models (LLMs). Yet recent evidence presents a tension: hidden states appear to already encode future reasoning before CoT fully unfolds, while explicit steps still remain crucial for tasks requiring compositional computation. To deepen the understanding between LLM's internal states and its verbalized reasoning trajectories, we investigate the latent planning strength of LLMs, through our probing method, Tele-Lens, applying to hidden states across diverse task domains. Our empirical results indicate that LLMs exhibit a myopic horizon, primarily conducting incremental transitions without precise global planning. Leveraging this characteristic, we propose a hypothesis on enhancing uncertainty estimation of CoT, which we validate that a sparse set of pivot positions can effectively represent the uncertainty of the entire path. We further underscore the significance of exploiting CoT dynamics, and demonstrate that automatic recognition of CoT bypass can be achieved without performance degradation. Our code, data and models are released at https://github.com/lxucs/tele-lens.

2602.01456 2026-05-29 cs.LG cs.CV 版本更新

Rectified LpJEPA: Joint-Embedding Predictive Architectures with Sparse and Maximum-Entropy Representations

Rectified LpJEPA:具有稀疏和最大熵表示的联合嵌入预测架构

Yilun Kuang, Yash Dagade, Tim G. J. Rudner, Randall Balestriero, Yann LeCun

发表机构 * New York University(纽约大学) Duke University(杜克大学) University of Toronto(多伦多大学) Brown University(布朗大学)

AI总结 提出Rectified Distribution Matching Regularization (RDMReg)损失,通过将表示对齐到Rectified Generalized Gaussian分布,实现稀疏且最大熵的表示,从而改进联合嵌入预测架构(JEPA)的性能。

Comments ICML 2026

详情
AI中文摘要

联合嵌入预测架构(JEPA)学习视角不变表示,并采用基于投影的分布匹配来防止崩溃。现有方法将表示正则化为各向同性高斯分布,但固有地偏向密集表示,未能捕捉高效表示中观察到的稀疏性关键特性。我们引入了Rectified Distribution Matching Regularization (RDMReg),这是一种切片双样本分布匹配损失,将表示对齐到Rectified Generalized Gaussian (RGG)分布。RGG通过整流显式控制期望的$\ell_0$范数,而其连续截断部分在期望$\ell_p$范数和支撑约束下具有最大熵特性。将RDMReg应用于JEPA得到Rectified LpJEPA,它严格推广了先前基于高斯的JEPA。实验表明,Rectified LpJEPA学习到稀疏、非负的表示,具有有利的稀疏性-性能权衡,并在图像分类基准上取得了有竞争力的下游性能,表明RDMReg可以在保留任务相关信息的同时强制执行稀疏性。

英文摘要

Joint-Embedding Predictive Architectures (JEPA) learn view-invariant representations and admit projection-based distribution matching for collapse prevention. Existing approaches regularize representations towards isotropic Gaussian distributions, but inherently favor dense representations and fail to capture the key property of sparsity observed in efficient representations. We introduce Rectified Distribution Matching Regularization (RDMReg), a sliced two-sample distribution-matching loss that aligns representations to a Rectified Generalized Gaussian (RGG) distribution. RGG enables explicit control over expected $\ell_0$ norm through rectification, while its continuous truncated component admits a maximum-entropy characterization under expected $\ell_p$ norm and support constraints. Equipping JEPAs with RDMReg yields Rectified LpJEPA, which strictly generalizes prior Gaussian-based JEPAs. Empirically, Rectified LpJEPA learns sparse, non-negative representations with favorable sparsity--performance trade-offs and competitive downstream performance on image classification benchmarks, showing that RDMReg can enforce sparsity while preserving task-relevant information.

2601.23156 2026-05-29 cs.LG cs.FL 版本更新

Unsupervised Hierarchical Skill Discovery

无监督层次化技能发现

Damion Harvey, Geraud Nangue Tasse, Benjamin Rosman, Branden Ingram, Steven James

发表机构 * University of the Witwatersrand(瓦特斯兰大学) Neural Discovery (MIND) Institute(神经发现(MIND)研究所)

AI总结 提出一种基于语法的无监督方法,从无标签轨迹中分割技能并构建层次结构,在像素级环境(如Craftax和Minecraft)中优于现有基线,并能加速下游强化学习任务。

Comments Accepted to ICML 2026. 27 pages. 15 figures

详情
AI中文摘要

我们考虑强化学习中无监督技能分割和层次结构发现的问题。虽然最近的方法试图将轨迹分割为可重用的技能或选项,但大多数依赖于动作标签、奖励或手工注释,限制了其适用性。我们提出了一种方法,将未标记的轨迹分割成技能,并使用基于语法的方法在它们之上诱导出层次结构。得到的层次结构既捕获了低级行为,也捕获了它们组合成高级技能的过程。我们在高维、基于像素的环境中评估了我们的方法,包括Craftax和完整、未修改版本的Minecraft。使用技能分割、重用和层次质量的指标,我们发现我们的方法始终比现有基线产生更结构化和语义上有意义的层次结构。此外,作为概念验证,我们证明了这些发现的层次结构加速并稳定了下游强化学习任务的学习。

英文摘要

We consider the problem of unsupervised skill segmentation and hierarchical structure discovery in reinforcement learning. While recent approaches have sought to segment trajectories into reusable skills or options, most rely on action labels, rewards, or handcrafted annotations, limiting their applicability. We propose a method that segments unlabelled trajectories into skills and induces a hierarchical structure over them using a grammar-based approach. The resulting hierarchy captures both low-level behaviours and their composition into higher-level skills. We evaluate our approach in high-dimensional, pixel-based environments, including Craftax and the full, unmodified version of Minecraft. Using metrics for skill segmentation, reuse, and hierarchy quality, we find that our method consistently produces more structured and semantically meaningful hierarchies than existing baselines. Furthermore, as a proof of concept, we demonstrate that these discovered hierarchies accelerate and stabilise learning on downstream reinforcement learning tasks.

2601.22274 2026-05-29 cs.LG 版本更新

Server-Proximal Aggregation for Federated Domain-Incremental Learning under Partial Participation: Task-Uniform Convergence and Backward Transfer

部分参与下联邦域增量学习的服务器近端聚合:任务均匀收敛与反向迁移

Longtao Xu, Jian Li

发表机构 * Stony Brook University, New York(石英布鲁克大学,纽约)

AI总结 针对联邦域增量学习(FDIL)中客户端异构、任务顺序到达且标签空间固定的场景,提出无记忆算法SPECIAL,通过服务器端轻量近端项抑制累积漂移,实现反向知识迁移(BKT)保证和首个部分参与下的非凸收敛速率O((E/NT)^(1/2))。

Comments Accepted in ICML2026

详情
AI中文摘要

现实联邦系统很少在静态数据上运行:输入分布漂移,而隐私规则禁止原始数据共享。我们将此设置研究为联邦域增量学习(FDIL),其中(i)客户端是异构的,(ii)任务顺序到达且域不断变化,但(iii)标签空间保持不变。在现实部署下,FDIL仍然缺少两个理论支柱:反向知识迁移(BKT)的保证以及在部分参与下所有任务序列上的收敛速率。我们引入SPECIAL(服务器近端高效持续聚合学习),一种简单的、无记忆的FDIL算法,它在标准FedAvg中添加了一个单服务器端“锚点”:在每一轮中,服务器通过一个轻量近端项,将均匀采样的参与客户端的更新推向先前的全局模型。该锚点无需重放缓冲区、合成数据或任务特定头部即可抑制累积漂移,保持通信和模型大小不变。我们的理论表明,SPECIAL(i)保留了早期任务:BKT界限将先前任务损失的任意增加限制为一个漂移控制项,该漂移控制项随着更多轮次、本地周期和参与客户端而缩小;(ii)在所有任务上高效学习:首个针对部分参与下FDIL的通信高效非凸收敛速率,O((E/NT)^(1/2)),其中E为本地周期数,T为通信轮数,N为每轮参与客户端数,与单任务FedAvg匹配,同时明确区分优化方差和任务间漂移。实验结果进一步证明了SPECIAL的有效性。

英文摘要

Real-world federated systems seldom operate on static data: input distributions drift while privacy rules forbid raw-data sharing. We study this setting as Federated Domain-Incremental Learning (FDIL), where (i) clients are heterogeneous, (ii) tasks arrive sequentially with shifting domains, yet (iii) the label space remains fixed. Two theoretical pillars remain missing for FDIL under realistic deployment: a guarantee of backward knowledge transfer (BKT) and a convergence rate that holds across the sequence of all tasks with partial participation. We introduce SPECIAL (Server-Proximal Efficient Continual Aggregation for Learning), a simple, memory-free FDIL algorithm that adds a single server-side ``anchor'' to vanilla FedAvg: in each round, the server nudges the uniformly sampled participated clients update toward the previous global model with a lightweight proximal term. This anchor curbs cumulative drift without replay buffers, synthetic data, or task-specific heads, keeping communication and model size unchanged. Our theory shows that SPECIAL (i) preserves earlier tasks: a BKT bound caps any increase in prior-task loss by a drift-controlled term that shrinks with more rounds, local epochs, and participating clients; and (ii) learns efficiently across all tasks: the first communication-efficient non-convex convergence rate for FDIL with partial participation, O((E/NT)^(1/2)), with E local epochs, T communication rounds, and N participated clients per round, matching single-task FedAvg while explicitly separating optimization variance from inter-task drift. Experimental results further demonstrate the effectiveness of SPECIAL.

2601.21568 2026-05-29 cs.LG 版本更新

Bridging Functional and Representational Similarity via Usable Information

通过可用信息桥接功能相似性与表征相似性

Antonio Almudévar, Alfonso Ortega

发表机构 * ViVoLab, Aragón Institute for Engineering Research (I3A), University of Zaragoza, Zaragoza, Spain(ViVoLab,阿拉贡工程研究院(I3A),萨拉戈萨大学,西班牙萨拉戈萨)

AI总结 提出一个基于可用信息的统一框架,从功能相似性、表征相似性及其关系三个维度进行理论和实证综合,揭示表征相似性是功能相似性的充分非必要条件。

详情
AI中文摘要

我们提出了一个通过可用信息量化表征之间相似性的统一框架,在三个关键维度上提供了严格的理论和实证综合。首先,针对功能相似性,我们建立了拼接性能与条件互信息之间的形式化联系。我们进一步揭示拼接本质上是非对称的,证明稳健的功能比较需要双向分析而非单向映射。其次,关于表征相似性,我们发现基于重构的指标和标准工具(如CKA、RSA)在特定约束下充当可用信息的估计量。关键的是,我们表明相似性是相对于预测族的能力而言的:对刚性观察者而言不同的表征,对更具表达力的观察者可能是相同的。第三,我们证明表征相似性是功能相似性的充分非必要条件。我们通过任务粒度层次统一这些概念:复杂任务上的相似性保证了任何更粗粒度衍生任务上的相似性,将表征相似性确立为最大粒度的极限:输入重构。

英文摘要

We present a unified framework for quantifying the similarity between representations through the lens of \textit{usable} information, offering a rigorous theoretical and empirical synthesis across three key dimensions. First, addressing functional similarity, we establish a formal link between stitching performance and conditional mutual information. We further reveal that stitching is inherently asymmetric, demonstrating that robust functional comparison necessitates a bidirectional analysis rather than a unidirectional mapping. Second, concerning representational similarity, we find that reconstruction-based metrics and standard tools (e.g., CKA, RSA) act as estimators of usable information under specific constraints. Crucially, we show that similarity is relative to the capacity of the predictive family: representations that appear distinct to a rigid observer may be identical to a more expressive one. Third, we demonstrate that representational similarity is sufficient but not necessary for functional similarity. We unify these concepts through a task-granularity hierarchy: similarity on a complex task guarantees similarity on any coarser derivative, establishing representational similarity as the limit of maximum granularity: input reconstruction.

2601.21564 2026-05-29 cs.LG 版本更新

Representation Unlearning: Forgetting through Information Compression

表示遗忘:通过信息压缩实现遗忘

Antonio Almudévar, Alfonso Ortega

发表机构 * ViVoLab, Aragón Institute for Engineering Research (I3A), University of Zaragoza, Zaragoza, Spain(ViVoLab,阿拉贡工程研究所(I3A),萨拉戈萨大学,西班牙萨拉戈萨)

AI总结 提出表示遗忘框架,通过在模型表示空间学习信息瓶颈变换来直接执行遗忘,无需修改模型参数,实现可靠遗忘、保持效用和计算高效。

详情
AI中文摘要

机器遗忘旨在消除特定训练数据对模型的影响,这一需求由隐私法规和鲁棒性关注驱动。现有方法通常修改模型参数,但此类更新可能不稳定、计算成本高且受局部近似限制。我们引入表示遗忘,一个直接在模型表示空间中执行遗忘的框架。我们不修改模型参数,而是学习一个对表示施加信息瓶颈的变换:最大化与保留数据的互信息,同时抑制关于待遗忘数据的信息。我们推导出使这一目标可处理的变分替代,并展示如何在两种实际场景中实例化:当保留和遗忘数据都可用时,以及在仅能访问遗忘数据的零样本设置中。在多个基准上的实验表明,与以参数为中心的基线相比,表示遗忘实现了更可靠的遗忘、更好的效用保持和更高的计算效率。

英文摘要

Machine unlearning seeks to remove the influence of specific training data from a model, a need driven by privacy regulations and robustness concerns. Existing approaches typically modify model parameters, but such updates can be unstable, computationally costly, and limited by local approximations. We introduce Representation Unlearning, a framework that performs unlearning directly in the model's representation space. Instead of modifying model parameters, we learn a transformation over representations that imposes an information bottleneck: maximizing mutual information with retained data while suppressing information about data to be forgotten. We derive variational surrogates that make this objective tractable and show how they can be instantiated in two practical regimes: when both retain and forget data are available, and in a zero-shot setting where only forget data can be accessed. Experiments across several benchmarks demonstrate that Representation Unlearning achieves more reliable forgetting, better utility retention, and greater computational efficiency than parameter-centric baselines.

2601.19947 2026-05-29 cs.LG cs.AI cs.CV 版本更新

NCSAM Noise-Compensated Sharpness-Aware Minimization for Noisy Label Learning

NCSAM: 噪声补偿的锐度感知最小化用于噪声标签学习

Jiayu Xu, Junbiao Pang

发表机构 * Beijing University of Technology(北京理工大学)

AI总结 提出NCSAM方法,通过噪声补偿扰动修正噪声标签引起的优化偏差,缓解对噪声标签的记忆,在合成和真实噪声标签基准上优于SAM基线。

Comments 11 pages, 1 figure, 8 tables. Major revision of v1: revised PAC-Bayesian theoretical analysis, clarified the NCSAM formulation, added appendix derivations, reorganized experiments and ablations, updated related work, citations, writing, and author list

详情
AI中文摘要

从噪声标签学习(LNL)仍然是深度学习中的一个基本挑战,因为现实世界的数据集通常包含损坏的注释。大多数现有方法依赖于标签校正或样本选择机制。相比之下,我们从优化角度研究LNL,通过建立标签噪声与锐度感知最小化(SAM)的平坦性寻求行为之间的理论联系。基于此分析,我们提出了噪声补偿的锐度感知最小化(NCSAM),它使用噪声补偿扰动来抵消由噪声标签引起的优化偏差。通过纠正失真的SAM扰动,NCSAM在训练过程中减轻了对噪声标签的记忆,同时保持了基于优化的学习的简单性。在合成和真实噪声标签基准上的实验表明,NCSAM在基于SAM的优化基线上持续改进,并与代表性的噪声标签学习方法保持竞争力。

英文摘要

Learning from Noisy Labels (LNL) remains a fundamental challenge in deep learning because real-world datasets often contain corrupted annotations. Most existing methods rely on label correction or sample selection mechanisms. In contrast, we study LNL from an optimization perspective by establishing a theoretical connection between label noise and the flatness-seeking behavior of Sharpness-Aware Minimization (SAM). Based on this analysis, we propose Noise-Compensated Sharpness-Aware Minimization (NCSAM), which uses a noise-compensated perturbation to counteract the optimization bias induced by noisy labels. By correcting distorted SAM perturbations, NCSAM mitigates the memorization of noisy labels during training while preserving the simplicity of optimization-based learning. Experiments on synthetic and real-world noisy-label benchmarks show that NCSAM consistently improves over SAM-based optimization baselines and remains competitive with representative noisy-label learning methods.

2601.14855 2026-05-29 cs.LG 版本更新

Adaptive Exponential Integration for Stable Gaussian Mixture Black-Box Variational Inference

自适应指数积分用于稳定高斯混合黑箱变分推断

Baojun Che, Yifan Chen, Daniel Zhengyu Huang, Xinying Mao, Weijie Wang

发表机构 * School of Mathematical Sciences, Peking University, Beijing, China(北京大学数学科学学院) Department of Mathematics, University of California, Los Angeles, CA, USA(加州大学洛杉矶分校数学系) Beijing International Center for Mathematical Research, Center for Machine Learning Research, Peking University, Beijing, China(北京国际数学研究中心,机器学习研究中心,北京大学,北京,中国)

AI总结 针对高斯混合黑箱变分推断的不稳定和低效问题,提出结合仿射不变预处理、无条件保持协方差正定性的指数积分器和自适应时间步长的稳定高效框架,并证明其收敛性。

Comments 41 pages, 10 figures

详情
AI中文摘要

黑箱变分推断(BBVI)结合高斯混合族提供了一种灵活的方法来近似复杂的后验分布,无需目标密度的梯度。然而,标准的数值优化方法常常遭受不稳定和低效的问题。我们开发了一个稳定高效的框架,结合了三个关键组件:(1)通过自然梯度公式实现的仿射不变预处理,(2)无条件保持协方差矩阵正定性的指数积分器,以及(3)自适应时间步长以确保稳定性并适应不同的预热和收敛阶段。所提出的方法与流形优化和镜像下降有自然联系。对于高斯后验,我们证明了在无噪声设置下的指数收敛性和在蒙特卡洛估计下的几乎必然收敛性,严格论证了自适应时间步长的必要性。在多模态分布、Neal多尺度漏斗以及基于PDE的达西流贝叶斯逆问题上的数值实验证明了所提方法的有效性。

英文摘要

Black-box variational inference (BBVI) with Gaussian mixture families offers a flexible approach for approximating complex posterior distributions without requiring gradients of the target density. However, standard numerical optimization methods often suffer from instability and inefficiency. We develop a stable and efficient framework that combines three key components: (1) affine-invariant preconditioning via natural gradient formulations, (2) an exponential integrator that unconditionally preserves the positive definiteness of covariance matrices, and (3) adaptive time stepping to ensure stability and to accommodate distinct warm-up and convergence phases. The proposed approach has natural connections to manifold optimization and mirror descent. For Gaussian posteriors, we prove exponential convergence in the noise-free setting and almost-sure convergence under Monte Carlo estimation, rigorously justifying the necessity of adaptive time stepping. Numerical experiments on multimodal distributions, Neal's multiscale funnel, and a PDE-based Bayesian inverse problem for Darcy flow demonstrate the effectiveness of the proposed method.

2601.14758 2026-05-29 cs.LG cs.AI cs.CL 版本更新

Mechanism Shift During Post-training from Autoregressive to Masked Diffusion Language Models

从自回归到掩码扩散语言模型的后训练中的机制转变

Injin Kong, Hyoungjoon Lee, Yohan Jo

发表机构 * Graduate School of Data Science, Seoul National University(首尔国立大学数据科学研究生院) Department of Biosystems & Biomaterials Science and Engineering, Seoul National University(首尔国立大学生物系统与生物材料科学与工程系)

AI总结 通过比较电路分析,发现后训练得到的掩码扩散模型在结构上根据任务保留或重组自回归电路,在语义上从局部专业化转向分布式整合,表明扩散后训练是内部计算的深度重组。

详情
AI中文摘要

将预训练的自回归模型(ARMs)后训练为掩码扩散模型(MDMs)已成为一种克服顺序生成局限性的经济有效方法。然而,后训练的MDMs是否获得了真正的新计算机制,还是仅仅以非自回归形式重新表达了自回归计算,仍不清楚。通过对ARMs及其从相同骨干网络后训练得到的MDM对应物进行电路比较分析,我们揭示了两个互补的重组轴。在结构上,转变是任务依赖的:MDMs在局部因果任务上保留自回归电路,但在全局任务上放弃继承的路径并将计算前置到早期层。在语义上,转变在不同机制间是一致的:ARMs中尖锐的局部专业化让位于MDMs中的分布式整合。这些发现共同表明,扩散后训练并非生成过程的表面变化,而是内部计算的重组,其深度取决于任务。

英文摘要

Post-training pretrained autoregressive models (ARMs) into masked diffusion models (MDMs) has emerged as a cost-effective way to overcome the limitations of sequential generation. Yet it remains unclear whether post-trained MDMs acquire genuinely new computational mechanisms or merely re-express autoregressive computation in a non-autoregressive form. Through a comparative circuit analysis of ARMs and their MDM counterparts post-trained from the same backbones, we uncover two complementary axes of reorganization. Structurally, the shift is task-dependent: MDMs preserve autoregressive circuitry on locally causal tasks but abandon inherited pathways and front-load computation into early layers on global tasks. Semantically, the shift is consistent across regimes: sharp, localized specialization in ARMs gives way to distributed integration in MDMs. Together, these findings show that diffusion post-training is not a surface-level change in the generation procedure but a reorganization of internal computation whose depth depends on the task.

2601.04765 2026-05-29 cs.CL cs.AI cs.LG physics.comp-ph 版本更新

Differential syntactic and semantic encoding in LLMs

大型语言模型中句法与语义的差异编码

Santiago Acevedo, Alessandro Laio, Marco Baroni

发表机构 * Catalan Institute of Research and Advanced Studies (ICREA) and Universitat Pompeu Fabra (UPF)(加泰罗尼亚研究与高级科学研究所(ICREA)和庞培法华大学(UPF))

AI总结 本研究通过平均共享句法结构或语义的句子隐藏表示向量,发现大型语言模型(以DeepSeek-V3为例)的内部层表示中句法和语义信息至少部分线性编码,且两者编码轮廓不同,可一定程度解耦。

Comments Published as conference paper at ICML 2026

详情
AI中文摘要

我们研究了句法和语义信息如何在大型语言模型(LLMs)的内部层表示中编码,重点关注非常大的DeepSeek-V3。我们发现,通过平均共享句法结构或语义的句子的隐藏表示向量,我们得到了能够捕获表示中相当大比例的句法和语义信息的向量。特别是,从句子向量中减去这些句法和语义“质心”会强烈影响它们与句法和语义匹配句子的相似性,这表明句法和语义至少部分地线性编码。我们还发现句法和语义的跨层编码轮廓不同,并且这两种信号可以在一定程度上解耦,这表明LLM表示中这两种语言信息的差异编码。

英文摘要

We study how syntactic and semantic information is encoded in inner layer representations of Large Language Models (LLMs), focusing on the very large DeepSeek-V3. We find that, by averaging hidden-representation vectors of sentences sharing syntactic structure or meaning, we obtain vectors that capture a significant proportion of the syntactic and semantic information contained in the representations. In particular, subtracting these syntactic and semantic ``centroids'' from sentence vectors strongly affects their similarity with syntactically and semantically matched sentences, respectively, suggesting that syntax and semantics are, at least partially, linearly encoded. We also find that the cross-layer encoding profiles of syntax and semantics are different, and that the two signals can to some extent be decoupled, suggesting differential encoding of these two types of linguistic information in LLM representations.

2601.00065 2026-05-29 cs.LG cs.CL cs.CR 版本更新

When the Same Coefficients Reach Different Places: Asymmetric Realizability in Transplanting Tokenizers across Large Language Models

当相同系数到达不同位置:跨大型语言模型移植分词器中的非对称可实现性

Xiaoze Liu, Weichen Yu, Matt Fredrikson, Xiaoqian Wang, Jing Gao

发表机构 * Purdue University(普渡大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 本文发现跨词汇模型组合中分词器移植的几何结构非对称性,并构造了“破坏令牌”以利用该漏洞,通过实验验证其在多个模型对中的存在性及对微调、谱滤波等防御措施的鲁棒性。

详情
AI中文摘要

跨词汇模型组合中的分词器移植将仅存在于捐赠者的嵌入行重构为基于共享词汇锚点的加权组合,并在基础模型上重用这些系数。我们识别出这种重构的一个结构几何特性:相同的系数向量在捐赠者和基础锚点跨度中到达不同的集合,即一个\emph{非对称可实现性}差距。在OMP下的65个捐赠者-基础对中,通过CLP、WECHSEL和FOCUS的跨算子验证,我们构造了\emph{破坏令牌}:在捐赠者锚点跨度中保持统计惰性,同时在基础中产生高显著性重构的单一系数向量。相同的Gemma-2-2B捐赠者检查点允许针对来自五个模型家族的13个不同下游基础进行此构造。植入的方向与未改变的干净参考权重合并。在部署者案例研究中,标准LoRA微调主要抑制了其提示分布与训练语料匹配的破坏者,并且在我们设置中不足以缓解此类攻击家族。测试的谱滤波器未能捕捉到非对称性。我们讨论了在开放权重组合供应链中的潜在滥用。

英文摘要

Tokenizer transplant in cross-vocabulary model composition reconstructs donor-only embedding rows as weighted combinations over shared lexical anchors and reuses those coefficients on the base. We identify a structural geometric property of this reconstruction: the same coefficient vector reaches different sets in the donor and base anchor spans, an \emph{asymmetric realizability} gap. Across 65 donor-base pairs under OMP, with cross-operator validation on CLP, WECHSEL, and FOCUS, we construct \textit{breaker tokens}: single coefficient vectors that remain statistically inert in the donor anchor span while producing a high-salience reconstruction in the base. The same Gemma-2-2B donor checkpoint admits this construction against 13 different downstream bases drawn from five model families. The planted direction passes weight-merging with a clean reference unchanged. In a deployer case study, standard LoRA fine-tuning suppresses the breaker primarily on prompts whose distribution matches the training corpus and is not a sufficient mitigation against this attack family in our setting. The tested spectral filters miss the asymmetry. We discuss potential misuse in the open-weight composition supply chain.

2512.21311 2026-05-29 cs.LG 版本更新

Learning to Solve PDEs on Neural Shape Representations

在神经形状表示上学习求解偏微分方程

Lilian Welschinger, Yilin Liu, Zican Wang, Niloy Mitra

发表机构 * University College London(伦敦大学学院) Adobe Research(Adobe研究)

AI总结 提出一种无网格公式,学习基于神经局部形状属性的局部更新算子,直接在神经表示上求解表面偏微分方程,无需显式网格或逐实例优化,且保持可微性。

Comments Accepted at CVPR 2026. Project page: https://welschinger.github.io/Learning-to-Solve-PDEs-on-Neural-Shape-Representations/

详情
AI中文摘要

在形状上求解偏微分方程支撑着许多形状分析和工程任务;然而,主流的偏微分方程求解器在多边形/三角形网格上运行,而现代3D资产越来越多地以神经表示的形式存在。这种不匹配导致没有合适的方法直接在神经域内求解表面偏微分方程,迫使进行显式网格提取或逐实例残差训练,阻碍了端到端的工作流程。我们提出了一种新颖的无网格公式,学习一个基于神经(局部)形状属性条件化的局部更新算子,使得表面偏微分方程能够直接在神经数据所在处求解。该算子自然地与流行的神经表面表示集成,仅在单个代表性形状上训练一次,并能在形状和拓扑变化中泛化,实现准确、快速的推理,无需显式网格划分或逐实例优化,同时保持可微性。在解析基准测试(球面上的热扩散方程和泊松方程)以及各种形状和神经表面表示上,我们的方法达到了与经典求解器相当的精度,同时实现了跨神经和传统表面表示的统一端到端流水线。我们的源代码和项目页面:https://welschinger.github.io/Learning-to-Solve-PDEs-on-Neural-Shape-Representations/。

英文摘要

Solving partial differential equations (PDEs) on shapes underpins many shape analysis and engineering tasks; yet, prevailing PDE solvers operate on polygonal/triangle meshes while modern 3D assets increasingly live as neural representations. This mismatch leaves no suitable method to solve surface PDEs directly within the neural domain, forcing explicit mesh extraction or per-instance residual training, preventing end-to-end workflows. We present a novel, meshfree formulation that learns a local update operator conditioned on neural (local) shape attributes, enabling surface PDEs to be solved directly where the (neural) data lives. The operator integrates naturally with prevalent neural surface representations, is trained once on a single representative shape, and generalizes across shape and topology variations, enabling accurate, fast inference without explicit meshing or per-instance optimization while preserving differentiability. Across analytic benchmarks (heat diffusion and Poisson equations on the sphere) and on diverse shapes and neural surface representations, our method achieves accuracy comparable to classical solvers while enabling a unified, end-to-end pipeline across neural and traditional surface representations. Our source code and project page: https://welschinger.github.io/Learning-to-Solve-PDEs-on-Neural-Shape-Representations/.

2512.19199 2026-05-29 cs.LG cs.AI 版本更新

On the Koopman-Based Generalization Bounds for Multi-Task Deep Learning

基于Koopman的多任务深度学习泛化界

Mahdi Mohammadigohari, Giuseppe Di Fatta, Giuseppe Nicosia, Panos M. Pardalos

发表机构 * Free University of Bozen-Bolzano(博兹纳-博尔扎诺自由大学) University of Catania(卡塔尼亚大学) University of Florida(佛罗里达大学)

AI总结 本文利用算子理论技术建立多任务深度神经网络的泛化界,通过利用权重矩阵的小条件数并引入定制的Sobolev空间作为扩展假设空间,提出比传统范数方法更紧的界,该界在单输出设置下仍有效且优于现有Koopman界。

Comments Accepted at the 11th International Conference on Machine Learning, Optimization, and Data Science (LOD), Castiglione della Pescaia, Italy, September 21-24, 2025. To appear in Lecture Notes in Computer Science (LNCS), volume 16467

详情
Journal ref
Machine Learning, Optimization, and Data Science (LOD 2025), Lecture Notes in Computer Science (LNCS), vol. 16468, Springer, 2026, pp. 376--392
AI中文摘要

本文利用算子理论技术建立了多任务深度神经网络的泛化界。作者通过利用权重矩阵中的小条件数并引入定制的Sobolev空间作为扩展假设空间,提出了比传统基于范数的方法更紧的界。该增强的界即使在单输出设置下仍然有效,优于现有的基于Koopman的界。所得框架保持了关键优势,如灵活性和与网络宽度无关,为核方法背景下的多任务深度学习提供了更精确的理论理解。

英文摘要

The paper establishes generalization bounds for multitask deep neural networks using operator-theoretic techniques. The authors propose a tighter bound than those derived from conventional norm based methods by leveraging small condition numbers in the weight matrices and introducing a tailored Sobolev space as an expanded hypothesis space. This enhanced bound remains valid even in single output settings, outperforming existing Koopman based bounds. The resulting framework maintains key advantages such as flexibility and independence from network width, offering a more precise theoretical understanding of multitask deep learning in the context of kernel methods.

2512.19184 2026-05-29 cs.LG cs.AI 版本更新

Operator-Based Generalization Bound for Deep Learning: Insights on Multi-Task Learning

基于算子的深度学习泛化界:多任务学习的洞见

Mahdi Mohammadigohari, Giuseppe Di Fatta, Giuseppe Nicosia, Panos M. Pardalos

发表机构 * Free University of Bozen-Bolzano(博兹纳-博尔扎诺自由大学) University of Catania(卡塔尼亚大学) University of Florida(佛罗里达大学)

AI总结 本文通过算子理论框架,结合Koopman方法与现有技术,为向量值神经网络和深度核方法提出了更紧的泛化界,并引入草图技术降低计算成本,同时提出深度向量值再生核希尔伯特空间框架,利用Perron-Frobenius算子增强深度核方法,推导了新的Rademacher泛化界,解决了欠拟合和过拟合问题。

Comments Accepted at the 11th International Conference on Machine Learning, Optimization, and Data Science (LOD), Castiglione della Pescaia, Italy, September 21-24, 2025. To appear in Lecture Notes in Computer Science (LNCS), volume 16467

详情
Journal ref
Machine Learning, Optimization, and Data Science (LOD 2025), Lecture Notes in Computer Science (LNCS), vol. 16468, Springer, 2026, pp. 120--137
AI中文摘要

本文提出了向量值神经网络和深度核方法的新型泛化界,通过算子理论框架聚焦多任务学习。我们的关键发展在于策略性地将基于Koopman的方法与现有技术相结合,实现了比传统基于范数的界更紧的泛化保证。为缓解基于Koopman方法的计算挑战,我们引入了适用于向量值神经网络的草图技术。这些技术在一般Lipschitz损失下给出了超额风险界,为包括鲁棒回归和多重分位数回归在内的应用提供了性能保证。此外,我们提出了一个新的深度学习框架——深度向量值再生核希尔伯特空间(vvRKHS),利用Perron-Frobenius(PF)算子增强深度核方法。我们为该框架推导了新的Rademacher泛化界,通过核精炼策略明确处理欠拟合和过拟合。这项工作为深度学习架构下的多任务学习泛化性质提供了新颖洞见,该领域直到最近才有所发展。

英文摘要

This paper presents novel generalization bounds for vector-valued neural networks and deep kernel methods, focusing on multi-task learning through an operator-theoretic framework. Our key development lies in strategically combining a Koopman based approach with existing techniques, achieving tighter generalization guarantees compared to traditional norm-based bounds. To mitigate computational challenges associated with Koopman-based methods, we introduce sketching techniques applicable to vector valued neural networks. These techniques yield excess risk bounds under generic Lipschitz losses, providing performance guarantees for applications including robust and multiple quantile regression. Furthermore, we propose a novel deep learning framework, deep vector-valued reproducing kernel Hilbert spaces (vvRKHS), leveraging Perron Frobenius (PF) operators to enhance deep kernel methods. We derive a new Rademacher generalization bound for this framework, explicitly addressing underfitting and overfitting through kernel refinement strategies. This work offers novel insights into the generalization properties of multitask learning with deep learning architectures, an area that has been relatively unexplored until recent developments.

2512.13517 2026-05-29 q-bio.NC cs.LG 版本更新

A Deep Learning Model of Mental Rotation Informed by Interactive VR Experiments

基于交互式VR实验的心理旋转深度学习模型

Raymond Khazoum, Daniela Fernandes, Aleksandr Krylov, Qin Li, Stephane Deny

发表机构 * Department of Computer Science, Aalto University, Espoo, Finland(奥卢大学计算机科学系,芬兰埃斯波) Department of Neuroscience and Biomedical Engineering, Aalto University, Espoo, Finland(奥卢大学神经科学与生物医学工程系,芬兰埃斯波)

AI总结 提出一个由等变编码器、神经符号对象编码器和神经决策代理组成的深度学习模型,通过VR实验验证,准确模拟人类心理旋转的性能、响应时间和行为。

Comments Version accepted at ICML 2026

详情
AI中文摘要

心理旋转——比较从不同视角观察到的物体的能力——是人类心理模拟和空间世界建模的基本示例。在这里,我们利用深度、等变和神经符号学习的最新进展,提出了一个人类心理旋转的机制模型。我们的模型由三个堆叠的组件组成:(1) 等变神经编码器,从图像中生成物体的3D空间表示;(2) 神经符号对象编码器,从这些空间表示中推导出符号对象描述;(3) 神经决策代理,通过循环路径比较这些符号描述,以在3D潜在空间中规定旋转模拟。我们的模型设计受到现有心理旋转实验文献的指导,并辅以VR实验,其中参与者有时可以操作物体进行比较。我们的模型很好地捕捉了参与者在我们和其他人的实验中的表现、反应时间和行为,并通过消融研究证明了每个组件的必要性。我们的工作为最近一系列人类空间推理的深度神经模型增添了新的内容,进一步证明了整合深度、等变和符号表示来模拟人类思维的效力。

英文摘要

Mental rotation -- the ability to compare objects seen from different viewpoints -- is a fundamental example of mental simulation and spatial world modeling in humans. Here we propose a mechanistic model of human mental rotation, leveraging recent advances in deep, equivariant, and neuro-symbolic learning. Our model consists of three stacked components: (1) an equivariant neural encoder, producing 3D spatial representations of objects from images, (2) a neuro-symbolic object encoder, deriving symbolic objects descriptions from these spatial representations, and (3) a neural decision agent, comparing these symbolic descriptions to prescribe rotation simulations in 3D latent space via a recurrent pathway. Our model design is guided by the existing experimental literature on mental rotation, which we complemented with experiments in VR where participants could at times manipulate the objects to compare. Our model captures well the performance, response times and behavior of participants in our and others' experiments, and through ablation studies we demonstrate the necessity of each component. Our work adds to a recent collection of deep neural models of human spatial reasoning, further demonstrating the potency of integrating deep, equivariant, and symbolic representations to model the human mind.

2512.10659 2026-05-29 cs.LG 版本更新

DCFO: Density-Based Counterfactuals for Outliers -- Additional Material

DCFO: 基于密度的离群点反事实解释——补充材料

Tommaso Amico, Pernille Matthews, Lena Krieger, Arthur Zimek, Ira Assent

发表机构 * Department of Computer Science(计算机科学系) Department of Computer Science and Mathematics(计算机科学与数学系)

AI总结 针对局部离群因子(LOF)缺乏可解释性的问题,提出基于密度的离群点反事实解释方法(DCFO),通过将数据空间划分为LOF平滑区域实现高效梯度优化,在50个OpenML数据集上优于现有方法。

详情
AI中文摘要

离群点检测识别显著偏离大多数数据分布的数据点。解释离群点对于理解导致其检测的潜在因素、验证其重要性以及识别潜在偏差或错误至关重要。有效的解释提供可操作的见解,有助于采取预防措施以避免未来出现类似的离群点。反事实解释通过识别改变预测所需的最小变化,阐明特定数据点为何被分类为离群点。尽管有价值,但大多数现有的反事实解释方法忽略了离群点检测带来的独特挑战,并且未能针对经典、广泛采用的离群点检测算法。局部离群因子(LOF)是最流行的无监督离群点检测方法之一,通过相对局部密度量化离群程度。尽管LOF在多种应用中广泛使用,但它缺乏可解释性。为解决这一局限性,我们提出了基于密度的离群点反事实解释(DCFO),这是一种专门为LOF生成反事实解释的新方法。DCFO将数据空间划分为LOF行为平滑的区域,从而实现高效的基于梯度的优化。在50个OpenML数据集上的广泛实验验证表明,DCFO始终优于基准竞争对手,在生成的反事实的邻近性和有效性方面表现更优。

英文摘要

Outlier detection identifies data points that significantly deviate from the majority of the data distribution. Explaining outliers is crucial for understanding the underlying factors that contribute to their detection, validating their significance, and identifying potential biases or errors. Effective explanations provide actionable insights, facilitating preventive measures to avoid similar outliers in the future. Counterfactual explanations clarify why specific data points are classified as outliers by identifying minimal changes required to alter their prediction. Although valuable, most existing counterfactual explanation methods overlook the unique challenges posed by outlier detection, and fail to target classical, widely adopted outlier detection algorithms. Local Outlier Factor (LOF) is one the most popular unsupervised outlier detection methods, quantifying outlierness through relative local density. Despite LOF's widespread use across diverse applications, it lacks interpretability. To address this limitation, we introduce Density-based Counterfactuals for Outliers (DCFO), a novel method specifically designed to generate counterfactual explanations for LOF. DCFO partitions the data space into regions where LOF behaves smoothly, enabling efficient gradient-based optimisation. Extensive experimental validation on 50 OpenML datasets demonstrates that DCFO consistently outperforms benchmarked competitors, offering superior proximity and validity of generated counterfactuals.

2512.10401 2026-05-29 stat.ML cs.LG math.ST stat.TH 版本更新

Diffusion differentiable resampling

扩散可微重采样

Jennifer Rosina Andersson, Zheng Zhao

发表机构 * Department of Information Technology, Uppsala University, Sweden(乌普萨拉大学信息科技系,瑞典)

AI总结 针对序贯蒙特卡洛中的可微重采样问题,提出一种基于无训练扩散模型代理的信息性且即时可微的重采样方法,理论证明其一致性,并在多个滤波和参数估计基准上优于现有方法。

Comments In ICML 2026

详情
AI中文摘要

本文关注序贯蒙特卡洛(例如粒子滤波)中的可微重采样问题。借鉴重参数化,我们提出了一种新的重采样方法,该方法基于无训练扩散模型代理,具有信息性且即时可微。我们从理论上证明了我们的扩散重采样方法提供了一致的重采样分布,并通过实验表明,在多个滤波和参数估计基准上,它优于最先进的可微重采样方法。最后,我们展示了当用于学习具有高维图像观测的复杂动力学-解码器模型时,它实现了具有竞争力的端到端性能。

英文摘要

This paper is concerned with differentiable resampling in the context of sequential Monte Carlo (e.g., particle filtering). Drawing on reparametrisation, we propose a new resampling method that is informative and instantly differentiable, based on a training-free diffusion model surrogate. We theoretically prove that our diffusion resampling method provides a consistent resampling distribution, and we show empirically that it outperforms the state-of-the-art differentiable resampling methods on multiple filtering and parameter estimation benchmarks. Finally, we show that it achieves competitive end-to-end performance when used in learning a complex dynamics-decoder model with high-dimensional image observations.

2512.03109 2026-05-29 cs.LG cs.AI stat.AP stat.ML 版本更新

E-valuator: Reliable Agent Verifiers with Sequential Hypothesis Testing

E-valuator: 基于序贯假设检验的可靠智能体验证器

Shuvom Sadhuka, Drew Prinster, Clara Fannjiang, Gabriele Scalia, Bonnie Berger, Aviv Regev, Hanchen Wang

发表机构 * Genentech(基因泰克) MIT(麻省理工学院) Johns Hopkins(约翰霍普金斯大学) Stanford(斯坦福大学)

AI总结 提出E-valuator方法,将任意黑盒验证器分数转化为具有可控虚警率的决策规则,通过序贯假设检验实现对智能体轨迹的在线监控,提升统计功效并节省令牌。

详情
AI中文摘要

智能体AI系统根据用户提示执行一系列动作,如推理步骤或工具调用。为了评估其轨迹的成功性,研究人员开发了验证器(如LLM评判器和过程奖励模型)来对智能体轨迹中每个动作的质量进行评分。尽管这些启发式评分可能提供信息,但在用于决定智能体是否会产生成功输出时,无法保证正确性。在此,我们引入e-valuator,一种将任意黑盒验证器分数转化为具有可证明虚警率控制的决策规则的方法。我们将区分成功轨迹(即会导致对用户提示正确响应的动作序列)与不成功轨迹的问题构建为序贯假设检验问题。E-valuator基于e-过程工具开发了一个序贯假设检验,该检验在智能体轨迹的每一步都保持统计有效性,从而能够对任意长动作序列的智能体进行在线监控。实验表明,在六个数据集和三个智能体上,e-valuator相比其他策略提供了更高的统计功效和更好的虚警率控制。我们还展示了e-valuator可用于快速终止有问题的轨迹并节省令牌。总之,e-valuator提供了一个轻量级、模型无关的框架,将验证器启发式转化为具有统计保证的决策规则,从而支持部署更可靠的智能体系统。

英文摘要

Agentic AI systems execute a sequence of actions, such as reasoning steps or tool calls, in response to a user prompt. To evaluate the success of their trajectories, researchers have developed verifiers, such as LLM judges and process-reward models, to score the quality of each action in an agent's trajectory. Although these heuristic scores can be informative, there are no guarantees of correctness when used to decide whether an agent will yield a successful output. Here, we introduce e-valuator, a method to convert any black-box verifier score into a decision rule with provable control of false alarm rates. We frame the problem of distinguishing successful trajectories (that is, a sequence of actions that will lead to a correct response to the user's prompt) and unsuccessful trajectories as a sequential hypothesis testing problem. E-valuator builds on tools from e-processes to develop a sequential hypothesis test that remains statistically valid at every step of an agent's trajectory, enabling online monitoring of agents over arbitrarily long sequences of actions. Empirically, we demonstrate that e-valuator provides greater statistical power and better false alarm rate control than other strategies across six datasets and three agents. We additionally show that e-valuator can be used for to quickly terminate problematic trajectories and save tokens. Together, e-valuator provides a lightweight, model-agnostic framework that converts verifier heuristics into decisions rules with statistical guarantees, enabling the deployment of more reliable agentic systems.

2511.11118 2026-05-29 cs.LG 版本更新

Improving Continual Learning of Knowledge Graph Embeddings via Informed Initialization

通过信息初始化改进知识图谱嵌入的持续学习

Gerard Pons, Besim Bilalli, Anna Queralt

AI总结 提出一种基于知识图谱模式与已有嵌入的信息初始化策略,提升持续学习中新知识的获取并减少灾难性遗忘,同时加速训练。

详情
AI中文摘要

许多知识图谱(KG)会频繁更新,迫使知识图谱嵌入(KGE)适应这些变化。为了解决这个问题,KGE的持续学习技术在更新旧嵌入的同时纳入新实体的嵌入。这些方法中的一个必要步骤是嵌入的初始化,作为KGE学习过程的输入,它对最终嵌入的准确性以及训练所需的时间有重要影响。这对于相对较小且频繁的更新尤其重要。我们提出了一种新颖的信息嵌入初始化策略,可以无缝集成到现有的KGE持续学习方法中,该策略在减少灾难性遗忘的同时增强新知识的获取。具体地,利用KG模式以及先前学习的嵌入,基于新实体所属的类别来获得其初始表示。我们广泛的实验分析表明,所提出的初始化策略提高了所得KGE的预测性能,同时增强了知识保留。此外,我们的方法加速了知识获取,减少了增量学习新嵌入所需的周期数,从而减少了时间。最后,其在不同类型的KGE学习模型中的优势也得到了证明。

英文摘要

Many Knowledege Graphs (KGs) are frequently updated, forcing their Knowledge Graph Embeddings (KGEs) to adapt to these changes. To address this problem, continual learning techniques for KGEs incorporate embeddings for new entities while updating the old ones. One necessary step in these methods is the initialization of the embeddings, as an input to the KGE learning process, which can have an important impact in the accuracy of the final embeddings, as well as in the time required to train them. This is especially relevant for relatively small and frequent updates. We propose a novel informed embedding initialization strategy, which can be seamlessly integrated into existing continual learning methods for KGE, that enhances the acquisition of new knowledge while reducing catastrophic forgetting. Specifically, the KG schema and the previously learned embeddings are utilized to obtain initial representations for the new entities, based on the classes the entities belong to. Our extensive experimental analysis shows that the proposed initialization strategy improves the predictive performance of the resulting KGEs, while also enhancing knowledge retention. Furthermore, our approach accelerates knowledge acquisition, reducing the number of epochs, and therefore time, required to incrementally learn new embeddings. Finally, its benefits across various types of KGE learning models are demonstrated.

2510.27663 2026-05-29 eess.IV cs.LG stat.ME stat.ML 版本更新

Bayesian model selection and misspecification testing in imaging inverse problems only from noisy and partial measurements

仅从噪声和部分测量中进行成像逆问题的贝叶斯模型选择与误设定检验

Tom Sprunck, Marcelo Pereyra, Tobias Liaudat

发表机构 * Heriot-Watt University, MACS \& Maxwell Institute for Mathematical Sciences, EH14 4AS, Edinburgh, United Kingdom

AI总结 提出一种结合贝叶斯交叉验证与数据分裂的通用方法,用于在无真实数据情况下对成像逆模型进行选择与误设定检测,兼容扩散采样器等贝叶斯成像采样器,计算成本低且准确率高。

详情
AI中文摘要

现代成像技术严重依赖贝叶斯统计模型来解决困难的图像重建和恢复任务。本文针对无真实数据的情况,研究此类模型的客观评估,重点关注模型选择和误设定诊断。现有的无监督模型评估方法通常因计算成本高且与通过机器学习模型隐式定义的现代图像先验不兼容,而不适用于计算成像。本文提出一种基于贝叶斯交叉验证与数据分裂(一种随机测量分裂技术)的新型组合方法,用于贝叶斯成像科学中的无监督模型选择和误设定检测。该方法与任何贝叶斯成像采样器兼容,包括扩散采样器和即插即用采样器。我们通过涉及多种评分规则和模型误设定类型的实验证明了该方法的有效性,在低计算成本下实现了出色的选择和检测精度。

英文摘要

Modern imaging techniques heavily rely on Bayesian statistical models to address difficult image reconstruction and restoration tasks. This paper addresses the objective evaluation of such models in settings where ground truth is unavailable, with a focus on model selection and misspecification diagnosis. Existing unsupervised model evaluation methods are often unsuitable for computational imaging due to their high computational cost and incompatibility with modern image priors defined implicitly via machine learning models. We herein propose a general methodology for unsupervised model selection and misspecification detection in Bayesian imaging sciences, based on a novel combination of Bayesian cross-validation and data fission, a randomized measurement splitting technique. The approach is compatible with any Bayesian imaging sampler, including diffusion and plug-and-play samplers. We demonstrate the methodology through experiments involving various scoring rules and types of model misspecification, where we achieve excellent selection and detection accuracy with a low computational cost.

2510.27391 2026-05-29 cs.CV cs.LG 版本更新

Modality Alignment across Trees on Heterogeneous Hyperbolic Manifolds

异质双曲流形上的树间模态对齐

Wei Wu, Xiaomeng Fan, Yuwei Wu, Zhi Gao, Pengxiang Li, Yunde Jia, Mehrtash Harandi

发表机构 * Beijing Key Laboratory of Intelligent Information Technology, School of Computer Science & Technology, Beijing Institute of Technology(北京智能信息科技重点实验室,计算机科学与技术学院,北京理工大学) Guangdong Laboratory of Machine Perception and Intelligent Computing, Shenzhen MSU-BIT University(广东机器感知与智能计算实验室,深圳MSU-BIT大学) Department of Electrical and Computer System Engineering, Monash University(电子与计算机系统工程系,墨尔本大学)

AI总结 提出一种在异质双曲流形上对齐图像和文本树状层次特征的方法,通过交叉注意力提取视觉层次特征、异质流形嵌入及KL距离度量学习中间流形,在开放集分类任务中优于基线。

Comments Published as a conference paper at ICLR 2026

详情
Journal ref
The Fourteenth International Conference on Learning Representations (ICLR 2026), Rio de Janeiro, Brazil, 2026
AI中文摘要

模态对齐对于视觉-语言模型(VLM)有效整合跨模态信息至关重要。然而,现有方法在提取文本层次特征的同时,对每个图像仅用单一特征表示,导致不对称和次优的对齐。为解决此问题,我们提出树间对齐(Alignment across Trees)方法,该方法为图像和文本模态构建并对齐树状层次特征。具体而言,我们引入一个语义感知的视觉特征提取框架,该框架对来自中间Transformer层的视觉类别标记应用交叉注意力机制,由文本线索引导以提取具有从粗到细语义的视觉特征。然后,我们将两种模态的特征树嵌入到具有不同曲率的双曲流形中,以有效建模其层次结构。为了在不同曲率的异质双曲流形之间进行对齐,我们推导了异质流形上分布之间的KL距离度量,并通过最小化该距离学习一个用于流形对齐的中间流形。我们证明了最优中间流形的存在性和唯一性。在多个图像数据集上的分类学开放集分类任务实验表明,我们的方法在少样本和跨域设置下持续优于强基线。

英文摘要

Modality alignment is critical for vision-language models (VLMs) to effectively integrate information across modalities. However, existing methods extract hierarchical features from text while representing each image with a single feature, leading to asymmetric and suboptimal alignment. To address this, we propose Alignment across Trees, a method that constructs and aligns tree-like hierarchical features for both image and text modalities. Specifically, we introduce a semantic-aware visual feature extraction framework that applies a cross-attention mechanism to visual class tokens from intermediate Transformer layers, guided by textual cues to extract visual features with coarse-to-fine semantics. We then embed the feature trees of the two modalities into hyperbolic manifolds with distinct curvatures to effectively model their hierarchical structures. To align across the heterogeneous hyperbolic manifolds with different curvatures, we formulate a KL distance measure between distributions on heterogeneous manifolds, and learn an intermediate manifold for manifold alignment by minimizing the distance. We prove the existence and uniqueness of the optimal intermediate manifold. Experiments on taxonomic open-set classification tasks across multiple image datasets demonstrate that our method consistently outperforms strong baselines under few-shot and cross-domain settings.

2510.14150 2026-05-29 cs.AI cs.LG cs.NE 版本更新

CodeEvolve: an open source evolutionary coding agent for algorithmic discovery and optimization

CodeEvolve:用于算法发现和优化的开源进化编码智能体

Henrique Assumpção, Diego Ferreira, Leandro Campos, Fabricio Murai

发表机构 * Inter Science - Inter&Co Federal University of Minas Gerais(联邦大学伯南迪斯) Worcester Polytechnic Institute(沃思彻斯特理工大学)

AI总结 提出CodeEvolve开源框架,结合大语言模型与岛屿进化搜索,通过灵感交叉、元提示和深度细化,在AlphaEvolve基准上匹配或超越5/9问题,并在匹配条件下优于OpenEvolve和ShinkaEvolve,以更低成本超越前沿闭源集成。

Comments 21 pages, 16 figures, 8 tables

详情
AI中文摘要

我们介绍了CodeEvolve,一个开源框架,它将大语言模型与基于岛屿的进化搜索相结合,用于端到端的算法发现。CodeEvolve在CVT-MAP-Elites存档和加权LLM集成之上集成了基于灵感的交叉、元提示和深度细化,为复杂问题生成优化解决方案。在AlphaEvolve基准套件上,CodeEvolve在9个问题中的5个上匹配或超过了报告的AlphaEvolve结果,并且在匹配条件下,在9个问题中的6个上优于开源框架OpenEvolve和ShinkaEvolve。使用开放权重的Qwen3-Coder-30B骨干网络,它在CirclePackingSquare的两个实例上均超过了报告的AlphaEvolve分数,成本大约比前沿闭源集成低一个数量级,并且在无需重新调整的情况下,在启发式设计任务上与EoH保持竞争力。消融实验表明,CodeEvolve组件之间的相互作用(而非任何单一算子)驱动了这些结果。我们在https://github.com/inter-co/science-codeevolve 发布了该框架、实验数据和实用的超参数指南。

英文摘要

We introduce CodeEvolve, an open-source framework that couples large language models with island-based evolutionary search for end-to-end algorithmic discovery. CodeEvolve integrates inspiration-based crossover, meta-prompting, and depth-based refinement on top of a CVT-MAP-Elites archive and a weighted LLM ensemble to generate optimized solutions for complex problems. On the AlphaEvolve benchmark suite, CodeEvolve matches or surpasses the reported AlphaEvolve results on 5 of 9 problems and, under matched conditions, outperforms the open-source frameworks OpenEvolve and ShinkaEvolve on 6 of 9. With the open-weight Qwen3-Coder-30B backbone, it surpasses the reported AlphaEvolve score on both CirclePackingSquare instances at roughly an order of magnitude lower cost than a frontier closed-source ensemble, and remains competitive with EoH on heuristic-design tasks without retuning. Ablations show that the interaction between CodeEvolve's components, rather than any single operator, drives these results. We release the framework, experimental data, and practical hyperparameter guidelines at https://github.com/inter-co/science-codeevolve.

2510.11499 2026-05-29 cs.LG cs.AI 版本更新

Offline Reinforcement Learning with Generative Trajectory Policies

基于生成轨迹策略的离线强化学习

Xinsong Feng, Leshu Tang, Chenan Wang, Haipeng Chen

发表机构 * School of Computing, Data Sciences Computer Engineering Department, UCLA, Los Angeles, USA

AI总结 本文提出生成轨迹策略(GTP),通过统一扩散、流匹配和一致性模型为常微分方程驱动的连续时间生成轨迹,并引入两种理论自适应方法,在D4RL基准上达到最先进性能。

Comments ICML 2026

详情
AI中文摘要

生成模型因其捕获复杂多模态行为的能力,已成为离线强化学习中一类强大的策略。然而,现有方法面临明显的权衡:扩散策略等慢速迭代模型计算成本高,而一致性策略等快速单步模型性能往往下降。在本文中,我们证明弥合这一差距是可能的。我们认为,超越个体方法局限的关键在于一个统一视角,该视角将现代生成模型(包括扩散、流匹配和一致性模型)视为学习由常微分方程驱动的连续时间生成轨迹的具体实例。这一原则性基础为强化学习中的生成策略提供了更清晰的设计空间,并使我们能够提出生成轨迹策略(GTP),一种新的、更通用的策略范式,学习底层ODE的完整解映射。为使该范式适用于离线强化学习,我们进一步引入了两种理论上原则性的自适应方法。实验结果表明,GTP在D4RL基准上达到了最先进的性能——它显著优于先前的生成策略,在多个以困难著称的AntMaze任务上取得了完美分数。

英文摘要

Generative models have emerged as a powerful class of policies for offline reinforcement learning (RL) due to their ability to capture complex, multi-modal behaviors. However, existing methods face a stark trade-off: slow, iterative models like diffusion policies are computationally expensive, while fast, single-step models like consistency policies often suffer from degraded performance. In this paper, we demonstrate that it is possible to bridge this gap. The key to moving beyond the limitations of individual methods, we argue, lies in a unifying perspective that views modern generative models, including diffusion, flow matching, and consistency models, as specific instances of learning a continuous-time generative trajectory governed by an Ordinary Differential Equation (ODE). This principled foundation provides a clearer design space for generative policies in RL and allows us to propose Generative Trajectory Policies (GTPs), a new and more general policy paradigm that learns the entire solution map of the underlying ODE. To make this paradigm practical for offline RL, we further introduce two key theoretically principled adaptations. Empirical results demonstrate that GTP achieves state-of-the-art performance on D4RL benchmarks - it significantly outperforms prior generative policies, achieving perfect scores on several notoriously hard AntMaze tasks.

2510.08722 2026-05-29 cs.LG cs.AI 版本更新

The Impact of Semantic Pairs on Self-Supervised Representation Learning

语义对自监督表示学习的影响

Mohammad Alkhalefi, Georgios Leontidis, Mingjun Zhong

AI总结 通过控制实验研究语义正对(不同同类实例)相比增强正对在自监督学习中的效果,发现语义对能提升泛化性能,尤其对比学习受益最大。

Comments 19 pages, 7 figures, 5 tables

详情
AI中文摘要

实例判别通过将同一图像的不同增强视图视为正对来学习视觉表示。虽然这鼓励对手工变换的不变性,但同图像正对可能保留背景、纹理、光照和对象特定细节等干扰相关性。语义正对,即不同的同类实例,通过在不同上下文中呈现对象可能减少这些相关性。然而,先前的研究通常将语义对与增强正对或错误邻居(即错误映射的语义对)结合,使得难以隔离语义配对的效果。我们提出了一个关于语义正对用于自监督表示学习的受控实证研究。从ImageNet-1K中,我们构建了两个匹配的子集:一个增强对基线和一个手动策划的语义对数据集,具有相同的类别组成和训练对数量。我们使用这些数据集在匹配的训练条件下比较代表性的对比和非对比SSL方法。在迁移学习和目标检测评估中,语义对预训练始终优于增强对预训练。额外的消融实验表明,语义对诱导了超出标准变换管道的不变性。在评估的方法中,对比学习从语义对中受益最大,其中SimCLR显示出最大的相对改进。这些结果阐明了语义正对在SSL中的作用,并为选择和设计能够有效利用语义对信息的框架提供了指导。

英文摘要

Instance discrimination learns visual representations by treating different augmented views of the same image as positive pairs. While this encourages invariance to handcrafted transformations, same-image positives can preserve nuisance correlations such as background, texture, illumination, and object-specific details. Semantic positive pairs, i.e., different same-class instances, may reduce these correlations by presenting objects across diverse contexts. However, previous studies often combine semantic pairs with augmented positives or false neighbors (i.e., incorrectly mapped semantic pairs), making it difficult to isolate the effect of semantic pairing. We present a controlled empirical study of semantic positive pairs for self-supervised representation learning. From ImageNet-1K, we construct two matched subsets: an augmented-pair baseline and a manually curated semantic-pair dataset with the same class composition and training-pair count. We use these datasets to compare representative contrastive and non-contrastive SSL methods under matched training conditions. Across transfer learning and object detection evaluations, semantic-pair pretraining consistently improves generalisation over augmented-pair pretraining. Additional ablations show that semantic pairs induce invariances beyond the standard transformation pipeline. Among the evaluated methods, contrastive learning benefits most strongly from semantic pairs, with SimCLR showing the largest relative improvement. These results clarify the role of semantic positive pairs in SSL and provide guidance for selecting and designing frameworks that can exploit semantic pair information effectively

2509.24895 2026-05-29 cs.LG 版本更新

Towards Understanding the Shape of Representations in Protein Language Models

理解蛋白质语言模型中表示的形状

Kosio Beshkov, Anders Malthe-Sørenssen

发表机构 * Department of Physics(物理系) University of Oslo(奥斯陆大学)

AI总结 本研究通过平方根速度表示和图过滤分析蛋白质语言模型(PLM)的表示空间,发现ESM2模型中Karcher均值和有效维度随层数非线性变化,且PLM优先编码残基的局部关系,最忠实于结构的表示出现在模型倒数第二层附近。

Comments Accepted as a poster at ICLR 2026. OpenReview: https://openreview.net/forum?id=Dnn8SSBJaY

详情
Journal ref
International Conference on Learning Representations (ICLR), 2026
AI中文摘要

虽然蛋白质语言模型(PLM)是未来从头蛋白质设计最有前途的研究途径之一,但它们将序列转换为隐藏表示的方式以及这些表示中编码的信息尚未完全理解。一些工作试图提出PLM的可解释性工具,但侧重于理解单个序列如何被这些模型转换。因此,PLM如何转换整个序列空间及其关系仍然未知。在这项工作中,我们尝试通过将蛋白质结构和表示与平方根速度(SRV)表示和图过滤联系起来,来理解这个转换后的序列空间。这两种方法自然地导出一个度量空间,在该空间中,可以比较成对的蛋白质或蛋白质表示。我们分析了来自SCOP数据集的不同类型蛋白质,并表明Karcher均值和SRV形状空间的有效维度作为不同大小ESM2模型中层数的函数遵循非线性模式。此外,我们使用图过滤作为工具来研究模型编码蛋白质结构特征的上下文长度。我们发现PLM优先编码残基之间的直接和局部关系,但对于较大的上下文长度开始退化。最忠实于结构的编码往往出现在模型最后一层附近但之前,表明在这些层之上训练折叠模型可能会提高折叠性能。

英文摘要

While protein language models (PLMs) are one of the most promising avenues of research for future de novo protein design, the way in which they transform sequences to hidden representations, as well as the information encoded in such representations is yet to be fully understood. Several works have attempted to propose interpretability tools for PLMs, but they have focused on understanding how individual sequences are transformed by such models. Therefore, the way in which PLMs transform the whole space of sequences along with their relations is still unknown. In this work we attempt to understand this transformed space of sequences by identifying protein structure and representation with square-root velocity (SRV) representations and graph filtrations. Both approaches naturally lead to a metric space in which pairs of proteins or protein representations can be compared with each other. We analyze different types of proteins from the SCOP dataset and show that the Karcher mean and effective dimension of the SRV shape space follow a non-linear pattern as a function of the layers in ESM2 models of different sizes. Furthermore, we use graph filtrations as a tool to study the context lengths at which models encode the structural features of proteins. We find that PLMs preferentially encode immediate as well as local relations between residues, but start to degrade for larger context lengths. The most structurally faithful encoding tends to occur close to, but before the last layer of the models, indicating that training a folding model ontop of these layers might lead to improved folding performance.

2509.24100 2026-05-29 stat.ME cs.LG 版本更新

SpeedCP: Fast Kernel-based Conditional Conformal Prediction

SpeedCP: 基于核的快速条件共形预测

Yating Liu, Yeo Jin Jung, Zixuan Wu, So Won Jeong, Claire Donnat

发表机构 * Department of Statistics University of Chicago(统计学系芝加哥大学)

AI总结 提出一种基于路径追踪的高效算法,在保持RKHS条件共形预测框架理论优势的同时,将计算速度提升40倍,区间长度缩短30%。

详情
AI中文摘要

共形预测提供了具有有限样本条件保证的分布自由预测集。我们基于Gibbs等人(2023)的RKHS框架,该框架利用协变量偏移族来提供近似条件共形预测区间,具有强大的理论前景,但计算成本过高。为弥补这一差距,我们开发了一种稳定高效的算法,该算法以与单次核分位数拟合基本相同的成本计算正则化RKHS共形优化问题的完整解路径。我们的路径追踪框架同时调整超参数,提供平滑控制和数据自适应校准。为了将方法扩展到高维设置,我们进一步将我们的方法与低秩潜在嵌入相结合,在数据驱动的潜在空间中捕获条件有效性。实验上,我们的方法在各种现代黑盒预测器上提供了可靠的条件覆盖,将Gibbs等人(2023)的区间长度改善了30%,同时实现了40倍的加速。

英文摘要

Conformal prediction provides distribution-free prediction sets with finite-sample conditional guarantees. We build upon the RKHS-based framework of Gibbs et al. (2023), which leverages families of covariate shifts to provide approximate conditional conformal prediction intervals, an approach with strong theoretical promise, but with prohibitive computational cost. To bridge this gap, we develop a stable and efficient algorithm that computes the full solution path of the regularized RKHS conformal optimization problem, at essentially the same cost as a single kernel quantile fit. Our path-tracing framework simultaneously tunes hyperparameters, providing smoothness control and data-adaptive calibration. To extend the method to high-dimensional settings, we further integrate our approach with low-rank latent embeddings that capture conditional validity in a data-driven latent space. Empirically, our method provides reliable conditional coverage across a variety of modern black-box predictors, improving the interval length of Gibbs et al. (2023) by 30%, while achieving a 40-fold speedup.

2509.22504 2026-05-29 cs.AI cs.LG 版本更新

Estimating the Empowerment of Language Model Agents

估计语言模型代理的赋权能力

Jinyeop Song, Jeff Gore, Max Kleiman-Weiner

发表机构 * Massachusetts Institute of Technology(麻省理工学院) University of Washington(华盛顿大学)

AI总结 提出基于信息论中赋权概念的评估框架EELMA,通过多轮文本交互近似有效赋权,实验表明赋权与任务性能强相关,可作为与任务成功度量互补的通用评估指标。

Comments Published at the International Conference on Machine Learning (ICML) 2026. 9 pages, 9 figures; camera-ready version

详情
AI中文摘要

随着语言模型(LM)代理在现实应用中的能力日益增强和广泛采用,除了昂贵且人工设计的基准测试外,对可扩展评估框架的需求日益增长。我们提出基于赋权的信息论评估,赋权是一种衡量代理通过其行动对未来状态影响的信息论度量。为了应对基于文本环境的独特挑战,我们引入了EELMA(估计语言模型代理的赋权能力),一种从多轮文本交互中近似有效赋权的算法。我们在文本游戏以及现实的网络和工具使用环境中演示了EELMA,表明赋权与平均任务性能强相关。我们进一步分析了赋权如何随模型、环境复杂性和代理配置而变化,并表明高赋权状态和行动通常标志着通用能力的关键时刻。这些结果确立了赋权作为一种与任务成功度量互补的、与目标无关的度量,用于LM代理评估。

英文摘要

As language model (LM) agents become increasingly capable and adopted in real-world applications, there is a growing need for scalable evaluation frameworks beyond costly, manually designed benchmarks. We propose information-theoretic evaluation based on empowerment, an information-theoretic measure of an agent's influence on future states through its actions. To handle the unique challenges of text-based environments, we introduce EELMA (Estimating Empowerment of Language Model Agents), an algorithm for approximating effective empowerment from multi-turn text interactions. We demonstrate EELMA on textual games and realistic web and tool-use environments, showing that empowerment strongly correlates with average task performance. We further analyze how empowerment varies across models, environment complexity, and agent configurations, and show that high-empowerment states and actions often mark pivotal moments for general capabilities. These results establish empowerment as a goal-agnostic metric that complements task-success measures for LM-agent evaluation.

2509.21154 2026-05-29 cs.LG cs.AI 版本更新

GRPO is Secretly a Process Reward Model

GRPO 秘密地是一个过程奖励模型

Michael Sullivan, Alexander Koller

发表机构 * Department of Language Science Technology, Saarland Informatics Campus, Saarland University, Saarbr \"u cken, Germany

AI总结 本文理论证明,使用结果奖励模型的 GRPO 强化学习算法等价于一个基于蒙特卡洛的过程奖励模型,并发现其缺陷,提出 λ-GRPO 改进,在推理任务上提升性能。

Comments 16 pages, 9 figures; accepted at ICML 2026

详情
AI中文摘要

过程奖励模型(PRMs)允许在强化学习(RL)中进行细粒度的信用分配,并且与结果奖励模型(ORMs)形成对比,后者为整个轨迹分配单一奖励。然而,我们在本文中提供了理论证明,配备 ORM 的组相对策略优化(GRPO)RL 算法实际上等价于一个配备非平凡、基于蒙特卡洛的 PRM 的 PRM-aware RL 目标(在温和假设下)。利用 GRPO-as-a-PRM 框架,我们识别出 GRPO 目标中的一个缺陷,该缺陷与不平衡的过程步骤和奖励相互作用,阻碍了探索和利用(在不同条件下)。我们提出对算法进行简单修改以减轻这一缺陷(λ-GRPO),并表明使用 λ-GRPO 调优的 LLM 在下游推理任务上优于使用标准 GRPO 调优的 LLM,并且更快达到峰值性能。这些结果表明,我们可以利用原始 GRPO 算法中隐藏的内置 PRM 结构来提升模型性能,而无需使用显式 PRM,并且对训练时间和成本的影响可以忽略不计。

英文摘要

Process reward models (PRMs) allow for fine-grained credit assignment in reinforcement learning (RL), and seemingly contrast with outcome reward models (ORMs), which assign a single reward to an entire trajectory. However, we provide theoretical proof in this work that the Group Relative Policy Optimization (GRPO) RL algorithm equipped with an ORM is in fact equivalent to a PRM-aware RL objective equipped with a non-trivial, Monte-Carlo-based PRM (given mild assumptions). Leveraging the framework of GRPO-as-a-PRM, we identify a flaw in the GRPO objective that interacts with imbalanced process steps and rewards to hinder both exploration and exploitation (under different conditions). We propose a simple modification to the algorithm to mitigate this defect ($λ$-GRPO), and show that LLMs tuned with $λ$-GRPO outperform LLMs tuned with standard GRPO on downstream reasoning tasks\textemdash and reach peak performance more rapidly. These results show that we can leverage the hidden, built-in PRM structure within the vanilla GRPO algorithm to boost model performance without employing an explicit PRM, and with a negligible impact on training time and cost.

2507.16880 2026-05-29 cs.CV cs.AI cs.LG 版本更新

Finding DoRI: Discovery of Retained Images in Diffusion Models

Finding DoRI: 扩散模型中保留图像的发现

Antoni Kowalczuk, Dominik Hintersdorf, Lukas Struppek, Kristian Kersting, Adam Dziedzic, Franziska Boenisch

发表机构 * CISPA Helmholtz Center for Information Security(CISPA信息安全研究中心) German Research Center for Artificial Intelligence (DFKI)(德国人工智能研究中心(DFKI)) Technical University of Darmstadt(达姆施塔特技术大学) Hessian Center for AI (Hessian.AI)(黑森人工智能中心(Hessian.AI)) Centre for Cognitive Science, Technical University of Darmstadt(达姆施塔特技术大学认知科学中心)

AI总结 通过挑战记忆局部化假设,发现文本嵌入的小扰动可重新触发数据复制,并证明记忆本质上是非局部的,从而提出对抗微调实现更鲁棒的缓解方法。

Comments Published at ICML 2026

详情
AI中文摘要

文本到图像扩散模型(DMs)在图像生成方面取得了显著成功。然而,由于它们可能无意中记忆并复制训练数据,数据隐私和知识产权问题仍然存在。最近的缓解工作集中在识别和剪枝负责触发逐字训练数据复制的权重,基于记忆可以被局部化的假设。我们挑战这一假设,并证明即使经过这样的剪枝,对先前缓解的提示的文本嵌入进行微小扰动可以重新触发数据复制,揭示了此类方法的脆弱性。我们的进一步分析提供了多个迹象表明记忆确实本质上不是局部的:(1)记忆图像的复制触发因素分布在文本嵌入空间中;(2)产生相同复制图像的嵌入会产生不同的模型激活;(3)不同的剪枝方法对同一图像识别出不一致的记忆相关权重集。最后,我们表明绕过局部性假设可以通过对抗微调实现更鲁棒的缓解。这些发现为文本到图像DMs中记忆的基本性质提供了新见解,并为未来开发更可靠的对抗DM记忆的缓解方法提供了信息。

英文摘要

Text-to-image diffusion models (DMs) have achieved remarkable success in image generation. However, concerns about data privacy and intellectual property remain due to their potential to inadvertently memorize and replicate training data. Recent mitigation efforts have focused on identifying and pruning weights responsible for triggering verbatim training data replication, based on the assumption that memorization can be localized. We challenge this assumption and demonstrate that, even after such pruning, small perturbations to the text embeddings of previously mitigated prompts can re-trigger data replication, revealing the fragility of such methods. Our further analysis then provides multiple indications that memorization is indeed \textit{not} inherently local: (1) replication triggers for memorized images are distributed throughout text embedding space; (2) embeddings yielding the same replicated image produce divergent model activations; and (3) different pruning methods identify inconsistent sets of memorization-related weights for the same image. Finally, we show that bypassing the locality assumption enables more robust mitigation through adversarial fine-tuning. These findings provide new insights into the fundamental nature of memorization in text-to-image DMs and inform the future development of more reliable mitigation methods against DM memorization.

2507.06092 2026-05-29 cs.CR cs.AI cs.LG 版本更新

Taming Data Challenges in ML-based Security Tasks Using Generative AI

驯服基于ML的安全任务中的数据挑战:使用生成式AI

Shravya Kanchi, Neal Mangaokar, Aravind Cheruvu, Sifat Muhammad Abdullah, Shirin Nilizadeh, Atul Prakash, Bimal Viswanath

发表机构 * University of Michigan, Ann Arbor(密歇根大学安娜堡分校) University of Texas at Arlington(德克萨斯理工大学)

AI总结 提出使用生成式AI(GenAI)生成的合成数据增强训练集,以改善机器学习安全分类器的泛化性能,在7个任务上实现最高32.6%的提升。

Comments Accepted at the 2026 ACM Asia Conference on Computer and Communications Security (AsiaCCS 2026)

详情
Journal ref
In Proc. ACM AsiaCCS 2026, Bangalore, India, June 1-5, 2026. ACM, 2026
AI中文摘要

基于机器学习的监督分类器广泛用于安全任务,其改进主要集中在算法进步上。我们认为,对分类器性能产生负面影响的数 据挑战受到的关注有限。我们解决以下研究问题:生成式AI(GenAI)的发展能否应对这些数据挑战并提高分类器性能?我们提出使用GenAI技术生成的合成数据增强训练数据集,以改善分类器的泛化能力。我们使用6种最先进的GenAI方法在7个不同的安全任务上评估了这种方法,并引入了一种名为Nimai的新型GenAI方案,该方案能够实现高度可控的数据合成。我们发现,GenAI技术可以显著提高安全分类器的性能,即使在数据严重受限的情况下(仅约180个训练样本),也能实现高达32.6%的提升。此外,我们证明GenAI可以促进部署后对概念漂移的快速适应,在调整过程中只需最少的标注。尽管取得了成功,但我们的研究发现,一些GenAI方案在某些安全任务上难以初始化(训练和生成数据)。我们还识别了特定任务的特征,如噪声标签、重叠的类别分布和稀疏特征向量,这些特征阻碍了使用GenAI提升性能。我们相信,我们的研究将推动未来针对安全任务的GenAI工具的开发。

英文摘要

Machine learning-based supervised classifiers are widely used for security tasks, and their improvement has been largely focused on algorithmic advancements. We argue that data challenges that negatively impact the performance of these classifiers have received limited attention. We address the following research question: Can developments in Generative AI (GenAI) address these data challenges and improve classifier performance? We propose augmenting training datasets with synthetic data generated using GenAI techniques to improve classifier generalization. We evaluate this approach across 7 diverse security tasks using 6 state-of-the-art GenAI methods and introduce a novel GenAI scheme called Nimai that enables highly controlled data synthesis. We find that GenAI techniques can significantly improve the performance of security classifiers, achieving improvements of up to 32.6% even in severely data-constrained settings (only ~180 training samples). Furthermore, we demonstrate that GenAI can facilitate rapid adaptation to concept drift post-deployment, requiring minimal labeling in the adjustment process. Despite successes, our study finds that some GenAI schemes struggle to initialize (train and produce data) on certain security tasks. We also identify characteristics of specific tasks, such as noisy labels, overlapping class distributions, and sparse feature vectors, which hinder performance boost using GenAI. We believe that our study will drive the development of future GenAI tools designed for security tasks.

2507.00037 2026-05-29 cs.LG cs.AI 版本更新

Model Fusion via Retrofitting

通过回溯改造的模型融合

Phoomraphee Luenam, Andreas Spanopoulos, Amit Sant, Thomas Hofmann, Sotiris Anagnostidis, Sidak Pal Singh

发表机构 * ETH Z\"urich

AI总结 提出一种以神经元为中心的融合算法,通过将父模型中间神经元分组为目标表示并训练融合模型子网络逼近,结合神经元归因分数进行显著特征对齐,适用于任意可模块化为有向无环图结构的架构,在零样本和非独立同分布场景下表现最佳。

Comments 5 figures, 15 tables, 23 pages

详情
AI中文摘要

模型融合旨在将独立训练的神经网络组合成一个单一模型而无需重新训练,但由于排列不变性、随机初始化和异构训练数据导致的表示差异,这一过程变得复杂。现有方法在非独立同分布数据分布下的零样本设置中尤其困难,并且通常局限于特定架构或成对融合。我们引入了一类以神经元为中心的融合算法,将融合视为一个原则性的表示匹配问题:父模型中的中间神经元被分组为目标表示,然后训练融合模型的相应子网络来逼近这些表示。与先前工作不同,我们的方法结合了神经元归因分数以偏向于显著特征的对齐,并且可以应用于任何可模块化为有向无环图层次的架构——在VGG、ResNet和ViT上进行了实证验证。在标准基准上的实验显示,与现有融合方法相比,我们的方法取得了一致的改进,在零样本和非独立同分布场景中增益最大。代码可在https://github.com/AndrewSpano/model-fusion-via-retrofitting获取。

英文摘要

Model fusion seeks to combine independently trained neural networks into a single model without retraining, but is complicated by representational divergence arising from permutation invariance, random initialization, and heterogeneous training data. Existing methods struggle particularly in zero-shot settings under non-IID data distributions, and are often limited to specific architectures or pairwise fusion. We introduce a neuron-centric family of fusion algorithms that frames fusion as a principled representation-matching problem: intermediate neurons across parent models are grouped into target representations, which the fused model's corresponding sub-networks are then trained to approximate. Unlike prior work, our approach incorporates neuron attribution scores to bias alignment toward salient features, and can be applied to any architecture modularizable as a DAG of levels -- empirically validated on VGGs, ResNets, and ViTs. Experiments across standard benchmarks show consistent improvements over existing fusion methods, with the largest gains in zero-shot and non-IID scenarios. Code is available at https://github.com/AndrewSpano/model-fusion-via-retrofitting.

2506.20344 2026-05-29 math.OC cs.LG 版本更新

A Complete Loss Landscape Analysis of Regularized Deep Matrix Factorization

正则化深度矩阵分解的完整损失景观分析

Po Chen, Rujun Jiang, Peng Wang

发表机构 * School of Data Science, Fudan University, Shanghai, China(复旦大学数据科学学院,上海,中国) Department of Computer and Information Science, University of Macau, Macau SAR, China(澳门大学计算机与信息科学系,澳门特别行政区,中国)

AI总结 本文通过闭式表征所有临界点并分类其类型,揭示了正则化深度矩阵分解的损失景观,解释了梯度方法几乎总是收敛到局部极小值的原因。

Comments 30 pages, 2 figures

详情
AI中文摘要

尽管深度矩阵分解(DMF)在各个领域有广泛的应用,但其优化基础仍然很大程度上是开放的。在这项工作中,我们旨在通过全面研究正则化DMF问题的损失景观来填补这一空白。为此,我们首先提供了该问题所有临界点的闭式表征。在此基础上,我们建立了临界点是局部极小值、全局极小值、严格鞍点或非严格鞍点的精确条件。利用这些结果,我们推导出每个临界点要么是局部极小值要么是严格鞍点的充要条件。这为梯度方法几乎总是收敛到正则化DMF问题的局部极小值提供了见解。最后,我们进行了数值实验以可视化其损失景观,支持我们的理论。

英文摘要

Despite its wide range of applications across various domains, the optimization foundations of deep matrix factorization (DMF) remain largely open. In this work, we aim to fill this gap by conducting a comprehensive study of the loss landscape of the regularized DMF problem. Toward this goal, we first provide a closed-form characterization of all critical points of the problem. Building on this, we establish precise conditions under which a critical point is a local minimizer, a global minimizer, a strict saddle point, or a non-strict saddle point. Leveraging these results, we derive a necessary and sufficient condition under which every critical point is either a local minimizer or a strict saddle point. This provides insights into why gradient-based methods almost always converge to a local minimizer of the regularized DMF problem. Finally, we conduct numerical experiments to visualize its loss landscape to support our theory.

2506.12815 2026-05-29 cs.LG 版本更新

TrojanTO: Action-Level Backdoor Attacks against Trajectory Optimization Models

TrojanTO:针对轨迹优化模型的行动级后门攻击

Yang Dai, Oubo Ma, Longfei Zhang, Xingxing Liang, Xiaochun Cao, Shouling Ji, Jiaheng Zhang, Jincai Huang, Li Shen

发表机构 * Laboratory for Big Data and Decision, National University of Defense Technology(大数据与决策实验室,国防科技大学) Zhejiang University(浙江大学) Shenzhen Campus of Sun Yat-sen University(中山大学深圳校区) National University of Singapore(新加坡国立大学)

AI总结 提出TrojanTO,首个针对轨迹优化模型的行动级后门攻击方法,通过交替训练增强触发与目标动作关联,并利用轨迹过滤和批量投毒实现高隐蔽性,在低攻击预算下有效植入后门。

Comments 23 pages, 6 figures

详情
Journal ref
International Conference on Learning Representations (ICLR), 2026
AI中文摘要

轨迹优化(TO)模型的最新进展在离线强化学习中取得了显著成功。然而,它们对后门攻击的脆弱性尚不清楚。我们发现,现有的强化学习后门攻击基于奖励操纵,由于TO模型固有的序列建模特性,这些攻击对其基本无效。此外,高维动作空间带来的复杂性进一步加剧了动作操纵的挑战。为解决这些问题,我们提出了TrojanTO,这是首个针对TO模型的行动级后门攻击。TrojanTO采用交替训练来增强触发器与目标动作之间的关联,以提高攻击有效性。为提高攻击隐蔽性,它通过轨迹过滤进行精确投毒以保持正常性能,并通过批量投毒确保触发器一致性。大量评估表明,TrojanTO能够在低攻击预算(0.3%的轨迹)下,跨不同任务和攻击目标有效植入后门攻击。此外,TrojanTO对DT、GDT和DC具有广泛的适用性,突显了其跨多种TO模型架构的可扩展性。

英文摘要

Recent advances in Trajectory Optimization (TO) models have achieved remarkable success in offline reinforcement learning. However, their vulnerabilities against backdoor attacks are poorly understood. We find that existing backdoor attacks in reinforcement learning are based on reward manipulation, which are largely ineffective against the TO model due to its inherent sequence modeling nature. Moreover, the complexities introduced by high-dimensional action spaces further compound the challenge of action manipulation. To address these gaps, we propose TrojanTO, the first action-level backdoor attack against TO models. TrojanTO employs alternating training to enhance the connection between triggers and target actions for attack effectiveness. To improve attack stealth, it utilizes precise poisoning via trajectory filtering for normal performance and batch poisoning for trigger consistency. Extensive evaluations demonstrate that TrojanTO effectively implants backdoor attacks across diverse tasks and attack objectives with a low attack budget (0.3\% of trajectories). Furthermore, TrojanTO exhibits broad applicability to DT, GDT, and DC, underscoring its scalability across diverse TO model architectures.

2505.21627 2026-05-29 cs.GT cs.AI cs.CY cs.LG 版本更新

Is Your LLM Overcharging You? Tokenization, Transparency, and Incentives

你的大语言模型是否在过度收费?分词、透明度与激励

Ander Artola Velasco, Stratis Tsirtsis, Nastaran Okati, Manuel Gomez-Rodriguez

发表机构 * Ander Artola Velasco(1. 阿德纳·阿尔托拉·韦拉斯科) Stratis Tsirtsis(2. 斯特拉蒂斯·蒂尔蒂斯) Nastaran Okati(3. 纳斯塔兰·奥卡蒂)

AI总结 研究当前按token计费机制下,服务提供商可能通过策略性报告token数量来过度收费,并提出按字符线性定价的激励相容机制以消除该财务激励。

Comments Selected as an oral presentation at ICML 2026

详情
AI中文摘要

最先进的大语言模型需要专门的硬件和大量能源来运行。因此,提供大语言模型访问的基于云的服务变得非常流行。在这些服务中,用户为模型生成的输出支付的价格取决于模型用于生成该输出的token数量:他们为每个token支付固定价格。在这项工作中,我们表明这种定价机制为提供商创造了财务激励,使其策略性地虚报模型用于生成输出的token数量,而用户无法证明甚至不知道提供商是否在过度收费。然而,我们也表明,如果不诚实的提供商被强制要求透明地说明模型使用的生成过程,那么在不引起怀疑的情况下最优地虚报是困难的。尽管如此,作为概念验证,我们开发了一种高效的启发式算法,使提供商能够在不引起怀疑的情况下显著过度收费用户。关键的是,我们证明运行该算法的成本低于从过度收费用户中获得的额外收入,突显了当前按token计费机制下用户的脆弱性。此外,我们表明,为了消除策略性行为的财务激励,定价机制必须根据token的字符数线性定价。虽然这会使提供商的利润率因token而异,但我们引入了一个简单的方案,采用这种激励相容定价机制的提供商可以维持他们在按token计费机制下的平均利润率。在此过程中,为了说明和补充我们的理论结果,我们使用来自$ exttt{Llama}$、$ exttt{Gemma}$和$ exttt{Ministral}$系列的几个大语言模型以及来自LMSYS Chatbot Arena平台的输入提示进行了实验。

英文摘要

State-of-the-art large language models require specialized hardware and substantial energy to operate. As a consequence, cloud-based services that provide access to large language models have become very popular. In these services, the price users pay for an output provided by a model depends on the number of tokens the model uses to generate it: they pay a fixed price per token. In this work, we show that this pricing mechanism creates a financial incentive for providers to strategize and misreport the (number of) tokens a model used to generate an output, and users cannot prove, or even know, whether a provider is overcharging them. However, we also show that, if an unfaithful provider is obliged to be transparent about the generative process used by the model, misreporting optimally without raising suspicion is hard. Nevertheless, as a proof-of-concept, we develop an efficient heuristic algorithm that allows providers to significantly overcharge users without raising suspicion. Crucially, we demonstrate that the cost of running the algorithm is lower than the additional revenue from overcharging users, highlighting the vulnerability of users under the current pay-per-token pricing mechanism. Further, we show that, to eliminate the financial incentive to strategize, a pricing mechanism must price tokens linearly on their character count. While this makes a provider's profit margin vary across tokens, we introduce a simple prescription under which the provider who adopts such an incentive-compatible pricing mechanism can maintain the average profit margin they had under the pay-per-token pricing mechanism. Along the way, to illustrate and complement our theoretical results, we conduct experiments with several large language models from the $\texttt{Llama}$, $\texttt{Gemma}$ and $\texttt{Ministral}$ families, and input prompts from the LMSYS Chatbot Arena platform.

2505.20955 2026-05-29 cs.CR cs.LG 版本更新

Enhancing Membership Inference Attacks on Diffusion Models from a Frequency-Domain Perspective

从频域角度增强扩散模型的成员推理攻击

Puwei Lian, Yujun Cai, Songze Li, Bingkun Bao

发表机构 * Southeast University(东南大学) The University of Queensland(昆士兰大学) Hefei University of Technology(合肥工业大学) Engineering Research Center of Blockchain Application, Supervision and Management (Southeast University), Ministry of Education(区块链应用、监督与管理工程研究中心(东南大学),教育部)

AI总结 本文从频域角度揭示扩散模型处理高频信息的缺陷导致成员推理攻击误分类,并提出即插即用的高频滤波模块以提升攻击性能。

Comments Accepted to Forty-Third International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

扩散模型在图像生成方面取得了巨大成功,但也引发了关于隐私和版权的重要担忧。成员推理攻击(MIAs)旨在确定特定数据是否在模型训练阶段被使用。由于当前针对扩散模型的MIAs通常利用模型的图像预测能力,我们将其形式化为一个统一的一般范式,通过计算成员分数进行成员识别。在该范式下,我们通过实验发现现有攻击忽略了扩散模型处理高频信息时的固有缺陷。因此,该缺陷导致包含更多高频内容的成员数据被误分类为留出数据,而高频内容较少的留出数据则倾向于被误分类为成员数据。此外,我们从理论上证明该缺陷降低了攻击的成员优势,从而干扰了对成员数据和留出数据的有效区分。基于这一发现,我们提出了一种即插即用的高频滤波模块,以减轻该缺陷的不利影响,该模块可以无缝集成到一般范式中的任何攻击中,且无需额外时间成本。大量实验证实,该模块在不同数据集和模型上显著提升了基线攻击的性能。代码可在 https://github.com/poetic2/FreMIA 获取。

英文摘要

Diffusion models have achieved tremendous success in image generation, but they also raise significant concerns regarding privacy and copyright issues. Membership Inference Attacks (MIAs) are designed to ascertain whether specific data was utilized during a model's training phase. As current MIAs for diffusion models typically exploit the model's image prediction ability, we formalize them into a unified general paradigm that computes the membership score for membership identification. Under this paradigm, we empirically find that existing attacks overlook the inherent deficiency in how diffusion models process high-frequency information. Consequently, this deficiency leads to member data with more high-frequency content being misclassified as hold-out data, and hold-out data with less high-frequency content tends to be misclassified as member data. Moreover, we theoretically demonstrate that this deficiency reduces the membership advantage of attacks, thereby interfering with the effective discrimination of member data and hold-out data. Based on this insight, we propose a plug-and-play high-frequency filter module to mitigate the adverse effects of the deficiency, which can be seamlessly integrated into any attacks within the general paradigm without additional time costs. Extensive experiments corroborate that this module significantly improves the performance of baseline attacks across different datasets and models. Code is available at https://github.com/poetic2/FreMIA.

2505.13745 2026-05-29 cs.LG stat.ML 版本更新

Synthetic Non-stationary Data Streams for Recognition of the Unknown

用于未知识别的合成非平稳数据流

Joanna Komorniczak

发表机构 * Wrocław University of Science and Technology(沃拉夫大学科学与技术学院)

AI总结 提出一种同时包含概念漂移和新类出现的合成数据流生成策略,并评估无监督漂移检测器在开放集识别任务中的表现。

详情
AI中文摘要

数据非平稳性问题在数据流处理中常被讨论。在动态环境中,方法应持续准备分析时变数据——因此,它们应支持增量训练并应对概念漂移。非平稳数据流环境中另一个同样重要的变化是新的、先前未知类别的出现。通常,方法专注于这两种现象之一——检测概念漂移或检测新类别——而数据流中可能同时出现这两种困难。此外,关于先前未知的观测,开放类别集的话题近年来变得尤为重要,方法的目标是在已知类别内高效分类,并识别模型能力范围外的对象。本文提出一种合成数据流生成策略,其中同时出现概念漂移和代表未知对象的新类别。所呈现的研究展示了无监督漂移检测器如何处理检测新类别和概念漂移的任务,并演示了生成的数据流如何用于开放集识别任务。

英文摘要

The problem of data non-stationarity is commonly addressed in data stream processing. In a dynamic environment, methods should continuously be ready to analyze time-varying data -- hence, they should enable incremental training and respond to concept drifts. An equally important variability typical for non-stationary data stream environments is the emergence of new, previously unknown classes. Often, methods focus on one of these two phenomena -- detection of concept drifts or detection of novel classes -- while both difficulties can be observed in data streams. Additionally, concerning previously unknown observations, the topic of open set of classes has become particularly important in recent years, where the goal of methods is to efficiently classify within known classes and recognize objects outside the model competence. This article presents a strategy for synthetic data stream generation in which both concept drifts and the emergence of new classes representing unknown objects occur. The presented research shows how unsupervised drift detectors address the task of detecting novelty and concept drifts and demonstrates how the generated data streams can be utilized in the open set recognition task.

2505.02604 2026-05-29 cs.LG 版本更新

Connecting Independently Trained Modes via Layer-Wise Connectivity

通过逐层连接性连接独立训练的模态

Yongding Tian, Zaid Al-Ars, Maksim Kitsak, Peter Hofstee

发表机构 * Computer Engineering Lab, Delft University of Technology, Delft, NL(代尔夫特理工大学计算机工程实验室) Network and Architecture Service, Delft University of Technology, Delft, NL(代尔夫特理工大学网络与架构服务) IBM Infrastructure, TX, USA(IBM基础设施)

AI总结 提出一种新的经验算法,通过逐层连接性构建独立训练神经网络模型之间的连续低损失路径,在多种现代架构上实现更一致的模式连接。

Comments 28 pages, 22 figures, accepted in ICML 2026: https://openreview.net/forum?id=4VOTzpH9MO

详情
AI中文摘要

实证研究表明,可以在独立训练的神经网络模型之间构建连续的低损失路径。这种现象称为模式连接性,指的是在参数空间中不同模式(即训练良好的解)之间存在这样的路径。然而,现有的经验方法不能可靠地连接独立训练的模态,并且主要在一组狭窄的架构(例如,基本的CNN、VGG和ResNet)上进行了评估,使得它们在新模型上的有效性尚不清楚。在这项工作中,我们提出了一种新的经验算法,用于连接独立训练的模态,该算法超越了传统架构,支持更广泛的网络,包括MobileNet、ShuffleNet、EfficientNet、RegNet、深度层聚合(DLA)和紧凑卷积变换器(CCT)。除了更广泛的适用性外,所提出的方法在独立训练的模态对之间产生更一致的连接路径,并支持连接使用不同训练超参数获得的模态。

英文摘要

Empirical studies have shown that continuous low-loss paths can be constructed between independently trained neural network models. This phenomenon, known as mode connectivity, refers to the existence of such paths between distinct modes-i.e., well-trained solutions in parameter space. However, existing empirical methods do not reliably connect independently trained modes and have been evaluated mainly on a narrow set of architectures (e.g., basic CNNs, VGG, and ResNet), leaving their effectiveness on newer models unclear. In this work, we propose a new empirical algorithm for connecting independently trained modes that generalizes beyond traditional architectures and supports a broader range of networks, including MobileNet, ShuffleNet, EfficientNet, RegNet, Deep Layer Aggregation (DLA), and Compact Convolutional Transformers (CCT). In addition to broader applicability, the proposed method yields more consistent connectivity paths across independently trained mode pairs and supports connecting modes obtained with different training hyperparameters.

2505.02069 2026-05-29 cs.LG stat.ML 版本更新

Neural Logistic Bandits

神经逻辑老虎机

Seoungbin Bae, Dabeen Lee

发表机构 * Department of Industrial \& Systems Engineering, KAIST Department of Mathematical Sciences \& Research Institute of Mathematics, Seoul National University Interdisciplinary Program in Artificial Intelligence, Seoul National University

AI总结 针对神经逻辑老虎机问题,利用一种新型的自归一化向量值鞅的Bernstein型不等式,提出两种算法NeuralLog-UCB-1和NeuralLog-UCB-2,分别实现与有效维度相关的遗憾上界,改进了现有结果。

详情
AI中文摘要

我们研究了神经逻辑老虎机问题,其主要任务是通过神经网络学习逻辑链接函数内的未知奖励函数。现有方法要么对$κ$(其中$1/κ$表示奖励分布的最小方差)有不利的依赖,要么直接依赖于特征维度$d$,而在基于神经网络的设置中$d$可能非常大。在这项工作中,我们引入了一种新型的自归一化向量值鞅的Bernstein型不等式,旨在绕过对环境维度的直接依赖。这使我们能够推导出一个遗憾上界,该上界随有效维度$\widetilde{d}$增长,而不是特征维度,同时保持对$κ$的最小依赖。基于该集中不等式,我们提出了两种算法NeuralLog-UCB-1和NeuralLog-UCB-2,它们分别保证了$\widetilde{O}(\widetilde{d}\sqrt{κT})$和$\widetilde{O}(\widetilde{d}\sqrt{T/κ})$阶的遗憾上界,改进了现有结果。最后,我们在合成数据集和真实数据集上报告了数值结果,以验证我们的理论发现。

英文摘要

We study the problem of neural logistic bandits, where the main task is to learn an unknown reward function within a logistic link function using a neural network. Existing approaches either exhibit unfavorable dependencies on $κ$, where $1/κ$ represents the minimum variance of reward distributions, or suffer from direct dependence on the feature dimension $d$, which can be huge in neural network-based settings. In this work, we introduce a novel Bernstein-type inequality for self-normalized vector-valued martingales that is designed to bypass a direct dependence on the ambient dimension. This lets us deduce a regret upper bound that grows with the effective dimension $\widetilde{d}$, not the feature dimension, while keeping a minimal dependence on $κ$. Based on the concentration inequality, we propose two algorithms, NeuralLog-UCB-1 and NeuralLog-UCB-2, that guarantee regret upper bounds of order $\widetilde{O}(\widetilde{d}\sqrt{κT})$ and $\widetilde{O}(\widetilde{d}\sqrt{T/κ})$, respectively, improving on the existing results. Lastly, we report numerical results on both synthetic and real datasets to validate our theoretical findings.

2502.20954 2026-05-29 cs.LG 版本更新

Robust and Efficient Writer-Independent IMU-Based Handwriting Recognition

鲁棒且高效的独立于书写者的基于IMU的手写识别

Jindong Li, Tim Hamann, Jens Barth, Peter Kämpf, Dario Zanca, Björn Eskofier

发表机构 * Machine Learning and Data Analytics Lab, Friedrich-Alexander-Universität Erlangen-Nürnberg, Erlangen, Germany(机器学习与数据分析实验室,埃朗根-纽伦堡大学,埃朗根,德国) STABILO International GmbH, Heroldsberg, Germany(STABILO国际有限公司,赫尔兹堡,德国) Translational Digital Health Group, Institute of AI for Health, Helmholtz Zentrum München - German Research Center for Environmental Health, Neuherberg, Germany(转化数字健康组,健康人工智能研究所,慕尼黑-德国环境健康研究中心,纽赫堡,德国)

AI总结 提出一种结合CNN编码器和BiLSTM解码器的模型,在IMU数据上实现独立于书写者的手写识别,在OnHW数据集和自建数据集上分别达到7.37%和9.44%的字符错误率,并展现出对未见书写风格的鲁棒性。

Comments Accepted at iWOAR 2025. Published in Springer LNCS, 2026. Code available at https://github.com/jindongli24/REWI

详情
Journal ref
Sensor-Based Activity Recognition and Artificial Intelligence (iWOAR 2025), Lecture Notes in Computer Science, pp. 261-286, Springer, Cham, 2026
AI中文摘要

使用惯性测量单元(IMU)数据进行手写识别(HWR)由于书写风格的多样性和数据集的有限性仍然具有挑战性。以往的方法往往难以处理未见过的书写者的手写,使得独立于书写者(WI)的识别成为一个关键但困难的问题。本文提出了一种模型,旨在提高基于IMU数据的WI HWR性能,该模型使用CNN编码器和基于BiLSTM的解码器。我们的方法对未见过的书写风格表现出强大的鲁棒性,在公共OnHW数据集和我们基于单词的数据集的WI划分上均优于现有方法,分别实现了7.37%和9.44%的字符错误率(CER),以及15.12%和32.17%的词错误率(WER)。鲁棒性评估表明,我们的模型在不同年龄组中保持优越性能,并且从一个组学到的知识相比其他方法能更好地泛化到另一个组。在我们基于句子的数据集上的评估进一步展示了识别完整句子的潜力。通过全面的消融研究,我们表明我们的设计选择在性能和效率之间实现了良好的平衡。这些发现支持开发更适应和可扩展的HWR系统用于实际应用。

英文摘要

Handwriting recognition (HWR) using inertial measurement unit (IMU) data remains challenging due to variations in writing styles and the limited availability of datasets. Previous approaches often struggle with handwriting from unseen writers, making writer-independent (WI) recognition a crucial yet difficult problem. This paper presents a model designed to improve WI HWR on IMU data, using a CNN encoder and BiLSTM-based decoder. Our approach demonstrates strong robustness to unseen handwriting styles, outperforming existing methods on the WI splits of both the public OnHW dataset and our word-based dataset, achieving character error rates (CERs) of 7.37% and 9.44%, and word error rates (WERs) of 15.12% and 32.17%, respectively. Robustness evaluation shows that our model maintains superior performance across different age groups, with knowledge learned from one group generalizing better to another compared to other approaches. Evaluation on our sentence-based dataset further demonstrates the potential for recognizing full sentences. Through comprehensive ablation studies, we show that our design choices achieve a strong balance between performance and efficiency. These findings support the development of more adaptable and scalable HWR systems for real-world applications.

2502.20838 2026-05-29 cs.SD cs.AI cs.LG eess.AS 版本更新

Weakly Supervised Detection and Temporal Localization of Whale Calls in Long-Duration Bioacoustic Data

弱监督检测与长时间生物声学数据中鲸叫声的时间定位

Ragib Amin Nihal, Benjamin Yen, Runwu Shi, Takeshi Ashizawa, Kazuhiro Nakadai

发表机构 * Systems and Control Engineering, School of Engineering, Institute of Science Tokyo, Japan(东京科学研究院工程学院系统与控制工程系)

AI总结 提出DSMIL-LocNet框架,利用弱监督多实例学习仅使用录音级标签实现鲸叫声的分类和时间定位,在长录音上优于全监督基线。

Comments Accepted in European Signal Processing Conference (EUSIPCO) 2026

详情
AI中文摘要

被动声学监测(PAM)系统生成持续数月连续录音,但自动化生物声学分析鲸叫声需要两种独立的标注工作:用于分类的二元存在标签和用于定位的精确时间边界。一个多分钟录音的二元标签可以在几秒钟内分配,但对其中的每个叫声打时间戳需要数小时的专家努力。在操作规模上同时提供两者是不可行的。我们提出DSMIL-LocNet,一个弱监督多实例学习(MIL)框架,仅使用录音级存在/缺失标签执行分类和时间定位。我们的双流架构整合频谱和时间特征,处理2-30分钟的录音,而无需现有CNN方法在长输入上退化的时间压缩。在AcousticTrends BlueFinLibrary上,DSMIL-LocNet在300-1800秒录音上达到F1分数0.88-0.91,而全监督CNN基线退化为0.19-0.64。它还提供这些基线在没有帧级标注的情况下无法产生的时间定位。代码:https://github.com/Ragib-Amin-Nihal/DSMIL-LocNet

英文摘要

Passive acoustic monitoring (PAM) systems generate continuous recordings spanning months, yet automated bioacoustic analysis of whale calls requires two separate annotation efforts: binary presence labels for classification and precise temporal boundaries for localization. A binary label for a multi-minute recording can be assigned in seconds, but timestamping every call within it requires hours of expert effort. Providing both is infeasible at operational scale. We present DSMIL-LocNet, a weakly supervised multiple instance learning (MIL) framework that performs both classification and temporal localization using only recording-level presence/absence labels. Our dual-stream architecture integrates spectral and temporal features to process recordings of 2--30 minutes without the temporal compression that degrades existing CNN methods on long inputs. On the AcousticTrends BlueFinLibrary, DSMIL-LocNet achieves F1 scores of 0.88--0.91 on recordings of 300--1800s, where fully supervised CNN baselines degrade to 0.19--0.64. It also provides temporal localization that these baselines cannot produce without frame-level annotation. Code: https://github.com/Ragib-Amin-Nihal/DSMIL-Loc

2502.10330 2026-05-29 cs.LG 版本更新

Diffusion-based learning framework for Constrained Nonconvex Optimization with Weighted Bootstrapped Refinement

基于扩散的约束非凸优化学习框架与加权自举细化

Shutong Ding, Yimiao Zhou, Ke Hu, Xi Yao, Junchi Yan, Xiaoying Tang, Ye Shi

发表机构 * ShanghaiTech University(上海科技大学) MoE Key Laboratory of Intelligent Perception(智能感知MoE重点实验室) Human Machine Collaboration(人机协同) Shanghai Jiao Tong University(上海交通大学) China Mobile Communications Company Limited Research Institute(中国移动通信公司有限研究院) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳))

AI总结 提出DiOpt框架,通过监督预热和自举训练两阶段学习噪声到约束区域的映射,解决扩散模型在约束非凸优化中的分布错位问题,实现高约束满足和最优性。

Comments accepted by ICML2026

详情
AI中文摘要

扩散模型的最新进展显示出通过利用其多模态性来加速非凸问题求解的潜力。然而,现有的大多数基于扩散的优化方法依赖于监督学习,并且缺乏强制执行约束满足的机制,而这在现实应用中是需要满足的。在这种情况下,我们研究并理论分析了监督扩散求解器的固有问题,并识别出分布错位问题,即生成的解分布在可行区域上的概率质量通常较低。为了解决这个问题,我们提出了DiOpt,一种新的基于扩散的约束非凸优化学习框架,它有效地学习了从噪声到约束区域的映射。具体来说,该框架在两个不同的阶段运行:初始的预热阶段,通过监督学习实现,随后是自举训练阶段。这种双阶段架构旨在迭代地细化解,从而在高度满足约束的情况下改进目标函数。最后,我们还在推理中采用解选择技术以获得更好的最优性。值得注意的是,DiOpt是首次成功将扩散求解器集成到约束非凸优化中。在多样化的非凸任务上的评估显示了DiOpt在最优性和约束满足方面的优越性。我们的官方页面发布在https://dingsht.tech/diopt-webpage。

英文摘要

Recent advances in diffusion models show promising potential to accelerate nonconvex problem solving by leveraging their multimodality. However, most existing diffusion-based optimization approaches rely on supervised learning and lack a mechanism to enforce constraint satisfaction, which is required in real-world applications. In that case, we investigate and theoretically analyze the inherent problem of supervised diffusion solvers and identify the distributional misalignment problem, i.e., the generated solution distribution often exhibits low probability mass on the feasible region. To resolve this issue, we propose DiOpt, a new diffusion-based learning framework for constrained nonconvex optimization, which effectively learns the mapping from noise to the constraint region. Specifically, this framework operates in two distinct phases: an initial warm-start phase, implemented via supervised learning, followed by a bootstrapping training phase. This dual-phase architecture is designed to iteratively refine solutions, thereby improving the objective function with high constraint satisfaction. Finally, we also employ a solution selection technique in inference for better optimality. Notably, DiOpt is the first successful integration of the diffusion solver in constrained nonconvex optimization. Evaluations on diverse nonconvex tasks demonstrate the superiority of DiOpt in both optimality and constraint satisfaction. Our official page is released at https://dingsht.tech/diopt-webpage.

2502.10205 2026-05-29 cs.LG 版本更新

Looking around you: external information enhances representations for event sequences

环顾四周:外部信息增强事件序列的表示

Petr Sokerin, Maria Kovaleva, Ekaterina Boyarina, Pavel Tikhomirov, Denis Vorobiyov, Alexey Zaytsev

发表机构 * LARSS Laboratory, AI Center, Skoltech(LARSS实验室、人工智能中心、Skoltech)

AI总结 针对事件序列表示学习中忽略同时发生序列上下文的问题,提出通过聚合多个用户表示来增强特定用户表示的方法,其中可学习注意力机制在多个数据集上显著提升指标。

详情
AI中文摘要

表示学习在不同领域产生模型,例如商店购买、客户交易和一般人的行为。然而,这类用于事件序列的模型通常孤立地处理每个序列,忽略了那些在时间上同时发生的序列的上下文。这种限制在金融和电子商务等条件快速变化的领域,或当某些序列缺乏近期事件时尤其成问题。我们开发了一种方法,从多个用户表示中聚合信息,在多个同时发生的事件序列的设置中增强特定用户的表示,实现了比独立处理每个序列更好的质量。我们的研究考虑了多种聚合方法,从简单的池化技术到可学习注意力聚合,后者可以突出其他用户之间更复杂的信息流。所提出的方法在现有编码器之上运行,并支持其高效微调。在九个多样化的事件序列数据集(金融、电子商务、娱乐等)和下游任务中,可学习注意力在有无微调的情况下均改善了指标分数,而均值池化虽然增益较小但仍然显著。

英文摘要

Representation learning produces models in different domains, such as store purchases, client transactions, and general people's behavior. However, such models for event sequences usually process each sequence in isolation, ignoring context from those that co-occur in time. This limitation is particularly problematic in domains with fast-evolving conditions, like finance and e-commerce, or when certain sequences lack recent events. We develop a method that aggregates information from multiple user representations, augmenting a specific user's representation in a setting with multiple co-occurring event sequences, achieving better quality than processing each sequence independently. Our study considers diverse aggregation approaches, ranging from simple pooling techniques to Learnable attention aggregation, that can highlight more complex information flow among other users. The proposed methods operate on top of an existing encoder and support its efficient fine-tuning. Across nine diverse event sequence datasets (finance, e-commerce, entertainment, etc.) and downstream tasks, Learnable attention improves metric scores, both with and without fine-tuning, while mean pooling yields a smaller but still significant gain.

2502.01360 2026-05-29 cs.LG math.AT q-bio.NC 版本更新

A Quotient Homology Theory of Representation in Neural Networks

神经网络表示的商同调理论

Kosio Beshkov

发表机构 * Department of Physics, University of Oslo(奥斯陆大学物理系)

AI总结 利用ReLU神经网络的分片线性性质,定义输入数据集上的等价关系并构造商空间,证明在凸性条件下神经表示的同调群与商同调群同构,从而无需外部度量即可计算Betti数。

详情
Journal ref
Transactions on Machine Learning Research, 05/2026, https://openreview.net/forum?id=RluspxztzS
AI中文摘要

先前的研究已经证明,使用ReLU激活函数的神经网络所实现的映射集合与分片线性连续映射的集合相同。此外,这类网络诱导一个超平面排列,将网络的输入域分割成凸多面体$G_J$,网络$Φ$在这些多面体上以仿射方式运行。在本文中,我们利用这些性质在输入数据集上定义一个等价关系$\sim_Φ$,该关系定义了一个商空间,该商空间可被分割成两个集合,分别与$Φ_J$的局部秩以及交集$\cap ext{Im}Φ_{J_i}$相关。我们将后者称为 extit{重叠分解}$\mathcal{O}_Φ$,并证明如果每个多面体与输入流形之间的交集是凸的,则神经表示的同调群与商同调群$H_k(Φ(\mathcal{M})) \simeq H_k(\mathcal{M}/\mathcal{O}_Φ)$同构。这使我们能够在不选择外部度量的情况下内在地计算神经表示的Betti数。我们开发了通过线性规划和并查集算法数值计算重叠分解的方法。利用这一框架,我们在玩具数据集上进行了若干实验,表明与标准持续同调相比,基于重叠同调的Betti数计算追踪的是纯拓扑特征而非几何特征。最后,我们研究了几个分类问题中训练过程中重叠分解的演化,并讨论了该方法的一些缺点。

英文摘要

Previous research has proven that the set of maps implemented by neural networks with a ReLU activation function is identical to the set of piecewise linear continuous maps. Furthermore, such networks induce a hyperplane arrangement splitting the input domain of the network into convex polyhedra $G_J$ over which a network $Φ$ operates in an affine manner. In this work, we leverage these properties to define an equivalence relation $\sim_Φ$ on top of an input dataset, which defines a quotient space that can be split into two sets related to the local rank of $Φ_J$ and the intersections $\cap \text{Im}Φ_{J_i}$. We refer to the latter as the \textit{overlap decomposition} $\mathcal{O}_Φ$ and prove that if the intersections between each polyhedron and an input manifold are convex, the homology groups of neural representations are isomorphic to quotient homology groups $H_k(Φ(\mathcal{M})) \simeq H_k(\mathcal{M}/\mathcal{O}_Φ)$. This lets us intrinsically calculate the Betti numbers of neural representations without the choice of an external metric. We develop methods to numerically compute the overlap decomposition through linear programming and a union-find algorithm. Using this framework, we perform several experiments on toy datasets showing that, compared to standard persistent homology, our overlap homology-based computation of Betti numbers tracks purely topological rather than geometric features. Finally, we study the evolution of the overlap decomposition during training on several classification problems and discuss some shortcomings of our method.

2412.00452 2026-05-29 cs.LG cs.CV 版本更新

Learning Locally, Revising Globally: Global Reviser for Federated Learning with Noisy Labels

局部学习,全局修正:面向含噪标签联邦学习的全局修正器

Yuxin Tian, Mouxing Yang, Yuhao Zhou, Jian Wang, Qing Ye, Tongliang Liu, Gang Niu, Jiancheng Lv

发表机构 * College of Computer Science, Sichuan University, Chengdu, China(四川大学计算机学院,中国成都) Engineering Research Center of Machine Learning(机器学习工程研究中心) University of Sydney, Sydney, Australia(悉尼大学,澳大利亚悉尼) Southeast University, Nanjing, China(东南大学,中国南京)

AI总结 针对联邦学习中标签噪声与数据异质性共存的问题,提出一种利用全局模型慢记忆特性的联邦全局修正器(FedGR),通过三个模块协同修正噪声标签并正则化局部训练,在三个基准上优于八种基线方法。

Comments ICML 2026 Camera Ready

详情
AI中文摘要

传统的联邦学习(FL)严重依赖高质量标签,这在实际应用中往往不现实,导致联邦标签噪声(F-LN)问题。更糟糕的是,FL的异质性加剧了F-LN问题,因为客户端经历不同的标签噪声类型、比率和数据分布。在本研究中,我们首先观察到FL的全局模型表现出对噪声标签的缓慢记忆现象,这表明其在FL中能够维持可靠的预测和鲁棒的表示。受此启发,我们提出了一种名为联邦全局修正器(FedGR)的新方法,这是一种直接而有效的方法,包含三个模块,协同修正噪声标签并正则化局部训练。通过利用这一固有属性,FedGR以自包含的方式提高了FL对标签噪声的鲁棒性。在三个广泛使用的F-LN基准上的大量实验表明,即使在严重的标签噪声和数据异质性下,FedGR也表现出优越的性能,始终优于八个最先进的基线。代码:https://github.com/cs-yuxintian/FedGR-ICML26

英文摘要

Conventional federated learning (FL) heavily depends on high-quality labels, which are often impractical in the real world, leading to the federated label-noise (F-LN) problem. Worse still, the F-LN problem is exacerbated by the heterogeneity of FL, whereas clients experience different label-noise types, ratios, and data distribution. In this study, we first observe an intriguing phenomenon that the global model of FL exhibits a slow memorization of noisy labels, suggesting its ability to maintain reliable predictions and robust representations in FL. Motivated by this, we propose a novel method termed Federated Global Reviser (\method), a straightforward yet effective method comprising three modules that collaboratively rectify noisy labels and regularize local training. By exploiting this inherent property, \method\ improves the label-noise robustness of FL in a self-contained manner. Extensive experiments on three widely used F-LN benchmarks demonstrate the superior performance of FedGR, consistently outperforming eight state-of-the-art baselines even in severe label-noise and data heterogeneity. Code: https://github.com/cs-yuxintian/FedGR-ICML26

2411.03006 2026-05-29 math.CO cs.CC cs.DM cs.LG math.OC 版本更新

Neural Networks and (Virtual) Extended Formulations

神经网络与(虚拟)扩展公式

Christoph Hertrich, Georg Loho

发表机构 * Georg Loho\ Universität Berlin \& University of Twente

AI总结 通过将神经网络表示能力与多面体的扩展复杂度关联,证明单调或输入凸神经网络规模的下界,并引入虚拟扩展复杂度以推广到一般神经网络。

详情
AI中文摘要

具有分段线性激活函数(如修正线性单元(ReLU)或maxout)的神经网络是现代机器学习中最基础的模型之一。我们通过将其表示能力与多面体$P$的扩展复杂度$\mathrm{xc}(P)$联系起来,向证明此类神经网络规模的下界迈出了一步。$\mathrm{xc}(P)$是组合优化和多面体几何中一个被充分研究的概念,描述了将$P$建模为线性规划所需的不等式数量。我们证明,$\mathrm{xc}(P)$是任何解决$P$上线性优化问题的单调或输入凸神经网络规模的下界。这暗示了此类神经网络在多种问题(包括多项式可解的最大权匹配问题)上的指数级下界。 为了尝试对一般神经网络也证明类似的下界,我们引入了虚拟扩展复杂度$\mathrm{vxc}(P)$的概念,它推广了$\mathrm{xc}(P)$,描述了将$P$上的线性优化问题表示为两个线性规划之差所需的不等式数量。我们证明$\mathrm{vxc}(P)$是任何在$P$上进行优化的神经网络规模的下界。虽然推导$\mathrm{vxc}(P)$的有用下界仍是一个开放问题,但我们通过证明给定具有小编码大小的虚拟扩展公式可以高效优化多面体$P$,论证了这一概念值得独立于神经网络进行研究。

英文摘要

Neural networks with piecewise linear activation functions, such as rectified linear units (ReLU) or maxout, are among the most fundamental models in modern machine learning. We make a step towards proving lower bounds on the size of such neural networks by linking their representative capabilities to the notion of the extension complexity $\mathrm{xc}(P)$ of a polytope $P$. This is a well-studied quantity in combinatorial optimization and polyhedral geometry describing the number of inequalities needed to model $P$ as a linear program. We show that $\mathrm{xc}(P)$ is a lower bound on the size of any monotone or input-convex neural network that solves the linear optimization problem over $P$. This implies exponential lower bounds on such neural networks for a variety of problems, including the polynomially solvable maximum weight matching problem. In an attempt to prove similar bounds also for general neural networks, we introduce the notion of virtual extension complexity $\mathrm{vxc}(P)$, which generalizes $\mathrm{xc}(P)$ and describes the number of inequalities needed to represent the linear optimization problem over $P$ as a difference of two linear programs. We prove that $\mathrm{vxc}(P)$ is a lower bound on the size of any neural network that optimizes over $P$. While it remains an open question to derive useful lower bounds on $\mathrm{vxc}(P)$, we argue that this quantity deserves to be studied independently from neural networks by proving that one can efficiently optimize over a polytope $P$ given a virtual extended formulation with small encoding size.

2410.23222 2026-05-29 cs.LG cs.AI stat.ML 版本更新

Dataset-Driven Channel Masks in Transformers for Multivariate Time Series

数据集驱动的Transformer通道掩码用于多变量时间序列

Seunghan Lee, Taeyoung Park, Kibok Lee

发表机构 * Department of Statistics and Data Science, Yonsei University(延世大学统计与数据科学系) LG AI Research(LG人工智能研究)

AI总结 提出部分通道依赖(PCD)概念,通过数据集特定的通道掩码(CMs)改进Transformer中的通道依赖建模,并在多种任务和数据集上验证有效性。

Comments ICASSP 2026. Preliminary version: NeurIPS Workshop on Time Series in the Age of Large Models 2024 (Oral presentation)

详情
AI中文摘要

最近基础模型的进展已成功扩展到时间序列(TS)领域,这得益于大规模TS数据集的出现。然而,先前的努力主要集中于捕获通道依赖(CD),这对于建模多变量时间序列至关重要,并且基于注意力的方法已被广泛用于此目的。尽管如此,这些方法主要关注修改架构,往往忽略了数据集特定特征的重要性。在这项工作中,我们引入了部分通道依赖(PCD)的概念,通过利用数据集特定信息来增强基于Transformer的模型中的CD建模,从而细化模型捕获的CD。为了实现PCD,我们提出了通道掩码(CMs),通过逐元素乘法将其集成到Transformer的注意力矩阵中。CMs由两个组件组成:1)捕获通道之间关系的相似性矩阵,以及2)数据集特定且可学习的领域参数,用于细化相似性矩阵。我们在多种任务和数据集上使用不同的骨干网络验证了PCD的有效性。代码可在此存储库获取:https://github.com/YonseiML/pcd。

英文摘要

Recent advancements in foundation models have been successfully extended to the time series (TS) domain, facilitated by the emergence of large-scale TS datasets. However, previous efforts have primarily Capturing channel dependency (CD) is essential for modeling multivariate time series (TS), and attention-based methods have been widely employed for this purpose. Nonetheless, these methods primarily focus on modifying the architecture, often neglecting the importance of dataset-specific characteristics. In this work, we introduce the concept of partial channel dependence (PCD) to enhance CD modeling in Transformer-based models by leveraging dataset-specific information to refine the CD captured by the model. To achieve PCD, we propose channel masks (CMs), which are integrated into the attention matrices of Transformers via element-wise multiplication. CMs consist of two components: 1) a similarity matrix that captures relationships between the channels, and 2) dataset-specific and learnable domain parameters that refine the similarity matrix. We validate the effectiveness of PCD across diverse tasks and datasets with various backbones. Code is available at this repository: https://github.com/YonseiML/pcd.

2409.06439 2026-05-29 cs.LG stat.CO stat.ML 版本更新

Extending Explainable Ensemble Trees (E2Tree) to regression contexts

将可解释集成树(E2Tree)扩展到回归场景

Massimo Aria, Agostino Gnasso, Carmela Iorio, Marjolein Fokkema

发表机构 * Department of Economics and Statistics, University of Naples Federico II(那不勒斯费德里科二世大学经济学与统计学系) Institute of Psychology, Leiden University(莱顿大学心理学研究所)

AI总结 本文通过引入新的不相似度度量,将可解释集成树方法从分类扩展到回归,并在真实数据集上验证其解释能力。

详情
Journal ref
Applied Stochastic Models in Business and Industry, Vol. 42, No. 1, e70064 (2026)
AI中文摘要

集成方法如随机森林通过聚合多个弱学习器提供了高精度的预测,改变了监督学习的格局。然而,尽管它们有效,这些方法往往缺乏透明度,阻碍了用户理解随机森林模型如何得出预测。可解释集成树(E2Tree)是一种解释随机森林的新方法,提供了响应变量与预测变量之间关系的图形表示。E2Tree的一个显著特点是它不仅考虑预测变量对响应的影响,还通过计算和使用不相似度度量来考虑预测变量之间的关联。E2Tree方法最初是为分类任务提出的。在本文中,我们将该方法扩展到回归场景。为了展示所提算法的解释能力,我们在真实数据集上进行了演示。

英文摘要

Ensemble methods such as random forests have transformed the landscape of supervised learning, offering highly accurate prediction through the aggregation of multiple weak learners. However, despite their effectiveness, these methods often lack transparency, impeding users' comprehension of how RF models arrive at their predictions. Explainable ensemble trees (E2Tree) is a novel methodology for explaining random forests, that provides a graphical representation of the relationship between response variables and predictors. A striking characteristic of E2Tree is that it not only accounts for the effects of predictor variables on the response but also accounts for associations between the predictor variables through the computation and use of dissimilarity measures. The E2Tree methodology was initially proposed for use in classification tasks. In this paper, we extend the methodology to encompass regression contexts. To demonstrate the explanatory power of the proposed algorithm, we illustrate its use on real-world datasets.

2406.10238 2026-05-29 cs.CL cs.LG cs.SI 版本更新

Early Detection of Misinformation for Infodemic Management: A Domain Adaptation Approach

信息疫情管理中虚假信息的早期检测:一种领域自适应方法

Minjia Mao, Xiaohang Zhao, Xiao Fang

发表机构 * Lerner College of Business and Economics, University of Delaware(德克萨斯大学德尔韦大学商学院与经济学学院) School of Information Management & Engineering, Shanghai University of Finance and Economics(上海财经大学信息管理与工程学院)

AI总结 针对信息疫情早期缺乏标注数据的问题,提出一种同时处理协变量偏移和概念偏移的领域自适应虚假信息检测方法,在真实数据集上优于现有方法。

详情
AI中文摘要

信息疫情是指在疾病爆发期间传播的大量真实信息和虚假信息。在信息疫情早期检测虚假信息是减少其对公共健康危害的关键。信息疫情早期的特点是存在大量关于某种疾病的未标注信息。因此,传统的虚假信息检测方法不适合此任务,因为它们依赖信息疫情领域的标注信息来训练模型。为解决这一局限,最先进的方法利用其他领域的标注信息来学习模型,以检测信息疫情领域的虚假信息。这些方法的有效性取决于它们缓解信息疫情领域与利用标注信息的领域之间的协变量偏移(即特征分布差异)和概念偏移(即标注模式差异)的能力。然而,这些方法侧重于缓解协变量偏移而忽略了概念偏移,导致其在该任务上效果不佳。为此,我们从理论上证明了同时处理协变量偏移和概念偏移的必要性,以及如何分别实现它们。基于理论分析,我们开发了一种新颖的虚假信息检测方法,同时解决了协变量偏移和概念偏移。使用真实数据集,我们进行了广泛的实证评估,证明我们的方法在性能上优于最先进的虚假信息检测方法以及可适用于该任务的常见领域自适应方法。

英文摘要

An infodemic refers to an enormous amount of true information and misinformation disseminated during a disease outbreak. Detecting misinformation at the early stage of an infodemic is key to reduce its harm to public health. An early stage infodemic is characterized by a large volume of unlabeled information concerning a disease. As a result, conventional misinformation detection methods are not suitable for this misinformation detection task because they rely on labeled information in the infodemic domain to train their models. To address this limitation, state-of-the-art methods learn their models using labeled information in other domains to detect misinformation in the infodemic domain. The efficacy of these methods depends on their ability to mitigate both covariate shift (i.e., differences in feature distributions) and concept shift (i.e., differences in labeling patterns) between the infodemic domain and the domains from which they leverage labeled information. However, these methods focus on mitigating covariate shift but overlook concept shift, rendering them less effective for the task. In response, we theoretically show the necessity of tackling both covariate and concept shifts as well as how to operationalize each of them. Built on the theoretical analysis, we develop a novel misinformation detection method that addresses both covariate and concept shifts. Using real-world datasets, we conduct extensive empirical evaluations to demonstrate the superior performance of our method over state-of-the-art misinformation detection methods as well as prevalent domain adaptation methods that can be tailored to solve the misinformation detection task.

2404.16077 2026-05-29 cs.PL cs.LG 版本更新

CompilerDream: Learning a Compiler World Model for General Code Optimization

CompilerDream: 学习编译器世界模型以实现通用代码优化

Chaoyi Deng, Jialong Wu, Ningya Feng, Jianmin Wang, Mingsheng Long

发表机构 * School of Software, BNRist Tsinghua University Beijing China(软件学院、北师大清华大学北京中国) Tsinghua University(清华大学)

AI总结 提出基于模型的强化学习方法CompilerDream,通过编译器世界模型模拟优化pass属性并训练智能体,实现跨应用场景和语言的通用代码优化,在零样本泛化上超越LLVM内置优化。

Comments KDD 2025 camera-ready version with extended appendix. Code is available at https://github.com/thuml/CompilerDream. This update additionally fixes an issue in Table 6 where the dataset names in three rows were ordered incorrectly

详情
AI中文摘要

编译器中的有效代码优化对计算机和软件工程至关重要。这些优化的成功主要取决于应用于代码的优化pass的选择和排序。虽然大多数编译器依赖固定的优化pass序列,但当前寻找最优序列的方法要么使用不切实际的慢速搜索算法,要么使用难以泛化到训练时未见代码的学习方法。我们提出了CompilerDream,一种基于模型的强化学习方法,用于通用代码优化。CompilerDream包含一个编译器世界模型,该模型准确模拟优化pass的内在属性,以及一个在此模型上训练以产生有效优化策略的智能体。通过在大规模程序数据集上训练,CompilerDream能够作为跨各种应用场景和源代码语言的通用代码优化器。我们的广泛实验首先突出了CompilerDream在自动调优方面的强大优化能力,它引领了CompilerGym排行榜。更重要的是,大规模训练的编译器世界模型和智能体的零样本泛化能力在多个数据集上表现出色,在值预测和端到端代码优化两种设置中均超越了LLVM的内置优化和其他最先进方法。

英文摘要

Effective code optimization in compilers is crucial for computer and software engineering. The success of these optimizations primarily depends on the selection and ordering of the optimization passes applied to the code. While most compilers rely on a fixed sequence of optimization passes, current methods to find the optimal sequence either employ impractically slow search algorithms or learning methods that struggle to generalize to code unseen during training. We introduce CompilerDream, a model-based reinforcement learning approach to general code optimization. CompilerDream comprises a compiler world model that accurately simulates the intrinsic properties of optimization passes and an agent trained on this model to produce effective optimization strategies. By training on a large-scale program dataset, CompilerDream is equipped to serve as a general code optimizer across various application scenarios and source-code languages. Our extensive experiments first highlight CompilerDream's strong optimization capabilities for autotuning, where it leads the CompilerGym leaderboard. More importantly, the zero-shot generalization ability of large-scale trained compiler world model and agent, excels across diverse datasets, surpassing LLVM's built-in optimizations and other state-of-the-art methods in both settings of value prediction and end-to-end code optimization.

2403.09441 2026-05-29 cs.LG 版本更新

An Empirical Study of the Influence of Adversarial Fine-Tuning on Compressed Neural Networks

对抗性微调对压缩神经网络影响的实证研究

Hallgrimur Thorsteinsson, Valdemar J Henriksen, Daniel I R Cruz, Raghavendra Selvan, Tong Chen

发表机构 * Department of Computer Science, University of Copenhagen, Denmark(丹麦哥本哈根大学计算机科学系)

AI总结 通过实验研究压缩模型的对抗性微调,发现其能显著提升鲁棒性,并在计算效率与鲁棒性之间取得平衡。

Comments 23 pages, 4 figures, 9 tables. Accepted to The 15th Scandinavian Conference on Artificial Intelligence (SCAI)

详情
AI中文摘要

随着深度学习模型日益融入日常生活,通过使其抵御对抗性攻击来确保安全性变得至关重要。研究发现,通过引入微小、有针对性的扰动来干扰输入数据,深度学习模型容易受到对抗性攻击。对抗性训练作为一种缓解策略,可以产生更鲁棒的模型。然而,这种对抗鲁棒性伴随着训练过程中设计对抗性攻击所需的额外计算成本。因此,对抗鲁棒性和计算效率这两个目标似乎相互冲突。在这项工作中,我们探讨了神经网络压缩对对抗鲁棒性的影响。我们特别研究了微调对压缩模型的影响,并展示了标准微调与对抗性微调之间的权衡。我们的结果表明,对压缩模型进行对抗性微调可以大幅提升其鲁棒性性能。我们在多个基准数据集上进行了实验,表明压缩模型的对抗性微调可以达到与对抗性训练模型相当的鲁棒性性能,同时提高计算效率。源代码可在此处获取:https://github.com/saintslab/Adver-Fine。

英文摘要

As deep learning (DL) models are increasingly being integrated into our everyday lives, ensuring their safety by making them robust against adversarial attacks has become increasingly critical. DL models have been found to be susceptible to adversarial attacks by introducing small, targeted perturbations to disrupt the input data. Adversarial training has been presented as a mitigation strategy that can result in more robust models. This adversarial robustness comes with additional computational costs required to design adversarial attacks during training. The two objectives -- adversarial robustness and computational efficiency -- then appear to be in conflict with each other. In this work, we explore the effects of neural network compression on adversarial robustness. We specifically explore the effects of fine-tuning on compressed models, and present the trade-off between standard fine-tuning and adversarial fine-tuning. Our results show that adversarial fine-tuning of compressed models can yield large improvements to their robustness performance. We present experiments on several benchmark datasets showing that adversarial fine-tuning of compressed models can achieve robustness performance comparable to adversarially trained models, while also improving computational efficiency. Source code is available here: https://github.com/saintslab/Adver-Fine.

2401.08197 2026-05-29 cs.LG cs.IT eess.SP math.IT 版本更新

Matrix Completion with Hypergraphs:Sharp Thresholds and Efficient Algorithms

超图矩阵补全:尖锐阈值与高效算法

Zhongtian Ma, Qiaosheng Zhang, Zhen Wang

发表机构 * Northwestern Polytechnical University(西北工业大学)

AI总结 本文研究基于子采样矩阵条目以及观测到的社交图和超图来补全评分矩阵的问题,证明了存在一个关于采样概率的尖锐阈值,并开发了一种高效算法,该算法在采样概率超过阈值时以高概率成功,且超图能有效降低所需采样概率。

Comments Accepted to LOG24

详情
AI中文摘要

本文考虑基于子采样矩阵条目以及观测到的社交图和超图来补全评分矩阵的问题。我们证明,对于精确补全评分矩阵的任务,存在一个关于采样概率的尖锐阈值——当采样概率高于阈值时任务可实现,否则不可能——展示了相变现象。该阈值可以表示为超图“质量”的函数,从而能够量化利用超图所带来的采样概率减少量。这也凸显了超图在矩阵补全问题中的有用性。在发现尖锐阈值的过程中,我们开发了一种计算高效的矩阵补全算法,该算法有效利用了观测到的图和超图。理论分析表明,只要采样概率超过上述阈值,我们的算法就以高概率成功,这一理论结果通过合成实验得到进一步验证。此外,我们在真实社交网络数据集(包含图和超图)上的实验表明,我们的算法优于其他最先进的矩阵补全算法。

英文摘要

This paper considers the problem of completing a rating matrix based on sub-sampled matrix entries as well as observed social graphs and hypergraphs. We show that there exists a \emph{sharp threshold} on the sample probability for the task of exactly completing the rating matrix -- the task is achievable when the sample probability is above the threshold, and is impossible otherwise -- demonstrating a phase transition phenomenon. The threshold can be expressed as a function of the ``quality'' of hypergraphs, enabling us to \emph{quantify} the amount of reduction in sample probability due to the exploitation of hypergraphs. This also highlights the usefulness of hypergraphs in the matrix completion problem. En route to discovering the sharp threshold, we develop a computationally efficient matrix completion algorithm that effectively exploits the observed graphs and hypergraphs. Theoretical analyses show that our algorithm succeeds with high probability as long as the sample probability exceeds the aforementioned threshold, and this theoretical result is further validated by synthetic experiments. Moreover, our experiments on a real social network dataset (with both graphs and hypergraphs) show that our algorithm outperforms other state-of-the-art matrix completion algorithms.

2605.29582 2026-05-29 cs.LG cs.CL 版本更新

PEARL: Training Socratic Tutors with Pedagogically Aligned Reinforcement Learning

PEARL: 使用教学对齐强化学习训练苏格拉底式导师

Qikai Chang, Zhenrong Zhang, Linbo Chen, Pengfei Hu, Jianshu Zhang, Youhui Guo, Jun Du

发表机构 * University of Science and Technology of China(中国科学技术大学) iFLYTEK Research(iFLYTEK研究院)

AI总结 提出PEARL框架,通过可控学生模拟器、生成式奖励模型和稳定多目标强化学习,训练苏格拉底式教学代理,在多个基准上达到开源模型最佳性能并与专有模型竞争。

Comments 16 pages, 7 figures

详情
AI中文摘要

大型语言模型(LLM)在教育辅导方面展现出潜力,但有效的辅导不仅仅是解决问题:它必须提供渐进的苏格拉底式引导,并在多轮交互中平衡多个教学目标。然而,由于学生模拟的保真度有限且可控性弱、教学奖励建模不明确以及多目标优化不稳定,训练这样的导师仍然具有挑战性。为克服这些限制,我们提出了PEARL,一个教学对齐的强化学习框架,用于训练苏格拉底式教学代理,包含三个关键组件。首先,我们引入了一个可控的学生模拟器,将潜在认知状态与响应生成解耦,以模拟多样的能力和误解。其次,我们开发了一个生成式奖励模型,联合评估教学质量和目标正确性以进行策略优化。最后,我们提出了一种稳定的多目标强化学习方案,在每个维度内离散化奖励并跨维度聚合归一化优势,防止高方差目标主导更新。在多个基准上的实验表明,尽管仅使用30B策略模型,PEARL在开源模型中取得了最佳性能,并与领先的专有LLM保持竞争力。

英文摘要

Large Language Models (LLMs) have shown promise as educational tutors, yet effective tutoring requires more than solving problems: it must provide progressive Socratic guidance and balance multiple pedagogical objectives across multi-turn interactions. However, training such tutors remains challenging due to limited-fidelity and weakly controllable student simulation, under-specified pedagogical reward modeling, and unstable multi-objective optimization. To overcome these limitations, we propose PEARL, a pedagogically aligned reinforcement learning framework for training Socratic tutoring agents, consisting of three key components. First, we introduce a controllable student simulator that decouples latent cognitive states from response generation to model diverse abilities and misconceptions. Second, we develop a generative reward model that jointly evaluates pedagogical quality and objective correctness for policy optimization. Finally, we propose a stable multi-objective RL scheme that discretizes rewards within each dimension and aggregates normalized advantages across dimensions, preventing high-variance objectives from dominating updates. Experiments on multiple benchmarks show that PEARL achieves the best performance among open-source models and remains competitive with leading proprietary LLMs, despite using only a 30B policy model.

2605.29580 2026-05-29 cs.LG stat.ML 版本更新

On the Construction and Implications of Low-Loss Valleys in LoRA-based Bayesian Inference

基于LoRA的贝叶斯推理中低损失谷的构造与启示

Daniel Dold, Emanuel Sommer, Julius Kobialka, Oliver Dürr, David Rügamer

发表机构 * HTWG Konstanz(康斯坦茨应用科学大学) LMU Munich(慕尼黑大学) Munich Center for Machine Learning (MCML)(慕尼黑机器学习中心)

AI总结 本文提出LoRA-Curve方法,通过分段贝塞尔曲线参数化在LoRA空间中连接独立最优解,形成连续低损失谷,并结合平坦极小扰动和JS散度正则化,在不牺牲性能的前提下提高预测分布的互信息,实现功能多样性。

详情
AI中文摘要

虽然低秩适应(LoRA)等参数高效微调方法已成为大型语言模型的标准方法,但对认知不确定性的原则性估计仍然具有挑战性。最近在LoRA机制下的结果表明,深度集成等离散多模态方法相比单模态方法几乎没有优势。这与深度学习中的更广泛观察相矛盾,在深度学习中,集成独立最优解通常能改善泛化,而通过连续低损失谷连接这些模态能进一步增强贝叶斯模型平均(BMA)。LoRA空间中是否存在这种结构,以及它是否能产生局部或离散方法所遗漏的功能多样性,尚未被研究。我们引入了LoRA-Curve,一种在LoRA空间中的分段贝塞尔曲线参数化,包含两种变体:一种自由配置,联合优化所有控制点;另一种锚定配置,连接独立微调的LoRA最优解。我们证明了损失沿曲线的路径连续性和Lipschitz正则性,并通过Qwen2.5 7B在推理和分类基准上的实验表明,线性插值会遇到损失障碍,而我们的锚定多段曲线通过连续低损失谷连接独立最优解。结合平坦极小扰动和詹森-香农散度正则化,LoRA-Curve在不牺牲性能的情况下,可测量地提高了预测分布的互信息,并将连续参数空间遍历与功能多样性联系起来。

英文摘要

While parameter-efficient fine-tuning methods like low-rank adaptation (LoRA) are standard for large language models, principled estimation of epistemic uncertainty remains challenging. Recent results in the LoRA regime suggest that discrete multi-mode approaches such as deep ensembles offer little benefit over single-mode methods. This contradicts broader observations in deep learning, where ensembling independent optima typically improves generalization, and linking these modes through continuous low-loss valleys further enhances Bayesian model averaging (BMA). Whether such structure exists in the LoRA space and whether it yields functional diversity missed by local or discrete methods has not been studied. We introduce LoRA-Curve, a segmented Bézier curve parameterization in the LoRA space, with two variants: a free configuration that jointly optimizes all control points, and an anchored configuration that connects independently fine-tuned LoRA optima. We prove pathwise continuity and Lipschitz regularity of the loss along the curve and empirically show, across reasoning and classification benchmarks with Qwen2.5 7B, that linear interpolation encounters loss barriers, while our anchored multi-segment curves connect independent optima through continuous low-loss valleys. Combined with flat-minima perturbations and a Jensen-Shannon divergence regularizer, LoRA-Curve yields measurably higher mutual information of the predictive distribution without sacrificing performance, and links continuous parameter-space traversal to functional diversity.

2605.29547 2026-05-29 cs.LG cs.AI math.OC 版本更新

Singularity-aware Optimization via Randomized Geometric Probing: Towards Stable Non-smooth Optimization

基于随机几何探测的奇异性感知优化:迈向稳定的非光滑优化

Ruoran Xu, Borong She, Xiaobo Jin, Qiufeng Wang

发表机构 * Xi'an Jiaotong-Liverpool University(西安交通大学利物浦大学)

AI总结 针对非光滑优化中Adam优化器的梯度抖动问题,提出奇异性感知Adam(S-Adam),通过局部几何不稳定性(LGI)度量动态调整步长,实现稳定训练并提升泛化性能。

Comments International Conference on Machine Learning (ICML), 2026

详情
AI中文摘要

深度学习优化严重依赖于损失景观平滑的假设,而现代架构由于ReLU激活和量化算子等非光滑组件系统性地违反了这一条件。在这种非光滑情况下,Adam等自适应优化器会出现梯度抖动,即由Clarke次微分内冲突信号引起的剧烈振荡,导致收敛性差和泛化能力欠佳。为解决此问题,我们引入了奇异性感知Adam(S-Adam),一种通过基于局部几何不稳定性动态调整步长来稳定训练的新型优化器。我们的关键贡献是局部几何不稳定性(LGI)度量,一种从随机方向导数方差导出的Clarke次微分直径的计算高效估计量。S-Adam采用自适应阻尼机制exp(-$λ$$ρ$),在高不稳定性区域减缓更新,同时在平滑盆地保持快速收敛。我们使用微分包含提供了严格的收敛性分析,证明S-Adam以最优的O(1/$\sqrt(T)$)速率几乎必然收敛到($δ$,$ε$)-Clarke稳定点。在量化感知训练(QAT)和高噪声小批量学习上的实证评估表明,S-Adam持续优于AdamW和Prox-SGD,在CIFAR-100上实现高达6%的准确率提升,在TinyImageNet上实现3%的提升,同时有效缓解梯度振荡。

英文摘要

Deep learning optimization relies heavily on the assumption of smooth loss landscapes, a condition systematically violated by modern architectures due to non-smooth components such as ReLU activations and quantization operators. In such non-smooth regimes, adaptive optimizers such as Adam suffer from gradient chattering, violent oscillations caused by conflicting signals within the Clarke subdifferential, leading to poor convergence and suboptimal generalization. To address this, we introduce Singularity-aware Adam (S-Adam), a novel optimizer that stabilizes training by dynamically modulating step sizes based on local geometric instability. Our key contribution is the Local Geometric Instability (LGI) metric, a computationally efficient estimator of the Clarke subdifferential diameter derived from the variance of randomized directional derivatives. S-Adam incorporates an adaptive damping mechanism exp(-$λ$$ρ$) that decelerates updates in high-instability regions while preserving fast convergence in smooth basins. We provide a rigorous convergence analysis using differential inclusions, proving that S-Adam converges almost surely to ($δ$,$ε$)-Clarke stationary points at the optimal O(1/$\sqrt(T)$) rate. Empirical evaluations on Quantization-Aware Training (QAT) and high-noise small-batch learning demonstrate that S-Adam consistently outperforms AdamW and Prox-SGD, achieving accuracy gains of up to 6 percent on CIFAR-100 and 3 percent on TinyImageNet while effectively mitigating gradient oscillations.

2605.29543 2026-05-29 cs.LG cs.AI cs.CL cs.HC cs.IR 版本更新

SCOPE: A Lightweight-training LLM Framework for Air Traffic Control Readback Monitoring

SCOPE:一种用于空中交通管制复诵监控的轻量训练LLM框架

Qihan Deng, Minghua Zhang, Yang Yang, Zhenyu Gao

发表机构 * Department of Mechanical and Aerospace Engineering, The Hong Kong University of Science and Technology(香港科学与技术大学机械与航空航天工程系) School of Electronic and Information Engineering, Beihang University(北航电子与信息工程学院) State Key Laboratory of CNS/ATM(国家空管自动化系统实验室)

AI总结 提出SCOPE框架,通过冻结LLM结合插件式开放集分类器和上下文学习机制,实现高效准确的空管复诵监控,在少样本设置下开放集检测准确率达91.05%,异常纠正率96.63%。

详情
AI中文摘要

飞行员对空中交通管制(ATC)语音指令的复诵是航空运输中防止沟通失误的主要保障。然而,复诵异常仍与约80%的航空事故相关。这一脆弱性因交通量增加和认知负荷升高而进一步加剧,从而推动了机器自动化复诵监控的需求。传统的基于规则和机器学习的方法难以在高度可变且不断演变的空管-飞行员通信术语中泛化。尽管大语言模型(LLM)凭借其强大的推理和泛化能力开辟了新途径,但现有方法在实践中仍面临部署和计算障碍。在这项工作中,我们提出了SCOPE(Semantic reasoning for Communication via Open-set Plug-in with Examples),一种新颖的轻量训练LLM框架,提升了基于机器的ATC复诵监控的效率和准确性。核心思想是在冻结的LLM之上,将插件式开放集分类器与精心设计的上下文学习机制相结合。在半合成通信数据集上的大量实验表明,SCOPE在实现运行环境所需的低延迟响应的同时,达到了优越的准确性。在少样本设置下,SCOPE在开放集检测中达到91.05%的准确率,并纠正了96.63%的异常复诵,从而在提供决策解释的同时优于现有最强基线。这些发现证明了我们的框架作为通向可解释和可控的ATC复诵监控的实用途径的潜力。

英文摘要

Pilot readback of Air Traffic Control (ATC) voice instructions is a primary safeguard against miscommunication in air transportation. However, readback anomalies remain implicated in approximately 80% of aviation incidents. This vulnerability is further exacerbated by rising traffic volume and elevated cognitive workload, thereby motivating automated readback monitoring by machine. Traditional rule-based and machine learning approaches struggle to generalize across the highly variable and evolving phraseology of air traffic controller-pilot communications. While Large Language Models (LLMs) have opened a new avenue through their strong reasoning and generalization capabilities, existing approaches still face deployment and computational barriers in practice. In this work, we propose Semantic reasoning for Communication via Open-set Plug-in with Examples (SCOPE), a novel lightweight-training LLM framework that advances both the efficiency and accuracy of machine-based ATC readback monitoring. The core idea is to couple a plug-in open-set classifier with a carefully designed in-context learning mechanism on top of a frozen LLM. Extensive experiments on the semi-synthetic communication dataset show that SCOPE attains superior accuracy while delivering the low-latency response required for operational environments. Under a few-shot setting, SCOPE achieves 91.05% accuracy in open-set detection and corrects 96.63% of anomalous readbacks, thereby outperforming the strongest available baselines while providing explanations for its decisions. These findings demonstrate the potential of our framework as a practical pathway toward interpretable and controllable ATC readback monitoring.

2605.29537 2026-05-29 cs.CC cs.LG cs.LO 版本更新

The Complexity of Verifying Feedforward Neural Networks in Quantised Settings

量化设置下前馈神经网络验证的复杂性

Eric Alsmann, Martin Lange, Marco Sälzer

发表机构 * University of Kassel(卡塞尔大学) RPTU University Kaiserslautern-Landau(科布伦茨-劳埃希斯大学)

AI总结 研究量化设置下前馈神经网络验证的计算复杂性,区分三类网络并分析线性规划和位向量规范下的复杂性,证明量化网络验证仍为NP完全,并为动态量化网络建立上界。

详情
AI中文摘要

我们研究了量化设置下神经网络验证的计算复杂性。我们区分了三类前馈神经网络(FNNs):具有精确有理权重的有理FNNs、权重来自有限宽度算术的量化FNNs,以及根据给定有限宽度算术评估有理网络的动态量化FNNs。我们考虑了文献中使用的两种规范类型。线性规划(LP)规范是线性约束的合取,而位向量(BV)规范允许在位级别进行推理,并能表达非线性约束。我们的结果给出了这些验证问题的复杂性全景。对于具有固定算术精度的量化FNNs,我们证明在LP和BV规范下的验证仍然是NP完全的,与有理情况下的复杂性相匹配。对于具有BV规范的动态量化FNNs,我们建立了上界,补充了先前已知的PSPACE-hard结果。

英文摘要

We investigate the computational complexity of neural network verification in quantised settings. We distinguish three classes of Feedforward Neural Networks (FNNs): rational FNNs with exact rational weights, quantised FNNs whose weights come from a finite-width arithmetic, and dynamically quantised FNNs in which rational networks are evaluated with respect to a given finite-width arithmetic. We consider two types of specifications used in the literature. Linear programming (LP) specifications are conjunctions of linear constraints, while bit-vector (BV) specifications allow reasoning at the bit level and can express non-linear constraints. Our results give a complexity landscape of these verification problems. For quantised FNNs with fixed arithmetic precision, we show that verification under both LP and BV specifications remains NP-complete, matching the complexity of the rational case. For dynamically quantised FNNs with BV specifications, we establish upper bounds, complementing a previously known PSPACE-hardness result.

2605.29535 2026-05-29 cs.LG 版本更新

AsymVLM: Asymmetric Token Pruning for Efficient Vision-Language Model Inference

AsymVLM:面向高效视觉-语言模型推理的非对称令牌剪枝

Yilin Feng, Ahmed Burak Gulhan, Mahmut Taylan Kandemir

发表机构 * The Pennsylvania State University(宾夕法尼亚州立大学)

AI总结 针对视觉和文本令牌在预填充与解码阶段的不同特性,提出非对称剪枝方法AsymVLM,通过视觉令牌的激进剪枝和文本令牌的基于阈值的驱逐,实现高达54%的FLOPs节省并在文档和图表理解任务上提升2-3%的准确率。

详情
AI中文摘要

视觉-语言模型(VLM)每张图像处理数千个视觉令牌,而文本令牌相对较少,但现有压缩方法对两种模态一视同仁。我们观察到两种模态具有根本不同的特性:视觉令牌在空间上冗余且主导预填充阶段,而文本令牌具有因果依赖性并在解码过程中累积。基于这种非对称性,我们提出并实证评估了AsymVLM,该方法在预填充前使用学习的重要性评分器结合每样本自适应预算对视觉令牌进行激进剪枝,并仅在文本令牌超过固定预算时执行基于时间阈值的驱逐。实验表明,AsymVLM在现有方法中实现了最高的FLOPs节省(高达54%),同时在视觉信息空间局部化且与查询相关的文档和图表理解任务上,比现有方法提升2-3%的准确率,并在整体基准上保持竞争性精度。在文本主导的场景中,我们的驱逐策略通过适应VLM的短上下文特性,显著优于标准的LLM缓存压缩方法。

英文摘要

Vision-Language Models (VLMs) process thousands of visual tokens per image alongside comparatively few text tokens, yet existing compression methods treat both modalities uniformly. We observe that the two modalities have fundamentally different properties: vision tokens are spatially redundant and dominate prefill, while text tokens are causally dependent and accumulate during decoding. Based on this asymmetry, we propose and empirically evaluate AsymVLM, which applies aggressive pruning to vision tokens before prefill using a learned importance scorer with per-sample adaptive budgeting, and temporal threshold-based eviction to text tokens only when they exceed a fixed budget. Our experiments indicate that AsymVLM achieves the highest FLOPs savings (up to 54%) among state-of-the-art methods while outperforming existing approaches by 2--3% on document and chart understanding tasks where visual information is spatially localized and query-specific, and maintaining competitive accuracy on holistic benchmarks. In text-dominated scenarios, our eviction strategy substantially outperforms standard LLM cache compression methods by adapting to the short-context nature of VLM.

2605.29531 2026-05-29 cs.SD cs.CV cs.LG 版本更新

Audio Deepfake Detection with Half-Truth Localisation Using Cross-Attentive Feature Fusion

使用交叉注意力特征融合的半真音频深度伪造检测与定位

S. Sutharya, Remya K. Sasi

发表机构 * Department of Computer Science(计算机科学系)

AI总结 提出CAFNet模型,通过三元分类和边界回归联合检测部分伪造音频,在MLADDC数据集上达到92.71%准确率和0.075s定位误差。

Comments 13 pages, 5 figures, 11 tables

详情
AI中文摘要

音频深度伪造检测通常作为二分类问题研究,但部分篡改语音(其中一段短合成片段被拼接进真实语音)构成了更困难且更现实的威胁。检测此类半真音频不仅需要区分真实和完全伪造语音,还需要定位篡改发生的位置。我们提出了CAFNet,一个576k参数的架构,联合处理这两个任务:它在单次前向传播中执行三元分类(真实、完全伪造或半真)并回归合成区域的时间边界。CAFNet通过并行深度可分离卷积分支和交叉注意力融合梅尔频率倒谱系数(MFCC)、线性频率倒谱系数(LFCC)和色度短时傅里叶变换(Chroma-STFT)特征,随后使用双向长短期记忆(BiLSTM)回归头进行边界预测。在组合的多语言音频深度伪造检测语料库(MLADDC)T2+T3测试集上,CAFNet达到92.71%的准确率和0.9910的宏观曲线下面积(AUC),边界定位平均绝对误差(MAE)为0.075秒,中位误差为0.052秒。在二分类检测中,它达到96.76%的准确率和3.20%的等错误率(EER),以超过500倍的参数减少优于微调的XLS-R 300M(78.31%)和AST 87M(93.03%)。跨数据集研究进一步表明,即使在降低骨干学习率的情况下,标准微调也会破坏跨域表示。

英文摘要

Audio deepfake detection is well-studied as a binary problem, but partially manipulated speech, where a short synthesised segment is spliced into an otherwise genuine utterance, poses a harder and more realistic threat. Detecting such half-truth audio requires not only distinguishing it from real and fully fake speech, but also localising where the manipulation occurs. We present CAFNet, a 576k-parameter architecture that addresses both tasks jointly: it performs ternary classification (real, fully-fake, or half-truth) and regresses the temporal boundaries of the synthesised region in a single forward pass. CAFNet fuses Mel-Frequency Cepstral Coefficient (MFCC), Linear-Frequency Cepstral Coefficient (LFCC), and Chroma Short-Time Fourier Transform (Chroma-STFT) features through parallel depthwise-separable convolution branches with cross-attention, followed by a Bidirectional Long Short-Term Memory (BiLSTM) regression head for boundary prediction. On the combined Multi-Lingual Audio Deepfake Detection Corpus (MLADDC) T2+T3 test set, CAFNet achieves 92.71% accuracy and macro Area Under the Curve (AUC) of 0.9910, with boundary localisation Mean Absolute Error (MAE) of 0.075s and a median error of 0.052s. On binary detection, it achieves 96.76% accuracy and 3.20% Equal Error Rate (EER), outperforming fine-tuned XLS-R 300M (78.31%) and AST 87M (93.03%) at over 500 times fewer parameters. A cross-dataset study further shows that standard fine-tuning collapses cross-domain representations even under reduced backbone learning rates.

2605.29525 2026-05-29 cs.LG 版本更新

Learning to Perturb Hidden Representations for Generalizable Deep Learning

学习扰动隐藏表示以实现可泛化深度学习

Hua Li

发表机构 * Henan University(河南大学)

AI总结 提出学习扰动激活(LPA)方法,通过自适应地扰动隐藏层激活并利用PGD学习类别级扰动,提升模型泛化能力,在平衡分类、长尾分类和域泛化任务上优于现有方法。

详情
AI中文摘要

深度神经网络通过级联表示处理数据:输入特征、隐藏激活、logits和损失。虽然输入、logit和标签层面的扰动已被系统研究,但构成网络大部分计算的中间隐藏激活尚未得到统一的扰动分析。本文建立了隐藏激活扰动的统一框架,揭示了Dropout、Manifold Mixup、对抗特征扰动及相关方法都施加了特定形式的激活扰动,但采用类别无关或随机策略。我们推测扩张性扰动(增加激活范数)起到正增强作用,而收缩性扰动(减少激活范数)起到负增强作用,并且扰动层决定了效果类似于输入级增强(浅层)还是logit级操作(深层)。我们提出学习扰动激活(LPA),该方法在选定的隐藏层自适应地扰动激活,并通过PGD学习类别级扰动。我们进一步提供了将激活扰动与平坦最小值和通过层的扰动放大联系起来的理论分析。在平衡分类、长尾分类和域泛化上的实验表明,LPA一致优于现有方法,并为logit扰动方法(如LPL)提供互补优势。

英文摘要

Deep neural networks process data through a cascade of representations: input features, hidden activations, logits, and loss. While perturbations at the input, logit, and label levels have been systematically studied, the intermediate hidden activations, which constitute the bulk of the network's computation, have received no unified perturbation analysis. In this paper, we establish a unified framework for hidden activation perturbation, revealing that Dropout, Manifold Mixup, adversarial feature perturbation, and related methods all impose specific forms of activation perturbation but with class-agnostic or random strategies. We conjecture that expansive perturbation (increasing activation norm) acts as positive augmentation, while contractive perturbation (decreasing activation norm) acts as negative augmentation, and that the perturbation layer determines whether the effect resembles input-level augmentation (shallow layers) or logit-level manipulation (deep layers). We propose Learning to Perturb Activations (LPA), which adaptively perturbs activations at a selected hidden layer with class-level perturbations learned via PGD. We further provide theoretical analysis connecting activation perturbation to flat minima and perturbation amplification through layers. Experiments on balanced classification, long-tail classification, and domain generalization demonstrate that LPA consistently outperforms existing methods and provides complementary benefits to logit perturbation methods such as LPL.

2605.29523 2026-05-29 cs.LG 版本更新

K-FinHallu: A Hallucination Detection Benchmark for Multi-Turn RAG in Korean Finance

K-FinHallu:面向韩语金融多轮RAG的幻觉检测基准

Eunbyeol Cho, Yunseung Lee, Mirae Kim, Jeewon Yang, Youngjun Kwak, Edward Choi

发表机构 * KAIST AI(KAIST人工智能实验室) Financial Tech Lab(金融科技实验室) KakaoBank Corp(Kakao银行公司)

AI总结 提出K-FinHallu基准,通过构建多轮对话和层次化幻觉分类,评估LLM在韩语金融RAG中的幻觉检测能力,发现即使最强模型在细粒度金融诊断和合理弃权上表现不佳。

详情
AI中文摘要

大型语言模型(LLMs)通过检索增强生成(RAG)推动了金融自动化,但幻觉仍然是高风险环境中部署的关键障碍。现有基准侧重于单轮、以英语为中心的任务,未解决韩语金融领域的多轮动态和语言-监管细微差别。我们引入K-FinHallu,这是首个用于多轮韩语金融RAG中幻觉检测的基准。我们从真实的韩语金融文档中构建多轮对话,并在基于上下文可回答性(明确考虑合理弃权)的层次化分类下注入幻觉。将前沿和开源LLMs作为幻觉检测器进行基准测试,我们发现即使最强的模型也难以进行细粒度的金融诊断和拒绝行为。虽然在我们的训练集上微调8B模型可获得与前沿LLMs竞争的性能,但合理弃权仍然是所有评估模型中最薄弱的方面。

英文摘要

Large Language Models (LLMs) have advanced financial automation through Retrieval-Augmented Generation (RAG), yet hallucinations remain a critical barrier to deployment in high-stakes environments. Existing benchmarks focus on single-turn, English-centric tasks, leaving the multi-turn dynamics and linguistic-regulatory nuances of the Korean financial domain unaddressed. We introduce K-FinHallu, the first benchmark for hallucination detection in multi-turn Korean financial RAG. We construct multi-turn dialogues from authentic Korean financial documents and inject hallucinations under a proposed hierarchical taxonomy based on context answerability that explicitly accounts for justified abstention. Benchmarking frontier and open-source LLMs as hallucination detectors, we find that even the strongest models struggle with fine-grained financial diagnostics and refusal behavior. While fine-tuning an 8B model on our training split yields performance competitive with frontier LLMs, justified abstention remains the weakest axis across all evaluated models.

2605.29500 2026-05-29 cs.LG cs.AI 版本更新

Quotient DAGs for Off-Policy Evaluation:Forward-Flow Importance Sampling and Exact Slate Propensities

离线策略评估的商DAG:前向流重要性采样与精确板倾向

Ziwen Xie, Shaowen Xiang, Hongyu He, Dianbo Liu

发表机构 * Shanghai Jiao Tong University(上海交通大学) National University of Singapore(新加坡国立大学)

AI总结 提出商DAG视角,通过前向流比率合并等价历史,实现精确的无序板倾向计算,减少方差并提高计算效率。

Comments 31 pages, 3 figures, 7 tables

详情
AI中文摘要

离线策略评估利用不同行为策略收集的数据来估计目标策略的表现,这在在线测试成本高或风险大时(如推荐或医疗)至关重要。标准重要性采样对每条记录轨迹进行重加权,但即使评估目标忽略生成过程的某些细节,它仍可能将其视为有意义:例如,自回归板推荐器可能生成有序的项目序列,而奖励和下游估计器仅依赖于无序板。这产生了噪声方差和计算差距,因为精确的无序板倾向需要对所有生成顺序求和。我们引入商DAG视角,合并对评估等价的历史,并在合并图上使用目标与行为的前向流比率分配权重。对于在集合充分的下一个项目接口下的板推荐,这产生了Forward-DP,一种子集DAG动态规划,无需阶乘枚举即可计算精确的无序倾向。得到的倾向基元使得能够对上下文相关的自回归板记录器进行实用的基于倾向的评估和模型选择。

英文摘要

Off-policy evaluation estimates how a target policy would perform using data collected by a different behavior policy, which is crucial when online testing is costly or risky, such as in recommendation or healthcare. Standard importance sampling reweights each logged trajectory, but it can treat details of the generation process as meaningful even when the evaluation target ignores them: for example, an autoregressive slate recommender may generate an ordered sequence of items while the reward and downstream estimator depend only on the unordered slate. This creates nuisance variance and a computational gap, since exact unordered slate propensities require summing over all generation orders. We introduce a quotient-DAG view that merges histories equivalent for evaluation and assigns weights using target-to-behavior forward-flow ratios on the merged graph. For slate recommendation under a set-sufficient next-item interface, this yields Forward-DP, a subset-DAG dynamic program that computes exact unordered propensities without factorial enumeration. The resulting propensity primitive enables practical propensity-based evaluation and model selection for context-dependent autoregressive slate loggers.

2605.29497 2026-05-29 cs.LG 版本更新

Convex Basins in Single-Index Model Loss Landscapes: Applications to Robust Recovery under Strong Adversarial Corruption

单索引模型损失景观中的凸盆地:强对抗性腐败下的鲁棒恢复应用

Santanu Das, Sagnik Chatterjee, Jatin Batra

发表机构 * School of Technology and Computer Science, Tata Institute of Fundamental Research, Mumbai, India(技术与计算机科学学院,印度塔塔基础研究机构,孟买)

AI总结 针对具有重尾噪声和对抗性腐败的单索引模型,提出首个基于凸盆地结构的鲁棒恢复算法,实现近线性样本和时间复杂度。

Comments Accepted at ICML 2026

详情
AI中文摘要

我们研究了在存在重尾噪声和恒定比例的对抗性腐败协变量及响应的情况下,鲁棒学习高斯单索引模型(SIMs)的问题。先前关于鲁棒恢复的工作考虑了线性回归(Pensia等人,JASA 2024)、严格单调链接函数(Awasthi等人,NeurIPS 2022)和相位恢复(Buna和Rebeschini,AISTATS 2025)等设置。然而,这些技术不能推广到通用的非对称非单调链接函数,例如现代门控神经架构中自然出现的标量原语 extsc{GeLU}和 extsc{Swish}。我们通过给出第一个针对通用非单调链接函数的具有近线性样本和时间复杂度的鲁棒恢复算法来填补这一空白,从而为一大类非线性SIMs建立了首个鲁棒恢复保证,而此前对这些SIMs没有任何已知保证。我们的核心贡献是对对抗性污染下高斯平方损失景观的新结构理解。关键的是,我们证明对于一大类非线性非单调SIMs,在真实参数周围存在一个维度无关、恒定半径的凸盆地,并且即使在对抗性污染下,也可以通过鲁棒谱初始化高效地到达该盆地。先前的工作无法同时建立这两个保证,因此要么在对抗性污染下崩溃,要么无法处理通用的非单调链接函数。这些结构洞察共同为鲁棒梯度下降提供了一个原则性的热启动,该算法在$ ilde{O}(nd)$时间和$ ilde{O}(d)$样本下可证明收敛到最终估计误差$O(\sigma\sqrt{\varepsilon})$,其中$\varepsilon$是污染比例。

英文摘要

We study the problem of robustly learning Gaussian Single Index Models (SIMs) in the presence of heavy-tailed noise and a constant fraction of adversarially corrupted covariates and responses. Prior work on robust recovery has considered settings such as linear regression (Pensia et al., JASA 2024), strictly monotonic link functions (Awasthi et al., NeurIPS 2022), and phase retrieval (Buna and Rebeschini, AISTATS 2025). However, these techniques do not extend to generic asymmetric non-monotonic link functions such as \textsc{GeLU} and \textsc{Swish}, which arise naturally as scalar primitives in modern gated neural architectures. We close this gap by giving the first robust recovery algorithm with near-linear sample and time complexity for generic non-monotonic link functions, thereby establishing the first robust recovery guarantees for a broad family of nonlinear SIMs for which \textit{no guarantees were previously known}. Our central contribution is a new structural understanding of the Gaussian squared-loss landscape under adversarial contamination. Crucially, we prove that for a broad class of nonlinear non-monotonic SIMs, a dimension-independent, constant-radius convex basin exists around the ground truth and is efficiently reachable via robust spectral initialization even under adversarial contamination. Prior works fail to establish both guarantees simultaneously, thereby either breaking down under adversarial contamination or failing to handle generic non-monotonic link functions. Together, these structural insights yield a principled warm start for robust gradient descent that provably converges to a final estimation error of $O(σ\sqrtε)$ in $\tilde{O}(nd)$ time with $\tilde{O}(d)$ samples, where $ε$ is the contamination fraction.

2605.29495 2026-05-29 cs.LG 版本更新

On-Policy Replay for Continual Supervised Fine-Tuning

面向持续监督微调的在策略重放

Yan Chen, Taojie Zhu, Meng Zhang, Xin Chen, Jiaqi Huang, Dongyang Xu, Yizhi Wang

发表机构 * Tsinghua University(清华大学) Alibaba Group(阿里巴巴集团)

AI总结 提出在策略重放(OPR)方法,通过重放模型自身生成的高质量响应来缓解持续监督微调中的灾难性遗忘,在多个大语言模型上显著降低遗忘。

详情
AI中文摘要

持续监督微调(SFT)是将大型语言模型(LLMs)适配到连续下游任务的事实标准,但它会遭受早期能力的灾难性遗忘。最近的研究表明,在策略信号——在模型自身输出上训练——比离策略监督更可靠地减少遗忘。现有的在策略方法通过新的训练目标(例如,带有教师副本的自蒸馏损失)路由该信号,从而继承了额外的前向传播、调度敏感性和来自教师的风格漂移。我们改为通过训练数据源路由在策略信号。我们的方法,在策略重放(OPR),在少量历史提示上展开最新检查点,通过任务奖励过滤生成结果,并将幸存(提示,模型响应)对作为普通SFT示例重放。没有教师,没有辅助损失,也没有即时蒸馏。在三个7-8B指令微调骨干(Qwen2.5-7B-Instruct、Qwen3-8B、Llama3.1-8B-Instruct)上,在TRACE持续学习基准测试中,OPR一致地减少了遗忘;在最尖锐的压力测试(Qwen2.5-7B-Instruct,顺序SFT BWT -13.93)中,OPR在10%重放预算下将BWT提升至-0.65,在1%预算下提升至-2.29——与调优的普通重放基线相比,|BWT|减少了46%,在所有三个骨干上观察到42-46%的减少。我们给出了一个KL收缩解释,将OPR和先前的在策略蒸馏方法置于单一轴上,并提出了一个反直觉的发现,解释了为什么普通重放已经是一个强基线:低分重放一致地比普通重放更差,表明OPR中的有效成分是在策略分布,而不是单独的响应质量。我们的代码可在https://github.com/Yancey2024/OnPolicyReplay获取。

英文摘要

Continual supervised fine-tuning (SFT) is the de facto recipe for adapting large language models (LLMs) to a stream of downstream tasks, but it suffers from catastrophic forgetting of earlier capabilities. Recent work shows that on-policy signals -- training on the model's own outputs -- reduce forgetting more reliably than off-policy supervision. Existing on-policy methods route this signal through a new training objective (e.g., self-distillation losses with a teacher copy), inheriting an extra forward pass, schedule sensitivity, and stylistic drift from the teacher.We instead route the on-policy signal through the training data source. Our method, On-Policy Replay (OPR), rolls out the most recent checkpoint on a small budget of historical prompts, filters the generations by a task reward, and replays the surviving (prompt, model response) pairs as ordinary SFT examples. There is no teacher, no auxiliary loss, and no on-the-fly distillation. Across three 7--8B instruction-tuned backbones (Qwen2.5-7B-Instruct, Qwen3-8B, Llama3.1-8B-Instruct) on the TRACE continual-learning benchmark, OPR consistently reduces forgetting; on the sharpest stress test (Qwen2.5-7B-Instruct, Sequential SFT BWT -13.93), OPR lifts BWT to -0.65 at a 10% replay budget and to -2.29 at a 1% budget -- a 46% reduction in |BWT| over a tuned Vanilla Replay baseline, with 42--46% reductions observed across all three backbones. We give a KL-shrinkage interpretation that places OPR and prior on-policy distillation methods on a single axis, and we present a counterintuitive finding that explains why Vanilla Replay is already a strong baseline: low-score replay is uniformly worse than Vanilla Replay, demonstrating that the active ingredient in OPR is the on-policy distribution, not the response quality alone.Our code is available at https://github.com/Yancey2024/OnPolicyReplay.

2605.29494 2026-05-29 cs.LG 版本更新

Gradient Perturbation: Learning to Perturb Gradients for Adaptive Training

梯度扰动:学习扰动梯度以实现自适应训练

Hua Li

发表机构 * Henan University(河南大学)

AI总结 本文提出学习扰动梯度(LPG)方法,通过自适应地扰动类别级别的梯度实现类别感知训练,并建立统一框架揭示SAM、梯度裁剪等方法的梯度扰动本质,实验表明LPG在平衡/长尾分类和噪声标签学习中优于现有方法。

详情
AI中文摘要

深度神经网络训练涉及前向传播(从特征经logits到损失)和反向传播(从损失经梯度到参数更新)。尽管沿前向链的扰动(包括特征扰动、logit扰动和标签扰动)已被广泛研究,但反向链的梯度扰动却鲜有系统性的研究。在本文中,我们建立了一个统一的梯度扰动框架,揭示现有方法如锐度感知最小化(SAM)、梯度裁剪和梯度噪声注入都可以解释为施加特定形式的梯度扰动。类似于最近提出的Logit扰动学习(LPL),我们推测放大某一类别的梯度范数起到正增强作用(增强学习),而抑制它则起到负增强作用(抑制过拟合)。基于这些观察,我们提出学习扰动梯度(LPG),该方法自适应地在类别级别扰动logit梯度以实现类别感知训练。我们还通过PAC-Bayesian分析建立了梯度扰动边界与泛化保证之间的理论联系。在平衡分类、长尾分类和噪声标签学习上的实验表明,LPG一致优于现有方法,并且可以作为插件模块与它们结合使用。

英文摘要

Deep neural network training involves both forward propagation (from features through logits to loss) and backward propagation (from loss through gradients to parameter updates). While perturbations along the forward chain, including feature perturbation, logit perturbation, and label perturbation, have been extensively studied, the backward chain's gradient perturbation has received little systematic investigation. In this paper, we establish a unified framework for gradient perturbation, revealing that existing methods such as Sharpness-Aware Minimization (SAM), gradient clipping, and gradient noise injection can all be interpreted as imposing specific forms of gradient perturbation. Analogous to the recently proposed Logit Perturbation Learning (LPL), we conjecture that amplifying the gradient norm for a class acts as positive augmentation (enhancing learning), while dampening it acts as negative augmentation (suppressing overfitting). Based on these observations, we propose Learning to Perturb Gradients (LPG), which adaptively perturbs logit-level gradients at the class level to achieve category-aware training. We also establish theoretical connections between gradient perturbation bounds and generalization guarantees via PAC-Bayesian analysis. Experiments on balanced classification, long-tail classification, and noisy label learning demonstrate that LPG consistently outperforms existing methods and can be combined with them as a plug-in module.

2605.29489 2026-05-29 cs.LG cs.SY eess.SY 版本更新

Access Sets Matter: Budgeting Expert Reads for Scalable Weight-Space Model Merging

访问集至关重要:为可扩展的权重空间模型合并预算专家读取

Yuanyi Wang, Yanggan Gu, Su Lu, Yifan Yang, Zhaoyi Yan, Congkai Xie, Jianmin Wu, Hongxia Yang

发表机构 * The Hong Kong Polytechnic University, PolyU(香港理工大学,PolyU) Hong Kong Polytechnic University Daya Bay Technology(香港理工大学达亚拜技术) Innovation Research Institute(创新研究院)

AI总结 针对大语言模型合并中专家权重读取的I/O瓶颈,提出MergePipe,一种预算感知的执行层,通过将合并问题转化为专家访问集问题,在显式I/O预算下选择要访问的专家增量块,实现高达11倍加速且参数偏差极小。

Comments ICML 2026 Workshop on Weight-Space Symmetries: from Foundations to Practical Applications

详情
AI中文摘要

权重空间模型合并通常被表述为检查点上的代数运算,然而在LLM规模下,限制性资源往往是必须读取的专家权重集。我们引入MergePipe,一种预算感知的执行层,将LLM合并转化为一个\emph{专家访问集}问题:给定一个合并算子和一个共享权重坐标系中的检查点族,在显式I/O预算下选择要访问的专家增量块。MergePipe索引参数块,构建确定性访问计划,并通过可重放清单执行诱导的预算合并。该计划在构造上是预算合理的,并在全预算下恢复全读取合并;对于固定系数加法算子,省略更新的误差由省略增量的范数界定。在Qwen和Llama合并工作负载上,MergePipe将专家读取I/O减少多达一个数量级,并实现高达$11 imes$的加速。代表性预算扫描显示,与全读取合并的参数偏差为$O(10^{-3})$,并且在下游基准测试上没有单调退化。

英文摘要

Weight-space model merging is usually formulated as an algebraic operation on checkpoints, yet at LLM scale the limiting resource is often the set of expert weights that must be read. We introduce MergePipe, a budget-aware execution layer that casts LLM merging as an \emph{expert access-set} problem: given a merge operator and a checkpoint family in a shared weight coordinate system, choose which expert delta blocks to access under an explicit I/O budget. MergePipe indexes parameter blocks, builds deterministic access plans, and executes the induced budgeted merge with replayable manifests. The plan is budget-sound by construction and recovers the full-read merge at full budget; for fixed-coefficient additive operators, the omitted-update error is bounded by the norm of omitted deltas. Across Qwen and Llama merging workloads, MergePipe reduces expert-read I/O by up to an order of magnitude and achieves up to $11\times$ speedups. Representative budget sweeps show $O(10^{-3})$ parameter deviation from full-read merges and no monotonic degradation on downstream benchmarks.

2605.29486 2026-05-29 cs.CL cs.AI cs.LG 版本更新

PhoneWorld: Scaling Phone-Use Agent Environments

PhoneWorld: 扩展手机使用代理环境

Zhengyang Tang, Yuxuan Liu, Xin Lai, Junyi Li, Pengyuan Lyu, Jason, Yiduo Guo, Zhengyao Fang, Yang Ding, Yi Zhang, Weinong Wang, Huawen Shen, Xingran Zhou, Liang Wu, Fei Tang, Sunqi Fan, Shangpin Peng, Zheng Ruan, Anran Zhang, Benyou Wang, Rui Yan, Ji-Rong Wen, Chengquan Zhang, Han Hu

发表机构 * Tencent Hunyuan(腾讯文英) The Chinese University of Hong Kong, Shenzhen(香港中文大学(深圳)) Gaoling School of Artificial Intelligence, Renmin University of China(中国人民大学人工智能学院( Gallagher 学院))

AI总结 提出PhoneWorld,一个可复用的管道,将真实GUI轨迹和截图转化为可控的手机使用环境、可执行任务、自动验证器和训练回滚,从而规模化构建手机代理环境。

Comments work in progress

详情
AI中文摘要

手机使用代理的一个核心瓶颈是,覆盖真实移动行为的可控、可复现环境难以大规模构建。现有的移动代理基准在评估方面取得了重要进展,但它们本身并未提供一种可扩展的方式来构建许多新的手机使用环境。我们提出了PhoneWorld,一个可复用的管道,将真实的GUI轨迹和截图转化为可控的手机使用环境、可执行任务、自动验证器和训练回滚。PhoneWorld不是一次手动构建一个移动基准,而是利用真实轨迹来恢复哪些屏幕重要、屏幕如何连接、哪些交互必须改变环境状态、以及哪些用户目标可以自动验证。从这些信号中,它构建了由只读应用内容和可变状态支持的可运行模拟Android应用,然后从相同环境中派生出可执行任务、基于规则的验证器和训练回滚。在当前实例中,PhoneWorld覆盖了16个领域的34个应用,涵盖了常见的消费者移动行为,如搜索、浏览、购物、预订、媒体和社交互动。在固定的训练预算下,将来自辅助AndroidWorld语料库的10K步替换为广泛的PhoneWorld监督,同时提升了所有四个评估基准,使HYMobileBench提高了17.7分,AndroidControl提高了6.0分,AndroidWorld提高了14.7分,PhoneWorld提高了52.5分。然后我们研究了两个额外的扩展问题:增加PhoneWorld监督量显著提高了PhoneWorld性能,并且在固定的PhoneWorld预算下,扩大应用覆盖范围带来了更大的收益。总体而言,PhoneWorld将焦点从一次构建一个移动基准转向了规模化供应手机使用环境本身。

英文摘要

A central bottleneck for phone-use agents is that controllable, reproducible environments covering real mobile behavior are hard to build at scale. Existing mobile-agent benchmarks have made important progress on evaluation, but they do not by themselves provide a scalable way to construct many new phone-use environments. We present PhoneWorld, a reusable pipeline that converts real GUI trajectories and screenshots into controllable phone-use environments, executable tasks, automatic verifiers, and training rollouts. Rather than hand-building one mobile benchmark at a time, PhoneWorld uses real trajectories to recover which screens matter, how screens connect, which interactions must change environment state, and which user goals admit automatic verification. From these signals, it builds runnable mock Android apps backed by read-only app content and mutable state, then derives executable tasks, rule-based verifiers, and training rollouts from the same environments. In its current instantiation, PhoneWorld covers 34 apps across 16 domains, spanning common consumer mobile behaviors such as search, browsing, shopping, booking, media, and social interaction. Under a fixed training budget, replacing 10K steps from an auxiliary AndroidWorld corpus in an AndroidWorld-based baseline with broad PhoneWorld supervision improves all four evaluation benchmarks at once, raising HYMobileBench by 17.7 points, AndroidControl by 6.0 points, AndroidWorld by 14.7 points, and PhoneWorld by 52.5 points. We then study two additional scaling questions: increasing the amount of PhoneWorld supervision strongly improves PhoneWorld performance, and under a fixed PhoneWorld budget, expanding app coverage yields even larger gains. Overall, PhoneWorld shifts the focus from building one mobile benchmark at a time to scaling the supply of phone-use environments themselves.

2605.29467 2026-05-29 cs.LG cs.AI 版本更新

Composing Non-Conjugate Factor Graphs with Closed-Form Variational Inference

非共轭因子图的闭式变分推断组合

Mykola Lukashchuk, Kyrylo Yemets, Wouter M. Kouw, Dmitry Bagaev, İsmail Şenöz, Jeff Beck, Bert de Vries

发表机构 * Eindhoven University of Technology, the Netherlands(埃因霍温理工大学,荷兰) Lviv Polytechnic National University, Lviv, Ukraine(利沃夫国立理工大学,利沃夫,乌克兰) Lazy Dynamics, Utrecht, the Netherlands(Lazy Dynamics,乌得勒支,荷兰)

AI总结 提出五种因子图原语,证明任意组合均支持闭式变分消息传递,并通过堆叠路由层实现通用函数逼近,应用于时间序列预测。

详情
AI中文摘要

将概率构建块堆叠成更深层次的架构通常会破坏闭式推断。我们证明闭式推断是可以保持的。我们识别了五种因子图原语:双线性因子、指数链接、Gamma先验、高斯似然和等式节点,并证明任何由它们组成的模型都允许闭式变分消息传递。这种构造之所以有效,是因为每个原语都保留了一小部分消息族:在平均场分解下,高斯变量上的消息保持高斯分布,精度变量上的消息保持Gamma分布,而唯一的非共轭接口——指数链接——通过高斯矩生成函数和Gamma族的充分统计量保持可处理性。我们展示了从静态集成到输入依赖门控再到分裂分支路由的递增深度组合,并表明堆叠路由层编码任意决策树,建立了具有闭式推断的通用函数逼近。应用于集成时间序列预测时,该框架产生了一个贝叶斯专家混合模型,其中门控函数是推断而非学习得到的,在五个基准数据集上提供了对专家选择的校准不确定性。

英文摘要

Stacking probabilistic building blocks into deeper architectures typically breaks closed-form inference. We show that closed-form inference can be preserved. We identify five factor-graph primitives: a bilinear factor, an exponential link, a Gamma prior, a Gaussian likelihood, and an equality node, and prove that any model composed from them admits closed-form variational message passing. The construction works because each primitive preserves a small set of message families: under mean-field factorization, messages on Gaussian variables remain Gaussian and messages on precision variables remain Gamma, while the only non-conjugate interface, the exponential link, remains tractable through the Gaussian moment-generating function and the sufficient statistics of the Gamma family. We demonstrate composition at increasing depth, from static ensembles through input-dependent gating to split-branch routing, and show that stacking routing layers encodes arbitrary decision trees, establishing universal function approximation with closed-form inference. Applied to ensemble time-series forecasting, the framework yields a Bayesian mixture of experts in which gating functions are inferred rather than learned, providing calibrated uncertainty over expert selection across five benchmark datasets.

2605.29464 2026-05-29 stat.ML cs.LG 版本更新

Deep Optimal Individualized Treatment Rules for Bivariate Survival Outcomes via Adaptive Prediction-Powered Learning

双变量生存结局的深度最优个体化治疗规则:基于自适应预测驱动学习

Kun Ren, Yifan Cui, Wen Su

发表机构 * Department of Biostatistics, City University of Hong Kong(香港城市大学生物统计学系) Center for Data Science, Zhejiang University(浙江大学数据科学中心)

AI总结 针对随机试验中的双变量生存结局,提出一种基于深度神经网络的自适应预测驱动方法,通过随机策略建模治疗规则并耦合边际加速失效时间模型,以最大化联合生存概率。

详情
AI中文摘要

在涉及多种治疗的随机试验中,双变量生存结局给决策带来了显著的分析挑战。本文通过深度神经网络,解决推导最优个体化治疗规则以最大化固定时间点$(t_1, t_2)$之后的联合生存概率的问题,同时考虑右删失。我们提出了一种新颖的方法,通过随机策略对治疗规则进行建模,并通过连接函数耦合边际加速失效时间模型以捕捉双变量依赖性。为了增强决策的鲁棒性和有效性,我们引入了一种自适应预测驱动方法,该方法利用机器学习模型的辅助预测。

英文摘要

In randomized trials involving multiple treatments, bivariate survival outcomes present significant analytical challenges for making decisions. This paper addresses the problem of deriving optimal individualized treatment rules to maximize the joint survival probability beyond fixed time points $(t_1, t_2)$ through deep neural networks, while accounting for right censoring. We propose a novel approach that models treatment rules via stochastic policies, coupling marginal accelerated failure time models via link function to capture bivariate dependence. To enhance robustness and effectiveness of decision making, we introduce an adaptive prediction-powered method that leverages auxiliary predictions from machine learning models.

2605.29459 2026-05-29 cs.CL cs.LG 版本更新

Kronecker Embeddings: Byte-Level Structured Token Representations for Parameter-Efficient Language Models

Kronecker嵌入:用于参数高效语言模型的字节级结构化词元表示

Rohan Shravan

发表机构 * The School of AI(人工智能学院)

AI总结 提出Kronecker嵌入,通过字节级字符-位置确定性分解替代标准嵌入表,消除91-94%输入侧可训练参数,在多个实验中实现更低验证损失、更强拼写鲁棒性和运行时效率。

Comments 28 pages, 16 tables. Reference implementation: https://github.com/theschoolofai/kronecker-embeddings

详情
AI中文摘要

大型语言模型通过一个形状为|V| x d_model的可学习嵌入表路由每个输入,在前沿规模下消耗数亿到数十亿的可训练参数。我们引入Kronecker嵌入,一种确定性的字节级字符-位置分解,用固定编码器和单个可学习投影替换该表,与标准BPE分词器兼容,在前沿规模下消除91-94%的输入侧可训练参数。我们提供五项贡献。第一,跨六个LM(135M-671B参数)的模型探针显示,训练后的输入嵌入将探针词的印刷变体聚类程度远高于形态学相关词;Kronecker在嵌入层避免了这种聚类。第二,在FineWeb-Edu上对nanoGPT GPT-2 124M进行2.5B词元的三种子受控比较显示,Kronecker达到比BPE绑定基线低2.5±0.2%的验证损失(差距0.083±0.007 nats,约9%更低的困惑度),达到BPE收敛损失所需的步数减少约1.43倍。第三,在110个干净/拼写错误对上的拼写鲁棒性探针显示,Kronecker在55.5%的对上保持top-1预测,而BPE为47.3%(+8.2个百分点),并将KL降低7.6%,在11个类别中赢得或平局10个;生成探针显示Kronecker在生成中回显字节新颖字符串和拼写错误,而BPE则遗忘它们。第四,BPE嵌入范数在训练期间漂移,而Kronecker投影范数保持在1.0附近,与稳定的表示目标一致。第五,一种即时运行时变体从4.5 MB的字节缓冲区重建嵌入,而不是从词汇量为131,072的2.15 GB表中重建,步长时间开销为0.01-0.24%。字节级局部性存在权衡:字节相似但语义距离远的对(compute/commute, nation/notion)聚类在一起,将消歧转移到早期注意力层。

英文摘要

Large language models route every input through a learned embedding table of shape |V| x d_model, consuming hundreds of millions to billions of trainable parameters at frontier scale. We introduce Kronecker Embeddings, a deterministic byte-level character-position factorization that replaces this table with a fixed encoder and a single learned projection, compatible with standard BPE tokenizers, eliminating 91--94% of input-side trainable parameters at frontier scale. We provide five contributions. First, a cross-model probe across six LMs (135M-671B parameters) shows trained input embeddings cluster typographic variants of the probe word far more than morphological relatives; Kronecker escapes this clustering at the embedding layer. Second, a controlled three-seed comparison on nanoGPT GPT-2 124M over 2.5B tokens of FineWeb-Edu shows Kronecker reaching 2.5 +- 0.2% lower validation loss than the BPE-tied baseline (gap 0.083 +- 0.007 nats, ~9% lower perplexity), needing ~1.43x fewer steps to reach BPE's converged loss. Third, a spelling-robustness probe over 110 clean/typo pairs shows Kronecker preserves the top-1 prediction on 55.5% of pairs vs. 47.3% for BPE (+8.2 pp) and lowers KL by 7.6%, winning or tying in 10 of 11 categories; a generation probe shows Kronecker echoes byte-novel strings and typos through generation where BPE forgets them. Fourth, BPE embedding norm drifts during training while Kronecker projection norm stays near 1.0, consistent with a stable representational target. Fifth, an on-the-fly runtime variant reconstructs embeddings from a 4.5 MB byte buffer rather than a 2.15 GB table at vocabulary 131,072, with 0.01--0.24% step-time overhead. Byte-level locality has a tradeoff: byte-similar but semantically distant pairs (compute/commute, nation/notion) cluster together, shifting disambiguation to early attention layers.

2605.29454 2026-05-29 cs.LG 版本更新

A Full-Pipeline Framework for Evaluating Membership Inference Attacks in Machine Learning

用于评估机器学习中成员推断攻击的全流程框架

Ding Chen, Xinwen Cheng, Xuyang Zhong, Xinping Chen, Xiaolin Huang, Chen Liu

发表机构 * City University of Hong Kong(香港城市大学) Shanghai Jiao Tong University(上海交通大学)

AI总结 提出一个涵盖数据、架构、算法和后训练模块的全流程评估框架,系统分析不同上下文对成员推断攻击效果的影响,并通过标准化威胁模型和互补指标提供实用指南。

详情
AI中文摘要

虽然成员推断攻击(MIAs)是识别训练数据的主流方法,但其应用已扩展到隐私审计和机器遗忘。然而,该领域缺乏一个系统性的框架来评估不同上下文如何影响MIA的效果。没有这样的特征描述,实践者可能会部署在基准测试中表现良好但在面对特定真实世界数据集的细微差别时变得统计上无关的算法。为了弥合这一差距并提供可操作的见解,我们引入了一个全面的评估框架,该框架系统地描述了整个机器学习流程(包括数据、架构、算法和后训练模块)中的隐私风险。我们的框架旨在固有地捕捉多样化的操作上下文,严格评估了在广泛训练配置下的最先进MIA。为了考虑真实世界部署中不同的误分类成本,我们采用了三个互补指标:对称成本下的平衡准确率,以及低FPR下的TPR(或低FNR下的TNR)用于严格惩罚误报或漏检的非对称场景。此外,认识到现有MIA假设不同的对手能力,我们形式化了两种标准化的威胁模型,并将这些攻击调整为相应的变体,以确保公平的基准测试。大量的实证评估表明,特定MIA方法的效果高度依赖于假设的威胁模型和选择的评估指标。最终,我们将这些发现提炼为可操作的指南,并提供一个即用的审计工具包,使实践者能够进行更好的隐私评估。

英文摘要

While Membership Inference Attacks (MIAs) are the prevailing method for identifying training data, their application has expanded into privacy auditing and machine unlearning. Nevertheless, the field lacks a systematic framework for evaluating how different contexts affect MIA efficacy. Without such a characterization, practitioners risk deploying algorithms that perform well on benchmarks but become statistically irrelevant when faced with the nuances of specific, real-world datasets. To bridge this gap and provide actionable insights, we introduce a comprehensive evaluation framework that systematically characterizes privacy risks across the entire machine learning pipeline, spanning data, architectures, algorithms, and post-training modules. Designed to inherently capture diverse operational contexts, our framework rigorously evaluates state-of-the-art MIAs across a broad spectrum of training configurations. To account for varying misclassification costs in real-world deployments, we employ three complementary metrics: Balanced Accuracy for symmetric costs, alongside TPR at low FPR (or TNR at low FNR) for asymmetric scenarios where false alarms or missed detections are strictly penalized. Furthermore, recognizing that existing MIAs assume divergent adversary capabilities, we formalize two standardized threat models and adapt these attacks into corresponding variants to ensure an equitable benchmark. Extensive empirical evaluations demonstrate that the efficacy of specific MIA methodologies is highly sensitive to the assumed threat models and chosen evaluation metrics. Ultimately, we distill these findings into actionable guidelines and provide a ready-to-use auditing toolkit, empowering practitioners to conduct better privacy assessments.

2605.29453 2026-05-29 cs.LG cs.AI 版本更新

Forget Less, Generalize More: Unifying Temporal and Structural Adaptation for Dynamic Graphs

遗忘更少,泛化更强:统一动态图的时间与结构适应

Qian Chang, Ciprian Doru Giurcaneanu, Runsong Jia, Xia Li, Guoping Hu, Xiufeng Cheng, Jinqing Yang, Mengjia Wu, Yi Zhang

发表机构 * University of Auckland, Auckland, New Zealand(奥克兰大学) University of Technology Sydney, Sydney, Australia(悉尼大学) Central China Normal University, Wuhan, China(Central China Normal University)

AI总结 提出双尺度保持动态(DSRD)框架,通过统一的时间-结构自适应机制和可学习衰减核,在动态图表示学习中实现更强的泛化能力。

详情
AI中文摘要

动态图上的表示学习需要捕获随时间与结构共同演化的复杂依赖关系。现有方法通常采用固定的时间衰减方案或预定义的结构传播深度,限制了其在具有不同交互频率和拓扑特征的图上的泛化能力。我们提出双尺度保持动态(DSRD),一个统一框架,维护一个同时编码时间记忆和结构上下文的保持性表示状态。DSRD引入两个关键组件:(i) 具有双尺度自适应的保持状态,在单一循环公式中联合建模时间动态和结构传播;(ii) 具有可学习时间敏感性参数的自适应衰减核,基于底层交互模式自动平衡短期响应和长期保持。我们提供理论分析,建立了事件级并行聚合与高效循环状态更新之间的等价性,以及所学动态的稳定性和有界性保证。在14个真实世界基准上的广泛实验表明,DSRD在链接预测和节点分类任务上均持续达到最先进性能,并在直推和归纳设置中展现出强泛化能力。

英文摘要

Representation learning on dynamic graphs requires capturing complex dependencies that evolve across both time and structure. Existing approaches typically adopt fixed temporal decay schemes or predetermined structural propagation depths, limiting their ability to generalize across graphs with diverse interaction frequencies and topological characteristics. We propose Dual-Scale Retentive Dynamics (DSRD), a unified framework that maintains a retentive representation state encoding both temporal memory and structural context. DSRD introduces two key components: (i) a retentive state with dual-scale adaptation that jointly models temporal dynamics and structural propagation within a single recurrent formulation, and (ii) adaptive decay kernels with learnable time-sensitivity parameters that automatically balance short-term responsiveness and long-term retention based on the underlying interaction patterns. We provide theoretical analysis establishing the equivalence between event-wise parallel aggregation and efficient recurrent state updates, as well as stability and boundedness guarantees for the learned dynamics. Extensive experiments on 14 real-world benchmarks demonstrate that DSRD consistently achieves state-of-the-art performance on both link prediction and node classification tasks, with strong generalization across transductive and inductive settings.

2605.29448 2026-05-29 cs.LG cs.AI cs.CV cs.IT math.IT 版本更新

How Much Is a Dataset Worth? Scaling Laws, the Vendi Score, and Matrix Spectral Functions

数据集值多少钱?缩放定律、Vendi分数与矩阵谱函数

Jeff A. Bilmes, Gantavya Bhatt, Arnav M. Das

发表机构 * Department of Electrical & Computer Engineering(电气与计算机工程系) Paul G. Allen School of Computer Science & Engineering(保罗·G·艾伦计算机科学与工程学院) University of Washington(华盛顿大学)

AI总结 本文通过子模性理论统一了神经缩放定律与Vendi分数,提出矩阵谱函数作为广义数据评估框架,并开发了基于割线方程的快速优化算法,在ImageNet-1K规模上实现了约35,000倍加速,实验表明设施选址函数在预测子集价值方面表现最佳。

Comments 75 pages

详情
AI中文摘要

神经缩放定律通过数据集大小评估数据,而Vendi分数使用量子熵衡量数据集价值。我们证明常见的神经缩放定律目标和Vendi分数都是子模的。进一步,我们表明Vendi分数是一类更广泛的子模目标(称为矩阵谱函数)的特例,这还包括行列式点过程(DPP)目标以及许多其他目标。我们还引入了弱矩阵单调函数,并展示了它们如何导致弱子模矩阵谱函数,从而产生一系列实用的数据评估目标。我们开发了基于割线方程的更新方法,避免了贪心优化过程中的重复特征分解,将$m$维嵌入的边际增益评估相对于预言机查询减少了$O(m)$因子。这实现了平均约35,000倍的实证加速,使得在ImageNet-1K规模的数据集上直接优化Vendi分数成为可能。由此,我们比较了多个目标在固定大小、类别平衡和固定训练预算条件下预测训练子集对保留测试性能价值的能力,包括Vendi分数、DPP、设施选址以及三种新的矩阵谱变体。在多个数据集上,设施选址表现最佳。直接优化还揭示,虽然Vendi分数在中等分数范围内具有预测性,但将目标推向更高值可能使其成为下游性能的糟糕代理。我们还发现,均匀随机选择的固定大小子集(无论是否类别平衡)在评估分数和保留性能上都表现出显著的集中性。最后,我们表明大小、类别平衡和训练预算单独并不决定数据价值:即使控制这些因素,性能范围也从好到差平滑变化。

英文摘要

Neural scaling laws appraise data through dataset size, while the Vendi Score uses quantum entropy to measure dataset value. We show both that common neural-scaling-law objectives and the Vendi Score are submodular. We further show that the Vendi Score is a special case of a broader class of submodular objectives that we call matrix spectral functions. This also includes determinantal (DPP) objectives, as well as many others. We also introduce weakly matrix monotone functions and show how they lead to weakly submodular matrix spectral functions, yielding a broad family of practical objectives for data appraisal. We develop secular-equation-based updates that avoid repeated eigendecompositions during greedy optimization, reducing marginal-gain evaluation for $m$-dimensional embeddings by an $O(m)$ factor relative to oracle queries. This yields an average empirical speedup of about 35,000x, making direct optimization of the Vendi Score feasible on ImageNet-1K-scale datasets. Thus enabled, we compare how well several objectives predict the value of training subsets for held-out test performance under fixed-size, class-balanced, and fixed training-budget regimes, including the Vendi Score, DPPs, facility location, and three new matrix spectral variants. Across multiple datasets, facility location performs the best. Direct optimization also reveals that, while the Vendi Score is predictive over moderate score ranges, pushing the objective to higher values can make it a poor downstream performance proxy. We also find that uniformly at random fixed-size subsets, both unconstrained and class-balanced, are remarkably concentrated in both appraisal scores and held-out performance. Finally, we show that size, class balance, and training budget do not alone determine data value: even when controlling for these factors, performance ranges smoothly from good to bad.

2605.29434 2026-05-29 cs.CR cs.AI cs.CL cs.LG 版本更新

AliMark: Enhancing Robustness of Sentence-Level Watermarking Against Text Paraphrasing

AliMark: 增强句子级水印对文本释义的鲁棒性

Yuexin Li, Wenjie Qu, Linyu Wu, Yulin Chen, Yufei He, Tri Cao, Bryan Hooi, Jiaheng Zhang

发表机构 * National University of Singapore(新加坡国立大学)

AI总结 提出AliMark框架,将句子级水印重构为比特序列编码与对齐问题,通过多候选对齐检测策略提升对句子拆分合并等结构扰动的鲁棒性。

Comments Accepted by ICML 2026

详情
AI中文摘要

现有的句子级水印方法通过将水印锚定在句子语义中来增强对释义的鲁棒性。然而,它们基于前缀的设计仍然容易受到结构扰动的影响,例如句子拆分和合并,这些扰动在强释义器(如DIPPER和GPT-3.5)下经常出现。为了缓解这个问题,我们提出了AliMark,一个将句子级水印重构为潜在水印文本与秘密比特序列之间的比特序列编码和对齐问题的框架。值得注意的是,我们的方法采用了两阶段检测策略:我们生成多个重构的文本变体,并自适应地将它们提取的比特序列与秘密比特序列对齐,以最小化对齐成本。这种多候选对齐设计自然地提高了对句子合并和拆分的鲁棒性。大量实验表明,在多种释义攻击下,AliMark显著优于最先进的基线方法。

英文摘要

Existing sentence-level watermarking methods enhance robustness to paraphrasing by anchoring watermarks in sentence semantics. However, their prefix-based designs remain vulnerable to structural perturbations, such as sentence splitting and merging, which commonly arise under strong paraphrasers like DIPPER and GPT-3.5. To mitigate this issue, we propose AliMark, a framework that reformulates sentence-level watermarking as a bit sequence encoding and alignment problem between a potentially watermarked text and a secret bit sequence. Notably, our approach adopts a two-stage detection strategy: we generate multiple restructured text variants and adaptively align their extracted bit sequences with the secret bit sequence to minimize alignment cost. This multi-candidate alignment design naturally improves robustness to sentence merges and splits. Extensive experiments demonstrate that AliMark substantially outperforms state-of-the-art baselines under diverse paraphrasing attacks.

2605.29420 2026-05-29 cs.AI cs.LG 版本更新

When Does Persona Prompting Actually Help? A Retrieval and Metric Analysis of Expert Role Injection in LLMs

角色提示何时真正有效?LLM中专家角色注入的检索与度量分析

Shuai Xiao, Su Liu, Weikai Zhou, Jialun Wu, Xinjie He, Zhiyuan Lin, Qiyang Xie

发表机构 * Independent Researchers(独立研究者)

AI总结 通过对比四种提示条件在1140个开放式问题上的表现,发现角色提示系统性地增加专家深度但降低清晰度,其效果高度依赖于问题类型和领域,且混合检索优于纯嵌入检索。

Comments 6 pages, 2 figures. Submitted for peer review

详情
AI中文摘要

角色提示被广泛用于引导大型语言模型,但其实际价值仍不明确。先前的工作通常使用聚合分数评估角色提示,难以确定专家角色提示是否一致地提高响应质量,或者是否沿着不同的质量维度改变响应。我们通过对比四种提示条件在涵盖38个专家角色和六个领域的1140个开放式问题上的表现来研究这个问题:无角色提示、通用领域专家提示、基于嵌入的角色检索,以及结合嵌入搜索和基于LLM的角色选择的混合检索方法。聚合结果显示各条件之间总体差异很小。然而,度量级分析揭示了一个聚合平均值掩盖的一致权衡:角色提示系统性地增加了专家深度,同时降低了清晰度。这些效果高度有条件而非普遍。角色提示在咨询类问题以及医学和心理学等领域表现最佳,在这些领域中,结构化的专家框架和风险沟通具有内在价值。相比之下,基线提示在金融、法律、科学和技术领域的概念性和解释性问题中表现更好,在这些领域中,简洁的平实语言解释更为重要。我们进一步表明,混合检索显著优于纯嵌入角色选择,尽管更好的角色检索并不能消除更广泛的专家深度与清晰度之间的权衡。总体而言,我们的发现表明,角色提示主要重塑响应特征而非广泛提升能力,并且多度量评估对于理解其效果是必要的。

英文摘要

Persona prompting is widely used to steer large language models, yet its practical value remains unclear. Prior work often evaluates persona prompting using aggregate scores, making it difficult to determine whether expert-role prompting consistently improves response quality or instead changes responses along different quality dimensions. We study this question through a controlled comparison of four prompting conditions across 1,140 open-ended questions spanning 38 expert roles and six domains: no role prompt, a generic domain-expert prompt, embedding-based role retrieval, and a hybrid retrieval method combining embedding search with LLM-based role selection. Aggregate results show only small overall differences between conditions. However, metric-level analysis reveals a consistent tradeoff that aggregate averages obscure: role prompting systematically increases expertise depth while reducing clarity. These effects are highly conditional rather than universal. Role prompting performs best on advisory questions and in domains such as medicine and psychology, where structured expert framing and risk communication are intrinsically valuable. In contrast, baseline prompting performs better on conceptual and explanatory questions in finance, legal, science, and technology domains, where concise plain-language explanation is more important. We further show that hybrid retrieval significantly improves over embedding-only role selection, although better role retrieval does not eliminate the broader expertise-depth versus clarity tradeoff. Overall, our findings suggest that persona prompting primarily reshapes response characteristics rather than broadly improving capability, and that multi-metric evaluation is necessary for understanding its effects.

2605.29415 2026-05-29 eess.IV cs.CV cs.LG eess.SP stat.ML 版本更新

Constructing efficient channels for ideal observers using the conjugate gradient method

使用共轭梯度法构建理想观察者的高效通道

Weimin Zhou

发表机构 * University of Arizona, Wyant College of Optical Sciences(亚利桑那大学光学科学学院) University of Arizona, Department of Radiology & Imaging Sciences(亚利桑那大学放射科与成像科学系)

AI总结 针对医学成像系统图像质量的任务评估,提出基于共轭梯度(CG)的方法构建高效通道,以近似贝叶斯理想观察者(IO)和霍特林观察者(HO)的性能。

Comments Submitted to the Journal of Medical Imaging (JMI) Special Issue Honoring Dr. Harrison H. Barrett

详情
AI中文摘要

基于任务的图像质量(IQ)评估对于医学成像系统的设计和优化至关重要。理想观察者,包括贝叶斯理想观察者(IO)和理想线性观察者(即霍特林观察者(HO)),提供了客观的品质因数(FOM),用于量化系统在信号检测任务上的性能。然而,将理想观察者应用于高维图像数据通常在计算上难以处理。通道机制提供了一种有效的降维框架,可以促进理想观察者的计算。本文提出了一种基于共轭梯度(CG)的方法,用于构建近似IO和HO性能的高效通道。

英文摘要

Task-based assessment of image quality (IQ) is critically important for the design and optimization of medical imaging systems. Ideal observers, including the Bayesian Ideal Observer (IO) and the ideal linear observer, i.e., the Hotelling observer (HO), provide objective figures of merit (FOMs) that quantify system performance on signal detection tasks. However, the application of ideal observers to high-dimensional image data is often computationally intractable. Channel mechanisms provide an effective framework for dimensionality reduction that can facilitate the computation of ideal observers. This work presents a conjugate gradient (CG)-based method to construct efficient channels for approximating the IO and HO performance.

2605.29412 2026-05-29 eess.SY cs.LG cs.SY 版本更新

Real-Time Retargeting Using Controllability Boundary for Chandrayaan-3 Lunar Landing

基于可控边界的月船三号月球着陆实时重定向

Suraj Kumar, Debjyoti Chakrabarti, Aditya Rallapalli, Bharat Kumar GVP, Ashok Kumar Kakula

发表机构 * Controls and Digital Area, U R Rao Satellite Center, Indian Space Research Organization(控制与数字部门,U R Rao卫星中心,印度空间研究组织)

AI总结 针对月船三号月球着陆任务,提出一种利用可控边界凸表示实现实时重定向的制导策略,通过数据驱动框架首次在运行任务中验证其有效性。

Comments 8 pages, 6 figures, Accepted for publication in American Control Conference 2026

详情
AI中文摘要

本文介绍了为月船三号月球着陆任务开发的实时重定向制导策略。基线制导生成近似燃料最优的下降轨迹,而高层策略在标称着陆点不可行时能够安全重定向到备选地点。重定向策略利用可控边界的凸表示,实现快速可行性检查和实时目标更新。据作者所知,这代表了数据驱动重定向框架在运行中的月球着陆任务中的首次应用。飞行前仿真和月船三号飞行结果验证了所提方法的有效性。

英文摘要

This paper presents the real-time retargeting guidance policy developed for the Chandrayaan-3 lunar landing mission. The baseline guidance generates approximate fuel-optimal descent trajectories, while a high-level policy enables safe retargeting to alternate sites when the nominal site becomes infeasible. The retargeting strategy leverages a convex representation of the controllability boundary, allowing rapid feasibility checks and real-time target updates. To the best of the authors knowledge, this represents the first application of a data-driven retargeting framework in an operational lunar landing mission. Pre-flight simulations and Chandrayaan-3 flight results validate the effectiveness of the proposed approach.

2605.29411 2026-05-29 cs.LG cs.AI stat.ME stat.ML 版本更新

The Good, the Bad, and the Ugly of Markov Boundary for Tabular Prediction

马尔可夫边界在表格预测中的好、坏与丑

Shu Wan, Abhinav Gorantla, Huan Liu, K. Selçuk Candan

发表机构 * Arizona State University(亚利桑那州立大学)

AI总结 研究马尔可夫边界在表格预测中的实际效用,发现理论上最优的边界在实践中有条件地提升预测性能,但因果发现方法难以实现其潜力。

Comments 11 pages, 9 figures, 2 tables. Preprint

详情
AI中文摘要

在标准图形假设下,目标变量的马尔可夫边界是使所有其他特征冗余的最小特征集。一旦观察到边界,目标变量与表格的其余部分条件独立。这对于表格预测来说是一个诱人的对象,因为它恰好指出了模型所需的列。然而,现代回归器仍然在完整特征集上训练。我们询问马尔可夫边界是否在SCM3K(一个包含3450个任务的合成SCM基准,特征数量从40到1000,涵盖六个SCM家族)上对预测真正有用,并使用六个回归器进行评估。答案比理论所暗示的要微妙得多。将回归器限制在oracle边界上通常会显著改善预测,并且随着特征空间变得更大更稀疏,改善程度增加。但是,通过因果发现恢复边界并在恢复的掩码上训练的自然流程并不奏效。现有的估计器在达到边界最有帮助的区域之前就耗尽了计算预算,即使它们运行,也很少能击败完整特征集。我们将此归因于三个原因。发现优化的是结构恢复而非预测。假阴性和假阳性具有高度不对称的预测成本。精确边界只是众多击败所有特征的特征集之一。然后,我们阐述了这些事实对于预测对齐的特征选择以及学习使用因果结构的表格模型的意义。

英文摘要

Under standard graphical assumptions, the Markov boundary of a target variable is the smallest set of features that renders every other feature redundant. Once the boundary is observed, the target is conditionally independent of the rest of the table. This is a tempting object for tabular prediction, since it names exactly the columns a model should need. Yet modern regressors are still trained on the full feature set. We ask whether the Markov boundary is genuinely useful for prediction on SCM3K, a 3,450-task synthetic SCM benchmark with feature counts from 40 to 1000 and six SCM families, evaluated with six regressors. The answer is more nuanced than the theory suggests. Restricting a regressor to the oracle boundary often improves prediction substantially, and the improvement grows as the feature space becomes larger and sparser. But the natural pipeline of recovering the boundary with causal discovery and training on the recovered mask does not deliver. Existing estimators exhaust the compute budget before reaching the regime where the boundary helps most, and even where they run they rarely beat the full feature set. We trace this to three causes. Discovery optimizes structural recovery rather than prediction. False negatives and false positives carry sharply asymmetric predictive cost. The exact boundary is only one of many feature sets that beat all features. We then develop what these facts imply for prediction-aligned feature selection and for tabular models that learn to use causal structure.

2605.29405 2026-05-29 cs.LG 版本更新

Information-Directed Offline-to-Online Reinforcement Learning

信息导向的离线到在线强化学习

Keru Chen

发表机构 * School of Electrical, Computer and Energy Engineering, Arizona State University(电气、计算机与能源工程学院,亚利桑那州立大学)

AI总结 本文提出信息导向采样(IDS)方法,通过条件互信息量化离线数据后的残余不确定性,在离线到在线强化学习中平衡即时遗憾与信息增益,并证明其贝叶斯遗憾界及在偏置残余不确定性场景下的优势。

详情
AI中文摘要

基于离线数据集的决策通常从固定离线数据中预热策略或评分模型,然后通过有限的在线交互进行优化。离线数据减少了不确定性,但并未消除探索需求;它改变了仍需探索的内容。我们通过学习目标 $χ$ 与在线轨迹在给定离线数据集条件下的条件互信息 $I(χ;τ_{1:T}\\mid\\mathcal{D}_N)$ 来形式化这种残余不确定性。这一观点自然地引出了信息导向采样(IDS),一个由参数 $η\\\ge 0$ 参数化的家族,通过权衡即时遗憾与信息增益来选择动作。我们通过比率证书证明了 IDS 的通用离线到在线贝叶斯遗憾界:任何由参考汤普森采样策略在同一随机策略类上满足的信息比率界都会被 IDS 继承。在已知动力学的贝叶斯线性奖励模型中,条件互信息具有对数行列式形式,且普通 IDS($η=0$)满足 $\\widetilde O\\\!\\\left(Hd\\\min\\\left\\\{\\\sqrt T,\\\,T\\\sqrt{C^\\\dagger_{β,\\\mathrm{IDS}_0}(N,T)/N}\\right\\\}\\right)$,其中覆盖系数与普通 IDS 自身诱导的访问分布相关。我们还识别出一个预热阶段,其中存在一个主导但信息丰富的探测动作,普通 IDS 会选择该探测动作而汤普森采样从不选择,从而产生常数因子的贝叶斯遗憾分离。受控的赌博机实验和 D4RL 离线到在线强化学习实验验证了这一机制:当离线数据信息丰富但留下偏置或低概率的残余不确定性,且目标在线动作可以解决这些不确定性时,IDS 最为有益,这种情形在离线强化学习、离线黑箱优化和贝叶斯优化中普遍存在。

英文摘要

Decision-making from offline datasets typically warm-starts a policy or score model from fixed offline data and then refines it with limited online interaction. Offline data reduces uncertainty, but it does not remove the need for exploration; it changes what remains to be explored. We formalise this residual uncertainty by the conditional mutual information $I(χ;τ_{1:T}\mid\mathcal{D}_N)$ between a learning target $χ$ and the online trajectories after conditioning on the offline dataset. This view leads naturally to information-directed sampling (IDS), a family parameterised by $η\ge 0$ that selects actions by trading off instantaneous regret against information gain. We prove a generic offline-to-online Bayesian regret bound for IDS through a ratio certificate: any information-ratio bound satisfied by a reference Thompson-sampling policy over the same randomised policy class is inherited by IDS. In a known-dynamics Bayesian linear-reward model, the conditional mutual information has a log-determinant form, and vanilla IDS ($η=0$) satisfies $\widetilde O\!\left(Hd\min\left\{\sqrt T,\,T\sqrt{C^\dagger_{β,\mathrm{IDS}_0}(N,T)/N}\right\}\right),$ where the coverage coefficient is tied to the visitation distribution induced by vanilla IDS itself. We also identify a warm-start regime with a dominated but informative probe in which vanilla IDS selects the probe while Thompson sampling never does, giving a constant-factor Bayesian regret separation. Controlled bandit experiments and D4RL offline-to-online RL experiments validate this mechanism: IDS is most beneficial when offline data is informative but leaves biased or low-probability residual uncertainty that targeted online actions can resolve, a regime shared by offline RL, offline black-box optimization, and Bayesian optimization.

2605.29401 2026-05-29 cs.LG 版本更新

Rethinking Post-Training Recipes for Multimodal Time-Series Forecasting

重新思考多模态时间序列预测的后训练方法

Haoxin Liu, Yichen Zhou, Rajat Sen, B. Aditya Prakash, Abhimanyu Das

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Google Research(谷歌研究)

AI总结 提出PostTime后训练方法,结合监督微调和基于可验证奖励的强化学习,利用大语言模型根据多模态上下文修正数值时间序列基础模型的预测,显著提升多模态时间序列预测性能。

详情
AI中文摘要

时间序列基础模型(TSFMs)在使用数值数据进行零样本单模态预测方面表现出色,但与LLMs不同,它们无法处理通常影响现实世界轨迹的多模态、非数值上下文。在这项工作中,我们弥合了这一差距,并主张一种多模态时间序列预测方法,该方法对LLMs进行后训练,使其作为上下文引导的修正器,作用于强大的数值TSFM先验。我们引入了PostTime,一种结合监督微调(SFT)和基于可验证奖励的强化学习(RLVR)的后训练方案,以及一种生成预测修正的自动推理轨迹的方法。PostTime教会LLM生成上下文条件的预测干预——基于多模态上下文决定修正、保留或忽略TSFM先验。我们在TimesX多模态预测基准上,使用Gemma-3-4B LLM和TimesFM-2.5 TSFM评估了该方法,结果表明它显著优于单独的TSFM、仅LLM的基线以及现有的多模态预测方法。

英文摘要

Time-Series Foundation Models (TSFMs) excel at zero-shot unimodal forecasting using numerical data, but unlike LLMs they cannot consume multimodal, non-numerical context that often shape real-world trajectories. In this work, we bridge this gap and argue for a multimodal time-series forecasting approach that post-trains LLMs to act as context-guided revisors over strong numerical TSFM priors. We introduce PostTime, a post-training recipe combining Supervised Fine-Tuning (SFT) and Reinforcement Learning with Verifiable Rewards (RLVR), along with a methodology to generate automated reasoning traces for forecast revisions. PostTime teaches an LLM to generate context-conditioned forecast interventions -- decisions to revise, preserve, or ignore the TSFM prior based on the multimodal context. We evaluate this approach on the TimesX multimodal forecasting benchmark using a Gemma-3-4B LLM and TimesFM-2.5 TSFM, and show that it significantly outperforms standalone TSFMs, LLM-only baselines, and existing multimodal forecasting approaches.

2605.29398 2026-05-29 cs.LG cs.AI 版本更新

GDSD: Reinforcement Learning as Guided Denoiser Self-Distillation for Diffusion Language Models

GDSD:强化学习作为扩散语言模型的引导去噪器自蒸馏

Xiaohang Tang, Keyue Jiang, Che Liu, Qifang Zhao, Xiaoxiao Xu, Sangwoong Yoon, Ilija Bogunovic

发表机构 * UCL Dept. of Statistical Science(伦敦大学学院统计科学系) UCL Centre for AI(伦敦大学学院人工智能中心) Alibaba Group(阿里巴巴集团) Dept. of EEE(电子工程系) Imperial College London(伦敦帝国理工学院) UNIST(全南大学) University of Basel(巴塞尔大学)

AI总结 提出引导去噪器自蒸馏(GDSD)方法,通过从逆KL正则化强化学习的闭式最优解中导出的优势引导自教师直接蒸馏扩散语言模型的去噪器,避免了ELBO似然代理带来的训练-推理不匹配偏差,在规划、数学和代码基准上显著优于现有方法。

Comments Preprint

详情
AI中文摘要

强化学习(RL)可用于改进扩散大语言模型(dLLMs)的策略(去噪器),但受到策略似然难以处理的阻碍。一类主流且高效的方法将标准RL中的似然替换为其证据下界(ELBO),该下界从随机掩码序列中估计。尽管与预训练高度一致,但这些方法通过使用ELBO作为似然代理引入了训练-推理不匹配(TIM)偏差,可能降低性能。在这项工作中,我们提出了引导去噪器自蒸馏(GDSD),直接从优势引导的自教师中蒸馏dLLMs的去噪器,该自教师源自逆KL正则化RL的闭式最优解。GDSD通过无归一化目标将dLLM的去噪器logits与教师匹配,将RL简化为无似然自蒸馏,从而绕过了TIM偏差。最近的基于ELBO的方法表现为应用不同蒸馏散度的实例,但存在GDSD避免的可诊断病态。在LLaDA-8B和Dream-7B的规划、数学和代码基准上,GDSD以更稳定的训练奖励动态持续优于先前最先进的基于ELBO的方法,测试准确率提升高达+19.6%。这些结果表明,直接的去噪器自蒸馏,无需依赖ELBO似然代理,可以为dLLMs提供更稳定有效的RL过程。代码可在https://github.com/GaryBall/GDSD获取。

英文摘要

Reinforcement learning (RL) can be used to improve the policy (denoiser) of diffusion large language models (dLLMs), while being hindered by the intractability of the policy likelihood. A dominant and efficient family of methods replaces the likelihood in standard RL with its evidence lower bound (ELBO), estimated from randomly masked sequences. Despite being well aligned with pre-training, these approaches introduce bias through training--inference mismatch by using the ELBO as a likelihood surrogate, which can degrade performance. In this work, we propose Guided Denoiser Self-Distillation (GDSD) to directly distill the denoiser of dLLMs from an advantage-guided self-teacher, derived from the closed-form optimum of reverse-KL regularized RL. GDSD matches the dLLM's denoiser logits to the teacher's via a normalization-free objective, which reduces RL to likelihood-free self-distillation and thus bypasses the TIM biases. Recent ELBO-based methods emerge as instances of applying different distillation divergences, but with diagnosable pathologies that GDSD avoids. On planning, math, and coding benchmarks with LLaDA-8B and Dream-7B, GDSD consistently outperforms prior state-of-the-art ELBO-based methods with a more stable training reward dynamics, achieving test-accuracy improvements of up to $+19.6\%$. These results suggest that direct denoiser self-distillation, without relying on an ELBO likelihood surrogate, can provide a more stable and effective RL procedure for dLLMs. Code is available at https://github.com/GaryBall/GDSD.

2605.29387 2026-05-29 cs.LG cs.AI stat.ML 版本更新

On the Optimizer Dependence of Neural Scaling Laws

神经缩放定律的优化器依赖性

Vansh Ramani, Shourya Vir Jain

发表机构 * Department of Computer Science and Engineering, Indian Institute of Technology Delhi(计算机科学与工程系,印度理工学院德里)

AI总结 通过随机特征回归实验,发现优化器类型系统性地影响神经缩放定律中的缩放指数α,预条件优化器产生更陡峭的缩放,并提供了光谱诊断预测高级优化器的收益。

详情
AI中文摘要

神经缩放定律 $L(N) \propto N^{-α}$ 中的缩放指数 $α$ 通常被视为由架构和数据确定的固定常数。我们提出证据表明 $α$ 系统性地依赖于优化器。在受控的随机特征回归实验——神经缩放的理论框架——中,我们测量了五种优化器变体和六种光谱条件下的 $α$。预条件优化器一致地产生更陡峭的缩放(更大的 $α$),且 $α$ 的偏移在大部分测试光谱范围内增加,在 $s = 1.5$ 附近达到峰值,并在 $s = 2.0$ 时保持较大。在 $s \approx 1.0$(自然语言的特征)时,完全自然梯度达到 $α\approx 0.31$,而梯度下降为 $α\approx 0.12$——拟合指数大 $2.6$ 倍,在随机特征模型中,该差异随模型规模加倍而累积。这种指数偏移是否以及如何迁移到大规模 LLM 训练中——近期证据表明优势可能随规模减弱——仍是一个重要的开放问题。我们的结果表明,缩放定律预测应考虑优化器选择,并且我们提供了一个光谱诊断来预测高级优化器何时会带来收益。

英文摘要

The scaling exponent $α$ in neural scaling laws $L(N) \propto N^{-α}$ is commonly treated as a fixed constant set by architecture and data. We present evidence that $α$ depends systematically on the optimizer. In controlled random-feature regression experiments -- the canonical theoretical framework for neural scaling -- we measure $α$ across five optimizer variants and six spectral conditions. Preconditioned optimizers consistently yield steeper scaling (larger $α$), with the $α$-shift increasing across most of the tested spectral range, peaking near $s = 1.5$, and remaining large at $s = 2.0$. At $s \approx 1.0$ (characteristic of natural language), the full natural gradient achieves $α\approx 0.31$ versus $α\approx 0.12$ for gradient descent -- a $2.6\times$ larger fitted exponent that, within the random-feature model, compounds with each model-size doubling. Whether and how this exponent shift transfers to large-scale LLM training -- where recent evidence suggests the advantage may attenuate with scale -- remains an important open question. Our results imply that scaling-law forecasts should account for optimizer choice, and we provide a spectral diagnostic predicting when advanced optimizers will pay off.

2605.29380 2026-05-29 cs.LG cs.AI cs.CV 版本更新

TRACER: Persistent Regularization for Robust Multimodal Finetuning

TRACER: 用于鲁棒多模态微调的持久正则化

Hesam Asadollahzadeh, Feng Liu, Christopher Leckie, Sarah M. Erfani

发表机构 * School of Computing and Information Systems (CIS), Faculty of Engineering and IT (FEIT), University of Melbourne, Australia(墨尔本大学计算机科学与信息系统学院(CIS)、工程与信息技术学院(FEIT))

AI总结 提出TRACER方法,通过加权移动平均教师实现持久正则化,解决多模态对比微调中的灾难性遗忘和EMA坍缩问题,提升分布外鲁棒性。

Comments ICML 2026

详情
AI中文摘要

微调预训练多模态模型的主流策略通常会降低分布外(OOD)鲁棒性,这种现象被称为灾难性遗忘。在本文中,我们为多模态对比微调开发了一个理论框架,为每种策略提供了闭式解和几何分解。该框架表明,自蒸馏在保留预训练模型知识方面比其他正则化方法更有效。我们的分析揭示了一个被广泛忽视的局限性:在鲁棒微调中广泛使用的标准指数移动平均(EMA)教师存在坍缩问题。为了解决这个问题,我们证明加权移动平均(WMA)教师在有限时间范围内保持持久的正则化力,并在任务子空间中实现无偏收敛,同时保留正交知识。这些见解促使了**TRACER**(**T**rajectory-**R**obust **A**nchoring for **C**ontrastive **E**ncoder **R**egularization)的提出,它将对比学习与WMA引导的多视角蒸馏相结合。在CLIP微调上的大量实验表明,在三种骨干架构上,OOD准确率和校准性能持续提升,全面的消融实验证实TRACER既有理论依据,又对超参数选择具有鲁棒性。代码可在[https://github.com/HesamAsad/TRACER](https://github.com/HesamAsad/TRACER)获取。

英文摘要

Mainstream strategies for finetuning pretrained multimodal models often degrade out-of-distribution (OOD) robustness, a phenomenon known as catastrophic forgetting. In this paper, we develop a theoretical framework for multimodal contrastive finetuning, yielding closed-form solutions and a geometric decomposition for each strategy. This framework shows that self-distillation is more effective than other regularization approaches to retain the knowledge of the pretrained model. Our analysis reveals a largely overlooked limitation: standard Exponential Moving Average (EMA) teachers, widely used in robust finetuning, suffer from collapse. To solve this, we prove that a Weighted Moving Average (WMA) teacher maintains a persistent regularizing force over finite horizons and yields bias-free convergence in the task subspace while preserving orthogonal knowledge. These insights motivate **TRACER** (**T**rajectory-**R**obust **A**nchoring for **C**ontrastive **E**ncoder **R**egularization), which combines contrastive learning with WMA-guided multi-perspective distillation. Extensive experiments on CLIP finetuning demonstrate consistent OOD accuracy and calibration gains across three backbone architectures, and comprehensive ablations confirm that TRACER is both principled and robust to hyperparameter choices. Code is available at [https://github.com/HesamAsad/TRACER](https://github.com/HesamAsad/TRACER).

2605.29379 2026-05-29 cs.CL cs.LG 版本更新

BrahmicTokenizer-131K: An Indic-Capable Drop-In Replacement for o200k_base

BrahmicTokenizer-131K:一种可替代o200k_base的印度文字兼容分词器

Rohan Shravan

发表机构 * The School of AI(人工智能学院)

AI总结 提出BrahmicTokenizer-131K,一种131072词汇量的字节级BPE分词器,通过两阶段改造在保持非印度文字性能的同时,显著提升印度文字的压缩效率。

Comments 24 pages, 15 tables, 3 code listings. Tokenizer artifact, verification scripts, and reproduction code at https://huggingface.co/theschoolofai/BrahmicTokenizer-131K and https://github.com/theschoolofai/BrahmicTokenizer-131K

详情
AI中文摘要

我们提出了BrahmicTokenizer-131K,一种131,072词汇量的字节级BPE分词器,它在131K词汇量类别中弥合了印度文字(Brahmic)的压缩差距,同时保留了OpenAI的o200k_base在英语、欧盟语言和代码方面的压缩性能。我们通过两阶段改造构建了它:(1)脚本剪枝裁剪,通过移除九个不相关书写系统将200,019个令牌减少到131,072个;(2)外科手术式改造,通过线性规划分配在九个印度文字Unicode块中填充2,372个语料库中缺失的词汇槽位。预分词器、解码器和继承的合并规则与o200k_base保持不变,使得BrahmicTokenizer-131K在分词器接口上成为即插即用的替代品。 在2700万份公开印度语预训练文本(28.4亿词,46.21 GB)上,BrahmicTokenizer-131K在相同词汇预算下产生的令牌比Mistral-Nemo Tekken / Sarvam-m少26.7%,每种语言的节省幅度从15.79%(泰米尔语)到76.79%(奥里亚语,压缩比4.31倍)。奥里亚语的优势在机制上可解释为Tekken/Sarvam-m包含零个奥里亚语块令牌;我们的改造添加了725个。在非印度语内容上,BrahmicTokenizer-131K与o200k_base的英语词汇生育率相当(1.235 vs 1.232令牌/词),并在HumanEval、MBPP和GSM8K上比Tekken/Sarvam-m好4.0-14.2%。在我们的14个分词器基准测试中,它是唯一一个在131K预算下同时在印度文字、英语、欧盟语言、代码和数学上具有竞争力的分词器。其他词汇类别的专用分词器(Sarvam-30B、Sarvam-1、MUTANT-Indic)以牺牲非印度语性能为代价实现了更好的印度语压缩:Sarvam-1的英语词汇生育率比我们差15.9%,其代码/数学压缩比我们差26-33%。我们在Apache 2.0许可下发布该工件,地址为https://huggingface.co/theschoolofai/BrahmicTokenizer-131K。

英文摘要

We present BrahmicTokenizer-131K, a 131,072-vocabulary byte-level BPE tokenizer that closes the Brahmic compression gap at the 131K-vocabulary class while preserving the English, EU-language, and code compression of OpenAI's o200k_base. We construct it through a two-stage retrofit: (1) a script-prune crop that reduces 200,019 tokens to 131,072 by removing nine out-of-scope writing systems, and (2) a surgical retrofit of 2,372 corpus-dead vocabulary slots determined by linear-programming allocation across nine Brahmic Unicode blocks. The pre-tokenizer, decoder, and inherited merge rules are unchanged from o200k_base, making BrahmicTokenizer-131K a drop-in replacement at the tokenizer interface. On 27 million documents of public Indic pretraining text (2.84 billion words, 46.21 GB), BrahmicTokenizer-131K produces 26.7% fewer tokens than Mistral-Nemo Tekken / Sarvam-m at the same vocabulary budget, with per-language savings of 15.79% (Tamil) to 76.79% (Odia, a 4.31x compression ratio). The Odia advantage is mechanistically explained by Tekken/Sarvam-m containing zero Oriya-block tokens; our surgery added 725. On non-Indic content, BrahmicTokenizer-131K matches o200k_base's English fertility (1.235 vs 1.232 tokens/word) and beats Tekken/Sarvam-m by 4.0-14.2% on HumanEval, MBPP, and GSM8K. Across our 14-tokenizer benchmark, it is the only tokenizer simultaneously competitive on Brahmic, English, EU, code, and math at the 131K budget. Specialist tokenizers at other vocab classes (Sarvam-30B, Sarvam-1, MUTANT-Indic) achieve better Indic compression at the cost of non-Indic performance: Sarvam-1's English fertility is 15.9% worse and its code/math compression 26-33% worse than ours. We release the artifact under Apache 2.0 at https://huggingface.co/theschoolofai/BrahmicTokenizer-131K.

2605.29371 2026-05-29 math.OC cs.LG cs.NA math.NA stat.ML 版本更新

Kernel-based potential mean-field games with unbiased random Fourier $U$-statistics

基于核的势均场博弈与无偏随机傅里叶 $U$-统计量

Yumiharu Nakano

发表机构 * Department of Mathematical and Computing Science, Institute of Science Tokyo(东京科学研究院数学与计算科学系)

AI总结 针对运行交互成本和终端目标成本均由再生核最大均值差异(MMD)惩罚表示的势均场博弈子类,提出一种利用核结构的计算框架,通过无偏随机傅里叶U-统计量估计成本,并证明样本级几乎必然收敛定理和显式收敛速率。

详情
AI中文摘要

我们研究势均场博弈的子类,其中运行交互成本和终端目标成本均通过再生核最大均值差异(MMD)惩罚表示,并开发了一个利用这种核结构的计算框架。两种成本均使用无偏随机傅里叶U-统计量表示从有限样本经验分布中估计,该统计量在批量大小上具有线性成本。受控扩散的漂移由神经网络参数化,并通过随机梯度下降训练。对于该子类,我们在惩罚参数、随机特征数量、样本大小和优化容差的耦合速率条件下,证明了样本级几乎必然收敛定理和显式几乎必然收敛速率。该框架包括核MMD惩罚Schrödinger桥问题作为交互成本消失的特例。数值实验在高达一百维的Schrödinger桥问题以及一个具有每辆车物理异质性的电动汽车充电协调问题上展示了该方法,其中聚合需求拥堵成本代表群体层面的价格反馈竞争,终端MMD惩罚塑造截止时刻的荷电状态分布。

英文摘要

We study the subclass of potential mean-field games in which the running interaction cost and the terminal target cost are both expressed through reproducing-kernel maximum mean discrepancy (MMD) penalties, and develop a computational framework that exploits this kernel structure. Both costs are estimated from finite-sample empirical distributions using a random Fourier U-statistic representation that is unbiased and has linear cost in the batch size. The drift of the controlled diffusion is parametrized by a neural network and trained via stochastic gradient descent. For this subclass we prove a sample-level almost-sure convergence theorem and an explicit almost-sure rate of convergence, under coupled rate conditions on the penalty parameter, the random-feature count, the sample size, and the optimization tolerance. The framework includes the kernel-MMD-penalty Schrödinger bridge problem as the special case of a vanishing interaction cost. Numerical experiments illustrate the method on the Schrödinger bridge problem in dimensions up to one hundred, and on an electric vehicle charging coordination problem with per-vehicle physical heterogeneity, where an aggregate-demand congestion cost represents price-feedback competition at the population level and the terminal MMD penalty shapes the state-of-charge distribution at the deadline.

2605.29366 2026-05-29 cs.LG 版本更新

Solving Integer Linear Programming with Parallel Tempering

使用并行回火求解整数线性规划

Kyuil Sim, Sanghyeok Choi, Jinkyoo Park

发表机构 * KAIST(韩国科学技术院) University of Edinburgh(爱丁堡大学)

AI总结 提出一种无求解器、基于采样的整数线性规划优化框架,利用局部平衡提议和并行回火技术直接探索离散可行区域,在多个基准上优于或匹敌经典求解器。

Comments Preprint. Code available at https://github.com/ski-sim/ILP-with-ParallelTempering

详情
AI中文摘要

整数线性规划(ILP)作为建模广泛组合优化问题的通用框架,通常由复杂的精确求解器或启发式方法求解。虽然基于学习的方法最近显示出有效性,但它们存在对分布外实例泛化能力差以及对外部求解器的固有依赖。在这项工作中,我们提出了一种无求解器、基于采样的ILP优化框架,无需训练或外部求解器即可直接探索离散可行区域。利用ILP的线性结构,我们采用局部平衡提议构建转移核,从而避免梯度近似。为了克服ILP能量景观的高度多模态性,我们集成了并行回火。除了标准的温度回火,我们还引入了惩罚回火,它在保持可行解目标景观的同时调节约束障碍。实验上,我们的方法在所有四个基准上持续优于SCIP,在200秒预算内匹配或超过Gurobi在四个任务中的两个,并且比基于学习的方法对分布偏移具有更强的鲁棒性。此外,在MIPLIB 2017实例上,我们的框架无需任何问题特定调优即可与经典求解器保持竞争力。

英文摘要

Integer Linear Programming (ILP) serves as a versatile framework for modeling a wide range of combinatorial optimization problems, typically addressed by sophisticated exact solvers or heuristics. While learning-based approaches have recently shown their effectiveness, they suffer from poor generalization to out-of-distribution instances and inherent dependence on external solvers. In this work, we propose a solver-free, sampling-based optimization framework for ILP that directly explores discrete feasible regions without training or external solvers. Exploiting the linear structure of ILP, we employ a Locally-Balanced Proposal to construct a transition kernel, thereby avoiding the gradient approximation. To overcome the highly multimodal nature of ILP energy landscapes, we integrate Parallel Tempering. In addition to standard temperature tempering, we introduce penalty tempering, which modulates constraint barriers while preserving the objective landscape over feasible solutions. Empirically, our method consistently outperforms SCIP across all four benchmarks, matches or exceeds Gurobi on two of four tasks within a 200-second budget, and is substantially more robust to distribution shift than learning-based methods. Furthermore, on MIPLIB 2017 instances, our framework remains competitive with classical solvers without any problem-specific tuning.

2605.29357 2026-05-29 cs.AI cs.LG cs.PL 版本更新

PassNet: Scaling Large Language Models for Graph Compiler Pass Generation

PassNet: 为图编译器通生成扩展大型语言模型

Yiqun Liu, Yingsheng Wu, Ruqi Yang, Enrong Zheng, Honglei Qiu, Sijun He, Tai Liang, Jingjing Wu, Yuhan Zhou, Yiwei Zhang, Dongyan Chen, Weihan Yi, Xinqi Li, Siqi Bao

发表机构 * Baidu, Inc.(百度公司)

AI总结 针对编译器默认优化在长尾子图上性能不佳的问题,提出PassNet生态系统,包含大规模数据集和基准测试,通过微调小模型在少量轨迹上即可接近前沿模型性能。

Comments Code and data available at https://github.com/PaddlePaddle/PassNet

详情
AI中文摘要

现代张量编译器(如 TorchInductor)在主流模型上实现了显著加速,但在长尾负载上却面临系统性性能瓶颈——我们的性能分析显示,43% 的真实世界子图在默认编译下出现端到端减速。虽然 LLM 为实现自动化优化提供了途径,但现有工作集中于独立内核生成。我们认为,通生成(即 LLM 编写可直接集成到编译器流水线中的结构化图变换)是更合适的抽象。我们提出 PassNet,首个基于 LLM 的编译器通生成的大规模生态系统,包括:(1) PassNet-Dataset,包含来自 10 万个真实世界模型的超过 1.8 万个独特计算图;(2) PassBench,200 个精心挑选的长尾可融合任务(共包含 2060 个子图),在错误感知加速分数(ES_t)下进行评估——该指标统一了正确性、稳定性和性能——并具有针对系统性 LLM 利用的分层完整性防御。实验表明,PassBench 既具有高度区分性,又真正未饱和:最佳前沿模型在总体上落后 TorchInductor 37%,但在单个子图上,LLM 相比同一编译器可实现高达 3 倍的加速——这表明瓶颈在于一致性而非能力。在仅约 4000 个 PassNet 轨迹上微调一个小模型,可获得 2.67 倍的改进,接近前沿模型性能,证明了巨大的提升空间,并验证了 PassNet 作为推进 LLM 驱动编译器优化的实时训练基础设施。所有数据、基准测试和工具均已公开。

英文摘要

Modern tensor compilers such as TorchInductor deliver substantial speedups on mainstream models, yet face a systematic performance ceiling on long-tail workloads -- our profiling shows that 43% of real-world subgraphs experience end-to-end slowdowns under default compilation. While LLMs offer a path toward automated optimization, existing efforts focus on standalone kernel generation. We argue that pass generation -- where LLMs author structured graph transformations that integrate directly into compiler pipelines -- is the more appropriate abstraction. We propose PassNet, the first large-scale ecosystem for LLM-based compiler pass generation, comprising: (1) PassNet-Dataset, over 18K unique computational graphs from 100K real-world models; and (2) PassBench, 200 curated long-tail fusible tasks (comprising 2,060 subgraphs in total) evaluated under the Error-aware Speedup Score (ES_t) -- a metric unifying correctness, stability, and performance -- with layered integrity defenses against systematic LLM exploitation. Experiments reveal that PassBench is both highly discriminative and genuinely unsaturated: the best frontier model trails TorchInductor by 37% in aggregate, yet on individual subgraphs LLMs achieve up to 3x speedup over the same compiler -- indicating that the bottleneck is consistency, not capability. Fine-tuning a small model on merely ~4K PassNet trajectories yields a 2.67x improvement approaching frontier-model performance, demonstrating substantial headroom and validating PassNet as live training infrastructure for advancing LLM-driven compiler optimization. All data, benchmarks, and tooling are publicly available.

2605.29354 2026-05-29 cs.CR cs.LG 版本更新

Harmless Yet Harmful: Neutral Prompting Attacks for Stealthy Hallucination Steering in Agent Skills

无害却有害:针对Agent技能中隐蔽幻觉引导的中性提示攻击

Chia-Yi Hsu, Chia-Mu Yu, Chun-Ying Huang, Jun Sakuma

发表机构 * Department of Computer Science(计算机科学系) National Yang Ming Chiao Tung University(阳明交通大学) Department of Electronics and Electrical Engineering(电子与电气工程系) School of Computing(计算学院) Institute of Science Tokyo(东京科学研究所)

AI总结 本文提出中性提示攻击(NPA),通过语义上看似无害的指令(如鼓励想象和详尽性)增加代码生成Agent的包幻觉倾向,从而引入软件供应链风险,并评估了其对多种编码LLM的有效性和逃避防御的能力。

Comments under review

详情
AI中文摘要

基于LLM的编码Agent通过生成代码、选择依赖项和产生包安装命令,越来越多地参与软件开发工作流程。这创造了一种新的软件供应链风险:当Agent幻觉出一个不存在的包时,攻击者可能注册该幻觉名称,并随后危害安装它的用户。现有的包幻觉攻击和防御主要关注自然发生的幻觉、有针对性的依赖引导或事后包验证。在本文中,我们介绍了\emph{中性提示攻击}(NPA),一种高度隐蔽的攻击范式,其中语义上良性的指令(如鼓励想象和详尽性)增加了包幻觉倾向,而不包含明确的恶意意图。与有针对性的依赖引导不同,NPA不指定攻击者选择的包。相反,它将模型的依赖生成行为转向更具推测性的包名称。我们在多个面向编码的LLM和包幻觉基准上评估了NPA。我们的结果表明,NPA增加了\emph{幻觉ASR}和\emph{Pip Install ASR},改变了幻觉包名称的分布,并逃避了现有的静态分析、基于LLM和基于Agent的技能防御。这些发现表明,看似无害的提示可以隐蔽地操纵幻觉行为,并产生下游软件供应链风险。

英文摘要

LLM-powered coding agents increasingly participate in software development workflows by generating code, selecting dependencies, and producing package installation commands. This creates a new software supply chain risk: when an agent hallucinates a non-existent package, an attacker may register the hallucinated name and later compromise users who install it. Existing package hallucination attacks and defenses primarily focus on naturally occurring hallucinations, targeted dependency steering, or post-hoc package validation. In this paper, we introduce \emph{Neutral Prompting Attack} (NPA), a highly stealthy attack paradigm in which semantically benign instructions, such as encouraging imagination and exhaustiveness, increase package hallucination propensity without containing explicit malicious intent. Unlike targeted dependency steering, NPA does not specify an attacker-chosen package. Instead, it shifts the model's dependency generation behavior toward more speculative package names. We evaluate NPA across multiple coding-oriented LLMs and package hallucination benchmarks. Our results show that NPA increases both \emph{Hallucination ASR} and \emph{Pip Install ASR}, changes the distribution of hallucinated package names, and evades existing static-analysis, LLM-based, and agent-based Skill defenses. These findings reveal that harmless-looking prompts can covertly manipulate hallucination behavior and create downstream software supply chain risks.

2605.29351 2026-05-29 cs.LG math.DS stat.ML 版本更新

Attention as In-Context Empirical Bayes: A Two-Stage View via Particle Dynamics

注意力作为上下文经验贝叶斯:通过粒子动力学的两阶段视角

Matthew Smart, Soumya Ganguly, Nilava Metya, Alexandre V. Morozov, Anirvan M. Sengupta

发表机构 * Lewis-Sigler Institute for Integrative Genomics(利斯-西格尔整合基因组研究所) Princeton University(普林斯顿大学) Department of Mathematics(数学系) Rutgers University(罗格斯大学) Department of Physics and Astronomy(物理与天文学系) Center for Computational Quantum Physics and Center for Computational Mathematics(计算量子物理中心和计算数学中心) Flatiron Institute(Flatiron研究所) Simons Foundation(西蒙斯基金会)

AI总结 本文通过粒子动力学将最小注意力仅变换器解释为两阶段经验贝叶斯过程,揭示了深度和注意力残差的统计角色,并证明无需显式噪声调度即可实现有效去噪。

Comments 52 pages, 5 figures

详情
AI中文摘要

我们研究了在所有标记损坏情况下的最小注意力仅变换器,并表明它们具有两阶段经验贝叶斯解释。单个注意力步骤计算相对于由上下文定义的经验分布的核加权后验均值。深度通过粒子动力学(阶段1)细化该分布,而长程跳跃连接将噪声输入作为查询用于后验推断(阶段2),揭示了深度和注意力残差的独特统计角色。该框架隔离了一个最小设置,其中上下文本身诱导了一个控制上下文推断的深度依赖能量景观。我们表明,无需显式噪声调度即可出现有效去噪:固定的核带宽和有限的积分范围就足够了,从而产生了一个有原则的深度-噪声关系。我们进一步为一类表现良好的先验建立了后验均值恢复保证,其中经验估计器在渐近条件下收敛到贝叶斯最优预测器。将这些动力学与反向扩散极限联系起来,我们的结果为注意力作为通过基于样本的后验估计进行上下文推断提供了统计解释,无需显式密度建模。

英文摘要

We study minimal attention-only transformers under all-token corruption and show they admit a two-stage empirical Bayes interpretation. A single attention step computes a kernel-weighted posterior mean with respect to the empirical distribution defined by the context. Depth refines this distribution through particle dynamics (Stage 1), while a long-range skip-connection carries the noisy input as a query for posterior inference (Stage 2), revealing distinct statistical roles for depth and attention residuals. The framework isolates a minimal setting in which the context itself induces a depth-dependent energy landscape governing in-context inference. We show that effective denoising can emerge without an explicit noise schedule: a fixed kernel bandwidth and finite integration horizon suffice, yielding a principled depth-noise relationship. We further establish a posterior-mean recovery guarantee for a class of well-behaved priors, where the empirical estimator converges to the Bayes-optimal predictor under asymptotic conditions. Connecting these dynamics to reverse-diffusion limits, our results provide a statistical interpretation of attention as in-context inference via sample-based posterior estimation, without explicit density modeling.

2605.29329 2026-05-29 q-bio.QM cs.LG 版本更新

Mixing Vector Model for Copolymer Inference via Mixed Integer Linear Programming

基于混合整数线性规划的共聚物推断的混合向量模型

Jianshen Zhu, Raveena Rai, Taiyo Sohkawa, Naveed Ahmed Azam, Kazuya Haraguchi, Liang Zhao, Tatsuya Akutsu

发表机构 * Department of Information Science and Technology, Tokyo University of Science(东京科学大学信息科学与技术系) Discrete Mathematics and Computational Intelligence Laboratory, Department of Mathematics, Quaid-i-Azam University(夸齐-阿扎姆大学数学系离散数学与计算智能实验室) Graduate School of Informatics, Kyoto University(京都大学信息研究生院) Graduate School of Advanced Integrated Studies in Human Survivability (Shishu-Kan), Kyoto University(京都大学人类生存高级综合研究研究生院(Shishu-Kan)) Bioinformatics Center, Institute for Chemical Research, Kyoto University(京都大学化学研究所生物信息中心)

AI总结 提出混合向量模型,通过混合整数线性规划实现共聚物的逆设计,在多个物化数据集上取得高预测精度并保持可解性。

详情
AI中文摘要

最近开发了一种新颖的两阶段分子推断框架mol-infer,通过两层模型下的混合整数线性规划(MILP),在给定学习预测函数和结构约束的条件下,以最优性和精确性推断具有规定抽象结构和期望性质值的化学图。在本研究中,我们通过引入一种称为混合向量(MV)模型的简单特征表示,将该框架扩展到共聚物。在所提出的模型中,共聚物特征向量表示为MILT可处理单体描述符的凸组合,加权系数为组成单体的混合比例。这种表示不需要明确的序列类别信息,因此自然兼容基于MILP的逆设计。在该模型下,我们使用人工神经网络、简化二次多元线性回归和随机森林为多个共聚物性质数据集构建预测函数。所提出的表示在多个物理化学性质数据集上实现了实际有用的预测性能;特别地,十个数据集中有九个的最佳测试R²分数超过0.7,六个数据集超过0.9。我们还制定了在MV表示下具有规定混合比例的多单体逆设计问题,并表明即使在三单体设置下,生成的MILP实例仍然可解。最后,我们通过重新评估推断的候选物并将重新计算的性质值与学习模型预测的值进行比较,进行外部一致性检查。总体而言,所提出的框架为在两层模型下实现共聚物的模型级精确逆设计提供了可处理的第一步。

英文摘要

A novel two-phase molecule inference framework, mol-infer, has recently been developed to infer chemical graphs with prescribed abstract structures and desired property values through mixed integer linear programming (MILP) under the two-layered model, with guaranteed optimality and exactness relative to the given learned prediction function and structural constraints. In this study, we extend this framework to copolymers by introducing a simple feature representation, called the mixing vector (MV) model. In the proposed model, a copolymer feature vector is represented as a convex combination of MILP-tractable monomer descriptors weighted by the mixing ratio of the constituent monomers. This representation does not require explicit sequence-class information and is therefore naturally compatible with MILP-based inverse design. Under this model, we construct prediction functions for several copolymer property datasets using artificial neural networks, reduced quadratic multiple linear regression, and random forests. The proposed representation achieves practically useful predictive performance across multiple physicochemical property datasets; in particular, the best test R^2 score exceeds 0.7 for nine of the ten datasets and exceeds 0.9 for six datasets. We also formulate a multi-monomer inverse-design problem under the MV representation with a prescribed mixing ratio and show that the resulting MILP instances remain tractable, even for three-monomer settings. Finally, we perform an external consistency check by re-evaluating the inferred candidates and comparing the re-computed property values with those predicted by the learned model. Overall, the proposed framework gives a tractable first step toward model-level exact inverse design of copolymers under the two-layered model.

2605.29327 2026-05-29 cs.CL cs.LG 版本更新

Reasoning-preserved Efficient Distillation of Large Language Models via Activation-aware Initialization

保留推理能力的大语言模型高效蒸馏:基于激活感知初始化

Junlin He, Yihong Tang, Tong Nie, Guilong Li, Binyu Yang, Jinxiao Du, Lijun Sun, Wei Ma

发表机构 * The Hong Kong Polytechnic University, Hong Kong SAR, China(香港理工大学) McGill University, Montreal, QC, Canada(麦吉尔大学)

AI总结 针对高效蒸馏导致的多步推理能力严重下降(推理崩溃),提出RED方法,通过激活感知初始化投影矩阵为通道选择矩阵,理论缓解有效秩崩溃,恢复推理能力并保持高效训练与通用性能。

详情
AI中文摘要

高效蒸馏(EDistill)通过结构化剪枝参数和调优轻量模块以高训练效率压缩大语言模型(LLM)。尽管这些EDistill LLM在通用能力基准上相对于类似大小的LLM取得了最先进的(SOTA)性能,但我们发现其多步推理能力严重下降,我们称之为推理崩溃。我们系统分析了推理崩溃的几何起源,并表明基于宽度缩减投影矩阵的SOTA EDistill方法遭受有效秩(eRank)崩溃,即隐藏表示的有效秩下降。我们从理论上解释了随机初始化投影矩阵的奇异值如何变得分布不均,导致eRank崩溃,进而导致token不可区分性。为解决此问题,我们提出了RED(保留推理能力的高效蒸馏)方法,该方法引入激活感知初始化,将投影矩阵初始化为通道选择矩阵,从而在理论上缓解eRank崩溃。在Llama和Qwen系列上的实验表明,RED在保持高训练效率和SOTA通用能力的同时,显著恢复了推理能力。

英文摘要

Efficient Distillation (EDistill) compresses large language models (LLMs) by structured pruning parameters and tuning lightweight modules with high training efficiency. Although these EDistilled LLMs achieve state-of-the-art (SOTA) performance on general ability benchmarks relative to similarly sized LLMs, we identify a severe degradation in their multi-step reasoning ability, which we term reasoning collapse. We systematically analyze the geometric origins of reasoning collapse and show that the SOTA EDistill method based on width-reducing projection matrices suffers from eRank collapse, in which the effective rank (eRank) of hidden representations drops. We theoretically explain how singular values of randomly initialized projection matrices become unevenly distributed, leading to eRank collapse and thus token indistinguishability. To address this issue, we propose RED (Reasoning-preserved Efficient Distillation) for LLMs, which introduces activation-aware initialization to initialize projection matrices as channel-selection matrices, thus theoretically mitigating eRank collapse. Experiments on Llama and Qwen series demonstrate that RED substantially recovers reasoning while maintaining high training efficiency and SOTA general ability.

2605.29326 2026-05-29 cs.LG 版本更新

NeuroEdge: Real-Time Hand Gesture Recognition with High-Density EMG Using Deep Learning at the Edge

NeuroEdge:基于边缘深度学习的密集肌电实时手势识别

Peter Chudinov, Zhenyu Lin, Jay Motamarry, Srihita Panati, Xiaorong Zhang, Zhuwei Qin

发表机构 * San Francisco State University(旧金山州立大学) Department of Biology(生物系) School of Engineering in Computer Engineering(计算机工程学院) College of San Mateo(圣马特奥学院) Contra Costa College(康特拉科斯塔学院)

AI总结 提出NeuroEdge系统,通过HD-EMG无线传输和轻量级CNN推理引擎,在微控制器上实现实时手势识别,准确率90%,延迟83ms。

详情
AI中文摘要

高密度肌电(HD-EMG)已成为解码精细神经肌肉活动的强大方式,可实现用于假肢控制、康复和增强交互等应用的实时神经-机器接口(NMI)。尽管卷积神经网络(CNN)等深度学习方法在基于EMG的手势识别中表现出高分类精度,但由于计算和内存限制,它们在嵌入式硬件上的部署仍然是一个重大挑战。本文提出NeuroEdge,一种基于实时HD-EMG的NMI系统,完全在资源受限的微控制器上执行手势识别。该系统包含两个定制模块:HD-EMG StreamBridge,一种无线通信接口,将原始HD-EMG数据从Quattrocento放大器流式传输到ESP32微控制器;以及EdgeDL推理引擎,一种在索尼Spresense微控制器上执行的轻量级深度学习框架。一个针对嵌入式推理优化的紧凑一维CNN实时处理滑动窗口的EMG数据。数据流和推理通过利用直接内存访问(DMA)进行数据传输以及ESP32和Spresense之间的串行外设接口(SPI)突发通信的架构进行流水线和同步,确保低延迟性能。实验结果表明,NeuroEdge在七种手势中实现了90%的实时分类准确率,使用从前臂记录的192通道HD-EMG,总平均延迟为83毫秒。我们的系统证明了在基于微控制器的边缘设备上部署基于HD-EMG的复杂手势识别的可行性,弥合了高分辨率生物信号采集与基于深度学习的嵌入式推理之间的差距,为下一代NMI铺平了道路。

英文摘要

High-density electromyography (HD-EMG) has emerged as a powerful modality for decoding fine-grained neuromuscular activity, enabling real-time neural-machine interfaces (NMIs) for applications such as prosthetic control, rehabilitation, and augmented interaction. While deep learning approaches such as convolutional neural networks (CNNs)have demonstrated high classification accuracy for EMG-based gesture recognition, their deployment on embedded hardware remains a major challenge due to computational and memory constraints. This paper presents NeuroEdge, a real-time HD EMG-based NMI system that performs gesture recognition entirely on resource-constrained microcontrollers. The system features two custom-designed modules: the HD-EMG StreamBridge, a wireless communication interface that streams raw HD-EMG data from a Quattrocento amplifier to an ESP32 microcontroller; and the EdgeDL Inference Engine, a lightweight deep learning framework executing on a Sony Spresense microcontroller. A compact 1-dimensional CNN optimized for embedded inference processes, sliding windows of EMG data in real time. Data streaming and inference are pipelined and synchronized through an architecture that utilizes Direct Memory Access (DMA) for data transfer and Serial Peripheral Interface (SPI) burst communication between the ESP32 and Spresense, ensuring low-latency performance. Experimental results show that NeuroEdge achieves a real-time classification accuracy of 90% across seven hand gestures, with a total average latency of 83 ms using 192 channels of HD-EMG recorded from the forearm. Our system demonstrates the feasibility of deploying complex HD-EMG-based gesture recognition on microcontroller-based edge devices, bridging the gap between high-resolution biosignal acquisition and deep learning-based embedded inference for next-generation NMIs.

2605.29307 2026-05-29 cs.CL cs.AI cs.IR cs.LG 版本更新

GrepSeek: Training Search Agents for Direct Corpus Interaction

GrepSeek:训练用于直接语料库交互的搜索代理

Alireza Salemi, Chang Zeng, Atharva Nijasure, Jui-Hui Chung, Razieh Rahimi, Fernando Diaz, Hamed Zamani

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校) Princeton University(普林斯顿大学) Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出GrepSeek,一种通过两阶段训练(冷启动数据集+GRPO优化)和语义保持的分片并行执行引擎,训练紧凑型搜索代理直接与文本语料库交互(通过shell命令),在开放域问答中取得最优F1和精确匹配。

详情
AI中文摘要

大型语言模型(LLM)搜索代理通过多轮推理和信息检索,在知识密集型语言任务中展现出强大潜力。大多数现有系统使用检索器,该检索器接收关键词或自然语言查询,并利用预计算文档表示的索引返回排序后的文档列表。在本工作中,我们探索了一种互补视角,其中搜索代理将语料库本身视为搜索环境,并通过执行可执行的shell命令来寻找证据。我们引入了GrepSeek,一种优化的直接语料库交互(DCI)搜索代理,它训练一个紧凑的搜索代理从大型文本语料库中查找、过滤和组合证据。为了解决在大语料库上直接使用强化学习进行学习行为的不稳定性,我们提出了一种两阶段训练流程。首先,我们使用答案感知的Tutor和答案盲的Planner构建冷启动数据集,生成经过验证的、因果基础的搜索轨迹。其次,我们使用组相对策略优化(GRPO)优化初始化的策略,使代理能够通过与语料库的直接交互来改进其任务导向的搜索行为。为了使DCI在大规模下实用,我们进一步使用语义保持的分片并行执行引擎,该引擎将基于shell的检索加速高达7.6倍,同时保持与shell命令顺序执行的字节精确等价。在七个开放域问答基准上的实验表明,GrepSeek在整体词元级F1和精确匹配上取得了最强性能。我们的分析还揭示了纯粹词汇交互在具有显著表面形式变化的查询上的局限性,表明DCI作为搜索代理的一种实用且具有竞争力的方法,可以在现实世界中补充现有的检索范式。

英文摘要

Large Language Model (LLM) search agents have shown strong promise for knowledge-intensive language tasks through multiple rounds of reasoning and information retrieval. Most existing systems access information using a retriever that takes a keyword or natural language query and returns a ranked list of documents using an index of pre-computed document representations. In this work, we explore a complementary perspective in which the search agent treats the corpus itself as the search environment and finds evidence by issuing executable shell commands. We introduce GrepSeek, an optimized direct corpus interaction (DCI) search agent that trains a compact search agent to find, filter, and compose evidence from large text corpora. To address the instability of learning behavior directly with reinforcement learning on large corpora, we propose a two-stage training pipeline. First, we construct a cold-start dataset using an answer-aware Tutor and answer-blind Planner to generate verified, causally grounded search trajectories. Second, we refine the initialized policy with Group Relative Policy Optimization (GRPO), allowing the agent to improve its task-oriented search behavior through direct interaction with the corpus. To make DCI practical at scale, we further use a semantics-preserving sharded-parallel execution engine that accelerates shell-based retrieval by up to $7.6\times$ while preserving byte-exact equivalence with sequential execution of the shell command. Experiments across seven open-domain question answering benchmarks show that GrepSeek achieves the strongest overall token-level $F_1$ and Exact Match. Our analysis also highlights the limitations of purely lexical interaction on queries with substantial surface-form variation, suggesting DCI as a practical and competitive method for search agents that can complement existing retrieval paradigms in the real world.

2605.29283 2026-05-29 cs.LG cs.AI 版本更新

Do Physics Foundation Models Learn Generalizable Physics? A Bias-Aware Benchmark Across Physical Regimes and Distribution Shifts

物理基础模型能否学习可泛化的物理?一种跨物理机制和分布偏移的偏差感知基准

Mengdi Chu, Yang Liu, Ayan Biswas, Han-Wei Shen

发表机构 * The Ohio State University(俄亥俄州立大学) Los Alamos National Laboratory(洛斯阿拉莫斯国家实验室)

AI总结 通过构建包含8种物理动力学、3种训练数据混合和25种测试机制的基准,评估五种物理基础模型架构,发现当前模型是条件性而非通用性泛化者,其泛化能力依赖于物理机制、时间尺度、初始条件、预训练、模型大小和架构,并指出改进需超越缩放模型或扩展数据,转向学习跨机制、时间尺度和分布偏移的可迁移物理知识。

Comments 26 pages, 31 figures

详情
AI中文摘要

最近的物理基础模型声称具有通用的时空预测能力,但它们的评估通常将性能压缩为固定训练分布下的单一平均分数。这使得难以确定模型是否学习了可泛化的物理动力学,还是仅在特定设置下表现良好。我们构建了一个包含8种物理动力学、3种训练数据混合和25种测试机制的基准,这些测试机制由动态尺度和初始条件复杂性变化引起,涵盖了分布内、分布偏移和分布外设置。我们评估了五种物理基础模型架构和每种架构的四种模型变体(从头训练和三种预训练大小),共得到60,000个测量结果。我们的结果表明,当前的物理基础模型表现为条件性而非通用性泛化者:它们的泛化能力取决于物理机制、时间尺度、初始条件设置、预训练、模型大小和架构。改进训练数据分布只能部分缓解这一限制。预训练和缩放也无法可靠地消除它们的能力偏差。我们认为,改进物理基础模型需要超越缩放模型或扩展数据,转向学习能够更好地跨机制、时间尺度和分布偏移捕获可迁移物理知识的机制。

英文摘要

Recent physics foundation models claim general spatiotemporal forecasting ability, yet their evaluations often collapse performance into a single average score under a fixed training distribution. This makes it difficult to determine whether a model has learned generalizable physical dynamics or only performs well under particular settings. We construct a benchmark with 8 physical dynamics, 3 training-data mixtures, and 25 test regimes induced by dynamic-scale and initial-condition complexity shifts, covering in-distribution, distribution-shift, and out-of-distribution settings. We evaluate five physics foundation model architectures and four model variants per architecture (scratch and three pretrained sizes), resulting in 60,000 measurements. Our results show that current physics foundation models behave as conditional rather than universal generalists: their generality depends on the physical regime, temporal scale, initial-condition setting, pretraining, model size, and architecture. Improving the training data distribution only partially mitigates this limitation. Pretraining and scaling are also unable to reliably remove their ability biases. We argue that improving physics foundation models requires moving beyond scaling models or expanding data, toward learning mechanisms that better capture transferable physical knowledge across regimes, temporal scales, and distribution shifts.

2605.29273 2026-05-29 cs.LG math.OC 版本更新

A Theoretical and Experimental Study of a Novel Adaptive Learning Algorithm

一种新型自适应学习算法的理论与实验研究

Sakshi Kumari, Shyam Kumar M, Sushmitha P

发表机构 * Department of Mathematics Indian Institute of Technology Patna(数学系印度理工学院帕纳瓦) Department of Mechanical Engineering Indian Institute of Technology Kharagpur(机械工程系印度理工学院Khargpur)

AI总结 针对现有自适应优化器(如Adam和AMSGrad)的收敛性问题,提出基于视线方法的C-Adam优化器,给出收敛性理论证明并通过数值实验验证。

详情
AI中文摘要

机器学习算法的一个关键组成部分是以更少的计算成本和更少的振荡来最小化损失函数。虽然基于自适应学习率的优化器已广泛用于实际任务,但它们不能保证收敛,这就是后来引入AMSGrad来研究Adam的非收敛行为的原因。本文批判性地回顾了流行的自适应优化方法(如Adam和AMSGrad),重点介绍了它们的基本设计概念。为了解决上述优化器的局限性,基于视线方法提出了一种新的优化器变体C-Adam。还提供了收敛性的理论证明,并通过一系列基于实际生活的数值实验验证了该优化器。

英文摘要

A crucial component of machine learning algorithms is minimizing loss functions with less computational cost and less oscillations. While adaptive learning rate-based optimizers have been widely used for real-world tasks, they do not guarantee convergence, which is why AMSGrad was later introduced to investigate the non-convergence behaviour of Adam. In this paper, popular adaptive optimization methods like Adam and AMSGrad are critically reviewed with an emphasis on their fundamental design concepts. To address limitations of the above mentioned optimizers, a new optimizer variant, C-Adam, is proposed based on the line of sight approach. A theoretical proof for convergence is also provided and the optimizer is validated through a number of real-life based numerical experiments.

2605.29272 2026-05-29 cs.LG cs.AI stat.ML 版本更新

Causal Label Recovery in Payment Networks

支付网络中的因果标签恢复

Gaurav Dhama

发表机构 * Mastercard(麦star卡)

AI总结 针对支付网络中标签存在的四种系统偏差,提出序列三重稳健(STR)估计器,同时纠正所有偏差并达到半参数效率界,实现基于数天而非数月数据的训练。

Comments 49 pages

详情
AI中文摘要

支付网络中的欺诈检测模型依赖于存在系统性偏差的退单标签进行训练。每个标签必须依次经过三个门控:授权(被拒绝的交易不产生标签)、发卡行报告(未报告的欺诈不可见)和延迟(待处理的退单在训练时缺失)。到达的标签可能因第一方滥用或发卡行错误分类而受损。配套论文[arXiv:2605.27557]证明这四种损害对检测性能施加了极小极大下界。本文问:能否达到该下界?我们将观测流程形式化为一个具有三个倾向阶段和一个损坏层的顺序缺失数据问题,并构建了序列三重稳健(STR)估计器。STR同时纠正所有四种损害,并达到半参数效率界——没有估计器能具有更低的渐近方差。它是序列三重稳健的:在每个门控处,一致性仅要求倾向模型或结果回归中有一个正确指定,而非两者。我们提供了通过噪声率调整的伪标签进行损坏校正、通过经验贝叶斯收缩稳定小发卡行的逆倾向权重、提供有效置信区间的插件方差估计量,以及用于有限样本保证的伯恩斯坦集中不等式。在操作层面,我们推导了最优训练延迟——使标签质量损失和模型过时之和最小化的成熟窗口——并证明STR允许使用数天而非数月前的数据进行训练,将模型新鲜度与退单成熟周期解耦。对于任何样本量,STR在均方误差上严格优于基于退单的朴素训练。

英文摘要

Fraud detection models in payment networks train on chargeback labels that are systematically biased. Every label must survive three sequential gates: authorization (declined transactions generate no labels), issuer reporting (unreported fraud is invisible), and delay (pending chargebacks are missing at training time). Labels that do arrive may be corrupted by first-party misuse or issuer misclassification. A companion paper [arXiv:2605.27557] proved that these four impairments impose a minimax lower bound on detection performance. This paper asks: can that bound be achieved? We formalize the observation pipeline as a sequential missing-data problem with three propensity stages and a corruption layer, and construct the Sequential Triply Robust (STR) estimator. The STR corrects for all four impairments simultaneously and achieves the semiparametric efficiency bound -- no estimator can have lower asymptotic variance. It is sequentially triply robust: at each gate, consistency requires only that either the propensity model or the outcome regression is correctly specified, not both. We provide corruption correction via noise-rate-adjusted pseudo-labels, empirical Bayes shrinkage to stabilize inverse-propensity weights for small issuers, a plug-in variance estimator yielding valid confidence intervals, and a Bernstein concentration inequality for finite-sample guarantees. On the operational side, we derive the optimal training delay -- the maturity window that minimizes the sum of label-quality loss and model staleness -- and prove that the STR permits training on data that is days old rather than months old, decoupling model freshness from the chargeback maturity cycle. The STR provably dominates naive chargeback-based training in mean squared error for any sample size.

2605.29271 2026-05-29 cs.AI cs.IR cs.LG 版本更新

CoHyDE: Iterative Co-Training of LLM Rewriter & Dense Encoder for Tool Retrieval

CoHyDE: 用于工具检索的LLM改写器与稠密编码器的迭代协同训练

Vaishali Senthil, Ashutosh Hathidara, Sebastian Schreiber

发表机构 * SAP Labs(SAP实验室)

AI总结 提出CoHyDE方法,通过迭代协同训练稠密编码器和LLM改写器,结合对比学习和偏好对齐,在工具检索任务中同时提升标准查询和模糊查询的性能。

详情
AI中文摘要

在大规模API目录上的工具检索是LLM智能体的核心瓶颈:用户查询以口语化、通常不明确的语言出现,而目录使用技术性API词汇,没有固定的编码器能够单独弥合这一差距。两种主要的训练方法,对比编码器微调和基于冻结LLM的HyDE式查询扩展,从相反的角度解决这个问题,并在互补的方向上失败:微调编码器在查询的表面形式与目录匹配时表现出色,但在不匹配时性能崩溃;而零样本HyDE对不明确的查询更鲁棒,但生成不感知目录的假设描述,当查询形式良好时检索性能下降。我们提出CoHyDE,一种迭代过程,将稠密编码器和LLM改写器训练为单个共同演化的系统:编码器使用改写器生成的目录风格假设描述通过InfoNCE重新训练,改写器通过DPO基于编码器的检索分数进行偏好对齐,两者在循环开始前在工具目录上进行热启动。在ToolBench目录的约10k工具子集上,三轮CoHyDE在标准查询上比最强的单组件基线提高+2.5个百分点的NDCG@5,在保留的模糊查询上提高+6.3个百分点,在最难的模糊层级上增益高达+8个百分点。消融实验证实协同训练是关键因素:单独使用任一组件都无法在形式良好和模糊查询上匹配CoHyDE,在模糊查询上损失高达-8个百分点。

英文摘要

Tool retrieval over large API catalogs is a core bottleneck for LLM agents: user queries arrive in colloquial, often underspecified language, while the catalog uses technical API vocabulary that no fixed encoder can bridge on its own. The two dominant training approaches, contrastive encoder fine-tuning and HyDE-style query expansion with a frozen LLM, address this problem from opposite ends and fail in complementary directions: the fine-tuned encoder excels when the query's surface form already matches the catalog but collapses when it does not, while zero-shot HyDE is more robust to underspecified queries yet generates catalog-unaware hypothetical descriptions that degrade retrieval when queries are well-formed. We introduce CoHyDE, an iterative procedure that trains the dense encoder and the LLM rewriter as a single co-evolving system: the encoder is retrained with InfoNCE on catalog-style hypothetical descriptions produced by the rewriter, and the rewriter is preference-aligned via DPO against the encoder's retrieval scores, with both sides warm-started on the tool catalog before the loop begins. On a ~10k tool subset of the ToolBench catalog, three rounds of CoHyDE improve over the strongest single-component baseline by +2.5 pp NDCG@5 on standard queries and +6.3 pp on held-out vague queries, with gains as large as +8 pp on the hardest vague tier. Ablations confirm that co-training is the key ingredient: using either component in isolation fails to match CoHyDE on both well-formed and vague queries, with losses of up to -8 pp on vague queries.

2605.29267 2026-05-29 cs.AI cs.LG 版本更新

When and How Human Curation Backfires: Preference Alignment under Multi-Model Self-Consuming Loop

人类策展何时以及如何适得其反:多模型自消费循环下的偏好对齐

Yang Zhang, Xiukun Wei, Xueru Zhang

发表机构 * Department of Computer Science and Engineering, The Ohio State University, Columbus, Ohio(计算机科学与工程系,俄亥俄州立大学,哥伦布,俄亥俄)

AI总结 研究多模型自消费训练中人类策展对模型对齐的影响,发现跨模型交互可能削弱甚至逆转策展效果,导致长期对齐退化。

详情
AI中文摘要

基础模型越来越多地使用先前模型迭代生成的合成数据进行训练,而非仅依赖真实数据。这种自消费训练范式可能导致模型崩溃、发散或偏差放大。近期工作(Ferbach et al., 2024)表明,将人类策展纳入循环可以引导自消费模型向人类对齐的行为,但这些分析聚焦于单一孤立模型,该模型仅消耗自身输出。然而,在实践中,模型经常交互并训练于其他模型产生的输入-输出对。本文研究多模型机制下的自消费训练。我们首先形式化了一个交互自消费模型的框架,并刻画了所得动力系统何时收敛到稳定点。然后,我们考察了一个模型的人类策展如何影响其自身对齐(自影响),以及这种效应如何传播到其他模型(交叉影响)。与孤立设置中人类策展总是增强模型对齐不同,我们表明跨模型交互可以削弱甚至逆转这种效应,最终损害长期对齐。

英文摘要

Foundation models are increasingly trained on synthetic data generated by prior model iterations rather than exclusively on real data. This self-consuming training paradigm can lead to model collapse, divergence, or bias amplification. Recent work (Ferbach et al., 2024) shows that incorporating human curation into the loop can steer a self-consuming model toward human-aligned behavior, but these analyses focus on a single, isolated model that solely consumes its own outputs. In practice, however, models often interact and train on input-output pairs produced by other models. This paper studies self-consuming training in the multi-model regime. We first formalize a framework for interacting self-consuming models and characterize when the resulting dynamical system converges to a stable point. We then examine how human curation of one model affects its own alignment (self-influence) and how such effects propagate to other models (cross-influence). Unlike isolated settings where human curation always enhances model alignment, we show that cross-model interactions can dampen or even invert this effect, ultimately degrading long-term alignment.

2605.29259 2026-05-29 cs.LG cs.AI 版本更新

KLAS: Using Similarity to Stitch Neural Networks for Improved Accuracy-Efficiency Tradeoffs

KLAS:利用相似性拼接神经网络以改进精度-效率权衡

Debopam Sanyal, Anantharaman Iyer, Alind Khare, Trisha Jain, Akshay Jajoo, Myungjin Lee, Clayton Kerce, Alexey Tumanov

发表机构 * Georgia Institute of Technology(佐治亚理工学院) Microsoft M365 Research(微软M365研究) Cisco Research(思科研究) Georgia Tech Research Institute(佐治亚理工研究机构)

AI总结 提出KLAS框架,通过KL散度度量中间表示相似性自动选择最佳拼接配置,在相同微调成本下提升拼接模型的精度-效率曲线。

详情
AI中文摘要

鉴于部署目标的广泛性,灵活模型选择对于在给定计算预算内优化性能至关重要。最近的研究表明,在模型家族内拼接预训练模型能够实现精度-效率权衡空间的成本效益插值。拼接将一个预训练模型的中间激活变换到另一个模型,生成新的插值拼接网络。这类网络沿精度-效率谱提供了部署选项池。然而,现有拼接方法往往产生次优权衡且缺乏泛化性,因为它们主要依赖启发式方法选择拼接配置。我们认为,构建改进的精度-效率权衡需要显式捕获并利用被拼接预训练模型之间的相似性。为此,我们引入KLAS,一种新颖的拼接选择框架,通过利用中间表示之间的KL散度,自动化和泛化跨模型家族的拼接选择。KLAS从$O(k^2n^2)$种可能性中为$k$个深度为$n$的预训练模型识别最有前景的二元拼接。通过全面实验,我们证明KLAS在相同微调成本下改进了拼接模型的精度-效率曲线,与基线相比,KLAS在相同计算成本下实现了高达$1.21\%$的ImageNet-1K top-1准确率提升,或在保持准确率的同时将FLOPs降低$1.33\times$。

英文摘要

Given the wide range of deployment targets, flexible model selection is essential for optimizing performance within a given compute budget. Recent work demonstrates that stitching pretrained models within a model family enables cost-effective interpolation of the accuracy-efficiency tradeoff space. Stitching transforms intermediate activations from one pretrained model into another, producing a new interpolated stitched network. Such networks provide a pool of deployment options along the accuracy-efficiency spectrum. However, existing stitching approaches often yield suboptimal tradeoffs and lack generalizability, as they primarily rely on heuristics to select stitch configurations. We argue that constructing improved accuracy-efficiency tradeoffs requires explicitly capturing and leveraging the similarity between pretrained models being stitched. To this end, we introduce KLAS, a novel stitch selection framework that automates and generalizes stitch selection across model families by leveraging KL divergence between intermediate representations. KLAS identifies the most promising binary stitches from the $O(k^2n^2)$ possibilities for $k$ pretrained models of depth $n$. Through comprehensive experiments, we demonstrate that KLAS improves the accuracy-efficiency curve of stitched models at the same finetuning cost as baselines. KLAS achieves up to $1.21\%$ higher ImageNet-1K top-1 accuracy at the same computational cost, or maintains accuracy with a $1.33\times$ reduction in FLOPs.

2605.29250 2026-05-29 cs.CL cs.AI cs.IR cs.LG 版本更新

OmniRetrieval: Unified Retrieval across Heterogeneous Knowledge Sources

OmniRetrieval:跨异构知识源的统一检索

Jinheon Baek, Soyeong Jeong, Sangwoo Park, Woongyeong Yeo, Minki Kang, Patara Trirat, Heejun Lee, Sung Ju Hwang

发表机构 * KAIST(韩国科学技术院)

AI总结 提出OmniRetrieval框架,通过自然语言查询识别并调度到不同知识源的本地执行引擎,在13个数据集和309个知识库上超越单源基线,实现异构知识源统一检索。

详情
AI中文摘要

现实世界的信息需求需要访问结构多样的知识源,从非结构化文本和关系表到知识图谱和属性图。然而,现有的检索器一次只在一个源上操作,使用固定的查询语言,使得可用知识的更广泛图景被不兼容的接口所分割。一种自然的统一尝试是将这些源折叠到一个共享空间中,但这会抹去每个源的结构性优势(如模式、本体、组合操作符),而这些优势赋予了每个源其表达能力。因此,对多样化知识的有效检索需要的不是同质化,而是一个能够按每个源自身条件与其交互的总体层。为了实现这一点,我们提出了OmniRetrieval,一个框架,它接受任何自然语言查询,识别合适的知识源,并将源原生查询分派到其本地执行引擎。在涵盖文本、关系和图结构源的13个数据集和309个不同知识库的广泛基准测试中,OmniRetrieval超过了单源基线,证明了它可以作为异构源的通用接口,同时保留使每个源有价值的结构差异。

英文摘要

Real-world information needs require access to structurally diverse knowledge sources, from unstructured text and relational tables to knowledge graphs and property graphs. Existing retrievers, however, operate over one source at a time under a fixed query language, leaving the broader landscape of available knowledge fragmented behind incompatible interfaces. A natural attempt at unification would collapse these sources into a shared space, but this erases the structural affordances (such as schemas, ontologies, compositional operators) that give each source its expressive power. Effective retrieval over diverse knowledge, therefore, requires not homogenization but an overarching layer that meets each source on its own terms. To achieve this, we present OmniRetrieval, a framework that takes any natural-language query, identifies appropriate knowledge sources, and dispatches source-native queries to their native execution engines. Across an extensive benchmark spanning 13 datasets and 309 distinct knowledge bases over text, relational, and graph-structured sources, OmniRetrieval exceeds single-source baselines, demonstrating that it can serve as a general-purpose interface to the heterogeneous sources while preserving the structural distinctions that make each source valuable.

2605.29249 2026-05-29 stat.ML cs.LG 版本更新

Prediction-Powered Inference Across Many Tasks for AI Evaluation & Social Science Research

跨任务预测驱动推理在AI评估与社会科学研究中的应用

Nicolas Emmenegger, Ellery Stahler, Chara Podimata

发表机构 * MIT(麻省理工学院)

AI总结 提出多任务预测驱动推理框架,通过跨任务重校准利用共享结构,在标签稀缺时提升统计推断效率,并证明非线性结构是跨任务增益的必要条件。

详情
AI中文摘要

许多应用需要在多个相关任务中进行统计上有效的推断,而每个假设只使用少量高质量标签。在AI评估中,这些任务可能对应于不同提示、子群体或假设下的模型行为;在社会科学调查中,它们可能对应于相关问题、群体或测量条件。预测驱动推理(PPI)利用丰富但廉价的代理测量来改进有限真实标签的推断,但常用方法独立处理任务,因此未能利用相关任务间的共享结构。这一限制在每任务仅有少量标签的场景中尤为重要。为解决此问题,我们引入了一个多任务预测驱动推理框架,该框架利用来自相关任务的标记数据来提高统计功效,同时保留任务特定的推断。我们的方法通过跨任务重校准来利用代理-真实关系中的共享结构,同时保留任务内修正和功效调优,以构建精确的点估计和置信区间。我们证明,只有当代理-真实关系包含非线性结构时,才能实现超越功效调优PPI的效率提升;仿射跨任务重校准在渐近意义上等同于使用原始代理。我们通过合成和半合成数据集上的实验,以及2024年美国总统大选期间审计语言模型关于选举相关信息的案例研究,补充了我们的理论发现。利用一项大型人工标注研究,我们表明当标签稀缺时,跨任务重校准可以显著减少置信区间宽度。

英文摘要

Many applications require statistically valid inference across many related tasks, while using only a handful of high-quality labels per hypothesis. In AI evaluation, these tasks may correspond to model behaviors across prompts, subgroups, or hypotheses; in social science surveys, they may correspond to related questions, populations, or measurement conditions. Prediction-powered inference (PPI) uses abundant but inexpensive proxy measurements to improve inference from limited, ground-truth labels, but commonly used methods treat tasks independently and therefore fail to exploit shared structure across related tasks. This limitation is especially important in settings where only a small number of labels are available per task. To address this issue, we introduce a multi-task prediction-powered inference framework that uses labeled data from related tasks to improve power while preserving task-specific inference. Our methods exploit the shared structure in the proxy-ground-truth relationship through cross-task recalibration, while retaining within-task rectification and power tuning to construct accurate point estimates and confidence intervals. We prove that efficiency gains beyond power-tuned PPI are only possible when the proxy-ground-truth relationship contains nonlinear structure; affine cross-task recalibrations are asymptotically equivalent to using the original proxy. We complement our theoretical findings with experiments on synthetic and semi-synthetic datasets, as well as a case study auditing language models on election-related information during the 2024 U.S. presidential election. Using a large human-annotation study, we show that cross-task recalibration can substantially reduce confidence interval widths when labels are scarce.

2605.29247 2026-05-29 cs.AI cs.CL cs.LG 版本更新

DenseSteer: Steering Small Language Models towards Dense Math Reasoning

DenseSteer: 引导小型语言模型进行密集数学推理

Yang Ouyang, Shuhang Lin, Jung-Eun Kim

发表机构 * North Carolina State University(北卡罗来纳州立大学) Rutgers University(罗格斯大学)

AI总结 提出DenseSteer,一种无需训练的推理时引导框架,通过调节内部表征向密集推理模式靠拢,提升小型模型在多步数学推理中的准确性。

Comments ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)展现出强大的链式推理(CoT)能力,而较小的模型(≤3B参数)在多步推理任务上表现显著不佳。基于对Qwen-2.5模型系列在数学推理基准上的实证分析,我们发现更熟练的推理与更少的推理步骤但每步更高的信息密度相关,我们将此属性称为密集推理。受此观察启发,我们提出了DenseSteer,一种无需训练的推理时引导框架,通过将内部表征调节至密集推理模式来增强小型模型的推理能力。实验表明,我们的方法在不增加词级负对数似然的情况下,持续提高了准确性,突显了密集推理作为数学问题求解的一种有效结构方法。

英文摘要

Large language models (LLMs) demonstrate strong chain-of-thought (CoT) reasoning abilities, while smaller models (<= 3B parameters) significantly underperform on multi-step reasoning tasks. Based on empirical analyses of the Qwen-2.5 model family on math reasoning benchmarks, we find that more proficient reasoning is associated with fewer reasoning steps but higher information density per step, a property we term Dense Reasoning. Motivated by this observation, we propose DenseSteer, a training-free inference-time steering framework that enhances small-model reasoning by modulating internal representations toward dense reasoning patterns. Experiments show that our method yields consistent accuracy improvements without increasing token-level Negative Log-Likelihood, highlighting dense reasoning as an effective structural approach to mathematical problem solving.

2605.29245 2026-05-29 cs.CR cs.CL cs.LG 版本更新

Implicit Identity Technologies for LLMs: Fingerprinting and Watermarking across Datasets, Models, and Generated Content

LLM的隐式身份技术:跨数据集、模型和生成内容的指纹识别与水印

Bing Liu, Shunping Wang, Yufan Zhu, Xinyi Yu, Jing Huang, Linkang Du, Hongbin Pei, Wei Luo

发表机构 * School of Cyber Science and Engineering, Xi’an Jiaotong University, Xi’an, China(西安交通大学计算机科学与工程学院) State Grid Henan Marketing Service Center, Henan, China(国网河南营销服务中心) Institute of Information Engineering, Chinese Academy of Sciences, Beijing, China(中国科学院信息工程研究所) School of Cyber Security, University of Chinese Academy of Sciences, Beijing, China(中国科学院大学网络安全学院) School of Information Technology, Deakin University, Geelong, Australia(迪金大学信息技术学院)

AI总结 本文综述了LLM指纹识别和水印技术,提出隐式身份统一抽象,并基于生命周期分类法组织数据集、模型和生成内容的技术,建立评估框架。

Comments Accepted by IJCAI-ECAI 2026. 11 pages, 1 figure. Survey and taxonomy of LLM fingerprinting and watermarking for identity, provenance, generated-content attribution, and asset protection

详情
AI中文摘要

本文对LLM指纹识别和水印技术进行了综述和分类,用于身份验证、所有权验证、溯源和生成内容归因。大型语言模型(LLM)需要大量数据、计算和专业知识投入,并越来越多地部署在高风险场景中,因此保护LLM相关资产并追溯其来源至关重要。现有工作已在数据集溯源、模型所有权和生成内容检测方面迅速扩展,但该领域仍然碎片化:指纹识别和水印的使用往往不一致,且方法通常仅在孤立的资产特定设置中研究。为解决这一差距,我们引入隐式身份作为LLM系统中可验证但不可直接观察的身份信号的统一抽象。我们将指纹识别区分为源自内在特征的非侵入式身份,将水印区分为有意嵌入数据、模型或生成内容中的侵入式身份。然后,我们提出一种基于生命周期的分类法,将技术组织到数据集、模型和生成内容中,并进一步通过验证语义进行区分:基于相似性的归因和密钥验证。最后,我们建立一个以可识别性、鲁棒性和可部署性为中心的评估框架,总结在现实访问和变换条件下的代表性指标。通过统一术语、生命周期阶段和评估目标,本综述为研究LLM身份技术以及开发更可靠的资产保护和溯源机制提供了结构化基础。

英文摘要

This paper presents a survey and taxonomy of LLM fingerprinting and watermarking for identity, ownership verification, provenance, and generated-content attribution. Large language models (LLMs) require substantial investments in data, computation, and expertise, and are increasingly deployed in high-stakes settings, making it critical to protect LLM-related assets and trace their origins. Existing work has rapidly expanded across dataset provenance, model ownership, and generated-content detection, but the field remains fragmented: fingerprinting and watermarking are often used inconsistently, and methods are typically studied within isolated asset-specific settings. To address this gap, we introduce implicit identity as a unifying abstraction for verifiable but not directly observable identity signals in LLM systems. We distinguish fingerprinting as non-intrusive identity derived from intrinsic characteristics, and watermarking as intrusive identity deliberately embedded into data, models, or generated content. We then propose a lifecycle-based taxonomy that organises techniques across datasets, models, and generated content, and further separates them by verification semantics: similarity-based attribution and keyed verification. Finally, we establish an evaluation framework centred on identifiability, robustness, and deployability, summarising representative metrics under realistic access and transformation regimes. By unifying terminology, lifecycle stages, and evaluation objectives, this survey provides a structured foundation for studying LLM identity technologies and for developing more reliable mechanisms for asset protection and provenance.

2605.29236 2026-05-29 cs.LG 版本更新

SigmaMedStat: Temporal Signal Modeling for ICU False Alarm Reduction

SigmaMedStat: 用于ICU误报减少的时间信号建模

Arunkumar Ramachandran

AI总结 提出SigmaMedStat系统,通过将60秒记录分割为6个10秒块并提取连续小波变换尺度图,结合EfficientNet-B0编码器和两层LSTM网络进行时间建模,在PhysioNet/CinC Challenge 2015数据集上实现AUC 0.822,有效降低ICU误报。

Comments Code available at github.com/Arun-K-Ram/sigmamedstat

详情
AI中文摘要

重症监护病房(ICU)中的警报疲劳是一个有充分记录的患者安全危机。临床监护仪每天每位患者产生350次或更多警报,其中72-99%在临床上无关紧要。工作人员对非可操作警报的脱敏增加了错过真正紧急情况的风险。本文提出了SigmaMedStat,一个机器学习系统,在采取临床行动之前评估生理警报信号的可信度。在PhysioNet/Computing in Cardiology Challenge 2015数据集(包含498个四通道ICU警报记录)上评估了四种方法。主要贡献是一个时间建模框架,它将每个60秒记录分割成六个连续的10秒块,进而为每个块生成连续小波变换(CWT)尺度图,使用共享的EfficientNet-B0编码器对每个块进行编码,并将得到的特征序列传递给两层长短期记忆(LSTM)网络。五折分层交叉验证的平均AUC为0.822 +/- 0.016(95% CI: [0.790,0.853]),而基于完整60秒窗口的静态EfficientNet基线为0.641。消融研究证实,时间分块和多通道信号融合均独立地有助于分类性能。按警报类型分析显示,心室扑动是最准确分类的警报类型(AUC 0.820),而心脏停搏仍然是最难的(AUC 0.722)。错误分析识别出65个假阴性和85个高置信度错误分类作为主要失败模式。所有代码和结果公开在https://github.com/Arun-K-Ram/sigmamedstat。

英文摘要

Alarm fatigue in intensive care units (ICUs) is a well documented patient safety crisis. Clinical monitors generate 350 or more alarms per patient per day, out of which 72-99% are clinically irrelevant. Staff desensitization to non-actionable alarms increases the risk of missed true emergencies. This paper presents SigmaMedStat, a machine learning system that evaluates the trustworthiness of physiological alarm signals before clinical action is taken. Four approaches were evaluated on the PhysioNet/Computing in Cardiology Challenge 2015 dataset of 498 four-channel ICU alarm recordings. Primary contribution is a temporal modeling framework that splits each 60 second recording into six consecutive 10-second chunks, and this in turn generates Continuous Wavelet Transform (CWT) scalograms per chunk, encodes each chunk with a shared EfficientNet-B0 encoder, and passes the resulting feature sequence to a two-layer Long Short-Term Memory (LSTM) network. Five-fold stratified cross-validation yields a mean AUC of 0.822 +/- 0.016 (95% CI: [0.790,0.853]), compared to 0.641 for a static EfficientNet baseline trained on the full 60-second window. Ablation studies confirm that temporal chunking and multi-channel signal fusion both contribute independently to classification performance. Per-alarm type analysis reveals that Ventricular Flutter is the most accurately classified alarm type (AUC 0.820) while Asystole remains the hardest (AUC 0.722). Error analysis identifies 65 false negatives and 85 high-confidence misclassifications as the primary failure modes. All code and results are publicly available at https://github.com/Arun-K-Ram/sigmamedstat.

2605.29202 2026-05-29 cs.LG 版本更新

Auditing Training Data in Generative Music Models via Black-Box Membership Inference

通过黑盒成员推断审计生成音乐模型中的训练数据

Yi Chen Liu, Jiawei Yu, Kexin Cao, Syed Irfan Ali Meerza, Trishika Movva, Jian Liu

发表机构 * University of Georgia(佐治亚大学) Independent Researcher(独立研究者) University of Tennessee(田纳西大学)

AI总结 本文提出一种黑盒成员推断方法,通过比较候选音频与模型基于其描述生成输出的语义对齐程度,并训练音乐审计器分类成员身份,实现对生成音乐模型训练数据的高精度审计。

Comments The paper has been accepted for presentation at the workshop ArtSec 2026: Workshop on Artwork Security and Provenance in the Age of AI

详情
AI中文摘要

近期文本到音乐生成的进展实现了结构化音乐音频的高保真合成,引发了对数据来源、同意和训练透明度的日益关注。这些模型通常在很少披露的大规模语料库上训练,没有实际机制来验证特定音频样本是否包含在训练中。在本文中,我们研究了生成音乐模型的黑盒成员推断,旨在仅通过查询部署系统来确定候选音乐样本是否在训练中使用。我们的关键见解是,训练成员身份会导致候选样本与模型基于其描述生成的结果之间系统性地更强的语义和结构对齐。我们使用相关描述查询目标模型,并在学习特征空间中测量候选音频与生成输出之间的关系。为了捕捉区分成员和非成员的特征,我们构建了由每个曲目及其基于描述生成的影子模型组成的配对示例,并训练音乐审计器分类成员身份。该审计器捕捉训练成员身份特有的对齐模式,并在完全黑盒设置下泛化到未见过的目标模型,无需访问模型参数或训练元数据。在多个最先进的音乐生成器上,我们的方法达到了高达98.6%的准确率,假阳性和假阴性率低至1.9%和1.0%,表明在现实部署场景中可靠的训练数据审计是可行的。

英文摘要

Recent advances in text-to-music generation enable high-fidelity synthesis of structured musical audio, raising growing concerns about data provenance, consent, and training transparency. These models are typically trained on large-scale corpora with little disclosure, leaving no practical mechanism to verify whether a particular audio sample was included in training. In this paper, we investigate black-box membership inference for generative music models, aiming to determine whether a candidate music sample was used during training, given only query access to the deployed system. Our key insight is that training membership induces systematically stronger semantic and structural alignment between a candidate sample and the model's generation conditioned on its caption. We query the target model with the associated caption and measure the relationship between the candidate audio and the generated output in a learned feature space. To capture features that separate members from non-members, we construct paired examples consisting of each track and its caption-conditioned generation from shadow models, and train a music auditor to classify membership. The auditor captures alignment patterns characteristic of training membership and generalizes to unseen target models in a fully black-box setting without access to model parameters or training metadata. Across multiple state-of-the-art music generators, our method achieves up to 98.6% accuracy, with false-positive and false-negative rates as low as 1.9% and 1.0%, demonstrating that reliable training-data auditing is feasible in realistic deployment scenarios.

2605.29194 2026-05-29 cs.LG cs.AI cs.NA math.NA 版本更新

Stochastic Lifting for Generating Trajectories of Stochastic Physical Systems

随机提升:生成随机物理系统轨迹

Jules Berman, Tobias Blickhan, Benjamin Peherstorfer

发表机构 * Courant Institute of Mathematical Sciences, New York University, New York, NY 10012, USA(Courant数学科学研究所,纽约大学,纽约,NY 10012,USA)

AI总结 提出随机提升方法,通过为每个状态转换附加独立高维随机标签并学习从当前状态和标签到下一状态的映射,以生成多样化的随机物理系统轨迹。

详情
AI中文摘要

许多随机物理系统随时间平滑演化,即状态分布随时间步长规则变化。从当前状态到下一状态的转移通常可以建模为平滑映射和显式随机源的组合。随机提升利用这一结构,通过为训练数据中的每个状态转换附加一个独立的高维随机标签,并使用标准回归损失拟合从当前状态和标签到下一状态的转移映射。这些标签作为辅助坐标,使模型能够从相似的当前状态表示多个可能的下一状态,避免在有限样本量下崩溃为均值预测。在推理时,每个时间步采样新的标签,并将学习到的映射自回归地向前滚动,每个时间步仅需一次网络评估即可生成多样化的轨迹。

英文摘要

Many stochastic physical systems evolve smoothly over time in the sense that the distribution of states changes regularly across time steps. The transition from current state to the next state can often be modeled as the combination of a smooth map and an explicit source of randomness. Stochastic Lifting exploits this structure by attaching an independent, high-dimensional random label to each state transition in the training data and fitting a transition map from the current state and label to the next state using a standard regression loss. The labels act as auxiliary coordinates that let the model represent multiple plausible next states from similar current states, avoiding collapse to a mean prediction in the finite-sample size regime. At inference, fresh labels are sampled at each time step and the learned map is rolled forward autoregressively, generating diverse trajectories with a single network evaluation per time step.

2605.29190 2026-05-29 cs.LG cs.CL 版本更新

When RL Suppresses Its Own Vocabulary: Recovering Reasoning Diversity in Puzzle-to-Math Transfer

当RL抑制自身词汇:在谜题到数学迁移中恢复推理多样性

Mayug Maniparambil, Arjun Karuvally, Terrence Sejnowski, Fergal Reid

发表机构 * Fin AI Research(Fin AI研究院) Salk Institute for Biological Studies(萨尔克生物医学研究所)

AI总结 本文提出一种基于可验证奖励的强化学习框架,通过引入新颖性奖励机制恢复被抑制的探索性推理原语,实现从约束满足谜题到数学问题的跨领域迁移,在无需数学数据的情况下将OlymMATH-Hard的pass@32从16%提升至36%。

Comments Preprint

详情
AI中文摘要

使用可验证奖励的强化学习(RLVR)改进了大语言模型的推理能力,但其跨领域迁移的条件及原因仍未被充分探索。我们研究了一个7B模型在仅使用约束满足谜题进行SFT和RL后训练(无数学问题)时的跨领域迁移。为了分析迁移如何产生,我们引入了一个推理原语级框架,该框架结合了9类跨度分类器和基序提取,使我们能够将思维链轨迹分割为原语基序,并追踪其在训练阶段和领域间的演变。我们发现,谜题SFT诱导了一个推理原语词汇,在OlymMATH-Hard上带来了+7pp的pass@32提升。随后,普通GSPO将这些原语组合成更长的计算-验证链,进一步增加了+6pp。然而,这个RL阶段也抑制了探索性原语,如“假设”和“回溯”。为了解决这个问题,我们引入了一个新颖性奖励,奖励多样化的正确轨迹,使用参考模型下的困惑度作为信号。这恢复了RL期间的恢复原语,并相对于普通GSPO额外增加了+7pp的pass@32。最终,端到端配方将硬数学能力上限从OLMo3-7B-Instruct-SFT基线的16.0%提升至36.0%,且在SFT或RL阶段未添加任何数学问题。

英文摘要

Reinforcement learning using verifiable rewards (RLVR) improves LLM reasoning, but the conditions under which it transfers across domains -- and why it does so -- remain under-explored. We study cross-domain transfer in a 7B model whose SFT and RL post-training stages use only constraint-satisfaction puzzles, with no mathematics problems in the post-training data. To analyze how transfer emerges, we introduce a reasoning primitive-level framework that combines a 9-class span classifier with motif extraction, allowing us to segment chain-of-thought traces into primitive motifs and track their evolution across training stages and domains. We find that puzzle SFT induces a reasoning-primitive vocabulary, yielding a $+7$pp \texttt{pass@32} gain on OlymMATH-Hard. Vanilla GSPO then composes these primitives into longer compute-verify chains, adding a further $+6$pp. However, this RL stage also suppresses exploratory primitives such as \textit{hypothesize} and \textit{backtrack}. To address this, we introduce a novelty bonus that rewards diverse correct rollouts, using perplexity under the reference model as a signal. This restores recovery primitives during RL and adds a further $+7$pp \texttt{pass@32} relative to vanilla GSPO. Finally, the end-to-end recipe raises the hard-math capability ceiling from $16.0\%$ at the OLMo3-7B-Instruct-SFT base to $36.0\%$, without adding any mathematics problems during the SFT or RL stages.

2605.29184 2026-05-29 cs.LG cs.AI 版本更新

Influence-Guided Symbolic Regression: Scientific Discovery via LLM-Driven Equation Search with Granular Feedback

影响引导的符号回归:基于大语言模型与细粒度反馈的方程搜索科学发现

Evgeny S. Saveliev, Samuel Holt, Nabeel Seedat, David L. Bentley, Jim Weatherall, Mihaela van der Schaar

发表机构 * University of Cambridge(剑桥大学) Thomson Reuters Foundational Research(汤姆森·路透基础研究) U. Colorado, Anschutz Medical Campus(科罗拉多大学安舒茨医疗校区)

AI总结 提出影响引导符号回归(IGSR)方法,利用大语言模型生成候选函数并通过细粒度影响分数进行剪枝,结合蒙特卡洛树搜索高效探索组合空间,在多个基准和真实生物数据中发现新关系。

Comments ICML 2026

详情
AI中文摘要

大型语言模型(LLM)为科学发现提供了有前景的途径,但它们在符号回归中的应用常受限于低效的搜索策略和粗糙的反馈信号。当前方法通常使用标量指标(如全局均方误差)指导LLM,这无法识别所提出方程中哪些成分驱动性能或导致误差。我们引入 extit{影响引导符号回归}(IGSR),该方法将方程发现表述为一个迭代的两步过程,结合多样化的项生成与严格选择:LLM为线性模型生成候选基函数$ψ_j(\mathbf{x})$,然后使用细粒度影响分数$Δ_j$进行评估。这些分数量化每个项对泛化准确性的边际贡献,从而实现影响引导的剪枝过程,系统地精炼模型结构。将此机制集成到蒙特卡洛树搜索(MCTS)中,能够在导航组合搜索空间的同时平衡对新函数形式的探索与对高影响成分的利用。我们在多个基准测试上展示了IGSR的有效性,包括LLM-SRBench、药理学PKPD模型、流行病学模拟和真实基因组数据。值得注意的是,我们通过一个高维生物数据集的案例研究验证了该框架的真正发现能力,其中IGSR识别出DNA甲基化与RNA聚合酶II暂停之间的新关系;该假设随后通过湿实验得到了支持。

英文摘要

Large Language Models (LLMs) offer a promising avenue for scientific discovery, yet their application to symbolic regression is often constrained by inefficient search strategies and coarse feedback signals. Current methods typically guide LLMs using scalar metrics (e.g., global Mean Squared Error), which fail to identify which components of a proposed equation are driving performance or causing error. We introduce \textit{Influence-Guided Symbolic Regression} (IGSR), a method that frames equation discovery as an iterative two-step process combining diverse term generation with rigorous selection: an LLM generates candidate basis functions $ψ_j(\mathbf{x})$ for a linear model, which are then evaluated using granular influence scores $Δ_j$. These scores quantify each term's marginal contribution to generalization accuracy, enabling an influence-guided pruning process that systematically refines the model structure. Integrating this mechanism into a Monte Carlo Tree Search (MCTS) enables navigating the combinatorial search space while balancing exploration of novel functional forms with exploitation of high-influence components. We demonstrate IGSR's effectiveness on a diverse suite of benchmarks, including LLM-SRBench, pharmacological PKPD models, an epidemiological simulation, and real-world genomic data. Notably, we validate the framework's capacity for genuine discovery in a case study using a high-dimensional biological dataset, in which IGSR identified a novel relationship between DNA methylation and RNA Polymerase II pausing; a hypothesis that was subsequently supported via wet-lab experimentation.

2605.29172 2026-05-29 cs.LG physics.ao-ph 版本更新

Probabilistic bias adjustment of seasonal forecasts using generative machine learning: A case study of Arctic sea ice predictions

基于生成式机器学习的季节预报概率偏差校正:以北极海冰预测为例

Parsa Gooya, Reinel Sospedra-Alfonso

发表机构 * Canadian Centre for Climate Modelling and Analysis(加拿大气候建模与分析中心)

AI总结 本研究提出基于条件变分自编码器的概率后处理框架,通过生成器替代高斯参数化解码器并采用连续排序概率评分优化,有效校正季节预报的系统偏差并提升分辨率与谱能量。

详情
AI中文摘要

季节气候预测通过提供未来几个月最可能发生的气候条件及其相关不确定性的早期信息,支持规划和风险管理。集合预报通过模拟许多可能的结果来实现这一点,使得预测能够以可用的概率形式表达。大集合和高分辨率预报通过更好地采样不确定性和捕捉更精细尺度的过程来加强这种指导,但会带来显著的计算成本。此外,预报集合存在漂移,并表现出系统偏差和随提前时间增长的时空误差,需要仔细的后处理和校准。加拿大气候建模与分析中心开发了一种基于条件变分自编码器(cVAE)的概率后处理框架,用于生成北极海冰的偏差校正季节预测的大集合。生成模型旨在学习以有偏模型预测为条件的观测分布。这使得能够生成任意大的、经过良好校准的、偏差校正的预测集合,且具有更高的技能。在此,我们扩展该框架以解决标准cVAE已知的局限性——预测中细尺度能量的损失和特征性的模糊。具体而言,我们在cVAE中使用生成器替代高斯参数化解码器,并在目标函数中使用连续排序概率评分代替均方误差。我们进一步使用比原始预报更高分辨率的目标数据集。我们表明,与基准预测相比,调整后的预测校准更好,与观测分布更一致,误差更小,同时相对于标准cVAE提高了原始预报的分辨率、锐度和谱功率。

英文摘要

Seasonal climate predictions support planning and risk management by offering early information of the most likely-to-occur climate conditions in the coming months, and associated uncertainties. Ensemble forecasts enable this by simulating many plausible outcomes, allowing predictions to be expressed as usable probabilities. Large ensembles and high-resolution forecasts strengthen this guidance by better sampling uncertainty and capturing finer-scale processes but come with significant computational cost. Moreover, forecast ensembles drift and exhibit systematic biases and spatio-temporal errors that grow with lead time, requiring careful post-processing and calibration. A probabilistic post-processing framework based on conditional Variational Autoencoders (cVAEs) was developed at the Canadian Center for Climate Modeling and Analysis to generate large ensembles of bias adjusted seasonal predictions of Arctic sea ice. The generative model was designed to learn the observational distribution conditioned on the biased model prediction. This enables generation of arbitrarily large ensembles of well-calibrated, bias corrected forecasts with improved skill. Here, we extend this framework to address the loss of fine-scale energy and the characteristic blurriness in predictions, a known limitation of standard cVAEs. Specifically, we employ a generator in place of the Gaussian parametrized decoder in the cVAE and use Continuous Ranked Probability Score in the objective function instead of the Mean Square Error. We further use a higher resolution target dataset compared to the raw forecast. We show that the adjusted forecasts are better calibrated, more consistent with the observational distribution, and exhibit smaller errors than benchmark predictions, while also enhancing the resolution of the raw forecasts and improving sharpness and spectral power relative to the standard cVAE.

2605.29168 2026-05-29 cs.AI cs.LG 版本更新

Better Later Than Sooner: Neuro-Symbolic Knowledge Graph Construction via Ontology-grounded Post-extraction Correction

晚做总比早做好:基于本体后提取校正的神经符号知识图谱构建

Lorenzo Loconte, Timothy Hospedales, Cristina Cornelio

发表机构 * University of Edinburgh, UK(爱丁堡大学) Samsung AI Center, Cambridge, UK(三星人工智能中心)

AI总结 提出一种神经符号框架,通过后提取校正解决LLM提取知识图谱时的本体不一致问题,减少token使用并提升图谱一致性。

详情
AI中文摘要

问答是AI中的核心挑战,特别是对于需要跨文档多跳推理或聚合、穷举等符号操作的复杂查询。检索增强生成已成为问答的主要方法,最近的基于图的变体通过组织知识以更好地支持组合性问题,部分解决了这些问题。然而,大多数基于文本图的RAG方法仍缺乏可靠回答复杂问题所需的符号操作结构。这推动了基于符号图的方法,该方法提取知识图谱,其关系是逻辑谓词,支持类似SQL的查询。然而,这些流程通常使用LLM进行KG提取,这可能导致一致性问题,即提取的事实可能违反常识本体约束。我们提出了一种用于本体基础KG构建的神经符号框架,结合了开放域提取、基于嵌入的类型和谓词规范化,以及针对本体违规的LLM校正。通过将校正推迟到后提取阶段,我们的方法避免了重复的LLM调用,显著减少了token使用,同时提高了KG一致性并保持了下游问答质量。最后,通过测量SPARQL图模式的出现,我们展示了提取的KG非常适合符号查询。

英文摘要

Question answering (QA) is a core challenge in AI, particularly for complex queries requiring multi-hop reasoning across documents, or symbolic operations like aggregation or exhaustive listing. Retrieval-augmented generation has become the dominant approach to QA, with recent graph-based variants addressing part of these issues by organizing knowledge to better support compositional questions. However, most textual graph-based RAG methods still lack the structure needed for symbolic operations useful to answer complex questions reliably. This motivates symbolic graph-based approaches, which extract knowledge graphs (KGs) whose relations are logic predicates that enable SQL-like querying. Yet these pipelines typically use LLMs for KG extraction, which can introduce consistency issues, where extracted facts may violate commonsense ontology constraints. We propose a neuro-symbolic framework for ontology-grounded KG construction combining open-domain extraction, embedding-based canonicalization of types and predicates, and targeted LLM-based correction of ontology violations. By deferring corrections to a post-extraction stage, our method avoids repeated LLM calls, substantially reducing token usage while improving KG consistency and preserving downstream QA quality. Finally, we show that the extracted KGs are well suited for symbolic querying by measuring the occurrence of SPARQL graph patterns.

2605.29161 2026-05-29 cs.LG cs.AI 版本更新

Evolutionary Refinement of Generative Graph Topologies: A Hybrid WGAN-GA Approach

生成图拓扑的进化精炼:一种混合WGAN-GA方法

James Sargant, Seyedeh Ava Razi Razavi, Renata Dividino, Sheridan Houghten

发表机构 * Computer Science Brock University, Canada(计算机科学 布鲁克大学 加拿大)

AI总结 提出一种混合WGAN-GA方法,通过遗传算法精炼GAN生成的图结构,减少度分布和谱分布等偏差,使合成图更接近真实图。

Comments 6 pages, 4 Figures, 4 Tables, IEEE World Congress on Computational Intelligence

详情
AI中文摘要

由于离散连通性、图大小变化和类别特定的结构模式,生成逼真的图结构数据具有挑战性。最近基于生成对抗网络(GAN)的图生成方法通过学习连通性和匹配类别特定的密度分布来改进边建模。然而,这些模型在与真实图相比时仍表现出明显的偏差,例如度和谱分布,表明重要的结构属性未完全保留。本工作旨在通过使用遗传算法(GA)精炼现有基于GAN的图生成器框架生成的图来减少这些偏差。在GAN框架中,生成器同时生成节点特征和连通性模式,而基于GNN的判别器评估图的真实性和类别一致性,以确保全局结构和类别对齐。在此基础上,我们应用GA来精炼生成图的边。精炼过程引导合成图更接近真实数据,同时保持多样性和新颖性。实验结果表明,与基础模型相比,GA精炼持续降低组合最大均值差异(MMD),从而生成更匹配真实结构模式的图。这表明进化精炼是纠正基于GAN的图生成器中残留结构偏差的有效且灵活的方法,提高了它们用于逼真图合成和数据增强的适用性。

英文摘要

Generating realistic graph-structured data is challenging due to discrete connectivity, varying graph sizes, and class-specific structural patterns. Recent Generative Adversarial Networks (GAN)-based graph generation methods improve edge modelling by learning connectivity and matching class-specific density distributions. However these models still exhibit noticeable deviations such as in degree and spectral distribution when compared to real graphs, indicating that important structural properties are not fully preserved. This work aims to reduce these deviations by refining the graphs produced by an existing GAN-based graph generator framework with a Genetic Algorithm (GA). In the GAN framework, the generator produces both node features and connectivity patterns, while a GNN-based critic evaluates graph realism and class consistency to ensure global structural and class alignment. Building on this foundation, we apply a GA to refine the edges of generated graphs. The refinement process guides synthetic graphs toward closer agreement with real data, while preserving diversity and novelty. Experimental results show that the GA refinement consistently lowers combined Maximum Mean Discrepancy (MMD) compared to the base model, leading to graphs that more closely match real structural patterns. This demonstrates that evolutionary refinement is an effective and flexible way to correct residual structural deviations in GAN-based graph generators, improving their suitability for realistic graph synthesis and data augmentation.

2605.29158 2026-05-29 cs.LG cs.IR q-bio.BM 版本更新

PROTOCOL: Late Interaction Retrieval for Protein Homolog Search

PROTOCOL: 用于蛋白质同源搜索的延迟交互检索

Gabrielle Cohn, Rohan Gumaste, Minh Hoang, Vihan Lakshman

发表机构 * MIT(麻省理工学院) Princeton University(普林斯顿大学)

AI总结 提出ProtoCol模型,利用ColBERT风格的延迟交互机制对残基嵌入进行最大相似度评分,以提升远程同源搜索的灵敏度,在SCOPe超家族和Pfam clan基准上优于多种基线方法。

详情
AI中文摘要

蛋白质同源搜索是功能注释、结构预测和进化分析的基础,但在全局序列相似性较弱且经典比对方法灵敏度下降的“模糊区”中仍然具有挑战性。蛋白质语言模型提供了上下文感知的表示,可以在此范围内提高比对灵敏度。然而,先前的基于蛋白质嵌入的检索流程通常将这些表示池化为单个向量,可能掩盖揭示远程同源性的局部基序、结构域或保守残基。我们引入了ProtoCol,该模型将蛋白质表示为残基嵌入的集合,并使用ColBERT风格的延迟交互来测试残基级比较是否改善同源检索。ProtoCol独立编码蛋白质,保持候选表示可预计算,并通过残基嵌入上的MaxSim对候选进行评分。在SCOPe超家族和Pfam clan基准上,ProtoCol优于基于序列组成、比对、池化PLM和训练的单向量基线,支持延迟交互作为远程同源搜索的有效检索层。

英文摘要

Protein homology search underlies function annotation, structure prediction, and evolutionary analysis, but remains challenging in the "twilight zone," where global sequence similarity is weak and classical alignment methods lose sensitivity. Protein language models provide context-aware representations that could improve alignment sensitivity in this regime. However, prior protein embedding-based retrieval pipelines often pool these representations into a single vector, potentially obscuring local motifs, domains, or conserved residues that reveal remote homology. We introduce ProtoCol, a model which represents proteins as sets of residue embeddings and uses ColBERT-style late interaction to test whether residue-level comparison improves homolog retrieval. ProtoCol encodes proteins independently, keeps candidate representations pre-computable, and scores candidates with MaxSim over residue embeddings. On SCOPe superfamily and Pfam clan benchmarks, ProtoCol outperforms sequence-composition, alignment-based, pooled PLM, and trained single-vector baselines, supporting late interaction as an effective retrieval layer for remote homology search.

2605.29157 2026-05-29 cs.LG cs.AI cs.CL 版本更新

Parallax: Parameterized Local Linear Attention for Language Modeling

Parallax: 参数化局部线性注意力用于语言建模

Yifei Zuo, Dhruv Pai, Zhichen Zeng, Alec Dewulf, Shuming Hu, Zhaoran Wang

发表机构 * Northwestern University(西北大学) Tilde Research(Tilde研究) University of Washington(华盛顿大学)

AI总结 提出Parallax,一种可扩展的参数化局部线性注意力机制,通过消除数值求解器并学习查询投影器,在语言模型预训练中实现一致的困惑度改进和下游任务迁移优势。

详情
AI中文摘要

大型语言模型(LLM)已成为人工智能的核心范式,但注意力的核心计算原语在结构上仍未改变。局部线性注意力(LLA)是一种从测试时回归框架的非参数统计中推导出的注意力机制。与先前关于高效注意力变体的研究相比,LLA将softmax注意力中的局部常数估计升级为局部线性估计,在关联记忆上提供了可证明更优的偏差-方差权衡。然而,由于计算和数值稳定性问题,LLA尚未在LLM预训练中扩展。我们引入Parallax,一种可扩展用于LLM的参数化局部线性注意力。Parallax消除了LLA中的数值求解器,并学习一个额外的类似查询的投影器来探测KV协方差。我们将Parallax置于一个由带宽、投影器构造和仿射结构连接的注意力机制家族中。我们提出一种硬件感知算法,提高了相对于FlashAttention的算术强度,将注意力转移到更受计算限制的区域。我们的原型解码核在各种批大小和上下文长度下匹配或超越FlashAttention 2/3。我们在0.6B和1.7B规模上预训练Parallax,发现整个预训练过程中困惑度持续改善,且收益迁移到下游基准测试。在参数匹配和计算匹配的控制下,优势持续存在,展示了帕累托改进。我们进行了仔细的预训练消融实验,并发现了一个新现象:Muon优化器解锁了Parallax的能力。据我们所知,这是架构研究文献中首次对注意力机制进行强架构-优化器协同设计的实证演示。

英文摘要

Large Language Models (LLMs) have become the central paradigm in artificial intelligence, yet the core computational primitive of attention has remained structurally unchanged. Local Linear Attention (LLA) is an attention mechanism derived from nonparametric statistics in the test-time regression framework. In contrast to prior research on efficient attention variants, LLA upgrades the local constant estimate in softmax attention to a local linear estimate, yielding provably superior bias-variance tradeoffs for associative memory. However, LLA has not been scaled in LLM pretraining due to computational and numerical stability concerns. We introduce Parallax, a parameterized Local Linear Attention that is scalable for LLMs. Parallax eliminates the numerical solver in LLA and learns an extra query-like projector that probes the KV covariance. We place Parallax within a family of attention mechanisms connected by the bandwidth, the probe construction and the affine structure. We propose a hardware-aware algorithm that increases the arithmetic intensity over FlashAttention, shifting attention into a more compute bound regime. Our prototype decode kernel matches or outperforms FlashAttention 2/3 across diverse batch sizes and context lengths. We pretrain Parallax at 0.6B and 1.7B scales and find consistent perplexity improvements throughout pretraining with gains that transfer to downstream benchmarks. The advantage persists under both parameter-matched and compute-matched controls, demonstrating a Pareto improvement. We perform careful pretraining ablations and identify a novel phenomenon whereby Muon unlocks the capacity of Parallax. To our knowledge, this is the first empirical demonstration of strong architecture-optimizer codesign for attention mechanisms in the architecture research literature.

2605.29156 2026-05-29 cs.LG cs.CL 版本更新

RUBRIC-ARROW: Alternating Pointwise Rubric Reward Modeling for LLM Post-training in Non-verifiable Domains

RUBRIC-ARROW:面向不可验证领域的大语言模型后训练的交替逐点评分规则奖励建模

Haoxiang Jiang, Zihan Dong, Tianci Liu, Wanying Wang, Ran Xu, Tony Yu, Linjun Zhang, Haoyu Wang

发表机构 * University at Albany(阿尔巴尼大学) Rutgers University(罗格斯大学) Purdue University(普渡大学) Independent Researcher(独立研究员) Emory University(埃默里大学) Georgia Institute of Technology(佐治亚理工学院)

AI总结 针对非可验证领域绝对评分困难的问题,提出交替框架RUBRIC-ARROW,联合训练规则生成器和条件裁判,通过概率评分规则和交替GRPO减少平局,提升奖励建模准确率并改善下游策略后训练。

详情
AI中文摘要

逐点评分奖励建模为大语言模型后训练提供关键信号,但在主观、不可验证的设置中难以进行绝对评分。基于规则的方法通过将评估分解为显式标准来解决这一问题,但现有方法通常依赖前沿大语言模型,并因硬布尔聚合导致的平局而受限。我们提出RUBRIC-ARROW,一个交替框架,联合训练规则生成器和条件裁判,其强化学习阶段仅使用成对偏好数据。我们的方法结合了基于概率的评分规则(减少平局)、阶段特定的基于偏好的奖励以及交替GRPO方案,共同训练逐点评分器。大量实验表明,RUBRIC-ARROW实现了具有竞争力的奖励建模准确率,并为下游策略后训练带来一致的增益。

英文摘要

Pointwise reward modeling offers critical signals for LLM post-training, yet struggles with absolute scoring in subjective, non-verifiable settings. Rubric-based methods address this by decomposing evaluation into explicit criteria, but existing approaches typically depend on frontier LLMs and suffer from ties caused by hard Boolean aggregation. We present RUBRIC-ARROW, an alternating framework that jointly trains a rubric generator and a rubric-conditioned judge, with its RL stage using only pairwise preference data. Our method couples a probability-based scoring rule that reduces ties with phase-specific preference-based rewards and an alternating GRPO scheme that together train the pointwise evaluator. Extensive experiments show that RUBRIC-ARROW achieves competitive reward-modeling accuracy and yields consistent gains for downstream policy post-training.

2605.29153 2026-05-29 cs.LG cs.AI physics.comp-ph 版本更新

Unveiling Multi-regime Patterns in SciML: Distinct Failure Modes and Regime-specific Optimization

揭示科学机器学习中的多机制模式:不同的失败模式与机制特定优化

Yuxin Wang, Yuanzhe Hu, Xiaokun Zhong, Xiaopeng Wang, Haiquan Lu, Tianyu Pang, Michael W. Mahoney, Yujun Yan, Pu Ren, Yaoqing Yang

发表机构 * Dartmouth College(达特茅斯学院) University of California, San Diego(加州大学圣地亚哥分校) University of California, Berkeley(加州大学伯克利分校) National University of Singapore(新加坡国立大学) Lawrence Berkeley National Laboratory(伯克利国家实验室) International Computer Science Institute(国际计算机科学研究所)

AI总结 通过机制感知诊断框架,研究科学机器学习模型在不同超参数设置下的多机制行为,发现三种一致机制结构、优化效果的机制特异性以及精细失败模式,为提升鲁棒性提供指导。

Comments Accepted by ICML 2026

详情
AI中文摘要

在不同超参数设置下训练的神经网络可能落入不同的训练“机制”,这些机制内部行为一致,而机制间存在定性差异。本文通过一个机制感知的诊断框架,联合分析性能、训练动态和损失景观几何,研究科学机器学习(SciML)模型中的这种多机制行为。我们识别出三个关键发现:(i)在许多标准SciML模型、不同的约束施加方式以及各种优化器设计中,一致地出现一个三机制结构;(ii)优化效果是机制特定的,没有单一方法在所有机制中表现良好;(iii)SciML模型可能表现出精细的失败模式,这些模式可能挑战对标准损失景观度量的传统解释。我们的结果为建立SciML中失败模式的统一、任务无关视角提供了一种方法,并为提高鲁棒性提供机制感知的指导。我们在广泛使用的SciML模型上验证了这些发现,包括物理信息神经网络、神经算子和神经常微分方程,涵盖了代表性的常微分方程和偏微分方程基准。

英文摘要

Neural networks trained under different hyperparameter settings can fall into distinct training "regimes," with consistent behavior within regimes and qualitative differences across regimes. In this paper, we study such multi-regime behavior in scientific machine learning (SciML) models through a regime-aware diagnostic framework that jointly analyzes performance, training dynamics, and loss-landscape geometry. We identify three key findings: (i) a consistent three-regime structure emerges across many standard SciML models, different constraint enforcements, and various optimizer designs; (ii) optimization effectiveness is regime-specific, with no single method performing well across all regimes; and (iii) SciML models can exhibit fine-grained failure modes that can challenge conventional interpretations of standard loss-landscape metrics. Our results provide an approach to establish a unified, task-oblivious perspective on failure modes in SciML and to inform regime-aware guidance for improving robustness. We validate these findings across widely-used SciML models, including physics-informed neural networks, neural operators, and neural ordinary differential equations, on benchmarks spanning representative ordinary and partial differential equations.

2605.29152 2026-05-29 cs.LG math.OC stat.ML 版本更新

Do Deep Networks Forget Initialization? A Forgetting-Time View of Practical Inductive Bias

深度网络会忘记初始化吗?实用归纳偏见的遗忘时间视角

Mohua Das, Pierfrancesco Beneventano, Shibshankar Dey, Gareth H. McKinkey, Tomaso Poggio

发表机构 * MIT(麻省理工学院) Northwestern University(西北大学)

AI总结 通过引入初始化记忆度量,研究随机初始化对训练后预测器的影响,发现低学习率SGD保留初始化记忆而Adam族方法遗忘,且遗忘动力学与泛化正则化相关。

Comments 39 pages, 9 figures

详情
AI中文摘要

随机初始化的神经网络在函数上诱导先验,但实践中使用的预测器仅在训练后产生。我们询问这种初始偏差有多少在训练流程中幸存。为了使问题可测量,我们引入初始化记忆:验证选择的预测器对随机初始化尺度的依赖性。我们在ResNet上进行了受控的CIFAR-10实验,其中初始化记忆已经尖锐地分离了训练机制。低学习率SGD可以在记住初始化的同时进行插值:在批大小$b=128$的ResNet-9上,尽管训练准确率$\ge99.5\%$,测试准确率在不同初始化尺度上变化$26.5$个百分点。这不是欠训练:将相同的低学习率机制扩展到$5{,}000$个epoch,差异基本不变。相比之下,Adam族方法在很大程度上消除了这种依赖性。当较大的学习率与显式$L_2$范数控制配对时,SGD也可以被遗忘。我们根据遗忘的时间尺度解释这些发现:梯度流式动力学可以保留初始化记忆,而随机有限步效应、显式范数衰减和自适应预处理在由显式或隐式正则化大小控制的尺度上消除它。因此,训练网络的实用归纳偏见不仅仅是架构先验,而是经过训练流程遗忘动力学过滤后的架构先验;并且改善泛化的相同正则化器正是那些消除初始化记忆的。

英文摘要

Randomly initialized neural networks induce a prior over functions, but the predictor used in practice is produced only after training. We ask how much of this initial bias survives the training pipeline. To make the question measurable, we introduce initialization memory: the dependence of the validation-selected predictor on the scale of the random initialization. We perform controlled CIFAR-10 experiments on ResNets where initialization memory already sharply separates training regimes. Low-learning-rate SGD can interpolate while still remembering its initialization: on ResNet-9 with batch size $b=128$, test accuracy varies by $26.5$ percentage points across initialization scales despite $\ge99.5\%$ training accuracy. This is not undertraining: extending the same low-learning-rate regime to $5{,}000$ epochs leaves the spread essentially unchanged. In contrast, Adam-family methods largely erase the dependence. SGD can also be made to forget when larger learning rates are paired with explicit $L_2$ norm control. We interpret these findings in terms of the time scale of forgetting: gradient-flow-like dynamics can preserve initialization memory, whereas stochastic finite-step effects, explicit norm decay, and adaptive preconditioning erase it on scales governed by the size of explicit or implicit regularization. The practical inductive bias of a trained network is therefore not the architectural prior alone, but the architectural prior after being filtered by the forgetting dynamics of the training pipeline; and the same regularizers that improve generalization are precisely those that erase memory of initialization.

2605.29148 2026-05-29 cs.LG stat.ML 版本更新

Optimal Gap-Dependent Regret for Private Stochastic Decision-Theoretic Online Learning

私有随机决策理论在线学习的最优间隙相关遗憾

Tommaso Cesari, Roberto Colomboni

发表机构 * School of Electrical Engineering and Computer Science University of Ottawa(电气工程与计算机科学学院,渥太华大学) School of Mathematics University of Bristol(数学学院,布里斯托尔大学)

AI总结 针对完全信息、事件级纯差分隐私的随机决策理论在线学习,提出一种无水平线的纯差分隐私算法,并证明遗憾界为O(log K / Δ_min + log K / ε)。

详情
AI中文摘要

我们研究具有完全信息和事件级纯差分隐私的随机决策理论在线学习。Hu和Mehta在COLT上提出的一个开放问题要求确定在纯事件级差分隐私下,随机决策理论在线学习的最优间隙相关遗憾率。对于$K$个动作,损失在$[0,1]$中,且唯一最优动作与次优动作的间隙为$Δ_{\min}$,已知下界为$ rac{\log K}{\min\{Δ_{\min},\varepsilon\}} $,或等价地,在通用常数范围内,为\[ rac{\log K}{Δ_{\min}}+ rac{\log K}{\varepsilon} \]。我们给出一个无水平线的纯DP算法,并证明对于任意水平线$T$,显式遗憾界\[ \operatorname{Reg}_T \le 1000 \cdot \left( rac{\log K}{Δ_{\min}}+ rac{\log K}{\varepsilon} ight) \]。数值常数未优化。该算法将时间划分为指数增长大小的块,每个块内执行单个动作,并通过指数机制(应用于前一个块的数据无关随机前缀)选择下一个动作。随机前缀将块遗憾转化为所有前缀长度上softmax选择误差的和。单个熵势参数以代价$\log K/\varepsilon$控制所有隐私主导的大间隙动作。

英文摘要

We study stochastic decision-theoretic online learning with full information and event-level pure differential privacy. A COLT open problem of Hu and Mehta asks to determine the optimal gap-dependent regret rate for stochastic decision-theoretic online learning under pure event-level differential privacy. For $K$ actions, losses in $[0,1]$, and a unique best action separated from the second-best action by gap $Δ_{\min}$, the known lower bound is of order $ \frac{\log K}{\min\{Δ_{\min},\varepsilon\}}, $ or equivalently, up to universal constants, of order \[ \frac{\log K}{Δ_{\min}}+\frac{\log K}{\varepsilon}. \] We give a horizon-free pure-DP algorithm and prove the explicit regret bound \[ \operatorname{Reg}_T \le 1000 \cdot \left(\frac{\log K}{Δ_{\min}}+\frac{\log K}{\varepsilon}\right) \] for every horizon $T$. The numerical constant is not optimized. The algorithm partitions time into blocks of exponentially increasing size, plays a single action throughout each block, and chooses the next action by an exponential mechanism applied to a data-independent random prefix of the previous block. The random prefix converts block regret into a sum, over all prefix lengths, of softmax selection errors. A single entropy-potential argument controls all privacy-dominated large-gap actions at cost $\log K/\varepsilon$.

2605.29139 2026-05-29 stat.ML cs.LG 版本更新

Anytime-Valid Federated Conformal RAG for LLM Swarms

面向LLM群体的任意有效联邦共形RAG

Prasanjit Dubey, Xiaoming Huo

发表机构 * H. Milton Stewart School of Industrial and Systems Engineering, Georgia Institute of Technology(H.米尔顿·斯图尔特工业与系统工程学院,佐治亚理工学院)

AI总结 提出Anytime-FC-RAG,通过可累积的逐步校准偏差预算和截断投注e过程,将联邦共形RAG扩展到任意停止时间均有效的序贯覆盖,并保证时间均匀报警有效性、Hoeffding拼接累积误覆盖包络及自适应控制下的安全性。

详情
AI中文摘要

联邦共形RAG(FC-RAG)为带宽受限的弱语言模型群体提供了无分布假设的覆盖保证,但仅限于固定时间范围。我们将其扩展到任意有效序贯覆盖:在每个停止时间均有效,且在可预测自适应控制(重新校准、每节点带宽升级、蒸馏学生刷新)下保持不变,且无需比固定时间范围FC-RAG更多的假设。朴素组合失败,因为FC-RAG的边缘覆盖界使得投注e过程在不利校准抽取下成为非超鞅,无法调用Ville不等式。我们提出Anytime-FC-RAG,这是一种序贯扩展,基于可累加的逐步校准偏差预算,将边缘界转换为校准好事件上的严格条件界,并配以在整个概率空间上为非负超鞅的截断投注e过程。由这两个要素,我们获得四个保证:时间均匀报警有效性$\mathbb{P}(\sup_t E_t \ge 1/δ_e) \le δ_e + δ_{\mathrm{cal}}$,相同总预算下的Hoeffding拼接累积误覆盖包络,任何可预测控制器(重新校准、带宽升级、学生刷新)下的安全性,以及通过可累加训练预算在无界序列的联邦探针-逻辑蒸馏(FPLD)刷新上的训练侧误差传播。实际结果是,仅在e过程超过警告阈值时升级检索带宽的自适应控制器,以显著更低的通信成本匹配固定高带宽调度的报警率。在GPT-2-small + MiniLM群体上对MMLU、DBpedia和AG News的实验验证了预测的报警率、检测延迟、包络覆盖以及14%-57%的带宽节省;报警仅在覆盖真正失效时触发。

英文摘要

Federated Conformal RAG (FC-RAG) provides distribution-free coverage for a bandwidth-limited swarm of weak language models, but only at a fixed horizon. We extend it to anytime-valid sequential coverage: validity at every stopping time, preserved under predictable adaptive control (recalibration, per-node bandwidth escalation, distilled-student refresh), at no extra cost in assumptions over fixed-horizon FC-RAG. Naive composition fails because FC-RAG's marginal coverage bound makes the betting e-process a non-supermartingale on adverse calibration draws, and Ville's inequality cannot be invoked. We give Anytime-FC-RAG, a sequential extension built on a summable per-step calibration-deviation budget that converts the marginal bound into a strict conditional bound on a calibration-good event, paired with a truncated betting e-process that is a nonnegative supermartingale on the entire probability space. From these two ingredients, we obtain four guarantees: time-uniform alarm validity $\mathbb{P}(\sup_t E_t \ge 1/δ_e) \le δ_e + δ_{\mathrm{cal}}$, a Hoeffding-stitched cumulative-miscoverage envelope at the same total budget, safety under any predictable controller (recalibration, bandwidth escalation, student refresh), and training-side error propagation across an unbounded sequence of Federated Probe-Logit Distillation (FPLD) refreshes via a summable training budget. As a practical consequence, an adaptive controller that escalates retrieval bandwidth only when the e-process crosses a warning threshold matches the alarm rate of a fixed-high-bandwidth schedule at substantially lower communication cost. Experiments on a GPT-2-small + MiniLM swarm across MMLU, DBpedia, and AG News verify the predicted alarm rate, detection delay, envelope coverage, and $14$-$57\%$ bandwidth savings; the alarm fires when and only when coverage genuinely breaks.

2605.29138 2026-05-29 cs.RO cs.AI cs.LG cs.SY eess.SY 版本更新

Multi-Resolution End-to-End Deep Neural Network for Optimizing Latency-Accuracy Tradeoff in Autonomous Driving

用于优化自动驾驶延迟-准确性权衡的多分辨率端到端深度神经网络

Qitao Weng, Heechul Yun

发表机构 * University of Kansas Lawrence(堪萨斯大学劳伦斯分校)

AI总结 提出一种多分辨率端到端CNN,通过运行时选择输入分辨率和分辨率重定向,在延迟预算下优化自动驾驶的延迟-安全性权衡。

Comments ICCPS 2026

详情
AI中文摘要

延迟-准确性权衡是深度神经网络在信息物理系统实时应用中的基础。在自动驾驶中,安全性尤其依赖于预测质量和从感知到执行的端到端延迟。我们观察到:(1) 当考虑延迟时,延迟最优的网络配置随场景上下文和计算可用性而变化;(2) 单一固定分辨率模型在条件变化时变得次优。我们提出了一种用于CARLA城市驾驶挑战的多分辨率端到端深度神经网络,使用单目摄像头输入。我们的方法采用支持多种输入分辨率的卷积神经网络,通过每分辨率批归一化,使得在延迟预算下运行时选择理想输入尺度成为可能,以及分辨率重定向,允许在没有原始训练数据集的情况下进行多分辨率训练。我们在CARLA中实现并评估了我们的多分辨率端到端CNN,以探索延迟-安全性边界。结果显示,相对于固定分辨率基线,每条路线的安全性指标——车道入侵、红灯违规和碰撞——一致改善。

英文摘要

Latency-accuracy tradeoffs are fundamental in real-time applications of deep neural networks (DNNs) for cyber-physical systems. In autonomous driving, in particular, safety depends on both prediction quality and the end-to-end delay from sensing to actuation. We observe that (1) when latency is accounted for, the latency-optimal network configuration varies with scene context and compute availability; and (2) a single fixed-resolution model becomes suboptimal as conditions change. We present a multi-resolution, end-to-end deep neural network for the CARLA urban driving challenge using monocular camera input. Our approach employs a convolutional neural network (CNN) that supports multiple input resolutions through per-resolution batch normalization, enabling runtime selection of an ideal input scale under a latency budget, as well as resolution retargeting, which allows multi-resolution training without access to the original training dataset. We implement and evaluate our multi-resolution end-to-end CNN in CARLA to explore the latency-safety frontier. Results show consistent improvements in per-route safety metrics - lane invasions, red-light infractions, and collisions - relative to fixed-resolution baselines.

2605.29136 2026-05-29 cs.CV cs.LG 版本更新

Eulerian Gaussian Splatting using Hashed Probability Pyramids

使用哈希概率金字塔的欧拉高斯溅射

Mia Gaia Polansky, George Kopanas, Stephan Garbin, Todd Zickler, Dor Verbin

发表机构 * Harvard University(哈佛大学) Google DeepMind(谷歌DeepMind) Google(谷歌)

AI总结 提出一种基于概率溅射的辐射场框架,用梯度优化的体积概率密度替代启发式操作,通过多尺度哈希网格实现端到端优化,在mip-NeRF 360上达到SOTA重建质量并保持3DGS渲染速度。

Comments CVPR 2026. Project Page: https://euleriansplatting.github.io

详情
AI中文摘要

我们引入了一种基于概率溅射的辐射场框架,该框架保留了3D高斯溅射(3DGS)的快速光栅化和测试效率,同时用基于梯度优化的体积概率密度替代了启发式原始操作。我们不通过手动调整的密集化(例如ADC)来重新定位、分割或剔除高斯体,而是将原始位置视为从持久、可学习的密度中抽取的样本。我们使用一种新颖的、内存高效的多尺度层次网格来实例化该密度,从而实现端到端的梯度优化。为了稳定优化,我们推导了一个具有控制变量的无偏梯度估计器,显著降低了方差。通过允许概率质量流向损失要求的地方,我们的框架消除了脆弱的先验,并自然地探索体积,在mip-NeRF 360上实现了最先进的重建质量,同时保持了3DGS级别的渲染速度。

英文摘要

We introduce a probabilistic splat-based radiance field framework that retains the fast rasterization and test-time efficiency of 3D Gaussian Splatting (3DGS) while replacing heuristic primitive manipulation with gradient-based optimization of a volumetric probability density. Rather than relocating, splitting, or culling Gaussians via hand-tuned densification (e.g., ADC), we treat primitive locations as samples drawn from a persistent, learnable density. We instantiate this density using a novel, memory-efficient multi-scale hierarchical grid that enables end-to-end gradient-based optimization. To stabilize the optimization, we derive an unbiased gradient estimator with control variates that markedly reduces variance. By allowing probability mass to flow to where the loss demands, our framework eliminates brittle priors and naturally explores the volume, achieving state-of-the-art reconstruction quality on mip-NeRF 360 while preserving 3DGS-level rendering speed.

2605.29126 2026-05-29 cs.LG cs.AI 版本更新

When and How Long? The Readout-Mediator Angle in Temporal Reasoning

何时与多久?时间推理中的读出-中介角度

Shreyas Fadnavis, Praitayini Kanakaraj, Felix Wyss

发表机构 * Bioscope AI

AI总结 通过测量线性探针与模型实际计算子空间之间的角度,发现探针可能学习与模型无关的正交方向,从而揭示基于探针的可解释性存在根本缺陷。

详情
AI中文摘要

线性探针几乎可以完美解码表示,但却可能与模型如何使用该表示完全无关。在语言模型的日历日期持续时间推理中,一个$\\\sin$/ $\\\cos$探针从层的激活中恢复一年中的第几天,但消融其方向对模型的答案没有影响——而在同一层通过分布式对齐搜索(DAS)找到的四维子空间被消融时,性能完全崩溃。我们测量这两个子空间之间的角度——\\emph{读出-中介角度}——发现它与两个随机子空间之间的角度(Haar均匀零假设)无法区分,这意味着探针学到了与模型实际计算正交的方向。逆向工程电路揭示了原因:注意力头通过学习的QK偏移($\\\pm30$和$\\\pm61$天)路由月份粒度的上下文,然后MLP将\\emph{何时}(绝对日期)转换为\\emph{多久}(持续时间)——所有这些都在探针从未触及的因果子空间的下游。稀疏自编码器分解证实了这种分裂:探针对齐和DAS对齐的特征编码了语义上不相交的概念,因果重叠可忽略不计。这种分离在四个规模($1.5$-$9\\\,$B)和两个模型家族中重复出现,并在另外两个领域(空间位移、符号算术)有初步证据,表明读出-中介正交性是探针可解释性的一种普遍失败模式。这直接削弱了将探针部署为运行时安全监控的提议:探针可以在模型已悄然放弃的方向上报告高置信度。

英文摘要

A linear probe can decode a representation almost perfectly and yet be completely irrelevant to how the model uses it. On calendar-date duration reasoning in language models, a $\sin$/$\cos$ probe recovers day-of-year from a layer's activations, yet ablating its direction has no effect on the model's answers -- while ablating a four-dimensional subspace found by Distributed Alignment Search (DAS) at the same layer collapses performance entirely. We measure the angle between these two subspaces -- the \emph{readout-mediator angle} -- and find it indistinguishable from the angle between two random subspaces (the Haar-uniform null), meaning the probe has learned a direction orthogonal to the model's actual computation. Reverse-engineering the circuit reveals why: attention heads route month-grained context through learned QK offsets at ${\pm}30$ and ${\pm}61$ days, and MLPs then convert \emph{when} (absolute date) into \emph{how long} (duration) -- all downstream of the causal subspace the probe never touches. Sparse-autoencoder decomposition confirms the split: probe-aligned and DAS-aligned features encode semantically disjoint concepts with negligible causal overlap. The dissociation replicates across four scales ($1.5$-$9\,$B) and two model families, with preliminary evidence on two further domains (spatial displacement, symbolic arithmetic), suggesting that readout-mediator orthogonality is a general failure mode of probe-based interpretability. This directly undermines proposals to deploy probes as runtime safety monitors: the probe can report high confidence on a direction the model has silently abandoned.

2605.29121 2026-05-29 math.DS cs.AI cs.LG 版本更新

A Minimal Bifurcation Model of Load Imbalance in a Softmax Mixture-of-Experts Router

Softmax混合专家路由器中负载不平衡的最小分岔模型

O. M. Kiselev

发表机构 * Innopolis University(因诺波利斯大学)

AI总结 提出一个两专家混合专家层的自适应softmax路由最小动力学模型,通过平均场极限从离散强化规则导出,发现超临界叉形分岔导致负载不平衡,并推导了分岔集和尖点灾变的精确参数方程。

Comments 21 pages, 11 figures

详情
AI中文摘要

我们提出了一个两专家混合专家(MoE)层的自适应softmax路由的最小动力学模型。该模型作为离散强化规则的平均场极限得到:被选中的专家获得小的分数增量,而所有分数经历正则化衰减。在对称情况下,极限系统具有超临界叉形分岔:对于弱反馈,存在唯一的稳定平衡状态,而当反馈强度超过临界值时,出现两个稳定的不对称状态。当加入外部不对称性时,叉形分岔展开为一对折叠分岔,在控制参数平面中形成一个尖点。我们推导了分岔集和尖点灾变的局部规范型的精确参数方程。数值实验将这一图景与经验专家负载、一个小的可训练MoE模型、硬top-1 PyTorch路由以及一个关于数字的小型分类实验联系起来。结果为自适应MoE路由器中负载不平衡的突然转变提供了一个可控的低维机制。

英文摘要

We propose a minimal dynamical model of adaptive softmax routing for a two-expert Mixture-of-Experts (MoE) layer. The model is obtained as a mean-field limit of a discrete reinforcement rule: the selected expert receives a small score increment, while all scores undergo regularizing decay. In the symmetric case the limiting system has a supercritical pitchfork bifurcation: for weak feedback there is a unique stable balanced state, whereas above a critical feedback strength two stable asymmetric states appear. When an external asymmetry is added, the pitchfork unfolds into a pair of fold bifurcations forming a cusp in the control-parameter plane. We derive exact parametric equations for the bifurcation set and the local normal form of the cusp catastrophe. Numerical experiments connect this picture to empirical expert load, a small trainable MoE model, hard top-1 PyTorch routing, and a small classification experiment on digits. The results provide a controlled low-dimensional mechanism for abrupt transitions to load imbalance in adaptive MoE routers.

2605.29114 2026-05-29 cs.CR cs.LG cs.RO 版本更新

ReasonBreak: Probing Vulnerabilities in Reasoning-Enabled Vision-Language-Action Models for Autonomous Driving

ReasonBreak: 探测自动驾驶中具备推理能力的视觉-语言-行动模型的脆弱性

Mohammadreza Teymoorianfard, Jean-Philippe Monteuuis, Jonathan Petit, Amir Houmansadr

发表机构 * University of Massachusetts Amherst(马萨诸塞大学阿默斯特分校) Qualcomm(高通)

AI总结 本文通过黑盒攻击方法,首次系统研究了具备推理能力的视觉-语言-行动模型在自动驾驶中面对真实输入扰动时的脆弱性,发现其推理和轨迹生成均易受攻击,导致碰撞率上升。

详情
AI中文摘要

具备集成推理能力的视觉-语言-行动(VLA)模型已被提出用于端到端自动驾驶,假设推理与轨迹生成之间存在紧密耦合。然而,此类系统在真实输入扰动下的鲁棒性尚未得到充分探索。我们表明,这些模型对真实输入扰动高度脆弱,在闭环仿真中推理攻击成功率高达89%,轨迹操控攻击成功率高达72%,导致碰撞率上升和安全指标下降。以NVIDIA近期开发的Alpamayo模型为代表,我们首次对具备推理能力的VLA模型在真实文本输入损坏下进行了系统性黑盒研究,评估了其对推理和驾驶行为的影响。我们引入了一个推理感知评估框架,捕捉推理的语义和结构方面,并结合以安全为中心的度量。我们还引入了一个基准,用于评估自动驾驶中推理-轨迹交互的攻击与防御。我们的结果强调了严格评估和改进防御的必要性,以确保自动驾驶中具备推理能力的VLA系统的安全性。

英文摘要

Vision-Language-Action (VLA) models with integrated reasoning have been proposed for end-to-end autonomous driving, assuming a tight coupling between reasoning and trajectory generation. However, the robustness of such systems under realistic input perturbations remains largely unexplored. We show that these models are highly vulnerable to realistic input perturbations, achieving up to 89% attack success rate (ASR) on reasoning and up to 72% on trajectory manipulation in closed-loop simulation, leading to increased collision rates and degraded safety metrics. Using NVIDIA's recent Alpamayo models as representative industry-developed VLAs, we conduct the first systematic black-box study of reasoning-enabled VLA models under realistic textual input corruptions, evaluating their impact on reasoning and driving behavior. We introduce a reasoning-aware evaluation framework capturing both semantic and structural aspects of reasoning, along with safety-centric measures. We also introduce a benchmark for evaluating attacks and defenses on reasoning-trajectory interactions in autonomous driving. Our results highlight the need for rigorous evaluation and improved defenses to ensure the safety of reasoning-enabled VLA systems in autonomous driving.

2605.29108 2026-05-29 cs.LG 版本更新

Bridging Chemists and AI: An Expert-Augmented Framework for Interpretable Route Evaluation

连接化学家与人工智能:一种专家增强的可解释路线评估框架

Yujia Guo, Mikhail Kabeshov, Tat Hong Duong Le, Samuel Genheden, Marco V. Mijangos, Varvara Voinarvoska, Giulia Bergonzini, Ola Engkvist, Samuel Kaski

发表机构 * Department of Computer Science, Aalto University(艾尔沃斯大学计算机科学系) Discovery Sciences R&D, AstraZeneca(阿斯利康发现科学研发部) Department of Computer Science and Engineering, Chalmers University of Technology and University of Gothenburg(查尔姆斯理工大学和哥德堡大学计算机科学与工程系) Department of Computer Science, University of Manchester(曼彻斯特大学计算机科学系)

AI总结 提出一种专家增强的数据驱动评分框架,结合机器学习与化学家领域知识,实现多步合成路线的数值与可解释评估,显著提升预测准确性。

Comments 13 pages, 11 figures, ELLIS Unconference Workshop: Generative Models, LLMs, and the Future of Molecular AI (ML4Molecules 2025)

详情
AI中文摘要

选择高效的多步合成路线是有机合成中的一个核心挑战,特别是在药物化学和工艺化学中,路线选择直接影响可行性、成本和开发效率。数据驱动的评估系统常常过度简化合成设计的多目标性质,并依赖于代理数据集(如专利路线)而非普遍适用的标准。为了解决这一问题,我们引入了一种专家增强的数据驱动评分框架,该框架将机器学习与化学家的领域知识相结合,用于数值和可解释的路线评估。使用参考路线与机器生成路线之间的树编辑距离训练基于DeepSets的模型,然后通过专家评估进行微调,以产生定量分数和可解释的定性类别:好、合理和差。所得系统在类别评估预测上实现了0.78的Spearman相关系数和0.77的Pearson相关系数,在分数预测上实现了60.2%的top-1排名准确率,显著优于之前17.5%的基线水平。

英文摘要

Selecting efficient multi-step synthetic routes is a central challenge in organic synthesis, particularly in medicinal and process chemistry, where route choice directly impacts feasibility, cost, and development efficiency. Data-driven assessment systems often oversimplify the multi-objective nature of synthesis design and rely on proxy datasets, such as patent routes, rather than universally grounded criteria. To address this, we introduce an expert-augmented, data-driven scoring framework that integrates machine learning with chemists' domain knowledge for both numerical and explainable route assessment. A DeepSets-based model is trained using tree edit distance between reference and machine-generated routes, and then fine-tuned with expert evaluations to produce both quantitative scores and interpretable qualitative categories: Good, Plausible, and Bad. The resulting system achieves a Spearman correlation coefficient of 0.78 and a Pearson correlation of 0.77 for category assessment prediction, and 60.2% top-1 ranking accuracy for score prediction, substantially outperforming the previous baseline of 17.5%.

2605.29101 2026-05-29 cs.LG cs.IT math.IT 版本更新

Model Merging by Output-Space Projection

通过输出空间投影进行模型合并

Bethan Evans, Benjamin Etheridge, Stephen Roberts, Jared Tanner

发表机构 * Department of Mathematics, University of Oxford, Oxford, UK(牛津大学数学系) Department of Engineering Science, University of Oxford, Oxford, UK(牛津大学工程科学系)

AI总结 将模型合并形式化为凸二次规划问题,通过校准输入和微调模型输出最小化平方输出校准目标,并推导出预测合并质量的闭式诊断指标。

详情
AI中文摘要

模型合并将多个微调检查点合并为单个多任务模型,无需重新训练。现有方法——如任务算术、模型汤、TIES和DARE——计算高效且经验成功,但依赖于启发式设计选择,缺乏形式化的最优性保证。我们表明,合并可以形式化为关于残差更新的凸二次规划,产生的权重通过校准输入和微调模型输出最小化平方输出校准目标,并将现有方法作为特例包含在内。我们的框架产生一个闭式诊断——选定基捕获的残差能量比例——仅使用校准集即可预测下游合并质量。实验上,QP在单层设置中匹配或优于现有方法,并且我们刻画了最优基相对于更便宜的对角QP提供显著增益的条件。我们通过顺序逐层算法扩展到多层合并,并在语言和视觉基准上展示了一致的增益。

英文摘要

Model merging combines fine-tuned checkpoints into a single multi-task model without retraining. Existing methods - such as task arithmetic, model soups, TIES, and DARE - are computationally efficient and empirically successful, but rely on heuristic design choices and lack formal optimality guarantees. We show that merging can be formulated as a convex quadratic programme over residual updates, yielding weights that minimise a squared-output calibration objective using calibration inputs and fine-tuned model outputs, and subsuming existing methods as special cases. Our framework yields a closed-form diagnostic - the fraction of residual energy captured by a chosen basis - that predicts downstream merge quality using only the calibration set. Empirically, the QP matches or outperforms existing methods in the single-layer setting, and we characterise when the optimal basis provides significant gains over the cheaper diagonal QP. We extend to multi-layer merging via a sequential layer-wise algorithm and demonstrate consistent gains across language and vision benchmarks.

2605.29092 2026-05-29 cs.CV cs.LG cs.MM 版本更新

Lightweight Complementary-Cue Fusion for Robust Video Face Forgery Detection

轻量级互补线索融合用于鲁棒视频人脸伪造检测

Sunghwan Baek, Tariq Anwaar, Karanveer Singh, Rita Singh

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出轻量级融合模块,结合手工特征(小波去噪特征与相位谱或局部二值模式),在极小参数增加下显著提升视频人脸伪造检测的鲁棒性。

Comments 13 pages, 6 figures, 3 tables

详情
AI中文摘要

当前的人脸视频伪造检测器使用宽或双流骨干网络。我们证明,通过单个轻量级融合两个手工线索,可以在更小的模型下实现更高的准确率。基于Xception基线模型(2190万参数),我们构建了两个检测器:LFWS,它添加一个1x1卷积来结合低频小波去噪特征(WDF)和来自空间相位浅层学习(SPSL)的相位谱通道;以及LFWL,它以相同方式融合WDF和局部二值模式(LBP)。这个额外模块仅增加292个参数,使总参数保持在2190万,小于F3Net(2250万)且不到SRM(5530万)的一半。即使如此小的开销,融合模型在FaceForensics++上将平均曲线下面积(AUC)从74.8%提升至78.6%,在DFDC-Preview上从70.5%提升至74.9%,分别比Xception基线提高3.8%和4.4%。在八个公开基准上,它们也始终优于F3Net、SRM和SPSL,无需额外数据或测试时增强。这些结果表明,通过轻量级融合块精心配对的手工特征,可以以远低于可比频率检测器的成本提供有竞争力的鲁棒性。我们的发现提示需要重新评估人脸视频伪造检测中规模驱动的设计选择。

英文摘要

Current face video forgery detectors use wide or dual-stream backbones. We show that a single, lightweight fusion of two handcrafted cues can achieve higher accuracy with a much smaller model. Based on the Xception baseline model (21.9 million parameters), we build two detectors: LFWS, which adds a 1x1 convolution to combine a low-frequency Wavelet-Denoised Feature (WDF) with a phase-spectrum channel derived from Spatial-Phase Shallow Learning (SPSL), and LFWL, which merges WDF with Local Binary Patterns (LBP) in the same way. This extra module adds only 292 parameters, keeping the total at 21.9 million, smaller than F3Net (22.5 million) and less than half the size of SRM (55.3 million). Even with this minimal overhead, the fused models increase the average area under the curve (AUC) from 74.8% to 78.6% on FaceForensics++ and from 70.5% to 74.9% on DFDC-Preview, gains of 3.8% and 4.4% over the Xception baseline. They also consistently outperform F3Net, SRM, and SPSL in eight public benchmarks, without extra data or test-time augmentation. These results show that carefully paired, handcrafted features, combined through the lightweight fusion block, can provide competitive robustness at a significantly lower cost than comparable frequency-based detectors. Our findings suggest a need to reevaluate scale-driven design choices in face video forgery detection.

2605.29089 2026-05-29 cs.LG cs.AI cs.CV 版本更新

OISD: On-Policy Internal Self-Distillation of Language Models

OISD: 语言模型在策略内部自蒸馏

Xinyu Liu, Darryl Cherian Jacob, Yang Zhou, Jindong Wang, Pan He

发表机构 * Auburn University(阿肯色大学) William & Mary(威廉与玛丽学院)

AI总结 提出OISD框架,通过将最终层的预测信号蒸馏到中间层,结合logit对齐和注意力对齐,提升推理能力,在数学推理任务上显著优于基线。

Comments Under Review for Publication

详情
AI中文摘要

最近的强化学习后训练方法主要使用稀疏的结果级奖励来优化最终输出策略,而很大程度上忽略了中间表示中编码的预测信号。在本文中,我们引入了一种称为在策略内部自蒸馏的新范式,并提出了OISD框架,该框架通过将最终层的在策略预测信号转移到中间表示来改进推理。在展开和组相对策略优化(GRPO)优化过程中,最终层既充当策略,又充当所选中间层的分离内部教师,通过两种互补机制引导中间层与其对齐:logit对齐,传递高级推理行为(如何思考);注意力对齐,强制从最终层到所选中间层的一致注意力模式(看哪里),两者都不需要外部特权信息。我们的OISD与GRPO一起,采用带符号优势加权的Jensen-Shannon对齐来蒸馏信息丰富的中间表示,同时在统一行动策略下保持策略一致性。实验结果表明了OISD的有效性,在四个数学推理任务上,相对于强推理强化学习基线,取得了显著且一致的改进。代码将在https://github.com/THE-MALT-LAB/OISD发布。

英文摘要

Recent reinforcement learning (RL) post-training approaches primarily optimize the final output policy using sparse outcome-level rewards, while largely overlooking predictive signals encoded in intermediate representations. In this paper, we introduce a new paradigm called on-policy internal self-distillation and propose the OISD framework, which improves reasoning by transferring on-policy predictive signals from the final layer to intermediate representations. During rollout and Group Relative Policy Optimization (GRPO) optimization, the final layer acts as both the policy and a detached internal teacher for selected intermediate layers, which are guided to align with it through two complementary mechanisms: logit alignment, which transfers high-level reasoning behaviors (how to think), and attention alignment, which enforces consistent attention patterns (where to look) from the final layer to the selected intermediate layer, both without requiring external privileged information. Our OISD, together with GRPO, employs signed advantage-weighted Jensen--Shannon alignment to distill informative intermediate representations while preserving policy consistency under a unified acting policy. Experimental results demonstrate the effectiveness of OISD, with substantial and consistent improvements over strong reasoning RL baselines across four mathematical reasoning tasks. The code will be released at https://github.com/THE-MALT-LAB/OISD

2605.29078 2026-05-29 cs.AI cs.LG 版本更新

Bridging the Sim-to-Real Gap in Reinforcement Learning-Based Industrial Dispatching through Execution Semantics

弥合基于强化学习的工业调度中仿真到现实的鸿沟:通过执行语义

Jonathan Hoss, Noah Klarmann

发表机构 * Department of Industrial Engineering(工业工程系) Rosenheim Technical University of Applied Sciences(罗森海姆应用技术大学)

AI总结 提出一个策略无关的执行与测量层,通过构建决策有效快照、定义标准化执行合约并记录结果分歧,将执行不确定性转化为可观测的结构化数据,从而弥合仿真与现实的差距。

Comments Accepted for publication at the 24th IEEE International Conference on Industrial Informatics (INDIN 2026), held from 26 to 29 July 2026 in Melbourne, Australia

详情
AI中文摘要

事件驱动的调度策略越来越多地部署在工业环境中,其中决策是在异步和部分可观测的系统状态下做出的。因此,决策状态在时间上不一致,动作可行性未明确定义,执行错误的根源仍然模糊。这些问题限制了可靠性和可解释性。为弥补这一差距,提出一个策略无关的执行与测量层,用于调解调度策略与工业执行环境。该层从异步事件流构建决策有效快照,定义具有明确动作可行性的标准化执行合约,并将结果记录为策略意图、事务结果、物理执行和人工干预之间的分歧。这使得决策语义与执行行为分离,并使部署不匹配可观测且结构上可归因。使用离散事件仿真评估所提框架。结果表明,在所有观测滞后情况下均具有分析优势,因为未区分的执行失败被转化为具有完全归因覆盖的结构化类型化结果。在低观测滞后下操作优势最强,此时可避免的执行错误可在提交前预防。总体而言,该层将执行不确定性转化为用于评估和策略改进的监督数据。

英文摘要

Event-driven scheduling policies are increasingly deployed in industrial environments, where decisions are made under asynchronous and partially observed system states. As a result, decision states are not temporally consistent, action admissibility is not explicitly defined, and the origin of execution errors remains ambiguous. These issues limit both reliability and interpretability. To address this gap, a policy-neutral execution and measurement layer is proposed to mediate between scheduling policies and the industrial execution environment. The layer constructs decision-valid snapshots from asynchronous event streams, defines a standardized execution contract with explicit action admissibility, and records outcomes as divergences between policy intent, transactional outcomes, physical execution, and human intervention. This enables a separation between decision semantics and execution behavior and makes deployment mismatch observable and structurally attributable. The proposed framework is evaluated using a discrete-event simulation. The results show analytical benefits across all observation lag regimes, as undifferentiated execution failures are transformed into structured, typed outcomes with full attribution coverage. Operational benefits are strongest under low observation lag, where avoidable execution errors can be prevented before commitment. Overall, the layer turns execution uncertainty into supervisory data for evaluation and policy refinement.

2605.29075 2026-05-29 cs.LG 版本更新

Knowledge Offloading: Decomposing LLMs into Sparse Backbones and Memory Modules

知识卸载:将大语言模型分解为稀疏骨干和记忆模块

Karim Galliamov, Rochelle Choenni, Ivan Titov

发表机构 * University of Amsterdam(阿姆斯特丹大学) University of Edinburgh(爱丁堡大学)

AI总结 提出知识卸载(KOFF)框架,通过结构化剪枝和轻量级恢复模块将预训练LLM分解为稀疏共享骨干和领域特定记忆,在约12%全局稀疏度下保持模型性能,并发现语言特定神经元优先被移除。

详情
AI中文摘要

大语言模型将通用能力和领域特定知识编码在同一组参数中。我们探究这种能力是否可以重组:将广泛有用的计算保留在共享骨干中,而将专门知识移入外部记忆模块。我们提出知识卸载(KOFF),一个将预训练LLM分解为稀疏共享骨干和领域特定记忆的框架。从冻结的基础模型开始,我们联合学习结构化剪枝掩码和轻量级恢复模块,这些模块以LoRA适配器和学习型键值缓存的形式实现。在3B到8B的Llama和Qwen模型上,我们发现非平凡的能力可以从共享骨干中移出而不会导致模型能力大幅下降。在大约12%的全局稀疏度下,KOFF保留了未剪枝模型的大部分性能,而剪枝相同冻结模型但没有记忆则性能急剧下降。消融实验表明LoRA和学习型KV记忆是互补的,专门化分析表明学习到的分解是有意义的:语言特定神经元被优先移除,而语言通用神经元主要保留在骨干中。这些结果表明知识可以在共享核心和可交换的外部记忆之间重新分配。

英文摘要

LLMs encode both general capabilities and domain-specific knowledge in a single set of parameters. We ask whether this capacity can be reorganized: keeping broadly useful computation in a shared backbone, while moving specialized knowledge into external memory modules. We propose \emph{knowledge offloading} (KOFF), a framework for decomposing a pretrained LLM into a sparse shared backbone and domain-specific memories. Starting from a frozen base model, we jointly learn a structured pruning mask and lightweight recovery modules, implemented as LoRA adapters and learned key-value caches. Across Llama and Qwen models from 3B to 8B, we find that non-trivial capacity can be moved out of the shared backbone without a large loss in model ability. At around 12\% global sparsity, KOFF preserves much of the unpruned model's performance, while pruning the same frozen model without memories degrades sharply. Ablations show that LoRA and learned KV memories are complementary, and specialization analyses suggest that the learned decomposition is meaningful: language-specific neurons are preferentially removed while language-general neurons largely remain in the backbone. These results suggest that knowledge can be reallocated between a shared core and swappable external memories.

2605.29068 2026-05-29 cs.AI cs.CL cs.CR cs.LG 版本更新

Robust and Efficient Guardrails with Latent Reasoning

具有潜在推理的鲁棒高效防护栏

Siddharth Sai, Xiaofei Wen, Muhao Chen

发表机构 * University of California, Davis(加州大学戴维斯分校)

AI总结 提出COLAGUARD模型,通过阶段式训练将多步安全推理转移到连续潜在空间,在保持高安全性能的同时实现12.9倍加速和22.4倍令牌减少。

详情
AI中文摘要

随着大型语言模型(LLMs)在现实应用中的日益部署,维护其安全性至关重要。现有的安全防护栏通常依赖单次分类或更近期的蒸馏推理。基于推理的防护栏显著优于仅分类的基线,但会带来大量的查询延迟和令牌开销,使其不适用于高吞吐量部署。为了解决这一挑战,我们提出了COLAGUARD,一种通过阶段式训练课程将多步安全推理转移到连续潜在空间的防护栏模型,从而在推理时实现直接的隐藏状态传播。在涵盖八个安全基准的十个提示和响应审核设置上评估,COLAGUARD在宏观F1上比Llama Guard 3提高了8.24分,并与我们的显式推理基线GuardReasoner在宏观F1上相当,同时实现了12.9倍的加速和22.4倍的令牌使用减少。我们的结果表明,潜在推理为可部署的防护栏提供了一种实用的替代方案,以替代显式理由生成,共同提高安全鲁棒性和推理效率,而不是将它们视为竞争目标。

英文摘要

Maintaining the safety of large language models (LLMs) is crucial as they are increasingly deployed in real-world applications. Existing safety guardrails typically rely on single-pass classification or, more recently, distilled reasoning. Reasoning-based guardrails significantly outperform classification-only baselines, but they incur substantial query latency and token overhead that make them impractical for highthroughput deployment. To address this challenge, we propose COLAGUARD, a guardrail model that transfers multi-step safety reasoning into a continuous latent space through a stage-wise training curriculum, enabling direct hidden-state propagation at inference. Evaluated on ten prompt- and response-moderation settings spanning eight safety benchmarks, COLAGUARD improves macro-F1 by 8.24 points over Llama Guard 3 and matches our explicit reasoning baseline, GuardReasoner, in macroF1 while delivering a 12.9X speedup and 22.4X reduction in token usage. Our results suggest that latent reasoning offers a practical alternative to explicit rationale generation for deployable guardrails, jointly improving safety robustness and inference efficiency rather than treating them as competing objectives.

2605.29042 2026-05-29 cs.AI cs.LG 版本更新

Differentiable Belief-based Opponent Shaping

基于可微信念的对手塑造

Aarav G Sane, Karthik Sivachandran, Rohan Paleja

发表机构 * Department of Computer Science(计算机科学系)

AI总结 提出D-BOS方法,通过可微的信念更新和梯度传播,在隐藏角色游戏中实现对手信念的塑造,从而自然涌现最优策略。

详情
AI中文摘要

人类协调往往依赖于通过战略行动影响他人信念的能力。在多智能体强化学习中,对手塑造试图复制这种影响,尽管现有方法通常作用于对手的参数、策略或价值空间。同时,隐藏角色游戏中的信念操纵技术通常依赖于硬编码的目标,如欺骗或信念饱和。我们提出基于可微信念的对手塑造(D-BOS),一种一阶方法,将每个观察者的信念视为被塑造的对手状态,并通过$k$步softmax-贝叶斯信念动力学进行微分。我们的方法不显式奖励欺骗或合作行为,而是将信念状态作为塑造目标。这使得最优策略能够从环境奖励结构中自然涌现。这种信念空间公式通过微分对手信念更新提供对手塑造信号,并通过聚合多个观察者个体推断信念轨迹上的梯度,自然地扩展到多个观察者。实验上,D-BOS在隐藏角色游戏中优于PPO和BBM,在混合动机设置中提升最大。

英文摘要

Human coordination often relies on the ability to influence the beliefs of others through strategic action. In multi-agent reinforcement learning, opponent shaping attempts to replicate this influence, though existing methods typically operate within an opponent's parameter, policy, or value space. Meanwhile, belief-manipulation techniques in hidden-role games often rely on hard-coded objectives, such as deception or belief saturation. We propose Differentiable Belief-based Opponent Shaping (D-BOS), a first-order method that treats each observer's belief as the shaped opponent state and differentiates through $k$-step softmax-Bayes belief dynamics. Rather than explicitly rewarding deceptive or cooperative behavior, our method treats the belief state as the target for shaping. This allows the optimal strategy to emerge naturally from the environment's reward structure. This belief-space formulation provides an opponent-shaping signal by differentiating through opponent belief updates, and naturally extends to multiple observers by aggregating gradients over their individual inferred belief trajectories. Empirically, D-BOS outperforms PPO and BBM in hidden-role games, with the largest gains in mixed-motive settings.

2605.29033 2026-05-29 cs.LG 版本更新

Moment Matching Q-Learning

矩匹配Q学习

Yiyan, Liang, Sifei Liu, Weitong Zhang

发表机构 * School of Data and Information Science, University of North Carolina at Chapel Hill, Chapel Hill, USA(数据与信息科学学院,北卡罗来纳大学教堂山分校,教堂山,美国)

AI总结 提出矩匹配Q学习(MoMa QL)框架,利用最大均值差异(MMD)匹配原始分布与目标分布的所有阶统计量,实现条件得分函数的分布级收敛,在D4RL任务中计算效率高且性能相当,并在离线到在线强化学习中通过加速流策略的动作采样展现更优的适应性和性能。

Comments 23 pages, 14 figures, 10 tables, accepted by ICML 2026

详情
AI中文摘要

基于得分和流的生成模型在捕捉复杂分布方面表现出显著的表达能力,并已广泛应用于从图像生成到强化学习的任务中。然而,这些模型存在推理延迟长的问题,这在具有迭代采样的强化学习中造成了显著的计算瓶颈。为了克服这一限制,我们提出了一个名为矩匹配Q学习(MoMa QL)的新框架,该框架利用统计假设检验中的最大均值差异(MMD)技术,旨在匹配原始分布和目标分布之间的所有阶统计量。通过对所有矩统计量施加强正则化,该算法保证了条件得分函数的分布级收敛,并在各种超参数下保持稳定。实验表明,我们的方法MoMa QL在各种D4RL任务中计算效率更高,且性能相当甚至具有竞争力。值得注意的是,通过加速基于流的策略的动作采样过程,MoMa QL在离线到在线强化学习任务中表现出更优的性能,因为其在线交互微调更快且适应性更强。

英文摘要

Score-based and flow-based generative models exhibit remarkable expressive capacity in capturing complex distributions, and have been extensively deployed in tasks ranging from image generation to reinforcement learning. Nevertheless, these models suffer from prolonged inference latency, which imposes a significant computational bottleneck in RL with iterative sampling. To overcome this limitation, we propose a new framework named Moment Matching Q-Learning (MoMa QL), which utilizes a technique from statistical hypothesis testing known as maximum mean discrepancy (MMD) that intend to match all orders of statistics between the original and target distribution. By enforcing strong regularization on all moment statistics, this algorithm guarantees distribution-level convergence for conditional score function and remains stable under various hyperparameters. Empirically, we show that our method MoMa QL is more computationally efficient with a comparable if not competitive performance in various D4RL tasks. Remarkably, by accelerating the action sampling process for flow-based policies, MoMa QL demonstrates superior performance in offline-to-online RL tasks because of faster and stronger adaptability for online interactive finetuning.

2605.29032 2026-05-29 cs.LG stat.ML 版本更新

Theoretical Foundations and Effective Algorithms for Policy-Aware Simulator Learning

策略感知模拟器学习的理论基础与有效算法

Christoph Dann, Yishay Mansour, Mehryar Mohri

发表机构 * Google Research(谷歌研究) Tel Aviv University(特拉维夫大学) Courant Institute of Mathematical Sciences(数学科学学院)

AI总结 针对模型强化学习中模拟器利用问题,提出以策略鲁棒性为目标,通过零和极小极大博弈学习模拟器,并给出理论保证与有效算法。

详情
AI中文摘要

基于模型的强化学习(MBRL)智能体通常通过最小化预测损失来学习世界模型。然而,强大的RL优化器不可避免地会利用微小的模型不准确性,导致模拟器利用和现实差距,即策略在模拟中成功但在现实世界中失败。我们提出学习模拟器的目标应该是策略鲁棒性而非预测准确性,并将其形式化为模型玩家与对抗策略玩家之间的零和极小极大博弈。我们提供了全面的理论分析:(1)在线学习保证,表明该博弈是可学习的,具有次线性遗憾界;(2)一个可处理的基于评论家的简化,通过局部评论家的损失来界定全局策略价值差距;(3)误差-MDP对偶性,证明寻找最坏情况策略在形式上是标准RL问题的对偶,其中奖励是一步评论家误差。这种对偶性产生了一个可证明收敛的主动数据选择算法。在连续控制任务上的实验表明,我们的方法在策略重要区域将预测误差降低了1.5-2.2倍,并使完全在模拟中训练的策略能够匹配接近最优的现实世界性能。

英文摘要

Model-based reinforcement learning (MBRL) agents typically learn world models by minimizing predictive loss. However, powerful RL optimizers inevitably exploit minor model inaccuracies, leading to simulator exploitation and a reality gap where policies succeed in simulation but fail in the real world. We propose that the objective for learning simulators should be strategic robustness rather than predictive accuracy, and formulate this as a zero-sum minimax game between a model player and an adversarial policy player. We provide a comprehensive theoretical analysis: (1) an online learning guarantee showing the game is learnable with sublinear regret bounds; (2) a tractable critic-based simplification bounding the global policy-value gap by the local critic's loss; and (3) an Error-MDP duality, proving that finding the worst-case policy is formally dual to a standard RL problem where the reward is the one-step critic error. This duality yields a provably convergent active data selection algorithm. Experiments on continuous control tasks demonstrate that our approach reduces prediction error in strategically important regions by $1.5$-$2.2\times$ and enables policies trained purely in simulation to match near-optimal real-world performance.

2605.29028 2026-05-29 cs.LG cs.AI 版本更新

Return-to-Go Is More Than a Number: Q-Guided Alignment for Return-Conditioned Supervised Learning

Return-to-Go 不仅仅是数字:面向返回条件监督学习的 Q 引导对齐

Yuxiao Yang, Weitong Zhang

发表机构 * University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校)

AI总结 提出 Q-ALIGN DT 框架,通过确保输出策略的 Q 值与输入 RTG 一致,实现返回条件序列模型中 RTG 与策略性能的对齐,在 D4RL 基准上取得优越的可控性和性能。

Comments 28 pages, 13 figures, 20 tables, accepted by ICML 2026

详情
AI中文摘要

条件序列模型 (CSMs) 通过将 return-to-go (RTG) 作为控制信号来学习策略。然而,现有的 CSMs 通常将 RTG 视为简单的数值输入,而不是将其与策略的性能对齐。在本文中,我们提出了 Q-ALIGN DT 框架,通过确保输出策略的 $Q$-值与输入 RTG 一致来强制执行这种对齐。通过利用 $Q$ 函数为 CSMs 提供密集指导,并使用 RTG-扰动技术结合 CSM 进一步微调,我们的方法确保更高的 RTG 一致地映射到具有更高期望回报的轨迹。理论上,我们证明 Q-ALIGN DT 可以高效地学习期望策略,并在 RTG 足够高时输出接近最优的策略。实验上,我们通过大量实验证明 Q-ALIGN DT 在 D4RL 基准上实现了优越的可控性和性能。值得注意的是,我们的模型有效地学习了一个结构化的策略族,该策略族保持精确对齐,并泛化到速度跟踪等先前方法失败的任务。

英文摘要

Conditioned Sequence Models (CSMs) learn policies by treating return-to-go (RTG) as a control signal. However, existing CSMs often treat the RTGs as simple numerical inputs rather than aligning them with the performance of their policies. In this paper, we propose Q-ALIGN DT, a framework that enforces this alignment by ensuring the $Q$-value of the output policy is consistent with the input RTG. By leveraging a $Q$ function to provide dense guidance to CSMs and further fine-tuning it using an RTG-perturbation technique with the CSM, our method ensures that higher RTGs are consistently mapped to trajectories with higher expected returns. Theoretically, we show that Q-ALIGN DT can efficiently learn the desired policy and output a near-optimal one when the RTG is sufficiently high. Empirically, we demonstrate through extensive experiments that Q-ALIGN DT achieves superior controllability and performance across the D4RL benchmark. Remarkably, our model effectively learns a structured family of policies that maintains precise alignment and generalizes to tasks like velocity-tracking where prior methods fail.

2605.29021 2026-05-29 cs.LG 版本更新

Designing Active Tether-Net Systems for Space Debris Capture with Graph-Learning-Aided Mixed-Combinatorial Optimization

基于图学习辅助混合组合优化的空间碎片捕获主动绳网系统设计

Feng Liu, Achira Boonrath, Gishnu Madhu, Eleonora M. Botta, Souma Chowdhury

发表机构 * University at Buffalo(布法罗大学)

AI总结 针对主动绳网系统设计中涉及连续、整数和分类变量的混合组合非线性规划问题,提出图神经网络辅助优化方法,将MCNLP简化为NLP,实现网形态、MU质量和推进器选择及瞄准点的联合设计,相比直接求解MCNLP显著加快收敛速度。

Comments Accepted for presentation at 2026 AIAA Aviation Forum

详情
AI中文摘要

主动绳网系统通过部署由可操作单元(MU)操纵的柔性网,是捕获大型非合作目标(如空间碎片)的一种有前景的解决方案。然而,对绳网系统的设计和控制选择进行并发系统探索以了解其全部潜力的研究仍然有限,部分原因是其呈现的复杂、受约束的非线性优化问题——涉及连续、整数和分类变量的混合,其中后两者分别来自网络连接性和组件选择。经典的二进制编码方法通常无法有效解决工程设计中的高度非线性和多模态混合组合非线性规划(MCNLP),而整数编码方法可能在组合之间引入虚假关系。鉴于组合空间的图结构特征,本文采用并扩展了一种新的图学习辅助优化方法来解决这个MCNLP问题。其中,图神经网络(GNN)被训练以评分(作为输出)并据此推荐表示为图中节点的候选组合,候选设计的连续变量向量部分作为输入。因此,MCNLP优化简化为NLP,可以使用标准求解器求解。虽然这种简化方法与NLP求解器的选择无关,但本文使用了一种最先进的带梯度微调的粒子群优化(PSO)算法作为求解器。在并发设计网的形态、MU中的质量和推进器选择以及绳网系统控制器使用的瞄准点的问题上,基于GNN的推荐器被证明相比直接求解MCNLP问题,能够显著更快地收敛到类似的最优解。

英文摘要

Active tether-net systems are a promising solution for capturing large non-cooperative targets, such as space debris, by deploying a flexible net manipulated by maneuverable units (MUs). However, concurrent systematic explorations of design and control choices of the tether-net system to understand its full potential remain limited, partly due to the complex, constrained, nonlinear optimization problem that it presents -- one that involves a mixture of continuous, integer and categorical variables, with the latter two arising from net connectivity and component choices, respectively. Classical binary encoding methods are often ineffective for solving highly nonlinear and multimodal Mixed Combinatorial Nonlinear Programmings (MCNLPs) in engineering design, while integer coding approaches can introduce spurious relations among combinations. Given the graph-structured characteristics of the combinatorial space, this paper adopts and extends a new graph-learning-aided optimization approach to solve this MCNLP problem. Here, a Graph Neural Network (GNN) is trained to score (as output) and thereof recommend candidate combinations represented as nodes in a graph, with the continuous variable vector portion of a candidate design given as input. As a result, the MCNLP optimization reduces to an NLP, which can be solved using standard solvers. While this reduction approach is agnostic to the choice of the NLP solver, here a state-of-the-art Particle Swarm Optimization (PSO) algorithm with gradient-based fine-tuning is used as the solver. Demonstrated on the problem of concurrently designing the morphology of the net, choice of mass and thrusters in the MUs and aiming points used by the controller of the tether-net system, the GNN-based recommender is shown to provide significantly faster convergence to similar optimal solutions, compared to direct solution of the MCNLP problem.

2605.29016 2026-05-29 astro-ph.IM astro-ph.CO cs.LG 版本更新

Three-dimensional Conditional Diffusion Models for Cosmological 21 cm Lightcone Emulation

用于宇宙学21厘米光锥模拟的三维条件扩散模型

Bin Xia, John H. Wise

发表机构 * Center for Relativistic Astrophysics, School of Physics, Georgia Institute of Technology, Atlanta, GA 30332, USA(相对论天体物理中心,物理学院,佐治亚理工学院,亚特兰大,GA 30332,USA)

AI总结 针对三维21厘米光锥模拟的困难,通过对比预处理、动态范围压缩、架构深度和训练时长等配置,发现Yeo-Johnson预处理结合中等幅度压缩在全局信号的标准化平均绝对误差上表现最优,但视觉上合理的样本仍存在统计偏差。

详情
AI中文摘要

我们研究了用于三维21厘米光锥模拟的条件扩散模型,重点关注天空平面大小为$64\times64$、视线深度达1024个像素的立方体。与早期的二维研究相比,三维设置更加困难,因为内存限制导致微批次非常小,而底层体素分布高度偏斜且长尾。我们通过使用$25{,}600$个训练光锥和固定参数点的验证集成,对预处理选择、动态范围压缩设置、架构深度和训练时长进行了控制比较。在验证中,每个参考参数点包含800个具有独立初始条件的21cmFAST实现,并且每个模型和每个参考集使用800个样本进行报告的集成比较。我们通过图像和摘要统计空间中的互补诊断评估生成的光锥:亮温度切片、全局信号、功率谱和简化散射系数。在测试的配置中,预处理是控制稳定训练和最终物理保真度的主导因素。在此探索的配置中,Yeo-Johnson预处理结合中等幅度压缩给出了最一致的有利权衡,最强的定量支持来自基于全局信号的标准差归一化平均绝对误差($\mathrm{MAE}_{\rm std}$)的排名,并且在互补诊断中表现出定性一致的行为。同时,视觉上合理的三维样本在两点和高阶统计中仍然保留可测量的偏差。因此,我们将当前工作视为三维21厘米模拟以及未来纳入更真实观测效应的研究的一个模拟级基线。

英文摘要

We investigate conditional diffusion modeling for three-dimensional 21 cm lightcone emulation, focusing on cubes with a sky-plane size of $64\times64$ and a line-of-sight depth up to 1024 cells. Relative to earlier 2D studies, the 3D setting is substantially harder because memory limits enforce very small micro-batches while the underlying voxel distribution is highly skewed and long tailed. We perform controlled comparisons across preprocessing choices, dynamic-range compression settings, architecture depth, and training duration using $25{,}600$ training lightcones and validation ensembles at fixed parameter points. For validation, each reference parameter point contains 800 21cmFAST realizations with independent initial conditions, and we use 800 samples per model and per reference set for the reported ensemble comparisons. We evaluate generated lightcones with complementary diagnostics in both image and summary-statistic spaces: brightness-temperature slices, the global signal, the power spectrum, and reduced scattering coefficients. Across the tested configurations, preprocessing is the dominant factor governing stable training and the resulting physical fidelity. Among the configurations explored here, Yeo-Johnson preprocessing combined with moderate amplitude compression gives the most consistently favorable trade-off, with the strongest quantitative support coming from rankings based on the standard-deviation-normalized mean absolute error ($\mathrm{MAE}_{\rm std}$) of the global signal and qualitatively compatible behavior in the complementary diagnostics. At the same time, visually plausible 3D samples still retain measurable biases in two-point and higher-order statistics. We therefore view the present work as a simulation-level baseline for three-dimensional 21 cm emulation and for future studies that incorporate more realistic observational effects.

2605.29009 2026-05-29 cs.LG cs.AI 版本更新

Label-Free Reinforcement Learning via Cross-Model Entropy

无标签强化学习:跨模型熵方法

Matt Gorbett, Hossein Shirazi

发表机构 * Independent Researcher(独立研究者) San Diego State University(圣地亚哥州立大学)

AI总结 提出跨模型熵(CME)作为无标签奖励信号,用于强化学习后训练大语言模型,在开放指令遵循任务上优于基线方法。

详情
AI中文摘要

使用强化学习后训练大语言模型受限于奖励信号。现有方法需要真实可验证的奖励(限制于自动正确性检查领域,如数学、代码执行)或人类偏好标签(收集成本高且易受奖励攻击)。最近的无标签方法用自参考信号(如多数投票或模型自身输出的token熵)替代真实验证器,但可能强化模型自身错误。本文提出跨模型熵(CME),即生成器响应在独立验证器模型下的平均对数似然,作为无标签奖励信号用于强化学习后训练。CME是连续的、无需训练,基于验证器认为不意外的响应可能正确或高质量的准则。由于验证器独立于生成器,该信号无法通过自一致性被操纵。我们将CME集成到GRPO中,不改变训练循环的其他部分,将无标签强化学习扩展到开放指令遵循——自参考信号不适用或不适配的场景。在开放指令遵循(UltraFeedback提示,在AlpacaEval 2.0上评估)上,CME奖励在四个模型家族(Qwen、Llama、Gemma、OLMo)和三种训练范式(预训练、SFT和指令微调)的头对头LLM-as-Judge比较中击败未训练基线,调整平局后的胜率从52.5%到71.4%。代码将在发表后发布。

英文摘要

Post-training large language models with reinforcement learning is bottlenecked by the reward signal. Existing approaches require either ground-truth verifiable rewards, restricting training to domains with automatic correctness checks (e.g., mathematics, code execution), or human preference labels, which are expensive to collect and prone to reward hacking. Recent label-free methods replace ground-truth verifiers with self-referential signals like majority voting or token entropy over a model's own outputs, but risk reinforcing a model's own errors. In this work we propose Cross-Model Entropy (CME), the mean log-likelihood of a generator's response under a separate verifier model, as a label-free reward signal for RL post-training. CME is continuous, training-free, and grounded in the principle that responses a verifier finds unsurprising are likely correct or high quality. Because the verifier is independent of the generator, the signal cannot be gamed through self-consistency. We integrate CME into GRPO with no other changes to the training loop, extending label-free RL to open-ended instruction following -- a regime where self-referential signals are inapplicable or poorly suited. On open-ended instruction following (UltraFeedback prompts, evaluated on AlpacaEval 2.0), CME rewards beat the untrained base in head-to-head LLM-as-Judge comparisons across four model families (Qwen, Llama, Gemma, OLMo) and three training regimes (pretrained, SFT, and instruction-tuned), with tie-adjusted win rates ranging from 52.5% to 71.4%. Code will be released upon publication.

2605.29008 2026-05-29 cs.LG 版本更新

Causal Intelligence for Constraint-Aware Intervention Design to Induce State Transitions

因果智能:面向状态转换的约束感知干预设计

Zixuan Song, Uwe Mueller, Dimitris V. Manatakis

发表机构 * MRL, Merck & Co., Inc.(MRL,默克公司)

AI总结 提出COAST方法,通过因果图学习和约束感知多目标优化,从数据中设计干预策略以实现系统状态转换。

详情
AI中文摘要

通过有针对性的干预将系统从一个状态驱动到另一个状态是科学中的一个基本挑战,然而大多数预测模型提供的机制洞察有限,且缺乏原则性的决策框架。本文提出COAST(状态转换的因果最优行动),一种用于计算机设计约束干预的因果智能方法,该干预诱导用户定义的状态转换。给定表征源状态和目标状态的数据,COAST学习上下文特定的因果图和结构因果模型,将观测到的分布变化归因于机制层面的因果驱动因素,并引入一种新颖的约束感知多目标优化公式,平衡转换效果、干预复杂性和目标状态稳定性。该方法模块化且领域无关,通过可互换的组件整合特征选择、因果发现、因果建模以及干预识别和评估。在合成基准和真实生物数据集上,COAST恢复了关键的因果驱动因素,并识别出实现期望状态转换的稳健的单目标和多目标干预策略,同时提供透明的机制解释以指导实验验证。

英文摘要

Driving a system from one state to another through targeted interventions is a fundamental challenge in science, yet most predictive models offer limited mechanistic insight and no principled framework for decision-making. Here we present COAST (Causally Optimal Actions for State Transitions), a causal-intelligence approach for the in-silico design of constrained interventions that induce user-defined state transitions. Given data characterizing source and target states, COAST learns context-specific causal graphs and structural causal models, attributes observed distributional shifts to mechanism-level causal drivers, and introduces a novel constraint-aware multi-objective optimization formulation that balances transition efficacy, intervention complexity, and target-state stability. The approach is modular and domain-agnostic, integrating feature selection, causal discovery, causal modeling, and intervention identification and evaluation through interchangeable components. Across synthetic benchmarks and real biological datasets, COAST recovers key causal drivers and identifies robust single- and multi-target intervention strategies that achieve desired state transitions, accompanied by transparent mechanistic rationales to guide experimental validation.

2605.29005 2026-05-29 cs.LG cs.AI 版本更新

LoRe: Adaptive Interaction-Evaluation Routing with Per-Step Interaction Budgets for Iterative Graph Solvers

LoRe: 基于每步交互预算的自适应交互评估路由用于迭代图求解器

Jintao Li, Yong-Yi Wang, Zheng-An Wang, Heng Fan

发表机构 * Beijing Key Laboratory of Fault-Tolerant Quantum Computing, Beijing Academy of Quantum Information Sciences, Beijing, China(北京容错量子计算重点实验室,量子信息科学北京市院,北京,中国) Beijing National Laboratory for Condensed Matter Physics, Institute of Physics, CAS, Beijing, China(北京凝聚态物理国家实验室,物理研究所,中国科学院,北京,中国) Beijing Key Laboratory of Advanced Quantum Technology, Beijing, China(北京先进量子技术重点实验室,北京,中国) Hefei National Laboratory, Hefei, China(合肥国家实验室,合肥,中国)

AI总结 提出LoRe方法,通过动态路由计算到高冲突或高不确定性交互,实现每步固定比例交互评估,在不牺牲解质量的前提下显著提升迭代图求解器的可扩展性和速度。

Comments Accepted at ICML 2026

详情
AI中文摘要

基于扩散的组合优化神经求解器反复重新评估密集的边/因子交互,导致推理时间昂贵且在大规模下常受内存限制。受多体物理计算方法的启发,我们引入LoRe,一种无需训练、推理时即插即用的包装器,强制执行每步交互评估预算:在每次迭代中,它通过动态路由计算到高冲突或高不确定性交互,仅评估固定比例的交互,而不是使用固定的稀疏化(例如静态kNN图或静态掩码)。在完全包含的端到端挂钟时间核算下,LoRe显著提高了最大独立集(MIS)问题的可扩展性,将可行推理扩展到基线内存溢出限制的3倍以上,实现了约8倍的加速和约12倍的峰值内存减少,同时在此范围内保持解质量。在大规模旅行商问题(TSP)上展示了跨任务通用性,并对拓扑变化具有零样本鲁棒性,LoRe在n=1000时实现了约15倍的加速,内存减少44倍,且巡回质量具有竞争力。

英文摘要

Diffusion-based neural solvers for combinatorial optimization repeatedly re-evaluate dense edge/factor interactions, making inference expensive in wall-clock time and often memory-bound at scale. Inspired by the computational methodologies of many-body physics, we introduce LoRe, a training-free, inference-time drop-in wrapper that enforces per-step interaction-evaluation budgeting: at each iteration, it evaluates only a fixed fraction of interactions by dynamically routing computation to high-conflict or high-uncertainty interactions, instead of using a fixed sparsification (e.g., static kNN graphs or static masks). Under fully inclusive end-to-end wall-clock accounting, LoRe substantially improves scalability on the Maximum Independent Set (MIS) problem, extending feasible inference more than $3\times$ beyond the baseline's out-of-memory limit, delivering a $\sim 8\times$ speedup and a $\sim 12\times$ peak-memory reduction, with solution quality preserved in this regime. Demonstrating cross-task generality on the large-scale Traveling Salesperson Problem (TSP) and zero-shot robustness to topology shifts, LoRe achieves a $\sim 15\times$ speedup at $n=1000$ with a $44\times$ memory reduction and competitive tour quality.

2605.29002 2026-05-29 cs.LG cs.DC 版本更新

FedQHD: Closed-Form Function-Space Federated Reinforcement Learning

FedQHD: 闭式函数空间联邦强化学习

Yuchen Hou, Yongshan Chen, Zhuowen Zou, Calvin Yeung, Mohsen Imani, Tian Lan, Mahdi Imani

发表机构 * Northeastern University(东北大学) University of California, Irvine(加州大学 Irvine 分校) The George Washington University(乔治·华盛顿大学)

AI总结 提出FedQHD,一种使用超维随机特征编码器和线性读出层的联邦Q学习方法,通过闭式聚合解决参数平均在函数空间中的不一致性问题,并理论分析了联邦差距。

详情
AI中文摘要

联邦强化学习使分散的智能体能够在不交换原始轨迹的情况下协作改进策略或价值估计。然而,FedAvg风格的参数平均在函数空间上是不一致的:当客户端使用异构编码器甚至相同的非线性网络时,平均参数不一定对应于任何公共函数空间中客户端价值函数的加权平均。我们提出FedQHD,一种使用超维(随机特征)状态编码器和线性读出层的联邦Q学习方法,使得Q函数在状态上是非线性的,但在可训练参数上是线性的。这种线性结构实现了闭式聚合。使用共享编码器时,函数空间共识更新恰好与局部读出矩阵的加权平均一致。使用异构编码器时,服务器通过在共享锚点状态集上平均客户端的Q值构建全局教师,每个客户端通过单次岭投影将该教师编译到其局部表示中。我们形式化了联邦差距——将联邦教师编译到异构客户端表示时产生的误差——相对于客户端特定的投影。我们证明该差距可分解为子空间错位、锚点集条件和正则化偏差。我们进一步确定锚点与维度比 $m \geq D_i$ 为良态区域,在该区域内差距简化为编码器异质性基底的倍数。在四个连续状态、离散动作控制基准上,FedQHD匹配或优于FedAvg风格基线和基于蒸馏的替代方法,同时需要更少的计算,并且联邦差距对编码器维度的经验依赖性与我们的理论分析一致。

英文摘要

Federated reinforcement learning enables decentralized agents to collaboratively improve policies or value estimates without exchanging raw trajectories. However, FedAvg-style parameter averaging is not function-space consistent: when clients use heterogeneous encoders or even identical nonlinear networks, averaged parameters need not correspond to the weighted average of client value functions in any common function space. We propose FedQHD, a federated Q-learning method using hyperdimensional (random-feature) state encoders with a linear readout, so that Q-functions are nonlinear in state yet linear in trainable parameters. This linear structure enables closed-form aggregation. With a shared encoder, the function-space consensus update coincides exactly with weighted averaging of local readout matrices. With heterogeneous encoders, the server constructs a global teacher by averaging client Q-values on a shared anchor-state set, and each client compiles this teacher into its local representation via a single ridge projection. We formalize the federation gap -- the error incurred when compiling a federated teacher into a heterogeneous client representation -- relative to a client-specific oracle projection. We show that this gap decomposes into subspace misalignment, anchor-set conditioning, and regularization bias. We further identify the anchor-to-dimension ratio $m \geq D_i$ as the well-conditioned regime in which the gap reduces to a multiple of the encoder heterogeneity floor. On four continuous-state, discrete-action control benchmarks, FedQHD matches or outperforms FedAvg-style baselines and distillation-based alternatives while requiring substantially less computation, and the empirical dependence of the federation gap on encoder dimension matches our theoretical analysis.

2605.29001 2026-05-29 cs.LG cs.AI 版本更新

FormInv: A Measurement Protocol for Semantic Invariance in Mathematical Reasoning Benchmarks

FormInv:数学推理基准中语义不变性的测量协议

Nishal Thomas, Noel Thomas

发表机构 * Mohamed Bin Zayed University of Artificial Intelligence(Mohamed Bin Zayed人工智能大学)

AI总结 提出FormInv协议,通过跨模型一致性审计检测语义错误,并引入语义一致性率(SCR)和Cochran's Q指标,揭示标准基准无法捕捉的排名变化和模型不一致性。

Comments 18 pages, 3 figures. Under review for the 3rd AI for Math Workshop (AI4Math), ICML 2026

详情
AI中文摘要

对MathCheck(ICLR 2025)的释义质量审计在129组中检测到4个语义不正确的释义(3.1%);移除它们后,GPT-4o从第2名降至第4名,并将Claude Haiku和DeepSeek V3提升至其之上;这些排名变化对任何单模型评估都是不可见的。跨模型一致性以不到10美元的成本自动发现了这些错误(MathCheck中>=3/4模型;我们的主要评估中>=6/9);在我们自己的数据集中,相同的协议发现47%的自动生成的连接变体释义在语义上不正确。这一缺陷加剧了更深的测量差距:Claude Haiku 4.5达到86%的准确率,但SCR=50%,意味着其一半的定理在语义等价的重新表述下得到不同的答案,而9个模型的总体准确率仅跨越86-96%,但语义一致性率(SCR)跨越50-82%——这是标准基准无法捕捉的32个百分点的差距。形式上,对于9个前沿模型的任何目标排名,存在一个释义族上的权重实现该排名(无免费基准推论),因为没有模型在所有族上帕累托占优——因此选择族的基准设计者隐含地决定了哪个模型获胜。FormInv提供了审计协议(在外部基准上以100%召回率复制)、SCR和每个定理的Cochran's Q作为主要不变性度量,在9个模型上评估了366-811个项目(基于Lean4验证的定理),以及用于情境感知模型选择的FormInvSelector。

英文摘要

A paraphrase-quality audit of MathCheck (ICLR 2025) detected 4 semantically incorrect paraphrases in 129 groups (3.1%); removing them drops GPT-4o from rank 2 to rank 4 and elevates Claude Haiku and DeepSeek V3 above it; these ranking changes are invisible to any single-model evaluation. Cross-model unanimity found these errors automatically (>= 3/4 models for MathCheck; >= 6/9 for our primary evaluation) for under $10; in our own dataset the same protocol found that 47% of auto-generated connective-variation paraphrases were semantically incorrect. That flaw compounds a deeper measurement gap: Claude Haiku 4.5 achieves 86% accuracy yet SCR=50%, meaning half its theorems are answered differently under semantically equivalent restatements, while aggregate accuracy across 9 models spans only 86-96% yet Semantic Consistency Rates (SCR) span 50-82% -- a 32-point gap invisible to standard benchmarks. Formally, for any target ranking over 9 frontier models there exists a weighting over paraphrase families that realizes it (No-Free-Benchmark corollary), because no model Pareto-dominates all families -- so benchmark designers who select families are implicitly choosing which model wins. FormInv supplies the audit protocol (replicated on external benchmarks at 100% recall), SCR and per-theorem Cochran's Q as primary invariance measures evaluated on 9 models across 366-811 items (on Lean4-verified theorems), and FormInvSelector for regime-aware model selection.

2605.28999 2026-05-29 cs.CR cs.AI cs.CL cs.LG 版本更新

Measuring Real-World Prompt Injection Attacks in LLM-based Resume Screening

测量基于LLM的简历筛选中真实世界的提示注入攻击

Mohan Zhang, Yuqi Jia, Zhen Tan, Steven Jiang, Neil Zhenqiang Gong, Tianlong Chen, Dawn Song

发表机构 * University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) Duke University(杜克大学) Arizona State University(亚利桑那州立大学) hireEZ University of California, Berkeley(加州大学伯克利分校)

AI总结 本研究首次系统性地分析了基于LLM的简历筛选应用中的提示注入攻击,通过设计专用检测器对约20万份真实简历进行测量,发现约1%的简历包含隐藏的提示注入,且近年来其流行度显著增加。

Comments Published in USENIX Security Symposium 2026; Code and artifacts are available at https://github.com/UNITES-Lab/resume-injection-measurement

详情
AI中文摘要

LLM容易受到提示注入攻击。然而,这种漏洞主要是在学术研究中通过概念性演示或少数轶事案例研究来展示的。其在真实世界基于LLM的应用中的普遍性和影响尚未得到充分探索。在这项工作中,我们首次对广泛使用的应用——基于LLM的简历筛选——中的提示注入攻击进行了系统研究。我们的分析基于hireEZ多年来收集的约20万份真实简历。我们首先设计了专门的方法来检测简历中的提示注入。在小规模数据集上的手动验证表明,我们的检测器实现了高精度,并优于最先进的通用检测器。然后,我们将检测器应用于完整的简历数据集,并对真实世界的提示注入攻击进行了全面的测量研究。我们的分析揭示了一些有趣的发现:大约1%的简历包含隐藏的提示注入;这种注入简历的流行度在过去一到两年内显著增加;超过90%的注入提示不使用显式指令。这些结果首次提供了真实世界基于LLM的应用中大规模提示注入的证据,并为未来理解和缓解此类攻击的研究奠定了基础。

英文摘要

LLMs are vulnerable to prompt injection attacks. However, this vulnerability has been primarily demonstrated conceptually in academic studies or through a few anecdotal case studies. Its prevalence and impact in real-world LLM-based applications are largely unexplored. In this work, we present the first systematic study of prompt-injection attacks in a widely used application: LLM-based resume screening. Our analysis is based on approximately 200K real-world resumes collected over multiple years by hireEZ. We first design tailored methods to detect prompt injection in resumes. Manual validation on a small-scale dataset demonstrates that our detectors achieve high precision and outperform state-of-the-art general-purpose detectors. We then apply our detector to the full resume dataset and conduct a comprehensive measurement study of real-world prompt injection attacks. Our analysis reveals several intriguing findings: approximately 1% of resumes contain hidden prompt injections; the prevalence of such injected resumes has increased noticeably over the past one to two years; and more than 90% of injected prompts do not use explicit instructions. These results provide the first evidence of large-scale prompt injection in real-world LLM-based applications and lay the groundwork for future studies to understand and mitigate such attacks.

2605.28990 2026-05-29 cs.LG 版本更新

Learning Robust and Task-Invariant Functional Representation from fMRI through Siamese Self-Supervised Learning

通过孪生自监督学习从fMRI中学习鲁棒且任务不变的功能表示

Jiyao Wang, Peiyu Duan, Nicha C. Dvornek, Lawrence H. Staib, Denis Sukhodolsky, Pamela Ventola, James S. Duncan

发表机构 * organization= Department of Biomedical Engineering , addressline= Yale University , city= New Haven , state= CT , country= USA organization= Radiology \& Biomedical Imaging , addressline= Yale School of Medicine , city= New Haven , state= CT , country= USA organization= Electrical Engineering , addressline= Yale University , city= New Haven , state= CT , country= USA organization= Child Study Center , addressline= Yale School of Medicine , city= New Haven , state= CT , country= USA

AI总结 提出轻量级自监督框架BrainSimSiam,利用正样本对学习鲁棒且通用的fMRI表示,在多个下游任务中超越全监督基线,接近大规模模型性能。

详情
AI中文摘要

功能磁共振成像(fMRI)是研究人脑功能的强大工具。然而,数据采集的高成本和精神病学评定量表固有的主观性常常导致数据集样本量小且标签质量可变,特别是在针对特定神经疾病时。结合fMRI数据固有的高维性,这些限制显著增加了模型过拟合的风险。近年来,通过组合多个数据集开发fMRI基础模型的兴趣日益增长;然而,预训练和微调所需的计算资源往往令人望而却步。我们展示了一个轻量级自监督框架能够产生跨多种下游任务泛化的表示,超越全监督基线,并接近大规模模型的性能。我们引入了BrainSimSiam,一种数据高效的自监督表示学习框架,利用仅正样本对来学习鲁棒且可泛化的特征。我们证明了所学表示在多个下游分类和回归任务中取得了强劲性能,突显了BrainSimSiam在数据有限的神经影像应用中的潜力。

英文摘要

Functional magnetic resonance imaging (fMRI) is a powerful tool for investigating human brain function. However, the high cost of data acquisition and the inherent subjectivity of psychiatric rating scales often lead to datasets with small sample sizes and variable label quality, especially when targeting a specific neurological condition. Combined with the inherently high dimensionality of fMRI data, these limitations substantially increase the risk of model overfitting. Recent years have seen growing interest in developing fMRI foundation models by combining multiple datasets; however, the computational resources needed for pretraining and fine-tuning are often prohibitive. We show that a lightweight self-supervised framework yields representations that generalize across diverse downstream tasks, outperforming fully supervised baselines and approaching the performance of large-scale models. We introduce BrainSimSiam, a data-efficient self-supervised representation learning framework that leverages positive-only data pairs to learn robust and generalizable features. We demonstrate that the learned representations achieve strong performance across multiple downstream classification and regression tasks, highlighting the potential of BrainSimSiam for data-limited neuroimaging applications.

2605.28983 2026-05-29 cs.LG cs.AI math.DS math.RT physics.comp-ph 版本更新

The Hamilton-Jacobi Theory of Deep Learning

深度学习的哈密顿-雅可比理论

Jose Marie Antonio Miñoza, Erika Fille T. Legara, Christopher P. Monterola

发表机构 * Center for AI Research PH(人工智能研究所以PH) Asian Institute of Management(亚洲管理学院)

AI总结 本文通过将神经网络训练精确识别为哈密顿-雅可比初值问题的搜索,建立了深度学习与粘性哈密顿-雅可比方程之间的严格对应关系,并统一了残差网络、Transformer、RNN等架构,导出了最优泛化率、对抗鲁棒性等定量结果。

详情
AI中文摘要

在本文中,神经网络训练被精确地识别为通过哈密顿-雅可比初值问题的搜索:每个梯度步选择粘性哈密顿-雅可比方程的初始数据,其Hopf-Cole传播子最佳拟合观测值;在推理时,输入是评估该解的空间点,初始条件已编码在权重中。这种对应对于log-sum-exp层是精确的,对于更广泛的架构(残差网络、Transformer和循环架构(RNN、LSTM、SSM))是结构性的,它们离散化同一类哈密顿-雅可比方程,具有依赖于架构的哈密顿量和粘性。一个单一的变形参数ε在交换图中统一了所有四个视角(网络、热带代数、粘性PDE、凸优化),并在Lipschitz条件下封闭。定量结果包括:固定t时的极小极大最优泛化率O(n^{-1/(d+2)});由ε控制的对抗鲁棒性;残差网络的反向传播作为哈密顿系统的协态方程(庞特里亚金最大值原理);通过PDE求积与数据内在维度一致的标度指数;以及闭式O(N)影响函数(softmax归因权重π_j),其熵景观随着ε增加经历折叠分岔,每个分岔合并归因盆地。

英文摘要

In this paper, training a neural network is identified, exactly, as a search through Hamilton--Jacobi initial-value problems: each gradient step selects the initial data of a viscous Hamilton--Jacobi equation whose Hopf--Cole propagator best fits the observations; at inference, the input is the spatial point at which that solution is evaluated and the initial condition is already encoded in the weights. The correspondence is exact for log-sum-exp layers and structural for broader architectures: residual networks, transformers, and recurrent architectures (RNNs, LSTMs, SSMs) each discretize the same class of Hamilton--Jacobi equations, with architecture-dependent Hamiltonian and viscosity. A single deformation parameter $\varepsilon$ unifies all four perspectives (network, tropical algebra, viscous PDE, convex optimization) in a commutative diagram closed under Lipschitz conditions. Quantitative consequences include: the minimax optimal generalization rate $O(n^{-1/(d+2)})$ for fixed $t$; adversarial robustness controlled by $\varepsilon$; backpropagation as the co-state equation of the Hamiltonian system for residual networks (Pontryagin Maximum Principle); scaling exponents consistent with data intrinsic dimension via PDE quadrature; and a closed-form $O(N)$ influence function (softmax attribution weights $π_j$) whose entropy landscape undergoes fold bifurcations as $\varepsilon$ increases, each merging attribution basins.

2605.28980 2026-05-29 math.OC cs.LG cs.NA eess.SP math.NA 版本更新

Manifold-based Algorithms for the Hadamard Decomposition

基于流形的Hadamard分解算法

Nicolas Gillis, Subhayan Saha, Stefano Sicilia, Arnaud Vandaele

发表机构 * Department of Mathematics and Operational Research, University of Mons(数学与运筹学系,蒙斯大学) Gruppo Nazionale Calcolo Scientifico-Istituto Nazionale di Alta Matematica(科学计算组-国家高级数学研究所)

AI总结 针对Hadamard分解问题,提出三种基于流形的新算法(包括Manopt、块投影梯度和无投影流形梯度下降),并设计新的初始化策略,在合成和真实数据上优于现有方法。

Comments 27 pages, code available from https://github.com/StefanoSicilia/Hadamard-Decomposition

详情
AI中文摘要

给定矩阵 $X$ 和两个秩 $r_1$ 和 $r_2$,Hadamard分解(HD)寻找两个低秩矩阵 $X_1$(秩 $r_1$)和 $X_2$(秩 $r_2$),它们与 $X$ 大小相同,使得 $X\approx X_1\circ X_2$,其中 $\circ$ 是Hadamard(逐元素)乘积。大多数情况下,HD比标准低秩近似(如截断奇异值分解(TSVD))更具表现力,因为它可以用相同数量的参数表示更高秩的矩阵;这是因为 $X_1 \circ X_2$ 的秩通常等于 $r_1 r_2$。本文首先给出HD的一些理论见解,特别是一个有用的重写形式 $X\approx WH^\top$,其中 $W$ 和 $H$ 有 $r_1 r_2$ 列并属于某些流形。这使我们能够开发三种计算HD的新算法。第一种使用表示 $X\approx X_1\circ X_2$ 并依赖于Manopt工具箱。另外两种依赖于重写形式 $X\approx WH^\top$:一种是块投影梯度方法,另一种是基于流形的梯度下降算法,不需要投影到可行集。最后两种算法特别适用于处理大规模稀疏数据。我们还提出了新的初始化策略,以提高HD的精度。我们将我们的算法和初始化策略与TSVD及现有技术进行了比较。数值结果表明,新方法在合成和真实数据上高效且具有竞争力。

英文摘要

Given a matrix $X$, and two ranks $r_1$ and $r_2$, the Hadamard decomposition (HD) looks for two low-rank matrices, $X_1$ of rank $r_1$ and $X_2$ of rank $r_2$, both of the same size as $X$, such that $X\approx X_1\circ X_2$, where $\circ$ is the Hadamard (element-wise) product. In most cases, HD is more expressive than standard low-rank approximations such as the truncated singular value decomposition (TSVD), as it can represent higher-rank matrices with the same number of parameters; this is because the rank of $X_1 \circ X_2$ is generically equal to $r_1 r_2$. In this paper, we first present some theoretical insights for HD, in particular a useful reformulation $X\approx WH^\top$ where $W$ and $H$ have $r_1 r_2$ columns and belong to certain manifolds. These allow us to develop three new algorithms for computing HD. The first one uses the representation $X\approx X_1\circ X_2$ and relies on the Manopt toolbox. The other two rely on the reformulation $X\approx WH^\top$: one is a block projected gradient method, and the other is a manifold-based gradient descent algorithm that does not require projection onto the feasible set. The last two algorithms are particularly effective for handling large sparse data. We also propose new initializations that allow us to improve the accuracy of the HD. We compare our algorithms and initialization strategies with the TSVD and with the state of the art. Numerical results show that the new methods are efficient and competitive on both synthetic and real data.

2605.28977 2026-05-29 cs.LG cs.AI 版本更新

Comparing Post-Hoc Explainable AI Methods for Interpreting Black-Box EEG Models in Depression Detection

比较事后可解释AI方法用于解释抑郁症检测中的黑盒脑电图模型

Antonia Šarčević, Nikolina Frid

发表机构 * University of Zagreb Faculty of Electrical Engineering and Computing(Zagreb大学电子工程与计算学院)

AI总结 本研究通过多种事后可解释性方法(如DeepSHAP、集成梯度、GradCAM、遮挡和置换特征重要性)分析InceptionTime架构在脑电图抑郁症检测中的决策过程,发现不同方法在额叶、颞叶和后部脑区(尤其是右半球)的归因模式部分收敛,但方法间存在差异,强调了事后可解释性的有用性和局限性。

详情
AI中文摘要

深度学习的最新进展使得基于脑电图的重度抑郁症分类越来越准确,但高容量模型的决策过程仍然难以解释。本研究调查了应用于训练用于基于脑电图的重度抑郁症检测的InceptionTime架构的多种事后可解释性方法。分析包括基于Shapley、基于梯度和基于扰动的归因方法:DeepSHAP、集成梯度、GradCAM、遮挡和置换特征重要性。在受试者级别的分层5折交叉验证框架内,通过跨脑电图片段和受试者的全局归因聚合进行可解释性分析。评估的方法揭示了部分收敛的归因模式,其中额叶、颞叶和后部脑区(尤其是右半球)反复受到关注。定量比较表明,基于梯度和基于扰动的方法之间具有实质性一致性,而DeepSHAP产生了相对独特的归因分布。同时,可解释性方法之间的差异凸显了方法假设对所得解释的影响。总体而言,结果表明,不同的事后可解释性方法捕捉了基于脑电图的深度学习模型在抑郁症检测中的部分重叠的相关性结构。尽管观察到的归因模式与先前几项关于重度抑郁症的脑电图研究大致一致,但该分析应被视为探索性的,而非确凿的神经生理学生物标志物或临床适用性的证据。该研究强调了事后可解释性在解释精神病学应用中的黑盒脑电图分类器方面的有用性和局限性。

英文摘要

Recent advances in deep learning have enabled increasingly accurate electroencephalography (EEG)-based classification of Major Depressive Disorder (MDD), but the decision-making processes of high-capacity models remain difficult to interpret. This study investigates multiple post-hoc explainability methods applied to an InceptionTime architecture trained for EEG-based MDD detection. The analysis includes Shapley-based, gradient-based, and perturbation-based attribution approaches: DeepSHAP, Integrated Gradients, GradCAM, Occlusion, and Permutation Feature Importance. Explainability analysis was performed within a subject-level stratified 5-fold cross-validation framework using global attribution aggregation across EEG segments and subjects. The evaluated methods revealed partially convergent attribution patterns, with recurring emphasis on frontal, temporal, and posterior EEG regions, particularly in the right hemisphere. Quantitative comparison demonstrated substantial agreement between gradient- and perturbation-based approaches, while DeepSHAP produced comparatively distinct attribution distributions. At the same time, variability between explainability methods highlighted the influence of methodological assumptions on the resulting explanations. Overall, the results suggest that different post-hoc explainability approaches capture partially overlapping relevance structures in EEG-based deep learning models for depression detection. Although the observed attribution patterns are broadly consistent with several previous EEG studies of MDD, the analysis should be interpreted as exploratory rather than evidence of definitive neurophysiological biomarkers or clinical applicability. The study highlights both the usefulness and limitations of post-hoc explainability for interpreting black-box EEG classifiers in psychiatric applications.

2605.28975 2026-05-29 cs.LG 版本更新

A Training-Time Diagnostic for Generalization via the Log-Alignment Ratio

基于对数对齐比率的训练时泛化诊断

Ali Shehper, Ashish Vaswani

发表机构 * Essential AI

AI总结 提出对数对齐比率(LAR)作为参数-激活对齐的度量,通过捕捉训练中权重谱与激活谱的扩散来跟踪记忆与泛化的转换,并在grokking和语言模型预训练中预测泛化差距。

Comments 32 pages, 25 figures

详情
AI中文摘要

我们研究了对数对齐比率(LAR),这是参数化理论中引入的一种参数-激活对齐度量。我们将其重新表述为矩阵归一化奇异值平方的权重谱$p$与输入在其奇异方向上投影的归一化平方的激活谱$q$之间的重叠。我们表明,通过捕捉训练过程中$p$和$q$的扩散,非嵌入LAR在两种不同设置下跟踪记忆与泛化之间的转换。在grokking中,LAR预测学习函数的有效维度:$k \approx n^{2(1-\text{LAR})}$,其中$n$是矩阵的输入维度。在3B参数语言模型预训练中,其与无过拟合基线的偏差跟踪泛化差距,并且其下降速率随着过拟合的接近而增加。LAR可从前向传播过程中可用的量计算,计算开销可忽略,且无需保留验证数据。

英文摘要

We study the log-alignment ratio (LAR), a measure of parameter-activation alignment, introduced in parameterization theory. We reformulate it as the overlap between a weight spectrum $p$ of the normalized squared singular values of a matrix and an activation spectrum $q$ of the normalized squared projections of inputs onto its singular directions. We show that unembedding LAR tracks the transition between memorization and generalization in two different settings by capturing the spread of $p$ and $q$ during training. In grokking, LAR predicts the effective dimension of the learned function: $k \approx n^{2(1-\text{LAR})}$, where $n$ is the input dimension of the matrix. In 3B-parameter language model pre-training, its deviation from a non-overfitting baseline tracks the generalization gap, and its rate of decline increases as overfitting approaches. LAR is computable from quantities available during the forward pass with negligible computational overhead, and requires no held-out validation data.

2605.28961 2026-05-29 stat.ML cs.LG math.OC 版本更新

Dynamics of Stochastic Momentum with Sparse Updates in High Dimensions

高维稀疏更新下随机动量的动力学

Katie Everett, Elliot Paquette

发表机构 * Google DeepMind & MIT(谷歌DeepMind及麻省理工学院) McGill University & Mila(麦吉尔大学及MILA)

AI总结 本文通过最小二乘和逻辑回归模型,理论分析了稀疏更新下动量的动力学,揭示了由动量保留时间尺度与学习时间尺度之比决定的相结构,并发现不同令牌稀疏度下的振荡动力学存在谱冲突。

详情
AI中文摘要

现有的动量理论假设梯度以大致恒定的速率到达每个参数,但这一假设在重尾数据分布和现代架构中常被违反。我们理论分析了稀疏更新下两种可处理动量模型的动力学:具有稀疏输入的最小二乘模型和具有稀有类别的逻辑回归模型。两者都给出了精确的闭式二阶矩动力学,我们针对稀疏性、批量大小和动量衰减的三个标度指数刻画了其高维极限。两个问题上的相结构由两个内在时间尺度之比决定:动量保留时间尺度(缓冲区存活的活动更新次数)和学习时间尺度(减少平方误差所需的活动更新次数)。当学习远慢于保留时,极限匹配SGD;当学习更快时,系统不稳定;当时间尺度相当时,我们恢复经典的重球动力学。振荡动力学发生在不同令牌稀疏度的不同动量值处,从而在全局动量上产生跨令牌频率的谱冲突。

英文摘要

Existing theory of momentum assumes that gradients arrive at every parameter at a roughly constant rate, an assumption violated in practice by heavy-tailed data distributions and modern architectures. We theoretically analyze the dynamics of two tractable models of momentum under sparse updates: a least squares model with sparse inputs and a logistic regression model with a rare class. Both admit exact closed-form second-moment dynamics whose high-dimensional limits we characterize across three scaling exponents for sparsity, batch size, and momentum decay. The phase structure on both problems is governed by the ratio of two intrinsic timescales: a momentum retention timescale (how many active updates the buffer survives) and a learning timescale (how many active updates it takes to reduce the squared error). When learning is much slower than retention, the limit matches SGD; when learning is faster, the system is unstable; where the timescales coincide, we recover classical heavy-ball dynamics. The oscillatory dynamics occur at different momentum values for different token sparsity, creating a spectral conflict for global momentum across token frequencies.

2605.28940 2026-05-29 hep-ph cs.LG hep-ex physics.data-an 版本更新

Neural Scaling Laws for Jet Generation

喷注生成的神经缩放定律

Oz Amram, Darius A. Faroughy, Tjarko Gerdes, Anna Hallin, Gregor Kasieczka, Michael Krämer, Humberto Reyes-Gonzalez, David Shih

发表机构 * Fermi National Accelerator Laboratory(费米国家加速器实验室) Rutgers University(罗格斯大学) Institute for Experimental Physics, Universität Hamburg(汉堡大学实验物理研究所) Institute for Theoretical Particle Physics and Cosmology, RWTH Aachen University(亚琛工业大学理论粒子物理与宇宙学研究所)

AI总结 本文首次探索粒子喷注生成任务中的缩放定律,发现模型大小缩放遵循对数定律,并证明下一个标记预测验证损失与物理性能单调相关。

详情
AI中文摘要

最近观察到的经验缩放定律描述了基础模型在三个独立关键量(数据集大小、计算量和模型参数)变化时的性能。提取这些缩放定律有助于训练大型复杂模型,因为传统方式调优超参数不可行。本文首次探索缩放定律是否也适用于粒子喷注生成任务——该任务既作为基础模型的预训练目标,也作为原位模拟本身。我们确实复制了模型大小缩放的关键对数缩放定律行为。除了研究生成模型的下一个标记预测验证损失,我们还研究了五个物理量的切片Wasserstein距离,这些物理量在训练期间模型无法直接获得。我们的研究表明,该量与下一个标记预测验证损失单调相关,意味着该损失确实是物理性能的良好代理。对于数据集大小和计算量的缩放,我们观察到损失和切片Wasserstein距离的缩放行为明显较弱。我们通过引入可学习窗口的概念分析这种行为,并认为喷注成分的自回归下一个标记预测相对于语言模型研究表现出较快的饱和。我们讨论了这种行为的可能起源,包括QCD辐射的随机性以及生成式与监督式学习任务在碰撞物理中的差异。

英文摘要

Recently observed empirical scaling laws describe the performance of foundation-type models as three independent key quantities -- dataset size, compute, and model parameters -- are modified. Extracting these scaling laws informs the training of large complex models for which the tuning of hyperparameters in traditional ways is not feasible. This work for the first time explores if scaling laws can also be observed for the task of particle jet generation -- both relevant as a pre-training objective for foundation models and as in-situ simulation by itself. We indeed replicate the key logarithmic scaling law behavior for model-size scaling. Beyond studying the next token prediction validation loss of the generative model, we also study the sliced Wasserstein distance of five physical quantities that are not immediately available to the model during training. Our study shows that this quantity is monotonically related to the next token prediction validation loss, meaning that this loss is indeed a good proxy for the physics performance. For the scaling with dataset size and compute, we observe substantially weaker scaling behavior of both the loss and the sliced Wasserstein distance. We analyze this behavior by introducing the concept of a learnable window, and argue that autoregressive next token prediction on jet constituents exhibits comparatively rapid saturation relative to language-model studies. We discuss possible origins of this behavior, including the stochastic nature of QCD radiation and differences between generative and supervised learning tasks in collider physics.

2605.28920 2026-05-29 cs.LG cs.AI stat.ML 版本更新

Conf-Gen: Conformal Uncertainty Quantification for Generative Models

Conf-Gen: 生成模型的共形不确定性量化

Gabriel Loaiza-Ganem, Kevin Zhang, Wei Cui, Marc T. Law, Kin Kwan Leung

发表机构 * layer6ai-labs(layer6ai实验室)

AI总结 提出Conf-Gen框架,通过共形风险控制适配生成任务,统一并扩展了共形预测在大型语言模型等生成模型中的应用,并在图像生成、对话AI和AI代理等新领域提供了形式化保证。

Comments ICML 2026

详情
AI中文摘要

共形预测(CP)及其扩展共形风险控制(CRC)是通过形式化保证量化监督机器学习中不确定性的成熟框架。然而,人工智能(AI)的最新突破由无监督生成模型驱动,例如大型语言模型(LLMs)和图像生成器,这些模型与CP或CRC不直接兼容。在这项工作中,我们引入了共形生成(Conf-Gen),这是一个将CRC适配到生成任务同时放宽其理论假设的通用框架。Conf-Gen统一并泛化了先前将CP应用于LLMs的尝试,并将共形方法扩展到全新的领域。我们通过一些新颖的应用展示了Conf-Gen的灵活性,包括在以下方面获得共形保证:生成非记忆图像的图像生成器、提出足够澄清问题的对话AI系统,以及AI代理输出的正确性。

英文摘要

Conformal prediction (CP) and its extension, conformal risk control (CRC), are established frameworks for quantifying uncertainty in supervised machine learning through formal guarantees. However, recent breakthroughs in artificial intelligence (AI) have been driven by unsupervised generative models, such as large language models (LLMs) and image generators, which are not directly compatible with CP or CRC. In this work we introduce conformal generation (Conf-Gen), a general framework adapting CRC to generative tasks while relaxing its theoretical assumptions. Conf-Gen unifies and generalizes previous attempts to apply CP to LLMs, and extends conformal methodology to entirely new domains. We demonstrate the flexibility of Conf-Gen through some novel applications, including obtaining conformal guarantees on: image generators producing non-memorized images, conversational AI systems having asked enough clarifying questions, and the output of AI agents being correct.

2605.28919 2026-05-29 cs.LG cs.AI cs.CL 版本更新

CosmicFish-HRM: Adaptive Reasoning via Hierarchical Recurrent Mechanisms in Compact Language Models

CosmicFish-HRM:紧凑语言模型中基于层次循环机制的适应性推理

Venkat Akhil Lakkapragada

发表机构 * Mistyoz AI Hyderabad, India(Mistyoz AI 德里, 印度)

AI总结 提出一种紧凑语言模型CosmicFish-HRM,通过层次推理模块动态分配推理深度,在保持较小参数量的同时实现适应性推理。

Comments 17 pages, 4 figures. Exploratory study of adaptive reasoning depth in compact autoregressive language models. Code available at https://github.com/MistyozAI/CosmicFish-HRM

详情
AI中文摘要

大型语言模型已经实现了强大的推理能力,尽管通常以巨大的参数数量和昂贵的推理为代价。在这项工作中,我们探索了一个不同的方向:紧凑语言模型中的自适应推理深度。我们提出了CosmicFish-HRM,这是一个紧凑的语言模型,围绕一个层次推理模块(HRM)构建,该模块在推理过程中动态分配计算资源。该模型不是对每个输入应用固定的计算,而是迭代通过高层和低层推理循环,并根据输入复杂度学习何时停止。CosmicFish-HRM将这种自适应推理核心与现代Transformer组件(包括分组查询注意力、RoPE和SwiGLU激活)相结合。虽然额外的推理基础设施在小规模下引入了开销,但我们假设随着模型规模的增长和HRM核心相对成本的降低,这种权衡变得越来越有利。我们的结果表明,该模型学习了非均匀的推理行为,在不同任务和输入之间分配不同数量的推理步骤。这些发现表明,自适应推理深度可能为仅依赖参数规模来实现推理能力提供一种有前途的替代方案。

英文摘要

Large language models have achieved strong reasoning capabilities, though often at the cost of massive parameter counts and expensive inference. In this work, we explore a different direction: adaptive reasoning depth in compact language models. We present CosmicFish-HRM, a compact language model built around a Hierarchical Reasoning Module (HRM) that dynamically allocates computational effort during inference. Instead of applying fixed computation to every input, the model iterates through high-level and low-level reasoning cycles and learns when to halt based on input complexity. CosmicFish-HRM combines this adaptive reasoning core with modern transformer components including Grouped Query Attention, RoPE, and SwiGLU activations. While the additional reasoning infrastructure introduces overhead at small scale, we hypothesize that this tradeoff becomes increasingly favorable as model size grows and the relative cost of the HRM core diminishes. Our results show that the model learns non-uniform reasoning behavior, allocating different numbers of reasoning steps across tasks and inputs. These findings suggest that adaptive reasoning depth may offer a promising alternative to relying solely on parameter scale for reasoning capability.

2605.28909 2026-05-29 cs.LG 版本更新

Sequential Physics-Constrained Neural Operator Forward Modeling for the $\textit{Norne}$ Reservoir System

面向Norne油藏系统的序贯物理约束神经算子正演建模

Clement Etienam, Juntao Yang, Oleg Ovcharenko, Nick Luiken, Tsubasa Onishi, Nefeli Moridis, Issam Said

发表机构 * NVIDIA Corporation(NVIDIA公司)

AI总结 针对Norne油藏基准问题,提出基于傅里叶神经算子(FNO)及其物理信息变体(PINO)的序贯代理模型,通过理论分析(函数空间公式、协变量偏移量化、物理约束谱稳定性、K步TBPTT梯度分析)和实验验证,实现全3298天时间跨度上油、气、压力和水的高精度预测,并在单GPU上获得约10^4倍加速。

Comments 22 pages, 2 figures, 2 tables. Code available at https://github.com/clementetienam/physicsnemo/tree/801a85bc08aa9caa0d54027a145b88c68e5e5f36/examples/reservoir_simulation/norne

详情
AI中文摘要

我们开发了一个全面的数学和计算框架,用于使用神经算子对三相黑油油藏动态进行序贯代理建模,特别关注傅里叶神经算子(FNO)及其物理信息变体(PINO)。应用重点是Norne基准油藏,定义在非均匀的$46\times112\times22$网格($N=113,344$个单元)上,生产历史跨越$T=30$个时间步,覆盖3298天。我们的理论贡献围绕四个相互关联的问题组织:(1)在乘积Sobolev空间设置中的泛函分析公式,包括隐式时间步映射的适定性和尖锐的局部Lipschitz估计;(2)协变量偏移量化,证明Wasserstein-2距离增长为$W_2 \leq \varepsilon(L^n-1)/(L-1)$,对于$L>1$具有指数级总体风险差异;(3)物理约束谱稳定性,表明使用$\lambda_R \geq \lambda^*_R$的PINO训练将学习到的Jacobian谱半径减小到$\rho_F + C\lambda_R^{-1/2}$,产生时间一致展开误差$|\delta_n| \leq \varepsilon/(1-\rho)$;(4)$K$步TBPTT梯度分析,推导出几何偏差衰减$O(\rho^K)$、最优窗口$K^* = O(\log(T/\sigma^2))$以及Adam收敛$O(1/\sqrt{t}) + O(\rho^{K^*})$。实证验证确认了所有理论预测:自回归PINO代理在完整的3298天范围内保持$R^2>0.99$(油)、$R^2>0.90$(气)、$R^2\approx 0.80$(压力)以及单调改善的$R^2$(水),在八块NVIDIA B200 GPU上训练不到一小时。一个包含1000个成员的集合在单块B200 GPU上运行不到一分钟,相比OPM有限体积模拟器获得约$10^4$倍的挂钟加速。

英文摘要

We develop a comprehensive mathematical and computational framework for sequential surrogate modeling of three-phase black-oil reservoir dynamics using neural operators, with particular emphasis on Fourier Neural Operators (FNO) and their physics-informed variant (PINO). The application focus is the Norne benchmark reservoir, defined on a heterogeneous $46\times112\times22$ grid ($N=113,344$ cells), with a production history spanning $T=30$ timesteps covering 3298 days. Our theoretical contributions are organized around four interlocking problems: (1) functional-analytic formulation in a product-Sobolev-space setting, including well-posedness of the implicit timestep map and sharp local Lipschitz estimates; (2) covariate shift quantification, proving that the Wasserstein-2 distance grows as $W_2 \leq \varepsilon(L^n-1)/(L-1)$, with exponential population-risk discrepancy for $L>1$; (3) physics-constrained spectral stability, showing PINO training with $λ_R \geq λ^*_R$ reduces the learned Jacobian spectral radius to $ρ_F + Cλ_R^{-1/2}$, yielding uniform-in-time rollout error $|δ_n| \leq \varepsilon/(1-ρ)$; and (4) $K$-step TBPTT gradient analysis, deriving geometric bias decay $O(ρ^K)$, optimal window $K^ = O(\log(T/σ^2))$, and Adam convergence $O(1/\sqrt{t}) + O(ρ^{K^*})$. Empirical validation confirms all theoretical predictions: autoregressive PINO surrogates sustain $R^2>0.99$ (oil), $R^2>0.90$ (gas), $R^2\approx 0.80$ (pressure), and monotonically improving $R^2$ (water) across the full 3298-day horizon, trained on eight NVIDIA B200 GPUs in under one hour. A 1000-member ensemble runs in under one minute on a single B200 GPU, giving a ${\sim}10^4\times$ wall-clock speedup over the OPM finite-volume simulator.

2605.28900 2026-05-29 cs.LG 版本更新

Spectral Guidance for Flexible and Efficient Control of Diffusion Models

谱引导:灵活高效的扩散模型控制方法

Gabriel Moreira, Manuel Marques, João Paulo Costeira, Chenyan Xiong

发表机构 * Carnegie Mellon University(卡内基梅隆大学) Institute for Systems and Robotics(系统与机器人研究所)

AI总结 提出谱引导框架,通过条件期望算子的奇异函数学习生成过程的固有几何结构,实现无需重训练或反向传播的稳定高保真控制,在CIFAR-10上条件准确率提升37个百分点且采样加速4倍。

Comments ICML 2026

详情
AI中文摘要

我们引入了谱引导,这是一个通过利用生成过程的内在几何结构来控制扩散模型的框架。随着数据被噪声逐步破坏,只有少量特征对控制仍然具有信息量。我们将这些特征表征为条件期望算子的奇异函数,并表明它们可以通过自监督目标进行学习。一旦恢复,这个基可以将任意引导信号(如标签、CLIP嵌入或掩码)直接投影到采样轨迹上。这种方法允许在采样过程中无需重训练或去噪器反向传播即可实现稳定、高保真的控制。实验上,我们在CIFAR-10上将条件准确率比最强的无训练基线提高了37个百分点,同时提供了4倍的采样加速。此外,支持标签和CLIP引导的相同表示也实现了空间控制(例如基于掩码的引导),而无需辅助模型。最后,我们的框架揭示了生成过程中的一个相变,指出了有效引导的最佳时间窗口。

英文摘要

We introduce Spectral Guidance, a framework for controlling diffusion models by leveraging the intrinsic geometry of the generative process. As data is progressively corrupted by noise, only a small number of features remain informative for control. We characterize them as the singular functions of a conditional expectation operator and show that they can be learned via a self-supervised objective. Once recovered, this basis enables the projection of arbitrary guidance signals, such as labels, CLIP embeddings, or masks, directly onto the sampling trajectory. This approach allows for stable, high-fidelity control without retraining or denoiser backpropagation during sampling. Empirically, we improve conditional accuracy on CIFAR-10 by 37 percentage points over the strongest training-free baseline while offering $4\times$ faster sampling. Moreover, the same representations that support label and CLIP guidance also enable spatial control, such as mask-based guidance, without auxiliary models. Finally, our framework reveals a phase transition in the generative process, pinpointing the optimal time window for effective guidance.

2605.28896 2026-05-29 cs.LG 版本更新

Feature Geometry of LoRA Adapters: A Sparse Autoencoder Analysis of Representational Divergence in Fine-Tuned Language Models

LoRA适配器的特征几何:微调语言模型中表示差异的稀疏自编码器分析

Prasanth K K

发表机构 * Independent AI Safety Researcher(独立人工智能安全研究员)

AI总结 本研究使用稀疏自编码器分析LoRA微调引起的表示几何变化,发现LoRA特征字典与预训练特征存在弱几何对齐,且适配器特定SAE能更有效重建delta激活。

详情
AI中文摘要

低秩适配(LoRA)已成为适应大型语言模型的广泛采用方法,但LoRA微调引起的内部表示变化仍未被充分理解。在这项工作中,我们使用稀疏自编码器(SAE)研究LoRA诱导表示的几何结构。我们引入了一个delta激活框架,该框架隔离了适配器对残差流的特定贡献。使用Gemma-2-9B和LoRA秩4、8、16和32,我们在多个Transformer层上训练适配器特定的SAE,并将它们学习的特征空间与预训练的SAE字典进行比较。我们使用解码器方向之间的余弦相似度、特征子空间的主角分析以及激活表示之间的中心核对齐(CKA)来评估表示对齐。跨层和秩,我们一致观察到LoRA诱导的特征字典与预训练SAE特征之间的几何对齐相对较弱。适配器特定的SAE也比预训练SAE更有效地重建delta激活,这表明LoRA更新在残差流内占据了部分不同的表示结构。此外,特征密度随秩和深度增加,而几何差异在各秩之间保持相对稳定。这些发现提供了经验证据,表明LoRA微调可以诱导出预训练可解释性字典未完全捕获的特征结构,对微调语言模型的机制可解释性、适配分析和安全审计具有启示意义。

英文摘要

Low-Rank Adaptation (LoRA) has emerged as a widely adopted approach for adapting large language models, yet the internal representational changes induced by LoRA fine-tuning remain insufficiently understood. In this work, we investigate the geometry of LoRA-induced representations using Sparse Autoencoders (SAEs). We introduce a delta activation framework that isolates the adapter-specific contribution to the residual stream. Using Gemma-2-9B with LoRA ranks 4, 8, 16, and 32, we train adapter-specific SAEs across multiple transformer layers and compare their learned feature spaces with pretrained SAE dictionaries. We evaluate representational alignment using cosine similarity between decoder directions, principal-angle analysis of feature subspaces, and Centered Kernel Alignment (CKA) between activation representations. Across layers and ranks, we consistently observe comparatively weak geometric alignment between LoRA-induced feature dictionaries and pretrained SAE features. Adapter-specific SAEs also reconstruct delta activations more effectively than pretrained SAEs, suggesting that LoRA updates occupy partially distinct representational structure within the residual stream. Additionally, feature density increases with rank and depth, while geometric divergence remains relatively stable across ranks. These findings provide empirical evidence that LoRA fine-tuning can induce feature structures that are not fully captured by pretrained interpretability dictionaries, with implications for mechanistic interpretability, adaptation analysis, and safety auditing of fine-tuned language models.

2605.28890 2026-05-29 cs.CR cs.LG 版本更新

Echoes within the Reasoning: Stealthy and Effective Watermarking via Chain of Thought

推理中的回声:通过思维链实现隐蔽且有效的数字水印

Jiacheng Lu, Yiming Li, Tao Song, Weijian Wang, Wenjie Qu, Haibing Guan, Jiaheng Zhang

发表机构 * School of Computer Science, Shanghai Jiao Tong University, Shanghai, China(上海交通大学计算机科学学院) Nanyang Technological University, Singapore(南洋理工大学) National University of Singapore, Singapore(新加坡国立大学)

AI总结 提出BiCoT框架,通过将水印嵌入推理轨迹的内部几何结构,并利用基于Top-logprob的黑盒验证器RSR,在不影响推理保真度的前提下实现鲁棒的水印检测。

Comments This paper is accepted by ICML2026

详情
AI中文摘要

具有思维链推理能力的大型语言模型代表有价值的知识产权,然而现有的黑盒水印方法通常通过扰动最终答案或依赖脆弱的触发模式,在鲁棒性和推理保真度之间进行权衡。我们提出BiCoT,一种水印框架,通过将高显著性结构锚点与私有签名子空间对齐,同时正则化普通控制令牌以保留语义容量,将所有权信号嵌入推理轨迹的内部几何结构。这种设计将水印与推理相关表示耦合,使得在不破坏支持连贯推理的特征的情况下难以移除水印。为了在模型窃取和表示漂移下实现验证,我们引入鲁棒子空间注册(RSR),一种基于Top-logprob的黑盒验证器,使用哨兵令牌校准输出分布中的系统性偏移。实验表明,BiCoT在多种复杂推理任务中保持推理保真度,同时在微调、量化、模型级扰动以及自适应输出级攻击下,在域内和域外设置中实现鲁棒检测。

英文摘要

Large Language Models with Chain-of-Thought reasoning capabilities represent valuable intellectual property, yet existing black-box watermarking methods often trade robustness for reasoning fidelity by perturbing final answers or relying on fragile trigger patterns. We propose BiCoT, a watermarking framework that embeds ownership signals into the internal geometry of reasoning traces by aligning high-saliency structural anchors with a private signature subspace while regularizing ordinary control tokens to preserve semantic capacity. This design couples the watermark with reasoning-relevant representations, making removal difficult without disrupting the features that support coherent reasoning. To enable verification under model theft and representation drift, we introduce Robust Subspace Registration (RSR), a Top- logprob-based black-box verifier that uses sentinel tokens to calibrate systematic shifts in the output distribution. Experiments show that BiCoT preserves reasoning fidelity across diverse complex reasoning tasks while achieving robust detection under fine-tuning, quantization, model-level perturbations, and adaptive output-level attacks across in-domain and out-of-distribution settings.

2605.28889 2026-05-29 cs.LG cs.AI 版本更新

Context Distillation as Latent Memory Management

上下文蒸馏作为潜在记忆管理

Ziyang Zheng, Zeju Li, Xiangyu Wen, Jianyuan Zhong, Junhua Huang, Lei Chen, Mingxuan Yuan, Qiang Xu

发表机构 * The Chinese University of Hong Kong(香港中文大学) Huawei Noah’s Ark Lab(华为诺亚实验室)

AI总结 将上下文蒸馏视为潜在记忆管理问题,通过独立LoRA适配器形成模块化记忆库,并利用自门控机制决定是否激活潜在记忆,以提升检索鲁棒性和效率。

详情
AI中文摘要

上下文蒸馏将上下文信息压缩到模型参数中,然而现有方法常常忽略多个蒸馏后的潜在记忆应如何在非预言机设置下存储、检索和安全激活。我们将上下文蒸馏表述为一个潜在记忆管理问题。我们将每个上下文蒸馏成一个独立的LoRA适配器,形成一个模块化记忆库,从而实现显式的记忆选择。给定一个查询,我们的框架检索候选记忆,将查询路由到最合适的适配器,并使用自门控机制决定是否应激活潜在记忆。为了提高效率,我们进一步引入缓存共享以减少推理过程中的管理开销。实验表明,我们的方法在检索方面显著优于基线,而自门控通过停用不必要的潜在记忆提高了鲁棒性。

英文摘要

Context distillation compresses contextual information into model parameters, yet existing methods often ignore how multiple distilled latent memories should be stored, retrieved, and safely activated in non-oracle settings. We formulate context distillation as a latent memory management problem. We distill each context into an independent LoRA adapter, forming a modular memory bank that enables explicit memory selection. Given a query, our framework retrieves candidate memories, routes the query to the most suitable adapter, and uses a Self-Gating mechanism to decide whether latent memory should be activated. To improve efficiency, we further introduce cache sharing to reduce management overhead during inference. Experiments show that our method substantially outperforms baselines with retrieval, while Self-Gating improves robustness by deactivate unnecessary latent memories.

2605.28888 2026-05-29 cs.IR cs.LG 版本更新

Generative Spatiotemporal Intent Sequence Recommendation via Implicit Reasoning in Amap

高德地图中基于隐式推理的生成式时空意图序列推荐

Sicong Wang, Ruiting Dong, Yue Liu, Bowen Zheng, Jun Meng, Jie Li, Shuaijun Guo, Yu Gu, Fanyi Di, Xin Li

发表机构 * AMAP, Alibaba Group(阿里集团地图(AMAP))

AI总结 提出GPlan框架,通过渐进式隐式思维链蒸馏和时空反事实DPO,将LLM推理能力压缩至轻量模型,实现低延迟且符合时空约束的意图序列生成。

Comments 9 pages, 1 figure

详情
AI中文摘要

现实世界中的用户行为很少由孤立动作组成;相反,它通常形成由时空依赖关系支配的意图流。为了提供集成服务推荐,我们聚焦于生成式时空意图序列推荐(GSISR)任务,旨在生成在复杂时空上下文中逻辑连贯且物理可执行的意图序列。虽然LLMs为GSISR提供了强大的推理潜力,但直接工业部署受到高推理延迟以及上下文不匹配或物理不可行计划的限制。为应对这些挑战,我们提出生成式框架GPlan,通过两个组件将LLM推理内化到轻量模型中。首先,为了在严格的延迟约束下实现推理,我们引入渐进式隐式思维链蒸馏,将显式推理过程压缩到保留的潜在令牌中,使小模型能够继承复杂的规划逻辑而无需生成长推理文本。其次,为了解决通用知识与现实世界约束之间的脱节,我们设计了时空反事实DPO。通过将模型与反事实上下文-计划对对齐,我们提高了对时空上下文的敏感性并减少了上下文不匹配的计划。离线实验和在线A/B测试表明,我们的方法提高了序列连贯性和上下文响应性。我们的实现和匿名化的GSISR数据集可在https://github.com/alibaba/GPlan获取。

英文摘要

Real-world user behavior rarely consists of isolated actions; instead, it often forms intent flows governed by spatiotemporal dependencies. To provide integrated service recommendations, we focus on the task of Generative Spatiotemporal Intent Sequence Recommendation (GSISR), which aims to generate intent sequences that are logically coherent and physically executable within complex spatiotemporal contexts. While LLMs offer strong reasoning potential for GSISR, direct industrial deployment is limited by high inference latency and context-mismatched or physically infeasible plans. To address these challenges, we propose a generative framework, GPlan, that internalizes LLM reasoning into lightweight models through two components. First, to enable reasoning under strict latency constraints, we introduce Progressive Implicit CoT Distillation, which compresses explicit reasoning processes into reserved latent tokens, allowing small models to inherit complex planning logic without generating long reasoning text. Second, to address the disconnect between general knowledge and real-world constraints, we design Spatiotemporal Counterfactual DPO. By aligning the model with counterfactual context-plan pairs, we improve sensitivity to spatiotemporal context and reduce context-mismatched plans. Offline experiments and online A/B testing demonstrate that our approach improves sequence coherence and context responsiveness. Our implementation and the anonymized GSISR dataset are available at https://github.com/alibaba/GPlan.

2605.28886 2026-05-29 q-bio.QM cs.LG 版本更新

Computational Modeling of Antibody-Antigen Complexes: PLM-Based and MSA-Based Approaches

抗体-抗原复合物的计算建模:基于PLM和基于MSA的方法

Xiao Luo

发表机构 * Toyota Technological Institute at Chicago(丰田技术研究所(芝加哥))

AI总结 本研究探讨抗体相关任务计算困难的原因,并提出基于蛋白质语言模型(PLM)和多重序列比对(MSA)的两种互补改进方法,以提升抗体-抗原结构预测精度。

Comments PhD thesis

详情
AI中文摘要

抗体通过特异性识别和中和抗原在免疫反应中发挥核心作用,治疗性抗体已成为癌症和自身免疫疾病的主要药物。然而,其发现仍依赖大量体外筛选,而抗体结构和抗体-抗原相互作用的准确计算建模可以优先候选、减少实验负担并加速理性设计。尽管近年来高精度蛋白质和复合物预测取得了进展,但与一般蛋白质-蛋白质相互作用相比,抗体相关任务仍存在持续的性能差距,限制了下游设计。 本论文研究了为何抗体相关任务更困难,并沿两个互补方向提出改进。首先,我们研究了基于蛋白质语言模型(PLM)的抗体及抗体-抗原结构预测方法。利用多个PLM的嵌入,我们的方法在抗体单体预测中达到了所比较的PLM方法中最高的CDR-H3精度。将其扩展到复合物预测时未能泛化:由于缺乏抗体和抗原之间的共进化信号,单序列PLM表示无法可靠识别结合界面。 其次,我们针对抗体-抗原复合物预测开发了两种基于MSA的干预措施:MSA精炼,结合了CDR聚焦过滤和从更大序列数据库恢复深度;以及收敛感知循环,选择稳定的中间循环状态用于最终扩散采样。这些干预措施在保留的抗体-抗原测试集上相对于AlphaFold3基线提供了一致的增益。由于这些方法修改了MSA构建和循环行为而非模型参数,它们无需重新训练或权重访问即可应用。

英文摘要

Antibodies play a central role in the immune response by specifically recognizing and neutralizing antigens, and therapeutic antibodies have become major drugs for cancer and autoimmune diseases. However, their discovery still relies on extensive in vitro screening, and accurate computational modeling of antibody structures and antibody-antigen interactions can prioritize candidates, reduce experimental burden, and accelerate rational design. Despite recent advances in high-accuracy protein and complex prediction, a persistent performance gap remains for antibody-related tasks compared with general protein-protein interactions, limiting downstream design. This thesis investigates why antibody-related tasks are harder and proposes improvements along two complementary directions. First, we investigate protein language model (PLM)-based methods for antibody and antibody-antigen structure prediction. Using embeddings from multiple PLMs, our approach achieves the best CDR-H3 accuracy among compared PLM-based methods on antibody monomer prediction. Extending it to complex prediction does not generalize: without co-evolutionary signals between antibody and antigen, single-sequence PLM representations do not reliably identify binding interfaces. Second, we develop two MSA-based interventions for antibody-antigen complex prediction: MSA refinement, which combines CDR-focused filtering with depth recovery from a larger sequence database, and convergence-aware recycling, which selects a stable intermediate recycle state for final diffusion sampling. Together, these interventions provide consistent gains over the AlphaFold3 baseline on a held-out antibody-antigen test set. Because the methods modify MSA construction and recycling behavior rather than model parameters, they apply without retraining or weight access.

2605.28880 2026-05-29 cs.LG physics.data-an stat.ME 版本更新

Towards Continuous-time Causal Foundation Models

迈向连续时间因果基础模型

Dennis Thumm, Ruben Wiedemann, Ying Chen

发表机构 * National University of Singapore(新加坡国立大学) Imperial College London(帝国理工学院伦敦分校) Department of Mathematics, Centre for Quantitative Finance, Risk Management Institute, National University of Singapore(新加坡国立大学数学系、量化金融中心、风险管理研究所)

AI总结 提出轨迹律对观测时间表不变的连续性准则,通过细网格积分与解耦观测实现连续时间因果先验模型,并在线性与非线性先验上验证其优于离散方法。

Comments ICML 2026 2nd Workshop on Foundation Models for Structured Data (FMSD)

详情
AI中文摘要

将时间序列的离散时间因果先验数据拟合网络扩展到连续时间,需要将机制写为随机微分方程(SDE)——但如果SDE在每个观测间隔内只积分一次,轨迹律依赖于观测时间,先验仍然是披着SDE外衣的离散时间马尔可夫模型。我们提出了一个精确的连续性准则——轨迹律对观测时间表的不变性——以及一个三层分类法(离散;朴素观测网格积分;细网格积分与解耦观测),并在具有OU或小型MLP非线性漂移、不规则观测时间表以及硬/软/时变干预的随机DAG上实现了顶层。一个2×2编码器×积分器消融实验,在线性和非线性先验上独立运行,发现细网格积分在8/8个单元上优于朴素积分(符号一致性p<1/256),且随着评估网格细化差距增大;编码器轴在细积分下无效,而在朴素积分下具有时间感知优势。我们发布了该先验以及一个在药代动力学和物理系统数据上的初步零样本协议。

英文摘要

Extending discrete-time causal Prior-data Fitted Networks for time series to continuous time invites writing the mechanism as a stochastic differential equation (SDE) -- but if the SDE is integrated \emph{once per observation gap}, the trajectory law depends on when it is observed, and the prior remains a discrete-time Markov model in SDE clothing. We propose a precise continuity criterion -- trajectory-law invariance to the observation schedule -- together with a three-tier taxonomy (discrete; naive observation-grid integration; fine-grid integration with decoupled observation) and a construction realising the top tier on a random DAG with OU or small-MLP nonlinear drifts, irregular observation schedules, and hard / soft / time-varying interventions. A $2 \times 2$ encoder $\times$ integrator ablation, run independently on a linear and a nonlinear prior, finds fine-grid integration beats naive on 8/8 cells (sign-consistency $p < 1/256$) with the gap growing as the eval grid refines; the encoder axis is null with fine integration but time-aware-leading with naive. We release the prior and a preliminary zero-shot protocol on pharmacokinetic and physical-system data.

2605.28873 2026-05-29 cs.LG 版本更新

Pre-Registering the Detectable Effect: A Paired-MDE Budget for 4-bit Quantization Benchmarks, with a Pilot Audit

预注册可检测效应:面向4位量化基准的配对MDE预算,附带一项试点审计

Zexin Zhuang, Yanhang Li, Zhichao Fan

发表机构 * Southern Methodist University(南方 Methodist 大学) Northeastern University(东北ern 大学) University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校)

AI总结 本文提出一种配对最小可检测效应(MDE)边界公式,用于量化基准的可靠性评估,并通过试点审计验证其有效性。

详情
AI中文摘要

这是一篇带有非配对试点审计的规划方法说明。我们将经典的配对二项样本量计算(Miettinen, 1968)应用于量化基准,给出了在配对项目数$m$和FP16-NF4不一致率$ρ_d$下的保守最小可检测效应(MDE)边界$δ^{*} \\\le (z_{1-α/2}+z_{1-β})\\\sqrt{ρ_d/m}$。该边界将“我的量化声明有多可靠?”转化为基准设计者在运行前可以承诺的一行预算。我们在四个模型和四个基准($n=100$的$k=5$次分割)上展示了该边界,并添加了一项并行的MMLU提示模板研究,以将边界的量化噪声尺度与提示噪声尺度进行比较。假设$ρ_d=0.10$(一个未测量的规划值),所有观察到的NF4-FP16差异均低于隐含的MDE,且大多数跨分割标准差落在二项参考$\\sqrt{p(1-p)/n}$的$\\pm 1.5$个百分点内,因此在$n=100$子样本上报告为“基准不可靠性”的大部分方差是二项抽样噪声。唯一的边界单元格(OPT-WinoGrande,$|Δ|=3.2$个百分点)在$ρ_d=0.10$时低于隐含MDE,但在$ρ_d=0.05$时高于它,说明了该边界明确的规划权衡。在MMLU上,提示模板范围2-10个百分点达到或超过了最大的观察量化差异(3.2个百分点),因此未先固定提示模板的量化审计会将模板方差吸收到其噪声基底中。我们用一个五行预注册模板补充了该边界。

英文摘要

This is a planning-method note with an unpaired pilot audit. We adapt the classical paired-binary sample-size calculation (Miettinen, 1968) to quantization benchmarks, giving a conservative minimum detectable effect (MDE) bound $δ^{*} \le (z_{1-α/2}+z_{1-β})\sqrt{ρ_d/m}$ in the paired item count $m$ and the FP16-NF4 disagreement rate $ρ_d$. The bound turns "how reliable is my quantization claim?" into a one-line budget a benchmark designer can commit to before running. We illustrate the bound on four models and four benchmarks ($k=5$ splits of $n=100$), and add a parallel MMLU prompt-template study to put the bound's quantization-noise scale alongside the prompt-noise scale. Assuming $ρ_d=0.10$ (an unmeasured planning value), all observed NF4-FP16 deltas fall below the implied MDE, and most cross-split SDs lie within $\pm 1.5$ pp of the binomial reference $\sqrt{p(1-p)/n}$, so much of the variance reported as "benchmark unreliability" on $n=100$ subsamples is binomial sampling noise. The single borderline cell (OPT-WinoGrande, $|Δ|=3.2$ pp) is below the implied MDE at $ρ_d=0.10$ but above it at $ρ_d=0.05$, illustrating the planning trade-off the bound makes explicit. On MMLU, prompt-template ranges of 2-10 pp meet or exceed the largest observed quantization delta (3.2 pp), so a quantization audit that does not first fix the prompt template absorbs template variance into its noise floor. We complement the bound with a five-line pre-registration template.

2605.28870 2026-05-29 cs.LG cs.AI 版本更新

Representation Alignment Rests on Linear Structure

表示对齐依赖于线性结构

Kiril Bangachev, Guy Bresler, Yury Polyanskiy

发表机构 * Massachusetts Institute of Technology(麻省理工学院)

AI总结 本文通过信号、偏差和噪声的三部分统计框架研究柏拉图表示假说,提出对齐源于对象与属性的线性关系,并通过稀疏自编码器提取线性特征、中心化和归一化减少偏差、以及数据稀缺导致噪声等证据支持该框架。

详情
AI中文摘要

我们通过表示的三部分统计框架研究柏拉图表示假说(PRH):信号、偏差和噪声。{1) 信号:} 我们提出柏拉图对齐源于对象与属性之间的普遍关系,这种关系根据线性表示假说(LRH)在线性上编码。我们通过稀疏自编码器提取线性对象-属性特征,并展示这些稀疏表示通常比其稠密对应物表现出更强的跨模态对齐,从而提供证据表明LRH有助于解释PRH。{2) 偏差:} 由于使用的不同架构和训练过程,模型具有不同的隐式偏差。我们表明这种差异可以部分缓解。中心化和归一化一致地改善跨模型对齐。{3) 噪声:} 有限样本训练导致表示中的噪声。我们通过揭示词频与对齐之间在LLM和文本嵌入模型中的强且一致的正相关,提供证据表明表示噪声由数据稀缺驱动。综合信号、偏差和噪声,我们提出一个统计模型,该模型细化线性表示假说,并解释与现代AI架构中出现的表示对齐相关的进一步现象。

英文摘要

We investigate the Platonic Representation Hypothesis (PRH) through a tripartite statistical framework of representations: signal, bias, and noise. {1) Signal:} We propose that Platonic alignment arises from the universal relationship between objects and attributes, which is encoded linearly in representations according to the Linear Representation Hypothesis (LRH). We provide evidence that LRH helps explain PRH by extracting linear object-attribute features with sparse autoencoders and showing that these sparse representations often exhibit stronger cross-modal alignment than their dense counterparts. {2) Bias:} Models have different implicit biases due to the diverse architectures and training procedures used. We show that this difference can be partially mitigated. Centering and normalization consistently improve cross-model alignment. {3) Noise:} Finite-sample training leads to noise in representations. We provide evidence that representational noise is driven by data scarcity by revealing a strong and consistent positive correlation between word frequency and alignment in LLMs and text embedding models. Synthesizing signal, bias, and noise, we propose a statistical model that refines the Linear Representation Hypothesis and explains further phenomena related to the alignment of representations emerging from diverse modern AI architectures.

2605.28869 2026-05-29 cs.LG cs.AI 版本更新

Balancing Multimodal Learning through Label Space Reshaping

通过标签空间重塑平衡多模态学习

Xiaoyu Ma, Weijie Zhang, Yuanhao Gao, Han Miao, Yongjian Deng, Hao Chen

AI总结 针对多模态学习中模态不平衡问题,提出基于标签空间重塑的BMLR方法,通过均衡各模态映射难度来提升多模态性能。

Comments In process

详情
AI中文摘要

多模态学习常受模态不平衡问题困扰,其中收敛较快的模态主导优化,而其他模态训练不足。现有方法通常通过加强弱模态或调整优化梯度来缓解此问题。然而,这些策略主要补偿优化速率差异,往往以牺牲强模态的优化能力为代价,而未从模态层面分析这些差异如何产生。基于理论洞察和实证观察,我们认为学习速度的差异源于模态特定特征空间与共享标签空间之间映射难度的不同。为解决此问题,我们提出了平衡多模态标签重塑(BMLR),这是首个从标签侧设计促进多模态平衡的方法。BMLR重塑跨模态标签空间以均衡各模态的映射难度,从而促进模态交互并为每个模态注入更丰富的类间信息。跨多种架构的大量实验表明,BMLR持续提升多模态性能,并与多种模型设计表现出强兼容性。源代码即将发布。

英文摘要

Multimodal learning often suffers from modality imbalance, where modalities that converge faster dominate optimization while others remain undertrained. Existing approaches typically mitigate this issue by strengthening the weak modality or adjusting optimization gradients. However, such strategies mainly compensate for optimization rate discrepancies, often at the expense of the strong modality's optimization capacity, without analyzing how these discrepancies arise at the modality level. Based on theoretical insights and empirical observations, we argue that the discrepancy of learning pace arises from differences in the mapping difficulty between modality-specific feature space and the shared label space. To address this issue, we propose Balanced Multimodal Label Reshaping (BMLR), the first method that promotes multimodal balance from the label-side design. BMLR reshapes the cross-modal label space to equalize mapping difficulty across modalities, thereby facilitating modality interaction and injecting richer inter-class information into each modality. Extensive experiments across multiple architectures demonstrate that BMLR consistently improves multimodal performance and exhibits strong compatibility with diverse model designs. The source code will be released soon.

2605.28868 2026-05-29 cs.LG cs.AI 版本更新

TaxDistill: Improving Metagenomic Taxonomic Annotation via Distilled Genomic Foundation Models

TaxDistill:通过蒸馏基因组基础模型改进宏基因组分类注释

Rongye Ye, Lun Li, Zheng Luo, Yiran Zhan, Shuhui Song

发表机构 * National Genomics Data Center, China National Center for Bioinformation(中国生物信息中心国家基因组数据中心) Beijing Key Laboratory of Intelligent Governance and Application of Biological Big Data, China National Center for Bioinformation(北京生物大数据智能治理与应用重点实验室,中国生物信息中心) Beijing Institute of Genomics, Chinese Academy of Sciences(北京基因组研究所,中国科学院) University of Chinese Academy of Sciences(中国科学院大学)

AI总结 提出TaxDistill知识蒸馏框架,利用500M参数的基因组基础模型GenomeOcean作为教师网络生成软标签,以减轻初始检索工具引入的标签噪声,从而提升宏基因组序列分类性能。

Comments The manuscript contains 14 pages, 7 figures, and 3 tables

详情
AI中文摘要

宏基因组分类注释旨在识别环境样本中DNA片段的微生物起源。依赖序列相似性的传统方法通常受到高微生物多样性和参考数据库不完整性的限制,这推动了诸如Taxometer等学习方法的发展,这些方法通过事后校正来学习更具信息量的宏基因组序列表示。然而,这些方法通常依赖于训练期间从相似性搜索工具获得的标签,这不可避免地引入了噪声,从而损害表示学习并降低分类性能。为了解决这个问题,我们提出了TaxDistill,一种用于宏基因组分类的知识蒸馏框架。我们引入GenomeOcean,一个500M参数的基因组基础模型,作为教师网络来提取深层语义特征并基于置信度生成软标签。通过将这些软标签信息蒸馏到轻量级学生网络中,TaxDistill有效减少了初始检索工具引入的标签噪声。在七个不同的CAMI2数据集上的全面实验表明,TaxDistill在大多数场景下优于现有基线。例如,在胃肠道数据集上,它将MMseqs2的F1分数从0.763提高到0.941,优于Taxometer基线。总体而言,TaxDistill为复杂宏基因组分析中的标签校正提供了一种可靠的方法。

英文摘要

Metagenomic taxonomic annotation aims to identify the microbial origins of DNA fragments in environmental samples. Traditional methods that rely on sequence similarity are often constrained by the high microbial diversity and the incompleteness of reference databases, which has motivated the development of learning approaches such as Taxometer that perform post hoc correction to learn more informative metagenomic sequence representations. However, these methods typically rely on labels derived from similarity search tools during training, which inevitably introduces noise that can impair representation learning and degrade classification performance. To address this issue, we propose TaxDistill, a knowledge distillation framework for metagenomic classification. We introduce GenomeOcean, a 500M parameter genomic foundation model, as the teacher network to extract deep semantic features and generate soft labels based on confidence. By distilling this soft label information into a lightweight student network, TaxDistill effectively reduces the label noise introduced by initial retrieval tools. Comprehensive experiments on seven diverse CAMI2 datasets demonstrate that TaxDistill outperforms existing baselines in most scenarios. For instance, on the Gastrointestinal dataset, it improves the F1 score of MMseqs2 from 0.763 to 0.941, outperforming the Taxometer baseline. Overall, TaxDistill provides a reliable method for label correction in complex metagenomic analysis.

2605.28867 2026-05-29 cs.LG cs.AI 版本更新

PrismFlow: Residual Dynamics for Flow Matching in Time-Series Generation

PrismFlow: 时间序列生成中流匹配的残差动力学

Junru Zhang, Lang Feng, Jinbo Wang, Xu Guo, Yucheng Wang, Han Yu, Min Wu, Yabo Dong, Duanqing Xu

发表机构 * Zhejiang University, China(浙江大学,中国) Nanyang Technological University, Singapore(南洋理工大学,新加坡) I2R, Agency for Science, Technology and Research (A*STAR), Singapore(科技研究局(A*STAR)新加坡研究所,新加坡)

AI总结 提出PrismFlow方法,通过Koopman启发的动力学专家和置信度感知的胜者全得目标,在流匹配中学习残差修正,以解决标准流匹配中全局向量场估计器导致的频谱失真和模式覆盖不足问题,在时间序列生成中取得最优性能。

详情
AI中文摘要

生成高质量时间序列数据具有挑战性,因为现实世界的信号通常表现出多模态模式和多尺度动力学,包括振荡和高频变化。流匹配(FM)为扩散模型提供了一种高效的替代方案,但实际实现通常依赖于单个有限容量的全局向量场估计器。在这种异质的时间分布中,不同的状态可能通过邻近的流状态,同时需要不相容的条件速度。使用标准$\ell_2$速度匹配目标训练的单一估计器可能学习到局部传输场的过度平滑近似。这种估计器级别的平滑会减弱分支特定的动力学,导致频谱失真和较差的模式覆盖。为了解决这个问题,我们提出了PrismFlow,一种新的具有Koopman启发动力学专家的FM方法。每个专家在一个潜在空间中学习残差修正,其中局部非线性时间演化可以通过线性变换近似。我们进一步提出了一种置信度感知的胜者全得(WTA)目标,该目标仅更新与每个样本最对齐的专家,同时屏蔽其他专家的梯度,鼓励模式特定专业化。在采样过程中,选定的专家向全局传输场添加残差动力学修正,在保持FM稳定性的同时恢复细粒度和高频时间结构。在各种基准测试中,PrismFlow有效缓解了标准FM中的频谱收缩,并实现了最先进的性能,Context-FID提升了15.6%,判别分数提升了38.6%,同时在低数据设置下保持鲁棒性,并有效用于预测和插补。

英文摘要

Generating high-quality time-series data is challenging because real-world signals often exhibit multimodal patterns and multiscale dynamics, including oscillations and high-frequency variations. Flow Matching (FM) offers an efficient alternative to diffusion models, but practical implementations typically rely on a single finite-capacity global vector-field estimator. In such heterogeneous temporal distributions, distinct regimes may pass through nearby flow states while requiring incompatible conditional velocities. A monolithic estimator trained with the standard $\ell_2$ velocity-matching objective may therefore learn an overly smoothed approximation of the local transport field. This estimator-level smoothing can attenuate branch-specific dynamics, leading to spectral distortion and poor mode coverage. To address this, we propose PrismFlow, a new FM method with Koopman-inspired dynamical experts. Each expert learns residual corrections in a latent space where local nonlinear temporal evolution can be approximated by linear transitions. We further propose a confidence-aware Winner-Take-All (WTA) objective that updates only the expert best aligned with each sample while masking gradients to the others, encouraging mode-specific specialization. During sampling, the selected expert adds a residual dynamical correction to the global transport field, preserving FM stability while recovering fine-grained and high-frequency temporal structures. Across various benchmarks, PrismFlow effectively mitigates the spectral contraction in standard FM and achieves state-of-the-art performance, with a 15.6% gain in Context-FID and a 38.6% improvement in Discriminative Score, while remaining robust in low-data settings and effective for forecasting and imputation.

2605.28866 2026-05-29 cs.LG cs.AI 版本更新

Continuity and Ordinality Matter: Constraining Time Series Tokens for Effective Time Series Analysis with Large Language Models

连续性与序数性至关重要:利用大语言模型进行有效时间序列分析的时间序列令牌约束

Musheng Li, Ziying Zhang, Cheng jin, Yuantao Gu

发表机构 * Department of Electronic Engineering(电子工程系)

AI总结 针对令牌化时间序列大语言模型忽略连续性和序数性的问题,提出COM策略,通过几何约束初始化与训练阶段,提升模型在多个时间序列分析基准上的性能与泛化能力。

详情
AI中文摘要

基于令牌的时间序列大语言模型(TS-LLMs)已成为时间序列分析与推理的一个有前景的方向。然而,先前的研究在很大程度上忽略了时间序列令牌固有的连续性和序数性,这严重限制了模型性能。在本文中,我们认为在时间序列令牌嵌入中保留这些属性对于基于令牌的TS-LLMs的有效性至关重要。为此,我们提出了COM(连续性与序数性至关重要),这是一种连续性和序数性感知策略,将几何约束整合到初始化阶段和训练阶段。在多个时间序列分析基准上的实证结果表明,COM持续提升了基于令牌的TS-LLMs的性能,取得了有竞争力的结果和强大的泛化能力。代码可在 https://anonymous.4open.science/r/COM 获取。

英文摘要

Token-based time series large language models (TS-LLMs) have emerged as a promising direction for time series analysis and reasoning. However, prior studies largely overlook the inherent continuity and ordinality of time series tokens, which substantially limits model performance. In this paper, we argue that preserving these properties in time series token embeddings is crucial for the effectiveness of token-based TS-LLMs. To this end, we propose COM (Continuity and Ordinality Matter), a continuity- and ordinality-aware strategy that integrates geometric constraints into both the initialization and training stages. Empirical results on multiple time series analysis benchmarks demonstrate that COM consistently improves the performance of token-based TS-LLMs, achieving competitive results and strong generalizability. Code is available at https://anonymous.4open.science/r/COM .

2605.28865 2026-05-29 cs.LG cs.AI 版本更新

Emergent Semantic Representations in World Models through Physical Interaction without Linguistic Supervision

无需语言监督的物理交互中世界模型中的涌现语义表征

Jiayi Fang

发表机构 * Independent Researcher(独立研究者)

AI总结 通过无语言监督的物理探索训练VAE世界模型,发现其潜在空间自发形成与物理几何结构对齐的语义结构,且预测性能与语义对齐共同提升,验证了物理几何作为世界模型表征的组织原则。

Comments 10 pages, 3 figures

详情
AI中文摘要

世界模型从物理探索中学习到什么,没有任何语言监督?我们认为答案由单一原则组织:物理世界的几何结构。在随机具身探索上训练基于VAE的世界模型,我们发现其潜在空间发展出反映物理几何的空间语义结构——方向准确率0.677±0.029对比随机初始化编码器的0.547,位置RSA 0.192±0.047对比随机编码器的0.029(提升6.6倍),表明训练诱导了超越CNN归纳偏置的真正结构组织。在20个时间检查点上,预测性能和语义对齐共同提升(Spearman r=-0.61, p=0.004),与共享驱动者解释一致。我们通过双重敲除确认:标准KL正则化(beta=0.1)迫使编码器远离几何结构,预测性能和语义对齐同时崩溃至接近随机水平(第50,000步),完全符合共享驱动者预测。将beta降至0.001可恢复几何访问并同时恢复两种能力。这些发现确立了物理世界几何作为世界模型表征的组织原则,对设计语义基础的具身智能体具有直接意义。

英文摘要

What does a world model learn from physical exploration, without any linguistic supervision? We argue the answer is organized by a single principle: the geometric structure of the physical world. Training a VAE-based world model on random embodied exploration, we find that its latent space develops spatial semantic structure that mirrors physical geometry -- direction accuracy 0.677+-0.029 versus 0.547 for a randomly initialized encoder, and position RSA 0.192+-0.047 versus 0.029 for random encoders (6.6x improvement), showing that training induces genuine structural organization beyond CNN inductive bias. Across 20 temporal checkpoints, prediction performance and semantic alignment co-improve (Spearman r=-0.61, p=0.004), consistent with the shared-driver account. We confirm this through a double knockout: standard KL regularization (beta=0.1) forces the encoder away from geometric structure, and both prediction performance and semantic alignment collapse simultaneously to near-chance by step 50,000 -- exactly as the shared-driver account predicts. Reducing beta to 0.001 restores geometric access and recovers both capabilities together. These findings establish physical world geometry as the organizing principle of world model representations, with direct implications for the design of semantically grounded embodied agents.

2605.28863 2026-05-29 cs.LG cs.AI 版本更新

Self-Play Reinforcement Learning under Imperfect Information in Big 2

大二(Big 2)中不完全信息下的自我对弈强化学习

Aalok Patwa

发表机构 * University of Pennsylvania(宾夕法尼亚大学)

AI总结 本文提出一个自我对弈强化学习框架,在四人不完全信息纸牌游戏Big 2中比较策略梯度和值近似方法,发现PPO优于其他方法,并证明中等熵正则化和当前策略自我对弈的有效性。

Comments 11 pages

详情
AI中文摘要

不完全信息多人游戏测试智能体在隐藏信息、稀疏奖励和非平稳对手下的行动能力。我们在Big 2(一个四人不完全信息纸牌游戏)中研究这些挑战。我们为Big 2开发了一个自我对弈强化学习框架,能够对策略梯度和值近似智能体进行受控比较。在共同的环境、输入表示、训练预算和评估协议下,PPO在对抗随机、贪婪和启发式Big 2对手时优于蒙特卡洛Q近似、SARSA和Q学习。我们进一步发现,适度的熵正则化通过防止策略变得过于确定性来改进PPO,并且当前策略自我对弈比检查点自我对弈或固定对手训练提供了更强的有限预算课程。这些结果共同表明,Big 2是研究不完全信息、多人交互、延迟奖励和可变动作集下深度强化学习的一个有用的受控环境。

英文摘要

Imperfect-information multiplayer games test whether agents can act under hidden information, sparse rewards, and non-stationary opponents. We study these challenges in Big 2, a four-player imperfect-information card game. We develop a self-play RL framework for Big 2 that enables controlled comparisons between policy-gradient and value-approximating agents. Under a common environment, input representation, training budget, and evaluation protocol, PPO outperforms Monte Carlo Q approximation, SARSA, and Q-learning against random, greedy, and heuristic Big 2 opponents. We further find that moderate entropy regularization improves PPO by preventing the policy from becoming overly deterministic, and that current-policy self-play provides a stronger finite-budget curriculum than checkpoint self-play or fixed-opponent training. Together, these results show that Big 2 is a useful controlled setting for studying deep RL under imperfect information, multiplayer interaction, delayed rewards, and variable action sets.

2605.28862 2026-05-29 cs.LG q-bio.QM 版本更新

Molecular Lead Optimization via Agentic Tool Planning

通过智能体工具规划进行分子先导优化

Lingxiao Li, Haobo Zhang, Ruohao Fan, Bin Chen, Jiayu Zhou

发表机构 * University of Michigan(密歇根大学) University of California, Davis(加州大学戴维斯分校) Michigan State University(密歇根州立大学)

AI总结 提出TRACE,一种轨迹感知的LLM推理智能体,将先导优化建模为序列决策问题,通过工具选择实现结构约束下的前瞻性分子优化,在ADMET任务中优于基线。

Comments 12 pages

详情
AI中文摘要

药物发现是一个漫长且资源密集的过程,由多个阶段组成。其中,先导优化在将早期命中化合物转化为可行的候选药物中起着关键作用。这一阶段需要通过细微的结构修饰来改善ADMET相关性质,同时保留负责与疾病靶点结合亲和力的关键分子子结构。人工智能的最新进展在加速药物发现的各个方面显示出前景;然而,大多数现有的先导优化方法依赖于一步式分子优化,未能考虑序列设计决策的长期后果。为了解决这一限制,我们提出了TRACE,一种用于分子先导优化的轨迹感知、LLM推理智能体,它将工具选择形式化为一个关于动作轨迹的序列决策问题。给定一个先导分子和一个优化目标,TRACE在分子优化工具上做出轨迹感知的决策,从而在结构约束下实现前瞻性优化。在多个ADMET优化任务上的实验表明,与基线模型相比,我们的智能体实现了更高的优化成功率、更大的性质改进和更高的有效性,同时保持了分子相似性。

英文摘要

Drug discovery is a lengthy and resource-intensive process composed of multiple stages. Among these stages, lead optimization plays a critical role in transforming early hit compounds into viable drug candidates. This stage requires improving ADMET-related properties through subtle structural refinement while preserving key molecular substructures responsible for binding affinity to disease targets. Recent advances in artificial intelligence have shown promise in accelerating various aspects of drug discovery; however, most existing approaches to lead optimization rely on one-step molecular optimization, which fail to account for the long-term consequences of sequential design decisions. To address this limitation, we propose TRACE, a trajectory-aware, LLM-reasoning agent for molecular lead optimization that formulates tool selection as a sequential decision-making problem over action trajectories. Given a lead molecule and an optimization objective, TRACE makes trajectory-aware decisions over molecular optimization tools, enabling forward-looking refinement under structural constraints. Experiments on multiple ADMET optimization tasks show that our agent achieves higher optimization success, larger property improvements, and higher validity, while preserving molecular similarity compared to baseline models.

2605.28861 2026-05-29 cond-mat.str-el cond-mat.dis-nn cs.LG 版本更新

Comment on "Spin-1/2 Kagome Heisenberg Antiferromagnet: Machine Learning Discovery of the Spinon Pair-Density-Wave Ground State"

评论:自旋-1/2 Kagome海森堡反铁磁体:通过机器学习发现自旋子对密度波基态

Helia Kamal, Dominik Kufel, DinhDuy Vu, Chris R. Laumann, Norman Y. Yao

发表机构 * Department of Physics, Harvard University, Cambridge, MA 02138, USA(哈佛大学物理系) Department of Physics, Boston University, Boston, MA 02215, USA(波士顿大学物理系)

AI总结 指出使用群等变卷积神经网络研究kagome海森堡反铁磁体基态时,由于Metropolis-Hastings采样中单自旋翻转更新导致遍历性破缺,使得报告的低能态是伪影,而采用自旋交换更新后网络收敛能量高于DMRG结果,质疑原文结论。

Comments 3 pages, 1 figure; Comment on arXiv:2401.02866

详情
AI中文摘要

最近的一篇文章[Phys. Rev. X 15, 011047 (2025)]利用群等变卷积神经网络研究了kagome海森堡反铁磁体的基态。在迄今为止研究的最大的有限尺寸团簇($N=108$)上,作者报告了显著低于其他数值方法(包括最先进的密度矩阵重正化群(DMRG)计算)的变分能量。与先前暗示可能存在自旋液体基态的结果相反,作者观察到了自旋子对密度波基态。我们发现:(i)报告的低能量是Metropolis-Hastings采样中遍历性破缺的伪影,因为作者使用的单自旋翻转更新规则实际上冻结了马尔可夫链;(ii)当通过自旋交换更新强制执行遍历采样时,神经网络收敛到显著高于现有DMRG结果的能量,这使该论文的主张受到质疑。

英文摘要

A recent article [Phys. Rev. X 15, 011047 (2025)] utilizes group-equivariant convolutional neural networks to study the ground state of the kagome Heisenberg antiferromagnet. On the largest finite-size cluster studied to date ($N=108$), the authors report variational energies significantly lower than other numerical methods, including state-of-the-art density matrix renormalization group (DMRG) calculations. In contrast to previous results suggesting a possible spin-liquid ground state, the authors observe a spinon pair-density-wave ground state. We find that: (i) the reported low energies are artifacts of broken ergodicity in the Metropolis--Hastings sampling, since the single-spin-flip update rule utilized by the authors effectively freezes the Markov chains; and (ii) when ergodic sampling is enforced via spin-exchange updates, the neural network converges to energies significantly higher than existing DMRG results, calling the paper's claims into question.

2605.28858 2026-05-29 cs.CE cs.LG math-ph math.MP 版本更新

An End-to-End PyTorch Interface for Differentiable PDE Solvers: A RANS Model-Correction Study

可微PDE求解器的端到端PyTorch接口:一项RANS模型校正研究

Luca Saverio, Michele Alessandro Bucci, Gianmarco Farro, Cédric Content, Denis Sipp

发表机构 * Digital Sciences \& Technologies Department , Safran Tech , Magny-Les-Hameaux , 78114 , France MONHADE, équipe INRIA-ONERA, DSG , ONERA, Institut Polytechnique de Paris , Palaiseau , 91120 , France

AI总结 提出一个端到端可微机器学习框架,通过将PDE作为隐层集成到PyTorch中,优化参数化校正项,用于数据同化和闭合建模,并在可压缩流RANS方程上验证。

详情
AI中文摘要

本工作提出了一种在完全可微的机器学习框架内求解偏微分方程约束反问题的端到端策略。所提出的公式提供了一种统一且用户友好的方法,适用于从数据同化到闭合建模的广泛问题。我们的方法结合了一个基线可微PDE求解器(从非线性系统$R(w) = 0$预测状态$w$)和一个通用的加性、参数化、可微校正$f_ϕ(w)$,其可训练参数为$ϕ$。我们展示了如何通过将PDE重新表述为隐层,将其集成到任意目标函数中,同时利用PyTorch的自动微分图,在完全可微的Python工作流中优化phi。该方法在可压缩流的雷诺平均纳维-斯托克斯方程上进行了演示,其中闭合项或其一部分使用可训练参数或神经网络建模。第一个应用考虑了二维NASA壁装驼峰测试案例,其中生产项参数针对时间平均LES数据进行了优化。第二个应用在VKI LS-59涡轮叶片上进行,其中通过优化可训练空间场重建了Spalart-Allmaras涡粘性场。使用可微BROADCAST求解器和Spalart-Allmaras湍流模型,从VKI LS-59涡轮叶片几何形状生成数据集。结果突出了该框架的灵活性,展示了其超越湍流建模,适用于更广泛的物理信息PDE约束问题(具有数据驱动组件)的适用性。

英文摘要

This work presents an end-to-end strategy for solving inverse problems constrained by Partial Differential Equations within a fully differentiable Machine Learning framework. The proposed formulation provides a unified and user-friendly methodology applicable to a wide range of problems, from data assimilation to closure modeling. Our approach combines a baseline differentiable PDE solver, which predicts the state w from the nonlinear system $R(w) = 0$, with a generic additive, parametrized, and differentiable correction $f_ϕ(w)$, with trainable parameters $ϕ$. We show how to optimize phi within a fully differentiable Python workflow by reformulating the PDE as an implicit layer, enabling its integration into arbitrary objective functions, while leveraging PyTorch's automatic differentiation graph. The method is demonstrated on the Reynolds-Averaged Navier-Stokes equations for compressible flows, where the closure term, or a portion of it, is modeled using trainable parameters or a Neural Network. The first application considers the 2D NASA Wall-Mounted Hump test case, where a production-term parameter is optimized against time-averaged LES data. A second application is carried out on the VKI LS-59 turbine blade, where the Spalart-Allmaras eddy viscosity field is reconstructed through the optimization of a trainable spatial field. A dataset is generated starting from the VKI LS-59 turbine blade geometry using the differentiable BROADCAST solver with the Spalart-Allmaras turbulence model. The results highlight the flexibility of the framework, showing its applicability beyond turbulence modeling to a broader class of physics-informed PDE-constrained problems with data-driven components.

2605.28854 2026-05-29 cs.CL cs.LG q-bio.NC 版本更新

Large language models reorganize representational geometry during in-context learning

大型语言模型在上下文学习中重组表征几何结构

Hua-Dong Xiong, Li Ji-An, Robert C. Wilson, Kwonjoon Lee, Xue-Xin Wei

发表机构 * School of Psychological and Brain Sciences, Georgia Tech(佐治亚理工学院心理与脑科学学院) Department of Psychology, New York University(纽约大学心理学系) Center of Excellence for Computational Cognition, Georgia Tech(佐治亚理工学院计算认知卓越中心) Honda Research Institute(本田研究院) Departments of Neuroscience and Psychology, The University of Texas at Austin(德克萨斯大学奥斯汀分校神经科学与心理学系)

AI总结 研究大型语言模型在上下文学习中的表征几何重组,发现其性能与任务表征结构相关,并通过原型算法动态调整表征以提高可分性。

详情
AI中文摘要

大型语言模型(LLMs)表现出显著的灵活性:它们可以从上下文示例中适应新任务,而无需任何参数更新,这种能力被称为上下文学习(ICL)。先前关于合成任务的研究表明,ICL可以实现特定算法,展示了架构能力,并且机制分析已经识别出支持这种行为的关键回路。然而,由于上下文计算——无论其算法形式如何——依赖于高维表征空间中的变换,该空间的几何结构如何塑造ICL的有效性仍不清楚。受神经科学中将分类视为神经表征解缠的观点启发,我们假设ICL依赖于任务相关表征的成功在线解缠。为了验证这一想法,我们研究了LLMs如何对上下文示例进行分类,这些示例的标签由模型自身具有已知结构的内部表征定义。我们表明,ICL性能与底层分类任务的表征结构系统性相关,并且成功的ICL伴随着几何重组,增加了在线可分性。我们进一步发现,LLM的行为可以通过一种原型类算法很好地描述,该算法在重塑表征以支持分类的同时整合证据。这些发现为预训练LLMs中的ICL提供了几何解释,将表征几何结构确立为ICL的机制约束,并量化了预训练表征所能提供的与上下文学习所能利用之间的差距。

英文摘要

Large language models (LLMs) exhibit remarkable flexibility: they can adapt to novel tasks from in-context examples without any parameter updates, a capability known as in-context learning (ICL). Prior work on synthetic tasks has shown that ICL can implement specific algorithms, demonstrating architectural competence, and mechanistic analyses have identified key circuits that support this behavior. However, because in-context computation -- regardless of its algorithmic form -- relies on transformations in high-dimensional representation space, it remains unclear how the geometry of that space shapes ICL effectiveness. Motivated by the neuroscience view of classification as the untangling of neural representations, we hypothesize that ICL depends on the successful online untangling of task-relevant representations. To test this idea, we study how LLMs classify in-context examples whose labels are defined by the model's own internal representations with known structure. We show that ICL performance correlates systematically with the representational structure of the underlying classification task and that successful ICL is accompanied by geometric reorganization that increases online separability. We further find that LLM behavior is well described by a prototype-like algorithm that integrates evidence while reshaping representations to support classification. These findings offer a geometric account of ICL in pretrained LLMs, establish representational geometry as a mechanistic constraint on ICL, and quantify the gap between what pretrained representations afford and what in-context learning can exploit.

2605.28853 2026-05-29 q-fin.PM cs.LG 版本更新

Financially Guided Deep Portfolio Optimization

财务引导的深度投资组合优化

Rahul Fernandes, Travis Desell

发表机构 * Department of Software Engineering(软件工程系) Rochester Institute of Technology(罗切斯特理工学院)

AI总结 提出一个端到端框架,通过直接优化夏普比率、Omega比率、条件风险价值(CVaR)和风险平价等关键财务指标的微分代理,利用神经网络学习投资组合权重,在2007-2023年50只标普500股票上,最佳模型(AttentionLSTM结合Omega-CVaR-RiskParity损失)在2022-2023年样本外测试中实现年化夏普比率0.29和总复合收益+7.86%,超越标普500指数12.38个百分点。

详情
AI中文摘要

由于非平稳性、噪声数据和高交易成本,现实金融市场中的投资组合优化极其困难。标准的预测-然后优化方法首先预测收益,然后求解权重,这加剧了预测误差,并且常常在制度转换下失败。我们提出一个端到端框架,直接优化关键财务指标——夏普比率、Omega比率、条件风险价值(CVaR)和风险平价——的可微代理,使得神经网络能够通过反向传播学习投资组合权重。我们的扩展窗口滚动前向程序,应用于2007年至2023年的50只标普500股票,包含了现实的买卖价差成本,并每季度再平衡。在具有挑战性的样本外测试期(2022-2023年),最佳模型——使用Omega-CVaR-RiskParity损失的AttentionLSTM——实现了年化夏普比率0.29和总复合收益+7.86%,而标普500指数总收益为-4.52%,年化夏普比率为-0.02。这比标普500指数高出12.38个百分点(相对改进超过270%),同时保持尾部风险(CVaR)几乎不变。该框架持续优于等权重投资组合、标普500指数以及传统方法(MVP、HRP、NCO),表明将财务目标直接嵌入模型训练能够在不利市场条件下产生稳健、经济上有意义的超额收益。

英文摘要

Portfolio optimization in real-world financial markets is notoriously difficult due to non-stationarity, noisy data, and high transaction costs. Standard predict-then-optimize methods first forecast returns and then solve for weights, compounding prediction errors and often failing under regime shifts. We propose an end-to-end framework that directly optimizes differentiable surrogates of key financial metrics - Sharpe ratio, Omega ratio, Conditional Value-at-Risk (CVaR), and Risk Parity - allowing neural networks to learn portfolio weights via backpropagation. Our expanding-window walk-forward procedure, applied to 50 S&P 500 stocks from 2007 to 2023, incorporates realistic bid-ask spread costs and rebalances quarterly. On the challenging out-of-sample test period (2022-2023), the best model - an AttentionLSTM with the Omega-CVaR-RiskParity loss - achieves an annualized Sharpe of 0.29 and a total compounded return of +7.86%, while the S&P 500 delivers -4.52% total return and an annualized Sharpe of -0.02. This outperforms the S&P 500 by 12.38 percentage points (a relative improvement of over 270%), while keeping tail risk (CVaR) nearly unchanged. The framework consistently outperforms the equal-weight portfolio, S&P 500, and traditional methods (MVP, HRP, NCO), demonstrating that embedding financial objectives directly into model training yields robust, economically meaningful outperformance even in adverse market conditions.

2605.28851 2026-05-29 astro-ph.EP astro-ph.IM cs.LG physics.ao-ph 版本更新

Towards a Foundation Model for the Martian Atmosphere

火星大气基础模型

Sujit Roy, Udayshankar Nair, Yuling Wu, Georgios Priftis, Liping Wang, Anastasia Georgiou, Anne Jones, Björn Lütjens, Johannes Schmude, Campbell Watson, Rachel A. Slank, Ankur Kumar, Anirbit Mukherjee, Procheta Sen, Ramin Lolachi, Haonan Chen, Manil Maskey, Juan Bernabé-Moreno, Rahul Ramachandran

发表机构 * Earth System Science Center, University of Alabama in Huntsville(阿拉巴马大学亨茨维尔分校地球系统科学中心) NASA Marshall Space Flight Center(美国宇航局马歇尔空间飞行中心) Department of Electrical & Computer Engineering, Colorado State University(科罗拉多州立大学电气与计算机工程系) Science and Technology Institute/Universities Space Research Association (USRA)(科学与技术研究所/大学空间研究协会) Department of Computer Science, The University of Manchester(曼彻斯特大学计算机科学系) School of Computer Science, University of Liverpool(利物浦大学计算机科学学院) Center for Space Sciences and Technology, University of Maryland, Baltimore County(马里兰大学巴尔的摩分校空间科学与技术中心) NASA Goddard Space Flight Center(美国宇航局戈达德空间飞行中心) Center for Research and Exploration in Space Science and Technology, NASA/GSFC(空间科学与技术研究与探索中心,NASA/GSFC) IBM Research(IBM研究院)

AI总结 针对火星大气数据稀疏、计算成本高等挑战,本文探讨了构建数据驱动基础模型的设计空间,包括可用数据、物理模型、下游应用及AI方法。

详情
AI中文摘要

火星大气中存在从行星尺度沙尘暴到中尺度地形云和夜间低空急流等动力学现象。全球环流模型能够模拟这些现象,但在解析中尺度特征所需的分辨率下计算成本高昂。虽然卫星遥感观测的同化使得利用此类模型进行预报成为可能,但观测记录通常稀疏、短暂且分散在不同仪器代际之间。这些限制促使我们开发数据驱动的火星大气基础模型。 基础模型处于复杂的设计空间中。可用数据、底层过程的物理特性以及人工智能的相应发展之间存在相互作用。尽管基础模型旨在以数据和计算高效的方式处理多个用例,但明确单个模型能够合理解决哪些应用至关重要。 本文旨在阐明这一设计空间。我们讨论了从大气反演到再分析数据集以及现有物理模型的可用数据。此外,我们识别了广泛的候选下游应用。最后,我们考虑了在此背景下可以利用的人工智能(AI)相关最新进展。这里,我们特别关注用于大气物理的AI模型、数据驱动的数据同化方法以及在有限数据环境下工作的技术。

英文摘要

The martian atmosphere hosts dynamical phenomena ranging from planet-encircling dust storms to mesoscale orographic clouds and nocturnal low-level jets. General circulation model show capability to simulate these phenomena, but is computationally expensive at resolution needed to resolve mesoscale features. While assimilation of satellite remote sensing observation enable forecasting capabilities using such models, observation record is often sparse, short and fragmented across instrument generators. These constraints motivate the development of a data-driven foundation model for the Martian atmosphere. Foundation models live in a complex design landscape. There is an interplay between the available data, the physics of the underlying processes and corresponding developments in AI. Even though the idea of a foundation model is to address multiple use cases in a data- and compute-efficient manner, it is important to have a clear picture what applications can sensibly addressed by a single model. The purpose of this paper is to elucidate this design landscape. We discuss available data ranging from atmospheric retrievals to reanalysis datasets as well as existing physical models. Moreover, we identify a wide range of candidate downstream applications. Finally, we consider relevant recent developments in artificial intelligence (AI) that can be leveraged in this context. Here, we put a particular emphasis on AI models for atmospheric physics, data-driven approaches to data assimilation as well as methods to work in a limited data setting.

2605.28844 2026-05-29 cs.NE cs.LG 版本更新

WASHH: An Anchor-Aware Whale-Guided Selection Hyper-Heuristic for Continuous Optimization and SVC Configuration

WASHH:一种用于连续优化和SVC配置的锚点感知鲸鱼引导选择超启发式算法

Yifu Zhao, Xiaofan Zou, Junhao Wei, Yanxiao Li, Baili Lu, Zhenhong Peng, Dexing Yao, Haochen Li, Qinbin He, Sio-Kei Im, Xu Yang, Yapeng Wang

发表机构 * Faculty of Applied Sciences, Macao Polytechnic University(澳门理工学院应用科学学院) School of Mechanical and Electrical Engineering and Automation, Shanghai University(上海大学机械与自动化工程学院) Pazhou Lab (Huangpu), Guangzhou(广州 Pazhou 实验室(黄埔)) College of Animal Science and Technology, Zhongkai University of Agriculture and Engineering(仲恺农业工程学院动物科学与技术学院) Macao Polytechnic University(澳门理工学院)

AI总结 提出WASHH超启发式算法,通过在线奖励控制器选择多种搜索行为,在连续优化和SVC超参数配置中取得最优平均排名和最低验证损失。

详情
AI中文摘要

学习辅助的算法设计通常必须在小的评估预算下做出可靠的搜索决策,而仅依赖单一元启发式算法可能不可靠。我们提出了WASHH,一种用于连续黑箱优化的鲸鱼引导自适应选择超启发式算法。WASHH使用WOA作为主要开发骨干,但将PSO风格记忆、GWO风格领导者平均、DE风格变异、局部坐标搜索和锚点引导细化视为可选择的搜索行为。在线奖励控制器根据观察到的改进分配评估,而锚点细化利用廉价参考配置(如箱中心或默认模型设置),而不绕过黑箱评估。在10个30维基准函数上,进行10次独立运行和12,000次评估,WASHH实现了最佳平均排名1.10,并在所有10个函数上达到最佳或并列最佳。它在8个函数上严格优于WOA,并在Rastrigin和Griewank函数上与WOA在数值最优值上持平。我们进一步研究了在300次评估预算下乳腺癌诊断的SVC超参数配置。WASHH在比较的优化器中获得了最低的平均验证对数损失,表明锚点感知选择超启发式算法是LEAD系统的一种实用轻量级方向。

英文摘要

Learning-assisted algorithm design often has to make reliable search decisions under small evaluation budgets, where committing to a single metaheuristic can be unreliable. We propose WASHH, a Whale-guided Adaptive Selection Hyper-Heuristic for continuous black-box optimization. WASHH uses WOA as the main exploitation backbone, but treats PSO-style memory, GWO-style leader averaging, DE-style variation, local coordinate search, and anchor-guided refinement as selectable search behaviors. An online reward controller allocates evaluations according to observed improvements, while anchor refinement exploits inexpensive reference configurations such as box centers or default model settings without bypassing black-box evaluation. On ten 30-dimensional benchmark functions with 10 independent runs and 12,000 evaluations, WASHH achieves the best average rank, 1.10, and is best or tied best on all ten functions. It strictly improves over WOA on eight functions and ties WOA at the numerical optimum on Rastrigin and Griewank. We further study SVC hyperparameter configuration for breast cancer diagnosis under a 300-evaluation budget. WASHH obtains the lowest mean validation log loss among the compared optimizers, suggesting that anchor-aware selection hyper-heuristics are a practical lightweight direction for LEAD systems.

2605.28843 2026-05-29 cs.DL cs.CY cs.LG 版本更新

The Biosecurity Blind Spot: Systematic Dual-use Detection in Open Science Infrastructure

生物安全盲点:开放科学基础设施中的系统性双重用途检测

Vasudha Sharma, Chakresh Kumar Singh, Jayesh Choudhari, Dharmit Nakrani

发表机构 * LophiLabs

AI总结 本研究通过混合词法过滤和大语言模型评估,系统分析了bioRxiv预印本中双重用途研究关注内容,揭示了开放获取摘要中普遍存在的潜在风险,并提出了结合元数据监控与开放科学原则的治理框架。

Comments Ongoing work

详情
AI中文摘要

人工智能以前所未有的速度改变着生命科学研究,加速了蛋白质结构预测、基因组建模和药物开发等领域的发现(Jumper et al., 2021; Mak et al., 2024)。然而,这种快速进步,加上开放科学运动,引入了重大的双重用途研究问题,但这些问题尚未得到充分的实证研究。本文首次对开放预印本服务器上的双重用途研究关注(DURC)内容进行了系统分析。我们使用词法过滤和大语言模型(LLM)评估的混合流程,筛选了约52,000篇bioRxiv预印本(2024-2025年),并根据美国及澳大利亚集团监管框架,对九个DURC类别、三个PEPP类别和五个治理类别的元数据进行了评分。我们的分析显示,双重用途相关的知识通常出现在公开可访问的标题和摘要中,即使在具有合法公共卫生目标的研究中,也常常超过既定的风险阈值。虽然这种映射捕捉了表面层面的信息扩散,但它并未衡量操作能力、下游滥用潜力或限制有害应用的重大技术和生物安全障碍。我们认为,机构审查流程、资助要求和预印本平台政策必须发展,以纳入主动的元数据级监控,同时不损害科学透明度。最终,将高风险方法学的受控访问机制与科学贡献的开放摘要相协调,为大规模治理AI加速生物学提供了实用框架。

英文摘要

AI is transforming life sciences research at unprecedented speed, accelerating discovery across protein structure prediction, genome modeling, and drug development (Jumper et al., 2021; Mak et al., 2024). Yet this rapid advancement, coupled with the open science movement, introduces significant dual-use research concerns that have received limited empirical scrutiny. Here we present the first systematic analysis of dual-use research of concern (DURC) content on open preprint servers. We screened ~52,000 bioRxiv preprints (2024-2025) using a hybrid pipeline of lexical filtering and large language model (LLM) evaluation, scoring metadata across nine DURC, three PEPP, and five governance categories aligned with U.S. and Australia Group oversight frameworks. Our analysis reveals that dual-use-adjacent knowledge is routinely present in openly accessible titles and abstracts, often exceeding established risk thresholds even in studies with legitimate public health objectives. While this mapping captures surface-level information diffusion, it does not measure operational capability, downstream misuse potential, or the substantial technical and biosafety barriers that constrain harmful application. We argue that institutional review processes, funding requirements, and preprint platform policies must evolve to incorporate proactive, metadata-level monitoring without compromising scientific transparency. Ultimately, harmonizing controlled-access mechanisms for high-risk methodologies with open summaries of scientific contributions offers a pragmatic framework for governing AI-accelerated biology at scale.

2605.28839 2026-05-29 cs.LG 版本更新

One Mask to Rule Them All: On Hidden Facts after Editing and How to Find Them

一掩蔽之:编辑后的隐藏事实及其发现方法

Ali Holmov, Paul Youssef, Nandi Schoots, Christin Seifert

发表机构 * Technical University of Munich(慕尼黑技术大学) Marburg University(马尔堡大学) University of Oxford(牛津大学)

AI总结 本文通过训练二进制掩码揭示ROME和MEMIT知识编辑方法依赖共同权重子集,并证明编辑抑制而非覆盖知识,导致无法传播至相关事实。

Comments Accepted to Findings of ACL 2026

详情
AI中文摘要

知识编辑方法(如ROME和MEMIT)通过修改MLP权重来更新Transformer模型中的事实关联。虽然主要通过输出行为进行评估,但其内部机制仍未得到充分探索。我们研究了编辑是否依赖于一种通用机制,无论修改哪个事实。尽管存在特定于事实的权重变化,我们认为ROME和MEMIT针对的是维持编辑所必需的相同权重子集。为了隔离这个子集,我们在编辑后的权重上训练了一个紧凑的二进制掩码。该掩码在训练集上逆转了80%的编辑,在测试集上逆转了超过70%,证实了不同的编辑共享一个共同的功能结构。我们的分析表明,掩码通过消除后期层中的过度注意力来逆转编辑。此外,我们展示了在编辑过程中注入掩码会将编辑成功率从98%降至38%,证明这种机制对于编辑成功是必要的。我们的发现——编辑抑制而非覆盖知识——解释了为什么ROME和MEMIT无法将变化传播到相关事实。所识别的共同功能子空间为检测和防御不想要的编辑提供了信息。

英文摘要

Knowledge editing methods such as ROME and MEMIT update factual associations in transformer models by modifying MLP weights. While evaluated mainly by output behavior, their internal mechanism remains underexplored. We investigate whether edits rely on a common mechanism, regardless of which fact is modified. Despite fact-specific weight changes, we argue that ROME and MEMIT target the same subset of weights critical for maintaining edits. To isolate this subset, we train a compact binary mask over the edited weights. The mask reverses 80% of edits on the training set and over 70% on the test set, confirming that diverse edits share a common functional structure. Our analysis reveals that the mask reverses edits by eliminating overattention in later layers. Additionally, we show that injecting the mask during editing drops editing success from 98% to 38%, demonstrating that this mechanism is necessary for edits to succeed. Our finding that edits suppress rather than overwrite knowledge explains why ROME and MEMIT fail to propagate changes to related facts. The identified common functional subspace informs detection and defense against unwanted edits.

2605.28827 2026-05-29 cs.CL cs.LG 版本更新

RightNow-Arabic-0.5B-Turbo: An Open Sub-1B Arabic Language Model via Vocabulary Injection and Edge-First Deployment

RightNow-Arabic-0.5B-Turbo: 通过词汇注入和边缘优先部署的开源子10亿参数阿拉伯语语言模型

Jaber Jaber, Osama Jaber

发表机构 * RightNow AI

AI总结 针对现有阿拉伯语模型要么是多语言模型对阿拉伯语支持不足,要么是参数过大难以部署的问题,提出基于Qwen2.5-0.5B的518M参数阿拉伯语专用模型RightNow-Arabic-0.5B-Turbo,通过词汇注入、继续预训练和监督微调等方法,在三个阿拉伯语基准上达到35.9%平均准确率,与1.5B模型性能相当,并实现边缘端高效部署。

Comments 12 pages, 7 tables, 4 figures, 1 algorithm. Weights: https://huggingface.co/RightNowAI/RightNow-Arabic-0.5B-Turbo

详情
AI中文摘要

开源的阿拉伯语大语言模型分为两类:子10亿参数的多语言模型将阿拉伯语视为次要语言(如Qwen2.5-0.5B、Falcon-H1-0.5B),以及需要服务器运行的7B-70B阿拉伯语专用模型(如Jais、AceGPT、ALLaM、SILMA)。唯一已发表的子20亿参数阿拉伯语专用模型Kuwain-1.5B从未发布权重。我们提出RightNow-Arabic-0.5B-Turbo,一个基于Qwen2.5-0.5B构建的518M参数阿拉伯语专用解码器LLM。该流程通过均值子词初始化添加27,032个阿拉伯语token,在8xH100上使用FSDP、FlashAttention变长打包和Liger融合内核继续预训练504M阿拉伯语token,然后对129,116个阿拉伯语指令对应用仅响应损失掩码的有监督微调,对6,750个阿拉伯语偏好对应用直接偏好优化,并对三个检查点进行权重汤合并。在三个lm-evaluation-harness阿拉伯语基准(COPA-ar、Arabic HellaSwag、ArabicMMLU)上,合并模型达到35.9%的平均准确率,击败所有同类开源模型,在COPA-ar上与Falcon-H1-1.5B持平(58.4%)但规模仅为三分之一,并以1/18的参数恢复了SILMA-9B平均性能的67%。边缘构建量化至398 MB(q4_k_m),通过llama.cpp在单个H100上以批量大小1达到635 tokens/s。所有代码(25个脚本共5,555行)、权重(bf16、int8和四种GGUF量化)及基准测试脚本已在https://huggingface.co/RightNowAI/RightNow-Arabic-0.5B-Turbo开源。

英文摘要

Open Arabic large language models split into two classes: sub-1B multilingual models that treat Arabic as an afterthought (Qwen2.5-0.5B, Falcon-H1-0.5B), and 7B-70B Arabic-specialized models that require a server to run (Jais, AceGPT, ALLaM, SILMA). The one published attempt at a sub-2B Arabic-specialized model, Kuwain-1.5B, never released its weights. We present RightNow-Arabic-0.5B-Turbo, a 518M-parameter Arabic-specialized decoder LLM built on Qwen2.5-0.5B. The pipeline adds 27,032 Arabic tokens via mean-subtoken initialization, continues pretraining on 504M Arabic tokens on 8xH100 with FSDP, FlashAttention varlen packing, and Liger fused kernels, then applies supervised fine-tuning on 129,116 Arabic instruction pairs with response-only loss masking, direct preference optimization on 6,750 Arabic preference pairs, and weight soup merging across three checkpoints. On three lm-evaluation-harness Arabic benchmarks (COPA-ar, Arabic HellaSwag, ArabicMMLU) the merged model reaches 35.9% mean accuracy, beats every same-class open model, ties Falcon-H1-1.5B on COPA-ar (58.4%) at one-third the size, and recovers 67% of SILMA-9B's mean at 1/18 the parameters. The edge build quantizes to 398 MB (q4_k_m) and delivers 635 tokens/s at batch size 1 on a single H100 via llama.cpp. All code (5,555 lines across 25 scripts), weights (bf16, int8, and four GGUF quantizations), and benchmark scripts are released at https://huggingface.co/RightNowAI/RightNow-Arabic-0.5B-Turbo.

2605.28551 2026-05-29 cs.CV cs.GR cs.LG 版本更新

Resolution-free neural surrogates for geometric parameterization and mapping with spatially varying fields

无分辨率依赖的几何参数化与映射神经替代模型:面向空间变化场

Yanwen Huang, Lok Ming Lui, Gary P. T. Choi

发表机构 * Department of Mathematics, The Chinese University of Hong Kong(香港中文大学数学系)

AI总结 提出一种无分辨率依赖的神经替代模型,通过多分辨率几何编码和几何感知约束(变分能量、扩散密度均衡、拟共形理论)无监督学习,直接从空间变化参数场预测映射位置,适用于任意结构化或非结构化点集。

详情
AI中文摘要

许多成像问题需要计算由空间变化的强度、特征或密度场引起的空间变换。典型例子包括畸变校正、可变形图像配准、基于图谱的分割以及变形驱动的图像分析。这些任务可以表述为几何映射问题,其中变换被约束以保持局部结构、控制边界行为或调节角度畸变。此类公式通常导致变分模型、扩散过程或椭圆偏微分方程。然而,当底层参数场在不同实例间变化时,重复求解高分辨率系统在计算上变得昂贵。在这项工作中,我们提出了一种无分辨率依赖的神经替代模型,用于几何参数化和映射问题。给定一个空间变化的参数场 $p:\Omega\to\mathbb{R}^m$ 和查询位置 $\{x_i\}_{i=1}^N\subset\Omega$,该模型预测任意结构化或非结构化点集上的映射位置 $\{u(x_i)\}_{i=1}^N$。为了避免对固定网格的依赖,我们采用了一种多分辨率几何编码策略,该策略将网络条件建立在参数场的坐标增强样本上。该模型通过强制执行源自变分能量、基于扩散的密度均衡和拟共形理论的几何感知约束进行训练,无需标记解数据。在拟共形映射和密度均衡映射问题上的实验结果展示了我们提出方法的有效性。

英文摘要

Many imaging problems require computing spatial transformations induced by spatially varying intensity, feature, or density fields. Canonical examples include distortion correction, deformable image registration, atlas-based segmentation, and deformation-driven image analysis. These tasks can be formulated as geometric mapping problems in which the transformation is constrained to preserve local structure, control boundary behavior, or regulate angular distortion. Such formulations typically lead to variational models, diffusion processes, or elliptic partial differential equations. However, repeatedly solving high-resolution systems becomes computationally expensive when the underlying parameter fields vary across instances. In this work, we propose a resolution-free neural surrogate for geometric parameterization and mapping problems. Given a spatially varying parameter field $p:Ω\to\mathbb{R}^m$ and query locations $\{x_i\}_{i=1}^N\subsetΩ$, the model predicts mapped locations $\{u(x_i)\}_{i=1}^N$ on arbitrary structured or unstructured point sets. To avoid dependence on a fixed grid, we use a multi-resolution geometric encoding strategy that conditions the network on coordinate-augmented samples of the parameter field. The model is trained without labeled solution data by enforcing geometry-aware constraints derived from variational energies, diffusion-based density equalization, and quasi-conformal theory. Experimental results on quasi-conformal mapping and density-equalizing mapping problems are presented to demonstrate the effectiveness of our proposed method.

2605.28488 2026-05-29 stat.ML cs.LG math.ST stat.TH 版本更新

Bridging Maximum Likelihood and Optimal Transport for Efficient Inference and Model Selection in Stochastic Block Models

桥接最大似然与最优传输:随机块模型中的高效推理与模型选择

Simon Queric, Cédric Vincent-Cuaz, Charles Bouveyron, Marco Corneli

发表机构 * Université Côte d’Azur(法国蔚蓝海岸大学) Inria(法国国家信息与自动化技术研究院) CNRS LJAD(法国国家科学研究中心LJAD实验室) Maasai Nice, France(法国尼斯马萨伊研究所) EPFL Lausanne, Switzerland(瑞士洛桑联邦理工学院) CNRS CEPAM(法国国家科学研究中心CEPAM实验室)

AI总结 本文通过最优传输视角研究随机块模型,提出正则化与未正则化的半松弛Gromov-Wasserstein估计器,实现聚类与模型参数的联合推断及簇数自动选择。

Comments 10 pages, 8 figures

详情
AI中文摘要

我们通过最优传输(OT)的视角研究随机块模型(SBM)中的推断。首先,我们证明最大似然变分推断(MLVI)可以解释为带有熵正则化的半松弛Gromov-Wasserstein(srGW)投影。虽然这种公式能产生准确的聚类,但熵正则化阻止了传输计划的稀疏性,从而阻碍了内在的模型选择。因此,我们研究未正则化的srGW估计器,并证明它们在渐近情况下一致地恢复SBM连接矩阵和潜在簇分配。然而,这种渐近性质在有限样本中并不能转化为可靠的模型选择,需要额外的机制来促进推断的簇比例中的稀疏性。我们通过实验表明,这种正则化公式产生的估计器能够在单个优化问题中同时恢复模型参数并选择簇的数量,从而避免了昂贵的网格搜索或启发式模型选择程序。

英文摘要

We study inference in stochastic block models (SBMs) through the lens of optimal transport (OT). We first establish that maximum likelihood variational inference (MLVI) can be interpreted as a semi-relaxed Gromov-Wasserstein (srGW) projection with entropic regularization. While this formulation yields accurate clustering, the entropic regularization prevents transport plans to be sparse, hindering intrinsic model selection. Consequently, we investigate unregularized srGW estimators, and prove that they consistently recover both the SBM connectivity matrix and latent cluster assignments in the asymptotic regime. However, this asymptotic property does not translate into reliable model selection in finite samples, and calls for additional mechanisms to promote sparsity in the inferred cluster proportions. We empirically show that such a regularized formulation yields estimators that simultaneously recover model parameters and select the number of clusters in a single optimization problem, thereby avoiding costly grid search or heuristic model selection procedures.

2605.28293 2026-05-29 cs.LG cs.AI 版本更新

ProRL: Effective Reinforcement Learning for Proactive Recommendation via Rectified Policy Gradient Estimation

ProRL: 通过修正策略梯度估计实现主动推荐的有效强化学习

Hongru Hou, Tiehua Mei, Denghui Geng, Jinhui Huang, Ao Xu, Hengrui Chen, Jiaqing Liang, Deqing Yang

发表机构 * School of Data Science, Fudan University, Shanghai, China(复旦大学数据科学学院,上海,中国)

AI总结 针对主动推荐系统中策略梯度估计存在的长度依赖偏差和高方差问题,提出ProRL框架,通过逐步奖励中心化和位置特定优势估计两个机制修正梯度,显著提升推荐效果。

Comments Accepted in ICML 2026

详情
AI中文摘要

主动推荐系统(PRS)旨在通过生成中间推荐路径来引导用户偏好向目标物品转移。强化学习(RL)为优化此类序列决策任务提供了原则性框架,因为路径奖励可以自然地捕捉短期接受度和长期引导有效性。然而,将策略梯度直接应用于PRS会导致梯度估计存在缺陷。我们识别出两个缺陷:(1)路径级奖励分解为具有正均值的步骤级奖励,产生长度依赖偏差,导致梯度倾向于路径扩展而非有意义的探索;(2)用整个路径级奖励加权每个步骤忽略了分解结构,导致高梯度方差。为修正这两个缺陷,我们提出了一种有效的RL框架ProRL,其中包含两种用于主动推荐的新机制。首先,逐步奖励中心化减去期望奖励以消除长度依赖偏差,确保路径扩展产生零期望梯度信号。其次,位置特定优势估计利用奖励分解结构计算步骤相关的基线,降低梯度方差。这些机制共同产生精确针对路径质量的策略梯度。我们在三个真实世界数据集上的实验表明,ProRL显著优于最先进的PRS。我们的代码可在https://github.com/hongruhou89/ProRL获取。

英文摘要

Proactive Recommender Systems (PRSs) aim to guide user preference shift toward target items by generating paths of intermediate recommendations. Reinforcement learning (RL) provides a principled framework for optimizing such sequential decision tasks, as path rewards can naturally capture both short-term acceptance and long-term guidance effectiveness. However, naively applying policy gradients to PRS results in deficient gradient estimation. We identify two deficiencies: (1) path-level rewards decompose into step-level rewards with positive mean, creating a length-dependent bias that causes gradients to favor path extension over meaningful exploration; (2) weighting each step by the entire path-level reward ignores the decomposition structure, leading to high gradient variance. To rectify these two deficiencies, we propose an effective RL framework ProRL with two novel mechanisms for proactive recommendation. First, Stepwise Reward Centering subtracts expected rewards to neutralize length-dependent bias, ensuring that path extension yields zero expected gradient signal. Second, Position-Specific Advantage Estimation leverages the reward decomposition structure to compute step-dependent baselines, reducing gradient variance. Together, these mechanisms yield policy gradients that precisely target path quality. Our experiments on three real-world datasets demonstrate that ProRL significantly outperforms state-of-the-art PRSs. Our code is available at https://github.com/hongruhou89/ProRL.

2605.27975 2026-05-29 cs.LG stat.ML 版本更新

Continual Learning in Modern Hopfield Networks with an Application to Diffusion Models

现代Hopfield网络中的持续学习及其在扩散模型中的应用

Ken Takeda, Masafumi Oizumi, Ryo Karakida

发表机构 * Graduate School of Arts and Science, The University of Tokyo(东京大学艺术与科学研究生院) Artificial Intelligence Research Center, AIST(AIST人工智能研究中心)

AI总结 通过现代Hopfield能量分析扩散模型中的持续学习,证明高能量异常样本更容易被遗忘,并基于能量选择重放样本以缓解遗忘。

详情
AI中文摘要

生成模型(包括扩散模型)越来越多地被用作基础模型,并通过顺序微调进行适配,这使得持续学习成为一个关键问题设定。然而,此类生成模型中的持续学习仍未被充分理解:任务变化后,学习分布的哪些方面最容易丢失,以及应优先重放哪些样本?我们通过现代Hopfield能量来解决这些问题。现代Hopfield网络(MHN)与扩散模型之间的最新联系使得MHN中的分析可以迁移到扩散模型。我们引入内在遗忘作为任务变化后Hopfield能量的增加。在MHN的可处理设定中,我们证明高能量、类似异常值的样本比类似聚类的样本经历更大的能量增加,这意味着位于尖锐、孤立盆地中的样本更容易被遗忘。我们进一步分析了记忆重放,并表明重放对高能量样本特别有效,从而实现了基于能量的重放样本选择。我们在MHN和两种扩散模型(Stable Diffusion和像素空间DDPM)的持续学习设置实验中验证了这些预测。在这些扩散模型中,Hopfield能量追踪基于重建的遗忘,重放实验揭示了与MHN分析一致的能量依赖性遗忘缓解。

英文摘要

Generative models, including diffusion models, are increasingly used as foundation models and adapted through sequential fine-tuning, making continual learning an essential problem setting. However, continual learning in such generative models remains poorly understood: after a task change, what aspects of the learned distribution are most easily lost, and what replay samples should be prioritized? We address these questions through the modern Hopfield energy. Recent links between modern Hopfield networks (MHNs) and diffusion models allow analyses in MHNs to be transferred to diffusion models. We introduce intrinsic forgetting as an increase in Hopfield energy after the task change. In tractable settings in an MHN, we prove that high-energy, outlier-like samples undergo a larger energy increase than cluster-like samples, implying that samples located in sharp, isolated basins are more forgettable. We further analyze memory replay and show that replay is particularly effective for high-energy samples, enabling an energy-based selection of replay samples. We validate these predictions in experiments on MHNs and two diffusion models under continual-learning settings: Stable Diffusion and a pixel-space DDPM. In these diffusion models, Hopfield energy tracks reconstruction-based forgetting, and replay experiments reveal energy-dependent mitigation of forgetting that is consistent with the MHN analysis.

2605.27968 2026-05-29 cs.CE cs.LG physics.comp-ph 版本更新

Adapting Automotive Aerodynamics Surrogates to New Vehicle Families via Transfer Learning

通过迁移学习将汽车空气动力学代理模型适应新车型族

Seunghwan Keum, Alok Warey

发表机构 * General Motors Research and Development(通用汽车研发)

AI总结 本文通过留一族实验,在61.47M参数的Transformer代理模型上比较全微调、轻量微调和低秩适应三种策略,发现低秩适应通过秩约束适配器正则化损失景观并保留预训练特征,仅用20个样本即可实现R²=0.85±0.02,优于全微调和从零训练,表明低秩适应是几何迁移的收敛使能器。

Comments 23 pages, 12 figures

详情
AI中文摘要

在工业CFD工作流中部署科学机器学习代理模型需要将预训练模型适应到新车型族,而无需大型数据集;然而,几何编码器学习的几何表示是否能够迁移到拓扑不同的形状仍未得到验证。 我们通过留一族实验来解决这个问题,实验使用一个61.47M参数的Transformer代理模型(AB-UPT),该模型在四个车型族(411个外部空气动力学案例)上预训练,并仅用20个样本适应到留出的第五个车型族。比较了三种策略:全微调(FFT)、轻量微调(LFT)和低秩适应(LoRA)。核心发现是预训练的几何编码器学习了可迁移的表示,但适应机制决定了它们是否能够被利用。FFT不稳定,因为61.47M无约束参数对20个样本过拟合(R²=0.40);LFT失败,因为冻结的编码器无法表示未见过的形状(R²<0)。LoRA解决了这两个问题:注入所有层的秩约束适配器正则化了损失景观,同时保留了预训练特征,在所有五个车型族上实现了R²=0.85±0.02,力RMSE比FFT低50%,点场误差低28%。LoRA还优于使用3倍目标族数据从零开始的训练,消除了对每个族大型数据集的需求。这些结果将LoRA从一种节省内存的便利工具重新定义为几何迁移的收敛使能器:一个共享骨干网络配合轻量级的每个族适配器,可在数小时内从最小数据训练完成。

英文摘要

Deploying Scientific Machine Learning surrogates in industrial CFD workflows requires adapting pretrained models to new vehicle families without large datasets; yet whether geometric representations learned by a geometry encoder transfer to topologically distinct shapes remains unvalidated. We address this through leave-one-family-out experiments on a 61.47M-parameter Transformer surrogate (AB-UPT) pretrained on four vehicle families (411 external aerodynamics cases) and adapted to the held-out fifth with only 20 samples. Three strategies are compared: Full Fine-Tuning (FFT), Lightweight Fine-Tuning (LFT), and Low-Rank Adaptation (LoRA). The central finding is that pretrained geometry encoders learn transferable representations, but the adaptation mechanism determines whether they can be exploited. FFT destabilizes as 61.47M unconstrained parameters overfit to 20 samples (R^2=0.40); LFT fails because the frozen encoder cannot represent unseen shapes (R^2<0). LoRA resolves both: rank-constrained adapters injected into all layers regularize the loss landscape while preserving pretrained features, achieving R^2=0.85+/-0.02 across all five families with 50% lower force RMSE than FFT and 28% lower pointwise field errors. LoRA also outperforms from-scratch training using 3x more target-family data, eliminating the need for large per-family datasets. These results recast LoRA from a memory-saving convenience into a convergence enabler for geometry transfer: a shared backbone paired with lightweight per-family adapters trainable in hours from minimal data.

2605.27474 2026-05-29 stat.ML cs.LG 版本更新

Stop Suppressing the Tail: Causal Inference for Extreme Events

停止抑制尾部:极端事件的因果推断

Eichi Uehara

发表机构 * Eichi Uehara

AI总结 针对重尾结果,提出一种平均剂量-响应函数(ADRF)估计器,通过基于中位数中心化的尾部诊断(PDHTE+JK)打破循环依赖,输出结构化尾部形状和深层尾部风险指标,在极端事件预测中显著优于传统方法。

Comments 22 pages, 6 figures, 13 tables. Keywords: double machine learning, dose-response, heavy tails, extreme value theory, causal inference

详情
AI中文摘要

估计结果如何响应连续处理(平均剂量-响应函数,ADRF)是因果推断的核心基础。然而,当结果具有重尾时,标准的鲁棒双重机器学习(DML)会刻意抑制这些极端值以稳定整体均值。在高风险场景(如金融收益或气候损失)中,这种被忽略的千分之一极端事件恰恰是实际目标量。此外,当前从模型残差中读取尾部的方法存在循环依赖,导致仅因核心估计器在Huber和Welsch之间切换,尾部形状推断就会发生剧烈变化。本研究提出一种ADRF估计器,它在标准点估计之外输出结构化的尾部形状。其尾部诊断(PDHTE+JK)通过基于中位数中心化的结果评估每个处理下的尾部形状,成功打破了循环依赖,使诊断结果不受核心方法选择的影响。输出包含四个处理条件量:尾部形状$\hatξ(t)$、深层尾部回报水平$\hat{Q}_α(t)$、条件短缺$\hat{S}_α(t)$、恢复的均值ADRF,以及一个明确的拒绝机制,当数据不支持极值建模时拒绝外推。与核加权分位数回归(QR)相比,所提估计器在重尾面板上将深层尾部($α=0.001$)回报水平MAE降低了11%,条件短缺MAE降低了25.5%。在样本稀缺场景($n\le2000$)中,MAE降低了20-29%。在freMTPL2汽车保险索赔数据上,它在对数索赔尺度上成功触发了明确的外推拒绝,这是QR或仅损失DML无法实现的。

英文摘要

Estimating how an outcome responds to a continuous treatment (the Average Dose-Response Function, or ADRF) is a core causal-inference primitive. However, when outcomes possess heavy tails, standard robust double machine learning (DML) deliberately suppresses these extremes to stabilize the bulk average. In high-stakes settings, such as financial returns or climate losses, this omitted 1-in-1000 extreme event is the actual target quantity. Furthermore, current methods that read the tail from a model's residuals suffer from circular dependence, causing tail shape inferences to shift drastically based solely on whether the core estimator is switched between Huber and Welsch. The research proposes an ADRF estimator that emits a structured tail-shape output alongside the standard point estimate. Its tail diagnostic (PDHTE+JK) evaluates the per-treatment tail shape from the outcome centered by a pilot median, successfully breaking the circular dependence and rendering the diagnostic invariant to the choice of core method. The output encompasses four treatment-conditional quantities: tail shape $\hatξ(t)$, deep-tail return levels $\hat{Q}_α(t)$, conditional shortfalls $\hat{S}_α(t)$, the recovered mean ADRF, and an explicit refusal mechanism that declines extrapolation when extreme-value modeling is unsupported by the data. Compared to kernel-weighted quantile regression (QR), the proposed estimator reduces deep-tail ($α=0.001$) return-level MAE by 11% and conditional-shortfall MAE by 25.5% across a heavy-tailed panel. It also achieves a 20-29% MAE reduction in sample-scarce regimes ($n\le2000$). On freMTPL2 motor-insurance claims, it successfully triggered an explicit extrapolation refusal on the log-claim scale, which neither QR nor loss-only DML can produce.

2605.26408 2026-05-29 cs.LG stat.ME stat.ML 版本更新

Function-Valued Causal Influence in Nonlinear Time Series

非线性时间序列中的函数值因果影响

Valentina V. Kuskova, Dmitry Zaytsev, Michael Coppedge

发表机构 * Lucy Family Institute for Data \& Society, University of Notre Dame, Notre Dame, Indiana, USA. Department of Political Science, University of Notre Dame, Notre Dame, Indiana, USA

AI总结 针对非线性时间序列因果发现中常用标量评分掩盖状态依赖函数效应的问题,提出基于个体条件期望的框架从神经加性向量自回归模型直接估计因果响应函数,揭示标量评分无法区分的多种函数行为。

Comments 26 pages, 6 tables, 8 figures

详情
AI中文摘要

时间序列中的因果发现越来越多地使用非线性机器学习模型进行,但由此产生的因果关系几乎总是通过标量边评分来总结。我们认为,这种做法掩盖了非线性自回归模型真正学习到的对象:一个状态依赖的函数,其效应随机制、幅度和上下文而变化。我们形式化了加性、贡献可分解架构的函数值因果影响,并表明标量因果评分构成了严重的信息瓶颈,将状态间变化与状态内残差噪声混为一谈。以神经加性向量自回归作为代表性架构,我们引入了一个基于个体条件期望的实用框架,直接从训练好的模型估计因果响应函数。通过受控的合成实验,我们证明了具有无法区分的标量评分的边可以表现出定性的不同函数行为,包括单调、阈值、饱和和符号变化效应。一个关于民主发展的应用案例进一步表明,函数值分析揭示了以评分为中心的方法系统性遗漏的特定于机制和非对称的因果结构。

英文摘要

Causal discovery in time series is increasingly performed using nonlinear machine-learning models, yet the resulting causal relationships are almost always summarized by scalar edge scores. We argue that this practice obscures the true object learned by nonlinear autoregressive models: a state-dependent function whose effect varies across regimes, magnitudes, and contexts. We formalize function-valued causal influence for additive, contribution-decomposable architectures and show that scalar causal scores constitute a severe information bottleneck, conflating between-state variation with within-state residual noise. Using Neural Additive Vector Autoregression as a representative architecture, we introduce a practical framework based on Individual Conditional Expectation for estimating causal response functions directly from trained models. Through controlled synthetic experiments, we demonstrate that edges with indistinguishable scalar scores can exhibit qualitatively different functional behaviors, including monotonic, thresholded, saturating, and sign-changing effects. An applied case study on democratic development further shows that function-valued analysis reveals regime-specific and asymmetric causal structure systematically missed by score-centric approaches.

2605.26255 2026-05-29 eess.IV cs.AI cs.LG 版本更新

Prospective evaluation of multimodal respiratory failure prediction: Do chest X-rays improve performance beyond EHR signals?

多模式呼吸衰竭预测的前瞻性评估:胸部X光片能否在电子健康记录信号之外提升性能?

Xiaolei Lu, Shamim Nemati

AI总结 本研究提出一种门控多模态框架,集成结构化电子健康记录时间序列数据和胸部X光片基础模型表示,用于前瞻性预测ICU患者24小时内是否需要有创机械通气,结果显示相比仅使用电子健康记录的模型和医生预测,多模态融合提高了区分度、敏感性和阳性预测值。

详情
AI中文摘要

呼吸衰竭的早期预测对于重症监护病房的及时临床干预至关重要。现有的基于电子健康记录(EHR)的模型可以持续监测生理恶化,但可能无法完全捕捉胸部X光片(CXR)中反映的肺部病理生理学。在本研究中,我们探讨CXR信息是否能在仅使用EHR信号的基础上改善有创机械通气的前瞻性预测。我们开发了一个门控多模态框架,将结构化EHR时间序列数据与CXR基础模型表示相结合。门控模块根据患者特定的临床背景自适应地控制成像特征的贡献,使模型在成像信息有用时选择性地依赖它。我们前瞻性地评估了该框架在ICU患者中预测24小时内需要有创机械通气的性能,并将其与已建立的仅使用EHR的模型(Ventio)、在匹配临床时间点获得的医生预测以及替代多模态变体进行比较。门控多模态模型比仅使用EHR的基线模型实现了更高的区分度,使用REMEDIS和MedInsight CXR表示时AUROC值分别为0.860和0.858,而Ventio为0.752。相对于医生预测,多模态框架显著提高了敏感性,同时保持了良好的特异性。与仅使用EHR的模型相比,多模态整合提高了特异性和阳性预测值,表明CXR信息可以细化选定患者的风险估计。这些发现支持自适应多模态融合作为将成像纳入前瞻性呼吸衰竭预测的实用策略。

英文摘要

Early prediction of respiratory failure is critical for timely clinical intervention in intensive care units. Existing electronic health record (EHR)-based models can continuously monitor physiologic deterioration, but they may not fully capture pulmonary pathophysiology reflected in chest radiographs (CXRs). In this study, we ask whether CXR information improves prospective prediction of invasive mechanical ventilation beyond EHR signals alone. We develop a gated multimodal framework that integrates structured EHR time-series data with CXR foundation-model representations. The gating module adaptively controls the contribution of imaging features based on patient-specific clinical context, allowing the model to selectively rely on imaging information when it is informative. We prospectively evaluate the framework for predicting invasive mechanical ventilation within 24 hours in ICU patients and compare it with an established EHR-only model (Ventio), physician predictions obtained at matched clinical time points, and alternative multimodal variants. The gated multimodal models achieved higher discrimination than the EHR-only baseline, with AUROC values of 0.860 and 0.858 using REMEDIS and MedInsight CXR representations, respectively, compared with 0.752 for Ventio. Relative to physician predictions, the multimodal framework substantially improved sensitivity while maintaining favorable specificity. Compared with the EHR-only model, multimodal integration increased specificity and positive predictive value, suggesting that CXR information can refine risk estimation in selected patients. These findings support adaptive multimodal fusion as a practical strategy for incorporating imaging into prospective respiratory failure prediction.

2605.26194 2026-05-29 cs.LG 版本更新

On the Role of Inductive Bias in Time-Series Pretraining: A Case Study in Learning Generalizable Representations for Clinical Time Series

论归纳偏置在时间序列预训练中的作用:以临床时间序列学习通用表征的案例研究

Sharmita Dey, Diego Paez-Granados

发表机构 * ETH Zurich(苏黎世联邦理工学院) Swiss Paraplegic Research(瑞士脊髓损伤研究所) ETH Zurich, Swiss Paraplegic Research(苏黎世联邦理工学院、瑞士脊髓损伤研究所)

AI总结 通过PathoFM编码器中心Transformer,结合局部补全、时间连续性和无监督上下文动力学三种互补目标,研究预训练目标中归纳偏置对跨任务类型和受试者迁移的影响,发现动态中心混合目标能产生最平衡的迁移表征。

详情
AI中文摘要

临床时间序列学习通常受限于小规模、异质性队列和协议漂移,而其下游应用涵盖分类(如病理诊断)和回归(如时间预测)。这些限制使得基础模型预训练具有吸引力,但提出了一个重要问题:预训练目标应施加何种归纳偏置,以使表征能够跨任务类型和受试者迁移。我们通过PathoFM研究脊髓损伤(SCI)的病理步态分析,PathoFM是一种以编码器为中心的Transformer,在多元步态窗口上使用三种互补目标进行预训练:局部补全(重建连续的掩码跨度以强制局部结构)、时间连续性(从观察到的前缀预测掩码的中期延续以强制平滑性和因果一致性)以及无监督上下文动力学(通过注意力基于受试者示例窗口进行支持-查询重建)。通过经验比较目标族(分组/对比、基于动力学和生成式重建),我们发现以动力学为中心的混合目标产生最平衡的迁移:分组目标有利于判别边界,但可能降低连续目标所需的幅度保真度,而仅重建目标保留波形结构但在分类上可能表现不佳。总体而言,将局部重建与时间连续性相结合,并在可获取示例时添加上下文条件,可产生稳健的受试者泛化表征。

英文摘要

Clinical time-series learning is routinely constrained by small, heterogeneous cohorts and protocol drift, while its downstream use spans both classification (e.g., pathology diagnosis) and regression (e.g., temporal forecasting). These constraints make foundation-model pretraining appealing, but raises an important question of which inductive biases should the pretraining objective impose so that representations transfer across task types and subjects. We study this question in pathological gait analysis for spinal cord injury (SCI) via PathoFM, an encoder-centric transformer pretrained on multivariate gait windows with three complementary objectives: Local Completion (reconstruct contiguous masked spans to enforce local structure), Temporal Continuity (predict a masked mid-horizon continuation from an observed prefix to enforce smoothness and causal consistency), and Unsupervised In-Context Dynamics (support-query reconstruction conditioned on subject exemplar windows via attention). Empirically comparing objective families (grouping/contrastive, dynamics-based, and generative reconstruction), we find that dynamics-centric mixtures produce the most balanced transfer: grouping objectives favor discriminative margins but can degrade magnitude fidelity needed for continuous targets, whereas reconstruction-only objectives preserve waveform structure but may underperform on classification. Overall, combining local reconstruction with temporal continuity, and adding in-context conditioning when exemplar access is realistic, yields robust subject-generalizing representations.

2605.26193 2026-05-29 cs.LG cs.AI 版本更新

Bridging Classification and Reconstruction: Cooperative Time Series Anomaly Detection

桥接分类与重建:协同时间序列异常检测

Qideng Tang, Dai Chaofan, Wubin Ma, Yahui Wu, Haohao Zhou, Tao Zhang, Huan Li, Dalin Zhang

发表机构 * National Key Laboratory of Information Systems Engineering, National University of Defense Technology(信息系统工程国家重点实验室,国防科技大学) College of Systems Engineering, National University of Defense Technology(系统工程学院,国防科技大学) Zhejiang University(浙江大学) Zhejiang Key Laboratory of Space Information Sensing and Transmission, Hangzhou Dianzi University(空间信息感知与传输浙江大学重点实验室,杭州电子科技大学)

AI总结 提出CoAD框架,通过分类模块生成概率软掩码指导重建模块,协同利用分类与重建范式的互补优势,有效检测细微复杂异常,并在基准数据集上显著优于现有方法。

Comments 15 pages, submitted to KDD 2026

详情
AI中文摘要

时间序列异常检测(TSAD)因其广泛应用而长期成为数据挖掘领域的热门研究课题。最近的研究挑战了流行的深度学习方法在TSAD中的有效性,指出它们无法检测细微和持久的异常。异常暴露(OE)和掩码自编码器(MAE)作为两种有前景的范式(分类和重建)出现,用于解决上述问题。然而,基于OE的方法受限于泛化能力差,而基于MAE的方法受限于掩码错位问题。为了解决这些局限性,本文提出了一种新颖的框架CoAD,该框架统一了两种范式,以利用它们的互补优势,同时减轻各自的弱点。在该框架中,分类模块为重建模块生成概率信息软掩码,这反过来又缓解了分类模块的泛化问题。这种协同设计使CoAD能够有效检测现有方法常常忽略的细微和复杂异常。此外,分类模块经过精心设计,以解决分类粒度不当和忽视频率信息的问题。在高质量基准数据集上,按照严格的评估协议进行的大量实验表明,CoAD显著优于最先进的深度学习和传统数据挖掘方法,突显了深度学习在TSAD中的潜力。此外,CoAD轻量级且速度远快于现有SOTA方法,展示了其在大规模实时应用中的实用价值。

英文摘要

Time series anomaly detection (TSAD) has long been a hot research topic in data mining due to its various applications. Recent studies challenge the effectiveness of popular deep learning methods for TSAD, suggesting their failure in detecting subtle and prolonged anomalies. Outlier Exposure (OE) and Masked Autoencoder (MAE) emerge as two promising paradigms (classification and reconstruction) for solving the above problems. However, OE-based methods are constrained by poor generalization, while MAE-based methods are limited by masking misalignment issues. To address these limitations, this paper proposes a novel framework, CoAD, which unifies the two paradigms to leverage their complementary strengths while mitigating their respective weaknesses. In this framework, the classification module generates probability-informed soft masks for the reconstruction module, which in turn alleviates the generalization problem of the classification module. This cooperative design enables CoAD to effectively detect subtle and complex anomalies that are often overlooked by existing methods. Additionally, the classification module is carefully designed to resolve issues related to improper classification granularity and the neglect of frequency information. Extensive experiments on high-quality benchmark datasets, conducted under rigorous evaluation protocols, demonstrate that CoAD significantly outperforms both state-of-the-art deep learning and traditional data mining methods, highlighting the potential of deep learning in TSAD. Moreover, CoAD is lightweight and substantially faster than existing SOTA methods, demonstrating its practical value for large-scale, real-time applications.

2605.25299 2026-05-29 cs.CV cs.LG 版本更新

A Principled Self-Referenced Early Stopping Approach for Deep Image Prior

一种基于自引用的原则性早期停止方法用于深度图像先验

Chaoyan Huang, Cheng-Han Huang, Ismail R. Alkhouri, Rongrong Wang

发表机构 * Department of Computational Mathematics, Science, & Engineering, Michigan State University(密歇根州立大学计算数学、科学与工程系) Department of Electrical Engineering and Computer Science, University of Michigan(密歇根大学电气工程与计算机科学系) X Computational Physics Division, Los Alamos National Laboratory(洛斯阿拉莫斯国家实验室计算物理部) Michigan Institute for Computational Discovery & Engineering, University of Michigan(密歇根大学计算发现与工程研究所) Mathematical Sciences, Michigan State University(密歇根州立大学数学科学系)

AI总结 针对深度图像先验(DIP)过拟合问题,提出一种基于构造伪自引用图像的过拟合检测框架,实现无需噪声水平估计的早期停止方法。

Comments 35 pages, 10 figures, 14 tables

详情
AI中文摘要

最近,深度图像先验(DIP)通过在无训练数据的情况下优化随机初始化的卷积神经网络,展示了解决逆成像问题(IIPs)的强大能力。然而,由于网络过参数化,DIP会过拟合噪声测量,使得早期停止(ES)至关重要。最成功的ES方法通过跟踪网络输出运行方差的波动来检测过拟合。然而,在许多应用中,这些波动可能过早出现,导致重建不稳定。本文首先证明,当退化图像的两个独立噪声副本可用时,可以实现近乎最优的DIP早期停止。受此观察启发,且由于获取两个完全独立的副本不可行,我们提出了一种基于构造伪自引用图像的过拟合检测框架,从而得到三种IIP特定算法。我们的方法还得到了关于单引用验证、伪验证估计以及共享噪声影响的理论结果的支持。在不同的IIP中,从自然图像恢复到医学图像重建,以及在不同噪声水平和噪声类型下,我们的方法始终优于现有的DIP早期停止方法,且无需准确估计噪声水平。

英文摘要

Recently, Deep Image Prior (DIP) has demonstrated strong capabilities for solving inverse imaging problems (IIPs) by optimizing a randomly initialized convolutional neural network in a training-data-free regime. However, DIP suffers from overfitting to noisy measurements due to network over-parameterization, making early stopping (ES) essential. The most successful ES method tracks fluctuations in the running variance of the network output to detect overfitting. However, in many applications, these fluctuations may appear prematurely, leading to unstable reconstructions. In this paper, we first show that nearly optimal DIP early stopping can be achieved when two independent noisy copies of the degraded image are available. Motivated by this observation, and since obtaining two fully independent copies is infeasible, we propose an overfitting detection framework based on constructing pseudo self-referenced images, resulting in three IIP-specific algorithms. Our approach is further supported by theoretical results on single-reference validation, pseudo-validation estimation, and the impact of shared noise. Across different IIPs, ranging from natural image restoration to medical image reconstruction, and under varying noise levels and noise types, our methods consistently outperform existing DIP early stopping approaches, all without requiring an accurate estimate of the noise level.

2605.25297 2026-05-29 cs.CL cs.AI cs.LG 版本更新

Eureka: Intelligent Feature Engineering for Enterprise AI Cloud Resource Demand Prediction

Eureka:面向企业AI云资源需求预测的智能特征工程

Hangxuan Li, Renjun Jia, Xuezhang Wu, Yunjie Qian, Zeqi Zheng, Xianling Zhang

发表机构 * Alibaba Cloud Computing Co. Ltd, Hangzhou, China(阿里云计算有限公司,杭州,中国) School of Computer Science, Fudan University, Shanghai, China(复旦大学计算机学院,上海,中国) School of Computer Science and Technology, Tongji University, Shanghai, China(同济大学计算机科学与技术学院,上海,中国) Independent Researcher, United States(独立研究员,美国)

AI总结 提出Eureka框架,将特征工程视为智能体代码生成问题,通过专家代理、LLM特征工厂和自演化对齐引擎三阶段,自动生成可执行特征代码,在医疗、金融、社交等7个公开基准及阿里云GPU资源需求预测中显著提升性能。

Comments accepted at NeurIPS 2025 Workshop, DASFAA 2026 (International Conference on Database Systems for Advanced Applications)

详情
Journal ref
Database Systems for Advanced Applications (DASFAA 2026), Lecture Notes in Computer Science, vol. 16540, pp. 528-540, Springer
AI中文摘要

有效的特征对于预测模型性能至关重要,但创建特征通常需要领域专业知识,限制了跨应用的可扩展性。我们将特征工程定义为一个智能体代码生成问题:特征不再是静态的数据转换,而是可生成、评估和迭代改进的可执行程序。我们提出了Eureka,一个由LLM驱动的三阶段框架。(1)专家代理,通过领域知识的SFT微调,生成结构化的JSON格式特征设计方案。(2)LLM特征工厂,通过思维链推理将每个方案转化为可执行的Python代码,将特征假设转化为可运行的程序。(3)自演化对齐引擎,使用带双通道奖励(基于指标的效用+语义对齐)的强化学习(GRPO)来提升代码质量。通过将特征表达为程序,学习到的生成模式可以跨领域迁移。在医疗、金融和社交领域的7个公开基准上评估,Eureka一致优于传统的AutoFE和基于LLM的基线。我们进一步在阿里云的云GPU资源需求预测中展示了Eureka的有效性,其中Eureka将需求满足率提高了16%,并将计算资源迁移率降低了33%。

英文摘要

Effective features are crucial for predictive model performance, but creating them often requires domain expertise, limiting scalability across applications. We define feature engineering as an agentic code generation problem: features are not static data transformations, but executable programs that can be generated, evaluated, and iteratively improved. We present Eureka, an LLM-driven framework with three stages. (1) An Expert Agent, fine-tuned via SFT on domain knowledge, produces structured feature design plans in JSON format. (2) An LLM Feature Factory translates each plan into executable Python code through chain-of-thought reasoning, turning feature hypotheses into runnable programs. (3) A Self-Evolving Alignment Engine uses Reinforcement Learning (GRPO) with dual-channel reward (metric-based utility + semantic alignment) to enhance code quality. By expressing features as programs, the learned generation patterns can transfer across domains. Evaluated on 7 public benchmarks in healthcare, finance, and social domains, Eureka consistently outperforms both traditional AutoFE and LLM-based baselines. We further demonstrate Eureka's effectiveness on cloud GPU resource demand prediction at Alibaba Cloud, where Eureka improves demand fulfillment rate by 16% and lowers computing resource migration rates by 33%.

2605.24846 2026-05-29 cs.LG cs.AI 版本更新

Tiny Brains, Giant Impact: Uncovering the Keystone Neurons of LLM with Just a Few Prompts

微小大脑,巨大影响:仅用少量提示揭示LLM的关键神经元

Xiangtian Ji, Yuxin Chen, Zhengzhou Cai, Xiang Wang, An Zhang, Tat-Seng Chua

发表机构 * National University of Singapore(新加坡国立大学) Beijing University of Posts and Telecommunications(北京邮电大学) University of Science and Technology of China(中国科学技术大学)

AI总结 本研究通过跨任务激活强度分析,发现大型语言模型中存在一组极其稀疏的关键神经元,其移除会导致模型行为崩溃,并基于此提出仅更新关键神经元的微调方法,在少量参数修改下达到与全参数微调相当或更优的任务性能。

详情
AI中文摘要

大型语言模型(LLM)展现出强大的综合能力,但支撑这些行为的内部机制仍未被充分理解。在这项工作中,我们展示了在多种开放权重Transformer模型中,存在一组神经元在跨多个能力维度的任务推理期间始终保持高度激活。通过沿跨任务激活强度进行探测,我们分离出一个极其稀疏的子集,其移除会导致模型行为崩溃,我们将其称为关键神经元。我们的分析揭示,关键神经元是模型的一个稳定且内在的神经元子集,主要在预训练期间建立。与这些神经元相关的参数在训练过程中被紧密校准,其精确值对模型能力至关重要。基于这些见解,我们提出了一种监督微调方法,仅更新关键神经元,在修改远少于全参数的情况下,实现了与全参数微调相当甚至更好的任务增益,同时更好地保留了其他能力维度的性能。

英文摘要

Large language models (LLMs) display strong comprehensive abilities, yet the internal mechanisms that support these behaviors remain insufficiently understood. In this work, we show that across a wide range of open-weight Transformers, a subset of neurons remains consistently highly activated during inference across tasks of multiple capability dimensions. By probing along the cross-task activation strength, an extremely sparse subset is isolated, whose removal causes a collapse in model behavior, which we term keystone neurons. Our analysis reveals that keystone neurons are a stable and intrinsic neuron subset of the model that is largely established during pretraining. The parameters associated with these neurons are tightly calibrated during the training process, and their precise values are critical for the capabilities of the model. Building on these insights, we propose a supervised fine-tuning approach that updates only keystone neurons, achieving task gains comparable to or even better than full-parameter fine-tuning while better preserving performance in other capability dimensions, despite modifying a much smaller number of parameters.

2605.23993 2026-05-29 cs.CV cs.AI cs.LG 版本更新

Nano World Models: A Minimalist Implementation of Future Video Prediction

纳米世界模型:未来视频预测的极简实现

Siqiao Huang, Partha Kaushik, Michael Chen, Hengkai Pan, Kaiwen Geng, Omar Chehab, Fernando Moreno-Pino, Max Simchowitz

发表机构 * DeepMind

AI总结 提出Nano World Models,一个基于扩散强迫的极简代码库,用于未来视频预测,支持可控研究世界模型的设计选择,并通过实验分析预测参数化、架构规模等因素对视频预测质量的影响。

Comments Project page: https://simchowitzlabpublic.github.io/nano-world-model/

详情
AI中文摘要

世界模型已成为学习预测模拟器的核心范式,支持生成、规划和决策。然而,尽管工业级交互式视频生成取得了快速进展,更广泛的研究社区仍然缺乏紧凑、可重复且易于扩展的实现来研究现代世界模型的设计选择。我们介绍了Nano World Models,一个围绕扩散强迫的极简代码库,用于未来视频预测。Nano World Models为生成目标、模型规模、动作条件机制、潜在观测空间、数据集、评估协议和长程展开程序提供了统一接口。这种设计使得通常在不同实现中纠缠的世界模型组件可以进行受控研究。通过在简单控制环境、游戏模拟和真实机器人数据上的实验,我们考察了预测参数化、架构规模、动作注入、采样预算和领域复杂性如何影响视频预测质量和自回归展开行为。通过发布代码、配置、评估脚本和预训练检查点,Nano World Models旨在为开放、可重复和科学的世界模型研究提供一个紧凑但可扩展的实验基础。

英文摘要

World models have become a central paradigm for learning predictive simulators that support generation, planning, and decision-making. Yet, despite rapid progress in industry-scale interactive video generation, the broader research community still lacks compact, reproducible, and easily extensible implementations for studying the design choices underlying modern world models. We introduce Nano World Models, a minimalist codebase for future video prediction centered around diffusion forcing. Nano World Models provides a unified interface for generative objectives, model scales, action-conditioning mechanisms, latent observation spaces, datasets, evaluation protocols, and long-horizon rollout procedures. This design enables controlled studies of world-modeling components that are often entangled across separate implementations. Through experiments across simple control environments, game simulation, and real-robot data, we examine how prediction parameterization, architecture scale, action injection, sampling budget, and domain complexity affect video prediction quality and autoregressive rollout behavior. By releasing code, configurations, evaluation scripts, and pretrained checkpoints, Nano World Models aims to provide a compact yet extensible experimental substrate for open, reproducible, and scientific world-model research.

2605.22924 2026-05-29 cs.LG cs.IR 版本更新

Building a privacy-preserving Federated Recommender system for mobile devices

构建保护隐私的移动设备联邦推荐系统

Aasheesh Singh

发表机构 * Département d’informatique et de recherche opérationnelle(计算机与运筹研究部)

AI总结 提出一种两阶段联邦推荐系统流水线,通过分离非敏感偏好数据与设备内敏感上下文数据,在保护隐私的同时实现移动设备上的个性化推荐。

Comments Masters thesis, Université de Montréal, Department of Computer Science and Operations Research, 2024

详情
AI中文摘要

在移动设备上提供个性化内容传统上需要在中央服务器上汇集敏感用户数据,这种做法越来越不符合现代隐私期望和地域法规。我们提出了一种用于移动设备的两阶段联邦推荐系统流水线,其核心原则是将非敏感的用户偏好数据与永不离开设备的敏感移动上下文数据分离。第一阶段在云端对非敏感的应用上下文数据运行协同过滤模型,生成相关项目的短列表。第二阶段在设备上使用敏感的移动信号对这些候选项目进行重新排序,只有模型更新/梯度会离开设备。我们在MovieLens、UCI人类活动识别以及一个专有试点数据集上验证了该方法,并提供了一个生产就绪的实现,作为可在Android和iOS上部署的Kotlin多平台库。

英文摘要

Serving personalized content on mobile devices has traditionally required pooling sensitive user data on centralized servers, a practice increasingly at odds with modern privacy expectations and geographical regulations. We present a two-stage federated recommendation system pipeline for mobile devices, built around a principled separation between non-sensitive user preference data and sensitive mobile context data that never leaves the device. The first stage runs a collaborative filtering model on non-sensitive app-context data in the cloud to generate a shortlist of relevant items. The second stage re-ranks these candidates on-device using sensitive mobile signals, with only model updates/gradients ever leaving the device. We validate the approach on MovieLens, UCI Human Activity Recognition, and a proprietary pilot dataset, and deliver a production-ready implementation as a Kotlin Multiplatform library deployable on Android and iOS.

2605.22586 2026-05-29 cs.LG cs.CL 版本更新

A Tutorial on Diffusion Theory: From Differential Equations to Diffusion Models

扩散理论教程:从微分方程到扩散模型

Jiayi Fu, Yuxia Wang

AI总结 本教程从微分方程角度统一阐述扩散模型的数学基础,推导ODE和SDE表示,解释分数匹配和去噪目标,并涵盖DDPM、DDIM、流匹配和扩散语言模型。

Comments A detailed tutorial on Diffusion models and SDE

详情
AI中文摘要

扩散模型已成为生成建模的主导框架,但其数学基础通常通过扩散概率模型、基于分数的建模、随机微分方程和数值采样方法分别呈现。我们编写本教程,从微分方程的角度提供这些观点的统一且自洽的阐述。从条件高斯噪声过程出发,我们推导常微分方程(ODE)和随机微分方程(SDE)表示,过渡到相应的边际正向动力学,然后得到使生成成为可能的逆向时间SDE和概率流ODE。我们表明逆向采样中的中心未知量是边际分数,解释在噪声预测参数化下分数匹配如何成为标准去噪目标,并讨论实际的逆向时间采样和引导。我们进一步将DDPM、DDIM、流匹配和基于分数的SDE置于一个共同框架中,并以连续嵌入空间中的扩散语言模型结束,同时简要讨论离散掩码标记扩散。本教程旨在作为扩散过程的分析基础与建立在其上的现代生成算法之间的桥梁。

英文摘要

Diffusion models have emerged as a dominant framework for generative modeling, but their mathematical foundations are often presented separately through diffusion probabilistic models, score-based modeling, stochastic differential equations, and numerical sampling methods. We write this tutorial to provide a unified and self-contained account of these viewpoints from the perspective of differential equations. Starting from a conditional Gaussian noising process, we derive ordinary differential equation (ODE) and stochastic differential equation (SDE) representations, pass to the corresponding marginal forward dynamics, and then obtain the reverse-time SDE and probability-flow ODE that make generation possible. We show that the central unknown quantity in reverse sampling is the marginal score, explain how score matching becomes the standard denoising objective under a noise-prediction parameterization, and discuss practical reverse-time sampling and guidance. We further place DDPM, DDIM, flow matching, and score-based SDEs in a common framework, and conclude with diffusion language models in continuous embedding space together with a brief discussion of discrete masked-token diffusion. The tutorial is intended as a bridge between the analytical foundations of diffusion processes and the modern generative algorithms built upon them.

2605.22082 2026-05-29 cs.RO cs.LG 版本更新

CoRMA: Contrastive RMA for Contact-Rich Meta-Adaptation

CoRMA: 用于接触丰富元适应的对比RMA

Wentian Wang, Chutong Wen, Hongxu Ma, Wuhao Wang, Zhexiong Xue, Abdul Haseeb Nizamani, Dandi Zhou, Xinhai Sun, Jianqiao Zhu

发表机构 * Synthoid AI

AI总结 提出CoRMA框架,通过语义接触上下文和对比学习实现力主导装配任务的元适应,无需演示或梯度更新,在仿真和真实机器人上优于基线。

详情
AI中文摘要

我们提出CoRMA(对比机器人运动适应),一个基于上下文的元适应框架,修改了RMA以适用于力主导的装配任务。CoRMA用紧凑的6维仅仿真语义接触上下文(描述接触开始、侧向接合、引导过渡、接触方向和卡滞)替换原始仿真器参数适应。一个可部署的因果Transformer适配器通过语义回归和力状态对比目标,从力、本体感受和动作历史中在线推断该上下文。部署时,移除真实上下文并由推断上下文替代,从而无需演示、特权输入或梯度更新即可实现片段内适应。我们在Isaac Lab / Isaac Sim 5.0中的PegInsert、GearMesh和NutThread任务以及真实Marvin机械臂上评估CoRMA。与在仿真中成功率高但在硬件上大幅下降的FORGE基线相比,CoRMA在受控目标位姿噪声下保留了更高的验证真实成功率。这些结果支持语义接触推断作为相关装配任务族内可复用的适应接口,而更广泛的未见任务泛化和Real2Sim校准仍是未来工作。

英文摘要

We present CoRMA(Contrastive Robotic Motor Adaptation), a context-based meta-adaptation framework that modifies RMA for force-dominant assembly. CoRMA replaces raw simulator-parameter adaptation with a compact 6D simulator-only semantic contact context describing contact onset, lateral engagement, guided transition, contact direction, and jamming. A deployable causal Transformer adapter infers this context online from force, proprioceptive, and action histories using semantic regression and a force-regime contrastive objective. At deployment, oracle context is removed and replaced by the inferred context, enabling within-episode adaptation without demonstrations, privileged inputs, or gradient updates. We evaluate CoRMA on PegInsert, GearMesh, and NutThread in Isaac Lab / Isaac Sim 5.0 and on a real Marvin arm. Compared with FORGE baselines that achieve high simulation success but degrade substantially on hardware, CoRMA retains higher verified real success under controlled target-pose noise. These results support semantic contact inference as a reusable adaptation interface within a related assembly task family, while broader unseen-task generalization and Real2Sim calibration remain future work.

2605.22069 2026-05-29 cs.CV cs.LG 版本更新

TWINGS: Thin Plate Splines Warp-aligned Initialization for Sparse-View Gaussian Splatting

TWINGS: 基于薄板样条翘曲对齐的稀疏视图高斯泼溅初始化

Hyeseong Kim, Geonhui Son, Deukhee Lee, Dosik Hwang

发表机构 * Yonsei University(延世大学) Korea Institute of Science and Technology(韩国科学技术院)

AI总结 提出TWINGS框架,利用薄板样条(TPS)对齐反投影点与三角化控制点,为3D高斯泼溅提供几何精确的初始化,从而在稀疏视图下提升场景重建的细节保留和颜色保真度。

Comments Accepted at CVPR 2026, Project page: https://sandokim.github.io/twings/

详情
AI中文摘要

从稀疏视图输入进行新视角合成是3D计算机视觉中的一个重大挑战,特别是在有限视角下实现高质量场景重建。我们引入了TWINGS,这是一个通过直接解决点稀疏性来增强3D高斯泼溅(3DGS)的框架。我们采用薄板样条(TPS),一种平滑的非刚性变形模型,通过最小化弯曲能量从控制点对应关系估计全局一致的翘曲,将估计深度反投影的点与三角化的3D控制点对齐,从而生成校准的反投影点。通过在这些控制点附近采样校准点,TWINGS为3DGS提供了快速且几何精确的初始化,最终改善了重建场景中结构细节的保留和颜色保真度。在DTU、LLFF和Mip-NeRF360上的大量实验表明,TWINGS在稀疏视图场景下始终优于现有方法,提供详细且准确的重建。

英文摘要

Novel view synthesis from sparse-view inputs poses a significant challenge in 3D computer vision, particularly for achieving high-quality scene reconstructions with limited viewpoints. We introduce TWINGS, a framework that enhances 3D Gaussian Splatting (3DGS) by directly addressing point sparsity. We employ Thin Plate Splines (TPS), a smooth non-rigid deformation model that minimizes bending energy to estimate a globally coherent warp from control-point correspondences, to align backprojected points from estimated depth with triangulated 3D control points, yielding calibrated backprojected points. By sampling these calibrated points near the control points, TWINGS provides a fast and geometrically accurate initialization for 3DGS, ultimately improving structural detail preservation and color fidelity in reconstructed scenes. Extensive experiments on DTU, LLFF, and Mip-NeRF360 demonstrate that TWINGS consistently outperforms existing methods, delivering detailed and accurate reconstructions under sparse-view scenarios.

2605.18587 2026-05-29 q-bio.GN cs.LG 版本更新

PACE: Geometry-Aware Bridge Transport for Single-Cell Trajectory Inference

PACE: 几何感知的桥梁传输用于单细胞轨迹推断

Chenglei Yu, Chuanrui Wang, Bangyan Liao, Tailin Wu

发表机构 * Zhejiang University(浙江大学) Department of Artificial Intelligence, School of Engineering, Westlake University(人工智能学院,西湖大学)

AI总结 针对单细胞轨迹推断中异步发育导致的错位问题,提出PACE框架,通过构建各向异性黎曼度量、交替优化跨时间耦合与神经桥梁、蒸馏全局速度场,在七个数据集上平均降低MMD、Wasserstein-1和Wasserstein-2距离23.7%。

Comments 31 pages, 12 figures

详情
AI中文摘要

基于破坏性时间序列快照的单细胞轨迹推断本质上是病态的:既未观察到跨时间细胞对应关系,也未观察到连续轨迹,因此仅凭快照分布无法唯一确定底层动力学。现有的最优传输和基于流的方法通常根据观察到的时钟时间通过欧几里得邻近性耦合细胞,当发育异步且在同一实验时间采样的细胞处于不同潜在伪时间阶段时,这可能导致轨迹错位。我们提出PACE,一个轨迹推断框架,通过三个耦合组件从破坏性时间序列快照中恢复几何一致的连续传输动力学。首先,PACE构建一个状态和时间依赖的各向异性黎曼度量,沿局部支持的切向方向分配低传输成本,同时惩罚法向速度分量。其次,它在诱导路径作用成本下交替优化跨时间耦合,并拟合相邻快照之间保持端点的神经桥梁。第三,它将学习到的桥梁动力学蒸馏为细胞状态上的全局连续时间速度场。在涵盖九个保留重建实验的七个受控和生物数据集上,PACE实现了最强的整体重建性能,相对于最强竞争基线,平均降低了MMD、Wasserstein-1距离和Wasserstein-2距离23.7%。在胚状体分化基准上,PACE还将RNA速度对齐提高了15.4%,且在训练过程中不需要显式的细胞配对、谱系追踪或RNA速度监督。代码可在https://github.com/AI4Science-WestlakeU/PACE获取。

英文摘要

Single-cell trajectory inference from destructive time-course snapshots is fundamentally ill-posed: neither cross-time cell correspondences nor continuous trajectories are observed, so the snapshot distributions alone do not uniquely determine the underlying dynamics. Existing optimal transport and flow-based methods typically couple cells by Euclidean proximity at observed clock times, which can misalign trajectories when development is asynchronous and cells sampled at the same experimental time occupy different latent pseudotime stages. We propose PACE, a trajectory inference framework that recovers geometry-consistent continuous transport dynamics from destructive time-course snapshots through three coupled components. First, PACE constructs a state- and time-dependent anisotropic Riemannian metric that assigns low transport cost along locally supported tangent directions while penalizing normal velocity components. Second, it alternates between refining cross-time couplings under the induced path-action cost and fitting endpoint-preserving neural bridges between adjacent snapshots. Third, it distills the learned bridge dynamics into a global continuous-time velocity field over cellular states. Across seven controlled and biological datasets covering nine held-out reconstruction experiments, PACE achieves the strongest overall reconstruction performance, reducing MMD, Wasserstein-1 distance, and Wasserstein-2 distance by 23.7% on average relative to the strongest competing baseline. PACE also improves RNA-velocity alignment by 15.4% on an embryoid body differentiation benchmark, without requiring explicit cell pairing, lineage tracing, or RNA-velocity supervision during training. Code is available at https://github.com/AI4Science-WestlakeU/PACE.

2605.15422 2026-05-29 cs.LG 版本更新

DualKV: Shared-Prompt Flash Attention for Efficient RL Training with Large Rollouts and Long Contexts

DualKV: 面向高效RL训练的共享提示Flash注意力机制,支持大规模展开和长上下文

Jiading Gai, Shuai Zhang, Xiang Song, Bernie Wang, George Karypis

发表机构 * Amazon Web Services(亚马逊网络服务) Google(谷歌) University of Minnesota(明尼苏达大学)

AI总结 针对RL训练中共享提示重复计算问题,提出DualKV内核,通过融合CUDA前向/反向核和veRL数据流水线重排,消除提示复制,实现1.63-3.82倍策略更新加速。

详情
AI中文摘要

现代RL后训练方法(如GRPO和DAPO)在从共享提示($P$个token)采样的$N$个响应序列(每个$R$个token)上进行训练,但标准FlashAttention在前向和反向传播中将所有$P$个提示token复制$N$次——在相同的隐藏状态上重复计算和内存。在大规模展开、长上下文RL训练($N\geq16$,$P\geq8\text{K}$)中,这种冗余主导了策略更新成本。我们观察到,在仅解码器模型中,因果掩码使提示表示在每一层跨序列不变,因此所有逐token操作(归一化、投影、MLP)和注意力可以一次性处理提示——这一特性尚未在训练的内核级别被利用。我们提出\textbf{DualKV},这是首个消除RL训练中共享提示复制的FlashAttention内核变体,通过(1)~融合的CUDA前向和反向内核,在单次内核启动中迭代两个不相交的KV区域——共享上下文和逐序列响应,以及(2)~veRL中的数据流水线重设计,将$N(P{+}R)$个token重新打包为每个微批$P{+}NR$个token,将token减少从注意力扩展到整个模型,因子$ρ= N(P{+}R)/(P{+}NR)$。DualKV在数学上等价于标准注意力,且不引入近似。在Qwen3-8B GRPO训练中,使用8$\times$H100 GPU($N{=}32$,8K上下文),DualKV实现了$1.63$--$2.09\times$的策略更新加速,支持$2\times$更大的微批,并将MFU从$36\%$提升至$76\%$。类似增益在DAPO上成立($2.47\times$加速,$77\%$ MFU)。在30B MoE规模下,使用16$\times$H100,DualKV相比FlashAttention(需要4路Ulysses序列并行以避免OOM)实现了$3.82\times$的策略更新加速和$3.38\times$的端到端步骤加速。

英文摘要

Modern RL post-training methods such as GRPO and DAPO train on $N$ response sequences of $R$ tokens sampled from a shared prompt of $P$ tokens, but standard FlashAttention replicates all $P$ prompt tokens $N$ times across both forward and backward passes -- duplicating compute and memory on identical hidden states. In large-rollout, long-context RL training ($N{\geq}16$, $P{\geq}8\text{K}$), this redundancy dominates the policy update cost. We observe that in decoder-only models, causal masking makes prompt representations invariant across sequences at every layer, so all per-token operations (norms, projections, MLP) and attention can process the prompt once -- a property not yet exploited at the kernel level for training. We propose \textbf{DualKV}, the first FlashAttention kernel variant that eliminates shared-prompt replication during RL training, via (1)~fused CUDA forward and backward kernels that iterate over two disjoint KV regions -- shared context and per-sequence response -- in a single kernel launch, and (2)~a data-pipeline redesign in veRL that repacks $N(P{+}R)$ tokens into $P{+}NR$ tokens per micro-batch, extending the token reduction from attention to the entire model by a factor $ρ= N(P{+}R)/(P{+}NR)$. DualKV is mathematically equivalent to standard attention and introduces no approximation. On Qwen3-8B GRPO training with 8$\times$H100 GPUs ($N{=}32$, 8K-context), DualKV achieves $1.63$--$2.09\times$ policy-update speedup, enables $2\times$ larger micro-batches, and raises MFU from $36\%$ to $76\%$. Similar gains hold for DAPO ($2.47\times$ speedup, $77\%$ MFU). At 30B MoE scale on 16$\times$H100, DualKV achieves $3.82\times$ policy-update and $3.38\times$ end-to-end step speedup over FlashAttention (which requires 4-way Ulysses sequence parallelism to avoid OOM).

2605.13841 2026-05-29 cs.SD cs.AI cs.CL cs.LG 版本更新

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents

EVA-Bench:一种用于评估语音代理的新型端到端框架

Tara Bogavelli, Gabrielle Gauthier Melançon, Katrina Stankiewicz, Oluwanifemi Bamgbose, Fanny Riols, Hoang H. Nguyen, Raghav Mehndiratta, Lindsay Devon Brin, Joseph Marinier, Hari Subramani, Anil Madamala, Sridhar Krishna Nemala, Srinivas Sunkara

发表机构 * ServiceNow

AI总结 提出EVA-Bench框架,通过机器人间音频对话模拟和复合指标(EVA-A和EVA-X)全面评估语音代理的准确性和体验质量。

Comments Work in progress

详情
AI中文摘要

语音代理是一种通过口语对话完成任务的人工智能系统,越来越多地部署在企业应用中。然而,现有基准测试未能同时解决两个核心评估挑战:生成逼真的模拟对话,以及全面衡量语音特定故障模式的质量。我们提出了EVA-Bench,一个端到端评估框架,同时解决这两个问题。在模拟方面,EVA-Bench通过动态多轮对话协调机器人间的音频对话,并自动进行模拟验证,检测用户模拟器错误并在评分前适当重新生成对话。在测量方面,EVA-Bench引入了两个复合指标:EVA-A(准确性),捕捉任务完成度、忠实度和音频级语音保真度;以及EVA-X(体验),捕捉对话进展、口语简洁性和话轮转换时机。这两个指标适用于所有主要的代理架构,支持直接的跨架构比较。EVA-Bench包含三个企业领域的213个场景、一个用于口音和噪声鲁棒性的受控扰动套件,以及区分峰值能力和可靠能力的pass@1、pass@k、pass^k测量。在跨越所有三种架构的12个系统中,我们发现:(1)没有系统在EVA-A pass@1和EVA-X pass@1上同时超过0.5;(2)峰值性能和可靠性能差异显著(EVA-A上pass@k与pass^k的中位数差距为0.44);(3)口音和噪声扰动暴露了显著的鲁棒性差距,其影响因架构、系统和指标而异(平均Δ高达0.314)。我们在开源许可下发布了完整的框架、评估套件和基准数据。

英文摘要

Voice agents, artificial intelligence systems that conduct spoken conversations to complete tasks, are increasingly deployed across enterprise applications. However, no existing benchmark jointly addresses two core evaluation challenges: generating realistic simulated conversations, and measuring quality across the full scope of voice-specific failure modes. We present EVA-Bench, an end-to-end evaluation framework that addresses both. On the simulation side, EVA-Bench orchestrates bot-to-bot audio conversations over dynamic multi-turn dialogues, with automatic simulation validation that detects user simulator error and appropriately regenerates conversations before scoring. On the measurement side, EVA-Bench introduces two composite metrics: EVA-A (Accuracy), capturing task completion, faithfulness, and audio-level speech fidelity; and EVA-X (Experience), capturing conversation progression, spoken conciseness, and turn-taking timing. Both metrics apply to all major agent architectures, enabling direct cross-architecture comparison. EVA-Bench includes 213 scenarios across three enterprise domains, a controlled perturbation suite for accent and noise robustness, and pass@1, pass@k, pass^k measurements that distinguish peak from reliable capability. Across 12 systems spanning all three architectures, we find: (1) no system simultaneously exceeds 0.5 on both EVA-A pass@1 and EVA-X pass@1; (2) peak and reliable performance diverge substantially (median pass@k--pass^k gap of 0.44 on EVA-A); and (3) accent and noise perturbations expose substantial robustness gaps, with effects varying across architectures, systems, and metrics (mean $Δ$ up to 0.314). We release the full framework, evaluation suite, and benchmark data under an open-source license.

2605.12208 2026-05-29 stat.ML cs.AI cs.LG stat.CO 版本更新

Self-Supervised Laplace Approximation for Bayesian Uncertainty Quantification

自监督拉普拉斯近似用于贝叶斯不确定性量化

Julian Rodemann, Alexander Marquard, Thomas Augustin, Michele Caprio

发表机构 * Rational Intelligence Lab, CISPA Helmholtz Center for Information Security Department of Statistics, LMU Munich(理性智能实验室,CISPA海德堡信息安全中心统计学系,慕尼黑大学) Department of Statistics, LMU Munich(统计学系,慕尼黑大学) Department of Computer Science, The University of Manchester(计算机科学系,曼彻斯特大学)

AI总结 提出自监督拉普拉斯近似(SSLA),通过重新拟合自预测数据直接近似后验预测分布,实现确定性、无采样的贝叶斯不确定性量化,并在回归任务中优于经典拉普拉斯近似。

Comments Accepted for publication in TMLR (https://openreview.net/forum?id=T8w8L2t3JG), v2: fixed typos and added a deceased-author footnote with a dedication to Thomas Augustin

详情
Journal ref
Transactions on Machine Learning Research (TMLR). ISSN 2835-8856 (2026)
AI中文摘要

近似贝叶斯推断通常围绕计算后验参数分布展开。然而,在实践中,感兴趣的主要对象通常是模型的预测而非其参数。在这项工作中,我们提出绕过参数后验,直接关注近似后验预测分布。我们通过从自监督和半监督学习中的自训练中汲取灵感来实现这一点。本质上,我们通过重新拟合自预测数据来量化贝叶斯模型的预测不确定性。这个想法非常简单:如果模型对自预测数据赋予高似然,那么这些预测的不确定性低,反之亦然。这产生了后验预测的确定性、无采样近似。我们的自监督拉普拉斯近似(SSLA)的模块化结构进一步允许我们插入不同的先验规范,从而实现经典的贝叶斯敏感性(关于先验选择)分析。为了绕过昂贵的重新拟合,我们进一步引入了SSLA的近似版本,称为ASSLA。我们从理论和经验上研究了(A)SSLA,涉及从贝叶斯线性模型到贝叶斯神经网络的回归模型。在模拟和真实数据集的广泛回归任务中,我们的方法在预测校准方面优于经典拉普拉斯近似,同时保持计算效率。

英文摘要

Approximate Bayesian inference typically revolves around computing the posterior parameter distribution. In practice, however, the main object of interest is often a model's predictions rather than its parameters. In this work, we propose to bypass the parameter posterior and focus directly on approximating the posterior predictive distribution. We achieve this by drawing inspiration from self-training within self-supervised and semi-supervised learning. Essentially, we quantify a Bayesian model's predictive uncertainty by refitting on self-predicted data. The idea is strikingly simple: If a model assigns high likelihood to self-predicted data, these predictions are of low uncertainty, and vice versa. This yields a deterministic, sampling-free approximation of the posterior predictive. The modular structure of our Self-Supervised Laplace Approximation (SSLA) further allows us to plug in different prior specifications, enabling classical Bayesian sensitivity (w.r.t. prior choice) analysis. In order to bypass expensive refitting, we further introduce an approximate version of SSLA, called ASSLA. We study (A)SSLA both theoretically and empirically in regression models ranging from Bayesian linear models to Bayesian neural networks. Across a wide array of regression tasks with simulated and real-world datasets, our methods outperform classical Laplace approximations in predictive calibration while remaining computationally efficient.

2605.06322 2026-05-29 cs.LG 版本更新

SMolLM: Small Language Models Learn Small Molecular Grammar

SMolLM: 小型语言模型学习小型分子语法

Akhil Jindal, Harang Ju

发表机构 * Carey Business School Johns Hopkins University(约翰霍普金斯大学Carey商学院)

AI总结 本文提出SMolLM,一个53K参数的小型权重共享Transformer,通过固定层次结构学习SMILES语法,在ZINC-250K数据集上以95%的有效性生成分子,优于参数多10倍的GPT模型。

Comments 19 pages, 5 figures, 11 tables

详情
AI中文摘要

用于分子设计的语言模型已扩展到数亿个参数,但人们对它们如何学习化学语法知之甚少。我们训练了SMolLM,一个53K参数的权重共享Transformer,在ZINC-250K药物样分子基准上生成新颖的SMILES,有效性达95%,优于参数多10倍的标准GPT。从机制上看,同一模块在多次前向传播中以固定层次结构解决SMILES约束:首先是括号,其次是环,最后是化合价,这一点通过错误分类和线性探测得到证明,并通过消融实验隔离出括号匹配头。综合这些结果,我们得到了一个紧凑、可机械解释的分子生成器,以及一个用于研究形式语言领域迭代计算的测试平台。

英文摘要

Language models for molecular design have scaled to hundreds of millions of parameters, yet how they learn chemical grammar is poorly understood. We train SMolLM, a 53K-parameter weight-shared transformer, to generate novel SMILES with 95% validity on the ZINC-250K drug-like-molecule benchmark, outperforming a standard GPT with 10 times more parameters. Mechanistically, the same block resolves SMILES constraints across passes in a fixed hierarchy: brackets first, rings second, and valence last, as shown by error classification and linear probing, with ablation isolating the bracket-matching head. Together, these results yield a compact, mechanistically interpretable molecular generator and a testbed for studying iterative computation in formal-language domains.

2605.04916 2026-05-29 cs.AI cs.LG cs.SC 版本更新

A Foundation Model for Zero-Shot Logical Rule Induction

零样本逻辑规则归纳的基础模型

Yin Jun Phua

发表机构 * Institute of Science Tokyo(东京科学研究所)

AI总结 提出神经规则归纳器(NRI),一种基于统计编码和并行槽解码的预训练模型,实现零样本逻辑规则归纳,无需重新训练即可泛化到新谓词。

Comments Camera-ready version accepted at IJCAI 2026, with full appendices

详情
AI中文摘要

归纳逻辑编程(ILP)从数据中学习可解释的逻辑规则。现有方法是传导性的:其学习参数绑定到特定谓词,并且每个新任务都需要重新训练。我们引入了神经规则归纳器(NRI),一种用于零样本规则归纳的预训练模型。NRI 不编码文字标识,而是使用领域无关的统计属性(如类别条件率、熵和共现)来表示文字,这些属性无需重新训练即可泛化到不同的标识和数量。该模型由一个统计编码器和一个基于并行槽的解码器组成。并行解码保持了逻辑析取的置换不变性;而自回归解码器则会施加任意子句顺序。乘积 T-范数松弛使规则执行可微分,从而仅基于预测准确性进行端到端训练。我们在规则恢复、对标签噪声和虚假相关性的鲁棒性以及零样本迁移到真实世界基准上评估了 NRI,并相信这项工作开启了符号推理基础模型的可能性。代码和参考检查点可在 https://github.com/phuayj/neural-rule-inducer 获取。

英文摘要

Inductive Logic Programming (ILP) learns interpretable logical rules from data. Existing methods are transductive: their learned parameters are bound to specific predicates and require retraining for each new task. We introduce Neural Rule Inducer (NRI), a pretrained model for zero-shot rule induction. Rather than encoding literal identities, NRI represents literals using domain-agnostic statistical properties such as class-conditional rates, entropy, and co-occurrence, which generalize across variable identities and counts without retraining. The model consists of a statistical encoder and a parallel slot-based decoder. Parallel decoding preserves the permutation invariance of logical disjunction; an autoregressive decoder would instead impose an arbitrary clause order. Product T-norm relaxation makes rule execution differentiable, allowing end-to-end training on prediction accuracy alone. We evaluate NRI on rule recovery, robustness to label noise and spurious correlations, and zero-shot transfer to real-world benchmarks, and we believe this work opens up the possibility of foundation models for symbolic reasoning. Code and the reference checkpoint are available at https://github.com/phuayj/neural-rule-inducer.

2605.02116 2026-05-29 cs.LG 版本更新

Statistical Consistency and Generalization of Contrastive Representation Learning

对比表示学习的统计一致性与泛化性

Yuanfan Li, Xiyuan Wei, Tianbao Yang, Yiming Ying

发表机构 * University of Sydney Texas A\&M University

AI总结 本文提出统一的统计学习理论,证明对比损失与最优排序统计一致,并推导出随负样本数增加而改善的泛化界,解释了大负样本集的经验优势。

Comments Accepted by ICML 2026

详情
AI中文摘要

对比表示学习(CRL)支撑着许多现代基础模型。尽管最近取得了理论进展,现有分析仍存在几个关键限制:(i)CRL的统计一致性仍知之甚少;(ii)可用的泛化界随着负样本数量的增加而恶化,这与大负样本集的经验优势相矛盾;(iii)CRL的检索性能受到的理论关注有限。在本文中,我们为CRL发展了一个统一的统计学习理论。对于下游任务,我们使用AUC型总体准则评估检索质量,并证明对比损失与最优排序是 extit{统计一致的}。我们进一步建立了一个 extit{校准型不等式},定量地将过剩对比风险与过剩检索次优性联系起来。对于上游训练,我们研究了监督和自监督对比目标,并分别推导了阶为$O(1/m + 1/\sqrt{n})$和$O(1/\sqrt{m} + 1/\sqrt{n})$的泛化界,其中$m$表示负样本数量,$n$表示锚点数量。这些界不仅解释了大负样本集的经验优势,还揭示了$m$和$n$之间的显式权衡。在大规模视觉-语言模型上的广泛实验证实了我们的理论预测。

英文摘要

Contrastive representation learning (CRL) underpins many modern foundation models. Despite recent theoretical progress, existing analyses suffer from several key limitations: (i) the statistical consistency of CRL remains poorly understood; (ii) available generalization bounds deteriorate as the number of negative samples increases, contradicting the empirical benefits of large negative sets; and (iii) the retrieval performance of CRL has received limited theoretical attention. In this paper, we develop a unified statistical learning theory for CRL. For downstream tasks, we evaluate retrieval quality using an AUC-type population criterion and show that the contrastive loss is \emph{statistically consistent} with optimal ranking. We further establish a \emph{calibration-style inequality} that quantitatively relates excess contrastive risk to excess retrieval suboptimality. For upstream training, we study both supervised and self-supervised contrastive objectives and derive generalization bounds of order $O(1/m + 1/\sqrt{n})$ and $O(1/\sqrt{m} + 1/\sqrt{n})$, respectively, where $m$ denotes the number of negative samples and $n$ the number of anchor points. These bounds not only explain the empirical advantages of large negative sets but also reveal an explicit trade-off between $m$ and $n$. Extensive experiments on large-scale vision--language models corroborate our theoretical predictions.

2605.01663 2026-05-29 cs.LG cs.RO 版本更新

Towards Efficient and Expressive Offline RL via Flow-Anchored Noise-conditioned Q-Learning

基于流锚定噪声条件Q学习的离线强化学习:高效且表达力强的方法

Sungyoung Lee, Dohyeong Kim, Eshan Balachandar, Zelal Su Mustafaoglu, Keshav Pingali

发表机构 * The University of Texas at Austin, Austin, TX, USA(德克萨斯大学奥斯汀分校) Independent Researcher, Seoul, South Korea(首尔独立研究者)

AI总结 提出FAN算法,通过单次流策略迭代和单高斯噪声样本实现高效离线强化学习,在保持高性能的同时显著降低计算成本。

Comments ICML 2026

详情
AI中文摘要

我们提出流锚定噪声条件Q学习(FAN),一种高效且高性能的离线强化学习算法。近期工作表明,表达力强的流策略和分布性评论家能提升离线强化学习性能,但计算成本高。具体而言,流策略需要迭代采样才能产生单个动作,分布性评论家需要计算多个样本(如分位数)来估计价值。为解决这些低效问题并保持高性能,我们引入FAN。我们的方法采用行为正则化技术,仅需单次流策略迭代,且分布性评论家仅需单个高斯噪声样本。我们对收敛性和性能边界的理论分析表明,这些简化不仅提高了效率,还带来了更优的任务性能。在机器人操作和运动任务上的实验表明,FAN实现了最先进的性能,同时显著减少了训练和推理时间。我们在https://github.com/brianlsy98/FAN 发布代码。

英文摘要

We propose Flow-Anchored Noise-conditioned Q-Learning (FAN), a highly efficient and high-performing offline reinforcement learning (RL) algorithm. Recent work has shown that expressive flow policies and distributional critics improve offline RL performance, but at a high computational cost. Specifically, flow policies require iterative sampling to produce a single action, and distributional critics require computation over multiple samples (e.g., quantiles) to estimate value. To address these inefficiencies while maintaining high performance, we introduce FAN. Our method employs a behavior regularization technique that uses a single flow policy iteration and requires a single Gaussian noise sample for distributional critics. Our theoretical analysis of convergence and performance bounds demonstrates that these simplifications not only improve efficiency but also lead to superior task performance. Experiments on robotic manipulation and locomotion tasks demonstrate that FAN achieves state-of-the-art performance while significantly reducing both training and inference runtimes. We release our code at https://github.com/brianlsy98/FAN.

2605.00716 2026-05-29 cs.LG cs.SI 版本更新

Aitchison Embeddings for Learning Compositional Graph Representations

用于学习组合图表示的Aitchison嵌入

Nikolaos Nakis, Chrysoula Kosma, Panagiotis Promponas, Michail Chatzianastasis, Giannis Nikolentzos

发表机构 * Human Nature Lab, Yale University, New Haven, USA(耶鲁大学人类本质实验室) Yale Institute for Network Science, Yale University, New Haven, USA(耶鲁大学网络科学研究所) Université Paris Saclay, Université Paris Cité, ENS Paris Saclay, CNRS, SSA, INSERM, Centre Borelli, Gif-sur-Yvette, France(巴黎萨克雷大学、巴黎城市大学、巴黎萨克雷高等师范学院、国家科学研究中心、SSA、国家医学研究院、Borelli研究中心,法国) Department of Informatics and Telecommunications, University of Peloponnese, Peloponnese, Greece(希腊皮洛斯大学信息与电信系)

AI总结 提出基于Aitchison几何的组合图嵌入框架,通过等距对数比坐标实现可解释的节点表示,在节点分类和链接预测任务中性能与强基线相当,并利用子成分一致性进行维度约简。

Comments ICML 2026 Camera-ready version

详情
AI中文摘要

表示学习是图机器学习的核心,驱动着链接预测和节点分类等任务。然而,大多数图嵌入难以解释,对学习特征与图结构之间的关系提供的洞察有限。许多网络自然地具有角色混合视图,其中节点最好被描述为潜在原型因素的混合。受此结构启发,我们提出了一个基于Aitchison几何的组合图嵌入框架,Aitchison几何是比较混合物的标准几何。节点表示为单纯形值组合,并通过等距对数比(ILR)坐标嵌入,该坐标在保留Aitchison距离的同时允许在欧几里得空间中进行无约束优化。这产生了内在可解释的嵌入,其几何反映了原型之间的相对权衡,并在成分限制下支持一致行为;我们考虑了固定和可学习的ILR基。在节点分类和链接预测中,我们的方法在提供构建时而非事后可解释性的同时,实现了与强基线相当的性能。最后,子成分一致性实现了原则性的成分限制:移除和重新归一化子集保留了良好定义的几何,我们通过子成分维度移除来探究原型组如何影响表示和预测。

英文摘要

Representation learning is central to graph machine learning, powering tasks such as link prediction and node classification. However, most graph embeddings are hard to interpret, offering limited insight into how learned features relate to graph structure. Many networks naturally admit a role-mixture view, where nodes are best described as mixtures over latent archetypal factors. Motivated by this structure, we propose a compositional graph embedding framework grounded in Aitchison geometry, the canonical geometry for comparing mixtures. Nodes are represented as simplex-valued compositions and embedded via isometric log-ratio (ILR) coordinates, which preserve Aitchison distances while enabling unconstrained optimization in Euclidean space. This yields intrinsically interpretable embeddings whose geometry reflects relative trade-offs among archetypes and supports coherent behavior under component restriction; we consider both fixed and learnable ILR bases. Across node classification and link prediction, our method achieves competitive performance with strong baselines while providing explainability by construction rather than post-hoc. Finally, subcompositional coherence enables principled component restriction: removing and renormalizing subsets preserves a well-defined geometry, which we exploit via subcompositional dimensionality removal to probe how archetype groups influence representations and predictions.

2605.00553 2026-05-29 cs.LG 版本更新

Stable-GFlowNet: Toward Diverse and Robust LLM Red-Teaming via Contrastive Trajectory Balance

Stable-GFlowNet: 通过对比轨迹平衡实现多样且鲁棒的LLM红队测试

Minchan Kwon, Sunghyun Baek, Minseo Kim, Jaemyung Yu, Dongyoon Han, Junmo Kim

发表机构 * Naver AI Lab(Naver AI实验室)

AI总结 针对大语言模型红队测试中有效性与多样性难以兼顾的问题,提出Stable-GFlowNet方法,通过消除配分函数估计和对比轨迹平衡实现稳定训练,在保持最优策略的同时提升攻击性能与多样性。

Comments ICML 2026 Spotlight

详情
AI中文摘要

大语言模型红队测试是一种主动识别LLM漏洞的重要安全过程。在红队测试中寻找有效且多样的攻击至关重要,但实现两者兼具极具挑战性。执行分布匹配的生成流网络(GFNs)是一种有前景的方法,但因其训练不稳定和模式坍塌而臭名昭著。特别是,红队测试中的不稳定奖励加速了模式坍塌。我们提出Stable-GFN(S-GFN),它消除了GFN中的配分函数$Z$估计,并减少了训练不稳定性。S-GFN通过成对比较避免Z估计,并采用针对噪声奖励的鲁棒掩码方法。此外,我们提出流畅性稳定器,以防止模型陷入产生无意义内容的局部最优。S-GFN在保持GFN最优策略的同时提供了更稳定的训练。我们展示了S-GFN在各种设置下压倒性的攻击性能和多样性。

英文摘要

Large Language Model (LLM) Red-Teaming, which proactively identifies vulnerabilities of LLMs, is an essential process for ensuring safety. Finding effective and diverse attacks in red-teaming is important, but achieving both is challenging. Generative Flow Networks (GFNs) that perform distribution matching are a promising methods, but they are notorious for training instability and mode collapse. In particular, unstable rewards in red-teaming accelerate mode collapse. We propose Stable-GFN (S-GFN), which eliminates partition function $Z$ estimation in GFN and reduces training instability. S-GFN avoids Z-estimation through pairwise comparisons and employs a robust masking methodology against noisy rewards. Additionally, we propose a fluency stabilizer to prevent the model from getting stuck in local optima that produce gibberish. S-GFN provides more stable training while maintaining the optimal policy of GFN. We demonstrate the overwhelming attack performance and diversity of S-GFN across various settings.

2604.26571 2026-05-29 cs.LG physics.chem-ph physics.data-an 版本更新

Advancing multi-site emission control: A physics-informed transfer learning framework with mixture of experts for carbon-pollutant synergy

推进多站点排放控制:一种基于物理信息的迁移学习框架结合专家混合模型实现碳-污染物协同控制

Yuxuan Ying, Hanqing Yang, Kaige Wang, Yu Hu, Zhiming Zheng, Yunliang Jiang, Xiaoqing Lin, Xiaodong Li, Jun Chen

发表机构 * College of Mechanical Engineering(机械工程学院) Alibaba Group(阿里巴巴集团) School of Computer Science and Technology(计算机科学与技术学院) Science and Education Integration College of Energy and Carbon Neutralization(能源与碳中和科学教育融合学院) State Key Laboratory of Clean Energy Utilization(清洁能源利用国家重点实验室)

AI总结 针对多站点城市固体废物焚烧厂排放控制中数据驱动模型迁移困难的问题,提出一种结合物理约束和运行工况异构性的碳-污染物专家混合模型(CPMoE),通过物理信息迁移学习实现跨站点知识迁移,在13个源站点和12个目标站点上均取得高预测精度,并实现3.6-6.3%的风险指数降低和94-100%的污染物协同减排。

Comments Supplementary materials will be released after the final version is finalized

详情
AI中文摘要

城市固体废物焚烧(MSWI)将城市废物转化为能源,但同时排放二氧化碳、一氧化碳和多种受管制的空气污染物,这些污染物在单一燃烧系统内紧密耦合。在多样化的设施网络中控制这些排放与优化单个工厂存在根本性不同:在一个站点训练的数据驱动模型捕捉到局部统计模式,这些模式很少能成功迁移到另一个站点,因为它们缺乏泛化所需的物理约束和工况级结构。这里我们证明,当物理守恒定律、运行工况异质性和碳-污染物耦合被联合处理时,可以在异质MSWI工厂中识别共享的排放控制关系。我们开发了一种碳-污染物专家混合(CPMoE)模型,该模型在基于守恒的正则化下,通过工况特定专家网络路由过程观测,并结合物理信息迁移学习将参考模型适应到新设施。在13个工厂中,CPMoE预测六种主要污染物和复合系统级风险指数,源域R2分别为0.668-0.904和0.666-0.970;迁移到12个目标工厂后,这些值仍保持在0.661-0.842和0.610-0.841。专家利用模式表明,适应过程通过结构化的工况重新加权进行,而不是从头重新学习。将迁移模型嵌入离线数字孪生,并针对历史过程记录筛选候选操作调整,在94-100%的评估样本中实现一致的风险指数降低3.6-6.3%,同时实现污染物协同减排。这些发现为异质废物-能源网络中可迁移的、系统级碳-污染物协同控制决策支持提供了一条实用途径。

英文摘要

Municipal solid waste incineration (MSWI) converts urban waste to energy but simultaneously emits carbon dioxide, carbon monoxide and multiple regulated air pollutants whose formation is tightly coupled within a single combustion system. Controlling these emissions across a network of diverse facilities poses a fundamentally different challenge from optimising a single plant: data-driven models trained at one site capture local statistical patterns that rarely survive transfer to another, because they lack the physical constraints and regime-level structure needed to generalise. Here we show that shared emission-control relationships can be identified across heterogeneous MSWI plants when physical conservation laws, operating-regime heterogeneity and carbon-pollutant coupling are treated jointly. We develop a carbon-pollutant mixture-of-experts (CPMoE) model that routes process observations through regime-specific expert networks under conservation-based regularisation, and combine it with physics-informed transfer learning to adapt a reference model to new facilities. Across 13 plants, CPMoE predicts six major pollutants and a composite system-level risk index with source-domain R2 of 0.668-0.904 and 0.666-0.970, respectively; after transfer to 12 target plants these values remain 0.661-0.842 and 0.610-0.841. Expert-utilisation patterns show that adaptation proceeds through structured regime re-weighting rather than re-learning from scratch. Embedding the transferred model in an offline digital twin and screening candidate operating adjustments against historical process records yields consistent risk-index reductions of 3.6-6.3% with simultaneous pollutant co-reductions in 94-100% of evaluated samples. These findings suggest a practical route toward transferable, system-level decision support for carbon-pollutant co-control in heterogeneous waste-to-energy networks.

2604.25098 2026-05-29 cs.AI cs.CL cs.LG 版本更新

Revisiting the Effectiveness of LLM Pruning for Test-Time Scaling

重新审视LLM剪枝对测试时缩放的有效性

Ocean Monjur, Shahriar Kabir Nahin, Anshuman Chhabra

发表机构 * Bellini College of AI, Cybersecurity, and Computing(人工智能、网络安全与计算学院)

AI总结 本文研究非结构化剪枝对推理型大语言模型测试时缩放性能的影响,发现其优于结构化剪枝甚至有时超过未剪枝模型,并探讨了层间稀疏分配策略的作用。

详情
AI中文摘要

大型语言模型(LLM)现在通过测试时计算缩放(TTS)展现出卓越的推理能力,在数学和编程基准测试中表现令人印象深刻。与此同时,模型压缩研究开发了剪枝方法,旨在在不牺牲任务性能的情况下移除冗余/有害参数。这两项研究进展的交叉点构成了我们工作的基础。具体到推理型LLM,先前的工作表明结构化剪枝(移除整组层块的方法)显著降低了TTS推理性能。然而,在这项工作中,我们重新审视了这一假设,并研究了非结构化剪枝(仅小心移除某些冗余/有害权重的方法)是否表现出类似的局限性。令人惊讶的是,我们在两个推理型LLM(s1.1-7B和Qwen3-8B)的四个推理基准上的广泛实验一致表明,与结构化剪枝相比,非结构化剪枝增强了TTS性能,有时甚至能超越未剪枝的全权重LLM。此外,我们还实证研究了不同层间稀疏分配策略的影响,这些策略是实现这些非结构化方法的重要参数选择。这些发现挑战了剪枝总是降低TTS性能的传统观念,实际上表明,谨慎进行的剪枝可以保持TTS的有效性。

英文摘要

Large Language Models (LLMs) now exhibit remarkable reasoning capabilities through test-time compute scaling (TTS), with impressive performance across math and coding benchmarks. In parallel, research in model compression has developed pruning methods that seek to remove redundant/detrimental parameters without sacrificing task performance. The intersection of these two research advancements lays the foundation for our work. Specific to reasoning LLMs, prior work has shown that structured pruning (methods which remove entire set of layer blocks), significantly degrades TTS reasoning performance. However, in this work, we revisit this assumption and investigate whether unstructured pruning (methods that carefully remove only certain redundant/detrimental weights) exhibits similar limitations. Surprisingly, our extensive experiments across four reasoning benchmarks on two reasoning LLMs: s1.1-7B and Qwen3-8B, consistently show that unstructured pruning augments TTS performance compared to structured pruning, and at times can even outperform the unpruned full-weight LLMs. Furthermore, we also empirically study the impact of different layer-wise sparsity allocation strategies, which are an important parametric choice for instantiating these unstructured methods. These findings challenge the conventional notion that pruning always reduces TTS performance and in fact, suggest that carefully undertaken pruning can retain TTS effectiveness.

2604.24824 2026-05-29 cs.LG 版本更新

Negative Ontology of True Target for Machine Learning: Towards Evaluation and Learning under Democratic Supervision

机器学习中真实目标的否定本体论:迈向民主监督下的评估与学习

Yongquan Yang

发表机构 * Institute of Sciences for AI(人工智能科学研究院)

AI总结 本文从哲学角度审视真实目标存在性假设的转变,提出基于否定本体论的民主监督框架,并构建了多不准确真实目标(MIATTs)的评估与学习体系。

详情
AI中文摘要

本文从哲学角度审视关于真实目标(TT)存在与不存在的假设转变如何为基于机器学习的预测建模带来新的视角和见解,并相应地提出了一个民主监督下的评估与学习知识体系。通过系统分析当前主流机器学习范式中TT的存在性假设,我们明确采用否定本体论视角,认为TT在客观世界中并不存在,并基于这一不存在假设定义了机器学习的民主监督。我们进一步提出多不准确真实目标(MIATTs)作为民主监督的实例级实现。基于MIATTs,我们推导了逻辑驱动的MIATTs生成与评估原则、使用MIATTs进行评估的逻辑评估公式,以及使用MIATTs进行学习的不可定义真实目标学习。基于这些组件,我们建立了基于MIATTs的评估与学习(EL-MIATTs)框架,用于基于机器学习的预测建模。一个实际应用展示了所提出的EL-MIATTs框架在支持个人教育和专业发展方面的潜力,与先前在教育与专业发展领域关于民主监督的讨论相一致。

英文摘要

This article philosophically examines how shifts in assumptions regarding the existence and non-existence of the true target (TT) give rise to new perspectives and insights for machine learning (ML)-based predictive modeling and, correspondingly, proposes a knowledge system for evaluation and learning under Democratic Supervision. By systematically analysing the existence assumption of the TT in current mainstream ML paradigms, we explicitly adopt a negative ontology perspective, positing that the TT does not objectively exist in the real world, and, grounded in this non-existence assumption, define Democratic Supervision for ML. We further present Multiple Inaccurate True Targets (MIATTs) as an instance-level realization of Democratic Supervision. Building upon MIATTs, we derive principles, for the logic-driven generation and assessment of MIATTs, a logical assessment formulation for evaluation with MIATTs, and undefinable true target learning for learning with MIATTs. Based on these components, we establish the evaluation and learning with MIATTs (EL-MIATTs) framework for ML-based predictive modelling. A real-world application demonstrates the potential of the proposed EL-MIATTs framework in supporting education and professional development for individuals, aligning with prior discussions of Democratic Supervision in the fields of education and professional development.

2604.23256 2026-05-29 cs.NE cs.AI cs.LG cs.SC 版本更新

Architecture-Induced Recoverability Bias in Differentiable Symbolic Regression

可微符号回归中的架构诱导的可恢复性偏差

Chakshu Gupta, Theodore J. LaGrow

发表机构 * College of Computing, Georgia Institute of Technology(佐治亚理工学院计算机学院) College of Lifetime Learning, Georgia Institute of Technology(佐治亚理工学院终身学习学院)

AI总结 本文研究可微符号回归中,变量路由架构对表达式可恢复性的影响,发现不同架构导致恢复率从0/64到64/64变化,并提出基于验证的架构选择方法将恢复率从34.4%提升至50.1%。

Comments 6 pages, 4 figures, 3 tables; submitted to IEEE MLSP 2026

详情
AI中文摘要

符号回归旨在从数值数据中恢复闭式表达式,但在可微符号回归中,恢复的表达式不仅取决于语法,还取决于训练期间变量路由的固定架构。这与闭式模型和可解释非线性结构有用的信号处理设置相关。这种特定于架构的影响很少被直接隔离,因为现有比较通常同时改变架构、算子族、语法或搜索过程。本文比较了三种深度为3的架构,涵盖24种算子-形状-叶子组合,在尽可能固定算子族、语法和训练协议的同时改变变量路由架构。在架构加原生训练协议的比较下,同一目标的恢复率从0/64变为64/64。一个目标上最好的架构在另一个目标上是最差的,并且具有两个等深子树的结构在所有测试配置中均失败(0/3,776)。作为概念验证的缓解措施,训练一个小型架构集,并选择保留集上RMSE最低的硬化表达式。在联合运行的子集上,这将恢复率从仅存在于所有三种配置中的架构的34.4%提高到50.1%。在肖克利二极管目标上,验证选择器恢复了该基线架构遗漏的情况,而该基线架构本身仅恢复0/32个种子。由于联合运行子集仅包含三种配置,选择器结果证明基于验证的架构选择是有前景的,而非完整的基准测试。这些结果支持将架构视为可测量的设计变量,应予以报告、压力测试,并使用保留验证集进行选择,而非先验固定。

英文摘要

Symbolic regression aims to recover closed-form expressions from numerical data, but in differentiable symbolic regression the recovered expression depends not only on the grammar but also on the fixed architecture through which variables are routed during training. This is relevant to signal-processing settings in which closed-form models and interpretable nonlinear structure are useful. This architecture-specific effect has rarely been isolated directly, because existing comparisons often vary architecture together with operator family, grammar, or search procedure. Three depth-3 architectures are compared across twenty-four operator--shape--leaf combinations, holding operator family, grammar, and training protocol fixed as far as possible while varying the variable-routing architecture. Recovery changes from $0/64$ to $64/64$ trials on the same target under an architecture-plus-native-training-protocol comparison. The best architecture on one target is the worst on another, and trees with two equal-depth subtrees fail in every configuration tested ($0/3{,}776$). As a proof-of-concept mitigation, a small architecture set is trained and the hardened expression with the lowest held-out RMSE is selected. On the jointly-run subset, this improves recovery from $34.4\%$ for the only architecture present in all three configurations to $50.1\%$. On a Shockley diode target, the validation selector recovers cases missed by that baseline architecture, which by itself recovers $0/32$ seeds. Since the jointly-run subset contains only three configurations, the selector result is evidence that validation-based architecture selection is promising, not a complete benchmark. These results support treating architecture as a measurable design variable that should be reported, stress-tested, and selected using held-out validation rather than fixed a priori.

2604.20443 2026-05-29 cs.CL cs.AI cs.LG 版本更新

DialToM: A Theory of Mind Benchmark for Forecasting State-Driven Dialogue Trajectories

DialToM:用于预测状态驱动对话轨迹的心智理论基准

Neemesh Yadav, Palakorn Achananuparp, Jing Jiang, Ee-Peng Lim

发表机构 * Singapore Management University(新加坡管理大学) Australian National University(澳大利亚国立大学)

AI总结 提出DialToM基准,通过多选评估框架从自然对话中构建,揭示LLMs在推断心理状态(字面ToM)与利用其进行社会预测(功能ToM)之间的系统性推理不对称性,并证明领域专家与AI之间存在显著能力差距。

Comments Submitted to EMNLP 2026

详情
AI中文摘要

我们介绍了DialToM,一个基于自然人类对话构建的带注释的心智理论(ToM)基准,采用多选评估框架。与近期在合成环境中显示显式心理状态推断与应用ToM之间存在差距的工作一致,我们建立了一个更严格的“状态驱动诊断探针”,要求模型仅从孤立的心理状态特征(无对话上下文)预测状态一致的对话轨迹。我们的评估揭示了系统性的推理不对称性——LLMs在推断心理状态(字面ToM)方面表现出色,但在利用它们进行社会预测(功能ToM)方面存在困难。关键的是,领域专家在此任务上达到100%准确率,证明了其有效性,并揭示了人类与AI之间的显著能力差距。此外,教师-学生推理注入探针显示,Gemini 3 Pro(建立了领先基线)具备强大的功能ToM能力,可用于无上下文预测,且该能力可迁移至较弱模型。DialToM、其评估代码和数据集公开于https://github.com/Stealth-py/DialToM。

英文摘要

We introduce DialToM, an annotated Theory of Mind (ToM) benchmark built from naturalistic human-human dialogues using a multiple-choice evaluation framework. Concurrent with recent work showing a gap between explicit mental-state inference and applied ToM in synthetic settings~\cite{gu2024simpletom}, we establish a stricter \emph{State-Driven Diagnostic Probe} in which models must forecast state-consistent dialogue trajectories solely from isolated mental-state profiles without dialogue context. Our evaluation reveals a systematic reasoning asymmetry -- LLMs excel at inferring mental states (Literal ToM) but struggle to leverage them for social forecasting (Functional ToM). Crucially, a domain expert achieves 100\% accuracy on this task, proving its validity and establishing a stark human-AI capability gap. Further, a teacher-student reasoning injection probe shows that Gemini 3 Pro -- which establishes the leading baseline -- possesses robust Functional ToM capabilities for context-free forecasting that are transferable to weaker models. DialToM, its evaluation code, and dataset are publicly available at https://github.com/Stealth-py/DialToM.

2604.18518 2026-05-29 cs.CV cs.LG 版本更新

UDM-GRPO: Stable and Efficient Group Relative Policy Optimization for Uniform Discrete Diffusion Models

UDM-GRPO:面向均匀离散扩散模型的稳定高效组相对策略优化

Jiaqi Wang, Haoge Deng, Ting Pan, Yang Liu, Chengyuan Wang, Fan Zhang, Yonggang Qi, Xinlong Wang

发表机构 * Beijing University of Posts(北京邮电大学) Beijing Academy of Artificial Intelligence(北京人工智能研究院)

AI总结 针对均匀离散扩散模型(UDM)与强化学习(RL)集成时训练不稳定、性能提升有限的问题,提出UDM-GRPO框架,通过将最终干净样本作为动作、利用扩散前向过程重建轨迹以及引入简化步数和无CFG策略,显著提升文本到图像生成任务的性能。

Comments UDM-GRPO is accepted by ICML 2026 (Spotlight). Code is available at https://github.com/Yovecent/UDM-GRPO

详情
AI中文摘要

均匀离散扩散模型(UDM)最近成为离散生成建模的一种有前景的范式;然而,其与强化学习的集成仍然很大程度上未被探索。我们观察到,将GRPO直接应用于UDM会导致训练不稳定和边际性能提升。为了解决这个问题,我们提出了UDM-GRPO,这是第一个将UDM与RL集成的框架。我们的方法基于两个关键见解:(i)将最终干净样本作为动作提供更准确和稳定的优化信号;(ii)通过扩散前向过程重建轨迹更好地将概率路径与预训练分布对齐。此外,我们引入了两种策略,即简化步数(Reduced-Step)和无CFG(CFG-Free),以进一步提高训练效率。UDM-GRPO在多个T2I任务上显著提升了基础模型性能。值得注意的是,GenEval准确率从69%提高到96%,PickScore从20.46增加到23.81,在连续和离散设置中均达到了最先进的性能。在OCR基准测试中,准确率从8%提高到57%,进一步验证了我们方法的泛化能力。代码可在https://github.com/Yovecent/UDM-GRPO获取。

英文摘要

Uniform Discrete Diffusion Model (UDM) has recently emerged as a promising paradigm for discrete generative modeling; however, its integration with reinforcement learning remains largely unexplored. We observe that naively applying GRPO to UDM leads to training instability and marginal performance gains. To address this, we propose UDM-GRPO, the first framework to integrate UDM with RL. Our method is guided by two key insights: (i) treating the final clean sample as the action provides more accurate and stable optimization signals; and (ii) reconstructing trajectories via the diffusion forward process better aligns probability paths with the pretraining distribution. Additionally, we introduce two strategies, Reduced-Step and CFG-Free, to further improve training efficiency. UDM-GRPO significantly improves base model performance across multiple T2I tasks. Notably, GenEval accuracy improves from $69\%$ to $96\%$ and PickScore increases from $20.46$ to $23.81$, achieving state-of-the-art performance in both continuous and discrete settings. On the OCR benchmark, accuracy rises from $8\%$ to $57\%$, further validating the generalization ability of our method. Code is available at https://github.com/Yovecent/UDM-GRPO.

2604.13410 2026-05-29 stat.ME cs.LG stat.ML 版本更新

Estimating Continuous Treatment Effects with Two-Stage Kernel Ridge Regression

使用两阶段核岭回归估计连续治疗效果

Seok-Jin Kim, Kaizheng Wang

发表机构 * Department of IEOR, Columbia University(哥伦比亚大学工业工程与运营研究系) Department of IEOR and Data Science Institute, Columbia University(哥伦比亚大学工业工程与数据科学研究所)

AI总结 针对连续治疗的效果函数估计问题,提出两阶段核岭回归方法,通过第一阶段建模响应与治疗和协变量的关系,第二阶段构造伪结果校正分布偏移,无需估计条件治疗密度即可达到最优学习界,并实现数据驱动的模型选择。

详情
AI中文摘要

我们研究连续治疗的效果函数估计问题,该函数将每个治疗值映射到群体平均结果。该设置中的一个核心挑战是混杂:治疗分配通常依赖于协变量,产生选择偏差,使得直接对响应进行回归不可靠。为了解决这个问题,我们提出了一种两阶段核岭回归方法。在第一阶段,我们学习一个模型,将响应表示为治疗和协变量的函数;在第二阶段,我们使用该模型构造伪结果以校正分布偏移,然后拟合第二个模型来估计治疗效果。尽管响应随治疗和协变量变化,但通过对协变量平均得到的诱导效果函数通常更简单,我们的估计器适应这种结构。我们在不估计条件治疗密度的情况下实现了最优学习界,从而绕过了现有方法中的一个主要瓶颈。此外,我们引入了一种完全数据驱动的模型选择程序,该程序对未知的重叠程度和底层核的谱衰减具有可证明的自适应性。

英文摘要

We study the problem of estimating the effect function for a continuous treatment, which maps each treatment value to a population-averaged outcome. A central challenge in this setting is confounding: treatment assignment often depends on covariates, creating selection bias that makes direct regression of the response on treatment unreliable. To address this issue, we propose a two-stage kernel ridge regression method. In the first stage, we learn a model for the response as a function of both treatment and covariates; in the second stage, we use this model to construct pseudo-outcomes that correct for distribution shift, and then fit a second model to estimate the treatment effect. Although the response varies with both treatment and covariates, the induced effect function obtained by averaging over covariates is typically much simpler, and our estimator adapts to this structure. Our optimal learning bounds are achieved without estimating the conditional treatment density, thereby bypassing a major bottleneck in existing methods. Furthermore, we introduce a fully data-driven model selection procedure that achieves provable adaptivity to both the unknown degree of overlap and the spectral decay of the underlying kernel.

2604.13147 2026-05-29 stat.ML cs.LG math.PR 版本更新

Adaptive Learning via Off-Model Training and Importance Sampling for Fully Non-Markovian Optimal Stochastic Control. Complete version

基于离模型训练和重要性采样的自适应学习用于完全非马尔可夫最优随机控制(完整版)

Dorival Leão, Alberto Ohashi, Simone Scotti, Adolfo M. D da Silva

发表机构 * Departamento de Matemática, Universidade de Brasília(数学系,巴西利亚大学) Università di Pisa, DEM(比萨大学,DEM) Université Paris Cité, LPSM(巴黎Cité大学,LPSM)

AI总结 针对完全非马尔可夫且依赖未知模型参数的连续时间随机控制问题,提出一种基于离散骨架和重要性采样的蒙特卡洛学习方法,实现离模型训练架构和自适应参数更新,并给出非渐近误差界。

Comments Typos are fixed. Numerical experiment is revised

详情
AI中文摘要

本文研究连续时间随机控制问题,其受控状态是完全非马尔可夫的,且依赖于未知模型参数。这类问题自然出现在路径依赖随机微分方程、粗糙波动率对冲以及分数布朗运动驱动的系统中。基于先前工作中发展的离散骨架方法,我们提出了一种用于相关嵌入后向动态规划方程的蒙特卡洛学习方法。我们的主要贡献有两方面。首先,针对几类具有代表性的非马尔可夫受控系统,我们构造了显式的支配训练律和Radon-Nikodym权重。这产生了一种离模型训练架构,其中在参考律下生成固定的合成数据集,而通过重要性采样恢复与目标模型相关的动态规划算子。其次,我们利用这种结构设计了参数模型不确定性下的自适应更新机制,使得可以通过重新加权相同的训练样本而非重新生成新轨迹来执行重复校准。对于固定参数,我们建立了通过深度神经网络逼近嵌入动态规划方程的非渐近误差界。对于自适应学习,我们推导了将蒙特卡洛逼近误差与模型风险误差分离的定量估计。数值实验在结构化线性二次型例子中展示了离模型训练机制和自适应重要性采样更新。

英文摘要

This paper studies continuous-time stochastic control problems whose controlled states are fully non-Markovian and depend on unknown model parameters. Such problems arise naturally in path-dependent stochastic differential equations, rough-volatility hedging, and systems driven by fractional Brownian motion. Building on the discrete skeleton approach developed in earlier work, we propose a Monte Carlo learning methodology for the associated embedded backward dynamic programming equation. Our main contribution is twofold. First, we construct explicit dominating training laws and Radon--Nikodym weights for several representative classes of non-Markovian controlled systems. This yields an off-model training architecture in which a fixed synthetic dataset is generated under a reference law, while the dynamic programming operators associated with a target model are recovered by importance sampling. Second, we use this structure to design an adaptive update mechanism under parametric model uncertainty, so that repeated recalibration can be performed by reweighting the same training sample rather than regenerating new trajectories. For fixed parameters, we establish non-asymptotic error bounds for the approximation of the embedded dynamic programming equation via deep neural networks. For adaptive learning, we derive quantitative estimates that separate Monte Carlo approximation error from model-risk error. Numerical experiments illustrate both the off-model training mechanism and the adaptive importance-sampling update in structured linear-quadratic examples.

2604.05446 2026-05-29 stat.ML cs.LG 版本更新

MEC: Machine-Learning-Assisted Generalized Entropy Calibration for Semi-Supervised Mean Estimation

MEC:基于机器学习的广义熵校准用于半监督均值估计

Se Yoon Lee, Jae Kwang Kim

发表机构 * Texas A\&M University(德克萨斯A&M大学) Iowa State University(爱荷华州立大学)

AI总结 提出MEC方法,通过交叉拟合校准加权改进预测驱动推断,在半监督均值估计中实现半参数效率界,并提升置信区间覆盖率和精度。

详情
AI中文摘要

获取高质量标签成本高昂,而无标签协变量通常丰富,这推动了具有可靠不确定性量化的半监督推断方法的发展。预测驱动推断(PPI)利用在少量标记样本上训练的机器学习预测器来提高效率,但在模型误指定下可能损失效率,并因标签重用而导致覆盖失真。我们引入了基于机器学习的广义熵校准(MEC),这是PPI的一种交叉拟合、校准加权变体。MEC通过基于Bregman投影的原则性校准框架对标记样本重新加权,以更好地与目标群体对齐,从而提高效率。这使MEC对预测器的仿射变换具有鲁棒性,并通过用更弱的投影误差条件替代原始预测误差条件,放宽了有效性的要求。因此,MEC在比现有PPI变体更弱的假设下达到了半参数效率界。在模拟和实际数据应用中,MEC实现了接近名义覆盖率的置信区间,并且比CF-PPI和普通PPI具有更紧的置信区间。

英文摘要

Obtaining high-quality labels is costly, whereas unlabeled covariates are often abundant, motivating semi-supervised inference methods with reliable uncertainty quantification. Prediction-powered inference (PPI) leverages a machine-learning predictor trained on a small labeled sample to improve efficiency, but it can lose efficiency under model misspecification and suffer from coverage distortions due to label reuse. We introduce Machine-Learning-Assisted Generalized Entropy Calibration (MEC), a cross-fitted, calibration-weighted variant of PPI. MEC improves efficiency by reweighting labeled samples to better align with the target population, using a principled calibration framework based on Bregman projections. This yields robustness to affine transformations of the predictor and relaxes requirements for validity by replacing conditions on raw prediction error with weaker projection-error conditions. As a result, MEC attains the semiparametric efficiency bound under weaker assumptions than existing PPI variants. Across simulations and a real-data application, MEC achieves near-nominal coverage and tighter confidence intervals than CF-PPI and vanilla PPI.

2603.23971 2026-05-29 cs.CL cs.AI cs.GT cs.LG cs.MA 版本更新

The Price Reversal Phenomenon: When Cheaper Reasoning Models Cost More

价格反转现象:当更便宜的推理模型成本更高时

Lingjiao Chen, Chi Zhang, Yeye He, Ion Stoica, Matei Zaharia, James Zou

发表机构 * Stanford University(斯坦福大学) UC Berkeley(加州大学伯克利分校) CMU(卡内基梅隆大学) Microsoft Research(微软研究院)

AI总结 本文首次系统研究推理模型标价与实际成本的偏差,发现32%的模型对比较中存在价格反转现象,并基于Shapley值建立成本归因框架,揭示思考令牌消耗和交互轮次的高度异质性是主要原因。

详情
AI中文摘要

开发者和消费者越来越根据列出的API价格选择推理模型(RMs)。然而,这些价格在多大程度上准确反映了实际推理成本?我们首次系统研究这一问题,评估了8个前沿RM在12个不同任务上的表现,涵盖竞赛数学、科学问答、代码生成和多领域智能体。我们发现了定价反转现象:在32%的模型对比较中,标价较低的模型实际上产生了更高的总成本,反转幅度高达28倍。例如,Gemini 3 Flash的标价比GPT-5.4便宜80%,但其在所有任务上的实际成本却高出38%。我们基于Shapley值构建了一个正式的成本归因框架,并利用它追溯了思考令牌消耗和交互轮次数量巨大异质性的主要贡献因素:对于同一查询,一个模型可能比另一个模型多使用900%的思考令牌,或多出10倍的环境交互轮次。我们进一步表明,每次查询的成本预测本质上是困难的:同一查询的重复运行产生的思考令牌变化高达9.7倍,为任何预测器建立了不可约的噪声底限。因此,我们提出成本分布预测作为一个开放挑战。我们的发现表明,列出的API定价是实际成本的不可靠代理,呼吁进行成本感知的模型选择和透明的每次请求成本监控。

英文摘要

Developers and consumers increasingly choose reasoning models (RMs) based on their listed API prices. However, how accurately do these prices reflect actual inference costs? We conduct the first systematic study of this question, evaluating 8 frontier RMs across 12 diverse tasks covering competition math, science QA, code generation, and multi-domain agents. We uncover the pricing reversal phenomenon: in 32% of model-pair comparisons, the model with a lower listed price actually incurs a higher total cost, with reversal magnitude reaching up to 28x. For example, Gemini 3 Flash's listed price is 80% cheaper than GPT-5.4's, yet its actual cost across all tasks is 38% higher. We build a formal cost attribution framework based on Shapley value, and leverage it to trace the dominating contributors to vast heterogeneity in thinking token consumption and number of interaction turns: on the same query, one model may use 900% more thinking tokens than another, or 10x more turns of environment interactions. We further show that per-query cost prediction is fundamentally difficult: repeated runs of the same query yield thinking token variation up to 9.7x, establishing an irreducible noise floor for any predictor. Thus, we propose cost distribution prediction as an open challenge. Our findings demonstrate that listed API pricing is an unreliable proxy for actual cost, calling for cost-aware model selection and transparent per-request cost monitoring.

2603.22348 2026-05-29 cs.LG cs.GT 版本更新

Learning Safely Without Knowing the World:COMPASS-Hedge

在不了解世界的情况下安全学习:COMPASS-Hedge

Ting Hu, Luanda Cai, Emmanouil-Vasileios Vlatakis-Gkaragkounis

发表机构 * Department of Economics University of Wisconsin–Madison(经济学系威斯康星大学麦迪逊分校) Department of Finance University of Wisconsin–Madison(金融系威斯康星大学麦迪逊分校) Department of Computer Sciences University of Wisconsin–Madison(计算机科学系威斯康星大学麦迪逊分校)

AI总结 提出COMPASS-Hedge算法,通过自适应伪遗憾缩放和基于阶段的激进策略,首次在全信息在线学习中同时实现对抗环境下的极小化最优遗憾、随机环境下的实例最优遗憾以及相对于基准策略的常数遗憾,且无需先验知识。

详情
AI中文摘要

在线学习算法常常面临一个基本的三难困境:在对抗性和随机性设置之间平衡遗憾保证,并提供相对于固定比较器的基线安全性。虽然现有方法在其中一个或两个领域表现出色,但它们通常无法在不牺牲最优速率或需要问题相关参数的神谕访问的情况下统一所有三个目标。在这项工作中,我们通过引入COMPASS-Hedge来弥合这一差距。据我们所知,我们的算法是第一个全信息任意时间方法,同时实现(达到对数因子):i)对抗环境中的极小化最优遗憾;ii)随机环境中实例最优、间隙相关的遗憾;以及iii)相对于指定基准策略的$\tilde{\mathcal{O}}(1)$遗憾。关键是,COMPASS-Hedge是无参数的,不需要事先了解环境的性质或随机次优间隙的大小。我们的方法依赖于自适应伪遗憾缩放和基于阶段的激进策略的新颖结合,以及比较器感知的混合策略。据我们所知,这提供了全信息设置中的第一个“三世界最优”保证,确立了基线安全性不必以最坏情况鲁棒性或随机效率为代价。

英文摘要

Online learning algorithms often face a fundamental trilemma: balancing regret guarantees between adversarial and stochastic settings and providing baseline safety against a fixed comparator. While existing methods excel in one or two of these regimes, they typically fail to unify all three without sacrificing optimal rates or requiring oracle access to problem-dependent parameters. In this work, we bridge this gap by introducing COMPASS-Hedge. To the best of our knowledge, our algorithm is the first full-information anytime method to simultaneously achieve, up to logarithmic factors: i) minimax-optimal regret in adversarial environments; ii) instance-optimal, gap-dependent regret in stochastic environments; and iii) $\tilde{\mathcal{O}}(1)$ regret relative to a designated baseline policy. Crucially, COMPASS-Hedge is parameter-free and requires no prior knowledge of the environment's nature or the magnitude of the stochastic suboptimality gaps. Our approach hinges on a novel integration of adaptive pseudo-regret scaling and phase-based aggression, coupled with a comparator-aware mixing strategy. To the best of our knowledge, this provides the first "best-of-three-world" guarantee in the full-information setting, establishing that baseline safety does not have to come at the cost of worst-case robustness or stochastic efficiency.

2603.18859 2026-05-29 cs.AI cs.CL cs.LG 版本更新

RewardFlow: Topology-Aware Reward Propagation on State Graphs for Agentic RL with Large Language Models

RewardFlow: 面向大语言模型智能体强化学习的拓扑感知状态图奖励传播

Xiao Feng, Bo Han, Zhanke Zhou, Jiaqi Fan, Jiangchao Yao, Ka Ho Li, Dahai Yu, Michael Kwok-Po Ng

发表机构 * TMLR Group(TMLR小组) Hong Kong Baptist University(香港 Baptist大学) TCL Corporate Research (HK) Co Ltd(TCL企业研究(香港)有限公司) Cooperative Medianet Innovation Center Shanghai Jiao Tong University(合作中位网创新中心上海交通大学) Department of Mathematics Hong Kong Baptist University(香港 Baptist大学数学系)

AI总结 提出RewardFlow方法,通过构建状态图进行拓扑感知的奖励传播,为智能体推理提供无标注的密集奖励,显著提升强化学习性能。

详情
AI中文摘要

强化学习在增强大语言模型智能体推理方面展现出潜力,但稀疏的终端奖励阻碍了细粒度优化。过程奖励建模提供了一种替代方案,但带来了高计算成本、奖励黑客风险和标注瓶颈。我们引入RewardFlow,一种用于估计智能体推理中状态级奖励的轻量级方法。通过构建捕获轨迹内在拓扑结构的状态图,RewardFlow执行拓扑感知的传播以估计每个状态对成功的贡献,从而产生有原则的、无标注的密集奖励。用于强化学习优化时,RewardFlow在四个智能体基准测试中显著优于先前基线:在基于文本的任务上平均成功率提高6.2%,在视觉推理上跨三个模型尺度比最强基线提高29.7%,在DeepResearch上准确率提高10%,同时具有卓越的鲁棒性和训练效率。RewardFlow的实现已在https://github.com/tmlr-group/RewardFlow公开。

英文摘要

Reinforcement learning (RL) shows promise for enhancing LLM agentic reasoning, yet sparse terminal rewards hinder fine-grained optimization. Process reward modeling offers an alternative but incurs high computational costs, reward hacking risks, and annotation bottlenecks. We introduce RewardFlow, a lightweight method for estimating state-level rewards in agentic reasoning. By constructing state graphs that capture the intrinsic topological structure of trajectories, RewardFlow performs topology-aware propagation to estimate each state's contribution to success, yielding principled, annotation-free dense rewards. Used for RL optimization, RewardFlow substantially outperforms prior baselines across four agentic benchmarks: +6.2% average success rate on text-based tasks, +29.7% on visual reasoning over the strongest baseline across three model scales, and +10% accuracy on DeepResearch, with superior robustness and training efficiency. The implementation of RewardFlow is publicly available at https://github.com/tmlr-group/RewardFlow.

2603.16673 2026-05-29 cs.RO cs.AI cs.LG 版本更新

When Should a Robot Think? Resource-Aware Reasoning via Reinforcement Learning for Embodied Robotic Decision-Making

机器人何时应该思考?基于强化学习的资源感知推理在具身机器人决策中的应用

Jun Liu, Pu Zhao, Zhenglun Kong, Xuan Shen, Peiyan Dong, Fan Yang, Lin Cui, Hao Tang, Geng Yuan, Wei Niu, Wenbin Zhang, Xue Lin, Gaowen Liu, Yanzhi Wang, Dong Huang

发表机构 * Robotics Institute, Carnegie Mellon University(卡内基梅隆大学机器人研究所) Northeastern University(东北大学) Harvard University(哈佛大学) Cornell University(康奈尔大学) MIT(麻省理工学院) Fujitsu Research of America(美国富士通研究) Tsinghua University(清华大学) Peking University(北京大学) University of Georgia(佐治亚大学) Florida International University(佛罗里达国际大学) EmbodyX Inc(EmbodyX公司) Cisco Systems(思科系统)

AI总结 提出RARRL框架,通过强化学习学习高层编排策略,使具身代理能自适应决定是否调用LLM推理、选择推理角色及分配计算预算,以平衡推理开销与任务成功率。

详情
AI中文摘要

具身机器人系统越来越依赖基于大语言模型(LLM)的代理来支持与环境交互过程中的高级推理、规划和决策。然而,调用LLM推理会引入大量的计算延迟和资源开销,这可能会中断动作执行并降低系统可靠性。过多的推理可能延迟动作,而推理不足则常常导致错误决策和任务失败。这引出了具身代理的一个基本问题:代理何时应该推理,何时应该行动?在这项工作中,我们提出了RARRL(基于强化学习的资源感知推理),一个用于具身代理资源感知编排的分层框架。RARRL不是学习低级控制策略,而是学习一个在代理决策层运行的高级编排策略。该策略使代理能够根据当前观察、执行历史和剩余资源,自适应地决定是否调用推理、使用哪个推理角色以及分配多少计算预算。大量实验,包括使用来自ALFRED基准测试的经验延迟配置文件进行评估,表明与固定或启发式推理策略相比,RARRL在减少执行延迟和增强鲁棒性的同时,持续提高了任务成功率。这些结果表明,自适应推理控制对于构建可靠且高效的具身机器人代理至关重要。

英文摘要

Embodied robotic systems increasingly rely on large language model (LLM)-based agents to support high-level reasoning, planning, and decision-making during interactions with the environment. However, invoking LLM reasoning introduces substantial computational latency and resource overhead, which can interrupt action execution and reduce system reliability. Excessive reasoning may delay actions, while insufficient reasoning often leads to incorrect decisions and task failures. This raises a fundamental question for embodied agents: when should the agent reason, and when should it act? In this work, we propose RARRL (Resource-Aware Reasoning via Reinforcement Learning), a hierarchical framework for resource-aware orchestration of embodied agents. Rather than learning low-level control policies, RARRL learns a high-level orchestration policy that operates at the agent's decision-making layer. This policy enables the agent to adaptively determine whether to invoke reasoning, which reasoning role to employ, and how much computational budget to allocate based on current observations, execution history, and remaining resources. Extensive experiments, including evaluations with empirical latency profiles derived from the ALFRED benchmark, show that RARRL consistently improves task success rates while reducing execution latency and enhancing robustness compared with fixed or heuristic reasoning strategies. These results demonstrate that adaptive reasoning control is essential for building reliable and efficient embodied robotic agents.

2603.14644 2026-05-29 eess.IV cs.CV cs.DB cs.LG 版本更新

LUMINA: A Multi-Vendor Mammography Benchmark with Energy Harmonization Protocol

LUMINA:采用能量协调协议的多供应商乳腺X线摄影基准

Hongyi Pan, Gorkem Durak, Halil Ertugrul Aktas, Andrea M. Bejar, Baver Tutun, Emre Uysal, Ezgi Bulbul, Mehmet Fatih Dogan, Berrin Erok, Berna Akkus Yildirim, Sukru Mehmet Erturk, Ulas Bagci

发表机构 * Department of Radiology, Northwestern University(北western大学放射科) Department of Radiation Oncology, University of Health Sciences Prof. Dr. Cemil Tascioglu City Hospital(健康科学大学教授Dr. Cemil Tascioglu医院放射肿瘤科) Department of Radiology, Istanbul University(伊斯坦布尔大学放射科)

AI总结 为解决现有FFDM数据集规模小、标注少和供应商多样性不足的问题,提出LUMINA多供应商数据集及能量协调方法,通过前景像素对齐减少域偏移,在诊断、BI-RADS分类和密度估计任务上验证了模型性能提升。

Comments This paper was accepted to CVPR 2026

详情
AI中文摘要

公开可用的全视野数字乳腺X线摄影(FFDM)数据集在规模、临床标注和供应商多样性方面仍然有限,阻碍了稳健模型的发展。我们引入了LUMINA,一个经过整理的多供应商FFDM数据集,明确编码了采集能量和供应商元数据,以捕捉现有基准中常被忽略的临床相关外观变化。该数据集包含来自468名患者的1824张图像(960张良性,864张恶性),附有病理确认标签、BI-RADS评估和乳腺密度标注。LUMINA涵盖六个采集系统,包括高能和低能成像模式,能够系统分析供应商和能量引起的域偏移。为应对这些变化,我们提出了一种仅前景的像素空间对齐方法(“能量协调”),将图像映射到低能参考,同时保留病变形态。我们在三个临床相关任务上对CNN和Transformer模型进行了基准测试:诊断(良性 vs. 恶性)、BI-RADS分类和密度估计。双视图模型一致优于单视图模型。EfficientNet-B0在诊断任务上达到93.54%的AUC,而Swin-T在密度预测上达到最佳宏平均AUC 89.43%。协调方法提升了各架构的性能,并产生了更局部的Grad-CAM响应。总体而言,LUMINA提供了(1)一个供应商多样化的基准和(2)一个模型无关的协调框架,用于可靠且可部署的乳腺X线摄影AI。

英文摘要

Publicly available full-field digital mammography (FFDM) datasets remain limited in size, clinical annotations, and vendor diversity, hindering the development of robust models. We introduce LUMINA, a curated, multi-vendor FFDM dataset that explicitly encodes acquisition energy and vendor metadata to capture clinically relevant appearance variations often overlooked in existing benchmarks. This dataset contains 1824 images from 468 patients (960 benign, 864 malignant), with pathology-confirmed labels, BI-RADS assessments, and breast-density annotations. LUMINA spans six acquisition systems and includes both high- and low-energy imaging styles, enabling systematic analysis of vendor- and energy-induced domain shifts. To address these variations, we propose a foreground-only pixel-space alignment method (''energy harmonization'') that maps images to a low-energy reference while preserving lesion morphology. We benchmark CNN and transformer models on three clinically relevant tasks: diagnosis (benign vs. malignant), BI-RADS classification, and density estimation. Two-view models consistently outperform single-view models. EfficientNet-B0 achieves an AUC of 93.54% for diagnosis, while Swin-T achieves the best macro-AUC of 89.43% for density prediction. Harmonization improves performance across architectures and produces more localized Grad-CAM responses. Overall, LUMINA provides (1) a vendor-diverse benchmark and (2) a model-agnostic harmonization framework for reliable and deployable mammography AI.

2603.01006 2026-05-29 cs.SD cs.AI cs.LG cs.MM 版本更新

AG-REPA: Causal Layer Selection for Representation Alignment in Audio Flow Matching

AG-REPA:音频流匹配中表示对齐的因果层选择

Pengfei Zhang, Tianxin Xie, Minghao Yang, Li Liu

发表机构 * AI Thrust, Information Hub, The Hong Kong University of Science and Technology (Guangzhou)(人工智能 thrust,信息中心,香港科学与技术大学(广州))

AI总结 提出AG-REPA方法,通过前向门控消融量化各层对速度场的因果贡献,实现稀疏层选择和自适应加权对齐,在音频流匹配中优于传统REPA基线。

Comments Accepted to ICML 2026. 17 pages, 4 figures, 12 tables

详情
AI中文摘要

表示对齐(REPA)通过将中间隐藏状态与预训练教师特征对齐来改进生成流模型的训练,但在令牌条件音频流匹配中,其有效性关键取决于监督层的选择,而监督层通常基于深度启发式地选择。在这项工作中,我们引入了归因引导的表示对齐(AG-REPA),一种用于音频流匹配中表示对齐的新型因果层选择策略。首先,我们发现最能存储语义/声学信息(高教师空间相似性)的层不一定是那些对驱动生成的速度场贡献最大的层,我们称之为存储-贡献分离(SCD)。为了将这一见解转化为可操作的训练指导,我们提出了一种前向门控消融(FoG-A),通过预测速度场中的诱导变化来量化每个层的因果贡献,从而实现稀疏层选择和自适应加权对齐。在统一的语音和通用音频训练(LibriSpeech + AudioSet)中,在不同的令牌条件拓扑下,AG-REPA始终优于REPA基线。总体而言,我们的结果表明,当对齐应用于因果主导的驱动速度场的层时,而不是应用于表示丰富但功能被动的层时,对齐最为有效。

英文摘要

REPresentation Alignment (REPA) improves the training of generative flow models by aligning intermediate hidden states with pretrained teacher features, but its effectiveness in token-conditioned audio Flow Matching critically depends on the choice of supervised layers, which is typically made heuristically based on the depth. In this work, we introduce Attribution-Guided REPresentation Alignment (AG-REPA), a novel causal layer selection strategy for representation alignment in audio Flow Matching. Firstly, we find that layers that best store semantic/acoustic information (high teacher-space similarity) are not necessarily the layers that contribute most to the velocity field that drives generation, and we call it Store-Contribute Dissociation (SCD). To turn this insight into an actionable training guidance, we propose a forward-only gate ablation (FoG-A) that quantifies each layer's causal contribution via the induced change in the predicted velocity field, enabling sparse layer selection and adaptive weighting for alignment. Across unified speech and general-audio training (LibriSpeech + AudioSet) under different token-conditioning topologies, AG-REPA consistently outperforms REPA baselines. Overall, our results show that alignment is most effective when applied to the causally dominant layers that drive the velocity field, rather than to layers that are representationally rich but functionally passive.

2603.00454 2026-05-29 cs.LG cs.AI 版本更新

Rooted Absorbed Prefix Trajectory Balance with Submodular Replay for GFlowNet Training

基于子模重放的根吸收前缀轨迹平衡用于GFlowNet训练

Xi Wang, Wenbo Lu, Shengjie Wang

发表机构 * Courant Institute School of Mathematics, Computing, and Data Science, New York University(纽约大学Courant研究所数学、计算与数据科学学院) Courant Institute School of Mathematics, Computing(纽约大学Courant研究所数学、计算与数据科学学院) Data Science, New York University(纽约大学数据科学学院)

AI总结 针对GFlowNet的模式坍塌问题,提出RapTB目标函数(通过根锚定子轨迹监督和吸收后缀备份提供密集前缀学习信号)和SubM子模重放策略(促进高奖励和多样性),在分子生成等任务中提升优化性能和多样性。

详情
AI中文摘要

生成流网络(GFlowNets)能够微调大型语言模型以近似奖励比例的后验分布,但仍容易出现模式坍塌,表现为前缀坍塌和长度偏差。我们将此归因于两个因素:(i)对早期前缀的信用分配较弱,以及(ii)有偏的重放导致偏移的、非代表性的训练流分布。我们提出根吸收前缀轨迹平衡(RapTB),该目标函数将子轨迹监督锚定在根节点,并通过吸收后缀备份将终端奖励传播到中间前缀,从而提供密集的前缀级学习信号。为了减轻重放引起的分布偏移,我们进一步引入SubM,一种子模重放刷新策略,同时促进高奖励和多样性。实验表明,在使用SMILES字符串的分子生成等任务中,RapTB结合SubM持续提升优化性能和分子多样性,同时保持高有效性。

英文摘要

Generative Flow Networks (GFlowNets) enable fine-tuning large language models to approximate reward-proportional posteriors, but they remain prone to mode collapse, manifesting as prefix collapse and length bias. We attribute this to two factors: (i) weak credit assignment to early prefixes, and (ii) biased replay that induces a shifted, non-representative training flow distribution. We propose Rooted absorbed prefix Trajectory Balance RapTB, an objective that anchors subtrajectory supervision at the root and propagates terminal rewards to intermediate prefixes via absorbed suffix-based backups, providing dense prefix-level learning signals. To mitigate replay-induced distribution shift, we further introduce SubM, a submodular replay refresh strategy that promotes both high reward and diversity. Empirically, on tasks such as molecule generation with LLM using SMILES strings, RapTB combined with SubM consistently improves optimization performance and molecular diversity while preserving high validity.

2603.00377 2026-05-29 cs.LG 版本更新

Improving Full Waveform Inversion in Large Model Era

在大模型时代改进全波形反演

Yinan Feng, Peng Jin, Yuzhe Guo, Yinpeng Chen, Youzuo Lin

发表机构 * School of Data Science and Society(数据科学与社会学院) University of North Carolina at Chapel Hill(北卡罗来纳大学教堂山分校) The Pennsylvania State University(宾夕法尼亚州立大学) Google DeepMind(谷歌DeepMind)

AI总结 提出通过协调缩放模型容量、数据多样性和训练策略,使十亿参数模型在简单合成数据上训练后能泛化到复杂地质结构,在OpenFWI等基准上达到最优性能。

详情
AI中文摘要

全波形反演(FWI)是一个高度非线性和不适定的问题,旨在从地表记录的地震波形数据中恢复地下速度图。现有的数据驱动FWI通常使用小模型,因为可用数据集体积有限、地质多样性不足且空间范围小,导致对过拟合的严重担忧。尽管它们在合成数据集上表现良好,但当前方法无法泛化到更真实的地质结构。在这项工作中,我们展示了完全在模拟且相对简单数据上训练的模型能够出色地泛化到具有挑战性且未见过的地质基准。我们提供了一个工作配方,通过协调三个轴上的缩放:模型容量、数据多样性和训练策略,来驯服十亿参数模型用于FWI。我们的模型在OpenFWI上达到了最先进的性能,并显著缩小了数据驱动FWI中的泛化差距。在六个具有挑战性的地球物理基准上,包括Marmousi、2D SEG/EAGE盐体和逆冲断层、2004 BP、Sigsbee和SEAM Phase I,它推断出了训练集中不存在的复杂结构,并带来了显著的性能提升(SSIM从0.5844提高到0.7669)。总体而言,我们的结果表明,通过适当的缩放策略,在简单合成数据上训练的大模型能够实现对更复杂和真实地质结构的显著泛化。

英文摘要

Full Waveform Inversion (FWI) is a highly nonlinear and ill-posed problem that aims to recover subsurface velocity maps from surface-recorded seismic waveforms data. Existing data-driven FWI typically uses small models, as available datasets have limited volume, geological diversity, and spatial extent, leading to substantial concerns about overfitting. Although they perform well on synthetic datasets, current methods fail to generalize to more realistic geological structures. In this work, we show that a model trained entirely on simulated and relatively simple data can generalize remarkably well to challenging and unseen geological benchmarks. We provide a working recipe that tames a billion-parameter model for FWI through coordinated scaling across three axes: model capacity, data diversity, and training strategy. Our model achieves state-of-the-art performance on OpenFWI and significantly narrows the generalization gap in data-driven FWI. Across six challenging geophysical benchmarks, including Marmousi, 2D SEG/EAGE Salt and Overthrust, 2004 BP, Sigsbee, and SEAM Phase I, it infers complex structures absent from the training set and delivers significant performance improvements (SSIM from 0.5844 to 0.7669). Overall, our results demonstrate that with an appropriate scaling strategy, large models trained on simple synthetic data can achieve substantial generalization to more complex and realistic geological structures.

2602.19619 2026-05-29 cs.LG 版本更新

Is Your Diffusion Sampler Actually Correct? A Sampler-Centric Evaluation of Discrete Diffusion Language Models

你的扩散采样器真的正确吗?离散扩散语言模型的采样器中心评估

Luhan Tang, Longxuan Yu, Shaorong Zhang, Greg Ver Steeg

发表机构 * University of California, Riverside(加州大学河滨分校)

AI总结 针对离散扩散语言模型评估中采样误差与去噪误差混淆的问题,提出基于oracle的采样器中心框架,通过精确隐马尔可夫模型后验隔离采样误差,发现少步离散扩散采样器在oracle去噪器下仍存在分布偏差。

Comments 30 pages, 10 figures

详情
AI中文摘要

离散扩散语言模型(dLLMs)通过并行更新的迭代去噪,为自回归模型(ARMs)提供了一种快速且灵活的替代方案。然而,它们的评估具有挑战性:现有指标将去噪器近似误差与采样动态引起的采样器诱导误差混为一谈,而自回归模型的自回归采样精确反映学习到的概率模型,不会出现此问题。我们引入了一个以采样器为中心的oracle框架,用从真实马尔可夫链导出的精确隐马尔可夫模型后验替换学习到的去噪器,在受控环境中隔离采样器诱导的误差。我们表明,即使在oracle去噪器下,少步离散扩散采样器在分布上也不正确,存在转移级不匹配,只有当步数接近序列长度时才会消失。此外,负对数似然(NLL)、生成困惑度(GenPPL)或MAUVE的改进并不意味着正确的采样。代码可在 https://luhantang.github.io/dllm_sampler 获取。

英文摘要

Discrete diffusion language models (dLLMs) provide a fast and flexible alternative to autoregressive models (ARMs) via iterative denoising with parallel updates. However, their evaluation is challenging: existing metrics conflate denoiser approximation error with sampler-induced error from the sampling dynamics, a problem that does not arise for ARMs whose autoregressive sampling exactly reflects the learned probability model. We introduce a sampler-centric oracle framework that replaces learned denoisers with an exact Hidden Markov Model posterior derived from a ground-truth Markov chain, isolating sampler-induced error in a controlled setting. We show that few-step discrete diffusion samplers are not distributionally correct even under an oracle denoiser, with transition-level mismatch that vanishes only as the number of steps approaches the sequence length. Moreover, improvements in negative log-likelihood (NLL), generative perplexity (GenPPL), or MAUVE do not imply correct sampling. Code is available at https://luhantang.github.io/dllm_sampler

2602.13238 2026-05-29 cs.NI cs.LG 版本更新

Securing SIM-Assisted Wireless Networks via Quantum Reinforcement Learning

通过量子强化学习保护SIM辅助无线网络

Le-Hung Hoang, Quang-Trung Luu, Dinh Thai Hoang, Diep N. Nguyen, Van-Dinh Nguyen

发表机构 * Smart Green Transformation Center, VinUniversity(Vin大学智能绿色转型中心) Université Paris-Saclay - CNRS - CentraleSupélec - L2S(巴黎萨克雷大学 - CNRS - 中央超导实验室 - L2S) School of Electrical and Data Engineering, University of Technology Sydney(悉尼技术大学电气与数据工程学院) School of Computer Science and Statistics, Trinity College Dublin, The University of Dublin(都柏林大学三一学院计算机科学与统计学学院)

AI总结 针对SIM辅助无线网络中高维优化和动态环境挑战,提出混合量子近端策略优化框架,联合优化功率分配和SIM相位,实现约15%保密率提升和30%收敛加速。

Comments Submitted to IEEE TCOM: 13 pages

详情
AI中文摘要

堆叠智能超表面(SIM)最近作为一种强大的波域技术出现,通过多层可编程架构实现对电磁信号的多级操控。虽然SIM为增强物理层安全提供了前所未有的自由度,但其极大数量的超原子导致高维且强耦合的优化空间,使得传统设计方法效率低下且难以扩展。此外,现有的深度强化学习(DRL)技术在动态无线环境中,面对被动窃听者的不完美知识时,存在收敛慢和性能下降的问题。为应对这些挑战,我们提出了一种混合量子近端策略优化(QPPO)框架,用于SIM辅助的安全通信,该框架联合优化发射功率分配和SIM相移,以在功率和服务质量约束下最大化平均保密率。具体而言,将参数化量子电路嵌入演员网络,形成混合经典-量子策略架构,增强了高维连续动作空间中的策略表示能力和探索效率。大量仿真表明,所提出的Q-PPO方案始终优于DRL基线,在不完美窃听者信道状态信息下,实现了约15%更高的保密率和30%更快的收敛速度。这些结果确立了Q-PPO作为SIM赋能安全无线网络的强大优化范式。

英文摘要

Stacked intelligent metasurfaces (SIMs) have recently emerged as a powerful wave-domain technology that enables multi-stage manipulation of electromagnetic signals through multilayer programmable architectures. While SIMs offer unprecedented degrees of freedom for enhancing physical-layer security, their extremely large number of meta-atoms leads to a high-dimensional and strongly coupled optimization space, making conventional design approaches inefficient and difficult to scale. Moreover, existing deep reinforcement learning (DRL) techniques suffer from slow convergence and performance degradation in dynamic wireless environments with imperfect knowledge of passive eavesdroppers. To address these challenges, we propose a hybrid quantum proximal policy optimization (QPPO) framework for SIM-assisted secure communications that jointly optimizes transmit power allocation and SIM phase shifts to maximize the average secrecy rate under power and quality-of-service constraints. Specifically, a parameterized quantum circuit is embedded into the actor network, forming a hybrid classical-quantum policy architecture that enhances policy representation capability and exploration efficiency in high-dimensional continuous action spaces. Extensive simulations demonstrate that the proposed Q-PPO scheme consistently outperforms DRL baselines, achieving approximately 15% higher secrecy rates and 30% faster convergence under imperfect eavesdropper channel state information. These results establish Q-PPO as a powerful optimization paradigm for SIM-enabled secure wireless networks.

2602.11760 2026-05-29 stat.ML cs.LG 版本更新

Aggregate Models, Not Explanations: Improving Feature Importance Estimation

聚合模型而非解释:改进特征重要性估计

Joseph Paillard, Angel Reyero Lobo, Denis A. Engemann, Bertrand Thirion

发表机构 * Roche Pharma Research \& Early Development, Roche Innovation Center Basel, F. Hoffmann-La Roche Ltd, Basel, Switzerland Universite Paris-Saclay, Inria, CEA, Palaiseau, France

AI总结 针对特征重要性估计不准确的问题,本文通过理论分析证明模型级集成比解释级集成能更有效地降低误差,并在基准和蛋白质组学数据上验证。

详情
AI中文摘要

特征重要性方法有望将机器学习模型从预测引擎转变为科学发现的工具。然而,由于数据采样和算法随机性,表达性模型可能不稳定,导致变量重要性估计不准确,削弱其在关键生物医学应用中的效用。尽管集成提供了一种解决方案,但由于重要性度量的非线性,决定是解释单个集成模型还是聚合单个模型解释是困难的,并且尚未得到充分研究。我们的理论分析在适应复杂最先进机器学习模型的假设下发展,揭示了这一选择主要由模型的超额风险驱动。与先前文献相反,我们表明模型级集成通过减少这一主导误差项,提供了更准确的变量重要性估计,特别是对于表达性模型。我们在经典基准和来自英国生物银行的大规模蛋白质组学研究中验证了这些发现。

英文摘要

Feature-importance methods show promise in transforming machine learning models from predictive engines into tools for scientific discovery. However, due to data sampling and algorithmic stochasticity, expressive models can be unstable, leading to inaccurate variable importance estimates and undermining their utility in critical biomedical applications. Although ensembling offers a solution, deciding whether to explain a single ensemble model or aggregate individual model explanations is difficult due to the nonlinearity of importance measures and remains largely understudied. Our theoretical analysis, developed under assumptions accommodating complex state-of-the-art ML models, reveals that this choice is primarily driven by the model's excess risk. In contrast to prior literature, we show that ensembling at the model level provides more accurate variable-importance estimates, particularly for expressive models, by reducing this leading error term. We validate these findings on classical benchmarks and a large-scale proteomic study from the UK Biobank.

2602.10765 2026-05-29 cs.LG 版本更新

Collaborative Threshold Watermarking

协作阈值水印

Tameem Bakr, Anish Ambreth, Nils Lukas

发表机构 * Department of Machine Learning, MBZUAI(机器学习系,MBZUAI)

AI总结 针对联邦学习中多客户端模型溯源问题,提出 (t,K)-阈值水印协议,通过秘密共享水印密钥实现至少 t 个客户端协作验证,且抵抗少于 t 客户端的共谋攻击。

详情
AI中文摘要

在联邦学习(FL)中,$K$ 个客户端共同训练一个模型,而不共享原始数据。由于每个参与者投入数据和计算资源,客户端需要机制来后续证明联合训练模型的来源。模型水印在权重中嵌入隐藏信号,但朴素方法要么随着 $K$ 增长,每个客户端的水印被稀释而无法扩展,要么赋予单个客户端验证并可能移除水印的能力。我们引入 $(t,K)$-阈值水印:客户端在训练期间协作嵌入共享水印,而只有至少 $t$ 个客户端的联盟才能重建水印密钥并验证可疑模型。我们秘密共享水印密钥 $τ$,使得少于 $t$ 个客户端的联盟无法重建它,并且可以在不公开 $τ$ 的情况下进行验证。我们在白盒设置中实例化我们的协议,并在 IID 和非 IID 分区上的图像分类任务以及语言模型微调设置中评估它。我们的水印在规模($K=128$)下仍可检测,准确率损失最小,并且在攻击(包括使用多达 20% 训练数据的自适应微调)下仍保持在检测阈值($z\ge 4$)以上。代码可在 https://github.com/tameemalaa/collaborative-threshold-watermark 获取。

英文摘要

In federated learning (FL), $K$ clients jointly train a model without sharing raw data. Because each participant invests data and compute, clients need mechanisms to later prove the provenance of a jointly trained model. Model watermarking embeds a hidden signal in the weights, but naive approaches either do not scale with many clients as per-client watermarks dilute as $K$ grows, or give any individual client the ability to verify and potentially remove the watermark. We introduce $(t,K)$-threshold watermarking: clients collaboratively embed a shared watermark during training, while only coalitions of at least $t$ clients can reconstruct the watermark key and verify a suspect model. We secret-share the watermark key $τ$ so that coalitions of fewer than $t$ clients cannot reconstruct it, and verification can be performed without revealing $τ$ in the clear. We instantiate our protocol in the white-box setting and evaluate it on image classification tasks on both IID and non-IID partitions, as well as language models fine-tuning setting. Our watermark remains detectable at scale ($K=128$) with minimal accuracy loss and stays above the detection threshold ($z\ge 4$) under attacks including adaptive fine-tuning using up to 20% of the training data. Code is available at https://github.com/tameemalaa/collaborative-threshold-watermark.

2602.09499 2026-05-29 cs.LG cs.CR 版本更新

Computationally Efficient Replicable Learning of Parities and Applications

奇偶性的计算高效可复现学习及其应用

Moshe Noivirt, Jessica Sorrell, Eliad Tsfadia

发表机构 * Department of Computer Science, Johns Hopkins University(约翰霍普金斯大学计算机科学系) Department of Computer Science, Bar-Ilan University(巴伊兰大学计算机科学系)

AI总结 本文提出首个计算高效的奇偶性可复现学习算法,证明可复现学习在一般分布上严格超越SQ学习,并揭示其与差分隐私在样本复杂度上的分离。

详情
AI中文摘要

我们研究了可复现性(Impagliazzo等 [STOC `22], Ghazi等 [NeurIPS `21])与其他稳定性概念之间的计算关系。具体而言,我们关注可复现PAC学习及其与差分隐私(Dwork等 [TCC 2006])和统计查询(SQ)模型(Kearns [JACM `98])的联系。从统计角度看,已知差分隐私学习和可复现学习是等价的,并且严格强于SQ学习。然而,在计算上,所有先前已知的高效(即多项式时间)可复现学习算法都局限于SQ可学习任务或受限分布,这与差分隐私学习形成对比。我们的主要贡献是第一个计算高效的可复现算法,用于在任意分布上可实现地学习奇偶性,这一任务在SQ模型中是困难的,但在差分隐私下是可能的。这一结果首次证明,在一般分布上的高效可复现学习严格扩展了高效SQ学习,并且在能力上更接近高效差分隐私学习,尽管可复现性与隐私之间存在计算分离。此外,我们利用我们的奇偶性学习器证明,假设$RP \neq NP$,将可复现性转化为纯差分隐私需要样本复杂度的严格损失。我们的主要构建模块是一个新的、高效且可复现的算法,给定一组向量,该算法输出其线性张成的一个子空间,该子空间覆盖了大部分向量。

英文摘要

We study the computational relationship between replicability (Impagliazzo et al. [STOC `22], Ghazi et al. [NeurIPS `21]) and other stability notions. Specifically, we focus on replicable PAC learning and its connections to differential privacy (Dwork et al. [TCC 2006]) and to the statistical query (SQ) model (Kearns [JACM `98]). Statistically, it was known that differentially private learning and replicable learning are equivalent and strictly more powerful than SQ-learning. Yet, computationally, all previously known efficient (i.e., polynomial-time) replicable learning algorithms were confined to SQ-learnable tasks or restricted distributions, in contrast to differentially private learning. Our main contribution is the first computationally efficient replicable algorithm for realizable learning of parities over arbitrary distributions, a task that is known to be hard in the SQ-model, but possible under differential privacy. This result provides the first evidence that efficient replicable learning over general distributions strictly extends efficient SQ-learning, and is closer in power to efficient differentially private learning, despite computational separations between replicability and privacy. Additionally, we leverage our parity learner to prove that, assuming $RP \neq NP$, converting replicability to pure differential privacy requires a strict loss in sample complexity. Our main building block is a new, efficient, and replicable algorithm that, given a set of vectors, outputs a subspace of their linear span that covers most of them.

2602.05786 2026-05-29 cs.LG stat.AP stat.ML 版本更新

Selecting Hyperparameters for Tree-Boosting

选择树提升的超参数

Floris Jan Koster, Fabio Sigrist

发表机构 * Seminar for Statistics, ETH Zurich(苏黎世联邦理工学院统计研究所)

AI总结 本文通过59个数据集比较了多种超参数优化方法,发现SMAC方法显著优于其他方法,并揭示了超参数调优的关键因素。

详情
AI中文摘要

树提升是一种广泛用于表格数据的机器学习技术。然而,其样本外准确性严重依赖于多个超参数。在本文中,我们使用59个回归和分类数据集,实证比较了几种流行的树提升超参数优化方法,包括随机网格搜索、树结构Parzen估计器(TPE)、基于高斯过程的贝叶斯优化(GP-BO)、Hyperband、基于序列模型的算法配置(SMAC)方法以及确定性全网格搜索。我们发现SMAC方法明显优于所有其他考虑的方法。我们进一步观察到:(i)需要相对较大的试验次数(大于100)才能进行准确的调优,(ii)使用超参数的默认值会产生非常不准确的模型,(iii)所有考虑的超参数都可能对树提升的准确性产生实质性影响,即不存在一组比其他超参数更重要的超参数,以及(iv)对于回归任务,使用早停法选择提升迭代次数比将其包含在搜索空间中能产生更准确的结果。

英文摘要

Tree-boosting is a widely used machine learning technique for tabular data. However, its out-of-sample accuracy is critically dependent on multiple hyperparameters. In this article, we empirically compare several popular methods for hyperparameter optimization for tree-boosting including random grid search, the tree-structured Parzen estimator (TPE), Gaussian-process-based Bayesian optimization (GP-BO), Hyperband, the sequential model-based algorithm configuration (SMAC) method, and deterministic full grid search using $59$ regression and classification data sets. We find that the SMAC method clearly outperforms all the other considered methods. We further observe that (i) a relatively large number of trials larger than $100$ is required for accurate tuning, (ii) using default values for hyperparameters yields very inaccurate models, (iii) all considered hyperparameters can have a material effect on the accuracy of tree-boosting, i.e., there is no small set of hyperparameters that is more important than others, and (iv) choosing the number of boosting iterations using early stopping yields more accurate results compared to including it in the search space for regression tasks.

2602.02909 2026-05-29 cs.AI cs.FL cs.LG 版本更新

Reasoning about Reasoning: BAPO Bounds on Chain-of-Thought Token Complexity in LLMs

关于推理的推理:LLM中思维链令牌复杂度的BAPO界限

Kiran Tomlinson, Tobias Schnabel, Adith Swaminathan, Jennifer Neville

发表机构 * Microsoft Research, Redmond, WA(微软研究院,西雅图,华盛顿)

AI总结 通过扩展BAPO模型,证明二元多数、三元组匹配和图可达性三个任务需要Ω(n)个思维链令牌,实验验证了线性缩放与理论下界一致。

Comments 31 pages; accepted to ICML '26

详情
AI中文摘要

通过思维链(CoT)推理进行推理时扩展是当前最先进LLM性能的主要驱动力,但会带来显著的延迟和计算成本。我们解决了一个基本的理论问题:随着输入规模增长,需要多少推理令牌才能解决问题?通过扩展有界注意力前缀预言机(BAPO)模型——一种量化任务所需信息流的LLM抽象——我们证明了三个典型的BAPO困难任务所需的CoT令牌下界:二元多数、三元组匹配和图可达性。我们证明当输入规模为$n$时,每个任务需要$Ω(n)$个推理令牌。我们通过显式构造给出了匹配或接近匹配的上界。最后,我们在前沿推理模型上的实验显示,这些任务上的推理令牌数量近似线性缩放,且在推理预算受限时出现失败,这与我们的理论下界一致。总之,我们的结果识别了通过CoT进行推理时计算的基本瓶颈,并为分析最优推理长度提供了一种原则性工具。

英文摘要

Inference-time scaling via chain-of-thought (CoT) reasoning is a major driver of state-of-the-art LLM performance, but it comes with substantial latency and compute costs. We address a fundamental theoretical question: how many reasoning tokens are required to solve a problem as input size grows? By extending the bounded attention prefix oracle (BAPO) model--an abstraction of LLMs that quantifies the information flow required to solve a task--we prove lower bounds on the CoT tokens required for three canonical BAPO-hard tasks: binary majority, triplet matching, and graph reachability. We show that each requires $Ω(n)$ reasoning tokens when the input size is $n$. We complement these results with matching or near-matching upper bounds via explicit constructions. Finally, our experiments with frontier reasoning models show approximately linear reasoning token scaling on these tasks and failures when constrained to smaller reasoning budgets, consistent with our theoretical lower bounds. Together, our results identify fundamental bottlenecks in inference-time compute through CoT and offer a principled tool for analyzing optimal reasoning length.

2602.01058 2026-05-29 cs.LG cs.AI cs.CL 版本更新

Good SFT Optimizes for SFT, Better SFT Prepares for Reinforcement Learning

好的SFT优化SFT,更好的SFT为强化学习做准备

Dylan Zhang, Yufeng Xu, Haojin Wang, Qingzhi Chen, Hao Peng

发表机构 * University of Illinois Urbana-Champaign(伊利诺伊大学厄巴纳-香槟分校) New York University (Shanghai)(纽约大学(上海))

AI总结 针对当前SFT-RL流程中离线SFT数据分布与在线RL策略分布不匹配的问题,提出基于策略评估的离线学习损失重加权方法PEAR,通过重要性采样重加权SFT损失,提升后续RL训练效果。

详情
AI中文摘要

推理大语言模型的后训练是一个整体过程,通常包括离线SFT阶段和后续的在线强化学习(RL)阶段。然而,SFT通常被孤立地优化,仅追求最大化SFT性能。我们表明,在相同的RL训练后,从更强的SFT检查点初始化的模型可能显著劣于从较弱检查点初始化的模型。我们将此归因于当前SFT-RL流程中典型的错配:生成离线SFT数据的分布可能与在线RL期间优化的策略(该策略从其自身的rollout中学习)存在显著差异。我们提出PEAR(基于策略评估的离线学习损失重加权算法),这是一种在SFT阶段纠正此错配并让模型更好地为RL做准备的方法。PEAR使用重要性采样来重加权SFT损失,具有三种变体,分别在token、块和序列级别操作。它可以用于增强标准SFT目标,并且一旦收集到离线数据的概率,仅需很少的额外训练开销。我们在可验证推理游戏和数学推理任务上对Qwen 2.5和3以及DeepSeek蒸馏模型进行了控制实验。PEAR在标准SFT基础上持续提升了RL后性能,在AIME2025上pass@8增益高达14.6%。我们的结果表明,通过设计和评估SFT时考虑下游RL而非孤立进行,PEAR是迈向更全面的大语言模型后训练的有效一步。

英文摘要

Post-training of reasoning LLMs is a holistic process that typically consists of an offline SFT stage followed by an online reinforcement learning (RL) stage. However, SFT is often optimized in isolation to maximize SFT performance alone. We show that, after identical RL training, models initialized from stronger SFT checkpoints can significantly underperform those initialized from weaker ones. We attribute this to a mismatch typical in current SFT-RL pipelines: the distribution that generates the offline SFT data can differ substantially from the policy optimized during online RL, which learns from its own rollouts. We propose PEAR (Policy Evaluation-inspired Algorithm for Offline Learning Loss Re-weighting), an SFT-stage method that corrects this mismatch and better prepares the model for RL. PEAR uses importance sampling to reweight the SFT loss, with three variants operating at the token, block, and sequence levels. It can be used to augment standard SFT objectives and incurs little additional training overhead once probabilities for the offline data are collected. We conduct controlled experiments on verifiable reasoning games and mathematical reasoning tasks on Qwen 2.5 and 3 and DeepSeek-distilled models. PEAR consistently improves post-RL performance over canonical SFT, with pass at 8 gains up to a 14.6 percent on AIME2025. Our results suggest that PEAR is an effective step toward more holistic LLM post-training by designing and evaluating SFT with downstream RL in mind rather than in isolation.

2601.22531 2026-05-29 cs.LG cs.AI 版本更新

Learn from A Rationalist: Distilling Intermediate Interpretable Rationales

向理性主义者学习:蒸馏中间可解释原理

Jiayi Dai, Randy Goebel

发表机构 * Department of Computing Science, University of Alberta, Edmonton, Canada(阿尔伯塔大学计算机科学系,加拿大埃德蒙顿) Alberta Machine Intelligence Institute, Edmonton, Canada(阿尔伯塔机器智能研究所,加拿大埃德蒙顿)

AI总结 提出REKD方法,通过知识蒸馏将教师模型的可解释原理和预测传授给学生模型,提升基于较弱神经网络的可解释原理提取模型的预测性能。

Comments Accepted to the 43rd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

由于深度神经网络(DNN)的广泛使用,尤其是在高风险领域,DNN的可解释性受到了越来越多的关注。原理提取(RE)的总体思想是通过选择-预测架构为DNN提供一个可解释的设计框架,其中两个神经网络分别联合学习进行特征选择和预测。仅依赖于最终任务预测的远程监督,学习选择特征子集(或原理)的过程需要在所有可能的特征组合空间中进行搜索,这在计算上具有挑战性,当基础神经网络能力不足时甚至更加困难。为了提高基于能力较弱或较小神经网络(即学生)的RE模型的预测性能,我们提出了REKD(基于知识蒸馏的原理提取),其中学生RE模型除了自身的RE优化外,还从教师(即理性主义者)的原理和预测中学习。这种对RE的结构调整与人类如何从可解释和可验证的知识中有效学习的方式高度一致。由于该方法与神经模型无关,任何黑盒神经网络都可以作为骨干模型集成。为了证明REKD的可行性,我们使用BERT和视觉变换器(ViT)模型的多种变体进行了实验。我们在语言和视觉分类数据集(即IMDB电影评论、CIFAR 10和CIFAR 100)上的实验表明,REKD显著提高了学生RE模型的预测性能。

英文摘要

Because of the pervasive use of deep neural networks (DNNs), especially in high-stakes domains, the interpretability of DNNs has received increased attention. The general idea of rationale extraction (RE) is to provide an interpretable-by-design framework for DNNs via a select-predict architecture where two neural networks learn jointly to perform feature selection and prediction, respectively. Given only the remote supervision from the final task prediction, the process of learning to select subsets of features (or rationales) requires searching in the space of all possible feature combinations, which is computationally challenging and even harder when the base neural networks are not sufficiently capable. To improve the predictive performance of RE models that are based on less capable or smaller neural networks (i.e., the students), we propose REKD (Rationale Extraction with Knowledge Distillation) where a student RE model learns from the rationales and predictions of a teacher (i.e., a rationalist) in addition to the student's own RE optimization. This structural adjustment to RE aligns well with how humans could learn effectively from interpretable and verifiable knowledge. Because of the neural-model agnostic nature of the method, any black-box neural network could be integrated as a backbone model. To demonstrate the viability of REKD, we conduct experiments with multiple variants of BERT and vision transformer (ViT) models. Our experiments across language and vision classification datasets (i.e., IMDB movie reviews, CIFAR 10 and CIFAR 100) show that REKD significantly improves the predictive performance of the student RE models.

2601.22347 2026-05-29 cs.LG cs.AI 版本更新

Pushing the Limits of Block Rotations in Post-Training Quantization

推动后训练量化中块旋转的极限

Sai Sanjeet, Ian Colbert, Pablo Monteagudo-Lago, Giuseppe Franco, Yaman Umuroglu, Nicholas J. Fraser

发表机构 * Advanced Micro Devices, Inc. (AMD)(Advanced Micro Devices公司) State University of New York at Buffalo(纽约州立大学布法罗分校) Norwegian University of Science(挪威科学与技术大学)

AI总结 本文提出PeRQ框架,通过置换和旋转重新分布激活值,以克服块旋转在抑制异常值时的几何限制,显著提升后训练量化精度。

详情
AI中文摘要

最近的后训练量化(PTQ)方法采用块旋转来在舍入前扩散异常值。虽然这减少了在线全向量旋转的开销,但块结构对异常值抑制的影响仍知之甚少。为填补这一空白,我们首次对块Hadamard旋转的异常值抑制进行了系统的非渐近分析。我们的分析表明,异常值抑制从根本上受限于输入向量的几何结构。特别地,在确定性最坏情况下,当旋转前的ℓ1范数质量在块间均匀分布时,旋转后的异常值最小。受这些见解的启发,我们引入了PeRQ(置换、旋转、然后量化),一个在旋转前通过置换重新分布激活质量的PTQ框架。我们提出了一种贪婪质量扩散算法,通过均衡期望的块间ℓ1范数来校准置换。为避免增加推理开销,我们识别了Transformer架构中置换等变区域,在部署前将这些置换合并到模型权重中。实验表明,PeRQ在所有块大小上一致地提高了精度,在将Llama3 1B量化为INT4且块大小为16时,恢复了全向量旋转困惑度的90%,而未经置换时仅为46%。

英文摘要

Recent post-training quantization (PTQ) methods have adopted block rotations to diffuse outliers prior to rounding. While this reduces the overhead of online full-vector rotations, the effect of block structure on outlier suppression remains poorly understood. To fill this gap, we present the first systematic, non-asymptotic analysis of outlier suppression for block Hadamard rotations. Our analysis reveals that outlier suppression is fundamentally limited by the geometry of the input vector. In particular, in the deterministic worst case, post-rotation outliers are minimized when the pre-rotation $\ell_1$ norm mass is evenly distributed across blocks. Guided by these insights, we introduce PeRQ (Permute, Rotate, then Quantize), a PTQ framework that redistributes activation mass via permutations prior to rotation. We propose a greedy mass diffusion algorithm to calibrate permutations by equalizing the expected blockwise $\ell_1$ norms. To avoid adding inference overhead, we identify permutation-equivariant regions in transformer architectures to merge these permutations into model weights before deployment. Experiments show that PeRQ consistently improves accuracy across all block sizes, recovering up to 90% of the full-vector rotation perplexity when quantizing Llama3 1B to INT4 with block size 16, compared to 46% without permutations.

2601.21725 2026-05-29 cs.CL cs.LG 版本更新

Procedural Pretraining: Warming Up Language Models with Abstract Data

程序化预训练:用抽象数据预热语言模型

Liangze Jiang, Zachary Shinnick, Anton van den Hengel, Hemanth Saratchandran, Damien Teney

发表机构 * EPFL(苏黎世联邦理工学院) Idiap Research Institute(伊迪普研究机构) AIML, Adelaide University(人工智能实验室,阿德莱德大学)

AI总结 提出程序化预训练方法,通过在抽象结构化数据(如形式语言生成的程序数据)上预训练语言模型,显著提升其推理能力并加速后续语义知识学习,实验表明仅需0.1-0.3%的程序数据即可超越标准预训练。

Comments ICML 2026. Project page: https://zlshinnick.github.io/procedural-pretraining-page/

详情
AI中文摘要

直接在网络规模语料库上预训练语言模型是当前的主流范式。我们研究了一种替代方案:首先让模型接触抽象结构化数据,以简化后续丰富语义知识的获取,类似于人类在学习高级推理之前先学习简单逻辑和数学。我们关注由形式语言和其他简单算法生成的程序数据作为此类抽象数据。首先,我们诊断了不同形式的程序数据能够提升的算法技能,通常效果显著。例如,当模型在Dyck序列(平衡括号)上预训练时,上下文召回(大海捞针)的准确率从10%跃升至98%。其次,我们研究了这些增益如何反映在更大模型(高达1.3B参数)的预训练中。我们发现,仅在前端加入0.1%至0.3%的程序数据,就能显著优于在自然语言、代码和非正式数学(C4、CodeParrot和DeepMind-Math数据集)上的标准预训练。值得注意的是,这也使得模型仅需原始数据的55/67/86%即可达到相同的损失值,从而相应地减少FLOPs。第三,我们探索了这些收益背后的机制,发现程序化预训练在注意力层和MLP层中都注入了非平凡的结构。前者对于结构化领域(如代码)尤为重要,后者对于语言领域重要。最后,我们为组合多种形式的程序数据铺平了道路。我们的结果表明,程序化预训练是一种简单、轻量级的方法,能够提升性能并加速语言模型预训练,最终揭示了在LLM中将知识获取与推理分离的前景。

英文摘要

Pretraining language models directly on web-scale corpora is the de facto paradigm. We study an alternative where the model is initially exposed to abstract structured data to ease the subsequent acquisition of rich semantic knowledge, much like humans learning simple logic and mathematics before higher reasoning. We focus on procedural data, generated by formal languages and other simple algorithms, as such abstract data. We first diagnose the algorithmic skills that different forms of procedural data can improve, often significantly. For example, the accuracy of context recall (Needle-in-a-haystack) jumps from 10 to 98% when a model is pretrained on Dyck sequences (balanced brackets). Second, we study how these gains are reflected in pretraining larger models (up to 1.3B). We find that front-loading as little as 0.1 to 0.3% procedural data significantly outperforms standard pretraining on natural language, code, and informal mathematics (C4, CodeParrot, and DeepMind-Math datasets). Notably, this also enables the models to reach the same loss value with only 55/67/86% of the original data and thus a comparable reduction in FLOPs. Third, we explore the mechanisms behind the benefits and find that procedural pretraining instills non-trivial structure in both attention and MLP layers. The former is particularly important for structured domains (e.g. code), and the latter for language. Finally, we lay a path for combining multiple forms of procedural data. Our results show that procedural pretraining is a simple, lightweight means of improving performance and accelerating language model pretraining, ultimately suggesting the promise of disentangling knowledge acquisition from reasoning in LLMs.

2601.21243 2026-05-29 math.OC cs.LG cs.NA math.NA 版本更新

Solving the Offline and Online Min-Max Problem of Non-smooth Submodular-Concave Functions: A Zeroth-Order Approach

求解非光滑子模-凹函数的离线和在线极小极大问题:一种零阶方法

Amir Ali Farzin, Yuen-Man Pun, Philipp Braun, Tyler Summers, Iman Shames

发表机构 * School of Engineering, Australian National University, Canberra, Australia(澳大利亚国立大学工程学院,澳大利亚堪培拉) Department of Electrical and Electronic Engineering, University of Melbourne, Melbourne, Australia(墨尔本大学电气与电子工程系,澳大利亚墨尔本) Department of Mechanical Engineering, University of Texas at Dallas, Texas, USA(德克萨斯大学达拉斯分校机械工程系,美国德克萨斯)

AI总结 针对目标函数关于最小化变量非光滑子模、关于最大化变量凹的极小极大问题,提出一种基于Lovász扩展次梯度和高斯平滑的零阶方法,证明离线情形下收敛到ε-鞍点,在线情形下达到O(√N P̄_N)对偶间隙。

详情
AI中文摘要

我们考虑目标函数可能非光滑、关于最小化变量子模且关于最大化变量凹的极大极小和极小极大问题。我们研究应用于该问题的零阶方法的性能。该方法基于关于最小化变量的目标函数Lovász扩展的次梯度,并利用高斯平滑来估计关于最大化变量的平滑函数梯度。在期望意义上,我们证明了算法在离线情形下收敛到ε-鞍点。此外,我们表明,在期望意义上,在线设定下算法实现了O(√N P̄_N)的在线对偶间隙,其中N是迭代次数,P̄_N是最优决策序列的路径长度。给出了所有情况下的复杂度分析和超参数选择。通过数值例子说明了理论结果。

英文摘要

We consider max-min and min-max problems with objective functions that are possibly non-smooth, submodular with respect to the minimiser and concave with respect to the maximiser. We investigate the performance of a zeroth-order method applied to this problem. The method is based on the subgradient of the Lovász extension of the objective function with respect to the minimiser and based on Gaussian smoothing to estimate the smoothed function gradient with respect to the maximiser. In expectation sense, we prove the convergence of the algorithm to an $ε$-saddle point in the offline case. Moreover, we show that, in the expectation sense, in the online setting, the algorithm achieves $O(\sqrt{N\bar{P}_N})$ online duality gap, where $N$ is the number of iterations and $\bar{P}_N$ is the path length of the sequence of optimal decisions. The complexity analysis and hyperparameter selection are presented for all the cases. The theoretical results are illustrated via numerical examples.

2601.20255 2026-05-29 cs.LG cs.CL cs.SE 版本更新

HE-SNR: Uncovering Latent Logic via Entropy for Guiding Mid-Training on SWE-bench

HE-SNR:通过熵揭示潜在逻辑以指导SWE-bench上的中期训练

Yueyang Wang, Jiawei Fu, Baolong Bi, Xili Wang, Xiaoqing Liu

发表机构 * School of Mathematical Sciences, Peking University, Beijing, China(北京大学数学科学学院) Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China(中国科学院计算技术研究所)

AI总结 针对SWE-bench基准,提出基于熵压缩假说的HE-SNR指标,通过细粒度熵分析指导中期训练数据筛选,在高达560B参数模型上验证有效性。

Comments Accepted at ICML 2026. 21 pages, 15 figures

详情
AI中文摘要

SWE-bench已成为评估大型语言模型在复杂软件工程任务中能力的主要基准。虽然这些能力主要在中期训练阶段获得,并在监督微调(SFT)期间被激发,但目前仍然缺乏能够有效指导中期训练的指标。诸如困惑度(PPL)等标准指标受到“长上下文税”的影响,且与下游SWE性能的相关性较弱。在本文中,我们首先引入严格的数据过滤策略来弥补这一差距。关键地,我们提出了熵压缩假说,将智能重新定义为不是通过标量Top-1压缩,而是通过将不确定性结构化为低阶的熵压缩状态(“合理犹豫”)的能力。基于这种细粒度熵分析,我们制定了一个新的指标,HE-SNR(高熵信噪比)。我们在不同上下文窗口(32K/128K)下对高达560B参数的模型验证了我们的方法。这项工作为优化LLM在复杂工程领域的潜在能力提供了理论基础和实用工具。

英文摘要

SWE-bench has emerged as the premier benchmark for evaluating Large Language Models on complex software engineering tasks. While these capabilities are fundamentally acquired during the mid-training phase and subsequently elicited during Supervised Fine-Tuning (SFT), there remains a critical deficit in metrics capable of guiding mid-training effectively. Standard metrics such as Perplexity (PPL) are compromised by the "Long-Context Tax" and exhibit weak correlation with downstream SWE performance. In this paper, we bridge this gap by first introducing a rigorous data filtering strategy. Crucially, we propose the Entropy Compression Hypothesis, redefining intelligence not by scalar Top-1 compression, but by the capacity to structure uncertainty into Entropy-Compressed States of low orders ("reasonable hesitation"). Grounded in this fine-grained entropy analysis, we formulate a novel metric, HE-SNR (High-Entropy Signal-to-Noise Ratio). We validate our approach on models with up to 560B parameters across different context windows (32K/128K). This work provides both the theoretical foundation and practical tools for optimizing the latent potential of LLMs in complex engineering domains.

2601.18728 2026-05-29 cs.LG math.DG math.OC math.ST stat.TH 版本更新

Riemannian AmbientFlow: Towards Simultaneous Manifold Learning and Generative Modeling from Corrupted Data

黎曼环境流:面向从损坏数据同时进行流形学习和生成建模

Willem Diepeveen, Oscar Leong

发表机构 * Department of Mathematics, University of California, Los Angeles(数学系,加州大学洛杉矶分校) Department of Statistics and Data Science, University of California, Los Angeles(统计与数据科学系,加州大学洛杉矶分校)

AI总结 提出Riemannian AmbientFlow框架,通过变分推断和数据驱动黎曼几何,从损坏观测中同时学习概率生成模型和非线性数据流形,并理论保证误差可控与双Lipschitz流形参数化。

详情
AI中文摘要

现代生成建模方法在从干净样本学习复杂数据分布方面表现出强大性能。然而,在许多科学和成像应用中,干净样本不可用,只能观测到噪声或线性损坏的测量值。此外,数据中存在的潜在结构(如流形几何)对于进一步的科学分析至关重要。在这项工作中,我们引入了Riemannian AmbientFlow,一个直接从损坏观测中同时学习概率生成模型和底层非线性数据流形的框架。基于AmbientFlow的变分推断框架,我们的方法结合了由归一化流引起的数据驱动黎曼几何,通过拉回度量和黎曼自编码器提取流形结构。我们建立了理论保证,表明在适当的几何正则化和测量条件下,学习到的模型以可控误差恢复底层数据分布,并产生光滑的双Lipschitz流形参数化。我们进一步证明,所得的光滑解码器可以作为具有恢复保证的逆问题的原则性生成先验。我们在低维合成流形和MNIST上实证验证了我们的方法。

英文摘要

Modern generative modeling methods have demonstrated strong performance in learning complex data distributions from clean samples. In many scientific and imaging applications, however, clean samples are unavailable, and only noisy or linearly corrupted measurements can be observed. Moreover, latent structures, such as manifold geometries, present in the data are important to extract for further downstream scientific analysis. In this work, we introduce Riemannian AmbientFlow, a framework for simultaneously learning a probabilistic generative model and the underlying, nonlinear data manifold directly from corrupted observations. Building on the variational inference framework of AmbientFlow, our approach incorporates data-driven Riemannian geometry induced by normalizing flows, enabling the extraction of manifold structure through pullback metrics and Riemannian Autoencoders. We establish theoretical guarantees showing that, under appropriate geometric regularization and measurement conditions, the learned model recovers the underlying data distribution up to a controllable error and yields a smooth, bi-Lipschitz manifold parametrization. We further show that the resulting smooth decoder can serve as a principled generative prior for inverse problems with recovery guarantees. We empirically validate our approach on low-dimensional synthetic manifolds and on MNIST.

2601.12699 2026-05-29 cs.LG cs.SY eess.SY 版本更新

Bandit Algorithms for Deep Brain Stimulation

深度脑刺激的赌博机算法

Arkaprava Gupta, Nicholas Carter, William Zellers, Prateek Ganguli, Benedikt Dietrich, Vibhor Krishna, Parasara Sridhar Duggirala, Samarjit Chakraborty

发表机构 * Department of Computer Science, UNC Chapel Hill(UNC夏洛茨维尔大学计算机科学系) Hochschule München(慕尼黑应用科学大学) Department of Neurosurgery, UNC Chapel Hill(UNC夏洛茨维尔大学神经外科系) TU Munich Institute for Advanced Study(慕尼黑技术大学高级研究学院)

AI总结 提出基于时间与阈值触发的剪枝多臂赌博机算法,无需离线训练,在抑制病理性β波段活动和降低刺激能耗方面优于深度强化学习方法,并验证了其在资源受限植入式系统上的可行性。

Comments Accepted to the ACM/IEEE 17th International Conference on Cyber-Physical Systems (ICCPS) 2026

详情
AI中文摘要

深度脑刺激(DBS)是帕金森病的有效治疗方法,但传统的固定参数刺激会降低电池寿命并引起副作用,同时无法适应变化的神经动力学。最近的强化学习方法提高了适应性,但大多数依赖深度神经网络,需要离线训练且计算成本过高,不适合植入式硬件。本文提出了一种基于时间与阈值触发的剪枝多臂赌博机(T3P MAB)算法的资源意识自适应DBS框架。该方法联合调节刺激频率和幅度,避免预先训练,并且足够透明以支持临床医生指导的调整。使用计算基底节-丘脑模型,我们展示了T3P比竞争的MAB方法收敛更快,并且在抑制病理性β波段活动方面优于深度强化学习基线,同时降低刺激功率。我们在不同的微控制器上实现了该方法,并报告了详细的能量测量,显示在不到两分钟内收敛,适合资源受限的植入式系统。这些结果支持轻量级赌博机控制作为实现个性化、节能DBS的实用途径。

英文摘要

Deep Brain Stimulation (DBS) is an effective treatment for Parkinson's disease, but conventional fixed-parameter stimulation can reduce battery life and cause side effects while failing to adapt to changing neural dynamics. Recent reinforcement learning approaches improve adaptability, yet most rely on deep neural networks that require offline training and are computationally too expensive for implantable hardware. This paper presents a resource-conscious adaptive DBS framework based on a Time- and Threshold-Triggered Pruned Multi-Armed Bandit (T3P MAB) algorithm. The proposed method jointly tunes stimulation frequency and amplitude, avoids prior training, and remains transparent enough to support clinician-guided adjustment. Using a computational basal ganglia-thalamic model, we show that T3P converges faster than competing MAB methods and outperforms deep-RL baselines in suppressing pathological beta-band activity while reducing stimulation power. We implemented it on different microcontrollers and report detailed energy measurements, showing convergence in under two minutes and suitability for resource-constrained implantable systems. These results support lightweight bandit-based control as a practical path toward personalized, energy-efficient DBS.

2601.08654 2026-05-29 cs.CL cs.AI cs.LG 版本更新

From Rubrics to Reliable Scores: Evidence-Grounded Text Evaluation with LLM Judges

从评分标准到可靠分数:基于证据的文本评估与LLM裁判

Yihan Hong, Huaiyuan Yao, Bolin Shen, Wanpeng Xu, Hua Wei, Yushun Dong

发表机构 * Washington University in St. Louis(华盛顿大学圣路易斯分校) Arizona State University(亚利桑那州立大学) Florida State University(佛罗里达州立大学)

AI总结 提出Rulers框架,通过三阶段推理(任务规范、结构化执行、事后校准)解决LLM在基于评分标准的文本评估中的执行漂移、归因不可验证和人类尺度错位问题,实现更可靠的评分。

详情
AI中文摘要

基于评分标准的文本评估越来越多地使用大型语言模型(LLM)作为可扩展的裁判,但将冻结的黑盒模型与人类评分标准对齐仍然具有挑战性。我们将这一挑战表述为一个标准迁移问题:目标不仅仅是提示LLM分配分数,而是将人类评分标准意图转移到一个稳定、可审计且与人类对齐的评分协议中。我们识别了基于LLM的评分标准评估中三种反复出现的失败模式:评分标准执行漂移、不可验证的分数归因和人类尺度错位。为了解决这些失败模式,我们引入了Rulers,一个三阶段推理时框架,用于可靠、基于证据的评分标准文本评估。Rulers首先将人类评分标准转换为锁定的任务级规范,然后通过结构化检查表决策、类型化证据基础以及在适用时进行可提取引用验证来执行该规范,最后应用事后校准以将模型衍生的信号与人类分数边界对齐。在涵盖论文评分、摘要评估、EFL写作评估和结构化输入文本生成的四个基于评分标准的基准测试中,Rulers在多个冻结骨干模型的大多数评估设置中实现了更强的人类分数一致性。进一步分析表明,Rulers更好地匹配了经验人类分数分布,提高了在语义等价评分标准扰动下的稳定性,并受益于其三个组成部分。这些结果表明,可靠的LLM评判需要固定标准、可追溯证据和校准的分数解释,而不仅仅是提示措辞。我们的代码可在 https://anonymous.4open.science/r/Rulers_0525-3328 获取。

英文摘要

Rubric-based text evaluation increasingly uses large language models (LLMs) as scalable judges, but aligning frozen black-box models with human scoring standards remains challenging. We formulate this challenge as a criteria-transfer problem: the goal is not merely to prompt an LLM to assign a score, but to transfer human rubric intent into a stable, auditable, and human-aligned scoring protocol. We identify three recurring failure modes in LLM-based rubric scoring: rubric execution drift, unverifiable score attribution, and human-scale misalignment. To address these failure modes, we introduce Rulers, a three-stage inference-time framework for reliable, evidence-grounded rubric-based text evaluation. Rulers first converts a human rubric into a locked task-level specification, then executes the specification with structured checklist decisions, typed evidence grounding, and extractive quote verification when applicable, and finally applies post-hoc calibration to align model-derived signals with human score boundaries. Across four rubric-governed benchmarks covering essay scoring, summarization assessment, EFL writing evaluation, and structured-input text generation, Rulers achieves stronger human-score agreement in most evaluated settings across multiple frozen backbone models. Further analyses show that Rulers better matches empirical human score distributions, improves stability under semantically equivalent rubric perturbations, and benefits from each of its three components. These results suggest that reliable LLM judging requires fixed criteria, traceable evidence, and calibrated score interpretation rather than prompt phrasing alone. Our code is available at https://anonymous.4open.science/r/Rulers_0525-3328.

2601.02094 2026-05-29 cs.LG math.FA 版本更新

Horizon Activation Mapping for Neural Networks in Time Series Forecasting

时间序列预测中神经网络的Horizon Activation Mapping

Krupakar Hans, V A Kandappan

发表机构 * Department of Computer Science Engineering(计算机科学工程系) Shiv Nadar University Chennai(施瓦纳达大学钦奈)

AI总结 提出一种基于梯度范数平均的视觉可解释性技术HAM,用于跨不同神经网络家族的时间序列预测模型选择与比较。

Comments Accepted and Presented in International Conference on Optimization and Learning (OLA2026) for publication in LNCS

详情
AI中文摘要

时间序列预测的神经网络依赖于误差度量和特定架构的可解释性方法进行模型选择,但这些方法不适用于不同家族的模型。为了解释对最先进模型家族中层类型不可知的预测模型,我们引入了Horizon Activation Mapping (HAM),这是一种受grad-CAM启发的视觉可解释性技术,使用梯度范数平均来研究horizon的子序列,而grad-CAM则研究图像数据上的注意力图。我们引入了因果和反因果模式,以计算每个时间步上子序列的梯度更新范数平均值,以及表示范数平均值均匀分布的等比例线。研究了优化景观相对于批次大小变化、早停、训练-验证-测试分割、架构选择、单变量预测和dropout的影响,并与HAM中的子序列性能相关联。有趣的是,基于批次大小的活动差异似乎表明每个epoch之间存在指数近似的可能性。本研究使用在ETTm2数据集上训练的多变量预测模型,包括基于MLP的CycleNet、N-Linear、N-HITS、基于自注意力的FEDformer、Pyraformer、基于SSM的SpaceTime和基于扩散的多分辨率DDPM,在不同horizon大小下生成HAM图。NHITS的神经逼近定理和SpaceTime的指数自回归活动被归因于其训练、验证和测试集上HAM图的趋势。总的来说,HAM可用于细粒度模型选择、验证集选择以及不同神经网络模型家族之间的比较。

英文摘要

Neural networks for time series forecasting have relied on error metrics and architecture-specific interpretability approaches for model selection that don't apply across models of different families. To interpret forecasting models agnostic to the types of layers across state-of-the-art model families, we introduce Horizon Activation Mapping (HAM), a visual interpretability technique inspired by grad-CAM that uses gradient norm averages to study the horizon's subseries where grad-CAM studies attention maps over image data. We introduce causal and anti-causal modes to calculate gradient update norm averages across subseries at every timestep and lines of proportionality signifying uniform distributions of the norm averages. Optimization landscape studies with respect to changes in batch sizes, early stopping, train-val-test splits, architectural choices, univariate forecasting and dropouts are studied with respect to performances and subseries in HAM. Interestingly, batch size based differences in activities seem to indicate potential for existence of an exponential approximation across them per epoch relative to each other. Multivariate forecasting models including MLP-based CycleNet, N-Linear, N-HITS, self attention-based FEDformer, Pyraformer, SSM-based SpaceTime and diffusion-based Multi-Resolution DDPM over different horizon sizes trained over the ETTm2 dataset are used for HAM plots in this study. NHITS' neural approximation theorem and SpaceTime's exponential autoregressive activities have been attributed to trends in HAM plots over their training, validation and test sets. In general, HAM can be used for granular model selection, validation set choices and comparisons across different neural network model families.

2601.01162 2026-05-29 cs.LG cs.AI cs.CL 版本更新

Bridging the Semantic Gap for Categorical Data Clustering via Large Language Models

弥合分类数据聚类的语义鸿沟:基于大语言模型的方法

Zihua Yang, Xin Liao, Yiqun Zhang, Yiu-ming Cheung

发表机构 * School of Computer Science and Technology(计算机科学与技术学院) Guangdong University of Technology(广东技术大学) Department of Computer Science(计算机科学系) Hong Kong Baptist University(香港 Baptist 大学)

AI总结 提出BREVE框架,利用外部知识库的语义嵌入丰富分类属性值,并引入自适应权重平衡原始标识与语义信息,在八个基准数据集上平均ARI排名达1.3。

Comments Accepted to ICPR2027

详情
AI中文摘要

定性数据广泛存在于医疗、营销和生物信息学等领域,聚类是其中模式发现的基本工具。定性数据聚类的核心困难在于度量属性值之间的相似性,这些属性值没有固有的顺序或距离。为了恢复这种关系,现有研究通常依赖于数据集内的共现统计。然而,当样本量较小时,这种统计路径变得不可靠,每个值的语义上下文因此未被充分利用。受此限制,本文提出BREVE(通过外部值丰富实现平衡表示),一种聚类框架,通过从外部知识库中提取额外的语义维度来丰富每个定性值。即,每个唯一值被扩展为一个密集嵌入,编码其语义内容。为了防止原始值身份被添加的维度稀释,进一步附加一个轻量级的独热编码组件。然后,由聚类紧致性引导的自适应权重决定富集维度进入最终表示的强度。通过这种设计,在八个基准数据集上的实验表明,与七个代表性竞争者相比,平均ARI排名为1.3。

英文摘要

Qualitative data are widespread in domains such as healthcare, marketing, and bioinformatics, where clustering offers a fundamental tool for pattern discovery. A core difficulty of qualitative-data clustering lies in measuring similarity among attribute values that carry no inherent ordering or distance. To recover such relationships, existing studies typically rely on within-dataset co-occurrence statistics. This statistical route, however, becomes unreliable once the sample size is small, and the semantic context of each value is therefore left underexploited. Motivated by this limitation, this paper proposes BREVE (Balanced Representation via External Value Enrichment), a clustering framework that enriches each qualitative value with extra semantic dimensions drawn from an external knowledge base. That is, every unique value is expanded by a dense embedding that encodes its semantic content. To prevent the original value identity from being diluted by the added dimensions, a lightweight one-hot component is further appended. An adaptive weight, guided by cluster compactness, then determines how strongly the enrichment dimensions enter the final representation. With this design, experiments on eight benchmark datasets yield an average ARI rank of 1.3 against seven representative competitors.

2512.00283 2026-05-29 cs.LG cs.AI q-bio.QM 版本更新

BioArc: Discovering Optimal Neural Architectures for Biological Foundation Models

BioArc:发现生物学基础模型的最优神经架构

Yi Fang, Haoran Xu, Jiaxin Han, Sirui Ding, Yizhi Wang, Yue Wang, Xuan Wang

发表机构 * Department of Computer Science, Virginia Tech, Blacksburg, VA, USA(弗吉尼亚理工学院计算机科学系) Department of Electrical and Computer Engineering, Virginia Tech, Blacksburg, VA, USA(弗吉尼亚理工学院电气与计算机工程系) Department of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA(卡内基梅隆大学计算机科学系) Department of Biomedical Data Science, Stanford University, Stanford, CA, USA(斯坦福大学生物医学数据科学系)

AI总结 针对现有基础模型架构直接迁移至生物学领域时忽视生物数据独特性质的问题,提出BioArc框架,利用神经架构搜索系统探索架构设计空间,发现高性能架构并提炼设计原则,同时提出架构预测方法以高效预测新任务的最优架构。

Comments Accepted at the 43nd International Conference on Machine Learning (ICML 2026)

详情
AI中文摘要

基础模型已彻底改变了自然语言处理(NLP)和计算机视觉(CV)等多个领域。尽管已有努力将通用AI领域中基础模型的成功迁移到生物学,但现有工作主要直接采用来自通用机器学习领域的现有基础模型架构,而未考虑每种生物数据模态独特的物理化学和结构特性进行系统设计。这导致性能欠佳,因为这些改造后的架构难以捕捉生物数据固有的长程依赖、稀疏信息和复杂的底层“语法”。为解决这一差距,我们引入了BioArc,这是一个新颖的框架,旨在超越直觉驱动的架构设计,转向生物学基础模型的原理性、自动化架构发现。利用神经架构搜索(NAS),BioArc系统性地探索了广阔的架构设计空间,跨多种生物模态评估架构,同时严格分析架构、分词和训练策略之间的相互作用。这一大规模分析识别出新颖的高性能架构,使我们能够提炼出一套经验性设计原则,以指导未来的模型开发。此外,为充分利用这套发现的原理性架构,我们提出并比较了几种架构预测方法,这些方法能够有效且高效地预测新生物学任务的最优架构。总体而言,我们的工作为基础资源和原理性方法论提供了基础,以指导下一代生物学任务特定模型和基础模型的创建。

英文摘要

Foundation models have revolutionized various fields such as natural language processing (NLP) and computer vision (CV). While efforts have been made to transfer the success of the foundation models in general AI domains to biology, existing works focus on directly adopting the existing foundation model architectures from general machine learning domains without a systematic design considering the unique physicochemical and structural properties of each biological data modality. This leads to suboptimal performance, as these repurposed architectures struggle to capture the long-range dependencies, sparse information, and complex underlying ``grammars'' inherent to biological data. To address this gap, we introduce BioArc, a novel framework designed to move beyond intuition-driven architecture design towards principled, automated architecture discovery for biological foundation models. Leveraging Neural Architecture Search (NAS), BioArc systematically explores a vast architecture design space, evaluating architectures across multiple biological modalities while rigorously analyzing the interplay between architecture, tokenization, and training strategies. This large-scale analysis identifies novel, high-performance architectures, allowing us to distill a set of empirical design principles to guide future model development. Furthermore, to make the best of this set of discovered principled architectures, we propose and compare several architecture prediction methods that effectively and efficiently predict optimal architectures for new biological tasks. Overall, our work provides a foundational resource and a principled methodology to guide the creation of the next generation of task-specific and foundation models for biology.

2511.16815 2026-05-29 stat.ML cs.LG 版本更新

BITS for GAPS: Bayesian Information-Theoretic Sampling for hierarchical GAussian Process Surrogates

BITS for GAPS:用于层次高斯过程代理的贝叶斯信息论采样

Kyla D. Jones, Alexander W. Dowling

发表机构 * Department of Chemical and Biomolecular Engineering, University of Notre Dame, Notre Dame, IN 46556, USA(化学与生物分子工程系,诺特大学)

AI总结 提出BITS for GAPS框架,通过贝叶斯层次建模将超参数不确定性传播到采样准则中,实现基于高斯过程代理模型的信息论实验设计,并在汽液平衡案例中验证其提升预测精度和信息增益的效果。

详情
Journal ref
Computers & Chemical Engineering, 197, 109041 (2026)
AI中文摘要

我们引入了用于层次高斯过程代理的贝叶斯信息论采样(BITS for GAPS),这是一个框架,能够实现基于高斯过程的代理模型的信息论实验设计。与标准方法(在采集函数中使用固定或点估计的超参数)不同,我们的方法通过贝叶斯层次建模将超参数不确定性传播到采样准则中。在该框架中,潜在函数接受高斯过程先验,而超参数被赋予额外的先验以捕捉建模者对控制物理现象的知识。因此,采集函数同时包含了来自潜在函数及其超参数的不确定性,确保采样由数据稀缺性和模型不确定性共同指导。我们进一步在此背景下建立了理论结果:后验微分熵的闭式近似和下界。我们通过一个汽液平衡案例研究展示了该框架在混合建模中的实用性。具体来说,我们为二元混合物中的潜在活度系数构建了一个代理模型。通过将代理嵌入扩展形式的拉乌尔定律中,我们构建了一个混合模型。该混合模型随后用于指导蒸馏设计。该案例研究展示了如何将部分物理知识转化为层次高斯过程代理。它还表明,使用BITS for GAPS通过瞄准Wilson活度模型的高不确定性区域,增加了期望信息增益和预测准确性。总体而言,BITS for GAPS是一个用于复杂物理系统中自适应数据采集的通用不确定性感知框架。

英文摘要

We introduce Bayesian Information-Theoretic Sampling for hierarchical GAussian Process Surrogates (BITS for GAPS), a framework enabling information-theoretic experimental design of Gaussian process-based surrogate models. Unlike standard methods, which use fixed or point-estimated hyperparameters in acquisition functions, our approach propagates hyperparameter uncertainty into the sampling criterion through Bayesian hierarchical modeling. In this framework, a latent function receives a Gaussian process prior, while hyperparameters are assigned additional priors to capture the modeler's knowledge of the governing physical phenomena. Consequently, the acquisition function incorporates uncertainties from both the latent function and its hyperparameters, ensuring that sampling is guided by both data scarcity and model uncertainty. We further establish theoretical results in this context: a closed-form approximation and a lower bound of the posterior differential entropy. We demonstrate the framework's utility for hybrid modeling with a vapor-liquid equilibrium case study. Specifically, we build a surrogate model for latent activity coefficients in a binary mixture. We construct a hybrid model by embedding the surrogate into an extended form of Raoult's law. This hybrid model then informs distillation design. This case study shows how partial physical knowledge can be translated into a hierarchical Gaussian process surrogate. It also shows that using BITS for GAPS increases expected information gain and predictive accuracy by targeting high-uncertainty regions of the Wilson activity model. Overall, BITS for GAPS is a generalized uncertainty-aware framework for adaptive data acquisition in complex physical systems.

2511.14584 2026-05-29 cs.LG cs.AI 版本更新

ReflexGrad: Within-Episode Failure Recovery in LLM Agents via Progress-Gated Dual-Process Routing

ReflexGrad: 基于进度门控双过程路由的LLM智能体片段内故障恢复

Ankush Kadu, Aswanth Krishnan

发表机构 * GitHub

AI总结 提出ReflexGrad双过程架构,通过进度门控路由在无演示条件下实现LLM智能体片段内故障恢复,显著提升任务成功率。

Comments 18 pages, 4 figures, 10 tables. Accepted at ICML 2026 FoGen Workshop

详情
AI中文摘要

我们提出ReflexGrad,一种用于LLM智能体在无演示条件下进行片段内故障恢复的双过程架构。当智能体过早采用错误方法并耗尽步骤预算时,故障后的轨迹包含逃脱所需的信息——但尚无已发表的架构在单个片段内利用这一信息。ReflexGrad在快速过程(每k=3步进行TextGrad风格的连续优化)和慢速过程(当m=5个连续低进度分数触发路由门时进行Reflexion风格的因果诊断)之间进行路由。确定性优先级合并保持自然语言策略的一致性,每次慢速激活产生三个可观察的产物:可复现的触发器、因果诊断和验证后的修复。在ALFWorld 134个任务、n=10个种子、无演示条件下,ReflexGrad将Qwen-3-8B从35.1%提升至75.4%(+40.3个百分点),超过计算匹配的1-shot LATS 2.7个百分点(p≈0.01)、ToT 5.7个百分点(p<10^{-4})和Self-Refine 6.7个百分点(p<10^{-5});在GPT-5上提升从46.3%至88.1%(+41.8个百分点)。1.5个百分点的跨模型差异在种子噪声范围内(p≈0.13),表明路由机制而非模型规模是增益的主要来源。代码、提示、逐种子日志和敏感性扫描已发布。

英文摘要

We present ReflexGrad, a dual-process architecture for within-episode failure recovery in LLM agents without demonstrations. When agents commit to a wrong approach early and exhaust the step budget, the post-failure trajectory contains the information to escape -- but no published architecture acts on it within a single episode. ReflexGrad routes between a fast process (TextGrad-style continuous refinement every $k{=}3$ steps) and a slow process (Reflexion-style causal diagnosis when $m{=}5$ consecutive low-progress scores fire a routing gate). A deterministic priority merge keeps the natural-language policy coherent, and each slow activation emits three observable artifacts: a reproducible trigger, a causal diagnostic, and a verified fix. On ALFWorld 134 tasks, $n{=}10$ seeds, no demonstrations, ReflexGrad lifts Qwen-3-8B from $35.1\%$ to $75.4\%$ ($+40.3$pp), beating compute-matched 1-shot LATS by $+2.7$pp ($p{\approx}0.01$), ToT by $+5.7$pp ($p{<}10^{-4}$), and Self-Refine by $+6.7$pp ($p{<}10^{-5}$); on GPT-5 the lift is $46.3{\to}88.1\%$ ($+41.8$pp). The $1.5$pp cross-model difference is within seed noise ($p{\approx}0.13$), suggesting that the routing mechanism, rather than model scale, is the primary source of the gain. Code, prompts, per-seed logs, and sensitivity sweeps are released.

2511.11830 2026-05-29 math.OC cs.LG 版本更新

A Computational Method for Solving the Stochastic Joint Replenishment Problem in High Dimensions

一种求解高维随机联合补货问题的计算方法

Barış Ata, Wouter van Eekelen, Yuan Zhong

发表机构 * Booth School of Business, University of Chicago(芝加哥大学商学院)

AI总结 针对高维随机联合补货问题,提出一种基于深度神经网络和脉冲控制近似的仿真计算方法,在高达50个SKU的问题中匹配或超越现有基准。

Comments 71 pages, 5 figures

详情
AI中文摘要

我们考虑一类高维随机联合补货问题的离散时间公式。首先,我们通过连续时间脉冲控制问题来近似原问题。利用脉冲控制问题、带跳的倒向随机微分方程(BSDEs)以及随机目标问题之间的联系,我们开发了一种新颖的、基于仿真的计算方法,该方法依赖深度神经网络来求解脉冲控制问题。基于该解,我们为原始(离散时间)随机联合补货问题提出了一种可实施的库存控制策略,并在系列测试问题中与最佳可用基准进行了比较。对于迄今为止研究的问题,我们的方法匹配或超越了能找到的最佳基准,并且在至少50维(即50个库存单位(SKU))的问题中计算可行。

英文摘要

We consider a discrete-time formulation for a class of high-dimensional stochastic joint replenishment problems. First, we approximate the problem by a continuous-time impulse control problem. Exploiting connections among the impulse control problem, backward stochastic differential equations (BSDEs) with jumps, and the stochastic target problem, we develop a novel, simulation-based computational method that relies on deep neural networks to solve the impulse control problem. Based on that solution, we propose an implementable inventory control policy for the original (discrete-time) stochastic joint replenishment problem, and test it against the best available benchmarks in a series of test problems. For the problems studied thus far, our method matches or beats the best benchmark we could find, and it is computationally feasible up to at least 50 dimensions -- that is, 50 stock-keeping units (SKUs).

2511.11703 2026-05-29 cs.LG cs.AI cs.RO 版本更新

Enhancing Reinforcement Learning in 3D Environments through Semantic Segmentation: A Case Study in ViZDoom

通过语义分割增强3D环境中的强化学习:以ViZDoom为例

Jin Huang

发表机构 * Hugo Huang(胡戈·黄)

AI总结 提出SS-only和RGB+SS两种输入表示,利用语义分割减少内存消耗并提升强化学习在3D环境中的性能,在ViZDoom中验证。

Comments Master's Thesis at the University of Edinburgh (2024)

详情
AI中文摘要

在高维感官输入的3D环境中,强化学习面临两大挑战:(1) 稳定学习所需的内存缓冲区导致的高内存消耗,以及(2) 部分可观测马尔可夫决策过程(POMDPs)的复杂性。本项目通过提出两种新颖的输入表示:SS-only和RGB+SS,两者均对RGB彩色图像进行语义分割,以应对这些挑战。在ViZDoom的死斗模式中进行了实验,利用完美的分割结果进行受控评估。我们的结果表明,SS-only能够将内存缓冲区的内存消耗减少至少66.6%,当应用如游程编码等最小开销的可向量化无损压缩技术时,可减少高达98.6%。同时,RGB+SS通过提供的额外语义信息显著增强了强化学习代理的性能。此外,我们探索了基于密度的热力图作为可视化强化学习代理移动模式的工具,并评估了其用于数据收集的适用性。与先前方法的简要比较突出了我们的方法如何克服在ViZDoom等3D环境中应用语义分割时的常见陷阱。

英文摘要

Reinforcement learning (RL) in 3D environments with high-dimensional sensory input poses two major challenges: (1) the high memory consumption induced by memory buffers required to stabilise learning, and (2) the complexity of learning in partially observable Markov Decision Processes (POMDPs). This project addresses these challenges by proposing two novel input representations: SS-only and RGB+SS, both employing semantic segmentation on RGB colour images. Experiments were conducted in deathmatches of ViZDoom, utilizing perfect segmentation results for controlled evaluation. Our results showed that SS-only was able to reduce the memory consumption of memory buffers by at least 66.6%, and up to 98.6% when a vectorisable lossless compression technique with minimal overhead such as run-length encoding is applied. Meanwhile, RGB+SS significantly enhances RL agents' performance with the additional semantic information provided. Furthermore, we explored density-based heatmapping as a tool to visualise RL agents' movement patterns and evaluate their suitability for data collection. A brief comparison with a previous approach highlights how our method overcame common pitfalls in applying semantic segmentation in 3D environments like ViZDoom.

2511.11505 2026-05-29 cs.LG 版本更新

FarSkip-Collective: Unhobbling Blocking Communication in Mixture of Experts Models

FarSkip-Collective: 解除混合专家模型中的阻塞通信

Yonatan Dukler, Guihong Li, Deval Shah, Jiang Liu, Vikram Appia, Emad Barsoum

发表机构 * Anonymous Institution, Anonymous City, Anonymous Region, Anonymous Country(匿名机构,匿名城市,匿名地区,匿名国家)

AI总结 提出FarSkip-Collective方法,通过修改模型架构跳过连接以重叠计算与通信,在16B至109B参数的最新模型上实现与原始版本相当的精度,并在推理和训练中显著加速。

Comments MLSys'26

详情
AI中文摘要

阻塞通信是在分布式环境中高效运行MoE的主要障碍。为此,我们提出了FarSkip-Collective,它修改了现代模型的架构,使其计算与通信能够重叠。我们的方法修改了模型架构以跳过连接,但事先并不清楚修改后的模型架构是否能保持同等能力,特别是对于大型最先进模型以及修改所有模型层的情况。我们对此给出了肯定回答,并完全转换了一系列参数从16B到109B的最新模型,使其通信能够重叠,同时实现了与其原始开源版本相当的精度。例如,我们通过自蒸馏转换了Llama 4 Scout (109B),在广泛的 downstream 评估中,其平均精度在指令调优版本的1%以内。除了证明大型修改模型的精度保持外,我们还通过显式重叠通信与计算的优化实现,实现了FarSkip-Collective的优势,在现有框架中加速了训练和推理。在推理方面,我们在SGLang中使用专家并行服务转换后的DeepSeek-V3架构时,实现了32.6%的首令牌时间加速,并在预填充阶段达到了97.3%的通信-计算重叠。在训练期间,我们的方法在使用专家并行预训练DeepSeek-V3 MoE层时,实现了88.9%的全到全通信集合的通信重叠。

英文摘要

Blocking communication presents a major hurdle in running MoEs efficiently in distributed settings. To address this, we present FarSkip-Collective which modifies the architecture of modern models to enable overlapping of their computation with communication. Our approach modifies the architecture to skip connections in the model and it is unclear a priori whether the modified model architecture can remain as capable, especially for large state-of-the-art models and while modifying all of the model layers. We answer this question in the affirmative and fully convert a series of state-of-the-art models varying from 16B to 109B parameters to enable overlapping of their communication while achieving accuracy that is comparable with their original open-source releases. For example, we convert Llama 4 Scout (109B) via self-distillation and achieve average accuracy within 1% of its instruction tuned release averaged over a wide range of downstream evaluations. In addition to demonstrating retained accuracy of the large modified models, we realize the benefits of FarSkip-Collective through optimized implementations that explicitly overlap communication with computation, accelerating both training and inference in existing frameworks. For inference, we demonstrate 32.6% speedup in Time To First Token when serving a converted DeepSeek-V3 architecture with expert parallelism in SGLang and achieve 97.3% communication-computation overlap during the prefill stage. During training, our approach enables 88.9% communication overlap of the all-to-all communication collectives when pre-training DeepSeek-V3 MoE layers with expert parallelism.

2511.10861 2026-05-29 cs.CV cs.AI cs.LG 版本更新

An accuracy-aware extension to LRP-based pruning for CNNs to prevent cascading accuracy degradation in data-scarce transfer learning

一种面向CNN的基于LRP剪枝的精度感知扩展,以防止数据稀缺迁移学习中的级联精度下降

Daisuke Yasui, Toshitaka Matsuki, Hiroshi Sato

发表机构 * Mathematics and Computer Science National Defense Academy of Japan(日本防卫大学校数学与计算机科学系)

AI总结 针对数据稀缺迁移学习中预训练CNN剪枝导致的级联精度下降问题,提出一种精度感知的剪枝控制机制,通过动态调整剪枝率和顺序来抑制精度下降,提升模型压缩后的分类性能。

Comments Accepted to scientific reports. The title was revised during the peer review process

详情
AI中文摘要

在大规模数据集(如ImageNet)上预训练的卷积神经网络(CNN)被广泛用作特征提取器,从稀缺数据中构建特定任务的高精度分类模型。在此类场景中,由于数据稀缺,微调预训练CNN变得困难,因此必须使用固定权重。然而,当权重固定时,许多对目标任务无贡献的滤波器仍保留在模型中,导致不必要的冗余和效率降低。因此,需要有效的方法通过剪枝对推理不必要的滤波器来减小模型大小。为此,已有研究提出了利用逐层相关性传播(LRP)的方法。LRP量化每个滤波器对推理结果的贡献,从而可以剪枝低相关性的滤波器。然而,现有基于LRP的剪枝方法被观察到会导致级联精度下降。在本研究中,我们为现有基于LRP的滤波器剪枝方法引入了一种精度感知的剪枝控制机制,该机制通过使用类别精度的调和平均数动态调整剪枝率和剪枝顺序,抑制级联精度下降,并在小数据环境下压缩预训练模型的同时保持任务特定性能。我们证明,该控制机制有效缓解了级联精度下降,与现有基于LRP的剪枝方法相比,实现了更高的分类精度,将VGG16的精度-剪枝率曲线下的类别平均面积(AUC)比传统基于LRP的方法提高了约15%。

英文摘要

Convolutional Neural Networks (CNNs) pre-trained on large-scale datasets such as ImageNet are widely used as feature extractors to construct high-accuracy classification models from scarce data for specific tasks. In such scenarios, fine-tuning the pre-trained CNN is difficult due to data scarcity, necessitating the use of fixed weights. However, when the weights are kept fixed, many filters that do not contribute to the target task remain in the model, leading to unnecessary redundancy and reduced efficiency. Therefore, effective methods are needed to reduce model size by pruning filters that are unnecessary for inference. To address this, approaches utilizing Layer-wise Relevance Propagation (LRP) have been proposed. LRP quantifies the contribution of each filter to the inference result, enabling the pruning of filters with low relevance. However, existing LRP-based pruning methods have been observed to cause cascading accuracy degradation. In this study, we introduce an accuracy-aware pruning control mechanism for existing LRP-based filter pruning methods, which suppresses cascading accuracy degradation by dynamically adjusting the pruning rate and the pruning order using the harmonic mean of class accuracy, and compresses the pre-trained model while preserving task-specific performance in a small-data environment. We demonstrate that this control mechanism effectively mitigates cascading accuracy degradation and achieves higher classification accuracy compared to existing LRP-based pruning methods, improving the class-averaged area under the accuracy-pruning-rate curve (AUC) of VGG16 by approximately 15\% over conventional LRP-based approaches.

2511.04934 2026-05-29 cs.LG 版本更新

Leak@$k$: Unlearning Does Not Make LLMs Forget Under Probabilistic Decoding

Leak@$k$:在概率解码下,遗忘并未使LLM真正忘记

Hadi Reisizadeh, Jiajun Ruan, Yiwei Chen, Soumyadeep Pal, Sijia Liu, Mingyi Hong

发表机构 * University of Minnesota(明尼苏达大学) Michigan State University(密歇根州立大学) IBM Research(IBM研究院)

AI总结 本文发现现有大语言模型遗忘方法在概率解码下无法真正删除敏感信息,并提出新指标leak@$k$和算法RULE来评估和缓解知识泄露。

详情
AI中文摘要

大型语言模型(LLM)中的遗忘对于法规遵从和构建避免产生私人、有毒、非法或受版权保护内容的伦理生成式AI系统至关重要。尽管进展迅速,但在这项工作中,我们表明 extit{几乎所有}现有的遗忘方法在实践中都未能实现真正的遗忘。具体来说,虽然在确定性(贪婪)解码下对这些“已遗忘”模型的评估通常表明使用标准基准成功移除了知识,但我们表明当使用标准概率解码对模型进行采样时,敏感信息可靠地重新出现。为了严格捕捉这一漏洞,我们引入了 exttt{leak@$k$},一种新的元评估指标,用于量化在现实解码策略下从模型生成$k$个样本时遗忘知识重新出现的可能性。使用三个广泛采用的基准TOFU、MUSE和WMDP,我们使用 exttt{leak@$k$}指标进行了首次大规模、系统性的遗忘可靠性研究。我们的发现表明,知识泄露在方法和任务中持续存在,强调当前最先进的遗忘技术仅提供有限的遗忘。我们提出了一种算法,称为基于Leak@$k$指标的鲁棒遗忘( exttt{RULE})来解决这一问题。我们证明, exttt{RULE}为TOFU基准提供了一个已遗忘模型,在大量生成样本下没有信息泄露。在MUSE基准上, exttt{RULE}在大多数采样预算$k$下,在 exttt{leak@$k$}指标上优于最先进的遗忘方法。代码可在https://github.com/OptimAI-Lab/Leak-k获取。

英文摘要

Unlearning in large language models (LLMs) is critical for regulatory compliance and for building ethical generative AI systems that avoid producing private, toxic, illegal, or copyrighted content. Despite rapid progress, in this work, we show that \textit{almost all} existing unlearning methods fail to achieve true forgetting in practice. Specifically, while evaluations of these `unlearned' models under deterministic (greedy) decoding often suggest successful knowledge removal using standard benchmarks, we show that sensitive information reliably resurfaces when models are sampled with standard probabilistic decoding. To rigorously capture this vulnerability, we introduce \texttt{leak@$k$}, a new meta-evaluation metric that quantifies the likelihood of forgotten knowledge reappearing when generating $k$ samples from the model under realistic decoding strategies. Using three widely adopted benchmarks, TOFU, MUSE, and WMDP, we conduct the first large-scale, systematic study of unlearning reliability using \texttt{leak@$k$} metric. Our findings demonstrate that knowledge leakage persists across methods and tasks, underscoring that current state-of-the-art (SOTA) unlearning techniques provide only limited forgetting. We propose an algorithm, termed Robust Unlearning under LEak@$k$ metric (\texttt{RULE}) to address this concern. We demonstrate that \texttt{RULE} provides an unlearned model for TOFU benchmark with no information leakage for a large number of generation samples. On the MUSE benchmark, \texttt{RULE} outperforms SOTA unlearning methods under the \texttt{leak@$k$} metric across most sampling budgets $k$. Codes are available at https://github.com/OptimAI-Lab/Leak-k.

2510.16060 2026-05-29 cs.LG cs.AI stat.ME stat.ML 版本更新

Beyond Accuracy: Are Time Series Foundation Models Well-Calibrated?

超越准确性:时间序列基础模型是否良好校准?

Coen Adler, Yuxin Chang, Felix Draxler, Samar Abdi, Padhraic Smyth

发表机构 * Department of Computer Science(计算机科学系) Department of Statistics(统计学系) Google, Irvine(谷歌(伊文斯堡))

AI总结 本文系统评估了五个时间序列基础模型和两个基线的校准特性,发现基础模型校准优于基线且无系统性过度自信或信心不足。

Comments Published as a conference paper at ICLR 2026

详情
Journal ref
Proceedings of ICLR 2026
AI中文摘要

最近时间序列数据基础模型的发展引起了在各种应用中使用此类模型的广泛兴趣。尽管基础模型实现了最先进的预测性能,但它们的校准特性仍然相对未被充分探索,尽管校准在许多实际应用中可能至关重要。在本文中,我们研究了五个近期时间序列基础模型和两个竞争基线的校准相关特性。我们进行了一系列系统评估,包括模型校准(即过度自信或信心不足)、不同预测头的影响以及长期自回归预测下的校准。我们发现时间序列基础模型始终比基线模型校准得更好,并且往往不会系统性地过度自信或信心不足,这与在其他深度学习模型中常见的过度自信形成对比。

英文摘要

The recent development of foundation models for time series data has generated considerable interest in using such models across a variety of applications. Although foundation models achieve state-of-the-art predictive performance, their calibration properties remain relatively underexplored, despite the fact that calibration can be critical for many practical applications. In this paper, we investigate the calibration-related properties of five recent time series foundation models and two competitive baselines. We perform a series of systematic evaluations assessing model calibration (i.e., over- or under-confidence), effects of varying prediction heads, and calibration under long-term autoregressive forecasting. We find that time series foundation models are consistently better calibrated than baseline models and tend not to be either systematically over- or under-confident, in contrast to the overconfidence often seen in other deep learning models.

2510.12310 2026-05-29 cs.CR cs.LG 版本更新

DeepTrust: Multi-Step Classification through Dissimilar Adversarial Representations for Robust Android Malware Detection

DeepTrust:通过不同对抗表示的多步分类实现鲁棒的Android恶意软件检测

Daniel Pulido-Cortázar, Daniel Gibert, Felip Manyà

发表机构 * Artificial Intelligence Research Institute (IIIA-CSIC)(人工智能研究所(IIIA-CSIC))

AI总结 提出DeepTrust元启发式方法,通过级联条件激活的异构分类器序列,最大化内部模型表示差异,在特征空间逃逸攻击下实现鲁棒检测,在2025年IEEE SaTML竞赛中获第一名。

详情
AI中文摘要

在过去十年中,机器学习已被广泛用于识别恶意Android应用程序。然而,这些方法仍然容易受到对抗样本的攻击,即那些被巧妙操纵以欺骗机器学习模型做出错误预测的样本。本研究提出了DeepTrust,一种新颖的元启发式方法,它将灵活的分类器(如深度神经网络)排列成有序序列,最终决策由单个内部模型根据级联激活的条件做出。在2025年IEEE SaTML会议的鲁棒Android恶意软件检测竞赛中,DeepTrust获得了第一名并取得了最先进的结果,在特征空间逃逸攻击下,其性能比次优竞争对手高出266%。同时,它在非对抗性恶意软件上保持了最高的检测率,假阳性率低于1%。该方法的效果源于最大化内部模型之间学习表示的差异。通过使用诱导数据产生根本不同嵌入的分类器,决策空间对攻击者变得不可预测。这挫败了逃逸攻击固有的迭代扰动过程,从而在不牺牲干净样本准确性的情况下增强了系统的鲁棒性。

英文摘要

Over the last decade, machine learning has been extensively applied to identify malicious Android applications. However, such approaches remain vulnerable against adversarial examples, i.e., examples that are subtly manipulated to fool a machine learning model into making incorrect predictions. This research presents DeepTrust, a novel metaheuristic that arranges flexible classifiers, like deep neural networks, into an ordered sequence where the final decision is made by a single internal model based on conditions activated in cascade. In the Robust Android Malware Detection competition at the 2025 IEEE Conference SaTML, DeepTrust secured the first place and achieved state-of-the-art results, outperforming the next-best competitor by up to 266% under feature-space evasion attacks. This is accomplished while maintaining the highest detection rate on non-adversarial malware and a false positive rate below 1%. The method's efficacy stems from maximizing the divergence of the learned representations among the internal models. By using classifiers inducing fundamentally dissimilar embeddings of the data, the decision space becomes unpredictable for an attacker. This frustrates the iterative perturbation process inherent to evasion attacks, enhancing system robustness without compromising accuracy on clean examples.

2510.12152 2026-05-29 stat.ML cs.LG 版本更新

Follow-the-Perturbed-Leader for Decoupled Bandits: Best-of-Both-Worlds and Practicality

解耦赌博机的跟随扰动领导者:两全其美与实用性

Chaiwon Kim, Jongyeong Lee, Min-hwan Oh

发表机构 * Seoul National University, Seoul, Korea(首尔国立大学,韩国首尔) Korea Institute of Science and Technology, Seoul, Korea(韩国科学技术院,韩国首尔)

AI总结 针对解耦多臂赌博机问题,提出一种高效的跟随扰动领导者策略,在随机环境下实现常数遗憾,在对抗环境下实现最优O(√KT)遗憾,且避免了凸优化和重采样过程,显著降低计算成本。

Comments Accepted to ICML 2026, 31 pages

详情
AI中文摘要

我们研究了解耦多臂赌博机问题,其中学习者在每一轮分别选择一个臂进行探索,并选择另一个可能不同的臂进行利用。在此设置中,探索臂的损失被观察到但不承担,而利用臂的损失被承担但不被观察到。我们提出了一种高效的跟随扰动领导者(FTPL)策略,该策略在随机环境下实现常数遗憾,在对抗环境下实现最优$O(\sqrt{KT})$遗憾,从而获得两全其美(BOBW)保证。我们方法的一个关键特征是它完全避免了先前BOBW策略所需的凸优化以及FTPL赌博机策略中通常使用的重采样过程。这使得FTPL能够充分发挥其计算效率优势,大幅降低计算成本。我们通过实验证实,我们的策略不仅提高了运行时间,而且在两种环境下都表现出优越的遗憾性能。

英文摘要

We study the decoupled multi-armed bandit problem, where the learner separately selects one arm for exploration and one, possibly different, arm for exploitation at each round. In this setting, the loss of the explored arm is observed but not incurred, whereas the loss of the exploited arm is incurred without being observed. We propose an efficient Follow-the-Perturbed-Leader (FTPL) policy that achieves Best-of-Both-Worlds (BOBW) guarantee with constant regret in the stochastic regime and optimal $O(\sqrt{KT})$ regret in the adversarial regime. A key feature of our method is that it completely avoids both the convex optimization required by prior BOBW policies and the resampling procedures typically used in FTPL bandit policies. This allows FTPL to fully realize its computational efficiency advantages, leading to substantial reductions in computational cost. We empirically confirm that our policy not only improves the runtime but also demonstrates superior regret performance in both regimes.

2510.10020 2026-05-29 stat.ML cs.LG q-bio.BM 版本更新

Calibrating Generative Models to Distributional Constraints

生成模型的分布约束校准

Henry D. Smith, Nathaniel L. Diamant, Brian L. Trippe

发表机构 * Stanford University, Palo Alto, CA USA(斯坦福大学)

AI总结 针对生成模型采样分布统计量偏离期望的校准问题,提出将校准形式化为受约束优化问题,并通过松弛损失和奖励损失两种替代目标进行微调,在蛋白质设计、图像生成和语言建模等应用中显著降低了数百个同时约束下的校准误差。

Comments To appear at the International Conference on Machine Learning (ICML), 2026. Codebase accompanying the paper is available at: https://github.com/smithhenryd/cgm

详情
AI中文摘要

生成模型经常存在校准不足的问题,即采样分布的统计量(例如给定类别中生成样本的比例)偏离期望值。我们将校准形式化为一个受约束的优化问题,并寻找在满足校准约束条件下与原始模型在Kullback-Leibler散度上最接近的模型。为了解决精确施加这些约束的难解性,我们引入了两种用于微调的替代目标:(1) 松弛损失,用校准惩罚项替代约束;(2) 奖励损失,将校准转化为奖励微调问题。我们证明,这些方法在数百个同时约束和参数高达九十亿的模型上显著降低了校准误差,应用范围涵盖蛋白质设计、图像生成和语言建模。

英文摘要

Generative models frequently suffer miscalibration, wherein statistics of the sampling distribution, such as the fraction of generations in a given class, deviate from desired values. We frame calibration as a constrained optimization problem and seek the closest model in Kullback-Leibler divergence satisfying a calibration constraint. To address the intractability of imposing these constraints exactly, we introduce two surrogate objectives for fine-tuning: (1) the relax loss, which replaces the constraint with a miscalibration penalty, and (2) the reward loss, which converts calibration into a reward fine-tuning problem. We demonstrate that these approaches substantially reduce calibration error across hundreds of simultaneous constraints and models with up to nine billion parameters, spanning applications in protein design, image generation, and language modeling.

2510.06063 2026-05-29 cs.AI cs.IT cs.LG math.IT 版本更新

TelecomTS: A Multi-Modal Observability Dataset for Time Series and Language Analysis

TelecomTS:面向时间序列与语言分析的多模态可观测性数据集

Austin Feng, Andreas Varvarigos, Ioannis Panitsas, Daniela Fernandez, Jinbiao Wei, Yuwei Guo, Jialin Chen, Ali Maatouk, Leandros Tassiulas, Rex Ying

发表机构 * Yale University(耶鲁大学) Johns Hopkins University(约翰霍普金斯大学)

AI总结 本文提出TelecomTS,一个来自5G电信网络的大规模多模态可观测性数据集,通过保留绝对尺度信息的异质协变量和多样化下游任务(异常检测、根因分析、多模态问答),揭示了现有模型在处理可观测性数据的高噪声、突变特性时的不足。

详情
AI中文摘要

现代企业在监控复杂系统时会产生大量的时间序列指标流,即所谓的可观测性数据。与来自气候等领域的传统时间序列不同,可观测性数据具有零膨胀、高度随机且时间结构极小的特点。尽管这些数据至关重要,但由于专有限制和隐私问题,可观测性数据集在公开基准中仍然代表性不足。现有数据集通常经过匿名化和归一化处理,去除了尺度信息,限制了其在异常检测、根因分析和多模态推理等任务中的应用。为弥补这一空白,我们引入了TelecomTS,这是一个源自5G电信网络的大规模可观测性数据集。TelecomTS包含具有明确绝对尺度信息的异质、去匿名化协变量,并提供多样化的下游任务套件,包括异常检测、根因分析和多模态问答。对最先进的时间序列、语言、推理和多模态基础模型的基准测试表明,现有方法难以应对可观测性数据特有的突变、高噪声和高方差动态特性。我们的实验进一步强调了保留协变量绝对尺度的重要性,凸显了开发能够原生利用尺度信息的基础时间序列模型以应对实际可观测性应用需求的必要性。代码可在https://github.com/Ali-maatouk/TelecomTS获取。

英文摘要

Modern enterprises generate vast streams of time series metrics when monitoring complex systems, known as observability data. Unlike conventional time series from domains such as climate, observability data are zero-inflated, highly stochastic, and exhibit minimal temporal structure. Despite their importance, observability datasets remain underrepresented in public benchmarks due to proprietary restrictions and privacy concerns. Existing datasets are often anonymized and normalized, removing scale information and limiting their use for tasks such as anomaly detection, root cause analysis, and multi-modal reasoning. To address this gap, we introduce TelecomTS, a large-scale observability dataset derived from a 5G telecommunications network. TelecomTS features heterogeneous, de-anonymized covariates with explicit absolute scale information and provides a diverse suite of downstream tasks, including anomaly detection, root cause analysis, and multi-modal question-answering. Benchmarking state-of-the-art time series, language, reasoning, and multi-modal foundation models reveals that existing approaches struggle with the abrupt, noisy, and high-variance dynamics characteristic of observability data. Our experiments further underscore the importance of preserving covariates' absolute scale, emphasizing the need for foundation time series models that natively leverage scale information for practical real-world observability applications. The code is available at: https://github.com/Ali-maatouk/TelecomTS.

2510.04758 2026-05-29 cs.LG 版本更新

Provable Affine Identifiability of Nonlinear CCA under Latent Distributional Priors

可证明的非线性CCA在潜在分布先验下的仿射可辨识性

Zhiwei Han, Stefan Matthes, Hao Shen

发表机构 * fortiss GmbH, Munich, Germany(fortiss公司,慕尼黑,德国) Technical university of Munich(慕尼黑技术大学)

AI总结 本文通过将分析从观测空间迁移到源空间,利用双变量分布的正交多项式展开,证明了在特定分布先验下非线性典型相关分析(CCA)能够恢复真实潜在因子至仿射变换,并提供了岭正则化经验CCA在有限样本下收敛的理论基础。

详情
AI中文摘要

在这项工作中,我们建立了非线性典型相关分析(CCA)恢复真实潜在因子至仿射变换的充分条件。通过将分析从观测空间迁移到源空间,我们将双变量分布正交多项式展开的经典统计结果扩展到表示学习,证明了在特定分布先验下的仿射可辨识性。我们正式证明白化是确保所学映射有界性和良态性的严格必要条件。此外,我们通过证明岭正则化经验CCA在有限样本下收敛到其总体对应物,弥合了理论与实践之间的差距。最后,我们的发现为近期基于相关性的非对比学习方法的实证成功提供了严格的理论基础。在合成和渲染图像数据集上的实验以及系统性消融研究验证了预测的恢复行为,并说明了当假设被违反时出现的失败模式。

英文摘要

In this work, we establish the sufficient conditions under which nonlinear Canonical Correlation Analysis (CCA) recovers ground-truth latent factors up to an affine transformation. By transporting the analysis from the observation space to the source space, we extend classical statistical results on orthogonal polynomial expansions of bivariate distributions to representation learning, proving affine identifiability under specific distributional priors. We formally demonstrate that whitening is strictly necessary to ensure the boundedness and well-conditioning of the learned mappings. Furthermore, we bridge the gap between theory and practice by proving that ridge-regularized empirical CCA converges to its population counterpart in the finite-sample regime. Finally, our findings provide a rigorous theoretical foundation explaining the empirical success of recent correlation-based non-contrastive learning methods. Experiments on synthetic and rendered image datasets, alongside systematic ablations, validate the predicted recovery behavior and illustrate the failure modes that arise when the assumptions are violated.

2510.03013 2026-05-29 cs.LG 版本更新

Distributional Inverse Reinforcement Learning

分布逆强化学习

Feiyang Wu, Ye Zhao, Anqi Wu

发表机构 * School of Computational Science and Engineering, Georgia Institute of Technology, Atlanta, USA(计算科学与工程学院,佐治亚理工学院,美国亚特兰大) George W. Woodruff School of Mechanical Engineering, Georgia Institute of Technology, Atlanta, USA(乔治·W·伍德鲁夫机械工程学院,佐治亚理工学院,美国亚特兰大)

AI总结 提出一种离线逆强化学习的分布框架,通过最小化一阶随机占优违反并整合扭曲风险度量,联合建模奖励函数的不确定性和回报的完整分布,实现奖励分布和分布感知策略的恢复。

Comments ICML 2026 Oral

详情
AI中文摘要

我们提出了一种用于离线逆强化学习(IRL)的分布框架,该框架联合建模奖励函数的不确定性和回报的完整分布。与恢复确定性奖励估计或仅匹配期望回报的传统IRL方法不同,我们的方法通过最小化一阶随机占优(FSD)违反,从而将扭曲风险度量(DRMs)整合到策略学习中,捕捉专家行为中更丰富的结构,特别是在学习奖励分布方面,使得能够恢复奖励分布和分布感知策略。该公式非常适合行为分析和风险感知模仿学习。理论分析表明,该算法以$\mathcal{O}(\varepsilon^{-2})$的迭代复杂度收敛。在合成基准、真实神经行为数据和MuJoCo控制任务上的实验结果表明,我们的方法恢复了富有表现力的奖励表示,并实现了最先进的性能。

英文摘要

We propose a distributional framework for offline Inverse Reinforcement Learning (IRL) that jointly models uncertainty over reward functions and full distributions of returns. Unlike conventional IRL approaches that recover a deterministic reward estimate or match only expected returns, our method captures richer structure in expert behavior, particularly in learning the reward distribution, by minimizing first-order stochastic dominance (FSD) violations and thus integrating distortion risk measures (DRMs) into policy learning, enabling the recovery of both reward distributions and distribution-aware policies. This formulation is well-suited for behavior analysis and risk-aware imitation learning. Theoretical analysis shows that the algorithm converges with $\mathcal{O}(\varepsilon^{-2})$ iteration complexity. Empirical results on synthetic benchmarks, real-world neurobehavioral data, and MuJoCo control tasks demonstrate that our method recovers expressive reward representations and achieves state-of-the-art performance.

2510.02480 2026-05-29 cs.AI cs.LG 版本更新

Controlling the Risk of Corrupted Contexts for Language Models via Early-Exiting

通过早退机制控制语言模型中有害上下文的风险

Andrea Wynn, Metod Jazbec, Charith Peris, Rinat Khaziev, Anqi Liu, Daniel Khashabi, Eric Nalisnick

发表机构 * Johns Hopkins University(约翰霍普金斯大学) University of Amsterdam(阿姆斯特丹大学) Amazon AGI(亚马逊人工智能实验室) Amazon Alexa(亚马逊Alexa)

AI总结 提出一种结合动态早退预测与无分布风险控制的方法,限制有害上下文对语言模型性能的退化,并在有益上下文中实现计算效率提升。

Comments Accepted to ICML 2026

详情
AI中文摘要

大型语言模型(LLMs)可能受到有害或不相关上下文的影响,这会显著损害模型在下游任务上的性能。这促使我们设计具有内置机制的原则性方案,以防范此类“垃圾进,垃圾出”场景。我们提出一种新颖方法,限制有害上下文对模型性能的退化程度。首先,我们定义模型的基线“安全”行为——即无任何上下文(零样本)时的模型性能。接着,我们应用无分布风险控制(DFRC)来控制用户提供的上下文将性能降至该安全零样本基线以下的程度。我们通过利用动态早退预测实现这一点,忽略那些最关注不安全输入的后注意力头。最后,我们提出对DFRC的修改,使其既能控制有害输入的风险,又能利用有益输入的性能和效率提升。我们在涵盖上下文学习和开放式问答的9项任务上展示了理论和实证结果,表明我们的方法能有效控制有害上下文的风险,同时在使用有益上下文时实现显著的计算效率提升。

英文摘要

Large language models (LLMs) can be influenced by harmful or irrelevant context, which can significantly harm model performance on downstream tasks. This motivates principled designs in which LLM systems include built-in mechanisms to guard against such "garbage in, garbage out" scenarios. We propose a novel approach to limit the degree to which harmful context can degrade model performance. First, we define a baseline "safe" behavior for the model -- the model's performance given no context at all (zero-shot). Next, we apply distribution-free risk control (DFRC) to control the extent to which the user-provided context can decay performance below this safe zero-shot baseline. We achieve this by leveraging dynamic early exit prediction, ignoring later attention heads that attend the most to the unsafe inputs. Finally, we propose modifications to DFRC that allow it to both control risk for harmful inputs \textit{and} leverage performance and efficiency gains on helpful inputs. We present both theoretical and empirical results across 9 tasks spanning in-context learning and open-ended question answering, showing that our approach can effectively control risk for harmful context and simultaneously achieve substantial computational efficiency gains with helpful context.

2510.00777 2026-05-29 cs.LG 版本更新

In-Place Feedback: Reliable Refinement for Multi-Turn Expert-LLM Collaboration

原地反馈:多轮专家-LLM协作的可靠精炼方法

Youngbin Choi, Minjong Lee, Saemi Moon, Seunghyuk Cho, Chaehyeon Chung, MoonJeong Park, Dongwoo Kim

发表机构 * Graduate School of Artificial Intelligence, POSTECH(POSTECH人工智能研究生院) Department of Computer Science and Engineering, POSTECH(POSTECH计算机科学与工程系)

AI总结 提出原地反馈交互范式,通过用户直接编辑模型先前响应并让模型从编辑上下文继续生成,在五个推理密集型基准上优于标准多轮反馈且更省token,用户研究证实其能提高最终输出满意度并降低疲劳。

Comments 42pages

详情
AI中文摘要

LLM生成的草稿常包含细微的事实或逻辑错误,但先前研究表明模型难以可靠地整合旨在修正这些错误的多轮反馈。我们提出原地反馈,一种交互范式,其中用户直接编辑模型先前的响应,模型从编辑后的上下文继续生成。在五个推理密集型基准上,原地反馈始终优于标准多轮反馈,同时需要更少的token,我们的细粒度分析表明,它能更可靠地应用修正并将修正传播到后续推理中。一项由领域专家精炼LLM生成摘要的用户研究证实了这些发现:参与者报告了更高的最终输出满意度和显著更低的疲劳感,而结合原地反馈和多轮反馈的混合策略在每个测量维度上得分最高。这些结果表明,直接编辑错误是专家-LLM协作的更有效范式。

英文摘要

LLM-generated drafts often contain subtle factual or logical errors, yet prior work shows that models struggle to reliably integrate multi-turn feedback aimed at fixing them. We propose in-place feedback, an interaction paradigm in which the user directly edits the model's previous response and the model continues generation from the edited context. In-place feedback consistently outperforms standard multi-turn feedback across five reasoning-intensive benchmarks while requiring fewer tokens, and our fine-grained analysis shows that it applies corrections more reliably and propagates them to subsequent reasoning. A user study with domain experts refining LLM-generated summaries corroborates these findings: participants report higher final-output satisfaction and substantially lower fatigue with in-place feedback, and a mixed strategy combining in-place and multi-turn feedback scores highest on every measured dimension. These results suggest that editing errors directly is a more effective paradigm for expert-LLM collaboration.

2509.21707 2026-05-29 stat.ML cs.LG stat.ME 版本更新

SADA: Safe and Adaptive Aggregation of Multiple Black-Box Predictions in Semi-Supervised Learning

SADA:半监督学习中多个黑箱预测的安全自适应聚合

Jiawei Shan, Zhifeng Chen, Yiming Dong, Yazhen Wang, Jiwei Zhao

发表机构 * Department of Biostatistics & Medical Informatics, University of Wisconsin-Madison(生物统计与医学信息学系,威斯康星大学麦迪逊分校) Department of Statistics, University of Wisconsin-Madison(统计学系,威斯康星大学麦迪逊分校)

AI总结 提出一种安全自适应聚合多个不确定质量黑箱预测的方法,保证不劣于仅用标注数据,并在存在完美预测时实现更快收敛或半参数效率界。

详情
AI中文摘要

半监督学习(SSL)在实践中出现于标注数据稀缺或获取成本高昂,而大量未标注数据易于获取的情况下。随着机器学习技术的广泛采用,使用多种模型和算法(包括深度学习、大语言模型和生成式AI)生成多个预测标签已变得越来越可行。在本文中,我们提出了一种新颖方法,能够安全且自适应地聚合多个质量不确定的黑箱预测,用于推理和预测任务。我们的方法提供两个关键保证:(i)无论预测质量如何,其表现永远不会差于仅使用标注数据;(ii)如果任意一个预测(无需知道是哪一个)完美拟合真实标签,算法会自适应地利用这一点,以实现更快的收敛速度或半参数效率界。我们通过小规模模拟和两项具有不同科学目标的真实数据分析展示了所提算法的有效性。提供了用户友好的R包sada以促进实际实施。

英文摘要

Semi-supervised learning (SSL) arises in practice when labeled data are scarce or expensive to obtain, while large quantities of unlabeled data are readily available. With the growing adoption of machine learning techniques, it has become increasingly feasible to generate multiple predicted labels using a variety of models and algorithms, including deep learning, large language models, and generative AI. In this paper, we propose a novel approach that safely and adaptively aggregates multiple black-box predictions of uncertain quality for both inference and prediction tasks. Our method provides two key guarantees: (i) it never performs worse than using the labeled data alone, regardless of the quality of the predictions; and (ii) if any one of the predictions (without knowing which one) perfectly fits the ground truth, the algorithm adaptively exploits this to achieve either a faster convergence rate or the semiparametric efficiency bound. We demonstrate the effectiveness of the proposed algorithm through small-scale simulations and two real-data analyses with distinct scientific goals. A user-friendly R package, sada, is provided to facilitate practical implementation.

2509.17208 2026-05-29 cs.LG physics.atm-clus 版本更新

Active Learning for Machine Learning Driven Molecular Dynamics

主动学习驱动的分子动力学机器学习

Kevin Bachelor, Sanya Murdeshwar, Daniel Sabo, Razvan Marinescu

发表机构 * University of California, Santa Cruz(加州大学圣克ruz分校) GiwoTech Inc.(GiwoTech公司)

AI总结 针对机器学习粗粒化势函数在模拟中因采样不足而性能退化的问题,提出基于RMSD的主动学习框架,通过在线查询Oracle生成数据,在保持粗粒化效率的同时修正覆盖缺口,使Chignolin蛋白模型的TICA空间W1指标提升33.05%。

Comments 9 pages, 4 figures, for Neurips Workshop: Machine Learning and the Physical Sciences 2025

详情
AI中文摘要

机器学习粗粒化(CG)势函数速度快,但当模拟到达欠采样的生物分子构象时性能会随时间退化,而生成广泛的全原子(AA)数据来应对这一问题在计算上不可行。我们提出了一种用于分子动力学(MD)中CG神经网络势函数的新型主动学习(AL)框架。该方法基于CGSchNet模型,采用从MD模拟中基于均方根偏差(RMSD)的帧选择,通过在神经网络势函数训练过程中查询预言机来实时生成数据。该框架在保持CG级效率的同时,在RMSD识别的精确覆盖缺口处修正模型。通过训练粗粒化神经网络势函数CGSchNet,我们实验证明该框架探索了先前未见过的构型,并在构象空间中未探索的区域上训练模型。我们的主动学习框架使得在Chignolin蛋白上训练的CGSchNet模型在内部基准测试套件上的时间滞后独立成分分析(TICA)空间中,Wasserstein-1(W1)指标提升了33.05%。

英文摘要

Machine-learned coarse-grained (CG) potentials are fast, but degrade over time when simulations reach under-sampled bio-molecular conformations, and generating widespread all-atom (AA) data to combat this is computationally infeasible. We propose a novel active learning (AL) framework for CG neural network potentials in molecular dynamics (MD). Building on the CGSchNet model, our method employs root mean squared deviation (RMSD)-based frame selection from MD simulations in order to generate data on-the-fly by querying an oracle during the training of a neural network potential. This framework preserves CG-level efficiency while correcting the model at precise, RMSD-identified coverage gaps. By training CGSchNet, a coarse-grained neural network potential, we empirically show that our framework explores previously unseen configurations and trains the model on unexplored regions of conformational space. Our active learning framework enables a CGSchNet model trained on the Chignolin protein to achieve a 33.05\% improvement in the Wasserstein-1 (W1) metric in Time-lagged Independent Component Analysis (TICA) space on an in-house benchmark suite.

2509.08194 2026-05-29 cs.LG stat.ML 版本更新

Prescribe-then-Select: Adaptive Policy Selection for Contextual Stochastic Optimization

先规定后选择:面向情境随机优化的自适应策略选择

Caio de Prospero Iglesias, Kimberly Villalobos Carballo, Dimitris Bertsimas

AI总结 针对情境随机优化中候选策略在协变量空间表现异质的问题,提出Prescribe-then-Select模块化框架,通过构建可行策略库并基于最优策略树集成学习元策略实现数据驱动的自适应选择,在单阶段报童和两阶段运输规划问题中优于单一最优策略。

详情
AI中文摘要

我们解决了情境随机优化中的策略选择问题,其中协变量作为情境信息可用,且决策必须满足严格的可行性约束。在许多情境随机优化场景中,来自不同建模范式的多个候选策略在协变量空间上表现出异质性能,没有单一策略能够统一占优。我们提出了Prescribe-then-Select(PS)模块化框架,该框架首先构建一个可行候选策略库,然后学习一个元策略来为观测到的协变量选择最佳策略。我们使用在训练集上通过交叉验证训练的最优策略树集成来实现元策略,使策略选择完全数据驱动。在两个基准情境随机优化问题——单阶段报童和两阶段运输规划中,PS在协变量空间的异质区域始终优于最佳单一策略,并在不存在这种异质性时收敛到占优策略。所有重现结果的代码可在https://anonymous.4open.science/r/Prescribe-then-Select-TMLR获取。

英文摘要

We address the problem of policy selection in contextual stochastic optimization (CSO), where covariates are available as contextual information and decisions must satisfy hard feasibility constraints. In many CSO settings, multiple candidate policies--arising from different modeling paradigms--exhibit heterogeneous performance across the covariate space, with no single policy uniformly dominating. We propose Prescribe-then-Select (PS), a modular framework that first constructs a library of feasible candidate policies and then learns a meta-policy to select the best policy for the observed covariates. We implement the meta-policy using ensembles of Optimal Policy Trees trained via cross-validation on the training set, making policy choice entirely data-driven. Across two benchmark CSO problems--single-stage newsvendor and two-stage shipment planning--PS consistently outperforms the best single policy in heterogeneous regimes of the covariate space and converges to the dominant policy when such heterogeneity is absent. All the code to reproduce the results can be found at https://anonymous.4open.science/r/Prescribe-then-Select-TMLR.

2509.05771 2026-05-29 stat.ML cs.LG math.OC 版本更新

Risk-averse Fair Multi-class Classification

风险规避的公平多类分类

Darinka Dentcheva, Xiangyu Tian

发表机构 * Department of Mathematical sciences(数学科学系) Stevens Institute of Technology(史蒂文斯理工学院)

AI总结 基于一致风险度量与系统性风险理论,提出一种适用于噪声、稀缺和标签不可靠数据的风险规避多类分类框架,并通过非线性聚合的系统方法设计两阶段随机规划及正则化分解算法,同时实现公平性增强。

详情
AI中文摘要

我们基于一致风险度量和系统性风险理论开发了一种新的分类框架。所提出的方法适用于数据存在噪声、稀缺(相对于问题维度)且标签可能不可靠的多类问题。在论文的第一部分,我们提供了使用系统性风险模型的基础,并展示了如何将其应用于线性和基于核的多类问题中。我们提出了一种通过非线性聚合的系统理论方法进行更高级的公式化,这导致了一个两阶段随机规划问题。设计了一种风险规避的正则化分解方法来求解该问题。在性能分析中,我们使用一种流行的多类方法作为所提出分类方法的基准。我们通过使用一致风险度量对该方法进行多种推广来说明我们的想法。所提出的风险规避方法的可行性在理论和数值上得到了支持。此外,我们证明了系统性风险度量的应用有助于在分类中强制执行公平性。对所提出模型的公平性进行了仔细的分析和实验。对于所有方法,我们的数值实验表明,它们在训练数据不可靠的情况下具有鲁棒性,并且在未知数据上的表现优于最小化期望分类误差的方法。此外,当类别数量增加时,性能会得到提升。

英文摘要

We develop a new classification framework based on the theory of coherent risk measures and systemic risk. The proposed approach is suitable for multi-class problems when the data is noisy, scarce (relative to the dimension of the problem), and the labeling might be unreliable. In the first part of our paper, we provide the foundation of the use of systemic risk models and show how to apply it in the context of linear and kernel-based multi-class problems. More advanced formulation via a system-theoretic approach with non-linear aggregation is proposed, which leads to a two-stage stochastic programming problem. A risk-averse regularized decomposition method is designed to solve the problem. We use a popular multi-class method as a benchmark in the performance analysis of the proposed classification methods. We illustrate our ideas by proposing several generalization of that method by the use of coherent measures of risk. The viability of the proposed risk-averse methods are supported theoretically and numerically. Additionally, we demonstrate that the application of systemic risk measures facilitates enforcing fairness in classification. Analysis and experiments regarding the fairness of the proposed models are carefully conducted. For all methods, our numerical experiments demonstrate that they are robust in the presence of unreliable training data and perform better on unknown data than the methods minimizing expected classification errors. Furthermore, the performance improves when the number of classes increases.

2508.15371 2026-05-29 cs.CL cs.AI cs.LG 版本更新

Confidence-Modulated Speculative Decoding for Large Language Models

置信度调节的推测解码用于大型语言模型

Jaydip Sen, Subhasis Dasgupta, Hetvi Waghela

发表机构 * Department of Data Science(数据科学系) Praxis Business School(普拉克斯商学院)

AI总结 本文提出一种基于置信度调节的推测解码框架,通过熵和边际不确定性度量动态调整草稿长度与验证过程,在机器翻译和摘要任务上实现加速并保持或提升BLEU和ROUGE分数。

Comments This is the preprint of the paper, which has been accepted for oral presentation and publication in the proceedings of IEEE INDISCON 2025. The conference will be organized at the National Institute of Technology, Rourkela, India, from August 21 to 23, 2025. The paper is 10 pages long, and it contains 2 figures and 5 tables

详情
AI中文摘要

推测解码已成为一种通过草稿-验证范式并行化令牌生成来加速自回归推理的有效方法。然而,现有方法依赖静态草稿长度和刚性验证标准,限制了其在不同模型不确定性和输入复杂性下的适应性。本文提出一种基于置信度调节草稿的信息论推测解码框架。通过利用草稿模型输出分布上的熵和边际不确定性度量,所提方法在每次迭代中动态调整推测生成的令牌数量。这种自适应机制减少了回滚频率,提高了资源利用率,并保持了输出保真度。此外,验证过程使用相同的置信度信号进行调节,使得在不牺牲生成质量的情况下更灵活地接受草稿令牌。在机器翻译和摘要任务上的实验表明,与标准推测解码相比,该方法在保持或提升BLEU和ROUGE分数的同时实现了显著加速。所提方法提供了一种原则性的即插即用方法,用于在不确定性变化条件下实现大型语言模型的高效且鲁棒的解码。

英文摘要

Speculative decoding has emerged as an effective approach for accelerating autoregressive inference by parallelizing token generation through a draft-then-verify paradigm. However, existing methods rely on static drafting lengths and rigid verification criteria, limiting their adaptability across varying model uncertainties and input complexities. This paper proposes an information-theoretic framework for speculative decoding based on confidence-modulated drafting. By leveraging entropy and margin-based uncertainty measures over the drafter's output distribution, the proposed method dynamically adjusts the number of speculatively generated tokens at each iteration. This adaptive mechanism reduces rollback frequency, improves resource utilization, and maintains output fidelity. Additionally, the verification process is modulated using the same confidence signals, enabling more flexible acceptance of drafted tokens without sacrificing generation quality. Experiments on machine translation and summarization tasks demonstrate significant speedups over standard speculative decoding while preserving or improving BLEU and ROUGE scores. The proposed approach offers a principled, plug-in method for efficient and robust decoding in large language models under varying conditions of uncertainty.

2508.08677 2026-05-29 cs.LG cs.CV 版本更新

Multi-level Collaborative Distillation Meets Global Workspace Model: A Unified Framework for OCIL

多级协作蒸馏遇见全局工作空间模型:面向OCIL的统一框架

Shibin Su, Guoqiang Liang, De Cheng, Shizhou Zhang, Lingyan Ran

发表机构 * School of Computer Science, Northwestern Polytechnical University(西北工业大学计算机学院) School of Telecommunications Engineering, Xidian University(西安电子科技大学电信工程学院)

AI总结 提出一种结合全局工作空间模型和多级协作蒸馏的统一框架,通过融合多学生模型参数形成共享隐式记忆并周期性广播,以及跨学生一致性和历史知识对齐机制,有效平衡在线类增量学习中的稳定性与可塑性。

Comments 15 pages, 8 figures

详情
AI中文摘要

在线类增量学习(OCIL)使模型能够从非独立同分布的数据流中持续学习。由于数据流中的样本只能被看到一次,因此与离线学习相比,它更适用于现实场景。然而,这一约束加剧了OCIL在维持稳定性与可塑性之间适当平衡的挑战。此外,在现实世界中更严格的内存缓冲区约束下,当前基于重放的方法效果较差。虽然集成方法提高了可塑性,但它们常常在稳定性上遇到困难。受全局工作空间理论(GWT)启发,我们提出了一种新颖方法,通过全局工作空间模型(GWM)——一种共享的隐式记忆,指导多个学生模型的学习——来增强集成学习。GWM通过在每个训练批次中融合所有学生的参数形成,捕获历史学习轨迹,并作为知识巩固的动态锚点。类似于GWT的广播机制,GWM定期重新分发给学生,稳定学习并促进跨任务一致性。此外,我们引入了一种多级协作蒸馏机制。它强制学生之间保持对等一致性,并通过将每个学生与GWM对齐来保留历史知识。因此,学生模型在保持先前所学知识的同时,仍能适应新任务,在稳定性与可塑性之间实现更好的平衡。在三个标准OCIL基准上的大量实验表明,我们的方法在各种内存预算下为多个OCIL模型带来了显著的性能提升。代码可在https://github.com/susususushi/GWM获取。

英文摘要

Online Class-Incremental Learning (OCIL) enables models to learn continuously from non-i.i.d. data streams. Since samples of the data streams can be seen only once, it is more suitable for real-world scenarios compared to offline learning. However, this constraint intensifies the challenge for OCIL in maintaining an appropriate balance between stability and plasticity. Moreover, under stricter memory buffer constraints in real world, current replay-based methods are less effective. While ensemble methods improve plasticity, they often struggle with stability. Inspired by the Global Workspace Theory (GWT), we propose a novel approach that enhances ensemble learning through a Global Workspace Model (GWM)-a shared, implicit memory that guides the learning of multiple student models. The GWM is formed by fusing the parameters of all students within each training batch, capturing the historical learning trajectory and serving as a dynamic anchor for knowledge consolidation. Like the broadcasting mechanism of GWT, the GWM is redistributed periodically to students, stabilizing learning and promoting cross-task consistency. In addition, we introduce a multi-level collaborative distillation mechanism. It enforces peer-to-peer consistency among students and preserves historical knowledge by aligning each student with the GWM. As a result, student models remain adaptable to new tasks while maintaining previously learned knowledge, striking a better balance between stability and plasticity. Extensive experiments on three standard OCIL benchmarks show that our method delivers significant performance improvement for several OCIL models across various memory budgets. The code is available at https://github.com/susususushi/GWM.

2508.02537 2026-05-29 cs.LG 版本更新

Solved in Unit Domain: JacobiNet for Differentiable Coordinate-Transformed PINNs

在单位域中求解:用于可微坐标变换PINNs的JacobiNet

Xi Chen, Jianchuan Yang, Junjie Zhang, Runnan Yang, Xu Liu, Hong Wang, Tinghui Zheng, Ziyu Ren, Wenqi Hu

发表机构 * Department of Mechanical and Aerospace Engineering, The Hong Kong University of Science and Technology(香港科技大学机械与航空航天工程系) Department of Mechanics & Engineering, College Architecture & Environment, Sichuan University(四川大学力学与工程学院) School of Mechanical Engineering and Automation, Beihang University(北京航空航天大学机械工程与自动化学院) Department of Civil and Environmental Engineering, The Hong Kong University of Science and Technology(香港科技大学土木与环境工程系) College of Computing and Data Science, Nanyang Technological University(南洋理工大学计算机与数据科学学院) West China Biomedical Big Data Center, West China Hospital, Sichuan University(四川大学西昌生物医学大数据中心,西昌医院)

AI总结 提出JacobiNet,一种基于学习的可微坐标变换PINN框架,通过端到端可微架构统一域映射与PDE求解,解决不规则边界域中PINNs的归一化、边界强制和损失项不平衡问题,显著提升精度和效率。

Comments Accepted by Journal of Computational Physics

详情
AI中文摘要

物理信息神经网络(PINNs)通过将物理定律嵌入学习过程,为求解偏微分方程提供了强大框架。然而,当应用于具有不规则边界的域时,PINNs常遭受不稳定和收敛缓慢的问题,这源于(1)几何各向异性导致的不一致归一化,(2)不精确的边界强制,以及(3)损失项竞争的不平衡。常见的解决方法是将该域映射到规则空间。然而,传统的映射方法依赖于特定情况的网格,在预指定的固定节点上定义雅可比矩阵,并通过链式法则重新表述PDE——使其与现代自动微分和张量框架不兼容。为弥合这一差距,我们提出了JacobiNet,一种基于学习的坐标变换PINN框架,在端到端可微架构中统一了域映射和PDE求解。JacobiNet通过自动梯度实现直接的雅可比矩阵计算,与下游PINNs共享计算图,从而避免了特定情况的网格划分、显式的雅可比矩阵计算/存储以及手动的PDE重新表述,同时解锁了几何编辑操作。通过将物理建模与几何复杂性分离,JacobiNet(1)解决了原始各向异性坐标中的归一化挑战,(2)促进了边界条件的硬性强制,以及(3)缓解了长期存在的损失项间不平衡问题。在各种PDE上的评估表明,JacobiNet将相对L2误差从0.11-0.73降低到0.01-0.09,平均精度提升了15.6倍。在具有变化形状的血管状域中,JacobiNet实现了对未见几何形状的毫秒级映射推理,平均预测精度提升了3.65倍,同时提供了超过10倍的加速——展示了强大的泛化能力、精度和效率。

英文摘要

Physics-Informed Neural Networks (PINNs) offer a powerful framework for solving PDEs by embedding physical laws into the learning process. However, when applied to domains with irregular boundaries, PINNs often suffer from instability and slow convergence, which stems from (1) inconsistent normalization due to geometric anisotropy, (2) inaccurate boundary enforcement, and (3) imbalanced loss term competition. A common workaround is to map the domain to a regular space. Yet, conventional mapping methods rely on case-specific meshes, define Jacobians at pre-specified fixed nodes, reformulate PDEs via the chain rule-making them incompatible with modern automatic differentiation, tensor-based frameworks. To bridge this gap, we propose JacobiNet, a learning-based coordinate-transformed PINN framework that unifies domain mapping and PDE solving within an end-to-end differentiable architecture. JacobiNet enables direct Jacobian computation via autograd, shares computation graph with downstream PINNs, thereby avoiding case-specific meshing, explicit Jacobian computation/storage, and manual PDE reformulation while unlocking geometric-editing operations. Separating physical modeling from geometric complexity, JacobiNet (1) addresses normalization challenges in the original anisotropic coordinates, (2) facilitates the hard enforcement of boundary conditions, and (3) mitigates the long-standing imbalance among loss terms. Evaluated on various PDEs, JacobiNet reduces the relative L2 error from 0.11-0.73 to 0.01-0.09, achieving an average 15.6x improvement in accuracy. In vessel-like domains with varying shapes, JacobiNet enables millisecond-level mapping inference for unseen geometries, improves prediction accuracy by an average of 3.65x, while delivering over 10x speedup-demonstrating strong generalization, accuracy, and efficiency.

2507.21429 2026-05-29 stat.ML cs.LG 版本更新

From Sublinear to Linear: Local Convergence in Finite-Width Networks via Locally Polyak-Lojasiewicz Regions

从次线性到线性:通过局部Polyak-Lojasiewicz区域在有限宽度网络中的局部收敛

Agnideep Aich, Ashit Baran Aich, Bruce Wade

发表机构 * Stanford University(斯坦福大学) University of Louisiana at Lafayette(路易斯安那州立大学拉法叶分校) Presidency College Kolkata, India(印度科利切斯特 Presidency 学院)

AI总结 本文研究有限宽度前馈网络在平方经验损失下梯度下降的局部线性收敛,通过局部Polyak-Lojasiewicz不等式和NTK正定性条件,证明了在局部拟凸区域内可实现线性收敛。

详情
AI中文摘要

我们研究了有限宽度前馈网络在平方经验损失下梯度下降的局部线性收敛。先前的工作表明,梯度下降可以保持在初始化附近的局部拟凸区域(LQCR)内,但仅给出次线性速率。我们证明,如果经验神经正切核在初始化时正定、在LQCR上Lipschitz稳定且与LQCR半径兼容,则平方损失满足局部Polyak-Łojasiewicz不等式,常数$μ= λ_0 - L_Θr(\Rcal) > 0$。结合固定步长迭代包含在LQCR内(作为线性速率定理中的假设),这在该区域上产生线性收敛。LQCR提供局部化;固定步长包含作为线性速率定理中的假设;PL不等式来自平方损失下的NTK条件。因此,结果是充分的局部条件,并非声称该机制对于快速收敛是必要或唯一的。实验上,我们通过NTK谱间隙、参数漂移、经验PL比率和次优性衰减来检验理论。在二值MNIST上,NTK保持正定,PL比率有正的下包络,损失在稳定区域呈几何衰减。在宽度消融实验中,固定步长宽度1024的运行离开局部区域;减小步长将最终漂移从1.870降至0.158,恢复观察到的局部区域诊断,并产生研究中观察到的最大经验PL比率下包络。在CIFAR-10子集上的CNN鲁棒性检查显示,PL比率包络在三个种子下保持正,且在稳定区域上三个种子均有正的下包络。

英文摘要

We study local linear convergence of gradient descent for finite-width feedforward networks under the squared empirical loss. Prior work shows that GD can remain confined to a Locally Quasi-Convex Region (LQCR) around initialization, but only gives a sublinear rate. We show that if the empirical Neural Tangent Kernel is positive at initialization, Lipschitz stable on the LQCR, and compatible with the LQCR radius, then the squared loss satisfies a local Polyak-Łojasiewicz inequality with constant $μ= λ_0 - L_Θr(\Rcal) > 0$. Combined with fixed-step iterate containment in the LQCR, imposed as a hypothesis in the linear-rate theorem, this yields linear convergence on the region. The LQCR supplies localization; fixed-step containment is imposed as a hypothesis in the linear-rate theorem; and the PL inequality comes from NTK conditioning under squared loss. The result is therefore a sufficient local condition, not a claim that this mechanism is necessary or unique for fast convergence. Empirically, we probe the theory through NTK spectral gap, parameter drift, empirical PL ratio, and suboptimality decay. On binary MNIST, the NTK remains positive, the PL ratio has a positive lower envelope, and the loss shows geometric decay on the stable regime. In a width ablation, the fixed-step width-$1024$ run leaves the local regime; reducing the step size lowers final drift from $1.870$ to $0.158$, restores the observed local-regime diagnostics, and yields the largest empirical PL-ratio lower envelope observed in the study. A CNN robustness check on a CIFAR-10 subset shows the PL-ratio envelope remains positive across three seeds, with a positive lower envelope across all three seeds on the stable regime.

2507.03318 2026-05-29 cs.LG cs.AI 版本更新

Structure-Aware Compound-Protein Affinity Prediction via Graph Neural Network with Group Lasso Regularization

基于图神经网络与组套索正则化的结构感知化合物-蛋白质亲和力预测

Zanyu Shi, Yang Wang, Pathum Weerawarna, Jie Zhang, Timothy Richardson, Yijie Wang, Kun Huang

发表机构 * Department of Biostatistics & Health Data Science(生物统计学与健康数据科学系) Indiana University(印第安纳大学) Department of Computer Science(计算机科学系) Indiana University Bloomington(印第安纳大学布卢明顿分校) Division of Clinical Pharmacology(临床药理学部) Indiana University School of Medicine(印第安纳大学医学院) IUSM-Purdue TREAT-AD Center(IUSM-普渡大学TREAT-AD中心) Department of Medical and Molecular Genetics(医学与分子遗传学系)

AI总结 提出利用图神经网络结合组套索和稀疏组套索正则化,从活性悬崖分子对中学习结构信息以预测化合物-蛋白质亲和力(IC50),并提升模型可解释性。

Comments 15 pages, 7 figures

详情
Journal ref
Comput Struct Biotechnol J. 2026;35:0012
AI中文摘要

可解释人工智能(XAI)方法越来越多地被应用于药物发现中,以学习分子表示并识别驱动性质预测的子结构。然而,为化合物性质预测构建结构-活性关系(SAR)建模的端到端可解释模型面临诸多挑战,例如特定蛋白质靶标的化合物-蛋白质相互作用活性数据有限,以及分子构型位点的细微变化会显著影响分子性质。我们利用具有活性悬崖的分子对,这些分子共享骨架但在取代基位点不同,其特征是对特定蛋白质靶标具有较大的效力差异。我们提出一个框架,通过实现图神经网络(GNN)来利用活性悬崖对的性质和结构信息,以预测化合物-蛋白质亲和力(即半数最大抑制浓度,IC50)。为了增强模型性能和可解释性,我们使用结构感知损失函数训练GNN,采用组套索和稀疏组套索正则化,这些正则化方法能够剪枝并突出与活性差异相关的分子子图。我们将该框架应用于针对三种原癌基因酪氨酸蛋白激酶Src蛋白(PDB ID:1O42、2H8H、4MXO)的分子活性悬崖数据。我们的方法通过稀疏组套索整合公共和私有节点信息,改进了性质预测,这体现在均方根误差(RMSE)降低和皮尔逊相关系数(PCC)提高上。应用正则化还通过提升图级全局方向分数和改进原子级着色精度,增强了GNN的特征归因能力。这些进展增强了药物发现流程中模型的可解释性,特别是在先导化合物优化中识别关键分子子结构方面。

英文摘要

Explainable artificial intelligence (XAI) approaches have been increasingly applied in drug discovery to learn molecular representations and identify substructures driving property predictions. However, building end-to-end explainable models for structure-activity relationship (SAR) modeling for compound property prediction faces many challenges, such as the limited number of compound-protein interaction activity data for specific protein targets, and plenty of subtle changes in molecular configuration sites significantly affecting molecular properties. We exploit pairs of molecules with activity cliffs that share scaffolds but differ at substituent sites, characterized by large potency differences for specific protein targets. We propose a framework by implementing graph neural networks (GNNs) to leverage property and structure information from activity cliff pairs to predict compound-protein affinity (i.e., half maximal inhibitory concentration, IC50). To enhance model performance and explainability, we train GNNs with structure-aware loss functions using group lasso and sparse group lasso regularizations, which prune and highlight molecular subgraphs relevant to activity differences. We applied this framework to activity cliff data of molecules targeting three proto-oncogene tyrosine-protein kinase Src proteins (PDB IDs: 1O42, 2H8H, 4MXO). Our approach improved property prediction by integrating common and uncommon node information with sparse group lasso, as reflected in reduced root mean squared error (RMSE) and improved Pearson's correlation coefficient (PCC). Applying regularizations also enhances feature attribution for GNN by boosting graph-level global direction scores and improving atom-level coloring accuracy. These advances strengthen model interpretability in drug discovery pipelines, particularly for identifying critical molecular substructures in lead optimization.

2506.06254 2026-05-29 cs.AI cs.CL cs.LG 版本更新

PersonaAgent: Bridging Memory and Action for Personalized LLM Agents

PersonaAgent:弥合个性化LLM智能体的记忆与行动

Weizhi Zhang, Xinyang Zhang, Chenwei Zhang, Liangwei Yang, Jingbo Shang, Zhepei Wei, Henry Peng Zou, Zijie Huang, Zhengyang Wang, Yifan Gao, Xiaoman Pan, Lian Xiong, Jingguo Liu, Philip S. Yu, Xian Li

发表机构 * Amazon Stores Foundational AI(亚马逊基础AI)

AI总结 提出PersonaAgent框架,通过整合个性化记忆模块(情景与语义记忆)和行动模块,并利用角色提示作为中介实现记忆与行动的协同,以解决LLM智能体的个性化任务。

Comments Accepted in ACL 2026

详情
AI中文摘要

由大型语言模型驱动的智能体近期作为先进范式出现,在广泛领域和任务中展现出令人印象深刻的能力。尽管潜力巨大,当前LLM智能体常采用一刀切方法,缺乏响应用户不同需求和偏好的灵活性。这一局限促使我们开发PersonaAgent——首个旨在处理多样化个性化任务的个性化LLM智能体框架。具体而言,PersonaAgent整合了两个互补组件:一个包含情景记忆和语义记忆机制的个性化记忆模块;一个使智能体能够执行针对用户定制的工具行动的个性化行动模块。核心在于,角色(定义为每位用户独特的系统提示)充当中间件:它利用来自个性化记忆的洞察来控制智能体行动,而这些行动的结果反过来又优化记忆。基于该框架,我们提出一种测试时用户偏好对齐策略,该策略模拟最近的n次交互以优化角色提示,通过模拟响应与真实响应之间的文本损失反馈确保实时用户偏好对齐。实验评估表明,PersonaAgent不仅有效个性化行动空间,还能在测试时实际应用中扩展,显著优于其他基线方法。这些结果证明了我们的方法在提供定制化、动态用户体验方面的可行性和潜力。

英文摘要

Large Language Model (LLM) empowered agents have recently emerged as advanced paradigms that exhibit impressive capabilities in a wide range of domains and tasks. Despite their potential, current LLM agents often adopt a one-size-fits-all approach, lacking the flexibility to respond to users' varying needs and preferences. This limitation motivates us to develop PersonaAgent, the first personalized LLM agent framework designed to address versatile personalization tasks. Specifically, PersonaAgent integrates two complementary components - a personalized memory module that includes episodic and semantic memory mechanisms; a personalized action module that enables the agent to perform tool actions tailored to the user. At the core, the persona (defined as unique system prompt for each user) functions as an intermediary: it leverages insights from personalized memory to control agent actions, while the outcomes of these actions in turn refine the memory. Based on the framework, we propose a test-time user-preference alignment strategy that simulate the latest n interactions to optimize the persona prompt, ensuring real-time user preference alignment through textual loss feedback between simulated and ground-truth responses. Experimental evaluations demonstrate that PersonaAgent significantly outperforms other baseline methods by not only personalizing the action space effectively but also scaling during test-time real-world applications. These results underscore the feasibility and potential of our approach in delivering tailored, dynamic user experiences.

2506.06095 2026-05-29 cs.LG 版本更新

Accelerating Sparse Transformer Inference on GPU

加速GPU上的稀疏Transformer推理

Wenhao Dai, Haodong Deng, Mengfei Rong, Xinyu Yang, Hongyu Liu, Fangxin Liu, Hailong Yang, Qianwen Cao, Qingxiao Sun

发表机构 * SSSLab, Dept. of CST China University of Petroleum-Beijing Beijing China(SSSLab,计算机科学与技术系中国石油大学(北京)北京中国) Baidu Inc. Beijing China(百度公司北京中国) School of Computer Science Shanghai Jiao Tong University Shanghai China(计算机科学学院上海交通大学上海中国) China University of Petroleum-Beijing Beihang University Beijing China(中国石油大学(北京)北京航空航天大学北京中国) China University of Petroleum-Beijing(中国石油大学(北京)) Beihang University(北京航空航天大学) Baidu Inc.(百度公司) Shanghai Jiao Tong University(上海交通大学)

AI总结 针对稀疏Transformer推理加速问题,提出STOF框架,通过分析建模将多头注意力映射为行式或块式核并采用独特存储格式,结合两阶段搜索的算子融合方案,在GPU上实现高达1.6倍的多头注意力计算加速和1.4倍的端到端推理加速。

详情
AI中文摘要

大型语言模型(LLMs)因其强大的理解能力在全球广受欢迎。作为LLMs的核心组件,通过并行化加速Transformer逐渐成为研究热点。掩码层向Transformer引入稀疏性以减少计算量。然而,以往的工作很少关注稀疏Transformer的性能优化。此外,当前的静态算子融合方案无法适应多样化的应用场景。为解决上述问题,我们提出STOF,一个针对稀疏Transformer的优化框架,能够在GPU上实现灵活的掩码和算子融合。对于多头注意力(MHA)结构,STOF根据分析建模将计算映射为具有独特存储格式的行式或块式核。对于下游算子,STOF将融合方案映射到编译模板,并通过两阶段搜索确定最优运行配置。实验结果表明,与最先进的工作相比,STOF在MHA计算中实现了最高1.6倍的加速,在端到端推理中实现了最高1.4倍的加速。

英文摘要

Large language models (LLMs) are popular around the world due to their powerful understanding capabilities. As the core component of LLMs, accelerating Transformer through parallelization has gradually become a hot research topic. Mask layers introduce sparsity into Transformer to reduce calculations. However, previous works rarely focus on the performance optimization of sparse Transformer. In addition, current static operator fusion schemes fail to adapt to diverse application scenarios. To address the above problems, we propose STOF, a framework that incorporates optimizations for Sparse Transformer that enables flexible masking and Operator Fusion on GPU. For multi-head attention (MHA) structure, STOF maps the computation to row-wise or blockwise kernels with unique storage formats according to analytical modeling. For downstream operators, STOF maps the fusion scheme to compilation templates and determines the optimal running configuration through two-stage searching. The experimental results show that compared to the stateof-the-art work, STOF achieves maximum speedups of 1.6x in MHA computation and 1.4x in end-to-end inference.

2506.05985 2026-05-29 cs.LG cs.RO 版本更新

Dynamic Mixture of Progressive Parameter-Efficient Expert Library for Lifelong Robot Learning

动态渐进式参数高效专家库混合用于终身机器人学习

Yuheng Lei, Sitong Mao, Shunbo Zhou, Hongyuan Zhang, Xuelong Li, Ping Luo

发表机构 * The University of Hong Kong(香港大学) Institute of Artificial Intelligence (TeleAI), China Telecom(人工智能研究院(TeleAI),中国电信) Huawei Cloud Computing Technologies(华为云计算技术) Ola Dimensions HKU Shanghai Intelligent Computing Research Center(香港大学上海智能计算研究中心)

AI总结 针对终身学习中任务标识不可用和知识隔离问题,提出动态渐进式参数高效专家库混合(DMPEL),通过构建低秩专家库和轻量路由器实现灵活的前向迁移,并引入专家系数回放缓解遗忘,在LIBERO基准上以最少可训练参数和存储超越现有方法。

Comments Accepted to Transactions on Machine Learning Research (TMLR) at https://openreview.net/forum?id=MHVBrjS8cG . Code is available at https://github.com/HarryLui98/DMPEL

详情
AI中文摘要

一个通用智能体必须在其生命周期中持续学习和适应,实现高效的前向迁移,同时最小化灾难性遗忘。先前在主导的预训练-微调范式中的工作探索了用于单任务适应的参数高效微调,通过少量参数有效引导冻结的预训练模型。然而,在终身学习背景下,这些方法依赖于测试时任务标识符这一不切实际的假设,并限制了孤立适配器之间的知识共享。为解决这些限制,我们提出了用于终身机器人学习的动态渐进式参数高效专家库混合(DMPEL)。DMPEL逐步构建一个低秩专家库,并采用轻量路由器将专家动态组合成端到端策略,从而实现灵活高效的终身前向迁移。此外,通过利用微调参数的模块化结构,我们引入了专家系数回放,引导路由器准确检索先前遇到任务的冻结专家。该技术缓解了遗忘,同时相比对整个策略进行经验回放,显著节省存储和计算。在终身机器人学习基准LIBERO上的大量实验表明,我们的框架在持续适应过程中的成功率上优于最先进的终身学习方法,同时使用了最少的可训练参数和存储。

英文摘要

A generalist agent must continuously learn and adapt throughout its lifetime, achieving efficient forward transfer while minimizing catastrophic forgetting. Previous work within the dominant pretrain-then-finetune paradigm has explored parameter-efficient fine-tuning for single-task adaptation, effectively steering a frozen pretrained model with a small number of parameters. However, in the context of lifelong learning, these methods rely on the impractical assumption of a test-time task identifier and restrict knowledge sharing among isolated adapters. To address these limitations, we propose Dynamic Mixture of Progressive Parameter-Efficient Expert Library (DMPEL) for lifelong robot learning. DMPEL progressively builds a low-rank expert library and employs a lightweight router to dynamically combine experts into an end-to-end policy, enabling flexible and efficient lifelong forward transfer. Furthermore, by leveraging the modular structure of the fine-tuned parameters, we introduce expert coefficient replay, which guides the router to accurately retrieve frozen experts for previously encountered tasks. This technique mitigates forgetting while being significantly more storage- and computation-efficient than experience replay over the entire policy. Extensive experiments on the lifelong robot learning benchmark LIBERO demonstrate that our framework outperforms state-of-the-art lifelong learning methods in success rates during continual adaptation, while utilizing minimal trainable parameters and storage.

2506.04602 2026-05-29 cs.GT cs.LG 版本更新

MVP-Shapley: Feature-based Modeling for Evaluating the Most Valuable Player in Basketball

MVP-Shapley:基于特征建模的篮球最有价值球员评估方法

Haifeng Sun, Yu Xiong, Runze Wu, Kai Wang, Lan Zhang, Changjie Fan, Shaojie Tang, Xiang-Yang Li

发表机构 * University of Science and Technology of China(科学技术大学) Netease, Fuxi AI Lab(网易凤凰人工智能实验室) University at Buffalo(布法罗大学)

AI总结 提出一种基于Shapley值的MVP评估框架,通过特征处理、胜负模型训练和贡献分配,结合因果优化实现球员排名,并在NBA数据集上验证有效性。

详情
AI中文摘要

电子竞技和多人在线游戏社区的蓬勃发展凸显了评估最有价值球员(MVP)的关键重要性。建立可解释且实用的MVP评估方法非常具有挑战性。在我们的研究中,我们特别关注逐回合数据,该数据记录了比赛中的相关事件,如助攻和得分。我们旨在通过引入一种新的MVP评估框架(记为\oursys)来应对这些挑战,该框架利用Shapley值。该方法包括特征处理、胜负模型训练、Shapley值分配以及基于球员贡献的MVP排名确定。此外,我们从因果关系的角度优化算法,使其与专家投票结果一致。最后,我们通过使用NBA数据集和Dunk City Dynasty数据集进行验证,证实了我们方法的有效性,并在行业中实现了在线部署。

英文摘要

The burgeoning growth of the esports and multiplayer online gaming community has highlighted the critical importance of evaluating the Most Valuable Player (MVP). The establishment of an explainable and practical MVP evaluation method is very challenging. In our study, we specifically focus on play-by-play data, which records related events during the game, such as assists and points. We aim to address the challenges by introducing a new MVP evaluation framework, denoted as \oursys, which leverages Shapley values. This approach encompasses feature processing, win-loss model training, Shapley value allocation, and MVP ranking determination based on players' contributions. Additionally, we optimize our algorithm to align with expert voting results from the perspective of causality. Finally, we substantiated the efficacy of our method through validation using the NBA dataset and the Dunk City Dynasty dataset and implemented online deployment in the industry.

2505.20634 2026-05-29 cs.LG stat.ML 版本更新

Explaining Concept Shift with Interpretable Feature Attribution

用可解释的特征归因解释概念漂移

Ruiqi Lyu, Alistair Turcan, Bryan Wilder

发表机构 * Carnegie Mellon University(卡内基梅隆大学)

AI总结 提出SGShift方法,通过将概念漂移建模为特征选择任务,利用广义加性模型、敲除和吸收等统计工具识别导致源域与目标域模型性能差异的稀疏漂移特征。

详情
AI中文摘要

当特征条件标签分布在域间发生变化时,就会发生概念漂移,这可能导致即使调优良好的机器学习模型在新域上校准失效。识别这些漂移特征可以独特地揭示域间特征-标签关系如何不同,考虑到这种差异可能跨越科学相关的维度(如时间、疾病状态、人群等)。在本文中,我们提出SGShift,一种将表格数据中概念漂移导致的性能下降归因于稀疏漂移特征集的方法。我们将概念漂移框架化为特征选择任务,以学习能够解释源域和目标域模型间性能差异的特征。该框架使SGShift能够适应强大的统计工具,如广义加性模型、敲除和吸收,以识别这些漂移特征。我们在各种机器学习模型的合成数据和真实数据上进行了广泛实验,发现SGShift比基线方法更准确地识别漂移特征,在漂移域中所需样本少,并且对复杂的概念漂移情况具有鲁棒性。

英文摘要

Concept shift occurs when the distribution of labels conditioned on the features changes between domains, which can make even a well-tuned ML model miscalibrated on a new domain. Identifying these shifted features provides unique insight into how feature-label relationships differ between domains, considering the difference may be across a scientifically relevant dimension, such as time, disease status, population, etc. In this paper, we propose SGShift, a method for attributing performance degradation under concept shift in tabular data to a sparse set of shifted features. We frame concept shift as a feature selection task to learn the features that can explain performance differences between models in the source and target domain. This framework enables SGShift to adapt powerful statistical tools such as generalized additive models, knockoffs, and absorption towards identifying these shifted features. We conduct extensive experiments in synthetic and real data across various ML models and find SGShift can identify shifted features much more accurately than baseline methods, requires few samples in the shifted domain, and is robust to complex cases of concept shift.

2505.05968 2026-05-29 cs.LG cs.MA 版本更新

Offline Multi-agent Reinforcement Learning via Sequential Score Decomposition

离线多智能体强化学习通过序列得分分解

Dan Qiao, Wenhao Li, Shanchao Yang, Hongyuan Zha, Baoxiang Wang

发表机构 * School of Data Science, The Chinese University of Hong Kong, Shenzhen, China(香港中文大学(深圳)数据科学学院) School of Computer Science, Tongji University, Shanghai, China(同济大学计算机科学学院) Vector Institute(向量研究所)

AI总结 针对离线合作多智能体强化学习中联合动作空间高维和异质行为数据导致的策略分布偏移问题,提出序列得分函数分解方法,利用扩散模型从多模态离线数据中学习每个智能体的正则化信号,指导策略更新至高分、分布内区域,在多个粒子环境和多智能体MuJoCo基准上实现最先进性能。

Comments ICML 2026 Accepted

详情
Journal ref
Forty-Third International Conference on Machine Learning, 2026
AI中文摘要

离线合作多智能体强化学习(MARL)因分布偏移面临独特挑战,尤其源于联合动作空间的高维性和分布外联合动作选择的存在。在这项工作中,我们强调离线MARL的一个基本挑战来自合作任务的多均衡性质,这诱导了高度多模态的联合行为策略空间与异质质量行为数据的耦合。这使得个体策略正则化难以与一致的协调模式对齐,导致策略分布偏移问题。为应对这一挑战,我们设计了一种序列得分函数分解方法,从联合行为策略中提炼每个智能体的正则化信号,在分散执行约束下诱导协调模态选择。然后我们利用灵活的基于扩散的生成模型从多模态离线数据中学习这些得分函数,并将其集成到联合动作评论家中,以在共享团队奖励下引导策略更新朝向高分、分布内区域。我们的方法在多个粒子环境和多智能体MuJoCo基准上一致实现了最先进性能。据我们所知,这是首个明确解决离线与在线MARL之间分布差距的工作,为更可泛化的基于离线策略的MARL方法铺平了道路。

英文摘要

Offline cooperative multi-agent reinforcement learning (MARL) faces unique challenges due to distributional shifts, particularly stemming from the high dimensionality of joint action spaces and the presence of out-of-distribution joint action selections. In this work, we highlight that a fundamental challenge in offline MARL arises from the multi-equilibrium nature of cooperative tasks, which induces a highly multimodal joint behavior policy space coupled with heterogeneous-quality behavior data. This makes it difficult for individual policy regularization to align with a consistent coordination pattern, leading to the policy distribution shift problems. To tackle this challenge, we design a sequential score function decomposition method that distills per-agent regularization signals from the joint behavior policy, which induces coordinated modality selection under decentralized execution constraints. Then we leverage a flexible diffusion-based generative model to learn these score functions from multimodal offline data, and integrate them into joint-action critics to guide policy updates toward high-reward, in-distribution regions under a shared team reward. Our approach achieves state-of-the-art performance across multiple particle environments and Multi-agent MuJoCo benchmarks consistently. To the best of our knowledge, this is the first work to explicitly address the distributional gap between offline and online MARL, paving the way for more generalizable offline policy-based MARL methods.

2505.02743 2026-05-29 cs.LG stat.ML 版本更新

Cooperative Variance Estimation and Bayesian Neural Networks for Disentangling Aleatoric and Epistemic Uncertainties

合作方差估计与贝叶斯神经网络用于分离偶然不确定性和认知不确定性

Jiaxiang Yi, Miguel A. Bessa

发表机构 * Faculty of Mechanical Engineering, Delft University of Technology, Mekelweg 2, Delft, 2628 CD, The Netherlands(代尔夫特理工大学机械工程学院) School of Engineering, Brown University, 184 Hope St., Providence, RI 02912, USA(布朗大学工程学院)

AI总结 提出通过合作训练方差估计网络与贝叶斯神经网络,实现偶然不确定性与认知不确定性的分离,并提升均值估计性能。

Comments 38 pages, 26 figures

详情
AI中文摘要

真实世界的数据包含偶然不确定性——由不完美的测量或对数据生成过程的不完全了解引起的不可约噪声。均值-方差估计网络可以学习这种类型的不确定性,但需要即兴的正则化策略以避免过拟合,并且无法预测认知不确定性(模型不确定性)。相反,贝叶斯神经网络可以预测认知不确定性,但由于贝叶斯推断的近似性质,它们以难以训练而著称。我们提出合作训练一个方差估计网络与一个贝叶斯神经网络,并通过实验证明,所得模型在改善均值估计的同时分离了偶然不确定性和认知不确定性。我们展示了该方法在多种数据集上的有效性和可扩展性,包括我们创建的一个时间依赖异方差回归数据集,其中偶然不确定性是已知的。所提出的方法易于实现、鲁棒,并且适用于各种模型架构。

英文摘要

Real-world data contains aleatoric uncertainty - irreducible noise arising from imperfect measurements or from incomplete knowledge about the data generation process. Mean-variance estimation networks can learn this type of uncertainty but require ad-hoc regularization strategies to avoid overfitting and are unable to predict epistemic uncertainty (model uncertainty). Conversely, Bayesian neural networks predict epistemic uncertainty but are notoriously difficult to train due to the approximate nature of Bayesian inference. We propose to cooperatively train a variance estimation network with a Bayesian neural network and empirically demonstrate that the resulting model disentangles aleatoric and epistemic uncertainties while improving the mean estimation. We demonstrate the effectiveness and scalability of this method across a diverse range of datasets, including a time-dependent heteroscedastic regression dataset we created where the aleatoric uncertainty is known. The proposed method is straightforward to implement, robust, and adaptable to various model architectures.

2503.13844 2026-05-29 cs.CL cs.AI cs.CY cs.LG 版本更新

Towards Detecting Persuasion on Social Media: From Model Development to Insights on Persuasion Strategies

检测社交媒体上的说服:从模型开发到说服策略的洞察

Elyas Meguellati, Stefano Civelli, Pietro Bernardelle, Shazia Sadiq, Irwin King, Gianluca Demartini

发表机构 * University of Queensland(昆士兰大学) The Chinese University of Hong Kong(香港中文大学)

AI总结 本文通过开发轻量级说服文本检测模型(在SemEval 2023任务3子任务3中达到最优性能)并应用于澳大利亚联邦选举2022 Facebook广告数据集,揭示了政治竞选在不同资金策略、词汇选择、人口统计定位和选举临近时说服强度时间变化中的模式。

详情
Journal ref
Proceedings of the International AAAI Conference on Web and Social Media 20(1) (2026) 1587-1608
AI中文摘要

政治广告通过嵌入更广泛宣传策略中的微妙说服技巧,在塑造公众舆论和影响选举结果方面发挥着关键作用。检测这些说服元素对于提高选民意识和确保民主进程的透明度至关重要。本文通过两项相互关联的研究,提出了一种连接模型开发与实际应用的综合方法。首先,我们引入了一个轻量级说服文本检测模型,该模型在SemEval 2023任务3子任务3中达到了最先进性能,同时所需的计算资源和训练数据远少于现有方法。其次,我们通过收集澳大利亚联邦选举2022 Facebook广告(APA22)数据集,对其中一部分进行说服标注,并对模型进行微调以使其从主流新闻适应社交媒体内容,从而展示了该模型的实际效用。然后,我们应用微调后的模型对APA22数据集的其余部分进行标注,揭示了政治竞选如何通过不同的资金策略、词汇选择、人口统计定位以及选举日临近时说服强度的时间变化来利用说服的独特模式。我们的发现不仅强调了分析社交媒体说服时领域特定建模的必要性,还展示了揭示这些策略如何能够增强透明度、告知选民并促进数字竞选中的问责制。

英文摘要

Political advertising plays a pivotal role in shaping public opinion and influencing electoral outcomes, often through subtle persuasive techniques embedded in broader propaganda strategies. Detecting these persuasive elements is crucial for enhancing voter awareness and ensuring transparency in democratic processes. This paper presents an integrated approach that bridges model development and real-world application through two interconnected studies. First, we introduce a lightweight model for persuasive text detection that achieves state-of-the-art performance in Subtask 3 of SemEval 2023 Task 3 while requiring significantly fewer computational resources and training data than existing methods. Second, we demonstrate the model's practical utility by collecting the Australian Federal Election 2022 Facebook Ads (APA22) dataset, partially annotating a subset for persuasion, and fine-tuning the model to adapt from mainstream news to social media content. We then apply the fine-tuned model to label the remainder of the APA22 dataset, revealing distinct patterns in how political campaigns leverage persuasion through different funding strategies, word choices, demographic targeting, and temporal shifts in persuasion intensity as election day approaches. Our findings not only underscore the necessity of domain-specific modeling for analyzing persuasion on social media but also show how uncovering these strategies can enhance transparency, inform voters, and promote accountability in digital campaigns.

2502.16548 2026-05-29 cs.LG cs.AI cs.CV 版本更新

A Composable Multimodal Framework for cine CMR-Text-Driven Prediction of Heart Failure Outcomes

用于电影心脏磁共振-文本驱动的心力衰竭结局预测的可组合多模态框架

Jianzhou Chen, Jinyang Sun, Xiumei Wang, Xi Chen, Heyu Chu, Guo Song, Yuji Luo, Xingping Zhou, Rong Gu

发表机构 * Department of Cardiology, Nanjing Drum Tower Hospital, State Key Laboratory of Pharmaceutical Biotechnology, Nanjing University(南京鼓楼医院心内科,南京大学国家药物生物技术重点实验室) School of Electronic Information and Electrical Engineering, Shanghai Jiao Tong University(上海交通大学电子信息与电气工程学院) College of Electronic and Optical Engineering, Nanjing University of Posts and Telecommunications(南京邮电大学电子与光学工程学院) College of Integrated Circuit Science and Engineering, Nanjing University of Posts and Telecommunications(南京邮电大学集成电路科学与工程学院) Department of Cardiology, Nanjing Drum Tower Hospital Clinical College of Nanjing Medical University(南京医科大学南京鼓楼医院临床学院心内科) Institute of Quantum Information and Technology, Nanjing University of Posts and Telecommunications(南京邮电大学量子信息与技术研究院)

AI总结 提出一种可组合多模态框架,通过整合cine CMR影像、结构化临床指标和非结构化文本记录,实现比单模态AI算法更准确的心力衰竭预后预测,并支持个性化治疗优化。

详情
AI中文摘要

目的。根据世界卫生组织(WHO)及其他公共卫生机构的数据,心力衰竭是全球主要死因之一,每年导致数百万人死亡。尽管心力衰竭领域已取得显著进展,生存率和射血分数有所改善,但由于其复杂性和多因素特征,仍存在大量未满足的需求。本研究旨在提出并评估一种用于心力衰竭评估和治疗优化的可组合策略框架,旨在提供更全面的患者评估和管理。方法。该框架利用多模态算法分析全面的患者数据,明确整合了电影心脏磁共振(cine CMR)序列、结构化临床指标(如实验室结果、人口统计学数据)和非结构化文本记录(如病史、处方)。通过整合这些多种数据源,我们的框架为患者提供了更全面的评估和优化的治疗方案。主要结果。与单模态AI算法相比,该多模态框架在心力衰竭预后预测方面展现出更高的准确性。此外,它还能详细评估各种病理指标对心力衰竭结局的影响。意义。通过系统性地整合异质性临床数据,该方法支持更全面的预后评估,并有助于为心力衰竭患者制定优化的个性化治疗计划。

英文摘要

Objective. Heart failure is one of the leading causes of death worldwide, with millions of deaths each year, according to data from the World Health Organization (WHO) and other public health agencies. While significant progress has been made in the field of heart failure, leading to improved survival rates and improvement of ejection fraction, there remains substantial unmet needs, due to the complexity and multifactorial characteristics. This study aims to propose and evaluate a composable strategy framework for assessment and treatment optimization in heart failure, designed to provide more holistic patient evaluation and management. Approach. The framework leverages multi-modal algorithms to analyze a comprehensive range of patient data, explicitly integrating cine cardiac magnetic resonance (cine CMR) sequences, structured clinical metrics (e.g., lab results, demographics), and unstructured textual records (e.g., medical history, prescriptions). By integrating these various data sources, our framework offers a more holistic evaluation and optimized treatment plan for patients. Main results. The multi-modal framework demonstrates superior accuracy in HF prognosis prediction compared to single-modal AI algorithms. Additionally, it enables a detailed evaluation of the impact of various pathological indicators on HF outcomes. Significance. By integrating heterogeneous clinical data in a systematic manner, this approach supports more comprehensive prognosis assessment and facilitates optimized, personalized treatment planning for heart failure patients.

2411.00278 2026-05-29 cs.LG 版本更新

KAN-AD: Time Series Anomaly Detection with Kolmogorov-Arnold Networks

KAN-AD:基于Kolmogorov-Arnold网络的时间序列异常检测

Quan Zhou, Changhua Pei, Fei Sun, Jing Han, Zhengwei Gao, Dan Pei, Haiming Zhang, Gaogang Xie, Jianhui Li

发表机构 * Computer Network Information Center, Chinese Academy of Sciences(中国科学院计算机网络信息中心) University of the Chinese Academy of Sciences(中国科学院大学) Hangzhou Institute for Advanced Study, University of the Chinese Academy of Sciences(中国科学院大学杭州高等研究 institute) Institute of Computing Technology, Chinese Academy of Sciences(中国科学院计算技术研究所) School of Frontier Sciences, Nanjing University(南京大学前沿科学学院) Department of Computer Science and Technology, Tsinghua University(清华大学计算机科学与技术系)

AI总结 针对时间序列异常检测中预测模型易过拟合局部波动的问题,提出用截断傅里叶展开替代B样条的KAN-AD方法,通过强调全局模式并抵抗局部扰动,在四个基准上平均检测精度提升15%。

Comments 11 pages, ICML 2025

详情
Journal ref
Proceedings of the 42nd International Conference on Machine Learning, PMLR 267:79136-79149, 2025
AI中文摘要

时间序列异常检测(TSAD)支撑着云服务和网络系统中的实时监控,能够快速识别异常以防止代价高昂的故障。大多数基于预测模型的TSAD方法倾向于通过强调微小波动而过度拟合。我们的分析表明,有效的TSAD应专注于通过平滑局部模式对“正常”行为进行建模。为此,我们将时间序列建模重新表述为用平滑单变量函数逼近序列。每个单变量函数的局部平滑性确保拟合的时间序列对局部扰动保持鲁棒。然而,由于B样条函数固有的局部化特性,直接实现KAN易受这些扰动影响。因此,我们提出KAN-AD,用截断傅里叶展开替代B样条,并引入一种新颖的轻量级学习机制,该机制在强调全局模式的同时对局部扰动保持鲁棒。在四个流行的TSAD基准上,KAN-AD相比最先进的基线实现了平均15%的检测精度提升(峰值超过27%)。值得注意的是,其可训练参数少于1000个,相比原始KAN推理速度提升50%,展示了该方法的效率和实际可行性。

英文摘要

Time series anomaly detection (TSAD) underpins real-time monitoring in cloud services and web systems, allowing rapid identification of anomalies to prevent costly failures. Most TSAD methods driven by forecasting models tend to overfit by emphasizing minor fluctuations. Our analysis reveals that effective TSAD should focus on modeling "normal" behavior through smooth local patterns. To achieve this, we reformulate time series modeling as approximating the series with smooth univariate functions. The local smoothness of each univariate function ensures that the fitted time series remains resilient against local disturbances. However, a direct KAN implementation proves susceptible to these disturbances due to the inherently localized characteristics of B-spline functions. We thus propose KAN-AD, replacing B-splines with truncated Fourier expansions and introducing a novel lightweight learning mechanism that emphasizes global patterns while staying robust to local disturbances. On four popular TSAD benchmarks, KAN-AD achieves an average 15% improvement in detection accuracy (with peaks exceeding 27%) over state-of-the-art baselines. Remarkably, it requires fewer than 1,000 trainable parameters, resulting in a 50% faster inference speed compared to the original KAN, demonstrating the approach's efficiency and practical viability.

2410.19371 2026-05-29 stat.ML cs.CR cs.LG 版本更新

Noise-Aware Differentially Private Variational Inference

噪声感知的差分隐私变分推断

Talal Alrawajfeh, Joonas Jälkö, Antti Honkela

发表机构 * University of Helsinki(赫尔辛基大学)

AI总结 针对差分隐私导致下游推断不可靠的问题,提出一种基于随机梯度变分推断的噪声感知近似贝叶斯推断方法,可应用于高维和非共轭模型,并改进了后验评估精度。

Comments 26 pages, 4 figures

详情
AI中文摘要

差分隐私(DP)为统计推断提供了强大的隐私保证,但这可能导致下游应用中不可靠的结果和偏差。尽管已有几种将DP扰动纳入推断的噪声感知方法被提出,但它们仅限于特定类型的简单概率模型。在这项工作中,我们提出了一种基于随机梯度变分推断的噪声感知近似贝叶斯推断新方法,该方法也可应用于高维和非共轭模型。我们还提出了一种更精确的噪声感知后验评估方法。实验表明,我们的推断方法在现有方法适用的领域具有相似的性能。在该领域之外,我们在高维贝叶斯线性回归上获得了准确的覆盖率,并在UCI成人数据集上的贝叶斯逻辑回归上获得了校准良好的预测概率。

英文摘要

Differential privacy (DP) provides robust privacy guarantees for statistical inference, but this can lead to unreliable results and biases in downstream applications. While several noise-aware approaches have been proposed which integrate DP perturbation into the inference, they are limited to specific types of simple probabilistic models. In this work, we propose a novel method for noise-aware approximate Bayesian inference based on stochastic gradient variational inference which can also be applied to high-dimensional and non-conjugate models. We also propose a more accurate evaluation method for noise-aware posteriors. Empirically, our inference method has similar performance to existing methods in the domain where they are applicable. Outside this domain, we obtain accurate coverages on high-dimensional Bayesian linear regression and well-calibrated predictive probabilities on Bayesian logistic regression with the UCI Adult dataset.

2410.15236 2026-05-29 cs.CR cs.AI cs.LG 版本更新

Jailbreaking and Mitigation of Vulnerabilities in Large Language Models

大语言模型的越狱与漏洞缓解

Benji Peng, Hanxuan Chen, Keyu Chen, Qian Niu, Ziqian Bi, Ming Liu, Pohsun Feng, Tianyang Wang, Lawrence K. Q. Yan, Yizhu Wen, Yichao Zhang, Caitlyn Heqi Yin, Xinyuan Song, Riyang Bao, Jiacheng Shi

发表机构 * Hunan University Changsha, PRC Georgia Institute of Technology Atlanta, USA Kyoto University Kyoto, Japan Purdue University West Lafayette, USA National Taiwan Normal University Taipei, ROC University of Liverpool Suzhou, PRC Hong Kong University of Science University of Hawaii Honolulu, USA The University of Texas at Dallas Dallas, USA University of Wisconsin-Madison Madison, USA Emory University Atlanta, USA College of William \& Mary Williamsburg, USA

AI总结 本文综述了大语言模型在提示注入和越狱攻击下的漏洞,分类攻击方法并评估防御策略,指出研究空白与未来方向。

详情
Journal ref
Eureka 1(1) (2026) 26-61
AI中文摘要

大语言模型通过推进自然语言理解和生成,在医疗、软件工程和对话系统等领域实现了广泛应用,从而改变了人工智能。尽管在过去几年取得了这些进展,但大语言模型已显示出相当大的漏洞,特别是对提示注入和越狱攻击。本综述分析了这些漏洞的研究现状,并介绍了可用的防御策略。我们大致将攻击方法分为基于提示的、基于模型的、多模态的和多语言的,涵盖对抗性提示、后门注入和跨模态利用等技术。我们还回顾了各种防御机制,包括提示过滤、转换、对齐技术、多智能体防御和自律,评估了它们的优缺点。我们还讨论了用于评估大语言模型安全性和鲁棒性的关键指标和基准,指出了在交互环境中量化攻击成功率的挑战以及现有数据集中的偏差。通过识别当前研究空白,我们提出了未来在韧性对齐策略、针对不断演变的攻击的高级防御、越狱检测自动化以及考虑伦理和社会影响方面的方向。本综述强调了在人工智能社区内持续研究和合作的必要性,以增强大语言模型的安全性并确保其安全部署。

英文摘要

Large Language Models (LLMs) have transformed artificial intelligence by advancing natural language understanding and generation, enabling applications across fields beyond healthcare, software engineering, and conversational systems. Despite these advancements in the past few years, LLMs have shown considerable vulnerabilities, particularly to prompt injection and jailbreaking attacks. This review analyzes the state of research on these vulnerabilities and presents available defense strategies. We roughly categorize attack approaches into prompt-based, model-based, multimodal, and multilingual, covering techniques such as adversarial prompting, backdoor injections, and cross-modality exploits. We also review various defense mechanisms, including prompt filtering, transformation, alignment techniques, multi-agent defenses, and self-regulation, evaluating their strengths and shortcomings. We also discuss key metrics and benchmarks used to assess LLM safety and robustness, noting challenges like the quantification of attack success in interactive contexts and biases in existing datasets. Identifying current research gaps, we suggest future directions for resilient alignment strategies, advanced defenses against evolving attacks, automation of jailbreak detection, and consideration of ethical and societal impacts. This review emphasizes the need for continued research and cooperation within the AI community to enhance LLM security and ensure their safe deployment.

2408.15451 2026-05-29 cs.LG cs.CR stat.ME 版本更新

Certified Causal Defense with Generalizable Robustness

具有泛化鲁棒性的认证因果防御

Yiran Qiao, Yu Yin, Chen Chen, Jing Ma

发表机构 * Case Wester Reserve University(凯斯西储大学) University of Virginia(弗吉尼亚大学)

AI总结 提出GLEAN框架,通过可认证因果因子学习解耦因果关系与虚假相关性,并设计因果认证防御策略,实现跨分布偏移域的鲁棒性泛化。

Comments Accepted by AAAI 2025

详情
AI中文摘要

尽管机器学习模型在各种场景中已被证明有效,但普遍认为许多模型容易受到对抗性攻击。近年来,出现了大量对抗性防御的研究。其中,认证防御因其对输入在特定范围内(例如$l_2$球)的任意对抗性扰动具有理论保证而闻名。然而,该领域现有的大多数工作难以将其认证鲁棒性泛化到具有分布偏移的其他数据域中。这一问题的根源在于难以消除不同域中虚假相关性对鲁棒性的负面影响。为解决此问题,本文提出了一种新颖的认证防御框架GLEAN,该框架将因果视角引入认证防御的泛化问题。具体而言,我们的框架集成了一个可认证的因果因子学习组件,以解耦输入与标签之间的因果关系和虚假相关性,从而排除虚假相关性对防御的负面影响。在此基础上,我们设计了一种因果认证防御策略来处理对潜在因果因子的对抗性攻击。通过这种方式,我们的框架不仅对训练分布中数据上的恶意噪声具有鲁棒性,而且能够将其鲁棒性泛化到具有分布偏移的各个域中。在基准数据集上的大量实验验证了我们的框架在不同数据域中认证鲁棒性泛化的优越性。代码见补充材料。

英文摘要

While machine learning models have proven effective across various scenarios, it is widely acknowledged that many models are vulnerable to adversarial attacks. Recently, there have emerged numerous efforts in adversarial defense. Among them, certified defense is well known for its theoretical guarantees against arbitrary adversarial perturbations on input within a certain range (e.g., $l_2$ ball). However, most existing works in this line struggle to generalize their certified robustness in other data domains with distribution shifts. This issue is rooted in the difficulty of eliminating the negative impact of spurious correlations on robustness in different domains. To address this problem, in this work, we propose a novel certified defense framework GLEAN, which incorporates a causal perspective into the generalization problem in certified defense. More specifically, our framework integrates a certifiable causal factor learning component to disentangle the causal relations and spurious correlations between input and label, and thereby exclude the negative effect of spurious correlations on defense. On top of that, we design a causally certified defense strategy to handle adversarial attacks on latent causal factors. In this way, our framework is not only robust against malicious noises on data in the training distribution but also can generalize its robustness across domains with distribution shifts. Extensive experiments on benchmark datasets validate the superiority of our framework in certified robustness generalization in different data domains. Code is available in the supplementary materials.

2310.14161 2026-05-29 cs.LG 版本更新

Promoting Generalization for Exact Solvers via Adversarial Instance Augmentation

通过对抗性实例增强促进精确求解器的泛化能力

Haoyang Liu, Yufei Kuang, Jie Wang, Xijun Li, Yongdong Zhang, Feng Wu

发表机构 * CAS Key Laboratory of Technology in GIPAS, University of Science and Technology of China(GIPAS技术CAS重点实验室,中国科学技术大学) Institute of Artificial Intelligence, Hefei Comprehensive National Science Center(合肥综合性国家科学中心人工智能研究院)

AI总结 针对学习型MILP求解器在未见实例上性能下降的问题,提出对抗性实例增强方法AdaSolver,通过将不可微的实例增强建模为上下文赌博机问题并联合对抗训练增强策略与求解器,显著提升基于模仿学习和强化学习的分支定界求解器的泛化能力。

详情
Journal ref
IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026
AI中文摘要

机器学习已成功应用于提高混合整数线性规划(MILP)求解器的效率。然而,由于训练分布的多样性有限,基于学习的求解器在未见过的MILP实例上——尤其是在扰动环境中的大规模实例上——常常遭受严重的性能下降。为解决这一问题,我们提出了一种新颖方法,称为对抗性实例增强,该方法无需了解新实例生成的问题类型,以促进分支定界(B&B)求解器中基于学习的分支模块的数据多样性(AdaSolver)。我们使用MILP实例的二分图表示,并通过学习到的增强策略增强图结构,从而获得各种扰动实例以正则化求解器。AdaSolver的主要技术贡献在于,我们将不可微的实例增强建模为上下文赌博机问题,并对抗性地训练基于学习的求解器和增强策略,从而实现对增强策略的高效梯度训练。据我们所知,AdaSolver是首个通用且有效的框架,用于理解和改进基于模仿学习(IL-based)和基于强化学习(RL-based)的B&B求解器的泛化能力。大量实验表明,通过生成各种增强实例,AdaSolver在各种分布上均显著提升了求解效率。

英文摘要

Machine learning has been successfully applied to improve the efficiency of Mixed-Integer Linear Programming (MILP) solvers. However, the learning-based solvers often suffer from severe performance degradation on unseen MILP instances -- especially on large-scale instances from a perturbed environment -- due to the limited diversity of training distributions. To tackle this problem, we propose a novel approach, which is called Adversarial Instance Augmentation and does not require to know the problem type for new instance generation, to promote data diversity for learning-based branching modules in the branch-and-bound (B&B) Solvers (AdaSolver). We use the bipartite graph representations for MILP instances and obtain various perturbed instances to regularize the solver by augmenting the graph structures with a learned augmentation policy. The major technical contribution of AdaSolver is that we formulate the non-differentiable instance augmentation as a contextual bandit problem and adversarially train the learning-based solver and augmentation policy, enabling efficient gradient-based training of the augmentation policy. To the best of our knowledge, AdaSolver is the first general and effective framework for understanding and improving the generalization of both imitation-learning-based (IL-based) and reinforcement-learning-based (RL-based) B&B solvers. Extensive experiments demonstrate that by producing various augmented instances, AdaSolver leads to a remarkable efficiency improvement across various distributions.

2308.13222 2026-05-29 physics.comp-ph cs.LG physics.flu-dyn stat.ML 版本更新

Bayesian Reasoning for Physics Informed Neural Networks

物理信息神经网络的贝叶斯推理

Krzysztof M. Graczyk, Kornel Witkowski

发表机构 * Institute for Theoretical Physics, University of Wroc aw(沃拉夫大学理论物理研究所) Institute of Low Temperature and Structure Research(低温与结构研究所) Polish Academy of Sciences(波兰科学院)

AI总结 提出一种基于证据驱动的贝叶斯物理信息神经网络方法,通过拉普拉斯近似高效计算模型证据,自动优化偏微分方程残差、边界条件和观测数据之间的损失权重,并在热方程、波动方程和伯格斯方程上验证了其求解精度与不确定性量化能力。

Comments 21 pages, 12 figures, re-edit the description of the Bayesian framework, some of the content moved to Appendix. Discussion of numerical performance added, as well as related approaches

详情
Journal ref
Phys. Rev. E 113, 055307 (2026)
AI中文摘要

我们引入了一种基于证据驱动的贝叶斯物理信息神经网络公式,能够自动优化偏微分方程残差、边界条件和观测数据之间的损失权重。与现有基于采样或变分推理的贝叶斯PINN方法不同,所提方法使用拉普拉斯近似解析计算模型证据,从而无需后验采样即可实现高效的超参数调优和模型比较。我们在热方程、波动方程和伯格斯方程上演示了该方法,获得了与精确解或参考解一致的结果。在伯格斯方程示例中,我们进一步展示了该框架自然地整合了控制方程和含噪声测量中的信息,在统一的贝叶斯框架内提供了预测不确定性。

英文摘要

We introduce an evidence-driven Bayesian formulation of physics-informed neural networks that enables automatic optimization of loss weights between PDE residuals, boundary conditions, and observational data. Unlike existing Bayesian PINN approaches based on sampling or variational inference, the proposed method uses a Laplace approximation to compute model evidence analytically, enabling efficient hyperparameter tuning and model comparison without posterior sampling. We demonstrate the method on the heat, wave, and Burgers' equations, obtaining solutions in agreement with exact or reference results. In the Burgers' equation example, we further show that the framework naturally integrates information from governing equations and noisy measurements, providing predictive uncertainties within a unified Bayesian setting.

2306.10356 2026-05-29 cs.LG cs.AI eess.SP 版本更新

MATNet: Multi-Level Fusion Transformer-Based Model for Day-Ahead PV Generation Forecasting

MATNet:基于多层级融合Transformer的日前光伏发电预测模型

Matteo Tortora, Francesco Conte, Gianluca Natrella, Paolo Soda

发表机构 * Department of Naval, Electrical, Electronics Telecommunications Engineering, University of Genoa, Via all’Opera Pia 11a, 16145 Genoa, Italy Unit of Innovation, Entrepreneurship \& Sustainability, Department of Engineering, University Campus Bio-Medico of Rome Via Alvaro del Portillo 21, 00128 Rome, Italy Computer Systems Department of Engineering, University Campus Bio-Medico of Rome Via Alvaro del Portillo 21, 00128 Rome, Italy

AI总结 提出一种基于多层级融合Transformer的多模态架构MATNet,通过多级联合融合和软注意力机制利用历史光伏数据与气象数据,在日前多步光伏发电预测中显著优于基线模型(RMSE 0.0445,相对提升约65%),并展现出对缺失数据的鲁棒性和跨域零样本泛化能力。

详情
AI中文摘要

可再生能源发电的准确预测对于促进可再生能源融入电力系统至关重要。聚焦光伏(PV)单元,预测方法主要分为基于物理和基于数据两大类,其中基于人工智能(AI)的模型提供了最先进的性能。然而,这些基于AI的模型虽然能够捕捉数据中的复杂模式和关系,却忽略了现象背后的物理先验知识。因此,本文提出MATNet,一种新颖的基于Transformer的多模态架构,用于多步日前光伏发电预测。该模型通过多层级联合融合方法输入历史光伏数据以及历史和预报气象数据,在多个融合阶段采用软注意力机制。我们在Ausgrid基准数据集上评估了MATNet的有效性,其显著优于各种基线模型,实现了0.0445的RMSE,相比表现最佳的基线方法相对提升约65%。分析进一步通过一系列消融研究、对缺失数据的敏感性分析(突显了MATNet对输入退化的鲁棒性)、在五个外部光伏数据集上的跨站点零样本泛化评估(证明了MATNet在显著域偏移下的鲁棒性)以及对模型计算复杂度的评估(确认了其在预测精度与计算效率之间的良好平衡)得到丰富。这些结果凸显了MATNet作为促进光伏能源融入电网的可靠且高效解决方案的潜力。代码可在https://github.com/arco-group/MATNet获取。

英文摘要

Accurate forecasting of renewable generation is crucial to facilitate the integration of Renewable Energy Sources into the power system. Focusing on photovoltaic (PV) units, forecasting methods can be divided into two main categories: physics-based and data-based strategies, with Artificial Intelligence (AI)-based models providing state-of-the-art performance. However, while these AI-based models can capture complex patterns and relationships in the data, they ignore the underlying physical prior knowledge of the phenomenon. Therefore, in this paper, we propose MATNet, a novel transformer-based multimodal architecture for multi-step day-ahead PV power generation forecasting. The model is fed with historical PV data and historical and forecast weather data through a multi-level joint fusion approach, employing a soft-attention mechanism at multiple fusion stages. We evaluate the effectiveness of MATNet on the Ausgrid benchmark dataset, where it significantly outperforms various baseline models, achieving an RMSE of 0.0445, corresponding to a relative improvement of approximately 65% compared to the best-performing baseline method. The analysis is further enriched by a comprehensive set of ablation studies, a sensitivity analysis on missing data, which highlights MATNet's resilience to input degradation, a cross-site zero-shot generalization evaluation on five external PV datasets, demonstrating MATNet's robustness under significant domain shifts, and an assessment of the model's computational complexity, confirming its favorable balance between predictive accuracy and computational efficiency. These results highlight MATNet's potential as a reliable and efficient solution to facilitate the integration of PV energy into the power grid. The code is available at https://github.com/arco-group/MATNet.